Review Reports - DOL-DETR: An Efficient Small Object Detection Algorithm for Unmanned Aerial Vehicle Remote Sensing

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This manuscript proposes DOL-DETR, a modified RT-DETR-r18-based architecture for small object detection in UAV imagery. The main architectural changes include DAIFI instead of AIFI, OMFF for feature fusion, and LDConv for downsampling. On VisDrone-DET2019, the authors report an improvement from 48.2 to 52.4 mAP@0.5, together with an increase from 106.6 to 120.1 FPS and a nearly unchanged parameter count (19.8M to 20.1M). The paper also includes an additional result on the DOTA dataset, where the reported mAP@0.5 increases from 72.3 to 76.1 compared with the baseline RT-DETR.

Main concerns

Insufficient experimental transparency and reproducibility. The experimental section (Section 4.2) provides the hardware environment and a few training hyperparameters, but several important details are missing, such as the exact training procedure, augmentation strategy, learning-rate schedule, warmup policy, FPS measurement protocol, number of repeated runs, random seeds, result variance, and the full protocol used for the DOTA experiment. In particular, the DOTA evaluation (Section 4.7) is presented only through a short summary table, without sufficient methodological detail. This seriously limits reproducibility.
Some numerical results in Section 4.4 are not convincingly explained. The most striking example is the increase in speed from 106.6 FPS to 158.2 FPS after introducing DAIFI, even though GFLOPs slightly increase (57.0 to 57.5). The manuscript attributes this to reduced “Memory Access Cost (MAC),” but this remains only a verbal explanation without profiling results, latency breakdowns, or evaluation across different batch sizes or input resolutions. This claim requires much stronger experimental support.
The comparative study (Section 4.5) is useful, but it is not documented in a sufficiently fair and rigorous way. It is not clear whether all competing models were trained under exactly the same conditions, with the same input resolution, number of epochs, augmentation strategy, and hardware platform, or whether some of the reported numbers were taken directly from prior publications. Without this clarification, the comparative table has limited evidential strength.
The DOTA generalization experiment (Section 4.7) is too weak to support broad claims. The generalization claim is based only on a short table comparing the baseline and the proposed model on DOTA. There is no detailed description of the training and evaluation protocol, no per-class analysis, no discussion of the dataset-specific challenges, and no comparison with other methods on that dataset. In its current form, this section is better interpreted as an additional cross-dataset test than as a strong validation of generalization ability.
Parts of the methodological presentation are overly promotional rather than analytical. In several places, the manuscript uses strong expressions such as “fully validates,” “outstanding balance,” “dual breakthrough,” and similar claims without providing sufficiently rigorous evidence. This tone weakens the scientific credibility of the paper.
There are problems with technical precision and presentation quality. The manuscript contains several inconsistencies in section numbering. For example, after Section 3.2.1, the text continues with “3.3.2. OKNet” and “3.3.3. MFM,” even though both are part of the OMFF description. In addition, Table 2 reports the learning rate as “0.001–1.0,” which appears either incorrect or at least highly unclear. These issues suggest insufficient proofreading and technical polishing.
Some references are outdated in form or not optimally matched to the claims they support. Several arXiv preprints are cited where final published versions are likely available, and a few references are only weakly aligned with the specific statements they are used to support. The reference list should therefore be revised more carefully.

Specific comments

Figure 1 provides a useful high-level overview, but it is not sufficiently self-explanatory. The authors should add either a concise legend for the module abbreviations and block types or a more explicit walk-through in the main text clarifying the data flow and the role of each newly introduced component.
The mathematical formulation is not fully rigorous. Several equations suffer from notation inconsistencies, and some formulations do not accurately reflect the described operations. More specifically:
a) Please unify the notation in the DAIFI/DAttention formulation. The symbols for the sampled feature map and the grid dimensions are currently inconsistent.
b) Please make the OKNet and MFM formulations fully explicit by specifying the exact tensor dimensions and the exact fusion operator used in the implementation.
c) Please align the MFM description, equation, and Figure 6. At present, the figure shows Softmax, whereas the text and equation use Sigmoid; moreover, the equation describes channel-wise rather than spatial modulation.
d) Please correct the LDConv mathematical formulation. The current convolution equation does not include the learned offsets, although the text states that the sampling positions are dynamically deformed.
It is recommended that all equations be numbered. At present, it is difficult to refer precisely to problematic formulations during review. In addition, the notation used in line 291 is programming-oriented rather than standard mathematical notation and should be formally explained or rewritten. There are also minor typesetting issues in the mathematical text. For example, the raw LaTeX expression in line 434 appears unresolved, which suggests insufficient proofreading of the manuscript.
Table 4 would be easier to interpret if the best values in each column were explicitly highlighted. This would make the relative position of the proposed method clearer, especially since the table is intended to demonstrate the trade-off between detection accuracy and computational efficiency.
Figure 8 contains multiple subpanels, but the caption does not explain the meaning of labels (a), (b), and (c). Please add these labels explicitly in the caption so that the individual subfigures can be clearly linked to the text. In addition, Figure 8 is insufficiently explained. The authors should clarify what the displayed heatmaps represent, how they were generated, and how they should be interpreted in the context of object detection on RGB images.

Author Response

Comment 1: Insufficient experimental transparency and reproducibility. The experimental section (Section 4.2) provides the hardware environment and a few training hyperparameters, but several important details are missing, such as the exact training procedure, augmentation strategy, learning-rate schedule, warmup policy, FPS measurement protocol, number of repeated runs, random seeds, result variance, and the full protocol used for the DOTA experiment. In particular, the DOTA evaluation (Section 4.7) is presented only through a short summary table, without sufficient methodological detail. This seriously limits reproducibility.

Response: Thank you for your valuable feedback. We entirely agree with your perspective regarding the significance of experimental transparency and reproducibility in scientific research.

To address this, we have substantially revised Section 4.2 to include comprehensive details concerning the model training process. Furthermore, in Section 4.7, we have provided additional specifics regarding the DOTA dataset, along with a clear description of the experimental environment and hyperparameter settings. These revisions can be found in lines 615–651 and 890–905 of the revised manuscript.

Comment 2: Some numerical results in Section 4.4 are not convincingly explained. The most striking example is the increase in speed from 106.6 FPS to 158.2 FPS after introducing DAIFI, even though GFLOPs slightly increase (57.0 to 57.5). The manuscript attributes this to reduced “Memory Access Cost (MAC),” but this remains only a verbal explanation without profiling results, latency breakdowns, or evaluation across different batch sizes or input resolutions. This claim requires much stronger experimental support.

Response: We sincerely thank the reviewer for this constructive comment. We fully agree that the underlying reasons for the anomalous increase in the FPS value require thorough validation.

To address this concern, we have supplemented Section 4.4 of the revised manuscript with additional validation experiments to substantiate our hypothesis. Specifically, this new experiment compares the GPU memory consumption during training between the original model and the model incorporating DAIFI. Furthermore, we analyzed the variations in FPS when altering the input batch sizes and image resolutions. These results comprehensively validate our claims. The detailed revisions can be found in lines 717–745, as well as in Table 4 and Table 5 of the revised manuscript.

Comment 3: The comparative study (Section 4.5) is useful, but it is not documented in a sufficiently fair and rigorous way. It is not clear whether all competing models were trained under exactly the same conditions, with the same input resolution, number of epochs, augmentation strategy, and hardware platform, or whether some of the reported numbers were taken directly from prior publications. Without this clarification, the comparative table has limited evidential strength.

Response: We thank the reviewer for the valuable feedback, and we fully agree with your perspective.

To address this concern, we have added a clarification in Section 4.2 of the revised manuscript, explicitly stating that unless otherwise specified, all experiments in this study were conducted under the experimental conditions detailed in this section. Additionally, we have included a note in Section 4.5 to clarify that the data used in this experiment were not directly extracted from previously published literature. The specific modifications can be found in lines 649–651 and 795–798 of the revised manuscript.

Comment 4: The DOTA generalization experiment (Section 4.7) is too weak to support broad claims. The generalization claim is based only on a short table comparing the baseline and the proposed model on DOTA. There is no detailed description of the training and evaluation protocol, no per-class analysis, no discussion of the dataset-specific challenges, and no comparison with other methods on that dataset. In its current form, this section is better interpreted as an additional cross-dataset test than as a strong validation of generalization ability.

Response: We sincerely appreciate the reviewer’s valuable feedback, and we fully agree with your perspective.

Upon a careful review of Section 4.7, we acknowledge the limitation pointed out by the reviewer. To address this shortcoming, we have incorporated comparative experiments using current mainstream object detectors (e.g., YOLOv7 and YOLOv8). Furthermore, we have elaborated on the specific challenges associated with the DOTA dataset and provided an in-depth analysis. We believe that the comprehensively revised ablation studies now meet the rigorous standards of the journal. The detailed modifications can be found in Section 4.7 of the revised manuscript.

Comment 5: Parts of the methodological presentation are overly promotional rather than analytical. In several places, the manuscript uses strong expressions such as “fully validates,” “outstanding balance,” “dual breakthrough,” and similar claims without providing sufficiently rigorous evidence. This tone weakens the scientific credibility of the paper.
Response: We sincerely thank the reviewer for this valuable feedback. We acknowledge that the original manuscript contained certain overly promotional language. In the revised manuscript, we have carefully reviewed the text and either removed these statements or replaced them with more objective and scientifically rigorous expressions.

Comment 6: There are problems with technical precision and presentation quality. The manuscript contains several inconsistencies in section numbering. For example, after Section 3.2.1, the text continues with “3.3.2. OKNet” and “3.3.3. MFM,” even though both are part of the OMFF description. In addition, Table 2 reports the learning rate as “0.001–1.0,” which appears either incorrect or at least highly unclear. These issues suggest insufficient proofreading and technical polishing.

Response: We sincerely thank the reviewer for pointing out these issues. We offer our deepest apologies for the careless oversights that occurred during the proofreading of the original manuscript.

To address these errors, we have made the following corrections in the revised manuscript: First, we have corrected the section numbering throughout the paper. Second, we have revised the imprecise expression regarding the learning rate in Table 2 to ensure scientific rigor.

Comment 7: Some references are outdated in form or not optimally matched to the claims they support. Several arXiv preprints are cited where final published versions are likely available, and a few references are only weakly aligned with the specific statements they are used to support. The reference list should therefore be revised more carefully.

Response: We sincerely thank the reviewer for the valuable feedback. We fully agree that standard and rigorous citation practices are a crucial aspect of academic writing.

In the revised manuscript, we have removed all citations to arXiv preprints and replaced them with their final peer-reviewed published versions. Furthermore, to address the concern regarding certain references having weak relevance to the main text, we have carefully reviewed and updated our bibliography. We have ensured that all cited literature is now highly relevant and closely aligned with the research content of this study.

Comment 8: Figure 1 provides a useful high-level overview, but it is not sufficiently self-explanatory. The authors should add either a concise legend for the module abbreviations and block types or a more explicit walk-through in the main text clarifying the data flow and the role of each newly introduced component.

Response: We sincerely thank the reviewer for the valuable feedback, and we fully agree with your perspective. We acknowledge the issues you pointed out regarding the high-level component overview depicted in Figure 1.

To address this concern, we have augmented Section 3.1 of the main text with a detailed description of the model's data flow, as well as a clear explanation of the specific role each newly introduced component plays within the overall model. The detailed modifications can be found in lines 235–266 of the revised manuscript.

Comment 9: The mathematical formulation is not fully rigorous. Several equations suffer from notation inconsistencies, and some formulations do not accurately reflect the described operations. More specifically:

a) Please unify the notation in the DAIFI/DAttention formulation. The symbols for the sampled feature map and the grid dimensions are currently inconsistent.
b) Please make the OKNet and MFM formulations fully explicit by specifying the exact tensor dimensions and the exact fusion operator used in the implementation.
c) Please align the MFM description, equation, and Figure 6. At present, the figure shows Softmax, whereas the text and equation use Sigmoid; moreover, the equation describes channel-wise rather than spatial modulation.
d) Please correct the LDConv mathematical formulation. The current convolution equation does not include the learned offsets, although the text states that the sampling positions are dynamically deformed.

Response: We sincerely appreciate the reviewer’s rigorous and meticulous review. We fully agree with your suggestions. To ensure that the mathematical formulas in our manuscript strictly conform to the journal's standards and academic conventions, we have conducted a comprehensive check and thoroughly revised the equations based on your valuable feedback. The specific modifications are detailed as follows:

Regarding the DAttention formulations: We have unified and aligned the mathematical notations used to represent the sampled feature maps and grid dimensions to ensure consistency.
Regarding the ambiguity in the OKNet and MFM equations: We have replaced the previous vague expressions with standard mathematical symbols for channel concatenation. Additionally, we have explicitly specified the dimensions of the tensors involved.
Regarding the inaccuracy and lack of clarity in the MFM formulations: We have replaced the previous abbreviation "S" with the standard operator "Softmax" to accurately denote the Softmax operation. Furthermore, we have explicitly clarified in the text that MFM employs a channel modulation mechanism, rather than a spatial modulation method.
Regarding the omission of offsets in the LDConv equation: We have corrected the formulation of LDConv. The updated equation in the revised manuscript now properly incorporates the learned offsets.

Comment 10: It is recommended that all equations be numbered. At present, it is difficult to refer precisely to problematic formulations during review. In addition, the notation used in line 291 is programming-oriented rather than standard mathematical notation and should be formally explained or rewritten. There are also minor typesetting issues in the mathematical text. For example, the raw LaTeX expression in line 434 appears unresolved, which suggests insufficient proofreading of the manuscript.

Response: We sincerely thank the reviewer for this constructive feedback, and we fully agree with your observations. To address your concerns, we have made the following improvements:

First, we have ensured that all mathematical equations are properly and sequentially numbered throughout the revised manuscript.

Second, we sincerely apologize for the residual LaTeX expressions present in the initial submission. Originally, the first draft was prepared using a LaTeX editor. However, to facilitate transparent collaborative editing and effectively track changes among the co-authors, we subsequently migrated the draft to Microsoft Word. Unfortunately, during this conversion process, a few LaTeX formatting artifacts were inadvertently left behind. We have now thoroughly proofread the entire manuscript and converted all such residual LaTeX codes into the standard Word equation format.

Finally, regarding the programming-oriented notation used in Equation 3, we have added explicit definitions and detailed explanations within the text to ensure that the mathematical logic is clear and easily comprehensible to readers.

Comment 11: Table 4 would be easier to interpret if the best values in each column were explicitly highlighted. This would make the relative position of the proposed method clearer, especially since the table is intended to demonstrate the trade-off between detection accuracy and computational efficiency.

Response: We sincerely thank the reviewer for this valuable suggestion. We fully recognize that clearly presented experimental data is crucial for enhancing the readability of the paper. Accordingly, we have highlighted the best values in each column of Table 6 in bold.

Comment 12: Figure 8 contains multiple subpanels, but the caption does not explain the meaning of labels (a), (b), and (c). Please add these labels explicitly in the caption so that the individual subfigures can be clearly linked to the text. In addition, Figure 8 is insufficiently explained. The authors should clarify what the displayed heatmaps represent, how they were generated, and how they should be interpreted in the context of object detection on RGB images.

Response: Thank you for your valuable and constructive comments regarding Figure 8(Corresponding to Figure 10 in the revised draft). We have carefully revised the manuscript to address all the concerns raised.

First, we have explicitly clarified the meanings of the subfigure labels (a), (b), and (c) in the caption. Specifically, the revised caption now clearly states that:

(a) corresponds to the original UAV RGB images,

(b) represents the heatmaps generated by the baseline RT-DETR model, and

This modification ensures that each subfigure can be directly and unambiguously linked to the corresponding discussion in the main text.

Second, we have substantially expanded the explanation of the heatmaps to improve clarity and scientific rigor. In the revised manuscript (Section 4.6), we now explicitly describe that the heatmaps are generated using the Grad-CAM (Gradient-weighted Class Activation Mapping) method. We further explain that the heatmaps are obtained by computing the gradients of the target class score with respect to the feature maps of the last convolutional layer, followed by a weighted aggregation and ReLU activation to highlight regions that positively contribute to detection.

Finally, we have clarified how these heatmaps should be interpreted in the context of object detection on RGB images. Specifically, we now explain that the heatmaps visualize the spatial attention distribution of the model, where warmer colors (e.g., red) indicate regions with higher importance for object detection. We also provide a more detailed comparative analysis, showing how the proposed DOL-DETR model achieves more precise localization of small and dense objects, better suppression of background noise, and more complete coverage of object contours compared to the baseline. This interpretation directly supports the quantitative improvements reported in the experimental results.

These revisions have significantly improved the clarity and interpretability of Figure10 and its role in demonstrating the effectiveness of the proposed method.For details, please refer to the legend description in Figure 8 of the revised draft and lines 827 to 839.

Reviewer 2 Report

Comments and Suggestions for Authors

Unmanned aerial vehicles (UAVs) have found wide application in such domains as precision agriculture, urban security, traffic monitoring, and disaster response, among others. Object detection in UAV imagery faces significant challenges associated with the small scale of targets and complex backgrounds. In this study, the authors propose an efficient small object detection algorithm based on the RT-DETR-R18 architecture. The developed algorithm has been evaluated on a dedicated small-object dataset.

The paper demonstrates both scientific relevance and practical value.

Suggestions:

The title of the paper contains an abbreviation; it is recommended that abbreviations be avoided in the title.
The manuscript includes abbreviated terms; it is advisable to define all abbreviations at their first occurrence in the text.
The Related Work section should be expanded.
The quality of Figure 2 should be improved.
In Figure 2 (“A structural comparison diagram between the DAIFI module and the original AIFI module”), the differences between the two diagrams should be more clearly highlighted and explicitly described.
The title declares the development of An Efficient Small Object Detection Algorithm; therefore, it is recommended to include a dedicated section describing this algorithm in detail.
The conclusions should be aligned with the contributions outlined in the Introduction.
The quality of Figure 1 should also be improved.
The reference list should be updated to include recent publications (within the last 3–5 years) from high-impact journals.

Author Response

Comment 1: The title of the paper contains an abbreviation; it is recommended that abbreviations be avoided in the title.

Response: Thank you for your valuable suggestion regarding the title's clarity. We have carefully revised the title to follow the academic convention of avoiding general abbreviations. Specifically, we have expanded the abbreviation "UAV" to its full form, "Unmanned Aerial Vehicle." Regarding "DOL-DETR," we have retained it as it represents the specific name of the proposed algorithm architecture introduced in this study.

The revised title is as follows:

Original Title: DOL-DETR: An Efficient Small Object Detection Algorithm for UAV Remote Sensing

Revised Title: DOL-DETR: An Efficient Small Object Detection Algorithm for Unmanned Aerial Vehicle Remote Sensing

We have updated this throughout the revised manuscript.

Comment 2: The manuscript includes abbreviated terms; it is advisable to define all abbreviations at their first occurrence in the text.

Response: Thank you very much for this helpful suggestion. We agree that clear definitions of abbreviations are essential for the readability of the manuscript.We have conducted a thorough check of the entire text and ensured that all abbreviations are defined in full upon their first appearance, both in the Abstract and the main body of the paper. Specifically, we have verified and/or updated the following key terms:

UAV: Unmanned Aerial Vehicle.

RT-DETR: Real-Time DEtection TRansformer.

DAIFI: DAttention-based Intra-scale Feature Interaction.

OMFF: Omni-Modulated Feature Fusion.

LDConv: Linear De-redundancy Convolution.

CNN: Convolutional Neural Network.

NMS: Non-Maximum Suppression.

IoU: Intersection over Union.

We have carefully reviewed all other technical abbreviations (such as SPDConv, OKNet, MFM, etc.) to ensure they follow academic standard.

Comment 3: The Related Work section should be expanded.

Response: Thank you for this constructive suggestion. We agree that a more comprehensive literature review provides a better context for our research. In the revised manuscript, we have significantly expanded the Related Work (Section 2) to include more recent developments and specialized methodologies. The key updates are as follows:

Addition of Section 2.3 (High-Precision Small Object Detection Algorithms): We have added a new subsection to specifically discuss recent advancements in high-precision detection for small targets. This section reviews various specialized strategies, such as feature enhancement through generative models and the application of refined loss functions (e.g., NWD), providing a more focused theoretical foundation for our proposed model.
Significant Expansion of Section 2.4 (UAV-Specific Detection): We have introduced a substantial amount of new content to Section 2.4, focusing on the latest state-of-the-art (SOTA) research in Unmanned Aerial Vehicle (UAV) remote sensing. This includes detailed discussions on recent lightweight architectures and multi-scale fusion frameworks published between 2024 and 2026 (e.g., MSFE-DETR, HPS-DETR, and TAF-YOLO).
Updated References: Along with these expansions, we have incorporated additional relevant references to ensure the manuscript reflects the current state of the field.

We believe these additions strengthen the paper by clearly positioning our work within the evolving landscape of UAV object detection. All major changes are highlighted in the revised manuscript. For specific revisions, please refer to lines 129 to 160 and lines 181 to 200 of the revised manuscript.

Comment 4: The quality of Figure 2 should be improved.

Response: Thank you for pointing this out. We apologize for the insufficient clarity of the original Figure 2. In the revised manuscript, we have completely redrawn Figure 2 to ensure it meets the high standards for academic publication. The specific improvements are as follows:

Enhanced Resolution: The new figure has been rendered with a minimum resolution of 600 DPI to ensure that all technical details and text remain sharp even when zoomed in.
We have enlarged the text in Figure 2 to ensure it is clearly visible.
Standardized Notation: All symbols and technical terms in the figure have been cross-checked to ensure consistency with the descriptions in the main text.

We believe the revised Figure 2 now provides a much clearer visual representation of the proposed DAIFI architecture.

Comment 5: In Figure 2 (“A structural comparison diagram between the DAIFI module and the original AIFI module”), the differences between the two diagrams should be more clearly highlighted and explicitly described.

Response: Thank you for this insightful suggestion. We agree that a clearer visual and textual distinction between the proposed DAIFI module and the original AIFI module is essential for readers to grasp the technical innovations of our work.

To address your concern, we have implemented the following revisions in the updated manuscript:

Visual Highlighting in Figure 2: We have modified Figure 2 by using boxes with different background colors to explicitly highlight the structural differences between the two modules. Specifically, the components unique to the DAIFI module are now visually distinguished from the standard elements of the AIFI module, making the architectural changes immediately apparent.
Explicit Description in Annotations: We have added brief yet explicit descriptions within the annotations of Figure 2.

We believe these modifications effectively resolve the ambiguity and provide a much clearer comparison of the two structures.

Comment 6: The title declares the development of An Efficient Small Object Detection Algorithm; therefore, it is recommended to include a dedicated section describing this algorithm in detail.

Response: We sincerely thank the reviewer for this insightful and constructive recommendation. We fully agree that a comprehensive and dedicated description of the proposed algorithm is vital for the reader's understanding of its core contributions and technical novelty. In accordance with your suggestion, we have performed a thorough revision of the manuscript:

Addition of Section 2.3 (Related Work):

To better situate our work within the current research landscape, we have incorporated a new subsection, Section 2.3 (Efficient Small Object Detection Algorithms), in the "Related Work" section. This subsection provides a systematic review of existing methodologies focused on efficient object detection, including lightweight architectural designs and optimized feature fusion mechanisms. This addition serves to establish a solid theoretical foundation and highlights the motivation behind the development of our proposed algorithm.

We believe these revisions have significantly enhanced the depth and clarity of our methodological presentation. We hope the updated manuscript now meets your expectations.

Comment 7: The conclusions should be aligned with the contributions outlined in the Introduction.

Response: We sincerely appreciate the reviewer’s constructive feedback regarding the structural consistency of the manuscript. Ensuring that the research contributions presented in the Introduction are explicitly echoed in the Conclusion is essential for the logical integrity of the paper. We have revised the manuscript accordingly:

We have thoroughly restructured the Conclusion (Section 5) to directly correspond with the core contributions defined in the Introduction. The revised conclusion now explicitly summarizes how the DAIFI module, the OMFF mechanism, and LDConv specifically address the challenges of small object detection in Unmanned Aerial Vehicle (UAV) remote sensing. This ensures that the final summary provides a clear and direct answer to the research objectives established at the beginning of the paper.

We believe these adjustments have significantly strengthened the logical flow and clarity of our work. For specific revisions, please refer to lines 962 to 986 of the revised manuscript.

Comment 8: The quality of Figure 1 should also be improved.

Response: We appreciate the reviewer’s suggestion to enhance the visual quality of our illustrations. We agree that clear labeling is fundamental to the accessibility of the proposed architecture. In the revised manuscript, we have updated Figure 1 with the following improvements:

Enhanced Legibility: We have significantly increased the font size of all text labels, including module names (DAIFI, OMFF, LDConv), layer dimensions, and data flow descriptions. This ensures that the technical details of the DOL-DETR framework remain clearly legible even when the figure is viewed at a reduced scale or in print.
Optimized Visual Clarity: In addition to font adjustments, we have refined the thickness of the connection lines and arrows to better match the larger text, ensuring a balanced and professional visual presentation that accurately reflects the hierarchical structure of our algorithm.

We believe that these modifications successfully address the accessibility concerns and improve the overall quality of Figure 1.

Comment 9: The reference list should be updated to include recent publications (within the last 3–5 years) from high-impact journals.

Response: We sincerely appreciate the reviewer’s suggestion to enhance the timeliness and academic impact of our references. We agree that incorporating the latest high-quality research is essential for maintaining the rigor and relevance of our study. In the revised manuscript, we have updated the reference list by incorporating several recent publications (2023–2025) from high-impact journals.

Integration of Recent High-Impact Research: We have added new citations from top-tier journals such as IEEE Transactions on Geoscience and Remote Sensing (TGRS), ISPRS Journal of Photogrammetry and Remote Sensing, and Remote Sensing. These additions focus on state-of-the-art developments in small object detection and transformer-based architectures. Specific examples include:

Wang et al. (2025) and Zhao et al. (2025) in IEEE TGRS, which discuss advanced feature extraction and position-guided detection for remote sensing.

Zhuo et al. (2025) in Remote Sensing, focusing on adaptive fusion for UAV aerial imagery.

Zhu et al. (2024) in IEEE TCSVT, regarding global multi-level perception and dynamic region aggregation.

Updating the Related Work:

These recent works have been integrated into Section 2 (Related Work) to provide a more comprehensive comparative context. This ensures that our proposed DOL-DETR is evaluated against the most current benchmarks in the field, further highlighting the novelty and effectiveness of our approach.

We believe the updated reference list now reflects the most recent trends and significantly strengthens the academic foundation of our paper.

Reviewer 3 Report

Comments and Suggestions for Authors

The authors propose DOL-DETR, an interesting algorithm for detecting small objects, which is based on the RT-DETR-R18 architecture, with the aim of mitigating the interference and noise inherent in existing deep learning models.

Section 2.3, 'Object Detection from UAV Perspectives', should cover more research related to the topic, detailing what the authors did, how they did it and the results they obtained. There is extensive literature on this subject.
Please increase the font size of the text in the figures, as some of them are difficult to read. We suggest using the same font size as in the main text.
Please number the equations and define the variables.
All figures must be referenced and described within the text. For example, Figures 4, 6 and 7 are not mentioned in the text.
Line 435: What does $(0, 0)$ mean?
Section 4.1: Describe the technical characteristics of the dataset, e.g. image resolution. Does this dataset contain images with distortions, such as geometric variations (camera angle, perspective)?
Table 2: Were these parameters always used? How were they chosen? These hyperparameters must be described clearly and in detail, explaining how and why they were chosen.
Line 518: What does $C$ mean?
Did all the models in Table 4 use the same dataset as the one used in your work? Have you tested these yourself, or have they been reported in other studies?
In the figure caption for Figure 8, please describe what (a), (b) and (c) correspond to, to avoid confusion.
Although the computational complexity variables were obtained correctly, what is the inference time? Would it be possible to implement the proposed method in real-time environments?
I think the numerical results presented are fine, but some images should also be included to provide a visual representation of the results.
Has your proposal been tested on a different database? It is important to carry out tests on a different dataset to verify the robustness of the proposed method.
What are the main limitations of your proposal? Based on this, what future work is proposed?
Please check that the references are in the specified format.

Author Response

Comment 1: Section 2.3, 'Object Detection from UAV Perspectives', should cover more research related to the topic, detailing what the authors did, how they did it and the results they obtained. There is extensive literature on this subject.

Response: We sincerely thank the reviewer for this valuable suggestion. We agree that a more thorough review of existing UAV-based object detection research provides essential context for our work. In the revised manuscript, we have addressed this comment with the following updates:

Restructuring and Expansion (Section 2.4):

We have moved the discussion regarding "Object Detection from UAV Perspectives" to the newly designated Section 2.4. This section has been extensively expanded to include a wider range of high-impact research from the past 3–5 years.

Detailed Literature Analysis:

The updated Section 2.4 now provides a more granular analysis of recent methodologies. Specifically, we have detailed the technical approaches (e.g., normalized Wasserstein distance and dynamic region aggregation) and the specific results obtained by various authors. This detailed comparison highlights the current limitations in UAV remote sensing—such as feature attenuation and noise interference—which our proposed DOL-DETR aims to overcome.

We believe that the expanded Section 2.4 now provides a much more comprehensive and rigorous foundation for the subsequent presentation of our algorithm. For specific revisions, please refer to lines 181 to 200 of the revised draft.

Comment 2: Please increase the font size of the text in the figures, as some of them are difficult to read. We suggest using the same font size as in the main text.

Reponse: We sincerely thank the reviewer for the constructive feedback regarding the legibility of our figures. We agree that clear and accessible visual aids are crucial for communicating the technical details of our model. In accordance with your suggestion, we have updated the figures in the revised manuscript:

We have specifically updated Figure 1 (The DOL-DETR framework) and Figure 2 (The proposed modules) to increase the font size of all text labels, including module names, layer parameters, and data flow descriptions, ensuring that all components are clearly visible and easy to read.

We believe that these modifications significantly improve the readability and overall quality of the illustrations.

Comment 3: Please number the equations and define the variables.

Response: We sincerely thank the reviewer for this helpful suggestion. We agree that standardized equation numbering and clear variable definitions are essential for the technical clarity and readability of the manuscript. In the revised version, we have performed the following updates:

Systematic Equation Numbering:

We have assigned sequential numbers to all mathematical expressions throughout the manuscript. For instance, the core formulations for the DAttention-based Intra-scale Feature Interaction (DAIFI), the Omni-Modulated Feature Fusion (OMFF) mechanism, and the Linear De-redundancy Convolution (LDConv) are now clearly labeled to facilitate easy reference.

Comprehensive Variable Definitions:

We have meticulously reviewed all equations and provided formal definitions for every mathematical symbol and variable immediately following their first appearance.

We believe these modifications have significantly improved the mathematical rigor and accessibility of the paper.

Comment 4: All figures must be referenced and described within the text. For example, Figures 4, 6 and 7 are not mentioned in the text.

Response: We sincerely thank the reviewer for pointing out this oversight. We agree that all visual evidence and experimental results must be explicitly integrated into the narrative of the paper to provide a coherent analysis. We have thoroughly revised the manuscript to ensure that every figure is appropriately referenced and described

We believe that these additions have improved the transparency of our experimental validation and the overall flow of the manuscript.

Comment 5: Line 435: What does $(0, 0)$ mean?

Response: We apologize for the confusion caused by this notation. We have carefully reviewed Line 435 and identified that the symbol "$(0, 0)$" was a typographical artifact resulting from the LaTeX-to-Word text conversion process.In the revised manuscript, we have performed the following corrections:

Symbol Removal/Correction: The erroneous symbol has been removed, and the intended mathematical expression [or text] has been restored to its original and correct form to ensure clarity.
Full Text Audit: We have conducted a comprehensive review of the entire manuscript to identify and rectify any other potential formatting errors or artifacts generated during the document format conversion.

We thank the reviewer for their meticulous reading and for pointing out this technical error.

Comment 6: Section 4.1: Describe the technical characteristics of the dataset, e.g. image resolution. Does this dataset contain images with distortions, such as geometric variations (camera angle, perspective)?

Response: We thank the reviewer for this valuable suggestion. Regarding the technical characteristics and geometric variations of the dataset, we have addressed these points in Section 4.1 and other relevant parts of the manuscript:

We explicitly discussed in Section 4.1 that UAV imagery is characterized by variable viewpoints and steep nadir angles, which induce complex background clutter and irregular target distributions.

To specifically address these geometric distortions, we proposed the Linear De-redundancy Convolution (LDConv). This operator utilizes learnable offsets to allow convolutional kernels to dynamically adapt to the morphological deformations of targets from UAV perspectives.

We have refined Section 4.1 in the revised manuscript to more explicitly detail these technical attributes. For specific revisions, please refer to lines 603 to 607 of the revised manuscript.

Comment 7: Table 2: Were these parameters always used? How were they chosen? These hyperparameters must be described clearly and in detail, explaining how and why they were chosen.

Response: We sincerely appreciate the reviewer's suggestion. We have added a detailed description of the hyperparameter selection criteria and implementation details in Section 4.2 of the revised manuscript.

All hyperparameters listed in Table 2 (e.g., Epochs, Learning Rate, Optimizer, etc.) were maintained consistently throughout all comparative and ablation experiments in this study to ensure the fairness and comparability of the experimental results.

These detailed descriptions have been added to ensure transparency and the full reproducibility of our experiments. For specific revisions, please refer to lines 615 to 651 of the revised manuscript.

Comment 8: Line 518: What does $C$ mean?

Response: We apologize for the confusion caused by this notation. We have carefully reviewed Line 435 and identified that the symbol "$C$" was a typographical artifact resulting from the LaTeX-to-Word text conversion process.In the revised manuscript, we have performed the following corrections:

Symbol Removal/Correction: The erroneous symbol has been removed, and the intended mathematical expression [or text] has been restored to its original and correct form to ensure clarity.
Full Text Audit: We have conducted a comprehensive review of the entire manuscript to identify and rectify any other potential formatting errors or artifacts generated during the document format conversion.

We thank the reviewer for their meticulous reading and for pointing out this technical error.

Comment 9: Did all the models in Table 4 use the same dataset as the one used in your work? Have you tested these yourself, or have they been reported in other studies?

Response: Thank you for this question. We would like to clarify that all models presented in Table 4 were evaluated on exactly the same dataset used in our work. The comparison results were not taken from other published studies — all experiments were conducted by ourselves under identical experimental conditions to ensure a fair and consistent comparison.

We acknowledge that this point was not stated explicitly enough in the original manuscript. We have revised the relevant section in the paper to clearly state that all baseline models in Table 4 were re-implemented and tested by our team on the same dataset, using the same data splits and evaluation protocols as our proposed method.

We hope this clarification addresses your concern. Please let us know if you have any further questions. For specific revisions, please refer to lines 608 and lines 795 to 798 of the revised manuscript.

Comment 10: In the figure caption for Figure 8, please describe what (a), (b) and (c) correspond to, to avoid confusion.

Response: Thank you for this helpful suggestion. We would like to inform you that this has already been addressed in the revised manuscript. The caption for this figure (now renumbered as Figure 10 following structural revisions to the paper) explicitly describes what each subfigure corresponds to

The updated caption clearly identifies each subfigure to avoid any confusion for the reader. We hope this revision satisfactorily addresses your concern.

Comment 11: Although the computational complexity variables were obtained correctly, what is the inference time? Would it be possible to implement the proposed method in real-time environments?

Response: Thank you for this question. We would like to point out that the inference time of our proposed method has been explicitly addressed in Section 4.4 of the revised manuscript. Specifically, we report the frames per second (FPS) achieved by our model, which serves as a direct measure of inference speed.

In Section 4.4, we also provide a dedicated discussion on the real-time applicability of our method, analyzing whether the achieved FPS meets the requirements for deployment in real-time environments. The results demonstrate that our model achieves 120.1 FPS, which satisfies the standard threshold for real-time processing.

We believe this discussion sufficiently addresses the concern regarding inference time and real-time feasibility. We hope this clarification is satisfactory. For specific revisions, please refer to lines 767 to 775 of the revised manuscript.

Comment 12: I think the numerical results presented are fine, but some images should also be included to provide a visual representation of the results.

Response: Thank you for your valuable suggestion. We fully agree that visual representations can greatly enhance the clarity and interpretability of the results. We are pleased to inform you that the revised manuscript now includes intuitive graphical illustrations of our results. Specifically, we refer the reviewer to Figure 8 and Figure 9 in the revised manuscript, which provide visual comparisons and representations that complement the numerical results reported in the tables.

We hope these additions address your concern and improve the overall presentation of our work.

Comment 13: Has your proposal been tested on a different database? It is important to carry out tests on a different dataset to verify the robustness of the proposed method.

Response: Thank you for raising this concern. We fully agree that cross-dataset evaluation is essential for verifying the generalization capability of the proposed method. In response to this suggestion, we have conducted additional ablation experiments on the DOTA dataset, which is a widely recognized benchmark that differs from our primary dataset. The results demonstrate the robustness and generalizability of our proposed method across different data distributions.

The details of these experiments and the corresponding results are presented in Section 4.7 of the revised manuscript. We hope this additional evaluation addresses your concern regarding cross-dataset validation.

Comment 14: What are the main limitations of your proposal? Based on this, what future work is proposed?

Response: Thank you for raising this point. We would like to direct the reviewer to Section 5 (Conclusion) of the revised manuscript, where we have explicitly discussed the main limitations of our proposed method as well as the directions for future research. We believe that an honest acknowledgment of current limitations and a clear outline of future work are essential components of a rigorous scientific contribution, and we have addressed both aspects in that section.

We hope this clarification is satisfactory, and we encourage the reviewer to refer to Section 5 for the full details. For specific revisions, please refer to lines 987 to 1000 of the revised manuscript.

Comment 15: Please check that the references are in the specified format.

Response: Thank you for this reminder. We have carefully reviewed all references in the manuscript. During this process, we identified that a considerable number of citations were previously listed in arXiv preprint format. We have updated all such references to reflect their final published versions, including the correct journal/conference names, volume numbers, page numbers, and publication years, in accordance with the required citation format.

We hope the reference list now meets the formatting requirements of the journal.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The revised manuscript shows a clear and substantial improvement compared to the previous version. The authors have addressed most of the reviewer’s comments in a constructive manner, and several important aspects of the paper have been significantly strengthened.

In particular, the experimental section has been improved by adding more detailed information about the training procedure, hyperparameters, and evaluation protocol, which enhances transparency and reproducibility. The authors have also clarified the fairness of the comparative study and explicitly stated that the reported results were obtained under consistent experimental conditions. The explanation of Fig. 1 has been expanded, making the overall architecture and data flow much clearer. Furthermore, the mathematical formulations have been revised: the notation is more consistent, key operations are now better defined, and the LDConv formulation correctly incorporates the learned offsets. The presentation of the heatmap analysis has also been improved, including clearer captions and a more precise explanation of the Grad-CAM visualization.

However, some issues are not fully resolved. (1) The explanation of the FPS improvement associated with the DAIFI module is more detailed than before, but it still relies on indirect evidence rather than on a thorough profiling-based analysis, which limits the strength of the claim. (2) The DOTA generalization experiment has been extended with additional comparisons and discussion, but it remains relatively limited in scope: the evaluation protocol is still not fully detailed, and the analysis does not yet provide a strong basis for broad generalization claims.

A more rigorous profiling-based analysis for point (1) is not realistically feasible at this stage of the revision. Therefore, it would be sufficient to explicitly state that the explanation is a hypothesis supported by indirect evidence rather than a definitive analysis, and to add a sentence such as: “A detailed kernel-level profiling is left for future work.”

The main issue in point (2) is the mismatch between the strength of the generalization claim and the relatively limited supporting evidence. What can realistically be achieved at this stage is the following:

The claim should be softened. Instead of “demonstrates strong generalization ability,” it should be phrased along the lines of “provides preliminary evidence of cross-dataset generalization.”
It should be clarified what the experiment actually represents, namely, a cross-dataset validation rather than a full benchmark, that it uses horizontal bounding boxes (if applicable), and that no dataset-specific optimization for DOTA is applied.
A few sentences outlining the limitations should be added, such as the absence of per-class analysis, rotated-box evaluation, and tiling strategies.
Optionally, the authors may include per-class AP for a few key categories or provide a brief discussion of where the model achieves the most significant gains or losses.

In addition, although the manuscript has been revised to reduce overly promotional language, some statements are still stronger than what is fully supported by the presented evidence, for example:

“This result strongly demonstrates…”
“compelling evidence…”
“optimal balance between accuracy and speed”
“highly suitable for time-critical tasks…”

Simply removing or replacing the strongest qualifiers would be sufficient to tone down these claims.

Finally, a few minor issues related to technical precision and proofreading are still present. For example:

Ln. 779 – “This provide strong evidence the synergistic advantages …” should be corrected to “This provides strong evidence of …” (grammatical error).
Ln. 328 – “where, ” should be corrected to “where ” (punctuation issue).
Ln. 463 – “where[⋅,⋅]denotes” contains missing spaces (typesetting issue).

These issues are minor and do not require substantial revision, but only careful final proofreading and minor corrections.

Overall, the manuscript is significantly improved and is close to being ready for publication. The remaining issues are mostly related to the strength of certain claims, completeness of experimental validation, and final polishing of the presentation.

Author Response

Comment 1: (1) The explanation of the FPS improvement associated with the DAIFI module is more detailed than before, but it still relies on indirect evidence rather than on a thorough profiling-based analysis, which limits the strength of the claim. A more rigorous profiling-based analysis for point (1) is not realistically feasible at this stage of the revision. Therefore, it would be sufficient to explicitly state that the explanation is a hypothesis supported by indirect evidence rather than a definitive analysis, and to add a sentence such as: “A detailed kernel-level profiling is left for future work.”

Response: We sincerely appreciate the reviewer’s insightful observation and highly pragmatic suggestion. We fully agree that without a thorough kernel-level profiling, our current explanation regarding the FPS improvement remains a hypothesis based on indirect observations, and the previous claims were indeed overly definitive. For specific changes, please refer to lines 711-746 of the revised draft.

We are also very grateful for your understanding regarding the feasibility of conducting a comprehensive hardware profiling at this stage of the revision. Following your valuable advice, we have carefully toned down the assertions in the revised manuscript. We explicitly clarified that the explanation serves as a "hypothesis supported by indirect evidence," and we have incorporated the exact sentence you recommended to acknowledge this limitation.

We believe these modifications accurately reflect the nature of our current evidence and make the scientific claims much more rigorous and objective. Thank you again for guiding us to improve the manuscript.

Cemmrnt 2: (2) The DOTA generalization experiment has been extended with additional comparisons and discussion, but it remains relatively limited in scope: the evaluation protocol is still not fully detailed, and the analysis does not yet provide a strong basis for broad generalization claims.

The claim should be softened. Instead of “demonstrates strong generalization ability,” it should be phrased along the lines of “provides preliminary evidence of cross-dataset generalization.”
It should be clarified what the experiment actually represents, namely, a cross-dataset validation rather than a full benchmark, that it uses horizontal bounding boxes (if applicable), and that no dataset-specific optimization for DOTA is applied.
A few sentences outlining the limitations should be added, such as the absence of per-class analysis, rotated-box evaluation, and tiling strategies.
Optionally, the authors may include per-class AP for a few key categories or provide a brief discussion of where the model achieves the most significant gains or losses

Response: We sincerely thank the reviewer for pointing out the mismatch between our previous claims and the actual scope of the DOTA experiments. We fully agree with your rigorous assessment. We are also deeply grateful for your pragmatic and highly constructive suggestions on how to properly frame these results at this revision stage. The relevant changes can be found in Section 4.7.

Following your guidance, we have extensively revised the corresponding section to tone down the assertions, explicitly define the boundaries of the experiment, and provide an objective discussion of the model's gains, losses, and limitations.

Specifically, we have made the following revisions in the manuscript :

1. Softening the claims and clarifying the experimental scope (HBB & no optimizations):

"To provide preliminary evidence of the cross-dataset generalization capability of DOL-DETR, we conducted extended experiments on the DOTA dataset..."
"...It is important to clarify that this experiment represents a cross-dataset validation rather than a complete, comprehensive benchmark evaluation. In this experiment, we deliberately maintained the raw integrity of the DOTA dataset without applying any DOTA-specific optimizations. All images were directly resized to a standard 640 × 640 input resolution, and the evaluation was strictly conducted using horizontal bounding boxes (HBB)."

2. Acknowledging the limitations (class-wise analysis, OBB, and tiling):

"...However, we must explicitly acknowledge the limitations of this validation. The current scope lacks a detailed class-wise analysis, which restricts a fine-grained understanding of the model's behavior across all 15 categories. Furthermore, the absence of rotated bounding box (OBB) evaluation and high-resolution image tiling (cropping) strategies inherently limits the absolute performance ceiling."

3. Briefly discussing the significant gains and losses:

"...Regarding specific performance variations, our observations suggest that the model achieves the most significant gains in detecting relatively isolated small objects (e.g., standard vehicles) due to the adaptive multi-scale fusion mechanism. Conversely, the model experiences performance bottlenecks (losses) in scenes with extremely dense and arbitrarily rotated clusters (e.g., tightly packed ships at ports), where the reliance on HBBs inevitably causes severe overlap ambiguity."

4. Adjusting the conclusion of this subsection:

"In conclusion, while the scope of this DOTA experiment is relatively limited, the findings provide preliminary evidence that DOL-DETR is not overfitted to the VisDrone scenario. A more exhaustive benchmark evaluation—incorporating OBB metrics, image tiling strategies, and detailed class-wise analysis—is left for future work to fully establish its generalized capabilities."

We believe these modifications accurately reflect the nature of the validation and make our scientific claims much more rigorous, objective, and proportional to the provided evidence. Thank you again for your invaluable guidance in improving the quality of our manuscript.

Comment 3: In addition, although the manuscript has been revised to reduce overly promotional language, some statements are still stronger than what is fully supported by the presented evidence, for example:

“This result strongly demonstrates…”
“compelling evidence…”
“optimal balance between accuracy and speed”
“highly suitable for time-critical tasks…”

Simply removing or replacing the strongest qualifiers would be sufficient to tone down these claims.

Response: We sincerely appreciate the reviewer’s careful reading and constructive feedback regarding the tone of our manuscript. We fully agree that these statements were overly promotional and stronger than what the current evidence can fully support.

Following your excellent suggestion, we have conducted a thorough and careful review of the entire manuscript. We have systematically removed or replaced overly strong qualifiers—such as "compelling," "strongly," "optimal," and "highly"—with more objective and restrained academic language.

We believe these modifications have successfully toned down the claims, making the manuscript much more rigorous, objective, and proportional to the presented experimental results. We are very grateful for your guidance in helping us improve the professional quality of our writing.

Comment 4: Finally, a few minor issues related to technical precision and proofreading are still present. For example:

Ln. 779 – “This provide strong evidence the synergistic advantages …” should be corrected to “This provides strong evidence of …” (grammatical error).
Ln. 328 – “where, ” should be corrected to “where ” (punctuation issue).
Ln. 463 – “where[⋅,⋅]denotes” contains missing spaces (typesetting issue).

These issues are minor and do not require substantial revision, but only careful final proofreading and minor corrections.

Response: We are deeply grateful to the reviewer for their meticulous reading and for pointing out these grammatical and typographical errors. We sincerely appreciate your patience and rigorous attention to detail.

We have carefully corrected all the specific issues you highlighted. Furthermore, we have conducted a comprehensive final proofreading of the entire manuscript to ensure technical precision, proper punctuation, and correct typesetting throughout.

Specifically, the corrections for your examples have been made as follows:

For Ln. 779, we have corrected the grammatical error by changing "provide" to "provides" and adding "of". (Note: In accordance with your previous comment regarding overly promotional language, we also removed the word "strong".) The sentence now correctly reads: "This provides evidence of the synergistic advantages..."
For Ln. 328, we have removed the unnecessary comma, correcting it to "where " before introducing the variables.
For Ln. 463, we have fixed the typesetting issue by adding the proper spaces, so it now reads: "where [⋅,⋅] denotes".

Thank you once again for your invaluable guidance and for helping us refine our manuscript to a much higher standard of publication quality.

Reviewer 2 Report

Comments and Suggestions for Authors

All the reviewer’s comments and suggestions have been taken into account in the revised version of the manuscript. The paper has been improved after revision. Therefore, the reviewer considers that the manuscript can be recommended for publication.

Author Response

Comment 1: All the reviewer’s comments and suggestions have been taken into account in the revised version of the manuscript. The paper has been improved after revision. Therefore, the reviewer considers that the manuscript can be recommended for publication.

Response: We are very grateful for the reviewer’s positive feedback and the recommendation for publication. We would like to express our sincere appreciation for your professionalism and patience throughout the peer-review process. Your insightful comments and constructive suggestions have been invaluable in significantly improving the quality and clarity of our manuscript.