AUP-DETR: A Foundational UAV Object Detection Framework for Enabling the Low-Altitude Economy
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors- The authors should conduct additional ablation experiments on VisDrone dataset since there are two datasets adopted in the manuscript but ablation results from only one dataset have been included. Then, if results from those two datasets are different, the authors should adjust the proposed approach or illustrate their decisions and reasons.
- The authors should perform additional optimization experiments for determining optimal number of adaptive residual blocks and optimal number of attention heads in SAT on VisDrone dataset. Similar to above comment, the authors should show global optimal numbers on both datasets or formulate such a joint optimization strategy if results from two datasets are not same.
- The authors should supplement a Figure for visual comparison of detection results between the baseline and AUP-DETR on the UCA-Det dataset in section of 4.6.1. The Results on UCA-Det.
- The authors should use same models to compare overall performances on both the UCA-Det dataset and VisDrone dataset. Namely, models in Table 5 and Table 6 should be same.
- The authors should supplement confusion matrices for all models on both the UCA-Det dataset and VisDrone dataset.
- The authors should unify the issues to be solved or scenes applied in the manuscript.
(1) For example, the issues or scenes raised in the following four points are not entirely consistent.
- In line 5: vast scale variations, dense small objects, and complex backgrounds.
- Among line 42 and line 44: interplay of sea, land, and air spaces, vast scale variations, and the high-density of targets.
- Among line 54 and line 55: mixed land-sea features, heterogeneous target coexistence, and complex interactions.
- Among line 69 and line 70: land-sea mixed scenes, extreme scale variations, and dense object distributions.
(2) The three issues or scenes mentioned among line 69 and line 70, which are land-sea mixed scenes, extreme scale variations, and dense object distributions, are recommended for the whole manuscript, since scale variations include small objects and complex backgrounds are not detailed among above other three points.
(3) The authors need to show results under all three scenes of land-sea mixed scenes, extreme scale variations, and dense object distributions for Figure 7, Figure 8, Figure 9 and the Figure to be supplemented in section of 4.6.1. The Results on UCA-Det.
- The authors should redraw Figure 5.
- The authors should better divide above three scenes further into corresponding subscenes such as follows. The two subscenes, namely, Sea mainly, Land mainly, are recommended to represent for land-sea mixed scenes. The three subscenes, namely, Small mainly, Medium mainly, Large mainly, are recommended to represent for extreme scale variations. The two subscenes, namely, Dense distribution, Sparse distribution, are recommended to represent for dense object distributions.
- Firstly, the authors should substitute a new Figure for the original Figure 5 by displaying a total of 12 images in three rows and four columns, with the first column being Small mainly images, the second column being Medium mainly images, the third column being Large mainly images, the first row being Sea mainly and Dense distribution images, the second row being Sea mainly and Sparse distribution images, the third row being Land mainly and Dense distribution images, and the fourth row being Land mainly and Sparse distribution images. Then, the authors should place texts of Small mainly, Medium mainly, Large mainly on from top left to top right of the new Figure. Finally, the authors should place texts of Sea mainly and Dense distribution, Sea mainly and Sparse distribution, Land mainly and Dense distribution, Land mainly and Sparse distribution on from left top to left bottom of the new Figure.
- 2.1. Gneral-Purpose Object Detection -> 2.1. General-Purpose Object Detection
- The authors need to unify the number of input and output components for both Fusion-SHC module and SSCF. For example, there are 3 input components and 2 output components for Fusion-SHC module in Figure 1. However, there are 3 input components and 1 output components for Fusion-SHC module in Figure 2. Besides, it seems there are 2 inputs and 1 output for SSCF in Figure 1, while there are 3 inputs and 1 output for SSCF in Figure 3.
- The authors should use the text of Y4 to represent same components in both Figure 1 and Figure 2. Besides, the authors should add the text of P3 into Figure 2.
- The authors should ensure same annotation texts for same inputs and outputs in both Figure 1 and Figure 3. P3, P4 (, P5) VS X1, X2, X3?
- The authors place Figure 3(A) on the top, adjust sequences by (C)->(B) and (B)->(C), and then place 3(B) on the left bottom and 3(C) on the right bottom.
- The authors should use same terms for same components. For example, there are three similar representations in Figure 4, i.e., Spatial-Aware FFN VS SpatialAware FFN VS Spatial FFN.
- The authors should provide the corresponding full names when using abbreviations for the first time both in the abstract and main text, and only provide the abbreviations when using them later both in the abstract and main text. For example, since SAT is used to represent Spatial Agent Transformer in line 80, Spatial Agent Transformer should be replaced by SAT in Figure 1. The authors should check related representations in the whole manuscript.
- The authors should use a vector Figure to represent the Figure 7, since most of texts in Figure 7 cannot be easily recognized. It is also recommended to use vector Figures to represent other Figures in the manuscript. Besides, the authors should decrease sizes of texts of Baseline and AUP-DETR in Figure 7.
Comments on the Quality of English LanguageThe English of the manuscript could be further polished.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThank you for the submission. The introduction provides a strong contextual basis linking UAV perception challenges to the emerging “low-altitude economy” concept. The literature review is comprehensive and up-to-date covering DETR and YOLO evolution and identifying the research gap. The architecture is well-illustrated and logically structured, highlighting clear modular innovation (Fusion-SHC, SSCF, SAT). The dataset UCA-Det is a significant contribution—its description includes resolution, density, and class distribution metrics, ensuring reproducibility. The mathematical formulations of fusion and attention mechanisms are precise and consistent. The ablation studies (Table 2) validate the impact of each component effectively. Recommendation for future publications:
- Clarify dataset accessibility (currently "available on request"): consider partial open access for transparency and benchmarking.
- Include error bars or variance indicators for key performance metrics (mAP, P, R) to quantify consistency.
Overall, this is a well structured distribution.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsGeneral Comments
The paper tackles object detection in UAV imagery, with a specific focus on scenes near the urban–sea border (port interfaces). The authors argue that this domain is under-served by current datasets and methods, and present (i) a new dataset, UCA-Det, and (ii) a DETR-based model, AUP-DETR, claimed to be effective and lightweight for such scenes. They also argue that the widely used VisDrone benchmark does not adequately reflect shoreline challenges. The topic is timely given the growth of UAV and remote-sensing applications. However, the motivation and detection objectives are presented somewhat vaguely; goals should be more precisely defined. The paper correctly lists common UAV challenges (scale variation, motion blur, clutter, rotations), but the specific need for an urban-coastal dataset/method remains under-substantiated; this requires cross-domain tests and shoreline-specific error analyses. Presentation quality needs work: the figures-especially architecture diagrams-are low-quality JPEGs with heavy compression, which is particularly problematic in a computer-vision paper; they should be vector or high-resolution PNGs. In the experimental section there are notable inconsistencies (e.g., Tables 5 vs 6 report conflicting Params/FLOPs for the same models). While the empirical results are relatively convincing, the Conclusions are brief, and the final sentence claims a “release” of UCA-Det without corresponding public links; for the dataset to function as a benchmark-and any method to be a new baseline-both data and code must be publicly available. As written, data availability statements are inconsistent.
Detailed comments
Title / Abstract / Introduction
- Acronyms at first mention. Expand all non-standard acronyms once (e.g., Urban Coastal Aerial Detection (UCA-Det), Occlusion-Guided Multi-task Network (OGMN)). Borderline terms like DINO should also be expanded once (DETR with Improved DeNoising Anchor Boxes) to avoid confusion.
- Domain motivation is light. The paper lists general UAV challenges beyond scale, which is fair, but does not demonstrate why the urban–sea border specifically requires a dedicated dataset/method. Please (i) add cross-domain generalization tests (e.g., train on urban/open-sea -> test on urban-coastal and vice versa), (ii) provide shoreline-aware analyses (e.g., AP vs. distance to shoreline, glare/whitecaps-induced FP), and (iii) clarify detection objectives/use-cases for the port interface.
- VisDrone references. Since you compare on VisDrone, cite the full protocol and list the 10 categories, plus mention occlusion/truncation/ignored regions and the official toolkit
Section 3 (Method)
- Figure 1 (and other diagrams) — quality & readability. Replace low-quality JPEG rasters with vector (PDF/SVG) or high-res PNG. Align legend colors with overlays; use thicker strokes; make captions self-contained; add zoom insets where needed.
- SAT only at S5. You apply SAT solely on the highest-level feature (S5). Please justify this design and consider/report variants with mid-level or cross-scale usage.
- Fusion-SHC alone degrades mAP. Ablations show that adding Fusion-SHC alone reduces mAP. Provide learning curves/PR and discuss interactions/stability that explain this behavior.
- “Lightweight” vs higher cost. FLOPs/params increase (e.g., ~74.2→85.3 G). Report per-module cost and FPS/latency on edge GPUs to justify the efficiency claim.
- Scale alignment & aliasing. Specify the upsampling method (nearest/bilinear?) and any anti-aliasing used when aligning scales—this is critical for tiny targets.
- “Spatial vs frequency domain” rationale. The claim that avoiding frequency-domain transforms saves cost is unconvincing in the context of modern detectors; please clarify or remove.
- Orientation vs AABB. You acknowledge rotation challenges but evaluate with AABB Explain not using OBB and report errors vs. object angle.
Section 4.1 (Dataset: UCA-Det)
- Purpose & object repertoire. Clearly define the intended use-cases at urban-coastal ports and justify the choice of four classes.
- Viewpoint geometry (nadir vs oblique). State explicitly whether nadir views are included; provide distributions of pitch/roll/yaw, altitude, and FOV. If nadir is excluded, justify why; also report AP by viewpoint (nadir vs oblique).
- Class imbalance & tiny classes.Ship ≈84.6%, car ≈12%, people ≈2.8%, cycle ≈0.6%. Report macro-averaged metrics, per-class AP/PR, AP_S/M/L per class, confusion matrices, and exact counts per split. Clarify the minimum annotation size policy; discuss mitigation (class-balanced loss, oversampling, crop/tiling).
- “Crowdedness” claim. “06 objects/image” does not prove crowded scenes.
- COCO size bins-definitions & wording. You report small 35.97%, medium 50.84%, large 13.19%, but do not define “medium.” Use COCO’s original-image area thresholds (A<32²; 32²≤A<96²; A≥96²), or state your variant clearly (e.g., post-resize/mosaic). Avoid implying that small alone is “predominant” (<50%); if you mean small+medium, say so explicitly (35.97%+50.84%=81%) and provide per-class size distributions and AP_S/M/L per class.
- Public availability (critical). The paper calls UCA-Det “public,” yet data availability says upon request. To function as a benchmark, the dataset must be publicly released (link/DOI, license, versioning, official splits, baseline code). Otherwise, soften the language and clarify access conditions.
- Sample/figure quality. Provide legible annotated samples consistent with the dataset’s resolution; align legend colors; export vector/hi-res.
Sections 4.2–4.3 (Experimental setup / training)
- 2K+ -> 640×640 preprocessing. All images are ≥2K, but training uses 640×640 with Mosaic=1. Specify the exact pipeline: letterbox vs random-resized crop vs tiling, aspect-ratio handling, bbox remapping, inference size(s), and report resolution sensitivity (e.g., 640 vs 960/1280) and objects-per-640-crop. Consider tiling/high-res inference for tiny objects.
- Multi-scale features vs input. Clarify that “multi-scale” refers to feature-pyramid levels (S2–S5) (SSCF), not multi-scale input. Provide feature-map shapes/strides; disentangle input-size effects from architectural gains.
- Training-budget parity. Ensure identical input size, schedule, augmentations, and epochs for all baselines on UCA-Det and VisDrone; document configs.
Results (Tables 5–6, Figs. 6–7)
- Params/FLOPs inconsistencies. For the same model, params must be identical across datasets and FLOPs nearly identical at the same input size. Tables 5 vs 6 contradict each other (e.g., RT-DETR R18/R50 swapped, UAV-DETR R18/R50 swapped, YOLOv11-L 7.6 GFLOPs implausible).
- Visual comparisons (Figs. 6–7). Add ground-truth overlays (or separate panels), mark TP/FP/FN, state IoU threshold and ensure identical thresholds/NMS across methods. Provide zoomed insets for dense regions and export vector/hi-res.
Conclusions
- Too brief-strengthen. Summarize concrete takeaways per factor (scale, blur, clutter) and acknowledge limitations (class imbalance, 2K -> 640 downscaling, dataset scope).
- Release claim must match practice. “By releasing the UCA-Det dataset…” implies a public release. Provide public links/DOIs for dataset and code (training/inference configs, FLOPs script). Otherwise, soften the claim and clarify access.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsFigures are nice and big and easy to read which is nice...but they have compression artifcts.
I do not agree that ports are more complex than urban streets. Can you defend this? I DO believe they have been less studied and therefore may be ripe for experimentation. In fact i would imagine it would be easier to add things like visible fiducials to port objects that would make it easier for vision systems to operate. Reference 13 refers to more difficult uncontrolled environments.
2.1 misspelled general
However the paper is well-written with minimal or no language problems.
why isn't the UCA-DET dataset available publicly like VisDrone?
I need more explanation of missed / false detections in Fig 6; especially the middle column.
Fig 7 is only partially convincing, although that is a common problem with object detection examples. in particular, i am not convinced that the object in the middle col of the last row is not really a van.
what is the baseline in figure 8 and 9?
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Round 2
Reviewer 3 Report
Comments and Suggestions for AuthorsI appreciate the authors’ detailed responses and the updated manuscript. Most of the major comments have been addressed, and corresponding changes have been introduced in the paper. Some points were left “for a subsequent version,” which is understandable given that several requests (e.g., dataset restructuring and additional analyses) would have required substantial effort that may not be feasible within the current timeline.
A particular concern remains the claim of a benchmark-grade dataset while the data are not publicly available. If public release is not possible at this stage, I strongly recommend softening any benchmark/release language across the paper (including the Contributions and Conclusions) to avoid over-claiming. Conversely, if the authors wish to retain the benchmark framing, they should provide public access (link/DOI, license, official splits) before publication.
Although a few recommended modifications were not implemented, the authors provided reasonably convincing justifications and inserted clarifications in the manuscript that acknowledge these limitations. Overall coherence has improved, and the work appears suitable for publication after minor corrections.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf

