Author Contributions
Conceptualization, Y.T.; methodology, Y.T. and Y.D.; software, B.Z.; validation, B.Z. and D.L.; formal analysis, C.Z.; investigation, B.Z.; resources, B.M.; data curation, B.Z. and D.L.; writing—original draft preparation, Y.T.; writing—review and editing, Y.D.; visualization, C.Z.; supervision, Y.D.; project administration, B.M. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Example images and object size distribution of three representative UAV aerial datasets. (a) The VisDrone-2019 dataset. (b) The UAVDT dataset. (c) The DOTA1.0 dataset.
Figure 1.
Example images and object size distribution of three representative UAV aerial datasets. (a) The VisDrone-2019 dataset. (b) The UAVDT dataset. (c) The DOTA1.0 dataset.
Figure 2.
Overall architecture of EAGLE-DET. The framework consists of three core modules: a CMENet backbone network for edge-aware feature extraction, AMFFN for attention-guided multi-scale feature fusion, and EUCBSC for enhanced upsampling with channel-spatial coordination. The modules work collaboratively to address the three-stage degradation problem of small object features.
Figure 2.
Overall architecture of EAGLE-DET. The framework consists of three core modules: a CMENet backbone network for edge-aware feature extraction, AMFFN for attention-guided multi-scale feature fusion, and EUCBSC for enhanced upsampling with channel-spatial coordination. The modules work collaboratively to address the three-stage degradation problem of small object features.
Figure 3.
Architecture of the CSPMEE module in the CMENet backbone network. The CSPMEE module adopts a branch-fusion feature-processing paradigm, consisting of bypass and enhancement branches. The enhancement branch contains MREAM units that construct multi-scale feature pyramids through parallel adaptive pooling branches with different scales. Each branch incorporates EdgeEnhance for adaptive edge enhancement based on high-frequency and low-frequency decomposition.
Figure 3.
Architecture of the CSPMEE module in the CMENet backbone network. The CSPMEE module adopts a branch-fusion feature-processing paradigm, consisting of bypass and enhancement branches. The enhancement branch contains MREAM units that construct multi-scale feature pyramids through parallel adaptive pooling branches with different scales. Each branch incorporates EdgeEnhance for adaptive edge enhancement based on high-frequency and low-frequency decomposition.
Figure 4.
Architecture of AMFFN feature fusion network. The network adopts a hierarchical multi-scale feature interaction architecture through a cascaded combination of SDEC and HSAT. (a) The HSAT module. (b) Detailed structure of the PSAttn mechanism. (c) The SDEC module.
Figure 4.
Architecture of AMFFN feature fusion network. The network adopts a hierarchical multi-scale feature interaction architecture through a cascaded combination of SDEC and HSAT. (a) The HSAT module. (b) Detailed structure of the PSAttn mechanism. (c) The SDEC module.
Figure 5.
Architecture of EUCBSC. (a) EUCBSC module; (b) BSCR sub-module.
Figure 5.
Architecture of EUCBSC. (a) EUCBSC module; (b) BSCR sub-module.
Figure 6.
Module ablation experimental results on the VisDrone-2019 validation set. In the configuration labels, C, A, and E represent CMENet, AMFFN, and EUCBSC, respectively. (a) Detection performance comparison across different module configurations. (b) Computational efficiency comparison.
Figure 6.
Module ablation experimental results on the VisDrone-2019 validation set. In the configuration labels, C, A, and E represent CMENet, AMFFN, and EUCBSC, respectively. (a) Detection performance comparison across different module configurations. (b) Computational efficiency comparison.
Figure 7.
Comparison of backbone and neck network architectures. (a) Backbone architecture comparison. (b) Neck architecture comparison.
Figure 7.
Comparison of backbone and neck network architectures. (a) Backbone architecture comparison. (b) Neck architecture comparison.
Figure 8.
Comparison with state-of-the-art methods on VisDrone-2019 test set. The scatter plot shows the trade-off between detection performance and model parameters. Circle size represents FPS (frames per second).
Figure 8.
Comparison with state-of-the-art methods on VisDrone-2019 test set. The scatter plot shows the trade-off between detection performance and model parameters. Circle size represents FPS (frames per second).
Figure 9.
Visualization of detection results on VisDrone-2019 test set. Comparison between Ground Truth, Baseline (RT-DETR-R18), FBRT-YOLO-m, DAB-DETR, and EAGLE-DET across various challenging scenarios: (a) Enhanced small object identification capability in urban intersection scenes. (b) Accurate detection and category discrimination for ultra-small objects. (c) Multi-scale object detection in complex scenes with both near large vehicles and distant small motorcycles. (d) Fine-texture perception for ultra-small pixel objects, correcting baseline misdetections. (e) Dense object scene processing with precise boundary distinction. (f) Robust detection under extreme lighting conditions (night low-light). (g) Accurate identification of occluded objects while suppressing false detections. Red boxes indicate missed detections or misclassifications by baseline or other methods that are correctly handled by EAGLE-DET.
Figure 9.
Visualization of detection results on VisDrone-2019 test set. Comparison between Ground Truth, Baseline (RT-DETR-R18), FBRT-YOLO-m, DAB-DETR, and EAGLE-DET across various challenging scenarios: (a) Enhanced small object identification capability in urban intersection scenes. (b) Accurate detection and category discrimination for ultra-small objects. (c) Multi-scale object detection in complex scenes with both near large vehicles and distant small motorcycles. (d) Fine-texture perception for ultra-small pixel objects, correcting baseline misdetections. (e) Dense object scene processing with precise boundary distinction. (f) Robust detection under extreme lighting conditions (night low-light). (g) Accurate identification of occluded objects while suppressing false detections. Red boxes indicate missed detections or misclassifications by baseline or other methods that are correctly handled by EAGLE-DET.
Figure 10.
Feature activation heatmaps under different scenarios comparing Ground Truth, Baseline, CMENet + AMFFN, and EAGLE-DET. The visualization covers various challenging aerial scenarios, including dense parking lots, oblique viewing angles, multi-scale intersections, and night low-light environments.
Figure 10.
Feature activation heatmaps under different scenarios comparing Ground Truth, Baseline, CMENet + AMFFN, and EAGLE-DET. The visualization covers various challenging aerial scenarios, including dense parking lots, oblique viewing angles, multi-scale intersections, and night low-light environments.
Figure 11.
Visualization of detection results on the UAVDT dataset. Comparison between Ground Truth, Baseline (RT-DETR), and EAGLE-DET across various traffic scenarios including highways, intersections, and urban roads. EAGLE-DET demonstrates superior detection capability for small and distant vehicles under different viewing angles and lighting conditions. Red boxes highlight improved detections compared to baseline.
Figure 11.
Visualization of detection results on the UAVDT dataset. Comparison between Ground Truth, Baseline (RT-DETR), and EAGLE-DET across various traffic scenarios including highways, intersections, and urban roads. EAGLE-DET demonstrates superior detection capability for small and distant vehicles under different viewing angles and lighting conditions. Red boxes highlight improved detections compared to baseline.
Figure 12.
Visualization of detection results on the DOTA1.0 dataset. Comparison shows EAGLE-DET’s performance on various object categories in remote sensing imagery including storage tanks, ships, harbors, and aircraft. Red boxes highlight improved detections compared to baseline.
Figure 12.
Visualization of detection results on the DOTA1.0 dataset. Comparison shows EAGLE-DET’s performance on various object categories in remote sensing imagery including storage tanks, ships, harbors, and aircraft. Red boxes highlight improved detections compared to baseline.
Table 1.
Experimental system environment.
Table 1.
Experimental system environment.
| Item | Model/Parameters |
|---|
| Operating System | Ubuntu LTS 22.04 |
| Programming Language | Python 3.10 |
| CPU | Intel Core(TM) i7-13700KF (Intel Corporation,
Santa Clara, CA, USA) |
| Graphics card | NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation,
Santa Clara, CA, USA) |
| GPU Memory | 24G |
Table 2.
Multi-scale pooling kernel configuration comparison experiments.
Table 2.
Multi-scale pooling kernel configuration comparison experiments.
| Config | Type | Pooling Kernel Config | AP50 | APs | APm |
|---|
| A | Single-scale | | 45.8 | 18.1 | 35.2 |
| B | Dual-scale | | 46.0 | 19.2 | 36.4 |
| C | Triple-scale | | 46.0 | 19.8 | 36.1 |
| D | Four-scale | | 46.2 | 18.5 | 35.8 |
| E | Four-scale | | 46.3 | 19.1 | 36.2 |
| F | Four-scale | | 46.3 | 19.3 | 36.6 |
| G | Four-scale | | 47.6 | 18.9 | 36.0 |
| H | Four-scale Progressive (Ours) | | 47.2 | 20.1 | 38.0 |
| I | Five-scale | | 46.2 | 19.2 | 36.8 |
Table 3.
Edge enhancement method comparison experiments (%).
Table 3.
Edge enhancement method comparison experiments (%).
| Edge Enhancement Method | AP50 | APs | APm | APl |
|---|
| MREAM (w/o EdgeEnhancer) | 45.9 | 18.9 | 36.9 | 39.5 |
| Sobel operator | 46.0 | 19.3 | 37.6 | 40.2 |
| Laplacian operator | 45.8 | 19.6 | 37.6 | 40.1 |
| AvgPool-based (Ours) | 47.2 | 20.1 | 38.0 | 40.7 |
Table 4.
AMFFN module component ablation experiments (%).
Table 4.
AMFFN module component ablation experiments (%).
| Config | SDEC | HSAT | AP50 | APs | APm |
|---|
| Baseline | | | 44.3 | 18.6 | 36.4 |
| Only SDEC | ✔ | | 45.6 | 19.3 | 37.6 |
| Only HSAT | | ✔ | 45.2 | 19.0 | 36.9 |
| AMFFN (Ours) | ✔ | ✔ | 46.8 | 19.5 | 38.5 |
Table 5.
EAGLE-DET algorithm module performance analysis on VisDrone-2019 validation set (%).
Table 5.
EAGLE-DET algorithm module performance analysis on VisDrone-2019 validation set (%).
| Model | C | A | E | GFLOPS | Params (M) | AP50 | AP50:95 | APs | APm | APl |
|---|
| 1. baseline | | | | 58.0 | 19.9 | 44.3 | 26.5 | 18.6 | 36.4 | 40.3 |
| 2 | ✔ | | | 48.4 | 14.4 | 47.2 | 27.8 | 20.1 | 38.0 | 40.7 |
| 3 | | ✔ | | 58.9 | 20.2 | 46.8 | 27.9 | 19.5 | 38.5 | 41.3 |
| 4 | | | ✔ | 57.6 | 20.6 | 46.1 | 26.9 | 19.7 | 37.6 | 40.9 |
| 5 | ✔ | ✔ | | 61.2 | 17.5 | 48.2 | 29.0 | 20.9 | 39.4 | 41.8 |
| 6 | ✔ | | ✔ | 52.3 | 14.8 | 47.9 | 28.2 | 20.3 | 39.6 | 42.1 |
| 7 | | ✔ | ✔ | 62.4 | 21.1 | 47.6 | 28.8 | 19.9 | 37.7 | 41.5 |
| 8. Ours | ✔ | ✔ | ✔ | 65.7 | 18.4 | 49.5 ± 0.2 | 29.8 ± 0.1 | 21.8 ± 0.2 | 40.1 ± 0.2 | 42.5 ± 0.3 |
Table 6.
Backbone network architecture comparison experiments on VisDrone-2019 validation set.
Table 6.
Backbone network architecture comparison experiments on VisDrone-2019 validation set.
| Backbone | GFLOPs | Params (M) | AP50 | AP50:95 | APs | APm | APl |
|---|
| Resnet18 (baseline) | 58.0 | 19.9 | 44.3 | 26.5 | 18.6 | 36.4 | 40.3 |
| Swin-T [6] | 98.4 | 36.6 | 41.7 | 24.6 | 16.6 | 35.0 | 35.5 |
| ConvNeXt-T [7] | 33.3 | 12.6 | 38.2 | 22.2 | 14.3 | 32.0 | 37.7 |
| RepVit [8] | 38.3 | 13.6 | 40.9 | 24.3 | 16.5 | 34.4 | 40.2 |
| Mambaout [37] | 41.9 | 15.9 | 41.5 | 24.5 | 16.9 | 34.3 | 34.9 |
| CMENet (Ours) | 48.4 | 14.4 | 47.2 | 27.8 | 20.1 | 38.0 | 40.7 |
Table 7.
Neck architecture comparison experiments on VisDrone-2019 validation set.
Table 7.
Neck architecture comparison experiments on VisDrone-2019 validation set.
| Method | GFLOPs | Params (M) | AP50 | AP50:95 | APs | APm | APl |
|---|
| CCFF (baseline) | 58.0 | 19.9 | 44.3 | 26.5 | 18.6 | 36.4 | 40.3 |
| BIFPN [9] | 64.3 | 20.3 | 45.8 | 27.4 | 19.5 | 37.6 | 43.1 |
| HSFPN [38] | 58.1 | 20.7 | 43.5 | 25.8 | 18.1 | 35.5 | 38.9 |
| RFPN [39] | 56.4 | 19.6 | 43.6 | 26.0 | 18.3 | 35.4 | 39.4 |
| HyperACE [40] | 60.4 | 21.9 | 43.5 | 26.1 | 17.9 | 36.5 | 43.4 |
| DPCF [41] | 57.0 | 19.9 | 45.2 | 27.1 | 19.3 | 37.0 | 40.3 |
| AMFFN (Ours) | 58.9 | 20.2 | 46.8 | 27.9 | 19.5 | 38.5 | 41.3 |
Table 8.
Upsampling comparison experiments on VisDrone-2019 validation set.
Table 8.
Upsampling comparison experiments on VisDrone-2019 validation set.
| Method | AP50 | AP50:95 | APs | APm | APl |
|---|
| CARAFE [11] | 44.7 | 26.7 | 18.7 | 36.9 | 42.2 |
| DySample [12] | 44.5 | 26.5 | 18.3 | 36.7 | 42.1 |
| EUCB [42] | 44.6 | 26.5 | 19.0 | 35.8 | 38.1 |
| Converse2D [43] | 44.0 | 26.4 | 18.5 | 36.0 | 40.8 |
| EUCBSC (Ours) | 46.1 | 26.9 | 19.7 | 37.6 | 40.9 |
Table 9.
Performance of different methods on VisDrone-2019 Test dataset (%).
Table 9.
Performance of different methods on VisDrone-2019 Test dataset (%).
| Method | AP50 | AP50:95 | APs | APm | APl | GFLOPS | Params (M) | FPS |
|---|
| Two-stage Object Detector |
| Faster RCNN [44] | 32.9 | 19.4 | 9.5 | 30.9 | 42.3 | 208.9 | 41.3 | 53.1 |
| Cascade-RCNN [45] | 32.6 | 19.7 | 9.9 | 30.9 | 40.6 | 236.6 | 62.3 | 44.3 |
| One-stage Object Detector |
| CNN-based |
| GFL [46] | 32.1 | 19.3 | 9.4 | 30.0 | 40.9 | 205.8 | 32.3 | 50.5 |
| ATSS [47] | 33.8 | 20.4 | 10.0 | 31.7 | 46.5 | 110.4 | 38.9 | 53.6 |
| TOOD [48] | 33.9 | 20.4 | 10.2 | 31.7 | 40.3 | 199.1 | 32.0 | 43.9 |
| RetinaNet [49] | 27.6 | 16.4 | 6.0 | 27.4 | 42.5 | 210.6 | 36.5 | 57.2 |
| YOLOv5m [50] | 28.8 | 15.2 | 7.3 | 23.3 | 30.6 | 48.0 | 20.9 | 113.5 |
| YOLOv8m [51] | 33.2 | 19.0 | 9.0 | 29.4 | 41.7 | 78.7 | 25.9 | 135.7 |
| YOLOv10m [52] | 34.5 | 19.5 | 9.7 | 30.0 | 41.4 | 58.9 | 15.3 | 105.8 |
| YOLOv11m [53] | 33.9 | 19.5 | 9.2 | 30.1 | 42.6 | 67.7 | 20.0 | 117.1 |
| YOLOv12m [54] | 33.6 | 19.2 | 9.4 | 29.8 | 38.6 | 67.2 | 20.1 | 106.4 |
| FBRT-YOLO-m [55] | 34.4 | 19.6 | 9.4 | 30.9 | 42.1 | 58.7 | 7.4 | 104.8 |
| Transformer-based |
| DETR [56] | 25.8 | 12.9 | 5.6 | 23.4 | 30.2 | 96.5 | 41.6 | 26.8 |
| Deformable-DETR [57] | 30.0 | 16.3 | 8.7 | 25.4 | 33.0 | 193.0 | 40.1 | 32.1 |
| Conditional-DETR [58] | 30.4 | 16.2 | 7.9 | 24.4 | 45.8 | 101.3 | 43.5 | 48.5 |
| DAB-DETR [59] | 37.2 | 19.2 | 10.7 | 29.3 | 48.3 | 102.9 | 44.0 | 50.9 |
| VRF-DETR [13] | 33.9 | 19.3 | 10.7 | 28.1 | 41.1 | 45.7 | 13.8 | 79.6 |
| RT-DETR-R34 [18] | 37.6 | 21.8 | 12.8 | 29.6 | 29.4 | 90.0 | 20.1 | 62.4 |
| RT-DETR-R50 [18] | 39.1 ± 0.2 | 22.5 ± 0.1 | 13.2 ± 0.1 | 32.6 ± 0.2 | 42.2 ± 0.2 | 129.6 | 42.0 | 67.7 |
| UAV-DETR-R18 [14] | 37.4 ± 0.3 | 20.3 ± 0.2 | 12.3 ± 0.2 | 30.4 ± 0.2 | 41.5 ± 0.3 | 73.9 | 21.6 | 64.3 |
| RT-DETR-R18 (Baseline) [18] | 35.2 ± 0.2 | 20.1 ± 0.2 | 11.4 ± 0.2 | 28.9 ± 0.3 | 36.1 ± 0.2 | 57.0 | 19.9 | 78.4 |
| EAGLE-DET (Ours) | 39.7 ± 0.2 | 23.0 ± 0.2 | 13.5 ± 0.2 | 33.5 ± 0.1 | 42.6 ± 0.3 | 65.7 | 18.4 | 71.7 |
Table 10.
Experimental results on UAVDT and DOTA datasets.
Table 10.
Experimental results on UAVDT and DOTA datasets.
| Datasets | Method | Parameters | GFLOPS | AP50:95 | APs | APm |
|---|
| UAVDT | RT-DETR-R18 | 19.9 | 58.9 | 84.9 | 76.0 | 87.9 |
| EAGLE-DET | 18.3 | 65.4 | 87.1 | 79.2 | 89.8 |
| DOTA1.0 | RT-DETR-R18 | 19.9 | 58.0 | 49.1 | 25.8 | 53.0 |
| EAGLE-DET | 18.3 | 65.5 | 50.6 | 27.7 | 55.1 |
Table 11.
Results of different categories on the DOTA datasets.
Table 11.
Results of different categories on the DOTA datasets.
| Class | RT-DETR-R18 | EAGLE-DET |
|---|
|
AP50
|
AP50:95
|
AP50
|
AP50:95
|
|---|
| small_vehicle | 67.8 | 42.1 | 69.2 | 43.1 |
| large_vehicle | 85.3 | 76.1 | 87.3 | 73.8 |
| plane | 91.9 | 69.4 | 92.5 | 69.9 |
| storage_tank | 74.1 | 45.4 | 74.3 | 45.8 |
| ship | 87.1 | 62.2 | 86.7 | 62.1 |
| harbor | 82.6 | 47.8 | 83.4 | 49.0 |
| ground_track_field | 64.0 | 41.5 | 64.7 | 44.3 |
| soccer_ball_field | 60.9 | 43.9 | 56.8 | 41.1 |
| tennis_court | 92.5 | 84.3 | 91.9 | 83.6 |
| swimming_pool | 60.0 | 25.9 | 62.9 | 28.3 |
| baseball_diamond | 74.9 | 44.8 | 72.2 | 44.0 |
| roundabout | 55.7 | 31.6 | 53.0 | 28.1 |
| basketball_court | 59.5 | 47.4 | 45.3 | 35.6 |
| bridge | 49.1 | 23.3 | 52.6 | 25.9 |
| helicopter | 53.8 | 32.4 | 69.9 | 46.1 |