STAIR-DETR: A Synergistic Transformer Integrating Statistical Attention and Multi-Scale Dynamics for UAV Small Object Detection
Abstract
1. Introduction
- (1)
- To replace the less effective AIFI module in the neck, we introduce the Statistical Feature Attention (SFA) mechanism, a token-level statistical self-attention module. By exploiting feature statistics such as mean and variance, SFA emphasizes salient local regions, suppresses background interference, and improves the model’s sensitivity to small objects with negligible computational cost.
- (2)
- Within the backbone, the conventional Basic Block is re-designed into Diverse Semantic Enhancement Block (DSEB). This module incorporates multi-branch pathways and dynamic convolutions to extract features under varied receptive fields, thereby enriching semantic representation while retaining spatial precision, which is critical for complex UAV imagery.
- (3)
- A new resolution conversion strategy is devised by unifying Context-Guided Down sampling (CGD) and Dy Sample. Unlike conventional operators that treat down- and up-sampling separately, the Adaptive Scale Transformation Operator (ASTO) performs context-aware compression and content-adaptive reconstruction in a joint framework. This design ensures high-fidelity feature transmission across scales and mitigates detail loss, particularly for fine-grained small object cues.
- (4)
- Aiming to improve the representation of extremely small targets, an extra P2 detection head is introduced in the Transformer decoder, leveraging high-resolution shallow features for accurate classification and regression, thereby enhancing detection performance for tiny targets.
2. Related Work
2.1. Evolution of Object Detection Paradigms
2.2. Challenges and Strategies for Small Object Detection in UAV Perspectives
2.3. Improvements and Applications Based on the DETR Architecture
| Algorithm | References | Limitations | Environment Type | Convergence | Scalability |
|---|---|---|---|---|---|
| Faster R-CNN | [9], 2017 | Incur high computational cost with slower inference than real-time detectors. | Suited to general detection that prioritizes accurate region proposals. | Improves via region proposal networks for stable, precise detection. | Robust in general scenarios but unsuitable for strict real-time or low-power deployments. |
| SSD | [20], 2016 | Weak performance on small objects and inconsistent accuracy across scales. | Optimized for general-purpose detection with an emphasis on single-shot efficiency. | Stabilizes training through multi-scale feature fusion. | Applicable to general scenarios yet constrained in small object-dominant or highly complex environments. |
| YOLO | [18], 2016 [19], 2018 [33], 2025 [34], 2025 | Fundamentally rely on CNN-based hierarchies and post-processing (NMS), weaker long-range dependency modeling compared to DETR. | Primarily designed for general real-time task | Benefiting from mature optimization strategies and dense supervision | It has strong adaptability to a variety of hardware platforms, but it has a bottleneck in dense occlusion scenes |
| DETR | [13], 2020 [15], 2021 [26], 2022 [27], 2022 | Slow convergence; poor adaptation to small/occluded objects. | Aimed at end-to-end detection with minimal post-processing. | Deformable variants accelerate convergence and lower computational cost. | Suitable for general detection, yet constrained in small object and real-time applications. |
| RT-DETR | [16], 2023 [28], 2024 [29], 2025 [30], 2024 [31], 2022 [32], 2024 [35], 2024 | Shows limited performance on small or aerial targets, and earlier versions may lag in real-time speed. | Developed to support real-time object detection across aerial, UAV, and remote-sensing applications. | Stabilizes accuracy through feature enhancement and variable receptive fields. | Delivers strong results on aerial and real-time tasks, yet capability declines in non-aerial or resource-constrained settings, it is difficult to ensure consistency across different scales. |
3. Proposed Method
3.1. Overall Framework
3.2. Backbone Network Enhancement
- (1)
- Main Branch: A standard 3 × 3 convolution capturing core local features;
- (2)
- Multi-scale Branch: A 1 × 1 convolution enabling channel-wise interaction and fine-scale detail modeling;
- (3)
- Sequential Convolution Branch: A composite of 1 × 1 followed by 3 × 3 convolutions, expanding the effective receptive field;
- (4)
- Average Pooling Branch: Aggregates local statistical information, enhancing robustness to subtle geometric variations;
- (5)
- Identity Mapping Branch: Preserves information flow through residual connections.
- (1)
- Batch normalization parameters are absorbed into preceding convolution weights and biases;
- (2)
- 1 × 1 kernels are zero-padded to 3 × 3 dimensions;
- (3)
- Average pooling is converted into a fixed-weight convolution kernel;
- (4)
- Identity mapping is represented by a 3 × 3 identity kernel;
- (5)
- All resulting kernels and biases are summed to produce a unified convolutional module.
3.3. Enhancement of Feature Fusion Network
3.4. Adaptive Scale Transformation Operators
3.5. STAIR-DETR Detection Head
4. Experimental Results and Analysis
4.1. Dataset and Experimental Setup
4.1.1. Dataset
4.1.2. Data Augmentation Strategy
4.1.3. Experimental Environment and Configuration
4.2. Evaluation Metrics
4.3. Ablation Experiments
4.4. Performance Comparison
4.5. Visualization Analysis
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep Learning for UAV-Based Object Detection and Tracking: A Survey. Remote Sens. 2024, 16, 149. [Google Scholar] [CrossRef]
- Chen, R.; Guo, Y.; Zheng, H.; Jiang, H. A Comprehensive Approach for UAV Small Object Detection with Simulation-based Transfer Learning and Adaptive Fusion. arXiv 2021, arXiv:2109.01800. [Google Scholar] [CrossRef]
- Tsouros, D.C.; Bibi, S.; Sarigiannidis, P.G. A review on UAV-based applications for precision agriculture. Information 2019, 10, 349. [Google Scholar] [CrossRef]
- Perks, M.T.; Russell, A.J.; Large, A.R.G. Advances in Flash-Flood Monitoring Using UAVs. Hydrol. Earth Syst. Sci. 2016, 20, 387–396. [Google Scholar] [CrossRef]
- Khosravi, P.; Roy, D.; Feng, X.; Xiao, X. A Survey of Small Object Detection Based on Deep Learning in Aerial Imagery. Intell. Data Anal. 2025, 29, 1–25. [Google Scholar]
- Viola, P.; Jones, M. Boosted Cascade of Simple Features for Rapid Object Detection. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA, 8–14 December 2001; pp. 511–518. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR′05), San Diego, CA, USA, 20–26 June 2005; pp. 886–893. [Google Scholar] [CrossRef]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
- Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar] [CrossRef]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
- Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar] [CrossRef]
- Neubeck, A.; Van Gool, L. Efficient Non-Maximum Suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong SAR, China, 20–24 August 2006; pp. 850–855. [Google Scholar] [CrossRef]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End to End Object Detection. arXiv 2021, arXiv:2104.12841. [Google Scholar] [CrossRef]
- Zhao, H.; Li, X.; Lin, Z.; Wang, Y.; Zhai, Z.; Yu, G. DETRs Beat YOLOs on Real-Time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
- Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 213–226. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016, Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar] [CrossRef]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision—ECCV 2018, Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar] [CrossRef]
- Lv, W.; Shang, Y.; Xu, C.; Wang, Y.; Sun, P.; Lin, S. DETRs with Hybrid Matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 11593–11603. [Google Scholar] [CrossRef]
- Wang, C.; Yang, C.; Zhang, Y.; Fan, H.; Wang, Z.; Tai, Y.; Li, J. QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object Detection. arXiv 2022, arXiv:2103.09136. [Google Scholar]
- Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 21–24 June 2022; pp. 13609–13617. [Google Scholar] [CrossRef]
- Cao, J.; Xie, E.; Zhang, X.; Sun, P.; Qi, L.; Luo, P.; Li, Z.; Wang, J. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
- Peng, Y.; Li, H.; Wu, P.; Zhang, Y.; Sun, X.; Wu, F. D-FINE: Redefine regression task in DETRs as fine-grained distribution refinement. arXiv 2024, arXiv:2410.13842. [Google Scholar] [CrossRef]
- Wenbin, L. VRF-DETR: Efficient Aerial Image Detection with Variable Receptive Fields. arXiv 2025, arXiv:2504.15165. [Google Scholar]
- Cao, X.; Wang, H.; Wang, X.; Hu, B. DFS-DETR: Detailed-feature-sensitive detector for small object detection in aerial images using transformer. Electronics 2024, 13, 3404. [Google Scholar] [CrossRef]
- Li, Z.; Cui, H.; Lin, T.Y.; Dollar, P.; Hariharan, B.; Girshick, R. Group DETR: Fast Training Converging DETR with Group-wise 731One-to-Many Assignment. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 2022, 732, 17299–17309. [Google Scholar]
- Hu, Z.; Wu, P.; Chen, J.; Zhu, H.; Wang, Y.; Peng, Y.; Li, H.; Sun, X. Dome-DETR: Density-Oriented Feature-Query Manipulation for Tiny Object Detection. arXiv 2024, arXiv:2405.05741. [Google Scholar]
- Wang, Z.; Yan, F.; Wang, L.; Yin, Y.; Lin, J. S-YOLO: An enhanced small object detection method based on adaptive gating strategy and dynamic multi-scale focus module. Neural Netw. 2025, 191, 107782. [Google Scholar] [CrossRef]
- Li, B.; Wang, B.; Hu, X. Bridging semantic discretization: DR-YOLO with dual-stream decomposition and dynamic feature weaving for low-light object detection. Inf. Sci. 2025, 731, 122896. [Google Scholar] [CrossRef]
- Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. Rt-detrv2: Improved baseline with bag-of-freebies for real-time detection transformer. arXiv 2024, arXiv:2407.17140. [Google Scholar]
- Wan, D.; Lu, R.; Hu, B.; Yin, J.; Shen, S.; Xu, T.; Lang, X. YOLO-MIF: Improved YOLOv8 with Multi-Information fusion for object detection in Gray-Scale images. Adv. Eng. Inform. 2024, 62, 102709. [Google Scholar] [CrossRef]
- Wu, Z.; Ding, T.; Lu, Y.; Pai, D.; Zhang, J.; Wang, W.; Yu, Y.; Ma, Y.; Haeffele, B.D. Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction. arXiv 2024, arXiv:2412.17810. [Google Scholar] [CrossRef]
- Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. Cgnet: A light-weight context guided network for semantic segmentation. IEEE Trans. Image Process. 2020, 30, 1169–1179. [Google Scholar] [CrossRef]
- Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 6027–6037. [Google Scholar]
- Xu, R.; Chen, Z.; Zuo, W.; Yan, J.; Lin, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar] [CrossRef]
- Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar] [CrossRef]
- Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO; Version 8.0.0 [Computer Software]; Ultralytics Inc.: Los Angeles, CA, USA, 2023; Available online: https://github.com/ultralytics/ultralytics (accessed on 15 May 2024).
- Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
- Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
- Lei, M.; Li, S.; Wu, Y.; Hu, H.; Zhou, Y.; Zheng, X.; Ding, G.; Du, S.; Wu, Z.; Gao, Y. YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception. arXiv 2025, arXiv:2506.17733. [Google Scholar]











| DSEB | SFA | P2 | CGD | DySample | mAP50 (%) ↑ | mAP50:95 (%) ↑ | FLOPs (G) ↓ | Param (M) ↓ | FPS ↑ |
|---|---|---|---|---|---|---|---|---|---|
| 36.2 | 20.7 | 57.0 | 20.2 | 103.0 | |||||
| √ | 38.0 | 22.0 | 57.0 | 19.8 | 100.7 | ||||
| √ | 37.4 | 21.8 | 57.1 | 19.7 | 107.2 | ||||
| √ | √ | 40.2 | 23.9 | 78.2 | 18.6 | 70.1 | |||
| √ | √ | 39.8 | 23.7 | 78.3 | 18.4 | 72.9 | |||
| √ | √ | √ | 40.7 | 24.3 | 78.3 | 18.4 | 72.6 | ||
| √ | √ | √ | √ | 41.2 | 23.2 | 84.8 | 21.0 | 66.1 | |
| √ | √ | √ | √ | √ | 41.7 | 23.4 | 86.6 | 21.2 | 40.2 |
| Models | mAP50 (%) ↑ | mAP50:95 (%) ↑ | FLOPs (G) ↓ | Param (M) ↓ | FPS ↑ |
|---|---|---|---|---|---|
| Two-stage methods | |||||
| Faster R-CNN [9] | 32.3 | 12.0 | 127.0 | 41.7 | 36.2 |
| Cascade-R-CNN [41] | 32.8 | 19.5 | 236 | 69.4 | 31.6 |
| One-stage methods | |||||
| YOLOv8m [42] | 33.2 | 19.0 | 79.8 | 25.9 | 256.0 |
| YOLOv11m [43] | 35.7 | 20.6 | 59.0 | 15.3 | 752.8 |
| YOLOv12m [44] | 31.7 | 18.3 | 67.5 | 20.2 | 122.4 |
| YOLOv13n [45] | 26.3 | 14.5 | 6.1 | 2.5 | 264.3 |
| S-YOLO [33] | 36.4 | 21.3 | 25.3 | 2.7 | 285.0 |
| DR-YOLO [34] | 38.5 | 20.9 | 8.1 | 3.1 | 57.8 |
| End-to-end methods | |||||
| Deformable DETR [15] | 30.2 | 16.4 | 172.5 | 40.2 | 19 |
| RT-DETR-r18 [16] | 36.8 | 20.7 | 57.0 | 20.2 | 103.0 |
| RT-DETRv2-r18 [35] | 39.1 | 22.1 | 60.0 | 20.0 | - |
| D-Fine-M [28] | 39.6 | 22.1 | 56.4 | 19.2 | 480.9 |
| VRF-DETR [29] | 39.9 | 23.3 | 44.6 | 13.8 | 158.7 |
| DFS-DETR [30] | 40.7 | - | - | 19.8 | - |
| ours | 41.7 | 23.4 | 86.6 | 21.2 | 40.2 |
| Dataset | Models | mAP_s (%) ↑ | mAP_m (%) ↑ | mAP_l (%) ↑ | mAP50 (%) ↑ | mAP5:95 (%) ↑ |
|---|---|---|---|---|---|---|
| DOTA | RT-DETR-r18 [16] | 21.1 | 44 | 51.6 | 65.1 | 41.7 |
| VRF-DETR [36] | 23.2 | 47.3 | 53.8 | 68.6 | 43.8 | |
| ours | 28.2 | 56.3 | 60.5 | 69.7 | 43.4 | |
| VisDrone | RT-DETR-r18 [16] | 9.1 | 26.3 | 33.7 | 36.2 | 20.7 |
| ours | 10.7 | 27.4 | 35.8 | 41.7 | 23.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hu, L.; Xue, P.; Guo, B.; Chen, Y.; Zha, W.; Tian, J. STAIR-DETR: A Synergistic Transformer Integrating Statistical Attention and Multi-Scale Dynamics for UAV Small Object Detection. Sensors 2025, 25, 7681. https://doi.org/10.3390/s25247681
Hu L, Xue P, Guo B, Chen Y, Zha W, Tian J. STAIR-DETR: A Synergistic Transformer Integrating Statistical Attention and Multi-Scale Dynamics for UAV Small Object Detection. Sensors. 2025; 25(24):7681. https://doi.org/10.3390/s25247681
Chicago/Turabian StyleHu, Linna, Penghao Xue, Bin Guo, Yiwen Chen, Weixian Zha, and Jiya Tian. 2025. "STAIR-DETR: A Synergistic Transformer Integrating Statistical Attention and Multi-Scale Dynamics for UAV Small Object Detection" Sensors 25, no. 24: 7681. https://doi.org/10.3390/s25247681
APA StyleHu, L., Xue, P., Guo, B., Chen, Y., Zha, W., & Tian, J. (2025). STAIR-DETR: A Synergistic Transformer Integrating Statistical Attention and Multi-Scale Dynamics for UAV Small Object Detection. Sensors, 25(24), 7681. https://doi.org/10.3390/s25247681

