MPI-DETR: Multi-Grain Prompt and Intensity-Guided Transformer for Small-Object Detection in UAV Imagery
Highlights
- We propose MPI-DETR, a novel detection framework utilizing a Dual- Stream Ranked Self-Attention (DRSA) module that maps spatial features into ordered intensity sequences.
- The framework achieves state-of-the-art AP50 scores of 43.8%, 87.5%, and 92.5% on the highly challenging AI-TOD, DIOR, and NWPU VHR-10 datasets, respectively.
- MPI-DETR effectively overcomes severe background clutter and semantic misalignment challenges in UAV imagery through bounded noise filtering and prompt-driven feature fusion.
- It provides a highly competitive and computationally efficient visual perception solution, ideal for resource-constrained UAV platforms and edge deployment scenarios.
Abstract
1. Introduction
- We propose MPI-DETR, an end-to-end detection framework specifically designed for small-object detection in UAV remote sensing images. By incorporating intensity guidance and prompt learning mechanisms, the framework effectively alleviates the challenges posed by background clutter interference and the spatially scattered distribution of targets.
- We design the DRSA module, which overcomes the limitations of conventional spatial attention by introducing an intensity reordering strategy and a dual-stream interaction mechanism, thereby enabling efficient global aggregation of features from spatially dispersed targets.
- We propose the BTC-FAM and PMGF modules, which reconstruct the pathways of feature encoding and fusion from the perspectives of noise suppression and cross-level feature alignment, respectively, enhancing the model’s ability to perceive weak target signals.
2. Related Works
2.1. Object Detection in UAV Imagery
2.2. Global Context Modeling and Attention Mechanisms
2.3. Feature Fusion and Pyramid Networks
3. Methods
3.1. Overall Architecture of MPI-DETR
3.2. Dual-Stream Ranked Self-Attention
- Intra-bucket dense stream (): Through consecutive grouping, attention is focused on adjacent elements in the sequence. Since the sequence has already been sorted, this stream concentrates on modeling the continuity of local fine-grained intensity variations.
- Inter-bucket sparse stream (): Through strided sampling, long-range connections are established across different intensity levels. This stream is intended to capture global statistical distribution relationships across hierarchical levels.
3.3. Bilateral Tanh Gating and Cosine Attention Feature Alignment Module
3.4. Prompt-Driven Multi-Grain Fusion Module
4. Experiments
4.1. Datasets
- AI-TOD: This is an extremely challenging dataset specifically designed for small-object detection. It contains 28,036 images and 700,621 instances across 8 categories. Unlike conventional datasets, the average object size in AI-TOD is extremely small, at only 12.8 pixels, and the proportion of small objects (smaller than 16 pixels) is exceptionally high. We follow the official split, using 14,536 images for training, 4272 for validation, and 9228 for testing.
- DIOR: This is currently one of the largest and most category-rich remote sensing object detection datasets. It contains 23,463 images and 192,472 instances, covering 20 common object categories, such as airplanes, ships, and storage tanks. This dataset exhibits dramatic object scale variations together with complex background textures, including urban areas, ports, and wild fields. It can effectively evaluate the robustness and generalization performance of MPI-DETR when facing high-frequency background noise interference and cross-scale feature alignment. We divide the dataset into training, validation, and test sets with a ratio of 7:1:2.
- NWPU VHR-10: This is a classic high-resolution geospatial object detection dataset. It contains 800 ultra-high-resolution images covering 10 categories. Although the dataset is relatively small in scale, it includes rich high-resolution texture details and dense spatial distributions. The training, validation, and test sets contain 550, 100, and 150 images, respectively. We follow this split for model training and evaluation.
4.2. Experimental Setup
4.3. Evaluation Indicators
4.4. Comparative Experiment
4.5. Ablation Experiment
- DRSA: As shown in Row 2 of Table 5, when the spatial attention module in the baseline model is replaced with the proposed DRSA, the of the model improves from 34.5% to 35.8%. More importantly, since DRSA abandons the computationally intensive spatial attention and instead adopts a parameter-free intensity sorting mechanism, the number of parameters and FLOPs are substantially reduced. This strongly demonstrates that, in UAV-view imagery, cross-region feature aggregation based on physical response intensity is not only more precise than conventional spatial search, but also more lightweight and efficient.
- BTC-FAM: On top of DRSA, we further embed BTC-FAM into the feature pyramid pathway. The experimental results show that this configuration improves by 0.8% and by 0.8%. This performance gain validates our previous theoretical assumption that the high-frequency background noise prevalent in shallow features of UAV images severely interferes with the representation of weak targets. By employing a bilateral tanh gating mechanism, BTC-FAM successfully filters out these impulse noises before cross-level feature propagation, thereby providing a cleaner semantic environment. Meanwhile, because it replaces part of the heavy spatial convolutions, the computational cost of the model is further reduced.
- PMGF: Finally, we integrate PMGF into the network to form the complete MPI-DETR (Row 4 of Table 5). Compared with the previous version, achieves another 0.8 percentage point improvement, reaching the best result of 37.4%, while the model size is reduced to 16.8 M. This result indicates that PMGF changes the traditional passive feature concatenation paradigm by actively retrieving and activating the fine-grained textures of small objects that are submerged in the shallow network through deep global prompts, thereby thoroughly addressing the problem of “semantic misalignment” during cross-level fusion.
4.6. Necessity Analysis of the Dual-Flow Mechanism Within DRSA
4.7. Visualization of Discriminative Regions
4.8. Analysis of Feature Representations in Dual-Stream Ranked Attention
4.9. Robustness and Discussion
4.10. Computational Complexity and Efficiency Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Li, N.; Ye, M.; Zhou, L.; Tang, S.; Gan, Y.; Liang, Z.; Zhu, X. Self-prompting analogical reasoning for UAV object detection. Proc. AAAI Conf. Artif. Intell. 2025, 39, 18412–18420. [Google Scholar] [CrossRef]
- Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
- Jiao, Z.; Wang, M.; Qiao, S.; Zhang, Y.; Huang, Z. Transformer-based object detection in low-altitude maritime UAV remote sensing images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4210413. [Google Scholar] [CrossRef]
- Jankovic, B.; Jangirova, S.; Ullah, W.; Khan, L.U.; Guizani, M. UAV-assisted real-time disaster detection using optimized transformer model. In Proceedings of the IEEE Symposium on Computers and Communications; IEEE: Piscataway, NJ, USA, 2025; pp. 1–7. [Google Scholar]
- Kelly, M.; Feirer, S.; Hogan, S.; Lyons, A.; Lin, F.; Jacygrad, E. Mapping orchard trees from UAV imagery through one growing season: A comparison between OBIA-based and three CNN-based object detection methods. Drones 2025, 9, 593. [Google Scholar] [CrossRef]
- Das, A.; Yang, Y.; Subburaj, V.H. YOLOv7 for weed detection in cotton fields using UAV imagery. AgriEngineering 2025, 7, 313. [Google Scholar] [CrossRef]
- Luo, M.; Zhao, R.; Zhang, S.; Chen, L.; Shao, F.; Meng, X. IM-CMDet: An intra-modal enhancement and cross-modal fusion network for small object detection in UAV aerial RGBT imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5008316. [Google Scholar] [CrossRef]
- Qin, H.; Xu, T.; Li, T.; Chen, Z.; Feng, T.; Li, J. MUST: The first dataset and unified framework for multispectral UAV single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–17 June 2025; pp. 16882–16891. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3146–3154. [Google Scholar]
- Nian, Z.; Yang, W.; Chen, H. AEFFNet: Attention enhanced feature fusion network for small object detection in UAV imagery. IEEE Access 2025, 13, 26494–26505. [Google Scholar] [CrossRef]
- Zhao, D.; Gu, L.; Qian, K.; Zhou, H.; Yang, T.; Cheng, K. Target tracking from infrared imagery via an improved appearance model. Infrared Phys. Technol. 2020, 104, 103116. [Google Scholar] [CrossRef]
- Zhao, D.; Zhang, H.; Arun, P.V.; Jiao, C.; Zhou, H.; Xiang, P.; Cheng, K. SiamSTU: Hyperspectral video tracker based on spectral spatial angle mapping enhancement and state aware template update. Infrared Phys. Technol. 2025, 150, 105919. [Google Scholar] [CrossRef]
- Zhao, D.; Hu, B.; Jiang, W.; Zhong, W.; Arun, P.V.; Cheng, K.; Zhao, Z.; Zhou, H. Hyperspectral video tracker based on spectral difference matching reduction and deep spectral target perception features. Opt. Lasers Eng. 2025, 194, 109124. [Google Scholar] [CrossRef]
- Zhao, D.; Zhong, W.; Ge, M.; Jiang, W.; Zhu, X.; Arun, P.V.; Zhou, H. SiamBSI: Hyperspectral video tracker based on band correlation grouping and spatial-spectral information interaction. Infrared Phys. Technol. 2025, 151, 106063. [Google Scholar] [CrossRef]
- Zhao, D.; Xu, X.; You, M.; Arun, P.V.; Zhao, Z.; Ren, J.; Wu, L.; Zhou, H. Local sub-block contrast and spatial-spectral gradient features fusion for hyperspectral anomaly detection. Remote Sens. 2025, 17, 695. [Google Scholar] [CrossRef]
- Zheng, D.; Dong, W.; Hu, H.; Chen, X.; Wang, Y. Less is more: Focus attention for efficient DETR. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6674–6683. [Google Scholar]
- Liao, N.; Zhang, Y.; Yu, Z.; Huang, J.; Zhu, M.; Peng, B. UAV-DETR: Few-parameter DETR for small object detection in high-altitude UAV images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 19, 2575–2587. [Google Scholar] [CrossRef]
- Zhang, J.; Zhang, Y.; Easa, S.M.; Xie, B.; Lin, L.; Zhou, X.; Zeng, N.; Zhang, W.; Song, M. E2-Former: An edge-enhanced transformer for UAV-based small object detection. IEEE Internet Things J. 2026; in press.
- Xiao, Y.; Xu, T.; Xin, Y.; Li, J. FBRT-YOLO: Faster and better for real-time aerial image detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 8673–8681. [Google Scholar]
- Yin, H.; Zhu, Z.; Wang, H. SED-DETR: A scale-enhanced deformable detection transformer for remote sensing images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5624412. [Google Scholar] [CrossRef]
- Zhou, S.; Chen, D.; Pan, J.; Shi, J.; Yang, J. Adapt or perish: Adaptive sparse transformer with attentive feature refinement for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 2952–2963. [Google Scholar]
- Song, L.; Chen, Y.; Yang, S.; Ding, X.; Ge, Y.; Chen, Y.C.; Shan, Y. Low-rank approximation for sparse attention in multi-modal LLMs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 13763–13773. [Google Scholar]
- Chen, W.; Bruzzone, L.; Dang, B.; Gao, Y.; Deng, Y.; Yu, J.G.; Yuan, L.; Li, Y. REST: Holistic learning for end-to-end semantic segmentation of whole-scene remote sensing imagery. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 48, 693–710. [Google Scholar] [CrossRef]
- Pu, Y.; Xia, Z.; Guo, J.; Han, D.; Li, Q.; Li, D.; Yuan, Y.; Li, J.; Han, Y.; Song, S.; et al. Efficient diffusion transformer with step-wise dynamic attention mediators. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 424–441. [Google Scholar]
- Wang, J.; Yang, W.; Guo, H.; Zhang, R.; Xia, G.S. Tiny object detection in aerial images. In Proceedings of the International Conference on Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 3791–3798. [Google Scholar]
- Lin, H.; Liu, J.; Li, X.; Wei, L.; Liu, Y.; Han, B.; Wu, Z. DCEA: DETR with concentrated deformable attention for end-to-end ship detection in SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 17292–17307. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. YOLO-World: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16901–16911. [Google Scholar]
- Li, X.; Ding, H.; Yuan, H.; Zhang, W.; Pang, J.; Cheng, G.; Chen, K.; Liu, Z.; Loy, C.C. Transformer-based visual segmentation: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10138–10163. [Google Scholar] [CrossRef]
- Liu, K.; Fu, Z.; Jin, S.; Chen, Z.; Zhou, F.; Jiang, R.; Chen, Y.; Ye, J. ESOD: Efficient small object detection on high-resolution images. IEEE Trans. Image Process. 2024, 34, 183–195. [Google Scholar] [CrossRef]
- Zhang, Y.; Wu, C.; Zhang, T.; Zheng, Y. Full-scale feature aggregation and grouping feature reconstruction-based UAV image target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–11. [Google Scholar] [CrossRef]
- Huo, Y.; Dong, Y.; Wang, C.; Zhang, M.; Wang, H. Multi-scale memory network with separation training for hyperspectral anomaly detection. Inf. Process. Manag. 2026, 63, 104494. [Google Scholar] [CrossRef]
- Huo, Y.; Wang, S.; Wang, C.; Zhang, M.; Wang, H. Dual-stream background modeling network with anomaly suppression for hyperspectral anomaly detection. Int. J. Appl. Earth Obs. Geoinf. 2026, 148, 105233. [Google Scholar] [CrossRef]
- Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
- Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
- Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://docs.ultralytics.com/models/yolov8 (accessed on 15 May 2026).
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
- Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://docs.ultralytics.com/models/yolo11 (accessed on 15 May 2026).
- Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
- Lei, M.; Li, S.; Wu, Y.; Hu, H.; Zhou, Y.; Zheng, X.; Ding, G.; Du, S.; Wu, Z.; Gao, Y. YOLOv13: Real-time object detection with hypergraph-enhanced adaptive visual perception. arXiv 2025, arXiv:2506.17733. [Google Scholar]
- Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. RT-DETRv2: Improved baseline with bag-of-freebies for real-time detection transformer. arXiv 2024, arXiv:2407.17140. [Google Scholar]
- Ma, S.; Zhang, Y.; Peng, L.; Sun, C.; Ding, L.; Zhu, Y. OWRT-DETR: A novel real-time transformer network for small object detection in open water search and rescue from UAV aerial imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4205313. [Google Scholar] [CrossRef]
- Peng, Y.; Li, H.; Wu, P.; Zhang, Y.; Sun, X.; Wu, F. D-FINE: Redefine regression task in DETRs as fine-grained distribution refinement. arXiv 2024, arXiv:2410.13842. [Google Scholar]
- Huang, S.; Lu, Z.; Cun, X.; Yu, Y.; Zhou, X.; Shen, X. DEIM: DETR with improved matching for fast convergence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–17 June 2025; pp. 15162–15171. [Google Scholar]
- Selvaraju, R.R.; Das, A.; Vedantam, R.; Cogswell, M.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. arXiv 2016, arXiv:1610.02391. [Google Scholar]











| Dataset | Images | Instances | Classes | Resolution | Main Challenge |
|---|---|---|---|---|---|
| AI-TOD | 28,036 | 700,621 | 8 | Extremely Small Objects (≈12.8 px) | |
| DIOR | 23,463 | 192,472 | 20 | Large Scale & Multi-class | |
| NWPU VHR-10 | 800 | 3651 | 10 | Variable sizes | High-Res & Dense Dist. |
| Model | Epochs | Param (M) | FLOPs (G) | |||||
|---|---|---|---|---|---|---|---|---|
| CNN-Based | ||||||||
| YOLO8n [41] | 200 | 3.1 | 8.9 | 0.321 | 0.152 | 0.386 | 0.185 | 0.170 |
| YOLO10n [42] | 200 | 2.8 | 8.7 | 0.302 | 0.148 | 0.365 | 0.178 | 0.164 |
| YOLO11n [43] | 200 | 2.6 | 6.6 | 0.325 | 0.155 | 0.389 | 0.188 | 0.169 |
| YOLO12n [44] | 200 | 2.6 | 6.6 | 0.331 | 0.158 | 0.393 | 0.192 | 0.173 |
| YOLO13n [45] | 200 | 2.4 | 6.4 | 0.335 | 0.160 | 0.396 | 0.196 | 0.175 |
| FBRT-YOLO-N [20] | 300 | 0.9 | 6.9 | 0.342 | 0.165 | 0.402 | 0.201 | 0.179 |
| Transformer-Based | ||||||||
| RT-DETR-R18 [40] | 120 | 20.1 | 58.6 | 0.345 | 0.171 | 0.408 | 0.205 | 0.182 |
| RT-DETRv2-R18 [46] | 120 | 20.1 | 58.6 | 0.351 | 0.175 | 0.415 | 0.211 | 0.187 |
| L-OWRT-DETR [47] | 100 | 19.1 | 54.2 | 0.348 | 0.172 | 0.410 | 0.207 | 0.184 |
| D-FINE-N [48] | 160 | 3.7 | 7.3 | 0.326 | 0.162 | 0.395 | 0.193 | 0.173 |
| DEIM [49] | 160 | 3.7 | 7.3 | 0.355 | 0.178 | 0.421 | 0.216 | 0.190 |
| MPI-DETR (Ours) | 120 | 16.8 | 48.3 | 0.374 | 0.185 | 0.438 | 0.231 | 0.202 |
| Model | Epochs | Param (M) | FLOPs (G) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| CNN-Based | |||||||||
| YOLO8n [41] | 200 | 3.1 | 8.9 | 0.225 | 0.462 | 0.751 | 0.824 | 0.658 | 0.602 |
| YOLO10n [42] | 200 | 2.8 | 8.7 | 0.231 | 0.456 | 0.760 | 0.829 | 0.668 | 0.615 |
| YOLO11n [43] | 200 | 2.6 | 6.6 | 0.221 | 0.465 | 0.771 | 0.836 | 0.678 | 0.619 |
| YOLO12n [44] | 200 | 2.6 | 6.6 | 0.227 | 0.473 | 0.787 | 0.841 | 0.688 | 0.629 |
| YOLO13n [45] | 200 | 2.4 | 6.4 | 0.223 | 0.474 | 0.782 | 0.845 | 0.686 | 0.632 |
| FBRT-YOLO-N [20] | 300 | 0.9 | 6.9 | 0.208 | 0.433 | 0.723 | 0.792 | 0.628 | 0.572 |
| Transformer-Based | |||||||||
| RT-DETR-R18 [40] | 120 | 20.1 | 58.6 | 0.276 | 0.517 | 0.803 | 0.865 | 0.705 | 0.651 |
| RT-DETRv2-R18 [46] | 120 | 20.1 | 58.6 | 0.277 | 0.520 | 0.791 | 0.861 | 0.704 | 0.647 |
| L-OWRT-DETR [47] | 100 | 19.1 | 54.2 | 0.293 | 0.505 | 0.792 | 0.854 | 0.698 | 0.643 |
| D-FINE-N [48] | 160 | 3.7 | 7.3 | 0.262 | 0.485 | 0.768 | 0.839 | 0.672 | 0.635 |
| DEIM [49] | 160 | 3.7 | 7.3 | 0.264 | 0.471 | 0.763 | 0.837 | 0.662 | 0.623 |
| MPI-DETR (Ours) | 120 | 16.8 | 48.3 | 0.305 | 0.531 | 0.815 | 0.875 | 0.718 | 0.662 |
| Model | Epochs | Param (M) | FLOPs (G) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| CNN-Based | |||||||||
| YOLO8n [41] | 200 | 3.1 | 8.9 | 0.319 | 0.526 | 0.619 | 0.904 | 0.623 | 0.552 |
| YOLO10n [42] | 200 | 2.8 | 8.7 | 0.149 | 0.477 | 0.582 | 0.840 | 0.543 | 0.512 |
| YOLO11n [43] | 200 | 2.6 | 6.6 | 0.322 | 0.513 | 0.624 | 0.896 | 0.628 | 0.556 |
| YOLO12n [44] | 200 | 2.6 | 6.6 | 0.295 | 0.484 | 0.595 | 0.859 | 0.591 | 0.524 |
| YOLO13n [45] | 200 | 2.4 | 6.4 | 0.301 | 0.483 | 0.594 | 0.852 | 0.587 | 0.529 |
| FBRT-YOLO-N [20] | 300 | 0.9 | 6.9 | 0.313 | 0.472 | 0.615 | 0.866 | 0.555 | 0.521 |
| Transformer-Based | |||||||||
| RT-DETR-R18 [40] | 120 | 20.1 | 58.6 | 0.216 | 0.529 | 0.662 | 0.886 | 0.630 | 0.570 |
| RT-DETRv2-R18 [46] | 120 | 20.1 | 58.6 | 0.234 | 0.537 | 0.619 | 0.894 | 0.649 | 0.573 |
| L-OWRT-DETR [47] | 100 | 19.1 | 54.2 | 0.267 | 0.550 | 0.646 | 0.906 | 0.637 | 0.577 |
| D-FINE-N [48] | 160 | 3.7 | 7.3 | 0.283 | 0.547 | 0.632 | 0.904 | 0.633 | 0.570 |
| DEIM [49] | 160 | 3.7 | 7.3 | 0.275 | 0.556 | 0.647 | 0.912 | 0.644 | 0.580 |
| MPI-DETR (Ours) | 120 | 16.8 | 48.3 | 0.332 | 0.576 | 0.685 | 0.925 | 0.662 | 0.598 |
| Model | DRSA | BTC-FAM | PMGF | Params (M) | FLOPs (G) | (%) | (%) | (%) |
|---|---|---|---|---|---|---|---|---|
| Baseline | × | × | × | 20.1 | 58.6 | 0.345 | 0.171 | 0.408 |
| +DRSA | ✓ | × | × | 18.5 | 53.0 | 0.358 | 0.176 | 0.421 |
| +BTC-FAM | ✓ | ✓ | × | 17.9 | 50.5 | 0.366 | 0.180 | 0.429 |
| MPI-DETR | ✓ | ✓ | ✓ | 16.8 | 48.3 | 0.374 | 0.185 | 0.438 |
| Model Variant | Intra-Bucket Stream | Inter-Bucket Stream | (%) | (%) | (%) |
|---|---|---|---|---|---|
| Baseline (No DRSA) | × | × | 34.5 | 17.1 | 40.8 |
| DRSA (Intra-only) | ✓ | × | 35.1 | 17.3 | 41.3 |
| DRSA (Inter-only) | × | ✓ | 35.3 | 17.4 | 41.6 |
| DRSA (Full Dual-Stream) | ✓ | ✓ | 35.8 | 17.6 | 42.1 |
| Model | Avg. Weights (Dist. 0–20 px) | Avg. Weights (Dist. 50–100 px) | Avg. Weights (Dist. 150–300 px) | Avg. Weights (Dist. > 400 px) |
|---|---|---|---|---|
| Baseline (Standard Multi-Head Self-Attention) | 0.85 | 0.22 | 0.08 | 0.02 |
| DRSA (Intensity-Ranked) | 0.65 | 0.61 | 0.58 | 0.55 |
| Model | Params (M) | FLOPs (G) | Training VRAM (GB) | Inference Speed (FPS) |
|---|---|---|---|---|
| Baseline | 20.1 | 58.6 | 7.8 | 115 |
| MPI-DETR (Ours) | 16.8 | 48.3 | 5.4 | 132 |
| Relative Change |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhang, J.; Xie, B.; Lin, L.; Yang, L.; Zhang, X.; Meng, Y.; Xie, X.; Zhang, Y.; Zhang, W. MPI-DETR: Multi-Grain Prompt and Intensity-Guided Transformer for Small-Object Detection in UAV Imagery. Remote Sens. 2026, 18, 1763. https://doi.org/10.3390/rs18111763
Zhang J, Xie B, Lin L, Yang L, Zhang X, Meng Y, Xie X, Zhang Y, Zhang W. MPI-DETR: Multi-Grain Prompt and Intensity-Guided Transformer for Small-Object Detection in UAV Imagery. Remote Sensing. 2026; 18(11):1763. https://doi.org/10.3390/rs18111763
Chicago/Turabian StyleZhang, Jie, Boxiang Xie, Lingfeng Lin, Liejun Yang, Xian Zhang, Yuke Meng, Xiaojuan Xie, Yao Zhang, and Wei Zhang. 2026. "MPI-DETR: Multi-Grain Prompt and Intensity-Guided Transformer for Small-Object Detection in UAV Imagery" Remote Sensing 18, no. 11: 1763. https://doi.org/10.3390/rs18111763
APA StyleZhang, J., Xie, B., Lin, L., Yang, L., Zhang, X., Meng, Y., Xie, X., Zhang, Y., & Zhang, W. (2026). MPI-DETR: Multi-Grain Prompt and Intensity-Guided Transformer for Small-Object Detection in UAV Imagery. Remote Sensing, 18(11), 1763. https://doi.org/10.3390/rs18111763

