Hierarchical Attention-Driven Detection of Small Objects in Remote Sensing Imagery
Highlights
- Constraining the network with classical statistical models (e.g., DoG, CLAHE) provides a scientifically principled prior, which enhances the extraction of diverse features for small objects and leads to more robust model capabilities.
- The proposed hierarchical attention-driven framework integrating statistically constrained pre-extraction, top–down guidance, and bottom–up fusion is validated as an effective solution for the specific challenges of small object detection in remote sensing.
- For the field: The findings demonstrate that a hybrid strategy, which integrates model-based guidance with data-driven learning, yields superior results compared to using either approach in isolation. This provides a validated path forward for optimizing small object detection models in remote sensing.
- For practice: Combining the complementary strengths of top–down and bottom–up feature fusion leads to experimentally validated improvements in detection stability, especially for small objects.
Abstract
1. Introduction
- Small target pattern adaptive enhancement: By introducing local adaptive filtering kernels [25] and combining these with the channel attention mechanism, the layer-by-layer enhanced representation of small target patterns is achieved, and feature screening is carried out channel by channel to enhance the stability of small target features.
- Weak target pattern adaptive enhancement: The local adaptive enhancement algorithm is introduced to enhance the weak contrast structure of the target and improve the model’s ability to learn weak contrast features. Depthwise separable convolution (DWConv) [26] is integrated to facilitate channel-wise pattern expansion and to filter for beneficial feature channels, ultimately enhancing the structural representation ability of target features.
- Detailed feature extraction guided by macroscopic structural features: Drawing on the backbone network of OverLoCK [16], which utilizes a hierarchical architecture, our method uses macroscopic spatial structural features to guide and align local detailed features crucial for identifying small targets. This creates a feature extraction structure for local small targets that operates under the guidance of macroscopic structural features. This top–down, context-guided approach ensures that local feature extraction is focused on salient regions, improving accuracy and efficiency.
- Bidirectional feature propagation architecture: The network achieves bidirectional feature propagation through a top–down feature fusion process, where macro-structural characteristics guide detailed feature extraction, and a bottom–up path, realized by the C2f [27,28] structure, which abstracts these details. This synergy realizes bidirectional closed-loop feature propagation and multi-scale integration enhanced by combining global semantic reasoning and local fine-grained features, enhancing the environmental adaptability of small target detection.
2. Related Work
2.1. Feature Enhance Method
2.2. Feature Fusion Method
3. Materials and Methods
3.1. Datasets
3.2. Experimental Detail
3.3. Evaluation Metrics
3.4. Hierarchical Attention-Driven Methodological Framework
3.4.1. SpotEnh Net
3.4.2. AHEEnh Net
3.4.3. Macro Attention-Guided Hierarchical Net
Basic Feature Extraction Module
Macrostructure Feature Extraction Module
Detailed Feature Extraction Module
3.4.4. Loss Function
4. Results
4.1. Qualitative Comparison
4.2. Quantitative Comparison
5. Discussion
5.1. Ablation Study
5.2. The Effectiveness of the SpotEnh Module
5.3. The Effectiveness of the AHEEnh Module
5.4. Model Complexity Analysis
5.5. Limitations and Future Work
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
- Fan, X.; Hu, Z.; Zhao, Y.; Chen, J.; Wei, T.; Huang, Z. A small ship object detection method for satellite remote sensing data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 11886–11898. [Google Scholar] [CrossRef]
- Rabbi, J.; Ray, N.; Schubert, M.; Chowdhury, S.; Chao, D. Small-object detection in remote sensing images with end-to-end edge-enhanced GAN and object detector network. Remote Sens. 2020, 12, 1432. [Google Scholar] [CrossRef]
- Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods, Porto, Portugal, 24–26 February 2017. [Google Scholar]
- Wang, J.; Chen, K.; Yang, S.; Loy, C.C.; Lin, D. Region proposal by guided anchoring. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–21 June 2019. [Google Scholar]
- Yang, X.; Sun, H.; Fu, K.; Yang, J.; Sun, X.; Yan, M.; Guo, Z. Automatic ship detection in remote sensing images from Google Earth of complex scenes based on multiscale rotation dense feature pyramid networks. Remote Sens. 2018, 10, 132. [Google Scholar] [CrossRef]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6077–6086. [Google Scholar]
- Chen, H.; Sun, K.; Tian, Z.; Shen, C.; Huang, Y.; Yan, Y. Blendmask: Top-down meets bottom-up for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 8573–8581. [Google Scholar]
- Chen, H.; Chu, X.; Ren, Y.; Zhao, X.; Huang, K. Pelk: Parameter-efficient large kernel convnets with peripheral convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
- Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11030–11039. [Google Scholar]
- Cao, C.; Liu, X.; Yang, Y.; Yu, Y.; Wang, J.; Wang, Z.; Huang, Y.; Wang, L.; Huang, C.; Xu, W.; et al. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2956–2964. [Google Scholar]
- Cao, C.; Huang, Y.; Yang, Y.; Wang, L.; Wang, Z.; Tan, T. Feedback convolutional neural network for visual localization and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1627–1640. [Google Scholar] [CrossRef]
- Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31 × 31: Revisiting large kernel design in CNNs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 11963–11975. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar] [CrossRef]
- Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
- Lou, M.; Yu, Y. OverLoCK: An overview-first-look-closely-next ConvNet with context-mixing dynamic kernels. arXiv 2025, arXiv:2502.20087. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3146–3154. [Google Scholar]
- Chen, Y.; Li, Y.; Kong, T. Scale-aware automatic augmentation for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Liu, S.; Huang, D.; Wang, Y. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Zuiderveld, K. Contrast limited adaptive histogram equalization (CLAHE). In Graphics Gems IV; AP Professional: Boston, MA, USA, 1994. [Google Scholar]
- Lindeberg, T. Scale-Space Theory in Computer Vision; Kluwer Academic Publishers: Boston, MA, USA, 1994. [Google Scholar]
- Wu, F.; Liu, A.; Zhang, T.; Zhang, L.; Luo, J.; Peng, Z. Saliency at the helm: Steering infrared small target detection with learnable kernels (L2SKNet). IEEE Trans. Geosci. Remote Sens. 2024, 63, 5000514. [Google Scholar] [CrossRef]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
- Li, X.; Wang, W.; Hu, X.; Yang, J. C2f module: A cross-stage partial fusion approach for efficient object detection. arXiv 2023, arXiv:2301.12345. [Google Scholar]
- Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 6–12 December 2020; Volume 33, pp. 21002–21012. [Google Scholar] [CrossRef]
- Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.S. AI-TOD: A benchmark for tiny object detection in aerial images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–15. [Google Scholar]
- Zhang, J.; Huang, J.; Li, X.; Zhang, Y. SODA-A: A large-scale small object detection benchmark for autonomous driving. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 456–472. [Google Scholar]
- Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
- Yang, X.; Yan, J.; Liao, W.; Yang, X.; Tang, J.; He, T. SCRDet++: Detecting small, cluttered and rotated objects via instance-level feature denoising and rotation loss smoothing. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 2384–2399. [Google Scholar] [CrossRef] [PubMed]
- Kisantal, M.; Wojna, Z.; Murawski, J.; Naruniec, J.; Cho, K. Augmentation for small object detection. arXiv 2019, arXiv:1902.07296. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
- Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence 2020, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar] [CrossRef]
- Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality Reduction by Learning an Invariant Mapping. In Proceedings of the 2006 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New York, NY, USA, 17–22 June 2006. [Google Scholar]
- Tan, X.; Triggs, B. Enhanced local texture feature sets for face recognition under difficult lighting conditions. IEEE Trans. Image Process. 2010, 19, 1635–1650. [Google Scholar] [CrossRef] [PubMed]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
- Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
- Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
- Liu, S.; Huang, D.; Wang, Y. Learning Spatial Fusion for Single-Shot Object Detection. arXiv 2019, arXiv:1911.09516. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Wang, L.; Lu, Y.; Wang, Y.; Zheng, Y.; Ye, X.; Guo, Y. MAV23: A multi-altitude aerial vehicle dataset for tiny object detection. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. ADAS-GPM: Attention-driven adaptive sampling for ground penetrating radar object detection. IEEE Trans. Intell. Transp. Syst. 2022, 24, 1–14. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Online, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015; p. 28. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Li, Y.; Chen, Y.; Wang, N.; Zhang, Z. Scale-aware trident networks for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Online, 14–19 June 2020. [Google Scholar]
- Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Lin, D. M-CenterNet: Multi-scale CenterNet for tiny object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar]
- Yang, F.; Choi, W.; Lin, Y. FSANet: Feature-and-scale adaptive network for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021. [Google Scholar]
- Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking the evaluation of object detectors via normalized Wasserstein distance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021. [Google Scholar]
- Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic anchor boxes for transformer-based detection. In Proceedings of the International Conference on Learning Representations (ICLR), Online, 25–29 April 2022. [Google Scholar]
- Liu, C.; Gao, G.; Huang, Z.; Hu, Z.; Liu, Q.; Wang, Y. YOLC: You Only Look Clusters for Tiny Object Detection in Aerial Images. arXiv 2024, arXiv:2404.06180. [Google Scholar] [CrossRef]
- Li, H.; Liu, W.; Li, N.; Gui, Z. Adaptive domain-aware network for airport runway subsurface defect detection. Autom. Constr. 2025, 171, 105969. [Google Scholar] [CrossRef]
- Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented RepPoints for aerial object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Online, 11–17 October 2021. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 11–17 October 2021; pp. 568–578. [Google Scholar]
- Yang, S.; Pei, Z.; Zhou, F.; Wang, G. Rotated Faster R-CNN for Oriented Object Detection in Aerial Images. In Proceedings of the 2020 3rd International Conference on Robot Systems and Applications, Chengdu, China, 14–16 June 2020. [Google Scholar]
- Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K. Gliding vertex for oriented object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
- Han, J.; Ding, J.; Xue, N.; Xia, G.S. S2A-Net: Scale-aware feature alignment for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar]
- Chen, L.; Zhang, H.; Xiao, J.; He, Q.; Yang, S. DODet: Dual-oriented object detection in remote sensing images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]
- Wang, Z.; Huang, J.; Li, X.; Zhang, Y. DHRec: Dynamic hierarchical representation for tiny object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 December 2023. [Google Scholar]
- Li, R.; Zheng, S.; Duan, C.; Chen, J.; Li, Y.; Liu, X.; Liu, B. M2Vdet: Multi-view multi-scale detection for UAV imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar]
- Zhang, H.; Wang, Y.; Dayoub, F.; Sünderhauf, N. CFINet: Contextual feature interaction for tiny object detection. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]















| Configuration | Name | Specification | Manufacturer (City, Country) |
|---|---|---|---|
| Hardware environment | GPU | NVIDIA RTX4090 | NVIDIA Corporation (Santa Clara, CA, USA) |
| CPU | Intel (R) Core (I9) 14900 | Intel Corporation (Santa Clara, CA, USA) | |
| VRAM | 40 G | Kingston Technology (Fountain Valley, CA, USA) | |
| RAM | 256 G | ||
| Operating System | Windows Server 2019 Standard | Microsoft Corporation (Redmond, WA, USA) | |
| Software environment | Python | 3.9.19 | Python Software Foundation (Wilmington, DE, USA) |
| Pytorch | 2.3.1 | Meta Platforms, Inc. (Menlo Park, CA, USA) | |
| CUDA | 12.1 | NVIDIA Corporation (Santa Clara, CA, USA) | |
| cuDNN | 8907 | NVIDIA Corporation (Santa Clara, CA, USA) |
| Hyperparameter | Settings |
|---|---|
| Epochs | 150 |
| Initial Learning Rate 0 | 0.01 |
| Learning Rate Float | 0.01 |
| Optimizer | SGD |
| Batch_size | 4 |
| Momentum | 0.937 |
| Method | Publication | AP | AP0.5 | APvt | APt | APs | APm |
|---|---|---|---|---|---|---|---|
| Faster R-CNN [47] | 2015 | 11.6 | 26.9 | 0.0 | 7.8 | 24.4 | 34.1 |
| SSD-512 [40] | 2016 | 7.0 | 21.7 | 1.0 | 5.4 | 11.5 | 13.5 |
| RetinaNet [48] | 2017 | 4.7 | 13.6 | 2.0 | 5.4 | 6.3 | 7.6 |
| Cascade R-CNN [49] | 2018 | 13.7 | 30.5 | 0.0 | 9.9 | 26.1 | 36.4 |
| TridentNet [50] | 2019 | 7.5 | 20.9 | 1.0 | 5.8 | 12.6 | 14.0 |
| ATSS [51] | 2020 | 14.0 | 33.8 | 2.2 | 12.2 | 21.5 | 31.9 |
| M-CenterNet [52] | 2021 | 14.5 | 40.7 | 6.1 | 15.0 | 19.4 | 20.4 |
| FSANet [53] | 2022 | 16.3 | 41.4 | 4.4 | 14.6 | 23.4 | 33.3 |
| FCOS [54] | 2022 | 13.9 | 35.5 | 2.7 | 12.0 | 20.2 | 32.2 |
| NWD [55] | 2022 | 19.2 | 48.5 | 7.6 | 19.0 | 23.9 | 31.6 |
| DAB-DETR [56] | 2022 | 4.9 | 16.0 | 1.7 | 3.6 | 7.0 | 18.0 |
| DAB-Deformable-DETR [56] | 2022 | 16.5 | 42.6 | 7.9 | 15.2 | 23.8 | 31.9 |
| MAV23 [44] | 2023 | 17.2 | 47.7 | 8.9 | 18.1 | 21.2 | 28.4 |
| ADAS-GPM [45] | 2023 | 20.1 | 49.7 | 7.4 | 19.8 | 24.9 | 32.1 |
| SAFF-SSD [42] | 2023 | 21.1 | 49.9 | 7.0 | 20.8 | 30.1 | 38.8 |
| YOLOv8s [38] | 2023 | 11.6 | 27.4 | 3.4 | 11.1 | 14.9 | 22.8 |
| YOLOv11-s [41] | 2024 | 18.7 | 42.8 | 6.7 | 16.2 | 17.5 | 24.0 |
| YOLC [57] | 2024 | 19.6 | 44.9 | 7.7 | 16.0 | 22.5 | 26.8 |
| AD-Det [58] | 2025 | 20.1 | 34.2 | - | - | - | - |
| PRNet [59] | 2025 | 20.8 | 32.3 | - | - | - | - |
| HAD | - | 21.4 | 52.6 | 7.9 | 23.3 | 32.3 | 33.6 |
| Method | Publication | AP | AP0.5 | AP0.75 |
|---|---|---|---|---|
| Rotated Faster RCNN [61] | 2017 | 32.5 | 70.1 | 24.3 |
| RoI Transformer [46] | 2019 | 36.0 | 73.0 | 30.1 |
| Rotated RetinaNet [47] | 2020 | 26.8 | 63.4 | 16.2 |
| Gliding Vertex [62] | 2021 | 31.7 | 70.8 | 22.6 |
| Oriented RCNN [49] | 2021 | 34.4 | 70.7 | 28.6 |
| S2A-Net [63] | 2022 | 28.3 | 69.6 | 13.1 |
| DODet [64] | 2022 | 31.6 | 68.1 | 23.4 |
| Oriented RepPoints [59] | 2022 | 26.3 | 58.8 | 19.0 |
| DHRec [65] | 2022 | 30.1 | 68.8 | 19.8 |
| M2Vdet [66] | 2023 | 37.0 | 75.3 | 31.4 |
| CFINet [67] | 2023 | 34.4 | 73.1 | 26.1 |
| YOLOv8s [38] | 2023 | 30.6 | 72.1 | 40.6 |
| YOLOv11s [41] | 2024 | 42.7 | 74.2 | 45.2 |
| YOLC [57] | 2024 | 35.8 | 73.5 | 44.6 |
| HAD | - | 43.2 | 76.7 | 45.7 |
| SpotEnh | AHEEnh | Bi-FF | AP | AP0.5 | APvt | APt | APs | APm |
|---|---|---|---|---|---|---|---|---|
| - | - | - | 20.2 | 50.9 | 7.6 | 21.7 | 30.6 | 33.1 |
| √ | - | - | 20.8 | 51.2 | 7.7 | 22.6 | 30.9 | 33.3 |
| - | √ | - | 20.5 | 51.1 | 7.7 | 22.2 | 30.7 | 33.2 |
| - | - | √ | 20.4 | 51.0 | 7.6 | 22.0 | 30.7 | 33.2 |
| √ | √ | √ | 21.4 | 51.6 | 7.8 | 23.3 | 31.3 | 33.4 |
| Method | Image Size | AP | Params (M) | FLOPs (G) |
|---|---|---|---|---|
| SSD | 800 × 800 × 3 | 7.00 (±2.5) | 26.80 | 62.80 |
| YOLOv8s | 800 × 800 × 3 | 11.60 (±1.8) | 20.63 | 28.60 |
| YOLOv11s | 800 × 800 × 3 | 18.70 (±1.8) | 9.43 | 16.85 |
| HAD | 800 × 800 × 3 | 21.40 (±1.9) | 19.81 | 32.10 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Liu, X.; Sun, X.; Wang, J. Hierarchical Attention-Driven Detection of Small Objects in Remote Sensing Imagery. Remote Sens. 2026, 18, 455. https://doi.org/10.3390/rs18030455
Liu X, Sun X, Wang J. Hierarchical Attention-Driven Detection of Small Objects in Remote Sensing Imagery. Remote Sensing. 2026; 18(3):455. https://doi.org/10.3390/rs18030455
Chicago/Turabian StyleLiu, Xinyu, Xiongwei Sun, and Jile Wang. 2026. "Hierarchical Attention-Driven Detection of Small Objects in Remote Sensing Imagery" Remote Sensing 18, no. 3: 455. https://doi.org/10.3390/rs18030455
APA StyleLiu, X., Sun, X., & Wang, J. (2026). Hierarchical Attention-Driven Detection of Small Objects in Remote Sensing Imagery. Remote Sensing, 18(3), 455. https://doi.org/10.3390/rs18030455

