Vision-Language Guided Semantic Diffusion Sampling for Small Object Detection in Remote Sensing Imagery
Abstract
1. Introduction
- CDATOD-Diff, an integrated detection framework, is specifically designed for small target detection under scale-variation conditions, significantly improving detection robustness and localization accuracy where conventional architectures falter.
- Innovative CLIP-driven dynamic anchor sampling strategies are introduced, solving the inadequate positive-sample issue under the guidance of CLIP in the sampling phase.
- A robust adaptive regression loss function named BC-IoU is proposed for the sample regression phase, ameliorating the regression challenges encountered by small targets in specific scenarios and diminishing the repercussions of scale fluctuations.
2. Related Works
2.1. Object Detection
2.2. Tiny Object Detection
2.3. Remote Sensing Image Processing Based on the Vision–Language Model
3. Methodology
3.1. CLIP Feature Extraction for Object Detection
3.2. CLIP-Driven Diffusion Anchor Point Sampling
3.2.1. Dynamic Anchor Point Sampling Procedure
3.2.2. Diffusion-Based Anchor Point Sampling Procedure
3.2.3. CLIP-Driven Conditional Encoding
3.3. Adaptive Loss
4. Results
4.1. Datasets
4.1.1. SAR Datasets
- (1)
- MSAR-1.0 [47]: The MSAR-1.0 dataset includes 28,449 detection slices, using data from the Hise-1 and Gaofen-3 satellites. The polarization modes of the MSAR-1.0 dataset include HH, HV, VH, and VV. The dataset scenarios include airports, ports, nearshore, islands, offshore, urban areas, etc. The types of targets include aircraft, oil tanks, bridges, and ships, consisting of 1851 bridges, 39,858 ships, 12,319 oil tanks, and 6368 aircraft. We picked the images with pixel size of 256 × 256 and divided 80% images for the training and the rest for the test.
- (2)
- HRSID [48]: The High-Resolution SAR Image Dataset (HRSID) is a dataset used for ship detection, semantic segmentation, and instance segmentation tasks in high-resolution SAR images. This dataset contains a total of 5604 high-resolution SAR images and 16951 ship instances. The HRSID dataset includes SAR images with different resolutions, polarizations, sea conditions, sea areas, and coastal ports. The resolutions of SAR images are as follows: 0.5 m, 1 m, and 3 m. We followed the division of dataset founders as 3642 images for the training and 1961 for the test.
4.1.2. Optical Datasets
- (1)
- AI-TOD [49]: The AI-TOD benchmark contains 28,036 aerial images with 700,621 annotated instances, specifically designed for small target detection in remote sensing applications. This dataset features eight distinct object categories: airplanes, bridges, military tanks, maritime vessels, recreational swimming pools, transportation vehicles, individual persons, and renewable energy installations (wind turbines). Notably, the mean object dimension within AI-TOD measures 12.8 pixels in diameter, representing a substantial reduction in scale compared to conventional aerial detection datasets. This characteristic makes AI-TOD particularly valuable for evaluating models’ capability in identifying sub-20-pixel targets that are prevalent in high-altitude observational scenarios.
- (2)
- VEDAI [50]: The VEDAI benchmark constitutes a specialized resource for vehicular recognition in aerial surveillance systems, containing 1210 high-resolution (1024 × 1024 pixels) images with 3700 meticulously annotated instances. This dataset features fine-grained taxonomic classification across six vehicle types: recreational vehicles (campers), passenger cars (sedans), utility vehicles (pickups), agricultural machinery (tractors), commercial transports (trucks), and light commercial vehicles (vans). Unlike conventional remote sensing datasets focusing on macro-objects like industrial storage tanks or athletic facilities, VEDAI specifically addresses micro-scale targets with average spatial coverage below 0.05% of the total image area. The inherent challenges stem from vehicles’ reduced spatial footprint (typically 32 × 32 pixels) and spectral similarity within complex urban backgrounds, establishing VEDAI as a critical evaluation platform for advanced pattern recognition in dense, cluttered environments.
- (3)
- USOD [50]: Developed from the UNICORN2008 foundation, the USOD benchmark specializes in sub-pixel vehicle detection for aerial surveillance applications. Its source imagery originates from electro-optical sensors with a 0.4 m spatial resolution, refined through a preprocessing pipeline involving spectral filtering, image segmentation, and expert verification to isolate vehicular targets. The curated collection contains 3000 annotated scenes encompassing 43,378 vehicular instances, divided into 70% training and 30% testing subsets through randomized stratification. Notably, 96.3% of targets fall within the micro-scale classification (16 × 16 pixels), cumulatively reaching 99.9% coverage in sub-32 × 32-pixel categories. This distribution establishes USOD as a challenging benchmark for evaluating detection algorithms in high-resolution surveillance scenarios. On this dataset, we performed quantitative evaluations and visual comparisons against the baseline methods.
4.2. Experimental Setup
4.3. Experimental Metrics
4.4. Comparison with State-of-the-Art Methods on SAR Datasets
4.4.1. Results on MSAR-1.0
4.4.2. Results on HRSID
4.5. Comparison with State-of-the-Art Methods on Visible Datasets
4.5.1. Results on AI-ToD
4.5.2. Results on VEDAI
4.6. Ablation Study
4.6.1. Baseline Setup
4.6.2. Ablation Study of the CLIP-Driven Diffusion Sampling Module
4.6.3. Ablation Study of Each Loss Item
4.6.4. Effectiveness of Balanced Loss Function
4.7. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Sun, Z.; Leng, X.; Zhang, M.; Ren, H.; Ji, K. SAR Image Object Detection and Information Extraction: Methods and Applications. Remote Sens. 2025, 17, 2098. [Google Scholar] [CrossRef]
- Cao, S.; Deng, J.; Luo, J.; Li, Z.; Hu, J.; Peng, Z. Local convergence index-based infrared small target detection against complex scenes. Remote Sens. 2023, 15, 1464. [Google Scholar] [CrossRef]
- Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote sensing object detection in the deep learning era—A review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Li, X.; Wen, C.; Hu, Y.; Yuan, Z.; Zhu, X.X. Vision-language models in remote sensing: Current progress and future trends. IEEE Geosci. Remote Sens. Mag. 2024, 12, 32–66. [Google Scholar] [CrossRef]
- Wang, Z.; Prabha, R.; Huang, T.; Wu, J.; Rajagopal, R. Skyscript: A large and semantically diverse vision-language dataset for remote sensing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 5805–5813. [Google Scholar]
- Tao, L.; Zhang, H.; Jing, H.; Liu, Y.; Yan, D.; Wei, G.; Xue, X. Advancements in Vision–Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques. Remote Sens. 2025, 17, 162. [Google Scholar] [CrossRef]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems 28, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
- Li, Y.; Li, Q.; Pan, J.; Zhou, Y.; Zhu, H.; Wei, H.; Liu, C. Sod-yolo: Small-object-detection algorithm based on improved yolov8 for uav images. Remote Sens. 2024, 16, 3057. [Google Scholar] [CrossRef]
- Yao, B.; Zhang, C.; Meng, Q.; Sun, X.; Hu, X.; Wang, L.; Li, X. SRM-YOLO for Small Object Detection in Remote Sensing Images. Remote Sens. 2025, 17, 2099. [Google Scholar] [CrossRef]
- Hu, J.; Wei, Y.; Chen, W.; Zhi, X.; Zhang, W. CM-YOLO: Typical Object Detection Method in Remote Sensing Cloud and Mist Scene Images. Remote Sens. 2025, 17, 125. [Google Scholar] [CrossRef]
- Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
- Zhou, X.; Zhuo, J.; Krahenbuhl, P. Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 850–859. [Google Scholar]
- Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9657–9666. [Google Scholar]
- Kong, T.; Sun, F.; Liu, H.; Jiang, Y.; Li, L.; Shi, J. Foveabox: Beyound anchor-based object detection. IEEE Trans. Image Process. 2020, 29, 7389–7398. [Google Scholar] [CrossRef]
- Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
- Zhou, J.; Xiao, C.; Peng, B.; Liu, Z.; Liu, L.; Liu, Y.; Li, X. DiffDet4SAR: Diffusion-based aircraft target detection network for SAR images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 4007905. [Google Scholar] [CrossRef]
- Bell, S.; Zitnick, C.L.; Bala, K.; Girshick, R. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2874–2883. [Google Scholar]
- Tang, X.; Du, D.K.; He, Z.; Liu, J. Pyramidbox: A context-assisted single shot face detector. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 797–813. [Google Scholar]
- Cui, L.; Lv, P.; Jiang, X.; Gao, Z.; Zhou, B.; Zhang, L.; Shao, L.; Xu, M. Context-aware block net for small object detection. IEEE Trans. Cybern. 2020, 52, 2300–2313. [Google Scholar] [CrossRef] [PubMed]
- Zhang, M.; Yue, K.; Li, B.; Guo, J.; Li, Y.; Gao, X. Single-frame infrared small target detection via gaussian curvature inspired network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5005013. [Google Scholar] [CrossRef]
- Wu, S.; Xiao, C.; Wang, Y.; Yang, J.; An, W. Sparsity-Aware Global Channel Pruning for Infrared Small-target Detection Networks. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5615011. [Google Scholar] [CrossRef]
- Zhang, S.; Zhu, X.; Lei, Z.; Shi, H.; Wang, X.; Li, S.Z. Faceboxes: A cpu real-time face detector with high accuracy. In Proceedings of the 2017 IEEE International Joint Conference on Biometrics (IJCB), Denver, CO, USA, 1–4 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–9. [Google Scholar]
- Xu, C.; Ding, J.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. Dynamic coarse-to-fine learning for oriented tiny object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7318–7328. [Google Scholar]
- Wang, J.; Chen, K.; Yang, S.; Loy, C.C.; Lin, D. Region proposal by guided anchoring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2965–2974. [Google Scholar]
- Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14454–14463. [Google Scholar]
- Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Yuan, Z.; Luo, P. Sparse R-CNN: An end-to-end framework for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15650–15664. [Google Scholar] [CrossRef] [PubMed]
- Zhang, S.; Zhu, X.; Lei, Z.; Shi, H.; Wang, X.; Li, S.Z. S3fd: Single shot scale-invariant face detector. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 192–201. [Google Scholar]
- Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
- Zhu, B.; Wang, J.; Jiang, Z.; Zong, F.; Liu, S.; Li, Z.; Sun, J. Autoassign: Differentiable label assignment for dense object detection. arXiv 2020, arXiv:2007.03496. [Google Scholar] [CrossRef]
- Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. RFLA: Gaussian receptive field based label assignment for tiny object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 526–543. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Qiu, C.; Yu, A.; Yi, X.; Guan, N.; Shi, D.; Tong, X. Open self-supervised features for remote-sensing image scene classification using very few samples. IEEE Geosci. Remote Sens. Lett. 2022, 20, 2500505. [Google Scholar] [CrossRef]
- Basso, L.D. CLIP-RS: A Cross-Modal Remote Sensing Image Retrieval Based on CLIP. Ph.D. Thesis, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA, 2022. [Google Scholar]
- Bazi, Y.; Al Rahhal, M.M.; Mekhalfi, M.L.; Al Zuair, M.A.; Melgani, F. Bi-modal transformer-based approach for visual question answering in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4708011. [Google Scholar] [CrossRef]
- Sun, X.; Wang, P.; Lu, W.; Zhu, Z.; Lu, X.; He, Q.; Li, J.; Rong, X.; Yang, Z.; Chang, H.; et al. RingMo: A remote sensing foundation model with masked image modeling. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5612822. [Google Scholar] [CrossRef]
- Chen, K.; Liu, C.; Chen, H.; Zhang, H.; Li, W.; Zou, Z.; Shi, Z. RSPrompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4701117. [Google Scholar] [CrossRef]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
- Chen, J.; Huang, Z.; Xia, R.; Wu, B.; Sheng, L.; Sun, L.; Yao, B. Large-scale multi-class SAR image target detection dataset-1.0. J. Radars 2022, 14, 1488. [Google Scholar]
- Wei, S.; Zeng, X.; Qu, Q.; Wang, M.; Su, H.; Shi, J. HRSID: A high-resolution SAR images dataset for ship detection and instance segmentation. IEEE Access 2020, 8, 120234–120254. [Google Scholar] [CrossRef]
- Wang, J.; Yang, W.; Guo, H.; Zhang, R.; Xia, G.S. Tiny object detection in aerial images. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3791–3798. [Google Scholar]
- Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image Represent. 2016, 34, 187–203. [Google Scholar] [CrossRef]
- Li, Y.; Chen, Y.; Wang, N.; Zhang, Z. Scale-aware trident networks for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6054–6063. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
- Ross, T.Y.; Dollár, G. Focal loss for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2980–2988. [Google Scholar]
- Zhou, Y.; Liu, H.; Ma, F.; Pan, Z.; Zhang, F. A sidelobe-aware small ship detection network for synthetic aperture radar imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5205516. [Google Scholar] [CrossRef]
- Ma, Y.; Guan, D.; Deng, Y.; Yuan, W.; Wei, M. 3SD-Net: SAR small ship detection neural network. IEEE Trans. Geosci. Remote Sens. 2024, 62. [Google Scholar] [CrossRef]
- Xu, C.; Wang, J.; Yang, W.; Yu, L. Dot distance for tiny object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1192–1201. [Google Scholar]
- Chen, S.; Sun, P.; Song, Y.; Luo, P. Diffusiondet: Diffusion model for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, Paris, France, 2–6 October 2023; pp. 19830–19843. [Google Scholar]
- Cai, Z.; Liu, S.; Wang, G.; Ge, Z.; Zhang, X.; Huang, D. Align-detr: Improving detr with simple iou-aware bce loss. arXiv 2023, arXiv:2304.07527. [Google Scholar]
Method | Backbone | AP | AP50 | AP75 | APs | APm | APl |
---|---|---|---|---|---|---|---|
Faster R-CNN [9] | ResNet50 | 43.0 | 65.6 | 49.2 | 36.5 | 64.6 | 56.9 |
TridentNet [51] | ResNet50 | 41.0 | 62.9 | 46.3 | 28.7 | 62.7 | 59.6 |
Sparse R-CNN [32] | ResNet50 | 42.2 | 65.0 | 46.4 | 33.4 | 62.2 | 57.6 |
SSD513 [52] | ResNet50 | 41.9 | 69.6 | 46.0 | 36.6 | 60.1 | 45.8 |
RetinaNet [53] | ResNet50 | 38.1 | 62.2 | 40.7 | 27.6 | 64.3 | 55.6 |
ATSS [35] | ResNet50 | 51.1 | 72.5 | 56.9 | 44.2 | 73.9 | 61.2 |
RepPoints [20] | ResNet50 | 44.4 | 69.0 | 50.3 | 34.7 | 67.2 | 60.6 |
AutoAssign [36] | ResNet50 | 52.5 | 76.4 | 59.8 | 47.5 | 71.1 | 64.3 |
Foveabox [21] | ResNet50 | 43.2 | 66.2 | 48.2 | 34.9 | 69.2 | 60.4 |
FCOS [22] | ResNet50 | 24.3 | 48.5 | 20.9 | 15.3 | 41.6 | 43.0 |
SASSDN [54] | CSPDarkner53 | 55.6 | 77.9 | 51.1 | 48.7 | 78.2 | 65.1 |
3SD-Net [27,55] | ResNet50 | 63.4 | 80.6 | 62.2 | 51.6 | 77.9 | 68.4 |
Proposed | ResNet50 | 64.1 | 83.8 | 68.9 | 58.9 | 80.9 | 71.9 |
Method | Backbone | AP | AP50 | AP75 | APs | APm | APl |
---|---|---|---|---|---|---|---|
Faster R-CNN [9] | ResNet50 | 50.8 | 80.5 | 56.2 | 33.0 | 67.6 | 43.2 |
TridentNet [51] | ResNet50 | 48.5 | 78.4 | 53.0 | 28.7 | 67.1 | 48.4 |
Sparse R-CNN [32] | ResNet50 | 40.8 | 65.8 | 44.8 | 22.8 | 59.3 | 34.1 |
SSD513 [52] | ResNet50 | 44.6 | 76.9 | 47.0 | 25.8 | 63.0 | 26.5 |
RetinaNet [53] | ResNet50 | 41.6 | 70.1 | 44.7 | 18.6 | 63.6 | 33.3 |
ATSS [35] | ResNet50 | 50.2 | 81.2 | 55.0 | 31.3 | 67.9 | 39.6 |
RepPoints [20] | ResNet50 | 50.4 | 83.8 | 53.8 | 32.9 | 66.9 | 41.3 |
AutoAssign [36] | ResNet50 | 52.0 | 85.6 | 55.8 | 35.0 | 67.3 | 44.2 |
Foveabox [21] | ResNet50 | 44.2 | 75.3 | 47.2 | 23.0 | 64.1 | 36.3 |
FCOS [22] | ResNet50 | 44.7 | 75.1 | 48.0 | 24.6 | 64.0 | 35.1 |
SASSDN [54] | CSPDarkner53 | 48.6 | 78.7 | 56.7 | 37.3 | 66.9 | 46.9 |
3SD-Net [27,55] | ResNet50 | 51.9 | 81.3 | 60.8 | 36.9 | 67.7 | 48.6 |
Proposed | ResNet50 | 55.4 | 85.5 | 61.5 | 39.7 | 69.8 | 50.7 |
Method | Backbone | AP | AP50 | AP75 | APvt | APt | APs | APm |
---|---|---|---|---|---|---|---|---|
Faster R-CNN [9] | ResNet50 | 11.1 | 26.3 | 7.6 | 0.0 | 7.2 | 23.3 | 33.6 |
TridentNet [51] | ResNet50 | 7.5 | 20.9 | 3.6 | 1.0 | 5.8 | 12.6 | 14.0 |
DotD [56] | ResNet50 | 16.1 | 39.2 | 10.6 | 8.3 | 17.6 | 18.1 | 22.1 |
Sparse R-CNN [32] | ResNet50 | 7.2 | 19 | 4.1 | 3.5 | 8.4 | 7.2 | 7.3 |
SSD513 [52] | ResNet50 | 7.0 | 21.7 | 2.8 | 1.0 | 4.7 | 11.5 | 13.5 |
RetinaNet [53] | ResNet50 | 8.7 | 22.3 | 4.8 | 2.4 | 8.9 | 12.2 | 16.0 |
ATSS [35] | ResNet50 | 12.8 | 30.6 | 8.5 | 1.9 | 11.6 | 19.5 | 29.2 |
RepPoints [20] | ResNet50 | 9.2 | 23.6 | 5.3 | 2.5 | 9.2 | 12.9 | 14.4 |
AutoAssign [36] | ResNet50 | 12.2 | 32.0 | 6.8 | 3.4 | 13.7 | 16.0 | 19.1 |
Foveabox [21] | ResNet50 | 8.7 | 21.1 | 5.4 | 1.1 | 6.7 | 13.4 | 26.4 |
FCOS [22] | ResNet50 | 10.7 | 26.9 | 6.5 | 2.3 | 11 | 15.1 | 20.7 |
RFLA [37] | ResNet50 | 16.3 | 39.1 | 11.3 | 7.3 | 18.5 | 19.8 | 21.8 |
DiffusionDet [57] | ResNet50 | 11 | 30 | 5.7 | 4 | 10.7 | 14.3 | 19.1 |
Proposed | ResNet50 | 19.4 | 47 | 13 | 8.2 | 20.8 | 22.9 | 24.5 |
Method | BO | CP | CA | OT | PI | TR | TK | VA | mAP |
---|---|---|---|---|---|---|---|---|---|
FasterRCNN [9] | 0.112 | 0.274 | 0.413 | 0.117 | 0.356 | 0.173 | 0.147 | 0.206 | 0.225 |
Retinanet [53] | 0.176 | 0.365 | 0.398 | 0.113 | 0.33 | 0.197 | 0.207 | 0.144 | 0.241 |
ATSS [35] | 0.315 | 0.371 | 0.445 | 0.185 | 0.417 | 0.274 | 0.245 | 0.348 | 0.315 |
RFLA [37] | 0.22 | 0.402 | 0.413 | 0.087 | 0.422 | 0.326 | 0.197 | 0.355 | 0.302 |
Align DETR [58] | 0.22 | 0.402 | 0.413 | 0.087 | 0.422 | 0.326 | 0.197 | 0.355 | 0.302 |
Diffusiondet [57] | 0.433 | 0.427 | 0.442 | 0.177 | 0.443 | 0.317 | 0.175 | 0.298 | 0.340 |
Proposed | 0.458 | 0.444 | 0.495 | 0.162 | 0.457 | 0.353 | 0.193 | 0.359 | 0.365 |
Dataset | Method | AP | AP50 | AP75 | APvt | APt | APs | APm |
---|---|---|---|---|---|---|---|---|
AI-ToD | FCOS | 0.107 | 0.269 | 0.065 | 0.023 | 0.11 | 0.151 | 0.207 |
FCOS + RFLA | 0.163 | 0.391 | 0.113 | 0.073 | 0.185 | 0.198 | 0.218 | |
FCOS + Diffusion | 0.179 | 0.448 | 0.102 | 0.061 | 0.167 | 0.246 | 0.317 | |
FCOS + Diff-CLIP | 0.194 | 0.47 | 0.13 | 0.082 | 0.208 | 0.229 | 0.245 | |
USOD | FCOS | 0.106 | 0.268 | 0.063 | 0.02 | 0.103 | 0.14 | 0.208 |
FCOS + RFLA | 0.192 | 0.613 | 0.05 | 0.087 | 0.206 | 0.302 | 0.182 | |
FCOS + Diffusion | 0.232 | 0.712 | 0.062 | 0.118 | 0.247 | 0.281 | 0.320 | |
FCOS + Diff-CLIP | 0.246 | 0.732 | 0.077 | 0.117 | 0.264 | 0.296 | 0.245 | |
VEDAI | FCOS | 0.017 | 0.086 | 0.003 | - | - | 0.012 | 0.055 |
FCOS + RFLA | 0.324 | 0.585 | 0.310 | - | - | 0.311 | 0.342 | |
FCOS + Diffusion | 0.313 | 0.600 | 0.284 | - | - | 0.328 | 0.332 | |
FCOS + Diff-CLIP | 0.365 | 0.683 | 0.342 | - | - | 0.345 | 0.388 |
IoU | Corner | AP | AP50 | AP75 | APvt | APt | APs | APm |
---|---|---|---|---|---|---|---|---|
✓ | 10.6 | 26.8 | 6.3 | 2 | 10.3 | 14 | 20.8 | |
✓ | 11.2 | 27.7 | 6.6 | 2.4 | 11.1 | 15.7 | 21.9 | |
✓ | ✓ | 12.5 | 30.5 | 8 | 2.6 | 11.7 | 17 | 24.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ma, J.; Bian, M.; Fan, F.; Kuang, H.; Liu, L.; Wang, Z.; Li, T.; Zhang, R. Vision-Language Guided Semantic Diffusion Sampling for Small Object Detection in Remote Sensing Imagery. Remote Sens. 2025, 17, 3203. https://doi.org/10.3390/rs17183203
Ma J, Bian M, Fan F, Kuang H, Liu L, Wang Z, Li T, Zhang R. Vision-Language Guided Semantic Diffusion Sampling for Small Object Detection in Remote Sensing Imagery. Remote Sensing. 2025; 17(18):3203. https://doi.org/10.3390/rs17183203
Chicago/Turabian StyleMa, Jian, Mingming Bian, Fan Fan, Hui Kuang, Lei Liu, Zhibing Wang, Ting Li, and Running Zhang. 2025. "Vision-Language Guided Semantic Diffusion Sampling for Small Object Detection in Remote Sensing Imagery" Remote Sensing 17, no. 18: 3203. https://doi.org/10.3390/rs17183203
APA StyleMa, J., Bian, M., Fan, F., Kuang, H., Liu, L., Wang, Z., Li, T., & Zhang, R. (2025). Vision-Language Guided Semantic Diffusion Sampling for Small Object Detection in Remote Sensing Imagery. Remote Sensing, 17(18), 3203. https://doi.org/10.3390/rs17183203