A Review of Visual Grounding on Remote Sensing Images
Abstract
1. Introduction
- Scale Heterogeneity: Remote sensing images span square-kilometer-scale urban clusters to sub-meter-scale individual targets, where small objects (e.g., ships, vehicles) coexist with large structures (e.g., airports, ports). This diversity challenges traditional detectors with fixed receptive fields.
- Semantic Complexity: Due to resolution limits in remote sensing imagery, small targets often have complex textures and insufficient edge detail [7]. This causes bidirectional ambiguity in visual–language mapping, where the complexity of accurate semantic expression can lead to information loss and attention drift during feature extraction, while simple expressions can easily cause target confusion in dense scenes. Additionally, target extraction is heavily impacted by background interference.
- Annotation Scarcity: Restricted data accessibility and expertise-dependent annotation result in significantly smaller datasets than natural-scene benchmarks, severely undermining model generalization capabilities.
- Duplicate work.
- Articles without publicly available reproducible code or peer review.
- Articles where the full text could not be obtained from the publisher.
- Articles written in English.
- Images used in the study must be remote sensing images (RSI) and cannot be other types of images.
- Methods that integrate complete sentence text (rather than individual or multiple discrete words).
2. Background
2.1. Concept Definition
2.2. Datasets and Benchmark
2.3. Evaluation Metrics
3. Evolutionary Trajectory
3.1. Transformer-Based Methods
3.2. MLLM-Based Methods
4. Characteristics and Innovations
4.1. Scale Heterogeneity
4.2. Semantic Complexity
4.3. Annotation Scarcity
5. Challenges and Outlook
5.1. Intelligent Annotation Agent for Multisource Heterogeneous Data
5.2. Cross-Temporal Perception Modeling for Dynamic Scenarios
5.3. Edge Computing-Oriented Lightweight Deployment
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
RSVG | Remote Sensing Visual Grounding |
GeoVG | Geospatial Visual Grounding |
MSVG | Multidimensional Semantic-Guidance Visual Grounding |
MGVLF | Multistage Synergistic Aggregation Module |
QGVA | Query-Guided Visual Attention |
MFE | Multilevel Feature Enhancement |
RIG | Regional Indication Generator |
AFS | Adaptive Feature Selection |
LQVG | Language Query-Based Visual Grounding |
MSCMA | Multistage Cross-Modal Alignment |
MTAM | Multidimensional Text–Image Alignment Module |
TACMT | Text-Aware Cross-Modal Transformer |
PEFT | Parameter-Efficient Fine-Tuning |
MSITFM | Multi-Scale Image-to-Text Fusion Module |
TCM | Text Confidence Matching |
MB-ORES | Multi-Branch Object Reasoner for Visual Grounding |
CLIP | Contrastive Language–Image Pretraining |
BLIP | Bootstrapping Language–Image Pretraining |
ViT | Vision Transformer |
DETR | Detection Transformer |
mAP | Mean Average Precision |
GFLOPS | Giga Floating-Point Operations Per Second |
LoRA | Low-Rank Adaptation |
VQA | Visual Question Answering |
LSTM | Long Short-Term Memory |
CNN | Convolutional Neural Network |
HBB | Horizontal Bounding Box |
OBB | Oriented Bounding Box |
RRSIS | Referring Remote Sensing Image Segmentation |
DIOR | Detection in Optical Remote Sensing |
HRRSD | High Resolution Remote Sensing Detection |
SPCD | Swimming Pool and Car Detection |
SAR | Synthetic Aperture Radar |
VRSBench | Versatile Vision–language Benchmark for Remote Sensing Image Understand-ing |
COREval | Comprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision–Language Models |
SOTA | State-of-the-Art |
IoU | Intersection-over-Union |
meanIoU | Mean Intersection-over-Union |
cumIoU | Cumulative Intersection-over-Union |
BERT | Bidirectional Encoder Representation of Transformer |
FQRNet | Novel Frequency and Query Refinement Network |
LLM | Large Language Model |
LHRS | Language Helps Remote Sensing |
PAL | Prompt-Assisted Learning |
GGL | Geometry-Guided Learning |
DOTA | Detection of Objects from Top-Down Perspectives |
OFA | One for All |
BitFit | Bias-Term Fine-Tuning |
Appendix A
Methods | Link |
---|---|
MGVLF [12] | https://github.com/ZhanYang-nwpu/RSVG-pytorch (accessed on 1 May 2025) |
LQVG [19] | https://github.com/LANMNG/LQVG (accessed on 1 May 2025) |
LPVA [20] | https://github.com/like413/OPT-RSVG (accessed on 1 May 2025) |
TACMT [23] | https://github.com/CAESAR-Radi/TACMT (accessed on 1 May 2025) |
MSANet [39] | https://github.com/waynamigo/MSAM (accessed on 1 May 2025) |
CSDNet [47] | https://github.com/WUTCM-Lab/CSDNet (accessed on 1 May 2025) |
GeoChat [14] | https://github.com/mbzuai-oryx/geochat (accessed on 1 May 2025) |
SkySenseGPT [24] | https://github.com/Luo-Z13/SkySenseGPT (accessed on 1 May 2025) |
LHRS-Bot [50] | https://github.com/NJU-LHRS/LHRS-Bot (accessed on 1 May 2025) |
EarthDial [53] | https://github.com/hiyamdebary/EarthDial (accessed on 1 May 2025) |
VHM [51] | https://github.com/opendatalab/VHM (accessed on 1 May 2025) |
GeoPix [52] | https://github.com/Norman-Ou/GeoPix (accessed on 1 May 2025) |
Appendix B
Dataset | Training (%) | Validation (%) | Test (%) |
---|---|---|---|
DIOR-RSVG(1) [12] | 40 | 10 | 50 |
DIOR-RSVG(2) [12] | 70 | 10 | 20 |
RSVG-HR [19] | 80 | - | 20 |
OPT-RSVG [20] | 40 | 10 | 50 |
RSSVG [29] | 70 | 10 | 20 |
SARVG-T | 80 | 10 | 10 |
SARVG-S [29] | 70 | 10 | 20 |
No. | Class Name | Training | Validation | Test |
---|---|---|---|---|
C01 | vehicle | 2888 | 714 | 3559 |
C02 | dam | 401 | 91 | 518 |
C03 | airplane | 664 | 199 | 842 |
C04 | stadium | 471 | 119 | 591 |
C05 | overpass | 908 | 203 | 1090 |
C06 | ground track field | 984 | 223 | 1237 |
C07 | golf field | 426 | 91 | 523 |
C08 | baseball field | 1457 | 353 | 1800 |
C09 | basketball court | 510 | 139 | 637 |
C10 | tennis court | 611 | 133 | 765 |
C11 | expressway toll station | 443 | 108 | 561 |
C12 | expressway service area | 552 | 157 | 703 |
C13 | windmill | 1175 | 312 | 1466 |
C14 | bridge | 1027 | 285 | 1277 |
C15 | harbor | 238 | 49 | 291 |
C16 | train station | 351 | 92 | 447 |
C17 | airport | 494 | 143 | 646 |
C18 | chimney | 502 | 116 | 620 |
C19 | storage tank | 477 | 123 | 630 |
C20 | ship | 749 | 182 | 957 |
- | Total | 15,328 | 3832 | 19,160 |
No. | Class Name | Training | Validation | Test |
---|---|---|---|---|
C01 | airplane | 979 | 230 | 1142 |
C02 | ground track field | 1600 | 365 | 2066 |
C03 | tennis court | 1093 | 284 | 1313 |
C04 | bridge | 1699 | 452 | 2212 |
C05 | basketball court | 1036 | 263 | 1385 |
C06 | storage tank | 1050 | 271 | 1264 |
C07 | ship | 1084 | 243 | 1241 |
C08 | baseball diamond | 1477 | 361 | 1744 |
C09 | T junction | 1663 | 425 | 2055 |
C10 | crossroad | 1670 | 405 | 2088 |
C11 | parking lot | 1049 | 268 | 1368 |
C12 | harbor | 758 | 209 | 953 |
C13 | vehicle | 3294 | 811 | 4083 |
C14 | swimming pool | 1128 | 308 | 1563 |
- | Total | 19,580 | 4895 | 24,477 |
No. | Class Name | Training | Test |
---|---|---|---|
C01 | Baseball field | 443 | 81 |
C02 | Basketball court | 201 | 38 |
C03 | Ground track field | 163 | 50 |
C04 | Roundabout | 534 | 122 |
C05 | Swimming pool | 150 | 40 |
C06 | Storage tank | 163 | 50 |
C07 | Tennis court | 497 | 118 |
- | Total | 2151 | 499 |
References
- Zhao, B. A Systematic Survey of Remote Sensing Image Captioning. IEEE Access 2021, 9, 154086–154111. [Google Scholar] [CrossRef]
- Cheng, Q.; Zhou, Y.; Fu, P.; Xu, Y.; Zhang, L. A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4284–4297. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S., Eds.; Volume 162, pp. 12888–12900. [Google Scholar]
- Wang, J.; Ma, A.; Chen, Z.; Zheng, Z.; Wan, Y.; Zhang, L.; Zhong, Y. EarthVQANet: Multi-Task Visual Question Answering for Remote Sensing Image Understanding. ISPRS J. Photogramm. Remote Sens. 2024, 212, 422–439. [Google Scholar] [CrossRef]
- Zhang, X.; Zhang, T.; Wang, G.; Zhu, P.; Tang, X.; Jia, X.; Jiao, L. Remote Sensing Object Detection Meets Deep Learning: A Metareview of Challenges and Advances. IEEE Geosci. Remote Sens. Mag. 2023, 11, 8–44. [Google Scholar] [CrossRef]
- Xie, Y.; Liu, S.; Chen, H.; Cao, S.; Zhang, H.; Feng, D.; Wan, Q.; Zhu, J.; Zhu, Q. Localization, Balance, and Affinity: A Stronger Multifaceted Collaborative Salient Object Detector in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–17. [Google Scholar] [CrossRef]
- Yu, L.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Bansal, M.; Berg, T.L. MAttNet: Modular Attention Network for Referring Expression Comprehension. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1307–1315. [Google Scholar]
- Yang, Z.; Chen, T.; Wang, L.; Luo, J. Improving One-Stage Visual Grounding by Recursive Sub-Query Construction. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 387–404. [Google Scholar]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
- Sun, Y.; Feng, S.; Li, X.; Ye, Y.; Kang, J.; Huang, X. Visual Grounding in Remote Sensing Images. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; ACM: Lisboa, Portugal, 2022; pp. 404–412. [Google Scholar]
- Zhan, Y.; Xiong, Z.; Yuan, Y. RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
- Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; de Las Casas, D.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training Compute-Optimal Large Language Models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates Inc.: Red Hook, NY, USA, 2022. [Google Scholar]
- Kuckreja, K.; Danish, M.S.; Naseer, M.; Das, A.; Khan, S.; Khan, F.S. GeoChat: Grounded Large Vision-Language Model for Remote Sensing. arXiv 2023, arXiv:2311.15826. [Google Scholar]
- Zhou, Y.; Lan, M.; Li, X.; Feng, L.; Ke, Y.; Jiang, X.; Li, Q.; Yang, X.; Zhang, W. GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding. arXiv 2025, arXiv:2411.11904. [Google Scholar]
- Moher, D.; Liberati, A.; Tetzlaff, J.; Altman, D.G.; The PRISMA Group. Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement. PLoS Med. 2009, 6, e1000097. [Google Scholar] [CrossRef]
- Deng, J.; Yang, Z.; Chen, T.; Zhou, W.; Li, H. TransVG: End-to-End Visual Grounding with Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 1749–1759. [Google Scholar]
- Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and a New Benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
- Lan, M.; Rong, F.; Jiao, H.; Gao, Z.; Zhang, L. Language Query-Based Transformer with Multiscale Cross-Modal Alignment for Visual Grounding on Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
- Li, K.; Wang, D.; Xu, H.; Zhong, H.; Wang, C. Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
- Zhang, Y.; Yuan, Y.; Feng, Y.; Lu, X. Hierarchical and Robust Convolutional Neural Network for Very High-Resolution Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5535–5548. [Google Scholar] [CrossRef]
- Swimming Pool and Car Detection. Available online: https://www.kaggle.com/datasets/kbhartiya83/swimming-pool-and-car-detection (accessed on 23 May 2025).
- Li, T. TACMT: Text-Aware Cross-Modal Transformer for Visual Grounding on High-Resolution SAR Images. ISPRS J. Photogramm. Remote Sens. 2025, 222, 152–166. [Google Scholar] [CrossRef]
- Luo, J.; Pang, Z.; Zhang, Y.; Wang, T.; Wang, L.; Dang, B.; Lao, J.; Wang, J.; Chen, J.; Tan, Y.; et al. SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding. arXiv 2024, arXiv:2406.10100. [Google Scholar]
- Li, X.; Ding, J.; Elhoseiny, M. VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding. arXiv 2024, arXiv:2406.12384. [Google Scholar]
- Ding, J.; Xue, N.; Xia, G.-S.; Bai, X.; Yang, W.; Yang, M.Y.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object Detection in Aerial Images: A Large-Scale Benchmark and Challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7778–7796. [Google Scholar] [CrossRef]
- An, X.; Sun, J.; Gui, Z.; He, W. COREval: A Comprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision-Language Models. arXiv 2024, arXiv:2411.18145v1. [Google Scholar]
- Wang, F.; Wang, H.; Chen, M.; Wang, D.; Wang, Y.; Guo, Z.; Ma, Q.; Lan, L.; Yang, W.; Zhang, J.; et al. XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery? arXiv 2025, arXiv:2503.23771. [Google Scholar]
- Chen, Y.; Zhan, L.; Zhao, Y.; Xiong, S.; Lu, X. VGRSS: Datasets and Models for Visual Grounding in Remote Sensing Ship Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–11. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Yang, Z.; Gong, B.; Wang, L.; Huang, W.; Yu, D.; Luo, J. A Fast and Accurate One-Stage Approach to Visual Grounding. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4682–4692. [Google Scholar]
- Sadhu, A.; Chen, K.; Nevatia, R. Zero-Shot Grounding of Objects from Natural Language Queries. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4693–4702. [Google Scholar]
- Liao, Y.; Liu, S.; Li, G.; Wang, F.; Chen, Y.; Qian, C.; Li, B. A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 10877–10886. [Google Scholar]
- Ye, J.; Tian, J.; Yan, M.; Yang, X.; Wang, X.; Zhang, J.; He, L.; Lin, X. Shifting More Attention to Visual Backbone: Query-Modulated Refinement Networks for End-to-End Visual Grounding. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 15481–15491. [Google Scholar]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 6877–6886. [Google Scholar]
- Du, Y.; Fu, Z.; Liu, Q.; Wang, Y. Visual Grounding with Transformers. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
- Xie, Y.; Zhan, N.; Zhu, J.; Xu, B.; Chen, H.; Mao, W.; Luo, X.; Hu, Y. Landslide Extraction from Aerial Imagery Considering Context Association Characteristics. Int. J. Appl. Earth Obs. Geoinf. 2024, 131, 103950. [Google Scholar] [CrossRef]
- Wang, F.; Wu, C.; Wu, J.; Wang, L.; Li, C. Multistage Synergistic Aggregation Network for Remote Sensing Visual Grounding. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
- Qiu, H.; Wang, L.; Zhang, M.; Zhao, T.; Li, H. Attribute-Prompting Multi-Modal Object Reasoning Transformer for Remote Sensing Visual Grounding. In Proceedings of the IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 9029–9032. [Google Scholar]
- Hu, Z.; Gao, K.; Zhang, X.; Yang, Z.; Cai, M.; Zhu, Z.; Li, W. Efficient Grounding DINO: Efficient Cross-Modality Fusion and Efficient Label Assignment for Visual Grounding in Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–14. [Google Scholar] [CrossRef]
- Hang, R.; Xu, S.; Liu, Q. A Regionally Indicated Visual Grounding Network for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–11. [Google Scholar] [CrossRef]
- Choudhury, S.; Kurkure, P.; Talwar, P.; Banerjee, B. CrossVG: Visual Grounding in Remote Sensing with Modality-Guided Interactions. In Proceedings of the IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 2858–2862. [Google Scholar]
- Ding, Y.; Xu, H.; Wang, D.; Li, K.; Tian, Y. Visual Selection and Multistage Reasoning for RSVG. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
- Li, C.; Zhang, W.; Bi, H.; Li, J.; Li, S.; Yu, H.; Sun, X.; Wang, H. Injecting Linguistic Into Visual Backbone: Query-Aware Multimodal Fusion Network for Remote Sensing Visual Grounding. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
- Ding, Y.; Wang, D.; Li, K.; Zhao, X.; Wang, Y. Visual Grounding of Remote Sensing Images with Multi-Dimensional Semantic-Guidance. Pattern Recognit. Lett. 2025, 189, 85–91. [Google Scholar] [CrossRef]
- Zhao, Y.; Chen, Y.; Yao, R.; Xiong, S.; Lu, X. Context-Driven and Sparse Decoding for Remote Sensing Visual Grounding. Inf. Fusion 2025, 123, 103296. [Google Scholar] [CrossRef]
- Zhan, Y.; Xiong, Z.; Yuan, Y. SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model. ISPRS J. Photogramm. Remote Sens. 2025, 221, 64–77. [Google Scholar] [CrossRef]
- Zhang, W.; Cai, M.; Zhang, T.; Zhuang, Y.; Mao, X. EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–20. [Google Scholar] [CrossRef]
- Muhtar, D.; Li, Z.; Gu, F.; Zhang, X.; Xiao, P. LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model. In Proceedings of the Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2025; pp. 440–457. [Google Scholar]
- Pang, C.; Weng, X.; Wu, J.; Li, J.; Liu, Y.; Sun, J.; Li, W.; Wang, S.; Feng, L.; Xia, G.-S.; et al. VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis. arXiv 2024, arXiv:2403.20213. [Google Scholar] [CrossRef]
- Ou, R.; Hu, Y.; Zhang, F.; Chen, J.; Liu, Y. GeoPix: A Multimodal Large Language Model for Pixel-Level Image Understanding in Remote Sensing. IEEE Geosci. Remote Sens. Mag. 2025, 2–16. [Google Scholar] [CrossRef]
- Soni, S.; Dudhane, A.; Debary, H.; Fiaz, M.; Munir, M.A.; Danish, M.S.; Fraccaro, P.; Watson, C.D.; Klein, L.J.; Khan, F.S.; et al. EarthDial: Turning Multi-Sensory Earth Observations to Interactive Dialogues. arXiv 2025, arXiv:2412.15190. [Google Scholar]
- Zhang, W.; Cai, M.; Zhang, T.; Zhuang, Y.; Li, J.; Mao, X. EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–19. [Google Scholar] [CrossRef]
- Zhao, E.; Wan, Z.; Zhang, Z.; Nie, J.; Liang, X.; Huang, L. A Spatial-Frequency Fusion Strategy Based on Linguistic Query Refinement for RSVG. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
- Zhang, X.; Wu, C.; Zhang, Y.; Xie, W.; Wang, Y. Knowledge-Enhanced Visual-Language Pre-Training on Chest Radiology Images. Nat. Commun. 2023, 14, 4542. [Google Scholar] [CrossRef]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv 2022, arXiv:2112.10752. [Google Scholar]
- Irvin, J.A.; Liu, E.R.; Chen, J.C.; Dormoy, I.; Kim, J.; Khanna, S.; Zheng, Z.; Ermon, S. TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data. arXiv 2024, arXiv:2410.06234. [Google Scholar]
- Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.-C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
- Frantar, E.; Alistarh, D. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv 2023, arXiv:2301.00774. [Google Scholar]
- Gu, Y.; Dong, L.; Wei, F.; Huang, M. MiniLLM: Knowledge Distillation of Large Language Models. arXiv 2024, arXiv:2306.08543. [Google Scholar]
- Choukroun, Y.; Kravchik, E.; Yang, F.; Kisilev, P. Low-Bit Quantization of Neural Networks for Efficient Inference. Statistics arXiv 2019, arXiv:1902.06822. [Google Scholar]
- Xia, Q.; Ye, W.; Tao, Z.; Wu, J.; Li, Q. A Survey of Federated Learning for Edge Computing: Research Problems and Solutions. High-Confid. Comput. 2021, 1, 100008. [Google Scholar] [CrossRef]
Traditional Object Detection | Visual Grounding | Referring Image Segmentation | |
---|---|---|---|
Input Modality | visual modality | visual modality+ linguistic modality | visual modality+ linguistic modality |
Output Form | bounding boxes | Bounding boxes | pixel-level masks |
Target Definition | Predefined closed categories | Open-vocabulary descriptions | Fine-grained semantic descriptions |
Datasets | Type | Image Sources | Ann. Format | Total Images | Total Objects | Avg. Length | Image Size |
---|---|---|---|---|---|---|---|
RSVGD [11] | RGB | DIOR | HBB | 4239 | 7933 | 28.33 | 1024 × 1024 |
DIOR-RSVG [12] | RGB | DIOR | HBB | 17,402 | 38,320 | 7.47 | 800 × 800 |
RSVG-HR [19] | RGB | DIOR | HBB | 2650 | 2650 | 19.6 | 1024 × 1024 |
OPT-RSVG [20] | RGB | HRRSD, DIOR, SPCD | HBB | 25,452 | 48,952 | 10.10 | - |
RSSVG [29] | RGB | FAIR1M, CGWX, DIOR-RSVG | HBB | 11,157 | 25,237 | 9.77 | - |
SARVG-T [23] | SAR | CAPELLA, GF-3, ICEYE SAR | HBB | 2465 | 7617 | - | 512 × 512 |
SARVG-S [29] | SAR | SAR-ship-Dataset | HBB | 43,798 | 54,429 | 7.72 | - |
Benchmark | |||||||
VRSBench [25] | RGB | DOTA-v2, DIOR | OBB | 29,614 | 52,472 | 14.31 | 512 × 512 |
COREval [27] | RGB | Google Earth | HBB/OBB | - | 200 | - | 800 × 800 |
XLRSBench [28] | RGB | DOTA-v2, ITCVD | OBB | - | 12,619 | - | 8500 × 8500 |
Methods | Visual Enc. | Text Enc./LLM | Params. | Training Set | Test Set | Pr@0.5 | mIoU |
---|---|---|---|---|---|---|---|
| |||||||
GeoVG [11] | - | - | - | 26,991 | 7500 | 57.78 | - |
MGVLF [12] | ResNet-50 | BERT | 152.5 | 26,991 | 7500 | 76.78 | 68.04 |
LQVG [19] | ResNet-50 | BERT | 166.3 | 26,991 | 7500 | 83.41 | 74.02 |
APMOR [40] | ResNet-101 | BERT | - | 26,991 | 7500 | 79.37 | 68.86 |
Eff-Gounding DINO [41] | ResNet-50 | BERT | 169.3 | 26,991 | 7500 | 83.05 | 73.41 |
RINet [42] | DarkNet-53 | BERT | - | 26,991 | 7500 | 64.14 | - |
CrossVG [43] | ViT-B/16 | BERT | - | 26,991 | 7500 | 77.51 | 70.56 |
VGRSS [29] | ResNet-50 | BERT | - | 26,991 | 7500 | 83.01 | 74.85 |
MSANet [39] | DarNet-53 | BERT | - | 26,991 | 7500 | 74.23 | 64.88 |
VSMR [44] | ResNet-50 | BERT | - | 15,328 | 19,160 | 78.24 | 68.88 |
QAMFN [45] | ResNet-50 | BERT | 128.4 | 15,328 | 19,160 | 81.67 | 71.48 |
MSVG [46] | ResNet-101 | BERT | - | 15,328 | 19,160 | 83.61 | 72.87 |
LPVA [20] | ResNet-50 | BERT | 156.2 | 15,328 | 19,160 | 82.27 | 72.35 |
FQRNet [55] | ResNet-50 | BERT | - | 15,328 | 19,160 | 77.23 | 68.35 |
CSDNet [47] | ResNet-101 | BERT | 154.64 | 27,133 | 7422 | 80.92 | 70.88 |
TACMT [23] | ResNet-50 | BERT | 150.9 | - | - | ||
| |||||||
GeoChat [14] | CLIP-ViT | Vicuna-v1.5 | ~7 B | GeoChat-Instruction | 555 | - | - |
SkyEyeGPT [48] | EVA-CLIP | LLaMA2 | ~7 B | SkyEye-968k | 7500 | 88.59 | - |
EarthGPT [49] | DINO-ViT+ CLIP-ConNeXt | LLaMA2 | ~7 B | MMRS-1M | 7500 | 76.65 | 69.34 |
SkySenseGPT [24] | CLIP-ViT | Vicuna-v1.5 | ~7 B | FIT-RS | - | - | - |
LHRS-Bot [50] | CLIP-ViT | LLaMA2 | ~7 B | LHRS-Instruct | 7500 | 88.10 | - |
VHM [51] | CLIP-ViT | Vicuna-v1.5 | ~7 B | VariousRS-Instruct | - | - | - |
GeoPix [52] | CLIP-ViT | Vicuna-v1.5 | ~7 B | GeoPixInstruct | - | - | - |
EarthDial [53] | InternViT | Phi-3-mini | ~4 B | EarthDial-Instruct | - | - | - |
GeoGround [15] | CLIP-ViT | Vicuna-v1.5 | ~7 B | refGeo | 7500 | 77.73 | - |
Transformer-Based | MLLM-Based | |
---|---|---|
Visual Encoder | ResNet-50/DarkNet-53 | ViT |
Text Encoder | BERT | - |
Strengths |
|
|
Weaknesses |
|
|
Methods | S.H. | S.C. | A.S. | Innovations |
---|---|---|---|---|
GeoVG [11] | √ | √ | × |
|
MGVLF [12] | √ | × | × |
|
VSMR [44] | √ | √ | × |
|
LQVG [19] | √ | √ | × |
|
QAMFN [45] | √ | √ | × |
|
MSVG [46] | √ | √ | × |
|
LPVA [20] | √ | √ | × |
|
FQRNet [55] | √ | √ | × |
|
MSANet [39] | √ | × | × |
|
CrossVG [43] | √ | √ | × |
|
APMOR [40] | × | √ | × |
|
TACMT [23] | √ | √ | × |
|
RINet [42] | √ | √ | × |
|
Eff-Gounding DINO [41] | √ | √ | √ |
|
VGRSS [29] | √ | √ | × |
|
CSDNet [47] | √ | √ | × |
|
GeoChat [14] | √ | √ |
| |
SkyEyeGPT [48] | √ | √ |
| |
EarthGPT [49] | √ | √ | √ |
|
SkySenseGPT [24] | √ | × | √ |
|
LHRS-Bot [50] | √ | × | √ |
|
VHM [51] | √ | × | √ |
|
GeoPix [52] | √ | √ | √ |
|
EarthDial [53] | √ | √ | √ |
|
GeoGround [15] | √ | × | √ |
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, Z.; Liu, L.; Wan, G.; Zhang, W.; Zhong, B.; Chang, H.; Li, X.; Liu, X.; Sun, G. A Review of Visual Grounding on Remote Sensing Images. Electronics 2025, 14, 2815. https://doi.org/10.3390/electronics14142815
Wang Z, Liu L, Wan G, Zhang W, Zhong B, Chang H, Li X, Liu X, Sun G. A Review of Visual Grounding on Remote Sensing Images. Electronics. 2025; 14(14):2815. https://doi.org/10.3390/electronics14142815
Chicago/Turabian StyleWang, Ziyan, Lei Liu, Gang Wan, Wei Zhang, Binjian Zhong, Haiyang Chang, Xinyi Li, Xiaoxuan Liu, and Guangde Sun. 2025. "A Review of Visual Grounding on Remote Sensing Images" Electronics 14, no. 14: 2815. https://doi.org/10.3390/electronics14142815
APA StyleWang, Z., Liu, L., Wan, G., Zhang, W., Zhong, B., Chang, H., Li, X., Liu, X., & Sun, G. (2025). A Review of Visual Grounding on Remote Sensing Images. Electronics, 14(14), 2815. https://doi.org/10.3390/electronics14142815