A Review of Visual Grounding on Remote Sensing Images

Wang, Ziyan; Liu, Lei; Wan, Gang; Zhang, Wei; Zhong, Binjian; Chang, Haiyang; Li, Xinyi; Liu, Xiaoxuan; Sun, Guangde

doi:10.3390/electronics14142815

Open AccessReview

A Review of Visual Grounding on Remote Sensing Images

by

Ziyan Wang

,

Lei Liu

^*,

Gang Wan

^*,

Wei Zhang

,

Binjian Zhong

,

Haiyang Chang

,

Xinyi Li

,

Xiaoxuan Liu

and

Guangde Sun

Space Information Academic, Space Engineering University, Beijing 101407, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(14), 2815; https://doi.org/10.3390/electronics14142815

Submission received: 24 May 2025 / Revised: 3 July 2025 / Accepted: 10 July 2025 / Published: 13 July 2025

Download

Browse Figures

Versions Notes

Abstract

Remote sensing visual grounding, a pivotal technology bridging natural language and high-resolution remote sensing images, holds significant application value in disaster monitoring, urban planning, and related fields. However, it faces critical challenges due to the inherent scale heterogeneity, semantic complexity, and annotation scarcity of remote sensing data. This paper first reviews the development history of remote sensing visual grounding, providing an overview of the basic background knowledge, including fundamental concepts, datasets, and evaluation metrics. Then, it categorizes methods by whether they employ large language models as a pedestal, and provides in-depth analyses of the innovations and limitations of Transformer-based and multimodal large language model-based methods. Furthermore, focusing on remote sensing image characteristics, it discusses cutting-edge techniques such as cross-modal feature fusion, language-guided visual optimization, multi-scale, and hierarchical feature processing, open-set expansion and efficient fine-tuning. Finally, it outlines current bottlenecks and proposes valuable directions for future research. As the first comprehensive review dedicated to remote sensing visual grounding, this work is a reference resource for researchers to grasp domain-specific concepts and track the latest developments.

Keywords:

visual grounding on remote sensing images (RSVG); visual and language; transformer; multimodal large language model (MLLM)

1. Introduction

The leapfrog advancement of remote sensing technology has propelled remote sensing image quality and resolution into the sub-meter-level era, where massive high-precision earth observation data provides unprecedented informational support for urban planning, environmental monitoring, national defense, and related domains. To overcome the efficiency bottleneck of traditional manual interpretation, researchers have dedicated efforts to establishing intelligent interaction bridges between natural language and remote sensing images. In remote sensing image captioning [1], natural language processing techniques enable the automated generation of precise and descriptive textual summaries, facilitating rapid comprehension of scene-level characteristics. For cross-modal image–text retrieval [2], alignment algorithms such as Contrastive Language-Image Pretraining (CLIP) [3] and Bootstrapping Language–Image Pretraining (BLIP) [4] empower bidirectional querying: retrieving images via textual descriptions or extracting textual metadata from visual content, thereby significantly enhancing data mining efficiency. In visual question answering (VQA) [5], algorithms demonstrate capabilities to resolve complex queries about object locations, area estimations, and other geospatial attributes within remote sensing scenes, offering intelligent decision-making support for resource management and disaster assessment. Despite these achievements, the more practical and challenging task of Remote Sensing Visual Grounding (RSVG) remains underexplored, constrained by theoretical and technical limitations.

RSVG aims to localize target objects in remote sensing images through natural language queries (phrases or sentences) and output corresponding bounding boxes. Unlike conventional object detection [6], which identifies all instances of predefined categories, RSVG simulates real-world referential dialogs, addressing complex localization demands in specialized scenarios. Compared to visual grounding in natural images, RSVG has three domain-specific characteristics:

Scale Heterogeneity: Remote sensing images span square-kilometer-scale urban clusters to sub-meter-scale individual targets, where small objects (e.g., ships, vehicles) coexist with large structures (e.g., airports, ports). This diversity challenges traditional detectors with fixed receptive fields.
Semantic Complexity: Due to resolution limits in remote sensing imagery, small targets often have complex textures and insufficient edge detail [7]. This causes bidirectional ambiguity in visual–language mapping, where the complexity of accurate semantic expression can lead to information loss and attention drift during feature extraction, while simple expressions can easily cause target confusion in dense scenes. Additionally, target extraction is heavily impacted by background interference.
Annotation Scarcity: Restricted data accessibility and expertise-dependent annotation result in significantly smaller datasets than natural-scene benchmarks, severely undermining model generalization capabilities.

While visual grounding in natural images has evolved over a decade, progressing from two-stage [8] proposal matching, one-stage [9] end-to-end regression, and transformer-based cross-modal encoding to the Multimodal Large Language Model (MLLM) [10], direct adaptation to remote sensing suffers significant performance degradation. The fundamental contradiction lies in the essential difference between natural images and remote sensing data: the former has prominent subjects and straightforward semantics, while the latter needs to parse compound semantics, and the dense and small targets of remote sensing images are easily submerged in the complex background. The introduction of Geospatial Visual Grounding (GeoVG) [11] in 2022 marked the initial effort to adapt visual grounding to remote sensing by constructing geospatial relational graphs to compress search spaces, albeit limited by coarse-grained feature alignment. Subsequent work [12] improved accuracy through multiscale cross-modal fusion but struggled with missed detections of small targets and inadequate parsing of complex linguistic descriptions.

As MLLM breaks through the scalinglLaw [13] of traditional machine learning, it injects new momentum into the solution of remote sensing visual grounding tasks that require cross-modal understanding by virtue of its hundreds of billions of parameters and massive cross-modal pretraining data. GeoChat [14] pioneered conversational interaction with high-resolution remote sensing images, enabling coordinate outputs via natural language instructions, though its precision lags behind dedicated models. GeoGround [15] unified horizontal bounding boxes (HBBs), oriented bounding boxes (OBBs), and segmentation masks through text-mask serialization, yet its computational complexity hinders real-time deployment.

Although the Transformer-based method and the MLLM-based method have made a stage breakthrough in the field of remote sensing visual grounding, they still face three new types of challenges: First, the semantic fragmentation of heterogeneous data from multiple sources may lead to the inefficiency of cross-modal graphic alignment, and it is difficult for the existing methods to rely on the annotation system dominated by optical remote sensing images to support the generalization requirements of open-set scenarios. Secondly, the static one-time-phase model cannot capture the spatial and temporal heterogeneity of dynamic processes such as flood inundation and urban expansion, and the decoupling of temporal semantic description and spatial localization capability seriously restricts the decision-making effectiveness of key applications such as disaster warning. Finally, the harsh resource constraints of the starboard/aircraft platforms form a sharp contradiction with the demand for high-resolution image processing, and the bottleneck of the real-time nature of the existing models needs to be broken through urgently. To address these challenges, this study systematically dissects three core issues—intelligent annotation of multisource data, spatiotemporal modeling for dynamic scenes, and lightweight edge deployment—and proposes knowledge-driven approaches, spatiotemporal coupling strategies, and computational resource coordination mechanisms, offering novel insights to advance RSVG research and practical implementation. Our contributions can be summarized as follows: First, we collected and tracked literature related to RSVG on Web of Science and Google Scholar, defined RSVG, traced its development history, and classified methods based on technical details. Then, we discussed RSVG benchmark datasets and evaluation metrics, analyzed and compared the performance of existing methods on classic datasets, and summarized research trends. Second, we focused on the characteristics of visual grounding tasks in the field of remote sensing and analyzed the innovations and discussions proposed by scholars regarding the characteristics of RSVG. Finally, we integrated current research challenges and provided valuable directions for future research to inspire subsequent researchers. To the best of our knowledge, this is the first systematic review of RSVG, and it aims to offer strategic and detailed references for researchers in this field.

This review systematically examines the technological framework of RSVG. To identify high-quality publications relevant to RSVG, we adhered to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement [16], as illustrated in Figure 1. We first collected relevant articles through Web of Science, Google Scholar, and IEEE Xplore, using keywords such as “remote sensing visual grounding,” “remote sensing referring expression comprehension,” and “remote sensing phrase grounding” to search for all articles up to 10 May 2025. We then established exclusion and inclusion criteria to screen records and full-text articles. The exclusion criteria were as follows:

Duplicate work.
Articles without publicly available reproducible code or peer review.
Articles where the full text could not be obtained from the publisher.

The inclusion criteria were as follows:

Articles written in English.
Images used in the study must be remote sensing images (RSI) and cannot be other types of images.
Methods that integrate complete sentence text (rather than individual or multiple discrete words).

A total of 62 papers and articles were initially identified. After applying the eligibility criteria and manually screening out redundant papers and articles, 28 papers and articles were ultimately selected for this review study. The remaining sections are structured as follows: Section 2 outlines the background, including concept definition, mainstream datasets, and evaluation metrics. Section 3 deconstructs existing methodologies into Transformer-driven and MLLM-driven paradigms based on architectural differences, offering in-depth critiques of their design philosophies and performance limitations. Section 4 addresses domain-specific innovations in RSVG, exploring advancements in cross-modal feature fusion, language-guided visual optimization, multi-scale and hierarchical feature processing, open-set expansion and efficient fine-tuning. Section 5 is based on application scenarios, reveals the core challenges, and provides an outlook.

2. Background

2.1. Concept Definition

Remote sensing visual grounding, also known as remote sensing referring expression comprehension [17], aims to achieve accurate spatial localization of specific targets in remote sensing images through natural language descriptions. The technique maps unstructured textual instructions to bounding box coordinates in the image by establishing cross-modal associations between visual features and semantic descriptions.

Compared with traditional object detection methods, remote sensing visual grounding is fundamentally different in the task paradigm. Traditional object detection [6] is usually based on supervised learning with a closed set of categories, and its performance is limited by the completeness of predefined categories and labeled data. Figure 2a shows the localization results of a ‘vehicle’ target in a remote sensing image. Remote sensing visual grounding uses open vocabulary descriptions as inputs, and is capable of handling compound semantic queries and dynamic parsing of implicit semantic constraints through multimodal feature fusion. As shown in Figure 2b, the visual grounding task can not only locate the ‘vehicle’ target, but also filter out the targets that are more suitable for the target based on the modifiers ‘white’ and ‘in the sunlight’. This openness makes the visual grounding more suitable for emergency response, environmental monitoring, and other fields that require flexible queries.

In particular, RSVG differs conceptually from Referring Remote Sensing Image Segmentation (RRSIS). While both involve remote sensing images and text descriptions, RSVG provides coarse target localization via bounding boxes, whereas RRSIS generates pixel-level masks for precise contour extraction, as shown in Figure 2c. Technically, RSVG focuses on cross-modal reasoning efficiency and often employs region recommendation networks based on the attention mechanism to quickly filter candidate regions through spatial relationship modelling, while referring image segmentation is oriented towards more accurate pixel-level feature extraction, which needs to solve problems such as edge blurring and target sticking, and usually relies on language-guided cascade segmentation architectures to handle fine-grained features. Table 1 compares the significant differences between the three tasks of traditional object detection, visual grounding, and referring image segmentation.

2.2. Datasets and Benchmark

The construction and annotation of relevant datasets have become critical factors driving advancements in this field. Sun et al. established the first remote sensing visual grounding dataset, the RSVG dataset (RSVGD) [11], laying the foundation for research in this domain. DIOR-RSVG [12] created with manually validated automatic generation algorithms based on the large-scale object detection dataset detection in optical remote sensing (DIOR) [18], covers 20 different target categories with high inter-class similarity and intra-class diversity, providing rich and reliable annotation data for visual grounding experiments. Lan et al. [19] proposed the RSVG-HR dataset, focusing on high-resolution remote sensing images, which contains 2650 image–text pairs. By re-annotating high-resolution images in the RSVGD dataset using an absolute-position and relative-position schema, RSVG-HR offers more challenging task scenarios. Li et al. [20] constructed the OPT-RSVG dataset by collecting 25,452 images from High-Resolution Remote Sensing Detection (HRRSD) [21], DIOR, and swimming pool and car detection (SPCD) [22] datasets and creating 48,952 image–text pairs. It introduces more complex scenes, a wider spatial resolution span, and richer object categories, providing more challenging data resources for RSVG tasks. To address the limitation of datasets being confined to optical remote sensing images, Li et al. [23] proposed the visual grounding for high-resolution synthetic aperture radar images (SARVG) dataset, focusing on visual grounding for synthetic aperture radar (SAR) images. Containing 2465 high-resolution SAR images and 7617 image–text pairs, it provides a valuable resource for visual grounding studies of SAR images, but the dataset contains only a single target, the power transmission tower, and target diversity is lacking.

Current instruction-tuning datasets for multimodal large models are primarily constructed from established benchmarks such as RSVG, DIOR-RSVG, and OPT-RSVG. Notably, SkySenseGPT [24] extends visual grounding to object reasoning tasks, annotating over 210,000 targets in rotated bounding box format with coordinate precision elevated to the millimeter level (0.01), significantly advancing spatial localization accuracy. Benchmarks designed to evaluate multimodal large models are evolving toward more complex scenarios, safety-oriented requirements, and large-scale scene challenges. For instance, the Versatile vision-language Benchmark for Remote Sensing image understanding (VRSBench) [25] proposed by Li et al. supports both horizontal and rotated bounding box localization across 26 object categories. While the DIOR-based detection dataset contains an average of 3.3 instances per image, the DOTA-v2 [26] subset escalates instance density to 14.2 instances per image, creating highly challenging scenarios for dense small-object localization. The Comprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision-Language Models (COREval) [27] addresses potential data leakage risks by constructing test sets from six satellite platforms, such as Landsat-8 and Sentinel-1/2, to ensure objective generalization assessment. Wang et al. [28] developed a benchmark for ultra-large-scale scenes, manually annotating 12,619 visually grounded instances with unique attributes (e.g., color, shape, position, size, relative location, etc.) from ultra-high-resolution images with an average size of 8500 × 8500. Some objects in this dataset occupy as few as five pixels, establishing new standards for microscopic-level localization and anti-background interference testing in massive scenes. Table 2 provides a comparative analysis of mainstream datasets for RSVG tasks. Table 2 summarizes major datasets for remote sensing visual grounding tasks. Appendix B comprehensively documents the partitioning methodologies employed by the datasets cataloged in Table 2.

2.3. Evaluation Metrics

The evaluation framework for remote sensing visual grounding tasks typically adopts the classic assessment paradigm proposed by Zhan et al. [12]. For precision-based metrics (Pr@0.5, Pr@0.6, Pr@0.7, Pr@0.8, Pr@0.9), they measure the accuracy when the intersection-over-union (IoU) between the predicted bounding box and the ground-truth box exceeds thresholds of 0.5, 0.6, 0.7, 0.8, and 0.9, respectively. These metrics intuitively reflect the model’s prediction accuracy under different overlap requirements. Additionally, mean intersection-over-union (meanIoU) calculates the arithmetic mean of IoU values for all image–language pairs in the dataset, as shown in Equation (1), evaluating the model’s localization capability across diverse scenarios. Cumulative intersection-over-union (cumIoU) is defined as the ratio of the total intersection area to the total union area across all samples, as shown in Equation (2), emphasizing the model’s performance under the overall data distribution.

m e a n I o U = \frac{1}{M} \sum_{t = 1}^{M} \frac{I_{t}}{U_{t}}

(1)

c u m I o U = \frac{\sum_{t} I_{t}}{\sum_{t} U_{t}}

(2)

where

t

is the sample index, M is the dataset size, and

I_{t}

and

U_{t}

denote the intersection and union areas between the predicted and ground-truth boxes, respectively.

3. Evolutionary Trajectory

Early visual grounding research for natural images began to emerge in 2014, with the proposal of two-stage [8] and one-stage [9] architectures using Long Short-Term Memory (LSTM) [30] as the language encoder and a Convolutional Neural Network (CNN) as the visual encoder to address cross-modal understanding requirements for visual grounding tasks. Two-stage visual grounding methods often treat visual grounding as an object retrieval task, identifying the object with the highest semantic match to the text expression from a set of object proposals. This approach consists of two stages: the first stage generates sparse object region proposals using object detection methods or unsupervised methods; the second stage matches object regions with text expressions to select the optimal object as the prediction result. A common practice is to encode candidate objects and text expressions using CNN and LSTM, respectively, and determine the output by calculating similarity. However, the second stage of the two-stage method relies heavily on the object detection results of the first stage. If the target region is not accurately detected in the first stage, the second stage cannot perform matching and localization effectively. Additionally, the computation of cross-modal similarity between numerous object region proposals and text expressions consumes substantial computational resources.

One-stage methods overcome the dependency on the first stage’s results through end-to-end training. This approach directly performs visual–language fusion in the intermediate layers of the object detector and outputs the bounding box with the highest score on predefined dense anchors. Inspired by YOLOv3 [31], Yang et al. [32] utilized DarkNet and Bidirectional Encoder Representation of Transformer (BERT) to extract visual and text features, respectively, and integrated text features into the YOLOv3 detector for visual grounding. Researchers have also made diverse attempts to improve model performance: Sadhu et al. [33] extended the task to zero-shot localization; Liao et al. [34] transformed visual grounding into a correlation filtering problem; Yang et al. [9] designed a recursive subquery construction module to handle long and complex sentences; Ye et al. [35] proposed a filter-based cross-modal fusion network to filter visual feature maps using structured knowledge and context. Although one-stage methods simplify the training process and reduce computational costs, they rely on manually designed mechanisms and complex modules for multimodal reasoning, making them prone to overfitting on scenario-specific datasets and lacking generality and generalization.

In 2020, the Transformer model was first applied to image classification tasks [36] and achieved better results than CNN models. This application inspired researchers to explore the potential of using Transformer architectures to unify visual–text feature representations, bringing new insights to cross-modal tasks like visual grounding. Deng et al. [17] proposed a Transformer-based object coordinate regression model that performs cross-modal fusion through stacked Transformer encoder layers and predicts target locations; Du et al. [37] utilized text-guided self-attention to mine semantic information and improve coordinate regression performance. Transformer-based visual grounding methods surpass traditional approaches by utilizing attention mechanisms to develop efficient feature extraction modules. These modules are adept at capturing cross-scale visual information and improving contextual relationships [38], demonstrating high versatility and effectiveness. Their improved performance in object detection tasks and wide applicability have gradually replaced earlier two-stage and one-stage methods.

Due to significant differences in target characteristics and scale between remote sensing images and natural images, both two-stage and one-stage methods exhibit limited applicability. Following the success of Transformer-based visual grounding in natural image processing, Sun et al. first adapted this approach for remote sensing in October 2022 [12], as shown in Figure 3. Later, Zhan et al. [12] advanced the field by releasing the DIOR-RSVG benchmark dataset and open-sourcing their Multistage Synergistic Aggregation Module implementation, providing essential resources for deep learning-based visual grounding. Although RSGPT appeared in 2023 as the first multimodal large language model for remote sensing, it did not achieve region-level localization capabilities. A key breakthrough occurred in November 2023 with GeoChat [14], which successfully integrated visual grounding into the MLLM framework. This milestone laid the groundwork for subsequent MLLM-based methods, driving faster development and notable performance gains from 2024 onward.

This section categorizes remote sensing visual grounding methods into Transformer-based architectures and MLLM-based methods, depending on whether they utilize large language models as foundational backbones. Representative frameworks of both paradigms are introduced, with their architectural differences comparatively illustrated in Figure 4.

3.1. Transformer-Based Methods

Transformer has become the mainstream architecture for remote sensing visual grounding tasks due to its powerful parallel processing capabilities and modeling of long-range dependencies. Transformer-based methods typically use ResNet-50 or DarkNet-53 as the visual encoder and BERT as the language encoder during the encoding stage, as shown in Table 3. They construct visual–text interaction modules based on self-attention mechanisms to enhance alignment between visual and language modalities, thereby improving visual grounding accuracy. The basic architecture is shown in Figure 4a.

Sun et al. [11] pioneered the introduction of visual grounding to the remote sensing field in 2022 with their GeoVG method, which uses a graph-based strategy to establish relationships between visual and language features. The numerical context module in the language encoder represents complex expressions as geospatial relationship graphs, reducing the search space in large-scale scenes and improving localization accuracy. However, creating relationship graphs requires high computational costs and tends to overlook relationships between small target objects. Wang et al. [39] proposed an autoregressive discrete coordinate sequence generation method to explore interactions between direct regression features and encoded multimodal features, achieving a mean IoU of 64.88% on RSVG. The novel frequency and query refinement network (FQRNet) [55] captures global structural information from remote sensing data using Fourier transforms and enhances spatial features with frequency information. Li et al. [23] designed a cross-modal encoder for SAR data in their TACMT model to strengthen text-guided scattering feature extraction, improving accuracy by 6.2% on the SARVG1.0 dataset.

In summary, Transformer-based methods exhibit strong feature extraction capabilities and flexible architectural designs, performing well in handling complex backgrounds and small-target localization. However, there is still room for optimization in multimodal feature fusion, addressing non-salient target features, and mitigating background interference.

3.2. MLLM-Based Methods

With the breakthrough of large language models (LLMs), multimodal large language models have provided new opportunities for automated analysis of earth observation data. By integrating visual and language modality information, MLLMs can more naturally process human language instructions and enhance open-scene understanding through large-scale data training and instruction tuning. General MLLMs support a wide range of tasks, including classification, detection, captioning, question answering, and visual reasoning. As an important downstream task, visual grounding has also achieved breakthroughs with the development of large models. These models typically rely on pretrained large language models and use specific visual encoders to align image and language features for unified multimodal task processing. Figure 4b illustrates the basic framework of MLLM-based visual grounding.

GeoChat [14] proposed by Zhou et al. first demonstrated remote sensing visual grounding capabilities in the MLLM. It uses CLIP-ViT as the visual backbone to align visual and language modalities, inserts positional encoding, and scales image input sizes to handle larger images. However, it has obvious limitations in localizing small targets, with an acc@0.5 value of only 2.9%. Zhan et al. [48] designed the SkyEyeGPT, which maps remote sensing visual features to the language domain through a simple projection layer, significantly improving the precise localization of small objects.

Compared to the massive data scale in natural scenes that supports the development of multimodal large models, the lack of training data in remote sensing has become a bottleneck. Muhtar et al. [50] pretrained on the large 4-million-image–text-pair dataset LHRS-Align and fine-tuned on the 30,000-instruction-pair dataset LHRS-Instruct, combining multi-level visual–language alignment strategies to unleash the potential of MLLMs. EarthDial [53], proposed by Soni et al., supports natural language dialog for multispectral, multitemporal, and multiresolution remote sensing data, using 11.11 million instruction pairs containing RGB, Sentinel-2, SAR, near-infrared, and infrared data for comprehensive instruction tuning to achieve stronger generalization.

Additionally, some scholars have introduced visual prompt models or combined rotated bounding boxes and masks to localize finer-grained targets. SkySenseGPT [24] extends visual grounding to object reasoning tasks, using rotated bounding boxes to improve the accuracy of target fitting compared to traditional horizontal bounding boxes. EarthMarker [54], proposed by Zhang et al., first introduced visual prompt learning into remote sensing multimodal large models, allowing users to interact with AI using prompts such as boxes, points, and free-form shapes, breaking the limitations of language instructions and enhancing flexibility. GeoPix [52] extends visual grounding to the pixel level and introduces a Class-wise Learnable Memory (CLM) module to dynamically extract and store category-specific geographic context, improving the model’s understanding of diverse instances in complex remote sensing scenes. However, when user instructions for referring segmentation and visual context analysis accidentally include instance location queries, task interference can occur, causing errors in segmentation masks in visual grounding tasks, as shown in Figure 5. GeoGround [15] leverages the powerful multi-task learning capabilities of LLMs, combining prompt-assisted learning (PAL) and geometry-guided learning (GGL) to unify visual grounding tasks with OBB, HBB, and mask annotations, allowing flexible output choices.

MLLM-based methods facilitate alignment and interaction between visual and language modalities by embedding text and images into a unified semantic space. Built on large-scale pretraining data, they exhibit excellent zero-shot generation capabilities and open-domain adaptability. MLLMs often inherit the reasoning capabilities of LLMs, supporting deep semantic understanding and reasoning for complex language instructions, with more flexible output formats. The strong language generation and multimodal fusion capabilities of MLLMs provide new approaches for remote sensing visual grounding. Nonetheless, the substantial parameter count of these models imposes substantial demands on both training costs and cross-modal data acquisition. Furthermore, the autoregressive mechanism inherent in LLMs, when utilized for coordinate generation, introduces a conflict between their sequential processing nature and the parallel processing capabilities demanded by dense object detection tasks. This, along with the high computational complexity and the mismatch between the model’s output structure and the requirements of object detection, collectively constrain the performance of MLLMs in localizing small objects. Table 4 compares the Transformer-based and MLLM-based methods.

4. Characteristics and Innovations

Compared to generic visual grounding, the development of remote sensing visual grounding has consistently centered on three core characteristics: scale heterogeneity, semantic complexity, and annotation scarcity.

4.1. Scale Heterogeneity

Acquired from nadir satellite perspectives, remote sensing imagery exhibits extensive spatial coverage, capturing objects ranging from square-kilometer-scale landforms to sub-meter-level fine targets. This target diversity challenges conventional detectors with fixed receptive fields, which struggle to accommodate multiscale objects and suffer severe background noise interference. Consequently, researchers have pursued multiscale hierarchical feature processing to mitigate scale heterogeneity and background clutter in large-scale scenes, thereby expanding dynamic perceptual coverage. Zhan et al. [12] proposed an MGVLF module, which integrates multiscale visual features and multi-granularity textual embeddings to adaptively filter irrelevant noise, effectively tackling challenges posed by significant scale variations and cluttered backgrounds. Further advancing this approach, Wang et al. [39] introduced a Multistage Synergistic Aggregation Module (MSAM), achieving multi-scale contextual fusion through generative coordinate sequence prediction. This method attained an accuracy of 83.61% on the DIOR-RSVG dataset. To resolve confusion between targets and similar objects, Qiu et al. [40] developed a learnable attribute prompter that adaptively explores diverse attribute information based on common object characteristics in remote sensing images. RINet [42] adopted a local-to-object scheme to progressively localize target regions via a Regional Indication Generator (RIG), enhancing localization capabilities for small targets. Ding et al. [44] designed an Adaptive Feature Selection (AFS) module to suppress noise and combined it with a multistage decoder to iteratively infer target attributes, achieving a precision of 78.24% in complex scenarios. While innovations in multi-scale and hierarchical feature processing effectively mitigate scale variations and background interference, challenges persist in slow model convergence and insufficient validation of detection performance in high-density scenarios.

4.2. Semantic Complexity

Remote sensing images contain diverse and numerous targets, with small objects such as ships, vehicles, and buildings constituting a significant proportion. A single target may correspond to varied textual descriptions, while identical expressions can refer to different objects. Precise semantic expressions are often lengthy and complex, frequently encountering issues like description forgetting and attention drift during feature extraction and matching. Conversely, oversimplified expressions risk ambiguity, particularly in large-scale scenes with densely clustered targets, where object confusion severely impedes precise localization. To enhance semantic consistency between visual and textual features, researchers employ multilevel alignment and dynamic interaction mechanisms to optimize cross-modal feature fusion. Early studies achieved coarse-grained alignment by constructing geospatial relation graphs but struggled to parse complex descriptions. Zhan et al. [12] proposed the MGVLF module to integrate multiscale visual features and multi-granularity textual embeddings. Lan et al. [19] introduced the Language Query-based Visual Grounding (LQVG) framework, which retrieves multiscale visual features using textual features as queries and incorporates a Multistage Cross-Modal Alignment (MSCMA) module to strengthen semantic correlations. Their approach achieved accuracies (Pr@0.5) of 83.41% and 87.37% on the DIOR-RSVG and RSVG-HR datasets, respectively, though with higher computational complexity. Choudhury et al. [43] simplified traditional multimodal fusion modules in their CrossVG model, relying solely on Sstacking Transformer encoder layers to achieve efficient cross-modality interaction. Multidimensional Semantic-Guidance Visual Grounding (MSVG) [45] further employed a Multidimensional Text–Image Alignment Module (MTAM) to increase the relevance between visual features and textual descriptions. Overall, cross-modality fusion strategies—such as constructing geospatial relation graphs and multi-granularity feature fusion—effectively narrow the search space and improve semantic matching capabilities.

Furthermore, language-guided visual optimization can mitigate attention drift and enhance focus on target regions by embedding textual semantics into the visual feature extraction process. Li et al. [45] proposed a Query-Guided Visual Attention (QGVA) module for the visual encoder, which dynamically focuses on language-described regions by injecting textual semantics into the visual encoding process. This method achieved a 4.98% accuracy improvement over MGVLF on the DIOR-RSVG dataset. LPVA [20] innovatively adopted a channel-spatial dual-dimensional dynamic weight adjustment strategy, combined with a Multilevel Feature Enhancement (MFE) decoder to suppress background interference and enhance feature distinctiveness. It achieved an accuracy of 82.27% on DIOR-RSVG, though its capability remains limited to one-object localization and it lacks proficiency in multi-target matching. For scenarios involving lengthy and complex linguistic expressions, RINet [42] introduced a word contribution learner to evaluate the importance of each word in the language description. Through iterative fine-tuning, it improved the comprehension of intricate linguistic information (Figure 6 illustrates the workflow of RINet). In summary, this method leverages attention mechanisms and domain-specific training data to enhance the robustness of complex description parsing and guide visual optimization.

4.3. Annotation Scarcity

Specialized datasets for RSVG are limited by domain knowledge and scarce data. Commonly used RSVG datasets, typically derived from annotated DIOR object detection benchmarks, contain at most 50,000 instances, significantly fewer than natural scene counterparts like RefCOCO (142,209 referring expressions for 50,000 objects). On one hand, the professional interpretation requirements for multi-modal remote sensing imagery pose a significant challenge to annotation quality. For example, SAR targets exhibit varying signatures due to incidence angles and polarization modes, requiring annotators to possess electromagnetic scattering knowledge and cross-modal interpretation skills, thereby escalating annotation costs. On the other hand, data access restrictions further exacerbate annotation scarcity: publicly available remote sensing data sources are primarily concentrated in civilian scenarios, while military or sensitive targets remain inaccessible due to security protocols, limiting dataset diversity and scenario coverage. This dual scarcity undermines model generalization for complex scenes or novel targets. Consequently, researchers explore open-set methods to expand candidate categories. Hu et al. adapted Grounding DINO to RSVG via a Multi-scale Image-to-Text Fusion Module (MSITFM) and Text Confidence Matching (TCM), optimizing cross-modal interaction and label assignment for robust open-world localization. Multimodal foundation models address scarcity by integrating multisensor data into unified instruction frameworks. EarthDial constructed an 11.11 million-pair multi-modal instruction dataset, leveraging complementary cross-modal information from RGB, SAR, near-infrared, etc., to drive the model’s learning of universal visual–language mapping relationships and reduce reliance on single-modal annotations. GeoGround unifies HBB, OBB, and mask tasks through prompt-assisted and geometry-guided learning, converting annotations into reusable instruction–response pairs, as shown in Figure 7. While multitask interference remains a concern, these approaches establish new paradigms for mitigating RSVG data limitations. To more comprehensively analyze whether each RSVG model has been improved to address the three typical challenges of scale heterogeneity, semantic complexity, and label scarcity, Table 5 provides a systematic summary of the relevant content and clearly identifies the innovative features of each model.

5. Challenges and Outlook

Remote sensing visual grounding has achieved notable progress, yet critical bottlenecks persist in multisource data fusion, dynamic scene understanding, and edge computing adaptability. Future research needs to focus on the following directions to carry out an in-depth exploration:

5.1. Intelligent Annotation Agent for Multisource Heterogeneous Data

Current RSVG datasets predominantly rely on optical imagery, resulting in weak generalization across modalities such as SAR, hyperspectral, multispectral, and LiDAR point clouds. This limitation stems from the differences in physical properties among multiple data sources. For example, SAR images are imaged by an electromagnetic wave scattering mechanism, and their texture features have no intuitive correspondence with the RGB channels; hyperspectral data contains hundreds of bands with high spectral redundancy, which will lead to dimensional disaster when directly inputted into the model; and the laser point cloud can accurately characterize the three-dimensional geometric structure, but it lacks semantic labels and has significant alignment errors in a large scale. Traditional manual annotation methods require specialized knowledge and consume a lot of time, and cross-modal graphical alignment relies on empirical rule design, which makes it difficult to support the training needs of open-set models. Thus, a knowledge-driven [56] annotation agent can be designed to integrate geographic knowledge mapping and a domain macrolanguage model, to embed professional terms into the text description generation process; to synthesize multi-source data-text pairs using the Diffusion [57] model to realize the scale of the training set to scale up, to satisfy the training demand of multi-modal large models, and to explore the potential of the model in visual grounding tasks.

5.2. Cross-Temporal Perception Modeling for Dynamic Scenarios

Existing remote sensing visual grounding methods mainly focus on static one-temporal-phase data, with which is difficult to capture dynamic processes such as flood inundation and urban sprawl, resulting in insufficient model temporal correlation capability. This is mainly due to the fact that multi-temporal phase remote sensing data often have significant heterogeneity, coupled with the lack of quantitative representation models for temporal semantics in natural language. Although TEOChat [58] tries to construct a temporal visual language large model for resolving the temporal evolution law, it fails to establish an accurate mapping between spatiotemporal coordinates and textual commands. Future work could adopt memory-augmented networks with deformable temporal attention mechanisms to dynamically correlate historical feature maps with current observation data. Multitemporal image–text pairs with temporal adverbs could enhance cross-temporal tracking, enabling autonomous change trajectory identification and disaster spread prediction, ultimately achieving a “change perception–precise localization–decision feedback” loop.

5.3. Edge Computing-Oriented Lightweight Deployment

High-resolution remote sensing image processing depends on large-parameter models to create effective representations. However, satellite and airborne platforms face multiple constraints, including limited computing resources, radiation-hardened requirements, and environmental adaptability, which severely restrict computational and storage capabilities. This makes supporting in-orbit training and real-time inference of complex models difficult. Research shows that even on specialized embedded AI hardware like the Jetson series or Movidius Myriad 2, inference times for unoptimized models are still too slow to meet the high-timeliness needs of tasks like disaster response. Although techniques like model quantization can reduce computational load, they often cause significant accuracy drops (mAP). This limits their usefulness in scenarios with strict real-time demands, such as military reconnaissance and disaster monitoring. Future research must take a dual approach: first, systematically compress models, reduce computational complexity, and improve inference efficiency and accuracy retention on edge platforms by integrating optimization methods like lightweight network design [59], model pruning [60], knowledge distillation [61], and low-bit-width quantization [62]; second, develop a federated edge learning framework [63] based on low-orbit satellite constellations. Through multi-node collaboration and knowledge sharing, this approach aims to overcome the data sample and model generalization bottlenecks of single satellites, gradually building distributed core models. Only in this way can we bridge the gap between laboratory models and practical deployment, enabling a real-time perception-decision loop of “what you see is what you get” in extreme, dynamic scenarios like battlefield perception and rapid disaster area assessment.

6. Conclusions

In this review, we systematically track and summarize the research progress and future path of remote sensing visual grounding technology in the last five years. First, we establish the basic concept of RSVG by comparing the scope boundaries of object detection, referring image segmentation, compiling the mainstream datasets such as DIOR-RSVG and RSVG-HR, and evaluation metrics. Subsequently, two major approaches to remote sensing visual grounding, namely, the Transformer-based methods and the multimodal large language model-based methods, are introduced in detail and analyzed comparatively. In addition, we discuss in depth the innovative techniques of remote sensing visual grounding based on the key features in the remote sensing images. Finally, we summarize the challenges faced by the research on remote sensing visual grounding and propose valuable directions for future research. This paper is suitable for both beginners and experienced researchers in the field of remote sensing visual grounding, and serves as a valuable resource for tracking the latest research progress.

Author Contributions

Conceptualization, L.L., G.W. and G.S.; methodology, Z.W., W.Z., B.Z., H.C., X.L. (Xinyi Li) and X.L. (Xiaoxuan Liu); software, Z.W., W.Z., B.Z., H.C., X.L. (Xinyi Li) and X.L. (Xiaoxuan Liu); validation, All authors; formal analysis, Z.W., W.Z., B.Z., H.C., X.L. (Xinyi Li) and X.L. (Xiaoxuan Liu); investigation, Z.W., W.Z., B.Z., H.C., X.L. (Xinyi Li) and X.L. (Xiaoxuan Liu); resources, Z.W., L.L., G.W., G.S.; data curation, Z.W., W.Z., B.Z., H.C., X.L. (Xinyi Li) and X.L. (Xiaoxuan Liu); writing—original draft preparation, Z.W., L.L., W.Z., B.Z. and H.C.; writing—review and editing, All authors; visualization, Z.W. and X.L. (Xinyi Li); supervision, L.L., G.W. and G.S.; project administration, L.L., G.W. and G.S.; funding acquisition, L.L., G.W. and G.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Committee Project, grant number 2024-YYTJ-QD-011-00 (Intelligent Compilation and Rapid Generation of Thematic Maps: Key Technology Research and Application). The APC was funded by the same grant.

Data Availability Statement

The data supporting the findings of this study are openly available in publicly accessible repositories, with details provided in the cited publications. Specific datasets and access links are as follows: RSVGD: Available at https://sunyuxi.github.io/publication/GeoVG (accessed on) (DOI: 10.1145/3503161.3548316). DIOR-RSVG: Hosted on GitHub at https://github.com/ZhanYang-nwpu/RSVG-pytorch (accessed on 1 May 2025) (DOI: 10.1109/TGRS.2023.3250471). RSVG-HR: Accessible via GitHub at https://github.com/LANMNG/LQVG (accessed on 1 May 2025) (DOI: 10.1109/TGRS.2024.3407598). OPT-RSVG: Available on GitHub at https://github.com/like413/OPT-RSVG (accessed on 1 May 2025) (DOI: 10.1109/TGRS.2024.3423663). SARVG-T: Hosted on GitHub at https://github.com/CAESAR-Radi/TACMT (accessed on 1 May 2025) (DOI: 10.1016/j.isprsjprs.2025.02.022). RSSVG and SARVG-S: Hosted on GitHub at https://github.com/LwZhan-WUT/VGRSS (DOI: 10.1109/TGRS.2025.3562717 (accessed on 1 May 2025)). VRSBench: Available on GitHub at https://github.com/lx709/VRSBench (accessed on 1 May 2025) (DOI: 10.48550/arXiv.2406.12384). COREval: Accessible via arXiv at https://doi.org/10.48550/arXiv.2411.18145 (accessed on 1 May 2025). XLRSBench: Hosted on https://xlrs-bench.github.io/ (accessed on 1 May 2025) (DOI: 10.48550/arXiv.2503.23771).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RSVG	Remote Sensing Visual Grounding
GeoVG	Geospatial Visual Grounding
MSVG	Multidimensional Semantic-Guidance Visual Grounding
MGVLF	Multistage Synergistic Aggregation Module
QGVA	Query-Guided Visual Attention
MFE	Multilevel Feature Enhancement
RIG	Regional Indication Generator
AFS	Adaptive Feature Selection
LQVG	Language Query-Based Visual Grounding
MSCMA	Multistage Cross-Modal Alignment
MTAM	Multidimensional Text–Image Alignment Module
TACMT	Text-Aware Cross-Modal Transformer
PEFT	Parameter-Efficient Fine-Tuning
MSITFM	Multi-Scale Image-to-Text Fusion Module
TCM	Text Confidence Matching
MB-ORES	Multi-Branch Object Reasoner for Visual Grounding
CLIP	Contrastive Language–Image Pretraining
BLIP	Bootstrapping Language–Image Pretraining
ViT	Vision Transformer
DETR	Detection Transformer
mAP	Mean Average Precision
GFLOPS	Giga Floating-Point Operations Per Second
LoRA	Low-Rank Adaptation
VQA	Visual Question Answering
LSTM	Long Short-Term Memory
CNN	Convolutional Neural Network
HBB	Horizontal Bounding Box
OBB	Oriented Bounding Box
RRSIS	Referring Remote Sensing Image Segmentation
DIOR	Detection in Optical Remote Sensing
HRRSD	High Resolution Remote Sensing Detection
SPCD	Swimming Pool and Car Detection
SAR	Synthetic Aperture Radar
VRSBench	Versatile Vision–language Benchmark for Remote Sensing Image Understand-ing
COREval	Comprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision–Language Models
SOTA	State-of-the-Art
IoU	Intersection-over-Union
meanIoU	Mean Intersection-over-Union
cumIoU	Cumulative Intersection-over-Union
BERT	Bidirectional Encoder Representation of Transformer
FQRNet	Novel Frequency and Query Refinement Network
LLM	Large Language Model
LHRS	Language Helps Remote Sensing
PAL	Prompt-Assisted Learning
GGL	Geometry-Guided Learning
DOTA	Detection of Objects from Top-Down Perspectives
OFA	One for All
BitFit	Bias-Term Fine-Tuning

Appendix A

We have compiled the GitHub links corresponding to the Transformer-based and MLLM-based methods mentioned in the Table 3, which are presented in Table A1. Among them, algorithms GeoVG [11], FQRNet [55], VSMR [44], QAMFN [45], MSVG [46], CrossVG [43], APMOR [40], RINet [42], VGRSS [29], Eff-Gounding DINO [41], SkyEyeGPT [48], EarthGPT [49], and GeoGround [15] have no publicly available code.

Table A1. GitHub links for Transformer-based and MLLM-based methods (partial).

Methods	Link
MGVLF [12]	https://github.com/ZhanYang-nwpu/RSVG-pytorch (accessed on 1 May 2025)
LQVG [19]	https://github.com/LANMNG/LQVG (accessed on 1 May 2025)
LPVA [20]	https://github.com/like413/OPT-RSVG (accessed on 1 May 2025)
TACMT [23]	https://github.com/CAESAR-Radi/TACMT (accessed on 1 May 2025)
MSANet [39]	https://github.com/waynamigo/MSAM (accessed on 1 May 2025)
CSDNet [47]	https://github.com/WUTCM-Lab/CSDNet (accessed on 1 May 2025)
GeoChat [14]	https://github.com/mbzuai-oryx/geochat (accessed on 1 May 2025)
SkySenseGPT [24]	https://github.com/Luo-Z13/SkySenseGPT (accessed on 1 May 2025)
LHRS-Bot [50]	https://github.com/NJU-LHRS/LHRS-Bot (accessed on 1 May 2025)
EarthDial [53]	https://github.com/hiyamdebary/EarthDial (accessed on 1 May 2025)
VHM [51]	https://github.com/opendatalab/VHM (accessed on 1 May 2025)
GeoPix [52]	https://github.com/Norman-Ou/GeoPix (accessed on 1 May 2025)

Appendix B

This appendix comprehensively documents the partitioning methodologies employed by the datasets cataloged in Table 2. Critically, VRSBench, COREval, and XLRSBench function exclusively as evaluation-only benchmarks and are thus excluded from partitioning documentation. A consolidated summary appears in Table A2, with granular partition specifications for DIOR-RSVG, RSVG-HR, and OPT-RSVG detailed in Table A3, Table A4, and Table A5, respectively. Notably, DIOR-RSVG implements two distinct partitioning schemes: 4:1:5 and 7:1:2. Table A3 only shows the former one.

Table A2. Summary of partition schemes.

Dataset	Training (%)	Validation (%)	Test (%)
DIOR-RSVG(1) [12]	40	10	50
DIOR-RSVG(2) [12]	70	10	20
RSVG-HR [19]	80	-	20
OPT-RSVG [20]	40	10	50
RSSVG [29]	70	10	20
SARVG-T	80	10	10
SARVG-S [29]	70	10	20

Table A3. Training, validation, and test instance numbers for DIOR-RSVG (4:1:5).

No.	Class Name	Training	Validation	Test
C01	vehicle	2888	714	3559
C02	dam	401	91	518
C03	airplane	664	199	842
C04	stadium	471	119	591
C05	overpass	908	203	1090
C06	ground track field	984	223	1237
C07	golf field	426	91	523
C08	baseball field	1457	353	1800
C09	basketball court	510	139	637
C10	tennis court	611	133	765
C11	expressway toll station	443	108	561
C12	expressway service area	552	157	703
C13	windmill	1175	312	1466
C14	bridge	1027	285	1277
C15	harbor	238	49	291
C16	train station	351	92	447
C17	airport	494	143	646
C18	chimney	502	116	620
C19	storage tank	477	123	630
C20	ship	749	182	957
-	Total	15,328	3832	19,160

Table A4. Training, validation, and test sample numbers for OPT-RSVG.

No.	Class Name	Training	Validation	Test
C01	airplane	979	230	1142
C02	ground track field	1600	365	2066
C03	tennis court	1093	284	1313
C04	bridge	1699	452	2212
C05	basketball court	1036	263	1385
C06	storage tank	1050	271	1264
C07	ship	1084	243	1241
C08	baseball diamond	1477	361	1744
C09	T junction	1663	425	2055
C10	crossroad	1670	405	2088
C11	parking lot	1049	268	1368
C12	harbor	758	209	953
C13	vehicle	3294	811	4083
C14	swimming pool	1128	308	1563
-	Total	19,580	4895	24,477

Table A5. Training and test sample numbers for RSVG-HR.

No.	Class Name	Training	Test
C01	Baseball field	443	81
C02	Basketball court	201	38
C03	Ground track field	163	50
C04	Roundabout	534	122
C05	Swimming pool	150	40
C06	Storage tank	163	50
C07	Tennis court	497	118
-	Total	2151	499

References

Zhao, B. A Systematic Survey of Remote Sensing Image Captioning. IEEE Access 2021, 9, 154086–154111. [Google Scholar] [CrossRef]
Cheng, Q.; Zhou, Y.; Fu, P.; Xu, Y.; Zhang, L. A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4284–4297. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S., Eds.; Volume 162, pp. 12888–12900. [Google Scholar]
Wang, J.; Ma, A.; Chen, Z.; Zheng, Z.; Wan, Y.; Zhang, L.; Zhong, Y. EarthVQANet: Multi-Task Visual Question Answering for Remote Sensing Image Understanding. ISPRS J. Photogramm. Remote Sens. 2024, 212, 422–439. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, T.; Wang, G.; Zhu, P.; Tang, X.; Jia, X.; Jiao, L. Remote Sensing Object Detection Meets Deep Learning: A Metareview of Challenges and Advances. IEEE Geosci. Remote Sens. Mag. 2023, 11, 8–44. [Google Scholar] [CrossRef]
Xie, Y.; Liu, S.; Chen, H.; Cao, S.; Zhang, H.; Feng, D.; Wan, Q.; Zhu, J.; Zhu, Q. Localization, Balance, and Affinity: A Stronger Multifaceted Collaborative Salient Object Detector in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–17. [Google Scholar] [CrossRef]
Yu, L.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Bansal, M.; Berg, T.L. MAttNet: Modular Attention Network for Referring Expression Comprehension. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1307–1315. [Google Scholar]
Yang, Z.; Chen, T.; Wang, L.; Luo, J. Improving One-Stage Visual Grounding by Recursive Sub-Query Construction. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 387–404. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
Sun, Y.; Feng, S.; Li, X.; Ye, Y.; Kang, J.; Huang, X. Visual Grounding in Remote Sensing Images. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; ACM: Lisboa, Portugal, 2022; pp. 404–412. [Google Scholar]
Zhan, Y.; Xiong, Z.; Yuan, Y. RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; de Las Casas, D.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training Compute-Optimal Large Language Models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates Inc.: Red Hook, NY, USA, 2022. [Google Scholar]
Kuckreja, K.; Danish, M.S.; Naseer, M.; Das, A.; Khan, S.; Khan, F.S. GeoChat: Grounded Large Vision-Language Model for Remote Sensing. arXiv 2023, arXiv:2311.15826. [Google Scholar]
Zhou, Y.; Lan, M.; Li, X.; Feng, L.; Ke, Y.; Jiang, X.; Li, Q.; Yang, X.; Zhang, W. GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding. arXiv 2025, arXiv:2411.11904. [Google Scholar]
Moher, D.; Liberati, A.; Tetzlaff, J.; Altman, D.G.; The PRISMA Group. Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement. PLoS Med. 2009, 6, e1000097. [Google Scholar] [CrossRef]
Deng, J.; Yang, Z.; Chen, T.; Zhou, W.; Li, H. TransVG: End-to-End Visual Grounding with Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 1749–1759. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and a New Benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Lan, M.; Rong, F.; Jiao, H.; Gao, Z.; Zhang, L. Language Query-Based Transformer with Multiscale Cross-Modal Alignment for Visual Grounding on Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Li, K.; Wang, D.; Xu, H.; Zhong, H.; Wang, C. Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Zhang, Y.; Yuan, Y.; Feng, Y.; Lu, X. Hierarchical and Robust Convolutional Neural Network for Very High-Resolution Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5535–5548. [Google Scholar] [CrossRef]
Swimming Pool and Car Detection. Available online: https://www.kaggle.com/datasets/kbhartiya83/swimming-pool-and-car-detection (accessed on 23 May 2025).
Li, T. TACMT: Text-Aware Cross-Modal Transformer for Visual Grounding on High-Resolution SAR Images. ISPRS J. Photogramm. Remote Sens. 2025, 222, 152–166. [Google Scholar] [CrossRef]
Luo, J.; Pang, Z.; Zhang, Y.; Wang, T.; Wang, L.; Dang, B.; Lao, J.; Wang, J.; Chen, J.; Tan, Y.; et al. SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding. arXiv 2024, arXiv:2406.10100. [Google Scholar]
Li, X.; Ding, J.; Elhoseiny, M. VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding. arXiv 2024, arXiv:2406.12384. [Google Scholar]
Ding, J.; Xue, N.; Xia, G.-S.; Bai, X.; Yang, W.; Yang, M.Y.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object Detection in Aerial Images: A Large-Scale Benchmark and Challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7778–7796. [Google Scholar] [CrossRef]
An, X.; Sun, J.; Gui, Z.; He, W. COREval: A Comprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision-Language Models. arXiv 2024, arXiv:2411.18145v1. [Google Scholar]
Wang, F.; Wang, H.; Chen, M.; Wang, D.; Wang, Y.; Guo, Z.; Ma, Q.; Lan, L.; Yang, W.; Zhang, J.; et al. XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery? arXiv 2025, arXiv:2503.23771. [Google Scholar]
Chen, Y.; Zhan, L.; Zhao, Y.; Xiong, S.; Lu, X. VGRSS: Datasets and Models for Visual Grounding in Remote Sensing Ship Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–11. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Yang, Z.; Gong, B.; Wang, L.; Huang, W.; Yu, D.; Luo, J. A Fast and Accurate One-Stage Approach to Visual Grounding. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4682–4692. [Google Scholar]
Sadhu, A.; Chen, K.; Nevatia, R. Zero-Shot Grounding of Objects from Natural Language Queries. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4693–4702. [Google Scholar]
Liao, Y.; Liu, S.; Li, G.; Wang, F.; Chen, Y.; Qian, C.; Li, B. A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 10877–10886. [Google Scholar]
Ye, J.; Tian, J.; Yan, M.; Yang, X.; Wang, X.; Zhang, J.; He, L.; Lin, X. Shifting More Attention to Visual Backbone: Query-Modulated Refinement Networks for End-to-End Visual Grounding. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 15481–15491. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 6877–6886. [Google Scholar]
Du, Y.; Fu, Z.; Liu, Q.; Wang, Y. Visual Grounding with Transformers. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
Xie, Y.; Zhan, N.; Zhu, J.; Xu, B.; Chen, H.; Mao, W.; Luo, X.; Hu, Y. Landslide Extraction from Aerial Imagery Considering Context Association Characteristics. Int. J. Appl. Earth Obs. Geoinf. 2024, 131, 103950. [Google Scholar] [CrossRef]
Wang, F.; Wu, C.; Wu, J.; Wang, L.; Li, C. Multistage Synergistic Aggregation Network for Remote Sensing Visual Grounding. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Qiu, H.; Wang, L.; Zhang, M.; Zhao, T.; Li, H. Attribute-Prompting Multi-Modal Object Reasoning Transformer for Remote Sensing Visual Grounding. In Proceedings of the IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 9029–9032. [Google Scholar]
Hu, Z.; Gao, K.; Zhang, X.; Yang, Z.; Cai, M.; Zhu, Z.; Li, W. Efficient Grounding DINO: Efficient Cross-Modality Fusion and Efficient Label Assignment for Visual Grounding in Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–14. [Google Scholar] [CrossRef]
Hang, R.; Xu, S.; Liu, Q. A Regionally Indicated Visual Grounding Network for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–11. [Google Scholar] [CrossRef]
Choudhury, S.; Kurkure, P.; Talwar, P.; Banerjee, B. CrossVG: Visual Grounding in Remote Sensing with Modality-Guided Interactions. In Proceedings of the IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 2858–2862. [Google Scholar]
Ding, Y.; Xu, H.; Wang, D.; Li, K.; Tian, Y. Visual Selection and Multistage Reasoning for RSVG. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Li, C.; Zhang, W.; Bi, H.; Li, J.; Li, S.; Yu, H.; Sun, X.; Wang, H. Injecting Linguistic Into Visual Backbone: Query-Aware Multimodal Fusion Network for Remote Sensing Visual Grounding. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Ding, Y.; Wang, D.; Li, K.; Zhao, X.; Wang, Y. Visual Grounding of Remote Sensing Images with Multi-Dimensional Semantic-Guidance. Pattern Recognit. Lett. 2025, 189, 85–91. [Google Scholar] [CrossRef]
Zhao, Y.; Chen, Y.; Yao, R.; Xiong, S.; Lu, X. Context-Driven and Sparse Decoding for Remote Sensing Visual Grounding. Inf. Fusion 2025, 123, 103296. [Google Scholar] [CrossRef]
Zhan, Y.; Xiong, Z.; Yuan, Y. SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model. ISPRS J. Photogramm. Remote Sens. 2025, 221, 64–77. [Google Scholar] [CrossRef]
Zhang, W.; Cai, M.; Zhang, T.; Zhuang, Y.; Mao, X. EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–20. [Google Scholar] [CrossRef]
Muhtar, D.; Li, Z.; Gu, F.; Zhang, X.; Xiao, P. LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model. In Proceedings of the Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2025; pp. 440–457. [Google Scholar]
Pang, C.; Weng, X.; Wu, J.; Li, J.; Liu, Y.; Sun, J.; Li, W.; Wang, S.; Feng, L.; Xia, G.-S.; et al. VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis. arXiv 2024, arXiv:2403.20213. [Google Scholar] [CrossRef]
Ou, R.; Hu, Y.; Zhang, F.; Chen, J.; Liu, Y. GeoPix: A Multimodal Large Language Model for Pixel-Level Image Understanding in Remote Sensing. IEEE Geosci. Remote Sens. Mag. 2025, 2–16. [Google Scholar] [CrossRef]
Soni, S.; Dudhane, A.; Debary, H.; Fiaz, M.; Munir, M.A.; Danish, M.S.; Fraccaro, P.; Watson, C.D.; Klein, L.J.; Khan, F.S.; et al. EarthDial: Turning Multi-Sensory Earth Observations to Interactive Dialogues. arXiv 2025, arXiv:2412.15190. [Google Scholar]
Zhang, W.; Cai, M.; Zhang, T.; Zhuang, Y.; Li, J.; Mao, X. EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–19. [Google Scholar] [CrossRef]
Zhao, E.; Wan, Z.; Zhang, Z.; Nie, J.; Liang, X.; Huang, L. A Spatial-Frequency Fusion Strategy Based on Linguistic Query Refinement for RSVG. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Zhang, X.; Wu, C.; Zhang, Y.; Xie, W.; Wang, Y. Knowledge-Enhanced Visual-Language Pre-Training on Chest Radiology Images. Nat. Commun. 2023, 14, 4542. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv 2022, arXiv:2112.10752. [Google Scholar]
Irvin, J.A.; Liu, E.R.; Chen, J.C.; Dormoy, I.; Kim, J.; Khanna, S.; Zheng, Z.; Ermon, S. TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data. arXiv 2024, arXiv:2410.06234. [Google Scholar]
Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.-C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Frantar, E.; Alistarh, D. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv 2023, arXiv:2301.00774. [Google Scholar]
Gu, Y.; Dong, L.; Wei, F.; Huang, M. MiniLLM: Knowledge Distillation of Large Language Models. arXiv 2024, arXiv:2306.08543. [Google Scholar]
Choukroun, Y.; Kravchik, E.; Yang, F.; Kisilev, P. Low-Bit Quantization of Neural Networks for Efficient Inference. Statistics arXiv 2019, arXiv:1902.06822. [Google Scholar]
Xia, Q.; Ye, W.; Tao, Z.; Wu, J.; Li, Q. A Survey of Federated Learning for Edge Computing: Research Problems and Solutions. High-Confid. Comput. 2021, 1, 100008. [Google Scholar] [CrossRef]

Figure 1. Process of extracting relevant papers.

Figure 2. Comparison of three task definitions. (a) Object detection; (b) Visual grounding; (c) Referring image segmentation. The red box/mask marks the position of the target in the image.

Figure 3. An overview of the research progress in RSVG from the perspective of the technical roadmap. The corresponding citations for abbreviated works can be found in the main text [11,12,14,19,20,23,24,29,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54].

Figure 4. Comparison of two method architectures. (a) Architecture of traditional transformer-based methods; (b) Architecture of MLLM-based methods. The red box marks the position of the target in the image.

Figure 5. Failure case of GeoPix [52]: Task confusion causing segmentation tokens to appear in visual grounding results. This leads to the output of the corresponding instance mask when performing visual grounding as specified by the user.

Figure 6. Flowchart of RINet [42]. The features extracted by DarkNet-53 and BERT are fed into RIG to generate an initial regional indication map, which is used to generate a high-resolution feature via CAM for grounding the target object. Such high-resolution features can fine-tune the indication map by CG. Dashed arrow lines are used to represent the flow of language information, and circular arrows indicate the fine-tuning path.

Figure 7. Architecture of GeoGround [15]. The CLIP-ViT visual encoder processes input images. Features are projected through a two-layer MLP connector and fed into the LLM along with language queries. For Referring Expression Comprehension (REC) tasks, the model outputs Horizontal Bounding Boxes (HBBs) or Oriented Bounding Boxes (OBBs). For Referring Expression Segmentation (RES) tasks, it generates segmentation masks. Additionally, the architecture supports multi-object localization beyond single-target outputs.

Table 1. A comparison of traditional object detection, visual grounding, and referring image segmentation.

	Traditional Object Detection	Visual Grounding	Referring Image Segmentation
Input Modality	visual modality	visual modality+ linguistic modality	visual modality+ linguistic modality
Output Form	bounding boxes	Bounding boxes	pixel-level masks
Target Definition	Predefined closed categories	Open-vocabulary descriptions	Fine-grained semantic descriptions

Table 2. Major datasets for remote sensing visual grounding tasks.

Datasets	Type	Image Sources	Ann. Format	Total Images	Total Objects	Avg. Length	Image Size
RSVGD [11]	RGB	DIOR	HBB	4239	7933	28.33	1024 × 1024
DIOR-RSVG [12]	RGB	DIOR	HBB	17,402	38,320	7.47	800 × 800
RSVG-HR [19]	RGB	DIOR	HBB	2650	2650	19.6	1024 × 1024
OPT-RSVG [20]	RGB	HRRSD, DIOR, SPCD	HBB	25,452	48,952	10.10	-
RSSVG [29]	RGB	FAIR1M, CGWX, DIOR-RSVG	HBB	11,157	25,237	9.77	-
SARVG-T [23]	SAR	CAPELLA, GF-3, ICEYE SAR	HBB	2465	7617	-	512 × 512
SARVG-S [29]	SAR	SAR-ship-Dataset	HBB	43,798	54,429	7.72	-
Benchmark
VRSBench [25]	RGB	DOTA-v2, DIOR	OBB	29,614	52,472	14.31	512 × 512
COREval [27]	RGB	Google Earth	HBB/OBB	-	200	-	800 × 800
XLRSBench [28]	RGB	DOTA-v2, ITCVD	OBB	-	12,619	-	8500 × 8500

Since both Li et al. [23] and Chen et al. [29] proposed SARVG datasets, SARVG-T denotes the dataset with the power transmission tower as the target, while SARVG-S denotes the dataset with the ship as the target, for distinction purposes.

Table 3. Summary of Transformer-based and MLLM-based methods: architectural configurations, dataset, and performance on DIOR-RSVG benchmark. Appendix A compiled the GitHub links corresponding to the Transformer-based and MLLM-based methods.

Methods	Visual Enc.	Text Enc./LLM	Params.	Training Set	Test Set	Pr@0.5	mIoU
Transformer-based methods
GeoVG [11]	-	-	-	26,991	7500	57.78	-
MGVLF [12]	ResNet-50	BERT	152.5	26,991	7500	76.78	68.04
LQVG [19]	ResNet-50	BERT	166.3	26,991	7500	83.41	74.02
APMOR [40]	ResNet-101	BERT	-	26,991	7500	79.37	68.86
Eff-Gounding DINO [41]	ResNet-50	BERT	169.3	26,991	7500	83.05	73.41
RINet [42]	DarkNet-53	BERT	-	26,991	7500	64.14	-
CrossVG [43]	ViT-B/16	BERT	-	26,991	7500	77.51	70.56
VGRSS [29]	ResNet-50	BERT	-	26,991	7500	83.01	74.85
MSANet [39]	DarNet-53	BERT	-	26,991	7500	74.23	64.88
VSMR [44]	ResNet-50	BERT	-	15,328	19,160	78.24	68.88
QAMFN [45]	ResNet-50	BERT	128.4	15,328	19,160	81.67	71.48
MSVG [46]	ResNet-101	BERT	-	15,328	19,160	83.61	72.87
LPVA [20]	ResNet-50	BERT	156.2	15,328	19,160	82.27	72.35
FQRNet [55]	ResNet-50	BERT	-	15,328	19,160	77.23	68.35
CSDNet [47]	ResNet-101	BERT	154.64	27,133	7422	80.92	70.88
TACMT [23]	ResNet-50	BERT	150.9			-	-
b. MLLM-based Methods
GeoChat [14]	CLIP-ViT	Vicuna-v1.5	~7 B	GeoChat-Instruction	555	-	-
SkyEyeGPT [48]	EVA-CLIP	LLaMA2	~7 B	SkyEye-968k	7500	88.59	-
EarthGPT [49]	DINO-ViT+ CLIP-ConNeXt	LLaMA2	~7 B	MMRS-1M	7500	76.65	69.34
SkySenseGPT [24]	CLIP-ViT	Vicuna-v1.5	~7 B	FIT-RS	-	-	-
LHRS-Bot [50]	CLIP-ViT	LLaMA2	~7 B	LHRS-Instruct	7500	88.10	-
VHM [51]	CLIP-ViT	Vicuna-v1.5	~7 B	VariousRS-Instruct	-	-	-
GeoPix [52]	CLIP-ViT	Vicuna-v1.5	~7 B	GeoPixInstruct	-	-	-
EarthDial [53]	InternViT	Phi-3-mini	~4 B	EarthDial-Instruct	-	-	-
GeoGround [15]	CLIP-ViT	Vicuna-v1.5	~7 B	refGeo	7500	77.73	-

For Transformer-based methods, Training set refers to the number of training instances from DIOR-RSVG. For MLLM-based methods, Training set refers to the fine-tuning dataset used.

Table 4. Comparative analysis of Transformer-based vs. MLLM-based approaches.

	Transformer-Based	MLLM-Based
Visual Encoder	ResNet-50/DarkNet-53	ViT
Text Encoder	BERT	-
Strengths	Computational efficiency Task specificity Low training cost	Strong cross-modal alignment Open-world adaptability Advanced semantic reasoning
Weaknesses	Limited generalization Shallow semantic understanding Coarse alignment	High resource demands Localization constraints from text regression Data dependency

Table 5. Comparative analysis of innovations across RSVG methods.

Methods	S.H.	S.C.	A.S.	Innovations
GeoVG [11]	√	√	×	The numeric context module represents complex expressions as geospatial relation graphs. The adaptive region attention module extracts key visual content.
MGVLF [12]	√	×	×	Multi-scale visual features and multi-granularity text embeddings are utilized to learn more discriminative representations. Irrelevant noise is adaptively filtered, and salient features are enhanced.
VSMR [44]	√	√	×	The multimodal enhancer and adaptive feature selection module focuses visual feature attention on language-related regions. The multistage decoder (MSD) reduces ambiguity in reasoning by continuously considering visual and language information and performing iterative queries.
LQVG [19]	√	√	×	Sentence-level text features are utilized as language query features for target retrieval. The MSCMA module enhances semantic relevance.
QAMFN [45]	√	√	×	The QGVA mechanism enhances visual features. A text-semantic attention-guided masking (TAM) module filters redundant information.
MSVG [46]	√	√	×	The MTAM enhances the correlation between visual features and text descriptions. A visual enhancement fusion module (VEFM) strengthens feature relevance through contextual information. Multistage decoding achieves final feature fusion and visual grounding.
LPVA [20]	√	√	×	A progressive attention (PA) module dynamically generates multi-scale weights and biases to enable the visual backbone to gradually focus on features related to language expressions. The MFE decoder aggregates visual contextual information to enhance the distinctiveness of target object features.
FQRNet [55]	√	√	×	A spatial–frequency fusion strategy based on language query refinement addresses challenges of scale variation and blurred boundaries. A frequency-guided spatial (FGS) module enhances spatial representation using spectral features. A query-aware original attention (QOA) mechanism enables deep multimodal fusion.
MSANet [39]	√	×	×	The MSAM aggregates multi-scale contextual information through a stacking strategy. A generative paradigm is introduced to directly generate discrete coordinate sequences, enhancing interaction between the regression process and encoded features.
CrossVG [43]	√	√	×	A cross-modal guidance encoder (CMGE) uses visual features to guide multi-granularity text embeddings. A cross-modal decoder explores word-level attributes to improve target recognition accuracy.
APMOR [40]	×	√	×	A learnable attribute prompter dynamically explores rich attribute information in remote sensing images. An attribute-prompting multimodal fusion encoder establishes fine-grained interaction between visual and language features. A multimodal progressive object reasoning decoder gradually queries more comprehensive object features.
TACMT [23]	√	√	×	A text-aware query selection module optimizes decoder queries. A cross-scale fusion module handles features of different scales.
RINet [42]	√	√	×	A local-to-object strategy is adopted to locate target regions via a regional indication generator. A word contribution learner evaluates the importance of each word in language expressions. A multi-round fine-tuning process fully utilizes complex language information.
Eff-Gounding DINO [41]	√	√	√	A multi-scale image-to-text fusion module (MSITFM) updates text features via self-attention and uses scale-specific cross-attention for multi-scale visual feature fusion to reduce learning complexity. A text confidence matching (TCM) mechanism introduces IoU-based confidence in label assignment to reduce mismatches.
VGRSS [29]	√	√	×	The Language-Guided Visual Feature Enhancement (LVFE) module enhances visual features through text guidance before feature fusion to address the problem of insufficient utilization of text information. The Visual–Language Fusion (VLF) module preserves spatial information through non-compressive stacking fusion and residual mechanisms. The EIoU loss function is introduced into bounding box regression, and geometric constraints are utilized to improve the convergence accuracy of the model, which is especially suitable for the multi-scale characteristics of ship targets.
CSDNet [47]	√	√	×	The Text-aware Fusion Module (TFM) modulates visual features using textual cues aggregated from image context to reduce target feature confusion. The Context-Enhanced Interaction Module (CIM) harmonizes the differences between visual and textual features by modeling multimodal contexts. The Text-Guided Sparse Decoder (TSD) addresses the issue of surface information redundancy.
GeoChat [14]	√		√	Supports image-level and region-level conversations.
SkyEyeGPT [48]	√		√	A high-quality RS instruction fine-tuning dataset with 968,000 instances enhances instruction fine-tuning of different granularities via a two-stage adjustment method.
EarthGPT [49]	√	√	√	The visual-enhanced perception mechanism refines and integrates coarse-scale semantic and detailed perceptual information. The cross-modal mutual comprehension method enhances interactions between visual perception and language understanding to deepen multimodal comprehension. Optimization is conducted using the unified instruction-following dataset MMRS-1M.
SkySenseGPT [24]	√	×	√	Fine-grained instruction tuning with the high-quality FIT-RS dataset significantly improves the complex scene.comprehension ability of remote sensing multimodal models.
LHRS-Bot [50]	√	×	√	The large-scale weakly-labeled LHRS-Align dataset trains the visual perception module in the pretraining stage, followed by multi-task and instruction fine-tuning.
VHM [51]	√	×	√	The large-scale high-quality VersaD dataset has detailed in-context examples, coupled with a fine-grained prompt framework and quality inspection mechanism.
GeoPix [52]	√	√	√	The class-wise learnable memory (CLM) module stores and retrieves intra-class shared geographic context to enhance model understanding of diverse instances in complex RS scenes. The two-stage training strategy mitigates conflicts between generation and segmentation tasks.
EarthDial [53]	√	√	√	The adaptive high-resolution module meets the requirements of high-resolution RS imagery. The data fusion module processes multi-band or multi-temporal data streams. The three-stage training strategy integrates RGB pretraining, temporal fine-tuning, and multi-band optimization.
GeoGround [15]	√	×	√	The text-mask paradigm compresses mask information into compact text sequences for efficient learning by VLMs. Hybrid supervision integrates PAL and GGL to fine-tune models using three types of signals.

S.H.: Scale Heterogeneity; S.C.: Semantic Complexity; A.S.: Annotation Scarcity. √ indicates that the model has been improved for the corresponding issue, while × indicates no improvement.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Liu, L.; Wan, G.; Zhang, W.; Zhong, B.; Chang, H.; Li, X.; Liu, X.; Sun, G. A Review of Visual Grounding on Remote Sensing Images. Electronics 2025, 14, 2815. https://doi.org/10.3390/electronics14142815

AMA Style

Wang Z, Liu L, Wan G, Zhang W, Zhong B, Chang H, Li X, Liu X, Sun G. A Review of Visual Grounding on Remote Sensing Images. Electronics. 2025; 14(14):2815. https://doi.org/10.3390/electronics14142815

Chicago/Turabian Style

Wang, Ziyan, Lei Liu, Gang Wan, Wei Zhang, Binjian Zhong, Haiyang Chang, Xinyi Li, Xiaoxuan Liu, and Guangde Sun. 2025. "A Review of Visual Grounding on Remote Sensing Images" Electronics 14, no. 14: 2815. https://doi.org/10.3390/electronics14142815

APA Style

Wang, Z., Liu, L., Wan, G., Zhang, W., Zhong, B., Chang, H., Li, X., Liu, X., & Sun, G. (2025). A Review of Visual Grounding on Remote Sensing Images. Electronics, 14(14), 2815. https://doi.org/10.3390/electronics14142815

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Review of Visual Grounding on Remote Sensing Images

Abstract

1. Introduction

2. Background

2.1. Concept Definition

2.2. Datasets and Benchmark

2.3. Evaluation Metrics

3. Evolutionary Trajectory

3.1. Transformer-Based Methods

3.2. MLLM-Based Methods

4. Characteristics and Innovations

4.1. Scale Heterogeneity

4.2. Semantic Complexity

4.3. Annotation Scarcity

5. Challenges and Outlook

5.1. Intelligent Annotation Agent for Multisource Heterogeneous Data

5.2. Cross-Temporal Perception Modeling for Dynamic Scenarios

5.3. Edge Computing-Oriented Lightweight Deployment

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI