Next Article in Journal
The Evolution of Squeezing in Coupled Macroscopic Mechanical Oscillator Systems
Previous Article in Journal
Analysis, Evaluation, and Prediction of Machine Learning-Based Animal Behavior Imitation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

A Review of Visual Grounding on Remote Sensing Images

Space Information Academic, Space Engineering University, Beijing 101407, China
*
Authors to whom correspondence should be addressed.
Electronics 2025, 14(14), 2815; https://doi.org/10.3390/electronics14142815
Submission received: 24 May 2025 / Revised: 3 July 2025 / Accepted: 10 July 2025 / Published: 13 July 2025

Abstract

Remote sensing visual grounding, a pivotal technology bridging natural language and high-resolution remote sensing images, holds significant application value in disaster monitoring, urban planning, and related fields. However, it faces critical challenges due to the inherent scale heterogeneity, semantic complexity, and annotation scarcity of remote sensing data. This paper first reviews the development history of remote sensing visual grounding, providing an overview of the basic background knowledge, including fundamental concepts, datasets, and evaluation metrics. Then, it categorizes methods by whether they employ large language models as a pedestal, and provides in-depth analyses of the innovations and limitations of Transformer-based and multimodal large language model-based methods. Furthermore, focusing on remote sensing image characteristics, it discusses cutting-edge techniques such as cross-modal feature fusion, language-guided visual optimization, multi-scale, and hierarchical feature processing, open-set expansion and efficient fine-tuning. Finally, it outlines current bottlenecks and proposes valuable directions for future research. As the first comprehensive review dedicated to remote sensing visual grounding, this work is a reference resource for researchers to grasp domain-specific concepts and track the latest developments.

1. Introduction

The leapfrog advancement of remote sensing technology has propelled remote sensing image quality and resolution into the sub-meter-level era, where massive high-precision earth observation data provides unprecedented informational support for urban planning, environmental monitoring, national defense, and related domains. To overcome the efficiency bottleneck of traditional manual interpretation, researchers have dedicated efforts to establishing intelligent interaction bridges between natural language and remote sensing images. In remote sensing image captioning [1], natural language processing techniques enable the automated generation of precise and descriptive textual summaries, facilitating rapid comprehension of scene-level characteristics. For cross-modal image–text retrieval [2], alignment algorithms such as Contrastive Language-Image Pretraining (CLIP) [3] and Bootstrapping Language–Image Pretraining (BLIP) [4] empower bidirectional querying: retrieving images via textual descriptions or extracting textual metadata from visual content, thereby significantly enhancing data mining efficiency. In visual question answering (VQA) [5], algorithms demonstrate capabilities to resolve complex queries about object locations, area estimations, and other geospatial attributes within remote sensing scenes, offering intelligent decision-making support for resource management and disaster assessment. Despite these achievements, the more practical and challenging task of Remote Sensing Visual Grounding (RSVG) remains underexplored, constrained by theoretical and technical limitations.
RSVG aims to localize target objects in remote sensing images through natural language queries (phrases or sentences) and output corresponding bounding boxes. Unlike conventional object detection [6], which identifies all instances of predefined categories, RSVG simulates real-world referential dialogs, addressing complex localization demands in specialized scenarios. Compared to visual grounding in natural images, RSVG has three domain-specific characteristics:
  • Scale Heterogeneity: Remote sensing images span square-kilometer-scale urban clusters to sub-meter-scale individual targets, where small objects (e.g., ships, vehicles) coexist with large structures (e.g., airports, ports). This diversity challenges traditional detectors with fixed receptive fields.
  • Semantic Complexity: Due to resolution limits in remote sensing imagery, small targets often have complex textures and insufficient edge detail [7]. This causes bidirectional ambiguity in visual–language mapping, where the complexity of accurate semantic expression can lead to information loss and attention drift during feature extraction, while simple expressions can easily cause target confusion in dense scenes. Additionally, target extraction is heavily impacted by background interference.
  • Annotation Scarcity: Restricted data accessibility and expertise-dependent annotation result in significantly smaller datasets than natural-scene benchmarks, severely undermining model generalization capabilities.
While visual grounding in natural images has evolved over a decade, progressing from two-stage [8] proposal matching, one-stage [9] end-to-end regression, and transformer-based cross-modal encoding to the Multimodal Large Language Model (MLLM) [10], direct adaptation to remote sensing suffers significant performance degradation. The fundamental contradiction lies in the essential difference between natural images and remote sensing data: the former has prominent subjects and straightforward semantics, while the latter needs to parse compound semantics, and the dense and small targets of remote sensing images are easily submerged in the complex background. The introduction of Geospatial Visual Grounding (GeoVG) [11] in 2022 marked the initial effort to adapt visual grounding to remote sensing by constructing geospatial relational graphs to compress search spaces, albeit limited by coarse-grained feature alignment. Subsequent work [12] improved accuracy through multiscale cross-modal fusion but struggled with missed detections of small targets and inadequate parsing of complex linguistic descriptions.
As MLLM breaks through the scalinglLaw [13] of traditional machine learning, it injects new momentum into the solution of remote sensing visual grounding tasks that require cross-modal understanding by virtue of its hundreds of billions of parameters and massive cross-modal pretraining data. GeoChat [14] pioneered conversational interaction with high-resolution remote sensing images, enabling coordinate outputs via natural language instructions, though its precision lags behind dedicated models. GeoGround [15] unified horizontal bounding boxes (HBBs), oriented bounding boxes (OBBs), and segmentation masks through text-mask serialization, yet its computational complexity hinders real-time deployment.
Although the Transformer-based method and the MLLM-based method have made a stage breakthrough in the field of remote sensing visual grounding, they still face three new types of challenges: First, the semantic fragmentation of heterogeneous data from multiple sources may lead to the inefficiency of cross-modal graphic alignment, and it is difficult for the existing methods to rely on the annotation system dominated by optical remote sensing images to support the generalization requirements of open-set scenarios. Secondly, the static one-time-phase model cannot capture the spatial and temporal heterogeneity of dynamic processes such as flood inundation and urban expansion, and the decoupling of temporal semantic description and spatial localization capability seriously restricts the decision-making effectiveness of key applications such as disaster warning. Finally, the harsh resource constraints of the starboard/aircraft platforms form a sharp contradiction with the demand for high-resolution image processing, and the bottleneck of the real-time nature of the existing models needs to be broken through urgently. To address these challenges, this study systematically dissects three core issues—intelligent annotation of multisource data, spatiotemporal modeling for dynamic scenes, and lightweight edge deployment—and proposes knowledge-driven approaches, spatiotemporal coupling strategies, and computational resource coordination mechanisms, offering novel insights to advance RSVG research and practical implementation. Our contributions can be summarized as follows: First, we collected and tracked literature related to RSVG on Web of Science and Google Scholar, defined RSVG, traced its development history, and classified methods based on technical details. Then, we discussed RSVG benchmark datasets and evaluation metrics, analyzed and compared the performance of existing methods on classic datasets, and summarized research trends. Second, we focused on the characteristics of visual grounding tasks in the field of remote sensing and analyzed the innovations and discussions proposed by scholars regarding the characteristics of RSVG. Finally, we integrated current research challenges and provided valuable directions for future research to inspire subsequent researchers. To the best of our knowledge, this is the first systematic review of RSVG, and it aims to offer strategic and detailed references for researchers in this field.
This review systematically examines the technological framework of RSVG. To identify high-quality publications relevant to RSVG, we adhered to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement [16], as illustrated in Figure 1. We first collected relevant articles through Web of Science, Google Scholar, and IEEE Xplore, using keywords such as “remote sensing visual grounding,” “remote sensing referring expression comprehension,” and “remote sensing phrase grounding” to search for all articles up to 10 May 2025. We then established exclusion and inclusion criteria to screen records and full-text articles. The exclusion criteria were as follows:
  • Duplicate work.
  • Articles without publicly available reproducible code or peer review.
  • Articles where the full text could not be obtained from the publisher.
The inclusion criteria were as follows:
  • Articles written in English.
  • Images used in the study must be remote sensing images (RSI) and cannot be other types of images.
  • Methods that integrate complete sentence text (rather than individual or multiple discrete words).
A total of 62 papers and articles were initially identified. After applying the eligibility criteria and manually screening out redundant papers and articles, 28 papers and articles were ultimately selected for this review study. The remaining sections are structured as follows: Section 2 outlines the background, including concept definition, mainstream datasets, and evaluation metrics. Section 3 deconstructs existing methodologies into Transformer-driven and MLLM-driven paradigms based on architectural differences, offering in-depth critiques of their design philosophies and performance limitations. Section 4 addresses domain-specific innovations in RSVG, exploring advancements in cross-modal feature fusion, language-guided visual optimization, multi-scale and hierarchical feature processing, open-set expansion and efficient fine-tuning. Section 5 is based on application scenarios, reveals the core challenges, and provides an outlook.

2. Background

2.1. Concept Definition

Remote sensing visual grounding, also known as remote sensing referring expression comprehension [17], aims to achieve accurate spatial localization of specific targets in remote sensing images through natural language descriptions. The technique maps unstructured textual instructions to bounding box coordinates in the image by establishing cross-modal associations between visual features and semantic descriptions.
Compared with traditional object detection methods, remote sensing visual grounding is fundamentally different in the task paradigm. Traditional object detection [6] is usually based on supervised learning with a closed set of categories, and its performance is limited by the completeness of predefined categories and labeled data. Figure 2a shows the localization results of a ‘vehicle’ target in a remote sensing image. Remote sensing visual grounding uses open vocabulary descriptions as inputs, and is capable of handling compound semantic queries and dynamic parsing of implicit semantic constraints through multimodal feature fusion. As shown in Figure 2b, the visual grounding task can not only locate the ‘vehicle’ target, but also filter out the targets that are more suitable for the target based on the modifiers ‘white’ and ‘in the sunlight’. This openness makes the visual grounding more suitable for emergency response, environmental monitoring, and other fields that require flexible queries.
In particular, RSVG differs conceptually from Referring Remote Sensing Image Segmentation (RRSIS). While both involve remote sensing images and text descriptions, RSVG provides coarse target localization via bounding boxes, whereas RRSIS generates pixel-level masks for precise contour extraction, as shown in Figure 2c. Technically, RSVG focuses on cross-modal reasoning efficiency and often employs region recommendation networks based on the attention mechanism to quickly filter candidate regions through spatial relationship modelling, while referring image segmentation is oriented towards more accurate pixel-level feature extraction, which needs to solve problems such as edge blurring and target sticking, and usually relies on language-guided cascade segmentation architectures to handle fine-grained features. Table 1 compares the significant differences between the three tasks of traditional object detection, visual grounding, and referring image segmentation.

2.2. Datasets and Benchmark

The construction and annotation of relevant datasets have become critical factors driving advancements in this field. Sun et al. established the first remote sensing visual grounding dataset, the RSVG dataset (RSVGD) [11], laying the foundation for research in this domain. DIOR-RSVG [12] created with manually validated automatic generation algorithms based on the large-scale object detection dataset detection in optical remote sensing (DIOR) [18], covers 20 different target categories with high inter-class similarity and intra-class diversity, providing rich and reliable annotation data for visual grounding experiments. Lan et al. [19] proposed the RSVG-HR dataset, focusing on high-resolution remote sensing images, which contains 2650 image–text pairs. By re-annotating high-resolution images in the RSVGD dataset using an absolute-position and relative-position schema, RSVG-HR offers more challenging task scenarios. Li et al. [20] constructed the OPT-RSVG dataset by collecting 25,452 images from High-Resolution Remote Sensing Detection (HRRSD) [21], DIOR, and swimming pool and car detection (SPCD) [22] datasets and creating 48,952 image–text pairs. It introduces more complex scenes, a wider spatial resolution span, and richer object categories, providing more challenging data resources for RSVG tasks. To address the limitation of datasets being confined to optical remote sensing images, Li et al. [23] proposed the visual grounding for high-resolution synthetic aperture radar images (SARVG) dataset, focusing on visual grounding for synthetic aperture radar (SAR) images. Containing 2465 high-resolution SAR images and 7617 image–text pairs, it provides a valuable resource for visual grounding studies of SAR images, but the dataset contains only a single target, the power transmission tower, and target diversity is lacking.
Current instruction-tuning datasets for multimodal large models are primarily constructed from established benchmarks such as RSVG, DIOR-RSVG, and OPT-RSVG. Notably, SkySenseGPT [24] extends visual grounding to object reasoning tasks, annotating over 210,000 targets in rotated bounding box format with coordinate precision elevated to the millimeter level (0.01), significantly advancing spatial localization accuracy. Benchmarks designed to evaluate multimodal large models are evolving toward more complex scenarios, safety-oriented requirements, and large-scale scene challenges. For instance, the Versatile vision-language Benchmark for Remote Sensing image understanding (VRSBench) [25] proposed by Li et al. supports both horizontal and rotated bounding box localization across 26 object categories. While the DIOR-based detection dataset contains an average of 3.3 instances per image, the DOTA-v2 [26] subset escalates instance density to 14.2 instances per image, creating highly challenging scenarios for dense small-object localization. The Comprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision-Language Models (COREval) [27] addresses potential data leakage risks by constructing test sets from six satellite platforms, such as Landsat-8 and Sentinel-1/2, to ensure objective generalization assessment. Wang et al. [28] developed a benchmark for ultra-large-scale scenes, manually annotating 12,619 visually grounded instances with unique attributes (e.g., color, shape, position, size, relative location, etc.) from ultra-high-resolution images with an average size of 8500 × 8500. Some objects in this dataset occupy as few as five pixels, establishing new standards for microscopic-level localization and anti-background interference testing in massive scenes. Table 2 provides a comparative analysis of mainstream datasets for RSVG tasks. Table 2 summarizes major datasets for remote sensing visual grounding tasks. Appendix B comprehensively documents the partitioning methodologies employed by the datasets cataloged in Table 2.

2.3. Evaluation Metrics

The evaluation framework for remote sensing visual grounding tasks typically adopts the classic assessment paradigm proposed by Zhan et al. [12]. For precision-based metrics (Pr@0.5, Pr@0.6, Pr@0.7, Pr@0.8, Pr@0.9), they measure the accuracy when the intersection-over-union (IoU) between the predicted bounding box and the ground-truth box exceeds thresholds of 0.5, 0.6, 0.7, 0.8, and 0.9, respectively. These metrics intuitively reflect the model’s prediction accuracy under different overlap requirements. Additionally, mean intersection-over-union (meanIoU) calculates the arithmetic mean of IoU values for all image–language pairs in the dataset, as shown in Equation (1), evaluating the model’s localization capability across diverse scenarios. Cumulative intersection-over-union (cumIoU) is defined as the ratio of the total intersection area to the total union area across all samples, as shown in Equation (2), emphasizing the model’s performance under the overall data distribution.
m e a n I o U = 1 M t = 1 M I t U t
c u m I o U = t I t t U t
where t is the sample index, M is the dataset size, and I t and U t denote the intersection and union areas between the predicted and ground-truth boxes, respectively.

3. Evolutionary Trajectory

Early visual grounding research for natural images began to emerge in 2014, with the proposal of two-stage [8] and one-stage [9] architectures using Long Short-Term Memory (LSTM) [30] as the language encoder and a Convolutional Neural Network (CNN) as the visual encoder to address cross-modal understanding requirements for visual grounding tasks. Two-stage visual grounding methods often treat visual grounding as an object retrieval task, identifying the object with the highest semantic match to the text expression from a set of object proposals. This approach consists of two stages: the first stage generates sparse object region proposals using object detection methods or unsupervised methods; the second stage matches object regions with text expressions to select the optimal object as the prediction result. A common practice is to encode candidate objects and text expressions using CNN and LSTM, respectively, and determine the output by calculating similarity. However, the second stage of the two-stage method relies heavily on the object detection results of the first stage. If the target region is not accurately detected in the first stage, the second stage cannot perform matching and localization effectively. Additionally, the computation of cross-modal similarity between numerous object region proposals and text expressions consumes substantial computational resources.
One-stage methods overcome the dependency on the first stage’s results through end-to-end training. This approach directly performs visual–language fusion in the intermediate layers of the object detector and outputs the bounding box with the highest score on predefined dense anchors. Inspired by YOLOv3 [31], Yang et al. [32] utilized DarkNet and Bidirectional Encoder Representation of Transformer (BERT) to extract visual and text features, respectively, and integrated text features into the YOLOv3 detector for visual grounding. Researchers have also made diverse attempts to improve model performance: Sadhu et al. [33] extended the task to zero-shot localization; Liao et al. [34] transformed visual grounding into a correlation filtering problem; Yang et al. [9] designed a recursive subquery construction module to handle long and complex sentences; Ye et al. [35] proposed a filter-based cross-modal fusion network to filter visual feature maps using structured knowledge and context. Although one-stage methods simplify the training process and reduce computational costs, they rely on manually designed mechanisms and complex modules for multimodal reasoning, making them prone to overfitting on scenario-specific datasets and lacking generality and generalization.
In 2020, the Transformer model was first applied to image classification tasks [36] and achieved better results than CNN models. This application inspired researchers to explore the potential of using Transformer architectures to unify visual–text feature representations, bringing new insights to cross-modal tasks like visual grounding. Deng et al. [17] proposed a Transformer-based object coordinate regression model that performs cross-modal fusion through stacked Transformer encoder layers and predicts target locations; Du et al. [37] utilized text-guided self-attention to mine semantic information and improve coordinate regression performance. Transformer-based visual grounding methods surpass traditional approaches by utilizing attention mechanisms to develop efficient feature extraction modules. These modules are adept at capturing cross-scale visual information and improving contextual relationships [38], demonstrating high versatility and effectiveness. Their improved performance in object detection tasks and wide applicability have gradually replaced earlier two-stage and one-stage methods.
Due to significant differences in target characteristics and scale between remote sensing images and natural images, both two-stage and one-stage methods exhibit limited applicability. Following the success of Transformer-based visual grounding in natural image processing, Sun et al. first adapted this approach for remote sensing in October 2022 [12], as shown in Figure 3. Later, Zhan et al. [12] advanced the field by releasing the DIOR-RSVG benchmark dataset and open-sourcing their Multistage Synergistic Aggregation Module implementation, providing essential resources for deep learning-based visual grounding. Although RSGPT appeared in 2023 as the first multimodal large language model for remote sensing, it did not achieve region-level localization capabilities. A key breakthrough occurred in November 2023 with GeoChat [14], which successfully integrated visual grounding into the MLLM framework. This milestone laid the groundwork for subsequent MLLM-based methods, driving faster development and notable performance gains from 2024 onward.
This section categorizes remote sensing visual grounding methods into Transformer-based architectures and MLLM-based methods, depending on whether they utilize large language models as foundational backbones. Representative frameworks of both paradigms are introduced, with their architectural differences comparatively illustrated in Figure 4.

3.1. Transformer-Based Methods

Transformer has become the mainstream architecture for remote sensing visual grounding tasks due to its powerful parallel processing capabilities and modeling of long-range dependencies. Transformer-based methods typically use ResNet-50 or DarkNet-53 as the visual encoder and BERT as the language encoder during the encoding stage, as shown in Table 3. They construct visual–text interaction modules based on self-attention mechanisms to enhance alignment between visual and language modalities, thereby improving visual grounding accuracy. The basic architecture is shown in Figure 4a.
Sun et al. [11] pioneered the introduction of visual grounding to the remote sensing field in 2022 with their GeoVG method, which uses a graph-based strategy to establish relationships between visual and language features. The numerical context module in the language encoder represents complex expressions as geospatial relationship graphs, reducing the search space in large-scale scenes and improving localization accuracy. However, creating relationship graphs requires high computational costs and tends to overlook relationships between small target objects. Wang et al. [39] proposed an autoregressive discrete coordinate sequence generation method to explore interactions between direct regression features and encoded multimodal features, achieving a mean IoU of 64.88% on RSVG. The novel frequency and query refinement network (FQRNet) [55] captures global structural information from remote sensing data using Fourier transforms and enhances spatial features with frequency information. Li et al. [23] designed a cross-modal encoder for SAR data in their TACMT model to strengthen text-guided scattering feature extraction, improving accuracy by 6.2% on the SARVG1.0 dataset.
In summary, Transformer-based methods exhibit strong feature extraction capabilities and flexible architectural designs, performing well in handling complex backgrounds and small-target localization. However, there is still room for optimization in multimodal feature fusion, addressing non-salient target features, and mitigating background interference.

3.2. MLLM-Based Methods

With the breakthrough of large language models (LLMs), multimodal large language models have provided new opportunities for automated analysis of earth observation data. By integrating visual and language modality information, MLLMs can more naturally process human language instructions and enhance open-scene understanding through large-scale data training and instruction tuning. General MLLMs support a wide range of tasks, including classification, detection, captioning, question answering, and visual reasoning. As an important downstream task, visual grounding has also achieved breakthroughs with the development of large models. These models typically rely on pretrained large language models and use specific visual encoders to align image and language features for unified multimodal task processing. Figure 4b illustrates the basic framework of MLLM-based visual grounding.
GeoChat [14] proposed by Zhou et al. first demonstrated remote sensing visual grounding capabilities in the MLLM. It uses CLIP-ViT as the visual backbone to align visual and language modalities, inserts positional encoding, and scales image input sizes to handle larger images. However, it has obvious limitations in localizing small targets, with an acc@0.5 value of only 2.9%. Zhan et al. [48] designed the SkyEyeGPT, which maps remote sensing visual features to the language domain through a simple projection layer, significantly improving the precise localization of small objects.
Compared to the massive data scale in natural scenes that supports the development of multimodal large models, the lack of training data in remote sensing has become a bottleneck. Muhtar et al. [50] pretrained on the large 4-million-image–text-pair dataset LHRS-Align and fine-tuned on the 30,000-instruction-pair dataset LHRS-Instruct, combining multi-level visual–language alignment strategies to unleash the potential of MLLMs. EarthDial [53], proposed by Soni et al., supports natural language dialog for multispectral, multitemporal, and multiresolution remote sensing data, using 11.11 million instruction pairs containing RGB, Sentinel-2, SAR, near-infrared, and infrared data for comprehensive instruction tuning to achieve stronger generalization.
Additionally, some scholars have introduced visual prompt models or combined rotated bounding boxes and masks to localize finer-grained targets. SkySenseGPT [24] extends visual grounding to object reasoning tasks, using rotated bounding boxes to improve the accuracy of target fitting compared to traditional horizontal bounding boxes. EarthMarker [54], proposed by Zhang et al., first introduced visual prompt learning into remote sensing multimodal large models, allowing users to interact with AI using prompts such as boxes, points, and free-form shapes, breaking the limitations of language instructions and enhancing flexibility. GeoPix [52] extends visual grounding to the pixel level and introduces a Class-wise Learnable Memory (CLM) module to dynamically extract and store category-specific geographic context, improving the model’s understanding of diverse instances in complex remote sensing scenes. However, when user instructions for referring segmentation and visual context analysis accidentally include instance location queries, task interference can occur, causing errors in segmentation masks in visual grounding tasks, as shown in Figure 5. GeoGround [15] leverages the powerful multi-task learning capabilities of LLMs, combining prompt-assisted learning (PAL) and geometry-guided learning (GGL) to unify visual grounding tasks with OBB, HBB, and mask annotations, allowing flexible output choices.
MLLM-based methods facilitate alignment and interaction between visual and language modalities by embedding text and images into a unified semantic space. Built on large-scale pretraining data, they exhibit excellent zero-shot generation capabilities and open-domain adaptability. MLLMs often inherit the reasoning capabilities of LLMs, supporting deep semantic understanding and reasoning for complex language instructions, with more flexible output formats. The strong language generation and multimodal fusion capabilities of MLLMs provide new approaches for remote sensing visual grounding. Nonetheless, the substantial parameter count of these models imposes substantial demands on both training costs and cross-modal data acquisition. Furthermore, the autoregressive mechanism inherent in LLMs, when utilized for coordinate generation, introduces a conflict between their sequential processing nature and the parallel processing capabilities demanded by dense object detection tasks. This, along with the high computational complexity and the mismatch between the model’s output structure and the requirements of object detection, collectively constrain the performance of MLLMs in localizing small objects. Table 4 compares the Transformer-based and MLLM-based methods.

4. Characteristics and Innovations

Compared to generic visual grounding, the development of remote sensing visual grounding has consistently centered on three core characteristics: scale heterogeneity, semantic complexity, and annotation scarcity.

4.1. Scale Heterogeneity

Acquired from nadir satellite perspectives, remote sensing imagery exhibits extensive spatial coverage, capturing objects ranging from square-kilometer-scale landforms to sub-meter-level fine targets. This target diversity challenges conventional detectors with fixed receptive fields, which struggle to accommodate multiscale objects and suffer severe background noise interference. Consequently, researchers have pursued multiscale hierarchical feature processing to mitigate scale heterogeneity and background clutter in large-scale scenes, thereby expanding dynamic perceptual coverage. Zhan et al. [12] proposed an MGVLF module, which integrates multiscale visual features and multi-granularity textual embeddings to adaptively filter irrelevant noise, effectively tackling challenges posed by significant scale variations and cluttered backgrounds. Further advancing this approach, Wang et al. [39] introduced a Multistage Synergistic Aggregation Module (MSAM), achieving multi-scale contextual fusion through generative coordinate sequence prediction. This method attained an accuracy of 83.61% on the DIOR-RSVG dataset. To resolve confusion between targets and similar objects, Qiu et al. [40] developed a learnable attribute prompter that adaptively explores diverse attribute information based on common object characteristics in remote sensing images. RINet [42] adopted a local-to-object scheme to progressively localize target regions via a Regional Indication Generator (RIG), enhancing localization capabilities for small targets. Ding et al. [44] designed an Adaptive Feature Selection (AFS) module to suppress noise and combined it with a multistage decoder to iteratively infer target attributes, achieving a precision of 78.24% in complex scenarios. While innovations in multi-scale and hierarchical feature processing effectively mitigate scale variations and background interference, challenges persist in slow model convergence and insufficient validation of detection performance in high-density scenarios.

4.2. Semantic Complexity

Remote sensing images contain diverse and numerous targets, with small objects such as ships, vehicles, and buildings constituting a significant proportion. A single target may correspond to varied textual descriptions, while identical expressions can refer to different objects. Precise semantic expressions are often lengthy and complex, frequently encountering issues like description forgetting and attention drift during feature extraction and matching. Conversely, oversimplified expressions risk ambiguity, particularly in large-scale scenes with densely clustered targets, where object confusion severely impedes precise localization. To enhance semantic consistency between visual and textual features, researchers employ multilevel alignment and dynamic interaction mechanisms to optimize cross-modal feature fusion. Early studies achieved coarse-grained alignment by constructing geospatial relation graphs but struggled to parse complex descriptions. Zhan et al. [12] proposed the MGVLF module to integrate multiscale visual features and multi-granularity textual embeddings. Lan et al. [19] introduced the Language Query-based Visual Grounding (LQVG) framework, which retrieves multiscale visual features using textual features as queries and incorporates a Multistage Cross-Modal Alignment (MSCMA) module to strengthen semantic correlations. Their approach achieved accuracies (Pr@0.5) of 83.41% and 87.37% on the DIOR-RSVG and RSVG-HR datasets, respectively, though with higher computational complexity. Choudhury et al. [43] simplified traditional multimodal fusion modules in their CrossVG model, relying solely on Sstacking Transformer encoder layers to achieve efficient cross-modality interaction. Multidimensional Semantic-Guidance Visual Grounding (MSVG) [45] further employed a Multidimensional Text–Image Alignment Module (MTAM) to increase the relevance between visual features and textual descriptions. Overall, cross-modality fusion strategies—such as constructing geospatial relation graphs and multi-granularity feature fusion—effectively narrow the search space and improve semantic matching capabilities.
Furthermore, language-guided visual optimization can mitigate attention drift and enhance focus on target regions by embedding textual semantics into the visual feature extraction process. Li et al. [45] proposed a Query-Guided Visual Attention (QGVA) module for the visual encoder, which dynamically focuses on language-described regions by injecting textual semantics into the visual encoding process. This method achieved a 4.98% accuracy improvement over MGVLF on the DIOR-RSVG dataset. LPVA [20] innovatively adopted a channel-spatial dual-dimensional dynamic weight adjustment strategy, combined with a Multilevel Feature Enhancement (MFE) decoder to suppress background interference and enhance feature distinctiveness. It achieved an accuracy of 82.27% on DIOR-RSVG, though its capability remains limited to one-object localization and it lacks proficiency in multi-target matching. For scenarios involving lengthy and complex linguistic expressions, RINet [42] introduced a word contribution learner to evaluate the importance of each word in the language description. Through iterative fine-tuning, it improved the comprehension of intricate linguistic information (Figure 6 illustrates the workflow of RINet). In summary, this method leverages attention mechanisms and domain-specific training data to enhance the robustness of complex description parsing and guide visual optimization.

4.3. Annotation Scarcity

Specialized datasets for RSVG are limited by domain knowledge and scarce data. Commonly used RSVG datasets, typically derived from annotated DIOR object detection benchmarks, contain at most 50,000 instances, significantly fewer than natural scene counterparts like RefCOCO (142,209 referring expressions for 50,000 objects). On one hand, the professional interpretation requirements for multi-modal remote sensing imagery pose a significant challenge to annotation quality. For example, SAR targets exhibit varying signatures due to incidence angles and polarization modes, requiring annotators to possess electromagnetic scattering knowledge and cross-modal interpretation skills, thereby escalating annotation costs. On the other hand, data access restrictions further exacerbate annotation scarcity: publicly available remote sensing data sources are primarily concentrated in civilian scenarios, while military or sensitive targets remain inaccessible due to security protocols, limiting dataset diversity and scenario coverage. This dual scarcity undermines model generalization for complex scenes or novel targets. Consequently, researchers explore open-set methods to expand candidate categories. Hu et al. adapted Grounding DINO to RSVG via a Multi-scale Image-to-Text Fusion Module (MSITFM) and Text Confidence Matching (TCM), optimizing cross-modal interaction and label assignment for robust open-world localization. Multimodal foundation models address scarcity by integrating multisensor data into unified instruction frameworks. EarthDial constructed an 11.11 million-pair multi-modal instruction dataset, leveraging complementary cross-modal information from RGB, SAR, near-infrared, etc., to drive the model’s learning of universal visual–language mapping relationships and reduce reliance on single-modal annotations. GeoGround unifies HBB, OBB, and mask tasks through prompt-assisted and geometry-guided learning, converting annotations into reusable instruction–response pairs, as shown in Figure 7. While multitask interference remains a concern, these approaches establish new paradigms for mitigating RSVG data limitations. To more comprehensively analyze whether each RSVG model has been improved to address the three typical challenges of scale heterogeneity, semantic complexity, and label scarcity, Table 5 provides a systematic summary of the relevant content and clearly identifies the innovative features of each model.

5. Challenges and Outlook

Remote sensing visual grounding has achieved notable progress, yet critical bottlenecks persist in multisource data fusion, dynamic scene understanding, and edge computing adaptability. Future research needs to focus on the following directions to carry out an in-depth exploration:

5.1. Intelligent Annotation Agent for Multisource Heterogeneous Data

Current RSVG datasets predominantly rely on optical imagery, resulting in weak generalization across modalities such as SAR, hyperspectral, multispectral, and LiDAR point clouds. This limitation stems from the differences in physical properties among multiple data sources. For example, SAR images are imaged by an electromagnetic wave scattering mechanism, and their texture features have no intuitive correspondence with the RGB channels; hyperspectral data contains hundreds of bands with high spectral redundancy, which will lead to dimensional disaster when directly inputted into the model; and the laser point cloud can accurately characterize the three-dimensional geometric structure, but it lacks semantic labels and has significant alignment errors in a large scale. Traditional manual annotation methods require specialized knowledge and consume a lot of time, and cross-modal graphical alignment relies on empirical rule design, which makes it difficult to support the training needs of open-set models. Thus, a knowledge-driven [56] annotation agent can be designed to integrate geographic knowledge mapping and a domain macrolanguage model, to embed professional terms into the text description generation process; to synthesize multi-source data-text pairs using the Diffusion [57] model to realize the scale of the training set to scale up, to satisfy the training demand of multi-modal large models, and to explore the potential of the model in visual grounding tasks.

5.2. Cross-Temporal Perception Modeling for Dynamic Scenarios

Existing remote sensing visual grounding methods mainly focus on static one-temporal-phase data, with which is difficult to capture dynamic processes such as flood inundation and urban sprawl, resulting in insufficient model temporal correlation capability. This is mainly due to the fact that multi-temporal phase remote sensing data often have significant heterogeneity, coupled with the lack of quantitative representation models for temporal semantics in natural language. Although TEOChat [58] tries to construct a temporal visual language large model for resolving the temporal evolution law, it fails to establish an accurate mapping between spatiotemporal coordinates and textual commands. Future work could adopt memory-augmented networks with deformable temporal attention mechanisms to dynamically correlate historical feature maps with current observation data. Multitemporal image–text pairs with temporal adverbs could enhance cross-temporal tracking, enabling autonomous change trajectory identification and disaster spread prediction, ultimately achieving a “change perception–precise localization–decision feedback” loop.

5.3. Edge Computing-Oriented Lightweight Deployment

High-resolution remote sensing image processing depends on large-parameter models to create effective representations. However, satellite and airborne platforms face multiple constraints, including limited computing resources, radiation-hardened requirements, and environmental adaptability, which severely restrict computational and storage capabilities. This makes supporting in-orbit training and real-time inference of complex models difficult. Research shows that even on specialized embedded AI hardware like the Jetson series or Movidius Myriad 2, inference times for unoptimized models are still too slow to meet the high-timeliness needs of tasks like disaster response. Although techniques like model quantization can reduce computational load, they often cause significant accuracy drops (mAP). This limits their usefulness in scenarios with strict real-time demands, such as military reconnaissance and disaster monitoring. Future research must take a dual approach: first, systematically compress models, reduce computational complexity, and improve inference efficiency and accuracy retention on edge platforms by integrating optimization methods like lightweight network design [59], model pruning [60], knowledge distillation [61], and low-bit-width quantization [62]; second, develop a federated edge learning framework [63] based on low-orbit satellite constellations. Through multi-node collaboration and knowledge sharing, this approach aims to overcome the data sample and model generalization bottlenecks of single satellites, gradually building distributed core models. Only in this way can we bridge the gap between laboratory models and practical deployment, enabling a real-time perception-decision loop of “what you see is what you get” in extreme, dynamic scenarios like battlefield perception and rapid disaster area assessment.

6. Conclusions

In this review, we systematically track and summarize the research progress and future path of remote sensing visual grounding technology in the last five years. First, we establish the basic concept of RSVG by comparing the scope boundaries of object detection, referring image segmentation, compiling the mainstream datasets such as DIOR-RSVG and RSVG-HR, and evaluation metrics. Subsequently, two major approaches to remote sensing visual grounding, namely, the Transformer-based methods and the multimodal large language model-based methods, are introduced in detail and analyzed comparatively. In addition, we discuss in depth the innovative techniques of remote sensing visual grounding based on the key features in the remote sensing images. Finally, we summarize the challenges faced by the research on remote sensing visual grounding and propose valuable directions for future research. This paper is suitable for both beginners and experienced researchers in the field of remote sensing visual grounding, and serves as a valuable resource for tracking the latest research progress.

Author Contributions

Conceptualization, L.L., G.W. and G.S.; methodology, Z.W., W.Z., B.Z., H.C., X.L. (Xinyi Li) and X.L. (Xiaoxuan Liu); software, Z.W., W.Z., B.Z., H.C., X.L. (Xinyi Li) and X.L. (Xiaoxuan Liu); validation, All authors; formal analysis, Z.W., W.Z., B.Z., H.C., X.L. (Xinyi Li) and X.L. (Xiaoxuan Liu); investigation, Z.W., W.Z., B.Z., H.C., X.L. (Xinyi Li) and X.L. (Xiaoxuan Liu); resources, Z.W., L.L., G.W., G.S.; data curation, Z.W., W.Z., B.Z., H.C., X.L. (Xinyi Li) and X.L. (Xiaoxuan Liu); writing—original draft preparation, Z.W., L.L., W.Z., B.Z. and H.C.; writing—review and editing, All authors; visualization, Z.W. and X.L. (Xinyi Li); supervision, L.L., G.W. and G.S.; project administration, L.L., G.W. and G.S.; funding acquisition, L.L., G.W. and G.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Committee Project, grant number 2024-YYTJ-QD-011-00 (Intelligent Compilation and Rapid Generation of Thematic Maps: Key Technology Research and Application). The APC was funded by the same grant.

Data Availability Statement

The data supporting the findings of this study are openly available in publicly accessible repositories, with details provided in the cited publications. Specific datasets and access links are as follows: RSVGD: Available at https://sunyuxi.github.io/publication/GeoVG (accessed on) (DOI: 10.1145/3503161.3548316). DIOR-RSVG: Hosted on GitHub at https://github.com/ZhanYang-nwpu/RSVG-pytorch (accessed on 1 May 2025) (DOI: 10.1109/TGRS.2023.3250471). RSVG-HR: Accessible via GitHub at https://github.com/LANMNG/LQVG (accessed on 1 May 2025) (DOI: 10.1109/TGRS.2024.3407598). OPT-RSVG: Available on GitHub at https://github.com/like413/OPT-RSVG (accessed on 1 May 2025) (DOI: 10.1109/TGRS.2024.3423663). SARVG-T: Hosted on GitHub at https://github.com/CAESAR-Radi/TACMT (accessed on 1 May 2025) (DOI: 10.1016/j.isprsjprs.2025.02.022). RSSVG and SARVG-S: Hosted on GitHub at https://github.com/LwZhan-WUT/VGRSS (DOI: 10.1109/TGRS.2025.3562717 (accessed on 1 May 2025)). VRSBench: Available on GitHub at https://github.com/lx709/VRSBench (accessed on 1 May 2025) (DOI: 10.48550/arXiv.2406.12384). COREval: Accessible via arXiv at https://doi.org/10.48550/arXiv.2411.18145 (accessed on 1 May 2025). XLRSBench: Hosted on https://xlrs-bench.github.io/ (accessed on 1 May 2025) (DOI: 10.48550/arXiv.2503.23771).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
RSVGRemote Sensing Visual Grounding
GeoVGGeospatial Visual Grounding
MSVGMultidimensional Semantic-Guidance Visual Grounding
MGVLFMultistage Synergistic Aggregation Module
QGVAQuery-Guided Visual Attention
MFEMultilevel Feature Enhancement
RIGRegional Indication Generator
AFSAdaptive Feature Selection
LQVGLanguage Query-Based Visual Grounding
MSCMAMultistage Cross-Modal Alignment
MTAMMultidimensional Text–Image Alignment Module
TACMTText-Aware Cross-Modal Transformer
PEFTParameter-Efficient Fine-Tuning
MSITFMMulti-Scale Image-to-Text Fusion Module
TCMText Confidence Matching
MB-ORESMulti-Branch Object Reasoner for Visual Grounding
CLIPContrastive Language–Image Pretraining
BLIPBootstrapping Language–Image Pretraining
ViTVision Transformer
DETRDetection Transformer
mAPMean Average Precision
GFLOPSGiga Floating-Point Operations Per Second
LoRALow-Rank Adaptation
VQAVisual Question Answering
LSTMLong Short-Term Memory
CNNConvolutional Neural Network
HBBHorizontal Bounding Box
OBBOriented Bounding Box
RRSISReferring Remote Sensing Image Segmentation
DIORDetection in Optical Remote Sensing
HRRSDHigh Resolution Remote Sensing Detection
SPCDSwimming Pool and Car Detection
SARSynthetic Aperture Radar
VRSBenchVersatile Vision–language Benchmark for Remote Sensing Image Understand-ing
COREvalComprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision–Language Models
SOTAState-of-the-Art
IoUIntersection-over-Union
meanIoUMean Intersection-over-Union
cumIoUCumulative Intersection-over-Union
BERTBidirectional Encoder Representation of Transformer
FQRNetNovel Frequency and Query Refinement Network
LLMLarge Language Model
LHRSLanguage Helps Remote Sensing
PALPrompt-Assisted Learning
GGLGeometry-Guided Learning
DOTADetection of Objects from Top-Down Perspectives
OFAOne for All
BitFitBias-Term Fine-Tuning

Appendix A

We have compiled the GitHub links corresponding to the Transformer-based and MLLM-based methods mentioned in the Table 3, which are presented in Table A1. Among them, algorithms GeoVG [11], FQRNet [55], VSMR [44], QAMFN [45], MSVG [46], CrossVG [43], APMOR [40], RINet [42], VGRSS [29], Eff-Gounding DINO [41], SkyEyeGPT [48], EarthGPT [49], and GeoGround [15] have no publicly available code.
Table A1. GitHub links for Transformer-based and MLLM-based methods (partial).
Table A1. GitHub links for Transformer-based and MLLM-based methods (partial).
MethodsLink
MGVLF [12]https://github.com/ZhanYang-nwpu/RSVG-pytorch (accessed on 1 May 2025)
LQVG [19]https://github.com/LANMNG/LQVG (accessed on 1 May 2025)
LPVA [20]https://github.com/like413/OPT-RSVG (accessed on 1 May 2025)
TACMT [23]https://github.com/CAESAR-Radi/TACMT (accessed on 1 May 2025)
MSANet [39]https://github.com/waynamigo/MSAM (accessed on 1 May 2025)
CSDNet [47]https://github.com/WUTCM-Lab/CSDNet (accessed on 1 May 2025)
GeoChat [14]https://github.com/mbzuai-oryx/geochat (accessed on 1 May 2025)
SkySenseGPT [24]https://github.com/Luo-Z13/SkySenseGPT (accessed on 1 May 2025)
LHRS-Bot [50]https://github.com/NJU-LHRS/LHRS-Bot (accessed on 1 May 2025)
EarthDial [53]https://github.com/hiyamdebary/EarthDial (accessed on 1 May 2025)
VHM [51]https://github.com/opendatalab/VHM (accessed on 1 May 2025)
GeoPix [52]https://github.com/Norman-Ou/GeoPix (accessed on 1 May 2025)

Appendix B

This appendix comprehensively documents the partitioning methodologies employed by the datasets cataloged in Table 2. Critically, VRSBench, COREval, and XLRSBench function exclusively as evaluation-only benchmarks and are thus excluded from partitioning documentation. A consolidated summary appears in Table A2, with granular partition specifications for DIOR-RSVG, RSVG-HR, and OPT-RSVG detailed in Table A3, Table A4, and Table A5, respectively. Notably, DIOR-RSVG implements two distinct partitioning schemes: 4:1:5 and 7:1:2. Table A3 only shows the former one.
Table A2. Summary of partition schemes.
Table A2. Summary of partition schemes.
DatasetTraining (%)Validation (%)Test (%)
DIOR-RSVG(1) [12]401050
DIOR-RSVG(2) [12]701020
RSVG-HR [19]80-20
OPT-RSVG [20]401050
RSSVG [29]701020
SARVG-T801010
SARVG-S [29]701020
Table A3. Training, validation, and test instance numbers for DIOR-RSVG (4:1:5).
Table A3. Training, validation, and test instance numbers for DIOR-RSVG (4:1:5).
No.Class NameTrainingValidationTest
C01vehicle28887143559
C02dam40191518
C03airplane664199842
C04stadium471119591
C05overpass9082031090
C06ground track field9842231237
C07golf field42691523
C08baseball field14573531800
C09basketball court510139637
C10tennis court611133765
C11expressway toll station443108561
C12expressway service area552157703
C13windmill11753121466
C14bridge10272851277
C15harbor23849291
C16train station35192447
C17airport494143646
C18chimney502116620
C19storage tank477123630
C20ship749182957
-Total15,328383219,160
Table A4. Training, validation, and test sample numbers for OPT-RSVG.
Table A4. Training, validation, and test sample numbers for OPT-RSVG.
No.Class NameTrainingValidationTest
C01airplane9792301142
C02ground track field16003652066
C03tennis court10932841313
C04bridge16994522212
C05basketball court10362631385
C06storage tank10502711264
C07ship10842431241
C08baseball diamond14773611744
C09T junction16634252055
C10crossroad16704052088
C11parking lot10492681368
C12harbor758209953
C13vehicle32948114083
C14swimming pool11283081563
-Total19,580489524,477
Table A5. Training and test sample numbers for RSVG-HR.
Table A5. Training and test sample numbers for RSVG-HR.
No.Class NameTrainingTest
C01Baseball field44381
C02Basketball court20138
C03Ground track field16350
C04Roundabout534122
C05Swimming pool15040
C06Storage tank16350
C07Tennis court497118
-Total2151499

References

  1. Zhao, B. A Systematic Survey of Remote Sensing Image Captioning. IEEE Access 2021, 9, 154086–154111. [Google Scholar] [CrossRef]
  2. Cheng, Q.; Zhou, Y.; Fu, P.; Xu, Y.; Zhang, L. A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4284–4297. [Google Scholar] [CrossRef]
  3. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar]
  4. Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S., Eds.; Volume 162, pp. 12888–12900. [Google Scholar]
  5. Wang, J.; Ma, A.; Chen, Z.; Zheng, Z.; Wan, Y.; Zhang, L.; Zhong, Y. EarthVQANet: Multi-Task Visual Question Answering for Remote Sensing Image Understanding. ISPRS J. Photogramm. Remote Sens. 2024, 212, 422–439. [Google Scholar] [CrossRef]
  6. Zhang, X.; Zhang, T.; Wang, G.; Zhu, P.; Tang, X.; Jia, X.; Jiao, L. Remote Sensing Object Detection Meets Deep Learning: A Metareview of Challenges and Advances. IEEE Geosci. Remote Sens. Mag. 2023, 11, 8–44. [Google Scholar] [CrossRef]
  7. Xie, Y.; Liu, S.; Chen, H.; Cao, S.; Zhang, H.; Feng, D.; Wan, Q.; Zhu, J.; Zhu, Q. Localization, Balance, and Affinity: A Stronger Multifaceted Collaborative Salient Object Detector in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–17. [Google Scholar] [CrossRef]
  8. Yu, L.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Bansal, M.; Berg, T.L. MAttNet: Modular Attention Network for Referring Expression Comprehension. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1307–1315. [Google Scholar]
  9. Yang, Z.; Chen, T.; Wang, L.; Luo, J. Improving One-Stage Visual Grounding by Recursive Sub-Query Construction. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 387–404. [Google Scholar]
  10. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
  11. Sun, Y.; Feng, S.; Li, X.; Ye, Y.; Kang, J.; Huang, X. Visual Grounding in Remote Sensing Images. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; ACM: Lisboa, Portugal, 2022; pp. 404–412. [Google Scholar]
  12. Zhan, Y.; Xiong, Z.; Yuan, Y. RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
  13. Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; de Las Casas, D.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training Compute-Optimal Large Language Models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates Inc.: Red Hook, NY, USA, 2022. [Google Scholar]
  14. Kuckreja, K.; Danish, M.S.; Naseer, M.; Das, A.; Khan, S.; Khan, F.S. GeoChat: Grounded Large Vision-Language Model for Remote Sensing. arXiv 2023, arXiv:2311.15826. [Google Scholar]
  15. Zhou, Y.; Lan, M.; Li, X.; Feng, L.; Ke, Y.; Jiang, X.; Li, Q.; Yang, X.; Zhang, W. GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding. arXiv 2025, arXiv:2411.11904. [Google Scholar]
  16. Moher, D.; Liberati, A.; Tetzlaff, J.; Altman, D.G.; The PRISMA Group. Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement. PLoS Med. 2009, 6, e1000097. [Google Scholar] [CrossRef]
  17. Deng, J.; Yang, Z.; Chen, T.; Zhou, W.; Li, H. TransVG: End-to-End Visual Grounding with Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 1749–1759. [Google Scholar]
  18. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and a New Benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
  19. Lan, M.; Rong, F.; Jiao, H.; Gao, Z.; Zhang, L. Language Query-Based Transformer with Multiscale Cross-Modal Alignment for Visual Grounding on Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
  20. Li, K.; Wang, D.; Xu, H.; Zhong, H.; Wang, C. Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
  21. Zhang, Y.; Yuan, Y.; Feng, Y.; Lu, X. Hierarchical and Robust Convolutional Neural Network for Very High-Resolution Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5535–5548. [Google Scholar] [CrossRef]
  22. Swimming Pool and Car Detection. Available online: https://www.kaggle.com/datasets/kbhartiya83/swimming-pool-and-car-detection (accessed on 23 May 2025).
  23. Li, T. TACMT: Text-Aware Cross-Modal Transformer for Visual Grounding on High-Resolution SAR Images. ISPRS J. Photogramm. Remote Sens. 2025, 222, 152–166. [Google Scholar] [CrossRef]
  24. Luo, J.; Pang, Z.; Zhang, Y.; Wang, T.; Wang, L.; Dang, B.; Lao, J.; Wang, J.; Chen, J.; Tan, Y.; et al. SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding. arXiv 2024, arXiv:2406.10100. [Google Scholar]
  25. Li, X.; Ding, J.; Elhoseiny, M. VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding. arXiv 2024, arXiv:2406.12384. [Google Scholar]
  26. Ding, J.; Xue, N.; Xia, G.-S.; Bai, X.; Yang, W.; Yang, M.Y.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object Detection in Aerial Images: A Large-Scale Benchmark and Challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7778–7796. [Google Scholar] [CrossRef]
  27. An, X.; Sun, J.; Gui, Z.; He, W. COREval: A Comprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision-Language Models. arXiv 2024, arXiv:2411.18145v1. [Google Scholar]
  28. Wang, F.; Wang, H.; Chen, M.; Wang, D.; Wang, Y.; Guo, Z.; Ma, Q.; Lan, L.; Yang, W.; Zhang, J.; et al. XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery? arXiv 2025, arXiv:2503.23771. [Google Scholar]
  29. Chen, Y.; Zhan, L.; Zhao, Y.; Xiong, S.; Lu, X. VGRSS: Datasets and Models for Visual Grounding in Remote Sensing Ship Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–11. [Google Scholar] [CrossRef]
  30. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  31. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  32. Yang, Z.; Gong, B.; Wang, L.; Huang, W.; Yu, D.; Luo, J. A Fast and Accurate One-Stage Approach to Visual Grounding. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4682–4692. [Google Scholar]
  33. Sadhu, A.; Chen, K.; Nevatia, R. Zero-Shot Grounding of Objects from Natural Language Queries. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4693–4702. [Google Scholar]
  34. Liao, Y.; Liu, S.; Li, G.; Wang, F.; Chen, Y.; Qian, C.; Li, B. A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 10877–10886. [Google Scholar]
  35. Ye, J.; Tian, J.; Yan, M.; Yang, X.; Wang, X.; Zhang, J.; He, L.; Lin, X. Shifting More Attention to Visual Backbone: Query-Modulated Refinement Networks for End-to-End Visual Grounding. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 15481–15491. [Google Scholar]
  36. Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 6877–6886. [Google Scholar]
  37. Du, Y.; Fu, Z.; Liu, Q.; Wang, Y. Visual Grounding with Transformers. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
  38. Xie, Y.; Zhan, N.; Zhu, J.; Xu, B.; Chen, H.; Mao, W.; Luo, X.; Hu, Y. Landslide Extraction from Aerial Imagery Considering Context Association Characteristics. Int. J. Appl. Earth Obs. Geoinf. 2024, 131, 103950. [Google Scholar] [CrossRef]
  39. Wang, F.; Wu, C.; Wu, J.; Wang, L.; Li, C. Multistage Synergistic Aggregation Network for Remote Sensing Visual Grounding. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
  40. Qiu, H.; Wang, L.; Zhang, M.; Zhao, T.; Li, H. Attribute-Prompting Multi-Modal Object Reasoning Transformer for Remote Sensing Visual Grounding. In Proceedings of the IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 9029–9032. [Google Scholar]
  41. Hu, Z.; Gao, K.; Zhang, X.; Yang, Z.; Cai, M.; Zhu, Z.; Li, W. Efficient Grounding DINO: Efficient Cross-Modality Fusion and Efficient Label Assignment for Visual Grounding in Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–14. [Google Scholar] [CrossRef]
  42. Hang, R.; Xu, S.; Liu, Q. A Regionally Indicated Visual Grounding Network for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–11. [Google Scholar] [CrossRef]
  43. Choudhury, S.; Kurkure, P.; Talwar, P.; Banerjee, B. CrossVG: Visual Grounding in Remote Sensing with Modality-Guided Interactions. In Proceedings of the IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 2858–2862. [Google Scholar]
  44. Ding, Y.; Xu, H.; Wang, D.; Li, K.; Tian, Y. Visual Selection and Multistage Reasoning for RSVG. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
  45. Li, C.; Zhang, W.; Bi, H.; Li, J.; Li, S.; Yu, H.; Sun, X.; Wang, H. Injecting Linguistic Into Visual Backbone: Query-Aware Multimodal Fusion Network for Remote Sensing Visual Grounding. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
  46. Ding, Y.; Wang, D.; Li, K.; Zhao, X.; Wang, Y. Visual Grounding of Remote Sensing Images with Multi-Dimensional Semantic-Guidance. Pattern Recognit. Lett. 2025, 189, 85–91. [Google Scholar] [CrossRef]
  47. Zhao, Y.; Chen, Y.; Yao, R.; Xiong, S.; Lu, X. Context-Driven and Sparse Decoding for Remote Sensing Visual Grounding. Inf. Fusion 2025, 123, 103296. [Google Scholar] [CrossRef]
  48. Zhan, Y.; Xiong, Z.; Yuan, Y. SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model. ISPRS J. Photogramm. Remote Sens. 2025, 221, 64–77. [Google Scholar] [CrossRef]
  49. Zhang, W.; Cai, M.; Zhang, T.; Zhuang, Y.; Mao, X. EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–20. [Google Scholar] [CrossRef]
  50. Muhtar, D.; Li, Z.; Gu, F.; Zhang, X.; Xiao, P. LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model. In Proceedings of the Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2025; pp. 440–457. [Google Scholar]
  51. Pang, C.; Weng, X.; Wu, J.; Li, J.; Liu, Y.; Sun, J.; Li, W.; Wang, S.; Feng, L.; Xia, G.-S.; et al. VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis. arXiv 2024, arXiv:2403.20213. [Google Scholar] [CrossRef]
  52. Ou, R.; Hu, Y.; Zhang, F.; Chen, J.; Liu, Y. GeoPix: A Multimodal Large Language Model for Pixel-Level Image Understanding in Remote Sensing. IEEE Geosci. Remote Sens. Mag. 2025, 2–16. [Google Scholar] [CrossRef]
  53. Soni, S.; Dudhane, A.; Debary, H.; Fiaz, M.; Munir, M.A.; Danish, M.S.; Fraccaro, P.; Watson, C.D.; Klein, L.J.; Khan, F.S.; et al. EarthDial: Turning Multi-Sensory Earth Observations to Interactive Dialogues. arXiv 2025, arXiv:2412.15190. [Google Scholar]
  54. Zhang, W.; Cai, M.; Zhang, T.; Zhuang, Y.; Li, J.; Mao, X. EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–19. [Google Scholar] [CrossRef]
  55. Zhao, E.; Wan, Z.; Zhang, Z.; Nie, J.; Liang, X.; Huang, L. A Spatial-Frequency Fusion Strategy Based on Linguistic Query Refinement for RSVG. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
  56. Zhang, X.; Wu, C.; Zhang, Y.; Xie, W.; Wang, Y. Knowledge-Enhanced Visual-Language Pre-Training on Chest Radiology Images. Nat. Commun. 2023, 14, 4542. [Google Scholar] [CrossRef]
  57. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv 2022, arXiv:2112.10752. [Google Scholar]
  58. Irvin, J.A.; Liu, E.R.; Chen, J.C.; Dormoy, I.; Kim, J.; Khanna, S.; Zheng, Z.; Ermon, S. TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data. arXiv 2024, arXiv:2410.06234. [Google Scholar]
  59. Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.-C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
  60. Frantar, E.; Alistarh, D. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv 2023, arXiv:2301.00774. [Google Scholar]
  61. Gu, Y.; Dong, L.; Wei, F.; Huang, M. MiniLLM: Knowledge Distillation of Large Language Models. arXiv 2024, arXiv:2306.08543. [Google Scholar]
  62. Choukroun, Y.; Kravchik, E.; Yang, F.; Kisilev, P. Low-Bit Quantization of Neural Networks for Efficient Inference. Statistics arXiv 2019, arXiv:1902.06822. [Google Scholar]
  63. Xia, Q.; Ye, W.; Tao, Z.; Wu, J.; Li, Q. A Survey of Federated Learning for Edge Computing: Research Problems and Solutions. High-Confid. Comput. 2021, 1, 100008. [Google Scholar] [CrossRef]
Figure 1. Process of extracting relevant papers.
Figure 1. Process of extracting relevant papers.
Electronics 14 02815 g001
Figure 2. Comparison of three task definitions. (a) Object detection; (b) Visual grounding; (c) Referring image segmentation. The red box/mask marks the position of the target in the image.
Figure 2. Comparison of three task definitions. (a) Object detection; (b) Visual grounding; (c) Referring image segmentation. The red box/mask marks the position of the target in the image.
Electronics 14 02815 g002
Figure 3. An overview of the research progress in RSVG from the perspective of the technical roadmap. The corresponding citations for abbreviated works can be found in the main text [11,12,14,19,20,23,24,29,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54].
Figure 3. An overview of the research progress in RSVG from the perspective of the technical roadmap. The corresponding citations for abbreviated works can be found in the main text [11,12,14,19,20,23,24,29,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54].
Electronics 14 02815 g003
Figure 4. Comparison of two method architectures. (a) Architecture of traditional transformer-based methods; (b) Architecture of MLLM-based methods. The red box marks the position of the target in the image.
Figure 4. Comparison of two method architectures. (a) Architecture of traditional transformer-based methods; (b) Architecture of MLLM-based methods. The red box marks the position of the target in the image.
Electronics 14 02815 g004
Figure 5. Failure case of GeoPix [52]: Task confusion causing segmentation tokens to appear in visual grounding results. This leads to the output of the corresponding instance mask when performing visual grounding as specified by the user.
Figure 5. Failure case of GeoPix [52]: Task confusion causing segmentation tokens to appear in visual grounding results. This leads to the output of the corresponding instance mask when performing visual grounding as specified by the user.
Electronics 14 02815 g005
Figure 6. Flowchart of RINet [42]. The features extracted by DarkNet-53 and BERT are fed into RIG to generate an initial regional indication map, which is used to generate a high-resolution feature via CAM for grounding the target object. Such high-resolution features can fine-tune the indication map by CG. Dashed arrow lines are used to represent the flow of language information, and circular arrows indicate the fine-tuning path.
Figure 6. Flowchart of RINet [42]. The features extracted by DarkNet-53 and BERT are fed into RIG to generate an initial regional indication map, which is used to generate a high-resolution feature via CAM for grounding the target object. Such high-resolution features can fine-tune the indication map by CG. Dashed arrow lines are used to represent the flow of language information, and circular arrows indicate the fine-tuning path.
Electronics 14 02815 g006
Figure 7. Architecture of GeoGround [15]. The CLIP-ViT visual encoder processes input images. Features are projected through a two-layer MLP connector and fed into the LLM along with language queries. For Referring Expression Comprehension (REC) tasks, the model outputs Horizontal Bounding Boxes (HBBs) or Oriented Bounding Boxes (OBBs). For Referring Expression Segmentation (RES) tasks, it generates segmentation masks. Additionally, the architecture supports multi-object localization beyond single-target outputs.
Figure 7. Architecture of GeoGround [15]. The CLIP-ViT visual encoder processes input images. Features are projected through a two-layer MLP connector and fed into the LLM along with language queries. For Referring Expression Comprehension (REC) tasks, the model outputs Horizontal Bounding Boxes (HBBs) or Oriented Bounding Boxes (OBBs). For Referring Expression Segmentation (RES) tasks, it generates segmentation masks. Additionally, the architecture supports multi-object localization beyond single-target outputs.
Electronics 14 02815 g007
Table 1. A comparison of traditional object detection, visual grounding, and referring image segmentation.
Table 1. A comparison of traditional object detection, visual grounding, and referring image segmentation.
Traditional Object DetectionVisual GroundingReferring Image Segmentation
Input Modalityvisual modalityvisual modality+
linguistic modality
visual modality+
linguistic modality
Output Formbounding boxesBounding boxespixel-level masks
Target DefinitionPredefined closed categoriesOpen-vocabulary descriptionsFine-grained semantic descriptions
Table 2. Major datasets for remote sensing visual grounding tasks.
Table 2. Major datasets for remote sensing visual grounding tasks.
DatasetsTypeImage SourcesAnn.
Format
Total
Images
Total
Objects
Avg.
Length
Image Size
RSVGD [11]RGBDIORHBB4239793328.331024 × 1024
DIOR-RSVG [12]RGBDIORHBB17,40238,3207.47800 × 800
RSVG-HR [19]RGBDIORHBB2650265019.61024 × 1024
OPT-RSVG [20]RGBHRRSD, DIOR, SPCDHBB25,45248,95210.10-
RSSVG [29]RGBFAIR1M, CGWX, DIOR-RSVGHBB11,15725,2379.77-
SARVG-T [23]SARCAPELLA, GF-3, ICEYE SARHBB24657617-512 × 512
SARVG-S [29]SARSAR-ship-DatasetHBB43,79854,4297.72-
Benchmark
VRSBench [25]RGBDOTA-v2, DIOROBB29,61452,47214.31512 × 512
COREval [27]RGBGoogle EarthHBB/OBB-200-800 × 800
XLRSBench [28]RGBDOTA-v2, ITCVDOBB-12,619-8500 × 8500
Since both Li et al. [23] and Chen et al. [29] proposed SARVG datasets, SARVG-T denotes the dataset with the power transmission tower as the target, while SARVG-S denotes the dataset with the ship as the target, for distinction purposes.
Table 3. Summary of Transformer-based and MLLM-based methods: architectural configurations, dataset, and performance on DIOR-RSVG benchmark. Appendix A compiled the GitHub links corresponding to the Transformer-based and MLLM-based methods.
Table 3. Summary of Transformer-based and MLLM-based methods: architectural configurations, dataset, and performance on DIOR-RSVG benchmark. Appendix A compiled the GitHub links corresponding to the Transformer-based and MLLM-based methods.
MethodsVisual Enc.Text Enc./LLMParams.Training SetTest SetPr@0.5mIoU
  • Transformer-based methods
GeoVG [11]---26,991750057.78-
MGVLF [12]ResNet-50BERT152.526,991750076.7868.04
LQVG [19]ResNet-50BERT166.326,991750083.4174.02
APMOR [40]ResNet-101BERT-26,991750079.3768.86
Eff-Gounding DINO [41]ResNet-50BERT169.326,991750083.0573.41
RINet [42]DarkNet-53BERT-26,991750064.14-
CrossVG [43]ViT-B/16BERT-26,991750077.5170.56
VGRSS [29]ResNet-50BERT-26,991750083.0174.85
MSANet [39]DarNet-53BERT-26,991750074.2364.88
VSMR [44]ResNet-50BERT-15,32819,16078.2468.88
QAMFN [45]ResNet-50BERT128.415,32819,16081.6771.48
MSVG [46]ResNet-101BERT-15,32819,16083.6172.87
LPVA [20]ResNet-50BERT156.215,32819,16082.2772.35
FQRNet [55]ResNet-50BERT-15,32819,16077.2368.35
CSDNet [47]ResNet-101BERT154.6427,133742280.9270.88
TACMT [23]ResNet-50BERT150.9 --
b.
MLLM-based Methods
GeoChat [14]CLIP-ViTVicuna-v1.5~7 BGeoChat-Instruction555--
SkyEyeGPT [48]EVA-CLIPLLaMA2~7 BSkyEye-968k750088.59-
EarthGPT [49]DINO-ViT+
CLIP-ConNeXt
LLaMA2~7 BMMRS-1M750076.6569.34
SkySenseGPT [24]CLIP-ViTVicuna-v1.5~7 BFIT-RS---
LHRS-Bot [50]CLIP-ViTLLaMA2~7 BLHRS-Instruct750088.10-
VHM [51]CLIP-ViTVicuna-v1.5~7 BVariousRS-Instruct---
GeoPix [52]CLIP-ViTVicuna-v1.5~7 BGeoPixInstruct---
EarthDial [53]InternViTPhi-3-mini~4 BEarthDial-Instruct---
GeoGround [15]CLIP-ViTVicuna-v1.5~7 BrefGeo750077.73-
For Transformer-based methods, Training set refers to the number of training instances from DIOR-RSVG. For MLLM-based methods, Training set refers to the fine-tuning dataset used.
Table 4. Comparative analysis of Transformer-based vs. MLLM-based approaches.
Table 4. Comparative analysis of Transformer-based vs. MLLM-based approaches.
Transformer-BasedMLLM-Based
Visual EncoderResNet-50/DarkNet-53ViT
Text EncoderBERT-
Strengths
  • Computational efficiency
  • Task specificity
  • Low training cost
  • Strong cross-modal alignment
  • Open-world adaptability
  • Advanced semantic reasoning
Weaknesses
  • Limited generalization
  • Shallow semantic understanding
  • Coarse alignment
  • High resource demands
  • Localization constraints from text regression
  • Data dependency
Table 5. Comparative analysis of innovations across RSVG methods.
Table 5. Comparative analysis of innovations across RSVG methods.
MethodsS.H.S.C.A.S.Innovations
GeoVG [11]×
  • The numeric context module represents complex expressions as geospatial relation graphs.
  • The adaptive region attention module extracts key visual content.
MGVLF [12]××
  • Multi-scale visual features and multi-granularity text embeddings are utilized to learn more discriminative representations.
  • Irrelevant noise is adaptively filtered, and salient features are enhanced.
VSMR [44]×
  • The multimodal enhancer and adaptive feature selection module focuses visual feature attention on language-related regions.
  • The multistage decoder (MSD) reduces ambiguity in reasoning by continuously considering visual and language information and performing iterative queries.
LQVG [19]×
  • Sentence-level text features are utilized as language query features for target retrieval.
  • The MSCMA module enhances semantic relevance.
QAMFN [45]×
  • The QGVA mechanism enhances visual features.
  • A text-semantic attention-guided masking (TAM) module filters redundant information.
MSVG [46]×
  • The MTAM enhances the correlation between visual features and text descriptions.
  • A visual enhancement fusion module (VEFM) strengthens feature relevance through contextual information.
  • Multistage decoding achieves final feature fusion and visual grounding.
LPVA [20]×
  • A progressive attention (PA) module dynamically generates multi-scale weights and biases to enable the visual backbone to gradually focus on features related to language expressions.
  • The MFE decoder aggregates visual contextual information to enhance the distinctiveness of target object features.
FQRNet [55]×
  • A spatial–frequency fusion strategy based on language query refinement addresses challenges of scale variation and blurred boundaries.
  • A frequency-guided spatial (FGS) module enhances spatial representation using spectral features.
  • A query-aware original attention (QOA) mechanism enables deep multimodal fusion.
MSANet [39]××
  • The MSAM aggregates multi-scale contextual information through a stacking strategy.
  • A generative paradigm is introduced to directly generate discrete coordinate sequences, enhancing interaction between the regression process and encoded features.
CrossVG [43]×
  • A cross-modal guidance encoder (CMGE) uses visual features to guide multi-granularity text embeddings.
  • A cross-modal decoder explores word-level attributes to improve target recognition accuracy.
APMOR [40]××
  • A learnable attribute prompter dynamically explores rich attribute information in remote sensing images.
  • An attribute-prompting multimodal fusion encoder establishes fine-grained interaction between visual and language features.
  • A multimodal progressive object reasoning decoder gradually queries more comprehensive object features.
TACMT [23]×
  • A text-aware query selection module optimizes decoder queries.
  • A cross-scale fusion module handles features of different scales.
RINet [42]×
  • A local-to-object strategy is adopted to locate target regions via a regional indication generator.
  • A word contribution learner evaluates the importance of each word in language expressions.
  • A multi-round fine-tuning process fully utilizes complex language information.
Eff-Gounding DINO [41]
  • A multi-scale image-to-text fusion module (MSITFM) updates text features via self-attention and uses scale-specific cross-attention for multi-scale visual feature fusion to reduce learning complexity.
  • A text confidence matching (TCM) mechanism introduces IoU-based confidence in label assignment to reduce mismatches.
VGRSS [29]×
  • The Language-Guided Visual Feature Enhancement (LVFE) module enhances visual features through text guidance before feature fusion to address the problem of insufficient utilization of text information.
  • The Visual–Language Fusion (VLF) module preserves spatial information through non-compressive stacking fusion and residual mechanisms.
  • The EIoU loss function is introduced into bounding box regression, and geometric constraints are utilized to improve the convergence accuracy of the model, which is especially suitable for the multi-scale characteristics of ship targets.
CSDNet [47]×
  • The Text-aware Fusion Module (TFM) modulates visual features using textual cues aggregated from image context to reduce target feature confusion.
  • The Context-Enhanced Interaction Module (CIM) harmonizes the differences between visual and textual features by modeling multimodal contexts.
  • The Text-Guided Sparse Decoder (TSD) addresses the issue of surface information redundancy.
GeoChat [14]
  • Supports image-level and region-level conversations.
SkyEyeGPT [48]
  • A high-quality RS instruction fine-tuning dataset with 968,000 instances enhances instruction fine-tuning of different granularities via a two-stage adjustment method.
EarthGPT [49]
  • The visual-enhanced perception mechanism refines and integrates coarse-scale semantic and detailed perceptual information.
  • The cross-modal mutual comprehension method enhances interactions between visual perception and language understanding to deepen multimodal comprehension.
  • Optimization is conducted using the unified instruction-following dataset MMRS-1M.
SkySenseGPT [24]×
  • Fine-grained instruction tuning with the high-quality FIT-RS dataset significantly improves the complex scene.comprehension ability of remote sensing multimodal models.
LHRS-Bot [50]×
  • The large-scale weakly-labeled LHRS-Align dataset trains the visual perception module in the pretraining stage, followed by multi-task and instruction fine-tuning.
VHM [51]×
  • The large-scale high-quality VersaD dataset has detailed in-context examples, coupled with a fine-grained prompt framework and quality inspection mechanism.
GeoPix [52]
  • The class-wise learnable memory (CLM) module stores and retrieves intra-class shared geographic context to enhance model understanding of diverse instances in complex RS scenes.
  • The two-stage training strategy mitigates conflicts between generation and segmentation tasks.
EarthDial [53]
  • The adaptive high-resolution module meets the requirements of high-resolution RS imagery.
  • The data fusion module processes multi-band or multi-temporal data streams.
  • The three-stage training strategy integrates RGB pretraining, temporal fine-tuning, and multi-band optimization.
GeoGround [15]×
  • The text-mask paradigm compresses mask information into compact text sequences for efficient learning by VLMs.
  • Hybrid supervision integrates PAL and GGL to fine-tune models using three types of signals.
S.H.: Scale Heterogeneity; S.C.: Semantic Complexity; A.S.: Annotation Scarcity. √ indicates that the model has been improved for the corresponding issue, while × indicates no improvement.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Z.; Liu, L.; Wan, G.; Zhang, W.; Zhong, B.; Chang, H.; Li, X.; Liu, X.; Sun, G. A Review of Visual Grounding on Remote Sensing Images. Electronics 2025, 14, 2815. https://doi.org/10.3390/electronics14142815

AMA Style

Wang Z, Liu L, Wan G, Zhang W, Zhong B, Chang H, Li X, Liu X, Sun G. A Review of Visual Grounding on Remote Sensing Images. Electronics. 2025; 14(14):2815. https://doi.org/10.3390/electronics14142815

Chicago/Turabian Style

Wang, Ziyan, Lei Liu, Gang Wan, Wei Zhang, Binjian Zhong, Haiyang Chang, Xinyi Li, Xiaoxuan Liu, and Guangde Sun. 2025. "A Review of Visual Grounding on Remote Sensing Images" Electronics 14, no. 14: 2815. https://doi.org/10.3390/electronics14142815

APA Style

Wang, Z., Liu, L., Wan, G., Zhang, W., Zhong, B., Chang, H., Li, X., Liu, X., & Sun, G. (2025). A Review of Visual Grounding on Remote Sensing Images. Electronics, 14(14), 2815. https://doi.org/10.3390/electronics14142815

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop