Cascaded Hierarchical Attention with Adaptive Fusion for Visual Grounding in Remote Sensing

Zhu, Huming; Gao, Tianqi; Li, Zhixian; Chen, Zhipeng; Li, Qiuming; Miao, Kongmiao; Hou, Biao; Jiao, Licheng

doi:10.3390/rs17172930

Open AccessArticle

Cascaded Hierarchical Attention with Adaptive Fusion for Visual Grounding in Remote Sensing

by

Huming Zhu

^1,2,*

,

Tianqi Gao

¹

,

Zhixian Li

²

,

Zhipeng Chen

²,

Qiuming Li

²,

Kongmiao Miao

³,

Biao Hou

² and

Licheng Jiao

²

¹

Hangzhou Institute of Technology, Xidian University, Hangzhou 311231, China

²

School of Artificial Intelligence, Xidian University, Xi’an 710071, China

³

China Telecom Corporation Limited Shaoxing Branch, Shaoxing 312000, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(17), 2930; https://doi.org/10.3390/rs17172930

Submission received: 22 July 2025 / Revised: 14 August 2025 / Accepted: 21 August 2025 / Published: 23 August 2025

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

Visual grounding for remote sensing (RSVG) is the task of localizing the referred object in remote sensing (RS) images by parsing free-form language descriptions. However, RSVG faces the challenge of low detection accuracy due to unbalanced multi-scale grounding capabilities, where large objects have more prominent grounding accuracy than small objects. Based on Faster R-CNN, we propose Faster R-CNN in Visual Grounding for Remote Sensing (FR-RSVG), a two-stage method for grounding RS objects. Building on this foundation, to enhance the ability to ground multi-scale objects, we propose Faster R-CNN with Adaptive Vision-Language Fusion (FR-AVLF), which introduces a layered Adaptive Vision-Language Fusion (AVLF) module. Specifically, this method can adaptively fuse deep or shallow visual features according to the input text (e.g., location-related or object characteristic descriptions), thereby optimizing semantic feature representation and improving grounding accuracy for objects of different scales. Given that RSVG is essentially an expanded form of RS object detection, and considering the knowledge the model acquired in prior RS object detection tasks, we propose Faster R-CNN with Adaptive Vision-Language Fusion Pretrained (FR-AVLF_PRE). To further enhance model performance, we propose Faster R-CNN with Cascaded Hierarchical Attention Grounding and Multi-Level Adaptive Vision-Language Fusion Pretrained (FR-CHAGAVLF_PRE), which introduces a cascaded hierarchical attention grounding mechanism, employs a more advanced language encoder, and improves upon AVLF by proposing Multi-Level AVLF, significantly improving localization accuracy in complex scenarios. Extensive experiments on the DIOR-RSVG dataset demonstrate that our model surpasses most existing advanced models. To validate the generalization capability of our model, we conducted zero-shot inference experiments on shared categories between DIOR-RSVG and both Complex Description DIOR-RSVG (DIOR-RSVG-C) and OPT-RSVG datasets, achieving performance superior to most existing models.

Keywords:

visual grounding for remote sensing; Faster R-CNN; vision-language fusion; DIOR-RSVG

1. Introduction

Remote sensing processing technology has a wide range of applications, including disaster detection [1], soil moisture [2], urban management [3], and military [4]. Recent advances in deep neural networks have addressed many challenges in computer vision and natural language processing, enabling the multi-modal fusion of vision and language for remote sensing image processing. Common vision-language multimodal tasks in remote sensing include text-based image retrieval [5,6], image captioning [7,8,9], text-based image generation [10,11], zero-shot applications [12,13], few-shot applications [14,15,16], and visual question answering [17,18].

RSVG is an RS image processing method based on multi-modal fusion of vision and language, and it can be considered an extension of visual remote sensing image object detection [19]. Visual remote sensing image object detection involves detecting all target objects and then locating and classifying them. In contrast, RSVG locates only the best-matching object in a remote sensing image based on a natural text phrase belonging to a vision-language object detection task.

In visual object detection [20,21,22], methods can be divided into one-stage [23,24,25,26], two-stage [19,27,28], and transformer-based approaches [29,30]. These methods have achieved great results in natural image visual grounding, but visual grounding in remote sensing is still in the development stage.

One-stage VG methods are computationally efficient but have limitations, such as relying on preset anchors or complex fusion mechanisms, which may lead to inaccurate positioning and overfitting [24]. Additionally, they often fail to fully utilize multimodal information, limiting performance in complex scenes. Transformer-based RSVG methods input vision-language features into fully connected layers to predict detection boxes [31], but this requires complex fusion of features. Two-stage visual grounding methods are underexplored in remote sensing and can significantly enhance detection precision [32,33,34]. Designing a two-stage visual grounding method for remote sensing images is thus highly beneficial.

To address this, we propose FR-RSVG, an effective two-stage RSVG method. We use the Swin Transformer-FPN backbone to generate five visual feature maps at different resolutions through hierarchical processing, while language features are generated via a language encoder. These features are fused using Multi-Level AVLF and then input into the object detection network to obtain a series of detection boxes. The fused features influence the confidence score during the region proposal network (RPN) stage, and final detection boxes are obtained through proposal ranking.

Due to large-scale differences among objects in remote sensing images, shallow visual encoders are effective at extracting small targets, while deep encoders excel with large targets. Existing methods typically determine the final detection based only on deep visual features. To address this, we propose a multi-level adaptive vision-language fusion module to determine object sizes from text information and adaptively combine shallow or deep visual features accordingly. Additionally, a Cascaded Hierarchical Attention Grounding module is proposed to improve detection accuracy for small targets.

In summary, our main contributions are as follows:

(1): We propose an effective two-stage RSVG model named FR-RSVG. It is based on the Faster R-CNN framework, which uses an RPN to obtain proposals, and the best-matching target is grounded via confidence ranking.
(2): Building upon FR-RSVG, we propose FR-AVLF, which is equipped with a layered adaptive vision-language fusion module. The visual characteristics of this model are derived through a flexible fusion of deep and shallow visual encoders, leveraging the supplied textual input to refine the hierarchical and semantic feature representation and augment the grounding accuracy for objects across diverse scales. Furthermore, based on FR-RSVG and FR-AVLF, we also propose FR-CHAGAVLF, which is equipped with a multi-level adaptive vision-language fusion module and a Cascaded Hierarchical Attention Grounding module.
(3): To investigate the effectiveness of weight transfer from remote sensing object detection to RSVG, we conducted extensive experiments using weights pretrained on ImageNet-1K and ImageNet-22K. We employed various backbone networks such as Swin-T, Swin-S, Swin-B, and Swin-L to construct different model architectures. Additionally, we compare the performance of different language encoders, including BERT, RoBERTa, and DeepSeek. The proposed weight transfer model FR-CHAGAVLF_PRE shows excellent grounding performance, with a Pr@0.9 of 59.42% on the DIOR-RSVG dataset, which reveals that this approach outperforms the direct RSVG dataset training in enhancing grounding accuracy.
(4): To validate the generalization performance of our model, we constructed the Complex-Description DIOR-RSVG (DIOR-RSVG-C) dataset based on the DIOR-RSVG dataset and conducted zero-shot inference using the FR-CHAGAVLF_PRE model weights. To further verify the model’s generalization capability, we also performed zero-shot inference experiments on shared categories between the DIOR-RSVG and OPT-RSVG datasets. The experimental results demonstrate that our model achieved excellent localization performance on both datasets, fully validating the model’s cross-dataset generalization capability.

2. Related Work

This section provides an overview of current research on visual grounding, highlighting the latest advancements in remote sensing.

2.1. Visual Grounding on Natural Image

Two-Stage Methods. Two-stage methods first predict a set of potential object regions in the image and then select the most matched proposal based on a specific language expression. Zhang et al. [32] proposed a variational Bayesian method called variational context to address the challenge of complex context modeling in visual grounding. This method introduces a cue-specific language-vision model that learns from end to end, reducing the search space for context and improving localization accuracy. Liu et al. [35] presented a cross-modal attention-guided erasing method, which selectively eliminates the most salient feature—either visual or linguistic—to explore complementary visual linguistic correspondences. Wang et al. [36] used node and edge attention components to discover inter-object relationships. Hong et al. [37] introduced RVG-TREE, an end-to-end model that generates a binary tree structure for parsing language. Zhong et al. [38] proposed RegionCLIP, a novel approach that enables learning region-level visual representations and aligns them precisely with textual concepts. However, there is currently no two-stage visual grounding method in remote sensing.

One-Stage Methods. One-stage methods are fast because they seamlessly combine the visual and linguistic modules to directly predict bounding boxes, eliminating time-consuming proposal generation. Liao et al. [39] proposed RCCF, a real-time method that remaps the understanding of referring sentences as a correlation filtering process and regression for result generation. Yang et al. [25] introduced a recursive sub-query construction method that iteratively processes visual and linguistic features, progressively improving performance. Huang et al. [26] developed a landmark feature convolution module that processes visual features guided by text in multiple directions, encoding position features to locate objects by combining contextual information with visual data. Liao et al. [40] proposed PLV, where text features guide the extraction of visual features in the early stages, followed by encoding and decoding to predict the results. While one-stage methods are efficient, they often rely on point features for object representation, which may lack the flexibility needed to capture intricate details in text expressions.

Transformer-Based Methods. Recent breakthroughs in transformers within computer vision and natural language processing have drawn significant attention from the research community, leading to the development of transformer-based methods for visual grounding. Carion et al. [41] proposed DETR, a groundbreaking end-to-end object detection transformer. Deng et al. [29] introduced TransVG, which establishes multi-modal correspondence using transformers and directly outputs the bounding box of the referred object. TransVG++ [42] improved on this by using both convolutional neural networks and a Transformer structure in the image branch, while the language branch relies solely on a Transformer structure, adopting the ViT method to process input images. Despite their successes, transformer-based methods for visual grounding have several drawbacks. First, they are computationally intensive, requiring long training and inference times and powerful hardware. Second, multimodal fusion remains a challenge, as aligning and integrating image and language features effectively can affect visual grounding accuracy. These aspects require further improvement.

2.2. Visual Grounding for Remote Sensing

Zhan et al. [31] proposed a DIOR-RSVG dataset and MGVLF model for this task. The MGVLF model inputs the fusion of visual and language features into a regression network and directly determines the single best-matching object. It is a popular visual grounding method which belongs to the transformer-based method. Sun et al. [43] also proposed an RSVG dataset and a model named GeoVG, which comprises a language encoder, an image encoder, and a fusion module. The language encoder is designed to acquire numerical geospatial relationships and represent intricate expressions as a geospatial relation graph. The image encoder is employed to analyze extensive remote sensing scenes by utilizing adaptive region attention. The fusion module serves to integrate the textual and visual features for the purpose of visual grounding. Wang et al. [44] proposed the MSAM module to enhance the extraction of multimodal features across various scales by efficiently integrating visual and textual contexts, and a generative paradigm is employed to directly produce sequences of discrete coordinates in an autoregressive fashion, for which there are two ways to improve the overall performance. Li et al. [45] proposed the LPVA module to facilitate the dynamic generation of multi-scale language-adaptive weights, thereby allowing the visual backbone to progressively learn visual features pertinent to expressions at each layer, and the MFE module was developed to consolidate visual contextual information related to the specified target while mitigating the interference of complex background noise. This enhancement aims to improve the distinctiveness of the object, thereby increasing the accuracy of its localization. In this paper, our goal is to balance the ability to ground objects of different scales, and we propose the first two-stage grounding model with a layered adaptive vision-language fusion module for RS objects.

3. Materials and Methods

In this section, we introduce our vision-language object detection method. Section 3.1, Section 3.2, Section 3.3 and Section 3.4 describe our FR-RSVG, FR-AVLF, FR-AVLF_PRE, and FR-CHAGAVLF_PRE models in detail, respectively. The components and differences of these models are summarized in Table 1.

3.1. FR-RSVG: Faster R-CNN in Visual Grounding for Remote Sensing

Figure 1 shows the method of Faster R-CNN in Visual Grounding for Remote Sensing (FR-RSVG), which includes the extraction of visual features, extraction of language features, vision-language fusion, and detection head. The input is the query text and RS image pair, and the output is the grounded target object.

Extraction of visual features: In the visual backbone of the model, we use Swin Transformer [46] pretrained on ImageNet [47]. Suppose that the input image is

I ϵ R^{H \times W \times C}

. Then, the visual feature process is shown in Equation (1).

V F_{\times 1}^{'}

is the visual feature extracted by the Swin Transformer:

V F_{\times 1}^{'} = S w i n Transformer (I)

(1)

Extraction of language features: We use the pretrained BERT encoder [48] as the language feature extractor for our model. It is important to note that since short sentences have limited words, we only needed to use BERT for inference and did not need to train it. Therefore, we only projected the features extracted by BERT through the full connection layer, which can be applied to the downstream feature extraction. The detailed extraction process is shown in Equation (2):

{LF}_{0} = {BERT Encoder}_{0} (x) {LF}_{i} = {BERT Encoder}_{i} ({LF}_{i - 1}) i ϵ [1, 11] {LF}_{pool} = BERTPooler ({LF}_{11}) LF' = Concat ({LF}_{pool}, ({LF}_{8} \oplus {LF}_{9} \oplus {LF}_{10} \oplus {LF}_{11}) / 4) LF = ReLU ({FC}_{768 \to 625} (LF'))

(2)

In Equation (2),

x

is the text,

{BERT Encoder}_{i}

denotes the

i

transformer encoding layer of BERT,

{LF}_{i}

is the feature extracted by each

BERT Encoder

layer,

BERTPooler

(•) can be subdivided into Tanh(FC(•)), Tanh(•) is a hyperbolic tangent activation function,

Concat

(•) represents the concatenation of feature vectors, and

LF

is the final language feature.

Visual language fusion: Since the size of the visual feature is 25 × 25, and the length of the language feature is 625, in order to align the visual and linguistic features, it is necessary to change the dimension of the linguistic features first and then carry out fusion. The specific calculation steps are shown in Equation (3):

{LF}_{\times 1}^{'} = {Reshape}_{625 \to 25 \times 25} (LF) VLF = Concat ({VF}_{\times 1}^{'}, {LF}_{\times 1}^{'}) {VLF}^{'} = {Conv}_{3 \times 3, 512 \to 256} ({Conv}_{3 \times 3, 512 \to 512} (VLF))

(3)

In Equation (3),

LF

is the input language feature,

{VF}_{\times 1}^{'}

is the visual feature,

{LF}_{\times 1}^{'}

is the language feature after the dimension changes,

{Conv}_{3 \times 3, 512 \to 512}

(•) is a 3×3 convolution kernel, and

{Conv}_{3 \times 3, 512 \to 256}

(•) is the second kernel, which will change the number of channels from 512 to 256.

Detection and sorting by confidence: This is the same as the traditional Faster R-CNN [49] model in the object detection stage. The positive and negative confidence and coordinates of the model are obtained in the RPN phase. After the second stage of detection, a series of object boxes will be obtained, and these object boxes will have score values. We selected the box with the largest score value as the final detection result.

3.2. FR-AVLF: Layered Adaptive Vision-Language Fusion in RSVG

Figure 2 shows the adaptive vision-language fusion method. The visual feature size extracted by the Swin Transformer will gradually become smaller as the number of block layers increases. Therefore, for the Swin Transformer, the visual features in the deep layers pay more attention to global information and are suitable for detecting large objects. In the shallow layers, more attention is paid to local information, which is suitable for detecting small objects. FR-AVLF automatically determines the size of the object to be detected from the text information and finally adaptively combines the deep or shallow layers of visual features according to the size of the object.

The FR-AVLF method can improve the multi-scale detection ability of vision-language models, and it is added to FR-RSVG as a module in this subsection. Figure 3 depicts FR-AVLF, which includes the extraction of multi-scale visual features, alignment of vision-language features, and hierarchical adaptive vision-language feature fusion.

Extraction of multi-scale visual features: Since remote sensing images frequently exhibit significant variations in the scale of observed targets, we propose layered adaptive vision-language fusion to adaptively fuse the multi-scale features of visuals and language. To address this challenge, we employed the Swin Transformer as the backbone network, which provides hierarchical feature representations through its four-stage architecture with patch merging operations that progressively reduce spatial resolution while increasing the feature dimensions. The Swin Transformer’s shifted window mechanism enables efficient cross-window information exchange while maintaining linear computational complexity. These multi-scale features from different stages are then enhanced through a feature pyramid network (FPN), which constructs a feature pyramid with strong semantic information at all scales via its top-down pathway and lateral connections. The FPN’s lateral connections use 1 × 1 convolutions to align channel dimensions and element-wise addition to fuse features from the bottom-up pathway (Swin Transformer stages) with the top-down pathway (upsampled higher-level features). After feature extraction by the Swin-Transformer with an FPN [50], the output feature maps of 256 channels with 5 different scales are 13 × 13, 25 × 25, 50 × 50, 100 × 100, and 200 × 200. These feature maps are denoted as

{VF}_{\times 1 / 2}^{'}

,

{VF}_{\times 1}^{'}

,

{VF}_{\times 2}^{'}

,

{VF}_{\times 4}^{'}

, and

{VF}_{\times 8}^{'}

, respectively, as shown in Equation (4):

{VF}_{\times 1 / 2}^{'}, {VF}_{\times 1}^{'}, {VF}_{\times 2}^{'}, {VF}_{\times 4}^{'}, {VF}_{\times 8}^{'} = Swin Transformer with FPN (I)

(4)

Align language and visual features: We expanded the language feature size to 25 × 25 and then upsampled and downsampled it. The sizes of the language feature are 13 × 13, 25 × 25, 50 × 50, 100 × 100, and 200 × 200, denoted as

{LF}_{\times 1 / 2}^{'} {, LF}_{\times 1}^{'}, {LF}_{\times 2}^{'}, {LF}_{\times 4}^{'}, {LF}_{\times 8}^{'}

, respectively. We concatenated the multi-scale image features with the language features as shown in Equation (5):

{LF}_{\times 1}^{'} = {Reshape}_{625 \to 25 \times 25} (LF) {LF}_{\times 2}^{'}, {LF}_{\times 4}^{'}, {LF}_{\times 8}^{'} = {Upsample}_{scale = 2, bilinear} ({LF}_{\times 1}^{'}, {LF}_{\times 2}^{'}, {LF}_{\times 4}^{'}) {LF}_{\times 1 / 2}^{'} = MaxPool ({LF}_{\times 1}^{'})

(5)

In Equation (5),

LF

represents the language feature of the input,

{Reshape}_{625 \to 25 \times 25}

means transforming the 625 single dimensional vector into a 25 × 25 matrix,

{Upsample}_{scale = 2, bilinear}

(•) is an upsampling calculation, where the scale of upsampling is 2 using bilinear upsampling, and

MaxPool

(•) represents the Max pooling operation.

Layered adaptive vision-language feature fusion: Vision-language feature fusion is carried out, and the specific calculation steps are shown in Equation (6):

VLF = {Concat ({VF}_{\times 1 / 2}^{'}, {LF}_{\times 1 / 2}^{'}), Concat ({VF}_{\times 1}^{'}, {LF}_{\times 1}^{'}), Concat ({VF}_{\times 2}^{'}, {LF}_{\times 2}^{'}), Concat ({VF}_{\times 4}^{'}, {LF}_{\times 4}^{'}), Concat ({VF}_{\times 8}^{'}, {LF}_{\times 8}^{'}) {VLF}^{'} = {Conv}_{3 \times 3, 512 \to 256} ({Conv}_{3 \times 3, channel \to 512} (VLF))

(6)

In Equation (6).

{Conv}_{3 \times 3, channel \to 512}

(•) is a 3 × 3 convolution kernel, where the convolution operation will change the number of channels from

channel

to 512,

{Conv}_{3 \times 3, 512 \to 256}

is the second kernel, which will change the number of channels from 512 to 256, and

{VLF}^{'}

is the final multi-scale vision-language fusion feature.

3.3. FR-AVLF_PRE: Transfer Remote Sensing Image Object Detection Model Weights to FR-AVLF

In FR-AVLF_PRE, we transfer the weights of a remote sensing image object detection model to the visual grounding task for remote sensing images (Figure 4). The transferred components include the Swin Transformer (combined with an FPN), RPN, and detection module. Since the remote sensing image object detection model already possesses strong object detection capabilities, we adopted a freezing strategy during the transfer process; before the fusion of visual features and language features, all parameters of the Swin Transformer backbone, FPN, RPN, and detection module were kept frozen, and only the fusion module and subsequent network layers were updated. During training, we did not apply layer-wise unfreezing or differential learning rates to the frozen parts; instead, they remained completely frozen to ensure that the pretrained detection features were preserved.

3.4. FR-CHAGAVLF_PRE: Faster R-CNN with Cascaded Hierarchical Attention Grounding and Multi-Level Adaptive Vision-Language Fusion Pretrained

Based on FR-AVLF_PRE, we further conduct architectural optimization and propose the FR-CHAGAVLF_PRE model. This model achieves two key improvements. First, in terms of language feature extraction, we replace the original BERT language encoder with the DeepSeek encoder. Given the characteristics of DeepSeek as a generative model, where key semantic information typically aggregates at the sequence end, we adopted the strategy of extracting the features of the last valid token of each sentence to obtain optimal language representation. Second, we redesigned the multimodal fusion architecture and target localization mechanism, proposing the Multi-Level Adaptive Vision-Language Fusion module and Cascaded Hierarchical Attention Grounding (CHAG), respectively.

Language feature extraction: To further enhance the expressive capability of language features in the RSVG task, we replaced the original language encoder with a generative language model based on the previous stage model, leveraging its characteristics of integrating key information in generative tasks to optimize the feature extraction process. Specifically, we first obtained the hidden states from all layers of the language model, where

H = {h_{1}, h_{2}, \dots, h_{l}}

, and then we took the hidden state of the last layer (

h_{l a s t} = h_{L}

). We calculated the actual length of each sentence, where

l_{i} = \sum_{j = 1}^{T} {m a s k}_{i, j} - 1

, extracted the features of the last valid token of each sentence (

s_{i} = h_{l a s t} [i, l_{i}]

), and calculated the average of the last four layers of hidden states such that

f_{l} = \frac{h_{L} + h_{L - 1} + h_{L - 2} + h_{L - 3}}{4}

, as shown in Equation (7):

LF' = Concat (f_{l}; s_{i}) LF = ReLU ({FC}_{hidden size \to 768} (LF')) LF = ReLU ({FC}_{768 \to 625} (LF))

(7)

Then, we performed vision-language feature alignment on LF to obtain

{LF}_{\times 1 / 2}^{'} {, LF}_{\times 1}^{'}, {LF}_{\times 2}^{'}, {LF}_{\times 4}^{'}, and {LF}_{\times 8}^{'}

, with sizes of 13 × 13, 25 × 25, 50 × 50, 100 × 100, and 200 × 200, respectively, as shown in Equation (5).

Multi-Level Adaptive Vision-Language Fusion (Multi-Level AVLF): This is designed to effectively combine multi-scale visual and textual features, thereby enhancing object detection performance in remote sensing images. The input to this module includes visual features extracted by the Swin Transformer-FPN, where

{V F}_{1}, {V F}_{2}, {V F}_{3}, {V F}_{4}, {V F}_{5} \in R^{B \times 256 \times H_{i} \times W_{i}}

, and language features obtained from the language encoder, where

{L F}_{1}, {L F}_{2}, {L F}_{3}, {L F}_{4}, {L F}_{5} \in R^{B \times 256 \times H_{i} \times W_{i}}

. In lines 1–5 of Algorithm A1, the text and visual features are first aligned in terms of dimensions. Specifically, the language features are reduced in dimensionality through a 1 × 1 convolution (

W^{p r o j 1} \in R^{41 \times 128 \times 1 \times 1}

) with ReLU activation and batch normalization and then expanded to 256 channels via another 1 × 1 convolution (

W^{p r o j 2} \in R^{128 \times 256 \times 1 \times 1}

), aligning the dimensions with the visual features, resulting in

{P L F}_{i}

(line 4 of Algorithm A1).

Next, in lines 6–20 of Algorithm A1, the module uses a multi-head cross-modal attention mechanism to dynamically determine the relationship between the text description and visual features. Specifically, when the text description includes spatial information (e.g., “upper left”), the projected language features

{P L F}_{i}

will generate a query, and the model will focus on the corresponding spatial regions in the visual features. When the text emphasizes object attributes (e.g., “circular tank”), the attention will focus on geometric and texture features, automatically selecting deep and shallow features. For example, when the input language query emphasizes global attributes (e.g., “a large oval-shaped playground”), the cross-modal attention mechanism tends to assign higher weights to low-resolution, semantically rich deep feature maps (such as Vf′13, Vf′25, or Vf′50) among the multi-scale features, as these layers are better suited for capturing the overall shape and contextual semantics of the scene. Conversely, for queries that require precise localization of small objects (e.g., “the white vehicle in the bottom right” or “a smaller vehicle”), the cross-modal attention mechanism tends to give higher weights to high-resolution, fine-grained shallow feature maps (such as Vf′100 or Vf′200), ensuring greater accuracy for the boundaries and positioning. Through this implicit, data-driven dynamic weighting strategy, the Multi-Level AVLF module can flexibly adapt to specific linguistic demands, achieving an effective balance between global semantic understanding and local spatial precision while intelligently highlighting the most discriminative feature levels. This mechanism enables the model to adaptively focus on different types of visual features based on the text content.

To ensure computational efficiency in high-resolution remote sensing image processing, Flash-Attention is employed in lines 9–12 of Algorithm A1 to compute attention with linear complexity, as shown in Equation (8):

A_{i} = F l a s h A t t e n t i o n (Q_{i}, K_{i}, V_{i}, s c a l e = \frac{1}{\sqrt{d_{k}}})

(8)

where the scaling factor is used to prevent gradient vanishing and Flash-Attention ensures efficient computation while saving memory.

Subsequently, in lines 15–19 of Algorithm A1, a gating mechanism is used to determine the optimal fusion weight between the attention output and the original visual features to handle feature demands under different prompts. The gating computation is presented below in Equation (9):

{g a t e}_{i} = σ (C o n v 2 D (R e L U (B N (C o n c a t [{a t t e n d e d}_{i}, {V F}_{i}])), W_{g 2}))

(9)

where

{a t t e n d e d}_{i}

is the result from the multi-head cross-modal attention mechanism,

{V F}_{i}

is the original visual feature, and

W_{g 2}

is the corresponding weight matrix. Then, element-wise fusion is performed as shown in Equation (10):

{g a t e d_{f u s i u o n}}_{i} = {g a t e}_{i} ⊙ {a t t e n d e d}_{i} + (1 - {g a t e}_{i}) ⊙ {V F}_{i}

(10)

This process does not require explicit keyword matching. Instead, the model learns to adaptively adjust the scale of the fusion guided by text, dynamically tuning the fusion ratio between cross-modal features and the original visual features through the gating weight.

Next, in line 19 of Algorithm A1, a residual connection and LayerNorm are used to ensure gradient stability, as shown in Equation (11):

{C A}_{i} = L a y e r N o r m ({g a t e d_{f u s i u o n}}_{i} + {V F}_{i})

(11)

Then, multi-head self-attention is applied to the cross-modal fusion result

{C A}_{i}

, and learnable scaling parameters γ (initialized to zero) are introduced to implement progressive fusion, ultimately obtaining

{S A}_{i}

(lines 21–30 of Algorithm A1).

Following this, feature-wise linear modulation (FiLM) is used to fine-tune the visual features in a text-semantic-driven manner (lines 31–38 of Algorithm A1), as shown in Equations (12) and (13):

γ_{i} = C o n v 2 D (R e L U (C o n v 2 D ({P L F}_{i}, W_{γ 1})), W_{γ 2})

(12)

β_{i} = C o n v 2 D (R e L U (C o n v 2 D ({P L F}_{i}, W_{β 1})), W_{β 2})

(13)

where

W_{γ 1}, W_{γ 2}, W_{β 1}, a n d W_{β 2}

are the corresponding weight matrices. Finally, we have Equation (14):

{F I L M}_{i} = γ_{i} ⊙ {S A}_{i} + β_{i}

(14)

where

γ_{i}

can multiplicatively amplify or suppress

{S A}_{i}

and

β_{i}

introduces an additive bias. This dual modulation strategy enables the algorithm to conditionally adjust the visual features based on text semantics, achieving more precise multi-modal feature fusion.

In lines 39–47 of Algorithm A1, a design is introduced to address the scale imbalance issue in multi-scale object detection. By simultaneously extracting the global context (AdaptiveAvgPool2D) and local context (3 × 3 AvgPool2D) and summing them together to generate the scale attention

{s c a l e_{a t t n}}_{i}

, the scale-calibrated features are obtained, as shown in Equation (15):

{S C}_{i} = {F I L M}_{i} ⊙ {s c a l e_{a t t n}}_{i}

(15)

This can adaptively enhance the important channels at different levels, effectively balancing the feature representation for objects of different sizes.

In lines 50–51 of Algorithm A1, global pooling is applied to each

{S C}_{i}

to obtain statistical quantities

μ_{i} a n d σ_{i}

, which are then concatenated into a 10-dimensional vector and input into two layers of an MLP to compute

s i z e_{l o g i t s}

, followed by a softmax function to obtain

s i z e_{w e i g h t s}

. The target resolution is calculated as shown in Equation (16):

{t a r g e t}_{s i z e} \leftarrow R o u n d (\sum_{i = 1}^{5} s i z e_{w e i g h t s} [:, i] \times s h a p e s [i])

(16)

Based on this dynamically selected optimal resolution, all feature levels are aligned to this resolution. Then, in line 63 of Algorithm A1, learnable fusion weights are used to generate the global context GC, and in lines 66–73, the global context is weighted and fused according to the statistical quantities

α_{i}

, as shown in Equation (17):

{V L F}_{i} = {S C}_{i} + α_{i} \times G C_{{r e s i z e d}_{i}}

(17)

The learned weight parameters

α_{i}

perform content-adaptive integration of the global context, ensuring that each level receives the most suitable global information supplement.

In summary, this algorithm achieves intelligent selection of deep and shallow features based on text semantics through a cross-modal attention mechanism. When the text description contains spatial location information, the model automatically focuses on the corresponding region features, and when the text emphasizes object attributes, it focuses on geometric and texture features, ensuring adaptive attention to the corresponding scale of visual features. The adaptive gating mechanism dynamically adjusts the fusion weights between cross-modal features and original visual features based on text semantics, determining the optimal feature combination strategy without explicit keyword matching. FiLM conditioning with the

γ_{i}

and

β_{i}

parameters provides multiplicative and additive modulation, offering fine-grained feature control based on text semantics. The remote sensing scale calibration module combines global and local context information, balancing feature representation for objects of different sizes through scale attention weights. Finally, through a statistical analysis-based MLP network, adaptive resolution selection is achieved, and content-aware weight parameters complete intelligent global context fusion. This multi-level, progressive, scale-adaptive strategy ensures precise matching of different-sized objects and text descriptions in various remote sensing scenarios, significantly improving object detection performance in complex remote sensing environments. As shown in Figure 5.

Cascaded hierarchical attention grounding (CHAG) is based on the cascaded region convolutional neural network architecture and dual-layer attention mechanism design. (The algorithm pseudocode can be found in Algorithm A2 of Appendix A). Through multi-stage region proposal generation and bounding box regression, CHAG adopts incremental intersection over union (IoU) thresholds to progressively refine object localization, significantly improving small object detection performance compared with single-stage methods. This iterative optimization strategy is particularly effective in remote sensing scenarios with uneven object distribution and large-scale spans, being capable of significantly reducing missed detection and false detection rates and thereby optimizing high-threshold detection metrics. The mathematical formula definition of this cascaded structure is shown in Equation (18).

Let

B_{k}

denote the bounding box predictions at stage

k

with corresponding features

F_{k}

. The refinement process iterates over

k \in {1, 2, 3}

stages:

B_{k + 1} = {R e g r e s s o r}_{k} (P o o l (F_{k}, B_{k}))

(18)

where each regressor is trained with progressively increasing IoU thresholds

τ_{k} = {0.55, 0.65, 0.75}

. This tiered threshold strategy is a principled design choice grounded in the Cascade R-CNN framework, which addresses the “high-quality detection paradox” by sequentially resampling proposals to ensure sufficient positive samples for each stage [51]. The specific thresholds were determined via a systematic ablation study on the DIOR-RSVG dataset, which confirmed this configuration strikes an optimal balance between initial proposal recall and final localization precision for the unique challenges of remote sensing data. (See Table A6 in Appendix A for detailed results). This approach ensures early-stage robustness by preserving small-object proposals through a moderate initial threshold (0.55), while enforcing late-stage precision via stricter localization criteria (0.75).

It also maintains adaptive learning through stage-specific loss weights

w_{k} = {1.0,1.0,1.0}

to balance error propagation [51]. Key implementation details include multi-scale region of interest (RoI) pooling at a [7 × 7] resolution, extracting aligned features from the {P2, P3, P4, P5} FPN layers, followed by 1024-dimensional fully connected layers for feature transformation, with batch normalization and ReLU activations to enhance training stability and nonlinearity, and balanced sampling with a 25% positive fraction using 512 proposals per image for robust optimization [51].

To further improve the region proposal quality, CHAG integrates a dual-layer attention mechanism. Before the RPN, cross-modal attention fuses multi-scale visual features with language features, embedding the semantic information of text descriptions into the RPN input to ensure high relevance between the region proposals and query statements. This global semantic alignment strategy is particularly adapted to the complex interactions between objects and backgrounds in remote sensing images.

Within the RPN, the multi-head attention mechanism dynamically adjusts the weights of fused features. Specifically, we employ a 4-head attention configuration. The input dimension for this mechanism is determined by the number of anchors per location plus an additional channel for language embedding. This combined feature vector is subsequently partitioned evenly across the four heads, with a standard dropout rate of 0.0. This process strengthens the semantic expression of local features, significantly improving the capability of capturing small objects and detailed features. This hierarchical attention design combines global guidance with local enhancement, effectively meeting the requirements of multi-scale object detection and compensating for the shortcomings of traditional methods. CHAG is implemented through efficient attention computation, substantially improving high-threshold detection performance in RSVG tasks while maintaining computational efficiency, demonstrating its robustness and superiority in remote sensing vision-language tasks. Figure 6 illustrates the architectural diagram of CHAG.

For FR-CHAGAVLF_PRE, this model is constructed based on cascaded hierarchical attention grounding (CHAG), Multi-Level Adaptive Vision-Language Feature Fusion (Multi-Level AVLF), DeepSeek, and pretrained weights from remote sensing image object detection. Figure 7 illustrates the architectural diagram of transferring remote sensing image object detection model weights to remote sensing image visual grounding tasks. The left part of the figure presents the RSVG framework for vision-language multi-scale fusion. The main transfer modules include the Swin Transformer backbone network integrated with an FPN and the detection module. Given the superior object detection performance of the remote sensing image object detection model, we adopted a parameter freezing strategy, namely freezing the pretrained weights before vision-language feature fusion and only fine-tuning the model parameters after the fusion process. The synergistic effect of the cascaded structure and dual-layer attention mechanism not only improves localization accuracy but also provides an innovative theoretical and technical framework for multimodal processing of remote sensing images.

3.5. Loss

Faster R-CNN [49] has loss functions in both the RPN and R-CNN [52] stages, with classification loss and regression loss in each stage. The classification in the RPN stage,

L_{cls}^{rpn}

, is the binary classification loss, and the R-CNN stage,

L_{cls}^{rcnn}

, has multi-classification loss. The number of classifications is the number of target classes plus 1. The 1 added class denotes the background.

L_{reg}^{rpn}

and

L_{reg}^{rcnn}

are the regression loss in the RPN and R-CNN stages, respectively. Thus, the loss function of Faster R-CNN can be expressed by Equation (19):

L_{Faster RCNN} = L_{cls}^{rpn} + L_{reg}^{rpn} + L_{cls}^{rcnn} + L_{reg}^{rcnn}

(19)

In order to give the model a more powerful ability to identify locations during training, as demonstrated in [31], we added the GIoU loss [53] based on the original loss function of Faster R-CNN. Finally, the loss function we used is shown in Equation (20):

L_{FR - RSVG} = L_{Faster RCNN} + λ \cdot L_{GIoU} (b, \hat{b})

(20)

where

b

is the predicted target,

\hat{b}

is the true target, and λ is a hyperparameter that balances the two loss functions. To determine the optimal value, we conducted ablation experiments with λ set to 0.6, 1.0, and 1.4, as shown in Appendix A, Table A5. The results demonstrate that λ = 1 achieved the best performance. Additionally, the MGVLF [31] method also set λ = 1 when balancing the two loss functions. Therefore, we adopted λ = 1 for all subsequent experiments.

4. Results

4.1. Dataset

This study systematically investigates the cross-domain generalization challenge in remote sensing vision-language tasks. We constructed a progressive evaluation pipeline that spanned from single-domain baselines to zero-shot cross-domain transfer, utilizing DIOR-RSVG, its linguistically complex extension DIOR-RSVG-C, and the external OPT-RSVG dataset:

(1): The baseline benchmark was DIOR-RSVG [31]. Built upon DIOR [54], this dataset comprises 17,402 high-resolution remote-sensing images and 38,320 concise captions aligned with 20 object categories. With an average caption length of 7.47 words and a vocabulary size of 100, it provides a controlled setting for evaluating model performance under limited linguistic complexity.
(2): The linguistic augmentation benchmark was DIOR-RSVG-C. To investigate the robustness against complex semantics, we constructed an enhanced dataset by randomly selecting 5202 images and their 11,436 original captions from DIOR-RSVG. The augmentation process involved several key steps. First, we randomly sampled 20% of the original dataset to ensure diverse representation across all 20 categories. Then, each caption was systematically elaborated using the Qwen-Plus large language model through carefully designed prompt engineering. Our prompt strategy specifically instructs the model to (1) preserve the original spatial logic (e.g., “lower left”, “center”, or “upper right”) to maintain accurate visual grounding, (2) enrich descriptions with texture, color, shape, and background context relevant to remote sensing scenarios, (3) use natural and professional language suitable for remote sensing object localization, (4) avoid introducing contradictory information, and (5) limit expansions to within 20 English words to maintain practical usability. The multimodal prompt combines both the original image (encoded as base64) and the textual description, enabling the vision-language model to generate contextually appropriate elaborations while preserving spatial accuracy. To ensure quality control, we implemented a multi-stage validation process: (1) automated filtering to remove responses that significantly deviated from the original caption length constraints or contain obvious contradictions, (2) random sampling of 500 generated descriptions for manual review to assess semantic consistency and spatial accuracy, and (3) iterative prompt refinement based on identified issues. Additionally, we employed consistency checks by comparing generated descriptions against the original annotations to ensure no spatial information was lost or distorted. This process yielded captions with an average length of 20.52 words and a vocabulary of 1354 terms. DIOR-RSVG-C retains the 20-category label space while significantly elevating linguistic diversity, enabling systematic analysis of performance degradation under increased textual complexity and providing a more challenging benchmark for evaluating model robustness in real-world remote sensing applications.
(3): Zero-shot cross-domain evaluation was carried out on OPT-RSVG [45]. To quantify out-of-domain generalization, we adopted the DIOR-RSVG-pretrained weights as a frozen feature extractor and performed zero-shot inference on OPT-RSVG. Focusing on the six semantic classes shared by both datasets—airplane, basketballcourt, ship, storagetank, tenniscourt, and vehicle—we constructed a sub-test set by uniformly and randomly sampling 20% of the image-text pairs per class (1092 pairs in total), thereby controlling sample bias and ensuring statistical significance. The sub-set accuracy served as a quantitative proxy for cross-domain consistency and out-of-domain robustness.

Collectively, the DIOR-RSVG → DIOR-RSVG-C → OPT-RSVG evaluation framework offers a seamless progression from baseline performance to complex linguistic adaptation and finally to zero-shot cross-domain transfer, providing a comprehensive gauge of the model’s generalization capacity.

4.2. Evaluation Metrics

The evaluation index of RSVG is consistent with the evaluation index used in [31,55]. If the IoU value between the bounding box predicted by the method and the annotated true box exceeds a threshold, then the predicted box of the method is considered to be correct. The indicators with IoU thresholds of 0.5, 0.6, 0.7, 0.8 and 0.9 are denoted as Pr@0.5, Pr@0.6, Pr@0.7, Pr@0.8, and Pr@0.9, respectively. We also used the meanIoU and cumIoU as our evaluation metrics. The formulas for these calculations are shown in Equations (21) and (22):

meanIoU = \frac{1}{M} \sum_{t} \frac{I_{t}}{U_{t}}

(21)

cumIoU = \frac{\sum_{t} I_{t}}{\sum_{t} U_{t}}

(22)

where

t

is the index of the image-query pair.

M

represents the size of the dataset.

I_{t}

and

U_{t}

are the intersection and union between the predicted bounding box and the ground truth bounding box, respectively.

4.3. Implementation Details

4.3.1. Training for FR-RSVG and FR-AVLF

The models for the FR-RSVG and FR-AVLF tasks were trained using the settings in Table 2, with a frozen BERT model initialized with pretrained public weights.

4.3.2. Pretraining for Visual Object Detection

Since it was also necessary to test whether the pretraining weight loaded on remote sensing images had an impact on the detection accuracy of the RSVG task, we trained the visual object detection model on the DIOR [54] dataset. The training data and test data were divided at a 1:1 ratio, and the input image size was 800 × 800. We used the OneCycle learning rate for training, with the maximum learning rate set to 0.0002 and the minimum learning rate set to 0.00002. This model uses the Adam optimizer (

β_{2}

= 0.99), and

β_{2}

would be adjusted according to the OneCycle method, with max

β_{1}

= 0.9 and min

β_{1}

= 0.8. The batchsize for training was set to 2. The learning rate of OneCycle was used for the first 10 epochs, and the learning rate of the last two epochs was fine-tuned. The learning rate of the fine-tuning phase was fixed at 0.000004, and the first moment was fixed at 0.9.

4.4. FR-RSVG Results

We first evaluate our FR-RSVG model. The experimental results are shown in Table 3. It can be seen from Table 3 that FR-RSVG achieved relatively general detection results on the DIOR-RSVG datasets, and the detection accuracy at a high threshold, such as Pr@0.9, was only 29.74. In addition, there were obvious differences between the meanIOU and cumIOU in this method; the cumIoU values were significantly higher than the meanIoU values, which was caused by the unbalanced recognition of small and large targets. Suppose that

{I S}_{t}

and

{U S}_{t}

are the intersection and union, respectively, of small targets on the predicted and true boxes, and

{I B}_{t}

and

{U B}_{t}

are the intersection and union, respectively, of the large target on the predicted and true boxes. Since the area of the small target was significantly smaller than that of the large target,

({IS}_{t} + {IB}_{t}) / ({US}_{t} + {UB}_{t}) \approx {IB}_{t} / {UB}_{t}

, the cumIoU could actually be approximated as shown in Equation (23).

cumIoU = \frac{\sum_{t} ({IS}_{t} + {IB}_{t})}{\sum_{t} ({US}_{t} + {UB}_{t})} \approx \frac{\sum_{t} {IB}_{t}}{\sum_{t} {UB}_{t}}

(23)

From Equation (23), we can see that cumIoU primarily reflects the model’s performance in recognizing large objects. Therefore, rather than maximizing the cumIoU, the goal should be to bring it as close as possible to the meanIoU. When the meanIoU is significantly smaller than the cumIoU, this suggests that the method struggles to effectively detect small objects. As a result, the FR-RSVG method fails to balance the recognition of both large and small objects.

We further analyzed the attention map of FR-RSVG, shown in Figure 8. Phrase (a) aims to detect the stadium, while phrase (b) focuses on detecting the track and field ground. The stadium contained the track and field ground, and the two objects were in close proximity. Since the input images and model weights were identical, the visual features were the same. However, the language features of phrases (a) and (b) differed due to their distinct descriptions. In Figure 8, we can observe that the vision-language features of phrases (a) and (b) were quite similar, with some differences in detail. When zooming in on the attention map around the stadium, we can find that the attention map for phrase (a) focused more on the stadium’s periphery, with warmer areas indicating more attention. Due to the attention map in (a) exhibiting a more intense focus around the stadium, the model produced a larger bounding box.

4.5. FR-AVLF Results

Table 4 shows the results of our FR-AVLF model on the DIOR-RSVG dataset. As can be seen from Table 4, the FR-AVLF method had excellent detection accuracy. Compared with FR-RSVG, the FR-AVLF method using the Swin-T backbone achieved the most significant growth in several indicators. The Pr@0.5, Pr@0.6, Pr@0.7, Pr@0.8, Pr@0.9, meanIoU, and cumIoU improved by 19.63, 21.94, 23.43, 23.39, 15.23, 17.09, and 2.62, respectively, and the value of the meanIoU for the FR-AVLF method was quite close to that of cumIoU, which means our method can balance the grounding ability of large and small objects.

We conducted a study on the influence of large and small objects on an attention map. As shown in Figure 9, we wanted to identify the large object: the track and field ground. We analyzed the situation of the attention map after the RPN and vision-language backbone were utilized.

As shown in Figure 9, for large objects, the language features adaptively combined with the visual features extracted by the deep layer network. This allowed valuable visual-language information to be extracted from the smaller feature maps. For instance, when the feature map size was 200 × 200, 100 × 100, or 50 × 50, no warm areas were visible on the track and field ground. However, when the feature map was 25 × 25, warmer regions on the track and field ground became visible, and they were more prominent at the 13 × 13 size. Additionally, when comparing the attention maps after the RPN stage and the vision-language backbone’s implementation, we observed that the warmer areas in the attention map after the utilizing the vision-language backbone were more scattered. This indicates that while the network exhibited some vision-language recognition capabilities, they were still relatively weak. In contrast, the attention map after the RPN stage showed more concentrated warmer areas, suggesting that vision-language features had been better integrated at this point.

Figure 10 presents the attention map for detecting small objects, specifically the vehicle in the lower right corner. As seen in Figure 10, the attention map after the vision-language backbone’s implementation was nearly unrelated to the present small objects. However, the attention map after the RPN stage clearly shows the warmer regions corresponding to the small objects. The warmer areas for small objects are the opposite of those for large objects. Notably, the warmer area for the vehicle only appeared in the 200 × 200 and 100 × 100 feature maps, with the warmer area becoming less noticeable at 100 × 100. This demonstrates that our FR-AVLF method effectively balances the grounding of both large and small objects.

4.6. FR-AVLF_PRE Results

We conducted experiments to investigate the impact of transferring a pretrained model for RS object detection to the RSVG task on detection accuracy.

First, we trained a visual RS object detector, FPN-Faster R-CNN, on the DIOR dataset. Compared with current advanced methods, FPN-Faster R-CNN demonstrated excellent detection accuracy. We then compared the effects of using pretrained weights from the ImageNet-1K and ImageNet-22K datasets on accuracy. The experimental results, conducted on the Swin-T backbone model, show that pretraining on the ImageNet-22K dataset yielded better performance. The detailed experimental results are presented in Table 5.

Next, after obtaining the visual model pretrained on remote sensing images, we loaded the model weights into the vision-language remote sensing object detection model for training. The experimental parameters remained the same as those used in the vision-language model trained earlier. Since the visual backbone has a strong feature extraction capability for remote sensing, we froze the backbone during training. The results are shown in Table 6.

As seen in Table 6, FR-AVLF_PRE showed significant improvement across various metrics compared with FR-AVLF. In particular, for the fine-grained metrics, the Swin-T backbone model showed improvement of 13.91 for Pr@0.9. The detection performance of FR-AVLF_PRE surpassed that of FR-AVLF, indicating that traditional visual RS image object detection is closely intertwined with vision-language RS image object detection. This transferability significantly boosts the model’s performance.

4.7. FR-CHAGAVLF_PRE Results

As can be seen from the experimental results in Section 4.3, Section 4.4 and Section 4.5, the best-performing visual encoder was Swin-L. Therefore, based on this, we demonstrate the results of FR-CHAGAVLF_PRE by replacing it with different language encoders and conducting ablation studies on different modules.

As shown in Table 7, both RoBERTa and Deepseek-1.5b significantly improved the model’s detection accuracy. However, our evaluation revealed a critical performance trade-off. While the BERT-based encoders demonstrated better initial precision at high IoU thresholds, the Deepseek-1.5b model achieved superior performance at lower thresholds (e.g., Pr@0.5). This result indicates that Deepseek-1.5b provided more robust semantic grounding, excelling at reliably identifying the correct object instance even if its initial bounding box lacked high spatial acuity. We selected Deepseek-1.5b based on this observation, as its primary strength is highly synergistic with our proposed architecture. The role of the language encoder in our framework is to furnish a high-quality semantic signal to the fusion and proposal modules. The subsequent CHAG module is then specifically tasked with iterative spatial refinement. By choosing the encoder with the best initial semantic performance, we provide a more reliable starting point for the CHAG module’s specialized localization process. This strategic division of labor—robust semantic identification followed by precision refinement—is validated in our subsequent ablation studies as yielding the best overall system performance.

Additionally, we experimented with the larger Deepseek-7b model, but its accuracy was slightly lower than that of Deepseek-1.5b. This is mainly because in vision-language object detection tasks, the language encoder’s core responsibility is to understand short target description texts, and the average length of the texts in our dataset was only 7.47 words. Such concise and straightforward descriptions (e.g., “red car” or “parked airplane”) have relatively low semantic complexity and do not require the complex reasoning capabilities of larger models. Meanwhile, the language encoder weights were frozen and only used to extract pretrained language features without joint training, and thus increasing the model size mainly brought about parameter redundancy rather than performance gains. Moreover, the computational cost and memory consumption of Deepseek-7b are significantly higher, resulting in higher practical application costs. Based on these considerations, we selected Deepseek-1.5b as the language encoder for subsequent ablation experiments.

As shown in Table 8, the ablation studies indicate that while the Multi-Level Adaptive Vision-Language Fusion (Multi-Level AVLF) module alone improved detection performance at lower IoU thresholds (e.g., +1.37% Pr@0.5), it severely degraded the high-precision localization ability (−8.13% Pr@0.9). In contrast, the cascaded hierarchical attention guidance (CHAG) module consistently enhanced performance across all IoU thresholds, with particularly notable improvements at strict thresholds (+3.97% Pr@0.9). When combined, CHAG effectively compensated for the shortcomings of Multi-Level AVLF at high IoU thresholds while leveraging its strengths at moderate IoU levels, achieving an optimal balance and boosting Pr@0.9 to 59.42%. This synergy highlights the superior localization capability of FR-CHAGAVLF_PRE compared with FR-AVLF, underscoring the critical role of the CHAG module in precise object detection.

A detailed analysis of the ablation results revealed a clear trade-off inherent in the Multi-Level AVLF module; by adaptively fusing multi-scale visual features with the global linguistic context (Algorithm A1, lines 66–73), it introduced rich semantic information but inevitably also caused certain localization noise. This noise interferes with the accurate delineation of object boundaries, leading to degraded localization performance at high IoU thresholds. In other words, while Multi-Level AVLF enhanced semantic disambiguation, the localization errors introduced by global context fusion compromised spatial precision, resulting in semantically correct but insufficiently precise localization for stringent high-IoU requirements.

To effectively complement the limitations of Multi-Level AVLF, the CHAG module employs an iterative refinement mechanism that decomposes the high-precision localization task into a series of progressively challenging stage-wise regression problems. Based on high-quality, semantically accurate proposals generated by the AVLF-enhanced RPN, CHAG utilizes a cascaded architecture trained with incrementally increasing IoU thresholds, enabling each stage’s regressor to focus on correcting finer localization errors. Through this process, CHAG further refines the rich semantic information provided by Multi-Level AVLF into more precise spatial localization capabilities. In essence, Multi-Level AVLF addresses the “what” and “roughly where” problems, providing a robust starting point, while CHAG tackles the “exactly where” problem through meticulous iterative optimization. This coarse-to-fine synergy effectively mitigates the inherent trade-offs of the fusion module, allowing the FR-CHAGAVLF_PRE model to achieve optimal performance across all IoU metrics.

Based on Table 9, the FR-CHAGAVLF_PRE model achieved progressive performance improvement through stepwise integration of the Multi-Level AVLF and CHAG modules. Compared with the baseline configuration (Swin-L + Deepseek-1.5b), the complete model maintains the visual backbone parameters at 197M while adding only 55 GFLOPs of computational cost (+3.5%), with the frame rate decreasing from 10.06 FPS to 9.23 FPS (−8.3%). The Multi-Level AVLF module addresses multi-scale object detection imbalance through adaptive vision-language feature fusion, while the CHAG module introduces cascaded hierarchical attention mechanisms to enhance small object detection accuracy. The synergistic effect of these two modules achieves organic integration of global semantic alignment and local feature enhancement, demonstrating excellent performance-efficiency trade-off characteristics in remote sensing image visual grounding tasks by exchanging minimal efficiency loss for significant accuracy improvements.

4.8. Comparison with Other Advanced Research Results

Our method will be compared with the current advanced methods. The comparison with other advanced research progress results is summarized in Table 10. As shown in Table 10, our method achieved the state of the art in several metrics, particularly Pr@0.7, Pr@0.8, and Pr@0.9 (78.34%, 72.78%, and 59.42%, respectively). It can be seen that our method achieved highly significant accuracy improvement compared with other methods for the indicators with high threshold requirements.

To assess the cross-domain generalizability of FR-CHAGAVLF_PRE, we directly applied its DIOR-RSVG-pretrained weights to zero-shot inference on (1) the full DIOR-RSVG-C set, (2) the intersection subset of DIOR-RSVG and DIOR-RSVG-C, and (3) the six shared classes in OPT-RSVG, namely airplane, basketballcourt, ship, storagetank, tenniscourt, and vehicle.

Table 11 presents the zero-shot transfer evaluation of FR-CHAGAVLF_PRE on the linguistically augmented DIOR-RSVG-C dataset. MGVLF was employed as the comparative baseline due to (1) its co-development with the DIOR-RSVG dataset, establishing domain-specific representativeness, (2) its methodological consistency in image processing protocols, contrasting with approaches like LQVG that employ resolution modifications for ultra-high-resolution scenarios, and (3) its established performance as a dedicated remote sensing visual grounding architecture.

Quantitative analysis demonstrates FR-CHAGAVLF_PRE’s superior cross-linguistic generalization. On the intersection subset (DIOR-RSVG ∩ DIOR-RSVG-C), our model reached 84.96% for Pr@0.5 versus MGVLF’s 77.73%. When evaluated on the complete DIOR-RSVG-C dataset with enhanced linguistic complexity, FR-CHAGAVLF_PRE maintained 72.29% for Pr@0.5 compared with MGVLF’s 66.24%, indicating greater robustness to semantic variations. This performance difference is attributed to the Multi-Level Adaptive Vision-Language Feature Fusion (Multi-Level AVLF) module’s capacity for adaptive multi-scale feature weighting and the contextual hierarchical attention grounding (CHAG) mechanism’s enhanced spatial reasoning under textual complexity variations.

The degradation in the high-threshold metrics (Pr@0.8 and Pr@0.9) with increased semantic complexity occurred under the frozen parameter conditions, requiring no additional computational overhead. The reduced performance gap between datasets for FR-CHAGAVLF_PRE (12.67 percentage points) compared with MGVLF (11.49 percentage points) demonstrates its superior linguistic robustness. These zero-shot transfer capabilities validate the architectural effectiveness for deployment in heterogeneous remote sensing environments with variable linguistic complexity.

Table 12 evaluates the cross-dataset generalization of FR-CHAGAVLF_PRE on the six shared classes between DIOR-RSVG and OPT-RSVG under zero-shot conditions. MGVLF served as the comparative baseline due to its establishment as the canonical model for remote sensing visual grounding and methodological consistency with standard evaluation protocols.

The results demonstrate FR-CHAGAVLF_PRE’s superior cross-domain transfer capabilities, achieving 68.39% for Pr@0.5 compared with MGVLF’s 62.96%, with performance gains amplifying at stricter thresholds (34.71% vs. 19.49% at Pr@0.9, respectively). The 6.20 percentage point meanIoU improvement (60.50% vs. 54.30%, respectively) indicates enhanced spatial precision across heterogeneous visual domains.

This performance differential stems from our model’s capacity for domain-invariant feature learning, effectively capturing transferable visual-linguistic correspondences under frozen parameter conditions. Although the absolute performance levels were expectedly lower than target domain training, the competitive zero-shot results demonstrate practical deployment capability across diverse remote sensing scenarios where domain-specific optimization may be impractical.

4.9. Vision-Language Detection Results

Figure 11 shows the vision-language detection results. Sub-figures (a), (b), and (c) correspond to small object detection, (d), (e), and (f) correspond to large object detection, and (g), (h), and (i) correspond to long sentence detection. Sub-figures (j), (k), and (l) represent the detection results of MSAM, (m), (n), and (o) represent the detection results of FR-CHAGAVLF_PRE, and (p), (q), and (r) represent the detection results of LQVG. To compare the visualization results of MSAM, FR-CHAGAVLF_PRE, and LQVG, the predicted labels of FR-CHAGAVLF_PRE were removed in the figure, and the colors of the predicted boxes and ground truth boxes were kept consistent with those of the MSAM and LQVG models, where green indicates ground truth boxes and red indicates predicted boxes. From the comparison of the three sets of images, it can be observed that the detection results of FR-CHAGAVLF_PRE and LQVG were roughly similar, while the detection results of MSAM were slightly inferior.

4.10. Limitations and Future Work

Although the model we proposed demonstrated excellent performance in most scenarios and showed strong detection capabilities, there were still certain failure cases in specific situations. We selected three failure detection examples and conducted an in-depth analysis of the causes to better understand the limitations of the model and provide directions for future improvements.

Figure 12a–c shows the original images, while (d), (e), and (f) show the inference results of the model. In some images, such as sub-figure (d), the target (e.g., vehicle) is similar in texture and color to the background, and the contours of the target are unclear, making it prone to background noise interference during detection. This issue typically arises when there is insufficient contrast between the target and the background, which makes it difficult for the model to accurately differentiate between them. To address this challenge, future work could focus on optimizing the model’s ability to extract detailed features, particularly in environments with complex backgrounds and blurred target contours. Additionally, employing data augmentation techniques, such as adjusting the contrast, brightness, and texture, could help enhance the model’s robustness in such complex scenarios.

For example, in sub-figure (e), the ground truth box covers an excessively large area, leading to a mismatch with the predicted box. This discrepancy between the ground truth and predicted boxes can affect the model’s training, especially when the ground truth box is either too large or too small, causing localization errors. To mitigate this issue, future work could focus on improving the accuracy of the annotations, ensuring that the ground truth boxes more closely align with the actual target. Moreover, optimization algorithms could be applied to automatically correct any excessively large ground truth boxes, thereby improving the accuracy of the model’s predictions.

Sub-figure (f) demonstrates another challenge the model faces when detecting targets, such as tennis courts, alongside adjacent objects, like vehicles. This is particularly problematic when the descriptions provided are vague or unclear. While the model performs well in detecting individual objects, its performance can suffer when targets overlap significantly or when the descriptions fail to clearly specify the target. To address this, future work could enhance the model’s language understanding capabilities by incorporating more contextual information, enabling better handling of complex scenarios. Additionally, utilizing multimodal learning, such as vision-language models, could improve the model’s ability to interpret and differentiate fine-grained targets by providing a deeper understanding of both visual and linguistic cues.

5. Conclusions

Multimodal machine learning methods allow machines to understand the relationship between images and text. Therefore, the RSVG method enables non-professionals to retrieve key information in remote sensing images with the help of AI machines. This has a wide range of application scenarios in military target detection, natural disaster detection, agricultural production, the urban scale, and other fields. In this paper, we proposed an RSVG method and achieve excellent results on the DIOR-RSVG dataset. The major conclusions are summarized as follows:

(1): We proposed a novel model named FR-RSVG. However, the detection effect of this method on the DIOR-RSVG dataset was not satisfactory. We analyzed the experimental results and found that this was due to the unbalanced recognition of large and small objects by the model. To solve this problem, we proposed FR-AVLF, which extracts language features through adaptive combination of deep or shallow vision encoders based on the input visual information of the text.
As RSVG is fundamentally an expanded version of RS object detection, we applied the model pretrained on RS images to FR-AVLF. The results show that the detection effect of FR-AVLF_PRE surpassed that of FR-AVLF, indicating a close connection between the visual remote sensing image object detection method and the vision-language method. Maybe in the future, we can not only focus on the multimodal integration of vision and language but also take into account the transformation from visual object detection to vision-language object detection.
(2): We found that the larger the number of parameters in the Transformer backbone, the better the performance of the model. The results show that the effects of using Swin-B and Swin-L were similar, but the number of parameters of Swin-L was more than twice that of Swin-B. Using Swin-L is not cost-effective because it takes a high amount of training resources. This may be caused by factors such as an unsuitable training strategy for Swin-L or an insufficient amount of training data. In the future, we can explore a training strategy suitable for Swin-L.
(3): We further proposed FR-CHAGAVLF_PRE based on FR-AVLF_PRE, whose detection results surpassed those of FR-AVLF_PRE, indicating that our adaptive fusion module and Cascaded Hierarchical Attention Grounding module are effective. In the future, we can continue to explore more complex fusion strategies and more effective cascading mechanisms.
(4): We conducted zero-shot inference experiments on shared categories between DIOR-RSVG and both the DIOR-RSVG-C and OPT-RSVG datasets using the FR-CHAGAVLF_PRE model weights trained on DIOR-RSVG, demonstrating the model’s good robustness and generalization capability. This result provides strong support for cross-dataset transfer in remote sensing visual grounding tasks. In the future, we can further explore using model weights trained on one remote sensing dataset and achieve higher accuracy on another dataset through low-cost fine-tuning strategies, thereby enabling low-cost or even cost-free efficient model deployment.
(5): Additionally, we identified several failure cases and analyzed their underlying causes. These analyses highlight the current limitations of our model and guide future work aimed at addressing these issues to further improve detection performance and robustness.

Author Contributions

Conceptualization, H.Z.; methodology, T.G.; software, H.Z.; validation, Z.L.; formal analysis, Z.L.; investigation, T.G., Q.L.; resources, H.Z., B.H. and L.J.; data curation, K.M.; writing—original draft preparation, T.G. and K.M.; writing—review and editing, H.Z., Z.C. and T.G.; visualization, Z.C.; supervision, L.J.; project administration, B.H.; funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Key Research and Development Program of Shaanxi (Program No. 2022ZDLGY01-09).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Kongmiao Miao was employed by the company China Telecom Corporation Limited Shaoxing Branch. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

Table A1. Performance of ablation experiment 1 model for 20 categories in the dataset. Referring to the ablation experiment in Table 8 of the paper, this category-specific accuracy was achieved when neither Multi-Level AVLF nor CHAG were used.

Visual Encoder	Language Encoder	Category	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9	meanIoU	cumIoU
Swin-L	Deepseek-1.5b	airplane	88.63	89.29	87.50	81.55	57.74	79.44	71.76
Swin-L	Deepseek-1.5b	airport	90.55	89.34	83.61	73.77	38.52	79.53	82.50
Swin- L	Deepseek-1.5b	baseballfield	86.55	87.80	87.27	85.94	65.25	79.48	76.95
Swin-L	Deepseek-1.5b	basketballcourt	81.01	81.45	80.65	75.81	55.65	74.36	69.75
Swin-L	Deepseek-1.5b	bridge	73.38	68.01	61.03	46.32	14.34	61.59	62.46
Swin-L	Deepseek-1.5b	chimney	87.30	88.55	87.02	85.50	58.78	80.02	82.27
Swin-L	Deepseek-1.5b	dam	82.26	78.35	65.98	50.52	16.49	68.48	72.33
Swin-L	Deepseek-1.5b	Expressway-Service-area	85.20	84.52	77.42	67.74	23.87	73.63	72.67
Swin-L	Deepseek-1.5b	Expressway-toll-station	75.17	74.53	70.75	62.26	38.68	65.67	55.55
Swin-L	Deepseek-1.5b	golffield	83.29	82.47	79.38	76.29	42.27	75.37	77.74
Swin-L	Deepseek-1.5b	groundtrackfield	87.03	88.28	85.35	78.02	50.18	77.40	87.97
Swin-L	Deepseek-1.5b	harbor	52.75	52.00	46.00	28.00	10.00	45.63	40.22
Swin-L	Deepseek-1.5b	overpass	69.65	65.61	62.43	47.62	17.46	59.05	60.71
Swin-L	Deepseek-1.5b	ship	73.63	73.40	71.92	64.04	33.99	65.22	64.01
Swin-L	Deepseek-1.5b	stadium	93.04	94.29	93.33	81.90	46.67	81.17	85.76
Swin-L	Deepseek-1.5b	storagetank	83.90	83.90	83.90	83.05	70.34	77.11	62.77
Swin-L	Deepseek-1.5b	tenniscourt	75.44	76.69	75.46	74.85	57.67	69.59	53.53
Swin-L	Deepseek-1.5b	trainstation	82.42	78.57	64.29	45.92	14.29	68.21	65.21
Swin-L	Deepseek-1.5b	vehicle	73.01	72.84	68.32	52.33	19.94	61.46	34.06
Swin-L	Deepseek-1.5b	windmill	94.95	90.25	79.42	59.57	20.94	76.72	80.45

Table A2. Performance of ablation experiment 2 model for 20 categories in the dataset. Referring to the ablation experiment in Table 8 of the paper, this category-specific accuracy was achieved when Multi-Level AVLF was not used but CHAG was used.

Visual Encoder	Language Encoder	Category	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9	meanIoU	cumIoU
Swin-L	Deepseek-1.5b	airplane	87.50	86.90	86.90	85.12	74.40	82.31	75.76
Swin-L	Deepseek-1.5b	airport	95.08	94.26	91.80	86.89	70.49	88.62	90.84
Swin- L	Deepseek-1.5b	baseballfield	86.47	86.47	86.21	85.15	78.78	82.17	83.21
Swin-L	Deepseek-1.5b	basketballcourt	80.65	80.65	80.65	79.84	77.42	77.90	74.54
Swin-L	Deepseek-1.5b	bridge	73.16	68.75	63.24	54.04	38.60	64.57	75.51
Swin-L	Deepseek-1.5b	chimney	90.84	90.84	90.84	90.08	82.44	86.67	89.55
Swin-L	Deepseek-1.5b	dam	90.72	87.63	79.38	65.98	41.24	79.44	81.20
Swin-L	Deepseek-1.5b	Expressway-Service-area	81.29	80.65	80.00	76.13	59.35	76.89	77.10
Swin-L	Deepseek-1.5b	Expressway-toll-station	79.25	78.30	75.47	66.98	54.72	72.04	83.75
Swin-L	Deepseek-1.5b	golffield	82.47	82.47	81.44	76.29	71.13	79.03	82.61
Swin-L	Deepseek-1.5b	groundtrackfield	87.55	87.55	86.81	82.42	71.43	81.79	91.00
Swin-L	Deepseek-1.5b	harbor	58.00	58.00	52.00	48.00	36.00	52.94	45.99
Swin-L	Deepseek-1.5b	overpass	70.90	68.78	64.02	58.73	39.68	63.78	72.29
Swin-L	Deepseek-1.5b	ship	77.34	77.34	76.35	73.40	60.10	72.15	73.66
Swin-L	Deepseek-1.5b	stadium	94.29	94.29	93.33	91.43	78.10	88.11	93.99
Swin-L	Deepseek-1.5b	storagetank	83.90	83.90	83.90	83.05	81.36	80.56	70.88
Swin-L	Deepseek-1.5b	tenniscourt	79.14	79.14	79.14	77.91	75.46	76.22	65.00
Swin-L	Deepseek-1.5b	trainstation	83.67	77.55	67.35	58.16	38.78	74.40	71.51
Swin-L	Deepseek-1.5b	vehicle	74.12	72.98	71.15	61.39	41.02	65.75	59.42
Swin-L	Deepseek-1.5b	windmill	94.58	92.06	85.56	71.12	43.32	81.74	85.91

Table A3. Performance of ablation experiment 3 model for 20 categories in the dataset. Referring to the ablation experiment in Table 8 of the paper, this category-specific accuracy was achieved when Multi-Level AVLF was used but CHAG was not used.

Visual Encoder	Language Encoder	Category	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9	meanIoU	cumIoU
Swin-L	Deepseek-1.5b	airplane	87.50	86.31	85.71	82.14	62.50	80.05	76.96
Swin-L	Deepseek-1.5b	airport	91.80	86.89	85.25	76.23	47.54	83.38	84.79
Swin- L	Deepseek-1.5b	baseballfield	87.80	87.53	87.27	83.55	71.35	81.55	78.98
Swin-L	Deepseek-1.5b	basketballcourt	83.06	83.06	82.26	79.03	63.71	77.92	73.24
Swin-L	Deepseek-1.5b	bridge	75.74	69.49	63.24	52.21	28.68	64.90	69.01
Swin-L	Deepseek-1.5b	chimney	89.31	89.31	87.79	85.50	66.41	82.86	82.96
Swin-L	Deepseek-1.5b	dam	87.63	77.32	68.04	50.52	27.84	73.42	75.76
Swin-L	Deepseek-1.5b	Expressway-Service-area	85.81	85.16	82.58	74.84	43.87	77.21	77.13
Swin-L	Deepseek-1.5b	Expressway-toll-station	77.36	75.47	70.75	66.04	50.00	69.07	66.30
Swin-L	Deepseek-1.5b	golffield	85.57	82.47	78.35	75.26	51.55	77.45	80.52
Swin-L	Deepseek-1.5b	groundtrackfield	86.08	85.35	83.88	78.39	59.34	78.28	87.74
Swin-L	Deepseek-1.5b	harbor	62.00	54.00	48.00	42.00	14.00	54.61	39.50
Swin-L	Deepseek-1.5b	overpass	76.19	72.49	68.25	59.79	30.16	67.07	72.09
Swin-L	Deepseek-1.5b	ship	76.85	76.35	72.91	68.97	38.92	69.20	71.50
Swin-L	Deepseek-1.5b	stadium	94.29	94.29	90.48	80.95	50.48	83.12	87.63
Swin-L	Deepseek-1.5b	storagetank	86.44	86.44	85.59	83.90	76.27	80.99	78.17
Swin-L	Deepseek-1.5b	tenniscourt	80.37	80.37	79.75	79.14	68.10	75.51	73.38
Swin-L	Deepseek-1.5b	trainstation	82.65	71.43	64.29	51.02	33.67	70.24	63.72
Swin-L	Deepseek-1.5b	vehicle	73.41	71.57	68.60	56.01	30.27	63.32	60.21
Swin-L	Deepseek-1.5b	windmill	92.78	91.34	80.87	62.45	32.85	79.58	81.46

Table A4. Performance of ablation experiment 4 model for 20 categories in the dataset. Referring to the ablation experiment in Table 8 of the paper, this category-specific accuracy was achieved when both Multi-Level AVLF and CHAG were used.

Visual Encoder	Language Encoder	Category	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9	meanIoU	cumIoU
Swin-L	Deepseek-1.5b	airplane	86.90	86.31	85.12	82.74	72.62	81.33	78.98
Swin-L	Deepseek-1.5b	airport	95.08	93.44	89.34	86.07	72.95	89.02	91.17
Swin- L	Deepseek-1.5b	baseballfield	87.00	87.00	86.74	86.21	79.31	82.57	83.00
Swin-L	Deepseek-1.5b	basketballcourt	82.26	82.26	82.26	79.84	76.61	79.20	81.08
Swin-L	Deepseek-1.5b	bridge	75.74	70.22	63.97	58.46	40.44	67.63	80.69
Swin-L	Deepseek-1.5b	chimney	89.31	88.55	88.55	88.55	79.39	85.29	88.24
Swin-L	Deepseek-1.5b	dam	83.51	78.35	73.20	64.95	42.27	77.07	79.43
Swin-L	Deepseek-1.5b	Expressway-Service-area	81.94	81.94	80.65	78.71	61.94	78.04	76.20
Swin-L	Deepseek-1.5b	Expressway-toll-station	78.30	77.36	74.53	68.87	57.55	71.72	86.17
Swin-L	Deepseek-1.5b	golffield	86.60	85.57	82.47	77.32	67.01	81.11	84.20
Swin-L	Deepseek-1.5b	groundtrackfield	86.08	86.08	85.35	81.68	68.86	80.90	91.79
Swin-L	Deepseek-1.5b	harbor	60.00	60.00	54.00	48.00	38.00	56.67	53.24
Swin-L	Deepseek-1.5b	overpass	73.54	71.43	67.20	58.20	43.39	66.24	77.15
Swin-L	Deepseek-1.5b	ship	77.83	77.34	76.35	72.91	58.62	72.18	76.06
Swin-L	Deepseek-1.5b	stadium	91.43	91.43	89.52	87.62	76.19	85.58	94.09
Swin-L	Deepseek-1.5b	storagetank	85.59	85.59	85.59	83.90	83.90	82.18	81.77
Swin-L	Deepseek-1.5b	tenniscourt	80.98	80.98	80.98	79.75	77.91	78.31	79.30
Swin-L	Deepseek-1.5b	trainstation	88.78	77.55	70.41	63.27	45.92	76.32	74.76
Swin-L	Deepseek-1.5b	vehicle	73.27	72.84	71.15	61.67	44.27	65.90	66.96
Swin-L	Deepseek-1.5b	windmill	94.22	91.70	84.84	68.23	44.40	81.98	85.93

Table A5. Ablation study on hyperparameter λ.

Visual Encoder	Language Encoder	λ	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9	meanIoU	cumIoU
Swin-L	BERT	0.6	77.11	75.00	70.69	60.51	33.40	67.49	68.83
Swin-L	BERT	1.0	78.39	77.56	75.58	70.38	53.99	71.59	73.43
Swin-L	BERT	1.4	76.25	74.19	69.65	60.72	34.91	66.89	67.54

Algorithm A1: Multi-level adaptive visual language fusion.

Input: Visual features

{V F}_{1}, {V F}_{2}, {V F}_{3}, {V F}_{4}, {V F}_{5} \in R^{B \times 256 \times H_{i} \times W_{i}};

Language feature

{L F}_{1}, {L F}_{2}, {L F}_{3}, {L F}_{4}, {L F}_{5} \in R^{B \times 256 \times H_{i} \times W_{i}}

, where

H_{i} \times W_{i}

represents the spatial resolution of the ith pyramid level.

Output: Vision-language features

{V L F}_{1}, {V L F}_{2}, {V L F}_{3}, {V L F}_{4}, {V L F}_{5} \in R^{B \times 256 \times H_{i} \times W_{i}}

.

01: for i = 1 to 5 do
02: //Text feature projection to visual feature space
03:

F_{1 i} \leftarrow R e L U (B N (C o n v 2 D ({L F}_{i}, W^{p r o j 1}))) w h e r e W^{p r o j 1} \in R^{41 \times 128 \times 1 \times 1}

04:

{P L F}_{i} \leftarrow C o n v 2 D (F_{1 i}, W^{p r o j 2}) w h e r e W^{p r o j 2} \in R^{128 \times 256 \times 1 \times 1}

05: end for
06: for i = 1 to 5 do
07: //Multi-head cross-modal attention
08:

{P L F_n o r m}_{i} \leftarrow L a y e r N o r m ({P L F}_{i}), {V F_n o r m}_{i} \leftarrow L a y e r N o r m ({V F}_{i})

09:

Q_{i} \leftarrow R e s h a p e (W_{Q} \cdot {P L F_{n o r m}}_{i}, [B, H, H_{i} \times W_{i}, d_{k}])

//Flatten spatial dimensions
10:

K_{i} \leftarrow R e s h a p e (W_{K} \cdot {V F_{n o r m}}_{i}, [B, H, H_{i} \times W_{i}, d_{k}])

//

H_{i} \times W_{i} =

spatial resolution
11:

V_{i} \leftarrow R e s h a p e (W_{V} \cdot {V F_{n o r m}}_{i}, [B, H, H_{i} \times W_{i}, d_{V}])

12:

A_{i} \leftarrow F l a s h A t t e n t i o n (Q_{i}, K_{i}, V_{i}, s c a l e = \frac{1}{\sqrt{d_{k}}})

13:

{a t t e n d e d}_{i} = R e s h a p e (A_{i}, [B, 256, H_{i}, W_{i}])

14: //Gating mechanism for selective fusion
15:

{g a t e_{i n p u t}}_{i} \leftarrow C o n c a t ([{a t t e n d e d}_{i}, {V F}_{i}], d i m = 1)

16:

{g 1}_{i} \leftarrow R e L U (B N (C o n v 2 D ({g a t e_{i n p u t}}_{i}, W_{g 1}))) w h e r e W_{g 1} : 512 \to 128

17:

{g a t e}_{i} \leftarrow S i g m o i d (C o n v 2 D ({g 1}_{i}, W_{g 2})) w h e r e W_{g 2} : 128 \to 256

18:

{g a t e d_{f u s i u o n}}_{i} \leftarrow {g a t e}_{i} ⊙ {a t t e n d e d}_{i} + (1 - {g a t e}_{i}) ⊙ {V F}_{i}

19:

{C A}_{i} \leftarrow L a y e r N o r m ({g a t e d_{f u s i u o n}}_{i} + {V F}_{i})

20: end for
21: for i = 1 to 5 do
22: // Multi-head self-attention on cross-modal features
23:

{X_{n o r m}}_{i} \leftarrow L a y e r N o r m ({C A}_{i})

24:

{Q_{s e l f}}_{i} \leftarrow R e s h a p e (W_{Q}^{s e l f} \cdot {X_{n o r m}}_{i}, [B, H, H_{i} \times W_{i}, d_{k}])

// Spatial resolution preserved
25:

{K_{s e l f}}_{i} \leftarrow R e s h a p e (W_{K}^{s e l f} \cdot {X_{n o r m}}_{i}, [B, H, H_{i} \times W_{i}, d_{k}])

26:

{V_{s e l f}}_{i} \leftarrow R e s h a p e (W_{V}^{s e l f} \cdot {X_{n o r m}}_{i}, [B, H, H_{i} \times W_{i}, d_{V}])

27:

{A_{s e l f}}_{i} \leftarrow F l a s h A t t e n t i o n ({Q_{s e l f}}_{i}, {K_{s e l f}}_{i}, {V_{s e l f}}_{i})

28:

{o u t}_{i} \leftarrow R e s h a p e ({A_{s e l f}}_{i}, [B, 256, H_{i}, W_{i}])

29:

{S A}_{i} \leftarrow L a y e r N o r m (γ \cdot {o u t}_{i} + {C A}_{i}) w h e r e γ i n i t i a l i z e d t o 0

30: end for
31: for i = 1 to 5 do
32: // FiLM conditioning for fine-grained modulation
33:

{γ_{t e m p}}_{i} \leftarrow R e L U (C o n v 2 D ({P L F}_{i}, W_{γ 1}, k e r n e l = 3, p a d d i n g = 1))

34:

γ_{i} \leftarrow C o n v 2 D ({γ_{t e m p}}_{i}, W_{γ 2}, k e r n e l = 1)

35:

{β_{t e m p}}_{i} \leftarrow R e L U (C o n v 2 D ({P L F}_{i}, W_{β 1}, k e r n e l = 3, p a d d i n g = 1))

36:

β_{i} \leftarrow C o n v 2 D ({β_{t e m p}}_{i}, W_{β 2}, k e r n e l = 1)

37:

{F I L M}_{i} \leftarrow γ_{i} ⊙ {S A}_{i} + β_{i}

38: end for
39: for i = 1 to 5 do
40: // Remote sensing scale calibration
41:

{g l o b a l_{c t x}}_{i} \leftarrow A d a p t i v e A v g P o o l 2 D ({F I L M}_{i}, o u t p u t_{s i z e} = 1)

42:

{l o c a l_{c t x}}_{i} \leftarrow A v g P o o l 2 D ({F I L M}_{i}, k e r n e l = 3, s t r i d e = 1, p a d d i n g = 1)

43:

{c o m b i n e d_{c t x}}_{i} \leftarrow {g l o b a l_{c t x}}_{i} + {l o c a l_{c t x}}_{i}

44:

{h 1}_{i} \leftarrow R e L U (C o n v 2 D ({c o m b i n e d_c t x}_{i}, W_{s 1})) w h e r e W_{s 1} : 256 \to 64

45:

{s c a l e_{a t t n}}_{i} \leftarrow S i g m o i d (C o n v 2 D ({h 1}_{i}, W_{s 2})) w h e r e W_{s 2} : 64 \to 256

46:

{S C}_{i} \leftarrow {F I L M}_{i} ⊙ {s c a l e_{a t t n}}_{i}

47: end for
48: // Adaptive weighting fusion
49: for i = 1 to 5 do
50:

T_{i} \leftarrow B N (C o n v 2 D ({S C}_{i}, W_{t}, i))

51:

μ_{i} \leftarrow M e a n (T_{i}, d i m s = [1,2, 3]), σ_{i} \leftarrow S t d (T_{i}, d i m s = [1,2, 3])

52:

{s t a t s}_{i} \leftarrow C o n c a t ([μ_{i}, σ_{i}])

53: end for
54:

s t a t s_{a l l} \leftarrow C o n c a t ([{s t a t s}_{1}, {s t a t s}_{2}, {s t a t s}_{3}, {s t a t s}_{4}, {s t a t s}_{5}])

// Dimension change: [B, 2] × 5 → [B, 10]
55:

h_{m l p} \leftarrow R e L U (L i n e a r (s t a t s_{a l l}, W_{m l p 1}))

// Input dimension: [B, 10], Linear layer: [B, 10] × [10, 16] → [B, 16], ReLU: [B, 16]
56:

s i z e_{l o g i t s} \leftarrow L i n e a r (h_{m l p}, W_{m l p 2})

// Linear layer: [B, 16] × [16, 5] → [B, 5]
57:

s i z e_{w e i g h t s} \leftarrow S o f t m a x (s i z e_{l o g i t s})

58:

s h a p e s \leftarrow [(H_{1}, W_{1}), (H_{2}, W_{2}), (H_{3}, W_{3}), (H_{4}, W_{4}), (H_{5}, W_{5})]

// Spatial resolutions of each level
59:

{t a r g e t}_{s i z e} \leftarrow R o u n d (\sum_{i = 1}^{5} s i z e_{w e i g h t s} [:, i] \times s h a p e s [i])

// Adaptive target resolution
60: for i = 1 to 5 do
61:

T_{a l i g n e d}, i \leftarrow I n t e r p o l a t e (T ᵢ, s i z e = t a r g e t_{s i z e}, m o d e =' b i l i n e a r')

62: end for
63:

f u s i o n_{w e i g h t s} \leftarrow S o f t m a x (w_{f u s i o n}) w h e r e w_{f u s i o n} \in R^{5} l e a r n a b l e

64:

G C \leftarrow \sum_{i = 1}^{5} f u s i o n_{w e i g h t s} [i] \times T_{a l i g n e d}, i

65: // Feature enhancement with global context integration
66: for i = 1 to 5 do
67:

μ_{{S C}_{i}} \leftarrow M e a n ({S C}_{i}, d i m s = [1,2, 3]) / / G l o b a l m e a n o f {S C}_{i} : R ᴮ

68:

σ_{{S C}_{i}} \leftarrow S t d ({S C}_{i}, d i m s = [1,2, 3]) / / G l o b a l s t d o f {S C}_{i} : R ᴮ

69:

l e v e l_{{s t a t s}_{i}} \leftarrow C o n c a t ([μ_{{S C}_{i}}, σ_{{S C}_{i}}]) / / S t a t i s t i c a l r e p r e s e n t a t i o n : R^{B X 2}

70:

α_{i} \leftarrow S i g m o i d (L i n e a r (l e v e l_{{s t a t s}_{i}}, W_{α})) w h e r e W_{α} \in R^{2} X^{1}

71:

G C_{{r e s i z e d}_{i}} \leftarrow I n t e r p o l a t e (G C, s i z e = (H_{i}, W_{i}), m o d e =' b i l i n e a r')

72:

{V L F}_{i} \leftarrow {S C}_{i} + α_{i} \times G C_{{r e s i z e d}_{i}}

73: end for
74: return

{V L F}_{1}, {V L F}_{2}, {V L F}_{3}, {V L F}_{4}, {V L F}_{5}

Notation:
B: batch size; H: number of attention heads = 8

d_{k} = d_{v} = \frac{256}{H}

: dimension per attention head

H_{i}, W_{i}

: spatial dimensions (height, width) at pyramid level i ∈ {1, 2, 3, 4, 5}

H_{i}, W_{i}

: spatial resolution of the ith feature map (e.g., 200 × 200, 100 × 100, 50 × 50, 25 × 25, 13 × 13)

W_{*}

: learnable weight matrices; γ, β: FiLM modulation parameters
⊙: element-wise multiplication; ∑: summation operator
BN: BatchNorm; Conv2D: 2D convolution operation

Algorithm A2: Cascaded hierarchical attention grounding.

Input:

V_{m u l t i s c a l e}

: A set of multi-scale visual features {

{V F}_{1}, {V F}_{2}, {V F}_{3}, {V F}_{4}, {V F}_{5} \in R^{B \times 256 \times H_{i} \times W_{i}}

} from the backbone network.

L_{m u l t i s c a l e}

: A set of multi-scale language features {

{L F}_{1}, {L F}_{2}, {L F}_{3}, {L F}_{4}, {L F}_{5} \in R^{B \times 256 \times H_{i} \times W_{i}}

} aligned with visual features.

L_{s e n t e n c e}

: A sentence-level language embedding vector.

T_{g t}

: A set of ground-truth targets, where each target contains

B_{g t}

(box) and

C_{g t}

(class label).

Output:

B_{f i n a l}

,

S_{f i n a l}

(Inference): The final predicted bounding box and its confidence score.

L_{t o t a l}

(Training): The total loss for optimization.

procedure begin:
01: CHAG(

V_{m u l t i s c a l e}

,

L_{m u l t i s c a l e}

,

L_{s e n t e n c e}

,

T_{g t}

)
02: //Hierarchical Attention Layer 1: Global Semantic Alignment.
03:

F_{f u s e d}

← MultiLevelFusion(

V_{m u l t i s c a l e}

,

L_{m u l t i s c a l e}

)
04: // Fuses language features into multi-scale visual features via attention.
05:
06: // Hierarchical Attention Layer 2: Local Feature Enhancement in RPN.
07:

P_{i n i t}

,

L_{r p n}

← RPN_with_Attention(

F_{f u s e d}

,

L_{s e n t e n c e}

)
08: // Generates text-relevant initial proposals by applying attention to objectness scores within the RPN head.
09:
10: // Cascaded Grounding Stage.
11:

P_{c u r r e n t}

←

P_{i n i t}

12:

L_{c a s c a d e}

← 0
13: for

k

← 1 to

N_{s t a g e s}

do
14:

t_{i o u}

←

T_{i o u} [k - 1]

15: if is_training then
16:

P_{s a m p l e d}

,

C_{k}^{g t}

,

B_{k}^{g t}

← SelectTrainingSamples(

P_{c u r r e n t}

,

T_{g t}

,

t_{i o u}

)
17: else
18:

P_{s a m p l e d}

←

P_{c u r r e n t}

19: end if
20:
21:

F_{r o i}

← RoIPool(

F_{f u s e d}

,

P_{s a m p l e d}

)
22:

F_{h e a d}

← Head_k(

F_{r o i}

)
23:

C_{k}^{p r e d}

,

B_{k}^{p r e d}

← Predictork(

F_{h e a d}

)
24:
25:    if is_training then
26:

L_{c l s}^{(k)}

← CrossEntropyLoss(

C_{k}^{p r e d}

,

C_{k}^{g t}

)
27:

L_{r e g}^{(k)}

← SmoothL1Loss(

B_{k}^{p r e d}

,

B_{k}^{g t}

)
28:

L_{s t a g e}^{(k)}

←

L_{c l s}^{(k)}

+

L_{r e g}^{(k)}

29:

L_{c a s c a d e}

←

L_{c a s c a d e}

+

W [k - 1] \times L_{s t a g e}^{(k)}

30:    end if
31:
32:

P_{c u r r e n t}

← Decode(

B_{k}^{p r e d}

.detach(),

P_{s a m p l e d}

)
33: end for
34:
35: // Post-processing and Final Output.
36:

D_{f i n a l}

← PostProcess(

C_{N_{s t a g e}}^{p r e d}

,

B_{N_{s t a g e}}^{p r e d}

,

P_{c u r r e n t}

)
37: // Includes NMS and clipping.
38:

B_{f i n a l}

,

S_{f i n a l}

← SelectBestDetection(

D_{f i n a l}

)
39:
40: if is_training then
41:

L_{t o t a l}

←

L_{r p n}

+

L_{c a s c a d e}

42: return

L_{t o t a l}

43: else
44: return

B_{f i n a l}

,

S_{f i n a l}

45: end if
end procedure

Notation:

N_{s t a g e s}

= total number of cascade stages (

N_{s t a g e s}

= 3);

T_{i o u}

= a set of IoU thresholds for each stage (

T_{i o u}

= {0.55,0.65,0.75});

W

= a set of loss weights for each stage (

W

= {1.0,1.0,1.0});

F_{f u s e d}

= fused multi-modal features after the global semantic alignment stage;

P_{i n i t}

= initial proposals generated by the RPN;

L_{r p n}

= loss calculated from the region proposal network;

P_{c u r r e n t}

= proposals used as input for the current cascade stage, which are refined in each iteration;

L_{c a s c a d e}

= cumulative loss from all stages of the cascade head;

t_{i o u}

= the specific intersection over union (IoU) threshold for the current stage;

P_{s a m p l e d}

= sampled proposals (positive and negative) for training at the current stage;

C_{k}^{g t}

= ground-truth class labels for the sampled proposals at stage k;

B_{k}^{g t}

= ground-truth bounding box regression targets for stage k;

F_{r o i}

= RoI features extracted via the RoIPool layer;

F_{h e a d}

= features processed by the stage-specific detection head.

C_{k}^{p r e d}

= predicted class logits at stage k;

B_{k}^{p r e d}

= predicted box regression deltas at stage k;

L_{c l s}^{(k)}

= classification loss for stage k;

L_{r e g}^{(k)}

= bounding box regression loss for stage k;

L_{s t a g e}^{(k)}

= total loss for a single cascade stage k;

D_{f i n a l}

= final set of detections after all post-processing steps (e.g., NMS);

C_{N_{s t a g e}}^{p r e d}

,

B_{N_{s t a g e}}^{p r e d}

= predicted class logits and box deltas from the final cascade stage;

L_{t o t a l}

= the total loss for the entire model.

Table A6. The influence of different IoU threshold settings for CHAG.

IoU Threshold of CHAG	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9	meanIoU	cumIoU
No-CHAG	78.39	77.56	75.58	70.38	53.99	71.59	73.43
IoU = {0.50,0.60,0.70}	78.96	77.83	75.63	70.55	54.36	72.26	73.62
IoU = {0.55,0.65,0.75}	78.63	77.65	76.19	71.22	57.16	73.43	75.48
IoU = {0.60,0.70,0.80}	77.08	75.14	71.67	63.10	43.30	69.39	76.80
IoU = {0.65,0.75,0.85}	74.82	73.07	70.04	63.07	47.83	67.86	68.17

References

Simantiris, G.; Panagiotakis, C. Unsupervised Color-Based Flood Segmentation in UAV Imagery. Remote Sens. 2024, 16, 2126. [Google Scholar] [CrossRef]
Senanayake, I.P.; Pathira Arachchilage, K.R.L.; Yeo, I.-Y.; Khaki, M.; Han, S.-C.; Dahlhaus, P.G. Spatial Downscaling of Satellite-Based Soil Moisture Products Using Machine Learning Techniques: A Review. Remote Sens. 2024, 16, 2067. [Google Scholar] [CrossRef]
Lei, X.; Jiang, J.; Deng, Z.; Wu, D.; Wang, F.; Lai, C.; Wang, Z.; Chen, X. An Ensemble Machine Learning Model to Estimate Urban Water Quality Parameters Using Unmanned Aerial Vehicle Multispectral Imagery. Remote Sens. 2024, 16, 2246. [Google Scholar] [CrossRef]
Cao, S.; Li, Z.; Deng, J.; Huang, Y.; Peng, Z. TFCD-Net: Target and False Alarm Collaborative Detection Network for Infrared Imagery. Remote Sens. 2024, 16, 1758. [Google Scholar] [CrossRef]
Ali, T.A.; Bazi, Y.; Al Rahhal, M.M.; Mekhalfi, M.L.; Rangarajan, L.; Zuair, M.A. TextRS: Deep Bidirectional Triplet Network for Matching Text to Remote Sensing Images. Remote Sens. 2020, 12, 405. [Google Scholar]
Cheng, Q.; Zhou, Y.; Fu, P.; Xu, Y.; Zhang, L. A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4284–4297. [Google Scholar] [CrossRef]
Liu, C.; Zhao, R.; Chen, H.; Zou, Z.; Shi, Z.X. Remote Sensing Image Change Captioning with Dual-Branch Transformers: A New Method and a Large Scale Dataset. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5633520. [Google Scholar] [CrossRef]
Hu, Y.; Yuan, J.; Wen, C.; Lu, X.; Li, X. RSGPT: A Remote Sensing Vision Language Model and Benchmark. arXiv 2023, arXiv:2307.15266. [Google Scholar] [CrossRef]
Wei, T.; Yuan, W.; Luo, J.; Zhang, W.; Lu, L. VLCA: Vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning. J. Syst. Eng. Electron. 2023, 34, 9–18. [Google Scholar] [CrossRef]
Bejiga, M.B.; Melgani, F.; Vascotto, A. Retro-Remote Sensing: Generating Images from Ancient Texts. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 950–960. [Google Scholar] [CrossRef]
Xu, Y.; Yu, W.; Ghamisi, P.; Kopp, M.; Hochreiter, S. Txt2Img-MHN: Remote Sensing Image Generation from Text Using Modern Hopfield Networks. IEEE Trans. Image Process. 2022, 32, 5737–5750. [Google Scholar] [CrossRef]
Li, A.; Lu, Z.; Wang, L.; Xiang, T.; Wen, J. Zero-Shot Scene Classification for High Spatial Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4157–4167. [Google Scholar] [CrossRef]
Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Zhou, J. RemoteCLIP: A Vision Language Foundation Model for Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622216. [Google Scholar] [CrossRef]
Jiang, X.; Zhou, N.; Li, X. Few-Shot Segmentation of Remote Sensing Images Using Deep Metric Learning. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6507405. [Google Scholar] [CrossRef]
Zhang, S.; Song, F.; Liu, X.; Hao, X.; Liu, Y.; Lei, T.; Jiang, P. Text Semantic Fusion Relation Graph Reasoning for Few-Shot Object Detection on Remote Sensing Images. Remote Sens. 2023, 15, 1187. [Google Scholar] [CrossRef]
Lu, X.; Sun, X.; Diao, W.; Mao, Y.; Li, J.; Zhang, Y.; Wang, P.; Fu, K. Few-Shot Object Detection in Aerial Imagery Guided by Text-Modal Knowledge. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5604719. [Google Scholar] [CrossRef]
Liu, G.; He, J.; Li, P.; Zhong, S.; Li, H.; He, G. Unified Transformer with Cross-Modal Mixture Experts for Remote-Sensing Visual Question Answering. Remote Sens. 2023, 15, 4682. [Google Scholar] [CrossRef]
Yuan, Z.; Mou, L.; Wang, Q.; Zhu, X.X. From Easy to Hard: Learning Language-Guided Curriculum for Visual Question Answering on Remote Sensing Data. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5623111. [Google Scholar] [CrossRef]
Yu, Z.; Yu, J.; Xiang, C.; Zhao, Z.; Tian, Q.; Tao, D. Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, 13–19 July 2018; pp. 1114–1120. [Google Scholar]
Chen, J.; Hong, H.; Song, B.; Guo, J.; Chen, C.; Xu, J. MDCT: Multi-Kernel Dilated Convolution and Transformer for One-Stage Object Detection of Remote Sensing Images. Remote Sens. 2023, 15, 371. [Google Scholar] [CrossRef]
Li, Q.; Chen, Y.; Zeng, Y. Transformer with Transfer CNN for Remote-Sensing-Image Object Detection. Remote Sens. 2022, 14, 984. [Google Scholar] [CrossRef]
Shiu, Y.-S.; Lee, R.-Y.; Chang, Y.-C. Pineapples’ Detection and Segmentation Based on Faster and Mask R-CNN in UAV Imagery. Remote Sens. 2023, 15, 814. [Google Scholar] [CrossRef]
Sadhu, A.; Chen, K.; Nevatia, R. Zero-Shot Grounding of Objects from Natural Language Queries. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4693–4702. [Google Scholar]
Yang, Z.; Gong, B.; Wang, L.; Huang, W.; Yu, D.; Luo, J. A Fast and Accurate One-Stage Approach to Visual Grounding. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4682–4692. [Google Scholar]
Yang, Z.; Chen, T.; Wang, L.; Luo, J. Improving One-stage Visual Grounding by Recursive Sub-query Construction. In Proceedings of the Computer Vision-ECCV 2020:16th Eurpean Conference, Glasgow, UK, 23–28 August 2020; pp. 387–404. [Google Scholar]
Huang, B.; Lian, D.; Luo, W.; Gao, S. Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 16888–16897. [Google Scholar]
Lu, X.; Zhang, Y.; Yuan, Y.; Feng, Y. Gated and Axis-Concentrated Localization Network for Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2020, 58, 179–192. [Google Scholar] [CrossRef]
Yang, S.; Li, G.; Yu, Y. Graph-structured referring expression reasoning in the wild. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9952–9961. [Google Scholar]
Deng, J.; Yang, Z.; Chen, T.; Zhou, W.; Li, H. TransVG: End-to-End Visual Grounding with Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 1749–1759. [Google Scholar]
Yang, L.; Xu, Y.; Yuan, C.; Liu, W.; Li, B.; Hu, W. Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 9499–9508. [Google Scholar]
Zhan, Y.; Xiong, Z.; Yuan, Y. RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5604513. [Google Scholar] [CrossRef]
Zhang, H.; Niu, Y.; Chang, S. Grounding Referring Expressions in Images by Variational Context. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4158–4166. [Google Scholar]
Yang, S.; Li, G.; Yu, Y. Dynamic Graph Attention for Referring Expression Comprehension. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4643–4652. [Google Scholar]
Liu, D.; Zhang, H.; Zha, Z.; Wu, F. Learning to Assemble Neural Module Tree Networks for Visual Grounding. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4672–4681. [Google Scholar]
Liu, X.; Wang, Z.; Shao, J.; Wang, X.; Li, H. Improving referring expression grounding with cross-modal attention-guided erasing. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1950–1959. [Google Scholar]
Wang, P.; Wu, Q.; Cao, J.; Shen, C.; Gao, L.; van den Hengel, A. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1960–1968. [Google Scholar]
Hong, R.; Liu, D.; Mo, X.; He, X.; Zhang, H. Learning to compose and reason with language tree structures for visual grounding. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 684–696. [Google Scholar] [CrossRef]
Zhong, Y.; Yang, J.; Zhang, P.; Li, C.; Codella, N.; Li, L.H.; Zhou, L.; Dai, X.; Yuan, L.; Li, Y.; et al. Regionclip: Region-based language-image pretraining. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 16793–16803. [Google Scholar]
Liao, Y.; Liu, S.; Li, G.; Wang, F.; Chen, Y.; Qian, C.; Li, B. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10880–10889. [Google Scholar]
Liao, Y.; Zhang, A.; Chen, Z.; Hui, T.; Liu, S. Progressive language customized visual feature learning for one-stage visual grounding. IEEE Trans. Image Process. 2022, 31, 4266–4277. [Google Scholar] [CrossRef] [PubMed]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the Computer Vision-ECCV 2020:16th Eurpean Conference, Glasgow, UK, 23–28 August 2020; Proceeding, Part I 17. Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Deng, J.; Yang, Z.; Liu, D.; Chen, T.; Zhou, W.; Zhang, Y.; Li, H.; Ouyang, W. Transvg++: End-to-end visual grounding with language conditioned vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 48, 13636–13652. [Google Scholar] [CrossRef]
Sun, Y.; Feng, S.; Li, X.; Ye, Y.; Kang, J.; Huang, X. Visual Grounding in Remote Sensing Images. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; Volume 9, pp. 404–412. [Google Scholar]
Wang, F.; Wu, C.; Wu, J.; Wang, L.; Li, C. Multistage Synergistic Aggregation Network for Remote Sensing Visual Grounding. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6007605. [Google Scholar] [CrossRef]
Li, K.; Wang, D.; Xu, H.; Zhong, H.; Wang, C. Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5631413. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 10012–10022. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Lin, T.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1483–1498. [Google Scholar] [CrossRef] [PubMed]
Girshick, R.B. Fast R-CNN. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1440–1448. [Google Scholar]
Rezatofighi, S.H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.D.; Savarese, S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision and Pattern Recognition(CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and A New Benchmark. ISPRS J. Photogram. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Wu, C.; Lin, Z.; Cohen, S.D.; Bui, T.; Maji, S. PhraseCut: Language-Based Image Segmentation in the Wild. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10213–10222. [Google Scholar]
Smith, L.N.; Topin, N. Super-convergence: Very fast training of neural networks using large learning rates. In Defense + Commercial Sensing; SPIE: Bellingham, WA, USA, 2019; Volume 1100612. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Yang, X.; Yan, J.; Yang, X.; Tang, J.; Liao, W.; He, T. SCRDet++: Detecting Small, Cluttered and Rotated Objects via Instance-Level Feature Denoising and Rotation Loss Smoothing. IEEE Trans. Pattern Anal. Machine. 2023, 45, 2384–2399. [Google Scholar] [CrossRef] [PubMed]
Sun, X.; Wang, P.; Lu, W.; Zhu, Z.; Lu, X.; He, Q.; Li, J.; Rong, X.; Yang, Z.; Chang, H.; et al. RingMo: A Remote Sensing Foundation Model with Masked Image Modeling. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5612822. [Google Scholar] [CrossRef]
Lan, M.; Rong, F.; Jiao, H.; Gao, Z.; Zhang, L. Language Query-Based Transformer with Multiscale Cross-Modal Alignment for Visual Grounding on Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5626513. [Google Scholar] [CrossRef]

Figure 1. Faster R-CNN in Visual Grounding for Remote Sensing (FR-RSVG). The upper part shows the process of obtaining language features from the BERT encoder and vision features from the Swin Transformer. The lower part is the region proposals that generate possible results.

Figure 2. FR-AVLF: Adaptive vision-language fusion method for RSVG. On the left is the visual branch for extracting image features, on the right is the language branch for generating text embeddings, and in the middle is our adaptive vision-language fusion module.

Figure 3. FR-AVLF: Layered adaptive vision-language fusion in RSVG. The upper part shows the process of fusing vision-language features. The lower part is the region proposals that generate possible results.

Figure 4. FR-AVLF_PRE: Transfer remote sensing image object detection model weights. The weights transferred are represented by a dark color, while the weights that need to be learned are represented by a light color.

Figure 5. Multi-Level AVLF: Multi-Level Adaptive Vision-Language Feature Fusion. Vf′[4] represents Vf′₁₃, Vf′₂₅, Vf′₅₀, Vf′₁₀₀, and Vf′₂₀₀. Lf′[4] represents Lf′₁₃, Lf′₂₅, Lf′₅₀, Lf′₁₀₀, and Lf′₂₀₀. VLf′[4] represents VLf′₁₃, VLf′₂₅, VLf′₅₀, VLf′₁₀₀, and VLf′₂₀₀.

Figure 6. The architecture of CHAG. Here, “I1” is the vision-language fusion feature, “I2” is the multi-sentence language embeddings, “conv” is the backbone convolutions, “NMS” is non-maximum suppression, “RoI Pool” is region-wise feature extraction, “H” is the network head, “B” is the bounding box, and “C” is classification, while “B0” is the initial selected region proposals.

Figure 7. FR-CHAGAVLF_PRE: Faster R-CNN with Cascaded Hierarchical Attention Grounding and Multi-Level Adaptive Vision-Language Fusion Pretrained. The weights transferred are represented by a dark color, while the weights that need to be learned are represented by a light color.

Figure 8. The attention map of FR-RSVG (Subfigure (a) aims to detect the stadium, whereas subfigure (b) targets the track-and-field ground).

Figure 9. Large object attention map after RPN and vision-language backbone implementation (The color intensity represents the attention weight distribution, where red and yellow regions indicate higher attention scores, highlighting areas strongly attended by the model, while blue regions indicate lower attention scores, corresponding to areas with little or no attention focus.).

Figure 10. Small object attention map after RPN and vision-language backbone utilization (The color intensity represents the attention weight distribution, where red and yellow regions indicate higher attention scores, highlighting areas strongly attended by the model, while blue regions indicate lower attention scores, corresponding to areas with little or no attention focus. The circles denote the objects to be detected).

Figure 11. Vision-language detection results. In sub-figures (j–r), green represents the ground truth boxes, and red represents the predicted boxes. (a) Phrase: The ground track field on the upper right. (b) Phrase: A small overpass. (c) Phrase: The ground track field on the lower left. (d) Phrase: The oval green and orange large ground track field. (e) Phrase: The green baseball field. (f) Phrase: A baseball field on the right. (g) Phrase: A textured gray frustum of a cone-shaped chimney is centered in the industrial zone, partially shadowed, with a metallic surface and surrounded by flat rooftops. (h) Phrase: The small gray bridge spans a narrow river in the lower left, its rectangular shape and smooth texture contrasting with the surrounding rugged terrain. (i) Phrase: A prominent gray circular storage tank with a smooth metallic texture is centered in the image, surrounded by paved ground and linear roadways. (j) Phrase (MSAM): A long dam. (MSAM). (k) Phrase: The large baseball field on the upper left. (l) Phrase (MSAM): The windmill on the lower right. (m) Phrase (FR-CHAGAVLF_PRE): A long dam. (n) Phrase (FR-CHAGAVLF_PRE): The large baseball field on the upper left. (o) Phrase (FR-CHAGAVLF_PRE): The windmill on the lower right. (p) Phrase (LQVG): A long dam. (q) Phrase (LQVG): The large baseball field on the upper left. (r) Phrase (LQVG): The windmill on the lower right.

Figure 12. Failure result visualization. (a) Phrase: a vehicle is on the lower right of the gray basketball court. (b) Phrase: An expressway service area is below the cyan vehicle. (c) Phrase: A tennis court is on the lower left of the vehicle on the upper left. (d) Phrase: a vehicle is on the lower right of the gray basketball court. (e) Phrase: An expressway service area is below the cyan vehicle. (f) Phrase: A tennis court is on the lower left of the vehicle on the upper left.

Table 1. Components and differences of each model.

Model Abbreviation	Full Model Name	Core Features
FR-RSVG	Faster R-CNN for Visual Grounding in Remote Sensing	Simple fusion of visual and language features only.
FR-AVLF	Faster R-CNN with Adaptive Vision-Language Fusion	Layer-wise adaptive fusion of visual features from Swin Transformer with language features.
FR-AVLF_PRE	Faster R-CNN with Adaptive Vision-Language Fusion (Pretrained)	Transfers visual weights pretrained on the DIOR dataset to FR-AVLF.
FR-CHAGAVLF_PRE	Faster R-CNN with Cascaded Hierarchical Attention Grounding and Multi-Level Adaptive Vision-Language Fusion (Pretrained)	Builds on FR-AVLF_PRE by replacing AVLF with Multi-Level AVLF and adding the CHAG module.

Table 2. Experimental parameter settings.

Parameter	Value or Setting
GPU	NVIDIA RTX A6000 Ada, 48GB
CPU	Intel Xeon Gold 6530
Framework	PyTorch 1.13
Dataset Split Ratio	Training:Validation:Test = 7:1:2
Input Image Size	800 × 800
BERT Model	Pretrained public weights, frozen during training
Learning Rate Scheduler	OneCycle [56], max LR = 0.0001, min LR = 0.000001
Optimizer	Adam [57], β₂ = 0.99, β₁ adjusted via OneCycle (max β₁ = 0.9, min β₁ = 0.8)
Training Epochs	12
Batch Size	16

Table 3. Results of FR-RSVG on DIOR-RSVG test set.

Visual Encoder	Language Encoder	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9	meanIoU	cumIoU
Swin-T	BERT	56.63	52.40	46.76	36.64	16.47	49.41	64.03
Swin-S	BERT	62.77	59.21	53.91	44.64	24.31	55.31	67.68
Swin-B	BERT	63.67	60.31	55.08	46.82	28.22	56.71	69.52
Swin-L	BERT	64.12	60.61	55.77	47.63	29.74	57.48	70.71

Table 4. Results of FR-AVLF on DIOR-RSVG test set. The number following the arrow (

↑

) indicates by how much FR-AVLF was an improvement compared with FR-RSVG.

Table 4. Results of FR-AVLF on DIOR-RSVG test set. The number following the arrow (

↑

) indicates by how much FR-AVLF was an improvement compared with FR-RSVG.

Visual Encoder	Language Encoder	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9	meanIoU	cumIoU
Swin-T	BERT	${76.26}_{↑ 19.63}$	${74.34}_{↑ 21.94}$	${70.19}_{↑ 23.43}$	${60.03}_{↑ 23.39}$	${31.70}_{↑ 15.23}$	${66.50}_{↑ 17.09}$	${66.65}_{↑ 2.62}$
Swin-S	BERT	${77.43}_{↑ 14.66}$	${75.72}_{↑ 16.51}$	${72.24}_{↑ 18.33}$	${63.46}_{↑ 18.82}$	${38.50}_{↑ 14.19}$	${68.33}_{↑ 13.02}$	${69.98}_{↑ 2.30}$
Swin-B	BERT	${77.84}_{↑ 14.17}$	${76.20}_{↑ 15.89}$	${72.66}_{↑ 17.58}$	${63.94}_{↑ 17.12}$	${38.61}_{↑ 10.39}$	${68.68}_{↑ 11.97}$	${69.96}_{↑ 0.44}$
Swin-L	BERT	${78.07}_{↑ 13.95}$	${76.80}_{↑ 16.19}$	${73.72}_{↑ 17.95}$	${66.32}_{↑ 18.69}$	${43.43}_{↑ 13.69}$	${69.66}_{↑ 12.18}$	${70.98}_{↑ 0.27}$

Table 5. Results of FPN-Faster R-CNN on DIOR [54] test set.

Model	Backbone	Backbone Parameters	Backbone Pretrain Dataset	mAP
SCRDet++(FPN) [58]	ResNet-101	45M	-	73.2
SCRDet++(RetinaNet) [58]	ResNet-101	45M	-	75.1
FPNISP [59]	Swin-B	88M	-	74.7
FPN-RingMo [59]	Swin-B	88M	-	75.9
FPN-Faster R-CNN (Ours)	Swin-T	28M	ImageNet-1K	73.29
FPN-Faster R-CNN (Ours)	Swin-T	28M	ImageNet-22K	74.21
FPN-Faster R-CNN (Ours)	Swin-S	50M	ImageNet-22K	76.40
FPN-Faster R-CNN (Ours)	Swin-B	88M	ImageNet-22K	78.52
FPN-Faster R-CNN (Ours)	Swin-L	197M	ImageNet-22K	78.89

Table 6. Results of FR-AVLF_PRE on DIOR-RSVG test set. The number following the arrow (

↑

) indicates by how much FR-AVLF_PRE improved the results compared with FR-AVLF.

Table 6. Results of FR-AVLF_PRE on DIOR-RSVG test set. The number following the arrow (

↑

) indicates by how much FR-AVLF_PRE improved the results compared with FR-AVLF.

Visual Encoder	Language Encoder	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9	meanIoU	cumIoU
Swin-T	BERT	${76.68}_{↑ 0.42}$	${75.32}_{↑ 0.98}$	${72.52}_{↑ 2.33}$	${65.77}_{↑ 5.74}$	${45.61}_{↑ 13.91}$	${68.84}_{↑ 2.34}$	${69.39}_{↑ 2.74}$
Swin-S	BERT	${78.08}_{↑ 0.65}$	${77.19}_{↑ 1.47}$	${74.88}_{↑ 2.64}$	${69.09}_{↑ 5.63}$	${51.33}_{↑ 12.83}$	${70.99}_{↑ 2.66}$	${72.58}_{↑ 2.60}$
Swin-B	BERT	${78.62}_{↑ 0.74}$	${77.62}_{↑ 1.42}$	${75.30}_{↑ 2.64}$	${69.42}_{↑ 5.48}$	${51.75}_{↑ 13.14}$	${71.37}_{↑ 2.69}$	${73.26}_{↑ 3.30}$
Swin-L	BERT	${78.39}_{↑ 0.32}$	${77.56}_{↑ 0.76}$	${75.58}_{↑ 1.86}$	${70.38}_{↑ 4.06}$	${53.99}_{↑ 10.56}$	${71.59}_{↑ 1.93}$	${73.43}_{↑ 2.45}$

Table 7. Test results of different language encoders on DIOR-RSVG. The number following the arrow (

↑

) indicates by how much FR-AVLF_PRE improved the results compared with FR-AVLF. The arrow (

↓

) indicates the degree of decrease.

Table 7. Test results of different language encoders on DIOR-RSVG. The number following the arrow (

↑

) indicates by how much FR-AVLF_PRE improved the results compared with FR-AVLF. The arrow (

↓

) indicates the degree of decrease.

Visual Encoder	Language Encoder	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9	meanIoU	cumIoU
Swin-L	BERT	${78.39}_{↑ 0.32}$	${77.56}_{↑ 0.76}$	${75.58}_{↑ 1.86}$	${70.38}_{↑ 4.06}$	${53.99}_{↑ 10.56}$	${71.59}_{↑ 1.93}$	${73.43}_{↑ 2.45}$
Swin-L	RoBERTa	${79.82}_{↑ 1.75}$	${78.75}_{↑ 1.95}$	${76.17}_{↑ 2.45}$	${70.39}_{↑ 4.07}$	${54.28}_{↑ 10.85}$	${71.76}_{↑ 2.10}$	${73.59}_{↑ 2.61}$
Swin-L	Deepseek-1.5b	${80.94}_{↑ 2.87}$	${79.93}_{↑ 3.13}$	${75.67}_{↑ 1.95}$	${66.75}_{↑ 0.43}$	${44.65}_{↑ 1.22}$	${71.70}_{↑ 2.04}$	${74.83}_{↑ 3.85}$
Swin-L	Deepseek-7b	${80.29}_{↑ 2.22}$	${77.74}_{↑ 0.94}$	${73.09}_{↓ 0.63}$	${62.34}_{↓ 3.98}$	${36.95}_{↓ 6.48}$	${70.56}_{↑ 0.9}$	${74.68}_{↑ 3.70}$

Table 8. Ablation studies. The number following the arrow (

↑

) indicates by how much FR-CHAGAVLF_PRE improved the results compared with FR-AVLF.

Table 8. Ablation studies. The number following the arrow (

↑

) indicates by how much FR-CHAGAVLF_PRE improved the results compared with FR-AVLF.

Visual Encoder	Language Encoder	Multi-Level AVLF	CHAG	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9
Swin-L	Deepseek-1.5b	$\times$	$\times$	${80.94}_{↑ 2.87}$	${79.93}_{↑ 3.13}$	${75.67}_{↑ 1.95}$	${66.75}_{↑ 0.43}$	${44.65}_{↑ 1.22}$
Swin-L	Deepseek-1.5b	$\times$	$\sqrt$	${81.52}_{↑ 3.45}$	${80.77}_{↑ 3.97}$	${78.42}_{↑ 4.70}$	${72.52}_{↑ 6.20}$	${58.32}_{↑ 14.89}$
Swin-L	Deepseek-1.5b	$\sqrt$	$\times$	${82.31}_{↑ 4.24}$	${80.11}_{↑ 3.31}$	${76.70}_{↑ 2.98}$	${68.53}_{↑ 2.21}$	${46.22}_{↑ 2.9}$
Swin-L	Deepseek-1.5b	$\sqrt$	$\sqrt$	${82.12}_{↑ 4.05}$	${80.77}_{↑ 3.97}$	${78.34}_{↑ 4.62}$	${72.78}_{↑ 6.46}$	${59.42}_{↑ 15.99}$

Table 9. Various performance metrics of FR-CHAGAVLF_PRE.

Visual Encoder	Language Encoder	Multi-Level AVLF	CHAG	Backbone Params	Flops	FPS
Swin-L	Deepseek-1.5b	$\times$	$\times$	197 M	1551.70 GFLOPs	10.06 FPS
Swin-L	Deepseek-1.5b	$\times$	$\sqrt$	197 M	1607.22 GFLOPs	9.96 FPS
Swin-L	Deepseek-1.5b	$\sqrt$	$\times$	197 M	1550.94 GFLOPs	9.73 FPS
Swin-L	Deepseek-1.5b	$\sqrt$	$\sqrt$	197 M	1606.48 GFLOPs	9.23 FPS

Table 10. Comparison with other advanced research results on DIOR-RSVG test set. The best results are bolded, and the second-best results are underlined.

Methods	Visual Encoder	Language Encoder	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9	meanIoU	cumIoU
One-stage:
ZSGNet [23,31]	ResNet-50	BiLSTM	51.67	48.13	42.3	32.41	10.15	44.12	51.65
FAOA-no Spatial [24,31]	DarkNet-53	BERT	63.63	61.20	56.92	50.15	38.83	57.53	62.66
FAOA [24,31]	DarkNet-53	LSTM	70.86	67.37	62.04	53.19	36.44	62.86	67.28
ReSC [25,31]	DarkNet-53	BERT	72.71	68.92	63.01	53.70	33.37	64.24	68.10
LBYL-Net [26,31]	DarkNet-53	BERT	73.78	69.22	65.56	47.89	15.69	65.92	76.37
Transformer-based:
TransVG [29,31]	ResNet-50	BERT	72.41	67.38	60.05	49.1	27.84	63.56	76.27
VLTVG [30,31]	ResNet-101	BERT	75.97	72.22	66.33	55.17	33.11	66.32	77.85
MGVLF [31]	ResNet-50	BERT	76.78	72.68	66.74	56.42	35.07	68.04	78.41
MSAM [44]	DarkNet	BERT	74.23	69.01	61.32	49.04	24.26	64.88	77.13
LQVG [60]	ResNet-50	BERT	83.41	81.03	75.91	65.52	43.53	74.02	82.22
Two-stage:
FR-AVLF_PRE (Ours)	Swin-B	BERT	78.62	77.62	75.30	69.42	51.75	71.37	73.26
FR-AVLF_PRE (Ours)	Swin-L	BERT	78.39	77.56	75.58	70.38	53.99	71.59	73.43
FR-CHAGAVLF_PRE (Ours)	Swin-L	Deepseek-1.5b	82.12	80.76	78.34	72.78	59.42	75.78	83.21

Table 11. Zero-shot performance of FR-CHAGAVLF_PRE on DIOR-RSVG-C.

Methods	Dataset	Visual Encoder	Language Encoder	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9	mean- IoU	cumIoU
MGVLF [31]	DIOR-RSVG ∩ DIOR-RSVG-C	Swin-L	Deepseek-1.5b	77.73	74.48	67.07	56.23	32.36	69.76	78.36
MGVLF [31]	DIOR-RSVG-C	Swin-L	Deepseek-1.5b	66.24	63.15	56.51	47.81	28.85	59.30	70.45
FR-CHAGAVLF_PRE (Ours)	DIOR-RSVG ∩ DIOR-RSVG-C	Swin-L	Deepseek-1.5b	84.96	84.02	81.94	76.60	63.36	78.68	85.46
FR-CHAGAVLF_PRE (Ours)	DIOR-RSVG-C	Swin-L	Deepseek-1.5b	72.29	71.26	69.22	65.24	53.48	66.90	78.11

Table 12. Zero-shot comparison across models on the six shared classes of OPT-RSVG.

Methods	Visual Encoder	Language Encoder	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9	meanIoU	cumIoU
MGVLF [31]	ResNet-50	BERT	62.96	59.88	53.93	41.58	19.49	54.30	57.28
FR-CHAGAVLF_PRE (Ours)	Swin-L	Deepseek-1.5b	68.39	66.58	64.50	53.62	34.71	60.50	62.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, H.; Gao, T.; Li, Z.; Chen, Z.; Li, Q.; Miao, K.; Hou, B.; Jiao, L. Cascaded Hierarchical Attention with Adaptive Fusion for Visual Grounding in Remote Sensing. Remote Sens. 2025, 17, 2930. https://doi.org/10.3390/rs17172930

AMA Style

Zhu H, Gao T, Li Z, Chen Z, Li Q, Miao K, Hou B, Jiao L. Cascaded Hierarchical Attention with Adaptive Fusion for Visual Grounding in Remote Sensing. Remote Sensing. 2025; 17(17):2930. https://doi.org/10.3390/rs17172930

Chicago/Turabian Style

Zhu, Huming, Tianqi Gao, Zhixian Li, Zhipeng Chen, Qiuming Li, Kongmiao Miao, Biao Hou, and Licheng Jiao. 2025. "Cascaded Hierarchical Attention with Adaptive Fusion for Visual Grounding in Remote Sensing" Remote Sensing 17, no. 17: 2930. https://doi.org/10.3390/rs17172930

APA Style

Zhu, H., Gao, T., Li, Z., Chen, Z., Li, Q., Miao, K., Hou, B., & Jiao, L. (2025). Cascaded Hierarchical Attention with Adaptive Fusion for Visual Grounding in Remote Sensing. Remote Sensing, 17(17), 2930. https://doi.org/10.3390/rs17172930

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cascaded Hierarchical Attention with Adaptive Fusion for Visual Grounding in Remote Sensing

Abstract

1. Introduction

2. Related Work

2.1. Visual Grounding on Natural Image

2.2. Visual Grounding for Remote Sensing

3. Materials and Methods

3.1. FR-RSVG: Faster R-CNN in Visual Grounding for Remote Sensing

3.2. FR-AVLF: Layered Adaptive Vision-Language Fusion in RSVG

3.3. FR-AVLF_PRE: Transfer Remote Sensing Image Object Detection Model Weights to FR-AVLF

3.4. FR-CHAGAVLF_PRE: Faster R-CNN with Cascaded Hierarchical Attention Grounding and Multi-Level Adaptive Vision-Language Fusion Pretrained

3.5. Loss

4. Results

4.1. Dataset

4.2. Evaluation Metrics

4.3. Implementation Details

4.3.1. Training for FR-RSVG and FR-AVLF

4.3.2. Pretraining for Visual Object Detection

4.4. FR-RSVG Results

4.5. FR-AVLF Results

4.6. FR-AVLF_PRE Results

4.7. FR-CHAGAVLF_PRE Results

4.8. Comparison with Other Advanced Research Results

4.9. Vision-Language Detection Results

4.10. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Cascaded Hierarchical Attention with Adaptive Fusion for Visual Grounding in Remote Sensing

Abstract

1. Introduction

2. Related Work

2.1. Visual Grounding on Natural Image

2.2. Visual Grounding for Remote Sensing

3. Materials and Methods

3.1. FR-RSVG: Faster R-CNN in Visual Grounding for Remote Sensing

3.2. FR-AVLF: Layered Adaptive Vision-Language Fusion in RSVG

3.3. FR-AVLFPRE: Transfer Remote Sensing Image Object Detection Model Weights to FR-AVLF

3.4. FR-CHAGAVLFPRE: Faster R-CNN with Cascaded Hierarchical Attention Grounding and Multi-Level Adaptive Vision-Language Fusion Pretrained

3.5. Loss

4. Results

4.1. Dataset

4.2. Evaluation Metrics

4.3. Implementation Details

4.3.1. Training for FR-RSVG and FR-AVLF

4.3.2. Pretraining for Visual Object Detection

4.4. FR-RSVG Results

4.5. FR-AVLF Results

4.6. FR-AVLFPRE Results

4.7. FR-CHAGAVLFPRE Results

4.8. Comparison with Other Advanced Research Results

4.9. Vision-Language Detection Results

4.10. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3. FR-AVLF_PRE: Transfer Remote Sensing Image Object Detection Model Weights to FR-AVLF

3.4. FR-CHAGAVLF_PRE: Faster R-CNN with Cascaded Hierarchical Attention Grounding and Multi-Level Adaptive Vision-Language Fusion Pretrained

4.6. FR-AVLF_PRE Results

4.7. FR-CHAGAVLF_PRE Results