Cascaded Hierarchical Attention with Adaptive Fusion for Visual Grounding in Remote Sensing
Abstract
1. Introduction
- (1)
- We propose an effective two-stage RSVG model named FR-RSVG. It is based on the Faster R-CNN framework, which uses an RPN to obtain proposals, and the best-matching target is grounded via confidence ranking.
- (2)
- Building upon FR-RSVG, we propose FR-AVLF, which is equipped with a layered adaptive vision-language fusion module. The visual characteristics of this model are derived through a flexible fusion of deep and shallow visual encoders, leveraging the supplied textual input to refine the hierarchical and semantic feature representation and augment the grounding accuracy for objects across diverse scales. Furthermore, based on FR-RSVG and FR-AVLF, we also propose FR-CHAGAVLF, which is equipped with a multi-level adaptive vision-language fusion module and a Cascaded Hierarchical Attention Grounding module.
- (3)
- To investigate the effectiveness of weight transfer from remote sensing object detection to RSVG, we conducted extensive experiments using weights pretrained on ImageNet-1K and ImageNet-22K. We employed various backbone networks such as Swin-T, Swin-S, Swin-B, and Swin-L to construct different model architectures. Additionally, we compare the performance of different language encoders, including BERT, RoBERTa, and DeepSeek. The proposed weight transfer model FR-CHAGAVLFPRE shows excellent grounding performance, with a Pr@0.9 of 59.42% on the DIOR-RSVG dataset, which reveals that this approach outperforms the direct RSVG dataset training in enhancing grounding accuracy.
- (4)
- To validate the generalization performance of our model, we constructed the Complex-Description DIOR-RSVG (DIOR-RSVG-C) dataset based on the DIOR-RSVG dataset and conducted zero-shot inference using the FR-CHAGAVLFPRE model weights. To further verify the model’s generalization capability, we also performed zero-shot inference experiments on shared categories between the DIOR-RSVG and OPT-RSVG datasets. The experimental results demonstrate that our model achieved excellent localization performance on both datasets, fully validating the model’s cross-dataset generalization capability.
2. Related Work
2.1. Visual Grounding on Natural Image
2.2. Visual Grounding for Remote Sensing
3. Materials and Methods
3.1. FR-RSVG: Faster R-CNN in Visual Grounding for Remote Sensing
3.2. FR-AVLF: Layered Adaptive Vision-Language Fusion in RSVG
3.3. FR-AVLFPRE: Transfer Remote Sensing Image Object Detection Model Weights to FR-AVLF
3.4. FR-CHAGAVLFPRE: Faster R-CNN with Cascaded Hierarchical Attention Grounding and Multi-Level Adaptive Vision-Language Fusion Pretrained
3.5. Loss
4. Results
4.1. Dataset
- (1)
- The baseline benchmark was DIOR-RSVG [31]. Built upon DIOR [54], this dataset comprises 17,402 high-resolution remote-sensing images and 38,320 concise captions aligned with 20 object categories. With an average caption length of 7.47 words and a vocabulary size of 100, it provides a controlled setting for evaluating model performance under limited linguistic complexity.
- (2)
- The linguistic augmentation benchmark was DIOR-RSVG-C. To investigate the robustness against complex semantics, we constructed an enhanced dataset by randomly selecting 5202 images and their 11,436 original captions from DIOR-RSVG. The augmentation process involved several key steps. First, we randomly sampled 20% of the original dataset to ensure diverse representation across all 20 categories. Then, each caption was systematically elaborated using the Qwen-Plus large language model through carefully designed prompt engineering. Our prompt strategy specifically instructs the model to (1) preserve the original spatial logic (e.g., “lower left”, “center”, or “upper right”) to maintain accurate visual grounding, (2) enrich descriptions with texture, color, shape, and background context relevant to remote sensing scenarios, (3) use natural and professional language suitable for remote sensing object localization, (4) avoid introducing contradictory information, and (5) limit expansions to within 20 English words to maintain practical usability. The multimodal prompt combines both the original image (encoded as base64) and the textual description, enabling the vision-language model to generate contextually appropriate elaborations while preserving spatial accuracy. To ensure quality control, we implemented a multi-stage validation process: (1) automated filtering to remove responses that significantly deviated from the original caption length constraints or contain obvious contradictions, (2) random sampling of 500 generated descriptions for manual review to assess semantic consistency and spatial accuracy, and (3) iterative prompt refinement based on identified issues. Additionally, we employed consistency checks by comparing generated descriptions against the original annotations to ensure no spatial information was lost or distorted. This process yielded captions with an average length of 20.52 words and a vocabulary of 1354 terms. DIOR-RSVG-C retains the 20-category label space while significantly elevating linguistic diversity, enabling systematic analysis of performance degradation under increased textual complexity and providing a more challenging benchmark for evaluating model robustness in real-world remote sensing applications.
- (3)
- Zero-shot cross-domain evaluation was carried out on OPT-RSVG [45]. To quantify out-of-domain generalization, we adopted the DIOR-RSVG-pretrained weights as a frozen feature extractor and performed zero-shot inference on OPT-RSVG. Focusing on the six semantic classes shared by both datasets—airplane, basketballcourt, ship, storagetank, tenniscourt, and vehicle—we constructed a sub-test set by uniformly and randomly sampling 20% of the image-text pairs per class (1092 pairs in total), thereby controlling sample bias and ensuring statistical significance. The sub-set accuracy served as a quantitative proxy for cross-domain consistency and out-of-domain robustness.
4.2. Evaluation Metrics
4.3. Implementation Details
4.3.1. Training for FR-RSVG and FR-AVLF
4.3.2. Pretraining for Visual Object Detection
4.4. FR-RSVG Results
4.5. FR-AVLF Results
4.6. FR-AVLFPRE Results
4.7. FR-CHAGAVLFPRE Results
4.8. Comparison with Other Advanced Research Results
4.9. Vision-Language Detection Results
4.10. Limitations and Future Work
5. Conclusions
- (1)
- We proposed a novel model named FR-RSVG. However, the detection effect of this method on the DIOR-RSVG dataset was not satisfactory. We analyzed the experimental results and found that this was due to the unbalanced recognition of large and small objects by the model. To solve this problem, we proposed FR-AVLF, which extracts language features through adaptive combination of deep or shallow vision encoders based on the input visual information of the text.As RSVG is fundamentally an expanded version of RS object detection, we applied the model pretrained on RS images to FR-AVLF. The results show that the detection effect of FR-AVLFPRE surpassed that of FR-AVLF, indicating a close connection between the visual remote sensing image object detection method and the vision-language method. Maybe in the future, we can not only focus on the multimodal integration of vision and language but also take into account the transformation from visual object detection to vision-language object detection.
- (2)
- We found that the larger the number of parameters in the Transformer backbone, the better the performance of the model. The results show that the effects of using Swin-B and Swin-L were similar, but the number of parameters of Swin-L was more than twice that of Swin-B. Using Swin-L is not cost-effective because it takes a high amount of training resources. This may be caused by factors such as an unsuitable training strategy for Swin-L or an insufficient amount of training data. In the future, we can explore a training strategy suitable for Swin-L.
- (3)
- We further proposed FR-CHAGAVLFPRE based on FR-AVLFPRE, whose detection results surpassed those of FR-AVLFPRE, indicating that our adaptive fusion module and Cascaded Hierarchical Attention Grounding module are effective. In the future, we can continue to explore more complex fusion strategies and more effective cascading mechanisms.
- (4)
- We conducted zero-shot inference experiments on shared categories between DIOR-RSVG and both the DIOR-RSVG-C and OPT-RSVG datasets using the FR-CHAGAVLFPRE model weights trained on DIOR-RSVG, demonstrating the model’s good robustness and generalization capability. This result provides strong support for cross-dataset transfer in remote sensing visual grounding tasks. In the future, we can further explore using model weights trained on one remote sensing dataset and achieve higher accuracy on another dataset through low-cost fine-tuning strategies, thereby enabling low-cost or even cost-free efficient model deployment.
- (5)
- Additionally, we identified several failure cases and analyzed their underlying causes. These analyses highlight the current limitations of our model and guide future work aimed at addressing these issues to further improve detection performance and robustness.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
| Visual Encoder | Language Encoder | Category | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 | meanIoU | cumIoU |
|---|---|---|---|---|---|---|---|---|---|
| Swin-L | Deepseek-1.5b | airplane | 88.63 | 89.29 | 87.50 | 81.55 | 57.74 | 79.44 | 71.76 |
| Swin-L | Deepseek-1.5b | airport | 90.55 | 89.34 | 83.61 | 73.77 | 38.52 | 79.53 | 82.50 |
| Swin- L | Deepseek-1.5b | baseballfield | 86.55 | 87.80 | 87.27 | 85.94 | 65.25 | 79.48 | 76.95 |
| Swin-L | Deepseek-1.5b | basketballcourt | 81.01 | 81.45 | 80.65 | 75.81 | 55.65 | 74.36 | 69.75 |
| Swin-L | Deepseek-1.5b | bridge | 73.38 | 68.01 | 61.03 | 46.32 | 14.34 | 61.59 | 62.46 |
| Swin-L | Deepseek-1.5b | chimney | 87.30 | 88.55 | 87.02 | 85.50 | 58.78 | 80.02 | 82.27 |
| Swin-L | Deepseek-1.5b | dam | 82.26 | 78.35 | 65.98 | 50.52 | 16.49 | 68.48 | 72.33 |
| Swin-L | Deepseek-1.5b | Expressway-Service-area | 85.20 | 84.52 | 77.42 | 67.74 | 23.87 | 73.63 | 72.67 |
| Swin-L | Deepseek-1.5b | Expressway-toll-station | 75.17 | 74.53 | 70.75 | 62.26 | 38.68 | 65.67 | 55.55 |
| Swin-L | Deepseek-1.5b | golffield | 83.29 | 82.47 | 79.38 | 76.29 | 42.27 | 75.37 | 77.74 |
| Swin-L | Deepseek-1.5b | groundtrackfield | 87.03 | 88.28 | 85.35 | 78.02 | 50.18 | 77.40 | 87.97 |
| Swin-L | Deepseek-1.5b | harbor | 52.75 | 52.00 | 46.00 | 28.00 | 10.00 | 45.63 | 40.22 |
| Swin-L | Deepseek-1.5b | overpass | 69.65 | 65.61 | 62.43 | 47.62 | 17.46 | 59.05 | 60.71 |
| Swin-L | Deepseek-1.5b | ship | 73.63 | 73.40 | 71.92 | 64.04 | 33.99 | 65.22 | 64.01 |
| Swin-L | Deepseek-1.5b | stadium | 93.04 | 94.29 | 93.33 | 81.90 | 46.67 | 81.17 | 85.76 |
| Swin-L | Deepseek-1.5b | storagetank | 83.90 | 83.90 | 83.90 | 83.05 | 70.34 | 77.11 | 62.77 |
| Swin-L | Deepseek-1.5b | tenniscourt | 75.44 | 76.69 | 75.46 | 74.85 | 57.67 | 69.59 | 53.53 |
| Swin-L | Deepseek-1.5b | trainstation | 82.42 | 78.57 | 64.29 | 45.92 | 14.29 | 68.21 | 65.21 |
| Swin-L | Deepseek-1.5b | vehicle | 73.01 | 72.84 | 68.32 | 52.33 | 19.94 | 61.46 | 34.06 |
| Swin-L | Deepseek-1.5b | windmill | 94.95 | 90.25 | 79.42 | 59.57 | 20.94 | 76.72 | 80.45 |
| Visual Encoder | Language Encoder | Category | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 | meanIoU | cumIoU |
|---|---|---|---|---|---|---|---|---|---|
| Swin-L | Deepseek-1.5b | airplane | 87.50 | 86.90 | 86.90 | 85.12 | 74.40 | 82.31 | 75.76 |
| Swin-L | Deepseek-1.5b | airport | 95.08 | 94.26 | 91.80 | 86.89 | 70.49 | 88.62 | 90.84 |
| Swin- L | Deepseek-1.5b | baseballfield | 86.47 | 86.47 | 86.21 | 85.15 | 78.78 | 82.17 | 83.21 |
| Swin-L | Deepseek-1.5b | basketballcourt | 80.65 | 80.65 | 80.65 | 79.84 | 77.42 | 77.90 | 74.54 |
| Swin-L | Deepseek-1.5b | bridge | 73.16 | 68.75 | 63.24 | 54.04 | 38.60 | 64.57 | 75.51 |
| Swin-L | Deepseek-1.5b | chimney | 90.84 | 90.84 | 90.84 | 90.08 | 82.44 | 86.67 | 89.55 |
| Swin-L | Deepseek-1.5b | dam | 90.72 | 87.63 | 79.38 | 65.98 | 41.24 | 79.44 | 81.20 |
| Swin-L | Deepseek-1.5b | Expressway-Service-area | 81.29 | 80.65 | 80.00 | 76.13 | 59.35 | 76.89 | 77.10 |
| Swin-L | Deepseek-1.5b | Expressway-toll-station | 79.25 | 78.30 | 75.47 | 66.98 | 54.72 | 72.04 | 83.75 |
| Swin-L | Deepseek-1.5b | golffield | 82.47 | 82.47 | 81.44 | 76.29 | 71.13 | 79.03 | 82.61 |
| Swin-L | Deepseek-1.5b | groundtrackfield | 87.55 | 87.55 | 86.81 | 82.42 | 71.43 | 81.79 | 91.00 |
| Swin-L | Deepseek-1.5b | harbor | 58.00 | 58.00 | 52.00 | 48.00 | 36.00 | 52.94 | 45.99 |
| Swin-L | Deepseek-1.5b | overpass | 70.90 | 68.78 | 64.02 | 58.73 | 39.68 | 63.78 | 72.29 |
| Swin-L | Deepseek-1.5b | ship | 77.34 | 77.34 | 76.35 | 73.40 | 60.10 | 72.15 | 73.66 |
| Swin-L | Deepseek-1.5b | stadium | 94.29 | 94.29 | 93.33 | 91.43 | 78.10 | 88.11 | 93.99 |
| Swin-L | Deepseek-1.5b | storagetank | 83.90 | 83.90 | 83.90 | 83.05 | 81.36 | 80.56 | 70.88 |
| Swin-L | Deepseek-1.5b | tenniscourt | 79.14 | 79.14 | 79.14 | 77.91 | 75.46 | 76.22 | 65.00 |
| Swin-L | Deepseek-1.5b | trainstation | 83.67 | 77.55 | 67.35 | 58.16 | 38.78 | 74.40 | 71.51 |
| Swin-L | Deepseek-1.5b | vehicle | 74.12 | 72.98 | 71.15 | 61.39 | 41.02 | 65.75 | 59.42 |
| Swin-L | Deepseek-1.5b | windmill | 94.58 | 92.06 | 85.56 | 71.12 | 43.32 | 81.74 | 85.91 |
| Visual Encoder | Language Encoder | Category | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 | meanIoU | cumIoU |
|---|---|---|---|---|---|---|---|---|---|
| Swin-L | Deepseek-1.5b | airplane | 87.50 | 86.31 | 85.71 | 82.14 | 62.50 | 80.05 | 76.96 |
| Swin-L | Deepseek-1.5b | airport | 91.80 | 86.89 | 85.25 | 76.23 | 47.54 | 83.38 | 84.79 |
| Swin- L | Deepseek-1.5b | baseballfield | 87.80 | 87.53 | 87.27 | 83.55 | 71.35 | 81.55 | 78.98 |
| Swin-L | Deepseek-1.5b | basketballcourt | 83.06 | 83.06 | 82.26 | 79.03 | 63.71 | 77.92 | 73.24 |
| Swin-L | Deepseek-1.5b | bridge | 75.74 | 69.49 | 63.24 | 52.21 | 28.68 | 64.90 | 69.01 |
| Swin-L | Deepseek-1.5b | chimney | 89.31 | 89.31 | 87.79 | 85.50 | 66.41 | 82.86 | 82.96 |
| Swin-L | Deepseek-1.5b | dam | 87.63 | 77.32 | 68.04 | 50.52 | 27.84 | 73.42 | 75.76 |
| Swin-L | Deepseek-1.5b | Expressway-Service-area | 85.81 | 85.16 | 82.58 | 74.84 | 43.87 | 77.21 | 77.13 |
| Swin-L | Deepseek-1.5b | Expressway-toll-station | 77.36 | 75.47 | 70.75 | 66.04 | 50.00 | 69.07 | 66.30 |
| Swin-L | Deepseek-1.5b | golffield | 85.57 | 82.47 | 78.35 | 75.26 | 51.55 | 77.45 | 80.52 |
| Swin-L | Deepseek-1.5b | groundtrackfield | 86.08 | 85.35 | 83.88 | 78.39 | 59.34 | 78.28 | 87.74 |
| Swin-L | Deepseek-1.5b | harbor | 62.00 | 54.00 | 48.00 | 42.00 | 14.00 | 54.61 | 39.50 |
| Swin-L | Deepseek-1.5b | overpass | 76.19 | 72.49 | 68.25 | 59.79 | 30.16 | 67.07 | 72.09 |
| Swin-L | Deepseek-1.5b | ship | 76.85 | 76.35 | 72.91 | 68.97 | 38.92 | 69.20 | 71.50 |
| Swin-L | Deepseek-1.5b | stadium | 94.29 | 94.29 | 90.48 | 80.95 | 50.48 | 83.12 | 87.63 |
| Swin-L | Deepseek-1.5b | storagetank | 86.44 | 86.44 | 85.59 | 83.90 | 76.27 | 80.99 | 78.17 |
| Swin-L | Deepseek-1.5b | tenniscourt | 80.37 | 80.37 | 79.75 | 79.14 | 68.10 | 75.51 | 73.38 |
| Swin-L | Deepseek-1.5b | trainstation | 82.65 | 71.43 | 64.29 | 51.02 | 33.67 | 70.24 | 63.72 |
| Swin-L | Deepseek-1.5b | vehicle | 73.41 | 71.57 | 68.60 | 56.01 | 30.27 | 63.32 | 60.21 |
| Swin-L | Deepseek-1.5b | windmill | 92.78 | 91.34 | 80.87 | 62.45 | 32.85 | 79.58 | 81.46 |
| Visual Encoder | Language Encoder | Category | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 | meanIoU | cumIoU |
|---|---|---|---|---|---|---|---|---|---|
| Swin-L | Deepseek-1.5b | airplane | 86.90 | 86.31 | 85.12 | 82.74 | 72.62 | 81.33 | 78.98 |
| Swin-L | Deepseek-1.5b | airport | 95.08 | 93.44 | 89.34 | 86.07 | 72.95 | 89.02 | 91.17 |
| Swin- L | Deepseek-1.5b | baseballfield | 87.00 | 87.00 | 86.74 | 86.21 | 79.31 | 82.57 | 83.00 |
| Swin-L | Deepseek-1.5b | basketballcourt | 82.26 | 82.26 | 82.26 | 79.84 | 76.61 | 79.20 | 81.08 |
| Swin-L | Deepseek-1.5b | bridge | 75.74 | 70.22 | 63.97 | 58.46 | 40.44 | 67.63 | 80.69 |
| Swin-L | Deepseek-1.5b | chimney | 89.31 | 88.55 | 88.55 | 88.55 | 79.39 | 85.29 | 88.24 |
| Swin-L | Deepseek-1.5b | dam | 83.51 | 78.35 | 73.20 | 64.95 | 42.27 | 77.07 | 79.43 |
| Swin-L | Deepseek-1.5b | Expressway-Service-area | 81.94 | 81.94 | 80.65 | 78.71 | 61.94 | 78.04 | 76.20 |
| Swin-L | Deepseek-1.5b | Expressway-toll-station | 78.30 | 77.36 | 74.53 | 68.87 | 57.55 | 71.72 | 86.17 |
| Swin-L | Deepseek-1.5b | golffield | 86.60 | 85.57 | 82.47 | 77.32 | 67.01 | 81.11 | 84.20 |
| Swin-L | Deepseek-1.5b | groundtrackfield | 86.08 | 86.08 | 85.35 | 81.68 | 68.86 | 80.90 | 91.79 |
| Swin-L | Deepseek-1.5b | harbor | 60.00 | 60.00 | 54.00 | 48.00 | 38.00 | 56.67 | 53.24 |
| Swin-L | Deepseek-1.5b | overpass | 73.54 | 71.43 | 67.20 | 58.20 | 43.39 | 66.24 | 77.15 |
| Swin-L | Deepseek-1.5b | ship | 77.83 | 77.34 | 76.35 | 72.91 | 58.62 | 72.18 | 76.06 |
| Swin-L | Deepseek-1.5b | stadium | 91.43 | 91.43 | 89.52 | 87.62 | 76.19 | 85.58 | 94.09 |
| Swin-L | Deepseek-1.5b | storagetank | 85.59 | 85.59 | 85.59 | 83.90 | 83.90 | 82.18 | 81.77 |
| Swin-L | Deepseek-1.5b | tenniscourt | 80.98 | 80.98 | 80.98 | 79.75 | 77.91 | 78.31 | 79.30 |
| Swin-L | Deepseek-1.5b | trainstation | 88.78 | 77.55 | 70.41 | 63.27 | 45.92 | 76.32 | 74.76 |
| Swin-L | Deepseek-1.5b | vehicle | 73.27 | 72.84 | 71.15 | 61.67 | 44.27 | 65.90 | 66.96 |
| Swin-L | Deepseek-1.5b | windmill | 94.22 | 91.70 | 84.84 | 68.23 | 44.40 | 81.98 | 85.93 |
| Visual Encoder | Language Encoder | λ | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 | meanIoU | cumIoU |
|---|---|---|---|---|---|---|---|---|---|
| Swin-L | BERT | 0.6 | 77.11 | 75.00 | 70.69 | 60.51 | 33.40 | 67.49 | 68.83 |
| Swin-L | BERT | 1.0 | 78.39 | 77.56 | 75.58 | 70.38 | 53.99 | 71.59 | 73.43 |
| Swin-L | BERT | 1.4 | 76.25 | 74.19 | 69.65 | 60.72 | 34.91 | 66.89 | 67.54 |
| Algorithm A1: Multi-level adaptive visual language fusion. |
| Input: Visual features Language feature , where represents the spatial resolution of the ith pyramid level. Output: Vision-language features . 01: for i = 1 to 5 do 02: //Text feature projection to visual feature space 03: 04: 05: end for 06: for i = 1 to 5 do 07: //Multi-head cross-modal attention 08: 09: //Flatten spatial dimensions 10: // spatial resolution 11: 12: 13: 14: //Gating mechanism for selective fusion 15: 16: 17: 18: 19: 20: end for 21: for i = 1 to 5 do 22: // Multi-head self-attention on cross-modal features 23: 24: // Spatial resolution preserved 25: 26: 27: 28: 29: 30: end for 31: for i = 1 to 5 do 32: // FiLM conditioning for fine-grained modulation 33: 34: 35: 36: 37: 38: end for 39: for i = 1 to 5 do 40: // Remote sensing scale calibration 41: 42: 43: 44: 45: 46: 47: end for 48: // Adaptive weighting fusion 49: for i = 1 to 5 do 50: 51: 52: 53: end for 54: // Dimension change: [B, 2] × 5 → [B, 10] 55: // Input dimension: [B, 10], Linear layer: [B, 10] × [10, 16] → [B, 16], ReLU: [B, 16] 56: // Linear layer: [B, 16] × [16, 5] → [B, 5] 57: 58: // Spatial resolutions of each level 59: // Adaptive target resolution 60: for i = 1 to 5 do 61: 62: end for 63: 64: 65: // Feature enhancement with global context integration 66: for i = 1 to 5 do 67: 68: 69: 70: 71: 72: 73: end for 74: return Notation: B: batch size; H: number of attention heads = 8 : dimension per attention head : spatial dimensions (height, width) at pyramid level i ∈ {1, 2, 3, 4, 5} : spatial resolution of the ith feature map (e.g., 200 × 200, 100 × 100, 50 × 50, 25 × 25, 13 × 13) : learnable weight matrices; γ, β: FiLM modulation parameters ⊙: element-wise multiplication; ∑: summation operator BN: BatchNorm; Conv2D: 2D convolution operation |
| Algorithm A2: Cascaded hierarchical attention grounding. |
| Input:: A set of multi-scale visual features {} from the backbone network. : A set of multi-scale language features {} aligned with visual features. : A sentence-level language embedding vector. : A set of ground-truth targets, where each target contains (box) and (class label). Output: , (Inference): The final predicted bounding box and its confidence score. (Training): The total loss for optimization. procedure begin: 01: CHAG(, , , ) 02: //Hierarchical Attention Layer 1: Global Semantic Alignment. 03: ← MultiLevelFusion(, ) 04: // Fuses language features into multi-scale visual features via attention. 05: 06: // Hierarchical Attention Layer 2: Local Feature Enhancement in RPN. 07: , ← RPN_with_Attention(, ) 08: // Generates text-relevant initial proposals by applying attention to objectness scores within the RPN head. 09: 10: // Cascaded Grounding Stage. 11: ← 12: ← 0 13: for ← 1 to do 14: ← 15: if is_training then 16: , , ← SelectTrainingSamples(, , ) 17: else 18: ← 19: end if 20: 21: ← RoIPool(, ) 22: ← Headk() 23: , ← Predictork() 24: 25: if is_training then 26: ← CrossEntropyLoss(, ) 27: ← SmoothL1Loss(, ) 28: ← + 29: ← + 30: end if 31: 32: ← Decode(.detach(),) 33: end for 34: 35: // Post-processing and Final Output. 36: ← PostProcess(, , ) 37: // Includes NMS and clipping. 38: , ← SelectBestDetection() 39: 40: if is_training then 41: ← + 42: return 43: else 44: return , 45: end if end procedure Notation: = total number of cascade stages ( = 3); = a set of IoU thresholds for each stage ( = {0.55,0.65,0.75}); = a set of loss weights for each stage ( = {1.0,1.0,1.0}); = fused multi-modal features after the global semantic alignment stage; = initial proposals generated by the RPN; = loss calculated from the region proposal network; = proposals used as input for the current cascade stage, which are refined in each iteration; = cumulative loss from all stages of the cascade head; = the specific intersection over union (IoU) threshold for the current stage; = sampled proposals (positive and negative) for training at the current stage; = ground-truth class labels for the sampled proposals at stage k; = ground-truth bounding box regression targets for stage k; = RoI features extracted via the RoIPool layer; = features processed by the stage-specific detection head. = predicted class logits at stage k; = predicted box regression deltas at stage k; = classification loss for stage k; = bounding box regression loss for stage k; = total loss for a single cascade stage k; = final set of detections after all post-processing steps (e.g., NMS); , = predicted class logits and box deltas from the final cascade stage; = the total loss for the entire model. |
| IoU Threshold of CHAG | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 | meanIoU | cumIoU |
|---|---|---|---|---|---|---|---|
| No-CHAG | 78.39 | 77.56 | 75.58 | 70.38 | 53.99 | 71.59 | 73.43 |
| IoU = {0.50,0.60,0.70} | 78.96 | 77.83 | 75.63 | 70.55 | 54.36 | 72.26 | 73.62 |
| IoU = {0.55,0.65,0.75} | 78.63 | 77.65 | 76.19 | 71.22 | 57.16 | 73.43 | 75.48 |
| IoU = {0.60,0.70,0.80} | 77.08 | 75.14 | 71.67 | 63.10 | 43.30 | 69.39 | 76.80 |
| IoU = {0.65,0.75,0.85} | 74.82 | 73.07 | 70.04 | 63.07 | 47.83 | 67.86 | 68.17 |
References
- Simantiris, G.; Panagiotakis, C. Unsupervised Color-Based Flood Segmentation in UAV Imagery. Remote Sens. 2024, 16, 2126. [Google Scholar] [CrossRef]
- Senanayake, I.P.; Pathira Arachchilage, K.R.L.; Yeo, I.-Y.; Khaki, M.; Han, S.-C.; Dahlhaus, P.G. Spatial Downscaling of Satellite-Based Soil Moisture Products Using Machine Learning Techniques: A Review. Remote Sens. 2024, 16, 2067. [Google Scholar] [CrossRef]
- Lei, X.; Jiang, J.; Deng, Z.; Wu, D.; Wang, F.; Lai, C.; Wang, Z.; Chen, X. An Ensemble Machine Learning Model to Estimate Urban Water Quality Parameters Using Unmanned Aerial Vehicle Multispectral Imagery. Remote Sens. 2024, 16, 2246. [Google Scholar] [CrossRef]
- Cao, S.; Li, Z.; Deng, J.; Huang, Y.; Peng, Z. TFCD-Net: Target and False Alarm Collaborative Detection Network for Infrared Imagery. Remote Sens. 2024, 16, 1758. [Google Scholar] [CrossRef]
- Ali, T.A.; Bazi, Y.; Al Rahhal, M.M.; Mekhalfi, M.L.; Rangarajan, L.; Zuair, M.A. TextRS: Deep Bidirectional Triplet Network for Matching Text to Remote Sensing Images. Remote Sens. 2020, 12, 405. [Google Scholar]
- Cheng, Q.; Zhou, Y.; Fu, P.; Xu, Y.; Zhang, L. A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4284–4297. [Google Scholar] [CrossRef]
- Liu, C.; Zhao, R.; Chen, H.; Zou, Z.; Shi, Z.X. Remote Sensing Image Change Captioning with Dual-Branch Transformers: A New Method and a Large Scale Dataset. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5633520. [Google Scholar] [CrossRef]
- Hu, Y.; Yuan, J.; Wen, C.; Lu, X.; Li, X. RSGPT: A Remote Sensing Vision Language Model and Benchmark. arXiv 2023, arXiv:2307.15266. [Google Scholar] [CrossRef]
- Wei, T.; Yuan, W.; Luo, J.; Zhang, W.; Lu, L. VLCA: Vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning. J. Syst. Eng. Electron. 2023, 34, 9–18. [Google Scholar] [CrossRef]
- Bejiga, M.B.; Melgani, F.; Vascotto, A. Retro-Remote Sensing: Generating Images from Ancient Texts. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 950–960. [Google Scholar] [CrossRef]
- Xu, Y.; Yu, W.; Ghamisi, P.; Kopp, M.; Hochreiter, S. Txt2Img-MHN: Remote Sensing Image Generation from Text Using Modern Hopfield Networks. IEEE Trans. Image Process. 2022, 32, 5737–5750. [Google Scholar] [CrossRef]
- Li, A.; Lu, Z.; Wang, L.; Xiang, T.; Wen, J. Zero-Shot Scene Classification for High Spatial Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4157–4167. [Google Scholar] [CrossRef]
- Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Zhou, J. RemoteCLIP: A Vision Language Foundation Model for Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622216. [Google Scholar] [CrossRef]
- Jiang, X.; Zhou, N.; Li, X. Few-Shot Segmentation of Remote Sensing Images Using Deep Metric Learning. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6507405. [Google Scholar] [CrossRef]
- Zhang, S.; Song, F.; Liu, X.; Hao, X.; Liu, Y.; Lei, T.; Jiang, P. Text Semantic Fusion Relation Graph Reasoning for Few-Shot Object Detection on Remote Sensing Images. Remote Sens. 2023, 15, 1187. [Google Scholar] [CrossRef]
- Lu, X.; Sun, X.; Diao, W.; Mao, Y.; Li, J.; Zhang, Y.; Wang, P.; Fu, K. Few-Shot Object Detection in Aerial Imagery Guided by Text-Modal Knowledge. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5604719. [Google Scholar] [CrossRef]
- Liu, G.; He, J.; Li, P.; Zhong, S.; Li, H.; He, G. Unified Transformer with Cross-Modal Mixture Experts for Remote-Sensing Visual Question Answering. Remote Sens. 2023, 15, 4682. [Google Scholar] [CrossRef]
- Yuan, Z.; Mou, L.; Wang, Q.; Zhu, X.X. From Easy to Hard: Learning Language-Guided Curriculum for Visual Question Answering on Remote Sensing Data. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5623111. [Google Scholar] [CrossRef]
- Yu, Z.; Yu, J.; Xiang, C.; Zhao, Z.; Tian, Q.; Tao, D. Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, 13–19 July 2018; pp. 1114–1120. [Google Scholar]
- Chen, J.; Hong, H.; Song, B.; Guo, J.; Chen, C.; Xu, J. MDCT: Multi-Kernel Dilated Convolution and Transformer for One-Stage Object Detection of Remote Sensing Images. Remote Sens. 2023, 15, 371. [Google Scholar] [CrossRef]
- Li, Q.; Chen, Y.; Zeng, Y. Transformer with Transfer CNN for Remote-Sensing-Image Object Detection. Remote Sens. 2022, 14, 984. [Google Scholar] [CrossRef]
- Shiu, Y.-S.; Lee, R.-Y.; Chang, Y.-C. Pineapples’ Detection and Segmentation Based on Faster and Mask R-CNN in UAV Imagery. Remote Sens. 2023, 15, 814. [Google Scholar] [CrossRef]
- Sadhu, A.; Chen, K.; Nevatia, R. Zero-Shot Grounding of Objects from Natural Language Queries. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4693–4702. [Google Scholar]
- Yang, Z.; Gong, B.; Wang, L.; Huang, W.; Yu, D.; Luo, J. A Fast and Accurate One-Stage Approach to Visual Grounding. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4682–4692. [Google Scholar]
- Yang, Z.; Chen, T.; Wang, L.; Luo, J. Improving One-stage Visual Grounding by Recursive Sub-query Construction. In Proceedings of the Computer Vision-ECCV 2020:16th Eurpean Conference, Glasgow, UK, 23–28 August 2020; pp. 387–404. [Google Scholar]
- Huang, B.; Lian, D.; Luo, W.; Gao, S. Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 16888–16897. [Google Scholar]
- Lu, X.; Zhang, Y.; Yuan, Y.; Feng, Y. Gated and Axis-Concentrated Localization Network for Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2020, 58, 179–192. [Google Scholar] [CrossRef]
- Yang, S.; Li, G.; Yu, Y. Graph-structured referring expression reasoning in the wild. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9952–9961. [Google Scholar]
- Deng, J.; Yang, Z.; Chen, T.; Zhou, W.; Li, H. TransVG: End-to-End Visual Grounding with Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 1749–1759. [Google Scholar]
- Yang, L.; Xu, Y.; Yuan, C.; Liu, W.; Li, B.; Hu, W. Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 9499–9508. [Google Scholar]
- Zhan, Y.; Xiong, Z.; Yuan, Y. RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5604513. [Google Scholar] [CrossRef]
- Zhang, H.; Niu, Y.; Chang, S. Grounding Referring Expressions in Images by Variational Context. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4158–4166. [Google Scholar]
- Yang, S.; Li, G.; Yu, Y. Dynamic Graph Attention for Referring Expression Comprehension. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4643–4652. [Google Scholar]
- Liu, D.; Zhang, H.; Zha, Z.; Wu, F. Learning to Assemble Neural Module Tree Networks for Visual Grounding. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4672–4681. [Google Scholar]
- Liu, X.; Wang, Z.; Shao, J.; Wang, X.; Li, H. Improving referring expression grounding with cross-modal attention-guided erasing. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1950–1959. [Google Scholar]
- Wang, P.; Wu, Q.; Cao, J.; Shen, C.; Gao, L.; van den Hengel, A. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1960–1968. [Google Scholar]
- Hong, R.; Liu, D.; Mo, X.; He, X.; Zhang, H. Learning to compose and reason with language tree structures for visual grounding. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 684–696. [Google Scholar] [CrossRef]
- Zhong, Y.; Yang, J.; Zhang, P.; Li, C.; Codella, N.; Li, L.H.; Zhou, L.; Dai, X.; Yuan, L.; Li, Y.; et al. Regionclip: Region-based language-image pretraining. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 16793–16803. [Google Scholar]
- Liao, Y.; Liu, S.; Li, G.; Wang, F.; Chen, Y.; Qian, C.; Li, B. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10880–10889. [Google Scholar]
- Liao, Y.; Zhang, A.; Chen, Z.; Hui, T.; Liu, S. Progressive language customized visual feature learning for one-stage visual grounding. IEEE Trans. Image Process. 2022, 31, 4266–4277. [Google Scholar] [CrossRef] [PubMed]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the Computer Vision-ECCV 2020:16th Eurpean Conference, Glasgow, UK, 23–28 August 2020; Proceeding, Part I 17. Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Deng, J.; Yang, Z.; Liu, D.; Chen, T.; Zhou, W.; Zhang, Y.; Li, H.; Ouyang, W. Transvg++: End-to-end visual grounding with language conditioned vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 48, 13636–13652. [Google Scholar] [CrossRef]
- Sun, Y.; Feng, S.; Li, X.; Ye, Y.; Kang, J.; Huang, X. Visual Grounding in Remote Sensing Images. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; Volume 9, pp. 404–412. [Google Scholar]
- Wang, F.; Wu, C.; Wu, J.; Wang, L.; Li, C. Multistage Synergistic Aggregation Network for Remote Sensing Visual Grounding. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6007605. [Google Scholar] [CrossRef]
- Li, K.; Wang, D.; Xu, H.; Zhong, H.; Wang, C. Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5631413. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 10012–10022. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
- Lin, T.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
- Cai, Z.; Vasconcelos, N. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1483–1498. [Google Scholar] [CrossRef] [PubMed]
- Girshick, R.B. Fast R-CNN. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1440–1448. [Google Scholar]
- Rezatofighi, S.H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.D.; Savarese, S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision and Pattern Recognition(CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
- Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and A New Benchmark. ISPRS J. Photogram. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
- Wu, C.; Lin, Z.; Cohen, S.D.; Bui, T.; Maji, S. PhraseCut: Language-Based Image Segmentation in the Wild. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10213–10222. [Google Scholar]
- Smith, L.N.; Topin, N. Super-convergence: Very fast training of neural networks using large learning rates. In Defense + Commercial Sensing; SPIE: Bellingham, WA, USA, 2019; Volume 1100612. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Yang, X.; Yan, J.; Yang, X.; Tang, J.; Liao, W.; He, T. SCRDet++: Detecting Small, Cluttered and Rotated Objects via Instance-Level Feature Denoising and Rotation Loss Smoothing. IEEE Trans. Pattern Anal. Machine. 2023, 45, 2384–2399. [Google Scholar] [CrossRef] [PubMed]
- Sun, X.; Wang, P.; Lu, W.; Zhu, Z.; Lu, X.; He, Q.; Li, J.; Rong, X.; Yang, Z.; Chang, H.; et al. RingMo: A Remote Sensing Foundation Model with Masked Image Modeling. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5612822. [Google Scholar] [CrossRef]
- Lan, M.; Rong, F.; Jiao, H.; Gao, Z.; Zhang, L. Language Query-Based Transformer with Multiscale Cross-Modal Alignment for Visual Grounding on Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5626513. [Google Scholar] [CrossRef]













| Model Abbreviation | Full Model Name | Core Features |
|---|---|---|
| FR-RSVG | Faster R-CNN for Visual Grounding in Remote Sensing | Simple fusion of visual and language features only. |
| FR-AVLF | Faster R-CNN with Adaptive Vision-Language Fusion | Layer-wise adaptive fusion of visual features from Swin Transformer with language features. |
| FR-AVLFPRE | Faster R-CNN with Adaptive Vision-Language Fusion (Pretrained) | Transfers visual weights pretrained on the DIOR dataset to FR-AVLF. |
| FR-CHAGAVLFPRE | Faster R-CNN with Cascaded Hierarchical Attention Grounding and Multi-Level Adaptive Vision-Language Fusion (Pretrained) | Builds on FR-AVLFPRE by replacing AVLF with Multi-Level AVLF and adding the CHAG module. |
| Parameter | Value or Setting |
|---|---|
| GPU | NVIDIA RTX A6000 Ada, 48GB |
| CPU | Intel Xeon Gold 6530 |
| Framework | PyTorch 1.13 |
| Dataset Split Ratio | Training:Validation:Test = 7:1:2 |
| Input Image Size | 800 × 800 |
| BERT Model | Pretrained public weights, frozen during training |
| Learning Rate Scheduler | OneCycle [56], max LR = 0.0001, min LR = 0.000001 |
| Optimizer | Adam [57], β2 = 0.99, β1 adjusted via OneCycle (max β1 = 0.9, min β1 = 0.8) |
| Training Epochs | 12 |
| Batch Size | 16 |
| Visual Encoder | Language Encoder | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 | meanIoU | cumIoU |
|---|---|---|---|---|---|---|---|---|
| Swin-T | BERT | 56.63 | 52.40 | 46.76 | 36.64 | 16.47 | 49.41 | 64.03 |
| Swin-S | BERT | 62.77 | 59.21 | 53.91 | 44.64 | 24.31 | 55.31 | 67.68 |
| Swin-B | BERT | 63.67 | 60.31 | 55.08 | 46.82 | 28.22 | 56.71 | 69.52 |
| Swin-L | BERT | 64.12 | 60.61 | 55.77 | 47.63 | 29.74 | 57.48 | 70.71 |
| Visual Encoder | Language Encoder | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 | meanIoU | cumIoU |
|---|---|---|---|---|---|---|---|---|
| Swin-T | BERT | |||||||
| Swin-S | BERT | |||||||
| Swin-B | BERT | |||||||
| Swin-L | BERT |
| Model | Backbone | Backbone Parameters | Backbone Pretrain Dataset | mAP |
|---|---|---|---|---|
| SCRDet++(FPN) [58] | ResNet-101 | 45M | - | 73.2 |
| SCRDet++(RetinaNet) [58] | ResNet-101 | 45M | - | 75.1 |
| FPNISP [59] | Swin-B | 88M | - | 74.7 |
| FPN-RingMo [59] | Swin-B | 88M | - | 75.9 |
| FPN-Faster R-CNN (Ours) | Swin-T | 28M | ImageNet-1K | 73.29 |
| FPN-Faster R-CNN (Ours) | Swin-T | 28M | ImageNet-22K | 74.21 |
| FPN-Faster R-CNN (Ours) | Swin-S | 50M | ImageNet-22K | 76.40 |
| FPN-Faster R-CNN (Ours) | Swin-B | 88M | ImageNet-22K | 78.52 |
| FPN-Faster R-CNN (Ours) | Swin-L | 197M | ImageNet-22K | 78.89 |
| Visual Encoder | Language Encoder | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 | meanIoU | cumIoU |
|---|---|---|---|---|---|---|---|---|
| Swin-T | BERT | |||||||
| Swin-S | BERT | |||||||
| Swin-B | BERT | |||||||
| Swin-L | BERT |
| Visual Encoder | Language Encoder | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 | meanIoU | cumIoU |
|---|---|---|---|---|---|---|---|---|
| Swin-L | BERT | |||||||
| Swin-L | RoBERTa | |||||||
| Swin-L | Deepseek-1.5b | |||||||
| Swin-L | Deepseek-7b |
| Visual Encoder | Language Encoder | Multi-Level AVLF | CHAG | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 |
|---|---|---|---|---|---|---|---|---|
| Swin-L | Deepseek-1.5b | |||||||
| Swin-L | Deepseek-1.5b | |||||||
| Swin-L | Deepseek-1.5b | |||||||
| Swin-L | Deepseek-1.5b |
| Visual Encoder | Language Encoder | Multi-Level AVLF | CHAG | Backbone Params | Flops | FPS |
|---|---|---|---|---|---|---|
| Swin-L | Deepseek-1.5b | 197 M | 1551.70 GFLOPs | 10.06 FPS | ||
| Swin-L | Deepseek-1.5b | 197 M | 1607.22 GFLOPs | 9.96 FPS | ||
| Swin-L | Deepseek-1.5b | 197 M | 1550.94 GFLOPs | 9.73 FPS | ||
| Swin-L | Deepseek-1.5b | 197 M | 1606.48 GFLOPs | 9.23 FPS |
| Methods | Visual Encoder | Language Encoder | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 | meanIoU | cumIoU |
|---|---|---|---|---|---|---|---|---|---|
| One-stage: | |||||||||
| ZSGNet [23,31] | ResNet-50 | BiLSTM | 51.67 | 48.13 | 42.3 | 32.41 | 10.15 | 44.12 | 51.65 |
| FAOA-no Spatial [24,31] | DarkNet-53 | BERT | 63.63 | 61.20 | 56.92 | 50.15 | 38.83 | 57.53 | 62.66 |
| FAOA [24,31] | DarkNet-53 | LSTM | 70.86 | 67.37 | 62.04 | 53.19 | 36.44 | 62.86 | 67.28 |
| ReSC [25,31] | DarkNet-53 | BERT | 72.71 | 68.92 | 63.01 | 53.70 | 33.37 | 64.24 | 68.10 |
| LBYL-Net [26,31] | DarkNet-53 | BERT | 73.78 | 69.22 | 65.56 | 47.89 | 15.69 | 65.92 | 76.37 |
| Transformer-based: | |||||||||
| TransVG [29,31] | ResNet-50 | BERT | 72.41 | 67.38 | 60.05 | 49.1 | 27.84 | 63.56 | 76.27 |
| VLTVG [30,31] | ResNet-101 | BERT | 75.97 | 72.22 | 66.33 | 55.17 | 33.11 | 66.32 | 77.85 |
| MGVLF [31] | ResNet-50 | BERT | 76.78 | 72.68 | 66.74 | 56.42 | 35.07 | 68.04 | 78.41 |
| MSAM [44] | DarkNet | BERT | 74.23 | 69.01 | 61.32 | 49.04 | 24.26 | 64.88 | 77.13 |
| LQVG [60] | ResNet-50 | BERT | 83.41 | 81.03 | 75.91 | 65.52 | 43.53 | 74.02 | 82.22 |
| Two-stage: | |||||||||
| FR-AVLFPRE (Ours) | Swin-B | BERT | 78.62 | 77.62 | 75.30 | 69.42 | 51.75 | 71.37 | 73.26 |
| FR-AVLFPRE (Ours) | Swin-L | BERT | 78.39 | 77.56 | 75.58 | 70.38 | 53.99 | 71.59 | 73.43 |
| FR-CHAGAVLFPRE (Ours) | Swin-L | Deepseek-1.5b | 82.12 | 80.76 | 78.34 | 72.78 | 59.42 | 75.78 | 83.21 |
| Methods | Dataset | Visual Encoder | Language Encoder | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 | mean- IoU | cumIoU |
|---|---|---|---|---|---|---|---|---|---|---|
| MGVLF [31] | DIOR-RSVG ∩ DIOR-RSVG-C | Swin-L | Deepseek-1.5b | 77.73 | 74.48 | 67.07 | 56.23 | 32.36 | 69.76 | 78.36 |
| MGVLF [31] | DIOR-RSVG-C | Swin-L | Deepseek-1.5b | 66.24 | 63.15 | 56.51 | 47.81 | 28.85 | 59.30 | 70.45 |
| FR-CHAGAVLFPRE (Ours) | DIOR-RSVG ∩ DIOR-RSVG-C | Swin-L | Deepseek-1.5b | 84.96 | 84.02 | 81.94 | 76.60 | 63.36 | 78.68 | 85.46 |
| FR-CHAGAVLFPRE (Ours) | DIOR-RSVG-C | Swin-L | Deepseek-1.5b | 72.29 | 71.26 | 69.22 | 65.24 | 53.48 | 66.90 | 78.11 |
| Methods | Visual Encoder | Language Encoder | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 | meanIoU | cumIoU |
|---|---|---|---|---|---|---|---|---|---|
| MGVLF [31] | ResNet-50 | BERT | 62.96 | 59.88 | 53.93 | 41.58 | 19.49 | 54.30 | 57.28 |
| FR-CHAGAVLFPRE (Ours) | Swin-L | Deepseek-1.5b | 68.39 | 66.58 | 64.50 | 53.62 | 34.71 | 60.50 | 62.95 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhu, H.; Gao, T.; Li, Z.; Chen, Z.; Li, Q.; Miao, K.; Hou, B.; Jiao, L. Cascaded Hierarchical Attention with Adaptive Fusion for Visual Grounding in Remote Sensing. Remote Sens. 2025, 17, 2930. https://doi.org/10.3390/rs17172930
Zhu H, Gao T, Li Z, Chen Z, Li Q, Miao K, Hou B, Jiao L. Cascaded Hierarchical Attention with Adaptive Fusion for Visual Grounding in Remote Sensing. Remote Sensing. 2025; 17(17):2930. https://doi.org/10.3390/rs17172930
Chicago/Turabian StyleZhu, Huming, Tianqi Gao, Zhixian Li, Zhipeng Chen, Qiuming Li, Kongmiao Miao, Biao Hou, and Licheng Jiao. 2025. "Cascaded Hierarchical Attention with Adaptive Fusion for Visual Grounding in Remote Sensing" Remote Sensing 17, no. 17: 2930. https://doi.org/10.3390/rs17172930
APA StyleZhu, H., Gao, T., Li, Z., Chen, Z., Li, Q., Miao, K., Hou, B., & Jiao, L. (2025). Cascaded Hierarchical Attention with Adaptive Fusion for Visual Grounding in Remote Sensing. Remote Sensing, 17(17), 2930. https://doi.org/10.3390/rs17172930

