TVI-MFAN: A Text–Visual Interaction Multilevel Feature Alignment Network for Visual Grounding in Remote Sensing
Abstract
1. Introduction
2. Related Works
3. Materials and Methods
3.1. Overview
3.1.1. Visual Feature Extraction
3.1.2. Textual Feature Extraction
3.2. Text–Visual Interaction Attention
3.3. Multi-Level Feature Alignment Network
| Algorithm 1 TVI-MFAN |
Require: Linguistic expression, remote sensing images, learnable token, ground truth Bbox. Ensure: Bbox. 1: Initialize all weights. 2: for epoch < epochs do 3: Execute the linguistic backbone to extract linguistic features . 4: Execute the TVIA to dynamically generate adaptive weights that guide the visual back-bone in extracting features by Equations (1) to (4). 5: Execute the MFAN to enhance the uniqueness of the object to obtain the refined multi-modal features for localization by Equations (5) to (7). 6: Input the refined multimodal features into the localization module of the TVI-MFAN framework to predict the coordinates of the Bbox. 7: Utilize a loss function to calculate the discrepancy between the predicted Bbox and the ground truth Bbox. 8: end for 9: Obtain predicted Bbox. |
4. Discussion
4.1. Data Description
4.2. Implementation Details
4.3. Comparisons with SOTA Methods
4.4. Ablation Study
4.4.1. Effectiveness of TVI-MFAN Components
4.4.2. Effectiveness of TVIA Module Position
4.4.3. Effectiveness of the Modules in Balancing Accuracy and Computational Efficiency
4.4.4. Effectiveness of TVIA Module Components
4.4.5. Effectiveness of Linguistic Feature Granularity
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Goodfellow, I.; Bengio, Y.; Courville, A.; Bengio, Y. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Volume 1. [Google Scholar]
- Prince, S.J. Understanding Deep Learning; MIT Press: Cambridge, MA, USA, 2023. [Google Scholar]
- Sejnowski, T.J. The Deep Learning Revolution; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Khanal, S.; Kc, K.; Fulton, J.P.; Shearer, S.; Ozkan, E. Remote sensing in agriculture-accomplishments, limitations, and opportunities. Remote Sens. 2020, 12, 3783. [Google Scholar] [CrossRef]
- Shirmard, H.; Farahbakhsh, E.; Müller, R.D.; Chandra, R. A review of machine learning in processing remote sensing data for mineral exploration. Remote Sens. Environ. 2022, 268, 112750. [Google Scholar] [CrossRef]
- Jiang, N.; Li, H.B.; Li, C.J.; Xiao, H.X.; Zhou, J.W. A fusion method using terrestrial laser scanning and unmanned aerial vehicle photogrammetry for landslide deformation monitoring under complex terrain conditions. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
- Lobry, S.; Marcos, D.; Murray, J.; Tuia, D. RSVQA: Visual question answering for remote sensing data. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8555–8566. [Google Scholar] [CrossRef]
- Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring models and data for remote sensing image caption generation. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2183–2195. [Google Scholar] [CrossRef]
- Sun, Y.; Feng, S.; Li, X.; Ye, Y.; Kang, J.; Huang, X. Visual grounding in remote sensing images. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 404–412. [Google Scholar]
- Deng, J.; Yang, Z.; Chen, T.; Zhou, W.; Li, H. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1769–1779. [Google Scholar]
- Deng, J.; Yang, Z.; Liu, D.; Chen, T.; Zhou, W.; Zhang, Y.; Li, H.; Ouyang, W. Transvg++: End-to-end visual grounding with language conditioned vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13636–13652. [Google Scholar] [CrossRef] [PubMed]
- Su, W.; Miao, P.; Dou, H.; Wang, G.; Qiao, L.; Li, Z.; Li, X. Language adaptive weight generation for multi-task visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10857–10866. [Google Scholar]
- Zhu, C.; Zhou, Y.; Shen, Y.; Luo, G.; Pan, X.; Lin, M.; Chen, C.; Cao, L.; Sun, X.; Ji, R. Seqtr: A simple yet universal network for visual grounding. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 598–615. [Google Scholar]
- Shi, F.; Gao, R.; Huang, W.; Wang, L. Dynamic mdetr: A dynamic multimodal transformer decoder for visual grounding. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 1181–1198. [Google Scholar] [CrossRef]
- Ye, J.; Tian, J.; Yan, M.; Yang, X.; Wang, X.; Zhang, J.; He, L.; Lin, X. Shifting more attention to visual backbone: Query-modulated refinement networks for end-to-end visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 15502–15512. [Google Scholar]
- Yang, L.; Xu, Y.; Yuan, C.; Liu, W.; Li, B.; Hu, W. Improving visual grounding with visual-linguistic verification and iterative reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9499–9508. [Google Scholar]
- Yang, C.; Li, Z.; Zhang, L. MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5649113. [Google Scholar] [CrossRef]
- Huang, Z.; Yan, H.; Zhan, Q.; Yang, S.; Zhang, M.; Zhang, C.; Lei, Y.; Liu, Z.; Liu, Q.; Wang, Y. A Survey on Remote Sensing Foundation Models: From Vision to Multimodality. arXiv 2025, arXiv:2503.22081. [Google Scholar] [CrossRef]
- Xiao, L.; Yang, X.; Lan, X.; Wang, Y.; Xu, C. Towards Visual Grounding: A Survey. arXiv 2024, arXiv:2412.20206. [Google Scholar] [CrossRef]
- Li, X.; Wen, C.; Hu, Y.; Yuan, Z.; Zhu, X.X. Vision-language models in remote sensing: Current progress and future trends. IEEE Geosci. Remote Sens. Mag. 2024, 12, 32–66. [Google Scholar] [CrossRef]
- Tao, L.; Zhang, H.; Jing, H.; Liu, Y.; Yan, D.; Wei, G.; Xue, X. Advancements in Vision–Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques. Remote Sens. 2025, 17, 162. [Google Scholar] [CrossRef]
- Huo, C.; Chen, K.; Zhang, S.; Wang, Z.; Yan, H.; Shen, J.; Hong, Y.; Qi, G.; Fang, H.; Wang, Z. When Remote Sensing Meets Foundation Model: A Survey and Beyond. Remote Sens. 2025, 17, 179. [Google Scholar] [CrossRef]
- Zhan, Y.; Xiong, Z.; Yuan, Y. Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
- Hang, R.; Xu, S.; Liu, Q. A regionally indicated visual grounding network for remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5647411. [Google Scholar] [CrossRef]
- Ding, Y.; Xu, H.; Wang, D.; Li, K.; Tian, Y. Visual selection and multi-stage reasoning for rsvg. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6007305. [Google Scholar] [CrossRef]
- Zhao, E.; Wan, Z.; Zhang, Z.; Nie, J.; Liang, X.; Huang, L. A Spatial Frequency Fusion Strategy Based on Linguistic Query Refinement for RSVG. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5409413. [Google Scholar] [CrossRef]
- Lan, M.; Rong, F.; Jiao, H.; Gao, Z.; Zhang, L. Language query based transformer with multi-scale cross-modal alignment for visual grounding on remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5626513. [Google Scholar] [CrossRef]
- Radouane, K.; Azzag, H. MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing. arXiv 2025, arXiv:2503.24219. [Google Scholar] [CrossRef]
- Corley, I.; Nsutezo, S.F.; Ortiz, A.; Robinson, C.; Dodhia, R.; Ferres, J.M.L.; Najafirad, P. FLAVARS: A Multimodal Foundational Language and Vision Alignment Model for Remote Sensing. arXiv 2025, arXiv:2501.08490. [Google Scholar] [CrossRef]
- Scheibenreif, L.; Hanna, J.; Mommert, M.; Borth, D. Self-supervised vision transformers for land-cover segmentation and classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1422–1431. [Google Scholar]
- Guo, X.; Lao, J.; Dang, B.; Zhang, Y.; Yu, L.; Ru, L.; Zhong, L.; Huang, Z.; Wu, K.; Hu, D.; et al. Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27672–27683. [Google Scholar]
- Wang, Y.; Albrecht, C.M.; Braham, N.A.A.A.; Liu, C.; Xiong, Z.; Zhu, X.X. Decoupling common and unique representations for multimodal self-supervision learning. arXiv 2024, arXiv:2309.05300. [Google Scholar]
- Fuller, A.; Millard, K.; Green, J. CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. Adv. Neural Inf. Process. Syst. 2023, 36, 5506–5538. [Google Scholar]
- Wang, Z.; Prabha, R.; Huang, T.; Wu, J.; Rajagopal, R. Skyscript: A large and semantically diverse vision-language dataset for remote sensing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 5805–5813. [Google Scholar]
- Zhang, Z.; Zhao, T.; Guo, Y.; Yin, J. RS5M and GeoRSCLIP: A large scale vision-language dataset and a large vision-language model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5642123. [Google Scholar] [CrossRef]
- Mo, S.; Kim, M.; Lee, K.; Shin, J. S-clip: Semi-supervised vision-language learning using few specialist captions. Adv. Neural Inf. Process. Syst. 2023, 36, 61187–61212. [Google Scholar]
- Wang, F.; Wu, C.; Wu, J.; Wang, L.; Li, C. Multistage synergistic aggregation network for remote sensing visual grounding. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021. pp. 8748–8763. Available online: https://proceedings.mlr.press/v139/radford21a.html (accessed on 20 November 2024).
- Li, X.; Wen, C.; Hu, Y.; Zhou, N. RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103497. [Google Scholar] [CrossRef]
- Xu, S.; Zhang, C.; Fan, L.; Meng, G.; Xiang, S.; Ye, J. Addressclip: Empowering vision-language models for city-wide image address localization. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 76–92. [Google Scholar]
- Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Zhou, J. Remoteclip: A vision language foundation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning; PMLR: Oxford, UK, 2022; pp. 12888–12900. Available online: https://proceedings.mlr.press/v162/li22n.html (accessed on 20 November 2024).
- Singh, A.; Hu, R.; Goswami, V.; Couairon, G.; Galuba, W.; Rohrbach, M.; Kiela, D. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 15638–15650. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Li, K.; Wang, D.; Xu, H.; Zhong, H.; Wang, C. Language-guided progressive attention for visual grounding in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5631413. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); ASSOC Computational: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Huang, B.; Lian, D.; Luo, W.; Gao, S. Look before you leap: Learning landmark features for one-stage visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16888–16897. [Google Scholar]
- Yang, Z.; Chen, T.; Wang, L.; Luo, J. Improving one-stage visual grounding by recursive sub-query construction. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XIV 16; Springer International Publishing: Cham, Switzerland, 2020; pp. 387–404. [Google Scholar]
- Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 38–55. [Google Scholar]
- Zhang, P.; Zhang, Y.; Wu, H.; Liu, X.; Hou, Y.; Wang, L. Language-guided Object Localization via Refined Spotting Enhancement in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5621315. [Google Scholar] [CrossRef]








| Methods | Venue | Visual Encoder | Language Encoder | Pr@0.5 (%) | Pr@0.6 (%) | Pr@0.7 (%) | Pr@0.8 (%) | Pr@0.9 (%) | meanIoU | cmuIoU |
|---|---|---|---|---|---|---|---|---|---|---|
| One-stage: | ||||||||||
| ZSGNet | ICCV’19 | ResNet-50 | BiLSTM | 48.64 | 47.32 | 43.85 | 27.69 | 6.33 | 43.01 | 47.71 |
| FAOA | ICCV’19 | DarkNet-53 | BERT | 68.13 | 64.30 | 57.15 | 41.83 | 15.33 | 58.79 | 65.20 |
| ReSC | ECCV’20 | DarkNet-53 | BERT | 69.12 | 64.63 | 58.20 | 43.01 | 14.85 | 60.18 | 65.84 |
| LBYL-Net | CVPR’21 | DarkNet-53 | BERT | 70.22 | 65.39 | 58.65 | 37.54 | 9.46 | 60.57 | 70.28 |
| Transformer-based: | ||||||||||
| TransVG | CVPR’21 | ResNet-50 | BERT | 69.96 | 64.17 | 54.68 | 38.01 | 12.75 | 59.80 | 69.31 |
| QRNet | CVPR’22 | Swin | BERT | 72.03 | 65.94 | 56.90 | 40.70 | 13.35 | 60.82 | 73.39 |
| VLTVG | CVPR’22 | ResNet-50 | BERT | 71.84 | 66.54 | 57.79 | 41.63 | 14.62 | 60.78 | 70.69 |
| MGVLF | TGRS’23 | ResNet-50 | BERT | 72.19 | 66.86 | 58.02 | 42.51 | 15.30 | 61.51 | 71.80 |
| FQRNet | TGRS’24 | Swin | BERT | 73.23 | 71.16 | 60.20 | 45.33 | 16.37 | 62.35 | 72.76 |
| TVI-MFAN | ResNet-50 | BERT | 75.65 | 72.32 | 61.22 | 49.51 | 21.61 | 65.19 | 74.30 | |
| Methods | Venue | Visual Encoder | Language Encoder | Pr@0.5 (%) | Pr@0.6 (%) | Pr@0.7 (%) | Pr@0.8 (%) | Pr@0.9 (%) | meanIoU | cmuIoU |
|---|---|---|---|---|---|---|---|---|---|---|
| One-stage: | ||||||||||
| ZSGNet | ICCV’19 | ResNet-50 | BiLSTM | 50.74 | 48.32 | 43.19 | 32.41 | 10.13 | 44.12 | 50.61 |
| FAOA | ICCV’19 | DarkNet-53 | BERT | 68.32 | 64.21 | 59.30 | 50.73 | 34.51 | 59.63 | 64.37 |
| ReSC | ECCV’20 | DarkNet-53 | BERT | 72.64 | 68.63 | 62.97 | 52.68 | 33.27 | 64.31 | 68.02 |
| LBYL-Net | CVPR’21 | DarkNet-53 | BERT | 73.61 | 69.38 | 65.14 | 47.42 | 15.65 | 65.87 | 76.39 |
| Transformer-based: | ||||||||||
| TransVG | CVPR’21 | ResNet-50 | BERT | 72.38 | 67.29 | 60.01 | 49.36 | 27.73 | 63.57 | 76.38 |
| QRNet | CVPR’22 | Swin | BERT | 75.81 | 70.79 | 62.26 | 49.57 | 25.61 | 66.43 | 82.97 |
| VLTVG | CVPR’22 | ResNet-50 | BERT | 69.36 | 65.17 | 58.39 | 46.53 | 24.32 | 59.91 | 71.93 |
| MGVLF | TGRS’23 | ResNet-50 | BERT | 75.76 | 72.03 | 65.21 | 54.84 | 35.67 | 67.46 | 78.61 |
| FQRNet | TGRS’24 | Swin | BERT | 77.14 | 74.08 | 68.92 | 59.71 | 36.87 | 68.92 | 79.36 |
| TVI-MFAN | ResNet-50 | BERT | 80.24 | 76.25 | 71.41 | 59.74 | 38.95 | 70.23 | 83.51 | |
| TVIA | MFAN | Params. | Pr@0.5(%) |
|---|---|---|---|
| 122.4M | 66.53 | ||
| ✓ | 123.3M | 71.96 | |
| ✓ | 127.5M | 69.37 | |
| ✓ | ✓ | 128.3M | 75.65 |
| Cases | i-th Stage of Visual Backbone | Params. | Pr@0.5(%) |
|---|---|---|---|
| 1 | i = 4 | 127.7M | 68.32 |
| 2 | i = 3, 4 | 128.1M | 73.63 |
| 3 | i = 1, 2, 3, 4 | 128.6M | 72.92 |
| 4 | i = 0, 1, 2, 3, 4 | 128.7M | 70.15 |
| 5 | i = 2, 3, 4 | 128.3M | 75.65 |
| Type | GFLOPs | FPS | Params. | Pr@0.5(%) |
|---|---|---|---|---|
| TransVG | 64.135 | 3.15 | 149.8M | 69.31 |
| TVI-MFAN | 60.327 | 5.36 | 128.3M | 75.65 |
| w/o TVIA | 59.861 | 6.51 | 127.5M | 69.37 |
| w/o MFAN | 60.197 | 5.79 | 123.3M | 71.96 |
| Weights: | Biases | Params. | Pr@0.5(%) | |
|---|---|---|---|---|
| Channel | Spatial | |||
| ✓ | ✓ | 128.2M | 72.14 | |
| ✓ | ✓ | 127.6M | 74.69 | |
| ✓ | ✓ | 127.5M | 75.01 | |
| ✓ | ✓ | ✓ | 128.3M | 75.65 |
| Word | Sentence | Pr@0.5(%) |
|---|---|---|
| ✓ | 75.65 | |
| ✓ | 70.32 | |
| ✓ | ✓ | 75.01 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chi, H.; Qin, W.; Chen, X.; Guo, W.; An, B. TVI-MFAN: A Text–Visual Interaction Multilevel Feature Alignment Network for Visual Grounding in Remote Sensing. Remote Sens. 2025, 17, 2993. https://doi.org/10.3390/rs17172993
Chi H, Qin W, Chen X, Guo W, An B. TVI-MFAN: A Text–Visual Interaction Multilevel Feature Alignment Network for Visual Grounding in Remote Sensing. Remote Sensing. 2025; 17(17):2993. https://doi.org/10.3390/rs17172993
Chicago/Turabian StyleChi, Hao, Weiwei Qin, Xingyu Chen, Wenxin Guo, and Baiwei An. 2025. "TVI-MFAN: A Text–Visual Interaction Multilevel Feature Alignment Network for Visual Grounding in Remote Sensing" Remote Sensing 17, no. 17: 2993. https://doi.org/10.3390/rs17172993
APA StyleChi, H., Qin, W., Chen, X., Guo, W., & An, B. (2025). TVI-MFAN: A Text–Visual Interaction Multilevel Feature Alignment Network for Visual Grounding in Remote Sensing. Remote Sensing, 17(17), 2993. https://doi.org/10.3390/rs17172993

