Co-Training Vision-Language Models for Remote Sensing Multi-Task Learning †
Highlights
- The proposed RSCoVLM, trained on a well-curated data recipe, is a fully open-sourced vision-language model (VLM) that excels in multiple remote sensing (RS) tasks. It achieves state-of-the-art performance across various tasks, even aerial object detection.
- The proposed dynamic resolution strategies enable the processing of RS images of arbitrary sizes. Among these strategies, the Zoom-in Chain method significantly enhances the performance on ultra-high-resolution RS images reasoning. Additionally, VLMs exhibit clear limitations under the commonly used mAP metric, which is influenced by confidence scores. Based on the proposed , RSCoVLM is shown to achieve detection performance comparable to conventional object detection models.
- From the perspective of RS VLM development, as a new baseline, RSCoVLM demonstrates substantial progress in capability and flexibility. It brings us one step closer to realizing a general-purpose generative agent for RS image processing.
- From the perspective of RS multi-task learning, the proposed framework offers greater extensibility. It will facilitate expansion to more and increasingly complex tasks in the future, moving toward a unified multi-task model.
Abstract
1. Introduction
- We present RSCoVLM, a fully open-sourced VLM baseline for RS MTL. The experiment shows that our model achieves leading performance across benchmarks of various datasets and tasks.
- We develop the universal framework for RS MTL based on VLM and create the data curation procedure to facilitate unified training across multi datasets of various RS tasks.
- We proposed a dynamic resolution strategy for RS, along with the Zoom-in Chain strategy and the LRS-VQA-Zoom dataset to further enhance the model’s reasoning ability on UHR images.
- We develop an aerial detection method for RS VLMs and propose an evaluation metric that enables a fair comparison between RS VLMs and conventional methods.
2. Related Works
2.1. General-Purpose Vision-Language Models
2.2. Remote Sensing Vision-Language Models
2.3. Remote Sensing Multi-Task Learning
3. Method
3.1. The Universal RS Multi-Task Framework
3.2. Data Curation Procedure
3.2.1. Data Acquisition
3.2.2. Data Processing and Integrating
3.2.3. Data Loading and Weighting
3.3. Dynamic Resolution Strategy
3.3.1. Full-Scale Visual Input
3.3.2. Scalable Bounding Boxes
3.3.3. Random Resizing
3.4. Zoom-In Chain for UHR RS Images
3.4.1. Template-Generated Data (60 k)
3.4.2. GPT-4V-Synthesized Data (159 k)
3.4.3. Multi-Choice-Query Data (83 k)
3.5. Aerial Detection Method for Vision-Language Models
3.5.1. Response Normalization
3.5.2. Evaluation Metrics
4. Experiment
4.1. Reproducibility Details
GitHub repository and uploaded the whole well-collected data folder and model weights to the
HuggingFace repository. The codebase is implemented concisely, leveraging resource-efficient and effective training techniques. To save GPU memory, we adopt DeepSpeed-ZeRO-Stage-1 [55] and gradient checkpointing. For improved computational efficiency, we utilize BFloat16 precision and Flash-Attention-2 [56] during both training and evaluation. Additionally, Liger Kernel [57] is employed to accelerate training, and vLLM [58] is used for faster inferencing. All experiments are conducted on VolcEngine high-performance computing clusters equipped with NVIDIA A800 GPUs. We will maintain the repositories and update the latest code, model and data in our future research progress.4.2. Evaluation on Large RS Imagery
4.2.1. Benchmark and Metric
4.2.2. Results
4.3. Evaluation on Visual Grounding
4.3.1. Benchmark and Metric
4.3.2. Results
4.4. Evaluation on Object Detection
4.4.1. Benchmark, Metric, and Comparison Setting
4.4.2. Results
4.5. Evaluation on Scene Classification
4.5.1. Benchmark and Metric
4.5.2. Results
4.6. Evaluation on Visual Question Answering
4.6.1. Benchmark and Metric
4.6.2. Results
5. Discussion
5.1. Failure Cases Analysis
5.2. Limitations and Outlook
6. Conclusions
Author Contributions
Funding
Data Availability Statement
HuggingFace repository.Conflicts of Interest
References
- Zhang, L.; Zhang, L. Artificial Intelligence for Remote Sensing Data Analysis: A review of challenges and opportunities. IEEE Geosci. Remote Sens. Magaz. 2022, 10, 270–294. [Google Scholar] [CrossRef]
- Zhou, Y.; Feng, L.; Ke, Y.; Jiang, X.; Yan, J.; Yang, X.; Zhang, W. Towards Vision-Language Geo-Foundation Models: A Survey. arXiv 2024, arXiv:2406.09385. [Google Scholar]
- Zhang, Y.; Yang, Q. A Survey on Multi-Task Learning. IEEE Trans. Know. Data Engin. 2022, 34, 5586–5609. [Google Scholar] [CrossRef]
- Li, Q.; Chen, Y.; He, X.; Huang, L. Co-training transformer for remote sensing image classification, segmentation, and detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–18. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
- Li, Q.; Chen, Y.; Zeng, Y. Transformer with Transfer CNN for Remote-Sensing-Image Object Detection. Remote Sens. 2022, 14, 984. [Google Scholar] [CrossRef]
- Han, J.; Gong, K.; Zhang, Y.; Wang, J.; Zhang, K.; Lin, D.; Qiao, Y.; Gao, P.; Yue, X. OneLLM: One Framework to Align All Modalities with Language. arXiv 2023, arXiv:2312.03700. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021. [Google Scholar]
- Li, Q.; Chen, Z.; Wang, W.; Wang, W.; Ye, S.; Jin, Z.; Chen, G.; He, Y.; Gao, Z.; Cui, E.; et al. OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text. In Proceedings of the The Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A Visual Language Model for Few-Shot Learning. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 23716–23736. [Google Scholar]
- Chen, Z.; Wu, J.; Wang, W.; Su, W.; Chen, G.; Xing, S.; Zhong, M.; Zhang, Q.; Zhu, X.; Lu, L.; et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 24185–24198. [Google Scholar]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. In Proceedings of the Advances in Neural Information Processing Systems; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 34892–34916. [Google Scholar]
- Wang, D.; Zhang, J.; Xu, M.; Liu, L.; Wang, D.; Gao, E.; Han, C.; Guo, H.; Du, B.; Tao, D.; et al. MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 11632–11654. [Google Scholar] [CrossRef]
- Li, Y.; Li, X.; Li, Y.; Zhang, Y.; Dai, Y.; Hou, Q.; Cheng, M.M.; Yang, J. Sm3det: A unified model for multi-modal remote sensing object detection. arXiv 2024, arXiv:2412.20665. [Google Scholar]
- Pang, C.; Weng, X.; Wu, J.; Li, J.; Liu, Y.; Sun, J.; Li, W.; Wang, S.; Feng, L.; Xia, G.S. Vhm: Versatile and honest vision language model for remote sensing image analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 6381–6388. [Google Scholar]
- Kuckreja, K.; Danish, M.S.; Naseer, M.; Das, A.; Khan, S.; Khan, F.S. GeoChat: Grounded Large Vision-Language Model for Remote Sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 27831–27840. [Google Scholar]
- Muhtar, D.; Li, Z.; Gu, F.; Zhang, X.; Xiao, P. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 440–457. [Google Scholar]
- Zhou, Y.; Lan, M.; Li, X.; Feng, L.; Ke, Y.; Jiang, X.; Li, Q.; Yang, X.; Zhang, W. GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding. arXiv 2024, arXiv:2411.11904. [Google Scholar]
- Li, Q.; Chen, Y.; Shu, X.; Chen, D.; He, X.; Yu, Y.; Yang, X. A Simple Aerial Detection Baseline of Multimodal Language Models. arXiv 2025, arXiv:2501.09720. [Google Scholar] [CrossRef]
- Luo, J.; Zhang, Y.; Yang, X.; Wu, K.; Zhu, Q.; Liang, L.; Chen, J.; Li, Y. When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning. arXiv 2025, arXiv:2503.07588. [Google Scholar]
- Wang, F.; Chen, M.; Li, Y.; Wang, D.; Wang, H.; Guo, Z.; Wang, Z.; Shan, B.; Lan, L.; Wang, Y.; et al. GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution. arXiv 2025, arXiv:2505.21375. [Google Scholar]
- Zhang, W.; Cai, M.; Zhang, T.; Zhuang, Y.; Mao, X. EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–20. [Google Scholar] [CrossRef]
- Luo, J.; Pang, Z.; Zhang, Y.; Wang, T.; Wang, L.; Dang, B.; Lao, J.; Wang, J.; Chen, J.; Tan, Y.; et al. Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding. arXiv 2024, arXiv:2406.10100. [Google Scholar]
- Chen, J.; Guo, H.; Yi, K.; Li, B.; Elhoseiny, M. VisualGPT: Data-Efficient Adaptation of Pretrained Language Models for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 16–24 June 2022; pp. 18030–18040. [Google Scholar]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 19730–19742. [Google Scholar]
- Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. In Proceedings of the The Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Dai, W.; Li, J.; Li, D.; Tiong, A.; Zhao, J.; Wang, W.; Li, B.; Fung, P.N.; Hoi, S. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 49250–49267. [Google Scholar]
- Gao, P.; Han, J.; Zhang, R.; Lin, Z.; Geng, S.; Zhou, A.; Zhang, W.; Lu, P.; He, C.; Yue, X.; et al. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. arXiv 2023, arXiv:2304.15010. [Google Scholar]
- Lin, Z.; Liu, C.; Zhang, R.; Gao, P.; Qiu, L.; Xiao, H.; Qiu, H.; Lin, C.; Shao, W.; Chen, K.; et al. SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models. arXiv 2023, arXiv:2311.07575. [Google Scholar] [CrossRef]
- Wang, W.; Chen, Z.; Chen, X.; Wu, J.; Zhu, X.; Zeng, G.; Luo, P.; Lu, T.; Zhou, J.; Qiao, Y.; et al. VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 61501–61513. [Google Scholar]
- Wu, J.; Zhong, M.; Xing, S.; Lai, Z.; Liu, Z.; Chen, Z.; Wang, W.; Zhu, X.; Lu, L.; Lu, T.; et al. VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 69925–69975. [Google Scholar]
- Chen, X.; Djolonga, J.; Padlewski, P.; Mustafa, B.; Changpinyo, S.; Wu, J.; Ruiz, C.R.; Goodman, S.; Wang, X.; Tay, Y.; et al. PaLI-X: On Scaling up a Multilingual Vision and Language Model. arXiv 2023, arXiv:2305.18565. [Google Scholar]
- Yue, Z.; Lin, Z.; Song, Y.; Wang, W.; Ren, S.; Gu, S.; Li, S.; Li, P.; Zhao, L.; Li, L.; et al. MiMo-VL Technical Report. arXiv 2025, arXiv:2506.03569. [Google Scholar]
- Wang, W.; Lv, Q.; Yu, W.; Hong, W.; Qi, J.; Wang, Y.; Ji, J.; Yang, Z.; Zhao, L.; Song, X.; et al. CogVLM: Visual Expert for Pretrained Language Models. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 121475–121499. [Google Scholar]
- Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]
- Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2.5-VL Technical Report. arXiv 2025, arXiv:2502.13923. [Google Scholar] [CrossRef]
- Irvin, J.A.; Liu, E.R.; Chen, J.C.; Dormoy, I.; Kim, J.; Khanna, S.; Zheng, Z.; Ermon, S. Teochat: A large vision-language assistant for temporal earth observation data. arXiv 2024, arXiv:2410.06234. [Google Scholar] [CrossRef]
- Zhan, Y.; Xiong, Z.; Yuan, Y. Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model. ISPRS J. Photogramm. Remote Sens. 2025, 221, 64–77. [Google Scholar] [CrossRef]
- Zhang, W.; Cai, M.; Ning, Y.; Zhang, T.; Zhuang, Y.; Lu, S.; Chen, H.; Li, J.; Mao, X. EarthGPT-X: A Spatial MLLM for Multi-level Multi-Source Remote Sensing Imagery Understanding with Visual Prompting. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4709221. [Google Scholar]
- Liu, X.; Lian, Z. Rsunivlm: A unified vision language model for remote sensing via granularity-oriented mixture of experts. arXiv 2024, arXiv:2412.05679. [Google Scholar]
- Cui, F.; Jiang, J. MTSCD-Net: A network based on multi-task learning for semantic change detection of bitemporal remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2023, 118, 103294. [Google Scholar] [CrossRef]
- Wang, Y.; Zhao, L.; Hu, Y.; Dai, H.; Zhang, Y. Multitask semantic change detection guided by spatiotemporal semantic interaction. Sci. Rep. 2025, 15, 16003. [Google Scholar] [CrossRef]
- Niu, Y.; Guo, H.; Lu, J.; Ding, L.; Yu, D. SMNet: Symmetric multi-task network for semantic change detection in remote sensing images based on CNN and transformer. Remote Sens. 2023, 15, 949. [Google Scholar] [CrossRef]
- Lin, H.; Wang, X.; Li, M.; Huang, D.; Wu, R. A multi-task consistency enhancement network for semantic change detection in HR remote sensing images and application of non-agriculturalization. Remote Sens. 2023, 15, 5106. [Google Scholar] [CrossRef]
- Lu, M.; Liu, J.; Wang, F.; Xiang, Y. Multi-task learning of relative height estimation and semantic segmentation from single airborne RGB images. Remote Sens. 2022, 14, 3450. [Google Scholar] [CrossRef]
- Carvalho, M.; Le Saux, B.; Trouvé-Peloux, P.; Champagnat, F.; Almansa, A. Multitask learning of height and semantics from aerial images. IEEE Geosci. Remote Sens. Lett. 2019, 17, 1391–1395. [Google Scholar] [CrossRef]
- Bastani, F.; Wolters, P.; Gupta, R.; Ferdinando, J.; Kembhavi, A. SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image Understanding. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 16772–16782. [Google Scholar]
- Soni, S.; Dudhane, A.; Debary, H.; Fiaz, M.; Munir, M.A.; Danish, M.S.; Fraccaro, P.; Watson, C.D.; Klein, L.J.; Khan, F.S.; et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. In Proceedings of the Computer Vision and Pattern Recognition Conference, Shanghai, China, 15–18 October 2025; pp. 14303–14313. [Google Scholar]
- Li, B.; Zhang, Y.; Guo, D.; Zhang, R.; Li, F.; Zhang, H.; Zhang, K.; Zhang, P.; Li, Y.; Liu, Z.; et al. LLaVA-OneVision: Easy Visual Task Transfer. Trans. Mach. Learn. Res. 2025. [Google Scholar]
- Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Li, Y.; Luo, J.; Zhang, Y.; Tan, Y.; Yu, J.G.; Bai, S. Learning to Holistically Detect Bridges From Large-Size VHR Remote Sensing Imagery. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 44, 7778–7796. [Google Scholar] [CrossRef]
- Li, Y.; Wang, L.; Wang, T.; Yang, X.; Luo, J.; Wang, Q.; Deng, Y.; Wang, W.; Sun, X.; Li, H.; et al. STAR: A First-Ever Dataset and A Large-Scale Benchmark for Scene Graph Generation in Large-Size Satellite Imagery. arXiv 2024, arXiv:2406.09410. [Google Scholar] [CrossRef]
- Sun, X.; Wang, P.; Yan, Z.; Xu, F.; Wang, R.; Diao, W.; Chen, J.; Li, J.; Feng, Y.; Xu, T.; et al. FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS J. Photogram. Remote Sens. 2022, 184, 116–130. [Google Scholar] [CrossRef]
- Rajbhandari, S.; Rasley, J.; Ruwase, O.; He, Y. ZeRO: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, Georgia, 9–19 November 2020. [Google Scholar]
- Dao, T. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 11 May 2024. [Google Scholar]
- Hsu, P.L.; Dai, Y.; Kothapalli, V.; Song, Q.; Tang, S.; Zhu, S.; Shimizu, S.; Sahni, S.; Ning, H.; Chen, Y. Liger-Kernel: Efficient Triton Kernels for LLM Training. In Proceedings of the Championing Open-Source DEvelopment in ML Workshop @ ICML25, Vancouver, BC, Canada, 19 July 2025. [Google Scholar]
- Kwon, W.; Li, Z.; Zhuang, S.; Sheng, Y.; Zheng, L.; Yu, C.H.; Gonzalez, J.; Zhang, H.; Stoica, I. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles, New York, NY, USA, 23–26 October 2023; pp. 611–626. [Google Scholar] [CrossRef]
- Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved Baselines with Visual Instruction Tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Zhang, Y.; Liu, Y.; Guo, Z.; Zhang, Y.; Yang, X.; Zhang, X.; Chen, C.; Song, J.; Zheng, B.; Yao, Y.; et al. LLaVA-UHD v2: An MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer. arXiv 2024, arXiv:2412.13871. [Google Scholar]
- Wang, W.; Chen, Z.; Wang, W.; Cao, Y.; Liu, Y.; Gao, Z.; Zhu, J.; Zhu, X.; Lu, L.; Qiao, Y.; et al. Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization. arXiv 2024, arXiv:2411.10442. [Google Scholar] [CrossRef]
- Zhu, J.; Wang, W.; Chen, Z.; Liu, Z.; Ye, S.; Gu, L.; Tian, H.; Duan, Y.; Su, W.; Shao, J.; et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv 2025, arXiv:2504.10479. [Google Scholar]
- Wang, W.; Gao, Z.; Gu, L.; Pu, H.; Cui, L.; Wei, X.; Liu, Z.; Jing, L.; Ye, S.; Shao, J.; et al. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency. arXiv 2025, arXiv:2508.18265. [Google Scholar]
- Zhan, Y.; Xiong, Z.; Yuan, Y. RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
- Li, X.; Ding, J.; Elhoseiny, M. VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding. arXiv 2024, arXiv:2406.12384. [Google Scholar] [CrossRef]
- Zhou, Y.; Yang, X.; Zhang, G.; Wang, J.; Liu, Y.; Hou, L.; Jiang, X.; Liu, X.; Yan, J.; Lyu, C.; et al. MMRotate: A Rotated Object Detection Benchmark using PyTorch. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022. [Google Scholar]
- Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking rotated object detection with gaussian wasserstein distance loss. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 11830–11841. [Google Scholar]
- Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021. [Google Scholar]
- Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
- Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
- Yang, X.; Yan, J. Arbitrary-oriented object detection with circular smooth label. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Han, J.; Ding, J.; Li, J.; Xia, G.S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
- Yao, K.; Xu, N.; Yang, R.; Xu, Y.; Gao, Z.; Kitrungrotsakul, T.; Ren, Y.; Zhang, P.; Wang, J.; Wei, N.; et al. Falcon: A Remote Sensing Vision-Language Foundation Model. arXiv 2025, arXiv:2503.11070. [Google Scholar] [CrossRef]
- Xiao, B.; Wu, H.; Xu, W.; Dai, X.; Hu, H.; Lu, Y.; Zeng, M.; Liu, C.; Yuan, L. Florence-2: Advancing a unified representation for a variety of vision tasks. arXiv 2023, arXiv:2311.06242. [Google Scholar] [CrossRef]
- Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A Benchmark Data Set for Performance Evaluation of Aerial Scene Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
- Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, New York, NY, USA, 2–5 November 2010; pp. 270–279. [Google Scholar] [CrossRef]
- Cheng, G.; Han, J.; Lu, X. Remote Sensing Image Scene Classification: Benchmark and State of the Art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
- Dai, D.; Yang, W. Satellite Image Classification via Two-Layer Sparse Coding With Biased Image Representation. IEEE Trans. Geosci. Remote Sens. 2011, 8, 173–176. [Google Scholar] [CrossRef]
- Zhu, B.; Lui, N.; Irvin, J.; Le, J.; Tadwalkar, S.; Wang, C.; Ouyang, Z.; Liu, F.Y.; Ng, A.Y.; Jackson, R.B. METER-ML: A multi-sensor earth observation benchmark for automated methane source mapping. arXiv 2022, arXiv:2207.11166. [Google Scholar]
- Chen, J.; Zhu, D.; Shen, X.; Li, X.; Liu, Z.; Zhang, P.; Krishnamoorthi, R.; Chandra, V.; Xiong, Y.; Elhoseiny, M. MiniGPT-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv 2023, arXiv:2310.09478. [Google Scholar]
- Chen, Z.; Wang, W.; Cao, Y.; Liu, Y.; Gao, Z.; Cui, E.; Zhu, J.; Ye, S.; Tian, H.; Liu, Z.; et al. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling. arXiv 2024, arXiv:2412.05271. [Google Scholar]
- Li, Z.; Muhtar, D.; Gu, F.; He, Y.; Zhang, X.; Xiao, P.; He, G.; Zhu, X. Lhrs-bot-nova: Improved multimodal large language model for remote sensing vision-language interpretation. ISPRS J. Photogramm. Remote Sens. 2025, 227, 539–550. [Google Scholar] [CrossRef]
- Muhtar, D.; Zhang, E.; Li, Z.; Gu, F.; He, Y.; Xiao, P.; Zhang, X. Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models. In Proceedings of the Thirty-Ninth Annual Conference on Neural Information Processing Systems, San Diego, CA, USA, 2–7 December 2025. [Google Scholar]
- Lobry, S.; Marcos, D.; Murray, J.; Tuia, D. RSVQA: Visual question answering for remote sensing data. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8555–8566. [Google Scholar] [CrossRef]









| Method | Model Size | Max Pixels | LRS-FAIR | LRS-Bridge | LRS-STAR | Avg. Acc |
|---|---|---|---|---|---|---|
| LLaVA-1.5 [59] | 7B | 0.1 M | 18.76 | 30.70 | 22.63 | 24.03 |
| LLaVA-UHD-v2 [60] | 7B | 0.7 M | 22.82 | 32.57 | 26.08 | 27.16 |
| Qwen2-VL [36] | 7B | 11.1 M | 23.80 | 38.12 | 27.87 | 29.93 |
| Qwen2.5-VL [37] | 7B | 12.8 M | 19.66 | 35.82 | 26.70 | 27.39 |
| Qwen3-VL [37] | 8 B | 16.8 M | 27.98 | 38.56 | 32.04 | 32.86 |
| A3B-30B | 16.8 M | 27.63 | 38.81 | 30.54 | 32.33 | |
| InternVL2.5-MPO [61] | 8B | 2.4 M | 24.95 | 34.59 | 25.14 | 28.23 |
| InternVL3 [62] | 8B | 2.4 M | 22.49 | 38.09 | 26.36 | 28.98 |
| InternVL3.5 [63] | 8B | 2.4 M | 25.14 | 35.50 | 26.86 | 29.17 |
| A3B-30B | 2.4 M | 16.83 | 37.05 | 22.15 | 25.34 | |
| Mimo-VL [34] | 7B | 12.8 M | 16.51 | 20.04 | 27.11 | 21.22 |
| GeoChat [17] | 7B | 0.3 M | 20.18 | 24.54 | 13.75 | 19.49 |
| LLaVA-1.5 + SFT. on LRS-VQA [21] | 7B | 0.1 M | 22.97 | 36.89 | 27.48 | 29.11 |
| LLaVA-Next + SFT. on LRS-VQA [21] | 7B | 2.8 M | 21.85 | 38.24 | 26.67 | 28.92 |
| RSCoVLM | 7B | 1.0 M | 27.37 | 42.42 | 31.77 | 33.85 |
| + Zoom-in Chain | 42.42 | 49.56 | 45.15 | 45.71 |
| Method | Input Size | DIOR-RSVG | RSVG | Geo.-VG | VRS.-VG | AVVG | Avg. Acc. | ||
|---|---|---|---|---|---|---|---|---|---|
| Val | Test | Val | Test | ||||||
| Qwen-VL-Chat [36] | 32.01 | 32.22 | 4.66 | 2.04 | 35.36 | 31.07 | 0.31 | 19.66 | |
| GeoChat [17] | 23.35 | 24.05 | 3.08 | 2.04 | 22.74 | 11.52 | 0.28 | 12.44 | |
| LHRS-Bot [18] | 17.04 | 17.59 | 0.95 | 1.56 | 3.25 | 1.19 | 0.00 | 5.94 | |
| VHM [16] | - | 48.04 | - | - | - | - | - | - | |
| RSUniVLM [41] | - | 72.47 | - | - | - | 69.31 | - | - | |
| Qwen2.5-VL [37] | Dynamic | 43.64 | 45.26 | 19.73 | 21.27 | 42.99 | 44.50 | 7.64 | 32.15 |
| Qwen-VL + SFT. on refGeo [19] | 58.65 | 58.76 | 12.99 | 10.59 | 41.75 | 47.38 | 9.53 | 34.24 | |
| GeoChat + SFT. on refGeo [19] | 60.27 | 61.96 | 16.32 | 14.67 | 56.99 | 51.36 | 11.52 | 39.01 | |
| LLaVA-1.5-7B + SFT. on refGeo [19] | 64.46 | 65.98 | 19.98 | 20.95 | 63.76 | 57.17 | 15.05 | 43.91 | |
| GeoGround [19] | 77.18 | 77.73 | 27.64 | 26.65 | 70.24 | 66.04 | 21.58 | 52.44 | |
| RSCoVLM | Dynamic | 83.56 | 84.55 | 54.04 | 53.79 | 76.39 | 79.73 | 29.40 | 65.92 |
| + Min Size | 66.56 | 67.64 | 21.23 | 20.70 | 21.43 | 67.50 | 0.85 | 37.99 | |
| + Small Size | 75.22 | 75.86 | 34.72 | 35.79 | 70.17 | 75.79 | 25.10 | 56.09 | |
| Method | Score | PL | BD | BR | GTF | SV | LV | SH | TC | BC | ST | SBF | RA | HA | SP | HC | APnc50 | APnc75 | APnc50:95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GWD [67] | 0.40 | 72.07 | 58.95 | 24.26 | 35.92 | 63.37 | 52.24 | 66.63 | 86.85 | 61.54 | 59.15 | 23.45 | 45.84 | 36.10 | 49.50 | 26.08 | 50.80 | 28.67 | 28.98 |
| R3Det [68] | 0.45 | 73.32 | 59.51 | 31.59 | 43.37 | 64.94 | 63.37 | 75.61 | 89.35 | 63.84 | 66.95 | 34.99 | 45.60 | 46.54 | 50.17 | 14.48 | 54.91 | 29.21 | 30.08 |
| ATSS [69] | 0.35 | 72.98 | 60.67 | 25.82 | 42.91 | 65.23 | 65.32 | 75.22 | 89.78 | 71.61 | 70.12 | 28.04 | 43.19 | 47.79 | 58.28 | 28.60 | 56.37 | 34.62 | 33.05 |
| Faster RCNN [70] | 0.85 | 73.49 | 67.37 | 32.50 | 43.19 | 62.92 | 63.13 | 73.95 | 88.79 | 73.77 | 66.33 | 25.23 | 48.89 | 53.03 | 56.94 | 31.17 | 57.38 | 32.98 | 32.96 |
| FCOS [71] | 0.30 | 72.09 | 56.32 | 32.02 | 27.79 | 64.28 | 63.83 | 75.75 | 89.21 | 68.11 | 67.82 | 27.30 | 37.15 | 46.64 | 58.98 | 19.53 | 53.79 | 32.13 | 31.49 |
| CSL [72] | 0.40 | 71.57 | 53.57 | 19.82 | 35.90 | 64.04 | 44.96 | 66.35 | 87.54 | 61.33 | 59.26 | 29.27 | 39.56 | 36.51 | 49.80 | 17.38 | 49.12 | 29.03 | 28.72 |
| S2A-Net [73] | 0.50 | 72.96 | 61.96 | 36.00 | 45.99 | 66.24 | 65.61 | 77.08 | 89.34 | 73.27 | 69.25 | 31.78 | 46.64 | 55.02 | 52.02 | 35.40 | 58.58 | 29.53 | 31.98 |
| RSCoVLM | 77.15 | 64.86 | 23.90 | 45.34 | 44.87 | 38.96 | 57.64 | 87.22 | 57.73 | 49.42 | 23.31 | 51.87 | 37.01 | 54.92 | 54.91 | 51.27 | 25.75 | 27.60 | |
| + Max Mode | 73.95 | 63.01 | 27.84 | 40.41 | 56.86 | 55.37 | 71.00 | 89.12 | 61.69 | 64.95 | 19.54 | 41.91 | 44.21 | 55.01 | 52.27 | 54.48 | 31.04 | 31.38 | |
| RSCoVLM-det | 73.52 | 64.68 | 26.89 | 47.18 | 52.57 | 52.71 | 59.33 | 89.16 | 63.17 | 61.43 | 18.91 | 45.96 | 47.62 | 59.19 | 70.24 | 55.50 | 30.78 | 31.75 | |
| + Max Mode | 69.04 | 64.44 | 33.32 | 44.67 | 56.21 | 66.47 | 73.71 | 87.59 | 61.38 | 63.95 | 22.41 | 46.63 | 47.90 | 59.82 | 50.82 | 56.56 | 33.88 | 33.66 | |
| Method | Model Size | AID | UCMerced | METER-ML | NWPU-RESISC45 | WHU-RS19 |
|---|---|---|---|---|---|---|
| MiniGPTv2 [81] | 7B | - | 32.96 | - | 14.29 | 28.15 | 64.80 |
| LLaVA-1.5 [59] | 7B | - | 31.10 | - | 21.73 | 34.96 | 54.55 |
| Qwen-VL-Chat [36] | 7B | - | 55.30 | - | 38.77 | 42.73 | 72.25 |
| Qwen2.5-VL [37] | 7B | 63.63 | 62.73 | 70.90 | 56.64 | 64.98 | 76.20 |
| Qwen3-VL [37] | 8B | 70.84 | 66.67 | 79.90 | 60.88 | 68.86 | 87.80 |
| A3B-30B | 71.75 | 68.87 | 80.19 | 64.07 | 70.22 | 87.70 | |
| InternVL2.5-MPO [61] | 8B | 69.38 | 64.23 | 62.90 | 55.04 | 59.21 | 80.20 |
| InternVL3 [62] | 8B | 67.78 | 63.40 | 67.29 | 59.65 | 64.32 | 86.40 |
| InternVL3.5 [63] | 8B | 77.03 | 75.00 | 83.43 | 51.33 | 92.57 | 91.70 |
| A3B-30B | 82.45 | 79.17 | 86.00 | 46.19 | 98.38 | 97.10 | |
| Mimo-VL [34] | 7B | 66.13 | 67.20 | 69.14 | 54.51 | 64.35 | 86.10 |
| LHRSBot [18] | 7B | - | 91.26 | - | 69.81 | 83.94 | 93.17 |
| GeoChat [17] | 7B | 72.00 | - | 84.40 | - | - | - |
| TEOChat [38] | 7B | 80.90 | - | 86.30 | - | - | - |
| EarthGPT-X [40] | 13B | 78.09 | - | 87.89 | - | - | - |
| RSUniVLM [41] | 1B | - | 81.18 | - | - | 86.86 | 84.91 |
| LHRS-Bot-Nova [83] | 7B | 83.06 | - | - | 72.74 | 83.97 | 96.20 |
| SkysenseGPT [24] | 7B | 88.16 | - | - | 40.00 | 90.06 | 95.50 |
| VHM [16] | 7B | - | 91.70 | - | 72.74 | 94.54 | 95.80 |
| ScoreRS [84] | 7B | - | 85.90 | - | 74.42 | 91.59 | 96.30 |
| RSCoVLM | 7B | 88.44 | 94.30 | 94.52 | 75.93 | 98.25 | 95.80 |
| Method | RSVQA Benchmark | VRSBench VQA | |||||
|---|---|---|---|---|---|---|---|
| HR-Comp. | HR-Pres. | LR-Comp. | LR-Pres. | LR-R-U | Avg. | ||
| LLaVA-1.5 [59] | 67.30 | 69.80 | 68.20 | 55.50 | 59.00 | 63.96 | - |
| LLaVA-1.6 [84] | 68.60 | 64.40 | 64.32 | 56.84 | 61.00 | 63.03 | - |
| Qwen2-VL [36] | 75.60 | 63.30 | 75.47 | 62.00 | 73.00 | 69.87 | - |
| Qwen2.5-VL [37] | 75.28 | 67.30 | 73.86 | 64.67 | 66.00 | 69.42 | 51.21 |
| Qwen3-VL [37] | 81.00 | 78.10 | 70.32 | 56.42 | 72.00 | 71.57 | 54.75 |
| InternVL-2.5 [82] | 75.50 | 65.80 | 71.16 | 66.21 | 72.00 | 70.13 | 47.20 |
| InternVL3 [62] | 74.15 | 62.35 | 73.06 | 66.23 | 74.00 | 69.96 | 50.68 |
| InternVL3.5 [63] | 80.14 | 53.80 | 92.00 | 91.26 | 96.00 | 82.64 | 53.74 |
| Mimo-VL [34] | 66.00 | 77.80 | 74.42 | 59.98 | 65.00 | 68.64 | 48.37 |
| LHRS-Bot-Nova [83] | 89.30 | 87.60 | 88.11 | 83.89 | 79.00 | 85.58 | - |
| GeoChat [17] | 83.30 | 59.10 | 90.52 | 90.63 | 97.00 | 84.11 | 40.80 |
| SkyEyeGPT [39] | 80.28 | 83.50 | 88.63 | 88.93 | 75.00 | 83.27 | - |
| VHM [16] | 83.30 | 68.30 | 90.11 | 89.89 | 87.00 | 83.72 | - |
| RSCoVLM | 82.60 | 68.50 | 93.16 | 92.18 | 94.00 | 86.09 | 58.08 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Li, Q.; Ma, S.; Luo, J.; Yu, Y.; Zhou, Y.; Wang, F.; Lu, X.; Wang, X.; He, X.; Chen, Y.; et al. Co-Training Vision-Language Models for Remote Sensing Multi-Task Learning. Remote Sens. 2026, 18, 222. https://doi.org/10.3390/rs18020222
Li Q, Ma S, Luo J, Yu Y, Zhou Y, Wang F, Lu X, Wang X, He X, Chen Y, et al. Co-Training Vision-Language Models for Remote Sensing Multi-Task Learning. Remote Sensing. 2026; 18(2):222. https://doi.org/10.3390/rs18020222
Chicago/Turabian StyleLi, Qingyun, Shuran Ma, Junwei Luo, Yi Yu, Yue Zhou, Fengxiang Wang, Xudong Lu, Xiaoxing Wang, Xin He, Yushi Chen, and et al. 2026. "Co-Training Vision-Language Models for Remote Sensing Multi-Task Learning" Remote Sensing 18, no. 2: 222. https://doi.org/10.3390/rs18020222
APA StyleLi, Q., Ma, S., Luo, J., Yu, Y., Zhou, Y., Wang, F., Lu, X., Wang, X., He, X., Chen, Y., & Yang, X. (2026). Co-Training Vision-Language Models for Remote Sensing Multi-Task Learning. Remote Sensing, 18(2), 222. https://doi.org/10.3390/rs18020222

