Co-Training Vision-Language Models for Remote Sensing Multi-Task Learning

Li, Qingyun; Ma, Shuran; Luo, Junwei; Yu, Yi; Zhou, Yue; Wang, Fengxiang; Lu, Xudong; Wang, Xiaoxing; He, Xin; Chen, Yushi; Yang, Xue

doi:10.3390/rs18020222

Open AccessArticle

Co-Training Vision-Language Models for Remote Sensing Multi-Task Learning^†

by

Qingyun Li

^1,‡

,

Shuran Ma

^2,‡,

Junwei Luo

³,

Yi Yu

⁴

,

Yue Zhou

^5,§

,

Fengxiang Wang

⁶,

Xudong Lu

⁷,

Xiaoxing Wang

⁸,

Xin He

¹

,

Yushi Chen

^1,*

and

Xue Yang

^9,§

¹

School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin 150001, China

²

School of Telecommunications Engineering, Xidian University, Xi’an 710071, China

³

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430072, China

⁴

School of Automation, Southeast University, Nanjing 210096, China

⁵

School of Geospatial Artificial Intelligence, East China Normal University, Shanghai 200241, China

⁶

College of Computer Science and Technology, National University of Defense Technology, Changsha 410005, China

⁷

Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR 999077, China

⁸

School of Computer Science, Shanghai Jiao Tong University, Shanghai 200240, China

⁹

School of Automation and Intelligent Sensing, Shanghai Jiao Tong University, Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in the International Geoscience and Remote Sensing Symposium (IGARSS) 2025.

^‡

These authors contributed equally to this work.

^§

These authors are project leaders of this work.

Remote Sens. 2026, 18(2), 222; https://doi.org/10.3390/rs18020222

Submission received: 7 December 2025 / Revised: 1 January 2026 / Accepted: 6 January 2026 / Published: 9 January 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The proposed RSCoVLM, trained on a well-curated data recipe, is a fully open-sourced vision-language model (VLM) that excels in multiple remote sensing (RS) tasks. It achieves state-of-the-art performance across various tasks, even aerial object detection.
The proposed dynamic resolution strategies enable the processing of RS images of arbitrary sizes. Among these strategies, the Zoom-in Chain method significantly enhances the performance on ultra-high-resolution RS images reasoning. Additionally, VLMs exhibit clear limitations under the commonly used mAP metric, which is influenced by confidence scores. Based on the proposed ${mAP}_{nc}$ , RSCoVLM is shown to achieve detection performance comparable to conventional object detection models.

What are the implication of the main findings?

From the perspective of RS VLM development, as a new baseline, RSCoVLM demonstrates substantial progress in capability and flexibility. It brings us one step closer to realizing a general-purpose generative agent for RS image processing.
From the perspective of RS multi-task learning, the proposed framework offers greater extensibility. It will facilitate expansion to more and increasingly complex tasks in the future, moving toward a unified multi-task model.

Abstract

With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision-language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation procedure, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data procedure effectively addresses complex RS data enviroments and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model’s object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models.

Keywords:

vision-language model; remote sensing; multi-task learning

1. Introduction

Earth observation systems have acquired extensive remote sensing (RS) data, necessitating the development of automated RS image interpretation techniques [1]. The emergence of artificial general intelligence has inspired researchers in the RS community to develop versatile agents capable of performing multiple tasks, such as scene classification, visual question answering, and object detection [2].

Most RS image processing methods typically train a specifically designed model on isolated datasets to achieve optimal performance on individual tasks. Due to the heterogeneity of data and model architecture, developing a unified model capable of handling multiple RS tasks, i.e., multi-task learning (MTL), remains challenging [3].

MTL provides several advantages for RS applications. First, a single MTL model with shared parameters can handle multiple tasks at once, unlike traditional task-specific models, which is closer to human perception. Second, by sharing knowledge across tasks, MTL mitigates the shortage of annotated data and reduces overfitting on individual tasks. Third, MTL learns joint representations that capture correlations among tasks, improving generalization. RS foundation models also benefit by obtaining consistent representations through pre-training on upstream tasks and fine-tuning on various downstream tasks. Overall, MTL helps advance RS foundation models by expanding pre-training tasks and enhancing cross-task learning [4].

Transformer [5] has demonstrated remarkable flexibility and generalization capabilities across various domains, including computer vision [6], natural language processing, speech processing, and remote sensing data analysis [7]. This progress has brought the goal of a unified multimodal and multi-task architecture increasingly within reach [8]. Consequently, vision-language models (VLMs), that bridge the gap between the two modalities by learning from vast amounts of paired [9] and interleaved image–text data [10], have been proposed and become the most commonly adopted foundation model paradigm in the multimodal domain [11].

In this study, we focus on generative VLMs, also named multimodal language models. These models are typically constructed upon vision and language foundation models, enabling them to process visual inputs and effectively interpret textual instructions. By harnessing the capabilities of powerful pre-trained foundation models and leveraging a versatile text interface, VLMs are positioned as a crucial element in the progression toward the unified MTL [12].

We consider that VLMs represent an ideal paradigm for RS MTL. Firstly, the textual interface of VLMs provides a unified representation for diverse task objectives, because the outputs of different RS tasks, such as classification, grounding, captioning, or question answering, can all be expressed in text form. Secondly, instruction tuning has demonstrated that VLMs can generalize beyond the tasks seen during training [13], enabling them to handle novel or composite tasks through in-context learning [11]. Finally, with sufficiently strong foundational capabilities, VLMs offer the potential to evolve toward more autonomous RS agents, where task reasoning and workflow design can be accomplished within a single, coherent framework.

In the RS community, MTL has been preliminarily explored, including several attempts leveraging VLMs. Nevertheless, existing approaches still exhibit notable limitations. Figure 1 summarizes the key differences among representative paradigms.

Early RS MTL approaches were typically designed for multiple pure-vision tasks [4,14,15], such as classification, segmentation, and detection. These methods generally adopt a shared feature extraction backbone with task-specific output heads. With carefully crafted training strategies, their performance on individual benchmarks is comparable to that of expert models trained on the specific dataset. However, they suffer from limited scalability and architectural rigidity. As the number of tasks increases, the heterogeneous design of multiple heads makes optimization increasingly difficult and less robust. Consequently, this paradigm struggles to scale up, resulting in insufficient model generalization. Nevertheless, when deployed on resource-constrained platforms such as satellites, this kind of MTL model remains highly valuable for its computational and storage efficiency.

As general-purpose VLMs increasingly exhibit early signs of the universal model, they have emerged as a scalable paradigm for MTL. In the RS domain, several studies have explored VLM-based MTL. However, their investigations into unified and generalizable paradigms remain limited: The regular VLMs focus primarily on language-centric description tasks such as image captioning and visual question answering, where text descriptions are synthesized for RS images to enable semantic understanding [16,17,18]. Others extend VLMs to purely visual tasks such as visual grounding and object detection, leveraging the flexible language interface of VLMs to learn from abundant localization annotations and achieve precise detection capabilities [19,20]. In addition, several approaches target ultra-high-resolution (UHR) RS image reasoning, often employing token pruning to alleviate the computational burden caused by extremely large inputs [21,22].

These studies collectively highlight the great potential of VLMs for RS MTL, yet each remains constrained within a limited scope. As shown in Figure 1, the first four types of works focus mainly on tasks involving regular images (images with regular resolutions) [16,17,18,19,20], whereas the fifth is specialized for UHR scenarios [21,22]. The detection VLMs [20] and grounded VLMs [19] excel at spatial grounding but pay little attention to semantic understanding, while the regular VLMs [16,17,18] rarely explore crucial object detection capabilities which are essential for RS image analysis. Hence, a unified framework that addresses these limitations in an integrated MTL setting is still lacking.

In this paper, we present a novel foundation model named RSCoVLM (Remote Sensing Cooperatively-trained Vision Language Model). We cooperatively train (co-train) it for multiple tasks in a unified framework that handles the following problems.

Firstly, the large-scale multi-task data must be curated to enable effective MTL. However, RS data are inherently complex, often exhibiting inconsistencies in format, noisy annotations, and heterogeneous bounding box definitions. Therefore, careful data curation is required to construct a well-organized and sustainable data environment for model training.

Secondly, we need to address the challenge of diverse input sizes of RS images. The classification task often uses sizes of a few megapixels, such as 256 × 256. Common object detection models typically use input sizes such as 512 × 512, 800 × 800, or 1024 × 1024. However, UHR images can have widths and heights exceeding 4000 pixels. Therefore, a dynamic resolution strategy is required, along with efficient and highly compatible solutions for UHR scenarios.

Finally, previous VLMs have shown limited capability in object detection tasks. They either perform only sparse visual grounding [17,19], provide detection results for a single category [23], or are evaluated leniently under low IoU thresholds [24]. However, aerial detection is particularly challenging due to issues such as dense object distribution, which places high demands on the visual input resolution and output sequence length of MLLMs. Moreover, VLMs cannot directly output the confidence scores of predicted objects, making it difficult to fairly compare them with traditional models using commonly used evaluation metrics.

To make RSCoVLM a competitive MTL baseline, we address the aforementioned challenges, respectively. Firstly, we create a data curation procedure, compromising the acquisition of raw data, offline processing and integration, as well as online loading and weighting. Moreover, we propose a dynamic resolution strategy that enables the model to simultaneously learn from images of various sizes. To further enhance the reasoning performance on UHR images, we propose the Zoom-in Chain strategy, which mimics how humans reason over UHR images. We also construct a corresponding dataset, LRS-VQA-Zoom, to specifically strengthen this capability. Additionally, we apply VLMs to object detection and propose a fair evaluation method that does not rely on confidence thresholds. Based on this, our RSCoVLM is validated as the first VLM that achieves performance comparable to traditional models on dense aerial detection task.

We evaluate RSCoVLM on multiple tasks across various benchmarks, achieving state-of-the-art performance in all of them. Our unified MTL framework greatly improves the model’s generalization ability, scalability, and usability.

To ensure transparency and reproducibility, we have made all details of this work, including the codes, model weights, and data folder, fully open-source. We will continuously maintain the open-source resources and update them with our latest research progress, aiming to build a user-friendly platform for the community.

The main contributions are summarized as follows:

We present RSCoVLM, a fully open-sourced VLM baseline for RS MTL. The experiment shows that our model achieves leading performance across benchmarks of various datasets and tasks.
We develop the universal framework for RS MTL based on VLM and create the data curation procedure to facilitate unified training across multi datasets of various RS tasks.
We proposed a dynamic resolution strategy for RS, along with the Zoom-in Chain strategy and the LRS-VQA-Zoom dataset to further enhance the model’s reasoning ability on UHR images.
We develop an aerial detection method for RS VLMs and propose an evaluation metric that enables a fair comparison between RS VLMs and conventional methods.

This manuscript is an extended and improved version of our conference paper [20] published in IGARSS 2025, which only investigated VLMs for detection tasks. The autoregressive object detection scheme in Section 3.5 is primarily derived from the conference version. Building upon it, we not only refine detection details in Section 3.5.1 but also further upgrade the VLMs with unified multi-task learning, accompanied by additional methods, models, and experimental results.

2. Related Works

2.1. General-Purpose Vision-Language Models

Early works such as VisualGPT [25], BLIP-2 [26], and Flamingo [11] explored different ways of integrating visual features with large language models or training joint image–text encoders, showing improved multimodal reasoning and understanding.

Later instruction-tuned frameworks, including LLaVA [13], MiniGPT-4 [27], and InstructBLIP [28], further enhanced interactive comprehension by fine-tuning LLMs with visual–text instructions. Lightweight adaptation methods such as LLaMA-Adapter V2 [29] and SPHINX [30] improved efficiency through visual adapters and zero-shot attention fusion, reducing the cost of multimodal alignment.

In parallel, the VisionLLM series [31,32] unified vision-centric tasks under the LLM paradigm, enabling open-ended reasoning over diverse visual inputs. More recent large-scale MLLMs, including PaLI-X [33], MiMo-VL [34], InternVL [12], CogVLM [35], and the Qwen-VL series [36,37], further integrate vision and language through end-to-end pre-training on massive multimodal data and scalable architectures. These models show improved visual grounding, OCR, and cross-domain reasoning, representing a shift from adapter-based fusion toward deeply coupled vision-language modeling. Collectively, these advances lay the foundation for adapting MLLMs to specialized domains such as remote sensing, where complex visual semantics and open-ended reasoning are required.

2.2. Remote Sensing Vision-Language Models

Recently, integrating vision–language models into remote sensing (RS) has attracted growing attention, giving rise to several domain-specific VLMs. GeoChat [17] pioneered this direction as the first RS-oriented VLM, addressing multiple optical imagery tasks via conversational interaction. EarthGPT [23] introduced a unified multimodal framework for multi-source RS data and diverse vision-language tasks. LHRS-Bot [18] leverages multi-level vision-language alignment and curriculum learning for RS image understanding.

Beyond static image interpretation, recent works explore temporal and fine-grained understanding. TEOChat [38] supports temporal Earth observation imagery and instruction following over sequential frames. SkySenseGPT [24] extends instruction tuning to fine-grained RS comprehension, achieving strong performance on public datasets and complex comprehension tasks. VHM [16] demonstrates capabilities on tasks such as building vectorization, multi-label classification, and honest question answering. SkyEyeGPT [39] employs multi-stage tuning to enable region-aware dialogue and grounding. EarthGPT-X [40] introduces spatial visual prompting to unify referring expression comprehension and grounding across heterogeneous RS imagery. RSUniVLM [41] adopts a granularity-oriented mixture-of-experts architecture to support unified image-level, region-level, and pixel-level RS image understanding.

2.3. Remote Sensing Multi-Task Learning

While recent advancements in remote sensing VLMs have enabled versatile multi-task capabilities, their general-purpose architectures often implicitly handle task interdependencies, potentially overlooking the intrinsic challenges of multi-task optimization, such as task interference and imbalance.

One line of research focuses on leveraging shared representations for synergistic task pairing. For example, several studies [42,43,44,45] combine semantic segmentation with change detection from bi-temporal imagery, showing that joint learning enhances feature sharing and reduces redundancy. Another direction jointly models geometric and semantic information, such as height estimation with semantic segmentation [46,47], demonstrating gains over single-task baselines.

Beyond specific task pairs, generalized MTL frameworks have been proposed. RSCoTr [4] performs classification, segmentation, and detection simultaneously, illustrating the potential of unified RS analysis. Large-scale datasets like SatlasPretrain [48] with multiple annotation modalities facilitate advanced MTL model development. SM3Det [15] uses a mixture-of-experts structure for multimodal detection of horizontal and rotated bounding boxes. EarthDial [49] leverages multiple multi-task decoders to transfer knowledge across diverse tasks, enriching shared feature learning.

3. Method

3.1. The Universal RS Multi-Task Framework

As shown in Figure 2, we propose a universal framework for RS MTL based on VLM. The model follows a popular VLM paradigm. It uses a vision encoder and text tokenizer to process images and text inputs, respectively. The unified decoder based on a language model then processes the bi-modal features and perform various tasks, such as RS image scene classification, question answering, captioning, grounding, and object detection.

Specifically, we develop a data curation procedure consisting of data acquisition, offline processing, and online loading, which provides diverse images with textual prompts and golden responses for model training. To enable the model supporting images of arbitrary sizes, we design a dynamic resolution strategy, which handles input images of small, regular and UHR sizes, respectively. The proposed Zoom-in Chain is designed to further enhance reasoning on UHR RS images. The final model can perform multiple tasks simultaneously. With the proposed aerial detection method for vision-language models, the model can perform the challenging aerial detection.

For the language branch, the input text is tokenized into a sequence of indices, where each index i corresponds to a learnable embedding

t_{i} \in R^{D}

. The output sequence is then de-tokenized to produce the final textual response.

For the vision branch, a RS image is preprocessed (e.g., resized or dynamically rescaled) and encoded by a vision Transformer to obtain features

F \in R^{N_{I} \times D_{I}}

, where

N_{I}

and

D_{I}

denote the feature number and dimension. The prompt text is tokenized into

N_{t}

embedding

T_{t} \in R^{N_{t} \times D}

. A bi-modal projection aligns visual embeddings with the language token space, generating

N_{v}

visual tokens

T_{v} \in R^{N_{v} \times D}

, with

N_{v} \propto N_{I}

. The language model input is as follows:

T = concat (T_{v}, T_{t}) \in R^{(N_{v} + N_{t}) \times D},

(1)

where

concat (\cdot, \cdot)

denotes token-wise concatenation.

During training, parameters

θ

are optimized via next-token prediction using cross-entropy loss:

L = - \sum_{j = 1}^{| r |} P_{j} (r, T), P_{j} (r, T) = log P_{θ} (r j ∣ r < j, T),

(2)

where

r = (r_{1}, \dots, r_{T})

is the response token sequence.

During inference, the model generates tokens auto-regressively until an end-of-sequence token is reached:

r_{j} = argmax P_{j} (r, T) or r_{j} \sim P_{j} (r, T),

(3)

where the first denotes deterministic decoding (e.g., greedy or beam search) and the second stochastic sampling (e.g., top-k, nucleus sampling).

3.2. Data Curation Procedure

In contrast to conventional approaches that mainly conduct standardized evaluations on a single benchmark, this section highlights the crucial role of data curation in developing RS MTL models. Given the diversity and complexity of RS data, characterized by heterogeneous formats, noisy annotations, and inconsistent bounding box definitions, a well-curated data recipe is indispensable. To serve as a robust foundation for training RS multi-task VLMs, we design a data curation procedure, which is not a fixed dataset but a comprehensive and sustainable data framework encompassing the following three main parts.

3.2.1. Data Acquisition

In this work, the dataset was curated through three sequential stages. Initially, we collected data by following the data recipes of several representative open-source vision–language models. Specifically, we adopted the description-related subsets from the instruction tuning data of VHM [16] and GeoChat [17], which cover tasks such as image classification, captioning, and visual question answering. These tasks can be further decomposed into subtasks, including modality recognition and resolution estimation. The refGeo [19] dataset was employed as the main grounding data source, while temporal multi-image data were drawn from TEOChatlas [38]. To prevent degradation of the model’s general reasoning ability during continued training, we also incorporated a subset of general-purpose data sampled from LLaVA-OneVision’s recipe [50], including chart interpretation, optical character recognition, and so on. By following these open-source data recipes, we indirectly surveyed and integrated diverse data sources.

Subsequently, we analyzed the limitations of the collected data and expanded the dataset using a task-specific training set. We observed that existing RS VLMs rarely address object detection, which is crucial for fine-grained perception in RS. To fill this gap, we incorporated the DOTA-v1.0 dataset [51], thereby enriching the model’s detection-related learning capabilities.

Finally, for abilities that could not be obtained from open datasets, we constructed a synthetic data pipeline to generate new annotations. To enable the model’s zoom-in chain capability, we curated large-scale RS images and synthesized image–region–question triples. The detailed construction process is described in Section 3.4.

3.2.2. Data Processing and Integrating

Due to the diverse formats and task requirements of the collected datasets, as well as potential systematic noise, we performed additional offline preprocessing to integrate all data into our training framework.

We first removed all task descriptors, such as the “[grounding]”, “[refer]”, and “[identify]” tags used in previous works [16,17]. These descriptors tag the specific tasks. However, in open-world scenarios or novel tasks, instructions are typically expressed in natural language rather than through fixed descriptor tokens. Therefore, we replaced these descriptors with natural language prompts to better align with the real-world usage.

Next, we examined all bounding boxes in the datasets and categorized them into horizontal boxes, oriented boxes, and quadrilateral boxes. Their representations were then unified through consistent normalization and ordering to avoid any information mismatch. Corresponding prompts were designed for each box type. By default, horizontal boxes were used in grounding tasks, while quadrilateral boxes were adopted for detection tasks.

A unified data format was further established to standardize the integration. Conversational data followed the messages structure defined by OpenAI, object detection data were formatted according to the COCO convention, and grounding data adhered to the refGeo [19] schema.

Finally, we performed rule-based cleaning on systematic irregularities, such as removing redundant punctuation and spaces, and correcting typographical errors. For the Zoom-in Chain dataset, we applied tool-call formatting. The evaluation set was also processed in a similar manner to ensure consistency with the training data.

3.2.3. Data Loading and Weighting

After integration, the curated dataset was organized into multiple subset units. During training, we applied online preprocessing and dynamically controlled the sampling ratio of each subset. Consequently, the model was trained in a flexible and adaptive multi-task environment rather than on a fixed, predefined dataset.

We argue that the model should not rely solely on predefined prompts from the training stage. To enhance robustness, multiple agent prompts were designed for certain tasks, and one was randomly selected during training. For grounding and detection data, a unified formatting scheme was adopted. We also incorporated the JSON-based output format used in Qwen2.5-VL [37], accompanied by specific prompts, and randomly switched between standard and JSON outputs during training. The prompts are exhibited in Figure 3, where the colored text indicates placeholder. In addition, a synonym replacement module was implemented to randomly substitute words with their synonyms, improving the model’s linguistic generalization. Standard data augmentation techniques, such as random resizing, were also applied to enhance multi-scale learning.

Each subset unit was assigned a sampling weight to guide data selection during training, analogous to controlling the flow rate of different ingredients in an automatic beverage dispenser. The sampling ratio is critical for multi-task learning: increasing the weight of more challenging tasks facilitates deeper learning, while adjusting the others helps mitigate catastrophic forgetting. In exploring optimal weighting strategies, we first conducted experiments with uniform ratios. Then, we increased the sampling proportion of tasks that underperformed relative to expectations. Finally, once all tasks reached or exceeded satisfactory performance, we fine-tuned the weights to achieve the best overall multi-task balance.

3.3. Dynamic Resolution Strategy

Most existing RS VLMs (such as GeoChat [17], VHM [16], and GeoGround [19]) have the only fixed square input shape (such as

336 \times 336

or

504 \times 504

). For each input image, they first pad the image to a square with zeros on the right or bottom, and then resize it to the input shape. Additionally, LRS-VQA [21] and GeoLLaVA-8k [22] scale the input size to 2 k × 2 k and 8 k × 8 k, respectively. They first cut a UHR RS image into slices of the fixed size and encode them into visual tokens. Then they prune the tokens to an amount comparable to the normal cases. In total, they pre-process the images only on a fixed shape or a small set of image shapes.

The proposed dynamic resolution strategy involves three interconnected aspects: supporting full-size input processing, scaling coordinate precision with input resolution, and curating training data to enhance learning across multiple resolutions.

3.3.1. Full-Scale Visual Input

The native resolution scheme in Qwen2-VL [36] inspired us to advance RS VLMs to accept inputs of arbitrary shapes. As shown in Figure 4, let H and W indicate the height and width of a given RS image.

L_{p a t c h}

is the patch length corresponding to each visual token from the vision encoder. They first calculate the tightest shape that can wrap the input image by

(\hat{H}, \hat{W}) = (⌈ H / L_{p a t c h} ⌉ \times L_{p a t c h}, ⌈ W / L_{p a t c h} ⌉ \times L_{p a t c h}),

(4)

where the

⌈ \cdot ⌉

means the ceiling function. Then, they resize the image to

(\hat{H}, \hat{W})

so that it can be exactly processed by the visual patch embedding.

This strategy allows the model to ingest images of arbitrary input sizes, which is well-suited to the diverse RS data. However, we should still set a range with a minimum scale to ensure adequate visual signal and a maximum scale due to constrained training resources. We divide the image sizes with the two bounds into three parts: small, regular, and UHR large. The small images are enlarged to ensure that there are enough visual tokens for the decoder to understand. For the UHR image, we design a zoom-in chain, which is introduced in Section 3.4.

3.3.2. Scalable Bounding Boxes

For grounded or detection VLMs, spatial localization is achieved by directly generating numerical coordinates within textual outputs, which are extracted through regular expressions during inference.

However, existing RS VLMs often suffer from a mismatch between the coordinate resolution and the input image resolution. For instance, GeoChat [17] processes images at a fixed resolution of 504 × 504, but its coordinate resolution is only 100 × 100, leading to a fivefold loss in localization precision and poor performance on small objects. Conversely, GeoGround [19] employs a 336 × 336 input resolution but defines coordinates at a much higher 1000 × 1000 scale, resulting in more than half of the coordinate space being unused and excessive localization precision.

In this work, we adopt scalable bounding boxes, whose coordinate resolution dynamically aligns with the input image resolution, thereby avoiding both under- and over-precision issues. This design naturally adapts to varying input sizes and allows flexible control of inference cost depending on the required localization accuracy.

3.3.3. Random Resizing

To ensure robust performance across varying input image sizes, we applied dynamic scale augmentation during training. For each task, input images were randomly rescaled to different resolutions. In grounding and detection tasks, the corresponding bounding boxes were synchronously scaled to maintain spatial consistency. We observed that this scale-based augmentation significantly improved the model’s robustness to input-size variation. Moreover, the model trained under such conditions exhibited enhanced performance when performing high-resolution inference. This also enables a practical inference-time strategy, allowing users to adjust image resolution according to task requirements and computational constraints.

3.4. Zoom-In Chain for UHR RS Images

Previous works on understanding UHR RS images, such as LRS-VQA [21] and GeoLLaVA-8k [22], primarily focus on addressing the issue of excessive image tokens through visual token pruning. Although this approach has proven effective and computationally efficient, it typically requires additional training and is not well-suited for joint training with tasks using standard image resolutions.

We observed that when humans analyze UHR RS images on electronic devices, their workflow typically involves first scanning the entire image to identify regions of interest, then zooming into these regions before performing the actual task. Inspired by this workflow, we designed the Zoom-in Chain strategy for RS VLMs, as illustrated below:

\begin{matrix} User : < Prompt > + I_{q} + < Question > \\ Assistant : \begin{matrix} [x 1, y 1, x 2, y 2] \end{matrix} \\ User : Zoom — in (I_{q}, [x 1, y 1, x 2, y 2]) \\ Assistant : \begin{matrix} < Ground Truth > \end{matrix} \end{matrix}

(5)

The blue portions indicate the training labels, while the others are ignored for loss. Specifically, given a UHR RS image, we first downsample the image for initial processing. The model is prompted as seen in Figure 5 with instructions to predict the RoI, which is then cropped and fed into the model in native resolution. The final answer is obtained from both the initial and the new inputs, effectively mimicking the human zoom-in workflow for improved localization and task performance. Since we perform one step for locating and one for task execution, the computational cost is approximately double that of the original inference.

To enable the model to learn zoom-in capabilities during training, we construct a specialized instruction tuning dataset for UHR RS image perception, named LRS-VQA-Zoom. The data pipeline is initiated by collecting three public, large-scale UHR RS datasets: DOTA1.0 [51], GLH-Bridge [52], and STAR [53].

The methodology for generating the LRS-VQA-Zoom is extended from the pipeline in LRS-VQA [21]. The final training corpus, totaling 302 k samples, comprises three distinct subsets: 60 k open-ended samples generated via rule-based templates, 159 k open-ended samples synthesized using GPT-4V, and 83 k samples in multi-choice-query form. Figure 6 exhibits the examples from each subset.

3.4.1. Template-Generated Data (60 k)

This subset focuses on two open-ended question categories: counting and comparison. For the counting data, the UHR image is first divided into a

3 \times 3

grid (nine regions). Depending on the density of the target category, questions are formulated to query either the total count across the entire image or the count within a specific region. For the comparison data, these tasks involve comparing the relative quantities of two different object categories. For all samples in this subset, the absolute coordinates of the corresponding bounding boxes are preserved in the training data.

3.4.2. GPT-4V-Synthesized Data (159 k)

This subset is designed to introduce greater question diversity. First, we filter the original object detection labels to identify unique target instances, which serve as “unique references”. Subsequently, the “coarse region” around each unique reference is cropped by applying a predefined padding margin. The dimensions of these coarse regions are suitable for processing by the GPT-4V model. We then prompt GPT-4V to generate diverse question–answer pairs based on these cropped regions. This process yields a rich variety of question types, including queries related to color, category, shape, status, spatial reasoning, and scene context (e.g., rural/urban). In this part of data, the coordinates of the horizontal bounding box defining the coarse region are recorded.

3.4.3. Multi-Choice-Query Data (83 k)

To enhance the model’s proficiency with mainstream evaluation formats (i.e., multiple-choice query (MCQ)) and to further diversify the training data, we converted a subset of 83 k open-ended question answering samples into an MCQ format using an automated pipeline centered around the large language models. For each question–answer pair, excluding simple binary (yes/no) queries, we prompted the GPT-4 to generate three plausible but incorrect “distractors” and return them alongside the original correct answer in a structured JSON format. This output was then systematically validated to ensure it contained four unique options. Finally, to prevent positional bias, the options were randomly shuffled, and the sample was formatted to include the question, four choices prefixed with letters (A, B, C, D), and the letter corresponding to the ground truth answer.

3.5. Aerial Detection Method for Vision-Language Models

In this paper, we investigate multi-class oriented aerial object detection. To enable the RS VLM to perform dense detection in aerial images, we propose a detection paradigm for vision-language models, representing detection outputs directly in textual form, as illustrated in the right part of Figure 2. Specifically, we propose a normalization procedure for model responses and a novel evaluation metric to facilitate fair comparisons between the RS VLM detectors and conventional detectors.

3.5.1. Response Normalization

In the aerial object detection task, each object is represented by its class label and an eight-parameter quadrilateral bounding box

o = (n_{o}, x_{1 o}, y_{1 o}, x_{2 o}, y_{2 o}, x_{3 o}, y_{3 o}, x_{4 o}, y_{4 o})

, where

(x_{i o}, y_{i o})

denote the coordinates of the polygon vertices in clockwise order. The vertex with the smallest vertical coordinate is designated as the starting point. The class label

n_{o}

corresponds to one of the c predefined categories

{C_{1}, C_{2}, . . ., C_{c}}

.

To standardize detection annotations, a consistent template is employed to ensure both uniqueness and order. For each input image, the model outputs detected objects in a structured sequence. Specifically, detection results are first grouped by category and sorted alphabetically by category name. Within each category, the bounding boxes are further ordered according to the position of their designated starting vertex.

During our extension of LMMRotate [20], we observed a subtle yet important issue. In the LMMRotate, images without any objects were removed from the training set to improve efficiency, following common practices in conventional aerial detector. However, this approach can be detrimental when training a VLM, as encountering object-free images during inference often leads the model to hallucinate, producing false positive detections. To address this, RSCoVLM retains images without objects in the training process and explicitly trains the model to output “There is none.” for such cases, thereby mitigating hallucinations and improving detection reliability.

Our VLM is capable of detecting multiple object categories within an aerial image, with both category labels and bounding box coordinates included in its output. During inference, detection results can be retrieved directly from the model response using straightforward regular expression parsing. Furthermore, unlike most traditional detectors that require postprocessing procedures such as non-maximum suppression (NMS) to address overlapping or redundant detections, the VLM inherently avoids these issues.

3.5.2. Evaluation Metrics

In conventional aerial detection tasks, mean average precision (mAP) is widely employed as the evaluation metric, requiring bounding boxes, class labels, and confidence scores for all detected objects. However, as discussed earlier, our model responses only include object categories and their corresponding spatial coordinates, which implies that vision-language models based on the proposed detection approach cannot directly produce mAP results.

In mAP calculation, confidence scores are used to rank detector predictions, which directly influences the accumulation of true and false positives along the PR curve. The conventional object detection models often retain low-confidence yet actually false-positive predictions to maintain higher mAP scores. For instance, several DETR-based models evaluate all 900 proposals per image even when only a few objects are present. Since vision-language models typically yield only a limited number of confident predictions, we are constrained to assign fixed or randomized confidence scores to enable mAP computation. However, the lack of well-calibrated confidence estimates still places vision-language models at a substantial disadvantage in mAP-based evaluation with the adaptation, even when visual inspection suggests that VLMs achieve performance comparable to conventional detectors.

In practice, an object detection prediction is composed of a location and a category. The confidence scores are primarily used to filter out low-confidence detections via thresholding, as part of the detection result visualization procedure. If we first filter out low-confidence predictions using a threshold and then randomize the confidence scores, we could remove the influence of confidence on the mAP calculation. We denoted this as mAP with no conference scores, i.e.,

{mAP}_{nc}

.

Figure 7 illustrates the

{mAP}_{nc}

of several detectors. The horizontal axis illustrates the variation of

{mAP}_{nc}

with increasing filtering thresholds. For each detector, the validation results were computed over ten runs with different random seeds, and the solid line represents the mean of these ten runs. The solid line in the figure is enveloped by a light-colored error band to indicate the effect of randomness on the results. Initially, as the filtering threshold increases, the detector’s performance improves because low-confidence false positives are progressively filtered out. After reaching a peak, overly strict filtering begins to remove true positives as well, causing the results to decline. Notably, this peak is still lower than the mAP metric that incorporates confidence scores. Furthermore, the error band is barely visible without magnification, indicating that although randomness is involved, the variance of

{mAP}_{nc}

is very small.

Instead of introducing an additional mechanism to estimate confidence for VLM-based detectors, we argue that confidence should not be a prerequisite when evaluating or comparing detection performance between VLMs and conventional detectors. Detection annotations and outputs inherently consist of class labels and bounding boxes, while confidence scores are auxiliary byproducts generated during inference. They may facilitate postprocessing but are not indispensable for evaluating model accuracy. Therefore, we advocate employing confidence-independent metrics such as mean F1-score (

{mF}_{1}

) and

{mAP}_{nc}

for a more equitable evaluation. Additionally, the small variance of

{mAP}_{nc}

also demonstrates its stability as a metric.

Finally, for benchmarks such as DOTA [51] and FAIR1M [54], where public test sets are unavailable and online evaluation servers rely solely on mAP, we recommend adopting

{mAP}_{nc}

as the primary evaluation metric to ensure consistent and fair assessment across different model types.

4. Experiment

In this section, the RSCoVLM is evaluated on benchmarks across various tasks, demonstrating the promising multi-task capabilities. We firstly provide detailed implementation specifications to facilitate reproducibility. Then, we compare our model with state-of-the-art methods on various RS understanding and perception tasks with different input resolutions.

Figure 8 presents the demonstration of RSCoVLM’s capabilities on several commonly used tasks. Notably, all tasks are accomplished using a single RSCoVLM model, demonstrating its impressive multi-task capability.

4.1. Reproducibility Details

We use Qwen2.5-VL-7B-Instruct [37] as the foundation model of RSCoVLM. The model is optimized with AdamW, employing a weight decay of 0.1. We train the full model with a base learning rate of 2 × 10⁻⁶, following a cosine learning rate schedule with a linear warmup over the first 5% of training steps. The total batch size is set to 32, and the maximum sequence length is 6144 tokens. The input images are constrained to resolutions between 224 × 224 and 1008 × 1008 pixels. During training, we set the maximum length of the token sequence to 6144, with tokens beyond this limit being truncated. Our design already accounts for constraints related to object quantity and context window size, ensuring sufficient context is available.

We have released the codebase on the Remotesensing 18 00222 i001

GitHub repository and uploaded the whole well-collected data folder and model weights to the Remotesensing 18 00222 i002

HuggingFace repository. The codebase is implemented concisely, leveraging resource-efficient and effective training techniques. To save GPU memory, we adopt DeepSpeed-ZeRO-Stage-1 [55] and gradient checkpointing. For improved computational efficiency, we utilize BFloat16 precision and Flash-Attention-2 [56] during both training and evaluation. Additionally, Liger Kernel [57] is employed to accelerate training, and vLLM [58] is used for faster inferencing. All experiments are conducted on VolcEngine high-performance computing clusters equipped with NVIDIA A800 GPUs. We will maintain the repositories and update the latest code, model and data in our future research progress.

4.2. Evaluation on Large RS Imagery

4.2.1. Benchmark and Metric

The LRS-VQA [21] is the latest visual question answering benchmark for large RS images. It features 7333 question–answer pairs across 8 categories, including count, color, category, shape, status, reasoning, rural/urban classification, and target background. The images in this benchmark reach up to 27,328 pixels in length and have an average size of 7099 × 6329 pixels.

There are three subsets, corresponding to three data sources: FAIR1M [54], GLH-Bridge [52], and STAR [53]. The official scoring implementation first calculates accuracy for each source and task, and then computes average accuracy (AA) across tasks for each source. The AAs for each subset are reported.

4.2.2. Results

The results are presented in Table 1. The maximum pixel numbers of each models are also provided. It can be seen that the average pixel number of LRS-VQA (about 45 million) has been larger than the largest pixels uplimit (16.8 million for Qwen3-VL [37]).

As shown in the table, the proposed Zoom-in Chain approach substantially enhances the model’s performance, achieving an overall improvement of 35% compared to the baseline that solely infers.

Furthermore, our model demonstrates stronger foundational capabilities than other competing models, approaching the performance of the leading Qwen3-VL-8B [37], while utilizing a slightly smaller parameter count and a significantly lower maximum input resolution. Our model also outperforms other RS foundation models, including GeoChat and the officially fine-tuned LLaVA-Next model for LRS-VQA [21].

The table further reveals that the performance gain achieved by the Zoom-in Chain on the LRS-Bridge subset is less pronounced compared to the other two subsets. We attribute this to the fact that this subset focuses solely on the bridgecategory. Given the slender, elongated structure and often limited pixel coverage (many within 200–300 pixels) of bridges, accurately localizing them is inherently more challenging for the model. In contrast, the other two subsets involve detecting multiple object categories.

4.3. Evaluation on Visual Grounding

4.3.1. Benchmark and Metric

We follow GeoGround [19] for visual grounding evaluation because of its strong emphasis on comprehensiveness, fairness, and transparency. The evaluation incorporates the validation and test sets of DIOR-RSVG and RSVG [64], the visual grounding portions of GeoChat-Bench [17] and VRSBench [65], as well as AVVG benchmark [19] for images captured by unmanned aerial vehicle. The evaluation details have strictly aligned with GeoGround. We directly adopted the splits and annotations provided by GeoGround for all benchmarks [19].

We follow common practice to utilize Acc@0.5 as the evaluation metric, which regards the predicting that has an Intersection over Union (IoU) greater than 0.5 with the ground truth as a successful localing.

4.3.2. Results

Table 2 presents the results, along with the corresponding input sizes for each model. The input resolutions of existing RS VLMs, including GeoChat [17], LHRS-Bot [18], VHM [16], and GeoGround [19], are fixed and typically smaller than 512 × 512. In contrast, only general-purpose VLMs such as Qwen2.5-VL [37] and MiMo-VL [34] support dynamic input resolution, enabling flexible adaptation to varying input sizes.

Our model demonstrates substantially superior performance across all benchmarks. It surpasses the previously best-performing visual-language model specialized for RS grounding, GeoGround, by approximately 25.7%, and outperforms all baselines that were supervised-finetuned on refGeo.

We further conducted experiments using fixed low-resolution inputs to intentionally weaken our model’s performance. Even at the minimal input size of 224 × 224, our model maintains strong capability; however, such a small resolution severely limits image clarity, causing small objects to occupy only a few pixels and become indistinguishable. In particular, performance on AVVG drops sharply, indicating that a 224 × 224 resolution is highly impractical for RS grounding. When evaluated at 336 × 336, which aligns with the input size of other comparison methods, our model still achieves state-of-the-art results.

We attribute this performance advantage to three primary factors. First, the support for dynamic input resolution allows the model to perform inference at native resolution without downsampling, preserving visual detail. Second, the multi-resolution augmentation strategy employed during training enables the model to generalize effectively across diverse resolutions and computational budgets. Finally, auxiliary localization-related tasks, such as object detection and zoom-in refinement, further strengthen the model’s grounding ability and robustness.

4.4. Evaluation on Object Detection

4.4.1. Benchmark, Metric, and Comparison Setting

We selected the most widely used aerial image object detection benchmark, DOTA-v1.0 [51], for our evaluation. The whole DOTA-v1.0 dataset comprises 2806 high-resolution aerial images and 188,282 object instances across 15 common categories. The proportions of testing set are 1/3. These images were collected from multiple sensors and platforms, and each instance is annotated with an 8 degrees-of-freedom oriented bounding box, capturing the wide variations in object scale, shape, and orientation typical of aerial imagery.

We adopt the Average Precision with no confidence (

{AP}_{nc}

) and report three specific variants:

{AP}_{nc 50}

(IoU threshold is 0.50),

{AP}_{nc 75}

(IoU threshold is 0.75), and

{AP}_{nc 50 : 95}

(the average

{AP}_{nc}

computed over IoU thresholds from 0.50 to 0.95 at increments of 0.05). The evaluation is based on the standard MMRotate [66] evaluation procedure. And the splitting length is set 512 with an overlap of 100.

The conventional object detection baselines are trained using the latest MMRotate [66], and the details necessary for reproducibility are also provided in the released code. We obtain a reasonable

{AP}_{nc}

for comparison methods using the following procedure: We first select a threshold for confidence scores to filter out low-score predictions, and then randomize (or, set to 1) the remaining prediction scores. The AP computed under this condition is denoted as

{AP}_{nc}

of the conventional detector. To determine an appropriate threshold for each detector, we evaluate

{AP}_{nc}

on the validation set by varying the confidence threshold from 0.00 to 0.95 in increments of 0.05, and select the threshold that yields the highest

{AP}_{nc}

for subsequent evaluation on the test set.

4.4.2. Results

We compare our model with state-of-the-art RS object detection methods in Table 3. Our multi-task model achieves detection performance comparable to conventional detectors, even though it is not specifically optimized for the single dataset as the comparison methods are. When trained solely on object detection data, denoted as RSCoVLM-det, the model exhibits further improvement and even surpasses half of the conventional methods. It is a remarkable achievement for RS vision-language models.

Thanks to the dynamic resolution strategy, our model can further enhance detection performance by maximizing the inference scale, referred to as the “Max Mode.” Specifically, each input image is upsampled to the model’s upper input limit of 1008 × 1008, and the outputs are then downsampled back to the original scale for evaluation. We observe a substantial increase in overall

{AP}_{nc}

, although certain categories such as plane (PL) and bridge (BD) experience minor degradation. The enhanced RSCoVLM-det even outperforms all competing approaches, while the conventional detectors are trained and evaluated at fixed resolutions without such a feature of test-time augmentation.

To the best of our knowledge, only two existing RS VLMs, LMMRotate [20] (our conference version) and Falcon [74], are capable of performing aerial object detection effectively. However, their common foundation model, Florence-2 [75], employs a fixed and relatively large input size of 1024 × 1024, which already exceeds the input limit of RSCoVLM. Moreover, LMMRotate is trained specifically for detection, while Falcon performs well only on its training set and does not report test results. In addition, Falcon requires multiple inferences per image, making separate predictions for each category, which results in extremely high computational cost. Therefore, we consider RSCoVLM to be the only vision-language model capable of performing multiple tasks while achieving detection performance that is fairly comparable to specialized object detection models, currently.

4.5. Evaluation on Scene Classification

4.5.1. Benchmark and Metric

We evaluate our model on five standard remote-sensing scene-classification benchmarks. The AID [76] dataset comprises approximately 10,000 images of size 600 × 600 pixels across 30 classes. The UCMerced [77] dataset consists of 2100 images of size 256 × 256 pixels covering 21 classes. The NWPU-RESISC45 [78] dataset contains 31,500 images of size 256 × 256 pixels across 45 classes, with large variation in resolution and scene complexity. The WHU-RS19 [79] dataset includes around 1000 high-resolution patches of size 600 × 600 pixels spanning 19 classes. The METER-ML [80] benchmark offers a large-scale multi-sensor setup with varied image sizes for extended generalisation evaluation. Together these benchmarks allow a robust assessment of our model’s generalisation across dataset scale, class-set size, imaging conditions and spatial resolutions.

We report overall accuracy of the test set for each benchmark. For METER-ML, NWPU-RESISC45, and WHU-RS19, we adopt the test set splits defined by VHM [16]. For UCMerced, we follow the split defined by GeoChat [17]. For AID, we present results using both the VHM and GeoChat splits to facilitate fair comparison.

4.5.2. Results

Table 4 presents the comparative results across the five scene classification benchmarks. The compared methods include classical VLM baselines (MiniGPTv2 [81] and LLaVA-1.5 [59]), leading open-source VLMs (the QwenVL [37], InternVL [82], and MiMo-VL [34] series), and latest RS VLMs (GeoChat [17], TEOChat [38], LHRS-Bot-Nova [83], SkysenseGPT [24], VHM [16], and ScoreRS [84]). As shown, our model consistently surpasses all compared approaches across all benchmarks.

4.6. Evaluation on Visual Question Answering

4.6.1. Benchmark and Metric

We evaluate our model’s visual question answering capability using two established benchmarks in the RS domain, including the RSVQA benchmark [85] and the VQA portion of VRSBench [65]. The RSVQA comprises two subsets of image–question–answer triplet derived from high-resolution (HR) orthorectified imagery and low-resolution (LR) RS data, enabling evaluation of model reasoning across spatial scales. The VRSBench dataset is a large-scale vision-language benchmark for RS image understanding that comprises 37,408 question-answer pairs in test set, supporting a broad range of understanding instructions. The standard question answering accuracy is used as the metric.

4.6.2. Results

Table 5 presents the results of visual question answering, demonstrating the strong understanding and conversational capabilities of our model. Our approach surpasses all open-source VLMs (including the LLaVA, Qwen, InternVL, and MiMo-VL series) as well as RS VLMs (GeoChat [17], LHRS-Bot-Nova [83], and VHM [16]) across the two benchmarks. In the zero-shot question answering evaluation on VRSBench-VQA [65], our model outperforms the latest general-purpose models, showing superior generalization ability on RS image question answering.

5. Discussion

5.1. Failure Cases Analysis

The left part of Figure 9 illustrates a failure case of the proposed Zoom-in Chain applied to UHR image question answering. The locating for the presented question is multi-step: the model should first localize all bridges in the image, identify the smallest one, and then ground the query (“the larger end”) within that specific bridge region. Since our method performs grounding in one round, it is insufficient for such multi-step localization. As a result, the model grounded the query at one end of a certain bridge. The grounded region corresponds to an urban green area, leading to an incorrect answer. It reveals that our current approach is still insufficient to handle more complex reasoning scenarios, and requires longer visual reasoning chains to support the model’s sophisticated cognitive processes.

The right part of Figure 9 presents a failure case of oriented object detection. The scene is a dense parking lot containing small, tightly packed vehicle targets, which has commonly been recognized as a challenging scenario in RS object detection. As shown, our model successfully detects the majority of vehicles. However, in the bottom-right corner, some detections are slightly offset due to interference from vehicle shadows, and a few instances of missed detection can be observed within the area marked by the white box. It indicates that our model requires further evolution to enhance the detection capability for dense and small objects.

5.2. Limitations and Outlook

In this section, we discuss several limitations of the present work and provide an outlook for future research.

Long-chain Reasoning Ability: The RS image understanding tasks could benefit from longer reasoning chains. The current framework and collected data only support single-pass output for a given task. As described in Section 5.1, for the “Zoom-in Chai”, our model still fail in grounding or produce incorrect answers due to an insufficient number of reasoning steps. Future work could design supervisory signals that guide the model’s multi-step reasoning process, thereby enhancing its deliberative capabilities and enabling it to handle more complex tasks.

Limitation of Context Length: In this work, the adopted bounding box representation consumes a substantial number of tokens. Due to computational constraints, we set the patch size to

512 \times 512

for cropping the DOTA dataset, thereby controlling the number of objects to stay within the context window. Future work could investigate more lightweight coordinate representations or explore methods to reduce the computational overhead in long-context scenarios, which would help mitigate this issue.

Detection efficiency: The proposed aerial detection methods based on autoregressive vision-language models still require serial output of objects. Although the performance has already been comparable to that of conventional detectors, there remains significant room for improvement in inference efficiency. Future work could explore multi-token prediction or investigate training-free parallel inference strategies to flexibly address this issue.

Task Interference: Despite carefully designed prompts, data sampling strategies, and training schedules, we continue to grapple with task interference in multi-task joint training, i.e., gains in one task often come at the expense of another. Although our model already achieves state-of-the-art results across multiple tasks, its full potential remains constrained by such inter-task competition. Future work could focus on explicitly modeling or mitigating this interference to unlock further performance improvements.

6. Conclusions

In this paper, we introduce RSCoVLM, the latest generation of versatile vision-language model. We carefully curated RS data, detailing the processes of data collection, offline integration, and online loading with adaptive weighting. To handle the wide range of image resolutions in RS images, we developed a dynamic resolution strategy and proposed the Zoom-in Chain mechanism with the LRS-VQA-Zoom dataset for ultra-high-resolution images. Moreover, we improved the model’s object detection capabilities and designed a fair evaluation protocol for comparison with conventional methods. Comprehensive experiments show that RSCoVLM consistently delivers state-of-the-art results across multiple tasks, surpassing previous RS VLMs and matching task-specific expert models. By releasing all code, models, and datasets, we aim to enable reproducibility and foster progress toward general-purpose remote sensing models.

Author Contributions

Conceptualization, Q.L. and X.Y.; methodology, Q.L. and X.L.; software, Q.L. and S.M.; validation, S.M. and Q.L.; formal analysis, X.W. and F.W.; investigation, X.H.; resources, X.Y.; data curation, J.L. and Y.Z.; writing, Q.L., S.M. and J.L.; visualization, Y.Y.; supervision, Y.C.; project administration, Y.Z. and X.Y.; funding acquisition, X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China under the Grant 62371169 and 62506229, Natural Science Foundation of Shanghai under 25ZR1402268, and Shanghai QiYuan Innovation Foundation.

Data Availability Statement

All the data and the usage can be found at the Remotesensing 18 00222 i003

HuggingFace repository.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, L.; Zhang, L. Artificial Intelligence for Remote Sensing Data Analysis: A review of challenges and opportunities. IEEE Geosci. Remote Sens. Magaz. 2022, 10, 270–294. [Google Scholar] [CrossRef]
Zhou, Y.; Feng, L.; Ke, Y.; Jiang, X.; Yan, J.; Yang, X.; Zhang, W. Towards Vision-Language Geo-Foundation Models: A Survey. arXiv 2024, arXiv:2406.09385. [Google Scholar]
Zhang, Y.; Yang, Q. A Survey on Multi-Task Learning. IEEE Trans. Know. Data Engin. 2022, 34, 5586–5609. [Google Scholar] [CrossRef]
Li, Q.; Chen, Y.; He, X.; Huang, L. Co-training transformer for remote sensing image classification, segmentation, and detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–18. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Li, Q.; Chen, Y.; Zeng, Y. Transformer with Transfer CNN for Remote-Sensing-Image Object Detection. Remote Sens. 2022, 14, 984. [Google Scholar] [CrossRef]
Han, J.; Gong, K.; Zhang, Y.; Wang, J.; Zhang, K.; Lin, D.; Qiao, Y.; Gao, P.; Yue, X. OneLLM: One Framework to Align All Modalities with Language. arXiv 2023, arXiv:2312.03700. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021. [Google Scholar]
Li, Q.; Chen, Z.; Wang, W.; Wang, W.; Ye, S.; Jin, Z.; Chen, G.; He, Y.; Gao, Z.; Cui, E.; et al. OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text. In Proceedings of the The Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A Visual Language Model for Few-Shot Learning. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 23716–23736. [Google Scholar]
Chen, Z.; Wu, J.; Wang, W.; Su, W.; Chen, G.; Xing, S.; Zhong, M.; Zhang, Q.; Zhu, X.; Lu, L.; et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 24185–24198. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. In Proceedings of the Advances in Neural Information Processing Systems; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 34892–34916. [Google Scholar]
Wang, D.; Zhang, J.; Xu, M.; Liu, L.; Wang, D.; Gao, E.; Han, C.; Guo, H.; Du, B.; Tao, D.; et al. MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 11632–11654. [Google Scholar] [CrossRef]
Li, Y.; Li, X.; Li, Y.; Zhang, Y.; Dai, Y.; Hou, Q.; Cheng, M.M.; Yang, J. Sm3det: A unified model for multi-modal remote sensing object detection. arXiv 2024, arXiv:2412.20665. [Google Scholar]
Pang, C.; Weng, X.; Wu, J.; Li, J.; Liu, Y.; Sun, J.; Li, W.; Wang, S.; Feng, L.; Xia, G.S. Vhm: Versatile and honest vision language model for remote sensing image analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 6381–6388. [Google Scholar]
Kuckreja, K.; Danish, M.S.; Naseer, M.; Das, A.; Khan, S.; Khan, F.S. GeoChat: Grounded Large Vision-Language Model for Remote Sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 27831–27840. [Google Scholar]
Muhtar, D.; Li, Z.; Gu, F.; Zhang, X.; Xiao, P. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 440–457. [Google Scholar]
Zhou, Y.; Lan, M.; Li, X.; Feng, L.; Ke, Y.; Jiang, X.; Li, Q.; Yang, X.; Zhang, W. GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding. arXiv 2024, arXiv:2411.11904. [Google Scholar]
Li, Q.; Chen, Y.; Shu, X.; Chen, D.; He, X.; Yu, Y.; Yang, X. A Simple Aerial Detection Baseline of Multimodal Language Models. arXiv 2025, arXiv:2501.09720. [Google Scholar] [CrossRef]
Luo, J.; Zhang, Y.; Yang, X.; Wu, K.; Zhu, Q.; Liang, L.; Chen, J.; Li, Y. When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning. arXiv 2025, arXiv:2503.07588. [Google Scholar]
Wang, F.; Chen, M.; Li, Y.; Wang, D.; Wang, H.; Guo, Z.; Wang, Z.; Shan, B.; Lan, L.; Wang, Y.; et al. GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution. arXiv 2025, arXiv:2505.21375. [Google Scholar]
Zhang, W.; Cai, M.; Zhang, T.; Zhuang, Y.; Mao, X. EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–20. [Google Scholar] [CrossRef]
Luo, J.; Pang, Z.; Zhang, Y.; Wang, T.; Wang, L.; Dang, B.; Lao, J.; Wang, J.; Chen, J.; Tan, Y.; et al. Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding. arXiv 2024, arXiv:2406.10100. [Google Scholar]
Chen, J.; Guo, H.; Yi, K.; Li, B.; Elhoseiny, M. VisualGPT: Data-Efficient Adaptation of Pretrained Language Models for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 16–24 June 2022; pp. 18030–18040. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 19730–19742. [Google Scholar]
Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. In Proceedings of the The Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Dai, W.; Li, J.; Li, D.; Tiong, A.; Zhao, J.; Wang, W.; Li, B.; Fung, P.N.; Hoi, S. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 49250–49267. [Google Scholar]
Gao, P.; Han, J.; Zhang, R.; Lin, Z.; Geng, S.; Zhou, A.; Zhang, W.; Lu, P.; He, C.; Yue, X.; et al. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. arXiv 2023, arXiv:2304.15010. [Google Scholar]
Lin, Z.; Liu, C.; Zhang, R.; Gao, P.; Qiu, L.; Xiao, H.; Qiu, H.; Lin, C.; Shao, W.; Chen, K.; et al. SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models. arXiv 2023, arXiv:2311.07575. [Google Scholar] [CrossRef]
Wang, W.; Chen, Z.; Chen, X.; Wu, J.; Zhu, X.; Zeng, G.; Luo, P.; Lu, T.; Zhou, J.; Qiao, Y.; et al. VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 61501–61513. [Google Scholar]
Wu, J.; Zhong, M.; Xing, S.; Lai, Z.; Liu, Z.; Chen, Z.; Wang, W.; Zhu, X.; Lu, L.; Lu, T.; et al. VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 69925–69975. [Google Scholar]
Chen, X.; Djolonga, J.; Padlewski, P.; Mustafa, B.; Changpinyo, S.; Wu, J.; Ruiz, C.R.; Goodman, S.; Wang, X.; Tay, Y.; et al. PaLI-X: On Scaling up a Multilingual Vision and Language Model. arXiv 2023, arXiv:2305.18565. [Google Scholar]
Yue, Z.; Lin, Z.; Song, Y.; Wang, W.; Ren, S.; Gu, S.; Li, S.; Li, P.; Zhao, L.; Li, L.; et al. MiMo-VL Technical Report. arXiv 2025, arXiv:2506.03569. [Google Scholar]
Wang, W.; Lv, Q.; Yu, W.; Hong, W.; Qi, J.; Wang, Y.; Ji, J.; Yang, Z.; Zhao, L.; Song, X.; et al. CogVLM: Visual Expert for Pretrained Language Models. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 121475–121499. [Google Scholar]
Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]
Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2.5-VL Technical Report. arXiv 2025, arXiv:2502.13923. [Google Scholar] [CrossRef]
Irvin, J.A.; Liu, E.R.; Chen, J.C.; Dormoy, I.; Kim, J.; Khanna, S.; Zheng, Z.; Ermon, S. Teochat: A large vision-language assistant for temporal earth observation data. arXiv 2024, arXiv:2410.06234. [Google Scholar] [CrossRef]
Zhan, Y.; Xiong, Z.; Yuan, Y. Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model. ISPRS J. Photogramm. Remote Sens. 2025, 221, 64–77. [Google Scholar] [CrossRef]
Zhang, W.; Cai, M.; Ning, Y.; Zhang, T.; Zhuang, Y.; Lu, S.; Chen, H.; Li, J.; Mao, X. EarthGPT-X: A Spatial MLLM for Multi-level Multi-Source Remote Sensing Imagery Understanding with Visual Prompting. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4709221. [Google Scholar]
Liu, X.; Lian, Z. Rsunivlm: A unified vision language model for remote sensing via granularity-oriented mixture of experts. arXiv 2024, arXiv:2412.05679. [Google Scholar]
Cui, F.; Jiang, J. MTSCD-Net: A network based on multi-task learning for semantic change detection of bitemporal remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2023, 118, 103294. [Google Scholar] [CrossRef]
Wang, Y.; Zhao, L.; Hu, Y.; Dai, H.; Zhang, Y. Multitask semantic change detection guided by spatiotemporal semantic interaction. Sci. Rep. 2025, 15, 16003. [Google Scholar] [CrossRef]
Niu, Y.; Guo, H.; Lu, J.; Ding, L.; Yu, D. SMNet: Symmetric multi-task network for semantic change detection in remote sensing images based on CNN and transformer. Remote Sens. 2023, 15, 949. [Google Scholar] [CrossRef]
Lin, H.; Wang, X.; Li, M.; Huang, D.; Wu, R. A multi-task consistency enhancement network for semantic change detection in HR remote sensing images and application of non-agriculturalization. Remote Sens. 2023, 15, 5106. [Google Scholar] [CrossRef]
Lu, M.; Liu, J.; Wang, F.; Xiang, Y. Multi-task learning of relative height estimation and semantic segmentation from single airborne RGB images. Remote Sens. 2022, 14, 3450. [Google Scholar] [CrossRef]
Carvalho, M.; Le Saux, B.; Trouvé-Peloux, P.; Champagnat, F.; Almansa, A. Multitask learning of height and semantics from aerial images. IEEE Geosci. Remote Sens. Lett. 2019, 17, 1391–1395. [Google Scholar] [CrossRef]
Bastani, F.; Wolters, P.; Gupta, R.; Ferdinando, J.; Kembhavi, A. SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image Understanding. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 16772–16782. [Google Scholar]
Soni, S.; Dudhane, A.; Debary, H.; Fiaz, M.; Munir, M.A.; Danish, M.S.; Fraccaro, P.; Watson, C.D.; Klein, L.J.; Khan, F.S.; et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. In Proceedings of the Computer Vision and Pattern Recognition Conference, Shanghai, China, 15–18 October 2025; pp. 14303–14313. [Google Scholar]
Li, B.; Zhang, Y.; Guo, D.; Zhang, R.; Li, F.; Zhang, H.; Zhang, K.; Zhang, P.; Li, Y.; Liu, Z.; et al. LLaVA-OneVision: Easy Visual Task Transfer. Trans. Mach. Learn. Res. 2025. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Li, Y.; Luo, J.; Zhang, Y.; Tan, Y.; Yu, J.G.; Bai, S. Learning to Holistically Detect Bridges From Large-Size VHR Remote Sensing Imagery. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 44, 7778–7796. [Google Scholar] [CrossRef]
Li, Y.; Wang, L.; Wang, T.; Yang, X.; Luo, J.; Wang, Q.; Deng, Y.; Wang, W.; Sun, X.; Li, H.; et al. STAR: A First-Ever Dataset and A Large-Scale Benchmark for Scene Graph Generation in Large-Size Satellite Imagery. arXiv 2024, arXiv:2406.09410. [Google Scholar] [CrossRef]
Sun, X.; Wang, P.; Yan, Z.; Xu, F.; Wang, R.; Diao, W.; Chen, J.; Li, J.; Feng, Y.; Xu, T.; et al. FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS J. Photogram. Remote Sens. 2022, 184, 116–130. [Google Scholar] [CrossRef]
Rajbhandari, S.; Rasley, J.; Ruwase, O.; He, Y. ZeRO: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, Georgia, 9–19 November 2020. [Google Scholar]
Dao, T. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 11 May 2024. [Google Scholar]
Hsu, P.L.; Dai, Y.; Kothapalli, V.; Song, Q.; Tang, S.; Zhu, S.; Shimizu, S.; Sahni, S.; Ning, H.; Chen, Y. Liger-Kernel: Efficient Triton Kernels for LLM Training. In Proceedings of the Championing Open-Source DEvelopment in ML Workshop @ ICML25, Vancouver, BC, Canada, 19 July 2025. [Google Scholar]
Kwon, W.; Li, Z.; Zhuang, S.; Sheng, Y.; Zheng, L.; Yu, C.H.; Gonzalez, J.; Zhang, H.; Stoica, I. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles, New York, NY, USA, 23–26 October 2023; pp. 611–626. [Google Scholar] [CrossRef]
Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved Baselines with Visual Instruction Tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Zhang, Y.; Liu, Y.; Guo, Z.; Zhang, Y.; Yang, X.; Zhang, X.; Chen, C.; Song, J.; Zheng, B.; Yao, Y.; et al. LLaVA-UHD v2: An MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer. arXiv 2024, arXiv:2412.13871. [Google Scholar]
Wang, W.; Chen, Z.; Wang, W.; Cao, Y.; Liu, Y.; Gao, Z.; Zhu, J.; Zhu, X.; Lu, L.; Qiao, Y.; et al. Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization. arXiv 2024, arXiv:2411.10442. [Google Scholar] [CrossRef]
Zhu, J.; Wang, W.; Chen, Z.; Liu, Z.; Ye, S.; Gu, L.; Tian, H.; Duan, Y.; Su, W.; Shao, J.; et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv 2025, arXiv:2504.10479. [Google Scholar]
Wang, W.; Gao, Z.; Gu, L.; Pu, H.; Cui, L.; Wei, X.; Liu, Z.; Jing, L.; Ye, S.; Shao, J.; et al. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency. arXiv 2025, arXiv:2508.18265. [Google Scholar]
Zhan, Y.; Xiong, Z.; Yuan, Y. RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Li, X.; Ding, J.; Elhoseiny, M. VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding. arXiv 2024, arXiv:2406.12384. [Google Scholar] [CrossRef]
Zhou, Y.; Yang, X.; Zhang, G.; Wang, J.; Liu, Y.; Hou, L.; Jiang, X.; Liu, X.; Yan, J.; Lyu, C.; et al. MMRotate: A Rotated Object Detection Benchmark using PyTorch. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022. [Google Scholar]
Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking rotated object detection with gaussian wasserstein distance loss. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 11830–11841. [Google Scholar]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021. [Google Scholar]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Yang, X.; Yan, J. Arbitrary-oriented object detection with circular smooth label. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Han, J.; Ding, J.; Li, J.; Xia, G.S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
Yao, K.; Xu, N.; Yang, R.; Xu, Y.; Gao, Z.; Kitrungrotsakul, T.; Ren, Y.; Zhang, P.; Wang, J.; Wei, N.; et al. Falcon: A Remote Sensing Vision-Language Foundation Model. arXiv 2025, arXiv:2503.11070. [Google Scholar] [CrossRef]
Xiao, B.; Wu, H.; Xu, W.; Dai, X.; Hu, H.; Lu, Y.; Zeng, M.; Liu, C.; Yuan, L. Florence-2: Advancing a unified representation for a variety of vision tasks. arXiv 2023, arXiv:2311.06242. [Google Scholar] [CrossRef]
Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A Benchmark Data Set for Performance Evaluation of Aerial Scene Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, New York, NY, USA, 2–5 November 2010; pp. 270–279. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Lu, X. Remote Sensing Image Scene Classification: Benchmark and State of the Art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
Dai, D.; Yang, W. Satellite Image Classification via Two-Layer Sparse Coding With Biased Image Representation. IEEE Trans. Geosci. Remote Sens. 2011, 8, 173–176. [Google Scholar] [CrossRef]
Zhu, B.; Lui, N.; Irvin, J.; Le, J.; Tadwalkar, S.; Wang, C.; Ouyang, Z.; Liu, F.Y.; Ng, A.Y.; Jackson, R.B. METER-ML: A multi-sensor earth observation benchmark for automated methane source mapping. arXiv 2022, arXiv:2207.11166. [Google Scholar]
Chen, J.; Zhu, D.; Shen, X.; Li, X.; Liu, Z.; Zhang, P.; Krishnamoorthi, R.; Chandra, V.; Xiong, Y.; Elhoseiny, M. MiniGPT-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv 2023, arXiv:2310.09478. [Google Scholar]
Chen, Z.; Wang, W.; Cao, Y.; Liu, Y.; Gao, Z.; Cui, E.; Zhu, J.; Ye, S.; Tian, H.; Liu, Z.; et al. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling. arXiv 2024, arXiv:2412.05271. [Google Scholar]
Li, Z.; Muhtar, D.; Gu, F.; He, Y.; Zhang, X.; Xiao, P.; He, G.; Zhu, X. Lhrs-bot-nova: Improved multimodal large language model for remote sensing vision-language interpretation. ISPRS J. Photogramm. Remote Sens. 2025, 227, 539–550. [Google Scholar] [CrossRef]
Muhtar, D.; Zhang, E.; Li, Z.; Gu, F.; He, Y.; Xiao, P.; Zhang, X. Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models. In Proceedings of the Thirty-Ninth Annual Conference on Neural Information Processing Systems, San Diego, CA, USA, 2–7 December 2025. [Google Scholar]
Lobry, S.; Marcos, D.; Murray, J.; Tuia, D. RSVQA: Visual question answering for remote sensing data. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8555–8566. [Google Scholar] [CrossRef]

Figure 1. Comparisons with existing MTL methods across the resolutions of input images, network architectures, and supported tasks (i.e., detection, grounding, description, and classification).

Figure 2. Overall schematic diagram of the proposed method. The overall RS MTL framework based on VLM is presented in Section 3.1. The data curation procedure is introduced in Section 3.2. The dynamic resolution strategy is proposed in Section 3.3. We introduce the proposed Zoom-in Chain strategy and the corresponding LRS-VQA-Zoom dataset in Section 3.4. Finally, we describe the aerial detection scheme and propose the fair metric

{AP}_{nc}

in Section 3.5.

Figure 2. Overall schematic diagram of the proposed method. The overall RS MTL framework based on VLM is presented in Section 3.1. The data curation procedure is introduced in Section 3.2. The dynamic resolution strategy is proposed in Section 3.3. We introduce the proposed Zoom-in Chain strategy and the corresponding LRS-VQA-Zoom dataset in Section 3.4. Finally, we describe the aerial detection scheme and propose the fair metric

{AP}_{nc}

in Section 3.5.

Figure 3. The multiple agent prompts designed for object detection task. We exhibit two of the multiple prompts for each bounding box type.

Figure 4. Schematic diagram of the native resolution input.

Figure 5. The input prompt for Zoom-in Chain. The Blue “Question” is a placeholder for the user prompt.

Figure 6. Examples of the three types of annotated data in the proposed LRS-VQA-Zoom.

Figure 7. The impact of confidence scores on

{mAP}_{nc}

with error bands. The colored lines record the variation trends of

{mAP}_{nc}

for the popular conventional detectors on DOTA-v1.0 [51] (trained and evaluated on both the ‘train’ split and the ‘validation’ split) dataset under different confidence thresholds.

Figure 7. The impact of confidence scores on

{mAP}_{nc}

with error bands. The colored lines record the variation trends of

{mAP}_{nc}

for the popular conventional detectors on DOTA-v1.0 [51] (trained and evaluated on both the ‘train’ split and the ‘validation’ split) dataset under different confidence thresholds.

Figure 8. Demonstration of RSCoVLM’s capabilities on several commonly used tasks, including scene classification, open-ended and multiple-choice question answering for regular and UHR images, visual grounding in aerial and UAV images, and aerial object detection. In particular, the visualized results of aerial detection are especially impressive.

Figure 9. Failure examples of “Zoom-in Chain” and “dense detection”.

Table 1. Comparison results of state-of-the-art vision-language models and our model on the LRS-VQA benchmark.

Method	Model Size	Max Pixels	LRS-FAIR	LRS-Bridge	LRS-STAR	Avg. Acc
LLaVA-1.5 [59]	7B	0.1 M	18.76	30.70	22.63	24.03
LLaVA-UHD-v2 [60]	7B	0.7 M	22.82	32.57	26.08	27.16
Qwen2-VL [36]	7B	11.1 M	23.80	38.12	27.87	29.93
Qwen2.5-VL [37]	7B	12.8 M	19.66	35.82	26.70	27.39
Qwen3-VL [37]	8 B	16.8 M	27.98	38.56	32.04	32.86
Qwen3-VL [37]	A3B-30B	16.8 M	27.63	38.81	30.54	32.33
InternVL2.5-MPO [61]	8B	2.4 M	24.95	34.59	25.14	28.23
InternVL3 [62]	8B	2.4 M	22.49	38.09	26.36	28.98
InternVL3.5 [63]	8B	2.4 M	25.14	35.50	26.86	29.17
InternVL3.5 [63]	A3B-30B	2.4 M	16.83	37.05	22.15	25.34
Mimo-VL [34]	7B	12.8 M	16.51	20.04	27.11	21.22
GeoChat [17]	7B	0.3 M	20.18	24.54	13.75	19.49
LLaVA-1.5 + SFT. on LRS-VQA [21]	7B	0.1 M	22.97	36.89	27.48	29.11
LLaVA-Next + SFT. on LRS-VQA [21]	7B	2.8 M	21.85	38.24	26.67	28.92
RSCoVLM	7B	1.0 M	27.37	42.42	31.77	33.85
+ Zoom-in Chain	7B	1.0 M	42.42	49.56	45.15	45.71

Table 2. Comparison results of state-of-the-art vision-language models and our model on visual grounding benchmarks.

Method	Input Size	DIOR-RSVG		RSVG		Geo.-VG	VRS.-VG	AVVG	Avg. Acc.
Method	Input Size	Val	Test	Val	Test	Geo.-VG	VRS.-VG	AVVG	Avg. Acc.
Qwen-VL-Chat [36]	$448 \times 448$	32.01	32.22	4.66	2.04	35.36	31.07	0.31	19.66
GeoChat [17]	$504 \times 504$	23.35	24.05	3.08	2.04	22.74	11.52	0.28	12.44
LHRS-Bot [18]	$224 \times 224$	17.04	17.59	0.95	1.56	3.25	1.19	0.00	5.94
VHM [16]	$336 \times 336$	-	48.04	-	-	-	-	-	-
RSUniVLM [41]	$336 \times 336$	-	72.47	-	-	-	69.31	-	-
Qwen2.5-VL [37]	Dynamic	43.64	45.26	19.73	21.27	42.99	44.50	7.64	32.15
Qwen-VL + SFT. on refGeo [19]	$448 \times 448$	58.65	58.76	12.99	10.59	41.75	47.38	9.53	34.24
GeoChat + SFT. on refGeo [19]	$504 \times 504$	60.27	61.96	16.32	14.67	56.99	51.36	11.52	39.01
LLaVA-1.5-7B + SFT. on refGeo [19]	$336 \times 336$	64.46	65.98	19.98	20.95	63.76	57.17	15.05	43.91
GeoGround [19]	$336 \times 336$	77.18	77.73	27.64	26.65	70.24	66.04	21.58	52.44
RSCoVLM	Dynamic	83.56	84.55	54.04	53.79	76.39	79.73	29.40	65.92
+ Min Size	$224 \times 224$	66.56	67.64	21.23	20.70	21.43	67.50	0.85	37.99
+ Small Size	$336 \times 336$	75.22	75.86	34.72	35.79	70.17	75.79	25.10	56.09

Table 3. Comparison results of state-of-the-art aerial detectors and our model on DOTA-v1.0 benchmark.

Method	Score	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	AP_nc50	AP_nc75	AP_nc50:95
GWD [67]	0.40	72.07	58.95	24.26	35.92	63.37	52.24	66.63	86.85	61.54	59.15	23.45	45.84	36.10	49.50	26.08	50.80	28.67	28.98
R3Det [68]	0.45	73.32	59.51	31.59	43.37	64.94	63.37	75.61	89.35	63.84	66.95	34.99	45.60	46.54	50.17	14.48	54.91	29.21	30.08
ATSS [69]	0.35	72.98	60.67	25.82	42.91	65.23	65.32	75.22	89.78	71.61	70.12	28.04	43.19	47.79	58.28	28.60	56.37	34.62	33.05
Faster RCNN [70]	0.85	73.49	67.37	32.50	43.19	62.92	63.13	73.95	88.79	73.77	66.33	25.23	48.89	53.03	56.94	31.17	57.38	32.98	32.96
FCOS [71]	0.30	72.09	56.32	32.02	27.79	64.28	63.83	75.75	89.21	68.11	67.82	27.30	37.15	46.64	58.98	19.53	53.79	32.13	31.49
CSL [72]	0.40	71.57	53.57	19.82	35.90	64.04	44.96	66.35	87.54	61.33	59.26	29.27	39.56	36.51	49.80	17.38	49.12	29.03	28.72
S2A-Net [73]	0.50	72.96	61.96	36.00	45.99	66.24	65.61	77.08	89.34	73.27	69.25	31.78	46.64	55.02	52.02	35.40	58.58	29.53	31.98
RSCoVLM		77.15	64.86	23.90	45.34	44.87	38.96	57.64	87.22	57.73	49.42	23.31	51.87	37.01	54.92	54.91	51.27	25.75	27.60
+ Max Mode		73.95	63.01	27.84	40.41	56.86	55.37	71.00	89.12	61.69	64.95	19.54	41.91	44.21	55.01	52.27	54.48	31.04	31.38
RSCoVLM-det		73.52	64.68	26.89	47.18	52.57	52.71	59.33	89.16	63.17	61.43	18.91	45.96	47.62	59.19	70.24	55.50	30.78	31.75
+ Max Mode		69.04	64.44	33.32	44.67	56.21	66.47	73.71	87.59	61.38	63.95	22.41	46.63	47.90	59.82	50.82	56.56	33.88	33.66

Table 4. Comparison results of state-of-the-art vision-language models and our model on five scene classification benchmarks.

Method	Model Size	AID	UCMerced	METER-ML	NWPU-RESISC45	WHU-RS19
MiniGPTv2 [81]	7B	- \| 32.96	-	14.29	28.15	64.80
LLaVA-1.5 [59]	7B	- \| 31.10	-	21.73	34.96	54.55
Qwen-VL-Chat [36]	7B	- \| 55.30	-	38.77	42.73	72.25
Qwen2.5-VL [37]	7B	63.63 \| 62.73	70.90	56.64	64.98	76.20
Qwen3-VL [37]	8B	70.84 \| 66.67	79.90	60.88	68.86	87.80
Qwen3-VL [37]	A3B-30B	71.75 \| 68.87	80.19	64.07	70.22	87.70
InternVL2.5-MPO [61]	8B	69.38 \| 64.23	62.90	55.04	59.21	80.20
InternVL3 [62]	8B	67.78 \| 63.40	67.29	59.65	64.32	86.40
InternVL3.5 [63]	8B	77.03 \| 75.00	83.43	51.33	92.57	91.70
InternVL3.5 [63]	A3B-30B	82.45 \| 79.17	86.00	46.19	98.38	97.10
Mimo-VL [34]	7B	66.13 \| 67.20	69.14	54.51	64.35	86.10
LHRSBot [18]	7B	- \| 91.26	-	69.81	83.94	93.17
GeoChat [17]	7B	72.00 \| -	84.40	-	-	-
TEOChat [38]	7B	80.90 \| -	86.30	-	-	-
EarthGPT-X [40]	13B	78.09 \| -	87.89	-	-	-
RSUniVLM [41]	1B	- \| 81.18	-	-	86.86	84.91
LHRS-Bot-Nova [83]	7B	83.06 \| -	-	72.74	83.97	96.20
SkysenseGPT [24]	7B	88.16 \| -	-	40.00	90.06	95.50
VHM [16]	7B	- \| 91.70	-	72.74	94.54	95.80
ScoreRS [84]	7B	- \| 85.90	-	74.42	91.59	96.30
RSCoVLM	7B	88.44 \| 94.30	94.52	75.93	98.25	95.80

Table 5. Comparison results of state-of-the-art vision-language models and our model on two VQA benchmarks.

Method	RSVQA Benchmark						VRSBench VQA
Method	HR-Comp.	HR-Pres.	LR-Comp.	LR-Pres.	LR-R-U	Avg.	VRSBench VQA
LLaVA-1.5 [59]	67.30	69.80	68.20	55.50	59.00	63.96	-
LLaVA-1.6 [84]	68.60	64.40	64.32	56.84	61.00	63.03	-
Qwen2-VL [36]	75.60	63.30	75.47	62.00	73.00	69.87	-
Qwen2.5-VL [37]	75.28	67.30	73.86	64.67	66.00	69.42	51.21
Qwen3-VL [37]	81.00	78.10	70.32	56.42	72.00	71.57	54.75
InternVL-2.5 [82]	75.50	65.80	71.16	66.21	72.00	70.13	47.20
InternVL3 [62]	74.15	62.35	73.06	66.23	74.00	69.96	50.68
InternVL3.5 [63]	80.14	53.80	92.00	91.26	96.00	82.64	53.74
Mimo-VL [34]	66.00	77.80	74.42	59.98	65.00	68.64	48.37
LHRS-Bot-Nova [83]	89.30	87.60	88.11	83.89	79.00	85.58	-
GeoChat [17]	83.30	59.10	90.52	90.63	97.00	84.11	40.80
SkyEyeGPT [39]	80.28	83.50	88.63	88.93	75.00	83.27	-
VHM [16]	83.30	68.30	90.11	89.89	87.00	83.72	-
RSCoVLM	82.60	68.50	93.16	92.18	94.00	86.09	58.08

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Q.; Ma, S.; Luo, J.; Yu, Y.; Zhou, Y.; Wang, F.; Lu, X.; Wang, X.; He, X.; Chen, Y.; et al. Co-Training Vision-Language Models for Remote Sensing Multi-Task Learning. Remote Sens. 2026, 18, 222. https://doi.org/10.3390/rs18020222

AMA Style

Li Q, Ma S, Luo J, Yu Y, Zhou Y, Wang F, Lu X, Wang X, He X, Chen Y, et al. Co-Training Vision-Language Models for Remote Sensing Multi-Task Learning. Remote Sensing. 2026; 18(2):222. https://doi.org/10.3390/rs18020222

Chicago/Turabian Style

Li, Qingyun, Shuran Ma, Junwei Luo, Yi Yu, Yue Zhou, Fengxiang Wang, Xudong Lu, Xiaoxing Wang, Xin He, Yushi Chen, and et al. 2026. "Co-Training Vision-Language Models for Remote Sensing Multi-Task Learning" Remote Sensing 18, no. 2: 222. https://doi.org/10.3390/rs18020222

APA Style

Li, Q., Ma, S., Luo, J., Yu, Y., Zhou, Y., Wang, F., Lu, X., Wang, X., He, X., Chen, Y., & Yang, X. (2026). Co-Training Vision-Language Models for Remote Sensing Multi-Task Learning. Remote Sensing, 18(2), 222. https://doi.org/10.3390/rs18020222

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Co-Training Vision-Language Models for Remote Sensing Multi-Task Learning †