Figure 1.
The diverse capabilities of GeoPilot across both SAR and optical imagery.
Figure 1.
The diverse capabilities of GeoPilot across both SAR and optical imagery.
Figure 2.
An overview of the GeoPilot architecture. Our model is centered around Vicuna 7B-v1.5, which processes multimodal inputs. The framework features a dual-pathway design: it can directly generate responses for general understanding tasks (top path), or invoke specialized external tools via a planning and reasoning module for precision-based tasks (right path).
Figure 2.
An overview of the GeoPilot architecture. Our model is centered around Vicuna 7B-v1.5, which processes multimodal inputs. The framework features a dual-pathway design: it can directly generate responses for general understanding tasks (top path), or invoke specialized external tools via a planning and reasoning module for precision-based tasks (right path).
Figure 3.
GeoPilot workflow on a tool-augmented object detection task. The process unfolds in two turns. In the first turn, the agent receives the user’s image–query pair () and generates a plan () that includes its reasoning and a call to an external tool. In the second turn, the agent receives the tool’s output () and synthesizes this new information to produce the final, grounded answer (). “User (automated)” denotes an internal system step that reformats the tool output and feeds it back to the agent.
Figure 3.
GeoPilot workflow on a tool-augmented object detection task. The process unfolds in two turns. In the first turn, the agent receives the user’s image–query pair () and generates a plan () that includes its reasoning and a call to an external tool. In the second turn, the agent receives the tool’s output () and synthesizes this new information to produce the final, grounded answer (). “User (automated)” denotes an internal system step that reformats the tool output and feeds it back to the agent.
Figure 4.
An example of instruction data construction using in-context learning. In the first turn, a system prompt guides the LLM to formulate a realistic user query from descriptive ground-truth data. In the second turn, the system prompt provides detailed context about the specific tool, including the task definition, expected output format, coordinate system rules, and self-correction instructions.
Figure 4.
An example of instruction data construction using in-context learning. In the first turn, a system prompt guides the LLM to formulate a realistic user query from descriptive ground-truth data. In the second turn, the system prompt provides detailed context about the specific tool, including the task definition, expected output format, coordinate system rules, and self-correction instructions.
Figure 5.
Qualitative results demonstrating GeoPilot’s versatility across optical and SAR imagery. Each example shows the user query (Q) and model response (A). Tool-augmented tasks additionally display the visual output produced by the invoked specialist tool. The examples illustrate the model’s ability to address both general understanding and tool-augmented tasks, highlighting robust multimodal reasoning and grounded response generation across diverse remote sensing scenarios.
Figure 5.
Qualitative results demonstrating GeoPilot’s versatility across optical and SAR imagery. Each example shows the user query (Q) and model response (A). Tool-augmented tasks additionally display the visual output produced by the invoked specialist tool. The examples illustrate the model’s ability to address both general understanding and tool-augmented tasks, highlighting robust multimodal reasoning and grounded response generation across diverse remote sensing scenarios.
Figure 6.
Qualitative results on SAR VQA tasks compared with other VLMs, including failure cases. Green checkmarks and red crosses indicate correct and incorrect answers, respectively. Failure cases highlight typical error modes such as misclassifying visually similar objects and imprecise counting in cluttered scenes.
Figure 6.
Qualitative results on SAR VQA tasks compared with other VLMs, including failure cases. Green checkmarks and red crosses indicate correct and incorrect answers, respectively. Failure cases highlight typical error modes such as misclassifying visually similar objects and imprecise counting in cluttered scenes.
Table 1.
A comprehensive comparison of capabilities for representative RS-oriented vision–language models. CL: classification, IC: image captioning, VQA: visual question answering, OP: object positioning, PD: panoptic detection, OD: object detection, VG: visual grounding, SS: semantic segmentation, IS: instance segmentation, RS: referring segmentation. A checkmark indicates that the capability is supported or explicitly reported by the corresponding model; a blank cell indicates that the capability is unsupported or not explicitly reported.
Table 1.
A comprehensive comparison of capabilities for representative RS-oriented vision–language models. CL: classification, IC: image captioning, VQA: visual question answering, OP: object positioning, PD: panoptic detection, OD: object detection, VG: visual grounding, SS: semantic segmentation, IS: instance segmentation, RS: referring segmentation. A checkmark indicates that the capability is supported or explicitly reported by the corresponding model; a blank cell indicates that the capability is unsupported or not explicitly reported.
| Models | Image Level | Region Level | Pixel Level | External Skills |
|---|
| CL | IC | VQA | CLSAR | ICSAR | VQASAR | OPSAR | PD | OD | VG | PDSAR | VGSAR | SS | IS | RS | Tool | Learning |
|---|
| RS VLMs | GeoChat | ✓ | ✓ | ✓ | | | | | | | ✓ | | | | | | | |
| LHRS-Bot | ✓ | ✓ | ✓ | | | | | | ✓ | ✓ | | | | | | | |
| RSGPT | ✓ | ✓ | ✓ | | | | | | | | | | | | | | |
| EarthGPT | ✓ | ✓ | ✓ | | | | | | | ✓ | | | | | | | |
| RS-ChatGPT | ✓ | ✓ | ✓ | | | | | | | | | | | ✓ | | | |
| SkyEyeGPT | ✓ | ✓ | ✓ | | | | | | | ✓ | | | | | | | |
| EarthMarker | ✓ | ✓ | ✓ | | | | | | | | | | | | | | |
| Falcon | ✓ | ✓ | ✓ | | | | | | ✓ | ✓ | | | | ✓ | | | |
| RS-Agent | ✓ | ✓ | ✓ | | | | | ✓ | | | ✓ | | | ✓ | | ✓ | |
| GeoPilot (Ours) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Table 2.
Overview of the abilities of GeoPilot and data statistics of our constructed RS tool-augmented instruction dataset.
For optical imagery: DOTA [
41], DIOR [
42], FAIR1M [
43] are used for panoptic/object detection; RiSBench [
44], OpenEarthMap [
45] for segmentation; RiSBench and iSAID [
46] for instance/referring segmentation; SpaceNet [
47] and CityScale [
48] for road extraction; RICE [
49] for cloud removal.
For SAR imagery: SARLANG-1M [
38] is used for general understanding; SARDet-100K [
39] for panoptic detection and visual grounding.
Table 2.
Overview of the abilities of GeoPilot and data statistics of our constructed RS tool-augmented instruction dataset.
For optical imagery: DOTA [
41], DIOR [
42], FAIR1M [
43] are used for panoptic/object detection; RiSBench [
44], OpenEarthMap [
45] for segmentation; RiSBench and iSAID [
46] for instance/referring segmentation; SpaceNet [
47] and CityScale [
48] for road extraction; RICE [
49] for cloud removal.
For SAR imagery: SARLANG-1M [
38] is used for general understanding; SARDet-100K [
39] for panoptic detection and visual grounding.
| | Abilities | Tools | Source | Size |
|---|
Optical Imagery Tasks | General Understanding | - | GeoChat Instruct | 198,326 |
| Panoptic Detection | LAE-DINO [50] | DOTA, DIOR, FAIR1M | 18,982 |
| Object Detection | LAE-DINO | DOTA, DIOR, FAIR1M | 12,557 |
| Visual Grounding | RemoteSAM [51] | RiSBench, DIOR | 17,086 |
| Semantic Segmentation | Segearth-OV [52] | OpenEarthMap | 6000 |
| Instance Segmentation | RemoteSAM | RiSBench, iSAID | 13,000 |
| Referring Segmentation | RemoteSAM | RiSBench, iSAID | 14,999 |
| Road Extraction | SAM-Road [48] | SpaceNet, CityScale | 1236 |
| Cloud Removal | SpA-GAN [53] | RICE | 2960 |
SAR Tasks | General Understanding | - | SARLANG-1M | 249,488 |
| Panoptic Detection | DenoDet [54] | SARDet-100K | 11,710 |
| Visual Grounding | Fine-tuned LAE-DINO | SARDet-100K | 9569 |
| Total | - | - | - | 555,913 |
Table 3.
Overview of GeoPilotBench, including datasets, size, and input/output formats.
Table 3.
Overview of GeoPilotBench, including datasets, size, and input/output formats.
| Task | Dataset | Size | Input/Output |
|---|
| Task Planning | Custom | 500 | Instruction → Tool Calls |
| VQA | RSVQA-LRBEN | 10,004 | Image + Question → Answer |
| | RSVQA-HRBEN | 62,554 | Image + Question → Answer |
| Referring Object Detection | GeoChat-Instruct | 7593 | Image + Referring Expression → Bounding Box |
| SAR VQA | SARDet-100K | 11,955 | Image + Question → Answer |
| SAR Image Captioning | SARLANG-1M-Cap | 13,682 | Image → Text |
Table 4.
Comparison context of representative baselines used in the main experiments. Data sizes are taken from the corresponding papers when clearly reported; otherwise, we mark them as NR. Pretrain/SFT denotes approximate pretraining and supervised fine-tuning data scale, respectively. † indicates tool-augmented samples.
Table 4.
Comparison context of representative baselines used in the main experiments. Data sizes are taken from the corresponding papers when clearly reported; otherwise, we mark them as NR. Pretrain/SFT denotes approximate pretraining and supervised fine-tuning data scale, respectively. † indicates tool-augmented samples.
| Model | Backbone LLM | Data Size (Pretrain/SFT) | Training Strategy |
|---|
| LLaVA-1.5 | Vicuna-7B | 558,000/665,000 | 2-stage (connector pretrain + visual instruction tuning) |
| MiniGPT-v2 | LLaMA-2-7B | web-scale mixed data/NR | 3-stage training |
| GeoChat | Vicuna-v1.5-7B | NR/∼318,000 | LoRA SFT |
| RSGPT | Vicuna-7B | NR/2585 | Q-Former + linear FT |
| LHRS-Bot | LLaMA-2-7B | ∼1.15 million/∼74,000 | 3-stage curriculum training |
| VHM | Vicuna-v1.5-7B | ∼1.4 million/>151,000 | 2-stage (RS pretrain + SFT) |
| EarthGPT | LLaMA-2 | LAION-400M + COCO/>1 million | cross-modal alignment + RS tuning |
| InternVL2-8B | InternLM2-8B | NR/NR | MLP warmup + instruction tuning |
| Qwen2.5-VL-7B | Qwen2.5-7B | NR/NR | released pretrained/post-trained model |
| GeoPilot | Vicuna-7B | 595K (LLaVA init.) / 555,913 † | 2-stage full SFT |
Table 5.
Comparison of task planning performance on GeoPilotBench (500 samples). TSA: Tool Selection Accuracy; PA: Parameter Accuracy; NTA: No-Tool Judgment Accuracy; OPA: overall planning accuracy. All models receive the same system prompt with tool descriptions. RSGPT and LHRS-Bot are evaluated as prompted RS-VLM baselines.
Table 5.
Comparison of task planning performance on GeoPilotBench (500 samples). TSA: Tool Selection Accuracy; PA: Parameter Accuracy; NTA: No-Tool Judgment Accuracy; OPA: overall planning accuracy. All models receive the same system prompt with tool descriptions. RSGPT and LHRS-Bot are evaluated as prompted RS-VLM baselines.
| Model | TSA | PA | NTA | OPA |
|---|
| LLaVA-1.5-7B [60] | 38.6 | 42.4 | 53.0 | 22.8 |
| RSGPT [26] | 40.2 | 48.0 | 51.0 | 28.0 |
| GeoChat [11] | 46.8 | 55.6 | 61.0 | 32.4 |
| GeoPilot w/o tool use traces | 48.8 | 57.0 | 62.0 | 34.6 |
| LHRS-Bot [28] | 51.4 | 58.2 | 62.0 | 38.0 |
| InternVL2-8B [61] | 58.4 | 65.2 | 68.0 | 44.2 |
| Qwen2.5-VL-7B [9] | 64.2 | 71.8 | 72.0 | 51.6 |
| GeoPilot (Ours) | 96.4 | 94.8 | 97.0 | 92.6 |
Table 6.
Per-tool breakdown of Tool Selection Accuracy (%) on the 500-sample task planning benchmark. # Samples denotes the number of samples in each tool category. ∅ denotes the no-tool category, where no external tool invocation is required.
Table 6.
Per-tool breakdown of Tool Selection Accuracy (%) on the 500-sample task planning benchmark. # Samples denotes the number of samples in each tool category. ∅ denotes the no-tool category, where no external tool invocation is required.
| Tool Category | # Samples | LLaVA-1.5 | Qwen2.5-VL | InternVL2 | GeoChat | GeoPilot |
|---|
| Object Detection | 80 | 35.0 | 68.8 | 62.5 | 52.5 | 97.5 |
| Panoptic Detection | 50 | 22.0 | 56.0 | 48.0 | 40.0 | 96.0 |
| Referring Object Detection | 70 | 28.6 | 60.0 | 52.9 | 44.3 | 97.1 |
| Sem. Segmentation | 50 | 30.0 | 62.0 | 56.0 | 38.0 | 96.0 |
| Inst. Segmentation | 50 | 26.0 | 54.0 | 48.0 | 36.0 | 94.0 |
| Ref. Segmentation | 40 | 25.0 | 52.5 | 47.5 | 35.0 | 95.0 |
| Road Extraction | 30 | 16.7 | 43.3 | 36.7 | 30.0 | 96.7 |
| Cloud Removal | 30 | 20.0 | 50.0 | 40.0 | 33.3 | 96.7 |
| No Tool (∅) | 100 | 53.0 | 72.0 | 68.0 | 61.0 | 97.0 |
| Overall TSA | 500 | 38.6 | 64.2 | 58.4 | 46.8 | 96.4 |
Table 7.
Overall planning accuracy (%) by difficulty level on the task planning benchmark. # Queries denotes the number of queries at each difficulty level.
Table 7.
Overall planning accuracy (%) by difficulty level on the task planning benchmark. # Queries denotes the number of queries at each difficulty level.
| Difficulty | # Queries | LLaVA-1.5 | Qwen2.5-VL | InternVL2 | GeoPilot |
|---|
| Easy | 200 | 32.0 | 65.5 | 58.0 | 97.0 |
| Medium | 200 | 20.5 | 49.0 | 42.0 | 93.0 |
| Hard | 100 | 8.0 | 32.0 | 24.0 | 84.0 |
Table 8.
Comparison of GeoPilot with VLMs on RSVQA-LRBEN. Results are reported using accuracy (%). Presence, Comparison, and Rural/Urban denote the three question types in the RSVQA-LRBEN benchmark. GeoPilot results are mean ± std over three seeds.
Table 8.
Comparison of GeoPilot with VLMs on RSVQA-LRBEN. Results are reported using accuracy (%). Presence, Comparison, and Rural/Urban denote the three question types in the RSVQA-LRBEN benchmark. GeoPilot results are mean ± std over three seeds.
| Model | Presence | Comparison | Rural/Urban | Average |
|---|
| LLaVA-1.5 [60] | 55.46 | 68.20 | 59.00 | 62.77 |
| MiniGPTv2 [6] | 55.16 | 55.22 | 39.00 | 54.96 |
| Qwen2.5-VL-7B [9] | 60.64 | 73.89 | 66.00 | 68.23 |
| InternVL2.5-8B [62] | 71.47 | 73.39 | 71.00 | 72.55 |
| LHRS-Bot [28] | 88.51 | 90.00 | 89.07 | 89.19 |
| VHM [34] | 90.11 | 89.89 | 88.00 | 89.33 |
| GeoChat [11] | 91.09 | 90.33 | 94.00 | 90.70 |
| RSGPT [26] | 91.03 | 91.70 | 94.00 | 92.29 |
| GeoPilot | 92.10 ± 0.32 | 91.91 ± 0.26 | 95.21 ± 0.40 | 93.21 ± 0.21 |
Table 9.
Comparison of GeoPilot with VLMs on RSVQA-HRBEN. Results are reported using accuracy (%). Presence and Comparison denote the two question types in RSVQA-HRBEN. GeoPilot results are mean ± std over 3 seeds.
Table 9.
Comparison of GeoPilot with VLMs on RSVQA-HRBEN. Results are reported using accuracy (%). Presence and Comparison denote the two question types in RSVQA-HRBEN. GeoPilot results are mean ± std over 3 seeds.
| Model | Presence | Comparison | Average |
|---|
| MiniGPTv2 [6] | 40.79 | 50.91 | 46.46 |
| Qwen-VL [7] | 66.44 | 60.41 | 63.06 |
| Qwen2.5-VL-7B [9] | 60.38 | 72.98 | 67.43 |
| InternVL2.5-8B [62] | 64.51 | 75.44 | 70.63 |
| EarthGPT [29] | 62.77 | 79.53 | 72.06 |
| GeoChat [11] | 58.45 | 83.19 | 72.30 |
| InternVL2-8B [61] | 67.35 | 76.91 | 72.70 |
| VHM [34] | 63.00 | 83.00 | 73.00 |
| GeoPilot | 63.51 ± 0.35 | 83.90 ± 0.16 | 73.65 ± 0.20 |
Table 10.
Comparison of GeoPilot with VLMs on GeoChat-Instruct. Results are reported using Acc@0.5. Small, Medium, and Large refer to object size categories; Single and Multiple indicate whether the referring expression targets one or multiple objects in the image.
Table 10.
Comparison of GeoPilot with VLMs on GeoChat-Instruct. Results are reported using Acc@0.5. Small, Medium, and Large refer to object size categories; Single and Multiple indicate whether the referring expression targets one or multiple objects in the image.
| Model | Small | Medium | Large | Single | Multiple |
|---|
| MiniGPTv2 [6] | 1.70 | 9.90 | 21.90 | 9.10 | 3.60 |
| GeoChat [11] | 2.90 | 13.60 | 21.70 | 16.00 | 4.30 |
| InternVL2-8B [61] | 7.20 | 23.76 | 31.99 | 25.77 | 9.30 |
| GeoPilot (Ours) | 10.63 | 25.11 | 32.45 | 26.32 | 16.84 |
Table 11.
Comparison of GeoPilot with VLMs on SAR VQA. OI: object identification; IC: instance counting; OC: object classification; OP: object positioning. Results are reported using accuracy.
Table 11.
Comparison of GeoPilot with VLMs on SAR VQA. OI: object identification; IC: instance counting; OC: object classification; OP: object positioning. Results are reported using accuracy.
| Model | OI | IC | OC | OP |
|---|
| LLaVA-1.5 [60] | 53.46 | 45.20 | 29.00 | 12.77 |
| Qwen2.5-VL [9] | 55.46 | 47.20 | 25.00 | 13.23 |
| GeoChat [11] | 62.04 | 54.36 | 51.54 | 19.84 |
| GeoPilot | 77.46 | 67.24 | 62.64 | 29.72 |
Table 12.
Comparison on SAR image captioning using the official SARLANG-1M-Cap test split with 13,682 caption samples. Baseline results are reported from the official SARLANG-1M-Cap benchmark when available. GeoChat and GeoPilot are evaluated under the same official split. ZS: zero-shot; FT: fine-tuned.
Table 12.
Comparison on SAR image captioning using the official SARLANG-1M-Cap test split with 13,682 caption samples. Baseline results are reported from the official SARLANG-1M-Cap benchmark when available. GeoChat and GeoPilot are evaluated under the same official split. ZS: zero-shot; FT: fine-tuned.
| Model | Param | Setting | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | ROUGE-L | CIDEr |
|---|
| LLaVA-1.5 [60] | 7B | ZS | 8.18 | 3.96 | 1.69 | 0.76 | 13.39 | 0.03 |
| Qwen2-VL [63] | 7B | ZS | 7.43 | 3.53 | 1.52 | 0.69 | 12.70 | 0.01 |
| GeoChat [11] | 7B | ZS | 11.62 | 5.42 | 2.43 | 1.28 | 14.78 | 0.17 |
| DeepSeek-VL [64] | 7B | ZS | 20.79 | 10.04 | 4.65 | 2.60 | 18.39 | 3.68 |
| Qwen2.5-VL [9] | 7B | ZS | 21.49 | 10.77 | 6.39 | 3.46 | 21.09 | 6.14 |
| InternVL2.5 [62] | 8B | ZS | 28.81 | 19.31 | 13.56 | 8.68 | 29.97 | 10.50 |
| LLaVA-1.5 [60] | 7B | FT | 29.61 | 18.86 | 13.14 | 9.21 | 27.12 | 28.31 |
| Qwen2-VL [63] | 7B | FT | 30.18 | 19.43 | 13.99 | 10.10 | 27.56 | 28.96 |
| Qwen2.5-VL [9] | 7B | FT | 35.27 | 26.65 | 21.38 | 17.31 | 34.37 | 62.85 |
| GeoPilot | 7B | FT | 32.74 | 26.91 | 21.74 | 17.86 | 33.18 | 70.24 |
Table 13.
End-to-end evaluation of tool-augmented tasks. Standalone Tool directly invokes the specialist tool with ground-truth parameters, while GeoPilot autonomously selects the tool and generates parameters. Semantic segmentation is evaluated on the official OpenEarthMap test split using Segearth-OV, and referring object detection is evaluated on the GeoChat-Instruct test split using RemoteSAM.
Table 13.
End-to-end evaluation of tool-augmented tasks. Standalone Tool directly invokes the specialist tool with ground-truth parameters, while GeoPilot autonomously selects the tool and generates parameters. Semantic segmentation is evaluated on the official OpenEarthMap test split using Segearth-OV, and referring object detection is evaluated on the GeoChat-Instruct test split using RemoteSAM.
| Task | Metric | Standalone Tool | GeoPilot (Ours) | Gap |
|---|
| Semantic Segmentation | mIoU | 39.8 | 38.3 | −1.5 |
| Referring Object Detection | Acc@0.5 | 29.1 | 26.1 | −3.0 |
Table 14.
Ablation study on training strategy and optical data mixing ratio . SAR Avg denotes the average performance over the four SAR VQA subtasks, while HRBEN Avg denotes the average accuracy on RSVQA-HRBEN. The first row () denotes single-stage training on mixed optical and SAR data without curriculum separation. A checkmark indicates that multi-stage training is used.
Table 14.
Ablation study on training strategy and optical data mixing ratio . SAR Avg denotes the average performance over the four SAR VQA subtasks, while HRBEN Avg denotes the average accuracy on RSVQA-HRBEN. The first row () denotes single-stage training on mixed optical and SAR data without curriculum separation. A checkmark indicates that multi-stage training is used.
| Multi-Stage | | OI | IC | OC | OP | SAR Avg | HRBEN Avg |
|---|
| – | 54.49 | 47.46 | 26.38 | 15.59 | 35.98 | 64.46 |
| ✓ | 0% | 70.95 | 61.28 | 56.47 | 24.02 | 53.18 | 67.08 |
| ✓ | 10% | 72.38 | 62.15 | 58.72 | 25.40 | 54.66 | 71.24 |
| ✓ | 20% | 77.46 | 67.24 | 62.64 | 29.72 | 59.27 | 73.68 |
| ✓ | 30% | 76.82 | 66.50 | 61.90 | 28.94 | 58.54 | 73.96 |
| ✓ | 50% | 73.60 | 63.18 | 57.46 | 26.30 | 55.14 | 73.70 |