Towards Robust Chain-of-Thought Prompting with Self-Consistency for Remote Sensing VQA: An Empirical Study Across Large Multimodal Models

Faria, Fatema Tuj Johora; Baniata, Laith H.; Choi, Ahyoung; Kang, Sangwoo

doi:10.3390/math13183046

Open AccessArticle

Towards Robust Chain-of-Thought Prompting with Self-Consistency for Remote Sensing VQA: An Empirical Study Across Large Multimodal Models

¹

Department of Computer Science and Engineering, Ahsanullah University of Science and Technology, Dhaka 1208, Bangladesh

²

Department of Computing, Gachon University, Seongnam 13120, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Mathematics 2025, 13(18), 3046; https://doi.org/10.3390/math13183046

Submission received: 6 August 2025 / Revised: 14 September 2025 / Accepted: 17 September 2025 / Published: 22 September 2025

(This article belongs to the Special Issue Big Data Mining and Knowledge Graph with Application)

Download

Browse Figures

Versions Notes

Abstract

Remote sensing visual question answering (RSVQA) involves interpreting complex geospatial information captured by satellite imagery to answer natural language questions, making it a vital tool for observing and analyzing Earth’s surface without direct contact. Although numerous studies have addressed RSVQA, most have focused primarily on answer accuracy, often overlooking the underlying reasoning capabilities required to interpret spatial and contextual cues in satellite imagery. To address this gap, this study presents a comprehensive evaluation of four large multimodal models (LMMs) as follows: GPT-4o, Grok 3, Gemini 2.5 Pro, and Claude 3.7 Sonnet. We used a curated subset of the EarthVQA dataset consisting of 100 rural images with 29 question–answer pairs each and 100 urban images with 42 pairs each. We developed the following three task-specific frameworks: (1) Zero-GeoVision, which employs zero-shot prompting with problem-specific prompts that elicit direct answers from the pretrained knowledge base without fine-tuning; (2) CoT-GeoReason, which enhances the knowledge base with chain-of-thought prompting, guiding it through explicit steps of feature detection, spatial analysis, and answer synthesis; and (3) Self-GeoSense, which extends this approach by stochastically decoding five independent reasoning chains for each remote sensing question. Rather than merging these chains, it counts the final answers, selects the majority choice, and returns a single complete reasoning chain whose conclusion aligns with that majority. Additionally, we designed the Geo-Judge framework to employ a two-stage evaluation process. In Stage 1, a GPT-4o-mini-based LMM judge assesses reasoning coherence and answer correctness using the input image, task type, reasoning steps, generated model answer, and ground truth. In Stage 2, blinded human experts independently review the LMM’s reasoning and answer, providing unbiased validation through careful reassessment. Focusing on Self-GeoSense with Grok 3, this framework achieves superior performance with 94.69% accuracy in Basic Judging, 93.18% in Basic Counting, 89.42% in Reasoning-Based Judging, 83.29% in Reasoning-Based Counting, 77.64% in Object Situation Analysis, and 65.29% in Comprehensive Analysis, alongside RMSE values of 0.9102 in Basic Counting and 1.0551 in Reasoning-Based Counting.

Keywords:

remote sensing; visual question answering (VQA); large multimodal models (LMMs); zero-shot prompting; chain-of-thought prompting; self-consistency

MSC:

68T01; 68T50

1. Introduction

Remote sensing has emerged as a vital technology for observing and analyzing Earth’s surface, offering important insights for areas such as urban development, agriculture, environmental monitoring, and disaster response. Within this field, high-spatial resolution (HSR) imagery is particularly valuable due to its ability to capture detailed visual information about ground features. These images provide clear and close-up views of both natural landscapes and built environments, making them especially useful for studying human settlements, vegetation, water bodies, and various land cover types on large scales [1,2].

The value of HSR imagery lies in its practical applications, including land use classification, environmental change detection, and resource management. Despite its advantages, interpreting HSR images remains a significant challenge due to the complexity and diversity of the objects they depict. Ground elements in these images often exhibit a wide range of shapes, textures, and spectral properties. This complexity creates a growing demand for advanced and intelligent techniques that can effectively analyze and extract meaningful information from these detailed and information-rich images [3,4].

Visual Question Answering (VQA) requires generating accurate natural language answers to open-ended questions about images. While VQA has advanced significantly with natural images, its extension to remote sensing, known as Remote Sensing VQA (RSVQA), remains limited. Traditional methods rely on convolutional networks for visual feature extraction and recurrent networks for language processing, enabling basic image and text interaction but lacking strong reasoning and knowledge integration. As a result, early RSVQA systems struggle with complex contextual and relational analysis in high-resolution imagery, highlighting the need for more advanced knowledge-driven approaches [5,6,7].

This lack of interpretability presents a major challenge for Remote VQA, particularly because high-spatial resolution remote sensing images are often complex and abstract. Properly understanding these images relies heavily on spatial relationships, domain-specific patterns, and multi-scale information. For example, answering questions about the number of residential buildings in a region, identifying crop types in agricultural zones, detecting flooded areas after heavy rainfall, or locating newly constructed roads requires not only recognizing features in the image but also reasoning across spatial patterns and interpreting the question in context. As a result, there is a strong need for models that can not only generate accurate answers but also provide human-understandable explanations, helping improve transparency and reliability in geospatial decision-making [8,9].

Recent advances in large multimodal models (LMMs) such as Claude 3.7 Sonnet [10], Grok 3 [11], Gemini [12], GPT-4 [13], LLaMA 4 Scout [14], and DeepSeek-R1 [15] have significantly improved visual reasoning and cross-modal understanding by jointly processing visual and textual inputs, making them well suited for tasks that require interpreting images with natural language. In remote sensing, specialized models such as GeoChat [16], LHRSBot [17], and RSGPT [18] address domain-specific challenges in geospatial imagery and question answering. Approaches such as the Semantic Object Awareness (SOBA) framework integrate object-aware segmentation with hybrid attention mechanisms and a numerical difference loss to unify classification and regression tasks, achieving 78.14 percent accuracy in Earth VQA [1]. Other methods employ two-step fine-tuning with domain-adaptive pretraining and prompt-based adaptation to enable natural language answer generation without predefined categories, demonstrating strong performance on the RSVQAxBEN dataset [2]. Additionally, the MM-RSVQA model fuses multimodal, multi-resolution remote sensing imagery, including high-resolution RGB, multi-spectral, and SAR data, using a VisualBERT-based transformer and achieving 65.56 percent accuracy on the TAMMI dataset [19], which highlights the benefits of multimodal integration for remote sensing VQA.

Previous studies have improved accuracy and enhanced the fusion of diverse data sources, but they share key limitations. They lack interpretability, providing no clear explanation of how inputs and processing steps lead to predictions. They rely mainly on traditional metrics such as accuracy, precision, and recall, which do not fully capture the quality or trustworthiness of answers, especially for questions requiring deeper reasoning and contextual understanding. Evaluations typically focus only on the correctness of the final answer, ignoring the clarity and coherence of the reasoning process, which is critical for complex remote sensing tasks. Furthermore, robust evaluation frameworks that integrate automated metrics with human judgment are lacking, limiting the transparency and reliability of these models in real-world applications [20,21].

In response to these challenges, this study introduces three frameworks as follows: Zero-GeoVision, CoT-GeoReason, and Self-GeoSense. These are aimed at enhancing the reasoning abilities of LMMs in RSVQA without requiring task-specific fine-tuning. For our experiments, we work with the EarthVQA dataset, which was originally developed by another research group [1] and consists of 6000 high-resolution satellite images and 208,593 question–answer pairs. From this large dataset, we selected a representative subset of 200 images (100 rural and 100 urban), along with their associated question–answer pairs (29 per rural image and 42 per urban image), to conduct a focused evaluation.

The Zero-GeoVision framework applies zero-shot prompting to draw direct answers from the pretrained knowledge of LMMs, serving as a baseline. The CoT-GeoReason framework introduces chain-of-thought prompting, guiding models step by step through feature detection, spatial analysis, and answer synthesis to improve reasoning transparency. Building upon this, the Self-GeoSense framework incorporates self-consistency by generating five independent reasoning chains per question and combining their outputs through majority voting, thereby improving robustness against ambiguous or complex inputs.

This study made the following contributions to the field of remote sensing visual question answering:

Introduced three task-specific frameworks—Zero-GeoVision, CoT-GeoReason, and Self-GeoSense—which employed zero-shot prompting, chain-of-thought reasoning, and self-consistency, respectively, to improve LMMs’ performance without fine-tuning.
Proposed an LMM Judge that automatically evaluated model outputs for remote sensing tasks. It assessed both the final answer and reasoning chain, assigning one of three labels (Matched, Partially Matched, or Not Matched) to support the computation of strict accuracy metrics while enhancing interpretability.
Demonstrated that the Self-GeoSense framework, especially with Grok 3, outperformed the existing Semantic Object Awareness (SOBA) framework baseline across all tasks, showing the effectiveness of self-consistency and structured reasoning in complex RSVQA tasks.
Integrated human expert evaluations to align model outputs with domain knowledge, addressing the limitations of purely automated metrics and enhancing the trustworthiness of the results.

The paper is organized as follows: Section 2 reviews prior work on task-specific deep learning and vision-language model-based approaches for RSVQA, highlighting their strengths and limitations. Section 3 describes the EarthVQA dataset and evaluation metrics such as accuracy, RMSE, and the Geo-Judge framework, forming the foundation for the experiments. Section 4 details the proposed frameworks, Zero-GeoVision, CoT-GeoReason, and Self-GeoSense, including their task-specific prompt designs and inference processes. Section 5 presents a detailed task-wise comparison of these frameworks against the SOBA baseline, analyzing the effects of different reasoning strategies and evaluation methods. Section 6 evaluates the computational cost and efficiency of the proposed frameworks, reporting the estimated token usage, average inference cost, and their implications for real-time and resource-constrained applications. Section 7 discusses the limitations of the proposed frameworks, including dataset biases and dependence on human evaluations. Section 8 proposes future research directions, such as optimizing computational efficiency, expanding dataset diversity, and reducing reliance on human evaluations. Finally, Section 9 summarizes the key findings, emphasizing the effectiveness of the proposed frameworks and their implications for RSVQA research.

2. Related Works

2.1. Task-Specific Deep Learning-Based Remote Sensing

Numerous studies have employed task-specific deep learning techniques, such as convolutional neural networks and attention mechanisms, to address remote sensing VQA and related tasks without relying on vision-language model architectures. Junjue et al. [1] introduced the EarthVQA dataset, comprising 6000 high-resolution images and 208,593 question–answer pairs, alongside the Semantic Object Awareness (SOBA) framework. SOBA employs a two-stage training process with a segmentation network and a hybrid attention mechanism, incorporating a Numerical Difference (ND) Loss to achieve 78.14% overall accuracy on EarthVQA, establishing a benchmark for relational reasoning in remote sensing.

Furthermore, Junjue et al. proposed EarthVQANet [22], a multi-task framework that integrates land cover segmentation with VQA. By utilizing a hierarchical pyramid segmentation network and semantic-guided attention, combined with an adaptive loss function for numerical reasoning, EarthVQANet achieves 85.98% overall accuracy on the EarthVQA dataset, demonstrating enhanced relational modeling capabilities. Additionally, Yuduo et al. [23] developed RSAdapter, a lightweight fine-tuning strategy for adapting pretrained multimodal transformers to RS-VQA. Using a parallel adapter architecture with linear transformation layers, RSAdapter achieves 85.81% overall accuracy, offering a computationally efficient alternative to full fine-tuning.

In a distinct approach, Hichem et al. [19] proposed MM-RSVQA, a multimodal, multi-resolution framework integrating very high resolution (VHR) RGB, multi-spectral, and SAR imagery, alongside the TAMMI dataset. By leveraging VisualBERT for feature fusion and DistilBERT for textual embeddings, MM-RSVQA achieves 65.56% accuracy on TAMMI, highlighting the effectiveness of task-specific deep learning for multimodal integration. Moreover, Ze et al. [24] introduced the Remote Sensing Copy-Move Question Answering (RSCMQA) task, supported by five datasets, and proposed the Copy-Move Forgery Perception Framework (CMFPF). CMFPF employs pixel-level tampering detection and cross-attention mechanisms, optimized by multiple loss functions, to outperform general VQA models on RSCMQA datasets, setting a new standard for tampering-aware question answering.

2.2. Vision-Language Model-Based Remote Sensing

In contrast, recent advancements in remote sensing have leveraged vision-language models (VLMs) and multimodal large language models (MLLMs) to integrate visual and textual data for tasks such as VQA, image captioning, and segmentation. Yakoub et al. [2] introduced RS-LLaVA, an enhanced vision-language model adapted from LLaVA for remote sensing imagery, designed to perform image captioning and VQA in a multi-task setting. By employing LoRA-based fine-tuning on the RS-instructions dataset, RS-LLaVA achieves an overall accuracy of 87.22% on the RSVQA-LR dataset, showcasing efficiency in multi-task vision-language modeling. Similarly, Surasakdi et al. [3] proposed a generative Large Vision-Language Model (LVLM) for RS-VQA, utilizing a two-step fine-tuning strategy with domain-adaptive pretraining and prompt-based fine-tuning. This model achieves 90.2% accuracy on yes/no questions, 84.9% on multiple-choice questions, and a 75.4% F1 score on open-ended questions using the RSVQAxBEN dataset, balancing fluency and accuracy despite high computational demands.

Furthermore, Fengxiang et al. [25] developed GeoLLaVA-8K, a multimodal LLM tailored for ultra-high-resolution (UHR) remote sensing imagery. By incorporating Background Token Pruning and Anchored Token Selection, GeoLLaVA-8K efficiently processes 8K-resolution images, setting a new state-of-the-art on XLRS-Bench for complex vision-language tasks. Additionally, Zilun et al. [26] proposed GeoRSMLLM, a hierarchical framework that uses a set-of-points data representation and a Qwen2-7B backbone with a SigLIP vision tower. This model excels in complex tasks like referring expression comprehension, segmentation, and change detection, offering a generalized solution for diverse remote sensing challenges.

In a related effort, Ruizhe et al. [27] introduced GeoPix, an RS multimodal LLM integrating a multi-scale mask predictor and a Class-wise Learnable Memory module. Trained on the GeoPixInstruct dataset, GeoPix achieves state-of-the-art performance in multi-referring segmentation while maintaining robust results on image- and region-level tasks. Moreover, Xiang et al. [28] developed VRSBench, a large-scale benchmark supporting image captioning, visual grounding, and VQA, with 29,614 images and 123,221 question–answer pairs. Evaluations of vision-language models like LLaVA-1.5 and Mini-Gemini on VRSBench demonstrate strong performance, with BLEU-4 scores of 14–15 and CIDEr scores exceeding 30, underscoring its utility for advancing RS vision-language models.

3. Background Study

3.1. Dataset Description

In this paper, we have used the EarthVQA dataset [1,29], which is designed to support VQA tasks in high-resolution remote sensing scenarios. The dataset contains six thousand high spatial resolution (HSR) satellite images collected from both urban and rural areas across regions such as Nanjing, Changzhou, and Wuhan. To enhance its utility for VQA research, the dataset creators introduced two major improvements: first, quantity expansion with a diverse collection of six thousand HSR images covering various urban and rural environments; and second, the addition of 208,593 question–answer pairs specifically crafted for spatial reasoning and city planning applications. Each urban image includes 42 question–answer pairs, while each rural image contains 29 question–answer pairs, addressing a wide range of questions related to infrastructure, object distribution, and environmental context. The types of VQA tasks included in the dataset, along with the corresponding reasoning capabilities required from the model, are summarized in Table 1.

For our experiments, we randomly selected 200 images from EarthVQA, with 100 from rural areas and 100 from urban areas, preserving all 29 and 42 question–answer pairs per image, respectively. This subset was chosen to balance representativeness, diversity, and computational feasibility. Stratified random sampling with a fixed seed (seed = 42) ensured reproducibility and proportional coverage across the three source cities. The 200-image subset captures the geographic, visual, and task diversity of the full dataset, providing a manageable yet meaningful sample for evaluating zero-shot and chain-of-thought prompting approaches. Representative visual question types across rural and urban images are illustrated in Figure 1, Figure 2 and Figure 3, highlighting the spatial reasoning complexity present in high-resolution remote sensing VQA tasks.

3.2. Evaluation Metrics Summary

This subsection outlines the evaluation metrics used to assess the performance of the Zero-GeoVision, CoT-GeoReason, and Self-GeoSense frameworks across six types of remote sensing tasks. We employ both classification and regression metrics to account for the varied nature of these tasks. A detailed overview of which evaluation methods are applied to which task types, including the use of automatic metrics, the Geo-Judge framework, and human evaluation, is provided in Table 2.

For classification-based tasks, we compute Accuracy [30] to evaluate the correctness of model predictions. For counting-related tasks, we apply Root Mean Squared Error (RMSE) [31] to measure numerical prediction accuracy. These metrics are selected to provide a comprehensive and task-sensitive evaluation of model performance across diverse reasoning and perception requirements in remote sensing VQA.

3.2.1. Accuracy

Accuracy is applied to evaluate the performance of categorical tasks, specifically Reasoning-Based Judging and Basic Judging, where the model outputs a discrete class label. It is also adapted for tasks like Basic Counting and Reasoning-Based Counting by treating counts as categorical outputs (e.g., exact count matches). For Comprehensive Analysis and Object Situation Analysis, accuracy is computed based on the correctness of key categorical elements in the response (e.g., correctly identifying primary land use or object state).

Traditional accuracy is computed as follows: for a dataset with N samples, where each sample consists of a tuple

(I, T, Q, A)

, accuracy is defined as follows:

Accuracy = \frac{1}{N} \sum_{i = 1}^{N} ⊮ (A_{pred, i} = A_{i})

(1)

where

A_{pred, i}

is the predicted answer for the i-th sample,

A_{i}

is the ground-truth answer, and

⊮ (\cdot)

is the indicator function that equals 1 if the prediction exactly matches the ground truth and 0 otherwise.

3.2.2. Semantic Accuracy

Model responses in remote sensing tasks often exhibit lexical variability and involve multi-step reasoning. To evaluate these outputs more effectively, we employ a semantic evaluation protocol using the two-stage Geo-Judge framework. Each prediction is assigned one of three labels: Matched, Partially Matched, or Not Matched, based on its consistency with the ground truth and the quality of reasoning, as judged either by the LMM or human annotators.

Semantic accuracy focuses on fully correct predictions, considering only the Matched label as correct. The Partially Matched and Not Matched categories are retained to provide additional insight into model behavior, distinguishing near-correct responses from complete errors. This distinction helps identify cases where the model demonstrates partial understanding, supports detailed error analysis, and allows future development of more nuanced evaluation metrics.

Semantic accuracy is formally defined as follows:

Semantic Accuracy = \frac{1}{N} \sum_{i = 1}^{N} ⊮ (Judge (A_{pred, i}, A_{i}) = Matched)

(2)

Here, Judge refers to the evaluation outcome provided by the LMM-as-a-Judge or by human experts, and N is the total number of predictions.

Accuracy is computed separately for each task type, reflecting the specific nature of the task as follows:

Bas Ju: Accuracy is calculated as the proportion of correctly predicted binary presence/absence labels (Yes/No) for simple object or land cover detection tasks.
Bas Co: Accuracy is measured by exact matches of predicted area coverage percentage ranges (e.g., 0–10%, 20–30%) with ground truth, reflecting spatial extent estimation.
Rel Ju: Accuracy reflects correct binary or categorical semantic judgments that require integrating visual cues and contextual reasoning (e.g., classifying a scene as rural or urban). Final decisions are based on semantic agreement verified by both the LMM and human evaluators.
Rel Co: Accuracy measures exact matches for counts of semantically defined objects or features (e.g., number of eutrophic water bodies), requiring deeper contextual understanding.
Com An: Accuracy is evaluated based on correct identification of multiple relevant scene elements or conditions in synthesized descriptive answers, focusing on the presence or absence of key categorical components and verified semantically.
Obj An: Accuracy assesses the correct classification of object types along with their contextual attributes (e.g., distinguishing ponds from rivers), requiring situational understanding and reasoning.

This approach ensures that evaluation accounts not only for exact correctness but also provides insight into partial reasoning and near-correct predictions, making the assessment more interpretable and informative for both model developers and readers.

3.2.3. Root Mean Squared Error (RMSE) for Counting Tasks

RMSE is a widely used metric to quantify the average magnitude of prediction errors in regression and counting tasks. It is defined as the square root of the average squared differences between the predicted values and the true values. Formally, for a dataset of size N, RMSE is calculated as follows:

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(C_{pred, i} - C_{i})}^{2}}

(3)

where

C_{pred, i}

denotes the predicted count or area percentage value for the i-th sample, and

C_{i}

is the corresponding ground-truth value.

In the EarthVQA dataset, RMSE is used to evaluate two counting tasks. For Basic Counting, percentage ranges (e.g., 0–10%) are converted to midpoint values (e.g., 5%) to compute spatial estimation accuracy. For Reasoning-Based Counting, RMSE measures deviations between predicted and true counts of semantically defined objects. Overall, it quantifies both percentage coverage estimation and semantic object counting precision.

3.2.4. Geo-Judge Framework

To address the limitations of traditional string-level evaluation metrics in generative remote sensing VQA, we introduce the Geo-Judge framework. Geo-Judge provides a two-stage evaluation pipeline leveraging both automated large multimodal model (LMM) scoring and two human expert judgments for thorough and unbiased validation.

LMM Judge (Stage 1)

In the first stage of Geo-Judge, we used GPT-4o mini to automatically evaluate model outputs. This LMM receives a structured prompt consisting of five components: the remote sensing image, the specific task type, the reasoning steps generated by another model, the final model-predicted answer, and the human-annotated ground truth.

Instead of binary scoring, the LMM returns one of three evaluation labels based on consistency with the ground truth and logical coherence of reasoning as follows:

Matched—The final answer and reasoning chain are fully consistent with the ground truth and meet task-specific expectations.
Partially Matched—The final answer may be correct or close to correct, but the reasoning chain is partially flawed, incomplete, or contains minor inconsistencies.
Not Matched—The final answer is incorrect, and the reasoning does not align with the expected logic or outcome.

This structured evaluation supports the calculation of strict and relaxed accuracy metrics and enables more interpretability.

The LMM evaluation process is defined as follows:

\begin{matrix} Geo- Judge (x) & = LMMJudge (Image, TaskType, ReasoningSteps, {Answer}_{Model}, {Answer}_{GT}) \\ \to {Matched, Partially Matched, Not Matched} \end{matrix}

(4)

where:

Image is the input remote sensing image,
TaskType denotes the category of the VQA task,
ReasoningSteps are the intermediate reasoning chains generated by a base model,
Answer_Model is the predicted final answer,
Answer_GT is the reference ground truth answer.

We design a structured Chain-of-Thought (CoT)-inspired prompt to guide the LMM in evaluating the model’s output in a reasoning-aware and task-sensitive manner. The prompt format followed in Stage 1, including all key input components and the step-by-step evaluation process used to assign judgment labels, is illustrated in Figure 4.

Human Judgment (Stage 2)

In the second stage, domain experts manually evaluate a subset of model outputs assessed by the LMM. To ensure unbiased assessment, human judgment adheres to the following two-phase process:

Phase 1 (Blinded Evaluation): Experts are presented only with the image, task type, and ground truth answer. Based solely on this context, they independently provide their own answer and assign an evaluation label (Matched, Partially Matched, or Not Matched) to what they believe is the correct reasoning and final output.
Phase 2 (Context-Aware Reassessment): After their initial judgment is recorded, experts are then shown the model-generated reasoning, predicted answer, and the LMM-assigned evaluation label. This enables the expert to (a) reassess their own evaluation, (b) agree or disagree with the LMM label, and (c) provide feedback or corrections if needed.

Figure 5 depicts the two-phase human judgment protocol employed in Stage 2 of the Geo-Judge framework to ensure thorough and unbiased evaluation of remote sensing VQA outputs. This two-phase review protocol introduces an additional verification layer while ensuring interpretability and transparency in LMM-based scoring. It also allows for quantifying human–LMM agreement, disagreement, and potential bias or oversight. Human evaluation is especially crucial for nuanced tasks such as Rel Ju, Obj An, and Com An.

4. Proposed Methodology

In this study, we analyze high-spatial resolution remote sensing images to assess residential environments, transportation infrastructure, and renovation needs for water bodies and unsurfaced roads. The overall end-to-end process flow for this analysis is illustrated in Figure 6. We propose three frameworks integrated with LMMs: Zero-GeoVision, which enables zero-shot [13,32] instruction for immediate image interpretation; CoT-GeoReason, which applies chain-of-thought [33] instruction to support detailed step-by-step reasoning; and Self-GeoSense, which extends this approach with self-consistency to iteratively refine and validate the analysis.

4.1. Zero-GeoVision Framework

The Zero-GeoVision framework leverages the pretrained knowledge embedded within LMMs to perform remote sensing image analysis without any task-specific fine-tuning or additional training. It operates in a zero-shot setting by constructing tailored prompts that guide the model to interpret the input image and question directly based on its general understanding of visual and textual data.

In a zero-shot learning setup, the model inference relies solely on its prior knowledge without any gradient updates. Formally, the predicted answer

A_{pred}

is obtained as follows:

A_{pred} = f_{M} (I, T, Q, P_{t})

(5)

where

f_{M}

is the pretrained LMM, I is the input image, T is the image type, Q is the question, and

P_{t}

is the task-specific prompt. No model parameters are updated, i.e.,

\nabla_{θ_{M}} L = 0

(6)

indicating zero gradient descent on model weights

θ_{M}

. The model responds based on its internal representations learned during pretraining.

For each task type defined in the previous section, a unique prompt template is carefully designed to emphasize the particular reasoning or analysis required. These prompts incorporate the remote sensing image I, its type T, and the question Q, explicitly instructing the model on the expected form of the response. By customizing prompts according to task characteristics, the framework maximizes the LMM’s ability to generalize across diverse remote sensing scenarios.

During inference, we adopt a consistent set of decoding parameters to ensure stable and diverse response generation. Specifically, the generation is performed using a temperature of 0.7 to balance creativity and precision, a nucleus sampling probability (top-p) of 0.9 to filter unlikely tokens, a top-k sampling value of 5 to limit token choices at each step, and a maximum token length of 512. No task-specific tuning or gradient updates are applied to the model weights, keeping the evaluation fully in the zero-shot regime. These configurations help maintain consistency while allowing the model’s pretrained reasoning capabilities to handle a wide range of question types across different remote sensing tasks.

Steps for Applying the Zero-GeoVision framework:

Input Preparation: Each data instance is represented as a tuple $(I, T, Q, A)$ , where I is the raw remote sensing image, T denotes the image type (e.g., optical or SAR), Q is the question to be answered, and A is the ground-truth answer. The image is processed and encoded in its native format, while the type and question are formatted as natural language text.
Task-Specific Prompt Design: Distinct prompt templates $P_{t}$ are created for each task type $t \in {RJ, BJ, BC, CA, OSA, RC}$ . Some examples include the following:
- Prompting Guideline for Bas Ju: Identify and classify based on straightforward visual cues.
- Prompting Guideline for Rel Ju: Evaluate the image features logically to make a reasoned decision.
- Prompting Guideline for Bas Co: Count specific objects visible in the image.
- Prompting Guideline for Rel Co: Combine object detection with contextual reasoning to refine counting results.
- Prompting Guideline for Obj An: Focus on the condition or contextual state of particular objects.
- Prompting Guideline for Com An: Describe multiple aspects of the image in detail.
These tailored prompts help the LMM focus its reasoning and interpretation aligned with the demands of each task.
Model Inference: The LMM takes the remote sensing image, its type, the associated question, and a task-specific prompt as input, and processes them to generate a predicted answer. In other words, the model’s inference function maps the combination of image, image type, question, and prompt to produce the predicted answer.

The Zero-GeoVision framework employs task-specific prompt designs tailored to each task type, which guide the LMM in interpreting the remote sensing image and question to generate accurate zero-shot responses, as illustrated in Figure 7, Figure 8 and Figure 9 further illustrate the prompt designs for the six task types grouped in pairs.

We performed evaluation with the Zero-GeoVision framework to assess the performance of four LMMs: GPT-4o [13], Grok 3 [11], Gemini 2.5 Pro [12], and Claude 3.7 Sonnet [10]. We measured Accuracy for all task types and RMSE for the counting-related tasks. Unlike our other reasoning-based frameworks, the Zero-GeoVision evaluation relied solely on basic zero-shot prompting without incorporating any explicit reasoning steps. Therefore, the results reflect model capabilities under straightforward prompt conditions. A summary of the evaluation results categorized by task type and model is presented in Table 3.

4.2. CoT-GeoReason Framework

The CoT-GeoReason framework builds upon zero-shot capabilities by employing chain-of-thought (CoT) prompting to guide LMMs in performing structured, step-by-step reasoning without any task-specific training or fine-tuning. This approach leverages the model’s inherent general knowledge and reasoning abilities by prompting it to articulate intermediate reasoning steps before producing the final answer. For each task type, carefully crafted CoT prompts emphasize the specific reasoning process required, enabling the model to handle complex remote sensing questions more effectively.

In this framework, rather than producing a direct answer, the model is encouraged to generate a sequence of intermediate reasoning steps

S = {S_{1}, S_{2}, \dots, S_{n}}

, followed by a final answer

A_{pred}

. This process can be formally modeled as the following factorized conditional generation:

P (A_{pred}, S_{1}, \dots, S_{n} ∣ I, T, Q, M) = \prod_{i = 1}^{n} P (S_{i} ∣ S_{< i}, I, T, Q, M) \cdot P (A_{pred} ∣ S_{1 : n}, I, T, Q, M)

(7)

Here, the following hold:

I is the input image.
T is the image type (e.g., optical, SAR).
Q is the natural language question.
M is the fixed pretrained LMM.
$S_{i}$ is the i-th reasoning step.
$S_{< i} = {S_{1}, \dots, S_{i - 1}}$ , and
$S_{1 : n} = {S_{1}, \dots, S_{n}}$ is the complete sequence of reasoning steps.

No gradient updates are applied to the model parameters during inference as follows:

\nabla_{θ_{M}} L = 0

(8)

This enforces the zero-shot inference constraint while still enabling complex reasoning chains through prompt engineering.

During inference, decoding parameters are configured to balance output quality and diversity. We use a temperature of 0.6 to promote focused yet flexible responses, a nucleus sampling (top-p) value of 0.85 to restrict token selection to likely candidates, and a top-k value of 10 to limit token choices at each step. The maximum token length is set to 768 to allow detailed reasoning while maintaining computational efficiency. This zero-shot CoT prompting framework enables comprehensive reasoning over diverse remote sensing tasks without additional model training.

Steps for Applying the CoT-GeoReason Framework:

Input Preparation: As in the zero-shot framework, prepare the input tuple $(I, T, Q, A)$ , ensuring the image I, type T, and question Q are formatted appropriately for the LMM.
Task-Specific CoT Prompt Design: Construct a prompt that instructs the model to follow a step-by-step reasoning process as follows:
- Prompting Strategy for Rel Ju: Identify relevant image features, analyze their implications, and make a reasoned judgment.
- Prompting Strategy for Bas Ju: Identify key visual characteristics and directly link them to a judgment.
- Prompting Strategy for Bas Co: Detect objects and systematically count them.
- Prompting Strategy for Com An: Segment the image, describe each region, and synthesize a comprehensive response.
- Prompting Strategy for Obj An: Identify objects and analyze their condition or context.
- Prompting Strategy for Rel Co: Combine object detection with reasoning about contextual factors to produce a count.
Model Inference with Reasoning Steps: The LMM generates a sequence of reasoning steps $S_{1}, S_{2}, \dots, S_{n}$ , culminating in the predicted answer $A_{pred}$ . The overall reasoning flow can be viewed as follows:

$A_{pred} = f_{M} (I, T, Q, P_{t}^{CoT}) = Generate (S_{1} \to S_{2} \to \dots \to S_{n} \to A_{pred})$

(9)

where $P_{t}^{CoT}$ is the chain-of-thought prompt designed for task type t.
Output Evaluation: Compare $A_{pred}$ with the ground-truth answer A to evaluate performance. The generated reasoning chain ${S_{1}, \dots, S_{n}}$ provides insights into the model’s internal reasoning and decision-making process.

By applying task-specific CoT prompts, the CoT-GeoReason framework ensures that the reasoning process aligns with the demands of each task type, enabling a comparative analysis of performance across the six tasks.

The CoT-GeoReason framework employs task-specific prompt designs tailored to each task type, which guide the LMM in interpreting the remote sensing image and question through chain-of-thought reasoning to generate accurate responses, as illustrated in Figure 10, Figure 11 and Figure 12. These figures illustrate the prompt designs for the six task types grouped in pairs. The performance of the LMMs within the CoT-GeoReason framework is summarized in Table 4.

We applied the CoT-GeoReason framework to comprehensively evaluate the performance of four LMMs: GPT-4o, Grok 3, Gemini 2.5 Pro, and Claude 3.7 Sonnet. Our assessment involved measuring Accuracy across all task types and RMSE specifically for counting-related tasks. For tasks such as Basic Judging and Basic Counting, Accuracy was calculated directly by comparing model outputs against objective ground truth labels. In contrast, for more complex tasks, namely Reasoning-Based Judging, Comprehensive Analysis, and Object Situation Analysis, the Accuracy metric incorporates semantic evaluation based on in-depth reasoning. This semantic evaluation was conducted through a combination of human expert judgment and LMM-as-a-judge approaches, allowing us to capture subtle and context-dependent aspects of model performance that are not fully reflected by automatic metrics alone. By integrating these multiple evaluation layers, the CoT-GeoReason framework delivers a comprehensive and nuanced assessment of model capabilities. Detailed results broken down by task type and model are presented in Table 4.

4.3. Self-GeoSense Framework

The Self-GeoSense framework extends the CoT-GeoReason framework by generating multiple independent chains of reasoning for each input and then aggregating these outputs to produce a more reliable and robust final answer. This Self-Consistency with Chain of Thought (CoT-SC) approach effectively reduces variability in LMM outputs, which is especially important for complex remote sensing tasks where image interpretation can be ambiguous. By leveraging the same task-specific CoT prompts as the CoT-GeoReason framework, Self-GeoSense improves the consistency of results through the generation and integration of multiple reasoning paths for each task type.

Example:

Task Type:	Basic Judging
Question:	Are there any roads in this scene?
Correct Answer:	Yes

Example Reasoning Chains:

Chain 1: Observe the image for linear features ⇒ Linear patterns detected, resembling paved surfaces ⇒ No vegetation covering these lines ⇒ Conclude roads are present ⇒ Answer: Yes.
Chain 2: Identify straight or curved lines in the image ⇒ Lines appear man-made and consistent with roads ⇒ Confirm with surrounding context (e.g., buildings) ⇒ Roads are present ⇒ Answer: Yes.
Chain 3: Check for natural features like rivers or paths ⇒ No water or organic patterns detected ⇒ Linear features suggest infrastructure ⇒ Likely roads ⇒ Answer: Yes.
Chain 4: Look for vehicle movement or road-like structures ⇒ No vehicles visible, but linear paths exist ⇒ Paths align with road characteristics ⇒ Roads exist ⇒ Answer: Yes.
Chain 5: Analyze image for urban elements ⇒ Linear features don’t match natural terrain ⇒ Patterns suggest man-made roads ⇒ No roads detected ⇒ Answer: No.

Formal Description:

For each input

(I, T, Q)

, the model generates k independent reasoning chains as follows:

R_{i} = {S_{1}^{(i)}, S_{2}^{(i)}, \dots, S_{n}^{(i)}, A_{i}}, for i = 1, 2, \dots, k

(10)

where the following hold:

$R_{i}$ is the i-th reasoning chain.
$S_{j}^{(i)}$ is the j-th reasoning step in the i-th chain.
$A_{i}$ is the answer predicted by the i-th chain.

The full probabilistic generation process for each chain is as follows:

P (R_{i} ∣ I, T, Q, M) = \prod_{j = 1}^{n} P (S_{j}^{(i)} ∣ S_{< j}^{(i)}, I, T, Q, M) \cdot P (A_{i} ∣ S_{1 : n}^{(i)}, I, T, Q, M)

(11)

Each chain is generated with stochastic decoding parameters to encourage diversity: temperature = 0.8, top-p = 0.9, and top-k = 60.

After all k chains are generated, their final predicted answers

{A_{1}, A_{2}, \dots, A_{k}}

are aggregated using a function g to compute the final answer as follows:

A_{final} = g (A_{1}, A_{2}, \dots, A_{k})

(12)

The aggregation function g is defined based on the task type as follows:

For categorical tasks (e.g., Basic Judging):

$A_{final} = mode (A_{1}, A_{2}, \dots, A_{k})$

(13)
For numerical tasks (e.g., counting tasks):

$A_{final} = \frac{1}{k} \sum_{i = 1}^{k} A_{i}$

(14)

Example Aggregation:

For the example, the answers from the five chains are Yes, Yes, Yes, Yes, No. Using majority voting for this categorical task yields the following:

Yes: 4 votes.
No: 1 vote.

The final answer is Yes, as it has the majority.

The final output probability can be approximated by marginalizing over the chains as follows:

P (A_{final} ∣ I, T, Q, M) \approx max_{A} \sum_{i = 1}^{k} P (A_{i} = A ∣ I, T, Q, M)

(15)

This ensemble-style aggregation reduces sensitivity to any single reasoning path, enhancing consistency and reliability across tasks characterized by high ambiguity or complexity.

Steps for Applying the Self-GeoSense Framework:

Input Preparation: Prepare the input tuple $(I, T, Q, A)$ as described in the previous frameworks. For the example, I is the satellite image, T is Basic Judging, Q is "Are there any roads in this scene?", and A is "Yes".
Task-Specific CoT Prompt Reuse: Use the task-specific CoT prompt designed for the CoT-GeoReason framework, as already shown in the previous framework, tailored for Basic Judging to guide the model in identifying roads in the image.
Multiple Reasoning Chains: For each input, generate $k = 5$ independent CoT reasoning chains $R_{1}, R_{2}, \dots, R_{5}$ , each producing a sequence of reasoning steps (separated by ⇒) and a predicted answer $A_{i}$ . The number of chains k is chosen to balance robustness and computational cost.
Answer Aggregation: Aggregate the k answers using the appropriate aggregation function g. For the example, majority voting selects "Yes" as the final answer.
Output Evaluation: Compare the final answer $A_{final}$ ("Yes") with the ground-truth answer A ("Yes") to assess performance. The ensemble process approximates a more reliable response distribution across chains.

The Self-GeoSense framework enhances the reliability of remote sensing analysis by generating multiple independent chains of reasoning for each input and aggregating their outputs to produce a more robust final answer. While the prompting approach is based on the standard Chain of Thought (CoT) technique, Self-GeoSense employs task-specific prompts that are carefully tailored to the unique characteristics of each task type. This customization ensures that each reasoning chain effectively addresses the specific demands of its respective task. Figure 10, Figure 11 and Figure 12 illustrate the CoT-style prompt designs used in the Self-GeoSense framework for the six task types, grouped in pairs.

We conducted an evaluation using the Self-GeoSense framework to assess the performance of four LMMs: GPT-4o, Grok 3, Gemini 2.5 Pro, and Claude 3.7 Sonnet. Accuracy was measured across all task types, while RMSE was calculated specifically for counting-related tasks. For Basic Judging and Basic Counting, Accuracy was determined directly based on objective criteria and ground truth labels. In contrast, for Reasoning-Based Judging, Comprehensive Analysis, and Object Situation Analysis tasks, Accuracy reflects a semantic evaluation incorporating both human expert judgment and LMM-as-a-judge methodologies. This combined approach harnesses automated reasoning alongside domain expertise to capture nuanced facets of model performance that exceed purely automatic metrics. A detailed comparison of model performance across task types within the Self-GeoSense framework is presented in Table 5.

5. Result Analysis

5.1. Detailed Task-Wise Comparison of Zero-GeoVision Framework vs. Semantic Object Awareness Framework (SOBA) Method

Table 3 presents the performance of the Zero-GeoVision framework, leveraging zero-shot prompting with LMMs (GPT-4o, Grok 3, Gemini 2.5 Pro, Claude 3.7 Sonnet), against the Semantic Object Awareness framework (SOBA) across six task categories: Basic Judging (Bas Ju), Reasoning-Based Judging (Rel Ju), Basic Counting (Bas Co), Reasoning-Based Counting (Rel Co), Object Situation Analysis (Obj An), and Comprehensive Analysis (Com An).

For Bas Ju, Grok 3 led Zero-GeoVision at 78.55%, while SOBA achieved 89.63%, outperforming the best Zero-GeoVision model by 11.08%. In Rel Ju, Grok 3 scored 73.12%, with SOBA at 82.64% (+9.52%). For Bas Co and Rel Co, top Zero-GeoVision models scored 78.12% (Claude 3.7 Sonnet) and 64.85% (Grok 3), respectively, while SOBA maintained higher accuracies of 80.17% and 67.86%. In Obj An and Com An, Zero-GeoVision peaked at 59.23% and 48.34%, whereas SOBA achieved 61.40% and 49.30%, highlighting challenges in tasks requiring relational reasoning.

Human expert analysis confirmed that Zero-GeoVision models often failed to capture relational cues and complex spatial-semantic interactions, producing superficial answers in Obj An and Com An tasks. To address this, we introduced CoT-GeoReason and Self-GeoSense frameworks, which incorporate structured, stepwise reasoning to guide models in analyzing and synthesizing visual information, mimicking human expert workflows. Initial evaluations show that reasoning-augmented frameworks significantly improve performance, particularly in Obj An and Com An, validating the importance of integrating human-like reasoning for high-level remote sensing visual question answering.

5.2. Detailed Task-Wise Comparison of CoT-GeoReason Framework vs. Semantic Object Awareness Framework (SOBA) Method

Table 4 presents the results of the CoT-GeoReason framework, which applies chain-of-thought prompting to LMMs (GPT-4o, Grok 3, Gemini 2.5 Pro, and Claude 3.7 Sonnet) in comparison with the SOBA baseline across six remote sensing VQA tasks. In Bas Ju, Grok 3 achieves 88.69%, closely matching SOBA’s 89.63% with only a 0.94% difference, while the other CoT-GeoReason models remain competitive. For Rel Ju, Grok 3 leads with 85.42%, exceeding SOBA’s 82.64%, and all CoT-GeoReason models surpass the baseline. In counting tasks, Bas Co and Rel Co, Grok 3 achieves 90.37% and 78.27%, representing improvements of +10.20% and +10.41% over SOBA, with the other models also outperforming the baseline. On the reasoning-intensive tasks, Obj An and Com An, Grok 3 records 69.64% (+8.24%) and 58.29% (+8.99%) respectively, showing consistent gains over SOBA and confirming the strength of chain-of-thought reasoning in complex analyses.

Error analysis with RMSE further supports these findings. CoT-GeoReason models maintain lower error rates than SOBA, with Grok 3 achieving the best balance between accuracy and reliability (0.7214 in Bas Co and 0.8725 in Rel Co compared to SOBA’s higher 0.7856 and 1.1457). Overall, CoT-GeoReason consistently matches or surpasses SOBA across all categories, demonstrating that chain-of-thought prompting enhances both accuracy and reasoning quality in multimodal remote sensing VQA tasks. To ensure robustness, outputs from CoT-GeoReason were first evaluated by a proprietary GPT-4o-mini LMM Judge, which assessed logical coherence and correctness. These judgments, along with detailed reasoning traces, were independently reviewed by human domain experts. This two-tier evaluation, combining automated reasoning validation with expert review, helped quantify agreement, identify model strengths and weaknesses, and refine prompting strategies. Integrating human oversight ensured that the observed improvements from CoT-GeoReason translated into meaningful performance gains, bridging the gap between automated model outputs and expert-level understanding.

5.3. Detailed Task-Wise Comparison of Self-GeoSense Framework vs. Semantic Object Awareness Framework (SOBA) Method

Building on insights from CoT-GeoReason, we observed that multiple valid reasoning paths often lead to correct answers in complex remote sensing VQA tasks. Outputs from LMMs were first assessed by an GPT-4o-mini LMM Judge for logical coherence, then blindly reviewed by human experts. This revealed the value of self-consistency: instead of relying on a single deterministic chain, Self-GeoSense samples multiple reasoning trajectories and selects the answer most consistently supported across them. Combined with spatial grounding, this approach improves reliability and captures intricate object–context relationships.

Table 5 compares Self-GeoSense performance with the SOBA baseline. In Bas Ju, Grok 3 achieves the highest accuracy of 94.69% (+5.06%). For Rel Ju, Grok 3 again leads with 89.42% (+6.78%). In Bas Co, Grok 3 records the top score of 93.18% (+12.99%). For Rel Co, Gemini 2.5 Pro achieves the best performance at 85.48% (+17.62%).

Performance gains are particularly notable in higher-level reasoning tasks. In Obj An, Grok 3 attains the highest score of 77.64% (+16.24%). In Com An, Grok 3 again leads with 65.29% (+15.99%). For error sensitivity, Grok 3 obtains the lowest RMSE values (0.7102 Bas Co, 0.8551 Rel Co), followed closely by Gemini 2.5 Pro (0.7283 and 0.8387).

Overall, Self-GeoSense consistently outperforms SOBA across all tasks and error metrics, underscoring the effectiveness of integrating self-consistent reasoning with spatial grounding for robust multimodal interpretation in remote sensing VQA.

6. Computational Cost and Efficiency Analysis

This section evaluates the computational cost and efficiency of the proposed frameworks: Zero-GeoVision, CoT-GeoReason, and Self-GeoSense. We report the estimated average token usage per inference and the corresponding estimated average cost in USD per one million tokens, considering variable prompt lengths for different reasoning frameworks. These metrics provide insight into the practical applicability of each framework, particularly in real-time or resource-constrained scenarios. The Self-GeoSense framework generates multiple independent reasoning chains, which increases both token usage and computational cost (see Table 6).

6.1. Approach to Cost Estimation

Token usage is composed of both input tokens (prompts and image embeddings) and output tokens (answers and reasoning chains). Image embeddings for high-resolution images are typically fixed-size (e.g., 512 tokens), while prompt length varies depending on the task and framework as follows:

Zero-GeoVision: concise prompts ( 100 tokens).
CoT-GeoReason: detailed prompts for step-by-step reasoning ( 500 tokens).
Self-GeoSense: comprehensive prompts for multiple independent reasoning chains ( 800 tokens).

Output tokens include short answers in Zero-GeoVision, step-by-step reasoning chains in CoT-GeoReason, and multiple reasoning chains in Self-GeoSense (five chains per input). Cost is calculated using API pricing (September 2025) as follows:

GPT-4o, owned by OpenAI: $2.50 per 1 M input tokens, $10.00 per 1 M output tokens.
Claude 3.7 Sonnet, owned by Anthropic: $3.00 per 1 M input tokens, $15.00 per 1 M output tokens.
Grok 3, owned by xAI: $3.00 per 1M input tokens, $15.00 per 1 M output tokens.
Gemini 2.5 Pro, owned by Google DeepMind: $1.25 per 1 M input tokens, $10.00 per 1 M output tokens.

6.2. Cost Computation

Costs are computed using the following formula:

Estimated Average Cos t (USD) = \frac{Input Tokens}{1, 000, 000} \times Input Price + \frac{Output Tokens}{1, 000, 000} \times Output Price

Representative token counts for a single high-resolution image with a complex prompt are outlined as follows:

Zero-GeoVision: Input = 100 + 512 = 612 tokens, Output = 150 tokens.
CoT-GeoReason: Input = 500 + 512 = 1012 tokens, Output = 500 tokens.
Self-GeoSense: Input = 800 + 512 = 1312 tokens, Output = 5 × 500 = 2500 tokens.

6.3. Observations and Implications

Zero-GeoVision remains the most cost-efficient, with estimated average costs between $0.00226 and $0.00409 per inference, suitable for low-latency applications.
CoT-GeoReason balances reasoning performance and cost, with estimated average costs between $0.00626 and $0.01054 per inference.
Self-GeoSense incurs the highest cost due to multiple reasoning chains, ranging from $0.02664 to $0.04144 per inference, and may be less practical for real-time scenarios.

Token usage and estimated average cost vary depending on image resolution, prompt length, and reasoning chain complexity. Optimization strategies, such as selective chain generation, parallel processing, or adaptive prompt truncation, can help reduce computational overhead for Self-GeoSense.

7. Limitations

Despite the promising results of the proposed frameworks, several limitations warrant consideration. First, the computational cost of the Self-GeoSense framework, which generates multiple reasoning chains (

k = 5

), is significantly higher than that of Zero-GeoVision or CoT-GeoReason, potentially limiting its scalability for real-time applications or resource-constrained environments. Additionally, the EarthVQA dataset, while diverse, is geographically limited to regions in Nanjing, Changzhou, and Wuhan, which may introduce biases and restrict generalization to other global contexts or varied satellite imagery types (e.g., SAR or hyperspectral). Furthermore, the reliance on human expert evaluations in the Geo-Judge framework introduces subjectivity and time-intensive processes, potentially affecting scalability and consistency across large datasets. Moreover, the large multimodal models (LMMs) tested (GPT-4o, Grok 3, Gemini 2.5 Pro, Claude 3.7 Sonnet) operate in a zero-shot setting without fine-tuning, which may cap their performance compared to models specifically adapted for remote sensing tasks. These limitations highlight the need for further optimization and broader dataset inclusion to enhance practical applicability. In addition, the effectiveness of the Self-GeoSense and CoT-GeoReason frameworks heavily depends on carefully crafted prompts and reasoning templates, and this reliance on prompt engineering may reduce robustness when adapting to new or unforeseen question types, thereby requiring domain experts for optimal configuration. Moreover, the frameworks currently do not explicitly model or quantify uncertainty in predictions or reasoning chains, potentially leading to overconfident outputs in ambiguous or low-quality imagery scenarios. Also, the EarthVQA dataset’s composition may bias evaluation results toward certain land cover types or seasonal conditions, thereby limiting real-world generalizability. While the dual-stage evaluation approach improves assessment accuracy, it introduces a scalability bottleneck due to dependence on human-in-the-loop judgment for large-scale datasets or continuous model monitoring. Furthermore, the frameworks do not account for temporal changes in remote sensing imagery, limiting their suitability for monitoring dynamic environmental phenomena or detecting changes over time. Lastly, the LMMs may inherit biases from their pretraining data, potentially influencing their interpretation of remote sensing inputs and leading to skewed or culturally biased reasoning outcomes.

8. Future Work

Future research can address the identified limitations by exploring several directions to improve the reliability and applicability of the proposed frameworks for remote sensing VQA. First, optimizing the Self-GeoSense framework to reduce computational overhead is critical for real-time applications; this could involve dynamically adjusting the number of reasoning chains based on task complexity or leveraging efficient sampling techniques, such as adaptive top k or top p strategies, to balance performance and resource usage. Furthermore, developing multimodal agent architectures tailored to each task type (for example, Basic Judging, Reasoning based Counting, Comprehensive Analysis) could improve performance by assigning specialized submodules for visual processing, reasoning, and answer synthesis. For instance, a dedicated agent for spatial analysis could enhance object detection in Basic Counting, while a reasoning focused agent could improve Comprehensive Analysis. In addition, expanding the dataset to include remote sensing images from a broader range of cities within Bangladesh beyond major urban centers like Dhaka, Chittagong, and Khulna such as Sylhet, Rajshahi, Barisal, or Rangpur would enhance geographical diversity and mitigate regional biases. Moreover, incorporating diverse types of remote sensing imagery, including Synthetic Aperture Radar (SAR), hyperspectral, thermal infrared, and LiDAR data, could improve model generalization across varied environmental conditions and imaging modalities. This would enable the frameworks to handle complex scenarios such as nighttime imaging or cloud covered scenes, which are critical for applications like disaster response. Additionally, reducing reliance on human expert evaluations within the Geo-Judge framework could be achieved by developing advanced LMM based judges or incorporating inter-rater agreement metrics (for example, Cohen’s kappa) to quantify consistency between automated and human judgments, which would streamline evaluation while maintaining robustness. Finally, fine-tuning LMMs using few shot learning with carefully curated examples from the EarthVQA dataset could enhance performance without requiring extensive retraining. For instance, providing five to ten task specific QA pairs per task type could help models better adapt to the nuances of remote sensing imagery.

9. Conclusions

This study advances remote sensing VQA by showing how structured reasoning strategies improve LMMs on the EarthVQA dataset, which includes 6000 satellite images and over 208,000 question–answer pairs. From this dataset, we randomly selected 200 images with 29 QA pairs each. To evaluate LMM performance on diverse geospatial reasoning tasks, we proposed three hybrid frameworks—Zero-GeoVision, CoT-GeoReason, and Self-GeoSense—targeting six task types—Basic Judging, Reasoning-Based Judging, Basic Counting, Reasoning-Based Counting, Object Situation Analysis, and Comprehensive Analysis. Using LMMs such as GPT-4o, Grok 3, Gemini 2.5 Pro, and Claude 3.7 Sonnet, these frameworks consistently outperformed the baseline Semantic Object Awareness framework (SOBA), as confirmed by the Geo-Judge framework combining automated and expert human evaluations. Zero-GeoVision leverages LMMs’ zero-shot capabilities with task-specific prompts to interpret images and questions directly, but it underperformed compared to SOBA in tasks requiring deeper geospatial understanding. To overcome this, the CoT-GeoReason framework incorporates chain-of-thought prompting, guiding models through step-by-step reasoning and intermediate analyses. Evaluated with the Geo-Judge framework and human experts, CoT-GeoReason outperformed Zero-GeoVision and surpassed SOBA in several reasoning-intensive tasks. Building further, the Self-GeoSense framework enhances robustness by generating multiple reasoning chains per input with stochastic decoding. Validated by Geo-Judge, Self-GeoSense achieved the highest performance across all tasks, with Grok 3 exceeding SOBA by 5.06% in Basic Judging, 12.99% in Basic Counting, 6.78% in Reasoning-Based Judging, 15.43% in Reasoning-Based Counting, 16.24% in Object Situation Analysis, and 15.99% in Comprehensive Analysis. This work demonstrates that structured reasoning and self-consistency significantly enhance LMM performance for complex geospatial VQA tasks while ensuring reliable evaluation through combined automated and human expert assessments.

Author Contributions

Conceptualization, L.H.B. and S.K.; Methodology, F.T.J.F., L.H.B., A.C., and S.K.; Software, F.T.J.F. and A.C.; Validation, A.C.; Formal analysis, L.H.B.; Investigation, L.H.B. and A.C.; Resources, F.T.J.F. and S.K.; Data curation, F.T.J.F. and S.K.; Writing—original draft, F.T.J.F. and L.H.B.; Writing—review and editing, A.C. and S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a grant from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, the Republic of Korea (Grant No. HI22C0646), and this work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Science and ICT under Grant NRF-2022R1A2C1005316.

Data Availability Statement

The original data presented in this study are publicly available in the EarthVQA repository at https://github.com/Junjue-Wang/EarthVQA (accessed on 3 April 2025). Further details related to the dataset can be found at http://junjuewang.top/EarthVQA (accessed on 7 April 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Correction Statement

This article has been republished with a minor correction to the Funding statement. This change does not affect the scientific content of the article.

References

Wang, J.; Zheng, Z.; Chen, Z.; Ma, A.; Zhong, Y. EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering. Proc. Aaai Conf. Artif. Intell. 2024, 38, 5481–5489. [Google Scholar] [CrossRef]
Bazi, Y.; Bashmal, L.; Al Rahhal, M.M.; Ricci, R.; Melgani, F. RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery. Remote Sens. 2024, 16, 1477. [Google Scholar] [CrossRef]
Siripong, S.; Chaiyapan, A.; Phonchai, T. Large vision-language models for remote sensing visual question answering. arXiv 2024, arXiv:2411.10857. [Google Scholar]
Chappuis, C.; Zermatten, V.; Lobry, S.; Le Saux, B.; Tuia, D. Prompt-RSVQA: Prompting visual context to a language model for remote sensing visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1372–1381. [Google Scholar]
Silva, J.D.; Magalhães, J.; Tuia, D.; Martins, B. Remote sensing visual question answering with a self-attention multimodal encoder. In Proceedings of the 5th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery (GeoAI ’22), Seattle, WA, USA, 1 November 2022; Association for Computing Machinery: New York, NY, USA; pp. 40–49. [Google Scholar] [CrossRef]
Zhang, S.; Wei, Q.; Li, Y.; Chen, Y.; Jiao, L. Visual Question Answering of Remote Sensing Image Based on Attention Mechanism. In Intelligence Science IV. ICIS 2022. IFIP Advances in Information and Communication Technology; Shi, Z., Jin, Y., Zhang, X., Eds.; Springer: Cham, Switzerland, 2022; Volume 659. [Google Scholar] [CrossRef]
Lobry, S.; Marcos, D.; Murray, J.; Tuia, D. RSVQA: Visual Question Answering for Remote Sensing Data. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8555–8566. [Google Scholar] [CrossRef]
Zhan, Y.; Xiong, Z.; Yuan, Y. SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model. ISPRS J. Photogramm. Remote Sens. 2025, 221, 64–77. [Google Scholar] [CrossRef]
Tao, L.; Zhang, H.; Jing, H.; Liu, Y.; Yan, D.; Wei, G.; Xue, X. Advancements in Vision–Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques. Remote Sens. 2025, 17, 162. [Google Scholar] [CrossRef]
Anthropic. Claude 3.5 Sonnet. 2024. Available online: https://www.anthropic.com/claude (accessed on 31 July 2025).
xAI. Grok 3. 2025. Available online: https://x.ai (accessed on 31 July 2025).
Google. Gemini 2.5 Pro (Preview Version, as of 31 July 2025) [Large Multimodal Model Accessed via Google AI Studio]. 2025. Available online: https://aistudio.google.com/ (accessed on 31 July 2025).
OpenAI. GPT-4o Mini. 2024. Available online: https://openai.com (accessed on 31 July 2025).
Meta AI. LLaMA 4 Scout. 2025. Available online: https://ai.meta.com/llama (accessed on 31 July 2025).
DeepSeek. DeepSeek-R1. 2025. Available online: https://deepseek.com (accessed on 31 July 2025).
Kuckreja, K.; Danish, M.; Naseer, M.; Das, A.; Khan, S.; Khan, F. GeoChat:Grounded Large Vision-Language Model for Remote Sensing. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 27831–27840. [Google Scholar] [CrossRef]
Muhtar, D.; Li, Z.; Gu, F.; Zhang, X.; Xiao, P. LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model. In Computer Vision–ECCV 2024. ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; Volume 15132. [Google Scholar] [CrossRef]
Hu, Y.; Yuan, J.; Wen, C.; Lu, X.; Liu, Y.; Li, X. RSGPT: A remote sensing vision language model and benchmark. ISPRS J. Photogramm. Remote Sens. 2025, 224, 272–286. [Google Scholar] [CrossRef]
Boussaid, H.; Tosato, L.; Weissgerber, F.; Kurtz, C.; Wendling, L.; Lobry, S. Visual Question Answering on Multiple Remote Sensing Image Modalities. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 2319–2328. [Google Scholar]
Liu, F.; Dai, W.; Zhang, C.; Zhu, J.; Yao, L.; Li, X. Co-LLaVA: Efficient Remote Sensing Visual Question Answering via Model Collaboration. Remote Sens. 2025, 17, 466. [Google Scholar] [CrossRef]
Zi, X.; Xiao, J.; Shi, Y.; Tao, X.; Li, J.; Braytee, A.; Prasad, M. RSVLM-QA: A Benchmark Dataset for Remote Sensing Vision Language Model-based Question Answering. arXiv 2025, arXiv:2508.07918. [Google Scholar]
Wang, J.; Ma, A.; Chen, Z.; Zheng, Z.; Wan, Y.; Zhang, L.; Zhong, Y. EarthVQANet: Multi-task visual question answering for remote sensing image understanding. ISPRS J. Photogramm. Remote Sens. 2024, 212, 422–439. [Google Scholar] [CrossRef]
Wang, Y.; Ghamisi, P. Rsadapter: Adapting multimodal models for remote sensing visual question answering. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5628313. [Google Scholar] [CrossRef]
Zhang, Z.; Zhao, E.; Niu, D.; Nie, J.; Liang, X.; Huang, L. Copy-Move Forgery Detection and Question Answering for Remote Sensing Image. arXiv 2024, arXiv:2412.02575. [Google Scholar]
Wang, F.; Chen, M.; Li, Y.; Wang, D.; Wang, H.; Guo, Z.; Wang, Z.; Shan, B.; Lan, L.; Wang, Y.; et al. GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution. arXiv 2025, arXiv:2505.21375. [Google Scholar]
Zhang, Z.; Shen, H.; Zhao, T.; Chen, B.; Guan, Z.; Wang, Y.; Jia, X.; Cai, Y.; Shang, Y.; Yin, J. GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing. arXiv 2025, arXiv:2503.12490. [Google Scholar]
Ou, R.; Hu, Y.; Zhang, F.; Chen, J.; Liu, Y. GeoPix: A multimodal large language model for pixel-level image understanding in remote sensing. IEEE Geosci. Remote Sens. Mag. 2025. early access. [Google Scholar] [CrossRef]
Li, X.; Ding, J.; Elhoseiny, M. Vrsbench: A versatile vision-language benchmark dataset for remote sensing image understanding. Adv. Neural Inf. Process. Syst. 2024, 37, 3229–3242. [Google Scholar]
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. oveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation. arXiv 2021, arXiv:2110.08733. [Google Scholar]
Swets, J.A. Measuring the accuracy of diagnostic systems. Science 1988, 240, 1285–1293. [Google Scholar] [CrossRef]
Hodson, T.O. Root mean square error (RMSE) or mean absolute error (MAE): When to use them or not. Geosci. Model Dev. Discuss. 2022, 15, 5481–5487. [Google Scholar] [CrossRef]
Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned language models are zero-shot learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]

Figure 1. Representation of a rural agricultural scene with water bodies, farmland, and infrastructure, showcasing diverse visual questions across all six EarthVQA reasoning categories.

Figure 2. Depiction of a rural land use scene with mixed terrain and road infrastructure, showcasing diverse visual questions across all six EarthVQA reasoning categories.

Figure 3. Rendering of a rural scene dominated by residential and woodland areas, excluding water and agriculture, showcasing diverse visual questions across all six EarthVQA reasoning categories.

Figure 4. Depiction of chain-of-thought prompting for automated logical and answer consistency assessment in Geo-Judge Stage 1.

Figure 5. Two-phase human judgment protocol for evaluating remote sensing VQA outputs in Stage 2 of the Geo-Judge framework.

Figure 6. End-to-end process flow for high-resolution remote sensing visual question answering.

Figure 7. Illustration of zero-shot prompting in the Zero-GeoVision framework: (a) GPT-4o response for Basic Judging; (b) Gemini 2.5 Pro response for Basic Counting.

Figure 8. Demonstration of zero-shot prompting in the Zero-GeoVision framework: (c) Claude 3.7 Sonnet response for Reasoning-Based Judging; (d) Grok 3 response for Reasoning-Based Counting.

Figure 9. Visualization of zero-shot prompting in the Zero-GeoVision framework: (e) Gemini 2.5 Pro response for Object Situation Analysis; (f) Grok 3 response for Comprehensive Analysis.

Figure 10. Illustration of chain-of-thought prompting in the CoT-GeoReason framework: (a) GPT-4o response for Basic Judging; (b) Gemini 2.5 Pro response for Basic Counting.

Figure 11. Demonstration of chain-of-thought prompting in the CoT-GeoReason framework: (c) Claude 3.7 Sonnet response for Reasoning-Based Judging; (d) Grok 3 response for Reasoning-Based Counting.

Figure 12. Visualization of chain-of-thought prompting in the CoT-GeoReason framework: (e) Gemini 2.5 Pro response for Object Situation Analysis; (f) Grok 3 response for Comprehensive Analysis.

Table 1. Description of Visual Question Answering (VQA) task types in the EarthVQA dataset.

Task Type	Detailed Description	Reasoning Level
Basic Judging (Bas Ju)	Binary classification task that determines the presence or absence of a specific land cover type (e.g., buildings, roads, and water bodies) in a remote sensing scene. These tasks are straightforward and rely primarily on the model’s ability to visually detect and recognize object-level patterns.	Low
Basic Counting (Bas Co)	Estimates the proportion or coverage area of specific land types or objects (e.g., buildings, agriculture, and roads) in the image. The answers are typically given in coarse percentage ranges (e.g., 0–10%, 40–50%, and 80–90%). This task requires not just identification, but also an understanding of spatial extent and distribution.	Low to Medium
Reasoning-Based Judging (Rel Ju)	Involves binary decisions that go beyond direct visual observation, requiring semantic interpretation or multi-step reasoning. Examples include identifying land use types (e.g., residential or woodland), inferring scene context (e.g., rural vs urban), or detecting specific land conditions. These tasks may rely on combinations of visible cues (such as building density, road networks, and vegetation) to infer higher-level semantics.	Medium to High
Reasoning-Based Counting (Rel Co)	Focuses on counting domain-specific objects that require contextual understanding or fine-grained classification (e.g., number of eutrophic water bodies). Unlike basic counting, this task requires a semantic interpretation of object type or quality, often relying on indirect features.	High
Comprehensive Analysis (Com An)	Requires the model to provide integrated, holistic insights into the scene by synthesizing multiple visual attributes. Tasks may include identifying the combination of land use types, assessing renovation needs, or describing environmental relationships (e.g., presence of water near agriculture). This task type demands a deep understanding of context, interaction between objects, and reasoning across multiple visual cues.	High
Object Situation Analysis (Obj An)	Tasks that ask for identification of object types along with their contextual status or attributes (e.g., types of water bodies such as ponds or rivers). These tasks focus on situational understanding of specific object classes in their environment, requiring interpretation of shape, spatial configuration, and contextual surroundings.	Medium to High

Table 2. Overview of evaluation metrics used for different task types including Accuracy, RMSE, Geo-Judge framework, and Human Evaluation.

Task Type	Metric Used	Automatic Evaluation Applicable	Geo-Judge Framework Required	Human Evaluation Required
Bas Ju	Accuracy	Yes	Optional	Optional
Bas Co	Accuracy / RMSE	Yes	Optional	Optional
Rel Ju	Accuracy (semantic)	No	Yes	Yes
Rel Co	Accuracy / RMSE	Partially (RMSE only)	Optional	Optional
Com An	Accuracy (semantic)	No	Yes	Yes
Obj An	Accuracy (semantic)	No	Yes	Yes

Table 3. Performance of large multimodal models on remote sensing visual question answering tasks within the Zero-GeoVision framework.

Model	Accuracy (%)						RMSE
Model	Bas Ju	Rel Ju	Bas Co	Rel Co	Obj An	Com An	Bas Co	Rel Co
GPT-4o	73.41	72.53	71.45	60.78	50.69	46.19	1.0512	1.1726
Grok 3	78.55	73.12	75.24	64.85	59.23	48.34	0.9916	1.1071
Gemini 2.5 Pro	74.34	71.76	70.76	61.37	52.63	43.48	1.0593	1.1642
Claude 3.7 Sonnet	72.88	70.94	78.12	63.23	58.07	41.76	0.9511	1.1294
SOBA [1]	89.63	82.64	80.17	67.86	61.40	49.30	0.7856	1.1457

Table 4. CoT-GeoReason performance evaluation of large multimodal models on remote sensing VQA tasks.

Model	Accuracy (%)						RMSE
Model	Bas Ju	Rel Ju	Bas Co	Rel Co	Obj An	Com An	Bas Co	Rel Co
GPT-4o	85.23	82.67	81.18	72.96	67.43	56.45	0.7421	0.9538
Grok 3	88.69	85.42	90.37	78.27	69.64	58.29	0.7214	0.8725
Gemini 2.5 Pro	86.17	83.93	89.23	73.36	65.93	57.68	0.7347	0.9471
Claude 3.7 Sonnet	87.79	82.85	85.59	74.19	68.24	56.98	0.7698	0.9344
SOBA [1]	89.63	82.64	80.17	67.86	61.40	49.30	0.7856	1.1457

Table 5. Self-GeoSense performance evaluation of large multimodal models on remote sensing VQA tasks.

Model	Accuracy (%)						RMSE
Model	Bas Ju	Rel Ju	Bas Co	Rel Co	Obj An	Com An	Bas Co	Rel Co
GPT-4o	90.23	84.67	86.14	79.92	72.43	63.45	0.7291	0.8935
Grok 3	94.69	89.42	93.18	83.29	77.64	65.29	0.7102	0.8551
Gemini 2.5 Pro	91.17	87.93	91.36	85.48	71.93	62.68	0.7283	0.8387
Claude 3.7 Sonnet	92.79	86.85	89.74	81.14	73.24	61.98	0.7468	0.8674
SOBA [1]	89.63	82.64	80.17	67.86	61.40	49.30	0.7856	1.1457

Table 6. Estimated average token usage and estimated average cost per high-resolution image across frameworks and LMMs.

Proposed Frameworks	Model	Tokens		Average Cost (USD)
Proposed Frameworks	Model	Input	Output	Average Cost (USD)
Zero-GeoVision	GPT-4o	612	150	0.00303
Zero-GeoVision	Grok 3	612	150	0.00409
Zero-GeoVision	Gemini 2.5 Pro	612	150	0.00226
Zero-GeoVision	Claude 3.7 Sonnet	612	150	0.00409
CoT-GeoReason	GPT-4o	1012	500	0.00753
CoT-GeoReason	Grok 3	1012	500	0.01054
CoT-GeoReason	Gemini 2.5 Pro	1012	500	0.00626
CoT-GeoReason	Claude 3.7 Sonnet	1012	500	0.01054
Self-GeoSense	GPT-4o	1312	2500	0.02828
Self-GeoSense	Grok 3	1312	2500	0.04144
Self-GeoSense	Gemini 2.5 Pro	1312	2500	0.02664
Self-GeoSense	Claude 3.7 Sonnet	1312	2500	0.04144

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Faria, F.T.J.; Baniata, L.H.; Choi, A.; Kang, S. Towards Robust Chain-of-Thought Prompting with Self-Consistency for Remote Sensing VQA: An Empirical Study Across Large Multimodal Models. Mathematics 2025, 13, 3046. https://doi.org/10.3390/math13183046

AMA Style

Faria FTJ, Baniata LH, Choi A, Kang S. Towards Robust Chain-of-Thought Prompting with Self-Consistency for Remote Sensing VQA: An Empirical Study Across Large Multimodal Models. Mathematics. 2025; 13(18):3046. https://doi.org/10.3390/math13183046

Chicago/Turabian Style

Faria, Fatema Tuj Johora, Laith H. Baniata, Ahyoung Choi, and Sangwoo Kang. 2025. "Towards Robust Chain-of-Thought Prompting with Self-Consistency for Remote Sensing VQA: An Empirical Study Across Large Multimodal Models" Mathematics 13, no. 18: 3046. https://doi.org/10.3390/math13183046

APA Style

Faria, F. T. J., Baniata, L. H., Choi, A., & Kang, S. (2025). Towards Robust Chain-of-Thought Prompting with Self-Consistency for Remote Sensing VQA: An Empirical Study Across Large Multimodal Models. Mathematics, 13(18), 3046. https://doi.org/10.3390/math13183046

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Towards Robust Chain-of-Thought Prompting with Self-Consistency for Remote Sensing VQA: An Empirical Study Across Large Multimodal Models

Abstract

1. Introduction

2. Related Works

2.1. Task-Specific Deep Learning-Based Remote Sensing

2.2. Vision-Language Model-Based Remote Sensing

3. Background Study

3.1. Dataset Description

3.2. Evaluation Metrics Summary

3.2.1. Accuracy

3.2.2. Semantic Accuracy

3.2.3. Root Mean Squared Error (RMSE) for Counting Tasks

3.2.4. Geo-Judge Framework

LMM Judge (Stage 1)

Human Judgment (Stage 2)

4. Proposed Methodology

4.1. Zero-GeoVision Framework

4.2. CoT-GeoReason Framework

4.3. Self-GeoSense Framework

5. Result Analysis

5.1. Detailed Task-Wise Comparison of Zero-GeoVision Framework vs. Semantic Object Awareness Framework (SOBA) Method

5.2. Detailed Task-Wise Comparison of CoT-GeoReason Framework vs. Semantic Object Awareness Framework (SOBA) Method

5.3. Detailed Task-Wise Comparison of Self-GeoSense Framework vs. Semantic Object Awareness Framework (SOBA) Method

6. Computational Cost and Efficiency Analysis

6.1. Approach to Cost Estimation

6.2. Cost Computation

6.3. Observations and Implications

7. Limitations

8. Future Work

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Correction Statement

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI