Next Article in Journal
A Density-Guided and Residual-Feedback Denoising Method for Building Height Estimation from ICESat-2/ATLAS Data
Previous Article in Journal
3-D Micro-Motion Features Estimation of Smooth Symmetric Nutating Cone Based on Monostatic Radar
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-Guided Difference Perception

by
Pei Deng
,
Wenqian Zhou
and
Hanlin Wu
*
School of Information Science and Technology, Beijing Foreign Studies University, Beijing 100875, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Remote Sens. 2026, 18(4), 541; https://doi.org/10.3390/rs18040541
Submission received: 1 January 2026 / Revised: 2 February 2026 / Accepted: 5 February 2026 / Published: 8 February 2026
(This article belongs to the Section Remote Sensing Image Processing)

Highlights

What are the main findings?
  • A novel interactive paradigm that supports the multi-turn, instruction-guided exploration of changes in bi-temporal remote sensing images.
  • We present ChangeChat-105k, a large-scale instruction-following dataset of 105K+ query–response pairs across six change analysis tasks, and DeltaVLM, a tailored vision language model.
What are the implications of the main findings?
  • Enables the natural language-based monitoring of dynamic Earth processes for applications such as urban planning, environmental monitoring, and disaster response.
  • Provides the first benchmark and architecture for interactive change interpretation in remote sensing, connecting visual detection with human-centered reasoning.

Abstract

The accurate interpretation of land cover changes in multi-temporal satellite imagery is critical for Earth observation. However, existing methods typically yield static outputs—such as binary masks or fixed captions—lacking interactivity and user guidance. To address this limitation, we introduce remote sensing image change analysis (RSICA), a novel paradigm that enables the instruction-guided, multi-turn exploration of temporal differences in bi-temporal images through visual question answering. To realize RSICA, we propose DeltaVLM, a vision language model specifically designed for interactive change understanding. DeltaVLM comprises three key components: (1) a fine-tuned bi-temporal vision encoder that independently extracts semantic features from each image in the input pair; (2) a visual difference perception module with a cross-semantic relation measuring (CSRM) mechanism to interpret changes; and (3) an instruction-guided Q-former that selects query-relevant change features and aligns them with a frozen large language model to generate context-aware responses. We also present ChangeChat-105k, a large-scale instruction-following dataset containing over 105k diverse samples. Extensive experiments show that DeltaVLM achieves state-of-the-art performance in both single-turn captioning and multi-turn interactive change analysis, surpassing both general multimodal models and specialized remote sensing vision language models.

Graphical Abstract

1. Introduction

The continuous acquisition of vast amounts of data by Earth observation satellites opens up opportunities to monitor our dynamic planet through remote sensing images (RSIs). These images enable the extraction and interpretation of temporal changes, offering significant value for applications such as disaster management [1], deforestation monitoring [2], and environmental surveillance [3]. However, analyzing RSIs poses distinct challenges compared to natural image processing, particularly in interpreting multi-temporal changes. Factors such as atmospheric variations, sensor differences, and geometric distortions [4] complicate the accurate detection and interpretation of changes over time. Early efforts to analyze temporal changes in RS imagery primarily relied on change detection techniques [5], such as pixel-level or object-based methods. While effective in locating changes, these methods often fail to provide insights into the nature of these changes.
Integrating natural language processing (NLP) techniques into RSI interpretation has narrowed the gap between raw visual data and human understanding. Remote sensing image captioning (RSIC) [6] aims to generate descriptive captions for a single observation. To enable more interactive exploration, remote sensing visual question answering (RSVQA) [7] was introduced, allowing users to query RSIs with natural language questions and receive corresponding textual responses. However, RSIC and RSVQA are constrained to single-image analysis and lack the ability to capture temporal changes. To address this limitation, RS image change captioning (RSICC) [8] extends RSIC to bi-temporal analysis, generating textual descriptions of spatiotemporal differences.
More recently, the advent of large language models (LLMs) [9] and their evolution into vision language models (VLMs) [10] have introduced interactive capabilities for RS interpretation, accommodating follow-up inquiries beyond static captions. However, these models, primarily trained on natural scenes, show limited performance on RS tasks due to significant data distribution differences, as illustrated in Figure 1. To address this domain gap, recent work has adapted VLMs to RS tasks through instruction tuning on domain-specific datasets [11]. This approach efficiently transfers VLMs’ reasoning capabilities to RS interpretation, achieving strong performance on complex, open-ended tasks with relatively few training samples. For example, RSGPT [12] extends LLMs to RS tasks like RSIC and RSVQA via domain-adaptive fine-tuning. Similarly, models like GeoChat [13] and RS-LLaVA [14], built upon the LLaVA [15] framework, excel in single-image region-based question answering (QA) and visual grounding. However, these models are limited to single-image analysis and cannot generate instruction-specific change descriptions from bi-temporal data. Moreover, most existing RS-VLMs primarily focus on fine-tuning LLM backbones, while overlooking RS-specific visual challenges in multi-temporal analysis, which can mask meaningful temporal differences [16]. This limitation is further compounded by the absence of large-scale instruction-following datasets for interactive bi-temporal RS analysis.
To address these challenges, we introduce remote sensing image change analysis (RSICA), a novel task that integrates the semantic grounding of change captioning with the interactive and reasoning capabilities of VQA. RSICA enables multi-turn, multi-task dialogs, allowing users to dynamically explore and interpret changes in bi-temporal RSIs. To support this task, we meticulously construct ChangeChat-105k, a large-scale instruction-following dataset with 105,107 instruction–response pairs, generated via a hybrid approach combining rule-based methods and ChatGPT (GPT-4o)’s in-context learning capabilities [17]. Our proposed dataset covers a wide range of instruction types for interactive change analysis, including (1) change captioning, (2) binary change detection, (3) category-specific change quantification, (4) change localization, (5) open-ended QA, and (6) multi-turn conversation.
Building upon the ChangeChat-105k dataset, we present DeltaVLM, an innovative end-to-end architecture for query-driven RSICA. Unlike existing VLMs that target a single image as input, DeltaVLM extends the conventional three-stage change captioning pipeline into a specialized vision language framework that enables the instruction-guided, multi-task change interpretation of bi-temporal RSIs. Its core components include (1) a bi-temporal vision encoder (Bi-VE), which processes bi-temporal RSIs to extract and compare features, capturing temporal differences at multiple scales; (2) an instruction-guided difference perception module (IDPM), incorporating a cross-semantic relation measuring (CSRM) mechanism and a Q-former to perceive subtle visual changes, filter out irrelevant semantics and noise in context, and dynamically align these differences with user-specific instructions; and (3) an LLM that decodes the aligned difference information into context-aware language responses.
To validate the effectiveness of DeltaVLM, we conduct comprehensive experiments and ablation studies on the proposed RSICA task using ChangeChat-105k. The results demonstrate that DeltaVLM achieves state-of-the-art (SOTA) performance, outperforming existing general-purpose large VLMs and RS change captioning models in interactive change analysis scenarios.
The main contributions of this paper can be summarized as follows:
  • We introduce RSICA, a novel task that integrates change captioning and VQA into a unified, interactive, and user-driven framework for analyzing bi-temporal RSIs.
  • We present ChangeChat-105k, a large-scale RS instruction-following dataset covering diverse change-related tasks including captioning, classification, counting, localization, open-ended QA, and multi-turn dialog.
  • We propose DeltaVLM, an innovative VLM architecture tailored to RSICA, which integrates a bi-temporal vision encoder, a visual difference perception module with CSRM, and a instruction-guided Q-former for dynamic, context-aware multi-task and multi-turn interactions.
  • We conduct comprehensive evaluations, demonstrating DeltaVLM’s superior performance compared to existing baselines on the RSICA task, validating its effectiveness in addressing complex change analysis challenges.

2. Related Work

Our work builds on the evolution of change analysis in RS—from task-specific methods to unified vision language frameworks enabling flexible, interactive interpretation. We first summarize representative approaches to change detection and captioning and then discuss interactive analysis via VQA; finally, we review recent advances in general-purpose and RS-oriented VLMs.

2.1. Task-Specific Methods for Change Analysis

Early research in RS change analysis primarily focused on designing specialized architectures for individual tasks, such as pixel-level change detection or scene-level change captioning.

2.1.1. Change Detection

Change detection aims to identify and localize differences between satellite images of the same geographic area acquired at different times. Early approaches were largely algebraic or statistical, including image differencing, ratioing, and change vector analysis [18]. While computationally efficient, these methods are sensitive to noise, illumination variations, and radiometric inconsistencies.
To improve the robustness, researchers have proposed index-based [19], transformation-based [20], and classification-based methods [21]. Among them, transformation-based approaches have played an important role in radiometric normalization. For example, slow feature analysis (SFA) [22] extracts temporally invariant features to suppress irrelevant variations, while iteratively reweighted multivariate alteration detection (IR-MAD) [23] provides an automatic and robust solution for relative radiometric correction between bi-temporal images. Object-based image analysis [24] further introduces spatial context into the change detection pipeline, improving the performance on high-resolution imagery, although its effectiveness depends heavily on accurate segmentation.
With the rise of deep learning, convolutional neural network-based methods have significantly improved change detection performance. Siamese networks [25] and U-Net–based architectures [26] enable the end-to-end learning of change maps through pairwise feature comparison or direct segmentation. More recently, Transformer-based models [27,28] have shown strong performance by leveraging self-attention to capture long-range dependencies and multi-scale context. In parallel, state space models (SSMs) have emerged as an efficient alternative to attention-based architectures. ChangeMamba [29] adapts the Mamba architecture to remote sensing change detection and achieves competitive accuracy with a reduced computational cost. To further alleviate the reliance on large labeled datasets, self-supervised learning strategies—such as contrastive learning [30] and masked image modeling [31]—have also been explored.
Beyond algorithmic advances, recent work has emphasized practical applications in disaster response. For instance, Zheng et al. [32] propose ChangeOS, a deep object-based framework for building damage assessment from bi-temporal high-resolution imagery. The BRIGHT dataset [33] complements these efforts by providing a globally distributed, multimodal benchmark for AI-driven disaster response across diverse events and regions. A comprehensive overview of change detection methods is provided in [5].

2.1.2. Change Captioning

Unlike change detection, which produces binary or multi-class masks, change captioning focuses on generating natural language descriptions of observed changes between bi-temporal remote sensing images. Early approaches followed an encoder–decoder paradigm, where CNNs extracted visual features from paired images and RNNs generated textual descriptions [34]. Attention mechanisms were later introduced to guide the decoder toward salient change regions, improving the relevance and fluency of generated captions [35].
More recent methods adopt a three-stage pipeline consisting of visual encoding, bi-temporal feature fusion, and language decoding. Within this framework, several representative approaches have been proposed. RSICCFormer [8] employs ResNet-101 as the visual encoder and a dual-branch Transformer decoder to enhance change representation. The sparse focus Transformer (SFT) [36] introduces sparse attention to selectively attend to change regions while reducing the computational overhead. ViT-based methods such as PSNet [37] further incorporate multi-scale feature fusion to generate more detailed descriptions.
More recently, RSCaMa [38] introduced a Mamba-based architecture for change captioning, leveraging state space models to capture long-range temporal dependencies between bi-temporal features with linear computational complexity. In addition, multi-task learning and semantic alignment have been investigated to improve the caption quality. PromptCC [39] decomposes change captioning into binary change classification and fine-grained perception, while Semantic-CC [40] incorporates pixel-level semantic guidance from auxiliary change detection tasks. LLM-based approaches, such as CDChat [41], further explore the use of large language models to enhance the expressiveness and robustness of change descriptions.

2.2. Interactive Analysis via VQA

While change detection and captioning provide predefined outputs, they are typically limited to single-turn analysis. To support more flexible and user-driven interaction, researchers have drawn inspiration from VQA [42] and introduced RSVQA. RSVQA allows users to pose natural language questions tailored to specific analytical needs, with answers grounded in the visual content of remote sensing images. Lobry et al. [7] established two benchmark datasets, RSVQA-LR and RSVQA-HR, covering agricultural and urban scenes.
Most RSVQA approaches follow an encoder–fusion–decoder architecture. To enhance multimodal reasoning, several works focus on improving the fusion module. The spatial hierarchical reasoning network [43] integrates multi-scale attention and semantic segmentation priors to better capture the spatial structure, while object-aware methods [44] emphasize relational modeling at the object level. An alternative direction is Prompt-RSVQA [45], which converts visual information into textual prompts and relies on a language model for implicit fusion.
Despite these advances, most RSVQA methods focus on single-temporal imagery and do not explicitly model temporal changes. Change-aware VQA [46] represents an early attempt to incorporate multi-temporal information, but it formulates the problem as answer classification rather than open-ended generation, limiting the response flexibility. These limitations motivate the development of generative vision language models capable of interactive change analysis.

2.3. Unified VLMs

Recent research has shifted toward generalist VLMs that unify multiple tasks through large-scale pretraining. These models offer a promising foundation for flexible and interactive remote sensing analysis.

2.3.1. General-Purpose VLMs

General-purpose VLMs [15,47,48] integrate visual encoders with large language models to support a wide range of multimodal tasks, including image captioning, cross-modal retrieval, and VQA. Representative examples include GPT-4o [49], Qwen-VL-Plus [50], GLM-4V-Plus [51], Gemini-1.5-Pro [52], and DeepSeek-VL2 [53]. Although these models exhibit strong zero-shot reasoning capabilities, their performance in remote sensing scenarios is often limited by domain-specific challenges such as scale variation, atmospheric effects, and specialized semantics.

2.3.2. VLMs for RS

To bridge this gap, several studies have adapted VLMs to the remote sensing domain by constructing large-scale RS-specific vision language datasets and fine-tuning pretrained models [54]. GeoChat [13] enables region-level conversational question answering, while RSGPT [12] builds upon InstructBLIP [55] with an annotation-aware Q-former to improve visual–text alignment. SkyEyeGPT [56] further extends this line of work by supporting multiple tasks, including segmentation, detection, grounding, and conversational QA.
More recently, RSUniVLM [57] was introduced, which unifies a wide range of remote sensing interpretation tasks—such as captioning, VQA, and pixel-level segmentation—within a single autoregressive language model. By representing segmentation masks as discrete token sequences, RSUniVLM eliminates the need for specialized visual decoders and demonstrates the potential of language-centric architectures for comprehensive RS scene understanding.
Despite rapid progress in RS-oriented VLMs, most existing models focus on single-temporal imagery and are not designed for the interactive analysis of bi-temporal changes. ChangeChat [58] takes an important step in this direction by providing an instruction-following dataset and fine-tuning VLMs for temporally aware geospatial tasks. However, ChangeChat does not explicitly adapt visual feature extraction to different user instructions, which may limit the response precision in complex scenarios.
Building on this observation, we expand the ChangeChat dataset and propose DeltaVLM, which introduces instruction-guided differential feature extraction to support multi-task, multi-turn dialog for the user-driven interpretation of geospatial changes.

3. ChangeChat-105k Dataset

Recent advances in RS change detection and captioning have produced benchmark datasets like LEVIR-CC [8] and LEVIR-MCI [59]. LEVIR-CC provides 10,077 bi-temporal image pairs with five human-written captions each, supporting change captioning but lacking fine-grained annotations for object counts or precise locations. LEVIR-MCI, built upon LEVIR-CC, provides pixel-level change maps to support binary change detection but lacks the deeper exploration of change information and cannot support interactive analysis. Both datasets are derived from the LEVIR-CD benchmark, which comprises very-high-resolution Google Earth images (0.5 m/pixel) covering 20 urban regions. The original images were co-registered to ensure spatial alignment, and the annotations explicitly label changes in human-made structures—primarily buildings and roads. Since the annotations exclude phenological variations and radiometric artifacts, the resulting change labels reflect genuine land cover changes rather than pseudo-changes caused by seasonal or illumination differences.
To support RSICA, we introduce ChangeChat-105k, a large-scale dataset with 105,107 instruction–response pairs derived from LEVIR-CC and LEVIR-MCI. We employ a hybrid pipeline combining rule-based methods and LLM-based generation, leveraging ChatGPT’s [17] in-context learning. For structured tasks like object counting and localization, we use rule-based methods with LEVIR-MCI’s pixel-level change maps and OpenCV-based contour detection to extract precise information. For open-ended tasks, we generate diverse instructions using ChatGPT with curated prompts and seed examples. The dataset includes six instruction types, ranging from structured information extraction to open-ended reasoning, as detailed in Figure 2, enabling the comprehensive evaluation of the multi-task, interactive change analysis capabilities.

3.1. Change Captioning

In this task, we extend the original LEVIR-CC triplets ( I t 1 , I t 2 , C) into an instruction–response format. For each bi-temporal image pair ( I t 1 , I t 2 ) and its corresponding change caption (C), we design a fixed instruction Q: “Please briefly describe the changes in these two images.” The instruction–response pair is formatted as follows:
Human : I t 1 I t 2 Q < STOP > Assistant : C < STOP > .

3.2. Binary Change Classification

In this task, we generate instructions that ask DeltaVLM to determine whether changes occurred, expecting a binary “yes” or “no” answer. The instructions are designed with the template “Please judge whether these two images have changed or not. Answer yes or no.” The ground truth for each image pair is derived from the change map in LEVIR-MCI. Phenological variations and pseudo-changes due to illumination or shadow differences are not considered positive labels.

3.3. Category-Specific Change Quantification

We create instructions to guide DeltaVLM in quantifying changes in specific categories, such as calculating the number of newly added buildings or roads. These quantity-related instructions are generated based on templates and use the OpenCV library’s contour detector to calculate the number of objects.

3.4. Change Localization

To localize changes spatially, we design instructions that ask DeltaVLM to return changed regions in a 3 × 3 grid, with cells denoted as
P = { TL , TC , TR , CL , CC , CR , BL , BC , BR } ,
where TL = top-left, TC = top-center, …, BR = bottom-right. The ground truth is obtained by splitting each change map into nine blocks; any block with >5% changed pixels is labeled as changed.

3.5. Open-Ended QA

To generate more diverse instruction-following data, we leverage ChatGPT’s in-context learning capabilities to automatically generate instruction–response pairs. As shown in Figure 3, we begin by providing ChatGPT with a system message to guide its responses. Then, we manually design a few seed examples for each type of task to help it to understand the desired output structure. Specifically, it generates two types of conversational data: (i) QA pairs from change captions and (ii) fine-grained queries incorporating extracted contour and quantification information.
Notably, we did not provide any visual information to ChatGPT. All questions and answers were derived from prompts that we constructed based on five captions, as well as the change contours and counting information extracted from the change map, as illustrated in Figure 3b.

3.6. Multi-Turn Conversation

We design multi-turn dialogs to encourage DeltaVLM to perform change analysis using a chain-of-thought (CoT) approach. The instructions are presented in increasing difficulty, beginning with simple binary change classification, followed by change object and quantity identification and progressing to the complex and detailed change captioning task.

4. Proposed Method

In this section, we provide a detailed explanation of the architecture of DeltaVLM.

4.1. Overview

As shown in Figure 4, DeltaVLM is an end-to-end framework tailored to interactive RSICA, comprising three key steps: (1) bi-temporal visual feature encoding, (2) instruction-guided difference feature extraction, and (3) language decoding based on an LLM.
First, the bi-temporal vision encoder (Bi-VE) extracts features from the paired input images I t 1 and I t 2 ,
F t 1 , F t 2 = Φ Bi - VE ( I t 1 , I t 2 ) .
Then, the IDPM enhances the bi-temporal features F t 1 , F t 2 with a CSRM mechanism, followed by a Q-former that aligns the refined features with user instruction P and learnable queries Q, producing instruction-guided difference representations F ^ diff ,
F t 1 , F t 2 = Φ enhancer ( F t 1 , F t 2 ) F ^ diff = Φ Q - former ( [ F t 1 , F t 2 ] ; P , Q ) .
Finally, F ^ diff are decoded by an LLM into a natural language response T, conditioned on the instruction P,
T = Φ LLM ( F ^ diff , P ) .

4.2. Bi-Temporal Vision Encoding

To leverage the power of large-scale pretraining for our Bi-VE, we adopt the EVA-ViT-g/14 [60] as its backbone. To adapt it to RSICA while mitigating catastrophic forgetting, we employ selective fine-tuning: the first 37 Transformer layers are frozen, and only the final two blocks are fine-tuned.
Given a bi-temporal RS image pair I t 1 , I t 2 R H × W × 3 , where H and W denote the height and width, respectively, Φ Bi - VE processes each image independently to avoid the early fusion of temporal information and prevent biases in initial feature extraction. Each image is first divided into a sequence of 16 × 16 patch embeddings, which are then passed through the Transformer encoder layers to capture complex visual patterns. Features are extracted from the second-to-last layer (bypassing the classification head) for task-specific semantics, yielding F t 1 , F t 2 R N × D , where N is the number of patches and D is the hidden dimension.
Mathematically, the Bi-VE’s operation is given by
F t 1 = Φ ViT ( I t 1 ; Θ fine - tuned ) R H 16 × W 16 × D F t 2 = Φ ViT ( I t 2 ; Θ fine - tuned ) R H 16 × W 16 × D ,
where Φ ViT is the EVA-ViT-g/14 encoder, and Θ fine - tuned denotes the parameters of the fine-tuned layers. The spatial resolution of F t 1 and F t 2 is reduced by a factor of 16 due to patch-based processing. These feature maps are then fed into the IDPM.

4.3. Instruction-Guided Difference Perception

Given bi-temporal features F t 1 , F t 2 , we first compute the raw visual difference,
F diff = F t 2 F t 1 ,
where F diff R N × D captures all pixel-level changes. Directly decoding F diff into language may introduce interference information such as sensor differences, lighting, or seasonal variations. To address this problem, we first explore the semantic relationships between F diff , F t 1 , and F t 2 through a CSRM mechanism to eliminate irrelevant change interferences. Subsequently, cross-modal alignment is achieved through an instruction-guided Q-former.

4.3.1. Cross-Semantic Relation Measuring

The CSRM mechanism works in three steps: contextualizing, gating, and filtering.
Step 1: contextualizing. To understand how changes relate to each temporal state, we compute context vectors by fusing difference features with original features:
C t 1 = tanh ( W c [ F diff ; F t 1 ] + b c ) C t 2 = tanh ( W c [ F diff ; F t 2 ] + b c ) ,
where [ · ; · ] denotes channel-wise concatenation; W c , W c R D × 2 D and b c , b c R D are learnable weights and biases. These linear projections transform the concatenated features into a new space that emphasizes semantic connections, while the tanh activation constrains the outputs to [ 1 , 1 ] .
Step 2: Gating. Through contextualizing, these context vectors capture change–context relationships. To further weight each detected change by its semantic relevance, we then employ a gating mechanism, inspired by gated recurrent units (GRUs) [61]. This step generates gate vectors G t 1 and G t 2 via a sigmoid activation:
G t 1 = σ ( W g [ F diff ; F t 1 ] + b g ) G t 2 = σ ( W g [ F diff ; F t 2 ] + b g ) ,
where σ is the sigmoid function producing values within ( 0 , 1 ) as relevance scores; W g , W g R D × 2 D and b g , b g R D are learnable weights and biases.
Step 3: Filtering. Finally, we selectively retain semantically relevant information through element-wise multiplication (⊙) between gate vectors and their corresponding context vectors:
F t 1 = G t 1 C t 1 F t 2 = G t 2 C t 2 .
This multiplication refines C t 1 and C t 2 under the guidance of G t 1 and G t 2 by suppressing irrelevant components (e.g., noise) with low gate values while preserving important changes (e.g., new structures or land cover shifts) with high values. Consequently, the resulting filtered features F t 1 and F t 2 retain only semantically relevant changes, which are then passed to the subsequent Q-former module.

4.3.2. Q-Former for Cross-Modal Alignment

Inspired by InstructBLIP [55], our Q-former module is specifically designed to generate change-aware features aligned with the given instruction P. This process begins with a set of learnable query embeddings Q R L × d , where L = 32 is the number of queries and d is the feature dimension matching the LLM’s input space. These queries are first refined through self-attention layers:
Q SA = SelfAttention ( Q ) .
Next, the refined queries Q SA attend to both the concatenated visual features and the instruction prompt through cross-attention:
Q CA = CrossAttention ( Q SA , [ F t 1 ; F t 2 ] , P ) .
This step dynamically aligns the change features with the task-specific instruction, tailoring them to the user’s query. Finally, the instruction-aware change features pass through a feed-forward network to yield the final compact output:
F ^ diff = FFN ( Q CA ) R 32 × d .
By incorporating these steps, the Q-former ensures that the extracted features effectively capture instruction-relevant changes while maintaining computational efficiency through the query bottleneck.

4.4. LLM-Based Language Decoder

We choose Vicuna-7B [62] as our language decoder, which is a powerful decoder-only LLM fine-tuned from LLaMA [63]. Our decoder takes the visual features F ^ diff and instruction prompt P as input, generating instruction-specific change descriptions. The process begins by tokenizing and embedding the instruction prompt P.
E = Φ embedding ( P ) ,
where Φ embedding denotes the tokenizer and embedding function, transforming the raw text into a sequence of embeddings E suitable for the language model. These embeddings capture the semantic essence of the prompt by mapping words or sub-words to high-dimensional vectors.
The learned prompt-aligned features F ^ diff , along with the embedded prompt E, serve as input to the language decoder, which generates a descriptive caption T = { t 1 , , t N } that summarizes the bi-temporal changes:
T = Φ LLM ( F ^ diff , E ) C N ,
where Φ LLM denotes the Vicuna-7B decoder, and T is a sequence of N tokens from the vocabulary C . This process effectively interprets the visual differences in the context of the user’s query, generating task-specific change descriptions.

4.5. Training Objective

Unlike typical RSICC methods that generate fixed descriptions for RS image pairs, DeltaVLM is trained on instruction-conditioned data. To enable this, we augment the dataset with instruction prompts P j , creating D train = { ( I 1 , P 1 , T 1 ) , , ( I M , P M , T M ) } , where each P j is a user query corresponding to the bi-temporal image pair I j and its target description T j , allowing the model to adapt its output to diverse user instructions. The model is trained using the cross-entropy loss function:
L train = 1 K i = 1 K w i log ( w ^ i ) ,
where K represents the total number of tokens in the target description, w i is the one-hot-encoded ground truth token at position i, and w ^ i is DeltaVLM’s predicted probability for the i-th token. By minimizing this loss over the augmented dataset D train , we train DeltaVLM to generate accurate and contextually relevant descriptions tailored to user-specific instructions.

5. Results

In this section, we present comprehensive experiments to evaluate the effectiveness of DeltaVLM for RSICA. We first describe the experimental setup and then report quantitative results across multiple tasks; finally, we analyze key components through ablation studies.

5.1. Experimental Setup

5.1.1. Dataset

We evaluate our proposed DeltaVLM on the ChangeChat-105k dataset, which comprises 105,107 instruction–response pairs aligned with bi-temporal image patches of size 256 × 256 at a 0.5 m / pixel spatial resolution. Each image pair is annotated with multiple task types: binary change detection, object counting, change localization, and change captioning. For evaluation, we split the dataset into a training set and a test set, with the detailed distribution of instruction–response pairs across tasks and subsets provided in Table 1.

5.1.2. Implementation Details

All experiments are conducted on Ubuntu 20.04 using the PyTorch 2.0.1 framework on NVIDIA L20 GPUs. For data augmentation, we first apply random cropping that removes 0–5% of the image content, followed by random rotation within [ 15 , + 15 ] . The augmented images are then resized to 224 × 224 pixels to match the ViT-g/14 backbone’s patch embedding requirements. We employ the AdamW optimizer [64] with weight decay of 0.05 for regularization. The initial learning rate is set to 1 × 10 5 , with a batch size of 24 and a maximum of 30 training epochs.

5.1.3. Evaluation Metrics

To comprehensively evaluate DeltaVLM’s multi-task capabilities, we employ task-specific metrics that are widely adopted in the field to provide a holistic view of DeltaVLM’s effectiveness.
  • Change Captioning: We adopt BLEU-N (N = 1, 2, 3, 4) [65], METEOR [66], ROUGE-L [67], and CIDEr [68] to assess the quality of generated change descriptions. These metrics evaluate the n-gram overlap, semantic similarity, sentence structure, and human consensus alignment, respectively.
  • Binary Change Classification: We use accuracy, precision, recall, and the F1-score to measure the classification performance. The F1-score provides a balanced measure, particularly important for change/no change distributions.
  • Category-Specific Change Quantification: We use the mean absolute error (MAE) and root mean squared error (RMSE) to evaluate the counting accuracy, with the MAE capturing the average deviation and the RMSE penalizing larger errors.
  • Change Localization: The change localization task requires returning the location of the change in a 3 × 3 grid format, and this task belongs to multi-class classification tasks. We use precision, recall, the F1-score, and the overall accuracy to evaluate the localization quality. Precision, recall, and the F1-score are computed using micro-averaging across all images.

5.2. Comparison with Baselines

We compare DeltaVLM against SOTA baselines across various tasks: change captioning, binary change classification, category-specific change quantification, and change localization. The baselines are categorized into three groups: (1) RS-specific change captioning models, which focus on domain-adapted architectures for RS imagery, including RSICCFormer [8], Prompt-CC [39], PSNet [37], RSCaMa [38], and SFT [36]; (2) an RS-specific VLM designed for remote sensing visual understanding, RSUniVLM [57], which unifies multiple RS tasks within a single vision language framework; and (3) general-purpose large VLMs, including GPT-4o [49], Qwen-VL-Plus [50], GLM-4V-Plus [51], DeepSeek-VL2 [53], and Gemini-1.5-Pro [52].

5.2.1. Change Captioning

We first evaluate DeltaVLM against both specialized RS change captioning models and VLMs on the ChangeChat-105k test set. As shown in Table 2, DeltaVLM achieves competitive performance across most metrics. Among general-purpose VLMs, the performance drops substantially, with BLEU-4 ranging from 13.85 (GLM-4V-Plus) to 25.68 (DeepSeek-VL2). This gap suggests that, without remote sensing-aware adaptation, general VLMs struggle to capture the subtle and domain-specific changes in bi-temporal imagery. RSUniVLM attains the highest CIDEr score (138.61) among all methods, while its BLEU-4 score (56.27) remains below those of specialized captioning models. Among RS change captioning models, RSCaMa achieves the best performance in terms of BLEU-1/-2/-3/-4 and ROUGE-L, benefiting from its Mamba-based architecture for capturing long-range temporal dependencies. SFT obtains the highest METEOR score (39.93). DeltaVLM ranks second in terms of BLEU-1/-2/-3 and ROUGE-L, closely matching the performance of the strongest task-specific captioning models. Importantly, unlike task-specific captioning models, DeltaVLM simultaneously supports multiple interactive tasks within a unified framework, demonstrating the versatility of our instruction-guided approach.

5.2.2. Binary Change Classification

Table 3 reports the binary classification results, where the task is to determine whether any change occurs between the bi-temporal pair. DeltaVLM obtains 93.99% accuracy and a 93.83% F1-score. Among general-purpose VLMs, GPT-4o and Gemini-1.5-Pro reach F1-scores of 85.07% and 83.77%, respectively, while Qwen-VL-Plus and DeepSeek-VL2 show very low recall (19.81–25.52%), suggesting a conservative bias toward predicting “no change”. The RS-specific RSUniVLM scores 90.99% in F1, confirming the benefit of domain-specific training. DeltaVLM improves upon RSUniVLM by 2.84 percentage points in F1 and upon GPT-4o by 8.76 points, with a balanced precision–recall trade-off.

5.2.3. Category-Specific Change Quantification

We further examine object-level counting for roads and buildings (Table 4). DeltaVLM records the lowest MAE/RMSE in both categories (roads: 0.24/0.70; buildings: 1.32/2.89). RSUniVLM fails to produce valid numerical outputs, primarily due to its limited model capacity (only 1B parameters), which restricts its ability to flexibly follow instructions. Relative to GPT-4o, DeltaVLM reduces the road MAE by 51% and building MAE by 29%. Across all methods, building counting yields higher errors than road counting, which we attribute to the greater morphological diversity and frequent occlusion in dense urban scenes.
These quantification results indicate that DeltaVLM can reliably estimate category-level change counts, complementing the coarse-grained binary decision with finer numerical information.

5.2.4. Change Localization

Table 5 presents the localization results on a 3 × 3 grid. We report the micro-averaged precision, recall, F1-score, and overall accuracy. For road changes, DeltaVLM attains the highest F1 (67.94%) and accuracy (70.92%). RSUniVLM reaches higher recall (89.93%) but at the cost of lower precision (42.04%) and accuracy (10.99%), indicating frequent over-prediction. A similar pattern appears for buildings: RSUniVLM obtains 96.36% recall yet only 23.28% accuracy, whereas DeltaVLM maintains a balanced precision–recall trade-off (77.79%/80.22%) and the highest accuracy (65.53%). These observations suggest that explicit difference modeling in DeltaVLM leads to the more reliable spatial grounding of changes.

5.2.5. Open-Ended QA

Finally, we evaluate open-ended QA, which requires free-form responses to diverse user queries (Table 6). Among general-purpose VLMs, DeepSeek-VL2 performs the best (BLEU-4 = 19.51, CIDEr = 170.08), while Qwen-VL-Plus lags notably (CIDEr = 31.75). RSUniVLM, an RS-specialized VLM, achieves substantially lower scores (CIDEr = 70.54), largely because it tends to generate overly terse responses (e.g., “no”) or rigidly produce change captioning outputs without flexibly addressing user instructions. DeltaVLM obtains the highest scores across all metrics (BLEU-4 = 20.87, CIDEr = 203.34), corresponding to a 19.6% relative CIDEr gain over DeepSeek-VL2. These results confirm that instruction-guided difference perception enables more informative and contextually grounded answers in multi-turn RSICA scenarios.

5.2.6. Qualitative Analysis

Beyond the quantitative evaluation, we provide a qualitative analysis to further demonstrate DeltaVLM’s capabilities and discuss the broader implications of our work for interactive remote sensing change analysis.
In addition to the quantitative results, we qualitatively evaluate DeltaVLM in multi-turn dialog settings. Figure 5 presents representative examples where the model answers a sequence of queries grounded in the same bi-temporal remote sensing image pair. Across multiple turns, DeltaVLM produces consistent responses to change detection, description, quantification, and localization queries, indicating its ability to maintain contextual information over dialog turns.
This ability is enabled by three factors: (i) the IDPM, which aligns visual change features with task-specific instructions; (ii) the domain-specific fine-tuning of the bi-temporal visual encoder for remote sensing imagery; and (iii) the diversity of the instruction-following samples in ChangeChat-105k. Together, these components allow the model to support interactive change analysis beyond static, single-task outputs.

6. Discussion

6.1. Ablation Analysis

To validate our proposed methods of Bi-VE fine-tuning and the cross-semantic relation measuring mechanism regarding the performance of remote sensing change analysis tasks, we conducted ablation studies on both change captioning and binary change classification tasks.
Table 7 presents the ablation results for the change captioning task under three configurations: (1) the complete DeltaVLM model, (2) without the cross-semantic relation measuring module (denoted as “w/o CSRM”), and (3) without bi-temporal visual encoder fine-tuning (denoted as “w/o Bi-VE FT”). The results reveal that removing the CSRM module leads to substantial performance degradation across all evaluation metrics, with BLEU-1 dropping from 85.78 to 64.42 and CIDEr decreasing from 136.72 to 101.92. This decline underscores the critical role of the CSRM module in enabling DeltaVLM to effectively perceive and represent visual differences between bi-temporal images. Without CSRM, the model struggles to identify semantically meaningful change regions, resulting in less accurate and less descriptive change captions. When the Bi-VE parameters are frozen, the model still achieves competitive performance but falls slightly short of the complete model, indicating that the domain-specific fine-tuning of the visual encoder enhances feature extraction for RS imagery.
In the binary change classification task, the findings are consistent with those observed in the captioning task (see Table 8). Notably, fine-tuning the Bi-VE, compared to the w/o Bi-VE FT condition, led to a considerable improvement in the F1-score. Removing the CSRM module again resulted in poor performance across all evaluation metrics. However, the model exhibited a strong bias toward predicting the “no change” class, indicating that it failed to detect meaningful differences without the semantic filtering provided by CSRM.

6.2. Model Scale and Efficiency Analysis

We compare the total and trainable parameter counts across VLMs and RS change captioning methods to assess the model scale and efficiency. It is worth noting that general-purpose VLMs, i.e., Gemini-1.5-Pro [52], GLM-4V-Plus [51], GPT-4o [49], and Qwen-VL-Plus [50], do not disclose their exact parameter counts, making direct comparisons impossible.
As shown in Table 9, DeltaVLM exhibits a larger total parameter count (∼8.2 B) compared to existing RS change captioning models, whose sizes range from 172.80 M to 647 M. This increase in scale stems from DeltaVLM’s unified vision language architecture, which integrates EVA-ViT-g/14 as the vision encoder and Vicuna-7B as the language decoder. However, only 288M parameters are trainable during fine-tuning, significantly reducing the adaptation costs while preserving the rich prior knowledge of the frozen backbone.

6.3. Reliability of LLM-Generated Annotations

A key concern in using LLMs like ChatGPT for dataset generation is the potential for hallucination or factual inaccuracy. To mitigate this, our data generation pipeline ensures that all GPT-generated responses are grounded in reliable, image-derived evidence. Specifically, for each bi-temporal image pair, we provide ChatGPT with five human-written change captions from LEVIR-CC, together with structured information extracted from LEVIR-MCI’s pixel-level change maps, including object contours (via OpenCV) and precise counts of changed instances per category. This rich context enables the LLM to generate answers not from imagination but by reasoning over factual, multimodal evidence. Consequently, the generated instruction–response pairs reflect accurate interpretations of actual scene changes. To validate this, we randomly sampled 500 GPT-generated QA pairs across all open-ended and conversational tasks and conducted manual verification by expert annotators. The evaluation confirmed a 100% factual accuracy rate. This result indicates that, when properly constrained by high-quality auxiliary signals, LLMs can be used to reliably scale instruction-aware datasets for remote sensing change analysis.

6.4. Limitations

We identify two primary failure cases. First, small-scale objects (e.g., narrow roads or isolated buildings) are sometimes missed or imprecisely localized due to the limited spatial resolution induced by patch-based visual encoding. Second, in change localization, changes near grid boundaries may lead to inconsistent predictions because continuous spatial information is discretized into a coarse grid. These issues stem from coarse spatial representations and grid-based supervision. Future work may mitigate these limitations through higher-resolution encoders, multi-scale feature fusion, or finer-grained spatial supervision.

7. Conclusions

In this paper, we introduce remote sensing image change analysis (RSICA), a novel paradigm that extends beyond traditional change detection and captioning by enabling the interactive, instruction-guided exploration of bi-temporal satellite imagery. To address this task, we propose DeltaVLM, an end-to-end vision language model that comprises four key components: (1) a selectively fine-tuned bi-temporal visual encoder adapted for remote sensing characteristics, (2) an instruction-guided difference perception module (IDPM) with a cross-semantic relation measuring (CSRM) mechanism for extracting task-relevant change features, (3) an instruction-guided Q-former for cross-modal alignment, and (4) a large language model decoder for generating natural language responses.
We also constructed ChangeChat-105k, a large-scale instruction-following dataset with over 105,000 samples across six task types. Experiments showed that DeltaVLM achieved state-of-the-art performance on multiple RSICA subtasks, including change captioning, binary classification, quantification, and localization. Ablation studies validated the effectiveness of each proposed component, particularly the CSRM module for capturing visual differences.
In future work, we will explore higher-resolution visual encoders to improve the detection of subtle changes, enhance the instruction diversity to support zero-shot generalization, and develop unified architectures capable of generating both textual and visual outputs for more comprehensive change analysis.

Author Contributions

Conceptualization, P.D. and H.W.; methodology, P.D. and W.Z.; software, P.D.; validation, W.Z.; formal analysis, P.D.; investigation, W.Z.; resources, H.W.; data curation, W.Z.; writing—original draft preparation, P.D.; writing—review and editing, H.W.; visualization, W.Z.; supervision, H.W.; project administration, H.W.; funding acquisition, H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62401064 and in part by the Fundamental Research Funds for the Central Universities under Grant 2024TD001.

Data Availability Statement

The code, dataset, and pretrained weights presented in this study are available at https://github.com/hanlinwu/DeltaVLM (accessed on 4 February 2026).

Acknowledgments

We thank the anonymous reviewers for their valuable suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
RSRemote Sensing
RSIsRemote Sensing Images
VQAVisual Question Answering
VLMVision Language Model
LLMLarge Language Model
RSICARemote Sensing Image Change Analysis
Bi-VEBi-Temporal Vision Encoder
IDPMInstruction-Guided Difference Perception Module
CSRMCross-Semantic Relation Measuring

References

  1. Van Westen, C. Remote sensing for natural disaster management. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2000, 33, 1609–1617. [Google Scholar]
  2. Chowdhury, R.R. Driving forces of tropical deforestation: The role of remote sensing and spatial models. Singap. J. Trop. Geogr. 2006, 27, 82–101. [Google Scholar] [CrossRef]
  3. Navalgund, R.R.; Jayaraman, V.; Roy, P. Remote sensing applications: An overview. Curr. Sci. 2007, 93, 1747–1766. [Google Scholar]
  4. Bannari, A.; Morin, D.; Bénié, G.; Bonn, F. A theoretical review of different mathematical models of geometric corrections applied to remote sensing images. Remote Sens. Rev. 1995, 13, 27–47. [Google Scholar] [CrossRef]
  5. Ding, L.; Hong, D.; Zhao, M.; Chen, H.; Li, C.; Deng, J.; Yokoya, N.; Bruzzone, L.; Chanussot, J. A Survey of Sample-Efficient Deep Learning for Change Detection in Remote Sensing: Tasks, strategies, and challenges. IEEE Geosci. Remote Sens. Mag. 2025, 13, 164–189. [Google Scholar] [CrossRef]
  6. Qu, B.; Li, X.; Tao, D.; Lu, X. Deep semantic understanding of high resolution remote sensing image. In Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China, 6–8 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–5. [Google Scholar]
  7. Lobry, S.; Marcos, D.; Murray, J.; Tuia, D. RSVQA: Visual question answering for remote sensing data. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8555–8566. [Google Scholar] [CrossRef]
  8. Liu, C.; Zhao, R.; Chen, H.; Zou, Z.; Shi, Z. Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5633520. [Google Scholar] [CrossRef]
  9. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
  10. Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  11. Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned language models are zero-shot learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
  12. Hu, Y.; Yuan, J.; Wen, C.; Lu, X.; Liu, Y.; Li, X. RSGPT: A remote sensing vision language model and benchmark. ISPRS J. Photogramm. Remote Sens. 2025, 224, 272–286. [Google Scholar] [CrossRef]
  13. Kuckreja, K.; Danish, M.S.; Naseer, M.; Das, A.; Khan, S.; Khan, F.S. Geochat: Grounded large vision-language model for remote sensing. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 27831–27840. [Google Scholar]
  14. Bazi, Y.; Bashmal, L.; Al Rahhal, M.M.; Ricci, R.; Melgani, F. RS-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery. Remote Sens. 2024, 16, 1477. [Google Scholar] [CrossRef]
  15. Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
  16. Rasti, B.; Scheunders, P.; Ghamisi, P.; Licciardi, G.; Chanussot, J. Noise reduction in hyperspectral imagery: Overview and application. Remote Sens. 2018, 10, 482. [Google Scholar] [CrossRef]
  17. OpenAI. ChatGPT: Optimizing Language Models for Dialogue. 2022. Available online: https://openai.com/blog/chatgpt (accessed on 19 May 2025).
  18. Singh, A. Review article digital change detection techniques using remotely-sensed data. Int. J. Remote Sens. 1989, 10, 989–1003. [Google Scholar] [CrossRef]
  19. Nelson, R.F. Detecting forest canopy change due to insect activity using Landsat MSS. Photogramm. Eng. Remote Sens. 1983, 49, 1303–1314. [Google Scholar]
  20. Nielsen, A.A.; Conradsen, K.; Simpson, J.J. Multivariate alteration detection (MAD) and MAF postprocessing in multispectral, bitemporal image data: New approaches to change detection studies. Remote Sens. Environ. 1998, 64, 19. [Google Scholar] [CrossRef]
  21. Serra, P.; Pons, X.; Sauri, D. Post-classification change detection with data from different sensors: Some accuracy considerations. Int. J. Remote Sens. 2003, 24, 3311–3340. [Google Scholar] [CrossRef]
  22. Wu, C.; Du, B.; Zhang, L. Slow feature analysis for change detection in multitemporal remote sensing images. IEEE Trans. Geosci. Remote Sens. 2013, 52, 2858–2874. [Google Scholar] [CrossRef]
  23. Nielsen, A.A. Regularized iteratively reweighted MAD method for change detection in multi-and hyperspectral data. IEEE Trans. Image Process. 2007, 16, 463–478. [Google Scholar] [CrossRef]
  24. Blaschke, T. Object based image analysis for remote sensing. ISPRS J. Photogramm. Remote Sens. 2010, 65, 2–16. [Google Scholar] [CrossRef]
  25. Daudt, R.C.; Le Saux, B.; Boulch, A. Fully convolutional siamese networks for change detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4063–4067. [Google Scholar]
  26. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
  27. Chen, H.; Qi, Z.; Shi, Z. Remote sensing image change detection with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5607514. [Google Scholar] [CrossRef]
  28. Bandara, W.G.C.; Patel, V.M. A transformer-based siamese network for change detection. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 207–210. [Google Scholar]
  29. Chen, H.; Song, J.; Han, C.; Xia, J.; Yokoya, N. ChangeMamba: Remote sensing change detection with spatiotemporal state space model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4409720. [Google Scholar] [CrossRef]
  30. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PmLR, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
  31. Bao, H.; Dong, L.; Piao, S.; Wei, F. BEiT: BERT pre-training of image transformers. In Proceedings of the 10th International Conference on Learning Representations, Online, 5–29 April 2022. [Google Scholar]
  32. Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A.; Zhang, L. Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: From natural disasters to man-made disasters. Remote Sens. Environ. 2021, 265, 112636. [Google Scholar] [CrossRef]
  33. Chen, H.; Song, J.; Dietrich, O.; Broni-Bediako, C.; Xuan, W.; Wang, J.; Shao, X.; Wei, Y.; Xia, J.; Lan, C.; et al. BRIGHT: A globally distributed multimodal building damage assessment dataset with very-high-resolution for all-weather disaster response. Earth Syst. Sci. Data Discuss. 2025, 17, 6217–6253. [Google Scholar] [CrossRef]
  34. Hoxha, G.; Chouaf, S.; Melgani, F.; Smara, Y. Change captioning: A new paradigm for multitemporal remote sensing image analysis. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5627414. [Google Scholar] [CrossRef]
  35. You, Q.; Jin, H.; Wang, Z.; Fang, C.; Luo, J. Image captioning with semantic attention. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4651–4659. [Google Scholar]
  36. Sun, D.; Bao, Y.; Liu, J.; Cao, X. A lightweight sparse focus transformer for remote sensing image change captioning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 18727–18738. [Google Scholar] [CrossRef]
  37. Liu, C.; Yang, J.; Qi, Z.; Zou, Z.; Shi, Z. Progressive scale-aware network for remote sensing image change captioning. In Proceedings of the IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 6668–6671. [Google Scholar]
  38. Liu, C.; Chen, K.; Chen, B.; Zhang, H.; Zou, Z.; Shi, Z. Rscama: Remote sensing image change captioning with state space model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6010405. [Google Scholar] [CrossRef]
  39. Liu, C.; Zhao, R.; Chen, J.; Qi, Z.; Zou, Z.; Shi, Z. A decoupling paradigm with prompt learning for remote sensing image change captioning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5622018. [Google Scholar] [CrossRef]
  40. Zhu, Y.; Li, L.; Chen, K.; Liu, C.; Zhou, F.; Shi, Z.X. Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5648916. [Google Scholar] [CrossRef]
  41. Noman, M.; Ahsan, N.; Naseer, M.; Cholakkal, H.; Anwer, R.M.; Khan, S.; Khan, F.S. CDCHAT: A large multimodal model for remote sensing change description. arXiv 2024, arXiv:2409.16261. [Google Scholar] [CrossRef]
  42. Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. Vqa: Visual question answering. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar]
  43. Zhang, Z.; Jiao, L.; Li, L.; Liu, X.; Chen, P.; Liu, F.; Li, Y.; Guo, Z. A spatial hierarchical reasoning network for remote sensing visual question answering. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4400815. [Google Scholar] [CrossRef]
  44. Wang, J.; Zheng, Z.; Chen, Z.; Ma, A.; Zhong, Y. Earthvqa: Towards queryable earth via relational reasoning-based remote sensing visual question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 5481–5489. [Google Scholar]
  45. Chappuis, C.; Zermatten, V.; Lobry, S.; Le Saux, B.; Tuia, D. Prompt-RSVQA: Prompting visual context to a language model for remote sensing visual question answering. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 1372–1381. [Google Scholar]
  46. Yuan, Z.; Mou, L.; Zhu, X.X. Change-aware visual question answering. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 227–230. [Google Scholar]
  47. Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning Research, PMLR, Online, 28 November–9 December 2022; pp. 12888–12900. [Google Scholar]
  48. Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
  49. Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. Gpt-4o system card. arXiv 2024, arXiv:2410.21276. [Google Scholar] [CrossRef]
  50. Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]
  51. GLM, T.; Zeng, A.; Xu, B.; Wang, B.; Zhang, C.; Yin, D.; Zhang, D.; Rojas, D.; Feng, G.; Zhao, H.; et al. ChatGLM: A family of large language models from glm-130b to glm-4 all tools. arXiv 2024, arXiv:2406.12793. [Google Scholar]
  52. Team, G.; Georgiev, P.; Lei, V.I.; Burnell, R.; Bai, L.; Gulati, A.; Tanzer, G.; Vincent, D.; Pan, Z.; Wang, S.; et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv 2024, arXiv:2403.05530. [Google Scholar] [CrossRef]
  53. Wu, Z.; Chen, X.; Pan, Z.; Liu, X.; Liu, W.; Dai, D.; Gao, H.; Ma, Y.; Wu, C.; Wang, B.; et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv 2024, arXiv:2412.10302. [Google Scholar]
  54. Zhang, Z.; Zhao, T.; Guo, Y.; Yin, J. RS5M and GeoRSCLIP: A large scale vision-language dataset and a large vision-language model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5642123. [Google Scholar] [CrossRef]
  55. Dai, W.; Li, J.; Li, D.; Tiong, A.; Zhao, J.; Wang, W.; Li, B.; Fung, P.N.; Hoi, S. Instructblip: Towards general-purpose vision-language models with instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 49250–49267. [Google Scholar]
  56. Zhan, Y.; Xiong, Z.; Yuan, Y. Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model. ISPRS J. Photogramm. Remote Sens. 2025, 221, 64–77. [Google Scholar] [CrossRef]
  57. Liu, X.; Lian, Z. RSUniVLM: A unified vision language model for remote sensing via granularity-oriented mixture of experts. arXiv 2024, arXiv:2412.05679. [Google Scholar]
  58. Deng, P.; Zhou, W.; Wu, H. Changechat: An interactive model for remote sensing change analysis via multimodal instruction tuning. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar]
  59. Liu, C.; Chen, K.; Zhang, H.; Qi, Z.; Zou, Z.; Shi, Z. Change-Agent: Towards Interactive Comprehensive Remote Sensing Change Interpretation and Analysis. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5635616. [Google Scholar] [CrossRef]
  60. Fang, Y.; Wang, W.; Xie, B.; Sun, Q.; Wu, L.; Wang, X.; Huang, T.; Wang, X.; Cao, Y. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 19358–19369. [Google Scholar]
  61. Cho, K.; van Merrienboer, B.; Gülçehre, Ç.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014. [Google Scholar]
  62. Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv. Neural Inf. Process. Syst. 2023, 36, 46595–46623. [Google Scholar]
  63. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
  64. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  65. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
  66. Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
  67. Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
  68. Vedantam, R.; Zitnick, C.L.; Parikh, D. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
Figure 1. The performance of DeltaVLM against state-of-the-art VLMs on five RS change analysis tasks. Each axis corresponds to a task-specific metric: change captioning (CIDEr), binary classification (F1 score), quantification (inverted Road’s MAE), localization (Road’s F1-score), and open-ended QA (METEOR). Grey dashed lines indicate equidistant performance levels from the center, enabling relative assessment of model strength across tasks.
Figure 1. The performance of DeltaVLM against state-of-the-art VLMs on five RS change analysis tasks. Each axis corresponds to a task-specific metric: change captioning (CIDEr), binary classification (F1 score), quantification (inverted Road’s MAE), localization (Road’s F1-score), and open-ended QA (METEOR). Grey dashed lines indicate equidistant performance levels from the center, enabling relative assessment of model strength across tasks.
Remotesensing 18 00541 g001
Figure 2. Instruction types and examples in the ChangeChat-105k dataset. The dataset employs rule-based methods for structured tasks (Types 1–4) and ChatGPT-generated responses for open-ended reasoning tasks (Types 5–6).
Figure 2. Instruction types and examples in the ChangeChat-105k dataset. The dataset employs rule-based methods for structured tasks (Types 1–4) and ChatGPT-generated responses for open-ended reasoning tasks (Types 5–6).
Remotesensing 18 00541 g002
Figure 3. Overview of GPT-based data generation for open-ended QA. (a) The system message that defines the AI’s role and task. (b) A few-shot seed example, showing the structured input (captions, counts, contours) and the desired output format. (c) Examples of the two main types of conversational data generated by our method.
Figure 3. Overview of GPT-based data generation for open-ended QA. (a) The system message that defines the AI’s role and task. (b) A few-shot seed example, showing the structured input (captions, counts, contours) and the desired output format. (c) Examples of the two main types of conversational data generated by our method.
Remotesensing 18 00541 g003
Figure 4. An overview of our proposed DeltaVLM. (a) Bi-temporal image encoder for visual feature extraction. (b) Instruction-guided difference perception module with CSRM mechanism. (c) Language decoder based on Vicuna-7B.
Figure 4. An overview of our proposed DeltaVLM. (a) Bi-temporal image encoder for visual feature extraction. (b) Instruction-guided difference perception module with CSRM mechanism. (c) Language decoder based on Vicuna-7B.
Remotesensing 18 00541 g004
Figure 5. Demonstration of multi-round dialog capabilities of DeltaVLM.
Figure 5. Demonstration of multi-round dialog capabilities of DeltaVLM.
Remotesensing 18 00541 g005
Table 1. Overview of the ChangeChat-105k dataset: instruction types, generation methods, and training/testing splits.
Table 1. Overview of the ChangeChat-105k dataset: instruction types, generation methods, and training/testing splits.
Instruction TypeSource DataGen. MethodResponse FormatTrainTest
Change CaptioningLEVIR-CCRule-basedDescriptive Text34,0751929
Binary Change ClassificationLEVIR-MCIRule-basedYes/No68151929
Category-Specific Change QuantificationLEVIR-MCIRule-basedObject Count68151929
Change LocalizationLEVIR-MCIRule-basedGrid Location68151929
Open-Ended QADerivedGPT-assistedQ&A Pair26,6007527
Multi-Turn ConversationDerivedGPT-assistedDialog68151929
Total87,93517,172
Table 2. Comparison with SOTA methods on the change captioning task on the ChangeChat-105k dataset. Bold denotes the best performance and underline indicates the second best.
Table 2. Comparison with SOTA methods on the change captioning task on the ChangeChat-105k dataset. Bold denotes the best performance and underline indicates the second best.
CategoryMethodBLEU-1BLEU-2BLEU-3BLEU-4METEORROUGE-LCIDEr
RS Change
Captioning
Models
PromptCC [39]83.6675.7369.1063.5438.8273.72136.44
PSNet [37]83.8675.1367.8962.1138.8073.60132.62
RSICCFormer [8]84.7276.2768.8762.7739.6174.12134.12
SFT [36]84.5675.8768.6462.8739.9374.69137.05
RSCaMa [38]85.7977.9971.0465.2439.9175.24136.56
VLMsDeepseek-VL2 [53]40.9434.5030.2625.6819.4854.37101.79
Gemini-1.5-Pro [52]45.6833.5925.5319.0122.6456.2591.37
GLM-4V-Plus [51]35.5924.2618.5413.8520.1354.3993.16
GPT-4o [49]46.0333.0924.6618.0522.5056.4990.92
Qwen-VL-Plus [50]41.3133.1927.9622.9518.0451.2492.99
RSUniVLM [57]82.0772.5663.9456.2736.1074.07138.61
DeltaVLM85.7877.1569.2462.5139.4775.01136.72
Table 3. Results for binary change classification. Bold indicates the best performance.
Table 3. Results for binary change classification. Bold indicates the best performance.
MethodAccuracy (%)Precision (%)Recall (%)F1-Score (%)
Deepseek-VL2 [53]59.7297.9519.8132.96
Gemini-1.5-Pro [52]83.8384.0383.5183.77
GLM-4V-Plus [51]79.8388.3868.6777.29
GPT-4o [49]84.8183.5886.6285.07
Qwen-VL-Plus [50]58.2273.6525.5237.90
RSUniVLM [57]91.2493.6388.4990.99
DeltaVLM93.9996.2991.4993.83
Table 4. Results for change quantification. “–” indicates that the model fails to produce a correct response to complete this task, and bold indicates the best performance.
Table 4. Results for change quantification. “–” indicates that the model fails to produce a correct response to complete this task, and bold indicates the best performance.
MethodRoadsBuildings
MAERMSEMAERMSE
Deepseek-VL2 [53]0.580.954.178.84
Gemini-1.5-Pro [52]0.581.252.568.71
GLM-4V-Plus [51]0.821.622.054.61
GPT-4o [49]0.491.001.864.57
Qwen-VL-Plus [50]0.901.504.419.03
RSUniVLM [57]
DeltaVLM0.240.701.322.89
Table 5. Results for change localization for roads and buildings. Bold indicates the best performance.
Table 5. Results for change localization for roads and buildings. Bold indicates the best performance.
CategoryMethodPrecision (%)Recall (%)F1-Score (%)Accuracy (%)
RoadsDeepseek-VL2 [53]30.0710.4315.4964.90
Gemini-1.5-Pro [52]43.0140.5541.7448.63
GLM-4V-Plus [51]21.9933.3226.496.79
GPT-4o [49]30.4427.0128.6233.85
Qwen-VL-Plus [50]15.421.402.5667.19
RSUniVLM [57]42.0489.9357.2910.99
DeltaVLM69.6366.3267.9470.92
BuildingsDeepseek-VL2 [53]61.9814.5223.5257.59
Gemini-1.5-Pro [52]65.7151.7557.9045.62
GLM-4V-Plus [51]38.9857.8346.5717.11
GPT-4o [49]55.6333.7041.9841.47
Qwen-VL-Plus [50]22.2320.7821.487.26
RSUniVLM [57]66.9596.3679.0023.28
DeltaVLM77.7980.2278.9965.53
Table 6. Open-ended QA results. Bold indicates the best performance.
Table 6. Open-ended QA results. Bold indicates the best performance.
MethodBLEU-1BLEU-2BLEU-3BLEU-4METEORROUGE-LCIDEr
Deepseek-VL2 [53]40.5730.1623.8419.5122.2640.01170.08
Gemini-1.5-Pro [52]28.7619.4613.689.8819.6632.0385.45
GLM-4V-Plus [51]31.4622.3116.6612.8021.3235.93129.48
GPT-4o [49]29.4720.3414.5610.7120.4233.0893.71
Qwen-VL-Plus [50]20.6611.706.704.2514.2622.8331.75
RSUniVLM [57]14.107.124.372.849.7533.3670.54
DeltaVLM43.2532.5325.7120.8721.6549.24203.34
Table 7. Ablation analysis on the change captioning task. Bold indicates the best performance.
Table 7. Ablation analysis on the change captioning task. Bold indicates the best performance.
MethodBLEU-1BLEU-2BLEU-3BLEU-4METEORROUGE-LCIDEr
w/o CSRM64.4256.5253.0851.4029.3160.54101.92
w/o Bi-VE FT84.2475.6267.9161.4039.2974.73134.76
DeltaVLM85.7877.1569.2462.5139.4775.01136.72
Table 8. Ablation analysis on the binary change classification task. Bold indicates the best performance.
Table 8. Ablation analysis on the binary change classification task. Bold indicates the best performance.
MethodAccuracy (%)Precision (%)Recall (%)F1 (%)
w/o CSRM50.1375.000.310.62
w/o Bi-VE FT90.5799.4981.5489.62
DeltaVLM93.9996.2991.4993.83
Table 9. Comparison of total and trainable parameters across VLMs and RS change captioning methods. Parameters are reported in millions (M) or billions (B).
Table 9. Comparison of total and trainable parameters across VLMs and RS change captioning methods. Parameters are reported in millions (M) or billions (B).
CategoryMethodTotal ParamsTrainable Params
RS Change
Captioning
Models
PromptCC [39]408.58 M196.28 M
PSNet [37]319.76 M231.53 M
RSICCFormer [8]172.80 M81.51 M
SFT [36]647 M647 M
RSCaMa [38]176.90 M176.90 M
VLMsDeepseek-VL2 [53]27 B27 B
RSUniVLM [57]∼1.0 B∼1.0 B
DeltaVLM∼8.2 B288 M
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Deng, P.; Zhou, W.; Wu, H. DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-Guided Difference Perception. Remote Sens. 2026, 18, 541. https://doi.org/10.3390/rs18040541

AMA Style

Deng P, Zhou W, Wu H. DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-Guided Difference Perception. Remote Sensing. 2026; 18(4):541. https://doi.org/10.3390/rs18040541

Chicago/Turabian Style

Deng, Pei, Wenqian Zhou, and Hanlin Wu. 2026. "DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-Guided Difference Perception" Remote Sensing 18, no. 4: 541. https://doi.org/10.3390/rs18040541

APA Style

Deng, P., Zhou, W., & Wu, H. (2026). DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-Guided Difference Perception. Remote Sensing, 18(4), 541. https://doi.org/10.3390/rs18040541

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop