A Multi-Scale Remote Sensing Image Change Detection Network Based on Vision Foundation Model

Liu, Shenbo; Zhao, Dongxue; Tang, Lijun

doi:10.3390/rs18030506

Open AccessArticle

A Multi-Scale Remote Sensing Image Change Detection Network Based on Vision Foundation Model

by

Shenbo Liu

,

Dongxue Zhao

and

Lijun Tang

^*

School of Physics and Electronic Science, Changsha University of Science and Technology, Changsha 410114, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(3), 506; https://doi.org/10.3390/rs18030506

Submission received: 1 January 2026 / Revised: 24 January 2026 / Accepted: 2 February 2026 / Published: 4 February 2026

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A multi-scale change detection framework (SAM-MSCD) is proposed that effectively adapts a vision foundation model (SAM) to remote sensing change detection via LoRA-based parameter-efficient fine-tuning and cross-temporal feature interaction.
The designed bi-temporal interaction module and multi-scale change feature enhancement module significantly improve detection accuracy, achieving state-of-the-art F1-score and IoU across four public RSCD benchmarks, especially in complex and fine-grained change scenarios.

What are the implications of the main findings?

The study demonstrates that vision foundation models can be efficiently and reliably transferred to remote sensing change detection tasks without full fine-tuning, reducing computational cost while maintaining high performance.
The proposed framework provides a practical and extensible paradigm for large-scale, multi-scene remote sensing change detection, supporting real-world applications such as urban monitoring, land-use analysis, and environmental assessment.

Abstract

As a key technology in the intelligent interpretation of remote sensing, remote sensing image change detection aims to automatically identify surface changes from images of the same area acquired at different times. Although vision foundation models have demonstrated outstanding capabilities in image feature representation, their inherent patch-based processing and global attention mechanisms limit their effectiveness in perceiving multi-scale targets. To address this, we propose a multi-scale remote sensing image change detection network based on a vision foundation model, termed SAM-MSCD. This network integrates an efficient parameter fine-tuning strategy with a cross-temporal multi-scale feature fusion mechanism, significantly improving change perception accuracy in complex scenarios. Specifically, the Low-Rank Adaptation mechanism is adopted for parameter-efficient fine-tuning of the Segment Anything Model (SAM) image encoder, adapting it for the remote sensing change detection task. A bi-temporal feature interaction module(BIM) is designed to enhance the semantic alignment and the modeling of change relationships between feature maps from different time phases. Furthermore, a change feature enhancement module (CFEM) is proposed to fuse and highlight differential information from different levels, achieving precise capture of multi-scale changes. Comprehensive experimental results on four public remote sensing change detection datasets, namely LEVIR-CD, WHU-CD, NJDS, and MSRS-CD, demonstrate that SAM-MSCD surpasses current state-of-the-art (SOTA) methods on several key evaluation metrics, including the F1-score and Intersection over Union(IoU), indicating its broad prospects for practical application.

Keywords:

remote sensing; change detection; vision foundation model; multi-scale analysis

1. Introduction

Remote Sensing Change Detection (RSCD) is a key technology that identifies and extracts dynamic changes on the Earth’s surface by analyzing images of the same geographical area acquired at different times [1]. This technology provides a core scientific basis for understanding the evolutionary patterns of land use and land cover, thereby offering crucial support for decision-making in fields such as urban expansion monitoring [2], forest resource management [3], disaster damage assessment [4], and ecological environment protection [5].

In recent years, deep learning models, represented by Convolutional Neural Networks (CNNs) [6,7,8,9] and Transformers [10,11,12], have significantly advanced RSCD technology with their powerful feature learning and contextual modeling capabilities. CNNs excel at capturing texture and spatial details through their inherent local perception and hierarchical feature extraction. Vision Transformers (ViTs), on the other hand, effectively model global dependencies via a self-attention mechanism, showing great potential for understanding large-scale scene changes. These data-driven methods rely on large-scale, high-quality RSCD datasets. However, annotating remote sensing images requires substantial professional expertise and manual effort, making the process costly and complex [13].

Vision foundation models (VFMs), pretrained on massive general-purpose image datasets, have learned rich and universal visual priors. Their strong generalization and cross-domain transfer capabilities allow them to handle complex object structures and semantic information effectively, even without extensive fine-tuning on remote sensing-specific data. This opens a new paradigm for improving change detection accuracy and robustness.

As VFMs were initially designed for natural images, their direct application is often limited when facing issues unique to remote sensing imagery, such as multi-scale objects, multi-spectral data, and large viewpoint differences. To enable the effective transfer of VFMs to RSCD tasks, existing studies have explored various adaptive modeling strategies. For instance, Chen et al. [14] proposed the time-traveling pixels method, which transfers the latent knowledge within VFMs to the change modeling process of bi-temporal images by simulating pixel evolution paths over time. Dong et al. [15] reconfigured the classic CLIP [16] architecture to enhance its ability to extract semantically consistent yet change-sensitive features from bi-temporal remote sensing images. While these methods focus on adapting existing VFMs for RSCD tasks, they have paid less attention to fully leveraging their deep semantic representation capabilities and optimizing the design by incorporating the intrinsic characteristics of remote sensing imagery.

To address this issue, this paper proposes a multi-scale remote sensing image change detection method based on a Vision Foundation Model, named SAM-MSCD. The method begins by introducing the Low-Rank Adaptation (LoRA) module [17] to perform parameter-efficient fine-tuning on a weight-shared Segment Anything Model (SAM) [18], enabling the effective extraction of semantic features from remote sensing images. Next, a bi-temporal image feature interaction module (BIM) is designed to model the fine-grained semantic differences between the bi-temporal images and to enhance the consistent representation of cross-temporal features. To explicitly address the multi-scale nature of change targets in remote sensing imagery, a change feature enhancement module (CFEM) is constructed as the core multi-scale modeling component of the proposed framework. CFEM generates more discriminative multi-scale difference feature maps and adopts a multi-scale fusion strategy to further strengthen the response in key change areas, thereby improving the overall performance of change detection. Finally, extensive experiments are conducted on four public RSCD benchmark datasets, where SAM-MSCD outperforms thirteen current state-of-the-art methods across multiple evaluation metrics, demonstrating excellent generalization ability and robustness. The main contributions of this paper are as follows:

A bi-temporal image feature interaction module is proposed, which effectively models the semantic differences between two temporal images, enhancing the model’s sensitivity to cross-temporal changes and its capacity for consistency modeling.
A change feature enhancement module is designed, which significantly improves the recognition accuracy of key change regions and the model’s robustness through a mechanism for generating and fusing multi-scale difference features.
A semantic feature extraction framework is built based on a shared SAM with LoRA. By combining this with BIM and CFEM, we propose SAM-MSCD, which fully leverages the semantic representation strengths of the vision foundation model and achieves its efficient adaptation and application in the RSCD task.

The remainder of this paper is organized as follows. Section 2 briefly introduces deep learning-based RSCD methods and VFM-based RSCD related work. Section 3 details the SAM-MSCD network design. Section 4 provides an extensive experimental analysis of SAM-MSCD. Section 5 presents a discussion of this paper. Section 6 concludes the paper.

2. Related Work

2.1. Deep Learning-Based RSCD

In recent years, with significant advancements in feature extraction by deep learning techniques such as CNNs and ViTs, deep learning-based change detection methods have made remarkable progress. Early research primarily focused on introducing CNNs into RSCD tasks and achieved good results. For example, Daudt et al. [7] proposed three models based on fully convolutional networks and siamese architectures: FC-EF, FC-Siam-conc, and FC-Siam-diff, which laid the foundation for the application of deep learning in the field of change detection. Lin et al. [19] designed a network structure composed of two symmetrical CNNs and combined it with a linear outer product operation to model the pixel-level relationships between bi-temporal images, thereby effectively identifying change regions. Fang et al. [20] proposed a densely connected siamese network structure, introducing a compact information transfer mechanism between the encoder and decoder, as well as between decoder levels, to alleviate the problem of spatial information loss in deep networks.

To further enhance the model’s ability to perceive global contextual information, researchers began to explore the application of ViTs to RSCD tasks. Zhang et al. [21] proposed a pure Transformer-based siamese U-shaped network that can effectively capture spatio-temporal global dependencies in bi-temporal images. Bandara et al. [10] employed a hierarchical Transformer structure to extract coarse-grained and fine-grained features from images at different scales, enabling the modeling and fusion of multi-scale difference features.

However, due to the limited annotated data available for RSCD tasks, pure Transformer models struggle to fully realize their potential. Consequently, many researchers have attempted to combine CNNs and Transformers to leverage the advantages of both local feature extraction and global semantic modeling. Chen et al. [11] proposed the BiT model, which first uses a ResNet [22] to extract initial features from bi-temporal images and then employs a Transformer encoder to further mine spatio-temporal contextual information. Building on this, Jiang et al. [23] introduced a Graph Neural Network (GNN) to enhance the model’s ability to model structured change information.

Furthermore, methods based on the state-space model Mamba have gradually gained attention. Chen et al. [24] were the first to introduce the Mamba architecture into the remote sensing change detection field, proposing the ChangeMamba model to explore its potential in modeling long-sequence remote sensing data.

Although the aforementioned methods have demonstrated good performance on multiple benchmark datasets, their generalization ability and robustness remain insufficient when dealing with complex and dynamic scenes of ground feature changes. Meanwhile, as model sizes continue to grow, the scarcity of high-quality annotated data severely constrains the performance of these methods in practical applications. To overcome this limitation, the research paradigm is shifting towards Vision Foundation Models. Unlike task-specific models trained from scratch, VFMs learned on massive generalized datasets offer robust visual priors, providing a new pathway to achieve high-performance change detection with limited supervision.

2.2. VFM-Based RSCD

VFMs are typically pre-trained on large-scale natural image datasets and possess rich, general feature representation capabilities [25]. With the rise of VFMs, they have demonstrated powerful generalization ability and transfer potential in natural image understanding tasks, and have been widely applied in fields such as semantic segmentation, image classification, and object detection [26,27,28]. Consequently, transferring VFMs to RSCD tasks has become a current research hotspot. Ding et al. [29] proposed a change detection method based on the SAM framework called SAM-CD. This method utilizes the visual encoder of FastSAM [30] to extract general visual representations from remote sensing images and designs a lightweight convolutional adapter to aggregate task-oriented change information. Li et al. [31] designed “bridging modules” aimed at closing the domain gap between natural and remote sensing images, thereby improving the applicability and transfer performance of VFMs in RSCD tasks. Additionally, Dong et al. [15] reconfigured the CLIP architecture, proposing the ChangeCLIP model, which enhances the expression of change-related semantic features in bi-temporal images through a contrastive learning mechanism to achieve effective change detection. Zhang et al. [32] attempted to fine-tune SAM and combined it with a boundary-aware loss function to enhance the model’s detail characterization ability and adaptability in RSCD tasks. These methods have, to some extent, advanced the application of VFMs in the RSCD field.

Although existing research has made some achievements in transferring VFMs to remote sensing tasks, most work still focuses on how to adapt existing models to meet the needs of remote sensing tasks, with less attention paid to fully leveraging the deep semantic modeling capabilities of VFMs and performing targeted optimization design combined with the intrinsic characteristics of remote sensing images. Therefore, exploring more efficient and interpretable VFM transfer strategies and constructing novel change detection frameworks oriented toward the characteristics of remote sensing imagery hold significant theoretical importance and practical application value.

3. Methodology

3.1. Overall Framework

As shown in Figure 1, the overall architecture of SAM-MSCD adopts a siamese network structure, which takes a pair of bi-temporal remote sensing images as input. To meet the demand for rapid and automated processing, the model does not rely on any prompt inputs during either training or inference. This prompt-free design is motivated by the characteristics of remote sensing change detection, which is a fully automatic, pixel-wise dense prediction task, where interactive or manually defined prompts are impractical and inconsistent across large-scale datasets. Therefore, the prompt encoder of SAM is removed, retaining only the image encoder. It is worth noting that the core strength of SAM lies in its powerful image encoder pretrained on large-scale data, while the prompt encoder mainly serves interactive segmentation scenarios. Removing the prompt encoder avoids task–model mismatch and reduces unnecessary architectural complexity, without sacrificing semantic representation capability. The network extracts high-level semantic features from the two temporal images separately using the pre-trained SAM image encoder. To better adapt the image encoder for remote sensing image feature extraction, a LoRA adapter is introduced for parameter-efficient fine-tuning. Furthermore, a Bi-temporal Interaction Module is embedded between the image encoders, serving as an information bridge between the two branches to achieve effective fusion of cross-temporal features and contextual modeling, thereby enhancing the model’s ability to perceive change regions. Subsequently, a Change Feature Enhancement Module is designed to further strengthen the feature representation of key change areas through a mechanism for generating and fusing multi-scale difference features. Finally, these change features are decoded by a detection head to produce the final change detection result.

The image encoder of SAM demonstrates strong general-purpose visual representation capabilities, providing a solid foundation for various vision tasks. However, directly applying such a large pre-trained model to RSCD tasks and fully fine-tuning it poses significant challenges, including high computational costs, large memory requirements, and the risk of catastrophic forgetting. To address these issues, adapter-based methods offer an efficient solution by enabling quick adaptation to specific tasks through the insertion of lightweight adapter modules between certain layers, without the need to retrain the entire model.

These adapter layers introduce only a small number of additional parameters that work in conjunction with the original model parameters, significantly reducing the computational resources and training time required, while also lowering the complexity of parameter adjustment. Compared with conventional full-parameter fine-tuning, this approach enables more efficient knowledge transfer while maintaining model performance.

Based on these advantages, SAM-MSCD incorporates the LoRA adapter, as shown in Figure 2, to achieve efficient adaptation of the VFM to RSCD tasks with minimal parameter overhead, thus striking a good balance between model capability, task adaptability and resource consumption.

As illustrated in Figure 2, given a pretrained weight matrix

W \in R^{d \times d}

, the parameter update

Δ W

is represented as a low-rank decomposition:

W + Δ W = W + B A

(1)

where

B \in R^{d \times r}

,

A \in R^{r \times d}

, and r (the rank) satisfies

r ≪ d

. The rank r is a tunable hyperparameter, similar to the learning rate.

During training, the original weight matrix W is kept frozen, and only the parameters in A and B are updated. For a given input x, the forward propagation is expressed as:

y = W x + Δ W x = W x + B A x

(2)

3.2. Bi-Temporal Image Feature Interaction Module

As a VFM designed for image segmentation tasks, SAM takes a single image as input, making its structure not directly applicable to RSCD tasks that require bi-temporal image inputs. As shown in Figure 1, the SAM-Large architecture contains a total of 24 Transformer layers, where the 6th, 12th, 18th, and 24th layers use a global self-attention mechanism, while the remaining layers use a local window-based attention mechanism. Based on the aforementioned architectural characteristics, this paper proposes the Bi-temporal Image Feature Interaction Module (BIM) and selectively embeds it into layers 6, 12, 18, and 24—that is, all Transformer layers employing global self-attention. This design is not arbitrary but stems from three key considerations: First, Global self-attention layers capture contextual information across the entire image, offering stronger semantic abstraction and spatial modeling capabilities. Introducing cross-temporal feature interaction at these layers effectively fuses high-level semantic representations from two temporal images, enhancing the model’s ability to distinguish genuine change areas while suppressing false change interference caused by variations in lighting, viewpoint, or season. Second, Layers 6, 12, 18, and 24 reside in the shallow, intermediate, deep, and output regions of the network, respectively, spanning multiple abstraction levels from local details to global semantics. Implementing dual-temporal interactions at these distributed depth positions facilitates modeling temporally varying features across different semantic granularities. This enables multi-level perception ranging from pixel-level changes to object-level changes, enhancing adaptability to multi-scale change targets. Finally, Introducing cross-temporal interactions across all 24 layers would not only significantly increase computational overhead and memory consumption but also risk error propagation and training instability due to frequent interactions with low-level noisy features. In contrast, embedding BIM only in key layers with global modeling capabilities ensures sufficient information fusion while avoiding redundant operations and noise accumulation, achieving a favorable balance between performance, efficiency, and robustness.

The structure of BIM is illustrated in Figure 3. Given two bi-temporal images

T_{1} \in R^{H \times W \times C}

and

T_{2} \in R^{H \times W \times C}

, BIM first computes their absolute difference to generate an initial difference feature map

E \in R^{H \times W \times C}

, highlighting regions with significant spatial changes:

E = | T_{1} - T_{2} |

(3)

Next, E is refined via convolutional operations to enhance its local continuity and contextual awareness. To further emphasize the change regions while preserving original information, a residual enhancement strategy is applied. Specifically, each temporal feature

T_{i}

is weighted by the difference map E and combined with the original feature. The result is passed through a Convolutional Block Attention Module (CBAM) [33] to extract more discriminative salient regions and feature dimensions, producing the enhanced features

T_{i}^{'} \in R^{H \times W \times C}

:

T_{i}^{'} = CBAM (T_{i} + T_{i} \times E)

(4)

A cross-enhancement strategy is then applied to mutually refine and guide the bi-temporal features, improving information complementarity and distinguishing change regions while suppressing redundancy. This process enhances feature alignment and semantic expressiveness. Finally, the enhanced features are combined with the original inputs via residual addition and concatenated to generate the final difference feature map

E^{'} \in R^{H \times W \times 2 C}

:

E^{'} = Concat ((T_{1}^{'} + T_{2}), (T_{2}^{'} + T_{1}))

(5)

This module design not only improves the sensitivity and discriminative ability of the model to subtle and complex changes but also maintains computational efficiency, making it well-suited for large-scale RSCD applications.

3.3. Change Feature Enhancement Module

Through the BIM, four change feature maps with the same spatial resolution are obtained. However, in actual remote sensing imagery, change targets often present challenges such as large scale variations, subtle change magnitudes, and easily confused change types. Therefore, enhancing the model’s sensitivity and discriminative ability for change regions, especially in complex backgrounds and scenes with multi-scale or fine-grained targets, has become a critical problem to be solved. To this end, this paper proposes a Change Feature Enhancement Module (CFEM). This module integrates techniques for generating and fusing multi-scale differential features, aiming to fully extract multi-level semantic information from input features and further enhance the ability to identify regions of change. Unlike traditional methods (such as Feature Pyramid Network FPN or U-Net) that typically employ simple element-wise addition or channel concatenation for feature fusion, CFEM innovatively constructs three synergistically working submodules: multiscale feature generation, multiscale feature fusion, and multiscale feature enhancement. This design not only preserves details and contextual information across different scales but also enhances focus on critical change regions through adaptive mechanisms, providing robust support for high-precision remote sensing change detection.

The overall architecture of the CFEM is illustrated in Figure 4. The module first receives four input feature maps

I_{i} \in R^{H \times W \times 512}

with identical spatial resolutions. Subsequently, transposed convolutions with strides of

{1, 2, 4, 8}

are applied to generate multi-scale feature maps

F_{scale}^{i}

(

i \in {1, 2, 3, 4}

) with progressively decreasing spatial resolutions. This process constructs a hierarchical feature pyramid that preserves high-level semantic information while introducing richer spatial details across multiple scales, thereby providing a solid foundation for subsequent feature fusion:

F_{scale}^{i} = ConvTranspose (I_{i})

(6)

To achieve efficient fusion while maintaining information integrity, a

1 \times 1

convolution is employed to project each scale-specific feature map into a unified channel dimension, resulting in the projected features

F_{proj}^{i}

. This operation effectively reduces computational complexity and ensures dimensional consistency among multi-scale features during the fusion process.

On this basis, a top-down multi-scale feature fusion strategy is introduced. Specifically, starting from the coarsest low-resolution features, the feature maps are progressively upsampled and fused with adjacent higher-resolution features through element-wise addition, yielding the fused representations

F_{fuse}^{i}

. This mechanism allows high-level semantic information to enhance shallow features, effectively bridging the semantic gap between deep and shallow layers and improving the model’s holistic perception of change targets, particularly for regions with significant scale variations and complex morphologies:

F_{fuse}^{i} = F_{proj}^{i} + u (F_{fuse}^{i + 1}) + d (F_{fuse}^{i - 1})

(7)

where

u (\cdot)

and

d (\cdot)

denote the upsampling and downsampling operations, respectively.

After feature fusion, a lightweight multi-layer perceptron (MLP) is applied to further enhance the nonlinear representational capacity of the fused features and to explore latent change relationships, thereby strengthening the modeling of fine-grained change patterns:

F_{mlp}^{i} = MLP (F_{fuse}^{i})

(8)

Finally, the multi-scale features

F_{mlp}^{i}

are resized to a unified spatial resolution and concatenated along the channel dimension to produce the final output feature

F_{final}

:

F_{final} = Concat [Resize (F_{mlp}^{1}), Resize (F_{mlp}^{2}), Resize (F_{mlp}^{3}), Resize (F_{mlp}^{4})]

(9)

Through this design, multi-scale contextual information is effectively aggregated into a unified feature space, providing richer and more discriminative representations for subsequent change detection tasks and enhancing the robustness and accuracy of the model across diverse and complex scenarios.

4. Experiments

4.1. Datasets and Evaluation Metrics

To comprehensively validate the effectiveness of the proposed SAM-MSCD method, we conducted extensive experiments on four representative public RSCD datasets, selected based on data volume and the variety of change targets. These include LEVIR-CD [6], WHU-CD [34], NJDS [35], and MSRS-CD [36]. All datasets were cropped into non-overlapping 512 × 512-pixel segments in a left-to-right, top-to-bottom order, and divided into training, validation, and test sets in a 7:1:2 ratio. A brief introduction to these four datasets is provided below.

The LEVIR-CD dataset, released in 2020, contains 637 pairs of images with a spatial resolution of 0.5 m, each sized

1024 \times 1024

pixels. These images were captured between 2002 and 2018, covering changes of 31,333 buildings including high-rise apartments, villas, and garages of various sizes in Texas, USA.

The WHU-CD dataset, released in 2019, contains a pair of images with a spatial resolution of 0.2 m and image dimensions of 32,207 × 15,354 pixels. The images are from Christchurch, New Zealand, taken in 2012 and 2016, covering changes in 12,796 buildings over a 20.5 square kilometer area.

The NJDS dataset, released in 2022, this dataset contains one pair of images with a spatial resolution of 0.3 m and dimensions of 14,231 × 11,381 pixels. It records various types of building change instances in Nanjing, China, between 2014 and 2018, including low-rise, mid-rise, and high-rise buildings. Due to its relatively small data volume, the NJDS dataset is particularly suitable for evaluating model performance in low-data scenarios, providing a key basis for studying model adaptability.

The MSRS-CD dataset, released in 2024, contains 841 pairs of images with a spatial resolution of 0.5 m and each sized

1024 \times 1024

pixels. The images are from southern Chinese cities during 2019 to 2023, covering various scale change targets including new constructions, urban sprawl, vegetation changes, and road development, thus comprehensively representing complex real-world change scenarios.

This paper adopts five common evaluation metrics for change detection algorithms to assess model performance: Precision (P), Recall (R), Intersection over Union (IoU), Overall Accuracy (OA), and the F1 score. Their specific definitions are as follows:

P r e c i s i o n = \frac{T P}{(T P + F P)}

(10)

R e c a l l = \frac{T P}{F N + T P}

(11)

I o U = \frac{T P}{T P + F N + F P}

(12)

O A = \frac{T P + T N}{T P + T N + F N + F P}

(13)

F 1 = \frac{2}{{R e c a l l}^{- 1} + {P r e c i s i o n}^{- 1}}

(14)

where

T P

denotes true positives (the number of samples correctly predicted as change),

T N

true negatives (correctly predicted as no change),

F P

false positives (predicted as change but actually no change), and

F N

false negatives (predicted as no change but actually change).

4.2. Implementation Details

SAM-MSCD is implemented based on the PyTorch2.4.0 deep learning framework, utilizing the Large version of SAM as the visual backbone. During the training process, the Cross-Entropy Loss function and the AdamW optimizer were used, and the model was trained for a total of 300 epochs. All experiments were conducted on a single NVIDIA Tesla L40 (48 GB) GPU. The initial learning rate was set to 0.0004, with a weight decay of 0.05 and a batch size of 4. Data augmentation was limited to simple geometric transformations such as flipping, brightness adjustment, and rotation to enhance the model’s generalization ability. For all comparison methods, we employ the officially released implementation versions or configuration schemes. Dataset partitioning and evaluation metrics remain consistent across all methods. When strict uniformity proves unfeasible due to architectural differences, each method is evaluated under recommended settings to ensure fair and representative performance comparisons. Specifically, all methods based on CNNs and Transformers use an input resolution of 256 × 256, while methods based on vision foundation models use an input resolution of 512 × 512.

4.3. Comparative Experiments with State-of-the-Art Algorithms

We conduct comparative experiments between the proposed SAM-MSCD and a series of RSCD methods, including CNN-based methods (FCCDN [37], SGSLN [38], AANet [39], SEIFNet [40]), Transformer-based methods (BIT [11], ChangeFormer [10], VcT [23], EATDer [41], MDIPNet [42]), and methods based on Vision Foundation Models (SAMCD [29], BAN [31], TTP [14], SFCD [32]). Although methods such as SAMCD and BAN have successfully achieved domain adaptation for visual foundation models by introducing adapters, their research focus has primarily centered on addressing cross-domain generalization challenges. In contrast, SAM-MSCD not only effectively tackles the domain adaptation challenges of VFMs in remote sensing scenarios but also simultaneously resolves the critical issue of detecting multi-scale variable objects, achieving synergistic optimization of domain adaptation and multi-scale modeling. The following briefly introduces these thirteen methods.

FCCDN: This network is a remote sensing change detection network with feature constraints. It introduces a constraint mechanism during bi-temporal feature extraction and fusion, and adopts a self-supervised learning strategy to achieve more accurate change area recognition.
SGSLN: This method is a binary change detection approach based on a switchable dual encoder-decoder structure. It integrates semantic guidance and spatial localization strategies to effectively address the limitations of traditional architectures in handling bi-temporal feature interference and intraclass variation, as well as multi-view building changes.
AANet: This network employs a fuzzy refinement module to locate pseudo-changes and occluded true change regions, and utilizes a weight rearrangement module to fuse multi-scale difference features, enhancing adaptability to objects with varying change scales.
SEIFNet: A lightweight change detection network that combines spatio-temporal enhancement and inter-layer fusion. Through multi-level feature extraction, a spatio-temporal difference enhancement module, and an adaptive context fusion module, it improves feature representation of change areas and mitigates issues of pseudo changes and scale variations.
BIT: This network introduces Transformer into remote sensing change detection. By modeling spatio-temporal context via semantic tokens, it overcomes the limitations of convolutional methods in handling long-range dependencies and complex scenes.
ChangFormer: A Transformer-based Siamese architecture for remote sensing image change detection. Unlike traditional convolutional methods, it combines a hierarchical Transformer encoder with an MLP decoder to effectively capture multi-scale and long-range dependencies, thereby improving detection accuracy.
VcT: This network extracts features via a shared backbone and incorporates a graph neural network to exploit shared contextual information between image pairs, thus enhancing change detection accuracy.
EATDer: This method combines edge-awareness with adaptive Transformers for remote sensing change detection. It uses a Siamese encoder structure integrated with adaptive vision Transformer blocks and a full-range fusion module to capture spatio-temporal variations, and employs an edge-aware decoder to refine change boundaries.
MDIPNet: A multi-scale dual-space interactive perception change detection network designed to address the high computational cost and insufficient semantic information utilization in existing RSCD models.
SAMCD: This network applies FastSAM to high-resolution RS images for change detection. Through a convolutional adapter and semantic learning branch, it improves the model’s adaptability and accuracy in remote sensing scenarios, outperforming fully supervised methods with sample-efficient learning capabilities.
BAN: This method introduces a frozen foundation model, along with bi-temporal adaptation branches and a connection module, effectively fusing the general knowledge of the foundation model with task-specific features for RSCD.
TTP: This network integrates latent knowledge from the SAM foundation model into RSCD, effectively addressing domain shifts and multi-temporal image heterogeneity in knowledge transfer.
SFCD: Specifically designed for RSCD, this network combines SAM with a feature interaction mechanism. It enhances fine-grained feature extraction, change sensitivity, and boundary recognition through parameter-efficient fine-tuning, a bi-temporal feature interaction module, and a boundary-aware loss function.

On the LEVIR-CD dataset, the results of the comparative experiments with current SOTA algorithms are shown in Table 1. The experimental data indicates that SAM-MSCD achieves the best performance on four core metrics: P, F1, IoU, and OA, with values of 93.64%, 92.54%, 85.94%, and 99.24%, respectively. This advantage primarily stems from the model’s innovative design in its multi-scale feature extraction and fusion mechanism. The Recall of SAM-MSCD is slightly lower than that of the second-ranked EATDer, with a difference of 1.71%. The reason for this phenomenon is that EATDer improves the recall rate by introducing the boundary change detection module, but it also brings a lot of pseudo-change noise, which significantly lowers its Precision, which is 7.91% lower than that of SAM-MSCD, and thus affects the balance of the overall performance. Figure 5 illustrates the visualization of the inference results of different algorithms on the LEVIR-CD dataset. As seen in Figure 5a, SAM-MSCD performs excellently in boundary detection. Furthermore, Figure 5b–d show that SAM-MSCD has significantly fewer false and missed detections in dense building areas, demonstrating stronger robustness. Figure 5e further illustrates the model’s ability to accurately identify the change region as well as carve out finer boundary details in the detection of single building changes of tiny dimensions, highlighting its significant advantages in fine-grained change detection tasks.

The results of the comparative experiments on the WHU-CD dataset are shown in Table 2. The specific values for SAM-MSCD on P, R, F1, IoU, and OA are 96.73%, 88.31%, 92.33%, 85.73%, and 99.29%, respectively. Compared to current mainstream SOTA methods, SAM-MSCD achieves the best performance on four metrics: R, F1, IoU, and OA. The visualization results in Figure 6a,b show that SAM-MSCD is significantly superior to other methods in controlling missed detections, effectively identifying change regions of smaller buildings. Figure 6e demonstrates that the model also performs excellently in controlling false detections, as it does not incorrectly identify changes in non-building categories like containers and vehicles as building changes, thus exhibiting a stronger class discrimination capability. Furthermore, as shown in Figure 6c,d, while achieving low rates of missed and false detections, SAM-MSCD still maintains good boundary detection performance, showcasing its excellent comprehensive capabilities in complex scenes.

The NJDS dataset has a small amount of data, which allows for an effective evaluation of a model’s change detection capabilities under low-data conditions. Table 3 shows the results of the comparative experiments on the NJDS dataset. The experimental data indicates that SAM-MSCD achieved the optimal values for the key evaluation metrics of F1-score and IoU, at 79.33% and 65.75% respectively. These scores are 3.95% and 5.26% higher than the second-best method, SAMCD. The visualization results in Figure 7 further validate this conclusion, showing that SAM-MSCD demonstrates excellent detection performance even with a relatively small amount of data. Specifically, the instances in Figure 7 show that SAM-MSCD can accurately identify change regions of different sizes and provide clear boundary detection results, indicating its robustness and efficiency in data-scarce scenarios.

The results of the SOTA comparative experiments on the MSRS-CD dataset are shown in Table 4. The experimental data indicates that SAM-MSCD achieves the best performance on the three key evaluation metrics of F1, IoU, and OA, with specific values of 79.67%, 65.84%, and 94.37%, respectively. This demonstrates the model’s outstanding detection capabilities in complex and variable real-world scenarios. The visualization results in Figure 8a–c further validate the advantages of SAM-MSCD in edge detection. Compared to other methods, it can more accurately delineate the boundaries of change regions and shows significant effectiveness in reducing missed detections. Furthermore, as shown in Figure 8d, when faced with natural changes in non-target categories (such as grass growth), existing mainstream methods exhibit a certain degree of false detections, where as SAM-MSCD can effectively avoid such errors, demonstrating stronger class discrimination ability and robustness. Figure 8e further demonstrates the superior performance of SAM-MSCD in the detection of small-size changes. Even in the face of small changes, the model is still able to accurately recognize and maintain good boundary integrity, with strong adaptability in fine-grained change detection tasks.

4.4. Model Complexity Analysis

To comprehensively evaluate the performance of the proposed SAM-MSCD in terms of computational efficiency and model size, a comparative analysis of model complexity was conducted against other mainstream change detection methods on the MSRS-CD dataset. As shown in Table 5, different methods are compared across six key dimensions: Network Type (CNN, Transformer, VFM), F1 score, IoU, FLOPs, Parameters (Params), and Inference time.

Overall, VFM-based methods generally outperform CNN and Transformer-based methods on the performance metrics of F1 and IoU. Among them, SAM-MSCD achieves 79.67% in F1 score and 65.84% in IoU, which is the best performance and clearly outperforms similar visual modeling methods, such as SAMCD with 76.53% in F1 and 61.98% in IoU, and BAN with 75.99% in F1 and 61.27% in IoU. This result shows that SAM-MSCD has stronger feature modeling capability and better generalization performance.

In terms of model complexity, SAM-MSCD has 54.76 G FLOPs and 42.65 M parameters, which is slightly higher than some of the lightweight models such as BIT, VcT, and SEIFNet, but much lower than ChangeFormer, which consumes a very high amount of computational resources and has 202.79 G FLOPs.It is worth emphasizing that compared with SAMCD, which is also a visual base model, SAM-MSCD improves the F1 score by 3.14 % and the IoU by 3.86 % while the number of parameters is reduced by nearly 40%, reflecting better structural design and expression efficiency.

In terms of performance-complexity trade-off, although TTP and SFCD also achieve better results in F1 score, their computational overhead and parameter size exceed that of SAM-MSCD. SFCD, for example, has a parameter count of 44.09 M and 52.14 G FLOPs, but its performance is slightly lower than that of SAM-MSCD, which only achieves 78.42% of the F1 score and 64.49% of the IoU. In addition, methods such as BAN and MDIPNet, although with close accuracy, have larger model size and higher deployment difficulty, which limit their potential application in resource-constrained environments.

In summary, SAM-MSCD balances high detection accuracy with model computational efficiency and parameter size, demonstrating excellent overall performance and practical application potential.

4.5. Ablation Studies

4.5.1. Ablation Study of Different Modules

To validate the effectiveness of the BIM and CFEM modules in SAM-MSCD, we conducted systematic ablation experiments on the LEVIR-CD and MSRS-CD datasets. The results are shown in Table 6.

Under the baseline setting without any additional modules (both BIM and CFEM removed) and using only a conventional absolute difference for feature comparison, the model’s F1-scores on the LEVIR-CD and MSRS-CD datasets were 83.87% and 71.34%, respectively, with IoU scores of 65.45% and 46.67%. This indicates that the overall performance was significantly limited. As shown in Figure 9d, the traditional absolute difference method, while capable of capturing some change regions, has notable deficiencies when dealing with complex scenes, ambiguous boundaries, and minute targets.

After introducing the BIM module, the model’s performance showed a marked improvement on both datasets. On LEVIR-CD, the F1 and IoU scores rose to 90.76% and 82.88%, respectively. On MSRS-CD, the F1 and IoU increased to 75.77% and 60.23%. This result indicates that BIM, through its bi-temporal feature interaction and difference enhancement mechanism, effectively enhances the model’s perception of change regions and its capability for symmetrical modeling. Furthermore, introducing the CFEM module also brought performance gains, achieving an F1 of 89.87% and an IoU of 81.67% on LEVIR-CD, and an F1 of 75.45% and an IoU of 60.11% on MSRS-CD. This demonstrates that the module, through the generation and fusion of multi-scale features with semantic context, improves the model’s ability to characterize fine-grained change features.

Further analysis of the experimental results reveals the distinct roles of the proposed modules in different scenarios. The BIM module demonstrates superior robustness in pseudo-change scenarios caused by seasonal variations or lighting differences (e.g., in the LEVIR-CD). By enforcing feature interaction within the encoder, BIM aligns the semantic distribution of bi-temporal images, effectively suppressing false positives derived from spectral inconsistency. On the other hand, the CFEM module proves critical in multi-scale scenarios (e.g., MSRS-CD). We observed that without CFEM, the model struggles to simultaneously detect large building footprints and narrow roads. The multi-scale fusion mechanism of CFEM successfully recovers the boundary details of small targets while maintaining the internal integrity of large objects.

When both BIM and CFEM were integrated simultaneously, the model’s performance reached its peak. On LEVIR-CD, the F1 score increased to 92.54% and the IoU rose to 85.94%. On the more challenging MSRS-CD dataset, the F1 and IoU also significantly improved to 79.67% and 65.84%, respectively. The results, in conjunction with Figure 9g, fully illustrate that the BIM and CFEM modules are complementary in modeling spatial-structural relationships and enhancing difference representations, jointly improving the model’s robustness and accuracy.

4.5.2. Analysis of the Impact of Different LoRA Rank Settings on Model Performance

To further analyze the impact of the parameter-efficient fine-tuning strategy LoRA on model performance, we conducted comparative experiments by setting different rank values in the proposed SAM-MSCD model. Experiments were carried out on the LEVIR-CD and WHU-CD datasets, and the results are summarized in Table 7. It can be observed that GPU memory consumption increases progressively with larger rank values.

On the LEVIR-CD dataset, as the rank value increases, the model performance across all evaluation metrics exhibits a trend of gradual improvement followed by saturation. The best overall performance is achieved when the rank is set to 32, where the model attains an F1 score of 92.54% and an IoU of 85.94%. Further increasing the rank does not yield noticeable performance gains.On the WHU-CD dataset, the detection performance continues to improve with increasing rank values, and the optimal results under the current experimental configuration are obtained when the rank is set to 128. This observation suggests that the WHU-CD dataset poses higher demands on model capacity compared to LEVIR-CD.

Considering the trade-off between detection accuracy and computational cost, setting the rank to 32 provides the most favorable overall balance under the current configuration. This setting not only delivers high detection performance but also effectively controls parameter scale and computational overhead. Therefore, a rank value of 32 is recommended for subsequent model training and practical deployment.

5. Discussion

This study demonstrates that vision foundation models can be effectively adapted to remote sensing image change detection through parameter-efficient fine-tuning combined with task-oriented architectural design. By integrating Low-Rank Adaptation (LoRA) into the SAM image encoder, the proposed SAM-MSCD framework successfully transfers rich semantic priors to the RSCD domain while mitigating the computational burden and overfitting risks associated with full fine-tuning. The strong performance observed on the low-data NJDS dataset further validates the advantage of foundation-model-based representations in annotation-scarce scenarios.

The proposed Bi-temporal Image Feature Interaction Module (BIM) plays a critical role in enhancing cross-temporal semantic consistency. Unlike conventional feature differencing strategies, BIM explicitly models interactions between bi-temporal features at multiple semantic levels, enabling more effective discrimination of subtle and complex changes. Ablation studies show that BIM consistently improves recall and overall detection robustness, particularly in dense urban environments and heterogeneous scenes where pseudo-changes are prevalent.

The Change Feature Enhancement Module (CFEM) further addresses the inherent multi-scale nature of remote sensing changes. By generating and fusing multi-scale difference features, CFEM enhances the representation of both fine-grained and large-area changes, leading to improved boundary integrity and reduced false detections in complex backgrounds, as demonstrated on the MSRS-CD dataset. To enable efficient fusion, CFEM resizes multi-scale features to a unified resolution of

128 \times 128

. Although such resizing may theoretically constrain the representation of extremely small objects, the fused features are derived from difference-enhanced representations that already emphasize change regions. Experimental results on datasets containing fine-grained changes indicate that this resolution provides a favorable balance between spatial detail preservation and computational efficiency.

Despite these promising results, several limitations remain. First, BIM relies on pixel-wise correspondence between bi-temporal images and may therefore be sensitive to residual registration errors. Second, while CFEM improves cross-scale feature consistency, repeated up- and down-sampling operations may introduce minor aliasing effects in extremely high-resolution imagery. Addressing these issues by incorporating registration-invariant attention mechanisms or adaptive resolution strategies constitutes an important direction for future research.

From a practical perspective, SAM-MSCD achieves a favorable trade-off between detection accuracy and computational complexity. Compared with existing VFM-based change detection methods, it delivers superior performance with a moderate parameter count, supporting its potential for real-world deployment. Nevertheless, the current framework is designed for bi-temporal optical imagery. Extending the proposed approach to multi-temporal or multi-modal remote sensing data, as well as improving robustness under severe domain shifts, remains an open and promising avenue for future work.

6. Conclusions

In this paper, we propose SAM-MSCD, a multi-scale change detection framework for remote sensing imagery that integrates a vision foundation model. The method adopts SAM as the backbone for semantic feature extraction, employs LoRA for parameter-efficient adaptation, and incorporates the BIM and CFEM modules to enhance temporal feature alignment and multi-scale change region modeling, respectively.Through sufficient experiments on four benchmark datasets, namely LEVIR-CD, WHU-CD, NJDS and MSRS-CD, SAM-MSCD achieves the optimal values on F1, which are 92.54%, 92.33%, 79.33%, 79.67%, respectively, and especially performs well in complex background and fine-grained change scenarios. In summary, SAM-MSCD not only effectively unleashes the potential of visual foundation models in remote sensing change detection but also significantly enhances the detection accuracy and robustness for multi-scale change targets. This method provides a practical technical pathway for achieving more efficient and precise remote sensing change detection.

Author Contributions

Conceptualization, S.L. and L.T.; methodology, S.L.; software, S.L.; validation, S.L. and D.Z.; formal analysis, S.L.; investigation, S.L.; resources, S.L.; data curation, S.L.; writing—original draft preparation, S.L.; writing—review and editing, S.L. and L.T.; visualization, S.L. and D.Z.; supervision, S.L. and L.T.; project administration, S.L. and L.T.; funding acquisition, L.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Water Resources Science and Technology Program of Hunan Province grant number No. XSKJ2025056-32. and Changsha University of Science and Technology Graduate Research Innovation Project grant number No. CLKYCX25078.

Data Availability Statement

Data available upon request.

Acknowledgments

The authors sincerely thank the School of Physical Electronics at Changsha University of Science and Technology for generously providing GPU computing resources, which greatly facilitated our work. We also appreciate the support from the Soil and Water Conservation Monitoring Center of the Ministry of Water Resources of China for providing experimental environments and conditions. Additionally, we would like to express our heartfelt gratitude to the editors and anonymous reviewers who contributed their constructive feedback behind the scenes, which helped to improve the quality of this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RSCD	Remote Sensing Change Detection
CNNs	Convolutional Neural Networks
ViTs	Vision Transformers
VFMs	Vision foundation models
LoRA	Low-Rank Adaptation
SAM	Segment Anything Model

References

Singh, A. Review article digital change detection techniques using remotely-sensed data. Int. J. Remote Sens. 1989, 10, 989–1003. [Google Scholar] [CrossRef]
Wang, H.; Liu, Y.; Wang, Y.; Yao, Y.; Wang, C. Land cover change in global drylands: A review. Sci. Total Environ. 2023, 863, 160943. [Google Scholar] [CrossRef]
Mulverhill, C.; Coops, N.C.; Achim, A. Continuous monitoring and sub-annual change detection in high-latitude forests using Harmonized Landsat Sentinel-2 data. ISPRS J. Photogramm. Remote Sens. 2023, 197, 309–319. [Google Scholar] [CrossRef]
Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A.; Zhang, L. Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: From natural disasters to man-made disasters. Remote Sens. Environ. 2021, 265, 112636. [Google Scholar] [CrossRef]
Zhao, B.; Zhang, M.; Li, W.; Song, X.; Gao, Y.; Zhang, Y.; Wang, J. Intermediate domain prototype contrastive adaptation for spartina alterniflora segmentation using multitemporal remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5401314. [Google Scholar] [CrossRef]
Chen, H.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Daudt, R.C.; Le Saux, B.; Boulch, A. Fully convolutional siamese networks for change detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4063–4067. [Google Scholar]
Chang, S.; Tang, S.; Deng, Y.; Zhang, H.; Liu, D.; Wang, W. An advanced scheme for deceptive jammer localization and suppression in elevation multichannel SAR for underdetermined scenarios. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 1–18. [Google Scholar] [CrossRef]
Zheng, H.; Gong, M.; Liu, T.; Jiang, F.; Zhan, T.; Lu, D.; Zhang, M. HFA-Net: High frequency attention siamese network for building change detection in VHR remote sensing images. Pattern Recognit. 2022, 129, 108717. [Google Scholar] [CrossRef]
Bandara, W.G.C.; Patel, V.M. A transformer-based siamese network for change detection. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 207–210. [Google Scholar]
Chen, H.; Qi, Z.; Shi, Z. Remote sensing image change detection with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5607514. [Google Scholar] [CrossRef]
Chen, J.; Wu, D.; Ma, Q.; Xu, S.; Zheng, Y. AGFormer: An anchor-guided transformer for class imbalance in remote sensing change detection. Pattern Recognit. 2025, 168, 111839. [Google Scholar] [CrossRef]
Zheng, Z.; Ermon, S.; Kim, D.; Zhang, L.; Zhong, Y. Changen2: Multi-temporal remote sensing generative change foundation model. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 725–741. [Google Scholar] [CrossRef]
Chen, K.; Liu, C.; Li, W.; Liu, Z.; Chen, H.; Zhang, H.; Zou, Z.; Shi, Z. Time travelling pixels: Bitemporal features integration with foundation model for remote sensing image change detection. In Proceedings of the IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 8581–8584. [Google Scholar]
Dong, S.; Wang, L.; Du, B.; Meng, X. ChangeCLIP: Remote sensing change detection with multimodal vision-language representation learning. ISPRS J. Photogramm. Remote Sens. 2024, 208, 53–69. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25 April 2022. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
Lin, Y.; Li, S.; Fang, L.; Ghamisi, P. Multispectral change detection with bilinear convolutional neural networks. IEEE Geosci. Remote Sens. Lett. 2019, 17, 1757–1761. [Google Scholar] [CrossRef]
Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A densely connected Siamese network for change detection of VHR images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8007805. [Google Scholar] [CrossRef]
Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure transformer network for remote sensing image change detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5224713. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Jiang, B.; Wang, Z.; Wang, X.; Zhang, Z.; Chen, L.; Wang, X.; Luo, B. VcT: Visual change transformer for remote sensing image change detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 2005214. [Google Scholar] [CrossRef]
Chen, H.; Song, J.; Han, C.; Xia, J.; Yokoya, N. ChangeMamba: Remote sensing change detection with spatiotemporal state space model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4409720. [Google Scholar] [CrossRef]
Awais, M.; Naseer, M.; Khan, S.; Anwer, R.M.; Cholakkal, H.; Shah, M.; Yang, M.-H.; Khan, F.S. Foundation models defining a new era in vision: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2245–2264. [Google Scholar] [CrossRef] [PubMed]
Zhao, F.; Zhang, C.; Zhang, R.; Wang, T. Visual Prompt Learning of Foundation Models for Post-Disaster Damage Evaluation. Remote Sens. 2025, 17, 1664. [Google Scholar] [CrossRef]
Lee, M.Y.; Lee, C.D.W.; Li, J.; Ang Jr, M.H. DINO-MOT: 3D Multi-Object Tracking with Visual Foundation Model for Pedestrian Re-Identification Using Visual Memory Mechanism. IEEE Robot. Autom. Lett. 2024, 10, 1202–1208. [Google Scholar] [CrossRef]
Liu, X.; Shi, G.; Wang, R.; Lai, Y.; Zhang, J.; Han, W.; Lei, M.; Li, M.; Zhou, X.; Wu, Y.; et al. Segment Any Tissue: One-shot reference guided training-free automatic point prompting for medical image segmentation. Med. Image Anal. 2025, 102, 103550. [Google Scholar] [CrossRef]
Ding, L.; Zhu, K.; Peng, D.; Tang, H.; Yang, K.; Bruzzone, L. Adapting segment anything model for change detection in VHR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611711. [Google Scholar] [CrossRef]
Zhao, X.; Ding, W.; An, Y.; Du, Y.; Yu, T.; Li, M.; Tang, M.; Wang, J. Fast segment anything. arXiv 2023, arXiv:2306.12156. [Google Scholar] [CrossRef]
Li, K.; Cao, X.; Meng, D. A new learning paradigm for foundation model-based remote-sensing change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5610112. [Google Scholar] [CrossRef]
Zhang, D.; Wang, F.; Ning, L.; Zhao, Z.; Gao, J.; Li, X. Integrating SAM with feature interaction for remote sensing change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4513011. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2018, 57, 574–586. [Google Scholar] [CrossRef]
Shen, Q.; Huang, J.; Wang, M.; Tao, S.; Yang, R.; Zhang, X. Semantic feature-constrained multitask siamese network for building change detection in high-spatial-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 189, 78–94. [Google Scholar] [CrossRef]
Liu, S.; Zhao, D.; Zhou, Y.; Tan, Y.; He, H.; Zhang, Z.; Tang, L. Network and dataset for multiscale remote sensing image change detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 18, 2851–2866. [Google Scholar] [CrossRef]
Chen, P.; Zhang, B.; Hong, D.; Chen, Z.; Yang, X.; Li, B. FCCDN: Feature constraint network for VHR image change detection. ISPRS J. Photogramm. Remote Sens. 2022, 187, 101–119. [Google Scholar] [CrossRef]
Zhao, S.; Zhang, X.; Xiao, P.; He, G. Exchanging dual-encoder–decoder: A new strategy for change detection with semantic guidance and spatial localization. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4508016. [Google Scholar] [CrossRef]
Hang, R.; Xu, S.; Yuan, P.; Liu, Q. AANet: An Ambiguity-Aware Network for Remote-Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–11. [Google Scholar] [CrossRef]
Huang, Y.; Li, X.; Du, Z.; Shen, H. Spatiotemporal enhancement and interlevel fusion network for remote sensing images change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5609414. [Google Scholar] [CrossRef]
Ma, J.; Duan, J.; Tang, X.; Zhang, X.; Jiao, L. Eatder: Edge-assisted adaptive transformer detector for remote sensing change detection. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5602015. [Google Scholar] [CrossRef]
Chang, H.; Wang, P.; Diao, W.; Xu, G.; Sun, X. Remote sensing change detection with bitemporal and differential feature interactive perception. IEEE Trans. Image Process. 2024, 33, 4543–4555. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Detailed illustration of SAM-MSCD. Includes SAM image encoder, LoRA adapter, bi-temporal image feature interaction module, change feature enhancement module and detection head.

Figure 2. LoRA adapter structure.

Figure 3. The structure of the BIM.

Figure 4. The structure of the CFEM.

Figure 5. The visual results of 11 algorithms tested on the LEVIR-CD dataset. Among them, (a–e) are the detection results on LEVIR-CD.

Figure 6. The visual results of 11 algorithms tested on the WHU-CD dataset. Among them, (a–e) are the detection results on WHU-CD.

Figure 7. The visual results of 11 algorithms tested on the NJDS dataset. Among them, (a–e) are the detection results on NJDS.

Figure 8. The visual results of 11 algorithms tested on the MSRS-CD dataset. Among them, (a–e) are the detection results on MSRS-CD.

Figure 9. Plots of visualization results of ablation experiments on LEVIR-CD dataset and MSRS-CD dataset. Where (a) image t1, (b) image t2, (c) ground truth, (d) without adding BIM and CFEM, (e) with adding BIM only, (f) with adding CFEM only, and (g) for SAM-MSCD.

Table 1. Comparison with Other SOTA Models on the LEVIR-CD Dataset, where the Bolded Values are the Best.

Methods	Type	P	R	F1	IoU	OA
FCCDN	CNN	92.17	90.49	91.32	83.04	99.12
SGSLN		92.63	90.71	91.66	83.64	99.16
AANet		92.10	87.74	89.87	81.60	98.99
SEIFNet		91.82	88.03	89.89	81.63	98.99
BIT	Transformer	90.33	89.56	89.94	81.72	98.98
Changeformer		92.05	88.80	90.40	82.48	99.04
VcT		92.57	87.65	90.04	81.89	99.01
EATDer		85.73	92.97	89.20	80.51	98.85
MDIPNet		92.04	90.22	91.12	83.69	99.14
SAMCD	VFM	90.72	91.42	91.06	83.60	99.09
BAN		93.55	90.70	92.10	85.36	99.21
TTP		93.18	91.35	92.26	85.62	99.22
SFCD		92.97	91.69	92.33	85.75	99.22
SAM-MSCD	VFM	93.64	91.26	92.54	85.94	99.24

Table 2. Comparison with Other SOTA Models on the WHU-CD Dataset, where the Bolded Values are the Best.

Methods	Type	P	R	F1	IoU	OA
FCCDN	CNN	95.24	83.58	89.03	79.85	99.01
SGSLN		95.53	84.38	89.61	80.79	99.06
AANet		77.12	80.97	79.00	65.29	97.92
SEIFNet		88.72	82.08	85.27	74.33	98.63
BIT	Transformer	78.48	84.13	81.21	68.46	98.12
Changeformer		88.08	76.92	82.12	69.67	98.38
VcT		83.07	80.09	81.55	68.85	98.25
EATDer		87.97	85.48	86.71	76.53	98.73
MDIPNet		96.44	82.76	89.08	80.31	99.02
SAMCD	VFM	96.90	78.63	86.81	76.70	98.85
BAN		96.09	83.52	89.37	80.78	99.04
TTP		96.11	87.06	91.36	84.10	99.21
SFCD		96.76	85.97	91.04	83.56	99.18
SAM-MSCD	VFM	96.73	88.31	92.33	85.75	99.29

Table 3. Comparison with Other SOTA Models on the NJDS Dataset, where the Bolded Values are the Best.

Methods	Type	P	R	F1	IoU	OA
FCCDN	CNN	87.79	48.73	62.68	45.38	96.19
SGSLN		83.59	58.42	68.77	52.22	96.51
AANet		79.57	54.19	64.47	47.57	96.08
SEIFNet		85.66	54.27	66.45	49.75	96.40
BIT	Transformer	71.81	55.79	62.79	45.77	95.66
Changeformer		85.53	31.59	46.14	29.99	95.15
VcT		80.14	46.15	58.57	41.41	95.71
EATDer		61.63	50.85	55.61	38.52	94.67
MDIPNet		85.90	59.98	70.64	54.61	96.72
SAMCD	VFM	87.52	66.20	75.38	60.49	97.16
BAN		90.85	46.08	61.14	44.03	96.15
TTP		57.27	82.53	67.62	51.08	98.00
SFCD		62.67	58.13	60.32	43.18	94.97
SAM-MSCD	VFM	87.42	72.62	79.33	65.75	97.51

Table 4. Comparison with Other SOTA Models on the MSRS-CD Dataset, where the Bolded Values are the Best.

Methods	Type	P	R	F1	IoU	OA
FCCDN	CNN	75.56	71.31	73.37	55.42	92.36
SGSLN		77.39	69.73	73.36	56.28	92.52
AANet		71.94	77.03	74.40	59.23	92.17
SEIFNet		75.24	75.46	75.24	60.30	92.67
BIT	Transformer	75.73	70.79	73.18	57.70	92.34
Changeformer		72.22	72.94	72.58	56.96	91.86
VcT		76.64	69.72	73.02	57.50	92.39
EATDer		66.73	84.17	74.44	59.29	91.47
MDIPNet		81.14	70.24	75.30	60.38	93.19
SAMCD	VFM	72.90	80.53	76.53	61.98	92.71
BAN		79.36	72.89	75.99	61.27	93.20
TTP		77.81	80.16	78.97	65.25	93.70
SFCD		76.64	80.28	78.42	64.49	93.47
SAM-MSCD	VFM	79.99	79.36	79.67	65.84	94.73

Table 5. Results of Model Complexity Analysis of Different Networks on MSRS-CD Dataset.

Methods	Type	F1	IoU	FLOPs (G)	Param (M)	Inference Time (ms)
FCCDN	CNN	73.37	55.42	12.49	6.31	15.29
SGSLN		73.36	56.28	11.50	6.04	10.31
AANet		74.40	59.23	24.21	15.82	10.63
SEIFNet		75.24	60.30	8.37	27.91	11.54
BIT	Transformer	73.18	57.70	8.75	3.04	11.58
Changeformer		72.58	56.96	202.79	41.03	28.78
VcT		73.02	57.50	10.64	3.57	14.68
EATDer		74.44	59.29	23.46	6.61	21.18
MDIPNet		75.30	60.38	14.10	40.21	23.58
SAMCD	VFM	76.53	61.98	8.60	70.59	201.59
BAN		75.99	61.27	33.17	80.30	76.22
TTP		78.97	65.25	51.06	39.89	88.56
SFCD		78.42	64.49	52.14	44.09	89.06
SAM-MSCD	VFM	79.67	65.84	54.76	42.65	90.34

Table 6. Results of Ablation Experiments on the LEVIR-CD Dataset and the MSRS-CD Dataset.

BIM	CFEM	LEVIR-CD		MSRS-CD		FLOPs (G)	Param (M)
BIM	CFEM	F1	IoU	F1	IoU	FLOPs (G)	Param (M)
✗	✗	83.87	65.45	71.34	46.67	46.77	31.48
✓	✗	90.76	82.88	75.77	60.23	52.96	38.78
✗	✓	89.87	81.67	75.45	60.11	48.86	36.32
✓	✓	92.54	85.94	79.67	65.84	54.76	42.65

Table 7. Impact of LoRA Rank Values in SAM-MSCD on the LEVIR-CD Dataset and the WHU-CD Dataset.

Rank	LEVIR-CD		WHU-CD		GPU Memory (MB)
Rank	F1	IoU	F1	IoU	GPU Memory (MB)
1	92.10	85.35	90.87	83.91	21,161
2	92.03	85.23	91.45	84.13	21,227
4	92.17	85.48	91.82	84.54	21,233
8	92.21	85.55	91.97	85.15	21,241
16	92.31	85.71	92.24	85.43	21,297
32	92.54	85.94	92.33	85.75	21,301
64	92.37	85.83	92.97	86.49	21,329
128	92.32	85.73	93.57	86.90	21,771

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, S.; Zhao, D.; Tang, L. A Multi-Scale Remote Sensing Image Change Detection Network Based on Vision Foundation Model. Remote Sens. 2026, 18, 506. https://doi.org/10.3390/rs18030506

AMA Style

Liu S, Zhao D, Tang L. A Multi-Scale Remote Sensing Image Change Detection Network Based on Vision Foundation Model. Remote Sensing. 2026; 18(3):506. https://doi.org/10.3390/rs18030506

Chicago/Turabian Style

Liu, Shenbo, Dongxue Zhao, and Lijun Tang. 2026. "A Multi-Scale Remote Sensing Image Change Detection Network Based on Vision Foundation Model" Remote Sensing 18, no. 3: 506. https://doi.org/10.3390/rs18030506

APA Style

Liu, S., Zhao, D., & Tang, L. (2026). A Multi-Scale Remote Sensing Image Change Detection Network Based on Vision Foundation Model. Remote Sensing, 18(3), 506. https://doi.org/10.3390/rs18030506

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Scale Remote Sensing Image Change Detection Network Based on Vision Foundation Model

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning-Based RSCD

2.2. VFM-Based RSCD

3. Methodology

3.1. Overall Framework

3.2. Bi-Temporal Image Feature Interaction Module

3.3. Change Feature Enhancement Module

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Implementation Details

4.3. Comparative Experiments with State-of-the-Art Algorithms

4.4. Model Complexity Analysis

4.5. Ablation Studies

4.5.1. Ablation Study of Different Modules

4.5.2. Analysis of the Impact of Different LoRA Rank Settings on Model Performance

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI