ASS-CD: Adapting Segment Anything Model and Swin-Transformer for Change Detection in Remote Sensing Images

Wei, Chenlong; Wu, Xiaofeng; Wang, Bin

doi:10.3390/rs17030369

Open AccessArticle

ASS-CD: Adapting Segment Anything Model and Swin-Transformer for Change Detection in Remote Sensing Images

by

Chenlong Wei

^1,2,

Xiaofeng Wu

^1,2 and

Bin Wang

^1,2,*

¹

Key Laboratory for Information Science of Electromagnetic Waves (MoE), Fudan University, Shanghai 200433, China

²

Image and Intelligence Laboratory, School of Information Science and Technology, Fudan University, Shanghai 200433, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(3), 369; https://doi.org/10.3390/rs17030369

Submission received: 16 December 2024 / Revised: 20 January 2025 / Accepted: 21 January 2025 / Published: 22 January 2025

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Change detection (CD) is a critical task in analyzing the geographic information changes in remote sensing images (RSIs), yet it still faces challenges such as complex background interference, multi-scale varying objects, and class imbalance between positive and negative samples. Recently, with the development of pre-training and fine-tuning techniques, transferring the general knowledge embedded in large-scale pre-trained visual foundation models (PVFMs) to various downstream tasks has attracted significant attention. However, when directly applying these PVFMs to CD tasks in RSIs, the domain knowledge differences often result in unsatisfactory outcomes. To address the above issues, we propose a novel hierarchical adapter framework to efficiently adapt PVFMs like FastSAM and Swin-Transformer for CD task in RSIs, namely ASS-CD. The proposed method leverages lightweight adapter modules with a cross-attention mechanism, which not only preserves the general knowledge of PVFMs but also integrates global and local information, significantly enhancing CD accuracy. Further, the convolutional block attention module (CBAM) is adopted to reduce interference from complex backgrounds and focus on multi-scale objects, and the hierarchical deep supervision module (HDSM) is utilized to impose deep supervision on multi-scale feature maps and compute the Dice loss, addressing the issue of class imbalance in CD datasets. The experimental results on three widely used datasets demonstrate that our ASS-CD achieves the state-of-the-art performance, with an approximately 5% improvement on the LEVIR-CD dataset compared to the other CD methods.

Keywords:

change detection; remote sensing images; adapter; cross-attention mechanism; segment anything model; swin-transformer

1. Introduction

Change detection (CD) in remote sensing images (RSIs) refers to the process of analyzing two images captured at different times over the same geographic area to detect changes [1]. The applications of CD are extensive, covering urban landscape monitoring [2], agricultural survey [3], land cover mapping [4], natural resource management [5], and disaster warning [6], among others. CD can be performed on data sources from various sensors, such as high-resolution optical imagery, synthetic aperture radar (SAR) imagery and hyperspectral imagery, which has led to the development of CD methods corresponding to different data sources [7,8,9,10]. In recent years, advancements in optical sensors have generated a large volume of bi-temporal RSIs for CD tasks [11], leading to significant interest in automatic CD methods which have made great progress in model performance. Figure 1 presents examples of CD datasets in the field of multi-spectral remote sensing, including image A acquired early, image B acquired later, and the ground truth label.

Traditional CD methods, which are based on technologies such as change vector analysis (CVA) [12], principal component analysis (PCA) [13], and independent component analysis (ICA) [14], primarily use only spectral information to identify changes [1]. Traditional CD methods are easy to implement and computationally efficient but rely heavily on handcrafted features [15], which limit their generalization capability and robustness. In recent years, with the rapid development of artificial intelligence, neural network-based methods have become the mainstream approach for CD due to their powerful feature learning and nonlinear fitting capabilities. Depending on the network architecture, these methods can be mainly categorized into two types: convolutional neural network (CNN)-based methods and Transformer-based methods. CNN-based methods typically employ Siamese CNNs as the backbone network to process bi-temporal images [16,17,18,19]. Others may integrate attention modules into the encoder to emphasize important features [20,21,22,23]. Although CNN-based methods can effectively extract local features, the constraints of intrinsic locality in CNNs limit their ability to capture global information. Transformer-based methods, composed of Vision Transformer (ViT) [24] blocks, excel in capturing global information and long-range dependencies. The initial approach employed a pure Transformer architecture [25,26,27] to develop CD networks, while the latest methods integrate CNNs and Transformers to simultaneously capture local and global features, thereby enhancing CD accuracy [28,29,30,31,32].

The above methods have made significant contributions to improving CD performance, yet they remain limited to traditional training design for models with relatively small parameter scales. These methods can only extract semantic information specific to CD tasks but fail to learn the general knowledge embedded in more complex pre-trained models, thereby limiting the overall performance enhancement. Notably, with the advancements in the pre-training and fine-tuning technology of large-scale models, applying the powerful generalization capabilities of pre-trained models to visual tasks in RSIs has emerged as a highly promising research direction [33,34,35]. Large-scale pre-trained visual foundation models (PVFMs) have demonstrated strong adaptability across various downstream visual tasks [36], such as CLIP [37], the segment anything model (SAM) [38], and vision mamba (Vim) [39]. Among these large-scale models, SAM has driven the rapid growth of PVFMs in the field of computer vision. Benefiting from its sophisticated architecture, the extensive training dataset known as SA-1B, and billions of mask labels, SAM has exhibited impressive segmentation performance across numerous applications [38]. However, when SAM is directly applied to RSIs, its performance often falls short in pixel-level tasks such as CD, primarily due to the differences in domain knowledge [40]. SAM is typically pre-trained on natural images, which limits its capacity for parsing remote sensing data. Secondly, full fine-tuning of large-scale pre-trained models is impractical due to the limited and difficult-to-obtain labeled data for downstream remote sensing tasks [41]. The effective utilization of large-scale models’ feature extraction capabilities and general knowledge has become a research focus in the field of CD in RSIs.

Parameter-Efficient Fine-Tuning (PEFT) [36] offers a promising strategy that not only overcomes the above limitations but also effectively leverages the general knowledge of large-scale models. PEFT updates only a minimal number of trainable parameters during training while keeping the majority of the pre-trained model’s weights frozen, achieving performance comparable to or even superior to full fine-tuning. The “adapter” [40], as a method of PEFT, incorporates domain-specific information or visual prompts into neural networks through lightweight adapter modules. By simply updating the minimal parameters in these adapters, the model can effectively integrate specific knowledge from downstream tasks with the general knowledge derived from pre-trained models, successfully bridging the gap between domain knowledge. As collecting large amounts of annotated bi-temporal image pairs for CD is both time-consuming and costly [42], adapter methods enable us to leverage the general knowledge of large-scale models while alleviating the need for large amounts of training data. However, there remain significant challenges in the design of adapter modules and the selection of appropriate PVFMs. Most current adapters rely solely on widely used Transformer-based large-scale models, which may limit their ability to capture local information. Therefore, it is crucial to develop an adapter network capable of learning both local and global feature processing capabilities simultaneously from CNN and Transformer frameworks.

Furthermore, there are several issues with bi-temporal image data for CD in RSIs that require special attention: (1) Compared to natural images, RSIs captured from high altitudes have a more complex background [11]. Determining how to eliminate the negative impact of background interference and enable the CD model to focus on the regions where changes have occurred remains a significant challenge. (2) CD objects often involve multiple scales, which requires the CD model to have the capacity to perceive multi-scale information. (3) CD datasets often suffer from class imbalance between positive and negative samples due to the relatively low proportion of changed pixels in RSIs [21]. This class imbalance phenomenon often causes CD models to focus more on large-scale unchanged areas, resulting in unstable training and a decline in detection accuracy, which ultimately does harm to the overall performance.

To solve the above problems, we propose an adapter network, termed ASS-CD, for adapting the segment anything model and Swin-Transformer [43] for CD in RSIs. The proposed ASS-CD is built upon the U-Net [44] with a weight-sharing Siamese encoder–decoder structure, integrating the general knowledge of SAM with Swin-Transformer through a carefully designed adapter. During the encoding stage, we employ the visual encoder from the pre-trained FastSAM [45] (a SAM variant based on the CNN architecture) as the backbone network and introduce an auxiliary branch composed of a pre-trained Swin-Transformer to provide global information for the proposed ASS-CD. Here, the parameters of both FastSAM and Swin-Transformer remain frozen during training. Then, we design a lightweight adapter module using cross-attention mechanism, which allows us to effectively adapt general knowledge from two pre-trained vision models with both local and global information to CD tasks by updating the adapter’s parameters. Furthermore, we insert a convolutional block attention module (CBAM) [46] into the skip connections, which combines both channel and spatial attention without significantly increasing computational cost compared with other attention mechanisms. CBAM enables the ASS-CD to suppress complex background interference and focus on multi-scale change objects. During the decoding phase, we apply the hierarchical deep supervision module (HDSM) to the feature map output from intermediate layers of the decoder and compute the Dice loss [47] for each processed feature map. With the help of HDSM, the issue of class imbalance between positive and negative samples in CD datasets can be effectively addressed. The experimental results on three widely used datasets demonstrate that the proposed ASS-CD consistently outperforms other state-of-the-art (SOTA) CD methods.

The main contributions of our work can be briefly summarized as follows.

(1): A novel CD network called ASS-CD is proposed, which is the first work to combine two pre-trained foundation models (FastSAM and Swin-Transformer) and transfer their general knowledge to the CD task through the adapter module. It also represents the first adapter framework to effectively fuse local and global information through a cross-attention mechanism in the remote sensing domain. By training a small number of parameters, the proposed ASS-CD achieves high-precision CD results.
(2): To address the challenge of background interference and multi-scale change targets, our work first integrates CBAM into the skip connections of the U-Net architecture, which consists of channel and spatial attention blocks.
(3): The HDSM is designed to impose deep supervision on multi-scale feature maps generated by the decoder and computes the Dice loss. This innovative design helps to address the class imbalance between positive and negative samples in CD datasets, which also promotes the model to learn more discriminative features.

The rest of this paper is organized as follows: Section 2 reviews the current research on deep learning-based CD methods and PEFT for PVFMs. Section 3 details the proposed method. Section 4 evaluates the performance of the proposed method through experiments. Section 5 and Section 6 discuss and conclude this work, respectively.

2. Related Works

2.1. Deep Learning-Based CD Methods

In recent years, CD methods based on deep learning have made significant progress. Early deep learning-based CD techniques primarily relied on CNN architectures with Siamese encoders, such as FC-EF, FC-Siam-conc, and FC-Siam-diff [16]. These fully convolutional Siamese networks extract features from bi-temporal images using two parallel encoders. Other approaches have adapted the U-Net framework, originally designed for semantic segmentation, to the CD task. SNUNet [17] is a densely connected Siamese network that leverages dense skip connections from U-Net++ [48]. DARNet-CD [49] proposes a dense attention refinement network based on a U-shaped encoder–decoder architecture. Other studies have introduced various attention modules to enhance CD performance, where STANet [20] incorporates a spatial–temporal attention module after a weight-shared feature extractor and DMINet [22] designs a joint attention module to facilitate the exchange of information between bi-temporal images. Although techniques like pyramid pooling, atrous convolution, and attention mechanisms can expand the receptive field to improve the ability of CNNs to model longer-range dependencies, the intrinsic locality of CNNs still limits their capacity to capture global information.

Transformers, widely used in natural language processing (NLP) due to their ability to capture long-range dependencies and global information, have also significantly impacted computer vision tasks such as image classification, object detection, and semantic segmentation. In CD tasks, some works directly employ Transformer as the backbone for feature extraction. Changeformer [25] is the first Transformer-based Siamese network designed specifically for CD. It introduces the self-attention mechanism and demonstrates the potential of Transformer architectures in the CD field. In addition, SwinSUNet [26] is built upon Swin-Transformer blocks and utilizes a fusion module to integrate encoder information. Although Transformer-based approaches have made progress, the lack of local information limits their overall performance. To address this limitation, recent research has focused on combining the strengths of CNNs and Transformers. BIT [28] introduces a bi-temporal image Transformer module, and employs a Siamese encoder based on ResNet18 to extract high-level features, which are then converted into semantic tokens and processed by Transformer blocks to capture global context across spatial and temporal domains. In addition, ConvTransNet [29] integrates a CNN and Transformer in parallel within its encoder, featuring a CNN branch for local feature extraction and a Transformer branch for global feature extraction. CTST [50] is a CNN- and Transformer-based edge enhancement and spatial–temporal synchronized CD network. ConvFormerSR [51] presents a novel cross-sensor framework that combines Transformers and CNNs to address the challenges of heterogeneous ground features and domain shift for CD in RSIs.

However, the above methods are still constrained by traditional model training designs. These approaches can only learn semantic information specific to CD and lack the ability to understand general knowledge inherent in natural images. With the development of large models, it is possible to fully utilize the general knowledge learned by pre-trained models and apply it to downstream tasks such as CD to achieve better results.

2.2. PEFT for PVFMs

Leveraging pre-training techniques such as MAE [52], many Transformer-based visual models have demonstrated excellent performance across a wide range of computer vision tasks [36]. The core challenge lies in efficiently transferring the general knowledge of large-scale models to downstream tasks. Fine-tuning PVFMs is a widely adopted paradigm. However, the full fine-tuning of large-scale models, which often contain billions of parameters, is usually impractical due to the high computational costs. In addition, Transformer-based models may not fully achieve their potential when applied to downstream tasks like remote sensing, due to the limited availability of data in these domains [53].

Fortunately, PEFT, initially proposed in NLP, offers a promising solution. It introduces a minimal number of trainable parameters while keeping the majority of the pre-trained model frozen. By updating these parameters, PVFMs can achieve performance on downstream tasks that is comparable to, or even surpasses, that of fully fine-tuned models. The current visual PEFT methods can be categorized into four main types: adapter tuning, prompt tuning, prefix tuning, and side tuning [36]. Some optimize a small set of task-specific parameters by modifying or appending the input embedding (prompt/prefix tuning). Others introduce an auxiliary branch parallel to the main model (side tuning). We chose adapter tuning due to its ability to introduce lightweight trainable modules directly to the model layers, which is much easier to implement and offers a better balance between efficiency and performance. We will briefly review the application of adapter tuning in PEFT. ViT-Adapter [54] combines adapters with spatial prior modules and feature interaction operations, enabling plain ViT models to handle various downstream tasks by embedding image priors. SAM-Adapter [40] first attempts to adapt SAM for downstream tasks such as shadow detection, by inserting an adapter module consisting of two MLPs, which allows the combination of general knowledge learned by SAM with specific task knowledge. ClassWise-SAM-Adapter [42] incorporates adapters into the image encoder of ViT to bridge the domain gap between natural scenes and SAR images, introducing low-frequency information from SAR images. In addition, several studies employ adapter modules to leverage PVFMs for change detection. BAN [53] proposes a bi-temporal adapter network as a general PVFM adaptation framework for CD, which includes frozen base models like CLIP [37], adapter branches, and bridging modules. SAM-CD [55] enhances CD accuracy by designing trainable adapters between a frozen SAM encoder and a CNN-based decoder. TTP [56] introduces an approach called “Time Traveling Pixel” that leverages the universal segmentation capabilities of PVFMs.

Although the above methods have promoted the adaptation of general knowledge in large-scale models to CD in RSIs, they all rely on single PVFMs for knowledge transfer. Furthermore, relying solely on the Transformer to provide general knowledge may result in the model’s insufficient capability to extract local information.

3. Proposed Method

In this section, we will elaborate on the details of the proposed ASS-CD and its corresponding components, including the adapter module, CBAM, HDSM, and loss function. The overall architecture of ASS-CD is illustrated in Figure 2, which shows the relationship between the proposed modules and their positions within the network. On the other hand, Figure 3 focuses more on the flow of image information and the details of the network.

As shown in Figure 2, the ASS-CD employs an encoder–decoder architecture based on U-Net with a Siamese encoder. The dual-channel networks employ the same structure and shared weights to ensure consistency in extracting features from bi-temporal images. During the encoding phase, we select FastSAM’s visual encoder as the backbone network, utilizing it as a PVFM to acquire general knowledge. To assist FastSAM in extracting image features and enhance its capability to capture global information, we also introduce an auxiliary branch composed of a pre-trained Swin-Transformer, ensuring that all parameters of the FastSAM encoder and Swin-Transformer branch remain frozen during training.

Typically, in the U-Net architecture, feature maps of different scales are connected through skip connections between the encoder and decoder. To address the negative impact of complex backgrounds and enable the proposed ASS-CD to focus on multi-scale changes in the objects, we introduce CBAM, comprising a channel attention block and a spatial attention block, into the skip connections to perceive changing information.

During the decoding stage, we employ a lightweight decoder composed of convolutional and up-sampling layers to generate multi-scale feature maps. Subsequently, these outputs from the decoder are processed by HDSM to compute the Dice loss. By providing supervision in intermediate layers of the decoder, HDSM helps to solve the issue of class imbalance in change detection datasets. Finally, a prediction head which includes a 1 × 1 convolution layer and a Softmax function is used to generate a change prediction map. This final output from ASS-CD is applied to calculate the binary cross-entropy (BCE) loss.

As shown in Figure 3, the bi-temporal images

T_{1} \in ℝ^{H \times W \times 3}

and

T_{2} \in ℝ^{H \times W \times 3}

are first projected by a weight-sharing Siamese FastSAM encoder, where

H

and

W

denote the height and width of the input images, respectively. FastSAM is built on CNNs and is capable of outputting feature maps at four spatial scales:

(H / 4) \times (W / 4)

,

(H / 8) \times (W / 8)

,

(H / 16) \times (W / 16)

, and

(H / 32) \times (W / 32)

. The output features of FastSAM can be represented as

f_{i} \in ℝ^{(H / n_{i}) \times (W / n_{i}) \times C_{i}} (i = 1, 2, 3, 4)

, with the scale indices defined as

n_{1} = 4, n_{2} = 8, n_{3} = 16, n_{4} = 32

, respectively. The Swin-Transformer in the auxiliary branch also extracts multi-scale features,

s_{i} \in ℝ^{(H / n_{i}) \times (W / n_{i}) \times C_{i}} (i = 1, 2, 3, 4)

, corresponding to the features of the FastSAM encoder. Next, we construct trainable lightweight adapter modules by leveraging the cross-attention mechanism, which consolidates the general knowledge of two PVFMs (FastSAM and Swin-Transformer) to obtain local and global dependencies while exploiting their combined strengths in multi-scale information extraction. The multi-scale features from the two pre-trained vision models will be fed into a trainable

A d a p t e r (i)

(i = 1, 2, 3, 4)

corresponding to each scale. We can obtain outputs like

f_{i}^{\sim} \in ℝ^{(H / n_{i}) \times (W / n_{i}) \times C_{i}} (i = 1, 2, 3, 4)

from the adapter modules. Finally, we concatenate the corresponding scale of bi-temporal image features from adapter modules and obtain the final outputs

F_{i} \in ℝ^{(H / n_{i}) \times (W / n_{i}) \times 2 C_{i}} (i = 1, 2, 3, 4)

of the encoding stage. Subsequently, CBAM is inserted into the skip connection paths of the U-Net architecture, and the HDSM is introduced to apply deep supervision to the multiple outputs from the decoder. Finally, we obtain the change prediction map

C \in ℝ^{H \times W \times 1}

after the processing of the prediction head.

3.1. Adapter Module

The proposed ASS-CD employs FastSAM as the backbone network of the Siamese encoder to generate multi-scale features. Here, FastSAM is a pre-trained model composed of CNN structures, which excels at capturing and understanding local information. However, when dealing with global information, it struggles to model long-range dependencies. In contrast, the Transformer is particularly suited for this task. To better capture both global and local information, we integrate an auxiliary branch based on Swin-Transformer into the network. Specifically, we employ the SwinV2-B [57] model pre-trained on ImageNet as the backbone network for this auxiliary branch, and all parameters of the Swin-Transformer remain frozen during the training phase.

The adapter module is primarily composed of a cross-attention block and a feed-forward network (FFN), as shown in Figure 4. The cross-attention block integrates local multi-scale information from FastSAM with global multi-scale information from Swin-Transformer. The multi-scale feature outputs from FastSAM serve as queries, while the outputs from the Swin-Transformer branch act as keys and values in the computation of the cross-attention mechanism.

The self-attention mechanism is a core component of the Transformer, and the formula is as follows:

A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(1)

The cross-attention mechanism is a specialized form of the self-attention mechanism, designed to analyze the relationships between features of two images in vision tasks. In the adapter module, we take

f_{i} \in ℝ^{(H / n_{i}) \times (W / n_{i}) \times C_{i}}

as the query, and

s_{i} \in ℝ^{(H / n_{i}) \times (W / n_{i}) \times C_{i}}

as the key and value. The formulas of

A d a p t e r (i)

(i = 1, 2, 3, 4)

are as follows:

f_{i}^{\sim} = A t t e n t i o n (n o r m (f_{i}), n o r m (s_{i}))

(2)

f_{i}^{\sim} = F F N (f_{i}^{\sim})

(3)

where

n o r m (\cdot)

denotes LayerNorm,

f_{i}^{\sim} \in ℝ^{(H / n_{i}) \times (W / n_{i}) \times C_{i}}

denotes the updated features from the FastSAM encoder, and

A t t e n t i o n (\cdot)

suggests the use of sparse attention. During the training phase, we only update the parameters of the lightweight adapter modules consisting of the cross-attention mechanism and FFN.

The adapter module design allows the proposed ASS-CD to simultaneously acquire task-specific knowledge from change detection data and absorb general knowledge from two pre-trained foundation models. At the same time, it integrates the strengths of CNN and Transformer architectures through the cross-attention mechanism, enabling the proposed ASS-CD to effectively process local information while capturing the long-distance dependencies. The adapter module introduces relatively few trainable parameters, making it a computationally efficient fine-tuning solution.

3.2. CBAM

CBAM [46] is integrated into the skip connection of the U-Net architecture, enabling the model to focus on the most relevant regions of multi-scale changing objects and eliminate the negative effects of background interference.

CBAM applies attention mechanisms in both the spatial and channel dimensions with two sequential blocks: the channel attention block (CAB) and spatial attention block (SAB). CAB aggregates spatial information by employing global average pooling (GAP) and global maximum pooling (GMP) to capture inter-channel relationships. The formula for CAB is as follows:

M_{c} = σ (M L P (G A P (F)) + M L P (G M P (F)))

(4)

where

F

represents the input feature map with dimensions

H \times W \times C

,

σ

is the sigmoid activation function, and

M_{c}

is the channel attention map output from CAB.

The subsequent SAB primarily focuses on the spatial relationships of the features. The formula for SAB is as follows:

M_{s} = σ (C o n v ([G A P (F, \dim = c), G M P (F, \dim = c)]))

(5)

where

M_{s}

is the spatial attention map output by SAB, and Conv denotes the convolution operation with a 7 × 7 filter.

In the final step, the attention maps from CAB and SAB are multiplied in an element-wise manner with

F_{i} \in ℝ^{(H / n_{i}) \times (W / n_{i}) \times 2 C_{i}}

to refine it. The refined feature map

F_{i}^{'}

is as follows:

F_{i}^{'} = M_{s} (F_{i} \otimes M_{c} (F_{i})) \otimes (F_{i} \otimes M_{c} (F_{i}))

(6)

where

i = 1, 2, 3

, and

\otimes

denotes element-wise multiplication.

3.3. HDSM

Deep supervision is a technique applied to deep neural networks, where auxiliary classifiers and loss functions are introduced at intermediate layers, supervising not only the final output layer but also the other layers. By providing supervision at multiple layers, deep supervision promotes the network to learn more discriminative features at different depths, thereby accelerating convergence and enhancing generalization ability [58].

To implement the HDSM, we apply deep supervision to the feature maps of three intermediate layers generated by the decoder. Specifically, we select the following three scales, which are connected to the corresponding feature maps of the encoder through skip connections:

D_{1} \in ℝ^{(H / 4) \times (W / 4) \times d_{1}}

,

D_{2} \in ℝ^{(H / 8) \times (W / 8) \times d_{2}}

, and

D_{3} \in ℝ^{(H / 16) \times (W / 16) \times d_{3}}

. As shown in Figure 3, the Dice loss is computed for each of these feature maps to achieve deep supervision. The HDSM enhances our model’s ability to capture multi-scale information. In addition, the introduction of Dice loss effectively addresses the common issue of class imbalance between positive and negative samples in change detection datasets.

Our method achieves deep supervision in a lightweight manner without excessively increasing computational cost. As shown in Figure 5, the HDSM consists of a series of stacked layers—a 3 × 3 deconvolution layer, a batch normalization layer, a ReLU activation function, and a 1 × 1 convolution layer—performing two deconvolution operations to achieve progressive up-sampling of the feature maps. Subsequently, a 1 × 1 convolution layer is used to adjust the channel dimension to 1, matching the dimension of the change prediction map. Finally, bilinear interpolation is applied to resize the feature maps of three different scales to the original spatial size, yielding outputs like

D S_{i} \in ℝ^{H \times W \times 1} (i = 1, 2, 3)

. The feature maps

D S_{i}

from HDSM are subsequently used to compute the Dice loss with the label map, achieving deep supervision for multi-scale intermediate layers. The formulas for the first and second deconvolution operations are as follows:

F_{u p 1} = R e L U (B N (D e C o n v_{3 \times 3} (D_{i})))

(7)

F_{u p 2} = R e L U (B N (D e C o n v_{3 \times 3} (F_{u p 1})))

(8)

where

D_{i}

(i = 1, 2, 3)

represents the feature map outputs from multi-scale intermediate layers of the decoder,

D e c o n v_{3 \times 3}

denotes the 3 × 3 deconvolution operation, BN stands for the batch normalization layer, and ReLU is the activation function.

F_{u p 1}

and

F_{u p 2}

are the feature maps after two up-sampling processes. The final 1 × 1 convolution layer and bilinear interpolation are formulated as follows:

F_{c o n v} = C o n v_{1 \times 1} (F_{u p 2})

(9)

D S_{i} = B i l i n e a r I n t e r p o l a t e (F_{c o n v}, H, W)

(10)

where

C o n v_{1 \times 1}

represents the 1 × 1 convolution layer, and

D S_{i}

is the output of HDSM with a size of

H \times W

, obtained through bilinear interpolation.

3.4. Loss Function

The overall loss function of ASS-CD comprises two components due to the application of deep supervision. The BCE loss is commonly used to calculate the discrepancy between the predicted change map and the ground truth label. The expression for the BCE loss is as follows:

L_{B C E} = - \frac{1}{N} \sum_{n = 1}^{N} (y_{n} \log p_{n} + (1 - y_{n}) \log (1 - p_{n}))

(11)

where

y_{n}

represents the ground truth value of the nth pixel,

p_{n}

represents the predicted probability that the nth pixel is classified as changed, and

N

represents the total number of pixels in the image.

In CD tasks for RSIs, significant class imbalance often exists between changed and unchanged pixels, which impairs training stability and model generalization capability. In addition, the model’s perception of change regions and its ability to process multi-scale information are also diminished. Dice loss, by considering the overlap between the predicted and ground truth regions, helps mitigate the impact of class imbalance. We select Dice loss as the second part of the overall loss function, whose formula is as follows:

L_{D i c e} = 1 - \frac{2 \times \sum_{n = 1}^{N} p_{n} y_{n}}{\sum_{n = 1}^{N} p_{n} + y_{n}}

(12)

where the summation term

\sum_{n = 1}^{N} p_{n} y_{n}

represents the intersection between the predicted and the ground truth pixels, while

\sum_{n = 1}^{N} p_{n} + y_{n}

represents the union. Since we apply deep supervision at three different scales of decoder, the Dice loss must include three terms corresponding to each scale. The overall loss function of the proposed ASS-CD is composed of the BCE loss and the Dice loss terms for the three scales. The formula is expressed as follows:

L = L_{B C E} + \sum_{i = 1}^{3} L_{D i c e}^{i}

(13)

where

L_{D i c e}^{i}

represents the ith scale of the output from the decoder to which we apply deep supervision.

4. Experimental Results and Analysis

This section first introduces the datasets used for model training and testing, as well as the evaluation metrics and implementation details. Subsequently, we conduct comparative experiments with other SOTA methods. Finally, ablation studies and a complexity analysis are presented to evaluate and verify the effectiveness of each proposed module.

4.1. Dataset Description

We chose the LEVIR-CD [20], WHU-CD [59], and DSIFN-CD [21] datasets for the experimental analysis. These datasets are classic benchmarks for CD in RSIs, serving as important databases for detecting changes in ground objects. The LEVIR-CD and WHU-CD datasets mainly focus on building change detection, while the DSIFN-CD dataset contains various types of objects. Thus, utilizing these datasets allows for a more comprehensive and in-depth analysis of model generalization capabilities across various data sources.

(1): LEVIR-CD [20]: This dataset is employed for CD, consisting of 637 bi-temporal high-resolution Google Earth image pairs with dimensions of 1024 × 1024 pixels and a spatial resolution of 0.5 m. These images capture substantial land-use changes, particularly building development and demolition, over time periods ranging from 5 to 14 years. To accommodate the input size requirements of the PVFMs and address GPU memory limitations, we cropped the images into 256 × 256 patches. The final dataset was split into 7120, 1024, and 2048 image pairs for the training, validation, and test sets, respectively. The dataset can be obtained from https://justchenhao.github.io/LEVIR/ accessed on 10 October 2022.
(2): WHU-CD [59]: This high-resolution aerial imagery dataset is utilized for building change detection tasks, focusing on the Christchurch region in New Zealand, where it captures changes between 2012 and 2016 due to post-earthquake reconstruction efforts following the 2011 earthquake. The original aerial images have a resolution of 0.2 m per pixel and cover a large area of 32,507 × 15,354 pixels. For practical use, the images were cropped into 256 × 256 patches. The dataset was divided into 6096, 762, and 762 image pairs for the training, validation, and test sets, respectively. The dataset can be obtained from https://gpcv.whu.edu.cn/data/building_dataset.html accessed on 1 October 2023.
(3): DSIFN-CD [21]: This is a dataset specifically designed for change detection, with data sourced from Google Earth covering six cities (i.e., Beijing, Chengdu, Shenzhen, Chongqing, Wuhan, and Xi’an) in China. DSIFN-CD includes various types of change objects, such as roads, buildings, farmlands, and water bodies. In this paper, the default 512 × 512 samples were cropped into 256 × 256 image patches. The dataset was randomly divided into training, validation, and test sets, containing 14,400, 1360, and 192 RSI pairs, respectively. The dataset can be obtained from https://paperswithcode.com/dataset/dsifn-cd accessed on 15 December 2023.

4.2. Implementation Details and Evaluation Metrics

The proposed ASS-CD was implemented using the PyTorch 2.3 framework. It was trained on a single NVIDIA GeForce RTX 3090 GPU with 24 GB of video memory from Santa Clara, CA, USA. We trained the model for 100 epochs across the three datasets. The chosen optimizer was Adam to achieve model convergence. The initial learning rate was set to 1 × 10⁻³ and was linearly decreased to zero over the course of training. The batch size was set to 8 and the weight decay was configured as 5 × 10⁻⁴. The proposed ASS-CD was constructed using the U-Net architecture. Both the FastSAM encoder and Swin-Transformer branch generated feature maps at four different scales, which were connected to the decoder through skip connections.

To evaluate the performance of ASS-CD and other comparative methods, we employed five widely used metrics to measure the similarity between the predicted change maps and the ground truth labels: precision (Pre), recall (Rec), F1 score (F1), intersection over union (IoU), and overall accuracy (OA). These evaluation metrics are defined as follows:

Pre = TP / (TP + FP)

(14)

Rec = TP / (TP + FN)

(15)

F 1 = 2 \cdot P r e \cdot R e c / (P r e + R e c)

(16)

I o U = T P / (T P + F P + F N)

(17)

O A = (T P + T N) / (T P + T N + F P + F N)

(18)

where TP, FP, TN, and FN represent true positives, false positives, true negatives, and false negatives, respectively.

4.3. Comparison Experiments

(1) Comparison Methods: To validate the performance of the proposed ASS-CD, we compared it with the following SOTA CD methods, which were classified into three categories: (1) CNN-based methods (FC-Siam-conc [16] and FC-Siam-diff [16] with simple Siamese networks, SNUNet [17] with a U-Net structure, STANet [20] and HANet [23] with attention modules); (2) Transformer-based methods (Changeformer [25] with Transformer, and SwinSUNet [26] with Swin-Transformer); (3) a hybrid framework combining CNNs and Transformers (BIT [28], ConvTransNet [29], and WNet [30]). These three types of comparison algorithms are all SOTA CD methods with high representativeness which summarize the development of CD approaches and are suitable for the experimental setup. We utilized publicly available implementations from GitHub for most comparison models, running them from scratch under the same conditions and employing default hyperparameters. Some results were selected from the original paper, which are indicated with *. This ensured consistency and comparability among the comparison models during the evaluation process.

It is noteworthy that, despite methods like ConvTransNet also combining CNNs and Transformers within their frameworks, their traditional model training paradigms differ significantly from the proposed ASS-CD. These methods are constrained by data volume, parameter scale, and the narrow scope of specific domain knowledge, resulting in limited model expressiveness. In contrast, ASS-CD does not require updating the parameters of the PVFMs. By introducing trainable adapter modules, it effectively learns domain-specific knowledge in RSIs while flexibly adapting the general knowledge of the pre-trained model to the CD task.

(2) Quantitative Analysis: Table 1, Table 2 and Table 3 present the experimental results of ASS-CD and other SOTA methods on the three datasets.

For the LEVIR-CD dataset, ASS-CD attains the highest scores in precision, recall, F1, IoU, and OA, with values of approximately 93.78%, 90.85%, 92.29%, 85.69%, and 99.14%, respectively. Compared to the second-best model, ASS-CD outperforms it by 2.62%, 0.67%, 1.62%, 2.76%, and 0.08% on these metrics.

For the WHU-CD dataset, ASS-CD achieves the best performance in four of the evaluation metrics, with only the precision score trailing behind the Changeformer method. Although Changeformer achieves higher precision, this improvement comes at the cost of significantly lower recall, making it much inferior to the proposed ASS-CD in overall performance. ASS-CD exceeds ConvTransNet, which achieves the second-best performance in precision, recall, F1, IoU, and OA by 1.22%, 1.13%, 1.18%, 2.04%, and 0.07%, respectively.

For the DSIFN-CD dataset, ASS-CD still achieves outstanding performance. The experimental results on DSIFN-CD demonstrate that the proposed hierarchical adapter framework exhibits excellent generalization capability when dealing with non-building targets such as rivers and farmland, which is a key strength of our method. This could be attributed to the general knowledge learned from the two pre-trained models.

(3) Visualization Analysis: In the visualization analysis, we used white for TP, black for TN, red for FP, and green for FN to clearly illustrate the model’s performance. Figure 6 presents examples of the CD results on the LEVIR-CD dataset using different methods, while Figure 7 shows similar results on the WHU-CD dataset. The visualization figures sequentially present the experimental results of Image A, Image B, Label, FC-Siam-conc, FC-Siam-diff, STANet, SNUNet, Changeformer, BIT, ConvTransNet, and the proposed ASS-CD from left to right.

From the visualization results, we can observe that the methods based on CNN architectures (FC-Siam-conc and FC-Siam-diff) are constrained by their relatively simple network design and the intrinsic locality of CNNs. The CNN-based methods tend to focus on local changes and neglect the global information of the bi-temporal images. They often produce large areas of false negatives (FNs, shown in green) and false positives (FPs, shown in red), leaving significant room for model improvement. In contrast, the method incorporating the Transformer architecture (Changeformer) excels in capturing global features by modeling long-distance dependency. This advantage helps reduce large-scale FN and FP regions and leads to more accurate results.

However, the Transformer-based methods struggle with detecting local details such as the contours and edges of buildings, as shown in Figure 6 and Figure 7. BIT and ConvTransNet, which combine CNNs and Transformers within an architecture, partly address this issue. The proposed ASS-CD also leverages the general knowledge from two types of networks and gains the ability to simultaneously process local and global information.

In summary, the proposed ASS-CD demonstrates the best performance in visual analysis. For instance, the last row of Figure 7 shows the results on the WHU-CD dataset, where the comparison methods exhibit large areas marked in green and red, indicating poor detection accuracy. For the FC-Siam-diff method, it can be observed that the building in the upper left corner is almost entirely covered in green, meaning it was incorrectly classified as a negative sample. However, the design that integrates CNNs with Transformers significantly reduces the rate of missed detections. Taking ASS-CD as an example, the erroneous display of building outlines accounts for only a minimal portion, indicating a significant enhancement in the visualization of CD results.

(4) Complexity Analysis: Table 4 presents complexity metrics of the comparative methods, with the number of parameters (Params) and floating-point operations per second (FLOPs) as evaluation criteria. Params reflect the capacity and learning requirements of the model during training, indicating its spatial complexity, and FLOPs represent the total number of floating-point operations performed by the model, serving as a measure of temporal complexity.

As shown in Table 4, compared to CNN-based methods, Transformer-based methods generally have more parameters. This is primarily due to the self-attention mechanism requiring a large number of model parameters to construct. Among the compared methods, BIT integrates the attention mechanism into the CNN backbone without a significant increase in the number of parameters. Changeformer reaches the highest values in both Params and FLOPs, making it the most complex algorithm in the experiments. The proposed ASS-CD, when adapted to the CD task in RSIs, requires only the training of lightweight adapter modules. This leads to lower levels of trainable parameters and FLOPs, thereby minimizing computational costs.

Additionally, it is worth noting that despite the relatively small number of trainable parameters in the proposed ASS-CD, the overall model parameters can still be quite large when deployed on hardware devices. This is because, even though the FastSAM backbone network and the Swin-Transformer auxiliary branch consist of frozen parameters that do not participate in gradient updates, they are still responsible for processing data and generating feature maps during the encoding phase. The parameters of these two pre-trained models significantly exceed those of the trainable parameters in ASS-CD, reaching 68 M and 88 M, respectively. This undoubtedly presents a challenge for the efficient deployment of ASS-CD, particularly in the context of limited hardware resources. We will explore possible solutions to this issue in Section 5.

4.4. Ablation Study

In this section, we conduct ablation experiments on the LEVIR-CD and WHU-CD datasets to evaluate the effectiveness of different modules. The baseline model is defined as a U-Net structure that retains only the pre-trained FastSAM as backbone for feature extraction. We create three ablation study models—Model_a, Model_b, and Model_c—based on the types of modules included in the architecture. Moreover, to highlight the advantages of pre-trained models over traditional neural networks, we also provide results for a pure U-Net under the same settings. The configurations are as follows:

Baseline: U-Net + FastSAM;
Model_a: U-Net + FastSAM + Adapter;
Model_b: U-Net + FastSAM + Adapter + CBAM;
Model_c: U-Net + FastSAM + Adapter + HDSM;
ASS-CD: U-Net + FastSAM + Adapter + CBAM + HDSM.

Table 5 presents the quantitative results of the ablation study. Meanwhile, to visually demonstrate the capabilities of different models, Figure 8 shows the heatmaps generated by the feature map outputs from the prediction head on the LEVIR-CD dataset. The final output from different models is first normalized to the range [0, 255] and then color-mapped to a JET-style heatmap. In these heatmaps, pixels closer to red indicate areas where the model has allocated more attention. In order to emphasize the impact of different modules, we added red bounding boxes to the heatmaps to highlight the area of interest.

(1) Adapter Module: The adapter module integrates a pre-trained Swin-Transformer branch, which assists the FastSAM encoder in processing image features through the cross-attention mechanism, enabling the baseline model to effectively fuse CNNs with Transformers to capture both local and global information. By updating the parameters of the adapter modules, general knowledge from pre-trained models is adapted for CD tasks.

Table 5 shows that the introduction of the adapter module significantly enhances model performance. The baseline model and Model_a surpass the performance of pure U-Net, demonstrating that large models do offer an advantage for CD tasks compared to traditional neural networks. In addition, Model_a demonstrates improvements in all evaluation metrics compared with the baseline model, indicating that directly applying a vision foundation model with frozen parameters (such as SAM) to remote sensing downstream tasks does not yield satisfactory results. Through visualization, we can intuitively feel the impact of the adapter module. In the heatmap of the baseline model shown in Figure 8d, the background regions of the feature maps are predominantly green, indicating that the model pays unnecessary attention to the background, which is counterproductive to the assessment of building changes. However, after the introduction of the adapter module, the background in the heatmap (Figure 8e) shifts to a predominantly blue tone. This change indicates that Model_a is more focused on the actual object areas that have undergone changes, rather than irrelevant background information.

(2) CBAM: In the ASS-CD design, we integrated CBAM into the full-scale skip-connection paths between the encoder and decoder of the U-Net architecture. As shown in Table 5, the introduction of CBAM improved the overall performance of Model_b. As for the heatmap, we can observe that in the red-boxed region of Figure 8f, the edges of background buildings become more blurred, while the multi-scale changing objects are highlighted with deeper red, compared with Figure 8e. This confirms that CBAM indeed helps the model to mitigate the negative impact of background interference and focus more on multi-scale changing objects.

(3) HDSM and Loss Function: HDSM aims to apply deep supervision to intermediate layers at various scales within the network decoder. In addition, integrating Dice loss into the overall loss function helps address the issue of class imbalance between positive and negative samples in CD datasets. Table 5 shows that Model_c has improved in all evaluation metrics compared to Model_a. From the red-boxed regions in Figure 8g, it can be observed that the outlines of background buildings become more blurred compared with Figure 8e. With the introduction of HDSM, the model reduces unnecessary attention to background areas, which belong to negative samples in CD datasets.

Table 6 presents the results of the ablation study on different combinations of loss functions. It can be observed that BCE loss and Dice loss contribute to improving model performance in different ways, and only their combination achieves the best results. Specifically, BCE loss focuses on pixel-wise classification accuracy, assigning equal importance to both background and foreground regions. As a result, it may lead to a decrease in precision in cases where the dataset suffers from class imbalance or significant background interference. In contrast, Dice loss emphasizes the overlap between predicted and ground truth regions, making it more effective in optimizing precision under severe class imbalance. However, this often comes at the cost of a slight decline in recall. By combining the BCE and Dice loss, it is possible to balance pixel-wise classification with regional overlap optimization, thereby addressing the problem of class imbalance and enhancing the overall performance of the model.

Ultimately, when we apply all modules, including the adapter module, CBAM, and HDSM, to the baseline model simultaneously, the proposed ASS-CD is presented and demonstrates superior advanced performance. As shown in Figure 8h, ASS-CD not only excels in quantitative performance but also stands out in terms of its heatmap visualization results, achieving precise capture of local details while comprehensively considering the global information of bi-temporal images.

5. Discussion

In summary, our extensive experimental results show that the proposed ASS-CD exhibits excellent performance on multiple datasets. Its performance not only surpasses the classical CD methods based on Siamese CNNs but also outperforms many Transformer-based methods with larger parameters, highlighting its significant practical application value and great potential for future development. We attribute its success to the following three main aspects of model design:

(1): We fuse two PVFMs, FastSAM and Swin-Transformer, to adapt the general knowledge of image semantic understanding to the specific CD task for RSIs, which allows our method to benefit from both local and global information from the pre-trained model.
(2): By integrating the CBAM and HDSM modules, the model’s ability to extract multi-scale varying objects, especially local details, is significantly improved, and these modules also effectively reduce the interference of irrelevant background information in RSIs.
(3): The introduction of Dice loss helps to overcome the category imbalance problem in the CD dataset, thereby improving the stability of model training.

However, the proposed ASS-CD also has several limitations that need to be addressed in future work:

(1): Adaptability to more advanced pre-trained models. In recent years, with the rapid advancement of large-scale model technology, more sophisticated PVFMs have emerged. For instance, the highly acclaimed vision Mamba model has achieved remarkable results in various computer vision tasks. Given this, we might explore the latest visual models to further enhance CD accuracy without altering the hierarchical architecture of ASS-CD.
(2): Deployment complexity on hardware devices. As mentioned in the complexity analysis subsection, although ASS-CD only updates a small number of parameters, the frozen parameters from FastSAM and Swin-Transformer lead to significant storage requirements when deploying the model on hardware devices. This underscores the necessity of employing model lightweighting techniques such as model pruning and knowledge distillation to reduce complexity while selecting appropriate PVFMs.
(3): Balance between model complexity and performance gain. Although we adopt an adapter tuning strategy to ensure that the number of parameters updated during training are minimal, the architecture of the proposed ASS-CD remains relatively complex due to the use of two large pre-trained models to learn general knowledge. In addition, the introduction of other modules like deep supervision also adds an extra computational burden and training complexity. We should explore more lightweight methods to supervise intermediate layers and simplify the parameters of HDSM, which are important factors during model implementation. In future work, we plan to simplify the model while maintaining its performance. This may involve reducing the number of adapters, using lighter pre-trained models, or optimizing other modules to strike a balance between model complexity and performance gain.

6. Conclusions

This paper proposes a novel hierarchical adapter framework, namely ASS-CD, designed for CD in RSIs. It leverages adapter modules to integrate general knowledge from pre-trained FastSAM and Swin-Transformer, allowing for the effective combination of specific knowledge for downstream tasks with the general knowledge of large-scale models by simply updating the adapter parameters. Furthermore, the introduction of CBAM and HDSM demonstrates the robust multi-scale feature representation capabilities of the proposed method. In addition, integrating Dice loss into model training through deep supervision significantly reduces the negative impact of class imbalance on model accuracy. The experimental results on three widely used public datasets demonstrate that the proposed ASS-CD achieved SOTA performance.

However, ASS-CD still has some limitations. In future work, we will explore the application of more newly emerging advanced PVFMs within the proposed adapter framework to achieve higher-accuracy CD performance in RSIs. Also, the model lightweighting techniques employed in the proposed adapter framework will help mitigate the challenges associated with a large number of parameters during the model deployment process.

Author Contributions

Conceptualization, C.W., X.W. and B.W.; methodology, C.W. and X.W.; software, C.W.; validation, C.W., X.W. and B.W.; formal analysis, C.W., X.W. and B.W.; investigation, C.W.; writing—original draft preparation, C.W., X.W. and B.W.; writing—review and editing, C.W., X.W. and B.W.; supervision, X.W. and B.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 62371140 and the National Key Research and Development Program of China under Grant 2022YFB3903404.

Data Availability Statement

The data presented in this study are available upon reasonable request, to be provided by the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cheng, G.; Huang, Y.; Li, X.; Lyu, S.; Xu, Z.; Zhao, H.; Zhao, Q.; Xiang, S. Change Detection Methods for Remote Sensing in the Last Decade: A Comprehensive Review. Remote Sens. 2024, 16, 2355. [Google Scholar] [CrossRef]
Coppin, P.R.; Jonckheere, I.G.C.; Nackaerts, K.; Muys, B.; Lambin, E.F. Digital Change Detection Methods in Ecosystem Monitoring: A Review. Int. J. Remote Sens. 2004, 25, 1565–1596. [Google Scholar] [CrossRef]
Bruzzone, L.; Prieto, D.F. Automatic Analysis of the Difference Image for Unsupervised Change Detection. IEEE Trans. Geosci. Remote Sens. 2000, 38, 1171–1182. [Google Scholar] [CrossRef]
Feranec, J.; Hazeu, G.W.; Christensen, S.; Jaffrain, G. Corine Land Cover Change Detection in Europe (case studies of the Netherlands and Slovakia). Land Use Policy 2007, 24, 234–247. [Google Scholar] [CrossRef]
Shi, S.; Zhong, Y.; Zhao, J.; Lv, P.; Liu, Y.; Zhang, L. Land-Use/Land-Cover Change Detection Based on Class-Prior Object-Oriented Conditional Random Field Framework for High Spatial Resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Qiao, H.; Wan, X.; Wan, Y.; Li, S.; Zhang, W. A Novel Change Detection Method for Natural Disaster Detection and Segmentation from Video Sequence. Sensors 2020, 20, 5076. [Google Scholar] [CrossRef] [PubMed]
Yu, Q.; Zhang, M.; Yu, L.; Wang, R.; Xiao, J. SAR Image Change Detection Based on Joint Dictionary Learning with Iterative Adaptive Threshold Optimization. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 5234–5249. [Google Scholar] [CrossRef]
Amitrano, D.; Guida, R.; Iervolino, P. Semantic Unsupervised Change Detection of Natural Land Cover with Multitemporal Object-Based Analysis on SAR Images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 5494–5514. [Google Scholar] [CrossRef]
Wang, L.; Wang, L.; Wang, Q.; Atkinson, P.M. SSA-SiamNet: Spatial-Wise Attention-Based Siamese Network for Hyperspectral Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5510018. [Google Scholar] [CrossRef]
Luo, F.; Zhou, T.; Liu, J.; Guo, T.; Gong, X.; Ren, J. Multiscale Diff-Changed Feature Fusion Network for Hyperspectral Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5502713. [Google Scholar] [CrossRef]
Khelifi, L.; Mignotte, M. Deep Learning for Change Detection in Remote Sensing Images: Comprehensive Review and Meta-Analysis. IEEE Access 2020, 8, 126385–126400. [Google Scholar] [CrossRef]
Ertürk, S. Fuzzy Fusion of Change Vector Analysis and Spectral Angle Mapper for Hyperspectral Change Detection. In Proceedings of the IGARSS 2018, Valencia, Spain, 22–27 July 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 5045–5048. [Google Scholar]
Celik, T. Unsupervised Change Detection in Satellite Images Using Principal Component Analysis and k-Means Clustering. IEEE Geosci. Remote Sens. Lett. 2009, 6, 772–776. [Google Scholar] [CrossRef]
Wu, C.; Du, B.; Zhang, L. Hyperspectral Change Detection Based on Independent Component Analysis. Int. J. Remote Sens. 2012, 16, 545–561. [Google Scholar]
Zheng, Z.; Du, S.; Taubenböck, H.; Zhang, X. Remote Sensing Techniques in The Investigation of Aeolian Sand Dunes: A Review of Recent Advances. Remote Sens. Environ. 2022, 271, 112913. [Google Scholar] [CrossRef]
Daudt, R.C.; Le Saux, B.; Boulch, A. Fully Convolutional Siamese Networks for Change Detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4063–4067. [Google Scholar]
Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A Densely Connected Siamese Network for Change Detection of VHR Images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Zheng, Z.; Ma, A.; Zhang, L.; Zhong, Y. Change is Everywhere: Single-Temporal Supervised Object Change Detection in Remote Sensing Imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 15173–15182. [Google Scholar]
Fang, S.; Li, K.; Li, Z. Changer: Feature Interaction is What You Need for Change Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–11. [Google Scholar] [CrossRef]
Chen, H.; Shi, Z. A Spatial-Temporal Attention-Based Method and a New Dataset for Remote Sensing Image Change Detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A Deeply Supervised Image Fusion Network for Change Detection in High Resolution Bi-temporal Remote Sensing Images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
Feng, Y.; Jiang, J.; Xu, H.; Zheng, J. Change Detection on Remote Sensing Images Using Dual-Branch Multilevel Intertemporal Network. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Han, C.; Wu, C.; Guo, H.; Hu, M.; Chen, H. HANet: A Hierarchical Attention Network for Change Detection with Bitemporal Very-High-Resolution Remote Sensing Images. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2023, 16, 3867–3878. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
Bandara, W.G.C.; Patel, V.M. A Transformer-Based Siamese Network for Change Detection. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 207–210. [Google Scholar]
Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure Transformer Network for Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Wang, Q.; Jing, W.; Chi, K.; Yuan, Y. Cross-Difference Semantic Consistency Network for Semantic Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–12. [Google Scholar] [CrossRef]
Chen, H.; Qi, Z.; Shi, Z. Remote Sensing Image Change Detection with Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Li, W.; Xue, L.; Wang, X.; Li, G. ConvTransNet: A CNN–Transformer Network for Change Detection with Multiscale Global–Local Representations. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Tang, X.; Zhang, T.; Ma, J.; Zhang, X.; Liu, F.; Jiao, L. WNet: W-Shaped Hierarchical Network for Remote-Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Cui, Y.; Chen, H.; Dong, S.; Wang, G.; Zhuang, Y. U-Shaped CNN-ViT Siamese Network with Learnable Mask Guidance for Remote Sensing Building Change Detection. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2024, 17, 11402–11418. [Google Scholar] [CrossRef]
Gao, Y.; Pei, G.; Sheng, M.; Sun, Z.; Chen, T.; Yao, Y. Relating CNN-Transformer Fusion Network for Remote Sensing Change Detection. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
Sun, X.; Wang, P.; Lu, W.; Zhu, Z.; Lu, X.; He, Q.; Li, J.; Xuee, R.; Yang, Z.; Chang, H.; et al. RingMo: A Remote Sensing Foundation Model with Masked Image Modeling. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–22. [Google Scholar] [CrossRef]
Wang, D.; Zhang, Q.; Xu, Y.; Zhang, J.; Du, B.; Tao, D.; Zhang, L. Advancing Plain Vision Transformer Toward Remote Sensing Foundation Model. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Hong, D.; Zhang, B.; Li, X.; Li, Y.; Li, C.; Yao, J.; Yokoya, N.; Li, H.; Ghamisi, P.; Jia, X.; et al. SpectralGPT: Spectral Remote Sensing Foundation Model. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5227–5244. [Google Scholar] [CrossRef] [PubMed]
Xin, Y.; Luo, S.; Zhou, H.; Du, J.; Liu, X.; Fan, Y.; Li, Q.; Du, Y. Parameter-Efficient Fine-Tuning for Pre-trained Vision Models: A Survey. arXiv 2024, arXiv:2402.02242. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 1 July 2021; pp. 8748–8763. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
Chen, T.; Zhu, L.; Ding, C.; Cao, R.; Wang, Y.; Li, Z.; Sun, L.; Mao, P.; Zang, Y. SAM Fails to Segment Anything?—SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More. arXiv 2023, arXiv:2304.09148. [Google Scholar]
Reed, C.J.; Gupta, R.; Li, S.; Brockman, S.; Funk, C.; Clipp, B.; Keutzer, K.; Candido, S.; Uyttendaele, M.; Darrell, T. Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4088–4099. [Google Scholar]
Pu, X.; Jia, H.; Zheng, L.; Wang, F.; Xu, F. Classwise-SAM-Adapter: Parameter Efficient Fine-Tuning Adapts Segment Anything to Sar Domain for Semantic Segmentation. arXiv 2024, arXiv:2401.02326. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Zhao, X.; Ding, W.; An, Y.; Du, Y.; Yu, T.; Li, M.; Tang, M.; Wang, J. Fast Segment Anything. arXiv 2023, arXiv:2306.12156. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Jorge Cardoso, M. Generalised Dice Overlap as A Deep Learning Loss Function for Highly Unbalanced Segmentations. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, 14 September 2017; pp. 240–248. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; pp. 3–11. [Google Scholar]
Li, Z.; Yan, C.; Sun, Y.; Xin, Q. A Densely Attentive Refinement Network for Change Detection Based on Very-High-Resolution Bitemporal Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
Wang, S.; Wu, W.; Zheng, Z.; Li, J. CTST: CNN and Transformer-Based Spatio-Temporally Synchronized Network for Remote Sensing Change Detection. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2024, 17, 16272–16288. [Google Scholar] [CrossRef]
Li, J.; Meng, Y.; Tao, C.; Zhang, Z.; Yang, X.; Wang, Z.; Wang, X.; Li, L.; Zhang, W. ConvFormerSR: Fusing Transformers and Convolutional Neural Networks for Cross-Sensor Remote Sensing Imagery Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
Li, K.; Cao, X.; Meng, D. A New Learning Paradigm for Foundation Model-Based Remote-Sensing Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–12. [Google Scholar] [CrossRef]
Chen, Z.; Duan, Y.; Wang, W.; He, J.; Lu, T.; Dai, J.; Qiao, Y. Vision Transformer Adapter for Dense Predictions. arXiv 2022, arXiv:2205.08534. [Google Scholar]
Ding, L.; Zhu, K.; Peng, D.; Tang, H.; Yang, K.; Bruzzone, L. Adapting Segment Anything Model for Change Detection in VHR Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–11. [Google Scholar] [CrossRef]
Chen, K.; Liu, C.; Li, W.; Liu, Z.; Chen, H.; Zhang, H.; Zou, Z.; Shi, Z. Time Travelling Pixels: Bitemporal Features Integration with Foundation Model for Remote Sensing Image Change Detection. In Proceedings of the IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 8581–8584. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. (Eds.) Swin Transformer V2: Scaling Up Capacity and Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11999–12009. [Google Scholar]
Li, R.; Wang, X.; Huang, G.; Yang, W.; Zhang, K.; Gu, X.; Tran, S.N.; Garg, S.; Alty, J.; Bai, Q. A Comprehensive Review on Deep Supervision: Theories and Applications. arXiv 2022, arXiv:2207.02376. [Google Scholar]
Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction from an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]

Figure 1. Examples of change detection images with (a) image A captured at time one, (b) image B captured at time two, and (c) the ground truth label.

Figure 2. The overall architecture of the proposed ASS-CD.

Figure 3. The encoder–decoder network based on U-Net for the proposed ASS-CD.

Figure 4. The details of the adapter module.

Figure 5. The details of HDSM.

Figure 6. Visualization results of different methods on LEVIR-CD dataset, where TP (white), TN (black), FP (red), and FN (green) are distinguished by different colors.

Figure 7. Visualization results of different methods on WHU-CD dataset, where TP (white), TN (black), FP (red), and FN (green) are distinguished by different colors.

Figure 8. Visualization of the heatmaps from different models on the LEVIR-CD dataset. (a) ImageA. (b) ImageB. (c) Label. (d) Baseline. (e) Model_a: Baseline + Adapter. (f) Model_b: Baseline + Adapter + CBAM. (g) Model_c: Baseline + Adapter + HDSM. (h) Our ASS-CD: Baseline + Adapter + CBAM + HDSM. Red bounding boxes highlight the region of interest for convenient observation. Pixels closer to red indicate areas where the model has allocated more attention.

Table 1. The evaluation results on the LEVIR-CD dataset. The bold values indicate the best experimental results. All metrics are described as percentages (%). * denotes the results from the original paper.

Methods	Pre	Rec	F1	IoU	OA
FC-Siam-conc [16]	88.79	86.57	87.67	78.04	98.75
FC-Siam-diff [16]	89.25	82.62	85.81	75.14	98.46
STANet [20]	90.68	87.70	89.17	80.45	98.91
SNUNet [17]	91.25	85.55	88.30	79.06	98.85
HANet [23] *	91.21	89.36	90.28	82.27	99.02
Changeformer [25]	91.85	87.88	89.82	81.52	98.99
SwinSUNet [26]	90.76	86.92	88.80	79.85	98.88
BIT [28]	91.74	88.25	89.96	81.76	99.00
ConvTransNet [29]	92.64	88.58	90.56	82.75	99.06
WNet [30] *	91.16	90.18	90.67	82.93	99.06
ASS-CD (Ours)	93.78	90.85	92.29	85.69	99.14

Table 2. The evaluation results on the WHU-CD dataset. The bold values indicate the best experimental results. All metrics are described as percentages (%). * denotes the results from the original paper.

Methods	Pre	Rec	F1	IoU	OA
FC-Siam-conc [16]	83.42	86.15	84.76	73.56	98.77
FC-Siam-diff [16]	86.55	85.01	85.77	75.09	98.88
STANet [20]	89.40	87.10	88.23	78.95	99.07
SNUNet [17]	90.25	87.46	88.83	79.91	99.09
HANet [23] *	88.30	88.01	88.16	78.82	99.16
Changeformer [25]	95.35	83.83	89.22	80.54	99.19
SwinSUNet [26]	91.35	87.42	89.34	80.74	99.20
BIT [28]	93.16	88.74	90.90	83.31	99.29
ConvTransNet [29]	92.66	91.57	92.11	85.38	99.38
WNet [30] *	92.37	90.15	91.25	83.91	99.31
ASS-CD (Ours)	93.88	92.70	93.29	87.42	99.45

Table 3. The evaluation results on the DSIFN-CD dataset. The bold values indicate the best experimental results. All metrics are described as percentages (%). * denotes the results from the original paper.

Methods	Pre	Rec	F1	IoU	OA
FC-Siam-conc [16] *	66.45	54.21	59.71	42.56	87.57
FC-Siam-diff [16] *	59.67	65.71	62.54	45.50	86.63
STANet [20] *	67.71	61.68	64.56	47.66	88.49
SNUNet [17] *	60.60	72.89	66.18	49.45	87.34
HANet [23] *	56.52	70.33	62.67	45.64	85.76
Changeformer [25]	73.61	75.28	74.44	59.28	89.10
SwinSUNet [26]	67.50	71.45	69.42	53.16	88.65
BIT [28] *	68.36	70.18	69.26	52.97	89.41
ConvTransNet [29]	66.30	68.74	67.50	50.94	88.20
WNet [30]	68.71	70.52	69.60	53.38	88.66
ASS-CD (Ours)	75.20	78.35	76.74	62.26	90.29

Table 4. Complexity analysis of different CD methods. The table exhibits both the trainable and frozen parameters of the proposed ASS-CD, which are presented in bold.

Methods	Params (M)	FLOPs (G)
FC-Siam-conc [16]	1.55	5.33
FC-Siam-diff [16]	1.35	4.73
STANet [20]	16.93	13.16
SNUNet [17]	12.03	54.83
Changeformer [25]	20.75	21.18
BIT [28]	3.50	10.63
ConvTransNet [29]	7.13	30.53
ASS-CD (trainable only)	5.30	12.72
+FastSAM (frozen)	+68
+Swin-Transformer (frozen)	+88

Table 5. The results of the ablation study on the LEVIR-CD and WHU-CD datasets. The bold values indicate the best experimental results. √ means that the module is introduced into the baseline model.

Methods	Adapter	CBAM	HDSM	LEVIR-CD					WHU-CD
Methods	Adapter	CBAM	HDSM	Pre	Rec	F1	IoU	OA	Pre	Rec	F1	IoU	OA
Pure U-Net				81.29	78.68	79.96	66.62	97.96	79.55	78.10	78.82	65.04	98.13
Baseline				86.62	84.35	85.47	74.63	98.50	83.27	81.33	82.29	69.91	98.45
Model_a	√			91.34	87.55	89.40	80.84	98.95	90.81	88.67	89.73	81.37	99.17
Model_b	√	√		92.55	89.25	90.87	83.27	99.10	91.90	90.10	90.99	83.47	99.28
Model_c	√		√	92.67	89.70	91.16	83.76	99.11	92.54	91.05	91.79	84.82	99.36
ASS-CD	√	√	√	93.78	90.85	92.29	85.69	99.14	93.88	92.70	93.29	87.42	99.45

Table 6. Ablation study on different combinations of loss function.

Dataset	BCE	Dice	Pre	Rec	F1	IoU	OA
	√		92.55	89.25	90.87	83.27	99.10
LEVIR-CD		√	93.13	88.87	90.95	83.40	99.10
	√	√	93.78	90.85	92.29	85.69	99.14
	√		91.90	90.10	90.99	83.47	99.28
WHU-CD		√	92.60	89.76	91.16	83.75	99.30
	√	√	93.88	92.70	93.29	87.42	99.45

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, C.; Wu, X.; Wang, B. ASS-CD: Adapting Segment Anything Model and Swin-Transformer for Change Detection in Remote Sensing Images. Remote Sens. 2025, 17, 369. https://doi.org/10.3390/rs17030369

AMA Style

Wei C, Wu X, Wang B. ASS-CD: Adapting Segment Anything Model and Swin-Transformer for Change Detection in Remote Sensing Images. Remote Sensing. 2025; 17(3):369. https://doi.org/10.3390/rs17030369

Chicago/Turabian Style

Wei, Chenlong, Xiaofeng Wu, and Bin Wang. 2025. "ASS-CD: Adapting Segment Anything Model and Swin-Transformer for Change Detection in Remote Sensing Images" Remote Sensing 17, no. 3: 369. https://doi.org/10.3390/rs17030369

APA Style

Wei, C., Wu, X., & Wang, B. (2025). ASS-CD: Adapting Segment Anything Model and Swin-Transformer for Change Detection in Remote Sensing Images. Remote Sensing, 17(3), 369. https://doi.org/10.3390/rs17030369

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ASS-CD: Adapting Segment Anything Model and Swin-Transformer for Change Detection in Remote Sensing Images

Abstract

1. Introduction

2. Related Works

2.1. Deep Learning-Based CD Methods

2.2. PEFT for PVFMs

3. Proposed Method

3.1. Adapter Module

3.2. CBAM

3.3. HDSM

3.4. Loss Function

4. Experimental Results and Analysis

4.1. Dataset Description

4.2. Implementation Details and Evaluation Metrics

4.3. Comparison Experiments

4.4. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI