SMBCNet: A Transformer-Based Approach for Change Detection in Remote Sensing Images through Semantic Segmentation

Feng, Jiangfan; Yang, Xinyu; Gu, Zhujun; Zeng, Maimai; Zheng, Wei

doi:10.3390/rs15143566

Open AccessArticle

SMBCNet: A Transformer-Based Approach for Change Detection in Remote Sensing Images through Semantic Segmentation

by

Jiangfan Feng

^1,†

,

Xinyu Yang

^1,*,†,

Zhujun Gu

²,

Maimai Zeng

² and

Wei Zheng

¹

School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

²

Pearl River Water Resources Research Institute, Pearl River Water Resources Commission, Guangzhou 510610, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2023, 15(14), 3566; https://doi.org/10.3390/rs15143566

Submission received: 7 June 2023 / Revised: 11 July 2023 / Accepted: 13 July 2023 / Published: 16 July 2023

(This article belongs to the Special Issue Convolutional Neural Network Applications in Remote Sensing II)

Download

Browse Figures

Versions Notes

Abstract

Remote sensing change detection (RSCD) is crucial for our understanding of the dynamic pattern of the Earth’s surface and human influence. Recently, transformer-based methodologies have advanced from their powerful global modeling capabilities in RSCD tasks. Nevertheless, they remain under excessive parameterization, which continues to be severely constrained by time and computation resources. Here, we present a transformer-based RSCD model called the Segmentation Multi-Branch Change Detection Network (SMBCNet). Our proposed approach combines a hierarchically structured transformer encoder with a cross-scale enhancement module (CEM) to extract global information with lower complexity. To account for the diverse nature of changes, we introduce a plug-and-play multi-branch change fusion module (MCFM) that integrates temporal features. Within this module, we transform the change detection task into a semantic segmentation problem. Moreover, we identify the Temporal Feature Aggregation Module (TFAM) to facilitate integrating features from diverse spatial scales. These results demonstrate that semantic segmentation is an effective solution to change detection (CD) problems in remote sensing images.

Keywords:

change detection; remote sensing image; semantic segmentation; transformer

1. Introduction

In recent years, there has been a significant increase in the use of remote sensing-based approaches for monitoring land use changes, particularly in the context of human-induced land-use changes [1]. Effective land resource management and sustainable urban development rely on accurate quantification of these changes. Change Detection (CD) techniques aim to identify and analyze changes in images captured at different time points over the same geographical area. This has led to a growing interest in RSCD technology, which involves classifying each pixel as a changed or unchanged location [2,3,4].

Traditional RSCD methods can be classified into pixel-based change detection (PBCD) and object-based change detection (OBCD) [5,6,7,8]. However, these methods face challenges due to various factors such as atmospheric conditions, seasonal variations, sensor differences, and subjective definitions of change. These factors have led to two major challenges: missing detection and pseudo-changes. Consequently, deep learning architectures, specifically convolutional neural networks (CNNs) [9], have emerged as powerful tools for extracting discriminative features in RSCD [10,11,12,13]. However, CNN-based methods still struggle to accurately capture edge features and handle pseudo-change information in their prediction results [14]. Additionally, the complexity of remote sensing images (RSIs) with diverse colors, sizes, rotations, and spatial distributions poses a challenge such as missing detection and pseudo-change for traditional CNN encoders [15,16,17].

To overcome these limitations, researchers have explored the use of transformers in RSCD [18,19,20,21,22]. Transformers [23] have demonstrated a remarkable semantic representation capability and have shown themselves to be effective in various image processing tasks [24,25]. By incorporating self-attention mechanisms, transformers can model spatial–temporal relationships and capture long-range dependencies for improved change prediction results. However, existing transformer-based methods often involve a significant number of model parameters and entail substantial computational overhead. Additionally, most researchers design custom networks from scratch to predict change masks and face challenges in general segmentation tasks, such as handling ground objects at varying scales and improving mask details, and they fail to capture the interconversion relationship between RSCD and semantic segmentation. Therefore, there is a need to develop transformer-based RSCD methods that can efficiently learn global dependencies and capture local details while reducing computational costs. By accomplishing this, we can make models more correctable and ultimately practical for broader RSCD tasks.

Here, we address these issues by reducing RSCD to semantic segmentation, which means tailoring a powerful semantic segmentation network to solve CD. This new paradigm, called the Segmentation Multi-Branch Change Detection Network (SMBCNet), leverages mainstream semantic segmentation techniques to tackle general problems in RSCD. First, we replace convolutional operations with transformer blocks, which generate multi-scale features using a hierarchically structured transformer encoder. To enhance feature extraction, we incorporate a cross-scale enhancement module (CEM) that enriches semantic information and captures fine-grained object details. Second, our approach emphasizes the importance of incorporating information related to diverse change types into the fusion feature. Therefore, a multi-branch change fusion module (MCFM) is introduced to classify changes into three types (“Appear”, “Disappear”, and “Exchange” in Figure 1), with each type learned separately. The MCFM integrates various change forms into the change detection task using a multi-branch structure. It efficiently preserves semantic information related to altered areas within a pair of multi-temporal images while effectively filtering out background information from regions that remain unchanged. This transformative procedure converts the change detection problem into a binary semantic segmentation task, where pixel values of 0 and 1 indicate unchanged and changed regions, respectively. This shift allows us to leverage advanced semantic segmentation networks to address the RSCD task, leading to improved information extraction and enhanced accuracy. Additionally, we employ a temporal feature aggregation module (TFM) to efficiently recalibrate multilevel features and enable progressive learning. SMBCNet achieves a larger effective receptive field (ERF) compared to traditional CNN encoders and strikes a better balance between accuracy and model size in RSCD tasks.

Our contributions in this study are:

We propose SMBCNet, a transformer-based network for RSCD that incorporates mainstream semantic segmentation techniques to address various challenges in this field. SMBCNet outperforms various types of previous approaches, achieving superior performance while having a smaller parameter size and computational complexity.
We introduce the MCFM, which classifies changes into three types and enhances the responsiveness of the network towards change regions or objects. Unlike previous fusion approaches, the MCFM not only provides enhanced interpretability but also adeptly captures the inherent features of RSCD.
We perform ablation experiments and comparative experiments on the WHU-CD and LEVIR-CD datasets to demonstrate the effectiveness of our proposed method. Our results show that the transformer backbone outperforms CNN-based backbones in RSCD tasks, even with relatively fewer parameters. This advantage stems from the transformer’s larger perceptual field, which enables stronger model characterization capabilities with lower computational resources.

2. Related Work

2.1. Remote Sensing Change Detection with CNN

In previous work, it was demonstrated that CNN-based methods can achieve superior accuracy compared to traditional methods for remote sensing change detection. Specifically, Daudt et al. [13] proposed a method that combines the benefits of both a fully convolutional neural network (FCN) and a Siamese architecture. Furthermore, they used skip connections to enhance the spatial accuracy of the results. Additionally, UNet++ [11] incorporates a residual block strategy, which captures more detailed information. Moreover, the approach effectively combines the weighted binary cross-entropy loss and dice-coefficient loss to mitigate the imbalance between change and unchanged objects. Additionally, Zhang [26] utilized the strengths of the DeepLabv2 [27] model and extended it to RSCD tasks by adopting atrous spatial pyramid pooling (ASPP) and atrous convolution operations. In addition, Liu [28] employed depth-wise separable convolution to lighten the FCN and improve its performance compared to the original FCN. Collectively, these studies have paved the way for exploring the combination of CNN-based methodologies with RSCD. However, the convolutional kernel of the convolutional operation may introduce limitations to CNN-based approaches, as remote sensing images often require a larger field of perception [14]. This limitation makes it challenging to capture more comprehensive spatial and contextual information. Therefore, various studies have attempted to address this limitation through the use of self-attention mechanisms, such as transformers, which are particularly effective in modeling long-range dependencies for RSCD.

2.2. Remote Sensing Change Detection with Attention Mechanisms

To address the limitations of the fixed receptive field in CNN-based RSCD methods, attention mechanisms have been introduced to extend the receptive field and improve the accuracy of change detection in remote sensing images (RSIs). One approach proposed by Chen [19] is the use of a spatial–temporal attention module, which computes attention weights between any two pixels at different times and positions, effectively enhancing the discriminative features. Another method, DASNet [18], captures long-range dependencies to obtain more discriminative feature representations, thereby strengthening the recognition performance of the model. A novel approach, TFI-GR [29], leverages temporal feature interaction and guided refinement to locate changes in RSIs. However, the incorporation of self-attentive mechanisms tends to increase the number of network parameters significantly. Furthermore, while attention mechanisms can address some of the limitations of CNNs, their ability to capture large receptive fields remains limited.

2.3. Remote Sensing Change Detection with a Transformer

Building on the pioneering ideas of ViT [30], transformers have achieved remarkable success in natural image processing and have also brought new solutions to RSCD research. Recently, some researchers have incorporated transformer-based methodologies into RSCD, providing novel research ideas. For example, Chen [31] leveraged a transformer encoder to compactly model the context of a bi-temporal image using only a few tokens. SwinSUNet [21] is the first pure transformer network to use a Siamese U-shaped structure for the CD task, with Swin transformer blocks [25] as the basic units for global feature extraction. ChangeFormer [22] efficiently utilizes SegFormer blocks to capture multi-scale, long-range details required for accurate change detection. However, these transformer-based RSCD methods may either integrate transformer encoder–decoder blocks with a CNN backbone to fuse and enhance multi-scale features, or they may have too many parameters for practical use in real-world applications.

3. Methodology

In this section, we will start by providing an overview of the proposed approach. Then, we will provide detailed information about the encoder, CEM, MCFM, TFAM, and decoder modules. Lastly, we will elaborate on the hybrid loss function that is integrated into our method.

3.1. Overview

The proposed network is based on a Siamese encoder–decoder architecture, depicted in Figure 2. The primary distinction between our network and earlier models is that we divide change detection into feature fusion and semantic segmentation, utilizing a hierarchical encoder–decoder structure. Firstly, a bi-temporal image pair of size

H \times W \times 3

is provided, and a transformer encoder generates four hierarchical feature maps

P_{i}

. Following this, all feature maps pass through CEM to enhance the representational capacity of bi-temporal features. Lastly, an effective decoder performs top-down multilevel feature fusion, allowing for change maps’ prediction with superior details.

3.2. Transformer Encoder

Our proposed transformer encoder comprises two critical components: transformer blocks for feature extraction and the Cross-scale Enhancement Module (CEM) for improving temporal features.

Transformer blocks. As illustrated in Figure 3, the hierarchical transformer encoder produces multi-level features that encompass both high-resolution coarse features and low-resolution fine-grained features, which are essential for detecting changes. This approach effectively minimizes redundancies and leads to the extraction of more cohesive and coherent RSI object details. Specifically, while inputting a pre-change and post-change image pair with the same

H \times W \times 3

resolution, the transformer encoder generates a hierarchically structured feature map

p_{i_{j}} \in R^{\frac{H}{2^{i + 1}} \times \frac{H}{2^{i + 1}} \times C_{i}}

, where

i \in \{1, 2, 3, 4\}

denotes the number of distinct layers, and

j \in \{0, 1\}

indicates pre- or post-change features. The channel counts

C_{i}

differ across layers, with

C_{i + 1} > C_{i}

. We use

C_{i}

as 32, 64, 160, 256 in sequence. The number of times a transformer operation is repeated in a transformer block is represented by

D_{i}

. Thus,

C_{i}

and

D_{i}

together determine the size of our encoder.

Specifically, to minimize the number of parameters while retaining the transformer’s essential feature extraction capabilities, we have replaced conventional self-attention techniques with more efficient self-attention mechanisms. Specifically, as described in the seminal work by Vaswani et al. [23], the self-attention procedure can be defined as:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}})

(1)

where the matrices Q, K, and V correspond to Query, Key, and Value, respectively. As described in [23], these three vectors have identical dimensions of

N \times C

, where

N = H \times W

represents the sequence’s length. The self-attention’s computational complexity of

O (N^{2})

makes it unsuitable for deployment on high-resolution RSIs. To tackle this challenge, we adopt a reduction strategy [24] to truncate the sequence’s length by introducing a reduction ratio, R, as follows:

\hat{D} = R e s h a p e (\frac{N}{R}, C \cdot R) (D)

(2)

D = L i n e a r (C \cdot R, C) (\hat{D})

(3)

where D represents the input sequence, consisting of

Q, K

, and V.

R e s h a p e (\frac{N}{R}, C \cdot R) (D)

operation reshapes D into a new tensor with dimensions

\frac{N}{R} \times (C \dots R)

.

L i n e a r (C_{i n}, C_{o u t}) (\hat{D})

denotes a linear layer that transforms a

C_{i n} -

dimensional tensor

\hat{D}

into a

C_{o u t} -

dimensional tensor. This approach reduces the complexity of the self-attention mechanism from

O (N^{2})

to

O (\frac{N^{2}}{R})

by minimizing redundant computations, and it also maintains the ability to extract features from RSIs.

To downsize the feature maps of the hierarchical transformer encoder, we apply an overlapped patch merging strategy that collapses

F_{i} (\frac{H}{4} \times \frac{H}{4} \times C_{i})

to

F_{j} (\frac{H}{8} \times \frac{H}{8} \times C_{j})

in size. To maintain local consistency across different patches, we define the patch size, K, the stride between adjacent patches, S, and the padding size,

P a

. Specifically, we set

K = 7

,

S = 4

,

P a = 3

for the overlapping patch merging, and

K = 3

,

S = 2

, and

P a = 1

for the non-overlapping patch merging. This enables us to obtain features of the same size as the non-overlapping process while maintaining local continuity across different patches.

To enable transformer models to effectively incorporate positional information, our approach involves using two MLP layers, coupled with 3 × 3 depth-wise convolutions. This enables the model to effectively capture the necessary positional relationships requisite for optimal performance. The process can be described as:

X_{o u t} = M L P (G E L U (C o n v_{3 \times 3} (M L P (X_{i n})) + X_{i n}

(4)

where

X_{i n}

is the feature from the self-attention module.

Cross-scale Enhancement Module(CEM). Multilevel features play a crucial role in object recognition, as they provide both detailed and semantic information about objects. High-level features are responsible for locating objects and contain semantic information, while low-level features provide finer-grained boundary and texture information. By incorporating multiple levels of features, we can strengthen the capacity of the extracted temporal features to represent information. This enables us to capture more nuanced details from lower-level features and derive semantic insights from higher-level features, thereby enhancing the overall representation capability. We propose the Contrast Enhancement Module (CEM) for this purpose, as shown in Figure 4.

CEM enhances features by combining features from adjacent stages. As shown in Figure 3, we use a residual learning scheme to perform feature fusion operations on feature maps

\{p_{0_{j}}, p_{1_{j}}\}

,

\{p_{0_{j}}, p_{1_{j}}, p_{2_{j}}\}

,

\{p_{1_{j}}, p_{2_{j}}, p_{3_{j}}\}

, and

\{p_{2_{j}}, p_{3_{j}}\}

separately, where

j \in \{0, 1\}

denotes pre-change and post-change. For example, let us consider the features extracted from the

t_{0}

image.

Figure 4 shows the processing steps employed in CEM. We use a residual concatenation operation to merge the top features

p_{1_{0}}

, which performs downsampling to match the resolution of

p_{2_{0}}

while reducing the number of channels in the top feature map. Next, we apply a

3 \times 3

convolutional operation to adjust the channel numbers of the mid feature layer

p_{2_{0}}

. Similarly, we conduct processing on the lower-level feature through the application of a

3 \times 3

convolutional layer, integrated with both normalization and a ReLU activation function. We also upscale the input in both height and width by a factor of two using linear interpolation. Finally, we concatenate the results obtained from the three branches to obtain the reinforced feature maps. These processes can be mathematically formulated as follows:

\begin{matrix} d_{0}^{t o p} & = C o n v_{3 \times 3} (p_{1_{0}}) + S P (p_{1_{0}}) \\ d_{0}^{m i d} & = C o n v_{3 \times 3} (p_{2_{0}}) \\ d_{0}^{b o t t} & = U p s a m p l i n g (C o n v_{3 \times 3} (p_{2_{0}})) \\ d_{0} & = C o n v_{1 \times 1} (C a t (d_{0}^{t o p}, d_{0}^{m i d}, d_{0}^{b o t t})) \end{matrix}

(5)

where SP(·) denotes a stochastic pooling operation and

C o n v_{3 \times 3}

means a convolutional layer with a kernel size of

3 \times 3

.

C a t

represents the concatenation operation.

d_{0}

means the enhanced feature maps before the change.

3.3. Multi-Branch Change Fusion Module (MCFM)

To effectively account for the diverse nature of changes in RSCD, we have divided the variations in RSCD into three categories, which are “Appear”, “Disappear”, and “Exchange”, respectively. As illustrated in Figure 1, “Appear” indicates the presence of an object solely in

t_{1}

; “Disappear” denotes the presence of an object solely in

t_{0}

; and “Exchange”, signifies the differences between objects in

t_{0}

and

t_{1}

in the same location. We argue that these three changes cover the vast majority of change requirements in the RSCD, and the network should model these three different change types separately.

Figure 5 explains our proposed MCFM. The upper three branches utilize the subtraction operation to capture the three distinct types of changes, while the “Distance” branch is a widely used method in RSCD to generate different feature maps and has been proven to have the ability to further enhance the network’s detection of the changed regions [19,21]. We employ channel self-attention instead of spatial self-attention in the “Exchange” branch for the following reason: in the “Appear” and “Disappear” branches, there is a transition from “0” to “1” or from “1” to “0”, respectively, as indicated by the green and yellow boxes shown in Figure 1. The spatial information of the changing objects effectively captures the differences between these states. However, in the “Exchange” branch, the transition occurs between two instances of “1”, as depicted by the blue box in Figure 1 (two different cars in the same location). Therefore, it is more pertinent to focus on the channel information changes of the feature map in the “Exchange” branch.

To provide further elaboration, the decoder outputs,

p_{i_{0}}

and

p_{i_{1}}

, have dimensions of

\frac{H}{2^{i + 1}} \times \frac{W}{2^{i + 1}} \times C_{i}

. Specifically,

i_{0}

denotes the i-th layer of pre-change features, whereas

i_{1}

represents the i-th layer of post-change features obtained through the decoder process. The four branches responsible for extracting change regions are described as follows:

C_{A p p e a r} = S A (R e L U (p_{i_{0}} - p_{i_{1}}))

(6)

C_{D i s a p p e a r} = S A (R e L U (p_{i_{1}} - p_{i_{0}}))

(7)

C_{E x c h a n g e} = C A (R e L U (m a x (p_{i_{0}}, p_{i_{1}}) - m i n (p_{i_{0}}, p_{i_{1}})

(8)

C_{D i s t a n c e} = B N (R e L U (C o n v_{1 \times 1} (C a t (p_{i_{0}}, p_{i_{1}}))))

(9)

where

B N

refers to batch normalization,

S A

represents spatial attention [32], and

C A

stands for channel attention [32]. As illustrated in Figure 5, spatial attention is utilized to enhance the network’s sensitivity to regions where positional information plays a crucial role in detecting changes. Similarly, channel attention facilitates reinforcement and refinement of the change regions by enabling interactions between corresponding feature maps. Moreover, merging the two feature maps across their channel dimension enables broadening of the coverage of change regions. This strategy effectively integrates information from both pre-change and post-change feature maps, thereby enhancing the representation of change patterns. To further consolidate the change branches, a branch-fuse operation involving a

1 \times 1

convolutional layer is applied after the activation map. This fusion process yields enhanced multi-branch change fusion features, facilitating a more comprehensive representation of changes. Overall, these operations collectively contribute to improved change detection performance within the proposed framework. The calculation formula for the change fusing process is shown below:

m_{i} = C o n v_{1 \times 1} (C A T (C_{A p p e a r} + C_{D i s a p p e a r} + C_{E x c h a n g e}, C_{D i s t a n c e})

(10)

where

m_{i}

denotes the output of the ith layer,

C o n v_{1 \times 1}

means a

1 \times 1

convolutional, layer and

C A T

means the contacting operation in the dimension of the channel. Notably, the feature channel numbers do not change through the MCFM.

3.4. Decoder

The decoder is responsible for further multi-size fusion of the fused change region features output from the MCFM module, and a simple but effective decoder is used to generate the final change map.

Temporal Feature Aggregation Module(TFAM). Integrating diverse spatial features has been shown to be an effective strategy for addressing multi-scaled change objects [33]. In our quest to achieve this, we propose a simple yet effective temporal feature aggregation module to merge spatial change features derived from MCFM. Our approach is inspired by the success of the Feature Pyramid Network (FPN) [34]. We utilize a simplified FPN to merge spatial features. Using a top-down pathway strategy, we compute a feature hierarchy consisting of feature maps at different scales, merging the semantic features of the higher level with those of the lower layer through upsampling, as illustrated in Figure 6. With four different scale features (

\frac{1}{4}

,

\frac{1}{8}

,

\frac{1}{16}

, and

\frac{1}{32}

), all feature maps undergo further processing with a

3 \times 3

convolutional layer before the downs-caled feature maps are subjected to

2 \times

bilinear upsampling and merged with the original lower-level feature map via element-wise addition. This simple architecture allows for the fusion of multi-scale feature maps with a minor increase in parameter count, thereby improving the accuracy of the final prediction results.

MLP Lightweight Decoder. Followed by constructing the final feature map, we employ a simple All-MLP decoder as showed in Figure 7, inspired by SegFormer [24]. The primary aim of the decoder is to reduce parameters while maintaining a powerful decoder ability. Firstly, we use a linear function to adjust the channel dimension of the feature map to

C_{e b d}

. This facilitates the control of the overall parameter size of the decoder, making it possible to manipulate the value of

C_{e b d}

. In the next step, the feature maps are upsampled to

\frac{1}{4}

of their original resolution and concatenated to allow for the integration of multi-level information. Similar to the previous step, we employ a linear function to decrease the number of channels in the concatenated feature map. This helps to reduce parameter without compromising model performance. Finally, we use another MLP to generate the segmentation mask, which is upsampled to a resolution of

H \times W \times 3

, giving us the final change map. This process enables the integration of multi-level features and facilitates accurate classification of the input data. We can express the decoder as:

\begin{matrix} \hat{F_{i}} & = L i n e a r (C_{i}, C_{e b d}) (\hat{F_{i}}), \forall i \\ \hat{F_{i}} & = U p s a m p l e (\frac{W}{4} \times \frac{H}{4}) (\hat{F_{i}}), \forall i \\ F_{i} & = L i n e a r (4 C_{e b d}, C_{e b d}) (C o n c a t ({\hat{F}}_{i}), \forall i \\ M_{o u t} & = L i n e a r (C_{e b d}, N_{c l s}) (F), \\ M a s k & = S i g m o i d (U p s a m p l e (W \times H) (M_{o u t})) \end{matrix}

(11)

where

M_{o u t}

and

M a s k

refer to the predicted mask before upsample and after, respectively.

L i n e a r (C_{i n}, C_{o u t}) (\cdot)

refers to a linear function that maps the number of channels of ⋯ from

C_{i n}

to

C_{o u t}

dimensions.

C_{e b d}

is the number of channels of the feature after MLP, and the size of it determines the scale of our model.

3.5. Hybrid Loss Function

Change detection tasks are fundamentally binary classification problems. However, the highly imbalanced proportion of changed and unchanged regions in the input data can have a significant deleterious effect on the model’s performance. To address this issue and guide the network to learn from complex scenes, we propose a hybrid loss function that consists of three parts: binary cross-entropy (BCE) loss, dice (Dice) loss [35], and an additional term. The BCE loss is formulated as

L_{b c e} (\hat{y}, y) = y \cdot l o g \hat{y} + (1 - y) \cdot l o g (1 - \hat{y})

(12)

where · denotes a dot-product operation, y and

\hat{y}

are ground truth and corresponding predicted mask, respectively. The Dice loss can be formulated as

D i c e = \frac{2 \cdot y \cdot \hat{y}}{∥y∥ + ∥\hat{y}∥}

(13)

L_{d i c e} (\hat{y}, y) = 1 - D i c e

(14)

where denotes the

l 1

norm. Then, the total loss is represented as

L (\hat{y}, y) = λ_{1} L_{b c e} (\hat{y}, y) + λ_{2} L_{d i c e} (\hat{y}, y)

(15)

where

λ_{1}, λ_{2}

are the weights of each loss function, and we set them as 0.4, 0.6, respectively.

4. Experiment

4.1. Dataset

To verify our proposed method’s performance, we used two publicly available CD datasets named LEVIR-CD and WHU-CD. The detailed information is listed as follows.

The LEVIR-CD dataset [19] is a binary CD dataset comprising 637 pairs of very-high-resolution (VHR) image patches. Each patch has a size of

1024 \times 1024

pixels, with a resolution of approximately 0.5 m/pixel. These image pairs were derived from Google Earth global images of Texas, spanning the years 2002 to 2018. In order to conduct our experiment, we cropped non-overlapping patches of size

256 \times 256

and randomly split them into three parts: 70% for training, 10% for validation, and 20% for testing. Finally, we obtained 7120/1024/2048 image pairs for train/val/test, respectively.

The WHU-CD dataset [36] contains just one pair of images, with a resolution of 32,507 × 15,354 as a crop of a wider geographic area. This dataset consists of aerial images obtained in April 2012 that contain 12,796 buildings in 20.5 km

^{2}

(16,077 buildings in the same area in the 2016 dataset) with 1.6-pixel accuracy. Following [37], we cropped the original image pairs in a non-overlapping manner, and after cropping, we formed 7434 small images of the size 256 × 256. After that, we randomly divided all the images into training, validation, and test sets with the rates of 70%, 10%, and 20%, respectively. Finally, we obtained 5203/743/1488 image pairs for train, val, and test, respectively.

4.2. Metrics

We calculated six widely used metrics to evaluate the performance of the proposed method [38,39]: precision (Pre), recall (Rec), F1-score (F1), overall accuracy (OA),

κ

coefficient and Intersection-over-Union (IoU). In these evaluation indicators, F1-score is the most important one. P and R represent lower false detection and omission, respectively. The larger their values, the better the prediction results. The calculation formulas for six metrics are as follows:

P r e = \frac{T P}{F P + T P}

(16)

R e c a l l = \frac{T P}{T P + F N}

(17)

F 1 = \frac{2}{P r e^{- 1} + R e c a l l^{- 1}}

(18)

O A = \frac{T P + T N}{T P + T N + F P + F N}

(19)

P_{e} = \frac{(T P + F N) \cdot (T N + F P) + (F P + T P) \cdot (F N + T P)}{{(T P + F P + T N + F N)}^{(2)}}

(20)

κ = \frac{O A - P_{e}}{1 - P_{e}}

(21)

I o U = \frac{T P}{T P + F P + F N}

(22)

where

T P

,

F P

,

T N

, and

F N

denote the numbers of true positives, false positives, true negatives, and false negatives, respectively.

4.3. Training Details

In our experiments, we implemented our model using PyTorch and trained it on an NVIDIA RTX 3090ti GPU. The backbone is initialized with parameters from the Mit-B0 model [24] pretrained on ImageNet-1K, while the remaining parts are randomly initialized. We utilized data augmentation techniques, including random flipping, random rescaling (0.8–1.2), and random temporal exchange. The AdamW [40] optimizer was applied to optimize the loss function with a weight decay of 0.0001 and beta values of (0.9, 0.99). The learning rate is initially set to 0.0005 and linearly decays to 0 until the final epoch is reached. To account for GPU memory limitations, the model was trained with a batch size of 32. During the training process, we implemented a strategy to reduce overfitting by selecting difficult samples. Similar to the comparison method, we evaluated multiple experimental results and chose the best one as the final result. These measures ensure the robustness of our model.

4.4. Baselines

To demonstrate the effectiveness of our approach, we compared our results with those reported in [22]. We addressed the three models presented in [10]. Moreover, to compare our model with other works adopting both spatial and channel attention mechanisms, we dealt with [20,41,42]. Finally, given the success achieved by transformers applied to the computer vision field, we also compared our results with those obtained in [22,31]. We reproduce all baseline methods using the modified codes [43] under their suggested parameters for fair comparisons. To further evaluate the validity of our proposed method, we also report the model parameters and computation costs of the above methods for reference.

4.5. Compared with the State-of-the-Art

In this section, we present a comprehensive comparison of our proposed model with several existing methods on two benchmark datasets, namely, LEVIR-CD and WHU-CD. The compared methods can be categorized into attention-based approaches and other efficient encoder–decoder structures. To ensure a fair and unbiased evaluation, we meticulously re-implemented all these methods and replicated their results within the same experimental environment. For each comparative method, we carefully selected a set of optimal hyperparameters that maximized the F1 score on the validation subset. This approach guarantees that all methods are fine-tuned under the same criteria, enabling a meaningful and consistent performance comparison.

The quantitative evaluation results for the two datasets are presented in Table 1 and Table 2. Furthermore, the qualitative assessment of the comparative methods is visualized in Figure 8 and Figure 9. These figures depict true positive (TP) regions in white, false positive (FP) regions in blue, false negative (FN) regions in red, and true negative (TN) regions in black. These visualizations allow for a comprehensive comparison of the methods’ performance. To evaluate our proposed model in terms of both accuracy and model size, we compared it with a method that strikes a good balance between these factors.

Experimental Results on the LEVIR-CD Dataset: Although FC-EF, FC-Siam-conc, FC-Siam-diff, and CDNet have the advantageous feature of occupying smaller memory footprints, they exhibit the poorest performance in terms of

κ

, F1, and IoU metrics. On the other hand, attention-based models such as BIT and SNUNet demonstrate similar performance, but BIT achieves comparable results with only one-tenth of the parameters compared to SNUNet (4.02 M vs. 42.38 M). It is worth noting that SNUNet, incorporating self-attention mechanisms, shows enhanced accuracy over CNN-based methods, albeit with a larger model size of 18.68 M.

When comparing our proposed network with transformer-based approaches, it is evident that BIT, despite its smaller parameter count, lags behind our network by 4.74%, 6.72%, and 7.49% in terms of

κ

, F1, and IoU on the LEVIR-CD dataset. Conversely, Changeformer, which utilizes the transformer as its backbone, surpasses CNN-based and self-attention-based methods in terms of

κ

, F1, and IoU, indicating the superiority of transformer-based feature extraction for remote sensing images. However, Changeformer exhibits a significantly larger parameter size of 40.5 M, roughly four times the size of our proposed network. Ultimately, our proposed SMBCNet achieves the best overall performance with a

κ

of 0.9032, an F1 score of 0.9087, and an IoU of 0.8316. This is attributed to the ability of our method to identify pseudo-change regions from multi-temporal images. Notably, SMBCNet demonstrates its superiority despite having a moderate parameter count of 10.14 M, which is significantly smaller than pure transformer-based Changeformer (47.3 M) and CNN-based DSIFN (42.38 M) methods.

Compared with traditional CNN methods, the FC-ef, FC-conc, and FC-diff networks have much smaller parameter sizes, but their performance metrics are not impressive. In terms of the F1 metric on the LEVIR-CD dataset, our network shows a significant improvement of 4.3% compared to the best-performing network among the three. Although DTCDSCN performs better than the previous three methods, the number of parameters determines that the method is difficult to use in practical applications. This indicates that CNN methods have limitations in remote sensing object detection.

Figure 8 presents a perceptual comparison of various CD methods. It is apparent that misclassified changed pixels are prevalent in the results of all methods except for our approach. Additionally, efforts have been made to reduce repetition in the text. Specifically, the results of SNUNet, BIT, and Changeformer exhibit a noticeable missing part of the building. In the top row of Figure 8, FC-diff, Fc-ef, Fc-cat, DSIFN, SNUNet, BIT, P2V-CD, and Changeformer fail to accurately localize the changed buildings, resulting in erroneous predictions. In contrast, our proposed method accurately and comprehensively detects the changed buildings. Notably, the change maps produced by SMBCNet exhibit the most favorable visual effect, appearing the closest to the ground truth. Additionally, efforts have been made to decrease repetition in the text.

Experimental Results on the WHU-CD Dataset: Table 2 shows the performance metrics for the WHU-CD dataset. The experimental results indicate that FC-EF, FC-Siam-conc, and FC-Siam-diff do not perform better than the other methods. Although DSIFN yields higher precision and recall than the aforementioned methods, its F1 score lags behind that of Changeformer, the pure transformer-based method. On a positive note, P2V-CD proves to be a promising solution, delivering favorable results across different datasets. Despite being the largest model on the list, DSIFN effectively prevents overfitting by utilizing pretrained encoders. It achieves a precision score of 0.9626 and the second-highest F1 score of 0.9127 on this dataset. Meanwhile, SNUNet fails to deliver competitive results, despite having more network parameters than the comparatively smaller CDNet model. Notably, our proposed SMBCNet method dominates the other methods with an F1 score of at least 2.82%, indicating its superior performance in detecting change between two remote sensing images.

Figure 9 presents a qualitative evaluation of the change detection techniques on the WHU-CD dataset, providing a more intuitive comparison of the methods. The comparison displays that the majority of the CD methods produce spurious changes or missed detections, especially in heavily built-up areas. For example, in row 3 of Figure 9, all methods besides our proposed technique misclassify the unremarkable region as the actual change region, leading to incorrectly classified areas shown as blue regions in the figure.

Table 3 shows the comparison between our proposed network and the selected method in terms of parameter count and accuracy on the LEVIR-CD dataset. We can observe that our proposed approach demonstrates an advancement compared to existing lightweight RSCD methods, establishing its efficacy in the task of change detection between RSIs. This outcome substantiates the effectiveness of our method and its superiority in addressing this specific challenge. Furthermore, the scalability and generalizability of our approach make it a promising solution for future research in the field of RSCD. Despite not attaining state-of-the-art results in terms of parameter size, SMBCNet exhibits a noteworthy enhancement in both performance and parameter size when compared to Changeformer, which also employs a transformer-based architecture. Particularly, SMBCNet showcases enhanced computational effectiveness, as reflected in its improved performance metrics and reduced parameter requirements.

On the other hand, our method stands out, as it accurately identifies change objects while effectively suppressing background interference between the bi-temporal images. Our method utilizes the strong contextual-dependency-capturing ability of transformers by progressively aggregating multilevel temporal difference features in a coarse-to-fine manner. This approach results in a more refined change map for RSCD. The qualitative results indicate that our proposed method outperforms other techniques in terms of detecting change objects with better accuracy and mitigating the negative influence of background interference in the bi-temporal images.

4.6. Ablation Study

To verify the effectiveness of the components and configurations of the proposed SMBCNet, we conduct comprehensive ablation studies on two RSCD datasets.

Effectiveness of transformer backbone. To verify the effectiveness of the transformer-based encoder in our network, we conducted ablation experiments using different lightweight CNN-based backbones. The results are summarized in Table 4, which includes the parameters and accuracy for both LEVIR-CD and WHU-CD datasets. It is important to note that the “Params” column in Table 4 refers to the size of a single backbone but not the size of the entire model. Additionally, “TB” refers to “transformer blocks”, as illustrated in Figure 2. The table shows that our transformer-based encoder outperforms other CNN-based backbones in both model size and feature extraction capabilities for RSI.

Furthermore, we combined our proposed CEM with the compared backbones to validate its effectiveness. The results demonstrate that MobileNetV2 combined with our proposed CEM achieves the highest accuracy on the LEVIR-CD dataset, while the encoder used by our network comes in second place. However, it is important to note that the MobileNetV2+ contains 14M parameters, which is nearly double the number of parameters in our proposed encoder model.

Effectiveness of MCFM. We devise MCFM to extract the change information and fuse temporal features, which enjoy high interpretability and reveal the essential characteristic of CD. The MCFM aims to account for the diverse nature of changes in RSCD and enhance changes. There are four branches, as illustrated in Figure 5. In order to validate the effect of different branches on the whole network, we selected four branches in turn for the experiment, and the results are shown in Table 5. Since the image enhancement we employ includes an enhancement strategy that swaps pairs of images, “A + DA” is much more accurate than “A”, and in terms of F1 score, it improves by nearly 4.72%. After adding the last two branches, we achieve the highest accuracy and can see that each branch plays a significant role in the change feature fusion. We argue that with our accompanying proposed MCFM, we are able to “reduce” CD to semantic segmentation, which means tailoring an existing and powerful semantic segmentation network to solve CD.

In the course of these MCFM ablation experiments, we tested the case of only choosing “DA”, “E”, and “D", but the results were not actually optimal. This is because in real life, “Appear” and “Disappear” are always present randomly, so using only one of the two branches will not achieve the best result. In order to simulate this process to increase the robustness and generalizability of our model, we have used random temporal exchange data augmentation techniques. Finally, our experiments prove that the results obtained by using all four branches are optimal.

As shown in Figure 10, we utilize heatmaps to effectively visualize the feature maps derived from the MCFM. These heatmaps are generated by analyzing the variance of all feature maps. By applying MCFM, regions that have undergone changes exhibit increased energy, particularly with enhanced intensity at the edges of the target. We can observe that the low-resolution feature map (e) is responsible for localizing the change area, while the high-resolution feature map (d) is responsible for making the edges of the change object more accurate. In essence, this approach strengthens and precisely localizes the edges of the modified target, leading to improved detection performance.

Influence of the size of the transformer blocks. We conducted an analysis of the effect of increasing the size of the encoder on the performance and model efficiency, and Table 6 summarizes the results for the three datasets. The

C_{i}

and

D_{i}

mentioned in Section Figure 3 control the size of our encoder. We observed that the increasing size of the encoder does not lead to consistent improvements in performance, while

C_{i} = {64, 128, 320, 512}

and

D_{i} = {2, 2, 2, 2}

achieves the best performance on the LEVIR-CD dataset. However, we also observed that the accuracy of the model does not consistently improve on the LEVIR-CD dataset with the progression of

C_{i}

and

D_{i}

. We believe that this inconsistency may be due to variations in overfitting tendencies that we observed during the model training, which are directly influenced by the size of the encoder.

While the combination of

C_{i} = \{64, 128, 320, 5\}

and

D_{i} = \{2, 2, 2, 2\}

yielded the best accuracy results, it also exhibited potential drawbacks regarding real-time processing. The larger encoder sizes presented in

C_{i} = \{32, 64, 160, 256\}

and

D_{i} = \{3, 6, 16, 3\}

strike a balance between accuracy and computational cost. By selecting these parameters, we aim to achieve a reasonable level of accuracy while ensuring that the model can operate efficiently in real-time scenarios.

Influence of $C_{ebd}$ , the MLP decoder channel dimension. We present an analysis of the impact of the channel dimension

C_{e b d}

in the MLP decoder, as discussed in Section 3.4. In Table 7, we demonstrate the model’s performance and parameters as a function of this dimension. Our findings indicate that setting

C_{e b d} = 256

yields highly competitive performance with minimal computational cost. As the value of

C_{e b d}

increases, the model’s performance improves, resulting in larger and less efficient models. It is worth noting that an excessively large value of

C_{e b d}

may lead to overfitting and a subsequent decrease in the model’s accuracy, despite an increase in the number of parameters. Based on these findings, we select

C_{e b d} = 256

as the optimal dimension for our final SMBCNet.

5. Discussion

Currently, the CD approach can be seen as a semantic partitioning problem, wherein it also indicates the possibility of a shift from one to the other between the two. Specifically, there are approximately three types of changes, namely, “Appear”, “Disappear”, and “Change”. Recent advances in transformer-based models provide a promising way to capture global context information. However, limited by the excessive computational complexity and inaccurate self-attention calculation, achieving accurate and relatively lightweight CD of RSI using transformer-based models still requires improvement. We address the CD problem by introducing a mechanism for multi-branch change feature fusion. In particular, we combine and enhance the features of the change region in remote sensing images. As a result, we can use existing semantic segmentation networks to solve the CD problem.

Unlike previous RSCD methods that incorporated self-attention with CNN, our approach utilizes a pure transformer-based network featuring a Siamese structure to address CD problems. As shown in Table 4, the transformer-based encoder outperforms CNN-based encoders in RSCD. This is because RSIs present challenges for object analysis and interpretation, including difficulties with accurately identifying and delineating objects with complex shapes or overlapping features, and challenges in distinguishing between objects with similar spectral reflectance and texture.

However, further improvements are still necessary. Figure 11 displays some failure cases in two RSCD datasets, where the proposed SMBCNet either incorrectly predicts the changed regions or fails to capture them due to the objects in question, exhibiting minimal differences with their surroundings. Nevertheless, as discussed in Section 4.5, our approach outperforms state-of-the-art methods in addressing the most challenging scenarios for the RSCD task.

6. Conclusions

This study introduces a novel RSCD network termed SMBCNet. Unlike current CD approaches that employ large CNNs as backbones, we utilize a hierarchically structured transformer encoder combined with an MLP decoder, achieving higher accuracy while maintaining a relatively lightweight model size than other pure transformer-based CD methods (compared to other pure transformer-based RSCD methods). To extract global information efficiently, we propose CEM, which balances information extraction and computational complexity. Additionally, we introduce the MCFM to address the diverse nature of changes, effectively transforming the CD task into a semantic segmentation problem. Furthermore, we introduce the TFM module to integrate features from various spatial scales. The state-of-the-art performance of our network on two publicly available CD datasets and its significant reduction in parameter count compared to transformer-based networks of equivalent type are demonstrated. The superior performance of our proposed method in RSCD also showcases its potential for practical RS applications. Through the incorporation of the CEM, MBCM, and TFM modules, we achieve improved results in change detection while maintaining computational cost. These findings hold promising implications for various RS applications.

There are certain limitations in our study that should be acknowledged, particularly the fact that a majority of the samples in the LEVIR-CD and WUH-CD datasets primarily consist of buildings. However, it is important to emphasize that none of the steps involved in our proposed pipeline are exclusive to building change detection. Therefore, our method can be readily extended to other forms of change detection in RSIs. To illustrate this, we conducted tests on the HRSCD [36] dataset, encompassing a range of changing objects, including artificial surfaces, agricultural areas, forests, wetlands, water, and unidentified regions. Nevertheless, it is important to consider another limitation when dealing with fine-grained change detection, particularly in scenarios where the changed area exhibits a small spatial scale and exhibits irregular distribution and geometric appearance. In future research, we aim to address this limitation by devising techniques to effectively discriminate between uncertain pixel-level differences across the entire image pair.

Author Contributions

Conceptualization, J.F. and X.Y.; methodology, J.F. and X.Y.; software, X.Y. and W.Z.; validation, J.F. and X.Y.; writing original draft preparation, J.F. and X.Y.; writing—review and editing, Z.G. and M.Z.; visualization, J.F., X.Y. and W.Z.; investigation, Z.G., M.Z. and W.Z.; funding acquisition, J.F., Z.G. and W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (NSFC) (41971365), the Major Science and Technology Project of the Ministry of Water Resources (No. SKR-2022037), the Natural Science Foundation of Chongqing (cstc2020jcyj-msxmX0855), and the Chongqing Graduate Research Innovation Project (CYS22448).

Data Availability Statement

The LEVIR-CD dataset and WHU-CD dataset in this study are downloaded at https://justchenhao.github.io/LEVIR/ (accessed on 13 April 2023) and http://gpcv.whu.edu.cn/data/building_dataset.html (accessed on 14 April 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Shafique, A.; Cao, G.; Khan, Z.; Asad, M.; Aslam, M. Deep learning-based change detection in remote sensing images: A review. Remote Sens. 2022, 14, 871. [Google Scholar] [CrossRef]
Lv, Z.; Liu, T.; Benediktsson, J.A.; Falco, N. Land cover change detection techniques: Very-high-resolution optical images: A review. IEEE Geosci. Remote Sens. Mag. 2021, 10, 44–63. [Google Scholar] [CrossRef]
Li, X.; He, M.; Li, H.; Shen, H. A combined loss-based multiscale fully convolutional network for high-resolution remote sensing image change detection. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Frick, A.; Tervooren, S. A framework for the long-term monitoring of urban green volume based on multi-temporal and multi-sensoral remote sensing data. J. Geovisualization Spat. Anal. 2019, 3, 6. [Google Scholar] [CrossRef]
Celik, T. Unsupervised change detection in satellite images using principal component analysis and k-means clustering. IEEE Geosci. Remote Sens. Lett. 2009, 6, 772–776. [Google Scholar] [CrossRef]
Liu, S.; Bruzzone, L.; Bovolo, F.; Du, P. Hierarchical unsupervised change detection in multitemporal hyperspectral images. IEEE Trans. Geosci. Remote Sens. 2014, 53, 244–260. [Google Scholar]
Ferraris, V.; Dobigeon, N.; Wei, Q.; Chabert, M. Detecting changes between optical images of different spatial and spectral resolutions: A fusion-based approach. IEEE Trans. Geosci. Remote Sens. 2017, 56, 1566–1578. [Google Scholar] [CrossRef]
Bruzzone, L.; Prieto, D.F. Automatic analysis of the difference image for unsupervised change detection. IEEE Trans. Geosci. Remote Sens. 2000, 38, 1171–1182. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Peng, D.; Zhang, Y.; Guan, H. End-to-end change detection for high resolution satellite images using improved UNet++. Remote Sens. 2019, 11, 1382. [Google Scholar] [CrossRef]
Zhan, Y.; Fu, K.; Yan, M.; Sun, X.; Wang, H.; Qiu, X. Change detection based on deep siamese convolutional network for optical aerial images. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1845–1849. [Google Scholar] [CrossRef]
Daudt, R.C.; Le Saux, B.; Boulch, A. Fully convolutional siamese networks for change detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4063–4067. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Shen, L.; Lu, Y.; Chen, H.; Wei, H.; Xie, D.; Yue, J.; Chen, R.; Lv, S.; Jiang, B. S2Looking: A satellite side-looking dataset for building change detection. Remote Sens. 2021, 13, 5094. [Google Scholar] [CrossRef]
Toker, A.; Kondmann, L.; Weber, M.; Eisenberger, M.; Camero, A.; Hu, J.; Hoderlein, A.P.; Şenaras, Ç.; Davis, T.; Cremers, D.; et al. Dynamicearthnet: Daily multi-spectral satellite dataset for semantic change segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 21158–21167. [Google Scholar]
Verma, S.; Panigrahi, A.; Gupta, S. Qfabric: Multi-task change detection dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 21–25 June 2021; pp. 1052–1061. [Google Scholar]
Chen, J.; Yuan, Z.; Peng, J.; Chen, L.; Huang, H.; Zhu, J.; Liu, Y.; Li, H. DASNet: Dual attentive fully convolutional Siamese networks for change detection in high-resolution satellite images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 1194–1206. [Google Scholar] [CrossRef]
Chen, H.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A densely connected Siamese network for change detection of VHR images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure transformer network for remote sensing image change detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Bandara, W.G.C.; Patel, V.M. A transformer-based siamese network for change detection. In Proceedings of the IGARSS 2022–2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 207–210. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Zhang, M.; Xu, G.; Chen, K.; Yan, M.; Sun, X. Triplet-based semantic relation learning for aerial remote sensing image change detection. IEEE Geosci. Remote Sens. Lett. 2018, 16, 266–270. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Liu, R.; Jiang, D.; Zhang, L.; Zhang, Z. Deep depthwise separable convolutional network for change detection in optical aerial images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 1109–1118. [Google Scholar] [CrossRef]
Li, Z.; Tang, C.; Wang, L.; Zomaya, A.Y. Remote sensing change detection via temporal feature interaction and guided refinement. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Chen, H.; Qi, Z.; Shi, Z. Remote sensing image change detection with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Daudt, R.; Le Saux, B.; Boulch, A.; Gousseau, Y. Multitask Learning for Large-scale Semantic Change Detection. Comput. Vis. Image Underst. 2019, 187, 102783. [Google Scholar] [CrossRef]
Bandara, W.G.C.; Patel, V.M. Revisiting consistency regularization for semi-supervised change detection in remote sensing images. arXiv 2022, arXiv:2204.08454. [Google Scholar]
Ding, L.; Guo, H.; Liu, S.; Mou, L.; Zhang, J.; Bruzzone, L. Bi-temporal semantic reasoning for the semantic change detection in HR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Huang, J.; Shen, Q.; Wang, M.; Yang, M. Multiple attention Siamese network for high-resolution image change detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–16. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
Liu, Y.; Pang, C.; Zhan, Z.; Zhang, X.; Yang, X. Building change detection for remote sensing images using a dual-task constrained deep siamese convolutional network model. IEEE Geosci. Remote Sens. Lett. 2020, 18, 811–815. [Google Scholar] [CrossRef]
Fang, S.; Li, K.; Li, Z. Changer: Feature Interaction Is What You Need for Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5610111. [Google Scholar] [CrossRef]
Lin, M.; Yang, G.; Zhang, H. Transition Is a Process: Pair-to-Video Change Detection Networks for Very High Resolution Remote Sensing Images. IEEE Trans. Image Process. 2022, 32, 57–71. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, CT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Sun, K.; Zhao, Y.; Jiang, B.; Cheng, T.; Xiao, B.; Liu, D.; Mu, Y.; Wang, X.; Liu, W.; Wang, J. High-resolution representations for labeling pixels and regions. arXiv 2019, arXiv:1904.04514. [Google Scholar]

Figure 1. Illustration of the three types of change. In our view, the object changes in RSCD can be divided into these three categories: “Appear”, “Disappear”, and “Exchange”, respectively. (a)

t_{0}

images. (b)

t_{1}

images.

Figure 1. Illustration of the three types of change. In our view, the object changes in RSCD can be divided into these three categories: “Appear”, “Disappear”, and “Exchange”, respectively. (a)

t_{0}

images. (b)

t_{1}

images.

Figure 2. Our proposed SMBCNet network framework consists of three main components: a transformer encoder, a multi-change fusion module, and a transformer decoder. Each of these components has a specific role in the overall process of feature extraction, change feature fusion, and feature decoding, respectively.

Figure 3. This illustration depicts the functionality of the transformer blocks, which are composed of four stages. The feature map downsamples after each stage.

Figure 4. Illustration of the proposed CEM. We have achieved feature enhancement by fusing the outputs of the four stages of the transformer block. “Top”, “Mid”, and “Bottom” denote the three features from neighboring stages. Note that “Top” and “Bottom” can be omitted.

Figure 5. Illustration of the proposed MCFM. We extract the texture features of the change region by fusing the four change branches. In order to highlight that the features are enhanced after the spatial attention and channel attention change regions, we use a gradient color to represent the change feature vector after the self-attention operation.

Figure 6. Illustration of the proposed TFAM. TFAM upsamples and fuses feature maps at multiple scales to achieve feature enhancement in regions of change.

Figure 7. Illustration of the proposed MLP decoder. The lightweight MLP decoder combines multiple feature maps with different shapes to finally predict the change map.

Figure 8. Visual comparisons of the proposed method and the state-of-the-art approaches on the LEVIR-CD dataset. (a)

t_{0}

images. (b)

t_{1}

images. (c) Ground truth. (d) FC-ef. (e) FC-conc. (f) FC-diff. (g) DSIFN. (h) SNUNet. (i) BIT. (j) P2V-CD. (k) Changeformer. (l) Ours. We use different colors to represent true positives (white), false positives (blue), true negatives (black), and false negatives (red).

Figure 8. Visual comparisons of the proposed method and the state-of-the-art approaches on the LEVIR-CD dataset. (a)

t_{0}

images. (b)

t_{1}

images. (c) Ground truth. (d) FC-ef. (e) FC-conc. (f) FC-diff. (g) DSIFN. (h) SNUNet. (i) BIT. (j) P2V-CD. (k) Changeformer. (l) Ours. We use different colors to represent true positives (white), false positives (blue), true negatives (black), and false negatives (red).

Figure 9. Visual comparisons of the proposed method and the state-of-the-art approaches on the WHU-CD dataset. (a)

t_{0}

images. (b)

t_{1}

images. (c) Ground truth. (d) FC-ef. (e) FC-conc. (f) FC-diff. (g) DSIFN. (h) SNUNet. (i) BIT. (j) P2V-CD. (k) Changeformer. (l) Ours. We use different colors to represent true positives (white), false positives (blue), true negatives (black), and false negatives (red).

Figure 9. Visual comparisons of the proposed method and the state-of-the-art approaches on the WHU-CD dataset. (a)

t_{0}

images. (b)

t_{1}

images. (c) Ground truth. (d) FC-ef. (e) FC-conc. (f) FC-diff. (g) DSIFN. (h) SNUNet. (i) BIT. (j) P2V-CD. (k) Changeformer. (l) Ours. We use different colors to represent true positives (white), false positives (blue), true negatives (black), and false negatives (red).

Figure 10. Visualization of MCFM on the LEVIR-CD dataset. (a)

t_{0}

images. (b)

t_{1}

images. (c) Ground truth. (d,e) are heatmaps after MCFM with the shape of

\frac{H}{4} \times \frac{w}{4} \times C_{1}

and

\frac{H}{8} \times \frac{w}{8} \times C_{2}

in Figure 6, respectively.

Figure 10. Visualization of MCFM on the LEVIR-CD dataset. (a)

t_{0}

images. (b)

t_{1}

images. (c) Ground truth. (d,e) are heatmaps after MCFM with the shape of

\frac{H}{4} \times \frac{w}{4} \times C_{1}

and

\frac{H}{8} \times \frac{w}{8} \times C_{2}

in Figure 6, respectively.

Figure 11. Visual of some failure cases of the proposed SMBCNet on two RSCD datasets. (a)

t_{0}

images. (b)

t_{1}

images. (c) Ground truth. (d) Ours. The rendered colors represent true positives (white), false positives (blue), true negatives (black), and false negatives (red).

Figure 11. Visual of some failure cases of the proposed SMBCNet on two RSCD datasets. (a)

t_{0}

images. (b)

t_{1}

images. (c) Ground truth. (d) Ours. The rendered colors represent true positives (white), false positives (blue), true negatives (black), and false negatives (red).

Table 1. Comparison experiment results with other models on the LEVIR-CD dataset. The best performance is indicated in bold.

Methods	Year	$κ$	F1	IoU	OA	Pre	Rec
FC-ef [13]	2018	0.7644	0.7755	0.6333	0.9839	0.6917	0.8823
FC-conc [13]	2018	0.7759	0.7863	0.6479	0.9802	0.7988	0.7742
FC-diff [13]	2018	0.8600	0.8665	0.7644	0.9877	0.8836	0.8500
DSIFN [41]	2020	0.8746	0.8803	0.7861	0.9892	0.8932	0.8677
BIT [31]	2020	0.8558	0.8615	0.7567	0.9880	0.8787	0.8449
SNUNet [20]	2022	0.8585	0.8647	0.7567	0.9880	0.8787	0.8449
Changeformer [22]	2022	0.9029	0.9073	0.8303	0.9916	0.9154	0.8994
P2V-CD [44]	2023	0.8937	0.8981	0.8150	0.9911	0.9102	0.8863
Ours	-	0.9032	0.9087	0.8316	0.9908	0.8961	0.9205

Table 2. Comparison experiment results with other models on the WHU-CD dataset. The best performance is indicated in bold.

Methods	Year	$κ$	F1	IoU	OA	Pre	Rec
FC-ef [13]	2018	0.7439	0.7599	0.6127	0.9742	0.6599	0.8955
FC-conc [13]	2018	0.8230	0.8111	0.6822	0.9796	0.7588	0.8711
FC-diff [13]	2018	0.8732	0.8790	0.7841	0.9897	0.8999	0.8590
DSIFN [41]	2020	0.8741	0.8832	0.7909	0.9902	0.9445	0.8294
BIT [31]	2020	0.7192	0.7154	0.5569	0.9698	0.6067	0.8715
SNUNet [20]	2022	0.8307	0.8248	0.7018	0.9826	0.7609	0.9003
P2V-CD [44]	2023	0.9038	0.9106	0.8360	0.9924	0.9279	0.8941
Changeformer [22]	2022	0.9057	0.9101	0.8351	0.9916	0.9165	0.9038
Ours	-	0.9288	0.9383	0.8831	0.9933	0.9470	0.9297

Table 3. Parameters, complexity, and performance comparison on the LEVIR-CD dataset. The best performance is indicated in bold.

Methods	Backbone	Param(M)	FLOPs	$κ$	F1	IoU
FC-ef [13]	U-Net	1.35	1.21	0.7644	0.7755	0.6333
FC-conc [13]	U-Net	1.55	3.54	0.7759	0.7863	0.6479
FC-diff [13]	U-Net	1.35	2.94	0.8600	0.8665	0.7644
DSIFN [41]	VGG-16	42.38	61.18	0.8746	0.8803	0.7861
SNUNet [20]	U-Net++	18.68	27.44	0.8585	0.8647	0.7567
BIT [31]	ResNet-18 [45]	4.02	4.35	0.8558	0.8615	0.7567
P2V-CD [44]	-	5.49	16.61	0.8937	0.8981	0.8150
Changeformer [22]	Transformer	47.3	-	0.9029	0.9073	0.8303
Ours	Transformer	10.14	16.3	0.9032	0.9087	0.8316

Table 4. Comparison to lightweight CNN-based backbone on LEVIR-CD dataset. Our lightweight transformer encoder has significant advantages in Params and Accuracy. “+” denotes combining the backbone with our proposed CEM. The best performance is indicated in bold.

Encoder	Params	LEVIR-CD			WHU-CD
Encoder	Params	$κ$	F1	IoU	$κ$	F1	IoU
ResNet-18 [45]	11.2 M	0.9021	0.9063	0.8287	0.9172	0.9208	0.8533
MobileNetV2 [46]	9.8 M	0.8873	0.8921	0.8053	0.9193	0.9228	0.8567
HRNet-W18 [47]	13.9 M	0.8681	0.8739	0.7761	0.9058	0.9100	0.8407
ResNet-18+ [45]	15.4 M	0.8898	0.8944	0.8090	0.9163	0.920	0.8519
MobileNetV2+ [46]	14.0 M	0.9100	0.9108	0.8413	0.9153	0.9190	0.8501
HRNet-W18+ [47]	18.2 M	0.9012	0.9057	0.8276	0.9130	0.9168	0.8464
Ours	3.2 M	0.9005	0.9049	0.8263	0.9112	0.9151	0.8436
Ours+	7.6 M	0.908	0.9032	0.8316	0.9288	0.9383	0.8831

Table 5. MCFM ablation experiments on the LEVIR-CD dataset.“A”, “DA”, “E”, and “D” denote the Appear, Disappear, Exchange, and distance branches in Figure 5, respectively. The best performance is indicated in bold.

Module	A	DA	E	D	F1	$κ$	IoU
MCFM	√				0.8316	0.8182	0.7768
MCFM	√	√			0.8788	0.8589	0.8078
MCFM	√	√	√		0.8814	0.8651	0.8121
MCFM	√	√	√	√	0.9078	0.9032	0.8316

Table 6. CEM ablation experiments on the LEVIR-CD dataset. “+” denotes combining the backbone with our proposed CEM, “TB” means transformer block illustrated in Figure 2. The best performance is indicated in bold.

Method	$C_{i}$	$D_{i}$	Param	F1	$κ$	IoU
DSIFN [41]	-	-	53.12 M	0.8803	0.8746	0.7861
P2V-CD [44]	-	-	3.02 M	0.8981	0.8937	0.8150
Changeformer [22]	-	-	40.5 M	0.9073	0.9029	0.8303
TB (Ours)	$\{32, 64, 160, 256\}$	$\{3, 6, 16, 3\}$	3.4 M	0.8935	0.8879	0.8257
TB (Ours)	$\{64, 128, 320, 512\}$	$\{2, 2, 2, 2\}$	13.1 M	0.9094	0.9005	0.8332
TB (Ours)	$\{64, 128, 320, 512\}$	$\{3, 4, 6, 3\}$	24.2 M	0.8872	0.8901	0.8291
TB (Ours)	$\{64, 128, 320, 512\}$	$\{3, 4, 18, 3\}$	40.7 M	0.8971	0.8925	0.8135
TB + (Ours)	$\{32, 64, 160, 256\}$	$\{3, 6, 16, 3\}$	7.6 M	0.9087	0.9032	0.8316
TB + (Ours)	$\{64, 128, 320, 512\}$	$\{2, 2, 2, 2\}$	17.3 M	0.9103	0.9043	0.8356
TB + (Ours)	$\{64, 128, 320, 512\}$	$\{3, 4, 6, 3\}$	28.4 M	0.8999	0.8955	0.8181
TB + (Ours)	$\{64, 128, 320, 512\}$	$\{3, 4, 18, 3\}$	48.2 M	0.8712	0.8651	0.7718

Table 7. Ablation study of

C_{e b d}

in the MLP decoder on the LEVIR-CD and WHU-CD dataset. The best performance is indicated in bold.

Table 7. Ablation study of

C_{e b d}

in the MLP decoder on the LEVIR-CD and WHU-CD dataset. The best performance is indicated in bold.

$C_{ebd}$	Params	LEVIR-CD			WHU-CD
$C_{ebd}$	Params	$κ$	F1	IoU	$κ$	F1	IoU
256	13.05 M	0.9032	0.9080	0.8316	0.9231	0.9264	0.8629
512	13.47 M	0.9058	0.9100	0.8348	0.9070	0.9111	0.8367
1024	15.11 M	0.9084	0.9030	0.8232	0.9101	0.9141	0.8418
2048	21.63 M	0.8913	0.8915	0.8043	0.8947	0.8993	0.8170

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, J.; Yang, X.; Gu, Z.; Zeng, M.; Zheng, W. SMBCNet: A Transformer-Based Approach for Change Detection in Remote Sensing Images through Semantic Segmentation. Remote Sens. 2023, 15, 3566. https://doi.org/10.3390/rs15143566

AMA Style

Feng J, Yang X, Gu Z, Zeng M, Zheng W. SMBCNet: A Transformer-Based Approach for Change Detection in Remote Sensing Images through Semantic Segmentation. Remote Sensing. 2023; 15(14):3566. https://doi.org/10.3390/rs15143566

Chicago/Turabian Style

Feng, Jiangfan, Xinyu Yang, Zhujun Gu, Maimai Zeng, and Wei Zheng. 2023. "SMBCNet: A Transformer-Based Approach for Change Detection in Remote Sensing Images through Semantic Segmentation" Remote Sensing 15, no. 14: 3566. https://doi.org/10.3390/rs15143566

APA Style

Feng, J., Yang, X., Gu, Z., Zeng, M., & Zheng, W. (2023). SMBCNet: A Transformer-Based Approach for Change Detection in Remote Sensing Images through Semantic Segmentation. Remote Sensing, 15(14), 3566. https://doi.org/10.3390/rs15143566

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SMBCNet: A Transformer-Based Approach for Change Detection in Remote Sensing Images through Semantic Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Remote Sensing Change Detection with CNN

2.2. Remote Sensing Change Detection with Attention Mechanisms

2.3. Remote Sensing Change Detection with a Transformer

3. Methodology

3.1. Overview

3.2. Transformer Encoder

3.3. Multi-Branch Change Fusion Module (MCFM)

3.4. Decoder

3.5. Hybrid Loss Function

4. Experiment

4.1. Dataset

4.2. Metrics

4.3. Training Details

4.4. Baselines

4.5. Compared with the State-of-the-Art

4.6. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI