SMBCNet: A Transformer-Based Approach for Change Detection in Remote Sensing Images through Semantic Segmentation

: Remote sensing change detection (RSCD) is crucial for our understanding of the dynamic pattern of the Earth’s surface and human inﬂuence. Recently, transformer-based methodologies have advanced from their powerful global modeling capabilities in RSCD tasks. Nevertheless, they remain under excessive parameterization, which continues to be severely constrained by time and computation resources. Here, we present a transformer-based RSCD model called the Segmentation Multi-Branch Change Detection Network (SMBCNet). Our proposed approach combines a hierarchically structured transformer encoder with a cross-scale enhancement module (CEM) to extract global information with lower complexity. To account for the diverse nature of changes, we introduce a plug-and-play multi-branch change fusion module (MCFM) that integrates temporal features. Within this module, we transform the change detection task into a semantic segmentation problem. Moreover, we identify the Temporal Feature Aggregation Module (TFAM) to facilitate integrating features from diverse spatial scales. These results demonstrate that semantic segmentation is an effective solution to change detection (CD) problems in remote sensing images.


Introduction
In recent years, there has been a significant increase in the use of remote sensingbased approaches for monitoring land use changes, particularly in the context of humaninduced land-use changes [1].Effective land resource management and sustainable urban development rely on accurate quantification of these changes.Change Detection (CD) techniques aim to identify and analyze changes in images captured at different time points over the same geographical area.This has led to a growing interest in RSCD technology, which involves classifying each pixel as a changed or unchanged location [2][3][4].
Traditional RSCD methods can be classified into pixel-based change detection (PBCD) and object-based change detection (OBCD) [5][6][7][8].However, these methods face challenges due to various factors such as atmospheric conditions, seasonal variations, sensor differences, and subjective definitions of change.These factors have led to two major challenges: missing detection and pseudo-changes.Consequently, deep learning architectures, specifically convolutional neural networks (CNNs) [9], have emerged as powerful tools for extracting discriminative features in RSCD [10][11][12][13].However, CNN-based methods still struggle to accurately capture edge features and handle pseudo-change information in their prediction results [14].Additionally, the complexity of remote sensing images (RSIs) with diverse colors, sizes, rotations, and spatial distributions poses a challenge such as missing detection and pseudo-change for traditional CNN encoders [15][16][17].
To overcome these limitations, researchers have explored the use of transformers in RSCD [18][19][20][21][22]. Transformers [23] have demonstrated a remarkable semantic representation capability and have shown themselves to be effective in various image processing tasks [24,25].By incorporating self-attention mechanisms, transformers can model spatialtemporal relationships and capture long-range dependencies for improved change prediction results.However, existing transformer-based methods often involve a significant number of model parameters and entail substantial computational overhead.Additionally, most researchers design custom networks from scratch to predict change masks and face challenges in general segmentation tasks, such as handling ground objects at varying scales and improving mask details, and they fail to capture the interconversion relationship between RSCD and semantic segmentation.Therefore, there is a need to develop transformer-based RSCD methods that can efficiently learn global dependencies and capture local details while reducing computational costs.By accomplishing this, we can make models more correctable and ultimately practical for broader RSCD tasks.
Here, we address these issues by reducing RSCD to semantic segmentation, which means tailoring a powerful semantic segmentation network to solve CD.This new paradigm, called the Segmentation Multi-Branch Change Detection Network (SMBCNet), leverages mainstream semantic segmentation techniques to tackle general problems in RSCD.First, we replace convolutional operations with transformer blocks, which generate multi-scale features using a hierarchically structured transformer encoder.To enhance feature extraction, we incorporate a cross-scale enhancement module (CEM) that enriches semantic information and captures fine-grained object details.Second, our approach emphasizes the importance of incorporating information related to diverse change types into the fusion feature.Therefore, a multi-branch change fusion module (MCFM) is introduced to classify changes into three types ("Appear", "Disappear", and "Exchange" in Figure 1), with each type learned separately.The MCFM integrates various change forms into the change detection task using a multi-branch structure.It efficiently preserves semantic information related to altered areas within a pair of multi-temporal images while effectively filtering out background information from regions that remain unchanged.This transformative procedure converts the change detection problem into a binary semantic segmentation task, where pixel values of 0 and 1 indicate unchanged and changed regions, respectively.This shift allows us to leverage advanced semantic segmentation networks to address the RSCD task, leading to improved information extraction and enhanced accuracy.Additionally, we employ a temporal feature aggregation module (TFM) to efficiently recalibrate multilevel features and enable progressive learning.SMBCNet achieves a larger effective receptive field (ERF) compared to traditional CNN encoders and strikes a better balance between accuracy and model size in RSCD tasks.
Our contributions in this study are: 1.
We propose SMBCNet, a transformer-based network for RSCD that incorporates mainstream semantic segmentation techniques to address various challenges in this field.SMBCNet outperforms various types of previous approaches, achieving superior performance while having a smaller parameter size and computational complexity.

2.
We introduce the MCFM, which classifies changes into three types and enhances the responsiveness of the network towards change regions or objects.Unlike previous fusion approaches, the MCFM not only provides enhanced interpretability but also adeptly captures the inherent features of RSCD.

3.
We perform ablation experiments and comparative experiments on the WHU-CD and LEVIR-CD datasets to demonstrate the effectiveness of our proposed method.Our results show that the transformer backbone outperforms CNN-based backbones in RSCD tasks, even with relatively fewer parameters.This advantage stems from the transformer's larger perceptual field, which enables stronger model characterization capabilities with lower computational resources.

Remote Sensing Change Detection with CNN
In previous work, it was demonstrated that CNN-based methods can achieve superior accuracy compared to traditional methods for remote sensing change detection.Specifically, Daudt et al. [13] proposed a method that combines the benefits of both a fully convolutional neural network (FCN) and a Siamese architecture.Furthermore, they used skip connections to enhance the spatial accuracy of the results.Additionally, UNet++ [11] incorporates a residual block strategy, which captures more detailed information.Moreover, the approach effectively combines the weighted binary cross-entropy loss and dice-coefficient loss to mitigate the imbalance between change and unchanged objects.Additionally, Zhang [26] utilized the strengths of the DeepLabv2 [27] model and extended it to RSCD tasks by adopting atrous spatial pyramid pooling (ASPP) and atrous convolution operations.In addition, Liu [28] employed depth-wise separable convolution to lighten the FCN and improve its performance compared to the original FCN.Collectively, these studies have paved the way for exploring the combination of CNN-based methodologies with RSCD.However, the convolutional kernel of the convolutional operation may introduce limitations to CNNbased approaches, as remote sensing images often require a larger field of perception [14].This limitation makes it challenging to capture more comprehensive spatial and contextual information.Therefore, various studies have attempted to address this limitation through the use of self-attention mechanisms, such as transformers, which are particularly effective in modeling long-range dependencies for RSCD.

Remote Sensing Change Detection with Attention Mechanisms
To address the limitations of the fixed receptive field in CNN-based RSCD methods, attention mechanisms have been introduced to extend the receptive field and improve the accuracy of change detection in remote sensing images (RSIs).One approach proposed by Chen [19] is the use of a spatial-temporal attention module, which computes attention weights between any two pixels at different times and positions, effectively enhancing the discriminative features.Another method, DASNet [18], captures long-range dependencies to obtain more discriminative feature representations, thereby strengthening the recognition performance of the model.A novel approach, TFI-GR [29], leverages temporal feature interaction and guided refinement to locate changes in RSIs.However, the incorporation of self-attentive mechanisms tends to increase the number of network parameters significantly.Furthermore, while attention mechanisms can address some of the limitations of CNNs, their ability to capture large receptive fields remains limited.

Remote Sensing Change Detection with a Transformer
Building on the pioneering ideas of ViT [30], transformers have achieved remarkable success in natural image processing and have also brought new solutions to RSCD research.Recently, some researchers have incorporated transformer-based methodologies into RSCD, providing novel research ideas.For example, Chen [31] leveraged a transformer encoder to compactly model the context of a bi-temporal image using only a few tokens.Swin-SUNet [21] is the first pure transformer network to use a Siamese U-shaped structure for the CD task, with Swin transformer blocks [25] as the basic units for global feature extraction.ChangeFormer [22] efficiently utilizes SegFormer blocks to capture multi-scale, long-range details required for accurate change detection.However, these transformer-based RSCD methods may either integrate transformer encoder-decoder blocks with a CNN backbone to fuse and enhance multi-scale features, or they may have too many parameters for practical use in real-world applications.

Methodology
In this section, we will start by providing an overview of the proposed approach.Then, we will provide detailed information about the encoder, CEM, MCFM, TFAM, and decoder modules.Lastly, we will elaborate on the hybrid loss function that is integrated into our method.

Overview
The proposed network is based on a Siamese encoder-decoder architecture, depicted in Figure 2. The primary distinction between our network and earlier models is that we divide change detection into feature fusion and semantic segmentation, utilizing a hierarchical encoder-decoder structure.Firstly, a bi-temporal image pair of size H × W × 3 is provided, and a transformer encoder generates four hierarchical feature maps P i .Following this, all feature maps pass through CEM to enhance the representational capacity of bi-temporal features.Lastly, an effective decoder performs top-down multilevel feature fusion, allowing for change maps' prediction with superior details.

Transformer Encoder
Our proposed transformer encoder comprises two critical components: transformer blocks for feature extraction and the Cross-scale Enhancement Module (CEM) for improving temporal features.
Transformer blocks.As illustrated in Figure 3, the hierarchical transformer encoder produces multi-level features that encompass both high-resolution coarse features and lowresolution fine-grained features, which are essential for detecting changes.This approach effectively minimizes redundancies and leads to the extraction of more cohesive and coherent RSI object details.Specifically, while inputting a pre-change and post-change image pair with the same H × W × 3 resolution, the transformer encoder generates a hierarchically structured feature map p i j ∈ R H 2 i+1 × H 2 i+1 ×C i , where i ∈ {1, 2, 3, 4} denotes the number of distinct layers, and j ∈ {0, 1} indicates pre-or post-change features.The channel counts C i differ across layers, with C i+1 > C i .We use C i as 32, 64, 160, 256 in sequence.The number of times a transformer operation is repeated in a transformer block is represented by D i .Thus, C i and D i together determine the size of our encoder.
Specifically, to minimize the number of parameters while retaining the transformer's essential feature extraction capabilities, we have replaced conventional self-attention techniques with more efficient self-attention mechanisms.Specifically, as described in the seminal work by Vaswani et al. [23], the self-attention procedure can be defined as: where the matrices Q, K, and V correspond to Query, Key, and Value, respectively.As described in [23], these three vectors have identical dimensions of N × C, where N = H × W represents the sequence's length.The self-attention's computational complexity of O(N 2 ) makes it unsuitable for deployment on high-resolution RSIs.To tackle this challenge, we adopt a reduction strategy [24] to truncate the sequence's length by introducing a reduction ratio, R, as follows: where D represents the input sequence, consisting of Q, K, and V. Reshape  To downsize the feature maps of the hierarchical transformer encoder, we apply an overlapped patch merging strategy that collapses To maintain local consistency across different patches, we define the patch size, K, the stride between adjacent patches, S, and the padding size, Pa.Specifically, we set K = 7, S = 4, Pa = 3 for the overlapping patch merging, and K = 3, S = 2, and Pa = 1 for the non-overlapping patch merging.This enables us to obtain features of the same size as the non-overlapping process while maintaining local continuity across different patches.
To enable transformer models to effectively incorporate positional information, our approach involves using two MLP layers, coupled with 3 × 3 depth-wise convolutions.This enables the model to effectively capture the necessary positional relationships requisite for optimal performance.The process can be described as: where X in is the feature from the self-attention module.

Cross-scale Enhancement Module(CEM).
Multilevel features play a crucial role in object recognition, as they provide both detailed and semantic information about objects.High-level features are responsible for locating objects and contain semantic information, while low-level features provide finer-grained boundary and texture information.By incorporating multiple levels of features, we can strengthen the capacity of the extracted temporal features to represent information.This enables us to capture more nuanced details from lower-level features and derive semantic insights from higher-level features, thereby enhancing the overall representation capability.We propose the Contrast Enhancement Module (CEM) for this purpose, as shown in Figure 4. We have achieved feature enhancement by fusing the outputs of the four stages of the transformer block."Top", "Mid", and "Bottom" denote the three features from neighboring stages.Note that "Top" and "Bottom" can be omitted.
CEM enhances features by combining features from adjacent stages.As shown in Figure 3, we use a residual learning scheme to perform feature fusion operations on feature maps p 0 j , p 1 j , p 0 j , p 1 j , p 2 j , p 1 j , p 2 j , p 3 j , and p 2 j , p 3 j separately, where j ∈ {0, 1} denotes pre-change and post-change.For example, let us consider the features extracted from the t 0 image.
Figure 4 shows the processing steps employed in CEM.We use a residual concatenation operation to merge the top features p 1 0 , which performs downsampling to match the resolution of p 2 0 while reducing the number of channels in the top feature map.Next, we apply a 3 × 3 convolutional operation to adjust the channel numbers of the mid feature layer p 2 0 .Similarly, we conduct processing on the lower-level feature through the application of a 3 × 3 convolutional layer, integrated with both normalization and a ReLU activation function.We also upscale the input in both height and width by a factor of two using linear interpolation.Finally, we concatenate the results obtained from the three branches to obtain the reinforced feature maps.These processes can be mathematically formulated as follows: where SP(•) denotes a stochastic pooling operation and Conv 3×3 means a convolutional layer with a kernel size of 3 × 3. Cat represents the concatenation operation.d 0 means the enhanced feature maps before the change.

Multi-Branch Change Fusion Module (MCFM)
To effectively account for the diverse nature of changes in RSCD, we have divided the variations in RSCD into three categories, which are "Appear", "Disappear", and "Exchange", respectively.As illustrated in Figure 1, "Appear" indicates the presence of an object solely in t 1 ; "Disappear" denotes the presence of an object solely in t 0 ; and "Exchange", signifies the differences between objects in t 0 and t 1 in the same location.We argue that these three changes cover the vast majority of change requirements in the RSCD, and the network should model these three different change types separately.
Figure 5 explains our proposed MCFM.The upper three branches utilize the subtraction operation to capture the three distinct types of changes, while the "Distance" branch is a widely used method in RSCD to generate different feature maps and has been proven to have the ability to further enhance the network's detection of the changed regions [19,21].We employ channel self-attention instead of spatial self-attention in the "Exchange" branch for the following reason: in the "Appear" and "Disappear" branches, there is a transition from "0" to "1" or from "1" to "0", respectively, as indicated by the green and yellow boxes shown in Figure 1.The spatial information of the changing objects effectively captures the differences between these states.However, in the "Exchange" branch, the transition occurs between two instances of "1", as depicted by the blue box in Figure 1 (two different cars in the same location).Therefore, it is more pertinent to focus on the channel information changes of the feature map in the "Exchange" branch.To provide further elaboration, the decoder outputs, p i 0 and p i 1 , have dimensions of H 2 i+1 × W 2 i+1 × C i .Specifically, i 0 denotes the i-th layer of pre-change features, whereas i 1 represents the i-th layer of post-change features obtained through the decoder process.The four branches responsible for extracting change regions are described as follows: where BN refers to batch normalization, SA represents spatial attention [32], and CA stands for channel attention [32].As illustrated in Figure 5, spatial attention is utilized to enhance the network's sensitivity to regions where positional information plays a crucial role in detecting changes.Similarly, channel attention facilitates reinforcement and refinement of the change regions by enabling interactions between corresponding feature maps.Moreover, merging the two feature maps across their channel dimension enables broadening of the coverage of change regions.This strategy effectively integrates information from both pre-change and post-change feature maps, thereby enhancing the representation of change patterns.To further consolidate the change branches, a branch-fuse operation involving a 1 × 1 convolutional layer is applied after the activation map.This fusion process yields enhanced multi-branch change fusion features, facilitating a more comprehensive representation of changes.Overall, these operations collectively contribute to improved change detection performance within the proposed framework.The calculation formula for the change fusing process is shown below: where m i denotes the output of the ith layer, Conv 1×1 means a 1 × 1 convolutional,layer and CAT means the contacting operation in the dimension of the channel.Notably, the feature channel numbers do not change through the MCFM.

Decoder
The decoder is responsible for further multi-size fusion of the fused change region features output from the MCFM module, and a simple but effective decoder is used to generate the final change map.
Temporal Feature Aggregation Module(TFAM).Integrating diverse spatial features has been shown to be an effective strategy for addressing multi-scaled change objects [33].In our quest to achieve this, we propose a simple yet effective temporal feature aggregation module to merge spatial change features derived from MCFM.Our approach is inspired by the success of the Feature Pyramid Network (FPN) [34].We utilize a simplified FPN to merge spatial features.Using a top-down pathway strategy, we compute a feature hierarchy consisting of feature maps at different scales, merging the semantic features of the higher level with those of the lower layer through upsampling, as illustrated in Figure 6.With four different scale features ( 14 , 1 8 , 1  16 , and 1 32 ), all feature maps undergo further processing with a 3 × 3 convolutional layer before the downs-caled feature maps are subjected to 2× bilinear upsampling and merged with the original lower-level feature map via element-wise addition.This simple architecture allows for the fusion of multi-scale feature maps with a minor increase in parameter count, thereby improving the accuracy of the final prediction results.MLP Lightweight Decoder.Followed by constructing the final feature map, we employ a simple All-MLP decoder as showed in Figure 7, inspired by SegFormer [24].The primary aim of the decoder is to reduce parameters while maintaining a powerful decoder ability.Firstly, we use a linear function to adjust the channel dimension of the feature map to C ebd .This facilitates the control of the overall parameter size of the decoder, making it possible to manipulate the value of C ebd .In the next step, the feature maps are upsampled to 1  4 of their original resolution and concatenated to allow for the integration of multi-level information.Similar to the previous step, we employ a linear function to decrease the number of channels in the concatenated feature map.This helps to reduce parameter without compromising model performance.Finally, we use another MLP to generate the segmentation mask, which is upsampled to a resolution of H × W × 3, giving us the final change map.This process enables the integration of multi-level features and facilitates accurate classification of the input data.We can express the decoder as: where M out and Mask refer to the predicted mask before upsample and after, respectively.Linear(C in , C out )(•) refers to a linear function that maps the number of channels of . . .from C in to C out dimensions.C ebd is the number of channels of the feature after MLP, and the size of it determines the scale of our model.

Hybrid Loss Function
Change detection tasks are fundamentally binary classification problems.However, the highly imbalanced proportion of changed and unchanged regions in the input data can have a significant deleterious effect on the model's performance.To address this issue and guide the network to learn from complex scenes, we propose a hybrid loss function that consists of three parts: binary cross-entropy (BCE) loss, dice (Dice) loss [35], and an additional term.The BCE loss is formulated as where • denotes a dot-product operation, y and ŷ are ground truth and corresponding predicted mask, respectively.The Dice loss can be formulated as where denotes the l1 norm.Then, the total loss is represented as where λ 1 , λ 2 are the weights of each loss function, and we set them as 0.4, 0.6, respectively.

Dataset
To verify our proposed method's performance, we used two publicly available CD datasets named LEVIR-CD and WHU-CD.The detailed information is listed as follows.
The LEVIR-CD dataset [19] is a binary CD dataset comprising 637 pairs of veryhigh-resolution (VHR) image patches.Each patch has a size of 1024 × 1024 pixels, with a resolution of approximately 0.5 m/pixel.These image pairs were derived from Google Earth global images of Texas, spanning the years 2002 to 2018.In order to conduct our experiment, we cropped non-overlapping patches of size 256 × 256 and randomly split them into three parts: 70% for training, 10% for validation, and 20% for testing.Finally, we obtained 7120/1024/2048 image pairs for train/val/test, respectively.
The WHU-CD dataset [36] contains just one pair of images, with a resolution of 32,507 × 15,354 as a crop of a wider geographic area.This dataset consists of aerial images obtained in April 2012 that contain 12,796 buildings in 20.5 km 2 (16,077 buildings in the same area in the 2016 dataset) with 1.6-pixel accuracy.Following [37], we cropped the original image pairs in a non-overlapping manner, and after cropping, we formed 7434 small images of the size 256×256.After that, we randomly divided all the images into training, validation, and test sets with the rates of 70%, 10%, and 20%, respectively.Finally, we obtained 5203/743/1488 image pairs for train, val, and test, respectively.

Metrics
We calculated six widely used metrics to evaluate the performance of the proposed method [38,39]: precision (Pre), recall (Rec), F1-score (F1), overall accuracy (OA), κ coefficient and Intersection-over-Union (IoU).In these evaluation indicators, F1-score is the most important one.P and R represent lower false detection and omission, respectively.The larger their values, the better the prediction results.The calculation formulas for six metrics are as follows: Pre = TP FP + TP (16) where TP, FP, TN, and FN denote the numbers of true positives, false positives, true negatives, and false negatives, respectively.

Training Details
In our experiments, we implemented our model using PyTorch and trained it on an NVIDIA RTX 3090ti GPU.The backbone is initialized with parameters from the Mit-B0 model [24] pretrained on ImageNet-1K, while the remaining parts are randomly initialized.We utilized data augmentation techniques, including random flipping, random rescaling (0.8-1.2), and random temporal exchange.The AdamW [40] optimizer was applied to optimize the loss function with a weight decay of 0.0001 and beta values of (0.9, 0.99).The learning rate is initially set to 0.0005 and linearly decays to 0 until the final epoch is reached.To account for GPU memory limitations, the model was trained with a batch size of 32.During the training process, we implemented a strategy to reduce overfitting by selecting difficult samples.Similar to the comparison method, we evaluated multiple experimental results and chose the best one as the final result.These measures ensure the robustness of our model.

Baselines
To demonstrate the effectiveness of our approach, we compared our results with those reported in [22].We addressed the three models presented in [10].Moreover, to compare our model with other works adopting both spatial and channel attention mechanisms, we dealt with [20,41,42].Finally, given the success achieved by transformers applied to the computer vision field, we also compared our results with those obtained in [22,31].We reproduce all baseline methods using the modified codes [43] under their suggested parameters for fair comparisons.To further evaluate the validity of our proposed method, we also report the model parameters and computation costs of the above methods for reference.

Compared with the State-of-the-Art
In this section, we present a comprehensive comparison of our proposed model with several existing methods on two benchmark datasets, namely, LEVIR-CD and WHU-CD.The compared methods can be categorized into attention-based approaches and other efficient encoder-decoder structures.To ensure a fair and unbiased evaluation, we meticulously re-implemented all these methods and replicated their results within the same experimental environment.For each comparative method, we carefully selected a set of optimal hyperparameters that maximized the F1 score on the validation subset.This approach guarantees that all methods are fine-tuned under the same criteria, enabling a meaningful and consistent performance comparison.
The quantitative evaluation results for the two datasets are presented in Tables 1 and 2. Furthermore, the qualitative assessment of the comparative methods is visualized in Figures 8 and 9.These figures depict true positive (TP) regions in white, false positive (FP) regions in blue, false negative (FN) regions in red, and true negative (TN) regions in black.These visualizations allow for a comprehensive comparison of the methods' performance.To evaluate our proposed model in terms of both accuracy and model size, we compared it with a method that strikes a good balance between these factors.When comparing our proposed network with transformer-based approaches, it is evident that BIT, despite its smaller parameter count, lags behind our network by 4.74%, 6.72%, and 7.49% in terms of κ, F1, and IoU on the LEVIR-CD dataset.Conversely, Changeformer, which utilizes the transformer as its backbone, surpasses CNN-based and self-attention-based methods in terms of κ, F1, and IoU, indicating the superiority of transformer-based feature extraction for remote sensing images.However, Changeformer exhibits a significantly larger parameter size of 40.5 M, roughly four times the size of our proposed network.Ultimately, our proposed SMBCNet achieves the best overall performance with a κ of 0.9032, an F1 score of 0.9087, and an IoU of 0.8316.This is attributed to the ability of our method to identify pseudo-change regions from multi-temporal images.Notably, SMBC-Net demonstrates its superiority despite having a moderate parameter count of 10.14 M, which is significantly smaller than pure transformer-based Changeformer (47.3 M) and CNN-based DSIFN (42.38 M) methods.
Compared with traditional CNN methods, the FC-ef, FC-conc, and FC-diff networks have much smaller parameter sizes, but their performance metrics are not impressive.In terms of the F1 metric on the LEVIR-CD dataset, our network shows a significant improvement of 4.3% compared to the best-performing network among the three.Although DTCDSCN performs better than the previous three methods, the number of parameters determines that the method is difficult to use in practical applications.This indicates that CNN methods have limitations in remote sensing object detection.
Figure 8 presents a perceptual comparison of various CD methods.It is apparent that misclassified changed pixels are prevalent in the results of all methods except for our approach.Additionally, efforts have been made to reduce repetition in the text.Specifically, the results of SNUNet, BIT, and Changeformer exhibit a noticeable missing part of the building.In the top row of Figure 8, FC-diff, Fc-ef, Fc-cat, DSIFN, SNUNet, BIT, P2V-CD, and Changeformer fail to accurately localize the changed buildings, resulting in erroneous predictions.In contrast, our proposed method accurately and comprehensively detects the changed buildings.Notably, the change maps produced by SMBCNet exhibit the most favorable visual effect, appearing the closest to the ground truth.Additionally, efforts have been made to decrease repetition in the text.
Experimental Results on the WHU-CD Dataset: Table 2 shows the performance metrics for the WHU-CD dataset.The experimental results indicate that FC-EF, FC-Siamconc, and FC-Siam-diff do not perform better than the other methods.Although DSIFN yields higher precision and recall than the aforementioned methods, its F1 score lags behind that of Changeformer, the pure transformer-based method.On a positive note, P2V-CD proves to be a promising solution, delivering favorable results across different datasets.Despite being the largest model on the list, DSIFN effectively prevents overfitting by utilizing pretrained encoders.It achieves a precision score of 0.9626 and the second-highest F1 score of 0.9127 on this dataset.Meanwhile, SNUNet fails to deliver competitive results, despite having more network parameters than the comparatively smaller CDNet model.Notably, our proposed SMBCNet method dominates the other methods with an F1 score of at least 2.82%, indicating its superior performance in detecting change between two remote sensing images.
Figure 9 presents a qualitative evaluation of the change detection techniques on the WHU-CD dataset, providing a more intuitive comparison of the methods.The comparison displays that the majority of the CD methods produce spurious changes or missed detections, especially in heavily built-up areas.For example, in row 3 of Figure 9, all methods besides our proposed technique misclassify the unremarkable region as the actual change region, leading to incorrectly classified areas shown as blue regions in the figure .Table 3 shows the comparison between our proposed network and the selected method in terms of parameter count and accuracy on the LEVIR-CD dataset.We can observe that our proposed approach demonstrates an advancement compared to existing lightweight RSCD methods, establishing its efficacy in the task of change detection between RSIs.This outcome substantiates the effectiveness of our method and its superiority in addressing this specific challenge.Furthermore, the scalability and generalizability of our approach make it a promising solution for future research in the field of RSCD.Despite not attaining stateof-the-art results in terms of parameter size, SMBCNet exhibits a noteworthy enhancement in both performance and parameter size when compared to Changeformer, which also employs a transformer-based architecture.Particularly, SMBCNet showcases enhanced computational effectiveness, as reflected in its improved performance metrics and reduced parameter requirements.On the other hand, our method stands out, as it accurately identifies change objects while effectively suppressing background interference between the bi-temporal images.Our method utilizes the strong contextual-dependency-capturing ability of transformers by progressively aggregating multilevel temporal difference features in a coarse-to-fine manner.This approach results in a more refined change map for RSCD.The qualitative results indicate that our proposed method outperforms other techniques in terms of detecting change objects with better accuracy and mitigating the negative influence of background interference in the bi-temporal images.

Ablation Study
To verify the effectiveness of the components and configurations of the proposed SMBCNet, we conduct comprehensive ablation studies on two RSCD datasets.
Effectiveness of transformer backbone.To verify the effectiveness of the transformerbased encoder in our network, we conducted ablation experiments using different lightweight CNN-based backbones.The results are summarized in Table 4, which includes the parameters and accuracy for both LEVIR-CD and WHU-CD datasets.It is important to note that the "Params" column in Table 4 refers to the size of a single backbone but not the size of the entire model.Additionally, "TB" refers to "transformer blocks", as illustrated in Figure 2. The table shows that our transformer-based encoder outperforms other CNN-based backbones in both model size and feature extraction capabilities for RSI.Furthermore, we combined our proposed CEM with the compared backbones to validate its effectiveness.The results demonstrate that MobileNetV2 combined with our proposed CEM achieves the highest accuracy on the LEVIR-CD dataset, while the encoder used by our network comes in second place.However, it is important to note that the MobileNetV2+ contains 14M parameters, which is nearly double the number of parameters in our proposed encoder model.
Effectiveness of MCFM.We devise MCFM to extract the change information and fuse temporal features, which enjoy high interpretability and reveal the essential characteristic of CD.The MCFM aims to account for the diverse nature of changes in RSCD and enhance changes.There are four branches, as illustrated in Figure 5.In order to validate the effect of different branches on the whole network, we selected four branches in turn for the experiment, and the results are shown in Table 5.Since the image enhancement we employ includes an enhancement strategy that swaps pairs of images, "A + DA" is much more accurate than "A", and in terms of F1 score, it improves by nearly 4.72%.After adding the last two branches, we achieve the highest accuracy and can see that each branch plays a significant role in the change feature fusion.We argue that with our accompanying proposed MCFM, we are able to "reduce" CD to semantic segmentation, which means tailoring an existing and powerful semantic segmentation network to solve CD.
In the course of these MCFM ablation experiments, we tested the case of only choosing "DA", "E", and "D", but the results were not actually optimal.This is because in real life, "Appear" and "Disappear" are always present randomly, so using only one of the two branches will not achieve the best result.In order to simulate this process to increase the robustness and generalizability of our model, we have used random temporal exchange data augmentation techniques.Finally, our experiments prove that the results obtained by using all four branches are optimal.
As shown in Figure 10, we utilize heatmaps to effectively visualize the feature maps derived from the MCFM.These heatmaps are generated by analyzing the variance of all feature maps.By applying MCFM, regions that have undergone changes exhibit increased energy, particularly with enhanced intensity at the edges of the target.We can observe that the low-resolution feature map (e) is responsible for localizing the change area, while the high-resolution feature map (d) is responsible for making the edges of the change object more accurate.In essence, this approach strengthens and precisely localizes the edges of the modified target, leading to improved detection performance.Influence of the size of the transformer blocks.We conducted an analysis of the effect of increasing the size of the encoder on the performance and model efficiency, and Table 6 summarizes the results for the three datasets.The C i and D i mentioned in Section 3 control the size of our encoder.We observed that the increasing size of the encoder does not lead to consistent improvements in performance, while C i = {64, 128, 320, 512} and D i = {2, 2, 2, 2} achieves the best performance on the LEVIR-CD dataset.However, we also observed that the accuracy of the model does not consistently improve on the LEVIR-CD dataset with the progression of C i and D i .We believe that this inconsistency may be due to variations in overfitting tendencies that we observed during the model training, which are directly influenced by the size of the encoder.While the combination of C i = {64, 128, 320, 5} and D i = {2, 2, 2, 2} yielded the best accuracy results, it also exhibited potential drawbacks regarding real-time processing.The larger encoder sizes presented in C i = {32, 64, 160, 256} and D i = {3, 6, 16, 3} strike a balance between accuracy and computational cost.By selecting these parameters, we aim to achieve a reasonable level of accuracy while ensuring that the model can operate efficiently in real-time scenarios.
Influence of C ebd , the MLP decoder channel dimension.We present an analysis of the impact of the channel dimension C ebd in the MLP decoder, as discussed in Section 3.4.In Table 7, we demonstrate the model's performance and parameters as a function of this dimension.Our findings indicate that setting C ebd = 256 yields highly competitive performance with minimal computational cost.As the value of C ebd increases, the model's performance improves, resulting in larger and less efficient models.It is worth noting that an excessively large value of C ebd may lead to overfitting and a subsequent decrease in the model's accuracy, despite an increase in the number of parameters.Based on these findings, we select C ebd = 256 as the optimal dimension for our final SMBCNet.

Discussion
Currently, the CD approach can be seen as a semantic partitioning problem, wherein it also indicates the possibility of a shift from one to the other between the two.Specifically, there are approximately three types of changes, namely, "Appear", "Disappear", and "Change".Recent advances in transformer-based models provide a promising way to capture global context information.However, limited by the excessive computational complexity and inaccurate self-attention calculation, achieving accurate and relatively lightweight CD of RSI using transformer-based models still requires improvement.We address the CD problem by introducing a mechanism for multi-branch change feature fusion.In particular, we combine and enhance the features of the change region in remote sensing images.As a result, we can use existing semantic segmentation networks to solve the CD problem.
Unlike previous RSCD methods that incorporated self-attention with CNN, our approach utilizes a pure transformer-based network featuring a Siamese structure to address CD problems.As shown in Table 4, the transformer-based encoder outperforms CNNbased encoders in RSCD.This is because RSIs present challenges for object analysis and interpretation, including difficulties with accurately identifying and delineating objects with complex shapes or overlapping features, and challenges in distinguishing between objects with similar spectral reflectance and texture.
However, further improvements are still necessary.Figure 11 displays some failure cases in two RSCD datasets, where the proposed SMBCNet either incorrectly predicts the changed regions or fails to capture them due to the objects in question, exhibiting minimal differences with their surroundings.Nevertheless, as discussed in Section 4.5, our approach outperforms state-of-the-art methods in addressing the most challenging scenarios for the RSCD task.

Conclusions
This study introduces a novel RSCD network termed SMBCNet.Unlike current CD approaches that employ large CNNs as backbones, we utilize a hierarchically structured transformer encoder combined with an MLP decoder, achieving higher accuracy while maintaining a relatively lightweight model size than other pure transformer-based CD methods (compared to other pure transformer-based RSCD methods).To extract global information efficiently, we propose CEM, which balances information extraction and computational complexity.Additionally, we introduce the MCFM to address the diverse nature of changes, effectively transforming the CD task into a semantic segmentation problem.Furthermore, we introduce the TFM module to integrate features from various spatial scales.The state-of-the-art performance of our network on two publicly available CD datasets and its significant reduction in parameter count compared to transformer-based networks of equivalent type are demonstrated.The superior performance of our proposed method in RSCD also showcases its potential for practical RS applications.Through the incorporation of the CEM, MBCM, and TFM modules, we achieve improved results in change detection while maintaining computational cost.These findings hold promising implications for various RS applications.
There are certain limitations in our study that should be acknowledged, particularly the fact that a majority of the samples in the LEVIR-CD and WUH-CD datasets primarily consist of buildings.However, it is important to emphasize that none of the steps involved in our proposed pipeline are exclusive to building change detection.Therefore, our method can be readily extended to other forms of change detection in RSIs.To illustrate this, we conducted tests on the HRSCD [36] dataset, encompassing a range of changing objects, including artificial surfaces, agricultural areas, forests, wetlands, water, and unidentified regions.Nevertheless, it is important to consider another limitation when dealing with fine-grained change detection, particularly in scenarios where the changed area exhibits a small spatial scale and exhibits irregular distribution and geometric appearance.In future research, we aim to address this limitation by devising techniques to effectively discriminate between uncertain pixel-level differences across the entire image pair.

Figure 1 .
Figure 1.Illustration of the three types of change.In our view, the object changes in RSCD can be divided into these three categories: "Appear", "Disappear", and "Exchange", respectively.(a) t 0 images.(b) t 1 images.

Figure 2 .
Figure 2. Our proposed SMBCNet network framework consists of three main components: a transformer encoder, a multi-change fusion module, and a transformer decoder.Each of these components has a specific role in the overall process of feature extraction, change feature fusion, and feature decoding, respectively.
( N R , C • R)(D) operation reshapes D into a new tensor with dimensions N R × (C . . .R). Linear(C in , C out )( D) denotes a linear layer that transforms a C in −dimensional tensor D into a C out − dimensional tensor.This approach reduces the complexity of the self-attention mechanism from O(N 2 ) to O( N 2 R ) by minimizing redundant computations, and it also maintains the ability to extract features from RSIs.

Figure 3 .
Figure 3.This illustration depicts the functionality of the transformer blocks, which are composed of four stages.The feature map downsamples after each stage.

Figure 4 .
Figure 4. Illustration of the proposed CEM.We have achieved feature enhancement by fusing the outputs of the four stages of the transformer block."Top", "Mid", and "Bottom" denote the three features from neighboring stages.Note that "Top" and "Bottom" can be omitted.

Figure 5 .
Figure 5. Illustration of the proposed MCFM.We extract the texture features of the change region by fusing the four change branches.In order to highlight that the features are enhanced after the spatial attention and channel attention change regions, we use a gradient color to represent the change feature vector after the self-attention operation.

Figure 6 .
Figure 6.Illustration of the proposed TFAM.TFAM upsamples and fuses feature maps at multiple scales to achieve feature enhancement in regions of change.

Figure 7 .
Figure 7. Illustration of the proposed MLP decoder.The lightweight MLP decoder combines multiple feature maps with different shapes to finally predict the change map.

Figure 9 .
Figure 9. Visual comparisons of the proposed method and the state-of-the-art approaches on the WHU-CD dataset.(a) t 0 images.(b) t 1 images.(c) Ground truth.(d) FC-ef.(e) FC-conc.(f) FC-diff.(g) DSIFN.(h) SNUNet.(i) BIT.(j) P2V-CD.(k) Changeformer.(l) Ours.We use different colors to represent true positives (white), false positives (blue), true negatives (black), and false negatives (red).Experimental Results on the LEVIR-CD Dataset: Although FC-EF, FC-Siam-conc, FC-Siam-diff, and CDNet have the advantageous feature of occupying smaller memory footprints, they exhibit the poorest performance in terms of κ, F1, and IoU metrics.On the other hand, attention-based models such as BIT and SNUNet demonstrate similar performance, but BIT achieves comparable results with only one-tenth of the parameters compared to SNUNet (4.02 M vs. 42.38M).It is worth noting that SNUNet, incorporating self-attention mechanisms, shows enhanced accuracy over CNN-based methods, albeit with a larger model size of 18.68 M.When comparing our proposed network with transformer-based approaches, it is evident that BIT, despite its smaller parameter count, lags behind our network by 4.74%, 6.72%, and 7.49% in terms of κ, F1, and IoU on the LEVIR-CD dataset.Conversely, Changeformer, which utilizes the transformer as its backbone, surpasses CNN-based and self-attention-

Table 1 .
Comparison experiment results with other models on the LEVIR-CD dataset.The best performance is indicated in bold.

Table 2 .
Comparison experiment results with other models on the WHU-CD dataset.The best performance is indicated in bold.

Table 3 .
Parameters, complexity, and performance comparison on the LEVIR-CD dataset.The best performance is indicated in bold.

Table 4 .
Comparison to lightweight CNN-based backbone on LEVIR-CD dataset.Our lightweight transformer encoder has significant advantages in Params and Accuracy."+" denotes combining the backbone with our proposed CEM.The best performance is indicated in bold.

Table 5 .
MCFM ablation experiments on the LEVIR-CD dataset."A","DA","E", and "D" denote the Appear, Disappear, Exchange, and distance branches in Figure5, respectively.The best performance is indicated in bold.

Table 6 .
CEM ablation experiments on the LEVIR-CD dataset."+"denotescombining the backbone with our proposed CEM, "TB" means transformer block illustrated in Figure2.The best performance is indicated in bold.

Table 7 .
Ablation study of C ebd in the MLP decoder on the LEVIR-CD and WHU-CD dataset.The best performance is indicated in bold.