BCTDNet: Building Change-Type Detection Networks with the Segment Anything Model in Remote Sensing Images

Zhang, Wei; Li, Jinsong; Wang, Shuaipeng; Wan, Jianhua

doi:10.3390/rs17152742

Open AccessArticle

BCTDNet: Building Change-Type Detection Networks with the Segment Anything Model in Remote Sensing Images

¹

College of Oceanography and Space Informatics, China University of Petroleum (East China), Qingdao 266580, China

²

Land Surveying and Mapping Institute of Shandong Province, Jinan 250061, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(15), 2742; https://doi.org/10.3390/rs17152742

Submission received: 20 June 2025 / Revised: 25 July 2025 / Accepted: 6 August 2025 / Published: 7 August 2025

(This article belongs to the Special Issue Deep Learning for Multi-Source Remote Sensing Image Interpretation: Exploring, Rethinking, and Limiting Breakthroughs)

Download

Browse Figures

Versions Notes

Abstract

Observing building changes in remote sensing images plays a crucial role in monitoring urban development and promoting sustainable urbanization. Mainstream change detection methods have demonstrated promising performance in identifying building changes. However, buildings have large intra-class variance and high similarity with other objects, limiting the generalization ability of models in diverse scenarios. Moreover, most existing methods only detect whether changes have occurred but ignore change types, such as new construction and demolition. To address these issues, we present a building change-type detection network (BCTDNet) based on the Segment Anything Model (SAM) to identify newly constructed and demolished buildings. We first construct a dual-feature interaction encoder that employs SAM to extract image features, which are then refined through trainable multi-scale adapters for learning architectural structures and semantic patterns. Moreover, an interactive attention module bridges SAM with a Convolutional Neural Network, enabling seamless interaction between fine-grained structural information and deep semantic features. Furthermore, we develop a change-aware attribute decoder that integrates building semantics into the change detection process via an extraction decoding network. Subsequently, an attribute-aware strategy is adopted to explicitly generate distinct maps for newly constructed and demolished buildings, thereby establishing clear temporal relationships among different change types. To evaluate BCTDNet’s performance, we construct the JINAN-MCD dataset, which covers Jinan’s urban core area over a six-year period, capturing diverse change scenarios. Moreover, we adapt the WHU-CD dataset into WHU-MCD to include multiple types of changing. Experimental results on both datasets demonstrate the superiority of BCTDNet. On JINAN-MCD, BCTDNet achieves improvements of 12.64% in IoU and 11.95% in F1 compared to suboptimal methods. Similarly, on WHU-MCD, it outperforms second-best approaches by 2.71% in IoU and 1.62% in F1. BCTDNet’s effectiveness and robustness in complex urban scenarios highlight its potential for applications in land-use analysis and urban planning.

Keywords:

building change-type detection; remote sensing; segment anything model; deep learning

1. Introduction

Building change-type detection seeks to identify and classify various types of building changes—such as construction and demolition—from multi-temporal remote sensing images (RSIs). This task is both crucial and challenging in remote sensing change detection. Unlike traditional binary change detection, which only distinguishes between “change” and “no change,” building change-type detection offers richer semantic insights and better supports real-world applications like urban planning, disaster assessment, and building regulation. However, most existing studies still focus on binary tasks [1,2,3,4,5,6], with limited research on the fine-grained recognition of construction and demolition. The lack of high-quality datasets further impedes the development and adoption of building change-type detection methods.

As remote sensing technology advances and high-resolution imagery becomes more accessible, building change-type detection places greater demands on models’ ability to discriminate features. Construction and demolition often manifest as subtle local differences, making them susceptible to pseudo-changes. These changes also exhibit semantic asymmetry, requiring models to possess strong temporal modeling and contextual understanding. Deep learning has recently shown significant promise in change detection tasks. Convolutional Neural Networks (CNNs), known for their strong local feature extraction, were among the earliest mainstream approaches [1,5,7,8,9,10,11,12]; however, their limited receptive fields restrict long-range dependency modeling. Transformers, which use self-attention to enhance global modeling, are well-suited for complex change detection but face high computational costs and resource demands, hindering large-scale deployment [4,13,14,15,16,17,18]. The emerging Mamba architecture [19,20], based on state-space models, enables long-range modeling with linear computational complexity, making it a promising alternative to Transformers [6,21,22,23,24,25,26,27,28,29,30]. However, regardless of the architecture, CNNs, Transformers, or Mamba, most studies still emphasize binary classification, with limited progress in modeling complex semantic changes, such as construction and demolition.

Although a few studies have attempted to address building change-type detection, they face inherent limitations. For instance, BCE-Net [31] extracts change features by using labels from the earlier period and optical images from the later period. In contrast, our method uses optical images from both time periods and employs a dual encoder comprising SAM and CNN to extract robust features. Furthermore, via a three-branch change-aware attribute decoder, it simultaneously extracts buildings and identifies change types across bi-temporal images. Moreover, BCE-Net’s SI-BU and WHU-C1 datasets provide only the later-period optical images along with their corresponding labels but lack the earlier-period optical images. This composition differs from that of conventional change detection datasets, thus limiting the datasets’ applicability, as shown in Figure 1. Another recent effort uses a FastSAM-based [32] framework with a dual-branch mask supervision strategy to enhance temporal modeling. However, its insufficient capability in interacting with encoded features and leveraging temporal features restricts its capacity to effectively capture diverse change types.

To address these challenges, we present a deep learning network with the Segment Anything Model (SAM) designed for building change-type detection. The main contributions of this study include the following:

We present a building change-type detection network, BCTDNet, which utilizes dual-feature interaction and attribute-aware decoding to identify newly constructed and demolished buildings.
To improve building recognition, we design a dual-feature interaction encoder that integrates multi-granularity features from SAM and CNN, adopting interactive attention. Furthermore, we develop a change-aware attribute decoder that incorporates an attribute-aware strategy to explicitly generate discriminative maps for newly constructed and demolished buildings, ensuring clear change type separation.
We construct the JINAN-MCD dataset specifically for the change-type detection task. Covering urban core areas over a six-year period, the JINAN-MCD dataset captures diverse change scenarios. It contains bi-temporal images, extraction labels, and change-type labels, thus meeting the needs of multi-task execution.

2. Related Work

Recent advancements in deep learning have introduced innovative solutions for change detection, significantly enhancing both accuracy and efficiency. This section reviews the applications and limitations of deep learning-based methods in the context of building change detection.

2.1. Binary Building Change Detection Methods

Most deep learning-based change detection methods are based on CNNs, which leverage strong local feature extraction capabilities [9,33,34,35]. CNN-based approaches are widely used for binary change detection in remote sensing imagery. Daudt et al. [1] first introduced fully convolutional networks (FCNs) with the FC-EF model, followed by two Siamese variants—FC-Siam-conc and FC-Siam-diff—to model feature differences between bi-temporal images. Later research extended this framework with multi-scale fusion, deep supervision, and attention mechanisms. For instance, DSIFN [7] improves feature contrast using difference supervision; DSAMNet [3] combines convolution and attention for better discrimination; DTCDSCN [8] jointly models change detection and semantic segmentation to optimize feature representation. However, the limited receptive field of CNNs hinders the modeling of long-range semantic dependencies, leading to false positives and missed detections in complex or large-scale scenes.

To address CNNs’ global modeling limitations, Transformer architectures have been introduced for change detection. Transformers use self-attention to model dependencies across arbitrary positions, improving spatiotemporal understanding. Early methods, such as those by Chen et al. [36], used Transformer encoders for context modeling. ChangeFormer [15] combines hierarchical Transformers and MLP decoders in a Siamese structure to extract multi-scale global features from bi-temporal images. SwinSUNet [37] and SiamSwin adopt U-shaped Swin Transformer architectures with windowed attention to enhance sensitivity to local structural changes. Hybrid models like TransUNetCD [38] and ConvTransNet [39] combine CNN and Transformer advantages, preserving local detail and enhancing global semantic understanding. However, Transformer-based methods suffer from quadratic complexity with respect to input size, limiting scalability for high-resolution or large-scale applications.

Recently, state space models (SSMs), particularly the Mamba architecture, have emerged as promising alternatives, offering linear computational complexity and strong long-range dependency modeling. Zhao et al. [22] first applied Mamba to remote sensing change detection, designing a multi-directional scanning module to extract spatial features at multiple scales. Later models, such as ChangeMamba [40] and CDMamba [28], introduced spatiotemporal interaction and scaled residual modules to improve fine-grained detection. Compared to Transformers, Mamba offers strong global modeling at significantly lower computational cost, making it suitable for high-resolution, large-scale data. However, Mamba’s original design targets sequential data, posing challenges for adaptation to complex spatial structures and semantics in remote sensing images. Current research remains focused on structural optimizations and feature interaction improvements, mainly for binary change detection.

Although existing methods have achieved promising results in binary change detection, studies on fine-grained building change-type detection—particularly distinguishing between building construction and demolition—remain limited. Most models simply identify whether change has occurred, without differentiating the types of changes, which limits their practical utility in applications such as urban planning and building regulation. In contrast, building change-type detection provides richer semantic information, supporting tasks like automated map updating, urban expansion monitoring, and demolition tracking. Thus, extending current methods to building change-type detection not only addresses a key research gap but also offers significant practical value.

2.2. Building Change-Type Detection Methods

Research on building change-type detection remains limited, and publicly available datasets are scarce. Liao et al. [31] proposed BCE-Net, which uses historical maps and a contrastive learning framework to extract building change features, and introduced two annotated datasets, SI-BU and WHU-C1, to support building change-type detection. While this work offers initial data resources, its dependence on historical building labels and later imagery results in a “pseudo-bitemporal” setting, which diverges from standard bi-temporal change detection and limits generalizability to real-world applications.

Zhang et al. [32] introduced a multiclass building change detection framework based on FastSAM [41], which encodes fine-grained building features using a modified FastSAM and incorporates a general adaptation mechanism. A dual-branch mask supervision strategy was also designed to integrate semantic features from both temporal images into the change map, enhancing temporal modeling capabilities. Although this approach presents a novel perspective, it still has limitations in feature interaction and change discrimination. Specifically, the encoder adopts a single frozen FastSAM structure, limiting its ability to detect multi-granular features. Moreover, the decoder directly predicts multiple change types and fails to fully construct temporal relationships, which restricts the model’s ability to learn discriminative features specific to change types.

To address these limitations, this paper proposes a building change-type detection network based on SAM [42]. The method designs a feature interaction mechanism to integrate lightweight CNN and adapt SAM to extract multi-granular features. In the decoding phase, an attribute-aware strategy enables feature association across spatiotemporal scales, explicitly constructing the temporal relationship of building changes such as construction and demolition.

3. Methodology

3.1. Architecture Overview

Our proposed method comprises two dual-feature interaction encoders and a change-aware attribute decoder, as shown in Figure 2. Bi-temporal inputs (T1 and T2) are processed through dual-feature interaction encoders that perform deep feature extraction. The encoder incorporates two components: (1) a multi-scale adapter that learns SAM features, and (2) an interactive attention module (IAM) that effectively fuses SAM features with CNN features. The resulting dual-temporal fused features, along with their merged representations, are then fed into the change-aware attribute decoder. Within this decoder, the dual-temporal fused features undergo up-sampling while being supervised by building semantic labels, thereby providing crucial semantic guidance for change-type detection. Simultaneously, an attribute-aware strategy (AAS) leverages both the merged representations and dual-temporal fused features to generate distinct newly constructed and demolished attribute information. The final outputs contain three prediction results: two building extraction results and a change-type detection result.

3.2. Dual-Feature Interaction Encoder

As illustrated in Figure 2, the dual-feature interaction encoder fuses the SAM and CNN architectures to integrate their global and local modeling advantages, aiming to enhance the perception of global context and texture details. Specifically, SAM contributes robust global context modeling through its attention mechanisms, enabling long-range dependency capture and holistic scene understanding. Meanwhile, the CNN architecture excels in extracting fine-grained local features, such as texture details and spatial hierarchies, through its inductive bias for locality and translation invariance. Then, we introduce a simple and effective multi-scale adapter to learn features from SAM and transfer them to the remote sensing domain. To effectively incorporate global context and texture details, an interactive attention module is further designed to hierarchically fuse multi-scale features of the adapter and CNN.

To better adapt SAM to our building change-type detection task, we rescale the input images from 1024 × 1024 to 512 × 512 pixels. Moreover, we modify the output of SAM into four multi-scale features to improve the semantic perception ability for different types of buildings. These multi-scale features are 1/32, 1/16, 1/8, and 1/4 of the input image size. For the CNN encoder, it consists of four cascaded lightweight convolutional blocks, each structured as follows (illustrated in the bottom of Figure 2): First, the input passes through a batch normalization (BN) layer to stabilize its distribution. Next, a 1 × 1 convolution expands the channel dimension to four times its original size. A 3 × 3 depth-wise convolution then efficiently captures local spatial dependencies. Then, another 1 × 1 convolution reduces the channel dimension back to its original size, enabling seamless residual connection. Finally, a Gaussian Error Linear Unit (GELU [43]) activation is used to enhance nonlinear representational capacity. This design balances computational efficiency and representation power, thus obtaining local features of buildings.

3.2.1. Multi-Scale Adapter

Although SAM possesses powerful semantic feature extraction capabilities, it still faces challenges in identifying small and indistinguishable buildings in RSI. Therefore, to enhance the model’s adaptability for change-type detection, we designed trainable multi-scale adapters to adjust the extracted SAM features, enabling learning of building structures and semantic patterns.

First, since SAM is built with Transformers, its feature maps are 3D (B × N × C), where B denotes batch size, N represents the feature dimension, and C indicates the number of channels. We transform SAM’s original four 3D features into 4D features (B × C × H × W) to ensure compatibility with convolutional operations, where H × W denote spatial dimensions. Subsequently, we employ a 1 × 1 convolutional layer and a 3 × 3 convolutional layer to extract both fine architectural details and semantic information from the transformed multi-scale features.

Second, we denote the multi-scale features at 1/32, 1/16, 1/8, and 1/4 spatial scales as

F_{T_{j}}^{1}

,

F_{T_{j}}^{2}

,

F_{T_{j}}^{3}

, and

F_{T_{j}}^{4}

. Each feature is processed by a corresponding convolution block to obtain the adapted feature

F_{T_{j}}^{A_{i}}

F_{T}^{A_{i}}

, denoted as follows:

F_{T_{j}}^{A_{i}} = ReLU (BN (Conv (F_{T_{j}}^{i})))

(1)

where Conv(·) denotes a 1 × 1 convolutional layer, and BN(·) denotes a batch normalization. The values of i are 1, 2, 3, and 4, indicating four scales. The values of j are 1 and 2, representing dual-temporal.

Finally, we can obtain multi-scale adaptation features processed by the adapter. By incorporating multi-scale information, the network gains enhanced capability to handle varying object scales while simultaneously acquiring richer contextual information. Therefore, these features are better adapted to the subsequent decoder, enabling more accurate building extraction and change-type detection.

3.2.2. Interactive Attention Module

To more effectively integrate the global semantic features from the SAM encoder with local structural features from the CNN encoder, we designed an interactive attention module serving as a bridge for dual-stream feature complementarity enhancement, as shown in Figure 3. This module effectively captures both geometric structures and high-level semantic information of buildings, thereby improving segmentation accuracy.

Specifically, we first perform unified dimension alignment on the multi-scale features extracted by both SAM and CNN encoders, reshaping them into flattened features. These features are then projected into a unified token space through linear projection layers, where the global semantic token from SAM is denoted as T_SAM and the local detail token from CNN as T_CNN.

Then, we generate query token Q_SAM based on the global semantic information from T_SAM to guide the model’s focus towards building-relevant key regions. Simultaneously, key token K_CNN and value token V_CNN are derived from the local structural features of T_CNN to enhance the model’s capacity to characterize fine-grained geometric features of buildings. These tokens are represented as follows:

Q_{SAM} = T_{SAM} \cdot ω_{Q}

(2)

K_{CNN} = T_{CNN} \cdot ω_{K}

(3)

V_{CNN} = T_{CNN} \cdot ω_{V}

(4)

where

ω_{Q}

,

ω_{K}

, and

ω_{V}

are different linear projection weights.

Subsequently, through interactive attention and residual connection, the interactive token T_SC is calculated as follows:

T_{SC} = Softmax (\frac{Q_{SAM} K_{CNN}^{T}}{\sqrt{d}}) V_{CNN} + T_{SAM}

(5)

where d represents the channel dimension of Q_SAM and K_CNN. This design enables global semantic information to dynamically adjust attention weights across local features. For instance, it can emphasize edge features when detecting irregular building shapes while enhancing planar structural awareness during building segmentation.

Finally, the output feature

{\hat{T}}_{SC}

of IAM is obtained by Multi-Layer Perceptron (MLP) and residual connection, which is calculated as follows:

{\hat{T}}_{SC} = MLP (LN (T_{SC})) + T_{SC}

(6)

where LN(·) denotes the layer normalization.

The above operation is independently performed across four features of different scales of each temporal image, enabling the model to capture multi-level visual information from fine to coarse granularity. Through this design, local features progressively integrate with global context in a targeted manner, while simultaneously allowing global semantics to refine local representations. There, IAM strengthens the semantic consistency between features at different scales, thereby facilitating building extraction and change-type detection.

3.3. Change-Aware Attribute Decoder

To achieve change-type detection, we design a change-aware attribute decoder and a change attribute detection strategy integrated with bi-temporal semantics guidance, as shown in Figure 2. The decoder consists of two weight-shared extraction decoding networks and an attribute-aware strategy. The twin extraction decoders are specifically designed for building semantic prediction from bi-temporal imagery. The attribute-aware strategy discriminates temporal relationships between the fused bi-temporal encoded features and the outputs of extraction decoders, generating newly constructed and demolished attributes. Meanwhile, these attributes are correlated with building semantics to definitively distinguish newly constructed from demolished structures.

Initially, we input the scale

i \in [1, 2, 3, 4]

pre-temporal

{\hat{T}}_{SC, i}^{t_{1}}

and post-temporal

{\hat{T}}_{SC, i}^{t_{2}}

features generated by the dual-feature interaction encoder and their concatenated bi-temporal features

{\hat{T}}_{SC, i}^{t_{1, 2}}

into the change-aware attribute decoder.

Extraction decoding networks employ a progressive fusion mechanism to gradually align semantic information across features at different levels. Specifically, taking the pre-temporal features as an example, the two shallowest features (

{\hat{T}}_{SC, 1}^{t_{1}}

and

{\hat{T}}_{SC, 2}^{t_{1}}

) undergo down-sampling and up-sampling operations, respectively, followed by element-wise addition to the original

{\hat{T}}_{SC, 2}^{t_{1}}

and

{\hat{T}}_{SC, 1}^{t_{1}}

features. Subsequently, this fusion process is sequentially applied to two feature groups: (

{\hat{T}}_{SC, 1}^{t_{1}}

,

{\hat{T}}_{SC, 2}^{t_{1}}

, and

{\hat{T}}_{SC, 3}^{t_{1}}

) and (

{\hat{T}}_{SC, 1}^{t_{1}}

,

{\hat{T}}_{SC, 2}^{t_{1}}

,

{\hat{T}}_{SC, 3}^{t_{1}}

, and

{\hat{T}}_{SC, 4}^{t_{1}}

), where each feature is resampled and added to the other features in the group. Finally, all four feature maps are resampled to the same resolution as

{\hat{T}}_{SC, 1}^{t_{1}}

and concatenated to output the building semantic features. Thanks to its weight-shared architecture, the extraction decoding network simultaneously generates the pre-temporal building semantic feature

F^{t_{1}}

and the post-temporal building semantic feature

F^{t_{2}}

.

Furthermore, we first perform layer-wise up-sampling on the deepest feature from

{\hat{T}}_{SC, i}^{t_{1, 2}}

, concatenating it with its adjacent higher-level feature map at each step. This process ultimately yields the shallowest bi-temporal feature

F^{t_{1, 2}}

, which encapsulates comprehensive building features from both bi-temporal images, thereby serving as prior knowledge to guide change attribute generation. Then, the attribute-aware strategy utilizes the pre-temporal building semantic feature

F^{t_{1}}

, the post-temporal building semantic feature

F^{t_{2}}

, and the bi-temporal feature

F^{t_{1, 2}}

to generate the newly constructed attribute map M_n and the demolished attribute map M_d, calculated as follows:

M_{n} = F^{t_{1, 2}} - F^{t_{1}}

(7)

M_{d} = F^{t_{1, 2}} - F^{t_{2}}

(8)

The newly constructed attribute map is obtained by computing the difference between the bi-temporal feature and the pre-temporal building semantic feature, reflecting newly constructed buildings in the post-temporal phase. Conversely, the demolished attribute map is generated by subtracting the post-temporal building semantic feature from the bi-temporal feature, representing building areas that disappeared from the pre-temporal phase. This method leverages triple feature space differencing to enhance change characteristics while preserving inter-temporal semantic relationships, ensuring clear semantic distinction between newly constructed and demolished attributes. Therefore, the attribute maps can directly support accurate building change-type detection.

Finally, our proposed method outputs five prediction maps, namely pre-temporal building semantic feature

F^{t_{1}}

, post-temporal building semantic feature

F^{t_{2}}

, bi-temporal feature

F^{t_{1, 2}}

, newly constructed attribute map M_n, and demolished attribute map M_d.

3.4. Loss Function

We adopt two loss functions to supervise the proposed method, including the binary cross-entropy loss function L_BCE and the change attribute loss function L_CA designed for the attribute-aware strategy. We define the pre-temporal extraction label as

G^{t_{1}}

, the post-temporal extraction label as

G^{t_{2}}

, and the change type label as

G^{SC}

.

For building extraction, we decompose the supervision into single-temporal supervision and bi-temporal fusion supervision. In single-temporal supervision, we employ a composite loss combining

L_{BCE}^{T 1}

and

L_{BCE}^{T 2}

with an equal weighting factor λ = 0.5 to jointly optimize building semantic learning across temporal domains. In bi-temporal fusion supervision, we assign the bi-temporal fusion labels to the bi-temporal fusion predictions and use

L_{BCE}^{TF}

to calculate their distribution difference. We generate bi-temporal fusion labels by combining building labels from both temporal phases during training, where any pixel containing buildings in either phase is labeled as foreground. These fused labels are further refined to account for overlapping building regions, ensuring consistent supervision signals. Therefore, the building extraction loss function L_BE is as follows:

L_{BE} = {λ L}_{BCE}^{T 1} (F^{t_{1}}, G^{t_{1}}) + {λ L}_{BCE}^{T 2} (F^{t_{2}}, G^{t_{2}}) + L_{BCE}^{TF} (F^{t_{1, 2}}, G^{t_{1}} + G^{t_{2}})

(9)

For change-type detection, we assign the

L_{BCE}^{NC}

and

L_{BCE}^{DL}

functions to both newly constructed and demolished attribute predictions, supervised by their corresponding change indexes (

G_{n}^{SC}

and

G_{d}^{SC}

) in the change type labels. Meanwhile, to jointly align bi-temporal attribute variations with binary change signals, we introduce a semantic change loss L_SC [44]. This loss strengthens the synergy between building extraction and changes attribute detection by computing the cosine loss between the extracted prediction and the binary change indices

G_{binary}^{SC}

derived from change type labels. Therefore, the change attribute loss function L_CA is as follows:

L_{CA} = L_{BCE}^{n} (M_{n}, G_{n}^{SC}) + L_{BCE}^{d} (M_{d}, G_{d}^{SC}) + L_{SC} (F^{t_{1}}, F^{t_{2}}, G_{binary}^{SC})

(10)

The final loss function L is denoted as follows:

L = L_{BE} + L_{CA}

(11)

4. Experiment

4.1. Datasets

To evaluate the effectiveness of the proposed method, we construct two change-type detection datasets, JINAN-MCD and WHU-MCD, which both contain bi-temporal images, building labels, and change-type labels.

JINAN-MCD: As a key central city in China’s eastern coastal economic powerhouse, Jinan City in Shandong Province boasts a permanent population exceeding 9.2 million and covers approximately 10,244 km² in its metropolitan area. Our study specifically focuses on Jinan’s urban core region. Between 2017 and 2023, this area underwent significant architectural transformation with accelerated urban renewal processes, making it an ideal case study for our research. Meanwhile, the area features diverse building heights, scales, and appearances. Moreover, high-rise buildings exhibit off-nadir displacement, further increasing the complexity of change-type detection.

The parameter information of the JINAN-MCD dataset is detailed in Table 1. The study area covers a 50 km² urban area with 0.5 m resolution. Considering the continuity of surface change processes, we established a one-year sampling interval, collecting images from six temporal points: 2017, 2018, 2019, 2021, 2022, and 2023. The 2020 imagery was excluded from analysis as it showed minimal building changes compared to adjacent years (2019 and 2021). Then, a team of experts manually delineated building footprints for each temporal image to generate building distribution maps in “.shp” format. These vector data were subsequently converted to raster format to create building extraction labels. For change-type labeling, we performed differencing operations between consecutive building distribution maps, categorizing each building instance as either unchanged, newly constructed, or demolished. Since our method specifically targets building construction and demolition detection, unchanged attributes were excluded from consideration.

The dataset comprises six annual image-label pairs, including five consecutive biennial pairs and one pair with the maximum temporal span. During preprocessing, the images and labels were cropped into non-overlapping 512 × 512 patches and split into training and testing sets at a 4:1 ratio. To ensure data quality, we conducted manual verification to filter out samples with annotation errors or poor imaging quality, resulting in six sets of valid data. Given the reduced sample size, we implemented data augmentation strategies to expand the data scale. In total, 13,639 pairs were allocated for model training, while the remaining 1608 pairs were reserved for independent testing and performance evaluation. There are three types of labels, representing background (black), newly built buildings (green), and demolished buildings (red).

WHU-MCD: This dataset is derived from the WHU-CD dataset. The WHU-CD dataset [45] focuses on post-earthquake reconstruction areas and includes two images with a resolution of 0.3 m and dimensions of 32,507 × 15,354 pixels, captured in 2012 and 2016. These images cover an area of 20.5 square kilometers and document changes in 12,796 buildings. However, the WHU-CD dataset provides only binary change labels and does not distinguish between newly constructed and demolished buildings, nor does it define a standardized division between the training and test sets. To overcome these limitations, we constructed the WHU-MCD dataset based on the WHU-CD dataset. Detailed parameters of the WHU-MCD dataset are provided in Table 2. The dataset contains 7620 images, each of size 256 × 256. The training and test sets are divided in a 4:1 ratio, consisting of 6096 and 1524 images, respectively. In the change labels, black indicates the background, green indicates newly built buildings, and red indicates demolished buildings.

The process of converting binary classification labels from the WHU-CD dataset to change type labels for the WHU-MCD dataset is as follows: First, the building labels from the previous and subsequent images are compared using the Intersection over Union (IoU) metric, with the subsequent image as the reference image. If a building appears in the previous image but not in the subsequent one, it is classified as a demolished building. Conversely, if a building appears in the subsequent image but not in the previous one, it is classified as a newly built building. Buildings with an IoU greater than 50% are classified as unchanged, while those with an IoU between 0 and 50% are manually validated. To balance the newly constructed and demolished samples, we randomly reverse the front-to-back phase of half of the images in the dataset. After processing and manual validation by experts, the final WHU-MCD dataset is generated.

4.2. Implementation Details

The experiments were implemented using PyTorch 2.4.0 on four NVIDIA Tesla V100 GPUs (32 GB VRAM each). To ensure reproducibility, we maintained identical experimental configurations across both datasets, conducting 50-epoch training with a fixed batch size of 5. For model optimization, we employed the AdamW [46] optimizer (β₁ = 0.9, β₂ = 0.999) with a weight decay of 0.001, implementing an initial learning rate of 0.00035. Model performance was assessed based on the results obtained from the JINAN-MCD and WHU-MCD test sets. Moreover, the background class was excluded during evaluation. During the inference stage, the newly constructed attribute map Mn and the demolished attribute map Md are independently transformed into prediction maps through argmax operations. Subsequently, these maps are element-wise summed to generate the change-type detection result.

To improve computational efficiency while preserving the global modeling capabilities of the original SAM, we adopt MobileSAM [38], a lightweight variant optimized for real-time applications. MobileSAM’s image encoder retains the core transformer-based architecture of SAM but significantly reduces computational overhead through decoupled distillation. Moreover, while maintaining the original MobileSAM parameters for patch embedding, we reduce the patch stride by half to better preserve semantic information at patch boundaries.

We employed the standard cross-entropy loss to optimize the performance of the compared change detection models. This is because change-type detection essentially constitutes a multi-class semantic segmentation task, as it requires pixel-wise classification of both invariant features and changes (newly constructed or demolished buildings) in bi-temporal imagery.

4.3. Evaluation Metrics

We quantitatively evaluate the proposed method using three standard change detection metrics, including Intersection over Union (IoU), Precision (Pre), Recall (Rec), and F1-score (F1).

IoU = \frac{TP}{TP + FP + FN}

(12)

Pre = \frac{TP}{TP + FP}

(13)

Rec = \frac{TP}{TP + FN}

(14)

F 1 = \frac{2 \times Pre \times Rec}{Pre + Rec}

(15)

where TP, TN, FP, and FN represent the number of true positives, true negatives, false positives, and false negatives, respectively.

4.4. Experimental Results

We benchmark our proposed method against SOTA change detection methods through comprehensive quantitative and qualitative analyses. The comparison methods include ChangeFormer [13], TFI-GR [47], DTCDSCN [8], SEIFNet [48], SwinSUNet [37], DMINet [33], SAM-CD [44], and CD-STMamba [6].

Quantitative comparison results are presented in Table 3. It is evident that our proposed method outperforms all other approaches, achieving the highest IoU (53.42%/84.47%) and F1 (69.81%/91.57%) on the JINAN-MCD and WHU-MCD datasets. This superiority primarily stems from our innovative network architecture design. Compared to CNN-based and Transformer-based methods, our approach incorporates the SAM visual foundation model and leverages a multi-scale adapter within the dual-feature interaction encoder to effectively harness SAM’s powerful generalization capabilities, significantly enhancing the robustness of semantic modeling. Although SAM-CD also utilizes SAM to encode image features, our method further introduces an interactive attention module to deeply integrate SAM features with CNN features, thereby improving fine-grained feature representation. More importantly, compared to other methods, our designed change-aware attribute decoder couples an attribute-aware strategy with dual-temporal semantic auxiliary supervision, enabling refined modeling of building changes, including both newly constructed and demolished buildings. Compared to the second-highest accuracy, our method improves IoU and F1 by 12.64% and 11.95% on the JINAN_MDC dataset and by 2.71% and 1.62% on the WHU-MCD dataset.

Furthermore, we present the detection performance of newly constructed and demolished buildings, as shown in Table 4 and Table 5. On the JINAN-MCD dataset, compared to the suboptimal approach, our method achieves a notable improvement in detecting newly constructed and demolished buildings, with a 16.02%/9.25% increase in IoUs and a 15.53%/8.37% gain in F1s. This enhancement is even more pronounced on the WHU-CD dataset, where we observe a 2.28%/3.15% rise in IoUs and a 1.33%/1.92% boost in F1s. The substantial performance gap between our method and others stems from the limitations of traditional approaches. These approaches heavily rely on dataset-specific biases, lack effective multi-feature interaction mechanisms, and fail to achieve type semantic modeling.

Additionally, we observe that most methods achieve balanced IoU and F1 scores on the JINAN-MCD dataset, which contains a substantially larger sample size. Although Table 1 indicates a significant predominance of newly constructed samples over demolished samples, urban expansion typically manifests as increased construction of buildings with similar visual characteristics, particularly in residential areas or communities. Consequently, newly constructed samples contain substantial homogeneous visual features, whose diversity may approximate that of demolished samples. This feature parity enables the model to learn both newly constructed and demolished change patterns in a balanced manner. In contrast, for the WHU-MCD dataset with relatively limited samples, the IoU and F1 performance for newly constructed buildings are significantly superior to that for demolished buildings. Table 2 shows that newly constructed buildings exhibit substantially larger sample sizes and total pixel counts than demolished buildings. This disparity suggests that newly constructed buildings possess richer visual characteristics and potentially more identifiable contextual information (e.g., adjacent roads and vegetation). Therefore, under limited sample conditions, the model demonstrates a distinct learning preference for temporal patterns associated with newly constructed buildings, ultimately achieving higher detection accuracy.

Qualitative comparison results are presented in Figure 4 and Figure 5. The results demonstrate that the proposed method exhibits particularly outstanding performance in detecting newly constructed and demolished buildings with structurally diverse and easily confusable features. This superiority stems from our approach’s ability to fully leverage the powerful generalization capabilities of the SAM visual encoder through the multi-scale adapter, thereby enhancing the robustness of feature extraction. The interactive attention module further integrates multi-granularity features from both SAM and CNN, significantly improving the representation of complex building structures. On the other hand, a change-aware attribute decoder constructs explicit semantic constraints through bi-temporal building extraction supervision, better preserving building boundaries and detecting fine-grained structures. Moreover, the attribute-aware strategy effectively enhances the discriminability of temporal feature spaces, enabling the model to better distinguish between newly constructed and demolished buildings under complex seasonal variations and illumination conditions. In summary, the proposed method demonstrates clear detection advantages, offering an effective solution for high-precision urban building construction and demolition detection applications.

4.5. Ablation Studies

Ablation studies analyze the impact of each core component on model performance. We first validate the performance advantages of the dual-feature interaction encoder. We then specifically demonstrate the effectiveness of the interactive attention module to evaluate the impact of the interaction between SAM and CNN on image encoding. Next, focusing on the change-aware attribute decoder, we analyze and study the contribution of the attribute-aware strategy to the new construction and demolition performance. Finally, we discuss the impact of the supervision of extraction decoding networks on change-type detection performance.

4.5.1. The Effectiveness of the Dual-Feature Interaction Encoder

Ablation results in Table 6 reveal that employing either CNN or SAM exclusively as the encoder leads to substantial performance degradation, with IoU dropping to 5.42% and F1 declining to 5%, particularly for newly constructed building detection. This phenomenon occurs because the SAM encoder excels at capturing large-scale architectural structures, while the CNN encoder specializes in extracting fine-grained visual features such as building edges and textures. The dual-feature interaction encoder that integrates CNN and SAM demonstrates significant performance improvements (achieving an overall IoU of 84.47% and F1 of 91.57%), validating the complementary nature of local features and global contextual information. These results conclusively demonstrate the superiority of our fusion model for change-type detection.

4.5.2. The Impact of the Interactive Attention Module

The efficacy of the interactive attention module is validated through ablation studies in Table 7 and Table 8. In the absence of the IAM, feature fusion is implemented through simple element-wise addition and channel-level concatenation. IAM effectively bridges the complementary strengths of the SAM encoder and CNN encoder, enabling efficient extraction of multi-scale change features. Without the interactive attention module, the change-type detection performance on the JINAN-MCD dataset is limited to 51.47%/51.34% IoUs and 67.93%/67.85% F1s. However, incorporating this module significantly improved these metrics to 53.42% IoU and 69.81% F1. Furthermore, on the WHU-MCD dataset, the introduction of the interactive attention module yielded additional gains of 2.47% in IoU and 2.27% in F1, demonstrating its consistent effectiveness across different datasets.

Notably, despite its significant performance improvements, the proposed interactive attention module introduces only 8.67 GFLOPs of additional computational overhead, which remains within a reasonable and acceptable range for practical deployment. This lightweight design ensures that the module enhances multi-scale feature fusion without imposing excessive computational burdens, striking an optimal balance between accuracy and efficiency. Thus, the IAM not only bridges the complementary strengths of SAM and CNN encoders but does so in a computationally efficient manner, making it a viable solution for change-type detection.

4.5.3. The Effectiveness of the Attribute-Aware Strategy

The attribute-aware strategy generates distinctive physical attribute features and incorporates supervisory signals to provide more discriminative guidance for change-type detection in challenging scenarios. When the AAS is not employed, we directly supervise the bi-temporal features using standard cross-entropy loss. Moreover, we utilize multi-task segmentation branches from BCE-Net [31] to generate a newly constructed attribute map and a demolished attribute map for comparison to verify the superiority of AAS. As demonstrated in Table 9 and Table 10, our strategy yields significant performance improvements. On the JINAN-MCD dataset, it achieves gains of 3.10% in IoU and 2.95% in F1, while on the WHU-MCD dataset, the improvements reach 3.19% in IoU and 1.93% in F1. These consistent enhancements validate the effectiveness of the attribute-aware strategy in strengthening the model’s discriminative capability for both newly constructed and demolished buildings.

Further analysis of the visualization results reveals that the attribute-aware strategy effectively mitigates interference from tree occlusions while enhancing the model’s capability to distinguish buildings from roads. As clearly demonstrated in Figure 6 and Figure 7, compared to models using simple cross-entropy supervision for bi-temporal features, the attribute-aware supervised approach generates more complete building predictions in heavily occluded regions. Moreover, the semantic attribute constraints significantly improve discriminative capability for architecturally ambiguous structures. For instance, the model exhibits fewer false associations or prediction holes when detecting large buildings with road-like features. These performance gains in complex scenarios highlight the strategy’s effectiveness in leveraging physical semantic cues for more reliable change-type detection.

4.5.4. The Impact of the Supervision of Extraction Decoding Networks

The change-aware attribute decoder introduces supervision signals from extraction decoding networks, constructing an explicit semantic constraint that provides more precise semantic guidance for change-type detection. As demonstrated in Table 11 and Table 12, this supervision strategy significantly enhances model performance. On the JINAN-MCD dataset, IoU improves by 1.68% and F1 increases by 3.04%, while on the WHU-MCD dataset, the gains are even more pronounced, with IoU and F1 rising by 1.1% and 1.61%. This improvement validates the effectiveness of extraction supervision in enhancing change-aware capabilities, demonstrating that explicit semantic constraints can boost the accuracy of change-type detection.

On the other hand, the dual-supervision strategy effectively mitigates boundary ambiguity while enhancing the model’s sensitivity to detecting small buildings. As clearly evidenced by the comparative experimental results in Figure 8 and Figure 9, the model with dual extraction supervision yields significantly sharper building contour predictions compared to its unsupervised counterpart. Moreover, the supervision of extraction decoding networks substantially improves the discriminability of densely distributed small-scale buildings (such as low-rise houses and shantytown structures), demonstrating superior capability in preserving architectural details.

5. Discussion

5.1. Visualization Performance in Complex Environments

To further demonstrate the robustness of the proposed method, we discuss the performance of both the attribute-aware strategy and auxiliary supervision extraction in complex scenarios. As illustrated in Figure 10, the attribute-aware strategy effectively enhances the model’s capability to detect jagged structures in demolished buildings. Notably, since this strategy independently learns temporal dependencies for both newly constructed and demolished buildings, it successfully avoids missing small demolished buildings while preventing false detection of irrelevant roof-like categories. Figure 11 demonstrates that the auxiliary supervision extraction significantly improves the detection of newly constructed and demolished buildings in high-density building areas under complex conditions. This improvement stems from the complete integration of building semantic signals into change-type detection, resulting in sharper boundaries for changed buildings.

5.2. Interpretability of Independent Ablation Studies

Our ablation study aims to isolate the contribution of each module by removing it while keeping other components fixed. Although baseline performances vary (due to dependencies between modules), this design directly measures how each module impacts the overall system. For instance, if removing our attribute-aware strategy causes a significant drop in performance while removing our interactive attention module shows different negative effects, this clearly highlights their relative importance.

More specifically, without an attribute-aware strategy, the accuracy on the JINAN-MCD dataset drops by 3.10% in IoU and 2.95% in F1 (Table 9), while on the WHU-MCD dataset, IoU decreases by 3.19% and F1 declines by 1.93% (Table 10). In contrast, the absence of the interactive attention module leads to performance degradations of 1.95% in IoU and 1.88% in F1 on the JINAN-MCD dataset (Table 7). For the WHU-MCD dataset, more pronounced reductions of 1.62% in IoU and 2.27% in F1 are observed (Table 8). Obviously, both proposed methods are effective. The attribute-aware strategy contributes more to the performance than the interactive attention module.

5.3. Effectiveness of the Temporal Augmentation Strategy on the WHU-MCD Dataset

The original WHU-CD dataset exhibits a significant class imbalance, with newly constructed samples substantially outnumbering demolished samples. This bias causes the model to predominantly learn newly constructed features, resulting in suboptimal demolished detection performance (IoU: 57.32% and F1: 72.87%), as shown in Table 13. To address this, we implement a temporal augmentation strategy to expand the demolished samples. Experimental results demonstrate the strategy’s effectiveness, with demolished detection metrics improving by 25.02% (IoU) and 17.44% (F1), respectively.

6. Conclusions

In this study, we present a SAM-based building change-type detection network designed to identify newly constructed and demolished buildings. First, we incorporate SAM to construct a dual-feature interaction encoder for extracting fine-grained bi-temporal image features. To enhance the network’s adaptability to building characteristics, we design a trainable multi-scale adapter to refine SAM features, enabling effective learning of architectural structures and semantic patterns. The developed interactive attention module further bridges SAM and CNN to facilitate the interaction between fine-grained geometric structures and high-level semantic information of buildings. Second, we construct a change-aware attribute decoder that injects building semantic information into change-type detection through extraction decoding networks. Meanwhile, the attribute-aware strategy explicitly generates newly constructed and demolished maps to establish clear temporal relationships.

To validate the proposed building change-type detection network, we establish the JINAN-MCD dataset covering Jinan’s urban core from 2017 to 2023 (6-year span) and adapt the WHU-CD dataset as WHU-MCD for multiple types of changing scenes. Experimental results on both datasets demonstrate that our method achieves the best quantitative and qualitative performance. It outperforms suboptimal approaches by 12.64%/2.71% in IoU and 11.95%/1.62% in F1 on JINAN-MCD/WHU-MCD, respectively, with superior robustness and generalization. Our network shows strong potential for practical applications in land-use analysis and urban planning. Future research should expand building temporal categories (e.g., unchanged/renovated buildings) to construct more comprehensive datasets and investigate building change detection foundational models.

Author Contributions

Conceptualization, W.Z. and J.L.; methodology, W.Z. and J.L.; software, J.L.; validation, W.Z., J.L. and S.W.; formal analysis, W.Z.; investigation, S.W.; resources, W.Z. and J.W.; data curation, W.Z., J.L. and S.W.; writing—original draft preparation, J.L. and S.W.; writing—review and editing, J.W.; visualization, W.Z. and J.L.; supervision, J.W.; project administration, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Shandong Provincial Key Research and Development Program, grant number 2024TSG00181.

Data Availability Statement

The original contributions presented in this study are included in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BCTDNet	building change-type detection network
SAM	Segment Anything Model
CNN	Convolutional Neural Network
BCD	binary change detection
FCNs	fully convolutional networks
SSMs	state space models
IAM	interactive attention module
AAS	attribute-aware strategy
BN	batch normalization
GELU	Gaussian Error Linear Unit
RSIs	remote sensing images
MLP	Multi-Layer Perceptron

References

Caye Daudt, R.; Le Saux, B.; Boulch, A. Fully Convolutional Siamese Networks for Change Detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; IEEE: New York, NY, USA, 2018; pp. 4063–4067. [Google Scholar]
Abdi, G.; Jabari, S. A Multi-Feature Fusion Using Deep Transfer Learning for Earthquake Building Damage Detection. Can. J. Remote Sens. 2021, 47, 337–352. [Google Scholar] [CrossRef]
Shi, Q.; Liu, M.; Li, S.; Liu, X.; Wang, F.; Zhang, L. A Deeply Supervised Attention Metric-Based Network and an Open Aerial Image Dataset for Remote Sensing Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Zhu, D.; Huang, X.; Huang, H.; Shao, Z.; Cheng, Q. ChangeViT: Unleashing Plain Vision Transformers for Change Detection. arXiv 2024, arXiv:2406.12847. [Google Scholar]
Zhao, S.; Zhang, X.; Xiao, P.; He, G. Exchanging Dual-Encoder–Decoder: A New Strategy for Change Detection With Semantic Guidance and Spatial Localization. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Liu, S.; Wang, S.; Zhang, W.; Zhang, T.; Xu, M.; Yasir, M.; Wei, S. CD-STMamba: Toward Remote Sensing Image Change Detection With Spatio-Temporal Interaction Mamba Model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 10471–10485. [Google Scholar] [CrossRef]
Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A Deeply Supervised Image Fusion Network for Change Detection in High Resolution Bi-Temporal Remote Sensing Images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
Liu, Y.; Pang, C.; Zhan, Z.; Zhang, X.; Yang, X. Building Change Detection for Remote Sensing Images Using a Dual-Task Constrained Deep Siamese Convolutional Network Model. IEEE Geosci. Remote Sens. Lett. 2021, 18, 811–815. [Google Scholar] [CrossRef]
Chen, H.; Wu, C.; Du, B.; Zhang, L.; Wang, L. Change Detection in Multisource VHR Images via Deep Siamese Convolutional Multiple-Layers Recurrent Neural Network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 2848–2864. [Google Scholar] [CrossRef]
He, R.; Li, W.; Mei, S.; Dai, Y.; He, M. EFP-Net: A Novel Building Change Detection Method Based on Efficient Feature Fusion and Foreground Perception. Remote Sens. 2023, 15, 5268. [Google Scholar] [CrossRef]
Yasir, M.; Liu, S.; Pirasteh, S.; Xu, M.; Sheng, H.; Wan, J.; De Figueiredo, F.A.P.; Aguilar, F.J.; Li, J. YOLOShipTracker: Tracking Ships in SAR Images Using Lightweight YOLOv8. Int. J. Appl. Earth Obs. Geoinf. 2024, 134, 104137. [Google Scholar] [CrossRef]
Yasir, M.; Shanwei, L.; Mingming, X.; Jianhua, W.; Nazir, S.; Islam, Q.U.; Dang, K.B. SwinYOLOv7: Robust Ship Detection in Complex Synthetic Aperture Radar Images. Appl. Soft Comput. 2024, 160, 111704. [Google Scholar] [CrossRef]
Tao, C.; Kuang, D.; Wu, K.; Zhao, X.; Zhao, C.; Du, X.; Zhang, Y. A Siamese Network with a Multiscale Window-Based Transformer via an Adaptive Fusion Strategy for High-Resolution Remote Sensing Image Change Detection. Remote Sens. 2023, 15, 2433. [Google Scholar] [CrossRef]
Zou, Y.; Shen, T.; Chen, Z.; Chen, P.; Yang, X.; Zan, L. A Transformer-Based Neural Network with Improved Pyramid Pooling Module for Change Detection in Ecological Redline Monitoring. Remote Sens. 2023, 15, 588. [Google Scholar] [CrossRef]
Bandara, W.G.C.; Patel, V.M. A Transformer-Based Siamese Network for Change Detection. In Proceedings of the IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022. [Google Scholar]
Ke, Q.; Zhang, P. Hybrid-TransCD: A Hybrid Transformer Remote Sensing Image Change Detection Network via Token Aggregation. IJGI 2022, 11, 263. [Google Scholar] [CrossRef]
Chen, H.; Qi, Z.; Shi, Z. Remote Sensing Image Change Detection with Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Li, B.; Wang, G.; Zhang, T.; Yang, H.; Zhang, S. Remote Sensing Image-Change Detection with Pre-Generation of Depthwise-Separable Change-Salient Maps. Remote Sens. 2022, 14, 4972. [Google Scholar] [CrossRef]
Gu, A.; Goel, K.; Ré, C. Efficiently Modeling Long Sequences with Structured State Spaces. arXiv 2022, arXiv:2111.00396. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
Wang, Z.; Zheng, J.-Q.; Zhang, Y.; Cui, G.; Li, L. Mamba-UNet: UNet-Like Pure Visual Mamba for Medical Image Segmentation. arXiv 2024, arXiv:2402.05079. [Google Scholar]
Zhao, S.; Chen, H.; Zhang, X.; Xiao, P.; Bai, L.; Ouyang, W. RS-Mamba for Large Remote Sensing Image Dense Prediction. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Xing, Z.; Ye, T.; Yang, Y.; Liu, G.; Zhu, L. SegMamba: Long-Range Sequential Modeling Mamba for 3D Medical Image Segmentation. arXiv 2024, arXiv:2401.13560. [Google Scholar]
Ruan, J.; Xiang, S. VM-UNet: Vision Mamba UNet for Medical Image Segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
He, X.; Cao, K.; Yan, K.; Li, R.; Xie, C.; Zhang, J.; Zhou, M. Pan-Mamba: Effective Pan-Sharpening with State Space Model. Inf. Fusion 2025, 115, 102779. [Google Scholar] [CrossRef]
Liao, W.; Zhu, Y.; Wang, X.; Pan, C.; Wang, Y.; Ma, L. LightM-UNet: Mamba Assists in Lightweight UNet for Medical Image Segmentation. arXiv 2024, arXiv:2403.05246. [Google Scholar]
Zhang, H.; Chen, K.; Liu, C.; Chen, H.; Zou, Z.; Shi, Z. CDMamba: Remote Sensing Image Change Detection with Mamba. arXiv 2024, arXiv:2406.04207. [Google Scholar]
Xie, X.; Cui, Y.; Ieong, C.-I.; Tan, T.; Zhang, X.; Zheng, X.; Yu, Z. FusionMamba: Dynamic Feature Enhancement for Multimodal Image Fusion with Mamba. Vis. Intell. 2024, 2, 37. [Google Scholar] [CrossRef]
Yang, A.; Li, M.; Ding, Y.; Fang, L.; Cai, Y.; He, Y. GraphMamba: An Efficient Graph Structure Learning Vision Mamba for Hyperspectral Image Classification. arXiv 2024, arXiv:2407.08255. [Google Scholar] [CrossRef]
Liao, C.; Hu, H.; Yuan, X.; Li, H.; Liu, C.; Liu, C.; Fu, G.; Ding, Y.; Zhu, Q. BCE-Net: Reliable Building Footprints Change Extraction Based on Historical Map and up-to-Date Images Using Contrastive Learning. ISPRS J. Photogramm. Remote Sens. 2023, 201, 138–152. [Google Scholar] [CrossRef]
Zhang, W.; Li, F.; Meng, J.; Li, J.; Wang, S.; Wan, J. Segment Anything Model for Multiclass Building Change Detection in Remote Sensing Images. In Proceedings of the Third International Conference on Environmental Remote Sensing and Geographic Information Technology (ERSGIT 2024), Xi’an, China, 22–24 November 2024; Tan, K., Yao, G., Eds.; SPIE: Bellingham, WA, USA, 2025; p. 6. [Google Scholar]
Feng, Y.; Jiang, J.; Xu, H.; Zheng, J. Change Detection on Remote Sensing Images Using Dual-Branch Multilevel Intertemporal Network. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Du, Z.; Li, X.; Miao, J.; Huang, Y.; Shen, H.; Zhang, L. Concatenated Deep-Learning Framework for Multitask Change Detection of Optical and SAR Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 719–731. [Google Scholar] [CrossRef]
Chen, P.; Zhang, B.; Hong, D.; Chen, Z.; Yang, X.; Li, B. FCCDN: Feature Constraint Network for VHR Image Change Detection. ISPRS J. Photogramm. Remote Sens. 2022, 187, 101–119. [Google Scholar] [CrossRef]
Chen, H.; Shi, Z. A Spatial-Temporal Attention-Based Method and a New Dataset for Remote Sensing Image Change Detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure Transformer Network for Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Li, Q.; Zhong, R.; Du, X.; Du, Y. TransUNetCD: A Hybrid Transformer Network for Change Detection in Optical Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
Li, W.; Xue, L.; Wang, X.; Li, G. ConvTransNet: A CNN–Transformer Network for Change Detection With Multiscale Global–Local Representations. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Chen, H.; Song, J.; Han, C.; Xia, J.; Yokoya, N. ChangeMamba: Remote Sensing Change Detection With Spatiotemporal State Space Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–20. [Google Scholar] [CrossRef]
Zhao, X.; Ding, W.; An, Y.; Du, Y.; Yu, T.; Li, M.; Tang, M.; Wang, J. Fast Segment Anything. arXiv 2023, arXiv:2306.12156. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for Activation Functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
Ding, L.; Zhu, K.; Peng, D.; Tang, H.; Yang, K.; Bruzzone, L. Adapting Segment Anything Model for Change Detection in VHR Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–11. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction From an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Li, Z.; Tang, C.; Wang, L.; Zomaya, A.Y. Remote Sensing Change Detection via Temporal Feature Interaction and Guided Refinement. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
Huang, Y.; Li, X.; Du, Z.; Shen, H. Spatiotemporal Enhancement and Interlevel Fusion Network for Remote Sensing Images Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]

Figure 1. Samples from common binary change detection (BCD) datasets, along with those from the JINAN-MCD and WHU-MCD datasets we constructed.

Figure 2. Overview of the proposed method.

Figure 3. Detailed structure of IAM. This module can complementarily learn SAM and CNN features, as shown in the red circle.

Figure 4. Qualitative comparison with different methods on the JINAN-MCD dataset.

Figure 5. Qualitative comparison with different methods on the WHU-MCD dataset.

Figure 6. Qualitative results with and without the attribute-aware strategy on the JINAN-MCD dataset.

Figure 7. Qualitative results with and without the attribute-aware strategy on the WHU-MCD dataset.

Figure 8. Qualitative results of supervised and unsupervised extraction decoding networks on the JINAN-MCD dataset.

Figure 9. Qualitative results of supervised and unsupervised extraction decoding networks on the WHU-MCD dataset.

Figure 10. Qualitative results with and without the attribute-aware strategy in complex environments on the WHU-MCD dataset.

Figure 11. Qualitative results of supervised and unsupervised extraction decoding networks in complex environments on the WHU-MCD dataset.

Table 1. Parameter information of the JINAN-MCD dataset.

Source Image	Coverage	Resolution	Number of Bands	Temporal Subset	Image Size	Train/Test	Change-Type Scale
2017	50 km²	0.5 m/pixel	3 (RGB)	2017–2018	512 × 512	13,639/1608	Newly constructed buildings (Number: 30,598; Area: 12.59 km²) Demolished buildings (Number: 25,616; Area: 8.32 km²)
2018				2018–2019
2019				2019–2021
2021				2021–2022
2022				2022–2023
2023				2017–2023

Table 2. Parameter information of the WHU-MCD dataset.

Coverage	Resolution	Number of Bands	Temporal Subset	Image Size	Train/Test	Change-Type Scale
20.5 km²	0.3 m/pixel	3 (RGB)	2012–2016	256 × 256	6096/1524	Newly constructed buildings (Number: 3054; Area: 0.89 km²) Demolished buildings (Number: 2577; Area: 0.85 km²)

Table 3. Quantitative results of different change detection methods on the JINAN-MCD and WHU-MCD datasets, with all results expressed as percentages (%). The computational complexity (FLOPs) and parameters (Param.) are evaluated on the 256 × 256 images, except for SwinSUNet, which is assessed on 224 × 224 images.

Method	Param. (M)	FLOPs (G)	JINAN-MCD				WHU-MCD
Method	Param. (M)	FLOPs (G)	IoU	F1	Pre	Rec	IoU	F1	Pre	Rec
ChangeFormer	41.03	202.79	30.59	46.80	64.20	36.90	69.36	81.82	87.24	77.08
TFI-GR	28.37	20.37	26.84	42.32	72.48	30.02	77.73	87.39	90.26	84.72
DTCDSCN	41.07	15.24	36.03	52.95	67.70	43.53	77.85	87.53	92.58	83.09
SEIFNet	8.38	27.91	35.29	52.12	64.16	44.03	76.57	86.64	92.89	81.38
SwinSUNet	43.57	12.43	22.09	35.71	52.45	27.40	81.76	89.95	90.25	89.66
DMINet	6.76	17.43	33.39	50.04	69.81	39.07	78.42	87.85	92.43	83.78
SAM-CD	5.49	39.06	39.63	56.69	59.07	54.89	80.60	89.24	91.67	86.95
CD-STMamba	63.33	67.00	40.78	57.86	56.25	59.72	72.13	83.81	78.06	90.50
BCTDNet (Ours)	118.14	79.20	53.42	69.81	64.50	75.90	84.47	91.57	92.95	90.04