DCSC Mamba: A Novel Network for Building Change Detection with Dense Cross-Fusion and Spatial Compensation

Xu, Rui; Mao, Renzhong; Yang, Yihui; Zhang, Weiping; Lin, Yiteng; Zhang, Yining

doi:10.3390/info16110975

Open AccessArticle

DCSC Mamba: A Novel Network for Building Change Detection with Dense Cross-Fusion and Spatial Compensation

by

Rui Xu

,

Renzhong Mao

^*

,

Yihui Yang

,

Weiping Zhang

,

Yiteng Lin

and

Yining Zhang

School of Computer Science and Mathematics, Fujian University of Technology, Fuzhou 350118, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(11), 975; https://doi.org/10.3390/info16110975

Submission received: 16 October 2025 / Revised: 3 November 2025 / Accepted: 9 November 2025 / Published: 11 November 2025

Download

Browse Figures

Versions Notes

Abstract

Change detection in remote sensing imagery plays a vital role in urban planning, resource monitoring, and disaster assessment. However, current methods, including CNN-based approaches and Transformer-based detectors, still suffer from false change interference, irregular regional variations, and the loss of fine-grained details. To address these issues, this paper proposes a novel building change detection network named Dense Cross-Fusion and Spatial Compensation Mamba (DCSC Mamba). The network adopts a Siamese encoder–decoder architecture, where dense cross-scale fusion is employed to achieve multi-granularity integration of cross-modal features, thereby enhancing the overall representation of multi-scale information. Furthermore, a spatial compensation module is introduced to effectively capture both local details and global contextual dependencies, improving the recognition of complex change patterns. By integrating dense cross-fusion with spatial compensation, the proposed network exhibits a stronger capability in extracting complex change features. Experimental results on the LEVIR-CD and SYSU-CD datasets demonstrate that DCSC Mamba achieves superior performance in detail preservation and robustness against interference. Specifically, it achieves F1 scores of 90.29% and 79.62%, and IoU scores of 82.30% and 66.13% on the two datasets, respectively, validating the effectiveness and robustness of the proposed method in challenging change detection scenarios.

Keywords:

high-resolution remote sensing imagery; change detection; state-space model; dense cross-fusion; building extraction

1. Introduction

With the rapid development of remote sensing technologies and the continuous improvement of image acquisition capabilities, change detection based on remote sensing imagery has become a research hotspot. It plays an essential role in urban development planning [1], disaster assessment [2], and resource monitoring [3,4]. Among these applications, building change detection is a key task that aims to accurately identify variations in the location, scale, morphology, and quantity of buildings by comparing multi-temporal images of the same geographic region. Such information provides critical support for dynamic urban monitoring and management [5,6,7].

Existing methods for building change detection can be broadly categorized into traditional approaches and deep learning-based approaches [8]. Traditional techniques typically rely on pixel-level, feature-level, object-level, or temporal analysis strategies [9]. While these methods achieved certain success in earlier studies, they generally suffer from several limitations, including heavy dependence on manual intervention, poor adaptability, and weak generalization capability in complex remote sensing scenarios. For example, the Conditional Mixed Markov model (CXM) proposed by Benedek et al. [10] and the symmetric convolutional coupling network (SCCN) proposed by Liu et al. [11], using the experimental settings and metrics reported by Zhan et al. [12], achieved precision values of 36.5% and 24.4%, respectively, on the Szada dataset. With the increasing availability of large-scale remote sensing data and the growing demand for higher accuracy, traditional methods have encountered significant performance bottlenecks when applied to high-resolution imagery for building change detection.

To overcome the limitations of traditional methods on high-resolution imagery, deep learning techniques have emerged as the dominant solution due to their powerful feature learning capabilities. Convolutional neural network (CNN)-based change detection models are commonly built upon encoder–decoder architectures, enabling end-to-end extraction of change information. For instance, Alcantarilla et al. [13] employed convolution–deconvolution structures for change detection in street-view images, demonstrating for the first time the feasibility of deep networks in this task. Depending on the stage at which feature fusion occurs, CNN-based methods can generally be divided into early fusion and late fusion strategies. Early fusion directly concatenates bi-temporal images along the channel dimension before encoding; however, variations in illumination, occlusion, and other factors may cause pixel-level mismatches, leading to spatial misalignment and reduced detection accuracy [12]. To mitigate this issue, late fusion strategies adopt a Siamese network structure, where features from each temporal image are extracted independently and subsequently fused at deeper levels. This design better preserves temporal semantic information and enhances model robustness. Zhan et al. [12] introduced Siamese CNNs for optical remote sensing image change detection, laying the foundation for CNN-based Siamese architectures in this field, the precision reached 41.2% on the Szada dataset. Building upon this, Daudt et al. [14] systematically proposed fully convolutional Siamese networks, namely FC-Siam-Conc and FC-Siam-Diff, which fuse features through concatenation or differencing, establishing the groundwork for late fusion. On the Szada dataset, their precision values were only 40.93% and 41.38%, respectively. Since then, research has increasingly focused on more advanced feature interaction mechanisms, leading to continuous improvements in detection accuracy and robustness over earlier CNN-based methods. Chen et al. [15] proposed STANet, which incorporates a spatial–temporal attention mechanism into the fusion stage, enabling the model to adaptively focus on key change regions across time. In the LEVIR-CD dataset, this model achieved a precision value of 83.8%. Subsequently, Chen et al. [16] developed DASNet, introducing a dual attention module to further enhance feature discrimination and suppress pseudo changes. Notably, the evolution of late fusion strategies is also reflected in the design of feature extraction backbones. Wang et al. [17] proposed a hybrid architecture that combines CNNs and Transformers to capture both local details and global dependencies, providing a promising direction for improving change detection accuracy.

When processing multi-temporal images, convolutional neural networks (CNNs) face inherent limitations. As illustrated in Figure 1a, the convolutional operation scans images progressively with a restricted receptive field, making it difficult to effectively capture global spatial features. In contrast, as shown in Figure 1b, Transformers treat an image as a sequence of patches and leverage the self-attention mechanism, allowing each patch to attend to the global context. This enables Transformers to excel at modeling long-range feature dependencies [18,19,20]. For example, Bandara et al. [21] proposed ChangeFormer, which integrates a hierarchical Transformer encoder with a multilayer perceptron decoder to enhance multi-scale detail representation. Similarly, Liu et al. [22] introduced the Swin Transformer, which improves the efficiency of global modeling through a shifted window mechanism. Nevertheless, the self-attention mechanism in Transformers comes with high computational complexity, which significantly limits training and inference efficiency on large-scale image datasets [23].

In recent years, compared with self-attention-based Transformer models, the Mamba model, built upon state space models (SSMs), has demonstrated a remarkable advantage of linear computational complexity [24]. Originally designed for natural language processing tasks, its structured state space sequence model (S4) processes input sequences in a causal manner, enabling efficient inference in language modeling [25]. However, since image data are inherently non-causal, directly applying Mamba’s causal modeling mechanism to vision tasks poses significant challenges [26]. To address this issue, visual Mamba models typically partition images into patches and flatten them into sequences for processing [27]. As illustrated in Figure 1c, by scanning the flattened image sequence, the model can effectively integrate positional information from all directions, thereby establishing a global receptive field without introducing additional linear computational overhead. This design not only preserves computational efficiency but also substantially enhances the expressive power of learned features [28,29]. Furthermore, Goel et al. [30] improved the original S4 by introducing a selection mechanism, enabling Mamba to dynamically choose relevant information in an input-dependent manner. Building upon this, Chen et al. [31] proposed ChangeMamba, a change detection model based on Visual State Space Model (VSSM), which effectively captures spatiotemporal dependencies and achieves accurate change prediction.

Despite the advantages of the aforementioned global scanning approaches, their ability to capture fine-grained local details in high-resolution imagery remains insufficient, leaving the rich geometric structures and texture information underutilized [32]. In change detection tasks, this limitation often leads to incomplete boundary delineation and the omission of small-scale change regions, reflecting inadequate detail representation. Moreover, semantic discrepancies may arise among multi-scale features; if fusion is improperly handled, certain details may be mistakenly classified as change regions. The insufficient exploitation of feature complementarity across multiple scales has thus constrained the further improvement of existing models.

To address these challenges, this paper proposes a novel Dense Cross-fusion and Spatial Compensation Mamba network (DCSC Mamba), as illustrated in Figure 1d. Specifically, the model employs the Selective State Space layer (S6) of Mamba to perform global relational modeling, while incorporating a Spatial Compensation (SC) mechanism during the scanning process to enhance the capture of local detail features, thereby mitigating the limitations of a pure Mamba architecture in local information extraction. In addition, a Dense Cross-fusion (DCF) module is designed to realize multi-granularity fusion of cross-modal features through dense interaction pathways. This design strengthens the semantic integration of bi-temporal features during the encoding stage, enabling effective integration from local details to global context, and ultimately improving the model’s capability in handling complex change information.

2. Principles and Methods

2.1. Fundamentals of State Space Models

The Structured State Space Model (SSM), as a sequence modeling approach, maps a one-dimensional function or sequence (x(t) ∈ R) to a hidden state h(t) ∈ R^m×N, and subsequently outputs y(t) ∈ R, where m denotes the state dimension and N represents the sequence length. This process can be formulated as a system of linear ordinary differential equations (ODEs), expressed as follows:

\{\begin{matrix} h^{'} (t) = A h (t) + B x (t) \\ y (t) = C h (t) \end{matrix}

(1)

In the above equations, A ∈ R^m×m denotes the state matrix, while B ∈ R^m×N and C ∈ R^N×m represent the projection parameters. Here, h′(t) indicates the hidden state corresponding to the current input x(t), and h(t) denotes the hidden state at the previous time step.

To integrate SSM into deep learning frameworks, the continuous system must be discretized. In SSM, the S4 method applies the zero-order hold (ZOH) rule to discretize the continuous parameters A and B into

\bar{A}

and

\bar{B}

, which are formulated as follows:

\{\begin{matrix} \bar{A} = \exp (Δ A) \\ \bar{B} = (Δ A) (\bar{A} - I) • Δ B \end{matrix}

(2)

∆ is the time scale parameter for discretizing the parameters A and B. Through discretization, continuous-time system SSM can be integrated into deep-learning models. where I denote the identity matrix. Consequently, the discretized SSM equations can be expressed as:

\{\begin{matrix} h_{p} = \bar{A} h_{p - 1} + \bar{B} x_{p} \\ y_{p} = C h_{p} \end{matrix}

(3)

In the above equations, h_p denotes the hidden state at time step p, while x_p and y_p represent the input and output sequences at the same time step, respectively. Since the S4 model possesses a linear time-invariant (LTI) property, its parameters A, B, and C remain constant during inference, which limits its ability to adapt dynamically to contextual variations. To address this limitation, Mamba introduces the Selective State Space model (S6). By incorporating a selection mechanism, the discrete parameters

\bar{A}

,

\bar{B}

, and Δ become input-dependent, allowing the model to dynamically adjust according to the input sequence. This enables selective processing of key information and facilitates modeling of complex temporal dependencies. Making the parameters ∆, B, C depend on the input x, and the formulas are defined as follows:

\{\begin{matrix} S_{B} (x) = {L i n e a r}_{N} (x) \\ S_{C} (x) = {L i n e a r}_{N} (x) \\ S_{Δ} (x) = {s o f t p l u s (L i n e a r}_{1} (x) + b) \end{matrix}

(4)

In the above equations, Linear_N and Linear₁ are fully connected (FC) layers that project the embedding dimension of x to N and 1. Softplus serves as the activation function, and b denotes the bias term. However, the aforementioned model is limited to processing unidirectional sequential data and cannot directly handle images, which are inherently non-causal. To overcome this limitation, a general two-dimensional vision model, termed Vision Mamba (VMamba), was proposed. By introducing a cross-scan module, the model scans input image data along both horizontal and vertical directions, enabling one-dimensional selective scanning in 2D image space. This design addresses the non-causality of images without incurring additional computational complexity.

2.2. DCSC Mamba Model Architecture

The overall architecture of the proposed DCSC Mamba network is illustrated in Figure 2, adopting a Siamese encoder–decoder design. Initially, input images from different temporal phases are partitioned into fixed-size patches via Patch Partition and subsequently fed into the Siamese encoders. The encoder consists of four hierarchical layers, each formed by stacking two VSS-Conv modules. Each VSS-Conv module performs bidirectional sequential scanning along both horizontal and vertical directions to model global contextual dependencies, while simultaneously employing convolution operations to extract local detailed features, thus achieving joint modeling of global and local information.

After Patch Merging downsampling, features are passed to the next layer. Simultaneously, features prior to downsampling are preserved and fed into the Dense Cross-Fusion (DCF) module. This module aligns cross-temporal features to mitigate registration errors between bi-temporal images and fuses features from different hierarchical levels to maximally retain critical information.

In the decoding stage, the SCMamba module constructs global and local relationships through scanning and utilizes features from the Dense Cross-Fusion module. During this process, a Spatial Compensation (SC) mechanism is applied to recover missing local information. Finally, the processed features are progressively upsampled and reconstructed into the final change prediction map.

2.3. Dense Cross Fusion

The overall structure of the Dense Cross-Fusion module is illustrated in the boxed region of Figure 3. This module takes as input the feature outputs from the encoder across two temporal phases and multiple hierarchical levels, and employs a cross-dense connection mechanism to deliver the fused multi-scale features to the corresponding decoder layers. The detailed fusion process is shown in Figure 3. First, the feature maps from the two temporal phases are concatenated along the channel dimension, followed by a convolution layer with a 3 × 3 kernel, stride of 2, and padding of 1 for downsampling, thereby reducing the spatial dimensions of the feature maps. The downsampled features are then concatenated with features from the next layer, further integrating contextual information from different depths and scales. Finally, a 1 × 1 convolution is applied to compress the concatenated high-dimensional features along the channel dimension and consolidate them into the target number of channels. This layer-wise nested and cross-scale dense fusion mechanism effectively aggregates multi-level features and enhances global contextual representation.

2.4. SCMamba

The SCMamba module consists of two components, namely the VSS-Conv Block and Spatial Compensation, which are applied to each layer of the encoder and decoder. The VSS-Conv Block is an enhanced variant of the Visual State Space (VSS) block, where a convolutional scanning mechanism is incorporated to capture richer local information. Its data processing flow is illustrated in Figure 4. Specifically, the input features are processed through two parallel branches. In the first branch, the features are sequentially passed through Layer Normalization (LN) and a linear projection for dimensional adjustment, followed by a 3 × 3 depthwise separable convolution and the SiLU activation function. The resulting feature block is then scanned along four different paths to comprehensively capture contextual information. The scanned feature sequences are fused with the SiLU-activated original input and subsequently processed by another LN and a multi-layer perceptron (MLP) before being output. In the second branch, the input first undergoes Batch Normalization (BN), followed by two parallel dilated convolutions (kernel size = 3 × 3, dilation rates = 1 and 3, respectively) to perceive regional variations at multiple scales. The convolutional outputs are activated by ReLU, summed together, and further passed through another ReLU activation. Finally, the outputs of both branches are summed to form the final representation of the VSS-Conv Block.

In the SCMamba module, the features output from the VSS-Conv Block are added to those from the dense cross-fusion module. Although the fused features contain abundant local details and global information, they also include a considerable amount of irrelevant background and noise, which compromises the integrity of change feature extraction. To address this issue, spatial compensation is applied to the fused features in order to filter out redundant details, suppress noise, and enhance effective spatial features, thereby enabling more complete and accurate extraction of change regions. The overall structure of SCMamba is illustrated in Figure 5, where the small circles denote the local features extracted by the VSS-Conv Block.

The specific operation of spatial compensation is as follows. First, the input feature X is subjected to channel shuffle (Group = 4), in which the channels are rearranged after group convolution to promote information interaction across different groups. The shuffled feature map Y is then processed by two parallel branches. In the first branch, a channel attention mechanism is applied. Specifically, global average pooling (GAP) is used to compress the W × H × C feature map into a 1 × 1 × C feature vector z, where the feature of each channel z_c is calculated as follows:

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} y_{c} (i, j)

(5)

Subsequently, in order to exploit the compressed global information, the vector is sequentially passed through two fully connected (FC) layers, where the first FC layer reduces the channel dimension to C/r and the second FC layer restores it to C. This process is followed by a ReLU activation function to introduce nonlinearity and a Sigmoid activation function to generate the normalized channel weight vector s. The process can be expressed as:

s = σ (W_{2} δ (W_{1} z))

(6)

W₁ and W₂ denote the weights of the two FC layers, δ represents the ReLU function, and σ denotes the Sigmoid function. Finally, the weight vector s is multiplied with the original feature map Y in a channel-wise manner to achieve adaptive calibration of channel features, thereby obtaining the output of the first branch:

y_{c} = z_{c} \times s_{c}

(7)

In the above equations, y_c denotes the c element of y. By multiplying with the corresponding weight, the channel-wise feature responses are adaptively calibrated, where informative channels are enhanced while irrelevant ones are suppressed. This achieves dynamic and selective enhancement of feature information.

The second branch adopts a spatial self-attention mechanism. Specifically, three independent 1 × 1 convolutional layers are applied to the input Y to project the query (Q), key (K), and value (V) matrices. The channel dimensions of Q and K are reduced to C/8, while that of V remains unchanged. The self-attention weight map is then computed as:

S e l f A t t e n t i o n (Y) = s o f t m a x (Q \cdot K T) \cdot V

(8)

Finally, the outputs of the two branches are summed to obtain the result of spatial compensation. Through this process, each local feature can effectively aggregate global contextual information and dynamically emphasize key spatial locations. This mechanism compensates for the limited local detail perception caused by the unidirectional scanning in Mamba, thereby enabling efficient enhancement of spatial change features.

3. Experiments and Analysis

3.1. Experimental Dataset

The experiments in this study were conducted on two publicly available datasets, namely LEVIR-CD [15] and SYSU-CD [33].

The LEVIR-CD dataset, released in 2020, is a building change detection benchmark. It contains 637 pairs of high-resolution RGB images with a spatial resolution of 0.5 m, each of size 1024 × 1024 pixels, captured between 2002 and 2018 over several urban areas in Texas, USA. The temporal span ranges from 5 to 14 years, and the dataset covers a wide variety of building types, including villas, high-rise apartments, small garages, and large warehouses. In this study, all original images were cropped into 256 × 256 patches, resulting in 7120, 1024, and 2048 samples for the training, validation, and testing sets, respectively.

The SYSU-CD dataset was released in 2021 by the research team at Sun Yat-sen University. It is a large-scale, diverse, and highly challenging benchmark for general remote sensing change detection. The dataset consists of 20,000 pairs of high-resolution images, each with a size of 256 × 256 pixels and a spatial resolution of 0.5 m. All image pairs were acquired in Hong Kong, China, spanning the years from 2007 to 2014. The dataset mainly focuses on urban development-related changes, such as newly constructed urban buildings and suburban expansion.

3.2. Experimental Environment and Parameters

All experiments were conducted on a 64-bit Windows 10 platform equipped with a 13th Gen Intel Core i5-13490F CPU @ 2.5 GHz and an NVIDIA GeForce RTX 5060 Ti GPU with 16 GB memory. The implementation was based on Python 3.10, PyTorch 2.2.0, and CUDA 11.8. During training, the Adam optimizer was adopted with an initial learning rate of 0.0001, an L2 regularization coefficient of 0.0005, a batch size of 4, and 100 training epochs. The binary cross-entropy loss function was employed.

3.3. Evaluation Indicators

To evaluate the effectiveness of the proposed algorithm, the experiments employed Precision, Recall, F1-score, and Intersection over Union (IoU) as performance metrics for change detection. Higher precision indicates fewer false positives, while higher recall indicates fewer false negatives. The F1-score, defined as the harmonic mean of precision and recall, serves as a comprehensive measure of model performance. The metrics are calculated as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(9)

R e c a l l = \frac{T P}{T P + F N}

(10)

F 1 = 2 \times \frac{P r e c i s o n \times R e c a l l}{P r e c i s o n + R e c a l l}

(11)

I o U = \frac{T P}{T P + F N + F P}

(12)

In the above formulas, TP (True Positive) denotes the number of pixels correctly identified as changed, TN (True Negative) denotes the number of pixels correctly identified as unchanged, FN (False Negative) denotes the number of changed pixels incorrectly predicted as unchanged, and FP (False Positive) denotes the number of unchanged pixels incorrectly predicted as changed.

3.4. Experimental Results and Analysis

To evaluate the effectiveness of DCSC Mamba for high-resolution remote sensing change detection, six representative change detection models were selected for comparison: CDNet [34], DSIFN [35], BIT [36], ChangeFormer [21], SarasNet [37], and ChangeMamba [31]. Among these, CDNet introduces a hierarchical fully convolutional architecture for pixel-wise change detection; DSIFN leverages deep siamese feature extraction with multi-level fusion; BIT employs a transformer-based bipartite matching mechanism to improve change region delineation; ChangeFormer utilizes a transformer with global context modeling for accurate change detection; SarasNet combines CNN and transformer modules to integrate local and global features; while ChangeMamba and the proposed DcscMamba are based on the Mamba architecture, using spatiotemporal state-space modeling to capture both temporal dynamics and spatial context for robust change detection.

3.4.1. Experimental Results and Analysis of the LEVIR-CD Dataset

The quantitative evaluation results on the LEVIR-CD dataset are presented in Table 1. As shown, compared with the six baseline methods, the proposed DcscMamba achieves the best performance in terms of Precision, F1-score, and IoU, with values of 89.57%, 90.29%, and 82.30%, respectively. The Recall of DcscMamba is 91.02%, slightly lower than that of the SarasNet model. Among the Transformer-based models, ChangeFormer demonstrates relatively weaker performance, with Precision, Recall, F1-score, and IoU lower than DcscMamba by 5.45%, 1.46%, 3.53%, and 5.68%, respectively. Compared with ChangeMamba, which is also based on the Mamba architecture, DcscMamba still shows superior performance, with improvements of 2.96%, 0.62%, 1.83%, and 2.98% in Precision, Recall, F1-score, and IoU, respectively. These results indicate that DcscMamba has a clear advantage in accurately identifying change regions.

Figure 6 presents the change detection results of different baseline models on the LEVIR-CD dataset. To visually distinguish the elements in the images, different colors are used: white represents changed buildings, black represents the background, blue indicates false positives, and red indicates false negatives.

In the first-row scenario, a row of small-scale buildings in the earlier image is replaced by grass in the later image. Due to the weak features of these buildings and the low contrast between the changed areas and the grassy background, CDNet, SarasNet, and ChangeMamba failed to detect any changes, while DSIFN, BIT, and ChangeFormer still exhibited considerable false negatives. In contrast, the proposed method, although not fully covering all changed regions, successfully identified partial segments, demonstrating improved sensitivity to weak change signals and mitigating the issue of missed small-scale targets.

In the second-row scenario, an irregularly shaped building change is present. CDNet, BIT, ChangeFormer, SarasNet, and ChangeMamba produced incomplete detections in this region, whereas DSIFN and the proposed method could largely recover the overall contour. Furthermore, several small-scale building changes near the irregular structure were missed by DSIFN but successfully detected by the proposed method. This demonstrates that our method possesses strong contextual awareness and can effectively capture fine-grained surrounding changes while identifying the main change region.

In the third scene, an empty space in the upper-left corner is replaced by a newly constructed large building. Most baseline models failed to detect this change accurately. Specifically, DSIFN, BIT, ChangeFormer, SarasNet, and ChangeMamba exhibited significant missed detections, while CDNet captured only part of the change. In contrast, the proposed method successfully identified the full extent of the large building, demonstrating superior performance in detecting large structures. For the building in the lower-right corner, except for DSIFN, which showed severe missed detections, all other models—including CDNet, BIT, ChangeFormer, SarasNet, ChangeMamba, and the proposed method—were able to detect it relatively well.

In the fourth-row scenario, DSIFN’s over-reliance on local features and limited global modeling capability led to missed detections. While CDNet, SarasNet, and ChangeMamba were able to detect the main change regions, their predictions were fragmented and lacked internal consistency. In contrast, BIT, ChangeFormer, and the proposed method maintained the overall continuity of the building change regions, effectively suppressing fragmentation errors and demonstrating stronger structural preservation.

In the last-row scenario, the change region consisted of white buildings surrounded by highly similar white-gray ground, creating a challenging background for detection. BIT and SarasNet missed parts of the changes, while DSIFN, ChangeFormer, and ChangeMamba produced false positives. In contrast, the proposed method consistently detected the true change regions and effectively distinguished buildings from background interference, highlighting its robustness in complex environments.

3.4.2. Experimental Results and Analysis of the SYSU-CD Dataset

The quantitative evaluation results on the SYSU-CD dataset are presented in Table 2. Compared with the LEVIR-CD dataset, SYSU-CD features more complex scenes, which are prone to producing a large number of false changes. Under such circumstances, maintaining high precision is particularly important, as excessive false positives can significantly increase manual verification costs. The proposed model demonstrates strong suppression of false positives on the SYSU-CD dataset, achieving a precision of 82.56%, which is notably higher than that of the other compared models.

The proposed DcscMamba achieves the best performance on the SYSU-CD dataset in terms of F1-score and IoU, with values of 79.62% and 66.13%, respectively. The Recall is 76.88%, which is relatively lower, indicating that the model tends to exclude certain boundary-ambiguous or feature-unclear change points. The F1-score, as the harmonic mean of precision and recall, demonstrates that DcscMamba achieves a good overall balance between Precision and Recall. Compared with CDNet, DSIFN, BIT, ChangeFormer, SarasNet, and ChangeMamba, the F1-score of the proposed model is higher by 3.78%, 0.89%, 3.10%, 3.08%, 0.58%, and 1.46%, respectively. In terms of IoU, CDNet, BIT, and ChangeFormer perform relatively poorly, with scores lower than DcscMamba by 5.05%, 4.16%, and 4.14%, respectively. DSIFN, SarasNet, and ChangeMamba perform better, but DcscMamba still outperforms them by 1.21%, 0.78%, and 1.97%, respectively.

Figure 7 shows the change detection results of different models on the SYSU-CD dataset. In the first-row scenario, a partial region within a building complex has changed. Due to the high spectral and textural similarity between the changed objects and their surroundings, as well as blurred boundaries, detection is extremely challenging. CDNet, DSIFN, and ChangeFormer failed to identify any changed regions. BIT, SarasNet, and ChangeMamba detected the changes but produced a considerable number of false positives, failing to distinguish real changes from background interference. In contrast, DcscMamba was able to extract the changed region almost completely, with slight boundary deviations but an overall contour closely matching the true change, demonstrating strong sensitivity and discriminative ability in complex backgrounds.

In the second-row scenario, CDNet, BIT, and ChangeFormer failed to detect small building changes. ChangeFormer and ChangeMamba identified partial changes, whereas only SarasNet and DcscMamba were able to extract the changes relatively completely.

In the third-row scenario, the change region has an irregular polygonal shape. CDNet, DSIFN and BIT exhibited false positives or missed detec-tions along the boundaries, resulting in incomplete contours. Compared to ChangeFormer, SarasNet, and ChangMamba, DcscMamba performs equally well, effectively capturing boundaries with similar integrity and accuracy.

In the fourth-row scenario, CDNet and BIT missed large-scale building changes, while DSIFN, ChangeFormer, and ChangeMamba detected most changes but produced numerous false positives, often misclassifying small surrounding features as changes. Only SarasNet and DcscMamba were able to extract the change regions relatively completely while maintaining low false positive rates.

In the fifth-row scenario, the changed buildings are small and surrounded by dense vegetation, with spectral similarity between the ground and buildings further complicating detection. CDNet, DSIFN, ChangeFormer, and ChangeMamba failed to detect these changes, resulting in missed detections. BIT produced confusion by misclassifying the surrounding ground as changes. SarasNet and DcscMamba produced similar results, both effectively reducing misclassifications caused by color similarity.

Overall, DcscMamba demonstrates strong robustness, consistently extracting building change regions more completely in complex backgrounds.

3.5. Ablation Experiments

To further validate the effectiveness of the proposed method, ablation experiments and analyses were conducted. Vmamba was selected as the baseline network, and the contributions of the dense cross-fusion module and the Scmamba module with spatial compensation were evaluated. The two variants are referred to as Baseline+Dc and Baseline+Sc, respectively. The experiments were performed on the LEVIR-CD dataset, and the quantitative results are summarized in Table 3.

Compared with the original baseline, introducing the dense cross-fusion module led to improvements across all metrics, with Precision, Recall, F1-score, and IoU increasing by 5.08%, 5.64%, 5.36%, and 8.06%, respectively. This indicates that the module effectively preserves change features and significantly enhances the model’s detection accuracy.

When the Scmamba module with spatial compensation was introduced, the Recall reached 92.23%, the F1-score 87.86%, and IoU 78.34%. However, compared with the dense cross-fusion variant, the improvement in Precision was limited, reaching only 83.88%. This suggests that the spatial compensation mechanism enhances the model’s sensitivity to changed regions and provides partial compensation for spatial detail information, allowing more true change targets to be captured, but it also introduces some false positives, limiting the gain in Precision.

Benefiting from the synergy of the dense cross-fusion module and the spatial compensation mechanism, the proposed DCSC Mamba achieves significant performance improvements, with an F1-score of 90.29% and IoU of 82.30%. Compared with using only the dense cross-fusion module, these represent increases of 2.95% and 4.77%, respectively; compared with using only the spatial compensation mechanism, the improvements are 2.43% and 3.96%, respectively. These results fully demonstrate the effectiveness and superiority of the proposed method.

In addition, compared to the original baseline, incorporating the dense cross-fusion module increases the model parameters to 71.73M, while adding the ScMamba module results in 73.08M parameters. The proposed method has a total of 74.73M parameters. This indicates that while each module contributes to improving the model’s accuracy, it also inevitably increases the number of parameters.

4. Conclusions

This paper proposes a novel Mamba-based network incorporating dense cross-fusion and spatial compensation mechanisms for high-resolution remote sensing change detection. The core of the model lies in leveraging the selective state space layer (S6) of Mamba to capture global dependencies, while innovatively introducing a spatial compensation mechanism during the scanning process to effectively address the limitations of the standard Mamba architecture in modeling local details. Additionally, a dense cross-fusion module is designed to achieve deep integration of multi-temporal features through multi-granularity interaction paths, enhancing the encoder’s ability to model multi-scale features and significantly improving the extraction of complex change information.

In the experimental evaluation, the proposed method was compared with several state-of-the-art change detection models on two publicly available datasets, LEVIR-CD and SYSU-CD. The results demonstrate that the proposed DcscMamba achieves superior performance, attaining F1 scores of 90.29% and 79.62%, and IoU scores of 82.30% and 66.13% on the two datasets, respectively, significantly surpassing the baseline methods and confirming its effectiveness. Further ablation studies confirm the individual contributions of the dense cross-fusion module and the spatial compensation mechanism, as well as their synergistic effect, indicating that their combination achieves an optimal balance between global context modeling and local detail capture.

Although the method still has room for improvement in extremely complex scenarios, this study not only extends the application of the Mamba architecture in remote sensing change detection but also provides valuable insights for achieving efficient and robust intelligent interpretation of remote sensing imagery in more complex domains, such as agriculture and forestry. Future work will focus on further enhancing the model’s generalization and robustness in challenging scenarios.

Author Contributions

Conceptualization, R.X. and R.M.; methodology, R.X., R.M., Y.Y., W.Z., Y.L. and Y.Z.; software, R.M.; validation, R.X., R.M., Y.Y., W.Z., Y.L. and Y.Z.; formal analysis, R.X., R.M. and Y.Y.; investigation, R.X., R.M., W.Z., Y.L. and Y.Z.; resources, R.X. and R.M.; data curation, R.M., W.Z. and Y.L.; writing—original draft preparation, R.X. and R.M.; writing—review and editing, R.X. and R.M.; visualization, R.X. and R.M.; supervision, R.X.; project administration, R.X. and R.M.; funding acquisition, R.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Scientific Research Startup Foundation of Fujian University of Technology, grant number GY-Z24009, and in part by the Social Science Planning Project of Fujian Province under Grant FJ2025B185.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in Zenodo at https:/doi.org/10.5281/zenodo.17305264; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bernhard, M.; Strauß, N.; Schubert, M. MapFormer: Boosting Change Detection by Using Pre-change Information. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 11657–11667. [Google Scholar] [CrossRef]
Cheng, G.; Huang, Y.; Li, X.; Lyu, S.; Xu, Z.; Zhao, H.; Zhao, Q.; Xiang, S. Change Detection Methods for Remote Sensing in the Last Decade: A Comprehensive Survey. Remote Sens. 2024, 16, 2355. [Google Scholar] [CrossRef]
Liu, B.; Chen, H.; Li, K.; Yang, M. Transformer-Based Multimodal Change Detection with Multitask Consistency Constraints. Inf. Fusion 2024, 108, 102358. [Google Scholar] [CrossRef]
Gan, Y.; Xuan, W.; Chen, H.; Liu, J.; Du, B. RFL-CDNet: Towards Accurate Change Detection via Richer Feature Learning. Pattern Recognit. 2024, 153, 110515. [Google Scholar] [CrossRef]
Chen, Z.; Zhou, Y.; Wang, B.; Xu, X.; He, N.; Jin, S. EGDE-Net: A Building Change Detection Method for High-Resolution Remote Sensing Imagery Based on Edge Guidance and Differential Enhancement. ISPRS J. Photogramm. Remote Sens. 2022, 191, 203–222. [Google Scholar] [CrossRef]
He, F.; Chen, H.; Yang, S.; Guo, Z. Building Change Detection Network Based on Multilevel Geometric Representation Optimization Using Frame Fields. Remote Sens. 2024, 16, 4223. [Google Scholar] [CrossRef]
Chen, Z.; Wang, R.; Xu, Y. Semi-Supervised Remote Sensing Building Change Detection with Joint Perturbation and Feature Complementation. Remote Sens. 2024, 16, 3424. [Google Scholar] [CrossRef]
Qin, R.; Tian, J.; Reinartz, P. 3D Change Detection–Approaches and Applications. ISPRS J. Photogramm. Remote Sens. 2016, 122, 41–56. [Google Scholar] [CrossRef]
Hussain, M.; Chen, D.; Cheng, A.; Wei, H.; Stanley, D. Change Detection from Remotely Sensed Images: From Pixel-Based to Object-Based Approaches. ISPRS J. Photogramm. Remote Sens. 2013, 80, 91–106. [Google Scholar] [CrossRef]
Benedek, C.; Szirányi, T. Change Detection in Optical Aerial Images by a Multilayer Conditional Mixed Markov Model. IEEE Trans. Geosci. Remote Sens. 2009, 47, 3416–3430. [Google Scholar] [CrossRef]
Liu, J.; Gong, M.; Qin, K.; Zhang, P. A Deep Convolutional Coupling Network for Change Detection Based on Heterogeneous Optical and Radar Images. IEEE Trans. Neural Netw. Learn. Syst. 2016, 29, 545–559. [Google Scholar] [CrossRef]
Zhan, Y.; Fu, K.; Yan, M.; Sun, X.; Wang, H.; Qiu, X. Change Detection Based on Deep Siamese Convolutional Network for Optical Aerial Images. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1845–1849. [Google Scholar] [CrossRef]
Alcantarilla, P.F.; Stent, S.; Ros, G.; Arroyo, R.; Gherardi, R. Street-View Change Detection with Deconvolutional Networks. Auton. Robots 2018, 42, 1301–1322. [Google Scholar] [CrossRef]
Daudt, R.C.; Le Saux, B.; Boulch, A. Fully Convolutional Siamese Networks for Change Detection. In Proceedings of the 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4063–4067. [Google Scholar] [CrossRef]
Chen, H.; Shi, Z. A Spatial-Temporal Attention-Based Method and a New Dataset for Remote Sensing Image Change Detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Chen, J.; Yuan, Z.; Peng, J.; Chen, L.; Huang, H.; Zhu, J.; Liu, Y.; Li, H. DASNet: Dual Attentive Fully Convolutional Siamese Networks for Change Detection in High-Resolution Satellite Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 1194–1206. [Google Scholar] [CrossRef]
Wang, G.; Li, B.; Zhang, T.; Zhang, S. A Network Combining a Transformer and a Convolutional Neural Network for Remote Sensing Image Change Detection. Remote Sens. 2022, 14, 2228. [Google Scholar] [CrossRef]
Aleissaee, A.A.; Kumar, A.; Anwer, R.M.; Khan, S.; Cholakkal, H.; Xia, G.-S.; Khan, F.S. Transformers in Remote Sensing: A Survey. Remote Sens. 2023, 15, 1860. [Google Scholar] [CrossRef]
Liu, M.; Chai, Z.; Deng, H.; Liu, R. A CNN-Transformer Network with Multiscale Context Aggregation for Fine-Grained Cropland Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 4297–4306. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Álvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Bandara, W.G.C.; Patel, V.M. A Transformer-Based Siamese Network for Change Detection. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2022), Kuala Lumpur, Malaysia, 17–22 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 207–210. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual State Space Model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. LocalMamba: Visual State Space Model with Windowed Selective Scan. In Proceedings of the European Conference on Computer Vision (ECCV 2024), Cham, Switzerland, 8–14 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 12–22. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, A.; Reid, I.; Hartley, R.; Zhuang, B.; Tang, H. Motion Mamba: Efficient and Long Sequence Motion Generation. In Proceedings of the European Conference on Computer Vision (ECCV 2024), Cham, Switzerland, 8–14 October 2024; Springer Nature: Cham, Switzerland, 2024. Lecture Notes in Computer Science. Volume 15059, pp. 265–282. [Google Scholar] [CrossRef]
Zhang, H.; Zhu, Y.; Wang, D.; Zhang, L.; Chen, T.; Wang, Z.; Ye, Z. A Survey on Visual Mamba. Appl. Sci. 2024, 14, 5683. [Google Scholar] [CrossRef]
Liu, L.; Ma, L.; Wang, S.; Wang, J.; Melo, S. Two-Stage Mamba-Based Diffusion Model for Image Restoration. Sci. Rep. 2025, 15, 22265. [Google Scholar] [CrossRef]
Shi, Y.; Dong, M.; Xu, C. Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model. Adv. Neural Inf. Process. Syst. 2024, 37, 25687–25708. [Google Scholar]
Goel, K.; Gu, A.; Donahue, C.; Ré, C. It’s Raw! Audio Generation with State-Space Models. In Proceedings of the International Conference on Machine Learning (ICML 2022), Baltimore, MD, USA, 17–23 July 2022; PMLR: Cambridge, MA, USA, 2022; pp. 7616–7633. [Google Scholar]
Chen, H.; Song, J.; Han, C.; Xia, J.; Yokoya, N. ChangeMamba: Remote Sensing Change Detection with Spatiotemporal State Space Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4409720. [Google Scholar] [CrossRef]
Tang, Y.; Li, Y.; Zou, H.; Zhang, X. Interactive Segmentation for Medical Images Using Spatial Modeling Mamba. Information 2024, 15, 633. [Google Scholar] [CrossRef]
Shi, Q.; Liu, M.; Li, S.; Liu, X.; Wang, F.; Zhang, L. A deeply supervised attention metric-based network and an open aerial image dataset for remote sensing change detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5604816. [Google Scholar] [CrossRef]
Jin, W.D.; Xu, J.; Han, Q.; Zhang, Y.; Cheng, M.M. CDNet: Complementary Depth Network for RGB-D Salient Object Detection. IEEE Trans. Image Process. 2021, 30, 3376–3390. [Google Scholar] [CrossRef]
Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A Deeply Supervised Image Fusion Network for Change Detection in High-Resolution Bi-Temporal Remote Sensing Images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
Chen, H.; Qi, Z.; Shi, Z. Remote Sensing Image Change Detection with Transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
Chen, C.P.; Hsieh, J.W.; Chen, P.Y.; Hsieh, Y.H.; Wang, B.S. SARAS-Net: Scale and Relation Aware Siamese Network for Change Detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2023), Washington, DC, USA, 7–14 February 2023; AAAI Press: Palo Alto, CA, USA, 2023; Volume 37, pp. 14187–14195. [Google Scholar] [CrossRef]

Figure 1. Illustration of different feature interaction mechanisms in deep learning. (a) CNN-based method: conceptualize an image as structured feature grids, employing convolutional layers that slide over the local space with a certain stride; (b) Transformer-based method: uses self-attention, treating an image as tokens, enabling each token to interact with other tokens; (c) Mamba-based method: uses the cross-scan module to integrate the pixels from different directions, achieving a global receptive field with linear complexity; (d) Our proposed method: While the cross-scan module captures a global receptive field, it simultaneously performs spatial compensation on neighboring pixels to enhance the preservation of local details. Arrows indicate the scanning direction of the model.

Figure 2. Overall architecture of the proposed model.

Figure 3. Dense Cross Fusion.

Figure 4. VSS-Conv Block.

Figure 5. SCMamba. The red dot represents the feature information of the current image block, the green dot represents the feature information of the surrounding image blocks, and the red arrow represents the self-attention processing process.

Figure 6. Visual analysis of the results of each model on the LEVIR-CD dataset. Red: missed detections; Blue: false detections.

Figure 7. Visual analysis of the results of each model on the SYSU-CD dataset. Red: missed detections; Blue: false detections.

Table 1. Quantitative comparison results of various contrast models on the LEVIR-CD dataset. The best results are shown in bold.

Method	Precision	Recall	F1	IoU
CDNet	82.74	89.95	86.20	75.74
DSIFN	88.82	90.64	89.72	81.36
BIT	88.37	90.95	89.64	81.23
ChangeFormer	84.12	89.56	86.76	76.62
SarasNet	84.65	92.75	88.51	79.40
ChangeMamba	86.61	90.40	88.46	79.32
DcscMamba	89.57	91.02	90.29	82.30

Table 2. Quantitative comparison results of various contrast models on the SYSU-CD dataset. The best results are shown in bold.

Method	Precision	Recall	F1	IoU
CDNet	71.78	80.38	75.84	61.08
DSIFN	75.97	81.69	78.73	64.92
BIT	77.39	76.03	76.52	61.97
ChangeFormer	74.16	78.78	76.54	61.99
SarasNet	80.03	77.21	79.04	65.35
ChangeMamba	76.24	80.18	78.16	64.16
DcscMamba	82.56	76.88	79.62	66.13

Table 3. Quantitative results of ablation experiments on the LEVIR-CD dataset. The best results are shown in bold.

Method	Precision	Recall	F1	IoU	Parameters (M)
Baseline	80.70	83.31	81.98	69.47	70.08
Baseline+Dc	85.78	88.95	87.34	77.53	71.73
Baseline+Sc	83.88	92.23	87.86	78.34	73.08
DcscMamba	89.57	91.02	90.29	82.30	74.73

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, R.; Mao, R.; Yang, Y.; Zhang, W.; Lin, Y.; Zhang, Y. DCSC Mamba: A Novel Network for Building Change Detection with Dense Cross-Fusion and Spatial Compensation. Information 2025, 16, 975. https://doi.org/10.3390/info16110975

AMA Style

Xu R, Mao R, Yang Y, Zhang W, Lin Y, Zhang Y. DCSC Mamba: A Novel Network for Building Change Detection with Dense Cross-Fusion and Spatial Compensation. Information. 2025; 16(11):975. https://doi.org/10.3390/info16110975

Chicago/Turabian Style

Xu, Rui, Renzhong Mao, Yihui Yang, Weiping Zhang, Yiteng Lin, and Yining Zhang. 2025. "DCSC Mamba: A Novel Network for Building Change Detection with Dense Cross-Fusion and Spatial Compensation" Information 16, no. 11: 975. https://doi.org/10.3390/info16110975

APA Style

Xu, R., Mao, R., Yang, Y., Zhang, W., Lin, Y., & Zhang, Y. (2025). DCSC Mamba: A Novel Network for Building Change Detection with Dense Cross-Fusion and Spatial Compensation. Information, 16(11), 975. https://doi.org/10.3390/info16110975

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DCSC Mamba: A Novel Network for Building Change Detection with Dense Cross-Fusion and Spatial Compensation

Abstract

1. Introduction

2. Principles and Methods

2.1. Fundamentals of State Space Models

2.2. DCSC Mamba Model Architecture

2.3. Dense Cross Fusion

2.4. SCMamba

3. Experiments and Analysis

3.1. Experimental Dataset

3.2. Experimental Environment and Parameters

3.3. Evaluation Indicators

3.4. Experimental Results and Analysis

3.4.1. Experimental Results and Analysis of the LEVIR-CD Dataset

3.4.2. Experimental Results and Analysis of the SYSU-CD Dataset

3.5. Ablation Experiments

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI