Next Article in Journal
OGAIS: OpenGL-Driven GPU Acceleration Methodology for 3D Hyperspectral Image Simulation
Previous Article in Journal
Spatiotemporal Evolution and Influencing Factors of Surface Urban Heat Island Effect in Nanjing, China (2000–2020)
Previous Article in Special Issue
RLita: A Region-Level Image–Text Alignment Method for Remote Sensing Foundation Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

VMMCD: VMamba-Based Multi-Scale Feature Guiding Fusion Network for Remote Sensing Change Detection

1
State Key Laboratory of Multispectral Information Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China
2
Aerospace Automatic Control Institute, Beijing 100190, China
3
Aerospace Information Research Institute Chinese Academy of Sciences, Beijing 100094, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(11), 1840; https://doi.org/10.3390/rs17111840
Submission received: 3 April 2025 / Revised: 14 May 2025 / Accepted: 23 May 2025 / Published: 24 May 2025

Abstract

:
Remote sensing image change detection, being a pixel-level dense prediction task, requires both high speed and high accuracy. The redundancy within the models and detection errors, particularly missed detections, generally affect accuracy and merit further research. Moreover, the former also leads to a reduction in speed. To guarantee the efficiency of change detection, encompassing both speed and accuracy, a VMamba-based Multi-scale Feature Guiding Fusion Network (VMMCD) is proposed. This network is capable of promptly modeling global relationships and realizing multi-scale feature interaction. Specifically, the Mamba backbone is adopted to replace the commonly used CNN and Transformer backbones. By leveraging VMamba’s global modeling ability with linear computational complexity, the computational resources needed for extracting global features are reduced. Secondly, considering the characteristics of the VMamba model, a compact and efficient lightweight network architecture is devised. The aim is to reduce the model’s redundancy, thereby avoiding the extraction or introduction of interfering and redundant information. As a result, the speed and accuracy of the model are both enhanced. Finally, the Multi-scale Feature Guiding Fusion (MFGF) module is developed, which strengthens the global modeling ability of VMamba. Additionally, it enriches the interaction among multi-scale features to address the common issue of missed detections in changed areas. The proposed network achieves competitive results on three publicly available datasets—SYSU-CD, WHU-CD, and S2Looking—and surpasses the current state-of-the-art (SOTA) methods on the SYSU-CD dataset, with an F 1 of 83.35% and I o U of 71.45%. Moreover, for inputs of 256 × 256 size, it is more than three times faster than the current SOTA VMamba-based change detection model. This outstanding achievement demonstrates the effectiveness of our proposed approach.

1. Introduction

With the increasing importance of remote sensing technology, the domain of change detection has progressively gained prominence. Change detection aims to identify changes between scenes at different time phases in order to achieve effective scene monitoring. Consequently, change detection finds extensive application in a diverse array of fields, encompassing forest vegetation monitoring [1], urban planning [2,3,4], land cover change analysis [5], disaster monitoring and evaluation [6], and military reconnaissance [7].
Conventional change detection methods are generally divided into pixel-based [8] and object-based approaches [9], depending on the analysis unit. However, advances in optical Very-High-Resolution (VHR) imaging have greatly improved image detail, revealing the finer texture and geometric structure of ground objects. This increased detail also introduces greater heterogeneity within regions of the same spatial scale, which limits the robustness and effectiveness of traditional methods.
With the evolution of software and hardware, deep learning-based change detection methods have gained broader adoption, significantly enhancing detection efficiency and accuracy. Currently, methods based on CNNs and Transformers are widely used. Since Daudt et al. [10] proposed the Fully Convolutional Early Fusion (FC-EF) approach, integrating FCNs into change detection, CNN-based methods have dominated for an extended period. During this era, many outstanding CNN-based methods emerged [11,12,13]. Nevertheless, these methods are subject to CNNs’ inherent limitations. Specifically, the receptive field is confined by the kernel size and number of layers, hindering global modeling and limiting their competence in complex scenes. To address this issue, various techniques such as deeper networks [14,15,16] or attention mechanisms [17,18] have been used to mitigate CNN limitations, albeit with increased computational cost.
Later, Dosovitskiy et al. [19] introduced the Vision Transformer (ViT) [20] to the visual domain. The ViT divides an image into fixed-size patches (e.g., 16 × 16 pixels) and treats them as a sequence. Via self-attention, each patch interacts with all others, enabling the model to consider the entire image holistically, thus addressing limited receptive fields [21]. Owing to this global modeling capability, the ViT has gained popularity in change detection [22,23,24]. Transformer-based models outperform CNNs in accuracy, owing to their potent global self-attention mechanism [19,25]. However, this also brings quadratic computational complexity. According to Andrew et al. [26], when computational cost and training time are comparable, the accuracy of CNNs and ViTs is nearly identical. Hence, it is difficult to optimize both accuracy and computational cost by choosing either a CNN or ViT alone, without considering model design and data resources.
As we are grappling with the challenge of balancing the modeling prowess and computational expense of the model, the advent of the Mamba model provided us with inspiration. The Mamba [27] model, an emerging sequence modeling approach predicated on the Structured State Space Sequence Model (S4) [28], is devised to address long-term dependency concerns. Mamba introduces input-dependent time-varying parameters to the state space model (SSM) and mitigates the modeling limitations of CNNs via the global receptive field and dynamic weighting, consequently enhancing the model’s context-based reasoning capacity. Simultaneously, Mamba exhibits a linear computational complexity and effectively curtails the computational expenditure. The efficiency of Mamba attests to its significant potential as a foundational model. Its remarkable success has precipitated its integration into the domain of vision [29,30]. To date, numerous visual tasks predicated on the Mamba model have attained satisfactory outcomes [31,32,33].
Despite these advancements, the application of Mamba in change detection is still in its infancy and has raised several open challenges, mainly in three aspects. Firstly, change detection inherently requires capturing spatial dependencies across bi-temporal images, yet the original VMamba scanning strategy is often insufficient to fully encode cross-scale and cross-location contextual interactions. However, existing methods attempt to address this challenge by introducing additional scan paths, which is an inherently flawed approach, as it still fails to guarantee the capture of dependencies in all directions [34,35]. Meanwhile, many existing models used for this task have unnecessary architectural complexity and inflated parameter counts [34,35,36], which are inconsistent with the semantic simplicity of change detection. Change detection is a process of reducing information volume, benefiting from the simplicity rather than the redundancy of the model [37]. Moreover, in practical applications, the missed-detection issue is frequently observed, which is an intolerable type of error [38]. In high-risk scenarios, such as disaster monitoring [6] or military surveillance [7], the cost of such omissions can be catastrophic. These challenges underscore the need for lightweight models that are specifically tailored for change detection, with global spatial awareness and heightened sensitivity to missed detections.
To overcome these aforementioned issues, in this paper, we developed the VMMCD model by leveraging the characteristics of the Mamba model. Precisely, inspired by the concept of integrating the CNN and Transformer, we devised a lightweight architecture akin to Transformer for VMMCD, which is capable of significantly augmenting the global modeling capacity of the Mamba model and circumventing model structure redundancy. We introduced a plug-and-play Multi-scale Feature Guiding Fusion (MFGF) module, representing an enhanced self-attention module. On the one hand, it can further boost the global modeling ability. On the other hand, it can intensify the feature interaction across multiple scales, thereby resolving or mitigating the issue of missed detections. Generally, VMMCD achieved an excellent balance between speed and accuracy, as well as between missed and false detections, in our results.
The main contributions of this paper are as follows:
  • We propose VMMCD, a lightweight yet effective model designed for change detection by adapting the VMamba backbone. The model employs a hierarchical architecture with Patch Merging, allowing it to preserve the long-range modeling strength of Mamba while significantly reducing structural redundancy and computational overhead.
  • We introduce a plug-and-play Multi-scale Feature Guiding Fusion (MFGF) module to enhance global modeling and inter-scale information interaction. This module reinforces feature fusion from deep to shallow layers, addressing incomplete contextual encoding and substantially reducing missed detections in complex scenarios.
  • We conduct extensive qualitative and quantitative evaluations on three benchmark datasets—SYSU-CD, WHU-CD, and S2Looking. The results demonstrate that VMMCD not only achieves competitive performance with high accuracy and efficiency but also effectively mitigates both redundancy and omission-related errors.
The remainder of this paper is organized as follows. Section 2 presents related work. Section 3 provides a detailed introduction to the proposed VMMCD architecture. Section 4 discusses the results of numerous comparative experiments. Finally, Section 5 draws conclusions.

2. Related Works

2.1. VMamba Model

Liu et al. [29] proposed VMamba, which incorporated Mamba into the domain of vision. The Mamba model is founded on the Structured State Space Sequence Model (S4 model). The S4 model stems from the prevalent Linear Time-Invariant (LTI) system within classical state space models:
h ( t ) = A h ( t ) + B x ( t ) y ( t ) = C h ( t )
These equations express a one-dimensional linear mapping through a hidden intermediate state h ( t ) , where t denotes time, A R N × N , B R 1 × N , and C R N × 1 .
In the realm of deep learning, the model under consideration is discretized, thereby necessitating a transformation from continuous-time to discrete-time formulations. A prevalent approach pertains to the utilization of the Zero-Order Hold (ZOH) method, which can be expounded as follows:
A ¯ = exp ( Δ A ) B ¯ = ( Δ A ) 1 ( exp ( Δ A ) I ) ( Δ B )
The ultimate discretized form is presented as follows:
h t = A ¯ h t 1 + B ¯ x t y t = C h t
This allows for parallel computation of results through global convolution. Nevertheless, within the Mamba model, specific alterations are made to the aforementioned process. Precisely, Mamba attaches certain parameters to the input, transforming the system into a linear time-varying one, surmounting the limitations of LTI, and acquiring additional learnable parameters. Furthermore, VMamba introduces the Cross-Scan Module(CSM), specially designed for two-dimensional image data, which plays a key role in enabling VMamba to possess linear complexity. The CSM merely requires scanning all patches along four distinct paths to ascertain the correlation between any patch and the others.
VMamba-based methods have shown promising performance in various vision tasks, including medical image segmentation (e.g., P-Mamba [39], VM-UNet [33]), hyperspectral image classification (Mamba-in-Mamba [40]), and multi-modal learning (VL-Mamba [41]).
Recently, VMamba has been applied to change detection. Zhao et al. [34] extended the CSM by adding more scanning paths to better model global spatial relations, while [35] adapted the CSM for bi-temporal inputs to capture inter-image dependencies. The above two methods achieved satisfactory accuracy metrics. However, they still have certain deficiencies and room for improvement. For instance, their approaches greatly increased the computational complexity of the model. The former doubled the computational complexity, while the latter increased it by 1.75 times, which is very detrimental to the inference speed of the methods. Secondly, they introduced additional structures and parameters. As we discussed in the Introduction, this may lead to structural redundancy of the models, which is not necessarily beneficial to the accuracy performance of the model. In contrast, our proposed VMMCD adopts a lightweight architecture that reduces both computational burden and potential structural redundancy.

2.2. Feature Fusion and Interaction

Feature fusion has been extensively employed in the realm of deep learning. According to the input sources, it can be categorized into multi-level fusion, multi-scale fusion, and heterogeneous feature fusion [42]. Straightforward feature fusion methods usually entail operations devoid of extra parameters, including addition [33], weighted sum, concatenation, pooling [43], and others. These methods are generally stable and do not substantially augment computational costs. Nevertheless, the performance of such feature fusion is frequently not outstanding, rendering it appropriate as a baseline approach for feature fusion. Numerous studies have put forward more efficacious feature fusion techniques. For instance, Huang et al. [44] utilized a feature fusion strategy predicated on coordinated attention to concentrate on the disparities between bi-temporal images. Wang et al. [45] integrated pixel-level and object-level features to accentuate geographical proximity.
Feature interaction is frequently considered as a stage within the process of feature fusion [38]. Nevertheless, certain scholars define it as an independent procedure distinct from feature fusion, which pertains to the correlation or interaction of homogeneous/heterogeneous features during the feature extraction stage preceding the fusion process [42]. In this study, we subscribe to the former perspective that feature interaction constitutes a stage within the feature fusion process. Notably, our primary focus lies on the feature interaction process that occurs across multiple scales.

3. Proposed Method

3.1. Overall Architecture

The proposed VMMCD architecture is illustrated in Figure 1. The model employs a typical U-Net architecture, but its distinction lies in a meticulously designed three-stage encoder–decoder backbone, which, unlike many counterparts of similar depth, is entirely based on Vision State Space (VSS) layers. This specific VSS-based configuration is tailored to optimize the tradeoff between representational capacity and computational load. The architecture comprises an encoder and decoder built with VSS layers, a patch embedding layer, a final classification layer, and the MFGFs. Our objective is to design a lightweight change detection network to prevent the extraction or introduction of interfering and irrelevant information that can arise from overly complex model structures. Moreover, a lightweight model conserves computing resources. Motivated by the work in [46], we devised the overall architecture of the lightweight model VMMCD. In addition, given that the VMamba block is a plug-and-play module analogous to the ViT block with an identical number of input and output channels, we also incorporated the design from [47] to augment the global modeling capacity of VMamba.
Let us denote the tensors for images at times T1 and T2 as X R C 0 × H × W and Y R C 0 × H × W . Conventionally, C represents the number of channels ( C 0 = 3 ), while H and W represent the height and width of the tensors, respectively.
Firstly, the inputs X and Y are processed by the Patch Embedding layer. This layer employs a 4 × 4 convolution with a stride of 4 to segment the input images into multiple non-overlapping 4 × 4 patches. Such downsampling mitigates interfering and irrelevant information, thereby facilitating subsequent global modeling by VMamba and MFGF. Layer normalization follows, yielding embedded images F 0 X , F 0 Y R C × H 4 × W 4 , with C set to 96 by default. Afterwards, they are forwarded to the VMamba-based Siamese encoder (details in Section 3.2) with shared weights for feature extraction. Several studies [34,35] have indicated that the default scanning path of the CSM in VMamba is imperfect. Our strategy is to maintain the original structure of the VSS block and externally implement two enhancement methods. Specifically, the first method is to conduct downsampling via Patch Merging, where the global linear mapping can effectively model global relationships. The second is leveraging the self-attention block in MFGF (details in Section 3.3) to progressively guide the multi-scale features extracted by the encoder, which can effectively model the multi-scale global feature relationships and augment the expression of deep abstract features. Then, the features are fully fused through the decoder, and at the same time, the size of the feature map is gradually restored to the same size as the input tensor of the same-level encoder using linear mapping. Finally, the encoder output is passed through a final classification layer, restoring it to the original image dimensions. Subsequently, a convolution layer generates the final change detection result.

3.2. VMamba-Based Encoder and Decoder

We designed the encoder and decoder based on VMamba. Diverging from conventional deeper architectures, which often employ four or more stages, our proposed encoder was strategically designed with three stages of VSS layers with shared weights. This streamlined configuration was chosen to optimize computational efficiency and directly address the challenge of extracting salient change features without introducing the redundancy or noise often associated with excessive depth, which is particularly pertinent for the focused task of change detection. Each VSS layer incorporates a set of VSS blocks, with the default number set to 2, which is utilized for modeling global context information. Figure 2a presents the detailed structure of the VSS block. Notably, the 2D Selective Scanning (SS2D) module proposed in VMamba [29] effectively addresses the issue of the suboptimal performance of 1D Selective Scanning (SS1D) [27] in the modeling of 2D image data.
The data processing within the SS2D module entails three steps: cross-scanning, selective scanning via the S6 block, and cross-merging. Herein, cross-scanning and cross-merging are jointly referred to as the cross-scanning module (CSM). Upon data input, cross-scanning unfolds the image patches into sequences along four distinct traversal paths. Subsequently, each block sequence is processed in parallel with four S6 blocks. Eventually, cross-merging reshapes and combines the obtained sequences to generate the output graph. We illustrate the self-attention method with square complexity and the cross-scanning method with linear complexity in Figure 2c and Figure 2d, respectively. The significant overlap in scanning paths chosen by different patches inherently avoids redundant computations. In contrast, the self-attention method with equivalent global modeling capabilities necessitates computing the correlation between all patches, leading to a computational complexity of O ( n 2 ) .
Within the first two layers of our three-stage encoder, feature maps at each level are processed using a Patch Merging layer. Specifically, to enhance the global modeling capability of the model and circumvent the issue of overly complex models mentioned initially, for the channel dimension of the feature map, a Patch Merging layer was appended after the VSS Block, as shown in Figure 2b. Given input f R C × H × W , the Patch Merging layer firstly performs interval sampling, which splits the adjacent blocks into four smaller blocks. Subsequently, they are concatenated along the channel axis, thereby transforming the original feature map into a feature map with the size of ( 4 C ) × ( H / 2 ) × ( W / 2 ) . Thereafter, a fully connected layer is utilized to compress the channels. Currently, the feature map is defined as ( 2 C ) × ( H / 2 ) × ( W / 2 ) . This downsampling methodology effectively transforms bi-temporal images into image tokens, preserving information integrity and facilitating global feature extraction by the VMamba-based encoder. Crucially, at the third and also the deepest stage of our encoder, the Patch Merging layer is omitted. It should be noted that, in the absence of channel compression, the Patch Merging layer will quadruple the number of channels of the output feature map relative to the input, resulting in an excessively large channel dimension at the bottom layer (reaching 1536) and leading to model redundancy. As stated in the first section, change detection is a process in which the amount of information is substantially reduced from input to output. In other words, it can also be regarded as a particular process of “denoising” or “eliminating”. Therefore, minimizing the introduction of superfluous features or an overly expansive feature space is paramount. Excessive feature channel dimensions and feature space will decelerate the model and impede its ability to identify key feature information within a vast amount of interfering or irrelevant information.
The inputs at each encoder stage, represented by F i X i = 1 3 and F i Y i = 1 3 , are combined via MFGFs at skip connections. Moreover, the output streams at the lowest level are aggregated and then input into the decoder.
The proposed decoder adopts a structure similar to that of the encoder, where each VSS layer also contains VSS blocks. Notably, we use only one VSS block by default in each decoder VSS layer. This simplification further improves computational efficiency without compromising performance. In addition, we employ a patch expansion layer to perform upsampling within the decoder. Finally, residual connections in the decoder fuse the upsampled features with the corresponding features from the encoder at the same level, thereby refining and reweighting pixel-level information in the decoder’s feature maps.

3.3. Multi-Scale Feature Guiding Fusion (MFGF) Module

To significantly enhance the global modeling capabilities of the VMamba backbone, which inherently processes information sequentially, we devised the MFGF module. As expounded in Section 1, several other change detection approaches predicated on VMamba endeavor to incorporate additional scanning paths to bolster the modeling prowess of VMamba [34,35]. Nevertheless, these strategies remain incapable of encompassing all conceivable patch dependency relationships. Motivated by certain methodologies integrating CNN and ViT [23], we intend to employ non-local self-attention to address the aforesaid issue. The proposed MFGF block is engineered as a lightweight, plug-and-play module leveraging the principles of non-local attention [48]. Its core innovation lies in its ability to transcend the limitations of VMamba’s restricted scanning paths by establishing direct, long-range dependencies across all feature map locations via its global self-attention mechanism. This effectively compensates for VMamba’s local processing bias, enabling a more holistic understanding of feature interrelations and thereby enhancing the model’s precision in identifying complex and distributed changes. Among these, the module based on non-local attention can be formulated as
A t t e n t i o n ( Q , K , V ) = S o f t m a x ( Q K T d h e a d ) V
Among them, Q, K, and V represent Q u e r y , K e y , and V a l u e , respectively. Furthermore, the MFGF module incorporates a sophisticated multi-scale feature integration strategy. Crucially, each MFGF unit receives an auxiliary input stream from the MFGF module in the subsequent deeper layer of the decoder. This deeper-level input, rich in abstract semantic information, is then fused with the shallower features via a residual connection. This hierarchical feature fusion, leveraging richer semantic information from deeper layers to refine shallower representations, significantly strengthens the interaction of multi-scale feature information. This design not only guides the effective fusion of shallow, detail-rich features with deep, context-aware features but also demonstrably improves the module’s discriminative power for subtle change detection without a commensurate increase in model complexity.
The structure of MFGF is illustrated in Figure 3. Regarding the inputs F i X i = 1 3 and F i Y i = 1 3 for each level of the encoder, initially, the two signals are directly summed as the input of MFGF. This is performed to represent the characteristics of the two input feature maps on the same feature map with a reduced computational cost:
F i = F i X + F i Y , i = 1 , 2 , 3
Subsequently, a residual connection is constructed between the input of the i-th level MFGF F i and the output of the deeper MFGF F i + 1 . Before the multiplication operation, Patch Expanding and a sigmoid activation function are implemented on F i + 1 . The Patch Expanding operation modifies the dimensions of F i + 1 to be congruent with those of F i , and the sigmoid function transforms the values into weights ranging from 0 to 1:
F i G = F i · S i g m o i d ( PE ( F i + 1 ) ) + F i
The resultant F i G is subsequently fed into the Q K V Conv layer, thereby generating Q, K, and V outputs via three separate convolution layers.
( Q i , K i , V i ) = Q K V Conv ( F i G )
The self-attention map is generated based on Q i , K i , and V i , and then incorporated into F i to yield the MFGF output F i :
F i = F i + T r a n s ( A )
A = S o f t m a x ( Q i K i T d k )
Notably, to ensure high computational efficiency, MFGF modules are exclusively integrated within the skip connections at all stages. Under this setting, MFGF operates on feature maps that have already been downsampled by factors of 4, 8, and 16 relative to the input resolution. This targeted application judiciously balances the potent global modeling capability of self-attention with computational tractability, effectively mitigating the substantial computational overhead typically associated with applying such mechanisms to high-resolution feature maps. This allows MFGF to deliver enhanced representational power and improved accuracy while preserving the overall efficiency of the change detection network.

3.4. Loss Function

Two loss functions are considered: Binary Cross-Entropy (BCE) Loss and Focal Loss [49]. The BCE Loss, which is prevalently utilized in binary classification tasks, is defined by Equation (10), where y i denotes the ground truth value, and y ^ i represents the predicted value. However, when y i equals 0, log ( y i ) approaches negative infinity, resulting in an infinite loss and the collapse of training. To prevent this issue, we employ the Sigmoid BCE Loss to avoid values of 0 or 1.
Nevertheless, the CD task represents an extremely foreground–background class-imbalanced binary classification problem. This imbalance induces the model to be inclined towards making negative predictions, ultimately resulting in significant missed detections. The conventional BCE Loss fails to adequately address this problem. Hence, we incorporate Focal Loss to diminish the model’s focus on the numerous easy samples during the training phase and enhance its attention towards the hard samples. Focal Loss can be formulated as per Equation (11):
L BCE = 1 N i = 1 N ( y i log ( y ^ i ) + ( 1 y i ) log ( 1 y ^ i ) )
L Focal = 1 N i = 1 N ( α ( 1 y ^ i ) γ y i log ( y ^ i ) + ( 1 α ) · y ^ i γ ( 1 y i ) log ( 1 y ^ i ) )
The terms α and γ here refer to the hyperparameters in Focal loss, which are used to balance the contributions of easy and hard samples. Typically, they are set to α = 0.75 and γ = 2 .
BCE Loss and Focal Loss are combined through a weighted sum. The final loss function is defined as
L ours = λ L BCE + L Focal
The hyperparameter λ serves to modulate the respective contributions of BCE Loss and Focal Loss. Since BCE Loss, unlike Focal Loss, does not incorporate the positive–negative sample adjustment parameter α , λ can be regarded as the intensity of applying positive–negative sample adjustment. Consequently, a higher λ value assigns greater weight to the BCE Loss, weakening the positive–negative sample adjustment.

4. Experiments and Results

4.1. Datasets

In the experimental section of this paper, we utilized three publicly available datasets: SYSU-CD [50], WHU-CD [51], and S2Looking [52]. These three datasets are currently among the most representative publicly available benchmarks for change detection. They cover several typical challenges encountered in change detection tasks, including extreme class imbalance (WHU-CD and S2Looking), category-related changes (WHU-CD and S2Looking), and category-agnostic changes (SYSU-CD). Moreover, based on the performance discrepancies of most existing methods across these datasets, they can be roughly categorized by difficulty level: WHU-CD as easy, SYSU-CD as moderate, and S2Looking as challenging. Therefore, we selected these datasets for our experiments to evaluate the proposed method under varying conditions.

4.1.1. SYSU-CD

SYSU-CD is a category-agnostic remote sensing image change detection dataset proposed by Sun Yat-sen University. It contains 20,000 pairs of 256 × 256 aerial images taken in Hong Kong from 2007 to 2014. The dataset is split into training, validation, and test sets with a ratio of 6:2:2. The main types of changes include urban construction, vegetation, road modifications, and offshore developments.

4.1.2. WHU-CD

WHU-CD is a remote sensing image change detection dataset developed by Wuhan University primarily for building change detection tasks. The original dataset consists of a pair of aerial images with dimensions of 32,507 × 15,354. For our experiments, we obtained non-overlapping 256 × 256 image data from the researchers’ webpage. The split ratio for the training, validation, and test sets is 4536:504:2760.

4.1.3. S2Looking

The S2Looking dataset, released by the Chinese Academy of Sciences in 2021, focuses on building change detection. It comprises 5000 pairs of bi-temporal images, each 1024 × 1024 , with a spatial resolution ranging from 0.5 to 0.8 m per pixel (m/pixel). Compared to many previous datasets, S2Looking is distinguished by its wider viewing angles, significant variations in illumination, and more complex scene characteristics. For our research, we utilized a publicly available, cropped version of this dataset consisting of non-overlapping 256 × 256 pixel patches. This version was split into training, validation, and test sets in a 7:1:2 ratio.

4.2. Experimental Setup

4.2.1. Implementation Details

Our model was implemented using Pytorch and trained and tested on an NVIDIA RTX 4090. The AdamW optimizer [53] was employed with a weight decay of 2.5 × 10−3, a learning rate of 2.5 × 10−4, a batch size of 8, and a maximum epoch number of 50. Before training, data augmentation methods, including random noise addition, random rotation, and random cropping, were applied.

4.2.2. Evaluation Metrics

To conduct a comprehensive performance evaluation of VMMCD, we selected P r e c i s i o n , R e c a l l , F 1 -score, and Intersection over Union ( I o U ) as the evaluation metrics. The formulations of these metrics are presented as follows:
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
F 1 = 2 R e c a l l 1 + P r e c i s i o n 1
I o U = T P T P + F P + F N
where T P , T N , F P , and F N represent true positives, true negatives, false positives, and false negatives, respectively. The I o U is considered the most compelling evaluation metric, as it reflects the overlap ratio between inference results and ground truth. F 1 also possesses a considerable degree of persuasiveness, as it balances the P r e c i s i o n and R e c a l l , which are to some extent mutually restrictive.

4.3. Comparison to State-of-the-Art (SOTA) Methods

In the comparative experiment, we selected several state-of-the-art (SOTA) methods for comparison. According to their feature extraction approaches, these methods can be classified into three categories: CNN-based methods (FC-EF [10], FC-Siam-Conc [10], FC-Siam-Diff [10], TinyCD [46], SNUNet [54], and CGNet [38]), Transformer-based methods (BIT [55] and ChangeFormer [36]), and VMamba-based methods (RS-Mamba [34] and ChangeMamba [35]).
When comparing the aforementioned methods, if the dataset partitioning in the original works was consistent with ours, we directly adopted the reported metric values. Otherwise, we reproduced these open-source methods for experimentation, using all the hyperparameters recommended in the original papers. It is noteworthy that the performance metric values obtained from most of our reproductions are higher than those reported in the original papers.

4.3.1. Quantitative Results

Table 1 presents the comparative experimental results with the selected state-of-the-art methods on the three datasets. For ease of viewing, we highlight the first, second, and third places in red, blue, and black, respectively.
For the category-independent dataset SYSU-CD, the proposed method achieved the optimal result in the key metrics of F 1 -score and I o U , with values of 83.35% and 71.45%, respectively. Notably, these scores surpass the previous state-of-the-art (SOTA) benchmarks of F 1 - and I o U scores (83.11% and 71.10%) reported in the SOTA literature. This demonstrates the effectiveness of the proposed VMamba-based multi-scale feature guiding change detection method.
However, for the building change detection dataset WHU-CD, the proposed method did not achieve the best results (92.52% and 86.08%) in the comparison but rather ranked third. Nevertheless, as shown in Table 2, our model’s inference speed (73.05 fps) was higher than that of the top two methods (which were 58.58 fps and 16.89 fps, respectively), and the model size is also more lightweight.
For the highly imbalanced dataset S2Looking, observing the performance metrics of the compared methods reveals that, overall, the various metrics of each method are not high. This indicates that the S2Looking dataset is very challenging due to its severe class imbalance between positive and negative samples. The proposed method achieved the best F 1 and I o U results in the comparative experiments, highlighting its superiority on this dataset.
In addition, from our experience, researchers typically focus more on comprehensive and overall quantitative metrics like F 1 and I o U , while comparatively less attention is given to other indicators that, though not as comprehensive, can still elucidate certain issues, such as P r e c i s i o n and R e c a l l . We posit that these indicators are also of value and can be employed. To further explicate the issue of the imbalance of missed detections and false detections, we conducted mathematical transformations and statistical analyses on P r e c i s i o n and R e c a l l , as depicted in Figure 4. We computed the proportions of false detections and missed detections within the total errors, respectively. In this context, we use green and red to denote missed detections and false detections.
Upon observing the statistical outcomes, it becomes evident that the compared methods exhibit a particular phenomenon: in the preponderant majority of cases, the proportion of missed detections exceeds that of false detections. Among them, although FC-Siam-Diff [10] had a lower proportion of missed detections on SYSU-CD, it went to the other extreme. This implies that for the majority of methods, missed detections constitute the most prominent factor constraining the model’s accuracy score. The occurrence of this phenomenon is not accidental. It primarily stems from the imbalance between positive and negative samples in the training data. In change detection, positive and negative samples are defined based on whether changes occur between remote sensing images captured at two different time points. Pixels within the changed regions are considered positive samples, while those outside the changed regions are considered negative samples. In most cases, the quantity of negative samples is typically several-fold or even dozens-fold that of positive samples. Consequently, the model is predisposed to make negative predictions, thereby giving rise to missed detections. Nevertheless, our method mitigates this issue to a substantial extent. As can be discerned, in the incorrect predictions of VMMCD, the ratio of missed detections to false detections is more balanced compared to that of other methods. This indicates that our method can effectively address the problem of missed detections in the change area, thereby achieving a better balance between missed detections and false detections. The visualization results presented in Figure 4 also corroborate this contention. Although the proportion of F N is slightly higher on the WHU-CD dataset, this difference does not contradict the aforementioned conclusion. As shown in Table 1, unlike SYSU-CD and S2Looking, the performance metrics on WHU-CD are generally higher. This indicates that the overall prediction error (i.e., F P + F N ) for most methods is relatively low. In such cases, even slight imbalances between F P and F N can lead to large fluctuations in the ratios F N / ( F P + F N ) and F P / ( F P + F N ) . Nevertheless, according to the results in Table 1, the proposed method achieved very close P r e c i s i o n and R e c a l l (93.84% and 91.23%, respectively), indicating that the proportions of missed detections and false detections remain well-balanced.

4.3.2. Qualitative Visualization Results

In addition to the aforementioned quantitative indicator analysis, to visually illustrate the efficacy of the proposed method in minimizing missed detections, we showcase the qualitative visualization outcomes of the binary change detection using the proposed method and other comparative methods in Figure 5, Figure 6 and Figure 7. For clarity, we use red and green to represent FP and FN, respectively, and mark challenging sample areas with red boxes.
In the visualization results of the SYSU-CD dataset, extensive red regions (representing false detections) and green regions (denoting missed detections) are observed, with the green regions being more conspicuous, suggesting a more pronounced problem of missed detections. Through qualitative examination of the bi-temporal change images, it is apparent that in the majority of cases, the extent of change within the regions is not highly pronounced (Figure 5(4)–(6)), and the boundaries are hard to define (Figure 5(2)). These factors account for the challenging characteristics of the SYSU-CD dataset. The visualization results depicted in Figure 5(1)–(3) seem comparatively better on account of more prominent changes, larger areas of change, and a less complex background. Nevertheless, they still display diverse levels of missed and false detections. For example, in Figure 5(1), the red-boxed area exhibits analogous texture characteristics in the T1 and T2 images (sandy soil and cement surfaces, respectively), resulting in missed detections in numerous results. Overall, our proposed method generated the most favorable visualization results with the minimal amount of green (missed detections) regions, demonstrating its capacity to mitigate or resolve the problem of missed detections in change regions to a certain extent.
Regarding the WHU-CD dataset, the visualization outcomes indicate a relatively smaller quantity of red and green areas. Generally, in comparison with the higher-scoring ChangeMamba and ChangeFormer, our proposed method possesses both merits and demerits, which is evident from the visualization results. By taking Figure 6(3) as an instance, the observation of the red-boxed area discloses that our method mitigated missed detections in numerous scenarios. It is worthy of clarification that, notwithstanding the missed detections exhibited by our method in Figure 6(1), these missed regions are diminutive, spatially scattered, and predominantly situated within the change area (i.e., internal voids), thereby exerting a negligible influence on the qualitative evaluation of the overall change region.
The S2Looking dataset presents a challenge because of its severe class imbalance, which is clearly shown in the GT images of Figure 7. In these images, the foreground regions account for a small portion and are precisely annotated. This notably augmented the complexity of change detection tasks within this dataset, leading to generally low performance scores for the existing methods. The visualization results follow the quantitative analysis, indicating a high ratio of missed detections and false detections. Nevertheless, our proposed method still attained favorable results with a minimal extent of green (missed detections) regions. Meanwhile, the red (false detections) regions are comparable to those of several methods with high scores [34,35,36].
This demonstrates that our proposed method can alleviate or even solve the problem of missed detections.

4.3.3. Model Efficiency

To conduct a more comprehensive assessment of the performance of the proposed method, we further compared the inference speed, computational complexity, parameter quantity, and corresponding performance metrics of diverse methods on the SYSU-CD dataset, as presented in Table 2. In comparison with certain CNN-based methods, our proposed approach exhibits a moderate level of computational complexity and inference speed. Although it has a slightly larger parameter quantity, it demonstrated a distinct advantage in accuracy. CGNet [38] and SNUNet [54], both of which integrate a self-attention mechanism, display a higher level of computational complexity and accuracy than other CNN-based methods. Compared to certain Transformer-based approaches, our proposed method exhibits substantially reduced computational cost and a decreased number of parameters. Moreover, it outperformed these methods in both inference speed and accuracy. Compared to the latest VMamba-based methods, our method attains substantial advantages in computational complexity, parameter quantity, and inference time while preserving comparable or even higher accuracy. Overall, our proposed method achieves a favorable balance among computational complexity, parameter quantity, inference speed, and accuracy, thereby fulfilling the requirements for high efficiency in change detection tasks.
In summary, based on the obtained results, the proposed VMMCD demonstrates a remarkable balance across two key performance aspects.
First, it achieves a balance between accuracy and speed. VMMCD incorporates the VMamba backbone and the multi-scale feature fusion module (MFGF), thereby acquiring robust global modeling capabilities and achieving high accuracy. It surpasses the highest recorded performance on the SYSU-CD dataset and shows competitive results on the other two benchmark datasets. Meanwhile, its lightweight architecture enables high inference speed.
Second, it maintains a balance between missed detections and false detections. A wealth of quantitative results indicates that, compared to other methods, the ratio of missed detections to false detections in VMMCD’s classification errors is more balanced. It is worth emphasizing that many existing methods exhibit a substantially higher missed-detection rate than false-detection rate, which contradicts the objective of change detection—that is, to accurately identify changed areas. As noted in Section 1, users typically show lower tolerance for missed detections than for false detections in change detection tasks. Consequently, our method effectively mitigates missed detections, thereby achieving a more desirable balance between missed detections and false detections.

4.4. Ablation Study

In this section, our ablation experiments were carried out from the following four aspects:
  • Backbone networks.
  • Model magnitude.
  • The number of MFGFs.
  • The coefficient λ of the loss function.

4.4.1. Ablation on Backbone Networks

A comparison of several representative backbone networks was conducted. As presented in Table 3, the VMamba backbone achieved the highest performance score, with its computational cost and parameter count only being surpassed by EfficientNet-B4. This is attributed to the fact that the VMamba backbone is capable of conducting global modeling with linear computational complexity, thus simultaneously achieving breakthroughs in accuracy and speed. The Swin-small in the table exhibited suboptimal performance, which we ascribe to the local attention mechanism employed by the Swin Transformer. This mechanism diminishes the global modeling ability of the model while decreasing the computational cost. In contrast, VMamba does not require a similar approach to reduce computational cost, endowing it with a robust modeling ability.

4.4.2. Ablation on Model Magnitude

We elucidated the rationale behind our intention to design a lightweight model in Section 1 and Section 3. Specifically, the change detection process inherently entails a significant reduction in image information. Consequently, the model ought to refrain from excessive complexity to prevent the extraction or introduction of interference and irrelevant information to the greatest extent possible. To validate this hypothesis, we intend to augment the complexity of VMMCD along two distinct axes for comparative analysis. Firstly, with respect to the number of feature dimensions per layer, we adjusted the channel compression within the Patch Merging operation. This alteration will modify the magnitude of the feature space at each layer of the model. We expounded upon the transformation process of the feature channel dimensions prior to and subsequent to Patch Merging and channel compression in Section 3.2. Briefly, during the Patch Merging layer processing, the channel dimension of the features will be quadrupled, and subsequently, a linear layer will be employed for channel compression.
Secondly, in relation to the model depth, we extended the proposed VMMCD by augmenting its depth, transforming the three-layer model into a four-layer architecture. Given that the feature space of the model is the cumulative sum of its feature spaces at each scale, increasing the model depth is tantamount to incorporating an additional scale of feature space into the model. For instance, in the case of VMMCD-S4 herein, we appended the deepest level of the feature space. In accordance with some previously established viewpoints [59], this will render the features extracted by the model more abstract and sophisticated. Nevertheless, our experimental results demonstrate that augmenting the layer count is not always appropriate for the change detection task.
We verified the rationality and superiority of the proposed lightweight method VMMCD, as shown in Table 4. The values [ × 1 , × 0.5 , × 0.25 ] denote the compression ratios, where “ × 1 ” implies no compression, and “ × 0.25 ” indicates compressing the channel dimension to one-fourth, effectively reducing the expanded channels back to the original count.
Firstly, with regard to the various scenarios of channel compression, we presented their corresponding quantitative metrics. Notably, given that the pretrained VMamba model weights cannot be loaded following a change in the channel dimension, for the sake of a fair comparison, we consistently employed randomly initialized weights during training. We horizontally compared the compression scenarios of different layers, with the performance of the model under 0.5 times channel compression serving as the reference. In the absence of compression, a certain redundancy exists in the channel dimension of the features. This redundancy causes a reduction in the density of crucial features within the feature space, posing difficulties for the model to extract key features amidst substantial interference. Moreover, it may potentially trigger an Out Of Memory (OOM) error. Conversely, when a compression rate of 0.25 is applied, the number of channels in the feature maps across all layers of the model becomes a fixed value of 96. This fixed number is disadvantageous for the model to extract profound abstract features. In the case of complex datasets, the model might lack the requisite capacity to learn the intricate patterns within the data, leading to underfitting and suboptimal performance on both the training and test sets.
Secondly, by comparing the cases in which the four-stage model and the three-stage model employ the same number of channels, it is evident that the performance scores of the four-stage model are considerably lower than those of the three-stage model. This is consistent with our initial intuition that for the change detection task, a lightweight model should be adopted to the greatest extent possible to prevent the introduction of interference and irrelevant information.
To further elucidate the feature space in the two distinct settings, we present the feature activation maps output by each stage of MFGF for both models, as illustrated in Figure 8. The four-stage model exhibited certain unfavorable characteristics during its operation. Firstly, the feature maps at the middle two scales of the model were in a state of weak activation. This implies that the features extracted by the middle two stages contributed relatively little to the overall classification process. Secondly, the abstract features extracted by the deepest stage of the model, namely, “4-MFGF”, were relatively inferior, and the features extracted by the shallowest stage, namely, “1-MFGF”, were not the edge features typically extracted by a general shallow network (despite the overall appearance being highly consistent with the ground truth) but rather certain texture features within the change area.
Based on these two characteristics, we can draw a conclusion that the weakly activated features at the middle scale of the model impede the transmission of deep low-frequency feature information to the shallowest layer, compelling the shallow network to learn low-frequency features that are challenging to acquire and simultaneously hindering the transmission of the gradient during the training process to the deepest layer. This is the reason why the shallow features appear to be superficially consistent with the ground truth, whereas the deep features yield poor results.
In contrast, our three-stage model, by eliminating redundant network structures, enabled the shallow and deep networks to primarily extract high-frequency edge features and low-frequency abstract features, respectively, each achieving a more reasonable outcome. Simultaneously, the features of the middle layer were in a striped activation state, augmenting the features of the other two scales in certain areas and facilitating the normal transmission of the gradient. The model in this state exhibits reduced redundancy, permitting each layer to extract features at their respective level in a normal manner.

4.4.3. Ablation on MFGFs

We validated the efficacy of the proposed MFGF in addressing the issue of missed detections. To this end, we compared the performance metrics of the MFGF module under varying numbers and positions, as presented in Table 5. Given that MFGF represents a flexible plug-and-play constituent, we further carried out experiments on the extended depth VMMCD-S4 model. Through a comparison of the varying positions and quantities of the MFGF module within the three-stage VMMCD model, it is evident that the number of MFGF modules was positively correlated with the model’s performance. When a single MFGF module was incorporated, the overall performance score exceeded that of the baseline model without MFGF but was inferior to that of the model with two MFGF modules. Notably, when three MFGF modules were utilized, the overall performance score reached its peak. It is noteworthy that in the experiments involving the four-stage VMMCD, the addition of the plug-and-play MFGF module also yielded a relatively pronounced performance enhancement, which further validates the efficacy of the proposed model and the MFGF module.
Concurrently, we present the visualization results of certain experiments, as depicted in Figure 9. The challenging areas are marked with red boxes. The “000-111” notations in the figure denote whether the MFGF was added at the corresponding positions (e.g., “011” indicates the addition of MFGFs at positions 2 and 3), maintaining the same overall order as in Table 5. It is observable that the visualization results are congruent with the quantitative results in Table 5. As the number of MFGF modules increased, the missed detections phenomenon within the red-boxed areas were mitigated.

4.4.4. Ablation on the Coefficient λ of the Loss Function

A comparison of the performances on three datasets with respect to different coefficients for the loss function is presented in Table 6. In practical experiments, the magnitude of Focal Loss is typically an order of magnitude less than that of the BCE Loss. Consequently, a coefficient term λ is solely added to the BCE Loss. When λ is relatively small (specifically, λ [ 0 , 0.3 ] ), finer gradients are set. As can be seen from Table 6, the relationship between performance metrics and coefficients differs among the datasets. Regarding SYSU-CD, when the coefficient λ was small ( λ < 0.3 ), there was no significant alteration in the performance metrics F 1 and I o U . In contrast, when λ became greater ( λ 0.3 ), F 1 and I o U commenced to decline. In the case of WHU-CD, a positive correlation was found to exist between F 1 , I o U , and λ . For S2Looking, F 1 and I o U fluctuated within a certain range under different coefficients, without exhibiting a clear correlation. Considering the performance across various datasets, λ was ultimately set to 0.2 for the aforementioned experiments.

5. Conclusions

In this research, a lightweight VMamba-based Multi-scale Feature Guiding Fusion change detection approach was devised. Certain recent deep learning-based change detection techniques have, on the one hand, contrived excessively intricate model architectures, giving rise to model redundancy and facilitating the extraction or introduction of interfering or extraneous information, thereby impeding the extraction of pivotal features. On the other hand, the utilization ratio of the feature information retrieved by the encoder is inadequate, culminating in the issue of missed detections. In light of the issues described above, a lightweight VMamba-based model was designed to mitigate the introduction of interfering or irrelevant information. Additionally, the information exchange between layers was fortified via the proposed MFGF block, further augmenting the modeling prowess of the VMamba model. This, in turn, enhanced the utilization of change feature information and mitigated or resolved the problem of missed detections within the change area. In comparison with CNN-based and Transformer-based methods, the proposed VMMCD attained a remarkable equilibrium in terms of speed-accuracy and missed-false detections. The performance score on SYSU-CD surpassed the extant zenith on SOTA and secured competitive scores on the other two datasets. Given the above discussion, the proposed VMMCD is highly suitable for real-world rapid change detection tasks, particularly in category-agnostic scenarios where the model must generalize across diverse change types without relying on semantic class information. It is worth noting, however, that VMMCD may still encounter missed detections in highly complex scenes, such as those involving subtle appearance variations, scale ambiguity, or severe background interference. These scenarios pose inherent challenges for lightweight models due to their limited representational capacity. To address this, future improvements may include the use of deep change cues to guide multi-scale feature fusion more effectively, as well as the adoption of class-weighted loss functions to alleviate sample imbalance and improve sensitivity to minor changes.

Author Contributions

Z.C. conceived the idea; Z.C. and H.C. verified the idea and designed the methodology; H.C. wrote the paper; J.L. and X.Z. reviewed and provided technical suggestions; Q.G. and W.D. provided resources and financial support. All authors have read and agreed to the published version of the manuscript.

Funding

Civil Aerospace Technology Pre-research Project of China’s 14th Five-Year Plan, Guide Number: D040404. Key Laboratory of Target Cognition and Application Technology, Project Number: 2023-CXPT-LC-005.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

The authors declare no conflicts of interest..

References

  1. Desclée, B.; Bogaert, P.; Defourny, P. Forest change detection by statistical object-based method. Remote Sens. Environ. 2006, 102, 1–11. [Google Scholar] [CrossRef]
  2. Bolorinos, J.; Ajami, N.K.; Rajagopal, R. Consumption Change Detection for Urban Planning: Monitoring and Segmenting Water Customers During Drought. Water Resour. Res. 2020, 56, e2019WR025812. [Google Scholar] [CrossRef]
  3. Ridd, M.K.; Liu, J. A Comparison of Four Algorithms for Change Detection in an Urban Environment. Remote Sens. Environ. 1998, 63, 95–100. [Google Scholar] [CrossRef]
  4. Hegazy, I.R.; Kaloop, M.R. Monitoring urban growth and land use change detection with GIS and remote sensing techniques in Daqahlia governorate Egypt. Int. J. Sustain. Built Environ. 2015, 4, 117–124. [Google Scholar] [CrossRef]
  5. Alqurashi, A.F.; Kumar, L. Investigating the Use of Remote Sensing and GIS Techniques to Detect Land Use and Land Cover Change: A Review. Adv. Remote Sens. 2013, 2, 193–204. [Google Scholar] [CrossRef]
  6. Sublime, J.; Kalinicheva, E. Automatic Post-Disaster Damage Mapping Using Deep-Learning Techniques for Change Detection: Case Study of the Tohoku Tsunami. Remote Sens. 2019, 11, 1123. [Google Scholar] [CrossRef]
  7. Se, S.; Firoozfam, P.; Goldstein, N.; Wu, L.; Dutkiewicz, M.; Pace, P.; Naud, J.L.P. Automated UAV-based mapping for airborne reconnaissance and video exploitation. In Proceedings of the Airborne Intelligence, Surveillance, Reconnaissance (ISR) Systems and Applications VI; Henry, D.J., Ed.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2009; Volume 7307, p. 73070M. [Google Scholar] [CrossRef]
  8. Nielsen, A.A. The Regularized Iteratively Reweighted MAD Method for Change Detection in Multi- and Hyperspectral Data. IEEE Trans. Image Process. 2007, 16, 463–478. [Google Scholar] [CrossRef]
  9. Hussain, M.; Chen, D.; Cheng, A.; Wei, H.; Stanley, D. Change detection from remotely sensed images: From pixel-based to object-based approaches - ScienceDirect. ISPRS J. Photogramm. Remote Sens. 2013, 80, 91–106. [Google Scholar] [CrossRef]
  10. Daudt, R.C.; Saux, B.L.; Boulch, A. Fully Convolutional Siamese Networks for Change Detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018. [Google Scholar]
  11. Zhang, H.; Lin, M.; Yang, G.; Zhang, L. ESCNet: An End-to-End Superpixel-Enhanced Change Detection Network for Very-High-Resolution Remote Sensing Images. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 28–42. [Google Scholar] [CrossRef]
  12. Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
  13. Lei, T.; Wang, J.; Ning, H.; Wang, X.; Xue, D.; Wang, Q.; Nandi, A.K. Difference Enhancement and Spatial–Spectral Nonlocal Network for Change Detection in VHR Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4507013. [Google Scholar] [CrossRef]
  14. Chen, J.; Yuan, Z.; Peng, J.; Chen, L.; Huang, H.; Zhu, J.; Liu, Y.; Li, H. DASNet: Dual Attentive Fully Convolutional Siamese Networks for Change Detection in High-Resolution Satellite Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 1194–1206. [Google Scholar] [CrossRef]
  15. Zhang, M.; Xu, G.; Chen, K.; Yan, M.; Sun, X. Triplet-Based Semantic Relation Learning for Aerial Remote Sensing Image Change Detection. IEEE Geosci. Remote Sens. Lett. 2019, 16, 266–270. [Google Scholar] [CrossRef]
  16. Zhang, M.; Shi, W. A Feature Difference Convolutional Neural Network-Based Change Detection Method. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7232–7246. [Google Scholar] [CrossRef]
  17. Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent Models of Visual Attention. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, BC, Canada, 8–13 December 2014; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27. [Google Scholar]
  18. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  19. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
  20. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  21. Chen, H.; Shi, Z. A Spatial-Temporal Attention-Based Method and a New Dataset for Remote Sensing Image Change Detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
  22. Lin, H.; Hang, R.; Wang, S.; Liu, Q. DiFormer: A Difference Transformer Network for Remote Sensing Change Detection. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6003905. [Google Scholar] [CrossRef]
  23. Li, W.; Xue, L.; Wang, X.; Li, G. ConvTransNet: A CNN–Transformer Network for Change Detection With Multiscale Global–Local Representations. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5610315. [Google Scholar] [CrossRef]
  24. Liu, M.; Shi, Q.; Chai, Z.; Li, J. PA-Former: Learning Prior-Aware Transformer for Remote Sensing Building Change Detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6515305. [Google Scholar] [CrossRef]
  25. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar]
  26. Smith, S.L.; Brock, A.; Berrada, L.; De, S. ConvNets Match Vision Transformers at Scale. arXiv 2023, arXiv:2310.16764. [Google Scholar]
  27. Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
  28. Gu, A.; Goel, K.; Ré, C. Efficiently Modeling Long Sequences with Structured State Spaces. arXiv 2022, arXiv:2111.00396. [Google Scholar]
  29. Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
  30. Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
  31. Wang, F.; Wang, J.; Ren, S.; Wei, G.; Mei, J.; Shao, W.; Zhou, Y.; Yuille, A.; Xie, C. Mamba-R: Vision Mamba ALSO Needs Registers. arXiv 2024, arXiv:2405.14858. [Google Scholar]
  32. Patro, B.N.; Agneeswaran, V.S. SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series. arXiv 2024, arXiv:2403.15360. [Google Scholar]
  33. Ruan, J.; Xiang, S. VM-UNet: Vision Mamba UNet for Medical Image Segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar]
  34. Zhao, S.; Chen, H.; Zhang, X.; Xiao, P.; Bai, L.; Ouyang, W. RS-Mamba for Large Remote Sensing Image Dense Prediction. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5633314. [Google Scholar] [CrossRef]
  35. Chen, H.; Song, J.; Han, C.; Xia, J.; Yokoya, N. ChangeMamba: Remote Sensing Change Detection With Spatiotemporal State Space Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4409720. [Google Scholar] [CrossRef]
  36. Bandara, W.G.C.; Patel, V.M. A Transformer-Based Siamese Network for Change Detection. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 207–210. [Google Scholar] [CrossRef]
  37. Li, J.; Wen, Y.; He, L. SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 6153–6162. [Google Scholar] [CrossRef]
  38. Han, C.; Wu, C.; Guo, H.; Hu, M.; Li, J.; Chen, H. Change Guiding Network: Incorporating Change Prior to Guide Change Detection in Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 8395–8407. [Google Scholar] [CrossRef]
  39. Ye, Z.; Chen, T.; Wang, F.; Zhang, H.; Zhang, L. P-Mamba: Marrying Perona Malik Diffusion with Mamba for Efficient Pediatric Echocardiographic Left Ventricular Segmentation. arXiv 2024, arXiv:2402.08506. [Google Scholar]
  40. Zhou, W.; Kamata, S.I.; Wang, H.; Wong, M.S.; Hou, H.C. Mamba-in-Mamba: Centralized Mamba-Cross-Scan in Tokenized Mamba Model for Hyperspectral Image Classification. arXiv 2024, arXiv:2405.12003. [Google Scholar] [CrossRef]
  41. Qiao, Y.; Yu, Z.; Guo, L.; Chen, S.; Zhao, Z.; Sun, M.; Wu, Q.; Liu, J. VL-Mamba: Exploring State Space Models for Multimodal Learning. arXiv 2024, arXiv:2403.13600. [Google Scholar]
  42. Fang, S.; Li, K.; Li, Z. Changer: Feature Interaction is What You Need for Change Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5610111. [Google Scholar] [CrossRef]
  43. Lin, T.Y.; RoyChowdhury, A.; Maji, S. Bilinear CNN Models for Fine-Grained Visual Recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
  44. Huang, Y.; Li, X.; Du, Z.; Shen, H. Spatiotemporal Enhancement and Interlevel Fusion Network for Remote Sensing Images Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5609414. [Google Scholar] [CrossRef]
  45. Wang, M.; Li, X.; Tan, K.; Mango, J.; Pan, C.; Zhang, D. Position-Aware Graph-CNN Fusion Network: An Integrated Approach Combining Geospatial Information and Graph Attention Network for Multiclass Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4402016. [Google Scholar] [CrossRef]
  46. Codegoni, A.; Lombardi, G.; Ferrari, A. TINYCD: A (Not So) Deep Learning Model For Change Detection. arXiv 2022, arXiv:2207.13159. [Google Scholar] [CrossRef]
  47. Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure Transformer Network for Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5224713. [Google Scholar] [CrossRef]
  48. Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-Local Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  49. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. arXiv 2018, arXiv:1708.02002. [Google Scholar]
  50. Shi, Q.; Liu, M.; Li, S.; Liu, X.; Wang, F.; Zhang, L. A Deeply Supervised Attention Metric-Based Network and an Open Aerial Image Dataset for Remote Sensing Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5604816. [Google Scholar] [CrossRef]
  51. Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction From an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
  52. Shen, L.; Lu, Y.; Chen, H.; Wei, H.; Xie, D.; Yue, J.; Chen, R.; Lv, S.; Jiang, B. S2Looking: A Satellite Side-Looking Dataset for Building Change Detection. Remote Sens. 2021, 13, 5094. [Google Scholar] [CrossRef]
  53. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2019, arXiv:1711.05101. [Google Scholar]
  54. Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A Densely Connected Siamese Network for Change Detection of VHR Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8007805. [Google Scholar] [CrossRef]
  55. Chen, H.; Qi, Z.; Shi, Z. Remote Sensing Image Change Detection With Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5607514. [Google Scholar] [CrossRef]
  56. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  57. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 770–778. [Google Scholar]
  58. Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2019, arXiv:1905.11946. [Google Scholar]
  59. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2012, 60, 84–90. [Google Scholar] [CrossRef]
Figure 1. Overall architecture of the proposed VMMCD.
Figure 1. Overall architecture of the proposed VMMCD.
Remotesensing 17 01840 g001
Figure 2. (a) Architecture of VSS Block. (b) Process of Patch Merging. (c,d) Comparison of self-attention and cross-scan.
Figure 2. (a) Architecture of VSS Block. (b) Process of Patch Merging. (c,d) Comparison of self-attention and cross-scan.
Remotesensing 17 01840 g002
Figure 3. Architecture of MFGF module.
Figure 3. Architecture of MFGF module.
Remotesensing 17 01840 g003
Figure 4. Qualitative and quantitative analyses of missed detections and false detections.
Figure 4. Qualitative and quantitative analyses of missed detections and false detections.
Remotesensing 17 01840 g004
Figure 5. Visualization results of different models on SYSU-CD [50]. TP (white), TN (black), FP (red), and FN (green).
Figure 5. Visualization results of different models on SYSU-CD [50]. TP (white), TN (black), FP (red), and FN (green).
Remotesensing 17 01840 g005
Figure 6. Visualization results of different models on WHU-CD [51].
Figure 6. Visualization results of different models on WHU-CD [51].
Remotesensing 17 01840 g006
Figure 7. Visualization results of different models on S2Looking [52].
Figure 7. Visualization results of different models on S2Looking [52].
Remotesensing 17 01840 g007
Figure 8. Visualization of the feature maps at different resolutions and the final binary output.
Figure 8. Visualization of the feature maps at different resolutions and the final binary output.
Remotesensing 17 01840 g008
Figure 9. Visualization results of different MFGF settings on SYSU-CD.
Figure 9. Visualization results of different MFGF settings on SYSU-CD.
Remotesensing 17 01840 g009
Table 1. A comparison with other SOTA change detection methods on SYSU-CD, WHU-CD, and S2Looking. The first, second, and third places are highlighted in red, blue, and black, respectively. Among the metrics, F 1 and I o U are the most compelling.
Table 1. A comparison with other SOTA change detection methods on SYSU-CD, WHU-CD, and S2Looking. The first, second, and third places are highlighted in red, blue, and black, respectively. Among the metrics, F 1 and I o U are the most compelling.
TypeMethodSYSU-CD [50]
Pre./Rec./F1/IoU
WHU-CD [51]
Pre./Rec./F1/IoU
S2Looking [52]
Pre./Rec./F1/IoU
CNN-basedFC-EF [10]80.2268.6273.9758.6974.5673.9474.2559.05----
FC-Siam-Conc [10]81.4469.9375.2560.3238.4784.2552.8235.8984.1621.5334.2920.69
FC-Siam-Diff [10]40.5478.9553.5736.5840.5478.9553.5736.5880.7023.1435.9721.93
TinyCD [46]85.8475.8080.5167.3889.6288.4489.0380.2272.4753.1561.3244.22
SNUNet [54]83.3176.3979.7066.2580.7987.0383.8072.1175.4945.0556.4339.30
CGNet [38]85.6078.4581.8769.3090.7890.2190.5082.6470.1859.3864.3347.41
Transformer-basedBIT [55]83.2272.6077.5563.3384.6288.0086.2875.8775.3549.4459.7142.56
ChangeFormer [36]86.4777.4281.7069.0695.5889.8392.6286.2573.3357.6264.5447.64
Mamba-basedRS-Mamba [34]85.3873.2778.8665.1093.7091.0892.3785.8371.4956.8063.3046.31
ChangeMamba [35] *88.7977.7482.8970.7991.9292.3694.0388.7368.5961.2564.7147.84
VMMCD (ours)84.7681.9783.3571.4593.8491.2392.5286.0865.4564.8665.1648.32
* The method proposed in our paper employs Mamba-small; thus, it is compared here with MambaBCD-Small, which is of a similar Mamba magnitude. Even when compared with MambaBCD-Base, the proposed method still outperforms it on both SYSU-CD and S2Looking datasets.
Table 2. A comparison with other SOTA change detection methods on model efficiency. We report GFlops, the number of parameters (in millions), and inference fps (in image pairs per second), as well as the F 1 and I o U on SYSU-CD. We highlight the metrics of the proposed method in bold. The shape input image has been resized to 256 × 256 × 3 .
Table 2. A comparison with other SOTA change detection methods on model efficiency. We report GFlops, the number of parameters (in millions), and inference fps (in image pairs per second), as well as the F 1 and I o U on SYSU-CD. We highlight the metrics of the proposed method in bold. The shape input image has been resized to 256 × 256 × 3 .
TypeMethodSYSU-CD
F1/IoU
GFlopsParams
(M)
fps
(pair/s)
C FC-EF [10]73.9758.693.241.35160.26
FC-Siam-Conc [10]53.5736.584.991.55119.75
FC-Siam-Diff [10]75.2560.324.391.35122.77
TinyCD [46]1.450.2985.4780.5167.38
SNUNet [54]79.7066.2511.733.0167.46
CGNet [38]81.8769.3087.5538.9874.70
T BIT [55]26.0011.3362.8277.5563.33
ChangeFormer [36]81.7069.06202.7941.0358.58
M RS-Mamba [34]78.8665.1018.3342.3022.58
ChangeMamba [35]82.8970.7928.7049.9416.89
VMMCD (ours)83.3571.454.514.9373.05
Table 3. Ablation on different backbone networks. We report the F 1 - and I o U scores of the model on SYSU-CD under some different backbone network settings, including 3 CNN-based backbones and 1 ViT-based backbone.
Table 3. Ablation on different backbone networks. We report the F 1 - and I o U scores of the model on SYSU-CD under some different backbone network settings, including 3 CNN-based backbones and 1 ViT-based backbone.
BackboneGFlopsParams
(M)
SYSU-CD
F1/IoU
VGG16 [56]50.4118.6275.9261.19
ResNet18 [57]5.4913.2178.5064.61
EfficientNet-B4 [58]2.711.4481.5768.87
Swin-small [25]15.2724.6168.8552.50
VMamba-small (ours)4.514.9383.3571.45
Table 4. Ablation on model magnitude. We report the F 1 - and I o U scores of the model on SYSU-CD under different model magnitudes, including 2 scenarios of dimensions in the S3 model and S4 model.
Table 4. Ablation on model magnitude. We report the F 1 - and I o U scores of the model on SYSU-CD under different model magnitudes, including 2 scenarios of dimensions in the S3 model and S4 model.
ModelDimsSYSU-CD
F1/IoU
VMMCD-S4 × 1 OOM
× 0.5 80.4267.25
× 0.25 80.2366.98
VMMCD-S3 × 1 80.4967.36
× 0.5 (Ours)81.1268.24
× 0.25 80.2567.02
Table 5. Ablation on different numbers of MFGF layers. We report the F 1 and I o U scores of the model on SYSU-CD under some different MFGF settings, including 8 scenarios in the S3 model and 2 scenarios in the S4 model.
Table 5. Ablation on different numbers of MFGF layers. We report the F 1 and I o U scores of the model on SYSU-CD under some different MFGF settings, including 8 scenarios in the S3 model and 2 scenarios in the S4 model.
ModelMFGFSYSU-CD
F1/IoU
1234
VMMCD-S4××××81.8869.33
82.6370.41
VMMCD-S3×××-82.3670.01
××-82.9870.90
××-82.7070.50
××-82.8370.69
×-83.0070.94
×-82.9970.92
×-83.1371.13
-83.3571.45
Table 6. Ablation on the coefficient λ of the loss function.
Table 6. Ablation on the coefficient λ of the loss function.
λ SYSU-CD
F1/IoU
WHU-CD
F1/IoU
S2Looking
F1/IoU
083.3471.4492.4585.9764.8347.97
0.183.3071.3892.4786.0065.1848.35
0.283.3571.4592.5286.0865.1648.32
0.383.2671.3292.5086.0465.0748.22
0.583.2571.3092.5086.0564.9048.03
183.2371.2892.5986.2065.2148.38
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, Z.; Chen, H.; Leng, J.; Zhang, X.; Gao, Q.; Dong, W. VMMCD: VMamba-Based Multi-Scale Feature Guiding Fusion Network for Remote Sensing Change Detection. Remote Sens. 2025, 17, 1840. https://doi.org/10.3390/rs17111840

AMA Style

Chen Z, Chen H, Leng J, Zhang X, Gao Q, Dong W. VMMCD: VMamba-Based Multi-Scale Feature Guiding Fusion Network for Remote Sensing Change Detection. Remote Sensing. 2025; 17(11):1840. https://doi.org/10.3390/rs17111840

Chicago/Turabian Style

Chen, Zhong, Hanruo Chen, Junsong Leng, Xiaolei Zhang, Qi Gao, and Weiyu Dong. 2025. "VMMCD: VMamba-Based Multi-Scale Feature Guiding Fusion Network for Remote Sensing Change Detection" Remote Sensing 17, no. 11: 1840. https://doi.org/10.3390/rs17111840

APA Style

Chen, Z., Chen, H., Leng, J., Zhang, X., Gao, Q., & Dong, W. (2025). VMMCD: VMamba-Based Multi-Scale Feature Guiding Fusion Network for Remote Sensing Change Detection. Remote Sensing, 17(11), 1840. https://doi.org/10.3390/rs17111840

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop