4.1. Datasets and Evaluation Metrics
To comprehensively validate the effectiveness of the proposed SAM-MSCD method, we conducted extensive experiments on four representative public RSCD datasets, selected based on data volume and the variety of change targets. These include LEVIR-CD [
6], WHU-CD [
34], NJDS [
35], and MSRS-CD [
36]. All datasets were cropped into non-overlapping 512 × 512-pixel segments in a left-to-right, top-to-bottom order, and divided into training, validation, and test sets in a 7:1:2 ratio. A brief introduction to these four datasets is provided below.
The LEVIR-CD dataset, released in 2020, contains 637 pairs of images with a spatial resolution of 0.5 m, each sized pixels. These images were captured between 2002 and 2018, covering changes of 31,333 buildings including high-rise apartments, villas, and garages of various sizes in Texas, USA.
The WHU-CD dataset, released in 2019, contains a pair of images with a spatial resolution of 0.2 m and image dimensions of 32,207 × 15,354 pixels. The images are from Christchurch, New Zealand, taken in 2012 and 2016, covering changes in 12,796 buildings over a 20.5 square kilometer area.
The NJDS dataset, released in 2022, this dataset contains one pair of images with a spatial resolution of 0.3 m and dimensions of 14,231 × 11,381 pixels. It records various types of building change instances in Nanjing, China, between 2014 and 2018, including low-rise, mid-rise, and high-rise buildings. Due to its relatively small data volume, the NJDS dataset is particularly suitable for evaluating model performance in low-data scenarios, providing a key basis for studying model adaptability.
The MSRS-CD dataset, released in 2024, contains 841 pairs of images with a spatial resolution of 0.5 m and each sized pixels. The images are from southern Chinese cities during 2019 to 2023, covering various scale change targets including new constructions, urban sprawl, vegetation changes, and road development, thus comprehensively representing complex real-world change scenarios.
This paper adopts five common evaluation metrics for change detection algorithms to assess model performance: Precision (P), Recall (R), Intersection over Union (IoU), Overall Accuracy (OA), and the F1 score. Their specific definitions are as follows:
where
denotes true positives (the number of samples correctly predicted as change),
true negatives (correctly predicted as no change),
false positives (predicted as change but actually no change), and
false negatives (predicted as no change but actually change).
4.3. Comparative Experiments with State-of-the-Art Algorithms
We conduct comparative experiments between the proposed SAM-MSCD and a series of RSCD methods, including CNN-based methods (FCCDN [
37], SGSLN [
38], AANet [
39], SEIFNet [
40]), Transformer-based methods (BIT [
11], ChangeFormer [
10], VcT [
23], EATDer [
41], MDIPNet [
42]), and methods based on Vision Foundation Models (SAMCD [
29], BAN [
31], TTP [
14], SFCD [
32]). Although methods such as SAMCD and BAN have successfully achieved domain adaptation for visual foundation models by introducing adapters, their research focus has primarily centered on addressing cross-domain generalization challenges. In contrast, SAM-MSCD not only effectively tackles the domain adaptation challenges of VFMs in remote sensing scenarios but also simultaneously resolves the critical issue of detecting multi-scale variable objects, achieving synergistic optimization of domain adaptation and multi-scale modeling. The following briefly introduces these thirteen methods.
FCCDN: This network is a remote sensing change detection network with feature constraints. It introduces a constraint mechanism during bi-temporal feature extraction and fusion, and adopts a self-supervised learning strategy to achieve more accurate change area recognition.
SGSLN: This method is a binary change detection approach based on a switchable dual encoder-decoder structure. It integrates semantic guidance and spatial localization strategies to effectively address the limitations of traditional architectures in handling bi-temporal feature interference and intraclass variation, as well as multi-view building changes.
AANet: This network employs a fuzzy refinement module to locate pseudo-changes and occluded true change regions, and utilizes a weight rearrangement module to fuse multi-scale difference features, enhancing adaptability to objects with varying change scales.
SEIFNet: A lightweight change detection network that combines spatio-temporal enhancement and inter-layer fusion. Through multi-level feature extraction, a spatio-temporal difference enhancement module, and an adaptive context fusion module, it improves feature representation of change areas and mitigates issues of pseudo changes and scale variations.
BIT: This network introduces Transformer into remote sensing change detection. By modeling spatio-temporal context via semantic tokens, it overcomes the limitations of convolutional methods in handling long-range dependencies and complex scenes.
ChangFormer: A Transformer-based Siamese architecture for remote sensing image change detection. Unlike traditional convolutional methods, it combines a hierarchical Transformer encoder with an MLP decoder to effectively capture multi-scale and long-range dependencies, thereby improving detection accuracy.
VcT: This network extracts features via a shared backbone and incorporates a graph neural network to exploit shared contextual information between image pairs, thus enhancing change detection accuracy.
EATDer: This method combines edge-awareness with adaptive Transformers for remote sensing change detection. It uses a Siamese encoder structure integrated with adaptive vision Transformer blocks and a full-range fusion module to capture spatio-temporal variations, and employs an edge-aware decoder to refine change boundaries.
MDIPNet: A multi-scale dual-space interactive perception change detection network designed to address the high computational cost and insufficient semantic information utilization in existing RSCD models.
SAMCD: This network applies FastSAM to high-resolution RS images for change detection. Through a convolutional adapter and semantic learning branch, it improves the model’s adaptability and accuracy in remote sensing scenarios, outperforming fully supervised methods with sample-efficient learning capabilities.
BAN: This method introduces a frozen foundation model, along with bi-temporal adaptation branches and a connection module, effectively fusing the general knowledge of the foundation model with task-specific features for RSCD.
TTP: This network integrates latent knowledge from the SAM foundation model into RSCD, effectively addressing domain shifts and multi-temporal image heterogeneity in knowledge transfer.
SFCD: Specifically designed for RSCD, this network combines SAM with a feature interaction mechanism. It enhances fine-grained feature extraction, change sensitivity, and boundary recognition through parameter-efficient fine-tuning, a bi-temporal feature interaction module, and a boundary-aware loss function.
On the LEVIR-CD dataset, the results of the comparative experiments with current SOTA algorithms are shown in
Table 1. The experimental data indicates that SAM-MSCD achieves the best performance on four core metrics: P, F1, IoU, and OA, with values of 93.64%, 92.54%, 85.94%, and 99.24%, respectively. This advantage primarily stems from the model’s innovative design in its multi-scale feature extraction and fusion mechanism. The Recall of SAM-MSCD is slightly lower than that of the second-ranked EATDer, with a difference of 1.71%. The reason for this phenomenon is that EATDer improves the recall rate by introducing the boundary change detection module, but it also brings a lot of pseudo-change noise, which significantly lowers its Precision, which is 7.91% lower than that of SAM-MSCD, and thus affects the balance of the overall performance.
Figure 5 illustrates the visualization of the inference results of different algorithms on the LEVIR-CD dataset. As seen in
Figure 5a, SAM-MSCD performs excellently in boundary detection. Furthermore,
Figure 5b–d show that SAM-MSCD has significantly fewer false and missed detections in dense building areas, demonstrating stronger robustness.
Figure 5e further illustrates the model’s ability to accurately identify the change region as well as carve out finer boundary details in the detection of single building changes of tiny dimensions, highlighting its significant advantages in fine-grained change detection tasks.
The results of the comparative experiments on the WHU-CD dataset are shown in
Table 2. The specific values for SAM-MSCD on P, R, F1, IoU, and OA are 96.73%, 88.31%, 92.33%, 85.73%, and 99.29%, respectively. Compared to current mainstream SOTA methods, SAM-MSCD achieves the best performance on four metrics: R, F1, IoU, and OA. The visualization results in
Figure 6a,b show that SAM-MSCD is significantly superior to other methods in controlling missed detections, effectively identifying change regions of smaller buildings.
Figure 6e demonstrates that the model also performs excellently in controlling false detections, as it does not incorrectly identify changes in non-building categories like containers and vehicles as building changes, thus exhibiting a stronger class discrimination capability. Furthermore, as shown in
Figure 6c,d, while achieving low rates of missed and false detections, SAM-MSCD still maintains good boundary detection performance, showcasing its excellent comprehensive capabilities in complex scenes.
The NJDS dataset has a small amount of data, which allows for an effective evaluation of a model’s change detection capabilities under low-data conditions.
Table 3 shows the results of the comparative experiments on the NJDS dataset. The experimental data indicates that SAM-MSCD achieved the optimal values for the key evaluation metrics of F1-score and IoU, at 79.33% and 65.75% respectively. These scores are 3.95% and 5.26% higher than the second-best method, SAMCD. The visualization results in
Figure 7 further validate this conclusion, showing that SAM-MSCD demonstrates excellent detection performance even with a relatively small amount of data. Specifically, the instances in
Figure 7 show that SAM-MSCD can accurately identify change regions of different sizes and provide clear boundary detection results, indicating its robustness and efficiency in data-scarce scenarios.
The results of the SOTA comparative experiments on the MSRS-CD dataset are shown in
Table 4. The experimental data indicates that SAM-MSCD achieves the best performance on the three key evaluation metrics of F1, IoU, and OA, with specific values of 79.67%, 65.84%, and 94.37%, respectively. This demonstrates the model’s outstanding detection capabilities in complex and variable real-world scenarios. The visualization results in
Figure 8a–c further validate the advantages of SAM-MSCD in edge detection. Compared to other methods, it can more accurately delineate the boundaries of change regions and shows significant effectiveness in reducing missed detections. Furthermore, as shown in
Figure 8d, when faced with natural changes in non-target categories (such as grass growth), existing mainstream methods exhibit a certain degree of false detections, where as SAM-MSCD can effectively avoid such errors, demonstrating stronger class discrimination ability and robustness.
Figure 8e further demonstrates the superior performance of SAM-MSCD in the detection of small-size changes. Even in the face of small changes, the model is still able to accurately recognize and maintain good boundary integrity, with strong adaptability in fine-grained change detection tasks.
4.4. Model Complexity Analysis
To comprehensively evaluate the performance of the proposed SAM-MSCD in terms of computational efficiency and model size, a comparative analysis of model complexity was conducted against other mainstream change detection methods on the MSRS-CD dataset. As shown in
Table 5, different methods are compared across six key dimensions: Network Type (CNN, Transformer, VFM), F1 score, IoU, FLOPs, Parameters (Params), and Inference time.
Overall, VFM-based methods generally outperform CNN and Transformer-based methods on the performance metrics of F1 and IoU. Among them, SAM-MSCD achieves 79.67% in F1 score and 65.84% in IoU, which is the best performance and clearly outperforms similar visual modeling methods, such as SAMCD with 76.53% in F1 and 61.98% in IoU, and BAN with 75.99% in F1 and 61.27% in IoU. This result shows that SAM-MSCD has stronger feature modeling capability and better generalization performance.
In terms of model complexity, SAM-MSCD has 54.76 G FLOPs and 42.65 M parameters, which is slightly higher than some of the lightweight models such as BIT, VcT, and SEIFNet, but much lower than ChangeFormer, which consumes a very high amount of computational resources and has 202.79 G FLOPs.It is worth emphasizing that compared with SAMCD, which is also a visual base model, SAM-MSCD improves the F1 score by 3.14 % and the IoU by 3.86 % while the number of parameters is reduced by nearly 40%, reflecting better structural design and expression efficiency.
In terms of performance-complexity trade-off, although TTP and SFCD also achieve better results in F1 score, their computational overhead and parameter size exceed that of SAM-MSCD. SFCD, for example, has a parameter count of 44.09 M and 52.14 G FLOPs, but its performance is slightly lower than that of SAM-MSCD, which only achieves 78.42% of the F1 score and 64.49% of the IoU. In addition, methods such as BAN and MDIPNet, although with close accuracy, have larger model size and higher deployment difficulty, which limit their potential application in resource-constrained environments.
In summary, SAM-MSCD balances high detection accuracy with model computational efficiency and parameter size, demonstrating excellent overall performance and practical application potential.
4.5. Ablation Studies
4.5.1. Ablation Study of Different Modules
To validate the effectiveness of the BIM and CFEM modules in SAM-MSCD, we conducted systematic ablation experiments on the LEVIR-CD and MSRS-CD datasets. The results are shown in
Table 6.
Under the baseline setting without any additional modules (both BIM and CFEM removed) and using only a conventional absolute difference for feature comparison, the model’s F1-scores on the LEVIR-CD and MSRS-CD datasets were 83.87% and 71.34%, respectively, with IoU scores of 65.45% and 46.67%. This indicates that the overall performance was significantly limited. As shown in
Figure 9d, the traditional absolute difference method, while capable of capturing some change regions, has notable deficiencies when dealing with complex scenes, ambiguous boundaries, and minute targets.
After introducing the BIM module, the model’s performance showed a marked improvement on both datasets. On LEVIR-CD, the F1 and IoU scores rose to 90.76% and 82.88%, respectively. On MSRS-CD, the F1 and IoU increased to 75.77% and 60.23%. This result indicates that BIM, through its bi-temporal feature interaction and difference enhancement mechanism, effectively enhances the model’s perception of change regions and its capability for symmetrical modeling. Furthermore, introducing the CFEM module also brought performance gains, achieving an F1 of 89.87% and an IoU of 81.67% on LEVIR-CD, and an F1 of 75.45% and an IoU of 60.11% on MSRS-CD. This demonstrates that the module, through the generation and fusion of multi-scale features with semantic context, improves the model’s ability to characterize fine-grained change features.
Further analysis of the experimental results reveals the distinct roles of the proposed modules in different scenarios. The BIM module demonstrates superior robustness in pseudo-change scenarios caused by seasonal variations or lighting differences (e.g., in the LEVIR-CD). By enforcing feature interaction within the encoder, BIM aligns the semantic distribution of bi-temporal images, effectively suppressing false positives derived from spectral inconsistency. On the other hand, the CFEM module proves critical in multi-scale scenarios (e.g., MSRS-CD). We observed that without CFEM, the model struggles to simultaneously detect large building footprints and narrow roads. The multi-scale fusion mechanism of CFEM successfully recovers the boundary details of small targets while maintaining the internal integrity of large objects.
When both BIM and CFEM were integrated simultaneously, the model’s performance reached its peak. On LEVIR-CD, the F1 score increased to 92.54% and the IoU rose to 85.94%. On the more challenging MSRS-CD dataset, the F1 and IoU also significantly improved to 79.67% and 65.84%, respectively. The results, in conjunction with
Figure 9g, fully illustrate that the BIM and CFEM modules are complementary in modeling spatial-structural relationships and enhancing difference representations, jointly improving the model’s robustness and accuracy.
4.5.2. Analysis of the Impact of Different LoRA Rank Settings on Model Performance
To further analyze the impact of the parameter-efficient fine-tuning strategy LoRA on model performance, we conducted comparative experiments by setting different rank values in the proposed SAM-MSCD model. Experiments were carried out on the LEVIR-CD and WHU-CD datasets, and the results are summarized in
Table 7. It can be observed that GPU memory consumption increases progressively with larger rank values.
On the LEVIR-CD dataset, as the rank value increases, the model performance across all evaluation metrics exhibits a trend of gradual improvement followed by saturation. The best overall performance is achieved when the rank is set to 32, where the model attains an F1 score of 92.54% and an IoU of 85.94%. Further increasing the rank does not yield noticeable performance gains.On the WHU-CD dataset, the detection performance continues to improve with increasing rank values, and the optimal results under the current experimental configuration are obtained when the rank is set to 128. This observation suggests that the WHU-CD dataset poses higher demands on model capacity compared to LEVIR-CD.
Considering the trade-off between detection accuracy and computational cost, setting the rank to 32 provides the most favorable overall balance under the current configuration. This setting not only delivers high detection performance but also effectively controls parameter scale and computational overhead. Therefore, a rank value of 32 is recommended for subsequent model training and practical deployment.