MBFI-Net: Multi-Branch Feature Interaction Network for Semantic Change Detection

Ding, Qing; Wang, Fengyan; Sun, Kaiyuan; Chen, Weilong; Wang, Mingchang; Cheng, Gui

doi:10.3390/rs18010179

Open AccessArticle

MBFI-Net: Multi-Branch Feature Interaction Network for Semantic Change Detection

by

Qing Ding

¹,

Fengyan Wang

^1,*

,

Kaiyuan Sun

¹,

Weilong Chen

¹,

Mingchang Wang

¹

and

Gui Cheng

²

¹

State Key Laboratory of Deep Earth Exploration and Imaging, College of Geo-Exploration Science and Technology, Jilin University, Changchun 130026, China

²

Key Laboratory of Space Precision Measurement Technology, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(1), 179; https://doi.org/10.3390/rs18010179

Submission received: 5 December 2025 / Revised: 28 December 2025 / Accepted: 29 December 2025 / Published: 5 January 2026

(This article belongs to the Special Issue Remote Sensing Image Change Detection and Feature Enhancement Based on Deep Learning)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Global-local modeling applied to bi-temporal images effectively enhances the perception of semantic changes;
Multi-branch information interaction improves both feature diversity and its representational performance.

What are the implications of the main findings?

Cross-granularity integrated features improve the detection performance of multi-scale semantic changes in complex scenes;
Prior constraints on logical relationships improve both the accuracy and robustness of semantic change detection.

Abstract

Semantic change detection (SCD) effectively captures ground object transition information within change regions, delivering more comprehensive and detailed results than binary change detection (BCD) tasks. The existing multi-task SCD models enable parallel processing of segmentation and BCD of bi-temporal remote sensing images, but they still have shortcomings in feature mining, interaction, and cross-task transfer. To address these limitations, a multi-branch feature interaction network (MBFI-Net) is proposed. MBFI-Net designs parallel encoding branches with attention mechanisms that enhance semantic change perception by jointly modeling global contextual patterns and local details. In addition, MBFI-Net proposes bi-temporal feature interaction (BTFI) and cross-task feature transfer (CTFT) modules to improve feature diversity and representativeness, and combines with prior logical relationship constraints to improve SCD performance. Comparative and ablation studies on the SECOND and Landsat-SCD datasets highlight the superiority and robustness of MBFI-Net, which achieves SeKs of 0.2117 and 0.5543, respectively. Furthermore, MBFI-Net strikes a balance between SCD results and model complexity and has superior detection performance for semantic change categories with a small proportion.

Keywords:

semantic change detection; multi-branch network; feature interaction; information transfer; remote sensing images

1. Introduction

Accurate acquisition of land use and land cover (LULC) changes provides critical support for resource conservation, post-disaster emergency response, and sustainable urban development [1,2]. Traditional manual field surveys suffer from high costs, low efficiency, and subjective biases, making them inadequate for rapidly obtaining large-scale change information. Recent advancements in remote sensing technology have opened new avenues for acquiring LULC changes [3,4]. Leveraging their strengths of high resolution, broad coverage, and short revisit cycles, remote sensing images supported by computer technologies have emerged as the predominant source of data for detecting surface changes [5,6,7].

Binary change detection (BCD) studies identify change regions from bi-temporal images with identical spatial coverage. Most existing BCD studies primarily concentrate on detecting single-category LULC changes, such as buildings [8], croplands [9], and forests [10]. In contrast, semantic change detection (SCD) enables precise extraction of change regions while identifying “from-to” transformations in LULC [11,12], providing more detailed surface change information.

Traditional SCD studies can be categorized into sequential-based [13,14], reversed-based [15,16], and parallel-based [17] methods according to their result acquisition processes. Sequential-based methods first calculate change magnitude using spectral and texture features. Subsequently, pixels are categorized into unchanged or changed groups using thresholds based on change intensity. Finally, machine learning algorithms or domain knowledge are employed to identify LULC transitions in the detected change regions. A representative method is change vector analysis, whose workflow is illustrated in Figure 1a. Reversed-based methods, also termed post-classification comparison methods, follow the workflow shown in Figure 1b. These methods initially obtain LULC classification results for each temporal image using spectral and texture features. They then compare LULC category differences between corresponding pixels in bi-temporal images to detect change regions and their associated transformations. Furthermore, parallel-based methods treat SCD as a classification task, as depicted in Figure 1c. These methods define each “from-to” LULC transformation as an individual class and use machine learning and other methods to extract SCD results from bi-temporal images directly.

However, each category of traditional SCD methods exhibits inherent limitations. Among these, sequential-based approaches neglect the correlation between BCD and semantic segmentation tasks [18]; reversed-based methods are heavily influenced by the classification accuracy of individual images, leading to pronounced error accumulation [19], while parallel-based methods face challenges in sample annotation and model convergence due to the excessive number of semantic change categories [20]. Furthermore, these traditional methods predominantly rely on expert knowledge to extract handcrafted features from medium-to-low-resolution images. Their information extraction capability becomes constrained when applied to high-resolution images, thus limiting their broader applicability in current SCD tasks.

Leveraging their powerful feature extraction abilities, SCD studies based on deep learning methods have recently attracted considerable attention [21]. These studies also adopt sequential [22,23], reversed [24,25], and parallel [26] data processing paradigms. While neural network-derived features have better representation ability than handcrafted features, the problems of task decoupling, error propagation, and excessive “from-to” categories have not been properly addressed. To mitigate these limitations, emerging research proposes multi-task networks [27,28,29] that synergistically integrate semantic segmentation with BCD tasks (Figure 1d). Bi-temporal images are processed through weight-shared siamese encoders to extract hierarchical multi-scale features, which are subsequently decoded into temporally specific LULC maps. Concurrently, the temporal image pair is fed into a dedicated change encoder to capture deep discrepancy patterns, generating pixel-wise BCD outputs. Advanced implementations employ skip connections to fuse multi-level features between the segmentation and BCD branches, enhancing SCD performance through cross-task knowledge transfer. A weighted combination of branch-specific loss functions drives gradient updates via backpropagation, enabling end-to-end optimization of the multi-task model. The SCD results are obtained by applying BCD masks to extract spatiotemporal “from-to” transformations from the bi-temporal LULC classification results.

Although current multi-task SCD models outperform traditional methods, SCD tasks in high-resolution images present unique challenges:

(1) The complex imaging scenarios and rich categories of semantic changes make it difficult for traditional single-type encoders (e.g., CNN or Transformer) to fully characterize the complexity of object changes, leading to prone false results.

(2) The scale of changed objects varies extremely, ranging from the construction of a single small house to the conversion of an entire farmland into forests. This makes it difficult for models with fixed receptive fields to adapt to the detection requirements of multi-scale changes.

(3) Changed areas in bi-temporal images should have different LULC categories. Most existing models fail to fully leverage logical relationships between tasks. HRSCD-str3 [24] performs BCD and segmentation tasks separately, while HRSCD-str4 [24] and SMNet [30] only transmit limited information across tasks through skip connections. This results in information inconsistency between detected binary changes and segmentation results across bi-temporal images.

To tackle these challenges, we introduce a multi-branch feature interaction network (MBFI-Net) for high-precision SCD tasks. The primary contributions of our study can be encapsulated as follows:

(1) To comprehensively capture local details and long-range dependencies of changing objects, we construct dual-Siamese encoders in MBFI-Net, consisting of Siamese convolutional neural networks (CNNs) and Siamese transformers. This hybrid architecture effectively characterizes complex change objects across diverse scenarios, providing robust feature representations for SCD tasks.

(2) To enhance SCD performance, we design bi-temporal feature interaction (BTFI) modules in MBFI-Net. These modules enrich the diversity of discrepancy features through multi-scale convolution fusion and bi-temporal interactive attention mechanisms, improving detection accuracy for semantic changing objects.

(3) Incorporating prior logical constraints, we develop cross-task feature transfer (CTFT) modules that facilitate information exchange and mutual enhancement between semantic segmentation and BCD tasks. This mechanism improves the semantic consistency of outputs from different task branches.

(4) Compared with other SCD methods, MBFI-Net achieves optimal performance on different datasets. Qualitative and quantitative results demonstrate its ability to detect change regions and identify LULC transitions accurately.

The remainder of this paper is organized as follows: Section 2 reviews the relevant work on BCD and SCD; Section 3 elaborates on the specific structure of MBFI-Net, and also introduces the datasets, competing methods, and experimental setup; Section 4 presents the SCD results on the SECOND and Landsat-SCD datasets; Section 5 validates the model performance through ablation studies and complexity analysis, and proposes future improvement directions; Section 6 summarizes the research work.

2. Related Work

2.1. Deep-Learning-Based BCD Methods

At present, high-resolution satellite images have better accessibility and quality. Meanwhile, deep learning techniques have been extensively adopted for BCD tasks in complex scenarios. These approaches effectively capture detailed and semantic features of changing objects, demonstrating superior detection performance compared to traditional change vector analysis and machine learning-based approaches.

Mainstream end-to-end BCD networks adopt single-branch or dual-branch architectures, corresponding to the early and late fusion of bi-temporal images. Among them, the single-branch architectures concatenate or difference the bi-temporal images as network inputs to obtain BCD results. Zhu et al. [31] enhanced SegNet with morphological post-processing to refine detection results, effectively reducing omission and commission errors. Peng et al. [32] developed a single-branch convolutional network, demonstrating the feasibility of the network in change region identification. Sun et al. [33] integrated different types of networks in a single-branch architecture, validating its effectiveness across two benchmark datasets. However, such early fusion strategies may inadequately exploit information within individual images, resulting in incomplete identification of change regions.

In contrast, dual-branch architectures employ Siamese or pseudo-Siamese encoders to extract image features, followed by a fusion of cross-temporal features in decoders for accurate change region identification. Ding et al. [34] introduced a pseudo-Siamese network that extracts bi-temporal features and combines attention mechanism with spatial pyramid structure to achieve high-quality BCD results. Zhang et al. [8] developed an edge refinement network under a dual-branch framework, incorporating adaptive cross-entropy loss to enhance detection performance. Ning et al. [35] introduced reverse change discovery into a multi-stage progressive dual-branch network, significantly improving BCD effectiveness and robustness.

Recently, advanced networks like transformer and mamba have demonstrated exceptional performance in computer vision tasks and are also being adapted to enhance BCD capabilities [36,37]. Pan et al. [38] introduced a transformer-based feature fusion network to better extract small building changes. Chen et al. [36] introduced the Mamba into BCD tasks, effectively capturing the global context for precise change identification. Meanwhile, emerging paradigms including semi-supervised [39,40], weakly supervised [41], object-level [42], and instance-level [43] approaches continue to advance BCD, driving continuous improvements in detection accuracy and generalization capacity.

2.2. Deep-Learning-Based SCD Methods

Current BCD results lack the granularity required for practical applications. In contrast, SCD provides finer-grained insights by identifying LULC transitions, albeit with increased task complexity. Some studies formulate SCD as a multi-class classification task, employing single-branch networks to predict “from-to” transition results directly. Daudt et al. [24] modified the prediction head of BCD networks by regarding each “from-to” transition as a separate class, enabling direct multi-class SCD outputs. Mou et al. [44] combined spectral-spatial-temporal representations for end-to-end prediction of multi-class semantic changes. Zhu et al. [26] utilized Siamese networks to extract bi-temporal features while suppressing background interference, achieving direct SCD without intermediate processing. The number of change categories grows quadratically with LULC classes in single-temporal images, creating a severe class imbalance that imposes significant challenges on model performance and annotation efforts. Alternative approaches perform bi-temporal LULC classification using neural networks and then derive semantic changes by comparing classification results. However, these methods suffer from error propagation issues. For instance, Peng et al. [45] utilized Siamese end-to-end networks to obtain LULC maps, which are subsequently compared for semantic change identification.

To further enhance SCD performance, multi-task networks that jointly optimize bi-temporal semantic segmentation and BCD have emerged as mainstream methods. These networks reduce error accumulation through parameter and feature sharing. Zheng et al. [46] explored semantic-change causality and temporal symmetry, integrated bi-temporal semantic segmentation with BCD, effectively outputting semantic change maps. Cui and Jiang [47] developed Siamese semantic-aware encoders to aggregate image features, enhancing feature representation and obtaining SCD results. Ding et al. [48] achieved efficient feature reuse via multi-level feature aggregation modules and combined prior knowledge for symmetric feature fusion of bi-temporal images, further optimizing SCD performance. Dai et al. [49] proposed a difference-enhanced network for farmland SCD, achieving reliable farmland SCD results in a semi-supervised manner.

The emergence of multi-task networks has effectively advanced SCD research. However, there remain limitations in bi-temporal feature mining and inter-task feature interaction that require improvement. Therefore, this study proposes MBFI-Net that comprehensively represents local details and global semantics of complex change objects. MBFI-Net enables feature interaction across temporal dimensions and information transfer across tasks, enhancing the precision and reliability of SCD results.

3. Materials and Methods

3.1. Overall Structure of MBFI-Net

MBFI-Net (Figure 2) is an end-to-end network that simultaneously performs segmentation and BCD of bi-temporal images through different task branches. MBFI-Net incorporates dual Siamese encoding branches to capture bi-temporal local and global information. This hybrid design enhances the model’s representational capacity for complex changing objects. Furthermore, MBFI-Net designs bi-temporal feature interaction and cross-task feature transfer modules, introduces attention mechanisms to improve the representativeness and richness of extracted features, thereby improving the SCD results.

The growing image resolution and the complexity of changing objects pose significant challenges for accurate SCD. To address this, MBFI-Net designs shared-weight multi-branch encoders to process bi-temporal remote sensing images, enabling global-local feature extraction and supporting SCD results.

ResNet34 [50] is employed to detect subtle changes. The residual block is shown in Figure 3a, consists of two 3 × 3 convolutions, batch normalization (BN), ReLU non-linear activation, and residual connections. The residual connections selectively adjust the size and channel of features, effectively preserving original information. The residual block can be expressed as:

F_{main} = {BN (Conv}_{3 \times 3} {(BR (Conv}_{3 \times 3} (F_{in}))))

(1)

F_{skip} = F_{in} or {BN (Conv}_{1 \times 1} (F_{in}))

(2)

F_{out} = ReLU (F_{main} + F_{skip})

(3)

where

F_{in}

and

F_{out}

are the input and output features,

F_{main}

and

F_{skip}

are the features extracted by the main and skip connection branches, Conv denotes convolution with different kernels, and + represents feature addition.

In addition, the swin transformer [51] is integrated into MBFI-Net to capture contextual dependencies, achieving global modeling of semantic change information. Patch embedding adjusts image size and channel numbers, generating non-overlapping patches for input into swin transformer stages. Between adjacent stages, patch merging is introduced to downsample feature maps, layer normalization, and channel dimension adjustment (Figure 3b). Each stage comprises varying numbers of swin transformer blocks (Figure 3c). The process of extracting image features using swin transformer blocks is as follows:

{\hat{x}}^{l} = W_{MSA} (LN (x^{l - 1})) + x^{l - 1}

(4)

x^{l} = MLP (LN ({\hat{x}}^{l})) + {\hat{x}}^{l}

(5)

{\hat{x}}^{l + 1} = {SW}_{MSA} (LN (x^{l})) + x^{l}

(6)

x^{l + 1} = MLP (LN ({\hat{x}}^{l + 1})) + {\hat{x}}^{l + 1}

(7)

where the feature

x^{l - 1}

flows through the layer normalization (LN) and

W_{MSA}

layers to obtain

{\hat{x}}^{l}

. Afterwards, feature

{\hat{x}}^{l}

undergoes LN and MLP layers to obtain

x^{l}

. The

{SW}_{MSA}

layer shares a similar structure with the

W_{MSA}

layer, but it is designed with a half-window-size offset.

3.2. Channel Attention and Spatial Attention Modules

The multi-branch encoding structure designed in MBFI-Net can fully extract global information and local representations of bi-temporal images. However, not all deep features help identify semantic changes. Redundant features may cause negative interference for accurate SCD. Therefore, MBFI-Net introduces channel attention mechanisms (CAMs) and spatial attention mechanisms (SAMs) to optimize multi-scale features from different encoding branches, suppressing redundant information interference [52]. These attention mechanisms enhance the distinctiveness and effectiveness of image features, and strengthen the mode’s focus on semantic changes, as shown in Figure 4.

The swin transformer encoding branch divides images into non-overlapping patches, performing self-attention calculations to capture long-range dependencies effectively. It fails to explore the correlations and importance differences between features of different channels fully. To address this, MBFI-Net uses CAMs to optimize the features extracted by the swin transformer branches, as follows:

CAM = σ (MLP (m p (F_{st})) + MLP (a p (F_{st}))) \times F_{st}

(8)

where

F_{st}

represents the features from swin transformer encoding branch,

m p

and

a p

denote channel-based max pooling and average pooling, MLP is the multi-layer perceptron,

σ

is the sigmoid function, + and × represent feature addition and multiplication operations, respectively.

Additionally, ResNet34 extracts features within a local receptive field, capturing local spatial patterns. However, traditional convolutions uniformly process the entire image without distinguishing the importance of different spatial regions. This may cause the model to fail in focusing sufficiently on key areas when processing complex images. Therefore, MBFI-Net uses SAMs to optimize the multi-scale features of the convolutional branches, enhancing the distinguishability between changed regions and background information, as follows:

SAM = σ ({Conv}_{7 \times 7} [M P (F_{res}), A P (F_{res})]) \times F_{res}

(9)

where

F_{res}

represents the multi-scale features from the residual encoding branch,

M P

and

A P

denote spatial max pooling and average pooling,

{Conv}_{7 \times 7}

is the convolution operation with a 7 × 7 kernel, and

[,]

represents feature concatenation.

3.3. Bi-Temporal Feature Interaction Module

MBFI-Net proposes the BTFI module (Figure 5) to improve the diversity and distinctiveness. The BTFI module generates multi-scale fused features via multi-branch convolutions and strengthens the mining of bi-temporal discrepancies using interactive attention. It enables adaptation to semantic changes of varying sizes and enhances the model’s ability to characterize complex semantic changes.

F_{ms} = {Conv}_{1 \times 1} ([{Conv}_{1 \times 1} (F_{in}), {{Conv}_{3 \times 1} (Conv}_{1 \times 3} (F_{in})), {{Conv}_{5 \times 1} (Conv}_{1 \times 5} (F_{in}))])

(10)

Q = Per ({Reshape}_{q} ({Conv}_{1 \times 1}^{q} (F_{in})))

(11)

V = {Reshape}_{v} ({Conv}_{1 \times 1}^{v} (F_{in}))

(12)

K = {Reshape}_{k} ({Conv}_{1 \times 1}^{k} (F_{in}))

(13)

F_{out}^{t 1} = F_{in}^{t 1} + F_{ms}^{t 1} + ({P a r a}_{t 1} \times ({Reshape}_{t 1} (V_{t 1} \times (Per (SM (K_{t 1} \times Q_{t 2}))))))

(14)

F_{out}^{t 2} = F_{in}^{t 2} + F_{ms}^{t 2} + ({P a r a}_{t 2} \times ({Reshape}_{t 2} (V_{t 2} \times (Per (SM (K_{t 2} \times Q_{t 1}))))))

(15)

where

F_{in}

donates the input feature of the BTFI module;

F_{ms}

is the multi-scale fusion feature obtained through multi-branch convolution;

P a r a

is a trainable parameter used to adaptively adjust the weight of interactive feature; Q, K, and V are the query, key, and value vectors; Reshape indicates feature size adjustment; Per indicates feature channel swapping; SM is the softmax function; and

F_{out}^{t 1}

and

F_{out}^{t 2}

are the output features of the BTFI module, which fully integrate bi-temporal information.

3.4. Cross-Task Feature Transfer Module

Semantic segmentation and BCD of bi-temporal images are closely related yet different tasks. Semantic segmentation aims to perform pixel-level classification on single-temporal images to identify different LULC categories. BCD focuses on comparing bi-temporal images to determine whether each pixel has changed. There is a prior logical relationship between the two tasks: generally, LULC differs in changed regions across different times, while unchanged regions should have the same LULC categories in bi-temporal images. To fully utilize the logical relationship between related tasks and further enhance the model’s SCD performance, MBFI-Net designs CTFT modules in the decoding stage, as shown in Figure 6.

F_{out}^{SSt 1} = F_{in}^{SSt 1} + σ ({Conv}_{7 \times 7} [M P (F^{CD}), A P (F^{CD})]) \times F_{in}^{SSt 1}

(16)

F_{out}^{SSt 2} = F_{in}^{SSt 2} + σ ({Conv}_{7 \times 7} [M P (F^{CD}), A P (F^{CD})]) \times F_{in}^{SSt 2}

(17)

where

F_{in}^{SSt 1}

and

F_{in}^{SSt 2}

are the features from the residual encoding branch for the semantic segmentation,

F^{CD}

represents the output features of the feature detail enhancement (FDE) modules in the BCD decoding branch, and

F_{out}^{SSt 1}

and

F_{out}^{SSt 2}

are the features ultimately passed to the semantic segmentation decoding task. The BCD information, as an additional constraint, can indicate the areas that need to be focused on, helping semantic segmentation branches to more accurately identify LULC in changing areas, thereby improving the overall SCD accuracy.

3.5. Feature Detail Enhancement Module

To obtain high-quality semantic segmentation and BCD results, MBFI-Net designs identical FDE modules (Figure 7) for different branches during feature decoding. The FDE module uses deconvolution to upsample features, gradually restoring features to the original image resolution. Additionally, the FDE module employs parallel convolutions with diverse kernel sizes to capture horizontal and vertical detail features, achieving multi-scale receptive field fusion and improving the accuracy of SCD results. Moreover, the FDE module introduces residual connections to preserve key information during decoding. After the last FDE module of each decoder, convolution operation is used to adjust the channel dimension and output the corresponding results.

F_{up} = {DeConv}_{2 \times 2} (F_{in})

(18)

F_{out} = {Conv}_{1 \times 1} ([{Conv}_{1 \times 5} ({{Conv}_{3 \times 3} (F}_{up})), {Conv}_{5 \times 1} ({{Conv}_{3 \times 3} (F}_{up}))]) + F_{up}

(19)

where the

F_{in}

donates the input feature,

{DeConv}_{2 \times 2}

represents the 2 × 2 deconvolution,

F_{up}

is the upsampled feature, and Conv represents convolutions with different kernel sizes, where both deconvolution and convolution are accompanied by BN and ReLU activation and

F_{out}

is the output feature.

3.6. Loss Function and Performance Assessment

To optimize model parameters, we selected cross-entropy as the loss function to supervise pixel-level SCD results. In the multi-task network, binary cross-entropy (

L_{BCE}

) and multi-class cross-entropy (

L_{MCE}

) are chosen to supervise BCD and semantic segmentation results, respectively. Overall loss is the weighted combination of losses from various task branches, with BCD and semantic segmentation weights set to 1 and 0.5, respectively.

L_{BCE} = - y log (p_{c}) - (1 - y) l o g (1 - p_{c})

(20)

L_{MCE} = - \sum_{i = 1}^{N} y_{i} log (p_{i})

(21)

where y is the binary change truth (1 for change, 0 for unchanged),

p_{c}

donates the predicted change probability, and N is the number of semantic segmentation classes.

y_{i}

equals 1 if the ground truth is i; otherwise, it is 0. Finally,

p_{i}

is the predicted probability that the pixel belongs to class i.

Consistent with existing studies, intersection over union (IoU) and F1 are selected to assess the BCD accuracies [8,39]. Moreover, we selected separated Kappa (SeK) to assess the SCD accuracies [27,28,45,53], reducing the impact of class imbalance by excluding correctly detected unchanged pixels.

IoU = \frac{TP}{TP + FP + FN}

(22)

F 1 = \frac{2 TP}{2 TP + FP + FN}

(23)

\hat{ρ} = \frac{\sum_{i = 2}^{C} Q_{i i}}{\sum_{i = 1}^{C} \sum_{j = 1}^{C} Q_{i j} - Q_{11}}

(24)

\hat{η} = \frac{\sum_{j = 1}^{C} ({\hat{Q}}_{j +} \cdot {\hat{Q}}_{+ j})}{{(\sum_{i = 1}^{C} \sum_{j = 1}^{C} Q_{i j} - Q_{11})}^{2}}

(25)

SeK = {\frac{\hat{ρ} - \hat{η}}{1 - \hat{η}} \times e}^{IoU - 1}

(26)

where TP and TN represent the counts of correctly detected changed and unchanged pixels, respectively; FP and FN represent the counts of false and missed detection of changed pixels, respectively; and

Q \in R^{C \times C}

denotes the confusion matrix of semantic changes.

{\hat{Q}}_{j +}

and

{\hat{Q}}_{+ j}

are the sums of elements in the j-th row and j-th column after removing the unchanged pixel count

Q_{11}

.

Furthermore, we use a comprehensive evaluation metric Score that considers both BCD and semantic segmentation results [46,53], evaluating the joint performance of “change region localization” and “semantic category recognition” in SCD tasks, as follows:

mIoU = 0.5 \times (\frac{TN}{TN + FP + FN} + \frac{TP}{TP + FP + FN})

(27)

Score = 0.3 \times mIoU + 0.7 \times SeK

(28)

3.7. Semantic Change Detection Datasets

To demonstrate MBFI-Net’s advantages in SCD tasks, two public datasets are selected for comparative and ablation experiments. These datasets reflect real-world change scenarios where unchanged pixels are much fewer than changed pixels, and there is an imbalance in the proportions of different semantic changes. Representative SCD samples are shown in Figure 8.

The SECOND dataset contains diverse semantic changes across six LULC categories [28]. The imagery exhibits spatial resolutions between 0.5 and 3 m, primarily covering Shanghai, Hangzhou, and Chengdu. With 512 × 512 pixel samples, we randomly allocate 2078 pairs for network training and 445 pairs each for model validation and testing.

The Landsat-SCD dataset covers Tumushuke, Xinjiang, China, with imagery acquired between 1990 and 2020 at a 30-m spatial resolution [29]. Compared to SECOND, Landsat-SCD has fewer semantic change categories and is limited to mutual conversions among water, farmland, desert and building. Using original 416 × 416 pixel samples, we randomly select 1435 pairs for training, 475 pairs for validation, and 475 pairs for SCD performance evaluation.

3.8. Competing Methods and Experimental Setup

To verify the advantages of MBFI-Net in SCD tasks, we selected nine representative methods for qualitative and quantitative comparison. PSPNet [54] and U-Net [55], classical neural networks for image segmentation, improve result accuracy via multi-scale feature extraction and skip connections, respectively. DSA-Net [34], a two-branch encoding network for BCD tasks, enhances the model’s focus on changed areas via attention mechanisms and deep supervision. HRSCD-str3 and HRSCD-str4 [24], both three-branch SCD networks, differ in that HRSCD-str4 enables feature transfer between different task branches. ChangeMask [46] considers the temporal symmetry of binary changes to build a multi-task SCD model. BiSRNet [56] proposes a semantic reasoning module to improve the information interaction between different task branches. SCanNet [48] proposes a transformer model that combines spatiotemporal dependencies to optimize SCD performance. MLFA-Net [27], a multi-task SCD network, focuses on multi-scale feature extraction, fusion, and full utilization. GLAI-Net [53] designs three encoding branches to extract contextual information from bi-temporal images and detect semantic changes in multitasking mode. Among the comparison methods, networks for semantic segmentation and BCD achieve SCD in a parallel manner, directly outputting multi-class semantic change maps.

All comparative and ablation experiments are implemented based on the PyTorch framework. To ensure the fairness and reliability of the results, all methods adopted the identical training strategy, hyperparameter settings, with a fixed random seed to guarantee full reproducibility. The training configuration includes setting the epoch to 60 and the batch size to 8. Sample augmentation strategies include random flipping and rotating. The learning rate decays following

0.1 \times {(1 - i t e r / t o t a l_i t e r)}^{1.5}

during training. Stochastic gradient descent is used as the optimizer with weight decay and momentum. All reported performance metrics are the stable test results after model convergence.

4. Results

The proposed MBFI-Net and other methods are tested on different datasets, with both qualitative and quantitative comparisons of their SCD performance. Additionally, ablation studies confirm the importance of various modules in improving feature interaction and thus enhancing SCD accuracy. Finally, the strengths and weaknesses of different methods are analyzed based on model performance and complexity.

4.1. SCD Results on SECOND Dataset

Various deep learning-based methods are evaluated on the SECOND dataset; the SCD results are visualized in Figure 9, and the accuracy of the corresponding SCD results is shown in Table 1.

Compared to PSP-Net and U-Net, DSA-Net, which incorporates deep supervision and attention mechanisms, achieves the highest accuracy among direct SCD methods, with SeK and Score reaching 0.1339 and 0.2943, respectively. However, DSA-Net’s accuracy is lower than that of all multi-task models except HRSCD-str3. This confirms that direct SCD methods struggle to achieve excellent results due to the excessive semantic change categories and imbalanced sample ratios among different categories. HRSCD-str4 outperforms HRSCD-str3 by 0.0494 in F1 and 0.0481 in Score, indicating that feature interaction among different branches can effectively improve SCD results. GLAI-Net and SCanNet, which can better extract image features and model semantic changes, show relatively excellent performance among the comparison methods, with SeKs of 0.2063 and 0.2037, respectively. Compared to other methods, MBFI-Net can fully extract global and local information and enable feature interaction across different branches, achieving the best SCD performance, with SeK and Score reaching 0.2117 and 0.3667, respectively.

Representative SCD results from various methods are visualized in Figure 9 to compare and verify the advantages of the proposed MBFI-Net. The proposed MBFI-Net obtains prediction results closest to the ground truth, with better identification of changed regions and a more accurate representation of LULC transitions between bi-temporal images. In the first group of results, most methods detect fragmented changed regions, while MBFI-Net obtains more complete and accurate detections. In the second group, the proportion of change areas is relatively small, and the comparative methods show varying degrees of omission (DSA-Net, BiSRNet) and misclassification (HRSCD str4, ChangeMask, SCanNet, MLFA-Net, and GLAI-Net). Although the SCD results of MBFI-Net also have some gaps compared with the ground truth, they more accurately represent the change in LULC and have more reliable detection boundaries. In the third group of results, MBFI-Net also shows better SCD performance than the comparison methods, more accurately identifying changes from buildings and low vegetation to buildings.

4.2. SCD Results on Landsat-SCD Dataset

Comparative experiments are also carried out on the Landsat-SCD dataset, with specific quantitative SCD results shown in Table 2.

PSPNet, used for semantic segmentation, performs the worst with a SeK of only 0.2345. DSA-Net shows the best performance among direct SCD methods. Due to the fewer semantic change classes in the Landsat-SCD dataset, DSA-Net’s detection performance is fully utilized, surpassing some multi-task SCD networks. In contrast, GLAI-Net achieves better feature extraction and utilization using three encoders, and has the highest SCD accuracy among the comparison methods. The proposed MBFI-Net leverages a multi-branch structure to extract global and local features, and effectively transfer features across different periods and tasks, achieving superior SCD performance than comparison methods, with F1, SeK, and Score reaching 0.8917, 0.5543, and 0.6508, respectively.

From Figure 10, HRSCD-str4 and ChangeMask exhibit severe false detection among multi-task methods, failing to detect water-to-other-object transitions. In BiSRNet’s results, rivers are significantly misclassified, with river widths much larger than ground truth. SCanNet and MLFA-Net achieve good detection performance among comparison methods, but still suffer from result fragmentation and false/misclassification of buildings and deserts. The visualization results also indicate that GLAI-Net performs well among the comparison methods, but there is still a certain proportion of misclassification of farmland in the bi-temporal images. In comparison, the proposed MBFI-Net outperforms other methods in detecting changed regions and identifying LULC transitions, with SCD results closer to ground truth.

5. Discussion

5.1. Ablation Studies

To confirm the importance of various modules for enhancing SCD performance, we design ablation experiments on different datasets, and the quantitative results are presented in Table 3 and Table 4. The modules involved mainly include BTFI, CTFT, and SAM+CAM in MBFI-Net.

These modules introduced in MBFI-Net can significantly improve the SCD results. After introducing the BTFI module, the bi-temporal multi-scale features are thoroughly fused, resulting in respective improvements of 0.0044 and 0.0372 in the SeKs of MBFI-Net on different datasets. After introducing the CTFT modules, the semantic information of the BCD branch is effectively passed to the segmentation branch, and the Scores of MBFI-Net on the SECOND and Landsat-SCD datasets improved by 0.0117 and 0.0295, respectively. Additionally, the SAM and CAM modules introduced in different feature extraction branches also help the model focus on changed areas and improve LULC classification, the improvements in F1 and Score on the Landsat-SCD dataset reach 0.0303 and 0.0740, respectively.

Ablation experiment results of the BTFI and CTFT modules across different datasets are selected for visualization analysis. Figure 11 shows that the BTFI module improves the representation of changed regions through multi-scale feature mining and bi-temporal feature interaction, significantly reducing missed and false detections. The boundaries of changed objects are more intact. Figure 12 also shows that the CTFT modules can improve the consistency between BCD and semantic segmentation results, enhance LULC classification accuracy in changed regions, and more accurately indicate complex semantic changes. The visual comparison results again demonstrate the relevant modules’ important role in improving SCD performance.

5.2. Model Performance and Complexity

To fully demonstrate MBFI-Net’s performance advantages, we further conduct a detailed comparison of the ground object transition matrices of SCD result from different methods (Figure 13) on the SECOND dataset. These transition matrices can more intuitively display the proportion of semantic change categories in different results. For a single temporal image, semantic categories include water (W), nonvegetated surface (NVS), low vegetation (LV), trees (T), buildings (B), and playground (P).

PSPNet and DSA-Net only detect a few major change categories, and semantic change categories that account for less than 1% in the ground truth (such as “from W to NVS”, “from LV to W”, etc.) are severely missed, thus failing to reflect LULC changes accurately. Multi-task SCD methods outperform direct SCD methods, identifying more LULC change categories. However, there are significant false detections of “from NVS to NVS” and “from LV to LV” in the transition matrices of HRSCD-str4 and ChangeMask. Among all methods, the proposed MBFI-Net demonstrates superior accuracy in detecting semantic changes and even effectively identify change categories with a relatively small proportion (such as “from P to NVS”, “from W to P”, and “from W to T”, which account for less than 0.5% in the ground truth). The pixel proportion of most semantic change categories in MBFI-Net’s transition matrix is closer to that in the ground truth. These comparative results again confirm the significant advantages of MBFI-Net in fully extracting, fusing, and utilizing global and local features to improve SCD performance.

As shown in Figure 14, ChangeMask has the smallest computational cost and parameter quantity, i.e., 34.72G FLOPs and 10.62M, respectively. However, it yielded poor SCD results on the SECOND dataset, with SeK and Score of only 0.1475 and 0.3123, respectively. The proposed MBFI-Net has a computational cost of 253.26G FLOPs, comparable to SCanNet, MLFA-Net, and GLAI-Net, but shows superior SCD performance. The parameter count of MBFI-Net is 38.13M, which is significantly lower than that of PSPNet and MLFA-Net, indicating that MBFI-Net attains a more favorable balance between model complexity and SCD accuracy.

5.3. Limitations and Prospects

Although MBFI-Net has achieved a balance between accuracy and efficiency, it still has room for simplification compared with lightweight networks. Its deployment on resource-constrained devices is limited, which also restricts its wide application in practical scenarios. In addition, swin transformer is only used as a feature extractor for single image and fails to give full play to its sequence modeling advantages in bi-temporal images.

To address these issues, future work will introduce lightweight architectures to replace some network components, combined with model distillation and channel pruning techniques, to further reduce the number of parameters and computational costs, thereby improving the model’s versatility. On the other hand, we will fully explore the sequence modeling capability of swin transformer, design a cross-temporal patch spatiotemporal modeling mechanism, and combine it with hierarchical attention strategies to achieve dynamic fusion of local and global features, so as to further improve detection accuracy.

6. Conclusions

Remote sensing images SCD is crucial for accurately and efficiently obtaining LULU changes. We propose the MBFI-Net to enhance SCD performance in this paper. MBFI-Net uses residual convolutional networks and swin transformers as feature encoders, effectively capturing global semantics and local details of diverse changed objects in complex scenes. Additionally, MBFI-Net incorporates CAM and SAM modules in its encoding branches to enhance the model’s focus on changed regions. Furthermore, the BTFI module is designed to enrich the diversity and representativeness of image features, initially reducing pseudo-changes. Based on the prior relationship between LULC classification and BCD tasks, MBFI-Net introduces CTFT modules, which enhance the semantic consistency of different branches and improve the model’s SCD performance.

Comparative and ablation experimental results show that the MBFI-Net outperforms comparison methods, achieving Scores of 0.3667 and 0.6216 across different datasets. In addition, BTFI and CTFT modules, as well as attention mechanisms, contribute to improving SCD performance. The SCD advantage of MBFI-Net is also well reflected in the minor change categories. Future research will enhance MBFI-Net’s performance, reduce model parameters and computational costs, and promote its in-depth application in urban planning and resource conservation.

Author Contributions

Q.D.: Writing—original draft, Methodology, Software; F.W.: Methodology, Writing—review & editing; K.S.: Data curation, Visualization; W.C.: Validation; M.W.: Conceptualization; G.C.: Visualization. All authors have read and agreed to the published version of this manuscript.

Funding

This work was supported by the Natural Science Foundation of China under Grant 42501577, the Natural Science Foundation of Jilin Province under Grant 20250102183JC, and the Jilin Province Youth Talent Support Project under Grant QT202401.

Data Availability Statement

The Landsat-SCD dataset is available at https://doi.org/10.6084/m9.figshare.19946135.v1 (accessed on 16 May 2025); the SECOND dataset is available at https://captain-whu.github.io/SCD/ (accessed on 16 May 2025).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships.

References

Hegazy, I.R.; Kaloop, M.R. Monitoring urban growth and land use change detection with GIS and remote sensing techniques in Daqahlia governorate Egypt. Int. J. Sustain. Built Environ. 2015, 4, 117–124. [Google Scholar] [CrossRef]
Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A.; Zhang, L. Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: From natural disasters to man-made disasters. Remote Sens. Environ. 2021, 265, 112636. [Google Scholar] [CrossRef]
Chen, J.; Yuan, Z.; Peng, J.; Chen, L.; Huang, H.; Zhu, J.; Liu, Y.; Li, H. DASNet: Dual attentive fully convolutional Siamese networks for change detection in high-resolution satellite images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 1194–1206. [Google Scholar] [CrossRef]
Sun, Y.; Zhao, Y.; Han, X.; Gao, W.; Hu, Y.; Zhang, Y. A feature enhancement network combining UNet and vision transformer for building change detection in high-resolution remote sensing images. Neural Comput. Appl. 2025, 37, 1429–1456. [Google Scholar] [CrossRef]
Jiang, W.; Sun, Y.; Lei, L.; Kuang, G.; Ji, K. Change detection of multisource remote sensing images: A review. Int. J. Digit. Earth 2024, 17, 2398051. [Google Scholar] [CrossRef]
Khankeshizadeh, E.; Mohammadzadeh, A.; Moghimi, A.; Mohsenifar, A. FCD-R2U-net: Forest change detection in bi-temporal satellite images using the recurrent residual-based U-net. Earth Sci. Inform. 2022, 15, 2335–2347. [Google Scholar] [CrossRef]
Noman, M.; Fiaz, M.; Cholakkal, H.; Khan, S.; Khan, F.S. ELGC-Net: Efficient local–global context aggregation for remote sensing change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–11. [Google Scholar] [CrossRef]
Zhang, J.; Shao, Z.; Ding, Q.; Huang, X.; Wang, Y.; Zhou, X.; Li, D. AERNet: An attention-guided edge refinement network and a dataset for remote sensing building change detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Liu, M.; Chai, Z.; Deng, H.; Liu, R. A CNN-transformer network with multiscale context aggregation for fine-grained cropland change detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 4297–4306. [Google Scholar] [CrossRef]
Jiang, J.; Xiang, J.; Yan, E.; Song, Y.; Mo, D. Forest-CD: Forest change detection network based on VHR images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Tian, S.; Tan, X.; Ma, A.; Zheng, Z.; Zhang, L.; Zhong, Y. Temporal-agnostic change region proposal for semantic change detection. ISPRS J. Photogramm. Remote Sens. 2023, 204, 306–320. [Google Scholar] [CrossRef]
Tang, K.; Xu, F.; Chen, X.; Dong, Q.; Yuan, Y.; Chen, J. The ClearSCD model: Comprehensively leveraging semantics and change relationships for semantic change detection in high spatial resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2024, 211, 299–317. [Google Scholar] [CrossRef]
Chen, J.; Chen, X.; Cui, X.; Chen, J. Change vector analysis in posterior probability space: A new method for land cover change detection. IEEE Geosci. Remote Sens. Lett. 2010, 8, 317–321. [Google Scholar] [CrossRef]
Pirrone, D.; Bovolo, F.; Bruzzone, L. A novel framework based on polarimetric change vectors for unsupervised multiclass change detection in dual-pol intensity SAR images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4780–4795. [Google Scholar] [CrossRef]
Bruzzone, L.; Cossu, R.; Vernazza, G. Detection of land-cover transitions by combining multidate classifiers. Pattern Recognit. Lett. 2004, 25, 1491–1500. [Google Scholar] [CrossRef]
Wan, L.; Xiang, Y.; You, H. A post-classification comparison method for SAR and optical images change detection. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1026–1030. [Google Scholar] [CrossRef]
Hao, M.; Shi, W.; Deng, K.; Zhang, H.; He, P. An Object-Based Change Detection Approach Using Uncertainty Analysis for VHR Images. J. Sens. 2016, 2016, 9078364. [Google Scholar] [CrossRef]
Bovolo, F.; Marchesi, S.; Bruzzone, L. A framework for automatic and unsupervised detection of multiple changes in multitemporal images. IEEE Trans. Geosci. Remote Sens. 2011, 50, 2196–2212. [Google Scholar] [CrossRef]
Lal, A.M.; Anouncia, S.M. Semi-supervised change detection approach combining sparse fusion and constrained k means for multi-temporal remote sensing images. Egypt. J. Remote Sens. Space Sci. 2015, 18, 279–288. [Google Scholar] [CrossRef][Green Version]
Chen, X.; Chen, J.; Shi, Y.; Yamaguchi, Y. An automated approach for updating land cover maps based on integrated change detection and classification methods. ISPRS J. Photogramm. Remote Sens. 2012, 71, 86–95. [Google Scholar] [CrossRef]
Basavaraju, K.; Sravya, N.; Kevala, V.D.; Lal, S. Recent Advances in Urban Expansion Monitoring Through Deep Learning-Based Semantic Change Detection Techniques From Satellite Imagery. In Proceedings of the 2024 IEEE Space, Aerospace and Defence Conference (SPACE), Bangalore, India, 22–23 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 169–173. [Google Scholar]
Saha, S.; Bovolo, F.; Bruzzone, L. Unsupervised deep change vector analysis for multiple-change detection in VHR images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 3677–3693. [Google Scholar] [CrossRef]
Saha, S. Confidence Estimation in Unsupervised Deep Change Vector Analysis. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–9. [Google Scholar] [CrossRef]
Daudt, R.C.; Le Saux, B.; Boulch, A.; Gousseau, Y. Multitask learning for large-scale semantic change detection. Comput. Vis. Image Underst. 2019, 187, 102783. [Google Scholar] [CrossRef]
Zhu, Q.; Guo, X.; Li, Z.; Li, D. A review of multi-class change detection for satellite remote sensing imagery. Geo-Spat. Inf. Sci. 2024, 27, 1–15. [Google Scholar] [CrossRef]
Zhu, Q.; Guo, X.; Deng, W.; Shi, S.; Guan, Q.; Zhong, Y.; Zhang, L.; Li, D. Land-use/land-cover change detection based on a Siamese global learning framework for high spatial resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 184, 63–78. [Google Scholar] [CrossRef]
Ding, Q.; Shao, Z.; Huang, X.; Wang, F.; Wang, M. MLFA-Net: Multi-level feature-aggregated network for semantic change detection in remote sensing images. Int. J. Digit. Earth 2024, 17, 2398070. [Google Scholar] [CrossRef]
Yang, K.; Xia, G.S.; Liu, Z.; Du, B.; Yang, W.; Pelillo, M.; Zhang, L. Asymmetric siamese networks for semantic change detection in aerial images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–18. [Google Scholar] [CrossRef]
Yuan, P.; Zhao, Q.; Zhao, X.; Wang, X.; Long, X.; Zheng, Y. A transformer-based Siamese network and an open optical dataset for semantic change detection of remote sensing images. Int. J. Digit. Earth 2022, 15, 1506–1525. [Google Scholar] [CrossRef]
Niu, Y.; Guo, H.; Lu, J.; Ding, L.; Yu, D. SMNet: Symmetric multi-task network for semantic change detection in remote sensing images based on CNN and transformer. Remote Sens. 2023, 15, 949. [Google Scholar] [CrossRef]
Zhu, B.; Gao, H.; Wang, X.; Xu, M.; Zhu, X. Change detection based on the combination of improved SegNet neural network and morphology. In Proceedings of the 2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC), Chongqing, China, 27–29 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 55–59. [Google Scholar]
Peng, X.; Zhong, R.; Li, Z.; Li, Q. Optical remote sensing image change detection based on attention mechanism and image difference. IEEE Trans. Geosci. Remote Sens. 2020, 59, 7296–7307. [Google Scholar] [CrossRef]
Sun, S.; Mu, L.; Wang, L.; Liu, P. L-UNet: An LSTM Network for Remote Sensing Image Change Detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Ding, Q.; Shao, Z.; Huang, X.; Altan, O. DSA-Net: A novel deeply supervised attention-guided network for building change detection in high-resolution remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2021, 105, 102591. [Google Scholar] [CrossRef]
Ning, X.; Zhang, H.; Zhang, R.; Huang, X. Multi-stage progressive change detection on high resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2024, 207, 231–244. [Google Scholar] [CrossRef]
Chen, H.; Song, J.; Han, C.; Xia, J.; Yokoya, N. Changemamba: Remote sensing change detection with spatio-temporal state space model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–20. [Google Scholar]
Yu, W.; Zhuo, L.; Li, J. GCFormer: Global context-aware transformer for remote sensing image change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–12. [Google Scholar] [CrossRef]
Pan, J.; Bai, Y.; Shu, Q.; Zhang, Z.; Hu, J.; Wang, M. M-swin: Transformer-based multi-scale feature fusion change detection network within cropland for remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
Ding, Q.; Shao, Z.; Huang, X.; Feng, X.; Altan, O.; Hu, B. Consistency-guided lightweight network for semi-supervised binary change detection of buildings in remote sensing images. GIScience Remote Sens. 2023, 60, 2257980. [Google Scholar] [CrossRef]
Han, C.; Wu, C.; Hu, M.; Li, J.; Chen, H. C2F-SemiCD: A coarse-to-fine semi-supervised change detection method based on consistency regularization in high-resolution remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–21. [Google Scholar] [CrossRef]
Wu, C.; Du, B.; Zhang, L. Fully convolutional change detection framework with generative adversarial network for unsupervised, weakly supervised and regional supervised change detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9774–9788. [Google Scholar] [CrossRef]
Gao, Q.; Feng, Z.; Yang, S.; Chang, Z.; Meng, H. Heterogeneous Object-Level Aircraft Change Detection via Cross-Modal Interaction and Imbalanced Learning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Zhang, H.; Ma, G.; Fan, H.; Gong, H.; Wang, D.; Zhang, Y. SDCINet: A novel cross-task integration network for segmentation and detection of damaged/changed building targets with optical remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2024, 218, 422–446. [Google Scholar] [CrossRef]
Mou, L.; Bruzzone, L.; Zhu, X.X. Learning spectral-spatial-temporal features via a recurrent convolutional neural network for change detection in multispectral imagery. IEEE Trans. Geosci. Remote Sens. 2018, 57, 924–935. [Google Scholar] [CrossRef]
Peng, D.; Bruzzone, L.; Zhang, Y.; Guan, H.; He, P. SCDNET: A novel convolutional network for semantic change detection in high resolution optical remote sensing imagery. Int. J. Appl. Earth Obs. Geoinf. 2021, 103, 102465. [Google Scholar] [CrossRef]
Zheng, Z.; Zhong, Y.; Tian, S.; Ma, A.; Zhang, L. ChangeMask: Deep multi-task encoder-transformer-decoder architecture for semantic change detection. ISPRS J. Photogramm. Remote Sens. 2022, 183, 228–239. [Google Scholar] [CrossRef]
Cui, F.; Jiang, J. MTSCD-Net: A network based on multi-task learning for semantic change detection of bitemporal remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2023, 118, 103294. [Google Scholar] [CrossRef]
Ding, L.; Zhang, J.; Guo, H.; Zhang, K.; Liu, B.; Bruzzone, L. Joint spatio-temporal modeling for semantic change detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Dai, A.; Yang, J.; Zhang, Y.; Zhang, T.; Tang, K.; Xiao, X.; Zhang, S. A difference enhancement and class-aware rebalancing semi-supervised network for cropland semantic change detection. Int. J. Appl. Earth Obs. Geoinf. 2025, 137, 104415. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Ding, Q.; Wang, F.; Wang, M.; Zhang, Y.; Cheng, G. GLAI-Net: Global-Local Awareness Integrated Network for Semantic Change Detection in Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 14291–14307. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Ding, L.; Guo, H.; Liu, S.; Mou, L.; Zhang, J.; Bruzzone, L. Bi-temporal semantic reasoning for the semantic change detection in HR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]

Figure 1. A schematic diagram of the SCD task and representative methods: (a) unsupervised analysis; (b) post-classification comparison; (c) direct semantic change detection; (d) integrated classification and change detection.

Figure 2. The verall structure of the proposed MBFI-Net.

Figure 3. Basic structures of (a) residual block, (b) patch merging, and (c) swin transformer blocks.

Figure 4. Basic structures of CAM and SAM.

Figure 5. The basic structure of the BTFI module.

Figure 6. The basic structure of the CTFT module.

Figure 7. The basic structure of the FDE module.

Figure 8. Representative samples in different datasets.

Figure 9. Representative visual SCD results on the SECOND dataset.

Figure 10. Representative visual SCD results on the Landsat-SCD dataset.

Figure 11. The visual SCD results of the BTFI module’s ablation studies.

Figure 12. The visual SCD results of the CTFT module’s ablation studies.

Figure 13. Transition matrices of SCD result from different methods.

Figure 14. Comparison of SCD accuracy with (a) model computational cost and (b) parameter quantity.

Table 1. SCD accuracies of various methods on the SECOND dataset.

Method	IoU	F1	SeK	Score
PSPNet	0.3804	0.5511	0.0761	0.2395
U-Net	0.4451	0.6160	0.1145	0.2782
DSA-Net	0.4590	0.6292	0.1339	0.2943
HRSCD-str3	0.4939	0.6612	0.1289	0.2939
HRSCD-str4	0.5511	0.7106	0.1821	0.3420
ChangeMask	0.5217	0.6857	0.1475	0.3123
BiSRNet	0.5599	0.7179	0.1964	0.3546
SCanNet	0.5591	0.7172	0.2037	0.3591
MLFA-Net	0.5633	0.7206	0.2011	0.3581
GLAI-Net	0.5621	0.7197	0.2063	0.3609
MBFI-Net	0.5690	0.7253	0.2117	0.3667

Table 2. SCD accuracies of various methods on the Landsat-SCD dataset.

Method	IoU	F1	SeK	Score
PSPNet	0.5808	0.7348	0.2345	0.3845
U-Net	0.6335	0.7757	0.2955	0.4373
DSA-Net	0.7534	0.8593	0.4685	0.5812
HRSCD-str3	0.6187	0.7644	0.2588	0.4085
HRSCD-str4	0.6563	0.7925	0.3078	0.4496
ChangeMask	0.6028	0.7522	0.2460	0.3963
BiSRNet	0.7157	0.8343	0.4055	0.5292
SCanNet	0.7638	0.8661	0.4821	0.5920
MLFA-Net	0.7822	0.8778	0.5182	0.6214
GLAI-Net	0.7850	0.8796	0.5182	0.6216
MBFI-Net	0.8046	0.8917	0.5543	0.6508

Table 3. The SCD accuracies of ablation studies on the SECOND dataset.

BTFI	CTFT	SAM+CAM	IoU	F1	SeK	Score
✕	✓	✓	0.5656	0.7225	0.2073	0.3629
✓	✕	✓	0.5610	0.7187	0.1978	0.3550
✓	✓	✕	0.5658	0.7227	0.2040	0.3604
✓	✓	✓	0.5690	0.7253	0.2117	0.3667

Note: ✕ indicates the module is not used; ✓ indicates the module is adopted.

Table 4. The SCD accuracies of ablation studies on the Landsat-SCD dataset.

BTFI	CTFT	SAM+CAM	IoU	F1	SeK	Score
✕	✓	✓	0.7845	0.8793	0.5171	0.6208
✓	✕	✓	0.7860	0.8802	0.5173	0.6213
✓	✓	✕	0.7565	0.8614	0.4621	0.5768
✓	✓	✓	0.8046	0.8917	0.5543	0.6508

Note: ✕ indicates the module is not used; ✓ indicates the module is adopted.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ding, Q.; Wang, F.; Sun, K.; Chen, W.; Wang, M.; Cheng, G. MBFI-Net: Multi-Branch Feature Interaction Network for Semantic Change Detection. Remote Sens. 2026, 18, 179. https://doi.org/10.3390/rs18010179

AMA Style

Ding Q, Wang F, Sun K, Chen W, Wang M, Cheng G. MBFI-Net: Multi-Branch Feature Interaction Network for Semantic Change Detection. Remote Sensing. 2026; 18(1):179. https://doi.org/10.3390/rs18010179

Chicago/Turabian Style

Ding, Qing, Fengyan Wang, Kaiyuan Sun, Weilong Chen, Mingchang Wang, and Gui Cheng. 2026. "MBFI-Net: Multi-Branch Feature Interaction Network for Semantic Change Detection" Remote Sensing 18, no. 1: 179. https://doi.org/10.3390/rs18010179

APA Style

Ding, Q., Wang, F., Sun, K., Chen, W., Wang, M., & Cheng, G. (2026). MBFI-Net: Multi-Branch Feature Interaction Network for Semantic Change Detection. Remote Sensing, 18(1), 179. https://doi.org/10.3390/rs18010179

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MBFI-Net: Multi-Branch Feature Interaction Network for Semantic Change Detection

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Deep-Learning-Based BCD Methods

2.2. Deep-Learning-Based SCD Methods

3. Materials and Methods

3.1. Overall Structure of MBFI-Net

3.2. Channel Attention and Spatial Attention Modules

3.3. Bi-Temporal Feature Interaction Module

3.4. Cross-Task Feature Transfer Module

3.5. Feature Detail Enhancement Module

3.6. Loss Function and Performance Assessment

3.7. Semantic Change Detection Datasets

3.8. Competing Methods and Experimental Setup

4. Results

4.1. SCD Results on SECOND Dataset

4.2. SCD Results on Landsat-SCD Dataset

5. Discussion

5.1. Ablation Studies

5.2. Model Performance and Complexity

5.3. Limitations and Prospects

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI