A Dual-Modal Mixture-of-Experts Attention U-Net (DMoE-AttU-Net) for Change Detection Using Heterogeneous Optical and SAR Remote Sensing Images

Khankeshizadeh, Seyed Ehsan; Mohammadzadeh, Ali; Jamali, Ali; Jamali, Sadegh

doi:10.3390/rs18101508

Open AccessCommunication

A Dual-Modal Mixture-of-Experts Attention U-Net (DMoE-AttU-Net) for Change Detection Using Heterogeneous Optical and SAR Remote Sensing Images

¹

Department of Photogrammetry and Remote Sensing, Geomatics Engineering Faculty, K. N. Toosi University of Technology, Tehran 15433-19967, Iran

²

Department of Technology and Society, Faculty of Engineering, Lund University, P.O. Box 118, 221 00 Lund, Sweden

³

Department of Geography, Simon Fraser University, Burnaby, BC V5A 1S6, Canada

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1508; https://doi.org/10.3390/rs18101508

Submission received: 27 March 2026 / Revised: 25 April 2026 / Accepted: 8 May 2026 / Published: 11 May 2026

(This article belongs to the Section Remote Sensing Perspective)

Download

Browse Figures

Versions Notes

Highlights

What is the main finding?

A novel dual-modal architecture (DMoE-AttU-Net) is proposed for heterogeneous optical–SAR change detection, integrating SAR-specific MoE and hierarchical attention mechanisms.

What are the implications of the main findings?

Modality-aware design with selective MoE in the SAR branch effectively mitigates speckle noise while preserving complementary optical information.
The coordinated use of channel and spatial attention enhances boundary delineation, supporting more reliable multimodal change detection in complex remote sensing scenarios.

Abstract

Binary change detection (BCD) using heterogeneous optical and SAR imagery faces challenges due to modality-specific noise and the lack of adaptive fusion strategies. Existing methods often fail to suppress SAR speckle noise and accurately localize fine boundaries. This study proposes a novel deep architecture, termed Dual-Modal Mixture-of-Experts Attention U-Net (DMoE-AttU-Net), featuring (i) dual-stream encoders for modality-specific feature extraction, (ii) a mixture-of-experts (MoE) module in the SAR stream with a gating network for dynamic fusion, (iii) Squeeze-and-Excitation (SE) and spatial attention mechanisms in the decoder, and (iv) hierarchical skip connections for multi-scale fusion. Unlike existing multimodal change detection frameworks that apply uniform feature fusion, the proposed architecture introduces a modality-aware design in which the MoE mechanism is selectively applied to the SAR stream, enabling adaptive suppression of speckle noise while preserving complementary optical information. These components collectively enhance change localization and reduce noise-induced artifacts. The proposed model achieved a mean IoU of 0.855 and a kappa coefficient of 0.836 on three optical–SAR datasets, outperforming state-of-the-art methods in both accuracy and spatial consistency.

Keywords:

dual-modal change detection; binary change detection; heterogeneous optical and SAR; remote sensing; Attention U-Net

1. Introduction

Binary change detection (BCD), the task of producing binary change maps (BCMs) from bi-temporal satellite images, supports critical remote sensing applications such as land cover monitoring [1], flood monitoring [2,3] and deforestation mapping [4,5]. Optical imagery provides rich spectral cues, but is vulnerable to cloud cover, illumination changes, and seasonal effects [6], whereas synthetic aperture radar (SAR) offers weather- and illumination-invariant measurements but introduces speckle noise and modality disparity [7]. The fusion of optical and SAR data therefore presents both opportunities and challenges: while complementary information can improve robustness, differences in geometry, radiometry, and noise characteristics complicate joint modeling [8]. This difficulty mainly stems from the inconsistent imaging mechanisms and statistical characteristics of optical and SAR data.

Deep learning (DL)-based solutions have generally followed two paradigms. The first is an image-level translation method [9], in which one modality (typically SAR) is translated into the domain of the other prior to BCD. Representative studies in this direction include SCCN [10], cGAN [11], and cyclic adversarial translation frameworks [12], such as X-Net and ACE-Net. These methods aim to reduce cross-modal discrepancy by mapping heterogeneous images into a shared or comparable feature space. However, despite their conceptual simplicity and effectiveness, these approaches remain sensitive to the quality of the translated images. In practice, the translation process may introduce artifacts and distort semantic structures, which can propagate errors to the subsequent BCD stage and ultimately degrade performance. The second paradigm involves feature-level fusion or alignment methods [13], in which modality-specific encoders are employed and extracted features are fused or aligned within a shared latent space. Compared to image-level translation approaches, this strategy generally provides improved robustness by directly learning cross-modal representations. However, it still faces challenges in achieving precise boundary localization under significant multimodal discrepancies. Recent studies have shown that conventional weight-shared Siamese architectures have limited capacity to adapt to heterogeneous optical–SAR feature distributions, motivating the development of multimodal frameworks with more explicit modality-aware processing mechanisms [14]. Standard encoder–decoder models (e.g., U-Net, U-Net++, Siamese variants such as FC-Siam-Diff) are widely adopted for change detection in homogeneous data due to their ability to capture multiscale contextual information [15,16]. However, when extended to heterogeneous optical–SAR scenarios, these models lack dedicated mechanisms for modality-specific adaptation and effective noise suppression, particularly in the presence of SAR speckle. This limitation has also been highlighted in recent cross-modal change detection studies based on convolutional neural networks [17].

This work is positioned within recent advances in multimodal change-detection (MMCD). Recent studies have explored different strategies to address cross-modal discrepancies in optical–SAR data. In one direction, the M²CD framework [14] integrates Mixture-of-Experts (MoE) modules within the backbone and leverages self-distillation to reduce cross-modal discrepancies, yielding significant gains in optical–SAR change detection tasks. Meanwhile, deep translation-based frameworks such as DTCDN have demonstrated that learning cross-domain mappings can help reduce modality gaps by transforming heterogeneous data into a comparable representation space [8]. In addition, fully transformer-based methods like TransY-Net [18] adopt global modeling and pyramid-level feature aggregation to better capture spatial context and complex change patterns. Despite these advances, effectively balancing modality-specific feature learning, noise robustness, and precise spatial localization remains an open challenge in heterogeneous BCM generation.

The proposed solution directly addresses the aforementioned challenges by introducing a modality-aware and role-specific integration of expert-based modulation and hierarchical attention within a dual-stream framework, rather than direct combining existing modules. A Dual-Modal MoE Attention U-Net (DMoE-AttU-Net) is thus introduced to reconcile modality discrepancies and sharpen spatial precision. Specifically, separate encoder branches process optical and SAR inputs independently, ensuring decoupled representation learning and preventing cross-modal feature interference. Within the SAR branch, a gating network dynamically routes features through multiple convolutional experts, enabling input-adaptive feature refinement specifically tailored to suppress speckle noise and handle SAR-specific variability. In contrast to prior MoE-based approaches that apply expert fusion uniformly across modalities, the proposed design selectively deploys the MoE mechanism within the SAR stream, where modality-specific noise is most prominent. Concurrently, Squeeze-and-Excitation (SE) channel attention modules and additive spatial attention gates are jointly employed in the decoder, forming a coordinated attention mechanism that simultaneously enhances channel discrimination and spatial localization. This synergy enables more accurate identification of change-relevant regions while reducing false alarms. Furthermore, multi-level skip connections are incorporated to fuse fine spatial details with high-level semantic representations, facilitating consistent boundary preservation and improving sensitivity to subtle and small-scale changes.

2. Materials and Methods

2.1. Proposed DMoE-AttU-Net Architecture

The proposed DMoE-AttU-Net extends the baseline Attention U-Net for dual-modal change detection using SAR and optical imagery (Figure 1). Rather than directly combining existing modules, the architecture is designed with a modality-aware structure in which each component is assigned a specific functional role to address cross-modal inconsistencies. It features a dual-stream encoder that extracts modality-specific features, ensuring independent representation learning for optical and SAR data and reducing feature interference between modalities. A MoE module is incorporated in the SAR branch to adaptively fuses multiple CNN experts via a gating network, mitigating SAR noise and variability. This selective integration allows the model to focus on expert-based refinement on the modality most affected by speckle noise, instead of applying uniform processing across streams. Each expert module also incorporates SE attention for enhanced channel discrimination, while attention gates in the decoder guide feature fusion during upsampling. These attention mechanisms operate jointly to refine both channel importance and spatial relevance during decoding. Furthermore, skip connections merge low-level encoder features with high-level decoder representations to preserve spatial detail and semantic context. Through this coordinated interaction between dual-stream encoding, expert-based modulation, and hierarchical attention, the architecture improves feature alignment across modalities and enhances sensitivity to subtle change patterns. These components collectively improve segmentation accuracy and generalization in heterogeneous remote sensing scenarios. Formally, the proposed architecture can be described as follows. As seen in Figure 1, the input tensors to the developed segmentation architecture are

x^{S A R} \in R^{B \times 1 \times H \times W}

and

x^{o p t i c a l} \in R^{B \times 3 \times H \times W}

, where

B

is the batch size, 1 and 3 denote the number of channels in the optical and SAR data, and

H = W = 256

are the input satellite data height and width. The output map is

Y \in R^{B \times 2 \times H \times W}

, where 2 corresponds to binary segmentation classes (unchanged and changed regions). The optical stream passes through two stacked residual blocks, expressed as:

R (x) = R e L U (B N (C o n v (D r o p o u t (R e L U (B N (C o n v (x)))))) + x)

(1)

The first residual block

R_{1}

transforms the input optical satellite data into low-level optical features

x_{1} = R_{1} (x^{o p t i c a l}) \in R^{B \times 64 \times H \times W}

. A

2 \times 2

max pooling (MP) operation reduces the spatial resolution, producing

p_{1} = M P (x_{1}) \in R^{B \times 64 \times \frac{H}{2} \times \frac{W}{2}}

. This is followed by the second residual block

R_{2}

, which captures mid-level optical features

x_{2} = R_{2} (p_{1}) \in R^{B \times 128 \times \frac{H}{2} \times \frac{W}{2}}

, followed by a max pooling operation, resulting in

p_{2} = M P (x_{2}) \in R^{B \times 128 \times \frac{H}{4} \times \frac{W}{4}}

, where

x_{1}

represents fine-detailed spatial information, while

x_{2}

includes mid-level semantic information abstracted from a larger receptive field. These two representations are preserved for skip connections to facilitate hierarchical fusion during decoding. On the other hand, a carefully designed SAR encoder module is developed to extract noise-resilient features from SAR data inputs by incorporating a shared feature extractor with a MoE mechanism. The shared block first captures general patterns using convolutional layers, ReLU activation, and max pooling. Its output is then fed into multiple parallel CNN expert networks, each learning complementary SAR-specific feature representations. These CNN expert outputs are refined through a SE attention mechanism to capture the most influential channels. The structure of the SE module is illustrated in Figure 2, and it is embedded within each CNN expert in the SAR branch of the proposed architecture (Figure 1). A gating network dynamically fuses the expert outputs based on the shared features, allowing the model to adaptively prioritize CNN experts, improving representation quality in noisy or heterogeneous SAR regions. Moreover, the designed MoE module in the SAR stream is a dynamic feature fusion mechanism that improves the model’s adaptability to varying input SAR characteristics. It consists of multiple parallel expert networks, each of which extracts distinct SAR feature representations from the shared feature maps. The gating network computes input-dependent weights for each expert based on the global information of the shared SAR features. These weights are then used to compute a weighted sum of CNN expert outputs, enabling the network to focus on the most informative CNN experts for each input SAR data. This adaptive fusion significantly enhances the model’s ability to extract vital SAR feature characteristics and ensures more discriminative and context-aware SAR feature learning.

In more detail, the SAR data is processed through the shared encoder block with two convolutional layers with

R e L U

activation functions, followed by max pooling and dropout, expressed as:

f_{s h a r e d} = D r o p o u t (M P (R e L U (C o n v (R e L U (C o n v (x^{S A R})))))) \in R^{B \times 64 \times \frac{H}{4} \times \frac{W}{4}}

(2)

This shared SAR representation is fed to the set of

E = 3

parallel CNN experts. Each expert receives

f_{s h a r e d}

feature maps, produces an intermediate feature map

f^{e}

, refined with the SE module to obtain

{\tilde{f}}^{e}

. The SE mechanism, as depicted in Figure 2 applies global average pooling (

G A P

) followed by two fully connected layers to generate channel-wise attention weights, using SoftMax function (

σ

) and is applied within each CNN expert shown in Figure 1:

s^{e} = σ (W_{2}^{e} \cdot R e L U (W_{1}^{e} \cdot G A P (f^{e}))) {\tilde{f}}^{e} = s^{e} \cdot f^{e} \in R^{B \times 64 \times \frac{H}{4} \times \frac{W}{4}}

(3)

Afterward, the expert outputs are stacked, resulting in

F = [{\tilde{f}}^{1}, {\tilde{f}}^{2}, \dots, {\tilde{f}}^{E}] \in R^{B \times E \times 64 \times \frac{H}{4} \times \frac{W}{4}}

. To obtain the most important channels in each CNN expert using the SE module, the shared SAR feature map is first flattened producing

z = f l a t t e n (f_{s h a r e d}) \in R^{B \times D}

. The flattened feature maps are fed to the gating network. The gating network applies two fully connected layers with dropout and

R e L U

, followed by

S o f t m a x

to produce expert weights, calculated as:

α = σ (W_{2} \cdot D r o p o u t (R e L U (W_{1} \cdot z))) \in R^{B \times E}

(4)

These estimated weights are then utilized to compute the output maps of the MoE module, estimated as:

f_{M o E} = \sum_{e = 1}^{E} α_{e} \cdot {\tilde{f}}^{e} \in R^{B \times 64 \times \frac{H}{4} \times \frac{W}{4}}

(5)

Then, the processed SAR and optical feature maps are stacked to form

x_{j o i n t} = C o n c a t (p_{2}, f_{M o E}) \in R^{B \times (128 + 64) \times \frac{H}{4} \times \frac{W}{4}}

. The resulting feature maps are fed to the residual block bottleneck

x_{b o t t l e n e c k} = R (x_{j o i n t}) \in R^{B \times 256 \times \frac{H}{4} \times \frac{W}{4}}

. In the decoder,

x_{b o t t l e n e c k}

is up-sampled using a

2 \times 2

transposed convolution

x_{3} = {C o n v T r a n s p o s e}_{2 \times 2} (x_{b o t t l e n e c k}) \in R^{B \times 128 \times \frac{H}{2} \times \frac{W}{2}}

. To improve feature selection and fusion between encoder and decoder features, an additive attention gate module is employed. The utilized additive attention gate module selectively filters encoder features before fusion with decoder features, enabling the developed change detection architecture to focus on spatially important regions. It receives a gating signal from the decoder and a skip connection feature from the encoder, both of which are projected to a common intermediate space. The attention coefficients are computed using element-wise addition followed by

R e L U

and a sigmoid activation function, producing a spatial attention mask. This mask modulates the encoder features, allowing only the most relevant information to pass through, thereby improving feature extraction and enhancing localization accuracy during the decoding step. In more detail, the additive attention gate is applied to focus on the most important encoder features

x_{2}

, expressed as:

ψ_{1} = σ (C o n v (R e L U (W_{g} (x_{3}) + W_{x} (x_{2})))), {\tilde{x}}_{2} = x_{2} \cdot ψ_{1}

(6)

Then, features of

x_{3}

and

{\tilde{x}}_{2}

are concatenated to obtain

x_{4}

. The resulting feature maps

x_{4}

are then upsampled to match the resolution of

x_{1}

, followed by a second attention gate, estimated as:

ψ_{2} = σ (C o n v (R e L U (W_{g} (x_{4}) + W_{x} (x_{1})))), {\tilde{x}}_{1} = x_{1} \cdot ψ_{2}

(7)

Afterward,

{\tilde{x}}_{1}

and

x_{4}

are concatenated and fed to the residual block to obtain

x_{5} \in R^{B \times 64 \times H \times W}

. The final BCM, corresponding to unchanged and changed regions, is obtained by a

1 \times 1

convolutional layer, defined by:

Y^{C h a n g e M a p} = {C o n v}_{1 \times 1} (x_{5}) \in R^{B \times 2 \times H \times W}

(8)

2.2. Dataset Description

Three heterogeneous (multi-modal) RS datasets comprising bi-temporal optical and SAR images are utilized to evaluate the performance of the proposed DMoE-AttU-Net. Each dataset consists of pre-event optical image and a post-event SAR image, accompanied by manually annotated BCMs. These datasets primarily focus on flood-affected areas, capturing various flooding scenarios such as river overflow and inundation under diverse environmental and imaging conditions. They vary in spatial resolution (ranging from 0.65 to 25 m) and in image dimensions (ranging from 554 to 2325 pixels in either dimension). Table 1 summarizes the key characteristics of the datasets, and representative samples of the inputs and corresponding reference labels are illustrated in Figure 3.

The first dataset (Figure 3a) covers flood-affected areas in California, USA [19], including parts of Sacramento, Yuba, and Sutter Counties. It integrates a multispectral optical image from Landsat-8 and a multi-polarized SAR image from Sentinel-1A with a spatial resolution of 15 m. The SAR input includes VV, VH, and their intensity ratio as channels. The reference change map highlights land cover transitions associated with flood events and was generated by leveraging auxiliary SAR observations acquired during the same time period. The dataset exhibits a significant class imbalance. The ratio of changed to unchanged pixels is approximately 1:23.
The second dataset (Figure 3b), referred to as Gloucester I [20], focuses on an urban area in the United Kingdom (UK). It combines high-resolution optical imagery from QuickBird-2 with SAR data from TerraSAR-X acquired in StripMap mode with HH polarization and a spatial resolution of 0.65 m. The dataset captures complex urban changes related to flood dynamics. Ground truth annotations were manually produced by domain experts based on pre- and post-event visual interpretation and ancillary information. The ratio of changed to unchanged pixels in this dataset is approximately 1:7.
The third dataset (Figure 3c), Gloucester II [20], also corresponds to a region in Gloucester, UK, and contains earlier-generation satellite data. It includes SPOT optical imagery and ERS-1 SAR data with spatial resolution of 25 m, captured before and after a historical flood event. Despite its lower resolution and limited spectral diversity, this dataset provides a valuable benchmark for evaluating the robustness of change detection methods under challenging conditions. Ground truth labels were manually delineated to reflect flood-induced land cover changes.

2.3. Experimental Settings and Implementation Details

For training, both optical and SAR images, along with binary labels, were divided into non-overlapping 256 × 256 patches, balancing spatial context and GPU memory constraints. The data were randomly split into training, validation, and test sets with a ratio of 8:1:1. To mitigate overfitting, data augmentation was applied only to training samples containing more than 10% change pixels. Augmentations included random flips, 90°/270° rotations, SAR-specific Gaussian noise, and optical-only color jitter (contrast, brightness, saturation). Although this targeted augmentation may introduce a slight bias toward higher change ratios, it helps prevent the model from being dominated by background samples and empirically improves convergence and detection performance. The model was optimized using a weighted combination of Intersection-over-Union (IoU), Dice, and Focal loss functions, defined as:

L = α L_{I o U} + β L_{D i c e} + γ L_{F o c a l}

(9)

where α = 0.3, β = 0.5, and γ = 0.2. The weights were empirically determined to balance boundary accuracy (

L_{D i c e}

), region overlap (

L_{I o U}

), and hard-sample emphasis (

L_{F o c a l}

).

No explicit class-specific weighting was applied; instead, class imbalance was handled implicitly by computing IoU and Dice losses on the change class and incorporating Focal loss to emphasize hard-to-classify samples. This hybrid loss formulation enables complementary optimization of region consistency and boundary precision, which is particularly effective under severe class imbalance [21,22]. Optimization was performed using the AdamW optimizer (learning rate = 0.001, weight decay = 1 × 10⁻⁵). The learning rate was adaptively adjusted using a plateau-based scheduling strategy, with a patience of 3 epochs and a minimum learning rate of 1 × 10⁻⁶. Training ran up to 400 epochs with early stopping after 10 stagnant epochs. Validation was performed every 5 epochs.

Model performance was measured on the test set using Precision, Recall, F1-score, IoU, Kappa coefficient (KC), mean Intersection over Union (mIoU), and overall accuracy (OA) metrics. Given the severe class imbalance in the datasets, particular emphasis is placed on change-sensitive metrics such as F1-score, IoU, and mIoU, while OA is reported for completeness, as it may be biased toward the dominant background class. To ensure robustness, the proposed model was evaluated over three independent runs using different random seeds, and the results are reported as the mean ± standard deviation.

To ensure a fair comparison under the same heterogeneous input conditions, all baseline methods were re-implemented and evaluated under the same data split, patch size, optimization settings, and augmentation strategy whenever applicable. Since several compared architectures were not originally developed for heterogeneous optical–SAR change detection, they were adapted to the present setting through a unified early-fusion strategy. Specifically, the optical and SAR inputs were co-concatenated channel-wise to form a common multimodal input representation. For Siamese-based baselines, the two temporal inputs were preserved in their original dual-branch formulation, while modality information was incorporated through corresponding optical–SAR image pairs. All experiments were implemented in PyTorch programming framework (version 2.10) on Google Colab Pro+ (NVIDIA Tesla A100 GPU, 40 GB VRAM, 52 GB RAM, 500 compute units).

3. Experimental Results and Discussion

To validate the effectiveness of the proposed DMoE-AttU-Net, comparisons were carried out against several widely adopted baseline architectures in RS-BCD: U-Net [23], Attention U-Net (AttU-Net) [24], Transformer U-Net (TransU-Net) [25], DeepLabV3 [26], Siam-NestedU-Net [27], FC-Siam-Diff [15], and SiamCRNN [28]. In addition to the aforementioned baselines, a recent state-of-the-art MMCD framework, M²CD [14], is included to enable a more comprehensive evaluation of recent multimodal approaches. M²CD is an optical–SAR change detection model that integrates MoE modules within its backbone to effectively address cross-modal feature distribution differences. Furthermore, it introduces an optical-to-SAR path (O2SP) combined with a self-distillation strategy to mitigate modality discrepancies during training. The model has demonstrated strong performance on large-scale multimodal datasets and serves as a competitive benchmark among recent MoE-based approaches. In the present study, the evaluation includes both quantitative metrics and qualitative analyses across three heterogeneous optical–SAR datasets: Gloucester I, California, and Gloucester II. It should be noted that all baseline models were consistently adapted to the heterogeneous optical–SAR setting, so that the reported differences primarily reflect architectural capability rather than variations in training protocol or input preparation.

3.1. Visual Comparison

Figure 4 presents qualitative results across the three datasets. The visual performance varies across models and scenes. In the Gloucester I dataset, which predominantly covers agricultural areas with intricate water–land interfaces, TransU-Net offers more stable change localization but fails to preserve fine geometric boundaries along riverbanks due to limited spatial bias. The DMoE-AttU-Net better captures these narrow and irregular water boundaries by exploiting dual-modal representations and attention-guided fusion. This improves discrimination in areas where complex water intersects with heterogeneous land parcels. In the California dataset, U-Net and DeepLabV3 struggle to capture fragmented inundation zones, particularly in the presence of SAR speckle noise, leading to missed detections. In contrast, the DMoE-AttU-Net more accurately delineates flood boundaries by leveraging the MoE module’s adaptive SAR expert fusion, which suppresses speckle variability, while SE attention enhances the model’s sensitivity to subtle water-related, particularly in fragmented agricultural parcels and mixed land cover regions. In Gloucester II, where changes predominantly follow narrow, elongated patterns such as riverbank shifts and channel expansions, baseline models exhibit fragmented predictions and poor continuity. The DMoE-AttU-Net maintains spatial coherence along these elongated structures, aided by hierarchical skip connections and adaptive attention gating, which promote consistent edge preservation across multiple scales.

A closer inspection of error cases (highlighted in Figure 4) reveals that false positives mainly occur in regions with pronounced radiometric differences in SAR imagery, such as water boundaries or shadowed areas, where speckle noise may resemble change patterns. Conversely, false negatives are primarily observed in small-scale or low-contrast changes, where spectral difference between temporal instances are limited. These observations highlight the inherent difficulty of heterogeneous change detection and explain part of the remaining performance gap. The performance variations across datasets can be attributed to differences in scene characteristics and change patterns. For instance, the California dataset presents a more severe class imbalance and complex background structures, thereby making change detection more challenging. In contrast, the Gloucester datasets contain more structured flood patterns, which are relatively easier to capture. This explains the observed differences in performance across datasets and further emphasizes the importance of robust multimodal feature learning.

3.2. Quantitative Comparison

As reported in Table 2, the proposed DMoE-AttU-Net achieved the highest overall performance among all evaluated DL architectures. It achieved a KC of 0.836, a mIoU of 0.855, and an OA of 0.967, establishing strong performance in heterogeneous optical–SAR BCD. Compared with the strongest baseline, SiamCRNN (KC = 0.818, mIoU = 0.841, OA = 0.962), these results correspond to relative improvements of +2.2% in KC, +1.7% in mIoU, and +0.5% in OA, respectively.

A direct comparison with the recent M²CD framework further highlights the effectiveness of the proposed design. As shown in Table 2, M²CD achieves a KC of 0.762, mIoU of 0.800, and OA of 0.961, indicating strong performance among multimodal methods. However, the proposed DMoE-AttU-Net surpasses M²CD with improvements of +7.4% in KC, +5.5% in mIoU, and +0.6% in OA. In addition, the proposed model achieves a higher change-class IoU (0.747 vs. 0.643) and F1-score (0.855 vs. 0.782), suggesting more accurate delineation of change regions. These improvements can be attributed to key architectural differences between the two approaches. In M²CD, MoE modules are integrated into a shared backbone, and an optical-to-SAR self-distillation path (O2SP) is used to align cross-modal features during training. In contrast, the proposed DMoE-AttU-Net employs a dual-stream design that explicitly separates optical and SAR feature extraction. Moreover, the MoE module is selectively applied only in the SAR branch to specifically address speckle noise and SAR variability, rather than uniformly across modalities. Importantly, the observed performance gains are not merely due to the inclusion of individual modules, but arise from their coordinated and modality-aware integration. In contrast to prior approaches where MoE, attention, or skip connections are employed in a largely independent or uniform manner, the proposed framework assigns each component a complementary functional role. The SAR-specific MoE module adaptively mitigates speckle noise, while the dual-stream design prevents feature interference between modalities. Simultaneously, the joint use of SE channel attention and spatial attention gates enables concurrent refinement of channel importance and spatial relevance, leading to more precise boundary delineation—an aspect not explicitly addressed in M²CD. This structured interaction between components results in a more targeted handling of modality-specific challenges and improved spatial consistency in the predicted change maps, thereby constituting the primary novelty of the proposed architecture.

A closer inspection of class-wise metrics reveals further insights. Early encoder–decoder architectures such as U-Net, AttU-Net, DeepLabV3, and TransU-Net follow a unified feature extraction paradigm that lacks explicit mechanisms to address cross-modal discrepancies. As a result, while these models achieve high background IoUs (~0.947–0.953) due to the dominance of unchanged pixels, they exhibit limited sensitivity to change regions, with IoUs constrained to 0.582–0.622. This limitation primarily stems from their inability to effectively suppress SAR-specific noise and align heterogeneous feature distributions across modalities. The introduction of the M²CD framework partially alleviates these issues by incorporating mixture-of-experts within the backbone and leveraging a self-distillation strategy to reduce the modality gap during training. This leads to improved BCD performance (IoU = 0.643, F1-score = 0.782), indicating enhanced cross-modal representation learning compared to earlier unified architectures. However, the shared backbone design of M²CD still enforces a partially coupled feature space, which may limit its ability to fully capture modality-specific characteristics, particularly for fine-grained boundary delineation. Subsequent Siamese-based architectures, such as Siam-NestedU-Net, further improve change localization by explicitly modeling bi-temporal differences and preserving hierarchical feature interactions. This leads to a notable increase in change IoU (0.671), reflecting improved sensitivity to structural changes. Nevertheless, the absence of dedicated mechanisms for handling SAR noise and modality-specific variability constrains their performance under heterogeneous conditions. More advanced baselines further reduced this gap: Siam-NestedU-Net improved change IoU to 0.671, FC-Siam-Diff reached 0.692, and SiamCRNN attained 0.724 together with the highest recall (0.916) for change pixels. These models benefit from enhanced temporal feature differencing and recurrent or multi-level feature fusion strategies, which improve the detection of complex and fragmented change patterns. However, their improvements are primarily driven by temporal modeling rather than modality-aware feature adaptation, leaving residual challenges in noisy SAR regions and subtle boundary localization. Nonetheless, the DMoE-AttU-Net further advanced the state of the art by reaching a change-class IoU of 0.747 and an F1-score of 0.855, translating into relative gains of +7.6% over Siam-NestedU-Net, +5.5% over FC-Siam-Diff, and +3.2% over SiamCRNN, as well as a substantial improvement of +16.2% over M²CD in terms of change IoU. It should be noted that certain baselines performed particularly well in specific metrics: for instance, SiamCRNN exhibited the highest precision for background pixels (0.989), while Siam-NestedU-Net achieved the best recall for background classification (0.995). These strengths, however, were not maintained consistently across change-related measures. In contrast, the DMoE-AttU-Net achieved a more balanced performance, combining strong background classification with superior delineation of change boundaries.

The observed improvements suggest that, beyond temporal differencing, effective multimodal change detection requires explicit modeling of modality-specific characteristics. The proposed architecture addresses this by decoupling feature extraction through a dual-stream design and introducing a SAR-specific MoE module that adaptively mitigates speckle noise, while attention-guided decoding enhances spatial selectivity. The quantitative evidence indicates that the proposed architecture in the present communication offers incremental but consistent improvements over the best-performing baselines, especially in metrics directly linked to change localization. Although the numerical improvements over the strongest baselines may appear moderate, it should be noted that performance in heterogeneous optical–SAR change detection has reached a relatively saturated regime, where even small gains in mIoU or KC correspond to meaningful improvements in accurately delineating change boundaries. These gains are attributed to three core design elements: (i) the MoE module in the SAR stream, which adaptively mitigates speckle noise; (ii) the combination of channel- and spatial-attention mechanisms, which selectively enhance change-relevant cues; and (iii) the use of multi-scale skip connections, which ensure better boundary preservation through semantic–spatial fusion.

3.3. Computational Complexity Analysis

To provide additional insight into the computational characteristics of the proposed model, a comparative analysis of model complexity is presented in Table 3. The comparison includes the number of trainable parameters, approximate FLOPs, and inference time per image. As reported in Table 3, the proposed DMoE-AttU-Net incurs higher computational cost compared to lightweight baseline models such as FC-Siam-Diff and U-Net, primarily due to the integration of dual-stream encoders, mixture-of-experts (MoE) modules, and attention mechanisms. However, this increased complexity is accompanied by consistent improvements in detection performance, indicating a favorable accuracy–efficiency trade-off.

It should be noted that the primary objective of this work is to enhance change detection accuracy in challenging heterogeneous scenarios. Accordingly, the proposed architecture prioritizes representational capacity and robustness over strict computational efficiency constraints. Nevertheless, the reported inference times indicate that the model remains practical for deployment on modern GPU hardware, particularly in scenarios where accuracy is prioritized. This limitation and its implications are further discussed in Section 3.4.

3.4. Ablation Experiments

An ablation study was conducted to evaluate the contribution of each component in the proposed architecture, including the SAR encoder, attention gates, SE attention, and the MoE mechanism (Table 4). To isolate their individual effects, multiple model variants were constructed by removing one component at a time while keeping all other components unchanged. Performance was evaluated using IoU for background and change classes, mIoU, KC, and OA.

The results demonstrate that the SAR encoder is the most critical component of the model. Its removal leads to a substantial degradation in performance, particularly for the change class, where IoU decreases from 0.747 to 0.604, as reported in Table 4. This highlights the importance of structured SAR feature extraction for capturing meaningful change patterns and surpassing the impact of inherent SAR noise. Similarly, removing the MoE mechanism results in a notable performance drop (mIoU decreases from 0.855 to 0.797), indicating that adaptive expert fusion improves robustness to SAR variability and noise.

The contribution of attention mechanisms is also significant, but their impact is strongly dependent on the presence of the SAR encoder. When the SAR encoder is absent, the performance drop is primarily governed by the lack of meaningful feature representations, which limits the effectiveness of attention-based modules. In contrast, when the SAR encoder is retained, removing attention gates leads to a more noticeable drop in performance (mIoU decreases from 0.855 to 0.779). This indicates that attention gates are most effective when operating on structured SAR features, suggesting a synergistic interaction between the SAR encoder and spatial attention in refining skip connection features. Similarly, removing SE attention reduces mIoU to 0.819, confirming that channel-wise feature recalibration enhances representation quality. Overall, the full model, which integrates all components, achieves the best performance across all evaluation metrics (mIoU = 0.855, KC = 0.836, OA = 0.967), demonstrating that these modules provide complementary benefits and collectively improve change detection accuracy.

In addition to the above ablation experiments, the sensitivity of the model to the number of MoE experts was further analyzed, as illustrated in Figure 5. It can be observed that increasing the number of experts from one to three leads to a consistent improvement in detection performance, with mIoU rising from 0.829 to 0.855. However, further increasing the number of experts to four does not yield additional gains and instead results in a slight performance degradation, while the computational cost, measured in terms of FLOPs, continues to increase. This behavior indicates diminishing returns and suggests that excessive experts may introduce redundancy without contributing to meaningful feature diversity. Therefore, an optimal trade-off between accuracy and computational efficiency is achieved when three experts are employed. These findings confirm that the number of experts is a critical design parameter that significantly influences both model performance and computational complexity.

3.5. Limitations and Future Research Directions

Despite the promising results, several limitations should be acknowledged. First, the datasets used in this study exhibit class imbalance, particularly in the California dataset (1:23 change-to-unchanged ratio). Although a combination of IoU and Dice losses was employed to mitigate this issue, no explicit class-specific weighting was applied. In addition, the data augmentation strategy was selectively applied to patches with more than 10% change pixels, which may introduce a degree of sampling bias and potentially affect generalization to regions with sparse or subtle changes.

Second, the experimental evaluation is limited to three heterogeneous optical–SAR datasets, two of which (Gloucester I and II) originate from the same geographic region, and all focus on flood-related changes. This is partly due to the limited availability of well-annotated multimodal (optical–SAR) benchmark datasets and computational constraints associated with training complex multimodal architectures within the scope of this communication. Consequently, the generalization of the proposed method to other types of changes (e.g., urban expansion, deforestation, or landslides) has not been fully validated.

Additionally, the proposed architecture introduces increased computational complexity and memory requirements due to the integration of dual-stream encoders, MoE modules, and attention mechanisms. This results in a higher number of parameters and computational cost compared to lightweight models such as FC-Siam-Diff and M²CD, which may limit scalability and deployment in resource-constrained environments.

In light of these limitations, future research will focus on improving both the robustness and generalization capability of the proposed framework in a structured manner. Specifically, addressing class imbalance through class-balanced loss functions and more generalized data sampling strategies will be essential to enhance performance in regions with sparse or subtle changes. Furthermore, extending the evaluation to more diverse benchmark datasets—including homogeneous datasets such as LEVIR-CD and WHU-CD, as well as heterogeneous scenarios involving a broader range of land-cover changes—will be necessary to validate the generalization of the model beyond flood-related events. In parallel, reducing computational complexity through the development of lightweight architectures and more efficient expert selection mechanisms will be critical for practical deployment in resource-constrained environments. Finally, exploring hybrid architectures, such as integrating transformer-based modules or leveraging semi/self-supervised pretraining, represents a promising direction to further enhance cross-modal feature representation and improve performance in complex change detection scenarios.

4. Conclusions

In this study, a novel dual-stream architecture named DMoE-AttU-Net was proposed to address the challenges inherent in heterogeneous optical–SAR binary change detection. By combining mixture-of-experts routing in the SAR stream with hierarchical attention mechanisms and multi-scale fusion, the method offers enhanced robustness to noise and improved delineation of change boundaries. Extensive experiments across diverse datasets demonstrated consistent superiority over baseline lightweight networks and Siamese architectures, validating the generalization capability and effectiveness of the design. While the framework advances the state-of-the-art in multimodal change detection, further opportunities remain. For example, sensitivity in regions with very mild changes or low contrast could be improved, and computation and memory costs of gating and attention modules warrant further optimization. Future work may explore integration of compact transformer modules, semi- or self-supervised pretraining, and extensions to multi-class or temporal-sequence change detection. The proposed architecture thus provides a solid foundation for future progress toward robust and precise multimodal change detection in remote sensing.

Author Contributions

Conceptualization, all authors; methodology, S.E.K. and A.J.; investigation, all authors; writing—original draft preparation, S.E.K. and A.J.; writing—review and editing, A.M. and S.J.; supervision, A.M. and S.J.; Funding acquisition, S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All the datasets analyzed during the current study are part of the benchmark multi-modal change detection dataset. The heterogeneous California data set is kindly available online at http address https://sites.google.com/view/luppino/data (last accessed on 5 May 2025). The Gloucester I and II are used in [20], are also publicly accessible via the following repository: https://www.iro.umontreal.ca/~mignotte/ResearchMaterial/ (last accessed on 5 May 2025). The code used in this study is openly available at the GitHub repository: https://github.com/aj1365/DMoE-AttU-Net (accessed on 7 May 2026).

Acknowledgments

The authors would like to appreciate Lund University, Sweden, for supporting the open-access publication of this work through the IOAP program. In addition, the authors sincerely thank the anonymous reviewers for their many valuable comments and suggestions that helped to improve both the technical content and the presentation quality of this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhu, Z.; Woodcock, C.E. Continuous change detection and classification of land cover using all available Landsat data. Remote Sens. Environ. 2014, 144, 152–171. [Google Scholar] [CrossRef]
Mohsenifar, A.; Mohammadzadeh, A.; Jamali, S. Unsupervised Rural Flood Mapping from Bi-Temporal Sentinel-1 Images Using an Improved Wavelet-Fusion Flood-Change Index (IWFCI) and an Uncertainty-Sensitive Markov Random Field (USMRF) Model. Remote Sens. 2025, 17, 1024. [Google Scholar] [CrossRef]
Moghimi, A.; Mohammadzadeh, A.; Khazai, S. Integrating Thresholding with Level Set Method for Unsupervised Change Detection in Multitemporal SAR Images. Can. J. Remote Sens. 2017, 43, 412–431. [Google Scholar] [CrossRef]
Khankeshizadeh, E.; Mohammadzadeh, A.; Moghimi, A.; Mohsenifar, A. FCD-R2U-net: Forest change detection in bi-temporal satellite images using the recurrent residual-based U-net. Earth Sci. Inform. 2022, 15, 2335–2347. [Google Scholar] [CrossRef]
Khankeshizadeh, E.; Tahermanesh, S.; Mohsenifar, A.; Moghimi, A.; Mohammadzadeh, A. FBA-DPAttResU-Net: Forest burned area detection using a novel end-to-end dual-path attention residual-based U-Net from post-fire Sentinel-1 and Sentinel-2 images. Ecol. Indic. 2024, 167, 112589. [Google Scholar] [CrossRef]
Liu, Q.; Ren, K.; Meng, X.; Shao, F. Domain Adaptive Cross Reconstruction for Change Detection of Heterogeneous Remote Sensing Images via a Feedback Guidance Mechanism. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4507216. [Google Scholar] [CrossRef]
Lv, Z.; Huang, H.; Sun, W.; Lei, T.; Benediktsson, J.A.; Li, J. Novel Enhanced UNet for Change Detection Using Multimodal Remote Sensing Image. IEEE Geosci. Remote Sens. Lett. 2023, 20, 2505405. [Google Scholar] [CrossRef]
Li, X.; Du, Z.; Huang, Y.; Tan, Z. A deep translation (GAN) based change detection network for optical and SAR remote sensing images. ISPRS J. Photogramm. Remote Sens. 2021, 179, 14–34. [Google Scholar] [CrossRef]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
Liu, J.; Gong, M.; Qin, K.; Zhang, P. A Deep Convolutional Coupling Network for Change Detection Based on Heterogeneous Optical and Radar Images. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 545–559. [Google Scholar] [CrossRef] [PubMed]
Niu, X.; Gong, M.; Zhan, T.; Yang, Y. A Conditional Adversarial Network for Change Detection in Heterogeneous Images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 45–49. [Google Scholar] [CrossRef]
Luppino, L.T.; Kampffmeyer, M.; Bianchi, F.M.; Moser, G.; Serpico, S.B.; Jenssen, R.; Anfinsen, S.N. Deep Image Translation with an Affinity-Based Change Prior for Unsupervised Multimodal Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4700422. [Google Scholar] [CrossRef]
Zhang, C.; Feng, Y.; Hu, L.; Tapete, D.; Pan, L.; Liang, Z.; Cigna, F.; Yue, P. A domain adaptation neural network for change detection with heterogeneous optical and SAR remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2022, 109, 102769. [Google Scholar] [CrossRef]
Liu, Z.; Zhang, J.; Wang, W.; Gu, Y. M²CD: A Unified MultiModal Framework for Optical-SAR Change Detection with Mixture of Experts and Self-Distillation. IEEE Geosci. Remote Sens. Lett. 2025, 22, 4012105. [Google Scholar] [CrossRef]
Daudt, R.C.; Le Saux, B.; Boulch, A. Fully convolutional Siamese networks for change detection. In Proceedings of the International Conference on Image Processing, ICIP, Athens, Greece, 7–10 October 2018; IEEE: New York, NY, USA, 2018. [Google Scholar] [CrossRef]
Peng, D.; Zhang, Y.; Guan, H. End-to-end change detection for high resolution satellite images using improved UNet++. Remote Sens. 2019, 11, 1382. [Google Scholar] [CrossRef]
He, X.; Zhang, S.; Xue, B.; Zhao, T.; Wu, T. Cross-modal change detection flood extraction based on convolutional neural network. Int. J. Appl. Earth Obs. Geoinf. 2023, 117, 103197. [Google Scholar] [CrossRef]
Yan, T.; Wan, Z.; Zhang, P.; Cheng, G.; Lu, H. TransY-Net: Learning Fully Transformer Networks for Change Detection of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4410012. [Google Scholar] [CrossRef]
Luppino, L.T.; Bianchi, F.M.; Moser, G.; Anfinsen, S.N. Unsupervised image regression for heterogeneous change detection. IEEE Trans. Geosci. Remote Sens. 2019, 57, 9960–9975. [Google Scholar] [CrossRef]
Mignotte, M. A Fractal Projection and Markovian Segmentation-Based Approach for Multimodal Change Detection. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8046–8058. [Google Scholar] [CrossRef]
Jadon, S. A survey of loss functions for semantic segmentation. In Proceedings of the 2020 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB 2020, Vina del Mar, Chile, 27–29 October 2020; IEEE: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Cardoso, M.J. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2017. [Google Scholar] [CrossRef]
Brahim, E.; Amri, E.; Barhoumi, W.; Bouzidi, S. Fusion of UNet and ResNet decisions for change detection using low and high spectral resolution images. Signal Image Video Process. 2024, 18, 695–702. [Google Scholar] [CrossRef]
Cummings, S.; Kondmann, L.; Zhu, X.X. Siamese Attention U-Net for Multi-Class Change Detection. In Proceedings of the International Geoscience and Remote Sensing Symposium (IGARSS), Kuala Lumpur, Malaysia, 17–22 July 2022; IEEE: New York, NY, USA, 2022. [Google Scholar] [CrossRef]
Pang, L.; Sun, J.; Chi, Y.; Yang, Y.; Zhang, F.; Zhang, L. CD-TransUNet: A Hybrid Transformer Network for the Change Detection of Urban Buildings Using L-Band SAR Images. Sustainability 2022, 14, 9847. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Li, K.; Li, Z.; Fang, S. Siamese NestedUNet Networks for Change Detection of High Resolution Satellite Image. In ACM International Conference Proceeding Series; Association for Computing Machinery: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
Chen, H.; Wu, C.; Du, B.; Zhang, L.; Wang, L. Change Detection in Multisource VHR Images via Deep Siamese Convolutional Multiple-Layers Recurrent Neural Network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 2848–2864. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed Dual-Modal Mixture-of-Experts Attention U-Net (DMoE-AttU-Net). The SE attention module integrated within each CNN expert is detailed in Figure 2.

ψ_{1}

and

ψ_{1}

denote attention maps used for feature modulation in the model.

Figure 1. Overall architecture of the proposed Dual-Modal Mixture-of-Experts Attention U-Net (DMoE-AttU-Net). The SE attention module integrated within each CNN expert is detailed in Figure 2.

ψ_{1}

and

ψ_{1}

denote attention maps used for feature modulation in the model.

Figure 2. Schematic of the squeeze-and-excitation (SE) channel attention module used within each CNN expert in the SAR branch of the proposed architecture (see Figure 1).

Figure 3. Heterogeneous bi-temporal optic-SAR datasets used in this study: (a) California, (b) Gloucester I, and (c) Gloucester II. For each dataset, the columns correspond to pre-optical image, post-SAR image, and the ground-truth change map, respectively.

Figure 4. Visual comparisons of the change maps obtained by different DL models on the three datasets. green is used to represent true positives (TP), while black represents true negatives (TN). Red signifies false positives (FP) and blue indicates false negatives (FN). The dotted-line boxes highlight selected regions for detailed comparison, where the proposed DMoE-AttU-Net demonstrates lower error rates and better preservation of change regions.

Figure 5. The number of parameters increases consistently with the number of experts, following a trend similar to FLOPs.

Table 1. Description of the heterogeneous data sets used in this study.

Dataset	Time 1 Datatype/Sensor/ Acquisition Time	Time 2 Datatype/Sensor/ Acquisition Time	Image Size (Pixel × Pixel)	Spatial Resolution (m)
California (USA)	Optic/Landsat-8/ 8 January 2017	SAR/Sentinel-1A/ 18 February 2017	2000 × 3500	≈15
Gloucester I (UK)	Optic/Quickbird-2/ July 2006	SAR/TerraSAR-X/ July 2007	2325 × 4135	0.65
Gloucester II (UK)	Optic/SPOT/ September 1999	SAR/ERS-1/ November 2000	1250 × 2600	≈25

Table 2. Performance comparison of different deep learning models (the numbers in bold represent the best value in each column), for the proposed method, results are reported as mean ± standard deviation over three independent runs.

Models	Precision		Recall		F1-Score		IoU		mIoU	KC	OA
Models	Background	Change	Background	Change	Background	Change	Background	Change	mIoU	KC	OA
U-Net [23]	0.954	0.923	0.994	0.611	0.974	0.736	0.948	0.582	0.765	0.71	0.952
AttU-Net [24]	0.962	0.854	0.986	0.687	0.974	0.761	0.949	0.615	0.782	0.736	0.953
DeepLabV3 [26]	0.968	0.798	0.977	0.738	0.973	0.767	0.947	0.622	0.784	0.739	0.951
TransU-Net [25]	0.958	0.934	0.994	0.643	0.976	0.762	0.953	0.616	0.784	0.739	0.956
M²CD [14]	0.962	0.946	0.995	0.667	0.978	0.782	0.958	0.643	0.800	0.762	0.961
Siam-NestedU-Net [27]	0.964	0.944	0.995	0.698	0.979	0.803	0.959	0.671	0.815	0.783	0.962
FC-Siam-Diff [15]	0.978	0.814	0.977	0.822	0.977	0.818	0.956	0.692	0.824	0.795	0.96
SiamCRNN [28]	0.989	0.776	0.967	0.916	0.978	0.84	0.958	0.724	0.841	0.818	0.962
DMoE-AttU-Net	0.987	0.818	0.975	0.896	0.981	0.855	0.963	0.747	0.855 ± 0.004	0.836 ± 0.005	0.967 ± 0.002

Table 3. Computational complexity comparison of different models.

Models	Params (M)	FLOPs (G)	Inference Time (ms)
U-Net	31.0	45	18
AttU-Net	33.5	48	20
DeepLabV3	42.3	65	28
TransU-Net	11.3	55	25
M²CD	7.0	38	22
Siam-NestedU-Net	12.0	40	17
FC-Siam-Diff	1.3	12	8
SiamCRNN	39.6	70	32
DMoE-AttU-Net	69.5	95	38

Table 4. Ablation study of the developed DMoE-AttU-Net model (the numbers in bold represent the best value in each column).

SAR Encoder	Attention Gates	SE Attention	MoE	IoU		mIoU	KC	OA
SAR Encoder	Attention Gates	SE Attention	MoE	Background	Change	mIoU	KC	OA
$\times$	✓	✓	✓	0.935	0.604	0.769	0.720	0.941
✓	$\times$	✓	✓	0.941	0.618	0.779	0.734	0.946
✓	✓	$\times$	✓	0.953	0.685	0.819	0.789	0.957
✓	✓	✓	$\times$	0.946	0.647	0.797	0.758	0.951
✓	✓	✓	✓	0.963	0.747	0.855	0.836	0.967

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Khankeshizadeh, S.E.; Mohammadzadeh, A.; Jamali, A.; Jamali, S. A Dual-Modal Mixture-of-Experts Attention U-Net (DMoE-AttU-Net) for Change Detection Using Heterogeneous Optical and SAR Remote Sensing Images. Remote Sens. 2026, 18, 1508. https://doi.org/10.3390/rs18101508

AMA Style

Khankeshizadeh SE, Mohammadzadeh A, Jamali A, Jamali S. A Dual-Modal Mixture-of-Experts Attention U-Net (DMoE-AttU-Net) for Change Detection Using Heterogeneous Optical and SAR Remote Sensing Images. Remote Sensing. 2026; 18(10):1508. https://doi.org/10.3390/rs18101508

Chicago/Turabian Style

Khankeshizadeh, Seyed Ehsan, Ali Mohammadzadeh, Ali Jamali, and Sadegh Jamali. 2026. "A Dual-Modal Mixture-of-Experts Attention U-Net (DMoE-AttU-Net) for Change Detection Using Heterogeneous Optical and SAR Remote Sensing Images" Remote Sensing 18, no. 10: 1508. https://doi.org/10.3390/rs18101508

APA Style

Khankeshizadeh, S. E., Mohammadzadeh, A., Jamali, A., & Jamali, S. (2026). A Dual-Modal Mixture-of-Experts Attention U-Net (DMoE-AttU-Net) for Change Detection Using Heterogeneous Optical and SAR Remote Sensing Images. Remote Sensing, 18(10), 1508. https://doi.org/10.3390/rs18101508

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Dual-Modal Mixture-of-Experts Attention U-Net (DMoE-AttU-Net) for Change Detection Using Heterogeneous Optical and SAR Remote Sensing Images

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Proposed DMoE-AttU-Net Architecture

2.2. Dataset Description

2.3. Experimental Settings and Implementation Details

3. Experimental Results and Discussion

3.1. Visual Comparison

3.2. Quantitative Comparison

3.3. Computational Complexity Analysis

3.4. Ablation Experiments

3.5. Limitations and Future Research Directions

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI