WEMFusion: Wavelet-Driven Hybrid-Modality Enhancement and Discrepancy-Aware Mamba for Optical–SAR Image Fusion

Wang, Jinwei; Zhao, Yongjin; Ma, Liang; Zhao, Bo; Song, Fujun; Cai, Zhuoran

doi:10.3390/rs18040612

Open AccessArticle

WEMFusion: Wavelet-Driven Hybrid-Modality Enhancement and Discrepancy-Aware Mamba for Optical–SAR Image Fusion

by

Jinwei Wang

^1,*

,

Yongjin Zhao

¹,

Liang Ma

¹,

Bo Zhao

²,

Fujun Song

¹ and

Zhuoran Cai

¹

School of Physics and Electronic Information, Yantai University, Yantai 264005, China

²

State Key Laboratory of Radio Frequency Heterogeneous Integration, Shenzhen University, Shenzhen 518060, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(4), 612; https://doi.org/10.3390/rs18040612

Submission received: 29 January 2026 / Revised: 11 February 2026 / Accepted: 14 February 2026 / Published: 15 February 2026

(This article belongs to the Special Issue High Earth Orbit Spaceborne SAR Systems, Technologies, and Applications (2nd Edition))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A unified WEMFusion framework is proposed. It constructs an interpretable band-wise prior via multi-scale wavelet transforms and introduces a hybrid-modality enhancement (HME) module: the high-frequency sub-module HRB focuses on injecting trustworthy optical edges and directional textures, while the low-frequency sub-module LF-CSCM performs cross-modality structural alignment, mitigating—from the fusion source—the entanglement and erroneous injection between genuine optical edges and SAR speckle.
A discrepancy-aware deep interaction mechanism is developed. DAG-MF explicitly injects cross-modality discrepancy and complementarity into the two-dimensional directionally scanned Mamba interaction process, making long-range modeling more selective in conflicting regions and more stable in consistent regions. Together with adaptive weighted aggregation, it dynamically balances prior-enhanced representations and deep fused representations, further improving structural consistency and detail stability.

What are the implications of the main findings?

The framework is more favorable for downstream interpretation. By suppressing noise mis-diffusion at the band level and constraining structural drift and directional breaks during long-range propagation, the fusion results become more reliable in terms of structural integrity, boundary stability, and perceptual consistency, thereby providing more stable support for downstream tasks such as object detection and semantic segmentation.
The framework is generalizable and deployable. The paradigm of frequency-domain decoupled priors coupled with discrepancy-driven selective interaction is broadly applicable and can be transferred to other cross-modality remote-sensing fusion tasks. Moreover, leveraging efficient state-space modeling and a lightweight multi-scale design, it better meets practical deployment requirements while maintaining strong performance.

Abstract

Optical and synthetic aperture radar (SAR) imagery are highly complementary in terms of texture details and structural scattering characterization. However, their imaging mechanisms and statistical distributions differ substantially. In particular, pseudo-high-frequency components introduced by SAR coherent speckle can be easily entangled with genuine optical edges, leading to texture mismatch, structural drift, and noise diffusion. To address these issues, we propose WEMFusion, a wavelet-prior-driven framework for frequency-domain decoupling and discrepancy-aware state-space fusion. Specifically, a multi-scale discrete wavelet transform (

D W T

) explicitly decomposes the inputs into low-frequency structural components and directional high-frequency sub-bands, providing an interpretable frequency-domain constraint for cross-modality alignment. We design a hybrid-modality enhancement (HME) module: in the high-frequency branch, it effectively injects optical edges and directional textures while suppressing the propagation of pseudo-high-frequency artifacts, and in the low-frequency branch, it reinforces global structural consistency and prevents speckle perturbations from leaking into the structural component, thereby mitigating structural drift. Furthermore, we introduce a discrepancy-aware gated Mamba fusion (DAG-MF) block, which generates dynamic gates from modality differences and complementary responses to modulate the parameters of a directionally scanned two-dimensional state-space model, so that long-range dependency modeling focuses on discrepant regions while preserving directional coherence. Extensive quantitative evaluations and qualitative comparisons demonstrate that WEMFusion consistently improves structural fidelity and edge detail preservation across multiple optical–SAR datasets, achieving superior fusion quality with lower computational overhead.

Keywords:

multimodal image fusion; optical (OPT) images; synthetic aperture radar (SAR) images; Mamba architecture; hybrid-modality enhancement; wavelet-driven

1. Introduction

Optical remote sensing imagery and synthetic aperture radar (SAR) imagery are inherently complementary due to their different imaging mechanisms [1,2]. As shown in Figure 1, optical imagery is acquired via passive imaging and is characterized by rich textures, clear edges, and strong semantic readability; however, it is easily affected by clouds and haze, shadows, and illumination variations [3]. Thus, small targets, such as the road vehicles highlighted in the red box, are not readily emphasized. By contrast, SAR imagery is acquired via active microwave imaging and supports all-time, all-weather observation; it can stably characterize structural and scattering differences and highlight strong scattering targets [4,5,6,7], but speckle noise and pseudo-textures introduced by coherent imaging reduce readability and increase the difficulty of cross-modality alignment and fusion [8,9]. Optical–SAR image fusion aims to effectively integrate the complementary information from the two sensors into a unified representation, producing fused results that simultaneously preserve clear textural details and stable structural scattering characterization, thereby improving interpretation capability under complex weather and complex land surface conditions and enhancing the performance of downstream tasks such as object detection, segmentation, and change detection [10,11]. It can provide more reliable data support for all-weather land surface information acquisition and fine-grained interpretation, serving typical application scenarios such as land use monitoring, fine-grained urban governance, and emergency disaster response [12].

Traditional image fusion methods largely rely on explicit priors and hand-crafted rules. They typically map source images into multi-scale and multi-directional representations, such as Laplacian pyramids [13] and wavelet transforms [14]. While highly interpretable and training-free, these methods employ fusion rules that are sensitive to noise, local misregistration, and cross-modality statistical discrepancies and thus often struggle to simultaneously achieve detail enhancement and structural stability. Sparse representation-based dictionary learning reformulates fusion as a joint representation in a sparse coefficient domain; by learning overcomplete dictionaries, it provides more stable structural descriptions, yet its performance depends heavily on the choice of dictionaries and sparsity constraints [15]. In recent years, deep learning has driven fusion toward an end-to-end paradigm. CNN- and autoencoder (AE)-based methods learn salient multi-source features with efficient encoders while maintaining favorable training and inference efficiency [16,17]. Diffusion models and adversarial learning, respectively, exploit generative priors or further incorporate perceptual constraints to improve texture realism and visual quality [18,19]. Transformers enhance global interaction and long-range dependency modeling via self-attention [20], whereas state-space models enable sequence modeling with near-linear complexity, offering a new pathway toward high-quality, high-resolution fusion [21,22].

Despite the rapid progress brought about by deep learning, optical–SAR fusion still faces three core challenges, detailed as follows. (1) Frequency-domain semantic misalignment: Optical high-frequency components correspond to genuine edges and directional textures, whereas SAR high-frequency components are often entangled with speckle noise and abrupt scattering variations [23,24]. Direct fusion in the spatial domain therefore tends to mistakenly inject speckle noise as details, leading to pseudo-textures and noise diffusion [25]. (2) Insufficient constraints on structural consistency: The two modalities differ markedly in geometric saliency and statistical distributions [26]; without explicit modeling and modulation of discrepant regions, structural drift and local mismatches are prone to occur. (3) Limited modeling of directional structures and long-range dependencies: Linear structures in SAR imagery, such as roads and dikes, exhibit strong directionality [27]. Convolutional operators struggle to maintain global directional coherence [28]. Although Transformers can perform global modeling, they provide insufficient directional inductive bias and incur computational costs that increase with resolution, which limits deployment.

Based on the above observations, we propose an efficient optical–SAR fusion framework, WEMFusion; it is an end-to-end image fusion network driven by wavelet decomposition that integrates hybrid-modality enhancement and discrepancy-aware Mamba. The proposed method fully exploits the explicit sub-band–domain decoupling capability of the wavelet transform to separate low-frequency structural components from directional high-frequency details and combines it with the efficiency advantage of state-space models in long-range dependency modeling. In this way, it enhances structural consistency and detail readability while suppressing SAR speckle-induced pseudo-textures, achieving high-quality and robust optical–SAR image fusion. The main contributions are as follows.

A wavelet-driven unified fusion framework: We employ a multi-scale $D W T$ with recursive decomposition to construct a frequency-domain prior, providing an interpretable input space for cross-modality alignment, complementary aggregation, and noise suppression. The recursive interactions are further propagated to lower-resolution scales, which substantially reduces computational and memory overhead.
A hybrid-modality enhancement module (HME): To address the high-frequency discrepancy between genuine optical edges and SAR speckle noise, the high-frequency branch strengthens optical directional textures, while the low-frequency branch models structurally consistent alignment. Together with a lightweight Adaptive Attentive Aggregation for Fusion (A3F), HME enables robust complementary fusion.
A discrepancy-aware gated Mamba fusion block (DAG-MF): We generate dynamic gates from modality differences and complementary responses to modulate the parameters of a two-dimensional directionally scanned state-space model, so that long-range modeling focuses on discrepant regions while preserving directional coherence.

The remainder of this paper is organized as follows. Section 2 reviews research advances related to this work, including multimodal image fusion, wavelet transforms, and Mamba-based models. Section 3 describes in detail the overall framework and key designs of the proposed fusion network. Section 4 resents comprehensive experiments on public datasets, including comparative evaluations against representative methods, ablation studies, efficiency assessment, and performance comparisons of fusion results on a downstream semantic segmentation task. Section 5 discusses the experimental findings and the characteristics of the proposed method. Section 6 concludes the paper and outlines directions for future research.

2. Related Work

This section reviews methods related to this work, including multimodal fusion, wavelet decomposition, and long-range dependency modeling, thus demonstrating our motivation for this work.

2.1. Multimodal Image Fusion

In recent years, multimodal image fusion (MMIF) has advanced rapidly under the impetus of deep learning. Mainstream methods can be broadly grouped into three categories: convolutional neural networks (CNNs) and autoencoders (AEs), generative models, and self-attention-based Transformers. Early studies primarily adopted CNN- and AE-based schemes. Li et al. proposed DenseFuse [16], which preserves intermediate-layer features via densely connected blocks and reconstructs the fused image at the decoder; they subsequently introduced NestFuse [17], incorporating nested connections to extract multi-scale deep features. Zhang et al. proposed IFCNN [29], which realizes a unified treatment of multiple fusion tasks within a generic convolutional framework. Although these methods offer advantages in training stability and inference efficiency, they are constrained by the local receptive fields of CNN kernels, making it difficult for such architectures to capture long-range semantic dependencies [30]. To improve perceptual quality, generative models have been widely introduced. Ma et al. brought generative adversarial networks into the fusion community and proposed FusionGAN [31], where adversarial learning encourages the fused image to retain more details; they later proposed DDcGAN [18], employing a dual-discriminator design to further balance information preservation across modalities. However, existing GAN-based architectures typically lack frequency-domain noise decoupling and are therefore prone to erroneously amplifying SAR speckle noise during adversarial training. More recently, Zhao et al. explored a diffusion model-based approach, DDFM [19]. While it markedly improves generation quality, its multi-step iterative denoising procedure incurs high computational cost [32], making it difficult to meet the requirements of real-time deployment at high resolution. In addressing long-range dependency modeling, Transformer architectures have gradually become mainstream. SwinFusion, proposed by Ma et al. [20], exploits the shifted-window mechanism of the Swin Transformer to enable cross-modality long-range modeling, whereas CDDFuse, proposed by Zhao et al. [33], combines the strengths of Transformers and CNNs to extract both global and local features. Nevertheless, self-attention is memory-intensive at high resolutions; moreover, when dealing with strongly directional linear structures such as roads and dikes, the lack of directional inductive bias may easily lead to structural discontinuities.

2.2. Wavelet Transform and Multi-Scale Decomposition

Multi-scale decomposition-based methods are built on the key idea of explicitly separating an image into low-frequency structural components and high-frequency directional details. The classic Mallat algorithm [34] established the theoretical foundation of the two-dimensional discrete wavelet transform (

D W T

), yielding a coarse-to-fine hierarchical representation through recursive decomposition. In the deep learning era, researchers have begun to treat wavelet sub-bands as an explicit representation space for neural networks. Guo et al. proposed Deep Wavelet Prediction for super-resolution [35], which learns to predict wavelet coefficients to recover more reliable textures and edges and is less likely to introduce unstable pseudo-details than purely spatial-domain reconstruction. In the image fusion community, Liu et al. further proposed WaveFusionNet, which combines DWT-based sub-band decomposition in the wavelet domain with a multi-scale encoder for infrared–visible fusion, enabling more precise organization of low-frequency structures and high-frequency details and improving reconstruction quality [36].

2.3. State-Space Models and Mamba

State-space models (SSMs) originate from classical control theory and have recently attracted renewed attention due to their potential for long-sequence modeling. The S4 model proposed by Gu et al. [37] addresses the computational bottleneck of long-term sequence memory. Building upon this line, Gu and Dao further proposed Mamba [21], which introduces a selection mechanism together with efficient hardware-aware implementations, achieving Transformer-comparable performance in large-scale sequence modeling. In the vision domain, VMamba proposed by Liu et al. [22] and Vim proposed by Zhu et al. [38] were among the first to bring Mamba into two-dimensional image processing, capturing spatial long-range dependencies through a cross-scan mechanism. Zhang et al. proposed Mamba-STFM [39], constructing an end-to-end Mamba-based encoder–decoder framework to efficiently model long-range spatiotemporal dependencies and fuse multi-source heterogeneous remote sensing observations. More recently, Mamba-based architectures have also been investigated for remote sensing-dense prediction tasks, such as change detection and UAV semantic segmentation, demonstrating competitive accuracy–efficiency trade-offs [40,41].

3. Methods

This section provides a detailed description of our method design, introducing in sequence the frequency-band decomposition, the overall WEMFusion framework, the key modules, and the loss function design.

3.1. Discrete Wavelet Transform ( $D W T$ ) and Inverse Discrete Wavelet Transform ( $I D W T$ )

As shown in Figure 2, to decouple structural trends and directional details in an explicit and interpretable frequency–sub-band space, we apply a two-dimensional discrete Haar wavelet transform to the input for multi-scale decomposition [42]. Given a 2D image

X \in ℝ^{H \times W \times C}

, a single-level 2D discrete wavelet transform (

D W T

) can be written as follows:

L L, L H, H L, H H = D W T (X, h a a r)

(1)

Correspondingly, the 2D inverse discrete wavelet transform (

I D W T

) reconstructs the original-scale representation from the four sub-bands as follows:

X_{f} = I D W T ((L L, L H, H L, H H), h a a r)

(2)

where

D W T

and

I D W T

denote the forward and inverse wavelet transforms, respectively.

L L

is the low-frequency approximation sub-band, which primarily carries global structure and intensity trends;

L H

,

H L

, and

H H

are directional high-frequency detail sub-bands corresponding to local variations in the horizontal, vertical, and diagonal directions, respectively. “Haar” specifies the particular wavelet basis used in the

D W T

/

I D W T

. Due to its shortest support and local mean filtering property, the Haar wavelet typically exhibits weaker cross-neighborhood coupling. Consequently, under identical boundary handling, it can reduce the risk of noise diffusion and detail smoothing introduced by longer support filters [43]. In the optical–SAR setting, this decomposition equips the model with the ability to disentangle frequency entanglement: the low-frequency branch focuses more on structural consistency, whereas the high-frequency branch emphasizes reliable detail selection and speckle-noise suppression, thereby alleviating texture mismatch, structural drift, and noise diffusion.

3.2. Overall Architecture

As illustrated in Figure 3, WEMFusion first constructs a multi-scale hierarchical frequency-domain prior by applying the discrete wavelet transform (

D W T

) to the input images, yielding four sub-bands

(L L^{s}, L H^{s}, H L^{s}, H H^{s})

, where

s \in {1, \dots, S}

denotes the scale index and a larger

s

corresponds to a deeper, lower-resolution scale. The low-frequency sub-band

L L^{s}

is recursively propagated to form a coarse-to-fine frequency hierarchy. Subsequently, the four sub-bands are concatenated along the channel dimension as frequency-domain inputs and fed into the WEM module in Figure 3b. Inside WEM, scale extractors first capture local features and map them into a unified embedding space, and they are then delivered to heterogeneous optical/SAR Mamba extractors. The optical branch emphasizes directional textures and edge detail enhancement, whereas the SAR branch focuses on robust scattering structure representation and global consistency modeling, thereby extracting complementary features while suppressing speckle interference. During the scale-wise fusion stage, DAG-MF performs discrepancy-aware cross-modal interaction and complementary aggregation; HME explicitly enforces a frequency-domain enhancement strategy that biases high-frequency components toward optical cues while promoting low-frequency cooperative alignment; and A3F adaptively reweights and fuses the two types of representations. Reconstruction proceeds progressively along a deep-to-shallow path: at each scale, DAG-MF first fuses the current-scale features with deeper priors, and the fused features are then restored to a higher resolution via

I D W T

. The high-resolution features are projected via convolution and aligned through interpolation (Align), then passed to the next scale for further fusion until the highest-resolution fused representation is obtained; finally, a lightweight reconstruction head outputs the fused image

I^{f}

.

3.3. Feature Extraction

As shown in Figure 3b, feature extraction in WEMFusion consists of two stages: frequency-domain scale encoding (scale extractors) and modality-specific high-level representation extraction (Mamba extractor). First, at scale

s

, the network concatenates the four sub-bands along the channel dimension to form a structurally decomposable frequency-domain input:

Z_{m}^{(s)} = C a t (L L_{m}^{(s)}, L H_{m}^{(s)}, H L_{m}^{(s)}, H H_{m}^{(s)}) \in ℝ^{H_{s} \times W_{s} \times 4 C}

(3)

where

m \in {opt, sar}

,

H_{s} \times W_{s}

denotes the spatial resolution at scale

s

, and

Cat

denotes the concatenation operation. This representation explicitly preserves the band semantics of low-frequency structures and directional details, enabling subsequent modules to apply targeted enhancement and denoising strategies across different frequency bands. Then, the encoder scale extractor

E^{(s)} (\cdot)

, composed of two convolution layers and LeakyReLU, maps

Z_{m}^{(s)}

to a unified embedding dimension:

F_{m}^{(s)} = E^{(s)} (Z_{m}^{(s)}) = δ (C o n v_{3 \times 3} (δ (C o n v_{3 \times 3} (Z_{m}^{(s)}))))

(4)

where

δ (\cdot)

denotes the LeakyReLU activation function.

C o n v_{3 \times 3} (\cdot)

denotes a

3 \times 3

convolution. The encoder

E^{(s)} (\cdot)

keeps the spatial resolution unchanged and aligns the channel dimension from

4 C

to

C

, producing the scale-level basic feature

F_{m}^{(s)}

, which is then fed into the modality-specific high-level extraction stage. As shown in Figure 3c, the optical branch adopts the optical Mamba extractor and introduces an edge enhancement module [44] before the Mamba stacking to strengthen genuine edges and directional textures. The SAR branch employs the SAR–Mamba extractor, directly stacking Mamba modules [45] to model structural scattering characteristics and long-range continuity. Specifically, as illustrated by the Mamba block in Figure 3c, the module flattens the 2D feature map into a sequence of

H_{s} \times W_{s}

tokens for state-space scan modeling and then reshapes it back to a 2D representation, thereby capturing cross-region long-range dependencies and structural consistency with relatively low computational complexity. Through modality-specific Mamba extraction at scale

s

, we obtain high-level representations of the optical and SAR modalities, denoted

T_{o p t}^{(s)}

and

T_{s a r}^{(s)}

, respectively.

3.4. Feature Fusion

The key challenges in the fusion stage are as follows: the high-frequency components of optical imagery mainly correspond to genuine edges and directional textures, whereas the high-frequency components of SAR imagery are often dominated by speckle noise and mixed with abrupt scattering variations, making them prone to entanglement in both spectral content and statistical characteristics. Meanwhile, effective structural information in SAR imagery typically exhibits strong directionality and long-range continuity, placing higher requirements on global consistency modeling during fusion. To this end, WEMFusion adopts a strategy of two complementary paths working collaboratively within the fusion backbone. HME provides an interpretable band-wise prior in the wavelet domain; by enhancing optical cues in the high-frequency bands while promoting cooperative alignment in the low-frequency band, it suppresses the erroneous propagation of speckle noise. However, although HME offers an interpretable frequency-domain prior, it is still insufficient to resolve local conflicts in complex regions and to guarantee the continuity of directional structures. Therefore, DAG-MF explicitly injects cross-modality discrepancies into a two-dimensional, directionally scanned Mamba, enabling structure-sensitive and direction-consistent deep cross-modal interaction. Finally, with the aid of a lightweight adaptive weighted aggregation mechanism, the prior-enhanced representation and the discrepancy-driven fused representation are dynamically balanced to produce the final within-scale fused features, which subsequently support reconstruction.

3.4.1. Hybrid-Modality Enhancement (HME)

As shown in Figure 3b, HME aims to construct a frequency-domain interpretable hybrid-modality prior before deep cross-modal interaction. It uses the optical modality to provide reliable high-frequency edge guidance, while explicitly aligning the overall trends of OPT and SAR at the low-frequency structural level, thereby reducing the risk that SAR speckle-induced pseudo-high-frequency textures are mistakenly injected as details into the fused result. To this end, HME divides the wavelet sub-bands into two complementary paths according to frequency bands, namely a high-frequency enhancement path and a low-frequency modulation path.

As shown in Figure 4, in the low-frequency path, HME takes the low-frequency sub-bands of both modalities as input. To address the commonly observed modality semantic bias and structural misalignment within low-frequency sub-bands, LF-CSCM constructs a dual-branch multi-scale perception and adaptive gating mechanism by deploying, in parallel, a

3 \times 3

depthwise convolution and a cross-shaped branch composed of asymmetric

1 \times 7

and

7 \times 1

convolutions, which capture local texture details and directional long-range context, respectively, enabling complementary feature modeling. The module then applies coordinate attention to enhance position-aware representations along the height and width directions and further adaptively adjusts the feature injection ratio via a gated residual connection, thereby strengthening structural responses and stability while retaining the main information of the source features.

As shown in Figure 5, in the high-frequency path, HME uses the optical sub-bands

(L L, L H, H L, H H)

as the detail source and applies HRB with multiple convolutions for detail enhancement and contrast boosting, reinforcing the directional consistency of genuine edges while suppressing unstable textures. Following the half-channel normalization idea [46], instance normalization is applied to half of the channels, whereas the other half remain unchanged. In this way, the HFE module normalizes half of the features to stabilize the distribution and reduce the discrepancy in feature-value ranges across different samples and training stages, while preserving the remaining un-normalized features to retain key information and context.

Finally, HME concatenates the outputs of the two paths along the channel dimension and projects them into a unified embedding space via scale extractors, producing the scale-

s

hybrid-enhanced feature

F_{H M E}^{(s)}

. This provides subsequent discrepancy-aware fusion with a higher-SNR prior input that carries explicit semantics for each frequency band. The above procedure can be summarized as follows:

F_{H M E}^{(s)} = E^{(s)} (C a t (\underset{l o w - f r e q u e n c y a l i g n m e n t}{\underset{︸}{L F - C S C M (L^{(s)})}}, \underset{o p t i c a l h i g h - f r e q u e n c y}{\underset{︸}{H R B (H_{o p t}^{(s)})}}))

(5)

where

E^{(s)} (\cdot)

corresponds to the scale extractors in the figure, which project the hybrid prior produced by HME to a unified channel dimension so as to match the subsequent fusion backbone.

3.4.2. Discrepancy-Aware Gated Mamba Fusion (DAG-MF)

In the optical–SAR fusion backbone, simply replacing the fusion trunk with Mamba alone still fails to resolve several key issues. On the one hand, the statistical and semantic discrepancies of high-frequency components across the two modalities are significant; without explicit discrepancy modeling, long-range scanning tends to propagate noise errors and amplify mismatches. On the other hand, SAR directional structures require global continuity, whereas regions dominated by scattering mutations and speckle noise demand different preservation and suppression strategies. It is therefore difficult to achieve region-adaptive balancing with a single content-driven long-range modeling scheme. Moreover, directly interacting in the spatial domain can accumulate high-frequency aliasing and amplify errors, further degrading fusion stability. Existing discrepancy-aware fusion usually injects cross-modality differences through attention mechanisms, whereas our goal is to further introduce explicit cross-modality discrepancies into the dynamic parameters of the SSM, thereby achieving discrepancy-driven modality interactions.

At the fusion stage of scale

s

, the inputs are the modality-specific high-level features

T_{opt}^{(s)}

and

T_{sar}^{(s)}

. As shown in Figure 6, DAG-MF first applies LayerNorm to both feature streams and then feeds them into the DiscrepancyGate to construct a gated prior from the discrepancy response and the complementary response:

D^{(s)} = |L N (T_{o p t}^{(s)}) - L N (T_{s a r}^{(s)})|

(6)

C^{(s)} = L N (T_{o p t}^{(s)}) ⊙ L N (T_{s a r}^{(s)})

(7)

where

| \cdot |

denotes an element-wise absolute value, and

⊙

denotes element-wise multiplication. The two responses are then concatenated and passed through a

1 \times 1

convolution for channel reduction, a

3 \times 3

depthwise convolution (DWConv) for spatial modeling, and a

1 \times 1

projection with activation to produce the shared gate

G (s)

:

G (s) = σ ({Conv}_{1 \times 1} ({DWConv}_{3 \times 3} (S i L U ({Conv}_{1 \times 1} ([D^{(s)}, C^{(s)}])))))

(8)

where

σ (\cdot)

denotes the sigmoid function, which constrains the gate values to [0,1]. This gate is injected into two parallel DAG-SS2D branches for the optical and SAR streams, respectively, to modulate the state-space updates of the two-dimensional four-direction selective scan (horizontal forward and backward, vertical forward and backward) [22], and the outputs are obtained in a residual form. Specifically, let

R_{k} (\cdot)

denote the rearrangement operator that unfolds the 2D feature into a sequence of length

L = H \times W

in the

k

-th direction, and let

SScan (\cdot)

denote the selective state-space scanning operator. Then, the gate is injected into the scanning dynamic control vector in an element-wise product manner, thereby directly modulating the strength of state updates:

{\hat{T}}_{m}^{(s)} = T_{m}^{(s)} + \sum_{k = 1}^{4} R_{k}^{- 1} (SScan (R_{k} (LN (T_{m}^{(s)})) ⊙ R_{k} (G^{(s)})))

(9)

where

R_{k}^{- 1} (\cdot)

denotes the corresponding inverse rearrangement that restores the sequence back to a two-dimensional representation. The key idea of this design is that the same

G^{(s)}

simultaneously acts on the two parallel scanning branches of the optical and SAR streams, so that discrepancy-dominant regions indicated by

D^{(s)}

receive stronger complementary information injection during long-range propagation, whereas consistency-dominant regions indicated by

C^{(s)}

maintain stable structural transmission under four-direction scanning. Compared with performing static weighting only at the feature level, DAG-MF injects the gate into the selective update process of SS2D [22], enabling long-range dependency modeling to be adaptively activated according to discrepancies. This mechanism suppresses the propagation of SAR speckle-induced pseudo-textures at the source, while enhancing the directional continuity and cross-scale consistency of linear structures such as roads and embankments.

3.4.3. Adaptive Attentional Feature Fusion (A3F)

As shown in Figure 7, A3F is used to adaptively fuse, at scale

s

, the frequency prior-enhanced feature

F_{H M E}^{(s)}

and the discrepancy-aware deep fused feature

F_{D A G}^{(s)}

. Its core idea is to learn a position-dependent weight map and dynamically reweight the two complementary representations: HME is more biased toward detail enhancement and denoising priors, whereas DAG-MF is more biased toward discrepancy-driven long-range structural consistency. Specifically, A3F takes

F_{H M E}^{(s)}

and

F_{D A G}^{(s)}

as contextual inputs, generates fusion weights via a global branch (Adaptive AvgPool) and a local branch (a lightweight

1 \times 1

convolution), and performs weighted fusion:

w^{(s)} = σ (A (F_{H M E}^{(s)} + F_{D A G}^{(s)}))

(10)

F_{A 3 F}^{(s)} = w^{(s)} ⊙ F_{H M E}^{(s)} + (1 - w^{(s)}) ⊙ F_{D A G}^{(s)} .

(11)

where

A (\cdot)

denotes the attention operator composed of the global and local branches, and

σ (\cdot)

is the sigmoid function. This design encourages the network to rely more on HME’s high-frequency prior to enhance details in texture-reliable regions, while in discrepancy-dominant regions or regions requiring directional coherence, it relies more on the deep fused representation, thereby improving fusion robustness and structural preservation.

3.5. Loss Function

We train the generator by jointly optimizing a content constraint and an adversarial constraint [42]. The overall loss is defined as follows:

L_{G} = λ_{c} L_{con} + λ_{adv} L_{adv}

(12)

where

λ_{c}

and

λ_{adv}

are the weights of the content and adversarial terms, respectively. In this work, we set

λ_{c} = 1.0

and

λ_{adv} = 10^{- 3}

. For the adversarial term, we adopt a dual-discriminator WGAN formulation to simultaneously enforce the distributional consistency of the fused result

F

in the optical domain and the SAR domain:

L_{adv} = - (E [D_{sar} (F)] + E [D_{opt} (F)])

(13)

where

D_{opt} (\cdot)

and

D_{sar} (\cdot)

denote the discriminators for the optical and SAR domains, respectively, and

E [\cdot]

denotes the expectation over the mini-batch and spatial locations. The content loss is constructed as a weighted sum of five components—intensity, gradient, structure, noise, and perceptual terms:

L_{con} = ω_{int} L_{int} + ω_{grad} L_{grad} + ω_{ssim} L_{ssim} + ω_{noise} L_{noise} + ω_{perc} L_{perc}

(14)

In all experiments, we set

ω_{int}

,

ω_{grad}

,

ω_{ssim}

,

ω_{noise}

,

ω_{perc}

to 20, 20, 10, 5, and 2, respectively. The intensity preservation term is defined as follows:

L_{int} = {‖F - \max (A, B)‖}_{1}

(15)

where

A

and

B

denote the input optical and SAR images,

F = G (A, B)

is the fused output,

{‖ \cdot ‖}_{1}

denotes the

L_{1}

norm, and

\max (\cdot, \cdot)

is the element-wise maximum operator. To ensure that the brightness of the fused result is not lower than the salient response of either modality, the gradient preservation term is defined as

L_{grad} = {‖\nabla F - \max (\nabla A, \nabla B)‖}_{1}

(16)

where

\nabla (\cdot)

denotes the Sobel gradient operator (computed by taking derivatives along the horizontal and vertical directions and then combining them) to emphasize edges and texture details. The structural consistency term adopts a weighted SSIM formulation:

L_{ssim} = 1 - (w_{A} SSIM (A, F) + w_{B} SSIM (B, F))

(17)

With

w_{A} = \frac{E [V_{A}]}{E [V_{A}] + E [V_{B}]}

,

w_{B} = 1 - w_{A}

, where

V_{A} = ‖ V_{A} ‖

and

V_{B} = ‖ V_{B} ‖

are gradient-magnitude maps, and the weights

w_{A}

,

w_{B}

bias the structural constraint toward the modality with more prominent gradients. The

S S I M

term is implemented with a differentiable

S S I M

operator, where modality-adaptive weights are derived from the global mean of 3 × 3 Sobel gradient magnitudes, and the weighted

S S I M

is incorporated as a structural loss in the form of

1 - S S I M

. To suppress speckle noise-induced random fluctuations, we introduce a residual fluctuation regularizer:

L_{noise} = E [Var (F - \max (A, B))]

(18)

where

Var (\cdot)

denotes the variance over spatial pixels, measuring the fluctuation strength of the residual map. The perceptual consistency term is defined as follows:

L_{perc} = w_{A}^{p} {‖Φ (F) - Φ (A)‖}_{1} + w_{B}^{p} {‖Φ (F) - Φ (B)‖}_{1}

(19)

with

w_{A}^{p} = \frac{E [{‖Φ (A)‖}_{1}]}{E [{‖Φ (A)‖}_{1}] + E [{‖Φ (B)‖}_{1}]}

and

w_{B}^{p} = 1 - w_{A}^{p}

, where

Φ (\cdot)

is a feature extractor from a pre-trained VGG network, used to constrain perceptual consistency of the fusion output from a higher-level semantic and texture statistics perspective. The perceptual term uses a frozen ImageNet-pretrained VGG16: single-channel inputs are replicated to three channels and normalized with ImageNet mean, the L1 feature discrepancy is computed at the relu2_2 layer, and modality contributions are adaptively weighted by the mean activation strength at relu1_2.

We implement adversarial learning with a dual-discriminator scheme: two discriminators with identical architectures are introduced in the optical domain and the SAR domain, respectively, so that distribution-matching constraints are simultaneously imposed on the fused output. Each discriminator is composed of multiple strided convolution layers for progressive down-sampling, combined with channel attention, and produces a local score map to characterize spatial–detail discrepancies. During training, we adopt the WGAN-GP stabilization strategy by linearly interpolating between real and generated samples and penalizing the gradient norm of the discriminator to enforce the Lipschitz constraint. Regarding the update schedule, in each iteration, each of the two discriminators is updated twice, followed by one update of the generator. The real samples for the discriminators are the optical input patch and the SAR input patch from the same aligned patch pair, respectively, while the generated sample is the fused patch produced by the generator. All inputs are normalized to [0,1] through the same data pipeline and are subjected to consistent online random cropping and augmentation, ensuring statistical consistency between generation and discrimination. We set the gradient penalty weight to

λ_{gp} = 10

in this work.

4. Experimental

This section describes the datasets, experimental settings, and evaluation metrics, and validates the effectiveness and robustness of the proposed method through comparative experiments, ablation studies, and qualitative visualizations.

4.1. Experimental Setup

Experiments are conducted on two co-registered optical–SAR datasets, Dongying [47] and Wuhan [48]. The Dongying dataset consists of Gaofen-2 (GF-2) RGB optical imagery and Gaofen-3 (GF-3) SAR imagery. We construct the dataset by cropping non-overlapping 256 × 256 optical–SAR image pairs from the original large-area scenes. To prevent data leakage, we strictly partition the training, validation, and test sets to ensure that they are mutually exclusive. Specifically, from the cropped image pairs, 1500 pairs from a designated region are selected for training; from the remaining samples, 500 pairs are further selected from other regions for testing, and 1831 pairs are selected from the remaining regions for validation, thereby constructing the test and validation sets. Notably, the validation set is larger than the training set mainly because online random cropping and data augmentation are adopted during training: for each training image pair, an aligned patch of fixed size (

128 \times 128

) is randomly cropped at each iteration, followed by random flipping and other augmentation operations, which substantially increases the effective number of training samples. In contrast, during validation and testing, neither random cropping nor augmentation is applied; inference is performed on the full image pairs and evaluation metrics are computed accordingly, yielding more stable and interpretable assessments. The larger validation set is intended to encompass a broader diversity of land cover scenarios and imaging conditions, thereby supporting more robust model selection and hyperparameter tuning. The Wuhan dataset follows the WHU-OPT-SAR dataset released by Li et al. It covers an area of approximately

5.1 \times 10^{4} {km}^{2}

in Hubei Province and provides 100 co-registered large-format image pairs of GF-1 optical and GF-3 SAR with a size of about

5556 \times 3704

pixels. All images are resampled to a unified spatial resolution of 5 m and are annotated with seven land cover categories: farmland, city, village, water, forest, road, and others. To evaluate cross-dataset generalization, we cropped the large-area Wuhan dataset into

256 \times 256

image pairs in the same manner and selected 500 pairs of samples, which were not involved in the training process and were used only for testing to validate the robustness of the model.

All models are trained on an NVIDIA GeForce RTX 4090 GPU (NVIDIA, Santa Clara, CA, USA). We use the Adam optimizer with a batch size of 6, train for 100 epochs, and set the initial learning rate to

2 \times 10^{- 5}

. To stabilize training, we adopt an Exponential Moving Average (EMA) with decay 0.999 and apply gradient clipping with a max-norm of 5.0. During training, input images are randomly cropped into

128 \times 128

patches and normalized to

[0, 1]

. We convert the RGB optical images into the YCbCr color space, use the luminance channel

Y

as the input to the fusion network and generate the fused result, and then combine it with the original

C_{b}

and

C_{r}

channels and apply the inverse transformation back to RGB to obtain the color fused image. The loss function consists of a reconstruction consistency term and an adversarial term, and the specific configuration is consistent with [42,45].

4.2. Comparison Experiments

To systematically validate the effectiveness and generalization of WEMFusion for registered optical–SAR fusion, we conduct comparative experiments on the Dongying and Wuhan datasets and select a set of representative general-purpose fusion methods as baselines. Specifically, we include U2Fusion [49] and MLFuse [50], which are centered on convolutional feature extraction and encoder–decoder reconstruction; PSFusion [51] and SeAFusion [52], which improve both fusion fidelity and downstream task effectiveness by enforcing semantic and structural constraints; SwinFusion [20], which performs global modeling via self-attention, and the unified interactive fusion framework TITA [53]; MambaDFuse [45] and SFMFusion [54], which model long-range dependencies based on state-space models (SSMs); and VSFF [24], a traditional method specifically designed for optical–SAR fusion that integrates complementary decomposition with saliency mechanisms. These methods span mainstream technical routes, including convolutional representations, decomposition–reconstruction strategies, attention-based modeling, and SSM-based long-range dependency modeling, enabling a relatively comprehensive assessment of the proposed method’s relative advantages.

Metrics: We quantitatively evaluate fusion quality using seven metrics: entropy (EN [55]), standard deviation (SD), average gradient (AG), visual information fidelity (VIFF [56]), edge preservation index (

Q^{A B / F}

), feature similarity (FSIM), and peak signal-to-noise ratio (PSNR). Except for PSNR, larger values generally indicate higher information content, better sharpness, and improved structural fidelity; a higher PSNR indicates smaller reconstruction error [57]. Since optical–SAR fusion tasks typically lack a ground-truth image, for the PSNR and FSIM metrics, we compute the values by using each of the two source images as the reference, respectively, and take their average as the final value.

Table 1 summarizes the quantitative comparison results on the Dongying dataset. Overall, our method achieves the best performance on most core metrics, demonstrating stronger overall fusion capability. On the one hand, it leads in information–content and contrast-related metrics, such as EN, indicating that the fused images preserve more effective information while enhancing the dynamic range. On the other hand, it is also superior in structural consistency and perceptual quality metrics, such as VIFF,

Q^{A B / F}

, and FSIM, suggesting more stable performance in cross-modality structural alignment, edge transfer, and detail fidelity.

Compared with other Mamba-based methods, MambaDFuse and SFMFusion, our approach further improves perceptual structural metrics and PSNR, showing that the proposed wavelet-domain decoupling combined with discrepancy-gated SSM fusion can more effectively suppress speckle-noise diffusion and reduce the introduction of pseudo structures, thereby producing more natural and more consistent fusion results. It is worth noting that although some baseline methods (e.g., VSFF) achieve slightly higher AG, their extremely low signal-to-noise ratio indicates that such high gradients mainly stem from the erroneous amplification of SAR speckle noise rather than genuine edge sharpness. In contrast, our method attains a better balance between sharpness and structural authenticity and is therefore more competitive in overall evaluation.

Table 2 summarizes the quantitative comparison results on the WHU-OPT-SAR dataset. Our method exhibits robust generalization in the quantitative evaluation of Wuhan, ranking at the forefront on five core metrics that jointly reflect information richness and structural fidelity. The results show that it not only maintains a very high dynamic range and information content in terms of EN and SD but also matches the state-of-the-art TITA method on the edge-transfer-related metric

Q^{A B / F}

and FSIM, and it significantly outperforms traditional baselines such as VSFF.

It is worth noting that although U2Fusion achieves a relatively high signal-to-noise ratio (PSNR) due to smoothing filtering, it lags markedly behind on the other metrics. In contrast, our method improves

Q^{A B / F}

by approximately 23% over U2Fusion, strongly demonstrating that the proposed modules, such as HRB, can effectively suppress speckle noise while preserving high-frequency textures and geometric details to the greatest extent, thereby achieving an optimal balance between objective statistical metrics and subjective visual perceptual quality.

As shown in Figure 8, we select four representative scenarios from the Dongying test set for qualitative comparison, including building complexes, shoreline embankments, isolated islands, and vegetated areas. U2Fusion and MLFuse exhibit an overall tendency toward smooth reconstruction. As observed in the yellow zoomed-in box of the first row’s building scene, optical texture layers and geometric boundaries are noticeably weakened, resulting in missing high-frequency details and blurred edges. VSFF enhances local gradients more aggressively, but it tends to amplify SAR speckle noise, leading to granularity and spurious edges while weakening the injection of optical textures, which results in missing details in the building region of the first row. PSFusion, SeAFusion, and SFMFusion adopt more aggressive enhancement strategies: edges become more prominent but are prone to over-sharpening; in regions with strong speckle, their discrimination between details and noise is insufficient, leaving residual speckle and introducing pseudo-textures and granularity, as visible in the fourth-row vegetation scene. SwinFusion, TITA, and MambaDFuse show varying performance in structural preservation. SwinFusion is generally stable, yet it still suppresses fine-grained textures and weak targets such as the embankment boundary in the second row, leading to detail attenuation or softened edges. TITA maintains global structures, but its injection of strong scattering cues and local alignment is relatively conservative, so SAR-salient cues such as tree trunks in vegetated areas are not sufficiently highlighted and do not adhere well to optical boundaries. Benefiting from long-range modeling, MambaDFuse preserves overall structures better; however, local texture disturbances induced by speckle are still observable, making the textures less clean.

In contrast, our method achieves, within the zoomed-in regions, more accurate fidelity of optical texture contours, sharp yet non-over-sharpened boundaries, and effective incorporation of SAR-salient scattering cues, while markedly suppressing speckle diffusion and pseudo-textures. This yields cleaner local appearances, more consistent structures, and a more natural overall visual effect, demonstrating more robust cross-modality fusion capability.

As shown in Figure 9, we select four representative complex scenarios from the Wuhan dataset, including road intersections, building rooftops, mountain ridges, and waterfront boats. The compared methods exhibit different degrees of imbalance between cross-modality complementarity and noise suppression. VSFF, U2Fusion, and MLFuse are constrained by conservative feature extraction strategies and tend to produce low-contrast, over-smoothed results. This not only causes severe energy attenuation and color fading of salient spectral tones in dense built-up areas but also heavily compresses and blurs the geometric details of strong reflective targets, such as the mountain ridges in the third row. In contrast, PSFusion, SeAFusion, and SFMFusion improve edge sharpness but lack selectivity over high-frequency components; they mistakenly treat the inherent speckle noise in SAR as structural information and amplify it, thereby introducing cluttered granular artifacts and spectral distortions around road intersections or in regions with mountain backgrounds. In addition, deep-learning-based methods such as SwinFusion, TITA, and MambaDFuse still show limitations in local detail consistency and are prone to halo artifacts around point targets such as bright street lamps or vehicles or texture jitter at semantic boundaries between vegetation and man-made facilities.

In comparison, our method achieves, across multiple zoomed-in regions, clearer geometric boundaries, more consistent texture hierarchy, and cleaner backgrounds. It effectively preserves the natural appearance and continuous structures of optical imagery, while highlighting key scattering targets in SAR (e.g., highly reflective building areas and boats) without excessively amplifying noise. Consequently, it exhibits more robust performance in terms of visual naturalness, detail readability, and cross-modality structural consistency.

4.3. Ablation Experiments

4.3.1. Ablation Settings

To verify the contribution of each key component to optical–SAR fusion performance, we conduct ablation studies on the Dongying dataset while keeping the data split, training hyperparameters, and inference pipeline unchanged. The experiments continue to adopt the seven evaluation metrics used in the comparative study and additionally introduce the Structural Similarity Index Measure (SSIM [58]) to measure structural similarity. In Table 3, different variants are indexed as follows: (1) removing hybrid-modality enhancement (HME); (2) removing the high-frequency enhancement branch (HRB) in HME; (3) removing the low-frequency cross-modality structural context module (LF-CSCM) in HME; (4) removing the Mamba-based feature extraction stage; (5) replacing the heterogeneous optical/SAR feature extractors with a shared homogeneous four-layer Mamba stack; (6) replacing the fusion module DAG-MF with a conventional sequential Mamba fusion; (7) removing discrepancy-aware gated fusion (DAG-MF) and using a standard fusion strategy instead; and (8) removing the adversarial consistency constraint. The complete model is denoted Full.

4.3.2. Overall Result Analysis

The ablation results indicate that the full model (Full) is the most robust overall, achieving a better comprehensive balance across multiple evaluation metrics. Full attains the best performance on the key metrics SD,

Q^{A B / F}

, VIFF, and FSIM, while the remaining metrics exhibit only small and acceptable deviations from their respective best values. This indicates that the proposed combination of wavelet-domain band-wise prior enhancement (HME) and discrepancy-driven deep interaction (DAG-MF) enables more stable complementary fusion under the pronounced statistical discrepancy and speckle interference of optical–SAR imagery. Compared with variants that remove a single component, Full simultaneously preserves information content, structural similarity, and perceptual fidelity, validating the synergistic gains of the sub-modules in suppressing erroneous pseudo-high-frequency injection, maintaining structural alignment and directional continuity, and alleviating cross-domain distribution bias.

4.3.3. Contribution Analysis of the HME Branches

Removing HME leads to an overall degradation, with more pronounced drops on SD, FSIM, SSIM, and PSNR, which are closely related to detail fidelity and structural consistency. This indicates that the band-wise prior before the fusion backbone is crucial for suppressing the erroneous propagation of SAR pseudo-high-frequency components and speckle noise, thereby stabilizing discriminative details and structural similarity. Further sub-module ablations show that removing the high-frequency enhancement branch HRB more readily weakens EN and SD, which reflect information content and detail strength, suggesting that HRB mainly highlights genuine optical edges and directional textures while reducing high-frequency pseudo-texture interference. In contrast, removing the low-frequency cross-modality structural context module LF-CSCM tends to affect FSIM SSIM, and AG, which are more related to structural consistency and regional stability, indicating that LF-CSCM is more dedicated to establishing structural-trend alignment on the low-frequency side and preventing noise perturbations from leaking into structural components. Overall, the two branches complement each other along the two dimensions of detail enhancement and structural stabilization.

4.3.4. Verification of the Mamba Feature Extraction Strategy

At the representation extraction stage, removing the Mamba-based feature extractor causes noticeable drops in VIFF,

Q^{A B / F}

, SSIM, and PSNR, indicating that long-range dependency modeling plays a critical role in preserving directionally continuous structures, cross-region consistency, and perceptual quality in optical–SAR fusion. Further replacing the heterogeneous optical/SAR extractors with a shared homogeneous stack also leads to degradation, confirming that the two modalities exhibit substantial differences in statistical distributions and inductive biases. Employing modality-specific extractors is therefore more beneficial for enhancing the credible representation of optical edges while suppressing speckle–noise-induced pseudo-texture responses in SAR, providing a cleaner feature basis for subsequent fusion interaction.

4.3.5. Verification of the Fusion Interaction Mechanism

In the mechanism validation of the fusion strategy, replacing the discrepancy-aware fusion module DAG-MF with a conventional sequential Mamba fusion or directly removing DAG-MF consistently leads to degradation in structure-related metrics: after removing DAG-MF, SD exhibits a relatively notable decrease, while SSIM, VIFF, FSIM, and

Q^{A B / F}

—metrics related to structural and edge information transfer—are simultaneously impaired. This indicates that relying solely on content-driven sequential interactions is insufficient to achieve a stable balance between structural alignment and noise suppression. In contrast, DAG-MF generates dynamic gating from discrepancy and complementary responses and uses it to modulate the state-space interactions under two-dimensional directional scanning, enabling the model to more selectively suppress pseudo-texture injection in discrepant regions while maintaining stable long-range structural dependencies in consistent regions. Consequently, DAG-MF plays a key role in improving cross-modality consistency and boundary stability.

4.3.6. Role of the Adversarial Consistency Constraint

After removing the adversarial consistency constraint, metrics more related to perceptual quality and statistical matching, such as VIFF and PSNR, decrease, indicating that the lightweight adversarial constraint can effectively alleviate cross-domain distribution bias and reduce statistical unnaturalness and local artifacts in the fused results, thereby improving overall visual naturalness and perceptual fidelity.

4.4. Comparison Experiments on Downstream Segmentation Tasks

To evaluate the practical gains of fusion results for downstream semantic segmentation, we conduct experiments on the cropped WHU-OPT-SAR dataset, using a total of 5292 co-registered image pairs. Specifically, the fused images generated by different fusion methods, and their corresponding remapped labels are cropped into

256 \times 256

patches using a sliding-window strategy, and the dataset is split into training, validation, and test sets at a ratio of 3:1:1. These fused patches are used as the only input to train and infer with SegFormer (MiT-B2) [59], ensuring that segmentation differences are solely caused by the upstream fusion quality. Considering the pronounced class imbalance and the insufficient samples of several thin and small classes (e.g., road and village) as well as the miscellaneous class (others) in the original annotations, we remap the original eight semantic categories into five classes to improve evaluation stability and focus on the major land cover types. Under the same data split and training/testing protocol, we compare overall metrics including overall accuracy (OA), average accuracy (AA), the Kappa coefficient, and class-wise segmentation scores for Farmland, City, Water, Forest, and Others.

The quantitative results in Table 4 show that our method is almost on par with the strongest baseline, SFMFusion, in terms of overall metrics and performs better than most other baselines. More importantly, at the class level, our method achieves the best scores on Water and Forest and also reaches 80.21 on City, demonstrating stable characterization capability for continuous water boundaries and large-area textured land covers. This advantage is consistent with our joint design of frequency-domain priors and long-range modeling. Specifically, DAG-MF explicitly injects cross-modality discrepancies into Mamba’s selective update process, so that long-range dependency modeling is selectively activated according to discrepancies. Mechanistically, this suppresses the erroneous diffusion of SAR speckle-induced pseudo-textures and enhances the directional continuity and cross-scale consistency of linear structures such as roads and embankments, thereby improving the discriminative stability of segmentation for water boundaries and forest textures. Meanwhile, A3F performs position-adaptive trade-offs between prior-enhanced representations and deep discrepancy-driven fusion representations, further ensuring a balance between detail enhancement and structural consistency.

It should be noted that there remains room for improvement on mixed long-tailed categories such as Others, which may be related to class imbalance. In addition, although SFMFusion shows a slight advantage on overall metrics, it comes at a substantial cost: as reported in Table 5, its runtime and peak memory reach 1129.842 ms and 1360.82 MB, respectively, whereas our method requires only 104.181 ms and 264.75 MB. Thus, while achieving near-optimal accuracy, our approach markedly reduces inference latency and memory consumption, better satisfying the efficiency and stability requirements for practical downstream deployment.

As shown in Figure 10, this example mainly contains four land cover categories: brown (Farmland), green (Forest), blue (Water), and a small amount of yellow (Village). Using the label (ground-truth annotation) as the reference, the comparisons indicate that methods such as U2Fusion and MLFuse suffer from spectral-information loss during fusion, leading to pronounced semantic confusion: a large number of fragmented yellow village regions are incorrectly submerged into farmland and forest, and large areas of farmland are misclassified as forest. In contrast, PSFusion and SeAFusion are affected by residual noise; compared with the label, their segmentation results also exhibit evident discrepancies, where large forest regions are recognized as farmland. Although VSFF, SFMFusion, and MambaDFuse preserve the overall structural integrity, boundary blurring and missing parts still occur at interfaces between different land cover types, such as the farmland–water boundary highlighted in the red box.

We observe that SFMFusion is slightly superior in overall segmentation performance, whereas our method achieves comparable segmentation accuracy while significantly improving inference efficiency. As evidenced by the analyses of the figures and tables, our model demonstrates stronger discriminative capability in boundary regions between classes such as water bodies and forests, better preserving regional consistency and category discrimination in complex conditions, thereby providing a more balanced trade-off between accuracy and efficiency. Overall, these results indicate that the proposed fusion strategy is effective in suppressing misclassifications induced by SAR speckle noise and exhibits advantages in maintaining region connectivity and boundary stability, making it a practical choice for supporting downstream semantic segmentation under efficiency constraints.

4.5. Efficiency Comparison Experiments

Using the same experimental settings, we select representative methods from different categories and compare their parameter size (Params), FP32 (32-bit single-precision floating-point) inference time and FP32 peak GPU memory. All methods are benchmarked on a single NVIDIA GeForce RTX 2080 Ti GPU (NVIDIA, Santa Clara, CA, USA) and take a pair of single-channel images as input, with the resolution fixed to

128 \times 128

and batch size

B = 6

. The inference time is averaged over multiple forward passes and reported as the latency per image pair, while the peak memory is measured as the maximum GPU memory allocation during a single forward pass. As reported in Table 5, Transformer-based methods (SwinFusion and TITA) are substantially higher than others in both runtime and peak memory, reflecting the considerable computation and storage overhead introduced by global attention under high-resolution inputs. MambaDFuse has a relatively controllable parameter scale, yet its runtime and memory consumption remain at a comparatively high level, indicating that long-range modeling still incurs non-negligible resource costs in practical deployment. SFMFusion, due to its three-branch auxiliary design and spatial–frequency enhanced Mamba, further increases peak memory usage.

In contrast, our method maintains a moderate increase in parameters while effectively keeping peak memory at a low level and achieving better inference efficiency; overall, its cost is clearly lower than that of Transformer-based methods and other Mamba-based approaches. Although MLFuse and SeAFusion are lighter and faster, they are typically accompanied by limited fusion representational capacity. Considering both fusion performance and resource overhead, WEMFusion offers a more favorable accuracy–efficiency trade-off, exhibiting stronger engineering practicality and deployment potential.

5. Discussion

In the main comparative experiments on the Dongying and Wuhan datasets, WEMFusion remains in the lead or consistently ranks in the top tier on most metrics, indicating that the proposed strategy—combining band-wise prior constraints with deep discrepancy-driven interaction—can perform stably under cross-scene statistical shifts and varying speckle intensities. In contrast, although some methods are more aggressive on sharpening-related metrics such as AG and SD, they are more likely to introduce side effects including unnatural textures or unstable boundaries. The qualitative results are consistent with this observation; our method presents more coherent contours and more natural texture transitions at land cover interfaces, reflecting effective fidelity to trustworthy optical high-frequency information after wavelet-domain decoupling, together with more controllable suppression of SAR speckle. Moreover, the efficiency comparison shows that these quality gains are achieved while maintaining more reasonable inference latency and peak memory consumption, demonstrating a more favorable trade-off between performance and cost.

The downstream semantic segmentation experiments further suggest that the fusion gains are not merely visual. Under a unified segmentation network setting, WEMFusion yields more stable performance on categories that heavily rely on structural boundaries and regional connectivity, such as Water, Forest, and City, implying that our fused images better support downstream models in learning clear semantic boundaries and consistent region representations, thereby improving interpretability and task usability.

The ablation study provides a mechanism-level explanation for these advantages. Removing HME causes overall drops in metrics such as SD, FSIM, and PSNR; a finer decomposition shows that HRB mainly improves the discriminability of high-frequency details, whereas LF-CSCM emphasizes low-frequency structural alignment and global consistency modulation, and the two complement each other in terms of detail fidelity and structural stability. For representation extraction, removing the Mamba-based feature extractor or replacing the heterogeneous extractors with an identical architecture shared by both modalities leads to performance degradation, indicating that, given the pronounced statistical differences between optical and SAR modalities, modality-specific long-range modeling is crucial for preserving directionally continuous structures. For fusion strategies, replacing or removing DAG-MF makes it difficult to simultaneously carry out structural alignment and noise suppression, further highlighting the critical role of discrepancy-aware modulation in deep interaction. It should be noted that our method still relies on high-quality co-registration, and fine-grained small targets and complex boundaries remain challenging under strong-noise conditions; future work will further improve generalization and stability in real-world scenarios.

6. Conclusions

To address key challenges in optical–SAR fusion, including high-frequency semantic conflicts, difficulties in structural alignment, and speckle-noise diffusion, we propose WEMFusion, a wavelet-enhanced Mamba fusion framework guided by frequency-domain priors. The proposed method explicitly decouples source images into low-frequency structural components and high-frequency directional components via a multi-scale discrete wavelet transform, endowing end-to-end learning with clear physical interpretability and a more robust band-wise alignment mechanism. Built upon this prior, we develop a hybrid-modality enhancement (HME) module, which generates high-quality hybrid prior representations through optical-guided high-frequency detail injection and low-frequency semantic modulation. Furthermore, we design a discrepancy-aware gated Mamba fusion (DAG-MF) block, which innovatively injects cross-modality discrepancies into the two-dimensional state-space scanning process, enabling selective perception of salient discrepant regions and stable long-range dependency modeling over consistent regions. Combined with a lightweight adaptive attentional feature fusion (A3F) module and a coarse-to-fine wavelet reconstruction strategy, WEMFusion demonstrates significant advantages in structural preservation, edge sharpness, and cross-modality consistency on the Dongying and Wuhan datasets, while maintaining low computational overhead. These results validate the effectiveness of coupling frequency-domain decoupling with discrepancy-driven SSM fusion. Future work will extend the proposed frequency-domain priors and state-space modeling to a broader range of multi-source sensor combinations.

Author Contributions

Conceptualization, J.W. and F.S.; methodology, Y.Z., L.M., B.Z. and J.W.; software, Y.Z.; formal analysis, L.M. and Z.C.; investigation, F.S. and Z.C.; resources, B.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This study is supported by the Guangdong Basic and Applied Basic Research Foundation (No. 2025B1515020076), the Foundation of Shenzhen City under Grant JCYJ20230808105359045, and the National Natural Science Foundation of China (Nos. 62571342, 62431021, 62571472).

Data Availability Statement

The WHU-OPT-SAR dataset and the Dongying dataset are publicly available and can be accessed at https://github.com/AmberHen/WHU-OPT-SAR-dataset (accessed on 29 January 2026) and https://github.com/XD-MG/DDHRNet (accessed on 29 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, C.; Sun, Y.; Xu, Y.; Sun, Z.; Zhang, X.; Lei, L.; Kuang, G. A review of optical and SAR image deep feature fusion in semantic segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 12910–12930. [Google Scholar] [CrossRef]
Yuan, Y.; Deng, T.; Le, Y.; Bai, H.; Guo, S.; Sun, S.; Chen, Y. SAR and Visible Image Fusion via Retinex-Guided SAR Reconstruction. Remote Sens. 2026, 18, 111. [Google Scholar] [CrossRef]
Han, S.; Wang, J.; Zhang, S. Former-CR: A Transformer-Based Thick Cloud Removal Method with Optical and SAR Imagery. Remote Sens. 2023, 15, 1196. [Google Scholar] [CrossRef]
Zhao, B.; Huang, L.; Jin, B. Strategy for SAR Imaging Quality Improvement with Low-Precision Sampled Data. IEEE Trans. Geosci. Remote Sens. 2021, 59, 3150–3160. [Google Scholar] [CrossRef]
Si, C.; Zhao, B.; Huang, L.; Liu, S. A Convolutional De-Quantization Network for Harmonics Suppression in One-Bit SAR Imaging. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5221316. [Google Scholar] [CrossRef]
Zhao, B.; Huang, L.; Li, J.; Zhang, P. Target Reconstruction from Deceptively Jammed Single-Channel SAR. IEEE Trans. Geosci. Remote Sens. 2018, 56, 152–167. [Google Scholar] [CrossRef]
Li, J.; Li, C.; Tan, X.; You, D.; Duan, C.; Zhang, S.; Dang, H.; Li, G.; Zhang, Q. A Review of Recent Development of Geosynchronous Synthetic Aperture Radar Technique. Remote Sens. 2025, 17, 3405. [Google Scholar] [CrossRef]
Sun, Z.; Zhi, S.; Li, R.; Xia, J.; Liu, Y.; Jiang, W. GDROS: A Geometry-Guided Dense Registration Framework for Optical–SAR Images Under Large Geometric Transformations. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5650315. [Google Scholar] [CrossRef]
Zhang, R.; Yang, Y.; Li, Z.; Li, P.; Wang, H. Optical and SAR Image Fusion: A Review of Theories, Methods, and Applications. Remote Sens. 2026, 18, 73. [Google Scholar] [CrossRef]
Gao, G.; Wang, M.; Zhang, X.; Li, G. DEN: A New Method for SAR and Optical Image Fusion and Intelligent Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5201118. [Google Scholar] [CrossRef]
Zhou, L.; Duan, K.; Dai, J.; Ye, Y. Advancing perturbation space expansion based on information fusion for semi-supervised remote sensing image semantic segmentation. Inf. Fusion 2025, 117, 102830. [Google Scholar] [CrossRef]
Wang, J.; Ma, L.; Zhao, B.; Gou, Z.; Yin, Y.; Sun, G. MRLF: Multi-Resolution Layered Fusion Network for Optical and SAR Images. Remote Sens. 2025, 17, 3740. [Google Scholar] [CrossRef]
Burt, P.J.; Adelson, E.H. The Laplacian pyramid as a compact image code. In Readings in Computer Vision; Morgan Kaufmann: San Francisco, CA, USA, 1987; pp. 671–679. [Google Scholar]
Pajares, G.; De La Cruz, J.M. A wavelet-based image fusion tutorial. Pattern Recognit. 2004, 37, 1855–1872. [Google Scholar] [CrossRef]
Olshausen, B.; Field, D. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 1996, 381, 607–609. [Google Scholar] [CrossRef]
Li, H.; Wu, X.-J. DenseFuse: A Fusion Approach to Infrared and Visible Images. IEEE Trans. Image Process. 2019, 28, 2614–2623. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Wu, X.-J.; Durrani, T. NestFuse: An Infrared and Visible Image Fusion Architecture Based on Nest Connection and Spatial/Channel Attention Models. IEEE Trans. Instrum. Meas. 2020, 69, 9645–9656. [Google Scholar] [CrossRef]
Ma, J.; Xu, H.; Jiang, J.; Mei, X.; Zhang, X.-P. DDcGAN: A Dual-Discriminator Conditional Generative Adversarial Network for Multi-Resolution Image Fusion. IEEE Trans. Image Process. 2020, 29, 4980–4995. [Google Scholar] [CrossRef]
Zhao, Z.; Bai, H.; Zhu, Y.; Zhang, J.; Xu, S.; Zhang, Y.; Zhang, K.; Meng, D.; Timofte, R.; Van Gool, L. DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 8082–8093. [Google Scholar]
Ma, J.; Tang, L.; Fan, F.; Huang, J.; Mei, X.; Ma, Y. SwinFusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA J. Autom. Sin. 2022, 9, 1200–1217. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Li, Y.; Liang, X.; Liang, J.; Chen, J. Image-Domain Signal Modeling and Refocusing of Air Moving Targets for MEO Multichannel SAR. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5224514. [Google Scholar] [CrossRef]
Ye, Y.; Zhang, J.; Zhou, L.; Li, J.; Ren, X.; Fan, J. Optical and SAR Image Fusion Based on Complementary Feature Decomposition and Visual Saliency Features. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5205315. [Google Scholar] [CrossRef]
Li, Y.; Cao, C.; Li, H. Analysis of Image Domain Characteristics of Maritime Rotating Ships for Spaceborne Multichannel SAR. Remote Sens. 2026, 18, 41. [Google Scholar] [CrossRef]
Sommervold, O.; Gazzea, M.; Arghandeh, R. A Survey on SAR and Optical Satellite Image Registration. Remote Sens. 2023, 15, 850. [Google Scholar] [CrossRef]
Sun, N.; Feng, Y.; Tong, X.; Lei, Z.; Chen, S.; Wang, C.; Xu, X.; Jin, Y. Road Network Extraction from SAR Images with the Support of Angular Texture Signature and POIs. Remote Sens. 2022, 14, 4832. [Google Scholar] [CrossRef]
Yang, X.; Huo, H.; Li, C.; Liu, X.; Wang, W.; Wang, C. Semantic perceptive infrared and visible image fusion transformer. Pattern Recognit. 2024, 149, 110223. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, Y.; Sun, P.; Yan, H.; Zhao, X.; Zhang, L. IFCNN: A general image fusion framework based on convolutional neural network. Inf. Fusion 2020, 54, 99–118. [Google Scholar] [CrossRef]
Chen, T.; Wang, C.; Zhang, Y.; Xia, K.; Qian, P. MFS-Fusion: Mamba-integrated deep multi-modal image fusion framework with multi-scale fourier enhancement and spatial calibration. Expert Syst. Appl. 2025, 299, 130054. [Google Scholar] [CrossRef]
Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
Cao, H.; Tan, C.; Gao, Z.; Xu, Y.; Chen, G.; Heng, P.A.; Li, S.Z. A survey on generative diffusion models. IEEE Trans. Knowl. Data Eng. 2024, 36, 2814–2830. [Google Scholar] [CrossRef]
Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Xu, S.; Lin, Z.; Timofte, R.; Van Gool, L. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5906–5916. [Google Scholar]
Mallat, S.G. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 11, 674–693. [Google Scholar] [CrossRef]
Guo, T.; Seyed Mousavi, H.; Huu Vu, T.; Monga, V. Deep wavelet prediction for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 104–113. [Google Scholar]
Liu, R.; Liu, Y.; Wang, H.; Du, S. WaveFusionNet: Infrared and visible image fusion based on multi-scale feature encoder–decoder and discrete wavelet decomposition. Opt. Commun. 2024, 573, 131024. [Google Scholar] [CrossRef]
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Zhang, Q.; Zhang, X.; Quan, C.; Zhao, T.; Huo, W.; Huang, Y. Mamba-STFM: A Mamba-Based Spatiotemporal Fusion Method for Remote Sensing Images. Remote Sens. 2025, 17, 2135. [Google Scholar] [CrossRef]
Chen, H.; Song, J.; Han, C.; Xia, J.; Yokoya, N. ChangeMamba: Remote Sensing Change Detection with Spatiotemporal State Space Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4409720. [Google Scholar] [CrossRef]
Tao, J.; Qiao, Q.; Song, J.; Sun, S.; Chen, Y.; Wu, Q.; Liu, Y.; Xue, F.; Wu, H.; Zhao, F. Deep Learning-Driven Automatic Segmentation of Weeds and Crops in UAV Imagery. Sensors 2025, 25, 6576. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, Z.; Qi, W.; Yang, F.; Xu, J. FreqGAN: Infrared and Visible Image Fusion via Unified Frequency Adversarial Learning. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 728–740. [Google Scholar] [CrossRef]
Zhou, J.; Senhadji, L.; Coatrieux, J.-L.; Luo, L. Iterative PET Image Reconstruction Using Translation Invariant Wavelet Transform. IEEE Trans. Nucl. Sci. 2009, 56, 116–128. [Google Scholar] [CrossRef][Green Version]
Zhou, M.; Zheng, N.; He, X.; Hong, D.; Chanussot, J. Probing Synergistic High-Order Interaction for Multi-Modal Image Fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 840–857. [Google Scholar] [CrossRef]
Li, Z.; Pan, H.; Zhang, K.; Wang, Y.; Yu, F. Mambadfuse: A mamba-based dual-phase model for multi-modality image fusion. arXiv 2024, arXiv:2404.08406. [Google Scholar]
Chen, L.; Lu, X.; Zhang, J.; Chu, X.; Chen, C. Hinet: Half instance normalization network for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Nashville, TN, USA, 19–25 June 2021; pp. 182–192. [Google Scholar]
Ren, B.; Ma, S.; Hou, B.; Hong, D.; Chanussot, J.; Wang, J.; Jiao, L. A dual-stream high resolution network: Deep fusion of GF-2 and GF-3 data for land cover classification. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102896. [Google Scholar] [CrossRef]
Li, X.; Zhang, G.; Cui, H.; Hou, S.; Wang, S.; Li, X.; Chen, Y.; Li, Z.; Zhang, L. MCANet: A joint semantic segmentation framework of optical and SAR images for land use classification. Int. J. Appl. Earth Obs. Geoinf. 2022, 106, 102638. [Google Scholar] [CrossRef]
Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A Unified Unsupervised Image Fusion Network. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 502–518. [Google Scholar] [CrossRef]
Lei, J.; Li, J.; Liu, J.; Wang, B.; Zhou, S.; Zhang, Q.; Wei, X.; Kasabov, N.K. MLFuse: Multi-Scenario Feature Joint Learning for Multi-Modality Image Fusion. IEEE Trans. Multimed. 2025, 27, 3880–3894. [Google Scholar] [CrossRef]
Tang, L.; Zhang, H.; Xu, H.; Ma, J. Rethinking the necessity of image fusion in high-level vision tasks: A practical infrared and visible image fusion network based on progressive semantic injection and scene fidelity. Inf. Fusion 2023, 99, 101870. [Google Scholar] [CrossRef]
Tang, L.; Yuan, J.; Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Inf. Fusion 2022, 82, 28–42. [Google Scholar] [CrossRef]
Hu, X.; Jiang, J.; Wang, C.; Jiang, K.; Liu, X.; Ma, J. Balancing Task-invariant Interaction and Task-specific Adaptation for Unified Image Fusion. arXiv 2025, arXiv:2504.05164. [Google Scholar]
Sun, H.; Lv, L.; Zhang, P.; Tang, T.; Tian, F.; Sun, W.; Lu, H. Spatial-Frequency Enhanced Mamba for Multi-Modal Image Fusion. IEEE Trans. Image Process. 2025, 34, 7684–7696. [Google Scholar] [CrossRef] [PubMed]
Roberts, J.W.; Van Aardt, J.A.; Ahmed, F.B. Assessment of image fusion procedures using entropy, image quality, and multispectral classification. J. Appl. Remote Sens. 2008, 2, 023522. [Google Scholar]
Han, Y.; Cai, Y.; Cao, Y.; Xu, X. A new image fusion performance metric based on visual information fidelity. Inf. Fusion 2013, 14, 127–135. [Google Scholar] [CrossRef]
Ma, J.; Ma, Y.; Li, C. Infrared and visible image fusion methods and applications: A survey. Inf. Fusion 2019, 45, 153–178. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]

Figure 1. Example samples from the Dongying dataset: (a) optical image; (b) SAR image; (c) fused image. The red bounding box denotes the region of interest for detailed comparison.

Figure 2. Discrete wavelet transform (

D W T

) grayscale decomposition results of optical (OPT) and SAR images.

Figure 2. Discrete wavelet transform (

D W T

) grayscale decomposition results of optical (OPT) and SAR images.

Figure 3. Overall framework of the proposed WEMFusion framework: (a) end-to-end optical–SAR fusion pipeline; (b) WEM module; (c) OPT/SAR Mamba extractor module. “*” denotes the multiplication operation.

Figure 4. Schematic illustration of the low-frequency cross-modality structural context module (LF-CSCM). “*” denotes the multiplication operation.

Figure 5. Schematic illustration of the high-frequency residual boosting block (HRB). “*” denotes the multiplication operation.

Figure 6. Discrepancy-aware gated Mamba fusion module (DAG-MF) (a) and its DAG-SS2D core (b). “*” denotes the multiplication operation.

Figure 7. Schematic diagram of the adaptive attentional fusion module.

Figure 8. Qualitative visual comparison of fusion results on the Dongying dataset. The yellow box denotes the region of interest shown in a zoomed-in view for comparison.

Figure 9. Qualitative visual comparison of fusion results on the WHU-OPT-SAR dataset. The yellow box denotes the region of interest shown in a zoomed-in view for comparison.

Figure 10. Qualitative comparison of different optical–SAR fusion methods on the downstream semantic segmentation task. The red bounding box denotes the region of interest for detailed comparison.

Table 1. Quantitative comparison results on the Dongying dataset, where red indicates the best value, and blue indicates the second-best value.

	EN	SD	$Q^{A B / F}$	VIFF	AG	FSIM	PSNR
U2Fusion	6.57	34.55	0.52	0.48	9.59	0.56	19.30
VSFF	6.88	34.17	0.28	0.52	10.38	0.48	16.32
SwinFusion	6.89	40.27	0.71	0.87	9.74	0.61	23.11
MLFuse	6.55	31.15	0.51	0.52	7.53	0.57	20.12
PSFusion	6.89	38.58	0.65	0.61	9.42	0.59	19.48
SeAFusion	6.86	39.50	0.66	0.70	9.95	0.58	20.44
TITA	6.81	38.54	0.69	0.81	9.42	0.60	21.24
MambaDFuse	6.88	40.30	0.70	0.86	9.74	0.61	23.14
SFMFusion	6.89	40.98	0.68	0.71	10.31	0.59	18.32
Ours	6.90	40.50	0.71	0.88	9.70	0.62	23.40

Table 2. Quantitative comparison results on the WHU-OPT-SAR dataset, where red indicates the best value, and blue indicates the second-best value.

	EN	SD	$Q^{A B / F}$	VIFF	AG	FSIM	PSNR
U2Fusion	6.81	38.99	0.59	0.44	15.01	0.54	18.96
VSFF	6.73	30.89	0.17	0.44	11.74	0.44	16.69
SwinFusion	6.90	39.94	0.71	0.59	15.96	0.57	18.12
MLFuse	6.47	31.19	0.51	0.42	11.87	0.53	18.10
PSFusion	6.95	38.46	0.64	0.43	13.81	0.55	17.76
SeAFusion	6.84	37.74	0.61	0.36	15.42	0.53	17.86
TITA	6.94	40.26	0.73	0.59	16.59	0.57	17.51
MambaDFuse	6.81	38.71	0.65	0.51	15.07	0.55	17.42
SFMFusion	6.84	39.78	0.69	0.49	16.45	0.56	14.97
Ours	6.95	40.18	0.73	0.55	16.03	0.57	17.84

Table 3. Quantitative ablation results of WEMFusion on the Dongying dataset, where the best values are highlighted in red.

	EN	SD	$Q^{A B / F}$	VIFF	AG	FSIM	PSNR	SSIM
(1) w/o HME	6.901	40.46	0.70	0.86	9.69	0.61	23.26	0.583
(2) w/o HRB	6.894	40.29	0.71	0.87	9.68	0.62	22.94	0.592
(3) w/o LF-CSCM	6.907	40.50	0.70	0.87	9.68	0.61	23.29	0.587
(4) w/o Mamba Extractor	6.893	40.24	0.69	0.82	9.67	0.61	22.70	0.582
(5) Shared Extractor	6.904	40.46	0.70	0.87	9.72	0.61	23.31	0.590
(6) Mamba Fusion	6.901	40.39	0.71	0.87	9.71	0.61	23.26	0.591
(7) w/o DAG-MF	6.903	40.45	0.70	0.87	9.68	0.61	23.42	0.577
(8) w/o GAN	6.893	40.18	0.70	0.84	9.73	0.61	22.71	0.594
Full	6.905	40.50	0.71	0.88	9.70	0.62	23.40	0.593

Table 4. Quantitative results of downstream semantic segmentation on the Wuhan dataset, where red indicates the best value and blue indicates the second-best value.

	OA	AA	Kappa	Farmland	City	Water	Forest	Others
U2Fusion	81.58	70.87	71.05	81.92	76.57	60.44	91.19	44.22
VSFF	82.73	74.33	72.96	80.90	81.65	62.11	92.22	54.75
MLFuse	82.12	71.74	71.77	81.98	77.70	61.70	91.67	45.67
SeAFusion	82.03	71.32	71.72	81.77	75.73	60.77	91.93	46.42
PSFusion	81.73	70.69	71.16	82.04	77.93	56.36	92.01	45.10
TITA	81.22	69.62	70.42	82.82	75.39	59.03	90.71	40.14
MambaDfuse	83.11	74.50	73.61	82.09	78.69	65.06	91.75	54.92
SFMFusion	83.47	74.61	74.14	83.52	78.79	64.79	91.59	54.35
Ours	83.40	74.51	74.02	82.05	80.21	66.93	92.56	50.78

Table 5. Comparison of computational complexity and inference efficiency.

	Params (M)	Runtime (ms)	Peak Memory (MB)
MLFuse	0.1118	4.941	179.59
SeAFusion	0.1669	1.187	237.78
SwinFusion	2.4154	431.204	824.42
MambaDfuse	1.3481	175.842	980.31
TITA	1.9235	446.536	753.15
SFMFusion	8.1375	1129.842	1360.82
Ours	1.5677	104.181	264.75

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, J.; Zhao, Y.; Ma, L.; Zhao, B.; Song, F.; Cai, Z. WEMFusion: Wavelet-Driven Hybrid-Modality Enhancement and Discrepancy-Aware Mamba for Optical–SAR Image Fusion. Remote Sens. 2026, 18, 612. https://doi.org/10.3390/rs18040612

AMA Style

Wang J, Zhao Y, Ma L, Zhao B, Song F, Cai Z. WEMFusion: Wavelet-Driven Hybrid-Modality Enhancement and Discrepancy-Aware Mamba for Optical–SAR Image Fusion. Remote Sensing. 2026; 18(4):612. https://doi.org/10.3390/rs18040612

Chicago/Turabian Style

Wang, Jinwei, Yongjin Zhao, Liang Ma, Bo Zhao, Fujun Song, and Zhuoran Cai. 2026. "WEMFusion: Wavelet-Driven Hybrid-Modality Enhancement and Discrepancy-Aware Mamba for Optical–SAR Image Fusion" Remote Sensing 18, no. 4: 612. https://doi.org/10.3390/rs18040612

APA Style

Wang, J., Zhao, Y., Ma, L., Zhao, B., Song, F., & Cai, Z. (2026). WEMFusion: Wavelet-Driven Hybrid-Modality Enhancement and Discrepancy-Aware Mamba for Optical–SAR Image Fusion. Remote Sensing, 18(4), 612. https://doi.org/10.3390/rs18040612

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

WEMFusion: Wavelet-Driven Hybrid-Modality Enhancement and Discrepancy-Aware Mamba for Optical–SAR Image Fusion

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Multimodal Image Fusion

2.2. Wavelet Transform and Multi-Scale Decomposition

2.3. State-Space Models and Mamba

3. Methods

3.1. Discrete Wavelet Transform ( D W T ) and Inverse Discrete Wavelet Transform ( I D W T )

3.2. Overall Architecture

3.3. Feature Extraction

3.4. Feature Fusion

3.4.1. Hybrid-Modality Enhancement (HME)

3.4.2. Discrepancy-Aware Gated Mamba Fusion (DAG-MF)

3.4.3. Adaptive Attentional Feature Fusion (A3F)

3.5. Loss Function

4. Experimental

4.1. Experimental Setup

4.2. Comparison Experiments

4.3. Ablation Experiments

4.3.1. Ablation Settings

4.3.2. Overall Result Analysis

4.3.3. Contribution Analysis of the HME Branches

4.3.4. Verification of the Mamba Feature Extraction Strategy

4.3.5. Verification of the Fusion Interaction Mechanism

4.3.6. Role of the Adversarial Consistency Constraint

4.4. Comparison Experiments on Downstream Segmentation Tasks

4.5. Efficiency Comparison Experiments

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1. Discrete Wavelet Transform ( $D W T$ ) and Inverse Discrete Wavelet Transform ( $I D W T$ )