MRLF: Multi-Resolution Layered Fusion Network for Optical and SAR Images

Wang, Jinwei; Ma, Liang; Zhao, Bo; Gou, Zhenguang; Yin, Yingzheng; Sun, Guangcai

doi:10.3390/rs17223740

Open AccessArticle

MRLF: Multi-Resolution Layered Fusion Network for Optical and SAR Images

by

Jinwei Wang

^1,*

,

Liang Ma

¹

,

Bo Zhao

²

,

Zhenguang Gou

¹

,

Yingzheng Yin

¹ and

Guangcai Sun

³

¹

School of Physics and Electronic Information, Yantai University, Yantai 264005, China

²

State Key Laboratory of Radio Frequency Heterogeneous Integration, Shenzhen University, Shenzhen 518060, China

³

National Key Laboratory of Radar Signal Processing, Xidian University, Xi’an 710071, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(22), 3740; https://doi.org/10.3390/rs17223740

Submission received: 19 September 2025 / Revised: 31 October 2025 / Accepted: 13 November 2025 / Published: 17 November 2025

(This article belongs to the Special Issue Advanced Algorithms and Techniques for Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The MRLF model outperforms state-of-the-art methods like DenseFuse, SwinFusion, and others on both the Dongying and Xi’an datasets. Quantitatively, it achieves an information entropy (EN) of 6.72 and a structural similarity (SSIM) of 0.63 on the Dongying dataset, indicating enhanced information richness and structural preservation. On the Xi’an dataset, it excels in spatial frequency (SF = 24.10) and gradient similarity (GS = 0.862), highlighting its ability to retain fine details and textures in complex urban scenes.
The model’s hierarchical fusion mechanism, incorporating multi-resolution decomposition and dual-attention modules, successfully addresses cross-modal radiometric discrepancies and multi-scale mismatches. Ablation experiments confirm that modules like the Complementary Feature Extraction Model (CFEM) and brightness distribution alignment reduce artifacts, with the fusion strategy yielding stable results even on large-pixel data, as shown in quantitative metrics like PSNR and FSIM.

What is the implication of the main finding?

By integrating optical and SAR images with high radiometric consistency, MRLF provides a robust scheme for continuous environmental monitoring. This is crucial for scenarios like disaster assessment and urban planning, where weather-independent data fusion ensures reliable information extraction under varying conditions, as noted in the document’s emphasis on “all-weather remote sensing monitoring”.
The model’s ability to preserve complementary features supports downstream applications such as semantic segmentation and target recognition. The document highlights that future work will leverage this fusion technology to improve higher-level tasks, enabling more accurate analysis in fields like agricultural monitoring and resource management.

Abstract

To enhance the comprehensive representation capability and fusion accuracy of remote sensing information, this paper proposes a multi-resolution hierarchical fusion network (MRLF) tailored to the heterogeneous characteristics of optical and synthetic aperture radar (SAR) images. By constructing a hierarchical feature decoupling mechanism, the method decomposes input images into low-resolution global structural features and high-resolution local detail features. A residual compression module is employed to preserve multi-scale information, laying a complementary feature foundation for subsequent fusion. To address cross-modal radiometric discrepancies, a pre-trained complementary feature extraction model (CFEM) is introduced. The brightness distribution differences between SAR and fusion results are quantified using the Gram matrix, and mean-variance alignment constraints are applied to eliminate radiometric discontinuities. In the feature fusion stage, a dual-attention collaborative mechanism is designed, integrating channel attention to dynamically adjust modal weights and spatial attention to focus on complementary regions. Additionally, a learnable radiometric enhancement factor is incorporated to enable efficient collaborative representation of SAR textures and optical semantics. To maintain spatial consistency, hierarchical deconvolution and skip connections are further used to reconstruct low-resolution features, gradually restoring them to the original resolution. Experimental results demonstrate that MRLF significantly outperforms mainstream methods such as DenseFuse and SwinFusion on the Dongying and Xi’an datasets. The fused images achieve an information entropy (EN) of 6.72 and a structural similarity of 1.25, while maintaining stable complementary feature retention under large-scale scenarios. By enhancing multi-scale complementary features and optimizing radiometric consistency, this method provides a highly robust multi-modal representation scheme for all-weather remote sensing monitoring and disaster emergency response.

Keywords:

multi-source image fusion; deep learning; optical (OPT) and synthetic aperture radar (SAR) image fusion; feature correlation

1. Introduction

Multisource remote sensing information fusion technology overcomes the limitations of single-sensor systems, such as insufficient information, poor environmental adaptability, and susceptibility to interference. By fusing data from multiple sensors, it enhances the accuracy and diversity of regional observations [1]. Optical imaging relies on spectral reflectance, providing rich color details but limited by weather conditions; SAR (Synthetic Aperture Radar) achieves all-weather detection through active microwave, capturing target structure and morphological information. These two modalities are complementary in imaging methods and features [2]. Their fusion integrates spectral information with structural information, enabling comprehensive information monitoring [3,4]. Fused images, which combine spectral and all-weather characteristics, support subsequent analysis and target recognition, with broad application prospects in fields such as environmental monitoring, urban planning, agricultural monitoring, disaster assessment, and resource management [5,6].

Image fusion algorithms are categorized into traditional methods and deep learning-based methods. Traditional methods include wavelet transform [7,8], pyramid transform [9,10], non-downsampled shearlet transform [11,12], saliency detection [13,14], subspace methods [15,16,17], and sparse representation [18,19]. Although these methods achieve information fusion, they suffer from limitations such as slow processing, information loss, noise sensitivity, and poor flexibility, driving research into new fusion technologies.

Deep learning-based fusion methods leverage deep neural networks to learn adaptive feature extraction and fusion rules, overcoming traditional limitations. Examples include DeepFuse [20], DenseFuse [21], and bilinear pooling layer fusion networks [22], which aim to improve fusion performance across modalities using deep learning architectures. However, existing methods often exhibit insufficient feature utilization or fail to fully capture complementary information between modalities. For instance, DeepFuse generates images via low-level feature fusion but has low feature utilization; DenseFuse introduces dense connection blocks to enhance inter-layer feature extraction but still lacks in learning associations between different modalities; while bilinear pooling networks optimize deep semantic features, they require further strengthening of second-order feature associations. In contrast, branch CNN networks [23] effectively address imaging differences between optical and SAR images through multi-scale feature extraction, outperforming single-branch networks. Methods like SOSTF [24] and MSFNet [25] enhance feature extraction and cross-level fusion to some extent, improving feature utilization. Nevertheless, these methods focus on merging pixel brightness and texture details or strengthening information transmission via residual/dense connections, yet fail to deeply exploit complementary information arising from imaging mechanism differences between optical and SAR images—such differences directly cause radiometric inconsistency-induced brightness discontinuities. Existing strategies often neglect cross-modal radiometric correction and adaptive alignment, struggling to bridge brightness gaps; additionally, single-resolution processing frameworks cannot effectively resolve multi-scale mismatches between global structures and local details, leaving issues like loss of small-scale textures and weakening of large-scale structures unresolved.

To address the above challenges—including insufficient complementary information mining due to overlooked imaging mechanism differences, radiometric inconsistency-induced brightness discontinuities, and multi-scale mismatches under single-resolution frameworks—this paper proposes a Multi-resolution Hierarchical Fusion Network (MRLF). This network optimizes multi-resolution feature extraction, attention-based fusion, and brightness distribution alignment to break through modality gaps. Specifically, the multi-resolution interpretation module hierarchically processes inputs using multi-resolution analysis [26], decomposing images into multi-scale features: a high-resolution branch captures edge/texture details, while a low-resolution branch extracts global context [27], balancing local details and global structures. The dual-attention fusion module focuses on integrating SAR’s topographic reflection intensity information [28], restoring a unified resolution via low-resolution reconstruction to preserve information integrity while effectively combining SAR’s all-weather advantages with optical images’ rich spectral information. Additionally, to mitigate brightness discontinuities caused by radiometric inconsistency, a brightness distribution alignment module is introduced to adaptively correct brightness discrepancies and alleviate multi-scale mismatches. This design significantly enhances fusion quality. The main contributions of this paper include:

A Multi-Resolution Layered Fusion Network (MRLF) is proposed for optical and SAR image fusion. Its multi-resolution module decomposes input features into distinct levels to capture details and semantic information, while intensity and gradient terms ensure controllable fusion across resolutions.
To coordinate multi-scale mismatches between global structures and local details, spatial and channel attention mechanisms are applied between features at different resolutions, along with designed generation mechanisms for spatial attention tensors and channel attention vectors, enhancing feature interaction under varying resolutions.
A contrast analysis method is proposed to address brightness discontinuities caused by radiometric inconsistency. By quantifying differences in brightness distribution and feature correlation between SAR images and fused results, this method reduces brightness discontinuity and strengthens the expression of key SAR information during fusion.
Experimental validation uses public datasets (Dongying and Xi’an) with large-pixel optical and SAR image pairs to comprehensively evaluate the model’s advantages and effectiveness in practical applications. Results demonstrate strong robustness and accuracy in fusing large-pixel image data.

The structure of this paper is organized as follows: Section 2 reviews related work that inspired our study. Section 3 details the proposed Multi-Resolution Multi-Layer Fusion Model (MRLF). Section 4 evaluates the model through experiments comparing deep learning-based image fusion methods. Section 5 concludes the paper.

2. Related Works

In this section, we reviewed some relevant previous studies, with a focus on inspiring related work, including the design ideas of loss functions and innovative modality enhancement methods.

2.1. Deep Learning-Based Image Fusion Framework

DenseFuse adopts a densely connected structure to integrate information from different modalities for multi-source image fusion. In this structure, each layer is connected to all previous layers, enabling effective information flow and integration across multiple scales. Its loss function consists of pixel loss (pixel loss) and Structural Similarity Index (SSIM) loss (SSIM loss), formulated as:

L = λ L_{ssim} + L_{p}

(1)

where

λ

is the weight balancing the two losses,

L_{ssim}

denotes the SSIM loss function, and

L_{p}

represents the pixel loss. The pixel loss

L_{p} = {∥O - I∥}_{2}

(where

{∥\cdot∥}_{2}

denotes the L2-norm calculation) measures the Euclidean distance between the input I and the output O. Since DenseFuse’s dense connection structure aims to integrate multi-modal information, the pixel loss ensures that local details from the input images are preserved during fusion. The SSIM loss function

L_{ssim} = 1 - SSIM (O, I)

(where

SSIM (\cdot)

denotes the structural similarity operation) is used to maintain structural similarity between the fused image and the input images. This addresses potential structural disruption caused by multi-modal integration in the dense connection structure, ensuring the fused image retains both multi-modal local details and structural similarity to the original inputs. These loss functions collectively help preserve image details, reduce artifacts, and maintain high structural and visual quality in the fused image during training.

NestFuse is an image fusion architecture based on nested connections and spatial-channel attention mechanisms. By introducing nested connections and attention mechanisms, this method enhances detail preservation and noise suppression. The architecture extracts multi-level features from multi-source images using Convolutional Neural Networks (CNNs). The nested connection module facilitates information flow between layers, ensuring effective combination of details and semantics. The attention module uses spatial and channel attention mechanisms to highlight important regions and feature channels. The fusion and reconstruction module integrates these features to generate the final fused image. NestFuse also employs the SSIM loss

L_{ssim}

and pixel loss

L_{p}

to form the total loss

L_{total}

, and adds a perceptual loss focusing on high-level semantic features of the image. The perceptual loss is defined as:

L = \frac{1}{Q} \sum_{q = 1}^{Q} L_{total} (I, O_{q})

(2)

where Q is the total number of layers in NestFuse’s feature extraction model, q denotes a specific layer in the feature extraction process, and

O_{q}

represents the input to the q-th layer of the feature extraction model. This multi-loss design aligns closely with NestFuse’s architecture. Nested connections and attention mechanisms aim to preserve details and suppress noise while highlighting critical information during multi-level feature extraction. Pixel and SSIM losses help maintain local details and structural similarity. The perceptual loss complements these by working with the nested structure and attention mechanisms: different feature maps (e.g., shallow-level

O_{1}

with edge/texture details, deep-level

O_{3}

with object structure semantics) capture hierarchical semantic information. By constraining features at multiple layers via the perceptual loss, NestFuse ensures the fused image retains multi-scale semantic information, improving overall quality. Additionally, this design supports multi-scale supervision, where features at different levels (

O_{1}, O_{2}, O_{3}

) are supervised to optimize multi-scale deep features.

The RFN-Nest framework addresses infrared-visible image fusion strategy challenges by proposing a Residual Fusion Network (RFN) based on residual structures to replace traditional fusion methods. For loss functions, RFN-Nest retains pixel loss and SSIM loss, and introduces a new detail-preserving loss and a feature enhancement loss to train the RFN, formulated as:

L_{RFN} = α L_{detail} + L_{feature}

(3)

where

α

balances the detail-preserving loss and feature enhancement loss. The detail-preserving loss

L_{detail}

is the SSIM loss between the fused output and the input visible light image. This constraint leverages the RFN’s residual structure—effective for retaining details—to ensure the fused image preserves structural similarity to the visible light input, thus retaining more details. The feature enhancement loss

L_{feature}

constrains the deep features of the fused image, defined as:

L_{feature} = \sum_{m = 1}^{M} w_{1} (m) {∥Φ_{f}^{m} - (w_{vi} Φ_{vi}^{m} + w_{ir} Φ_{ir}^{m})∥}_{F}^{2}

(4)

where M is the number of multi-scale deep features,

w_{1}

is a vector of trade-off parameters,

Φ_{f}^{m}

denotes the fused features extracted by the RFN at the m-th layer,

Φ_{ir}^{m}

and

Φ_{vi}^{m}

represent infrared and visible light features at the m-th layer, respectively, and

w_{vi}, w_{ir}

control the influence of different source images. This loss design synergizes with RFN’s residual structure: the residual structure enhances feature expression, while the feature enhancement loss ensures deep features better fuse infrared and visible light information, improving overall fusion performance.

2.2. Style Features and Feature Correlation

Style transfer is an image processing technique that utilizes deep learning to merge the content of one image with the artistic style of another, generating a new image that preserves the structure of the original image while exhibiting the target style. To obtain the style representation of the input image, texture information is typically captured through the feature space at higher layers of the Convolutional Neural Network (CNN). The style representation is obtained by calculating the relationships between different layers and features in the network, especially by measuring the relative positions and similarities of individual features in the convolutional feature maps. This representation effectively captures artistic style details such as texture, tone, and local structure, and subsequently transfers these style features to the generated image.

The feature correlation of style features is measured by calculating the Gram matrix of the image, using the feature space to obtain texture information [29]. This feature space can be built upon the filter responses from any layer in the network, consisting of correlations between different filter responses. These feature correlations are given by the Gram matrix

G^{l} \in R^{N_{l} \times N_{l}}

, where

G_{i j}^{l}

is the inner product between the vectorized feature maps

F_{i k}^{l}

and

F_{j k}^{l}

from layer l, as defined in Equation (5):

G_{i j}^{l} = \sum_{k} F_{i k}^{l} F_{j k}^{l}

(5)

where the indices i and j serve as free indices that denote the row and column positions, respectively, of elements within the resulting Gram matrix

G^{I}

. Specifically, i identifies the row index, corresponding to the channel or position in the original feature map

F^{I}

that contributes to the row of

G^{I}

, while j indicates the column index, marking the channel or position in

F^{I}

that defines the column of

G^{I}

. Together, they anchor the location of each element

G_{i j}^{I}

in the matrix, which is computed by summing (via the dummy index k) the products of corresponding elements from the i-th and j-th channels of

F^{I}

across all spatial positions (or other dimensions indexed by k). This formulation effectively captures channel-wise correlations or inner products between different features in

F^{I}

, with i and j structuring how these relationships are organized into the final matrix

G^{I}

.

The overall style loss function is the weighted sum of the style representations of the original image and the generated image at each layer, as defined in Equation (6):

L_{s t y l e} = \sum_{l = 0}^{L} w_{l} \frac{1}{4 N_{l}^{2} M_{l}^{2}} {(G_{i j}^{l} - A_{i j}^{l})}^{2}

(6)

where

M_{l}

is the size of the feature map at layer l of the model,

N_{l}

is the number of feature maps at layer l,the denominator

4 N_{i}^{2} M_{i}^{2}

is used for normalization to make the loss scales consistent across different layers.

G_{i j}^{l}

and

A_{i j}^{l}

are the style representations of the generated image and the original image, respectively, and

w_{l}

is the weight factor that represents the contribution of each layer to the total loss function.

3. Our Proposed Method

In this section, we will provide a detailed introduction to the proposed MRLF framework. First, we will introduce the overall network architecture of MRLF. Next, we will describe the multimodal information extraction module and the multi-resolution hierarchical fusion structure. Based on this, we propose a method for enhancing radiometric information to complement the radiometric information of SAR and optical images. Finally, we define the relevant loss functions to constrain the results of hierarchical fusion.

3.1. Overall Architecture

To generate high-quality fused images of optical and SAR data, we propose the MRLF network (Figure 1), which explores multi-resolution feature fusion and leverages complementary information to address the weaknesses of each modality. The proposed MRLF network addresses the weaknesses of each modality through multi-resolution feature fusion and complementary information utilization. The model’s main architecture comprises three key components: a feature extraction module for initial processing and multi-resolution feature interpretation, a fusion module for effectively combining same/different resolution features, and a feature reconstruction module that restores low-resolution features and cascades specific dimensions.

Based on the characteristics of SAR images, particularly the single-channel nature of specific polarizations, representing SAR data in grayscale form effectively highlights the intensity information within the image [30]. Grayscale images, by nature single-channel, emphasize target intensity and facilitate feature extraction. Their single-channel structure inherently reduces data complexity compared to multi-channel images, simplifying subsequent processing steps, improving computational efficiency, and accelerating overall processing speed [31,32,33]. Therefore, its grayscale format

I_{S A R} \in R^{H \times W \times 1}

is directly utilized as the input to the tensor transformation operation.

However, when considering multi-modal image fusion tasks involving both SAR and optical imagery, it is important to note that while SAR’s simplified single-channel representation streamlines its own preprocessing, optical images—characterized by multi-channel RGB formats—require more elaborate preprocessing to align their information with SAR’s intensity-focused data. To better control the brightness information in the fusion of optical and SAR images, the optical image can be converted from the RGB color space to the YCrCb space [34], allowing for more precise extraction of luminance data, i.e., from

I_{O P T} \in R^{H \times W \times 3}

to

I_{O P T}^{'} \in R^{H \times W \times 3}

. After completing the conversion of the optical image to the YCrCb color space, we extract the Y channel from the optical image to integrate its luminance channel information. Next, we fuse the optical luminance channel with the grayscale SAR image and provide the corresponding loss function for the subsequent fusion process.

In the initial feature extraction phase, we first apply a convolutional layer with a kernel size of

3 \times 3

, a stride of

1 \times 1

, and PReLU [35] as the activation function to

I_{OPT}^{'}

and

I_{SAR}

(both of which have a channel dimension of 1). This convolutional layer is configured to output 32 channels, explicitly expanding the channel dimension of the input features from 1 to 32. Through this operation, we extract heterogeneous remote sensing features

F_{OPT} \in R^{H \times W \times 32}

and

F_{SAR} \in R^{H \times W \times 32}

from

I_{OPT}^{'}

and

I_{SAR}

, respectively. This step not only completes the initial feature extraction but also enriches the feature representation by leveraging the channel expansion capability of convolution, providing a more comprehensive foundation for subsequent multi-resolution analysis.The Multi-resolution Interpretation Module (MIM) consists of residual downsampling blocks and their variants, aiming to perform multi-resolution separation on the expanded-channel features

F_{OPT}

and

F_{SAR}

. This enables hierarchical feature extraction across different resolutions. Specifically, assuming the spatial resolution of

F_{OPT}

and

F_{SAR}

is

H \times W

, after passing through MIM, the original single-resolution

H \times W

features are decomposed into three levels: the original resolution

H \times W

, half-resolution

\frac{H}{2} \times \frac{W}{2}

, and quarter-resolution

\frac{H}{4} \times \frac{W}{4}

.

The fusion module mainly consists of two types: one is the same-resolution fusion block, and the other is the different-resolution fusion block. The primary function of the same-resolution fusion block is to fuse features at the same resolution, which are divided into three resolution levels:

H \times W

,

\frac{H}{2} \times \frac{W}{2}

, and

\frac{H}{4} \times \frac{W}{4}

.

The different-resolution fusion block focuses on feature fusion across different resolutions. However, when there is a large difference in resolution, feature fusion can introduce considerable noise. Therefore, we use adjacent resolution layers for fusion rather than fusing layers with a significant resolution difference, such as

F_{O P T}^{H \times W}

and

F_{S A R}^{\frac{H}{4} \times \frac{W}{4}}

. Due to sensor differences, optical images generally have higher clarity and richer detail information compared to SAR images. Therefore, we perform layer-wise fusion of high-resolution optical features with lower-resolution SAR image features to fully leverage the advantages of both [36,37].

In the feature reconstruction stage, the fused features

M_{fusion}^{1}

,

M_{fusion}^{2}

,

M_{fusion}^{3}

,

M_{fusion}^{4}

,

M_{fusion}^{5}

output by different fusion modules are processed separately by the Low-resolution Reconstruction Module according to different resolution conditions. Specifically,

M_{fusion}^{1}

and

M_{fusion}^{2}

are directly concatenated because their resolutions remain unchanged;

M_{fusion}^{3}

and

M_{fusion}^{5}

enter different

U_{1}

modules to be restored to the original resolution;

M_{fusion}^{4}

enters the

U_{2}

module to be restored to the original resolution. Finally, all tensors that have completed restoration and reconstruction undergo a concatenation operation. Subsequently, the concatenated features pass through two

3 \times 3

convolutional layers (stride

1 \times 1

, Tanh activation). The final output is a grayscale fusion image reconstructed from these features. This grayscale image is then merged with the Cr and Cb data from the optical image to produce the final fused color image.

3.2. Decomposition and Reconstruction at Multiple Resolution Scales

The method proposed in this paper, aimed at achieving the fusion of high-level semantics and detailed textures, employs the Multi-resolution Interpretation Module (MIM) and the Low-resolution Reconstruction Module (LRM) to process their ive input features.

The Multi-resolution Interpretation Module (MIM) decomposes input features into hierarchical multi-scale representations via specialized residual downsampling blocks. As illustrated in Figure 2, MIM incorporates three distinct residual blocks

D_{1}

,

D_{2}

, and

D_{3}

, each constructed using a basic residual downsampling module denoted as

D (k, s)

.

D_{1}

, configured with

s = 1

, preserves the full input resolution

H \times W

(

F_{SAR}^{H \times W}, F_{OPT}^{H \times W} \in R^{H \times W \times 32}

) and is shown in Figure 2a;

D_{2}

, with

s = 2

, produces half-resolution features at

\frac{H}{2} \times \frac{W}{2}

(

F_{SAR}^{\frac{H}{2} \times \frac{W}{2}}, F_{OPT}^{\frac{H}{2} \times \frac{W}{2}} \in R^{\frac{H}{2} \times \frac{W}{2} \times 32}

) and is illustrated in Figure 2b;

D_{3}

, composed of two cascaded

D (k, s)

blocks (with

s = 2

), generates quarter-resolution representations at

\frac{H}{4} \times \frac{W}{4}

(

F_{SAR}^{\frac{H}{4} \times \frac{W}{4}}, F_{OPT}^{\frac{H}{4} \times \frac{W}{4}} \in R^{\frac{H}{4} \times \frac{W}{4} \times 32}

) and is depicted in Figure 2c. Resolution decoupling is coordinated through strategic stride configurations. This hierarchical design facilitates progressive feature extraction while mitigating information degradation.

The Low-resolution Reconstruction Module (LRM) reconstructs multi-scale features back to the original resolution via residual upsampling blocks

U_{1}

and

U_{2}

. Specifically,

U_{1}

(depicted in Figure 3b and

U_{2}

(illustrated in Figure 3c are employed to facilitate the upsampling process. Each module incorporates a Residual Branch with two PReLU-activated convolutional layers, followed by a transposed convolution

ConvT (k, s)

(3 × 3 kernel, stride

s = 2

) for resolution restoration, and a final refinement convolution. The Upsampling Branch combines a convolutional layer with

ConvT (k, s)

(with PReLU) to enhance spatial coherence in upsampled features.

3.3. Multi-Resolution Feature Fusion Module

As shown in Figure 4, the proposed multi-resolution feature fusion module focuses on the “longitudinal screening of channels” in channel attention [38] and the “horizontal positioning of locations” in spatial attention [39,40], dynamically weighting features from both the feature channel dimension and spatial location dimension to form complementarities. Meanwhile, the module not only addresses dual-attention fusion under the same resolution but also extends dual-attention fusion to adjacent resolution features. In Figure 4, for feature maps with different resolutions

F_{OPT}^{p}

(resolution p) and

F_{SAR}^{q}

(resolution q and

q < p

), resolution conversion is performed to obtain feature maps with the same resolution

F_{OPT}^{p}

and

F_{SAR}^{p}

(both with resolution p); for feature maps with the same resolution

F_{OPT}^{p}

and

F_{SAR}^{p}

, no upsampling resolution conversion is required.

For the feature maps

F_{O P T}^{p}

(with resolution p) and

F_{S A R}^{q}

(with resolution q, where

q < p

), resolution conversion is performed to obtain feature maps

F_{O P T}^{p}

and

F_{S A R}^{p}

, both with the same resolution p. For the feature maps

F_{O P T}^{p}

and

F_{S A R}^{p}

, which have the same resolution, no upsampling resolution conversion is needed. After resolution conversion, the features

F_{OPT}^{p}

and

F_{SAR}^{p}

are subjected to element-wise summation to obtain a feature tensor. This tensor is then passed through a global average pooling layer to reduce its dimensionality from two-dimensional to one-dimensional. Subsequently, a convolutional downsampling layer that reduces the number of channels further compresses the important information into a channel feature vector C. This process can be represented by Equation (7):

C = D o w n_{C} (G (F_{O P T}^{p} \oplus F_{S A R}^{p}))

(7)

where

D o w n_{C} (\cdot)

represents the channel downsampling operation,

G (\cdot)

represents the global pooling operation, and ⊕ represents the element-wise addition.

In the generation process of the spatial feature tensor, the same resolution conversion method as that for generating channel feature vectors is used to transform the feature maps into

F_{OPT}^{P}

and

F_{SAR}^{P}

, which are then concatenated. The concatenated feature maps are passed through a

1 \times 1

convolutional layer to reduce the number of input channels, and finally, a spatial downsampling layer is used for spatial dimensionality reduction to further compress the important spatial information into a spatial feature tensor S. This process can be expressed by the following Equation (8):

S = D o w n_{S} (P R e L U (C o n v_{1 \times 1} (c o n c a t (F_{O P T}^{p}, F_{S A R}^{p}))))

(8)

where

D o w n_{S} (\cdot)

represents the spatial downsampling operation,

P R e L U (\cdot)

represents the activation operation using the PReLU activation function,

C o n v_{1 \times 1} (\cdot)

represents the convolution operation with a 1 × 1 kernel to compress the input channels, and

c o n c a t (\cdot)

represents the concatenation operation.

After the generation of the channel feature vector C and the spatial feature tensor S, we will use them to generate the attention fusion feature map.In this stage, we need to upsample the extracted channel feature vector C and the spatial feature tensor S, and use the sigmoid function to restore them to their original form before compression. Then, the corresponding information representations are fused with the extracted feature maps at each resolution to obtain the attention fusion feature map. Taking the a-th fusion module as an example, the optical image representation map

M_{O P T}^{a}

generated by the a-th fusion block is represented by Equation (9), and the spatial information representation map of the SAR image, denoted as

M_{S A R}^{a}

, is represented by Equation (10). The final attention fusion feature map generated by the fusion block is denoted as

M_{f u s i o n}^{a}

, and the specific steps can be expressed by Equation (11):

M_{O P T}^{a} = F_{O P T}^{p} \otimes s i g m o i d (U p_{S} (S))

(9)

M_{S A R}^{a} = F_{S A R}^{p} \otimes s i g m o i d (U p_{S} (S))

(10)

\begin{matrix} M_{f u s i o n}^{a} = s i g m o i d (U p_{C} (C)) \otimes M_{O P T}^{a} + s i g m o i d (U p_{C} (C)) \otimes M_{S A R}^{a} \end{matrix}

(11)

where

U p_{S} (\cdot)

and

U p_{c} (\cdot)

represent the spatial upsampling operation and the channel upsampling operation, respectively.

s i g m o i d (\cdot)

represents the sigmoid activation operation, and ⊗ represents element-wise multiplication. During the multiplication operation, the broadcasting mechanism extends the channel attention map to a shape suitable for multiplication, enabling the weighted processing of the feature map.

The training of the fusion module considers the training efficiency of the attention module. It also takes into account the adaptation of multi-resolution multi-level features.The L2 norm is commonly used in image information processing to measure the extension of fine detail texture information. By calculating the L2 norm of an image, the texture intensity of a region can be captured. Regions with higher L2 norms generally contain more texture details. The Gaussian Laplacian operator first applies Gaussian filtering to the image and then performs the Laplacian operation. The Laplacian operator is a two-dimensional isotropic tool used to compute the second-order spatial derivatives of an image, and it is often applied in edge detection tasks. Prior to performing the Laplacian operation, the image is typically smoothed using a Gaussian filter to reduce the influence of noise on the Laplacian operator. To address this, we introduce the multi-scale perception loss

L_{p e r}

, as shown in Equation (12):

L_{p e r} = L_{h i g h} + η \cdot L_{l o w}

(12)

L_{h i g h} = \sum_{a = 1}^{5} λ_{a} {∥\nabla M_{f u s i o n}^{a} - \nabla F_{O P T}^{a}∥}_{F}^{2}

(13)

L_{l o w} = \sum_{a = 1}^{5} λ_{a} {∥M_{f u s i o n}^{a} - F_{S A R}^{a}∥}_{F}^{2}

(14)

where ∇ represents the Gaussian Laplacian operator,

L_{h i g h}

is the multi-scale perception loss for high-resolution visible light remote sensing images, obtained by Equation (13),

L_{l o w}

is the multi-scale perception loss for low-resolution SAR images, obtained by Equation (14),

{∥\cdot∥}_{F}

is the Frobenius norm,

η

represents the weight balancing these two loss functions, and

λ_{a}

represents the constraint weight for each fusion module. When a = 2, 3, 4,

λ_{a}

represents the constraint weight for fusion modules of the same resolution; when a = 1, 5,

λ_{a}

represents the constraint weight for fusion modules of different resolutions. A pair of input images undergoes a total of 5 dual-attention feature fusion modules, so the total multi-scale perception loss for 5 layers must be accumulated.

3.4. Complementary Feature Extraction Model (CFEM) and Radiometric Consistency Enhancement

To enable reasonable distribution and enhance the expressiveness of highlight information in the fusion results, this paper optimizes by comparing the brightness distribution consistency between the input SAR image and the fusion result to enhance the radiometric consistency of the fusion results. To this end, a complementary feature extraction model is added to the proposed model. Given the feature extraction capability of the VGG model [41,42], we add a CFEM (auxiliary model) with conv layers, PReLU, and max-pooling from a pre-trained VGG. It has 4 task layers using 3 × 3 kernels, as shown in Figure 5. In the figure,

{(\cdot)}_{n}

represents the features or computations inside the parentheses that are performed in the n-th task layer. For example, in the figure indicates that the feature inside the parentheses is extracted in the first task layer. Table 1 displays the structure of the n-th layer within the Complementary Feature Extraction Model (CFEM).

Taking the first task layer (n = 1) as an example, in the first task layer, the input image

I \in R^{H \times W \times 3}

is first processed by the first convolutional layer, which maps the image to 64 feature maps and extracts preliminary features using a 3 × 3 convolution kernel. Then, the second convolutional layer further extracts more complex features, outputting 64 new feature maps. A ReLU activation function is applied after each convolutional layer to introduce nonlinear transformations, enhancing the network’s expressive power and feature extraction capability. In the first task layer,

C o n v_{3 \times 3} (3, 64)

represents a convolutional layer with 3 input channels and 64 output channels.

To quantify the luminance distribution consistency between the input SAR image

I_{SAR}

and the fusion result

I_{result}

, we first feed

I_{SAR}

and

I_{result}

into a complementary feature extraction model to extract multi-scale features. Specifically, the feature maps

φ (I_{SAR})

and

φ (I_{result})

extracted at the n-th task layer are utilized to compute the feature correlation discrepancy. By accumulating the discrepancies across all N feature maps, we characterize the feature correlation using the Gram matrix. This process ultimately yields the luminance distribution loss function

L_{d i s}

, formulated as Equation (15):

L_{d i s} = \sum_{n = 1}^{N} {({∥G m (φ (I_{S A R})) - G m (φ (I_{r e s u l t}))∥}_{2}^{2})}_{n}

(15)

where

G m (\cdot)

represents the Gram matrix that measures feature correlation.

3.5. Loss Function

A reasonable loss function plays a crucial role in image fusion, as it defines the optimization objective of the model and accurately measures the difference between the fused image and the target image. In the fusion method we propose, the information in optical images is mainly concentrated in the Y channel, while the information in SAR images is represented by radar reflectance characteristics. We need to more effectively measure and preserve the complementary information from both types of remote sensing images. To better capture local features and details, we introduce a similarity loss

L_{i m a g e}

to evaluate the fusion results and guide model optimization.In our model, fusion of features at different resolutions is applied under channel attention and spatial attention modules. Therefore, we use a multi-layer perception loss

L_{p e r}

to ensure the fusion efficiency of multi-resolution features and improve the visual quality of the fusion results. When complementing the brightness information in the two images, we introduce a luminance distribution loss

L_{d i s}

to adjust the grayscale values between the two images, ensuring the fusion of high-brightness areas in the SAR image with their corresponding parts in the optical image. The overall loss function of our method is shown in Equation (16):

L = L_{i m a g e} + α \cdot L_{p e r} + β \cdot L_{d i s}

(16)

where

α

and

β

control the weights of

L_{p e r}

and

L_{d i s}

, respectively.

Entropy can be used to quantify the amount of information in an image. A high entropy value typically indicates that the image contains rich content and many details, while a low entropy value suggests that the image is relatively simple or uniform. Standard deviation not only reflects the contrast of the image but also provides insights into the overall pixel intensity distribution [43]. To measure the degree of dispersion and variability in the pixel value distribution of the fused image, we combine the standard deviation and entropy to design the weights

w_{l}^{O P T}

and

w_{l}^{S A R}

for constraining the luminance intensity term, as shown in Equation (17):

\begin{matrix} w_{l}^{O P T} = \frac{1}{N} \sum_{n = 1}^{N} {(\frac{1}{C_{O}} \sum_{c = 1}^{C_{O}} (\begin{matrix} E N (φ_{c} ({I^{'}}_{O P T})) + k_{1} \cdot S T D (φ_{c} ({I^{'}}_{O P T})) \end{matrix}))}_{n} \\ w_{l}^{S A R} = \frac{1}{N} \sum_{n = 1}^{N} {(\frac{1}{C_{O}} \sum_{c = 1}^{C_{O}} (\begin{matrix} E N (φ_{c} (I_{S A R})) + k_{2} \cdot S T D (φ_{c} (I_{S A R})) \end{matrix}))}_{n} \end{matrix}

(17)

where

E N (\cdot)

and

S T D (\cdot)

represent the calculated entropy and standard deviation for each image,

k_{1}

and

k_{2}

control the weights between entropy and standard deviation,

φ_{c} (\cdot)

denotes the feature maps extracted by the complementary feature extraction model from each channel at the n-th task layer,

{(\cdot)}_{n}

represents the operations within the parentheses applied to the features extracted at the n-th task layer, c denotes the c-th output channel,

C_{O}

is the total number of output channels, and N is the total number of task layers used by the complementary feature extraction model. In designing the gradient terms

w_{g}^{O P T}

and

w_{g}^{S A R}

, we also combine the L2 norm with the Gaussian Laplacian operator [44] as defined in Equation (18):

\begin{matrix} w_{g}^{O P T} = \frac{1}{N} \sum_{n = 1}^{N} {(\frac{1}{C_{O}} \sum_{c = 1}^{C_{O}} {∥\nabla φ_{c} ({I^{'}}_{O P T})∥}_{2}^{2})}_{n} \\ w_{g}^{S A R} = \frac{1}{N} \sum_{n = 1}^{N} {(\frac{1}{C_{O}} \sum_{c = 1}^{C_{O}} {∥\nabla φ_{c} (I_{S A R})∥}_{2}^{2})}_{n} \end{matrix}

(18)

where ∇ represents the Gaussian Laplacian operator.

We measure pixel differences by utilizing the Mean Structural Similarity Index (MSSIM) loss. This loss function is used to constrain the difference between the fused result and the input image [45]. To apply the MSSIM loss constraint on global brightness intensity and gradients to the fused result, we define the gradient similarity loss

L_{g}

as shown in Equation (19) and the brightness similarity loss

L_{l}

as shown in Equation (20) in the design:

\begin{matrix} L_{g} = 1 - w_{g}^{S A R} \cdot M S S I M (I_{r e s u l t}, I_{S A R}) - w_{g}^{O P T} \cdot M S S I M (I_{r e s u l t}, {I^{'}}_{O P T}) \end{matrix}

(19)

\begin{matrix} L_{l} = w_{l}^{O P T} \cdot {∥I_{r e s u l t} - {I^{'}}_{O P T}∥}_{2}^{2} + w_{l}^{S A R} \cdot {∥I_{r e s u l t} - I_{S A R}∥}_{2}^{2} \end{matrix}

(20)

where

M S S I M (a, b)

represents the mean structural similarity between a and b. The similarity loss

L_{i m a g e}

is obtained by combining these two losses through a weighted sum, as shown in Equation (21):

L_{i m a g e} = L_{g} + ζ \cdot L_{l}

(21)

where

ζ

is the weight controlling the balance between the two losses.

4. Implementation

Section 4 evaluates the performance of the proposed model through experimental comparisons with various deep learning image fusion models. Additionally, ablation experiments are conducted to further analyze the effectiveness of the proposed model.

4.1. Preparation for Implementation

In this section, we will introduce the training data, testing data, and the parameter settings used during the training process in this method.

4.1.1. Data Description

During the training phase, this paper uses the Dongying dataset [46] as the foundation for training the network’s image fusion performance. This dataset contains 6173 pairs of optical and SAR images from Dongying, Shandong, China. The optical images are from the GF-2 satellite and consist of RGB channels, while the SAR images are VV polarization data from the GF-3 satellite. The images used in the experiments have a size of 256 × 256 pixels, with a spatial resolution of 1 m after preprocessing. The efficacy of the proposed image fusion method was to be assessed, for which the dataset underwent an initial screening process. During this preprocessing stage, image pairs with improper alignment or incomplete data were eliminated. Then, 1500 image pairs were randomly selected as the training dataset for training the overall network. During the testing phase, this paper used both the Dongying dataset and the Xi’an dataset [46] as the test datasets to evaluate the network’s performance. For the Dongying dataset, 500 image pairs were randomly selected from the remaining 4673 optical and SAR image pairs (excluding the 1500 training pairs) to evaluate the network’s fusion results. The Xi’an dataset contains 2122 pairs of 128 × 128 pixel optical and SAR images from Xi’an, China, where the optical images are from the GF-2 satellite and the SAR images are VV polarization data from the GF-3 satellite. For testing with this dataset, 500 pairs of visible light and SAR images were randomly selected as the test set.

4.1.2. Training Parameters

The experiments were conducted on an NVIDIA GeForce RTX 3060 Laptop GPU (NVIDIA, Santa Clara, CA, USA) and an AMD Ryzen 75800H with Radeon Graphics CPU (AMD, Santa Clara, CA, USA). The batch size and epoch for training were set to 6 and 100, respectively. The values for the parameters were set as follows:

α

= 1 ×

10^{- 12}

,

β

= 1 ×

10^{4}

,

η

= 1,

ζ

= 5, and N = 1 (the setting of this parameter value will be discussed in the following subsubsection). In this paper, the Adam optimizer with a learning rate of 2.4 ×

10^{- 3}

was used to optimize the loss function.

4.2. Comparative Experiments

In this section, we conducted experiments on the Dongying dataset and Xi’an dataset by comparing the proposed method with several advanced methods. Additionally, we fused a set of data with larger pixel sizes and verified the effectiveness of the proposed fusion network through quantitative evaluation.

To evaluate the proposed method, we selected several deep learning-based multi-source image fusion methods (DensFuse, NestFuse, RFN-Nest, U2Fusion [47], CDDFuse [48], PSFusion [49], SwinFusion [50]) for performance comparison. By analyzing their fusion results, we comprehensively assessed their performance via subjective visual evaluation, focusing on detail preservation, image clarity, and visual realism.

For quantitative evaluation, we used several commonly used metrics, including Entropy (EN), Spatial Frequency (SF) [51], SSIM, Feature Similarity Index (FSIM), Gradient Similarity (GS), and Peak Signal-to-Noise Ratio (PSNR). These metrics were used to quantitatively measure the performance of each model in terms of fusion quality. Entropy (EN) measures the complexity or amount of information in an image; Spatial Frequency (SF) measures the extent of pixel value changes in the image; Feature Similarity Index [52] (FSIM) evaluates the similarity of image features (e.g., edges, textures); Gradient Similarity [52] (GS) assesses the similarity of image gradients to reflect structural preservation; Peak Signal-to-Noise Ratio (PSNR) quantifies the ratio between the maximum possible pixel value and the power of the noise; and SSIM is used to evaluate the similarity in terms of brightness, contrast, and structure.

4.2.1. Comparison Experiments on the Dongying Dataset

We compared the proposed model with other deep learning-based image fusion methods on the Dongying dataset to validate its performance in optical and SAR images fusion tasks. Figure 6 shows the comparison results of different models on the Dongying dataset. The first column represents the input optical images, and the second column represents the input SAR images. Columns three to nine display the fusion results of DensFuse, NestFuse, RFN-Nest, U2Fusion, SwinFusion, CDDFuse, and PSFusion, respectively. The last column shows the fusion result obtained by the model proposed in this paper.

By comparing the original images with the fusion results of the method proposed in this paper, it can be observed that in the yellow box of the first-row images, the optical image does not highlight the complete ground structural information of the islands. While the SAR image emphasizes the island contours and ground structural details not visible in the optical image, it blurs the buildings that were clearly depicted in the optical image. In the fusion result of the method proposed here, these advantages are well integrated: it not only preserves the island contours and ground structures but also retains the buildings that were blurred in the SAR image. Similar effects are also evident in the images of the second and third rows, where the fusion results enhance surface structures while incorporating interpretability not originally present in the SAR image. In the fourth-row images (as indicated by the yellow box), the optical image only shows the general shadows of vegetation, whereas the SAR image accurately localizes tree positions. The fusion result retains the actual positions of the trees and enriches the information by adding the interpretability of the optical image.

DenseFuse fuses multi-modal information using a dense connection structure, yet its low feature utilization rate results in blurred local details in the fused image, as evidenced in Figure 2. In the fourth row of Figure 2, the DenseFuse output displays blurrier vegetation texture compared to our proposed method, accompanied by indistinct island contours, unclear building boundaries, and visible artifacts. Relying on pixel loss and SSIM loss, DenseFuse fails to address radiometric inconsistency—this is apparent in the second row of Figure 2, where the fusion of SAR’s strongly reflective regions with optical details appears unnatural, leading to a significant visual loss of SAR feature information. Our method dynamically weights SAR and optical features via a dual attention mechanism: in the same region (e.g., the island within the yellow box in the first row), it preserves sharp contours while eliminating artifacts. Compared to NestFuse, MRLF’s multi-scale decomposition design overcomes noise sensitivity, enhancing small-scale textures in vegetation areas (as seen in the fourth row of Figure 2) without compromising overall hierarchical structure. Although RFN-Nest optimizes feature representation using a residual structure, it struggles to highlight structural information—for example, in the fourth row of Figure 2, tree positions from the SAR image fail to naturally merge with optical shadows in the RFN-Nest result, causing blurred edges. As shown in the first row of Figure 2, MRLF addresses the issue of SAR’s strong radiation overwhelming optical details through a brightness distribution alignment module, outperforming U2Fusion and SwinFusion by producing more natural fusion of island contours and building clusters (avoiding local over-darkening). To mitigate CDDFuse’s color distortion, MRLF’s YCrCb color space conversion and CFEM module ensure spectral fidelity, with no observable color tone deviation. These improvements collectively grant MRLF higher edge sharpness and radiometric consistency visually.

The quantitative evaluation results are presented in Table 2. In quantitative evaluations on the Dongying dataset, MRLF excels in key metrics owing to its innovative design. Information entropy (EN = 6.72) reaches the highest value, benefiting from MIM—residual downsampling blocks separate input images into low-resolution global structure and high-resolution local details, while residual compression retains multi-scale information to maximize richness. Structural similarity (SSIM = 0.63) and gradient similarity (GS = 0.853) demonstrate strong performance due to the dual attention fusion mechanism: channel attention dynamically weights the modality contributions of SAR and optical features, spatial attention focuses on complementary regions, and LRM’s deconvolution combined with skip connections maintains spatial consistency to prevent edge degradation. The advantages in FSIM (0.600, second) and PSNR (21.325, second) stem from CFEM, which quantifies brightness distribution differences via a pre-trained VGG network and aligns SAR and fused radiometric features using Gram matrices, reducing brightness discontinuity artifacts. The competitiveness of SF (19.36, third) reflects the multi-resolution framework’s capability to preserve high-frequency textures. Collectively, these results validate that MRLF achieves balanced enhancements in detail, structure, and naturalness through synergistic hierarchical fusion, attention weighting, and radiometric correction.

4.2.2. Comparison Experiments on the Xi’an Dataset

We further validated the proposed model by comparing it with other deep learning-based image fusion methods on the Xi’an dataset. featuring 128 × 128 urban building images (more complex than the Dongying dataset, raising greater fusion challenges), we obtained comparable results as shown in Figure 7: the first row presents input optical images, the second column input SAR images, columns 3–9 fusion results from DensFuse, NestFuse, RFN-Nest, U2Fusion, SwinFusion, CDDFuse, and PSFusion, respectively, and the last column the proposed model’s result.

By comparing the original images with the fusion results of this experiment, it can be observed that in the yellow boxes of the first-row images, the optical image fails to highlight ground building information, while the SAR image emphasizes building contours but blurs the road information visible in the optical image. In the fusion result of the method proposed in this paper, both the information of building clusters and the previously blurred roads are prominently enhanced. Similarly, in the second row, the fusion result retains the structures that are missing in the optical image within the yellow boxes but are highlighted in the SAR image. In the results of the fourth row (as indicated by the yellow box), the optical image adds blue buildings overlooked by the SAR image, while the road information emphasized by the SAR image is also incorporated into the fusion result.

For the more complex Xi’an dataset, MRLF’s dual attention mechanism and resolution conversion technology effectively address fusion challenges in structurally dense areas. As shown in the first and fourth rows of Figure 7, compared to RFN-Nest, MRLF performs better in handling building shadows—retaining SAR-emphasized building contours and optical image road networks—while RFN-Nest suffers from blurred edges due to insufficient detail retention. In the third row of Figure 7, relative to U2Fusion, MRLF’s hierarchical reconstruction mechanism avoids edge information loss, enabling natural transitions between building clusters and their surroundings without abruptness or excessive shadows. Also in the third row of Figure 7, compared to SwinFusion, MRLF’s multi-resolution balance suppresses excessive global brightness retention, preventing building surfaces from being overly bright and losing spectral details. In the third row of Figure 7, against PSFusion, MRLF’s gradient similarity design ensures texture realism and avoids unnatural fusion results. Visually, MRLF achieves efficient integration of complementary information in yellow-marked regions—highlighting SAR’s structural details while incorporating optical color interpretability—providing more reliable visual outputs for urban monitoring applications.

The quantitative evaluation results are presented in Table 3. In quantitative evaluations on the Xi’an dataset, MRLF leverages its multi-resolution hierarchical fusion architecture and attention mechanism to demonstrate significant advantages in key metrics. Spatial Frequency (SF = 24.10), reflecting the model’s excellent retention of gradient details, is primarily attributed to MIM’s hierarchical decomposition design. Leading performance in Structural Similarity (SSIM = 0.62) and Gradient Similarity (GS = 0.862) stems from the synergistic effect of the dual attention fusion module: channel attention dynamically adjusts the weights of SAR and optical modalities, spatial attention focuses on complementary regions, and LRM’s deconvolution combined with skip connections restores spatial consistency. Outstanding Information Entropy (EN = 6.95, ranked second) reflects the model’s retention of information richness, benefiting from mutual CFEM ensuring efficient integration of multi-source features. The competitiveness of Peak Signal-to-Noise Ratio (PSNR = 19.625, third) and Feature Similarity (FSIM = 0.602) is closely related to radiometric consistency enhancement strategies—quantifying brightness distribution differences between SAR and the fused results via Gram matrices, and introducing a brightness loss function to align cross-modal radiometric features. Collectively, these results validate that MRLF achieves the optimal balance between information content, structural fidelity, and visual naturalness in complex urban scenes through its closed-loop design of multi-resolution decomposition, attention weighting, and radiometric correction.

4.2.3. Comparison Experiments on Larger Pixel Size Data Fusion Tasks

To further verify the experimental results, we selected a pair of optical and SAR images with larger pixel sizes as test samples and compared the proposed model with other deep learning-based image fusion methods. The experimental sample consists of imaging from a SAR flight test conducted in a park lake area in Shanxi.

This paper employs a patch-based processing strategy to address the input size mismatch between large-sized images and model requirements during testing. Large remote sensing or high-resolution images often cannot be directly fed into models either because they exceed the model’s GPU memory capacity or fail to meet the divisibility constraints of the model’s downsampling layers for input dimensions. Specifically, we first set the patch size to 192 × 192 pixels based on the model architecture and memory limitations. Using a sliding window, we then traverse the original full-resolution image to be fused, sequentially cropping subpatches compatible with the model’s input. Subsequently, each cropped subpatch and its corresponding subpatch from the other modality are fed into the pre-trained fusion model to generate the fused result for that subpatch.

The aim of this experiment was to evaluate the performance of the proposed method in handling large-size image data fusion tasks. In the experiment described in this section, we preprocessed the selected optical and SAR image pairs by cropping the images to 1000 × 1000 pixels, followed by registration and stretching to ensure that the image data could effectively undergo the fusion task.In this group of experiments, the visual comparison results obtained are shown in Figure 8.

The first row displays, from left to right, the input optical image, SAR image, and fusion results from DensFuse, NestFuse, and RFN-Nest. The second row, from left to right, shows the fusion results from U2Fusion, SwinFusion, CDDFuse, PSFusion, and the proposed model. In the test results of large-scale images, by comparing the original images with the fusion results, it can be observed that within the yellow box of the optical image, the contours of the islands and bridges in the lake do not exhibit the same prominence as highlighted in the SAR image. Meanwhile, the SAR image lacks the road information within the islands. In the fusion result, however, the advantages of both are well integrated.

For large-scale images of 1000 × 1000 pixels, compared to all contrastive methods, MRLF exhibits notable advantages in edge enhancement and color consistency. As shown in the Figure 8j, the floating bridge contour highlighted is clear and sharp in the MRLF result, whereas DenseFuse (Figure 8c) and U2Fusion (Figure 8f) suffer from blurred edges due to single-resolution limitations. In Figure 8h, although CDDFuse attempts driven fusion, it exhibits local color distortion; MRLF, however, maintains large-area regions without brightness discontinuities via a radiation enhancement factor. In Figure 8g (SwinFusion) and Figure 8i (PSFusion), prominent texture smoothing issues arise in large-pixel processing, while MRLF’s channel-spatial attention dynamically weights complementary features, preserving small-scale details. Visually, MRLF’s fusion result is artifact-free and highly natural; the leading quantitative metrics EN (6.95) and GS (0.835) further validate its robustness in large-pixel tasks.

The quantitative evaluation results are provided in Table 4 for the fusion of 1000 × 1000 pixel optical and SAR image pairs. In quantitative evaluations for large-pixel optical and SAR image fusion tasks, MRLF—leveraging its multi-resolution hierarchical architecture—demonstrates significant advantages in key metrics: both Information Entropy (EN = 6.95) and Gradient Similarity (GS = 0.835) reach the highest values, directly reflecting the model’s excellent ability to retain image information richness and gradient consistency. The leading advantage of EN stems from decomposing input features into global low-resolution semantics and local high-resolution details, combined with a residual compression module that retains multi-scale information to avoid information loss in large-pixel data. The highest value of GS is attributed to the synergy of the dual attention fusion module and multi-scale perceptual loss: channel attention dynamically weights SAR’s intensity features and optical’s spectral features, spatial attention focuses on edge and texture regions, and the Gaussian-Laplacian operator constrains gradient similarity to ensure natural fusion of structural details in large-scale scenes. Additionally, the excellent performance of FSIM (0.660, ranked second) is closely related to the radiometric consistency enhancement design: the Complementary Feature Extraction Model (CFEM) quantifies brightness distribution differences between SAR and fused results via a pre-trained VGG network, aligns radiometric features using Gram matrices, reduces cross-modal artifacts, and thus improves feature matching accuracy. The competitiveness of PSNR (ranked third) together with these results validates that MRLF achieves the optimal balance between information fidelity, structural integrity, and visual clarity in large-pixel fusion tasks through its integrated design of multi-resolution decomposition, attention mechanisms, and adaptive radiometric correction.

4.3. Ablation Experiment

This section aims to validate the MRLF model’s effectiveness via ablation experiments on the Dongying dataset, including those for Multi-Resolution Hierarchical Fusion, Complementary Feature Extraction Model, fusion strategies, and sampling strategies, to analyze and evaluate its performance under different conditions.

4.3.1. Multi-Resolution Hierarchical Fusion

In this section, we discuss the impact of the multi-resolution fusion modules proposed in this paper on fusion results. In NO.1, we only use the fusion module with the same resolution; in NO.2, we only use the fusion module with different resolutions; in NO.3, we only use the fusion module for high-resolution features (i.e., modules

a = 1

and

a = 2

); in NO.4, we use the fusion module for low-resolution features (i.e.,

a = 3

,

a = 4

, and

a = 5

). The remaining experiments all employ a single fusion module to fuse features from different levels. This set of experiments was also conducted on the Dongying dataset, and Table 5 presents the quantitative evaluation results.

As shown in Table 5, the quantitative results of the fusion modules with different resolutions indicate: NO.1 (same resolution) achieves the best performance in SF (18.32) and GS (0.854); high-resolution fusion leads in EN with 6.72 and SSIM stabilizes at 0.63; among the parameterized experiments, a = 2 performs best comprehensively (with the highest EN, SF, FSIM, and PSNR). Overall, multi-resolution methods (e.g., NO.3 and a = 2) outperform single-resolution methods, while key metrics decline for low-resolution cases (a = 4, a = 5), validating the importance of balancing resolutions for retaining both detail and global information.

4.3.2. Number of Layers in the Complementary Feature Extraction Model

This section will delve into the impact of the complementary feature extraction model proposed in this paper on the fusion results, primarily by comparing the effect of different total task layers (N) on the fusion performance. Table 6 provides the quantitative evaluation results for different values of N. In the experiment, we measured the average physical memory usage during the training process (PM), the average of peak GPU utilization (pGPUu), and the average training duration per epoch (time).

The experimental results show that deeper features in the complementary feature extraction model have certain advantages in terms of mutual information for the fusion results, but they also lead to some loss of original information. Additionally, a larger value of (N) requires more resources during the training process, and the training time will also increase accordingly. After comprehensively considering both fusion quality and training efficiency, we ultimately set the experimental parameter N to 1.

4.3.3. Fusion Strategy

In the fusion strategy experiment, we compared the fusion results of our multi-resolution hierarchical fusion strategy with those of summation and concatenation strategies. The resulting fusion images are shown in Figure 9. The first and second columns display the input SAR and optical images, respectively. The third and fourth columns show the fusion results obtained by replacing the proposed fusion module with summation and concatenation operations. The final column presents the fusion result using the proposed model.As shown in Table 7, quantitative analysis further validates these observations.

The fusion results of summation and concatenation are inferior to the proposed method in key metrics. For summation, its EN (4.83) and SF (17.52) are lower than the proposed method’s 6.95 and 13.73, indicating weaker information richness and texture detail preservation; its SSIM (0.62) and FSIM (0.465) also lag behind the proposed method’s 0.63 and 0.660, reflecting poorer structural and feature similarity. Although summation’s GS (0.859) and PSNR (21.047) are slightly better, these gains are overshadowed by its significant deficits in critical texture and structure metrics. For concatenation, the gaps are more pronounced: its EN (4.61) and SF (11.14) are markedly lower than the proposed method’s values, SSIM (0.43) and FSIM (0.347) are far weaker, and PSNR (16.169) drops drastically, directly reflecting the negative impact of redundant information and ineffective cross-modal interaction on fusion quality.

4.3.4. Sampling Strategy

This section conducts an ablation experiment where down-sampling blocks and up-sampling blocks with a stride of 4 × 4 are used to replace the combination of two down-sampling blocks and two up-sampling blocks (each with a stride of 2 × 2) employed in this paper for multi-resolution feature extraction and restoration. The quantitative evaluation results of this ablation experiment are presented in Table 7.

The fusion effect of this method is generally comparable to that of the sampling strategy proposed in this paper, yet the latter demonstrates a marginal advantage in key detail metrics: EN, FSIM, GS, and PSNR all achieve superior values. Although these minor differences are not statistically significant, they intuitively reflect that the sampling strategy adopted in this paper, while maintaining the output size, more precisely controls the receptive field, reduces detail loss, stabilizes gradient propagation, and provides greater flexibility for multi-resolution feature fusion.

4.3.5. Ablation of $α$ and $β$

This section quantitatively investigates the impact of the hyperparameters

α

and

β

in the loss function on multi-modal image fusion results through ablation experiments, as shown in Table 7. Eight typical value groups were selected for the experiments:

α = 1 \times 10^{4}

,

α = 1 \times 10^{- 5}

,

α = 1 \times 10^{- 10}

,

α = 1 \times 10^{- 14}

,

β = 1 \times 10^{- 12}

,

β = 1 \times 10^{- 8}

,

β = 1 \times 10^{- 4}

, and

β = 1 \times 10^{8}

, combined with the proposed method for comparison.

From the indicator changes in the

α

-related rows, it can be observed that

α

has a significant and nonlinear impact on the fusion results: when

α

is excessively large (e.g.,

1 \times 10^{4}

) or excessively small (e.g.,

1 \times 10^{- 5}

), all evaluation metrics decrease significantly. For example, when

α = 1 \times 10^{4}

, the EN (Entropy) is only 3.95 (far lower than that of Ours at 6.72), SSIM (Structural Similarity Index) drops to 0.23, and FSIM (Feature Similarity Index) is merely 0.161, indicating that an overly strong weight of

L_{feature}

disrupts the balance between features. In contrast, when

α = 1 \times 10^{- 10}

, EN (6.71), SF (Spectral Correlation) (17.35), SSIM (0.62), and PSNR (Peak Signal-to-Noise Ratio) (20.150) all reach the optimal level within the

α

group, yet they still fall short of the performance of the proposed method. Compared with

α

,

β

exerts a more stable influence on the indicators: when

β

varies, the EN fluctuates only within the range of 6.69–6.71, SSIM ranges between 0.58–0.62, and FSIM remains stable at 0.547–0.601, suggesting that minor adjustments to the weight

L_{style}

have limited impact on preserving global structures. Notably, when

β = 1 \times 10^{- 4}

, EN (6.71), SF (17.45), SSIM (0.61), and GS (Gradient Similarity) (0.847) all achieve relatively high levels within the

β

group, forming suboptimal hyperparameter combinations together with

α = 1 \times 10^{- 10}

. The proposed method outperforms all others across all metrics (EN = 6.72, SF = 19.36, SSIM = 0.63, FSIM = 0.600, GS = 0.853, PSNR = 21.325), validating the rationality of the selected

α

and

β

values in the proposed method.

4.4. Convergence Analysis

Figure 10 illustrates the variation trends of the overall loss value L during training for the Dongying dataset (red diamond markers) and Xi’an dataset (blue square markers) across epochs. Both datasets exhibit convergence after 50 epochs: the Xi’an dataset stabilizes within the range of 0.15–0.18, while the Dongying dataset further drops below 0.1. Overall, both curves decrease monotonically with increasing epochs and converge, indicating the effectiveness of the training process. Notably, the Dongying dataset shows more stable fitting in the early stages and achieves slightly better final accuracy, whereas the Xi’an dataset undergoes more pronounced fluctuations before realizing efficient convergence. These observations reflect the influence of different data size characteristics on the training dynamics.

5. Conclusions

This paper introduces the MRLF model, which converts multi-source images into features of various resolutions and fuses them using dual-attention and resolution-specific modules. The low-resolution image is reconstructed, and layer fusion results are integrated for the final output. We employ a pre-trained model to calculate adaptive loss function parameters and enhance SAR-fusion feature correlation via a brightness distribution loss, boosting SAR representation. Experiments on Dongying and Xi’an datasets with large optical-SAR image pairs show significant performance improvements, suiting optical-SAR fusion tasks. The future work aims to enhance SAR image real-time processing, hardware acceleration, and parallel computing through algorithm optimization, while shifting the focus of research efforts to higher-level visual tasks—such as semantic segmentation of multi-source remote sensing images. Specifically, we will leverage image fusion technology to promote the improvement of semantic segmentation accuracy and utilize semantic segmentation to enhance the performance of image fusion. By establishing this bidirectional empowerment technical closed-loop, we aim to elevate the utilization value of remote sensing data.

Author Contributions

Conceptualization, B.Z., Z.G., Y.Y. and G.S.; Methodology, J.W.; Formal analysis, L.M.; Investigation, L.M.; Writing—original draft, J.W.; Writing—review & editing, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This study is supported by the Guangdong Basic and Applied Basic Research Foundation (No. 2025B1515020076), the Foundation of Shenzhen City under Grant JCYJ20230808105359045, and the National Natural Science Foundation of China (Nos. 62571342, 62431021).

Data Availability Statement

The datasets used in this study, Dongying dataset and Xi’an dataset, can be obtained from the provided https://github.com/XD-MG/DDHRNet (accessed on 18 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, S.; Kang, X.; Fang, L.; Hu, J.; Yin, H. Pixel-level image fusion: A survey of the state of the art. Inf. Fusion 2017, 33, 100–112. [Google Scholar] [CrossRef]
Liu, C.; Sun, Y.; Xu, Y.; Sun, Z.; Zhang, X.; Lei, L.; Kuang, G. A Review of Optical and SAR Image Deep Feature Fusion in Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 12910–12930. [Google Scholar] [CrossRef]
Moreira, A.; Prats-Iraola, P.; Younis, M.; Krieger, G.; Hajnsek, I.; Papathanassiou, K.P. A tutorial on synthetic aperture radar. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–43. [Google Scholar] [CrossRef]
Wu, J.; Li, Y.; Zhong, B.; Zhang, Y.; Liu, Q.; Shi, X.; Ji, C.; Wu, S.; Sun, B.; Li, C.; et al. Synergistic Coupling of Multi-Source Remote Sensing Data for Sandy Land Detection and Multi-Indicator Integrated Evaluation. Remote Sens. 2024, 16, 4322. [Google Scholar] [CrossRef]
Baier, G.; Deschemps, A.; Schmitt, M.; Yokoya, N. Synthesizing Optical and SAR Imagery From Land Cover Maps and Auxiliary Raster Data. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4701312. [Google Scholar] [CrossRef]
Sun, Q.; Liu, M.; Chen, S.; Lu, F.; Xing, M. Ship Detection in SAR Images Based on Multilevel Superpixel Segmentation and Fuzzy Fusion. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5206215. [Google Scholar] [CrossRef]
Kedar, M.; Rege, P.P. Wavelet Transform-Based Fusion of SAR and Multispectral Images. In Nanoelectronics, Circuits and Communication Systems; Nath, V., Mandal, J.K., Eds.; Springer: Singapore, 2020; pp. 261–275. [Google Scholar]
Li, C.; Luo, Z.; Wang, Q. Research on fusion method of SAR and RGB image based on wavelet transform. In Proceedings of the 13h International Conference on Digital Image Processing (ICDIP 2021), Virtual, 20–23 May 2021. [Google Scholar]
Wang, W.; Chang, F. A Multi-focus Image Fusion Method Based on Laplacian Pyramid. J. Comput. 2011, 6, 2559–2566. [Google Scholar] [CrossRef]
Siheng, M.; Li, Z.; Hong, P.; Jun, W. Medical image fusion based on DTNP systems and Laplacian pyramid. J. Membr. Comput. 2021, 3, 284–295. [Google Scholar] [CrossRef]
Tianyong, C.; Yumin, T.; Qiang, L.; Bingxin, B. Novel fusion method for SAR and optical images based on non-subsampled shearlet transform. Int. J. Remote Sens. 2020, 41, 4590–4604. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H.S. Fully-Convolutional Siamese Networks for Object Tracking. arXiv 2016, arXiv:1606.09549. [Google Scholar] [CrossRef]
Zhang, S.; Li, X.; Zhang, X.; Zhang, S. Infrared and visible image fusion based on saliency detection and two-scale transform decomposition. Infrared Phys. Technol. 2021, 114, 103626. [Google Scholar] [CrossRef]
Ye, Y.; Zhang, J.; Zhou, L.; Li, J.; Ren, X.; Fan, J. Optical and SAR image fusion based on complementary feature decomposition and visual saliency features. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5205315. [Google Scholar] [CrossRef]
Batur, E.; Maktav, D. Assessment of Surface Water Quality by Using Satellite Images Fusion Based on PCA Method in the Lake Gala, Turkey. IEEE Trans. Geosci. Remote Sens. 2019, 57, 2983–2989. [Google Scholar] [CrossRef]
Bavirisetti, D.P.; Rajesh, K.N.; Dhuli, R.; Polinati, S. Multimodal Medical Image Fusion Based on Content-based and PCA-sigmoid. Curr. Med. Imaging 2022, 18, 546–562. [Google Scholar] [CrossRef]
Zhang, Z.; Shi, Z.; An, Z. Hyperspectral and panchromatic image fusion using unmixing-based constrained nonnegative matrix factorization. Optik 2013, 124, 1601–1608. [Google Scholar] [CrossRef]
Zong, J.-J.; Qiu, T.-S. Medical image fusion based on sparse representation of classified image patches. Biomed. Signal Process. Control. 2017, 34, 195–205. [Google Scholar] [CrossRef]
Zhang, Q.; Fu, Y.; Li, H.; Zou, J. Dictionary learning method for joint sparse representation-based image fusion. Opt. Eng. 2013, 52, 057006. [Google Scholar] [CrossRef]
Prabhakar, K.R.; Srikar, V.S.; Babu, R.V. DeepFuse: A Deep Unsupervised Approach for Exposure Fusion with Extreme Exposure Image Pairs. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4724–4732. [Google Scholar] [CrossRef]
Li, H.; Wu, X.J. DenseFuse: A Fusion Approach to Infrared and Visible Images. IEEE Trans. Image Process. 2019, 28, 2614–2623. [Google Scholar] [CrossRef] [PubMed]
Xiao, L.; Lin, L.; Yuli, S.; Ming, L.; Gangyao, K. Multimodal Bilinear Fusion Network With Second-Order Attention-Based Channel Selection for Land Cover Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 1011–1026. [Google Scholar]
Feng, Q.; Zhu, D.; Yang, J.; Li, B. Multisource Hyperspectral and LiDAR Data Fusion for Urban Land-Use Mapping based on a Modified Two-Branch Convolutional Neural Network. ISPRS Int. J.-Geo 2019, 8, 28. [Google Scholar] [CrossRef]
Ye, Y.; Liu, W.; Zhou, L.; Peng, T.; Xu, Q. An Unsupervised SAR and Optical Image Fusion Network Based on Structure-Texture Decomposition. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4028305. [Google Scholar] [CrossRef]
Chenwu, W.; Junsheng, W.; Zhixiang, Z.; Hao, C. MSFNet: MultiStage Fusion Network for infrared and visible image fusion. Neurocomputing 2022, 507, 26–39. [Google Scholar] [CrossRef]
Hui, L.; Yanfeng, T.; Ruiliang, P.; Liang, L. Remotely Sensing Image Fusion Based on Wavelet Transform and Human Vision System. Int. J. Signal Process. Image Process. Pattern Recognit. 2015, 8, 291–298. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H.; Shao, L. Learning Enriched Features for Real Image Restoration and Enhancement. arXiv 2020. [Google Scholar] [CrossRef]
Kulkarni, S.C.; Rege, P.P. Pixel level fusion techniques for SAR and optical images: A review. Inf. Fusion 2020, 59, 13–29. [Google Scholar] [CrossRef]
Gatys, L.A.; Ecker, A.S.; Bethge, M. Image Style Transfer Using Convolutional Neural Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NE, USA, 26 June–1 July 2016; pp. 2414–2423. [Google Scholar] [CrossRef]
Chandrakanth, R.; Saibaba, J.; Varadan, G.; Ananth Raj, P. Feasibility of high resolution SAR and multispectral data fusion. In Proceedings of the 2011 IEEE International Geoscience and Remote Sensing Symposium, Vancouver, BC, Canada, 24–29 July 2011; pp. 356–359. [Google Scholar] [CrossRef]
Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2544–2550. [Google Scholar] [CrossRef]
Fein-Ashley, J.; Wickramasinghe, S.; Zhang, B.; Kannan, R.; Prasanna, V. A Single Graph Convolution Is All You Need: Efficient Grayscale Image Classification. arXiv 2024, arXiv:2402.00564. [Google Scholar] [CrossRef]
Hu, P.; Guo, W.; Chapman, S.C.; Guo, Y.; Zheng, B. Pixel size of aerial imagery constrains the applications of unmanned aerial vehicle in crop breeding. ISPRS J. Photogramm. Remote Sens. 2019, 154, 1–9. [Google Scholar] [CrossRef]
Baofeng, T.; Jun, L.; Xin, W. Lossy compression algorithm of remotely sensed multispectral images based on YCrCb transform and IWT. In Proceedings of the International Symposium on Photoelectronic Detection and Imaging 2007: Image Processing, Beijing, China, 9–12 September 2007. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv 2015, arXiv:1502.01852. [Google Scholar]
Sonobe, R.; Yamaya, Y.; Tani, H.; Wang, X.; Kobayashi, N.; Mochizuki, K.-i. Mapping crop cover using multi-temporal Landsat 8 OLI imagery. Int. J. Remote Sens. 2017, 38, 4348–4361. [Google Scholar] [CrossRef]
Singh, P.; Diwakar, M.; Shankar, A.; Shree, R.; Kumar, M. A Review on SAR Image and its Despeckling. Arch. Comput. Methods Eng. 2021, 28, 4633–4653. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Jie, H.; Li, S.; Samuel, A.; Gang, S.; Enhua, W. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2011–2023. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar] [CrossRef]
Mishra, N.; Jahan, I.; Nadeem, M.R.; Sharma, V. A Comparative Study of ResNet50, EfficientNetB7, InceptionV3, VGG16 models in Crop and Weed classification. In Proceedings of the 2023 4th International Conference on Intelligent Engineering and Management (ICIEM), London, UK, 9–11 May 2023; pp. 1–5. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Yang, C.; Zhang, J.Q.; Wang, X.R.; Liu, X. A novel similarity based quality metric for image fusion. Inf. Fusion 2008, 9, 156–160. [Google Scholar] [CrossRef]
Nonato, L.G.; do Carmo, F.P.; Silva, C.T. GLoG: Laplacian of Gaussian for Spatial Pattern Detection in Spatio-Temporal Data. IEEE Trans. Vis. Comput. Graph. 2021, 27, 3481–3492. [Google Scholar] [CrossRef] [PubMed]
Roberts, J.W.; van Aardt, J.; Ahmed, F. Assessment of image fusion procedures using entropy, image quality, and multispectral classification. J. Appl. Remote Sens. 2008, 2, 023522. [Google Scholar]
Ren, B.; Ma, S.; Hou, B.; Hong, D.; Chanussot, J.; Wang, J.; Jiao, L. A dual-stream high resolution network: Deep fusion of GF-2 and GF-3 data for land cover classification. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102896. [Google Scholar] [CrossRef]
Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A Unified Unsupervised Image Fusion Network. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 502–518. [Google Scholar] [CrossRef] [PubMed]
Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Xu, S.; Lin, Z.; Timofte, R.; Van Gool, L. CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 5906–5916. [Google Scholar]
Tang, L.; Zhang, H.; Xu, H.; Ma, J. Rethinking the necessity of image fusion in high-level vision tasks: A practical infrared and visible image fusion network based on progressive semantic injection and scene fidelity. Inf. Fusion 2023, 99, 101870. [Google Scholar] [CrossRef]
Wang, Z.; Chen, Y.; Shao, W.; Li, H.; Zhang, L. SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images. IEEE Trans. Instrum. Meas. 2022, 71, 5016412. [Google Scholar] [CrossRef]
Ma, J.; Ma, Y.; Li, C. Infrared and visible image fusion methods and applications: A survey. Inf. Fusion 2019, 45, 153–178. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, L.; Mou, X.; Zhang, D. FSIM: A Feature Similarity Index for Image Quality Assessment. IEEE Trans. Image Process. 2011, 20, 2378–2386. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overall architecture of the MRLF network.

Figure 2. Illustrates the schematic of the Multi-resolution Interpretation Module (MIM). In the figure, (a) illustrates the Basic Residual Downsampling Module D(k,s); (b) illustrates the structural diagram of

D_{1}

; (c) illustrates the structural diagram of

D_{2}

; (d) illustrates the structural diagram of

D_{3}

.

Figure 2. Illustrates the schematic of the Multi-resolution Interpretation Module (MIM). In the figure, (a) illustrates the Basic Residual Downsampling Module D(k,s); (b) illustrates the structural diagram of

D_{1}

; (c) illustrates the structural diagram of

D_{2}

; (d) illustrates the structural diagram of

D_{3}

.

Figure 3. This figure illustrates the schematic of the Low-resolution Reconstruction Module (LRM). In the figure, (a) illustrates the Basic Residual Upsampling Module

U (k, s)

; (b) illustrates the structural diagram of

U_{1}

; (c) illustrates the structural diagram of

U_{2}

.

Figure 3. This figure illustrates the schematic of the Low-resolution Reconstruction Module (LRM). In the figure, (a) illustrates the Basic Residual Upsampling Module

U (k, s)

; (b) illustrates the structural diagram of

U_{1}

; (c) illustrates the structural diagram of

U_{2}

.

Figure 4. This figure shows the overall structure of the Multi-resolution Feature Fusion Module.

Figure 5. Complementary Feature Extraction Model.

Figure 6. This figure shows the comparison results on the Dongying dataset: (a) Optical image, (b) SAR, (c) DenseFuse, (d) NestFuse, (e) RFN-Nest, (f) U2Fusion, (g) SwinFusion, (h) CDDFusion, (i) PSFusion, (j) Ours. The yellow box highlights an enlarged view of a local region in each figure.

Figure 7. This figure shows the comparison results on the Xian dataset: (a) Optical image, (b) SAR, (c) DenseFuse, (d) NestFuse, (e) RFN-Nest, (f) U2Fusion, (g) SwinFusion, (h) CDDFusion, (i) PSFusion, (j) Ours. The yellow box highlights an enlarged view of a local region in each figure.

Figure 8. This figure presents a comparison of the fusion results obtained from image pairs, each consisting of 1000 × 1000 pixels: (a) Optical image, (b) SAR, (c) DenseFuse, (d) NestFuse, (e) RFN-Nest, (f) U2Fusion, (g) SwinFusion, (h) CDDFusion, (i) PSFusion, (j) Ours. The yellow box highlights an enlarged view of a local region in each figure.

Figure 9. Thisfigure shows the results of the fusion strategy ablation experiment: (a) SAR, (b) Optical image, (c) Summation, (d) Concatenation, (e) Ours.

Figure 10. The Changing Trend of the Overall Loss Function during Training.

Table 1. This table displays the structure of the n-th layer within the Complementary Feature Extraction Model (CFEM), where Conv + ReLU denotes a convolutional layer followed by an activation layer, and the subsequent indicators pertain solely to the Conv convolutional layer.

Layer	Type	Channel (Input)	Channel (Output)	Output Feature Map
n = 1	Conv + ReLU	3	64	$φ (I) \in R^{\frac{H}{2} \times \frac{W}{2} \times 64}$ ,
	Conv + ReLU	64	64	$φ_{c} (I) \in R^{\frac{H}{2} \times \frac{W}{2} \times 1}$
n = 2	Conv + ReLU	64	128	$φ (I) \in R^{\frac{H}{4} \times \frac{W}{4} \times 128}$ ,
	Conv + ReLU	128	128	$φ_{c} (I) \in R^{\frac{H}{4} \times \frac{W}{4} \times 1}$
n = 3	Conv + ReLU	128	256	$φ (I) \in R^{\frac{H}{8} \times \frac{W}{8} \times 256}$ ,
	Conv + ReLU	256	256	$φ_{c} (I) \in R^{\frac{H}{8} \times \frac{W}{8} \times 1}$
n = 4	Conv + ReLU	256	512	$φ (I) \in R^{\frac{H}{16} \times \frac{W}{16} \times 512}$ ,
	Conv + ReLU	512	512	$φ_{c} (I) \in R^{\frac{H}{16} \times \frac{W}{16} \times 1}$

Table 2. This table shows the quantitative evaluation results on the Dongying dataset. Best performance is highlighted in bold, while second-best and third-best performances are displayed in red and blue, respectively.

	EN	SF	SSIM	FSIM	GS	PSNR (dB)
DensFuse	6.33	14.21	0.61	0.588	0.845	19.897
NestFuse	6.31	16.60	0.62	0.600	0.852	20.060
RFN-Nest	6.32	11.49	0.57	0.559	0.836	20.010
U2Fusion	6.09	18.53	0.60	0.567	0.844	20.448
SwinFusion	6.46	18.86	0.61	0.605	0.847	21.502
CDDFusion	6.41	19.87	0.61	0.591	0.839	19.839
PSFusion	6.63	19.89	0.59	0.601	0.841	19.438
Ours	6.72	19.36	0.63	0.600	0.853	21.325

Table 3. This table shows the quantitative evaluation results on the Xian dataset. Best performance is highlighted in bold, while second-best and third-best performances are displayed in red and blue, respectively.

	EN	SF	SSIM	FSIM	GS	PSNR (dB)
DensFuse	6.65	15.51	0.61	0.595	0.860	19.165
NestFuse	6.74	18.75	0.62	0.609	0.859	19.839
RFN-Nest	6.70	13.59	0.59	0.573	0.847	19.235
U2Fusion	6.55	20.92	0.60	0.573	0.852	19.451
SwinFusion	6.91	21.60	0.60	0.617	0.853	19.829
CDDFusion	6.95	23.17	0.60	0.600	0.846	18.905
PSFusion	6.97	22.09	0.61	0.611	0.853	18.147
Ours	6.95	24.10	0.62	0.602	0.862	19.625

Table 4. This table shows the quantitative evaluation results of fusion for 1000 × 1000 pixel image pairs. Best performance is highlighted in bold, while second-best and third-best performances are displayed in red and blue, respectively.

	EN	SF	SSIM	FSIM	GS	PSNR (dB)
DensFuse	6.73	8.01	0.61	0.634	0.816	20.241
NestFuse	6.68	12.00	0.68	0.517	0.754	19.919
RFN-Nest	6.86	12.19	0.57	0.522	0.749	20.353
U2Fusion	6.94	18.21	0.52	0.568	0.774	19.518
SwinFusion	6.54	10.55	0.68	0.525	0.746	20.117
CDDFusion	6.92	18.23	0.54	0.631	0.816	20.543
PSFusion	6.75	12.77	0.66	0.680	0.826	20.848
Ours	6.95	13.73	0.63	0.660	0.835	20.383

Table 5. This table shows the quantitative evaluation results of different resolution feature fusion modules. Best performance is highlighted in bold.

	EN	SF	SSIM	FSIM	GS	PSNR (dB)
NO.1	6.71	18.32	0.63	0.596	0.854	21.201
NO.2	6.71	18.05	0.63	0.598	0.846	21.123
NO.3	6.72	17.33	0.63	0.600	0.847	21.157
NO.4	6.68	16.61	0.63	0.600	0.847	21.288
a = 1	6.65	16.96	0.63	0.592	0.853	21.340
a = 2	6.72	18.31	0.63	0.601	0.851	21.342
a = 3	6.68	16.90	0.63	0.601	0.849	21.335
a = 4	6.68	15.04	0.60	0.602	0.848	21.235
a = 5	6.68	15.48	0.61	0.594	0.853	21.012
Ours	6.72	19.36	0.63	0.600	0.853	21.325

Table 6. This table shows the quantitative evaluation results for different values of N. Best performance is highlighted in bold.

	EN	SF	FSIM	PSNR (dB)	PM (MB)	pGPUu	Time
N = 1 (Ours)	6.72	19.36	0.600	21.324	1380.8	76.8 %	334 s
N = 2	6.71	19.36	0.591	21.324	1444.7	82.5 %	488 s
N = 3	6.68	18.33	0.602	21.057	1445.1	87 %	708 s
N = 4	6.72	17.41	0.592	21.524	1450.1	87 %	1180 s

Table 7. Thistable presents the results of several ablation experiments, including quantitative evaluations of fusion strategies, quantitative evaluations of sampling strategies, and quantitative evaluations of ablation experiments for hyperparameters

α

and

β

. Best performance is highlighted in bold, while second-best and third-best performances are displayed in red and blue, respectively.

Table 7. Thistable presents the results of several ablation experiments, including quantitative evaluations of fusion strategies, quantitative evaluations of sampling strategies, and quantitative evaluations of ablation experiments for hyperparameters

α

and

β

. Best performance is highlighted in bold, while second-best and third-best performances are displayed in red and blue, respectively.

	EN	SF	SSIM	FSIM	GS	PSNR (dB)
summation	4.83	17.52	0.62	0.465	0.859	21.047
concatenation	4.61	11.14	0.43	0.347	0.797	16.169
stride = 4 × 4	6.69	19.33	0.63	0.597	0.847	21.224
$α$ = 1 × $10^{4}$	3.95	13.10	0.23	0.161	0.722	14.555
$α$ = 1 × $10^{- 5}$	2.03	0.55	0.07	0.048	0.711	12.650
$α$ = 1 × 10 $10^{- 10}$	6.71	17.35	0.62	0.582	0.847	20.150
$α$ = 1 × $10^{- 14}$	6.68	16.32	0.58	0.577	0.831	20.512
$β$ = 1 × $10^{- 12}$	6.69	16.36	0.58	0.601	0.842	20.437
$β$ = 1 × $10^{- 8}$	6.70	17.41	0.62	0.583	0.841	20.105
$β$ = 1 × $10^{- 4}$	6.71	17.45	0.61	0.583	0.847	20.190
$β$ = 1 × $10^{8}$	4.66	22.88	0.58	0.547	0.810	20.690
Ours	6.72	19.36	0.63	0.600	0.853	21.325

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Ma, L.; Zhao, B.; Gou, Z.; Yin, Y.; Sun, G. MRLF: Multi-Resolution Layered Fusion Network for Optical and SAR Images. Remote Sens. 2025, 17, 3740. https://doi.org/10.3390/rs17223740

AMA Style

Wang J, Ma L, Zhao B, Gou Z, Yin Y, Sun G. MRLF: Multi-Resolution Layered Fusion Network for Optical and SAR Images. Remote Sensing. 2025; 17(22):3740. https://doi.org/10.3390/rs17223740

Chicago/Turabian Style

Wang, Jinwei, Liang Ma, Bo Zhao, Zhenguang Gou, Yingzheng Yin, and Guangcai Sun. 2025. "MRLF: Multi-Resolution Layered Fusion Network for Optical and SAR Images" Remote Sensing 17, no. 22: 3740. https://doi.org/10.3390/rs17223740

APA Style

Wang, J., Ma, L., Zhao, B., Gou, Z., Yin, Y., & Sun, G. (2025). MRLF: Multi-Resolution Layered Fusion Network for Optical and SAR Images. Remote Sensing, 17(22), 3740. https://doi.org/10.3390/rs17223740

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MRLF: Multi-Resolution Layered Fusion Network for Optical and SAR Images

Highlights

Abstract

1. Introduction

2. Related Works

2.1. Deep Learning-Based Image Fusion Framework

2.2. Style Features and Feature Correlation

3. Our Proposed Method

3.1. Overall Architecture

3.2. Decomposition and Reconstruction at Multiple Resolution Scales

3.3. Multi-Resolution Feature Fusion Module

3.4. Complementary Feature Extraction Model (CFEM) and Radiometric Consistency Enhancement

3.5. Loss Function

4. Implementation

4.1. Preparation for Implementation

4.1.1. Data Description

4.1.2. Training Parameters

4.2. Comparative Experiments

4.2.1. Comparison Experiments on the Dongying Dataset

4.2.2. Comparison Experiments on the Xi’an Dataset

4.2.3. Comparison Experiments on Larger Pixel Size Data Fusion Tasks

4.3. Ablation Experiment

4.3.1. Multi-Resolution Hierarchical Fusion

4.3.2. Number of Layers in the Complementary Feature Extraction Model

4.3.3. Fusion Strategy

4.3.4. Sampling Strategy

4.3.5. Ablation of α and β

4.4. Convergence Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.3.5. Ablation of $α$ and $β$