Local Information-Driven Hierarchical Fusion of SAR and Visible Images via Refined Modal Salient Features

Yan, Yunzhong; Jiang, La; Li, Jun; Liu, Shuowei; Liu, Zhen

doi:10.3390/rs17142466

Open AccessArticle

Local Information-Driven Hierarchical Fusion of SAR and Visible Images via Refined Modal Salient Features

by

Yunzhong Yan

,

La Jiang

,

Jun Li

^*

,

Shuowei Liu

and

Zhen Liu

College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(14), 2466; https://doi.org/10.3390/rs17142466

Submission received: 27 May 2025 / Revised: 30 June 2025 / Accepted: 14 July 2025 / Published: 16 July 2025

(This article belongs to the Special Issue Advancing Synthetic Aperture Radar: Imaging, Processing, and Applications in Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

Compared to other multi-source image fusion tasks, visible and SAR image fusion faces a lack of training data in deep learning-based methods. Introducing structural priors to design fusion networks is a viable solution. We incorporated the feature hierarchy concept from computer vision, dividing deep features into low-, mid-, and high-level tiers. Based on the complementary modal characteristics of SAR and visible, we designed a fusion architecture that fully analyze and utilize the difference of hierarchical features. Specifically, our framework has two stages. In the cross-modal enhancement stage, a CycleGAN generator-based method for cross-modal interaction and input data enhancement is employed to generate pseudo-modal images. In the fusion stage, we have three innovations: (1) We designed feature extraction branches and fusion strategies differently for each level based on the features of different levels and the complementary modal features of SAR and visible to fully utilize cross-modal complementary features. (2) We proposed the Layered Strictly Nested Framework (LSNF), which emphasizes hierarchical differences and uses hierarchical characteristics, to reduce feature redundancy. (3) Based on visual saliency theory, we proposed a Gradient-weighted Pixel Loss (GWPL), which dynamically assigns higher weights to regions with significant gradient magnitudes, emphasizing high-frequency detail preservation during fusion. Experiments on the YYX-OPT-SAR and WHU-OPT-SAR datasets show that our method outperforms 11 state-of-the-art methods. Ablation studies confirm each component’s contribution. This framework effectively meets remote sensing applications’ high-precision image fusion needs.

Keywords:

synthetic aperture radar (SAR); visible; image fusion; hierarchical features

1. Introduction

Due to the inherent theoretical and technical limitations of hardware devices, images captured by a single sensor or under a single shooting condition often fail to effectively and comprehensively represent the imaging scene [1]. Image fusion naturally emerges as a solution by integrating meaningful information from different source images to produce a single image that contains richer information, thereby enhancing its utility for subsequent applications [2]. In the realm of remote sensing, image fusion tasks encompass a variety of modalities, each with unique characteristics and applications. Among these, the fusion of visible and synthetic aperture radar (SAR) images is of significant importance and has garnered considerable attention in recent years.

SAR actively emits pulsed electromagnetic waves and receives reflected signals to acquire surface information about the targets [3]. Unlike visible imaging technologies, SAR can penetrate clouds, rain, snow, and fog, providing all-weather and all-day imaging capabilities that are not affected by weather and lighting conditions. Although SAR is rich in spatial information, it lacks the spectral information that visible imagery provides, which is critical for numerous remote sensing applications [4]. Furthermore, SAR images often suffer from poor interpretability and are affected by speckle noise [5]. Visible images depend on sunlight reflected from Earth’s objects. Although two differently structured objects may appear identical in visible imagery because of their similar spectral responses, these differences can be distinguished in SAR imagery. Therefore, SAR and visible image fusion is advantageous for generating images with enriched spatial and spectral information. The analysis of fused images enhances our understanding and interpretation of the imaged area, serving a wide range of applications [4].

Early image fusion methods relied on mathematical transformations for manual analysis and fusion rule design in spatial or transform domains, also known as traditional fusion methods. These include multi-scale transform-based [6], sparse representation-based [7], subspace-based [8], saliency-based [9], and total variation-based methods [10]. However, these methods have limitations. They often enforce uniform transformations across different source images, neglecting inherent differences and leading to poorly expressive features. Additionally, traditional fusion strategies are too coarse for deep feature fusion, limiting performance. The rapid development of deep learning has shown superiority in various fields. Its powerful feature learning and generalization capabilities make it suitable for remote sensing image fusion tasks [11]. Common methods include convolutional neural networks (CNNs) [12], auto-encoder (AE) networks [13], generative adversarial networks (GANs) [14], and transformer-based approaches [15]. These techniques are crucial for autonomous feature learning and the efficient processing of large-scale remote sensing data.

The optimal fusion of complementary features constitutes a fundamental paradigm in image fusion research, where systematic investigation of feature decomposition methodologies has profoundly advanced theoretical understanding and algorithmic development in this domain. The seminal work by Vese and Osher [16] pioneered the application of finite difference techniques for the decomposition of images into cartoon and texture components, establishing the foundation for exploiting modality-specific characteristics. In the context of visible and SAR image fusion, Ye et al. [17] developed the VSFF framework to extract complementary features from optical and SAR modalities. However, conventional approaches rely on hand-crafted transformation heuristics, consequently limiting their adaptability to complex inter-modal discrepancies. In deep learning, Ye et al. [18] proposed SOSTF by incorporating structural–textural decomposition with customized loss functions. Its unsupervised learning approach, while advantageous in avoiding manual labeling, still relies on predefined structural and textural components, which hinders its ability to fully capture the complex interactions between SAR and visible images. Studies in other multi-modal fusion fields [19,20,21] have also delved into the hierarchical decomposition of complementary features. However, limitations still exist. On the one hand, there is a lack of a detailed refinement of the decomposed features at each hierarchical level and differentiated processing. In addition, complex decomposition methods and the insufficient use of hierarchical characteristics continue to present challenges, such as a loss of detail and the blurring of edges. It should be noted that the development of deep learning methodologies specifically for SAR-visible fusion remains relatively underexplored. To our knowledge, beyond the aforementioned works [17,18], Kong et al. [22] developed a semi-supervised approach that employs dense UGAN with the Gram–Schmidt transformation, which nevertheless requires annotated data. A major bottleneck that hinders progress in this field is the paucity of SAR-visible paired datasets, where both the scarcity and suboptimal quality of existing public repositories present significant obstacles. Although contemporary fusion algorithms focus predominantly on model architecture design, the quality and quantity of raw input images constitute critical determinants that significantly influence the effectiveness of deep learning methodologies [23]. Addressing these data-related challenges is imperative for advancing SAR-visible image fusion.

To address these technical challenges and compensate for the lack of detailed analysis and full utilization of hierarchical feature differences in existing studies, this paper proposes a method for the local information-driven hierarchical fusion of SAR and visible images via refined modal salient features. The methodology comprises two orthogonal processing phases: cross-modal enhancement and modal fusion. In the cross-modal enhancement phase, to mitigate the effects of data quality limitations on fusion network training, we employ CycleGAN generators [24] to produce pseudo-modal images. This approach establishes cross-modal correlations through CycleGAN’s cycle-consistency constraints, effectively enabling cross-modal interaction via bidirectional generation. Subsequently, the generated pseudo-modal images are concatenated with the source images along the dimension o of the channel to achieve cross-modal enhancement. In the modal fusion phase, to fully exploit and utilize complementary modality information, we classify features into low-level, middle-level, and high-level groups [25] according to their degree of abstraction, based on computer vision principles [26]. In particular, the texture features of visible images and the structural contour features of SAR images are defined as low-level and middle-level features, respectively. In contrast to other fusion networks [20,21] based on complementary feature decomposition, our methodology explicitly distinguishes characteristic properties across different feature hierarchies and employs a targeted topological design to construct differentiated processing branches for feature extraction and fusion, along with a strict hierarchical reconstruction mechanism. Specifically, feature extraction addresses SAR’s high dynamic range characteristics. A modified ResNet-50 architecture serves as the backbone network with hierarchically partitioned subnetworks optimized for distinct feature levels. For feature fusion, we implement saliency-aware selective fusion for low-/middle-level features while maintaining maximum integrity preservation of high-level semantic-carrying features. For image reconstruction, we proposed a Layered Strictly Nested Framework (LSNF) that systematically preserves and propagates refined hierarchical depth features by establishing strict intra-layer and inter-layer nested connections. To address detail degradation and edge blurring in the fusion process, we devise a Gradient-Weighted Pixel Loss (GWPL) that incorporates human visual system sensitivity priors. Unlike conventional pixel-wise losses, such as the mean squared error (MSE), which treat all regions equally, the GWPL prioritizes high-frequency components during the fusion process, which aligns with the cognitive principle that humans perceive scenes through structural contrasts. In summary, the main contributions of this paper are as follows.

To mitigate the effects of data quality limitations on the training of fusion networks, we propose a two-stage hierarchical fusion paradigm for SAR-visible images. The cross-modal enhancement stage employs CycleGAN generators to synthesize pseudo-modal images, effectively facilitating cross-modal interaction between SAR-visible images and enhancing cross-modal consistency. In the fusion stage, departing from conventional decomposition-based methods, we introduce computer vision-guided hierarchical decomposition to explicitly refine modality salient features and implement strict layer-wise processing of complementary characteristics.
To fully exploit and utilize complementary modality information while jointly considering the characteristic properties of hierarchical features and the high dynamic range specific to SAR images, we construct a topology where differentiated feature extraction and fusion branches target distinct hierarchical features and devise an LSNF. This hierarchical differentiation mechanism facilitates improved learning of complementary cross-modal features between visible and SAR modalities.
To address detail degradation and edge blurring in the fusion process, we formulate a GWPL that guides pixel-level optimization through feature saliency spatial variations, thereby bridging global structural constraints with local detail preservation and enhancing detail representation in fused results.

Experimental results on two datasets of different scales demonstrate that our method achieves superior image fusion performance compared to existing deep learning-based and traditional methods. This is evident in both human visual perception and objective metrics, with particularly outstanding performance in metrics related to image detail quality.

2. Materials and Methods

In this section, we detail the framework and workflow of the method. Our basic idea for fusion is to first convert the visible images from the RGB space to the YCbCr space, extract the Y channel images, and perform grayscale fusion with the SAR images, then combine it with the Cb and Cr channel images to reconstruct the fused color images. For simplicity, the visible images discussed below refer to visible grayscale images. The proposed framework (Figure 1) employs a generative adversarial network with dual discriminators. The generator’s workflow comprises four sequential processing stages: First, source SAR and visible images are processed by CycleGAN generators (Figure 2) to produce pseudo-modality images, thereby establishing bidirectional interactions between the SAR and visible domains. These generated images are then channel-concatenated with the original inputs for cross-modal enhancement. Subsequently, as illustrated in Figure 3, the enhanced representations undergo multi-layer deep feature extraction. The extracted features then progress through hierarchical fusion operations. Finally, multi-level fused features are reconstructed into the output image via an LSNF.

2.1. CycleGAN Generators for Cross-Modal Enhancement

To address the issue of data support required for training networks, we input the real SAR images

I_{S A R_R e a l}

and the real visible images

I_{V i s_R e a l}

into the two generators,

G_{S \to V}

and

G_{V \to S}

, of the CycleGAN network as shown in Figure 2:

I_{S A R_F a k e} = G_{V \to S} (I_{V i s_R e a l})

(1)

I_{V i s_F a k e} = G_{S \to V} (I_{S A R_R e a l})

(2)

The pseudo-SAR images

I_{S A R_F a k e}

and the pseudo-visible images

I_{V i s_F a k e}

are utilized to enhance the quality of the data by supplementing additional information with the real images. These pseudo-images are generated to provide complementary features that may not be present in the real images, thereby enriching the dataset. Ultimately, they serve as input to the multi-layer deep feature extraction module of the fusion network. The integration of real and pseudo-images is performed through a convolutional operation, which is defined as follows:

I_{S A R} = Conv (Concat (I_{S A R_R e a l}, I_{S A R_F a k e}))

(3)

I_{V i s} = Conv (Concat (I_{V i s_R e a l}, I_{V i s_F a k e}))

(4)

Here, Conv represents a convolutional layer that processes the concatenated real and pseudo-images, extracting meaningful features for subsequent analysis. This approach ensures that the fusion network receives a comprehensive set of features, which improves its ability to perform accurate and robust image fusion.

2.2. Multi-Layer Deep Feature Extraction

Given the lack of research on the fusion of different depth characteristics at the bottom, middle, and high levels in existing methods, this module considers using the Resnet-50 [27] architecture for its excellent performance in obtaining depth features. The same modified Resnet-50 architecture is used as the backbone, with two branches to separately extract multi-layer depth features of SAR and visible images. Since SAR images typically have a higher dynamic range [28], instead of normalizing to 0 to 1, we normalize the pixels to −1 to 1. This better captures the negative features in the input data. We modify the activation function from Relu [29] to Leaky_Relu [30] in the original Resnet-50 architecture, and the activation functions used in the modules are also designed or modified to Leaky_Relu, as shown in Figure 3a. First, the input SAR and visible images are used to extract multi-depth features through the backbone network:

O u t 1_{S}, O u t 2_{S}, O u t 3_{S}, O u t 4_{S}, O u t 5_{S} = F_{S} (I_{S A R})

(5)

O u t 1_{V}, O u t 2_{V}, O u t 3_{V}, O u t 4_{V}, O u t 5_{V} = F_{V} (I_{V I S})

(6)

In this study, the modules

F_{S} ()

and

F_{V} ()

are designated as feature extraction components for the SAR and visible branches, respectively.

O u t 1

,

O u t 2

,

O u t 3

,

O u t 4

, and

O u t 5

denote depth features extracted from the backbone, ranging from low to high levels. The subscript indicates whether the output is from the SAR or the visible branch. Specifically, the distinction between low-, middle-, and high-level features is not clearly defined and is usually based on their abstraction level. As shown in Figure 4, we perform a visual analysis of the outputs

O u t 1

,

O u t 2

,

O u t 3

,

O u t 4

, and

O u t 5

, categorizing them into distinct feature levels based on complexity and abstraction.

O u t 1

signifies low-level features, capturing fundamental patterns and edges crucial for initial image comprehension, with a dimensionality of 32.

O u t 2

and

O u t 3

represent middle-level features, which extend to low-level features by capturing more intricate patterns and textures. Their dimensionalities, 128 and 256, respectively, offer a richer depiction of image content, facilitating enhanced differentiation of structures.

O u t 4

and

O u t 5

are identified as high-level features, encapsulating the most abstract and complex representations, capturing semantic information and intricate details. With dimensionalities of 512 and 1024, they demonstrate the ability to represent detailed and nuanced aspects of the images. This hierarchical division from low- to high-level features ensures a comprehensive analysis that effectively captures both fundamental and complex information in the feature extraction process.

Subsequently, acknowledging that low-level features in images of the same scene across different modalities are generally more consistent and stable [31], we integrated the DenseASPP module [32] to enhance these low-level features. This module is specifically designed to densify feature maps, thereby capturing richer and more intricate details within the images. This enhancement not only improves the granularity of the feature representation but also significantly bolsters the robustness and accuracy of the fusion results by providing a more stable foundation for subsequent processing stages.

In the final stage, the densification of low-level features using the DenseASPP module, along with the seamless transmission of middle- and high-level features, culminates in the output of the multi-depth feature extraction module. The equations below describe the process:

L o w_{S} = DenseASPP (O u t 1_{S})

(7)

L o w_{V} = DenseASPP (O u t 1_{V})

(8)

For both SAR and visible modalities, the outputs Out2 and Out3 capture perceptible structural details with progressively enriched feature combinations through channel-wise expansion, forming complex pattern representations categorized as mid-level features (designated

M i d d l e 2

and

M i d d l e 1

, respectively). In addition, the outputs Out4 and Out5 encode abstract global contextual information beyond direct perceptual details, constituting high-level features (identified as

H i g h 2

and

H i g h 1

).

In this context,

L o w_{S}

,

M i d d l e 1_{S}

,

M i d d l e 2_{S}

,

H i g h 1_{S}

, and

H i g h 2_{S}

denote the feature hierarchy for the SAR image. Similarly,

L o w_{V}

,

M i d d l e 1_{V}

,

M i d d l e 2_{V}

,

H i g h 1_{V}

, and

H i g h 2_{V}

represent the corresponding features of the visible image. This structured approach ensures that all feature levels are used effectively, maintains the integrity and richness of the data throughout the extraction process, and ultimately contributes to more precise and reliable image fusion.

2.3. Deep Feature Hierarchical Fusion

As depicted in Figure 3b, we integrated the lightweight attention module CBAM [33] into the fusion component to effectively merge low- and middle-level features. This module enhances feature representation by selectively highlighting advantageous modal features, thereby improving the overall quality of the fusion. For the integration of high-level features, an adder mechanism is employed to ensure the retention of all feature information, which is crucial for guiding the reconstruction of the fused image while preserving semantic integrity. To illustrate the fusion process, we present the merging of

L o w_{S}

and

L o w_{V}

as a representative example, demonstrating the methodology applied to the fusion of features at the low level and the middle level.

The channel attention

M_{C} (F)

and spatial attention

M_{S} (F)

mechanisms are defined as follows:

M_{C} (F) = σ (MLP (AvgPool (F)) + MLP (MaxPool (F)))

(9)

M_{S} (F) = σ (f^{7 \times 7} ([AvgPool (F); MaxPool (F)]))

(10)

The fusion of low-level features is achieved through a two-step process. Initially, the low-level features of the SAR and visible modalities, denoted as

L o w_{S}

and

L o w_{V}

, are concatenated and processed through a convolutional layer. This operation is defined as

L o w_{x} = Conv (Concat (L o w_{S}, L o w_{V}))

(11)

Here, Conv represents a convolutional layer that extracts integrated features from the concatenated input, enhancing the representation by capturing interactions between the modalities. Subsequently, the fused features are enhanced through spatial and channel attention mechanisms, respectively, denoted as

M_{S}

and

M_{C}

. This enhancement is expressed as

L o w_{F} = M_{S} (M_{C} (L o w_{x}) \times L o w_{x}) \times (M_{C} (L o w_{x}) \times L o w_{x})

(12)

This approach ensures that the essential characteristics of each modality are preserved and effectively combined, resulting in robust and comprehensive feature fusion. Using attention mechanisms, the fusion process selectively emphasizes important features, thereby enhancing the quality and robustness of the final output.

2.4. Layered Strictly Nested Framework (LSNF)

Once the deeply fused features have been obtained at various levels, the focus shifts to how to effectively apply these features in a layered manner. Drawing inspiration from previous works [34,35], in which nested connection-based decoders were found to be effective in preserving feature information, we further advanced this concept. We proposed the Layered Strictly Nested Framework (LSNF) (illustrated in part (c) of Figure 3). This framework purposefully nests features within the same level in the topological structure according to their hierarchical differences, instead of nesting all features without discrimination. The design of the LSNF better highlights hierarchical differences and leverages hierarchical characteristics to reduce feature redundancy, and this framework significantly reduces the demand for computational resources. Specifically, we denote k as the characteristic level, where

k \in

{1—high-level, 2—middle-level, and 3—low-level}. We use

m_{k} \in {1, 2, 3, \dots, M_{k}}

to represent channel levels from high to low within the kth feature level. In addition, we use

n \in {1, 2, 3 \dots, N_{k}}

to distinguish the outputs

x^{(k, m_{k}, n)}

of different convolutional groups. Here,

M_{k}

is determined by the network design, and

N_{k}

follows the conditional relationship (13), where the dependency between

N_{k}

and

M_{k}

explicitly reflects the hierarchical nest strategy of the characteristics.

N_{k} = \{\begin{matrix} M_{k} & if k = 1 \\ M_{k} + 1 & if k = 2, 3 \end{matrix}

(13)

The hierarchical feature nesting mechanism is illustrated in Figure 5. The outputs of the convolutional groups are defined with respect to (14) and (15):

\{\begin{matrix} x^{(1, m_{1}, n)} = {High}_{F}^{m_{1}} (n = 1) \\ x^{(1, m_{1}, n)} = Conv [{x^{(1, m_{1}, i)}|}_{i = 1}^{n - 1}, up (x^{(1, m_{1} - 1, n - 1)})] (m_{1} \geq n > 1) \end{matrix}

(14)

\{\begin{matrix} x^{(k, m_{k}, n)} = Middle / {Low}_{F}^{m_{k}} (n = 1) \\ x^{(k, 1, 2)} = Conv [x^{k, 1, 1}, up (x^{(k - 1, M_{k - 1}, N_{k - 1})})] (n = 2, m_{k} = 1) \\ x^{(k, m_{k}, n)} = Conv [{x^{(k, m_{k}, i)}|}_{i = 1}^{n - 1}, up (x^{(k, m_{k} - 1, n - 1)})] (m_{k} \geq n \geq 2) \end{matrix}

(15)

Here,

{High}_{F}^{m_{k}}

and

Middle / {Low}_{F}^{m_{k}}

represent the fused features at level

m_{k}

, derived from the feature extraction and fusion branches. Conv indicates the processing performed by the convolutional group, while

up

denotes the upsampling procedure. In our approach, the parameters are set as follows:

M_{1} = 2

,

M_{2} = 2

, and

M_{3} = 1

.

2.5. Discriminator Structure

As shown in Figure 6, the two discriminators share the same architecture. Each consists of seven convolutional layers followed by a linear layer. Without batch normalization, the first convolutional module comprises a convolutional layer and a Leaky_ReLU activation function. The next six modules have a consistent structure: each includes a convolutional layer, a batch normalization layer, and a Leaky_ReLU activation function. All convolutional layers use a kernel size of

3 \times 3

with a stride of 2, facilitating a rapid reduction in the width and height of the feature maps. Batch normalization is implemented using BN. The final linear layer transforms the flattened feature map into an output that quantifies the relative distance between the generated image and the real image. To optimize parameter efficiency, the weights of the second through seventh convolutional layers are shared.

2.6. Loss Function

Here, we focus on the loss function settings during the fusion phase. In the fusion phase, we constructed a generative adversarial network (GAN), which naturally divides the loss into generator loss and discriminator loss.

2.6.1. Loss Function of the Generator

The generator loss function, denoted as

L o s s_{G}

, is a composite of two primary components: content loss,

L o s s_{Content}

, and adversarial loss,

L o s s_{GAN}

. Content loss ensures that the generated images retain the essential features and structure of the input images, thereby preserving semantic integrity during the generation process. Adversarial loss, on the other hand, drives the generator to produce outputs that are indistinguishable from real images, enhancing the realism of the generated content. The formulation of the generator’s loss function is given by

L o s s_{G} = λ_{1} L o s s_{Content} + λ_{2} L o s s_{GAN}

(16)

where

λ_{1}

and

λ_{2}

are weighting factors that balance the contributions of the content and adversarial losses, respectively. By appropriately tuning these parameters, the generator can be optimized to achieve a desirable trade-off between content fidelity and visual realism.

2.6.2. Content Loss

In this section, we propose an innovative Gradient-Weighted Pixel Loss (GWPL) to ensure that the generated image retains as much detailed information from both the SAR image and the visible image as possible while preserving the overall structure and pixel integrity within a local scope.

First, for the input image I, the Sobel operator [36] is employed to compute its gradient:

\{\begin{matrix} G_{x} (I) = I \otimes K_{x} \\ G_{y} (I) = I \otimes K_{y} \end{matrix}

(17)

where

K_{x}

and

K_{y}

are the horizontal and vertical kernels of the Sobel operator, and ⊗ denotes the convolution operation. The gradient magnitude is then calculated as

G (I) = | G_{x} (I) | + | G_{y} (I) |

(18)

For the real SAR image

I_{S A R}

, the real visible image

I_{V i s}

, and the fused image

I_{F}

, we compute their gradient magnitude matrices

G (S A R)

,

G (V i s)

, and

G (F)

as (17) and (18). The joint gradient matrix is defined as

G_{j o i n t} = max (G (V i s), G (S A R))

(19)

The gradient loss is defined as the L1 loss [37] between the fused image gradient matrix and the joint gradient matrix:

L o s s_{grad} = {∥ G (F) - G_{j o i n t} ∥}_{1}

(20)

Specifically, we define the weight matrices for SAR and visible images as follows:

\{\begin{matrix} W_{S} (i, j) = \{\begin{matrix} 1, & if G {(S A R)}_{(i, j)} = G_{j o i n t (i, j)} \\ 0, & otherwise \end{matrix} \\ W_{V} (i, j) = \{\begin{matrix} 1, & if G {(V i s)}_{(i, j)} = G_{j o i n t (i, j)} \\ 0, & otherwise \end{matrix} \end{matrix}

(21)

The GWPL is then formulated as

\begin{matrix} L o s s_{GWPL} = \frac{1}{N} (∥ W_{S} (i, j) (I_{F} - I_{S A R}) ∥_{2}^{2} + {∥ W_{V} (i, j) (I_{F} - I_{V i s}) ∥}_{2}^{2}) \end{matrix}

(22)

Here, N represents the number of pixels in the input image. The content loss, derived by combining the Structural Similarity Index (SSIM) [38], is formulated as follows:

L o s s_{SSIM} = w_{1} (1 - S S I M_{F, SAR}) + w_{2} (1 - S S I M_{F, Vis})

(23)

L o s s_{Content} = α L o s s_{grad} + β L o s s_{GWPL} + γ L o s s_{SSIM}

(24)

In these equations,

w_{1}

and

w_{2}

are weighting factors that determine the relative importance of the SSIM loss components for the SAR and visible images, respectively. These weights allow for the adjustment of the balance between preserving the structural similarity of the fused image with respect to each modality. By tuning

w_{1}

and

w_{2}

, one can emphasize the preservation of details from SAR or visible images, depending on the specific requirements of the application. The parameters

α

,

β

, and

γ

are additional weighting factors that balance the contributions of the gradient loss (

L o s s_{grad}

) [39],

L o s s_{GWPL}

, and the SSIM loss (

L o s s_{SSIM}

) to the overall content loss. This formulation ensures a comprehensive approach to maintaining both the structural integrity and detailed information in the fused image.

2.6.3. Adversarial Loss

Adversarial loss is a key component in facilitating the interactive update of parameters between the generator and the discriminator. The goal is for the discriminator’s evaluation of the generated image to approach a target value of 0.9, rather than 1. This adjustment helps mitigate potential issues such as gradient vanishing, which can occur when the discriminator’s output is too close to the extremes of 0 or 1. In this framework, two discriminators are employed to evaluate the generated images, resulting in two components of adversarial loss for the generator, and they are defined as follows:

L o s s_{G A N_{V i s}} = \frac{1}{N} \sum_{n = 1}^{N} {(D_{V} (I_{F}^{(n)}) - 0.9)}^{2}

(25)

L o s s_{G A N_{S A R}} = \frac{1}{N} \sum_{n = 1}^{N} {(D_{S} (I_{F}^{(n)}) - 0.9)}^{2}

(26)

In these equations,

n \in N

denotes the index of the fused image,

I_{F}^{(n)}

represents the n-th fused image, and

D_{V} (I_{F}^{(n)})

and

D_{S} (I_{F}^{(n)})

are the outputs of the visible and SAR discriminators, respectively, when evaluating the generated image.

The overall adversarial loss, which combines the contributions from both discriminators, is expressed as

L o s s_{GAN} = L o s s_{G A N_{V i s}} + L o s s_{G A N_{S A R}}

(27)

This formulation ensures that the generator is effectively trained to produce images that are indistinguishable from real images, thereby enhancing the realism and quality of the generated outputs.

2.6.4. Loss Function of the Discriminator

The loss function

L o s s_{D}

of the discriminator is designed to evaluate the input images, with the objective of classifying the real samples toward 0.9 and the generated samples toward 0.1. This approach still helps maintain a stable training process by avoiding extreme values, which can lead to issues such as gradient vanishing. The loss functions for the two discriminators are defined as follows:

\begin{matrix} L o s s_{D_{V i s}} = & \frac{1}{N} \sum_{n = 1}^{N} {(D_{V} (I_{V i s}^{(n)}) - 0.9)}^{2} + \frac{1}{N} \sum_{n = 1}^{N} {(D_{V} (I_{F}^{(n)}) - 0.1)}^{2} \end{matrix}

(28)

\begin{matrix} L o s s_{D_{S A R}} = & \frac{1}{N} \sum_{n = 1}^{N} {(D_{S} (I_{S A R}^{(n)}) - 0.9)}^{2} + \frac{1}{N} \sum_{n = 1}^{N} {(D_{S} (I_{F}^{(n)}) - 0.1)}^{2} \end{matrix}

(29)

In these equations,

n \in N

represents the index of the image,

I_{V i s}^{(n)}

and

I_{S A R}^{(n)}

denote the n-th real visible and SAR images, respectively, while

I_{F}^{(n)}

is the n-th generated (fused) image. The terms

D_{V} (I_{V i s}^{(n)})

and

D_{S} (I_{S A R}^{(n)})

are the outputs of the visible and SAR discriminators when evaluating real images, and

D_{V} (I_{F}^{(n)})

and

D_{S} (I_{F}^{(n)})

are their outputs for the generated images.

This loss ensures that discriminators are effectively trained to distinguish between real and generated images, thus enhancing the overall robustness and accuracy of the adversarial network.

3. Experimental Results

Here we cover the details of implementing and configuring our SAR and visible image fusion network. The experiment is performed to demonstrate the rationality of our model and the structure of the network.

3.1. Experimental Setup

3.1.1. Datasets

The primary dataset used in our experiments consists of 90 pairs of SAR and visible images, each with a sub-meter resolution of

512 \times 512

, sourced from the YYX-OPT-SAR dataset [40]. These high-resolution images provide detailed information about the shape, structure, and texture of the objects, making them ideal for evaluating the performance of fusion algorithms in terms of structural and textural integration. The fusion results derived from these images offer a compelling assessment of the effectiveness of the fusion method. Furthermore, to assess the performance of the fusion method in different datasets, we used medium-resolution SAR and visible image pairs from the WHU-OPT-SAR dataset [41]. The original images in this dataset have a resolution of

5556 \times 3704

, with the optical images containing four channels. To conform to the requirement for three channels of visible and to better illustrate the fusion effect, we excluded the fourth

α

channel and cropped the images to select 2566 pairs of

512 \times 512

SAR and visible image pairs. Examples of these images are presented in Figure 7.

3.1.2. Metrics

We used seven metrics to quantitatively evaluate the fusion results, which fully cover all four aspects of the evaluation system [42,43]:

Information-Based:
-
Entropy (EN) quantifies the amount of information contained in an image, with higher values indicating richer information diversity.
-
The standard deviation (SD) reflects the contrast and intensity distribution of the fused image, where a higher SD suggests a better utilization of the dynamic range.
-
The edge-based metric (Qabf) evaluates the edge information transferred from source images to the fused result, with higher values implying superior edge preservation.
Image-Feature-Based:
-
The average gradient (AG) measures the sharpness and texture details of an image; higher AG values correspond to richer gradient information.
-
The spatial frequency (SF) characterizes the overall activity level of the image details, where an elevated SF indicates enhanced edge and texture representation.
Structural-Similarity-Based:
-
The Structural Similarity Index (SSIM) assesses the structural consistency between the fused and source images, with values closer to one denoting minimal structural distortion.
Human Perception-Inspired:
-
Visual information fidelity (VIF) quantifies perceptual similarity through natural scene statistics and human visual system modeling, where higher scores align better with human visual expectations.

3.1.3. Implementation Details

The computational experiments were executed on a cloud-based infrastructure utilizing NVIDIA L20 GPUs with CUDA 11.3 acceleration, managed through a Miniconda3 environment (Python 3.8, Ubuntu 20.04). Our training pipeline comprised two sequential phases: domain adaptation via CycleGAN pretraining for 2400 iterations to generate photorealistic pseudo-images, followed by fusion network optimization with dataset-specific configurations. We maintained a uniform batch size of two across all experiments while adapting training durations to dataset characteristics—55 epochs for the YYX dataset versus 25 epochs for the WHU dataset. The optimization strategy employed Adam with differential learning rates:

1.0 \times 10^{- 2}

for generator networks versus

1.0 \times 10^{- 5}

for discriminators, implementing a scheduled decay policy that reduced learning rates by 25% every 10 epochs. Loss function hyperparameters were carefully calibrated through empirical validation: Equation (16) maintained equilibrium through identical weighting coefficients (

λ_{1} = λ_{2} = 1

), while structural preservation in Equation (23) utilized symmetric weights (

w_{1} = w_{2} = 0.5

). Feature-space constraints in Equation (24) were governed by the coefficient triad

α = 8

,

β = 15

, and

γ = 8

, which were selected to balance gradient magnitudes across different network components. This parametric configuration demonstrated stable convergence behavior while preserving multi-modal feature representations throughout the optimization trajectory.

3.2. Comparison with State-of-the-Art Methods

Our experimental evaluation establishes a rigorous comparison framework using two benchmark datasets, incorporating both cutting-edge deep learning techniques and classical fusion methodologies. The evaluation encompasses eight representative deep learning approaches: FusionGAN [14], ResNetFusion [44], and DenseFuse [45] as foundational architectures, followed by more recent developments including TarDal [46], ReCoNet [47], SDNet [48], CDDFuse [49], and MACTFusion [50]. For traditional baselines, we implement three well-established methods: weighted least-squares (WLS) optimization [51], perceptual quality-oriented Hybrid-MSD [52], and the recent variational framework VSFF [17]. To ensure methodological fairness, the supervised models (TarDal and ReCoNet) employ their official pre-trained implementations without architectural modifications, while other deep learning competitors undergo full training convergence on our experimental datasets. All conventional methods strictly adhere to their original parameter configurations as specified in respective publications. This comprehensive evaluation protocol facilitates a direct performance comparison across different methodological paradigms while maintaining implementation consistency.

3.2.1. Qualitative Comparison

Figure 8 and Figure 9 provide a qualitative comparison of the fusion results. In Figure 8, our proposed method effectively fuses SAR and visible images with sub-meter resolution, retaining both significant structural and spectral information conducive to interpretation. The red rectangle highlights areas where our method excels in filling in the “bad areas” in the source SAR images with visible images, ensuring seamless integration with adjacent regions. This results in a more coherent and visually appealing image, enhancing interpretability.

In Figure 9, our method demonstrates its ability to fuse medium-resolution SAR and visible images effectively. It preserves crucial target contours and maintains good contrast in building information within the red rectangle. The method successfully balances detail retention and noise suppression, providing a clear and detailed representation of the scene. This balance is crucial for applications requiring precise image analysis and interpretation.

3.2.2. Quantitative Comparison

We conducted a comprehensive quantitative comparison between our proposed method and state-of-the-art techniques on two datasets: the compact YYX dataset and the extensive WHU dataset. Table 1 and Table 2 summarize the evaluation results using seven established metrics, EN, Qabf, SD, AG, SF, SSIM, and VIF, with the top three performances marked in red, green, and blue, respectively. In the YYX dataset (Table 1), traditional methods, including WLS, Hybrid-MSD, and VSFF, demonstrate competitive performance, achieving notable scores in multiple metrics. However, our method continues to maintain relatively good performance. The results of the WHU dataset (Table 2) reveal distinct advantages of deep learning approaches, particularly our proposed method which achieves top-three performance across all metrics. In particular, our method excels at preserving important details and textures, as evidenced by its high scores in AG (19.541) and SF (52.833). These results align with our original intention of using detailed information to guide the training process. Furthermore, consistent high performance across different datasets and various metrics highlights the robustness and adaptability of our approach. The impressive results on the larger WHU dataset provide strong evidence that our approach can effectively leverage large amounts of data for more comprehensive training, leading to better performance. In particular, the high scores in AG and SF indicate that our method is capable of preserving fine details and textures, which are crucial for accurate image interpretation. This is consistent with our goal of enhancing the quality of image fusion by leveraging detailed information. In summary, the quantitative results validate the effectiveness of our proposed method in achieving high-quality image fusion, particularly on larger datasets. We have reason to believe that as dataset sizes continue to increase, our method will demonstrate even better performance due to more thorough training and generalization capabilities. Our approach is thus well-suited for SAR and visible image fusion tasks at different resolutions, and its consistent performance across datasets highlights its robustness and adaptability.

3.3. Ablation Studies

In this section, we validate the effectiveness of different modules and design choices through ablation experiments conducted on the YYX datasets. As shown in Table 3, we use metrics such as EN, AG, SF, SD, VIF, and the number of model parameters to quantitatively assess fusion performance. The model parameters directly impact the model’s performance and complexity. They are computed on the basis of the specific architecture and configuration of the model. Generally, a higher number of parameters may lead to better performance due to the model’s increased capacity to learn complex patterns. However, it also results in greater computational complexity and resource demands. Thus, finding a balance between quantity, performance, and complexity is vital for optimal model design. The results of our method are highlighted in bold in the table, while results from other ablation configurations that outperform our method are underlined. These experiments demonstrate the contribution of each component to the overall performance, confirming the robustness and efficiency of our design. The analysis provides insights into how each module enhances the fusion process, ensuring optimal balance between detail preservation and computational efficiency.

3.3.1. Improved Resnet-50 and [−1, 1] Normalization

We verify that the validity of the normalized and modified Leaky_Relu activation function used [−1, 1]. In Exp. I, we revert to the [0, 1] normalization and the original Relu activation function, and the experimental results show that the modified Leaky_ReLU activation function combined with the [−1, 1] normalization is able to retain more details when dealing with negative values. This normalization extends the dynamic range of the data, allowing the network to better capture subtle changes in features during training. Leaky_ReLU avoids ReLU’s ’death’ problem by allowing negative values to pass, thus retaining more useful information during the feature extraction phase. In addition, a slight reduction in the number of parameters also helps to improve the computational efficiency and training speed of the model. Although this approach may have some impact on structural consistency, it can provide higher image quality and better visual effects in tasks that require rich detail.

3.3.2. Fusion Method of Joint Attention Mechanism CBAM and Addition

We verify the necessity of the proposed fusion method of the joint attention mechanism CBAM and addition. In Exp. II, we used a complete fusion method using the CBAM. Similarly, in Exp. III, we used a full fusion method using an adder. The experimental results show that although simple adder fusion can completely retain the feature information and slightly reduce the number of parameters, it lacks the ability to distinguish the importance of features, which is not good for the effective fusion of deep features. However, the fusion method using the CBAM can selectively highlight features, but it will lead to a significant increase in computational complexity and the number of model parameters, especially when dealing with high-dimensional features, and may introduce unnecessary computational overhead. In addition, attention mechanisms can overemphasize certain features, causing other important features to be ignored, thus losing information. We use the CBAM to select the features at the low and middle levels, and we used addition to retain the high-level semantic features completely so as to obtain the fusion image with rich details and complete semantic features in the reconstruction of the image.

3.3.3. Layered Strictly Nested Framework (LSNF)

We verify the validity of the proposed LSNF. In Exp. IV, we revert to the dense nested connection architecture [35], and the experimental results show that our LSNF firstly has a significant reduction in the number of model parameters, and secondly, such refined feature connections make the features interact more fully during the transfer process. This architecture not only improves the transmission efficiency of features but also enhances the diversity and expression ability of the features. This approach improves the visual quality and detail of the image while keeping the model lightweight.

3.3.4. Densely Connected Atrous Spatial Pyramid Pooling (DenseASPP) Module

We validated the necessity of using the DenseASPP module. In Exp. V, we removed this module, and the experimental results show that the DenseASPP module plays an important role in the low-level feature branches of the feature extraction phase, although using this module results in a small increase in the number of model parameters. It can capture multi-scale context information through intensive hollow convolution operations and enhance the feature expression ability. The module helps improve feature resolution and detail retention, especially when dealing with complex scenes, enabling better extraction and fusion of detail information. In addition, the use of the DenseASPP module allows the model to process diverse input data more efficiently while maintaining high accuracy, improving overall performance and robustness.

3.3.5. Modal Cross-Enhancement with CycleGAN Generators

We verify the necessity of using the bidirectional generation capability of the CycleGAN generator network to generate pseudo-modal images in the cross-modal enhancement stage. In Exp. VI, we directly input the source image into the fusion network, and the experimental results show that pre-mode cross-enhancement by generating pseudo-modal images can help overcome the differences between different modes. This approach provides a more consistent representation of features before fusion, allowing the fusion network to integrate information from different modes more efficiently, thereby preserving more detail and texture. This is especially important for tasks that require high precision and rich detail. However, in some applications, if the requirements for details are not high, direct fusion can achieve better results, simplifying the process and reducing the computational overhead. Therefore, whether or not to use modal cross-enhancement can be balanced according to specific task requirements.

3.3.6. ${Loss}_{GMPL}$

Finally, in Exp. VII, we modify the definition of Equation (22) by using a simple MSE loss [53] instead of a GMPL and show that while there is a small EN score advantage in the overall error, this approach ignores the weight allocation of image details. GMPL enables a sharper focus on edges and detailed areas within images. It enhances the retention of details by assigning greater weights to these critical regions. This approach aligns well with the human visual perception system, which prioritizes high-gradient areas. In contrast, simple MSE loss treats every pixel uniformly, often leading to blurred details and information loss. Thus, GMPL is particularly beneficial for tasks demanding high detail retention, as it delivers more precise image reconstruction that is more attuned to human visual perception.

4. Conclusions

This paper presented a novel local information-driven hierarchical fusion method for SAR and visible images. This method operates in two phases: In the cross-modal enhancement phase, CycleGAN generator-driven pseudo-modal image generation is employed to strengthen cross-modal consistency. In the fusion phase, the deep features are hierarchically decomposed. Guided by modality prior knowledge and the distinctive characteristics of hierarchical features, differentiated feature extraction and fusion branches are designed. The Layered Strictly Nested Framework (LSNF) establishes strict intra-layer and inter-layer nested connections to reduce computation and feature redundancy. The Gradient-Weighted Pixel Loss (GWPL) constraint increases the expression of high-frequency information, emphasizing edge details while maintaining global structural fidelity. Experiments on two datasets show that this method achieves state-of-the-art performance, highlighting its robustness, efficiency, and potential for practical applications. This approach not only provides a new solution for modality-adaptive remote sensing image fusion but also offers a novel perspective on feature interpretation. However, the model has limitations, such as performance degradation in small-scale datasets and room for improvement in visual perceptual quality. Future research will focus on lightweight variants for real-time deployment and extending the method to other modalities.

Author Contributions

Conceptualization, Y.Y., J.L. and Z.L.; Methodology, Y.Y.; Software, Y.Y.; Validation, Y.Y.; Formal analysis, Y.Y.; Investigation, Y.Y.; Resources, J.L. and Z.L.; Writing – original draft, Y.Y.; Writing – review & editing, Y.Y., L.J., J.L., S.L. and Z.L.; Supervision, J.L., S.L. and Z.L.; Project administration, J.L. and Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The “YYX datasets” supporting this study are openly available from https://github.com/yeyuanxin110/YYX-OPT-SAR, accessed on 20 May 2025. The “WHU datasets” are excerpted from https://github.com/AmberHen/WHU-OPT-SAR-dataset, accessed on 20 May 2025; additional data can be provided by the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A Unified Unsupervised Image Fusion Network. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 502–518. [Google Scholar] [CrossRef] [PubMed]
Ma, J.; Ma, Y.; Li, C. Infrared and Visible Image Fusion Methods and Applications: A Survey. Inf. Fusion 2019, 45, 153–178. [Google Scholar] [CrossRef]
Franceschetti, G.; Lanari, R. Synthetic Aperture Radar Processing; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
Kulkarni, S.C.; Rege, P.P. Pixel level fusion techniques for SAR and optical images: A review. Inf. Fusion 2020, 59, 13–29. [Google Scholar] [CrossRef]
Moreira, A.; Prats-Iraola, P.; Younis, M.; Krieger, G.; Hajnsek, I.; Papathanassiou, K.P. A tutorial on synthetic aperture radar. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–43. [Google Scholar] [CrossRef]
Liu, Y.; Liu, S.; Wang, Z. A General Framework for Image Fusion Based on Multi-Scale Transform and Sparse Representation. Inf. Fusion 2015, 24, 147–164. [Google Scholar] [CrossRef]
Yang, B.; Li, S. Multi-Focus Image Fusion and Restoration with Sparse Representation. IEEE Trans. Instrum. Meas. 2010, 59, 884–892. [Google Scholar] [CrossRef]
Harsanyi, J.C.; Chang, C.I. Hyperspectral Image Classification and Dimensionality Reduction: An Orthogonal Subspace Projection Approach. IEEE Trans. Geosci. Remote Sens. 1994, 32, 779–785. [Google Scholar] [CrossRef]
Han, J.; Pauwels, E.J.; De Zeeuw, P. Fast Saliency-Aware Multi-Modality Image Fusion. Neurocomputing 2013, 111, 70–80. [Google Scholar] [CrossRef]
Ma, J.; Chen, C.; Li, C.; Huang, J. Infrared and Visible Image Fusion via Gradient Transfer and Total Variation Minimization. Inf. Fusion 2016, 31, 100–109. [Google Scholar] [CrossRef]
Lian, Z.; Zhan, Y.; Zhang, W.; Wang, Z.; Liu, W.; Huang, X. Recent Advances in Deep Learning-Based Spatiotemporal Fusion Methods for Remote Sensing Images. Sensors 2025, 25, 1093. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, Y.; Sun, P.; Yan, H.; Zhao, X.; Zhang, L. IFCNN: A General Image Fusion Framework Based on Convolutional Neural Network. Inf. Fusion 2020, 54, 99–118. [Google Scholar] [CrossRef]
Qi, B.; Zhang, Y.; Nie, T.; Yu, D.; Lv, H.; Li, G. A Novel Infrared and Visible Image Fusion Network Based on Cross-Modality Reinforcement and Multi-Attention Fusion Strategy. Expert Syst. Appl. 2025, 264, 125682. [Google Scholar] [CrossRef]
Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A Generative Adversarial Network for Infrared and Visible Image Fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
Qu, L.; Liu, S.; Wang, M.; Li, S.; Yin, S.; Qiao, Q.; Song, Z. TransFuse: A Unified Transformer-Based Image Fusion Framework Using Self-Supervised Learning. arXiv 2022, arXiv:2201.07451. [Google Scholar] [CrossRef]
Vese, L.A.; Osher, S.J. Modeling Textures with Total Variation Minimization and Oscillating Patterns in Image Processing. J. Sci. Comput. 2003, 19, 553–572. [Google Scholar] [CrossRef]
Ye, Y.; Zhang, J.; Zhou, L.; Li, J.; Ren, X.; Fan, J. Optical and SAR Image Fusion Based on Complementary Feature Decomposition and Visual Saliency Features. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5205315. [Google Scholar] [CrossRef]
Ye, Y.; Liu, W.; Zhou, L.; Peng, T.; Xu, Q. An Unsupervised SAR and Optical Image Fusion Network Based on Structure-Texture Decomposition. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4028305. [Google Scholar] [CrossRef]
Song, W.; Li, S.; Fang, L.; Lu, T. Hyperspectral Image Classification with Deep Feature Fusion Network. IEEE Trans. Geosci. Remote. Sens. 2018, 56, 3173–3184. [Google Scholar] [CrossRef]
Li, Z.; Huang, L.; He, J. A Multiscale Deep Middle-level Feature Fusion Network for Hyperspectral Classification. Remote. Sens. 2019, 11, 695. [Google Scholar] [CrossRef]
Aslam, M.A.; Salik, M.N.; Chughtai, F.; Ali, N.; Dar, S.H.; Khalil, T. Image Classification Based on Mid-Level Feature Fusion. In Proceedings of the 2019 15th International Conference on Emerging Technologies (ICET), Peshawar, Pakistan, 2–3 December 2019; pp. 1–6. [Google Scholar]
Kong, Y.; Hong, F.; Leung, H.; Peng, X. A Fusion Method of Optical Image and SAR Image Based on Dense-UGAN and Gram–Schmidt Transformation. Remote Sens. 2021, 13, 4274. [Google Scholar] [CrossRef]
Liu, Y.; Chen, X.; Wang, Z.; Wang, Z.J.; Ward, R.K.; Wang, X. Deep Learning for Pixel-Level Image Fusion: Recent Advances and Future Prospects. Inf. Fusion 2018, 42, 158–173. [Google Scholar] [CrossRef]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar]
Yuqiang, F. Research on Mid-level Feature Learning Methods and Applications for Image Target Recognition. Ph.D. Thesis, National University of Defense Technology, Changsha, China, 2015. (In Chinese). [Google Scholar]
Marr, D. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information; MIT Press: Cambridge, MA, USA, 1982. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Yao, Z.; Fang, L.; Yang, J.; Zhong, L. Nonlinear Quantization Method of SAR Images with SNR Enhancement and Segmentation Strategy Guidance. Remote Sens. 2025, 17, 557. [Google Scholar] [CrossRef]
Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), Reykjavic, Iceland, 22–25 April 2014; pp. 315–323. [Google Scholar]
Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; Volume 30, p. 3. [Google Scholar]
Zhang, Y.; Li, X.; Chen, W.; Zang, Y. Image Classification Based on Low-Level Feature Enhancement and Attention Mechanism. Neural Process. Lett. 2024, 56, 217. [Google Scholar] [CrossRef]
Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. DenseASPP for Semantic Segmentation in Street Scenes. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3684–3692. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3–19. [Google Scholar]
Li, H.; Wu, X.J.; Durrani, T. NestFuse: An Infrared and Visible Image Fusion Architecture Based on Nest Connection and Spatial/Channel Attention Models. IEEE Trans. Instrum. Meas. 2020, 69, 9645–9656. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; LNCS; Springer: Cham, Switzerland, 2018; Volume 11045, pp. 3–11. [Google Scholar]
Clarke, M.R.B. Pattern Classification and Scene Analysis; Wiley: Hoboken, NJ, USA, 1974. [Google Scholar]
Zhao, H.; Gallo, O.; Frosio, I.; Kautz, J. Loss Functions for Image Restoration with Neural Networks. IEEE Trans. Comput. Imaging 2017, 3, 47–57. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Ge, L.; Dou, L. G-Loss: A Loss Function with Gradient Information for Super-Resolution. Optik 2023, 280, 170750. [Google Scholar] [CrossRef]
Li, J.; Zhang, J.; Yang, C.; Liu, H.; Zhao, Y.; Ye, Y. Comparative Analysis of Pixel-Level Fusion Algorithms and a New High-Resolution Dataset for SAR and Optical Image Fusion. Remote Sens. 2023, 15, 5514. [Google Scholar] [CrossRef]
Li, X.; Zhang, G.; Cui, H.; Hou, S.; Wang, S.; Li, X.; Chen, Y.; Li, Z.; Zhang, L. MCANet: A Joint Semantic Segmentation Framework of Optical and SAR Images for Land Use Classification. Int. J. Appl. Earth Obs. Geoinf. 2022, 106, 102638. [Google Scholar] [CrossRef]
Liu, Z.; Blasch, E.; Xue, Z.; Zhao, J.; Laganière, R.; Wu, W. Objective Assessment of Multiresolution Image Fusion Algorithms for Context Enhancement in Night Vision: A Comparative Study. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 94–109. [Google Scholar] [CrossRef]
Zhang, X.; Ye, P.; Xiao, G. VIFB: A Visible and Infrared Image Fusion Benchmark. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 468–478. [Google Scholar]
Ma, J.; Liang, P.; Yu, W.; Chen, C.; Guo, X.; Wu, J.; Jiang, J. Infrared and Visible Image Fusion via Detail Preserving Adversarial Learning. Inf. Fusion 2020, 54, 85–98. [Google Scholar] [CrossRef]
Li, H.; Wu, X.J. DenseFuse: A Fusion Approach to Infrared and Visible Images. IEEE Trans. Image Process. 2019, 28, 2614–2623. [Google Scholar] [CrossRef]
Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-Aware Dual Adversarial Learning and a Multi-Scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5792–5801. [Google Scholar]
Huang, Z.; Liu, J.; Fan, X.; Liu, R.; Zhong, W.; Luo, Z. ReCoNet: Recurrent Correction Network for Fast and Efficient Multi-Modality Image Fusion. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 539–555. [Google Scholar]
Zhang, H.; Ma, J. SDNet: A Versatile Squeeze-and-Decomposition Network for Real-Time Image Fusion. Int. J. Comput. Vis. 2021, 129, 2761–2785. [Google Scholar] [CrossRef]
Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Xu, S.; Lin, Z.; Timofte, R.; Van Gool, L. CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 5906–5916. [Google Scholar]
Xie, X.; Zhang, X.; Tang, X.; Zhao, J.; Xiong, D.; Ouyang, L.; Yang, B.; Zhou, H.; Ling, B.W.K.; Teo, K.L. MACTFusion: Lightweight Cross Transformer for Adaptive Multimodal Medical Image Fusion. IEEE J. Biomed. Health Inform. 2024, 29, 3317–3328. [Google Scholar] [CrossRef]
Ma, J.; Zhou, Z.; Wang, B.; Zong, H. Infrared and Visible Image Fusion Based on Visual Saliency Map and Weighted Least Square Optimization. Infrared Phys. Technol. 2017, 82, 8–17. [Google Scholar] [CrossRef]
Zhou, Z.; Wang, B.; Li, S.; Dong, M. Perceptual Fusion of Infrared and Visible Images Through a Hybrid Multi-Scale Decomposition with Gaussian and Bilateral Filters. Inf. Fusion 2016, 30, 15–26. [Google Scholar] [CrossRef]
Zhao, J.; Liu, Y.; Wang, L.; Zhang, W.; Chen, H. A Comprehensive Review of Image Fusion Techniques: From Pixel-Based to Deep Learning Approaches. IEEE Trans. Image Process. 2021, 30, 1–15. [Google Scholar]

Figure 1. Architecture of the GAN for SAR-visible image fusion with dual discriminators (discriminator_Vis and discriminator_SAR). The generator integrates two core components: (1) a CycleGAN generator-based cross-modal enhancement module that enhances source image quality and enforces cross-modal consistency and (2) a modal fusion module performing multi-layer feature extraction, deep feature hierarchical fusion, and Layered Strictly Nested operations for fused image synthesis.

Figure 2. The CycleGAN architecture incorporates two generators (

G_{S \to V}

and

G_{V \to S}

) and two domain-specific discriminators (

D_{S}

and

D_{V}

). It employs cycle-consistency constraints to ensure invertible mappings for image translation. These generators, which produce pseudo-SAR and pseudo-visible images, are used for cross-modal enhancement between SAR and visible images.

Figure 2. The CycleGAN architecture incorporates two generators (

G_{S \to V}

and

G_{V \to S}

) and two domain-specific discriminators (

D_{S}

and

D_{V}

). It employs cycle-consistency constraints to ensure invertible mappings for image translation. These generators, which produce pseudo-SAR and pseudo-visible images, are used for cross-modal enhancement between SAR and visible images.

Figure 3. Architecture of the modal fusion. The proposed model integrates multi-layer deep feature extraction, deep feature hierarchical fusion, and the Layered Strictly Nested Framework (LSNF) to achieve robust fusion of SAR and visible images while preserving structural integrity and high-frequency details.

Figure 4. This figure comparatively visualizes multi-level feature representations from cross-modal inputs (Vis vs. SAR), where aligned feature maps across identical image pairs reveal modality-specific characteristics through divergent channel activations. Four representative channels were systematically sampled from the complete tensor space to ensure visualization generalizability. For Vis images, the images are categorized as follows:

{Out 1}_{V}

(low-level fundamental patterns),

{Out 2}_{V}

–

{Out 3}_{V}

(middle-level structural patterns), and

{Out 4}_{V}

–

{Out 5}_{V}

(high-level semantics). For SAR images, they are categorized as follows:

{Out 1}_{S}

(low-level fundamental patterns),

{Out 2}_{S}

–

{Out 3}_{S}

(middle-level structural patterns), and

{Out 4}_{S}

–

{Out 5}_{S}

(high-level target interactions).

Figure 4. This figure comparatively visualizes multi-level feature representations from cross-modal inputs (Vis vs. SAR), where aligned feature maps across identical image pairs reveal modality-specific characteristics through divergent channel activations. Four representative channels were systematically sampled from the complete tensor space to ensure visualization generalizability. For Vis images, the images are categorized as follows:

{Out 1}_{V}

(low-level fundamental patterns),

{Out 2}_{V}

–

{Out 3}_{V}

(middle-level structural patterns), and

{Out 4}_{V}

–

{Out 5}_{V}

(high-level semantics). For SAR images, they are categorized as follows:

{Out 1}_{S}

(low-level fundamental patterns),

{Out 2}_{S}

–

{Out 3}_{S}

(middle-level structural patterns), and

{Out 4}_{S}

–

{Out 5}_{S}

(high-level target interactions).

Figure 5. The nested method in the Layered Strictly Nested Framework (LSNF). The solid lines denote components common to low-, middle-, and high-level features, while the dashed lines indicate operations exclusive to low- and middle-level features.

Figure 6. Dual-discriminator architecture for adversarial evaluation. Both discriminators (

D_{V i s}

and

D_{S A R}

) share convolutional layers (2–7) to reduce parameter redundancy. Each branch employs a sequence of strided convolutions, Leaky_ReLU activations, and batch normalization to assess the realism of fused images relative to source modalities.

Figure 6. Dual-discriminator architecture for adversarial evaluation. Both discriminators (

D_{V i s}

and

D_{S A R}

) share convolutional layers (2–7) to reduce parameter redundancy. Each branch employs a sequence of strided convolutions, Leaky_ReLU activations, and batch normalization to assess the realism of fused images relative to source modalities.

Figure 7. Examples of images used in the experiment: (a) SAR images from the “YYX Datasets”, (b) SAR images from the “WHU Datasets”, (c) visible images from the “YYX Datasets”, and (d) visible images from the “WHU Datasets”.

Figure 8. Fusion results on the YYX datasets. Red boxes highlight key enhancement areas with zoomed insets.

Figure 9. Fusion results on the WHU datasets. Red boxes highlight key enhancement areas with zoomed insets.

Table 1. Quantitative results of the YYX datasets.

Method	EN	Qabf	SD	AG	SF	SSIM	VIF
WLS [51]	6.992	0.539	35.601	12.086	32.839	0.967	0.499
Hybrid-MSD [52]	7.018	0.565	38.034	12.867	34.572	0.978	0.543
VSFF [17]	7.333	0.44	49.279	13.126	33.263	0.919	0.49
FusionGAN [14]	6.855	0.314	36.152	7.188	17.926	0.325	0.477
DenseFuse [45]	6.853	0.408	30.961	8.498	22.092	0.914	0.442
ResNetFusion [44]	6.537	0.57	32.044	11.384	32.698	0.872	0.7
SDNet [48]	6.434	0.445	27.009	9.94	27.274	0.903	0.418
TarDal [46]	6.261	0.178	24.679	5.186	13.143	0.656	0.294
RecoNet [47]	6.461	0.338	40.681	8.044	17.008	0.781	0.42
CDDFuse [49]	7.197	0.618	41.234	12.546	32.799	0.929	0.651
MACTFusion [50]	7.271	0.561	41.735	12.2	32.58	0.914	0.484
Ours	7.118	0.62	40.277	13.244	34.402	0.966	0.638

Note: The dataset contains 90 image pairs (81 for training and 9 for testing). Color coding: red = best, green = second-best, and blue = third-best.

Table 2. Quantitative results of the WHU datasets.

Method	EN	Qabf	SD	AG	SF	SSIM	VIF
WLS [51]	6.775	0.662	39.67	15.69	42.722	0.733	0.545
Hybrid-MSD [52]	6.687	0.688	48.527	16.854	46.773	0.947	0.655
VSFF [17]	6.241	0.202	24.767	7.881	23.634	0.784	0.465
FusionGAN [14]	7.223	0.569	47.873	17.242	44.19	0.471	0.434
DenseFuse [45]	6.582	0.562	34.942	11.539	31.36	0.807	0.518
ResNetFusion [44]	6.21	0.74	48.291	16.812	47.194	0.954	0.834
SDNet [48]	6.457	0.655	36.128	14.322	38.646	0.682	0.512
TarDal [46]	6.386	0.272	36.766	8.185	21.446	0.574	0.319
RecoNet [47]	6.398	0.204	25.028	7.902	16.805	0.649	0.309
CDDFuse [49]	6.849	0.749	51.654	18.777	51.274	0.783	0.837
MACTFusion [50]	7.027	0.62	41.183	17.893	54.201	0.651	0.441
Ours	7.036	0.73	52.443	19.541	52.833	0.845	0.744

Note: The dataset contains 2566 image pairs (2552 for training and 14 for testing). Color coding: red = best, green = second-best, and blue = third-best.

Table 3. Ablation experiment results in the YYX datasets.

Exp	Configuration	EN	AG	SF	SD	VIF	Parameter Quantity
I	Leaky_Relu with Normalized to [−1, 1] → Relu with Normalized to [0, 1]	7.111	12.497	32.011	38.71	0.561	27,939,735
II	3 CBAM with 2 Addition → 5 CBAM	7.046	12.481	32.99	38.266	0.621	30,726,257
III	3 CBAM with 2 Addition → 5 Addition	7.057	12.568	32.819	38.299	0.595	27,762,769
IV	LSNF → Nest Connection	7.068	12.536	32.637	38.83	0.61	35,372,257
V	with DenseASPP → out DenseASPP	7.064	12.762	33.643	39.009	0.621	27,737,841
VI	with CycleGAN generators → out CycleGAN generators	7.099	12.969	33.709	39.671	0.635	27,938,865
VII	$L o s s_{G W P L}$ → $L o s s_{M S E}$	7.276	11.724	29.807	42.982	0.578	27,939,441
Ours		7.118	13.244	34.402	40.277	0.638	27,939,441

Note: Boldface denotes results obtained by our method; underline indicates results that outperform ours.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, Y.; Jiang, L.; Li, J.; Liu, S.; Liu, Z. Local Information-Driven Hierarchical Fusion of SAR and Visible Images via Refined Modal Salient Features. Remote Sens. 2025, 17, 2466. https://doi.org/10.3390/rs17142466

AMA Style

Yan Y, Jiang L, Li J, Liu S, Liu Z. Local Information-Driven Hierarchical Fusion of SAR and Visible Images via Refined Modal Salient Features. Remote Sensing. 2025; 17(14):2466. https://doi.org/10.3390/rs17142466

Chicago/Turabian Style

Yan, Yunzhong, La Jiang, Jun Li, Shuowei Liu, and Zhen Liu. 2025. "Local Information-Driven Hierarchical Fusion of SAR and Visible Images via Refined Modal Salient Features" Remote Sensing 17, no. 14: 2466. https://doi.org/10.3390/rs17142466

APA Style

Yan, Y., Jiang, L., Li, J., Liu, S., & Liu, Z. (2025). Local Information-Driven Hierarchical Fusion of SAR and Visible Images via Refined Modal Salient Features. Remote Sensing, 17(14), 2466. https://doi.org/10.3390/rs17142466

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Local Information-Driven Hierarchical Fusion of SAR and Visible Images via Refined Modal Salient Features

Abstract

1. Introduction

2. Materials and Methods

2.1. CycleGAN Generators for Cross-Modal Enhancement

2.2. Multi-Layer Deep Feature Extraction

2.3. Deep Feature Hierarchical Fusion

2.4. Layered Strictly Nested Framework (LSNF)

2.5. Discriminator Structure

2.6. Loss Function

2.6.1. Loss Function of the Generator

2.6.2. Content Loss

2.6.3. Adversarial Loss

2.6.4. Loss Function of the Discriminator

3. Experimental Results

3.1. Experimental Setup

3.1.1. Datasets

3.1.2. Metrics

3.1.3. Implementation Details

3.2. Comparison with State-of-the-Art Methods

3.2.1. Qualitative Comparison

3.2.2. Quantitative Comparison

3.3. Ablation Studies

3.3.1. Improved Resnet-50 and [−1, 1] Normalization

3.3.2. Fusion Method of Joint Attention Mechanism CBAM and Addition

3.3.3. Layered Strictly Nested Framework (LSNF)

3.3.4. Densely Connected Atrous Spatial Pyramid Pooling (DenseASPP) Module

3.3.5. Modal Cross-Enhancement with CycleGAN Generators

3.3.6. Loss GMPL

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3.6. ${Loss}_{GMPL}$