Local Extrema Adaptive Pyramid Decomposition for Optical and SAR Image Fusion

Huang, Zhiyang; Xiao, Qianwen; Liu, Qiao

doi:10.3390/electronics15102129

Open AccessArticle

Local Extrema Adaptive Pyramid Decomposition for Optical and SAR Image Fusion

by

Zhiyang Huang

¹

,

Qianwen Xiao

² and

Qiao Liu

^3,*

¹

School of Artificial Intelligence, Jiangxi Industry Polytechnic College, Nanchang 33000, China

²

School of Design and Art, Jiangxi Industry Polytechnic College, Nanchang 33000, China

³

National Center for Applied Mathematics, Chongqing Normal University, Chongqing 401331, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(10), 2129; https://doi.org/10.3390/electronics15102129

Submission received: 1 April 2026 / Revised: 11 May 2026 / Accepted: 12 May 2026 / Published: 15 May 2026

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Optical and Synthetic Aperture Radar (SAR) sensors capture complementary and consistent information, and their fusion enhances remote sensing image quality. Existing pyramid decomposition-based methods suffer from insufficient texture–edge discrimination. Additionally, the manual setting of parameters during pyramid decomposition introduces uncertainty in the fusion results. To address this problem, we propose an optical and SAR image fusion framework based on local extrema adaptive pyramid decomposition (LEAPFusion), which enhances edge preservation and improves parameter adaptability. Specifically, by leveraging the edge-preserving properties of local extrema, we introduce them into the image pyramid decomposition framework to construct complementary local extrema and Laplacian pyramids. Then, we introduce an explicit parameter adaptation strategy in which the decomposition levels and local extrema kernel sizes are automatically determined from image size and pyramid scale, enabling consistent multi-scale representation and reducing parameter sensitivity compared to empirically tuned settings. Finally, by exploiting the complementary properties of the two pyramids, we implement a multi-type fusion strategy: weighted averaging for low-frequency components and parameter-adaptive pulse-coupled neural network (PAPCNN) for high-frequency details. Our decomposition framework seamlessly integrates three representative edge-preserving filters—a median filter, a guided filter, and a rolling guidance filter—demonstrating strong generalization capability across different filtering paradigms. Extensive experiments on two benchmark datasets demonstrate that our method outperforms seven state-of-the-art algorithms, achieving the best results across diverse scenes with improvements of up to 13.38% in SF and 18.90% in SCD compared to the second-best methods.

Keywords:

local extrema; scale-adaptive; pyramid decomposition; edge preservation; image fusion; remote sensing

1. Introduction

With the rapid advancement of remote sensing satellite sensors, a multitude of remote sensing imagery sources have emerged, including optical images and Synthetic Aperture Radar (SAR) images. Integrating optical and SAR images into a single composite can significantly enhance the accuracy of downstream remote sensing tasks, such as land cover classification [1], building extraction [2], ship detection and recognition [3,4], and cloud removal [5]. The fusion of optical and SAR remote sensing imagery remains a pivotal and dynamic area within current research topics. The image fusion task integrates complementary information from multiple sources to improve downstream tasks, for example, fusion of visible and infrared images [6,7,8], multispectral and panchromatic images [9,10,11], and optical and SAR images [12,13,14,15]. Infrared images provide thermal information but lack contrast and detail, while visible images offer rich texture and high contrast, making them ideal for human perception, although they are sensitive to environmental factors. Their fusion combines the detailed edges of visible with infrared thermal information. Multispectral images provide rich spectral data but lower spatial resolution, while panchromatic images offer high spatial resolution but no spectral information. Combining both results in high-quality images with fine spatial detail and rich spectral information. As shown in Figure 1, optical images capture surface spectral information but are limited by weather and lighting conditions, while SAR images are unaffected by these factors, capable of penetrating shallow soil and vegetation, but suffer from speckle noise. The presence of speckle noise and lower visual quality complicates the interpretation of SAR images [16]. The fusion of optical and SAR images combines their strengths, significantly improving remote sensing image analysis efficiency [17].

However, despite their complementary advantages, effective fusion of optical and SAR images remains challenging due to their significant modality differences. From a statistical perspective, SAR images are generated through coherent imaging mechanisms and are inherently affected by multiplicative speckle noise, which follows a non-Gaussian distribution. Moreover, SAR backscatter intensity is highly dependent on surface roughness, dielectric properties, and imaging geometry, leading to strong spatial heterogeneity. Furthermore, SAR images typically exhibit heavy-tailed distributions, where high-frequency components simultaneously contain structural edges and stochastic noise. This characteristic makes it difficult to distinguish meaningful texture information from speckle interference during feature extraction and fusion, thereby increasing the difficulty of achieving accurate edge preservation and cross-modal consistency. These statistical characteristics make it difficult to reliably distinguish structural edges from stochastic fluctuations using conventional filtering or fixed-scale decomposition methods. Therefore, a fusion framework that can adaptively capture local structural variations while being less sensitive to statistical assumptions is highly desirable.

Pixel-level image fusion, particularly using multi-scale decomposition (MSD) methods, has gained significant attention due to its superior performance across various domains [13,14,15,18]. MSD-based fusion generally involves three steps: (1) decomposing input images into multiple scales to capture high- and low-frequency information, (2) applying fusion rules at each scale, and (3) reconstructing the fused image via inverse transformation. Among MSD methods, pyramidal transforms are widely adopted. Pyramid decomposition generates multiscale representations by recursively applying Gaussian filtering and downsampling, capturing low-frequency (Gaussian pyramid) and high-frequency (Laplacian pyramid) components for efficient analysis and image reconstruction [18,19]. However, the Gaussian filter’s uniform smoothing often results in detail loss, particularly in texture-rich or structurally complex images, where excessive smoothing diminishes recognizability. Additionally, it uniformly smooths all regions, potentially over-blurring high-contrast or textured areas. Thus, preserving edge sharpness and fine details while leveraging the benefits of multi-scale analysis remains a critical challenge.

Recent advances have incorporated edge-preserving filters (EPFs) into fusion frameworks [20,21,22,23] to solve the aforementioned problems. Most EPFs often assume low contrast in image details, which may not accurately capture fine-scale spatial variations. To distinguish textures from individual edges, Subr et al. [24] utilize the local extrema (LE) of the input image to extract information about oscillations and define detail as oscillations between local minima and maxima. The theoretical motivation for employing LE in optical and SAR fusion lies in its ability to bypass the limitations of gradient-based filtering. In SAR imagery, structural edges are often submerged in multiplicative speckle noise, making traditional filters struggle to differentiate between noise-induced fluctuations and meaningful textures. By characterizing detail as oscillations between local minima and maxima (envelopes), the LE algorithm provides a more robust mathematical criterion for isolating structural information, ensuring that sharp edges are preserved even under heavy-tailed noise distributions. LE-based methods have demonstrated robust edge-preserving and statistical capabilities. Xu [25] and Du et al. [26] employ LE for multiscale image decomposition using varying sliding kernel sizes. Du et al. [19] further integrate LE into an Information of Interest-based fusion strategy to improve detail extraction. However, current EPF-based multi-scale image fusion approaches face two fundamental challenges: (1) decomposition levels, typically determined empirically, may cause either over-decomposition with excessive computational costs or under-decomposition with insufficient detail preservation, particularly due to the absence of adaptive mechanisms linking EPF’s sliding kernel size to decomposition levels; (2) sophisticated fusion strategies often require manual parameter adjustment, potentially leading to distortion artifacts and limited adaptability to varying land cover types in remote sensing scenarios. While parameter adaptation has been explored in previous studies, most existing methods either rely on empirical initializations or are optimized for specific modalities. In contrast, our proposed method establishes a direct, deterministic mapping between the input’s spatial dimensions (height/width) and the decomposition hierarchy. This ensures that the scale of analysis is inherently tailored to the image resolution and the spatial density of features, providing a parameter-free framework that is robust across diverse remote sensing scenarios without requiring prior tuning.

To address these challenges, particularly the difficulty of distinguishing texture and edge information under complex SAR statistical characteristics, in this paper, we propose an optical and SAR image fusion framework based on local extrema adaptive pyramid decomposition (LEAPFusion), which enhances edge preservation and improves parameter adaptability. Specifically, we propose a local-extrema-based adaptive pyramid decomposition method that automatically determines the level of decomposition and the size of the kernel. This is particularly crucial for effectively eliminating parameter-induced uncertainties associated with manual configuration. Furthermore, we design a simple yet effective pyramid fusion strategy according to the complementary features between the LE pyramid and the Laplacian pyramid. It demonstrates robust performance across various land cover types. We integrate the proposed method into three representative edge-preserving filters—a median filter, a guided filter, and a rolling guidance filter—to validate its effectiveness and robustness. Extensive experimental results on two datasets demonstrate that our method achieves competitive performance.

The main contributions of this work are summarized as follows:

We propose an LE-based adaptive pyramid decomposition for optical and SAR image fusion, leveraging edge preservation and scale adaptivity to extract complementary information from both image sources.
We design a simple yet effective multi-scale multi-type pyramid fusion strategy, which has good robustness to different challenges.
We integrate the proposed method into three EPFs and compare it with several state-of-the-art fusion methods. Extensive experimental results demonstrate strong generalization capabilities, delivering satisfactory results even when alternative EPFs are employed.

The rest of the paper is organized as follows. Section 2 reviews related works, including traditional and deep learning-based optical and SAR image fusion methods. Section 3 gives the details of the proposed methodology. Section 4 presents the experiments, including datasets, evaluation metrics, and comparative analysis. Section 5 provides a detailed discussion of the results, and Section 6 summarizes the paper with a conclusion.

2. Related Works

2.1. Traditional Fusion Methods

In recent decades, optical and SAR image fusion has advanced through four main approaches: component substitution (CS), multiscale decomposition (MSD), model-based, and hybrid methods [17]. CS methods (e.g., GS [27], IHS [28], PCA [29], BT [30]) are computationally efficient and enhance spatial details, but rely heavily on inter-image correlation. MSD methods (e.g., typical MSD [13,14], pyramid [18], NSST [15]) decompose images into frequency components to fuse complementary features, preserving spectral fidelity while capturing multiscale textures. Model-based approaches, including variational and sparse representation (SR) methods [18,31], treat fusion as a restoration problem, generating high-resolution results with reduced sensitivity to registration errors but higher computational cost. Hybrid methods combine CS and MSD strategies (e.g., IHS-DWT [32], modified BT [33], IHS-NSCT with PCNN [34]) to balance spatial and spectral quality, albeit with increased complexity.

In MSD-based fusion, pyramid-based approaches are widely employed. Traditional Gaussian and Laplacian pyramids rely on fixed linear filters and pre-defined levels, which often compromise edge preservation. To address these limitations, various locally adaptive frameworks have been developed, where decomposition parameters—including kernel size and pyramid levels—are dynamically derived from the signal’s spatial dimensions and local statistics. Such scale-adaptive strategies ensure that the decomposition granularity aligns precisely with the inherent structural complexity of optical and SAR imagery. For instance, Zhang et al. [18] propose a Laplacian-pyramid-based multispectral and SAR fusion method leveraging an overcomplete dictionary for improved feature representation, but their sparse representation of low-frequency components incurs high computational costs. Du et al. [19] adopt the local Laplacian filter for medical image fusion, yet its performance is highly parameter-sensitive and requires manual tuning. Similarly, Kou et al. [35] use a Laplacian–Gaussian pyramid for multi-exposure HDR fusion, but the fixed number of pyramid levels limits adaptability. While these methods have achieved promising results, they generally face two critical challenges: (1) difficulty in effectively extracting fine edge information and (2) reliance on manually set parameters, which can introduce uncertainty and reduce reproducibility. In contrast, our proposed method addresses both issues by enabling robust edge extraction and automated parameter determination, thereby improving the reliability and consistency of fusion results.

To better preserve fine structural details without excessive blurring, the LE algorithm offers a robust theoretical basis grounded in spatial envelope characterization. Unlike uniform convolution, LE extracts the regional maxima and minima of the input signal to construct upper and lower envelopes. The arithmetic mean of these envelopes serves as the local baseline, while the difference between the original signal and the baseline isolates high-frequency oscillations. By mathematically linking the decomposition process to the regional extrema of the signal, the LE algorithm effectively circumvents the over-smoothing artifacts typical of linear filtering, making it highly effective for distinguishing structural edges from stochastic noise in SAR imagery.

Furthermore, in the feature integration phase, the parameter-adaptive pulse-coupled neural network (PAPCNN) provides a solid theoretical foundation for adaptive weight generation. The standard PCNN originates from the biological synchronous pulse-firing phenomena observed in the feline visual cortex. PAPCNN systematically improves this by implementing a mathematical parameter-mapping strategy, where internal neural variables—such as the linking strength—are calculated directly from the input signal’s spatial features, such as spatial frequency. This spatially dependent mechanism allows the network to generate optimal firing maps for different regions without manual parameter tuning, providing a biologically inspired and mathematically rigorous criterion for multi-modal feature fusion.

2.2. Deep Learning-Based Fusion Methods

Significant progress has been made in image fusion using deep learning frameworks. Early CNN-based methods, such as IFCNN [36] and U2Fusion [37], demonstrated strong capability in extracting complementary features and performing unsupervised fusion across diverse modalities.

Subsequent studies have focused on enhancing cross-modal representation learning for optical–SAR fusion tasks. For instance, Ye et al. [38] proposed an unsupervised framework that combines structural features from SAR with texture information from optical images. More recent works have emphasized addressing modality heterogeneity and improving structural consistency. Duan et al. [5] introduced a feature pyramid network (FPNet) to improve multiscale feature integration for cloud removal tasks. Liu et al. [39] developed OSHFNet, a heterogeneous dual-branch network that explicitly models the differences between optical and SAR imaging mechanisms, pointing out that traditional shared-weight architectures often suffer from insufficient texture–edge discrimination. Similarly, MGFNet [40] employs a multi-path feature extraction strategy to reduce spatial information loss and enhance semantic alignment in complex scenes.

Despite these advances, a key limitation of current deep learning-based fusion methods remains insufficient texture–edge discrimination. Due to the significant heterogeneity between optical and SAR modalities, high-frequency components often contain both structural edges and speckle noise. To alleviate cross-modal inconsistency, many fusion networks tend to suppress high-frequency responses during feature alignment, which leads to blurred edges and loss of fine texture details. This issue is further exacerbated in complex urban scenes and has been widely observed in recent heterogeneous fusion frameworks. While deep learning-based methods achieve strong performance in task-specific applications, they still face challenges in interpretability and robustness under significant modality gaps.

3. Methodology

The proposed framework for fusing optical and SAR images is depicted in Figure 2 and Algorithm 1. The process begins with an IHS transformation of the optical image to extract its intensity (I) component. This I component, along with the SAR image, undergoes adaptive pyramid decomposition, generating LE pyramids and Laplacian pyramids. The LE pyramids are fused using a weighted-averaging strategy, while the Laplacian pyramids are fused with the parameter-adaptive pulse coupled neural network (PAPCNN) [41]. The fused pyramids are then reconstructed using Laplacian pyramid reconstruction to produce the fused image. Finally, the inverse IHS transformation is applied to obtain the final fusion result.

Algorithm 1 The Proposed LEAPFusion for Optical and SAR image fusion

Require: Optical image O, SAR image S
Ensure: Fused image F

1:: Perform IHS transform on O to obtain intensity $I_{c}$ and chrominance channels $(v_{1}, v_{2})$
2:: Determine the number of pyramid levels L, based on the width and height of $I_{c}$ or S (refer to Equation (8))
3:: for $l = 1$ to L do
4:: Construct both local extrema and Laplacian pyramids for $I_{c}$ and S, denoted as $LE {I_{c}}_{k}^{l}$ and $LE {S}_{k}^{l}$ , $L P {\{I_{c}\}}^{l}$ and $L P {\{S\}}^{l}$ (refer to Equations (6) and (7)).
5:: Update sliding window k (refer to Equation (9)).
6:: Fuse the l-th layer of the local extrema and Laplacian pyramids to obtain the fused results $F_{L E}^{l}$ and $F_{L P}^{l}$ , respectively (refer to Equation (10) and PAPCNN [41]).
7:: Compute the final pyramid fused result as $F_{1}^{l} = F_{LE}^{l} + F_{LP}^{l}$
8:: end for
9:: Reconstruct $F_{2}$ from $F_{1}$ via the inverse Laplacian pyramid transform, then obtain F via the inverse IHS transform using $F_{2}$ , $v_{1}$ , and $v_{2}$ .
10:: return F

3.1. Edge-Preserving Scale-Adaptive Pyramid Decomposition

The Gaussian–Laplacian pyramid is a widely adopted multi-resolution technique in image processing, particularly for tasks such as feature extraction and image fusion. However, the Gaussian filter, which forms the core of this approach, applies uniform smoothing to all pixels without distinguishing between edges and background regions. This isotropic operation inevitably blurs edge information and fails to preserve local structural details. Moreover, the performance of the Gaussian filter heavily depends on the selection of its standard deviation, and manually tuning this parameter introduces uncertainty, limiting its adaptability and robustness in automated processing pipelines. These shortcomings highlight the need for more advanced filtering methods capable of adaptively preserving edges while maintaining smoothness.

Given these limitations, we introduce a local-extrema-based EPF to replace the Gaussian filter, enabling more effective extraction of edge and texture information. Additionally, we propose an adaptive pyramid decomposition framework capable of automatically determining two pivotal parameters—the decomposition levels and the local extrema computation kernel sizes—thereby mitigating the parameter-related uncertainties inherent in manual configurations.

3.1.1. Local Extrema

Local extrema was first introduced by Subr et al. [24] as a novel EPF that can effectively smooth highly contrasted oscillations while preserving salient edges. The process consists of three main steps: (1) identify the local maxima and minima of the input image I within a specified sliding window; (2) construct the maximal and minimal envelopes using interpolation techniques; (3) compute the average of these two envelopes to obtain the smoothed mean, referred to as the coarse layer (C). Figure 3 illustrates the three steps of the LE algorithm using 1D slices of the input image in Figure 4a (red), along with its extrema, maximal envelope (green), minimal envelope (blue), and the coarse layer (black). The residual layer is then calculated as

R = I - C

. Further details are provided below.

Locate extrema. A pixel p is identified as a local maximum (or minimum) if no more than

k - 1

pixels within its

k \times k

neighborhood have a greater (or smaller) value. Oscillations, with maxima detected using a

k \times k

kernel, exhibit wavelengths of at least

k / 2

pixels. Larger kernels may fail to detect finer oscillations.

Construct extremal envelopes. Given an image I and a set of pixels S representing local extrema, we compute an extremal envelope E using an interpolation method adapted from the technique proposed by Levin et al. [42]. The objective is to find an interpolant

E (r)

and

E (s)

that are similar when

I (r)

and

I (s)

are alike. Formally, we minimize the following functional

\sum_{r} {(E (r) - \sum_{s \in (N (r))} ω_{r s} E (s))}^{2},

(1)

subject to the constraint

\forall p \in S E (p) = I (p),

(2)

where

N (r)

denotes the neighbors of r, and weights

ω_{r s} \propto exp (- \frac{{(I (r) - I (s))}^{2}}{2 σ_{r}^{2}}),

(3)

are computed using the local variance

σ_{r}^{2}

around r. Here, r and s denote the pixels adjacent to pixel p;

I (r)

,

I (s)

, and

I (p)

represent their corresponding intensity values; the envelope values at points r, s, and p are denoted by E(r), E(s), and E(p), respectively; and

σ_{r}

is the variance of the intensities within a window centered at r. We adopt the approach of Levin et al. [42] and minimize the quadratic functional using their weighted least squares formulation.

Calculate coarse layer. The envelope is constructed independently for the minima and maxima of the image, yielding the minimal and maximal envelopes, respectively. The coarse layer is then obtained by averaging these two envelopes, as expressed in Equation (4)

C = (E m a x + E m i n) / 2,

(4)

where

E m a x

and

E m i n

represent maximal and minimal envelope, C represent coarse layer.

3.1.2. Scale-Adaptive Pyramid Decomposition

The scale-adaptive pyramid construction is characterized by two key aspects: (1) the adaptive number of pyramid levels, i.e., the decomposition levels, and (2) the adaptive kernel size of the LE algorithm. We apply the LE algorithm to the input image extracts of a coarse layer. Furthermore, by iteratively downsampling the image, an LE-based LE pyramid can be constructed. Specifically, the

(l + 1)

-th level is generated by down-sampling the l-th level after it has been processed by the LE filter:

L E {\{I\}}^{l + 1} = Down (LE (L E {\{I\}}^{l}, k)),

(5)

where

Down (\cdot)

denotes a down-sampling operation by a factor of 2, ensuring that each successive level represents a coarser scale of the original signal. The LE component at each level is defined as:

L E {\{I\}}_{k}^{l} = ((E m a x {\{I\}}_{k}^{l}) + (E m i n {\{I\}}_{k}^{l})) / 2,

(6)

where I represents the input image, k is the computation kernel in the LE algorithm, and l indicates the level. In the context of the Laplacian pyramid, the l-th level of

L P {\{I\}}^{l}

can be derived using the Equation (7),

L P {\{I\}}^{l} = L E {\{I\}}^{l} - u p s a m p l e \{L E {\{I\}}^{l + 1}\},

(7)

where

u p s a m p l e \{\cdot\}

represents upsampling the image. In this study, the

u p s a m p l e {\cdot}

operation is specifically implemented using bilinear interpolation to recover the spatial dimensions of the coarser layers for feature integration. Further implementation details of the pyramid structure can refer to [43].

In Subr et al.’s [24] work, the number of decomposition levels is manually specified, and the kernel size is uniformly incremented by a constant value of 8 for each iteration within the multi-scale image decomposition process. This approach introduces uncertainty regarding the extent of image decomposition and increases the computational complexity. Inspired by the methods of Li et al. [44] and Kou et al. [35], our proposed method defines the total number of decomposition layers, denoted as L, based on a systematic criterion

L = ⌊l o g_{2} (m i n (W, H))⌋ - T,

(8)

where

⌊\cdot⌋

means rounding down, W and H are the width and height of the input image, and the value of T is the initial kernel size, which was set to 3 according to reference [24]. Equation (8) establishes an explicit relationship between the number of decomposition levels and the spatial resolution of the input image. Unlike conventional methods that employ a fixed number of layers, this formulation ensures that larger images are decomposed into deeper pyramid structures, enabling more effective multi-scale representation. Consequently, the decomposition depth can adapt to different image sizes without manual tuning.

As the image is iteratively down-sampled, its spatial resolution decreases, meaning that a fixed-size kernel will cover a progressively larger relative area of the original scene. To maintain a consistent receptive field and effectively capture features across different scales, the kernel size should be adjusted according to the decomposition level. In our framework, the kernel size is slightly increased in deeper layers to ensure that structural information at coarser scales is adequately characterized without being overlooked due to the reduced resolution. Thus, the filter size at each scale is adaptive rather than a fixed constant, as expressed in the following equation:

k^{l} = m a x (l - 2, 3), l = 1, \dots L,

(9)

where

k^{l}

represent the kernel size of l-the scale. Equation (9) further introduces a scale-dependent kernel adaptation mechanism. As the pyramid level increases, the image becomes progressively smoother due to downsampling and filtering. Accordingly, larger kernels are employed at coarser scales to better capture structural information while suppressing noise. This design maintains consistency of feature representation across scales and avoids the limitations of fixed kernel sizes.

Unlike existing pyramid decomposition methods that typically rely on fixed or empirically tuned parameters, the proposed strategy provides an explicit and unified parameterization driven by image size and pyramid scale. This design eliminates the uncertainty introduced by manual parameter selection and improves the robustness and generalization ability of the fusion framework across images with diverse resolutions and scene characteristics.

3.2. Multi-Type Pyramid Fusion

Following the application of the aforementioned method for adaptive pyramid decomposition to the images, we successfully derive both the LE pyramid and the Laplacian pyramid. Subsequently, we proceed to fuse each corresponding layer from the optical and SAR pyramids to synthesize a fused pyramid.

3.2.1. LE Pyramid Fusion

The LE pyramid contains the low-frequency components of an image, representing the global intensity distribution and background information. Optical and SAR images have distinct characteristics in their coarse layers. Optical images capture visible light and provide rich spectral information, making them ideal for identifying colors, textures, and fine details. In contrast, SAR images use microwave signals to penetrate clouds and darkness, offering strong capabilities in detecting surface structures and terrain features, but often with lower resolution and more noise. The coarse layer of an optical image typically contains smooth and continuous information, while the coarse layer of an SAR image tends to emphasize structural information.

We employ a simple weighted-averaging fusion rule for each layer, as described by Equation (10):

F_{L E}^{l} = α_{1} L E_{o p t i c a l}^{l} + α_{2} L E_{S A R}^{l},

(10)

where l denotes l-th level, the max value is L,

L E_{o p t i c a l}

and

L E_{S A R}

denote the LE pyramid images at the l-th layer, and

F_{L E}

represents the fusion result of the optical and SAR images at the l-th layer in the LE pyramid. The weighted-averaging rule is chosen for its simplicity, computational efficiency, and ability to preserve the complementary strengths of both modalities. It effectively integrates the spectral richness of optical images with the structural robustness of SAR images, while mitigating noise through averaging. In this study, we set

α_{1} = 0.5

and

α_{2} = 0.5

to retain shared features and minimize redundancy.

3.2.2. Laplacian Pyramid Fusion

The Laplacian pyramid represents the high-frequency components of an image. In optical imaging, high-frequency components capture micro-textural variations (e.g., foliated surfaces, architectural details, pavement patterns) while containing stochastic/quantization noise. SAR imagery exhibits high-frequency signatures at abrupt intensity transitions (target edges, geometric contours) and terrain-sensitive textures (urban structures, vegetation gradients), coupled with inherent speckle noise from coherent scattering. Laplacian pyramid fusion necessitates adaptive enhancement of modality-specific high-frequency features alongside hierarchical noise suppression through multi-scale decomposition.

Given the importance of preserving these high-frequency details for fusion tasks, the PAPCNN [41] is employed. Although the PAPCNN was originally designed for medical images, its bionic pulse-coupling mechanism is inherently effective for remote sensing. The linking mechanism allows neurons to fire based on spatial correlation. Since SAR speckle noise is spatially uncorrelated, while physical structures are spatially continuous, the network can selectively enhance meaningful textures while suppressing isolated noise spikes. This makes it a suitable choice for the Laplacian pyramid fusion.

The absolute pixel values at each level of the Laplacian pyramid are fed into the network as inputs, denoted as

F_{i j} [n] = |L P_{S}^{l}|

, where

S \in (p t o i c a l, S A R)

. Furthermore, the firing times are accumulated by incorporating an iterative step at the end of each iteration to improve fusion accuracy

T_{i j} [n] = T_{i j} [n - 1] + Y_{i j} [n],

(11)

Here, the firing time of each neuron is denoted as

T_{i j} [N]

, where

i, j

denote the position of neuron

N_{i j}

, N represents the total number of iterations, n denotes the current iteration, and

Y_{i j} [n]

denotes the output of the nth iteration of the PCNN. At each level of the Laplacian pyramid, the firing times for images

L P_{o p t i c a l}^{l}

and

L P_{S A R}^{l}

are represented as

T_{o p t i c a l, i j}^{l} [N]

and

T_{S A R, i j}^{l} [N]

, respectively. The fused result at each level is computed using the following rule

F_{L P}^{l} = \{\begin{matrix} L P_{S A R}^{l}, T_{S A R, i j}^{l} [N] > T_{o p t i c a l, i j}^{l} [N] \\ L P_{o p t i c a l}^{l}, o t h e r w i s e, \end{matrix}

(12)

where l denotes the l-th level, the max value is L,

F_{L P}

represents the fusion result of the optical and SAR images at the l-th layer in the Laplacian pyramid, and the coefficient associated with the larger firing time is chosen as the fusion coefficient for the Laplacian pyramid. In the PAPCNN framework, the iteration number N is empirically set to 60 based on ablation studies (see Section 4.4.3). Regarding the model parameters, the linking strength

β

and attenuation coefficient

α

are adaptively determined by the algorithm according to the local statistical characteristics of the input images. This adaptive parameterization, combined with the optimized iteration number N, ensures that the PAPCNN can effectively distinguish between structural features and stochastic speckle noise across different scales.

3.3. IHS Method

In this study, the optical image contains three RGB channels, while the SAR image has a single channel. To enable fusion, the multi-channel optical image is first converted into a single channel by extracting its intensity (I) component through the IHS transformation. The I component is then fused with the SAR image across multiple scales. After the fusion process, the result is converted back to the RGB color space using the inverse IHS transformation. While the IHS transform is traditionally employed for multispectral and panchromatic fusion, it has also been widely adopted in optical–SAR fusion studies, demonstrating its effectiveness in cross-modality fusion scenarios [12,13,14,15,18].

The IHS transform [45] is mathematically expressed as follows:

[\begin{matrix} I \\ H \\ S \end{matrix}] = [\begin{matrix} 1 / 3 & 1 / 3 & 1 / 3 \\ - \sqrt{2} / 6 & - \sqrt{2} / 6 & 2 \sqrt{2} / 6 \\ 1 / \sqrt{2} & - 1 / \sqrt{2} & 0 \end{matrix}] [\begin{matrix} R \\ G \\ B \end{matrix}],

(13)

where R, G, and B denote the red, green and blue channels of the optical image, respectively. I, H, S are the three components after IHS transformation. The inverse IHS transform can be expressed as

[\begin{matrix} R_{n e w} \\ G_{n e w} \\ B_{n e w} \end{matrix}] = [\begin{matrix} 1 & - 1 / \sqrt{2} & 1 / \sqrt{2} \\ 1 & - 1 / \sqrt{2} & - 1 / \sqrt{2} \\ 1 & \sqrt{2} & 0 \end{matrix}] [\begin{matrix} I' \\ H \\ S \end{matrix}],

(14)

where

I'

denotes the new intensity component after processing.

R_{n e w}

,

G_{n e w}

,

B_{n e w}

are the new channels obtained after the IHS inverse transformation.

4. Experiments

4.1. Experiment Settings

4.1.1. Experiment Data

To evaluate the proposed method, all algorithms are tested on the publicly available YYX-OPT-SAR dataset [46] and optical–SAR dataset [47]. YYX-OPT-SAR includes 150 pairs of optical and SAR images, with optical images from Google Earth and SAR images from an unmanned aerial vehicle. These images cover areas around Baicheng City (Jilin Province) and Weinan City (Shaanxi Province), representing three typical land cover types: urban, suburban, and mountainous regions. Each image pair, with a size of

512 \times 512

pixels, has been registered.

Ren et al. [47] publicly released a co-registered optical–SAR dataset, with optical images acquired by the GF-2 satellite and SAR images by GF-3. The dataset covers diverse land cover types, including buildings, farmland, vegetation, water bodies, and roads. For this study, we selected the first 150 optical–SAR image pairs (

256 \times 256

pixels each) from the Dongying region of Shandong Province. This selection was made to complement the first dataset (YYX-OPT-SAR), which primarily focuses on mountainous and urban terrains. By including the Dongying region—a typical plain area—we can more comprehensively evaluate the adaptability and robustness of the proposed fusion algorithm across diverse topographical conditions.

4.1.2. Compared Methods

We compare our method with seven existing approaches spanning three categories: EPF-based MSD (MLGCF [23], MSGRGF [22]); saliency-based VSFF [12]; and general deep learning-based methods (IFCNN [36], U2Fusion [37], MUFusion [48], LFDT [49]).

4.1.3. Evaluation Indices

The experimental evaluation incorporates both qualitative and quantitative assessments. Qualitative analysis examines perceptual differences in fused images through visual inspection, while quantitative analysis provides objective performance measurement through five established metrics: entropy (EN) [50], spatial frequency (SF) [51], sum of correlation of differences (SCD) [52], spectral distortion index (

D_{λ}

) [53], and spectral angular mapper (SAM) [54].

4.2. Result Analysis of YYX-OPT-SAR Dataset

The dataset used in this study consists of three land cover types. For each land cover type, one image is selected for comparative display. To facilitate comparison, the detailed regions of the fusion results are enlarged and highlighted with red rectangular boxes, as shown in Figure 5.

The first and second rows show the fusion results for urban land cover images. Due to the limited grayscale information in SAR images, all methods except VSFF and LFDT fail to effectively preserve the optical image’s color, resulting in a generally darker tone. Specifically, the roof color in the IFCNN and MUFusion fusion results is distorted, and U2Fusion produces color artifacts. While MLGCF and MSGRGF show some improvement in color preservation, they do not effectively retain local energy from the SAR images. VSFF and LFDT fusion preserves some tree shadows from the SAR image but introduces halo artifacts, indicating insufficient edge preservation. In contrast, the proposed method retains both the optical image’s color and the SAR image’s local energy while demonstrating superior edge preservation.

The third and fourth rows show the fusion results for suburban areas. The input images clearly demonstrate that optical and SAR images provide complementary information. While all methods except VSFF and LFDT effectively integrate this information, they fail to preserve the optical image’s color. Although VSFF produces better visual effects, it does not adequately fuse the complementary information. In contrast, the proposed method effectively retains the complementary features of both optical and SAR images.

The fifth and sixth rows show the fusion results for mountain land cover images. Optical images provide rich spectral information, while SAR images capture detailed texture. Due to SAR’s side-looking imaging mechanism, significant elevation changes in the terrain cause higher grayscale values on the slope-facing side. Thus, preserving SAR’s texture information is crucial during fusion. The results show that the MSGRGF, VSFF, and LFDT methods are less effective at preserving the edges of terraced fields. In contrast, the proposed method demonstrates stronger edge preservation, effectively retaining the SAR texture details.

Table 1 presents a quantitative comparison of different fusion methods across three land cover types in the YYX-OPT-SAR dataset. The values represent the average score for each method across the fusion tasks. Red indicates the best value among the eight methods, green denotes the second-best, and blue represents the third-best. ↑ and ↓ indicate the higher the better and the lower the better, respectively.

Overall, among the eight methods, our approach ranks in the top three across all metrics for the three fusion tasks, demonstrating the robustness and broad applicability. Specifically, the EN metric achieved top-ranking performance in urban and mountain scenarios, while securing second place in suburban areas. The SF metric demonstrated comparable performance to the VSFF method overall. Notably, the SCD index ranked first across all three fusion tasks. For the

D_{λ}

and SAM metrics, our method maintains competitive performance within the top tier. Specifically, in the urban and suburban scenarios, both metrics rank within the top three (with the SAM index in the suburban task achieving the second-best value). In the mountainous terrain, although

D_{λ}

and SAM show a slight decrease in ranking compared to structural metrics, they still remain within a reasonable and effective range.

Notably, the SCD index consistently ranked first across all three fusion tasks with a significant margin. This superiority stems from the fact that our scale-adaptive LE decomposition effectively extracts complementary features from the heterogeneous source images, ensuring the fused product maintains a high degree of informational correlation with the inputs.

Regarding the

D_{λ}

and SAM metrics, which consistently ranked third, this performance reflects a deliberate trade-off between structural injection and spectral preservation. In complex scenes like urban and mountain areas, our method prioritizes the integration of salient geometric structures from SAR (evidenced by the top-tier SF and EN scores), which inevitably introduces minor spectral deviations. However, these values remain highly competitive, striking an optimal balance for practical interpretation.

4.3. Result Analysis of Optical–SAR Dataset

The fusion results on the optical–SAR dataset are qualitatively compared in Figure 6. The bridge fusion results in Figure 6 (Rows 1–2) reveal that VSFF fails to adequately preserve SAR intensity characteristics, while IFCNN and MUFusion exhibit spectral distortion. Although other methods maintain both spectral and SAR intensity profiles, our approach demonstrates superior detail preservation. The roof fusion comparison in Figure 6 (Rows 3–4) demonstrates that while all methods preserve roof coloration effectively, our approach shows superior retention of SAR ground intensity characteristics. Furthermore, edge preservation analysis confirms our method maintains significantly richer structural details compared to other techniques, with SAR imagery particularly well capturing roof and ground-edge intensity profiles. The field boundary fusion results in Figure 6 (Rows 5–6) reveal that while both LFDT and our method excel in edge preservation and information integration, MUFusion shows compromised texture despite maintaining intensity, and IFCNN exhibits spectral distortion. Our approach uniquely preserves all critical features—structural, textural, and spectral—demonstrating comprehensive fusion superiority.

The quantitative comparison results on the optical–SAR dataset are presented in Table 2. Among the five evaluation metrics, our method achieved top rankings in EN, SF, and SCD. While it demonstrated slightly inferior performance to VSFF in

D_{λ}

and SAR indices, it secured second place in SAM. In the optical–SAR dataset, our method continues to lead in EN, SF, and SCD, reinforcing the strength of the PAPCNN-based high-frequency fusion strategy. While it secured the second-best performance in SAM, the slightly inferior ranking in

D_{λ}

compared to VSFF is due to the inherent sensitivity of

D_{λ}

to intensity variations when SAR structural extrema are heavily injected. Nevertheless, the overall ranking demonstrates that the proposed framework maintains robust spectral–spatial balance even in varied sensor configurations.

Based on experimental results from the two datasets, the overall performance of general deep learning-based fusion methods remains suboptimal, which can be attributed to several factors. First, these methods do not explicitly model the modality differences between optical (spectral-focused) and SAR (structure- and texture-focused) images, potentially leading to inadequate integration of complementary information or the introduction of interference. Second, despite leveraging deep architectures, they lack semantic guidance and typically perform pixel- or feature-level fusion using generic representations, failing to capture class-discriminative features essential for land cover classification. Third, most models are optimized for image reconstruction or perceptual quality rather than classification performance, leading to the loss of task-relevant information. Finally, limited training data and the absence of remote sensing-specific regularization reduce their generalizability, particularly in complex or dynamic land cover scenarios.

To further validate the proposed method, both qualitative and quantitative analyses were performed. Qualitatively, visual comparisons confirm these results, showing sharper boundaries, finer structures, and more natural spectral appearance. Quantitatively, local extrema decomposition enhances EN and SF by accurately extracting textures across contrasts, enriching gradients and preserving edges. The scale-adaptive pyramid with local extrema constraints reduces SCD and

D_{λ}

by suppressing cross-modal mixing and ensuring spectral consistency, while oversmoothing avoidance improves SAM. Overall, the method achieves a balanced integration of spatial detail and spectral fidelity, explaining consistent metric improvements and generalizability.

4.4. Ablation Study

4.4.1. Effectiveness of Parameter-Adaptive Strategy

To demonstrate the effectiveness of the proposed parameter-adaptive approach compared to conventional fixed-parameter settings, two comparative experiments were designed using a control-variable method: (1) fixing the decomposition level

L = 4

[13,22,55] and setting a constant kernel size

k \in {3, 5, 7}

and (2) fixing a constant

k = 3

while varying

L \in {3, 4, 5}

. These fixed-parameter configurations serve as the control groups to be compared against our adaptive method. A comparative evaluation based on the EN metric is presented in Figure 7.

As shown in Figure 7, the proposed adaptive method outperforms the fixed-parameter fusion approach in both experimental groups. This result validates that the adaptive k (Equation (9)) and adaptive L (Equation (8)) can dynamically adjust to the image resolution and content, achieving a better balance of information extraction than any single fixed combination. In the first group, our method achieves superior EN values in urban areas compared to other methods, although it shows slightly lower performance in suburban and mountainous regions. Notably, in the second experimental group, our approach demonstrates optimal performance across all three terrain types. The results indicate that kernel size significantly influences EN values, with larger kernel sizes generally improving EN values at the cost of increased computational resources. However, it is crucial to note that the kernel size must be smaller than the dimensions of the corresponding layer image, suggesting that excessively large kernel sizes may not necessarily benefit the experimental process.

4.4.2. Influence of Color Space Transformation

To evaluate the impact of different color space transformations, we compared our IHS-based framework with three alternative configurations: (1) Direct Intensity, where the optical image is converted to a grayscale image using a weighted combination of RGB channels, fused with the SAR image, and subsequently projected back into the RGB space via the Brovey transform; (2) YCbCr, fusing the luminance (Y) channel; and (3) HSV, fusing the value (V) channel. While the color space transformation was varied, all other components of the proposed framework remained unchanged.

The results are shown in Figure 8 and Table 3. Visually (Figure 8), the Direct Intensity and HSV methods exhibit localized color artifacts (e.g., blue/red patches) due to spectral distortions during the back-projection process. While the YCbCr result is visually comparable to the proposed method in terms of color fidelity, quantitative analysis (Table 3) reveals a clear performance gap. The proposed IHS-based method achieves the highest entropy (EN = 6.8233) and spatial frequency (SF = 24.6603), significantly outperforming YCbCr (SF = 22.7192). This demonstrates that the IHS transformation provides a more compatible mathematical foundation for our hierarchical fusion rules, allowing for a more profound injection of SAR structural details while maintaining natural color fidelity.

4.4.3. Sensitivity Analysis of the PAPCNN Iteration Count N

The number of iterations N in the PAPCNN is a key parameter that influences the extraction of structural features. To evaluate its impact, we tested N values ranging from 40 to 100. As shown in Figure 9, the evaluation metrics for different terrain types (urban, mountain, and suburban) stabilize as N increases.

Specifically, the spectral fidelity metrics (SAM and

D_{λ}

) and information-based metrics (EN and SF) show negligible fluctuations for N ≥ 50. The structural correlation metric (SCD) reaches a plateau at N = 60. This convergence behavior indicates that the PAPCNN effectively captures the salient features of the source images within 60 iterations. Continuing to increase N beyond this point would lead to unnecessary computational overhead without further enhancing fusion quality. Consequently, N = 60 is determined as the standard setting for all experiments to ensure a balance between performance and efficiency.

4.4.4. Effectiveness of Local Extrema Pyramid Fusion Rules

To evaluate the effectiveness of the proposed weighted-averaging rule for LE pyramid fusion, we conducted an ablation study by replacing it with a hybrid strategy combining weighted local energy (WLE) and weighted sum of eight-neighborhood-based modified Laplacian (WSEML) [13,15]. The comparative experimental results are presented in Figure 10 and Table 4.

As indicated in Table 4, while the WLE and WSEML rules achieve higher SF scores, this is primarily due to the Laplacian operator’s aggressive amplification of local gradients. While increasing visual sharpness, such operators risk introducing spectral artifacts and amplifying SAR speckle noise.

In contrast, our weighted-averaging rule prioritizes radiometric and spectral integrity, yielding superior scores in SCD,

D_{λ}

, and SAM. According to the Wald protocol [56], preserving the sensors’ physical characteristics is the primary criterion for fusion quality. In optical–SAR fusion, high EN and SF often represent “pseudo-gain” where noise is misinterpreted as information [52]. Our approach acts as a conservative integrator, ensuring high-frequency injection remains within physically plausible bounds to provide a reliable foundation for subsequent analysis.

Furthermore, the observed trade-off suggests that a hybrid strategy—applying conservative averaging to low-frequency layers while utilizing adaptive weights for high-frequency details—could further reconcile structural clarity with spectral consistency, which will be explored in future work.

4.4.5. Effectiveness of Laplacian Pyramid Fusion Rules

To verify the effectiveness of the PAPCNN in high-frequency fusion, we replace it with the max-absolute (max-abs) selection rule [13,14] while keeping the rest of the framework unchanged. As shown in Figure 11, the max-abs strategy introduces noticeable noise and artifacts, whereas the proposed method preserves clearer structures with better visual consistency.

The quantitative results in Table 5 further confirm this observation. Compared with max-abs, the proposed method improves EN (6.7479→6.8233), SF (20.6155→24.6603), and SCD (1.6214→1.6868), while reducing

D_{λ}

(0.0300→0.025) and SAM (0.3721→0.3302). These results demonstrate that PAPCNN achieves a better balance between detail enhancement and noise suppression, outperforming the max-abs rule.

4.4.6. Analysis of the Complementarity Between LE and Laplacian Pyramids

To verify the complementary roles of the LE and Laplacian pyramids, we conducted an ablation study by comparing the proposed dual-pyramid framework with two single-pyramid variants. The quantitative results and visual comparisons are presented in Table 6 and Figure 12, respectively.

From a quantitative perspective (Table 6), the “LE + Laplacian pyramid” scheme achieves the best performance across almost all metrics, including EN, SF, SCD, and

D_{λ}

. Specifically, the combination significantly boosts the spatial frequency (SF) from 15.6182 (Laplacian only) and 12.1003 (LE only) to 24.6603, indicating a much higher capacity for capturing and retaining fine-grained structural details. While the “only Laplacian pyramid” scheme shows a slightly lower SAM value, it comes at the cost of significant information loss in other dimensions.

The qualitative analysis in Figure 12 further illustrates the necessity of the dual-pyramid structure. The “Only Laplacian pyramid” result appears noticeably dark and suffers from low contrast, failing to effectively integrate the radiometric information of the optical image. Conversely, the “Only LE pyramid” result is over-smoothed and exhibits a “whitening” effect, where sharp edges and fine textures from the SAR image are blurred. In contrast, the “Laplacian + LE pyramid” result achieves a superior balance: it retains the sharp structural features of the Laplacian decomposition while maintaining the rich energy and radiometric stability provided by the LE pyramid. This complementarity ensures that the fused image is both spatially sharp and radiometrically natural, validating our design of the hybrid pyramid framework.

4.4.7. Runtime Analysis of Fusion Methods

Table 7 presents the average runtimes of various fusion methods evaluated on the datasets, all executed on an Intel Core i7-12700H CPU. For traditional fusion methods, theoretical time complexity is reported to reflect their algorithmic characteristics, while for deep learning-based methods, the computational cost is quantified in terms of GFLOPs, which is commonly used to measure model-level complexity during inference. The results are summarized in Table 7.

As illustrated in Table 7, the proposed method exhibits an average runtime of 4.8873 s with a theoretical complexity of

O (M \times N \times K)

. This performance is primarily due to the iterative nature of the LE decomposition, which requires K iterations to construct stable upper and lower envelopes for high-fidelity edge preservation. The proposed method was implemented in MATLAB 2021b. It is important to emphasize that the current performance was evaluated under a single-threaded, unoptimized implementation. Given the high degree of pixel-level parallelism in envelope construction, an optimized implementation in a compiled language (e.g., C++ or CUDA) is estimated to yield a speedup of 10 to 30 times (approximately 10× from compilation and 4–8× from multi-core parallelization). In the context of high-precision image fusion, this computational cost is considered an acceptable trade-off for the superior performance in maintaining radiometric consistency and suppressing spectral artifacts.

5. Discussion

5.1. Generalization Analysis via Controlled Decomposition Substitution

To further analyze the robustness and generalization capability of the proposed framework, we conduct a controlled study by replacing the proposed LE decomposition operator with several representative edge-preserving filters (EPFs), including a median filter (MF), a guided filter (GF), and a rolling guidance filter (RGF). In this experiment, all other components of the framework are kept unchanged, including the decomposition levels, adaptive parameter settings, fusion strategies, and reconstruction procedures. This controlled setting ensures that the influence of the decomposition operator can be isolated and fairly evaluated.

As shown in Table 8 and Table 9, all EPF-based variants achieve competitive performance within the proposed framework. This indicates that the effectiveness of the proposed method does not rely solely on a specific decomposition operator, but is largely attributed to the overall adaptive decomposition and fusion mechanism. Although the LE-based method does not consistently achieve the best performance across all metrics, it exhibits more balanced behavior across different evaluation criteria and scene types. In particular, it avoids significant performance degradation on any single metric, suggesting a stable trade-off between structural detail enhancement and spectral fidelity preservation. More importantly, the consistently strong performance of MF, GF, and RGF within the same framework highlights the generality and flexibility of the proposed method. This demonstrates that the framework is compatible with different EPF operators and maintains robust performance across diverse configurations. This property also indicates that the proposed framework can be flexibly extended by incorporating alternative filtering strategies to meet different application requirements.

5.2. Applicability in Downstream Tasks: Building Extraction

To evaluate the practical utility of the proposed framework, we conducted a targeted building extraction experiment following the evaluation strategy in [57]. We selected building-dense samples from the YYX-OPT-SAR dataset (indices 1–50) and the optical–SAR dataset (indices 26–33, 75–89, 101–106, 115–119, and 132–138). The Morphological Building Index (MBI) [58] was employed to quantitatively and qualitatively assess the structural enhancement, with results detailed in Table 10 and Figure 13.

The results in Table 10 demonstrate that the proposed method achieves superior structural saliency. On the optical–SAR dataset, it yields the highest MBI score of 0.1752, doubling the optical baseline (0.0787) and outperforming state-of-the-art methods like LFDT (0.1490). Similar competitive performance on the YYX-OPT-SAR dataset (0.1169) confirms that our framework effectively injects robust SAR backscattering features to enhance urban target representation.

Visual comparisons in Figure 13 further validate these findings. While single-modality extraction often produces fragmented footprints due to shadow interference, our method generates the most cohesive and continuous building structures (red ellipses). Additionally, our approach exhibits superior noise immunity; compared to methods such as MUFusion, it maintains a clean background by effectively suppressing SAR-induced false alarms (blue ellipses). This underscores the framework’s capability to balance spatial detail injection with robust noise suppression.

6. Conclusions

In this paper, we propose a novel adaptive pyramid decomposition method based on local extrema for optical and SAR image fusion, which enhances edge preservation and parameter adaptivity while effectively extracting complementary information from both modalities. Experimental results demonstrate that the proposed method achieves highly competitive overall performance compared to seven state-of-the-art fusion approaches. Generalization experiments further validate its robustness across various EPF-based architectures. Objectively, the primary advantage of our framework lies in its superior ability to preserve spatial structures and maximize informational correlation (evidenced by top-tier SF, EN, and SCD scores). However, a limitation of this method is that the strong injection of SAR structural extrema inevitably introduces a slight spectral deviation (reflected in average rankings for SAM and

D_{λ}

), representing a trade-off between spatial enhancement and spectral fidelity. To address these limitations, future research will focus on two specific directions: (1) designing spatially varying adaptive weights during the color space transformation to better constrain spectral distortion in homogeneous regions and (2) integrating lightweight deep learning modules to automatically optimize the LE decomposition levels and PAPCNN hyperparameters for real-time remote sensing applications.

Author Contributions

Conceptualization, Z.H.; methodology, Z.H.; validation, Z.H. and Q.X.; formal analysis, Z.H.; investigation, Z.H.; resources, Z.H.; data curation, Z.H. and Q.X.; writing—original draft, Z.H.; writing—review and editing, Z.H., Q.X. and Q.L.; visualization, Z.H. and Q.X.; supervision, Q.L.; project administration, Z.H.; funding acquisition, Z.H. and Q.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (Grant No.62302073), the Natural Science Foundation of Chongqing (Grant No. CSTB2024NSCQ-LZX0039), the Science and Technology Research Program of Chongqing Municipal Education Commission (Grant No. KJZD-K202200501), and the Jiangxi Provincial Department of Education Science and Technology Youth Foundation (Grant No. GJJ181334).

Data Availability Statement

The data analyzed in this study were obtained from a publicly available dataset, and the relevant source has been cited in the reference section of this article. The source code used in this study will be publicly available at https://github.com/Zhiyang-Huang/LEAPFusion, accessed on 11 May 2026.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Quan, Y.; Tong, Y.; Feng, W.; Dauphin, G.; Huang, W.; Xing, M. A novel image fusion method of multi-spectral and sar images for land cover classification. Remote Sens. 2020, 12, 3801. [Google Scholar] [CrossRef]
Li, X.; Zhang, G.; Cui, H.; Hou, S.; Chen, Y.; Li, Z.; Li, H.; Wang, H. Progressive fusion learning: A multimodal joint segmentation framework for building extraction from optical and SAR images. ISPRS J. Photogramm. Remote Sens. 2023, 195, 178–191. [Google Scholar] [CrossRef]
Sun, Q.; Liu, M.; Chen, S.; Lu, F.; Xing, M. Ship detection in SAR images based on multilevel superpixel segmentation and fuzzy fusion. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5206215. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, L.; Wu, J.; Guo, W. Optical and Synthetic Aperture Radar Image Fusion for Ship Detection and Recognition: Current state, challenges, and future prospects. IEEE Geosci. Remote Sens. Mag. 2024, 12, 132–168. [Google Scholar] [CrossRef]
Duan, C.; Belgiu, M.; Stein, A. Efficient Cloud Removal Network for Satellite Images Using SAR-optical Image Fusion. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6008605. [Google Scholar] [CrossRef]
Li, H.; Zhao, J.; Li, J.; Yu, Z.; Lu, G. Feature dynamic alignment and refinement for infrared–visible image fusion: Translation robust fusion. Inf. Fusion 2023, 95, 26–41. [Google Scholar] [CrossRef]
Li, M.; Sun, J.; Ma, H.; Wang, F.; Sun, F. Infrared and Visible Image Fusion Based on Multi-modal and Multi-scale Cross-compensation. Knowl.-Based Syst. 2026, 338, 115441. [Google Scholar] [CrossRef]
Yue, J.; Fang, L.; Xia, S.; Deng, Y.; Ma, J. Dif-Fusion: Toward High Color Fidelity in Infrared and Visible Image Fusion With Diffusion Models. IEEE Trans. Image Process. 2023, 32, 5705–5720. [Google Scholar] [CrossRef]
Wang, S.; Xie, Q.; Zhao, Q.; Meng, D. AMHF-Net: A Multispectral and Hyperspectral Image Fusion Network for Arbitrary-Band Hyperspectral Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2026, 19, 12367–12383. [Google Scholar] [CrossRef]
Zhang, Z.; Xu, J.; Wang, X.; Xie, G.; Wei, L. Multiscale fusion of panchromatic and multispectral images based on adaptive iterative filtering. Remote Sens. 2023, 16, 7. [Google Scholar] [CrossRef]
Deng, S.; Ma, J.; Deng, L.J.; Wei, P. OTIAS: OcTree implicit adaptive sampling for multispectral and hyperspectral image fusion. Proc. AAAI Conf. Artif. Intell. 2025, 39, 2708–2716. [Google Scholar] [CrossRef]
Ye, Y.; Zhang, J.; Zhou, L.; Li, J.; Ren, X.; Fan, J. Optical and SAR image fusion based on complementary feature decomposition and visual saliency features. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5205315. [Google Scholar] [CrossRef]
Gong, X.; Hou, Z.; Wan, Y.; Zhong, Y.; Zhang, M.; Lv, K. Multispectral and SAR image fusion for multi-scale decomposition based on least squares optimization rolling guidance filtering. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5401920. [Google Scholar] [CrossRef]
Gong, X.; Hou, Z.; Ma, A.; Zhong, Y.; Zhang, M.; Lv, K. An adaptive multi-scale gaussian co-occurrence filtering decomposition method for multispectral and SAR image fusion. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 8215–8229. [Google Scholar] [CrossRef]
Li, W.; Wu, J.; Liu, Q.; Zhang, Y.; Cui, B.; Jia, Y.; Gui, G. An effective multimodel fusion method for SAR and optical remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 5881–5892. [Google Scholar] [CrossRef]
Moreira, A.; Prats-Iraola, P.; Younis, M.; Krieger, G.; Hajnsek, I.; Papathanassiou, K.P. A tutorial on synthetic aperture radar. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–43. [Google Scholar] [CrossRef]
Kulkarni, S.C.; Rege, P.P. Pixel level fusion techniques for SAR and optical images: A review. Inf. Fusion 2020, 59, 13–29. [Google Scholar] [CrossRef]
Zhang, H.; Shen, H.; Yuan, Q.; Guan, X. Multispectral and SAR image fusion based on Laplacian pyramid and sparse representation. Remote Sens. 2022, 14, 870. [Google Scholar] [CrossRef]
Du, J.; Li, W.; Xiao, B. Anatomical-functional image fusion by information of interest in local Laplacian filtering domain. IEEE Trans. Image Process. 2017, 26, 5855–5866. [Google Scholar] [CrossRef]
Parida, P.; Panda, M.K.; Rout, D.K.; Panda, S.K. Infrared and visible image fusion using quantum computing induced edge preserving filter. Image Vis. Comput. 2025, 153, 105344. [Google Scholar] [CrossRef]
Zhang, Y.; Lee, H.J. Multisensor infrared and visible image fusion via double joint edge preservation filter and nonglobally saliency gradient operator. IEEE Sens. J. 2023, 23, 10252–10267. [Google Scholar] [CrossRef]
Ma, J.; Zhou, Z.; Wang, B.; Zong, H. Infrared and visible image fusion based on visual saliency map and weighted least square optimization. Infrared Phys. Technol. 2017, 82, 8–17. [Google Scholar] [CrossRef]
Tan, W.; Zhou, H.; Song, J.; Li, H.; Yu, Y.; Du, J. Infrared and visible image perceptive fusion through multi-level Gaussian curvature filtering image decomposition. Appl. Opt. 2019, 58, 3064–3073. [Google Scholar] [CrossRef]
Subr, K.; Soler, C.; Durand, F. Edge-preserving multiscale image decomposition based on local extrema. ACM Trans. Graph. (TOG) 2009, 28, 1–9. [Google Scholar] [CrossRef]
Xu, Z. Medical image fusion using multi-level local extrema. Inf. Fusion 2014, 19, 38–48. [Google Scholar] [CrossRef]
Du, J.; Li, W.; Xiao, B.; Nawaz, Q. Medical image fusion by combining parallel features on multi-scale local extrema scheme. Knowl.-Based Syst. 2016, 113, 4–12. [Google Scholar] [CrossRef]
Yang, J.; Ren, G.; Ma, Y.; Fan, Y. Coastal wetland classification based on high resolution SAR and optical image fusion. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS); IEEE: New York, NY, USA, 2016; pp. 886–889. [Google Scholar]
Chen, C.M.; Hepner, G.; Forster, R. Fusion of hyperspectral and radar data using the IHS transformation to enhance urban surface features. ISPRS J. Photogramm. Remote Sens. 2003, 58, 19–30. [Google Scholar] [CrossRef]
Pal, S.; Majumdar, T.; Bhattacharya, A.K. ERS-2 SAR and IRS-1C LISS III data fusion: A PCA approach to improve remote sensing based geological interpretation. ISPRS J. Photogramm. Remote Sens. 2007, 61, 281–297. [Google Scholar] [CrossRef]
Dupas, C.A. SAR and LANDSAT TM image fusion for land cover classification in the Brazilian Atlantic Forest Domain. Int. Arch. Photogramm. Remote Sens. 2000, 33, 96–103. [Google Scholar]
Zhang, W.; Yu, L. SAR and Landsat ETM+ image fusion using variational model. In Proceedings of the 2010 International Conference on Computer and Communication Technologies in Agriculture Engineering; IEEE: New York, NY, USA, 2010; Volume 3, pp. 205–207. [Google Scholar]
Huang, Y.; Liao, J.; Guo, H.; Zhong, X. The fusion of multispectral and SAR images based wavelet transformation over urban area. In Proceedings of the 2005 IEEE International Geoscience and Remote Sensing Symposium (IGARSS’05); IEEE: New York, NY, USA, 2005; Volume 6, pp. 3942–3944. [Google Scholar]
Chibani, Y. Additive integration of SAR features into multispectral SPOT images by means of the à trous wavelet decomposition. ISPRS J. Photogramm. Remote Sens. 2006, 60, 306–314. [Google Scholar] [CrossRef]
Wang, X.; Chen, C. Image fusion for synthetic aperture radar and multispectral images based on sub-band-modulated non-subsampled contourlet transform and pulse coupled neural network methods. Imaging Sci. J. 2016, 64, 87–93. [Google Scholar] [CrossRef]
Kou, F.; Li, Z.; Wen, C.; Chen, W. Edge-Preserving Smoothing Pyramid Based Multi-Scale Exposure Fusion. J. Vis. Commun. Image Represent. 2018, 53, 235–244. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, Y.; Sun, P.; Yan, H.; Zhao, X.; Zhang, L. IFCNN: A general image fusion framework based on convolutional neural network. Inf. Fusion 2020, 54, 99–118. [Google Scholar] [CrossRef]
Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A Unified Unsupervised Image Fusion Network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 502–518. [Google Scholar] [CrossRef]
Ye, Y.; Liu, W.; Zhou, L.; Peng, T.; Xu, Q. An unsupervised SAR and optical image fusion network based on structure-texture decomposition. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4028305. [Google Scholar] [CrossRef]
Liu, C.; Sun, Y.; Zhang, X.; Xu, Y.; Lei, L.; Kuang, G. OSHFNet: A heterogeneous dual-branch dynamic fusion network of optical and SAR images for land use classification. Int. J. Appl. Earth Obs. Geoinf. 2025, 141, 104609. [Google Scholar] [CrossRef]
Wei, K.; Dai, J.K.; Hong, D.; Ye, Y. MGFNet: An MLP-dominated gated fusion network for semantic segmentation of high-resolution multi-modal remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2024, 135, 104241. [Google Scholar] [CrossRef]
Yin, M.; Liu, X.; Liu, Y.; Chen, X. Medical image fusion with parameter-adaptive pulse coupled neural network in nonsubsampled shearlet transform domain. IEEE Trans. Instrum. Meas. 2018, 68, 49–64. [Google Scholar] [CrossRef]
Levin, A.; Lischinski, D.; Weiss, Y. Colorization using optimization. In ACM SIGGRAPH 2004 Papers; ACM: New York, NY, USA, 2004; pp. 689–694. [Google Scholar]
Burt, P.J.; Adelson, E.H. The Laplacian pyramid as a compact image code. In Readings in Computer Vision; Elsevier: Amsterdam, The Netherlands, 1987; pp. 671–679. [Google Scholar]
Li, Z.; Wei, Z.; Wen, C.; Zheng, J. Detail-enhanced multi-scale exposure fusion. IEEE Trans. Image Process. 2017, 26, 1243–1252. [Google Scholar] [CrossRef]
Haydn, R. Application of the IHS color transform to the processing of multisensor data and image enhancement. In Proceedings of the International Symposium on Remote Sensing of Arid and Semi-Arid Lands, Cairo, Egypt, 19–25 January 1982. [Google Scholar]
Li, J.; Zhang, J.; Yang, C.; Liu, H.; Zhao, Y.; Ye, Y. Comparative analysis of pixel-level fusion algorithms and a new high-resolution dataset for SAR and optical image fusion. Remote Sens. 2023, 15, 5514. [Google Scholar] [CrossRef]
Ren, B.; Ma, S.; Hou, B.; Hong, D.; Chanussot, J.; Wang, J.; Jiao, L. A dual-stream high resolution network: Deep fusion of GF-2 and GF-3 data for land cover classification. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102896. [Google Scholar] [CrossRef]
Cheng, C.; Xu, T.; Wu, X.J. MUFusion: A general unsupervised image fusion network based on memory unit. Inf. Fusion 2023, 92, 80–92. [Google Scholar] [CrossRef]
Yang, B.; Jiang, Z.; Pan, D.; Yu, H.; Gui, G.; Gui, W. LFDT-Fusion: A latent feature-guided diffusion Transformer model for general image fusion. Inf. Fusion 2025, 113, 102639. [Google Scholar] [CrossRef]
Liu, Y.; Liu, S.; Wang, Z. A general framework for image fusion based on multi-scale transform and sparse representation. Inf. Fusion 2015, 24, 147–164. [Google Scholar] [CrossRef]
Eskicioglu, A.M.; Fisher, P.S. Image quality measures and their performance. IEEE Trans. Commun. 1995, 43, 2959–2965. [Google Scholar] [CrossRef]
Aslantas, V.; Bendes, E. A new image quality metric for image fusion: The sum of the correlations of differences. AEU Int. J. Electron. Commun. 2015, 69, 1890–1896. [Google Scholar] [CrossRef]
Alparone, L.; Aiazzi, B.; Baronti, S.; Garzelli, A.; Nencini, F.; Selva, M. Multispectral and panchromatic data fusion assessment without reference. Photogramm. Eng. Remote Sens. 2008, 74, 193–200. [Google Scholar] [CrossRef]
Yuhas, R.H.; Goetz, A.F.; Boardman, J.W. Discrimination among semi-arid landscape endmembers using the spectral angle mapper (SAM) algorithm. In Proceedings of the JPL, Summaries of the Third Annual JPL Airborne Geoscience Workshop; Volume 1: AVIRIS Workshop; NASA: Washington, DC, USA, 1992. [Google Scholar]
Zhou, Z.; Wang, B.; Li, S.; Dong, M. Perceptual fusion of infrared and visible images through a hybrid multi-scale decomposition with Gaussian and bilateral filters. Inf. Fusion 2016, 30, 15–26. [Google Scholar] [CrossRef]
Wald, L.; Ranchin, T.; Mangolini, M. Fusion of satellite images of different spatial resolutions: Assessing the quality of resulting images. Photogramm. Eng. Remote Sens. 1997, 63, 691–699. [Google Scholar]
Li, Q.; Mou, L.; Sun, Y.; Hua, Y.; Shi, Y.; Zhu, X.X. A Review of Building Extraction From Remote Sensing Imagery: Geometrical Structures and Semantic Attributes. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4702315. [Google Scholar] [CrossRef]
Huang, X.; Zhang, L. Morphological Building/Shadow Index for Building Extraction From High-Resolution Imagery Over Urban Areas. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2012, 5, 161–172. [Google Scholar] [CrossRef]

Figure 1. Representative heterogeneous image pair from the benchmark dataset: (a) optical image; (b) SAR image.

Figure 2. The flowchart of the proposed method, which mainly consists of an adaptive local extrema module and a multi-scale multi-type pyramid fusion module.

Figure 3. Detailed 1D profile analysis of the LE algorithm steps. To clearly demonstrate the pixel-level mathematical mechanism, a 1D horizontal slice is extracted from the 2D data in Figure 4. Step 1: Locate the local maxima and minima of the input profile (red). Step 2: Construct the maximal (purple) and minimal (yellow) envelopes via interpolation. Step 3: Calculate the average of the two envelopes to obtain the coarse layer profile (green). This 1D process is applied across the entire image to achieve full 2D decomposition.

Figure 4. Visualization of the LE decomposition on a 2D SAR image. (a–c) show the 2D input image I, the decomposed coarse layer C, and the residual layer R (

R = I - C

), respectively. (d) presents the 1D intensity profiles extracted from the same horizontal scanline (marked by colored lines in (a–c)). The plots show the input intensities (red) and their separation into the coarse (green) and residual (blue) layers by the LE algorithm.

Figure 4. Visualization of the LE decomposition on a 2D SAR image. (a–c) show the 2D input image I, the decomposed coarse layer C, and the residual layer R (

R = I - C

), respectively. (d) presents the 1D intensity profiles extracted from the same horizontal scanline (marked by colored lines in (a–c)). The plots show the input intensities (red) and their separation into the coarse (green) and residual (blue) layers by the LE algorithm.

Figure 5. Fusion results of the YYX-OPT-SAR dataset for three land cover types. The first and second rows correspond to urban land cover, the third and fourth rows to suburban land cover, and the fifth and sixth rows to mountainous land cover. The SCD values are labeled in the bottom-right corner of each fused image.

Figure 6. Fusion results of optical–SAR dataset. The SCD values are labeled in the bottom-right corner of each fused image.

Figure 7. A comparative analysis of the EN metric with fixed decomposition levels L and kernel sizes k. In the first row, L is fixed at 4 and

k = 3, 5, 7

. In the second row, k is fixed at 3, with L taking values of 3, 4, 5.

Figure 7. A comparative analysis of the EN metric with fixed decomposition levels L and kernel sizes k. In the first row, L is fixed at 4 and

k = 3, 5, 7

. In the second row, k is fixed at 3, with L taking values of 3, 4, 5.

Figure 8. Visual comparison of the ablation study on the IHS transformation. Compared with YCbCr and the proposed IHS-based framework, direct gray-scale fusion and HSV-based methods exhibit significant color distortion, resulting in a loss of spectral consistency relative to the original optical image.

Figure 9. The trend of evaluation metrics for fusion results across different scenes as a function of the number of iterations N.

Figure 10. Comparison of results using different fusion rules for the LE pyramid. (a1,a2,b1,b2) The source optical and SAR images, respectively. (c1,c2) WLE and WSEML. (d1,d2) The weighted-averaging method used in this study.

Figure 11. Comparison of results using different fusion rules for the Laplacian pyramid.Compared with the Max-abs rules, the proposed method (with PAPCNN) effectively suppresses noise and artifacts while preserving clearer structural textures and maintaining color consistency.

Figure 12. Visual comparison of fusion results using different pyramid decomposition schemes.

Figure 13. Visual comparison of building extraction results using the Morphological Building Index (MBI) across different methods. The red and blue ellipses highlight the regions where our proposed method exhibits superior performance in preserving structural completeness and suppressing background noise, respectively.

Table 1. Quantitative evaluation of fusion results on the YYX-OPT-SAR dataset, with the best results highlighted in red, the second-best results in green and the third-best results in blue.

Images	Metrics	Traditional Methods			Deep Learning Methods				Ours
Images	Metrics	MLGCF [23]	MSGRGF [22]	VSFF [12]	IFCNN [36]	U2Fusion [37]	MUFusion [48]	LFDT [49]	Ours
Urban	EN↑	6.9981	7.0117	7.3801	6.8666	6.4437	7.0480	7.2629	7.4115
	SF↑	31.2822	30.7613	31.9297	29.3705	19.8326	16.6442	27.6946	33.2573
	SCD↑	1.6290	1.6146	1.6826	1.5785	1.3698	0.7205	1.5066	1.8673
	$D_{λ} ↓$	0.0114	0.0111	0.0037	0.0132	0.0305	0.0242	0.0014	0.0063
	SAM↓	0.3213	0.2983	0.2315	0.3128	0.3751	0.3149	0.2168	0.2405
Suburban	EN↑	7.0070	7.0346	7.4003	6.8681	6.4847	7.0487	7.2299	7.3331
	SF↑	32.7038	31.8651	35.6181	30.7929	20.7293	16.6110	28.7598	34.7725
	SCD↑	1.6540	1.6411	1.6615	1.6265	1.4414	0.8229	1.5053	1.8412
	$D_{λ} ↓$	0.0128	0.0121	0.0028	0.0131	0.0306	0.0184	0.0021	0.0077
	SAM↓	0.2931	0.2806	0.1684	0.2665	0.3161	0.2654	0.2010	0.1923
Mountain	EN↑	6.6286	6.6774	6.8321	6.5435	6.2931	6.7637	6.7431	6.9061
	SF↑	30.7202	30.6092	26.0934	28.7523	19.6907	13.6077	25.6193	30.0530
	SCD↑	1.7035	1.6988	1.5416	1.6895	1.5209	0.8322	1.4151	1.8264
	$D_{λ} ↓$	0.0224	0.0243	0.0130	0.0279	0.0502	0.0578	0.0135	0.0198
	SAM↓	0.3613	0.3425	0.2630	0.3368	0.4507	0.3979	0.2507	0.2736

Table 2. Quantitative evaluation of fusion results on dataset optical–SAR, with the best results highlighted in red, the second-best results in green and the third-best results in blue.

Metrics	Traditional Methods			Deep Learning Methods				Ours
Metrics	MLGCF [23]	MSGRGF [22]	VSFF [12]	IFCNN [36]	U2Fusion [37]	MUFusion [48]	LFDT [49]	Ours
EN↑	6.5457	6.5855	6.1318	6.6455	6.3556	6.5048	6.7383	6.8233
SF↑	20.7167	20.7122	16.4576	21.7502	18.3417	12.3369	20.6209	24.6603
SCD↑	1.4067	1.4187	1.1533	1.4130	1.3338	0.2668	1.4042	1.6868
$D_{λ}$ ↓	0.0309	0.0272	0.0067	0.0353	0.0226	0.0481	0.0153	0.0258
SAM↓	0.4728	0.4518	0.2023	0.4458	0.4458	0.3950	0.4487	0.3302

↑ indicates the higher the better; ↓ indicates the lower the better.

Table 3. Ablation study on the necessity of the IHS transformation.

Methods	EN↑	SF↑	SCD↑	$D_{λ} ↓$	SAM↓
w/o IHS (Direct Intensity)	6.7939	22.5166	1.6935	0.0015	0.3286
YCbCr	6.7805	22.7192	1.6942	0.0252	0.3332
HSV	6.7313	22.1061	1.6478	0.0007	0.3163
Proposed (with IHS)	6.8233	24.6603	1.6868	0.0250	0.3302

↑ indicates the higher the better; ↓ indicates the lower the better. The bold values indicate the best performance for each metric.

Table 4. Numerical values of objective evaluation metrics for different fusion rules in the LE pyramid, with the best results highlighted in bold.

Metrics	WLE and WSEML	Weighted-Averaging
EN↑	6.8395	6.8233
SF↑	29.2602	24.6603
SCD↑	1.3236	1.6868
$D_{λ}$ ↓	0.0333	0.0258
SAM↓	0.3458	0.3302

↑ indicates the higher the better; ↓ indicates the lower the better. The bold values indicate the best performance for each metric.

Table 5. Ablation Study on Laplacian Pyramid Fusion Rules.

Methods	EN↑	SF↑	SCD↑	$D_{λ} ↓$	SAM↓
Max-abs	6.7479	20.6155	1.6214	0.0300	0.3721
Proposed (with PAPCNN)	6.8233	24.6603	1.6868	0.025	0.3302

↑ indicates the higher the better; ↓ indicates the lower the better. The bold values indicate the best performance for each metric.

Table 6. Ablation study on the complementary effects of LE and Laplacian pyramids.

	EN↑	SF↑	SCD↑	$D_{λ} ↓$	SAM↓
Only Laplacian pyramid	5.4166	15.6182	0.4670	0.0263	0.3042
Only LE pyramid	6.7038	12.1003	1.4855	0.0412	0.3128
Laplacian + LE pyramid	6.8233	24.6603	1.6868	0.0258	0.3302

↑ indicates the higher the better; ↓ indicates the lower the better. The bold values indicate the best performance for each metric.

Table 7. Computational characteristics of different image fusion methods, where M and N denote the height and width of the input image, respectively; L is the number of decomposition levels;

ω

represents the sliding window size; K denotes the number of iterations; and

α

is the iteration factor. Note that the values of K, L,

ω

, and

α

vary across different algorithms due to their distinct optimization mechanisms. For traditional methods, computational complexity is expressed in Big-O notation. For deep learning-based methods, GFLOPs are reported only to reflect model-scale complexity and are not intended for direct comparison. Average runtime is evaluated on a computer equipped with an Intel Core i7-12700H CPU @ 2.30 GHz and 16 GB RAM to ensure fairness.

Table 7. Computational characteristics of different image fusion methods, where M and N denote the height and width of the input image, respectively; L is the number of decomposition levels;

ω

represents the sliding window size; K denotes the number of iterations; and

α

is the iteration factor. Note that the values of K, L,

ω

, and

α

vary across different algorithms due to their distinct optimization mechanisms. For traditional methods, computational complexity is expressed in Big-O notation. For deep learning-based methods, GFLOPs are reported only to reflect model-scale complexity and are not intended for direct comparison. Average runtime is evaluated on a computer equipped with an Intel Core i7-12700H CPU @ 2.30 GHz and 16 GB RAM to ensure fairness.

Method	Category	Complexity (Time/GFLOPs)	Avg. Runtime (CPU, s)
MLGCF [23]	Traditional	$O (K \times M \times N)$	15.5413
MSGRGF [22]	Traditional	$O (L \cdot M N \cdot (K + α))$	0.5445
VSFF [12]	Traditional	$O (M \times N \times K)$	1.2841
Ours	Traditional	$O (M \times N \times K)$	4.8873
IFCNN [36]	Deep Learning	16.9869 GFLOPs	0.0549
U2Fusion [37]	Deep Learning	86.3408 GFLOPs	0.5479
MUFusion [48]	Deep Learning	5.474 GFLOPs	0.8683
LFDT [49]	Deep Learning	238.36 GFLOPs	0.8633

Table 8. EPF fusion performance for three land features in YYX-OPT-SAR dataset.

Images	Metrics	MF	GF	RGF	LE
Urban	EN↑	7.3986	7.4075	7.4221	7.4115
	SF↑	30.8506	30.7297	32.0366	33.2573
	SCD↑	1.8842	1.8866	1.8655	1.8673
	$D_{λ} ↓$	0.0065	0.0064	0.0065	0.0063
	SAM↓	0.2342	0.2279	0.2328	0.2405
Suburban	EN↑	7.3240	7.3288	7.3390	7.3331
	SF↑	33.5040	32.9712	34.3770	34.7725
	SCD↑	1.8591	1.8636	1.8362	1.8414
	$D_{λ} ↓$	0.0077	0.0077	0.0077	0.0077
	SAM↓	0.1898	0.1863	0.1887	0.1923
Mountain	EN↑	6.8758	6.8768	6.9077	6.9061
	SF↑	28.7433	28.2268	29.6700	30.0530
	SCD↑	1.8382	1.8442	1.8129	1.8264
	$D_{λ} ↓$	0.0203	0.0201	0.0205	0.0198
	SAM↓	0.2645	0.2636	0.2658	0.2736

↑ indicates the higher the better; ↓ indicates the lower the better.

Table 9. EPF fusion performance in optical–SAR dataset.

Metrics	MF	GF	RGF	LE
EN↑	6.7912	6.7911	6.7980	6.8233
SF↑	19.1434	18.3988	19.7711	24.6603
SCD↑	1.7099	1.7063	1.7008	1.6868
$D_{λ}$ ↓	0.0256	0.0258	0.0259	0.0258
SAM↓	0.3160	0.3094	0.3105	0.3302

↑ indicates the higher the better; ↓ indicates the lower the better.

Table 10. Comparison of MBI values of different image fusion methods on two datasets (higher values indicate better performance; red: best, green: second best, Blue: third best).

Method	YYX-OPT-SAR (MBI)	Optical–SAR (MBI)
Optical	0.0755	0.0787
SAR	0.0624	0.1085
MLGCF [23]	0.1123	0.1312
MSGRGF [22]	0.1195	0.1341
VSFF [12]	0.1032	0.1217
IFCNN [36]	0.1022	0.1351
U2Fusion [37]	0.0961	0.1179
MUFusion [48]	0.1129	0.1226
LFDT [49]	0.1186	0.1490
Ours	0.1169	0.1752

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, Z.; Xiao, Q.; Liu, Q. Local Extrema Adaptive Pyramid Decomposition for Optical and SAR Image Fusion. Electronics 2026, 15, 2129. https://doi.org/10.3390/electronics15102129

AMA Style

Huang Z, Xiao Q, Liu Q. Local Extrema Adaptive Pyramid Decomposition for Optical and SAR Image Fusion. Electronics. 2026; 15(10):2129. https://doi.org/10.3390/electronics15102129

Chicago/Turabian Style

Huang, Zhiyang, Qianwen Xiao, and Qiao Liu. 2026. "Local Extrema Adaptive Pyramid Decomposition for Optical and SAR Image Fusion" Electronics 15, no. 10: 2129. https://doi.org/10.3390/electronics15102129

APA Style

Huang, Z., Xiao, Q., & Liu, Q. (2026). Local Extrema Adaptive Pyramid Decomposition for Optical and SAR Image Fusion. Electronics, 15(10), 2129. https://doi.org/10.3390/electronics15102129

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Local Extrema Adaptive Pyramid Decomposition for Optical and SAR Image Fusion

Abstract

1. Introduction

2. Related Works

2.1. Traditional Fusion Methods

2.2. Deep Learning-Based Fusion Methods

3. Methodology

3.1. Edge-Preserving Scale-Adaptive Pyramid Decomposition

3.1.1. Local Extrema

3.1.2. Scale-Adaptive Pyramid Decomposition

3.2. Multi-Type Pyramid Fusion

3.2.1. LE Pyramid Fusion

3.2.2. Laplacian Pyramid Fusion

3.3. IHS Method

4. Experiments

4.1. Experiment Settings

4.1.1. Experiment Data

4.1.2. Compared Methods

4.1.3. Evaluation Indices

4.2. Result Analysis of YYX-OPT-SAR Dataset

4.3. Result Analysis of Optical–SAR Dataset

4.4. Ablation Study

4.4.1. Effectiveness of Parameter-Adaptive Strategy

4.4.2. Influence of Color Space Transformation

4.4.3. Sensitivity Analysis of the PAPCNN Iteration Count N

4.4.4. Effectiveness of Local Extrema Pyramid Fusion Rules

4.4.5. Effectiveness of Laplacian Pyramid Fusion Rules

4.4.6. Analysis of the Complementarity Between LE and Laplacian Pyramids

4.4.7. Runtime Analysis of Fusion Methods

5. Discussion

5.1. Generalization Analysis via Controlled Decomposition Substitution

5.2. Applicability in Downstream Tasks: Building Extraction

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI