1. Introduction
With the rapid advancement of remote sensing satellite sensors, a multitude of remote sensing imagery sources have emerged, including optical images and Synthetic Aperture Radar (SAR) images. Integrating optical and SAR images into a single composite can significantly enhance the accuracy of downstream remote sensing tasks, such as land cover classification [
1], building extraction [
2], ship detection and recognition [
3,
4], and cloud removal [
5]. The fusion of optical and SAR remote sensing imagery remains a pivotal and dynamic area within current research topics. The image fusion task integrates complementary information from multiple sources to improve downstream tasks, for example, fusion of visible and infrared images [
6,
7,
8], multispectral and panchromatic images [
9,
10,
11], and optical and SAR images [
12,
13,
14,
15]. Infrared images provide thermal information but lack contrast and detail, while visible images offer rich texture and high contrast, making them ideal for human perception, although they are sensitive to environmental factors. Their fusion combines the detailed edges of visible with infrared thermal information. Multispectral images provide rich spectral data but lower spatial resolution, while panchromatic images offer high spatial resolution but no spectral information. Combining both results in high-quality images with fine spatial detail and rich spectral information. As shown in
Figure 1, optical images capture surface spectral information but are limited by weather and lighting conditions, while SAR images are unaffected by these factors, capable of penetrating shallow soil and vegetation, but suffer from speckle noise. The presence of speckle noise and lower visual quality complicates the interpretation of SAR images [
16]. The fusion of optical and SAR images combines their strengths, significantly improving remote sensing image analysis efficiency [
17].
However, despite their complementary advantages, effective fusion of optical and SAR images remains challenging due to their significant modality differences. From a statistical perspective, SAR images are generated through coherent imaging mechanisms and are inherently affected by multiplicative speckle noise, which follows a non-Gaussian distribution. Moreover, SAR backscatter intensity is highly dependent on surface roughness, dielectric properties, and imaging geometry, leading to strong spatial heterogeneity. Furthermore, SAR images typically exhibit heavy-tailed distributions, where high-frequency components simultaneously contain structural edges and stochastic noise. This characteristic makes it difficult to distinguish meaningful texture information from speckle interference during feature extraction and fusion, thereby increasing the difficulty of achieving accurate edge preservation and cross-modal consistency. These statistical characteristics make it difficult to reliably distinguish structural edges from stochastic fluctuations using conventional filtering or fixed-scale decomposition methods. Therefore, a fusion framework that can adaptively capture local structural variations while being less sensitive to statistical assumptions is highly desirable.
Pixel-level image fusion, particularly using multi-scale decomposition (MSD) methods, has gained significant attention due to its superior performance across various domains [
13,
14,
15,
18]. MSD-based fusion generally involves three steps: (1) decomposing input images into multiple scales to capture high- and low-frequency information, (2) applying fusion rules at each scale, and (3) reconstructing the fused image via inverse transformation. Among MSD methods, pyramidal transforms are widely adopted. Pyramid decomposition generates multiscale representations by recursively applying Gaussian filtering and downsampling, capturing low-frequency (Gaussian pyramid) and high-frequency (Laplacian pyramid) components for efficient analysis and image reconstruction [
18,
19]. However, the Gaussian filter’s uniform smoothing often results in detail loss, particularly in texture-rich or structurally complex images, where excessive smoothing diminishes recognizability. Additionally, it uniformly smooths all regions, potentially over-blurring high-contrast or textured areas. Thus, preserving edge sharpness and fine details while leveraging the benefits of multi-scale analysis remains a critical challenge.
Recent advances have incorporated edge-preserving filters (EPFs) into fusion frameworks [
20,
21,
22,
23] to solve the aforementioned problems. Most EPFs often assume low contrast in image details, which may not accurately capture fine-scale spatial variations. To distinguish textures from individual edges, Subr et al. [
24] utilize the local extrema (LE) of the input image to extract information about oscillations and define detail as oscillations between local minima and maxima. The theoretical motivation for employing LE in optical and SAR fusion lies in its ability to bypass the limitations of gradient-based filtering. In SAR imagery, structural edges are often submerged in multiplicative speckle noise, making traditional filters struggle to differentiate between noise-induced fluctuations and meaningful textures. By characterizing detail as oscillations between local minima and maxima (envelopes), the LE algorithm provides a more robust mathematical criterion for isolating structural information, ensuring that sharp edges are preserved even under heavy-tailed noise distributions. LE-based methods have demonstrated robust edge-preserving and statistical capabilities. Xu [
25] and Du et al. [
26] employ LE for multiscale image decomposition using varying sliding kernel sizes. Du et al. [
19] further integrate LE into an Information of Interest-based fusion strategy to improve detail extraction. However, current EPF-based multi-scale image fusion approaches face two fundamental challenges: (1) decomposition levels, typically determined empirically, may cause either over-decomposition with excessive computational costs or under-decomposition with insufficient detail preservation, particularly due to the absence of adaptive mechanisms linking EPF’s sliding kernel size to decomposition levels; (2) sophisticated fusion strategies often require manual parameter adjustment, potentially leading to distortion artifacts and limited adaptability to varying land cover types in remote sensing scenarios. While parameter adaptation has been explored in previous studies, most existing methods either rely on empirical initializations or are optimized for specific modalities. In contrast, our proposed method establishes a direct, deterministic mapping between the input’s spatial dimensions (height/width) and the decomposition hierarchy. This ensures that the scale of analysis is inherently tailored to the image resolution and the spatial density of features, providing a parameter-free framework that is robust across diverse remote sensing scenarios without requiring prior tuning.
To address these challenges, particularly the difficulty of distinguishing texture and edge information under complex SAR statistical characteristics, in this paper, we propose an optical and SAR image fusion framework based on local extrema adaptive pyramid decomposition (LEAPFusion), which enhances edge preservation and improves parameter adaptability. Specifically, we propose a local-extrema-based adaptive pyramid decomposition method that automatically determines the level of decomposition and the size of the kernel. This is particularly crucial for effectively eliminating parameter-induced uncertainties associated with manual configuration. Furthermore, we design a simple yet effective pyramid fusion strategy according to the complementary features between the LE pyramid and the Laplacian pyramid. It demonstrates robust performance across various land cover types. We integrate the proposed method into three representative edge-preserving filters—a median filter, a guided filter, and a rolling guidance filter—to validate its effectiveness and robustness. Extensive experimental results on two datasets demonstrate that our method achieves competitive performance.
The main contributions of this work are summarized as follows:
We propose an LE-based adaptive pyramid decomposition for optical and SAR image fusion, leveraging edge preservation and scale adaptivity to extract complementary information from both image sources.
We design a simple yet effective multi-scale multi-type pyramid fusion strategy, which has good robustness to different challenges.
We integrate the proposed method into three EPFs and compare it with several state-of-the-art fusion methods. Extensive experimental results demonstrate strong generalization capabilities, delivering satisfactory results even when alternative EPFs are employed.
The rest of the paper is organized as follows.
Section 2 reviews related works, including traditional and deep learning-based optical and SAR image fusion methods.
Section 3 gives the details of the proposed methodology.
Section 4 presents the experiments, including datasets, evaluation metrics, and comparative analysis.
Section 5 provides a detailed discussion of the results, and
Section 6 summarizes the paper with a conclusion.
3. Methodology
The proposed framework for fusing optical and SAR images is depicted in
Figure 2 and Algorithm 1. The process begins with an IHS transformation of the optical image to extract its intensity (I) component. This I component, along with the SAR image, undergoes adaptive pyramid decomposition, generating LE pyramids and Laplacian pyramids. The LE pyramids are fused using a weighted-averaging strategy, while the Laplacian pyramids are fused with the parameter-adaptive pulse coupled neural network (PAPCNN) [
41]. The fused pyramids are then reconstructed using Laplacian pyramid reconstruction to produce the fused image. Finally, the inverse IHS transformation is applied to obtain the final fusion result.
| Algorithm 1 The Proposed LEAPFusion for Optical and SAR image fusion |
Require: Optical image O, SAR image S Ensure: Fused image F- 1:
Perform IHS transform on O to obtain intensity and chrominance channels - 2:
Determine the number of pyramid levels L, based on the width and height of or S (refer to Equation ( 8)) - 3:
for to L do - 4:
Construct both local extrema and Laplacian pyramids for and S, denoted as and , and (refer to Equations ( 6) and ( 7)). - 5:
Update sliding window k (refer to Equation ( 9)). - 6:
Fuse the l-th layer of the local extrema and Laplacian pyramids to obtain the fused results and , respectively (refer to Equation ( 10) and PAPCNN [ 41]). - 7:
Compute the final pyramid fused result as - 8:
end for - 9:
Reconstruct from via the inverse Laplacian pyramid transform, then obtain F via the inverse IHS transform using , , and . - 10:
return F
|
3.1. Edge-Preserving Scale-Adaptive Pyramid Decomposition
The Gaussian–Laplacian pyramid is a widely adopted multi-resolution technique in image processing, particularly for tasks such as feature extraction and image fusion. However, the Gaussian filter, which forms the core of this approach, applies uniform smoothing to all pixels without distinguishing between edges and background regions. This isotropic operation inevitably blurs edge information and fails to preserve local structural details. Moreover, the performance of the Gaussian filter heavily depends on the selection of its standard deviation, and manually tuning this parameter introduces uncertainty, limiting its adaptability and robustness in automated processing pipelines. These shortcomings highlight the need for more advanced filtering methods capable of adaptively preserving edges while maintaining smoothness.
Given these limitations, we introduce a local-extrema-based EPF to replace the Gaussian filter, enabling more effective extraction of edge and texture information. Additionally, we propose an adaptive pyramid decomposition framework capable of automatically determining two pivotal parameters—the decomposition levels and the local extrema computation kernel sizes—thereby mitigating the parameter-related uncertainties inherent in manual configurations.
3.1.1. Local Extrema
Local extrema was first introduced by Subr et al. [
24] as a novel EPF that can effectively smooth highly contrasted oscillations while preserving salient edges. The process consists of three main steps: (1) identify the local maxima and minima of the input image
I within a specified sliding window; (2) construct the maximal and minimal envelopes using interpolation techniques; (3) compute the average of these two envelopes to obtain the smoothed mean, referred to as the coarse layer (
C).
Figure 3 illustrates the three steps of the LE algorithm using 1D slices of the input image in
Figure 4a (red), along with its extrema, maximal envelope (green), minimal envelope (blue), and the coarse layer (black). The residual layer is then calculated as
. Further details are provided below.
Locate extrema. A pixel p is identified as a local maximum (or minimum) if no more than pixels within its neighborhood have a greater (or smaller) value. Oscillations, with maxima detected using a kernel, exhibit wavelengths of at least pixels. Larger kernels may fail to detect finer oscillations.
Construct extremal envelopes. Given an image
I and a set of pixels
S representing local extrema, we compute an extremal envelope
E using an interpolation method adapted from the technique proposed by Levin et al. [
42]. The objective is to find an interpolant
and
that are similar when
and
are alike. Formally, we minimize the following functional
subject to the constraint
where
denotes the neighbors of
r, and weights
are computed using the local variance
around
r. Here,
r and
s denote the pixels adjacent to pixel
p;
,
, and
represent their corresponding intensity values; the envelope values at points r, s, and p are denoted by E(r), E(s), and E(p), respectively; and
is the variance of the intensities within a window centered at
r. We adopt the approach of Levin et al. [
42] and minimize the quadratic functional using their weighted least squares formulation.
Calculate coarse layer. The envelope is constructed independently for the minima and maxima of the image, yielding the minimal and maximal envelopes, respectively. The coarse layer is then obtained by averaging these two envelopes, as expressed in Equation (
4)
where
and
represent maximal and minimal envelope,
C represent coarse layer.
3.1.2. Scale-Adaptive Pyramid Decomposition
The scale-adaptive pyramid construction is characterized by two key aspects: (1) the adaptive number of pyramid levels, i.e., the decomposition levels, and (2) the adaptive kernel size of the LE algorithm. We apply the LE algorithm to the input image extracts of a coarse layer. Furthermore, by iteratively downsampling the image, an LE-based LE pyramid can be constructed. Specifically, the
-th level is generated by down-sampling the
l-th level after it has been processed by the LE filter:
where
denotes a down-sampling operation by a factor of 2, ensuring that each successive level represents a coarser scale of the original signal. The LE component at each level is defined as:
where
I represents the input image,
k is the computation kernel in the LE algorithm, and
l indicates the level. In the context of the Laplacian pyramid, the
l-th level of
can be derived using the Equation (
7),
where
represents upsampling the image. In this study, the
operation is specifically implemented using bilinear interpolation to recover the spatial dimensions of the coarser layers for feature integration. Further implementation details of the pyramid structure can refer to [
43].
In Subr et al.’s [
24] work, the number of decomposition levels is manually specified, and the kernel size is uniformly incremented by a constant value of 8 for each iteration within the multi-scale image decomposition process. This approach introduces uncertainty regarding the extent of image decomposition and increases the computational complexity. Inspired by the methods of Li et al. [
44] and Kou et al. [
35], our proposed method defines the total number of decomposition layers, denoted as
L, based on a systematic criterion
where
means rounding down,
W and
H are the width and height of the input image, and the value of
T is the initial kernel size, which was set to 3 according to reference [
24]. Equation (
8) establishes an explicit relationship between the number of decomposition levels and the spatial resolution of the input image. Unlike conventional methods that employ a fixed number of layers, this formulation ensures that larger images are decomposed into deeper pyramid structures, enabling more effective multi-scale representation. Consequently, the decomposition depth can adapt to different image sizes without manual tuning.
As the image is iteratively down-sampled, its spatial resolution decreases, meaning that a fixed-size kernel will cover a progressively larger relative area of the original scene. To maintain a consistent receptive field and effectively capture features across different scales, the kernel size should be adjusted according to the decomposition level. In our framework, the kernel size is slightly increased in deeper layers to ensure that structural information at coarser scales is adequately characterized without being overlooked due to the reduced resolution. Thus, the filter size at each scale is adaptive rather than a fixed constant, as expressed in the following equation:
where
represent the kernel size of
l-the scale. Equation (
9) further introduces a scale-dependent kernel adaptation mechanism. As the pyramid level increases, the image becomes progressively smoother due to downsampling and filtering. Accordingly, larger kernels are employed at coarser scales to better capture structural information while suppressing noise. This design maintains consistency of feature representation across scales and avoids the limitations of fixed kernel sizes.
Unlike existing pyramid decomposition methods that typically rely on fixed or empirically tuned parameters, the proposed strategy provides an explicit and unified parameterization driven by image size and pyramid scale. This design eliminates the uncertainty introduced by manual parameter selection and improves the robustness and generalization ability of the fusion framework across images with diverse resolutions and scene characteristics.
3.2. Multi-Type Pyramid Fusion
Following the application of the aforementioned method for adaptive pyramid decomposition to the images, we successfully derive both the LE pyramid and the Laplacian pyramid. Subsequently, we proceed to fuse each corresponding layer from the optical and SAR pyramids to synthesize a fused pyramid.
3.2.1. LE Pyramid Fusion
The LE pyramid contains the low-frequency components of an image, representing the global intensity distribution and background information. Optical and SAR images have distinct characteristics in their coarse layers. Optical images capture visible light and provide rich spectral information, making them ideal for identifying colors, textures, and fine details. In contrast, SAR images use microwave signals to penetrate clouds and darkness, offering strong capabilities in detecting surface structures and terrain features, but often with lower resolution and more noise. The coarse layer of an optical image typically contains smooth and continuous information, while the coarse layer of an SAR image tends to emphasize structural information.
We employ a simple weighted-averaging fusion rule for each layer, as described by Equation (
10):
where
l denotes
l-th level, the max value is
L,
and
denote the LE pyramid images at the
l-th layer, and
represents the fusion result of the optical and SAR images at the
l-th layer in the LE pyramid. The weighted-averaging rule is chosen for its simplicity, computational efficiency, and ability to preserve the complementary strengths of both modalities. It effectively integrates the spectral richness of optical images with the structural robustness of SAR images, while mitigating noise through averaging. In this study, we set
and
to retain shared features and minimize redundancy.
3.2.2. Laplacian Pyramid Fusion
The Laplacian pyramid represents the high-frequency components of an image. In optical imaging, high-frequency components capture micro-textural variations (e.g., foliated surfaces, architectural details, pavement patterns) while containing stochastic/quantization noise. SAR imagery exhibits high-frequency signatures at abrupt intensity transitions (target edges, geometric contours) and terrain-sensitive textures (urban structures, vegetation gradients), coupled with inherent speckle noise from coherent scattering. Laplacian pyramid fusion necessitates adaptive enhancement of modality-specific high-frequency features alongside hierarchical noise suppression through multi-scale decomposition.
Given the importance of preserving these high-frequency details for fusion tasks, the PAPCNN [
41] is employed. Although the PAPCNN was originally designed for medical images, its bionic pulse-coupling mechanism is inherently effective for remote sensing. The linking mechanism allows neurons to fire based on spatial correlation. Since SAR speckle noise is spatially uncorrelated, while physical structures are spatially continuous, the network can selectively enhance meaningful textures while suppressing isolated noise spikes. This makes it a suitable choice for the Laplacian pyramid fusion.
The absolute pixel values at each level of the Laplacian pyramid are fed into the network as inputs, denoted as
, where
. Furthermore, the firing times are accumulated by incorporating an iterative step at the end of each iteration to improve fusion accuracy
Here, the firing time of each neuron is denoted as
, where
denote the position of neuron
,
N represents the total number of iterations,
n denotes the current iteration, and
denotes the output of the
nth iteration of the PCNN. At each level of the Laplacian pyramid, the firing times for images
and
are represented as
and
, respectively. The fused result at each level is computed using the following rule
where
l denotes the
l-th level, the max value is
L,
represents the fusion result of the optical and SAR images at the
l-th layer in the Laplacian pyramid, and the coefficient associated with the larger firing time is chosen as the fusion coefficient for the Laplacian pyramid. In the PAPCNN framework, the iteration number
N is empirically set to 60 based on ablation studies (see
Section 4.4.3). Regarding the model parameters, the linking strength
and attenuation coefficient
are adaptively determined by the algorithm according to the local statistical characteristics of the input images. This adaptive parameterization, combined with the optimized iteration number
N, ensures that the PAPCNN can effectively distinguish between structural features and stochastic speckle noise across different scales.
3.3. IHS Method
In this study, the optical image contains three RGB channels, while the SAR image has a single channel. To enable fusion, the multi-channel optical image is first converted into a single channel by extracting its intensity (I) component through the IHS transformation. The I component is then fused with the SAR image across multiple scales. After the fusion process, the result is converted back to the RGB color space using the inverse IHS transformation. While the IHS transform is traditionally employed for multispectral and panchromatic fusion, it has also been widely adopted in optical–SAR fusion studies, demonstrating its effectiveness in cross-modality fusion scenarios [
12,
13,
14,
15,
18].
The IHS transform [
45] is mathematically expressed as follows:
where
R,
G, and
B denote the red, green and blue channels of the optical image, respectively.
I,
H,
S are the three components after IHS transformation. The inverse IHS transform can be expressed as
where
denotes the new intensity component after processing.
,
,
are the new channels obtained after the IHS inverse transformation.
4. Experiments
4.1. Experiment Settings
4.1.1. Experiment Data
To evaluate the proposed method, all algorithms are tested on the publicly available YYX-OPT-SAR dataset [
46] and optical–SAR dataset [
47]. YYX-OPT-SAR includes 150 pairs of optical and SAR images, with optical images from Google Earth and SAR images from an unmanned aerial vehicle. These images cover areas around Baicheng City (Jilin Province) and Weinan City (Shaanxi Province), representing three typical land cover types: urban, suburban, and mountainous regions. Each image pair, with a size of
pixels, has been registered.
Ren et al. [
47] publicly released a co-registered optical–SAR dataset, with optical images acquired by the GF-2 satellite and SAR images by GF-3. The dataset covers diverse land cover types, including buildings, farmland, vegetation, water bodies, and roads. For this study, we selected the first 150 optical–SAR image pairs (
pixels each) from the Dongying region of Shandong Province. This selection was made to complement the first dataset (YYX-OPT-SAR), which primarily focuses on mountainous and urban terrains. By including the Dongying region—a typical plain area—we can more comprehensively evaluate the adaptability and robustness of the proposed fusion algorithm across diverse topographical conditions.
4.1.2. Compared Methods
We compare our method with seven existing approaches spanning three categories: EPF-based MSD (MLGCF [
23], MSGRGF [
22]); saliency-based VSFF [
12]; and general deep learning-based methods (IFCNN [
36], U2Fusion [
37], MUFusion [
48], LFDT [
49]).
4.1.3. Evaluation Indices
The experimental evaluation incorporates both qualitative and quantitative assessments. Qualitative analysis examines perceptual differences in fused images through visual inspection, while quantitative analysis provides objective performance measurement through five established metrics: entropy (EN) [
50], spatial frequency (SF) [
51], sum of correlation of differences (SCD) [
52], spectral distortion index (
) [
53], and spectral angular mapper (SAM) [
54].
4.2. Result Analysis of YYX-OPT-SAR Dataset
The dataset used in this study consists of three land cover types. For each land cover type, one image is selected for comparative display. To facilitate comparison, the detailed regions of the fusion results are enlarged and highlighted with red rectangular boxes, as shown in
Figure 5.
The first and second rows show the fusion results for urban land cover images. Due to the limited grayscale information in SAR images, all methods except VSFF and LFDT fail to effectively preserve the optical image’s color, resulting in a generally darker tone. Specifically, the roof color in the IFCNN and MUFusion fusion results is distorted, and U2Fusion produces color artifacts. While MLGCF and MSGRGF show some improvement in color preservation, they do not effectively retain local energy from the SAR images. VSFF and LFDT fusion preserves some tree shadows from the SAR image but introduces halo artifacts, indicating insufficient edge preservation. In contrast, the proposed method retains both the optical image’s color and the SAR image’s local energy while demonstrating superior edge preservation.
The third and fourth rows show the fusion results for suburban areas. The input images clearly demonstrate that optical and SAR images provide complementary information. While all methods except VSFF and LFDT effectively integrate this information, they fail to preserve the optical image’s color. Although VSFF produces better visual effects, it does not adequately fuse the complementary information. In contrast, the proposed method effectively retains the complementary features of both optical and SAR images.
The fifth and sixth rows show the fusion results for mountain land cover images. Optical images provide rich spectral information, while SAR images capture detailed texture. Due to SAR’s side-looking imaging mechanism, significant elevation changes in the terrain cause higher grayscale values on the slope-facing side. Thus, preserving SAR’s texture information is crucial during fusion. The results show that the MSGRGF, VSFF, and LFDT methods are less effective at preserving the edges of terraced fields. In contrast, the proposed method demonstrates stronger edge preservation, effectively retaining the SAR texture details.
Table 1 presents a quantitative comparison of different fusion methods across three land cover types in the YYX-OPT-SAR dataset. The values represent the average score for each method across the fusion tasks. Red indicates the best value among the eight methods, green denotes the second-best, and blue represents the third-best. ↑ and ↓ indicate the higher the better and the lower the better, respectively.
Overall, among the eight methods, our approach ranks in the top three across all metrics for the three fusion tasks, demonstrating the robustness and broad applicability. Specifically, the EN metric achieved top-ranking performance in urban and mountain scenarios, while securing second place in suburban areas. The SF metric demonstrated comparable performance to the VSFF method overall. Notably, the SCD index ranked first across all three fusion tasks. For the and SAM metrics, our method maintains competitive performance within the top tier. Specifically, in the urban and suburban scenarios, both metrics rank within the top three (with the SAM index in the suburban task achieving the second-best value). In the mountainous terrain, although and SAM show a slight decrease in ranking compared to structural metrics, they still remain within a reasonable and effective range.
Notably, the SCD index consistently ranked first across all three fusion tasks with a significant margin. This superiority stems from the fact that our scale-adaptive LE decomposition effectively extracts complementary features from the heterogeneous source images, ensuring the fused product maintains a high degree of informational correlation with the inputs.
Regarding the and SAM metrics, which consistently ranked third, this performance reflects a deliberate trade-off between structural injection and spectral preservation. In complex scenes like urban and mountain areas, our method prioritizes the integration of salient geometric structures from SAR (evidenced by the top-tier SF and EN scores), which inevitably introduces minor spectral deviations. However, these values remain highly competitive, striking an optimal balance for practical interpretation.
4.3. Result Analysis of Optical–SAR Dataset
The fusion results on the optical–SAR dataset are qualitatively compared in
Figure 6. The bridge fusion results in
Figure 6 (Rows 1–2) reveal that VSFF fails to adequately preserve SAR intensity characteristics, while IFCNN and MUFusion exhibit spectral distortion. Although other methods maintain both spectral and SAR intensity profiles, our approach demonstrates superior detail preservation. The roof fusion comparison in
Figure 6 (Rows 3–4) demonstrates that while all methods preserve roof coloration effectively, our approach shows superior retention of SAR ground intensity characteristics. Furthermore, edge preservation analysis confirms our method maintains significantly richer structural details compared to other techniques, with SAR imagery particularly well capturing roof and ground-edge intensity profiles. The field boundary fusion results in
Figure 6 (Rows 5–6) reveal that while both LFDT and our method excel in edge preservation and information integration, MUFusion shows compromised texture despite maintaining intensity, and IFCNN exhibits spectral distortion. Our approach uniquely preserves all critical features—structural, textural, and spectral—demonstrating comprehensive fusion superiority.
The quantitative comparison results on the optical–SAR dataset are presented in
Table 2. Among the five evaluation metrics, our method achieved top rankings in EN, SF, and SCD. While it demonstrated slightly inferior performance to VSFF in
and SAR indices, it secured second place in SAM. In the optical–SAR dataset, our method continues to lead in EN, SF, and SCD, reinforcing the strength of the PAPCNN-based high-frequency fusion strategy. While it secured the second-best performance in SAM, the slightly inferior ranking in
compared to VSFF is due to the inherent sensitivity of
to intensity variations when SAR structural extrema are heavily injected. Nevertheless, the overall ranking demonstrates that the proposed framework maintains robust spectral–spatial balance even in varied sensor configurations.
Based on experimental results from the two datasets, the overall performance of general deep learning-based fusion methods remains suboptimal, which can be attributed to several factors. First, these methods do not explicitly model the modality differences between optical (spectral-focused) and SAR (structure- and texture-focused) images, potentially leading to inadequate integration of complementary information or the introduction of interference. Second, despite leveraging deep architectures, they lack semantic guidance and typically perform pixel- or feature-level fusion using generic representations, failing to capture class-discriminative features essential for land cover classification. Third, most models are optimized for image reconstruction or perceptual quality rather than classification performance, leading to the loss of task-relevant information. Finally, limited training data and the absence of remote sensing-specific regularization reduce their generalizability, particularly in complex or dynamic land cover scenarios.
To further validate the proposed method, both qualitative and quantitative analyses were performed. Qualitatively, visual comparisons confirm these results, showing sharper boundaries, finer structures, and more natural spectral appearance. Quantitatively, local extrema decomposition enhances EN and SF by accurately extracting textures across contrasts, enriching gradients and preserving edges. The scale-adaptive pyramid with local extrema constraints reduces SCD and by suppressing cross-modal mixing and ensuring spectral consistency, while oversmoothing avoidance improves SAM. Overall, the method achieves a balanced integration of spatial detail and spectral fidelity, explaining consistent metric improvements and generalizability.
4.4. Ablation Study
4.4.1. Effectiveness of Parameter-Adaptive Strategy
To demonstrate the effectiveness of the proposed parameter-adaptive approach compared to conventional fixed-parameter settings, two comparative experiments were designed using a control-variable method: (1) fixing the decomposition level
[
13,
22,
55] and setting a constant kernel size
and (2) fixing a constant
while varying
. These fixed-parameter configurations serve as the control groups to be compared against our adaptive method. A comparative evaluation based on the EN metric is presented in
Figure 7.
As shown in
Figure 7, the proposed adaptive method outperforms the fixed-parameter fusion approach in both experimental groups. This result validates that the adaptive
k (Equation (
9)) and adaptive
L (Equation (
8)) can dynamically adjust to the image resolution and content, achieving a better balance of information extraction than any single fixed combination. In the first group, our method achieves superior EN values in urban areas compared to other methods, although it shows slightly lower performance in suburban and mountainous regions. Notably, in the second experimental group, our approach demonstrates optimal performance across all three terrain types. The results indicate that kernel size significantly influences EN values, with larger kernel sizes generally improving EN values at the cost of increased computational resources. However, it is crucial to note that the kernel size must be smaller than the dimensions of the corresponding layer image, suggesting that excessively large kernel sizes may not necessarily benefit the experimental process.
4.4.2. Influence of Color Space Transformation
To evaluate the impact of different color space transformations, we compared our IHS-based framework with three alternative configurations: (1) Direct Intensity, where the optical image is converted to a grayscale image using a weighted combination of RGB channels, fused with the SAR image, and subsequently projected back into the RGB space via the Brovey transform; (2) YCbCr, fusing the luminance (Y) channel; and (3) HSV, fusing the value (V) channel. While the color space transformation was varied, all other components of the proposed framework remained unchanged.
The results are shown in
Figure 8 and
Table 3. Visually (
Figure 8), the Direct Intensity and HSV methods exhibit localized color artifacts (e.g., blue/red patches) due to spectral distortions during the back-projection process. While the YCbCr result is visually comparable to the proposed method in terms of color fidelity, quantitative analysis (
Table 3) reveals a clear performance gap. The proposed IHS-based method achieves the highest entropy (EN = 6.8233) and spatial frequency (SF = 24.6603), significantly outperforming YCbCr (SF = 22.7192). This demonstrates that the IHS transformation provides a more compatible mathematical foundation for our hierarchical fusion rules, allowing for a more profound injection of SAR structural details while maintaining natural color fidelity.
4.4.3. Sensitivity Analysis of the PAPCNN Iteration Count N
The number of iterations N in the PAPCNN is a key parameter that influences the extraction of structural features. To evaluate its impact, we tested N values ranging from 40 to 100. As shown in
Figure 9, the evaluation metrics for different terrain types (urban, mountain, and suburban) stabilize as N increases.
Specifically, the spectral fidelity metrics (SAM and ) and information-based metrics (EN and SF) show negligible fluctuations for N ≥ 50. The structural correlation metric (SCD) reaches a plateau at N = 60. This convergence behavior indicates that the PAPCNN effectively captures the salient features of the source images within 60 iterations. Continuing to increase N beyond this point would lead to unnecessary computational overhead without further enhancing fusion quality. Consequently, N = 60 is determined as the standard setting for all experiments to ensure a balance between performance and efficiency.
4.4.4. Effectiveness of Local Extrema Pyramid Fusion Rules
To evaluate the effectiveness of the proposed weighted-averaging rule for LE pyramid fusion, we conducted an ablation study by replacing it with a hybrid strategy combining weighted local energy (WLE) and weighted sum of eight-neighborhood-based modified Laplacian (WSEML) [
13,
15]. The comparative experimental results are presented in
Figure 10 and
Table 4.
As indicated in
Table 4, while the WLE and WSEML rules achieve higher SF scores, this is primarily due to the Laplacian operator’s aggressive amplification of local gradients. While increasing visual sharpness, such operators risk introducing spectral artifacts and amplifying SAR speckle noise.
In contrast, our weighted-averaging rule prioritizes radiometric and spectral integrity, yielding superior scores in SCD,
, and SAM. According to the Wald protocol [
56], preserving the sensors’ physical characteristics is the primary criterion for fusion quality. In optical–SAR fusion, high EN and SF often represent “pseudo-gain” where noise is misinterpreted as information [
52]. Our approach acts as a conservative integrator, ensuring high-frequency injection remains within physically plausible bounds to provide a reliable foundation for subsequent analysis.
Furthermore, the observed trade-off suggests that a hybrid strategy—applying conservative averaging to low-frequency layers while utilizing adaptive weights for high-frequency details—could further reconcile structural clarity with spectral consistency, which will be explored in future work.
4.4.5. Effectiveness of Laplacian Pyramid Fusion Rules
To verify the effectiveness of the PAPCNN in high-frequency fusion, we replace it with the max-absolute (max-abs) selection rule [
13,
14] while keeping the rest of the framework unchanged. As shown in
Figure 11, the max-abs strategy introduces noticeable noise and artifacts, whereas the proposed method preserves clearer structures with better visual consistency.
The quantitative results in
Table 5 further confirm this observation. Compared with max-abs, the proposed method improves EN (6.7479→6.8233), SF (20.6155→24.6603), and SCD (1.6214→1.6868), while reducing
(0.0300→0.025) and SAM (0.3721→0.3302). These results demonstrate that PAPCNN achieves a better balance between detail enhancement and noise suppression, outperforming the max-abs rule.
4.4.6. Analysis of the Complementarity Between LE and Laplacian Pyramids
To verify the complementary roles of the LE and Laplacian pyramids, we conducted an ablation study by comparing the proposed dual-pyramid framework with two single-pyramid variants. The quantitative results and visual comparisons are presented in
Table 6 and
Figure 12, respectively.
From a quantitative perspective (
Table 6), the “LE + Laplacian pyramid” scheme achieves the best performance across almost all metrics, including EN, SF, SCD, and
. Specifically, the combination significantly boosts the spatial frequency (SF) from 15.6182 (Laplacian only) and 12.1003 (LE only) to 24.6603, indicating a much higher capacity for capturing and retaining fine-grained structural details. While the “only Laplacian pyramid” scheme shows a slightly lower SAM value, it comes at the cost of significant information loss in other dimensions.
The qualitative analysis in
Figure 12 further illustrates the necessity of the dual-pyramid structure. The “Only Laplacian pyramid” result appears noticeably dark and suffers from low contrast, failing to effectively integrate the radiometric information of the optical image. Conversely, the “Only LE pyramid” result is over-smoothed and exhibits a “whitening” effect, where sharp edges and fine textures from the SAR image are blurred. In contrast, the “Laplacian + LE pyramid” result achieves a superior balance: it retains the sharp structural features of the Laplacian decomposition while maintaining the rich energy and radiometric stability provided by the LE pyramid. This complementarity ensures that the fused image is both spatially sharp and radiometrically natural, validating our design of the hybrid pyramid framework.
4.4.7. Runtime Analysis of Fusion Methods
Table 7 presents the average runtimes of various fusion methods evaluated on the datasets, all executed on an Intel Core i7-12700H CPU. For traditional fusion methods, theoretical time complexity is reported to reflect their algorithmic characteristics, while for deep learning-based methods, the computational cost is quantified in terms of GFLOPs, which is commonly used to measure model-level complexity during inference. The results are summarized in
Table 7.
As illustrated in
Table 7, the proposed method exhibits an average runtime of 4.8873 s with a theoretical complexity of
. This performance is primarily due to the iterative nature of the LE decomposition, which requires
K iterations to construct stable upper and lower envelopes for high-fidelity edge preservation. The proposed method was implemented in MATLAB 2021b. It is important to emphasize that the current performance was evaluated under a single-threaded, unoptimized implementation. Given the high degree of pixel-level parallelism in envelope construction, an optimized implementation in a compiled language (e.g., C++ or CUDA) is estimated to yield a speedup of 10 to 30 times (approximately 10× from compilation and 4–8× from multi-core parallelization). In the context of high-precision image fusion, this computational cost is considered an acceptable trade-off for the superior performance in maintaining radiometric consistency and suppressing spectral artifacts.