2.1. SFIM
SFIM [
14] is a high-fidelity image fusion method for processing optical remote sensing images, which can divide the image to be processed into multiple blocks for parallel computation and is suitable for data processing on satellites. The classic SFIM model is defined in Equation (1):
        where 
 is the 
DN value of a low-resolution image with a wavelength of 
, and 
 is the DN value of a high-resolution image with a wavelength of 
. 
 is the simulated high-resolution pixel corresponding to 
, and 
 is the local average value of 
 in a neighborhood, which is equivalent to the resolution of 
. If solar radiation is given and constant, then the surface reflectance only depends on the terrain. If two images are quantized to the same 
DN value range and have the same resolution, then it is assumed that 
 [
28], so that 
 can cancel each other out. Meanwhile, because the surface reflectance of images with different resolutions does not change much, it is assumed that 
, so that 
 can cancel each other out. Equation (1) is transformed into Equation (2):
For panchromatic multispectral fusion, Equation (2) is simplified as Equation (3):
In the above formula,  is a multispectral image,  is a panchromatic image,  image is the  image upsampled to the resolution size of the  image,  is a low-resolution panchromatic image, and  is the fusion result. The ratio between  and  only preserves the edge details of high-resolution images while essentially eliminating spectral and contrast information.
The reason why the classic SFIM method performs poorly on the fusion of panchromatic and multispectral images of different scales is that, in the degradation process, the average convolution kernel processing or the improved Gaussian convolution kernel processing needs to provide the relevant convolution kernel in advance. In this way, different convolution kernels need to be set for different satellites. Therefore, a single convolution kernel cannot filter out the spatial information of remote sensing images of different scales well, which leads to blurring of the fused image.
  2.2. Method
The aim of the method in this paper is to generate a high-quality fusion result by obtaining a low-resolution panchromatic image that is consistent with the spatial and spectral characteristics of the multispectral image. The improvement of the method in this paper focuses on obtaining a downscaled panchromatic image () that maintains both the spatial information and spectral features of the multispectral image. During the fusion process, the multispectral image and the low-resolution panchromatic image have to be resampled to maintain consistent sizes. As such, the ideal low-resolution panchromatic image,  (where downsampling is denoted by the subscript ds), should possess a similar image space characteristic to the multispectral image. To achieve a similar spatial structure for the downscaled image to the MS image, a low-pass filter is necessary to eliminate some of the high-frequency information present. Gaussian filtering is selected as the tool to adjust the sharpness by controlling the kernel sharpness through parameter adjustments. Based on these improvements, this paper proposes an adaptive iterative filtering fusion method for panchromatic multispectral images of varying scales. The algorithm can be summarized in the following steps:
Step 1. Calculate the scale ratio of the panchromatic and multispectral images to be fused;
Step 2. Adaptively construct convolution kernels of various scales based on the scale ratio proportion;
Step 3. Use the constructed convolution kernels to iteratively degrade the panchromatic image;
Step 4. Upscale the multispectral and degraded panchromatic images to match the panchromatic scale;
Step 5. Fuse the panchromatic and multispectral images using a ratio-based method.
The algorithm flow of this paper is shown in 
Figure 1.
In Step 1, the scale ratio is determined by examining whether there is geographic information on the input panchromatic and multispectral images. If geographic information is present, the overlapping range of the panchromatic and multispectral images in the geographic space is calculated. The overlapping range can then be back-calculated to obtain the pixel coordinates of the panchromatic and multispectral images, and their corresponding overlapping areas.
        
Here, 
 and 
 are the pixel coordinates of the overlapping region between the two images. 
 corresponds to the upper left corner of the overlapping area, and 
 corresponds to the lower right corner. Additionally, the scale ratio of the panchromatic image and the multispectral image can be expressed by the following formula:
The goal of the second step is to create convolution kernels, σ, with differing scales. To do this, we adapt the construction process from the Gaussian pyramid. By doing so, we can construct  convolution kernels based on the Gaussian pyramid transformation.
The first step in the process of constructing these kernels is to calculate the number of convolution kernels, 
, needed based on the target scale. This value should be an integer.
        
Following the calculation of the integer value for the number of convolution kernels, 
, we can now construct the floating-point quantity, 
, for the convolution kernel scale.
        
If , this indicates that the difference between scales is exactly a power of 2. In such cases, we can easily construct a multiscale convolution kernel using a traditional Gaussian pyramid.
However, if  and  are not equal, this implies that the scale difference is not a power of 2. In such cases, to construct a multiscale convolution kernel, we need to add one more layer resulting in  scale layers.
To represent the Gaussian convolution kernel, we use the following equation:
In the above formula, 
 represents the coordinates of any point in the convolution kernel, while 
 represents the coordinates of the kernel’s center point. In layer 
, Gaussian convolution kernels with a standard deviation of 
 [
29] are used. According to the suggestion of SIFT, 
 achieves optimal results when performing 2-fold downsampling, so the value of 1.6 is chosen in this paper. However, if 
, different standard deviations must be estimated for the 
 layer. The estimation method is as follows:
If  and  are equal, then the  layer uses a standard deviation of . The construction method of the convolutional kernels used in the last layer is identical to that used in previous layers, which involves using Gaussian convolutional kernels.
The third step requires iterative degradation based on the number of convolutional layers being constructed and the corresponding convolutional kernels calculated in the second step. Each layer uses the corresponding convolutional kernels for convolution, and after the convolution process is complete, downsampling is performed to obtain an ideal low-resolution panchromatic image, 
. Based on a consideration of computational efficiency and the downsampling effect, the bilinear resampling method is adopted as the downsampling method, and the formula is shown as follows:
        where 
, 
, 
, 
, and these four points are the points around the downsampling target point 
. In this way, it is possible to resample images when the scale is not integer.
Next, in the fourth step, the original MS and low-resolution panchromatic image, , are upsampled to the panchromatic PAN scale to obtain  and . The resampling method for upsampling is based on a consideration of computational efficiency and the effect of resampling, and the same bilinear resampling model mentioned in the previous step is selected. By building a Gaussian pyramid in this way, it is possible to obtain a degraded panchromatic image of the corresponding scale.
Finally, the fifth step involves obtaining the fusion image, 
, using the ratio method:
This estimation method ensures a smooth transition between convolution layers with different numbers of output channels, which helps maintain the method’s overall performance.
  2.3. Quality Indices
To conduct an objective evaluation of the algorithm’s performance, this study has adopted a reduced-resolution assessment and a full-resolution assessment without reference. The reduced-resolution assessment includes the following four indicators: cross correlation (CC), structural similarity index measure (SSIM), spectral angle mapper (SAM) and erreur relative globale adimensionnelle de synthese (ERGAS). The full-resolution assessment without reference comprises three evaluation metrics: spectral distortion index (), spatial distortion index (), and hybrid quality with no reference (HQNR).
CC represents the spectral similarity between the computed 
 and fused images, with larger values indicating greater similarity between the 
MS and Fused images. 
CC is defined in Equation (12), where the subscript 
 specifies the position of the pixel. The ideal value of 
CC is 1.
        
 - (2)
 Structural Similarity Index Measure
Structural similarity 
 [
30] is used to evaluate the degree of similarity between two images, 
 and 
, which has strong spatial interdependence and can reflect the degree of correlation between the structural information of two images well. 
 is defined as follows:
        where 
 and 
 are the means of 
 and 
, respectively, 
 and 
 are the variances of 
 and 
, respectively, 
 is the covariance of 
 and 
, and 
 and 
 are constants used to maintain stability, where 
L is the dynamic range of the pixel value, and by default, 
 and 
. The ideal value of 
 is 1.
Spectral angle mapper (
) [
31] is a spectral measure that represents the angle between the reference vector and the processing vector of a given pixel in the spectral feature space of an image, which is defined as
        
        where 
 is the inner product between the fused image and 
MS at the 
th pixel. 
 is calculated as the spectral angle between the 
MS and fusion vectors of a given pixel, and smaller values of 
 indicate greater similarity between the multispectral and fusion vectors [
32]. The ideal value of 
SAM is 0.
- (4)
 Erreur Relative Globale Adimensionnelle de Synthese
The erreur relative globale adimensionnelle de synthese (
) [
33] provides a global indication of the reference distortion of the test multiband image. It is defined as
        
        where 
 is the ratio between the pixel sizes of 
MS and 
PAN images. 
 is the number of digits in the band, and 
 is the average of the 
th band of the reference.
- 2.
 The full-resolution assessment without reference evaluates the quality of pansharpened images at the resolution of PAN images without relying on a single reference image. The evaluation will be performed using actual observed images.
          
- (1)
 Spectral Distortion Index
The spectral distortion index 
 [
34] of the Khan protocol is defined as
        
 is a multiband extension of the general image quality index, which is used for the quality assessment of pansharpened 
MS images, first for 4 bands and later extended to 
 bands [
35,
36,
37]. Each pixel of an image with 
N spectral bands is placed into a hypercomplex (
HC) number with one real part and 
N − 1 imaginary parts.
 and 
 denote the HC representations of the reference and test spectral vectors at pixel (c, r). 
 can be written as a product of three components.
        
The first part represents the modulus of the HC correlation coefficient between 
 and 
, which measures the degree of linear correlation. The second and third terms measure luminance distortion and contrast distortion on all bands simultaneously, respectively [
35]. The value of 
 ranges from 0 to 1, and 
 is equal to 1 if, and only if, 
.
- (2)
 Spatial Distortion Index
Spatial distortion index 
 [
38] is defined as
        
        where 
 and 
 and 
 are the intensities of 
 and 
, respectively, which are defined as
        
- (3)
 Hybrid Quality with no Reference
Hybrid quality with no reference (
) [
39] borrows the spatial distortion index 
 from QNR and the spectral distortion index 
 from the Khan protocol. It is defined as
        
        where usually 
.