Next Article in Journal
MultiVeg: A Very High-Resolution Benchmark for Deep Learning-Based Multi-Class Vegetation Segmentation
Previous Article in Journal
Post-Fire Restauration in Mediterranean Watersheds: Coupling WiMMed Modeling with LiDAR–Landsat Vegetation Recovery
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Shift-Invariant Unsupervised Pansharpening Based on Diffusion Model

1
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
2
School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
3
Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Chinese Academy of Sciences, Beijing 100190, China
4
Institute of Remote Sensing Application in Public Security, People’s Public Security University of China, Beijing 100038, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(1), 27; https://doi.org/10.3390/rs18010027
Submission received: 19 October 2025 / Revised: 16 December 2025 / Accepted: 17 December 2025 / Published: 22 December 2025
(This article belongs to the Section Remote Sensing Image Processing)

Highlights

What are the main findings?
  • A shift-invariant unsupervised diffusion-based pansharpening framework is proposed.
  • A translation error estimation module and loss are introduced to handle spatial offsets.
What are the implication of the main findings?
  • The work establishes a new framework for training unsupervised diffusion models in pansharpening.
  • The study offers new insights into achieving shift-invariant unsupervised fusion.

Abstract

Pansharpening is a crucial topic in remote sensing, and numerous deep learning-based methods have recently been proposed to explore the potential of deep neural networks (DNNs). However, existing approaches are often sensitive to spatial translation errors between high-resolution panchromatic (HRPan) and low-resolution multispectral (LRMS) images, leading to noticeable artifacts in the fused results. To address this issue, we propose an unsupervised pansharpening method that is robust to translation misalignment between HRPan and LRMS inputs. The proposed framework integrates a shift-invariant module to estimate subpixel spatial offsets and a diffusion-based generative model to progressively enhance spatial and spectral details. Moreover, a multi-scale detail injection module is designed to guide the diffusion process with fine-grained structural information. In addition, a carefully formulated loss function is established to preserve the fidelity of fusion results and facilitate the estimation of translation errors. Experiments conducted on the GaoFen-2, GaoFen-1, and WorldView-2 datasets demonstrate that the proposed method achieves superior fusion quality compared with state-of-the-art approaches and effectively suppresses artifacts caused by translation errors.

1. Introduction

Remote sensing applications, ranging from fine-scale land-cover mapping to urban monitoring, often require imagery that simultaneously exhibits rich spatial textures and accurate spectral information [1,2,3]. However, most existing satellites typically acquire high-resolution panchromatic (HRPan) and low-resolution multispectral (LRMS) signals separately, leading to a fundamental resolution imbalance between spatial and spectral modalities. Pansharpening has emerged as an effective strategy to bridge this gap by reconstructing high-resolution multispectral (HRMS) images from the two complementary sources. Despite the long-standing interest in this task, the rapid evolution of imaging sensors, data volume, and real-world deployment demands has highlighted practical challenges that remain insufficiently addressed by conventional fusion paradigms.
Existing pansharpening approaches can be broadly categorized into two classes: traditional methods and deep learning-based (DL-based) methods. Traditional methods are typically divided into component substitution-based (CS-based) methods [4,5,6], multiresolution analysis-based (MRA-based) methods [7,8,9], and variational optimization-based (VO-based) methods [10,11,12]. CS-based methods can effectively preserve spatial details but often suffer from spectral distortion [13]. In contrast, MRA-based methods generally achieve better spectral fidelity, while their ability to retain fine spatial details is limited [14]. Although VO-based methods can overcome these limitations, they are typically complex and computationally expensive [7,14].
DL-based methods have achieved remarkable progress by leveraging the strong capacity of neural networks, which can be divided into two classes: supervised and unsupervised. Supervised methods typically train models on reduced-resolution images, yet their generalization to full-resolution scenarios remains inconsistent due to scale discrepancies and distribution shifts [15]. Differently, unsupervised methods train models on full-resolution datasets, to address the scale generalization difficulties in supervised methods [16,17,18].
A large number of DL-based unsupervised pansharpening methods have been proposed in recent years. For instance, Lin et al. [19] utilized convolutional neural networks (CNNs) to estimate the blur kernel of LRMS, effectively preserving the spectral fidelity. Zhou et al. [18] developed a cycle-consistent generative adversarial network (UCGAN) to bridge the gap between reduced and full-resolution images. Recently, diffusion models (DMs) have attracted extensive attention owing to their strong generative ability across various vision tasks. Compared with GANs, DMs not only produce more accurate results but also provide a more stable training process without the risk of mode collapse [20]. For example, Rui et al. [21] proposed an unsupervised hyperspectral pansharpening method based on a pretrained DM. However, existing unsupervised pansharpening methods based on DMs mostly depend on pretrained DMs [22,23], while the direct training of native DMs for unsupervised pansharpening from scratch remains underexplored and lacks theoretical justification.
In addition, most methods, whether supervised or unsupervised, assume perfect alignment between HRPan and LRMS. However, in practical satellite imaging, HRPan and LRMS are rarely perfectly aligned due to differences in sensor geometry, acquisition time, or preprocessing, inevitably leading to shift (translation) errors between the two images [24]. Even subpixel misalignment causes spatial details from HRPan to be injected into incorrect LRMS locations, producing ghosting edges, duplicated contours, and spectral-spatial inconsistencies in the fused images, as shown in Figure 1. Therefore, a shift-invariant pansharpening method that is robust to translation offsets between LRMS and HRPan is essential for reliable fusion, as it enables the model to compensate for unknown spatial misalignments and effectively prevent the propagation of fusion artifacts.
Several studies have recognized this issue and proposed corresponding solutions [16,25,26,27,28]. For example, Lee et al. [25] introduced a feature alignment module to generate an adjusted LRMS that better matches the HRPan input. However, this adjusted LRMS is only used as input to the fusion network, while the spectral loss computation still relies on multiple spatially shifted LRMS versions. Similarly, Ciotola et al. [16] estimated the offset between HRPan and LRMS by measuring their correlation across several spatial shifts before training. During loss computation, the LRMS is then shifted according to the estimated offset. Although these strategies improve fusion results compared with those ignoring translation errors, they still suffer from several limitations: (1) redundant computation, as multiple LRMS shifts are required; (2) the need to predefine a limited range of possible displacements; and (3) the offset-adjusted LRMS is not directly involved in loss optimization, limiting the effectiveness of these correction strategies.
To address the above limitations, first, we present a framework for training DMs in the context of unsupervised pansharpening and provide a mathematical proof to support their effectiveness. To better guide the diffusion process, we design a Multiscale Detail Injection module (MSDI) that progressively injects fine-grained spatial and spectral information during generation. Second, to achieve robustness against translation errors, we develop a Translation Error Estimation Module (TEM) together with a dedicated translation estimation loss, which adaptively estimates subpixel offsets between HRPan and LRMS. This design effectively enhances the spatial consistency and suppresses misalignment-induced fusion artifacts. Based on these designs, we propose Shift-Invariant Unsupervised Pansharpening based on Diffusion Model (SIUPan). The main contributions of our work are summarized as follows:
  • We propose a training framework for diffusion models in unsupervised pansharpening that directly trains a native DM in an unsupervised setting. We design an MSDI to guide the diffusion process by injecting hierarchical spatial and spectral details, ensuring a balanced improvement in both spatial and spectral fidelity.
  • We develop a TEM to adaptively estimate subpixel translation offsets between HRPan and LRMS, enabling shift-invariant fusion without redundant computations or predefined displacement ranges.
  • A new translation error estimation loss function is developed to guide the TEM estimate the translation error. A tailored loss function is constructed to jointly maintain spectral fidelity and spatial sharpness.
In the rest of this article, we describe the proposed method in Section 2. In Section 3, we conduct extensive experiments to evaluate the performance of our method. Finally, a brief conclusion is presented in Section 4.

2. Methodology

2.1. Overall Network Architecture

As shown in Figure 2, SIUPan consists of two primary modules: (1) a TEM for estimating the displacements and (2) a diffusion-based fusion module for image fusion. Let I pan R W × H × 1 , I lrms R w × h × C , and I hrms R W × H × C represent the corresponding HRPan, LRMS, and HRMS, respectively, where W ( w ) and H ( h ) are the width and height of HRPan (LRMS); C is the number of bands of LRMS. LRMS is upsampled to obtain the upsampled LRMS (MS), denoted as I ms R W × H × C . An 11th-order polynomial interpolator [29,30] is used to upsample LRMS. In SIUPan, I pan and I ms are firstly fed into the TEM to obtain the displacements ( Δ m θ , Δ n θ ) . Next, we resample I ms to obtain the shifted image I Rms θ . In the fusion module, we first define an initial estimate of HRMS, denoted as Î hrms = I pan C . Here, I pan C represents I pan stacked C times along band axis. The input residual image Î res = Î hrms I Rms θ is then added to Gaussian noise ϵ t and fed into the diffusion model. The diffusion model with MSDI generates the residual image I res θ , which is added to I Rms θ to obtain the fused image I hrms θ = I Rms θ + I res θ . Some notations used in the following discussion are listed in Table 1.

2.2. Diffusion Model

Similar to most diffusion models [20,31,32,33,34], we utilize a U-Net as the backbone of our denoising network. Here, the SR3 [32] architecture is employed. We modify both the forward and reverse processes of diffusion to adapt it for unsupervised pansharpening. Additionally, MSDI is designed to inject spatial details and spectral information into U-Net for better fusion results.

2.2.1. Unsupervised Diffusion Model

In Denoising Diffusion Probabilistic Models (DDPM) [35], the forward process is defined on samples drawn from the true data distribution. However, in unsupervised pansharpening, the true data
x 0 = I res = I hrms I Rms
is unavailable. To address this challenge, we adopt a proxy estimate
x ̂ 0 = Î res = Î hrms I Rms θ ,
where Î hrms is an initial HRMS estimate, which is set as the stacked HRPan along the band axis in our work. The relationship between x 0 and x ̂ 0 can be described as
x 0 = x ̂ 0 + z ,
where z is the difference between x 0 and x ̂ 0 . We assume that z follows a Gaussian distribution
p ( z ) = N ( z ; 0 , δ 2 I ) ,
where N means the Gaussian distribution, and δ 2 is the variance. Thus, x ̂ 0 is a Gaussian-perturbed version of x 0 . This assumption allows us to define a valid diffusion forward process without access to the true data distribution.
The forward process can be expressed as
p ( x ̂ t | x ̂ t 1 ) = N ( x ̂ t ; 1 β t x ̂ t 1 , β t I ) ,
where t [ 1 : T ] , T is the number of timesteps, and β t is a pre-defined variance. In forward process, we can sample x ̂ t from x ̂ 0 in a closed form:
p ( x ̂ t | x ̂ 0 ) = N ( x ̂ t ; α ¯ t x ̂ 0 , ( 1 α ¯ t ) I ) ,
where α ¯ t = i = 0 t α i and α i = 1 β i . Substituting (3) into (6), we obtain:
x ̂ t = α ¯ t x 0 + η t , η t N ( 0 , ( 1 α ¯ t + δ 2 α ¯ t ) I ) ,
indicating that using x ̂ 0 simply modifies the variance of the diffusion kernel but preserves the trajectory structure. The pseudocode of the forward process is given in Algorithm 1.
Algorithm 1 The Forward Process.
Input: the number of timestep, T, the estimate of x 0 , x ̂ 0
Output: the noised image, x ̂ T
1: for t = 1 : T do
2:        select β t
3:        sample x ̂ t according to (6)
4: end for
In the reverse process, our network predicts x 0 directly rather than noise. Thus, we can sample x ̂ t by the conditional distribution p ( x ̂ t | x 0 ) , which also follows a Gaussian distribution
p ( x ̂ t | x 0 ) = N ( x ̂ t ; α ¯ t x 0 , ( 1 α ¯ t + δ 2 α ¯ t ) I ) .
Subsequently, we iteratively update x 0 using the sampled x ̂ t until the sampling process is finished. The pseudocode of the reverse process is given in Algorithm 2. The forward and reverse processes of diffusion model in our work are illustrated in Figure 3.
Algorithm 2 The Reverse Process.
Input: the number of timestep, T, Gaussian noise image, x ̂ T , the network, f θ
Output: sampled image, x 0
1: for t = T : 1 do
2:        x 0 = f θ ( x ̂ t , t )
3:        sample x ̂ t 1 according to (8)
4: end for
5: t = t 1
6: x 0 = f θ ( x ̂ 0 , t )
Since our goal is to obtain x 0 , the training objective is to maximize the log likelihood of p ( x 0 ) , i.e., ln p ( x 0 ) ,
max ln p ( x 0 ) = max ln p ( x 0 , x ̂ 0 : T ) d x ̂ 0 : T .
However, we cannot maximize it directly because of the latent variables ( x ̂ 0 : T ) in it. Therefore, we use its Evidence Lower Bound (ELBO), which can be written as
ln p ( x 0 ) x 0 x 0 θ 2 2 ,
where x 0 θ is the output of our network. The detailed proof of (10) is provided in Appendix A. Let Δ e = x 0 x 0 θ 2 2 ; then, the objective function in (9) can be converted to
min Δ e = x 0 x 0 θ 2 2 .
Clearly, there is no I res ( I hrms ) available in unsupervised pansharpening. Fortunately, we have two transformations of I hrms ( I pan and I Rms θ ) that can be used for training,
I pan = T pan ( I hrms ) , I Rms θ = T Rms ( I hrms ) ,
where T pan ( T Rms ) represent the transformation function. Moreover, we modify the objective function (11) into
min Δ e pan + Δ e Rms ,
where Δ e pan = f pan ( T pan ( I Rms θ + I res θ ) I pan ) , and Δ e Rms = f Rms ( T Rms ( I Rms θ + I res θ ) I Rms θ ) . f Rms and f Rms are the spatial and spectral loss functions, respectively, which will be given in the following.
In fact, we can employ the sample solvers that are compatible with DDPM, as the structure of our DM is consistent with that of DDPM. The Denoising Diffusion Implicit Model sample solver [36] is used in our work.

2.2.2. MSDI

Many studies have shown that incorporating conditional information in DMs can enhance generation quality [20,31,37,38]. In pansharpening, we naturally have two prior conditions, HRPan and LRMS, which are leveraged to provide conditional information. HRPan contains rich spatial details but lacks spectral information, while LRMS provides abundant spectral information but has fewer distinct details. Therefore, to balance the spatial details and spectral information, we design the MSDI.
As shown in Figure 4, the MSDI is a multi-layer-structure module, where each layer consists of two cross-attention blocks and two downsampling blocks. Cross-attention blocks are employed to enhance the spectral information of HRPan features and improve the spatial details of LRMS features. Downsampling blocks are used to obtain feature maps at multiple scales.
For the i-th layer of the MSDI (except the first layer), the inputs, F pan i 1 and F ms i 1 , are first downsampled to reduce the size of the feature maps, respectively. Convolution layers are followed to control the depth of the feature map,
F pan i = Conv ( Downsample ( F pan i 1 ) ) , F ms i = Conv ( Downsample ( F ms i 1 ) ) .
Sequentially, F ms i and F pan i are obtained after BatchNormalization and LeakyReLU. Two cross-attentions are employed to enhance the two feature maps, which are formulated as
F pan i = E ( F pan i , F ms i ) , F ms i = E ( F ms i , F pan i ) ,
where E ( input , context ) is the linear complexity cross-attention [39]. The output of the i-th layer is
F pan i = F pan i + F pan i , F ms i = F ms i + F ms i .
As shown in Figure 5, F ms 1 , the input feature, is blurred, and F pan 1 has low contrast. After cross-attention blocks, the sharpness in F ms 1 is enhanced, and the contrast in F pan 1 is improved. Finally, F ms i and F pan i are injected into the encoder to guide image generation.

2.3. Translation Error Estimation Module

The translation errors between I pan and I ms , can be described as
[ m pan n pan ] T = T R ( [ m ms n ms ] T ) = A [ m ms n ms 1 ] T ,
where [ m pan n pan ] and [ m ms n ms ] are the coordinates in I pan and I ms , respectively. A is an affine transformation matrix, which can be expressed as
A = 1 0 Δ m 0 1 Δ n .
However, Δ m and Δ n are unknown. Thus, the TEM is designed to estimate them.
As shown in Figure 6, the TEM is comprised of two structurally identical Feature Extractors and one Feature Integration Block. In the Feature Extractor, the inputs, I pan and I ms , are mapped into feature domain via a convolutional layer. Considering our goal is to shift I ms to match I pan , and they are heterogenous images (with significant style differences), we concatenate I ms with I pan at the input of MS Feature Extractor, which provides I ms with style information from I pan . The Feature Integration Block is composed of a MidBlock, three DownBlocks, a convolutional layer, an activation function, and a global average pooling. The MidBlock, containing two convolutional layers and two activation functions, integrates the two extracted features. Inspired by [40], three cascaded DownBlocks are employed to extract the difference between I pan and I ms in the multiscale feature domain. Each DownBlock applies average pooling with a 3 × 3 kernel and stride 2 to downsample the feature maps. After a global average pooling, an estimated displacement [ Δ m θ Δ n θ ] is obtained.
Our objective is to obtain I Rms , which has the same coordinates as I pan . Then, according to (14), we can get the estimated coordinates of I Rms as
[ m Rms θ n Rms θ ] T = 1 0 Δ m θ 0 1 Δ n θ [ m ms n ms 1 ] T .
Then, the estimated I Rms (noted as I Rms θ ) can be resampled by
I Rms θ ( m Rms θ , n Rms θ , c ) = i H j W I ms ( i , j , c ) k ( m ms i ) k ( n ms j ) ,
where k ( · ) is the generic sampling kernel, and c is the band number. Here, the bilinear sampling kernel is used. Moreover, a translation error estimation loss is designed to achieve more accurate estimation of the spatial offsets. Since the true displacements [ Δ m , Δ n ] are unknown, it is infeasible to directly minimize the differences between [ Δ m , Δ n ] and [ Δ m θ , Δ n θ ] . Instead, we construct the loss function based on I pan and I Rms θ , as detailed in Section 2.4.1.

2.4. Loss Function

In SIUPan, the loss consists of two terms: the translation error estimation loss and the fusion loss, which is formulated as
L = w 1 L R + L F ,
where L R is the translation error estimation loss, L F is the fusion loss, and w 1 is the trade-off parameter.

2.4.1. Translation Error Estimation Loss

The objective of TEM is to make [ Δ m θ Δ n θ ] as close as possible to [ Δ m Δ n ] . However, since the GT offsets are unavailable, we redefine the objective as evaluating whether I Rms θ is properly aligned with I pan . According to [41,42,43,44], if two images are perfectly matched, their phase difference is zero in frequency domain. Thus, we formulate the translation error estimation loss function as the phase difference between I pan and I Rms θ
L R = f R ( I pan , I Rms θ ) ,
where f R is the translation error estimation loss function. Assuming that two images X R H × W and Y R H × W have displacements [ Δ m Δ n ] , their phase difference in frequency domain can be estimated using normalized cross-power spectrum (NCPS), noted as Q ( u , v )
Q ( u , v ) = X ˜ ( u , v ) Y ˜ ( u , v ) | X ˜ ( u , v ) Y ˜ ( u , v ) | ,
where X ˜ ( u , v ) and Y ˜ ( u , v ) are Fourier transforms of X ( m , n ) and Y ( m , n ) , respectively. However, NCPS is sensitive to noise [43]. To improve its robustness, Zhu et al. [43] proposed an autocorrelated version, termed ANCPS, which can be formulated as
R ( μ , ν ) = Q ( u , v ) Q * ( u μ , v ν ) = e j 2 π ( u Δ m / H + v Δ n / W ) ,
where R ( μ , ν ) is ANCPS, and Q * is the complex conjugate of Q . From (21), we can find the displacements are embedded in the phase of ANCPS. Thus, we can use the ANCPS of I pan and I Rms θ to estimate their displacements. For simplification, we also denote the displacements and ANCPS of I pan and I Rms θ as [ Δ m Δ n ] and R ( μ , ν ) , respectively. To ensure proper alignment in both directions, we must compute Δ m and Δ n separately. Thus, we have
p x ( μ , ν ) = R ( μ , ν ) R * ( μ 1 , ν ) = e j 2 π Δ m / H , p y ( μ , ν ) = R ( μ , ν ) R * ( μ , ν 1 ) = e j 2 π Δ n / W ,
where R * is the complex conjugate of R . From (22), we can obtain Δ m and Δ n by p x and p y , respectively. To enhance the robustness, we use r pixels to calculate their average phase. Thus, the translation error estimation loss is formulated as
L R = 1 r i r | angle ( p x ( μ i , ν i ) ) | + | angle ( p y ( μ i , ν i ) ) | ,
where angle means the extracting phase.

2.4.2. Fusion Loss

SIUPan incorporates two categories of fusion loss functions: the spatial loss function (i.e., f pan in Section 2.2) and the spectral loss function (i.e., f Rms in Section 2.2). Specifically, the spatial loss consists of one term ( L spa ), while the spectral loss includes two terms ( L spe 1 and L spe 2 ). Therefore, the fusion loss is expressed as
L F = w 2 L spa + w 3 L spe 1 + w 4 L spe 2 ,
where w 2 , w 3 , and w 4 are trade-off parameters.
L spa is designed to preserve the spatial details. According to refs. [28,45], the correlation coefficient between I pan and I hrms θ is used to measure the spatial distortion. The correlation coefficient between two images X and Y is defined as
CORR ( X , Y ) = σ X Y σ X σ Y ,
where σ X , σ Y , and σ X Y are the variances and covariance of X and Y . It is obviously unreasonable to assume that each band of I hrms θ would have a high correlation with I pan . Therefore, the correlation coefficient between degraded I pan and I Rms θ is taken as a threshold mask. Moreover, a s × s window is used to calculate the local correlation coefficient. The spatial loss is formulated as
L spa = 1 C S c C s S 1 ρ s ( c ) u ( ρ m s ρ s ( c ) ) ,
where ρ s ( c ) = CORR ( I pan s , I hrms θ s ( c ) ) , and ρ m s ( c ) = CORR ( G ( I pan s ) , I Rms θ s ( c ) ) . C is the number of bands, and S is the number of windows. G ( · ) represents the Gaussian degrading operation, and u ( · ) is the unit step function.
We design two spectral loss terms L spe 1 and L spe 2 to preserve the spectral information. Since there is a lack of GT to evaluate the spectral distortion, following Wald’s protocol [46], we degrade I hrms θ first and then compute the spectral losses between the degraded I hrms θ and I Rms θ .
Due to the fluctuations in the spectrum, a small change in low-value bands can lead to significant spectral distortions, whereas the situation is reversed in high-value bands. To address this issue, we use the relative difference, which is defined as
L spe 1 = 1 C c = 1 C | D ( G ( I hrms θ ( c ) ) ) D ( I Rms θ ( c ) ) | MEAN ( D ( I Rms θ ( c ) ) ) ,
where D ( · ) means downsampling.
L spe 2 is designed to finetune the shape of spectral curve, ensuring the pixelwise and local spectral similarity, which contains two components. The first one is the Spectral Angle Mapper (SAM) [47,48], which calculates the angle between two spectral vectors, x and y ,
SAM ( x , y ) = arccos x · y x · y .
The second one is the structural Similarity Index Measure (SSIM) [49], which is an effective measure of similarity between two images, which is defined as
SSIM ( X , Y ) = ( 2 μ X μ Y + C 1 ) ( 2 σ X Y + C 2 ) ( μ X 2 + μ Y 2 + C 1 ) ( σ X 2 + σ Y 2 + C 2 ) ,
where μ X and μ Y are the averages of X and Y , respectively. C 1 and C 2 are constants to avoid instability. SSIM not only accelerates the convergence but also ensures the local smoothness of the fused results.
Therefore, L spe 2 is defined as
L spe 2 = SAM ( D ( I Rms θ ) , D ( G ( I hrms θ ) ) ) + 1 SSIM ( I Rms θ , G ( I hrms θ ) ) .

3. Experiments

3.1. Data and Training Details

We conduct experiments on three satellite datasets with varying degrees of translation errors. GF2 and part of the WV2 images are sourced from [50]. Although our method is unsupervised, additional experiments are performed on reduced-resolution datasets, degraded according to Wald’s protocol [46] for evaluation purposes. Training samples are generated by cropping 16 × 16 patches from LRMS and 64 × 64 patches from HRPan. The details of the datasets are listed in Table 2. It should be noted that the displacements listed in the table represent preliminary estimates obtained using the method described in [43].
The proposed method is implemented in Pytorch and is trained on a single NVIDIA GeForce RTX 2080Ti GPU. We use the Adam optimizer, the initial learning rate is set to 0.0002, and the decay rate is 0.97. The iteration is 50,000 in our experiment, and the batch size is 64. The trade-off parameters in the loss function are set as w 1 = 100 , w 2 = 4.5 , w 3 = 4 , and w 4 = 1 for the WV2 dataset. The experiments for all methods are performed on the same datasets.

3.2. Experiment Design

The proposed method is compared with eleven pansharpening methods, including the following:
  • Traditional methods: adaptive Gram-Schmidt (GSA) [6], spatial fidelity with nonlocal regression (SFNLR) [51], and the nonconvex framelet sparse reconstruction method (NC-FSRM) [52]; these methods are conducted without training.
  • Supervised methods: diffusion model with disentangled modulations for sharpening multispectral and hyperspectral images (DDIF) [20], pansharpening GAN (PSGAN) [50], spatial–spectral dual back-project network (S2DBPN) [53] and dual-conditionally guided registration and fusion diffusion network (RFDifNet) [54]; these methods are trained on reduced resolution datasets and tested on both reduced and full resolution datasets. Among them, RFDifNet addresses the misalignment between HRPan and LRMS by explicitly estimating a deformation field to correct the spatial discrepancies between them.
  • Unsupervised methods: zero-shot semi-supervised method for pansharpening (ZSPan) [55], pansharpening via deep image prior (PSDip) [56], λ -CNN-based pansharpening ( λ -PNN) [16], unsupervised cycle-consistent generative adversarial network (UCGAN) [18]; these methods are trained and tested on reduced and full resolution datasets, respectively. Among them, λ -PNN handles the translation misalignment between HRPan and LRMS by applying multiple shifted versions of the inputs to approximate the spatial offsets between the two images.
To evaluate the performance of different methods, seven widely used metrics are used for quantitative evaluation. Three of these are no-reference metrics used for full resolution assessment, including the spectral distortion index ( D λ ), spatial distortion index ( D s ), and quality with no-reference (QNR) [4,57]. Four of them are reference metrics used for reduced resolution assessment, including the erreur relative globale adimensionnelle de synthèse (ERGAS) [46], SSIM, SAM, and spatial correlation coefficient (SCC) [58]. Since translation errors exist between HRPan and LRMS, it is essential to account for their impact when computing evaluation metrics. Therefore, we compute the metrics using both the original (unaligned) and translation-corrected LRMS–HRPan pairs as reference images to provide a more comprehensive assessment.

3.3. Comparison on the GF2 Dataset

3.3.1. Visual Results

We test the performance of our method and the baseline methods on the GF2 dataset. Figure 7 and Figure 8 display the fusion results at full and reduced resolutions. As shown in Figure 7, noticeable displacements in the GF2 dataset adversely affect all methods that lack translation match, particularly traditional and unsupervised ones, leading to duplicated edges in the fusion results. While λ -PNN mitigates some artifacts through iterative shift-based alignment, it fails to fully eliminate them. RFDifNet, which estimates a deformation field to register HRPan and LRMS, also struggles with the large misalignment and introduces spatial distortions. In contrast, our method effectively suppresses artifacts by employing TEM, which accurately compensates for the spatial translation between HRPan and MS inputs. Notably, the impact of translation errors becomes less significant on the reduced-resolution dataset (Figure 8), as displacements diminish during image downsampling. Nevertheless, among traditional and unsupervised approaches, our method remains highly competitive, preserving the fine structural details of buildings with higher fidelity.

3.3.2. Quantitative Results

We provide quantitative comparisons on the GF2 dataset. Table 3 and Table 4 show the quantitative results based on aligned and original image pairs, respectively. We emphasize that Table 3 more accurately reflects the fusion quality, as it minimizes the influence of translation misalignments. As shown in Table 3, most methods yield satisfactory results in terms of D λ , with our method outperforming all other unsupervised methods. Benefiting from MSDI, our approach achieves a better balance between spectral preservation and spatial fidelity. Consequently, ours attains the best performance not only on D s but also on QNR. For reference metrics, supervised methods naturally perform better due to the availability of GT, while our approach remains highly competitive among unsupervised and traditional methods.
When comparing Table 3 and Table 4, we observe that some methods (e.g., DDIF, ZSPan) achieve better performance when evaluated on the original unregistered pairs because they are trained and inferred on unregistered data, causing their outputs to preserve the same geometric misalignment. In contrast, our method performs worse under this setting because it explicitly aligns HRPan and LRMS, producing fused results that are more consistent with the registered references. This also indicates that evaluation metrics become unreliable when misalignment exists between HRPan and LRMS.

3.4. Comparison on the GF1 Dataset

3.4.1. Visual Results

We further evaluate the proposed method on the GF1 dataset. Figure 9 and Figure 10 present the fusion results at full and reduced resolutions, respectively. Although the GF1 dataset contains only sub-pixel translation errors (typically <1 pixel), these small misalignments still induce noticeable artifacts in the fusion results of methods that do not explicitly handle translation errors (Figure 9). Our method, along with RFDifNet and λ -PNN, effectively compensates for these errors and better preserves spectral integrity at land-cover boundaries. In contrast, supervised methods exhibit limited capability in preserving fine spatial details at full resolution, leading to noticeable boundary blurring (Figure 9). However, our method achieves an effective balance between spatial and spectral fidelity across both resolutions, outperforming other methods in preserving fine structures.

3.4.2. Quantitative Results

Table 5 and Table 6 present the quantitative results based on the translation-corrected and original images, respectively. As shown in Table 5, supervised methods perform strongly on D λ , with PSGAN achieving the best overall results. This is because the GF1 dataset is dominated by large homogeneous land-cover regions (e.g., farmland), where the spectral difference between full- and reduced-resolution images is relatively small [59]. Among unsupervised methods, UCGAN achieves the best D λ , while our method ranks second. However, UCGAN introduces obvious spectral distortions along linear structures such as roads (Figure 9m). Our method achieves the best D s and QNR, demonstrating its strong spatial reconstruction capability and practical potential.
When comparing Table 5 and Table 6, we observe that the no-reference metrics change only slightly, whereas the reference-based metrics exhibit more noticeable variations. For the no-reference metrics, the differences between the results computed using translation-corrected and original pairs are much smaller on GF1 than on GF2. This further supports our explanation that such improvements on unregistered data mainly stem from training and generating results on unregistered inputs. By comparing the behaviors of both the no-reference and reference metrics across the two tables, we can also find that the no-reference metrics are less sensitive to sub-pixel translation errors.

3.5. Comparison on the WV2 Dataset

3.5.1. Visual Results

Figure 11 and Figure 12 illustrate the fusion results on the WV2 dataset at full and reduced resolutions, respectively. As the displacements between HRPan and MS are minimal (<0.5 pixels in average), no noticeable artifacts appear in the fusion results across all methods. Among unsupervised methods, UCGAN preserves spectral fidelity well but exhibits evident spatial distortion, due to the difficulty in balancing spectral and spatial fidelity. On the reduced resolution dataset, supervised methods still show a significant advantage. Compared with unsupervised and traditional methods, our method achieves better results at both full and reduced resolution.

3.5.2. Quantitative Results

Table 7 and Table 8 present the quantitative results based on the translation-corrected and original images, respectively. As shown in Table 7, on the full-resolution dataset, our method achieves the best performance in both D s and QNR and ranks second in D λ . Although RFDifNet attains the best D λ , its D s is considerably worse, indicating spatial distortions. On the reduced-resolution dataset, supervised methods maintain superiority. Among unsupervised methods, our method achieves the best performance across all metrics, effectively preserving both spatial and spectral information, which also demonstrates the powerful generation capability of diffusion models.
Comparing Table 7 and Table 8, we observe that differences still exist even under small displacements (<0.5 pixel), indicating that even slight misalignment can influence the evaluation of the fusion quality.

3.6. Ablation

3.6.1. TEM

In the proposed method, TEM is an essential module to match HRPan and MS; thus, we conduct ablation experiments on the GF2 and WV2 datasets. To demonstrate the capability of TEM, we perform GSA using I pan and I Rms θ , as well as I pan and I ms , respectively, and the quantitative results are tabulated in Table 9. For the GF2 dataset, the quality of the GSA fusion results improves significantly after translation correction. In contrast, for the WV2 dataset, only slight improvements are observed. These results indicate that TEM successfully maps HRPan and MS into a shared feature space, effectively reducing translation errors.
To verify the validity of TEM in SIUPan, we remove it from the proposed network and then conduct comparative experiments. As TEM is excluded, I Rms θ is unavailable, and thus L R cannot be computed, requiring its omission. Figure 13 shows the fused images before and after removing TEM, while Table 10 provides the corresponding quantitative evaluation.
As shown in Figure 13a,b, for the GF2 dataset, removing the TEM leads to pronounced spatial distortions and noticeable artifacts, severely degrading the fusion quality. For the WV2 dataset, where translation errors are minimal, no apparent visual differences are observed in Figure 13c,d; however, the quantitative metrics still show a consistent decline (Table 10). These findings demonstrate that even sub-pixel translation discrepancies can negatively affect the pansharpening performance, highlighting the necessity of the TEM for achieving robust and high-fidelity fusion in SIUPan.

3.6.2. X ̂ 0

In SIUPan, we replace x 0 with x ̂ 0 . We assume that the difference between x 0 and x ̂ 0 follows a zero-mean Gaussian distribution. This assumption is relatively mild, implying that our method should be insensitive to the choice of x ̂ 0 . To validate this, we conduct experiments using four different estimates of x ̂ 0 : (1) I ones , (2) I pan C , (3) I GSA , and (4) GT. Here, I ones denotes an all-ones matrix, while I GSA represents the fusion result obtained by the GSA method. Additionally, since GT is only available on reduced resolution datasets, we conducted experiments for the fourth case only at reduced resolution. The results are shown in Table 11.
As tabulated in Table 11, I ones results in a relatively poor performance, because I ones I hrms deviates from the zero-mean assumption in (4). In contrast, the other three estimates produce comparable results, demonstrating that the proposed method is largely insensitive to the choice of x ̂ 0 . Therefore, considering both the accuracy and efficiency, we use I pan C as x ̂ 0 in our method.

3.6.3. MSDI

MSDI is designed to inject spatial and spectral information into the encoder when generating HRMS in our work, enabling it to achieve strong performance even on small-scale datasets. To evaluate the effectiveness of the MSDI, we conduct ablation experiments on the WV2 dataset. The MSDI employs cross-attention at multiscale to fully exploit the information contained in HRPan and MS, balancing the spatial and spectral details in the fusion results. As shown in Table 12, removing the MSDI leads to a noticeable performance drop on the WV2 dataset. In particular, D s degrades substantially, indicating that without the MSDI, the fused images suffer from spatial distortions. These results verify that the MSDI plays a crucial role in enhancing the spatial consistency and overall fusion quality of SIUPan.

3.6.4. Loss Terms

There are four loss terms in the proposed method, three of which are fusion losses and one of which is registration loss. To validate the necessity of each component, we conduct ablation experiments, and the quantitative results are summarized in Table 13. As shown in Table 13, removing L spa leads to severely degraded fusion results, as it is the only loss directly constraining the spatial fidelity. The absence of L spe 1 causes the model to fail in achieving the best results at full resolution. The loss term L spe 2 further refines the spectral curve and helps balance the fusion quality across multiple scales; its removal results in moderate degradation in all metrics. With L R , the TEM provides more accurate translation error estimation between HRPan and MS, leading to improved performance on D s . According to the experiment results, each term of the loss function ensures that the model can achieve good results at both full and reduced resolutions.

4. Conclusions

In this article, we propose a shift-invariant unsupervised pansharpening network. In the proposed method, a translation error estimation module, i.e., TEM, is employed to estimate the displacements between HRPan and MS. Moreover, a multiscale detail injection module is designed to balance the spatial and spectral fidelity. Extensive experiments on the GF2, GF1, and WV2 datasets demonstrate that the proposed method outperforms baseline pansharpening approaches. The ablation studies further confirm that each designed module and loss function plays an essential role in the overall performance of the model.
Compared to supervised methods, unsupervised methods can exploit the original information of HRPan and MS. Nevertheless, because GT is absent, its spectral and spatial losses are computed independently, highlighting the importance of accurately handling translation errors between HRPan and MS. Furthermore, our work primarily focuses on translation errors between HRPan and MS, while potential misalignments among individual MS bands also deserve further study.

Author Contributions

Conceptualization, J.X. and L.J.; methodology, J.X., L.J. and J.Y.; validation, J.X. and J.Y.; formal analysis, Q.F., J.X. and J.L.; investigation, J.X. and J.L.; resources, Y.Z., Q.F. and K.L.; data curation, J.L. and K.L.; writing—original draft preparation, J.X.; writing—review and editing, L.J. and J.Y.; visualization, J.X.; funding acquisition, K.L. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

The log likelihood in (7) can be expressed as follows:
ln p ( x 0 ) = ln p ( x 0 , x ̂ 0 : T ) d x ̂ 0 : T = ln p ( x 0 , x ̂ 0 : T ) d x ̂ 0 : T q ( x ̂ 0 : T | x 0 ) q ( x ̂ 0 : T | x 0 ) d x ̂ 0 : T E q ln p θ ( x 0 , x ̂ 0 : T ) q ( x ̂ 0 : T | x 0 ) = E q ln p ( x ̂ T ) p ( x 0 | x ̂ 0 ) t = 1 T p θ ( x ̂ t 1 | x ̂ t ) q ( x ̂ 0 | x 0 ) t = 1 T q ( x ̂ t | x ̂ t 1 ) = E q ln p ( x ̂ T ) p θ ( x ̂ 0 | x ̂ 1 ) p ( x 0 | x ̂ 0 ) q ( x ̂ 1 | x ̂ 0 ) q ( x ̂ 0 | x 0 ) + E q ln t = 2 T p θ ( x ̂ t 1 | x ̂ t ) t = 2 T q ( x ̂ t | x ̂ t 1 , x 0 ) = E q ln p ( x ̂ T ) p θ ( x ̂ 0 | x ̂ 1 ) p ( x 0 | x ̂ 0 ) q ( x ̂ 1 | x ̂ 0 ) q ( x ̂ 0 | x 0 ) + E q ln t = 2 T p θ ( x ̂ t 1 | x ̂ t ) t = 2 T q ( x ̂ t 1 | x ̂ t , x 0 ) q ( x ̂ t | x 0 ) q ( x ̂ t 1 | x 0 ) = E q ln p ( x ̂ T ) p θ ( x ̂ 0 | x ̂ 1 ) p ( x 0 | x ̂ 0 ) q ( x ̂ 1 | x 0 ) q ( x ̂ 1 | x ̂ 0 ) q ( x ̂ 0 | x 0 ) q ( x ̂ T | x 0 ) + E q ln t = 1 T p θ ( x ̂ 0 | x ̂ 1 ) q ( x ̂ t 1 | x ̂ t , x 0 ) = E q ln p θ ( x ̂ 0 | x ̂ 1 ) p ( x 0 | x ̂ 0 ) q ( x ̂ 0 | x 0 ) + E q ln p ( x ̂ T ) q ( x ̂ T | x 0 ) + E q t = 2 T ln p θ ( x ̂ t 1 | x ̂ t ) q ( x ̂ t 1 | x ̂ t , x 0 ) + E q ln q ( x ̂ 1 | x 0 ) q ( x ̂ 1 | x ̂ 0 ) : = L ELBO .
We notice that, once x ̂ 0 is fixed, δ is independent of x ̂ t ; thus, L ELBO can be rewritten as
L ELBO = E q ln p θ ( x ̂ 0 | x ̂ 1 ) C 1 ( δ ) reconstruction term D K L ( q ( x ̂ T | x 0 ) | | p ( x ̂ T ) ) prior matching term t = 2 T E q D K L ( q ( x ̂ t 1 | x ̂ t , x 0 ) | | p θ ( x ̂ t 1 | x ̂ t ) ) denoising matching term + C 2 ( δ ) constant term ,
where C 1 ( δ ) and C 2 ( δ ) are terms that only depend on δ . Therefore, similar to DDPM, to maxmize L ELBO , we only need to maximize the reconstruction term and denoising matching term.
For the reconstruction term, we need to maximize ln p θ ( x ̂ 0 | x ̂ 1 ) . Expanding ln p θ ( x ̂ 0 | x ̂ 1 ) , we can get
ln p θ ( x ̂ 0 | x ̂ 1 ) = ln 1 ( 2 π ) n | Σ ̂ | 1 2 δ e 1 2 ( x ̂ 0 μ t 1 ) T Σ ̂ 1 ( x ̂ 0 μ t 1 ) = ln 1 ( 2 π ) n | Σ ̂ | 1 2 δ 1 2 ( x ̂ 0 μ t 1 ) T Σ ̂ 1 ( x ̂ 0 μ t 1 ) 1 2 ( x ̂ 0 μ t 1 ) T Σ ̂ 1 ( x ̂ 0 μ t 1 ) 1 2 x ̂ 0 μ t 1 2 2 = 1 2 x 0 ( z + μ t 1 ) 2 2 = 1 2 x 0 x 0 θ 2 2 ,
where x 0 θ is the prediction of network, and μ t 1 and Σ ̂ are the mean and the variance, respectively. In (A3), z is absorbed into the prediction of network. Thus, to maximize ln p θ ( x ̂ 0 | x ̂ 1 ) , we just need to minimize x 0 x 0 θ 2 2 .
For the denoising matching term, we use p θ ( x ̂ t 1 | x ̂ t ) to approximate the real distribution p ( x ̂ t 1 | x ̂ t , x 0 ) . p ( x ̂ t 1 | x ̂ t , x 0 ) is a conditional Gaussian distribution, which can be expressed as
p ( x ̂ t 1 | x ̂ t , x 0 ) = q ( x ̂ t | x ̂ t 1 , x 0 ) q ( x ̂ t 1 | x 0 ) q ( x ̂ t | x 0 ) = N ( x ̂ t ; α t x ̂ t 1 , ( 1 α t ) I ) N ( x ̂ t ; α ¯ t x 0 , ( 1 α ¯ t + δ 2 α ¯ t ) I ) · N ( x ̂ t 1 ; α ¯ t 1 x 0 , ( 1 α ¯ t 1 + δ 2 α ¯ t 1 ) I ) exp ( x ̂ t α t x ̂ t 1 ) 2 2 ( 1 α t ) + ( x ̂ t 1 α ¯ t x 0 ) 2 2 ( 1 α t 1 + δ 2 α ¯ t 1 ) ( x ̂ t α ¯ t x 0 ) 2 2 ( 1 α ¯ t + δ 2 α ¯ t 1 ) constant term exp 1 2 α t 1 α t + 1 1 α ¯ t 1 + δ 2 α ¯ t 1 x ̂ t 1 2 2 α t ( 1 α ¯ t 1 + δ 2 α ¯ t 1 ) x ̂ t + α ¯ t 1 ( 1 α t ) x 0 1 α ¯ t + δ 2 α ¯ t · x ̂ t 1 N ( x ̂ t 1 ; μ q t , Σ q t ) ,
where μ q t and Σ q t are the mean and the variance, respectively. Here are their expressions:
μ q t = α t ( 1 α ¯ t 1 + δ 2 α ¯ t 1 ) x ̂ t + α ¯ t 1 ( 1 α t ) x 0 1 α ¯ t + δ 2 α ¯ t ,
Σ q t = 1 α ¯ t + δ 2 α ¯ t ( 1 α t ) ( 1 α ¯ t 1 + δ 2 α ¯ t 1 ) I = σ q t 2 I .
We assume that p θ ( x ̂ t 1 | x ̂ t ) = N ( x ̂ t 1 ; μ p , Σ p ) . Similar to DDPM, we treat the variance as a constant that does not need to be learned by the network, setting Σ p = Σ q t . Define the form of μ p as follows:
μ p = α t ( 1 α ¯ t 1 + δ 2 α ¯ t 1 ) x ̂ t + α ¯ t 1 ( 1 α t ) x 0 θ 1 α ¯ t + δ 2 α ¯ t ,
where x 0 θ is the prediction of network. Thus, all KL divergences can be expressed as follows:
D KL q ( x ̂ t 1 | x ̂ t , x 0 ) | | p θ ( x ̂ t 1 | x ̂ t ) = D KL N ( x ̂ t 1 ; μ q t , Σ q t ) | | N ( x ̂ t 1 ; μ p , Σ q t ) = 1 2 ln | Σ q t | | Σ q t | d + tr ( Σ q t 1 Σ q t ) + ( μ p μ q t ) T Σ q t 1 ( μ p μ q t ) = 1 2 σ q t 2 μ p μ q t 2 2 = α ¯ t 1 ( 1 α t ) 2 σ q t 2 ( 1 α ¯ t + δ 2 α ¯ t ) x 0 θ x 0 2 2 x 0 θ x 0 2 2 .
According to (A3) and (A8), we can obtain
L ELBO x 0 θ x 0 2 2 .
Thus, to maximize L ELBO , we only need to minimize x 0 θ x 0 2 2 .

References

  1. Gong, P.; Wang, J.; Huang, H. Stable Classification with Limited Samples in Global Land Cover Mapping: Theory and Experiments. Sci. Bull. 2024, 69, 1862–1865. [Google Scholar] [CrossRef] [PubMed]
  2. Wang, Y.; Sun, Y.; Cao, X.; Wang, Y.; Zhang, W.; Cheng, X. A Review of Regional and Global Scale Land Use/Land Cover (LULC) Mapping Products Generated from Satellite Remote Sensing. ISPRS J. Photogramm. Remote Sens. 2023, 206, 311–334. [Google Scholar] [CrossRef]
  3. Zhao, Q.; Ji, L.; Su, Y.; Yu, K.; Zhao, Y. Monitoring changes to small-sized lakes using high spatial and temporal satellite imagery in the Badain Jaran Desert from 2015 to 2020. Int. J. Crowd Sci. 2025, in press. [Google Scholar] [CrossRef]
  4. Vivone, G.; Alparone, L.; Chanussot, J.; Dalla Mura, M.; Garzelli, A.; Licciardi, G.A.; Restaino, R.; Wald, L. A Critical Comparison Among Pansharpening Algorithms. IEEE Trans. Geosci. Remote Sens. 2015, 53, 2565–2586. [Google Scholar] [CrossRef]
  5. Tu, T.M.; Huang, P.; Hung, C.L.; Chang, C.P. A Fast Intensity-Hue-Saturation Fusion Technique with Spectral Adjustment for IKONOS Imagery. IEEE Geosci. Remote Sens. Lett. 2004, 1, 309–312. [Google Scholar] [CrossRef]
  6. Aiazzi, B.; Baronti, S.; Selva, M. Improving Component Substitution Pansharpening Through Multivariate Regression of MS +Pan Data. IEEE Trans. Geosci. Remote Sens. 2007, 45, 3230–3239. [Google Scholar] [CrossRef]
  7. Meng, X.; Xiong, Y.; Shao, F.; Shen, H.; Sun, W.; Yang, G.; Yuan, Q.; Fu, R.; Zhang, H. A Large-Scale Benchmark Data Set for Evaluating Pansharpening Performance: Overview and Implementation. IEEE Geosci. Remote Sens. Mag. 2021, 9, 18–52. [Google Scholar] [CrossRef]
  8. Aiazzi, B.; Alparone, L.; Baronti, S.; Garzelli, A.; Selva, M. MTF-tailored Multiscale Fusion of High-resolution MS and Pan Imagery. Photogramm. Eng. Remote Sens. 2006, 72, 591–596. [Google Scholar] [CrossRef]
  9. Otazu, X.; Gonzalez-Audicana, M.; Fors, O.; Nunez, J. Introduction of Sensor Spectral Response into Image Fusion Methods. Application to Wavelet-Based Methods. IEEE Trans. Geosci. Remote Sens. 2005, 43, 2376–2385. [Google Scholar] [CrossRef]
  10. Ballester, C.; Caselles, V.; Igual, L.; Verdera, J.; Rougé, B. A Variational Model for P+XS Image Fusion. Int. J. Comput. Vis. 2006, 69, 43–58. [Google Scholar] [CrossRef]
  11. Vicinanza, M.R.; Restaino, R.; Vivone, G.; Dalla Mura, M.; Chanussot, J. A Pansharpening Method Based on the Sparse Representation of Injected Details. IEEE Geosci. Remote Sens. Lett. 2015, 12, 180–184. [Google Scholar] [CrossRef]
  12. Palsson, F.; Sveinsson, J.R.; Ulfarsson, M.O. A New Pansharpening Algorithm Based on Total Variation. IEEE Geosci. Remote Sens. Lett. 2014, 11, 318–322. [Google Scholar] [CrossRef]
  13. Chen, Y.; Wan, Z.; Chen, Z.; Wei, M. CSLP: A Novel Pansharpening Method Based on Compressed Sensing and L-PNN. Inf. Fusion 2025, 118, 103002. [Google Scholar] [CrossRef]
  14. Wang, H.; Zhang, H.; Tian, X.; Ma, J. Zero-Sharpen: A Universal Pansharpening Method across Satellites for Reducing Scale-Variance Gap via Zero-Shot Variation. Inf. Fusion 2024, 101, 102003. [Google Scholar] [CrossRef]
  15. Deng, L.j.; Vivone, G.; Paoletti, M.E.; Scarpa, G.; He, J.; Zhang, Y.; Chanussot, J.; Plaza, A. Machine Learning in Pansharpening: A Benchmark, from Shallow to Deep Networks. IEEE Geosci. Remote Sens. Mag. 2022, 10, 279–315. [Google Scholar] [CrossRef]
  16. Ciotola, M.; Poggi, G.; Scarpa, G. Unsupervised Deep Learning-Based Pansharpening with Jointly Enhanced Spectral and Spatial Fidelity. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5405417. [Google Scholar] [CrossRef]
  17. Scarpa, G.; Ciotola, M. Full-Resolution Quality Assessment for Pansharpening. Remote Sens. 2022, 14, 1808. [Google Scholar] [CrossRef]
  18. Zhou, H.; Liu, Q.; Weng, D.; Wang, Y. Unsupervised Cycle-Consistent Generative Adversarial Networks for Pan Sharpening. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5408814. [Google Scholar] [CrossRef]
  19. Lin, H.; Dong, Y.; Ding, X.; Liu, T.; Liu, Y. Unsupervised Pan-Sharpening via Mutually Guided Detail Restoration. Proc. AAAI Conf. Artif. Intell. 2024, 38, 3386–3394. [Google Scholar] [CrossRef]
  20. Cao, Z.; Cao, S.; Deng, L.J.; Wu, X.; Hou, J.; Vivone, G. Diffusion Model with Disentangled Modulations for Sharpening Multispectral and Hyperspectral Images. Inf. Fusion 2024, 104, 102158. [Google Scholar] [CrossRef]
  21. Rui, X.; Cao, X.; Pang, L.; Zhu, Z.; Yue, Z.; Meng, D. Unsupervised Hyperspectral Pansharpening via Low-Rank Diffusion Model. Inf. Fusion 2024, 107, 102325. [Google Scholar] [CrossRef]
  22. Jiang, H.; Chen, Z. Transformer-Based Diffusion and Spectral Priors Model for Hyperspectral Pansharpening. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 18962–18977. [Google Scholar] [CrossRef]
  23. Xiao, J.L.; Huang, T.Z.; Deng, L.J.; Lin, G.; Cao, Z.; Li, C.; Zhao, Q. Hyperspectral Pansharpening via Diffusion Models with Iteratively Zero-Shot Guidance. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 12669–12678. [Google Scholar] [CrossRef]
  24. Xiong, Z.; Li, W.; Zhao, X.; Zhang, B.; Tao, R.; Du, Q. PRF-Net: A Progressive Remote Sensing Image Registration and Fusion Network. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 9437–9450. [Google Scholar] [CrossRef]
  25. Lee, J.; Seo, S.; Kim, M. SIPSA-Net: Shift-Invariant Pan Sharpening with Moving Object Alignment for Satellite Imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10166–10174. [Google Scholar]
  26. Seo, S.; Choi, J.S.; Lee, J.; Kim, H.H.; Seo, D.; Jeong, J.; Kim, M. UPSNet: Unsupervised Pan-Sharpening Network with Registration Learning Between Panchromatic and Multi-Spectral Images. IEEE Access 2020, 8, 201199–201217. [Google Scholar] [CrossRef]
  27. Dai, H.; Liu, X.; Qiao, Y.; Zheng, K.; Xiao, X.; Cai, Z. UFN-GAN: An Unsupervised Generative Adversarial Network for Remote Sensing Image Fusion. In Proceedings of the 2021 China Automation Congress (CAC), Beijing, China, 22–24 October 2021; pp. 1803–1808. [Google Scholar] [CrossRef]
  28. Ciotola, M.; Vitale, S.; Mazza, A.; Poggi, G.; Scarpa, G. Pansharpening by Convolutional Neural Networks in the Full Resolution Framework. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5408717. [Google Scholar] [CrossRef]
  29. Aiazzi, B.; Alparone, L.; Baronti, S.; Garzelli, A. Context-Driven Fusion of High Spatial and Spectral Resolution Images Based on Oversampled Multiresolution Analysis. IEEE Trans. Geosci. Remote Sens. 2002, 40, 2300–2312. [Google Scholar] [CrossRef]
  30. Aimé, P.; Drumetz, L.; Mura, M.D.; Bajjouk, T.; Garello, R. Consistency and Ambiguities of Quality No Reference Metric for Pansharpening. In Proceedings of the IGARSS 2023—IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; pp. 5583–5586. [Google Scholar] [CrossRef]
  31. Li, S.; Li, S.; Zhang, L. Hyperspectral and Panchromatic Images Fusion Based on the Dual Conditional Diffusion Models. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5526315. [Google Scholar] [CrossRef]
  32. Saharia, C.; Ho, J.; Chan, W.; Salimans, T.; Fleet, D.J.; Norouzi, M. Image Super-Resolution via Iterative Refinement. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 4713–4726. [Google Scholar] [CrossRef]
  33. Niu, A.; Zhang, K.; Pham, T.X.; Sun, J.; Zhu, Y.; Kweon, I.S.; Zhang, Y. CDPMSR: Conditional Diffusion Probabilistic Models for Single Image Super-Resolution. arXiv 2023, arXiv:2302.12831. [Google Scholar] [CrossRef]
  34. Shang, S.; Shan, Z.; Liu, G.; Zhang, J. ResDiff: Combining CNN and Diffusion Model for Image Super-Resolution. arXiv 2023, arXiv:2303.08714. [Google Scholar] [CrossRef]
  35. Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 6840–6851. [Google Scholar]
  36. Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the International Conference on Learning Representations, Online, 26 April–1 May 2020. [Google Scholar]
  37. Liu, J.; Yuan, Z.; Pan, Z.; Fu, Y.; Liu, L.; Lu, B. Diffusion Model with Detail Complement for Super-Resolution of Remote Sensing. Remote Sens. 2022, 14, 4834. [Google Scholar] [CrossRef]
  38. Zhang, L.; Rao, A.; Agrawala, M. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 3836–3847. [Google Scholar]
  39. Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; Li, H. Efficient Attention: Attention with Linear Complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Online, 5–9 January 2021; pp. 3531–3539. [Google Scholar]
  40. Zampieri, A.; Charpiat, G.; Girard, N.; Tarabalka, Y. Multimodal Image Alignment through a Multiscale Chain of Neural Networks with Application to Remote Sensing. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 657–673. [Google Scholar]
  41. Shekarforoush, H.; Berthod, M.; Zerubia, J. Subpixel Image Registration by Estimating the Polyphase Decomposition of Cross Power Spectrum. In Proceedings of the CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 18–20 June 1996; pp. 532–537. [Google Scholar] [CrossRef]
  42. Stone, H. Progressive Wavelet Correlation Using Fourier Methods. IEEE Trans. Signal Process. 1999, 47, 97–107. [Google Scholar] [CrossRef]
  43. Zhu, L.; Geng, X. A New Translation Matching Method Based on Autocorrelated Normalized Cross-Power Spectrum. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6956–6968. [Google Scholar] [CrossRef]
  44. Geng, X.; Yang, W. Cyclic Shift Matrix—A New Tool for the Translation Matching Problem. IEEE Trans. Geosci. Remote Sens. 2019, 57, 8904–8913. [Google Scholar] [CrossRef]
  45. Ciotola, M.; Scarpa, G. Fast Full-Resolution Target-Adaptive CNN-Based Pansharpening Framework. Remote Sens. 2023, 15, 319. [Google Scholar] [CrossRef]
  46. Wald, L. Data Fusion: Definitions and Architectures: Fusion of Images of Different Spatial Resolutions; Presses des Mines: Paris, France, 2002. [Google Scholar]
  47. Yuhas, R.H.; Goetz, A.F.H.; Boardman, J.W. Discrimination among Semi-Arid Landscape Endmembers Using the Spectral Angle Mapper (SAM) Algorithm; JPL: Grand Rapids, MI, USA, 1992. [Google Scholar]
  48. Ji, L.; Geng, X.; Sun, K.; Zhao, Y.; Gong, P. Modified N-FINDR Endmember Extraction Algorithm for Remote-Sensing Imagery. Int. J. Remote Sens. 2015, 36, 2148–2162. [Google Scholar] [CrossRef]
  49. Wang, Z.; Bovik, A.; Sheikh, H.; Simoncelli, E. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
  50. Liu, Q.; Zhou, H.; Xu, Q.; Liu, X.; Wang, Y. PSGAN: A Generative Adversarial Network for Remote Sensing Image Pan-Sharpening. IEEE Trans. Geosci. Remote Sens. 2021, 59, 10227–10242. [Google Scholar] [CrossRef]
  51. Xiao, J.L.; Huang, T.Z.; Deng, L.J.; Wu, Z.C.; Wu, X.; Vivone, G. Variational Pansharpening Based on Coefficient Estimation with Nonlocal Regression. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5406115. [Google Scholar] [CrossRef]
  52. Wu, Z.C.; Huang, T.Z.; Deng, L.J.; Vivone, G. A Framelet Sparse Reconstruction Method for Pansharpening with Guaranteed Convergence. Inverse Probl. Imaging 2023, 17, 1277–1300. [Google Scholar] [CrossRef]
  53. Zhang, K.; Wang, A.; Zhang, F.; Wan, W.; Sun, J.; Bruzzone, L. Spatial-Spectral Dual Back-Projection Network for Pansharpening. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5402216. [Google Scholar] [CrossRef]
  54. Diao, W.; Hu, L.; Zhang, K.; Xiao, L. Dual-Conditionally Guided Diffusion Models for Fusion of Unregistered Multisource Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5404117. [Google Scholar] [CrossRef]
  55. Cao, Q.; Deng, L.J.; Wang, W.; Hou, J.; Vivone, G. Zero-Shot Semi-Supervised Learning for Pansharpening. Inf. Fusion 2024, 101, 102001. [Google Scholar] [CrossRef]
  56. Rui, X.; Cao, X.; Li, Y.; Meng, D. Variational Zero-shot Multispectral Pansharpening. arXiv 2024, arXiv:2407.06633. [Google Scholar] [CrossRef]
  57. Alparone, L.; Aiazzi, B.; Baronti, S.; Garzelli, A.; Nencini, F.; Selva, M. Multispectral and Panchromatic Data Fusion Assessment Without Reference. Photogramm. Eng. Remote Sens. 2008, 74, 193–200. [Google Scholar] [CrossRef]
  58. Zhou, J.; Civco, D.L.; Silander, J.A. A Wavelet Transform Method to Merge Landsat TM and SPOT Panchromatic Data. Int. J. Remote Sens. 1998, 19, 743–757. [Google Scholar] [CrossRef]
  59. Hasituya; Chen, Z.; Wang, L.; Liu, J. Selecting Appropriate Spatial Scale for Mapping Plastic-Mulched Farmland with Satellite Remote Sensing Imagery. Remote Sens. 2017, 9, 265. [Google Scholar] [CrossRef]
Figure 1. Fused results on the GF2 dataset. As marked by the rectangles, duplicated contours and spectral distortion appear in GSA and DDIF. In contrast, these artifacts in fusion results are eliminated in SIUPan (ours). (a) HRPan; (b) LRMS; (c) GSA; (d) DDIF; (e) λ -PNN; (f) SIUPan.
Figure 1. Fused results on the GF2 dataset. As marked by the rectangles, duplicated contours and spectral distortion appear in GSA and DDIF. In contrast, these artifacts in fusion results are eliminated in SIUPan (ours). (a) HRPan; (b) LRMS; (c) GSA; (d) DDIF; (e) λ -PNN; (f) SIUPan.
Remotesensing 18 00027 g001
Figure 2. Overall flowchart of the proposed method. The I pan and I ms are first fed into TEM to obtain the displacements between them. The displacements are utilized to resample I ms to obtain I Rms θ , which is employed for fusion and loss calculation. Î res θ is generated by the diffusion model with MSDI. Fusion result, I hrms θ , is obtained by adding Î res θ and I Rms θ together.
Figure 2. Overall flowchart of the proposed method. The I pan and I ms are first fed into TEM to obtain the displacements between them. The displacements are utilized to resample I ms to obtain I Rms θ , which is employed for fusion and loss calculation. Î res θ is generated by the diffusion model with MSDI. Fusion result, I hrms θ , is obtained by adding Î res θ and I Rms θ together.
Remotesensing 18 00027 g002
Figure 3. The forward process and reverse process of diffusion model in our work. In the forward process (indicated by the black solid arrows), x ̂ 0 is used to sample x ̂ t . In the reverse process (indicated by the pink solid arrows), our network predicts x 0 directly, and the generated x 0 is used to sample x ̂ t 1 .
Figure 3. The forward process and reverse process of diffusion model in our work. In the forward process (indicated by the black solid arrows), x ̂ 0 is used to sample x ̂ t . In the reverse process (indicated by the pink solid arrows), our network predicts x 0 directly, and the generated x 0 is used to sample x ̂ t 1 .
Remotesensing 18 00027 g003
Figure 4. The architecture of the MSDI module. Each layer of the MSDI consists of four blocks: two cross-attention blocks and two downsample blocks. The cross-attention blocks are used to exploit the potential information from HRPan and MS, and the downsampling blocks are used to reduce the size of the feature map.
Figure 4. The architecture of the MSDI module. Each layer of the MSDI consists of four blocks: two cross-attention blocks and two downsample blocks. The cross-attention blocks are used to exploit the potential information from HRPan and MS, and the downsampling blocks are used to reduce the size of the feature map.
Remotesensing 18 00027 g004
Figure 5. False-color images of feature maps. (a) F ms 1 . (b) F ms 1 . (c) F pan 1 . (d) F pan 1 .
Figure 5. False-color images of feature maps. (a) F ms 1 . (b) F ms 1 . (c) F pan 1 . (d) F pan 1 .
Remotesensing 18 00027 g005
Figure 6. The architecture of TEM. Two feature extractors map I pan and I ms into the feature domain. A feature integration block, consisting of a MidBlock, three DownBlocks, a convolutional layer, an activation function, and a global average pooling, extracts their differences and outputs the displacement vector [ Δ m θ , Δ n θ ] .
Figure 6. The architecture of TEM. Two feature extractors map I pan and I ms into the feature domain. A feature integration block, consisting of a MidBlock, three DownBlocks, a convolutional layer, an activation function, and a global average pooling, extracts their differences and outputs the displacement vector [ Δ m θ , Δ n θ ] .
Remotesensing 18 00027 g006
Figure 7. Fused results on the GF2 full resolution dataset. Visualized in RGB. (a) HRPan. (b) MS. Traditional: (c) GSA, (d) SFNLR, (e) NC-FSRM. Supervised: (f) DDIF, (g) S2DBPN, (h) PSGAN, (i) RFDifNet. Unsupervised: (j) ZSPan, (k) PSDip, (l) λ -PNN, (m) UCGAN. (n) SIUPan.
Figure 7. Fused results on the GF2 full resolution dataset. Visualized in RGB. (a) HRPan. (b) MS. Traditional: (c) GSA, (d) SFNLR, (e) NC-FSRM. Supervised: (f) DDIF, (g) S2DBPN, (h) PSGAN, (i) RFDifNet. Unsupervised: (j) ZSPan, (k) PSDip, (l) λ -PNN, (m) UCGAN. (n) SIUPan.
Remotesensing 18 00027 g007
Figure 8. Fused results on the GF2 reduced resolution dataset. Visualized in RGB. (a) GT. (b) HRPan. Traditional: (c) GSA, (d) SFNLR, (e) NC-FSRM. Supervised: (f) DDIF, (g) S2DBPN, (h) PSGAN. (i) RFDifNet. Unsupervised: (j) ZSPan, (k) PSDip, (l) λ -PNN, (m) UCGAN, (n) SIUPan.
Figure 8. Fused results on the GF2 reduced resolution dataset. Visualized in RGB. (a) GT. (b) HRPan. Traditional: (c) GSA, (d) SFNLR, (e) NC-FSRM. Supervised: (f) DDIF, (g) S2DBPN, (h) PSGAN. (i) RFDifNet. Unsupervised: (j) ZSPan, (k) PSDip, (l) λ -PNN, (m) UCGAN, (n) SIUPan.
Remotesensing 18 00027 g008
Figure 9. Fused results on the GF1 full resolution dataset. Visualized in RGB. (a) HRPan. (b) MS. Traditional: (c) GSA, (d) SFNLR, (e) NC-FSRM. Supervised: (f) DDIF, (g) S2DBPN, (h) PSGAN, (i) RFDifNet. Unsupervised: (j) ZSPan, (k) PSDip, (l) λ -PNN, (m) UCGAN. (n) SIUPan.
Figure 9. Fused results on the GF1 full resolution dataset. Visualized in RGB. (a) HRPan. (b) MS. Traditional: (c) GSA, (d) SFNLR, (e) NC-FSRM. Supervised: (f) DDIF, (g) S2DBPN, (h) PSGAN, (i) RFDifNet. Unsupervised: (j) ZSPan, (k) PSDip, (l) λ -PNN, (m) UCGAN. (n) SIUPan.
Remotesensing 18 00027 g009
Figure 10. Fused results on the GF1 reduced resolution dataset. Visualized in RGB. (a) GT. (b) HRPan. Traditional: (c) GSA, (d) SFNLR, (e) NC-FSRM. Supervised: (f) DDIF, (g) S2DBPN, (h) PSGAN, (i) RFDifNet. Unsupervised: (j) ZSPan, (k) PSDip, (l) λ -PNN, (m) UCGAN, (n) SIUPan.
Figure 10. Fused results on the GF1 reduced resolution dataset. Visualized in RGB. (a) GT. (b) HRPan. Traditional: (c) GSA, (d) SFNLR, (e) NC-FSRM. Supervised: (f) DDIF, (g) S2DBPN, (h) PSGAN, (i) RFDifNet. Unsupervised: (j) ZSPan, (k) PSDip, (l) λ -PNN, (m) UCGAN, (n) SIUPan.
Remotesensing 18 00027 g010
Figure 11. Fused results on the WV2 full resolution dataset. Visualized in RGB. (a) HRPan. (b) MS. Traditional: (c) GSA, (d) SFNLR, (e) NC-FSRM. Supervised: (f) DDIF, (g) S2DBPN, (h) PSGAN, (i) RFDifNet. Unsupervised: (j) ZSPan, (k) PSDip, (l) λ -PNN, (m) UCGAN, (n) SIUPan.
Figure 11. Fused results on the WV2 full resolution dataset. Visualized in RGB. (a) HRPan. (b) MS. Traditional: (c) GSA, (d) SFNLR, (e) NC-FSRM. Supervised: (f) DDIF, (g) S2DBPN, (h) PSGAN, (i) RFDifNet. Unsupervised: (j) ZSPan, (k) PSDip, (l) λ -PNN, (m) UCGAN, (n) SIUPan.
Remotesensing 18 00027 g011
Figure 12. Fused results on the WV2 reduced resolution dataset. Visualized in RGB. (a) GT. (b) HRPan. Traditional: (c) GSA, (d) SFNLR, (e) NC-FSRM. Supervised: (f) DDIF, (g) S2DBPN, (h) PSGAN, (i) RFDifNet. Unsupervised: (j) ZSPan, (k) PSDip, (l) λ -PNN, (m) UCGAN, (n) SIUPan.
Figure 12. Fused results on the WV2 reduced resolution dataset. Visualized in RGB. (a) GT. (b) HRPan. Traditional: (c) GSA, (d) SFNLR, (e) NC-FSRM. Supervised: (f) DDIF, (g) S2DBPN, (h) PSGAN, (i) RFDifNet. Unsupervised: (j) ZSPan, (k) PSDip, (l) λ -PNN, (m) UCGAN, (n) SIUPan.
Remotesensing 18 00027 g012
Figure 13. Results of TEM ablation study. Visualized in RGB. (a) With TEM on the GF2 dataset. (b) Without TEM on the GF2 dataset. (c) With TEM on the WV2 dataset. (d) Without TEM on the WV2 dataset.
Figure 13. Results of TEM ablation study. Visualized in RGB. (a) With TEM on the GF2 dataset. (b) Without TEM on the GF2 dataset. (c) With TEM on the WV2 dataset. (d) Without TEM on the WV2 dataset.
Remotesensing 18 00027 g013
Table 1. The frequently used notations.
Table 1. The frequently used notations.
NotationSizeDescription
I lrms w × h × C Multi-spectral image
I pan W × H × 1 Panchromatic image
I ms W × H × C Upsampled multi-spectral image
I Rms W × H × C Perfectly registered I ms
I Rms θ W × H × C Translation corrected I ms
I hrms W × H × C Fused multi-spectral image we desire
Î hrms W × H × C Estimate of I hrms
Î hrms θ W × H × C The output image of SIUPan
I res W × H × C I hrms I Rms
Î res W × H × C Î hrms I Rms θ
Î res θ W × H × C Î hrms θ I Rms θ
Table 2. Details of the datasets. We build full resolution and reduced resolution datasets on GF2, WV2, and GF1 images.
Table 2. Details of the datasets. We build full resolution and reduced resolution datasets on GF2, WV2, and GF1 images.
SensorImage DetailsResolutionImage Crop
BandsBitDisplacementTrainTest
GF2410[1.8 1.9]ReducedPAN:3.2m64 × 64 × 1512 × 512 × 1
MS:12.8m16 × 16 × 4128 × 128 × 4
Number16,00030
FullPAN:0.8m64 × 64 × 1512 × 512 × 1
MS:3.2m16 × 16 × 4128 × 128 × 4
Number16,00030
GF1410[0.6 0.8]ReducedPAN:8.0m64 × 64 × 1512 × 512 × 1
MS:32.0m16 × 16 × 4128 × 128 × 4
Number15,61627
FullPAN:2.0m64 × 64 × 1512 × 512 × 1
MS:8.0m16 × 16 × 4128 × 128 × 4
Number15,61627
WV2411[0.4 0.3]ReducedPAN:2.0m64 × 64 × 1512 × 512 × 1
MS:8.0m16 × 16 × 4128 × 128 × 4
Number16,00030
FullPAN:0.5m64 × 64 × 1512 × 512 × 1
MS:2.0m16 × 16 × 4128 × 128 × 4
Number16,00030
Table 3. Quantitative results on the GF2 dataset based on translation-corrected LRMS and HRPan. The best result in each group is in bold font.
Table 3. Quantitative results on the GF2 dataset based on translation-corrected LRMS and HRPan. The best result in each group is in bold font.
TypeModelFull-Resolution DatasetReduced-Resolution Dataset
D λ D s QNRERGASSSIMSCCSAM
TraditionalGSA0.08970.08840.83481.59260.92060.96660.0288
SFNLR0.05770.19610.75772.66200.86310.90640.0393
NC-FSRM0.06290.20510.74542.45180.88380.91710.0372
SupervisedDDIF0.06310.11940.82581.25680.95690.97700.0184
S2DBPN0.05730.11740.83271.28400.95500.97650.0205
PSGAN0.05980.13060.81831.31970.95520.97560.0187
RFDifNet0.04770.25800.70661.96880.87600.93920.0264
UnsupervisedZSPan0.13090.09920.78262.83010.85160.90950.0500
PSDip0.07370.25250.69233.75950.79510.86410.0442
λ-PNN0.06870.10810.82962.73820.85460.90290.0369
UCGAN0.06000.12670.82132.55370.83230.90730.0344
SIUPan0.05570.05210.89541.45540.92380.96730.0238
Table 4. Quantitative results on the GF2 dataset based on original LRMS and HRPan. The best result in each group is in bold font.
Table 4. Quantitative results on the GF2 dataset based on original LRMS and HRPan. The best result in each group is in bold font.
TypeModelFull-Resolution DatasetReduced-Resolution Dataset
D λ D s QNR ERGAS SSIM SCC SAM
TraditionalGSA0.08720.06410.85891.94710.89540.95040.0324
SFNLR0.05320.22100.73762.35600.87480.92740.0364
NC-FSRM0.05900.23000.72492.19720.88850.93470.0348
SupervisedDDIF0.05850.12930.82040.67690.98170.99280.0140
S2DBPN0.05300.12980.82450.86750.97260.98920.0180
PSGAN0.05560.14420.80890.79500.97800.99140.0152
RFDifNet0.04350.28400.68482.00890.88890.94160.0293
UnsupervisedZSPan0.12690.10580.78012.77960.85250.91320.0496
PSDip0.06970.27790.67143.30200.83250.90260.0409
λ -PNN0.06580.12470.81632.44480.86360.92110.0342
UCGAN0.05660.14930.80272.70570.80770.90060.0369
SIUPan0.05200.06420.88681.78240.90150.95410.0282
Table 5. Quantitative results on the GF1 dataset based on translation-corrected LRMS and HRPan. The best result in each group is in bold font.
Table 5. Quantitative results on the GF1 dataset based on translation-corrected LRMS and HRPan. The best result in each group is in bold font.
TypeModelFull-Resolution DatasetReduced-Resolution Dataset
D λ D s QNR ERGAS SSIM SCC SAM
TraditionalGSA0.06750.06720.87171.31970.92080.96270.0291
SFNLR0.04250.06680.89361.74990.90280.94010.0344
NC-FSRM0.04450.06980.88911.64880.91250.94350.0331
SupervisedDDIF0.04420.08800.87200.74540.97290.98710.0160
S2DBPN0.04210.06290.89790.86220.96420.98310.0190
PSGAN0.03930.08410.88000.85080.96670.98430.0184
RFDifNet0.04160.06310.89801.06300.95470.97600.0216
UnsupervisedZSPan0.05510.07230.87771.68700.88110.92950.0394
PSDip0.04980.08370.87072.07660.87070.91480.0397
λ -PNN0.06160.06690.87661.43420.91620.95420.0283
UCGAN0.04020.06870.89451.82160.88170.93570.0395
SIUPan0.04190.05200.90861.37380.91250.96090.0283
Table 6. Quantitative results on the GF1 dataset based on original LRMS and HRPan. The best result in each group is in bold font.
Table 6. Quantitative results on the GF1 dataset based on original LRMS and HRPan. The best result in each group is in bold font.
TypeModelFull-Resolution DatasetReduced-Resolution Dataset
D λ D s QNR ERGAS SSIM SCC SAM
TraditionalGSA0.06670.06920.87061.42690.91330.95580.0312
SFNLR0.04230.06610.89451.94270.88910.92670.0376
NC-FSRM0.04420.06910.89001.85740.89670.92950.0364
SupervisedDDIF0.04400.08730.87280.66840.97430.98820.0154
S2DBPN0.04180.06240.89850.85450.96150.98240.0197
PSGAN0.03920.08320.88110.77700.96800.98520.0180
RFDifNet0.04130.06190.89941.02440.95400.97690.0219
UnsupervisedZSPan0.07610.05490.87361.78440.87370.92210.0411
PSDip0.04950.08240.87212.30100.85110.89700.0431
λ -PNN0.06130.06810.87571.62960.90480.94350.0318
UCGAN0.04010.06810.89521.95990.86680.92740.0418
SIUPan0.04150.05260.90831.49710.90630.95420.0312
Table 7. Quantitative results on the WV2 dataset based on translation-corrected LRMS and HRPan. The best result in each group is in bold font.
Table 7. Quantitative results on the WV2 dataset based on translation-corrected LRMS and HRPan. The best result in each group is in bold font.
TypeModelFull-Resolution DatasetReduced-Resolution Dataset
D λ D s QNR ERGAS SSIM SCC SAM
TraditionalGSA0.04920.06030.89461.51770.95320.97230.0375
SFNLR0.05250.05530.89552.17540.93060.94690.0401
NC-FSRM0.05140.05310.89872.58570.90750.92530.0460
SupervisedDDIF0.04720.06010.89671.08640.97720.98630.0212
S2DBPN0.04850.05400.90291.18420.97290.98310.0254
PSGAN0.04970.05910.89651.09870.97620.98580.0224
RFDifNet0.04400.05820.90051.12280.97490.98400.0227
UnsupervisedZSPan0.08860.08080.84002.86270.92740.92950.0448
PSDip0.05150.07820.87482.76150.90040.91490.0482
λ -PNN0.06180.06010.88332.12120.94130.95310.0391
UCGAN0.05090.09450.86102.80510.86780.91090.0471
SIUPan0.04720.05110.90501.68330.95330.96840.0358
Table 8. Quantitative results on the WV2 dataset based on original LRMS and HRPan. The best result in each group is in bold font.
Table 8. Quantitative results on the WV2 dataset based on original LRMS and HRPan. The best result in each group is in bold font.
TypeModelFull-Resolution DatasetReduced-Resolution Dataset
D λ D s QNR ERGAS SSIM SCC SAM
TraditionalGSA0.04870.06880.88701.85130.93620.96060.0423
SFNLR0.05100.05220.90002.57270.91230.92770.0454
NC-FSRM0.04980.05150.90172.58570.90750.92530.0460
SupervisedDDIF0.04710.05900.89830.94850.98020.98830.0221
S2DBPN0.04860.05620.90151.19090.97090.98270.0273
PSGAN0.04970.05880.89761.05210.97660.98590.0239
RFDifNet0.04390.05620.90281.07870.97500.98420.0240
UnsupervisedZSPan0.08720.07860.84303.11500.90940.91370.0487
PSDip0.05040.07370.88013.20330.87250.88690.0539
λ -PNN0.06240.06180.88202.51240.92400.93460.0447
UCGAN0.04890.08670.86993.11740.84370.89310.0516
SIUPan0.04730.05260.90402.06200.93660.95350.0412
Table 9. Quantitative assessment of translation correction on GF2 and WV2 datasets. GSA is used to fuse HRPan and MS.
Table 9. Quantitative assessment of translation correction on GF2 and WV2 datasets. GSA is used to fuse HRPan and MS.
DatasetMSGSA Fusion Results
D λ D s QNR
GF2 I ms 0.08970.08840.8348
I Rms θ 0.08000.07480.8553
WV2 I ms 0.04920.06030.8946
I Rms θ 0.04850.05950.8960
Table 10. Quantitative assessment of TEM ablation study on GF2 and WV2 datasets.
Table 10. Quantitative assessment of TEM ablation study on GF2 and WV2 datasets.
DatasetTEMFull-Resolution DatasetsReduced-Resolution Dataset
D λ D s QNR ERGAS SSIM SCC SAM
GF2w/o0.06070.10410.84092.34720.87490.92980.0337
w/0.05570.05210.89541.45540.92380.96730.0238
WV2w/o0.04810.05240.90271.89620.94820.96250.0386
w/0.04720.05110.90501.68330.95330.96840.0358
Table 11. Quantitative assessment of x ̂ 0 ablation study on the WV2 dataset.
Table 11. Quantitative assessment of x ̂ 0 ablation study on the WV2 dataset.
x ̂ 0 Full-Resolution DatasetReduced-Resolution Dataset
D λ D s QNR ERGAS SSIM SCC SAM
I ones 0.04910.05450.90061.72880.95160.96690.0364
I pan C 0.04720.05110.90501.68330.95330.96840.0358
I GSA 0.04750.05190.90411.68560.95310.96870.0356
GT---1.68890.95300.96820.0357
Table 12. Quantitative assessment of MSDI ablation study on the WV2 dataset.
Table 12. Quantitative assessment of MSDI ablation study on the WV2 dataset.
MSDIFull-Resolution DatasetReduced-Resolution Dataset
D λ D s QNR ERGAS SSIM SCC SAM
w/o0.04700.11540.84402.72020.86670.91520.0396
w/0.04720.05110.90501.68330.95330.96840.0358
Table 13. Quantitative assessment of different loss functions on the WV2 dataset.
Table 13. Quantitative assessment of different loss functions on the WV2 dataset.
Loss TermFull-Resolution DatasetReduced-Resolution Dataset
L R L spa L spe 1 L spe 2 D λ D s QNR ERGAS SSIM SCC SAM
×0.04590.05360.90441.78380.94550.96540.0355
×0.13670.62000.330751.81710.04290.14610.0845
×0.05420.06290.88711.56350.95480.97010.0362
×0.05080.05460.89841.72830.95200.96700.0360
0.04720.05110.90501.68330.95330.96840.0358
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xie, J.; Ji, L.; Ye, J.; Liu, J.; Feng, Q.; Liu, K.; Zhao, Y. Shift-Invariant Unsupervised Pansharpening Based on Diffusion Model. Remote Sens. 2026, 18, 27. https://doi.org/10.3390/rs18010027

AMA Style

Xie J, Ji L, Ye J, Liu J, Feng Q, Liu K, Zhao Y. Shift-Invariant Unsupervised Pansharpening Based on Diffusion Model. Remote Sensing. 2026; 18(1):27. https://doi.org/10.3390/rs18010027

Chicago/Turabian Style

Xie, Jialei, Luyan Ji, Jinzhou Ye, Jilei Liu, Qi Feng, Kejian Liu, and Yongchao Zhao. 2026. "Shift-Invariant Unsupervised Pansharpening Based on Diffusion Model" Remote Sensing 18, no. 1: 27. https://doi.org/10.3390/rs18010027

APA Style

Xie, J., Ji, L., Ye, J., Liu, J., Feng, Q., Liu, K., & Zhao, Y. (2026). Shift-Invariant Unsupervised Pansharpening Based on Diffusion Model. Remote Sensing, 18(1), 27. https://doi.org/10.3390/rs18010027

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop