A Pyramid-Enhanced Swin Transformer for Robust Hyperspectral–Multispectral Image Fusion and Super-Resolution

Lu, Yu; Hu, Lin; Hu, Jiankai; Gan, Shu; Yuan, Xiping; Li, Wang; Zhao, Hailong

doi:10.3390/rs18081255

Open AccessArticle

A Pyramid-Enhanced Swin Transformer for Robust Hyperspectral–Multispectral Image Fusion and Super-Resolution

by

Yu Lu

¹,

Lin Hu

^1,*,

Jiankai Hu

¹,

Shu Gan

¹,

Xiping Yuan

¹,

Wang Li

¹

and

Hailong Zhao

²

¹

Faculty of Land and Resources Engineering, Kunming University of Science and Technology, Kunming 650093, China

²

School of Geography and Planning, Sun Yat-sen University, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(8), 1255; https://doi.org/10.3390/rs18081255

Submission received: 10 March 2026 / Revised: 11 April 2026 / Accepted: 20 April 2026 / Published: 21 April 2026

(This article belongs to the Special Issue Innovations in Hyperspectral Image Processing: Advancing Image Generation, Denoising, Fusion Techniques and Beyond)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A novel hyperspectral–multispectral fusion framework (PMSwinNet) is proposed, integrating pyramid multi-scale feature enhancement with a Swin Transformer to jointly model spatial–spectral information.
Experiments on multiple public datasets show that the proposed method outperforms existing state-of-the-art approaches in spatial detail preservation, spectral fidelity, and overall reconstruction quality.

What are the implications of the main findings?

The pyramid-enhanced Transformer architecture provides an effective strategy for addressing spatial–spectral coupling and improving hyperspectral image super-resolution.
The proposed framework offers a robust and scalable solution for remote sensing applications requiring high-resolution hyperspectral data, such as environmental monitoring and land-cover analysis.

Abstract

Due to the inherent limitations of both hyperspectral and multispectral imagery, balancing high spatial resolution with high spectral fidelity has become one of the fundamental challenges in remote sensing image processing. A prevailing strategy is to fuse these two types of data to reconstruct images that jointly preserve their respective advantages. However, existing reconstruction approaches still suffer from complex coupling between spatial and spectral information, and limited feature extraction capabilities. To address these issues, this study proposes PMSwinNet (Pyramid Multi-scale Swin Transformer Network), a novel architecture that integrates pyramid-based feature enhancement with Transformer mechanisms. The PMSwinNet incorporates multi-scale pyramid feature fusion and window-based self-attention. Through a progressive multi-stage design and three complementary components—feature extraction and reconstruction modules—the Transformer branch leverages window partitioning and shifting operations to capture long-range spatial dependencies and local contextual cues, while the pyramid features extract both global and local information across multiple spatial scales. In addition, a high-frequency branch is introduced, which employs lightweight convolutions to enhance edges, textures, and other high-frequency details, effectively suppressing blurring and artifacts during reconstruction. Experimental evaluations on multiple public hyperspectral datasets demonstrate that the PMSwinNet outperforms state-of-the-art methods, particularly in terms of detail preservation, spectral distortion suppression, and robustness.

Keywords:

hyperspectral image super-resolution; Swin Transformer; pyramid feature; high-frequency enhancement

1. Introduction

Hyperspectral images (HSIs), characterized by hundreds of contiguous spectral bands, provide richer physical and chemical information about ground objects than conventional multispectral or RGB images [1]. Nevertheless, obtaining a sufficient signal-to-noise ratio (SNR) often requires hyperspectral sensors to employ longer exposure times and larger instantaneous fields of view (IFOVs), which inevitably lead to low spatial resolution (LRHSI) [2]. In contrast, multispectral images (MSI) captured by optical sensors offer higher spatial resolution but limited spectral information (HRMSI). The problem of reconstructing a high-resolution HSI (HRHSI) from an LRHSI and an HRMSI, known as hyperspectral–multispectral image fusion or hyperspectral image super-resolution (HSISR), has thus become a focal point in remote sensing research [3].

Existing methods for HSISR fall into two main categories: classical methods and deep learning-based models [4,5,6,7,8]. Classical techniques primarily involve pansharpening, matrix factorization, and tensor decomposition. Traditional methods—such as pansharpening, matrix factorization, and tensor decomposition—have achieved encouraging results but rely heavily on strong assumptions and handcrafted priors. Balancing spatial detail recovery with minimal spectral distortion remains difficult under such frameworks. However, due to their reliance on linear assumptions and handcrafted priors, traditional methods are limited in modeling complex nonlinear spatial–spectral dependencies. As a result, research attention has increasingly shifted toward deep learning approaches, particularly Transformer-based architectures.

Recent advances in deep learning have demonstrated remarkable potential to overcome these limitations by learning nonlinear mappings without the need for domain-specific priors. Convolutional neural networks (CNNs) are widely adopted in super-resolution tasks for their ability to model local structures and fine details [9,10,11,12], whereas Transformers excel in long-range dependency modeling and global context recovery through self-attention. Driven by the combined capabilities of CNNs and Transformers, we present a new HSISR framework that achieves superior performance on reference datasets as well as real satellite imagery. However, most existing methods primarily focus on spatial–spectral modeling while largely overlooking frequency-domain information, especially high-frequency components that are crucial for recovering fine textures and structural details. Beyond improving quantitative results on public benchmarks, our method also proves effective on real data, as validated on the GF5-S2A dataset, and can be extended to other missions such as WorldView and Landsat-Sentinel.

To fully exploit the complementary characteristics of hyperspectral data in both the spatial and frequency domains, and to enhance high-frequency details while maintaining spectral consistency, we further design a pyramid-based multi-stage architecture that integrates a Transformer backbone with a multi-branch feature enhancement mechanism [13]. Unlike existing approaches that implicitly learn such information, our design explicitly models the interaction between spatial, spectral, and frequency components. This component is implemented as the Pyramid-Enhanced Swin Transformer Block (PESTB), whose structure directly corresponds to its code-level implementation.

In this design, the main branch employs stacked Swin Transformer Blocks to extract spatial–spectral representations through window-based self-attention, while patch embedding and unembedding link token features to the image domain. To enhance the backbone’s sensitivity to multi-scale structures, a lightweight pyramid features module is inserted at each stage, performing multi-scale pooling, convolution, and feature fusion to capture both global and local contextual patterns. To alleviate texture smoothing commonly observed in Transformer-based reconstructions, a parallel high-frequency enhancement branch is introduced. This design is motivated by the inherent limitation of self-attention in preserving high-frequency details. Instead of explicit high-pass filtering, this branch uses two shallow convolutions that directly process intermediate Transformer outputs, enabling effective extraction of gradients, textures, and edges. These high-frequency features are fused with the pyramid-enhanced representations to jointly refine spatial and frequency information. Based on these designs, we propose an end-to-end HSI reconstruction network built upon a multi-stage unfolding strategy, as shown in Figure 1. The inputs to all iterations are the observations Yand Z, while the variable for the current iteration is constructed by combining the output X from the previous iteration. The network consists of k stages. In each stage, two core modules operate collaboratively, the Residual Computation (RC) module and the PESTB module, the latter of which is described in detail in Section 3.3. In the stage 1, the low-resolution hyperspectral image X is resampled via bicubic interpolation to generate an initial estimate Z, which serves as the starting point for the iterative optimization. Within the RC module, learnable hyperparameters are introduced to ensure that X, Y, and Z remain consistent in data dimensions, strictly adhering to the Wald protocol [14]. The computed physical residual is then concatenated with the feature map from the previous stage and fed into the PESTB module. Finally, at stage n, the network outputs a high-fidelity reconstructed hyperspectral image X⁽ⁿ⁾.

At the network level, the PMSwinNet adopts a multi-stage unfolding architecture in which each stage uses PESTBs to progressively recover fine- and coarse-scale representations. The Transformer backbone provides global–local contextual modeling, the pyramid module reinforces cross-scale structure, and the high-frequency branch compensates for detail loss. As processing proceeds across stages, the network incrementally reduces spatial blurring while maintaining spectral fidelity. A spatial–frequency compound loss with multi-scale residual supervision further improves robustness to diverse degradation models.

As a result, the PMSwinNet provides a more lightweight yet detail-preserving solution for hyperspectral image super-resolution. The key contributions of the present study are detailed as follows:

(1): We propose a novel Swin Transformer-based HSISR framework that explicitly incorporates spatial–spectral–frequency collaborative modeling, enabling improved reconstruction of both global context and fine-grained details.
(2): Leveraging the hierarchical structure of Swin Transformers and pyramid enhancement, the encoder captures rich multi-scale representations with improved efficiency, while facilitating cross-scale information interaction beyond conventional designs.
(3): This is the first work to introduce a multi-branch spatial prior fusion and adaptive reconstruction mechanism into a Transformer-based HSISR framework. The proposed model is trained end-to-end and demonstrates strong robustness to degradation and noise while effectively preserving spatial–spectral details, achieving superior performance and strong generalization ability compared to recent state-of-the-art (SOTA) methods.

2. Related Works

In this section, we briefly review classical hyperspectral image super-resolution (HSISR) methods, which can be broadly categorized into traditional approaches and deep learning-based methods.

2.1. Traditional Methods

Traditional methods are generally divided into three categories: pansharpening-based methods, matrix factorization-based methods, and tensor decomposition-based methods.

(1) Pansharpening-Based Methods: Pansharpening techniques, originally developed for fusing multispectral and panchromatic images, aim to enhance spatial resolution by injecting spatial details from high-resolution images into lower-resolution spectral bands [12]. In the context of HSISR, hyperspectral and multispectral images can be analogously treated as multispectral and panchromatic images, respectively, making HSI–MSI fusion a natural extension of pansharpening [13]. These methods can be further divided into four categories: multi-resolution analysis [15], component substitution [16], variational optimization-based [17], and machine learning-based approaches [18,19].

For instance, Principal Component Analysis (PCA) [20] enhances resolution by replacing the first principal component of the LRHSI with the HRMSI. The Gram–Schmidt (GS) method orthogonalizes the data, substitutes the mean intensity vector with the MSI, and then applies an inverse transform. Its variant, GSA [5], improves the GS method by considering inter-band correlations, thereby yielding higher-quality fusion.

(2) Matrix Factorization-Based Methods: These methods unfold the 3D hyperspectral cube (height × width × bands) into a 2D matrix (pixels × bands) and decompose it into spectral bases and coefficients. The fusion task is reframed as estimating these two matrices via optimization, often with constraints such as sparsity or low rank.

For example, ref. [21] introduced Coupled Nonnegative Matrix Factorization (CNMF), which alternately updates endmember and abundance matrices under a physical observation model. Ref. [22] proposed HySure, which applies total variation regularization and formulates fusion as a convex optimization problem, achieving strong results.

(3) Tensor Decomposition-Based Methods: Unlike matrix approaches that flatten data, tensor decomposition preserves the 3D structure of HSI, better capturing spatial–spectral correlations. Typically, an HSI is divided into local patches (cubes), and similar patches are grouped for joint processing. Exploiting self-similarity and local redundancy, sparse priors are often imposed on the core tensor.

Ref. [23] proposed Coupled Sparse Tensor Factorization (CSTF) using Tucker decomposition, where the fusion task is cast as estimating dictionaries and a sparse core tensor with spatial–spectral regularization. Ref. [24] designed a unified low-rank tensor recovery framework leveraging non-local similarity, later extended to a weighted model by adjusting singular values.

2.2. Deep Learning-Based Methods

Deep learning-based methods have demonstrated remarkable performance in hyperspectral–multispectral image fusion. Unlike traditional methods, these methods leverage neural networks to learn optimal fusion mappings in a data-driven manner. Let

X \in R^{H \times W \times S}

denote the target high-resolution hyperspectral image (HRHSI),

Y \in R^{H \times W \times s}

denote the high-resolution multispectral image (HRMSI), and

Z \in R^{h \times w \times S}

denote the low-resolution hyperspectral image (LRHSI). The LRHSI has high spectral resolution with S bands but low spatial resolution of size h × w, whereas the HRMSI has high spatial resolution H × W but fewer spectral bands (s < S). The objective of hyperspectral image super-resolution (HSISR) is to reconstruct X from Y and Z, typically by learning a nonlinear mapping function

Y, Z, X = f_{θ} (Y, Z)

, where θ denotes the parameters of the neural network [3]. This function is often learned under a reduced resolution setting, where both the HSI and MSI are downsampled according to the Wald protocol [14], and the original HRHSI is treated as the ground-truth.

In practice, deep learning-based HSISR methods are typically divided into supervised and unsupervised approaches, depending on whether ground-truth data is used during training. Supervised learning methods have achieved significant success with deep architectures, such as convolutional neural networks (CNN), residual networks (ResNets), and Transformers. These models are trained end-to-end using labeled data and have shown strong performance in applications such as remote sensing image classification and high-resolution image reconstruction [25].

In contrast, unsupervised learning methods do not require paired ground-truth images. Instead, they learn to model the underlying structure of the data through self-reconstruction or adversarial objectives. Representative frameworks include autoencoders (AEs), variational autoencoders (VAEs), and generative adversarial networks (GANs), which have been widely adopted for dimensionality reduction, anomaly detection, and unsupervised fusion. While these approaches avoid manual labeling and are well-suited for cost-sensitive scenarios, their ability to capture semantic structure and class boundaries remains a challenge.

Several deep learning-based HSISR methods have been proposed in recent years. For example, ref. [26] proposed the first 3D CNN-based method for fusing LRHSIs and HRMSIs, utilizing PCA-based dimensionality reduction as a prior. Ref. [27] introduced DHSIS, which incorporates residual learning to embed prior knowledge and regularize the fusion task. Building on this work, ref. [28] further proposed a general framework that adapts to various datasets without retraining. Ref. [29] developed a dual-branch architecture that progressively reconstructs HRHSIs by fusing features at multiple scales, and designed a RAP loss to jointly constrain spatial and spectral distortions. SSDT, proposed in [30], ingeniously introduces a dilated-window mechanism, which effectively captures broader spatial context through multi-scale branches. In contrast, SCIAU-Net, designed in [31], combines the traditional Alternating Direction Method of Multipliers (ADMM) algorithm with neural network architectures, enabling more interpretable spatial–spectral cross-modal interactions.

In the field of unsupervised learning, ref. [32] proposed UDALN, a three-stage model that learns both the point spread function (PSF) and spectral response function (SRF) adaptively to perform super-resolution without ground-truth. Ref. [33] introduced a Multi-level Cross-feature Attention (MCA) mechanism, employing Transformers to encode multi-level features and fuse local and global information for cross-modal interactions. Other recent deep learning-based methods, such as DCTransformer [34], PSRT [35], 3DT-Net [36], MoGDCN [37], DRT [38] and SSDT [30], will be compared in detail in the Section 4.

3. Methods

3.1. Objective Function

We take the hyperspectral image as

X \in R^{H \times W \times S}

as the reference image, where H and W represent the spatial dimensions and S represents the number of spectral channels. According to the Wald protocol [14], we generate a HRMSI,

Y \in R^{H \times W \times s}

, by spectral degradation, and an LRHSI,

Z \in R^{h \times w \times S}

, by applying spatial Gaussian blur and downsampling. The degradation processes can be expressed as follows:

\begin{matrix} Y = C X, \end{matrix}

(1)

\begin{matrix} Z = X R, \end{matrix}

(2)

where

C \in R^{w h \times W H}

represents the spatial downsampling operator, and

R \in R^{S \times s}

and is SRF simulating the MSI sensor.

Given Y and Z, the goal is to reconstruct by minimizing the following function [39]:

\begin{matrix} \underset{X}{X = m i n} \frac{1}{2} \underset{Spatial consistency}{\underset{⏟}{∥ C X - Y ∥_{F}^{2}}} + \underset{Spectral consistency}{\frac{1}{2} \underset{⏟}{∥ X R - Z ∥_{F}^{2}}} + λ f (X), \end{matrix}

(3)

where

f (X)

denotes the image prior term, and

λ

is the regularization parameter.

To avoid handcrafted priors, we adopt a data-driven prior formulation and solve (3) using the Proximal Gradient Descent algorithm. Since the first two terms are differentiable and the third term is non-smooth, the optimization is iteratively solved as follows [36]:

\begin{matrix} X^{(k + 1)} = {p r o x}_{λ η f} (X^{(k)} - η ((Z^{' (k)} - Z) R^{T} + C^{T} (Y^{' (k)} - Y))), \end{matrix}

(4)

where

η

is the step size, and prox(·) is the proximal operator that ensures constraint regularization.

3.2. Pyramid Feature Extraction

To capture multi-scale spatial context, we introduce a Pyramid Feature Enhancement module. It applies adaptive average pooling with different scales (1 × 1, 2 × 2), followed by 1 × 1 convolutions for channel compression. The features are then upsampled and concatenated:

\begin{matrix} F_{s} = U p (C o n v_{1 \times 1} (P o o l_{s \times s} (X))), s \in {1,2} . \end{matrix}

(5)

The concatenated feature maps are fused by a convolution layer and activation function:

\begin{matrix} F_{f u s i o n} = σ (C o n ν_{1 \times 1} [F_{1}, F_{2}]) . \end{matrix}

(6)

Finally, we use a residual connection to generate the output:

\begin{matrix} O u t = X + F_{f u s i o n} . \end{matrix}

(7)

3.3. PESTB Module

Reconstructing fine details such as edges and textures is critical in super-resolution tasks. We introduce the Pyramid-Enhanced Swin Transformer Block (PESTB), which integrates the Swin Transformer with high-frequency branches and pyramid feature modules. The PESTB enhances local detail and semantic understanding, while residual connections improve stability and convergence during multi-stage processing. The detailed architecture is shown in Figure 2, and is described as follows:

(1): Transformer Backbone Branch

The backbone branch is constructed from multiple Swin Transformer Blocks, which extract both local and long-range dependencies. The shifted window self-attention mechanism enlarges the effective receptive field while maintaining computational efficiency, enabling the model to capture richer contextual information. This branch produces stable mid- and low-frequency semantic features that serve as the foundation for subsequent enhancement.

(2): High-Frequency Enhancement Branch

To address the limited detail recovery ability of the backbone branch, an independent high-frequency enhancement branch is incorporated. This branch first employs shallow convolutions to extract high-frequency components—such as local gradients and edges—followed by nonlinear activation to further strengthen these features. The extracted high-frequency representations are then fused with the backbone features through element-wise addition, allowing the network to more accurately recover fine textures and structural details.

(3): Pyramid Feature Extraction Module

To improve the network’s capability in perceiving multi-scale structures, the PESTB includes a pyramid features module. This module performs downsampling, convolutional encoding, and upsampling reconstruction at multiple scales, emulating a multi-resolution image pyramid. The resulting multi-scale outputs are fused through concatenation and linear projection, enabling the network to robustly handle texture regions with significant scale variations.

(4): Residual Fusion Mechanism

The PESTB adopts branch-specific residual connections as well as a final global residual connection to enhance training stability and gradient flow. The final output is formed by fusing three feature components: the Transformer backbone features, the pyramid multi-scale features, and the high-frequency enhancement features. This fusion strategy significantly improves detail reconstruction while avoiding substantial increases in computational cost.

3.4. Network Training

We train the PMSwinNet using a combined loss function consisting of the L1 loss and Spectral Angle Mapper (SAM) loss:

\begin{matrix} L_{L 1} = \frac{1}{N} \sum_{i = 1}^{N} {∥{\hat{X}}_{i} - X_{i}∥}_{1}, \end{matrix}

(8)

\begin{matrix} L_{S A M} = \frac{1}{N} \sum_{i = 1}^{N} \cos^{- 1} (\frac{⟨{\hat{X}}_{i}, X_{i}⟩}{{∥ {\hat{X}}_{i} ∥}_{2} \cdot {∥ X_{i} ∥}_{2}}), \end{matrix}

(9)

\begin{matrix} L_{P i x e l} = L_{L 1} + α \cdot L_{S A M}, \end{matrix}

(10)

where α is the weighting parameter,

{\hat{X}}_{i}

is the final HRHSI, and

X_{i}

is the ground-truth HRHSI. We use the ADAM optimizer with Xavier initialization [40] for training. Our experiments show that the combined loss outperforms individual losses in preserving both spatial fidelity and spectral preservation. The algorithm of the PESTB is described as follows (Algorithm 1):

Algorithm 1: Structure of PESTB

Note: The feature map at the current stage could be the features after patch embedding or the output of the previous PESTB.

Input: Feature map

F_{in} \in R^{B \times C \times H \times W}

Output: Enhanced feature map

F_{out} \in R^{B \times C \times H \times W}

Procedure:

Step 1: Apply patch embedding to convert the input feature map into patch tokens.

Step 2: For each Transformer layer,

l = 1, \dots, L

in the PESTB.

Step 3: Extract high-frequency details with a lightweight convolutional branch:

(1) Apply a

3 \times 3

convolution to capture local details.

(2) Project channels to match the main branch dimension.

Step 4: Enhance multi-scale spatial context by applying the pyramid feature module.

Step 5: Fuse the main Transformer features with high-frequency features and residual connection.

End procedure

4. Experiments and Discussions

4.1. Datasets

To evaluate the effectiveness and generalization of the proposed model, we conduct experiments on three commonly practiced simulated hyperspectral datasets and one real-world dataset. The simulated datasets include PaviaU [41], Chikusei [42], and Xiongan [43], while the real-world dataset is GF5-S2A, composed of paired images from the GaoFen-5 Advanced HyperSpectral Imager (GF5 AHSI) and Sentinel-2A satellites. The simulated datasets are used to assess performance under controlled degradation, whereas the real dataset is used to demonstrate the model’s applicability in practical scenarios.

(1): PaviaU: The Pavia University dataset was collected by the University of Pavia, Italy, using the ROSIS sensor. The dataset includes 103 spectral bands ranging from 430 to 860 nm. Its spatial resolution is 1.3 m, and the image size is 610 × 340 pixels.
(2): Chikusei: This dataset covers a mix of agricultural and urban areas in Chikusei, Japan. A total of 128 spectral bands is provided, spanning the wavelength range of 363–1018 nm. The data feature a spatial resolution of 2.5 m and an image size of 2517 × 2335 pixels.
(3): Xiongan: The Xiongan dataset contains hyperspectral images covering 250 bands in the 400–1000 nm range. It has a high spatial resolution of 0.5 m and an image size of 3750 × 1580 pixels.
(4): GF5-S2A: The dataset is collected from real-world observations of the GF5 AHSI and Sentinel-2A satellites. The GF5 AHSI sensor provides imagery across 0.4–2.5 μm with a spatial resolution of 30 m. Sentinel-2A images have spectral coverage from 0.4 to 2.0 μm, with spatial resolutions of 10 m (visible), 20 m (near-infrared), and 60 m (shortwave infrared).

4.2. Evaluation Metrics

We adopt four widely used quantitative metrics to evaluate the performance of super-resolution reconstruction: Peak Signal-to-Noise Ratio (PSNR), Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS), Spectral Angle Mapper (SAM), and Structural Similarity Index Measure (SSIM).

PSNR: The PSNR measures the pixel-wise fidelity and is derived from the mean squared error (MSE) [44]. An increase in the PSNR reflects enhanced quality of the reconstructed images. It is defined as follows:

\begin{matrix} P S N R (X, Y) = \frac{1}{B} \sum_{i = 1}^{B} 10 \cdot \log_{10} (\frac{m a x (Y_{i})^{2}}{m s e (X_{i}, Y_{i})}), \end{matrix}

(11)

where

X

is the fused image,

Y

is the reference image, and B is the number of spectral bands.

ERGAS: ERGAS reflects both spatial and spectral quality [45]. A lower ERGAS score indicates better fusion performance. It is calculated as follows:

\begin{matrix} E R G A S (X, Y) = \frac{100}{r} \sqrt{\frac{1}{B} \sum_{i = 1}^{B} {(\frac{R M S E (X_{i}, Y_{i})}{{\bar{Y}}_{i}})}^{2}}, \end{matrix}

(12)

SAM: The SAM evaluates spectral similarity by treating each pixel’s spectral signature as a vector and computing the angle between the estimated and reference vectors [46]. A smaller angle indicates better spectral fidelity:

\begin{matrix} S A M (X, Y) = \frac{1}{H W} \sum_{i, j} arcco s (\frac{Y_{i j}^{T} \cdot X_{i j}}{{∥Y_{i j}∥}_{2} {∥X_{i j}∥}_{2}}), \end{matrix}

(13)

where H and W are the height and width of the image.

SSIM: The SSIM measures the structural similarity between the reconstructed and reference images [47]. A value closer to one implies higher similarity:

\begin{matrix} S S I M (X, Y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{χ y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}, \end{matrix}

(14)

where

μ_{x}

and

μ_{y}

are the means,

σ_{x}^{2}

and

σ_{y}^{2}

are variances of

X

and

Y

, respectively,

σ_{χ y}

is the covariance, and

C_{1}

and

C_{2}

are constants to stabilize the division.

4.3. Experimental Setup

For the simulated experiments, we adopt the widely used protocol in [24], employing an IKONOS-like SRF to mimic realistic spectral degradation. Given a HRHSI, the corresponding LRHSI is generated by applying Gaussian blurring followed by spatial downsampling, while a HRMSI is obtained by projecting the HSI through the spectral response matrix. Taking PaviaU as an example, the original dataset is regarded as a HRHSI. Spatial degradation is applied using a Gaussian kernel, and spectral degradation is performed via the response matrix to obtain LRHSIs and HRMSIs, respectively. In our experiments, the kernel size and the standard deviation σ are determined according to the scaling factor, as formulated in (15) and (16).

k_size = 2 \times ratio + 1 .

(15)

\begin{matrix} σ = (2 \times r a t i o) / 3 . \end{matrix}

(16)

For PaviaU, the lower-left 320 × 320 patch is selected for testing purposes, and the rest of the image is utilized for training. For Chikusei, a 512 × 512 region located between rows and columns 300–812 is selected for testing; the rest is for training. For Xiongan, a 512 × 512 region defined by rows 1750–2262 and columns 650–1162 was designated as the test set, with all other pixels employed for training. After removing water absorption bands, PaviaU and Xiongan retain 93 bands using the same spectral response.

In the real-world experiment, GF5 AHSI images are used as HRHSIs and Sentinel-2A images as HRMSIs. The images are preprocessed using ENVI 5.6 to perform spectral resampling, dimensionality reduction, and cropping. The resulting HRHSI and HRMSI are of size 300 × 300 × 106 and 900 × 900 × 10, respectively. The LRHSI is generated using a 9 × 9 Gaussian blur and 3× downsampling, resulting in 900 × 900 × 106.

We select two 60 × 60 patches in the top-left diagonal of the image as validation data (the corresponding HRHSI patches are 20 × 20), and use the remainder for training. The entire image is used for testing.

During the training phase, the training samples are generated by partitioning the entire image into patches using a predefined regular grid. Specifically, for the hyperspectral–multispectral fusion task, after constructing the low-resolution hyperspectral images (LRHSIs) via Gaussian blurring followed by spatial downsampling, and generating the high-resolution multispectral images (HRMSIs) using a spectral response matrix, deterministic sliding-window cropping is applied. The HRHSI and HRMSI are divided on the high-resolution grid, while the LRHSI is partitioned on the corresponding low-resolution grid, using fixed patch sizes and strides. Notably, this process is deterministic and does not involve random cropping during training.

The network is trained using the Adam optimizer, with an initial learning rate of 1 × 10⁻⁴, for a maximum of 1000 epochs, and default optimizer parameters. A step decay learning rate schedule is adopted, where the learning rate is multiplied by 0.5 every 100 epochs. All experiments are conducted on an NVIDIA RTX 3090 GPU. For clarity, the best results are indicated in bold, and the second best are underlined.

4.4. Ablation Study

In this section, we conduct a series of ablation experiments to validate the effectiveness of the proposed method. All experiments are performed on the PaviaU dataset with a 16× scaling factor. The evaluation is carried out under the following configurations:

(1): We first adopt the Swin Transformer as the baseline model (BS).
(2): We introduce only the pyramid feature module (PF) into the baseline.
(3): We introduce only the high-frequency (HF) enhancement branch into the baseline.
(4): We evaluate the complete version of our proposed method.

As shown in Table 1, the quantitative performance degrades whenever any individual component is removed. Overall, the combined contribution of all modules within the PMSwinNet provides complementary benefits, yielding positive synergy for the image reconstruction task.

4.5. Performance on Public Dataset

In experiments on public datasets, we compare the PMSwinNet with several recent and widely adopted SOTA HSISR methods under identical Gaussian kernel and scaling factor settings. The comparison involves one matrix factorization-based method, HySure [21], two unsupervised models, UDALN [32] and FeafusFormer [33], and six supervised models: 3DT-Net [36], PSRT [35], DCTransformer [34], MoGDCN [37], DRT [38] and SSDT [30].

HySure is a physically interpretable method based on subspace regularization aimed at reducing spectral distortion by fusing LRHSIs and HRMSIs. UDALN is an unsupervised learning method which has a three-stage model that learns both PSF and SRF adaptively to perform super-resolution without ground-truth. FeafusFormer contains a Multi-level Cross-feature Attention mechanism, employing Transformers to encode multi-level features and fuse local and global information for cross-modal interaction. Among the supervised deep learning methods, 3DT-Net and DCTransformer are Transformer-based networks. The 3DT-Net combines Swin Transformer and a CNN for spatial–spectral feature extraction, while DCTransformer uses a bidirectional cross-attention Transformer block to enable mutual reconstruction between MSIs and HSIs. PSRT replaces traditional self-attention with pyramid structure attention and utilizes a Shuffle–Reshuffle mechanism to model both local and global information. MoGDCN introduces a denoising module based on deformable convolution networks (DCNs) and employs a sampled U-Net structure for reconstruction. DRT enhances the joint restoration of spatial and spectral information by leveraging deep feature representations, thereby alleviating detail loss and spectral distortion during the fusion process. SSDT, in contrast, is a deep network that incorporates a spatial–spectral joint modeling strategy.

(1) Results on Chikusei: Table 2 presents the quantitative results of different models on the Chikusei dataset. Overall, the PMSwinNet achieves the best performance across most evaluation metrics. At the 8× scale, SSDT yields the lowest ERGAS and DCTransformer achieves the best SAM, whereas the PMSwinNet records the highest PSNR and SSIM, showing clear advantages in spatial fidelity. At the 16× scale, the PMSwinNet still obtains the highest PSNR. This can be attributed to the intrinsic characteristics of the Chikusei dataset, which contains abundant agricultural areas and complex urban textures with large variations in object scales. Traditional convolutional neural networks (MoGDCN), constrained by fixed receptive fields, struggle to simultaneously capture fine-grained textures and large homogeneous regions. In contrast, the Pyramid Enhancement Swin Transformer Block (PESTB) introduced in the PMSwinNet effectively extracts multi-scale features, enabling the model to capture spatial structures at different levels of granularity. Regarding ERGAS, the PMSwinNet performs slightly worse than SSDT. This may be due to the increased computational complexity introduced by the PESTB module when pursuing high spatial resolution reconstruction, which can lead to minor quantization errors. For SSIM, the PMSwinNet ranks first at 8× and second at 16×, with FeafusFormer marginally ahead. The small SSIM differences across the methods are likely attributable to training stochasticity.

Although the PMSwinNet does not lead in every single metric, it demonstrates the most consistent and competitive overall performance. The visual comparison on the Chikusei dataset (Figure 3) further supports this conclusion: the PMSwinNet shows the lightest regions in the SAM and DIF heatmaps. In the MRAE map, while slightly stronger color intensity is observed in the central region compared with UDALN, MoGDCN, DRT, SSDT and FeafusFormer, the PMSwinNet displays significantly lighter colors along the upper-right river, indicating superior fusion quality overall. These results further demonstrate the advantage of the Swin Transformer’s shifted window mechanism in modeling long-range spatial dependencies, which effectively leverages spectral correlations among neighboring materials to compensate for information loss caused by downsampling.

(2) Results on PaviaU: The quantitative performance of various models on the PaviaU dataset is summarized in Table 3. At the 8× scale, the PMSwinNet ranks first across all metrics except SAM, slightly trailing 3DT-Net and DCTransformer. In contrast, at the 16× scale, the PMSwinNet surpasses all competing methods on every metric except ERGAS. This phenomenon can be explained by the underlying physical mechanism: under high downsampling ratios, spatial details are severely degraded, and simple linear mappings or local feature extraction methods become insufficient. The success of PMSwinNet lies in its dual modeling capability. Specifically, the Swin Transformer module captures global dependencies across spectral bands, ensuring spectral fidelity, while the pyramid structure enhances the representation of complex spatial structures such as campus buildings and shadows. The visual comparisons in Figure 4 further corroborate these findings. PMSwinNet produces the most visually faithful results, as evidenced by the lightest regions in the SAM, MRAE, and DIF error maps. Moreover, the reconstructed objects exhibit sharper boundaries and are free from noticeable artifacts. This can be attributed to the model’s ability to effectively balance spatial texture recovery and spectral alignment through multi-scale feature learning during feature evolution, resulting in superior spectral consistency and reduced reconstruction errors.

(3) Results on Xiongan: A quantitative comparison of different models on the Xiongan dataset is presented in Table 4. At both fusion scales, the PMSwinNet exhibits consistently strong results. At the 8× scale, it achieves a PSNR of 50.6368, markedly surpassing all other methods. For ERGAS and SAM, it ranks second, slightly behind SSDT and DCTransformer, while for SSIM, PMSwinNet and DCTransformer both achieve leading results, demonstrating strong structural fidelity. At the 16× scale, the PMSwinNet attains the best scores in the PSNR, SAM, and ERGAS, highlighting its robustness and reconstruction capability under more challenging downsampling conditions. Although it is competitive with the Transformer-based DCTransformer in terms of SSIM, the PMSwinNet demonstrates greater stability in preserving SAM, suggesting that its high-frequency compensation mechanism plays a critical role. The visual results in Figure 5 further confirm these findings. The PMSwinNet yields the lightest regions in both the MRAE and SAM maps. In the DIF visualization, its performance is close to HySure; however, the fused images of HySure exhibit larger deviations from the ground-truth, whereas PMSwinNet better preserves spatial–spectral fidelity.

Across all three datasets, the PMSwinNet demonstrates outstanding robustness in spatial–spectral reconstruction. Its superiority is particularly evident under high upscaling factors (16×) and in scenarios involving complex textures, where it consistently achieves leading PSNR and SSIM values. This advantage stems from the multi-scale feature aggregation capability of the PESTB module, which effectively compensates for spatial structural information lost during downsampling. Meanwhile, the long-range dependency modeling enabled by the Swin Transformer overcomes the limitations of local receptive fields by leveraging global contextual relationships, ensuring high spectral consistency. As a result, PMSwinNet achieves an effective balance between reconstruction accuracy and spectral fidelity, especially when handling high compression ratios and heterogeneous scenes.

4.6. Computational Cost

To further assess model efficiency, we compare the number of parameters and model sizes of all supervised deep learning approaches on the PaviaU dataset (UDALN and FeafusFormer are unsupervised models), as shown in Table 5. Among the Transformer-based methods, the PMSwinNet records low parameter count and small model size, reflecting its superior efficiency. While PSRT achieves the lowest computational complexity overall, its quantitative and qualitative performance remains inferior to the PMSwinNet, as previously discussed.

4.7. Comparison of Spectral Curves

To further evaluate spectral fidelity, we compare the reconstructed spectral curves of each model on all three public datasets at the pixel location (100, 100) and random pixel location. As illustrated in Figure 6, PMSwinNet’s reconstructed spectra align most closely with the ground-truth, exhibiting the lowest spectral distortion across all datasets.

4.8. Blind HSI Super-Resolution

While previous experiments presupposed knowledge of the downsampling blur kernel, real-world degradation processes are generally unknown and exhibit considerable complexity. To address this, we conduct a blind super-resolution experiment on the Chikusei dataset with an 8× scaling factor, following the protocol of MoGDCN.

During training, we adopt the same Gaussian kernel type as before but randomly sample the standard deviation within [1.0, 3.0]. For testing, the standard deviation is varied from 1.0 to 3.0 in increments of 0.2, ensuring that each LRHSI input undergoes a distinct degradation at every iteration, thereby simulating real-world uncertainty.

It is worth noting that the blind super-resolution setting inherently involves unknown degradation processes, including implicit noise perturbations and model mismatch. Therefore, evaluating model performance under this setting can indirectly reflect its robustness to various noise conditions. A model capable of maintaining stable reconstruction quality in such scenarios can be considered to possess an effective noise suppression ability.

We compare the PMSwinNet with six representative deep learning methods: PSRT, 3DT-Net, DCTransformer, MoGDCN, DRT and SSDT. The quantitative results, summarized in Table 6, report the PSNR, SAM, ERGAS, and SSIM across eleven Gaussian kernels. Compared with the non-blind setting, all models maintain stable performance, with only 3DT-Net showing notable variation. Importantly, the PMSwinNet consistently outperforms all competitors under every degradation condition.

To further evaluate robustness, we extend testing to more challenging cases using Gaussian kernels outside the training range [1.0, 3.0] and a standard Bicubic kernel. As shown in Table 7, although performance slightly decreases under these settings, the PMSwinNet still achieves the best results across all metrics, confirming its strong adaptability to diverse and unknown degradations.

These results indicate that PMSwinNet can effectively suppress noise-induced artifacts while preserving fine spatial details, even under unknown degradation conditions.

4.9. Performance on Real Data

To assess real-world applicability, we further evaluate the PMSwinNet on the GF5-S2A dataset. The fusion outcomes are visually presented in Figure 7. For quantitative evaluation, we adopt the no-reference metric D_λ (Spectral Distortion Index) where lower values indicate better spectral fidelity [48]. In addition, we report the model complexity in terms of the number of parameters (Params) and floating-point operations (FLOPs), aiming to provide a comprehensive analysis of both reconstruction performance and computational efficiency. (Note that QNR (Quality with No Reference) is not employed here, as it requires a panchromatic reference image, which is unavailable in this scenario.)

Since HySure is an optimization-based, model-driven method, its core relies on an iterative solving process rather than a feed-forward architecture with a fixed number of learnable parameters. Therefore, it cannot be fairly evaluated using the Params metric. Similarly, its computational complexity is dominated by data-dependent iterative operations, making it difficult to define a standardized FLOP value for fair comparison. For UDALN, although it is an unsupervised approach, its training process depends on a data-specific adaptive network structure. As a result, its parameter scale may vary under different data settings and is not directly comparable to that of standard supervised models. In addition, due to its adaptive architecture and dynamic training scheme, the FLOPs of UDALN are also data-dependent and cannot be consistently quantified under a unified setting. To avoid introducing potentially misleading comparisons, these methods are not included in the table under the Params and FLOPs metrics.

As shown in Table 8, the PMSwinNet achieves the lowest D_λ value (0.0038) among all compared methods on the GF5-S2A real dataset, indicating its superior capability in preserving spectral consistency in real-world remote sensing scenarios. Compared with representative approaches, such as 3DT-Net, DRT, SSDT, and DCTransformer, the PMSwinNet demonstrates a more pronounced advantage in suppressing spectral distortion. Meanwhile, the PMSwinNet contains 6.7869 M parameters, representing a moderate model scale comparable to SSDT and MoGDCN. Despite not being the smallest model, it achieves optimal D_λ performance. Although PSRT exhibits a lower parameter count and FLOPs, its D_λ value reaches 0.0194, which is significantly inferior to that of the PMSwinNet, suggesting that lightweight design alone is insufficient to ensure spectral fidelity in real-world scenarios.

Overall, the PMSwinNet attains superior spectral preservation performance under acceptable model complexity, highlighting its strong practical potential for real hyperspectral image fusion tasks.

5. Conclusions

In this study, we introduce the PMSwinNet, a progressive framework for hyperspectral image super-resolution (HSISR) that effectively integrates spatial–spectral information from LRHSI and HRMSI The model centers on a Pyramid-Enhanced Swin Transformer Block (PESTB), which couples multi-scale spatial fusion with frequency-domain refinement to strengthen high-dimensional representations. The PMSwinNet consists of three complementary components:

(1): A Swin Transformer branch that captures long-range spatial–spectral dependencies;
(2): A lightweight high-frequency branch that restores edges and textures often smoothed by Transformers;
(3): Pyramid feature fusion modules that enhance hierarchical spatial context.

The encoder extracts representations using patch embedding and stacked PESTBs, while the decoder reconstructs high-resolution images via patch unembedding and convolutional fusion. A hybrid loss combining spatial and frequency-domain constraints, together with multi-scale residual learning, improves robustness.

Ablation results confirm the effectiveness of the PESTB and high-frequency enhancement. Experiments on three benchmarks show consistent improvements over SOTA methods across the PSNR, SAM, ERGAS, and SSIM, especially at large-scale factors. Additional blind SR and real GF5-S2A experiments demonstrate strong robustness and spectral fidelity.

Author Contributions

Conceptualization, L.H., S.G., X.Y. and W.L.; methodology, L.H., S.G., X.Y. and W.L.; software, Y.L. and H.Z.; validation, Y.L. and H.Z.; formal analysis, L.H.; investigation, Y.L., J.H. and H.Z.; resources, L.H., S.G., X.Y. and W.L.; data curation, Y.L. and J.H.; writing—original draft preparation, Y.L.; writing—review and editing, L.H.; visualization, Y.L. and J.H.; supervision, L.H., S.G. and X.Y.; project administration, L.H., S.G., X.Y. and W.L.; funding acquisition, L.H., S.G. and X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant number 62266026, Deep Earth Probe and Mineral Resources Exploration—National Science and Technology Major Project 2024ZD1001400 and the Yunnan Fundamental Research Projects [No. 202501AU070133].

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors thank all the reviewers for their valuable comments and suggestions on this article. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

PMSwinNet	Pyramid Multi-Scale Swin Transformer Network
HSI	Hyperspectral Image
MSI	Multispectral Image
SNR	Signal-To-Noise Ratio
IFOV	Instantaneous Field Of View
LRHSI	Low-Resolution HSI
HRHSI	High-Resolution HSI
HSISR	Hyperspectral Image Super-Resolution
CNNs	Convolutional Neural Networks
PESTB	Pyramid-Enhanced Swin Transformer Block
SOTA	State-Of-The-Art
GF5 AHSI	GaoFen-5 Advanced HyperSpectral Imager
S2A	Sentinel-2A
PSNR	Peak Signal-To-Noise Ratio
AEs	Autoencoders
VAEs	Variational Autoencoders
GANs	Generative Adversarial Networks
ERGAS	Erreur Relative Globale Adimensionnelle de Synthèse
SAM	Spectral Angle Mapper
SSIM	Structural Similarity Index Measure
MSE	Mean Squared Error
BS	Baseline Model
PF	Pyramid Feature Module
HF	High-Frequency
DCN	Deformable Convolution Networks
D_λ	Spectral Distortion Index
ADMM	Alternating Direction Method of Multipliers

References

Yu, H.; Ling, Z.; Zheng, K.; Gao, L.; Li, J.; Chanussot, J. Unsupervised Hyperspectral and Multispectral Image Fusion with Deep Spectral-Spatial Collaborative Constraint. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5534114. [Google Scholar] [CrossRef]
Wang, Z.; Ma, Y.; Zhang, Y. Review of pixel-level remote sensing image fusion based on deep learning. Inf. Fusion 2023, 90, 36–58. [Google Scholar] [CrossRef]
Vivone, G. Multispectral and hyperspectral image fusion in remote sensing: A survey. Inf. Fusion 2023, 89, 405–417. [Google Scholar] [CrossRef]
Zhu, C.; Deng, S.; Zhou, Y.; Deng, L.; Wu, Q. QIS-GAN: A Lightweight Adversarial Network with Quadtree Implicit Sampling for Multispectral and Hyperspectral Image Fusion. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5531115. [Google Scholar] [CrossRef]
Zhu, C.; Dai, R.; Gong, L.; Gao, L.; Ta, N.; Wu, Q. An adaptive multi-perceptual implicit sampling for hyperspectral and multispectral remote sensing image fusion. Int. J. Appl. Earth Obs. 2023, 125, 103560. [Google Scholar] [CrossRef]
Yu, C.; Zhou, S.; Song, M.; Gong, B.; Zhao, E.; Chang, C. Unsupervised Hyperspectral Band Selection via Hybrid Graph Convolutional Network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5530515. [Google Scholar] [CrossRef]
Yan, J.; Zhang, K.; Zhang, F.; Ge, C.; Wan, W.; Sun, J. Multispectral and hyperspectral image fusion based on low-rank unfolding network. Signal Process. 2023, 213, 109223. [Google Scholar] [CrossRef]
Xing, C.; Wang, M.; Cong, Y.; Wang, Z.; Duan, C.; Liu, Y. Sparse coding with morphology segmentation and multi-label fusion for hyperspectral image super-resolution. Comput. Vis. Image Underst. 2023, 227, 103603. [Google Scholar] [CrossRef]
Yang, H.; Yu, H.; Zheng, K.; Hu, J.; Tao, T.; Zhang, Q. Hyperspectral Image Classification Based on Interactive Transformer and CNN with Multilevel Feature Fusion Network. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5507905. [Google Scholar] [CrossRef]
Tian, C.; Yuan, Y.; Zhang, S.; Lin, C.; Zuo, W.; Zhang, D. Image super-resolution with an enhanced group convolutional neural network. Neural Netw. 2022, 153, 373–385. [Google Scholar] [CrossRef]
Ye, L.; Zhou, C.; Peng, H.; Wang, J.; Liu, Z.; Yang, Q. Multi-level feature interaction image super-resolution network based on convolutional nonlinear spiking neural model. Neural Netw. 2024, 177, 106366. [Google Scholar] [CrossRef]
Kang, L.; Tang, B.; Huang, J.; Li, J. 3D-MRI super-resolution reconstruction using multi-modality based on multi-resolution CNN. Comput. Methods Programs Biomed. 2024, 248, 108110. [Google Scholar] [CrossRef]
Vivone, G.; Restaino, R.; Chanussot, J. Full Scale Regression-Based Injection Coefficients for Panchromatic Sharpening. IEEE Trans. Image Process. 2018, 27, 3418–3431. [Google Scholar] [CrossRef]
Wald, L.; Ranchin, T.; Mangolini, M. Fusion of satellite images of different spatial resolutions: Assessing the quality of resulting images. Photogramm. Eng. Remote Sens. 1997, 63, 691–699. [Google Scholar]
Vivone, G.; Marano, S.; Chanussot, J. Pansharpening: Context-Based Generalized Laplacian Pyramids by Robust Regression. IEEE Trans. Geosci. Remote Sens. 2020, 58, 6152–6167. [Google Scholar] [CrossRef]
Dong, W.; Yang, Y.; Qu, J.; Xiao, S.; Du, Q. Hyperspectral Pansharpening via Local Intensity Component and Local Injection Gain Estimation. IEEE Geosci. Remote Sens. Lett. 2022, 19, 5507405. [Google Scholar] [CrossRef]
Deng, L.; Vivone, G.; Guo, W.; Dalla Mura, M.; Chanussot, J. A Variational Pansharpening Approach Based on Reproducible Kernel Hilbert Space and Heaviside Function. IEEE Trans. Image Process. 2018, 27, 4330–4344. [Google Scholar] [CrossRef] [PubMed]
Deng, L.; Vivone, G.; Jin, C.; Chanussot, J. Detail Injection-Based Deep Convolutional Neural Networks for Pansharpening. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6995–7010. [Google Scholar] [CrossRef]
Ciotola, M.; Vitale, S.; Mazza, A.; Poggi, G.; Scarpa, G. Pansharpening by Convolutional Neural Networks in the Full Resolution Framework. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5408717. [Google Scholar] [CrossRef]
Zhang, X.; Jiang, X.; Jiang, J.; Zhang, Y.; Liu, X.; Cai, Z. Spectral–Spatial and Superpixelwise PCA for Unsupervised Feature Extraction of Hyperspectral Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5502210. [Google Scholar] [CrossRef]
Yokoya, N.; Yairi, T.; Iwasaki, A. Coupled Nonnegative Matrix Factorization Unmixing for Hyperspectral and Multispectral Data Fusion. IEEE Trans. Geosci. Remote Sens. 2012, 50, 528–537. [Google Scholar] [CrossRef]
Simoes, M.; Bioucas-Dias, J.; Almeida, L.B.; Chanussot, J. A Convex Formulation for Hyperspectral Image Superresolution via Subspace-Based Regularization. IEEE Trans. Geosci. Remote Sens. 2015, 53, 3373–3388. [Google Scholar] [CrossRef]
Li, S.; Dian, R.; Fang, L.; Bioucas-Dias, J.M. Fusing Hyperspectral and Multispectral Images via Coupled Sparse Tensor Factorization. IEEE Trans. Image Process. 2018, 27, 4118–4130. [Google Scholar] [CrossRef]
Chang, Y.; Yan, L.; Zhao, X.; Fang, H.; Zhang, Z.; Zhong, S. Weighted Low-Rank Tensor Recovery for Hyperspectral Image Restoration. IEEE Trans. Cybern. 2020, 50, 4558–4572. [Google Scholar] [CrossRef] [PubMed]
Jia, S.; Min, Z.; Fu, X. Multiscale spatial–spectral transformer network for hyperspectral and multispectral image fusion. Inf. Fusion 2023, 96, 117–129. [Google Scholar] [CrossRef]
Palsson, F.; Sveinsson, J.R.; Ulfarsson, M.O. Multispectral and Hyperspectral Image Fusion Using a 3-D-Convolutional Neural Network. IEEE Geosci. Remote Sens. Lett. 2017, 14, 639–643. [Google Scholar] [CrossRef]
Dian, R.; Li, S.; Guo, A.; Fang, L. Deep Hyperspectral Image Sharpening. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 5345–5355. [Google Scholar] [CrossRef]
Dian, R.; Li, S.; Kang, X. Regularizing Hyperspectral and Multispectral Image Fusion by CNN Denoiser. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 1124–1135. [Google Scholar] [CrossRef]
Xu, S.; Amira, O.; Liu, J.; Zhang, C.; Zhang, J.; Li, G. HAM-MFN: Hyperspectral and Multispectral Image Multiscale Fusion Network with RAP Loss. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4618–4628. [Google Scholar] [CrossRef]
Fu, X.; Liu, P.; Han, P.; Jia, S. SSDT: Multiscale Spatial–Spectral Dilated Transformer for Hyperspectral and Multispectral Image Fusion. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5525615. [Google Scholar] [CrossRef]
Zhang, R.; Lei, B.; Feng, W.; Chai, X. SCIAU-Net: A Spatial-Spectral Cross-Modal Interaction ADMM Unfolding Network for Hyperspectral and Multispectral Image Fusion. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2026, 19, 8175–8192. [Google Scholar] [CrossRef]
Li, J.; Zheng, K.; Yao, J.; Gao, L.; Hong, D. Deep Unsupervised Blind Hyperspectral and Multispectral Data Fusion. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6007305. [Google Scholar] [CrossRef]
Cao, X.; Lian, Y.; Wang, K.; Ma, C.; Xu, X. Unsupervised Hybrid Network of Transformer and CNN for Blind Hyperspectral and Multispectral Image Fusion. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5507615. [Google Scholar] [CrossRef]
Ma, Q.; Jiang, J.; Liu, X.; Ma, J. Reciprocal transformer for hyperspectral and multispectral image fusion. Inf. Fusion 2024, 104, 102148. [Google Scholar] [CrossRef]
Deng, S.; Deng, L.; Wu, X.; Ran, R.; Hong, D.; Vivone, G. PSRT: Pyramid Shuffle-and-Reshuffle Transformer for Multispectral and Hyperspectral Image Fusion. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5503715. [Google Scholar] [CrossRef]
Ma, Q.; Jiang, J.; Liu, X.; Ma, J. Learning a 3D-CNN and Transformer prior for hyperspectral image super-resolution. Inf. Fusion 2023, 100, 101907. [Google Scholar] [CrossRef]
Dong, W.; Zhou, C.; Wu, F.; Wu, J.; Shi, G.; Li, X. Model-Guided Deep Hyperspectral Image Super-Resolution. IEEE Trans. Image Process. 2021, 30, 5754–5768. [Google Scholar] [CrossRef]
Xu, K.; Chen, Y.; Zhao, W.; Wang, Y. Dual-Branch Rectangle Transformer for Hierarchical Hyperspectral Super Resolution via Spectral Reversion Contrastive Learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 15804–15828. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11 October 2021. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13 December 2015. [Google Scholar]
Dell’Acqua, F.; Gamba, P.; Ferrari, A.; Palmason, J.A.; Benediktsson, J.A.; Arnason, K. Exploiting Spectral and Spatial Information in Hyperspectral Urban Data with High Resolution. IEEE Geosci. Remote Sens. Lett. 2004, 1, 322–326. [Google Scholar] [CrossRef]
Yokoya, N.; Iwasaki, A. Airborne Hyperspectral Data over Chikusei; SAL-2016-05-27; Space Application Laboratory, The University of Tokyo: Tokyo, Japan, 2016; Volume 5, p. 5. [Google Scholar]
Cen, Y.; Zhang, L.; Zhang, X.; Wang, Y.; Qi, W.; Tang, S.; Zhang, P. Aerial hyperspectral remote sensing classification dataset of Xiongan New Area (Matiwan Village). J. Remote Sens. 2020, 24, 1299–1306. [Google Scholar]
Gonzalez, R.C.; Woods, R.E. Digital Image Processing, 4th ed.; Pearson: New York, NY, USA, 2018; pp. 1–1024. [Google Scholar]
Wald, L. Data Fusion: Definitions and Architectures: Fusion of Images of Different Spatial Resolutions, 1st ed.; Presses des MINES: Paris, France, 2002; pp. 1–198. [Google Scholar]
Yuhas, R.H.; Goetz, A.F.; Boardman, J.W. Discrimination among semi-arid landscape endmembers using the spectral angle mapper (SAM) algorithm. In Proceedings of the Summaries of the Third Annual JPL Airborne Geoscience Workshop, Pasadena, CA, USA, 1–5 June 1992; Volume 1, pp. 147–159. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Zhu, C.; Deng, S.; Song, X.; Li, Y.; Wang, Q. Mamba Collaborative Implicit Neural Representation for Hyperspectral and Multispectral Remote Sensing Image Fusion. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5504915. [Google Scholar] [CrossRef]

Figure 1. Architecture of the proposed PMSwinNet. X, Y, and Z denote the HRHSI, HRMSI, and LRHSI, respectively. C and R represent spatial downsampling and spectral downsampling operators, while C^T and R^T correspond to the respective upsampling operators, V(n) represents the intermediate feature representation at stage n of the unfolding network, “−“ denotes element-wise subtraction, and η is a learnable relaxation factor that adaptively adjusts the weight of the prior information at each stage.

Figure 2. The framework of PESTB network.

Figure 3. The Chikusei dataset results are arranged by rows—RGB images (61-41-21) in the first, MRAE maps in the second, SAM maps in the third, and DIF maps illustrating the 31st band variation in the fourth.

Figure 4. The PaviaU dataset results are arranged by rows—RGB images (61-41-21) in the first, MRAE maps in the second, SAM maps in the third, and DIF maps illustrating the 31st band variation in the fourth.

Figure 5. The Xiongan dataset results are arranged by rows—RGB images (61-41-21) in the first, MRAE maps in the second, SAM maps in the third, and DIF maps illustrating the 31st band variation in the fourth.

Figure 6. Spectral reconstruction curves on three datasets: (a–c) show comparisons of the curves for Chikusei, PaviaU, and Xiongan at the fixed point (100, 100). (d–f) show comparisons of the curves for Chikusei, PaviaU, and Xiongan at random points. Except for the ground-truth and PMSwinNet, all other comparison methods are displayed with 50% transparency for better visual distinction.

Figure 7. Reconstruction results for GF5-S2A Dataset (the HRMSI is composited with 3-2-1 bands and the rest are composited with 31-21-4 bands, size of 900 × 900 × 106).

Table 1. Ablation study on the contribution of different components in PMSwinNet.

Methods	PSNR	ERGAS	SAM	SSIM	Params (M)
BS	42.2594	0.4824	2.6056	0.9924	5.67
PF	42.9981	0.4277	2.2339	0.9936	5.70
HF	43.5289	0.4345	1.8235	0.9943	5.83
PMSwinNet	44.2209	0.2805	1.5592	0.9973	5.87

Table 2. Experimental results on Chikusei.

Method	Ratio	PSNR	ERGAS	SAM	SSIM
HySure	8	41.2548	3.2657	8.006	0.9585
UDALN	8	44.2214	1.0823	3.4774	0.9854
PSRT	8	37.4333	1.5701	3.8688	0.9878
3DT-Net	8	44.1628	1.2144	4.1401	0.9875
DCTransformer	8	41.7410	1.4418	2.1375	0.975
MoGDCN	8	41.9898	1.2057	3.6373	0.9822
FeafusFormer	8	44.7868	0.9489	3.5632	0.9892
DRT	8	40.1122	1.2919	3.5047	0.9784
SSDT	8	45.4880	0.8754	2.8863	0.9911
PMSwinNet	8	45.5642	1.1182	3.4651	0.9893
HySure	16	43.8666	1.2565	6.5299	0.9695
UDALN	16	42.2944	1.1907	5.0798	0.9698
PSRT	16	41.5613	1.2638	4.6089	0.9799
3DT-Net	16	42.6964	1.2639	4.125	0.9805
DCTransformer	16	44.0066	1.3468	5.4464	0.9799
MoGDCN	16	44.0091	1.1536	3.8654	0.9840
FeafusFormer	16	44.7071	1.0427	4.3332	0.9862
DRT	16	37.6617	1.6826	4.3929	0.9580
SSDT	16	44.6348	1.0557	3.5099	0.9856
PMSwinNet	16	45.4918	1.2049	3.7658	0.9859

Table 3. Experimental results on PaviaU.

Method	Ratio	PSNR	ERGAS	SAM	SSIM
HySure	8	29.3383	3.343	12.6989	0.8982
UDALN	8	37.0812	0.6824	3.5369	0.9832
PSRT	8	23.7338	3.7044	2.684	0.7900
3DT-Net	8	37.9992	0.6112	2.6429	0.9882
DCTransformer	8	37.3994	0.5926	2.2084	0.9870
MoGDCN	8	34.0730	0.8554	3.6787	0.9713
FeafusFormer	8	37.6207	0.6047	3.1234	0.9856
DRT	8	28.9599	1.402	5.8875	0.9131
SSDT	8	37.2119	0.5821	2.7601	0.9871
PMSwinNet	8	38.0353	0.5667	2.5540	0.9887
HySure	16	31.7191	1.2979	9.5819	0.9221
UDALN	16	32.431	1.0981	6.0629	0.9369
PSRT	16	42.9545	0.3105	1.6349	0.9966
3DT-Net	16	41.7069	0.3707	1.8858	0.9953
DCTransformer	16	44.0066	0.2472	1.5908	0.9968
MoGDCN	16	34.3116	0.7548	4.4352	0.9775
FeafusFormer	16	37.707	0.5630	3.5364	0.9870
DRT	16	32.1358	0.9328	3.6621	0.9585
SSDT	16	40.5705	0.3829	2.1638	0.9945
PMSwinNet	16	44.2209	0.2805	1.5592	0.9973

Table 4. Experimental results on Xiongan.

Method	Ratio	PSNR	ERGAS	SAM	SSIM
HySure	8	37.9401	1.269	3.9656	0.9467
UDALN	8	42.1008	0.3343	2.1869	0.9819
PSRT	8	44.3682	0.2261	0.8096	0.9919
3DT-Net	8	49.6004	0.1277	0.8832	0.9976
DCTransformer	8	50.1760	0.1098	0.7187	0.9980
MoGDCN	8	48.4459	0.1465	1.0029	0.9967
FeafusFormer	8	47.1774	0.1814	1.5886	0.9948
DRT	8	41.1232	0.3052	0.9815	0.9853
SSDT	8	50.5442	0.1047	0.7456	0.9979
PMSwinNet	8	50.6368	0.1137	0.7449	0.9980
HySure	16	38.191	0.6055	4.038	0.9497
UDALN	16	38.834	0.4649	3.0868	0.9663
PSRT	16	39.3975	0.4182	0.9002	0.9719
3DT-Net	16	45.8099	0.1809	1.0547	0.9948
DCTransformer	16	46.6166	0.1698	0.9779	0.9958
MoGDCN	16	41.807	0.2927	1.6588	0.9868
FeafusFormer	16	43.9285	0.245	1.5624	0.9906
DRT	16	35.72	0.5961	1.6874	0.9417
SSDT	16	47.0521	0.1202	0.9160	0.9958
PMSwinNet	16	47.4814	0.1590	0.8974	0.9960

Table 5. Computational analysis on supervised deep learning methods on PaviaU dataset.

	PSRT	3DT-Net	DCTransformer	MoGDCN	DRTNET	SSDT	PMSwinNet
Params (M)	0.28	6.29	8.34	6.85	3.377728	6.61956	5.87
ModelSize (Mb)	3.565	72.423	96.983	78.599	38.363	76.006	68
FLOPS (G)	0.1684	11.4593	10.8884	9.879	1.3272	0.9225	10.2799

Table 6. Average blind results of different approaches on Chikusei dataset using Gaussian blur kernels with varying standard deviations (σ) at a scaling factor of 8.

σ	1.0	1.2	1.4	1.6	1.8	2	2.2	2.4	2.6	2.8	3.0
PMSwinNet
PSNR	42.77	42.95	43.08	43.17	43.23	43.27	43.29	43.29	43.30	43.30	43.30
ERGAS	1.08	1.06	1.05	1.04	1.03	1.03	1.02	1.02	1.02	1.02	1.02
SAM	3.06	3.00	2.96	2.92	2.90	2.88	2.87	2.87	2.86	2.86	2.86
SSIM	0.985	0.986	0.986	0.986	0.987	0.987	0.987	0.987	0.987	0.987	0.987
3DT-Net
PSNR	39.54	40.21	40.81	41.28	41.60	41.79	41.91	41.99	42.04	42.08	42.10
ERGAS	1.68	1.57	1.48	1.41	1.37	1.34	1.32	1.31	1.31	1.30	1.30
SAM	4.52	4.46	4.41	4.37	4.34	4.31	4.30	4.29	4.28	4.27	4.27
SSIM	0.967	0.972	0.975	0.977	0.979	0.979	0.980	0.980	0.980	0.981	0.981
PSRT
PSNR	37.28	37.45	37.57	37.64	37.68	37.70	37.70	37.70	37.69	37.68	37.67
ERGAS	1.94	1.88	1.84	1.82	1.80	1.79	1.78	1.78	1.77	1.77	1.77
SAM	4.85	4.74	4.65	4.59	4.54	4.51	4.49	4.47	4.46	4.46	4.45
SSIM	0.955	0.957	0.958	0.959	0.959	0.959	0.960	0.960	0.959	0.959	0.959
DCTransformer
PSNR	41.33	41.55	41.73	41.87	41.98	42.05	42.11	42.14	42.16	42.18	42.19
ERGAS	1.30	1.29	1.23	1.27	1.26	1.26	1.26	1.26	1.25	1.25	1.25
SAM	4.79	4.73	4.69	4.66	4.63	4.62	4.61	4.60	4.60	4.59	4.59
SSIM	0.987	0.979	0.979	0.980	0.980	0.980	0.980	0.980	0.980	0.980	0.980
MoGDCN
PSNR	42.42	42.55	42.63	42.69	42.73	42.75	42.76	42.77	42.77	42.77	42.77
ERGAS	1.21	1.20	1.19	1.18	1.18	1.17	1.17	1.17	1.17	1.17	1.17
SAM	4.04	3.99	3.95	3.92	3.90	3.89	3.88	3.88	3.88	3.87	3.87
SSIM	0.982	0.982	0.982	0.983	0.983	0.983	0.983	0.983	0.983	0.983	0.983
DRT
PSNR	38.66	38.94	39.192	39.37	39.51	39.65	39.79	39.93	40.03	40.10	40.13
ERGAS	1.52	1.48	1.46	1.44	1.42	1.40	1.39	1.37	1.37	1.36	1.35
SAM	3.81	3.76	3.72	3.69	3.66	3.65	3.64	3.63	3.63	3.62	3.62
SSIM	0.966	0.968	0.970	0.971	0.972	0.972	0.973	0.974	0.974	0.975	0.975
SSDT
PSNR	42.27	42.34	42.55	42.66	42.85	42.94	43.06	43.12	43.13	43.20	43.24
ERGAS	1.08	1.05	1.05	1.05	1.04	1.03	1.03	1.02	1.02	1.02	1.00
SAM	3.29	3.22	3.16	3.14	3.08	3.05	3.01	2.96	2.88	2.87	2.87
SSIM	0.982	0.983	0.983	0.984	0.985	0.986	0.986	0.987	0.987	0.987	0.987

Table 7. Average blind results of different approaches on Chikusei dataset using Gaussian blur kernels outside the range [1.0, 3.0] and a bicubic kernel for a scaling factor of 8. Gaussian kernels with σ ∈ [1.0, 3.0] were used during training.

σ	0.4	0.6	0.8	3.2	3.4	3.6	Bicubic
PMSwinNet
PSNR	41.88	42.28	42.55	43.29	43.29	43.29	41.77
ERGAS	1.20	1.15	1.11	1.02	1.02	1.02	1.21
SAM	3.38	3.23	3.14	2.86	2.86	2.86	3.46
SSIM	0.982	0.983	0.984	0.987	0.987	0.987	0.981
3DT-Net
PSNR	37.37	38.18	38.86	42.12	42.13	42.14	38.17
ERGAS	2.11	1.94	1.80	1.30	1.30	1.30	1.98
SAM	4.84	4.69	4.60	4.27	4.26	4.26	4.92
SSIM	0.949	0.957	0.963	0.981	0.981	0.981	0.955
PSRT
PSNR	36.33	36.77	37.05	37.66	37.65	37.64	36.24
ERGAS	2.27	2.11	2.01	1.77	1.77	1.77	2.16
SAM	5.39	5.13	4.98	4.45	4.45	4.45	5.53
SSIM	0.943	0.949	0.952	0.959	0.959	0.959	0.944
DCTransformer
PSNR	40.44	40.81	41.09	42.20	42.20	42.21	40.30
ERGAS	1.37	1.34	1.32	1.25	1.25	1.25	1.38
SAM	5.04	4.92	4.85	4.59	4.59	4.59	5.13
SSIM	0.975	0.976	0.977	0.980	0.980	0.980	0.974
MoGDCN
PSNR	41.73	42.06	42.26	42.77	42.77	42.76	41.75
ERGAS	1.30	1.26	1.23	1.17	1.17	1.18	1.30
SAM	4.33	4.19	4.11	3.87	3.87	3.87	4.38
SSIM	0.979	0.980	0.981	0.983	0.983	0.983	0.979
DRT
PSNR	37.39	37.93	38.32	40.17	40.22	40.27	38.66
ERGAS	1.70	1.61	1.56	1.35	1.34	1.34	1.52
SAM	4.08	3.95	3.87	3.62	3.62	3.62	3.81
SSIM	0.957	0.961	0.964	0.975	0.975	0.976	0.966
SSDT
PSNR	40.72	41.69	42.49	43.26	43.26	43.27	41.27
ERGAS	1.23	1.13	1.06	1.00	0.98	0.97	1.22
SAM	3.99	3.68	3.47	2.87	2.86	2.86	3.49
SSIM	0.978	0.982	0.985	0.987	0.987	0.987	0.979

Table 8. Quantitative evaluation on real data.

Methods	HySure	UDALN	PSRT	3DT-Net	DCTransformer	MoGDCN	DRT	SSDT	PMSwinNet
D_λ	0.0185	0.0131	0.0194	0.0072	0.012	0.1327	0.0097	0.0112	0.0038
Params (M)	—	—	0.2927	7.1126	8.3735	6.8931	3.4048	6.832	6.7869
FLOPS (G)	—	—	0.695	49.402	43.658	40.115	4.789	4.161	44.683

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lu, Y.; Hu, L.; Hu, J.; Gan, S.; Yuan, X.; Li, W.; Zhao, H. A Pyramid-Enhanced Swin Transformer for Robust Hyperspectral–Multispectral Image Fusion and Super-Resolution. Remote Sens. 2026, 18, 1255. https://doi.org/10.3390/rs18081255

AMA Style

Lu Y, Hu L, Hu J, Gan S, Yuan X, Li W, Zhao H. A Pyramid-Enhanced Swin Transformer for Robust Hyperspectral–Multispectral Image Fusion and Super-Resolution. Remote Sensing. 2026; 18(8):1255. https://doi.org/10.3390/rs18081255

Chicago/Turabian Style

Lu, Yu, Lin Hu, Jiankai Hu, Shu Gan, Xiping Yuan, Wang Li, and Hailong Zhao. 2026. "A Pyramid-Enhanced Swin Transformer for Robust Hyperspectral–Multispectral Image Fusion and Super-Resolution" Remote Sensing 18, no. 8: 1255. https://doi.org/10.3390/rs18081255

APA Style

Lu, Y., Hu, L., Hu, J., Gan, S., Yuan, X., Li, W., & Zhao, H. (2026). A Pyramid-Enhanced Swin Transformer for Robust Hyperspectral–Multispectral Image Fusion and Super-Resolution. Remote Sensing, 18(8), 1255. https://doi.org/10.3390/rs18081255

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Pyramid-Enhanced Swin Transformer for Robust Hyperspectral–Multispectral Image Fusion and Super-Resolution

Highlights

Abstract

1. Introduction

2. Related Works

2.1. Traditional Methods

2.2. Deep Learning-Based Methods

3. Methods

3.1. Objective Function

3.2. Pyramid Feature Extraction

3.3. PESTB Module

3.4. Network Training

4. Experiments and Discussions

4.1. Datasets

4.2. Evaluation Metrics

4.3. Experimental Setup

4.4. Ablation Study

4.5. Performance on Public Dataset

4.6. Computational Cost

4.7. Comparison of Spectral Curves

4.8. Blind HSI Super-Resolution

4.9. Performance on Real Data

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI