4.2. Evaluation Metrics
We adopt four widely used quantitative metrics to evaluate the performance of super-resolution reconstruction: Peak Signal-to-Noise Ratio (PSNR), Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS), Spectral Angle Mapper (SAM), and Structural Similarity Index Measure (SSIM).
PSNR: The PSNR measures the pixel-wise fidelity and is derived from the mean squared error (MSE) [
44]. An increase in the PSNR reflects enhanced quality of the reconstructed images. It is defined as follows:
where
is the fused image,
is the reference image, and B is the number of spectral bands.
ERGAS: ERGAS reflects both spatial and spectral quality [
45]. A lower ERGAS score indicates better fusion performance. It is calculated as follows:
SAM: The SAM evaluates spectral similarity by treating each pixel’s spectral signature as a vector and computing the angle between the estimated and reference vectors [
46]. A smaller angle indicates better spectral fidelity:
where H and W are the height and width of the image.
SSIM: The SSIM measures the structural similarity between the reconstructed and reference images [
47]. A value closer to one implies higher similarity:
where
and
are the means,
and
are variances of
and
, respectively,
is the covariance, and
and
are constants to stabilize the division.
4.3. Experimental Setup
For the simulated experiments, we adopt the widely used protocol in [
24], employing an IKONOS-like SRF to mimic realistic spectral degradation. Given a HRHSI, the corresponding LRHSI is generated by applying Gaussian blurring followed by spatial downsampling, while a HRMSI is obtained by projecting the HSI through the spectral response matrix. Taking PaviaU as an example, the original dataset is regarded as a HRHSI. Spatial degradation is applied using a Gaussian kernel, and spectral degradation is performed via the response matrix to obtain LRHSIs and HRMSIs, respectively. In our experiments, the kernel size and the standard deviation σ are determined according to the scaling factor, as formulated in (15) and (16).
For PaviaU, the lower-left 320 × 320 patch is selected for testing purposes, and the rest of the image is utilized for training. For Chikusei, a 512 × 512 region located between rows and columns 300–812 is selected for testing; the rest is for training. For Xiongan, a 512 × 512 region defined by rows 1750–2262 and columns 650–1162 was designated as the test set, with all other pixels employed for training. After removing water absorption bands, PaviaU and Xiongan retain 93 bands using the same spectral response.
In the real-world experiment, GF5 AHSI images are used as HRHSIs and Sentinel-2A images as HRMSIs. The images are preprocessed using ENVI 5.6 to perform spectral resampling, dimensionality reduction, and cropping. The resulting HRHSI and HRMSI are of size 300 × 300 × 106 and 900 × 900 × 10, respectively. The LRHSI is generated using a 9 × 9 Gaussian blur and 3× downsampling, resulting in 900 × 900 × 106.
We select two 60 × 60 patches in the top-left diagonal of the image as validation data (the corresponding HRHSI patches are 20 × 20), and use the remainder for training. The entire image is used for testing.
During the training phase, the training samples are generated by partitioning the entire image into patches using a predefined regular grid. Specifically, for the hyperspectral–multispectral fusion task, after constructing the low-resolution hyperspectral images (LRHSIs) via Gaussian blurring followed by spatial downsampling, and generating the high-resolution multispectral images (HRMSIs) using a spectral response matrix, deterministic sliding-window cropping is applied. The HRHSI and HRMSI are divided on the high-resolution grid, while the LRHSI is partitioned on the corresponding low-resolution grid, using fixed patch sizes and strides. Notably, this process is deterministic and does not involve random cropping during training.
The network is trained using the Adam optimizer, with an initial learning rate of 1 × 10−4, for a maximum of 1000 epochs, and default optimizer parameters. A step decay learning rate schedule is adopted, where the learning rate is multiplied by 0.5 every 100 epochs. All experiments are conducted on an NVIDIA RTX 3090 GPU. For clarity, the best results are indicated in bold, and the second best are underlined.
4.5. Performance on Public Dataset
In experiments on public datasets, we compare the PMSwinNet with several recent and widely adopted SOTA HSISR methods under identical Gaussian kernel and scaling factor settings. The comparison involves one matrix factorization-based method, HySure [
21], two unsupervised models, UDALN [
32] and FeafusFormer [
33], and six supervised models: 3DT-Net [
36], PSRT [
35], DCTransformer [
34], MoGDCN [
37], DRT [
38] and SSDT [
30].
HySure is a physically interpretable method based on subspace regularization aimed at reducing spectral distortion by fusing LRHSIs and HRMSIs. UDALN is an unsupervised learning method which has a three-stage model that learns both PSF and SRF adaptively to perform super-resolution without ground-truth. FeafusFormer contains a Multi-level Cross-feature Attention mechanism, employing Transformers to encode multi-level features and fuse local and global information for cross-modal interaction. Among the supervised deep learning methods, 3DT-Net and DCTransformer are Transformer-based networks. The 3DT-Net combines Swin Transformer and a CNN for spatial–spectral feature extraction, while DCTransformer uses a bidirectional cross-attention Transformer block to enable mutual reconstruction between MSIs and HSIs. PSRT replaces traditional self-attention with pyramid structure attention and utilizes a Shuffle–Reshuffle mechanism to model both local and global information. MoGDCN introduces a denoising module based on deformable convolution networks (DCNs) and employs a sampled U-Net structure for reconstruction. DRT enhances the joint restoration of spatial and spectral information by leveraging deep feature representations, thereby alleviating detail loss and spectral distortion during the fusion process. SSDT, in contrast, is a deep network that incorporates a spatial–spectral joint modeling strategy.
(1) Results on Chikusei: Table 2 presents the quantitative results of different models on the Chikusei dataset. Overall, the PMSwinNet achieves the best performance across most evaluation metrics. At the 8× scale, SSDT yields the lowest ERGAS and DCTransformer achieves the best SAM, whereas the PMSwinNet records the highest PSNR and SSIM, showing clear advantages in spatial fidelity. At the 16× scale, the PMSwinNet still obtains the highest PSNR. This can be attributed to the intrinsic characteristics of the Chikusei dataset, which contains abundant agricultural areas and complex urban textures with large variations in object scales. Traditional convolutional neural networks (MoGDCN), constrained by fixed receptive fields, struggle to simultaneously capture fine-grained textures and large homogeneous regions. In contrast, the Pyramid Enhancement Swin Transformer Block (PESTB) introduced in the PMSwinNet effectively extracts multi-scale features, enabling the model to capture spatial structures at different levels of granularity. Regarding ERGAS, the PMSwinNet performs slightly worse than SSDT. This may be due to the increased computational complexity introduced by the PESTB module when pursuing high spatial resolution reconstruction, which can lead to minor quantization errors. For SSIM, the PMSwinNet ranks first at 8× and second at 16×, with FeafusFormer marginally ahead. The small SSIM differences across the methods are likely attributable to training stochasticity.
Although the PMSwinNet does not lead in every single metric, it demonstrates the most consistent and competitive overall performance. The visual comparison on the Chikusei dataset (
Figure 3) further supports this conclusion: the PMSwinNet shows the lightest regions in the SAM and DIF heatmaps. In the MRAE map, while slightly stronger color intensity is observed in the central region compared with UDALN, MoGDCN, DRT, SSDT and FeafusFormer, the PMSwinNet displays significantly lighter colors along the upper-right river, indicating superior fusion quality overall. These results further demonstrate the advantage of the Swin Transformer’s shifted window mechanism in modeling long-range spatial dependencies, which effectively leverages spectral correlations among neighboring materials to compensate for information loss caused by downsampling.
(2) Results on PaviaU: The quantitative performance of various models on the PaviaU dataset is summarized in
Table 3. At the 8× scale, the PMSwinNet ranks first across all metrics except SAM, slightly trailing 3DT-Net and DCTransformer. In contrast, at the 16× scale, the PMSwinNet surpasses all competing methods on every metric except ERGAS. This phenomenon can be explained by the underlying physical mechanism: under high downsampling ratios, spatial details are severely degraded, and simple linear mappings or local feature extraction methods become insufficient. The success of PMSwinNet lies in its dual modeling capability. Specifically, the Swin Transformer module captures global dependencies across spectral bands, ensuring spectral fidelity, while the pyramid structure enhances the representation of complex spatial structures such as campus buildings and shadows. The visual comparisons in
Figure 4 further corroborate these findings. PMSwinNet produces the most visually faithful results, as evidenced by the lightest regions in the SAM, MRAE, and DIF error maps. Moreover, the reconstructed objects exhibit sharper boundaries and are free from noticeable artifacts. This can be attributed to the model’s ability to effectively balance spatial texture recovery and spectral alignment through multi-scale feature learning during feature evolution, resulting in superior spectral consistency and reduced reconstruction errors.
(3) Results on Xiongan: A quantitative comparison of different models on the Xiongan dataset is presented in
Table 4. At both fusion scales, the PMSwinNet exhibits consistently strong results. At the 8× scale, it achieves a PSNR of 50.6368, markedly surpassing all other methods. For ERGAS and SAM, it ranks second, slightly behind SSDT and DCTransformer, while for SSIM, PMSwinNet and DCTransformer both achieve leading results, demonstrating strong structural fidelity. At the 16× scale, the PMSwinNet attains the best scores in the PSNR, SAM, and ERGAS, highlighting its robustness and reconstruction capability under more challenging downsampling conditions. Although it is competitive with the Transformer-based DCTransformer in terms of SSIM, the PMSwinNet demonstrates greater stability in preserving SAM, suggesting that its high-frequency compensation mechanism plays a critical role. The visual results in
Figure 5 further confirm these findings. The PMSwinNet yields the lightest regions in both the MRAE and SAM maps. In the DIF visualization, its performance is close to HySure; however, the fused images of HySure exhibit larger deviations from the ground-truth, whereas PMSwinNet better preserves spatial–spectral fidelity.
Across all three datasets, the PMSwinNet demonstrates outstanding robustness in spatial–spectral reconstruction. Its superiority is particularly evident under high upscaling factors (16×) and in scenarios involving complex textures, where it consistently achieves leading PSNR and SSIM values. This advantage stems from the multi-scale feature aggregation capability of the PESTB module, which effectively compensates for spatial structural information lost during downsampling. Meanwhile, the long-range dependency modeling enabled by the Swin Transformer overcomes the limitations of local receptive fields by leveraging global contextual relationships, ensuring high spectral consistency. As a result, PMSwinNet achieves an effective balance between reconstruction accuracy and spectral fidelity, especially when handling high compression ratios and heterogeneous scenes.
4.8. Blind HSI Super-Resolution
While previous experiments presupposed knowledge of the downsampling blur kernel, real-world degradation processes are generally unknown and exhibit considerable complexity. To address this, we conduct a blind super-resolution experiment on the Chikusei dataset with an 8× scaling factor, following the protocol of MoGDCN.
During training, we adopt the same Gaussian kernel type as before but randomly sample the standard deviation within [1.0, 3.0]. For testing, the standard deviation is varied from 1.0 to 3.0 in increments of 0.2, ensuring that each LRHSI input undergoes a distinct degradation at every iteration, thereby simulating real-world uncertainty.
It is worth noting that the blind super-resolution setting inherently involves unknown degradation processes, including implicit noise perturbations and model mismatch. Therefore, evaluating model performance under this setting can indirectly reflect its robustness to various noise conditions. A model capable of maintaining stable reconstruction quality in such scenarios can be considered to possess an effective noise suppression ability.
We compare the PMSwinNet with six representative deep learning methods: PSRT, 3DT-Net, DCTransformer, MoGDCN, DRT and SSDT. The quantitative results, summarized in
Table 6, report the PSNR, SAM, ERGAS, and SSIM across eleven Gaussian kernels. Compared with the non-blind setting, all models maintain stable performance, with only 3DT-Net showing notable variation. Importantly, the PMSwinNet consistently outperforms all competitors under every degradation condition.
To further evaluate robustness, we extend testing to more challenging cases using Gaussian kernels outside the training range [1.0, 3.0] and a standard Bicubic kernel. As shown in
Table 7, although performance slightly decreases under these settings, the PMSwinNet still achieves the best results across all metrics, confirming its strong adaptability to diverse and unknown degradations.
These results indicate that PMSwinNet can effectively suppress noise-induced artifacts while preserving fine spatial details, even under unknown degradation conditions.
4.9. Performance on Real Data
To assess real-world applicability, we further evaluate the PMSwinNet on the GF5-S2A dataset. The fusion outcomes are visually presented in
Figure 7. For quantitative evaluation, we adopt the no-reference metric D
λ (Spectral Distortion Index) where lower values indicate better spectral fidelity [
48]. In addition, we report the model complexity in terms of the number of parameters (Params) and floating-point operations (FLOPs), aiming to provide a comprehensive analysis of both reconstruction performance and computational efficiency. (Note that QNR (Quality with No Reference) is not employed here, as it requires a panchromatic reference image, which is unavailable in this scenario.)
Since HySure is an optimization-based, model-driven method, its core relies on an iterative solving process rather than a feed-forward architecture with a fixed number of learnable parameters. Therefore, it cannot be fairly evaluated using the Params metric. Similarly, its computational complexity is dominated by data-dependent iterative operations, making it difficult to define a standardized FLOP value for fair comparison. For UDALN, although it is an unsupervised approach, its training process depends on a data-specific adaptive network structure. As a result, its parameter scale may vary under different data settings and is not directly comparable to that of standard supervised models. In addition, due to its adaptive architecture and dynamic training scheme, the FLOPs of UDALN are also data-dependent and cannot be consistently quantified under a unified setting. To avoid introducing potentially misleading comparisons, these methods are not included in the table under the Params and FLOPs metrics.
As shown in
Table 8, the PMSwinNet achieves the lowest D
λ value (0.0038) among all compared methods on the GF5-S2A real dataset, indicating its superior capability in preserving spectral consistency in real-world remote sensing scenarios. Compared with representative approaches, such as 3DT-Net, DRT, SSDT, and DCTransformer, the PMSwinNet demonstrates a more pronounced advantage in suppressing spectral distortion. Meanwhile, the PMSwinNet contains 6.7869 M parameters, representing a moderate model scale comparable to SSDT and MoGDCN. Despite not being the smallest model, it achieves optimal D
λ performance. Although PSRT exhibits a lower parameter count and FLOPs, its D
λ value reaches 0.0194, which is significantly inferior to that of the PMSwinNet, suggesting that lightweight design alone is insufficient to ensure spectral fidelity in real-world scenarios.
Overall, the PMSwinNet attains superior spectral preservation performance under acceptable model complexity, highlighting its strong practical potential for real hyperspectral image fusion tasks.