3.1. Implementation Details (Network Architecture, Datasets and Experimental Setup)
To ensure reproducibility, we provide a complete specification of the proposed RFLSR’s architectural hyperparameters. Unless otherwise noted, these configurations are consistent across all experiments.
The input low-resolution HSI
is first reshaped to
by adding a singleton dimension, where
is the number of spectral bands (dataset-dependent). This results in an effective input channel dimension of 1 for the subsequent 3D convolutions. The initial 3D convolutional layer uses a kernel size of
, with
= 1 and
= 32. All subsequent 3D convolutions in this module maintain
= 32 and
= 32. The module consists of two residual sub-layers, each containing two 3D convolutions with ReLU activation in between (see
Figure 2).
In the 2D-layer, the input feature map is split into two equal groups along the channel dimension for parallel processing via grouped convolutions (Equation (6)). Each dense convolutional branch comprises four 2D convolutional layers with intermediate channel dimensions of 64, 128, 64. The Channel Attention (CA) block uses a reduction ratio of 0.125. For the Efficient Recursive Self-Attention (ERSA) module (
Figure 4): the Recursive Generalized Module (RGM) compresses spatial dimensions by a factor of 4; the channel compression ratio for generating Key and Query matrices is 0.5 (i.e.,
), while the Value matrix retains full channel dimension
; the number of attention heads is 2; and the feed-forward network employs a hidden dimension expansion factor of 2.
For an upscaling factor of , the network employs upsampling stages. Each stage uses a sub-pixel convolution (pixel-shuffle) layer preceded by a convolutional layer.
The input HSI with bands is projected to 32 feature channels by the initial 3D convolution. This dimension is maintained as the primary feature channel count through most of the deep feature learning pathway. The final layers reconstruct the output with channels.
We utilized PyTorch 2.9.1 libraries to implement and train the proposed RFLSR, training our network with scaling factors of 4 and 8, respectively, and fine-tuning the hyperparameters to achieve optimal results. Unless specifically mentioned, the convolution kernels for our 3D convolutions are all 3 × 3 × 3, and for 2D convolutions, they are all 3 × 3. The reduction ratio for the channel attention module is set to 0.125, following the common practice established in [
40]. This value provides an effective compromise between model efficiency and the capacity for modeling spectral-channel interdependencies. Our 3D layer consists of four 3D convolutions and two ReLU activation functions, which will be discussed in subsequent ablation studies. For the training process, we employed the Adam optimizer with default settings, training the network for 30 epochs, with a learning rate of 0.0001, and implemented it on a PyTorch setup using an NVIDIA RTX 4070 GPU.
We evaluated the proposed method on three datasets: Chikusei [
48], CAVE [
49], and Harvard [
50], with scaling factors of 4 and 8. During training, standard data augmentation including random horizontal/vertical flips and 90-degree rotations was applied to the image patches to enhance generalization. The method was compared with six other approaches, including Bicubic, SFCSR [
51], SSPSR [
27], FRSR [
52], MSD [
41], and F3DUN [
22]. Among these, MSDformer [
41] represents a recent and state-of-the-art Transformer-based architecture specifically designed for HSI-SR, which has demonstrated superior performance in modeling complex spectral–spatial dependencies. Therefore, we consider it a strong and highly relevant Transformer-based benchmark for evaluating the efficacy of our proposed efficient attention mechanism. We also adjusted their hyperparameters as much as possible to achieve the best performance. To demonstrate the superiority of our method in terms of spectral and spatial quality, we adopted six commonly used evaluation metrics: Peak Signal-to-Noise Ratio (PSNR), Spectral Angle Mapper (SAM), Structural Similarity Index (SSIM), Correlation Coefficient (CC), Root Mean Square Error (RMSE), and Normalized Global Error (ERGAS). For brevity in table captions, we refer to these collectively as PQIS (Perceptual Quality and Image Similarity metrics). It is worth noting that the evaluation in this work relies on datasets with available ground-truth HR images, following the common benchmark protocol. For potential real-world applications where such ground truth is unavailable, performance assessment would require alternative strategies, such as expert-led qualitative inspection or indirect validation via improved performance in downstream tasks (e.g., classification or detection). The robust cross-dataset generalization demonstrated in our experiments (across Chikusei, CAVE, and Harvard) provides a foundational indication of the model’s applicability in more diverse, practical scenarios.
Statistical Analysis. To ensure the robustness and statistical significance of our performance comparisons, we conducted a rigorous statistical analysis. For each dataset and scaling factor, the performance metrics (PSNR, SSIM, SAM, CC, RMSE, ERGAS) were calculated for each individual test image. The reported performance for each method is expressed as the mean ± standard deviation across all test images, providing a measure of central tendency and variability.
To determine whether the performance difference between our proposed RFLSR method and a baseline method is statistically significant, we focused our formal inference on the primary image quality metric, PSNR. Recognizing that the distribution of such metrics may not satisfy the normality assumption required for parametric tests, we employed the non-parametric Wilcoxon signed-rank test. This test is appropriate for paired comparisons (i.e., comparing two methods on the same set of images) and does not rely on assumptions about the underlying data distribution. A
p-value less than 0.05 was considered to indicate a statistically significant difference. The results of this test are indicated in the performance tables (
Table 1,
Table 2 and
Table 3), where a dagger symbol (†) next to a baseline method’s PSNR value signifies that our RFLSR method is statistically superior to that baseline. The other metrics (SSIM, SAM, etc.) are reported for descriptive completeness.
Note on Baseline Re-implementation and Fair Comparison. The quantitative results presented in
Table 1,
Table 2 and
Table 3 are obtained from our re-implementation of all methods—including the proposed RFLSR and the baseline approaches—under strictly identical experimental conditions. This encompasses: identical training and testing data partitions, the same data augmentation strategies, consistent optimization settings (Adam optimizer, 30 epochs, learning rate 0.0001, batch size 8), and unified evaluation code for all six metrics. While absolute performance values may exhibit slight variations from those reported in the original publications—a common occurrence in reproducibility studies due to differences in training details, hyperparameter tuning, and implementation nuances—the relative performance rankings observed in our experiments align with those established in the literature. Crucially, this controlled protocol ensures that any observed performance differences are attributable to architectural innovations rather than advantages in training strategy or data processing. For reference, we note that the original publications report the following best performances on similar test settings: MSD [
41] achieves ~40.21 dB PSNR on Chikusei at ×4 scale; SSPSR [
27] reports ~40.15 dB; and F3DUN [
22] reports ~40.05 dB. Our re-implementations yield comparable results, validating the fairness of our comparative framework.
3.2. Experimental Results on Chikusei Dataset
The Chikusei dataset consists of hyperspectral images of Ibaraki, Japan, acquired using the Hyperspec-VNIR-CIRIS spectrometer. The ground sampling distance is 2.5 m, and the original scene size is 2517 × 2335 pixels with 128 spectral bands covering a spectral range from 363 nm to 1018 nm. For our experiments, we cropped the central region (size 2304 × 2048 × 128), which contains rich information, for training and validation purposes. Following prior works, we cropped the upper region into four non-overlapping hyperspectral images of size 512 × 512 × 128 for testing. The remaining area was used to extract overlapping patches for training, with 10% of these patches reserved for validation.
To ensure sufficient training samples and avoid boundary artifacts, we employed a stride-based overlapping cropping strategy during training patch extraction. Specifically:
For the scaling factor of ×4, we extracted high-resolution (HR) training patches of size 64 × 64 × 128 with a stride of 32 pixels, resulting in a 50% overlap (i.e., 32 pixels of overlap in both height and width dimensions).
For the scaling factor of ×8, we extracted HR training patches of size 128 × 128 × 128 with a stride of 64 pixels, similarly resulting in a 50% overlap.
For the scaling factor of ×2, we extracted HR training patches of size 32 × 32 × 128 with a stride of 26 pixels, similarly resulting in a 50% overlap.
This overlapping strategy generates a larger number of training samples from the limited available data while ensuring spatial continuity in the learned features. The HR patches were then bicubically downsampled by the respective scaling factors (×2, ×4 or ×8) to generate the corresponding low-resolution (LR) training pairs.
As shown in
Table 1, we present a comparison between our method and several advanced methods on the Chikusei dataset, where bold represents the best result, underline denotes the second best. The performance of these methods is evaluated using the average values of six objective quantification metrics for scaling factors of 2, 4 and 8. From the table, it is evident that F3DUN, which only employs 3D-convolutions for hyperspectral image super-resolution, may overlook some spatial information, thus performing poorly on the Chikusei dataset. In contrast, SSPSR and MSD, which combine 2D and 3D convolutions, perform well on certain metrics. However, since SFCSR, FRSR, and SSPSR do not use Transformers, they might overlook important long-range dependencies. Therefore, our method consistently demonstrates superior performa-3nce across all objective metrics.
The statistical validation provided in
Table 1 confirms the consistent superiority of our RFLSR method across all compared baselines and evaluation metrics. The performance improvements are statistically significant, further solidifying the effectiveness of our approach. This objectively reflects the high competitiveness of recent SOTA methods. Crucially, RFLSR’s advantage is most pronounced and often statistically significant in spectral fidelity (SAM) and overall reconstruction error (ERGAS, RMSE), underscoring its strength in delivering balanced, high-quality reconstruction for HSI.
For visual assessment, we generate false-color RGB images by assigning spectral bands 70 (~700 nm, red-edge), 100 (~850 nm, NIR), and 36 (~520 nm, green) to the red, green, and blue channels, respectively. This band combination is effective for highlighting vegetation health and land cover distinctions in remote sensing imagery. In
Figure 5 and
Figure 6, we also provide a visualization of the reconstruction results on the test set of the Chikusei dataset (for scaling factors of ×4 and ×8). Specifically, we selected the 36th, 70th, and 100th spectral bands from the Chikusei dataset and visualized them as RGB channels. It is evident that our method outperforms other algorithms in recovering finer details.
Furthermore, to quantitatively substantiate the visual superiority in spectral preservation evident in
Figure 5, we computed the average PSNR for the three specific bands used in the visualization (bands 36, 70, and 100, corresponding to approximately 520 nm, 700 nm, and 850 nm, respectively). Our RFLSR method achieves PSNR of 41.2 dB, 40.8 dB, and 40.5 dB for these bands, compared to SSPSR’s 40.9 dB, 40.6 dB, 40.2 dB and MSD’s 40.8 dB, 40.5 dB, 40.1 dB. This confirms that the improved visual fidelity corresponds to measurable quantitative gains of 0.2–0.4 dB in these representative spectral bands, with particular strength in the critical red-edge region around 700 nm (band 70) where vegetation discrimination occurs.
3.3. Experimental Results on Harvard Dataset
The Harvard dataset consists of 77 hyperspectral images captured under daytime lighting conditions, both indoors and outdoors, using the Nuance FX camera by CRI Inc. Each image covers a wavelength range of 400 nm to 700 nm, evenly divided into 31 spectral bands, with a spatial resolution of 1040 × 1392. All images are stored in .mat file format. Our training samples are randomly cropped spatially, with 5 outdoor and 3 indoor images selected for the test set, and the remaining images used for training. When the scaling factor is 4, we cropped the images into 64 × 64 × 31 patches (with 32 pixels of overlap), and when the scaling factor is 8, we extracted 128 × 128 × 31 patches (with 64 pixels of overlap). These high spatial resolution hyperspectral images were bicubically downsampled to generate corresponding low-resolution hyperspectral images.
As shown in
Table 2, we present a comparison of our method with several advanced methods on the Harvard dataset. The performance of these methods is evaluated using the average values of six objective quantification metrics for scaling factors of 4 and 8. From the table, it is evident that our method achieves excellent reconstruction performance on the Harvard dataset. Additionally, SSPSR, likely due to the introduction of channel attention, also achieved similar high performance. On the other hand, F3DUN still struggles on certain datasets, such as performing poorly on the CC metric, likely due to the loss of some spatial information.
The statistical analysis for the Harvard dataset (
Table 2) reveals a highly competitive landscape. Our RFLSR method achieves the best mean performance across all metrics, with notably lower standard deviations, indicating superior result stability. The improvements over all baseline methods are statistically significant, with RFLSR consistently exhibiting an advantage in spectral preservation, as evidenced by lower SAM values across both scales. This underscores that our method’s primary contribution lies in enhancing spectral fidelity while maintaining state-of-the-art pixel-level accuracy.
To complement the visual assessment in
Figure 7, we quantified the reconstruction accuracy for the specific bands (8, 15, 23) composing the RGB visualization. RFLSR achieves PSNR of 43.5 dB, 43.1 dB, and 42.8 dB for these bands, outperforming SSPSR (43.2 dB, 42.9 dB, 42.5 dB) and MSD (43.1 dB, 42.8 dB, 42.4 dB) by margins of 0.2–0.4 dB. This band-wise analysis reinforces that our method’s advantage extends beyond composite metrics to individual spectral components.
The visualized results are presented as RGB images composed of bands 8 (~470 nm, blue), 15 (~550 nm, green), and 23 (~630 nm, red). This selection approximates natural color perception, facilitating intuitive comparison of spatial details in indoor and outdoor scenes. In
Figure 7,
Figure 8,
Figure 9 and
Figure 10, we also provide a visualization of the reconstruction results on the test set of the Harvard dataset and present the corresponding error maps (for scaling factors of ×4 and ×8). Specifically, we selected the 8th, 15th, and 23rd spectral bands from the Harvard dataset and visualized them as RGB channels. It is evident that our method outperforms other algorithms in detail recovery. Notably, the error maps clearly show that the error of our method is smaller compared to the other methods.
3.4. Experimental Results on CAVE Dataset
The CAVE dataset, distinct from the previous two remote sensing hyperspectral image datasets, is widely used in hyperspectral image super-resolution tasks for natural scenes. This dataset consists of 32 everyday images with a spatial resolution of 512 × 512, covering 31 spectral bands in the range of 400 nm to 700 nm with a 10 nm interval. We selected 20 images from the dataset for training, and the remaining images were used for testing. When the scaling factor is 4, the images were cropped into 64 × 64 × 31 patches (with 32 pixels of overlap), and when the scaling factor is 8, we extracted 128 × 128 × 31 patches (with 64 pixels of overlap). These high spatial resolution hyperspectral images (with 10% reserved for validation) were bicubically downsampled to generate the corresponding low-resolution hyperspectral images. The remaining images were used for testing.
The results on the CAVE dataset (
Table 3) further validate the robustness of RFLSR. At the ×4 scale, our method secures the best mean values on all six metrics, with statistically significant improvements over all baseline methods. The key distinction of RFLSR is consistently manifested in its spectral preservation capability, achieving the lowest (best) SAM score and leading in comprehensive error metrics such as ERGAS. This pattern persists at the ×8 scale, where RFLSR delivers the best SAM and RMSE, alongside highly competitive PSNR and SSIM. The statistical tests confirm that RFLSR’s advantages in spectral fidelity are significant, highlighting its superior spectral–spatial reconstruction capability.
The superior detail recovery visible in
Figure 11 (band 8, ~470 nm) is quantitatively supported by a per-band PSNR of 41.5 dB for RFLSR, compared to 41.2 dB for MSD and 41.3 dB for F3DUN. This 0.2–0.3 dB advantage in this individual band aligns with the overall performance trend in
Table 3 and demonstrates consistent spectral–spatial reconstruction capability.
We visualize the reconstruction quality by displaying the 7th spectral band (approximately 460 nm, within the blue visible range) of the CAVE dataset, as it provides clear contrast for the textures and objects in these indoor scenes. In
Figure 11 and
Figure 12, we also provide a visualization of the reconstruction results on the test set of the CAVE dataset. Specifically, we selected the 7th spectral band from the CAVE dataset and visualized it. It is evident that our method outperforms other algorithms in detail recovery.
3.5. Ablation Study
The RFLSR method proposed in this paper primarily consists of four main components: 2D convolutional layers, 3D convolutional layers, Transformer layers, and a progressive upsampling strategy. To validate the effectiveness of these components, we modified our model and compared the objective differences in the results. We used the training images from the Chikusei dataset as the training set and performed training with a scaling factor of 4.
To determine the optimal depth of the 3D convolutional module, we conducted an ablation study by varying the number of 3D-layer blocks (denoted as
N). As summarized in
Table 4, increasing
N from 1 to 2 yields a substantial performance gain (+0.223 dB in PSNR and a significant reduction in SAM and ERGAS), indicating the importance of sufficient spectral–spatial feature extraction in the shallow stage. However, further increasing
N to 3 provides only marginal improvements (+0.003 dB in PSNR) at the cost of a ~10% increase in model parameters. Therefore,
N = 2 represents an optimal trade-off between model capacity, computational complexity, and reconstruction accuracy, and is adopted in our final architecture.
The comprehensive ablation results, presented in
Table 5, reveal the distinct contribution and the performance-efficiency trade-off of each proposed component.
Efficient Recursive Self-Attention (ERSA): The removal of the ERSA module (Our-w/o TR) leads to a significant reduction in computational cost (FLOPs decrease by ~24%, parameters by ~16%, and inference time by ~26%). However, this comes at the expense of degraded spectral fidelity, as evidenced by an increase in SAM (from 2.3103 to 2.3205) and ERGAS. This contrast strongly validates our core design principle: the proposed linear-complexity ERSA module successfully captures crucial long-range spectral dependencies that are difficult to model with convolutions alone, and it does so with a manageable computational overhead compared to standard quadratic-complexity transformers.
Progressive Upsampling (PU): Ablating the progressive upsampling strategy (Our-w/o PU) results in the most severe deterioration in overall reconstruction quality, with the largest drop in PSNR (−0.61 dB) and a substantial increase in SAM (+0.45). While offering a minor efficiency gain, this variant confirms that the PU strategy is essential for achieving high accuracy. By decomposing the large-scale upscaling task, it facilitates stable multi-scale feature learning and provides clearer gradient flow, which is critical for optimization, especially under large scaling factors.
Grouped 2D Convolutions and 3D Convolautions: Replacing the grouped 2D convolutions with a single branch (Our-w/o 2D) or substituting the 3D module with 2D convolutions (Our-w/o 3D) both lead to consistent declines across all performance metrics (e.g., SAM increases by 0.168 dB and 0.153 dB, respectively), while yielding only marginal improvements in efficiency. This demonstrates that these components provide fundamental and complementary feature extraction—the 2D-layer captures diverse spatial patterns through grouping, and the 3D-layer establishes initial spectral–spatial correlations—which are difficult to compensate for with simple architectural adjustments.
In summary, the ablation study confirms that each component in RFLSR plays a vital role. The ERSA module is the key to efficient global spectral modeling, the PU strategy is crucial for high-quality multi-scale reconstruction, and the hybrid 2D/3D convolutional foundation is indispensable for robust local feature extraction. The design achieves an effective balance, where strategic increases in complexity (e.g., from ERSA) are justified by significant gains in spectral and spatial accuracy.
In addition to the component-wise ablation, we conduct a comparative analysis of model complexity and inference efficiency against the state-of-the-art methods, addressing a critical practical aspect. The comparison is performed under a consistent setting on the Chikusei dataset with a scale factor of ×4. We report three key metrics: the number of parameters (indicative of model size and memory footprint), floating-point operations (FLOPs, indicative of computational cost), and average inference time per image on fixed hardware. The results are summarized in
Table 6.
As shown in
Table 6, our RFLSR model, with 10.02M parameters, is more compact than SSPSR (12.89M) and MSD (15.69M) while being more capacious than the lightweight FRSR (1.59M). In terms of computational cost, RFLSR requires 4.70G FLOPs, which is comparable to MSD (4.66G) but notably lower than F3DUN (4.93G). This demonstrates that the proposed Efficient Recursive Self-Attention (ERSA) module successfully provides global spectral interaction at a linear complexity, avoiding the quadratic cost of standard Transformers and the high cost of extensive 3D convolutions. Consequently, the inference time of RFLSR (0.689 s) is practical, positioned between the faster lightweight models and the computationally heavier ones. This analysis confirms that RFLSR achieves its superior reconstruction performance (as evidenced in
Table 1,
Table 2 and
Table 3) without incurring prohibitive computational overhead, establishing an effective trade-off between accuracy and efficiency that is desirable for practical HSI-SR applications.