Cascaded Multi-Attention Feature Recurrent Enhancement Network for Spectral Super-Resolution Reconstruction

Jin, He; Lan, Jinhui; Zhuang, Zhixuan; Zeng, Yiliang

doi:10.3390/rs18020202

Open AccessArticle

Cascaded Multi-Attention Feature Recurrent Enhancement Network for Spectral Super-Resolution Reconstruction

by

He Jin

^†,

Jinhui Lan

^*,†,

Zhixuan Zhuang

and

Yiliang Zeng

Beijing Engineering Research Center of Industrial Spectrum Imaging, School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2026, 18(2), 202; https://doi.org/10.3390/rs18020202

Submission received: 4 October 2025 / Revised: 14 December 2025 / Accepted: 4 January 2026 / Published: 8 January 2026

(This article belongs to the Special Issue Advanced Artificial Intelligence and Deep Learning for Remote Sensing (3rd Edition))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A cascaded multi-attention-based method is proposed for spectral super-resolution reconstruction of hyperspectral images from RGB images.
The method achieves state-of-the-art performance on multiple public datasets with improved spectral fidelity and spatial detail preservation.

What are the implications of the main findings?

The proposed approach effectively reduces spectral distortion under complex illumination conditions.
It offers a practical solution for accurate hyperspectral reconstruction in remote sensing and environmental monitoring applications.

Abstract

Hyperspectral imaging (HSI) captures the same scene across multiple spectral bands, providing richer spectral characteristics of materials than conventional RGB images. The spectral reconstruction task seeks to map RGB images into hyperspectral images, enabling high-quality HSI data acquisition without additional hardware investment. Traditional methods based on linear models or sparse representations struggle to effectively model the nonlinear characteristics of hyperspectral data. Although deep learning approaches have made significant progress, issues such as detail loss and insufficient modeling of spatial–spectral relationships persist. To address these challenges, this paper proposes the Cascaded Multi-Attention Feature Recurrent Enhancement Network (CMFREN). This method achieves targeted breakthroughs over existing approaches through a cascaded architecture of feature purification, spectral balancing and progressive enhancement. This network comprises two core modules: (1) the Hierarchical Residual Attention (HRA) module, which suppresses artifacts in illumination transition regions through residual connections and multi-scale contextual feature fusion, and (2) the Cascaded Multi-Attention (CMA) module, which incorporates a Spatial–Spectral Balanced Feature Extraction (SSBFE) module and a Spectral Enhancement Module (SEM). The SSBFE combines Multi-Scale Residual Feature Enhancement (MSRFE) with Spectral-wise Multi-head Self-Attention (S-MSA) to achieve dynamic optimization of spatial–spectral features, while the SEM synergistically utilizes attention and convolution to progressively enhance spectral details and mitigate spectral aliasing in low-resolution scenes. Experiments across multiple public datasets demonstrate that CMFREN achieves state-of-the-art (SOTA) performance on metrics including RMSE, PSNR, SAM, and MRAE, validating its superiority under complex illumination conditions and detail-degraded scenarios.

Keywords:

hyperspectral image; spectral super-resolution; multi-attention; feature extraction

1. Introduction

Hyperspectral imaging (HSI) captures reflectance information of the same scene across multiple narrow spectral bands, providing richer and more continuous spectral features compared to RGB images [1,2]. This detailed spectral description is crucial for distinguishing different materials, leading to widespread applications in medical imaging [3,4,5], object classification [6,7,8], target detection [9,10,11,12], spectral unmixing [13,14,15], and environmental monitoring [16,17,18]. However, acquiring high-quality HSI is constrained by expensive and complex imaging equipment, making large-scale deployment challenging. In contrast, RGB cameras and multispectral systems are more accessible and cost-effective. Consequently, reconstructing hyperspectral images using these low-cost devices has become a significant research focus in recent years.

Spectral Super-Resolution (SSR) technology enables the reconstruction of hyperspectral images from low-resolution spectral data (e.g., RGB), thereby alleviating hardware constraints and substantially enhancing the application value of existing datasets [19,20]. Traditional methods based on linear models or sparse representations achieved some early success but struggle to capture the nonlinear characteristics of hyperspectral data and often fail to preserve complete spatial structural details. With the advancement of deep learning, numerous deep learning-based spectral reconstruction methods for RGB images have been proposed. The fundamental workflow of deep learning-based reconstruction algorithms is illustrated in Figure 1, where the input is a three-channel RGB image, and the output is a hyperspectral image. The primary workflow structure consists of three components: feature extraction, nonlinear mapping, and spectral reconstruction. Deep learning-based spectral reconstruction methods typically employ an end-to-end network trained with a predefined loss function. Through backpropagation, the loss function is minimized during training on paired RGB and hyperspectral images. Finally, the trained network is directly utilized for spectral reconstruction.

Convolutional neural networks (CNNs) combined with attention mechanisms have been applied to the SSR task, significantly improving reconstruction performance. However, existing methods still face several challenges: they tend to produce noise and spectral distortion when reconstructing images in areas with rapid illumination changes; there exists an imbalance in spatial and spectral feature extraction, which may lead to excessive emphasis on spectral information and loss of spatial details, or overemphasis on spatial information and reduced spectral fidelity; in low-resolution scenes, severe spectral overlap among objects makes material differentiation challenging.

To overcome these limitations, this paper proposes the Cascaded Multi-Attention Feature Recurrent Enhancement Network (CMFREN). This method achieves hyperspectral reconstruction through two core modules: the Hierarchical Residual Attention (HRA) module and the Cascaded Multi-Attention (CMA) module. The HRA module addresses noise and spectral distortion in regions with rapid illumination changes, while the Cascaded Multi-Attention (CMA) module incorporates Spatial–Spectral Balanced Feature Extraction (SSBFE) and a Spectral Enhancement Module (SEM) to address spatial imbalance and spectral variation blurring, respectively.

Extensive experiments on multiple public datasets demonstrate that CMFREN significantly advances spectral reconstruction accuracy and preserves fine image details, outperforming state-of-the-art methods in both quantitative and qualitative evaluations. Furthermore, CMFREN exhibits strong robustness and generalization across diverse scenarios. The main contributions of this work are as follows:

Proposed an HRA module integrating residual connections and multi-scale context to effectively suppress artifacts in regions with abrupt illumination changes;
Developed a spatial–spectral balanced feature extraction module to achieve adaptive balancing of spatial and spectral features;
Introduces a recurrent spectral enhancement module that progressively strengthens spectral details to improve reconstruction quality;
Achieves state-of-the-art performance across three benchmark datasets, validating the proposed method’s effectiveness and broad applicability.

2. Related Work

Spectral Super-Resolution (SSR) aims to reconstruct hyperspectral images (HSI) from low-spectral-resolution images (such as RGB images). Its core challenge lies in solving the precise mapping problem from ‘finite spectral bands to continuous fine spectral details.’ Hyperspectral images contain dozens or even hundreds of continuous spectral bands, enabling precise characterization of the spectral properties of objects. They hold significant applications in fields such as remote sensing, agricultural monitoring, and environmental assessment. However, the high cost of hyperspectral imaging equipment limits its widespread adoption. SSR technology overcomes these hardware constraints through computational methods, making it a prominent research focus in recent years.

2.1. Traditional Spectral Super-Resolution Reconstruction

Early SSR research primarily depended on manually crafted prior knowledge, focusing on mathematical models that exploit the sparsity or low-rank characteristics of spectral signals to guide reconstruction. In 2015, Robles-Kelly [21] made the initial attempt to incorporate color and appearance cues into sparse coding. However, the lack of a comprehensive dictionary learning framework restricted this method’s applicability to specialized scenarios, such as multimedia images. In 2016, Arad and Ben-Shahar [22] introduced the first end-to-end sparse recovery framework at ECCV, learning a universal hyperspectral dictionary via K-SVD and employing the OMP algorithm for stable RGB-to-HSI mapping. This work established a foundational paradigm for subsequent research. Nevertheless, dictionary learning approaches are inherently constrained by limited training samples, making it difficult to capture spectral variability in complex environments. Additionally, the process of estimating sparse coefficients is computationally intensive and exhibits weak generalization. To address these challenges, subsequent studies explored shallow models to simplify the mapping process. Aeschbacher et al. [23] proposed the A+ method, which utilizes shallow linear mapping for spectral reconstruction, validating the effectiveness of ‘lightweight models + data-driven approaches.’ They also introduced sparse regularization loss to enhance noise robustness. While shallow learning alleviates computational bottlenecks associated with traditional sparse representation, it still fails to model complex nonlinear spectral relationships.

2.2. Spectral Super-Resolution Reconstruction via Deep Learning

CNNs have become the mainstream framework for SSR due to their local feature extraction capabilities, with early work focusing on network depth and feature reuse. Galliani et al. [24] adapted the Tiramisu network for semantic segmentation into DenseUnet, enhancing feature reuse through dense connections and demonstrating for the first time the potential of deep CNNs in SSR. Xiong et al. [25] proposed HSCNN+, introducing residual dense blocks to mitigate gradient vanishing issues. By fusing spatial and spectral features through multi-scale convolutions, their approach secured first place in the NTIRE 2018 challenge. Can et al. [26] designed the lightweight residual network CanNet, achieving efficient spectral reconstruction with only 6 convolutional layers, validating the effectiveness of ‘moderate depth + residual connections’. CNNs suffer from limited receptive fields, making it difficult to capture long-range spectral dependencies, and are sensitive to image degradation (e.g., noise, compression).

To enhance feature discriminability and robustness against interference, researchers introduced attention mechanisms and multi-scale feature interaction strategies. Li et al. [27] proposed a hybrid 2D–3D residual attention network that dynamically adjusts spectral weights via channel attention to enhance feature expression in key bands. Li et al. [28] designed the DRCR Net with a Non-Local Purification Module (NPM) that employs a hierarchical pyramid structure to remove noise and compression artifacts from input images. Combined with a dual-channel recalibration module (CRM) to optimize feature response, this approach achieved third place in the NTIRE 2022 challenge. Li et al. [29] introduced the multi-scale feature fusion method HASIC-Net, employing a dual-path architecture. It extracts spatial features via 2D residual groups and captures spectral correlations through 1D residual groups. Cross-path connections enable spatial–spectral feature interaction, while the Structural Information Consistency (SIC) module preserves edge details. HASIC-Net achieved state-of-the-art performance on the CAVE and NTIRE datasets. Attention mechanisms enable networks to adaptively focus on important features, while multi-scale fusion enhances adaptability to complex scenes. He et al. [30] proposed DsTer, combining Transformer with ResNet. It employs multi-head self-attention to capture long-range spectral dependencies and replaces the traditional Transformer’s MLP module with ResNet for processing 3D remote sensing data, significantly reducing Spectral Angle Mapper (SAM) on remote sensing datasets like Chikusei and Xiong’an. Cai et al. [31] proposed a novel transform-based spectral reconstruction method—Multi-level Spectral Transformation (MST++). It employs Spectral-wise Multi-head Self-Attention (S-MSA) based on HSI spatial sparsity and spectral self-similarity (as shown in Figure 2) to form a U-shaped structure. This extracts multi-resolution contextual information, progressively improving reconstruction quality from coarse to fine. Wang et al. [32] proposed DHRNet, which employs intrinsic image decomposition to separate ‘reflectance (material intrinsic)’ and ‘shadow (environmental interference)’ from RGB images. They then designed a dual-path network to optimize spectral reconstruction for each feature category. Finally, a feature enhancement module enforces spatial–spectral consistency, fundamentally reducing environmental interference and spectral distortion. Wang et al. [33] proposed the WHANet, which decomposes RGB images into high-frequency and low-frequency features processed by CNNs and transformers, respectively. Combined with a fast Fourier transform loss, this approach addresses spatial degradation caused by high-frequency information loss. Inspired by these studies, the proposed method integrates the strengths of deep learning approaches like CNNs and Transformers for contextual feature fusion. To address inefficient feature propagation in traditional recurrent models, it introduces cross-iteration residual fusion, leveraging historically enhanced features to guide current optimization and strengthen feature continuity. An improved multi-scale fusion strategy combines PixelUnshuffle downsampling with Residual Dense Blocks (ResDB) to enhance feature capture across spatial scales, compensating for CNN’s limitations in modeling global dependencies; Spectral-wise Multi-head Self-Attention (S-MSA) replaces traditional channel attention, dynamically adjusting spectral–spatial weights to balance spatial detail and spectral fidelity.

3. Proposed Methods

To address artifacts resulting from abrupt illumination changes, the imbalance in spatial and spectral feature extraction, and spectral aliasing in low-resolution scenes, this paper proposes a Cascaded Multi-Attention Feature Recurrent Enhancement Network (CMFREN) for spectral super-resolution reconstruction. The overall framework structurally integrates feature extraction and enhancement, ensuring robustness and stability across diverse complex scenes. The overall architecture of CMFREN is illustrated in Figure 3. Taking RGB images as input, CMFREN comprises a Hierarchical Residual Attention (HRA) module and ten Cascaded Multi-Attention (CMA) modules. Each CMA module integrates a Spatial–Spectral Balanced Feature Extraction (SSBFE) submodule and a Spectral Enhancement Module (SEM). The Hierarchical Residual Attention (HRA) module suppresses artifacts in transitional illumination regions through multi-scale contextual fusion. The Spatial–Spectral Balanced Feature Extraction (SSBFE) module utilizes multi-scale convolutions and spectral attention to adaptively balance spatial and spectral features. The Spectral Enhancement Module (SEM) deeply reinforces spectral differences via recurrent feedback and attention mechanisms, mitigating spectral aliasing issues at low resolutions.

3.1. Hierarchical Residual Attention (HRA) Module

Within the overall architecture, the HRA module functions as the central component of the initial stage. By integrating multi-scale contextual information across multiple feature branches, it performs feature purification and stabilization on the input features before subsequent deep feature extraction, thereby providing a more robust feature foundation for downstream modules. The structure of the HRA module is depicted in Figure 4. Each branch of the HRA module utilizes MSAA for feature refinement, enabling accurate learning of contextual features and effectively suppressing spectral artifacts. The MSAA structure is illustrated in Figure 5.

The MSAA module refines features through a dual-branch approach: ‘spatial path + channel path.’

Spatial path:

F_{s p 1} = C o n v_{1 \times 1} (F_{i n}; W_{s p 1})

(1)

F_{s p 2} = C o n v_{3 \times 3} (F_{s p 1}; W_{s p 3}) + C o n v_{5 \times 5} (F_{s p 1}; W_{s p 5}) + C o n v_{7 \times 7} (F_{s p 1}; W_{s p 7})

(2)

F_{s p 3} = C o n v_{7 \times 7} (F_{c} [A v g P o o l (F_{s p 2}), M a x P o o l (F_{s p 2})]; {W^{\circ}}_{s p 7}) ⊙ F_{s p 2}

(3)

For the input feature map

F_{i n}

, multi-scale spatial contextual information is captured via the spatial path through the following steps: first, a 1 × 1 convolution is performed for dimensionality reduction; then, multi-scale convolutions are employed to cover illumination variations across different scale ranges; finally, enhancement is implemented using a 7 × 7 convolution combined with spatial attention.

F_{s p 1}

is the feature after dimension reduction via 1 × 1 convolution;

F_{s p 2}

is the multi-scale convolutional feature;

F_{s p 3}

is the spatially attention-enhanced feature;

W_{s p 1}

,

W_{s p 3}

,

W_{s p 5}

,

W_{s p 7}

,

{W^{\circ}}_{s p 7}

denote learnable weights;

⊙

represents the Hadamard product.

Channel Path: Illumination exerts varying modulation intensities on different spectral channels, and channel attention globally models such interactive characteristics. First, global average pooling is applied to the input features to obtain channel-wise descriptors. Then, the globally pooled features and attention features of the channel path are fed through two stacked 1 × 1 convolutional layers to generate channel attention, which is further fused with the spatial path to produce multi-scale information. These outputs are the multi-scale features of the channel path.

F_{c h 1}

,

F_{c h 2}

are multi-scale features of the channel path.

F_{c h 1} = A v g P o o l (F_{i n})

(4)

F_{c h 2} = C o n v_{1 \times 1} (C o n v_{1 \times 1} (F_{c h 1}; W_{c h 1}); W_{c h 2})

(5)

The final output of MSAA is further stabilized via residual connection for feature enhancement:

f_{M S A A} = F_{M S A A} (F_{s p 3} {, F}_{c h 2}) + F_{i n}

(6)

f_{M S A A}

is the output of the MSAA module, which integrates the spatial path and channel path and combines residual connections.

Illumination variations include global light intensity changes, local shadows, and other spatial scale variations. The input feature of the HRA module is

U \in R^{C \times H \times W}

, where

C

is the number of feature channels and

H \times W

is the spatial size. The original feature branch and the downsampled feature branch are concatenated along the channel dimension:

P = {[C o n v_{3 \times 3} (U_{↓ 2}; W_{↓ 2})]}_{↑ 2}

(7)

B = C o n c a t [I, P]

(8)

Among the symbols:

↓ 2

denotes downsampling using average pooling;

↑ 2

denotes upsampling using bilinear interpolation;

W_{↓ 2}

denotes learnable weights;

P

represents the feature of the downsampled branch passed to the original feature branch after convolution;

B

represents the concatenated feature of the original feature branch and the downsampled feature branch;

I

denotes the original feature branch without concatenation.

The original feature branch of the HRA module extracts deep features through two successive 3 × 3 convolutional layers, followed by the MSAA module to enhance multi-scale feature interaction:

F_{o r i} = f_{M S A A} (C o n v_{3 \times 3} (C o n v_{3 \times 3} (B; W_{o r i 1}); W_{o r i 2}))

(9)

where

F_{o r i}

denotes the output of the original feature branch;

f_{M S A A}

denotes feature refinement via the MSAA module. The receptive field of the 3 × 3 convolutional kernel matches the spatial range of local illumination, enabling effective extraction of the spatial patterns of illumination variations while preserving spatial–spectral details.

A 1 × 1 convolutional residual connection of the original input is added to avoid feature drift during the feature extraction process and ensure the stable learning of illumination features:

F_{r} = F_{o r i} + C o n v_{1 \times 1} (I; W_{o r i})

(10)

The input of the HRA module is downsampled to compress the spatial dimension:

I_{d s} = U_{↓ 2} = \frac{1}{4} \sum_{i = 0}^{1} \sum_{j = 0}^{1} U (x + i, y + j)

(11)

Average pooling performs summation averaging over a 2 × 2 neighborhood, suppressing local pixel jumps caused by sudden illumination changes while preserving global lighting trends. The downsampling branch compresses spatial dimensions through average pooling, implementing low-pass filtering of global lighting features. This operation filters out local high-frequency noise while retaining the global distribution of light intensity.

Deep features of the downsampled input are extracted through two successive 3 × 3 convolutional layers, followed by the MSAA module is used to refine global context:

F_{d s} = f_{M S A A} (C o n v_{3 \times 3} (C o n v_{3 \times 3} (I_{d s}; W_{d s 1}); W_{d s 2}))

(12)

F_{d} = (F_{d s} + C o n v_{1 \times 1} (I_{d s}; W_{d s})) ↑_{2}

(13)

where

F_{d s}

is the output of the downsampled feature;

F_{d}

represents the final output of the downsampled feature branch. HRA introduces residual connections in both branches to avoid feature drift caused during the feature extraction process and ensure the stable learning of illumination features. The residual connection in the original branch is equivalent to the direct preservation of the illumination features of the original input, preventing the loss of illumination features caused by deep networks. Residual connection in the downsampled branch: The residual term ensures that the global illumination distribution features are not distorted by downsampling or upsampling operations, providing a stable global benchmark for the learning of local illumination features.

F_{r}

and

F_{d}

are concatenated along the channel dimension, and feature fusion is performed through a 3 × 3 convolutional layer:

F_{H R A} = {C o n v}_{3 \times 3} (C o n c a t (F_{r}, F_{d}); W_{H R A})

(14)

The above HRA module ensures gradient stability via residual connections, enhances discriminability through multi-scale and downsampled feature fusion, and strengthens features via attention mechanisms. The final output fused contextual feature provides a reliable foundation for subsequent processing.

3.2. Cascaded Multi-Attention Feature Recurrent Enhancement Module

As depicted in Figure 6a, the CMA module comprises two Spatial–Spectral Balanced Feature Extraction (SSBFE) modules and one Spectral Enhancement Module (SEM). The cascaded structure of CMFREN is constructed by stacking ten CMA modules, as illustrated in Figure 3. First, the input features are passed into the first SSBFE module of the CMA. Through the synergistic operation of components such as Multi-Scale Residual Feature Enhancement (MSRFE) and Spectral-wise Multi-head Self-Attention (S-MSA), dynamic and balanced optimization of spatial and spectral features is achieved. Subsequently, in combination with the SEM, fine-grained spectral differences are progressively captured through the complementary interaction of attention mechanisms and convolutional operations. Finally, the second SSBFE module is employed to further facilitate in-depth spatial–spectral feature interaction and extraction, thereby enhancing the completeness of feature representation. The features from the preceding CMA module are transmitted to the next SSBFE module, while the features from the previous SEM are directly concatenated with those processed by the subsequent SSBFE module before being passed to the SEM. However, the features of SEM are only used as internal features, and the final output feature is the second SSBFE module of the last CMA module.

3.2.1. Spatial–Spectral Balanced Feature Extraction Module

As shown in Figure 6b, the MSRFE module first performs multi-scale downsampling on the input feature: PixelUnshuffle is used to downsample the input tensor, generating feature maps at different scales to realize multi-scale representation of data:

X_{1 / 2} = P i x e l U n s h u f f l e (F_{M S R F E - i n}, s = 2)

(15)

X_{1 / 4} = P i x e l U n s h u f f l e (F_{M S R F E - i n}, s = 4)

(16)

where

F_{M S R F E - i n}

is the input feature of the MSRFE module (size:

H \times W \times C

);

P i x e l U n s h u f f l e (\cdot)

denotes the downsampling operation; the downsampling factor s is set to 2 and 4, respectively;

X_{1 / 2}

denotes the 2× downsampled feature; and

X_{1 / 4}

denotes the 4× downsampled feature. This multi-scale feature foundation provides data support for subsequent capture of spatial textures and spectral information at different resolutions, avoiding the insufficient adaptability of a single scale to complex scenarios.

Local feature extraction in MSRFE is performed by applying a 3 × 3 convolutional layer:

F_{c o n v - 1 / 2} = C o n v_{3 \times 3} (X_{1 / 2}; W_{1 / 2})

(17)

F_{c o n v - 1 / 4} = C o n v_{3 \times 3} (X_{1 / 4}; W_{1 / 4})

(18)

where

F_{c o n v - 1 / 2}

represents the local feature of the 1/2 downsampled branch;

F_{c o n v - 1 / 4}

represents the local feature of the 1/4 downsampled branch. These features refine local spatial textures and spectral details at multiple scales, providing a ‘fine-grained feature foundation’ for subsequent residual fusion and avoiding the loss of local spatial–spectral information during high-level feature extraction.

Residual dense fusion is performed on the local feature of the 1/4 downsampled branch (the residual dense block is shown in Figure 7a):

F_{R e s D B - 1 / 4} = F_{c o n v - 1 / 4} + F_{D B} (F_{c o n v - 1 / 4})

(19)

where

F_{D B} (\cdot)

denotes the operation of concatenating and fusing multi-convolutional layer features in the residual dense block;

F_{R e s D B - 1 / 4}

represents the feature of the 1/4 downsampled branch after ResDB fusion.

Classic residual connections (shown in Figure 7b) are used to strengthen feature dependencies: two successive 3 × 3 convolutional layers combined with residual learning are used, and skip connections are added to mitigate performance degradation in deep networks:

F_{R e s B - 1 / 4} = F_{R e s D B - 1 / 4} + C o n v_{3 \times 3} ((C o n v_{3 \times 3} (F_{R e s D B - 1 / 4}; W_{1})); W_{2})

(20)

Cross-scale feature integration is performed on the 1/4 downsampled channel and 1/2 downsampled channel via PixelShuffle:

F_{1 / 2 - 1 / 4 - f u s i o n} = C o n c a t (F_{c o n v - 1 / 2}, P i x e l S h u f f l e (F_{R e s B - 1 / 4}, r = 2))

(21)

The MSRFE module performs cross-scale feature integration on the 1/2 downsampled feature and original channel feature via PixelShuffle:

F_{1 - 1 / 2 - f u s i o n} = C o n c a t (F_{c o n v - 1}, P i x e l S h u f f l e (F_{R e s B - 1 / 2}, r = 2))

(22)

Integrating feature maps from different scales improves the expressive ability of the model. The output of the MSRFE module is:

F_{M S R E F} = F_{m s r e f} (R (F_{0}), R (F_{1 / 2}), R (F_{1 / 4}))

(23)

where

R (F_{0})

enotes the original size branch;

R (F_{1 / 2})

denotes the 1/2 downsampled branch;

R (F_{1 / 4})

denotes the 1/4 downsampled branch.

The S-MSA module dynamically learns the weight of spectral information to adaptively balance spatial textures and spectral features. The input feature is mapped to Query (Q), Key (K), and Value (V):

Q = X W^{Q}, K = X W^{K}, V = X W^{V}

(24)

where

W^{Q}, W^{K}, W^{V} \in R^{C \times C}

are learnable weight matrices.

Multi-Head Spectral Attention: Q, K, and V are split into N heads along the spectral dimension. The attention calculation for the j-th head is:

A_{j} = s o f t m a x (σ_{j} \cdot K_{j}^{T} Q_{j}), {h e a d}_{j} = V_{j} A_{j}

(25)

where

σ_{j} \in R^{1}

is a learnable scale parameter, used to adaptively adjust the intensity of spectral attention.

The multi-head outputs are concatenated, and position embedding (PE) is added:

F_{s - m s a} = (F_{c} [{h e a d}_{1}, \dots, {h e a d}_{N}]) W^{O} + P E (V)

(26)

where

W^{O} \in R^{C \times C}

is the output projection matrix;

P E (\cdot)

is the function to generate position embedding.

The output of the spectral–spatial feature balanced extraction module is:

F_{s s} = {C o n v}_{3 \times 3} (F_{s - m s a} (F_{M S R E F}); W_{s - m s a}) + {C o n v}_{1 \times 1} (F_{M S R F E - i n}; W_{M S R F E - i n})

(27)

3.2.2. Spectral Enhancement Module

The SEM achieves collaborative optimization through the integration of attention mechanisms and convolution, employing a progressive spectral enhancement strategy to deeply capture spectral differences. The recurrent spectral enhancement module adopts a recurrent feedback mechanism: through an iterative process of ‘cross-round feature fusion—convolutional detail extraction—dynamic attention weighting,’ it leverages the interaction between historically enhanced features

F_{t - 1}

and current spatial–spectral features to progressively optimize spectral details. Specifically, the output

F_{t}

of the t-th round is used as the input

F_{t}

of the next round (t + 1-th round), forming a recurrent enhancement chain.

Cross-Round Residual Feature Fusion: Taking the output

F_{s s}

of the spatial–spectral balanced module (current spatial–spectral feature) and the spectral enhancement output

F_{t - 1}

of the previous round (historically enhanced feature) as inputs, initial cross-round feature fusion is realized via residual connections, providing a composite feature of ‘current spatial–spectral information + historical enhancement prior’ for subsequent enhancement:

F_{f u s i o n} = {C o n v}_{3 \times 3} (F_{s s}; W_{s s}) + F_{t - 1}

(28)

A 3 × 3 convolutional layer is applied to the fused feature

F_{f u s i o n}

to capture fine-grained details of local spectrum and space:

F_{t} = {C o n v}_{3 \times 3} (F_{f u s i o n}; W_{f u s i o n})

(29)

F_{t}

is used as a feedback signal, directly input into the recurrent spectral enhancement module of the next round (i.e.,

F_{t - 1} = F_{t}

for the next round), forming a closed loop of iterative enhancement. The recurrent spectral enhancement module introduces prior knowledge of historical enhancement via ‘cross-round residual fusion’, extracts current details using ‘convolution-activation’, and dynamically strengthens key features via S-MSA. More importantly, the recurrent feedback mechanism enables spectral features to gradually approach the detailed distribution of high-resolution ground truth through multiple iterations, realizing ‘progressive and refined’ spectral enhancement.

4. Results

4.1. Experimental Results

4.1.1. Datasets

NTIRE 2022: Selected from the NTIRE 2022 Spectral Reconstruction Challenge Clean track dataset. NTIRE 2022 comprises 950 public RGB-HSI pairs covering diverse scenes, including vegetation, architecture, and people. The spectral resolution is 10 nm, featuring 31 bands spanning 400 nm to 700 nm, with a spatial resolution of 482 × 512 pixels.

CAVE: The CAVE hyperspectral dataset is a collection of hyperspectral image data provided by Columbia University’s Computer Vision Laboratory (CAVE Lab) (New York, NY, USA). The CAVE dataset was captured using a cooled charge-coupled device (CCD) camera equipped with variable-spectral liquid crystal tunable filters. The dataset comprises 32 hyperspectral images (HSIs) categorized into five groups: Materials, Skin and Hair, Pigments, Food and Beverages, and Authenticity. Spectral resolution is 10 nm, featuring 31 spectral bands spanning 400 nm to 700 nm, with spatial resolution of 512 × 512 pixels.

grss_dfc_2018: grss_dfc_2018 was collected by the National Center for Airborne Laser Mapping (NCALM) (Houston, TX, USA) on 16 February 2017, from the University of Houston. The hyperspectral data were acquired using the ITRES CASI 1500 (a VNIR sensor from ITRES Research Limited (Calgary, AB, Canada) with a 1500-pixel field of view), covering the spectral range of 380–1050 nm across 48 bands. The spatial resolution is 4172 × 1202 pixels. We selected 23, 12, and 5 bands from the hyperspectral image to construct RGB images. The original hyperspectral images and synthetic RGB images are cropped into 27 pairs of spatially non-overlapping image patches with a size of 512 × 512 pixels.

4.1.2. Evaluation Indicators

To evaluate the quality of reconstructing hyperspectral images from RGB images, objective image quality metrics are required. Commonly used metrics in hyperspectral image reconstruction include Root Mean Square Error (RMSE), Peak Signal-to-Noise Ratio (PSNR), Spectral Angle Mapper (SAM), and Mean Relative Absolute Error (MRAE).

RMSE quantifies the quality of spectral reconstruction by calculating the absolute pixel-wise error across all spectral bands between the reconstructed and reference hyperspectral images. It also serves as the official evaluation metric in the NTIRE 2022 Hyperspectral Image Reconstruction Challenge. A lower RMSE value indicates smaller spectral reconstruction error and higher accuracy of the reconstructed hyperspectral image.

R M S E = \sqrt{\frac{1}{H \times W} {\sum_{i = 1}^{H} \sum_{j = 1}^{W} (I_{G T} (i, j) - I_{R} (i, j))}^{2}}

(30)

PSNR (Peak Signal-to-Noise Ratio) is a traditional metric that measures the ratio between the maximum possible signal and the noise present in the reconstructed spectral image, calculated based on the differences between corresponding pixel values of the reconstructed and reference spectral images. A higher PSNR value indicates better reconstruction quality. PSNR is typically defined in terms of the Mean Square Error (MSE) as follows:

M S E = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} (I_{G T} (i, j) - I_{R} (i, j))

(31)

P S N R = 10 \log_{10} (\frac{M A X_{I}^{2}}{M S E}) = 20 \cdot \log_{10} (\frac{M A X_{I}}{\sqrt{M S E}})

(32)

SAM (Spectral Angle Mapper) quantifies the spectral similarity between two spectra by measuring the angle between their spectral vectors, which are derived from the reconstructed and reference hyperspectral images at the same spatial location. A smaller angle indicates a higher degree of spectral similarity between the two spectra. Therefore, a smaller SAM value signifies higher spectral similarity between the reference and reconstructed hyperspectral pixel values, leading to higher quality in the reconstructed spectral image.

S A M = \frac{1}{H \times W} {c o s}^{- 1} (\sum_{i = 1}^{H} \sum_{j = 1}^{W} \frac{{(I_{G T}^{(i, j)})}^{T} \cdot I_{R}^{(i, j)}}{{∥ I_{G T}^{(i, j)} ∥}_{2} \cdot {∥ I_{R}^{(i, j)} ∥}_{2}})

(33)

In hyperspectral image reconstruction tasks, the Mean Relative Absolute Error (MRAE) is a widely used metric for evaluating the quality of spectral reconstruction.

M R A E = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} \frac{| I_{R} (i, j) - I_{G T} (i, j) |}{| I_{G T} (i, j) |}

(34)

In this formula,

i, j

: pixel intensity;

I_{G T}

: true hyperspectral image;

I_{R}

: reconstructed hyperspectral image;

H

: image height;

W

: image width;

M A X_{I}

: the maximum value of all pixels;

{∥ \cdot ∥}_{2}

: Norm of the vector.

4.1.3. Parameter Setting

The training framework for this paper is Pytorch (v2.0.1). Images are cropped to a spatial resolution of 128 × 128, and all RGB data pairs with their corresponding hyperspectral data are normalized to the range [0, 1]. The NTIRE 2022 dataset is split into training and test subsets at an 18:1 ratio. The first 20 hyperspectral images from the CAVE dataset are used for training, while the remaining 12 hyperspectral images serve as test data. For the grss_dfc_2018 dataset, bands 23, 12, and 5 are selected as conditional inputs for RGB images. The dataset is cropped into 27 paired 512 × 512 blocks. Additionally, 3 non-overlapping image patches are cropped into 12 sub-patches of size 256 × 256, which are then used for the testing phase. The batch size is set to 10. The Adam optimizer is employed with parameters β₁ = 0.9 and β₂ = 0.999, utilizing a cosine annealing learning rate adjuster. The initial learning rate is 1 × 10⁻⁴, and training was conducted for 300 epochs. The Mean Relative Absolute Error (MRAE) is adopted as the loss function throughout the training process to quantify the discrepancy between the predicted hyperspectral images (HSIs) and the ground-truth HSIs. Meanwhile, an early stopping criterion is introduced: if the MRAE on the validation set fails to decrease for 20 consecutive epochs, the training process is terminated immediately to avoid model overfitting.

All experiments were conducted in a Pytorch environment on a system equipped with an Intel i5-13490F CPU, NVIDIA GeForce RTX 4090 24GB GPU, and 128GB RAM.

Table 1 shows the detailed architecture of the CMFREN model, along with the specific dimensions of the input and output for each module. The number of CMA modules is set to 10, and the channel dimension is configured as 64. The kernel count of the final convolutional layer is determined by the number of channels of the hyperspectral image to be reconstructed.

4.2. Comparison of the Experimental Results

We compared our proposed CMFREN with several spectral super-resolution methods, including traditional approaches such as Arad and A+, classical deep learning methods like HSCNN and HRNet, the Transformer-based model MST++, and the hybrid model DRCR. All methods were trained on the NTIRE2022, CAVE, and grss_dfc_2018 datasets. For each model, we strictly adhered to the original authors’ parameter settings and assigned consistent training and validation sets. We compared our proposed CMFREN method with various spectral super-resolution techniques, encompassing traditional approaches (e.g., Arad and A+), classical deep learning methods (e.g., HSCNN and HRNet), the Transformer-based model MST++, and the hybrid model DRCR. All methods were trained on the NTIRE2022, CAVE, and grss_dfc_2018 datasets. For each model, we strictly adhered to the original authors’ parameter settings and assigned consistent training and validation sets. When the input size was 256 × 256 × 3, we calculated the number of parameters and FLOPS values for different models while also measuring the average inference time across various test datasets. Image quality was quantitatively compared using four selected image quality evaluation metrics. Furthermore, to visually demonstrate the spectral super-resolution performance of different methods, we visualized reconstruction errors and conducted qualitative visual comparisons across multiple dimensions.

4.2.1. Results on the NTIRE 2022 Dataset

The quantitative results on the NTIRE 2022 dataset are summarized in Table 2. Our proposed method contains 6.78 M parameters and requires 111.49 G FLOPS, with an inference time of 1.2572 s, positioning it at a moderate level in terms of computational efficiency. Compared to the best-performing baseline method, DRCR, our approach achieves an RMSE of 0.0178, representing a 1.6% reduction relative to DRCR’s 0.0181. This improvement, achieved within an already low-error range, demonstrates that our method further reduces reconstruction errors and enables more refined enhancement in reconstruction accuracy. For the PSNR metric, our method achieves 35.5105 dB, which is 1.48 dB higher than DRCR’s 34.0328 dB, ranking the highest among all compared methods. In terms of spectral similarity, our method achieves a SAM value of 4.9925, which is comparable to DRCR’s 4.9295. However, when combined with the MRAE, where our method obtains 0.1863 compared to DRCR’s 0.1925, these two metrics together highlight the superiority of our approach in preserving overall spectral fidelity.

Figure 8 illustrates the absolute differences between the hyperspectral images (HSI) generated by the proposed method and other comparison methods on the NTIRE 2022 dataset relative to ground truth values across each spectral band. The color map below represents the absolute error between the reconstructed spectral image and the true spectral image for each pixel pair. Smaller errors are indicated by colors shifting toward blue, signifying that the reconstructed spectral image approaches the true spectral image more closely—i.e., higher spectral reconstruction accuracy. The visualization clearly demonstrates that the absolute difference for the proposed method is smaller than that of other comparison methods across all bands, validating the accuracy of the reconstruction results from a band-by-band perspective.

To further verify the restoration effect of fine-grained features, we selected ARAD_1K_0924 from the NTIRE 2022 dataset for reconstruction. Spatial local region magnification analysis was performed on the reconstructed images of its 1^st, 11^th, 21^st, and 31^st bands, as shown in Figure 9. The results demonstrate that the images reconstructed by the CMFREN method exhibit clearer tower outlines and natural and distortion-free transitions in the sky area, and their reconstruction quality is more consistent with the fine details of the real images.

To comprehensively validate the model’s performance in light-sensitive scenes, we selected two images—ARAD_1K_0905 and ARAD_1K_0947—from the NTIRE 2022 dataset. We annotated typical regions with rapid illumination changes using bounding boxes, compared the normalized spectral density curves of images generated by each method within the annotated regions, and calculated the spectral curve correlation coefficients, as shown in Figure 10, Figure 11, Figure 12 and Figure 13. The correlation coefficient quantifies the consistency between the reconstructed spectrum and the true spectrum, with values closer to 1 indicating better consistency. The results show that the CMFREN method achieves higher correlation coefficients than other methods. Compared to alternative approaches, the spectral images generated by the proposed reconstruction method exhibit significantly smaller fluctuations in spectral density curves during rapid illumination changes. Their curve shapes more closely resemble the true spectral distribution, demonstrating the method’s ability to maintain spectral feature stability in complex illumination scenarios. This stability stems from the HRA module’s multi-scale fusion of contextual information and residual attention mechanism, enabling the network to adaptively suppress artifacts during rapid transitions between strong light and shadow. Concurrently, the SSBFE within the CMA module dynamically adjusts spatial and spectral feature weights, preventing over-reliance on any single feature channel. This ensures the model maintains smooth and consistent spectral curves even under extreme illumination variations.

4.2.2. Results on the CAVE Dataset

The CAVE dataset comprises high-resolution images of everyday objects characterized by rich visual details and intricate spectral features, imposing stricter demands on models to capture subtle spatial–spectral interactions. As shown in Table 3, CMFREN ranks moderately in terms of parameter complexity and computational efficiency while achieving significant improvements across all image quality metrics. Regarding reconstruction error, the Root Mean Square Error (RMSE) decreased from 0.0097 in MST++ to 0.0078, representing a 20% reduction. This substantial improvement demonstrates CMFREN’s enhanced error control capability, effectively mitigating bias during the restoration of fine features. The Peak Signal-to-Noise Ratio (PSNR) increased from 36.0853 dB to 36.3796 dB. Although the increase was relatively modest at 0.29 dB, it reflects improved retention of object texture edges and surface gloss variations in high-resolution scenes. The Spectral Angle Mapper (SAM) value decreased from 6.2841 to 5.6667, a reduction of 0.62, indicating significantly enhanced alignment between reconstructed spectral curves and actual data. Additionally, the Mean Relative Absolute Error (MRAE) decreased from 0.1786 to 0.1723, further validating CMFREN’s stability in restoring spectral and spatial features.

Figure 14 visually demonstrates the absolute differences between CMFREN and other reconstructed hyperspectral images (HSI) relative to ground truth values across each band. The results indicate that CMFREN exhibits smaller absolute differences than other comparison methods in every band.

To further validate the restoration performance of fine features, we selected four images with typical fine textures from the CAVE dataset: real_and_fake_peppers, stuffed_toys, superballs, and watercolors. We performed a local magnification analysis on their 16^th band, as shown in Figure 15. The results demonstrate that CMFREN achieves clearer surface textures and edge details on objects, with reconstruction outcomes that more closely match the fine granularity of real images. This is because the MSRFE in the CMA module utilizes PixelUnshuffle and multi-branch convolutions to construct a fine-grained multi-scale representation, effectively capturing high-frequency textures and edge features. Meanwhile, the SEM continuously enhances local details through a recurrent feedback mechanism, endowing the model with greater expressiveness when restoring complex textures and subtle boundaries. This enables clearer texture reconstruction compared to other methods.

4.2.3. Results on the grss_dfc_2018 Dataset

Table 4 reports the reconstruction performance of different methods on the grss_dfc_2018 dataset. The number of parameters for MST++ shows a significant increase compared to those for NTIRE2022 and CAVE datasets, while the parameters for other models remain largely unchanged. This is because the number of channels in the reconstructed images varies, and for other models, only the parameters of the output layer change. Due to the structural design of the MST++ model, its intermediate layer parameters undergo modification, resulting in a substantial increase in the total number of parameters. Compared to the 482 × 512 × 3 resolution of NTIRE 2022 and the 512 × 512 × 3 resolution of CAVE, the grss_dfc_2018 dataset employs a 256 × 256 × 3 test set, resulting in significantly reduced inference time. However, the relative inference speeds among different methods remain unchanged. The method achieves an RMSE of 0.1706, reducing the MST++ value of 0.1804 by 0.0098. This indicates effective minimization of pixel-level reconstruction error, significantly improving the match between reconstructed results and true values. Regarding spatial detail representation, the PSNR value reached 36.9293 dB, an improvement of approximately 1.5 dB over DRCR’s 35.4363 dB. In terms of spectral similarity and relative error, this method achieved a SAM value of 5.7446, a reduction of 0.23 compared to DRCR’s 5.9755, indicating higher consistency between the reconstructed spectrum and the true spectrum. Concurrently, the MRAE value decreased to 0.0889, representing an approximately 10% improvement over the DRCR method. These results further highlight the advantages of this approach in controlling overall relative error and ensuring spectral fidelity.

Figure 16 visually presents the absolute differences between the hyperspectral images (HSI) reconstructed by the proposed method and other comparison methods relative to ground truth values across each band. The results indicate that all methods exhibit smaller errors in bands 12 and 24, while the proposed method demonstrates smaller absolute differences than other methods across all bands.

We selected representative images from the grss_dfc_2018 dataset and conducted spectral curve comparison analysis for two distinctly different land cover types—parking lots and grasslands—calculating the spectral curve correlation coefficients. Figure 17a and Figure 18a show the locations of the selected parking lot and grassland areas, respectively. Figure 17b and Figure 18b present the true spectral curves of corresponding points alongside spectral curves from different reconstruction methods and their spectral curve correlation coefficients. Results indicate that hyperspectral images generated by this method exhibit higher correlation coefficients, meaning spectral curves more closely approximate true values, yielding superior reconstruction quality. Figure 19a displays selected locations for both land cover types, while Figure 19b clearly demonstrates spectral differences between the two object types through their true spectral curves and those reconstructed by this method. This advantage stems from the iterative spectral enhancement mechanism within the SEM. By progressively strengthening spectral details through cross-round feature fusion and dynamically weighted attention mechanisms, it achieves deep separation of spectral features between different objects. Concurrently, the SSBFE module implements adaptive balancing at the spectral dimension. This ensures the model maintains the overall authenticity of spectral curves while accentuating differences in land cover categories within complex scenes, thereby enhancing spectral resolution capability and classification performance.

4.3. Ablation Experiment

4.3.1. Ablation Study on Cascade Times of CMA Module

As outlined in the introduction, we designed a cascaded architecture for spectral reconstruction to enable the network to learn comprehensive spatial–spectral information. Consequently, we investigated the impact of cascade depth on reconstruction results. On the NTIRE2022 dataset, we evaluated the performance of cascades ranging from 6 to 12 CMAs. The final results are presented in Table 5. We observed a steady improvement in performance as the number of cascades increased. However, increasing the number of cascades also leads to greater model depth. When n = 12, no significant improvement in reconstruction quality was observed, and performance even declined. Therefore, n = 10 was selected for the experiments. In summary, an appropriate number of cascades can significantly enhance spectral reconstruction performance.

4.3.2. Ablation Experiment Investigating the Effects Among Different Modules

To validate the role of each module in the cascaded multi-attention feature recycling enhancement spectral super-resolution reconstruction network, ablation experiments were conducted on the Ntire2022 dataset.

The ablation experiments validated different combinations of the HRA, SSBFE, and SEMs within the CMFREN model, revealing the influence patterns of their synergistic effects on spectral super-resolution reconstruction performance. Table 6 presents a comparison of evaluation metric values obtained from various combination schemes.

The HRA + SSBFE combination achieves significant optimization in both core accuracy metrics—RMSE and PSNR. Experimental data indicate that this combination achieves an RMSE as low as 0.0189 and a PSNR of 35.3517 dB, slightly inferior only to the full model while substantially outperforming other incomplete combinations. This stems from HRA’s ‘purification’ effect, which reduces systematic errors caused by illumination variations, providing reliable input for SSBFE’s spectral–spatial balance. Meanwhile, SSBFE’s multi-scale feature fusion and adaptive weight adjustment effectively resolve the imbalance in traditional methods—either overemphasizing spectral information at the expense of spatial information or vice versa—simultaneously improving pixel-level error and signal fidelity in the reconstruction results.

The SSBFE + SEM combination achieves a SAM of 5.3241 and an MRAE of 0.1987, ranking highest among incomplete combinations. This demonstrates its precise control over spectral similarity and relative error. SSBFE’s spatial–spectral balance prevents spectral features from being disrupted by spatial noise, providing SEM with a ‘pure and balanced’ spectral foundation. SEM’s cyclic enhancement mechanism progressively optimizes spectral curves through multiple iterations, reducing angular deviations caused by spectral aliasing while minimizing relative error. The core advantage of this combination lies in its ‘foundation-to-detail’ spectral refinement, ensuring reconstructed spectra not only align with true values in overall trends but also maintain consistency in fine-grained variations, supporting applications in complex scenarios.

The HRA + SEM combination achieves an RMSE of 0.0206, PSNR of 34.7831 dB, SAM of 5.9625, and MRAE of 0.2309, the worst among all combinations. This validates the limitations of insufficient functional complementarity: without SSBFE’s spectral balance adjustment, features purified by HRA still exhibit mismatched spectral information, and SEM’s spectral enhancement can only operate on an unbalanced foundation, failing to fundamentally resolve spectral aliasing and detail loss. This combination demonstrates that spectral enhancement must be grounded in spectral–spatial balance. Otherwise, isolated feature purification and detail enhancement cannot achieve synergistic effects, making comprehensive performance improvement difficult.

The complete combination of HRA + SSBFE + SEM achieves optimal performance across all four metrics: RMSE, PSNR, SAM, and MRAE. HRA and SSBFE ensure foundational reconstruction accuracy, while SEM enhances spectral detail recognition and stability. This enables the model to maintain high performance across diverse scenarios—complex lighting, high-detail requirements, and low resolution. It also fully validates the module priority sequence (SSBFE > HRA > SEM) and demonstrates the necessity of multi-module collaborative design for hyperspectral tasks.

5. Discussion

The CMFREN model proposed in this paper focuses on the spectral super-resolution task for hyperspectral images. CMFREN employs a cascaded architecture of “Feature Purification–Spectral Balancing–Gradual Enhancement” to achieve targeted improvements over existing methods. In terms of effectiveness, the model demonstrates significant performance improvements across three major benchmark datasets. On the NTIRE 2022 dataset, the PSNR reaches 35.5105 dB, surpassing DRCR by 1.48 dB, highlighting strong robustness under varying lighting conditions. On the CAVE dataset, the RMSE decreased by 20% relative to the state-of-the-art MST++ and by 0.62 compared to SAM, highlighting the model’s advantages in spectral restoration for fine-textured scenes. On the grss_dfc_2018 dataset, the proposed method achieved an RMSE of 0.1706, reducing it by 0.0098 compared to MST++ with an RMSE of 0.1804 and improving the PSNR by approximately 1.5 dB relative to DRCR, thereby validating its effectiveness for large-scale remote sensing data.

From a practical application perspective, CMFREN’s performance advantages enable its potential implementation across multiple domains. In remote sensing monitoring, it can accurately reconstruct hyperspectral data from RGB images, reducing reliance on expensive hyperspectral imaging equipment and supporting large-scale environmental monitoring and land cover classification. In medical imaging, its capability for fine-texture restoration supports spectral feature analysis of lesions, aiding in early disease diagnosis. In industrial inspection, it distinguishes material compositions through spectral details, thereby enhancing the precision and efficiency of product quality control. Furthermore, the model’s ability to mitigate spectral aliasing in low-resolution scenarios extends its applicability to resource-constrained environments, such as drone-based remote sensing and mobile imaging, significantly broadening its operational scope. Nevertheless, CMFREN still has certain limitations:

1. Spatial leakage risk: Although the grss_dfc_2018 dataset uses non-overlapping slice segmentation, both the training and test sets originate from the same original scene, introducing potential spatial correlations that may slightly overestimate the model’s performance. This risk is intensified when spectral feature distributions within scenes are relatively uniform.

2. High computational cost: CMFREN incorporates 10 cascaded CMA modules, and the SSBFE module integrates multi-scale convolutions with S-MSA attention mechanisms. This results in a relatively large model size and computational load. Compared to lightweight models, its number of parameters increases significantly, leading to slower inference speeds. Consequently, this limits deployment and application on edge devices, such as mobile terminals and drone-embedded systems.

3. Dataset distribution bias: Experimental datasets such as CAVE and NTIRE 2022 were collected under controlled laboratory or standardized conditions. These conditions differ from real-world scenarios involving complex backgrounds, such as atmospheric scattering and noise interference. This discrepancy may degrade the model’s generalization performance in real-world environments, making it difficult to fully replicate the superior results achieved under laboratory settings.

Future research may explore the following directions:

1. Reducing model complexity: To address the high computational cost of current models, future lightweight designs will focus on two approaches: First, replacing standard convolutions in modules with separable convolutions to reduce parameters and computational load while preserving feature extraction capabilities. Second, introducing dynamic network architectures that employ adaptive channel pruning to automatically eliminate redundant channels during training while retaining core feature extraction capabilities. The research objective is to reduce the number of model parameters and the computational load by 50% while maintaining performance loss below 5%, thereby meeting edge-device deployment requirements and expanding model application scenarios.

2. Self-supervised pre-training: Existing models rely heavily on large amounts of labeled RGB–hyperspectral paired data, limiting their real-world applicability. Future work will explore self-supervised pre-training strategies, leveraging unlabeled RGB images and a small amount of labeled hyperspectral data for joint training. Designing a spectral reconstruction self-supervised task enables the model to learn general spectral feature representations from unlabeled data; Subsequent fine-tuning with labeled data enhances model generalization and data utilization efficiency. This approach effectively mitigates labeled data scarcity, lowers training cost barriers, and further expands model application scenarios.

3. Real-world data application: Current experiments rely on standard datasets, which differ from real-world scenarios. Future efforts will focus on three areas: First, collecting real-world hyperspectral data incorporating complex factors like atmospheric scattering, noise interference, and extreme illumination to build datasets closer to practical applications; Second, integrating physical degradation models into the model to simulate spectral distortion processes in real scenarios, enabling the model to learn coping mechanisms for complex interference during training; Third, performing customized optimization for specific real-world scenarios such as UAV remote sensing, industrial inspection, and precision agriculture to enhance practical implementation effectiveness and increase research utility.

4. Cross-sensor generalization: Existing models trained on single-sensor data struggle to adapt to the spectral response functions of different RGB cameras, limiting their versatility. Future research will explore cross-sensor adaptive methods by incorporating domain adaptation modules to learn spectral mapping relationships between data from different sensors, thereby reducing the impact of sensor variations, while simultaneously designing universal spectral feature extractors to lessen the model’s dependency on specific sensor spectral responses. The goal is to enable models trained on one sensor’s data to be directly applied to RGB images from other sensors without retraining, thereby enhancing model universality and practicality while reducing adaptation costs in real-world applications.

6. Conclusions

We evaluated the proposed CMFREN model against classical methods and existing state-of-the-art approaches on three public datasets—CAVE, NTIRE 2022, and grss_dfc_2018—with four metrics: RMSE, PSNR, SAM, and MRAE. Results show the model achieves SOTA performance across all three datasets, with the lowest RMSE, highest PSNR, and consistently low SAM and MRAE. This performance breakthrough stems from targeted solutions to three core challenges: ‘illumination transition artifacts,’ ‘spectral feature imbalance,’ and ‘spectral difference blurring.’ The model not only surpasses existing SOTA methods across quantitative and qualitative metrics but also demonstrates stable high performance across diverse scenarios and complex datasets, fully validating its robustness and superiority. Particularly in high-detail scenes, leveraging the HRA module’s global attention feature refinement, CMA module’s balanced spatial–spectral feature extraction, spectral enhancement module’s collaborative optimization, and cascaded multi-attention recurrent enhancement mechanism, CMFREN effectively suppresses artifacts while preserving subtle spatial textures and spectral variations. It also deeply captures spectral features under complex illumination and resolves spectral aliasing issues in low-resolution data, further validating its strong adaptability to real-world scenarios.

Author Contributions

Conceptualization, H.J. and J.L.; investigation, H.J. and J.L.; methodology, H.J. and Z.Z.; validation, Y.Z.; writing, H.J. and Z.Z.; supervision, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded in part by the 14th Five-Year Plan Funding of China, grant number 50916040401.

Data Availability Statement

The CAVE dataset mentioned in this paper is openly and freely available at http://www.cs.columbia.edu/CAVE/databases/multispectral/ (accessed on 17 June 2025). The NTIRE 2022 dataset used in this study is freely available at https://codalab.lisn.upsaclay.fr/competitions/721 (accessed on 18 June 2025). The GRSS_DFC_2018 dataset used in this study is freely available at https://machinelearning.ee.uh.edu/2018-ieee-grss-data-fusion-challenge-fusion-of-multispectral-lidar-and-hyperspectral-data/ (accessed on 20 June 2025).

Acknowledgments

We would like to thank the editor and reviewers for their reviews, which improved the content of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lu, B.; Dao, P.D.; Liu, J.; He, Y.; Shang, J. Recent Advances of Hyperspectral Imaging Technology and Applications in Agriculture. Remote Sens. 2020, 12, 2659. [Google Scholar] [CrossRef]
Tejasree, G.; Agilandeeswari, L. An Extensive Review of Hyperspectral Image Classification and Prediction: Techniques and Challenges. Multimed. Tools Appl. 2024, 83, 80941–81038. [Google Scholar] [CrossRef]
Ragusa, D.; Gazzoni, M.; Torti, E.; Marenzi, E.; Leporati, F. Vision Transformer for Brain Tumor Detection Using Hyperspectral Images With Reduced Spectral Bands. IEEE Access 2025, 13, 121704–121719. [Google Scholar] [CrossRef]
Tran Ba, V.; Hübner, M.; Bin Qasim, A.; Rees, M.; Sellner, J.; Seidlitz, S.; Christodoulou, E.; Özdemir, B.; Studier-Fischer, A.; Nickel, F.; et al. Semantic Hyperspectral Image Synthesis for Cross-Modality Knowledge Transfer in Surgical Data Science. Int. J. CARS 2025, 20, 1205–1213. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Xu, H.; Tang, H.; Xin, L.; Huang, X.; Wang, N.; Zhao, X.; Wei, X.; Zhang, R. Fluorescence and Reflectance-Based Dual-Modal Hyperspectral Image Fusion for Caries Diagnosis. Measurement 2025, 246, 116701. [Google Scholar] [CrossRef]
Amoako, P.Y.O.; Cao, G.; Shi, B.; Yang, D.; Acka, B.B. Orthogonal Capsule Network with Meta-Reinforcement Learning for Small Sample Hyperspectral Image Classification. Remote Sens. 2025, 17, 215. [Google Scholar] [CrossRef]
Fu, C.; Zhou, T.; Guo, T.; Zhu, Q.; Luo, F.; Du, B. CNN-Transformer and Channel-Spatial Attention Based Network for Hyperspectral Image Classification with Few Samples. Neural Netw. 2025, 186, 107283. [Google Scholar] [CrossRef] [PubMed]
Wang, N.; Pan, X.; Luo, X.; Gao, X. Hyperspectral Image Classification Based on Attentional Residual Networks. IEEE Access 2025, 13, 10678–10688. [Google Scholar] [CrossRef]
Imani, M.; Cerra, D. Phase Space Deep Neural Network with Saliency-Based Attention for Hyperspectral Target Detection. Adv. Space Res. 2025, 75, 3565–3588. [Google Scholar] [CrossRef]
Li, T.; Cai, Y.; Zhang, Y.; Cai, Z.; Jiang, G.; Liu, X. Superpixel Prior Cluster-Level Contrastive Clustering Network for Large-Scale Urban Hyperspectral Images and Vehicle Detection. IEEE Trans. Veh. Technol. 2025, 74, 2019–2031. [Google Scholar] [CrossRef]
Li, T.; Jin, H.; Li, Z. Hyperspectral Target Detection Based on Graph Sampling and Aggregation Network. PLoS ONE 2025, 20, e0320043. [Google Scholar] [CrossRef]
Zhao, X.; Liu, K.; Wang, X.; Zhao, S.; Gao, K.; Lin, H.; Zong, Y.; Li, W. Tensor Adaptive Reconstruction Cascaded With Global and Local Feature Fusion for Hyperspectral Target Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 607–620. [Google Scholar] [CrossRef]
Alshahrani, A.A.; Bchir, O.; Ben Ismail, M.M. Autoencoder-Based Hyperspectral Unmixing with Simultaneous Number-of-Endmembers Estimation. Sensors 2025, 25, 2592. [Google Scholar] [CrossRef]
Yang, H.; Zhang, C. Dual Embedding Transformer Network for Hyperspectral Unmixing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 3514–3529. [Google Scholar] [CrossRef]
Zhang, S.; Zheng, J.; Lai, P.; Li, F.; Liang, L.; Plaza, A.; Deng, C.; Wang, S. Sparse Unmixing of Hyperspectral Images With Noise Reduction Using Spatial Filtering. IEEE Trans. Instrum. Meas. 2025, 74, 1–18. [Google Scholar] [CrossRef]
Dutta, A.; Lall, B.; Sharma, S. Potential of Satellite Hyperspectral Imaging Technology in Soil Health Analysis: A Step towards Environmental Sustainability. Environ. Monit. Assess. 2025, 197, 314. [Google Scholar] [CrossRef] [PubMed]
Faqeerzada, M.A.; Kim, H.; Kim, M.S.; Baek, I.; Chan, D.E.; Cho, B.-K. Hyperspectral Imaging VIS-NIR and SWIR Fusion for Improved Drought-Stress Identification of Strawberry Plants. Comput. Electron. Agric. 2025, 237, 110702. [Google Scholar] [CrossRef]
Liu, Y.; Feng, H.; Fan, Y.; Yue, J.; Yang, F.; Fan, J.; Ma, Y.; Chen, R.; Bian, M.; Yang, G. Utilizing UAV-Based Hyperspectral Remote Sensing Combined with Various Agronomic Traits to Monitor Potato Growth and Estimate Yield. Comput. Electron. Agric. 2025, 231, 109984. [Google Scholar] [CrossRef]
Zhang, X.; Peng, Z.; Wang, Y.; Ye, F.; Fu, T.; Zhang, H. A Robust Multispectral Reconstruction Network from RGB Images Trained by Diverse Satellite Data and Application in Classification and Detection Tasks. Remote Sens. 2025, 17, 1901. [Google Scholar] [CrossRef]
Pei, Z.; Wu, X.; Wu, X.; Xiao, Y.; Yu, P.; Gao, Z.; Wang, Q.; Guo, W. Segmenting Vegetation from UAV Images via Spectral Reconstruction in Complex Field Environments. Plant Phenomics 2025, 7, 100021. [Google Scholar] [CrossRef]
Robles-Kelly, A. Single Image Spectral Reconstruction for Multimedia Applications. In Proceedings of the 23rd ACM international conference on Multimedia, Brisbane, Australia, 26–30 October 2015; ACM: New York, NY, USA, 2015; pp. 251–260. [Google Scholar]
Arad, B.; Ben-Shahar, O. Sparse Recovery of Hyperspectral Signal from Natural RGB Images. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 19–34. [Google Scholar]
Wu, J.; Aeschbacher, J.; Timofte, R. Defense of Shallow Learned Spectral Reconstruction from RGB Images. In Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 471–479. [Google Scholar]
Galliani, S.; Lanaras, C.; Marmanis, D.; Baltsavias, E.; Schindler, K. Learned Spectral Super-Resolution. arXiv 2017. [Google Scholar] [CrossRef]
Xiong, Z.; Shi, Z.; Li, H.; Wang, L.; Liu, D.; Wu, F. HSCNN: CNN-Based Hyperspectral Image Recovery from Spectrally Undersampled Projections. In Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 518–525. [Google Scholar]
Can, Y.B.; Timofte, R. An Efficient CNN for Spectral Reconstruction from RGB Images. arXiv 2018. [Google Scholar] [CrossRef]
Li, Q.; Wang, Q.; Li, X. Mixed 2D/3D Convolutional Network for Hyperspectral Image Super-Resolution. Remote Sens. 2020, 12, 1660. [Google Scholar] [CrossRef]
Li, J.; Du, S.; Wu, C.; Leng, Y.; Song, R.; Li, Y. DRCR Net: Dense Residual Channel Re-Calibration Network with Non-Local Purification for Spectral Super Resolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1258–1267. [Google Scholar]
Li, J.; Du, S.; Song, R.; Wu, C.; Li, Y.; Du, Q. HASIC-Net: Hybrid Attentional Convolutional Neural Network With Structure Information Consistency for Spectral Super-Resolution of RGB Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
He, J.; Yuan, Q.; Li, J.; Xiao, Y.; Liu, X.; Zou, Y. DsTer: A Dense Spectral Transformer for Remote Sensing Spectral Super-Resolution. Int. J. Appl. Earth Obs. Geoinf. 2022, 109, 102773. [Google Scholar] [CrossRef]
Cai, Y.; Lin, J.; Lin, Z.; Wang, H.; Zhang, Y.; Pfister, H.; Timofte, R.; Gool, L.V. MST++: Multi-Stage Spectral-Wise Transformer for Efficient Spectral Reconstruction. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 744–754. [Google Scholar]
Wang, N.; Mei, S.; Zhang, Y.; Ma, M.; Zhang, X. Hyperspectral Image Reconstruction From RGB Input Through Highlighting Intrinsic Properties. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Wang, N.; Mei, S.; Wang, Y.; Zhang, Y.; Zhan, D. WHANet:Wavelet-Based Hybrid Asymmetric Network for Spectral Super-Resolution From RGB Inputs. IEEE Trans. Multimed. 2025, 27, 414–428. [Google Scholar] [CrossRef]

Figure 1. Flowchart of deep learning-based reconstruction algorithm.

Figure 2. Schematic of Spectral-wise Multi-Head Self-Attention (S-MSA).

Figure 3. Overall Architecture of the Proposed CMFREN.

Figure 4. Hierarchical Residual Attention Module.

Figure 5. Multi-Scale Attention Aggregation (MSAA) Module.

Figure 6. (a) Cascaded Multi-Attention module; (b) Multi-Stage Residual Feature Enhancement.

Figure 7. (a) Residual Dense Block; (b) Residual Block.

Figure 8. RMSE Heatmaps between Reconstructed Images (Bands 1, 11, 21, 31) and Ground Truth for ARAD_1K_0919: (a) Arad; (b) A+; (c) HSCNN; (d) HRNet; (e) MST++; (f) DRCR; (g) Ours.

Figure 9. Magnified Details of ARAD_1K_0924 Reconstructed by Different Methods (Bands 1, 11, 21, 31): (a) Arad; (b) A+; (c) HSCNN; (d) HRNet; (e) MST++; (f) DRCR; (g) Ours.

Figure 10. (a) Schematic Diagram of Illumination Region Selection for ARAD_1K_0905; (b) Normalized Spectral Density Curve.

Figure 11. (a) Schematic Diagram of Shadow Region Selection for ARAD_1K_0905; (b) Normalized Spectral Density Curve.

Figure 12. (a) Schematic Diagram of Illumination Region Selection for ARAD_1K_0947; (b) Normalized Spectral Density Curve.

Figure 13. (a) Schematic Diagram of Shadow Region Selection for ARAD_1K_0947; (b) Normalized Spectral Density Curve.

Figure 14. RMSE Heatmaps between Reconstructed Images (Bands 1, 11, 21, 31) and Ground Truth for real_and_fake_peppers: (a) Arad; (b) A+; (c) HSCNN; (d) HRNet; (e) MST++; (f) DRCR; (g) Ours.

Figure 15. Magnified Details of the 16^th Band for Images (real_and_fake_peppers, stuffed_toys, superballs, watercolors) Reconstructed by Different Methods: (a) Arad; (b) A+; (c) HSCNN; (d) HRNet; (e) MST++; (f) DRCR; (g) Ours.

Figure 16. RMSE Heatmaps between Reconstructed Images (Bands 1, 12, 24, 36, 48) and Ground Truth for Typical Images in grss_dfc_2018: (a) Arad; (b) A+; (c) HSCNN; (d) HRNet; (e) MST++; (f) DRCR; (g) Ours.

Figure 17. (a) Selected Location of the Parking Lot; (b) Spectral Reflectance of Different Reconstruction Methods.

Figure 18. (a) Selected Location of the Grass Plot; (b) Spectral Reflectance of Different Reconstruction Methods.

Figure 19. (a) Selected Locations of the Parking Lot and Grassland; (b) Comparison of Reflectance between the Proposed Method and Ground Truth for the Parking Lot and Grassland.

Table 1. Details of the CMFREN.

Module Name	Core Layer Type	Kernel/Pad/Stride	Input	Output	Kernel Count
Head	Conv2d	3×3/1/1	(3,128,128)	(64,128,128)	64
HRA	AvgPool2d Conv2d×4	Pooling: 2×2/0/2 Conv: 3×3/1/1 1×1/0/1	(64,128,128)	(64,128,128)	64
	Interpolate Concatenate	Upsampling ×2 Channel concatenation	(64,64,64) (64,128,128)	(128,128,128)	-
	MSAA×2 Conv2d(fuse)	3×3/1/1	(128,128,128)	(64,128,128)	64
MSAA	Conv2d×7	1×1/0/1 3×3/1/1 5×5/2/1 7×7/3/1	(64,128,128)	(64,128,128)	32/16/64
MSRFE	PixelUnshuffle Conv2d Residual Block	2×2/4×4 shuffling 3×3/1/1 1×1/0/1	(64,128,128)	(64,128,128)	64
MSRFE	Interpolate Concatenate	Upsampling ×2 3-path concatenation	(64,64/32,64/32)	(64,128,128)	-
ResidualDense Block	Conv2d×5	3×3/1/1×4 1×1/0/1×1	(64,128,128)	(64,128,128)	32×4 64
Residual Block	Conv2d×2	3×3/1/1	(64,128,128)	(64,128,128)	64
S-MSA	Linear×4 Conv2d×2	Linear output: 64 Conv: 3×3/1/1	(64,128,128)	(64,128,128)	64
SSBFE	MSRFE S-MSA Conv2d	3×3/1/1 1×1/0/1	(64,128,128)	(64,128,128)	64
SEM	Conv2d×2 S-MSA	3×3/1/1	(64,128,128)	(64,128,128)	64
CMA(×10)	SSBFE×2 SEM	-	(64,128,128)	(64,128,128)	64
Tail	Conv2d	3×3/1/1	(64,128,128)	(31/48,128,128)	31/48

Table 2. Quantitative Results on the NTIRE 2022 Dataset. Bold font indicates the best performance; underlined font indicates the second-best performance.

Method	Params	FLOPS	Time	RMSE	PSNR	SAM	MRAE
Arad	-	-	-	0.0869	24.3736	11.3816	0.5476
A+	-	-	-	0.0649	26.8940	9.3801	0.5476
HSCNN	146.68 K	9.59 G	0.3478 s	0.0548	27.3359	6.7007	0.3542
HRNet	31.70 M	163.80 G	0.6815 s	0.0331	31.7148	6.1103	0.2208
MST++	1.62 M	23.31 G	1.3139 s	0.0277	33.0712	5.1077	0.1964
DRCR	9.38 M	586.57 G	0.4227 s	0.0181	34.0328	4.9295	0.1925
Ours	6.78 M	111.49 G	1.2572 s	0.0178	35.5105	4.9925	0.1863

Table 3. Quantitative Results on the CAVE Dataset. Bold font indicates the best performance; underlined font indicates the second-best performance.

Method	Params	FLOPS	Time	RMSE	PSNR	SAM	MRAE
Arad	-	-	-	0.0615	30.4258	17.7565	0.4912
A+	-	-	-	0.0357	32.8174	13.2659	0.3429
HSCNN	146.68 K	9.59 G	0.4120 s	0.0218	34.7632	9.2175	0.2568
HRNet	31.70 M	163.80 G	0.8437 s	0.0143	35.4921	7.1635	0.1973
MST++	1.62 M	23.31 G	1.5949 s	0.0097	36.0853	6.2841	0.1786
DRCR	9.38 M	586.57 G	0.5213 s	0.0108	35.8726	6.5319	0.1845
Ours	6.78 M	111.49 G	1.3177 s	0.0078	36.3796	5.6667	0.1723

Table 4. Quantitative Results on the grss_dfc_2018 Dataset. Bold font indicates the best performance; underlined font indicates the second-best performance.

Method	Params	FLOPS	Time	RMSE	PSNR	SAM	MRAE
Arad	-	-	-	0.3416	25.7811	13.5369	0.2151
A+	-	-	-	0.3073	27.7014	9.8617	0.1973
HSCNN	166.34 K	10.88 G	0.1367 s	0.2288	31.2174	7.9316	0.1326
HRNet	31.71 M	164.45 G	0.2695 s	0.1971	33.1272	7.0186	0.1165
MST++	3.84 M	54.86 G	0.4128 s	0.1804	35.1484	6.2358	0.1017
DRCR	9.39 M	587.57 G	0.1713 s	0.1853	35.4363	5.9755	0.0984
Ours	6.79 M	111.81 G	0.4093 s	0.1706	36.9293	5.7446	0.0889

Table 5. Quantitative Results of Ablation Experiments. Bold font indicates the best performance; underlined font indicates the second-best performance.

Num_CMA	RMSE	PSNR	SAM	MRAE
6	0.0195	34.5628	5.5137	0.1986
7	0.0189	34.8973	5.3241	0.1932
8	0.0183	35.1459	5.1768	0.1895
9	0.0180	35.3724	5.0512	0.1871
10	0.0178	35.5105	4.9925	0.1863
11	0.0179	35.4816	4.9734	0.1870
12	0.0181	35.3987	5.0869	0.1889

Table 6. Quantitative Results of Ablation Experiments. Bold font indicates the best performance; underlined font indicates the second-best performance.

HRA	SSBFE	SEM	RMSE	PSNR	SAM	MRAE
√		√	0.0206	34.7831	5.9625	0.2309
√	√		0.0189	35.3517	5.7816	0.1912
	√	√	0.0217	35.1715	5.3241	0.1987
√	√	√	0.0178	35.5105	4.9925	0.1863

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jin, H.; Lan, J.; Zhuang, Z.; Zeng, Y. Cascaded Multi-Attention Feature Recurrent Enhancement Network for Spectral Super-Resolution Reconstruction. Remote Sens. 2026, 18, 202. https://doi.org/10.3390/rs18020202

AMA Style

Jin H, Lan J, Zhuang Z, Zeng Y. Cascaded Multi-Attention Feature Recurrent Enhancement Network for Spectral Super-Resolution Reconstruction. Remote Sensing. 2026; 18(2):202. https://doi.org/10.3390/rs18020202

Chicago/Turabian Style

Jin, He, Jinhui Lan, Zhixuan Zhuang, and Yiliang Zeng. 2026. "Cascaded Multi-Attention Feature Recurrent Enhancement Network for Spectral Super-Resolution Reconstruction" Remote Sensing 18, no. 2: 202. https://doi.org/10.3390/rs18020202

APA Style

Jin, H., Lan, J., Zhuang, Z., & Zeng, Y. (2026). Cascaded Multi-Attention Feature Recurrent Enhancement Network for Spectral Super-Resolution Reconstruction. Remote Sensing, 18(2), 202. https://doi.org/10.3390/rs18020202

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cascaded Multi-Attention Feature Recurrent Enhancement Network for Spectral Super-Resolution Reconstruction

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Traditional Spectral Super-Resolution Reconstruction

2.2. Spectral Super-Resolution Reconstruction via Deep Learning

3. Proposed Methods

3.1. Hierarchical Residual Attention (HRA) Module

3.2. Cascaded Multi-Attention Feature Recurrent Enhancement Module

3.2.1. Spatial–Spectral Balanced Feature Extraction Module

3.2.2. Spectral Enhancement Module

4. Results

4.1. Experimental Results

4.1.1. Datasets

4.1.2. Evaluation Indicators

4.1.3. Parameter Setting

4.2. Comparison of the Experimental Results

4.2.1. Results on the NTIRE 2022 Dataset

4.2.2. Results on the CAVE Dataset

4.2.3. Results on the grss_dfc_2018 Dataset

4.3. Ablation Experiment

4.3.1. Ablation Study on Cascade Times of CMA Module

4.3.2. Ablation Experiment Investigating the Effects Among Different Modules

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI