Hyperspectral Image Reconstruction Based on State Space Models

Wang, Xuguang; Zhou, Haozhe; Wei, Tongxin; Zhang, Yanchao

doi:10.3390/rs18070990

Open AccessArticle

Hyperspectral Image Reconstruction Based on State Space Models

School of Information Science and Engineering, Zhejiang Sci-Tech University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(7), 990; https://doi.org/10.3390/rs18070990

Submission received: 13 February 2026 / Revised: 21 March 2026 / Accepted: 23 March 2026 / Published: 25 March 2026

(This article belongs to the Special Issue AI-Driven Remote Sensing Image Restoration and Generation)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A novel FGA-Mamba network is proposed, which leverages the superior global modeling capabilities of State Space Models to significantly enhance the accuracy of spectral reconstruction.
The method achieves state-of-the-art (SOTA) reconstruction accuracy on both selfcollected rice and public datasets while maintaining linear computational complexity.

What are the implications of the main findings?

Cost-Effective Data Acquisition: Our method demonstrates that expensive professional hyperspectral hardware is no longer a prerequisite for acquiring high-fidelity spectral data. By bridging the gap between low-dimensional images (RGB, MSI) and hyperspectral imaging, we provide a low-cost alternative that significantly lowers the barrier to entry for the remote sensing field.
Precision Agriculture Applications: The reconstructed hyperspectral images exhibit high spectral fidelity, enabling more precise calculation of vegetation indices (FRI1). This validates the method’s potential for large-scale, site-specific crop monitoring, such as early stress detection and precision weed management.

Abstract

To address the high hardware costs associated with hyperspectral imaging in precision agriculture, spectral reconstruction (SR) is emerging as a feasible solution for obtaining hyperspectral images. However, existing methods, mainly including CNN and Transformer, face a notable dilemma: convolutional neural networks (CNNs) are limited by their local receptive fields, while Transformers encounter the problem of quadratic computational complexity. Effectively balancing computational efficiency with the capture of long-range spatial dependencies remains a significant challenge. To this end, this study proposes FGA-Mamba (Frequency-Gradient Attention Mamba), a novel reconstruction network based on the Mamba architecture. This network introduces a Frequency-Visual State Space (F-VSS) module, which combines the linear long-range modeling capability of state space models (SSMs) with a frequency-domain self-calibration mechanism to enhance global structural consistency by explicitly modulating frequency features. In addition, we designed an Enhanced Gradient Attention Module (EGAM). This module optimizes local feature representation through a gradient-aware mechanism, effectively compensating for the loss of spatial details. Experimental results on 3 datasets shows that FGA-Mamba have significant improvement in both quantitative and qualitative metrics. Moreover, the high consistency observed in vegetation index (VI) calculations confirms its potential for practical agricultural application.

Keywords:

Mamba; hyperspectral image (HSI); geographic alignment; spectral reconstruction (SR)

1. Introduction

Hyperspectral imaging is an imaging technology that captures images in the spatial dimension while continuously sampling in the spectral dimension, allowing rich and continuous spectral information to be obtained for each pixel, thereby finely characterizing the composition and properties of materials. This unique imaging mechanism offers high spectral resolution and precise material identification capabilities, leading to its widespread application in fields such as remote sensing [1], environmental monitoring [2], and target recognition [3,4]. Currently, the acquisition of hyperspectral images primarily relies on specialized imaging equipment capable of recording continuous spectral features with high precision. However, these devices are generally prohibitively expensive and are constrained by complex optical structures, slow scanning speeds, and stringent mechanical stability requirements. Consequently, capturing dynamic scene information in real-time remains difficult, keeping both the acquisition costs and the barriers to widespread adoption high.

To overcome these obstacles, spectral reconstruction (SR) has received widespread attention, as it can reconstruct hyperspectral images from RGB or multispectral data. As Bian [5] demonstrated in the journal Nature, the field of spectral reconstruction is transitioning from expensive optics to computational reconstruction. High-quality SR algorithms are crucial for achieving real-time, high-resolution spectral imaging. In the field of remote sensing image enhancement, pan-sharpening, which fuses low-resolution multispectral data with high-resolution panchromatic images, is a commonly used technique. However, its fundamental objective differs significantly from that of spectral reconstruction. Although pan-sharpening methods based on component substitution or multiresolution analysis can effectively enhance spatial details by leveraging structural priors, they often introduce spectral distortions. The primary focus of these approaches is spatial enhancement rather than the expansion of spectral dimensions, making them incapable of reliably recovering continuous spectral information from limited band inputs. Therefore, this study focuses on spectral reconstruction methods to ensure a high level of spectral fidelity. Early SR methods primarily relied on sparse coding and dictionary learning. However, these approaches struggle to capture the nonlinear spectral distortions that occur under complex environmental conditions, which limits their reconstruction accuracy and generalization capability.

Driven by the rapid advancement of deep learning, Convolutional Neural Networks (CNNs) have emerged as a prominent method in the spectral reconstruction domain.This method [6,7] extracts local features using convolutional kernels and uses residual connections to facilitate network depth, enabling effective learning of hierarchical representations. However, the performance of CNNs is intrinsically constrained by their inductive bias, specifically the restricted local receptive field. In spectral reconstruction tasks, the spectral signature of a single pixel is frequently modulated by long-range spatial contexts, such as the distribution of neighboring crop canopies, as well as complex cross-band correlations. CNNs often struggle to capture these global dependencies efficiently. Furthermore, their static parameterization lacks the flexibility to adapt dynamically to input content. This limitation frequently results in spectral distortion or the over-smoothing of details in regions exhibiting intricate textures or abrupt spectral transitions.

To mitigate the locality constraints of CNNs, Transformer-based and hybrid architectures [8] have been increasingly adopted. By utilizing self-attention mechanisms, these methods establish global receptive fields and dynamic modeling capabilities, enabling the capture of long-range spatial-spectral dependencies and achieving State-of-the-Art (SOTA) precision. Nevertheless, despite their performance superiority, the practical deployment of Transformers faces critical impediments due to prohibitive computational costs. The self-attention mechanism incurs a quadratic computational complexity (O(

N^{2}

)) relative to image size, resulting in excessive memory footprint and inference latency. This computational burden creates a significant gap between theoretical performance and practical application on resource-constrained edge devices.

To address this accuracy-efficiency dilemma and the specific requirements of spectral sequence modeling, the Mamba architecture [9] (based on Selective State Space Models, SSMs) has recently emerged as a compelling solution. Although originally designed for long-sequence modeling in natural language processing, Mamba distinguishes itself from Transformers by discretizing continuous state-space equations. It utilizes parallel scan algorithms to facilitate efficient training while maintaining linear inference complexity (O(N)). This unique architecture endows Mamba with the dual advantage of capturing global long-range dependencies and achieving high inference efficiency. Therefore, we propose FGA-Mamba, an efficient spectral reconstruction model featuring a global receptive field and linear computational complexity. We introduce the Frequency-Domain Visual State Space (F-VSS) module, which explicitly enhances global structural coherence and effectively suppresses artifacts by combining frequency-domain priors with Mamba’s long-range modeling capabilities. Additionally, we propose the Enhanced Gradient Attention Module (EGAM). This module leverages a gradient-aware mechanism to enhance high-frequency spatial information and edge textures, thereby effectively mitigating oversmoothing in spectral reconstruction. FGA-Mamba achieves significant improvements in reconstruction accuracy while maintaining low computational cost.

The main contributions of this study are summarized as follows:

(1) We propose a novel network for reconstructing HSI from Multispectral Images (MSI). By incorporating the Mamba architecture, the proposed method achieves an optimal balance between reconstruction fidelity and computational efficiency, and is applied to the field of agricultural remote sensing spectral reconstruction.

(2) We introduce a mechanism that synergizes frequency-domain self-calibrated attention with state space modeling. This design is tailored to mitigate the limitations of state modeling in capturing high-frequency information, thereby significantly enhancing the model’s ability to preserve intricate spatial details and ensuring global frequency consistency.

(3) We design an efficient Gradient-Enhanced Attention Module (EGAM). By introducing a gradient-aware mechanism based on central difference convolution, this module effectively captures local high-frequency features such as edges and textures. This design serves to mitigate the over-smoothing phenomenon often associated with pure state-based models, thereby refining local feature representation.

(4) We employ Vegetation Indices (VIs) to rigorously validate the reconstruction efficacy. The results demonstrate the practical utility and reliability of the proposed model for downstream agricultural applications.

2. Related Work

2.1. Hyperspectral Image Reconstruction

Early research mainly focused on model-based methods, which rely on manual design and mathematical optimization. For instance, Arad et al. [10] used sparse coding to build spectral dictionaries, setting a foundation for sparse recovery. Subsequently, Aeschbacher et al. [11] introduced shallow learning to map RGB values to spectral signatures. However, these methods depend heavily on specific assumptions and often require slow optimization steps during use, leading to lower reconstruction quality and difficulty in real-time application.

With the rise of deep learning, data-driven methods have become the mainstream solution due to their strong ability to learn features. Convolutional Neural Networks (CNNs) are widely adopted in this field. Shi et al. [12] proposed HSCNN+, which uses residual learning to improve the recovery of details. To fix the loss of spectral information, Zhao et al. [13] developed a hierarchical network (HRNet) that refines features step-by-step. Li et al. [14] introduced the Adaptive Weighted Attention Network (AWAN), which uses attention mechanisms to better adjust feature responses. Additionally, general image restoration models like HINet [15] and MPRNet [16] have also been adapted for spectral tasks, effectively improving image quality.

Despite their success, CNN-based methods face a critical limitation: they primarily focus on local features due to fixed convolution kernel sizes. This structural constraint makes it difficult for them to capture long-range dependencies, which are vital for reconstructing complex remote sensing scenes with repetitive patterns. To solve this, Transformer-based architectures were introduced to see the whole picture (global interactions). For example, Cai et al. [17] proposed MST++, a Transformer designed for spectral tasks. While Transformers perform well, their high computational cost (growing quadratically with image size) consumes too much memory, making them hard to deploy on drones. Therefore, there is an urgent need for an architecture that offers both global modeling capabilities and linear computational efficiency.

2.2. State Space Models

State Space Models (SSMs) have recently emerged as a promising alternative for handling sequential data, originating from classical control theory. By leveraging state space representations, the Structured State Space Sequence (S4) model achieves linear computational complexity, allowing long sequences to be processed efficiently without compromising accuracy [18]. Building on these foundational principles, Mamba represents a significant advancement in sequence modeling. Mamba combines state space theory with advanced deep learning techniques, using a selective state mechanism to dynamically adjust parameters based on input data. This mechanism allows the model to selectively compress context and focus on the most relevant information, significantly enhancing modeling capability while maintaining linearly scalable computational efficiency [19].

Although Mamba was initially designed for one-dimensional sequential data, its potential has been rapidly extended to computer vision. To address the challenge of processing two-dimensional images, pioneering frameworks such as VMamba and Vision Mamba (Vim) were introduced. These models convert images into 2D patches and employ diverse scanning techniques, such as cross-scanning and bi-directional scanning, to model spatial relationships effectively. This adaptation enables Mamba to capture global context with linear scalability, offering a robust alternative to Convolutional Neural Networks (CNNs) and Transformers. Furthermore, Mamba has been successfully applied to various vision tasks, including video processing [20], medical image analysis [21], and remote sensing [22]. However, the application of Mamba in spectral reconstruction is still relatively underexplored.

3. Materials and Methods

3.1. Study Area and Experimental Design

To ensure data diversity, field experiments were conducted at two sites exhibiting agricultural characteristics. Area 1, located at the China Rice Research Institute in Fuyang District, Hangzhou city (30°04′4″ N, 119°56′05″ E). Area 2, located at the outskirts of Jinhua City (30°02′86″ N, 119°93′25″ E), covering an area of 3200 m². At the time of data acquisition, the rice paddies in this region were in the heading stage.The research area is shown in Figure 1.

3.2. Aerial Image Acquisition and Data Preprocessing

Data acquisition campaigns were conducted in June 2023 (12:00 p.m.) employing two distinct UAV systems for low-altitude sensing. For the hyperspectral component, we deployed a DJI Matrice 600 Pro (M600 Pro, DJI Technology Co., Ltd., Shenzhen, China) carrying a Specim FX10 push-broom scanner (Specim, Spectral Imaging Ltd., Oulu, Finland). Optical specifications of this sensor include a 32 mm lens and a 50 Hz frame rate, enabling the capture of 300 spectral bands (385–1020 nm) across 480 spatial pixels. Simultaneously, multispectral observations were gathered via a DJI Matrice 300 RTK drone integrated with an MS600 V2 camera (Yusense Information Technology and Equipment (Qingdao) Co., Ltd., Qingdao, China). This payload, operating at a 1 Hz trigger frequency with a 48.8° FOV, recorded data in six discrete channels: Blue (450 nm), Green (555 nm), Red (660 nm), Red-edge (720/750 nm), and NIR (840 nm). Mission planning was executed through DJI GS Pro software (Version 2.0.17, DJI Technology Co., Ltd., Shenzhen, China) to guarantee a minimum side overlap of 33%, with comprehensive flight settings provided in Table 1 and the hardware system configuration illustrated in Figure 2.

3.3. Remote Sensing Image Preprocessing

Distinct preprocessing pipelines were designed to accommodate the specific characteristics of each sensor modality, thereby ensuring same geolocation and spatial resolution. Prior to spatial registration, radiometric calibration was conducted to convert raw digital numbers (DN) into surface reflectance. During the flight preparation stage, six diffuse reflectance reference panels with laboratory-calibrated values were placed within the imaging area. Among these, the 10%, 30%, and 50% reflectance panels were selected as calibration references. Based on the manufacturer-provided reflectance spectra, spectral resampling was performed using polynomial fitting to align with the sensor’s spectral response. Subsequently, the sensor’s gain and offset parameters were estimated via the least squares method to establish a calibration model, which was then applied to generate full-band reflectance images.

For RGB and MSI datasets, we utilized Pix4Dmapper to generate orthomosaics, employing point cloud reconstruction to rectify terrain distortions. During the stitching process, 455 RGB images and 2172 single-band multispectral images were utilized, with the final six-band MSI synthesized in ENVI 5.6 software. Given that the push-broom hyperspectral data lacked direct georeferencing, geometric correction was performed in ArcGIS Pro (Version 3.0.0, Esri, Redlands, CA, USA). As illustrated in Figure 3, flight strips were mosaicked into a single orthophoto using the RGB orthophoto as a spatial reference.

To ensure spatial consistency, both drones operate at a constant flight altitude of 50 m. Although the flight altitude is the same, there are slight differences in the original ground sampling distance (GSD) due to differences in the optical structures of the two sensors. Specifically, the MS600 V2 uses a frame-based imaging method, while the Specim FX10 uses a pushbroom imaging method. This spatial discrepancy is resolved during the registration stage in ENVI 5.6 by resampling the HSI to match the spatial grid of the multispectral images (MSI), achieving spatial alignment. The MSI mosaic served as the reference image, and the HSI mosaic was aligned using the cross-correlation algorithm [23]. We optimized feature matching using the Forstner operator [24] with a threshold of 0.8. For Ground Control Point (GCP) refinement, we adopted a hybrid approach combining automated matching with manually identified static landmarks. Following registration, the intersecting regions were cropped to remove edge artifacts. The maps were subsequently partitioned into 256 × 256 patches with a 10% overlap. To improve the model’s generalization ability and reduce overfitting, data augmentation was applied to the training samples, including random horizontal and vertical flips, as well as rotations of

90^{\circ}

,

180^{\circ}

, and

270^{\circ}

. Finally, the dataset consists of 240 training pairs, 20 validation pairs, and 36 testing pairs.

3.4. Model Construction

3.4.1. Network Architecture

Let the HSI be denoted as a three-dimensional tensor

H \in R^{C \times H \times W}

where C represents the number of spectral bands, and H and W denote the spatial height and width, respectively. The corresponding MSI, denoted as

M \in R^{c \times H \times W}

, c < C. This relationship is modeled as

M = R H + E

(1)

where R denotes the spectral response function (SRF) matrix of the multispectral sensor, acting as a mapping matrix that integrates spectral information. The term E represents the additive noise inherent in the imaging process. As illustrated in Figure 4, the proposed FGA-Mamba utilizes a multi-stage cascading framework designed to progressively reconstruct high-fidelity H from the observed M.

The architecture is designed to reconstruct hyperspectral data from six-channel MSI inputs. The network consists of multiple spectral reconstruction stages, each integrating spatial, spectral, and frequency-domain components alongside Gradient-Enhanced Attention Modules, abbreviated as EGAM, and visual modeling blocks. This hierarchical design enables the progressive recovery of fine-grained spectral details. Regarding the data flow, the input MSI first undergoes channel expansion via a convolutional layer before being processed by the FG block. This block comprises four parallel branches: the Frequency-Visual State Space (F-VSS) module, the Spatial Gradient Attention module, the Spectral Gradient Attention module, and the EGAM. Subsequently, the feature maps generated by these branches are concatenated to produce an aggregated representation. Finally, a channel reduction operation is applied, and the input is fused with the output via a residual connection. This residual strategy effectively preserves original structural information and alleviates the vanishing gradient problem, thereby enhancing the spatial-spectral modeling capability of the model. Through this progressive process, FGA-Mamba restores spectral information stage by stage, ultimately generating high-quality hyperspectral imagery.

3.4.2. Frequency-Visual State Space Block

As illustrated in the Figure 4a, the F-VSS block employs a three-branch parallel architecture to synergistically model local details, long-range dependencies, and frequency-domain consistency. Upon receiving the input feature, the module first utilizes a 1 × 1 linear projection to expand the channel dimension (where represents the expansion factor), producing an intermediate feature map that is subsequently distributed across the three parallel pathways.The first pathway, designated as the SS2D branch, is tasked with capturing long-range dependencies. In this branch, local context is initially encoded via a convolution layer followed by a Sigmoid-Linear Unit (SiLU) activation function, before the features are fed into the SS2D unit for two-dimensional state space modeling. Drawing on the principles of the Selective Structured State Space (S6) mechanism, the SS2D unit flattens the 2D spatial features into sequences. It then executes selective scanning along four cardinal directions—horizontally (left and right) and vertically (up and down)—to effectively grasp global dependencies within the non-causal 2D data structure. The sequence representations derived from these four-way scans are then aggregated, reshaped back into their original 2D spatial dimensions, and refined through Layer Normalization to yield the output z1:

X^{'} = S i L U (D W C o n v (L i n (X)))

(2)

z_{1} = L N (R e s h a p e (\sum_{d \in (\leftarrow, \to, ↑, ↓)} S 6 ({F l a t t e n}_{d} (X^{'}))))

(3)

where

L i n

denotes the linear layer,

D W C o n v

signifies depthwise separable convolution.

F l a t t e n_{d}

refers to the sequence construction obtained by unrolling features along direction d, followed by

S 6 ()

, which corresponds to the selective scan update mechanism. The notation represents the aggregation

\sum ()

of outputs from all scanned directions, and

R e s h a p e

transforms the processed data back into its original two-dimensional spatial structure.

The second branch is designated as the Frequency-Domain Self-Calibration Attention Branch, which maps features into the frequency domain to enhance the perception of structural high-frequency components. Specifically, a two-dimensional Fast Fourier Transform (2D FFT) is performed on the input X to decompose it into amplitude and phase components

(A, φ)

. A lightweight attention mechanism is then introduced to adaptively weight these frequency components. This enables the model to automatically highlight structurally critical frequency bands while suppressing noise and pseudo-high-frequency artifacts based on the input features, thereby achieving spectral self-calibration. Subsequently, the features are transformed back to the spatial domain via the Inverse Fast Fourier Transform (IFFT) and may be further refined by channel recalibration to yield

z_{2}

:

(A, φ) = F F T (\hat{X}), (\hat{A}, \hat{φ}) = α (A, φ) ⊙ (A, φ)

(4)

z_{2} = C h a n n e l G a t e (I F F T (\hat{A}, \hat{φ}))

(5)

where

F F T

denotes the two-dimensional Fast Fourier Transform,

α ()

represents the frequency-domain attention mapping, and ⊙ indicates element-wise weighting in the frequency domain.

I F F T

signifies the Inverse Fast Fourier Transform, and

C h a n n e l G a t e

corresponds to channel recalibration.

To address the limitations of SSMs in feature modeling, we introduce a parallel frequency-domain branch. The SS2D unit captures global, long-range dependencies efficiently, but its sequential scanning mechanism can suppress high-frequency details. As a result, SSMs often over-smooth local information that is important for preserving sharp spatial edges and subtle spectral variations in hyperspectral images. The frequency-domain branch extracts and enhances these high-frequency components in the Fourier domain, complementing the structural information that the SS2D unit may overlook. During feature aggregation, the enhanced output

z_{2}

is fused with the SSM output

z_{1}

to recover details lost in the smoothing process. This approach allows the model to retain global context while better preserving local structures.

The third branch comprises channel expansion, a linear layer, and a

S i L U

activation function, utilized to perform channel mixing on the input features X to generate

z_{3}

. Finally, the output features from the three branches are aggregated, and a linear projection is applied to restore the channel dimension to C, yielding the final module output

x_{1}

.

3.4.3. Spatial Gradient Attention and Spectral Gradient Attention

As illustrated in the Figure 4b,c High spectral correlation is a fundamental characteristic of hyperspectral images. To capture abrupt variations in spectral curves, such as absorption peaks and reflection edges, we introduce the Spectral Gradient Attention mechanism. This module operates by explicitly modeling the first-order differential relationships between adjacent bands. Specifically, we compute the spectral gradient, denoted as

G_s p e c

, by calculating the difference between neighboring spectral channels of the input feature X. This operation effectively highlights local transitions across the spectral dimension. To ensure numerical stability and consistent feature scaling, we apply Min-Max normalization to the computed gradients.

G_{s p e c}^{(c)} = X^{(c + 1)} - X^{(c)}, c \in {0, 1, \dots, C - 2}

(6)

{\hat{G}}_{s p e c} = \frac{G_{s p e c} - min (G_{s p e c})}{max (G_{s p e c}) - min (G_{s p e c}) + ϵ}

(7)

where

ϵ

is a small constant to prevent division by zero. After normalizing, we restore the channel dimensions via padding. To generate the channel-wise attention weights, we aggregate global spectral information using both Global Average Pooling and Global Max Pooling. These pooled descriptors are then processed by a shared Multi-Layer Perceptron (MLP) and fused via element-wise addition. Finally, a Sigmoid function

σ

produces the attention map

W_s p e c

, which is used to recalibrate the original input features:

W_{s p e c} = σ (M L P (A v g P o o l ({\hat{G}}_{s p e c})) + M L P (M a x P o o l ({\hat{G}}_{s p e c})))

(8)

Y_{s p e c} = X ⊙ W_{s p e c}

(9)

While spectral information is critical, preserving geometric structure is equally important for reconstruction tasks. The Spatial Gradient Attention module is designed to enhance the model’s perception of high-frequency spatial details, such as object boundaries and surface textures.

We begin by computing the first-order spatial differences in both horizontal and vertical directions. These gradients are concatenated and normalized using the same Min-Max strategy to highlight structural edges regardless of their intensity. To aggregate this structural information into a unified spatial map, we compute the mean and maximum values along the channel dimension:

G_{x} = X (c, h, w + 1) - X (c, h, w)

(10)

G_{y} = X (c, h + 1, w) - X (c, h, w)

(11)

G_{s p a} = N o r m (C o n c a t (G_{x} {, G}_{y}))

(12)

F_{a v g} = A v g P o o l (G_{s p a}), F_{m a x} = M a x P o o l (G_{s p a})

(13)

The aggregated features are processed through a convolutional layer to capture broad contextual information. This generates a spatial attention map

M_s p a

that emphasizes salient regions while suppressing background noise. Finally, the output is obtained by modulating the input features with this spatial mask.

M_{s p a} = σ (C o n v ((F_{a v g}, F_{m a x})))

(14)

Y_{s p a} = X ⊙ M_{s p a}

(15)

Explicit modeling of spectral and spatial gradients addresses the smoothing effect of SSMs. While continuous sequence modeling captures the overall continuity of hyperspectral signatures and global spatial context, it can weaken localized high-frequency details, such as narrow absorption features or sharp object boundaries. To better preserve these details, we introduce spectral and spatial gradient attention based on first-order differential operations. These operations highlight important transitions in both spectral and spatial domains. The generated attention masks guide the network to retain local discontinuities that might otherwise be smoothed out by the SSM units.

3.4.4. Enhanced Gradient Attention

Building upon the aforementioned spatial and spectral attention mechanisms, we further introduce the Enhanced Gradient Attention Module (EGAM), as show in the Figure 4d. This module is designed to comprehensively leverage first-order gradient features within the spatial dimension, thereby bolstering the model’s capability to capture high-frequency structures such as edges and textures.

Given the input feature, we first compute the first-order differences in both horizontal and vertical directions using a central finite difference operator, yielding

G_{x}

and

G_{y}

, respectively:

G_{x} (c, h, w) = X (c, h, w + 1) - X (c, h, w - 1)

(16)

G_{y} (c, h, w) = X (c, h + 1, w) - X (c, h - 1, w)

(17)

where

c, h, w

denote the channel, column, and row indices, respectively.Subsequently, these two components are concatenated along the channel dimension and normalized to produce the enhanced spatial gradient feature:

G_{s p a} = N o r m (C o n c a t (G_{x} {, G}_{y}))

(18)

where

C o n c a t

denotes concatenation along the channel dimension, and

N o r m

represents normalization.We then apply standard convolution to the input feature and central convolution to the gradient feature, fusing the two to yield the output

G_{s p a}

. The process then advances to the channel attention phase. Specifically, Global Average Pooling (GAP) and a two-layer Multi-Layer Perceptron (MLP) are utilized to learn importance weights for each channel, thereby performing channel-wise adaptive recalibration on the fused features. Furthermore, a spatial attention mechanism is employed. Max pooling and average pooling are applied to the feature maps refined by channel attention. The resulting maps are concatenated along the channel axis and processed through a convolution layer followed by a Sigmoid activation function to generate the spatial weight map M:

F_{1} = C o n c a t (C o n v (X), C e n t e r C o n v (G_{s p a}))

(19)

F_{2} = F_{1} ⊙ σ (M L P (A v g P o o l (F_{1})))

(20)

M = (C o n v (C o n c a t (M a x P o o l (F_{2}), A v g P o o l (F_{2}))))

(21)

x_{4} = F_{2} ⊙ M + X

(22)

where

C o n v

denotes convolution,

C e n t e r C o n v

represents central convolution, and

C o n c a t

signifies concatenation along the channel dimension.

M L P

refers to a two-layer Multi-Layer Perceptron.

M a x P o o l

and

A v g P o o l

correspond to max pooling and average pooling, Furthermore,

σ

denotes the activation function, and ⊙ indicates element-wise multiplication.

The Enhanced Gradient Attention Module (EGAM) is designed to improve the interaction with SSMs in a more stable and structurally consistent manner. While basic gradient attention provides initial boundary awareness, it typically relies on forward difference operations, which may introduce slight spatial shifts in hyperspectral reconstruction. To mitigate this issue, EGAM adopts a central finite difference operator, whose symmetric property enables more accurate edge localization without phase offset. In addition, instead of using simple multiplicative masking, EGAM incorporates gradient features by concatenating them with the original feature maps, allowing high-frequency information to be preserved more explicitly. With the help of channel–spatial attention and residual connections, EGAM gradually integrates these local details into the network representation. This design helps alleviate the over-smoothing effect of SSMs and improves the preservation of sharp boundaries and textures while maintaining the fidelity of the original features.

3.5. Evaluation Metrics

Quantitative Evaluation Metrics: To conduct a comprehensive and rigorous assessment of the reconstruction model, we employed a suite of standard metrics: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Spectral Angle Mapper (SAM). Specifically, both RMSE and MAE quantify the magnitude of error at the pixel level. While RMSE imposes a higher penalty on large deviations due to its squared error term, MAE provides a direct linear assessment of average absolute errors, making it highly robust to isolated outliers. Meanwhile, PSNR and SSIM are primarily leveraged to characterize the fidelity of spatial structures and textural preservation. Complementing these, SAM evaluates spectral consistency by calculating the geometric proximity between the reconstructed and reference spectral vectors.In terms of interpretation, higher values for PSNR and SSIM indicate superior image quality, whereas lower values for RMSE and SAM denote minimal reconstruction error. Of particular significance is SAM: due to its inherent sensitivity to spectral shape and robustness against illumination variations, it serves as a critical indicator of spectral authenticity, reflecting the model’s reliability for downstream hyperspectral applications. The formulas for evaluation metrics are as follows:

M A R E = \frac{1}{n} \sum_{i = 1}^{n} \frac{|Y_{R}^{i} - Y_{G}^{i}|}{Y_{G}^{i}}

(23)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(Y_{R}^{i} - Y_{G}^{i})}^{2}}

(24)

M A E = \frac{1}{n} \sum_{i = 1}^{n} |Y_{R}^{i} - Y_{G}^{i}|

(25)

where

Y_{R}^{i}

and

Y_{G}^{i}

represent the i-th spectral bands of the ground truth and reconstructed images, respectively, while n indicates the total number of spectral bands.

P S N R = 20 \cdot lg \frac{M A X_{I}}{\sqrt{M S E}} = 20 \cdot lg \frac{M A X_{I}}{R M S E}

(26)

S S I M = \frac{(2 μ_{R} μ_{G} + C_{1}) (2 σ_{R} σ_{G} + C_{2})}{(μ_{R}^{2} + μ_{G}^{2} + C_{1}) (σ_{R}^{2} + σ_{G}^{2} + C_{2})}

(27)

S A M = \frac{1}{n} \sum_{i = 1}^{n} {cos}^{- 1} \frac{Y_{R}^{i^{T}} Y_{G}^{i}}{{(Y_{R}^{i^{T}} Y_{R}^{i})}^{\frac{1}{2}} {(Y_{G}^{i^{T}} Y_{G}^{i})}^{\frac{1}{2}}}

(28)

where

μ_{R}

and

μ_{G}

represent the mean intensities of the reconstructed and ground truth images, respectively, while

σ_{R}^{2}

and

σ_{G}^{2}

denote their respective variances. The terms

C_{1}

and

C_{2}

are small constants added to ensure numerical stability. Additionally,

M A X_{I}

represents the maximum possible value.

Standard evaluation frameworks primarily rely on generic metrics such as RMSE and SSIM. While these indices adequately quantify pixel-level fidelity, they often fail to reflect performance effectively in downstream remote sensing tasks, such as land cover classification and agricultural monitoring. Consequently, these metrics lack task-oriented interpretability. To address this limitation, this study incorporates Vegetation Indices (VIs) as a key analytical dimension. By synthesizing reflectance data from specific spectral bands, VIs derive critical biophysical parameters regarding vegetation coverage and physiological status. Their utility in ecological monitoring and precision agriculture is well-established due to their rapid temporal response and high sensitivity. Incorporating VIs as a supplementary metric not only enhances the scientific rigor of the assessment but also demonstrates the practical applicability of the model in operational environments.

Among these indices, the Fluorescence Reflectance Index 1 (FRI1) [25] is widely used to indicate leaf chlorophyll content. Since chlorophyll fluorescence is closely related to pigment concentration and leaf absorption, FRI1 derived from spectral data provides a useful reference for field monitoring. To complement its sensitivity to narrow spectral bands, we further introduce the Normalized Green–Red Difference Index (NGRDI). Based on broader visible bands, NGRDI reflects overall canopy coverage and crop greenness, enabling evaluation of reconstruction performance across different spectral ranges. FRI1 and NGRDI are computed using Equations (29) and (30), respectively, and their values are normalized to the range [0, 1]. Given their different physical meanings, separate grading strategies are adopted. As defined in Equations (31) and (32), both indices are divided into five levels for evaluation. Finally, to assess the distribution consistency of these indices, the Intersection over Union (IoU) is employed, as described in Equation (33).

F R I 1 = \frac{R_{690}}{R_{630}}

(29)

N G R D I = \frac{R_{550} - R_{670}}{R_{550} + R_{670}}

(30)

where

R_{(\cdot)}

represents the surface reflectance in this band of the image.

L_{F R I} = \{\begin{matrix} L_{1} & VI \leq 0 \\ L_{2} & 0 < VI \leq 0.1 \\ L_{3} & 0.1 < VI \leq 0.4 \\ L_{4} & 0.4 < VI \leq 0.8 \\ L_{5} & 0.8 < VI \leq 1 \end{matrix}

(31)

L_{N G R D I} = \{\begin{matrix} L_{1} & VI \leq 0 \\ L_{2} & 0 < VI \leq 0.3 \\ L_{3} & 0.3 < VI \leq 0.6 \\ L_{4} & 0.6 < VI \leq 0.9 \\ L_{5} & 0.9 < VI \leq 1 \end{matrix}

(32)

I o U = \frac{1}{5} \sum_{i = 1}^{5} \frac{G_{i} ⋃ M_{i}}{G_{i} ⋂ M_{i}}

(33)

3.6. Training Setting

3.6.1. Dataset

In this study, one public benchmark dataset NTIRE2022 [26] and two real-world datasets are utilized to evaluate the model’s performance. The dataset processing details are as follows: the NTIRE2022 dataset comprises 900 RGB-HSI pairs for training and 50 HSI pairs for validation. All HSI images have a spatial resolution of 482 × 512 and contain 31 channels ranging from 400 nm to 700 nm. We crop the image pairs to 100 × 100 for training. The two self-constructed UAV low-altitude remote sensing rice field datasets were processed as described in Section 3.3, divided into training, validation, and test sets in the ratio of 7:1:2, and cropped to 256 × 256 for training.

3.6.2. Implementation Details

We optimize the model using the Adam optimizer with a batch size of 4. Momentum terms are set to

β_{1} = 0.9

and

β_{2} = 0.999

. A polynomial learning rate decay schedule (power = 1.5) is adopted, initialized at 0.0001, to stabilize convergence. The objective function combines an L1 loss with a Cosine Similarity loss, which is weighted at 0.01. During the training phase, the Mean Relative Absolute Error (MRAE) serves as the primary metric, while RMSE and PSNR are monitored as auxiliary indicators. The model is validated every 1000 iterations over a total of approximately 300,000 iterations, and the optimal state is saved. To mitigate boundary artifacts during validation, loss calculation is restricted to the central image regions by cropping 128 pixels from the edges. This step ensures a robust and unbiased assessment.All experiments were implemented in PyTorch (Version 2.1.1, Linux Foundation, San Francisco, CA, USA) and conducted on a workstation equipped with an AMD Ryzen9 7950X CPU with 4.50 GHz and an NVIDIA GeForce RTX 4090 GPU.

4. Results

In this section, we compare FGA-Mamba against state-of-the-art spectral reconstruction methods, including MPRNet [16], HINet [15], EDSR [27], HSCNN+ [12], HRNet [13], HDNet [28], AWAN [14], and MST++ [17]. All methods were evaluated under identical experimental conditions to ensure a fair comparison and optimal performance.

4.1. Comparison with SOTA Methods

4.1.1. Quantitative Results

To provide a more intuitive representation of the competitiveness of our model for spectral reconstruction of heterologous images, we provide a PSNR-Params-FLOPs comparison in Figure 5, Table 2 provides the specific parameters of different models. The horizontal axis is FLOPs (computational cost), the vertical axis is PSNR (performance), and the radius of the circle is Params (memory cost). It can be seen that our approach occupies the higher left corner, striking the best balance between performance and efficiency.

Table 3 and Table 4 presents the evaluation results of spectral reconstruction metrics on both the ideal and real-world datasets, with the best performance for each metric highlighted in bold. To validate the spectral reconstruction performance of the model, we conducted experiments on the NTIRE 2022 dataset. Our method achieved optimal results on most metrics and sub-optimal results on the SAM metric, with a difference of only 0.001 from the optimal value. Owing to our proposed local-to-global reconstruction framework, our model demonstrated excellent performance in spectral reconstruction under ideal conditions. To establish that the reconstruction capability of the model extends beyond the computer vision domain and can be applied to the remote sensing domain for heterogeneous spectral reconstruction, we conducted experiments on the UAV paddy field dataset. We achieved results comparable to those obtained with the ideal dataset, demonstrating the model’s generalization ability and its potential for application in remote sensing field production.

4.1.2. Visualization Results

To provide a more intuitive assessment of image reconstruction quality, we randomly selected three spectral bands to generate their corresponding reconstruction error maps. The error map is calculated based on the pixel-wise absolute difference, formulated as follows:

E^{(k)} = |Y_{R}^{(k)} - Y_{G}^{(k)}|

(34)

where

Y_{R}^{(k)}

and

Y_{G}^{(k)}

represent the pixel values of the reconstructed image and the reference image (ground truth) for the k-th spectral band, respectively. This calculation method effectively highlights the spatial distribution of reconstruction errors across different regions. For visualization, the error map is shown using a pseudo-color scale, where blue areas indicate errors close to zero, and red and yellow areas indicate higher reconstruction differences.

Experimental results show that previous reconstruction methods struggled to maintain consistent granularity and eliminate distortion, especially in high-frequency components. In contrast, our method demonstrates a stronger capability for precise texture recovery. Notably, in target rice planting areas, which are crucial for agricultural production, other methods produce noise artifacts that result in spots of varying sizes and densities. In contrast, our method shows superior spatial smoothness and spectral fidelity. This performance enhancement is largely attributed to the synergistic mechanisms of the introduced modules. First, the Frequency-Visual State Space (F-VSS) Block integrates state modeling with a frequency-domain self-calibration mechanism. This design effectively enhances the capacity for modeling long-term dependencies and frequency consistency. Second, the Spatial Gradient Attention Module and the Spectral Gradient Attention Module focus on directional variations in key structures and high-frequency spectral transitions, respectively. By exploiting inter-band differences, these modules significantly improve texture restoration and spectral sensitivity. Furthermore, the enhanced gradient mechanism reinforces the modeling of local edge information. This enables the model to exhibit stronger structural perception and noise suppression capabilities, particularly in the rice field area located at the top of the Figure 6.

As illustrated in the Figure 7, we evaluate the reconstruction performance of the competing methods in the spectral dimension. Spectral response curves characterize pixel reflectance across varying wavelengths, serving as a critical basis for identifying material composition and surface status. High-fidelity spectral reconstruction demands not only global trend alignment with the Ground Truth (GT) but also the precise preservation of local details, including absorption bands, reflectance peaks, and spectral transition regions.

To visually demonstrate the results, we randomly selected three spatial points from the reconstructed images. For these points, we plotted the GT spectral response curves along with the curves generated by other comparison methods. The observations show that FGA-Mamba aligns more closely with the GT curves. Although some deviations remain, FGA-Mamba exhibits better consistency in curve shape, slope changes, and peak positions compared to other methods.

4.2. Application Validation

Vegetation Indices (VIs) are widely applied in agricultural remote sensing to evaluate the coverage extent and growth health of surface vegetation, as well as to predict crop yields. Typical VIs (such as NDVI and EVI) quantify the photosynthetic intensity of green vegetation by exploiting the differences in spectral reflectance between multispectral or hyperspectral bands, thereby reflecting its physiological state. Therefore, VI distribution maps calculated based on reconstructed hyperspectral images not only visualize the biophysical significance of the reconstruction results but also serve as a critical basis for assessing their downstream application value.

4.2.1. Validation of the Application of VI

Since hyperspectral reconstruction is typically performed at the pixel level, test images are often segmented into small patches for processing to reduce computational resource consumption. This approach, however, results in the reconstructed images initially lacking georegistration information. To address this, before generating the full VI distribution map, we first register the reconstructed hyperspectral images with the geolocation information of the original multispectral data. Subsequently, the image patches are mosaicked to reconstruct a spatially consistent orthomosaic.

The Figure 8 presents the VI distribution maps generated based on the original hyperspectral image and various reconstruction methods. A comparison of overall visual consistency and detail preservation reveals that, with the exception of HINET, which failed to generate a complete VI map, the other methods yielded usable vegetation index maps. The VI distribution map generated by our proposed FGA-Mamba is the most consistent with the original map. Whether in the main paddy field area or the transition zones between roads, it achieves a more realistic and coherent representation of vegetation coverage. Notably, in the low-lying shrub areas along the road edges, only our method completely preserved the green vegetation features, whereas other methods exhibited varying degrees of blurring or feature loss. Furthermore, in a narrow horizontal boundary region at the top of the paddy field, the reconstruction results of FGA-Mamba clearly restored the boundary texture and vegetation distribution, demonstrating excellent structural fidelity.For a more rigorous quantification of the VI distribution, we employed the VI-IoU, RMSE, and MAE metric to assess the spatial similarity between the maps, as detailed in Table 5.

Figure 9 illustrates the Normalized Green–Red Difference Index (NGRDI) distribution maps derived from the original hyperspectral images and various reconstruction methods. A comparison of overall visual consistency shows that the spatial distribution of the reconstructed NGRDI levels from our method is most consistent with the ground truth. This is particularly evident in the rice fields area above the central road, where the predicted levels closely match the actual scene. Compared with MPRNet, our method produces clearer boundary structures within the paddy fields. While some competing methods appear to generate sharper details, they often introduce over-sharpening artifacts and spurious textures, which degrade the overall consistency. In contrast, our method achieves more reliable delineation of these fine-grained agricultural regions. To quantitatively evaluate the NGRDI distribution, we employ IoU, RMSE, and MAE to measure the spatial similarity between the maps, as summarized in Table 6.

4.2.2. Verification of Generalizability

Although theoretical feasibility is crucial, models limited to specific datasets lack practical value in diverse agricultural environments. Therefore, generalization validation becomes a key criterion for evaluating models. The results in Section 4.1.1 have already shown that FGA-Mamba possesses a certain degree of generalization capability. We further deployed FGA-Mamba directly in Research Area 2 to assess its generalization performance. As shown in Table 5, although environmental differences between the two areas naturally lead to fluctuations in absolute performance metrics, our method still outperforms other competing approaches. Notably, the excellent stability of the MARE metric highlights the model’s adaptability to specific site variations. This strong generalization capability provides more robust support for agricultural remote sensing tasks.

4.3. Ablation Study

To systematically deconstruct the efficacy of FGA-Mamba’s internal mechanisms, we conducted an ablation study utilizing the rice field dataset. The investigation was bifurcated into two phases: first, assessing the impact of the FG block cascading depth, and second, isolating the contributions of specific model components.

4.3.1. Impact of Mamba Block Depth

FGA-Mamba is constructed by cascading multiple FG blocks. To determine the optimal stacking depth, we analyzed the reconstruction performance relative to the number of blocks, denoted as n. The quantitative metrics detailed in Table 7 reveal a distinct trend: while initial increments in n yield tangible improvements in reconstruction fidelity, performance saturation occurs as the network deepens further. Crucially, the configuration of n = 3 strikes the most favorable equilibrium between reconstruction accuracy and computational economy. Consequently, a depth of 3 was adopted as the standard configuration to maximize the trade-off between precision and efficiency.

4.3.2. Impact of Model Components

To assess the efficacy of individual components within FGA-Mamba, we conducted a component-wise ablation study, as detailed in Table 8. The results indicate that the exclusion of any specific module leads to performance degradation. Most notably, removing the F-VSS module causes a sharp decline in reconstruction fidelity, reducing PSNR by 2.11 dB and increasing RMSE by 17.42%.

Furthermore, the independent removal of Spatial or Spectral Attention also impairs performance. The simultaneous absence of both mechanisms results in a substantial regression, with PSNR dropping by 2.18 dB. This confirms their complementary nature in preserving spatial textures and spectral features. Finally, the integration of EGAM yields further gains, increasing PSNR by 0.22 dB and decreasing RMSE by 4.49%. In summary, the synergistic operation of these modules ensures optimal performance in both spectral fidelity and spatial structural restoration.

4.4. Limitations and Future Work

While the cross-scene deployment demonstrates the potential of FGA-Mamba for agricultural remote sensing, the current validation is still limited to specific environmental settings and crop conditions.

In future work, we will further refine the network architecture and extend comparisons with more recent state-of-the-art (SOTA) models to evaluate its performance under more complex and diverse real-world scenarios. In particular, we aim to address variations in land surface conditions caused by seasonal changes, precipitation, and diurnal illumination. This requires the support of more large-scale, multi-temporal UAV remote sensing datasets.

5. Conclusions

In this study, we present an efficient deep learning model based on state-space models (SSMs) for high-fidelity hyperspectral data reconstruction from multispectral inputs. To this end, we developed a standardized preprocessing and registration pipeline for heterogeneous remote sensing data. This pipeline minimizes the spatial geometric differences between drone-mounted multispectral and hyperspectral images, thereby creating a high-quality paired rice field dataset for algorithm training.Our approach is designed for the MSI-to-HSI reconstruction task, with the core of the network combining the F-VSS module with three distinct gradient attention mechanisms. The F-VSS module leverages the linear complexity advantage of the Mamba architecture to capture global long-range dependencies through frequency-domain calibration while maintaining structural consistency. Meanwhile, the Enhanced Gradient Attention Module (EGAM) explicitly strengthens the extraction of high-frequency textures and edge information through central difference convolution. These synergistic enhancement mechanisms enable the system to efficiently recover reliable high-dimensional spatial and spectral information from low-dimensional inputs.

Furthermore, we propose an evaluation strategy that uses vegetation indices (VIs) as key auxiliary indicators, combined with traditional image metrics. These indices are employed to assess the practical effectiveness of the generated images in reflecting paddy field coverage and growth vigor. Comprehensive tests indicate that the proposed method outperforms existing approaches in balancing reconstruction accuracy and computational efficiency. Future research will focus on two aspects: first, improving the model’s robustness in complex scenarios, such as different seasonal lighting conditions and crop growth cycles; second, exploring lightweight deployment on edge computing devices to enable real-time field monitoring. In summary, the proposed method provides an effective solution for hyperspectral image reconstruction, offering substantial technical support for agricultural phenotyping research.

Author Contributions

Methodology, X.W.; Software, X.W.; Investigation, H.Z., T.W.; Resources, Y.Z.; Writing—original draft, X.W.; Writing—review & editing, Y.Z.; Supervision, H.Z. and Y.Z.; Funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Natural Science Foundation of China, grant number: 61905219, Natural Science Foundation of Zhejiang Province, grant number: LMS25C130001, Chinese Academy of Agricultural Sciences, grant number: CAAS-CNRRI-2025-02, National Key Research and Development Program of China, grant number: 2023YFD1401100.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to ongoing research and data usage agreements.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bhargava, A.; Sachdeva, A.; Sharma, K.; Alsharif, M.H.; Uthansakul, P.; Uthansakul, M. Hyperspectral imaging and its applications: A review. Heliyon 2024, 10, e33208. [Google Scholar] [CrossRef] [PubMed]
Li, W.; Fazli, S.; Maharjan, S.; El-Askary, H. Deciphering Water Quality and Algal Dynamics in Clear Lake Through Hyperspectral Analysis Using Emit Data. In Proceedings of the IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7-12 July 2024; IEEE: New York, NY, USA, 2024; pp. 9302–9306. [Google Scholar]
Tian, Q.; He, C.; Xu, Y.; Wu, Z.; Wei, Z. Hyperspectral target detection: Learning faithful background representations via orthogonal subspace-guided variational autoencoder. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5516714. [Google Scholar] [CrossRef]
Chang, C.I. An effective evaluation tool for hyperspectral target detection: 3D receiver operating characteristic curve analysis. IEEE Trans. Geosci. Remote Sens. 2020, 59, 5131–5153. [Google Scholar] [CrossRef]
Bian, L.; Wang, Z.; Zhang, Y.; Li, L.; Zhang, Y.; Yang, C.; Fang, W.; Zhao, J.; Zhu, C.; Meng, Q.; et al. A broadband hyperspectral image sensor with high spatio-temporal resolution. Nature 2024, 635, 73–81. [Google Scholar] [CrossRef] [PubMed]
Mei, S.; Jiang, R.; Li, X.; Du, Q. Spatial and spectral joint super-resolution using convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4590–4603. [Google Scholar] [CrossRef]
Koundinya, S.; Sharma, H.; Sharma, M.; Upadhyay, A.; Manekar, R.; Mukhopadhyay, R.; Karmakar, A.; Chaudhury, S. 2D-3D CNN based architectures for spectral reconstruction from RGB images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 844–851. [Google Scholar]
Cai, Y.; Lin, J.; Hu, X.; Wang, H.; Yuan, X.; Zhang, Y.; Timofte, R.; Van Gool, L. Coarse-to-fine sparse transformer for spectral reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23-27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 662–679. [Google Scholar]
Rahman, M.M.; Tutul, A.A.; Nath, A.; Laishram, L.; Jung, S.K.; Hammond, T. Mamba in vision: A comprehensive survey of techniques and applications. arXiv 2024, arXiv:2410.03105. [Google Scholar] [CrossRef]
Arad, B.; Ben-Shahar, O. Sparse recovery of hyperspectral signal from natural RGB images. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11-14 October 2016; Proceedings, Part VII 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 19–34. [Google Scholar]
Aeschbacher, J.; Wu, J.; Timofte, R. In defense of shallow learned spectral reconstruction from RGB images. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 471–479. [Google Scholar]
Shi, Z.; Chen, C.; Xiong, Z.; Liu, D.; Wu, F. Hscnn+: Advanced cnn-based hyperspectral recovery from rgb images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 939–947. [Google Scholar]
Zhao, Y.; Po, L.M.; Yan, Q.; Liu, W.; Lin, T. Hierarchical regression network for spectral reconstruction from RGB images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Virtual, 14–19 June 2020; pp. 422–423. [Google Scholar]
Li, J.; Wu, C.; Song, R.; Li, Y.; Liu, F. Adaptive weighted attention network with camera spectral sensitivity prior for spectral reconstruction from RGB images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Virtual, 14–19 June 2020; pp. 462–463. [Google Scholar]
Chen, L.; Lu, X.; Zhang, J.; Chu, X.; Chen, C. Hinet: Half instance normalization network for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 182–192. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H.; Shao, L. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 14821–14831. [Google Scholar]
Cai, Y.; Lin, J.; Lin, Z.; Wang, H.; Zhang, Y.; Pfister, H.; Timofte, R.; Van Gool, L. Mst++: Multi-stage spectral-wise transformer for efficient spectral reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 745–755. [Google Scholar]
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Islam, M.M.; Bertasius, G. Long movie clip classification with state-space video models. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23-27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 87–104. [Google Scholar]
Yue, Y.; Li, Z. Medmamba: Vision mamba for medical image classification. arXiv 2024, arXiv:2403.03849. [Google Scholar] [CrossRef]
Yao, J.; Hong, D.; Li, C.; Chanussot, J. Spectralmamba: Efficient mamba for hyperspectral image classification. arXiv 2024, arXiv:2404.08489. [Google Scholar] [CrossRef]
Wang, Y.; Li, X.R. Distributed estimation fusion with unavailable cross-correlation. IEEE Trans. Aerosp. Electron. Syst. 2012, 48, 259–278. [Google Scholar] [CrossRef]
Kenney, C.S.; Zuliani, M.; Manjunath, B. An axiomatic approach to corner detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20-26 June 2005; IEEE: New York, NY, USA, 2005; Volume 1, pp. 191–197. [Google Scholar]
Broge, N.H.; Leblanc, E. Comparing prediction power and stability of broadband and hyperspectral vegetation indices for estimation of green leaf area index and canopy chlorophyll density. Remote Sens. Environ. 2001, 76, 156–172. [Google Scholar] [CrossRef]
Arad, B.; Timofte, R.; Monakhova, K.; Fu, Q.; Moslemi, M.; Pitchoy, R.; Ploumis, S.; Kergosien, T.; Pike, T.; Gipping, H.; et al. NTIRE 2022 challenge on spectral reconstruction from RGB images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 1062–1080. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
Hu, X.; Cai, Y.; Lin, J.; Wang, H.; Yuan, X.; Zhang, Y.; Timofte, R.; Van Gool, L. Hdnet: High-resolution dual-domain learning for spectral compressive imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17542–17551. [Google Scholar]

Figure 1. Overview of the research areas.

Figure 2. Overview of the various components of the drone system used in the study. (a) DJI Mavic Pro 2; (b) DJI M300 RTK UAV System; (c) DJI M600 Pro UAV System; (d) Hasselblad L1D-20C 20MP; (e) MS600 V2 camera; (f) Specim FX10 Hyperspectral camera.

Figure 3. Overall experimental flowchart, including data collection, data preprocessing, network architecture, and model validation.

Figure 4. Overview of Each Module: (a) Frequency-Visual State Space Block; (b) Spatial Attention modules; (c) Spectral Attention modules; (d) Enhanced Gradient Attention modules.

Figure 5. Comparison of PSNR, Parameter amount, and FLOPs with existing spectral reconstruction methods. The circle radius represents Parameter amount.

Figure 6. Reconstruction results of different methods on the rice dataset area 1. (The first row is 500 nm, the second row is 600 nm, and the third row is 700 nm).

Figure 7. Three random points (a–c) selected from HSI reconstructed by different methods and the original HSI in the rice datasets.

Figure 8. Comparison of VI distribution images obtained by different methods.

Figure 9. Comparison of NGRDI distribution images obtained by different methods.

Table 1. Flight parameter setting.

Flight Setting Content	Parameters	Flight Setting Content	Parameters
Flight altitude	50 m	Movement speed	2.3 m/s
Mainline angle	182°	Head Pitch Angle	−90°
Heading overlap rate	94%	Bypass overlap rate	89%
Distance between photos	F: 3.7 M/S: 9.8 M	Photo interval	2.0 SEC

Table 2. Comparison of floating-point operations and parameters for different methods. The top performer in each metric is highlighted in bold.

Model	HSCNN+	HDNET	HINET	EDSR	HRNET	AWAN	MPRNET	MST++	OURS
FLOPs/(G/ $10^{9}$ )	177.86	174.71	125.76	159.53	166.59	18.96	696.91	85.24	10.89
Parameters/(M/ $10^{6}$ )	2.71	2.67	19.56	2.44	32.73	0.29	13.56	5.97	0.17

The best results shown in bold.

Table 3. Reconstruction results on the RICE datasets.

Model	Rice Datasets
Model	MARE ↓	RMSE ↓	PSNR ↑	SSIM ↑	SAM ↓	IoU ↑
HSCNN+	0.1848	0.0185	35.33	0.923	0.168	0.1934
HRNET	0.2178	0.0231	34.07	0.910	0.172	0.4652
HINET	0.2005	0.0193	34.80	0.911	0.183	/
EDSR	0.1931	0.0186	34.19	0.909	0.178	0.5252
HDNET	0.1982	0.0193	35.15	0.915	0.181	0.3954
AWAN	0.2197	0.0217	34.12	0.916	0.179	0.0719
MPRNET	0.1836	0.0187	35.59	0.923	0.162	0.5407
MST++	0.1819	0.0179	35.61	0.928	0.164	0.6141
Ours	0.1798	0.0178	35.89	0.933	0.163	0.6430

↓ indicates that smaller values are better, ↑ indicates that larger values are better. The best results shown in bold and the second-best results underlined.

Table 4. Reconstruction results on the NTIRE2022 datasets.

Model	NTIRE2022 Datasets
Model	MARE ↓	RMSE ↓	PSNR ↑	SSIM ↑	SAM ↓
HSCNN+	0.404	0.053	26.71	0.814	0.226
HDNET	0.397	0.050	26.83	0.828	0.179
HINET	0.300	0.043	28.31	0.862	0.148
EDSR	0.293	0.042	28.79	0.889	0.211
HRNET	0.270	0.037	29.97	0.912	0.138
AWAN	0.247	0.033	31.82	0.918	0.107
MPRNET	0.201	0.029	32.92	0.937	0.091
MST++	0.181	0.027	33.56	0.945	0.082
Ours	0.166	0.025	34.28	0.953	0.083

↓ indicates that smaller values are better, ↑ indicates that larger values are better. The best results shown in bold and the second-best results underlined.

Table 5. Quantitative comparison of NGRDI and FRI reconstruction performance.

Model	NGRDI			FRI
Model	RMSE ↓	MAE ↓	IoU ↑	RMSE ↓	MAE ↓	IoU ↑
HSCNN+	0.0479	0.0392	0.4070	0.0797	0.0676	0.1934
HRNET	0.0503	0.0391	0.4641	0.0870	0.0673	0.4652
HINET	0.0821	0.0488	0.4744	/	/	/
EDSR	0.0480	0.0377	0.5165	0.0751	0.0585	0.5252
HDNET	0.0455	0.0348	0.5619	0.0839	0.0708	0.3954
AWAN	0.0452	0.0356	0.5532	0.0831	0.0684	0.0719
MPRNET	0.0405	0.0326	0.5722	0.0683	0.0524	0.5407
MST++	0.0489	0.0404	0.5570	0.0658	0.0538	0.6141
Ours	0.0416	0.0321	0.5935	0.0623	0.0490	0.6430

↓ indicates that smaller values are better, ↑ indicates that larger values are better. The best results shown in bold and the second-best results underlined.

Table 6. Comparison results of the generalization of all methods.

Model	MARE ↓	RMSE ↓	PSNR ↑	SSIM ↑	SAM ↓
HSCNN+	0.4547	0.0211	33.47	0.873	0.263
HDNET	0.4026	0.0203	33.71	0.869	0.262
HINET	0.3809	0.0197	33.79	0.878	0.280
EDSR	0.3643	0.0189	34.29	0.880	0.261
HRNET	0.3529	0.0189	34.41	0.885	0.251
AWAN	0.4478	0.0209	33.67	0.876	0.261
MPRNET	0.3449	0.0187	34.80	0.893	0.252
MST++	0.3323	0.0187	34.86	0.898	0.249
Ours	0.3183	0.0184	34.93	0.903	0.251

↓ indicates that smaller values are better, ↑ indicates that larger values are better. The best results shown in bold and the second-best results underlined.

Table 7. The impact of the number of FG blocks on the experiment, with the optimal results highlighted in bold.

n	RMSE ↓	PSNR ↑	SSIM ↑	SAM ↓	Params/(M/ $10^{6}$ )	FLOPs/(G/ $10^{9}$ )
1	0.0233	33.76	0.901	0.211	0.114	6.81
2	0.0192	34.93	0.915	0.179	0.128	7.69
3	0.0178	35.89	0.933	0.163	0.171	10.89
4	0.0179	35.87	0.932	0.162	0.185	11.13
5	0.0188	35.82	0.929	0.163	0.215	12.96

↓ indicates that smaller values are better, ↑ indicates that larger values are better. The best results shown in bold.

Table 8. Effects of different components on experimental results, with the optimal results shown in bold.

Method	F-VSS	Gspa	Gspec	EGAM	RMSE ↓	PSNR ↑	SSIM ↑	SAM ↓
(a)	🗸	🗸	🗸		0.0186	35.67	0.929	0.172
(b)	🗸		🗸	🗸	0.0192	34.92	0.916	0.185
(c)	🗸	🗸		🗸	0.0187	35.19	0.922	0.180
(d)	🗸			🗸	0.0218	33.71	0.912	0.193
(e)		🗸	🗸	🗸	0.0209	33.78	0.915	0.201
(f)	🗸	🗸	🗸	🗸	0.0178	35.89	0.933	0.163

↓ indicates that smaller values are better, ↑ indicates that larger values are better. The best results shown in bold. 🗸 indicates the use of the module.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, X.; Zhou, H.; Wei, T.; Zhang, Y. Hyperspectral Image Reconstruction Based on State Space Models. Remote Sens. 2026, 18, 990. https://doi.org/10.3390/rs18070990

AMA Style

Wang X, Zhou H, Wei T, Zhang Y. Hyperspectral Image Reconstruction Based on State Space Models. Remote Sensing. 2026; 18(7):990. https://doi.org/10.3390/rs18070990

Chicago/Turabian Style

Wang, Xuguang, Haozhe Zhou, Tongxin Wei, and Yanchao Zhang. 2026. "Hyperspectral Image Reconstruction Based on State Space Models" Remote Sensing 18, no. 7: 990. https://doi.org/10.3390/rs18070990

APA Style

Wang, X., Zhou, H., Wei, T., & Zhang, Y. (2026). Hyperspectral Image Reconstruction Based on State Space Models. Remote Sensing, 18(7), 990. https://doi.org/10.3390/rs18070990

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hyperspectral Image Reconstruction Based on State Space Models

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Hyperspectral Image Reconstruction

2.2. State Space Models

3. Materials and Methods

3.1. Study Area and Experimental Design

3.2. Aerial Image Acquisition and Data Preprocessing

3.3. Remote Sensing Image Preprocessing

3.4. Model Construction

3.4.1. Network Architecture

3.4.2. Frequency-Visual State Space Block

3.4.3. Spatial Gradient Attention and Spectral Gradient Attention

3.4.4. Enhanced Gradient Attention

3.5. Evaluation Metrics

3.6. Training Setting

3.6.1. Dataset

3.6.2. Implementation Details

4. Results

4.1. Comparison with SOTA Methods

4.1.1. Quantitative Results

4.1.2. Visualization Results

4.2. Application Validation

4.2.1. Validation of the Application of VI

4.2.2. Verification of Generalizability

4.3. Ablation Study

4.3.1. Impact of Mamba Block Depth

4.3.2. Impact of Model Components

4.4. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI