1. Introduction
Hyperspectral imaging is an imaging technology that captures images in the spatial dimension while continuously sampling in the spectral dimension, allowing rich and continuous spectral information to be obtained for each pixel, thereby finely characterizing the composition and properties of materials. This unique imaging mechanism offers high spectral resolution and precise material identification capabilities, leading to its widespread application in fields such as remote sensing [
1], environmental monitoring [
2], and target recognition [
3,
4]. Currently, the acquisition of hyperspectral images primarily relies on specialized imaging equipment capable of recording continuous spectral features with high precision. However, these devices are generally prohibitively expensive and are constrained by complex optical structures, slow scanning speeds, and stringent mechanical stability requirements. Consequently, capturing dynamic scene information in real-time remains difficult, keeping both the acquisition costs and the barriers to widespread adoption high.
To overcome these obstacles, spectral reconstruction (SR) has received widespread attention, as it can reconstruct hyperspectral images from RGB or multispectral data. As Bian [
5] demonstrated in the journal Nature, the field of spectral reconstruction is transitioning from expensive optics to computational reconstruction. High-quality SR algorithms are crucial for achieving real-time, high-resolution spectral imaging. In the field of remote sensing image enhancement, pan-sharpening, which fuses low-resolution multispectral data with high-resolution panchromatic images, is a commonly used technique. However, its fundamental objective differs significantly from that of spectral reconstruction. Although pan-sharpening methods based on component substitution or multiresolution analysis can effectively enhance spatial details by leveraging structural priors, they often introduce spectral distortions. The primary focus of these approaches is spatial enhancement rather than the expansion of spectral dimensions, making them incapable of reliably recovering continuous spectral information from limited band inputs. Therefore, this study focuses on spectral reconstruction methods to ensure a high level of spectral fidelity. Early SR methods primarily relied on sparse coding and dictionary learning. However, these approaches struggle to capture the nonlinear spectral distortions that occur under complex environmental conditions, which limits their reconstruction accuracy and generalization capability.
Driven by the rapid advancement of deep learning, Convolutional Neural Networks (CNNs) have emerged as a prominent method in the spectral reconstruction domain.This method [
6,
7] extracts local features using convolutional kernels and uses residual connections to facilitate network depth, enabling effective learning of hierarchical representations. However, the performance of CNNs is intrinsically constrained by their inductive bias, specifically the restricted local receptive field. In spectral reconstruction tasks, the spectral signature of a single pixel is frequently modulated by long-range spatial contexts, such as the distribution of neighboring crop canopies, as well as complex cross-band correlations. CNNs often struggle to capture these global dependencies efficiently. Furthermore, their static parameterization lacks the flexibility to adapt dynamically to input content. This limitation frequently results in spectral distortion or the over-smoothing of details in regions exhibiting intricate textures or abrupt spectral transitions.
To mitigate the locality constraints of CNNs, Transformer-based and hybrid architectures [
8] have been increasingly adopted. By utilizing self-attention mechanisms, these methods establish global receptive fields and dynamic modeling capabilities, enabling the capture of long-range spatial-spectral dependencies and achieving State-of-the-Art (SOTA) precision. Nevertheless, despite their performance superiority, the practical deployment of Transformers faces critical impediments due to prohibitive computational costs. The self-attention mechanism incurs a quadratic computational complexity (O(
)) relative to image size, resulting in excessive memory footprint and inference latency. This computational burden creates a significant gap between theoretical performance and practical application on resource-constrained edge devices.
To address this accuracy-efficiency dilemma and the specific requirements of spectral sequence modeling, the Mamba architecture [
9] (based on Selective State Space Models, SSMs) has recently emerged as a compelling solution. Although originally designed for long-sequence modeling in natural language processing, Mamba distinguishes itself from Transformers by discretizing continuous state-space equations. It utilizes parallel scan algorithms to facilitate efficient training while maintaining linear inference complexity (O(
N)). This unique architecture endows Mamba with the dual advantage of capturing global long-range dependencies and achieving high inference efficiency. Therefore, we propose FGA-Mamba, an efficient spectral reconstruction model featuring a global receptive field and linear computational complexity. We introduce the Frequency-Domain Visual State Space (F-VSS) module, which explicitly enhances global structural coherence and effectively suppresses artifacts by combining frequency-domain priors with Mamba’s long-range modeling capabilities. Additionally, we propose the Enhanced Gradient Attention Module (EGAM). This module leverages a gradient-aware mechanism to enhance high-frequency spatial information and edge textures, thereby effectively mitigating oversmoothing in spectral reconstruction. FGA-Mamba achieves significant improvements in reconstruction accuracy while maintaining low computational cost.
The main contributions of this study are summarized as follows:
(1) We propose a novel network for reconstructing HSI from Multispectral Images (MSI). By incorporating the Mamba architecture, the proposed method achieves an optimal balance between reconstruction fidelity and computational efficiency, and is applied to the field of agricultural remote sensing spectral reconstruction.
(2) We introduce a mechanism that synergizes frequency-domain self-calibrated attention with state space modeling. This design is tailored to mitigate the limitations of state modeling in capturing high-frequency information, thereby significantly enhancing the model’s ability to preserve intricate spatial details and ensuring global frequency consistency.
(3) We design an efficient Gradient-Enhanced Attention Module (EGAM). By introducing a gradient-aware mechanism based on central difference convolution, this module effectively captures local high-frequency features such as edges and textures. This design serves to mitigate the over-smoothing phenomenon often associated with pure state-based models, thereby refining local feature representation.
(4) We employ Vegetation Indices (VIs) to rigorously validate the reconstruction efficacy. The results demonstrate the practical utility and reliability of the proposed model for downstream agricultural applications.
4. Results
In this section, we compare FGA-Mamba against state-of-the-art spectral reconstruction methods, including MPRNet [
16], HINet [
15], EDSR [
27], HSCNN+ [
12], HRNet [
13], HDNet [
28], AWAN [
14], and MST++ [
17]. All methods were evaluated under identical experimental conditions to ensure a fair comparison and optimal performance.
4.1. Comparison with SOTA Methods
4.1.1. Quantitative Results
To provide a more intuitive representation of the competitiveness of our model for spectral reconstruction of heterologous images, we provide a PSNR-Params-FLOPs comparison in
Figure 5,
Table 2 provides the specific parameters of different models. The horizontal axis is FLOPs (computational cost), the vertical axis is PSNR (performance), and the radius of the circle is Params (memory cost). It can be seen that our approach occupies the higher left corner, striking the best balance between performance and efficiency.
Table 3 and
Table 4 presents the evaluation results of spectral reconstruction metrics on both the ideal and real-world datasets, with the best performance for each metric highlighted in bold. To validate the spectral reconstruction performance of the model, we conducted experiments on the NTIRE 2022 dataset. Our method achieved optimal results on most metrics and sub-optimal results on the SAM metric, with a difference of only 0.001 from the optimal value. Owing to our proposed local-to-global reconstruction framework, our model demonstrated excellent performance in spectral reconstruction under ideal conditions. To establish that the reconstruction capability of the model extends beyond the computer vision domain and can be applied to the remote sensing domain for heterogeneous spectral reconstruction, we conducted experiments on the UAV paddy field dataset. We achieved results comparable to those obtained with the ideal dataset, demonstrating the model’s generalization ability and its potential for application in remote sensing field production.
4.1.2. Visualization Results
To provide a more intuitive assessment of image reconstruction quality, we randomly selected three spectral bands to generate their corresponding reconstruction error maps. The error map is calculated based on the pixel-wise absolute difference, formulated as follows:
where
and
represent the pixel values of the reconstructed image and the reference image (ground truth) for the
k-th spectral band, respectively. This calculation method effectively highlights the spatial distribution of reconstruction errors across different regions. For visualization, the error map is shown using a pseudo-color scale, where blue areas indicate errors close to zero, and red and yellow areas indicate higher reconstruction differences.
Experimental results show that previous reconstruction methods struggled to maintain consistent granularity and eliminate distortion, especially in high-frequency components. In contrast, our method demonstrates a stronger capability for precise texture recovery. Notably, in target rice planting areas, which are crucial for agricultural production, other methods produce noise artifacts that result in spots of varying sizes and densities. In contrast, our method shows superior spatial smoothness and spectral fidelity. This performance enhancement is largely attributed to the synergistic mechanisms of the introduced modules. First, the Frequency-Visual State Space (F-VSS) Block integrates state modeling with a frequency-domain self-calibration mechanism. This design effectively enhances the capacity for modeling long-term dependencies and frequency consistency. Second, the Spatial Gradient Attention Module and the Spectral Gradient Attention Module focus on directional variations in key structures and high-frequency spectral transitions, respectively. By exploiting inter-band differences, these modules significantly improve texture restoration and spectral sensitivity. Furthermore, the enhanced gradient mechanism reinforces the modeling of local edge information. This enables the model to exhibit stronger structural perception and noise suppression capabilities, particularly in the rice field area located at the top of the
Figure 6.
As illustrated in the
Figure 7, we evaluate the reconstruction performance of the competing methods in the spectral dimension. Spectral response curves characterize pixel reflectance across varying wavelengths, serving as a critical basis for identifying material composition and surface status. High-fidelity spectral reconstruction demands not only global trend alignment with the Ground Truth (GT) but also the precise preservation of local details, including absorption bands, reflectance peaks, and spectral transition regions.
To visually demonstrate the results, we randomly selected three spatial points from the reconstructed images. For these points, we plotted the GT spectral response curves along with the curves generated by other comparison methods. The observations show that FGA-Mamba aligns more closely with the GT curves. Although some deviations remain, FGA-Mamba exhibits better consistency in curve shape, slope changes, and peak positions compared to other methods.
4.2. Application Validation
Vegetation Indices (VIs) are widely applied in agricultural remote sensing to evaluate the coverage extent and growth health of surface vegetation, as well as to predict crop yields. Typical VIs (such as NDVI and EVI) quantify the photosynthetic intensity of green vegetation by exploiting the differences in spectral reflectance between multispectral or hyperspectral bands, thereby reflecting its physiological state. Therefore, VI distribution maps calculated based on reconstructed hyperspectral images not only visualize the biophysical significance of the reconstruction results but also serve as a critical basis for assessing their downstream application value.
4.2.1. Validation of the Application of VI
Since hyperspectral reconstruction is typically performed at the pixel level, test images are often segmented into small patches for processing to reduce computational resource consumption. This approach, however, results in the reconstructed images initially lacking georegistration information. To address this, before generating the full VI distribution map, we first register the reconstructed hyperspectral images with the geolocation information of the original multispectral data. Subsequently, the image patches are mosaicked to reconstruct a spatially consistent orthomosaic.
The
Figure 8 presents the VI distribution maps generated based on the original hyperspectral image and various reconstruction methods. A comparison of overall visual consistency and detail preservation reveals that, with the exception of HINET, which failed to generate a complete VI map, the other methods yielded usable vegetation index maps. The VI distribution map generated by our proposed FGA-Mamba is the most consistent with the original map. Whether in the main paddy field area or the transition zones between roads, it achieves a more realistic and coherent representation of vegetation coverage. Notably, in the low-lying shrub areas along the road edges, only our method completely preserved the green vegetation features, whereas other methods exhibited varying degrees of blurring or feature loss. Furthermore, in a narrow horizontal boundary region at the top of the paddy field, the reconstruction results of FGA-Mamba clearly restored the boundary texture and vegetation distribution, demonstrating excellent structural fidelity.For a more rigorous quantification of the VI distribution, we employed the VI-IoU, RMSE, and MAE metric to assess the spatial similarity between the maps, as detailed in
Table 5.
Figure 9 illustrates the Normalized Green–Red Difference Index (NGRDI) distribution maps derived from the original hyperspectral images and various reconstruction methods. A comparison of overall visual consistency shows that the spatial distribution of the reconstructed NGRDI levels from our method is most consistent with the ground truth. This is particularly evident in the rice fields area above the central road, where the predicted levels closely match the actual scene. Compared with MPRNet, our method produces clearer boundary structures within the paddy fields. While some competing methods appear to generate sharper details, they often introduce over-sharpening artifacts and spurious textures, which degrade the overall consistency. In contrast, our method achieves more reliable delineation of these fine-grained agricultural regions. To quantitatively evaluate the NGRDI distribution, we employ IoU, RMSE, and MAE to measure the spatial similarity between the maps, as summarized in
Table 6.
4.2.2. Verification of Generalizability
Although theoretical feasibility is crucial, models limited to specific datasets lack practical value in diverse agricultural environments. Therefore, generalization validation becomes a key criterion for evaluating models. The results in
Section 4.1.1 have already shown that FGA-Mamba possesses a certain degree of generalization capability. We further deployed FGA-Mamba directly in Research Area 2 to assess its generalization performance. As shown in
Table 5, although environmental differences between the two areas naturally lead to fluctuations in absolute performance metrics, our method still outperforms other competing approaches. Notably, the excellent stability of the MARE metric highlights the model’s adaptability to specific site variations. This strong generalization capability provides more robust support for agricultural remote sensing tasks.
4.3. Ablation Study
To systematically deconstruct the efficacy of FGA-Mamba’s internal mechanisms, we conducted an ablation study utilizing the rice field dataset. The investigation was bifurcated into two phases: first, assessing the impact of the FG block cascading depth, and second, isolating the contributions of specific model components.
4.3.1. Impact of Mamba Block Depth
FGA-Mamba is constructed by cascading multiple FG blocks. To determine the optimal stacking depth, we analyzed the reconstruction performance relative to the number of blocks, denoted as
n. The quantitative metrics detailed in
Table 7 reveal a distinct trend: while initial increments in n yield tangible improvements in reconstruction fidelity, performance saturation occurs as the network deepens further. Crucially, the configuration of
n = 3 strikes the most favorable equilibrium between reconstruction accuracy and computational economy. Consequently, a depth of 3 was adopted as the standard configuration to maximize the trade-off between precision and efficiency.
4.3.2. Impact of Model Components
To assess the efficacy of individual components within FGA-Mamba, we conducted a component-wise ablation study, as detailed in
Table 8. The results indicate that the exclusion of any specific module leads to performance degradation. Most notably, removing the F-VSS module causes a sharp decline in reconstruction fidelity, reducing PSNR by 2.11 dB and increasing RMSE by 17.42%.
Furthermore, the independent removal of Spatial or Spectral Attention also impairs performance. The simultaneous absence of both mechanisms results in a substantial regression, with PSNR dropping by 2.18 dB. This confirms their complementary nature in preserving spatial textures and spectral features. Finally, the integration of EGAM yields further gains, increasing PSNR by 0.22 dB and decreasing RMSE by 4.49%. In summary, the synergistic operation of these modules ensures optimal performance in both spectral fidelity and spatial structural restoration.
4.4. Limitations and Future Work
While the cross-scene deployment demonstrates the potential of FGA-Mamba for agricultural remote sensing, the current validation is still limited to specific environmental settings and crop conditions.
In future work, we will further refine the network architecture and extend comparisons with more recent state-of-the-art (SOTA) models to evaluate its performance under more complex and diverse real-world scenarios. In particular, we aim to address variations in land surface conditions caused by seasonal changes, precipitation, and diurnal illumination. This requires the support of more large-scale, multi-temporal UAV remote sensing datasets.
5. Conclusions
In this study, we present an efficient deep learning model based on state-space models (SSMs) for high-fidelity hyperspectral data reconstruction from multispectral inputs. To this end, we developed a standardized preprocessing and registration pipeline for heterogeneous remote sensing data. This pipeline minimizes the spatial geometric differences between drone-mounted multispectral and hyperspectral images, thereby creating a high-quality paired rice field dataset for algorithm training.Our approach is designed for the MSI-to-HSI reconstruction task, with the core of the network combining the F-VSS module with three distinct gradient attention mechanisms. The F-VSS module leverages the linear complexity advantage of the Mamba architecture to capture global long-range dependencies through frequency-domain calibration while maintaining structural consistency. Meanwhile, the Enhanced Gradient Attention Module (EGAM) explicitly strengthens the extraction of high-frequency textures and edge information through central difference convolution. These synergistic enhancement mechanisms enable the system to efficiently recover reliable high-dimensional spatial and spectral information from low-dimensional inputs.
Furthermore, we propose an evaluation strategy that uses vegetation indices (VIs) as key auxiliary indicators, combined with traditional image metrics. These indices are employed to assess the practical effectiveness of the generated images in reflecting paddy field coverage and growth vigor. Comprehensive tests indicate that the proposed method outperforms existing approaches in balancing reconstruction accuracy and computational efficiency. Future research will focus on two aspects: first, improving the model’s robustness in complex scenarios, such as different seasonal lighting conditions and crop growth cycles; second, exploring lightweight deployment on edge computing devices to enable real-time field monitoring. In summary, the proposed method provides an effective solution for hyperspectral image reconstruction, offering substantial technical support for agricultural phenotyping research.