1. Introduction
The spectral resolution of hyperspectral images (HSIs) is on the order of
. The amount and resolution of spectral information they contain are far greater than those of multispectral and RGB images. HSIs have been widely used in image information processing fields, particularly in remote sensing and Earth observation [
1], agriculture and crop monitoring [
2], mineral exploration and identification [
3], environmental pollution detection [
4], and cultural heritage preservation and archeology [
5]. However, the acquisition of hyperspectral images requires expensive sensor equipment and technology, large-capacity data storage and transmission needs, as well as specialized knowledge and resources for the complex data collection and processing procedures, which severely limits their widespread application.
To overcome these limitations and considering the advantages of RGB images, such as low acquisition cost, high spatial resolution, and rich texture details, an increasing number of researchers have focused on recovering hyperspectral images (HSIs) from RGB images by learning the dependencies and correlations between RGB images, a process known as spectral reconstruction. Methods for hyperspectral image reconstruction based on RGB images can supplement spectral dimensional information, improve image processing performance, and reduce hardware costs and complexity, making them a research hotspot in the fields of image processing and computer vision. Early works [
6,
7,
8] utilized prior knowledge such as low-rankness and sparsity to crudely model shallow feature representations. However, due to the simplicity of model design and over-reliance on prior information, the accuracy and generative capability of these methods were limited.
With the advantages of automatic feature learning, high performance, and scalability, deep learning models have been widely applied in fields such as computer vision [
9], natural language processing [
10], and medical image analysis [
11], achieving remarkable results. Given the excellent performance of deep learning, researchers have started to apply deep learning techniques to the field of hyperspectral image reconstruction from a single RGB image. Based on the different neural network architectures in deep learning, the reconstruction methods can be categorized into CNN-based methods [
12,
13], GAN-based methods [
14,
15], Attention-based methods [
16,
17], and Transformer-based methods [
18,
19]. Most CNN-based methods, while optimizing pixel-level distances between the generated and real images, tend to over-smooth the images, affecting the diversity and authenticity of the spectra. GAN-based spectral imaging methods, to avoid instability during training, typically require the exclusion of training data that do not match the test set, leading to poor model robustness. Attention-based reconstruction methods enhance feature extraction by deepening the network, expanding the network scale, and integrating multiple networks, which improves reconstruction performance, but this significantly reduces the speed of spectral image reconstruction. Although Transformer-based methods can account for the spatial sparsity of spectral information, they fail to effectively model spectral similarity and long-range dependencies, and maintaining a balance between spatial and spectral resolution remains an unresolved issue.
To address these problems, this paper proposes a Dual-Gated Mamba Multi-Scale Adaptive Feature Learning (DMMAF) network, which can reconstruct high-precision hyperspectral images from a single RGB image. Specifically, our network adopts a reflection dot-product adaptive dual-noise-aware feature extraction method, which effectively supplements the edge detail information of the spectral images and enhances the model’s robustness. Furthermore, DMMAF introduces a deformable attention-based global feature extraction and dual-gated Mamba local feature extraction method, which strengthens the interaction between local and global information, thereby improving the accuracy of image reconstruction. Meanwhile, DMMAF proposes a structure-aware smooth loss function, which successfully addresses the challenge of balancing spatial and spectral resolution by integrating smooth loss, curvature loss, and attention supervision loss.
The main contributions of our work are summarized as follows:
- (1)
We propose the Double-Gated Mamba Multi-Scale Adaptive Feature (DMMAF) learning network for high-precision reconstruction from RGB to HSIs. The network primarily consists of three components: Reflection Dot-product Adaptive Dual-noise-aware Feature Extraction, Deformable Attention Dual-gated Mamba Multi-Scale Feature Learning, and Structure-aware Smooth Constraint Loss Function. Extensive experimental results demonstrate that this algorithm can reconstruct a high-precision HSI from a single RGB image, outperforming other advanced reconstruction algorithms during unsupervised training.
- (2)
DMMAF designs a Reflection Dot-product Adaptive Dual-noise-aware Feature Extraction method. This method is primarily composed of two reflection depth dot-product feature processing modules and an adaptive dual-noise mask module. The reflection depth dot-product module primarily implements channel transformation and supplements edge spatial information details, while the adaptive dual-noise mask enhances feature bandwidth correlation and contextual relationships, thereby improving the model’s robustness.
- (3)
DMMAF constructs a Deformable Attention Dual-gated Mamba Multi-Scale Feature Learning method. The deformable attention modeling addresses the issue of insufficient attention to important information in global features during image reconstruction, thereby mitigating the problem of detail loss. The Dual-gated Mamba local feature extraction resolves the global–local feature conflict and reduces channel redundancy, resulting in a significant reduction in parameters. The combination of these two methods enhances the interaction efficiency between local and global information during image reconstruction, thus improving the accuracy of the reconstruction.
- (4)
DMMAF introduces a structure-aware smooth loss function. This module comprises three loss functions: smooth loss, curvature loss, and attention supervision loss, which address issues such as the neglect of image structural information, a lack of spatial structural constraints, and insufficient supervision of the attention mechanism. It effectively balances spatial and spectral resolution.
2. Related Work
Hyperspectral image reconstruction from a single RGB image based on deep learning can generally be categorized into three types based on different neural network architectures: supervised reconstruction, semi-supervised reconstruction, and unsupervised reconstruction.
n supervised RGB-to-hyperspectral image reconstruction, the model is trained using a large dataset of paired RGB and hyperspectral images [
20], where the RGB image is used to predict and reconstruct the hyperspectral image. The Attention-based Scale Feature-Adversarial Network (SAPUNet) [
16] combines attention mechanisms (scale attention pyramid UNet) and feature pyramids (scale attention pyramid W-Net) to reconstruct hyperspectral images from RGB images. This network generates results with spatial consistency and less blurring by combining content loss and adversarial loss; however, it fails to reconstruct high-frequency details. The RGB-to-Hyperspectral Generative Adversarial Network (R2HGAN) [
15] utilizes a conditional discriminator and spectral discriminator joint discriminative method, showing great realism in spectral reconstruction. However, this model requires reassembling the RGB image, which leads to inefficiencies in memory usage. The hyperspectral image reconstruction model based on deep convolutional neural networks (HSCNN-R) [
12] learns the mapping relationship between RGB and hyperspectral images through a training dataset and optimizes the model parameters using a loss function. However, this model suffers from a significant amount of feature overlap, increasing computation time. The HSGAN network introduces a two-stage adversarial training strategy and a spatial–spectral attention mechanism for RGB-to-hyperspectral image reconstruction [
14], improving denoising capabilities and enhancing feature representation. However, its generalization ability across different RGB images and scenes is poor, resulting in unstable performance. The Adaptive Weighted Attention Network (AWAN) [
21], consisting of 12 dual-residual attention blocks (DRABs) and a PSNL module, exhibits excellent adaptability to scene variation and can reconstruct high accuracy and visual effects when constrained by input illumination data. However, its reconstruction accuracy deteriorates under illumination changes. The Dual Hierarchical Regression Network (DHRNet) [
22] designs a shadow feature extraction module based on a dense structure and a reflection feature extraction module with an attention mechanism to reconstruct spectral information from reflection and shadow features in the presence of illumination variation. This network suppresses spectral distortion and enhances the clarity of hyperspectral image reconstruction. However, the attention mechanism overly focuses on local regions, impacting the accuracy of spectral reconstruction. The Correlation and Continuity Network (CCNet) [
23] solves the challenge of multi-scale fusion in RGB-to-hyperspectral reconstruction by designing spectral correlation modeling and neighborhood spectral continuity modules that balance local and global feature similarity and continuity. It also incorporates an adaptive fusion module to improve the complementarity between modules. However, this method overlooks the global context and non-local correlations of features, leading to a decline in reconstruction quality. The Wavelet-based Dual Transformer Model (WDTM) [
24] combines dual attention mechanisms with wavelet signal decomposition, capturing non-local spatial similarity and global spectral correlation, while improving computational efficiency. However, its robustness under different imaging conditions is limited. Supervised learning for RGB-to-hyperspectral reconstruction can better learn the mapping relationship between paired RGB and hyperspectral data, achieving high spectral precision and spatial resolution in reconstruction, making it suitable for complex scenes. However, this approach requires a large amount of accurately labeled RGB-HSI paired data, which is costly, and the reconstruction of hyperspectral images requires complex model structures and high computational resources, which can lead to model overfitting.
In hyperspectral image reconstruction, semi-supervised methods use a small amount of paired RGB and hyperspectral images, along with a large amount of unlabeled RGB images, reducing the computational resource and cost issues associated with supervised learning. Semi-supervised deep learning techniques [
25] employ RGB space reverse mapping to form a total variation regularized model for training a limited number of hyperspectral images, improving reconstruction accuracy. However, this method does not incorporate adversarial learning techniques, and the limited labeled data results in low robustness. Such methods can adapt to scene changes and improve the accuracy of hyperspectral image reconstruction by using a small amount of labeled data and a large amount of unlabeled data. However, they suffer from limitations due to assumptions about the data, and unlabeled data may contain noise. Additionally, insufficient labeled data can hinder the effective learning of complex features, impacting reconstruction performance.
Unsupervised RGB-to-hyperspectral image reconstruction does not rely on large amounts of manually labeled hyperspectral data, enabling training even when labeled data are scarce. The class-based backpropagation neural network (BPNN) [
26] uses an unsupervised clustering method to partition RGB and hyperspectral images into several pairs for training, and the trained BPNN reconstructs the final hyperspectral image from the classified RGB images. However, the BPNN algorithm has slow learning speeds and is prone to training failures. The Deep Residual Convolutional Neural Network (DRCNN) model [
13] similarly uses an unsupervised clustering method to establish a nonlinear spectral mapping between RGB and hyperspectral image pairs, demonstrating significant applicability and effectiveness. However, the proposed method requires training multiple DRCNNs, leading to inefficient networks and reduced reconstruction accuracy. A perceptual imaging degradation model using the camera spectral response function effectively addresses the low efficiency of single RGB-to-hyperspectral reconstruction [
27] while achieving high reconstruction accuracy. However, the model is limited by unknown RGB image degradation conditions in real-world settings, reducing the diversity of imaging. To mitigate the dependency of all HSI-SR models on RGB image degradation conditions, the camera response function (CRF) is used for RGB-to-hyperspectral image reconstruction [
28]. However, this model is constrained by the camera sensor, which can cause spectral distortion in RGB images. The Random Resonance Iterative Model based on Two Types of Illumination [
29] accurately approximates the gradient algorithm to solve the resonance iterative model, significantly improving both visual and quantitative evaluations. However, the algorithm’s solution process is complex, reducing the model’s robustness. The Deep Low-Rank Hyperspectral Image model (DLRHyIn) [
30] uses ℓ2 norm squared as the data fitting function, applying unsupervised high-order algorithms (DIP) and smooth low-rank regularization to improve network convergence speed. However, the method’s effectiveness is not guaranteed in the presence of outliers or sparse noise. The SkyGAN model based on the GAN framework [
31] combines adversarial distribution difference alignment and cycle consistency constraints in an unsupervised, unpaired manner to reconstruct domain-aware hyperspectral images from natural scene RGB, reducing training cycles. However, this model is susceptible to collapse during the learning process, and the generator may degrade, losing image details. The Unsupervised Spectral Super-Resolution Decomposition Guided Unsupervised Network (UnGUN) [
32] enables single RGB-to-hyperspectral reconstruction without paired images. This network includes two decomposition branches for RGB and hyperspectral images, along with a comprehensive reconstruction branch, ensuring the reconstructed image follows the features of real hyperspectral images. However, the decomposition and reconstruction processes require module adjustments and discriminator support, making the algorithm cumbersome. The Masked Transformer (MFormer) network [
18] uses a dual-frequency spectral self-attention (DSSA) module and a Multi-Head Attention Block (MAB) module for hyperspectral image reconstruction, capturing fine spectral features while enhancing network generalization and effectiveness. However, the masking reconstruction operation is computationally intensive and suffers from challenges in balancing spatial–spectral resolution.
As seen, compared to supervised and semi-supervised methods, unsupervised single RGB-to-hyperspectral image reconstruction offers advantages such as lower data labeling and collection costs, stronger generalization ability, and higher data utilization efficiency. However, it still faces challenges such as loss of detail, insufficient robustness, low reconstruction accuracy, and difficulty in balancing spatial and spectral resolution.
3. Methodology
A single RGB image is used as the input to the DMMAF model to obtain the hyperspectral image reconstructed under the unsupervised method. The DMMAF network framework is illustrated in
Figure 1. It consists of three main parts: (a) Reflection Dot-Product Adaptive Dual-noise-Aware Feature Extraction (RDPADN), (b) Deformable Attention Dual-Gated Mamba Multi-Scale Feature Learning (DADGM), and (c) Structure-Aware Smooth Constraint Loss Function. RDPADN is primarily used for fine feature edge extraction and adaptive noise processing of a single RGB image (
Section 3.1). This module is composed of two Reflection Depth Point Feature Extraction (RDPFE) modules and an Adaptive Dual-Noise Masking (ADNM) module. It uses a reflection dot-product mechanism to highlight structural boundaries and suppress noise, allowing the model to retain fine spatial details during the subsequent reconstruction process. DADGM mainly consists of the dual encoding structure Attention-Mamba Layer (AMLayer), the Double-Mamba Layer (DMLayer), and the decoding structure Attention-Feedforward Layer (AFLayer). DADGM accepts refined features from the RGB image and, based on the even-odd indexing of the encoder layers, enters the AMLayer and DMLayer to perform multi-scale feature downsampling and upsampling. Finally, the features are fused into the AFLayer to achieve local and global association modeling (
Section 3.2). The Structure-Aware Smooth Constraint Loss Function module calculates the loss function according to the number of iterations (
Section 3.3). This loss function encourages spatial smoothness, maintains structural boundaries, and enhances spectral consistency without relying on paired RGB-HSI labels. If the preset epoch number is not reached, the loss calculation returns to the encoder index layer for further processing through the dual-branch encoder, creating a cyclic pattern to optimize model performance and improve reconstruction accuracy. Once the epoch number is met, the final reconstructed hyperspectral image is outputted.
3.1. Reflection Dot-Product Adaptive Dual-Noise-Aware Feature Extraction
In the hyperspectral image reconstruction task, feature extraction from the RGB image is a crucial step that enables the model to recover key spectral channels ranging from dozens to hundreds using limited visible light information. This feature extraction forms the foundation for the effectiveness of the entire reconstruction task. However, existing feature extraction methods suffer from issues such as high convolution overhead, loss of edge information, coupling of spatial and channel information, and weak representation of key regional features [
33,
34]. To address these challenges, this paper proposes a Reflection Dot-product Adaptive Dual-noise-aware Feature Extraction method.
The Reflection Dot-Product Adaptive Dual-noise-Aware Feature Extraction method consists of two Reflection Depth Point Feature Extraction (RDPFE) modules and an Adaptive Dual-Noise Masking (ADNM) module. The main implementation process is shown in
Figure 2. The features of the RGB image are input into the Reflection Depth Point Feature Extraction (RDPFE) submodule to obtain shallow features
, At this stage, the feature map channels are expanded from 3 dimensions to 31 dimensions. The shallow feature
are processed through the Adaptive Dual-Noise Masking (ADNM) module to generate adaptive mask features
, with the number of channels remaining unchanged.
is then passed through the RDPFE module to extract multi-channel refined features
. Clearly, the method primarily involves the reflection depth dot-product and the adaptive dual-noise masking components.
3.1.1. Reflection Depth Point Feature Extraction
The foundation of single RGB image reconstruction of hyperspectral images lies in the extraction of features from the RGB image. This is primarily achieved using Convolutional Neural Networks (CNNs) to extract local spatial features or by introducing Transformer models to model long-range dependencies and capture global contextual information. However, CNN models [
35] struggle to capture spectral dependencies across channels, while Transformer-based models incur high computational costs [
36], leading to overfitting or training instability. To address these challenges, this paper proposes Reflection Depth Point Feature Extraction (RDPFE), which enhances the incorporation of edge spatial information using reflection padding and depth-wise separable convolution techniques. This method effectively resolves spectral dependencies across channels while reducing computational overhead.
Reflection Depth Point Feature Extraction (RDPFE) is primarily utilized for feature extraction and channel dimension transformation in images. It mainly consists of reflection padding, depth-wise convolution, and point-wise convolution processes, as illustrated in
Figure 3.
- (1)
Reflection Padding
In image processing tasks, edge pixels often contain important information such as object contours. Reflection padding extends the boundaries by symmetrically copying the edge pixels of the input tensor before convolution, while dynamically adjusting based on the dilation rate of the convolution to reduce boundary artifacts introduced by the convolution. Specifically, after preprocessing a single RGB image, the resulting feature map is subjected to reflection padding
, the features
can be represented by Equation (1).
In this case,
represents the padding width,
is the size of the padding convolution kernel, and
is the dilation coefficient. The padding operation does not introduce additional invalid values (such as 0); instead, it utilizes the reflective information of the edge pixels, effectively reducing the generation of edge artifacts in the image.
- (2)
Depth Convolution
The deep convolution independently convolves the reflection-padded features
across the input channels, with a kernel size of
. The processed three-dimensional feature
can be represented by the following formula:
Compared to traditional fully connected convolutions, depth-wise convolution significantly improves computational efficiency while retaining spatial features, making it suitable for constructing lightweight models.
- (3)
Point-Wise Convolution
Point-wise convolution uses 31 1 × 1 convolution kernels to simultaneously perform convolution operations across the channels of feature
to obtain the output features
, as shown in Equation (3).
The point-wise convolution linearly combines all input channels at each pixel position of the deep convolution’s output.
3.1.2. Adaptive Dual-Noise Masking
The masking operation enhances the bandwidth correlation of features and the contextual relationships between features, thereby extracting deeper features from the image. Traditional masking modules, such as MAE (Masked Autoencoders) [
34] work by masking a portion of the image patches and utilizing an encoder–decoder structure to learn and reconstruct the masked parts. However, traditional MAE is often static, lacking noise-awareness and context adaptation capabilities, resulting in suboptimal efficiency when processing complex hyperspectral images. To address this issue, ADNM is proposed. It primarily includes three core steps: spatial masking, channel perturbation, and adaptive masking, as shown in
Figure 4.
- (1)
Spatial Masking
➀ The spatial masking module receives the features
initially processed by RDPFE. The size of its feature map is
, where
is the batch size,
is the number of channels,
and
are the height and width of the feature map. The features
are then flattened into vectors
, followed by a reshape operation to derive the feature
as expressed in Equation (4).
➁ A random matrix
is generated at each pixel position of the feature
, as shown in Equation (5).
The random matrix
assigns a random value to each pixel, and the size of the random value is compared with
to classify and obtain a new binary mask matrix
, as shown in the following formula:
where
represent the pixel position in the
-th column and
-th row. When the random value is greater than
, it is masked; otherwise, it is retained.
➂ In the binary mask matrix
, the pixel positions where the matrix value is 1 introduce noise
, which acts on
to obtain the spatial mask feature
as shown in Equation (7):
where 0.1 represents the intensity of the noise
.
- (2)
Channel Perturbation
Building upon the spatial position random masking, the ADNM module further introduces a channel-level masking strategy to perform channel perturbation. A certain proportion of channels are selected from
, and noise is added to all pixel values of these channels.
represents the proportion of randomly selected channels, σ denotes the noise intensity, and
is generated from a standard normal distribution. Therefore, after the noise is added, the channel perturbed feature
can be expressed as shown in Equation (8).
- (3)
Adaptive Masking
The model dynamically updates the spatial and channel masking ratios and based on the number of iterations, thereby repeating the following steps ➀ and ➁, until the set epoch number is reached, at which point is output. Therefore, within the iterative loop, the adaptive masking module controls the proportion of features to be masked in the spatial masking module and the number of channels to be masked in the channel perturbation. The overall masking ratio of the ADNM module increases linearly, preserving more features in the early stages of training and gradually increasing the masking intensity in the later stages. This approach helps improve the model’s stability and convergence speed under high masking ratios.
➀ The iterative update of the spatial masking ratio
is generally as follows: the value of
increases with the number of training iterations. It is represented by Equation (9):
Here, is the initial masking ratio, set to 0.3, and is the adjustment coefficient, with a default value of 0.5. represents the current training epoch, and is the total number of training epochs.
➁ The iterative update of the channel masking ratio
is generally as follows: the value of
increases with the number of training iterations, as represented by Equation (10).
In this case, is the initial masking ratio, set to 0.2, and is the adjustment coefficient, with a default value of 0.2.
3.2. Deformable Attention Dual-Gated Mamba Multi-Scale Feature Learning
To address the issues of insufficient long-range dependency modeling, inefficient multi-scale feature fusion, and low interaction efficiency between local and global information in traditional CNNs and single Transformer architectures, a Deformable Attention Dual-Gated Mamba Multi-Scale Feature Learning method has been designed. The process of this method is illustrated in
Figure 5. The multi-channel refined feature
is divided into two types of features,
and
, based on the odd–even indexing of the stacked encoder layers, and they are processed separately by the AMLayer and DMLayer.
The distinction between these two processing structures lies in the use of different encoders: Encoder A in the AMLayer employs a combination of the attention mechanism and Mamba processing, while Encoder B in the DMLayer utilizes the dual Mamba processing method. In the AMLayer, the feature
is constrained by the Coder Depth indexing layer to form the even-layer feature
, which then enters Encoder A and undergoes the first downsampling operation via the SElayer. The downsampling result subsequently enters Encoder A for a second processing, followed by the second downsampling and the first upsampling operations. The upsampling result is then processed by Encoder A for the third time, followed by the second upsampling, yielding the final result
. The processing procedure in the DMLayer is identical to that of the AMLayer. Both AMLayer and DMLayer processing paths involve down-down-up-up sampling. To better illustrate their implementation, we have provided the pseudocode for the sampling path algorithm in
Table A1. The processed feature
from this layer is fused with
to generate
. This fused feature then enters the RDPFE for fine-tuning of the image features, producing
. Subsequently, it passes through the two decoders that form the AFLayer (Attention-Feedforward Layer), and then re-enters the RDPFE (as described in
Section 3.1.1) to output the feature
.
The Encoder and Decoder of the AMLayer, DMLayer, and AFLayer layers are distinguished by their internal structural compositions, and are denoted as Encoder A, Encoder B, and Decoder, as shown in
Figure 6. In Encoder A, the image feature
undergoes Prenorm normalization and is then processed through the Deformable Attention (DA) submodule and the Dual Gated Mamba (DGM) submodule for dynamic global feature modeling, local feature modeling, and fusion. The fused feature is processed by the Feedforward Network (FFN) then added to the input feature
to perform a residual connection before being output. In Encoder B,
undergoes Prenorm normalization, passes through two DGM submodules, is concatenated, and is then processed through the FFN. The output is then combined with the original feature
through a residual connection. In the Decoder, the feature
also undergoes Prenorm normalization, passes through two DA submodules, is concatenated, and is then processed through the FFN. The output is then combined with the original feature
through a residual connection. It can be seen that DA and DGM are the core and critical modules in both the encoder and decoder; therefore, a detailed introduction to these two modules is necessary in the following sections.
3.2.1. Deformable Attention Global Feature Extraction
The Deformable Attention (DA) submodule dynamically allocates weights to different parts of the input, enabling focused attention on important information and the effective integration of contextual details. The structure of the DA module is illustrated in
Figure 7, and its specific technical details can be divided into the following five steps:
➀ Construct Multi-head Vectors. The feature
undergoes a linear transformation through a linear layer to construct multi-head query vectors
, key vectors
, and value vectors
, as shown in Equation (11).
where
,
is the batch size,
is the total number of spatial positions,
is the number of attention heads,
is the dimensionality of each head.
➁ Learn spatial offsets and generate sampling grids. The feature
undergoes a Conv2d convolution operation to predict the two-dimensional offset for each spatial position corresponding to its multi-head attention. The offset
can be expressed as shown in Equation (12).
The base sampling grid coordinates
are updated by adding the spatial offset
resulting in the sampling grid
, as shown in Equation (13).
➂ Key-value sampling of the sampling grid. K and V are based on the constructed sampling grid
bilinear interpolation (BI) is used for resampling to obtain
and
, as shown in Equations (14) and (15).
➃ Attention weight calculation. The transpose of
and
is processed using Softmax to obtain the attention weights
, as shown in Equation (16).
➄ Attention-weighted output. The product of the attention weights
and the sampled
values results in the feature
, as shown in Equation (17).
The feature
, after undergoing Dropout processing, is passed through Layer Scale and a projection layer to obtain the final output feature
, as shown in Equation (18).
3.2.2. Dual-Gated Mamba Local Feature Extraction
Due to the high computational and memory overhead, as well as the lack of precise control over long-range dependency modeling in single attention feature extraction mechanisms, a local precise feature modeling method based on a gated state-space model is proposed, as shown in
Figure 8.
The specific details of the Dual-Gated Mamba module consist of the following three key steps.
➀ Input-level Gated Adjustment. This operation applies dynamic channel-level weighting to the input features, suppressing noise or irrelevant features and highlighting important information. The input
is flattened into a sequential format
, which is then mapped through the linear layer
to obtain the projection matrices
and
. The projection feature values
are obtained by multiplying
with
, as shown in Equation (19).
The sequence length of is equal to the height × width of the image, with each pixel becoming a feature sequence of one time step. In this way, the two-dimensional spatial information is linearized into a one-dimensional feature sequence for temporal modeling.
After
is multiplied by
, the spatial gated value
is generated through the SiLU function, as shown in Equation (20).
The projection feature value
is element-wise multiplied by the spatial gated value
to produce the input-level gated conditional feature
as shown in Equation (21).
This operation will apply weighting to . As shown in Equation (21), the larger the value of , the more features of the projection feature are preserved during dot-product operation, meaning the value of becomes larger. Conversely, if tends to 0, the value of becomes smaller.
➁ Mamba core discretized state convolution. The input-level gated feature
is transformed into a convolutional format, followed by padding to obtain the feature
, as shown in Equation (22).
After padding, the feature
undergoes a 1d convolution on each channel
resulting in the feature
, as shown in Equation (23).
State size refers to the feature dimension that the model processes at each time step (or position). In DGM, the feature vector of each pixel has a dimension of (the number of channels). After the flatten operation, the model’s state size becomes , where each pixel position has a feature vector of size
In the state convolution (SSM), multiple convolution kernels
are element-wise multiplied with the feature
, resulting in a weighted sum. The final feature
is given by Equation (24).
➂ Output-level gated adjustment. The input feature
undergoes LayNorm normalization to obtain
, which is then passed through a linear layer to generate the gated projection matrix
. The product of these two is activated by the SiLU function to produce the independent branch channel gated value
as shown in Equation (25).
is element-wise multiplied with the feature
, and after undergoing the dropout operation, the final output
is obtained, as represented by Equation (26).
The channel gate value and the state space convolution feature collaboratively adjust to generate
As shown in Equations (19)–(25), and modulate the input features along the spatial and channel dimensions, respectively.
3.3. Structure-Aware Smooth Loss Function
Given that current methods’ loss functions often overlook structural information in images, lack spatial structure constraints, and suffer from insufficient supervisory guidance in attention mechanisms, this paper designs a structure-aware smooth loss function
, which consists of smooth loss
, curvature loss
, and attention supervision loss
, as shown in Equation (27).
where
,
and
are the weight parameters.
- (1)
Smoothness Loss
Smoothness loss [
37] aims to encourage the model’s output to maintain continuous variations in the local space, preventing the occurrence of high-frequency noise or unnatural discontinuities, as shown in Equation (28).
where
represents the total number of pixels, the loss function penalizes abrupt changes by measuring the gradient differences in the horizontal and vertical directions of the output image
, thus promoting texture consistency in the result.
and
represent the row and column index coordinates of the pixels in image
, i.e., the coordinates of each pixel in the two-dimensional image
. The smoothness loss function encourages local continuity and mitigates high-frequency noise by imposing a constraint on the first-order gradient (neighborhood pixel difference) of the reconstructed image. Its physical interpretation is that, in natural scenes, the spectra of a given object surface or adjacent regions generally exhibit a smooth spatial transition. Penalizing the first-order difference is conceptually equivalent to imposing a prior on the lower-order statistics of the data, thereby inhibiting the decoder from producing physically implausible oscillations or noise, even in the absence of external labels.
- (2)
Curvature Loss
Curvature loss [
38] further constrains the smoothness of the image at the level of second-order derivatives. It is primarily used to suppress local “sharp deformations” or “spike-like structures” in the reconstructed image, as shown in Equation (29).
This loss function facilitates create a more natural and smooth transition at structural boundaries in the output image by measuring the differences between each pixel in the image and its four neighboring pixels. Curvature loss penalizes the second-order difference or local curvature anomalies, suppressing abrupt local variations and ensuring natural transitions at boundaries rather than discontinuous jumps. Compared to the smoothness term, the curvature term provides stronger geometric regularization with respect to edge shapes and geometric consistency, thereby aiding the network in maintaining structural fidelity at boundaries without relying on pixel-level labels.
- (3)
Attention Supervision Loss
To enhance the model’s ability to focus on structural information, an unsupervised attention supervision loss mechanism is introduced. This mechanism guides the model to learn from key regions, thereby assigning higher attention weights to them. The loss function
is represented by Equation (30).
where
is the attention matrix obtained by weighting the attention weights generated by the DA submodule, and
is the mask matrix generated by the spatial mask of the ADNM submodule (for detailed structure, see
Section 3.1.2 and
Section 3.2.1). The attention mask and attention map are both derived from the network module’s own computations. The attention loss is based on the consistency constraint between the attention map generated internally by the network and the spatial mask output by the ADNM module. It guides the network to focus its learning resources (parameter updates) on regions with rich structural information or significant spectral differences.
5. Discussion
This study presents the Dual-Gated Mamba Multi-Scale Adaptive Feature Learning (DMMAF) network, which demonstrates exceptional performance in unsupervised hyperspectral image reconstruction from a single RGB image, particularly in terms of image naturalness, detail accuracy, and model robustness, showing significant improvements over existing methods.
Our proposed DMMAF utilizes an unsupervised learning approach, achieving hyperspectral image reconstruction without relying on labeled information. The loss function in this study consists of three parts: smoothness loss (Equation (28)), curvature loss (Equation (29)), and attention supervision loss (Equation (30)). These losses do not provide training signals through pixel-level comparisons between the reconstructed HSI and the true HSI or RGB images, but rather they provide self-constraints to the network based on low-order statistics and structural consistency of the reconstructed results. The smoothness/curvature terms penalize low-order statistical biases and second-order geometric anomalies, acting as “weak-label” priors (without using any external labels). The attention term further reduces the uncertainty in self-supervised training by enhancing the internal consistency of the representations. These unsupervised constraints collectively replace the pixel-level guidance provided by explicit labels, enabling the model to learn stable and physically meaningful spectral mappings without relying on paired HSI labels.
In terms of visual image reconstruction quality, we performed a cross-method comparative analysis. The DMMAF method demonstrates outstanding performance in both image naturalness and detail accuracy, outperforming existing methods, including HRNet [
42], AWAN [
43], MFormer [
18], and GMSR [
44], four advanced SOTA methods. DMMAF designed a feature extraction method combining RDPFE and ADNM, which improves edge detail richness and robustness in the reconstructed image, as shown in
Figure 12. This is due to the RDPFE module, which combines depth convolution for spatial and channel adjustment with pixel-level convolution, effectively reducing edge artifacts and retaining edge details closer to the true image. The ADNM module’s adaptive masking strategy allows the model to autonomously learn relevant features, leading to smoother color transitions in these regions and better matching the true image. DMMAF also constructs the DA and DGM modules for local–global feature association modeling, which improves the attention to key details and accuracy in the reconstructed image, as shown in
Figure 13. This is mainly due to the DA module’s flexibility in capturing attention areas and adapting to object shapes, enhancing the global detail integrity of the reconstructed image. The DGM module’s precise control of long-range dependency modeling improves the accuracy of local feature representations. Furthermore, DMMAF uses a structure-aware smoothness loss function to balance spectral and spatial resolution, as shown in
Figure 14. This is because the structure-aware smoothness loss function effectively addresses multiple issues, including information neglect, insufficient spatial structure constraints, and lack of attention supervision. As a result, the reconstructed image shows smoother transitions in structure and color, along with more detailed features. Compared to the true hyperspectral image, the reconstructed images from DMMAF show better performance in terms of certain details and boundaries, although there is still a slight gap. This is mainly related to the dataset quality, input image resolution, and prior assumptions during model training. Nevertheless, DMMAF’s reconstructed images demonstrate high spectral consistency and structural fidelity in most scenarios, making it an effective approximation of hyperspectral images. The overall visualization comparison results indicate that DMMAF excels in recovering object boundaries, details, and textures, demonstrating good practicality. In applications such as remote sensing imaging, precision irrigation in agriculture, and ecological environment monitoring, hyperspectral images provide rich spectral information, helping researchers obtain more accurate surface reflectance data. The outputs from DMMAF can provide useful hyperspectral data for these application scenarios without the need for expensive hyperspectral sensors, increasing the practicality of this method.
DMMAF also demonstrates the highest reconstruction accuracy through quantitative comparisons with HRNet, AWAN, MFormer, and GMSR, as shown in
Table 1. DMMAF outperforms state-of-the-art unsupervised hyperspectral reconstruction algorithms across all three datasets, effectively recovering details and maintaining high structural consistency. While the AWAN method shows good performance in hyperspectral image reconstruction, it has certain flaws in detail recovery and noise suppression. In contrast, DMMAF compensates for information loss and suppresses noise through the ADNM module’s masking mechanism, improving model robustness. HRNet, based on a high-resolution network, excels in processing high-resolution images, but its performance in balancing spectral information and detail recovery is limited by its model structure. DMMAF addresses this issue with the structure-aware smoothness loss function, achieving better results in spectral texture and structural fidelity. The GMSR method, based on image restoration technology, performs well in image quality but lacks global information modeling capability. DMMAF, under the influence of the DA module, outperforms GMSR in this regard, significantly improving image reconstruction accuracy. MFormer, a transformer-based model, has strong global information modeling capability, but may suffer from information loss when handling local details and complex scenes. DMMAF overcomes this limitation through deep dot-product feature extraction and double Mamba feature extraction, showing particular advantages in complex boundaries. However, in this study, only the NTIRE2020, Harvard, and CAVE datasets with 31 similar bands were used. In the future, we plan to use datasets with different bands, especially for applications in remote sensing and medical imaging, to enhance the model’s adaptability and portability. Additionally, the current evaluation metrics are primarily based on PSNR, RMSE, and MRSE, without involving additional metrics such as SAM and SSIM, which could offer a more objective and comprehensive evaluation of the algorithm’s effectiveness.
This study conducted two ablation experiments: (1) eliminating the ADNM, DA, and DGM modules from the complete model, as shown in
Table 2, and (2) sequentially adding the ADNM, DA, and DGM modules to the baseline model, as shown in
Table 3. The data from
Table 2 shows that when the ADNM module’s real-time masking function is removed, the model’s ability to handle noise is reduced. When the DA and DGM submodules are removed, the network’s ability to capture local features and build global features is compromised. These effects are visually demonstrated in the results shown in
Figure 16, indicating the indispensable role of the DGM, DA, and ADNM modules in capturing local detail information, handling noise, and utilizing information interactions. The experimental quantification results in
Table 3 clearly show that the sequential addition of each module significantly enhances the overall model performance, further proving the importance of each module in improving hyperspectral image reconstruction quality, particularly in detail preservation, denoising, and global feature modeling. However, ablation experiments across datasets have not yet been conducted, so it is uncertain whether the approach is effective across different data distributions.
DMMAF conducted two experiments: one with epoch-by-epoch loss training and the other with the masking ratio varying epoch by epoch, as detailed in
Table 4 and
Table 5. Due to the structure-aware smoothness loss function, DMMAF exhibits relatively stable performance during most epochs, as shown in
Figure 17. When the training epochs are too few (less than 100), the model cannot fully understand the mapping between images, leading to poor reconstruction performance. When the training epochs are too many (greater than 100), overfitting occurs, resulting in poor generalization. In the visual results shown in
Figure 18, the model reaches its highest performance at epoch = 100. This is because, under the influence of the ADNM module, the model’s masking ratio increases as training progresses, forcing the model to repeatedly relearn spatial and channel structure information adaptively.
6. Conclusions
This study presents the DMMAF network framework for unsupervised hyperspectral image (HSI) reconstruction from a single RGB image. The main innovation of DMMAF lies in its dual-gated Mamba multi-scale adaptive feature learning paradigm, which tightly integrates a noise-aware edge detail extractor based on RDPADN, a deformable attention dual-gated Mamba module for joint local–global modeling, and a structure-aware smoothness loss function that provides unsupervised spatial and spectral guidance. DMMAF enhances the preservation of spatial structure and high-frequency spectral details and improves robustness to noise and illumination changes through adaptive masking and dual-gated feature modulation while maintaining reasonable computational complexity. This results in superior reconstruction accuracy compared to existing state-of-the-art supervised and unsupervised methods. Compared to the best-performing unsupervised algorithms, DMMAF improves PSNR, RMSE, and MRAE by 0.15%, 5.0%, and 2.2%, respectively, on the NTIRE2020 dataset. On the Harvard and CAVE datasets, DMMAF improves PSNR and RMSE by 0.7%, 3.8% and 1.7%, 2.4%, respectively. Extensive experimental results, both numerical and visual, demonstrate that our algorithm effectively addresses issues such as detail loss, insufficient robustness, low reconstruction accuracy, and the difficulty of achieving a balance between spatial and spectral resolution, highlighting the superiority and practical application potential of DMMAF. In the future, we will expand the evaluation metrics, use datasets with different bands, and conduct cross-dataset experiments to further demonstrate the effectiveness and generalization of the DMMAF algorithm.