1. Introduction
Hyperspectral imaging is a frontier technology in optical sensing. By integrating imaging and spectroscopy, it performs dense narrowband sampling of spectral signatures to simultaneously capture 2D spatial information and high-dimensional spectral information of a target. Compared with conventional three-band RGB images, hyperspectral images (HSIs) contain substantially richer material and fine-grained details, and have been widely used in many critical fields, such as remote sensing [
1,
2,
3], medical image processing [
4,
5], food safety [
6,
7], environmental monitoring [
8,
9], and geological exploration [
10,
11]. HSI is also increasingly leveraged in downstream perception tasks, e.g., hyperspectral video tracking [
12,
13,
14,
15] and hyperspectral anomaly detection [
16,
17,
18], as well as adverse-weather sensing with near-infrared multi-/tri-spectral imaging [
19,
20,
21]. However, traditional hyperspectral imaging systems often rely on bulky spectrometers and perform scanning along the spatial and/or spectral dimension, which makes acquisition time-consuming and the hardware large in size. These limitations severely hinder their adoption in dynamic scenes and in portable or low-cost applications. To address these challenges, researchers have developed various lightweight hyperspectral acquisition solutions, including on-chip spectral imaging [
22], spectral encoders based on nanophotonic structures or metasurfaces [
23], and snapshot compressive imaging (SCI) systems [
24,
25]. Despite progress in hardware miniaturization, many of these techniques remain at the laboratory-prototype stage. They are often constrained by fabrication processes, optical efficiency, or system stability, and thus are difficult to deploy broadly in real-world scenarios. Moreover, even relatively low-cost SCI systems typically cost on the order of tens of thousands to one hundred thousand dollars, which further limits their adoption in consumer-grade and cost-sensitive applications.
To alleviate the above bottlenecks, hyperspectral reconstruction (SR) has emerged as a promising alternative. SR aims to recover high-dimensional, continuous, and physically consistent spectral information from low-dimensional observations (e.g., RGB or multispectral images). Since mapping from three bands to hundreds of bands is inherently a severely under-determined inverse problem [
26], the SR task is intrinsically ill-posed. This ill-posedness is shared by a broad class of underdetermined inverse problems in signal processing (e.g., localization/suppression in underdetermined SAR systems [
27]), which highlights the necessity of incorporating effective priors or learned constraints. Accordingly, existing SR research has broadly evolved into two major paradigms [
28]. Model-based optimization methods [
26,
29,
30,
31] explicitly formulate an imaging degradation model and incorporate priors such as sparsity and low-rankness as regularization terms, then solve the problem via iterative optimization. These methods are interpretable, but they are sensitive to hand-crafted priors, and they often struggle to capture non-local and higher-order spatial–spectral dependencies in complex scenes. In contrast, data-driven deep learning approaches [
32,
33,
34,
35,
36,
37] have achieved substantial performance gains on public benchmarks due to their strong representation capacity. However, most state-of-the-art reconstruction algorithms still follow a supervised learning paradigm and heavily rely on large-scale, pixel-wise, precisely registered spectral annotations. Because hyperspectral data acquisition requires expensive hardware, complicated procedures, and strict calibration, annotation is highly costly, which imposes clear scalability limitations on supervised methods in real-world applications.
Semi-supervised/unsupervised SR methods [
38,
39,
40,
41] can substantially reduce the reliance on densely annotated spectral labels; however, in complex scenes, the stability of proxy supervision signals such as consistency constraints or pseudo-labels still needs to be improved. Moreover, the lack of explicit physical constraints may also compromise the physical plausibility of the reconstructed spectra. Therefore, it is necessary to explore reconstruction frameworks that achieve both reliable supervision and physical consistency under limited annotations.
We propose a semi-supervised learning framework guided by spectrally aware mini-patches (SA-MP), as illustrated in
Figure 1, which delivers reliable reconstruction performance under limited spectral annotations. The proposed method introduces patch-wise averaged spectra as region-level statistical supervision on only a small number of local patches. This design reformulates the conventional pixel-wise spectral constraint into a local, region-based statistical constraint, enabling hyperspectral reconstruction without dense pixel-level spectral labels over the entire image and substantially reducing the dependence on high-density spectral annotations. Given the inherently under-determined nature of hyperspectral reconstruction, we explicitly integrate a Tikhonov-based physical prior into the network optimization process to balance physical plausibility and data-driven representation. In particular, the physical layer is formulated in an optimizable manner and its regularization is adaptively updated during end-to-end training, which improves reconstruction accuracy while preserving the physical prior. In addition, we develop a deep reconstruction network that fuses spectral and spatial information. A hybrid attention mechanism jointly models inter-spectral correlations and spatial structural details, strengthens feature representation, and leads to higher-quality reconstruction results.
The main contributions of this paper are summarized as follows:
A deployment-friendly new paradigm: We introduce SSHSR, a semi-supervised hyperspectral reconstruction framework that reduces reliance on high-precision spectral annotations while maintaining a lightweight design. With just 1.59 M parameters, which is fewer than MST++ [
42] (1.62 M parameters), it becomes an ideal choice for practical deployment. SSHSR runs efficiently on general-purpose hardware, such as portable laptops, without the need for dedicated workstations.
Low data requirements with SA-MP guidance: The core of our SSHSR is the SA-MP guidance module, which extracts patch-level averaged spectra from local regions to provide supervision for the reconstruction model. This design allows the network to stably learn the RGB–HSI mapping even when spectral annotations are scarce.
Performance gains via physics–data synergistic fusion: We integrate an optimizable Tikhonov prior for adaptive regularization and enforce spectral–spatial attention with a frequency-domain consistency loss, improving physical fidelity and reconstruction accuracy. On the GDFC remote-sensing dataset [
43], our method achieves a 6.8% improvement in PSNR and a 22.1% reduction in SAM.
The remainder of this paper is organized as follows.
Section 2 reviews related work on RGB-to-HSI reconstruction and semi-supervised learning.
Section 3 introduces the proposed SSHSR framework, including the learnable Tikhonov prior, the SA-MP supervision mechanism, as well as the network architecture and loss functions.
Section 4 presents the experimental settings and results on multiple benchmarks, together with ablation studies and further analyses.
Section 5 reports experimental results and analysis on the real-world collected dataset.
Section 6 discusses the challenges of the method and future work. Finally,
Section 7 concludes the paper.
3. Materials and Methods
This section presents the proposed SSHSR. We first formulate the problem model. We then describe the initial spectral estimation based on an optimizable Tikhonov regularization. Next, the reconstruction network architecture that integrates spectral and spatial information is introduced. After that, we present the semi-supervised setting under limited spectral annotations and the construction of SA-MP. Finally, we detail the corresponding composite loss function. The complete training and inference procedure is outlined in Algorithm 1.
| Algorithm 1 Training and inference of SSHSR. |
- Require:
(RGB), SRF matrix , MaxEpoch - Ensure:
(reconstructed HSI) - 1:
Parameters: learnable , network , weights - 2:
for to MaxEpoch do - 3:
with Equation ( 4) - 4:
- 5:
// SA-MP from GT, then aligned sampling on SR - 6:
with Equations ( 10) and ( 11) - 7:
with Equations ( 10) and ( 11) - 8:
with Equation ( 12) - 9:
with Equation ( 13) - 10:
with Equation ( 14) - 11:
with Equation ( 15) - 12:
with Equations ( 16) and ( 17) - 13:
with Equation ( 18) - 14:
Update by Adam - 15:
end for - 16:
Inference: ; ; return
|
3.1. Problem Formulation
In the imaging model, an RGB image can be regarded as the result of a weighted spectral integration of a HSI
along the spectral dimension under the camera spectral response function (SRF). Specifically, the pixel value of an RGB observation
at channel
can be modeled as an integration process over the spectral dimension:
where
denotes the spectral response function of channel
c and
is typically set to 400–700 nm. For a more compact representation, we rewrite the above process in matrix form:
Let
denote the RGB vector,
denote the corresponding hyperspectral vector, and
be the spectral response matrix. Here,
h and
w denote the image height and width, respectively.
L denotes the number of spectral bands.
3.2. Initial Spectral Estimation Based on Tikhonov Regularization
Based on the imaging model described above, we construct a physics-inspired layer using Tikhonov regularization. This layer performs a linear inverse mapping from the input RGB image to produce an initial hyperspectral estimate, and it serves as the first module of our network. Specifically, we define the initial spectrum
as the solution to the following Tikhonov-type optimization problem:
The first term is a data fidelity term, which constrains the error between the reconstructed HSI and the observation in the RGB space so that it adheres to the imaging model described above. The second term is a regularization term that introduces a spectral smoothness prior. Here,
is the Tikhonov matrix (typically implemented as a second-order finite-difference Laplacian operator), and
is the regularization coefficient, which is generally set to 0.01.
Since Equation (
3), above, is a quadratic convex function with respect to
H, and all terms are continuously differentiable, an analytical solution can be obtained by taking the derivative with respect to
H and setting it to zero:
The resulting closed-form solution is treated as a physics-constrained initial spectral estimate and is fed into the subsequent deep network for refinement. This design provides a more reasonable initialization for the network and, at the architectural level, explicitly embeds a spectral continuity prior that carries through the entire reconstruction process. Furthermore, we extend this fixed formulation to an optimizable version by treating and as learnable parameters and updating them adaptively during training, which allows the model to better match the spectral statistics of different datasets.
3.3. Overall Network Architecture
We construct a multi-scale hybrid attention network based on U-Net (
Figure 2) to jointly model the complex mapping relationship between spectral and spatial information, and to achieve multi-scale feature fusion in the encoder–decoder structure. The network is built upon three key designs. First, the network embeds hybrid attention modules in the encoder/decoder to model the dependencies between local spatial textures and spectral information through the combination of spectral–spatial attention. Then, we adopt the MIMO (multi-input, multi-output) paradigm [
65,
66] together with PixelShuffle [
67] for information-preserving rearrangement, which promotes progressive reconstruction of multi-scale features from coarse to fine. Following this, we introduce a physics-guided global residual learning mechanism; the initial smooth estimate produced by the optimizable Tikhonov regularization is used as the network input, and image-level skip connections add this estimate to the network predictions at each scale. With these designs, the reconstruction task is explicitly reformulated as learning the residual between the “physics-based initial estimate” and the target spectra. This encourages the network to focus on recovering high-frequency details suppressed by regularization, reduces the optimization difficulty to some extent, and builds synergy between physics-based constraints and data-driven learning, thereby improving the stability and interpretability of the reconstruction process.
Hybrid Attention Module (HAM)
The hybrid attention module includes spectral attention and spatial attention blocks. The structure of the spectral attention block is shown in
Figure 3a. Its main function is similar to that of channel attention [
68]; it explicitly models inter-band correlations to adaptively recalibrate the responses of spectral channels. Let the input feature be
. First, the global average pooling (GAP) branch computes the mean response of each spectral channel over the spatial dimensions, while the global max pooling (GMP) branch extracts the corresponding maximum response, resulting in two spectral descriptor vectors:
Next, the two descriptor vectors are separately fed into two
convolution layers (with non-shared weights) followed by nonlinear activations, producing two spectral weight vectors.
The two weight vectors are then multiplied with the input feature
F in a spectral-wise manner, yielding two weighted feature maps.
Finally, the two weighted features are concatenated along the spectral dimension and fused through a
convolution to obtain the output of the spectral attention block:
In the above, and denote ReLU and Sigmoid, respectively, ⊙ denotes spectral-wise multiplication, denotes concatenation along the spectral dimension, and represents the learnable weights of the convolution.
Spatial attention [
69] aims to generate a spatial saliency map that guides the network to focus on key regions with complex textures or rich edges (
Figure 3b). First, we apply GAP and GMP to
F along the channel dimension, producing two single-channel feature maps
. We then concatenate them and feed the result into a
convolution followed by a Sigmoid activation to obtain a spatial attention map. Finally, this attention map is multiplied with the input feature in an element-wise manner, yielding the output of the spatial attention module.
3.4. SSHSR Scheme
Based on SA-MP, we first construct spectral supervision on a small number of local patches and then extend this mechanism to the multi-scale outputs of the network, as shown in
Figure 2. Specifically, to avoid spatial bias caused by sparse annotations, we perform regular grid sampling on the ground-truth (GT) HSI
H at the highest resolution. We partition the image into a
grid and crop a
patch centered in each grid cell as a supervised region, denoted as
. The spectral vectors of all pixels
within the k-th patch are averaged to obtain its local mean spectrum.
with
. To extend the supervised patches to a multi-scale setting, we progressively resize these
supervised patches extracted at the highest resolution along the same downsampling pathway as the network. This produces the corresponding supervised regions
at each scale
s. Let
denote the downsampling operator at scale
s. The hyperspectral patch at scale
s for the k-th region is given by
where
denotes the
patch cropped from the original high-resolution image.
denotes the pixel set of the downsampled patch. The mean spectrum of the k-th patch at scale
s is defined as
By first selecting nine patches in a grid at the highest resolution and then scaling each patch along the network’s downsampling pathway to compute the mean spectrum at every scale, we obtain a set of cross-scale local mean spectra for each SA-MP. These spectra are then used to define the multi-scale mean-spectrum loss.
3.5. Composite Loss Function
Under the setting of limited spectral annotations, we design a composite loss function composed of multiple constraints. First, we define a semi-supervised loss
only on the few regions with spectral annotations. It consists of two terms: a mean-spectrum loss and a frequency-domain consistency loss. Specifically, following the SA-MP construction in
Section 3.4, we compute the patch-wise mean spectra at each scale for both the ground truth and the reconstructed HSI, denoted as
. At scale
s, the mean-spectrum loss
is defined as the average
distance between the mean spectra over the K patches. Correspondingly, after applying a 1D FFT to the mean spectrum of each patch, we compute the average
difference between the GT and reconstructed frequency-domain representations over the K patches, which yields the frequency-consistency loss
at scale
s.
Finally, we obtain the overall semi-supervised loss
by a weighted summation of
and
across all scales:
in which
are the weighting factors for the two terms.
Degradation-consistency loss: Since spectral labels are available only at a few locations, while RGB observations are known over the entire image, we explicitly model the HSI to RGB imaging degradation using the camera SRF, as shown in
Figure 1. For the reconstructed HSI at each scale,
, we project it to the RGB space through the degradation model to obtain a synthesized RGB image
. We then compare
with the input RGB at the same scale (obtained by applying the same downsampling operator to the original RGB) in a pixel-wise manner, and define the degradation-consistency loss at scale
s as
. The total degradation-consistency loss is the weighted sum across scales:
By combining the two sets of constraints above, the final overall training objective is written as
In our experiments, the three balancing factors are set to 0.1, 0.1, and 100, respectively, to weight the spectral statistical terms and the physics-based degradation consistency term.
4. Experiments and Results
4.1. Dataset
We conduct independent experiments on three public hyperspectral benchmark datasets and further validate the proposed SSHSR on a self-collected real-world scene dataset, in order to assess its reconstruction performance under different imaging scenarios and spectral distributions. The public benchmarks cover natural scenes (ARAD-1K [
70] and CAVE [
71]), as well as complex remote-sensing earth-observation scenarios (IEEE GRSS DFC 2018 [
43,
72] (GDFC)). The real-world dataset is captured using a commercial Specim IQ hyperspectral camera (Specim, Oulu, Finland) and includes typical indoor targets such as office supplies, potted plants, cups, and a standard color chart. In total, 31 hyperspectral images are collected, with a spectral range of 400–700 nm. All RGB observations used in our experiments are synthesized via the spectral integration projection model described in
Section 3.1, using the SRFs of standard cameras (e.g., Canon 60D and Basler ace 2 [
70]), which ensures that the constructed RGB–HSI pairs are consistent with the physical imaging mechanism. Detailed dataset specifications, including spatial resolution, spectral band range, and train/val splits, etc., are summarized in
Table 1.
Dataset-specific preprocessing is applied to ensure physical validity and to reduce the influence of abnormal radiometric values. For ARAD-1K, we perform data cleaning and remove abnormal samples containing all-zero values or invalid pixels. For CAVE, the original images contain invalid black borders and boundary artifacts; we therefore crop the images to remove invalid edge regions and retain an effective field of view of . For GDFC, RGB observations are generated using the SRF of the Basler ace 2 camera. The mismatch between the camera response range (360–750 nm) and the HSI spectral coverage (380–1050 nm) is mitigated by linearly interpolating and extending the SRF, which establishes a more physically consistent RGB–HSI mapping over the full spectral range.
4.2. Evaluation Metrics
A rigorous quantitative evaluation and fair comparison are conducted using four widely adopted metrics: mean relative absolute error (MRAE), root mean squared error (RMSE), peak signal-to-noise ratio (PSNR) to measure pixel-wise numerical reconstruction fidelity, and spectral angle mapper (SAM) to assess the geometric similarity of spectral signatures. A higher PSNR indicates better performance, whereas lower values are preferred for the other metrics.
In the above, and denote the i-th pixel values of the GT and reconstructed HSI, respectively, and N is the total number of pixels. and denote the 1D spectral signatures of the j-th hyperspectral pixel in the GT and reconstructed images, respectively, and M is the number of pixels in an image slice.
4.3. Implementation Details
During training, we randomly crop patches from the original RGB images as input. End-to-end optimization is performed using the Adam optimizer. During initialization, we set as a second-order finite difference Laplacian operator and initialize to a small positive value of 0.01. To ensure that the regularization strength remains positive during the optimization process, we use the non-negative parameterization , ensuring that throughout training. To accommodate the different convergence behaviors of the physics layer and the deep network, we adopt a hierarchical learning-rate scheme; the initial learning rate is set to for the learnable regularization matrix and coefficient in the Tikhonov layer, and for the remaining network parameters. The learning rate is halved every 500 epochs, and the model is trained for a total of 3000 epochs. The batch size is set to 16 on ARAD-1K and 4 on the other datasets. These learnable parameters are optimized alongside the network weights. All compared methods are implemented under the same semi-supervised training framework and use only about 2% labeled data as well. It is worth noting that most competing methods follow a single-input, single-output architecture; for consistency and simplicity, we therefore train these methods with a unified single-scale version of the semi-supervised loss. The proposed network is implemented in PyTorch 2.0 and trained on a single NVIDIA RTX 4090 GPU.
4.4. Results
4.4.1. Quantitative Results
We conducted comparative experiments on the ARAD-1K [
70], CAVE [
71], and GDFC datasets [
43] against several representative spectral reconstruction methods, including HSCNND [
51], HRNet [
55], AWAN [
54], MST++ [
42], RepCPSI [
62], FSDFF [
60], and LTRN [
61]. The results are summarized in
Table 2. On ARAD-1K, our method achieves improvements of 11.8%, 7.3%, 10.2%, and 0.029% in MRAE, RMSE, SAM, and PSNR, respectively, over the second-best result, indicating a clear gain in reconstruction accuracy. On CAVE, our approach further improves MRAE, RMSE, and PSNR by 15.4%, 7.1%, and 2.2%, respectively, and remains the top performer overall. Notably, the improvement in SAM is relatively limited, suggesting that the spectral shape agreement still leaves room for improvement. This behavior may be attributed to severe metamerism in CAVE, under which all methods exhibit degraded SAM performance. Several recent studies [
73] have examined this phenomenon in depth, yet an effective remedy has not been established. On the GDFC dataset, the proposed method continues to show a pronounced advantage; compared with the second-best method, it improves MRAE, RMSE, SAM, and PSNR by 23.1%, 20.9%, 22.1%, and 6.8%, respectively. Overall, consistent gains are observed across these representative datasets. In addition, although MST++ performs strongly under full supervision, its performance drops markedly with only 2% labeled data, indicating limited effectiveness when spectral supervision is scarce.
4.4.2. Qualitative and Visual Results
To assess the perceptual quality of the reconstructed HSI, we visualize band-wise MRAE maps for randomly selected validation samples from the ARAD-1K, CAVE, and GDFC datasets, comparing our method with other representative reconstruction approaches, as shown in
Figure 4,
Figure 5 and
Figure 6. In these error maps, darker colors indicate higher reconstruction accuracy. The heatmaps show that our method produces darker regions over a larger area, suggesting lower reconstruction errors and richer texture details in the reconstructed HSI. Compared with other methods, our results exhibit a more uniform error distribution and better recovery of local structures and fine details. In addition, spectral fidelity is evaluated by comparing the mean spectral signatures within selected regions, as shown in
Figure 7. The selected regions are highlighted by red boxes in
Figure 4 and
Figure 6. As can be seen from
Figure 7, the spectra recovered by our method are closer to the GT.
4.4.3. Model Complexity and Efficiency
Beyond reconstruction performance, we further report the number of model parameters (Params), floating-point operations (FLOPs), and inference time to comprehensively assess storage overhead, computational complexity, and practical runtime efficiency. Since FLOPs depend on the input resolution, we compute FLOPs on the ARAD-1K dataset using a unified input size of
, and measure inference time under the same hardware conditions. As shown in
Table 3, our method achieves lower Params and FLOPs, resulting in a smaller model size and reduced computational cost. Compared with HRNet and AWAN, which obtain the second-best performance on some datasets, our approach significantly reduces the parameter scale and computational burden while delivering higher reconstruction accuracy. Moreover, relative to the recently proposed FSDFF and LTRN, our method also demonstrates superior efficiency in terms of Params and FLOPs. In addition, our method requires only 2.26 s for inference, further validating its high efficiency in practical deployment scenarios. Overall, the proposed method achieves a better trade-off between high-fidelity reconstruction and lightweight, efficient inference.
4.5. Ablation Analysis
This section conducts a systematic ablation study on the CAVE dataset to validate the effectiveness of the spectrally aware mini-patches (SA-MP), the contributions of different module designs, the role of the frequency-domain consistency loss, the impact of single-scale versus multi-scale loss design, and the comparison of mini-patch selection strategies (random sampling vs. grid-center sampling). The corresponding ablation results are summarized in
Table 4,
Table 5,
Table 6,
Table 7 and
Table 8, providing a systematic analysis of how each component contributes to improving semi-supervised spectral reconstruction performance.
Effect of SA-MP. We investigate the role of SA-MP in semi-supervised learning. As reported in
Table 4, compared with the purely unsupervised setting without SA-MP, introducing SA-MP reduces MRAE, RMSE, and SAM by 10.6%, 36.8%, and 27.1%, respectively, and increases PSNR by 15.1%. These results indicate that relying solely on the unsupervised degradation-consistency loss tends to cause noticeable spectral distortion, whereas the regional mean-spectrum guidance provided by SA-MP imposes more effective constraints on the learned spectral distribution, leading to a substantial improvement in reconstruction performance.
Effect of different modules: To evaluate the contribution of each component, we adopt a baseline network composed only of residual blocks and then progressively incorporate the hybrid attention module (HAM) and the physics-prior layer (Tikhonov layer) for comparison. As reported in
Table 5, incorporating HAM results in improvements MRAE, RMSE, SAM, and PSNR by 4.4%, 17.1%, 6.5%, and 6.3%, respectively, compared to the baseline. This demonstrates that the hybrid attention mechanism can more effectively couple spatial textures with spectral correlations, thereby enhancing the feature representation capabilities. Further introducing the Tikhonov layer yields additional gains over both the baseline and the HAM-only setting, demonstrating the benefit of the physics prior for reconstruction. When the fixed-form Tikhonov layer is upgraded to an optimizable version, the best performance is achieved; relative to the baseline, the four metrics improve by 19.2%, 31.7%, 19.8%, and 12.2%, respectively, and compared with the fixed (non-optimizable) Tikhonov layer, the improvements are of 3.3%, 9.6%, 6.8%, and 2.2%, respectively. Meanwhile, the parameter count remains low, indicating a favorable trade-off between reconstruction accuracy and model complexity.
Effect of the Frequency Consistency Loss: In SSHSR, we introduce Frequency Consistency Loss (FCL) into the spectral supervision to constrain the discrepancy between the reconstructed and GT spectra in the frequency domain, thereby encouraging better modeling of inter-band correlations. As shown in
Table 6, compared with the setting without FCL, adding FCL reduces MRAE, RMSE, and SAM by 5.1%, 8.0%, and 14.7%, respectively, and improves PSNR by 3.1%. These results indicate that the frequency-domain consistency constraint provides stable performance gains within our framework and enhances the model’s ability to capture spectral structural information.
Effect of single-scale/multi-scale loss functions: We further conduct an ablation study on single-scale and multi-scale losses. The single-scale loss is calculated based on the mean spectrum of SA-MP at the highest spatial resolution output. In contrast, the multi-scale loss applies joint constraints across the network’s multi-scale outputs; using the GT SA-MPs at the highest resolution as a reference, these patches are downsampled at each scale to obtain corresponding spectral references. The mean spectrum of each patch at each scale is then compared with the mean spectrum of the corresponding patch in the network’s output at that scale, strengthening the local spectral consistency across different spatial resolutions. As shown in
Table 7, compared to the single-scale loss, introducing the multi-scale loss reduces MRAE, RMSE, and SAM by 3.6%, 6.1%, and 7.0%, respectively, and improves PSNR by 2.0%. This indicates that the multi-scale constraint provides more comprehensive spectral supervision across different spatial resolutions, alleviating the scale bias caused by supervision at a single scale, and thus improving overall reconstruction accuracy.
Comparison of mini-patch selection strategies:To examine the effect of different mini-patch selection schemes on SA-MP mean-spectrum supervision, we keep all other settings unchanged and vary only the sampling strategy of the
mini-patches, comparing random sampling with grid-center sampling (i.e., selecting the centers of a
grid within each 128 × 128 crop). Both strategies use the same number of mini-patches for supervision. As shown in
Table 8, compared with random sampling, grid-center sampling achieves reductions of 2.7%, 4.7%, and 5.8% in MRAE, RMSE, and SAM, respectively, along with a 2.0% improvement in PSNR. These results indicate that the spatial placement of mini-patches affects the effectiveness of mean-spectrum supervision and, consequently, the final reconstruction performance.
5. Analysis of Results in Real-World Scenarios
Following the validation on standard benchmark datasets, this section further explores the performance of the proposed method in real-world environments. We conducted extended experimental analysis based on hyperspectral data collected from real-world scenes to further validate the effectiveness and advantages of the semi-supervised hyperspectral reconstruction paradigm in practical application scenarios. The real-world dataset was captured using the commercial Specim IQ hyperspectral camera (Specim, Oulu, Finland), with scenes including typical objects such as indoor office supplies, potted plants, cups, and standard color targets. The spectral range of the data is 400–700 nm, with 31 spectral bands and a spatial resolution of . A total of 31 hyperspectral images were collected, with 26 randomly selected for the training set and 5 for the validation set. During the data collection process, the hyperspectral camera was fixed above the target object at an approximately 45° viewing angle. A halogen lamp was used as the light source, positioned directly above the camera to achieve uniform illumination of the target. The corresponding RGB observations were generated by spectral integration of the collected hyperspectral data using the SRF of the Canon 60D camera.
Quantitative and Visual Results
Our method achieves strong performance on three representative datasets, namely ARAD-1K, CAVE, and GDFC. To further evaluate its reconstruction capability under different data distributions and imaging conditions, we apply the proposed approach to real captured data for comparative validation. As shown in
Table 9, the proposed method also delivers consistent performance gains on the real-scene dataset, outperforming the second-best result by 13.7%, 16.2%, 11.9%, and 6.0% in terms of MRAE, RMSE, SAM, and PSNR, respectively. This indicates that our method can maintain stable and competitive reconstruction performance in real-world acquisition scenarios.
Figure 8 presents the MRAE maps on the real-scene validation set, and
Figure 9 shows the corresponding spectral curves. From the visualization of the MRAE maps, our method exhibits darker colors over larger regions, indicating higher-quality reconstructed hyperspectral images and more accurate recovery of both spatial details and spectral information. As can be clearly observed in
Figure 9, the spectra reconstructed by our method are closer to the ground-truth spectra, demonstrating its notable advantages in spectral consistency and spectral fidelity.
7. Conclusions
This paper focuses on the RGB-to-hyperspectral reconstruction problem under the condition of limited spectral annotations, and proposes a semi-supervised reconstruction framework guided by spectrally aware mini-patches (SA-MP). The framework samples a small number of local spectral mini-patches through SA-MP and constructs supervision terms based on their region-averaged spectra, allowing the limited local spectral information to effectively constrain the full-image prediction, thereby enabling semi-supervised full-image hyperspectral recovery. Furthermore, we embed a learnable Tikhonov regularization physical layer into the reconstruction process, jointly optimizing the regularization matrix and coefficients to provide more stable physical constraints and reliable initialization. At the network level, a hybrid attention reconstruction structure is designed, which promotes the full interaction of spectral–spatial information and detailed recovery by jointly modeling spectral and spatial features. Extensive experimental results demonstrate that the proposed method achieves competitive reconstruction performance on multiple public benchmark and extended datasets.
Finally, we summarize the main limitations of SSHSR (e.g., metamerism-related ambiguity, potential SRF mismatch in practical remote sensing, and limited generalization under extreme illumination variations) and outline several possible future directions in the Discussion section, including more robust semi-supervised supervision designs, improved tolerance to SRF uncertainty, illumination-robust training/augmentation, and the potential use of complementary self-supervised or adversarial constraints.