Spatial/Spectral-Frequency Adaptive Network for Hyperspectral Image Reconstruction in CASSI

Liu, Hejian; Yuan, Yan; Yin, Xiaorui; Su, Lijuan

doi:10.3390/rs17193382

Open AccessArticle

Spatial/Spectral-Frequency Adaptive Network for Hyperspectral Image Reconstruction in CASSI

Key Laboratory of Precision Opto-Mechatronics Technology, Ministry of Education, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(19), 3382; https://doi.org/10.3390/rs17193382

Submission received: 28 August 2025 / Revised: 27 September 2025 / Accepted: 3 October 2025 / Published: 8 October 2025

(This article belongs to the Special Issue Advances in Hyperspectral Remote Sensing Image Processing: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

We propose SSFAN, a novel dual-branch transformer network that integrates spatial and channel attention with frequency-domain learning (via a dedicated FDT module and loss) for high-fidelity HSI reconstruction in CASSI.
The introduced channel compression and expansion modules effectively optimize the computational efficiency of the network, enabling superior detail preservation with moderate resource consumption.

What is the implication of the main findings?

The integrated spatial/spectral-frequency learning framework achieves breakthrough reconstruction accuracy by simultaneously modeling multi-domain features, particularly in recovering fine textures and suppressing noise.
This work provides a practical and efficient solution for high-quality hyperspectral imaging, balancing performance and complexity for real-world applications.

Abstract

Coded-Aperture Snapshot Spectral Imaging (CASSI) systems acquire 3D spatial–spectral information on dynamic targets by converting 3D hyperspectral images (HSIs) into 2D compressed measurements. Various end-to-end networks have been proposed for HSI reconstruction from these measurements. However, these methods have not explored the frequency-domain information of HSIs. This research presents the spatial/spectral-frequency adaptive network (SSFAN) for CASSI image reconstruction. A frequency-division transformation (FDT) decomposes HSIs into distinct Fourier frequency components, enabling multiscale feature extraction in the frequency domain. The proposed dual-branch architecture consists of a spatial–spectral module (SSM) to preserve spatial–spectral consistency and a frequency division module (FDM) to model inter-frequency dependencies. Channel compression/expansion modules are integrated into the FDM to balance computational efficiency and reconstruction quality. Frequency-division loss supervises feature learning across divided frequency channels. Ablation experiments validate the contributions of each network module. Furthermore, comparison experiments on synthetic and real CASSI datasets demonstrate that SSFAN outperforms state-of-the-art end-to-end methods in reconstruction performance.

Keywords:

coded-aperture snapshot spectral imaging; reconstruction algorithm; frequency-domain processing; end-to-end deep neural network

1. Introduction

Hyperspectral images (HSIs) have been widely used in various fields such as remote sensing [1,2,3], medical diagnosis [4,5,6], military security [7,8,9], and tracking [10,11] because they have more spectral bands and richer spectral information than monochromatic or RGB images. Recently, snapshot hyperspectral imaging systems [12,13,14,15,16,17,18] have been developed to capture the 3D spatial–spectral data cubes of dynamic targets. Inspired by compressive sensing theory, coded-aperture snapshot spectral imaging (CASSI) systems [16,19,20,21,22] have been developed to achieve hyperspectral data cubes with high spatial resolution. However, reconstructing HSIs from 2D compressed data remains a critical challenge for CASSI systems.

It is evident that reconstructing 3D HSI data from 2D compressed measurements is a highly ill-posed inverse problem [12]. Reconstruction algorithms for CASSI systems are primarily divided into two categories: optimization algorithms and deep learning methods. Traditional optimization algorithms [20,23,24,25,26] leverage hand-crafted priors and rely on iterative algorithms to solve the problem of reconstruction. For example, Liu proposed the decompress snapshot compressive imaging (DeSCI) algorithm by constructing a joint model that integrates [24] the non-local self-similarity of video/hyperspectral frames and the rank minimization approach. Yuan proposed generalized alternating projection (GAP) to address the inverse problem in CASSI imaging by introducing a prior that minimizes the total variance [25]. Zhang et al. proposed dimension-discriminative low-rank tensor recovery (DLTR) to reconstruct HSIs [26]. Yin et al. uses the non-local low-rank tensor (NLRT) prior to design an optimization algorithm [20]. These traditional optimization methods have notable drawbacks, such as slow reconstruction speed and limited generalization capabilities. Algorithm parameters often require manual adjustment. Moreover, the algorithms often suffer from low accuracy when applied to noisy data.

The deep-learning-based methods for CASSI systems are categorized into three main types: plug-and-play (PnP), deep unfolding networks, and end-to-end (E2E) networks. The PnP algorithm [27,28,29,30,31] substitutes the prior module in the iterative process of optimization algorithm with a denoising network. Comparing with traditional optimization algorithms, PnP algorithms significantly improve reconstruction accuracy [32].

Deep unfolding networks [33,34,35,36] are designed on the basis of the iterative frameworks of optimization algorithms. They feature high interpretability and good reconstruction accuracy. Ma et al. proposed a deep tensor alternating direction method of multipliers network (ADMM-Net) [33] for video SCI systems. They achieved this by unfolding the inference iterations into a layerwise structure and designed a deep neural network based on tensor operations for the prior model. Huang proposed a deep unfolding method for the CASSI system based on the maximum a posteriori (MAP) estimation framework, which uses the learned Gaussian scale mixture (GSM) prior [34]. Cai et al. proposed a principled degradation-aware unfolding half-shuffle transformer (DAUHST) by combining the degradation-aware unfolding framework (DAUF) and the half-shuffle transformer (HST) [35]. Li et al. presented a pixel-adaptive deep unfolding transformer (PADUT) for HSI reconstruction in CASSI by incorporating a pixel-adaptive data module and a non-local spectral transformer prior module [36].

E2E networks [37,38,39,40,41,42,43] aim to establish a direct mapping from the compressed measurements of the CASSI system to reconstructed HSIs. These types of methods have exhibited good reconstruction performance. However, existing E2E networks predominantly exploit the spatial–spectral information of HSIs for reconstruction, without considering the spatial-frequency-domain information of the targets.

CASSI systems can be categorized into single-disperser CASSI (SD-CASSI) [16] and dual-disperser CASSI (DD-CASSI) [21,22]. In this study, taking SD-CASSI as an example, we propose a spatial/spectral-frequency adaptive network (SSFAN) for HSI reconstruction from compressive CASSI measurements. The frequency-division transformation (FDT) is proposed to process the frequency information of the feature maps, and a loss function is developed to help the network learn information from different frequency bands of HSIs. Additionally, both channel-dimension and spatial-dimension transformers are introduced to improve image reconstruction.

Our main contributions are as follows:

1. A novel E2E neural network framework, SSFAN, is proposed for the reconstruction of HSIs in a CASSI system. The framework features two branches—a transformer based on spatial attention and a transformer based on channel attention—designed to reconstruct the spatial–spectral and Fourier frequency components of HSIs across different bands.

2. The FDT is designed to obtain distinct Fourier frequency components and is accompanied by FDT loss to supervise the learning of information from different frequency channels.

3. To address the high computational load caused by the FDT, a channel compression module and a channel expansion module are introduced to balance the reconstruction performance and computer resource consumption of the neural network.

The article is organized as follows. Section 2 briefly reviews the existing E2E networks for CASSI and some works on frequency-domain processing. Section 3 describes the image formation model of SD-CASSI. Section 4 introduces a mathematical formulation of the proposed reconstruction method. Section 5 details the proposed two-branch E2E framework, including the network architecture (a spatial–spectral module and a frequency-division module) and the loss function. Section 6 shows the results of the ablation and comparison experiments. Finally, Section 7 presents the conclusion of this study.

2. Related Works

E2E networks are constructed based on three key aspects: the specific network structure, the utilization of system information, and the selection of loss functions. Regarding network structure, the basic structure of an E2E network includes linear layers, a convolution layer [44], and attention layers [45]. These fundamental components are combined to form the residual structure, the transformer structure [46], the UNet structure [47], etc. These diverse architectures are employed in various E2E networks available today [48,49,50,51,52]. Regarding the use of system information to extract features from inputs, many neural networks utilize various transforms, such as the Fourier transform [53], wavelet transform [54], and S-transform [55]. Regarding loss functions, different types are employed, including, among others, mean square error (MSE) [56] and learned perceptual image patch similarity (LPIPS) [57].

2.1. E2E Neural Networks for CASSI Reconstruction

Many E2E networks have been developed for the reconstruction of CASSI data. The

λ

-net [37] is a dual-stage generative model specifically designed to reconstruct the desired 3D data from SCI system. Meng et al. proposed spatial–spectral self-attention (TSA) [38] to process each dimension sequentially but in an order-independent manner. Hu et al. proposed a high-resolution dual-domain learning network (HDNet) integrating a residual network and spatial–spectral attention as its core architecture [39]. Cai et al. proposed the mask-guided spectral-wise transformer (MST) [40] by introducing spectral-wise multi-head self-attention (MSA) and a mask-guided mechanism (MM) to boost performance in HSI reconstruction. Cai et al. designed the coarse-to-fine sparse transformer (CST) [42]. This model employs a spectra-aware screening mechanism (SASM) and spectra-aggregation hashing multi-head self-attention (SAH-MSA) for tasks such as coarse patch selection, fine pixel clustering and self-similarity capturing. Luo et al. designed a dual-window multiscale transformer (DWMT) to capture both long-range dependencies and local details for HSI reconstruction in CASSI systems [43].

2.2. Frequency-Domain Processing

Frequency-domain processing has long been an important research area in signal processing. In recent years, with the extensive applications of deep learning across various fields, many methods [58,59,60,61,62,63,64,65,66] have emerged to integrate deep learning with frequency domain analysis. Xu [58] et al. found that deep neural networks often fit target functions from low to high frequencies during the training process, a concept referred to as the F-Principle. Based on these findings, frequency-domain processing techniques can help networks achieve superior performance. Kai et al. found that learning in the frequency domain with selected static channels can achieve higher accuracy than the conventional spatial downsampling approaches [59]. Wei et al. introduced an invariant wavelet transform with multiple parameters in the fractional domain for a watermarking algorithm [60]. Yang et al. utilized Fourier frequency features in super-resolution fields [61]. Jiang et al. proposed focal frequency loss [62], which enables neural networks to adaptively focus on frequency components in a process called frequency-domain learning. Karami et al. proposed a system that recognizes static gestures of alphabetic characters in Persian Sign Language (PSL) by combining a wavelet transform and neural networks [63]. Jamali et al. introduced the Haar wavelet transform in deep CNNs [64] to facilitate effective feature extraction, thereby enhancing the classification accuracy of polarimetric synthetic-aperture radar imagery. Meanwhile, HDNet employed frequency domain learning to improve reconstruction quality. Zhang et al. employed the fast Fourier transform (FFT) during subaperture decomposition to enhance feature extraction for SAR target recognition [65]. Wan et al. proposed a frequency-domain spectral feature module to perform FFT in order to extract multiscale and multi-frequency features, enhancing the discrimination of local features and global patterns in HSI classification [66]. Inspired by the above works, we propose a network module to learn information from the Fourier frequency components of HSIs.

3. Image Formation Model of SD-CASSI

As shown in Figure 1, the SD-CASSI system consists of an objective lens, a coding mask, a disperser, two relay lenses, and a monochromatic sensor. The input HSI cube is imaged by the objective lens and then spatially modulated by the encoding mask. Next, the coded HSI is spectrally sheared by the disperser. Finally, the spatially sheared and coded HSI is captured compressively by the sensor, creating a 2D compressed image.

For an SD-CASSI system, it is assumed that the input 3D HSI cube is

X_{0} \in R^{C \times H \times W}

, where H and W are the spatial dimensions and C is the spectral dimension. Furthermore, let the coded aperture be

M_{0} \in R^{H \times W}

. The modulated HSI cube

X_{1}

after passing through the coded aperture can be expressed as

X_{1} (λ, h, w) = X_{0} (λ, h, w) \cdot M_{0} (h, w)

(1)

where h and w are the spatial indices of the object plane corresponding to the H and W dimensions, respectively, while

λ

is the spectral index for the C dimension. After passing through the relay lenses and the disperser element, the dispersed HSI cube

X_{2} (λ, m, n)

is given by

X_{2} (λ, m, n) = X_{1} (λ, h, w + d (λ - λ_{c}))

(2)

where m and n are spatial indices of pixels on the sensor plane, d denotes a dispersive element with linear dispersion, and

λ_{c}

denotes the center wavelength (note that this model assumes linear dispersion). The sensor captures

X_{2}

and generates a 2D measurement

Y

, which is given by

Y = \sum_{λ = 1}^{C} X_{2} (λ, :, :) + G

(3)

where

G \in R^{H \times (W + d (λ - λ_{c}))}

is random noise generated during the imaging process.

The core task in reconstructing an HSI is to retrieve the original HSI cube from the measurement Y obtained by the SD-CASSI system.

4. Problem Formulation

As shown in Figure 2, the E2E model for spectral image reconstruction in the CASSI system is a mapping from the HSIs of targets to the system measurements via a neural network. This mapping can be expressed as

X_{pred} = F (Y)

(4)

where

X_{pred}

is the reconstructed HSI,

Y

is the measurements captured by the CASSI system, and

F

denotes the mapping relationship between them. The E2E networks commonly employed for CASSI reconstruction in recent years are mainly single-branch networks that learn spatial and spectral features.

Currently, these network methods rarely include the learning of spatial-frequency features for data reconstruction in the SD-CASSI system. To incorporate the frequency-domain feature, we propose a two-branch E2E network that extracts and fuses the holistic HSI information and frequency-band-specific details. Before presenting our reconstruction model, we first introduce our frequency-domain processing approach.

4.1. Frequency-Division Transformation

For the spectral images, although the image of each wavelength captures the same scene, the spatial profiles across different wavelengths exhibit both similarities and disparities. The similarities correspond to the low-frequency components of spatial frequency information, while the differences align with high-frequency components represent band-specific detail variations. Notably, existing neural network methods for HSI reconstruction predominantly utilize spatial information, with limited utilization of information in the spatial-frequency domain.

In the present study, to address this gap, we propose frequency-division transformation (FDT) to decompose spectral images into distinct spatial-frequency components, enabling the network to explicitly learn frequency-specific features. The procedure for FDT is shown in Figure 3.

The input, an arbitrary 3D data cube, is represented as

X \in R^{C \times H \times W}

. First, an FFT is performed on the input

X

to calculate the amplitude diagrams

A \in R^{C \times U \times V}

and phase diagrams

P \in R^{C \times U \times V}

:

A, P = FFT (X)

(5)

where U and V are dimensions in the frequency domain, corresponding, respectively, to the spatial dimensions H and W.

Since the Fourier transform of a real-valued image exhibits symmetric property, a symmetric division of its spectrogram is principled and necessary. This approach inherently respects the physical structure of the frequency domain, ensuring that redundant symmetric component pairs are processed cohesively. Consequently, it enhances parameter efficiency and prevents the model from learning inconsistent representations from spectrally identical regions.

Taking the center of the spectrogram as the origin, the entire image domain is defined over

u \in [- \frac{U}{2}, \frac{U}{2}]

,

v \in [- \frac{V}{2}, \frac{V}{2}]

. Suppose that the spectrogram needs to be divided into P segments. By symmetrically dividing both axes, the frequency bands of the positive half-axis of U can be derived as follows:

(0, ⌊\frac{U}{2 P}⌋)

,

(⌊\frac{U}{2 P}⌋, ⌊\frac{U}{P}⌋)

,

\dots

,

(⌊\frac{(P - 1) U}{2 P}⌋, \frac{U}{2})

. Similarly, the frequency bands of the V-axis can be obtained. Combining these bands from both axes yields a segmented magnitude spectrum

[a^{1}, a^{2}, \dots, a^{P}]

. This process is defined by the following function:

[a^{1}, a^{2}, \dots, a^{P}] = Divide (A)

(6)

Subsequently, the image components at different frequencies are retrieved by applying the inverse fast Fourier transform (IFFT) to the amplitude spectrum components of the phase spectrum, which is given by

x^{p} = IFFT (a^{p}, P)

(7)

A total of P component images are generated; they are denoted as

{x^{p} ∣ p = 1, 2, \dots, P}

. Finally, these P images are concatenated along the channel dimension, resulting in the output data

X_{F D T} \in R^{C P \times H \times W}

. The pseudocode for the FDT operation described above is summarized in Algorithm 1.

Algorithm 1 Frequency-Division Transformation

Input: Data cube

X \in R^{C \times H \times W}

.

1:: Calculate amplitude map $A \in R^{C \times U \times V}$ and phase map $P \in R^{C \times U \times V}$
$A, P = FFT 2 (X, dim = H, W)$
2:: Divide amplitude map $A$ into P sub-maps $[a^{1}, a^{2}, \dots, a^{P}]$
3:: for p:1 to P do
4:: $X^{p} = IFFT (a^{p}, P)$
5:: end for

Output: Data cube after division

X_{F D T} \in R^{C P \times H \times W}

4.2. Reconstruction Model

In the algorithms addressing the restoration problem of the SD-CASSI system, the measurement

Y

serves as input, while the prediction value

X_{pred}

constitutes the output.

The framework of the reconstruction model is shown in Figure 2. With the feature extraction operation

F_{input}

, a feature map

Z \in R^{C \times H \times W}

can be calculated as

Z = F_{input} (Y)

(8)

Here,

Z

encapsulates all the spatial- and spectral-domain information.

The FDT operation is applied to decompose

Z

into spatial-frequency band components as follows:

F = FDT (Z) = [f^{1}, f^{2}, \dots, f^{P}]

(9)

The spatial–spectral module (SSM), denoted as

F_{SSM}

, is designed to map

Z

onto the base spatial and spectral information of the input HSI cube

X_{0}

.

S_{pred} = F_{SSM} (Z)

(10)

Assume that the true values of different spatial-frequency bands of targets’ HSIs are represented by

X_{FDT}

. In order to achieve results close to the spatial-frequency components of

X_{FDT}

, the frequency-division module (FDM), denoted as

F_{FDM}

, is designed to map

f^{P}

to the predicted value

f_{pred}^{p}

:

f_{pred}^{p} = F_{FDM} (f^{p})

(11)

As a result, the

F_{FDM}

process generates the predicted

F_{pred} = [f_{pred}^{1}, f_{pred}^{2}, \dots, f_{pred}^{P}]

. Finally, the two pieces of information,

S_{pred}

and

F_{pred}

, are fused to obtain the predicted HSI value

X_{pred}

.

X_{pred} = F_{fusion} (S_{pred}, F_{pred})

(12)

5. Proposed Method

In this paper, we propose an E2E network for the reconstruction of CASSI. This method restores HSIs from compressed measurements by extracting spatial–spectral and frequency features from different spatial frequency bands. It is named the spatial/spectral-frequency adaptive network (SSFAN).

5.1. Network Framework

The framework of SSFAN is illustrated in Figure 2. The initial part of the network consists of a convolution layer followed by several residual blocks (RBs), which correspond to the function

F_{input}

in the reconstruction model. Through this initial processing, the feature map

Z

is obtained. The subsequent section splits into two branches. The right branch is designed to learn the spatial–spectral features of HSI, corresponding to the role of the function

F_{SSM}

in the reconstruction model. The left branch, FDM, aims to learn the features of different spatial frequency components of HSIs and corresponds to the function of the expression

F_{FDM}

in the reconstruction model. When

Z

is fed into these two branches, they learn the features of HSIs, resulting in the outputs

S_{pred}

and

[f_{pred}^{1}, f_{pred}^{2}, \dots, f_{pred}^{P}]

, respectively. They are fed into the final part of the network to generate the final reconstructed HSIs. This part consists of several residual blocks that correspond to the

F_{fusion}

function.

5.2. Spatial–Spectral Module (SSM)

As shown in Figure 4a, the SSM is designed to implement the

F_{SSM}

function. It takes the feature map

Z

as input and generates

S_{pred}

as output. The number of input channels,

Z

, increases through a convolution layer. Then, the cube is fed into a U-shaped network constructed with a non-local spatial attention block (NSAB). As illustrated in Figure 5a, the NSAB comprises two layer normalization (LN) layers, a non-local spatial attention (NSA) module, and a feedforward neural network (FNN). The FFN, shown in Figure 5c, consists of two convolution (CONV) layers, a Gaussian error linear unit (GELU), and a depthwise convolution (DWConv).

The architecture of the NSA module is shown in Figure 5a. First, the input

X_{in} \in R^{C \times H \times W}

is shifted with

L / 2

. Then, the shifted data are divided into several data cubes

[x_{1}, x_{2}, \dots, x_{n}]

, where the size of each cube

x_{i}

is

C \times L \times L

(i = 1, 2, \dots, n)

. For each cube

x_{i}

, three group networks are utilized to transform it into query vectors

Q_{i}

, key vectors

K_{i}

, and value vectors

V_{i}

. These vectors have a size of

h \times L L \times C

, where h represents the number of heads in the multi-head attention mechanisms.

Q_{i} = x_{i} W^{Q}

(13)

K_{i} = x_{i} W^{K}

(14)

V_{i} = x_{i} W^{V}

(15)

Then, the attention output is given by

Attention (Q_{i}, K_{i}, V_{i}) = V_{i} Softmax (\frac{K_{i}^{T} Q_{i}}{β})

(16)

The attention map has a size of

h \times L L \times L L

. This map can guide the network to learn features of a specific position.

5.3. Frequency-Division Module (FDM)

As shown in Figure 4b, the FDM is designed to implement the function

F_{F D M}

. As shown in Figure 2, the feature map

Z

is taken as input, with

F_{pred}

generated as output. The FDT-P block transforms the input into spatial-frequency map

F = f^{1}, f^{2}, \dots, f^{P}

. The FDM includes multiple non-local channel attention blocks (NCABs). The structure of NCABs is detailed in Figure 5b.

The data volume of this feature map is extremely large. The input feature map size is

C \times H \times W

and the number of parts established by spatial-frequency division is P. Accordingly, the size of

F_{pred}

is

C P \times H \times W

. A comparative analysis of chosen values for P is presented in Section 6.1. Assuming

C = 28

and

P = 5

, there will be 140 channels to be processed. Obviously, it is difficult to train a network with limited memory to process data with 140 channels using the network. On the other hand, the features extracted from different spatial frequency divisions do not have the same significant impact on the final reconstruction results.

In order to reduce the computation cost, we can reasonably compress the channel dimension using the channel compress module (CCM). A smaller feature map

X_{F D C} \in R^{C^{'} \times H \times W}

(C^{'} < C P)

is obtained. This feature map is processed by the network to retrieve frequency information. Its channel is then expanded back to

C P

by the channel extension module (CEM).

The CCM consists of three convolution layers and three RELU activation layers. The first convolution layer, with both the input and output channels set to

C P

, is designed to extract features from the output of FDM. The second convolution layer, with input channels of size

C P

and output channels of size

C^{'}

, is designed to compress the channel information. The third convolution layer, featuring

C^{'}

input and output channels, organizes compressed features for subsequent modules. The architecture is designed to approximate a sufficiently complex function to retain the most efficient information from frequency division while minimizing reconstruction loss. The process can be expressed as the following equation:

CCM (x) = RELU ({Conv}_{C^{'}}^{C^{'}} (RELU ({Conv}_{C P}^{C^{'}} (RELU ({Conv}_{C P}^{C P} (x))))))

(17)

The CEM also consists of three convolution layers and three RELU activation layers, with the input and output channel configurations of the convolution layers being exactly the opposite of those in the CCM. Therefore, the process can be expressed by the following equation:

CEM (x) = RELU ({Conv}_{C P}^{C P} (RELU ({Conv}_{C^{'}}^{C P} (RELU ({Conv}_{C^{'}}^{C^{'}} (x))))))

(18)

The NCAB serves as the core learning structure in the FDM. Analogous to the NSAB, it is composed of two LN layers, a feedforward neural network, and a non-local channel attention (NCA) module. There are two reasons for adopting NCA over NSA. First, the FDT concatenates the HSI data of different frequencies along the channel dimension, so that the information of different frequency components is stored in the channel dimension. Second, the spectral data of the same pixel in a hyperspectral image are also encoded within the channel dimension.

As with the NSA module, cubes are generated by shifting and splitting the input data. For each cube,

Q_{i}

,

K_{i}

, and

V_{i}

are derived via Equations (13)–(15). The attention output is computed using Equation (16). In NCA, unlike NSA, these vectors have a size of

h \times C \times L L

, and the attention map has a size of

h \times C \times C

. The feature map is finally restored to the input image size by fusion operations and a series of residual blocks.

5.4. Loss Function

To ensure that the neural network output aligns closely with the ground truth (GT), we adjust parameters not only by comparing the final output to the GT in the spatial domain but also by comparing the local outputs of a specific module to the GT in the frequency domain. Therefore, the loss function of the network consists of two main components: the base loss and the frequency-division loss.

The mean squared error (MSE) loss is a standard choice for regression tasks, effectively ensuring pixelwise spectral fidelity and promoting the smoothness of reconstructed features. Therefore, we accordingly adopt the

l_{2}

loss function (MSE) to calculate each loss term, as expressed in the following equation:

L o s s_{MSE} = \frac{1}{N} \sum_{j = 1}^{N} {(x_{j} - {\hat{x}}_{j})}^{2}

(19)

The basic loss is given by the following equation:

L o s s_{BASE} = MSE (X_{gt}, X_{pred})

(20)

where

X_{gt}

is the GT image and

X_{pred}

is the final output of the network.

As shown in Figure 6, the frequency-division loss is calculated using the FDM of the predicted output (

F D_{pred} = F_{FDM} (X_{pred})

) and the GT frequency component (

F D_{gt} = F_{FDM} (X_{gt})

), which is given by the following equation:

L o s s_{FDT} = MSE (F D_{pred}, F D_{gt})

(21)

The final loss function is expressed as follows:

L o s s_{ALL} = L o s s_{FDT} + η L o s s_{BASE}

(22)

where

η

is a weighting coefficient that balances the two loss terms.

6. Experimental Results

In this section, ablation experiments are conducted on the proposed network, and comparison experiments are conducted with multiple algorithms. The proposed method is compared with several E2E networks,

λ

-net [37], TSA-Net [38], HDNet [39], MST-L [40], MST++ [41], CST-L [42], and DWMT [43]. The experiments are performed and evaluated on both synthetic datasets and real datasets.

The peak signal-to-noise ratio (PSNR), structural similarity (SSIM), and spectral angle mapper (SAM) are employed to evaluate the reconstruction performance of different methods. For the spatial evaluation indicators PSNR and SSIM, we calculate them for each 2D spectral image and then compute the average values across all spectral bands as the final results. For the spectral indicator SAM, the values of each spatial point are calculated and then averaged to generate the final value.

6.1. Experiment Setup

In the simulation experiments, two datasets, namely, CAVE [67] and KAIST [68], are adopted to generate compressed 2D measurements of the SD system based on the model presented in Section 3. Meanwhile, ten scenes from the KAIST dataset are used to generate a test set. In the real experiment, HSIs of five real scenes captured by an SD-CASSI system [38] are used for comparison. The real HSIs consist of 28 spectral channels ranging from 450 nm to 650 nm, with each channel having two-pixel dispersion.

The proposed SSFAN is implemented using PyTorch 2.7.1. The network models for quantitative comparisons and ablation studies are trained with the Adam optimizer [69] and cosine annealing scheme [70] for 500 epochs on an NVIDIA GeForce RTX 4090 (NVIDIA Corporation; Santa Clara, CA, USA). The training uses a batch size of five and input patches of size 256 × 256. The loss weight

η

is set to one. The initial learning rate during training is

2 \times 10^{- 4}

.

6.2. Ablation Study

In this section, we conduct several ablation experiments to investigate the hyperparameters of models and the effectiveness of different components in models. The baseline of the model is a small version in which only the number of network layers is reduced.

First, the optimal number of frequency bands in the FDT for the reconstruction results is investigated. Table 1 shows the average PSNR, SSIM and SAM values of ten test scenes in the test dataset. Five different numbers of divided bands are tested in the FDT. The best performance is achieved when the frequency band number is set to five.

The optimal channel numbers in the CCM and CEM are then explored. In this experiment, the channel numbers in the CCM and CEM are set to 28, 40, 56, and 64. According to Table 2, the model performance does not improve monotonically with increasing channel number. The best results are obtained when the channel number is set to 56. The ablation study in Table 3 clearly highlights the compression capability of the CCM and CEM. The proposed model requires only 117.28 GFLOPs and 6.34 M parameters, which is a decrease of about 61.4% and 69.5%, respectively, from the baseline model that operates without them.

To verify the effectiveness of the SSM and FDM, module ablation experiments are conducted. The experiments include three scenarios: one using only the SSM, one using only the FDM, and one using both modules. Table 4 summarizes the results. When only the SSM is used, higher PSNR and SAM values but lower SSIM are obtained than when only the FDM is used. When both modules are used, the best overall performance is achieved. These results indicate that the SSM can provide some spatial information and reduce the noise in HSIs, while the FDM can contribute more structural information and spectral information from the image.

The impact of the FDT was also investigated. In order to isolate the effects of the number of channels on the FDT, a comparison is made between the FDT and a convolution layer that can achieve the same channel-number expansion. The results are presented in Table 5. The network employing the FDT achieves a 0.33 dB improvement over the network using the convolution layer. This finding confirms that the FDT can capture more information about HSIs than the conventional option of increasing the number of channels.

6.3. Performance Evaluation on a Simulation Dataset

Table 6 summarizes PSNR, SSIM, SAM of eight different methods across ten different scenes on a simulation dataset. Compared with

λ

-Net, TSA-Net, HDNet, MST-L, MST++, CST-L, and DWMT, the proposed SSFAN demonstrates superior reconstruction performance. Specifically, SSFAN can achieve an average PSNR improvement of 0.96 dB over CST-L. In terms of spectral accuracy, SSFAN reduces the average SAM by 0.016 rad compared to CST-L.

Visualizations of reconstruction results for eight methods in scenes 5 and 7 are illustrated in Figure 7 and Figure 8. For the 28 spectral bands, the 481.5 nm, 544.0 nm, 575.5 nm, and 636.5 nm bands of HSIs are selected, and these wavelength data are converted into images. Additionally, each image includes zoomed-in patches of the regions of interest (ROIs) within white rectangles. Notably, SSFAN reconstructs HSIs with more accurate image structural information and finer details.

The spectral curves and the correlation coefficients of the spectral curves for selected points of scenes 2 and 3 are illustrated in Figure 9. The correlation coefficient corr is given by

corr (X, Y) = \frac{cov (X, Y)}{σ_{X} σ_{Y}}

(23)

Here,

cov

denotes covariance, and

σ_{X}

and

σ_{Y}

denote the variance of

X

and

Y

. By comparing the reconstructed spectral signatures with those of other methods, we demonstrate that the proposed SSFAN can achieve higher spectral accuracy than other methods.

In terms of visualization comparison, as shown in Figure 8, the textures in the wings of the bird are reconstructed by SSFAN. On the other hand, the structural information of the same region is not clear in the results of other methods. This is attributed to the proposed FDT module that provides information on different spatial frequency bands to the SSFAN network, thereby enhancing the reconstruction ability for structural information of details in images.

6.4. Performance Evaluation with Real CASSI Data

The effectiveness of SSFAN in reconstructing HSIs from real CASSI measurements is also evaluated. Following the same experimental setup as previous studies on HSI reconstruction of real CASSI images [34,38,40,42], SSFAN is trained using both the CAVE and KAIST datasets with the real mask. To investigate the anti-noise ability of the proposed network, 11-bit shot noise is added during the training process.

Figure 10 provides the visualization of four spectral bands of the HSIs reconstructed by different methods. Compared with other E2E methods, the spectral images reconstructed by SSFAN exhibit more spatial details. In the top two rows, corresponding to the short-wave bands, the MST and CST methods produce noisy artifacts on the left of the foreground target, whereas the results of SSFAN show no apparent noise. This indicates that our SSFAN is more effective in CASSI reconstruction. Furthermore, the spectral curves of the green branch (point A) and the red strawberry (point B) in the reconstructed HSIs are illustrated in Figure 11b,d for comparison. The spectral curves obtained by SSFAN match the colors of the selected points better than those generated by other methods.

7. Conclusions

In this paper, we propose a novel E2E hyperspectral image reconstruction framework for CASSI. The network, named SSFAN, integrates frequency-domain processing and cross-dimensional feature learning. The FDT is proposed to decompose HSIs into distinct Fourier frequency components, enabling the network to capture multiscale spatial-frequency dependencies. Leveraging the FDT, the network is designed to reconstruct HSIs by fusing spatial–spectral information and specific frequency features. Specifically, SSM employs spatial transformers to extract local spatial structures and preserve spectral consistency, while FDM utilizes channelwise attention to model the inter-correlations of specific Fourier frequency components. To balance computational efficiency and reconstruction quality, channel compression and expansion modules are incorporated to optimize feature dimensionality during frequency processing. Ablation experiments validate the contribution of specific modules and parameters in reconstruction performance. Extensive experiments on both simulated and real SD-CASSI datasets demonstrate that SSFAN outperforms state-of-the-art E2E methods. Visual comparisons reveal that SSFAN reconstructs fine textures (e.g., bird wings) and suppresses noise more effectively than existing methods, confirming its superiority in preserving spatial detail. Future research will explore the perceptual and structural loss functions to further improve the visual quality of the reconstructed HSIs.

Author Contributions

Methodology, H.L.; software, H.L.; validation, X.Y.; formal analysis, H.L. and X.Y.; data curation, X.Y.; writing—original draft preparation, H.L.; writing—review and editing, X.Y., L.S. and Y.Y; visualization, X.Y.; funding acquisition, L.S. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant No. 61635002 and by the Fundamental Research Funds for the Central Universities.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Borengasser, M.; Hungate, W.S.; Watkins, R. Hyperspectral Remote Sensing: Principles and Applications; CRC Press: Boca Raton, FL, USA, 2007. [Google Scholar]
Aburaed, N.; Alkhatib, M.Q.; Marshall, S.; Zabalza, J.; Al Ahmad, H. A review of spatial enhancement of hyperspectral remote sensing imaging techniques. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 2275–2300. [Google Scholar] [CrossRef]
Benediktsson, J.A.; Ghamisi, P. Spectral-Spatial Classification of Hyperspectral Remote Sensing Images; Artech House: Boston, MA, USA, 2015. [Google Scholar]
Keshava, N. Distance metrics and band selection in hyperspectral processing with applications to material identification and spectral libraries. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1552–1565. [Google Scholar] [CrossRef]
Fei, B. Hyperspectral imaging in medical applications. In Data Handling in Science and Technology; Elsevier: Amsterdam, The Netherlands, 2019; Volume 32, pp. 523–565. [Google Scholar]
Lu, G.; Fei, B. Medical hyperspectral imaging: A review. J. Biomed. Opt. 2014, 19, 010901. [Google Scholar] [CrossRef] [PubMed]
Shimoni, M.; Haelterman, R.; Perneel, C. Hypersectral imaging for military and security applications: Combining myriad processing and sensing techniques. IEEE Geosci. Remote Sens. Mag. 2019, 7, 101–117. [Google Scholar] [CrossRef]
Yuen, P.W.; Richardson, M. An introduction to hyperspectral imaging and its application for security, surveillance and target acquisition. Imaging Sci. J. 2010, 58, 241–253. [Google Scholar] [CrossRef]
Briottet, X.; Boucher, Y.; Dimmeler, A.; Malaplate, A.; Cini, A.; Diani, M.; Bekman, H.; Schwering, P.; Skauli, T.; Kasen, I.; et al. Military applications of hyperspectral imagery. In Proceedings of the Targets and Backgrounds XII: Characterization and Representation, Orlando (Kissimmee), FL, USA, 17–18 April 2006; SPIE: Bellingham, WA, USA, 2006; Volume 6239, pp. 82–89. [Google Scholar]
Xiong, F.; Zhou, J.; Qian, Y. Material based object tracking in hyperspectral videos. IEEE Trans. Image Process. 2020, 29, 3719–3733. [Google Scholar] [CrossRef]
Van Nguyen, H.; Banerjee, A.; Chellappa, R. Tracking via object reflectance using a hyperspectral video camera. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 13–18 June 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 44–51. [Google Scholar]
Yuan, X.; Brady, D.J.; Katsaggelos, A.K. Snapshot compressive imaging: Theory, algorithms, and applications. IEEE Signal Process. Mag. 2021, 38, 65–88. [Google Scholar] [CrossRef]
Gao, L.; Wang, L.V. A review of snapshot multidimensional optical imaging: Measuring photon tags in parallel. Phys. Rep. 2016, 616, 1–37. [Google Scholar] [CrossRef]
Hu, H.; Zhou, H.; Xu, Z.; Li, Q.; Feng, H.; Chen, Y.; Jiang, T.; Xu, W. Practical snapshot hyperspectral imaging with DOE. Opt. Lasers Eng. 2022, 156, 107098. [Google Scholar] [CrossRef]
Xu, N.; Xu, H.; Chen, S.; Hu, H.; Xu, Z.; Feng, H.; Li, Q.; Jiang, T.; Chen, Y. Snapshot hyperspectral imaging based on equalization designed doe. Opt. Express 2023, 31, 20489–20504. [Google Scholar] [CrossRef]
Wagadarikar, A.; John, R.; Willett, R.; Brady, D. Single disperser design for coded aperture snapshot spectral imaging. Appl. Opt. 2008, 47, B44–B51. [Google Scholar] [CrossRef] [PubMed]
Llull, P.; Liao, X.; Yuan, X.; Yang, J.; Kittle, D.; Carin, L.; Sapiro, G.; Brady, D.J. Coded aperture compressive temporal imaging. Opt. Express 2013, 21, 10526–10545. [Google Scholar] [CrossRef]
He, K.; Wang, X.; Wang, Z.W.; Yi, H.; Scherer, N.F.; Katsaggelos, A.K.; Cossairt, O. Snapshot multifocal light field microscopy. Opt. Express 2020, 28, 12108–12120. [Google Scholar] [CrossRef]
Arce, G.R.; Brady, D.J.; Carin, L.; Arguello, H.; Kittle, D.S. Compressive coded aperture spectral imaging: An introduction. IEEE Signal Process. Mag. 2013, 31, 105–115. [Google Scholar] [CrossRef]
Yin, X.; Su, L.; Chen, X.; Liu, H.; Yan, Q.; Yuan, Y. Hyperspectral Image Reconstruction of SD-CASSI Based on Nonlocal Low-Rank Tensor Prior. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Gehm, M.E.; John, R.; Brady, D.J.; Willett, R.M.; Schulz, T.J. Single-shot compressive spectral imaging with a dual-disperser architecture. Opt. Express 2007, 15, 14013–14027. [Google Scholar] [CrossRef] [PubMed]
Xu, P.; Liu, L.; Jia, Y.; Zheng, H.; Xu, C.; Xue, L. A refinement boosted and attention guided deep FISTA reconstruction framework for compressive spectral imaging. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–12. [Google Scholar] [CrossRef]
Bioucas-Dias, J.M.; Figueiredo, M.A. A new TwIST: Two-step iterative shrinkage/thresholding algorithms for image restoration. IEEE Trans. Image Process. 2007, 16, 2992–3004. [Google Scholar] [CrossRef]
Liu, Y.; Yuan, X.; Suo, J.; Brady, D.J.; Dai, Q. Rank minimization for snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2990–3006. [Google Scholar] [CrossRef]
Yuan, X. Generalized alternating projection based total variation minimization for compressive sensing. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 2539–2543. [Google Scholar]
Zhang, S.; Wang, L.; Fu, Y.; Zhong, X.; Huang, H. Computational hyperspectral imaging based on dimension-discriminative low-rank tensor recovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10183–10192. [Google Scholar]
Chan, S.H.; Wang, X.; Elgendy, O.A. Plug-and-play ADMM for image restoration: Fixed-point convergence and applications. IEEE Trans. Comput. Imaging 2016, 3, 84–98. [Google Scholar] [CrossRef]
Qiao, M.; Liu, X.; Yuan, X. Snapshot spatial–temporal compressive imaging. Opt. Lett. 2020, 45, 1659–1662. [Google Scholar] [CrossRef]
Meng, Z.; Yu, Z.; Xu, K.; Yuan, X. Self-supervised neural networks for spectral snapshot compressive imaging. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 2622–2631. [Google Scholar]
Zheng, S.; Liu, Y.; Meng, Z.; Qiao, M.; Tong, Z.; Yang, X.; Han, S.; Yuan, X. Deep plug-and-play priors for spectral snapshot compressive imaging. Photonics Res. 2021, 9, B18–B29. [Google Scholar] [CrossRef]
Yuan, X.; Liu, Y.; Suo, J.; Durand, F.; Dai, Q. Plug-and-play algorithms for video snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7093–7111. [Google Scholar] [CrossRef]
Yuan, X.; Liu, Y.; Suo, J.; Dai, Q. Plug-and-play algorithms for large-scale snapshot compressive imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1447–1457. [Google Scholar]
Ma, J.; Liu, X.Y.; Shou, Z.; Yuan, X. Deep tensor admm-net for snapshot compressive imaging. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10223–10232. [Google Scholar]
Huang, T.; Dong, W.; Yuan, X.; Wu, J.; Shi, G. Deep gaussian scale mixture prior for spectral compressive imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 16216–16225. [Google Scholar]
Cai, Y.; Lin, J.; Wang, H.; Yuan, X.; Ding, H.; Zhang, Y.; Timofte, R.; Gool, L.V. Degradation-aware unfolding half-shuffle transformer for spectral compressive imaging. Adv. Neural Inf. Process. Syst. 2022, 35, 37749–37761. [Google Scholar]
Li, M.; Fu, Y.; Liu, J.; Zhang, Y. Pixel adaptive deep unfolding transformer for hyperspectral image reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 12959–12968. [Google Scholar]
Miao, X.; Yuan, X.; Pu, Y.; Athitsos, V. l-net: Reconstruct hyperspectral images from a snapshot measurement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4059–4069. [Google Scholar]
Meng, Z.; Ma, J.; Yuan, X. End-to-end low cost compressive spectral imaging with spatial-spectral self-attention. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 187–204. [Google Scholar]
Hu, X.; Cai, Y.; Lin, J.; Wang, H.; Yuan, X.; Zhang, Y.; Timofte, R.; Van Gool, L. Hdnet: High-resolution dual-domain learning for spectral compressive imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 17542–17551. [Google Scholar]
Cai, Y.; Lin, J.; Hu, X.; Wang, H.; Yuan, X.; Zhang, Y.; Timofte, R.; Van Gool, L. Mask-guided spectral-wise transformer for efficient hyperspectral image reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 17502–17511. [Google Scholar]
Cai, Y.; Lin, J.; Lin, Z.; Wang, H.; Zhang, Y.; Pfister, H.; Timofte, R.; Van Gool, L. Mst++: Multi-stage spectral-wise transformer for efficient spectral reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 745–755. [Google Scholar]
Cai, Y.; Lin, J.; Hu, X.; Wang, H.; Yuan, X.; Zhang, Y.; Timofte, R.; Van Gool, L. Coarse-to-fine sparse transformer for hyperspectral image reconstruction. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 686–704. [Google Scholar]
Luo, F.; Chen, X.; Gong, X.; Wu, W.; Guo, T. Dual-window multiscale transformer for hyperspectral snapshot compressive imaging. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 3972–3980. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Bahdanau, D. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaizer, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; proceedings, part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Mehraban, S.; Adeli, V.; Taati, B. Motionagformer: Enhancing 3d human pose estimation with a transformer-gcnformer network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 6920–6930. [Google Scholar]
Zhou, S.; Chen, D.; Pan, J.; Shi, J.; Yang, J. Adapt or perish: Adaptive sparse transformer with attentive feature refinement for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2952–2963. [Google Scholar]
Gao, J.; Zhang, Y.; Geng, X.; Tang, H.; Bhatti, U.A. PE-Transformer: Path enhanced transformer for improving underwater object detection. Expert Syst. Appl. 2024, 246, 123253. [Google Scholar] [CrossRef]
Xu, L.; Bennamoun, M.; Boussaid, F.; Laga, H.; Ouyang, W.; Xu, D. Mctformer+: Multi-class token transformer for weakly supervised semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8380–8395. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Chicchi, L.; Buffoni, L.; Febbe, D.; Giambagli, L.; Marino, R.; Fanelli, D. Automatic Input Feature Relevance via Spectral Neural Networks. arXiv 2024, arXiv:2406.01183. [Google Scholar] [CrossRef]
Fujieda, S.; Takayama, K.; Hachisuka, T. Wavelet convolutional neural networks. arXiv 2018, arXiv:1805.08620. [Google Scholar] [CrossRef]
Liu, G.; Zhou, W.; Geng, M. Automatic Seizure Detection Based on S-Transform and Deep Convolutional Neural Network. Int. J. Neural Syst. 2020, 30, 1950024. [Google Scholar] [CrossRef]
Bauer, E.; Kohavi, R. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Mach. Learn. 1999, 36, 105–139. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]
Xu, Z.Q.J.; Zhang, Y.; Luo, T.; Xiao, Y.; Ma, Z. Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv 2019, arXiv:1901.06523. [Google Scholar] [CrossRef]
Xu, K.; Qin, M.; Sun, F.; Wang, Y.; Chen, Y.K.; Ren, F. Learning in the frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1740–1749. [Google Scholar]
Wei, D.; Deng, Y. Redistributed invariant redundant fractional wavelet transform and its application in watermarking algorithm. Expert Syst. Appl. 2025, 262, 125707. [Google Scholar] [CrossRef]
Zhang, Y.; Zheng, P.; Jiang, J.; Xiao, P.; Gao, X. FCIR: Rethink aerial image super resolution with Fourier analysis. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Jiang, L.; Dai, B.; Wu, W.; Loy, C.C. Focal frequency loss for image reconstruction and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 13919–13929. [Google Scholar]
Karami, A.; Zanj, B.; Sarkaleh, A.K. Persian sign language (PSL) recognition using wavelet transform and neural networks. Expert Syst. Appl. 2011, 38, 2661–2667. [Google Scholar] [CrossRef]
Jamali, A.; Mahdianpari, M.; Mohammadimanesh, F.; Bhattacharya, A.; Homayouni, S. PolSAR image classification based on deep convolutional neural networks using wavelet transformation. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Zhang, H.; Wang, W.; Deng, J.; Guo, Y.; Liu, S.; Zhang, J. MASFF-Net: Multi-azimuth scattering feature fusion network for SAR target recognition. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 19425–19440. [Google Scholar] [CrossRef]
Wan, X.; Chen, F.; Mo, D.; Liu, H.; Li, Z.; Hu, K. FS-CGNet: Frequency Spectral-Channel Fusion and Cross-Scale Global Aggregation Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–22. [Google Scholar] [CrossRef]
Yasuma, F.; Mitsunaga, T.; Iso, D.; Nayar, S.K. Generalized assorted pixel camera: Postcapture control of resolution, dynamic range, and spectrum. IEEE Trans. Image Process. 2010, 19, 2241–2253. [Google Scholar] [CrossRef]
Choi, I.; Jeon, D.S.; Nam, G.; Gutierrez, D.; Kim, M.H. High-quality hyperspectral reconstruction using a spectral prior. ACM Trans. Graph. (TOG) 2017, 36, 1–13. [Google Scholar] [CrossRef]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]

Figure 1. The principle of CASSI. The input HSI is encoded by an optical imaging system, and then an E2E network is used to reconstruct the HSI from the compressed measurement.

Figure 2. The proposed framework of the reconstruction model for CASSI.

Figure 3. The process of frequency-division transformation.

Figure 4. Schematic frameworks of proposed (a) spatial–spectral module and (b) frequency-division module.

Figure 5. Architectural schematics of principal blocks. (a) Structure of the non-local spatial attention block (NSAB), which integrates a non-local spatial attention (NSA) module to extract spatial attention features. (b) Structure of the non-local channel attention block (NCAB), which incorporates a non-local channel attention (NCA) module to extract channel attention features. (c) Configuration of the feedforward network (FFN).

Figure 6. The loss function of SSFAN.

Figure 7. The reconstructed spectral images of scene 5. (Left) RGB image and simulated SD-CASSI measurement. (Right) Visualizations of GT and reconstruction results in four spectral bands, which are illustrated in corresponding colors. The respective PSNR, SSIM, and SAM values of eight methods are as follows: TSA (29.39 dB/0.884/0.117),

λ

-Net (26.19 dB/0.817/0.285), HDNet (32.69 dB/0.946/0.089), MST-L (33.38 dB/0.947/0.101), MST++ (33.28 dB/0.952/0.081), CST-L (33.25 dB/0.955/0.078), DWMT (34.05 dB/0.962/0.071), and proposed SSFAN (34.85 dB/0.969/0.064).

Figure 7. The reconstructed spectral images of scene 5. (Left) RGB image and simulated SD-CASSI measurement. (Right) Visualizations of GT and reconstruction results in four spectral bands, which are illustrated in corresponding colors. The respective PSNR, SSIM, and SAM values of eight methods are as follows: TSA (29.39 dB/0.884/0.117),

λ

-Net (26.19 dB/0.817/0.285), HDNet (32.69 dB/0.946/0.089), MST-L (33.38 dB/0.947/0.101), MST++ (33.28 dB/0.952/0.081), CST-L (33.25 dB/0.955/0.078), DWMT (34.05 dB/0.962/0.071), and proposed SSFAN (34.85 dB/0.969/0.064).

Figure 8. The reconstructed spectral images of scene 7. (Left) RGB image and simulated SD-CASSI measurement. (Right) Visualizations of GT and reconstruction results in four spectral bands, which are illustrated in corresponding colors. The respective PSNR, SSIM, and SAM values of eight methods are as follows: TSA (20.32 dB/0.878/0.133),

λ

-Net (26.47 dB/0.806/0.246), HDNet (33.67 dB/0.926/0.114), MST-L (35.87 dB/0.925/0.110), MST++ (34.35 dB/0.934/0.103), CST-L (36.58 dB/0.944/0.099), DWMT (35.29 dB/0.948/0.092), and proposed SSFAN (36.14 dB/0.955/0.086).

Figure 8. The reconstructed spectral images of scene 7. (Left) RGB image and simulated SD-CASSI measurement. (Right) Visualizations of GT and reconstruction results in four spectral bands, which are illustrated in corresponding colors. The respective PSNR, SSIM, and SAM values of eight methods are as follows: TSA (20.32 dB/0.878/0.133),

λ

-Net (26.47 dB/0.806/0.246), HDNet (33.67 dB/0.926/0.114), MST-L (35.87 dB/0.925/0.110), MST++ (34.35 dB/0.934/0.103), CST-L (36.58 dB/0.944/0.099), DWMT (35.29 dB/0.948/0.092), and proposed SSFAN (36.14 dB/0.955/0.086).

Figure 9. The reconstructed spectral curves of the following methods: TSA,

λ

-Net, HDNet, MST-L, MST++, CST-L, DWMT, and proposed SSFAN. (a) RGB image and corresponding simulated SD-CASSI measurement of scene 2. (b) Reconstructed GT spectral curves of point A by eight E2E networks. The correlation coefficients between the reconstructed curves and the GT are listed. (c) RGB image and corresponding simulated CASSI measurement of scene 3. (d) Reconstructed spectral curves of point B by eight E2E networks. The correlation coefficients between the reconstructed curves and the GT are listed.

Figure 9. The reconstructed spectral curves of the following methods: TSA,

λ

-Net, HDNet, MST-L, MST++, CST-L, DWMT, and proposed SSFAN. (a) RGB image and corresponding simulated SD-CASSI measurement of scene 2. (b) Reconstructed GT spectral curves of point A by eight E2E networks. The correlation coefficients between the reconstructed curves and the GT are listed. (c) RGB image and corresponding simulated CASSI measurement of scene 3. (d) Reconstructed spectral curves of point B by eight E2E networks. The correlation coefficients between the reconstructed curves and the GT are listed.

Figure 10. (Left) Reference RGB image and corresponding real SD-CASSI measurement. (Right) Visualization of four spectral bands reconstructed by eight E2E networks. Different spectral bands are illustrated in corresponding colors.

Figure 11. The reconstructed spectral curves of the following methods: TSA,

λ

-Net, HDNet, MST-L, MST++, CST-L, DWMT, and proposed SSFAN. (a) Reference RGB image and corresponding real SD-CASSI measurement. (b) The spectral curves of point A reconstructed by eight E2E networks. (c) Reference RGB image and corresponding real CASSI measurement. (d) The spectral curves of point B reconstructed by eight methods.

Figure 11. The reconstructed spectral curves of the following methods: TSA,

λ

-Net, HDNet, MST-L, MST++, CST-L, DWMT, and proposed SSFAN. (a) Reference RGB image and corresponding real SD-CASSI measurement. (b) The spectral curves of point A reconstructed by eight E2E networks. (c) Reference RGB image and corresponding real CASSI measurement. (d) The spectral curves of point B reconstructed by eight methods.

Table 1. Ablation study of the number of frequency bands in the FDT on the simulation dataset; the average PSNR, SSIM and SAM are listed. The optimal results are highlighted in bold within the table.

Number of Frequency Bands	PSNR	SSIM	SAM
3	36.70	0.964	0.097
4	36.82	0.964	0.093
5	37.02	0.966	0.091
6	36.93	0.965	0.092
7	36.98	0.965	0.094

Table 2. Ablation study of the number of compressed channels in the CCM and CEM on the simulation dataset; the average PSNR, SSIM, SAM, speed (GFLOPs) and parameter count (Params) are listed. The optimal results are highlighted in bold within the table.

Number of Compressed Channels	PSNR	SSIM	SAM	GFLOPs	Params
28	36.73	0.964	0.097	89.46	4.24 M
40	36.91	0.964	0.095	99.29	4.97 M
56	37.02	0.965	0.091	117.28	6.34 M
64	36.39	0.963	0.100	128.44	7.19 M

Table 3. Ablation study of the CCM and CEM.

Model	GFLOPs	Params
w/o CCE and CEM	304.08	20.82 M
w/ CCE and CEM	117.28	6.34 M

Table 4. Ablation study for sub-modules of the SFDN on the simulation dataset; the average PSNR, SSIM and SAM are listed. The optimal results are highlighted in bold within the table.

SSM	FDM	PSNR	SSIM	SAM
✓		36.91	0.963	0.100
	✓	36.77	0.965	0.976
✓	✓	37.02	0.965	0.091

Table 5. Ablation study of the FDT and convolution layer on the simulation dataset; the average PSNR, SSIM, and SAM are listed. The optimal results are highlighted in bold within the table.

Operation	PSNR	SSIM	SAM
FDT	37.02	0.965	0.091
Convoluton Layer	36.69	0.965	0.095

Table 6. Comparison between the proposed network and open-source E2E methods on ten test scenes (S1–S10). The PSNR, SSIM, SAM, speed (FLOPs), and parameter count (Params) are listed. The optimal results are highlighted in bold within the table.

Methods	Params	GFLOPs	S1	S2	S3	S4	S5	S6	S7	S8	S9	S10	Avg
			30.10	28.46	27.73	37.01	26.19	28.64	26.47	26.09	27.50	27.13	28.53
$λ$ -net	62.64 M	117.98	0.849	0.805	0.870	0.934	0.817	0.853	0.806	0.831	0.826	0.816	0.841
			0.247	0.304	0.272	0.420	0.285	0.454	0.246	0.481	0.277	0.454	0.344
			32.03	31.00	32.25	39.19	29.39	31.44	20.32	29.35	30.01	29.59	31.46
TSA-Net	44.25 M	110.06	0.892	0.858	0.915	0.953	0.884	0.908	0.878	0.888	0.890	0.874	0.894
			0.152	0.181	0.129	0.146	0.117	0.169	0.133	0.199	0.134	0.167	0.153
			35.14	35.67	36.03	42.30	32.69	34.46	33.67	32.48	34.89	32.38	34.97
HDNet	2.37 M	154.76	0.935	0.940	0.943	0.969	0.946	0.952	0.926	0.941	0.942	0.937	0.943
			0.129	0.142	0.097	0.102	0.089	0.120	0.114	0.142	0.111	0.119	0.117
			35.40	35.87	36.51	42.27	32.77	34.80	35.87	32.67	35.39	32.50	35.18
MST-L	2.03 M	28.15	0.941	0.944	0.953	0.973	0.947	0.955	0.925	0.948	0.949	0.941	0.957
			0.123	0.142	0.106	0.130	0.101	0.137	0.110	0.180	0.130	0.146	0.131
			35.80	36.23	37.34	42.63	33.38	35.38	34.35	33.71	36.67	33.38	35.99
MST++	1.33 M	19.42	0.943	0.947	0.957	0.973	0.952	0.957	0.934	0.953	0.953	0.945	0.951
			0.118	0.130	0.080	0.121	0.081	0.115	0.103	0.145	0.100	0.116	0.111
			35.96	36.84	38.16	42.44	33.25	35.72	36.58	34.34	36.51	33.09	36.28
CST-L	3.00 M	27.81	0.949	0.955	0.962	0.975	0.955	0.963	0.944	0.961	0.957	0.945	0.956
			0.116	0.118	0.082	0.099	0.078	0.104	0.099	0.119	0.101	0.102	0.102
			36.54	37.85	38.59	44.66	34.05	36.24	35.29	34.62	37.52	34.05	36.94
DWMT	14.48 M	46.71	0.957	0.963	0.964	0.984	0.962	0.970	0.948	0.967	0.965	0.958	0.964
			0.109	0.114	0.073	0.097	0.071	0.103	0.092	0.119	0.088	0.098	0.096
			37.41	38.85	38.19	43.84	34.85	37.07	36.14	34.80	37.42	33.82	37.24
SSFAN	6.34 M	117.28	0.965	0.971	0.965	0.985	0.969	0.973	0.955	0.969	0.965	0.959	0.968
			0.101	0.103	0.073	0.077	0.064	0.083	0.086	0.097	0.086	0.089	0.086

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, H.; Yuan, Y.; Yin, X.; Su, L. Spatial/Spectral-Frequency Adaptive Network for Hyperspectral Image Reconstruction in CASSI. Remote Sens. 2025, 17, 3382. https://doi.org/10.3390/rs17193382

AMA Style

Liu H, Yuan Y, Yin X, Su L. Spatial/Spectral-Frequency Adaptive Network for Hyperspectral Image Reconstruction in CASSI. Remote Sensing. 2025; 17(19):3382. https://doi.org/10.3390/rs17193382

Chicago/Turabian Style

Liu, Hejian, Yan Yuan, Xiaorui Yin, and Lijuan Su. 2025. "Spatial/Spectral-Frequency Adaptive Network for Hyperspectral Image Reconstruction in CASSI" Remote Sensing 17, no. 19: 3382. https://doi.org/10.3390/rs17193382

APA Style

Liu, H., Yuan, Y., Yin, X., & Su, L. (2025). Spatial/Spectral-Frequency Adaptive Network for Hyperspectral Image Reconstruction in CASSI. Remote Sensing, 17(19), 3382. https://doi.org/10.3390/rs17193382

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Spatial/Spectral-Frequency Adaptive Network for Hyperspectral Image Reconstruction in CASSI

Abstract

Highlights

Abstract

1. Introduction

2. Related Works

2.1. E2E Neural Networks for CASSI Reconstruction

2.2. Frequency-Domain Processing

3. Image Formation Model of SD-CASSI

4. Problem Formulation

4.1. Frequency-Division Transformation

4.2. Reconstruction Model

5. Proposed Method

5.1. Network Framework

5.2. Spatial–Spectral Module (SSM)

5.3. Frequency-Division Module (FDM)

5.4. Loss Function

6. Experimental Results

6.1. Experiment Setup

6.2. Ablation Study

6.3. Performance Evaluation on a Simulation Dataset

6.4. Performance Evaluation with Real CASSI Data

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI