Next Article in Journal
Improvement of Wireless Localization Precision Using Chirp Signals
Previous Article in Journal
An Identification Method for Road Hypnosis Based on XGBoost-HMM
Previous Article in Special Issue
Measuring the Level of Aflatoxin Infection in Pistachio Nuts by Applying Machine Learning Techniques to Hyperspectral Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

ARM-Net: A Tri-Phase Integrated Network for Hyperspectral Image Compression

1
Liaoning General Aviation Academy, Shenyang 110136, China
2
School of Electronical and Information Engineering, Shenyang Aerospace University, Shenyang 110136, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(6), 1843; https://doi.org/10.3390/s25061843
Submission received: 13 February 2025 / Revised: 7 March 2025 / Accepted: 15 March 2025 / Published: 16 March 2025

Abstract

:
Most current hyperspectral image compression methods rely on well-designed modules to capture image structural information and long-range dependencies. However, these modules tend to increase computational complexity exponentially with the number of bands, which limits their performance under constrained resources. To address these challenges, this paper proposes a novel triple-phase hybrid framework for hyperspectral image compression. The first stage utilizes an adaptive band selection technique to sample the raw hyperspectral image, which mitigates the computational burden. The second stage concentrates on high-fidelity compression, efficiently encoding both spatial and spectral information within the sampled band clusters. In the final stage, a reconstruction network compensates for sampling-induced losses to precisely restore the original spectral details. The proposed framework, known as ARM-Net, is evaluated on seven mixed hyperspectral datasets. Compared to state-of-the-art methods, ARM-Net achieves an overall improvement of approximately 1–2 dB in both the peak signal-to-noise ratio and multiscale structural similarity index measure, as well as a reduction in the average spectral angle mapper of approximately 0.1.

1. Introduction

Hyperspectral imaging is a broadly adopted data acquisition technique in which hyperspectral sensors mounted on spaceborne or airborne platforms capture narrowband continuous spectral images of the Earth’s surface [1,2]. These sensors can capture a wide spectral range, from visible light to shortwave infrared, with each pixel containing spectral information that is reflected or radiated from multiple bands. Typically, a hyperspectral image (HSI) is represented as a three-dimensional data cube that captures spatial and spectral information about the physical world and characterizes the intrinsic optical properties of each location. Hyperspectral imaging has become increasingly popular in real-world agricultural and industrial monitoring due to its ability to detect subtle variations across numerous spectral bands. In precision agriculture, for example, HSI data enable detailed assessment of crop health and early detection of diseases or nutrient deficiencies. Such insights support better resource allocation and yield optimization while mitigating environmental impacts. In industrial settings, hyperspectral imaging can be employed for quality inspection and material identification, which facilitates faster and more reliable defect detection in manufacturing processes. Moreover, the growing availability of spaceborne or UAV-based hyperspectral sensors has paved the way for large-scale applications in monitoring, surveying, and real-time decision-making across diverse domains. Despite these clear advantages, the fine spectral resolution of HSIs generates large data volumes and exhibits significant spectral redundancy, which not only poses substantial challenges for storage and transmission but also limits its efficient application in downstream tasks [3,4,5,6,7]. As hyperspectral sensors proliferate in agricultural and industrial operations, the exponential growth of data volume compounds these bottlenecks. Hence, the efficient storage and transmission of HSI data have become critical issues that require urgent attention, prompting extensive research into advanced compression techniques and algorithmic innovations.
Transform coding has demonstrated significant effectiveness in HSI compression. A common approach involves applying Principal Component Analysis (PCA) to decorrelate the spectra and reduce the spectral dimensionality of HSIs. Subsequently, the dimensionally reduced data can be compressed in blocks or at multiple resolutions by the Joint Photographic Experts Group (JPEG) [8] or JPEG2000 [9] to further optimize data storage and transmission. However, these methods [8,9] inevitably lose critical spectral information and image details (e.g., textures and edges) at high compression ratios, resulting in significant degradation of image quality and making the compression techniques unsuitable for applications that require high fidelity. Autoencoders or variational autoencoders [10,11] based on rate distortion (RD) [12] have paved the way for new developments in lossy compression methods. In contrast to the aforementioned traditional mathematical methods, convolutional neural network-based codecs have effectively addressed the limitations of traditional algorithms by simultaneously optimizing reconstruction distortion and compression ratio. This architecture has been widely applied to compress RGB images [13,14,15,16,17,18,19], multispectral images [20], and HSIs [21,22,23,24,25,26], with optimizations tailored for each component of the framework.
Deep learning techniques have found extensive application in RGB image compression. These methods aim to optimize the compression rate or reduce reconstructive distortion by tailored modules. Specifically, Minnen [13] jointly combined an autoregressive mask and a hyperprior for accurate probabilistic modeling, but the introduction of serial operations resulted in longer decoding times. To tackle the decoding efficiency issue, He [14] introduced an optimized checkerboard context model, building upon the work of [13], to enable parallel decoding. This approach significantly enhances decoding efficiency by improving parallelization and computational effectiveness. In addition, the combination of a hyperprior and local context results in a limited global perspective. To overcome this limitation, Qian [15] introduced an approach integrating local and global contextual information with a hyperprior, resulting in an enhanced compression rate. However, the global context entropy model exhibits quadratic complexity, which makes it computationally intensive. To tackle this issue, Jiang [16] introduced a multi-reference entropy model with linear complexity, which effectively reduces the computational load. On the other hand, Cheng [17] utilized discretized Gaussian mixture likelihood to model pixel distribution and employed a non-local attention module for global feature extraction. Han [18] achieved efficient remote sensing image compression through a combination of an edge-guiding mechanism and adversarial learning, preserving image edges and texture details. Tang [19] addressed the limitations of traditional deep image compression methods through a combination of graph attention mechanisms and asymmetric convolutional neural networks based on autoregressive modeling.
HSIs contain more spectral information than RGB images, which introduces new challenges for compression tasks. In HSIs, the limited spatial resolution and substantial spectral overlap degrade compression performance. Kong [20] designed a neuroscience-based non-local attention module that captures both fine features of nearby pixels and large-scale features in the spatial domain. Additionally, a multi-scale spectral attention block was introduced to extract non-smooth spectral correlations at different scales. Guo [21] introduced an edge component in RD loss combined with an interactive dual attention module to enhance the integrated structure of the latent representations. Additionally, Guo [22] explored HSI channel-level relationships and used contrastive informative feature encoding to address the problem of collapsing and losing information attributes at high compression ratios. Rezasoltani [23] and Zhang [24] used implicit neural representations and neural radiation fields, respectively, to map HSIs to low-dimensional representations, thus effectively alleviating the compression difficulty. Byju [25] used 3D-Conv and squeeze-and-excitation attention modules jointly to eliminate HSI spatial and spectral redundancies. Sebastià Mijares i Verdú [26] employed a channel clustering strategy to address the computational complexity and scalability issues.
Although the aforementioned methods have made significant progress in HSI compression, there is still potential for improvement in achieving more efficient compression. First, these well-designed modules focus only on specific dimensions (spatial or spectral) and fail to jointly eliminate their redundancies. Second, increasing the number of parameters in these networks is necessary to compress images with larger numbers of bands; that is, their performance cannot be maintained as the number of bands increases [26]. Moreover, well-designed feature extraction modules can lead to an explosion in Floating-Point Operations per Second (FLOPs) when handling high-dimensional inputs, which ultimately degrades network performance. To overcome these challenges, we propose a three-stage framework for hyperspectral compression. The first stage utilizes an adaptive band selector for initial spectral feature extraction on high-dimensional HSIs, followed by the extraction of key spectral features most relevant to subsequent processing. The selected band clusters aim to represent the complete information of HSIs while reducing redundant data, which optimizes the efficiency of the entire compression framework and enables a more efficient representation of HSI data. The second stage utilizes a compression network integrated with a multi-head recurrent spectral attention mechanism to perform secondary compression, which achieves the highest compression ratio. This architecture leverages multiple attention heads to selectively focus on different spectral features across various bands, which enhances the model’s ability to capture intricate relationships within the spectral data. Through a cyclical focus on different portions of the spectrum, the compression network ensures that the most relevant features are identified and prioritized, leading to a maximized compression ratio and minimized information loss. The third stage uses a multi-scale spatial-spectral attention-based reconstruction network for spatial and spectral reconstruction to decompress the original high-dimensional input. The network integrates local and global spatial and spectral information to construct a high-precision reconstruction network, which is refined from coarse to fine and then integrated to obtain a high-fidelity reconstruction of the original HSI.
The main contributions of this work are outlined as follows:
  • This research proposes an innovative three-stage hyperspectral compression framework, known as ARM-Net. ARM-Net consists of an adaptive band selector (ABS), a Recurrent Spectral Attention Compression Network (RSACN), and a Multi-Scale Spatial-Spectral Attention Reconstruction Network (MSSARN).
  • To alleviate the burden on the compression network, this paper introduces the ABS, which builds upon a common band selection mechanism used in hyperspectral lossless compression. By adaptively selecting band clusters with the highest information content, the ABS reduces the overall computational load of the framework.
  • To enhance hyperspectral compression, ARM-Net incorporates a multi-head recurrent spectral attention (MHRSA) module within its codec. MHRSA dynamically assigns attention weights to spectral bands, allowing the network to focus on the most relevant spectral features for compression. By leveraging multiple attention heads, the module captures diverse spectral interactions to preserve spectral consistency across bands, resulting in reduced redundancy and improved compression efficiency. This targeted weight adjustment approach is essential to address varying spectral pixel values, mitigating information loss that simple averaging methods may overlook.
  • To optimize hyperspectral reconstruction, we propose a Spatial-Spectral Attention Block (SSAB) within the reconstruction backbone of ARM-Net. The SSAB jointly models spatial and spectral dependencies to enhance reconstruction accuracy, which compensates for spatial detail loss during compression. Spectral-Wise Multi-Head Self-Attention (Spec-MSA) and Spatial Multi-Head Self-Attention (Spa-MSA) in the SSAB are linked by residuals to effectively compensate for the lack of spatial details in HSI reconstruction through spectral reconstruction (SR) networks. This versatile and efficient plug-and-play spatial-spectral attention mechanism captures fine-grained features across both spatial and spectral dimensions while preserving a linear relationship between spatial dimensions and computational complexity.
  • We comprehensively evaluate the network on our mixed hyperspectral dataset. Experimental results demonstrate that ARM-Net surpasses state-of-the-art (SOTA) approaches in terms of the peak signal-to-noise ratio (PSNR), multi-scale structural similarity index measure (MS-SSIM), and spectral angle mapper (SAM).

2. Methods

2.1. The Proposed Three-Stage Compression Framework

Figure 1 illustrates the overall structure of the proposed three-stage compression framework and highlights the key components of ARM-Net: the ABS, RSACN, and MSSARN. Let X i n R H × W × C represent the input HSI, where H denotes the height, W is its width, and C is the number of bands. The compression process of our framework for HSIs is as follows: X i n is first passed through a band selection network to identify band clusters that effectively represent the entire HSI. Within ARM-Net, the spectral correlation coefficient upper triangular matrix is computed, and β pairs of bands with lower scores are selected as input band clusters X β R H × W × 2 β for the compression network. The compression network performs secondary compression on X β to obtain the bitstream with the maximum compression ratio. Within ARM-Net, the RSACN further enhances the ability to eliminate spectral redundancy, resulting in a higher compression ratio. After the maximally compressed bitstream is obtained, the decoder reconstructs it into sampled cluster bands X β ^ , which are then passed to the reconstruction network. Finally, the reconstruction network reconstructs X T ^ into original-dimensional data X i n ^ . The whole compression process of ARM-Net can be expressed by Equations (1)–(3):
X β = ABS ( X in )
X β ^ = RSACN ( X β )
X ^ in = MSSARN ( X β ^ )

2.2. Adaptive Band Selector (ABS)

In response to the high-dimensional nature of HSIs, the band selection technique effectively reduces computational complexity and optimizes data storage and transmission efficiency. A few representative bands are chosen to reduce equipment costs and computational load while maintaining data expressiveness. Therefore, band selection and extraction are typically regarded as fundamental steps in lossless compression algorithms [27,28] and compressed sensing [29,30].
Band selection employs metrics such as entropy, interquartile range, standard deviation, and image gradient to flexibly evaluate the relevance of each band and selects bands with high information content, strong variability, or prominent texture features based on different criteria. However, according to the fundamental principles of information theory [31], higher information content in the selected bands does not necessarily result in better performance of the band combination. Instead, inter-band redundancy must be considered to achieve optimal information utilization and overall performance optimization. Thus, in hyperspectral data analysis, maximizing the information capacity of the selected bands while minimizing inter-band redundancy is crucial for optimizing hyperspectral data processing frameworks. In our framework, reconstruction accuracy depends on the amount of information in the sampled band clusters. In other words, if a band cluster fails to fully represent the corresponding HSI patch, the compression process becomes ineffective. To more comprehensively represent the full-band spectral information, this study introduces spectral correlation coefficients based on the maximum correlation principle [32] to quantify the spectral correlation between bands and enhance data independence through spectral decorrelation techniques, as follows:
c i , j = x = 1 H y = 1 W p i ( x , y ) p ¯ i p j ( x , y ) p ¯ j x = 1 H y = 1 W p i ( x , y ) p ¯ i 2 x = 1 H y = 1 W p j ( x , y ) p ¯ j 2
where i and j represent the i-th and j-th bands in the HSI patch, respectively; p i ( x , y ) represents the pixel value at position ( x , y ) of the i-th band; and p ¯ i denotes the mean pixel value of the i-th band. Figure 2 presents a sample correlation matrix of the Botswana dataset. It can be observed that adjacent bands generally exhibit a high degree of correlation, with many local bands having correlation coefficients greater than 0.95, which indicates significant information redundancy. The pseudo-code in Algorithm 1 provides a detailed description of the adaptive band selection strategy, enabling the adaptive computation of optimal band clusters for new HSIs. In this study, the ABS is applied to cropped HSI patches, and the β pairs of bands that appear most frequently across all results are selected as the final adaptive selection outcome. Based on experimental results, the hyperparameter β is set to 3, and the band cluster [1, 2, 20, 25, 29, 30] is selected for training. Detailed experimental procedures and analyses are provided in the ablation experiments section, where the effectiveness of the selected band clusters is validated.
Algorithm 1 Adaptive band selection algorithm workflow
  • Input: HSI X i n R H × W × C
  • Output: Selected band pairs L
  • Initialize an empty list L to store selected band pairs
  • for i = 1 to C − 1 do
  •    for j = i + 1 to C do
  •       calculate inter-spectral correlation coefficients C i , j
  •    end for
  • end for
  • for  k = 1  to  β   do
  •    Select the k-th pair of bands ( i , j ) with the lowest correlation from the sorted list
  •    if band i or j has already been selected then
  •       Skip the current pair and move to the next pair in the sorted list
  •    else
  •       Append ( i , j ) to the list L
  •    end if
  • end for
  • if less than β pairs are selected then
  •    Continue searching for additional band pairs to complete the selection if necessary
  • end if
  • returns the list of selected band pairs L

2.3. Recurrent Spectral Attention Compression Network (RSACN)

The compression network is crucial for enhancing the compression ratio and plays a pivotal role in the three-stage framework. This section first presents the overall architecture of the RSACN. Next, a detailed explanation of the multi-head recurrent spectral attention (MRSA) module is provided, highlighting its design and improvements.
As shown in Figure 3, the RSACN consists of encoder G a , decoder G s , hyperencoder H a , hyperdecoder H s , quantizer Q, and an entropy model. First, G a encodes X β into a latent representation y. Then, H a captures the spatial dependencies within y and represents them as z. Q ensures that stochastic gradient descent remains differentiable throughout the optimization process while quantizing y and z into discrete values y ^ and z ^ . Here, z ^ serves as prior information for y and stores the statistical information μ , σ ^ required for arithmetic coding. Then, μ , σ ^ are decoded through H s to assist the entropy model in probability modeling during encoding. Meanwhile, y ^ is input into G s to reconstruct the band clusters X β ^ . The entire compression process is formulated in Equations (5)–(7):
y = G a ( X β , φ g ) , z = H a ( y , φ h )
y ^ = Q ( y ) , z ^ = Q ( z )
μ , σ ^ = H s ( z ^ , θ h ) , X β ^ = G s ( y ^ , θ g )
where φ g and φ h are the optimized parameters of G a and H a , and θ g and θ h represent the parameters learned by G s and H s . During training, we replace quantization with additive uniform noise to optimize ARM-Net using stochastic gradient descent since rounding to the nearest integer produces zero gradients almost everywhere [11]. y ^ can be modeled using a mean-scale Gaussian distribution. Non-parametric, fully factorized density models are trained on the auxiliary information z ^ .
Although each band in an HSI may appear visually independent, each reflects different spectral information from the same scene. Pixels at the same spatial location across different bands often share similar spectral features, which are inherently linked to the material properties of the objects. This phenomenon indicates that, despite the rich spectral data in HSIs, the bands are not isolated from one another but are intrinsically correlated. This further underscores the significant spectral correlation inherent in HSIs. Similar to non-local spatial correlations, many schemes use inter-spectral correlation for long-range dependency extraction via methods such as non-local mean (NLM) [33]. However, unlike spatially similar pixels, pixels in the spectral domain have different value ranges, which makes NLM unsuitable, as its averaging operation may disrupt the spectral relationships of each pixel across the spectrum. To address this problem, the MHRSA block [34] is inserted into the codec to dynamically compute the weights of average pixels across the spectrum for each band. Each band is assigned a distinct weight to aggregate information from other bands so that spectral dependencies are preserved. A diagram of the MHRSA block is shown in Figure 4.
It first employs two Multi-Layer Perceptrons (MLPs), followed by the application of two distinct activation functions to transform the input features into candidate features Z and subsequently merge the weights W. Specifically, we adopt tanh for generating candidate features due to its symmetric range (−1,1), which can better capture both positive and negative correlations. We use sigmoid for more interpretable and stable re-scaling of feature amplitudes.
Z = t a n h ( M L P 1 ( F ) )
W = s i g m o i d ( ( M L P 2 ( F ) )
M L P ( X ) = W 1 ( t a n h ( W 2 X ) )
where W 1 , W 2 R C × C .
These processes can be equivalently viewed as the query, key, and value projections in self-attention [35]. The key difference is that the attention weights are computed directly, rather than deriving the attention map through the covariance of the key and query. Another distinction is that we perform the attention operation through a recurrent fusion step, which requires linear memory and time complexity. Specifically, the recurrent merging step for spectral mixing is performed through the accumulation of the candidate feature Z for each band based on the merging weight W as follows:
O i = ( 1 W i ) Z i + W i O i 1
where O i , Z i , and W i are the output features, candidate features, and merging weights of the i-th band, respectively. It can be observed that this merging step fuses features from all previous bands Z i , where i < j for the j-th band. Therefore, it links inter-spectral features and can leverage information from cleaner bands to reduce spectral redundancy.

2.4. Multi-Scale Spatial-Spectral Attention Reconstruction Network (MSSARN)

HSIs contain abundant spatial and spectral features. Existing transformer-based models often focus only on spatial or spectral information, which leads to information loss. To address this issue, this paper proposes a transformer-based multi-scale feature extraction module called the Multi-Scale Spatial-Spectral Attention (MSSA) module, which is embedded into the reconstruction network. MSSA is designed to extract both spatial and spectral features simultaneously, which improves reconstruction performance.
As illustrated in Figure 1, the MSSARN consists of three cascaded MSSA modules. Direct use of transformer-based approaches may result in computational error accumulation due to the attention mechanism’s emphasis on long-range similarity [36]. To mitigate this, we adopt a U-shaped transformer architecture for fine-grained reconstruction. Additionally, a two-dimensional attention mechanism is employed to capture long-range dependencies between spectral bands. As depicted in Figure 5, MSSA employs U-Net as the backbone for the top-down extraction of effective features. The convolutional layers before and after MSSA, the embedding block, and the mapping block are single convolutional 3 × 3 layers. During the encoder stage, N 1 + N 2 SSABs are used to extract hierarchical abstract features. Simultaneously, convolutional 4 × 4 downsampling operations reduce spatial resolution and increase channel depth. The decoder follows a symmetric structure that utilizes 2 × 2 deconvolutions for upsampling and N 1 + N 2 SSABs to progressively integrate features. The bottleneck layer comprises N 3 SSABs. To preserve information lost during downsampling, skip connections are introduced between the encoder and decoder. A 1 × 1 convolution is applied to fuse spatial and spectral features. The right side of Figure 5 illustrates that each SSAB consists of a Feed-Forward Network (FFN), Spec-MSA [37], Spa-MSA [38], and LayerNorm. The convolutional layer is a single convolutional 3 × 3 layer. The FFN follows the parameter settings outlined in [37]. Additionally, window-based transformers often encounter the ’grid issue’ when processing high-resolution images. To alleviate this, we introduce residual connections between window-based MSA and shuffle-window MSA to enhance feature interaction. The convolutional kernel size is set to be matched with the window size to ensure alignment in feature extraction. As illustrated in Figure 6, the spatial-spectral attention mechanism comprises parallel Spa-MSA and Spec-MSA modules that compute spatial and spectral multi-head self-attention, respectively. The two attention modules operate in parallel to provide dual features that enhance cross-dimensional interactions.
As illustrated on the left side of Figure 6 and Figure 7c, Spec-MSA treats each spectrum as a token and thus focuses on more non-local spectral self-similarities.
Spec-MSA computes self-attention for h e a d j :
A j = softmax ( σ j K j T Q j ) , h e a d j = V j A j
where K j T is the transpose of K j .
Due to the significant variation in spectral density across wavelengths, a learnable parameter σ j is used to adapt the self-attention A j by reweighting the matrix multiplication K j T Q j within h e a d j . The softmax function is applied to normalize the attention weights across all spectral tokens, ensuring that the computed attention scores form a probability distribution and highlight key spectral correlations. Subsequently, the outputs of N heads are concatenated and passed through a linear projection, followed by the incorporation of positional embeddings:
S p e c - M S A ( X ) = ( Concat ( h e a d j ) ) W + f p ( V )
where W R C × C represents the learnable parameters and f p ( · ) is the function to generate positional embeddings.
It consists of two depth-wise 3 × 3 convolutional layers, a Gaussian error linear unit activation function, and reshaping operations. The HSI is ordered along the spectral dimension by wavelength. Therefore, these embeddings are utilized to encode the positional information of different spectral channels. Finally, the result of Equation (13) is reshaped to obtain the output feature map X s p e c R H × W × C .
Spa-MSA consists of a window-based MSA (W-MSA) and a shuffle-window MSA (SW-MSA), which are designed to facilitate long-range cross-window interactions. The right side of Figure 6 and Figure 7a,b depicts W-MSA and SW-MSA, with their primary difference being the spatial shuffle mechanism. In brief, for a W-MSA with window size M and N tokens as input, the output is reshaped to (M, N/M), transposed, and flattened to serve as input for the next layer. This combines tokens from different windows to establish long-range connections. Subsequently, the spatial dimensions are reshaped to (N/M, M) through spatial alignment operations with relative positional offsets, followed by transposition and flattening to restore their original configurations.

3. Results

3.1. Experimental Configurations

To thoroughly assess the proposed HSI compression architecture and the ARM-Net designed within this framework, this paper trained ARM-Net on a large, high-quality mixed HSI dataset, allowing us to test our model on new HSIs without the need for retraining. The dataset integrates HSI data from several well-known HSI datasets, including Botswana, KSC, Pavia Center, Pavia University, Salinas, and Houston, as well as hyperspectral data collected by the AVIRIS sensor [39]. The spatial resolution of the dataset ranges from 512 × 217 to 2384 × 601. For efficient processing, the dataset was first divided into non-overlapping patches of 128 × 128, and then 30-channel random overlapping sub-patches were generated along the spectral dimension. Data augmentation techniques, such as horizontal flipping, vertical flipping, and rotation, were applied to enhance the diversity and robustness of the data. In the end, 26,380 patches were randomly divided into training, validation, and test sets in a ratio of 8:1:1. Figure 8 presents a selection of images from the dataset to highlight the diversity and richness of the hyperspectral data. Various scenes within the integrated hyperspectral dataset are illustrated in these images, showcasing different spectral characteristics and visual details. To effectively visualize different regions, the input image was transformed into a pseudo-color image, allowing for a clearer differentiation of areas to meet human perceptual requirements.

3.2. Training Details

To ensure efficient compression and reconstruction, ARM-Net was trained in two stages. First, the training set was normalized using divergence normalization, and then RD loss was applied to train the compression network, as shown in Equations (14)–(16):
L o s s c o m = R ( y ^ ) + R ( z ^ ) + λ D ( X β , X β ^ )
R ( y ^ ) = E log 2 N ( μ , σ 2 ) U 1 2 , 1 2 ( y ^ )
R ( z ^ ) = E log 2 P z ^ | ψ ( z ^ | ψ ) U 1 2 , 1 2 ( z ^ )
where ψ represents the parameters of the Gaussian distribution, R represents the bit rate, and D is the distortion between the original and reconstructed images. The Mean Squared Error (MSE) was used as the metric to measure D. λ is the Lagrange multiplier, a hyperparameter used to balance the bit rate and reconstruction distortion. After the RSACN converged, these parameters were frozen, and the MSSARN was trained using reconstruction distortion, as shown in Equation (17):
L o s s r e c = M S E ( X in , X ^ in )
The entire model was trained for 1800 epochs with a batch size of 20. The network was optimized using the Adam optimizer with an initial learning rate of 0.0001. When there were no significant changes in the RD loss, the learning rate was reduced to 0.00005. The entire network was accelerated and optimized using an NVIDIA 3090 GPU. All codecs were run on the same CPU (i7-12700H @ 2.3 GHz) during inference.

3.3. Evaluation Strategies

In this paper, all deep learning-based methods were evaluated using the PSNR metric. The compression ratio of HSIs was measured by the number of bits per pixel per band (bpppb), and three commonly used metrics were employed to assess the distortion introduced during the compression process from different perspectives: PSNR [24], MS-SSIM [24], and SAM [22], with the following calculation formulas:
P S N R ( X , X ^ ) = 1 C i = 1 C 10 log 10 max 2 ( X i ) MSE i
where MSE ( X , X ^ ) = 1 H × W × C X X ^ F 2 and max 2 ( · ) denotes the square of the maximum pixel in the i-th band.
S A M ( X , X ^ ) = 1 H × W × j = 1 H × W cos 1 X j · X j ^ X j X j ^
where · represents the inner product, · denotes the L 2 norm, and X j and X ^ j denote the j-th pixel of the original and reconstructed HSIs, respectively.
MS-SSIM [11] is expressed in decibels as 10 log 10 ( 1 MS - SSIM ) . Higher PSNR and MS-SSIM values indicate better spatial fidelity, while a lower SAM value signifies better spectral fidelity [37].

3.4. Comparative Results

3.4.1. Rate-Distortion Performance

This study compared nine compression methods, including three hyperspectral image compression methods: ARM-Net (ours), FHNeRF (2024) [24], and Verdú (2024) [26]; three RGB image compression methods: CHENG (2020) [17], Pan (2023) [40], and Hyperprior (2017) [11]; one traditional hyperspectral imaging method: PCA [9]; and two traditional algorithms: BPG and JPEG2000. The traditional compression methods utilize publicly available software or code. Specifically, JPEG2000 was implemented using the OpenJPEG library, BPG is based on an open-source C language library, and PCA was performed by combining PCA with JPEG2000. Other methods followed the same setup as described in the original studies. Table 1 presents all the methods mentioned in the comparative experiments along with their key characteristics.
Figure 9 presents the RD curves and compression ratios of various compression methods on the mixed dataset. Not surprisingly, our ARM-Net was highly competitive and outperformed most existing methods. In the context of high compression ratios, the reconstruction quality of deep learning approaches [11,17,24,26,40] is primarily contingent upon the network’s feature inference capabilities. ARM-Net and [24,26] benefited from the incorporation of sophisticated feature extraction modules and feature-domain information flow channels, which together enabled them to achieve reconstruction quality superior to that of traditional algorithms. ARM-Net employs a cyclic spectral information fusion methodology, integrating spatial texture and spectral correlation. It outperformed the other methods at both low and high compression ratios. Specifically, ARM-Net achieved a 1.09 dB improvement in PSNR over the SOTA methods. At a bpppb of 0.58, ARM-Net’s PSNR was approximately 1.5 dB higher than that of FHNerF [24]. Furthermore, ARM-Net exhibited minimal fluctuations across different bit rates, indicating its ability to maintain stable reconstruction quality. Additionally, as the compression ratio decreased, the amount of information in the observed values increased. Existing methods, which typically rely on complex feature processing and optimization steps, iteratively refine the reconstructed image. In contrast, ARM-Net leverages an innovative network architecture to perform feature extraction and optimization directly on the sampled bands, thereby avoiding the complexity of iterative steps and enabling more efficient image reconstruction.

3.4.2. Comparison of Visualization Results

Figure 10 shows a visual comparison of a test patch from Pavia University containing numerous small buildings. The quantitative results for PSNR, MS-SSIM, and SAM are also displayed below the corresponding visualizations. Notably, the bpppb values presented in the visual comparison exhibit slight fluctuations. This variation arose because the reported values were based on actual measurements rather than data points extracted from the curves in the figure. The results indicate that the results of both classical codecs (JPEG2000, PCA, and BPG) and deep learning-based methods suffered from varying degrees of blur. The reconstructed image of ARM-Net (b) displays excellent texture details and edge definition. Although FHNerF (c) performed well at high bit rates, it exhibited a slightly lower PSNR, resulting in some loss of detail. Verdú (d) demonstrated a substantial degradation in reconstruction quality, accompanied by noticeable texture blurring. Other methods, such as CHENG (e), Pan (f), and Hyperprior (g), fell short in terms of texture details and contrast. In general, ARM-Net demonstrated superior visual perception compared to the other compression methods.
This study also conducted experiments on HSI data collected by AVIRIS, which cover objects of different scales and types, to further validate the effectiveness of ARM-Net. Figure 11 presents a visual comparison of a sample patch from the AVIRIS dataset at low bit rates. The dataset predominantly contains complex mountain textures, characterized by highly irregular structures and rich spatial details, which presents a significant challenge for compression algorithms. Traditional methods of dealing with such complex textures at low bit rates exhibited noticeable shortcomings. Severe detail loss was observed in CHENG (e) during reconstruction. The mountain textures appeared blurred, and the contrast was reduced, causing the structure to look flattened. Similarly, Hyperprior (g) failed to restore the intricate mountain details, as obvious noise and distorted textures appeared in the reconstructed image. Furthermore, PCA (h) did not adequately preserve the mountain details during compression, resulting in blurred image edges. In contrast, ARM-Net effectively mitigated structural distortions during compression by combining spatial and spectral information via the SSAB, which operates bidirectionally. ARM-Net successfully preserved the detailed texture of the mountain ranges and avoided the structural distortions that occurred with other methods. This resulted in better reconstruction quality, further confirming its advantages in complex natural scenes.

3.4.3. Model Complexity Analysis

Table 2 lists the parameters, FLOPs, and inference times for the six learning-based comparison methods. Compared to other methods, ARM-Net maintained a good parameter count and achieved a significant reduction in computational complexity. The number of parameters was slightly higher than that of Hyperprior and Verdú due to the three-segment structure of ARM-Net. Similarly, ARM-Net, downscaled with an adaptive band selector, had significantly fewer FLOPs than other end-to-end compression networks. FHNeRF achieved a lower number of parameters and computational complexity via the representation of pixel coordinates. While FHNeRF excelled across all metrics, it required a significant amount of training and performed slightly worse than our method in terms of PSNR and MS-SSIM. This indicates that ARM-Net had a moderate inference time. Additionally, since the reconstruction network needs to capture spatial and spectral features, its decoding phase took longer than that of the end-to-end compression network.

3.5. Ablation Experiments

3.5.1. Ablation Experiments on Band Selection

Notably, the upper limit of the reconstruction network’s performance is determined by the sampled band clusters. Specifically, if the image information provided to the reconstruction network is incomplete, it becomes challenging for the network to adapt to the full band information. Our band correlation algorithm evaluates and samples groups of bands with large differences to ensure the information richness of the input image, which leads to an improvement in reconstruction quality. However, since band data are not uniformly distributed, certain bands may contain crucial information for the reconstruction task and equidistant sampling might miss these key bands, causing a loss of important information. This deficiency is particularly evident in the MSE loss curve and the PSNR metric. Under identical conditions, Table 3 shows the PSNR performance of the equidistantly sampled band clusters [1, 7, 13, 19, 25, 30] and the band clusters [1, 2, 20, 25, 29, 30] selected by the adaptive band selection algorithm at the same compression rate. It can be seen that the PSNR of the band clusters sampled by the adaptive band selection algorithm was about 2–3 dB higher than that of the equidistant sampling method. This indicates that the adaptive band selection algorithm contributed to the improvement of the reconstruction network’s performance.
We also tested the hyperparameter β with different values to observe PSNR metrics under the same bpppb. Table 4 indicates that when β = 2 , the band clusters did not adequately represent the patch, as evidenced by the lower PSNR. On the other hand, β = 4 did not result in a higher PSNR gain but increased the network’s parameter count.
To further validate the performance of the ABS, this experiment first applied the ABS to select bands from 500 randomly selected HSI patches. These 500 patches were then input into the trained compression model for compression and reconstruction. The ABS was applied again to the reconstructed HSI to observe differences between the band selection results of the reconstructed HSI and the raw HSI before compression. A high degree of overlap between the reconstructed data and the band selection results of the original inputs would indicate a strong ability of the model to identify and preserve key spectral information. Conversely, significant deviations may highlight potential limitations in the model’s ability to extract meaningful features during the reconstruction process. Figure 12 shows that the reconstructed band correlations are nearly identical to the original input data and consistent with the results of adaptive band selection. This confirms that the subsequent compression and reconstruction processes are meaningful.

3.5.2. Ablation Experiments on the Attention Module

To test the impact of Spa-MSA and MRSA on overall compression performance, the following configurations were evaluated. First, a baseline model without the spatial attention module was used to assess its performance in terms of compression effectiveness and distortion metrics. Next, a model without MRSA was evaluated to assess performance based solely on the spatial attention module. Finally, both modules were applied simultaneously to the model to comprehensively assess their combined effects. The results shown in Figure 13 demonstrate removing both MRSA and Spa-MSA led to a decrease in PSNR. Specifically, when MRSA was removed, PSNR decreased across various bit rates, although the overall decline remained relatively small. In contrast, the removal of Spa-MSA resulted in a more significant drop in PSNR, particularly at lower bit rates. Compared to the complete ARM-Net model, this performance degradation confirms that both MRSA and Spa-MSA are essential for refining compression quality.

3.5.3. Ablation Experiments on the Framework

This section presents a series of ablation experiments on the proposed three-stage framework to evaluate the impact of different combinations of compression and reconstruction networks on overall compression performance. As shown in Figure 14, three learned compression methods—Hyperprior [11], MSSSA [20], and GMM [17]—and two SOTA SR networks—MST [37] and AWAN [41]—were embedded into the proposed framework, and their combined PSNR performance was measured. Table 5 provides detailed information on the network parameters, FLOPs, and average inference time for each combination. The average inference time represents the mean duration for the network to complete compression during testing, apart from the arithmetic encoding and decoding processes. As can be seen in Figure 14 and Table 5, the MSSSA+MST achieved the highest PSNR gain but also significantly increased the computational load, approximately five times that of Hyperprior+MST. GMM was outperformed by the other two compression networks within the framework. Furthermore, MST surpassed AWAN in reconstruction capability. From a combined perspective of performance and computational complexity, Hyperprior+MST, as a baseline for the framework, is by far the best choice.

4. Discussion

ARM-Net employs the proposed three-stage composite framework, with the compression architecture incorporating the MHRSA module and the reconstruction architecture integrating a multi-scale spatial-spectral attention mechanism. The objective is to adaptively learn robust spectral and spatial representations through attention mechanisms, guided by the inherent structure of hyperspectral data. As shown in Figure 9, Figure 10 and Figure 11, ARM-Net significantly outperforms existing compression methods, with notable improvements in PSNR and MS-SSIM, as well as a reduction in SAM. The proposed ARM-Net effectively distinguishes useful spectral and spatial features, thereby enhancing compression performance. Furthermore, ARM-Net’s spatial-spectral attention mechanism allows for detailed feature extraction across both spatial and spectral dimensions, which improves reconstruction accuracy and ensures high-quality output. Figure 14 further demonstrates that, even with a relatively simple compression network, the three-stage compression framework leads to significant performance gains. In general, ARM-Net has two significant advantages. On the one hand, the multi-scale attention mechanism selectively provides structured representations and allows for more precise pixel connections during reconstruction without introducing excessive bit rates. On the other hand, the spatial-spectral attention mechanism optimizes the compression model, which ensures that the model learns in a more structured direction and effectively balances compression efficiency with reconstruction quality.

5. Conclusions

This study introduces ARM-Net, a novel three-stage compression network designed to address two major HSI compression challenges: high computational complexity and the limited efficiency of spectral redundancy elimination. By leveraging the novel three-stage hybrid architecture, ARM-Net integrates an ABS, RSACN, and MSSARN to significantly enhance compression performance. In the reconstruction stage, the integration of the SSAB enables detailed feature extraction across both spatial and spectral dimensions, significantly enhancing reconstruction accuracy and ensuring that the final output retains essential visual and spectral characteristics. Additionally, the MHRSA module dynamically adjusts the weighting of each spectral band to highlight key features and reduce redundancy. Through extensive evaluations of benchmark datasets, ARM-Net demonstrates superior compression performance and achieves high detail retention and minimal distortion compared to existing methods. The qualitative results demonstrate that ARM-Net achieves high-quality compression while effectively preserving both the visual clarity and spectral consistency of reconstructed images. However, the performance of ARM-Net is still constrained by the band selection algorithm and the reconstruction network. In particular, datasets with extremely wide spectral ranges, substantial intra-band variability, or large spatial dimensions can exceed the current capacity of the model to accurately capture and reconstruct all critical features, especially when the training data do not comprehensively cover all variations. At very aggressive compression ratios, important spectral or spatial information may be lost, further exacerbating reconstruction errors. Despite these limitations, ARM-Net holds great promise for real-world HSI applications, such as in agriculture and industry. In agricultural scenarios, precise yet efficient hyperspectral data analysis can enable early detection of crop stress or disease, soil property assessment, and optimized resource management. In industrial contexts, accurate spectral reconstruction can facilitate quality inspection, material identification, and contamination detection while reducing data storage and transmission overhead. By providing both high compression ratios and fine spectral preservation, ARM-Net can support faster data-driven decisions. Future work should therefore concentrate on exploring optimized combinations of band sampling strategies and reconstruction networks. This includes refining the band selection approach to better accommodate large or highly variable datasets, as well as enhancing reconstruction models to manage more extreme compression levels without sacrificing essential spectral or spatial details. By addressing these limitations, ARM-Net has the potential to become even more robust and versatile in a broader range of hyperspectral compression scenarios, ultimately benefiting practical HSI applications in agriculture, industry, and beyond.

Author Contributions

Conceptualization, Q.F.; methodology, Q.F.; software, Z.W.; validation, J.W. and Z.W.; formal analysis, Q.F. and Z.W.; investigation, Q.F.; resources, Q.F. and Z.W.; data curation, Z.W.; writing—original draft preparation, Z.W.; writing—review and editing, Q.F., Z.W., J.W. and L.Z.; visualization, Q.F.; supervision, L.Z.; project administration, L.Z.; funding acquisition, Q.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Liaoning Province Education Administration under grant number JYTMS20230243.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

We gratefully appreciate the publishers of the Botswana dataset, KSC dataset, Pavia dataset, Salinas dataset, and Houston dataset. We would also like to thank the editors and reviewers for their efforts and contributions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Plaza, A.; Benediktsson, J.A.; Boardman, J.W.; Brazile, J.; Bruzzone, L.; Camps-Valls, G.; Chanussot, J.; Fauvel, M.; Gamba, P.; Gualtieri, A.; et al. Recent advances in techniques for hyperspectral image processing. Remote Sens. Environ. 2009, 113, S110–S122. [Google Scholar] [CrossRef]
  2. Bian, L.; Wang, Z.; Zhang, Y.; Li, L.; Zhang, Y.; Yang, C.; Fang, W.; Zhao, J.; Zhu, C.; Meng, Q.; et al. A broadband hyperspectral image sensor with high spatio-temporal resolution. Nature 2024, 635, 73–81. [Google Scholar] [CrossRef] [PubMed]
  3. Ullah, F.; Ullah, I.; Khan, R.U.; Khan, S.; Khan, K.; Pau, G. Conventional to deep ensemble methods for hyperspectral image classification: A comprehensive survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 3878–3916. [Google Scholar] [CrossRef]
  4. Tian, Q.; He, C.; Xu, Y.; Wu, Z.; Wei, Z. Hyperspectral Target Detection: Learning Faithful Background Representations via Orthogonal Subspace-Guided Variational Autoencoder. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5516714. [Google Scholar] [CrossRef]
  5. Rajabi, R.; Zehtabian, A.; Singh, K.D.; Tabatabaeenejad, A.; Ghamisi, P.; Homayouni, S. Hyperspectral imaging in environmental monitoring and analysis. Front. Environ. Sci. 2024, 11, 1353447. [Google Scholar] [CrossRef]
  6. Li, Y.; Luo, Y.; Zhang, L.; Wang, Z.; Du, B. MambaHSI: Spatial-spectral mamba for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5524216. [Google Scholar] [CrossRef]
  7. García-Vera, Y.E.; Polochè-Arango, A.; Mendivelso-Fajardo, C.A.; Gutiérrez-Bernal, F.J. Hyperspectral image analysis and machine learning techniques for crop disease detection and identification: A review. Sustainability 2024, 16, 6064. [Google Scholar] [CrossRef]
  8. Omar, H.M.; Morsli, M.; Yaichi, S. Image compression using principal component analysis. In Proceedings of the 2020 2nd International Conference on Mathematics and Information Technology (ICMIT), Adrar, Algeria, 18–19 February 2020; pp. 226–231. [Google Scholar]
  9. Du, Q.; Fowler, J.E. Hyperspectral image compression using JPEG2000 and principal component analysis. IEEE Geosci. Remote Sens. Lett. 2007, 4, 201–205. [Google Scholar] [CrossRef]
  10. Ballé, J.; Laparra, V.; Simoncelli, E.P. End-to-end optimized image compression. arXiv 2016, arXiv:1611.01704. [Google Scholar]
  11. Ballé, J.; Minnen, D.; Singh, S.; Hwang, S.J.; Johnston, N. Variational image compression with a scale hyperprior. arXiv 2018, arXiv:1802.01436. [Google Scholar]
  12. Berger, T. Rate-distortion theory. In Wiley Encyclopedia of Telec Ommunications; Wiley Online Library: Hoboken, NJ, USA, 2003. [Google Scholar]
  13. Minnen, D.; Ballé, J.; Toderici, G.D. Joint autoregressive and hierarchical priors for learned image compression. Adv. Neural Inf. Process. Syst. 2018, 31, 10771–10780. [Google Scholar]
  14. He, D.; Zheng, Y.; Sun, B.; Wang, Y.; Qin, H. Checkerboard context model for efficient learned image compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 14771–14780. [Google Scholar]
  15. Qian, Y.; Tan, Z.; Sun, X.; Lin, M.; Li, D.; Sun, Z.; Li, H.; Jin, R. Learning accurate entropy model with global reference for image compression. arXiv 2020, arXiv:2010.08321. [Google Scholar]
  16. Jiang, W.; Yang, J.; Zhai, Y.; Ning, P.; Gao, F.; Wang, R. Mlic: Multi-reference entropy model for learned image compression. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 7618–7627. [Google Scholar]
  17. Cheng, Z.; Sun, H.; Takeuchi, M.; Katto, J. Learned image compression with discretized gaussian mixture likelihoods and attention modules. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 7939–7948. [Google Scholar]
  18. Han, P.; Zhao, B.; Li, X. Edge-Guided Remote Sensing Image Compression. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5524515. [Google Scholar] [CrossRef]
  19. Dardouri, T.; Kaaniche, M.; Benazza-Benyahia, A.; Dauphin, G.; Pesquet, J.C. Joint Learning of Fully Connected Network Models in Lifting Based Image Coders. IEEE Trans. Image Process. 2023, 33, 134–148. [Google Scholar] [CrossRef]
  20. Kong, F.; Cao, T.; Li, Y.; Li, D.; Hu, K. Multi-scale spatial-spectral attention network for multispectral image compression based on variational autoencoder. Signal Process. 2022, 198, 108589. [Google Scholar] [CrossRef]
  21. Guo, Y.; Tao, Y.; Chong, Y.; Pan, S.; Liu, M. Edge-guided hyperspectral image compression with interactive dual attention. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5500817. [Google Scholar] [CrossRef]
  22. Guo, Y.; Chong, Y.; Pan, S. Hyperspectral image compression via cross-channel contrastive learning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5513918. [Google Scholar] [CrossRef]
  23. Rezasoltani, S.; Qureshi, F.Z. Hyperspectral Image Compression Using Implicit Neural Representations. In Proceedings of the 2023 20th Conference on Robots and Vision (CRV), Montreal, QC, Canada, 6–8 June 2023; pp. 248–255. [Google Scholar]
  24. Zhang, L.; Pan, T.; Liu, J.; Han, L. Compressing Hyperspectral Images into Multilayer Perceptrons Using Fast-Time Hyperspectral Neural Radiance Fields. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5503105. [Google Scholar] [CrossRef]
  25. Byju, A.P.; Fuchs, M.H.P.; Walda, A.; Demir, B. Generative Adversarial Networks for Spatio-Spectral Compression of Hyperspectral Images. arXiv 2023, arXiv:2305.08514. [Google Scholar]
  26. Mijares i Verdú, S.; Ballé, J.; Laparra, V.; Bartrina-Rapesta, J.; Hernández-Cabronero, M.; Serra-Sagristà, J. A Scalable Reduced-Complexity Compression of Hyperspectral Remote Sensing Images Using Deep Learning. Remote Sens. 2023, 15, 4422. [Google Scholar] [CrossRef]
  27. Llaveria, D.; Park, H.; Camps, A.; Narayan, R. Efficient Onboard Band Selection Algorithm for Hyperspectral Imagery in SmallSat Missions with Limited Downlink Capabilities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8646–8661. [Google Scholar] [CrossRef]
  28. Xiang, X.; Jiang, Y.; Shi, B. Hyper-spectral image compression based on band selection and slant Haar type orthogonal transform. Int. J. Remote Sens. 2024, 45, 1658–1677. [Google Scholar] [CrossRef]
  29. Zhang, J.; Zhang, Y.; Cai, X.; Xie, L. Three-Stages Hyperspectral Image Compression Sensing with Band Selection. CMES-Comput. Model. Eng. Sci. 2023, 134, 293–316. [Google Scholar] [CrossRef]
  30. Zhou, X.; Zou, X.; Shen, X.; Wei, W.; Zhu, X.; Liu, H. BTC-Net: Efficient bit-level tensor data compression network for hyperspectral image. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5500717. [Google Scholar] [CrossRef]
  31. Sun, W.; Du, Q. Hyperspectral band selection: A review. IEEE Geosci. Remote Sens. Mag. 2019, 7, 118–139. [Google Scholar] [CrossRef]
  32. Zhu, F.; Wang, H.; Yang, L.; Li, C.; Wang, S. Lossless compression for hyperspectral images based on adaptive band selection and adaptive predictor selection. KSII Trans. Internet Inf. Syst. (TIIS) 2020, 14, 3295–3311. [Google Scholar]
  33. Chen, T.; Liu, H.; Ma, Z.; Shen, Q.; Cao, X.; Wang, Y. End-to-end learnt image compression via non-local attention optimization and improved context modeling. IEEE Trans. Image Process. 2021, 30, 3179–3191. [Google Scholar] [CrossRef] [PubMed]
  34. Lai, Z.; Fu, Y. Mixed attention network for hyperspectral image denoising. arXiv 2023, arXiv:2301.11525. [Google Scholar]
  35. Gao, Z.; Yi, W. Prediction of Projectile Interception Point and Interception Time Based on Harris Hawk Optimization–Convolutional Neural Network–Support Vector Regression Algorithm. Mathematics 2025, 13, 338. [Google Scholar] [CrossRef]
  36. Zimerman, I.; Wolf, L. On the long range abilities of transformers. arXiv 2023, arXiv:2311.16620. [Google Scholar]
  37. Cai, Y.; Lin, J.; Lin, Z.; Wang, H.; Zhang, Y.; Pfister, H.; Timofte, R.; Van Gool, L. Mst++: Multi-stage spectral-wise transformer for efficient spectral reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 745–755. [Google Scholar]
  38. Yang, X.; Chen, J.; Yang, Z. Hyperspectral Image Reconstruction via Combinatorial Embedding of Cross-Channel Spatio-Spectral Clues. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6567–6575. [Google Scholar]
  39. AVIRIS-Airborne Visible/Infrared Imaging Spectrometer. 2025. Available online: https://aviris.jpl.nasa.gov/ (accessed on 4 March 2025).
  40. Pan, T.; Zhang, L.; Song, Y.; Liu, Y. Hybrid attention compression network with light graph attention module for remote sensing images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6005605. [Google Scholar] [CrossRef]
  41. Li, J.; Wu, C.; Song, R.; Li, Y.; Liu, F. Adaptive weighted attention network with camera spectral sensitivity prior for spectral reconstruction from RGB images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 462–463. [Google Scholar]
Figure 1. The overall structure of the proposed framework, with ARM-Net’s structure illustrated as the final network.
Figure 1. The overall structure of the proposed framework, with ARM-Net’s structure illustrated as the final network.
Sensors 25 01843 g001
Figure 2. Example of cross-correlation between spectral bands for the Botswana dataset.
Figure 2. Example of cross-correlation between spectral bands for the Botswana dataset.
Sensors 25 01843 g002
Figure 3. The backbone of the Recurrent Spectral Attention Compression Network (RSACN) based on a variational autoencoder. The red MRSA is our multi-head recurrent spectral attention module. This paper denotes the convolution as ’Conv Kernel-size Stride’, where ↑ / ↓ represent upsampling and downsampling, respectively.
Figure 3. The backbone of the Recurrent Spectral Attention Compression Network (RSACN) based on a variational autoencoder. The red MRSA is our multi-head recurrent spectral attention module. This paper denotes the convolution as ’Conv Kernel-size Stride’, where ↑ / ↓ represent upsampling and downsampling, respectively.
Sensors 25 01843 g003
Figure 4. Illustration of the multi-head recurrent spectral attention block. MLP stands for two-layer linear projections with Tanh activation. C and HW denote the spectral and spatial dimensions, respectively.
Figure 4. Illustration of the multi-head recurrent spectral attention block. MLP stands for two-layer linear projections with Tanh activation. C and HW denote the spectral and spatial dimensions, respectively.
Sensors 25 01843 g004
Figure 5. MSSA module and SSAB module ( N 1 , N 2 , and N 3 are all set to 3).
Figure 5. MSSA module and SSAB module ( N 1 , N 2 , and N 3 are all set to 3).
Sensors 25 01843 g005
Figure 6. Structures of Spec-MSA and Spa-MSA.
Figure 6. Structures of Spec-MSA and Spa-MSA.
Sensors 25 01843 g006
Figure 7. Diagram of different MSAs. The green-colored box represents the query element, and the dashed box denotes the key element. (a) W-MSA calculates self-attention within position-specific windows. (b) SW-MSA processes data from different windows after W-MSA through spatial shuffling and alignment, introducing global cross-window interaction. (c) Spec-MSA treats each spectral channel as a token and calculates the self-attention along the spectral dimension.
Figure 7. Diagram of different MSAs. The green-colored box represents the query element, and the dashed box denotes the key element. (a) W-MSA calculates self-attention within position-specific windows. (b) SW-MSA processes data from different windows after W-MSA through spatial shuffling and alignment, introducing global cross-window interaction. (c) Spec-MSA treats each spectral channel as a token and calculates the self-attention along the spectral dimension.
Sensors 25 01843 g007
Figure 8. (a) Botswana dataset. (b) Pavia Center dataset. (c) Pavia University dataset. (d) Examples of other HSIs collected by the AVIRIS sensor.
Figure 8. (a) Botswana dataset. (b) Pavia Center dataset. (c) Pavia University dataset. (d) Examples of other HSIs collected by the AVIRIS sensor.
Sensors 25 01843 g008
Figure 9. RD curves in terms of PSNR, MS-SSIM, and SAM aggregated over the mixed dataset. The plots on the left, from top to bottom, illustrate bpppb vs. PSNR, bpppb vs. MS-SSIM, and bpppb vs. SAM, respectively. The plots on the right present the compression ratios. Our proposed ARM-Net (bright pink line) achieves better RD performance than other approaches.
Figure 9. RD curves in terms of PSNR, MS-SSIM, and SAM aggregated over the mixed dataset. The plots on the left, from top to bottom, illustrate bpppb vs. PSNR, bpppb vs. MS-SSIM, and bpppb vs. SAM, respectively. The plots on the right present the compression ratios. Our proposed ARM-Net (bright pink line) achieves better RD performance than other approaches.
Sensors 25 01843 g009
Figure 10. Visual comparison of example blocks from the Pavia University dataset at high bit rates. From (aj): Ground Truth (GT), ARM-Net (bpppb: 0.74; PSNR: 33.32 dB; MS-SSIM: 24.74 dB; SAM: 0.17), FHNeRF (bpppb: 0.70; PSNR: 31.99 dB; MS-SSIM: 24.5 dB; SAM: 0.25), Verdú (bpppb: 0.78; PSNR: 27.62 dB; MS-SSIM: 24.51 dB; SAM: 0.33), CHENG (bpppb: 0.85; PSNR: 22.29 dB; MS-SSIM: 21.74 dB; SAM: 0.23), Pan (bpppb: 0.75; PSNR: 22.51 dB; MS-SSIM: 20.90 dB; SAM: 0.22), Hyperprior (bpppb: 0.74; PSNR: 21.36 dB; MS-SSIM: 18.58 dB; SAM: 0.212), PCA (bpppb: 0.86; PSNR: 24.07 dB; MS-SSIM: 22.37 dB; SAM: 0.35), BPG (bpppb: 0.79; PSNR: 27.20 dB; MS-SSIM: 19.86 dB; SAM: 0.34), and JPEG2000 (bpppb: 0.7918; PSNR: 24.4 dB; MS-SSIM: 17.58 dB; SAM: 0.06).
Figure 10. Visual comparison of example blocks from the Pavia University dataset at high bit rates. From (aj): Ground Truth (GT), ARM-Net (bpppb: 0.74; PSNR: 33.32 dB; MS-SSIM: 24.74 dB; SAM: 0.17), FHNeRF (bpppb: 0.70; PSNR: 31.99 dB; MS-SSIM: 24.5 dB; SAM: 0.25), Verdú (bpppb: 0.78; PSNR: 27.62 dB; MS-SSIM: 24.51 dB; SAM: 0.33), CHENG (bpppb: 0.85; PSNR: 22.29 dB; MS-SSIM: 21.74 dB; SAM: 0.23), Pan (bpppb: 0.75; PSNR: 22.51 dB; MS-SSIM: 20.90 dB; SAM: 0.22), Hyperprior (bpppb: 0.74; PSNR: 21.36 dB; MS-SSIM: 18.58 dB; SAM: 0.212), PCA (bpppb: 0.86; PSNR: 24.07 dB; MS-SSIM: 22.37 dB; SAM: 0.35), BPG (bpppb: 0.79; PSNR: 27.20 dB; MS-SSIM: 19.86 dB; SAM: 0.34), and JPEG2000 (bpppb: 0.7918; PSNR: 24.4 dB; MS-SSIM: 17.58 dB; SAM: 0.06).
Sensors 25 01843 g010
Figure 11. Visual comparison of example data blocks from the ARIVIRS sensor acquisition dataset at low bit rates. (a) Original. (b) ARM-Net (bpppb: 0.22; PSNR: 26.76 dB; MS-SSIM: 23.16 dB; SAM: 0.26). (c) FHNerF (bpppb: 0.18; PSNR: 25.20 dB; MS-SSIM:21.40 dB; SAM: 0.33). (d) Verdú (bpppb: 0.28; PSNR: 23.48 dB; MS-SSIM: 18.47 dB; SAM: 0.34). (e) CHENG (bpppb: 0.19; PSNR: 18.85 dB; MS-SSIM: 17.56 dB; SAM: 0.38). (f) Pan (bpppb: 0.13; PSNR: 18.10 dB; MS-SSIM: 16.90 dB; SAM: 0.32). (g) Hyperprior (bpppb: 0.23; PSNR: 17.81 dB; MS-SSIM: 14.41 dB; SAM: 0.39). (h) PCA (bpppb: 0.17; PSNR: 19.88 dB; MS-SSIM: 17.47 dB; SAM: 0.42). (i) BPG (bpppb: 0.20; PSNR: 19.07 dB; MS-SSIM: 11.55 dB; SAM: 0.40). (j) JPEG2000 (bpppb:0.16; PSNR: 14.08dB; MS-SSIM: 10.62 dB; SAM: 0.12).
Figure 11. Visual comparison of example data blocks from the ARIVIRS sensor acquisition dataset at low bit rates. (a) Original. (b) ARM-Net (bpppb: 0.22; PSNR: 26.76 dB; MS-SSIM: 23.16 dB; SAM: 0.26). (c) FHNerF (bpppb: 0.18; PSNR: 25.20 dB; MS-SSIM:21.40 dB; SAM: 0.33). (d) Verdú (bpppb: 0.28; PSNR: 23.48 dB; MS-SSIM: 18.47 dB; SAM: 0.34). (e) CHENG (bpppb: 0.19; PSNR: 18.85 dB; MS-SSIM: 17.56 dB; SAM: 0.38). (f) Pan (bpppb: 0.13; PSNR: 18.10 dB; MS-SSIM: 16.90 dB; SAM: 0.32). (g) Hyperprior (bpppb: 0.23; PSNR: 17.81 dB; MS-SSIM: 14.41 dB; SAM: 0.39). (h) PCA (bpppb: 0.17; PSNR: 19.88 dB; MS-SSIM: 17.47 dB; SAM: 0.42). (i) BPG (bpppb: 0.20; PSNR: 19.07 dB; MS-SSIM: 11.55 dB; SAM: 0.40). (j) JPEG2000 (bpppb:0.16; PSNR: 14.08dB; MS-SSIM: 10.62 dB; SAM: 0.12).
Sensors 25 01843 g011
Figure 12. Band-selective frequency distribution comparison of original and reconstructed hyperspectral data. The horizontal axis (1–30) represents the 30 bands of the patch, while the vertical axis shows the frequency with which each band was selected by the ABS.
Figure 12. Band-selective frequency distribution comparison of original and reconstructed hyperspectral data. The horizontal axis (1–30) represents the 30 bands of the patch, while the vertical axis shows the frequency with which each band was selected by the ABS.
Sensors 25 01843 g012
Figure 13. Comparison of MRSA and Spa-MSA.
Figure 13. Comparison of MRSA and Spa-MSA.
Sensors 25 01843 g013
Figure 14. PSNR values achieved by different network combinations in the ablation experiments.
Figure 14. PSNR values achieved by different network combinations in the ablation experiments.
Sensors 25 01843 g014
Table 1. Summary of the nine compression methods compared in this study.
Table 1. Summary of the nine compression methods compared in this study.
MethodAdvantagesDisadvantagesApplicabilityLimitations
ARM-Net (ours)Spatial and spectral feature fusionFramework dependency issuesGeneral hyperspectral imagesSlow decoding speed
FHNeRF (2024)Implicit transform codingLimited generalizabilityGeneral hyperspectral imagesTraining relies on specific images
Verdú (2024)Channel clustering reduces complexitySpectral channel dependenceGeneral hyperspectral imagesLimited by embedded architecture
CHENG (2020)Accurate modeling of discrete Gaussian mixture modelsHigh computational complexityGeneral still imagesWeak spectral information representation
Pan (2023)Focuses on content and texture branchesHigh computational complexityGeneral still imagesMay introduce artifacts
Hyperprior (2017)Accurate modeling of hyperprior entropy modelInsufficient adaptabilityGeneral still imagesWeak spectral information representation
PCAReduces feature dimensionalitySensitive to data accuracyPre-compression of small-sized and high-relevance imagesComplex decompression
BPGHigh dynamic rangeLow codec performanceHigh-quality, low-bandwidth transmissionPoor compatibility
JPEG2000Transparent progressiveLow bit-rate blurMedical/Satellite imagesLimited adaptability to complex scenarios
Table 2. Model parameters, FLOPs, and inference times (including enc-time and dec-time) for the proposed method and comparison methods.
Table 2. Model parameters, FLOPs, and inference times (including enc-time and dec-time) for the proposed method and comparison methods.
MethodParameters (M)FLOPs (G)Enc-Times (s)Dec-Times (s)
ARM-Net9.37.90.160.29
FHNeRF0.0047851.70.110.14
Pan21.055.60.420.40
Cheng18.061.10.401.50
Hyperprior7.128.70.120.15
Verdú7.128.70.130.16
Table 3. Effect of different band selection methods on the PSNR of ARM-Net.
Table 3. Effect of different band selection methods on the PSNR of ARM-Net.
Compression Ratio1.0/160.8/16
PSNR with adaptive band selection algorithm34.14 dB33.11 dB
PSNR for equally spaced samples32.07 dB31.66 dB
Table 4. Impact of hyperparameter β on PSNR performance of ARM-Net at different bpppb values.
Table 4. Impact of hyperparameter β on PSNR performance of ARM-Net at different bpppb values.
β 234
PSNR (bpppb = 1.0)28.1434.1434.21
PSNR (bpppb = 0.8)26.1133.1133.06
Table 5. Parameters, FLOPs, and inference times of the ablation methods.
Table 5. Parameters, FLOPs, and inference times of the ablation methods.
MethodsParametersFLOPsTimes
Hyperprior+MST8.6 M8.0 G16.8 ms
MSSSA+MST38.7 M45.5 G30.8 ms
Cheng+MST11.2 M11.5 G21.6 ms
Hyperprior+AWAN7.5 M9.9 G17.9 ms
MSSSA+AWAN37.5 M47.2 G34.1 ms
Cheng+AWAN10.1 M13.6 G22.6 ms
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fang, Q.; Wang, Z.; Wang, J.; Zhang, L. ARM-Net: A Tri-Phase Integrated Network for Hyperspectral Image Compression. Sensors 2025, 25, 1843. https://doi.org/10.3390/s25061843

AMA Style

Fang Q, Wang Z, Wang J, Zhang L. ARM-Net: A Tri-Phase Integrated Network for Hyperspectral Image Compression. Sensors. 2025; 25(6):1843. https://doi.org/10.3390/s25061843

Chicago/Turabian Style

Fang, Qizhi, Zixuan Wang, Jingang Wang, and Lili Zhang. 2025. "ARM-Net: A Tri-Phase Integrated Network for Hyperspectral Image Compression" Sensors 25, no. 6: 1843. https://doi.org/10.3390/s25061843

APA Style

Fang, Q., Wang, Z., Wang, J., & Zhang, L. (2025). ARM-Net: A Tri-Phase Integrated Network for Hyperspectral Image Compression. Sensors, 25(6), 1843. https://doi.org/10.3390/s25061843

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop