Hyperspectral Image Compression Method Based on Spatio-Spectral Joint Feature Extraction and Attention Mechanism

Zhang, Yan; Xiao, Huachao

doi:10.3390/sym17122065

Open AccessArticle

Hyperspectral Image Compression Method Based on Spatio-Spectral Joint Feature Extraction and Attention Mechanism

by

Yan Zhang

and

Huachao Xiao

^*

China Academy of Space Technology (Xi’an), Xi’an 710100, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(12), 2065; https://doi.org/10.3390/sym17122065

Submission received: 26 September 2025 / Revised: 4 November 2025 / Accepted: 17 November 2025 / Published: 3 December 2025

(This article belongs to the Special Issue Symmetry and Its Applications in Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

Traditional hyperspectral image compression methods often struggle to achieve high compression ratios while maintaining satisfactory reconstructed image quality under low-bitrate conditions. With the progressive development of deep learning, it has demonstrated significant advantages in lossy image compression research. Compared to visible light images, hyperspectral images possess rich spectral information. When directly applying visible light image compression models to hyperspectral image compression, the spectral information of hyperspectral images is overlooked, making it difficult to achieve optimal compression performance. In this paper, we combine the characteristics of hyperspectral images by extracting spatial and spectral features and performing fusion-based encoding and decoding to achieve end-to-end lossy compression of hyperspectral images. The structures of the encoding end and the decoding end are in symmetry. Additionally, attention mechanism is incorporated to enhance reconstruction quality. The proposed model is compared with the latest hyperspectral image compression standard algorithms to validate its effectiveness. Experimental results show that, under the same image quality, the proposed method reduces the bpp (bits per pixel) by 4.67% compared to CCSDS123.0-B-2 on the Harvard hyperspectral dataset while also decreasing the spectral angle loss by 13.68%, achieving better performance.

Keywords:

hyperspectral image compression; spatial-spectral feature extraction; attention mechanism

1. Introduction

Hyperspectral images (HSIs) are captured through continuous narrow-band imaging, enabling the acquisition of fine spectral characteristics of target objects. Compared to grayscale and RGB images, HSIs can more comprehensively reflect the spectral information and spatial distribution of target areas, making them widely applicable in environmental monitoring, agriculture, military reconnaissance, and other fields [1,2,3]. HSIs usually have high resolution, large width, and rich geospatial information, and they are suitable for tasks such as image classification [4], target segmentation [5], target detection [6], and super-resolution [7]. Recent years have witnessed significant advances in deep learning and artificial intelligence within the field of hyperspectral imaging. Techniques such as convolutional neural networks (CNNs), self-supervised learning, and Transformer-based models have substantially improved performance in tasks including classification, anomaly detection, and target detection. These AI-driven methodologies are increasingly exploring their generalization capabilities across different tasks, domains, and sensor types, thereby paving new technical pathways toward scalable, automated, and robust hyperspectral analysis workflows [8]. However, their enormous data volume poses significant challenges for transmission and storage in on-orbit remote sensing satellites and end-user applications. To achieve high fidelity and better preserve the spectral information of HSIs, efficient H compression techniques have become a key research focus.

Traditional 2D image coding employs a hybrid coding framework composed of modules such as prediction, transform, quantization, and entropy coding. Discrete Cosine Transform (DCT) and Discrete Wavelet Transform (DWT) are widely used in conventional transform coding techniques to decompose the original image into energy-concentrated coefficients, which are then quantized to achieve efficient compression. For example, the classic image coding standard JPEG (Joint Photographic Experts Group) [9] applies DCT to 8 × 8 image blocks, but this approach introduces severe blocking artifacts under low-bitrate conditions. JPEG2000 [10] adopts DWT, supporting input blocks of arbitrary sizes and significantly mitigating blocking artifacts.

Recently, deep learning-based lossy image processing has achieved better performance than traditional approaches [11,12]. For instance, Toderici et al. [13,14] utilized Recurrent Neural Networks (RNNs) to progressively generate entropy-coded bitstreams and intermediate reconstructed images of varying quality, adjusting the compression ratio by controlling the number of iterations. Jiao et al. [15] employed Convolutional Neural Networks (CNNs) to reduce artifacts generated by the JPEG standard when compressing holographic images, addressing quality degradation caused by the loss of high-frequency features during compression. Kingma et al. [16] introduced Variational Autoencoders (VAEs), while Ballé et al. [17] enhanced the original VAE framework by incorporating an entropy model and a hyperprior structure. Minnen et al. [18] further advanced the field by introducing an autoregressive model. Cheng et al. [19] significantly improved compression efficiency by integrating residual blocks, a Gaussian Mixture Model (GMM)-based entropy model, and attention mechanisms. Minnen et al. [20] proposed a channel-wise autoregressive entropy model, further enhancing the performance of learned image compression algorithms. Currently, end-to-end learned image coding techniques fully demonstrate their potential in leveraging spatial correlations and statistical distributions of images, achieving optimized rate-distortion trade-offs while allowing rapid adjustment of various distortion metrics.

With the generation of HSI data, compression methods tailored for HSIs have been extensively studied, resulting in numerous research efforts [21]. The focus lies in designing effective feature extractors and entropy coders. In recent years, to address the urgent demand for HSI compression, the Consultative Committee for Space Data Systems (CCSDS) established and released the low-complexity lossless and near-lossless multispectral/hyperspectral image compression standard CCSDS123.0-B-2 [21] in 2019 (hereinafter referred to as CCSDS123.0-B-2). This new standard extends the capabilities of its predecessor, CCSDS123.0-B-1 [21], which only supported lossless compression. Here, ‘near-lossless’ refers to the ability to perform compression in a way that the maximum error in the reconstructed image can be limited to a user-specified bound. CCSDS123.0-B-2 improves the predictor by introducing error constraints, enabling near-lossless compression. Additionally, it incorporates a novel entropy coding method called Hybrid Entropy Coding to enhance compression efficiency for low-entropy data.

Meanwhile, learning-based HSI compression methods have also advanced rapidly. Guo [22] proposed HCCNet, a hyperspectral compression network leveraging cross-channel contrastive learning. Zhang [23] adopted a compressed sensing approach, introducing a CNN–Transformer hybrid architecture for HSI compression. Rezasoltani [24] developed a compression method based on Implicit Neural Representation (INR). Zhang [25] proposed a low-overhead, compressed sensing-based channel filtering method, where the encoder derives channel filters via least-squares regression between the compressed and original images, transmitting a bitstream containing both the compressed image and filters to the decoder.

Although existing methods have achieved certain progress, HSI compression still faces numerous challenges. Traditional approaches such as JPEG2000 [10] and CCSDS123.0-B-2 exhibit a significant degradation in reconstructed image quality at low bit rates, while direct application of visible-light image compression models leads to the loss of spectral characteristics. Most current deep learning-based HSI compression methods adopt compressed sensing frameworks, which generate code streams without entropy encoding modules and are thus unsuitable for transmission over satellite communication channels. Therefore, investigating HSI compression models incorporating entropy encoding is of great importance for reducing the data transmission burden in satellite systems.

The data characteristics of HSIs dictate the design rationale for their compression frameworks. The spatial dimension of an HSI contains rich textural and structural information, while the spectral dimension records the unique spectral response curves of ground objects. These two types of information are closely interrelated yet distinct: abrupt changes in spatial details often correspond to rapid variations in spectral features, whereas smooth regions exhibit high inter-spectral correlation. Existing methods, such as those directly adapted from visible-light image models, typically employ uniform 2D convolution operations. This approach struggles to adaptively balance the utilization efficiency of spatial and spectral information, often resulting in either blurred spatial details or distorted spectral characteristics under constrained bitrates. Therefore, the primary motivation of this work is to design a mechanism capable of explicitly and separately extracting spatial and spectral features, aiming to maximize the preservation of critical information in each dimension through dedicated feature extraction pathways.

However, processing spatial and spectral features in isolation is insufficient for achieving optimal compression. The essence of an HSI lies in its representation of a spatial scene projected across a continuous spectral range, with both dimensions unified at a semantic level. If spatial and spectral features are separated during encoding, the decoder will face challenges in reconstructing this intrinsic coherence. Thus, feature fusion becomes a critical step. This paper introduces a spatial–spectral joint encoding–decoding network, where independently extracted features are fused in the latent space. This enables the model to learn the mapping relationships between spatial structures and spectral curves, thereby leveraging the fused features during decoding to reconstruct the image with higher accuracy.

During the feature fusion and reconstruction process, the contribution of different spatial locations and spectral bands to the overall reconstruction quality is not uniform. For instance, spatial details in edge regions typically carry higher information content and warrant a greater allocation of bits. Existing compression frameworks often lack the capability to model such non-uniform importance, leading to inefficient bitrate allocation. To address this key limitation of undifferentiated bit distribution, this work incorporates an attention mechanism. The attention module dynamically computes weights for spatial–spectral features, guiding the model to prioritize the allocation of limited bits to information-intensive regions and bands. Consequently, under the same bitrate, the proposed approach achieves higher reconstruction fidelity, particularly by suppressing overall distortion under low-bitrate conditions.

In light of the correlated yet distinct spatial and spectral characteristics of hyperspectral imagery, this paper proposes an end-to-end lossy compression model based on joint spatial–spectral feature extraction and attention mechanisms. The model employs a variational autoencoder (VAE) architecture to separately extract and fuse spatial and spectral features for encoding and decoding, thereby achieving end-to-end compression of HSIs. Building upon visible-light image compression models, spectral features are extracted using 3D convolutions and fused with features obtained through 2D convolutions. The latent representation of the image is derived via analysis transformation, quantized, and subsequently entropy-encoded using a channel-wise autoregressive entropy model to generate a transmission-friendly bitstream. This bitstream is then synthesized through inverse transformation and reconstructed into the original HSI via a joint spatial–spectral feature recovery module. The structures of the joint feature extraction and recovery modules are symmetric. The effectiveness of the proposed method is validated through comparisons with state-of-the-art HSI compression standard algorithms.

The structure of this paper is organized as follows: Section 1, the Introduction, outlines the characteristics of HSIs, traditional image compression methods, and deep learning-based hyperspectral compression techniques. It also discusses the limitations of existing approaches and provides a brief overview of the proposed model architecture. Section 2 reviews the main development of lossy compression methods for HSIs based on deep learning. Section 3 elaborates on the structure of the end-to-end HSI compression model based on joint spatial–spectral feature extraction and attention mechanisms, detailing the spatial information extraction module, spectral information extraction module, spatial–spectral feature fusion-based encoding–decoding network, and the local attention mechanism. Section 4 describes the experimental setup and evaluation methodology. Section 5 presents an analysis of the rate-distortion performance of the proposed method, comparing it with traditional approaches and two deep learning-based visible-light image compression models. Reconstructed image details are visually demonstrated, and the results are thoroughly discussed.

2. Related Work

Early HSI compression methods typically applied RGB image compression techniques to encode HSI data in a band-by-band manner. This approach fails to exploit inter-spectral correlations, resulting in suboptimal rate-distortion performance. To address this limitation, several studies [26,27] have extended two-dimensional image compression algorithms to handle the three-dimensional nature of HSIs. These methods often employ transform coding, which converts HSIs from the pixel domain into a latent space (e.g., frequency domain). Prominent transforms, such as the Discrete Cosine Transform (DCT) and wavelet transforms, utilize linear operations typically defined by a set of linear and orthogonal basis functions. However, such linear transforms with fixed bases may not fully capture the underlying redundancies, as HSIs are not generated by a linear combination of independent components [28]. Consequently, nonlinear transforms may be more suitable for HSI compression.

In recent years, learning-based nonlinear transforms [29,30] have demonstrated considerable potential in HSI compression. Dua et al. [29] and La et al. [30] were among the first to introduce autoencoder (AE) for lossy HSI compression. Subsequently, the variational autoencoder (VAE), which incorporates variational Bayesian theory, has been adopted to represent the latent features of input HSIs from a probabilistic perspective. Owing to its ability to effectively model image statistics, the VAE framework has emerged as a dominant architecture in image compression. Building upon the VAE, Guo et al. [31] utilized a hyperprior model [17] to compress HSIs. Guo [22] further proposed HCCNet, a hyperspectral compression network incorporating cross-channel contrastive learning, to better preserve spectral characteristics.

To fully leverage inter-spectral correlations and achieve superior rate-distortion performance, the proposed HSI compression method adopts a variational autoencoder architecture. It integrates both 2D and 3D convolutional layers to jointly model spatial and spectral features, and incorporates an attention mechanism to enhance reconstruction detail and rate-distortion optimization.

3. Method

The overall framework of the proposed end-to-end HSI compression network, which integrates spatial–spectral joint feature extraction with attention mechanisms, is illustrated in Figure 1. The encoding process first feeds the input HSI into a spatial–spectral joint feature extraction network to separately extract spectral features and spatial features, which are then concatenated to form complete joint spatial–spectral feature information. The joint spatial–spectral feature data is subsequently input into a feature fusion encoding network to generate low-dimensional feature representations. Finally, the low-dimensional feature representations are quantized. The potential probability distribution of the low-dimensional features is predicted through the single Gaussian distribution-based channel-wise autoregressive entropy model (SGM-based channel-wise autoregressive entropy model) [20], and the arithmetic coding is used to generate the binary code stream.

The decoding reconstruction process performs inverse operations of the encoding stage. The decoder first reconstructs the low-dimensional feature representations from the compressed binary bitstream through arithmetic decoding and inverse quantization. These recovered low-dimensional feature representations are then fed into a feature fusion decoding network to restore the joint spatial–spectral features. Ultimately, the HSI is reconstructed by the spatial–spectral joint image reconstruction network.

This algorithm specifically addresses the characteristics of HSI by proposing a novel compression approach based on spatial–spectral information extraction. This method extracts spatial information and spectral information respectively and fuses them, and then it inputs the fused information into the fusion coding network to more effectively eliminate redundancy. During the decoding process, it reconstructs the spatial and spectral components to achieve high-quality image restoration.

3.1. 3D-2D Hybrid Convolution-Based Spatial–Spectral Joint Feature Extraction Module

3.1.1. Spatial Feature Extraction Module

Spatial information constitutes the primary input received by the human visual system and holds particular significance in the field of image processing. Similar to visible light images, HSIs also contain substantial spatial information. With continuous improvements in spectral imaging resolution, the spatial information in acquired images has become increasingly abundant. During HSI compression, it is crucial to eliminate spatial redundancies while preserving critical features.

This section proposes a spatial feature extraction module based on the residual network architecture in CNN. When extracting spatial features directly using CNN, the naive approach would cause unintended fusion of information across spectral bands, thereby interfering with spectral feature extraction. To address this limitation, we employ grouped convolution for feature extraction. The detailed architecture is illustrated in Figure 2, where Conv() represents grouped convolution. All parameters remain consistent with conventional convolution except for the final parameter, which specifies the number of groups.

To accelerate feature extraction and mitigate issues such as gradient vanishing or explosion, the fundamental unit of the spatial feature extraction module adopts a two-layer residual block structure. The processing pipeline operates as follows: First, the HSI data is fed into an initial grouped convolutional layer. While this layer maintains the same number of convolutional kernels as conventional convolution, it significantly reduces computational overhead by restricting each kernel’s operation to only within its assigned feature group during the extraction process.

3.1.2. Spectral Feature Extraction Module

In addition to spatial information, HSIs contain substantial spectral information with significant inter-band redundancies that must be effectively removed during compression. To better eliminate these spectral redundancies, this section proposes a dedicated Spectral Feature Extraction Module specifically designed to extract inter-band spectral characteristics.

Similar to the spatial feature extraction module, the spectral feature extraction module is designed to exclusively process inter-band information to maintain spectral integrity while maximizing redundancy elimination, deliberately avoiding fusion with spatial features. Three-dimensional (3D) convolution extends traditional two-dimensional (2D) convolution by incorporating an additional data dimension, making it particularly suitable for processing multi-frame image data in applications such as video analysis and medical imaging.

For HSIs containing multiple spectral bands, the data can be conceptually represented in two ways: as a single-frame image with multiple channels (where channels correspond to spectral bands) or as a multi-frame sequence with a single channel (where frames represent spectral bands). To facilitate effective spectral feature extraction using 3D convolution, we adopt the latter representation, treating each spectral band as an individual frame in a single-channel sequence.

In standard implementation, 3D convolution kernels operate by sliding across all three dimensions of the input data, introducing an additional depth-wise convolution process compared to 2D convolution. The computational complexity can be quantified as follows: for 2D convolution, the computation scales as

c^{2} \times k^{2} \times w \times h

, while for 3D convolution it increases to

c^{2} \times k^{3} \times w \times h \times t

, where

c

represents the number of spectral bands,

k

denotes the kernel size,

w

and

h

indicate spatial width and height respectively, and

t

corresponds to the spectral dimension (analogous to the temporal dimension in video processing). This formulation demonstrates that 3D convolution incurs significantly higher computational overhead compared to its 2D counterpart.

Direct application of 3D convolution for HSI feature extraction would not only incur substantial computational overhead but also lead to unintended fusion of spatial and spectral information during feature extraction. To mitigate computational complexity while exclusively extracting inter-band spectral features, we strategically configure the 3D convolution kernel dimensions: employing 1 × 1 kernel size in spatial dimensions (effectively disabling spatial feature extraction) while maintaining a kernel size of 3 along the spectral dimension. This modified implementation restricts feature extraction solely to inter-band correlations while significantly reducing parameter computations, with the computational complexity now expressed as

c^{2} \times k \times w \times h \times t

. Compared to conventional 3D convolution, this approach achieves a

k^{2}

-fold reduction in parameter count while preserving dedicated spectral information extraction.

As illustrated in Figure 3, the spectral feature extraction module architecture processes hyperspectral data through the following stages: First, the input HSI is transformed into a single-channel multi-frame representation, where the frame count corresponds to the number of spectral bands. This restructured data then passes through a series of two-layer residual blocks composed of our modified 3D convolutions. Multiple residual blocks progressively extract hierarchical spectral features while maintaining spectral purity. Finally, the processed feature maps are reshaped to their original dimensions for subsequent fusion with parallel-extracted spatial features. Notably, all convolution kernels maintain three-dimensional parameters (one spectral dimension and two spatial dimensions), though spatial feature extraction is deliberately suppressed through the 1 × 1 spatial kernel configuration. This design ensures exclusive spectral information flow while maintaining compatibility with spatial feature fusion at later network stages.

3.2. Spatial–Spectral Joint Feature Fusion Encoding–Decoding Network Model

Inspired by end-to-end compression models for visible light images, our framework employs a hybrid 2D-3D convolutional architecture to extract joint spatial–spectral features, which are then processed through a dedicated spatial–spectral feature fusion encoding–decoding network. To enhance the model’s capability in learning complex information relationships, we incorporate a local attention mechanism. As illustrated in Figure 4, the proposed spatial–spectral fusion encoding/decoding network comprises three core modules: spatial–spectral feature fusion encoding network, spatial–spectral feature fusion decoding network and arithmetic encoding/decoding components.

The spatial–spectral feature fusion encoding network processes the input hyperspectral image’s joint spatial–spectral features through a series of convolutional downsampling modules, Generalized Divisive Normalization (GDN) layers [32], and local attention mechanisms to achieve effective feature fusion. The resulting low-dimensional compressed representation preserves the majority of the original hyperspectral image’s critical information while significantly reducing redundancy.

Our proposed method builds upon the hyperprior architecture [17,18] and channel-wise autoregressive entropy model [20]. The encoder

E

transforms a given input image

x

into a latent representation

y

. After quantization

Q

,

\hat{y}

denotes the discrete-valued latent representation of

y

. The decoder

D

subsequently reconstructs the output image

\hat{x}

from

\hat{y}

. This primary workflow can be formally expressed as:

\begin{array}{l} y = E (x; ϕ) \\ \hat{y} = Q (y) \\ \hat{x} = D (\hat{y}; θ) \end{array}

(1)

where

ϕ

and

θ

are trainable parameters of the encoder

E

and decoder

D

. Quantization

Q

inevitably introduces clipping errors of the latent (

e r r o r = y - Q (y)

), which leads to distortion of the reconstructed image. Following previous work [20], in the training phase, we also modify the quantization error by rounding and adding the predicted quantization error.

We model each element

{\hat{y}}_{i}

as a single Gaussian distribution with its standard deviation

σ_{i}

and mean

μ_{i}

by introducing a side information

{\hat{z}}_{i}

. The distribution

p_{{\hat{y}}_{i} | {\hat{z}}_{i}} ({\hat{y}}_{i} | {\hat{z}}_{i})

of

{\hat{y}}_{i}

is modeled by a single gaussian distribution-based entropy model (SGM):

p_{{\hat{y}}_{i} | {\hat{z}}_{i}} ({\hat{y}}_{i}| {\hat{z}}_{i}) = N (μ_{i}, σ_{i}^{2})

(2)

The loss function of image compression model is:

\begin{array}{l} L & = R + λ & \cdot D \\ = E_{x \sim p_{x}} & [- {l o g}_{2} p_{\hat{y} | \hat{z}} (\hat{y} | \hat{z}) - {l o g}_{2} p_{\hat{z}} (\hat{z})] \\ + λ \cdot E_{x \sim p_{x}} [d (x, \hat{x})] \end{array}

(3)

where

λ

controls the trade-off between rate and distortion,

R

is the bit rate of latents

\hat{y}

and

\hat{z}

,

d (x, \hat{x})

is the distortion between the raw image

x

and reconstructed image

\hat{x}

.

The encoding network incorporates Generalized Divisive Normalization (GDN), which serves as a divisive normalization module. GDN represents one of the most widely used normalization and nonlinear activation functions in image compression. In the decoding network, Inverse GDN (IGDN) performs the inverse operation of GDN in the encoding pathway. Within both the spatial–spectral feature fusion encoding and decoding networks, we introduce a local attention mechanism that assigns higher weighting coefficients to image feature regions with high contrast during the fusion encoding/decoding process, followed by fusion encoding/decoding of the re-weighted features.

The decoder network performs inverse operations corresponding to the encoder network, utilizing convolutional upsampling modules, IGDN, and local attention modules to reconstruct the spatial–spectral features of the HSI.

HSI compression aims to reconstruct high-quality images at given bitrates, where establishing an accurate entropy model is crucial for bitrate estimation. Therefore, the proposed network architecture introduces an SGM-based channel-wise autoregressive entropy model [20] to predict the latent probability distribution of the low-dimensional features output by the encoding network. The learned parameters are then transmitted to both the arithmetic encoder and decoder for entropy encoding and decoding operations.

3.3. Local Attention Mechanism

The attention mechanism computationally emulates biological perceptual processes by selectively allocating cognitive resources to salient regions, thereby enhancing detail acquisition while suppressing irrelevant information. This paradigm mirrors neurobiological attention mechanisms wherein distinct spatial regions receive differential processing priority based on their behavioral relevance. Specifically, the system dynamically modulates its information processing capacity to preferentially encode and retain high-value visual features in regions of interest (ROIs), while attenuating neural responses to less significant areas.

The network training process involves learning hierarchical representations of image data through the optimization of feature-specific weighting parameters via backpropagation. These learned weights are applied through element-wise multiplication with the original input features to generate enhanced feature maps.

In the network training process, the generation of attention masks based on spatially adjacent elements can effectively enhance the rate-distortion performance of the network at a relatively low computational cost. To improve the quality of reconstructed images, inspired by the Non-Local Attention Module (NLAM) [33], this section introduces a local attention module into the network architecture. This module specifically focuses on regions with complex texture features by partitioning the feature map into non-overlapping blocks and independently computing attention maps for each block.

Y_{i}^{k} = \frac{1}{C (X^{k})} \sum_{\forall j} f (X_{i}^{k}, X_{j}^{k}) g (X_{j}^{k})

(4)

g (X_{j}^{k}) = W_{g} X_{j}^{k}

(5)

Here,

X_{i}^{k}

and

X_{j}^{k}

represent the i-th and j-th elements within the k-th partitioned block, respectively. The function

f (\cdot)

denotes an embedded Gaussian function, which computes the similarity between elements, while

C (\cdot)

serves as a normalization factor to ensure the attention weights sum to unity. Additionally,

W_{g}

corresponds to the embedding matrix associated with the spatial position of the k-th block, enabling location-aware feature transformation.

The attention map for each block is obtained by incorporating residual connections to the intra-block attention computation, as formalized in Equation (6):

Z_{i}^{k} = W_{z} Y_{i}^{k} + X_{i}^{k}

(6)

Based on the computational results, regions with significant information divergence in the image are identified, and additional bit allocation is assigned to these complex information areas. The structure of the local attention module is illustrated in Figure 5a, where the residual block consists of convolutional layers composed of 1 × 1 and 3 × 3 convolutions, as shown in Figure 5b.

As illustrated in Figure 6, the feature maps generated by the attention module reveal its operational mechanism: it effectively guides the bit allocation strategy. Significantly more bits are assigned to detail-rich regions, such as the textured patterns on the wall, thereby preserving fine spatial information. Fewer bits are allocated to the relatively smooth wall areas in the image, which indicates the significant role of the attention module in the feature map.

4. Experiments

4.1. Training Details

To verify the compression performance of the proposed method for HSIs, we use the Harvard hyperspectral dataset [28], which has a spatial resolution of 1392 × 1040 with 31 spectral bands and contains 77 hyperspectral images of different scenes. We select 65 images as the training set and the remaining 12 as test images. Due to the large number of spectral bands in hyperspectral images, directly using the full images as training data would impose high demands on memory and graphics cards. Therefore, we crop all hyperspectral images in the training set along the spatial dimensions to reduce hardware requirements, resulting in sub-images of size 256 × 256 × 31. After dataset augmentation, the final training set contains 5200 hyperspectral images.

We implement the proposed spatial–spectral feature extraction-based HSI compression network on the CompressAI platform. For the proposed model, the numbers of latent and hyper latent channels are set to 256 and 320, respectively. The model uses the Adam optimizer with a batch size of 8. During training, the initial learning rate is set to

1 \times 10^{- 4}

, and for the last 60k iterations, the learning rate is reduced to

1 \times 10^{- 5}

. When optimizing the model for MSE, the

λ

values are set to {0.05, 0.09, 0.15, 0.31, 0.52}.

4.2. Evaluation

To validate our proposed method, we conducted experiments using 12 uncompressed hyperspectral images with a resolution of 1392 × 1040. For evaluating the rate-distortion performance, we employed bits per pixel (bpp) as the metric for bitrate and adopted either PSNR, MS-SSIM or spectral angle mapping (SAM) as the optimal distortion metric for quality assessment. Rate-distortion (RD) curves were plotted to demonstrate the coding efficiency of the proposed approach. It should be noted here that the index of bpp (bits per pixel) is calculated from the entire hyperspectral image, rather than a single spectral band within it.

(1): PSNR: PSNR is a widely used image quality evaluation index. The larger its value, the better the image quality. For the normalized hyperspectral image, the calculation formula of PSNR is:

P S N R = 10 \cdot \log_{10} (\frac{1}{M S E (x, \hat{x})})

(7)

: $M S E (x, \hat{x})$ is the mean squared error between the original image $x$ and the reconstructed image $\hat{x}$ , which is calculated by Formula (8):

$\begin{matrix} M S E = \frac{1}{C \cdot H \cdot W} \sum_{i = 1}^{C} \sum_{j = 1}^{H} \sum_{k = 1}^{W} (x_{i j k} - {\hat{x}}_{i j k})^{2} \end{matrix}$

(8)

where $C$ is the number of spectral bands of the hyperspectral image, $H$ and $W$ are the height and width of the image respectively, and $x_{i j k}$ and ${\hat{x}}_{i j k}$ are the values of each pixel of the original image and the reconstructed image respectively.
(2): MS-SSIM: MS-SSIM is an evaluation metric that takes into account the structural similarity of images at different scales. Its value ranges between 0 and 1. The larger the value, the better the image quality. Detailed information about MS-SSIM can be found in [34]. Since almost all MS-SSIM values are concentrated around 0.9, we convert the MS-SSIM value $d$ to $- 10 {l o g}_{10} (1 - d)$ to magnify the MS-SSIM difference between the original image and the reconstructed image.
(3): SAM: SAM (Spectral Angle Mapping) does not focus on the absolute differences in pixel values. Instead, it measures the shape similarity by calculating the angle between spectral vectors. A smaller SAM value indicates good reconstruction performance. Its mathematical description can be represented by Formula (9):

$S A M (x, \hat{x}) = \frac{180}{π} \cdot \frac{1}{H \cdot W} \cdot \sum_{j = 1}^{H \cdot W} \arccos (\frac{x_{j} ⊙ {\hat{x}}_{j}}{∥ x_{j} ∥ ∥ {\hat{x}}_{j} ∥})$

(9)

Here, $x_{j}$ and ${\hat{x}}_{j}$ correspond to the $j$ -th pixels of the original image and the reconstructed image respectively, $⊙$ denotes the inner product, and $∥\cdot∥$ is the $l_{2}$ norm.

5. Results and Discussion

5.1. Rate-Distortion Performance

The proposed spatial–spectral feature extraction algorithm was compared with traditional image compression method JPEG2000 [10] and two learning-based approaches, namely Minnen 2018 [18] and Cheng 2020 [19]. While Minnen 2018 [18] and Cheng 2020 [19] similarly employ a variational autoencoder framework, they lack dedicated spatial–spectral feature extraction modules.

Figure 7 shows the values of various evaluation indicators obtained by different compression methods at different bit rates on the Harvard dataset. The experimental results show that the PSNR value of the proposed spatial–spectral feature extraction network model is better than that of the traditional algorithm, JPEG2000 [10]. At around 0.8 bpp, it is on average 0.672 dB higher than that of the Minnen2018 [18] algorithm and 0.594 dB higher than that of the Cheng2020 [19] algorithm. From the performance comparison in the MS-SSIM index, it can be seen that the proposed compression model has the closest structural similarity to the original image. In addition, from the spectral reconstruction quality, it can be seen that the proposed method has the smallest spectral angle, which indicates that the reconstructed image has the least spectral loss and retains the most comprehensive spectral information.

The JPEG2000 [10] algorithm employs handcrafted discrete wavelet transform (DWT) and scalar quantization, which rely on fixed, content-agnostic transform bases. As a result, its representational capacity is limited when handling the complex spatial–spectral joint redundancy inherent in HSIs, leading to suboptimal redundancy reduction. Furthermore, JPEG2000 [10] fails to fully exploit the highly structured statistical correlations between spectral bands and lacks the capability for targeted fidelity preservation in perceptually or informationally critical regions.

In contrast, the proposed method adopts an end-to-end deep learning framework. It incorporates a learnable 3D-2D joint convolutional module for spatial–spectral feature extraction, which adaptively learns and captures the most discriminative spatial structures and spectral correlations in HSIs. Additionally, a local attention mechanism is introduced. By dividing feature maps into patches and computing intra-patch attention, the model effectively allocates more bits and reconstruction resources to regions with complex textures and high information content, thereby significantly enhancing the quality of the reconstructed image.

Since the CCSDS123.0-B-2 algorithm is designed for lossless and near-lossless compression of HSIs, it cannot achieve lower bitrates and consequently demonstrates suboptimal performance in rate-distortion curves. To achieve lower bpp, we enforced a large maximum error limit in the CCSDS123.0-B-2 algorithm, which resulted in a significant degradation of the reconstructed image quality, effectively transforming it into a lossy compression method.

Therefore, to ensure a fair comparison under equivalent reconstruction fidelity, the proposed method and the CCSDS123.0-B-2 algorithm are compared at similar PSNR levels, as presented in Table 1. Under equivalent image quality conditions, our spatial–spectral feature extraction network model achieves a 4.67% reduction in bits per pixel (bpp) compared to the traditional CCSDS123.0-B-2 algorithm, while simultaneously reducing spectral angle loss by 13.68%.

Although JPEG2000 [10], Minnen2018 [18], and Cheng2020 [19] achieve lower SAM values under comparable image quality, they incur substantially higher bitrates, particularly in the case of JPEG2000 [10]. As illustrated in Figure 6, these results collectively indicate that the proposed model attains superior rate–distortion performance and enhanced spectral fidelity.

Figure 8 presents detailed visualizations of the 15th spectral band from an original HSI in the Harvard dataset and its reconstructed versions by various methods at a relatively low bits per pixel (bpp) rate. It can be observed that the proposed method effectively reconstructs the detailed features of the wall surface and tree branches, with the texture of bricks on the wall satisfactorily restored. Through magnification and careful comparison of the 15th spectral band in the reconstructed HSIs, noticeable blurring and blocking artifacts are identified in the wall and branch details under the JPEG2000 [10] compression algorithm, exhibiting significant distortion in fine textures, particularly along the transitional edges of the image scene where blurring is evident. The methods by Minnen2018 [18] and Cheng2020 [19] demonstrate superior performance compared to the JPEG2000 [10] algorithm, recovering certain details on the rooftop. However, these approaches still exhibit some degree of blurring at edge textures and noticeable luminance distortion. In contrast, the HSI reconstructed by the proposed method preserves a considerable amount of detailed textures and edge characteristics without apparent luminance discrepancies, yielding a visual appearance closer to the original image.

To validate the robustness and generalizability of the proposed method, compression and decompression experiments were conducted on the Chikusei satellite HSI. The detailed comparisons between the original image and the reconstructed versions by different methods are illustrated in Figure 9. At a low bpp rate, the reconstruction quality of the JPEG2000 [10] algorithm remains unsatisfactory. Both the Minnen2018 [18] and Cheng2020 [19] methods result in considerable blurring in the rooftop areas of the Chikusei image, whereas the proposed method successfully restores certain textural details of the roof. Overall, the algorithm proposed in this paper demonstrates superior visual performance.

5.2. Ablation Study

To validate the effectiveness of our proposed method, we conducted ablation studies using a baseline model that excludes both the spatial–spectral module (SSM) and the local attention module (LAM). This baseline was compared against two variants: one incorporating only the spatial–spectral module and another integrating only the local attention module. Experiments were performed with

λ

values set to 0.15 and 0.52, respectively. As summarized in Table 2, the gradual incorporation of the spatial–spectral module and the local attention module leads to a consistent reduction in bpp while progressively improving PSNR. These ablation results demonstrate the individual and combined contributions of the proposed modules, confirming their efficacy in enhancing compression performance.

5.3. Resource Consumption

Figure 10 shows the performance of the method proposed in this paper, compared with the models of Minnen2018 [18] and Cheng2020 [19], against JPEG2000 [10] in terms of BD-Rate saving and decoding time. Meanwhile, the GPU memory consumption of each model during inference is marked beside the circles. It can be clearly seen from the figure that each model exhibits different characteristics in performance trade-off. The GPU memory consumption of the Minnen2018 [18] model is 1.2 GB. It achieves a BD-Rate saving of 20.392% and a decoding time of approximately 30.61 s, showing relatively balanced performance. The GPU memory consumption of the Cheng2020 [19] model is 1.8 GB. Its BD-Rate saving is slightly lower than that of Minnen2018 [18], about 16.469%, while the decoding time increases to approximately 44.64 s. This indicates that with the increase in memory consumption, Cheng2020 [19] does not gain significant advantages in improving coding efficiency, and the decoding time becomes longer.

The model proposed in this paper constructs an architecture with the largest number of parameters by introducing modules such as joint spatial—spectral feature extraction and attention mechanism, thereby achieving the best compression performance. The BD-Rate saving compared with the JPEG2000 [10] algorithm reaches approximately 51.012%. The model proposed in this study is more suitable at this stage for application scenarios with extreme requirements for compression ratio but insensitivity to latency and computing resources (such as post-processing of received data by ground stations, high-fidelity archiving, etc.). However, the price for this performance is a significantly increased computational burden, resulting in the longest decoding time and the largest GPU memory consumption. In the field of HSI compression, higher compression gains currently still rely on deeper networks and more complex feature interactions, leading to a marginal increase in computational cost.

Although the proposed model has outstanding advantages in compression performance and can achieve higher BD-Rate saving, it has certain deficiencies in decoding efficiency. The longer decoding time and higher GPU memory consumption may limit its application in some scenarios with real-time requirements and limited memory resources. Therefore, future research should not stop at pursuing higher rate distortion performance but should be committed to breaking this trade-off curve and exploring the lightweight design of the model. On the premise of not significantly reducing the BD-Rate saving advantage, the decoding time and GPU memory consumption should be reduced to improve the practicality and universality of the model.

5.4. Discussion

This study proposes an HSI compression framework that integrates joint spatial–spectral feature extraction with an attention mechanism. Systematic experiments validate its significant advantages in compression efficiency and spectral fidelity. As shown in Figure 6, on the Harvard dataset, the proposed method achieves a PSNR of 39.61 dB at 0.561 bpp, representing a 31% bitrate saving compared to the CCSDS123.0-B-2 standard, while reducing the Spectral Angle Mapper (SAM) error to 3.09°—an improvement of 13.68% over competing methods. In contrast to conventional methods, the discrete wavelet transform employed by JPEG2000 [10] utilizes a fixed transform basis, which lacks adaptability to the spatial–spectral characteristics of HSIs. This inherent limitation leads to pronounced blocking artifacts and spectral distortion at low bitrates. Meanwhile, the CCSDS-123.0-B-2 standard is primarily designed for near-lossless compression of HSIs; however, its predictive coding framework struggles to achieve high compression ratios. When compared to deep learning-based approaches, although models like Minnen2018 [18] and Cheng2020 [19] are proficient for RGB image compression, their direct application to HSIs often fails to adequately capture inter-spectral correlations, resulting in suboptimal compression performance. The proposed method addresses these shortcomings by explicitly modeling joint spatial–spectral features, thereby achieving a significant improvement in spectral fidelity while effectively preserving structural similarity. These notable achievements stem from three core innovations: first, a 3D-2D joint convolutional architecture that effectively extracts spatial and spectral features; second, an SGM-based channel-autoregressive entropy model that optimizes bit allocation efficiency; and third, a local attention module that significantly enhances reconstruction quality in complex texture regions through intra-block weighting.

The proposed scheme demonstrates distinct advantages in satellite-based application scenarios. The generated binary bitstream is directly compatible with the CCSDS space data link protocol, avoiding compatibility issues associated with floating-point data transmission in compressed sensing methods. Furthermore, the entropy coding module can be seamlessly integrated with RS (255,223) forward error correction, enhancing resilience to channel bit errors. This end-to-end architecture maintains a high compression ratio while offering strong feasibility for engineering deployment.

Despite the superior compression performance of the proposed approach, Figure 10 indicates that it entails higher computational complexity, manifesting as longer decoding times and greater GPU memory consumption compared to other benchmarks. This computational overhead is primarily attributable to the intensive computations involved in 3D convolutional operations and the attention modules. 3D convolution operations consume substantial FPGA logic resources, leading to high computational complexity, slow training convergence, and a large parameter count, necessitating further optimization for lightweight design. Furthermore, the fixed rate-distortion parameter

λ

restricts dynamic bitrate adaptation. From a practical standpoint, the performance-efficiency trade-off is acceptable in the current stage. For bandwidth-constrained applications like satellite-based data processing, the high compression ratio significantly alleviates data transmission demands. Moreover, as the decoding process can be conducted at ground stations, which have less stringent real-time requirements, the computational burden at the encoder side is justified.

Although its computational complexity remains a challenge for practical deployment, the proposed method presents an effective solution for HSI compression, demonstrating particular promise for bandwidth-constrained remote sensing applications. Future work will focus on validating generalization performance across multi-source datasets, creating a lightweight model architecture as the foundation for efficient deployment, developing a reconfigurable compression architecture to support on-orbit parameter adjustment, and exploring variable bitrate hyperspectral compression models. These improvements are expected to advance the practical application of hyperspectral compression technology in next-generation remote sensing satellite systems.

Author Contributions

Conceptualization, Y.Z. and H.X.; methodology, Y.Z.; software, Y.Z.; validation, Y.Z. and H.X.; formal analysis, Y.Z.; investigation, Y.Z.; resources, H.X.; data curation, Y.Z. and H.X.; writing—original draft preparation, Y.Z.; writing—review and editing, Y.Z. and H.X.; visualization, Y.Z.; supervision, H.X.; project administration, H.X.; funding acquisition, H.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Young Elite Scientists Sponsorship Program by CAST, grant number 2023QNRC001.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dupont, M.F.; Elbourne, A.; Cozzolino, D.; Chapman, J.; Truong, V.K.; Crawford, R.J.; Latham, K. Chemometrics for Environmental Monitoring: A Review. Anal. Methods 2020, 12, 4597–4620. [Google Scholar] [CrossRef] [PubMed]
Sarić, R.; Nguyen, V.D.; Burge, T.; Berkowitz, O.; Trtílek, M.; Whelan, J.; Lewsey, M.G.; Čustović, E. Applications of Hyperspectral Imaging in Plant Phenotyping. Trends Plant Sci. 2022, 27, 301–315. [Google Scholar] [CrossRef] [PubMed]
Zhong, Y.; Wang, X.; Xu, Y.; Wang, S.; Jia, T.; Hu, X.; Zhao, J.; Wei, L.; Zhang, L. Mini-UAV-Borne Hyperspectral Remote Sensing: From Observation and Processing to Applications. IEEE Geosci. Remote Sens. Mag. 2018, 6, 46–62. [Google Scholar] [CrossRef]
Zeng, H.; Zou, S.; Yao, C.; Xu, C. LGNet: A Symmetric Dual-Branch Lightweight Model for Remote Sensing Scene Classification Based on Lie Group Feature Extraction and Cross-Attention Mechanism. Symmetry 2025, 17, 780. [Google Scholar] [CrossRef]
Wang, L.; Ren, K. Attention-Based Mask R-CNN Enhancement for Infrared Image Target Segmentation. Symmetry 2025, 17, 1099. [Google Scholar] [CrossRef]
Xie, Y.; Wang, Y.; Wang, X.; Tan, Y.; Qin, Q. Remote Sensing Image Change Detection Based on Dynamic Adaptive Context Attention. Symmetry 2025, 17, 793. [Google Scholar] [CrossRef]
Yan, Y.; He, Y.; Hu, Y.; Guo, B. Video Superresolution via Parameter-Optimized Particle Swarm Optimization. Math. Probl. Eng. 2014, 2014, 373425. [Google Scholar] [CrossRef]
Hong, D.; Li, C.; Yokoya, N.; Zhang, B.; Jia, X.; Plaza, A.; Gamba, P.; Benediktsson, J.A.; Chanussot, J. Hyperspectral Imaging. arXiv 2024, arXiv:2508.08107. [Google Scholar]
Wallace, G.K. The JPEG Still Picture Compression Standard. Commun. ACM 1991, 34, 30–44. [Google Scholar] [CrossRef]
Taubman, D.S.; Marcellin, M.W.; Rabbani, M. JPEG2000: Image Compression Fundamentals, Standards and Practice. J. Electron. Imaging 2002, 11, 286–287. [Google Scholar] [CrossRef]
Chen, X.; Lei, W.; Zhang, W.; Wang, Y.; Liu, M. Ultra-Low Bitrate Predictive Portrait Video Compression with Diffusion Models. Symmetry 2025, 17, 913. [Google Scholar] [CrossRef]
Yang, L.; Xue, Y.; Ning, Y.; Zhang, H.; Ma, Y. FIGD-Net: A Symmetric Dual-Branch Dehazing Network Guided by Frequency Domain Information. Symmetry 2025, 17, 1122. [Google Scholar] [CrossRef]
Toderici, G.; O’Malley, S.M.; Hwang, S.; Vincent, D.; Minnen, D.C.; Baluja, S.; Covell, M.; Sukthankar, R. Variable Rate Image Compression with Recurrent Neural Networks. arXiv 2015, arXiv:1511.06085. [Google Scholar]
Toderici, G.; Vincent, D.; Johnston, N.; Hwang, S.J.; Minnen, D.; Shor, J.; Covell, M. Full Resolution Image Compression with Recurrent Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5435–5443. [Google Scholar]
Jiao, S.; Jin, Z.; Chang, C.; Zhou, C.; Zou, W.; Li, X. Compression of Phase-Only Holograms with JPEG Standard and Deep Learning. Appl. Sci. 2018, 8, 1258. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Ballé, J.; Minnen, D.C.; Singh, S.; Hwang, S.; Johnston, N. Variational Image Compression with a Scale Hyperprior. arXiv 2018, arXiv:1802.01436. [Google Scholar] [CrossRef]
Minnen, D.; Ballé, J.; Toderici, G.D. Joint Autoregressive and Hierarchical Priors for Learned Image Compression. Adv. Neural Inf. Process. Syst. 2018, 31, 10794–10803. [Google Scholar]
Cheng, Z.; Sun, H.; Takeuchi, M.; Katto, J. Learned Image Compression with Discretized Gaussian Mixture Likelihoods and Attention Modules. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7936–7945. [Google Scholar]
Minnen, D.; Singh, S. Channel-Wise Autoregressive Entropy Models for Learned Image Compression. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 3339–3343. [Google Scholar]
Dua, Y.; Kumar, V.; Singh, R.S. Comprehensive Review of Hyperspectral Image Compression Algorithms. Opt. Eng. 2020, 59, 090902. [Google Scholar] [CrossRef]
Guo, Y.; Chong, Y.; Pan, S. Hyperspectral Image Compression via Cross-Channel Contrastive Learning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5513918. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, L.; Song, C.; Zhang, P. Hyperspectral Image Compression Sensing Network with CNN–Transformer Mixture Architectures. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Rezasoltani, S.; Qureshi, F.Z. Hyperspectral Image Compression Using Sampling and Implicit Neural Representations. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5500712. [Google Scholar] [CrossRef]
Zhang, W.; Xu, J.; Chen, Y.; Li, D.; Gao, W. Low-Overhead Compression-Aware Channel Filtering for Hyperspectral Image Compression. IEEE Geosci. Remote Sens. Lett. 2025, 22, 5503805. [Google Scholar] [CrossRef]
Penna, B.; Tillo, T.; Magli, E.; Olmo, G. Progressive 3-D Coding of Hyperspectral Images Based on JPEG 2000. IEEE Geosci. Remote Sens. Lett. 2006, 3, 125–129. [Google Scholar] [CrossRef]
Christophe, E.; Mailhes, C.; Duhamel, P. Hyperspectral Image Compression: Adapting SPIHT and EZW to Anisotropic 3-D Wavelet Coding. IEEE Trans. Image Process. 2008, 17, 2334–2346. [Google Scholar] [CrossRef] [PubMed]
Chakrabarti, A.; Zickler, T. Statistics of Real-World Hyperspectral Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 20–25 June 2011; pp. 193–200. [Google Scholar]
Dua, Y.; Singh, R.S.; Parwani, K.; Lunagariya, S.; Kumar, V. Convolution Neural Network Based Lossy Compression of Hyperspectral Images. Signal Process. Image Commun. 2021, 95, 116255. [Google Scholar] [CrossRef]
La Grassa, R.; Re, C.; Cremonese, G.; Gallo, I. Hyperspectral Data Compression Using Fully Convolutional Autoencoder. Remote Sens. 2022, 14, 2472. [Google Scholar] [CrossRef]
Guo, Y.; Chong, Y.; Ding, Y.; Pan, S.; Gu, X. Learned Hyperspectral Compression Using a Student’s T Hyperprior. Remote Sens. 2021, 13, 4390. [Google Scholar] [CrossRef]
Ballé, J.; Laparra, V.; Simoncelli, E.P. Density Modeling of Images Using a Generalized Normalization Transformation. arXiv 2015, arXiv:1511.06281. [Google Scholar]
Liu, H.; Chen, T.; Guo, P.; Shen, Q.; Cao, X.; Wang, Y.; Ma, Z. Non-Local Attention Optimized Deep Image Compression. IEEE Trans. Image Process. 2021, 30, 3179–3191. [Google Scholar] [CrossRef]
Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale Structural Similarity for Image Quality Assessment. In Proceedings of the 37th Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003; pp. 1398–1402. [Google Scholar]

Figure 1. Hyperspectral Image Compression Framework Based on Spatial–Spectral Joint Feature Extraction.

Figure 2. Spatial Feature Extraction Module.

Figure 3. Spectral Feature Extraction Module.

Figure 4. Spatial–Spectral Joint Feature Fusion Encoding–Decoding Network Model.

Figure 5. (a) Local Attention Module, (b) Residual Block.

Figure 6. Visualization of the original image and the feature map containing the attention module.

Figure 7. RD Performance evaluation on the Harvard dataset. JPEG 2000 [10], Cheng 2020 [19], Minnen 2018 [18]. (a) PSNR; (b) MS-SSIM; (c) SAM.

Figure 8. Comparative visualization of reconstructed image details using different methods on the Harvard dataset (imgc9). JPEG 2000 [10], Minnen 2018 [18], Cheng 2020 [19].

Figure 9. Comparative visualization of reconstructed image details using different methods on the Chikusei Image. JPEG 2000 [10], Minnen 2018 [18], Cheng 2020 [19].

Figure 10. The decoding time, GPU Memory Consumption, and BD-Rate reductions over JPEG2000 [10] on the Harvard test set. Minnen 2018 [18], Cheng 2020 [19].

Table 1. A comparison of metrics between the proposed method and various algorithms at similar PSNR levels.

Method	PSNR/dB	bpp/bits per Pixel	SAM/°
Proposed	39.78	0.775	3.09
CCSDS123.0-B-2	39.63	0.813	3.58
JPEG2000 [10]	39.16	1.106	2.83
Minnen2018 [18]	39.308	0.838	2.75
Cheng2020 [19]	39.386	0.812	2.68

Table 2. Contrast Experiment.

Method	λ	bpp	PSNR
Baseline	0.15	0.583	38.27
Baseline + LAM	0.15	0.58	38.32
Baseline + SSM	0.15	0.579	38.36
Baseline + SSM + LAM	0.15	0.561	38.608
Baseline	0.52	1.152	40.13
Baseline + LAM	0.52	1.15	40.24
Baseline + SSM	0.52	1.108	40.29
Baseline + SSM + LAM	0.52	1.1	40.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Xiao, H. Hyperspectral Image Compression Method Based on Spatio-Spectral Joint Feature Extraction and Attention Mechanism. Symmetry 2025, 17, 2065. https://doi.org/10.3390/sym17122065

AMA Style

Zhang Y, Xiao H. Hyperspectral Image Compression Method Based on Spatio-Spectral Joint Feature Extraction and Attention Mechanism. Symmetry. 2025; 17(12):2065. https://doi.org/10.3390/sym17122065

Chicago/Turabian Style

Zhang, Yan, and Huachao Xiao. 2025. "Hyperspectral Image Compression Method Based on Spatio-Spectral Joint Feature Extraction and Attention Mechanism" Symmetry 17, no. 12: 2065. https://doi.org/10.3390/sym17122065

APA Style

Zhang, Y., & Xiao, H. (2025). Hyperspectral Image Compression Method Based on Spatio-Spectral Joint Feature Extraction and Attention Mechanism. Symmetry, 17(12), 2065. https://doi.org/10.3390/sym17122065

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Hyperspectral Image Compression Method Based on Spatio-Spectral Joint Feature Extraction and Attention Mechanism

Abstract

1. Introduction

2. Related Work

3. Method

3.1. 3D-2D Hybrid Convolution-Based Spatial–Spectral Joint Feature Extraction Module

3.1.1. Spatial Feature Extraction Module

3.1.2. Spectral Feature Extraction Module

3.2. Spatial–Spectral Joint Feature Fusion Encoding–Decoding Network Model

3.3. Local Attention Mechanism

4. Experiments

4.1. Training Details

4.2. Evaluation

5. Results and Discussion

5.1. Rate-Distortion Performance

5.2. Ablation Study

5.3. Resource Consumption

5.4. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI