Remote Sensing Image Compression via Wavelet-Guided Local Structure Decoupling and Channel–Spatial State Modeling

Liu, Jiahui; Zhang, Lili; Wang, Xianjun

doi:10.3390/rs17142419

Open AccessArticle

Remote Sensing Image Compression via Wavelet-Guided Local Structure Decoupling and Channel–Spatial State Modeling

by

Jiahui Liu

,

Lili Zhang

^*

and

Xianjun Wang

College of Electronic and Information Engineering, Shenyang Aerospace University, Shenyang 110136, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(14), 2419; https://doi.org/10.3390/rs17142419

Submission received: 20 May 2025 / Revised: 6 July 2025 / Accepted: 9 July 2025 / Published: 12 July 2025

(This article belongs to the Special Issue New Insights in Remote Sensing Image Interpretation with Deep Learning)

Download

Browse Figures

Versions Notes

Abstract

As the resolution and data volume of remote sensing imagery continue to grow, achieving efficient compression without sacrificing reconstruction quality remains a major challenge, given that traditional handcrafted codecs often fail to balance rate-distortion performance and computational complexity, while deep learning-based approaches offer superior representational capacity. However, challenges remain in achieving a balance between fine-detail adaptation and computational efficiency. Mamba, a state–space model (SSM)-based architecture, offers linear-time complexity and excels at capturing long-range dependencies in sequences. It has been adopted in remote sensing compression tasks to model long-distance dependencies between pixels. However, despite its effectiveness in global context aggregation, Mamba’s uniform bidirectional scanning is insufficient for capturing high-frequency structures such as edges and textures. Moreover, existing visual state–space (VSS) models built upon Mamba typically treat all channels equally and lack mechanisms to dynamically focus on semantically salient spatial regions. To address these issues, we present an innovative architecture for distant sensing image compression, called the Multi-scale Channel Global Mamba Network (MGMNet). MGMNet integrates a spatial–channel dynamic weighting mechanism into the Mamba architecture, enhancing global semantic modeling while selectively emphasizing informative features. It comprises two key modules. The Wavelet Transform-guided Local Structure Decoupling (WTLS) module applies multi-scale wavelet decomposition to disentangle and separately encode low- and high-frequency components, enabling efficient parallel modeling of global contours and local textures. The Channel–Global Information Modeling (CGIM) module enhances conventional VSS by introducing a dual-path attention strategy that reweights spatial and channel information, improving the modeling of long-range dependencies and edge structures. We conducted extensive evaluations on three distinct remote sensing datasets to assess the MGMNet. The results of the investigations revealed that MGMNet outperforms the current SOTA models across various performance metrics.

Keywords:

remote sensing; image compression; visual mamba; wavelet transform-guided local structure decoupling; channel–global information modeling

1. Introduction

Remote sensing images are extensively utilized in diverse areas such as resource exploration, urban development, agricultural monitoring, defense, and homeland security [1,2,3]. With the rapid advancement of remote sensing and aerospace technologies, data acquisition capabilities have expanded significantly, resulting in an exponential increase in the volume of images, while this surge offers valuable informational support, it also presents considerable challenges in data management, transmission, and storage.

To address these challenges, image compression has emerged as an essential tool in remote sensing workflows. It minimizes storage and bandwidth requirements while enhancing the efficiency of subsequent processing and analysis. Traditional methods, such as JPEG [4] and JPEG2000 [5], depend on manually designed feature extraction and block-based transform coding. However, these methods often suffer from blocking artifacts, blurred edges, and ringing effects, which limit their suitability for high-precision remote sensing tasks. Recently, modern standards such as JPEG XL [6] and WebP [7] have demonstrated enhanced performance in static image compression, effectively balancing high compression ratios with detail preservation. However, these formats still encounter limitations when addressing the spectral, spatial, and temporal diversity of high-resolution remote sensing imagery, highlighting bottlenecks in adaptability and compression efficiency.

Learning-based approaches have become a prominent research focus. Compared to traditional techniques, these methods demonstrate superior performance in both compression efficiency and reconstruction quality. Ballé et al. [8] proposed an end-to-end image compression framework based on convolutional neural networks, as illustrated in Figure 1a. The encoder employs downsampling modules to convert the input image into high-dimensional latent representations, which reduces spatial redundancy among adjacent features.

To enhance the modeling capacity of the latent space, Ballé et al. [9] introduced a hyperprior network, as depicted in Figure 1b, which leverages side information derived from latent features to estimate their probability distribution and improves the precision of entropy coding. Building upon this foundation, Minnen et al. [10] proposed an enhanced framework that integrates an autoregressive context model within the hyperprior architecture, as shown in Figure 1c; by assuming a Gaussian distribution, the model jointly estimates the mean

μ

and variance

σ

of the latent representation, which further refines the entropy modeling process. The incorporation of this context-aware mechanism enables more accurate probability estimation by effectively capturing local spatial dependencies.

In this context, Cheng et al. [11] introduced a discretized Gaussian mixture likelihood model aimed at enhancing the representation of latent distributions. This methodology demonstrates an improved balance between compression rate and reconstruction distortion, as depicted in Figure 1d.

To further enhance performance, Minnen et al. [12] introduced a channel-wise autoregressive entropy model that incorporates channel modulation and residual prediction within the latent space. This innovative approach not only enhances rate-distortion performance but also mitigates the sequential limitations associated with previous context-adaptive models, as illustrated in Figure 1e.

The Mamba architecture, which is grounded in a state space model (SSM) [13], has recently demonstrated considerable benefits in the management of long-sequence tasks. By employing a state transition technique, it effectively captures long-range interdependence while preserving linear computing complexity and exceptional scalability. This positions the Mamba architecture as a compelling alternative to models based on the Transformer framework. Furthermore, the introduction of selective scanning enhances the expressive capabilities of structured state space sequence models. The mechanism under scrutiny dynamically filters and concentrates on informative regions based on input features. This process significantly improves computational efficiency without compromising modeling accuracy.

Building on this foundation, VMamba [14] and Vim [15] extend the Mamba framework to two-dimensional vision tasks. By employing directionally selective scanning, these models efficiently capture and integrate global context. Consequently, they achieve broader receptive fields and enhanced performance in object detection and image classification, while also reducing inference latency and resource consumption. These features underscore their significant potential for visual understanding and practical application.

In this paper, we innovatively introduce the Mamba architecture into the field of remote sensing image compression, proposing a multi-scale channel global Mamba compression network (MGMNet). The objective is to achieve a better balance between compression performance and reconstruction. To this end, MGMNet has designed two core modules: the wavelet transform-guided local structure decoupling module (WTLS) and the channel–global information collaborative modeling module (CGIM).

In contrast to the intricate architecture and redundant parameters of the multi-branch high-low frequency compression model introduced by Xiang et al. [16], the WTLS module utilized in MGMNet adopts a wavelet decomposition strategy. This methodology segments the feature maps into multi-scale low-frequency and high-frequency sub-bands, facilitating the simultaneous modeling of global contours and local details within images. Building upon this framework, WTLS incorporates a local attention mechanism that prioritizes critical feature regions, such as edges, textures, and geometric structures present in the high-frequency sub-bands. This enhancement significantly improves the model’s ability to delineate essential features within visual data. In comparison to traditional multi-scale feature extraction methods that depend on deep convolutional stacking or extensive self-attention mechanisms, WTLS substantially reduces both the parameter count of the model and its computational requirements. This reduction allows for the effective capture of highly discriminative fine-grained features with diminished complexity, demonstrating notable structural compactness and computational efficiency.

Compared to the approach taken by Wang et al. [17], which relies solely on visual state space (VSS) scanning to obtain global semantics in the VMIC model, we find that during the compression of remote sensing images, pure bidirectional state scanning can establish long-term dependencies between pixels but has several limitations. First, VSS scanning utilizes uniform sampling and equal-weight processing, which poses difficulties in effectively capturing high-frequency details within the image. This results in decreased edge sharpness and a loss of detail after compression. Second, a single-state scan cannot differentiate the importance of various channels, which may lead to noise channels interfering with signal channels, compromising the quality of reconstruction. Lastly, remote sensing images exhibit significant spatial non-stationarity, with substantial differences in texture and statistical properties between adjacent areas. However, VSS scanning cannot adaptively prioritize the importance of spatial regions, preventing it from allocating more modeling resources to critical target areas (such as road edges and building structures) compared to background areas (such as vegetation and water bodies). Furthermore, a singular scanning mechanism encounters challenges in effectively managing the fusion of multi-scale information, which complicates the equilibrium between global semantics and local high-frequency responses. This limitation subsequently constrains the rate-distortion trade-off.

To address this challenge, we propose the CGIM module, which employs parallel execution of VSS scanning alongside a spatial–channel reconstruction weighting strategy. This module dynamically adjusts the importance of features and spatial positions for each channel, allowing the network to prioritize regions and frequency bands that are vital for enhancing compression efficiency and image fidelity. This methodology effectively mitigates the shortcomings associated with single scanning in capturing essential components. Furthermore, MGMNet capitalizes on the synergistic effects of the WTLS and CGIM modules to proficiently integrate multi-scale structural information with both global and local semantics in remote sensing images. It adeptly leverages critical regional features while maintaining a lightweight network architecture, ensuring ease of deployment and facilitating improved feature recovery and rate-distortion optimization.

The principal contributions of this article are outlined as follows:

A novel lightweight remote sensing image compression network, designated as MGMNet, is introduced. This network combines the capabilities of state–space modeling with a collaborative modeling approach in channel space. Through the application of visual state–space modeling, a dynamic weighting mechanism for spatial channels has been developed. This mechanism allows the network to effectively capture the global semantic information present in remote sensing images, while simultaneously maintaining a precise focus on critical regions and significant edges. This approach achieves a superior compression-reconstruction trade-off under limited computational resources.
A local structure modeling method based on wavelet transform has been developed, enabling the parallel modeling of low-frequency contours and high-frequency details in remote sensing images through multi-scale decomposition. When combined with a local attention strategy, this approach significantly enhances the representation of important details.
To address the limitations of the visual state–space scanning mechanism in compression, we propose a dual-path modeling approach that integrates spatial–channel reconstruction strategies. This design preserves the capacity to model long dependencies but also effectively enhances the network’s responsiveness to high-frequency regions by dynamically weighting various channel features and spatial positions. This improvement leads to enhanced identification and reconstruction accuracy of target areas, significantly alleviating the shortcomings of single scanning in structural capture and multi-scale fusion.
The experimental findings, encompassing rate-distortion performance and cross-validation, reveal that MGMNet exhibits considerable enhancements in performance relative to traditional image compression techniques as well as current learning-based methods across datasets. This robust evidence substantiates the efficacy of the proposed approach.

The organization of this article is delineated as follows: Section 2 provides a review of pertinent literature, whereas Section 3 presents the proposed MGMNet, elaborating on the theoretical underpinnings and operational functions of the WTLS and CGIM components. Section 4 outlines the experimental methodology and evaluation techniques employed, which will illustrate the efficacy and advantages of the proposed approach across various remote sensing datasets. Lastly, Section 5 encapsulates the article’s contributions and contemplates prospective avenues for future research.

2. Related Works

2.1. Deep Learning Compression Method Based on Natural Image

Expanding upon the work presented in [8], scholars have persistently advanced network architectures and probabilistic modeling techniques to augment the efficiency of compression. Guo et al. [18] introduced a dual-branch entropy model designed to independently capture causal context and global features to represent the latent space. Nevertheless, models based on convolutional neural networks (CNNs) generally depend on local convolutional operators, which constrain their capacity to effectively capture long dependencies. This shortcoming restricts their effectiveness in tasks with complex global structures, such as remote sensing image compression. Additionally, while context modeling techniques—such as autoregressive models [19,20]—can improve representation quality, they often suffer from sequential decoding, leading to high inference latency and limiting real-time deployment.

The constrained receptive field and lack of global representation in CNNs have motivated recent research to investigate Transformer architectures for image compression tasks. Khoshkhahtinat et al. [21] proposed a Transformer-based framework that combines nonlinear feature transformation with global modeling capabilities. This design suppresses spatial redundancy and enhances feature compactness. Zou et al. [22] developed a symmetric structure based on the Swin Transformer, incorporating hierarchical downsampling to improve multi-scale feature extraction. This approach achieves a balance between modeling power and computational efficiency. In addition, models such as GroupedMixer [23] enhance context modeling by introducing grouped processing and mixing operations, demonstrating the potential of Transformer-based architectures in image compression tasks. However, existing Transformer approaches still face challenges, including high architectural complexity and slow inference speeds. Moreover, most methods focus primarily on spatial modeling, while inter-channel dependencies remain underexplored.

2.2. Deep Learning Compression Method Based on Remote Sensing Image

Compared to natural images, remote sensing RGB imagery exhibits significant differences in scale, scene composition, and semantic distribution. It typically covers large geographical areas with diverse land cover types and pronounced scale variations. Such images often display strong regional homogeneity and spatial redundancy. Additionally, remote sensing data are frequently acquired from multiple platforms and across different periods, demonstrating temporal stability and repetition. These characteristics increase structural complexity and impose higher demands on the representation capacity and computational efficiency of compression models.

To leverage the spatiotemporal properties of remote sensing imagery, various compression strategies have been proposed. Zhang et al. [24] developed a Spatio-Temporal Context Model (STCM) that effectively captures both spatial neighborhoods and temporal sequences, while this model enhances compression performance, it necessitates high inter-frame consistency and precise registration, and its intricate architecture results in slow inference times. Li et al. [25] proposed the OF-RSIC method, which utilizes a Content-Weighted Attention Module (CWAM) to allocate bits more efficiently to significant regions, improving semantic fidelity. However, this enhancement is accompanied by a decline in quality in non-salient areas and an increase in computational demands. More recently, transformer-based models have been employed due to their superior capabilities in global modeling. Shi et al. [26] introduced MGSSNet, which combines convolutional neural networks (CNNs) and transformers through multi-head attention to improving spatial and spectral representation. Despite achieving enhanced accuracy, the model’s substantial parameter count and high inference costs impede its real-time application. Pan et al. [27] presented the CCGN model, which differentiates between structural and textural components to attain robust reconstruction performance at extremely low bitrates. Nonetheless, unstable training dynamics and a prolonged decoding pipeline limit its practical usability.

2.3. State Space Models

To balance modeling efficiency and global representation capacity, researchers have begun exploring the application of SSM in visual tasks. Originally developed for time-series analysis, SSMs have recently demonstrated strong long-range dependency modeling when integrated with deep learning frameworks. Gu et al. [13] proposed the Linear State Space Layer (LSSL) and Structured State Space (S4) model, significantly improving modeling. The Mamba architecture further advances this line of work by enabling linear scalability and fast inference, achieving strong performance across multimodal tasks. Liu et al. [14] introduced Mamba into vision by proposing VMamba, a visual SSM that enhances context modeling through two-dimensional state updates and selective scanning. Zhu et al. [15] proposed Vim, which employs a bidirectional state propagation mechanism to improve feature compression. For remote sensing scenarios, Ma et al. [28] designed RS3Mamba by incorporating a state-aware auxiliary branch to improve structural awareness. He et al. [29] proposed 3DSS-Mamba, which models spectral–spatial dependencies via three-dimensional state transitions, enhancing compression effectiveness. Zhao et al. [30] developed CNN-Mamba, which integrates spatial and spectral cues to improve multi-resolution image fusion.

In summary, although Transformer- and generation-based models have shown advancements in the accuracy of remote sensing image compression, their substantial architectural complexity and inference latency present considerable obstacles for practical deployment on edge devices. In contrast, State Space Models (SSMs) offer significant advantages, including structural simplicity, efficient inference processes, and strong capabilities for long-range modeling.

3. Proposed Method

This section offers a comprehensive overview of the proposed MGMNet and its associated components, which encompass the wavelet transform-guided local structure decoupling module (WTLS) and the channel–global information collaborative modeling module (CGIM).

3.1. Preliminaries

State Space Models (SSMs) [31], are a type of sequence model that can be considered as a linear time-invariant system commonly encountered in control theory, signal processing, or linear systems. SSM maps a one-dimensional continuous input signal

x (t) \in R

to an output response

y (t) \in R

through a learnable hidden state

h (t) \in R^{N}

. This process can be represented using linear ordinary differential equations (ODEs), i.e.,

h^{'} (t) = A h (t) + B x (t)

(1)

y (t) = C h (t)

(2)

Equation (1) represents the state equation, and Equation (2) represents the observation equation. Where

h^{'} (t) \in R^{N}

is the derivative of

h (t)

at the moment t,

A \in R^{N \times N}

is the state transition matrix,

B \in R^{N \times 1}

is the input matrix, and

C \in R^{1 \times N}

is the output matrix.

Due to the challenges of integrating continuous SSM into deep learning models based on discrete sequences, the S4 [31] model serves as the discretized counterpart of continuous SSM, adapting them to deep learning frameworks by discretizing the ODEs. The Selective Scanning Space State Sequence Model (S6) [32] is also a discretized version of continuous SSM. To make continuous SSM suitable for discrete signals in image processing, a Zero-Order Holder (ZOH) [33] is used to convert continuous parameters

A

and

B

into discrete parameters

\bar{A}

and

\bar{B}

, defined as follows:

\bar{A} = exp (Δ A)

(3)

\bar{B} = {(Δ A)}^{- 1} (exp (Δ A) - I) \cdot Δ B

(4)

where

Δ

represents the time scale parameter, the discretized SSM can be expressed as follows:

h_{t} = \bar{A} h_{t - 1} + \bar{B} x_{t}

(5)

y_{t} = C h_{t}

(6)

The output x can be computed through either global convolution or linear recurrence, defined as follows:

\bar{K} = (C \bar{B}, C \bar{AB}, \dots, C {\bar{A}}^{L - 1} \bar{B})

(7)

y = x * \bar{K}

(8)

where L is the length of the input sequence,

\bar{K}

represents the structured convolutional kernel, and

*

denotes the convolution operation.

3.2. Overall Framework

The overall architecture of MGMNet is illustrated in Figure 2. The network adopts a modular and scalable design tailored for remote sensing image compression, integrating spatial–frequency decoupling and global semantic modeling to enhance both compression efficiency and reconstruction fidelity.

The main encoder consists of down-sampling convolutions, followed by the WTLS and CGIM modules. WTLS is designed to decouple structural and textural components through wavelet-guided multi-scale analysis, while CGIM enhances global context understanding and channel-wise feature interaction. These two modules work collaboratively to provide a compact yet expressive latent representation of the input.

The decoder mirrors the encoder’s structure, incorporating the same functional modules to progressively reconstruct the image. This symmetrical design ensures consistency in spatial–frequency representation and contributes to high-quality restoration of fine details and large-scale structures.

The main encoding–decoding process adopts a three-stage encoding mechanism: The original input

x \in R^{H \times W \times 3}

is first transformed into a high-dimensional latent representation

y

via the main encoder

g_{a}

. Subsequently, the latent features undergo discretization through the quantization module Q to produce the discretized latent features

\hat{y}

. Finally, the quantized features

\hat{y}

are inversely mapped by the main decoder

g_{s}

to generate the reconstructed image

\hat{x} \in R^{H \times W \times 3}

. This procedure can be mathematically formalized as follows:

\begin{matrix} y & = g_{a} (x; ϕ) \\ \hat{y} & = Q (y - μ) + μ \\ \hat{x} & = g_{s} (\hat{y}; θ) \end{matrix}

(9)

where

x

and

\hat{x}

represent the original and reconstructed images, respectively,

ϕ

and

θ

denote the trainable parameters of the main encoder

g_{a}

and the main decoder

g_{s}

.

Q (\cdot)

represents the quantization operation.

This formulation enables MGMNet to perform compression and reconstruction of remote sensing images while achieving an effective balance among fidelity, compactness, and computational efficiency. Its modular architecture further facilitates seamless integration with components such as hyperprior entropy models and side information encoders, enhancing adaptability across diverse compression scenarios.

In the hyperprior framework, the main latent feature

y

is first transformed into higher-order statistical features

z

via the hyper-prior encoder

h_{a}

. After being discretized by the quantizer Q to yield

z

, the hyper-prior decoder

h_{s}

decodes

\hat{z}

into the parameter required for entropy model.

Following previous studies [10,34,35], we round each

[y - μ]

and encode it into a bitstream rather than directly encoding

y

. The encoded symbol

\hat{y}

is then reconstructed as

[y - μ] + μ

. Subsequently,

[y - μ]

is losslessly encoded using a distance encoder, and its bitstream is represented by a single parameter

σ

. Utilizing the foundational concepts of channel-wise autoregressive entropy models, we partition the tensor

y

into S separate segments. Each segment possesses dimensions of

W \times H \times \frac{C}{S}

, we set

S = 5

consistent with, [35]. The hyper-prior decoder decodes the side information in this framework to obtain two critical parameters,

\hat{μ}

and

\hat{σ}

. The entropy model functions in a sequential manner, determining the conditional probability distribution for each segment about the preceding segments. Each segment is subjected to entropy encoding and decoding processes, which facilitate the reconstruction of the comprehensive quantized latent representation

\hat{y}

.

Subsequently, each slice

y_{i}

is processed through the slicing network to obtain

{\bar{y}}_{i}

. A slice

{\bar{y}}_{i}

can only be decoded after all preceding slices

\{{\bar{y}}_{1}, {\bar{y}}_{2}, \dots, {\bar{y}}_{i - 1}\}

have been successfully decoded. Quantization operations inevitably introduce errors, which contribute to distortions in the reconstructed image. To mitigate these errors, we employ the rounding function and latent residual prediction

r_{i}

. Figure 3 illustrates the comprehensive methodology utilized in this entropy model.

3.3. Channel-Global Visual Model

Remote sensing images often encompass extensive geospatial features, such as urban areas, forests, and farmlands. Global features are crucial for compression models to understand the scene’s overall structure and help allocate bit resources more efficiently. In addition, remote sensing images exhibit strong correlations between different channels. Modeling channel features can effectively reduce redundant information and improve compression efficiency. However, efficiently extracting and integrating global and channel features to enhance compression performance remains a significant challenge.

The Visual State Space (VSS) model demonstrates proficiency in modeling long-range dependencies and capturing global features. Nonetheless, its fundamental component, the 2D-Selective Scanning (SS2D) module, exhibits certain limitations. SS2D begins by partitioning the image into patches of a fixed size, which are then analyzed along four diagonal orientations. Each patch matrix is subsequently flattened into a sequence for each orientation and processed for feature extraction. The results from all orientations are then integrated to amalgamate various global perspectives. Although this approach facilitates the incorporation of multi-directional context, the inflexible scanning order and patch-slicing methodology may constrain the model’s capacity to represent specific structured elements and local details effectively, while this mechanism is adept at global feature extraction, the absence of interaction among patches may lead to incomplete or distorted representations of global features that span multiple regions, such as river flows or mountain range extensions. Consequently, the SS2D module may struggle to establish effective connections between dispersed regions when identifying long-range features across various geographic units, increasing the likelihood of global information loss.

As depicted in Figure 4, we propose a channel–global information collaborative modeling module (CGIM) to tackle the aforementioned difficulties. The CGIM is a parallel architecture that includes the VSS model and the Spatial–Channel Reconstruction Strategy (SCRS) module. Among them, SCRS aims to extract more accurate global and channel information by optimizing spatial and channel features of remote sensing images. SCRS utilizes the synergy of spatial and channel dimensions to fuse the extracted global information with the channel information to complement the global feature extraction of VSS. This design can effectively mitigate the problem of information incompleteness that the SS2D module may cause. It ensures information integrity during compression and improves compression performance and reconstruction quality.

First, the tensor

X \in R^{C \times H \times W}

undergoes processing through a

1 \times 1

layer. This layer is configured to produce an output channel size of 2C. The output is then divided as

X_{V S S} \in R^{C \times H \times W}

and

X_{S C R S} \in R^{C \times H \times W}

. It allows the VSS and SCRS modules to operate independently and in parallel. This parallel processing capability facilitates the more targeted and efficient extraction of features by each module, enhancing the overall representational capacity of the network.

Subsequently, the tensor

X_{V S S}

is input into the VSS module, resulting in the output tensor

X_{V S S}^{'}

. Concurrently, the tensor

X_{S C R S}

is processed through the SCRS module to yield the output tensor

X_{S C R S}^{'}

. Subsequently, the tensor

X_{V S S}^{'}

is processed through a global average pooling (GAP) layer. Following the application of a sigmoid function, the output generated by the GAP layer is multiplied by the tensor

X_{V S S}^{'}

, resulting in the tensor

X_{V S S}^{″}

. Similarly,

X_{S C R S}^{'}

undergoes the same process, resulting in

X_{S C R S}^{″}

. The tensors

X_{V S S}^{″}

and

X_{S C R S}^{″}

are concatenated. Subsequently, the features extracted by the VSS and SCRS modules are integrated through a second

1 \times 1

layer. The output channel size of this convolutional layer is C. Ultimately, the input features X are incorporated with the output features to produce the final output

X_{o u t}

. This process can be mathematically represented as follows:

\begin{matrix} X_{V S S}, X_{S C R S} & = S p l i t (C o n v_{1 \times 1} (X)) \\ X_{V S S}^{'} & = V S S (X_{V S S}) \\ X_{S C R S}^{'} & = S C R S (X_{S C R S}) \\ X_{V S S}^{″} & = σ (G A P (X_{V S S}^{'})) \times X_{V S S}^{'} \\ X_{S C R S}^{″} & = σ (G A P (X_{S C R S}^{'})) \times X_{S C R S}^{'} \\ X_{o u t} & = C o n v_{1 \times 1} (C o n c a t (X_{V S S}^{″}, X_{S C R S}^{″})) + X \end{matrix}

(10)

where

S p l i t (•)

denotes the split operation.

σ (•)

signifies the Sigmoid function.

3.3.1. Visual State Space Model

The VSS model, derived from VMamba [14], is a core component of CGIM. Given an input feature

X_{i n} \in R^{C \times H \times W}

, it is first processed through Layer Normalization (LN) before being routed to two gated branches. In the first branch, the input undergoes a linear layer, depthwise separable convolution, and SiLU activation function [36] for feature extraction. The extracted features are fed into the SS2D module to capture global features and establish long-range dependencies. Finally, the output passes through a second LN layer to produce the first branch output

X_{o u t 1}

. This process can be represented as follows:

\begin{matrix} X_{o u t 1} = L N_{2} (S S 2 D (S i L U (D W C o n v (L i n e a r_{1} (L N_{1} (X_{i n})))))) \end{matrix}

(11)

In the second branch, the input goes through a linear layer and SiLU activation function to get the output of the second branch

X_{o u t 2}

. It can be expressed as follows:

\begin{matrix} X_{o u t 2} = S i L U (L i n e a r_{2} (L N_{1} (X_{i n}))) \end{matrix}

(12)

The outputs

X_{o u t 1}

and

X_{o u t 2}

are fused to integrate features from both branches through element-wise multiplication. A linear layer is applied to enhance feature integration further. Finally, a skip connection is established between the linear layer’s output and the input

X_{i n}

, resulting in the final output

X_{V S S_o u t}

of the VSS module. This process can be represented as follows:

\begin{matrix} X_{V S S_o u t} = L i n e a r (X_{o u t 1} ⊙ X_{o u t 2}) + X_{i n} \end{matrix}

(13)

where ⊙ denotes element-wise multiplication.

As illustrated in Figure 5, the SS2D module processes data in three steps: cross-scanning, selective scanning using S6 blocks, and cross-merging. Given an input feature map X, it is first divided into patches of equal size. The SS2D module then performs cross-scanning in four directions (top-left to bottom-right, bottom-right to top-left, top-right to bottom-left, and bottom-left to top-right) to unfold the input patches into four independent sequences. Each sequence is processed in parallel using the S6 block selective scanning mechanism, which ensures comprehensive feature extraction in all directions.

Finally, a cross-merging operation combines the sequences from all directions, reconstructing the output feature map

\bar{X}

with the exact dimensions as the input. This process is expressed as follows:

\begin{matrix} X_{v} & = scan (X, v) \\ {\bar{X}}_{v} & = S 6 (X_{v}) \\ \bar{X} & = merge ({\bar{X}}_{1}, {\bar{X}}_{2}, {\bar{X}}_{3}, {\bar{X}}_{4}) \end{matrix}

(14)

where

v \in [1, 2, 3, 4]

denotes four different scanning directions,

s c a n (•)

and

m e r g e (•)

denote the cross-scan and cross-merge operations.

The S6 model [32], is an enhanced version built upon the S4 model [31], with its core innovation being the introduction of a selective mechanism that enables the model to dynamically adjust the parameter configurations of the State Space Model (SSM) based on the input content. This mechanism endows the model with improved adaptability and selective processing capability, allowing it to retain critical semantic information while effectively filtering out redundant or irrelevant details during feature extraction. As a result, the model achieves better representational efficiency and reconstruction quality.

In terms of architectural design, S6 also adopts the Spatial-Sequential 2D (SS2D) framework, which encodes images through complementary one-dimensional sequential traversal paths (e.g., row-wise and column-wise), enabling each pixel to aggregate spatial context from multiple directions. This design effectively constructs a global receptive field in the 2D space, significantly enhancing the model’s ability to capture long-range dependencies and cross-region structural patterns.

The selective mechanism introduced in the S6 model makes it particularly well-suited for handling the spatial non-stationarity commonly found in remote sensing images. Such images often contain rich and diverse land-cover types, with significant variations in texture density, structural scale, and edge sharpness across regions, leading to high spatial heterogeneity. Traditional static modeling approaches often lack the flexibility to adapt to this heterogeneous distribution, resulting in the loss of structural information or the inclusion of excessive redundant features.

In S6, the selective state modulation mechanism performs input-driven dynamic parameter adjustment, enabling differentiated state update behaviors across spatial regions. In areas with complex textures or abrupt structural changes, the model enhances state retention and feature fusion to better capture fine details and boundaries. Conversely, in flat, highly repetitive, or information-sparse regions, it suppresses state updates and actively “forgets” redundant content, thereby improving overall modeling efficiency. This mechanism effectively constructs a form of spatially selective memory, allowing the model to adaptively shift its “attention focus” in response to diverse spatial structures, thus improving its capacity to handle the complex and dynamic spatial patterns in remote sensing imagery.

Through the synergistic integration of the selective mechanism and the SS2D framework, the S6 model exhibits not only strong global modeling capabilities and efficient sequential processing but also the flexibility to adapt its modeling strategies based on spatial characteristics of the input, demonstrating excellent performance in non-stationary environments.

3.3.2. Spatial–Channel Reconstruction Strategy

The Spatial–Channel Reconstruction Strategy (SCRS) represents an enhancement of the SCConv architecture initially proposed by Li et al. [37]. This strategy enhances visual representation through the implementation of a dual-path mechanism, which adeptly decouples and simultaneously models spatial and channel features. This is illustrated in Figure 6.

The Spatial Reconstruction Unit (SRU) employs a separation–reconstruction methodology to mitigate spatial redundancy and improve spatial feature representation. Concurrently, the Channel Reconstruction Unit (CRU) utilizes a separation–transformation–fusion approach to diminish channel redundancy and augment inter-channel correlations. The outputs generated by SRU and CRU are combined through element multiplication to effectively merge spatial and channel information. Attention weights are computed through a Softmax function and applied to V. A linear transformation follows to strengthen feature interactions.

By separating spatial and channel dependencies, this architecture facilitates enhanced modeling of global–local attention and augments the discriminative power of the features obtained. The following presents a streamlined mathematical formulation:

\begin{matrix} Q, K, V & = L i n e a r_{1} (X_{a t t}) \\ Q_{S R U} & = S R U (Q) \\ K_{C R U} & = C R U (K) \\ X_{S C R S_o u t} & = L i n e a r_{2} (S o f t M a x (Q_{S R U} K_{C R U}^{T} / \sqrt{d}) V) \end{matrix}

(15)

where d denotes the dimension of input.

3.3.3. Spatial Reconstruction Unit

As illustrated in Figure 7, the Spatial Reconstruction Unit (SRU) refines feature maps by reducing spatial redundancy through a “separate-and-reconstruct” approach.

The input tensor

Q \in R^{C \times H \times W}

is normalized via Group Normalization; the resulting scaling coefficients serve as indicators of evaluating each channel’s content. This process is mathematically represented as follows:

\begin{matrix} X_{o u t} = G N (Q) = γ \frac{Q - μ}{\sqrt{σ^{2} + ε}} + β \end{matrix}

(16)

where the parameters

γ

and

β

are adjustable affine transformation coefficients within the GN layer. Feature maps possessing abundant spatial information generally demonstrate more spatial pixel fluctuations, resulting in bigger corresponding

γ

values. This relationship can be articulated mathematically as follows:

\begin{matrix} W_{γ} = \{w_{i}\} = \frac{γ_{i}}{\sum_{j = 1}^{C} γ_{j}}, i, j = 1, 2, \dots, C . \end{matrix}

(17)

In this context,

W_{γ}

represents the normalized weight employed to indicate the significance of various feature maps. An increase in the value of

γ

suggests a greater diversity and richness of spatial information present within the feature maps.

First, the Sigmoid function is employed to normalize the weights

W_{γ}

. Subsequently, these weights are classified using a gating threshold of 0.5. Informative weights

W_{i n f o}

are represented by 1 for weights that exceed the threshold, while non-informative weights

W_{r e d}

are represented by 0 for weights that are below the threshold. This process can be expressed as follows:

\begin{matrix} W_{i n f o}, W_{r e d} = G a t e (S i g m o i d (W_{γ} (G N (Q)))) \end{matrix}

(18)

Subsequently, the input feature Q is subjected to multiplication by the informative weight

W_{i n f o}

and the non-informative weight

W_{r e d}

, yielding two distinct weighted features:

Q_{i n f o}

and

Q_{r e d}

. The feature

Q_{i n f o}

encapsulates critical attributes characterized by a higher information content, whereas

Q_{r e d}

encompasses redundant attributes with comparatively lower information content. These two weighted features are then amalgamated through a cross-reconstruction operation, which culminates in the generation of enhanced features

Q_{i n f o}^{'}

and

Q_{r e d}^{'}

that exhibit diminished spatial redundancy. The cross-reconstruction process effectively synthesizes the two weighted features, augmenting their information flow. Ultimately,

Q_{i n f o}^{'}

and

Q_{r e d}^{'}

are concatenated to produce the final spatially refined feature

Q_{f i n a l}

. The mathematical representation of the reconstruction process can be articulated as follows:

\begin{matrix} Q_{i n f o} & = W_{i n f o} \otimes Q \\ Q_{r e d} & = W_{r e d} \otimes Q \\ Q_{i n f o}^{'} & = Q_{i n f o}^{1} \oplus Q_{r e d}^{2} \\ Q_{r e d}^{'} & = Q_{i n f o}^{2} \oplus Q_{r e d}^{1} \\ Q_{f i n a l} & = C o n c a t (Q_{i n f o}^{'}, Q_{r e d}^{'}) \end{matrix}

(19)

The symbol ⊗ signifies element-wise multiplication. The symbol ⊕ denotes element-wise addition.

3.3.4. Channel Reconstruction Unit

To better leverage redundant information across channels, we introduce the Channel Reconstruction Unit (CRU), which refines feature channels to extract more discriminative channel features. As illustrated in Figure 8, the CRU adopts a “separation–transformation-fusion” strategy to reduce channel redundancy and enhance feature representation capacity effectively.

Separation. The input feature

K \in R^{C \times H \times W}

is first split along the channel dimension. It is divided into two parts: one with

α C

channels and the other with

(1 - α) C

channels. Here,

0 ⩽ α ⩽ 1

represents the channel split ratio. Each sub-feature is then processed by a 1×1 convolution. This operation compresses the channels and reduces the computational costs. As a result, the original feature K is separated into an upper part

K_{u p}

and a lower part

K_{l o w}

. These two parts serve as the basis for subsequent transformation and fusion operations.

Transformation. During the transformation phase, the input feature

K_{u p}

undergoes processing through a specialized rich feature extraction module. This module integrates Group Convolution (GWC) and Pointwise Convolution (PWC) to derive more representative high-level features while ensuring computational efficiency. Nonetheless, the segmentation of channels into distinct groups may restrict inter-channel communication. PWC introduces fully connected operations across channels to address this limitation and enhance inter-feature interactions. The outputs generated by the GWC and PWC are subsequently combined through element-wise addition to yield the aggregated feature map

Y_{1}

. This transformation can be articulated as follows:

\begin{matrix} Y_{1} = M^{G} K_{u p} + M^{P_{1}} K_{u p} \end{matrix}

(20)

where

M^{G}

denotes the learnable weight matrix for GWC.

M^{P_{1}}

denotes the learnable weight matrix for PWC.

During the lower transformation stage, the variable

K_{l o w}

is processed through a PWC module, which is responsible for extracting shallow detail features that complement the high-level features derived from the upper transformation stage. The output generated by the PWC module is subsequently combined with the original

K_{l o w}

through a channel-wise concatenation operation. This procedure results in the final output feature of the lower transformation stage, referred to as

Y_{2}

. The entire process can be mathematically represented as follows:

\begin{matrix} Y_{2} = C o n c a t (M^{P_{2}} K_{l o w}, K_{l o w}) \end{matrix}

(21)

where

M^{P_{2}}

denotes the learnable weight matrix for PWC.

Fusion. The transformed output features

Y_{1}

and

Y_{2}

are adaptively integrated utilizing a streamlined SKNet [38] approach. At the outset, GAP is employed to consolidate input feature information using the channel statistics

S_{m}

.

The channel descriptors corresponding to the upper and lower transformation features, denoted as

S_{1}

and

S_{2}

, are subsequently combined. The vectors

β_{1}

and

β_{2}

are produced through the application of a channel-wise soft attention mechanism, executed as follows:

\begin{matrix} β_{1} = \frac{e^{S_{1}}}{e^{S_{1}} + e^{S_{2}}}, β_{2} = \frac{e^{S_{2}}}{e^{S_{1}} + e^{S_{2}}}, β_{1} + β_{2} = 1 \end{matrix}

(22)

In the concluding phase, the feature importance vectors

β_{1}

and

β_{2}

facilitate the integration of

Y_{1}

and

Y_{2}

within the channel dimension. This results in the final channel-refined feature

K_{f i n a l}

, as follows:

\begin{matrix} K_{f i n a l} = β_{1} Y_{1} + β_{2} Y_{2} \end{matrix}

(23)

The CRU leverages lightweight convolutional operations to capture informative channel-level features. At the same time, it applies a cost-efficient feature reuse mechanism to mitigate redundancy. This design improves both computational efficiency and feature representation quality.

3.4. Wavelet Transform-Guided Local Structure Decoupling Module

Remote sensing images typically encompass intricate details, such as the edges of roads and buildings, making the efficient extraction of local features essential for maintaining critical information and enhancing the quality of image reconstruction. Traditional convolutional neural networks (CNNs) often encounter challenges in capturing long-range dependencies due to their limited receptive fields. This limitation can lead to inadequate contextual awareness and suboptimal feature representation. Furthermore, the constrained receptive field hinders conventional CNNs from effectively capturing large-scale features and long-range dependencies among distant pixels. Consequently, pertinent information that is spatially distant may not be adequately integrated into local features, adversely impacting both compression performance and reconstruction quality.

We propose a wavelet transform-guided local structure decoupling module (WTLS) based on wavelet transform convolution [39] to efficiently extract multi-scale local features from feature maps to address the aforementioned issues. Specifically, wavelet transform convolution leverages the multi-scale decomposition properties of the wavelet transform to enlarge the receptive field while avoiding excessive parameterization significantly. In contrast to conventional convolutions, wavelet transform convolution offers a significant advantage in the extraction of low-frequency information, as wavelet transforms emphasize low-frequency components during the decomposition process. The integration of wavelet transform convolution within the WTLS model facilitates a more precise extraction of multi-scale local features while preserving large-scale contextual information. This capability substantially contributes to the improvement of remote sensing image compression performance and the quality of reconstruction.

As shown in Figure 9, the input feature is first processed by the initial branch. This branch consists of a

5 \times 5

wavelet convolution (WTConv), followed by a

3 \times 3

WTConv. Both layers contribute to multi-scale feature extraction in the wavelet domain. The output of the

3 \times 3 WTConv

is then mapped to result in the output

X_{1} \in R^{C \times H \times W}

of the first branch.

Subsequently, the input feature is directed through the second branch of the model. This branch systematically applies a

5 \times 5 WTConv

followed by a

3 \times 3 WTConv

. The resultant feature is represented as

X_{2} \in R^{C \times H \times W}

. An element-wise multiplication is then executed between

X_{1}

and

X_{2}

to enhance the interaction of the features. The resulting product undergoes further refinement through a

5 \times 5 WTConv

to augment the fusion process. Ultimately, the output of this convolution is combined with the original input X, yielding the final output denoted as

X_{W T L S_o u t}

. The entire process can be mathematically expressed as follows:

\begin{matrix} X_{1} & = S i g m o i d ({WTConv}_{3 \times 3} ({WTConv}_{5 \times 5} (X))) \\ X_{2} & = {WTConv}_{3 \times 3} ({WTConv}_{5 \times 5} (X)) \\ X_{W T L S_o u t} & = {WTConv}_{5 \times 5} (X_{1} \otimes X_{2}) + X \end{matrix}

(24)

where ⊗ denotes element-wise multiplication.

4. Experiments

To ensure a comprehensive evaluation of MGMNet, we employed DOTA, UC-Merced, and NWPU-RESISC45 datasets, which encompass diverse and abundant ground object features, making them well-suited for validating the model’s effectiveness.

MGMNet is compared with several classical image compression standards, including JPEG2000 [5], WebP [7], BPG, and AVIF. In addition, we included recent learning-based approaches, such as those proposed by Cheng et al. [11] (Cheng2020), Jiang et al. [40] (Jiang2022), Liu et al. [34] (Liu2023), Liu et al. [41] (Liu2024), and Qin et al. [42] (Qin2024). Quantitative analyses utilizing four evaluation indicators indicate that MGMNet consistently outperforms its competitors across all metrics.

4.1. Loss Fuction

The loss function of our model consists of two components. The first component constrains the bitrate of the bitstream generated during the compression process, including the latent representation and side information. The second component measures the mean squared error (MSE) between the input image and the final output of the model. The formulation is as follows:

\begin{matrix} L & = R + λ \cdot D \\ = E_{x \sim p_{x}} [- {log}_{2} p_{\hat{y}} (Q (y))] \\ + E_{x \sim p_{x}} [- {log}_{2} p_{\hat{z}} (\hat{z})] + λ \cdot E_{x \sim p_{x}} [d (x, \hat{x})] \end{matrix}

(25)

The rate-distortion optimization aims to simultaneously reduce the size of the compressed bitstream and the distortion in the decompressed image. This process can be summarized as minimizing the number of bits required for compression while maintaining the quality of the decoded image.

To illustrate this, we take the UC-Merced dataset as an example and present the training process under the setting of

λ

= 0.0035 (corresponding to bpp = 0.46). Figure 10 shows the trend of the loss function concerning the number of training epochs.

As observed from the figure, the model exhibits a fast convergence behavior at the early stages of training under this

λ

setting. The overall training process remains stable, with a smooth and steady decrease in the loss. Therefore, under this configuration, the model achieves performance saturation before reaching the preset 150 epochs. Early stopping can thus effectively reduce computational cost while mitigating the risk of overfitting.

Similarly, under the current setting of

λ

= 0.0035, we further illustrate the evolution of two image quality evaluation metrics, PSNR and MS-SSIM, during training. As shown in Figure 11, both metrics exhibit a steady upward trend as the number of epochs increases and gradually converges in the later stages of training. This indicates that the model continuously improves its reconstruction quality while maintaining strong compression performance.

In particular, around epoch 100, both PSNR and MS-SSIM curves begin to stabilize, suggesting that the model has reached a performance saturation point, where further training yields marginal improvements. Therefore, under this

λ

configuration, early termination of training does not compromise the final model performance and helps improve training efficiency while reducing computational resource consumption.

4.2. Experimental Settings

All learning-based models are developed using the PyTorch 2.1.0. framework. Training is conducted on an NVIDIA RTX 3090 GPU. The Adam optimizer [43] is employed to perform gradient-based optimization. During training, image patches of 256 × 256 pixels are uniformly and randomly sampled from the training datasets. The training process begins with a learning rate of 1

\times 10^{- 4}

. A mini-batch size of 8 is used during optimization. The model is trained for a total of 200 epochs. The learning rate is gradually reduced during training. It is set to

10^{- 5}

after 100 epochs, and further lowered to

10^{- 6}

at epoch 150. At epoch 180, it is decreased to

10^{- 7}

, which is then kept constant until training completes.

The model is trained using a rate-distortion loss function. Distortion is quantified by the MSE between the original and reconstructed images. The balance between rate and distortion is controlled by a trade-off parameter

λ

. We evaluate the model under different compression levels by varying

λ

within the set [0.0018, 0.0035, 0.0067, 0.013, 0.025, 0.05].

4.3. Evaluation Indicators

Bit rate is quantified in terms of bits per pixel (Bpp). The assessment of image quality is conducted through four complementary metrics, which are encapsulated in rate-distortion (R-D) curves.

PSNR (Peak Signal-to-Noise Ratio) quantifies the pixel-level reconstruction error, with elevated values signifying enhanced fidelity.

MS-SSIM (Multi-Scale Structural Similarity Index) examines structural similarity across various scales, where increased values imply superior perceptual quality.

LPIPS (Learned Perceptual Image Patch Similarity) assesses perceptual similarity through deep feature representations, with reduced values reflecting greater visual similarity.

VIFp: Quantifies the amount of visual information preserved; higher values denote better quality.

4.4. Analysis of Wavelet Basis and Convolution Kernel Size in WTLS Module

The WTLS (wavelet transform-guided local structure decoupling module) plays a critical role in balancing spatial–frequency representation and computational efficiency in the proposed compression framework. To systematically evaluate the influence of key architectural components, we investigate two primary factors: the choice of wavelet basis and the convolution kernel size, which, respectively, control the frequency decomposition and the spatial receptive field of the model.

4.4.1. Impact of Wavelet Basis Selection

The wavelet basis determines how input features are decomposed into different frequency bands and directly affects the model’s ability to preserve structural details and compress redundant information. In this study, the Haar wavelet (db1) is adopted as the default wavelet basis due to its unique theoretical and empirical advantages.

As the shortest orthogonal wavelet (support length = 2), Haar offers high sensitivity to sharp transitions and structural boundaries, making it particularly effective in capturing edges of human-made objects such as roads and buildings [44,45,46]. Its strict orthogonality ensures lossless information representation during forward and inverse transforms, contributing to stable and accurate reconstruction. Moreover, Haar’s high computational efficiency—attributable to its simple filter structure—facilitates seamless integration into CNN-based architectures and reduces overall computation overhead.

From an empirical perspective, remote sensing images often involve diverse land-cover types and complex textures with high spatial variability across regions. The Haar wavelet effectively separates low-frequency structures (e.g., terrain contours) from high-frequency details (e.g., rooftops, vegetation boundaries), offering a favorable balance between structural fidelity and perceptual quality [47].

To further validate the suitability of the Haar wavelet, a comparative ablation study is conducted on the UC Merced dataset. Several representative wavelet bases are evaluated, including Daubechies-4 (db4), Symlets-4 (sym4), and Coiflets-1 (coif1), with all other architectural parameters held constant. As shown in Figure 12, the Haar wavelet consistently achieves competitive rate-distortion (RD) performance and demonstrates stronger robustness in preserving geometric features.

4.4.2. Role of Convolution Kernel Size and Performance–Efficiency Trade-Off

While wavelet basis selection influences frequency domain modeling, the convolution kernel size serves as the primary factor in controlling the model’s receptive field, which is crucial for capturing complex spatial dependencies in remote sensing images.

To investigate this, we perform controlled experiments using kernel sizes of 3 × 3, 5 × 5, and 7 × 7, under both Haar and Daubechies-4 configurations. The results are shown in Figure 13 and Figure 14.

The findings reveal that increasing the kernel size substantially expands the receptive field, enhancing the model’s ability to capture long-range spatial dependencies and cross-region structural correlations. This is particularly critical for remote sensing scenes involving non-local continuity, such as road networks, rivers, and mountain ranges.

Larger kernels also strengthen the synergy between convolutional modeling and wavelet-guided frequency decoupling, allowing for a clearer separation of low- and high-frequency features. This improves the model’s capacity to balance structure preservation and texture reconstruction, which directly translates to enhanced reconstruction quality.

However, as shown in Table 1, increasing the kernel size results in a noticeable rise in parameter count, FLOPs, and inference time. For instance, switching from a 3 × 3 to a 7 × 7 kernel increases decoding time by over 20%. This trade-off becomes especially significant in real-world applications involving high-resolution data, limited hardware, or edge deployment scenarios.

4.4.3. Final Configuration and Design Considerations

In summary, the convolution kernel size in the WTLS module plays a dominant role in shaping the receptive field and enhancing spatial modeling capability, while the wavelet basis acts more as a tuning factor in frequency-domain decoupling.

Although both influence the model’s performance, kernel size exerts a more direct impact, especially in large-scale, structurally complex scenes.

Considering the trade-off between modeling accuracy, computational cost, and remote sensing adaptability, the Haar wavelet combined with a 5 × 5 convolution kernel is adopted as the default configuration for the WTLS module. This setup achieves a well-balanced solution in terms of structure fidelity, compression quality, and deployment efficiency.

4.5. Rate-Distortion Performance

In this study, we assess the rate-distortion characteristics of all models using four evaluation metrics: PSNR, MS-SSIM, LPIPS, and VIFp. Figure 15, Figure 16 and Figure 17 illustrate the rate-distortion curves for various compression approaches. Additionally, cross-dataset generalization is examined in Figure 18 and Figure 19.

Figure 15 illustrates the rate-distortion curves for several compression techniques assessed using the DOTA dataset. It is noteworthy that MGMNet demonstrates superior performance compared to all other methods in terms of PSNR and MS-SSIM. This finding suggests that MGMNet exhibits enhanced rate-distortion efficiency while effectively maintaining both radiometric fidelity and structural integrity. Qin 2024 [42] ranks closely behind MGMNet in PSNR and MS-SSIM and shows comparable results to Liu 2024 [41] in LPIPS and VIFp. MGMNet also achieves the lowest LPIPS scores, suggesting better alignment with human perceptual quality, and attains the highest VIFp scores, reflecting superior retention of fine-grained details and high-frequency information.

Figure 16 presents the rate-distortion curves for the UC-Merced dataset. MGMNet demonstrates performance in terms of PSNR that is comparable to Cheng 2020 [11] at lower bitrates, while exceeding it at higher bitrates. Furthermore, MGMNet consistently outperforms all other methodologies, with the exception of Cheng 2020 [11], across the entire range of bitrates. In terms of MS-SSIM, MGMNet closely aligns with Cheng 2020 [11] but significantly outperforms all other methods. Additionally, MGMNet exhibits the lowest distortion across bitrates for the LPIPS metric and shows superior performance on the VIFp metric.

Figure 17 shows the results on NWPU-RESISC45, where MGMNet outperforms all comparison methods in PSNR, MS-SSIM, and VIFp, especially at high bitrates. In terms of LPIPS, MGMNet surpasses Qin 2024 [42] at low bitrates and performs comparably at high bitrates, while consistently outperforming the remaining methods.

To validate its robustness, we evaluated MGMNet on the validation sets of UC-Merced and NWPU-RESISC45 using a model trained on DOTA. As shown in Figure 18 and Figure 19, under cross-dataset validation, MGMNet demonstrates excellent performance in all four metrics. These results confirm MGMNet’s strong generalization capability and its effectiveness for remote sensing image compression.

We compute average BD-rate and BD-PSNR for MGMNet and other comparative methods relative to JPEG2000 as the baseline anchor, as shown in Table 2. BD-rate measures the difference in bit rate between different compression algorithms at the same image quality typically expressed as a percentage. In comparison, a negative value suggests a lower bit rate, indicating better compression efficiency. BD-PSNR reflects the change in image quality (measured by PSNR) at the same bit rate across different compression algorithms, usually expressed in dB. A positive value means the new method provides a higher PSNR at the same bit rate to yield better image quality. In comparison, a negative value indicates a lower PSNR and poorer quality. Table 2 shows that MGMNet achieves the most excellent BD-rate savings across all three datasets while also providing the highest BD-PSNR improvement. This demonstrates its ability to deliver superior image quality at lower bit rates.

(1): In the context of the DOTA dataset, MGMNet demonstrates a significant reduction in the bit rate of 78.198% when compared to JPEG2000. When compared to other methods, MGMNet achieves a BD-rate reduction of 35.119%, 14.222%, 19.977%, 2.992%, 6.565%, 2.981%, 8.396%, and 1.027% compared to WebP, BPG, AVIF, Cheng 2020 [11], Jiang 2022 [40], Liu 2023 [34], Liu 2024 [41], and Qin 2024 [42], respectively. At the same bit rate, MGMNet also improves BD-PSNR by 3.519 dB, 1.754 dB, 2.240 dB, 0.403 dB, 0.772 dB, 0.388 dB, 0.968 dB, and 0.242 dB compared to these methods.
(2): On the UC-Merced dataset, MGMNet reduces the bit rate by 72.021% when maintaining the same image quality as JPEG2000. Under identical bit rate conditions, it also delivers a 5.473 dB gain in PSNR. When compared to other methods, MGMNet reduces BD-rate by 31.657%, 19.596%, 24.062%, 1.048%, 7.133%, 4.176%, 4.722%, and 1.720% relative to WebP, BPG, AVIF, Cheng 2020 [11], Jiang 2022 [40], Liu 2023 [34], Liu 2024 [41], and Qin 2024 [42], respectively. Additionally, at the same bit rate, MGMNet improves BD-PSNR by 3.394 dB, 2.600 dB, 2.949 dB, 0.194 dB, 1.684 dB, 1.278 dB, 0.910 dB, and 0.198 dB compared to these methods.
(3): For the NWPU-RESISC45 dataset, MGMNet reduces the bit rate by 43.808% compared to JPEG2000 at the same image quality. Compared to other methods, MGMNet achieves BD-rate reductions of 36.557%, 18.251%, 24.043%, 2.570%, 14.584%, 12.124%, 6.562%, and 1.665% relative to WebP, BPG, AVIF, Cheng 2020 [11], Jiang 2022 [40], Liu 2023 [34], Liu 2024 [41], and Qin 2024 [42], respectively. At the same bit rate, MGMNet also improves BD-PSNR by 2.477 dB, 1.431 dB, 1.883 dB, 0.585 dB, 1.408 dB, 1.229 dB, 0.847 dB, and 0.208 dB compared to these methods.

In summary, MGMNet demonstrates significant bitrate savings and image quality improvement on three datasets, fully demonstrating its excellent compression efficiency and robustness.

An additional performance evaluation is conducted on the DOTA dataset under extremely low bitrate conditions (BPP < 0.1). Two representative generative compression methods are introduced as baseline comparisons to assess the relative performance of the proposed approach.

(1): Pan et al. [27] leverages a coupled generative network and compression module to enhance perceptual quality at ultra-low bitrates, particularly achieving favorable results on perceptual metrics such as LPIPS.
(2): Ye et al. [48] introduces map-assisted semantic priors to guide content-aware reconstruction of remote sensing images, aiming to preserve land-cover structures even at extremely low bitrates.

The experimental results are shown in Figure 20. Although these generative approaches demonstrate advantages in perceptual quality, they typically rely on learned generative priors to hallucinate or fill in missing information caused by aggressive compression. Such generation-based mechanisms may introduce hallucinated details in high-texture regions of remote sensing imagery; for example, by producing non-existent road extensions, fabricated building outlines, or artificial textures.

In contrast, our proposed MGMNet does not rely on explicit generative priors. Instead, it explicitly models multi-scale spatial context and incorporates a wavelet-guided texture enhancement module to restore structural and textural details based on physically consistent and semantically faithful representations.

Under the BPP < 0.1 settings, MGMNet outperforms the above generative methods in objective metrics such as PSNR and MS-SSIM, demonstrating superior ability in recovering pixel-level accuracy and structural integrity.

On the LPIPS metric, MGMNet performs slightly below generative models. This is primarily because LPIPS measures similarity in deep feature space, favoring perceptual closeness rather than true fidelity to the original data source. Our approach places more emphasis on preserving spatial geometry and reconstructing authentic textures, avoiding the risk of introducing perceptually “plausible but incorrect” structures commonly found in generation-based reconstructions.

Overall, under ultra-low bitrate conditions, MGMNet achieves a better balance between structural fidelity and compression quality. The generated images are more geometrically consistent and semantically reliable, making the method well-suited for high-precision remote sensing applications where authenticity and accuracy are critical.

4.6. Visualization of Reconstructed Images

To verify the visual effect of MGMNet, this experiment visualizes and analyzes the reconstructed images from different methods. Figure 21, Figure 22 and Figure 23 show the reconstructed images and their zoomed-in regions for different datasets.

In Figure 21, among the four traditional compression methods, BPG produces visually superior results compared to AVIF, WebP, and JPEG2000. For example, the BPG-reconstructed image clearly shows the tennis court’s grid lines and the outlines of the buildings, while AVIF, WebP, and JPEG2000 exhibit noticeable distortions and blurring. Compared to MGMNet, although BPG achieves a slightly higher bitrate, MGMNet obtains a significant 1.286 dB gain in PSNR. MGMNet also delivers better visual quality, with clearer grid lines and sharper object boundaries.

Moreover, as demonstrated in Figure 22 and Figure 23, MGMNet consistently exhibits enhanced visual performance across the other two datasets. These findings further corroborate its robust generalization ability and resilience in various remote sensing contexts. In summary, MGMNet not only excels in preserving visual details but also in compression efficiency, highlighting its efficacy and potential applicability in remote sensing image compression endeavors.

To complement the quantitative metrics in Table 2—where performance differences across methods remain relatively small—pixel-level error heatmaps are employed for qualitative comparison. A detailed pixel-level difference analysis is performed based on the reconstructed results shown in Figure 21, and the corresponding error heatmaps are presented in Figure 24. These heatmaps visualize the absolute per-pixel reconstruction error between the original and compressed images, offering clearer insight into performance differences in structurally critical areas.

The error values are computed as absolute differences at the pixel level and visualized using the Jet colormap, producing heatmaps where cool colors (e.g., blue) represent low error and warm colors (e.g., red) indicate high error. To enhance the visibility of fine residuals in low-error regions, a contrast enhancement factor is applied during heatmap generation, making subtle differences more discernible—particularly in high-resolution remote sensing images that contain rich structural textures.

As illustrated in Figure 24, the proposed method yields uniformly low error across the image. Most regions appear in cool tones, demonstrating strong performance in both texture recovery and structure preservation. Compared to other approaches, fewer high-error regions are observed, especially in edge areas and fine textures, highlighting the method’s superiority in spatial fidelity and detail reconstruction.

Compared to existing deep learning-based approaches, MGMNet demonstrates enhanced capability in restoring fine-grained image details at similar bits per pixel (bpp) levels. It also achieves higher performance in terms of both PSNR and MS-SSIM metrics. In the red-marked regions, Jiang 2022 [40] and Liu 2024 [41] fail to fully reconstruct the gridlines and exhibit artifacts and noise. Liu 2023 [34] and Cheng 2020 [11] exhibit an over-smoothing effect on the vegetation situated behind the residence, which consequently leads to a diminution of high-frequency information. In contrast, MGMNet preserves more texture and sharper edges. Compared with Qin 2024 [42], MGMNet also reconstructs richer textures and more defined contours.

4.7. Ablation Experiments

To evaluate the impact of each module, we conducted ablation experiments utilizing the DOTA dataset. The findings are illustrated in Figure 25. Here, the baseline refers to the original network Qin 2024 [42]. Baseline + CGIM denotes the baseline integrated with the Channel–Global Information Module (CGIM). Baseline + WTLS represents the baseline combined with the Wavelet-Transform Local Structure module (WTLS). Baseline + CGIM + WTLS includes both modules integrated simultaneously.

As shown in Figure 25, adding CGIM to the baseline significantly improves rate-distortion performance at similar bitrates. This demonstrates the importance of incorporating channel-wise and global contextual information for accurate remote-sensing image reconstruction.

The baseline + WTLS configuration shows a clear PSNR improvement over the baseline. However, the gains in MS-SSIM and LPIPS are relatively limited. In terms of VIFp, baseline + WTLS performs on par with the baseline at low bitrates and surpasses it at higher bitrates. These results highlight WTLS’s effectiveness in capturing multi-scale local features and enhancing structural quality.

The integrated model, which includes the baseline, CGIM, and WTLS components, consistently exhibits enhanced performance across all assessed metrics. This suggests that the combined integration of CGIM and WTLS facilitates a more thorough representation of features. The proposed architecture enables the model to proficiently learn and assimilate local, channel, and global features, ensuring the preservation of essential information and promoting accurate, high-fidelity image reconstruction.

To further demonstrate the effectiveness of each component, we compute the average BD-rate and BD-PSNR for baseline + CGIM, baseline + WTLS, and baseline + CGIM + WTLS, using the baseline as the reference point. The results are presented in Table 3.

(1): For the same quality, baseline + CGIM achieves a 3.262% reduction in bit rate compared to the baseline, baseline + WTLS reduces the bit rate by 1.435%, and baseline + CGIM + WTLS leads to a 5.918% decrease.
(2): For the same bit rate, baseline + CGIM shows a 0.154 dB increase in PSNR over baseline, baseline + WTLS achieves a 0.067 dB improvement, and baseline + CGIM + WTLS results in a 0.289 dB increase in PSNR.

The above results show that both CGIM and WTLS significantly improve image compression performance, while the combination of the two further enhances it.

4.8. Complexity Analysis

To ensure a fair assessment of computational complexity and resource consumption, all compression methods were evaluated on the DOTA validation set using identical hardware and environmental settings. The comparison considers FLOPs, parameter count, and average encoding/decoding times.

All evaluations were conducted with input images of size 3 × 256 × 256, and timing results represent averages to mitigate variations caused by GPU memory usage. The findings indicate that MGMNet attains an advantageous equilibrium between compression efficacy and computational efficiency. Two Transformer-based image compression [22,49] algorithms were further incorporated for comparative analysis, highlighting the computational efficiency of the proposed architecture.

As shown in Table 4, in terms of FLOPs, MGMNet incurs an increase of 53.35 G and 15.20 G compared to Cheng 2020 [11] and Qin 2024 [42], respectively. However, it achieves reductions of 1.94%, 30.84%, and 57.61% compared to Jiang 2022 [40], Liu 2023 [34], and Liu 2024 [41], respectively. Regarding model parameters, MGMNet reduces the parameter count by 51.24%, 25.18%, and 37.71% compared to Jiang 2022 [40], Liu 2023 [34], and Liu 2024 [41], respectively, while exhibiting an increase of 48.01 M and 8.91 M compared to Cheng 2020 [11] and Qin 2024 [42].

4.9. Task-Oriented Performance

To assess the practical effectiveness of the proposed compression method in downstream remote sensing tasks, salient object detection is selected as a representative evaluation scenario [50]. This task provides a high-level vision benchmark that is sensitive to both structural integrity and semantic preservation, making it suitable for evaluating compressed image quality.

The widely used SggNet model [50] is adopted for saliency prediction, with the ORSSD dataset [51] serving as the evaluation benchmark. ORSSD contains a diverse range of land-cover categories, including urban and natural scenes, and is designed for tasks involving fine-grained structural understanding. To ensure reproducibility and demonstrate representative performance, a sample image (ID: 0021) is randomly selected from the ORSSD test set for qualitative and quantitative analysis.

During the experiment, the trained compression model is applied to compress and reconstruct the test images under varying bitrate settings, measured in bits per pixel (BPP). Each reconstructed image is then fed into the SggNet to perform salient object detection. To evaluate how well the compressed images preserve task-relevant information, we adopt three commonly used metrics: F-measure [52], S-measure [53], and Mean Absolute Error (MAE) [54]. These metrics, respectively, measure the precision-recall balance, structural similarity, and pixel-level accuracy between the predicted saliency maps and the ground truth.

The results, summarized in Table 5, demonstrate that the proposed method maintains high detection performance across all compression levels. Even under low-bitrate conditions (e.g., BPP = 0.113), the reconstructed images retain sufficient structure and semantics to support accurate saliency detection, as reflected by only slight degradation in all three evaluation metrics.

In addition to the quantitative analysis, Figure 26 provides a visual comparison of the detection results on ORSSD-0021. The original and compressed reconstructed images are processed by SggNet, and the resulting saliency maps are shown side by side for comparison.

5. Conclusions

This research presents MGMNet, a lightweight compression framework specifically designed for remote sensing imagery. It leverages the sequence modeling capabilities inherent in state–space networks, combined with channel-aware semantic enhancement, to enhance compression efficacy. In light of the escalating complexity and volume of remote sensing data, MGMNet confronts significant compression challenges by incorporating two innovative modules: the Wavelet Transform-guided Local Structure Decoupling (WTLS) module and the Channel–Global Information Modeling (CGIM) module. The WTLS module employs a multi-scale decomposition approach to separate structural contours from fine-grained textures, facilitating the parallel modeling of various frequency bands while preserving a streamlined architecture. Concurrently, the CGIM module bolsters global modeling proficiency through directionally selective state–space scanning and dynamic spatial–channel reweighting, thereby effectively capturing both semantic relevance and local discriminability.

Future research will aim to adapt the proposed framework to hyperspectral and multi-temporal remote sensing data. These scenarios involve complex spectral dependencies and temporal variations, posing greater challenges for compression.

Author Contributions

Conceptualization, J.L.; methodology, J.L.; software, L.Z. and J.L.; investigation, L.Z. and X.W.; resources, J.L. and L.Z.; writing—original draft preparation, J.L. and X.W.; writing—review and editing, J.L., X.W. and L.Z.; visualization, J.L.; supervision, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Liaoning Province Education Administration under Grant JYTMS20230243 and by the Fundamental Research Funds for Universities of Liaoning Province under Grant LJ222410143071.

Data Availability Statement

The DOTA dataset is publicly available at https://captain-whu.github.io/DOTA/dataset.html (accessed on 25 February 2021). The UC-Merced dataset is publicly available at https://tensorflow.google.cn/datasets/catalog/uc_merced (accessed on 6 December 2022). The NWPU-RESISC45 dataset is publicly available at https://gcheng-nwpu.github.io/#Datasets (accessed on 21 December 2022).

Acknowledgments

We gratefully appreciate the publishers of the three datasets and the editors and reviewers for their efforts and contributions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dong, R.; Yuan, S.; Luo, B.; Chen, M.; Zhang, J.; Zhang, L.; Li, W.; Zheng, J.; Fu, H. Building Bridges Across Spatial and Temporal Resolutions: Reference-Based Super-Resolution via Change Priors and Conditional Diffusion Model. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar] [CrossRef]
Zhang, L.; Dong, R.; Yuan, S.; Zhang, J.; Chen, M.; Zheng, J.; Fu, H. DeepLight: Reconstructing High-Resolution Observations of Nighttime Light with Multi-Modal Remote Sensing Data. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, Jeju, Republic of Korea, 3–9 August 2024; Larson, K., Ed.; International Joint Conferences on Artificial Intelligence Organization: California, CA, USA, 2024; pp. 7563–7571. [Google Scholar] [CrossRef]
Zhao, Y.; Zheng, J.; Fu, H.; Wu, W.; Gao, J.; Chen, M.; Zhang, J.; Zhang, L.; Dong, R.; Du, Z.; et al. SW-LCM: A Scalable and Weakly-supervised Land Cover Mapping Method on a New Sunway Supercomputer. In Proceedings of the 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), St. Petersburg, FL, USA, 15–19 May 2023; pp. 657–667. [Google Scholar] [CrossRef]
Wallace, G. The JPEG still picture compression standard. IEEE Trans. Consum. Electron. 1992, 38, xviii–xxxiv. [Google Scholar] [CrossRef]
Taubman, D.S.; Marcellin, M.W.; Rabbani, M. JPEG2000: Image compression fundamentals, standards and practice. J. Electron. Imaging 2002, 11, 286–287. [Google Scholar] [CrossRef]
Mandeel, T.H.; Imran Ahmad, M.; Khalid, N.A.A.; Md Isa, M.N. A Comparative Study on Lossless compression mode in WebP, Better Portable Graphics (BPG), and JPEG XL Image Compression Algorithms. In Proceedings of the 2021 8th International Conference on Computer and Communication Engineering (ICCCE), Kuala Lumpur, Malaysia, 22–23 June 2021; pp. 17–22. [Google Scholar] [CrossRef]
Öztürk, E.; Mesut, A. Performance Evaluation of JPEG Standards, WebP and PNG in Terms of Compression Ratio and Time for Lossless Encoding. In Proceedings of the 2021 6th International Conference on Computer Science and Engineering (UBMK), Ankara, Turkey, 15–17 September 2021; pp. 15–20. [Google Scholar] [CrossRef]
Ballé, J.; Laparra, V.; Simoncelli, E. End-to-end Optimized Image Compression. arXiv 2016, arXiv:1611.01704. [Google Scholar] [CrossRef]
Ballé, J.; Minnen, D.; Singh, S.; Hwang, S.; Johnston, N. Variational image compression with a scale hyperprior. arXiv 2018, arXiv:1802.01436. [Google Scholar] [CrossRef]
Minnen, D.; Ballé, J.; Toderici, G.D. Joint Autoregressive and Hierarchical Priors for Learned Image Compression. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
Cheng, Z.; Sun, H.; Takeuchi, M.; Katto, J. Learned Image Compression with Discretized Gaussian Mixture Likelihoods and Attention Modules. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 7936–7945. [Google Scholar] [CrossRef]
Minnen, D.; Singh, S. Channel-Wise Autoregressive Entropy Models for Learned Image Compression. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 3339–3343. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual State Space Model. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 10–15 December 2024; Volume 37, pp. 103031–103063. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. In Proceedings of the 41st International Conference on Machine Learning, JMLR.org, ICML’24, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Xiang, S.; Liang, Q. Remote Sensing Image Compression Based on High-Frequency and Low-Frequency Components. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5604715. [Google Scholar] [CrossRef]
Wang, Y.; Liang, F.; Wang, S.; Chen, H.; Cao, Q.; Fu, H.; Chen, Z. Towards an Efficient Remote Sensing Image Compression Network with Visual State Space Model. Remote Sens. 2025, 17, 425. [Google Scholar] [CrossRef]
Guo, Z.; Zhang, Z.; Feng, R.; Chen, Z. Causal Contextual Prediction for Learned Image Compression. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 2329–2341. [Google Scholar] [CrossRef]
Qian, Y.; Tan, Z.; Sun, X.; Lin, M.; Li, D.; Sun, Z.; Li, H.; Jin, R. Learning Accurate Entropy Model with Global Reference for Image Compression. In Proceedings of the International Conference on Learning Representations, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
Li, M.; Zuo, W.; Gu, S.; You, J.; Zhang, D. Learning Content-Weighted Deep Image Compression. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3446–3461. [Google Scholar] [CrossRef]
Khoshkhahtinat, A.; Zafari, A.; Mehta, P.M.; Akyash, M.; Kashiani, H.; Nasrabadi, N.M. Multi-Context Dual Hyper-Prior Neural Image Compression. In Proceedings of the 2023 International Conference on Machine Learning and Applications (ICMLA), Jacksonville, FL, USA, 15–17 December 2023; pp. 618–625. [Google Scholar] [CrossRef]
Zou, R.; Song, C.; Zhang, Z. The Devil Is in the Details: Window-based Attention for Image Compression. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17471–17480. [Google Scholar] [CrossRef]
Li, D.; Bai, Y.; Wang, K.; Jiang, J.; Liu, X.; Gao, W. GroupedMixer: An Entropy Model with Group-Wise Token-Mixers for Learned Image Compression. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 9606–9619. [Google Scholar] [CrossRef]
Zhang, J.; Dong, R.; Zheng, J.; Chen, M.; Zhang, L.; Zhao, Y.; Fu, H. Spatial-Temporal Context Model for Remote Sensing Imagery Compression. In Proceedings of the 32nd ACM International Conference on Multimedia, MM ’24, New York, NY, USA, 28 October–1 November 2024; pp. 6063–6072. [Google Scholar] [CrossRef]
Li, J.; Hou, X. Object-Fidelity Remote Sensing Image Compression with Content-Weighted Bitrate Allocation and Patch-Based Local Attention. IEEE Trans. Geosci. Remote Sens. 2024, 62, 2004314. [Google Scholar] [CrossRef]
Shi, C.; Shi, K.; Zhu, F.; Zeng, Z.; Wang, L. Multihead Global Attention and Spatial Spectral Information Fusion for Remote Sensing Image Compression. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 999–1015. [Google Scholar] [CrossRef]
Pan, T.; Zhang, L.; Qu, L.; Liu, Y. A Coupled Compression Generation Network for Remote-Sensing Images at Extremely Low Bitrates. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5608514. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.O. RS3Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar] [CrossRef]
He, Y.; Tu, B.; Liu, B.; Li, J.; Plaza, A. 3DSS-Mamba: 3D-Spectral-Spatial Mamba for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5534216. [Google Scholar] [CrossRef]
Zhao, G.; Wu, H.; Luo, D.; Ou, X.; Zhang, Y. Spatial–Spectral Interaction Super-Resolution CNN–Mamba Network for Fusion of Satellite Hyperspectral and Multispectral Image. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 18489–18501. [Google Scholar] [CrossRef]
Gu, A.; Goel, K.; Ré, C. Efficiently Modeling Long Sequences with Structured State Spaces. In Proceedings of the The International Conference on Learning Representations (ICLR), Online, 25–29 April 2022. [Google Scholar]
Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. LocalMamba: Visual State Space Model with Windowed Selective Scan. In Proceedings of the Computer Vision—ECCV 2024 Workshops, Milan, Italy, 29 September–4 October 2025; pp. 12–22. [Google Scholar] [CrossRef]
Pechlivanidou, G.; Karampetakis, N. Zero-order hold discretization of general state space systems with input delay. IMA J. Math. Control Inf. 2022, 39, 708–730. [Google Scholar] [CrossRef]
Liu, J.; Sun, H.; Katto, J. Learned Image Compression with Mixed Transformer-CNN Architectures. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14388–14397. [Google Scholar] [CrossRef]
He, D.; Yang, Z.; Peng, W.; Ma, R.; Qin, H.; Wang, Y. ELIC: Efficient Learned Image Compression with Unevenly Grouped Space-Channel Contextual Adaptive Coding. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5708–5717. [Google Scholar] [CrossRef]
Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 2018, 107, 3–11. [Google Scholar] [CrossRef]
Li, J.; Wen, Y.; He, L. SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 6153–6162. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar] [CrossRef]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet Convolutions for Large Receptive Fields. In Proceedings of the Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2025; pp. 363–380. [Google Scholar] [CrossRef]
Jiang, W.; Yang, J.; Zhai, Y.; Ning, P.; Gao, F.; Wang, R. MLIC: Multi-Reference Entropy Model for Learned Image Compression. In Proceedings of the 31st ACM International Conference on Multimedia, MM ’23, New York, NY, USA, 29 October–2 November 2023; pp. 7618–7627. [Google Scholar] [CrossRef]
Liu, Y.; Yang, W.; Bai, H.; Wei, Y.; Zhao, Y. Region-Adaptive Transform with Segmentation Prior for Image Compression. In Proceedings of the Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2025; pp. 181–197. [Google Scholar] [CrossRef]
Qin, S.; Wang, J.; Zhou, Y.; Chen, B.; Luo, T.; An, B.; Dai, T.; Xia, S.; Wang, Y. MambaVC: Learned Visual Compression with Selective State Spaces. arXiv 2024, arXiv:2405.15413. [Google Scholar]
Zhang, Z. Improved Adam Optimizer for Deep Neural Networks. In Proceedings of the 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), Banff, AB, Canada, 4–6 June 2018; pp. 1–2. [Google Scholar] [CrossRef]
Huang, H.; He, R.; Sun, Z.; Tan, T. Wavelet-SRNet: A Wavelet-Based CNN for Multi-scale Face Super Resolution. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1698–1706. [Google Scholar] [CrossRef]
Finder, S.E.; Zohav, Y.; Ashkenazi, M.; Treister, E. Wavelet feature maps compression for image-to-image CNNs. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Fu, H.; Liang, J.; Fang, Z.; Han, J.; Liang, F.; Zhang, G. WeConvene: Learned Image Compression with Wavelet-Domain Convolution and Entropy Model. In Proceedings of the Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2025; pp. 37–53. [Google Scholar] [CrossRef]
Gal, R.; Hochberg, D.C.; Bermano, A.; Cohen-Or, D. SWAGAN: A style-based wavelet-driven generative model. ACM Trans. Graph. 2021, 40, 134. [Google Scholar] [CrossRef]
Ye, Y.; Wang, C.; Sun, W.; Chen, Z. Map-Assisted remote-sensing image compression at extremely low bitrates. ISPRS J. Photogramm. Remote Sens. 2025, 223, 159–172. [Google Scholar] [CrossRef]
Qian, Y.; Sun, X.; Lin, M.; Tan, Z.; Jin, R. Entroformer: A Transformer-based Entropy Model for Learned Image Compression. In Proceedings of the International Conference on Learning Representations (ICLR), Online, 25–29 April 2022. [Google Scholar]
Liu, J.; He, J.; Chen, H.; Yang, R.; Huang, Y. A Lightweight Semantic- and Graph-Guided Network for Advanced Optical Remote Sensing Image Salient Object Detection. Remote Sens. 2025, 17, 861. [Google Scholar] [CrossRef]
Li, C.; Cong, R.; Hou, J.; Zhang, S.; Qian, Y.; Kwong, S. Nested Network with Two-Stream Pyramid for Salient Object Detection in Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sensing 2019, 57, 9156–9166. [Google Scholar] [CrossRef]
Fan, D.-P.; Cheng, M.-M.; Liu, Y.; Li, T.; Borji, A. Structure-Measure: A New Way to Evaluate Foreground Maps. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4558–4567. [Google Scholar] [CrossRef]
Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-tuned salient region detection. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1597–1604. [Google Scholar] [CrossRef]
Perazzi, F.; Krähenbühl, P.; Pritch, Y.; Hornung, A. Saliency filters: Contrast based filtering for salient region detection. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 733–740. [Google Scholar] [CrossRef]

Figure 1. Image compression models using different entropy models. (a) Baseline model [8]. (b) Compression model with hyperprior structure [9]. (c) Compression model with autoregressive context [10]. (d) Compression model with Gaussian mixture likelihood entropy model [11]. (e) Compression model with channel autoregressive entropy model [12].

Figure 2. The overall architecture of MGMNet. WTLS represents the wavelet transform-guided local structure decoupling module, and CGIM represents the channel–global information collaborative modeling module. k denotes the convolution kernel size, s indicates the stride, and [

C_{1}

,

C_{2}

,

C_{3}

,

C_{4}

,

C_{5}

,

C_{6}

] represents the number of channels, set as [128, 128, 128, 320, 128, 192]. ↓ indicates down-sampling, and ↑ indicates up-sampling. CEAM refers to the Channel Autoregressive Entropy Model, while FM refers to the Factorized Entropy Model.

Figure 2. The overall architecture of MGMNet. WTLS represents the wavelet transform-guided local structure decoupling module, and CGIM represents the channel–global information collaborative modeling module. k denotes the convolution kernel size, s indicates the stride, and [

C_{1}

,

C_{2}

,

C_{3}

,

C_{4}

,

C_{5}

,

C_{6}

] represents the number of channels, set as [128, 128, 128, 320, 128, 192]. ↓ indicates down-sampling, and ↑ indicates up-sampling. CEAM refers to the Channel Autoregressive Entropy Model, while FM refers to the Factorized Entropy Model.

Figure 3. Schematic diagram of the CAM.

Figure 4. Channel Global Information Collaborative Modeling Module. VSS Block denotes the Visual State Space model, and SCRS denotes the Spatial–Channel Reconstruction Strategy.

Figure 5. Visual state space model and 2D selective scanning processing. The inputs are cross-scanned along four different directions, and each sequence is processed independently using the S6 model. Finally, the processed information is merged to construct a 2D feature map as the final output.

Figure 6. Schematic diagram of spatial–channel reconfiguration strategy.

Figure 7. Schematic diagram of the spatial reconstruction unit.

Figure 8. Schematic diagram of channel reconstruction unit.

Figure 9. Schematic of the wavelet transform-guided local structure decoupling module.

Figure 10. Variation trend of the loss function with training epochs.

Figure 11. Variation trends of PSNR and MS-SSIM with training epochs.

Figure 12. Rate-distortion performance comparison of different wavelet basis.

Figure 13. The rate-distortion performance of the Haar (db1) wavelet under different convolution kernel sizes on DOTA dataset.

Figure 14. A comparison between Haar (db1) and Daubechies-4 (db4) wavelets with varying kernel sizes on UC-Merced dataset.

Figure 15. Rate-distortion curves on the DOTA dataset. (a) PSNR; (b) MS-SSIM; (c) LPIPS; (d) VIFp.

Figure 16. Rate-distortion curves on the UC-Merced dataset. (a) PSNR; (b) MS-SSIM; (c) LPIPS; (d) VIFp.

Figure 17. Rate-distortion curves on the NWPU-RESISC45 dataset. (a) PSNR; (b) MS-SSIM; (c) LPIPS; (d) VIFp.

Figure 18. Rate-distortion performance of the model trained with DOTA and tested on the UC-Merced validation data. (a) PSNR; (b) MS-SSIM; (c) LPIPS; (d) VIFp.

Figure 19. Rate-distortion performance of the model trained with DOTA and tested on the NWPU-RESISC45 validation data. (a) PSNR; (b) MS-SSIM; (c) LPIPS; (d) VIFp.

Figure 20. Reconstruction comparison between MGMNet and generative compression methods under extremely low bitrates on the DOTA dataset.

Figure 21. Visual comparison of reconstructed images on the DOTA dataset.

Figure 22. Visual comparison of reconstructed images on the UC-Merced dataset.

Figure 23. Visual comparison of reconstructed images on the NWPU-RESISC45 dataset.

Figure 24. Pixel-level error maps on the DOTA dataset.

Figure 25. Ablation study results on the DOTA dataset across various configurations. (a) PSNR; (b) MS-SSIM; (c) LPIPS; (d) VIFp.

Figure 26. Visualization results of ORSSD-0021 processed by SggNet.

Table 1. Evaluation of the computational efficiency across different wavelet bases and kernel sizes.

Wavelet Basis	Kernel Size	Params (M)	FLOPs (G)	Enc-Time (s)	Dec-Time (s)
Haar	$5 \times 5$	56.79	80.76	0.122	0.115
Haar	$3 \times 3$	55.32	78.51	0.118	0.101
Haar	$7 \times 7$	58.95	82.61	0.158	0.126

Table 2. BD-rate and BD-PSNR comparisons of MGMNet and other methods, using JPEG2000 as the anchor reference.

Model	DOTA		UC-Merced		NWPU-RESISC45
Model	BD-Rate	BD-PSNR	BD-Rate	BD-PSNR	BD-Rate	BD-PSNR
WebP	−43.079%	2.000 dB	−40.364%	2.079 dB	−7.251%	0.389 dB
BPG	−63.976%	3.765 dB	−52.425%	2.873 dB	−25.557%	1.435 dB
AVIF	−58.221%	3.279 dB	−47.959%	2.524 dB	−19.765%	0.983 dB
Cheng2020 ([11])	−75.206%	5.116 dB	−70.973%	5.279 dB	−41.238%	2.281 dB
Jiang2022 ([40])	−71.633%	4.747 dB	−64.888%	3.789 dB	−29.224%	1.458 dB
Liu2023 ([34])	−75.217%	5.131 dB	−67.845%	4.195 dB	−31.684%	1.637 dB
Liu2024 ([41])	−69.802%	4.551 dB	−67.299%	4.563 dB	−37.246%	2.019 dB
Qin2024 ([42])	−77.171%	5.277 dB	−70.301%	5.275 dB	−42.143%	2.658 dB
Ours	−78.198%	5.519 dB	−72.021%	5.473 dB	−43.808%	2.866 dB

Table 3. Results of ablation experiments on the DOTA dataset.

Model	BD-Rate	BD-PSNR
baseline + CGIM	−3.262%	0.154 dB
baseline + WTLS	−1.435%	0.067 dB
baseline + CGIM + WTLS	−5.918%	0.289 dB

Table 4. Evaluation of the computational efficiency across different image compression techniques.

Method	FLOPs (G)	Params (M)	Encoding Time (s)	Decoding Time (s)
Cheng2020 [11]	27.41	8.78	0.118	0.097
Jiang2022 [40]	82.36	116.48	0.125	0.156
Zou2022 [22]	48.70	75.00	0.283	0.272
Entroformer [49]	44.65	78.32	0.185	0.152
Liu2023 [34]	116.78	75.90	0.139	0.145
Liu2024 [41]	190.55	91.17	0.110	0.121
Qin2024 [42]	65.56	47.88	0.098	0.114
Ours	80.76	56.79	0.122	0.115

Table 5. Evaluation results of ORSSD-0021 under different bitrate settings using three metrics.

Image	S-Measure	F-Measure	MAE
Original	0.9452	0.9051	0.006549
Bpp = 0.113	0.9288	0.8777	0.007654
Bpp = 0.253	0.9360	0.8920	0.007432
Bpp = 0.691	0.9394	0.8971	0.007128

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, J.; Zhang, L.; Wang, X. Remote Sensing Image Compression via Wavelet-Guided Local Structure Decoupling and Channel–Spatial State Modeling. Remote Sens. 2025, 17, 2419. https://doi.org/10.3390/rs17142419

AMA Style

Liu J, Zhang L, Wang X. Remote Sensing Image Compression via Wavelet-Guided Local Structure Decoupling and Channel–Spatial State Modeling. Remote Sensing. 2025; 17(14):2419. https://doi.org/10.3390/rs17142419

Chicago/Turabian Style

Liu, Jiahui, Lili Zhang, and Xianjun Wang. 2025. "Remote Sensing Image Compression via Wavelet-Guided Local Structure Decoupling and Channel–Spatial State Modeling" Remote Sensing 17, no. 14: 2419. https://doi.org/10.3390/rs17142419

APA Style

Liu, J., Zhang, L., & Wang, X. (2025). Remote Sensing Image Compression via Wavelet-Guided Local Structure Decoupling and Channel–Spatial State Modeling. Remote Sensing, 17(14), 2419. https://doi.org/10.3390/rs17142419

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Remote Sensing Image Compression via Wavelet-Guided Local Structure Decoupling and Channel–Spatial State Modeling

Abstract

1. Introduction

2. Related Works

2.1. Deep Learning Compression Method Based on Natural Image

2.2. Deep Learning Compression Method Based on Remote Sensing Image

2.3. State Space Models

3. Proposed Method

3.1. Preliminaries

3.2. Overall Framework

3.3. Channel-Global Visual Model

3.3.1. Visual State Space Model

3.3.2. Spatial–Channel Reconstruction Strategy

3.3.3. Spatial Reconstruction Unit

3.3.4. Channel Reconstruction Unit

3.4. Wavelet Transform-Guided Local Structure Decoupling Module

4. Experiments

4.1. Loss Fuction

4.2. Experimental Settings

4.3. Evaluation Indicators

4.4. Analysis of Wavelet Basis and Convolution Kernel Size in WTLS Module

4.4.1. Impact of Wavelet Basis Selection

4.4.2. Role of Convolution Kernel Size and Performance–Efficiency Trade-Off

4.4.3. Final Configuration and Design Considerations

4.5. Rate-Distortion Performance

4.6. Visualization of Reconstructed Images

4.7. Ablation Experiments

4.8. Complexity Analysis

4.9. Task-Oriented Performance

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI