Enhancing Perception Quality in Remote Sensing Image Compression via Invertible Neural Network

Li, Junhui; Hou, Xingsong

doi:10.3390/rs17122074

Open AccessArticle

Enhancing Perception Quality in Remote Sensing Image Compression via Invertible Neural Network

by

Junhui Li

and

Xingsong Hou

^*

School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(12), 2074; https://doi.org/10.3390/rs17122074

Submission received: 9 February 2025 / Revised: 7 May 2025 / Accepted: 12 May 2025 / Published: 17 June 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Despite the impressive performance of existing image compression algorithms, they struggle to balance perceptual quality and high image fidelity. To address this issue, we propose a novel invertible neural network-based remote sensing image compression (INN-RSIC) method. Our approach captures the compression distortion from an existing image compression algorithm and encodes it as Gaussian-distributed latent variables using an INN, ensuring that the distortion in the decoded image remains independent of the ground truth. By using the inverse mapping of the INN, we input the decoded image with randomly resampled Gaussian variables, generating enhanced images with improved perceptual quality. We incorporate channel expansion, Haar transformation, and invertible blocks into the INN to accurately represent compression distortion. Additionally, a quantization module (QM) is introduced to mitigate format conversion impact, enhancing generalization and perceptual quality. Extensive experiments show that INN-RSIC achieves superior perceptual quality and fidelity compared to existing algorithms. As a lightweight plug-and-play (PnP) method, the proposed INN-based enhancer can be easily integrated into existing high-fidelity compression algorithms, enabling flexible and simultaneous decoding of images with enhanced perceptual quality.

Keywords:

remote sensing image compression; invertible neural network; Haar transformation; image enhancement

Graphical Abstract

1. Introduction

Remote sensing (RS) images have been widely applied in various fields, including urban planning [1], resource management [2], and disaster monitoring [3]. However, RS images are typically characterized by high content, high resolution, and a large size [4]. Furthermore, with the advancement of sensor technology and the enhancement of the image acquisition capability of satellite and airborne equipment, the necessity for storing or transmitting RS images is also growing in importance. To cope with these challenges, high-resolution RS images are usually compressed before they are stored and transmitted to the ground. In this case, it is critical to achieve low-bitrate compression while obtaining decoded images with high perceptual quality.

Traditional image compression algorithms, including JPEG2000 [5], BPG [6], and VVC [7], have been instrumental in facilitating the storage and transmission of image data. However, image compression with these algorithms inevitably suffers from undesired blocking, ringing artifacts, and blurring [8,9], which may highly influence the perception quality.

With the development of deep learning techniques, learning-based image compression methods have made significant progress in obtaining high rate-distortion (RD) performance [10,11,12,13,14,15]. Notably, in natural image compression, Ballé et al. [16] proposed to view additional side information as a hyper-prior entropy model to estimate a zero-mean Gaussian distribution, laying the foundation for subsequent improvements in entropy modeling. Cheng et al. [9] further introduced discretized Gaussian mixture likelihoods to realize the distribution estimation of the entropy model, resulting in impressive decoding performance. Meanwhile, He et al. [17] developed a spatial-channel contextual adaptive model, enhancing compression performance without sacrificing computational speed. Liu et al. [13] leveraged the local modeling ability of convolutional neural networks (CNNs) and the non-local modeling strengths of Transformers to develop the encoder and decoder networks. A channel-squeezing-based entropy model was further proposed to enhance RD performance. Jiang et al. [15] proposed capturing the channel-wise, local spatial, and global spatial correlations present in latent representations to develop a comprehensive entropy model, which was then employed to design the competitive image compression method MLIC++. Despite the success of learning-based compression algorithms in natural scenes, RS images present unique challenges due to their rich texture, weak correlation, and low redundancy compared to natural images [18,19]. Traditional metrics, such as PSNR and SSIM, though widely used to evaluate the fidelity of compressed images, often fail to align with human visual perception, particularly in low-bitrate scenarios. These metrics emphasize pixel-wise similarity but overlook global structural consistency and fine details that are crucial for visually appealing images. This limitation has been extensively discussed in recent studies on deep learning-based image compression, where perceptual quality, rather than purely objective fidelity, has proven to better match human visual preferences. Enhanced perceptual quality allows for more reliable real-time monitoring and decision making in fields such as environmental monitoring and disaster management.

To achieve high perceptual quality, Zhang et al. [10] introduced a multi-scale attention module to enhance the network’s feature extraction capability, and they developed an improved entropy model using global priors and anchored-stripe attention. Pan et al. [20] resorted to generative adversarial networks (GANs) to independently reconstruct the image content and detailed textures, subsequently fusing these features to achieve low-bitrate RS image compression. Additionally, Xiang et al. [21] utilized discrete wavelet transform to separate image features into high- and low-frequency components, and they also designed compression networks to enhance the model’s representation ability of both of the types of features.

In summary, the abovementioned deep learning-based image compression methods typically achieve image compression through CNN blocks [9,10,11,16,17,21] or Transformer blocks [22,23,24,25,26], followed by optimization using distance measurements like

L_{1}

,

L_{2}

, or adversarial loss [27]. Despite their impressive performance, these methods usually struggle to generate satisfactory perception quality while maintaining high image fidelity. Recently, invertible neural network (INN) has emerged as a new paradigm for image generation, demonstrating remarkable performance across various applications [28,29,30,31,32]. For instance, Zhao et al. [28] leveraged INN to allow the color information lost during grayscale image generation to remain independent of the input image. Similarly, Xiao et al. [32] employed an INN-based framework to address the information loss during downscaling mapping by modeling the bidirectional degradation and restoration from a new perspective. Building on the success, several approaches have integrated INN into image compression to capture richer texture details [25,29].

For instance, in [25], an enhanced INN-based encoding network was devised to improve the decoding performance of natural image compression. In [29], the authors utilized INN to develop an invertible image generation module, aiming to prevent information loss and propose a competitive low-bit-rate compression algorithm. Unlike these methods, the proposed approach focuses on leveraging the invertible capabilities of INN to recover more texture information of the images that are decoded by an existing compression algorithm without requiring additional bitstreams and model retraining. Particularly, it aims to transform the complex compression distortion distribution of the compression algorithm into a simpler, well-defined Gaussian distribution through forward processing. During inverse processing, it samples additional texture details from the Gaussian-distributed variables conditioned on the decoded image, thereby enriching the reconstructed output with finer details. Figure 1 illustrates the perception comparison between the decoded images and the enhanced images. Here, bits per pixel (bpp) is used as a key metric to quantify the compression efficiency, representing the average number of bits required to encode each pixel in the image. A lower bpp value indicates higher compression efficiency, while a higher bpp value typically corresponds to better image quality. One can see that the proposed INN-RSIC exhibits a high capability in recovering texture-rich details without additional bitrates, demonstrating its effectiveness in balancing compression efficiency and perceptual quality.

In this paper, we propose the invertible neural network-based remote sensing image compression (INN-RSIC) method. Specifically, we utilize an invertible forward network with a conditional generation module (CGM) to encode the compression loss information of an existing image compression algorithm into Gaussian-distributed latent variables. As a result, it is expected to sample some of the lost prior information from Gaussian distribution, thereby facilitating the recovery of visually enhanced images during inverse mapping. Additionally, to effectively learn compression distortion, we adopted channel expansion and Haar transformation [33] to separate the high and low frequencies and introduced a quantization module (QM) to reduce the impact of the reconstruction quality due to data type conversion in the inference stage.

The primary contributions of this paper can be summarized as follows.

The proposed INN-RSIC is the first attempt to model the image compression distortion of an image compression method using invertible transforms. It serves as a PnP method to obtain highly perceptible decoded images while preserving the performance of the baseline algorithm.
We develop a novel, effective yet simple architecture with channel expansion, Haar transformation, and invertible blocks. This architecture enables projecting compression distortion into a case-agnostic distribution so that the compression distortion information can be obtained based on samples from a Gaussian distribution.
CGM is introduced to split and encode the compression distortion information from the ground truth into the latent variables conditioned on the synthetic image. This process generates pattern-free synthetic images and enhances the texture details of the reconstructed images during inverse mapping.
Extensive experiments indicate that the proposed INN-RSIC achieves a superior balance of perceptual quality and high image fidelity compared to existing state-of-the-art image compression algorithms. Our method offers a novel perspective for image compression algorithms to improve their perception quality.

2. Related Work

2.1. End-to-End RS Image Compression

In RS image compression, many researchers have directed their attention toward network structure and image transformation techniques [20,21,34,35]. For instance, Pan et al. [20] adopted GANs to independently decode the image content and detailed textures, later combining these features to achieve a low-bitrate compression of RS images. More recently, Xiang et al. [21] leveraged the discrete wavelet transform to separate image features into high- and low-frequency components. They then designed compression networks to bolster the model’s ability to present these features at different frequencies to recover texture-rich details.

Moreover, in the pursuit of advancing network modules for decoding more texture-rich details of RS image compression, some studies have aimed to improve the estimated accuracy of the entropy model. Pan et al. [8] proposed a hybrid attention network to improve the prediction accuracy of the entropy model and enhance the representation capabilities of both the encoder and decoder. In [36], a long-range convolutional network was developed as a network model to improve the decoding performance of RS image compression. Fu et al. [37] considered the local and non-local redundancies presented in RS images and developed a mixed hyperprior network to explore both, thus improving the accuracy of entropy estimation.

In summary, the above RS image compression algorithms mainly focus on designing complex networks for compact latent variable representation and more accurate entropy estimation to achieve texture-rich decoding results. Different from the above methods, we innovatively project the compression distortion loss of decoded images from an image compression method into Gaussian-distributed latent variables in the forward network, while combining the Gaussian distribution with the decoded images in the inverse network to obtain texture-rich enhanced images.

2.2. Invertible Neural Network (INN)

INN offers several advantages, including bijective mapping, efficient mapping access, and computable Jacobi matrices, making it promising for various machine learning tasks [38,39,40,41,42]. For instance, Zhao et al. [28] employed INN to generate invertible grayscale images by separating color information from color images and encoding it into Gaussian-distributed latent variables. This approach ensures that the color information lost during grayscale generation remains independent of the input image. Similarly, Xiao et al. [32] developed invertible models to generate valid degraded images while transforming the distribution of lost contents to a fixed distribution of a latent variable during forward degradation. The restoration was then made tractable by applying the inverse transformation on the generated degraded image along with a randomly drawn latent variable. Liu et al. [41] considered image degradation and super-resolution as a pair of inverse tasks and replaced the generators in GAN with INN for unpaired image super-resolution. In addition, INN technology has been leveraged to improve the transformation between the image space and the latent feature space, thereby mitigating information loss [25]. For instance, Xie et al. [25] proposed an enhanced INN-based encoding network for better image compression. Gao et al. [29] integrated INN into the development of an invertible image generation module to prevent information loss and developed a competitive low-bitrate image compression algorithm.

3. Materials and Methods

Compared to natural images, RS images usually contain richer textures, which makes acquiring RS images at low bit rates more challenging [19]. Therefore, the proposed INN-RSIC aims to utilize INN to encode the residual distortion after compression of an existing image compression algorithm into a set of latent variables

z

, following a pre-defined distribution, such as a Gaussian distribution. As a result, the distribution of latent variables becomes independent of the distribution of the input image. In the enhanced reconstruction stage of our framework (i.e., the inverse mapping of the proposed INN-RSIC), a new set of randomly sampled latent variables

\vec{z}

can effectively represent the residual distortion after compression information to some extent. Thus, we can simply feed the re-sampled

\vec{z}

, along with the decoded images, to the INN-RSIC of the inverse network for enhanced images.

Generally, the distribution of compression distortion information varies among different learning-based methods, as each exhibits distinct preferences in learning data distribution. For instance, some algorithms excel in achieving high performance on texture-rich images, while others deliver on the opposite. Therefore, we focus here on the compression distortion distribution of the impressive image compression algorithm ELIC at different bitrates. Unfortunately, pinpointing the distortion distribution directly is challenging. Therefore, we chose to capture the distortion by investigating the relationship between the input image and the decoded image of ELIC. In this way, we can indirectly explore the distortion distribution inherent in the compression algorithm.

The architecture of the proposed INN-RSIC, consisting of a compressor and an enhancer, is depicted in Figure 2. For the compressor, given an input image

x \in R^{W \times H \times 3}

, the encoder

g_{a} (\cdot)

first extracts the latent representation

y

, which is then quantized through the operation

Q (\cdot)

to obtain

\hat{y}

. Subsequently,

\hat{y}

is encoded into a bitstream using the estimated probability distribution. On the decoding side, the decoded image

\tilde{x}

is reconstructed by feeding

\hat{y}

into the decoder

g_{s} (\cdot)

. This process can be formally described as follows:

\begin{matrix} y & = g_{a} (x), \\ \hat{y} & = Q (y), \\ \tilde{x} & = g_{s} (\hat{y}) . \end{matrix}

(1)

To enable efficient arithmetic coding of

\hat{y}

, the hyper networks

h_{a} (\cdot)

and

h_{s} (\cdot)

are typically employed to estimate the probability model. For the enhancer, it mainly comprises two streams: the forward and inverse networks, denoted as

f_{Θ} (\cdot)

and

f_{Θ}^{- 1} (\cdot)

, respectively, where

Θ

represents the network parameters.

3.1. INN Architecture

3.1.1. Invertible Block

The invertible block layer is a crucial component in invertible architectures, acting as a bridge to adapt between two different distributions via its trainable parameters. Figure 2b depicts the structure of an invertible block. An inverse block operates on an input feature map

v \in R^{H \times W \times C}

, where H, W, and C denote the height, width, and the number of channels, respectively. In each invertible block,

v

is split along the channel dimension into two parts

[v_{1}, v_{2}]

, where

v_{1} \in R^{H \times W \times C_{1}}

,

v_{2} \in R^{H \times W \times C_{2}}

, and

C_{1} + C_{2} = C

. To obtain the input features of the

(l + 1)

-th invertible block, inverse transformations are applied to both segments using learnable scale and shift parameters by

\begin{matrix} v_{1}^{l + 1} & = v_{1}^{l} + ϕ (v_{2}^{l}), \\ v_{2}^{l + 1} & = v_{2}^{l} ⊙ \exp (ρ (v_{1}^{l + 1})) + η (v_{1}^{l + 1}), \end{matrix}

(2)

where ⊙ refers to the Hadamard product. The

φ (\cdot)

,

ρ (\cdot)

, and

η (\cdot)

are achieved by the dense blocks [43]. Therefore, the inverse transformation can be derived from Equation (2) by

\begin{matrix} v_{1}^{l} & = v_{1}^{l + 1} - ϕ (v_{2}^{l}), \\ v_{2}^{l} & = (v_{2}^{l + 1} - η (v_{1}^{l + 1})) ⊘ \exp (ρ (v_{1}^{l + 1})), \end{matrix}

(3)

where ⊘ denotes element-wise division.

In addition, as illustrated in Figure 2b, the forward network outputs the synthetic image

s

and the latent variable

z

. Following the studies [25,29], we assume that

z

follows a Gaussian distribution, which can be randomly sampled from the same distribution for the inverse mapping. Consequently, in the training stage, the synthetic image

s

and

z

can be merged and fed into the inverse network for model optimization. In the inference stage, the decoded image

\tilde{x}

and resampled

\vec{z} \sim N (0, I)

can be combined and fed into the inverse network for image enhancement.

3.1.2. Channel Expansion and Haar Transformation

Within each invertible block, the channels of both its input and output signals have the same number. However, in addition to generating the synthetic image, we also need to produce an expanded output to derive the Gaussian latent variable

z

. To achieve this, as shown in Figure 2b, we double the number of channels of the input image using

1 \times 1

convolutional layers. In addition, Haar transformation [33] is used here to separate the compression distortion information by splitting the high and low frequencies, which are then fed into the invertible blocks.

Specifically, in the forward network, given the ground truth

x

, the feature

F

fed into the first invertible block can be expressed as

F = H_{D} (Concat (Φ_{1} (x / 2.0), Φ_{2} (x / 2.0))),

(4)

where

H_{D} (\cdot)

denotes the Haar function,

Concat (\cdot)

refers to a concatenation operation, and

Φ_{1} (\cdot)

and

Φ_{2} (\cdot)

refer to two

1 \times 1

convolution layers. After the last invertible block, the inverse Haar function

H_{U} (\cdot)

is adopted to transform the features from the frequency domain to the spatial domain.

As a result, at the end of the forward network, we can thus obtain six-channel signals with the same size as the input image, and, with CGM, we derive both the synthetic image

s

and Gaussian latent variables

z

, which are then combined and fed into the inverse network for image recovery.

In the inverse network, the inverse processing of Equation (4) can be formulated as

\begin{matrix} p_{i} & = Split (H_{D}^{- 1} (F)), i \in {1, 2}, \\ x & = Φ_{1}^{- 1} (p_{1}) + Φ_{2}^{- 1} (p_{2}), \end{matrix}

(5)

where

Φ_{1}^{- 1} (\cdot)

,

Φ_{2}^{- 1} (\cdot)

, and

H_{D}^{- 1} (\cdot)

refer to the inverse functions of

Φ_{1} (\cdot)

,

Φ_{2} (\cdot)

, and

H_{D} (\cdot)

, respectively. Next, we will illustrate CGM in detail.

3.1.3. Conditional Generation Module (CGM)

As illustrated above, we aim to encode the compression distortion information of decoded RS images into a set of Gaussian-distributed latent variables, where the mean and variance are conditioned on the synthetic image. This conditioning enables the reconstruction process to be image-adaptive. To achieve this, as illustrated in Figure 2b, we introduce CGM at the end of the forward network. Specifically, to establish the dependency between

s

and

z

, as depicted in Figure 3, the output six-channel tensor

g

from the last invertible block is further divided by CGM into two parts: a three-channel synthetic image

s

, and a three-channel latent variable

\tilde{z}

(representing the compression distortion information).

In the forward mapping, inspired by [28], we normalize

\tilde{z}

into standard Gaussian-distributed variables

z

by

z = (\tilde{z} - μ) ⊘ σ

, where

z \sim N (0, I)

, and the mean and scale of

s

can be calculated by

\begin{matrix} μ & = ψ_{1} (s), \\ σ & = \exp (ψ_{2} (s)), \end{matrix}

(6)

where

ψ_{1} (\cdot)

and

ψ_{2} (\cdot)

are achieved by the dense block [43]. Hence, its reverse mapping can be formulated as

\tilde{z} = z ⊙ σ + μ

, where

\tilde{z} \sim N (μ, σ)

. In this way, we encode the compression distortion information into the latent variables, whose distribution is conditioned on the synthetic image. The inverse network is similar to [28,32], where we sample a set of random variables from the Gaussian distribution conditioned on the synthetic images to reconstruct the enhanced images.

3.1.4. Quantization Module (QM)

To ensure compatibility with common decoded image storage formats, such as RGB (8 bits for each R, G, and B color channels), as shown in Figure 2b, we integrated QM after CGM. This module converts the floating-point values of the produced synthetic images into 8-bit unsigned integers by a rounding operation for quantization. However, it is essential to acknowledge a significant obstacle: the quantization module is inherently non-differentiable. To cope with the challenge, we employed the pass-through estimator technique used in [17] to ensure that INN-RSIC is efficiently optimized during the training process. Subsequently, in the inference stage, the decoded image can be reasonably fed into the inverse network for image enhancement.

3.2. Optimization Strategy

3.2.1. Compression Optimization Loss

As we aim to capture the compression distortion of ELIC [17], we provide here a brief overview of its loss function. Specifically, to balance the compression ratio and the image quality of the decoded image of ELIC, the loss function can be formulated as

L_{1} = R + λ D (x, \tilde{x}),

(7)

where the rate R refers to the entropy of the quantized latent variables of ELIC. Meanwhile,

D (\cdot, \cdot)

denotes the similarity between the input image and the decoded image, which is typically measured using the mean squared error (MSE). Different compression rates can be achieved by adjusting the hyperparameter

λ

, where the higher the value of

λ

, the higher the bpp and the better the image quality.

3.2.2. Forward Presentation Loss

The forward network is primarily focused on enabling the proposed model to capture the compression distortion distribution of the decoded image. This is achieved by establishing a correspondence between the input image

x

and the decoded image

\tilde{x}

, as well as a case-agnostic distribution

p (z)

of

z

. It can be realized by developing two loss functions: decoded image loss and case-agnostic distribution loss.

Decoded Image Loss

To generate the guidance images for training the forward network, ELIC is adopted to generate the labeled image

\tilde{x}

, which can be formulated as

\tilde{x} = ELIC (x, λ) .

(8)

Thereafter, to make our model follow the guidance, we drive the synthetic image

s = f_{Θ} (x)

to resemble

\tilde{x}

, which can be derived by

L_{dec} = | | f_{Θ} (x) - \tilde{x} {| |}^{2} .

(9)

Case-Agnostic Distribution Loss

To regularize the distribution of

z

, we maximize, inspired by [28], the log-likelihood of

p (z)

. Thus, the loss function to constrain the latent variables

z

can be formulated as

\begin{matrix} L_{case} & = - \log (p (z)) \\ = - \log (\frac{1}{{(2 π)}^{M / 2}} \exp (- \frac{1}{2} {| | z | |}^{2})), \end{matrix}

(10)

where M is the dimensionality of

z

. This loss function penalizes the normalized latent variables

z

to follow standard Gaussian distribution. Consequently, in the inverse processing, we can randomly sample a set of Gaussian-distributed variables along with the synthetic image to derive the enhanced image.

3.2.3. Reverse Reconstruction Loss

The inverse network aims to guide the model in recovering visually appealing images using randomly sampled latent variables and synthetic images. This is achieved by developing two loss functions: enhancement reconstruction loss and quality perception loss.

Enhancing Reconstruction Loss

In theory, the synthetic image can be perfectly restored to the corresponding ground truth version through the inverse network of INN because there is no information omission. However, in practice, the decoded image is not generated by the forward network of the proposed INN-RSIC but with an image compression method, that is, the synthetic image should be stored in 8-bit unsigned integer format so that the decoded image can be used directly here instead of the synthetic image for image detail enhancement during the inference stage. To address this, we adopt QM at the end of the forward network during training. Consequently, by penalizing the discrepancy between the reconstructed image and the ground truth, we can derive the enhancing reconstruction loss given as

L_{en} = | | f_{Θ}^{- 1} (s, \vec{z}) - x {| |}^{2},

(11)

where

\vec{z}

denotes the re-sampled latent variables from standard Gaussian distribution, and the synthetic image can be derived by

s = QM (f_{Θ} (x))

.

Quality Perception Loss

To improve the perceptual performance of the network by estimating the distances in predefined feature space rather than image space, a perceptual loss function called

L_{per}

is used here. In other words, the feature space distance is represented by an optimization function, which serves as a driver for the network to perform image reconstruction while retaining a feature representation similar to the ground truth. Concretely, the learning perceptual image patch similarity (LPIPS) [44] is utilized to indicate the perceptual loss, which is defined as

L_{per} = | | Vgg (\hat{x}) - Vgg (x) {| |}^{2},

(12)

where

Vgg (\cdot)

denotes a feature function that leverages the features extracted at the “Conv4-4” layer of VGG19 to penalize the contrast similarity between

\hat{x}

and

x

, as was adopted in [45].

Therefore, the total loss function can be given by

L_{total} = λ_{1} L_{dec} + λ_{2} L_{case} + λ_{3} L_{en} + λ_{4} L_{per},

(13)

where

λ_{1}, λ_{2}, λ_{3}

, and

λ_{4}

are hyperparameters used to balance the different loss terms.

Thus, after obtaining the trained model, in the inference stage, given a decoded image

\tilde{x}

through using Equation (8), we can derive its enhanced image

\hat{x}

by

\hat{x} = f_{Θ}^{- 1} (\tilde{x}, \vec{z})

, where

\vec{z}

refers to the latent variables sampled from Gaussian distribution

\vec{z} \sim N (0, I)

. It can be observed that there is no additional bitrate acquirement for the enhancement processing of the decoded image. The training and inference stages of the proposed INN-RSIC are summarized in Algorithm 1.

Algorithm 1: Processing of INN-RSIC

1 Training Stage:

2 Input:

x, λ

3 Output:

\tilde{x}, Θ

1:: Procedure: Compressor $(x, λ)$
2:: $\tilde{x} = ELIC (x, λ),$
3:: Return: $\tilde{x}$
4:: Procedure: Enhancer $(x)$
5:: Initialize parameters $Θ$ of INN-RSIC with Xavier.
6:: for epoch ← 1 to num_epochs do
7::         // Compute forward loss:
         $L_{dec} = | | f_{Θ} (x) - \tilde{x} {| |}^{2},$
         $L_{case} = - \log (\frac{1}{{(2 π)}^{M / 2}} \exp (- \frac{1}{2} {| | z | |}^{2})),$
8::         // Compute backward loss:
         $L_{en} = | | f_{Θ}^{- 1} (QM (f_{Θ} (x)), \vec{z}) - x {| |}^{2},$
         $L_{per} = | | Vgg (\hat{x}) - Vgg (x) {| |}^{2},$
9:: // Compute total loss:
$L_{total} = λ_{1} L_{dec} + λ_{2} L_{case} + λ_{3} L_{en} + λ_{4} L_{per},$
10:: // Update $Θ$ using gradient descent:
$Θ \leftarrow Θ - α \nabla L_{total} (Θ),$
11:: end for
12:: Return: $Θ$

Inference Stage:

Input:

x, λ

Output:

\tilde{x}, \hat{x}

1:: Procedure: Compressor $(x, λ)$
2:: $\tilde{x} = ELIC (x, λ),$
3:: Return: $\tilde{x}$
4:: Procedure: Enhancer $(\tilde{x})$
5:: Loading the trained $Θ$ for INN-RSIC.
6::      // Randomly sample $\vec{z}$ and conduct inverse mapping:
      $\vec{z} \sim N (0, I),$
      $\hat{x} = f_{Θ}^{- 1} (\tilde{x}, \vec{z}),$
7:: Return: $\hat{x}$

4. Results

4.1. Experimental Settings

4.1.1. Datasets

In this experiment, two datasets, including DOTA [46] and UC-Merced (UC-M) [47], were used for performance evaluation. Concretely, we used the training dataset of DOTA and 80% of the UC-M training set for model training. Each image was randomly cropped into the resolution of

128 \times 128

. Random horizontal flip, random vertical flip, and random crop were applied for data augmentation. We randomly chose 100 images and 10% images from the DOTA testing dataset and UC-M, respectively. Then, each image was centrally cropped into the resolution of

256 \times 256

as the testing set.

4.1.2. Implementation Details

In our experiment, we utilized ELIC [17] to compress the images of the training dataset. Concretely, the ELIC models with parameters

λ \in {4, 8, 32, 100, 450} \times 10^{- 4}

were used to separately derive the decoded images, which were then utilized as labeled images to constrain the forward latent representation. As different compression rates result in different compression distortion distributions of ELIC, the labeled images under each

λ

were used to train the INN-RSIC model. Additionally, to train the proposed model, the widely-used AdamW [48], with

β_{1}

= 0.9 and

β_{2}

= 0.999, was employed for parameter optimization. The hyperparameters

λ_{1}

,

λ_{2}

,

λ_{3}

, and

λ_{4}

were experimentally set to

1.0

,

200.0

,

0.05

, and

0.01

, respectively. We set the initial learning rate

α

to 1

\times 10^{- 4}

and the total number of epochs to 300, with the learning rate halved every 60 epochs until it reached a value smaller than 1

\times 10^{- 6}

. PSNR, MS-SSIM, and LPIPS [44] were adopted as the evaluation metrics. PSNR and MS-SSIM primarily concentrate on numerical comparisons and structural similarity in images, whereas LPIPS emphasizes perceptual evaluation by leveraging deep learning-based models to extract high-level perceptual features, thus aligning more closely with human visual perception. To maintain experimental rigor and fairness, all of the models were thoroughly trained for comparison and validation.

4.2. Performance Evaluation

In this section, we conduct a comprehensive comparison of the proposed INN-RSIC and several competitive image compression algorithms, using the testing set of DOTA and UC-M. The benchmark includes the state-of-the-art image compression algorithms, including traditional standards such as JPEG2000 [5], BPG [6], VVC (YUV 444) [7], as well as the competitive learning-based image compression algorithms like HiFiC [27], InvComp [25], STF [22], Entrorformer [23], ELIC [17], TCM [13], and MLIC++ [15]. The performance comparisons of the tested algorithms on the DOTA and UC-M datasets are presented in Figure 4. The results demonstrate that the proposed INN-RSIC achieves superior performance in terms of LPIPS, except when compared to HiFiC. While HiFiC achieves comparable perceptual quality, its performance suffers from poor image fidelity, as reflected by its low PSNR values.

Notably, at low bitrates (bpp), the gains in LPIPS are minimal for our method. This can be attributed to the high distortion in the decoded image, which limits the ability of the proposed INN-RSIC to effectively recover finer details from the Gaussian distribution conditioned on the decoded image. However, as the bitrate increases, the proposed INN-RSIC demonstrates significant improvements in LPIPS, highlighting its ability to balance perceptual quality and image fidelity more effectively at higher bitrates. Additionally, compared to GAN-based methods like HiFiC, which rely on adversarial training to generate fine-grained textures and global structural consistency, our lightweight INN-based method prioritizes simplicity and efficiency. As a result, while both methods achieved comparable LPIPS scores, HiFiC benefits from its generative capabilities of GAN. However, our method’s advantage lies in its reduced computational complexity and ability to seamlessly integrate with existing image compression frameworks as a lightweight PnP module.

Additionally, Figure 5 shows the decoding results of these algorithms on the testing images of DOTA and UC-M at low and high bitrates. As shown in Figure 5a, it is noticeable that the image “P0121” decoded by the state-of-the-art traditional compression algorithm VVC exhibited obvious distortion, even with an additional 23.66% increase in bpp compared to the proposed INN-RSIC. Moreover, the decoded image from the competitive learning-based compression algorithm STF still suffers from significant distortion, despite a 70.73% increase in bpp. At high bitrates, the decoded image “P0216” by VVC presents blurry texture details despite the higher bpp. Particularly, it is evident that the perception quality of the decoded images by STF remains inferior to that of the proposed INN-RSIC, even with a 65.63% increase in additional bpp. Similarly, as depicted in Figure 5b, the results further confirm that the proposed INN-RSIC demonstrates impressive perception quality compared to traditional and deep learning-based image compression algorithms at both low and high bitrates. Further, Figure 6 visualizes the decoded images from HiFiC and the proposed INN-RSIC. The results clearly demonstrate that, at comparable bpp values, the proposed method produces images with finer details and better image fidelity, achieving a closer resemblance to the ground truth.

In summary, we can safely demonstrate that the proposed INN-RSIC effectively and impressively contributes to the detailed recovery of decoded images without increasing additional bitrates.

4.3. Ablation Evaluation

4.3.1. Effectiveness of QM

QM considers the impact of the format mismatch on the reconstruction performance of the inverse mapping. Since the ground truth is not available at the decoding end, i.e., the synthetic image is not available, we therefore use the decoded image as the inverse input to the proposed INN-RSIC in the inference stage. Figure 7 shows the quality metrics of the proposed model with and without QM for enhancing the decoded images of ELIC on the testing set of DOTA and UC-M. From the results, it can be seen that the proposed model with QM presents better perception quality results on both datasets. The reason lies in the fact that the input images fed into the proposed INN-RSIC are 3-channel 8-bit RGB images during the inference stage, and QM can better contribute to the proposed model to match the data type of the input images.

4.3.2. Effectiveness of CGM

CGM aims to incorporate guidance from synthetic images into the latent variables during the reconstruction of texture-rich images. To demonstrate the effectiveness of this approach. Figure 7 presents the performance of the proposed INN-RSIC with and without CGM on the testing sets of DOTA and UC-M. It is evident that the INN-RSIC with CGM consistently outperforms the model without CGM in terms of LPIPS. This implies that encoding latent variables into Gaussian-distributed variables conditioned on synthetic images can enhance the reconstruction performance during inverse mapping. This observation is also consistent with earlier research [28].

To visualize the effectiveness of QM and CGM, Figure 8 shows the enhanced images of the proposed INN-RSIC with and without QM or CGM on the decoded images “P0088” and “P0097” by ELIC. The results indicate that, although the enhanced images have higher perceptual quality than the decoded images when either QM or CGM is not available, the proposed INN-RSIC provides the best perceptual quality when QM and CGM are used.

4.3.3. Effectiveness of Prior Distribution and Wavelet Transform

To evaluate the impact of the latent prior distribution and wavelet transform on the proposed INN-RSIC framework, we conducted comparative experiments using different prior types and wavelet bases. Specifically, we replaced the default Gaussian prior with a Laplacian distribution and also compared the commonly used Haar wavelet with the smoother Daubechies-4 (db4) wavelet. For consistency, the state-of-the-art compression model DCAE [49], with

λ = 0.0018

, was used as the baseline. All comparative evaluations were conducted on the DOTA testing set.

As shown in Table 1, replacing the Gaussian prior with a Laplacian one led to a drop in PSNR and MS-SSIM, and it introduced more visible artifacts in reconstructed images, as illustrated in Figure 9. This indicates that the Gaussian prior better models the residual distribution in the latent space, yielding more stable training and improved fidelity.

From a theoretical standpoint, the choice of a Gaussian prior is supported by the maximum entropy principle, which states that the Gaussian distribution maximizes entropy among all distributions with a given mean and variance. This makes it the least biased and most information-preserving assumption under limited prior knowledge. Furthermore, the smoothness and differentiability of the Gaussian log-likelihood facilitate more stable and efficient optimization in INN-based models. This is consistent with the fact that the default optimization objective in most INN-based architectures [28,29,32,45] assumes a standard Gaussian prior in the latent space.

Regarding the choice of wavelet, as also reported in Table 1, the comparison between Haar and db4 revealed only marginal differences in performance. This indicates that both wavelet bases are comparably effective for capturing the structural information during invertible mapping. Nonetheless, Haar remains slightly preferable in our setting due to its computational simplicity and marginally better perceptual metrics.

In summary, the experimental results support the rationale behind our design: the Gaussian prior, grounded in the maximum entropy principle, outperforms the Laplacian prior in both stability and reconstruction quality. While db4 yields comparable results, the Haar wavelet offers a better trade-off between perceptual performance and computational simplicity. Therefore, we adopted the Gaussian prior and Haar wavelet as the default configuration in INN-RSIC.

4.4. Robustness Evaluation

4.4.1. Adaptability to Different Compression Rates

The robustness of a model is crucial for its practical deployment. We investigated the performance of INN-RSIC when trained at a specific compression rate but evaluated at different rates, aiming to examine its adaptability to varying compression conditions. Here, we employed two INN-RSIC models trained at two values of

λ

to enhance the decoded images of ELIC across various compression ratios. The proposed INN-RSIC trained at

λ = 8 \times 10^{- 4}

was used to enhance the images compressed and was decoded using ELIC at different

λ

for the images in the testing set of DOTA, where

λ \in {4, 8, 32, 100, 450} \times 10^{- 4}

.

Figure 10a depicts the LPIPS comparison results, from which one can see that the proposed INN-RSIC trained with the corresponding

λ

demonstrated the most significant improvement in perception quality in terms of LPIPS, except when

bpp = 0.0457

. The poorer performance of the proposed INN-RSIC at low bpp (i.e.,

λ = 4 \times 10^{- 4}

) may be because the quality of the corresponding decoded image is poor at very low bpp, whereas the Gaussian-distributed variables are conditioned on the decoded image, that is, the decoded image has limited guidance to enable INN-RSIC to recover stable results.

Furthermore, we investigated the robustness of the proposed INN-RSIC model trained at high bitrate conditions in improving decoded images across different bitrate levels. The proposed INN-RSIC trained at

λ = 32 \times 10^{- 4}

was used to enhance the images compressed and decoded using ELIC at different

λ

for the images in the testing set of DOTA, where

λ \in {4, 8, 32, 100, 450} \times 10^{- 4}

. LPIPS comparison results are illustrated in Figure 10b. Comparing the LPIPS values between the baseline and robust cases, it was observed that, except for

bpp = 0.7224

, the proposed model trained at

λ = 32 \times 10^{- 4}

can still improve perception quality across different bitrate levels, except for

λ = 4 \times 10^{- 4}

.

To visualize the robust performance of the model, the robust case in Figure 11a,b show the enhanced images obtained by augmenting the decoded images of ELIC at high and low bitrates using the INN-RSIC model trained at low and high bitrates, respectively. The baseline denotes the image decoded by ELIC. From the results, it can be observed that, although the decoded image of the robust case presents more artifacts and burrs compared to the ground truth and the proposed INN-RSIC, it presents better visual characteristics and recovers more texture information compared to the baseline on the global vision.

In summary, the results indicate that both the proposed models trained at low and high bitrates exhibit effective enhancement of perception quality for decoded images across a wide range of bpp levels, highlighting the impressive robustness of the proposed INN-RSIC in enhancing decoded images.

4.4.2. Generalizability with Other Image Compression Algorithms

We further evaluate the generalizability of INN-RSIC by replacing the baseline with both classical traditional and state-of-the-art learning-based image compression algorithms, namely JPEG2000 and DCAE [49]. Figure 12 illustrates the improved results on two testing images, “P0017” and “P0080”, from the DOTA dataset.

When combined with JPEG2000, it is evident that, while color distortion is significantly corrected and edge details are improved, the enhanced images still suffer from severe artifacts. This issue may arise because JPEG2000 tends to introduce “ringing” artifacts near edges or high-contrast areas due to the truncation effect of the wavelet transform. The complex and variable nature of these artifacts makes it challenging for the INN to suppress them effectively. In contrast, when combined with DCAE, the enhanced images exhibit improved perceptual quality. It is worth noting that, in these experiments, the quantization parameter (QP) for JPEG2000 was set to 0.2, and a rate of

λ = 0.0018

was adopted in DCAE to generate the baseline images.

To sum up, these results demonstrate the effectiveness of the proposed method in integrating with both traditional and learning-based algorithms, further highlighting the generalizability and versatility of the developed framework.

4.4.3. Robustness Across Various Resolutions

We assessed the effectiveness of INN-RSIC on input images of varying resolutions, validating its robustness across different scales. Specifically, we evaluated the performance of the developed INN-RSIC on input images with resolutions of

128 \times 128

and

512 \times 512

. The

128 \times 128

images were obtained by center-cropping two images from the DOTA testing set, while the

512 \times 512

images were randomly cropped from the DOTA testing dataset. As shown in Figure 13, INN-RSIC consistently improved performance compared to the baseline, demonstrating its effectiveness across different resolutions. These results highlight the robustness of the proposed method when handling images of varying scales.

4.5. Model Complexity

To assess the computational complexity of the proposed method, we conducted the experiments on the DOTA testing set with an Intel Silver 4214R CPU running at 2.40 GHz and one NVIDIA GeForce RTX 3090 Ti GPU. The average results are given in Table 2.

From the results, it is clear that MLIC++ has 82.36 M parameters, whereas the proposed INN-RSIC comprises only 67.64% of that count. Notably, the INN-based enhancer we introduced adds only 1.25 M parameters, representing a mere 2.30% increase over the baseline ELIC, which underscores its remarkable memory efficiency and lightweight design. In terms of floating point operations per second (FLOPs), HiFiC and MLIC++ bear a heavy computational load, with FLOPs values of 148.24 G and 116.48 G, respectively. In contrast, the proposed INN-RSIC requires only 42.88% and 54.58% of the FLOPs of HiFiC and MLIC++, respectively. This significant reduction in computational complexity is achieved, while maintaining competitive performance, thanks to the efficient design of the INN-based enhancer. As for the decoding times, it can be observed that the proposed INN-RSIC adds only 0.0155 s to the GPU decoding time compared to its baseline, ELIC. This minimal increase highlights the efficiency of the INN-based enhancement module, ensuring that the proposed method maintains fast decoding speeds while improving perceptual quality.

To sum up, the proposed INN-RSIC requires less memory space, has lower computational complexity, and enjoys outstanding decoding speed on both CPU and GPU. The lightweight and efficient design of the INN-based enhancer ensures that the overall system remains highly practical for real-world applications.

5. Discussion

During training, we utilized decoded images as labels to guide the forward training of INN-RSIC in generating synthetic images, which were then fed into the inverse network along with Gaussian-distributed latent variables for image reconstruction. However, in the inference stage, as the ground truth cannot be obtained to feed into the forward network at the decoding end, i.e., the synthetic image cannot be derived, we resorted to using labeled images instead of synthetic ones. In other words, the decoded images were fed into the inverse network along with Gaussian-distributed latent variables for enhancing the texture of the decoded images. While the texture information of the enhanced images is enriched with the assistance of Gaussian-distributed latent variables, the strictly invertible characteristic of INN inevitably affects the reconstruction quality of the enhanced image when using decoded images instead of synthetic ones.

This investigation primarily focused on exploring the input sensitivity of the proposed INN-RSIC model. Figure 14 showcases the images restored by INN-RSIC with synthetic images and the decoded images as inputs during the inference stage. The results reveal that, when synthetic images are used as input, the reconstructed image effectively recovers texture-rich details. However, when decoded images are employed as input, although some details are still recovered in the enhanced image, noticeable detail distortions emerge, particularly in texture-rich regions. To explore this issue further, we analyzed the residual image between the synthetic image and the decoded image. It became evident that the differences contradicted the invertible characteristic of INN, thereby inevitably limiting the performance of the final texture recovery.

In summary, the similarity between the decoded and synthetic images significantly influenced the recovery of texture-rich image details. Optimizing the algorithm design to ensure that the resulting synthetic and decoded images were as similar as possible will greatly enhance the decoding of detailed texture-rich RS images without requiring additional bitrates. This optimization holds the potential to improve the overall performance and fidelity of the INN-RSIC model, allowing for a more accurate reconstruction of RS images.

6. Conclusions

In this paper, we present INN-RSIC, a novel method designed to enhance the perceptual quality of decoded RS images. This approach encodes the compression distortion of an image compression algorithm into Gaussian-distributed latent variables and leverages invertible transformations to substantially restore texture details. Additionally, we introduced the CCGM to separate and encode the compression distortion from the ground truth into latent variables conditioned on the synthetic image, thereby improving the reconstruction of the enhanced image. Moreover, we incorporated QM to mitigate the impact of data type conversions on reconstruction quality during inference. Extensive experiments on two widely used RS image datasets demonstrate that INN-RSIC achieves a superior balance between perceptual quality and high image fidelity compared to state-of-the-art image compression algorithms with low model complexity. Furthermore, our approach offers a new research perspective for enhancing the texture reconstruction capability of RS image compression algorithms by minimizing the gap between synthetic and decoded images.

Author Contributions

Conceptualization, J.L. and X.H.; methodology, J.L.; software, J.L.; investigation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, J.L. and X.H. visualization, J.L.; supervision, X.H.; project administration, X.H. All authors read and agreed to the published version of this manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62272376.

Data Availability Statement

The datasets used in this study are publicly available. The DOTA dataset can be accessed at https://captain-whu.github.io/DOTA/ (accessed on 1 May 2025), and the UC-M dataset is available at http://weegee.vision.ucmerced.edu/datasets/landuse.html (accessed on 1 May 2025). The code and results are available at https://github.com/mlkk518/INN-RSIC (accessed on 2 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, Y.; Wu, C.; Wu, J.; Zhang, Y.; Bi, X.; Wang, M.; Yan, E.; Song, C.; Li, J. Projected Spatiotemporal Evolution of Urban Form Using the SLEUTH Model with Urban Master Plan Scenarios. Remote Sens. 2025, 17, 270. [Google Scholar] [CrossRef]
Lv, Z.; Huang, H.; Sun, W.; Lei, T.; Benediktsson, J.A.; Li, J. Novel Enhanced UNet for Change Detection Using Multimodal Remote Sensing Image. IEEE Geosci. Remote Sens. Lett. 2023, 20, 2505405. [Google Scholar] [CrossRef]
Zhang, Z.; Xu, W.; Shi, Z.; Qin, Q. Establishment of a Comprehensive Drought Monitoring Index Based on Multisource Remote Sensing Data and Agricultural Drought Monitoring. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2021, 14, 2113–2126. [Google Scholar] [CrossRef]
Li, Y.; Ma, J.; Zhang, Y. Image retrieval from remote sensing big data: A survey. Inf. Fusion 2021, 67, 94–115. [Google Scholar] [CrossRef]
Taubman, D.S.; Marcellin, M.W.; Rabbani, M. JPEG2000: Image compression fundamentals, standards and practice. J. Electron. Imaging 2002, 11, 286–287. [Google Scholar] [CrossRef]
Bellard, F. BPG Image Format. Available online: https://bellard.org/bpg/ (accessed on 2 May 2025).
Versatile Video Coding Reference Software Version 16.0 (vtm-16.0). Available online: https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM/-/releases/VTM-16.0 (accessed on 2 May 2025).
Pan, T.; Zhang, L.; Song, Y.; Liu, Y. Hybrid Attention Compression Network with Light Graph Attention Module for Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6005605. [Google Scholar] [CrossRef]
Cheng, Z.; Sun, H.; Takeuchi, M.; Katto, J. Learned image compression with discretized Gaussian mixture likelihoods and attention modules. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7939–7948. [Google Scholar]
Zhang, L.; Hu, X.; Pan, T.; Zhang, L. Global Priors with Anchored-Stripe Attention and Multiscale Convolution for Remote Sensing Image Compression. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2024, 17, 138–149. [Google Scholar] [CrossRef]
Tang, Z.; Wang, H.; Yi, X.; Zhang, Y.; Kwong, S.; Kuo, C.C.J. Joint graph attention and asymmetric convolutional neural network for deep image compression. IEEE Trans. Circuit Syst. Video Technol. 2022, 33, 421–433. [Google Scholar] [CrossRef]
Verdú, S.M.i.; Chabert, M.; Oberlin, T.; Serra-Sagristà, J. Fixed-Quality Compression of Remote Sensing Images with Neural Networks. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2024, 17, 12169–12180. [Google Scholar] [CrossRef]
Liu, J.; Sun, H.; Katto, J. Learned image compression with mixed Transformer-CNN architectures. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14388–14397. [Google Scholar]
Wang, Y.; Liang, F.; Wang, S.; Chen, H.; Cao, Q.; Fu, H.; Chen, Z. Towards an Efficient Remote Sensing Image Compression Network with Visual State Space Model. Remote Sens. 2025, 17, 425. [Google Scholar] [CrossRef]
Jiang, W.; Yang, J.; Zhai, Y.; Ning, P.; Gao, F.; Wang, R. MLIC: Multi-reference entropy model for learned image compression. In Proceedings of the ACM Int’l Conf. on Multimedia (ACM MM), Ottawa, ON, Canada, 29 Octobe–3 November 2023; pp. 7618–7627. [Google Scholar]
Ballé, J.; Minnen, D.; Singh, S.; Hwang, S.J.; Johnston, N. Variational Image Compression with a Scale Hyperprior. In Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April 30–3 May 2018. [Google Scholar]
He, D.; Yang, Z.; Peng, W.; Ma, R.; Qin, H.; Wang, Y. Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive coding. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5718–5727. [Google Scholar]
Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring models and data for remote sensing image caption generation. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2183–2195. [Google Scholar] [CrossRef]
Han, P.; Zhao, B.; Li, X. Edge-Guided Remote-Sensing Image Compression. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5524515. [Google Scholar] [CrossRef]
Pan, T.; Zhang, L.; Qu, L.; Liu, Y. A Coupled Compression Generation Network for Remote-Sensing Images at Extremely Low Bitrates. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5608514. [Google Scholar] [CrossRef]
Xiang, S.; Liang, Q. Remote Sensing Image Compression Based on High-Frequency and Low-Frequency Components. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5604715. [Google Scholar] [CrossRef]
Zou, R.; Song, C.; Zhang, Z. The devil is in the details: Window-based attention for image compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17492–17501. [Google Scholar]
Qian, Y.; Sun, X.; Lin, M.; Tan, Z.; Jin, R. Entroformer: A Transformer-based Entropy Model for Learned Image Compression. In Proceedings of the Tenth International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
Xiang, S.; Liang, Q.; Fang, L. Discrete Wavelet Transform-Based Gaussian Mixture Model for Remote Sensing Image Compression. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3000112. [Google Scholar] [CrossRef]
Xie, Y.; Cheng, K.L.; Chen, Q. Enhanced invertible encoding for learned image compression. In Proceedings of the ACM Int’l Conf. on Multimedia (ACM MM), Virtual, 20–24 October 2021; pp. 162–170. [Google Scholar]
Li, J.; Hou, X. Object-Fidelity Remote Sensing Image Compression with Content-Weighted Bitrate Allocation and Patch-Based Local Attention. IEEE Trans. Geosci. Remote Sens. 2024, 62, 2004314. [Google Scholar] [CrossRef]
Mentzer, F.; Toderici, G.D.; Tschannen, M.; Agustsson, E. High-fidelity generative image compression. Proc. Conf. Neural Inf. Process. Syst. (NeurIPS) 2020, 33, 11913–11924. [Google Scholar]
Zhao, R.; Liu, T.; Xiao, J.; Lun, D.P.K.; Lam, K.M. Invertible Image Decolorization. IEEE Trans. Image Process. 2021, 30, 6081–6095. [Google Scholar] [CrossRef]
Gao, F.; Deng, X.; Jing, J.; Zou, X.; Xu, M. Extremely Low Bit-rate Image Compression via Invertible Image Generation. IEEE Trans. Circuit Syst. Video Technol. 2023, 34, 6993–7004. [Google Scholar] [CrossRef]
Radev, S.T.; Mertens, U.K.; Voss, A.; Ardizzone, L.; Köthe, U. BayesFlow: Learning Complex Stochastic Models with Invertible Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 1452–1466. [Google Scholar] [CrossRef]
Zhou, M.; Fu, X.; Huang, J.; Zhao, F.; Liu, A.; Wang, R. Effective Pan-Sharpening with Transformer and Invertible Neural Network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5406815. [Google Scholar] [CrossRef]
Xiao, M.; Zheng, S.; Liu, C.; Wang, Y.; He, D.; Ke, G.; Bian, J.; Lin, Z.; Liu, T.Y. Invertible image rescaling. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 126–144. [Google Scholar]
Haar, A. Zur Theorie der Orthogonalen Funktionensysteme; Georg-August-Universitat: Gottingen, Germany, 1909. [Google Scholar]
Li, J.; Liu, Y. Non-blind post-processing algorithm for remote sensing image compression. Knowl.-Based Syst. 2021, 214, 106719. [Google Scholar] [CrossRef]
Zhao, S.; Yang, S.; Liu, Z.; Feng, Z.; Zhang, K. Sparse flow adversarial model for robust image compression. Knowl.-Based Syst. 2021, 229, 107284. [Google Scholar] [CrossRef]
Xiang, S.; Liang, Q. Remote sensing image compression with long-range convolution and improved non-local attention model. Signal Process. 2023, 209, 109005. [Google Scholar] [CrossRef]
Fu, C.; Du, B. Remote Sensing Image Compression Based on the Multiple Prior Information. Remote Sens. 2023, 15, 2211. [Google Scholar] [CrossRef]
Rombach, R.; Esser, P.; Ommer, B. Network-to-Network Translation with Conditional Invertible Neural Networks. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), Online, 6–12 December 2020; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 2784–2797. [Google Scholar]
Li, F.; Sheng, Y.; Zhang, X.; Qin, C. iSCMIS:Spatial-Channel Attention Based Deep Invertible Network for Multi-Image Steganography. IEEE Trans. Multimed. 2024, 26, 3137–3152. [Google Scholar] [CrossRef]
Huang, J.J.; Dragotti, P.L. WINNet: Wavelet-inspired invertible network for image denoising. IEEE Trans. Image Process. 2022, 31, 4377–4392. [Google Scholar] [CrossRef]
Liu, H.; Shao, M.; Qiao, Y.; Wan, Y.; Meng, D. Unpaired image super-resolution using a lightweight invertible neural network. Pattern Recognit. 2023, 144, 109822. [Google Scholar] [CrossRef]
Song, Y.; Meng, C.; Ermon, S. MintNet: Building Invertible Neural Networks with Masked Convolutions. In Proceedings of the 33th Conference on Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Change Loy, C. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision Workshop (ECCVW), Munich, Germany, 8–14 September 2018. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]
Xia, M.; Liu, X.; Wong, T.T. Invertible grayscale. ACM Trans. Graph. 2018, 37, 246. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Long, Y.; Xia, G.; Lu, Q. Learning RoI Transformer for Detecting Oriented Objects in Aerial Images. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems (GIS), San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Lu, J.; Zhang, L.; Zhou, X.; Li, M.; Li, W.; Gu, S. Learned Image Compression with Dictionary-based Entropy Model. arXiv 2025, arXiv:2504.00496. [Google Scholar]

Figure 1. Visualization of the decoded images of ELIC [17] and the proposed INN-RSIC.

Figure 2. The architecture of the proposed INN-RSIC. Compressor refers to the competitive image compression algorithm [17], where “3RB” represents a stack of three residual blocks and “Attn” denotes an attention mechanism.

Figure 3. Illustration of the conditional generation module (CGM).

Figure 4. Comparison of the performance on the testing sets of DOTA and UC-M. It was observed that INN-RSIC achieved a better trade-off between fidelity (i.e., higher PSNR) and perceptual quality compared to HiFiC. (a–c) Evaluation on DOTA. (d–f) Evaluation on UC-M. Arrows indicate the direction of preferable performance for each metric.

Figure 5. Visualization of the decoding performance on the testing images of DOTA and UC-M. (a) Testing images “P0121” and “P0216” of DOTA. (b) Testing images “baseballdiamond95” and “tenniscourt90” of UC-M.

Figure 6. Visualization of the decoded images of HiFiC and the proposed INN-RSIC.

Figure 7. Quantitative comparison of the proposed INN-RSIC with and without QM or CGM.

Figure 8. Visualization of the enhanced images generated by the proposed INN-RSIC with and without (w/o) QM or CGM on the testing images “P0088” and “P0097” of DOTA. ELIC was adopted to derive the decoded images, which serve as input for the proposed INN-RSIC for enhancement.

Figure 9. Visualization comparison using different prior distributions, demonstrating that the Gaussian prior leads to clearer and more accurate reconstructions.

Figure 10. LPIPS comparison among the baseline, robust approach, and the proposed INN-RSIC. The baseline refers to testing on the decoded images from ELIC without further enhancement. (a) Performance at low bpp (

λ = 8 \times 10^{- 4}

). (b) Performance at high bpp (

λ = 32 \times 10^{- 4}

). The robust approach indicates that INN-RSIC, trained at

bpp = 0.0735

(a) or

bpp = 0.1678

(b), is applied to enhance decoded images across a wide range of bpp values.

Figure 10. LPIPS comparison among the baseline, robust approach, and the proposed INN-RSIC. The baseline refers to testing on the decoded images from ELIC without further enhancement. (a) Performance at low bpp (

λ = 8 \times 10^{- 4}

). (b) Performance at high bpp (

λ = 32 \times 10^{- 4}

). The robust approach indicates that INN-RSIC, trained at

bpp = 0.0735

(a) or

bpp = 0.1678

(b), is applied to enhance decoded images across a wide range of bpp values.

Figure 11. Robust performance comparison on the testing images of DOTA. (a) Low-bitrate-trained model (

λ = 8 \times 10^{- 4}

) to enhance the decoded images at a high bitrate (

λ = 32 \times 10^{- 4}

). (b) High-bitrate-trained model (

λ = 32 \times 10^{- 4}

) to enhance the decoded images at a low bitrate (

λ = 8 \times 10^{- 4}

).

Figure 11. Robust performance comparison on the testing images of DOTA. (a) Low-bitrate-trained model (

λ = 8 \times 10^{- 4}

) to enhance the decoded images at a high bitrate (

λ = 32 \times 10^{- 4}

). (b) High-bitrate-trained model (

λ = 32 \times 10^{- 4}

) to enhance the decoded images at a low bitrate (

λ = 8 \times 10^{- 4}

).

Figure 12. Visual comparison of our framework using JPEG2000 and DCAE as baselines.

Figure 13. Visual comparison of the performance improvement achieved by the developed INN-RSIC under various resolutions. The first three columns displayed results for images with a resolution of

128 \times 128

, while the last three columns show results for images with a resolution of

512 \times 512

.

Figure 13. Visual comparison of the performance improvement achieved by the developed INN-RSIC under various resolutions. The first three columns displayed results for images with a resolution of

128 \times 128

, while the last three columns show results for images with a resolution of

512 \times 512

.

Figure 14. Visualization of images generated by the proposed INN-RSIC with different inputs during the inference stage under

λ = 32 \times 10^{- 4}

. The reconstructed images were obtained using the synthetic images as input, while the enhanced images were derived using the decoded images as input. The residual images illustrate the difference between the synthetic images and the decoded images.

Figure 14. Visualization of images generated by the proposed INN-RSIC with different inputs during the inference stage under

λ = 32 \times 10^{- 4}

. The reconstructed images were obtained using the synthetic images as input, while the enhanced images were derived using the decoded images as input. The residual images illustrate the difference between the synthetic images and the decoded images.

Table 1. Performance comparison using different prior distributions, showing that the Gaussian prior better supports residual mapping than the Laplacian prior.

Methods	PSNR (dB) ↑	MS-SSIM ↑	LPIPS ↓
Baseline (DCAE)	30.21	11.2960	0.3240
Laplacian + Haar	28.79	10.6048	0.3029
Gaussian + db4	29.39	10.9528	0.3047
Gaussian + Haar (Ours)	29.37	10.8302	0.3025

Table 2. Comparison of the parameters and computational cost of the various image compression algorithms.

Methods	Parameters (M)	FLOPs (G)	CPU		GPU
Methods	Parameters (M)	FLOPs (G)	Enc. Time (s)	Dec. Time (s)	Enc. Time (s)	Dec. Time (s)
STF	33.35	99.83	1.1582	1.3173	0.0768	0.1022
HiFiC	55.42	148.24	0.1959	0.5344	0.1253	0.2844
ELIC	54.46	31.66	0.9205	0.9245	0.1746	0.0988
TCM	44.97	35.23	0.3951	0.4171	0.1269	0.1238
MLIC++	82.36	116.48	0.5320	0.6453	0.1254	0.1761
INN-RSIC (Ours)	55.71	63.57	0.9205	0.9554	0.1746	0.1143

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Hou, X. Enhancing Perception Quality in Remote Sensing Image Compression via Invertible Neural Network. Remote Sens. 2025, 17, 2074. https://doi.org/10.3390/rs17122074

AMA Style

Li J, Hou X. Enhancing Perception Quality in Remote Sensing Image Compression via Invertible Neural Network. Remote Sensing. 2025; 17(12):2074. https://doi.org/10.3390/rs17122074

Chicago/Turabian Style

Li, Junhui, and Xingsong Hou. 2025. "Enhancing Perception Quality in Remote Sensing Image Compression via Invertible Neural Network" Remote Sensing 17, no. 12: 2074. https://doi.org/10.3390/rs17122074

APA Style

Li, J., & Hou, X. (2025). Enhancing Perception Quality in Remote Sensing Image Compression via Invertible Neural Network. Remote Sensing, 17(12), 2074. https://doi.org/10.3390/rs17122074

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Perception Quality in Remote Sensing Image Compression via Invertible Neural Network

Abstract

1. Introduction

2. Related Work

2.1. End-to-End RS Image Compression

2.2. Invertible Neural Network (INN)

3. Materials and Methods

3.1. INN Architecture

3.1.1. Invertible Block

3.1.2. Channel Expansion and Haar Transformation

3.1.3. Conditional Generation Module (CGM)

3.1.4. Quantization Module (QM)

3.2. Optimization Strategy

3.2.1. Compression Optimization Loss

3.2.2. Forward Presentation Loss

Decoded Image Loss

Case-Agnostic Distribution Loss

3.2.3. Reverse Reconstruction Loss

Enhancing Reconstruction Loss

Quality Perception Loss

4. Results

4.1. Experimental Settings

4.1.1. Datasets

4.1.2. Implementation Details

4.2. Performance Evaluation

4.3. Ablation Evaluation

4.3.1. Effectiveness of QM

4.3.2. Effectiveness of CGM

4.3.3. Effectiveness of Prior Distribution and Wavelet Transform

4.4. Robustness Evaluation

4.4.1. Adaptability to Different Compression Rates

4.4.2. Generalizability with Other Image Compression Algorithms

4.4.3. Robustness Across Various Resolutions

4.5. Model Complexity

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI