remote

: Learned image compression has achieved a series of breakthroughs for nature images, but there is little literature focusing on high-resolution remote sensing image (HRRSI) datasets. This paper focuses on designing a learned lossy image compression framework for compressing HRRSIs. Considering the local and non-local redundancy contained in HRRSI, a mixed hyperprior network is designed to explore both the local and non-local redundancy in order to improve the accuracy of entropy estimation. In detail, a transformer-based hyperprior and a CNN-based hyperprior are fused for entropy estimation. Furthermore, to reduce the mismatch between training and testing, a three-stage training strategy is introduced to reﬁne the network. In this training strategy, the entire network is ﬁrst trained, and then some sub-networks are ﬁxed while the others are trained. To evaluate the effectiveness of the proposed compression algorithm, the experiments are conducted on an HRRSI dataset. The results show that the proposed algorithm achieves comparable or better compression performance than some traditional and learned image compression algorithms, such as Joint Photographic Experts Group (JPEG) and JPEG2000. At a similar or lower bitrate, the proposed algorithm is about 2 dB higher than the PSNR value of JPEG2000.


Introduction
Remote sensing optical cameras are one of the most important satellite platforms in many applications of Earth observation and related works [1][2][3]. With the development of imaging technology, the spatial and spectral resolution of these cameras has become higher and higher. Many satellite cameras now aim to obtain a high spatial resolution, high temporal resolution, and large area-wide coverage in remote sensing images. These platforms generate a significant amount of image data every second, resulting in a burden in terms of transmission and storage [4]. Therefore, designing an efficient remote sensing image compression algorithm is crucial in remote sensing image processing.
Remote sensing image compression is an important topic, as most remote sensing images need to be compressed for storage and transmission purposes. Remote sensing image compression can be classified into two categories: lossy remote sensing image compression and lossless remote sensing image compression. Lossless compression algorithms can reconstruct all the information from the original remote sensing images. However, due to information theory, the compression ratio of lossless image compression is limited for each remote sensing image. Generally, the compression ratio of lossless image compression can only reach 3:1 to 4:1 for most remote sensing images [5,6]. Thus, only a few applications prefer to adopt lossless image compression, such as small target detection and fine classification of hyperspectral images [7][8][9]. To achieve higher compression ratios and alleviate the challenges of storage and transmission, many image data are stored in a lossy manner, and a series of lossy image compression algorithms have been proposed. Unlike lossless remote image compression algorithms, lossy remote sensing image compression algorithms aim to neglect or drop some unimportant information to achieve higher compression ratios. Typically, lossy remote image compression can easily achieve compression ratios of 15:1 or even 100:1 when more information is dropped [10,11]. Rate distortion is commonly used to measure the compression performance of lossy image compression algorithms. The rate refers to the storage space or transmission bandwidth occupied by a remote sensing image, while distortion refers to the deviation or distortion between the original remote sensing image and the reconstructed remote sensing image. With a similar bit rate, lower distortion indicates better compression algorithm performance.
In the early years, many researchers attempted to design some lossy remote sensing image compression algorithms based on standard image compression techniques, such as JPEG [12] and JPEG2000 [13]. In [14], the authors proposed a more efficient variant of the JPEG coding scheme for compressing remote sensing images obtained by optical satellite sensors. This compression scheme involves broadening cloud features to include their cloud-land transitions, which simplify coding and subsequent compression. The authors of [15] developed a compression ratio prediction algorithm for Discrete Cosine Transform (DCT)-based coders using remote sensing images. This algorithm can also be used in other DCT-or JPEG-based image compression algorithms for remote sensing applications.
Discrete Wavelet Transform (DWT), being a transform-based technique, has been shown to achieve higher rate-distortion performance compared to DCT-based compression algorithms. As a result, many researchers have focused on creating remote sensing image compression algorithms based on DWT or JPEG2000. For instance, the authors of [16] developed a remote sensing image compression algorithm using the JPEG2000 compression standard. In their approach, they considered the insignificance of unimportant areas such as non-data areas during the compression process to improve the compression performance of multi-spectral remote sensing images.
In [17], researchers present the criterion satisfied by an optimal transform of a JPEG2000compatible compression scheme under the high-resolution quantization hypothesis and without the Gaussianity assumption. They also introduced two compression scheme variants and the associated criteria minimized by optimal transforms. Then, they presented two algorithms: one derived from the Independent Component Analysis algorithm ICA-inf, which computes the optimal transform, and another with the orthogonality constraint, as well as one without any constraints other than invertibility. Considering the high dimension of hyperspectral remote sensing images, Ref. [18] combines JPEG2000 and Principal Components Analysis (PCA) to compress these images. PCA is used in JPEG2000 for spectral decorrelation as well as spectral dimensionality reduction. In addition, considering that the vector quantization algorithm is also efficient in some situations, Ref. [19] introduces a novel compression algorithm using vector quantization, PCA, and JPEG2000. This scheme first spectrally decorrelates by vector quantization and PCA and then applies JPEG2000 to the principal components for compression. This algorithm achieves better compression performance than the famous PCA + JPEG2000 compression algorithm in rate-distortion performance. Based on DWT, the Consultative Committee for Space Data Systems (CCSDS) also designed a series of international remote sensing image compression standards [20][21][22]. Furthermore, other researchers have also designed efficient compression algorithms based on HEVC and other compression theories, including dictionary learning, compress sensing, and more [4,[23][24][25][26][27].
In recent years, deep learning has achieved significant successes in various image processing [3,[28][29][30][31][32][33][34][35][36] and remote sensing applications [2,3,[37][38][39]. Consequently, many researchers have attempted to design learning-based remote sensing image compression algorithms. Compared to handcrafted transforms used in traditional image compression algorithms, learning-based algorithms can adapt to different characteristics of images [10]. In [40], researchers proposed a low-dimensional visual representation convolutional neural Remote Sens. 2023, 15, 2211 3 of 16 network for efficient remote sensing image compression. The network is used to transform coefficients in the wavelet domain from a large-scale representation to a smaller scale, obtaining an optimized wavelet representation by minimizing the distortion between the original and reconstructed wavelet representations. This algorithm applies a multibasis dictionary post-transform to the optimized wavelet representation to achieve high compression performance and computational efficiency. In [41], inspired by the symmetric structure of some traditional image compression methods, the researchers propose a new symmetrical lattice-generating adversarial network (SLGAN) for remote sensing image compression. This paper designs several pairs of symmetrical encoder-decoders to build the generator for generating deep latent representation codes and then decoding them for reconstruction. For each pair of encoded and decoded lattices, they adopt a discriminator for adversarial learning with the generator. Additionally, an enhanced Laplacian of Gaussian loss is designed as a regularizer to train the SLGAN, aiming to enhance remote sensing image edges, contours, and textures in the decoder sides. Experimental results on GF-2 data demonstrate that this algorithm achieves good compression performance. In [42], the researchers design a novel learned hyperspectral remote sensing image compression algorithm that incorporates a spectral attention mechanism and a novel entropy model based on Student's t-distribution. This algorithm achieves better compression performance on several hyperspectral image datasets. Considering the importance of edge information in many remote sensing applications and its potential as prior information for compression schemes, Ref. [11] introduces an edge-guided hyperspectral compression network that enables high-quality reconstruction. To extract useful edge features from learned edge features, the authors propose an interactive dual attention module, which avoids additional redundancy in edge information. In this compression scheme, the edge-guided loss and interactive dual attention module are combined to enhance the comprehensive structure of the latent representation. Moreover, the interactive dual attention makes the edge extraction network focus only on relevant boundaries, rather than all edges, resulting in savings in bit rate cost and obtaining a strong structural representation. In [43], the authors design a high-order Markov Random Field as an attention network to achieve good compression performance for high-resolution remote sensing image compression. Additionally, some researchers have also designed learned image compression methods based on attention strategies [44,45].
Many learned image compression algorithms have achieved better compression performance compared to classical image compression methods, such as JPEG and JPEG2000 [10,46,47]. However, there is limited literature that has considered both global and local redundancy in a compression scheme. For most high-resolution remote sensing images (HRRSIs), both global and local redundancies exist, which can be seen in Figure 1. Exploring both redundancies simultaneously can lead to a more accurate entropy model, resulting in improved rate-distortion compression performance. This paper proposes a new entropy model that captures redundancy in both global and local contexts simultaneously. Additionally, to reduce the gap between the training and testing processes, a refinement stage is introduced to help to improve the compression performance. The main contributions of this paper are listed as follows: 1.
To capture the local redundancy as well as global redundancy, a new entropy model based on transformer-based prior and CNN-based prior is designed. The transformerbased prior is the main focus for capturing the global redundancy and the CNN-based prior is the main focus on the local redundancy. When fused, these two pieces of information priors can achieve a better compression performance than a single prior.

2.
Based on the transformer and the CNN-based transformer, a new compression algorithm for HRRSIs is designed. To reduce the gap between the training and testing, the proposed algorithm adopts a three-stage refined processing. The refined stage can help refine the entropy network as well as the decoder network, which can help us obtain a more accurate entropy model and better reconstructed images.

3.
The experiment is conducted on an HRRSI dataset, and the results show that the proposed algorithm obtains a better compression than JPEG and JPEG2000 and other leaned image compression algorithms.
The rest of the paper is organized as follows: In Section 2, the formulation of lossy image compression is introduced. The proposed algorithm is presented in Section 3. Section 4 is the experiment and analysis. Finally, a conclusion and a discussion of this algorithm are presented in Section 5.

Formulation of Lossy Image Compression
In this section, the formulation of learned lossy image compression is introduced. Most learned lossy image compression algorithms consist of several blocks, including an encoder transform, decoder transform, a quantizer, and an entropy model [10,[46][47][48][49][50]. The encoder transforms the original image into a latent representation and then a quantization function is applied to the float latent representation. After obtaining the integer latent representation, an entropy model is used to encode these coefficients into the binary bit stream. On the decoder side, after entropy decoding and dequantization, the decoder transform dequantized the coefficient into the reconstructed image. Image compression processing can be simply written as follows: where g a and g s represent the encoder and decoder transforms, respectively. Q|U refers to the quantizer, while En refers to the entropy model, which estimates the entropy of the image (computed based on the probability density model) during training.
x and x represent the original and reconstructed images, respectively, whiley andŷ represent the latent and quantized latent variables, respectively. Since uniform quantization can interrupt the gradient back-propagation during training, some researchers have introduced strategies to avoid this problem, such as adding uniform noise [46,48]. Other researchers have designed more efficient compression encoders and decoders to improve compression performance [49,50]. Entropy models are one of the most crucial parts of learned lossy image compression algorithms. More accurate entropy estimation can help reduce the bit cost in entropy coding. Thus, a series of strategies have been proposed to improve the entropy model's compression performance [10,46,47,50]. In [46], the author designed an additional network (hyperprior network) to transmit some extra information abstracted from the latent representation. This extra information only occupies a small bit rate but can help construct a more accurate entropy model. This structure has significantly improved compression Remote Sens. 2023, 15, 2211 5 of 16 performance, and many later studies have adopted similar structures [10,42,44,47,50]. With the hyperprior information, the whole compression scheme can be written as In Equation (2), En1 and En2 are two different models. Usually, En1 refers to the factorized parameter model [48], while En2 can be seen as a density model conditioned on some prior information, such as a single Gaussian model [46,47] or a Gaussian mixture model [50]. The parameter of these entropy models is estimated based on some prior information, such as the hyperprior information [46], local context information [47], or global reference information [51].
After obtaining the prior information, the entropy model's parameter can be learned using parameter estimation networks. If the entropy model is a Gaussian mixture model, the pixel density can be written as follows: In Equation (3), N(. . . ) means the single Gaussian model and µ i and δ represent the mean and variance of the i th Gaussian model. w i is the weight parameter, and ∑ w i = 1. The estimated density can be used to compute the entropy when training.

Motivation
Compared to normal natural images, remote sensing images typically cover a larger spatial region, which often includes similar land covers resulting in complex textures and structures. As shown in Figure 1, the left figure shows white roofs and a road with similar structures and textures, while the right figure shows similar structures and textures in the grassland, in addition to the road and edge information. These similarities contain a significant amount of non-local redundancy, which can be leveraged to construct a more accurate entropy model and achieve better rate-distortion performance.
In recent years, transformers have achieved numerous breakthroughs in various computer vision tasks due to their powerful non-local representation ability [52][53][54][55]. To capture the global redundancy contained in remote sensing images, transformer-based networks have been adopted. However, remote sensing images also contain local redundancy, and the hyperprior network introduced in [46] has been shown to be effective in capturing local redundancy. To improve entropy model estimation, new prior information is designed by fusing the hyperprior and the prior information abstracted from a transformer-based network.
The most relevant compression algorithm is [46]. The main differences between the proposed algorithm and [46] are shown in Figure 2. In the left figure, g a , g s , h a , and h s represent the CNN-based analysis transform, synthetic transform, hyperprior analysis transform, and hyperprior synthetic transform, respectively. Q|U means the quantizer, where U adds uniform noise and Q means round quantization. Gaussian Single Model (GSM) refers to the Gaussian Single Model used in [46]. x,x, y,ŷ, z, andẑ represent the  The main differences between the proposed algorithm and [46] are listed below: 1.
Ref. [46] only adopts a CNN network to explore the hyperprior information, but the proposed compression scheme adopts two branches to explore local and global context information. The two branches include a transformer-based and a CNN-based network.

2.
In the entropy model construction, Ref. [46] uses the GSM, while the proposed algorithm uses the Gaussian mixture model (GMM) instead.

3.
Additionally, the GDN layers are replaced by a layer of a transformer-based layer, poolformer, in the proposed algorithm.
In addition to [46], several other related works are also relevant to the proposed algorithm, including [47,48,57]. Ref. [57] introduced a novel normalization layer, GDN, to improve the compression performance. In the proposed algorithm, the GDN layers are replaced with transformer-based blocks. Ref. [48] is an early end-to-end image compression method that introduced the factorized entropy model, which has been used in many subsequent works. The coding of hyper-latents in our algorithm is also based on this model. Ref. [47] is a variation of [46] that introduces local context to improve the compression performance, but it requires an autoregressive model which can be time-consuming. In contrast, our proposed algorithm does not use any autoregressive models. Instead, a transformer-based prior is adopted to explore the non-local context information. Our experimental results demonstrate that the proposed algorithm achieves better compression performance than these related models.
The overall compression framework is shown in Figure 3. In the framework, "Conv2d, K3s2N" refers to a 2D down-sampling convolution with N filters with a kernel size of 3. Transformer-based layers are implemented using poolformer blocks [54]. The entropy model is composed of two distinct models, a CNN-based hyperprior and a transformerbased hyperprior. The context model fuses these two pieces of prior information and uses it to estimate the parameters of the GMM.

Entropy Model
The CNN and transformer-based entropy model are shown in Figure 4. The left subfigure shows the details of the CNN-based hyperprior network, and the right subfigure shows the transformer-based hyperprior network. In the coding process, after obtaining the latent representation, it is sent into both the CNN-based hyperprior network and transformer-based network, and the hyper-latent is coded using the factorized model based on [48]. This hyper-latent representation is then decoded into hyperprior and fused to estimate the entropy model parameter. In this paper, a Gaussian mixture model is adopted, and the value of K is set to 3. From the fixed Gaussian mixture model, the same density model in the coding and coding processing can be constructed, and this process will construct a single GMM for each pixel. These density models are used to compute the probability of the latent for entropy encoding and decoding.

Conv2d,k3s1N
LeakrRelu,0.  As shown in the figure, this entropy model consists of several blocks, including the CNN-based hyperprior network, transformer-based hyperprior network, two factorized entropy models, and a Gaussian mixture model. The two hyperprior networks abstract the hyperprior information, fuse it, and send the fused information to estimate the parameter of GMM. The factorized entropy model can be referenced in [48]. The use of GMM has been shown to achieve better compression performance in much of the literature [10,50]. The proposed algorithm focuses more on the fusion of the transformer-and CNN-based hyperprior networks.
The details of the transformer block used in the transformer-based hyperprior network are shown in Figure 5. As the figure shows, this block contains poolformer [54] blocks and CBAM [58] blocks. In the proposed algorithm, the CBAM blocks are removed in the first transformer block, and two layers of CNN blocks with LeakReLU are added instead. After obtaining the CNN-based hyperprior and transformer-based hyperprior, they will be fused and sent to the parameter estimator network to estimate the parameters of GMM. The parameter estimator network is shown in Figure 6. In the figure, w is the weight of a single Gaussian model with parameters of mean µ and variance δ. For the proposed algorithm, K is fixed to 3 for the GMM, thus there are three sets of w, µ, and δ.

Training Strategy
Entropy estimation is a crucial component of learned lossy image compression algorithms. During the compression process, an entropy model is constructed to generate a density model for each latent representation. During the training process, this density model is utilized to estimate the information entropy for the latent representation. As per Shannon's information theory, the lower bound of the bit rate is equivalent to the information theory; hence, lower entropy indicates a lower bit rate. After the training process, an entropy model is obtained, which generates a set of density models for encoding and decoding. During the encoding and decoding process, these density models generate various probabilities for each latent representation. These probabilities help the entropy coding algorithms transform the latent representation into binary streams.
However, learned lossy image compression algorithms can only approximate the true density model and true possibility. Therefore, the gap between the estimated density model and the true density model can hurt the compression performance. Additionally, there is a significant gap between training and testing due to the difference in quantization. In the training process, hard quantization, such as the round function, can impede the backpropagation of gradients. To ensure gradient back-propagation, some soft quantization methods, such as adding uniform noise or using stochastic rounding, are often used. In the proposed compression scheme, uniform noise is added to approximate round quantization during the training process, which enhances the compression performance. However, in the true coding process (testing), the compression scheme must ensure that the latent representation coefficients are integers, making adding uniform noise unsuitable for this situation. Therefore, the difference in quantization can cause a gap in the entropy model, leading to sub-optimal compression performance. Furthermore, this difference can also cause a mismatch between the analysis transform and synthetic transform, resulting in much more distortion in reconstructed images.
To address this issue, various strategies have been proposed by researchers, such as those introduced in [56,59,60]. In [56], the author presented a forward vector quantization approach, and a soft histogram was used for entropy estimation instead of a simple Gaussian model [46] or Gaussian mixture model [50]. This method has been shown to improve rate-distortion compression performance by reducing the gap between training and testing. Similarly, the proposed algorithm employs a similar strategy as in [59].
This strategy involves a three-stage training process to reduce the gap between training and testing. First, all compression networks are trained to obtain a relatively good compression performance network. In the second training stage, the analysis transform network is fixed and the other compression network will be trained. Finally, in the third round of training processing, the analysis transform, synthetic transform, and hyper-analysis transform networks are fixed, and the hyper-synthetic transform and parameter estimation network will be trained. During the first stage, all quantization functions are approximated by adding uniform noise. In the second stage, the latent representation coefficients are quantified using a round function. Finally, in the last stage, the hyper-latent and latent representation will be quantified by rounding operation. By following this process, the proposed image compression scheme can improve the compression performance.

Dataset and Training Setting
To evaluate the effectiveness of the proposed algorithm, experiments were conducted on the HRRSI dataset Aerial Imagery for Roof Segmentation (AIRS). The dataset comprises 857 images for training, 94 images for validation, and 96 images for testing, with a pixel resolution of 0.072 m and each image having a size of 10,000 × 10,000 × 3. During training, the training dataset was first cropped into 1024 × 1024 non-overlapping patches, resulting in approximately 65,000 images. These were then randomly cropped into 256 × 256 image patches during the training process. The validation dataset was split into non-overlapping patches of 2048 × 2048 in size, resulting in 1438 images for testing. The adaptive moment estimation (Adam) optimizer [61] was used with a learning rate of 2 × 10 −4 and a batch size of 20. The network was trained for 100 epochs with a stage decay of 0.75 for every 20 epochs. In the last two refinement stages, the analysis transform was fixed, and the other networks were trained with a learning rate of 10 −5 and 8 images per batch for each stage of 20 epochs. The rate-distortion performance was controlled by the formula rate + λ × distortion, where the rate was estimated using the entropy model and MSE was used to evaluate distortion. The values of λ in this compression scheme were set to [0.0018, 0.0035, 0.0067, 0.013, 0.025, 0.0483, 0.0932], respectively. The channel of all models was set according to [46].

Evaluation Metrics
As mentioned above, the rate-distortion performance is commonly used to evaluate the compression performance of lossy image compression algorithms. In this paper, due to the large number of test images (1438), the estimated information entropy is used to estimate the bitrate of compressed images for all learned image compression algorithms. For distortion, the Peak Signal-to-Noise Ratio (PSNR) [62] and Multi-Scale Structural Similarity (MSSSIM) [63] indexes are adopted as metrics to estimate the distortion. PSNR values represent the ratio of the signal-to-noise energy, which affects the image quality. It can be used to measure how close the decompressed (reconstructed) image and the original image are. The PSNR is formulated as follows: The parameter x max represents the maximum pixel value of the image bands, while MSE refers to the Mean Square Error between the original and reconstructed images. For HRRSI with an 8-bit depth, the value of x max is 255.
MSSSIM is based on Structural Similarity (SSIM) but is formulated in a multi-scale manner as follows: where c j (x, y) and s j (x, y) denote the contrast comparison and the structure comparison for the j th scale image. The luminance comparison [l m (x, y)] α M is computed only at the final scale M. The parameters α j , β j , and η j define the relative importance of the three components, and, for simplicity, we set α j = β j = η j and ∑ m j=1 η j = 1. These three comparison values are calculated as follows: where µ x , δ x and δ 2 x are the mean, variance, and covariance of the original image x, respectively. c 1 , c 2 , and c 3 are three small constants, and the details can be found in [63].

Comparison Algorithms
The proposed algorithm, along with several traditional lossy image compression algorithms and learned image compression algorithms, is evaluated using rate-distortion performance. The traditional lossy image compression algorithms considered in this study include JPEG [12] and JPEG2000 [13]. JPEG and JPEG2000 are two widely used international compression standard algorithms based on DCT and DWT, respectively.
The learned lossy image compression includes three closely related learned image compression techniques: factorized [48], hyperprior [46], and joint [47]. The factorized technique [48] was the first learned lossy image compression algorithm that achieved a comparable compression performance to JPEG. The hyperprior technique [46] was the first literature to adopt hyperprior, which has since been adopted by many learned lossy image compression schemes, including the proposed framework, which uses a similar framework to hyperprior [46]. The joint technique [47] adopts local context to significantly improve the compression performance, but the drawback is that the decoding processing must be performed pixel by pixel. For JPEG and JPEG2000, the "imwrite" function of Matlab is used with default settings, except for the different quality factors used to control the rate-distortion performance.

Experimental Result and Analysis
The compression performance of the proposed algorithm and several compared algorithms is shown in Figures 7 and 8. Figure 7 presents the rate-distortion performance based on bpp (bit per pixel) and PSNR. As shown in the figure, JPEG [12] achieves worse compression results, while JPEG2000 [14] achieves better PSNR values than factorized [48] in most bitrates, especially for larger bitrates. Factorized [48] is one of the earliest learned compression algorithms that achieved better compression performance than JPEG [12]. Compared with factorized, the hyperprior algorithm [46] adopts prior information to help construct a more accurate entropy model, resulting in better PSNR values. The performance of hyperprior [46] is greatly improved compared to the factorized and JPEG2000 algorithms. Joint [47] is based on hyperprior, and it introduces autoregressive-based local context for entropy estimation. As shown in the figures, in lower bitrates, the compression performance of joint [47] is much better than hyperprior [46]; however, the compression performance of hyperprior [46] is slightly better than joint [47]. Compared with the other algorithms, the proposed algorithm achieves better PSNR values, which means better compression performance. Figure 8 shows the compression performance based on bpp and MSSSIM. We can obtain a similar result to Figure 7. In all bitrates, the proposed algorithm can obtain the best compression performance. In addition, joint [47] achieves better MSSSIM values than hyperprior [46]. It is an interesting phenomenon that the performance of JPEG2000 is better than factorized [48] at a bitrate of about 0.42 bpp, but, in other fields, factorized is better than JPEG2000. This also means that better PSNR does not necessarily mean better MSSSIM. Moreover, the proposed algorithm does not adopt autoregressive-based local context, which is meaningful in some fields that require fast compression and decompression.
To further demonstrate the effectiveness of our algorithm, we have compared the visual quality of decompressed images with other popular image compression techniques, such as JPEG [12] and JPEG2000 [14]. The results are presented in Figures 9 and 10. In these figures, the upper right corner depicts the original image, and the lower left corner shows the decompressed image obtained by our algorithm after compression. The other two algorithms represent the results of JPEG and JPEG2000. As shown in Figure 9, the JPEG algorithm produces a clear checkerboard effect on the edge of the roof. In contrast, the JPEG2000 algorithm results in some texture distortion in the edge area. The compression outcomes of our algorithm are visually similar to the original image and almost indistinguishable. Objective evaluation indicators of the three algorithms also reveal that our algorithm produces higher PSNR values and MSSSIM values at smaller bit rates. The results presented in Figure 10 are similar to those shown in Figure 9. On the white roof, the reconstructed image of JPEG has noticeable artifacts, and the JPEG2000 reconstruction result shows visible noise differences. In contrast, the reconstructed image of our algorithm Remote Sens. 2023, 15, 2211 12 of 16 exhibits better quality after decompression. Objective evaluation indicators also confirm that the performance of our algorithm surpasses that of other algorithms. Figure 7. The rate-distortion compression performance of the proposed algorithm and comparison algorithms. The rate is measured by bit per pixel (bpp) and distortion is measured by PSNR.

Conclusions
This paper presents a transformer-based learned lossy image compression algorithm for HRRSI. The proposed algorithm adopts a transformer-based hyperprior to explore nonlocal redundancy and a CNN-based hyperprior to explore local redundancy of HRRSI. By fusing these two types of prior information, the algorithm obtains a more accurate entropy model, resulting in lower information entropy and better compression performance. The results show that the proposed algorithm outperforms other traditional algorithms and some other learned lossy image compression methods. Moreover, the proposed algorithm does not use any auto-regression networks to explore local context, making it suitable for applications that require parallel codecs. Although the proposed algorithm has achieved better compression performance, there is still room for improvement in the backbone network design and the use of fusion prior information. In [10], the authors adopt a coarseto-fine network to obtain good compression performance and use prior information to refine the quality of reconstructed images, achieving even better compression performance. In the future, we may focus on designing a better analysis and synthesis transform as well as using prior information to further improve the compression performance of HRRSI compression.

Conflicts of Interest:
The authors declare no conflict of interest.