ESTUGAN: Enhanced Swin Transformer with U-Net Discriminator for Remote Sensing Image Super-Resolution

: Remote sensing image super-resolution (SR) is a practical research topic with broad applications. However, the mainstream algorithms for this task suffer from limitations. CNN-based algorithms face difﬁculties in modeling long-term dependencies, while generative adversarial networks (GANs) are prone to producing artifacts, making it difﬁcult to reconstruct high-quality, detailed images. To address these challenges, we propose ESTUGAN for remote sensing image SR. On the one hand, ESTUGAN adopts the Swin Transformer as the network backbone and upgrades it to fully mobilize input information for global interaction, achieving impressive performance with fewer parameters. On the other hand, we employ a U-Net discriminator with the region-aware learning strategy for assisted supervision. The U-shaped design enables us to obtain structural information at each hierarchy and provides dense pixel-by-pixel feedback on the predicted images. Combined with the region-aware learning strategy, our U-Net discriminator can perform adversarial learning only for texture-rich regions, effectively suppressing artifacts. To achieve ﬂexible supervision for the estimation, we employ the Best-buddy loss. And we also add the Back-projection loss as a constraint for the faithful reconstruction of the high-resolution image distribution. Extensive experiments demonstrate the superior perceptual quality and reliability of our proposed ESTUGAN in reconstructing remote sensing images.


Introduction
The rapid development of modern aerospace technology has put remote sensing imagery into wider use in the remote sensing field.Remote sensing images are essential for applications such as target detection and tracking.However, obtaining high-resolution (HR) remote sensing images can be challenging due to technical limitations and cost constraints.Image super-resolution (SR) is a promising option and a heated technology in recent years that provides critical research significance.In recent years, deep learning-based methods for single image super-resolution (SISR) have made remarkable achievements.Since the proposal of the SRCNN [1] by Dong et al. in 2014, CNN-based methods have significantly advanced the field of SR.Scholars have continuously improved network architecture and proposed elaborate structures [2][3][4], such as residual learning, dense connectivity, Laplace pyramid, and so on.RCAN [5] has achieved another pinnacle of peak signal-to-noise ratio (PSNR) by adding the channel attention module to the CNN-based architecture.However, CNN-based methods face an unavoidable obstacle when it comes to SR. Due to the design of the convolutional layer, convolution kernels interact with the image in a contentindependent process.It is illogical to use the same convolutional kernel to reconstruct different areas of the image.The transformer architecture [6][7][8][9] stands out in this case, employing the self-attention mechanism for global interaction and achieving significant performance in several visual tasks.However, due to the quadratic complexity of processing images, transformer-based models tend to generate a large number of parameters and are computationally intensive.The Swin Transformer [10] was created to combine the advantages of transformer-and CNN-based models, not only establishing long-term dependencies between images, but also processing large-sized images through a local attention mechanism.SwinIR [11] firstly applies the Swin Transformer to the field of SISR; it achieves the optimal PSNR with fewer parameters and is an enormous prospect.HAT [12] activates more input signals by concatenating the channel attention mechanism in the Swin Transformer layer and proposes the overlapping cross-window attention mechanism to optimize cross-window information interaction.
While the methods mentioned above have achieved high PSNR scores, they can produce ambiguous results.This is because they often use MSE or MAE for the one-to-one supervision of a single low-resolution (LR) image corresponding to a single high-resolution (HR) image, which can lead to pixel averaging and overly smooth and blurred outcomes.Remote sensing images are mainly used in the fields of object detection as well as geologic analysis, and we believe that the over-smoothed and blurred results generated by these networks will have a negative impact on some of the categories.To obtain more realistic images, researchers have employed Generative Adversarial Networks (GANs) to recover images with rich texture details [13][14][15][16].Although these methods have made considerable progress, further research is necessary due to their difficulty in training and tendency to produce artifacts.An alternative approach proposed by [17] is the Best-buddy loss, which breaks the strict mapping between LR and HR set by MSE or MAE.This approach allows multiple patches close to ground truth to supervise SR, reducing the difficulty of network training while improving the perceptual quality of reconstructed images.
The learning-based approaches mentioned above offer a new development direction for the remote sensing image SR task.
LGCNet [18] is the first CNN-based SR model for remote sensing images that outperforms traditional methods and verifies the effectiveness of deep learning methods.Jiang et al. [19] propose an edge enhancement network based on a GAN to enhance the edge by learning noise masks.Some algorithms [20][21][22][23][24][25][26] have achieved considerable success by adding elaborate structural designs or various attention mechanisms to CNN.Currently, learning-based methods in remote sensing image SR are developing rapidly and have achieved remarkable progress, but the challenges are still significant.
The selection of a reconstruction network better suited to the characteristics of remote sensing images is a challenging problem, because remote sensing images are characterized by a large spatial span, complex texture structure, and few pixels covered by objects, which undoubtedly produce further difficulties to reconstruction tasks [27].To faithfully restore high-resolution images, we adopt the Swin Transformer as the backbone, which can realize long-term dependency modeling with shift windows and exploit the internal self-similarity within remote sensing images.Specifically, we adopt the Residual Hybrid Attention Group (RHAG) proposed by HAT [12] and refine its network design to obtain significant performance with fewer parameters, which is named the Enhanced Swin Transformer Network (ESTN).
However, simply utilizing a more powerful reconstruction network will not completely achieve satisfactory results in the remote sensing image SR task.This is because objects in remote sensing images cover fewer pixels, and a ship may be represented by only several pixels.Employing PSNR-based methods is vulnerable to blurred results, while GANs offer a decent solution.In addition, remote sensing images contain more diverse texture features and different regions with distinct texture differences [27].We discovered that regions with different texture complexity in remote sensing images should not adopt the same supervision strategy.Adversarial learning should be performed for texture-rich regions to facilitate the reconstruction of fine details.However, for the smooth region, the PSNR-based method is sufficient to recover satisfactory results.Instead, feeding such regions into the discriminator may lead to uncomfortable artifacts.Existing methods do not take this concern into account.To resolve the above problems, we propose the U-Net discriminator with the region-aware learning strategy.On the one hand, the U-shaped network design allows the discriminator to fully integrate the structural information at each hierarchy level and finally obtain pixel-by-pixel feedback.On the other hand, it can divide the areas according to texture complexity, and only the detailed regions are fed into the discriminator, forcing the discriminator to focus on distinguishing complex areas and greatly suppressing artifacts.Accordingly, our discriminator can effectively assist the ESTN in predicting realistic and highly detailed images.
To further improve the perceptual quality, we also introduce the Best-buddy (BB) loss [17] and Back-projection (BP) loss to break the rigid mapping from the LR space to the HR space.This reduces the training difficulty and contributes to the recovery of realistic texture details.
Overall, the main contributions of our work are as follows: (

Related Works
The following contents list some aspects of the previously proposed methodology related to our proposed ESTUGAN:

Swin Transformer
The Swin Transformer [10] is a universal backbone for vision tasks and represents one of the first hierarchical vision transformers.Due to its excellent performance and parallelization accessibility, it has become the state-of-the-art technology for various vision tasks such as target detection and image segmentation.The core idea of the Swin Transformer is to compute self-attention within a non-overlapping movable window, which makes the model computation linear with respect to the feature map resolution, and greatly compresses the cost of self-attention.SwinIR introduces the Swin Transformer to image SR for the first time, further refreshing the state of the art of SR tasks.However, there is still substantial room for improvement in the Swin Transformer.The window attention mechanism [28][29][30] has limitations, and the exchange of information across windows and the shallow message mobilization both require further optimization.

Generative Adversarial Network
Nowadays, GANs have been widely explored and have achieved remarkable achievements in various image processing domains such as style migration, super resolution, image complementation, and denoising tasks [31][32][33].This approach is mainly inspired by the idea of competition in game theory, which is applied to deep learning by construct-ing two deep learning models: a generative network G (generator) and a discriminator network D (discriminator).The two models are then continuously played against each other to make G generate realistic images, while D has the powerful ability to determine the image authenticity.To reconstruct images with high perceptual quality, SRGAN introduces a discriminator that guides the generator to recover the fine texture information by adversarial loss.ESRGAN [15] proposes Residual-in-Residual Dense Blocks (RRDB) to build the network and invokes the relativistic GAN to make discriminators predict relative truthfulness, winning first prize in the PIRM 2018-SR Challenge.These approaches have been widely adopted as the mainstream of perception-based image SR algorithms.

Loss Function on Deep Learning
It is obvious that SISR is inherently an ill-posed problem, where a LR image often corresponds to multiple HR images.Proper guidance of the model to find the region in the latent space closest to the real HR image is the key to the SR problem.Therefore, a suitable loss function becomes particularly relevant.In existing studies, most algorithms adopt MAE/MSE loss to make the SR image approximate to the ground truth pixel by pixel.This pixel-level loss is beneficial to upgrade the PSNR but is detrimental to the reconstruction of texture details [34].To solve this problem, perceptual loss [35] is proposed to compute the similarity of deep features to enhance the perceptual quality.Fuoli et al. [36] propose Fourier spatial loss to facilitate the recovery of lost high frequency information.Benefiting from perceptual loss and adversarial loss, SRGAN [13,14] and ESRGAN [15] recover photo-realistic outcomes, but they face the possibility of annoying artifacts.Liang et al. introduce the Local Discriminant Learning (LDL) strategy [37] that explicitly penalizes artifacts without sacrificing real details, alleviating the artifact problem partly.Li et al. suggest the Best-buddy loss [17] to address the above problems.The estimated patches are enabled to seek optimal supervision dynamically during training, contributing to the production of more reasonable details.

Deep Learning Based SISR for Remote Sensing Images
In recent years, deep learning based SISR has become mainstream due to the powerful extraction capabilities of deep neural networks.And these approaches also lead to the development and advancement of remote sensing image SR algorithms.The CNN-based SISR was widely adopted by scholars in the early days; they retrained the network on remote sensing images and designed elaborate network architectures for feature extraction.
LGCNet [18] learns hierarchical representations of remote sensing images by constructing a "multifork" structure.DDRN [38] proposes ultra-dense residual blocks to construct a simple but effective recursive network.Similarly, many refined structural designs have been applied to the network with impressive achievements.However, the convolutional kernel interacts with the image in a content-independent manner, which limits the reconstruction of texture details.Some works enhance the expressive power of the model by adding various attentional mechanisms, such as MHAN [39] and SMSR [40].But these approaches tend to be computationally intensive and still have long-term dependency modeling difficulties.In addition, the above method adopts the learning strategy which maximizes the PSNR and encourages the model to find the pixel mean, leading to blurred results.Regarding this topic, several related works have made promising progress.On the one hand, adversarial learning strategies have been employed by some works, such as SRGAN and ESRGAN, in order to reconstruct photo-realistic images.MA-GAN [27] and SRAGAN [41] combined a GAN with attention mechanisms to upgrade the visual quality of remote sensing images.On the other hand, some loss functions [35][36][37] have been proposed to motivate the generation of high-frequency content.However, these solutions are still not perfect, since problems remain, like the difficulty of GAN training and the potential for artifacts.Our work is based on a GAN, which employs the Swin Transformer as the generator for long term dependency modeling, and a U-Net discriminator with the region-aware strategy to facilitate high-frequency detail generation while suppressing artifacts to a certain extent.

Image Super Resolution Quality Assessment
SR image quality assessment is an effective way to evaluate and compare SR methods, which is an important guide for model optimization and parameter selection.Subjective human assessment represents a highly reliable evaluation approach, but it tends to be time-consuming and laborious.The PSNR [42] is the most popular metric to assess the reconstruction performance by calculating only the purely mathematical difference of pixels.Wang et al. [43] simulate the human visual system and propose an evaluation scheme based on structural similarity.However, these two options sometimes differ from the human eye's perceptual quality, leading to ambiguous predictions.In order to maintain better consistency with subjective quality evaluations, a comparison of the feature similarity between images is employed by Zhang et al. [44] to estimate the distance from the prediction to the ground truth.The SFSN model [45] aims to find a balance between structural fidelity and statistical naturalness.Then, SRIF [46] is proposed to merge deterministic fidelity and statistical fidelity into a single prediction.Thanks to the development of deep learning, Ref. [47] extracts deep features to appraise the Learned Perceptual Image Patch Similarity (LPIPS) between two images, which is more in line with the human perceptual situation.DeepSRQ [48] with deep two-stream convolutional networks provides a satisfactory solution to the problem of no-reference evaluation.

Methods
In this section, we first present a brief overview on the workflow of our algorithm, and then we give a detailed description for the generator, the U-Net discriminator with the region-aware learning strategy, and the loss function employed by ESTUGAN, respectively.

Overview of ESTUGAN
For recovering images with superior perceptual quality, we designed the ESTUGAN based on a GAN, which consists of the ESTN as the generator, and the U-Net discriminator.The principal framework is shown in Figure 1.Given an LR image I LR ∈ R H×W×C , an SR image I SR ∈ R rH×rW×C (r is the scale factor) can be obtained by the generator, denoted as where G(•) denotes the generator.Subsequently, unlike the approach of [14], which feeds I SR directly to the discriminator, in our approach, I SR is sent to the region-aware adversarial learning stage, where we feed only regions with rich texture details to the U-Net discriminator for authenticity judgments by regional division processing.Finally, the discriminator outputs the real probability map and feeds it back to the generator, prompting the generation of real abundant details.In a GAN, the generator is urged to deceive the discriminator by creating realistic fake HR images, while the discriminator is trained to be powerful in discriminating authenticity, and both of them compete against each other to make the SR image distribution gradually approximate the real image distribution.

The Architecture of the Generator
As shown in Figure 2, we keep the high-performance architecture design of SwinIR [11], and the whole generator is composed of three modules: shallow feature extraction, deep feature extraction, and image reconstruction.
In the shallow feature extraction module, we employ a separate convolutional layer to map the input image to a high-dimensional space.It helps the visual representation to be learned better and optimized stably.The extracted shallow features can be expressed as where H SF (•) denotes the shallow feature extraction.

The Architecture of the Generator
As shown in Figure 2, we keep the high-performance architecture design of SwinIR [11], and the whole generator is composed of three modules: shallow feature extraction, deep feature extraction, and image reconstruction.

The Architecture of the Generator
As shown in Figure 2, we keep the high-performance architecture design of SwinIR [11], and the whole generator is composed of three modules: shallow feature extraction, deep feature extraction, and image reconstruction.In the deep feature extraction module, we adopt a new basic block inspired by HAT [12], called Residual Hybrid Attention Group.And we rename it to Enhanced Swin Transformer Block (ESTB) for the convenience of description, the architecture of which is shown in Figure 2a.It integrates the channel attention mechanism and the overlapping cross-attention block (OCAB), which achieves an effective aggregation for cross-window information.In addition, we insert a second residual mechanism after the convolution kernel behind the fourth ESTB.Although the residual block [49] can increase the perceptual field, we find that in low-level reconstruction tasks, such as image SR, excessively long residual connections will on the contrary weaken the generation quality of the reconstructed images, because overly abstract high-dimensional features can make network learning more difficult and cause degradation in the performance of the generation network [50].
To further demonstrate the effect of the number of residual blocks and the number of connection dimensions on the network performance, we set up three different networks in the ablation study section to demonstrate the superior performance of our network.The processes can be formulated as follows: where H i DF (•) denotes the deep feature extraction module, containing i ESTB blocks and a 3 × 3 convolutional layer.In this paper, i is set to 4 in Equation ( 3), and in Equation ( 4), i is set to 2.
In the image reconstruction module, we use jump connections to aggregate deep features and shallow features and reconstruct high-resolution images with the pixel-shuffle method [51].It can be expressed as where H Rec (•) indicates reconstruction module.

U-Net Discriminator with Region-Aware Learning Strategy
As for the discriminator, inspired by [52,53], we adopt the U-Net discriminator, which essentially consists of an encoder and a decoder to be connected, as shown in Figure 3.The encoder continuously downsamples the I SR in order to obtain the global information, and finally reacts to the overall image reality.While the decoder is dedicated to the local information authenticity judgment, it keeps performing progressive upsampling operations to output the per-pixel reality with the same resolution as I SR .In addition, skip connections are applied to facilitate the information communication between the two networks, further promoting the detailed recovery.Such a structural design forces the discriminator to focus on the structural and semantic message differences between fake and genuine samples, pursuing the accuracy of the global context and local information of the reconstruction outcome.For addressing the artifacts of the GAN-based methods [17], the region aware strategy is appended within the U-Net discriminator, as shown in Figure 1.The smooth regions and texture-rich regions of   are separated by the statistical local pixel distribution of   , and only texture-rich regions are fed into the U-Net discriminator for adversarial learning.This not only avoids the generation of artifacts in smooth areas, but also permits the discriminator to focus on regions where fine realistic details are required to be recovered, assisting the reconstruction of perceptually realistic images.Regarding the specific region-aware learning strategy, we first perform the unfold operation with kernel size k on   to obtain  ×  patches  , with size  2 .The standard For addressing the artifacts of the GAN-based methods [17], the region aware strategy is appended within the U-Net discriminator, as shown in Figure 1.The smooth regions and texture-rich regions of I SR are separated by the statistical local pixel distribution of I HR , and only texture-rich regions are fed into the U-Net discriminator for adversarial learning.This not only avoids the generation of artifacts in smooth areas, but also permits the discriminator to focus on regions where fine realistic details are required to be recovered, assisting the reconstruction of perceptually realistic images.Regarding the specific regionaware learning strategy, we first perform the unfold operation with kernel size k on I HR to obtain rH × rW patches Q i,j with size k 2 .The standard deviation std Q i,j is then calculated for each patch, and the final binary feature map M i,j is obtained by comparison with the pre-set threshold, which is denoted as where i and j denote the specific locations of patches, and the pixel values are set to 0 for flat regions and 1 for texture-rich regions in the map.Finally, I SR_mask is obtained by multiplying M i,j with I SR .
In addition, we also introduce the spectral normalization regularization [54] to further secure the stability of training and suppress artifacts.

Best-Buddy Loss
Since a single LR image can correspond to multiple HR images, SISR is intrinsically an indeterminate problem.For a given HR-LR pair, the commonly adopted MSE/MAE loss tends to perform a one-to-one rigid mapping, as shown in the blue diagram of Figure 4.This overlooks the intrinsic uncertainty of SISR, resulting in reconstructed images lacking high-frequency information.In order to overcome the limitation caused by the supervision of I SR from a single I HR , we refer to [55][56][57][58][59] and adopt the BB loss.It allows diverse supervised patches p i hr * to positively steer the predicted patches p sr and achieves the multiplicity of supervision, as shown in the yellow diagram of Figure 4.For p i hr * , it should be as close as possible to both the predicted patches p sr and the patch p hr of I HR , which can be expressed as where • 2 expresses L 2 loss, B denotes the supervised candidate database [17] of this image, which is obtained from the three-level image pyramid expansion achieved by the bicubic downsample operation, and i denotes the number of iterations.Then, the BB loss of this patch can be expressed as where • 1 denotes L 1 loss.

Adversarial Loss
Adversarial loss is employed to facilitate perceptually realistic image generation, and the adversarial loss of the generator and discriminator are respectively denoted as L adv_D = L BCE (D(I HR ), U real ) + L BCE D(I SR ), U f ake (10) where L BCE (•) denotes binary cross entropy loss, D(•) denotes the output of the discrimi- nator, which is a tensor of shape rH × rW × 1, U real and U f ake are tensors with the same shape as D(•), where all the values of U real are 1 for real labels and all the values of U f ake are 0 for fake labels.denote the LR patch, HR patch (ground truth), predicted HR patch, and Best-buddy HR patch in the current iteration, respectively.

Adversarial Loss
Adversarial loss is employed to facilitate perceptually realistic image generation, and the adversarial loss of the generator and discriminator are respectively denoted as _ =   ((  ),   ) +   ((  ),   ) (10) where   (•) denotes binary cross entropy loss, (•) denotes the output of the discriminator, which is a tensor of shape  ×  × 1,   and   are tensors with the same shape as (•), where all the values of   are 1 for real labels and all the values of   are 0 for fake labels.

Perceptual Loss
The perceptual loss is calculated utilizing the three layers,  3−4 ,  4−4 , and  5−4 , of the feature maps in the pre-trained VGG19 network, which can be expressed as ‖ −4 (  ) −  −4 (  )‖ 1 (11) where   denotes the weight occupied by each layer, and  3 = 1/8,  4 = 1/4, and  5 = 1/2, respectively.Our Best-buddy (BB) loss combined with Back-projection (BP) loss for supervision compared to MSE/MAE loss.Specifically, the blue plot represents the MAE/MSE loss, and the yellow plot represents the BB loss and BP loss we adopted.p i lr , p i hr , p i sr , and p i hr * denote the LR patch, HR patch (ground truth), predicted HR patch, and Best-buddy HR patch in the current iteration, respectively.

Back-Projection Loss
The adoption of BP loss forces the LR image obtained by downsampling I SR with r times to match I LR , achieving further supervision for I SR in the low-resolution image space, which can be denoted as where bi(•, r) denotes the bicubic downsampling operation with a scale factor r. Thus, the overall generator loss can be expressed as

Datasets in Experiments
To validate the effectiveness of our proposed method, we selected four public remote sensing datasets, including the NWPU-RESISC45 dataset [60], the UCMerced dataset [61], the RSCNN7 dataset [62], and the DOTA dataset [63].These datasets all consist of numerous RGB images and are extensively adopted in the remote sensing image SR field.  = ‖(  , ) −   ‖ 1 (12) where (•, ) denotes the bicubic downsampling operation with a scale factor r. Thus, the overall generator loss can be expressed as

Datasets in Experiments
To validate the effectiveness of our proposed method, we selected four public remote sensing datasets, including the NWPU-RESISC45 dataset [60], the UCMerced dataset [61], the RSCNN7 dataset [62], and the DOTA dataset [63].These datasets all consist of numerous RGB images and are extensively adopted in the remote sensing image SR field.

UCMerced Dataset
The UCMerced dataset is widely adopted for remote sensing image visual processing tasks, consisting of 21 categories, with 100 images per category.The images were captured by the remote sensing satellites of the University of California, Merced, and have a resolution of 256 × 256 pixels, covering various scenes, such as urban areas, forests, and farmlands.We randomly selected 10 images in each category as the testing set for our experiments, which can test the effectiveness of our approach and its robustness after training on the NWPU-RESISC45 dataset.

RSCNN7 Dataset
The RSCNN7 dataset consists of seven categories covering 2800 images and each image has 400 × 400 pixels.This dataset is sampled at different scales and takes into account weather variability and seasonal changes.

DOTA Dataset
The DOTA dataset consists of 2806 aerial images, each with pixel sizes ranging from 800 × 800 to 4000 × 4000, containing objects in various scales, shapes, and orientations.These images are annotated for 15 common target categories, including airplanes, ships, storage tanks, baseball fields, tennis courts, basketball courts, surface runways, harbors, bridges, large vehicles, small vehicles, helicopters, roundabouts, soccer fields, and basketball courts.

Quantitative Evaluation Metrics
In this paper, we judge the various methods using three typical image quality evaluation metrics, which are the peak signal-to-noise ratio (PSNR), the structure similarity index measure (SSIM), and the learned perceptual image patch similarity (LPIPS).

PSNR
The PSNR [42] is a common measure of signal reconstruction quality, and it is often defined simply by the mean squared error (MSE).For two monochrome images I and K with a size of m × n, their mean squared differences are defined as Thus, the PSNR can be expressed as where MAX I denotes the maximum pixel value in image I, and a higher PSNR value means less distortion.

SSIM
The SSIM [43] is also a full-reference image quality evaluation criterion, which measures image similarity in terms of contrast, and structure, respectively.It can be expressed as where µ x and µ y denote the mean pixel values for the two images, respectively, σ x and σ y denote the standard deviation for each image, and C 1 and C 2 are constants.The SSIM value ranges from 0 to 1, and the higher the value, the less the image distortion.

LPIPS
The LPIPS [47] evaluates the perceptual similarity between images according to a deep learning model, which corresponds more closely to human perception than the PSNR and SSIM do [34].The LPIPS can be expressed as where φ(•) l indicates the feature map of the l-th convolutional layer, and n l denotes the quantity of elements in φ(•) l .denotes the product operation in the channel dimension, and ω l represents a learned weight vector.A lower value of LPIPS means that the two images are more similar in human perception.

Experimental Details
Our experiment was conducted on the NVIDIA Tesla V100 GPU.The input image size was set to 48 × 48 and the batch size was eight.We employed the bicubic operation to downsample the original high-resolution image to obtain the HR-LR training pair.The channel of our ESTN was set to sixty, and the attention heads and the window size were set to six and sixteen, respectively.
Adam was as our optimizer and β 1 = 0.9, β 2 = 0.999, the learning rate was 1 × 10 −4 while the initial stage utilized preheating and cosine decay.k and θ were introduced in the method were are set to 11 and 0.025, respectively.As for the loss function, µ 1 , µ 3 , and µ 4 were set to 1, while µ 2 was set to 0.005 (refer to [17]).

Quantitative Comparison
In our experiments, we validated the performance of our model ESTUGAN by comparing it with six deep-learning SR methods, including RCAN [5], RRDB, SwinIR [11], SRGAN [14], ESRGAN [15], and BebyGAN [17].We selected 31050 images from the NWPU-RESISC45 dataset as the training set and 450 images as the testing set.In addition, to verify the generalizability of these models, we included 210 randomly selected images in the UCMerced dataset, 800 randomly selected images in the DOTA dataset, and all the images in the RSCNN7 dataset as additional test sets.Under the same conditions, we tested all the methods with the 4× amplification and evaluated them using the PSNR, SSIM, and LPIPS metrics.
Table 1 shows the quantitative results.It can be seen that the proposed approach achieves the most satisfactory results.In the comparison with the GAN-based methods (SRGAN, ESRGAN, and BebyGAN), ESTUGAN achieves the maximum PSNR and SSIM, and achieves the lowest LPIPS, demonstrating that it reconstructs images with optimal accuracy and perceptual quality.It is worth mentioning that ESTUGAN still maintains the best performance on three additional test sets, validating the scalability of our proposed model.In contrast, the performance of SRGAN on the DOTA dataset shows a distinct decline, reflecting the model's shortcomings in generalizability.In the comparison with CNN-based methods (RCAN, RRDB, SwinIR), the proposed method also achieves amazing results, just slightly lower than SwinIR and higher than the other compared methods.Although it is slightly lower than SwinIR in performance, the number of parameters and FLOPs of our method are only one-fourth of those of SwinIR (illustrated in Section 4.6).The proposed method greatly saves computational resources and efficiency in the SR task for remote sensing images.The ESTN also achieves quite robust results with minimal parameters when evaluated using three additional test sets.
We also compared these methods on 45 category scenarios from the NWPU-RESISC45 dataset; as shown in Table 2, ESTUGAN outperforms the comparison methods for each scenario.Among them, the PSNR of ESTUGAN, in several scenes such as aircraft, desert, circular farmland, and industrial area, is higher than BebyGAN by over 0.3 dB, and ESTU-GAN achieves the lowest LPIPS in all scenes, which means the predicted images generated by our method have the optimum visual effect.It also proves that our method can be fine-tuned for different scenes to faithfully reconstruct the actual image distribution.

Qualitative Comparison
We also performed a qualitative comparison to verify the effectiveness of ESTUGAN, as shown in Figure 6.Compared to SRGAN, ESRGAN, and BebyGAN, our proposed method generates more accurate structure information and minimum artifacts, especially in the flat areas.We also reconstruct sharper and more detailed results compared to PSNR-based methods.The effectiveness of our method is well proven.

Ablation Study
We conducted ablation experiments on the test set to verify the performance of the proposed components.In order to verify the performance of the U-Net discriminator, we adopted BebyGAN and ESTUGAN as the baseline to test their performance with the U-Net discriminator and regular discriminator [14,15] respectively.As shown in Figure 7, after adopting the U-Net discriminator, the PSNR of BebyGAN improves by 0.22 dB, the LPIPS decreases by 0.012, and the SSIM increases by 0.001 dB.When replacing the U-Net discriminator with a regular discriminator in the proposed method, the PSNR drops by 0.146 dB, the LPIPS rises 0.008, and the SSIM decreases by nearly 0.01 dB, which significantly affects the reconstruction performance.This shows that the U-Net discriminator provides a more robust ability to identify authenticity.Meanwhile, we visualized the results of the discriminator determination, and the results are shown in Figure 8c, where the black pixels denote that the discriminator makes a negative judgment, while the white pixels indicate that the discriminator generates a positive judgment.Such an accurate pixel-by-pixel judgment facilitates the generator to produce better results for the LPIPS.
In addition, we also verified the effectiveness of the BB loss and the region-aware learning strategy in our approach, as shown in Table 3. Due to the elimination of the BB loss, the performance decreases on both test sets.Similarly, the PSNR, SSIM, and LPIPS deteriorate after the removal the region-aware strategy.It is noteworthy that the performance of the model without the BB loss and the region-aware learning strategy deteriorates more significantly on the UCMerced test set than on the NWPU-RESISC45 dataset.This observation underscores the potential benefits of incorporating the BB loss and the region-aware learning strategy to enhance the model generalizability.
Finally, to demonstrate the performance of our improved deep feature extraction module in the generator, we compared it with two baselines which have the same deep feature extraction module as HAT [12].We set the number of channels to sixty and the ESTB to four (denoted as baseline1) and six (denoted as baseline2), respectively.Table 4 records the comparison results of our ESTN with two baselines on the UCMerced dataset.As can be seen from the experimental results, neither of the two baselines perform as well as our network.Although baseline2 has a deeper network structure, the effect is not better than baseline1.This proves that the residual structure will suffer performance degradation in the long-term feature extraction phase, and that cascading between residual structures will improve the performance of the remote sensing image SR task.

Ablation Study
We conducted ablation experiments on the test set to verify the performance of the proposed components.In order to verify the performance of the U-Net discriminator, we adopted BebyGAN and ESTUGAN as the baseline to test their performance with the U-Net discriminator and regular discriminator [14,15] respectively.As shown in Figure 7, after adopting the U-Net discriminator, the PSNR of BebyGAN improves by 0.22 dB, the LPIPS decreases by 0.012, and the SSIM increases by 0.001 dB.When replacing the U-Net significantly affects the reconstruction performance.This shows that the U-Net discriminator provides a more robust ability to identify authenticity.Meanwhile, we visualized the results of the discriminator determination, and the results are shown in Figure 8c, where the black pixels denote that the discriminator makes a negative judgment, while the white pixels indicate that the discriminator generates a positive judgment.Such an accurate pixel-by-pixel judgment facilitates the generator to produce better results for the LPIPS.discriminator provides a more robust ability to identify authenticity.Meanwhile, we visualized the results of the discriminator determination, and the results are shown in Figure 8c, where the black pixels denote that the discriminator makes a negative judgment, while the white pixels indicate that the discriminator generates a positive judgment.Such an accurate pixel-by-pixel judgment facilitates the generator to produce better results for the LPIPS.

Model Complexity Analysis
Figure 9 visualizes the measurement between the parameters and the PSNR of EDSR [64], RCAN [5], RRDB [15], SwinIR [11], HSENet [24], SWCG [65], Resnet [2], and our ESTN.It can be seen that the ESTN is comparable to SwinIR in terms of performance and has an absolute advantage in the parameters, saving over nine parameters (M) compared to SwinIR.Our ESTN performs impressively in terms of the PSNR performance and the number of parameters.Table 5 comprehensively shows the parameters, FLOPs, and inference time for different methods.

Conclusions
In this paper, ESTUGAN was proposed for characteristics of remote sensing images.The generator was the ESTN with the backbone of the Swin Transformer, which combines the advantages of CNN-and transformer-based models, possessing a more powerful expression ability.Meanwhile, the U-Net discriminator with the region-aware learning strategy and the loss strategy that can supervise flexibility was proposed; it effectively suppressed artifacts and guided the generator to recover authentic high-frequency information.Extensive experiments proved that ESTUGAN outperforms existing methods with fewer parameters for remote sensing image SR.Specifically, we tested the performance of our model on four widely used remote sensing datasets.And for the proposed method, sufficient ablation tests were conducted to verify the validity of the components.At the same time, we also explored the network length and the performance of the image SR task to some extent; we found that just adding more functional blocks and increasing the number of parameters does not improve the overall performance, and even decreases it in some specific scenarios.
In the future, we will continue to explore the effectiveness of lightweight models for SR tasks in remote sensing images.

Figure 1 .
Figure 1.The overview of our proposed ESTUGAN.The proposed U-Net discriminator with regionaware learning strategy focuses on adversarial learning in texture-rich regions and outputs a map for the true situation of each pixel.We use Best-buddy loss, Back-projection loss, perceptual loss, and adversarial loss to supervise the generator, and adversarial loss to guide the optimization for the discriminator.

Figure 2 .
Figure 2. The framework of our proposed ESTN generator, which consists of three modules in total, including the deep feature extraction module, shallow feature extraction module, and image reconstruction module.

Figure 1 .Figure 1 .
Figure 1.The overview of our proposed ESTUGAN.The proposed U-Net discriminator with regionaware learning strategy focuses on adversarial learning in texture-rich regions and outputs a map for the true situation of each pixel.We use Best-buddy loss, Back-projection loss, perceptual loss, and adversarial loss to supervise the generator, and adversarial loss to guide the optimization for the discriminator.

Figure 2 .
Figure 2. The framework of our proposed ESTN generator, which consists of three modules in total, including the deep feature extraction module, shallow feature extraction module, and image reconstruction module.

Figure 2 .
Figure 2. The framework of our proposed ESTN generator, which consists of three modules in total, including the deep feature extraction module, shallow feature extraction module, and image reconstruction module.

Figure 3 .
Figure 3.The framework of our proposed U-Net discriminator.

Figure 3 .
Figure 3.The framework of our proposed U-Net discriminator.

Figure 4 .
Figure 4.Our Best-buddy (BB) loss combined with Back-projection (BP) loss for supervision compared to MSE/MAE loss.Specifically, the blue plot represents the MAE/MSE loss, and the yellow plot represents the BB loss and BP loss we adopted.p i lr , p i hr , p i sr , and p i hr * denote the LR patch, HR patch (ground truth), predicted HR patch, and Best-buddy HR patch in the current iteration, respectively.

4. 1 . 1 .
NWPU-RESISC45 DatasetThis dataset encompasses 45 classes of remote sensing images with high inter-class similarity and intra-class diversity.It contains a total of 31,500 images with a resolution of 256 × 256 pixels.We randomly selected 10 images in each category as the testing set for our experiments and used the rest as the training set.Some of the training set images are shown in Figure5.

4. 1 . 1 .
NWPU-RESISC45 Dataset This dataset encompasses 45 classes of remote sensing images with high inter-class similarity and intra-class diversity.It contains a total of 31,500 images with a resolution of 256 × 256 pixels.We randomly selected 10 images in each category as the testing set for our experiments and used the rest as the training set.Some of the training set images are shown in Figure 5.

Figure 5 .
Figure 5. Partial images of the training set in the NWPU-RESISC45 dataset.Figure 5. Partial images of the training set in the NWPU-RESISC45 dataset.

Figure 5 .
Figure 5. Partial images of the training set in the NWPU-RESISC45 dataset.Figure 5. Partial images of the training set in the NWPU-RESISC45 dataset.

Figure 6 .
Figure 6.Visualization comparison of various algorithms on the NWPU-RESISC45 dataset with a scale factor ×4.

Figure 6 .
Figure 6.Visualization comparison of various algorithms on the NWPU-RESISC45 dataset with a scale factor ×4.

Figure 7 .
Figure 7.The performance of our proposed ESTUGAN and the BebyGAN on the NWPU-RESISC45 dataset when different discriminators are employed.The discriminators were measured using PSNR, SSIM, and LPIPS metrics.

Figure 7 .
Figure 7.The performance of our proposed ESTUGAN and the BebyGAN on the NWPU-RESISC45 dataset when different discriminators are employed.The discriminators were measured using PSNR, SSIM, and LPIPS metrics.

Figure 7 .
Figure 7.The performance of our proposed ESTUGAN and the BebyGAN on the NWPU-RESISC45 dataset when different discriminators are employed.The discriminators were measured using PSNR, SSIM, and LPIPS metrics.

Figure 8 .
Figure 8.The visualization of the U-Net discriminator.(a) The original images in the selected dataset.(b) The generated images of the proposed generator.(c) The discrimination on the generated images.

Electronics 2023 , 21 Figure 9 .Table 5 .
Figure 9. Visualization comparison of parametric quantities and PSNR of diverse approaches.Table 5. Parameters, FLOPs, and GPU runtime for various super-resolution models.GPU runtime is tested on the Tesla V100 GPU and the input size is 125 × 125.Model Parameters FLOPs GPU Runtime RCAN 16 M 233.8 G 0.189 s RRDB 16.7 M 257.5 G 0.101 s HSENet 5.4 M 73.3 G 0.155 s SwinIR 11.9 M 202.2 G 0.288 s ESTN (ours) 2.2 M 53.5 G 0.165 s 1) We propose a promising framework, ESTUGAN, which adopts the Enhanced Swin Transformer as the generator backbone and a U-Net discriminator.The Enhanced Swin Transformer is capable of mobilizing more input information to model local content, benefiting from united channel attention and self-attention.In addition, it employs an overlapping cross-attention mechanism to further aggregate cross-window information with stronger representational capabilities.Extensive experiments demonstrate that our proposed network outperforms other methods when targeting remote sensing image SR.(2)We propose a U-Net discriminator with the region-aware learning strategy to reconstruct highly detailed remote sensing images.The region-aware learning strategy can effectively suppress artifacts by masking flat regions and feeding only texture-rich regions to the discriminator for adversarial training.Moreover, the U-shaped network is designed with jumping connections that allows for the connection of shallow detailed content with deep semantic information, providing intensive feedback for each pixel's authenticity.(3) The BB loss and BP loss are employed to further enhance the visual quality of the image.Multiple supervised signals that are similar to the ground truth are utilized to flexibly guide the image reconstruction; this reduces the training difficulty and helps to generate high-frequency information.

Mapping MSE/MAE Loss Best-buddy Finding BB and BP Loss BP BB
Figure 4. Our Best-buddy (BB) loss combined with Back-projection (BP) loss for supervision compared to MSE/MAE loss.Specifically, the blue plot represents the MAE/MSE loss, and the yellow plot represents the BB loss and BP loss we adopted.   ,  ℎ  ,    , and  ℎ *

Table 1 .
Qualitative comparison of PSNR, SSIM, and LPIPS in the NWPU-RESISC45, the UCMerced dataset, the RSCNN7 dataset, and the DOTA dataset at a four-time scale factor.

Table 2 .
SR results for each class in the NWPU-RESISC45 dataset at a four-time scale factor.

Table 3 .
The comparison of ablation studies on BB loss and region aware strategies in the NWPU-RESISC45 dataset."Ours" means our proposed ESTUGAN, "w/o BBL" and "w/o RA" indicate the model removing BB loss and the mode removing the region aware strategy.

Table 4 .
Comparison of using different generator frameworks on the UCMerced dataset.