Saliency-Guided Remote Sensing Image Super-Resolution

: Deep learning has recently attracted extensive attention and developed signiﬁcantly in remote sensing image super-resolution. Although remote sensing images are composed of various scenes, most existing methods consider each part equally. These methods ignore the salient objects (e.g., buildings, airplanes, and vehicles) that have more complex structures and require more attention in recovery processing. This paper proposes a saliency-guided remote sensing image super-resolution (SG-GAN) method to alleviate the above issue while maintaining the merits of GAN-based methods for the generation of perceptual-pleasant details. More speciﬁcally, we exploit the salient maps of images to guide the recovery in two aspects: On the one hand, the saliency detection network in SG-GAN learns more high-resolution saliency maps to provide additional structure priors. On the other hand, the well-designed saliency loss imposes a second-order restriction on the super-resolution process, which helps SG-GAN concentrate more on the salient objects of remote sensing images. Experimental results show that SG-GAN achieves competitive PSNR and SSIM compared with the advanced super-resolution methods. Visual results demonstrate our superiority in restoring structures while generating remote sensing super-resolution images.


Introduction
High spatial quality (HQ) optical remote sensing images have a high spatial resolution and low noise characteristics, which can be widely used in agricultural and forestry monitoring, urban planning, military surveillance, and other fields. However, the time and cost of development and the vulnerability of the image to the changes in atmosphere and light are the reasons for the acquisition of a large number of low spatial quality (LQ) remote sensing images. Recently, more researchers have poured attention into recovering HQ remote sensing images from LQ using image processing technology. Low spatial resolution is a crucial factor for the low quality of remote sensing images [1]. Therefore, enhancing spatial resolution became the most common approach to acquiring high-quality images.
Remote sensing image super-resolution (SR) aims to recover the high-resolution (HR) remote sensing image from the corresponding low-resolution (LR) remote sensing image. Currently, remote sensing image SR has become an intensely active research topic in remote sensing image analysis [2]. However, it is an ill-posed problem as each LR input traditional methods of detecting salient objects depend on handcrafted features, such as contrast [37] and boundary background [38]. With the rapid development of convolutional neural networks, an increasing number of researchers propose to estimate a saliency map from an image with end-to-end networks [39][40][41].  Based on the advantages of salient object detection, a proper saliency map helps boost super-resolution performance by allocating more resources to import regions accordingly. Therefore, this paper proposes a generative adversarial network with saliency-guided remote sensing image super-resolution (SG-GAN) to alleviate the above-mentioned issue. As the saliency map reveals the salient regions in an image, we apply this powerful tool to guide image recovery. This paper designs a saliency-guided network that aligns the saliency map estimated from the generated SR image with one of the HR images as an auxiliary SR problem. Then, the saliency map is integrated into the remote sensing SR network to guide the high-resolution image reconstruction. The main contributions of our paper can be summarized as follows. • We propose a saliency-guided remote sensing image super-resolution network (SG-GAN) while maintaining the merits of GAN-based methods to generate perceptualpleasant details. Additionally, the saliency object detection module with an encodedecoder structure in SG-GAN helps generative networks to focus training on the salient regions of the image.
• We provide the additional constraint to supervise the saliency map of the remote sensing images by designing a saliency loss. It imposed a second-order restriction in the SR process to retain the structural configuration and encourage the obtained SR images with higher perceptual quality and fewer geometric distortions. • Compared with the existing methods, the SG-GAN model reconstructs high-quality details and edges in transformed images, both quantitatively and qualitatively.
In the rest of the paper, we briefly review related works in Section 2. In Section 3, the main idea of the proposed method, network structure, and loss functions are introduced in detail. Specific experimental design and comparison of experimental results are shown in Section 4. Section 5 conducts a particular analysis of the experimental results. At last, Section 6 makes a summary of the paper.

Related Works
This section mainly introduces related deep learning-based image super-resolution and salient object detection.

Deep Learning-Based Image Super-Resolution
The deep learning-based image super-resolution methods are usually divided into two frameworks: One is image super-resolution with convolutional neural networks, and the other is image super-resolution with generative adversarial networks.

Image Super-Resolution with Convolutional Neural Networks
The groundbreaking work of image super-resolution came from SRCNN [8] in 2014. SRCNN implemented an end-to-end mapping by using three convolutional layers. These three layers had completed the task of feature extraction and high-resolution image reconstruction. FSRCNN [42] was an improved version of SRCNN. Instead of inputting high-dimensional images, the FSRCNN directly extracted features from LR images and utilized deconvolution to upscale images. There are many models [43][44][45] that rely on residual connection methods and have achieved good results. RCAN [46] introduced the attention mechanism to image super-resolution to improve feature utilization. Chen et al. designed a self-supervised spectral-spatial residual network [47]. The authors of [48] proposed multi-scale dilation residual block to deal with the super-resolution problem of remote sensing images.

Image Super-Resolution with Generative Adversarial Networks
The generative adversarial network [11] was proposed by Goodfellow et al. in 2014, which has been widely applied in numerous visual tasks. In the field of image superresolution, Ledig et al. proposed SRGAN [13], which adopted GAN to implement image super-resolution tasks. SRGAN exploited residual network [49] structure as the main structure for the generator. Skip connection operation to add low-resolution information directly to the learned high-dimension features. VGG [50] network was utilized in the main body of the discriminator, which guided the generator to pay plenty of attention to in-depth features. Under the GAN framework, the adversarial loss was also an effective loss to improve image resolution by minimizing the differences between generated images and target images. Ref. [51] proposed an Enlighten-GAN model, which utilized the Self-Supervised Hierarchical Perceptual Loss to optimize network results. Both the works in [52,53] have achieved outstanding results based on GAN. Although these existing perceptual-driven methods indeed improve the overall visual quality of super-resolved images, they equally consider whether saliency regions or non-saliency regions in remote sensing images.

Region-Aware Image Restoration
Gu et al. [54] suggest that the performance difference of various methods mainly lies in the areas with complex structure and rich texture. Recently, researchers have proposed various strategies for other regions of images to achieve better image processing results. RAISE [55] divides the low-resolution images into different image clusters and designs specific filters for each image cluster. SFTGAN [56] introduces a spatial feature transformation layer, combining the advanced semantic information with the image's features. ClassSR [57] proposes a classification-based method to classify the image blocks according to the restoration difficulty and adopts different super-resolution networks to improve the super-resolution efficiency and super-resolution effect. The difference is that this paper proposes a super-resolution network based on the saliency guidance, which enhances the super-resolution effect by providing additional constraints to the saliency region of the image and assigning more weights.

Salient Object Detection
With the rapid development of convolutional neural networks, an increasing number of researchers classified images into the saliency area and non-saliency area by utilizing super-pixels or local patches. MACL [58] adopted two paths to extract local and global contexts from different windows of super-pixels and then built a united model in the same hybrid deep learning framework. The work in [59] combined abstraction, element distribution, and uniqueness to obtain a saliency map to construct a saliency loss to deal with retinal vasculature image super-resolution. Qin et al. proposed BASNet [60] to segment the salient object regions, which predicted the structures with clear boundaries. Saliency information may provide additional supervision for better image reconstruction for deep adversarial networks. In this work, we aim to leverage saliency-guided to further improve the GAN-based remote sensing SR methods.

Proposed Methods
In this section, we first present the proposed SG-GAN framework in detail. Then salient object detection model is demonstrated. Finally, the optimization function is introduced briefly.
Based on the above analysis, recovering the salient regions of an image play an essential role in improving the performance of remote sensing image super-resolution tasks. This paper focuses on detecting the salient regions and paying more attention to these regions. Figure 3 shows our main idea. We adopt the salient object detection network to detect the generated image's saliency map and the target image's saliency map to motivate SG-GAN to recover more structure details. Figure 3. The main idea of SG-GAN: the low-resolution images are input into the generator, then output into super-resolution images. The generated and authentic images are input into the salient object detection network separately. The discrepancy of the two saliency maps estimated from generated and HR images is encoded as saliency loss. The loss is fed back to the generator, making the saliency area of the generated image more realistic.

Structure of SG-GAN
SG-GAN consists of two parts: the generator and the discriminator. The generator can reconstruct the input low-resolution image into a super-resolution image. The discriminator acts as a leader, guiding the generator to generate a direction closer to the high-resolution image for training.
As shown in Figure 4, the generator G θ G mainly contains five parts: (a) shallow feature extraction unit, (b) deep feature extraction unit, (c) reducing parameters unit, (d) upscaling unit, and (e) image reconstruction unit. Precisely, for (a), the generator extracts the shallow feature by utilizing a convolution layer with the kernel size of 9 and a ParametricReLU layer [61] as the activate function, which converts the image of three channels into a feature map with 64 channels. (b) Then, through the 16 residual units for deep feature extraction, which have been proposed in [62], extracting the low, medium, and highfrequency features for more abundant information. (c) The residual blocks (RB) remove the batch normalization layer to save memory and reduce parameters. Moreover, the generator employs a short and long skip connection to transfer the shallow feature directly to in-depth features, thereby achieving the fusion of the low-level and high-level features. Then, all the extracted features are converted by a convolution layer with a 1 × 1 kernel size. In order to prepare for the subsequent upsampling operation, the number of features is integrated and reduced. (d) Finally, two sub-pixel convolutions contain the convolution layer, Pixel shuffle, and PReLU, which can be regarded as the upscaling unit. (e) The final remote sensing super-resolution image is obtained through a convolution layer of 9 × 9 kernel size. The discriminator network D θ D is trained to discriminate authentic HR images from generated SR samples. It adopts the exact structure of SRGAN in the discriminator module to pay more attention to the deep parts and to solve the maximization problem in Equation (1). The discriminator structure is shown in Figure 5. Specifically, the input image passes through 8 convolutional layers with an increasing number of 3 × 3 filter kernels, increasing by a factor of 2 from 64 to 512 kernels in-depth feature information. The discriminator considers the authenticity of the deep feature information. At the end of the discriminator, a number between 0 and 1 is obtained through the Sigmoid function. Then, the size of the number judges the authentic degree of the input image. When the output number is between 0 and 0.5, the input image is fake; when the output number is between 0.5 and 1, the input image is an actual image. As deep features help improve the sharpness of the image, the generator can generate a sharper image under the guidance of the discriminator.

Details of Salient Object Detection Network
As shown in Figure 3, the remote sensing SR network incorporates saliency regions representations from the salient object detection network. The motivation of this scheme is that the more accurate the grasp of saliency regions, the more helpful it is to reconstruct the resolution of complex areas. A well-designed saliency object detection branch can carry rich saliency information, which is pivotal to recovering remote sensing image SR.
Salient object detection methods based on Fully Convolutional Neural Networks (FCN) are able to capture richer spatial and multiscale information to retain the location information and obtain structure features. In this paper, BASNet [60] is selected as the salient object detection network baseline. Inspired by U-Net, we design our salient object prediction module as an Encoder-Decoder network in Figure 6 because this kind of architecture is able to capture high-level global contexts and low-level details at the same time. Meanwhile, spatial location information is also captured. The existence of a bridge structure helps the whole network to capture the global feature. The output is a salient object with a distinct edge from the last layer of the network. Moreover, in order to accurately predict the saliency areas and their contours, the paper applies a hybrid loss in Equation (2) which contains Binary Cross-Entropy (BCE) bce in Equation (3), Structural Similarity (SSIM) ssim in Equation (4) and Intersection-over-Union (IoU) loss iou in Equation (5) to achieved superior performance at pixel-level, patch-level, and map-level.
iou denote BCE loss [63], SSIM loss [64], and IoU loss [65], respectively. BCE loss is the most widely used loss in binary classification and segmentation. It is defined as where G(r, c) ∈ {0, 1} denotes the ground-truth label of pixel (r, c) and S(r, c) denotes the predicted probability of being the salient object, measuring the pixel-wise level. SSIM is originally proposed for image quality assessment, which captures the structural information in an image. Therefore, we integrated it into our saliency detection training loss to learn the structural information of the salient object ground truth. Suppose x = x j : j = 1, . . . , N 2 and y = x j : j = 1, . . . , N 2 be the pixel values of two corresponding patches, which size are N/timesN cropped from the predicted probability map S and the binary ground truth mask G respectively, the SSIM of x and y is defined as where µ x , µ y and σ x , σ y denote the mean and standard deviations of x and y, respectively, σ xy denotes the covariance of x and y, C 1 = 0.01 2 and C 2 = 0.03 2 are used to refrain from dividing by zero, measuring the patch-wise level. IoU measures the similarity of two sets [66] and as a standard evaluation measure for object detection and segmentation. To ensure its differentiability, we adopt the IoU loss used in [64]: where G(r, c) ∈ {0, 1} denotes the ground-truth label of the pixel (r, c) and S(r, c) is the predicted probability of being the salient object, measuring the map-wise level. Based on the above analysis, this paper cascades the above-mentioned three-loss functions in the saliency detection network. BCE loss maintains a smooth gradient for all pixels, while IoU loss pays more attention to the foreground. SSIM loss is used to promote that the prediction respects the structure of the original image.

Design of Saliency Loss
This paper aims to design a loss to compare each pixel value to reduce the salient objects' differences between the generated remote sensing image and the actual remote sensing image.
The saliency feature map reflects the saliency regions of the image and whether the image structure, texture, and other information are rich. We expect to capture more apparent differences among the saliency maps in our task. In Figure 7, we list a lowresolution image and a high-resolution image and their saliency maps before and after the Sigmoid layer, respectively. Map2 and map4 are the saliency diagrams of low-resolution and high-resolution images without the sigmoid layer. Map1 and map3 are the saliency diagrams through the sigmoid layer. Compared with map1 and map3, map2 and map4 show more apparent differences in target contour and deviation in the black-and-white degree. The output saliency maps from the Sigmoid layer (map1 and map3) are harder to distinguish in most areas. Therefore, to better utilize the difference between saliency maps and improve the super-resolution performance of remote sensing images, we choose the saliency feature maps before the Sigmoid layer. The saliency loss function is expressed in Equation (6).
where W and H represent the width and height of feature map, respectively. C indicates the number of feature channels. h i I HR and h i I SR are the extracted saliency maps from the target image and the generated image, respectively.

Design of Basic Loss Functions
The saliency loss has been described in Section 3.3. This subsection mainly introduces the rest loss functions, including the L 1 loss, the perceptual loss, and the adversarial loss.
The L 1 loss is adopted to reduce the pixel-level differences, and the function is shown in Equation (7).
where C, W, and H represent the number of feature channels, the width, and the height of the feature map, respectively. I HR i represents the target remote sensing image, and I SR i is the generated remote sensing images.
The perceptual loss is the loss at the feature level, which focuses on considering image features. Besides, the perceptual loss utilizes the VGG network [50], which takes advantage of the high-level feature map. As the perceptual loss focuses on in-depth features, it can help improve the image's clarity and achieve better visual effects. Equation (8) shows the formula of the loss.
where C, W, and H represent the number of feature channels, the width, and the height of feature map, respectively. ϕ i I HR indicates the feature extracted from the target image and ϕ i I SR is the feature extracted from generated image. Adversarial loss is the basic loss of the generative adversarial network. The loss function contains the generator loss and the discriminator loss function. The details are shown in Equations (9) and (10).
here, L g adversarial and L s adversarial represent the adversarial loss of the generator and the adversarial loss of the discriminator, respectively. D θ D G θ G I LR is the probability that the generated image G θ D I LR is a generated remote sensing HR image, I HR is the real remote sensing image and I LR is the remote sensing LR image. Through the adjustment of these two loss functions, the continuous adjustment training between the generator and the discriminator is realized.

Experiments
In this section, we first introduce the training and testing dataset. Then we conduct experiments on test datasets. We use PSNR and SSIM as metrics to compare the SG-GAN algorithm quantitatively with existing advanced algorithms and show the visualization results.

Datasets and Metrics
SG-GAN is trained on RAISE [67] dataset, which contains a total of 8156 highresolution images. We randomly flip and crop the images to make the training data more aplenty. The images are cropped into 64 × 64 blocks as the high-resolution images. The corresponding low-resolution images are generated with the Bicubic method and clipped into 16 × 16 blocks.
In testing stage, we test it on SR standard benchmarks, including Set5 [68], Set14 [69], BSD100 [70], and Urban100 [71]. In addition, to verify the advantages of salient object detection for SR, we also report the performance on MSRA10K [72], which is originally provided for the salient object annotation.
Moreover, to better evaluate the remote sensing image SR performance and generalization of the proposed SG-GAN, we randomly select images for five remote sensing datasets to evaluate our approach: NWPU VHR-10 datset [73], UCAS-AOD dataset [74], AID dataset [75], UC-Merced dataset [76], and NWPU45 dataset [77] according to categories, and then LR images obtained by bicubic downsampling to form remote sensing super-resolution datasets.
NWPU VHR-10 dataset [73]: This dataset consists of 800 images divided into 10 class geospatial objects, including airplanes, ships, storage tanks, baseball diamonds, tennis courts, basketball courts, ground track fields, harbors, bridges, and vehicles. We randomly select 10 images from each category, obtain LR images through bicubic downsampling, and use these 100 images as the test dataset.

UCAS-AOD dataset [74]:
This dataset is designed for airplane and vehicle detection. Consisting of 600 images with 3210 airplanes and 310 images with 2819 vehicles. We randomly select 200 to obtain LR images through bicubic downsampling and use these images as the test dataset.
AID dataset [75]: This dataset consists of 10,000 images in 30 aerial scene categories, including airports, bare ground, baseball fields, beaches, bridges, centers, churches, commercials, dense residences, deserts, farmlands, forests, etc. The size of each image is 600 × 600 pixels. We randomly select 10 images from each category, obtain LR images through bicubic downsampling and use these 300 images as the test dataset.

UC-Merced dataset [76]:
This dataset consists of 2100 images in 21 land use categories, including agriculture, airplanes, baseball fields, beaches, buildings, small churches, etc. The size of each image is 256 × 256 pixels. We randomly select 10 images from each category, obtain LR images through bicubic downsampling and use these 210 images as the test dataset.
NWPU45 dataset [77]: This dataset consists of 31,500 images of 45 scene categories, including airport, baseball diamond, basketball court, beach, bridge, chaparral, church, etc. The size of each image is 256 × 256 pixels. We randomly select 10 images from each category, obtain LR images through bicubic downsampling and use these 450 images as the test dataset.
Following most remote sensing SR methods [1,78,79], the peak signal-to-noise ratio (PSNR) and structure similarity index (SSIM) are used to evaluate our model quantitatively. PSNR is formed as Equation (11), which measures the difference between corresponding pixels of the super-resolved image and the ground truth. SSIM is formed as Equation (12) and evaluates the structural similarity. Moreover, the performance of GAN-based methods (SRGAN, SRGAN+L Saliency , and SG-GAN) are measured by additional indexes, including inception score (IS) [80], sliced Wasserstein distance (SWD) [81], and Frechet inception distance (FID) [82]. IS is given by Equation (13), which measures the diversity of generated images. FID compares the distributions of Inception embeddings of real and generated images as Equation (14). SWD approximates the Wasserstein-1 distance between the real and the generated images and calculates the statistical similarity between local image patches extracted from Laplacian pyramid representations.
where C, W, and H represent the number of feature channels, the width, and the height of the feature map, respectively. The MSE refers to the mean square error between SR image x and remote sensing HR image y. MAX represents the maximum signal value that exists in the HR image.
where µ x , µ y and σ x , σ y denote the mean and standard deviations of SR image (x) and remote sensing HR image (y), σ xy denotes the covariance of x and y, and C 1 = 0.01 2 and C 2 = 0.03 2 are used to refrain from dividing by zero.
where x is a generated image sampled from the learned generator distribution p g , E is the expectation over the set of the generated images, D KL is the KL-distance between the conditional class distribution p(y | x) and the marginal class distribution p(y) = E where (m r , C r ), m g , C g denote the mean and covariance of the real and generated image distribution, respectively.

Implementation Details
In order to compare the improvement effect of our method with SRGAN, parameters and calculations of SG-GAN are set as almost or less than SRGAN. Therefore, the batch size is 16, and the training iteration epoch is 100. The optimization method we used is ADAM, and the β 1 and β 2 are 0.9 and 0.99, respectively. We adopt the pre-trained saliency detection model BASNet, which is trained on the DUTS-TR [83] dataset. All the paper experiments were carried out using the Pytorch framework.

Comparison with the Advanced Methods
In this section, numerous experiments are described on the five SR datasets mentioned above. We compare our proposed method with various advanced SR methods, including Bicubic, FSRCNN [42], SRResnet [13], RCAN [46] and SRGAN [13].
Quantitative Results by PSNR/SSIM. Table 1 presents quantitative comparisons for ×4 SR. Compared with all the above approaches, our SG-GAN performs the best in almost all cases. Qualitative Results. Figure 8 shows the qualitative results on SR standard benchmarks, and Figure 9 presents the visual results on the MSRA10K salient object database. We observe that most of the compared methods would suffer from blurring edges and noticeable artifacts, especially the saliency regions of the image. Compared with the interpolation method (Bicubic), the details of the image generated by the FSRCNN method are improved. However, the vision is still blurred, and insufficient edge information. With the deepening development of the network, the extracted deep features increase, SRResnet recovers more high-frequency information. RCAN subjoins the attention mechanism and obtains a better repair effect. Images generated by SRGAN get a tremendous visual impact, but many unreal details are also generated. In contrast, our SG-GAN gains much better results in recovering sharper and saliency areas, more faithful to the ground truth. These comparisons indicate that SG-GAN can better recover more salient and informative components in HR images and show competing image SR results than other methods.

Application of Remote Sensing Image
Remote sensing images contain various objects and rich texture information. Therefore, the problem of super-resolution restoration of remote sensing images has always been a hotspot. To verify the performance of SG-GAN, we conducted experiments on five remote sensing datasets, which have been mentioned in Section 4.1. Table 2 shows the quantitative results of scale factors ×4. Among them, Bicubic, FSRCNN [42], SRResnet [13], RCAN [46], and SRGAN [13] are the advanced SR methods, and LGCNet [78], DMCN [1], DRSEN [79], DCM [84], and AMFFN [85] are remote sensing SR methods. The results of the advanced SR methods are tested with the pre-trained model of the DIV2K [86] dataset. For remote sensing SR methods, we directly use the results given in the original paper. Compared with the SRGAN, the PSNR of SG-GAN is improved +0.89 dB on the NWPU WHR-10 dataset, +1.53 dB on the UCAS-AOD dataset, +0.91 dB on the AID dataset, +1.85 dB on the UC-Merced dataset, and +0.98 dB on the NWPU45 dataset. The quantitative results demonstrated that SG-GAN utilizes the additional constraint to supervise the saliency map of the remote sensing images, which can obtain SR images with a higher quantitative index. LGCNet  Table 3. Table 3. Generative quality evaluations of GAN-based SR methods on AID and UC-Merced. Red and blue indicate the best and second-best performance.

Method
Scale AID UC-Merced In order to fully demonstrate the effectiveness of SG-GAN, we present the ×4 SR visual results on UCAS-AOD dataset [74] in Figure 10, NWPU VHR-10 datset [73] in Figure 11, AID dataset [75] in Figure 12, UC-Merced dataset [76] in Figure 13, and NWPU45 dataset [77] in Figure 14.   From what is demonstrated in qualitative results, the restoration difficulty of various parts is different. The non-saliency regions (e.g., ground, land, and lawn) are comparatively easier to super-resolution than the saliency regions as the latter have complex structures. The main difference among super-resolution methods is on saliency regions. Bicubic is a commonly used image interpolation method, which the surrounding pixel information is filled to obtain the pixel value through linear calculation. Therefore, the image obtained in this way will appear blurry. The other methods are based on deep learning, and their effects are better than interpolation-based methods.
FSRCNN is a relatively shallow network structure composed of various feature processing layers connected in series. Limited by the depth of the network, FSRCNN only learns more shallow feature information. When the reconstructed image is measured, the reconstructed image can obtain better PSNR and SSIM results. SRResnet deepens the network through residual connections. It can capture richer mid-frequency and high-frequency information, precisely what low-resolution images lack. Therefore, the reconstructed image quality will be significantly improved when the depth of the network increases.
SRGAN and SRResnet were proposed simultaneously, and the major-est difference between them is that SRGAN contains a discriminator. Although the discriminator will slow down the network training, adversarial training between the two networks can make the generator capture more feature information. However, the discriminator focuses on the mid-frequency and high-frequency image features, reducing the attention to the edge and color. Therefore, the generated image will have a particular deviation between the shallow feature information and the information of the actual value image. The proposed SG-GAN motivates the network to focus on the salient features of the image, which helps the network pay attention to the complex area of the image so as to get a better-reconstructed image, which has the characteristic of reducing aliasing, blur artifacts, and better reconstructing high-fidelity details. From Figure 12, SG-GAN infers a sharp edge of the saliency area, indicating that our method is capable of capturing structural characteristics of objects in images. Our approach also recovers better textures than the compared SR methods in qualitative comparisons of other datasets. In other words, the structure and edge of the saliency region can benefit from the designed saliency loss. Although SG-GAN is also a generational adversarial structure, due to the efficient mining and utilization of features by the network, it makes up for the loss of shallow features. Experimental and visual results reflect the superiority of the SG-GAN algorithm.

Ablation Study
To validate our method further, we design an SRGAN+Lsa model, which adds the saliency loss reference in Equation (6) to the SRGAN model. Detailed ablation experiment results of quantitative results are in Table 2. After adding the L Saliency loss in SRGAN, the PSNR of NWPU VHR-10, UCAS-AOD, AID, UC-Merced, and NWPU45 increase 0.85 dB, 0.56 dB, 0.55 dB, 1.36 dB, and 0.17 dB, respectively. Figure 10-14 have visualized the qualitative results on UCAS-AOD, NWPU VHR-10, AID, UC-Merced, and NWPU45 datasets, respectively. It is easier to distinguish the method's pros and cons in the complex structure. The results show that the model's experimental results with additional saliency loss are better than SRGAN. The salient discrepancies of each part of the image implicitly report the diversity in the image's structure and features. Moreover, it gives extra care to more complex regions and plays a positive role in repairing a complete image. In other words, the saliency loss is helpful when dealing with the super-resolution problem of remote sensing images. It can pay more attention to the details of the edge and structure of the image and accurately depict the characteristics of the remote sensing image's saliency area.

Discussion
Compared with SG-GAN, although the other advanced SR methods could generate more details and more explicit SR images, there are still existing edge distortions and wrong feature structures. This paper is demonstrated that the proper loss function is essential to improve the resolution results of remote sensing images SR. Several experiments have proved that increasing the attention to the salient part of remote sensing images can help restore the saliency region's graph structures and enhance image SR performance.
However, our proposed methods demonstrate weaknesses in reconstructing arbitrary scale resolution. Only has favorable performance in remote sensing images super-resolution tasks with specific scale magnification (e.g., ×2 and ×4). Therefore, it is a future direction for our work to apply SR with more extensive and even arbitrary magnifications.

Conclusions
This paper proposes a saliency-guided remote sensing image super-resolution method (SG-GAN), which considers saliency differences in various regions of remote sensing images. As the structure of the saliency area of the image is more complex and the information contained is relatively affluent, increasing the attention to the saliency area is beneficial to the quality of the super-resolution image. Therefore, this paper utilizes the saliency target detection network to construct a saliency loss function to help the generator capture more helpful information. At the same time, the strategy of adversarial learning plays a crucial role in improving the ability of SG-GAN to grasp characteristic details. This paper has conducted experiments on the saliency dataset, standard datasets, and remote sensing datasets, confirming that focusing more on the saliency area of the image is beneficial for improving the image's resolution. Quantitative and qualitative experimental results have shown the effectiveness of the proposed method.

Acknowledgments:
The authors would like to thank all colleagues in the laboratory for their generous help. The authors would like to thank the anonymous reviewers for their constructive and valuable suggestions.

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript: