Improved SRGAN for Remote Sensing Image Super-Resolution Across Locations and Sensors

: Detailed and accurate information on the spatial variation of land cover and land use is a critical component of local ecology and environmental research. For these tasks, high spatial resolution images are required. Considering the trade-off between high spatial and high temporal resolution in remote sensing images, many learning-based models (e.g., Convolutional neural network, sparse coding, Bayesian network) have been established to improve the spatial resolution of coarse images in both the computer vision and remote sensing fields. However, data for training and testing in these learning-based methods are usually limited to a certain location and specific sensor, resulting in the limited ability to generalize the model across locations and sensors. Recently, generative adversarial nets (GANs), a new learning model from the deep learning field, show many advantages for capturing high-dimensional nonlinear features over large samples. In this study, we test whether the GAN method can improve the generalization ability across locations and sensors with some modification to accomplish the idea “training once, apply to everywhere and different sensors” for remote sensing images. This work is based on super-resolution generative adversarial nets (SRGANs), where we modify the loss function and the structure of the network of SRGANs and propose the improved SRGAN (ISRGAN), which makes model training more stable and enhances the generalization ability across locations and sensors. In the experiment, the training and testing data were collected from two sensors (Landsat 8 OLI and Chinese GF 1) from different locations (Guangdong and Xinjiang in China). For the cross-location test, the model was trained in Guangdong with the Chinese GF 1 (8 m) data to be tested with the GF 1 data in Xinjiang. For the cross-sensor test, the same model training in Guangdong with GF 1 was tested in Landsat 8 OLI images in Xinjiang. The proposed method was compared with the neighbor-embedding (NE) method, the sparse representation method (SCSR), and the SRGAN. The peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) were chosen for the quantitive assessment. The results showed that the ISRGAN is superior to the NE (PSNR: 30.999, SSIM: 0.944) and SCSR (PSNR: 29.423, SSIM: 0.876) methods, and the SRGAN (PSNR: 31.378, SSIM: 0.952), with the PSNR = 35.816 and SSIM = 0.988 in the cross-location test. A similar result was seen in the cross-sensor test. The ISRGAN had the best result (PSNR: 38.092, SSIM: 0.988) compared to the NE (PSNR: 35.000, SSIM: 0.982) and SCSR (PSNR: 33.639, SSIM: 0.965) methods, and the SRGAN (PSNR: 32.820, SSIM: 0.949). Meanwhile, we also tested the accuracy improvement for land cover classification before and after super-resolution by the ISRGAN. The results show that the accuracy of land cover classification after super-resolution was significantly improved, in particular, the impervious surface class (the road and buildings with high-resolution texture) improved by 15%.


Introduction
Detailed and accurate information on the spatial variation of land cover and land use is a critical component of local ecology and environmental research.For these tasks, high spatial resolution images are required to capture the temporal and spatial dynamics of the earth's surface processes [1].Considering the trade-off between high spatial and high temporal resolution in remote sensing images, at present, there are two main methods for obtaining high spatio-temporal resolution images: 1) the multi-source image fusion models [2,3] and 2) the image super-resolution models [4,5].
Compared to the image fusion models, the methods in the super-resolution field do not require auxiliary high spatial resolution data with the same location and a similar observation date to predict the detailed spatial texture, which makes these methods more approachable for different scenarios in both the computer vision and remote sensing fields.
The basic assumption of the image super-resolution model is that missing details in a low spatial resolution image can be either reconstructed or learned from other high spatial resolution images if these images follow the same resampling process as was used to create the low spatial resolution image.Based on this assumption, in the last decade, efforts have been made to focus on accurately predicting the point spread function (PSF), which represents the mixture process that forms the low-resolution pixels.There are mainly three groups of methods: 1) interpolation-based methods, 2) refactoring-based methods, and 3) learning-based methods (Table 1).
Firstly, the interpolation-based methods [6][7][8] are based on a certain mathematical strategy to calculate the pixel value of the target point to be restored from the relevant points, which is of low complexity and high efficiency.However, the edge effect in the resulting image is obvious, and the image details cannot be recovered since there is no new information produced in the interpolation process.
Secondly, the refactoring-based methods [9][10][11] model the imaging process and integrate different information from the same scene to obtain high-quality reconstruction results.Usually, these methods trade time differences for spatial resolution improvement, which usually requires pre-registration and a large amount of calculation.
Thirdly, the learning-based methods [12][13][14][15][16][17][18][19][20] overcome the limitation of the difficulty by determining the resolution improvement multiple of the reconstruction method and can be oriented towards a single image, which is the main development direction of the current super-resolution reconstruction.In this category, commonly used methods include the neighbor-embedding method (NE) [21], the sparse representation method (SCSR) [22] and the deep learning method.
Table 1.The comparison of super-resolution methods.

Interpolation-based methods
The value of the current pixel can be represented by the nearby pixels The nearest neighbor interpolation [6] low complexity and high efficiency No image texture detail can be predicted and usually makes the image smoother looking The bilinear interpolation [7] The bicubic interpolation [8] Refactoring-based methods The physical properties and features can be recovered by image reconstruction (RE) technology, and these rules of the point spread function (PSF) can be further applying for the detail recovering Joint MAP registration [9] The different information on the same scene is fused to obtain high-quality reconstruction results

Requires pre-registration and a large amount of calculation
Sparse regression and natural image prior [11] Kernel regression PSF deconvolution [23] Learning-based methods The point spread function can be created by learning from a large number of image samples [24] Neighbor-embedding (NE) [21] Getting better performance when training samples are more like the target image, and can achieve a higher PSNR when a large number of samples are involved

Highly time consuming, requiring big training datasets and usually limited model generalization ability across datasets
Convolutional neural network (SRCNN) [25] Bayesian networks [26] Kernel-based methods [27] SVM-based methods [28] Sparse representation (SCSR) [22] Super-resolution GAN(SRGAN) [29] Remote Sens. 2020, 12, 1263 3 of 21 The learning-based method usually requires the high representative of the training samples to cover the data variation of the whole population.In practice, a large number of training samples from different sources are usually collected to achieve this goal.However, in the remote sensing field, it is almost impossible to prepare such a training sample set because the variation of remote sensing data not only depends on the variation of the object but also on the different locations and different satellite sensors.Due to this limitation, many learning-based methods are limited to a certain location and specific sensor, resulting in the limited generalization ability of the model across locations and sensors.This limitation remains a challenge for producing one super-resolution model for different locations and different satellite sensors.
In recent years, with the rapid development of artificial intelligence, especially neural network-based deep learning methods, deep learning has been widely applied in the field of computer vision, due to its obvious advantages for nonlinear process fittings of large samples.One benefit of these models is their ability to handle large sample sets while retaining a good generalization ability.In the field of image super-resolution, the super-resolution CNN model (SRCNN) was first presented by Dong [25] in 2014.Compared with traditional image super-resolution, this method achieves a higher peak signal-to-noise ratio, but when the image on the sampling ratio is high, the reconstructed image will be too smooth, and details will be lost.To overcome this shortage, a super-resolution generative adversarial network model (SRGAN) was presented by replacing the original CNN structure with the generative adversarial network (GAN) [29].As the newest learning model from the deep learning field, the SRGAN shows many advantages for capturing high-dimensional nonlinear features over large samples.However, the generalization ability of the SRGAN model in remote sensing images across different locations and sensors remains unknown.
In this study, we test whether the GAN-based method can improve the generalization ability across locations and sensors by making some modifications so that we can train once and apply the results everywhere and with different sensors.Our work is based on the SRGAN model by modifying the loss function and the structure of the network of the SRGAN.The major contributions of this study are as follows:

•
We propose the improved SRGAN (ISRGAN), which stabilizes the model training and enhances the generalization ability across locations and sensors; The structure of this paper is as follows: The overall workflow (Section 2.1), the review of the original SRGAN (Section 2.2), the problem of the original SRGAN and the corresponding improvement (Section 2.3) are described in the Methods section (Section 2).The study area, dataset and assessment method in this study are explained in the Experiments section (Section 3).Section 4 describes the main results and findings in our study.In Section 5, we discuss the advantages and possible further works followed by the Conclusion section (Section 6).

Workflow
In this paper, the experiment on whether the super-resolution model has generalization capability across locations and sensors is mainly divided into three parts, as presented in Figure 1.The blue part represents the model training part, which used the GF 1 data in Guangdong.The green part indicates that the model was tested for generalization ability across regions, using the GF 1 data in Xinjiang.The orange part indicates that the model was tested for whether there was generalization ability across sensors, mainly using the Landsat 8 data in Xinjiang.First, we used the ISRGAN to train the Guangdong GF 1 data and obtained the super-resolution model.At the same time, we divided the test set into three parts, namely test dataset 1 (GF 1 data in Guangdong province), test dataset 2 (GF 1 data in Xinjiang province) and test dataset 3 (Landsat 8 data in Xinjiang province).
For the test of whether the model had generalization ability across locations in the green part, we tested dataset 1 and dataset 2 and obtained the PSNR and SSIM, respectively, in order to conduct a t-test to determine whether the model had generalization ability across locations.For the test of whether the model had generalization ability across sensors in the orange part, we tested dataset 2 and dataset 3 and obtained the PSNR and SSIM, respectively, in order to conduct a t-test to determine whether the model had generalization ability across sensors.

SRGAN Review
The GAN is a deep learning model proposed by Goodfellow et al. [30] in 2014.The structure of the GAN is inspired by the two-person zero-sum game in game theory.The framework consists of a generator (G) and a discriminator (D), where the generator (G) learns the distribution of real sample data and generates new sample data, and the discriminator (D) is a binary classifier used to distinguish whether the data is from real samples or generated samples.The structure diagram of the GAN is shown in Figure 2. The optimization process of the GAN is a minimax problem.When the generator and discriminator reach the Nash equilibrium, the optimization process is completed.In machine learning, GANs have become a hot research direction.At present, the field of computer vision has become the most widely studied and applied field of GANs, which have broad applications.
The SRGAN is a super-resolution network structure proposed by Christian Ledig [29] in a paper published at the 2017 CVPR conference, which brings the effect of super-resolution to a new height.The SRGAN is trained based on the GAN network, which consists of a generator and a discriminator.The generators use a ResNet structure [31], the former part of the network is connected with several residual blocks, each containing two 3 × 3 convolution layers, which is followed by the batch normalization layer, which is activated with the ReLu function.In the latter part, two subpixel network modules are added to increase the size so that the generator can learn high-resolution image details in the front layer during the training process and improve the image resolution later, so as to achieve the purpose of reducing computing resources.The discriminator adopts the vgg-19 network structure [32], including eight convolution layers, where the LeakyReLu function is used as the activation function for the hidden layer, and finally, the probability of the predicted image coming from the real high-resolution and generated high-resolution image is obtained by using the full connection layer and sigmoid activation function.
The main innovation of the SRGAN is to propose an optimization algorithm of perceived loss based on a neural network, which is to replace the mean square error (MSE) content loss with the loss calculated based on the vgg-19 network feature map.The original pixel-level MSE loss calculation (L SR mse ) and the characteristic loss calculation (L SR VGG/i,j ) based on the vgg-19 network are shown in Equations ( 1) and ( 2), respectively where represents the ratio between the spatial resolution of a high-and low-resolution image, W and H respectively represent the pixel numbers of the width and height of the low-resolution image, W i,j and H i,j represent the dimension of the corresponding feature map in the vgg-19 network, ∅ i,j represents the feature map obtained before the j th convolution before the i th max-pooling layer within the vgg-19 network, I x,y represents the gray value of the layer map at the point (x, y), and G θG represents the reconstructed image.

Problems in SRGAN
Due to the phenomena of gradient disappearance and mode collapse in the SRGAN training process, the model does not have a good generalization ability for remote sensing image super-resolution across locations and sensors.Next, we analyze the nature of this phenomenon.
The first phenomenon, gradient disappearance, can be explained as follows: In the training process of the SRGAN, we want the discriminator to be strong enough to distinguish the samples well, and we want to give a result of 1 for the real high-resolution sample image and a result of 0 for the generated high-resolution sample image.Therefore, the discriminator loss function of the original SRGAN is shown in Equation ( 3): where V(G, D) represents the difference between the ground true high-resolution (HR) image and the model-generated image, HR and LR represent high-resolution images and low-resolution images, respectively, G and D represent the generator and discriminator, respectively, P r and P g represent the distribution of the real HR image and the generated super-resolution (SR) image, and E represents the expectation.
The generator hopes that the image generated by itself can be marked as 1 by the discriminator, so the adversarial loss function of the generator is: Therefore, the loss function of the SRGAN is shown in Equation ( 5): The training process of the SRGAN is divided into two steps.The first step is to fix the generator and train the discriminator.For Equation (3): = HR P r log(D(HR))d(HR) In order to optimize the discriminator's ability to distinguish data sources, Equation ( 6) is maximized as follows D * G = P r P r + P g (7) where D * G is the optimal discriminator.The second step is to fix the discriminator trained in the first step to optimize the generator.For Equation ( 5), we substitute Equation (7) into it.The new form of V(G, D) as shown in Equation ( 8): In addition, the Kullback-Leibler (KL) divergence and Jensen-Shannon Divergence (JSD) can be expressed as Equations ( 9) and ( 10) )) (10) where P 1 and P 2 represent the two probability distributions.By substituting Equations ( 9) and (10) into Equation ( 8), we can get the final form of V(G, D) where JSD(P r ||P g ) represents the JSD divergence of the P r and P g .As can be seen from Equation (11), only if P r = P g , V(G, D) reaches the minimum value is the generation effect the best.In practice, the generated distribution can only be infinitely close to the real distribution, but the two can never overlap completely.According to Equation (11), only when the real distribution and the generated distribution overlap completely is the V(G, D) equal to −2log2, otherwise, it always equals 0. Therefore, when using the gradient descent method, the generator cannot get any gradient information, which means it faces the problem of gradient disappearance.
The second phenomenon, mode collapse, can be explained as follows: The generator loss function can also be written as: Due to the KL divergence, P r and P g can be transformed into the form containing D*: From Equations ( 12) and ( 13), it can be concluded that the loss function is equivalent to: Since only the first two terms depend on the generator (G), the final loss function of the generator is equivalent to minimizing the following function: According to Equation (15), in the process of training the discriminator, the KL divergence should be reduced while the JS divergence should be increased, resulting in unstable training.At the same time, due to the asymmetry of the KL divergence, the following phenomenon occurs: the generator generates an unreal image, and the punishment is relatively high, so the generator generates an image similar to the real image, with a lower penalty.The result of this phenomenon is that the generator tends to produce similar images, which is called the mode collapse problem.
The two problems stated above led to the unstable performance of the SRGAN in the training process, which led to a poor generalization ability for remote sensing image super-resolution across locations and sensors.

Improved SRGAN
Inspired by the idea of WGANs [33], we modified the loss function and the partial structure of the network in view of the weak generalization ability caused by the instability of the SRGAN in the training process and proposed the ISRGAN.In this paper, the Wasserstein distance is used to replace the KL divergence and JS divergence.The calculation of the Wasserstein distance is shown in Equation ( 16) where W P r , P g represents the Wasserstein distance of the P r and P g , γ represents the joint distribution of the real HR image and the generated image, (P r , P g ) represents the set of all possible joint probability distributions of the P r and P g , x − y represents the distance between a pair of samples x and y, (x, y) ∼ γ represents how much "mass" must be transported from x to y in order to transform the distributions of the P r into the distribution of the P g , and the marginal distributions of x and y are the P r and P g , respectively, and inf represents the lower bound that we can take on the expectation of all possible joint distributions.However, when learning the image distribution, the random variable has thousands of dimensions, and it is difficult to solve directly by solving the linear programming problem.Therefore, we convert it into the dual form where and K is the Lipschitz constant of f , and sup represents the upper bound of the expectation.Therefore, based on the SRGAN structure and the ideal of the WGAN, this paper modifies the network structure, and the loss function during training can be summarized as follows: (1) Sigmoid was removed from the last layer of the discriminator to transform the classification problem into a regression problem; (2) The loss functions of the generator and discriminator were not logarithmic; (3) During the training process, after updating the discriminator parameters, its absolute value was truncated to no more than a fixed constant; (4) The convolution kernel of the last layer of the generator was changed from 1 × 1 to 9 × 9.

Study Area
The study area selected included some areas of the Guangdong-Hong Kong-Macao greater bay area and Yuli county, Xinjiang.The Guangdong-Hong Kong-Macao greater bay area is located in the pearl river delta of Guangdong, which is the fourth largest bay area in the world after the New York bay area, the San Francisco bay area, and the Tokyo bay area.The economic foundation of this region is solid, the development potential is huge, and it has a pivotal status; Yuli county, located in the middle of Xinjiang, is an important transportation hub in Xinjiang.The area is particularly rich in mineral resources and tourist resources and is known as the "back garden" of Korla.The two research areas are representative of the study of the super-resolution of remote sensing images and subsequent ground-object extraction.Figure 3 shows the image coverage of GF 1 in the study area.

Data and Datasets
The experimental data included the GF 1 satellite data (Land observation satellite data service platform: http://218.247.138.119:7777/DSSPlatform/index.html) and the Landsat 8 satellite data (the USGS Earth Explorer: https://earthexplorer.usgs.gov/).The GF 1 satellite was successfully launched from the Jiuquan Satellite Launch Center on April 26, 2013, and is the first satellite from a major Chinese project using a high-resolution earth observation system.The GF 1 satellite is equipped with two 2-m panchromatic resolution cameras, two 8-m resolution multispectral cameras, and four 16-m resolution multispectral wide-width cameras, among which PMS sensor multispectral cameras contain four bands: blue, green, red, and near-infrared.Landsat 8 was launched by NASA on February 11, 2013, and is equipped with two sensors: the Operational Land Imager (OLI) and the Thermal Infrared Sensor (TIRS) thermal infrared sensor, among which the OLI sensor multispectral cameras contain nine bands: coastal, blue (B), green (G), red (R), near-infrared (NIR), short wave infrared 1 (SWIR1), and short wave infrared 2 (SWIR2).The details are shown in Tables 2 and 3. Table 4 shows the data details used in this paper.Considering the difference in the spectral range between the two kinds of satellite data in each channel, the experimental data selected in this paper only includes three RGB bands.When doing the experiment, we need to make sure that the same ratio of the high-resolution image and low-resolution image pixel must be strictly 1:4, while the GF 1 PMS sensor data and Landsat satellite's OLI sensor data is not a strict 1:4 relation.In order to guarantee the feasibility of the experiment, we used the GF1 data and cut it into images of 256 × 256 size, while obtaining images of 64 × 64 size through cubic subsampling.Finally, we selected 3200 images of 256 × 256 size and their subsamples for training as high-resolution and low-resolution images, respectively.

Network Parameter Setting and Training
The proportion factor of the high-resolution image (HR) and low-resolution image (LR) used in the experiment was × 4, among which the LR images were obtained by sampling the HR image four times, using the nearest-neighbor method in Python.During training, the batch size was set to 16, and the training process was divided into two steps.In the first step, the ResNet [31] was trained to obtain the mean square error (MSE) between the generated high-resolution image and the real high-resolution image, namely the traditional pixel-based loss, and the learning rate was initialized to 10 -4 , training 100 epochs in total.In the second step, we used the model trained in the first step as the initialization of the generator.Using pretreatment based on pixel losses can make the method based on the GAN achieve a better effect.The reason for this can be summarized as follows: 1) The high-resolution image generated by preprocessing is a relatively good image for the discriminator, so it pays more attention to texture details in the following training process.2) It is better to avoid the generator reaching the local optimization.The initialization learning rate of the generator training was 10 -4 , and it was reduced to 1/2 in each 250 iterations, training 500 epochs in total.We used the RMSProp optimization algorithm to update the generator and discriminator alternately until the model converged.The model was implemented using the Tensorflow framework (Google Inc. https://www.tensorflow.org)and was trained on four NVIDIA GeForce GTX TITAN X GPUs.The code is modified based on the SRGAN code, which can be freely downloaded from the GitHub website (https://github.com/tensorlayer/srgan).

Assessment
In this paper, we adopted the peak signal-to-noise ratio (PSNR) [34] and structural similarity index measurement (SSIM) [35] as the evaluation indexes for the experimental results.Usually, after an image is compressed, the image spectrum will change, so the output image will be somewhat different from the original image.In order to measure the quality of the processed image, the PSNR is usually used to measure whether a processor meets the expected requirements and is calculated as follows where the MSE is calculated as shown in Equation ( 19) where I n refers to the gray value of the n th pixel of the original image, and P n refers to the gray value of the n th pixel after processing.The unit of the peak signal-to-noise ratio (PSNR) is dB, and the higher the value, the better the image quality.SSIM is a method used to measure the subjective experience quality of television, film, or other digital images and video.This method was first proposed by the image and video engineering laboratory of the University of Texas at Austin and then developed in cooperation with New York University.The SSIM algorithm is used to test the similarity of two images, and its measurement or prediction of image quality is based on uncompressed or undistorted images as a reference.The model measures image similarity in brightness, contrast, and structure.Its calculation is shown in Equation ( 20) where u X and u Y represent the means of the gray values of image X and image Y, respectively, and δ 2 X , δ 2 Y and σ XY represent the variances of the gray values of image X and image Y, respectively. Generally, 03 and L is the maximum image value.The range of SSIM is (0,1); the larger its value, the less image distortion there is.

High Spatial Resolution Image of ISRGAN across Locations and Sensors
Based on the ISRGAN super-resolution training model, we tested the ISRGAN on a GF 1 dataset from Guangdong, a GF 1 dataset in Xinjiang, and a Landsat 8 dataset in Xinjiang.Some test results are shown in Figures 4-6.

High Spatial Resolution Image of ISRGAN across Locations and Sensors
Based on the ISRGAN super-resolution training model, we tested the ISRGAN on a GF 1 dataset from Guangdong, a GF 1 dataset in Xinjiang, and a Landsat 8 dataset in Xinjiang.Some test results are shown in Figures 4-6.As can be seen from the examples, the slope of the statistical pictures of the image after super-resolution and the real image in the gray value of the three bands is close to 1, so this model maintains the spectral information of the original image.
In addition, in order to verify the generalization ability of the super-resolution model across locations and sensors, we defined the following: if the evaluation indexes on two data sets follow a normal distribution and the mean value is not significantly different, then the model is approximately considered to have the same property on two datasets.Therefore, when verifying the generalization ability of the model across locations, we conducted t-tests on the evaluation indexes of the Guangdong (GF 1) dataset and the Xinjiang (GF 1) dataset.When verifying the generalization ability of the model across sensors, we also conducted a t-test on the evaluation indexes of the Xinjiang (GF 1) dataset and the Xinjiang (Landsat) dataset by using the R software with the "car" package (https: //www.rdocumentation.org/packages/car/versions/3.0-3), for which the confidence was 95%.The test results are shown in Tables 5 and 6, respectively.In Tables 5 and 6, first, we judge whether there is a significant difference between the two groups of data, which is the Levene test of the variance equation.If the Sig parameter values are greater than 0.05, there is no significant difference in variance.After judging the variance, a t-test was performed on the mean value.Similarly, if the Sig (2-tailed) index value is greater than 0.05, there is no significant difference in the mean value of the two groups of data, which means that there is no significant difference between the two groups of data.As can be seen from Table 5, there is no significant difference in the mean values of the two groups of data in the PSNR and SSIM.Therefore, the model has generalization ability across locations.As can be seen from Table 6, there is no significant difference between the two groups of data in the SSIM, while there is a significant difference between the two groups of data in the PSNR.Since the PSNR only considers the gray value of pixels between the two groups of images, while the SSIM comprehensively considers the brightness, contrast, structure, and other information between the two groups of images, the SSIM has generalization ability across sensors.

Compare ISRGAN with NE, SCSR, and SRGAN
In this paper, the super-resolution methods we compared include the NE and SCSR methods, and the SRGAN.According to the super-resolution results, we calculated the evaluation indexes between them and the original images.In this paper, it was difficult to obtain the Landsat 8 satellite data and GF 1 satellite data at the same time and in the same scene, and the corresponding number of pixels was not consistent, considering the subsequent consistency of the computational criteria in validating whether the model has generalization ability across locations and sensors.Therefore, all the reference images of the quantitative calculation in this paper were the original images before super-resolution.The calculation method was to reduce the sampling of the image after super-resolution and then calculate the quantification index between the image after super-resolution and the original image.Figure 7 shows the partial comparison results of this paper based on the NE and SCSR methods, the SRGAN, and the ISRGAN in the three test sets.
We counted the evaluation indexes of the test data on the three test sets with three super-resolution algorithms and calculated their means.The results are shown in Table 7: As can be seen from Table 7, in the horizontal comparison of the three super-resolution algorithms, the ISRGAN algorithm in this paper is significantly superior to the other three methods in the PSNR and SSIM, so it has certain advantages.The purpose of image super-resolution is to make better use of the advantages of high spatial resolution images and improve the accuracy of target recognition [36], classification [37], and change detection [38,39].Taking the land cover classification and ground feature extraction as examples, the changes in the land use classification and ground feature extraction in the Landsat images before and after super-resolution were compared and analyzed.
The classification area selected was in the area at the border of Guangzhou and Shenzhen, where the feature categories are rich and densely distributed.It is beneficial to fully take advantage of high spatial resolution images in the land use classification and feature extraction.As shown in Figure 8, the original Landsat 8 image of the super-resolution application sample area was obtained on February 7, 2016.For the classification of features, in order to avoid the interference of human factors on the classification results, we used the K-means algorithm to classify the before and after images of super-resolution, and the classification results are shown in Figure 9.By comparing the classification results before and after the super-resolution images, we can see that the extraction effect of the image after super-resolution is better than that before super-resolution in the areas where the features are densely distributed and texture details are not obvious, such as roads and impervious surfaces.From the overall extraction effect, the result of the image after the super-resolution classification is better than that of the former in boundary and separability.Therefore, the image after super-resolution is broadly applicable in the classification of remote sensing images.
In addition, in order to better reflect the superiority of the image after super-resolution on the extraction of impervious surfaces, we used the Support Vector Machine (SVM) algorithm to extract impervious surfaces based on images before and after super-resolution.Similarly, in order to minimize the influence of human factors on the extraction results, we randomly selected the same set of training samples and test samples on Google Earth, converted them into the area of interest at the corresponding resolution, and then classified and extracted them.The extraction results of the impervious surfaces before and after super-resolution are shown in Figure 10.In order to quantitatively verify the improvement of the extraction accuracy of impervious surfaces based on the image after super-resolution, we tested the same set of test samples selected above, which included 122 impervious surface sample points and 71 non-impervious surface sample points.The confusion matrix of the impervious surface extraction before and after super-resolution is shown in Tables 8 and 9.The overall accuracy of the impervious surface extraction before super-resolution was 70.1%, and the Kappa coefficient was 0.419.After super-resolution, the overall extraction accuracy of impervious surfaces was 86.1%, and the Kappa coefficient was 0.720.The quantitative results show that the extraction accuracy after super-resolution was nearly 15% higher than that before super-resolution.

Discussion
In comparison with the NE method, SCSR method, and the SRGAN, the ISRGAN shows better performance of generalization in the cross-location and -sensor tests.However, like many other super-resolution algorithms, the pseudo-textures can still be seen in the output after super-resolution, and the bell effect near the edges still needs to be improved.The edge enhancement algorithms can be further applied to recover the edge details, especially in the high spatial variation area.
Meanwhile, the scale ratio can lead to a dramatic change in the visual satisfaction of the model output.For most super-resolution algorithms, the perfect scale ratio is about 2:4.This means you can get a satisfying prediction when you down-scale an image with 30-m resolution down to 8-m rather than to 1-m.By increasing this ratio, the output of the image texture could show more random pseudo-textures and lead to a more serious bell effect.The fundamental reason for this is that the learning-based super-resolution algorithms try to recover the nonlinear point spread function (PSF) from a large number of available samples.However, when the ratio increases, the image details lost when crossing the different scales could be more and more complex and harder to capture by one universal PSF based on limited samples.The process of recovering the image details by the super-resolution algorithms is ill-posed, since the number of pixels needed to be predicted always needs to be larger than the number of known low spatial resolution pixels.Other than the large-scale ratio, the other possible way to recover the image details is by using image fusion technology, such as the spatial and temporal adaptive reflectance fusion model (STARFM) [40], enhanced STARFM (ESTARFM) [41] and the U-STFM model [42], which is basically "borrowing" the detailed image texture from the high spatial resolution reference image rather than predicting it.However, the consequences are when the discrepancy between the reference image and the input image goes large or the land cover changes rapidly, and thus the fusion-based method can fail to predict these changes.
In addition, due to the fact that GANs have two networks, which are named the generator and discriminator networks, more parameters than a convolutional neural network need to be optimized during training, so a long training time is needed.Currently, Muyang Li [43] has proposed the GAN compression method, which greatly shortens the training time.Through a large number of experiments, the computations in pix2pix [44], CycleGAN [45], and GauGAN [46] with this method were reduced from 1/9 to 1/21 without losing the fidelity of the generated image in the meantime.Therefore, the GAN compression method can be applied to the super-resolution network to effectively reduce the training time of the model.

Conclusions
Based on the super-resolution algorithm of the generated adversarial network in the computer vision field, this paper aimed to solve the problems of gradient disappearance and mode collapse that exist in the training of the generated adversarial network itself.Combined with the method of minimizing the Wasserstein distance proposed in the WGAN, we modified the original super-resolution network (SRGAN) and proposed the ISRGAN.Then we applied it to the super-resolution of remote sensing influence and drew the following conclusions:

Figure 1 .
Figure 1.The workflow of the experiment.

Figure 3 .
Figure 3. Study area and the image coverage of GF 1.

Figure 4 .
Figure 4.The predicted super-resolution image for our ISRGAN model (b) on a GF 1 dataset in Guangdong, compared to the input image (a) and the ground truth (c).Figures (d), (e), and (f) are 1:1 plots of the Digital Number (DN) value in the red, green and blue bands compared to the ground truth (c), with slopes of 1.0063, 1.0032, and 0.9955, respectively.

Figure 4 .
Figure 4.The predicted super-resolution image for our ISRGAN model (b) on a GF 1 dataset in Guangdong, compared to the input image (a) and the ground truth (c).Figures (d), (e), and (f) are 1:1 plots of the Digital Number (DN) value in the red, green and blue bands compared to the ground truth (c), with slopes of 1.0063, 1.0032, and 0.9955, respectively.

Figure 5 .
Figure 5. Cross-location comparison: the predicted super-resolution image for our ISRGAN model (b) on a GF 1 dataset in Xinjiang, compared to the input image (a) and the ground truth (c).Figures (d), (e), and (f) are 1:1 plots of the DN value in the red, green and blue bands compared to the ground truth (c), with slopes of 0.9658, 0.9378 and 0.9485, respectively.

Figure 6 .
Figure 6.Cross-sensor and -location comparison: the predicted super-resolution image for our ISRGAN model (b) on a Landsat 8 dataset in Xinjiang, compared to the input image (a) and the ground truth (c).Figures (d), (e), and (f) are 1:1 plots of the DN value in the red, green and blue bands compared to the ground truth (c), with slopes of 0.9527, 0.9564 and 0.9760, respectively.

Figure 7 .
Figure 7.Comparison of the output of our ISRGAN model (the fifth column) to other models (NE, SCSR, SRGAN).Figures (a) and (b) are the results of the GF 1 dataset in Guangdong, (c) is the result of the cross-location testing in the GF 1 dataset in Xinjiang, and (d) is the result of Landsat 8 in Xinjiang.The ground truth is shown as HR.The low-resolution inputs are marked as LR.

Figure 8 .
Figure 8. Landsat 8 image of the demonstration area.

Figure 9 .
Figure 9.Comparison of the classification results between the images before and after super-resolution.Figure (a) is the classification result on the image before super-resolution and Figure (b) is the classification result on the image after super-resolution.

Figure 10 .
Figure 10.Comparison of the extraction results of impervious surfaces between the images before and after super-resolution (Figure (a) is the extraction result on the image before super-resolution and Figure (b) is the extraction result on the image after super-resolution).
r log P r P r +P g + E x∼P g log

Table 2 .
The parameters of GF 1.

Table 3 .
The parameters of Landsat 8.

Table 4 .
The information of used data.

Table 5 .
T-Test of the model across locations.

Table 6 .
T-Test of the model across sensors.

Table 7 .
Image quality comparison of different super-resolution algorithms.

Table 8 .
Confusion matrix of construction on the image before super-resolution.

Table 9 .
Confusion matrix of construction on the image after super-resolution.