Super-Resolution of Remote Sensing Images via a Dense Residual Generative Adversarial Network

: Single image super-resolution (SISR) has been widely studied in recent years as a crucial technique for remote sensing applications. In this paper, a dense residual generative adversarial network (DRGAN)-based SISR method is proposed to promote the resolution of remote sensing images. Di ﬀ erent from previous super-resolution (SR) approaches based on generative adversarial networks (GANs), the novelty of our method mainly lies in the following factors. First, we made a breakthrough in terms of network architecture to improve performance. We designed a dense residual network as the generative network in GAN, which can make full use of the hierarchical features from low-resolution (LR) images. We also introduced a contiguous memory mechanism into the network to take advantage of the dense residual block. Second, we modiﬁed the loss function and altered the model of the discriminative network according to the Wasserstein GAN with a gradient penalty (WGAN-GP) for stable training. Extensive experiments were performed using the NWPU-RESISC45 dataset, and the results demonstrated that the proposed method outperforms state-of-the-art methods in terms of both objective evaluation and subjective perspective.


Introduction
High-resolution (HR) images, which contain abundant, detailed information, are crucial for various remote sensing applications, such as target detection, surveillance [1], satellite imaging [2] and others. Increasingly, many researchers prefer to reconstruct HR images from low-resolution (LR) images via an image processing technology called super-resolution (SR), which is popularly used to solve the LR problems caused by the sensor, compensates for the deficiencies of the hardware and overcomes the influence of fuzziness, noise and other factors in the process of imaging [3][4][5].
Single image super-resolution (SISR) is an inherently ill-posed problem since vast pixel intensities need to be predicted by the LR pixel. Such a problem is typically mitigated by constraining the solution space using strong prior information. In order to learn the prior information, recent state-of-the-art methods mostly adopt the example-based [6] strategies. Those methods either explored the self-similarities of examples [7,8] or mapped the LR to HR patches with the help of external samples [9,10]. Yang et al. implemented a SR method utilizing sparse code to express LR and HR images [11]. Li et al.   images generated and the ground-truth images extracted from VGG19 [41]. The adversarial loss is achieved by the DN as shown in Figure 2.  Kim et al. [42] were enlightened by the residual network and then introduced a deeper network for super-resolution (VDSR), as shown in Figure 3a. It should be noted that the layers with the same color in Figure 3 belong to the same class. VDSR increased the network depth via cascading, vast convolutional layers. Since the reconstructed HR image is very similar to the input, global residual learning (GRL) is effective at reducing the difficulty of training deep networks. Kim et al. [43] augmented the receptive field of the network by introducing a recurrent neural network (DRCN), as shown in Figure 3b, which is beneficial for parameter sharing and reducing memory consumption. Moreover, they utilized recursive-supervision and skip-connection to overcome the difficulty of training. Tai et al. [44] proposed a very deep convolutional neural network model named deep recursive residual network (DRRN, illustrated in Figure 3c) that strives for deep yet concise networks. DRRN adopts both GRL and LRL. GRL and LRL mainly differ in that LRL is performed in every few stacked layers, while GRL is performed between the input and output images. Particularly, both GRL and LRL are employed to ease the problem of training the deep network. Comprehensive empirical evidence shows that the residual networks are easier to optimize and able to gain accuracy from their considerably increased depth. In contrast to residual learning-based SR methods, GAN-based approaches can recover more convincing and realistic HR images.

Proposed Method
In this section, we first describe the designed GN in the proposed dense residual generative adversarial network (DRGAN) in detail. Then, we demonstrate the DN part. Finally, we explicitly introduce the modified loss function of DRGAN according to WGAN-GP.
In this paper, let G I denote the ground-truth image with size m × n. L I denotes the downsampled result of G I with size (m/s) × (n/s), where s is the corresponding scale factor. SR I represents the corresponding reconstructed SR image with size m× n. The proposed loss function was not based on the mean square error (MSE) of the pixel space, resulting in the reconstructed images exhibiting relatively low peak signal-to-noise ratios (PSNRs). Moreover, SRGAN always suffers from the conundrums of training and the gradient disappearing.

Residual Learning-Based SR
Originally, residual learning was proposed to address problems such as image classification and detection. Residual learning exhibits excellent performance in computer vision problems from low-level to high-level tasks. Christian et al. introduced the idea of residual learning into the problem of SR and employed a deep residual (Res-Net) with skip-connection as the GN, as shown in Figure 1. Res-Net utilizes local residual learning (LRL) to ease the training of networks, and comprehensive empirical evidence showed that the residual networks are easier to optimize and able to gain accuracy from the considerably increased depth. Nevertheless, LRL simply extracts local features by preserving the information, and it is not able to save the hierarchical features in a global manner.
Kim et al. [42] were enlightened by the residual network and then introduced a deeper network for super-resolution (VDSR), as shown in Figure 3a. It should be noted that the layers with the same color in Figure 3 belong to the same class. VDSR increased the network depth via cascading, vast convolutional layers. Since the reconstructed HR image is very similar to the input, global residual learning (GRL) is effective at reducing the difficulty of training deep networks. Kim et al. [43] augmented the receptive field of the network by introducing a recurrent neural network (DRCN), as shown in Figure 3b, which is beneficial for parameter sharing and reducing memory consumption. Moreover, they utilized recursive-supervision and skip-connection to overcome the difficulty of training. Tai et al. [44] proposed a very deep convolutional neural network model named deep recursive residual network (DRRN, illustrated in Figure 3c) that strives for deep yet concise networks. DRRN adopts both GRL and LRL. GRL and LRL mainly differ in that LRL is performed in every few stacked layers, while GRL is performed between the input and output images. Particularly, both GRL and LRL are employed to ease the problem of training the deep network. Comprehensive empirical evidence shows that the residual networks are easier to optimize and able to gain accuracy from their considerably increased depth. In contrast to residual learning-based SR methods, GAN-based approaches can recover more convincing and realistic HR images.

Structure of the GN
The whole architecture of the GN is drawn in Figure 4. According to the functions in the GN, we can divide it into four parts: feature extraction, dense residual units (DRUs), residual learning and image reconstruction. L I and SR I are the input and output of the GN, respectively. Nah et al. removed the batch normalization layers in their image deblurring work due to the batch normalization layers normalizing the features and getting rid of range flexibility [45]. That is to say, the batch normalization layers are applicable in the area of target classification rather than the field of SR. Therefore, we did not employ batch normalization (BN) layers in the whole GN, as shown in Figure  4.

Feature Extraction
We employed two convolutional layers to extract features at first, because of two significant roles of convolutional layers: mitigating the effect of noise and strengthening the characteristics of the original signal. The operation of the feature extraction part can be expressed as follows:

Proposed Method
In this section, we first describe the designed GN in the proposed dense residual generative adversarial network (DRGAN) in detail. Then, we demonstrate the DN part. Finally, we explicitly introduce the modified loss function of DRGAN according to WGAN-GP.
In this paper, let I G denote the ground-truth image with size m × n. I L denotes the down-sampled result of I G with size (m/s) × (n/s), where s is the corresponding scale factor. I SR represents the corresponding reconstructed SR image with size m × n.

Structure of the GN
The whole architecture of the GN is drawn in Figure 4. According to the functions in the GN, we can divide it into four parts: feature extraction, dense residual units (DRUs), residual learning and image reconstruction. I L and I SR are the input and output of the GN, respectively. Nah et al. removed the batch normalization layers in their image deblurring work due to the batch normalization layers normalizing the features and getting rid of range flexibility [45]. That is to say, the batch normalization layers are applicable in the area of target classification rather than the field of SR. Therefore, we did not employ batch normalization (BN) layers in the whole GN, as shown in Figure 4.
image reconstruction. L I and SR I are the input and output of the GN, respectively. Nah et al. removed the batch normalization layers in their image deblurring work due to the batch normalization layers normalizing the features and getting rid of range flexibility [45]. That is to say, the batch normalization layers are applicable in the area of target classification rather than the field of SR. Therefore, we did not employ batch normalization (BN) layers in the whole GN, as shown in Figure  4.

Feature Extraction
We employed two convolutional layers to extract features at first, because of two significant roles of convolutional layers: mitigating the effect of noise and strengthening the characteristics of the original signal. The operation of the feature extraction part can be expressed as follows:

Feature Extraction
We employed two convolutional layers to extract features at first, because of two significant roles of convolutional layers: mitigating the effect of noise and strengthening the characteristics of the original signal. The operation of the feature extraction part can be expressed as follows: where W FE,1 and W FE,2 represent n FE,1 convolution kernels of size c × k FE,1 × k FE,1 and n FE,2 convolution kernels of size n FE,1 × k FE,2 × k FE,2 , respectively; c denotes the channel number of the input image I L ; k FE,1 and k FE,2 are the spatial sizes of the convolution filter; B FE,1 and B FE,2 represent the biases; * represents the convolution operation; g(·) represents the activation function; and FE is the output part of the feature extraction and the input of the DRU.
In the case of SR, we only need to process the luminance channel of images, since human eyes are more sensitive to the brightness information of the images. Thus, we extract the Y-channel after transforming the images from RGB to YC b C r color space. The remaining two channels are upscaled to the required size via bicubic interpolation, and the final SR image can be obtained by fusing these three channels of the image. Therefore, the channel number of the input image I L is always c = 1.
This paper adopts the parametric rectified linear unit (PReLU) [46] as the activation function g(·). It can achieve a regular effect to a certain extent. Compared to ReLU [47], PReLU improves the convergence rate of the network by adding a few of parameters. The formula of g(·) can be expressed as follows: where α t is a learnable parameter, α is initialized to 0.25 and t denotes the time of iteration. When the network updates the parameters in reverse, the update formula of α t can be formulated as where µ denotes the momentum, ε refers to the learning rate and L represents the loss function.

DRUs
Assume that there are d DRUs; the specific architecture of each DRU is shown in Figure 5. Each DRU includes three convolutional layers, three activation layers, one weighted-sum layer and one element-wise sum layer. The convolutional layers in each DRU are densely connected in the manner shown in Figure 5. GRL and LRL are utilized simultaneously.

Structure of the DN
where D P is the distribution of the ground-truth image and G P is the distribution of the reconstructed image.
With the advantages of the GAN, we can recover SR I that is highly similar to the ground-truth image G I and difficult to distinguish via the DN. The whole operation of p-th DRU p can be formulated as follows: where W p,1 to W p,3 and B p,1 to B p,3 represent the kernels and biases, respectively, of the three successive convolutional layers; S p,1 to S p,3 , denote the weighted-sum layers in sequence; D p,1 to D p,3 represent the output of the former convolutional layers in sequence (the activation layers are omitted for clarity); and D p denotes the corresponding output of the p-th DRU p . The blue lines in Figure 5 represent that the preceding outputs of convolutional layers in a DRU are fed into the posterior convolutional layers, which form the short-term memory. Similarly, the red and purple lines in Figure 5 represent that the preceding outputs of DRUs are fed into the latter layers, which correspond to the long-term memory. The outputs of the previous DRUs and convolutional layers can connect to the latter layers directly, which can not only save the feed-forward features but also extract local dense features. All of these result in a memory mechanism.
In the circumstance that the former DRU and the whole convolutional layers are fed into the latter layer, we need to decrease the feature numbers to reduce the burden of the network. Thus, we employ weighted-sum layers S p,1 to S p,3 that adaptively learn specific weights for each memory, which determines how much of the long-term and short-term memory should be saved. We refer to the operation of S p,1 to S p,3 in DRU p as the local decision function.

Residual Learning
In recent studies, residual networks have achieved great performance on the low-level to high-level computer vision tasks. In this paper, we adopt both LRL and GRL in order to make full use of them. As shown in Figure 5, the blue lines represent LRL for the GN, and the red lines denote GRL for GN. The whole function of the part of residual learning can be formulated as follows: where S RL denotes the weighted-sum layer; W RL,1 and B RL,1 represent the kernel and bias, respectively, of the convolutional layer; D 1 to D d represent the outputs of the d DRU successively; and R ws , R ws,1 and R denote the outputs of the weighted-sum layer, the convolutional layer and the element-wise sum layer in the part of residual learning, respectively. The difference between LRL and GRL in the part of residual learning is that LRL is acquired between the DRU and the weighted-sum layer, while GRL is implemented between the input image I L and the element-wise sum layer, as shown in Figure 5. The weighted-sum layer S RL is used to extract the hierarchical features obtained from the previous DRUs through LRL and to decide their proportions in the ensuing features. We define the operation of S RL in the part of residual learning as the global decision function compared to S p,1 to S p,3 in DRU p . The convolutional layer W RL,1 is employed to further exploit features, and the element-wise sum layer aims for the GRL. The combination of LRL and GRL improves the performance of the GN and is less prone to over-fitting.

Image Reconstruction
Inspired by ESPCN, we adopted a sub-pixel convolutional layer for image upscaling and reconstruction in addition to a convolutional layer. The whole function of the part of image reconstruction can be formulated as follows: where W IR,1 and B IR,1 represent the kernel and bias, respectively, of the convolutional layer; W IR,sc denotes the sub-pixel convolutional layer; * denotes the operation of sub-pixel convolution; I 1 and I denote the outputs of the convolutional layer and the sub-pixel convolutional layer in the part of image reconstruction; and I is the final SR image obtained, I SR .
The sub-pixel convolution layer W IR,sc can be conceptually separated into two steps, and the conceptual graph is shown in Figure 6:

1)
Convolution. Similar to the previous convolution layers in the GN, this step is used to extract features. The difference between them is that there are s 2 feature maps according to the upscaling factor s.

Structure of the DN
According to the theory of GANs, there is a DN in addition to the GN, which forms the adversarial networks: the GN produces the reconstructed image I SR , while the DN is used to distinguish between the ground-truth image I G and I SR . That is to say, we should optimize the parameters θ DN in the DN along with the parameters θ GN in the GN in an alternating manner to solve the adversarial min-max problem: where P D is the distribution of the ground-truth image and P G is the distribution of the reconstructed image.
With the advantages of the GAN, we can recover I SR that is highly similar to the ground-truth image I G and difficult to distinguish via the DN.
However, differently from the DN in SRGAN, as shown in Figure 2, we make modifications in terms of two aspects. First, we replace the last sigmoid layer with a Leaky ReLU layer referring to WGAN-GP. The discriminator in SRGAN mainly aims for the task of true and binary classification, while the purpose of the DN in DRGAN is fitting the distance of Wasserstein approximately. Second, we remove the BN layers in the DN. We apply a gradient penalty for each sample individually. However, BN layers in the DN will have undesirable effects on the gradient penalty for the reason that BN layers may introduce interdependent relationships among different samples in the same batch. Thus, we omit the BN layers. The final architecture of the DN is shown in Figure 7.

Loss Function
In SRGAN, the perceptual loss function l SRGAN was proposed, and it was the weighted sum of a content loss l con and an adversarial loss l adv . The conceptual process of training SRGAN is shown in Figure 8. l SRGAN is formulated as follows: Specifically, l con is defined as the Euclidean distance between the feature maps of the recovered image θ GN (I L ) and the corresponding ground-truth image I G in VGG, and it is formulated as where f j,k is the feature map acquired from the k-th convolutional layer before the j-th pooling layer in the VGG, and w j,k and h j,k denote the dimensions of the respective feature maps.
VGG is taken as a universal feature extractor to extract high-level features. l con is equal to the MSE between the high-level features extracted by VGG. With the advantage of l con , the reconstructed images become more realistic and full of abundant details.
where GN l is the MSE between the reconstructed image SR I and the corresponding ground-truth image G I in VGG. Because of the content loss, the MSE loss provides solutions with the highest PSNR values, Besides the content loss, SRGAN also introduced the adversarial loss in order to promote the network to favor solutions that reside on the manifold of ground-truth images by aiming to fool the DN. The adversarial loss is obtained from the result of θ DN (θ GN (I L )) overall training samples as where θ DN (θ GN (I L )) denotes the probability of judging the recovered image θ GN (I L ) as the corresponding I G . Furthermore, Equation (10) is transformed into Equation (11) for better gradient behavior.
However, in this paper, according to WGAN-GP, we modify the loss function l DRGAN of the proposed DRGAN to solve the problems of unstable training, gradient disappearing or exploding and mode collapse. The method of WGAN-GP was used to train our model, thereby solving the problem of gradient explosion during training via a new Lipschitz continuous limit method, the gradient penalty. For this reason, we omit the BN layers in the DN, as mentioned above. BN layers may introduce the interdependent relationships among different samples in the same batch. Moreover, the loss function based on the MSE of pixel space is supplemented, and the DN is used to discriminate the feature maps of I SR and I G extracted via VGG. In this manner, we can not only achieve convincing reconstructed images with abundant details but also acquire results with high PSNRs. The corresponding process of training the proposed DRGAN is shown in Figure 9.
Let l GN represent the loss function of GN and l DN denote the loss function of DN. Different from l con in SRGAN, l GN is formulated as where l GN is the MSE between the reconstructed image I SR and the corresponding ground-truth image I G in VGG. Because of the content loss, the MSE loss provides solutions with the highest PSNR values, which are, however, perceptually rather smooth and less convincing than results achieved with a loss component that is more sensitive to visual perception.
Back propagation Figure 8. The conceptual process of training the adversarial networks. The G I and SR I obtained from the GN are fed into the DN and VGG simultaneously, and we can acquire the content loss and adversarial loss, respectively. Then, we update the parameters in the adversarial networks according to the result and repeat the process until the optimization is finished.
where GN l is the MSE between the reconstructed image SR I and the corresponding ground-truth image G I in VGG. Because of the content loss, the MSE loss provides solutions with the highest PSNR values, Figure 9. The conceptual process of training DRGAN. The loss function based on mean square error (MSE) is computed between the ground-truth image I G and I SR is obtained from the GN. Then, the modified DN is used to distinguish the feature maps extracted by VGG, and the adversarial loss is also obtained.
How l DN differs from l adv in SRGAN, as shown in Equation (10), is reflected in three aspects. First, the DN is no longer used to distinguish the reconstructed image I SR and the corresponding ground-truth image. VGG extracts the high-level feature maps of I SR and I G , which need to be distinguished by the DN in our DRGAN. Second, the result of θ DN (θ GN (·)) is acquired without logarithm operations. The reason for this choice is that the probability of distinguishing the fake from the real data is replaced with the Wasserstein distance between the distributions of ground-truth images and reconstructed images. The DN in DRGAN removes the last sigmoid layer. Third, the gradient penalty is supplemented to keep the gradient steady in the process of back-propagation. The loss function l DN of DN can be formulated as where f (θ GN (I L )) and f (I G ) represent the feature maps of I SR and I G extracted by VGG; [ ∇ z θ DN (z) 2 − 1] 2 is the gradient penalty according to WGAN-GP; λ is the coefficient set to 10 based on several comparative experiments; and ∇ z indicates the operation of partial derivatives for z, which can be formulated as The whole process of training the proposed DRGAN can be divided into five steps:

1)
Feed the LR image I L into the GN, obtain the corresponding reconstructed image I SR and compute the content loss l GN based on the MSE. 2) Import the reconstructed image I SR and the corresponding ground-truth image I G into VGG, and extract the respective high-level features. 3) Feed the extracted feature maps into the DN and obtain the adversarial loss. The final loss is computed as the weighted sum of the content loss l GN and the adversarial loss l DN . 4) Implement the backward process of the network and compute the gradients of each layer.
Optimize the network iteratively by updating the parameters in the DN and GN according to the training policy. 5) Repeat the above steps until reaching the minimum loss of the network, and then the work of training the network is finished.
In this paper, the loss function that we proposed can show the training situation better than an ordinary GAN. Moreover, the gradient penalty can be reversed to the GN and the DN to minimize the loss of the generated network l GN and maximize the loss of the discriminating network l DN .

Experiments
In this section, we first describe the preparation for the experiments. Then, we illustrate the details of the implementation and introduce two quality evaluation indexes for images that are commonly used in the related literature.

Dataset
NWPU-RESISC45 [48] is a classical scene classification data set consisting of remote sensing images 256 × 256 pixels in size. NWPU-RESISC45 contains 45 types of ground features in total, with 700 images per type. In this study, we chose the series of airplane images as targets and selected 500 airplane images as the objective training sample, while leaving 100 images for validation images and the rest as test images.

Training Details
Referring to WGAN-GP, we adopt RMSprop [49] rather than Adam [50] to optimize our model; the weight matrices W are updated as where δ is initialized to 0.02, W denotes the weights in the network, q denotes the order of the element in W, v represents an adaptive moment estimation, t denotes the iteration time and the learning rate ε is initialized to 0.0001. Before training, we augment the remote sensing images by horizontally flipping and rotating. Then, we down-sample the ground training images I G by the required upscale factor s to obtain the LR images I L . For each mini-batch, we cropped 16 random sub images from LR training samples of size 64 × 64 and sub images from ground-truth training samples of size 256 × 256. Taking considerations of both training time and complexities of the network, we employed eight dense recursive units in the GN described in Section 3.1. Each convolutional layer in the GN owns a 3 × 3 kernel and 64 feature maps. Moreover, we adopted zero padding in each convolutional layer to make sure the outputs had the same sizes as the original inputs.
We implemented the experiments in TensorFlow [51] and accelerated them using a single NVIDIA GTX1080TI GPU with 11 GB of memory. Specifically, we first trained the GN with only the loss function based on l GN , as formulated in Equation (12), and then we initialized the entire DRGAN network with it to avoid undesirable local optima. The whole process of training required approximately four days.

Peak Signal-To-Noise Ratio (PSNR)
The PSNR [52] was adopted in this paper as the quality evaluation index of the reconstructed HR image. It is dependent on the MSE between the ground-truth images X = {X i } and the reconstructed HR images H = {H i }. The formulas for MSE and PSNR can be expressed as follows: where m and n denote the height and width of images X i and H i ; a and b represent the horizontal and vertical axes.

Structural Similarity Index (SSIM)
The SSIM [52] is commonly used for the evaluation of the quality of the reconstructed HR images, and it is calculated as follows: where c(X i , H i ) denotes the brightness contrast, d(X i , H i ) denotes the comparison of contrast, e(X i , H i ) represents the contrast of pixel structure and where σ 2 X i and σ 2 H i denote the variance of images X i and H i ; σ X i H i refers to the covariance between X i and H i ; µ X i and µ H i indicate the average values of X i and H i ; and C 1 , C 2 and C 3 are constants.

Normalized Root Mean Square Error (NRMSE)
The normalized root mean square error (NRMSE) used in [53] measures the distance between the data predicted by the mapping model and the original data observed from the environment. It can be computed as follows, and the smaller the value of NRMSE is, the better quality the reconstructed HR image has.

Erreur Relative Globale Adimensionnelle De Synthese (ERGAS)
The erreur relative globale adimensionnelle de synthese (ERGAS) [54] was put forward to measure the quality of reconstructed HR images by taking the scaling factor into consideration, and it can be formulated as: where s represents the scale factor, c denotes the channel number of the image, and µ X is the mean value of X. The smaller the value of ERGAS, the better the quality of the reconstructed HR image.

Results
To test the performance of the proposed SR method via DRGAN, we implement tests in public datasets and compared the results of DRGAN with those of several state-of-the-art methods. In addition, we selected the results of bicubic interpolation as the baseline reference. For SISR methods based on DL, DRGAN was compared with SRCNN [20], FSRCNN [22], ESPCN [21], VDSR [42], DRRN [44] and SRGAN [34]. The publicly available testing codes from the corresponding authors were employed. For fair comparison, we cropped the pixels in the boundary before evaluation like the operation in SRCNN [20].  It can be observed from Table 5 that when the number of convolutional layers of the network is relatively deep, such as in the models of VDSR, DRRN, SRGAN and the proposed DRGAN, the reconstruction time of the test image under our method is far less than that of other approaches.
In addition to the quantitative comparisons, we also performed visual comparisons among our method and above-listed methods. We show the reconstructed HR results with different scale factors in Figures 10-12, and the ground-truth images are also provided for reference. For clearer contrast, we selected an area marked with a green rectangle to zoom in and placed the close-up below the corresponding whole image. 6. Discussion

The Effect of Adding MSE into the Loss Function
SRGAN's perceptual loss, which consists of an adversarial loss and a content loss, can help the model generate convincing reconstructed results, but the objective indicators of SRGAN do not perform well against other methods because its loss function is dependent on the feature space, not the pixel space. To address this drawback, MSE loss was introduced in our proposed method to ensure the similarity between the output image and the target image.
To assess the effect of adding MSE loss, we compared the PSNR and SSIM values of reconstructed HR images obtained through the networks with and without MSE in the loss function with a scale factor of ×3 for the test set. The results indicate that the network with MSE loss added 6. Discussion

The Effect of Adding MSE into the Loss Function
SRGAN's perceptual loss, which consists of an adversarial loss and a content loss, can help the model generate convincing reconstructed results, but the objective indicators of SRGAN do not perform well against other methods because its loss function is dependent on the feature space, not the pixel space. To address this drawback, MSE loss was introduced in our proposed method to ensure the similarity between the output image and the target image.
To assess the effect of adding MSE loss, we compared the PSNR and SSIM values of reconstructed HR images obtained through the networks with and without MSE in the loss function with a scale factor of ×3 for the test set. The results indicate that the network with MSE loss added has superior performance relative to that without MSE constraint, and an improvement of approximately 0.36 dB in PSNR and 0.0085 in SSIM can be achieved using our new loss function.

The Impact of Using ℒ or ℒ on Our SR Model
As is known, it is usually difficult to decide when to suspend the training of the generator or discriminator for traditional GAN-based approaches. GAN-based methods often suffer from the situation of gradient vanishing. As mentioned in Section 3, we referred to the key idea of WGAN-GP instead of using an ordinary GAN in our model.
For comparison, we drew the loss convergence curves of the generator of our model under the conditions of using ℒ or ℒ . We selected hyper parameter 'epoch' values of 100 and 200 and have displayed the experimental results. As shown in Figure 14, the red curves represent the trend of loss convergence under ℒ , while the blue curves represent the results of using ℒ . It can be clearly observed from Figure 14a,b that the loss is difficult to converge (the blue curves) when using ordinary ℒ to train the model regardless of the hyper parameter 'epoch'; after training for a period of time, the loss instead increases, which is called mode collapse and often occurs in GANs. Obviously, ℒ can overcome this drawback very well. The curves of loss convergence of our model under ℒ (the red curves) show that the loss is always decreasing until convergence is accomplished. 6.2. The Impact of Using L gan or L wgan on Our SR Model As is known, it is usually difficult to decide when to suspend the training of the generator or discriminator for traditional GAN-based approaches. GAN-based methods often suffer from the situation of gradient vanishing. As mentioned in Section 3, we referred to the key idea of WGAN-GP instead of using an ordinary GAN in our model.
For comparison, we drew the loss convergence curves of the generator of our model under the conditions of using L gan or L wgan . We selected hyper parameter 'epoch' values of 100 and 200 and have displayed the experimental results. As shown in Figure 14, the red curves represent the trend of loss convergence under L wgan , while the blue curves represent the results of using L gan . It can be clearly observed from Figure 14a,b that the loss is difficult to converge (the blue curves) when using ordinary L gan to train the model regardless of the hyper parameter 'epoch'; after training for a period of time, the loss instead increases, which is called mode collapse and often occurs in GANs. Obviously, L wgan can overcome this drawback very well. The curves of loss convergence of our model under L wgan (the red curves) show that the loss is always decreasing until convergence is accomplished.

The Impact of Using ℒ or ℒ on Our SR Model
As is known, it is usually difficult to decide when to suspend the training of the generator or discriminator for traditional GAN-based approaches. GAN-based methods often suffer from the situation of gradient vanishing. As mentioned in Section 3, we referred to the key idea of WGAN-GP instead of using an ordinary GAN in our model.
For comparison, we drew the loss convergence curves of the generator of our model under the conditions of using ℒ or ℒ . We selected hyper parameter 'epoch' values of 100 and 200 and have displayed the experimental results. As shown in Figure 14, the red curves represent the trend of loss convergence under ℒ , while the blue curves represent the results of using ℒ . It can be clearly observed from Figure 14a,b that the loss is difficult to converge (the blue curves) when using ordinary ℒ to train the model regardless of the hyper parameter 'epoch'; after training for a period of time, the loss instead increases, which is called mode collapse and often occurs in GANs. Obviously, ℒ can overcome this drawback very well. The curves of loss convergence of our model under ℒ (the red curves) show that the loss is always decreasing until convergence is accomplished.

Future Work
SR of remote sensing images based on DL is faced with more problems than natural images. Training through DL is based on the premise of the sufficiently qualified training samples. However, it is not easy to collect a large amount of remote sensing images of high quality that satisfy the requirements. Therefore, transferring knowledge from an external dataset attracts a lot of attention with the continuous development of DL. Generally, it is easy to collect a nature image dataset that has higher resolution than remote sensing images and contains more detailed information. The performance of the proposed DRGAN method can probably be improved by pretraining the model with abundant natural images as the training data, and then fine-tuning the model with remote sensing images. Transfer learning is a potential solution for the issue that will be studied in future work.

Future Work
SR of remote sensing images based on DL is faced with more problems than natural images. Training through DL is based on the premise of the sufficiently qualified training samples. However, it is not easy to collect a large amount of remote sensing images of high quality that satisfy the requirements. Therefore, transferring knowledge from an external dataset attracts a lot of attention with the continuous development of DL. Generally, it is easy to collect a nature image dataset that has higher resolution than remote sensing images and contains more detailed information. The performance of the proposed DRGAN method can probably be improved by pretraining the model with abundant natural images as the training data, and then fine-tuning the model with remote sensing images. Transfer learning is a potential solution for the issue that will be studied in future work.

Conclusions
In this paper, we propose a novel SISR method named DRGAN to promote the resolution of remote sensing images. We tried to improve the performance of the GAN by enhancing the ability of the GN to reconstruct images. In particular, we introduced the design of dense residual network into the GN and utilized the memory mechanism to extract hierarchical features for better reconstruction. Furthermore, we added MSE into the loss function and modified the model of the DN and the loss function referring to WGAN-GP, which resulted in improving the accuracy of reconstruction and avoiding gradient vanishing. In addition to the aircraft images, we also used other types of remote sensing images and several natural image datasets to verify the robustness of our model. The experimental results for a publicly available dataset demonstrate that our proposed method can achieve the best performance in terms of the accuracy and visual performance. In future work, other techniques will be applied, such as the transfer learning technique, which can be used to borrow high-frequency information from natural image datasets that contain images with very high resolution, to further improve the performance of the new method.