Next Article in Journal
A Prior Estimation of the Spatial Distribution Parameter of Soil Moisture Storage Capacity Using Satellite-Based Root-Zone Soil Moisture Data
Next Article in Special Issue
Fast Super-Resolution of 20 m Sentinel-2 Bands Using Convolutional Neural Networks
Previous Article in Journal
A Comparative Analysis of Phytovolume Estimation Methods Based on UAV-Photogrammetry and Multispectral Imagery in a Mediterranean Forest
Previous Article in Special Issue
Spatial Resolution Matching of Microwave Radiometer Data with Convolutional Neural Network

Remote Sens. 2019, 11(21), 2578; https://doi.org/10.3390/rs11212578

Article
Super-Resolution of Remote Sensing Images via a Dense Residual Generative Adversarial Network
by Wen Ma 1,2,3, Zongxu Pan 1,3,*, Feng Yuan 4 and Bin Lei 1,3
1
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
2
School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Huairou District, Beijing 101408, China
3
Key Laboratory of Technology in Geo-spatial Information Processing and Application System, Beijing 100190, China
4
School of Computer and Software, Nanjing University of Information Science & Technology, Nanjing 210044, China
*
Author to whom correspondence should be addressed.
Received: 25 September 2019 / Accepted: 1 November 2019 / Published: 3 November 2019

Abstract

:
Single image super-resolution (SISR) has been widely studied in recent years as a crucial technique for remote sensing applications. In this paper, a dense residual generative adversarial network (DRGAN)-based SISR method is proposed to promote the resolution of remote sensing images. Different from previous super-resolution (SR) approaches based on generative adversarial networks (GANs), the novelty of our method mainly lies in the following factors. First, we made a breakthrough in terms of network architecture to improve performance. We designed a dense residual network as the generative network in GAN, which can make full use of the hierarchical features from low-resolution (LR) images. We also introduced a contiguous memory mechanism into the network to take advantage of the dense residual block. Second, we modified the loss function and altered the model of the discriminative network according to the Wasserstein GAN with a gradient penalty (WGAN-GP) for stable training. Extensive experiments were performed using the NWPU-RESISC45 dataset, and the results demonstrated that the proposed method outperforms state-of-the-art methods in terms of both objective evaluation and subjective perspective.
Keywords:
single image super-resolution (SISR); remote sensing images; generative adversarial network (GAN); dense residual network (DRN); Wasserstein GAN with gradient penalty (WGAN-GP)

1. Introduction

High-resolution (HR) images, which contain abundant, detailed information, are crucial for various remote sensing applications, such as target detection, surveillance [1], satellite imaging [2] and others. Increasingly, many researchers prefer to reconstruct HR images from low-resolution (LR) images via an image processing technology called super-resolution (SR), which is popularly used to solve the LR problems caused by the sensor, compensates for the deficiencies of the hardware and overcomes the influence of fuzziness, noise and other factors in the process of imaging [3,4,5].
Single image super-resolution (SISR) is an inherently ill-posed problem since vast pixel intensities need to be predicted by the LR pixel. Such a problem is typically mitigated by constraining the solution space using strong prior information. In order to learn the prior information, recent state-of-the-art methods mostly adopt the example-based [6] strategies. Those methods either explored the self-similarities of examples [7,8] or mapped the LR to HR patches with the help of external samples [9,10]. Yang et al. implemented a SR method utilizing sparse code to express LR and HR images [11]. Li et al. used the sparsity prior of image statistics to recover images [12]. Pan et al. proposed an SISR method based on compressive sensing and structural self-similarity [13]. Radu et al. proposed anchored neighborhood regression (ANR) for fast, example-based SR [14], and then proposed an improved version called A+ [15].
In recent years, due to the powerful learning ability, deep learning (DL) models, especially convolutional neural networks (CNNs), have been widely used to address the ill-posed inverse problem of SR and have demonstrated superiority over reconstruction-based methods [16,17] and other learning paradigms [18,19]. As the pioneering CNN model for SR, Dong et al. [20] proposed an algorithm for super-resolution using convolutional neural networks (SRCNN) to predict the nonlinear mapping between the LR and HR patches, which significantly outperformed the classical non-DL methods. Shi et al. [21] presented an efficient sub-pixel convolutional neural network (ESPCN) and rearranged the finally-acquired feature maps instead of up-sampling the images to reduce the running time of the algorithm. Meanwhile, Dong et al. [22] proposed a compact, hourglass-shaped convolutional neural network structure (FSRCNN) to accelerate SRCNN, which could process images in real time. With the advantages of effectively building modules, the networks for SISR were made deeper and wider to obtain better performance. Zhao et al. [23] proposed a novel SISR approach for magnetic resonance (MR), which applied a channel splitting network to ease the burden of the network. Abdul et al. [24] presented a hybrid residual attention network (HRAN), which can greatly reduce the complexity of the CNN and achieve better performance. In [25], Zhao et al. proposed a novel, example-based method for SISR, which contains two stages in the method and achieves better reconstruction accuracy. Li et al. [26] presented a spatial modulated residual unit (SMRU) and a recursively dilated residual network (RDRN) which can effectively utilize the contextual information upon larger regions. In [27], He et al. designed a novel, deep–shallow cascade-based CNN method, which can effectively recover the high-frequency information of remote sensing images.
SISR is also of great practical value for remote sensing and hyperspectral images, as it can assist the visual interpretation of images in many fields of application, such as meteorology, agriculture, military, etc. Ma et al. [28] present a novel method for remote sensing images via the wavelet transform combined with the recursive residual network (WTCRR), which can fully exploit the potential to depict remote sensing images at different frequency bands. Zhang et al. [29] applied multiple-point statistics (MPS) and isometric mapping (ISOMAP) to solve the SR problem of remote sensing images, which effectively utilized their respective advantages. Gu et al. [30] proposed a deep residual squeeze and excitation network (DRSEN) to reduce the computational complexity and improve the accuracy of remote sensing image reconstruction. Based on Laplacian pyramid network, He et al. [31] proposed a novel SR method to enhance the resolution of hyperspectral images and simultaneously preserve the spectral information. In [32], Kwan et al. integrated a hybrid color mapping (HCM) algorithm and a plug-and-play algorithm for hyperspectral images SR task.
Moreover, generative adversarial networks (GANs) [33] have been developed rapidly and have attracted a large amount of attention in recent years. Ledig et al. designed a GAN for image super-resolution (SRGAN) [34]. He separately employed a deep residual network proposed by He et al. [35] with skip-connection as the generative network (GN) and designed a classification network as the discriminative network (DN). Moreover, he proposed a perceptual loss function that consisted of an adversarial loss and a content loss. Ma et al. [36] proposed a novel method on SR task named transferred generative adversarial network (TGAN), which can enhance the feature representation ability of the model and solve the problem of poor quality and insufficient quantity of remote sensing images. Alec et al. [37] proposed a novel network architecture, deep convolutional generative adversarial networks (DCGANs), and enhanced the stability of the training and the quality of the results. Martin et al. [38] defined a new form of GAN named Wasserstein GAN (WGAN), which minimizes a reasonable and efficient approximation of the earth-mover (EM) distance. Subsequently, Ishaan et al. [39] improved WGAN by penalizing the norm of gradient of the critic with regard to its input (WGAN-GP), which outperformed the standard WGAN.
However, GAN-based SR approaches mainly focus on the design of the loss function, ignoring the influence of the property on the final performance of the method. Moreover, it is difficult to decide when to suspend the training of the generator or discriminator for traditional GAN-based approaches. Also, GAN-based methods often suffer from the situation of a gradient disappearing.
To address the above drawbacks, we propose a dense residual generative adversarial network (DRGAN) for the remote sensing images SR task. More specifically, we introduce a network with residual learning and dense connection as the GN, which is able to take advantage of all the hierarchical features from the original LR images abundantly. We incorporated a memory mechanism (MM) into the GN by using dense residual unit (DRU), which could further enhance the performance of GN, as well as that of DRGAN. Moreover, we took note of the key idea of WGAN-GP, which could improve the training speed and solve the problem of gradient vanishing in GAN-based SR approaches. We modified the DN and improved the loss function also. Extensive experiments were performed using the NWPU-RESISC45 dataset, and the DRGAN method we propose was compared with the classical methods. The experimental results demonstrated that the new method improves both the test accuracy and visualization results.
This paper is organized as follows: In Section 2, we introduce GAN-based and residual learning-based methods and briefly discuss their pros and cons. Then, we describe the proposed DRGAN method in detail in Section 3. Section 4 and Section 5 are dedicated to the experimental details and a comparison of the results with those of other state-of-the-art methods, respectively. Next, we present a discussion of the proposed method in Section 6. Finally, the conclusions are drawn in Section 7.

2. Related Work

The success of Alex-Net [40] in ImageNet created a new era of DL for vision. In recent years, DL-based methods have achieved dramatic performance compared with conventional methods in SISR, especially GAN-based and residual learning-based approaches. The work related to these two approaches in SISR is described briefly in this section.

2.1. GAN-Based SR

GAN presented by Goodfellow et al. was mainly inspired by the idea of a zero-sum game in game theory. The core idea of GAN-based SR is training a GN, as shown in Figure 1, with the goal of fooling a diverse DN that is trained to distinguish reconstructed images from real images.
Moreover, SRGAN defines a novel perceptual loss consisting of an adversarial loss and a content loss. The content loss is obtained based on the Euclidean distance between the feature maps of the images generated and the ground-truth images extracted from VGG19 [41]. The adversarial loss is achieved by the DN as shown in Figure 2.
The proposed loss function was not based on the mean square error (MSE) of the pixel space, resulting in the reconstructed images exhibiting relatively low peak signal-to-noise ratios (PSNRs). Moreover, SRGAN always suffers from the conundrums of training and the gradient disappearing.

2.2. Residual Learning-Based SR

Originally, residual learning was proposed to address problems such as image classification and detection. Residual learning exhibits excellent performance in computer vision problems from low-level to high-level tasks. Christian et al. introduced the idea of residual learning into the problem of SR and employed a deep residual (Res-Net) with skip-connection as the GN, as shown in Figure 1. Res-Net utilizes local residual learning (LRL) to ease the training of networks, and comprehensive empirical evidence showed that the residual networks are easier to optimize and able to gain accuracy from the considerably increased depth. Nevertheless, LRL simply extracts local features by preserving the information, and it is not able to save the hierarchical features in a global manner.
Kim et al. [42] were enlightened by the residual network and then introduced a deeper network for super-resolution (VDSR), as shown in Figure 3a. It should be noted that the layers with the same color in Figure 3 belong to the same class. VDSR increased the network depth via cascading, vast convolutional layers. Since the reconstructed HR image is very similar to the input, global residual learning (GRL) is effective at reducing the difficulty of training deep networks. Kim et al. [43] augmented the receptive field of the network by introducing a recurrent neural network (DRCN), as shown in Figure 3b, which is beneficial for parameter sharing and reducing memory consumption. Moreover, they utilized recursive-supervision and skip-connection to overcome the difficulty of training. Tai et al. [44] proposed a very deep convolutional neural network model named deep recursive residual network (DRRN, illustrated in Figure 3c) that strives for deep yet concise networks. DRRN adopts both GRL and LRL. GRL and LRL mainly differ in that LRL is performed in every few stacked layers, while GRL is performed between the input and output images. Particularly, both GRL and LRL are employed to ease the problem of training the deep network. Comprehensive empirical evidence shows that the residual networks are easier to optimize and able to gain accuracy from their considerably increased depth. In contrast to residual learning-based SR methods, GAN-based approaches can recover more convincing and realistic HR images.

3. Proposed Method

In this section, we first describe the designed GN in the proposed dense residual generative adversarial network (DRGAN) in detail. Then, we demonstrate the DN part. Finally, we explicitly introduce the modified loss function of DRGAN according to WGAN-GP.
In this paper, let I G denote the ground-truth image with size m × n. I L denotes the down-sampled result of I G with size (m/s) × (n/s), where s is the corresponding scale factor. I S R represents the corresponding reconstructed SR image with size m × n.

3.1. Structure of the GN

The whole architecture of the GN is drawn in Figure 4. According to the functions in the GN, we can divide it into four parts: feature extraction, dense residual units (DRUs), residual learning and image reconstruction. I L and I S R are the input and output of the GN, respectively. Nah et al. removed the batch normalization layers in their image deblurring work due to the batch normalization layers normalizing the features and getting rid of range flexibility [45]. That is to say, the batch normalization layers are applicable in the area of target classification rather than the field of SR. Therefore, we did not employ batch normalization (BN) layers in the whole GN, as shown in Figure 4.

3.1.1. Feature Extraction

We employed two convolutional layers to extract features at first, because of two significant roles of convolutional layers: mitigating the effect of noise and strengthening the characteristics of the original signal. The operation of the feature extraction part can be expressed as follows:
{ F 1 = g ( W F E , 1 I L + B F E , 1 ) F E = g ( W F E , 2 F 1 + B F E , 2 ) ,
where W F E , 1 and W F E , 2 represent n F E , 1 convolution kernels of size c × k F E , 1 × k F E , 1 and n F E , 2 convolution kernels of size n F E , 1 × k F E , 2 × k F E , 2 , respectively; c denotes the channel number of the input image I L ; k F E , 1 and k F E , 2 are the spatial sizes of the convolution filter; B F E , 1 and B F E , 2 represent the biases; represents the convolution operation; g ( ) represents the activation function; and F E is the output part of the feature extraction and the input of the DRU.
In the case of SR, we only need to process the luminance channel of images, since human eyes are more sensitive to the brightness information of the images. Thus, we extract the Y-channel after transforming the images from RGB to YCbCr color space. The remaining two channels are upscaled to the required size via bicubic interpolation, and the final SR image can be obtained by fusing these three channels of the image. Therefore, the channel number of the input image I L is always c = 1.
This paper adopts the parametric rectified linear unit (PReLU) [46] as the activation function g(·). It can achieve a regular effect to a certain extent. Compared to ReLU [47], PReLU improves the convergence rate of the network by adding a few of parameters. The formula of g(·) can be expressed as follows:
g ( x ) = { x , α t x , i f   x > 0 i f   x < 0 ,
where α t is a learnable parameter, α is initialized to 0.25 and t denotes the time of iteration. When the network updates the parameters in reverse, the update formula of α t can be formulated as
Δ α t + 1 = μ Δ α t + ε L α t ,
where μ denotes the momentum, ε refers to the learning rate and L represents the loss function.

3.1.2. DRUs

Assume that there are d DRUs; the specific architecture of each DRU is shown in Figure 5. Each DRU includes three convolutional layers, three activation layers, one weighted-sum layer and one element-wise sum layer. The convolutional layers in each DRU are densely connected in the manner shown in Figure 5. GRL and LRL are utilized simultaneously.
The whole operation of p-th DRUp can be formulated as follows:
{ D p , 1 = S p , 1 ( g ( W p , 1 D p 1 + B p , 1 ) , D p 1 ) D p , 2 = S p , 2 ( g ( W p , 2 D p 1 + B p , 2 ) , D p , 1 , D p 1 ) D p , 3 = S p , 3 ( g ( W p , 3 D p 2 + B p , 3 ) , D p , 2 , D p , 1 , D p 1 ) D p = D p , 3 + D p 1
where W p , 1 to W p , 3 and B p , 1 to B p , 3 represent the kernels and biases, respectively, of the three successive convolutional layers; S p , 1 to S p , 3 , denote the weighted-sum layers in sequence; D p , 1 to D p , 3 represent the output of the former convolutional layers in sequence (the activation layers are omitted for clarity); and D p denotes the corresponding output of the p-th DRUp.
The blue lines in Figure 5 represent that the preceding outputs of convolutional layers in a DRU are fed into the posterior convolutional layers, which form the short-term memory. Similarly, the red and purple lines in Figure 5 represent that the preceding outputs of DRUs are fed into the latter layers, which correspond to the long-term memory. The outputs of the previous DRUs and convolutional layers can connect to the latter layers directly, which can not only save the feed-forward features but also extract local dense features. All of these result in a memory mechanism.
In the circumstance that the former DRU and the whole convolutional layers are fed into the latter layer, we need to decrease the feature numbers to reduce the burden of the network. Thus, we employ weighted-sum layers S p , 1 to S p , 3 that adaptively learn specific weights for each memory, which determines how much of the long-term and short-term memory should be saved. We refer to the operation of S p , 1 to S p , 3 in DRUp as the local decision function.

3.1.3. Residual Learning

In recent studies, residual networks have achieved great performance on the low-level to high-level computer vision tasks. In this paper, we adopt both LRL and GRL in order to make full use of them. As shown in Figure 5, the blue lines represent LRL for the GN, and the red lines denote GRL for GN. The whole function of the part of residual learning can be formulated as follows:
{ R w s = S R L ( D 1 , D 2 , , D d , F E ) R w s , 1 = g ( W R L , 1 R w s + B R L , 1 ) R = R w s , 1 + F 1 ,
where S R L denotes the weighted-sum layer; W R L , 1 and B R L , 1 represent the kernel and bias, respectively, of the convolutional layer; D 1 to D d represent the outputs of the d DRU successively; and R w s , R w s , 1 and R denote the outputs of the weighted-sum layer, the convolutional layer and the element-wise sum layer in the part of residual learning, respectively.
The difference between LRL and GRL in the part of residual learning is that LRL is acquired between the DRU and the weighted-sum layer, while GRL is implemented between the input image I L and the element-wise sum layer, as shown in Figure 5. The weighted-sum layer S R L is used to extract the hierarchical features obtained from the previous DRUs through LRL and to decide their proportions in the ensuing features. We define the operation of S R L in the part of residual learning as the global decision function compared to S p , 1 to S p , 3 in DRUp. The convolutional layer WRL,1 is employed to further exploit features, and the element-wise sum layer aims for the GRL. The combination of LRL and GRL improves the performance of the GN and is less prone to over-fitting.

3.1.4. Image Reconstruction

Inspired by ESPCN, we adopted a sub-pixel convolutional layer for image upscaling and reconstruction in addition to a convolutional layer. The whole function of the part of image reconstruction can be formulated as follows:
{ I 1 = g ( W I R , 1 R + B I R , 1 ) I S R = I = W I R , s c I 1 ,
where W I R , 1 and B I R , 1 represent the kernel and bias, respectively, of the convolutional layer; W I R , s c denotes the sub-pixel convolutional layer; denotes the operation of sub-pixel convolution; I 1 and I denote the outputs of the convolutional layer and the sub-pixel convolutional layer in the part of image reconstruction; and I is the final SR image obtained, I S R .
The sub-pixel convolution layer W I R , s c can be conceptually separated into two steps, and the conceptual graph is shown in Figure 6:
1)
Convolution. Similar to the previous convolution layers in the GN, this step is used to extract features. The difference between them is that there are s 2 feature maps according to the upscaling factor s .
2)
Arrangement. Arrange all the pixels in the corresponding position of s 2 feature maps in a predetermined order in order to combine them into a series of areas. The size of each area is s × s . Each area corresponds to a mini-patch in the final SR image I S R . In this manner, we rearrange the final feature maps of size s 2 × ( m / s ) × ( n / s ) into I S R of size 1 × m × n . This implementation equals the rearrangement of the image without convolution operations, and thus, requires very little time.

3.2. Structure of the DN

According to the theory of GANs, there is a DN in addition to the GN, which forms the adversarial networks: the GN produces the reconstructed image I S R , while the DN is used to distinguish between the ground-truth image I G and I S R . That is to say, we should optimize the parameters θ D N in the DN along with the parameters θ G N in the GN in an alternating manner to solve the adversarial min-max problem:
min θ D N   max θ G N E I G ~ P D [ log θ D N ( I G ) ] + E I H R ~ P G [ 1 log θ G N ( I H R ) ] .
where P D is the distribution of the ground-truth image and P G is the distribution of the reconstructed image.
With the advantages of the GAN, we can recover I S R that is highly similar to the ground-truth image I G and difficult to distinguish via the DN.
However, differently from the DN in SRGAN, as shown in Figure 2, we make modifications in terms of two aspects. First, we replace the last sigmoid layer with a Leaky ReLU layer referring to WGAN-GP. The discriminator in SRGAN mainly aims for the task of true and binary classification, while the purpose of the DN in DRGAN is fitting the distance of Wasserstein approximately. Second, we remove the BN layers in the DN. We apply a gradient penalty for each sample individually. However, BN layers in the DN will have undesirable effects on the gradient penalty for the reason that BN layers may introduce interdependent relationships among different samples in the same batch. Thus, we omit the BN layers. The final architecture of the DN is shown in Figure 7.

3.3. Loss Function

In SRGAN, the perceptual loss function l S R G A N was proposed, and it was the weighted sum of a content loss l c o n and an adversarial loss l a d v . The conceptual process of training SRGAN is shown in Figure 8. l S R G A N is formulated as follows:
l S R G A N = l c o n + 10 3 l a d v .
Specifically, l c o n is defined as the Euclidean distance between the feature maps of the recovered image θ G N ( I L ) and the corresponding ground-truth image I G in VGG, and it is formulated as
l c o n = 1 w j , k h j , k x = 1 w j , k y = 1 h j , k [ f j , k ( I G ) x , y f j , k ( θ G N ( I L ) ) x , y ] 2 ,
where f j , k is the feature map acquired from the k-th convolutional layer before the j-th pooling layer in the VGG, and w j , k and h j , k denote the dimensions of the respective feature maps.
VGG is taken as a universal feature extractor to extract high-level features. l c o n is equal to the MSE between the high-level features extracted by VGG. With the advantage of l c o n , the reconstructed images become more realistic and full of abundant details.
Besides the content loss, SRGAN also introduced the adversarial loss in order to promote the network to favor solutions that reside on the manifold of ground-truth images by aiming to fool the DN. The adversarial loss is obtained from the result of θ D N ( θ G N ( I L ) ) overall training samples as
l a d v = log θ D N ( θ G N ( I L ) ) ,
where θ D N ( θ G N ( I L ) ) denotes the probability of judging the recovered image θ G N ( I L ) as the corresponding I G . Furthermore, Equation (10) is transformed into Equation (11) for better gradient behavior.
l a d v = log [ 1 θ D N ( θ G N ( I L ) ) ]
However, in this paper, according to WGAN-GP, we modify the loss function l D R G A N of the proposed DRGAN to solve the problems of unstable training, gradient disappearing or exploding and mode collapse. The method of WGAN-GP was used to train our model, thereby solving the problem of gradient explosion during training via a new Lipschitz continuous limit method, the gradient penalty. For this reason, we omit the BN layers in the DN, as mentioned above. BN layers may introduce the interdependent relationships among different samples in the same batch. Moreover, the loss function based on the MSE of pixel space is supplemented, and the DN is used to discriminate the feature maps of I S R and I G extracted via VGG. In this manner, we can not only achieve convincing reconstructed images with abundant details but also acquire results with high PSNRs. The corresponding process of training the proposed DRGAN is shown in Figure 9.
Let l G N represent the loss function of GN and l D N denote the loss function of DN. Different from l c o n in SRGAN, l G N is formulated as
l G N = 1 m n x = 1 m y = 1 n [ ( I G ) x , y ( θ G N ( I L ) ) x , y ] 2 ,
where l G N is the MSE between the reconstructed image I S R and the corresponding ground-truth image I G in VGG. Because of the content loss, the MSE loss provides solutions with the highest PSNR values, which are, however, perceptually rather smooth and less convincing than results achieved with a loss component that is more sensitive to visual perception.
How l D N differs from l a d v in SRGAN, as shown in Equation (10), is reflected in three aspects. First, the DN is no longer used to distinguish the reconstructed image I S R and the corresponding ground-truth image. VGG extracts the high-level feature maps of I S R and I G , which need to be distinguished by the DN in our DRGAN. Second, the result of θ D N ( θ G N ( · ) ) is acquired without logarithm operations. The reason for this choice is that the probability of distinguishing the fake from the real data is replaced with the Wasserstein distance between the distributions of ground-truth images and reconstructed images. The DN in DRGAN removes the last sigmoid layer. Third, the gradient penalty is supplemented to keep the gradient steady in the process of back-propagation. The loss function l D N of DN can be formulated as
l D N = θ D N ( f ( θ G N ( I L ) ) ) θ D N ( f ( I G ) ) + λ [ z θ D N ( z ) 2 1 ] 2
where f ( θ G N ( I L ) ) and f ( I G ) represent the feature maps of I S R and I G extracted by VGG; [ z θ D N ( z ) 2 1 ] 2 is the gradient penalty according to WGAN-GP; λ is the coefficient set to 10 based on several comparative experiments; and z indicates the operation of partial derivatives for z , which can be formulated as
z = β f ( I G ) + ( 1 β ) f ( θ G N ( I L ) ) ,   β ~ u n i f o r m [ 0 , 1 ] .
The whole process of training the proposed DRGAN can be divided into five steps:
1)
Feed the LR image I L into the GN, obtain the corresponding reconstructed image I S R and compute the content loss l G N based on the MSE.
2)
Import the reconstructed image I S R and the corresponding ground-truth image I G into VGG, and extract the respective high-level features.
3)
Feed the extracted feature maps into the DN and obtain the adversarial loss. The final loss is computed as the weighted sum of the content loss l G N and the adversarial loss l D N .
4)
Implement the backward process of the network and compute the gradients of each layer. Optimize the network iteratively by updating the parameters in the DN and GN according to the training policy.
5)
Repeat the above steps until reaching the minimum loss of the network, and then the work of training the network is finished.
In this paper, the loss function that we proposed can show the training situation better than an ordinary GAN. Moreover, the gradient penalty can be reversed to the GN and the DN to minimize the loss of the generated network l G N and maximize the loss of the discriminating network l D N .

4. Experiments

In this section, we first describe the preparation for the experiments. Then, we illustrate the details of the implementation and introduce two quality evaluation indexes for images that are commonly used in the related literature.

4.1. Dataset

NWPU-RESISC45 [48] is a classical scene classification data set consisting of remote sensing images 256 × 256 pixels in size. NWPU-RESISC45 contains 45 types of ground features in total, with 700 images per type. In this study, we chose the series of airplane images as targets and selected 500 airplane images as the objective training sample, while leaving 100 images for validation images and the rest as test images.

4.2. Training Details

Referring to WGAN-GP, we adopt RMSprop [49] rather than Adam [50] to optimize our model; the weight matrices W are updated as
( v t ) q = { ( v t 1 ) q + δ , ( v t 1 ) q · ( 1 δ ) , ( L ( W t ) ) q ( L ( W t 1 ) ) q > 0 e l s e ,
( W t + 1 ) q = ( W t ) q ε ( v t ) q ,
where δ is initialized to 0.02, W denotes the weights in the network, q denotes the order of the element in W, v represents an adaptive moment estimation, t denotes the iteration time and the learning rate ε is initialized to 0.0001.
Before training, we augment the remote sensing images by horizontally flipping and rotating. Then, we down-sample the ground training images I G by the required upscale factor s to obtain the LR images I L . For each mini-batch, we cropped 16 random sub images from LR training samples of size 64 × 64 and sub images from ground-truth training samples of size 256 × 256. Taking considerations of both training time and complexities of the network, we employed eight dense recursive units in the GN described in Section 3.1. Each convolutional layer in the GN owns a 3 × 3 kernel and 64 feature maps. Moreover, we adopted zero padding in each convolutional layer to make sure the outputs had the same sizes as the original inputs.
We implemented the experiments in TensorFlow [51] and accelerated them using a single NVIDIA GTX1080TI GPU with 11 GB of memory. Specifically, we first trained the GN with only the loss function based on l G N , as formulated in Equation (12), and then we initialized the entire DRGAN network with it to avoid undesirable local optima. The whole process of training required approximately four days.
l M S E = 1 m n x = 1 m y = 1 n [ ( I G ) x , y ( θ G N ( I L ) ) x , y ] 2 .

4.3. Quantitative Evaluation Factors

4.3.1. Peak Signal-To-Noise Ratio (PSNR)

The PSNR [52] was adopted in this paper as the quality evaluation index of the reconstructed HR image. It is dependent on the MSE between the ground-truth images X = { X i } and the reconstructed HR images H = { H i } . The formulas for MSE and PSNR can be expressed as follows:
M S E = 1 m n a = 1 m b = 1 n ( X i ( a , b ) H i ( a , b ) ) 2 ,
P S N R = 10 lg 255 2 M S E ,
where m and n denote the height and width of images X i and H i ; a and b represent the horizontal and vertical axes.

4.3.2. Structural Similarity Index (SSIM)

The SSIM [52] is commonly used for the evaluation of the quality of the reconstructed HR images, and it is calculated as follows:
S S I M ( X i , H i ) = c ( X i , H i ) d ( X i , H i ) e ( X i , H i ) ,
where c ( X i , H i ) denotes the brightness contrast, d ( X i , H i ) denotes the comparison of contrast, e ( X i , H i ) represents the contrast of pixel structure and
{ c ( X i , H i ) = 2 μ X i μ H i + C 1 μ X i 2 + μ H i 2 + C 1 d ( X i , H i ) = 2 σ X i σ H i + C 2 σ X i 2 + σ H i 2 + C 2 e ( X i , H i ) = σ X i H i + C 3 σ X i σ H i + C 3 ,
where σ X i 2 and σ H i 2 denote the variance of images X i and H i ; σ X i H i refers to the covariance between X i and H i ; μ X i and μ H i indicate the average values of X i and H i ; and C 1 , C 2 and C 3 are constants.

4.3.3. Normalized Root Mean Square Error (NRMSE)

The normalized root mean square error (NRMSE) used in [53] measures the distance between the data predicted by the mapping model and the original data observed from the environment. It can be computed as follows, and the smaller the value of NRMSE is, the better quality the reconstructed HR image has.
NRMSE ( X , H ) = MSE ( X , H ) 255

4.3.4. Erreur Relative Globale Adimensionnelle De Synthese (ERGAS)

The erreur relative globale adimensionnelle de synthese (ERGAS) [54] was put forward to measure the quality of reconstructed HR images by taking the scaling factor into consideration, and it can be formulated as:
ERGAS ( X , H ) = 100 s 1 c [ MSE ( X , H ) μ X ] 2
where s represents the scale factor, c denotes the channel number of the image, and μ X is the mean value of X . The smaller the value of ERGAS, the better the quality of the reconstructed HR image.

5. Results

To test the performance of the proposed SR method via DRGAN, we implement tests in public datasets and compared the results of DRGAN with those of several state-of-the-art methods. In addition, we selected the results of bicubic interpolation as the baseline reference. For SISR methods based on DL, DRGAN was compared with SRCNN [20], FSRCNN [22], ESPCN [21], VDSR [42], DRRN [44] and SRGAN [34]. The publicly available testing codes from the corresponding authors were employed. For fair comparison, we cropped the pixels in the boundary before evaluation like the operation in SRCNN [20].
Table 1, Table 2, Table 3, Table 4 and Table 5 show the summarized results of PSNR, SSIM, NRMSE, ERGAS and test time, respectively, for three chosen images and the whole test datasets with three different upscaling factors (×2, ×3 and ×4). The proposed DRGAN outperforms all of the methods listen, in all scales, regardless of which metric is considered. At scale factors of ×2, ×3 and ×4, DRGAN boosts the second-best method by 0.23, 0.22 and 0.21 dB in PSNR, 0.0134, 0.0198 and 0.0175 in SSIM, 0.0004, 0.0006 and 0.0008 in NRMSE, and 0.0451, 0.0660 and 0.0234 in ERGAS. Moreover, although SRGAN can generate convincing results, the objective indicators of SRGAN do not compare well with those of other methods for the reason that its loss function is dependent on the feature space, not the pixel space.
It can be observed from Table 5 that when the number of convolutional layers of the network is relatively deep, such as in the models of VDSR, DRRN, SRGAN and the proposed DRGAN, the reconstruction time of the test image under our method is far less than that of other approaches.
In addition to the quantitative comparisons, we also performed visual comparisons among our method and above-listed methods. We show the reconstructed HR results with different scale factors in Figure 10, Figure 11 and Figure 12, and the ground-truth images are also provided for reference. For clearer contrast, we selected an area marked with a green rectangle to zoom in and placed the close-up below the corresponding whole image.
We show the SR results of ‘airplane_001.jpg’ with an upscaling factor ×2 in Figure 10. DRGAN accurately reconstructed straight lines and obtained clearer and sharper results than the other methods. It can be observed that the edges reconstructed by DRGAN are the clearest among all the approaches.
Figure 11 provides the reconstructed HR results of ‘airplane_095.jpg’ with a scale factor of ×3 and zoomed-in close-ups of the airplane wings. We can clearly observe that the edges of airplane wings in the images reconstructed by the other deep-learning-based methods are vaguer, relatively, or more distorted, while DRGAN achieves more convincing results with fewer artifacts. The edges resulting from the proposed DRGAN method are sharper and the contrasts are clearer than those of other state-of-the-art methods.
The reconstructed HR results of ‘airplane_035.jpg’ with a scale factor ×4 are shown in Figure 12. We also enlarged the area around the aircraft tail. We also display the results of SRGAN in Figure 12g. It is obvious that the reconstructed HR image obtained with DRGAN, which is shown in Figure 12h, is the best result that is closest to the ground-truth HR image shown in Figure 12a. We can see from the comparison that for a large-scale factor of ×4, the aircraft tail is reconstructed cleanly and vividly when using SRGAN and DRGAN, whereas it is blurred or distorted when using other methods, and DRGAN is better than SRGAN.

6. Discussion

6.1. The Effect of Adding MSE into the Loss Function

SRGAN’s perceptual loss, which consists of an adversarial loss and a content loss, can help the model generate convincing reconstructed results, but the objective indicators of SRGAN do not perform well against other methods because its loss function is dependent on the feature space, not the pixel space. To address this drawback, MSE loss was introduced in our proposed method to ensure the similarity between the output image and the target image.
To assess the effect of adding MSE loss, we compared the PSNR and SSIM values of reconstructed HR images obtained through the networks with and without MSE in the loss function with a scale factor of ×3 for the test set. The results indicate that the network with MSE loss added has superior performance relative to that without MSE constraint, and an improvement of approximately 0.36 dB in PSNR and 0.0085 in SSIM can be achieved using our new loss function.
Figure 13 compares the reconstructed HR images of ‘airplane_633.jpg’ obtained through the networks both excluding and including MSE in the loss function. It can be observed from the close-ups of the head of the airplane that the edges of the reconstructed image obtained from the network without the MSE constraint (as shown in Figure 13a) are much vaguer than those obtained from the network adding MSE loss (as shown in Figure 13b). Through testing on the whole test set, we found that the results obtained without MSE being constrained are more likely to generate artifacts, which proves that adding MSE loss can achieve more subjectively realistic visual effects.

6.2. The Impact of Using L g a n or L w g a n on Our SR Model

As is known, it is usually difficult to decide when to suspend the training of the generator or discriminator for traditional GAN-based approaches. GAN-based methods often suffer from the situation of gradient vanishing. As mentioned in Section 3, we referred to the key idea of WGAN-GP instead of using an ordinary GAN in our model.
For comparison, we drew the loss convergence curves of the generator of our model under the conditions of using L g a n or L w g a n . We selected hyper parameter ‘epoch’ values of 100 and 200 and have displayed the experimental results. As shown in Figure 14, the red curves represent the trend of loss convergence under L w g a n , while the blue curves represent the results of using L g a n . It can be clearly observed from Figure 14a,b that the loss is difficult to converge (the blue curves) when using ordinary L g a n to train the model regardless of the hyper parameter ‘epoch’; after training for a period of time, the loss instead increases, which is called mode collapse and often occurs in GANs. Obviously, L w g a n can overcome this drawback very well. The curves of loss convergence of our model under L w g a n (the red curves) show that the loss is always decreasing until convergence is accomplished.

6.3. Robustness of the Model

To further test the performance of proposed SR reconstruction model, the DRGAN was tested using several natural image datasets (Set5 [55], Set14 [56], BSD100 [57], Urban100 [58]) and other types of remote sensing images besides those of airplanes. Table 6 shows the summarized results of PSNR and SSIM with three different upscaling factors (×2, ×3 and ×4). The proposed DRGAN outperforms Bicubic, SRCNN and SRGAN in all scales, regardless of whether PSNR or SSIM, even though there is a difference in the data distribution between the test set composed of natural images and the training set composed of remote sensing images. We also compared the subjective effects of the test images. Figure 15 and Figure 16 give the results of the reconstructed HR images obtained through several methods for ‘rectangular_farmland_008.jpg’ of the NWPU-RESISC45 dataset and ‘img_001.png’ of the BSD100 dataset. By comparing the close-ups of the reconstructed HR images obtained through various methods, it is clear that results which are not bad are obtained after image reconstruction with our DRGAN, which proves that our model is relatively robust.

6.4. Future Work

SR of remote sensing images based on DL is faced with more problems than natural images. Training through DL is based on the premise of the sufficiently qualified training samples. However, it is not easy to collect a large amount of remote sensing images of high quality that satisfy the requirements. Therefore, transferring knowledge from an external dataset attracts a lot of attention with the continuous development of DL. Generally, it is easy to collect a nature image dataset that has higher resolution than remote sensing images and contains more detailed information. The performance of the proposed DRGAN method can probably be improved by pretraining the model with abundant natural images as the training data, and then fine-tuning the model with remote sensing images. Transfer learning is a potential solution for the issue that will be studied in future work.

7. Conclusions

In this paper, we propose a novel SISR method named DRGAN to promote the resolution of remote sensing images. We tried to improve the performance of the GAN by enhancing the ability of the GN to reconstruct images. In particular, we introduced the design of dense residual network into the GN and utilized the memory mechanism to extract hierarchical features for better reconstruction. Furthermore, we added MSE into the loss function and modified the model of the DN and the loss function referring to WGAN-GP, which resulted in improving the accuracy of reconstruction and avoiding gradient vanishing. In addition to the aircraft images, we also used other types of remote sensing images and several natural image datasets to verify the robustness of our model. The experimental results for a publicly available dataset demonstrate that our proposed method can achieve the best performance in terms of the accuracy and visual performance. In future work, other techniques will be applied, such as the transfer learning technique, which can be used to borrow high-frequency information from natural image datasets that contain images with very high resolution, to further improve the performance of the new method.

Author Contributions

W.M. and Z.P. conceived and designed the experiments; W.M. performed the experiments; F.Y. analyzed the data; B.L. contributed materials and computing resources; W.M. wrote the paper.

Funding

This work was supported by the National Natural Science Foundation of China under grants 61701478 and 61331017.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wilman, W.W.Z.; Yuen, P.C. Very low resolution face recognition problem. IEEE Trans. Image Process. 2010, 21, 327–340. [Google Scholar]
  2. Thornton, M.W.; Atkinson, P.M.; Holland, D.A. Sub-pixel mapping of rural land cover objects from fine spatial resolution satellite sensor imagery using super-resolution pixel-swapping. Int. J. Remote Sens. 2006, 27, 473–491. [Google Scholar] [CrossRef]
  3. Timofte, R.; Agustsson, E.; Van Gool, L.; Yang, M.-H.; Zhang, L.; Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M.; et al. NTIRE 2017 Challenge on Single Image Super-Resolution: Methods and Results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 1110–1121. [Google Scholar]
  4. Irani, M.; Peleg, S. Improving resolution by image registration. CVGIP Graph. Model. Image Process. 1991, 53, 231–239. [Google Scholar] [CrossRef]
  5. Su, H.; Zhou, J.; Zhang, Z. Survey of super-resolution image reconstruction methods. Acta Autom. Sin. 2013, 39, 1202–1213. [Google Scholar] [CrossRef]
  6. Yang, C.-Y.; Ma, C.; Yang, M.-H. Single-Image Super-Resolution: A Benchmark. Model Data Eng. 2014, 8692, 372–386. [Google Scholar]
  7. Freedman, G.; Fattal, R. Image and video up-scaling from local self-examples. ACM Trans. Graph. 2011, 2, 12. [Google Scholar]
  8. Yang, J.; Lin, Z.; Cohen, S. Fast Image Super-Resolution Based on In-Place Example Regression. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 1059–1066. [Google Scholar]
  9. Kim, K.I.; Kwon, Y. Single-Image Super-Resolution Using Sparse Regression and Natural Image Prior. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1127–1133. [Google Scholar]
  10. Chang, H.; Yeung, D.Y.; Xiong, Y. Super-resolution through neighbor embedding. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 27 June–2 July 2004; pp. 275–282. [Google Scholar]
  11. Yang, J.; Wright, J.; Huang, T.; Ma, Y. Image super-resolution as sparse representation of raw image patches. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
  12. Li, F.; Jia, X.; Fraser, D.; Lambert, A. Super resolution for remote sensing images based on a universal hidden Markov tree model. IEEE Trans. Geosci. Remote Sens. 2010, 48, 1270–1278. [Google Scholar]
  13. Pan, Z.; Yu, J.; Huang, H.; Hu, S.; Zhang, A.; Ma, H.; Sun, W. Super-Resolution Based on Compressive Sensing and Structural Self-Similarity for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2013, 51, 4864–4876. [Google Scholar] [CrossRef]
  14. Timofte, R.; Smet, V.; Gool, L.V. Anchored neighborhood regression for fast example-based super-resolution. In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013; pp. 1920–1927. [Google Scholar]
  15. Timofte, R.; Smet, D.; Gool, L.V. A+: Adjusted anchored neighborhood regression for fast super-resolution. In Proceedings of the Asian Conference on Computer Vision (ACCV), Singapore, 1–5 November 2014; pp. 111–126. [Google Scholar]
  16. Glasner, D.; Bagon, S.; Irani, M. Super-resolution from a single image. In Proceedings of the IEEE 12th international conference on computer vision, Kyoto, Japan, 29 September–2 October 2009; pp. 349–356. [Google Scholar]
  17. Yang, J.; Wright, J.; Huang, T.S.; Ma, Y. Image Super-Resolution Via Sparse Representation. IEEE Trans. Image Process. 2010, 19, 2861–2873. [Google Scholar] [CrossRef]
  18. Perez-Pellitero, E.; Salvador, J.; Ruiz-Hidalgo, J.; Rosenhahn, B. PSyCo: Manifold Span Reduction for Super Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1837–1845. [Google Scholar]
  19. Salvador, J.; Perezpellitero, E. Naive Bayes Super-Resolution Forest. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2380–7504. [Google Scholar]
  20. Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a Deep Convolutional Network for Image Super-Resolution. In Proceedings of the Computer Vision–ECCV, Zurich, Switzerland, 6–12 September 2014; Springer International Publishing: Berlin, Germany, 2014; pp. 184–199. [Google Scholar]
  21. Shi, W.; Caballero, J.; Huszar, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
  22. Dong, C.; Loy, C.C.; Tang, X. Accelerating the Super-Resolution Convolutional Neural Network. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; Volume 9906, pp. 391–407. [Google Scholar]
  23. Zhao, X.; Zhang, Y.; Zhang, T.; Zou, X. Channel Splitting Network for Single MR Image Super-Resolution. IEEE Trans. Image Process. 2019, 28, 5649–5662. [Google Scholar] [CrossRef] [PubMed]
  24. Muqeet, A.; Bin Iqbal, M.T.; Bae, S.-H. HRAN: Hybrid Residual Attention Network for Single Image Super-Resolution. IEEE Access 2019, 7, 137020–137029. [Google Scholar] [CrossRef]
  25. Zhao, F.; Si, W.; Dou, Z. Image super-resolution via two stage coupled dictionary learning. Multimedia Tools Appl. 2017, 78, 28453–28460. [Google Scholar] [CrossRef]
  26. Li, F.; Bai, H.; Zhao, Y. Detail-preserving image super-resolution via recursively dilated residual network. Neurocomputing 2019, 358, 285–293. [Google Scholar] [CrossRef]
  27. He, H.; Chen, T.; Chen, M.; Li, D.; Cheng, P. Remote sensing image super-resolution using deep–shallow cascaded convolutional neural networks. Sens. Rev. 2019, 39, 629–635. [Google Scholar] [CrossRef]
  28. Ma, W.; Pan, Z.; Guo, J.; Lei, B. Achieving Super-Resolution Remote Sensing Images via the Wavelet Transform Combined With the Recursive Res-Net. IEEE Trans. Geosci. Remote Sens. 2019, 57, 3512–3527. [Google Scholar] [CrossRef]
  29. Zhang, T.; Du, Y.; Lu, F. Super-Resolution Reconstruction of Remote Sensing Images Using Multiple-Point Statistics and Isometric Mapping. Remote Sens. 2017, 9, 724. [Google Scholar] [CrossRef]
  30. Gu, J.; Sun, X.; Zhang, Y.; Fu, K.; Wang, L. Deep Residual Squeeze and Excitation Network for Remote Sensing Image Super-Resolution. Remote Sens. 2019, 11, 1817. [Google Scholar] [CrossRef]
  31. He, Z.; Liu, L. Hyperspectral Image Super-Resolution Inspired by Deep Laplacian Pyramid Network. Remote Sens. 2018, 10, 1939. [Google Scholar] [CrossRef]
  32. Kwan, C.; Choi, J.H.; Chan, S.H.; Zhou, J.; Budavari, B. A Super-Resolution and Fusion Approach to Enhancing Hyperspectral Images. Remote Sens. 2018, 10, 1416. [Google Scholar] [CrossRef]
  33. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
  34. Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; Shi, W. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  35. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  36. Ma, W.; Pan, Z.; Guo, J.; Lei, B. Super-Resolution of Remote Sensing Images Based on Transferred Generative Adversarial Network. In Proceedings of the IGARSS 2018–2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 1148–1151. [Google Scholar]
  37. Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. Available online: https://arxiv.org/abs/1511.06434 (accessed on 1 August 2019).
  38. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. Available online: https://arxiv.org/abs/1701.07875 (accessed on 1 August 2019).
  39. Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A. Improved Training of Wasserstein GANs. Available online: https://arxiv.org/abs/1704.00028 (accessed on 1 August 2019).
  40. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the NIPS, Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
  41. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; pp. 2–5. [Google Scholar]
  42. Jiwon, K.; Jung, K.L.; Kyoung, M.L. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
  43. Jiwon, K.; Jung, K.L.; Kyoung, M.L. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1637–1645. [Google Scholar]
  44. Tai, Y.; Yang, J.; Liu, X. Image Super-Resolution via Deep Recursive Residual Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2790–2798. [Google Scholar]
  45. Nah, S.; Kim, T.; Lee, K. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 257–265. [Google Scholar]
  46. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on Image Net Classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
  47. Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
  48. Cheng, G.; Han, J.; Lu, X. Remote Sensing Image Scene Classification: Benchmark and State of the Art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
  49. Tieleman, T.; Hinton, G. Lecture 6.5-RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn. 2012, 4, 26–31. [Google Scholar]
  50. Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  51. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv 2015, arXiv:1603.04467. [Google Scholar]
  52. Hore, A.; Ziou, D. Image Quality Metrics: PSNR vs. SSIM. International Conference on Pattern Recognition. In Proceedings of the 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar]
  53. Haut, J.M.; Fernandez-Beltran, R.; Paoletti, M.E.; Plaza, J.; Plaza, A.; Pla, F. A new deep generative network for unsupervised remote sensing single-image super-resolution. IEEE Trans. Geosci. Remote Sens. 2018, 56, 6792–6810. [Google Scholar] [CrossRef]
  54. Veganzones, M.A.; Simoes, M.; Licciardi, G.; Yokoya, N.; Bioucas-Dias, J.M.; Chanussot, J. Hyperspectral super-resolution of locally low rank images from complementary multisource data. IEEE Trans. Image Process. 2016, 25, 274–288. [Google Scholar] [CrossRef]
  55. Bevilacqua, C.M.; Roumy, A.; Morel, M.-L.A. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the BMVC, Surrey, UK, 3–7 September 2012. [Google Scholar]
  56. Zeyde, R.; Elad, M.; Protter, M. On single image scale-up using sparse-representations. In Proceedings of the International Conference on Curves and Surfaces, Avignon, France, 24–30 June 2010; pp. 711–730. [Google Scholar]
  57. Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the IEEE International Conference on Computer Vision, Vancouver, BC, Canada, 7–14 July 2001; pp. 416–423. [Google Scholar]
  58. Huang, J.-B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Figure 1. The architecture of the generative network (GN) in a GAN for image super-resolution (SRGAN). Layers with the same color indicate that they are the layers of the same type. IL is fed into the network and passed through GN, and finally, IHR is obtained.
Figure 1. The architecture of the generative network (GN) in a GAN for image super-resolution (SRGAN). Layers with the same color indicate that they are the layers of the same type. IL is fed into the network and passed through GN, and finally, IHR is obtained.
Remotesensing 11 02578 g001
Figure 2. The architecture of the discriminative network (DN) in SRGAN. The DN is trained for the goal of distinguishing the reconstructed images from the ground-truth images, and the final sigmoid activation function is utilized to obtain the probability for distinction.
Figure 2. The architecture of the discriminative network (DN) in SRGAN. The DN is trained for the goal of distinguishing the reconstructed images from the ground-truth images, and the final sigmoid activation function is utilized to obtain the probability for distinction.
Remotesensing 11 02578 g002
Figure 3. Architectures of convolutional neural networks (CNN)-based networks. (a) VDSR. The red line represents the global residual learning (GRL). There are 20 convolutional layers in total, each of which consists 64 filters of size 3 × 3. The tawny layers represent the element-wise sum operation. (b) DRCN. The layers in yellow refer to recursive layers and share the same weights and bias. The final output is obtained by computing the weighted mean. (c) DRRN. The green blocks represent recursive units, and each of them contains two convolutional layers and the corresponding activation functions. DRRN adopts both GRL and local residual learning (LRL).
Figure 3. Architectures of convolutional neural networks (CNN)-based networks. (a) VDSR. The red line represents the global residual learning (GRL). There are 20 convolutional layers in total, each of which consists 64 filters of size 3 × 3. The tawny layers represent the element-wise sum operation. (b) DRCN. The layers in yellow refer to recursive layers and share the same weights and bias. The final output is obtained by computing the weighted mean. (c) DRRN. The green blocks represent recursive units, and each of them contains two convolutional layers and the corresponding activation functions. DRRN adopts both GRL and local residual learning (LRL).
Remotesensing 11 02578 g003
Figure 4. The architecture of the GN in the proposed DRGAN. I L is also taken for GRL afterwards, in addition, to the input of the network. Layers with the same color represent layers of the same type.
Figure 4. The architecture of the GN in the proposed DRGAN. I L is also taken for GRL afterwards, in addition, to the input of the network. Layers with the same color represent layers of the same type.
Remotesensing 11 02578 g004
Figure 5. The architecture of the DRU in the GN. The blue and purple lines represent LRL for DRU, and the red lines denote GRL for DRU.
Figure 5. The architecture of the DRU in the GN. The blue and purple lines represent LRL for DRU, and the red lines denote GRL for DRU.
Remotesensing 11 02578 g005
Figure 6. Conceptual graph of how the sub-pixel convolutional layer works. I 1 denotes the output of the convolutional layer in the image reconstruction part. I S R is obtained from I 1 through the operations of convolution and arrangement.
Figure 6. Conceptual graph of how the sub-pixel convolutional layer works. I 1 denotes the output of the convolutional layer in the image reconstruction part. I S R is obtained from I 1 through the operations of convolution and arrangement.
Remotesensing 11 02578 g006
Figure 7. The architecture of the DN in DRGAN. The differences from the DN in SRGAN are that the last sigmoid layer is replaced with a Leaky ReLU layer and the batch normalization (BN) layers are removed.
Figure 7. The architecture of the DN in DRGAN. The differences from the DN in SRGAN are that the last sigmoid layer is replaced with a Leaky ReLU layer and the batch normalization (BN) layers are removed.
Remotesensing 11 02578 g007
Figure 8. The conceptual process of training the adversarial networks. The I G and I S R obtained from the GN are fed into the DN and VGG simultaneously, and we can acquire the content loss and adversarial loss, respectively. Then, we update the parameters in the adversarial networks according to the result and repeat the process until the optimization is finished.
Figure 8. The conceptual process of training the adversarial networks. The I G and I S R obtained from the GN are fed into the DN and VGG simultaneously, and we can acquire the content loss and adversarial loss, respectively. Then, we update the parameters in the adversarial networks according to the result and repeat the process until the optimization is finished.
Remotesensing 11 02578 g008
Figure 9. The conceptual process of training DRGAN. The loss function based on mean square error (MSE) is computed between the ground-truth image I G and I S R is obtained from the GN. Then, the modified DN is used to distinguish the feature maps extracted by VGG, and the adversarial loss is also obtained.
Figure 9. The conceptual process of training DRGAN. The loss function based on mean square error (MSE) is computed between the ground-truth image I G and I S R is obtained from the GN. Then, the modified DN is used to distinguish the feature maps extracted by VGG, and the adversarial loss is also obtained.
Remotesensing 11 02578 g009
Figure 10. Comparisons of the reconstructed results of ‘airplane_001.jpg’ with a scale factor of ×2 for different methods; the values of PSNR and SSIM are given: (a) original, (b) bicubic (29.99 dB/0.9160), (c) SRCNN (32.85 dB/0.9489), (d) FSRCNN (33.72 dB/0.9539), (e) ESPCN (33.23 dB/0.9515), (f) VDSR (34.14 dB/0.9563), (g) DRRN (34.34 dB/0.9583) and (h) DRGAN (34.62 dB/0.9661).
Figure 10. Comparisons of the reconstructed results of ‘airplane_001.jpg’ with a scale factor of ×2 for different methods; the values of PSNR and SSIM are given: (a) original, (b) bicubic (29.99 dB/0.9160), (c) SRCNN (32.85 dB/0.9489), (d) FSRCNN (33.72 dB/0.9539), (e) ESPCN (33.23 dB/0.9515), (f) VDSR (34.14 dB/0.9563), (g) DRRN (34.34 dB/0.9583) and (h) DRGAN (34.62 dB/0.9661).
Remotesensing 11 02578 g010
Figure 11. Comparisons of the reconstructed results of ’airplane_095.jpg’ with a scale factor of ×3 for different methods; the values of the PSNR and SSIM are given: (a) original, (b) bicubic (25.34 dB/0.7708), (c) SRCNN (26.54 dB/0.8151), (d) FSRCNN (26.95 dB/0.8307), (e) ESPCN (26.80 dB/0.8243), (f) VDSR (27.31 dB/0.8444), (g) DRRN (27.54 dB/0.8522) and (h) DRGAN (27.80 dB/0.8628).
Figure 11. Comparisons of the reconstructed results of ’airplane_095.jpg’ with a scale factor of ×3 for different methods; the values of the PSNR and SSIM are given: (a) original, (b) bicubic (25.34 dB/0.7708), (c) SRCNN (26.54 dB/0.8151), (d) FSRCNN (26.95 dB/0.8307), (e) ESPCN (26.80 dB/0.8243), (f) VDSR (27.31 dB/0.8444), (g) DRRN (27.54 dB/0.8522) and (h) DRGAN (27.80 dB/0.8628).
Remotesensing 11 02578 g011
Figure 12. Comparisons of the reconstructed results of ‘airplane_035.jpg’ with a scale factor of ×4 for different methods; the values of PSNR and SSIM are given: (a) original, (b) bicubic (25.78 dB/0.8053), (c) SRCNN (27.20 dB/0.8494), (d) FSRCNN (27.69 dB/0.8676), (e) ESPCN (27.32 dB/0.8477), (f) VDSR (28.02 dB/0.8950), (g) SRGAN (27.03 dB/0.8554) and (h) DRGAN (28.48 dB/0.9143).
Figure 12. Comparisons of the reconstructed results of ‘airplane_035.jpg’ with a scale factor of ×4 for different methods; the values of PSNR and SSIM are given: (a) original, (b) bicubic (25.78 dB/0.8053), (c) SRCNN (27.20 dB/0.8494), (d) FSRCNN (27.69 dB/0.8676), (e) ESPCN (27.32 dB/0.8477), (f) VDSR (28.02 dB/0.8950), (g) SRGAN (27.03 dB/0.8554) and (h) DRGAN (28.48 dB/0.9143).
Remotesensing 11 02578 g012
Figure 13. Comparisons of the reconstructed images of “airplane_633.jpg” with scale factor ×3 for the network we proposed when MSE loss is and is not included: (a) omitting MSE from the loss function (28.66 dB/0.7969) and (b) including MSE in the loss function (29.04 dB/0.8044).
Figure 13. Comparisons of the reconstructed images of “airplane_633.jpg” with scale factor ×3 for the network we proposed when MSE loss is and is not included: (a) omitting MSE from the loss function (28.66 dB/0.7969) and (b) including MSE in the loss function (29.04 dB/0.8044).
Remotesensing 11 02578 g013
Figure 14. Comparison of loss convergences of the generator of our model with a scale factor of ×4 under the conditions of using L g a n or L w g a n .
Figure 14. Comparison of loss convergences of the generator of our model with a scale factor of ×4 under the conditions of using L g a n or L w g a n .
Remotesensing 11 02578 g014
Figure 15. Comparisons of the reconstructed results of ‘rectangular_farmland_008.jpg’ of NWPU-RESISC45 with a scale factor of ×4 for different methods; the values of PSNR and SSIM are given. (a) Original, (b) bicubic (27.95 dB/0.7015), (c) SRCNN (28.51 dB/0.7070) and (d) DRGAN (30.60 dB/0.7524).
Figure 15. Comparisons of the reconstructed results of ‘rectangular_farmland_008.jpg’ of NWPU-RESISC45 with a scale factor of ×4 for different methods; the values of PSNR and SSIM are given. (a) Original, (b) bicubic (27.95 dB/0.7015), (c) SRCNN (28.51 dB/0.7070) and (d) DRGAN (30.60 dB/0.7524).
Remotesensing 11 02578 g015
Figure 16. Comparisons of the reconstructed results of ‘img_001.png’ of BSD100 with a scale factor of ×3 for different methods; the values of PSNR and SSIM are given. (a) Original, (b) bicubic (24.74 dB/0.7861), (c) SRCNN (26.65 dB/0.8495) and (d) DRGAN (27.33 dB/0.8606).
Figure 16. Comparisons of the reconstructed results of ‘img_001.png’ of BSD100 with a scale factor of ×3 for different methods; the values of PSNR and SSIM are given. (a) Original, (b) bicubic (24.74 dB/0.7861), (c) SRCNN (26.65 dB/0.8495) and (d) DRGAN (27.33 dB/0.8606).
Remotesensing 11 02578 g016
Table 1. Peak signal to noise ratio (PSNR) (dB) metric results for the NWPU dataset using different methods.
Table 1. Peak signal to noise ratio (PSNR) (dB) metric results for the NWPU dataset using different methods.
TitleScaleBicubicSRCNNFSRCNNESPCNVDSRDRRNSRGANDRGAN
(ours)
airplane
_001
×229.9932.8533.7233.2334.1434.34-/-34.62
×326.9528.8429.5929.2130.2930.45-/-30.69
×425.2126.4526.9426.6827.6627.8826.2528.11
airplane
_035
×230.3632.7533.2232.9233.4533.63-/-33.91
×327.3629.0229.2129.1529.7829.94-/-30.16
×425.7827.2027.6927.3228.0228.1927.0328.48
airplane
_095
×227.9829.8630.3630.0830.6930.86-/-30.15
×325.3426.5426.9526.8027.3127.54-/-27.80
×424.0224.8725.1125.0025.3625.5324.6925.83
Test dataset×232.2034.3734.9634.6335.1235.33-/-35.56
×329.0930.5931.1530.8731.4731.70-/-31.92
×427.4228.4328.9228.6829.3129.5527.9929.76
Table 2. Structural similarity index (SSIM) metric results for the NWPU dataset using different methods.
Table 2. Structural similarity index (SSIM) metric results for the NWPU dataset using different methods.
TitleScaleBicubicSRCNNFSRCNNESPCNVDSRDRRNSRGANDRGAN
(ours)
airplane
_001
×20.91600.94890.95390.95150.95630.9583-/-0.9661
×30.83500.87680.88930.88260.90130.9089-/-0.9196
×40.76810.80350.81870.81110.84220.85120.80630.8622
airplane
_035
×20.94010.96450.96970.96550.97090.9745-/-0.9811
×30.86930.90740.92100.91010.93810.9396-/-0.9460
×40.80530.84940.86760.84770.89500.90120.85540.9143
airplane
_095
×20.87500.91520.92170.91900.92730.9338-/-0.9432
×30.77080.81510.83070.82430.84440.8522-/-0.8628
×40.70050.73690.75190.74600.77110.78020.73780.7908
Test dataset×20.90420.93460.93970.93720.94350.9497-/-0.9631
×30.82320.85820.86920.86440.88100.8904-/-0.9102
×40.76230.79180.80450.79950.82400.83690.79330.8544
Table 3. Normalized root mean square error (NRMSE) metric results for the NWPU dataset using different methods.
Table 3. Normalized root mean square error (NRMSE) metric results for the NWPU dataset using different methods.
TitleScaleBicubicSRCNNFSRCNNESPCNVDSRDRRNSRGANDRGAN
(ours)
airplane
_001
×20.03170.02110.02060.02180.01960.0192-/-0.0186
×30.04500.03620.03550.03460.03060.0300-/-0.0292
×40.05490.04760.04490.04640.04140.04040.04870.0393
airplane
_035
×20.03040.02300.02180.02260.02130.0208-/-0.0201
×30.04290.03540.03340.03490.03240.0318-/-0.0310
×40.05140.04370.04130.04310.03970.03890.04450.0377
airplane
_095
×20.03990.03210.03030.03130.02920.0286-/-0.0311
×30.05410.04710.04490.04570.04310.0420-/-0.0407
×40.06290.05710.05550.05630.05390.05290.05830.0511
Test dataset×20.02730.02150.02010.02090.01950.0171-/-0.0167
×30.03820.03230.03040.03140.02930.0260-/-0.0254
×40.04590.04080.03870.03980.03710.03330.03990.0325
Table 4. Erreur relative globale adimensionnelle de synthese (ERGAS) metric results for the NWPU dataset using different methods.
Table 4. Erreur relative globale adimensionnelle de synthese (ERGAS) metric results for the NWPU dataset using different methods.
TitleScaleBicubicSRCNNFSRCNNESPCNVDSRDRRNSRGANDRGAN
(ours)
airplane
_001
×23.65832.64472.39102.53112.26972.1977-/-2.0831
×33.45872.78442.55512.67332.35352.2882-/-2.1940
×43.16792.74662.59482.68162.39072.30112.62162.2354
airplane
_035
×24.55293.46593.28293.39953.19083.1116-/-2.9958
×34.30173.55333.35453.51443.25273.1998-/-3.0587
×43.86413.28663.10393.24882.99002.80453.05872.7582
airplane
_095
×24.06293.28163.09683.20212.97912.9877-/-2.8653
×33.68053.21103.06173.11902.93532.9122-/-2.8800
×43.21792.91872.83882.88272.75902.66542.82522.6029
Test dataset×23.16082.50152.34512.43452.26662.2081-/-2.1630
×32.94622.49962.35512.43792.26132.2475-/-2.1815
×42.65222.36342.24152.31072.14692.09982.29732.0764
Table 5. Test time (s) results on NWPU dataset using different methods.
Table 5. Test time (s) results on NWPU dataset using different methods.
TitleScaleBicubicSRCNNFSRCNNESPCNVDSRDRRNSRGANDRGAN
(ours)
airplane
_001
×20.00000.12970.03690.03191.76430.2157-/-0.1638
×30.00000.12770.01890.01701.72640.2153-/-0.1619
×40.00000.12870.01090.01201.77620.21540.75150.1610
airplane
_035
×20.00000.12670.03390.03091.75130.2389-/-0.1820
×30.00000.13160.01800.01701.72340.2374-/-0.1816
×40.00000.12970.01000.01201.75630.23650.75500.1814
airplane
_095
×20.00000.13160.03290.02991.79820.2268-/-0.0410
×30.00000.12670.01600.01591.72540.2070-/-0.0382
×40.00000.14660.01000.01001.76630.20990.75870.0394
Test dataset×20.00000.13030.03370.03001.79610.2386-/-0.1621
×30.00000.12780.01580.01561.74420.2173-/-0.1657
×40.00000.13710.00960.01021.76890.21020.76470.1539
Table 6. Objective metric results of several different methods using several natural datasets.
Table 6. Objective metric results of several different methods using several natural datasets.
TitleScaleBicubicPSNR/SSIMSRCNNPSNR/SSIMSRGANPSNR/SSIMDRGANPSNR/SSIM
Set5× 233.66/0.929936.66/0.9542-/-36.98/0.9602
× 330.39/0.868232.75/0.9090-/-33.11/0.9130
× 428.42/0.810430.49/0.862829.40/0.847230.86/0.8712
Set14× 230.23/0.868732.45/0.9067-/-32.81/0.9118
× 327.54/0.773629.30/0.8215-/-29.65/0.8286
× 426.00/0.701927.50/0.751326.02/0.739727.89/0.7655
BSD100× 229.56/0.843131.36/0.8879-/-31.91/0.8936
× 327.21/0.738528.41/0.7863-/-28.77/0.7951
× 425.96/0.667526.90/0.710125.16/0.668827.22/0.7268
Urban100× 226.88/0.840329.50/0.8946-/-30.02/0.9024
× 324.46/0.734926.24/0.7989-/-26.56/0.8031
× 423.14/0.657724.52/0.722123.98/0.693524.90/0.7356

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Back to TopTop