Optical Remote Sensing Image Denoising and Super-Resolution Reconstructing Using Optimized Generative Network in Wavelet Transform Domain

: High spatial quality (HQ) optical remote sensing images are very useful for target detection, target recognition and image classiﬁcation. Due to the inﬂuence of imaging equipment accuracy and atmospheric environment, HQ images are difﬁcult to acquire, while low spatial quality (LQ) remote sensing images are very easy to acquire. Hence, denoising and super-resolution (SR) reconstruction technology are the most important solutions to improve the quality of remote sensing images very effectively, which can lower the cost as much as possible. Most existing methods usually only employ denoising or SR technology to obtain HQ images. However, due to the complex structure and the large noise of remote sensing images, the quality of the remote sensing image obtained only by denoising method or SR method cannot meet the actual needs. To address these problems, a method of reconstructing HQ remote sensing images based on Generative Adversarial Network (GAN) named “Restoration Generative Adversarial Network with ResNet and DenseNet” (RRDGAN) is proposed, which can acquire better quality images by incorporating denoising and SR into a uniﬁed framework. The generative network is implemented by fusing Residual Neural Network (ResNet) and Dense Convolutional Network (DenseNet) in order to consider denoising and SR problems at the same time. Then, total variation (TV) regularization is used to furthermore enhance the edge details, and the idea of Relativistic GAN is explored to make the whole network converge better. Our RRDGAN is implemented in wavelet transform (WT) domain, since different frequency parts could be handled separately in the wavelet domain. The experimental results on three different remote sensing datasets shows the feasibility of our proposed method in acquiring remote sensing images.


Introduction
High spatial quality (HQ) optical remote sensing images have the characteristics of high spatial resolution (HR) and low noise, which can be widely used in agricultural and forestry monitoring, urban planning, military reconnaissance and other fields. However, the time and cost of development and the vulnerability of the image to changes in atmosphere and light are the reasons for the acquisition of a large number of low spatial quality (LQ) remote sensing images. So, how to obtain HQ images economically and conveniently has been a major challenge in the field of remote sensing.
Recently, more researchers have paid attention to recovering HQ remote sensing images from LQ ones using image processing technology.
Low spatial resolution (LR) and noise are two common factors causing low quality of remote sensing images [1]. So, enhancing spatial resolution and denoising are two of the most common approaches to acquire high quality images.
Generally, image SR reconstruction and denoising methods mean adding useful information (HR details) to LQ images and removing useless information (noise) from LQ images, respectively. Due to the existence of multiple solutions for any pixel in a LR image, SR methods are ill-posed problems [2]. Basically, SR methods include Single Image Super-Resolution (SISR) and Multi-Image Super-Resolution (MISR) according to the number of LR image, because, in the field of remote sensing research, image data are not abundant. However, the MISR method obtains HR images by processing a set of LR images which have only slightly different views, so the SISR method is commonly used in the remote sensing field, which acquires a HR image through a single LR image. Interpolation-based [3,4], reconstructed-based [5,6] and example-based [7] are three common classifications of SISR methods. This article does not discuss interpolation-based methods and reconstructed-based methods since these two types of methods are usually treated as traditional methods. For example, moments-based methods are very popular in image reconstruction [8][9][10][11]. Example-based methods establish the relationship between LR and HR images to reconstruct the high-frequency part of the LR images. In recent years, with the development of big data technology, machine learning methods have become increasingly popular and practical. Deep learning methods, which usually mean deep convolutional neural networks (CNN) [12][13][14], are currently one of the research hot-spots. Deep learning methods have achieved impressive results in many fields such as image processing. In particular, the SR algorithm based on deep learning has achieved significantly better results than the traditional SR reconstruction algorithm [15][16][17][18][19]. This algorithm has also achieved excellent results in the field of SR reconstruction of remote sensing images [7,20,21]. SRCNN, which is the first SR method based on CNN, was proposed by Dong et al. [2,22]. SRCNN borrows the idea of parse-coding SR method. However, if SRCNN deepens the number of layers, it will become very difficult to train. Then, with the emergence of residual learning techniques, deeper networks could be designed to achieve better results. The VDSR published by Kim et al. [23] is the earliest and most typical method of using residual learning. Recently, Generative Adversarial Network (GAN) based methods are getting popular because these methods could generate more interesting results. SRGAN [24] is the first GAN-based SR method that could reconstruct more details than the normal non-GAN-based method. After that, ESRGAN [25] is proposed, which is the enhancement of SRGAN, and this method achieved the state-of-the-art effect that time.
The problems of image denoising and SR are similar because both of them mainly process the parts of high-frequency, but keep other information preserved. Model-driven traditional maximum posteriori and data-driven modern deep learning method are two categories of image denoising algorithm. The model-driven approach accomplishes the task of denoising by constructing a reasonable maximum posteriori model. The biggest disadvantage of the model-driven approach is that it relies too much on the assumption of image priori and noise distribution given in advance. When the assumption deviates from the real situation of the actual data, the established model is no longer applicable. Recently, the data-driven method has achieved good results in image denoising. Its operation mode is: using pairs of noised image and corresponding clean image as inputs and outputs to train a pre-designed deep network with an end-to-end approach. The well-trained architecture could be directly regarded as a function, which could use the noised image as input to acquire the corresponding restored image.
The above methods have achieved some results, but there is still room for improvement. First, Most of the existing methods mentioned above only employ denoising or SR technology to obtain HQ optical images. However, due to the complex structure [1] and the large noise of remote sensing optical images, the quality of image obtained only by denoising method or SR method cannot meet the actual needs. So, how to handle the two problems fast and accurate is very important and meaningful for acquiring HQ optical remote sensing images. Second, non-GAN-based methods could achieve better Peak Signal-to-Noise Ratio (PSNR) results, but details in these results are more blurry than the results that GAN-based methods achieved. However, GAN-based methods are difficult to train because when the discriminator is well trained, the generator gradient disappears and the loss cannot be lowered. When the discriminator is poorly trained, the generator gradient is inaccurate. Only if the discriminator is not well trained can it be good. However, it is difficult to grasp this fire. Even in different stages before and after the same round of training, this fire may be different, so GAN-based methods are so difficult to train. Third, most of methods mentioned above only handle image denoising or SR problems in spatial domain directly, which we think is not suitable because different high frequencies corresponding different detailed information should be processed differently, which cannot be distinguished well in the spatial domain. To address the above-mentioned problems, an end-to-end CNN-based method named "Restoration Generative Adversarial Network with ResNet and DenseNet" (RRDGAN) that could handle the two problems using one network in the meantime is proposed by us. Considering that residual learning could reuse features implicitly and the dense connection keeps exploring new features, we combine the benefit of the three network structure as our generator to achieve a better effect. We furthermore use total variation (TV) loss [26] to achieve better edge details since Rudin et al. [27] observed that the TV of noise-polluted images was significantly greater than that of noise-free images. Then, we use the idea of Relativistic GAN [28] to optimize our discriminator to make the whole network converge better. Not only that, considering most of the remote sensing image denoising and SR methods are carried out directly in the spatial domain and processing different frequency parts of a remote sensing image is the key step to both denoising and SR reconstruction, so to furthermore improve the performance of our methods, RRDGAN is implemented in WT domain instead of directly in spatial domain. So, Figure 1 illustrates the result of improving the quality of remote sensing images (4 times SR with 25 level Gaussian noise removal) in spatial domain or in WT domain, respectively. We could see that the result in WT domain is obviously better than in spatial domain.
In conclusion, The following three aspects are the contribution of this article:

1.
A method named RRDGAN is proposed. RRDGAN combines denoising and superresolution reconstruction into a unified framework to obtain better quality optical remote sensing images.

2.
The generator of RRDGAN combines residual learning and dense connection to obtain better PSNR results, and the discriminator uses relativistic loss to make the entire network converge better. Generator also uses TV loss to reconstruct better details.

3.
RRDGAN is implemented in WT domain, which could handle different parts of LR image well, respectively.
The rest of this article is organized as follows. Section 2 introduces related works in handling single image denoising and SR reconstruction methods. Section 2 also recommends the related works about processing these problems in WT domain. Section 3 gives the implementation details of our RRDGAN. Section 4 describes the experimental results. Section 5 gives some discussions about this article and conclusion is drawn in Section 6.

Optical Image Super-Resolution Reconstruction Method
As mentioned above, the recent research hotspot in image processing is a CNNbased method because CNN could extract more exact image features gradually as the network layers get deeper, which could achieve better results than traditional methods and human eyes. As mentioned above, SRCNN, which is the first SR method using CNN, was proposed [2] in 2016. Then, various CNN-based SR methods keep coming out every year. Kim et al. [23] proposed the first SR method-based residual learning, which deepens the network to 20 layers. After that, Laplacian Pyramid SR CNN (LapSRN) was proposed by Lai et al. [29], and it could gradually reconstruct high-frequency details in different sub-bands of potential remote sensing images. Then, a SR method based on dense connection was proposed by Tong et al. [30], which made CNN layers deeper and became the state-of-the-art (SOTA) SR method that time.
Nowadays, GAN-based methods are getting more popular in image processing research areas. Different from those SR methods, who use PSNR to evaluate the effect, SRGAN, which is proposed by Ledig et al. [24], is the first GAN-based SR method. In addition to using PSNR, SRGAN uses Mean Opinion Score (MOS) [24] to evaluate the effect, which could evaluate human visual effect of an image. The generative part of this method, named Super-Resolution Residual Network (SRResnet) is a CNN structure, which combines local and global residual learning. The adversarial part of this method is a very ordinary CNN structure, which could discriminate whether an image is ground truth or the result of SR reconstruction. In this article, the authors compare the SR reconstruction result of some typical methods, including SRResnet. The comparison result shows that the PSNR result of SRGAN is not the highest among these SOTA methods. However, when using MOS to evaluate the reconstruction effect, we could see that SRGAN achieves the highest score in MOS, and we could also see that SRGAN reconstructs more details than other non-GAN methods do, even including SRResnet. This is because PSNR-based method uses MSE to compute the loss between reconstructed image and ground truth, which could make reconstructed images smoother. So, in general, PSNR is not the only rule to judge whether the reconstructed image is good or not.

Single Image Denoising Method
Similarly, CNN-based denoising methods are also popular recently. Jain et al. [31] proposed the first CNN-based denoising method. Compared with other traditional methods, this method achieves similar or even better results. DnCNN was proposed by Zhang et al. [32], which combines batch normalization with residual learning and obtained the latest results. Then, an automatic encoder with symmetric jump connection network was proposed by Mao et al. [33]. The method realizes 10 pairs of symmetric convolution and deconvolution layers, the first 5 layers are the coding layer, while the last 5 layers are the decoding layer. Therefore, the image denoising network based on CNN becomes more and more profound.

Single Image Restoration in Wavelet Transform Domain
Remote sensing image restoration methods in spatial domain usually handle high frequency and low frequency parts together, which is not very appropriate. We should pay more attention in processing different high frequency parts, especially high frequency part because SR problem and some typical noises (e.g., salt and pepper noise) are related to the high frequency part. So, a good way to deal with image restoration problems is treating different frequency parts separately. WT has been proven to be a very effective image restoration method [20,34]. An image could be transformed into a series of coefficients in the same size by WT operation. It is suitable to predict the wavelet coefficients by exploiting the sparse coding algorithm and reconstructing the HQ optical image for the detail of seed band, which is very sparse.
A typical Haar WT operation is showed in Figure 2. As illustrated in Figure 2, LL is the low frequency sub-band of the original image, which represents the global topology. The other three sub-bands (HL, LH and HH in the figure) represent the high frequency part in vertical, horizontal and diagonal orientations, respectively. By the way, using the inverse implementation of wavelet transformation, we could acquire the final image with these four sub-band coefficients. A SR method using 3D-CNN in WT domain was proposed by Yang et al. [35]. This method first uses 3D-CNN to acquire features, then decomposes these features into wavelet coefficients. After that, these wavelet coefficients with 3D-CNN can be handled to get the reconstructed coefficients and finally, the reconstructed image would be obtained by inverse wavelet transform. This method requires multi-frame images to accomplish 3D-CNN, which is not convenient for the remote sensing research area.

Proposed Method
In this section, we give the problem definition first and then the details of our proposed method following.

Problem Definition
To achieve our ultimate goal, which is to reconstruct a HQ optical remote sensing image from a LQ one, the relationship between LQ remote sensing image and its HQ reconstructed result should be established. So, the purpose of our method is to process denoising and SR problems simultaneously by establishing a mapping F from LQ image to its HQ reconstructed counterpart through a CNN-based network. In this article, the LQ image is denoted as Y, and recovering Y from an image F(Y) is our goal. In this article, RRDGAN is the mapping F.
As illustrated in Figure 3, the generator of our method mainly includes four steps. First, Haar wavelet transformation is implemented to decompose the input LQ remote sensing image into four sub-band coefficients. Second, these coefficients would be sent into our generator, which is a deep network, to acquire the reconstructed sub-bands coefficients. Finally, the output image is acquired by Haar inverse wavelet transformation. To accomplish the adversarial part, both the reconstructed image and spatial ground truth would be sent into a typical CNN-base network, which is similar to the discriminator of SRGAN. After these steps, we will get a well-trained network, which could improve the spatial quality of remote sensing images.

Proposed Method
Our network architecture is introduced in this section. The proposed method RRDGAN is illustrated in Figure 4. The LQ images were obtained by downsampling the HQ optical images using bicubic kernel with factor r = 4, then adding some typical noises. Compared with SRGAN, we made two modifications to achieve better performance on remote sensing images. First, we combined residual learning and dense connection to replace the original SRResnet, which is the generative part. Then, we used Realistic GAN loss function to train the discriminator part.

Network Architecture
The Discriminator aims to distinguish whether the estimated HQ optical images are plausible or not. Different from the original SRGAN, our discriminator is based on Relativistic GAN. The discriminator takes both the real reconstruction result and the fake reconstruction result as inputs. The recent Relativistic GAN [24] is one improvement of the original GAN which could distinguish a real sample and a fake one better. So, we apply Realistic GAN instead of the original GAN in our discriminator, which could further improve the performance.
The Generator processes remote sensing input images in wavelet transform domain. It first enlarges the input image twice by simply bicubic method, because applying Haar wavelet transform to the image will result in four sets of coefficients that are one-fourth the size of the original image. Then, it sends the four wavelet transform coefficients into a deep network, which combines the advantages of residual learning and densely connection, respectively, to restore the coefficients, which could be used to acquire final high quality images.
In our generator, residual learning and dense connection are the most important implementations. Residual learning is learning the residual between the input and output, so in this article, we define a residual r = y − x, most of which may be zero or less [14]. In this equation, r is the residual pixel of x and y. Dense connection is defined as ., x l ] refers to the concatenation of feature-maps produced in layers x 0 , x 1 , ..., x l [13].
Residual learning and dense connection have a complex relationship. Residual learning could reuse features implicitly, but it is not good at exploring new features. In contrast, the dense connection keeps exploring new features, but suffers from higher redundancy [36]. So, we fuse these two structure and achieve both advantages. Each Dense-Residual-Block (DRB) contains a densely connection branch and a residual learning branch. The residual learning branch consists of one 1 × 1 filter, one 3 × 3 filter and one 1 × 1 filter. The densely connection branch concatenates the input and residual learning output. There are 20 DRB blocks in our generator part. The final upsampling implementation is inspired by ESPCN [37], which could upscale the last feature maps into the HR output by using an efficient sub-pixel convolution layer. What needs to be illustrated is that we remove all BN operations in our generator part for the reason of keeping BN operations that could not achieve better performance than removing them.

The Loss Function
The Discriminator uses idea of relativistic GAN, we define the discriminator loss as: In this equation, E g [·] and E f [·] represent the average of all the ground truth or reconstructed image in one mini-batch, respectively. D is defined as equation In this equation, O, g and f represent the non-transformed discriminator output, the ground truth and reconstructed image by generator, respectively.
The Generator loss function includes three parts: content loss, adversarial loss and TV loss. Instead of MSE loss, VGG loss is chosen to be our content loss [24] in our generator, which is defined as: In this content loss equation, φ i,j is the feature map obtained by the j-th convolution of the well-trained VGG19 network before the i-th maxpooling layer. X is the HQ image, and F(Y) is the reconstructed image. M is the row number of image X, and N is the column number of image F(Y).
Furthermore, the adversarial loss is similar to discriminator loss: Finally, TV loss is defined as (6). in order to enhance the edge information of the reconstructed image. Rudin et al. [27] observed that the TV of noise-polluted images was significantly greater than that of noise-free images, so in image super-resolution and denoising, TV regularization is a structured restoration method aimed at preserving image details.
As showed in (6), y represents the reconstructed image, and i, j represent the pixel horizontal and vertical positions in the image, respectively.
In summary, the loss function for the generator is similar to SRGAN, which is illustrated as follows: In this equation, λ and β are the coefficients to balance different loss terms.

Experimental Results
In this experimental results section, the datasets this article used are described first, then experimental results are illustrated.

DataSets
In our experiment, three datasets were used to verify the performance of RRDGAN. The three datasets are: UCMERCED [38], NWPU-RESISC45 [39] and GaoFen-1. Among these datasets, UCMERCED includes 21 land-use scene classes. All these classes have high spatial resolution (0.3 m/pixel). NWPU-RESISC45 is proposed by Northwestern Polytechnical University (NWPU). This dataset is a public benchmark and has 31,500 images in total. These images could be divided into 45 scenes. The size of each image in UCMERCED and NWPU-RESISC45 is 256 × 256 pixels. GaoFen-1 is the dataset of multispectral images which are obtained by GaoFen-1 satellite. We choose to mix these three datasets instead of implementing the three datasets, respectively. Then, 135 images are randomly chosen for training, while another 40 images are randomly chosen for testing.

Implementation Details
The LQ images (96 × 96) for training were acquired by downsampling the ground truth (256 × 256) with factor 4, then adding noise on them. As Figures 5 and 6 show, two types of noise are added, respectively, in our experiments: White Gaussian noise with level 25, and salt and pepper noise with level 0.005. The total number of these LQ image/HQ image pairs is 135, which are from three datesets. These images belong to various types, including airplane, church, forest, wetland and so on.  Finally, we added a total of 20 DPN blocks. The learning rate is 0.0001 in the beginning, λ in Equation (7) is 10 −3 and decayed by a factor of 0.1 every 10 5 iterators. The environments of implementing these experiments are Nvidia GTX 1080Ti, Inter Genuine Inter CPU 1.4 GHz, 64 GB RAM, and the tensorflow-1.14.0 package.
For the sake of fairness, VDSR, SRGAN and ESRGAN are well retrained using the datasets mentioned above.
By the way, the reason we choose to use Haar wavelet is that Haar wavelet is one of the simplest and fastest wavelet transform methods, and processing the four frequency subbands obtained by Haar wavelet can achieve the expected denoising and super-resolution reconstruction effect in this paper, and the processing speed is also satisfactory.

Results and Analysis
PSNR results and MOS of low resolution images with noise among our method , SRGAN, SRResnet and VDSR are illustrated in Table 1. PSNR, short for "Peak Signal to Noise Ratio", is an objective standard for image evaluation. It is generally used as an engineering project between the maximum signal and background noise.
In Equation (8), n is the image bit width. The MSE is the mean square error, which is defined in Equation (9). X is the ground truth, and F(Y) is the reconstructed image. M is the row number in X, and N is the column number in F(Y).
MOS test has been used for decades. It was used to evaluate the quality of voice communication systems in the beginning, and later widely used to identify key components in voice communication systems. The MOS test process is a group of listeners sitting in a quiet room, listening to the call and scoring the call quality. Especially, inspired by SRGAN, 26 raters were asked to give an integral score from 1 (low) to 5 (high) to evaluate images reconstructed by different methods.
Another important metric to evaluate the reconstructed images is called perceptual index. Perceptual index is used to judge the perceptual quality of images. The definition of perceptual index is the expression of Ma's score [40] and NIQE [41], which is showed in Equation (10). The lower perceptual index means a better reconstructed image.
The result between our RRDGAN with VGG loss (RRDGAN-VGG) as its discriminator and RRDGAN with MSE loss (RRDGAN-MSE) as its discriminator were also compared int this table. To further verify the performance of RRDGAN, we classify the results by the content of images. We totally use 10 classes of ground features to test. Table 2 shows the results for these ground features. These results are reconstructions of low resolution with white Gaussian noise inputs. These results prove our RRDGAN has the best performance.       The influence of DBN numbers, BN and wavelet transform were also investigated, respectively, in our experiment. Among these three factors, we would discuss the influence of the number of DBN first. Figures 12 and 13 are the experiment results of five different numbers of DBN on performance of PSNR and training time, respectively. We could see that PSNR is getting better with the increasing number of DBN. However, when the number of DBN is 25, the performance of PSNR is not significantly enhanced, while the training time is obviously increasing.  Then, we investigate the influence of BN. In this experiment, The result of using BN in the generative part and not using BN in the generative part were compared. We could see that not using BN is good for our method. BN only considers relative differences, and does not require absoluteness. It ignores absolute differences between pixels (or features) of the image (normalized variance because mean is zero). Differential tasks (such as categorization and recognition) have added value. Batch normalization does not perform well for image super-resolution reconstruction, which requires absolute difference. Figure 14 illustrates the influence of batch normalization. Then, we discuss the influence of whether using wavelet transform. As illustrated in Figure 1 (LR with Gaussian noise), we could see that image restoration in wavelet transform domain achieves better performance. For image SR reconstruction task, wavelet transform has been proven to have an ideal effect [42]. As 15 shows, different frequency sub-bands represent different information of the image after wavelet decomposition, in which the low-frequency sub-bands represent the global topological information of the image, while the other high-frequency sub-bands represent the structure and texture of the image. Therefore, as long as the corresponding wavelet coefficients are accurately predicted, high-quality and high-resolution images with rich texture details and global topological information can be reconstructed from low-resolution images.
For image denoising, Gaussian noise and salt and pepper noise are two common noises in remote sensing images. For Gaussian noise, it exists in each frequency sub-band after Haar wavelet transform. Therefore, the removal of Gaussian noise is equivalent to learning the threshold information of traditional wavelet transform to remove Gaussian noise. After Haar wavelet transform, salt and pepper noise also exists in each frequency sub-band, and its shape and distribution are similar to the spatial domain. Therefore, the removal of salt and pepper noise is equivalent to learning the end-to-end relationship between noisy and noise-free images. The effect of removing Gaussian noise in the wavelet transform domain is better than that in the spatial domain (just like Figure 1 shows), but the effect of removing salt and pepper noise in the wavelet transform domain is not significantly better than that in the spatial domain. This is because salt and pepper noise cannot be removed more easily in the domain of wavelet transform, so it is impossible to achieve a more ideal denoising effect through end-to-end learning. Figure 16 shows the comparison results of applying RRDGAN (only denoising) in both WT domain and spatial domain. We removed the upsampling part to make RRDGAN only accomplish the denoising task. By the way, we also did some experiments to verify whether the performance of RRDGAN is better than implementing denoising method (BM3D Algorithm [43] or Non-Local Means method [44], which are two of the best denoising method) first, then SR method followed. Figure 17 shows the result that RRDGAN is better than using the combination of BM3D (or NLM) and RRDGAN. We can see that the result of the combination of BM3D (or NLM) and RRDGAN is smoother than the result of using RRDGAN directly. Finally, we experimented to verify the effect of using relativistic loss and TV loss. Figures 18 and 19 give the comparison results of whether to use relativistic loss and whether to use TV loss, respectively. We can see that using relativistic loss and TV loss helps to learn more details.

Different from ESRGAN
Our method looks like ESRGAN, but there are still three differences between the two methods. Firstly, our RRDGAN uses TV loss to furthermore improve the quality of reconstructed image in generator part. Secondly, dense connection is the backbone of our RRDGAN, which has higher capacity, while ESRGAN uses residual learning as its backbone. Finally, our RRDGAN is implemented in WT domain, while ESRGAN is implemented in spatial domain. The experiment result shows that the performance of our RRDGAN is better than ESRGAN. In terms of computational complexity, we used bottleneck structure while ESRGAN did not. Each RRDGAN block almost has 290 M parameters, while each ESRGAN block has 4500M. So in one block, ESRGAN's parameter number is almost 15 times ours.

Deal with White Gaussian Noise
Our method is implemented in WT domain, which could deal with SR and denoising problems in different frequency parts. In our experiment, Gaussian and salt and pepper noise are added, respectively, to the low resolution optical remote sensing image to obtain the final low quality image. Based on our analysis, SR and salt and pepper problems are both related to the high frequency part of optical remote sensing image, but white Gaussian noise exists in all frequency parts. So, handling white Gaussian noise in WT domain with deep learning method is to learn the different relationships in each frequency between low quality image and ground truth. The relationship obtained by our method is similar to the sparse decomposition result in using traditional WT-based denoising method. So after using the image with white Gaussian noise to train in WT domain, our RRDGAN could remove white Gaussian noise well.

Others
We add experiments to compare the super-resolution part of our method with Fractional Charlier moments method [10] and Hahn moments method [11] using database Set14 and AVLetters [45]. The experiment result shows that the super-resolution part of our method has better performance in both visual effect and quantitative result.

Conclusions
In this article, a GAN-based method implemented in WT domain named RRDGAN is proposed, which could solve both remote sensing image denoising and SR problems in the meantime by a unified network structure. RRDGAN mainly handles optical remote sensing image spatial denoising and super-resolution reconstruction problem in wavelet transform domain. It combines the advantages of both non-GAN-based and GAN-based methods, which means the generative part combines residual learning (includes both local and global residual learning) and dense connection to get a high PSNR result. Generator uses TV loss to furthermore enhance the reconstructed effect. Relativistic loss is also applied in our discriminator to make the whole network converge better. Finally, RRDGAN is implemented in WT domain instead of in spatial domain directly for the reason of different high frequency corresponding different detailed information, should be processed differently, which cannot be distinguished well in the spatial domain. The experimental results, which are tested on the datasets of UCMERCED, NWPU-RESISC45 and GAOFEN-1, show that our method not only could remove typical noise (salt & pepper noise and white gaussian noise) of remote sensing images but also could enhance the spatial resolution.
In the future, we would research a way to handle the problem that the quality of so-called remote sensing ground truth is also low. We will try to use high quality natural images to help us to accomplish this mission.
Author Contributions: X.F. conceived the concept and methodology. W.Z. provided the funding support. Z.X. wrote the program, did the experiments and analyze the results. X.S. and W.Z. checked and proofread the whole article. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

HQ
High spatial Quality LQ Low spatial Quality HR High spatial Resolution LR Low spatial Resolution CNN Convolutional Neural Network GAN Generative Adversarial Network TV Total Variation WT Wavelet Transform