Unpaired Remote Sensing Image Super-Resolution with Multi-Stage Aggregation Networks

Mainstream image super-resolution (SR) methods are generally based on paired training samples. As the high-resolution (HR) remote sensing images are difficult to collect with a limited imaging device, most of the existing remote sensing super-resolution methods try to down-sample the collected original images to generate an auxiliary low-resolution (LR) image and form a paired pseudo HR-LR dataset for training. However, the distribution of the generated LR images is generally inconsistent with the real images due to the limitation of remote sensing imaging devices. In this paper, we propose a perceptually unpaired super-resolution method by constructing a multi-stage aggregation network (MSAN). The optimization of the network depends on consistency losses. In particular, the first phase is to preserve the contents of the super-resolved results, by constraining the content consistency between the down-scaled SR results and the low-quality low-resolution inputs. The second stage minimizes perceptual feature loss between the current result and LR input to constrain perceptual-content consistency. The final phase employs the generative adversarial network (GAN) to adding photo-realistic textures by constraining perceptual-distribution consistency. Numerous experiments on synthetic remote sensing datasets and real remote sensing images show that our method obtains more plausible results than other SR methods quantitatively and qualitatively. The PSNR of our network is 0.06dB higher than the SOTA method—HAN on the UC Merced test set with complex degradation.


Introduction
Image super-resolution, which aims to reconstruct the high-resolution (HR) image from its low-resolution (LR) observation, is an active research topic and has been demonstrated to be an effective method to increase the spatial resolution. In the field of remote sensing, the development of antenna arrays [1,2] plays an essential role in image superresolution. SR methods compensate for the information lost in the process of image transmission and compression, and improve the spatial resolution of remote sensing data for environmental monitoring [3] and object detection. The remote sensing image degradation process is usually defined as where I HR means high-resolution images, k denotes blur kernels, ↓ s means degradation model with scale factor s, n is additive white Gaussian noise(AWGN) in general. Comparing with updating the hardware devices, using the SR technique has advantages of low cost, easy implementation, and high efficiency. Remote sensing image super-resolution has become one of the most important applications of SR technology.
Previous super-resolution methods are mainly based on interpolation [4] and reconstruction [5]. Images with different spatial resolutions have different carrying capacities for information. The problem of image super-resolution is from reconstruction of LR images with severe lack of information to HR images with rich information is an ill-posed problem that a single LR image can reconstruct multiple super-resolved results. To tackle this problem, the example-learning based SR methods have been proposed to learn the mapping function between LR and HR image pair, such as sparse coding [6,7] and probabilistic graphical model [8]. Furthermore, with the development of deep convolutional neural network(CNN) in the field of computer vision, the SR methods based on deep learning came into being because they have a higher capability of representation for LR-to-HR nonlinear mapping modeling than the previous example-learning based SR methods. As an initial attempt, Dong et al. [9] introduced CNN into SR that employed three convolutional layers to model the mapping function. In order to increase the depth of the network for improving the capability of the model, Kim et al. [10] firstly introduced the global residual learning and gradient clipping strategy into the SR task and succeeds in training a deep SR network by using global residual learning and gradient clipping strategy. Inspired by VDSR, Tai et al. [11] used local residual learning and global residual learning to build a deep recursive residual network to improve the effect of super-resolved results. Skip connection based networks [12,13] can fuse information in different network depths to increase the representation ability of the model so that it has a stronger information-carrying capacity. Tong et al. [14] initially employed densely connected block and proposed SR method based on dense skip connections. EDSR [15] which introduced local residual learning and removed unnecessary batch normalization layers made a significant breakthrough in SR performance. Ledig et al. [16] applied Generative Adversarial Networks [17] into the model and proposed the perceptual loss function. It is crucial to generate more realistic textures and images. ESRGAN [18] employed residual in residual dense blocks and optimized loss function on SRGAN. Zhang et al. [19] integrated the channel-wise attention into the residual blocks by recalibrating the feature channel to push the state-of-the-art(SOTA) performance of SR forward. Huang et al. [20] initially proposed an interpretable attention mechanism into SR task and achieved SOTA performance. Supervised methods usually use a particular degradation model from HR as LR to train the network. Some researches [21,22] have shown that critical to the success of an accurate model of the degradation of SISR. However, in contrast to the natural images dataset, the original image in remote sensing dataset [23] is not clear enough to meet the need as a ground-truth for training. In addition, paired HR remote sensing images are hard to collect, especially since the cost of updating hardware devices is extremely high.
However, the current mainstream SR methods focus on model innovation, on the one hand, assuming that the image degradation process is bicubic interpolation. On the other hand, focus on the degradation process and build models to fit the data under various blur kernels. When LR images with unknown degradation that real remote sensing images contain the sensor noise and aliasing effects, the existing SR methods struggle to achieve satisfactory performance since we don't know the downsampling operation of remote sensing images. Nevertheless, unpaired super-resolution method has a good performance in constructing real natural images. Freedman et al. [24] proposed self-similarity based image SR, which reconstructed the HR images by using patches in previous natural images. Yuan et al. [25] proposed cycle consistency and designed Cycle-in-Cycle GAN, which did not require the paired LR-HR images and solved the unsupervised image translation problems. First, a GAN with two discriminators was employed for learning degradation process; then another GAN was used for image SR. Shocher et al. [26] exploited the internal self-similarity of a single image internal information into self-supervised iterative optimization of the model without any reference images or pretrained. Ulyanov et al. [27] proposed a new strategy for dealing with the regularization tasks in inverse problems and has been proved to be very effective and has been successful in many imaging inverse problems such as denoising, super-resolution and so on. Since the imaging environment of remote sensing images is generally unknown and complex, unsupervised learning has been developed in remote sensing image SR. Zhang et al. [28] proposed an unsupervised method that employs GAN to obtain super-resolved and achieved satisfactory performance. Inspired by CycleGAN, Wang et al. [29] proposed two generative CNNs for down-sampling named Cycle-CNN modeling the degradation process and the corresponding reconstruction, named CycleCNN. UDGN [3] models the internal recurrence of information inside the single image to generate Lake Area images with higher spatial resolution without requiring pretraining. Zhang et al. [30] proposed a multi-degradation aided method by adopting the multiple Gaussian blur kernels and AWGN to get LR images for adapting to the mixed degraded model.
Since there is not any prior degradation, they generate bad content and details that do not match the corresponding LR image, many unpaired super-resolution methods can only use image features that rely on LR patches of a small fixed size. The spatial distribution of remote sensing images is highly complex, so reconstructing a high-quality HR image with accurate content and refining photo-realistic details are necessary. When we get wrong content from low-quality remote sensing images, we observe that super-resolved results produced by GAN-based methods are often affected by structural distortions. To tackle this problem and inspired by DCSR [31], we explicitly add a multi-stage aggregation architecture which is a step-by-step training mode to realize content consistency(CC), perceptual-content consistency(PCC) and perceptual-distribution consistency(PDC). We also propose consistency losses for our network. The contributions of our proposed unpaired SR method can be summarized in three points:

1.
We introduce a multi-stage aggregation network for gradually optimizing the model with the degraded self-exemplars and unpaired references, which allows it to achieve effective optimization from content to perception. Specifically, the first stage can be adapted to better pixel-wise PSNR, and the subsequent stages can be adapted to more realistic texture and details reconstruction.

2.
Aiming at retaining the content on remote sensing images and excavating its underlying perceptual similarities from the low-quality images, we propose consistency loss functions for contents retainment and details reconstruction in different phases.

3.
We conducted experimental validation of multiple datasets on our method, and the results indicate the superiority of our method in remote sensing SR and have more intuitive visual effects.

Methods
The SR task aims to estimate the SR image I SR from its LR counterpart I LR . Generally, the dataset of paired super-resolution tasks has LR-HR paired images and LR image input model to receive SR image by non-linear mapping. While there are typically no adequate paired high-quality remote sensing images as training reference so that unpaired SR method has received more and more attention. In most cases, we get LR images from original images assuming that the image degradation process is the bicubic interpolation. But in fact, we only have low-quality low-resolution images as input so that reconstructing photo-realistic details is challenging. Therefore, our proposed method aims at recovering more desired details of LR images with unpaired training datasets.
The framework of MSAN which is shown in Figure 1 consists of a generator and a discriminator and shows the process of calculating consistency loss functions. In the different stages, the generator can be updated by calculating corresponding consistency losses. Given an input LR image I L without paired HR image. The super-resolved image I SR can be generated by where, H Generator−C (·), H Generator−PC (·) and H Generator−PC (·) denote the generator in different stages respectively, and r is the scale factor that we want to LR enlarge. And the structure of Generator-C, Generator-PC and Generator-PD are introduced in detail in Figure 2. For extracting and aggregating the multi-level features, we design an efficient framework which contains dense connections and skip connections on multi-level fusion module. Furthermore, in order to achieve the photo-realistic performance, we utilize multi-stage consistency for retaining content and structure from low-resolution inputs and recovering more photo-realistic details. In this section, we detail our proposed method MSAN.

Network Architecture
For training unpaired super-resolution network, we don't have a training set I LR i , that has N LR-HR pairs. Moreover, remote sensing images generally incorporate various degradation models, including poor details and noises that are undesired for the image SR task. However, noise and structure degradation models are important for recovering principle contents and un-occurred details; therefore, there is an urgent need to design a module that effectively utilizes low-quality input to extract low-frequency image feature information. As shown in Figure 2, following our previous work DCSR [31], the generator mainly consists of three parts: initial feature representation module (IFRM), multi-level fusion module (MLFM) and reconstruction module (RM). The IFRM is applied to represent a threechannel LR input as a set of feature-maps, similar to the most existing super-resolution methods. The MLFM is designed to capture more informative features for SR by a sequence of stacked enhanced feature extraction modules (EFEM). The extracted features are fed into the RM to reconstruct the SR images. We employ a convolutional layer as the IFRM to extract the initial features from LR input, which can be empressed by where, f 0 is the initial extracted feature which is then fed into the MLFM and also used for global residual-feature learning, H IFRM (·) and H DFE (·) respectively denote the function of initial feature representation and multi-level fusion module. Inspired by the dense connections [32] and residual blocks [12,13], they have been expounded to be an effective strategy for benefiting the information flow of transferred features at different layers. Different from RDN [13], we add an adaptive learning filter after concatenation. It is worth mentioning that the main contribution of our proposed work is a multi-stage aggregation strategy so that the generator of our network follows the existing networks and can be replaced with any better backbone. The basic dense network is illustrated in Figure 2. We introduce the strategies in our multi-level fusion module. For super-resolution task, shallow-layer features generally represent the coarse image information, however, deep-layer feature maps represent the fine details and textures. Therefore, in order to improve the reuse of features in the middle layer, we employ the local skip-connections to ensure that the shallow-layer information still participates in the image reconstruction. The MLFM is constructed by stacking n enhanced feature extraction modules (EFEM) for inferring the informative features for detail recovering as shown in Figure 2. Let f dense represent the feature generated by the dense-sampling structure. Thus, we introduce the MLFM as where H n EFEM (. . .) indicates the composite function of n-th EFEM, Conv 3 (·) indicates a 3 × 3 convolution operation.
As illustrated in Figure 3, different from Feature Extraction Module (FEM) in DCSR, we add a transition block (TB) which can combine information coming from different blocks through an adaptive learning process. We also improve the previous FEM that employs dense connections between m residual blocks (DRB) to form a residual blockchain. To improve the continuous memory of information flow, TB is attached to the fusing of feature information of each layer better. In the transition block, The features extracted by the previous EFEMs and the current residual blockchain are concatenated and then input a convolutional layer to process features more efficiently. Specifically, the inputs f 0 , f 1 , . . ., and f i are subjected to the concatenation operation. Let f n represent the feature generated by the middle layer. The process of DRB and TB are formulated as where x m denotes the feature extracted by the m-th residual block, [x 0 , x 1 , x 2 , . . . , x m−1 ] indicate the features are concatenated together and r 0 denotes where ϕ n (·) denotes the function of a convolutional layer that kernel size is one which is applied for reducing the channel dimension into a fixed size, H n DRB (·) denotes the dense residual blockchain contained in the n-th EFEM.
After acquiring the high informative features, we utilize an representation module (RM) for feature manipulation and outputting a high-resolution image. As shown in Figure 2 the final representation module only consists of one learnable layer which is a convolution operation and a non-parametric operation (sub-pixel convolution [33]). When we need to enlarge the image on scale factor s, RM rearrange the tensor with dimensions H × W × C · r 2 as rH × rW × C for converting groups of high-resolution features into RGB images.

Multi-Stage Architecture
Traditionally, we can optimize the model parameters to achieve good performance by minimizing the loss function between the super-resolved results and ground-truth, as shown in Equation (10): For the unpaired remote sensing super-resolution task, the low-resolution remote sensing images still contain enough similar principal information by transmitting hardware devices. Therefore, multi-stage aggregation can solve the problem that unpaired methods generate undesirable content and details, and consistency losses are introduced to restrict model convergence, which is shown in Figure 1.
(1) First stage: Content Consistency Aiming at recovering the principal information of the super-resolved results from the LR inputs, the content-consistency stage is optimized via minimizing the difference between the super-resolved image I SR C and the corresponding low-quality image I L . We proposed a content-consistency loss function (L CC ) in the training process of the first stage. Only using L CC to optimize the model during training can reconstruct the correct content and avoid recovering the wrong content due to the high spatial resolution of the remote sensing images. In this stage, the content-consistency loss L CC is formulated as where D(·) indicates the down-scaling operators using the bicubic algorithm. We downsample super-resolved images I SR C by the same scale factor and the processed image as input performs enhanced feature extraction and reconstruction. This stage mainly pursues a high PSNR value. However, due to the influence of uncontrollable artifacts under the unknown degradation model, noise and distortion will still affect the reconstruction of the principal information at this stage. The super-resolved of the current stage is slightly smoother and cannot produce rich details.
(2) Second stage: Perceptual-Content Consistency In the absence of paired groud-truth images, while ensuring that the content remains consistent, excavating deep semantic features is urgently necessary. In order to enhance the perceptual content of super-resolved images and explore the implicit perceptual similarity, we perform the second stage of training on the results of the previous stage. Particularly, we design two consistency loss functions (L PCC 1 and L PCC 2 ) as illustrated in Figure 4. The generator used at this stage is the same as in the previous stage. The pre-trained VGG-19 [34] is applied here for inferring the perceptual similarity of I L and I SR PC . As shown in Figure 4, in calculating L PCC 1 , we perform a blurring operation on I SR PC and upscaled I L . The objective functions for training the second phase could be formulated as where U p(·) indicates the up-scaling operators via bicubic resampling meanwhile the size of U p(I L ) is the same as I SR PC . Aiming at removing the distortion effects, we apply the blurring operators G(·) on images using 5 × 5 Gaussian blur kernels with σ = 1.2. Furthermore, we use the deep layer features ϕ deep of VGG-19 for representing the perceptual-content (PC) representations. We extract the final perceptual-content features on the "Conv 5,4 " layer of VGG19. Specifically, "Conv m,n " indicates the feature maps obtained by the n-th convolution (before activation) before the m-th max-pooling layer within the VGG-19 network. In addition, even with noise and distortion in low-quality remote sensing images, low-level PC features should also be helpful for reconstructing desired structure details. So we proposed the perceptual down-scaling which is still calculated by using the L 1 norm.
where ϕ shallow denotes shallow-level features "Conv 2,2 " of VGG-19, which is utilized for perceptual representations of down-scaled I SR PC and I L . However, in calculating L PCC 2 for perceptual down-scaling similarity, different from L PCC 1 , the input of VGG-19 is the down-scaling I SR PC and the original low-quality image I L .
(3) Third stage: Perceptual-Distribution Consistency Aiming at pursuing better visual effect, GAN [17] could reconstruct the photo-realistic details of remote sensing images with better perceptual performance due to the statistic gaming of the generator and discriminator. So the unpaired remote sensing ground-truths as real images I r and the reconstructing results of the first two stages as fake ones I f , similar to ESRGAN [18]. As shown in Figure 1, we employ the relativistic GAN to train the photo-realistic model to implement the perceptual-distribution consistency constraint. The relativistic discriminator intends to estimate the probability that I r is more realistic than I f . So L PDC is composed of L Dis and L Gen . The corresponding adversarial loss is formulated as The generator loss is calculated by where Dis Ra denotes the relativistic discriminator in ESRGAN [18] and loss functions are calculated by the cross-entropy. The framework of the discriminator is the the same as RankSRGAN [35]. In addition, we also employ the total variation loss L TV as a regularization term to remove noises and checkerboard artifacts from images and keep the edges.

Dataset
We perform experiments to evaluate our method on three widely used benchmark datasets in remote sensing: WHU-RS19 [23], UC Merced dataset [36] and DOTA [37]. For WHU-RS19, UC Merced dataset, we randomly extract 70% original images for training as low-quality LR input, 15% images for validation, and the rest for testing. Especially, for the unpaired super-resolution task, since the HR-sized inputs are needed in the third phase, 500 images from the DOTA dataset are selected as the unpaired HR references. To be fair, the test images are not used when training network as shown in Figure 5.

Implementation Details and Metrics
As depicted in Section 3, in the MLFM structure, we stack EFEMs number as n = 8 to build a deep network with having 64 initial channels. In each EFEM, we stack sixteen residual blocks(m = 16) which convolutional layers have 64 filters with the growth rate of 64. In the training process, we use the 74 × 74 RGB image patches which are randomly cropped from the low-quality training set, and then augment them with random horizontal flipping and rotation by 90 • . Besides, the high-resolution high-quality patches with a size of 296 × 296 are randomly selected from the ground-truth. Furthermore, all the LR and HR images are pre-processed by subtracting the mean RGB value of the dataset for accelerating the ing phase. We train all of our models by using the Adam optimizer [38] with setting the momentum parameter β 1 = 0.9, β 1 = 0.99. The mini-batch size is set to 16 and the initial learning rate is 1 × 10 −4 and halved for every 1 × 10 5 mini-batch updates. Each of the final models will get convergence after 2 × 10 5 mini-batch updates on PyTorch framework and the NVIDIA Titan RTX with 24 GB memory.
For evaluating the SR performance, we apply two common image quality assessment criteria for evaluating pixel-wise discrepancies: Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) [39]. In addition, we also use a no-reference image quality assessment Natural Image Quality Evaluator (NIQE) [40] and a full-reference metric the Learned Perceptual Image Patch Similarity (LPIPS) [41] for scoring the degree of perceptual performance. Following the convention of super-resolution, only the luminance channel is selected for full-reference image quality assessment because the intensity of image is more sensitive to human-vision than the chroma.

Results
In this section, for assessing our method performance, we will give a detailed description of our experimental results. The proposed method is compared with the supervised methods and unsupervised methods on four common image quality assessments that we apply in Section 2.2.2.

Model Analysis
From Section 2.1.2, we train the final generator of our model is trained under the optimization of different stages with Content Consistency, Perceptual-Content Consistency and Perceptual-Distribution Consistency. In this part, we design the ablation study and analyze for investigating the effectiveness of each stage by training the ablated models.
In order to verify whether the feature maps extracted by the CC, PCC, and PDC have a dependency relationship, we show the visual results of training intermediate feature maps. As illustrated in Figure 6, we can detect that the feature maps extracted by the generator of different stages are similar. Therefore, our multi-stage aggregation network is reasonable because it can gradually reconstruct high-efficiency remote sensing images. To verify the necessity of using (muti-stage) architecture on unpaired dataset training, we remove CC, PCC and PDC from MSAN. As shown in Table 1 and Figure 7, specifically, those ablation study results demonstrate progressive consistency plays significant roles in qualitative performance:

The Effect of Content Consistency
As depicted in Section 2.1.2(1), in the CC stage, the content-consistency loss L CC is introduced for reconstructing the principal information from the low-resolution remote sensing inputs by down-scaling the super-resolved images. As Equation (11), this term aims at driving the degraded super-resolved results to be similar to the blurred low-quality inputs with the assumption of noiselessness. The principal content information of superresolved results I SR c show high consistency to the low-resolution inputs as illustrated in Figure 7. That is why our model in the first stage has higher PSNR scores and lower perceptual metrics than other ablated model in Table 1.

The Effect of Perceptual-Content Consistency
As depicted in Section 2.1.2(2), by utilizing the pre-trained VGG-19 network to excavate the deep layer and shallow layer features, the effects of PCC that employs L PCC 1 and L PCC 2 should focus more on perceptual content information similarity between the feature level criteria of super-resolved results and low-resolution inputs. And as shown in Figure 4, L PCC 1 discriminates the deep layer features similarity with the assumptions of noiselessness using Gaussian blur G(·). L PCC 2 is applied on optimizing the distance of the shallow layer between the low-resolution inputs and down-sampling super-resolved results. Since the deep features extracted from the VGG-19 [34] depends on the local receptive fields strongly, the undesired noises would introduce some negative effects on the clear pixels to retain the perceptual-content consistency of I SR C and I L . L PCC 2 mainly controls the similarity of principal information and explores perceptual to generate novel details as the perceptual features of noises have positive effects on information supplement of details. As shown in Figure 8, to evaluate the effect of different Gaussian kernels in computing the perceptual consistency, we conduct the experimental comparisons. The results suggest that the size of kernel (5 × 5) and the standard deviation value (σ = 1.2) are proper. Furthermore, as depicted in Figures 6 and 7, adding the PCC stage to the model focuses more on the perceptual content consistency of super-resolved results from the corresponding inputs. Due to remote sensing images containing the complex distribution of objects, the PCC stage mainly avoids lack of informative details, over-fitting on edges and artifacts.

The Effect of Perceptual-Distribution Consistency
As depicted in Section 2.1.2(3), we exploit the adversarial learning to train the perceptual model for photo-realistic super-resolution. Especially for experiments, as shown in Table 1, comparing the model MSAN w/o PC and MSAN (NIQE and LPIPS), it fully demonstrated that perceptual-distribution consistency is beneficial to improve the visual effect. From Figure 7, we can observe that the super-resolved results of the MSAN without PDC cannot generate photo-realistic textures and details. It means PDC is necessary for our architecture.

Comparisons with State-of-the-Art Methods
For a comprehensive comparison, we compare our method with several paired methods including SRCNN [9], EDSR [15], ESRGAN [18], RCAN [19], IMDN [42]and HAN [43], and unpaired methods including SelfExSR [24], ZSSR [26], CinCGAN [25] and DRN [44]. These methods are all evaluated on the UC Merced test set and the WHU-RS19 test set with bicubic degradation and complex degradations (bicubic + Gaussian blur) at the scaling factor of ×4. Bicubic degradation is the most widely used assumption in the paired SR task, although it cannot effectively fit the original remote sensing image degradation model. Therefore, we conducted experiments on bicubic degradation as a baseline and add blur degradation to analyze the robustness of MSAN against complex degradation.

Result on Bicubic Degradation
In this part, two remote sensing test sets are down-sampled with bicubic degradation. Tables 2 and 3 show quantitative comparisons for ×4 SR and present the average assessment results. Table 2 presents the quantitative results of evaluation between our method and other unpaired methods. Consider the UC Merced test set as an example. It can be seen that our method achieves the best performance in all cases. Specifically, the PSNR gains which MSAN exceeds the second-best model reach 0.48 dB. The LPIPS of our method is 0.0062 lower than that obtained with the suboptimal method. The proposed MSAN shows significant advantages of NIQE. The NIQE and LPIPS of the MSAN are both optimal compared with those of other unpaired methods on the WHU-RS19 test set. The NIQE and LPIPS of our method are 0.047 and 0.0036 lower than that obtained with the suboptimal method respectively. According to Table 2, in the remote sensing dataset, the unpaired method we proposed is superior to all the other unpaired methods on most of the metrics. It is verified that MSAN can reconstruct more subjective perception remote sensing images. Table 3 presents the quantitative results of evaluation between our method and paired methods. As for the ×4 SR task on UC Merced test set, the cases of our method all achieve the best performance. The proposed MSAN performs the best in terms of LPIPS and NIQE and keeps the presentable PSNR values. Taking the WHU-RS19 test set as an example. For bicubic degradation, experimental results demonstrate that the proposed method outperforms part of paired SR methods. Our method achieves the best LPIPS performance with those of other methods. The LPIPS is 0.009 lower than that obtained with the suboptimal IMDN method. HAN keeps the presentable PSNR and SSIM values. Although ESRGAN achieves the lowest NIQE, the content of its super-resolution results is often inconsistent by undesirable artifacts so that the PSNR of the ESRGAN is the lowest as shown in Figure 9.
As shown in Figure 9, we present visual comparisons on bicubic degradation datasets with the Bicubic technique and other state-of-the-art methods, which are among the methods listed in Table 2 and 3. Here, we use SSIM and non-reference image quality metrics NIQE to assess these super-resolved results. From Figure 9a, we observe that most of the compared unpaired methods suffer from blurred artifacts so that it is difficult to reconstruct the correct content and structure edge. The SelfExSR and DRN reconstruct SR image contains wrong details that can't match the information in the groud-truth images. In contrast, our MSAN can slightly alleviate this phenomenon and produce more accurate results with more precise details. In Figure 9b, we present visual comparisons of some paired methods reconstruction results. For image "Desert_43" from the WHU-RS19 dataset, our MSAN can synthesize realistic textures while retaining a delicate content compared to other methods.

Result on Complex Degradation
We further apply our MSAN to solve the complex degradation image super-resolution to analyze the robustness. In order to construct complex degradation, the isotropic Gaussian blur is added to low-resolution images. Quantitative comparisons for ×4 SR are shown in Tables 4 and 5 and visual comparisons are shown in Figure 10.
In Table 4, compared with other unpaired SR methods, our method achieves the best quantitative results of the evaluation. Making the UC Merced test set as an example, It can be seen that our method achieves the best performance on PSNR, SSIM and NIQE. CinCGAN and DRN also render competitive results. Specifically, The LPIPS of CinCGAN is 0.4487 lower than other unpaired methods. The PSNR and SSIM which MSAN exceeds the second-best model-DRN reach 0.02 dB and 0.0101 respectively. The proposed MSAN still shows significant advantages of NIQE. In the WHU-RS19 test set, the SSIM and NIQE of the MSAN are both optimal compared with those of other unpaired methods. The PSNR of our method is 0.04 dB lower than that obtained with the best method.  As shown in Table 5, Several evaluation indexes render competitive results in contrast with paired SR methods. Compared to ESRGAN, the proposed MSAN achieves the best in terms of PSNR and keeps the presentable LPIPS and NIQE values. Although SRCNN achieves the lowest LPIPS, NIQE of it is the highest compared to other methods. To show that the MSAN has an excellent reconstruction effect on blur low-resolution images, we present the visual effect in Figure 10, where the images are from WHU-RS19. From Figure 10, ESRGAN generates many distorted textures on complex degradation test sets; in addition, ZSSR and SelfExSR suffer from the jagged effect on the edges of details. Super-resolved results of our method reconstruct more content details than other methods. This firmly reflects the strong SR capacity of our models in dealing with complex degradation.

Method of Application
In Section 3, all experiments are based on the down-sampling of the original images with a known degradation model to obtain low-quality low-resolution images of the test set. However, in the real world, the information transportation and compression procedure are generally unknown. Therefore, reconstructing original low-quality images of different datasets to further discuss the proposed method's SR reconstruction effect is necessary.
To this end, the sample without degradation operation is chosen from the UC Merced dataset with a spatial resolution of 0.3 m per pixel. The methods based on the Bicubic, ZSSR, and MSAN are used to perform SR reconstruction. As shown in Figure 11a, it presents the results of SR reconstruction with an up-sampling factor ×4. Due to the lack of corresponding ground-truth images, the reconstruction performance can be evaluated by comparing the visual effect reconstructed by different methods. In addition, we chose three original images from the DOTA and GaoFen-1 with a high spatial resolution as shown in Figure 11b-d. Because of the quality of different original images, the visual effect of the image reconstructed by our method on the first two images is not as good as that on the last two images. Nevertheless, the proposed method obtains better performance with more details and textures and more precise contour stripes than the two other methods. These experimental results further verify that our MSAN produces promising results when tackling real-world remote sensing data with complicated patterns.

Limitation
By quantitatively analyzing the experimental results on various datasets, we find that our MSAN achieves the best performance on the WHU-RS19 dataset but only reaches the second-best on the UC Merced dataset, whose high-resolution counterparts have lower quality and resolution than the WHU-RS19 dataset. In a sense, it indicates that our MSAN shows less effectiveness and generalization on reconstructing small-sized images. For a reason, in the content reconstruction stage, the content consistency constraint is applied to optimize the model by calculating the pixel-wise discrepancy. And in the UC Merced dataset, the images with the size of 256 × 256 contain fewer contents for reconstruction. Due to the progressive reconstruction, only if the super-resolved results of the first stage preserve the principal information of the image, the subsequent stages would refine the perceptual textures well. Therefore, in the future, it is expected to improve the capability of the model and optimization for content consistency in the first stage.

Conclusions
In this paper, we propose an efficient unpaired super-resolution method with multistage aggregation network(MSAN) to super-resolved real remote sensing images. Specifically, we utilize content consistency, perceptual-content consistency and perceptualdistribution consistency to exert a stage-by-stage training mode, which helps reconstruct accurate content, avoid structure details loss and generate desired photo-realistic details. We introduce different consistency losses for different phases to optimize content and distribution similarity between the super-resolved results and low-quality inputs. The generator of MSAN mainly includes EFEMs to improve the use of original low-quality image feature information. The experiments of method analysis suggest the effectiveness of progressive reconstruction strategy and every phase is dependent. Extensive experimental results demonstrate that our method achieves satisfactory performance in four metrics compared with current unpaired SR methods and renders a competitive visual effect in contrast with paired SR methods. Besides, we conduct our experiments on original remote sensing images and obtains excellent performance, proving the robustness and practicality in real-world applications.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
All authors have reviewed the manuscript and approved submission to this journal. The authors declare that there is no conflict of interest regarding the publication of this article and no self-citations included in the manuscript.