Super-resolution (SR), which aims at restoring the missing high-frequency information from lower-resolution images in order to increase the apparent spatial resolution [1
], is a crucial field of research in the remote sensing community. Different from the common imaging devices (e.g., camera), imagery resolution of the space-borne imaging system is always limited by factors such as orbit altitude, revisit cycle, instantaneous field of view, optical sensor, and the like [2
]. Undoubtedly, once a remote sensing satellite is launched, the super-resolving reconstruction is needed to exceed those limitations and improve the image resolution from a post-processing perspective.
SR, as a key image processing technique, has gained increasing attention for decades. Its core idea is to reconstruct a high-resolution (HR) image from its low-resolution (LR) counterpart. Many traditional algorithms have been proposed to handle this issue [4
]. Recently, with the booming of deep learning-based methods and the satisfying results they gained, traditional algorithms are outperformed by them. Deep learning-based super-resolving networks could be categorized into two groups according to their structures: linear networks and skip connection-based networks.
Linear network indicates a simple single-path structure consisting of only convolutional layers without any skip connections or multiple branches. Dong et al. [7
] first demonstrated that a convolutional neural network (CNN) can be used to learn mapping from LR space to HR space in an end-to-end manner. Their model, SRCNN, successfully adopts a deep learning technique into the SR community and shows outstanding performance. However, it would have to first interpolate the inputs to the desired size. This early up-sampling design is memory-intensive since the network structure parameter grows in proportion to cope with high-dimension input. Contrary to SRCNN, Shi et al. [8
] proposed to perform feature extraction in LR space and increase the resolution from low-dimension space to high-dimension space only at the very end of the network. Their network, efficient sub-pixel convolutional neural network (ESPCN) [8
], introduces an efficient sub-pixel convolutional layer at the end to predict a HR output from LR feature maps directly, with one up-sampler for each feature map. Their late up-sampling design significantly reduces the memory and computational requirements, but it still employs a shallow linear structure.
Considering the limited representational ability of the simple linear structure, the skip connection-based network uses residual connections to promote gradients’ propagation and makes it feasible to build very deep networks. He et al. [9
] first demonstrated the advantages of the residual design. Kim et al. [10
] then introduced residual learning into SR reconstruction. They pointed out that SRCNN [7
] relies on the contextual information of small regions, and it converges slowly during training. Also, SRCNN works only for a single scale at a time. Therefore, they proposed a model named very deep convolutional network (VDSR) [10
]. Unlike the shallow architecture used in SRCNN, VDSR exploits contextual feature priors over large image regions by cascading small size filters many times in the network. To speed up the training, it learns residuals only and uses extremely high learning rates enabled by a strategy named adjustable gradient clipping. They also extended VDSR to deal with the multi-scale super-resolving problem jointly in a single network. The deeply recursive convolutional network (DRCN) [11
] is another model proposed by Kim et al., which applies the same convolutional layers multiple times, as the name indicates. Based on a similar idea of using recursive units, Tai et al. later introduced recursive block in the deep recursive residual network (DRRN) [12
] and memory block in the persistent memory network (MemNet) [13
]. Note that References [10
] still require bicubic interpolated images as input. As for post-up-sampling networks, Ledig et al. [14
] introduced the residual network (ResNet) [9
], which is proposed to solve high-level image processing problems, such as image classification and target detection, into their model, SRResNet. Lai et al. employed a novel pyramidal framework within their network laplacian pyramid super-resolution network (LapSRN) [15
], which consists of three sub-networks to predict the residual features under large SR factors in a progressive manner. Other recent work like the information distillation network (IDN) [16
], adopts an information distillation block, which is made up of enhancement units and compression units. Super-resolution network for multiple degradations (SRMD) [17
] takes multiple degradations into account simultaneously, which offers a unique capability. The cascading residual network (CARN) [18
] uses multiple cascading connections to incorporate local-level and global-level representations. This strategy makes information and gradient propagate efficiently, but it neglects the information difference between different levels.
As for the remote sensing community, the authors of Reference [1
] explored enhancing high-frequency content and image-to-image translation based on Reference [14
]. Huang et al. [2
] combined SRCNN [7
] and VDSR [10
] and achieved superior SR performance on Sentinal-2A data. Luo et al. [19
] then improved the work of Reference [10
] with a mirroring reflection method in the light of image self-similarity. Lei et al. [20
] explored a multi-fork design, named local-global combined network, to learn multi-level feature information of remote sensing images including local details and global environmental information. Xu et al. [21
] argued that Reference [20
] ignores the local information produced by lower layers and further proposed the deep memory connected network [21
], which employs local and global memory connections to further leverage local details and global priors learned in different convolutional layers. In fact, the image information they utilized are still limited. Furthermore, considering the insufficiency of the good-qualified HR remote sensing training samples, Huat et al. [22
] studied a deep generative network to learn mapping between LR space and HR space without external HR training data. They super-resolved remote sensing images from an unsupervised perspective.
In a word, deep learning-based methods achieved significantly satisfying performance in the SR problem, and the skip-connection design further optimized the learning process and improved the hierarchical representation ability of the networks. Nonetheless, these networks still have some deficiencies when super-resolving remote sensing data.
First, the aforementioned methods forgot that all prior knowledge learned by their networks are useful for reconstructing. Even though References [18
] took pattern information at the local-level and global-level into account, what they utilized is still limited. Also, none of them [7
] attempt to build a model with multiple perceptual scales, which could learn information at diverse context scales adaptively. Remote sensing images have highly complex spatial distribution and the ground objects exhibited usually share diverse ranges of their scales. Therefore, extracting as much prior knowledge as possible at different levels is critical to coping with the complexity and variability of the remote sensing data and reconstructing images with high fidelity.
Second, all methods previously discussed treat the learned feature equally in the SR process, which lacks scalability in processing information at different levels. To be specific, some studies tried to learn local and global information [18
] or multi-scale features [23
], but they neglected the channel-wise constituent differences across those feature maps and failed to use them reasonably. Actually, information obtained from different levels are usually full of components (e.g., edges, textures, and smooth regions) with different proportions, which are unequally important for reconstructing an image.
To solve these problems, based on the idea of “the more complementary prior information we capture the better reconstructions we get”, a multi-perception attention network (MPSR) is developed for remote sensing image super-resolution. The main contribution of this study is:
Present MPSR, a parallel two-branch structure, which achieves multi-perception learning in image patterns and multi-level information adaptive weighted fusion simultaneously.
Propose residual channel attention group (RCAG), where the enhanced residual block (ERB) serves as the main building block to fully capture the prior information from diverse perception levels and the attention mechanism allows the group to focus on more informative feature maps adaptively.
Train the proposed model with a supervised transfer learning strategy to cope with the lack of real HR remote sensing training samples and further boost the reconstruction ability of the proposed network toward remote sensing images.
In this article, we first analyze the proposed methods in Section 2
. In Section 3
, we clarify the experimental settings, demonstrate the effectiveness of the proposed methods, study the relations between SR performance and the factors such as the number of the enhanced residual blocks and the number of residual channel attention groups, and compare the proposed MPSR with recent works in objective criterion and subjective perspective. Further discussion is given in Section 4
, and the conclusion is provided in Section 5
The methods proposed in this paper are proven to have convincing performance with extensive experimental results. In Section 3.2
, the gains after adding ERBs and RCAGs clearly clarify the effectiveness of the multiple perceptual scales within the design and the rationality of treating information from different levels with unequal attentions. Then, a reasonable network structure was given by progressively modifying the number of ERBs and RCAGs, and further improving it with a transfer training strategy. In order to explore the SR capacities of the models, tests were conducted over public remote sensing data and benchmark natural image sets in Section 3.3
. The results encouraged us that the models achieved pretty good performance in comparison to the world’s top SR methods and obtained satisfactory super-resolved results even when dealing with the complex and varied remote sensing images from the GaoFen-1 satellite and the GaoFen-2 satellite. From the slight lines on the playground to the indistinct but dense buildings, and so on (Figure 11
), all the SR results demonstrate excellent image processing capability of the multi-perception learning-based network once again.
However, some problems were found through this research. In general, the CNN-based method could benefit from increasing the network depth, while worse test results were received when going deeper by adding ERBs (e.g., B = 9 and B = 10, see Table 2
), and something similar happened when G = 4 (Table 3
). This phenomenon could be related to the input images. Compared with natural images, the input images from the UC Merced dataset’s [35
] lacked a high-frequency component, though they had spatial resolution of 0.3 m per pixel. Moreover, after the degradation operation before testing, the image quality gets worse. The low initial input gradients may lead to vanishing gradients during the SR process and are unsuitable for a deep network to learn or extract information. Therefore, making a good trade-off between super-resolving performance and the network setting according to the practical situation is of great importance.
In addition, an objective evaluation on super-resolved GaoFen-1 data and GaoFen-2 data could not be performed, since the real HR image is unknown. How can a more reasonable and relatively objective evaluation be performed in such case without a standard reference? It is an open issue that needs to be solved. Besides, existing CNN-based SR works mostly using a bicubic down-sampler to generate LR images. Actually, learning multiple degradations [17
] or exploring real-world degradation [43
] helps to train super-resolving models since the true degradation does not always follow the bicubic interpolation-based assumption. Furthermore, a high-quality dataset dedicated to remote sensing SR research is also a core issue to be solved.