Transferred Multi-Perception Attention Networks for Remote Sensing Image Super-Resolution

: Image super-resolution (SR) reconstruction plays a key role in coping with the increasing demand on remote sensing imaging applications with high spatial resolution requirements. Though many SR methods have been proposed over the last few years, further research is needed to improve SR processes with regard to the complex spatial distribution of the remote sensing images and the diverse spatial scales of ground objects. In this paper, a novel multi-perception attention network (MPSR) is developed with performance exceeding those of many existing state-of-the-art models. By incorporating the proposed enhanced residual block (ERB) and residual channel attention group (RCAG), MPSR can super-resolve low-resolution remote sensing images via multi-perception learning and multi-level information adaptive weighted fusion. Moreover, a pre-train and transfer learning strategy is introduced, which improved the SR performance and stabilized the training procedure. Experimental comparisons are conducted using 13 state-of-the-art methods over a remote sensing dataset and benchmark natural image sets. The proposed model proved its excellence in both objective criterion and subjective perspective.


Introduction
Super-resolution (SR), which aims at restoring the missing high-frequency information from lower-resolution images in order to increase the apparent spatial resolution [1], is a crucial field of research in the remote sensing community. Different from the common imaging devices (e.g., camera), imagery resolution of the space-borne imaging system is always limited by factors such as orbit altitude, revisit cycle, instantaneous field of view, optical sensor, and the like [2][3][4]. Undoubtedly, once a remote sensing satellite is launched, the super-resolving reconstruction is needed to exceed those limitations and improve the image resolution from a post-processing perspective.
SR, as a key image processing technique, has gained increasing attention for decades. Its core idea is to reconstruct a high-resolution (HR) image from its low-resolution (LR) counterpart. Many traditional algorithms have been proposed to handle this issue [4][5][6]. Recently, with the booming of deep learning-based methods and the satisfying results they gained, traditional algorithms are outperformed by them. Deep learning-based super-resolving networks could be categorized into two groups according to their structures: linear networks and skip connection-based networks.
Linear network indicates a simple single-path structure consisting of only convolutional layers without any skip connections or multiple branches. Dong et al. [7] first demonstrated that a convolutional neural network (CNN) can be used to learn mapping from LR space to HR space in an end-to-end manner. Their model, SRCNN, successfully adopts a deep learning technique into the SR community In a word, deep learning-based methods achieved significantly satisfying performance in the SR problem, and the skip-connection design further optimized the learning process and improved the hierarchical representation ability of the networks. Nonetheless, these networks still have some deficiencies when super-resolving remote sensing data.
First, the aforementioned methods forgot that all prior knowledge learned by their networks are useful for reconstructing. Even though References [18,20,21] took pattern information at the local-level and global-level into account, what they utilized is still limited. Also, none of them [7,8,[10][11][12][13][14][15][16][17][18][19][20][21][22] attempt to build a model with multiple perceptual scales, which could learn information at diverse context scales adaptively. Remote sensing images have highly complex spatial distribution and the ground objects exhibited usually share diverse ranges of their scales. Therefore, extracting as much prior knowledge as possible at different levels is critical to coping with the complexity and variability of the remote sensing data and reconstructing images with high fidelity.
Second, all methods previously discussed treat the learned feature equally in the SR process, which lacks scalability in processing information at different levels. To be specific, some studies tried to learn local and global information [18,20,21] or multi-scale features [23], but they neglected the channel-wise constituent differences across those feature maps and failed to use them reasonably. Actually, information obtained from different levels are usually full of components (e.g., edges, textures, and smooth regions) with different proportions, which are unequally important for reconstructing an image.
To solve these problems, based on the idea of "the more complementary prior information we capture the better reconstructions we get", a multi-perception attention network (MPSR) is developed for remote sensing image super-resolution. The main contribution of this study is:

1.
Present MPSR, a parallel two-branch structure, which achieves multi-perception learning in image patterns and multi-level information adaptive weighted fusion simultaneously.

2.
Propose residual channel attention group (RCAG), where the enhanced residual block (ERB) serves as the main building block to fully capture the prior information from diverse perception levels and the attention mechanism allows the group to focus on more informative feature maps adaptively.

3.
Train the proposed model with a supervised transfer learning strategy to cope with the lack of real HR remote sensing training samples and further boost the reconstruction ability of the proposed network toward remote sensing images.
In this article, we first analyze the proposed methods in Section 2. In Section 3, we clarify the experimental settings, demonstrate the effectiveness of the proposed methods, study the relations between SR performance and the factors such as the number of the enhanced residual blocks and the number of residual channel attention groups, and compare the proposed MPSR with recent works in objective criterion and subjective perspective. Further discussion is given in Section 4, and the conclusion is provided in Section 5.

Network Architecture
As shown in Figure 1, MPSR employs a well-designed two-branch structure, which is capable to learn a diverse set of priors at multiple context scales. Since the multi-level information obtained has varying importance for reconstruction due to the channel-wise constituent differences, the attention mechanism [24][25][26][27] is introduced to rescale it. The whole network mainly consists of three parts: shallow feature learning, multi-perception deep feature extraction, and reconstruction. Finally, the whole SR process is defined as: where ) ( MPSR f represents the function of MPSR.

Loss Function
MPSR is optimized with loss function. There are several choices to serve as a loss function, such as 2 L loss, 1 L loss, perceptual, and adversarial losses. 1 L loss is chosen to be minimized, for Here, the LR input image is denoted as I LR . One convolutional (Conv) layer is used to extract shallow feature F 0 from I LR . Formally, the first layer is expressed as a function f SF (·): F 0 is used for multi-level information extraction. Then: where f MP (·) denotes the parallel two-branch multi-perception structure, which further contains G RCAGs in each branch. Since the proposed structure achieves learning prior information at multiple levels, its output is treated as F ML . More details about the multi-perception part are provided in Section 2.3. F ML is then sent to the reconstruction part, which is composed of an upscale module and a Conv layer: where f UP (·) and F UP denote upscale module and upscaled feature map, respectively. The sub-pixel Conv layer [8] is chosen as the upscaler, which can aggregate LR images and project them to high-dimensional space. The upscaled feature is reconstructed via the last Conv layer: where I SR indicates the reconstruction result of MPSR. Finally, the whole SR process is defined as: where f MPSR (·) represents the function of MPSR.

Loss Function
MPSR is optimized with loss function. There are several choices to serve as a loss function, such as L 2 loss, L 1 loss, perceptual, and adversarial losses. L 1 loss is chosen to be minimized, for it has been demonstrated to be more suitable for SR tasks [28]. Considering a given training dataset I i LR , I i HR n i=1 , which contains n HR training samples and their degenerated LR versions, the goal of training MPSR is to optimize the L 1 loss to recover from I LR , an image I SR = f MPSR (I LR ) which is as similar as possible to the ground truth image I HR : where Θ indicates the weight set of MPSR. More details about training are given in Section 3.1.

Multi-Perception Learning
Multi-perception learning and multi-level information adaptive weighted fusion are achieved by combining ERBs and RCAGs. Hence, details about these two basic modules are given first in the following subsections.

Enhanced Residual Block
Residual designs exhibit excellent performance from low-level tasks (e.g., SR [10][11][12][13][14]16,18,19,[21][22][23][25][26][27]) to high-level tasks (e.g., image classification [9]). Ledig et al. [14] successfully applied the residual block architecture (Figure 2a) [9] to resolve the SR problem without much modification. Some researchers [21,29] further removed the batch normalization (BN) layers from the residual blocks in their network ( Figure 2b) and experimentally showed that this simple modification can improve the super-resolving performance. Tong et al. [30] then pointed out that skip connections between Conv layers provide an effective way to jointly employ the low-level information and high-level information to enhance the super-resolving performance. it has been demonstrated to be more suitable for SR tasks [28].
where  indicates the weight set of MPSR. More details about training are given in Section 3.1.

Multi-Perception Learning
Multi-perception learning and multi-level information adaptive weighted fusion are achieved by combining ERBs and RCAGs. Hence, details about these two basic modules are given first in the following subsections.
To fully capture the feature information at different levels, we further optimized the common residual block architecture (Figure 2b) by introducing a short residual connection (Figure 2c). This block structure, named the enhanced residual block (ERB), is the basic constituent unit of the proposed RCAG introduced in Section 2.3.2. As shown in Figure 2c, the later Conv layer in an ERB takes the output of the former Conv layer as input, assuming that filters of the same size (i.e., 3 × 3) are used for these two Conv layers. For the first layer, the receptive field is of size: For the next layer, the size of the receptive field is: To fully capture the feature information at different levels, we further optimized the common residual block architecture (Figure 2b) by introducing a short residual connection (Figure 2c). This block structure, named the enhanced residual block (ERB), is the basic constituent unit of the proposed RCAG introduced in Section 2.3.2. As shown in Figure 2c, the later Conv layer in an ERB takes the output of the former Conv layer as input, assuming that filters of the same size (i.e., 3 × 3) are used for these two Conv layers. For the first layer, the receptive field is of size: For the next layer, the size of the receptive field is: That is, Conv layers of the same spatial size form relatively different receptive fields. Thus, two perceptual scales can be achieved in each ERB.
In general, a large receptive field means that the Conv layer can collect and analyze more neighbor pixels to predict feature maps which would contain more contextual information. In other words, the output feature maps of the later Conv layer contain more contextual feature priors, which can be exploited to predict high-frequency components, than those of the former Conv layer. Moreover, the two short residual connections within the ERB carry the input and the output of the former Conv layer to the end, that is, information from three different levels serve as the total output of an ERB (e.g., ERB g,b , the b-th ERB in g-th RCAG): where f 1 (·), f 2 (·), and f ERB g,b (·) denote the combination of the former Conv layer and ReLU [31], the later Conv, and the function of ERB g,b , respectively. F g,b−1 and F g,b are the input and output of ERB g,b . It should be noted that if the short residual connection added is removed, like the block structure shown in Figure 2b, the feature information generated by the former Conv layer would be discarded. In brief, the ERB not only achieves two perceptual scales but also fully utilizes the prior information at three different levels by itself. The effectiveness of the ERB surpassing the common residual block (Figure 2b) is shown quantitatively in Section 3.2.

Residual Channel Attention Group
It has been demonstrated that stacked residual blocks and one global residual connection can be used to construct a deep network in Reference [14]. Actually, simply stacking residual blocks to build a very deep network would suffer training difficulties (e.g., vanishing gradients) and can hardly achieve performance improvements. Therefore, a residual channel attention group (RCAG) structure is proposed here.
As shown in Figure 3, one RCAG (e.g., RCAG g , the g-th RCAG in a branch) contains B ERBs. As discussed in Section 2.3.1, the b-th ERB in RCAG g can be formulated as: where F g,b−1 , the input of ERB g,b , is composed of responses generated by ERB g,b-1 .
Remote Sens. 2019, 11, x FOR PEER REVIEW 6 of 20 That is, Conv layers of the same spatial size form relatively different receptive fields. Thus, two perceptual scales can be achieved in each ERB.
In general, a large receptive field means that the Conv layer can collect and analyze more neighbor pixels to predict feature maps which would contain more contextual information. In other words, the output feature maps of the later Conv layer contain more contextual feature priors, which can be exploited to predict high-frequency components, than those of the former Conv layer. Moreover, the two short residual connections within the ERB carry the input and the output of the former Conv layer to the end, that is, information from three different levels serve as the total output of an ERB (e.g., ERBg,b, the b-th ERB in g-th RCAG): denote the combination of the former Conv layer and ReLU [31], the later Conv, and the function of ERBg,b, respectively. It should be noted that if the short residual connection added is removed, like the block structure shown in Figure 2b, the feature information generated by the former Conv layer would be discarded.
In brief, the ERB not only achieves two perceptual scales but also fully utilizes the prior information at three different levels by itself. The effectiveness of the ERB surpassing the common residual block (Figure 2b) is shown quantitatively in Section 3.2.

Residual Channel Attention Group
It has been demonstrated that stacked residual blocks and one global residual connection can be used to construct a deep network in Reference [14]. Actually, simply stacking residual blocks to build a very deep network would suffer training difficulties (e.g., vanishing gradients) and can hardly achieve performance improvements. Therefore, a residual channel attention group (RCAG) structure is proposed here.
As shown in Figure 3, one RCAG (e.g., RCAGg, the g-th RCAG in a branch) contains B ERBs. As discussed in Section 2.3.1, the b-th ERB in RCAGg can be formulated as: Specifically, each ERB in RCAGg receives three different levels of image information output by the former ERB and generates information at three other levels as the input of the later ERB, except ERBg,1. The multi-level feature information obtained by all stacked ERBs in RCAGg can be described as: Specifically, each ERB in RCAG g receives three different levels of image information output by the former ERB and generates information at three other levels as the input of the later ERB, except ERB g,1 .
The multi-level feature information obtained by all stacked ERBs in RCAG g can be described as: where F g−1 represents the input of RCAG g , and the output of RCAG g-1 .
The channel attention (CA) mechanism [27] generates different attention for each channel-wise feature map it receives. As shown in Figure 4, the input, which contains C feature maps of a size H × W. Vector z ∈ R C , a channel-wise statistic of size 1 × 1 × C, can be obtained by performing global average pooling to X. The c-th element of z is determined by: where x c (i, j) denotes the pixel value at position (i, j) of the c-th feature map x c and f GP (·) represents the global average pooling function. Such channel-wise statistic, z = [z 1 , . . . , z c . . . , z C ], can be viewed as a collection of the local descriptors, whose statistics contribute to the expression of the whole image [24,27].
where 1  g F represents the input of RCAGg, and the output of RCAGg-1.
The channel attention (CA) mechanism [27] generates different attention for each channel-wise feature map it receives. As shown in Figure 4, the input ] ..., ,..., , a channel-wise statistic of size 1 × 1 × C, can be obtained by performing global average pooling to .  As studied in References [26,27], a gating mechanism with sigmoid function is adopted to extract channel-wise dependencies from the information aggregated by the global average pooling function: indicate the sigmoid function and ReLU, respectively. D W denotes the weights of a Conv layer, which serves as channel-downscaler with a reduction ratio, r [27]. After channel-downscaling and being activated by ] ..., ,..., where c x is the rescaled c-th feature map.
In this case, the multi-level information obtained by all stacked ERBs in a RCAG can be adaptively rescaled with the CA mechanism by considering constituent differences among channels. The function of CA is denoted as , and could further have: As studied in References [26,27], a gating mechanism with sigmoid function is adopted to extract channel-wise dependencies from the information aggregated by the global average pooling function: where f S (·) and f R (·) indicate the sigmoid function and ReLU, respectively. W D denotes the weights of a Conv layer, which serves as channel-downscaler with a reduction ratio, r [27]. After channel-downscaling and being activated by f R (·), the low-dimension vector z 1 = f R (W D · z) of size 1 × 1 × C/r is later upscaled with factor r by a channel-upscaling Conv layer, whose parameter set is W U . Finally, the statistic s = [s 1 , . . . , s c . . . , s C ] is outputted by sigmoid gating f S (·), which is employed to perform channel-wise rescaling to the input X: where x c is the rescaled c-th feature map. In this case, the multi-level information obtained by all stacked ERBs in a RCAG can be adaptively rescaled with the CA mechanism by considering constituent differences among channels. The function of CA is denoted as f CA (·), and could further have: Then, the total output of RCAG g is formulated as: As discussed above, a RCAG achieves multi-level information extraction and adaptive weighted fusion by combining B stacked ERBs and a CA module. Moreover, with this modular design, the network depth can be easily controlled by modifying the number of blocks or groups. The quantitative comparison between the performance of RCAG and simple residual group (residual group composed of stacked residual blocks, without CA) is provided in Section 3.2.

Multi-Perception Learning Overview
Reviewing Figure 1, the proposed multi-perception deep feature extraction part has two branches. Each branch has G RCAGs and one RCAG further contains B ERBs.
As analyzed in the previous two subsections, B stacked ERBs in one RCAG have B × 2 different perceptual scales in all. RCAGs in one branch share different receptive fields with each other due to the depth they are located. Furthermore, the kernel size of all filters located in the upper-branch is set to 3 × 3 while all filters located in the lower-branch are of size 5 × 5. Also, every Conv layer in the two branches has its own scale-specific receptive field. Further enhancement of the perception capacity can be done by adding a branch in which the kernel size is larger, or by increasing the network depth. For example, kernel 7 × 7, the parameter number of one ERB is 5.44 times and 1.96 times larger than an ERB of kernel 3 × 3 and 5 × 5, respectively [32]. Adding a branch with larger convolution kernel size will introduce a great number of additional parameters, then, overfitting can arise [11]. Hence, the network ability is improved by adding modules, as shown in Section 3.2. As a result, the whole two-branch multi-perception part achieved a diverse set of perceptual scales that sums to: The final multiple levels prior information F ML learned by MPSR is expressed as: where f branch 1 (·) and f branch 2 (·) are functions of the upper-branch and the lower-branch, respectively. F G 1 and F G 2 represent the output information of the G-th RCAG in the upper-branch and the lower-branch, correspondingly. With this multi-perception design, the proposed network can consider feature representations from diverse receptive fields by different attention when reconstructing an image.

Transfer Training Strategy
Currently, there is no standard training set used for image SR reconstruction in the remote sensing community. As a matter of fact, it is difficult to collect a large amount of remote sensing images with clear edges and textures which are suitable for training a SR model. However, the performance of deep learning-based SR methods always benefits from a sufficient volume of good-qualified HR and Remote Sens. 2019, 11, 2857 9 of 21 LR training sample pairs. Thus, a transfer training strategy to deal with the insufficiency of training samples is introduced here. The core of transfer learning is assuming that individual models for related tasks share parameters or prior distributions of hyperparameters [33], which means to solve tasks in one domain based on the shared knowledge obtained from other related domains.
Hence, the proposed MPSR is pre-trained with the natural image set DIV2K [34] as an external knowledge set when conducting experiments. Generally speaking, the low-level feature information learned from DIV2K (e.g., point-like components, local texture and color, and point-line distribution) can be shared. In order to learn high-level feature information specific to remote sensing data, the pre-trained network is re-trained by using images randomly selected from UC MERCED [35] (a remote sensing scenes classification dataset). This training strategy further boosts the model performance on super-resolving remote sensing images. Relevant experimental results are provided in Section 3.2.

Experiment Settings
In this section, the experiment settings on datasets, degradation model, training, and evaluation metrics are clarified.
Datasets: 800 training samples from the DIV2K dataset [34] are used as the pre-training set, and 800 images are selected randomly from the UC MERCED [35] for transfer training. For testing, 120 images from the UC MERCED are chosen at random, which are different from transfer training samples, to form a test set named UCtest. To further demonstrate the effectiveness of the proposed model, it is compared with the state-of-the-art algorithms on publicly available benchmark natural datasets, including Set5 [36], Set14 [37], BSD100 [38], and Urban100 [39]. The representative images from these datasets are shown in Figure 5.
Remote Sens. 2019, 11, x FOR PEER REVIEW 9 of 20 point-line distribution) can be shared. In order to learn high-level feature information specific to remote sensing data, the pre-trained network is re-trained by using images randomly selected from UC MERCED [35] (a remote sensing scenes classification dataset). This training strategy further boosts the model performance on super-resolving remote sensing images. Relevant experimental results are provided in Section 3.2.

Experiment Settings
In this section, the experiment settings on datasets, degradation model, training, and evaluation metrics are clarified.
Datasets: 800 training samples from the DIV2K dataset [34] are used as the pre-training set, and 800 images are selected randomly from the UC MERCED [35] for transfer training. For testing, 120 images from the UC MERCED are chosen at random, which are different from transfer training samples, to form a test set named UCtest. To further demonstrate the effectiveness of the proposed model, it is compared with the state-of-the-art algorithms on publicly available benchmark natural datasets, including Set5 [36], Set14 [37], BSD100 [38], and Urban100 [39]. The representative images from these datasets are shown in Figure 5.
 DIV2K [34] contains 800 natural images for training. The image resolution is of around 2K.  UC MERCED [35] contains 2100 images in size of 256 × 256 pixel. The pixel resolution is 0.3 m.  Set5 [36] is a classical dataset which only consists of 5 test images.  Set14 [37] has 14 test images which contain more categories compared to Set5.  BSD100 [38] has 100 rich and delicate images ranging from natural to object-specific.  Urban100 [39] is a relatively more recent dataset composed of 100 images, the focus of which is on urban scenes. Degradation model: Experiments are conducted with the bicubic interpolation degradation model and three down-sampling scales (×2, ×3, ×4) [17]. Specifically, a LR version is generated from its corresponding HR counterpart by bicubic interpolation with a specific downscaling factor. For example, a three-fold down-sampling LR image can be generated from its corresponding HR counterpart by bicubic interpolation with a factor of 1/3.
All models go through 500 epochs of pre-training, and 100 epochs of transfer training. The ×2 network is trained from scratch. After it is converged, it is used as a pre-trained model for factors ×3 and ×4 [29]. As shown in Figure 6, this pre-training strategy stabilizes the training process and further improves the network performance.
The proposed models are implemented with PyTorch [41] and 4 NVIDIA GTX 1080Ti GPUs. Evaluation metrics: Experimental results are quantitatively evaluated with peak signalto-noise ratio (PSNR) and the structural similarity index (SSIM) [42] on the Y channel (i.e., luminance) in transformed YCbCr space. This is due to human vision being more sensitive to details in intensity space than in color [10]. Higher PSNR and SSIM values represent better reconstruction quality.   Degradation model: Experiments are conducted with the bicubic interpolation degradation model and three down-sampling scales (×2, ×3, ×4) [17]. Specifically, a LR version is generated from its corresponding HR counterpart by bicubic interpolation with a specific downscaling factor. For example, a three-fold down-sampling LR image can be generated from its corresponding HR counterpart by bicubic interpolation with a factor of 1/3.

Model Design and Performance
Training: Data augmentation is performed on both 800 pre-training images and 800 transfer training images, including rotation of 90 • , 180 • , 270 • , and horizontal flipping [27]. In each training batch, 16 LR input patches of size 48 × 48 and the corresponding HR patches are used. The proposed model is trained with the Adam optimizer [40] by setting β 1 = 0.9, β 2 = 0.999, and = 10 -8 [27]. The learning rate is initially set as 10 -4 and decreases to half every 2 × 10 5 batches [26].
All models go through 500 epochs of pre-training, and 100 epochs of transfer training. The ×2 network is trained from scratch. After it is converged, it is used as a pre-trained model for factors ×3 and ×4 [29]. As shown in Figure 6, this pre-training strategy stabilizes the training process and further improves the network performance.
The proposed models are implemented with PyTorch [41] and 4 NVIDIA GTX 1080Ti GPUs. Evaluation metrics: Experimental results are quantitatively evaluated with peak signal-to-noise ratio (PSNR) and the structural similarity index (SSIM) [42] on the Y channel (i.e., luminance) in transformed YCbCr space. This is due to human vision being more sensitive to details in intensity space than in color [10]. Higher PSNR and SSIM values represent better reconstruction quality.
further improves the network performance.
The proposed models are implemented with PyTorch [41] and 4 NVIDIA GTX 1080Ti GPUs. Evaluation metrics: Experimental results are quantitatively evaluated with peak signalto-noise ratio (PSNR) and the structural similarity index (SSIM) [42] on the Y channel (i.e., luminance) in transformed YCbCr space. This is due to human vision being more sensitive to details in intensity space than in color [10]. Higher PSNR and SSIM values represent better reconstruction quality.

Model Design and Performance
The effectiveness of using ERB, RCAG, and the transfer training strategy, as well as the relations between SR performance and factors, such as the number of ERBs and RCAGs, are studied in this section. Additionally, the configuration details of the final model are specified.
ERB and RCAG: To demonstrate the effects of these two proposed structures, the networks are set with B = 6 (the number of ERBs) and G = 3 (the number of RCAGs). In Table 1, the first row represents MPSR composed of a common RB (residual block with only one short residual connection, as shown in Figure 2b) and RG (residual group composed of stacked residual blocks, without CA), the PSNR value it gained is relatively low (39.510 dB). After adding ERB, the performance reached 39.540 dB, as shown in the second row. After adding RCAG, a similar trend is observed-the performance improved from 39.540 dB to 39.604 dB. These findings firmly demonstrate the effectiveness of widely extracting and reasonably leveraging multi-level prior information by introducing the proposed ERB and RCAG. Transfer training: MPSR with 6 ERBs and 3 RCAGs is used to verify the significance of adopting the transfer learning strategy mentioned in Section 2.3. As shown in Table 1, after transfer training, the gain of MPSR-transferred (row 4) over MPSR-notransfer (row 3) reaches 0.124 dB. This improvement shows that the transferred MPSR achieves better reconstruction performance.
ERB number and RCAG number: B (the ERB number) and G (the RCAG number) of MPSR-notransfer is modified progressively to obtain the most suitable values of B and G. First, G is set to three. In terms of the results shown in Table 2, B is set to eight to get a reasonable trade-off between reconstruction performance and speed. The changed value of G is shown in Table 3. Generally, the reconstruction performance would further improve if the network depth kept on increasing, i.e., adding more ERBs and RCAGs, at the cost of training time. Actually, not only the running time is sacrificed but also the GPU memory usage due to the huge amount of calculations and parameters. In the end, a trade-off is made between the performance and speed for the model: i.e., B = 8 and G = 5.
Final model configuration: With regard to the final model, G is set to five in each branch and B is set to eight in each RCAG. The kernel size of channel-downscaling the Conv layer and channel-upscaling the Conv layer in the CA module are 1 × 1. The kernel size of Conv layers in the lower branch is 5 × 5 (as described in Section 2.3.3), and the kernel sizes of all the rest of the Conv layers in the network are 3 × 3. For Conv layers with filters of 3 × 3 and 5 × 5, the zero-padding strategy [10] is used to keep the sizes of all feature maps the same. Furthermore, all Conv layers in the shallow feature extraction part and multi-perception deep feature extraction part have 64 filters (C = 64), expect for the channel-downscaling layers. The filter number of channel-downscaling Conv layers as C/r is set to four, which indicates that the reduction ratio r mentioned in Section 2.3.2 is 16. The setting of this value is similar to that in References [24,27]. As for the reconstruction part, the sub-pixel Conv layer [8] is used as an upscaler, and the last Conv in the network has 3 filters in order to output color images. In the following experiments, the final model without transfer training is named as MPSR, and the transferred one as MPSR-T.

Comparisons to State-of-the-Art Methods
In this section, the quantitative and qualitative results of the final model in comparison to recent state-of-the-art models, on the remote sensing dataset [35], benchmark natural image sets [36][37][38][39], and data from GaoFen-1 satellite and GaoFen-2 satellite, are provided.
Evaluation on UCtest: MPSR and MPSR-T are adopted to super-resolve images from the UCtest. As described in Section 3.1, the UCtest is composed of 120 images randomly selected from the UC Merced land use dataset [35], which are different from transfer training samples. The reconstruction results are compared with four recent state-of-the-art methods, including IDN [16], SRMD for noise-free degradation (SRMDNF) [17], CARN [18], and MSRN [23]. All these SR algorithms were published at the world's top computer vision conference-CVPR 2018 and ECCV 2018. Note that the MPSR is only pre-trained by the DIV2K dataset [34], a widely used SR dataset in the computer vision community. That is to say, it is fair enough to do comparisons. Also, transfer training is not performed on the state-of-the-art models.
As shown in Table 4, the proposed MPSR and MPSR-T yield the highest scores and the second-best scores in all experiments, respectively. The gains on PSNR obtained by MPSR-T are 0.23 dB, 0.46 dB, and 0.38 dB higher than the third best approach, on the three up-sampling factors.  Figure 7, Figure 8, Figure 9 show the SR results with upscaling factor ×2, ×3, and ×4 of different approaches. To carry out better comparisons, some regions of original HR images and the corresponding SR reconstruction results are displayed in an enlarged scale. For example, in Figure 7, the area within the red box of the original image dense residential 13 is zoomed in by factor ×2 and named as 'HR'. The patch named 'Bicubic' in the first row, which is a part of the image reconstructed by bicubic interpolation, represents the same area as patch 'HR'. It can be seen that the left edge of the greenbelt in patch 'MPSR-T' is sharper than those in other reconstructed results (e.g., patch 'MSRN'). Figure 8 shows a similar trend. As for Figure 9, the lines in some patches are blurred out while the models yield superior results.      Benchmark results: To further validate the effectiveness of the proposed network, MPSR (without transfer training) is compared with 13 state-of-the-art algorithms, including SRCNN [7], VDSR [10], DRCN [11], DRRN [12], MemNet [13], LapSRN [15], IDN [16], SRMDNF [17], CARN [18], MSRN [23], SelNet [25], SRRAM [26], and SRDenseNet [30] on publicly available benchmark datasets [35][36][37][38].
In Figure 10, only two SR results of reconstructed factor ×3 and ×4 on the Urban100 dataset is provided. The difference with Figure 7, Figure 8, Figure 9 is that patches which are part of the original super-resolved images are exhibited without zooming in. As can be seen in img_062 and img_004, the five state-of-the-art methods for comparison [10,15,16,18,23] cannot clearly reconstruct the lattices and generate blurring artifacts [27]. In contrast, the MPSR can overcome the blurring artifacts better and recover image details of high fidelity and shows a significant improvement.  In a more comprehensive comparison, quantitative evaluations for reconstruction factor ×2, ×3, and ×4 on dataset Set5 [36], Set14 [37], BSD100 [38], and Urban100 [39] are provided in Table 5. The results of state-of-the-art methods involved are cited from their papers. It is worth pointing out that MPSR performs the best on all the benchmark natural image sets with all scaling factors. In other words, the proposed model is also a competitive candidate for super-resolving other kinds of In a more comprehensive comparison, quantitative evaluations for reconstruction factor ×2, ×3, and ×4 on dataset Set5 [36], Set14 [37], BSD100 [38], and Urban100 [39] are provided in Table 5. The results of state-of-the-art methods involved are cited from their papers. It is worth pointing out that MPSR performs the best on all the benchmark natural image sets with all scaling factors. In other words, the proposed model is also a competitive candidate for super-resolving other kinds of images, not just remote sensing images. Validation by using GaoFen-1 and GaoFen-2 data: To verify the robustness of MPSR-T, some experiments are performed on multispectral remote sensing data from the GaoFen-1 satellite (medium-resolution, 8 m per pixel) and the GaoFen-2 satellite (relatively high resolution, 4 m per pixel).
Band 1, band 2, and band 3 are selected to stack into true color images before conducting the experiments, and original GaoFen-1 data and GaoFen-2 data are taken directly as LR input. Some of the test results are provided in Figures 11 and 12. In Figure 11, patch 'mountain ×2' represents an enlarged version of the original terminal area, and 'mountain SR ×2' is the corresponding result of MPSR-T. It is impossible to carry out an objective evaluation with PSNR and SSIM for these reconstructed images because the real HR image is unknown. However, MPSR-T shows impressive performance when coping with remote sensing images with highly complex spatial distribution and varied-scale ground objects. For both small-scale ground features (e.g., the slight textures or edges of the playground, highway, airstrip, and mountain) and large-scale ground objects (e.g., terminal, factory, dense building, small town), satisfactory super-resolved results are obtained and the spatial resolution is significantly improved, which proved that the multi-perception network proposed achieved promising SR capacity.  Figure 11. ×2, ×3, and ×4 SR results of the GaoFen-1 data. Figure 11. ×2, ×3, and ×4 SR results of the GaoFen-1 data.

Discussion
The methods proposed in this paper are proven to have convincing performance with extensive experimental results. In Section 3.2, the gains after adding ERBs and RCAGs clearly clarify the effectiveness of the multiple perceptual scales within the design and the rationality of treating information from different levels with unequal attentions. Then, a reasonable network structure was given by progressively modifying the number of ERBs and RCAGs, and further improving it with a transfer training strategy. In order to explore the SR capacities of the models, tests were conducted over public remote sensing data and benchmark natural image sets in Section 3.3. The results encouraged us that the models achieved pretty good performance in comparison to the world's top SR methods and obtained satisfactory super-resolved results even when dealing with the complex and varied remote sensing images from the GaoFen-1 satellite and the GaoFen-2 satellite. From the slight lines on the playground to the indistinct but dense buildings, and so on (Figure 11), all the SR results demonstrate excellent image processing capability of the multi-perception learning-based network once again.
However, some problems were found through this research. In general, the CNN-based method could benefit from increasing the network depth, while worse test results were received when going deeper by adding ERBs (e.g., B = 9 and B = 10, see Table 2), and something similar happened when G = 4 (Table 3). This phenomenon could be related to the input images. Compared

Discussion
The methods proposed in this paper are proven to have convincing performance with extensive experimental results. In Section 3.2, the gains after adding ERBs and RCAGs clearly clarify the effectiveness of the multiple perceptual scales within the design and the rationality of treating information from different levels with unequal attentions. Then, a reasonable network structure was given by progressively modifying the number of ERBs and RCAGs, and further improving it with a transfer training strategy. In order to explore the SR capacities of the models, tests were conducted over public remote sensing data and benchmark natural image sets in Section 3.3. The results encouraged us that the models achieved pretty good performance in comparison to the world's top SR methods and obtained satisfactory super-resolved results even when dealing with the complex and varied remote sensing images from the GaoFen-1 satellite and the GaoFen-2 satellite. From the slight lines on the playground to the indistinct but dense buildings, and so on (Figure 11), all the SR results demonstrate excellent image processing capability of the multi-perception learning-based network once again.
However, some problems were found through this research. In general, the CNN-based method could benefit from increasing the network depth, while worse test results were received when going deeper by adding ERBs (e.g., B = 9 and B = 10, see Table 2), and something similar happened when G = 4 (Table 3). This phenomenon could be related to the input images. Compared with natural images, the input images from the UC Merced dataset's [35] lacked a high-frequency component, though they had spatial resolution of 0.3 m per pixel. Moreover, after the degradation operation before testing, the image quality gets worse. The low initial input gradients may lead to vanishing gradients during the SR process and are unsuitable for a deep network to learn or extract information. Therefore, making a good trade-off between super-resolving performance and the network setting according to the practical situation is of great importance.
In addition, an objective evaluation on super-resolved GaoFen-1 data and GaoFen-2 data could not be performed, since the real HR image is unknown. How can a more reasonable and relatively objective evaluation be performed in such case without a standard reference? It is an open issue that needs to be solved. Besides, existing CNN-based SR works mostly using a bicubic down-sampler to generate LR images. Actually, learning multiple degradations [17] or exploring real-world degradation [43] helps to train super-resolving models since the true degradation does not always follow the bicubic interpolation-based assumption. Furthermore, a high-quality dataset dedicated to remote sensing SR research is also a core issue to be solved.

Conclusions
In this paper, a novel multi-perception attention network (MPSR) was presented, by fully considering the complex spatial distribution of the remote sensing data and diverse spatial scales of ground objects. By incorporating the enhanced residual blocks (ERBs) and residual channel attention groups (RCAGs), MPSR achieved multi-perceptual scale learning and multi-level information adaptive weighted fusion. Also, a pre-train and transfer strategy was adopted to further improve the SR ability of the network toward remote sensing images. Extensive experimental results over remote sensing data and benchmark natural image sets demonstrated that the proposed MPSR achieves superior performance compared to the state-of-the-art methods. It is worth mentioning that MPSR is also a competitive candidate for super-resolving other kinds of images.

Patents
The patent 201911140450.X results from the work reported in this manuscript.