End-to-End Super-Resolution for Remote-Sensing Images Using an Improved Multi-Scale Residual Network

Remote-sensing images constitute an important means of obtaining geographic information. Image super-resolution reconstruction techniques are effective methods of improving the spatial resolution of remote-sensing images. Super-resolution reconstruction networks mainly improve the model performance by increasing the network depth. However, blindly increasing the network depth can easily lead to gradient disappearance or gradient explosion, increasing the difficulty of training. This report proposes a new pyramidal multi-scale residual network (PMSRN) that uses hierarchical residual-like connections and dilation convolution to form a multi-scale dilation residual block (MSDRB). The MSDRB enhances the ability to detect context information and fuses hierarchical features through the hierarchical feature fusion structure. Finally, a complementary block of global and local features is added to the reconstruction structure to alleviate the problem that useful original information is ignored. The experimental results showed that, compared with a basic multi-scale residual network, the PMSRN increased the peak signal-to-noise ratio by up to 0.44 dB and the structural similarity to 0.9776.


Introduction
Image resolution indicates the amount of information contained in an image [1]. A high-resolution (HR) image has a higher pixel density, higher definition characteristics, and more detailed texture information than a low-resolution (LR) image. Whereas image resolution is the number of pixels in an image, spatial resolution indicates the minimum size of ground targets whose details can be distinguished and is used in the field of remote sensing. In remote-sensing images, high spatial resolution enables the changes in surface details to be observed more clearly on a smaller spatial size [2]. In actual scenes, some remote-sensing satellites only provide low-spatial-resolution remote-sensing images that do not meet actual usage requirements. Single-image super-resolution (SISR) reconstruction techniques use software methods to improve the spatial resolution of remotesensing images without changing the imaging system, which makes the use of these images advantageous [3].
The popular SISR reconstruction techniques are mainly based on conventional algorithms and learning-based algorithms. Conventional algorithms are divided into interpolation-based and sparse-based representation methods. Interpolation-based methods, e.g., the bicubic 1.
Difficulty recurring network models: most SR reconstruction models require operators to have superior training methods; meanwhile, some SR reconstruction models have many network layers, which require sophisticated hardware equipment. These characteristics make these network models difficult to recur.

2.
Inadequate feature utilisation: blindly increasing the number of network layers will aggravate image feature forgetting; however, using only a single up-sampling operation to increase the number of pixels in the final reconstruction stage will cause some of the LR image information to be lost.
In view of the above shortcomings, this report presents a novel multi-scale dilation residual block (MSDRB) and new complementary block (CB) for reconstruction and proposes a new pyramidal multi-scale residual network (PMSRN). Firstly, the dilation convolution combination with multiple dilation rates is used to improve the receptive field and reduce the difficulty of training. Simultaneously, to integrate image features of different scales more effectively, hierarchical residual-like connections (namely, Res2Net [18]) are introduced into the MSDRBs to achieve a more granular multi-scale feature representation. On this basis, to solve the problem of forgetting and underutilising network features as much as possible, the output of each MSDRB layer is used as the input of the hierarchical feature fusion structure (HFFS). Finally, the CB module designed during the reconstruction process can fully utilise the useful information in the original LR image.
The contributions of this study are as follows:

1.
A new MSDRB is proposed. This module expresses multi-scale features with finer granularity, increases the receptive field of each network layer, and enhances the ability to detect image features adaptively.

2.
To fuse the shallow and deep features, a new reconstruction CB is proposed. This module can fully utilise the useful information in the original LR image, prevent network instability, and improve the network robustness and image reconstruction effect.

3.
The proposed PMSRN is easier to train than other networks, since its number of parameters is only 43.33% of that of EDSR, and the module is independent and easy to migrate to other networks for learning.
The remainder of this paper is organised as follows. Section 2 introduces the MSDRB and reconstruction part of the CB module and describes the relevant theoretical analysis. Section 3 presents the experimental results and analyses the effectiveness of the algorithm. Section 4 discusses the practical application effects of PMSRN in different scenarios. Finally, Section 5 outlines the conclusions.

Materials and Methods
The proposed method is suitable for the SR reconstruction of single LR images. To ensure that the proposed method is universally applicable to images from different sensors, a red-green-blue (RGB) colour model is used to convert all the bands in the image. Because these bands have the same spatial resolution, each low-spatial-resolution image can obtain the corresponding three-channel image to be reconstructed and use it as the network model input.

Network Architecture
The objective of SR reconstruction is to reconstruct a high-definition SR image I SR ∈R Wr×Hr×C from an LR image I LR ∈R W×H×C by learning the mapping between LR and HR. The number of RGB space channels C is 3, the LR version of the HR image I HR ∈R Wr×Hr×C is I LR , W and H, respectively, represent the width and height of the LR, and r represents the up-sampling factor during SR reconstruction. After the network is trained and learned, the weight coefficient setθ obtained as:θ Here, according to the MSRN [15], i represents the ith image in the Nth training set, and L SR represents the loss function used in an SR reconstruction network. The gradient-descent method is used to minimise L SR to obtain the mapping function F θ of the optimal model. Several researchers have begun to study the loss function L SR to improve network performance through the innovative L SR ; however, the network performance improvement is not evident [15]. In order to avoid introducing unnecessary training methods and reduce computations, we finally choose the L 1 function. Therefore, the loss function L SR can be defined as: The PMSRN is an improved version of the MSRN. This architecture can reconstruct SISR images into higher resolution images, mainly through feature extraction and image reconstruction. Figure 1 shows the overall architecture. In the training process, firstly, the RGB colour model is used to convert all the bands contained in the public image to obtain the HR image. Secondly, the LR obtained by the HR through the Bicubic down-sampling operation is used as the PMSRN input. Thirdly, the PMSRN uses multiple MSDRBs to learn the feature mapping relationship between the LR and HR. Furthermore, the global and local feature information are subsequently combined through the HFFS. Finally, the CB reconstructs the SR image.
Our approach differs from the original MSRN [15] in two main aspects: • In the feature extraction, the MSDRB replaces the multi-scale residual module.

•
In the reconstruction part, a CB module was added.

Multi-Scale Dilation Residual Block (MSDRB)
Firstly, in order to provide the network with stronger multi-scale feature extraction capabilities, an MSDRB ( Figure 2) was designed in PMSRN. The MSDRB consists of three parts: multi-scale fusion, multilevel residual learning, and multi-dilation-rate dilated convolution groups. Figure 2. MSDRB structure, with four layers divided into three branches. Branch 1 contains the hierarchical residual-like connections (namely, Res2Net) residual block, activation function (i.e., rectified linear unit, ReLU), and concatenation operation. Branches 2 and 3 contain the dilation convolution with dilation rates d of 2 and 3, respectively; ReLU; and concatenation operation. The outputs of branches S 3 , P 3 , and Q 3 are output as S after concatenation and 1 × 1 convolution. Finally, S is feature-added with B n−1 .
Multi-Scale Feature Fusion: the multi-scale nature of the image is similar to that of the human eye observing an object. When the distance from the object is different, the perceived characteristics are different; that is, with the same object in the field of view, the image size and scale are different, so the features are also different [19], multi-scale information is crucial to computer vision algorithms.
In the first-layer network structure, the input B n−1 passes through the Res2Net residual block of branch 1, is activated by the activation function rectified linear unit (ReLU, represented by σ), and outputs S 1 ; input B n−1 passes through the dilation convolution of branch 2 (dilation rate d = 2), is activated by the ReLU activation function, and outputs P 1 ; input B n−1 passes through the dilation convolution of branch 3 (dilation rate d = 3), is activated by the ReLU activation function, and outputs Q 1 . The structure outputs S 1 , P 1 , and Q 1 can be expressed as follows: In the second-layer network structure, the inputs S 1 , P 1 , and Q 1 output [S 1 , P 1 , Q 1 ] through the concatenation operation. [S 1 , P 1 , Q 1 ] passes through the Res2Net residual block of branch 1, is activated by the ReLU activation function, and outputs S 2 . Then, [S 1 , P 1 , Q 1 ] passes through the dilation convolution of branch 2 (expansion rate d = 2), is activated by the activation function ReLU, and outputs P 2 . Further, [S 1 , P 1 , Q 1 ] passes through the dilation convolution of branch 3 (expansion rate d = 3), is activated by the activation function ReLU, and outputs Q 2 . The outputs S 2 , P 2 , and Q 2 of the network structure of the second layer can be expressed as follows: In the third-layer network structure, the inputs S 2 and P 2 of branch 1 output [S 2 , P 2 ] through the concatenation operation, which is activated by the ReLU activation function and outputs S 3 . The inputs S 2 and Q 2 of branch 2 output [S 2 , Q 2 ] after the concatenation operation, which is activated by the ReLU activation function and outputs P 3 . The inputs Q 2 and P 2 of branch 3 output [Q 2 , P 2 ] after the concatenation operation, which is activated by the ReLU activation function and outputs Q 3 . The network structure outputs S 3 , P 3 and Q 3 of the third layer can be expressed as follows: In the fourth-layer network structure, the inputs S 3 , P 3 , and Q 3 are subjected to the concatenation operation; the output is [S 3 , P 3 , Q 3 ], which is filtered by a 1 × 1 standard convolution kernel and outputs S . The output S of the fourth layer can be expressed as: In (3)-(12), w and b represent the weight and bias, respectively; the superscripts represent the numbers of layers; the subscripts 1 × 1 and 3 × 3 represent the sizes of the convolution kernels; the subscript Res2Net represents the convolution type as hierarchical residual-like connections; and the subscript d represents the dilation convolution with dilation rate d.
Let us assume that the number of channels of the MSDRB input B n−1 is n_feats; then, the numbers of output channels of the internal first, second, and third layers are n_feats, 3 × n_feats, and 6 × n_feats, respectively. In the fourth layer, the number of output channels based on the feature map concatenation is 18 × n_feats, and the number of feature map channels is reduced to n_feats again with a 1 × 1 convolution kernel.
Multilevel Residual Learning: inside a single MSDRB, Res2Net represents multi-scale features with a level that is more granular and increases the receptive field range of each network layer. Specifically, the module replaces a single 3 × 3 convolution kernel with a convolution kernel group. Simultaneously, different convolution kernels can be connected in the form of hierarchical residuals (Figure 3b). The internal operation of Res2Net can be defined as: The input image features are filtered by a 1 × 1 standard convolution operation and copied into four pieces of feature information, namely, X 1 , X 2 , X 3 , and X 4 . In Res2Net, the outputs of different receptive fields are obtained. For example, Y 2 , Y 3 , and Y 4 obtain the receptive fields of the standard convolutions 3 × 3, 5 × 5, and 7 × 7, respectively. Finally, the four outputs are fused, and the number of output channels is reduced to the number of input channels after a 1 × 1 convolution operation. This strategy of splitting and fusing enables convolution to process features more efficiently. Note that the MSRN has stronger feature extraction capabilities [15] than the ResNet, dense residual network [20], and inception [21]. To illustrate further the necessity of introducing the Res2Net module, we conducted an experimental comparison with MSRN (discussed in Section 3.1.2).
Outside the MSDRB, add the corresponding elements of S and B n−1 and output B n , which is expressed as: Multi-Dilation Rate Dilated Convolution Group: although this group improves the receptive field and reduces the amount of calculation, the resolution loss is minimised. Dilated convolution is used to set different dilation rates d to obtain different receptive fields. Adding d − 1 zeros to the convolution kernel will not increase the amount of calculation. Figure 4 shows dilated convolution kernels at different d values. If the amount of calculation remains constant, different dilation rates d will make the standard convolution k × k have different receptive fields R. The receptive field of dilation convolution [22] can be calculated as follows: The calculation based on (18) shows that when d = 1, 2, and 3, the dilation convolution (Figure 4a-c) is equivalent to a 3 × 3, 5 × 5, and 7 × 7 receptive field of the standard convolution, as used in branches 1, 2, and 3 in Figure 2, respectively.
Different branches constructed in this manner have different receptive fields. Combined with the above multi-scale feature fusion and multilevel residual learning, the PMSRN can increase the amount of calculation by a small margin and enhance the ability to detect image feature information (see Section 3.2.3 for a discussion of the corresponding experiments).

Complementary Block (CB) in Image Reconstruction Structure
Secondly, in order to fuse the shallow and deep features, a new reconstruction CB is proposed in PMSRN. A CB is constructed in the reconstructed structure of the PMSRN, and its input consists of two parts, namely, original image feature B 0 and the HFFS ( Figure 5). The HFFS is a global and local feature fusion structure. The inputs B 0 , B 1 , . . . , and B n are subjected to the concatenation operation, and the output is filtered through two convolution layers. B 0 and the HFFS respectively perform sub-pixel convolution operations ( Figure 6) and rearrange the tensor with dimensions H × W × C·r 2 as rH × rW × C. Then, the corresponding elements are added, and the features are reconstructed as SR images after 3 × 3 standard convolution filtering. The CB module integrates the global and local features, effectively utilises the original feature information, and prevents information loss. Section 3.1.1 analyses the necessity of the CB module.

Datasets
To facilitate experimental comparisons with other SR reconstruction networks, the publicly available, high-quality image dataset, diverse 2K (DIV2K) [23], was selected, which contained 800 training and 100 verification images. In the test phase, the remote-sensing images used in the test stage contained numerous regular roads, buildings, fields, and other features with increased requirements for detailed texture, while the DIV2K training set images showed clear details. By learning the mapping relationship between the LR and HR of the DIV2K training set, the ability of the network model to distinguish detailed textures can be enhanced to improve the spatial resolution of the remote sensing image. Therefore, the DIV2K dataset can be applied to the SR reconstruction of remote-sensing images. The aerial image dataset (AID) is a remote sensing image dataset that includes 30 categories of scene images, each of which has approximately 220-420 pieces (a total of 10,000 pieces), and each image size is 600 × 600. A total of 7614 high-definition images were selected as the new dataset, of which 80% (6768) were used as the training set, 10% (846) as the verification set, and 10% (846) as the AID-test.
The DIV2K dataset is used in the comparative experiments described in Sections 3.1 and 3.2. The AID dataset is used to compare the effect of remote-sensing image reconstruction, as described in Sections 3.3 and 4. Both training sets have 2× (r = 2), 3× (r = 3), 4× (r = 4), and 8× (r = 8) training sets with four different up-sampling factors. To improve the training efficiency, the LR image input after bi-cubic down-sampling was divided into multiple training images with a size of 64 × 64, and sent to the network model for training. Before training, each training block was randomly scaled, rotated, and flipped to increase the training data.
In the test phase of Section 3.2, five public datasets were used: Set5 [24], Set14 [25], BSDS100 [26] (B100), Urban100 [27], and Manga109 [28]. The test sets Set5 and Set14 are lowcomplexity, single-image, SR reconstruction datasets based on non-negative neighbourhood embedding; BSDS100 and Urban100 comprise 100 images each; and the Manga109 dataset contains 109 high-quality Japanese cartoon images, which can fully verify the performance of the model. Both the PMSRN model and MSRN are trained in RGB space to evaluate their performance. In Section 3.3, the AID-test set will be used as the test set to compare the reconstruction effect of PMSRN on different training sets.

Experimental Environment
The initial learning rate lr of the PMSRN network is 0.0001, and the learning rate decreases by 50% every 200 epochs. The optimiser was the Adam optimiser [29]. Eight MS-DRB (n = 8) were used in the model. The number of input channels of each MSDRB was equal to that of the output channels, and the number of output channels of the HFFS module was consistent with that of a single MSDRB. Two NVIDIA GeForce RTX 2080Ti were used to train the PMSRN model on the Pytorch framework. When there is a corresponding HR image, two evaluation standards, peak signal-to-noise ratio (PSNR) [30] and structural similarity (SSIM) [30] are used for evaluation. The higher the PSNR/SSIM value, the better the SR image reconstruction effects.

Benefits of CB
To utilise the global and local feature information fully and enhance the reconstruction effect of SR images, a CB is added to the reconstruction structure of the model. An experiment was performed to analyse the relationship between the peak signal-to-noise ratio (PSNR) of the diverse 2K (DIV2K) verification set and the training batch epoch to confirm the network improvement effected by the CB. Figure 7 compares the PSNRs of the MSRN-CB and FSRCNN, MSRN, and IMDN networks in the epoch range of 0 to 100; curves of different colours are used to indicate different networks. Note that to verify the effectiveness of the CB module, no pre-training parameters (training methods) were used to initialise any of the networks.
The PSNR training curves of the deep network (more than 20 layers) MSRN (blue line), MSRN-CB (green line), and IMDN (purple line) are much higher than that of the shallow network (fewer than 5 layers) FSRCNN (orange line). The MSRN-CB (green line) has the highest PSNR curve and the highest rising speed. With increasing r, the PSNR curve of the MSRN-CB becomes increasingly flat, indicating that the CB module can effectively improve the stability and robustness of the network model. To verify further the necessity of introducing the CB module, the Manga109 test set was considered as an example, and Table 1 compares the PSNR/SSIM of the four models.  The MSRN-CB has the highest value on the test set Manga109. When r = 2, its PSNR is 2.4 dB higher than that of the shallow FSRCNN, 0.03 dB higher than that of the deep IMDN, and the same as that of the deep MSRN, and its SSIM reaches 0.9769. When r = 3, its PSNR is 2.87 dB higher than that of the shallow FSRCNN and 0.42 dB and 0.12 dB higher than those of the deep IMDN and MSRN, respectively, and its SSIM reaches 0.9450. When r = 4, its PSNR is 2.78 dB higher than that of the shallow FSRCNN and 0.3 dB and 0.08 dB higher than those of the deep IMDN and MSRN, and its SSIM reaches 0.9088. When r = 8, its PSNR is 1.61 dB higher than that of the shallow FSRCNN and 0.21 dB and 0.03 dB higher than those of the deep IMDN and MSRN, respectively, and its SSIM reaches 0.7744. It can be seen that as r increases, the numerical gap between the MSRN-CB and the other three network models increases, indicating that the MSRN-CB has a better reconstruction effect for larger r.

Benefits of Res2Net
To increase the receptive field of each network layer and enhance the multi-scale feature extraction capability of the network, Res2Net was introduced as the MSRN-Res2Net network. The experimental method, data, and comparison networks were the same as those in the CB module experiment described in Section 3.1.1, and Figure 8 presents the experimental results. The PSNR training curves of the deep network MSRN (blue line), MSRN-Res2Net (green line), and IMDN (purple line) are much higher than that of the shallow network FSRCNN (orange line). The MSRN-Res2Net has the highest PSNR curve and highest rising speed. As r increases, the distinction between MSRN-Res2Net and the other three curves becomes more obvious. To explain the necessity of adding Res2Net more effectively, Table 2 compares the four models according to the evaluation standard PSNR/SSIM. MSRN-Res2Net has the largest value on the test set Manga109. When r = 2, its PSNR is 2.58 dB higher than that of the shallow FSRCNN and 0.21 dB and 0.18 dB higher than those of the deep IMDN and MSRN, respectively, and its SSIM reaches 0.9772. When r = 3, its PSNR is 2.84 dB higher than that of the shallow FSRCNN and 0.39 dB and 0.09 dB higher than those of the deep IMDN and MSRN, respectively, and its SSIM reaches 0.9454. When r = 4, its PSNR is 2.94 dB higher than that of the shallow FSRCNN and 0.46 dB and 0.23 dB higher than those of the deep IMDN and MSRN, respectively, and its SSIM reaches 0.9108. When r = 8, its PSNR is 1.73 dB higher than that of the shallow FSRCNN and 0.33 dB and 0.15 dB higher than those of the deep IMDN and MSRN, respectively, and its SSIM reaches 0.7790.
The numerical changes conclusively indicate that as r increases, the numerical gap between the MSRN-Res2Net and the other three network models increases. Thus, the network with Res2Net can also yield better reconstruction results when r is larger.
Consider the Set5 test set as an example. Compared with other SISR methods, the PMSRN produces superior PSNR and SSIM values. In the SR case with 2× (r = 2), the PSNR of the PMSRN is 0.03 dB higher than that obtained with the suboptimal EDSR method and 0.08 dB higher than that of the basic MSRN. In the 3× (r = 3) SR case, the PSNR of the PMSRN is 0.01 dB higher than that of the suboptimal EDSR method and 0.21 dB higher than that of the basic MSRN. In the SR case with 4× (r = 4), the PSNR of the PMSRN is the same as that of the suboptimal EDSR method and 0.28 dB higher than that of the basic MSRN. In the SR case with 8× (r = 8), the PSNR of the PMSRN is 0.14 dB higher than that of the basic (suboptimal) MSRN. Next, consider Set14 as an example. In the SR case with 2× (r = 2), the SSIM of PMSRN is the best, and the PSNR value of the PMSRN is suboptimal. The PSNR is 0.07 dB lower than that of the optimal EDSR method, and the PSNR of the MSRN is 0.26 dB lower than that of the PMSRN. In the case of SR with 3× (r = 3), the SSIM and PSNR of the PMSRN are both suboptimal. The PSNR is 0.04 dB lower than that of the optimal EDSR method and 0.08 dB higher than that of the basic MSRN. In the case of SR with 4× (r = 4), the SSIM and PSNR of the PMSRN are both suboptimal, and the PSNR is 0.04 dB lower than that of the optimal EDSR method and 0.1 dB higher than that of the basic MSRN. In the SR case with 8× (r = 8), the PSNR of the PMSRN is 0.13 dB higher than that of the basic (suboptimal) MSRN, and the SSIM of the PMSRN is the best.
Using B100 as an example, in the SR cases with 2× (r = 2), 3× (r = 3), and 4× (r = 4), the PSNRs and SSIMs of the PMSRN are both suboptimal, although the PSNR of the PMSRN is 0.08 dB higher than that of the basic MSRN. In the cases of SR with 2× (r = 2) and 3× (r = 3), the PSNRs of the PMSRN are 0.05 dB lower than those of the optimal EDSR method. Moreover, in the case of SR with 4× (r = 4), the PSNR of the PMSRN is 0.02 dB lower than that of the optimal EDSR method. In the case of SR with 8× (r = 8), the PSNR and SSIM of the PMSRN are both optimal, and its PSNR is 0.08 dB higher than that of the basic MSRN (suboptimal).
Taking Urban100 as an example, the PSNR and SSIM of the PMSRN are both optimal compared with those of other SISR methods. In the cases of SR with 2× (r = 2), 3× (r = 3) and 4× (r = 4), and 8× (r = 8), the PSNR of the PMSRN is 0.41 dB, 0.3 dB, and 0.2 dB higher than that of the basic (suboptimal) MSRN, respectively. Considering Manga109 as an example, the PSNR and SSIM of the PMSRN are both optimal compared with those of other SISR methods. In the cases of SR with 2× (r = 2), 3× (r = 3), 4× (r = 4), and 8× (r = 8), the PSNR of the PMSRN is 0.44 dB, 0.3 dB, 0.43 dB, and 0.35 dB higher than that of the basic (suboptimal) MSRN, respectively.
Among all r values, compared with the conventional Bicubic technique, the reconstruction network using deep learning yielded the maximum increase in PSNR of 8.04 dB, and the maximum SSIM was 0.9776. Compared with the shallow network SRCNN and FSRCNN, the PMSRN exhibited a significant improvement in SSIM, with an average improvement in PSNR of~1.3 dB. Compared with the deep network, the PSNR/SSIM of PMSRN were only slightly lower than those of EDSR on test sets Set14 and B100, but training the EDSR model required more memory and space. In contrast, the number of parameters in our model was much smaller than the EDSR model. For more details, please refer to Section 3.2.3.
To show that the PMSRN has an excellent reconstruction effect on images, the visual effect was further analysed quantitatively. Figure 9 visually compares the PMSRN with the basic MSRN and conventional Bicubic technique, which are among the methods listed in Table 3. It can be clearly observed that compared with the Bicubic technique, the PMSRN and MSRN yield higher SR image definition.

Visual Effect Comparison
In the enlarged region, it is apparent that PMSRN produces finer details, such as scarf stripes in 2× (r = 2, Figure 9a) and that the reconstructed SR image contains details that closely match the detailed information in the HR image. In the images with larger r, the performance of the PMSRN is prominent. For example, in the upper-left red frame of the enlarged areas in the SR images with r = 3 and r = 4 (Figure 9c,d), the images reconstructed by the PMSRN are similar to the HR images. However, when r = 4, the MSRN reconstructs stripes with incorrect detail. In the red frame area with r = 8 (Figure 9d), the reconstructed image effects of the PMSRN is substantially higher than that of the Bicubic method. Compared with the MSRN, the PMSRN shows more detailed edge information. The excellent performance of the PMSRN in terms of visual effects is consistent with the quantitative analysis results in Section 3.2.1. It can be observed that the number of parameters of PMSRN has only half of the EDSR model, but that the PMSRN performs the best in terms of the PSNR. This finding demonstrates that our model has a more effective structure and achieves a better balance between performance and model size.

Comparison of Reconstruction Effects of Different Training Sets
To study the effects of the DIV2K and AID training sets on the PMSRN model, comparative experiments were conducted on the evaluation results and subjective visual effects. Using 846 remote-sensing images from the AID training set as the test set, the model reconstruction effects of the same algorithm when trained on the DIV2K and AID training sets were compared. Table 4 summarises the evaluation results in terms of the PSNR/SSIM. The names of the datasets are given in parentheses.   Although the resolution of each HR image in the DIV2K training set is higher than that in the AID training set, the PSNRs and SSIMs of the PMSRN (DIV2K) are lower than those of the PMSRN (AID) and even lower than those of the MSRN (AID). Therefore, when reconstructing remote-sensing images, the PMSRN (AID) had a better reconstruction effect.
Several groups of images were randomly selected to compare the visual effects of the SR images reconstructed by the PMSRN (DIV2K) and PMSRN (AID) (Figure 11).   For 2× (r = 2), the quality of the reconstructed images obtained using the PMSRN (DIV2K) and PMSRN (AID) is much higher than that of the LR images. Accordingly, the PMSRN (AID) clearly shows more detailed information about the bow deck in the red frame. For 3× (r = 3), 4× (r = 4), and 8× (r = 8), the reconstruction effects of the PMSRN (DIV2K) and PMSRN (AID) in the red box are significantly different. It can be clearly observed that the SR images acquired using the PMSRN (AID) are closer to the HR images. Therefore, the PMSRN (AID) was used as the final SR reconstruction network.

Discussion
All the above experimental results are based on the down-sampling of HR images to obtain LR images. However, in practical applications, LR remote-sensing images are directly collected by sensors. Therefore, it is necessary to discuss further the SR reconstruction effect of the proposed method on actual LR images.
For this reason, in the absence of corresponding high-spatial-resolution images, we firstly selected a remote-sensing image with a spatial resolution of 16 m and randomly selected four areas for experimentation. The methods based on the Bicubic model, MSRN, and PMSRN were used to perform SR reconstruction. Figure 12 presents the results of SR reconstruction with different r values.  Owing to the lack of corresponding high-spatial-resolution images, the reconstruction effects can only be evaluated by comparing the SR image definition of the three methods. Based on observation, it was found that the SR image quality based on deep learning (MSRN and PMSRN) is better than that of the Bicubic method. Compared with the MSRN SR image, the PMSRN SR image exhibits more details, fuller colour, and clearer contour stripes.
In addition, ignoring the fact that different wavelengths have different reflectivities in the same area [31], we selected different wavebands belonging to the same remote-sensing image to combine them and obtain HR images with a spatial resolution of 1 m and LR images with spatial resolutions of 2 m and 4 m. We randomly selected two areas for experimentation, as shown in Figure 13. The images on the left of Figure 13a,b show images with a spatial resolution of 1 m, which are only used as HR images for visual comparison. The images in the middle of Figure 13a,b present SR images reconstructed using the Bicubic model, and the images on the right of Figure 13a,b depict SR images reconstructed by the PMSRN. It can be observed that the outline of SR images reconstructed by the PMSRN is clearer than that of Bicubic and closer to the images on the left of Figure 13. Because the ground features are too fuzzy and complex, the reconstructed image in the images on the right of Figure 13b is not as good as that in Figure 13a. However, the definition of the image on the right of Figure 13b is significantly higher than that of the image in the middle of Figure 13b. These results further verify that the PMSRN is of significant value in remote-sensing image research.

Conclusions
In this report, we proposed an efficient SR reconstruction network called the PMSRN, which is an improved method of the MSRN. The main modules of the PMSRN include MSDRBs and a CB. MSDRBs use numerous residual networks both internally and externally, which can enhance the detection ability of image features on multiple scales and fully utilise image feature information. Furthermore, the network introduces global and local CB when reconstructing SR images, which helps improve the network stability, prevent information loss, and improve the use of original LR image feature information. In addition, comparison of the reconstruction effects of the PMSRN when trained using DIV2K and AID data conclusively indicated that the AID training set is more suitable for remotesensing image reconstruction. Comparison of the remote-sensing SR images reconstructed by PMSRN (AID) with the high-spatial-resolution remote-sensing images acquired by a satellite indicated that PMSRN (AID) yielded satisfactory results.
Compared with other networks, the PMSRN achieved impressive reconstruction performance on RGB data sets that ignore noise. In future work, it will be necessary to investigate the effects of noise on SR reconstruction as well as the SR reconstruction of non-RGB band images.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.