Sentinel-2 Sharpening via Parallel Residual Network

: Sentinel-2 data is of great utility for a wide range of remote sensing applications due to its free access and ﬁne spatial-temporal coverage. However, restricted by the hardware, only four bands of Sentinel-2 images are provided at 10 m resolution, while others are recorded at reduced resolution (i.e., 20 m or 60 m). In this paper, we propose a parallel residual network for Sentinel-2 sharpening termed SPRNet, to obtain the complete data at 10 m resolution. The proposed network aims to learn the mapping between the low-resolution (LR) bands and ideal high-resolution (HR) bands by three steps, including parallel spatial residual learning, spatial feature fusing and spectral feature mapping. First, rather than using the single branch network, the parallel residual learning structure is proposed to extract the spatial features from different resolution bands separately. Second, the spatial feature fusing is aimed to fully fuse the extracted features from each branch and produce the residual image with spatial information. Third, to keep spectral ﬁdelity, the spectral feature mapping is utilized to directly propagate the spectral characteristics of LR bands to target HR bands. Without using extra training data, the proposed network is trained with the lower scale data synthesized from the observed Sentinel-2 data and applied to the original ones. The data at 10 m spatial resolution can be ﬁnally obtained by feeding the original 10 m, 20 m and 60 m bands to the trained SPRNet. Extensive experiments conducted on two datasets indicate that the proposed SPRNet obtains good results in the spatial ﬁdelity and the spectral preservation. Compared with the competing approaches, the SPRNet increases the SRE by at least 1.538 dB on 20 m bands and 3.188 dB on 60 m bands while reduces the SAM by at least 0.282 on 20 m bands and 0.162 on 60 m bands.


Introduction
Sentinel-2 is a wide swath and optical fine resolution satellite imaging mission released by the European Space Agency (ESA) [1]. Owing to frequent revisit rate, global access and free availability, Sentinel-2 products have been widely used to monitor dynamically changing geophysical variables such as vegetation, soil, water cover and coasts [2][3][4][5]. However, due to the storage and transmission bandwidth restrictions, thirteen spectral bands in Sentinel-2 image are acquired with three different spatial resolutions including: four 10 m bands, six 20 m bands and three 60 m bands. With the same spatial coverage, the low-resolution (LR) bands have the potential to be enhanced by image sharpening, which is an economically effective technique that can merge the LR bands with the high-resolution (HR) bands to produce a complete HR image (ideally without loss of spectral information) [6]. With desirable spatial and spectral resolution, the sharpening image can yield better interpretation capabilities in the remote sensing applications [7][8][9]. adopt a single branch to extract feature from these bands together, which may sacrifice some efficient information.
To address the aforementioned problems, a parallel residual network for Sentinel-2 sharpening termed SPRNet is proposed in this paper. The proposed method can be divided into three steps. First, to exploit sufficient spatial information and learn the mapping between the LR and corresponding HR bands, we propose a parallel structure based on residual learning, where several branches with the same network compositions are utilized to extract feature from different resolution bands independently. Second, we develop the spatial feature fusing unit to concatenate and fuse the spatial features extracted from each branch and then these feature maps are restored to spatial residual image, which has the same channels as the sharpened bands. Third, a skip-connection is constructed to add the spectral information to the spatial residual image. Based on the above-mentioned steps, we can obtain the Sentinel-2 image with all bands at 10 m resolution, using the 10 m, 20 m and 60 m bands. Compared with the existing methods, the contributions of this paper can be summarized as twofold: 1. We propose a Sentinel-2 sharpening method to raise the spatial resolution of both 20 m and 60 m bands with the help of 10 m bands, which can produce the HR image with all bands at 10 m resolution. 2. We develop a parallel network structure for extracting feature from different resolution bands by separate branches. This idea enables to improve the spatial resolution of LR bands while keeping spectral fidelity simultaneously.
The remainder of the paper is organized as follows. Section 2 introduces the proposed SPRNet framework for Sentinel-2 sharpening in detail. In Section 3, the experimental validation and analysis on the degraded and real Sentinel-2 data are presented. Discussions on the experiments are shown in Section 4. Finally, we provide some concluding remarks in Section 5.

Network Architecture
In this paper, we propose a parallel residual network to learn the sharpening for the Sentinel-2 images. Before we present our method, we introduce the bands of Sentinel-2 in brief. The bands of Sentinel-2 images are divided into 3 sets by different resolutions, including 10 m, 20 m and 60 m sets. Each set as well as its corresponding band index and spectral characteristics are displayed in Table 1. It's noteworthy that B10 is excluded from our spatial enhancement due to its poor radiometric quality and across-track striping artifacts [23]. Given these sets, the goal of our sharpening method is to estimate the HR version at 10 m resolution of 20 m and 60 m bands. Since the spatial ratio between 20 m and 10 m is different from the ratio between 60 m and 10 m, we adopt two separate networks (i.e., SPRNet 2× for 20 m bands and SPRNet 6× for 60 m bands, respectively) to implement Sentinel-2 sharpening. The structures of the SPRNet 2× and SPRNet 6× are shown in Figure 1 and each consists of three parts: the parallel residual learning, the spatial feature fusing and the spectral feature mapping. First, the spatial features of HR and LR bands are extracted from the separated branches, which are composed of the initial spatial feature extraction (ISFE) and a series of residual blocks (ResBlocks). Second, the spatial feature fusing is constructed by the feature concatenation and several fully connected (FC) layers to merge and propagate the spatial information. Third, the spectral features of LR are directly stacked to the fused spatial features using a skip-connection layer in order to transmit the spectral information. The target HR image can be finally predicted from the trained models using LR and auxiliary HR bands.

Parallel Spatial Residual Learning
To learn the mapping for the independent spatial information extraction, we construct a parallel structure, where the inputs with different spatial resolution can be fed into the different branches separately. Since the 60 m bands can not contribute to the sharpening for 20 m bands, the SPRNet 2× consists of two branches while the SPRNet 6× consists of three branches. In each branch, we adopt the residual structure including ISFE unit and a series of ResBlocks to ensure that sufficient information from the inputs can be excavated.
Within the SPRNet 2× and SPRNet 6× , we can obtain numerous spectral feature maps which can contribute to the model performance. However, increasing the feature maps would lead to the unstable training procedure and destroy the sharpening results in return. To address this problem, we propose the ISFE unit with the structure in Figure 2a, which places a constant scaling layer after the convolution and activation function layers, to multiply the input features with a constant. With the input x, they can be defined as: where x 1 denotes the output of ISFE, {w, b} means the weight matrix and basis of the convolution, ϕ is rectified linear unit (ReLU) as ϕ (x) = max (x, 0), µ is the constant scaling with factor 0.05, and * denotes the convolution operation.
To explore deeper spatial feature and learn the spatial mapping between LR and HR bands, the output of ISFE is fed to a series of ResBlocks with the structure in Figure 2b. Each Resblock consists of the convolution, activation function, and residual scaling layers [44]. To propagate the input information and alleviate the gradient vanishment problem, a skip-connection is added. So, the m th ResBlock can be computed as: where y m 1 and y m 2 denote the intermediate results, {w m , b m } is the weight matrix and basis of the convolution in Resblock, x m+1 denotes the output of the ResBlock, and λ is a residual scaling with factor 0.1.

Spatial Feature Fusing
In order to combine the information of different resolution bands, we propose the spatial feature fusing component. After the parallel residual learning component, the extracted feature maps learning from separate branches are concatenated so they can be simultaneously fed into the next layer. To fully fuse the information of these maps, two FC layers are adopted here and each of them is followed by a ReLU activation. Subsequently, a convolution layer is aimed to transform the feature maps into the spatial residual image with the channels as same as the sharpened bands. With the concatenated maps z, these layers can be formulated as follows: where z f denotes the output of the FC layer, {w f , b f } means the weight matrix and basis of the FC and convolution layers of this component, and z 2 is the output. What's more, after each convolutional operation, we adopt the zero padding to get the same size with the inputs.

Spectral Feature Mapping
The parallel spatial residual learning component and spatial feature fusing component mainly contribute toward learning the spatial mapping between the LR bands and targeted HR bands. Considering the target HR and input LR share the same spectral content, we construct the spectral feature mapping by adopting a skip-connection into the network to keep spectral consistency. This operation adds the up-scaled LR bands to the spatial residual image obtained from last step to propagate the spectral information directly. As such, the approximated HR can be produced by combining the spatial features and spectral characteristics.

Training and Applying
Following the above steps, the designed network can learn an end-to-end mapping between the LR and corresponding HR bands. However, due to the lack of HR reference, the mapping can not be learned from the data at original scale directly. It's a generic solution that training and testing the sharpening methods follow Wald's protocol [45] that takes the degraded data as inputs and the original data as the corresponding reference. This operation requires the base assumption that the mapping relationship between the LR and HR is scale-invariant (i.e., 40 m→20 m for inferring 20 m→10 m and 360 m→60 m for inferring 60 m→10 m). In this way, the image sharpening can be implemented using the degraded trained model. For convenience, the 10 m, 20 m and 60 m bands of Sentinel-2 data are denoted as X 10 , X 20 and X 60 , respectively. And their degraded version which is convoluted with the predetermined point spread function (PSF) [23,27] and downsampled by utilizing bilinear interpolation, can be denoted as X D 10 , X D 20 and X D 60 , respectively. As mentioned before, it's sufficient to train two networks SPRNet 2× and SPRNet 6× . With the synthetic data pairs, these models can be trained as follows.
For SPRNet 2× , X D 10 , X D 20 are created by downsampling the X 10 and X 20 by a factor 2, and used to train the 40 m→20 m network. Since the size of X D 10 and X D 20 is different, we can up-sample the X D 20 to the spatial size of X D 10 . Then, we concatenate the X D 10 and up-sclaed X D 20 as the input of SPRNet ×2 . The mapping F 2× (·) can be learned by minimizing the loss between the HR reference X 20 and the sharpening result F 2× ([X D 10 , X D 20 ], Θ 1 ), where Θ 1 is the model parameters, and the loss function can be formulated as follows: where |·| denotes the L1-norm, which computes the mean absolute error between the generated and the reference data. Compared with SPRNet 2× , the input and output of SPRNet 6× are different. We downsample all bands by a factor 6. Then, we adopt the X D 10 , X D 20 and X D 60 as input and the original X 60 as HR reference to train the 360 m→60 m network. Like SPRNet 2× , this model is estimated by minimized the following loss function: where Θ 2 is the parameters of SPRNet 6× , and F 6× (·) denotes the mapping between X D 60 and X 60 . On the basis of the above steps, the proposed method can learn the mapping between LR and HR bands. When we implement the image sharpening in the applying stage, we input the original bands X 10 , X 20 and X 60 to the trained SPRNet 2× and SPRNet 6× models to produce the estimated HR bands Y 20 and Y 60 : The predicted Y 20 and Y 60 are the corresponding sharpening results at 10 m resolution of the 20 m and 60 m bands, respectively. Thus, the image with all bands at 10 resolution is obtained.

Data
Our experimental data come from the Sentinel-2 Level-1C products, which have been converted from radiance into geo-coded top of atmosphere (TOA) reflectance with a sub-pixel multi-spectral registration [46]. The training data used in this paper cover a scene of Guangdong Province in China with a spatial extent of 72 km by 72 km and was collected on 31 December 2017. Figure

Experimental Details
In our experiments, some important parameters of the proposed method are configured as follows. To train the SPRNet 2× , the training data are degraded by a factor 2 and sliced to the patch of 60 × 60 pixels. Similarly, to train the SPRNet 6× , the training data are degraded by a factor 6 and sliced to the patch of 20 × 20 pixels. For each network, 3600 sample pairs can be used for training and 10% of them are used for validation. The number of ResBlocks M is set as 6 in each branch and we use 128 filters of the size 3 × 3 for convolution layers expect the last convolution in our evaluations. The choice of the parameter is inspired by [23]. Since the last convolution is aimed at reducing the feature dimension to the number of the sharpened bands, the number of filters is set as 6 and 2 in SPRNet 2 × and SPRNet 6× , respectively. These networks are implemented in the Keras framework with NVIDIA Tesla K80 GPU. We use the Nadam [47,48] with β 1 = 0.9, β 1 = 0.999 and = 10 −8 as optimizer to train the networks. The learning rate is initialized as 10 −4 , which can be reduced by a factor of 2 whenever the validation loss does not decrease for 5 epochs, and the reducing procedure is terminated whenever the learning rate is less than 10 −5 . The mini-batch size and the epoch number of training are set as 128 an 200, respectively.

Baselines and Quantitative Evaluation Metrics
To assess the effectiveness of our proposed method, we take SupReME [27], ResNet [43] and DSen2Net [23] as benchmark methods. Besides, the bicubic interpolation (Bicubic) is used to illustrate the performance of the naive upsampling without considering spectral correlations. The parameters of SupReME and DSen2Net are set as suggested in the original publications, while the number of ResBlocks in ResNet is set as 6.
We adopt six evaluation metrics for quantitative evaluation including: root mean squared error (RMSE), signal-to-reconstruction error (SRE), correlation coefficient (CC), universal image quality index (UIQI), erreur relative globale adimensionnelle de synthèse (ERGAS) and spectral angle mapper (SAM) [45,49]. The RMSE and SRE evaluate the quantitative similarity between the target images and the reference images based on mean square error (MSE). The CC indicates the correlation and the UIQI is a mathematically defined universal image quality index, which can be applied to various image processing applications. The ERGAS reflects fidelity of the target images based on the weighted sum of MSE in each band, and the SAM describes the spectral fidelity of the sharpening results. In these evaluation metrics, when the sharpening results are closer to the reference one, the values of RMSE, ERGAS, and SAM are smaller, on the contrary, the values of SRE, CC, and UIQI are larger.

Evaluation at Lower Scale
Since the 10 m version of LR bands are not available in the testing datasets, we follow the Wald's protocol and give the quantitative evaluation at lower scale, i.e., the SPRNet 2× is evaluated on the task to sharpen 40 m to 20 m; in the same way, the SPRNet 6× is evaluated on the task to sharpen 360 m to 60 m. The lower scale data are generated by synthetically degrading the original data by the upscale ratio (i.e., 2 for SPRNet 2× and 6 for SPRNet 6× ). In the following, we separately discuss the effectiveness of the SPRNet 2× and SPRNet 6× .
SPRNet 2× -20 m bands. As for 20 m bands sharpening, the network SPRNet 2× is trained by the simulated data degraded from the observed data by a factor 2 to learn the mapping between 40 m and 20 m. Several state-of-the-art methods are compared with the proposed method. Tables 2  and 3 list the quantitative assessment results of these methods for two testing datasets. Among them, we calculate RMSE, SRE, CC, UIQI on each band, and then compute the mean values over the bands. The ideal value of each index is provided for the convenience of inter-comparison. The best results are highlighted in bold. According to the reported results, a few observations are noteworthy. (1) All the methods are significantly better than the Bicubic method, especially the CNN-based methods, which outperform the Bicubic by a large margin. For instance, our SPRNet reduces the RMSE by a factor of above 2 and reaches more than 10 dB higher SRE. This illustrates the effectiveness of the sharpening procedure. (2) The proposed SPRNet method obtains the best evaluation results in all indexes. For site 1, the mean RMSE of the SPRNet is 59.910, with a decrease of 104.514, 26.437 and 12.573 when compared to SupReME, ResNet and DSen2Net. Accordingly, the mean SRE value of the SPRNet is 29.721 dB, which is 8.723 dB, 3.078 dB and 1.538 dB higher than that of the aforesaid methods, respectively. Also, the mean CC and UIQI of the SPRNet are 0.994 and 0.980 with gains of 0.002 and 0.008 over that of the best comparison method DSen2Net. For site 2, the mean RMSE of the SPRNet is 55.155, 35.27 and 21.991 smaller than that of SupReME, ResNet and DSen2Net, respectively. And the mean SRE is 8.123 dB, 6.072 dB and 4.098 dB higher than that of the corresponding methods, respectively. Compared with the DSen2Net, the mean CC and UIQI of the SPRNet increase by 0.004 and 0.03. The above results demonstrate the great spatial similarity of the proposed SPRNet. Moreover, we also observe the proposed method obtains the best ERGAS and SAM. The ERGAS of the SPRNet for two sites are 0.273 and 0.237 lower than that of the ResNet, while 0.15 and 0.149 lower than that of the DSen2Net. The SAM of the SPRNet for site 1 is 1.384 while that of the compared methods are larger than 1.6 and the SAM of the SPRNet for site 2 is 0.586 while that of the competitors are higher than 0.9. These analyses indicate the effectiveness of our SPRNet in both spatial and spectral domains. Furthermore, we depict visual comparisons with different methods on two testing datasets in Figures 5 and 6. The figures provide the RGB (B12, B8a and B5 as RGB) and each bands results. In order to observe the difference between sharpening results and ground truth clearly, the absolute differences between them are presented. In these figures, if the sharpening results are either blur edges or exaggerate the contrast, the residual errors are high, on the contrary, when the results are similar to the ground truth, the residual errors trend to zero. It can be seen that the results of SPRNet are closer to the reference while the compared methods exhibit errors along high contrast edges at almost bands. In Figure 5, the images of the Bicubic and SupReME are more bright, meaning these methods get deteriorate results for the spatial reconstruction. In contrast, the CNN-based methods have more smooth regions with dark color and the edges of structures are less, and the best results can be found in the SPRNet. As for Figure 6, the boundaries of the land plots are still obvious in the Bicubic and SupReME. Among the CNN-based methods, SPRNet performs satisfactorily, especially for B5, B6, B7 and B8a.
SPRNet 6× -60 m bands. To sharpen the 60 m bands, we train another network SPRNet 6× using downgraded data with resolution 60 m, 120 m and 360 m to learn the mapping from 360 m to 60 m. The quantitative results of site 1 and site 2 are shown in Tables 4 and 5 13.817 dB, 7.314 dB, 6.059 dB, 3.754 dB. The mean CC and UIQI of the SPRNet are 0.994 and 0.972, with gains of 0.008 and 0.031 over that of the DSen2Net. In addition, the ERGAS and SAM of the SPRNet are 0.114 and 0.162 smaller than that of the DSen2Net. These results reveal the effectiveness of the SPRNet in sharpening 60m bands, which further show feasibility and suitability of the proposed method. We also perform a qualitative comparison to ground truth. The RGB (B9, B9 and B1 as RGB) results and absolute residuals of two sites are plotted in Figures 7 and 8. The visual impression of 60 m bands confirms that the SPRNet clearly dominates the competition with much less structured residuals. We can observe that the competing methods have more residuals for both sites, in contrast, the results of our method have more smooth regions and the color is prone to dark. This indicates our method obtains the best overall performance.

The performance of different bands.
To verify the generic ability of the sharpening methods on different spectral wavelengths, the performance curves of different bands for different indices are shown in Figure 9. Almost all methods show the similar trend and the performance of the CNN-based methods are substantially better. Among the 20 m bands (i.e., B5, B6, B7, B8a, B11 and B12), we find that all the methods exhibit a marked drop in accuracy of B11 and B12. The numeric comparisons can be found in Tables 2 and 3. For instance, compared to the average level, the SRE values of SPRNet drop 0.246 dB (site 1) and 0.993 dB (site 2) on B11, while drop 4.509 dB (site 1) and 3.102 dB (site 2) on B12. The reason is that these two bands lie in the SWIR spectrum (>1600 nm), which beyond the spectral range (400∼900 nm) of 10 m resolution bands, and thus the details of B11 and B12 can not be infer exactly by borrowing the 10 m information. As for the 60 m bands (i.e., B1 and B9), the accuracy of the Bicubic is obviously lower than other methods. This is due to the fact that the Bicubic can not use any information from the auxiliary HR bands, which aggravates the difficulties of recovering the details. Furthermore, the performance of B9 is slightly worse than that of B1. Since the center wavelength of B1 is 443 nm which is covered by 400∼900 nm, but B9 (center wavelength at 945 nm) is out of this range, the useful information borrowed from 10 m bands is limited. These observations indicate that the bands closer to the auxiliary HR bands can have more precise sharpening results.

Evaluation at the Original Scale
To verify the generalization of our method to true scale Sentinel-2 data, we directly feed the original LR and 10 m bands into the trained networks (i.e., band sets [20 m, 10 m] fed into SPRNet 2× and band sets [60 m, 20 m, 60 m] fed into SPRNet 6× ) to produce 10 m resolution version of the LR bands.As there is no ground truth being present, the higher resolution spectral bands are considered as the reference data to assess the sharpening method. In our experiments, four spectral bands with 10 m resolution are served as the reference data for visual evaluation. The up-scaled results of a sub-area obtained by the Bicubic and SPRNet are shown in Figures 10 and 11.

Evaluation at the Original Scale
To verify the generalization of our method to true scale Sentinel-2 data, we directly feed the original LR and 10 m bands into the trained networks (i.e., band sets [20 m, 10 m] fed into SPRNet 2× and band sets [60 m, 20 m, 60 m] fed into SPRNet 6× ) to produce 10 m resolution version of the LR bands.As there is no ground truth being present, the higher resolution spectral bands are considered as the reference data to assess the sharpening method. In our experiments, four spectral bands with 10 m resolution are served as the reference data for visual evaluation. The up-scaled results of a sub-area obtained by the Bicubic and SPRNet are shown in Figures 10 and 11. From these figures, we can clearly observe that the sharpening results of the SPRNet receive a good visual quality. Although the bicubic interpolation has properties of smoothing the original images, it is unable to recover the spatial details, while the sharpening results of the SPRNet are sharper and bring out additional details in all cases. Moreover, we can find that the sharpening results of LR bands improve the spatial resolution without noticeable artifacts. To be specific, as can be observed from the marked region (red rectangle), the SPRNet produces much sharper edges and the details of ground object are more abundant. In Figure 10, compared with the 10 m bands, the original 20 m bands can not show the outlines of the building clearly and the original 60 m bands are difficult to depict the subject. Nevertheless, our method commendably enhances the spatial resolution of 20 m and 60 m bands and recovers the details of the building in these bands. In Figure 11, the contours are clear and vivid in the sharpening results of the SPRNet whereas they are blurred or distorted in the original LR data. What's more, the sharpening results of LR bands match the 10 m resolution bands. These observations further imply our SPRNet can effectively sharpen the Sentinel-2 images and obtain a complete data at 10 m resolution.

Effect of Combining Various-Resolution Bands
To investigate the impacts of fusing various-resolution bands, we test different combinations of 10 m, 20 m, and 60 m band sets as the input to the SPRNet 2× and SPRNet 6× . The experiment results of two testing data are displayed in Table 6. As for the SPRNet 2× , we take the model trained by the 20 m set as the baseline (SPRNet 2× -1). We then add the 10 m set to the SPRNet 2× -1, resulting in SPRNet 2× -2. From the SPRNet 2× -1 to SPRNet 2× -2, the SRE values increase by 7.782 dB for site 1 and 7.947 dB for site 2, which demonstrates the effectiveness of utilizing the information from the 10 m bands to enhance the 20 m bands. We further add the 60 m set to the SPRNet 2× -2, resulting in SPRNet 2× -3. Compared with the SPRNet 2× -2, the SRE values of the SPRNet 2× -3 decrease by 0.911 dB and 0.768 dB for site 1 and site 2, respectively. This is because that the lower resolution bands can not contribute to higher resolution bands sharpening. As for the SPRNet 6× , the baseline (SPRNet 6× -1) is only trained by the 60 m set. Due to the large amplification factor, the SPRNet 6× -1 can not learn the LR and HR mapping accurately. The SPRNet 6× -2 is obtained by adding the 10 m set to the SPRNet 6× -1. The SRE values of the SPRNet 6× -2 are 13.844 dB and 10.935 dB higher than that of the SPRNet 6× -1 for site 1 and stie 2, respectively. Moreover, the SPRNet 6× -3 combining the 10 m, 20 m and 60 m sets outperform other models, which implies that both 10 m and 20 m bands provide useful information to reproduce the details of 60 m bands. Based on the above analysis, we draw the conclusion that auxiliary bands with finer resolution can efficiently improve the sharpening results. Therefore, it is reasonable to sharpen Sentinel-2 image using two separate networks with different inputs.

Effect of Constant Scaling In ISFE
To investigate the effects of the constant scaling in ISFE unit, we display the training curves of our proposed method with and without constant scaling, and the speed of the training procedure is displayed in Figure 12, from which two observations can be drawn. First, we find that the networks with constant scaling converge faster. As for the SPRNet 2× , the network with constant scaling converges rapidly to the fine performance during 80 epochs, while the network without constant scaling takes about 100 epochs to reach the maximum performance. As for the SPRNet 6× , the learning loss of the network with constant scaling tends to stable before 60 epochs, but the margin fluctuation of another curve becomes smaller until 70 epochs. Second, the final accuracy is higher for the networks with constant scaling. Compared with the networks without constant scaling, the SRE values of the networks with constant scaling are increase by more than 5 dB at the first epoch. Even if the training epochs reach to 200, the SRE values of the networks without constant scaling are still lower than that of the proposed networks. Therefore, the addition of constant scaling is a simple but powerful strategy in our SPRNet.

Conclusions
In this paper, we propose a parallel residual network (i.e., SPRNet) for Sentinel-2 image sharpening to obtain complete data at the highest sensor resolution. The proposed method is designed to sharpen both 20 m and 60 m bands. Compared with existing deep learning-based methods, the main advantage of our SPRNet is that the sufficient spatial information of different resolution bands are extracted by separate branches in a parallel structure. In addition, the spatial information fusing and spectral characteristics propagating can be presented by the designed spatial feature fusing component and spectral feature mapping component. As such, the proposed method obtains the good sharpening results in the spatial fidelity and the spectral preservation. By learning the LR and corresponding HR mapping at lower scale, the trained SPRNet can produce the image at 10 m resolution with the original Sentinel-2 data. Extensive experiments on the degraded and original data prove the proposed method is competitive with the state-of-the-art approaches. In quantitative evaluations on the degraded data, for 20 m bands, the SRE of the SPRNet is 1.538 dB (site 1) and 4.098 dB (site 2) higher than the best competing approach; for 60 m bands, the SPRNet increases the SRE by 3.188 dB (site 1) and 3.754 dB (site 2) compared to the best competing approach. The proposed method also shows visually convincing results on original data. In the future, we will discuss the effects of the network parameters and try to adaptively decide the parameters. How to apply the sharpening results to other application areas (e.g., target detection and classification) is also a future research topic.