PlanetScope Radiometric Normalization and Sentinel-2 Super-Resolution (2.5 m): A Straightforward Spectral-Spatial Fusion of Multi-Satellite Multi-Sensor Images Using Residual Convolutional Neural Networks

: Sentinel-2 (S2) imagery is used in many research areas and for diverse applications. Its spectral resolution and quality are high but its spatial resolutions, of at most 10 m, is not su ﬃ cient for ﬁne scale analysis. A novel method was thus proposed to super-resolve S2 imagery to 2.5 m. For a given S2 tile, the 10 S2 bands (four at 10 m and six at 20 m) were fused with additional images acquired at higher spatial resolution by the PlanetScope (PS) constellation. The radiometric inconsistencies between PS microsatellites were normalized. Radiometric normalization and super-resolution were achieved simultaneously using state-of–the-art super-resolution residual convolutional neural networks adapted to the particularities of S2 and PS imageries (including masks of clouds and shadows). The method is described in detail, from image selection and downloading to neural network architecture, training, and prediction. The quality was thoroughly assessed visually (photointerpretation) and quantitatively, conﬁrming that the proposed method is highly spatially and spectrally accurate. The method is also robust and can be applied to S2 images acquired worldwide at any date.


Introduction
Satellite imagery provides a unique and detailed perspective on the state and changes in land, coastal, and oceanic ecosystems [1]. However, extractible information is limited by the spectral, spatial, and temporal resolutions of remote sensing images. Due to trade-offs in satellite instruments, images have generally either a high spatial resolution and a low spectral resolution or vice versa. One of the most used solutions is pansharpening: the fusion of a multispectral (MS) image with a panchromatic (PAN) image, both acquired simultaneously by the same satellite and capturing the same area [2]. MS images are typically composed of several bands partitioning the solar radiation into different spectra (e.g., red, blue, green and near-infrared). PAN images are composed of one band but capturing the whole solar radiation at a higher spatial resolution than MS images. The resulting pansharpened image combines the highest spatial resolution of PAN image with the highest spectral resolution of MS image. The numerous available pansharpening methods can be labeled as spectral, spatial, and spectral-spatial or spatiotemporal [3].
Pansharpening is a special case of image fusion: a combination of several images into a single composite image that has a higher information content than any of the original images [4,5]. Image fusion

Software and Hardware
Preparation and processing of images and patches, as well as neural networks (NNs), were operated in R [34] with mainly three R packages: raster [35], sf [36], and keras (TensorFlow backend) [37], and in connection (command lines) with GDAL/OGR library [38] and Orfeo ToolBox [39,40]. Only one computer was used with satisfactory performance: Windows 10 64 bits, 64 Gb of RAM, 20 cores at 3.31 GHz, a 2 TB SSD, and a NVIDIA GeForce RTX 2080 graphic card (CUDA).

Image Selection
For both constellations, Sentinel-2 (S2) and PlanetScope (PS), images with the higher processing levels, i.e., bottom of atmosphere (BOA) surface reflectance, were selected: Level 2A for S2 and Level 3B (Analytic Ortho Scene) for PS. For S2, the 4 bands at 10 m and the 6 bands at 20 m were used. For PS, all the available bands (R, G, B, and NIR) were used. The spectral overlapping between bands of S2 and PS varied along the spectra (Figure 1). The S2 bands, B5, B6, B11, and B12, presented no overlapping with any of the PS bands. Less precise results were thus expected for these four bands.

Software and Hardware
Preparation and processing of images and patches, as well as neural networks (NNs), were operated in R [34] with mainly three R packages: raster [35], sf [36], and keras (TensorFlow backend) [37], and in connection (command lines) with GDAL/OGR library [38] and Orfeo ToolBox [39,40]. Only one computer was used with satisfactory performance: Windows 10 64 bits, 64 Gb of RAM, 20 cores at 3.31 GHz, a 2 TB SSD, and a NVIDIA GeForce RTX 2080 graphic card (CUDA).

Image Selection
For both constellations, Sentinel-2 (S2) and PlanetScope (PS), images with the higher processing levels, i.e., bottom of atmosphere (BOA) surface reflectance, were selected: Level 2A for S2 and Level 3B (Analytic Ortho Scene) for PS. For S2, the 4 bands at 10 m and the 6 bands at 20 m were used. For PS, all the available bands (R, G, B, and NIR) were used. The spectral overlapping between bands of S2 and PS varied along the spectra (Figure 1). The S2 bands, B5, B6, B11, and B12, presented no overlapping with any of the PS bands. Less precise results were thus expected for these four bands. As the revisit time of S2 is the longest (five days against one for PS), S2 tiles of 100 km by 100 km were first selected. PS scenes were then selected to fully cover each S2 tile following three criteria: (1) maximum radiometric quality based on "clear percent" and "clear confidence percent" variables of the PS "unusable data mask 2" (udm2); (2) minimum absolute time interval with S2 tile (<7 days); and (3) maximum common area with S2 tile (>10%). The number of selected PS scenes varied for each S2 tile. Percentage of cloud cover was close to 0 for S2 and the lowest possible for PS but clouds and shadows were present in several PS scenes (<5%).
In total, 6 S2 tiles were selected worldwide to illustrate the robustness of the proposed method ( Figure 2). These tiles were acquired at various dates by one of the two S2 satellites (A or B) over 5 countries (Table 1). Two of these tiles (BeS and BeN) covered a large part of Belgium characterized by a great diversity of landscapes and land covers (Figure 3 A,B). The Corine Land Cover (CLC) map of 2018 (Europe) was used to assess the quality of the proposed method by main land cover. As the revisit time of S2 is the longest (five days against one for PS), S2 tiles of 100 km by 100 km were first selected. PS scenes were then selected to fully cover each S2 tile following three criteria: (1) maximum radiometric quality based on "clear percent" and "clear confidence percent" variables of the PS "unusable data mask 2" (udm2); (2) minimum absolute time interval with S2 tile (<7 days); and (3) maximum common area with S2 tile (>10%). The number of selected PS scenes varied for each S2 tile. Percentage of cloud cover was close to 0 for S2 and the lowest possible for PS but clouds and shadows were present in several PS scenes (<5%).
In total, 6 S2 tiles were selected worldwide to illustrate the robustness of the proposed method ( Figure 2). These tiles were acquired at various dates by one of the two S2 satellites (A or B) over 5 countries (Table 1). Two of these tiles (BeS and BeN) covered a large part of Belgium characterized by a great diversity of landscapes and land covers ( Figure 3A,B). The Corine Land Cover (CLC) map of 2018 (Europe) was used to assess the quality of the proposed method by main land cover.  (Table 1). Background: OpenStreetMap.  S2 tiles were downloaded from the Theia Hub (https://theia.cnes.fr) using the R package "theiaR" [41]. Theia produces S2 Level 2A products (from Level L1C) with the MAJA software estimating atmospheric optical properties and detecting clouds and shadows from multi-temporal information [42]. MAJA cloud masking performed 7% better than Sen2Cor and was recommended [43,44]. PS images were downloaded using the Planet API (https://developers.planet.com).  S2 tiles were downloaded from the Theia Hub (https://theia.cnes.fr) using the R package "theiaR" [41]. Theia produces S2 Level 2A products (from Level L1C) with the MAJA software estimating atmospheric optical properties and detecting clouds and shadows from multi-temporal information [42]. MAJA cloud masking performed 7% better than Sen2Cor and was recommended [43,44]. PS images were downloaded using the Planet API (https://developers.planet.com).

Radiometric Inconsistency
Radiometric inconsistency between PS scenes ( Figure 3E,F) was mainly due to varying acquisition dates/times (<7 days) but also to differences in microsatellite spectral response [31,32]. At the scale of a S2 tile, the radiometric variability was largely expressed by orbital strips. PS strips are unique for a given satellite and a given absolute orbit. To deal with this heterogeneity, two solutions could be applied: either at the data preparation level or at the modelling level. The first solution was used by Galar et al. [29]. Only the most appropriate patches of image pairs (S2 and RapidEye) were kept for neural network training. This solution required arbitrary thresholding and could potentially lead to unbalanced or incomplete training data. The second solution was to include the source of the radiometric variability, i.e., the PS strips, in the network architecture.
For PS, each "unusable data mask 2" (udm2) was converted into one mono-band binary mask (1: clear with confidence ≥ 75; 0: not clear). All scene binary masks and multi-band rasters (R, G, B, and NIR) were registered using a global linear transformation (R package RStoolbox [45]). Sub-pixel shifts in X and Y were estimated from the red bands of the two data sources ( Figure 1): PS R band (shifted) and S2 B4 band (master). Owing to the rather small footprint of PS scenes, this method gave good results and photointerpretation confirmed that the use of a more complex method (e.g., [46]) was not necessary. After co-registration, PS scene masks and multi-band rasters were mosaicked to cover the whole S2 tile. In these mosaics, PS scenes with the best selection criteria overlapped the others ( Figure 3E,F). PS strips, being a categorical variable, were considered with a dummy approach. A multi-band mosaic composed of binary values was generated at the scale of the S2 tile. The number of bands corresponded to the number of unique PS strips. The mask and strips mosaics were resampled from 3 to 2.5 m (i.e., whole multiple of the S2 band resolutions) using the "nearest" method. The four-band mosaic was resampled to the same resolution using the "bicubic" method.
For each S2 tile, the 1st and 99th percentiles (P 1st and P 99th ) of each of the 10 S2 bands and 4 PS bands were computed. These percentiles were used to normalize data during NN training and to compute a custom loss.

Network Architecture
As the relationships between variables in the spectral and spatial dimensions within and between S2 and PS images were a priori unknown, convolutional neural networks (CNNs) were used to directly learn from data. The proposed CNNs aimed at super-resolving the 10 bands of S2 to 2.5 m using intrinsic spectral-spatial information of PS. More particularly, CNNs with residual learning were implemented. For networks with numerous convolutions and layers, residual CNNs are known to be safer and easier to train [24,47], notably because they limit the problem of vanishing/exploding gradients [48,49].
For NN training, S2 ground truth at 2.5 m should ideally be used, but obviously did not exist. The Wald's protocol was thus used [50]. It consists in using, as input, properly downsampled bands and, as output, the same bands but at the highest available resolution ( Table 2). This protocol is based on the assumption that relationships between variables are scale invariant. Another solution could have been to use the method proposed by Galar et al. [29]. However, although avoiding data downsampling, bands of input and output would have to be equal in numbers and close in spectral ranges [24,29]. Table 2. Spatial resolution, patch size, and depth (number of channels) of input and output data for training and prediction of the two super-resolution neural networks (NNs): from 10 to 2.5 m (×4) and from 20 to 2.5 m (×8). PS, PlanetScope data; S210, Sentinel-2 10-m data; S220, Sentinel-2 20-m data. For resolutions, values not in brackets correspond to NN prediction. For patch sizes, values in brackets correspond to the overlapping pixels (both sides). Two scale ratios (SRs) were applied: 4× (from 10 to 2.5 m) and 8× (from 20 to 2.5 m). Data at the desired SRs were generated by blurring with a Gaussian filter (standard deviation of 0.75 and 1.50 pixels, respectively) and then averaging over SR × SR windows [23]. The S2 10-and 20-m masks, and the PS mask and strips were downsampled using the "mode" resampling method (no blur).

NN
The general network architecture was thought as the best compromise between super-resolution quality and performance efficiency. Several state-of-the-art residual-learning convolutional neural networks (RCNNs) for image super-resolution [47][48][49]51,52] were tested with some adaptations to include data of both constellations, Sentinel-2 (S2) and PlanetScope (PS), with different spatial resolutions and numbers of spectral bands (6 S2 bands at 20 m, 4 S2 bands at 10 m, and 4 PS bands at 2.5 m), as well as their corresponding masks of clouds and shadows. Although percentages of clouds and shadows were low in PS scenes (<5%), these additional data improved the local rendering of the super-resolved S2 images. PS strips were also included to normalize the pronounced radiometric inconsistencies between PS strips ( Figure 3E,F).
Several configurations of super-resolution RCNNs were tested (patches size, number of filters, number of residual blocks, different loss functions, etc.) with encouraging results from the beginning. The main issue was to eliminate undesirable border effects of S2 output patches. These effects were smaller when using sub-pixel convolutions instead of transpose convolutions [53], before or after residual learning, but still undesirable. The border effects were further reduced using sub-pixel Remote Sens. 2020, 12, 2366 7 of 19 convolution followed by "mirror" padding [54] to increase patch size from 64 to 96 pixels (16 + 64 + 16). Small but visible differences were still present for a few pixels at the border of the output patches. The overall quality of super-resolution was not that much affected, but border effects were still disturbing for human eyes (when zooming). The proposed architecture resulted in the best radiometry normalization and super-resolution quality with no border effects.
Two RCNNs were implemented: one by scale ratio (4× and 8×). As the PS strips (and their number) varied between S2 tiles, RCNNs were trained from scratch for each S2 tile (no use of pre-trained NN or weights). The network architecture was the same for both RCNNs and was composed of three steps: data transformation (Figure 4), residual learning, and data reconstruction ( Figure 5). S2 20-m (S220), S2 10-m (S210), and PS bands were normalized separately using Equation (1). Data normalization strongly enhances the performance and learning capacity of NNs and minimizes the error. PS mask, strips, and normalized bands were concatenated. S210 mask and normalized bands were concatenated. S220 mask and normalized bands were concatenated. S210 and S220 concatenations were upscaled using one sub-pixel convolution [51] of scale 4 (patch size from 24 to 96) and of scale 8 (patch size from 12 to 96), respectively ( Table 2). Upscaled S210 and S220 data and PS data were concatenated and then used for residual learning. The residual learning was composed of four residual dense blocks (RDBs), and each RDB was composed of three convolutions and two concatenations. RDBs are described in detail in [52]. The padding was set to "same" (zero padding). The reconstruction consisted in one cropping to resize patches from 96 to 64 and eliminating the data affected by the zero padding (Table 2). A last convolution was applied to get the appropriate number of channels: 4 for the ×4 NN and 6 for the ×8 NN. Finally, the output data were denormalized (i.e., reverse of Equation (1)).
where   Table 2 for patch sizes and resolutions of input and output data.  Table 2 for patch sizes and resolutions of input and output data.  Table 2 for patch sizes and resolutions of input and output data.  Table 2 for patch sizes and resolutions of input and output data.
For all convolutions in RDBs, the kernel size was 3 × 3 and the activation function was rectified linear unit (ReLU). The number of filters was 128 between RDBs and 64 inside RDBs [52]. For data transformation and reconstruction, kernel size was 1 × 1 and the activation function was linear. The total number of parameters, varying between S2 tiles and NNs, was about 2,550,000.

Network Training
The configuration was the same for both NNs. For each NN, 3000 batches were used as training data. The batch size was 32 samples, one sample corresponding to a whole set of PS, S210, and S220 patches ( Table 2). For each batch, the 32 samples were randomly extracted. The Adam optimizer [55] was used with a learning rate of 1e −4 and a decay of 5e −5 . The number of epochs was 100 and the number of steps per epoch was 300. For each NN, the parameters where thus updated 960,000 times (=100 × 300 × 32). When the training stopped, the loss value was at the beginning of the learning plateau [56]. A higher number epoch, such as 150 or 200, could have slightly increased the quality but at a cost of a too long processing time and with an increasing risk of overfitting.
Concerning loss function, performance and results using the mean absolute error (MAE or "l1") or the mean squared error (MSE or "l2") were not satisfying. MSE is easier to solve but MAE is more robust to outliers. While MSE is the dominant NN measure, it was shown that MAE performed better than MSE for single image super-resolution [57]. Considering the lower quality of radiometry between and within PS scenes, using MSE with a non-negligible proportion of outliers was problematic. On the other hand, the training time was longer using MAE and resulted in a lesser overall quality. As an intermediate solution, the mean pseudo-Huber (MPH) [58] was used (Equation (2) with a delta value of 2). MPH behaves as MSE near the origin (low residuals) and as MAE elsewhere ( Figure 6). Furthermore, as the value range varied significantly between the 10 S2 bands, the residuals were weighted using Equation (3) before MPH computation.
where δ denotes the delta value and RS denotes the residuals (i.e., ground truth minus prediction).
where RS b denotes the residuals for the considered band b, P 1st b and P 99th b denote the first and 99th percentiles of the values of band b, MeanRng denotes the mean of the value ranges of all bands (6 for S220 and 4 for S210), and WeightRS b denotes the weighted residuals of band b injected in Equation (2) for computing MPH (Equation (2)).

Quality Assessment
The radiometric normalization and super-resolution quality were assessed at ground truth resolution for the two NNs (10 m for ×4 NN and 20 m for ×8 NN) considering the whole S2 tile (not only the training data) and using seven measures [59,60]: mean error (ME), mean absolute error (MAE), mean weighted absolute error (MWAE), root mean square error (RMSE), peak signal noise ratio (PSNR), structure similarity (SSIM), and cross correlation (CC). MWAE is similar to MAE but computed from absolute residuals divided by the band value range (P P ) and expressed in

Quality Assessment
The radiometric normalization and super-resolution quality were assessed at ground truth resolution for the two NNs (10 m for ×4 NN and 20 m for ×8 NN) considering the whole S2 tile (not only the training data) and using seven measures [59,60]: mean error (ME), mean absolute error (MAE), mean weighted absolute error (MWAE), root mean square error (RMSE), peak signal noise ratio (PSNR), structure similarity (SSIM), and cross correlation (CC). MWAE is similar to MAE but computed from absolute residuals divided by the band value range (P 99th b − P 1st b ) and expressed in percentage. SSIM and PSNR are the most commonly considered metrics for image super-resolution [61]. SSIM measures the similarity between the signals of ground truth and predicted images. PSNR is the ratio between the maximum possible power of the signal and the power of corrupting noise. SSIM and PSNR were computed with a max value of 2 16 . ME values close to zero; lower values of MAE, MWAE, and RMSE; and higher values of the PSNR, SSIM and CC indicate better quality. In addition, the quality was also assessed visually (photointerpretation), with attention given to the borders of output patches and to areas with variation in PS radiometry (different strips) and presence of clouds and shadows.

Measured Quality
The radiometric normalization and super-resolution quality was assessed by NN (at 10 m for ×4 and 20 m for ×8) for the six S2 tiles: whole tile and all bands together (Table 3) and with a focus on PS strips ( Table 4). The quality was also assessed by band (Table 5) and by main land cover (Table 6) for the two Belgian tiles ("BeS" and "BeN"). Table 3. Measured quality of the two super-resolution neural networks (NNs): from 10 to 2.5 m (×4) and from 20 to 2.5 m (×8). Overall quality (i.e., whole tile and all bands together) for the six selected S2 tiles (Figure 2) (Table 1). ME, mean error; MAE, mean absolute error; MWAE (%), mean weighted absolute error; RMSE, root mean square error; PSNR, peak signal noise ratio; SSIM, structure similarity. CC, cross correlation. The overall quality of the six tiles was very good (Table 3) with low values in MAE, MWAE, RMSE, and MPH; high values in PNSR, SSIM, and CC; and ME values close to zero (no bias). SSIM values computed from the four predicted bands for ×4 NN and the six predicted bands for ×8 NN reached the maximum value of 1 (by rounding), indicating that the two images are (almost) perfectly structurally similar [57]. PSNR and SSIM values were systematically higher than those of Galar et al. [29], a unique and most similar study, but using a different approach. Their highest PSNR and SSIM values were 35.5 and 0.957, respectively. The MWAE, expressing the error in percentage of the band value ranges, varied from 1.5% to 2.9%. The quality variations between PS strips were low (Table 4), confirming that the method accurately captured and corrected the PS radiometric inconsistencies at the scale of the S2 tile. The quality of each of the 10 predicted bands was high with no major differences ( Table 5). The MWAE varied from 1% to 5.3%. Although the bands B5, B6, B11, and B12 presented no overlapping with the PS bands (Figure 1), their error was close to the average. Surprisingly, the band B8 was the less well predicted of all bands. Concerning the quality by land cover (Table 6), no major differences were highlighted but artificial surfaces were logically less well predicted than less spectrally and spatially contrasted areas such as agricultural and forest lands. Table 6. Measured quality of the two super-resolution neural networks (NNs), from 10 to 2.5 m (×4) and from 20 to 2.5 m (×8), by main land cover (Corine Land Cover 2018) of the two Belgian S2 tiles (BeS and BeN together) (Figure 3). ME, mean error; MAE, mean absolute error; RMSE, root mean square error; CC, cross correlation.

Visual Quality
The proposed method was also validated by photointerpretation (Figure 7). There was no border effect even at maximum zoom. The output patches were thus not distinguishable from each other at coarser and finer resolutions. The normalization of the radiometric inconsistencies between PS strips was impressive (Figure 8). In the super-resolved images (10 S2 bands at 2.5 m), no differences were observed. A 1-pixel between-scenes boundary was sometime visible (but with very similar spectral values). This boundary was probably indirectly induced by the training data downsampling (resampling of the PS strips mosaic using the "mode" method). The effect of clouds and shadows in PS scenes on super-revolved images was limited as they simply did not appear (Figure 9, Top). However, these areas were slightly affected by speckle and/or blur, decreasing locally the visual quality. For clouds and shadows in S2 tiles, they were fully predicted (Figure 9, Bottom).

Processing Times
The processing times using a unique computer and for a given S2 tile were on average: about 1 h 30 min for the selection, downloading, and preparation of images; about 2 × 4 h for the preparation of the 2 × 3000 training batches; about 2 × 8 h for the NN trainings; about 1 h 15 min for the prediction at the coarser resolution (four bands at 10 m and six bands at 20 m); and about 21 h for the prediction at finer resolution (10 bands at 2.5 m). The total time for the complete procedure was thus about 48 h.

Processing Times
The processing times using a unique computer and for a given S2 tile were on average: about 1 h 30 min for the selection, downloading ,and preparation of images; about 2 × 4 h for the preparation of the 2 × 3000 training batches; about 2 × 8 h for the NN trainings; about 1 h 15 min for the prediction at the coarser resolution (four bands at 10 m and six bands at 20 m); and about 21 h for the prediction at finer resolution (10 bands at 2.5 m). The total time for the complete procedure was thus about 48 h.

Discussion
In this study, we presented and validated a novel method for super-resolving Sentinel-2 (S2) imagery (10 bands from 10 or 20 to 2.5 m). Super-resolution was achieved by fusion with additional images acquired at finer resolution by the PlanetScope (PS) constellation. The super-resolution quality was thoroughly analyzed for six S2 tiles acquired in contrasted conditions over five countries around the world, confirming that the proposed method is highly accurate and robust. The method also remarkably normalized the radiometric inconsistencies between PS micro-satellites. Super-resolution and radiometric normalization were achieved simultaneously using state-of-the-art residual convolutional neural networks (RCNNs), adapted to the particularities of S2 and PS imageries, and including the corresponding masks of clouds and shadows.
To our knowledge, only one study [29] considered a similar approach combining S2 and RapidEye images for super-resolving three of the S2 bands (B2, B3, and B4) to 5 m, also using

Discussion
In this study, we presented and validated a novel method for super-resolving Sentinel-2 (S2) imagery (10 bands from 10 or 20 to 2.5 m). Super-resolution was achieved by fusion with additional images acquired at finer resolution by the PlanetScope (PS) constellation. The super-resolution quality was thoroughly analyzed for six S2 tiles acquired in contrasted conditions over five countries around the world, confirming that the proposed method is highly accurate and robust. The method also remarkably normalized the radiometric inconsistencies between PS micro-satellites. Super-resolution and radiometric normalization were achieved simultaneously using state-of-the-art residual convolutional neural networks (RCNNs), adapted to the particularities of S2 and PS imageries, and including the corresponding masks of clouds and shadows.
To our knowledge, only one study [29] considered a similar approach combining S2 and RapidEye images for super-resolving three of the S2 bands (B2, B3, and B4) to 5 m, also using RCNNs, but with an architecture originally developed for conventional "RGB" images. With a spatial resolution of 2.5 m, the generalization of the procedure to 10 S2 bands (B2, B3, B4, B8, B5, B6, B7, B8a, B11, and B12) and a much higher accuracy, this study further explored the high potential of deep learning for multi-satellite multi-sensor image fusion. The proposed method is highly spatially and spectrally accurate, at the scale of the considered S2 tile, but also locally, and separately for each band, PS strip (i.e., unique PS satellite and orbit), and main land cover.
Radiometric inconsistencies between PS strips are mainly related to differences in sensor spectral response, orbital configuration [31,32], and acquisition date/time. The proposed method accurately captures and corrects these radiometric variations. As the PS strips varied for each S2 tile, the RCNNs had to be train from scratch every time (no use of pre-trained NN or weights). Although it strongly increases robustness, the processing time could be limiting for routine use. With a unique computer, 48 h per S2 tile were necessary. This processing time could be significantly decreased with code optimization and implementation in powerful cloud computing platforms, such as Google Earth Engine.
The high temporal resolution of PS imagery (revisit time of one day) is an important element. The small acquisition time differences between the S2 tile and PS scenes strongly limit land cover changes. Contrary to Shao et al. [15], fusing Landsat-8 and S2 imageries, the use of temporal series was not necessary.
We deliberately selected S2 tiles and PS scenes with low percentages of clouds and shadows (<5%). The inclusion of S2 and PS masks in the network architecture improved the local rendering of the super-resolved images. Clouds and shadows in S2 were fully predicted. Clouds and shadows in PS scenes were not predicted but with speckle and/or blur effects. The proposed method should be tested with higher percentages of clouds and shadows. RCNNs should be able to learn and adapt the prediction. However, for percentages > 20%, it would probably be more appropriate to use only data free of clouds and shadows. The super-resolution quality would be identical but with a higher proportion of missing data.
The proposed method could also be applied to images of other satellite constellations. For instance, S2 could be replaced by Landsat and PS by Pleiades. However, it would be important to keep in mind that the proposed method is based on the scale-invariant hypothesis [50]. We demonstrated that the ×8 scale ratio (from 20 to 2.5 m) resulted in high quality super-resolution, but the quality for the 4× scale ratio (from 10 to 2.5 m) was as expected higher. The maximum scale ratio is still to be determined.
The proposed RCNN architecture could probably still be improved in several ways. Data augmentation is a well-known way to improve the performance of deep networks. For image super-resolution, some of the augmentation methods highlighted by Ghaffar et al. [62] could be added. The "CutBlur" approach could also be tested [63], as well as the use of the FReLU activation function [64], instead of ReLU, for residual learning. Residual learning could be done first, separately for S220, S210, and PS data (three branches, similarly to Wu et al. [28]) and then together (single branch). Concerning loss functions, as the pseudo-Huber resulted in better predictions and performance than with the usual MAE and MSE, the robust adaptive loss function [58] looks promising.

Conclusions
The proposed method allows super-resolving 10 bands of the Sentinel-2 imagery, from 20 or 10 to 2.5 m, at a very high spatial and spectral accuracy. The method is robust and can be applied to S2 tiles acquired worldwide at any date.