Fast Super-Resolution of 20 m Sentinel-2 Bands Using Convolutional Neural Networks

: Images provided by the ESA Sentinel-2 mission are rapidly becoming the main source of information for the entire remote sensing community, thanks to their unprecedented combination of spatial, spectral and temporal resolution, as well as their associated open access policy. Due to a sensor design trade-off, images are acquired (and delivered) at different spatial resolutions (10, 20 and 60 m) according to speciﬁc sets of wavelengths, with only the four visible and near infrared bands provided at the highest resolution (10 m). Although this is not a limiting factor in general, many applications seem to emerge in which the resolution enhancement of 20 m bands may be beneﬁcial, motivating the development of speciﬁc super-resolution methods. In this work, we propose to leverage Convolutional Neural Networks (CNNs) to provide a fast, upscalable method for the single-sensor fusion of Sentinel-2 (S2) data, whose aim is to provide a 10 m super-resolution of the original 20 m bands. Experimental results demonstrate that the proposed solution can achieve better performance with respect to most of the state-of-the-art methods, including other deep learning based ones with a considerable saving of computational burden.

importantly, a sufficiently rich training dataset is provided, as the knowledge (model parameters) will come from experience (data). To the best of our knowledge, the first notable example of DL applied to the super-resolution of remote sensing images is the pansharpening convolutional neural network (PNN) proposed by Masi et al. [35], which has been recently upgraded [36] with the introduction of a residual learning block and a fine-tuning stage for target adaptivity and cross-sensor usage. Another residual network for pansharpening (PanNet) is proposed in [37]. However, none of these methods can be applied to S2 images without some architectural network adaptation and retraining. Examples of convolutional networks conceived for Sentinel-2 are instead proposed in [2,11]. In [11], the super-resolution was limited to the SWIR 20 m band, as the actual goal was water mapping by means of the modified normalized difference water index (MNDWI), for which green and SWIR bands were requested. Lanaras et al. [2], instead, collected a very large training dataset which has been used to train two much deeper super-resolution networks, one for the 20 m subset of bands and the other for the remaining 60 m bands, achieving state-of-the-art results. In related problems, for example the single-image super-resolution of natural images or other more complex vision tasks such as object recognition or instance segmentation, thanks to the knowledge hidden in huge and shared training databases, deep learning has shown really impressive results compared to model-based approaches. Data sharing has represented a key enabling factor in these cases allowing researchers to compete with each other or reproduce others' models. In light of this consideration, we believe that Sentinel-2 is a very interesting case because of the free access to data that can serve as playground for a larger scale research activity on remote sensing super-resolution or other tasks. In the same spirit, Lanaras et al. [2] pushed on the power of the data by collecting a relatively large dataset to get good generalization properties. On the other hand, complexity is also an issue that end users care about. In this regard, the challenge of our contribute is to design and train a relatively small and flexible network capable of achieving competitive results at a reduced cost on the super-resolution of the 20 m S2 bands, exploiting spatial information from the higher-resolution 10 m S2/VNIR bands. Indeed, the proposed network being lightweight, apart from enabling the use of the method on cheaper hardware, allows quickly fine-tuning it when the target data are misaligned from the training data for some reason. The proposed method for Fast Upscaling of SEntinel-2 (FUSE) images is an evolution of the proof-of-concept work presented in [38]. In particular, the major improvements with respect to the method in [38] reside in the following changes: a. Architectural improvements with the introduction of an additional convolutional layer. b. The definition of a new loss function which accounts for both spectral and structural consistency. c. An extensive experimental evaluation using diverse datasets for testing that confirms the generalization capabilities of the proposed approach.
The rest of the paper is organized as follows. In Section 2, we describe datasets and proposed method. Evaluation metrics, comparative solutions and experimental results are then gathered in Section 3. Insights about the performance of the proposed solution and related future perspectives are given in Section 4. Finally, Section 5 provides concluding remarks.

Materials and Methods
The development of a deep learning super-resolution method suited for a given remote sensing imagery involves at least three key steps, with some iterations among them: a. Selection/generation of a suitable dataset for training, validation and test; b. Design and implementation of one or more DL models; c. Training and validation of the models (b) using the selected dataset (a).
By following this rationale, for ease of presentation, in this section, we first present the datasets and their preprocessing (a), then we describe design (b) and training (c) of the proposed model.

Datasets and Labels Generation
Regardless of its complexity and capacity, a target deep learning model remains a data-driven machinery whose ultimate behavior heavily depends on the training dataset, notably on its representativeness of real-world cases. Hence, we provide here detailed information about our datasets and their preprocessing.
For the sake of clarity, let us first recall the main characteristics of the 13 spectral bands of Sentinel-2, gathered in Table 1, and clarify symbols and notations that are used in the following with the help of Table 2. Full-resolution reference (also referred to as ground truth or label), usually unavailable. x,x, r generic band of x,x, r, respectively. x, x, x hp Upsampled (via bicubic) versions of x, x, x hp , respectively. z Single (average) band of z. x ↓ , z ↓ , r ↓ ,... Reduced-resolution domain variables associated with x, z, r,..., respectively. Whenever unambiguous subscript ↓ will be dropped.
Except for some cases where unsupervised learning strategies can be applied, a sufficiently large dataset containing input-output examples is usually necessary to to train a deep learning model. This is also the case for super-resolution or pansharpening. In our case, as we decided to fuse 10 m bands (z) with 20 m (x) to enhance the resolution of x by a factor of 2 (resolution ratio), which means that we should have examples of the kind ((x, z); r), being r the desired (super-resolved) output corresponding to the composite input instance (x, z). In rare cases, one can rely on referenced data, for example thanks to ad hoc missions to collect full-resolution data to be used as reference, whereas in most cases referenced samples are unavailable.
Under the latter assumption, many deep learning solutions for super-resolution or pansharpening have been developed (e.g., [2,11,35,36,[39][40][41]) by means of a proper schedule for generating referenced training samples from the same no-reference input dataset. It consists of a resolution downgrade process that each input band undergoes which involves two steps: (i) band-wise low-pass filtering; and (ii) uniform R × R spatial subsampling, being R the target super-resolution factor. This is aimed to shift the problem from the original full-resolution domain to a reduced-resolution domain. In our case, R = 2 while the two original input components, x and z, will be transformed in corresponding variables x ↓ and z ↓ , respectively, lying in the reduced-resolution space, with associated reference r ↓ trivially given by r ↓ = x. How to filter the several bands before subsampling is an open question. Lanaras et al. [2] pointed out that with deep learning one does not need to specify sensor characteristics, for instance, spectral response functions, since sensor properties are implicit in the training data. Contrarily, Masi et al. [35] asserted that the resolution scaling should be done accounting for the sensor Modulation Transfer Function (MTF), in order to generalize properly when applied at full resolution. Such a position follows the same rationale of the so-called Wald's protocol, a procedure commonly used for generating referenced data for objective comparison of pansharpening methods [26]. Actually, this controversial point cannot be resolved by looking at the performances in the reduced-resolution space, since a network learns from training data the due relationship whatever preprocessing has been performed on the input data. On the other hand, in full-resolution domain, no objective measures can be used because of the lacking referenced test data. In this work we follow the approach proposed in [35] making use of sensor MTF. The process for the generation of a training sample is summarized in Figure 1. Each band undergoes a different low-pass filtering, prior to being downsampled, whose cut-off frequency is related to the sensor MTF characteristics. Additional details can be found in [42].
Full-resolution domain Reduced-resolution domain Figure 1. Generation of a training sample ((x ↓ , z ↓ ); r ↓ ) using Wald's protocol. All images are shown in false-color RGB using subsets of bands for ease of presentation. Each band is low-lass filtered with a different cut-off frequency according with the sensor MTF characteristics.
Another rather critical issue is the training dataset selection as it impacts the capability of the trained models to generalize well on unseen data. In the computer vision domain, a huge effort has been devoted to the collection of very large datasets in order to support the development of deep learning solutions for such diverse problems as classification, detection, semantic segmentation, tracking video and so forth (notable examples are ImageNet and Kitty datasets). Instead, within the remote sensing domain, there are no examples of datasets which are as large as ImageNet or Kitty. This is due to several obstacles, among which the cost of the data and the related labeling which requires domain experts, as well as the data sharing policy usually adopted in the past years by the remote sensing community. Luckily, for super-resolution, one can at least rely on the above-described fully-automated resolution downgrading strategy to avoid labeling costs. Due to the scarcity of data, most deep-learning models for resolution enhancement applied to remote sensing have been trained on a relatively small dataset, possibly taken from a few large images, from which non-overlapping sets for training, validation and testing are singled out [35,37,41]. The generalization limits of a pansharpening model trained on too few data have been stressed in [36], for both cross-image and cross-sensor scenarios, where a fine-tuning stage has been proposed to cope with the scarcity of data. In particular, it was shown that, for a relatively small CNN that integrates a residual learning module, a few training iterations (fine-tuning) on the reduced-resolution version of the target image allow quickly recovering the performance loss due to the misalignment between training and test sets. For Sentinel-2 imagery, thanks to the free access guaranteed by the Copernicus program, larger and more representative datasets can be collected, as done by Lanaras et al. [2], aiming for a roughly even distribution on the globe and for variety in terms of climate zone, land-cover and biome type. In this study, we opted for a lighter and flexible solution with a relatively small number of parameters to learn and a (pre-)training dataset of relatively limited size. This choice is motivated by the experimental observation that in actual application the tuning of the parameters is still recommendable even if larger datasets have been used in training, making appealing lighter solutions that can be quickly tuned if needed.
To be aligned with the work of Lanaras et al. [2], we decided to keep their setting by using Sentinel-2 data without atmospheric correction (L1C product) for our experiments. For training and validation, we referred to three scenes (see Figure 2), corresponding to different environmental contexts: Venice, Rome, and Geba River. In particular, we randomly cropped 18,996 square tiles of size 33 × 33 (at 20 m resolution) from the three selected scenes to be used for training (15,198) and validation (3898). Besides, we have choosen four more scenes for the purpose of testing, namely Athens, Tokyo, Addis Abeba, and Sydney, which present different characteristics, hence allowing for a more robust validation of the proposed model. From such sites, we singled out three 512 × 512 crops at 10 m resolution, for a total of twelve test samples.

Rome
Venice Geba River

Proposed Method
The proposed solution takes inspiration from two state-of-the-art CNN models for pansharpening, namely PanNet [37] and the target-adaptive version [36] of PNN [35], both conceived for very high resolution sensors such as Ikonos or WorldView-2/3. Both methods rely on a residual learning scheme, while main differences concern loss function, input preprocessing, and overall network backbone shape and size. Figure 3 shows the top-level flowchart of the proposed method. As we deal with Sentinel-2 images, differently from Yang et al. [37] and Scarpa et al. [36], we have 10 input bands, six lower-resolution ones (x), to be super-resolved, plus four higher-resolution bands (z). Let us preliminarily point out that we train a single (relatively small) network for each band x to be super-resolved, as represented at the output in Figure 3. However, the deterministic preprocessing bounded by the dashed box is a shared part, while the core CNN, with fixed hyper-parameters, changes from one band to another to be super-resolved. This choice presents two main advantages. The first is that whenever users need to super-resolve only a specific band, they can make use of a lighter solution with computational advantages. The second reason is related to the experimental observation that training separately the six networks allows reaching the desired loss levels more quickly than using a single wider network. This feature is particularly desirable if users need to fine-tune the network on their own dataset. Turning back to the workflow, observe that both input subsets, x and z, are high-pass filtered (HPF) as also done by PanNet. This operation relies on the intuition that the missing details that the network is asked to recover lie in the high frequency range of the input image. Next, the HPF component x hp is upsampled (R × R) using a standard bicubic interpolation, yielding x hp , in order to match the size of z hp with which to be concatenated prior to feed the actual CNN. The single-band CNN output f Φ (x, z) is therefore combined with the upsampled target band x to provide its super-resolved versionx = x + f Φ (x, z). This last combination, obtained through a skip connection that retrieves the low-resolution content ofx directly from the input, is known as residual learning strategy [43], and has soon became a standard option for deep learning based super-resolution and pansharpening [2,36,37], as it is proven to speed-up the learning process. The CNN architecture is more similar to the pansharpening models [35,36] than to PanNet [37], making use of just four convolutional layers, whereas PanNet uses ten layers, each singling out 32 features (except for the output layer). Moreover, a batch normalization layer operating on the input stack precedes the convolutional ones. This has proven to make the learning process robust with respect to the statistical fluctuations of the training dataset [44]. In Table 3, the network hyper-parameters of the convolutional layers are summarized. Once the training dataset and model are fixed, a suitable loss function to be minimized needs to be defined in order for the learning process to take place. L 2 or L 1 norms are typical choices [2,35,36,38,39] due to their simplicity and robustness, with the latter being probably more effective to speed-up the training, as observed in [2,36]. However, these measures do not account for structural consistency as they are computed on a pixel-wise basis and, therefore, assess only spectral dissimilarity. To cope with this limitation, an option is to resort to a so-called perceptual loss [45], which is an indirect error measurement performed in a suitable feature space generated with a dedicated CNN. In [37], structural consistency is enforced by working directly on detail (HPF) bands. In the proposed solution, in addition to the use HPF components, we also define a combined loss that explicitly accounts for spectral and structural consistency. In particular, inspired by the variational approach [46], we make use of the following loss function: where three terms, corresponding to fidelity, or spectral consistency (L Spec ), structural consistency (L Struct ) and regularity (L Reg ), are linearly combined. The weights were tuned experimentally using the validation set as λ 1 = 1, λ 2 = 0.1, and λ 3 = 0.01. By following the intuition proposed in [2,36], we decided to base the fidelity term on the L 1 norm, that is where the expectation E{·} is estimated on the reduced-resolution training minibatches during the gradient descent procedure. f Φ (·) stands for the CNN function (including preprocessing) whose parameters to learn are collectively indicated with Φ. This loss term, as well as the other two, refers to a single band (x ↓ ) super-resolution whose ground-truth is r ↓ = x. As the training is performed in the reduced-resolution domain, in the reminder on this section, we drop the subscript ↓ for the sake of simplicity.
The structural consistency term is given by where the operator G = (G 1 , . . . , G 4 ) generalizes the gradient operator including derivatives in the diagonal directions that help to improve quality, as shown in [46]. It has been shown that the gradient distribution for real-world images is better fit with a heavy-tailed distribution such as a hyper-Laplacian (p(x) ∝ e −k|x| p , 0 < p < 1). Accordingly, we make use of a L p -norm with p = 1/2, which we believe can be more effective [46]. This term penalizes discontinuities in the super-resolved bandx if they do not occur, with the same orientation, in the panchromatic band. As the dynamics of these discontinuities are different, an additional prior regularization term that penalizes the total variation ofx helps to avoid unstable behaviors: Eventually, the network parameters were (pre-)trained by means of the Adaptive Moment Estimation (ADAM) optimization algorithm [47] applied to the above-defined overall loss (Equation (1)). In particular, we have set the ADAM default hyper-parameters, which are learning rate, η = 0.002, and decay rate of the first and second moments, β 1 = 0.9 and β 2 = 0.999, respectively [48]. The training was run for 200 epochs, being an epoch a single pass over all minibatches (118) in which the training set has been split, with each minibatch composed of 128 33 × 33 input-output samples.

Experimental Results
In this section, after a brief recall of the accuracy evaluation metrics (Section 3.1) and of the comparative methods (Section 3.2), we provide and discuss numerical and visual results (Section 3.3).

Accuracy Metrics
The quality assessment of pansharpening algorithms can be carried out in two frameworks, with or without ground-truth. Since normally the ground-truth is unavailable, the former context refers to the application of Wald's protocol [42], which is the same process used for the generation of training samples, as described in Section 2.1. Therefore, this evaluation frame, hereinafter referred to as reference-based, applies in the reduced-resolution domain and allows one to provide objective quality measurements. Because of the resolution shift (downgrade), the reference-based evaluation approach has a limited extent and it is therefore custom to complement it with a full-resolution assessment, referred to as the no-reference one, aimed to give qualitative measurements at full resolution.
In particular, we use the following reference-based metrics: • Universal Image Quality Index (Q-Index) takes into account three different components: correlation coefficient, mean luminance distance and contrasts [49]. • Erreur Relative Globale Adimensionnelle de Synthése (ERGAS) measures the overall radiometric distortion between two images [50]. • Spectral Angle Mapper (SAM) measures the spectral divergence between images by averaging the pixel-wise angle between spectral signatures [51].

•
High-pass Correlation Coefficient (HCC) is the correlation coefficient between the high-pass filtered components of two compared images [52].
On the other hand, as no-reference metrics, we use the following [26,53]: • Spectral Distortion (D λ ) measures the spectral distance between the bicubic upscaling of the image component to be super-resolved,x, and its super-resolution,x. • Spatial Distortion (D S ) is a measurement of the spatial consistency between the super-resolved imagex and the high-resolution component z.

•
Quality No-Reference (QNR) index is a combination of the two above indexes that accounts for both spatial and spectral distortions.
For further details about the definition of the above metrics, the reader is referred to the associated articles.

Compared Methods
We compared our FUSE method with several state-of-the-art solutions. On the one side are classical approaches for pansharpening, properly generalized to the case of Sentinel-2, such as the following: • Generalized Intensity Hue Saturation (GIHS) method [20]. • Brovey transform-based method [54]. Gram-Schmidt algorithm with Generalized Laplacian Pyramid decomposition (GS2-GLP) [57].
Detailed information about these approaches can be found in the survey work of Vivone et al. [26]. Besides, we also compared with the following deep learning approaches native for Sentinel-2 images, including two ablations of our proposal:

•
Our previous CNN-based method (M5) proposed in [11], extended (training from scratch) to all six 20 m bands. • The CNN model (DSen2) proposed in [2], which is much deeper than ours and has been trained on a very large dataset.
• An enhancement of M5 where High-Pass filtering on the input and other minor changes have been introduced (HP-M5) [38], which represents a first insight on the improvements proposed in this work. • FUSE with only three layers instead of four. • FUSE trained using the L 1 norm without regularization and structural loss terms.

Numerical and Visual Results
To assess the performance of the proposed method, we collected twelve 512 × 512 images (at 10 m resolution) from four larger images taken over Athens, Adis Abeba, Sydney and Tokyo, respectively, from which no training or validation samples were extracted.
Numerical figures were computed for all compared methods on each test image. The average measures over the dataset are gathered in Table 4. Reference-based accuracy indicators shown on the left-hand side of the table are computed in the reduced-resolution space and provide objective measurements of the reconstruction error. Overall, we can see that the proposed FUSE method performs slightly better than DSen2 and outperforms all compared solution on three out of four indicators. On the other hand, M5 and ATWT-M3 show a slightly better spectral preservation compared to FUSE according to the Spectral Angle Mapper indicator.
As reduced-resolution data do not fully reproduce statistical fluctuations that may occur in the full resolution context, a common choice is to complement the low-resolution evaluation with a full-resolution assessment that, however, does not rely on objective error measurements. In particular, we resort to three well-established indicators that are usually employed in the pansharpening context: the spectral and spatial distortions, D λ and D S , respectively, and their combination, the QNR. According to these indicators, shown on the right-hand side of Table 4, the proposed method, again, outperforms the competitors. A slightly better spectral preservation is given by HP-M5, M5 and ATWT-M3. Table 4. Accuracy assessment of several super-resolution methods. On top are model-based approaches and DL methods are on the bottom, including the proposed FUSE method. Let us now look at some sample results starting from the full-resolution context. Figures 4 and  5 show some of the 512 × 512 images used for test, associated with urban and extra-urban contexts, respectively. For the sake of visualization, we use RGB false-color subsets of z and x. In particular, we use three out of four bands of z (B2, B3 and B4), and three out of six bands of x (B5, B8a and B11-see Table 1). The input components z and x are shown on the left and middle columns, while the super-resolutionx obtained with the proposed method is shown on the right.

Reference-Based
Although at a first glance these results look pretty nice, a different observation scale would help to gain insight the behavior of the compared solutions. Therefore, in Figure 6, we show some zoomed details with the corresponding super-resolutions using different selected methods. In particular, for the sake of simplicity, we have restricted the visual inspection to the most representative DL and not DL approaches according to both reference-based and no-reference indicators reported in Table 4. A careful inspection reveals that some model-based approaches provide higher detail enhancement compared to DL methods. However, it remains difficult to appreciate the spectral preservation capability of the different methods due to the lack of objective references.   Actual errors can be visualized in the reduced-resolution domain instead. In Figure 7, we show in particular a few meaningful details processed in such a domain. For each sample, the composite input ( x ↓ , z ↓ ) is shown in the leftmost column, followed by the reference ground-truth r ↓ . Then, Columns 3-7 show a few selected solutions (odd rows) with the corresponding error maps (even rows) obtained as difference between the super-resolved image and the reference,x ↓ − r ↓ . As it can be seen, the DL methods perform pretty well in comparison with model based approaches as the error map is nearly constant gray, whereas for PRACS and ATWT-M3 visible piece-wise color shifts are introduced. This observation does not contrast with the good values of SAM obtained by PRACS, since this indicator accounts for the relative color/band proportions but not for their absolute intensity (some "colorful" error maps in Figure 7 are partially due to the band-wise histogram stretching used for the sake of visualization). Overall, by looking at both numerical accuracy indicators and visual results, in both reduced-and full-resolution contexts, the proposed method provides state-of-the-art results on our datasets, as does DSen2. z x PRACS ATWT-M3 HP-M5 DSen2 FUSE Figure 6. Full-resolution results for selected details. For each detail (row) from left to right are shown the two input components to be fused, followed by the corresponding fusions obtained by compared methods.

Discussion
To assess the impact of the proposed changes with respect to the baseline HP-M5, an additional convolutional layer and a composite loss that adds a regularization term and a structural term to the basic spectral loss (L 1 -norm), we also carried out an ablation study. In particular, we have the three-layer scaled version of FUSE and the four-layer version trained without regularization and structural loss terms. These two solutions are also reported in Table 4. As can be seen, except for the SAM index, the full version of FUSE outperforms consistently both scaled versions, with remarkable gains on ERGAS, in the reference-based framework, and on the spatial distortion D S , in the no-reference context. Focusing on the two ablations, it seems that the use of the composite loss has a relatively better impact compared to the network depth increase. This is particular evident looking at the SAM indicator.
The experimental evaluation presented above confirms the great potential of the DL approach in the context of the data fusion problem at hand, as already seen for pansharpening [35] and single-image super-resolution of natural images [39] a few years ago. The numerical gap between DL methods and the others is consistent and confirmed by visual inspection. In particular, we observe that the use of the additional structural loss term, the most relevant change with respect to our previous models M5 and HP-M5, allowed us to reach and slightly overcome the accuracy level of DSen2. Beside accuracy assessment, it is worth focusing on the related computational burden. DL methods, in fact, are known to be computationally demanding, hence potentially limited for large-scale applicability. Thus, we focused from the beginning on relatively small CNN models. Indeed, the proposed model involves about 28K parameters in contrast to DSen2 which has 2M parameters. In Table 5, we gather a few numbers obtained experimentally on a single GPU Quadro P6000 with 24 GB of memory. For both the proposed and DSen2, we show the GPU memory load and the computational time for the inference with respect to the image size. As the proposed model is replicated, with different parameters, for each of the six bands to be super-resolved, we assume either a sequential GPU usage (as done in the table) or a parallel implementation, therefore with 6× memory usage but also 6× faster processing. In any case, to have a rough idea of the different burden, it is sufficient to observe that, by using about one third of the memory necessary for DSen2 to super-resolve a 512 × 512 image, FUSE can super-resolve a 16× larger image (2048 × 2048) in the same time slot. In addition, it also has to be considered that, in many applications, the user may be interested in super-resolving a single band, hence saving additional computational and/or memory load. Finally, this picture does not consider the less critical training phase or an eventual fine-tuning stage, which would further highlight the advantage of using a smaller network. To have a rough idea of this, we recall that, according to Lanaras et al. [2], DSen2 was trained in about three days on a NVIDIA Titan Xp 12 GB GPU, whereas the training of our model took about 3 h using a Titan X 12 GB.

Conclusions
We presented and validated experimentally a new CNN-based super-resolution method for the 20 m bands of Sentinel-2 images, which blends high-resolution spatial information from the 10 m bands of the same sensor. The proposed network is relatively small compared to other state-of-the-art CNN-based models, such as DSen2, achieving comparable accuracy levels in both numerical and subjective visual terms. Overall, it is worth noticing that DL methods overcome model-based approaches especially in terms of spectral distortion (see Figure 7), which is rather interesting considering that the two band sets to be fused are only partially overlapped/correlated, as can be seen in Table 1. In light of this, it will be interesting to explore in our future work the extension to 60 m bands of the proposed approach.