Single-Image Super-Resolution of Sentinel-2 Low Resolution Bands with Residual Dense Convolutional Neural Networks

: Sentinel-2 satellites have become one of the main resources for Earth observation images because they are free of charge, have a great spatial coverage and high temporal revisit. Sentinel-2 senses the same location providing different spatial resolutions as well as generating a multi-spectral image with 13 bands of 10, 20, and 60 m/pixel. In this work, we propose a single-image super-resolution model based on convolutional neural networks that enhances the low-resolution bands (20 m and 60 m) to reach the maximal resolution sensed (10 m) at the same time, whereas other approaches provide two independent models for each group of LR bands. Our proposed model, named Sen2-RDSR, is made up of Residual in Residual blocks that produce two ﬁnal outputs at maximal resolution, one for 20 m/pixel bands and the other for 60 m/pixel bands. The training is done in two stages, ﬁrst focusing on 20 m bands and then on the 60 m bands. Experimental results using six quality metrics (RMSE, SRE, SAM, PSNR, SSIM, ERGAS) show that our model has superior performance compared to other state-of-the-art approaches, and it is very effective and suitable as a preliminary step for land and coastal applications, as studies involving pixel-based classiﬁcation for Land-Use-Land-Cover or the generation of vegetation indices.


Introduction
Managed by the European Space Agency (ESA), the Sentinel-2 satellites play an important role in today's remote sensing as they provide multispectral optical imagery that can be used for several applications such as land cover-land use monitoring, change detection, vegetation and soil analysis, and mapping of physical variables, among others. Some of its characteristics are its considerable surface coverage, the high revisit time [1] and the possibility of getting the data for free (available at [2]), democratizing images for research, as well as launching free and commercial products, making it increasingly useful for Earth observation data.
Each satellite provides 13 bands: 4 high-resolution (HR) bands with 10 m/pixel, 6 low-resolution (LR) bands with 20 m/pixel, and 3 very low-resolution (VLR) bands with 60 m/pixel. Due to the spectral variety and the wide swath, covering a surface of 290 km, the satellites produce nearly 23 TB/day of data, resulting in a vast amount of data that needs to be stored [3]. Besides storage issues, other reasons for designing sensors at different scales were transmission bandwidth constraints or band selection that do not require HR images, among others [4].
Spatial resolution is a fundamental parameter for the analysis of remote sensing imagery. It is defined as the minimum distance in which two separated objects are distinguishable, and depends on several factors, for instance altitude, distance, and the quality of the instruments [5]. Another relevant feature, in line with spatial resolution, is the Ground Sampling Distance (GSD), which is the surface of the Earth represented by a pixel [6].
One of the first options for spatial enhancement is interpolation, such as linear and bicubic. Interpolation is simple and fast, but the resulting image is often blurry and with low-quality [4].
Some platforms often carry two instruments on-board, a high-resolution sensor called panchromatic (PAN), as well as a low-resolution multi-spectral (MS) sensor. The panchromatic band has higher spatial resolution than the MS bands but a wider spectral bandwidth usually overlapping the spectral range of the MS. A common practice is to use this HR band to enhance the LR MS channels using pansharpening techniques [18]. This is not the case with Sentinel-2 satellites, as they do not carry a panchromatic sensor on-board. Instead, the multispectral instrument on-board Sentinel-2 provides data at three different spatial resolutions in the VIS-NIR and SWIR, whose spectral ranges do not overlap. In any case, some pansharpening techniques were proposed for super-resolving the LR bands using individual 10 m bands or a synthetic panchromatic band [19][20][21][22][23][24].
Some analytical methods were also proposed to improve the Sentinel-2 LR bands. For instance, Wang et al. [25] presented a fusion algorithm that extends two common approaches: component substitution and multi-regression analysis. They fuse HR and LR bands to produce SR bands using a method called area-to-point regression kriging (ATPRK), which is computationally efficient and better preserves the spectral information. This method was previously applied to the MODIS satellite and adapted for Sentinel-2. In [26], Brodu proposed to super-resolve the LR bands combining band-inherent information, to maintain spectral coherency, and independent geometric information common to all bands.
However, lately, machine learning approaches have shown to outperform analytical methods. Lanaras et al. [4] proposed a CNN with skip connections between feature maps (resblocks), mainly focusing on detail differences between the LR and HR bands rather than learning a direct mapping. In this manner, the model learns to sharpen the spatial resolution of LR bands by combining the information of HR and LR bands. The LR bands are first upsampled with bicubic interpolation and concatenated with HR bands before entering the network.
Zhang et al. [27] proposed to use a Generative Adversarial Network (GAN) [28] for super-resolving the LR bands. The generator uses a similar approach as in [4] but is enhanced with more residual blocks, and trained with adversarial training. Zhu et al. [16] used a similar methodology but, instead of using resblocks, a channel-attention mechanism [29] was proposed to better exploit the interdependence among channels of the feature maps and to let the network focus on more informative details. On the other hand, Zhang et al. [3] proposed a model combining resblocks with self-attention mechanisms. In addition, they proposed a distributed training procedure suitable for high-performance environments, such as supercomputers, that can achieve state-of-the-art results, speedingup the training process while maintaining the loss of performance at minimum values for both models. All these models improved the baseline performance established by [4] using a dataset proposed for that purpose.
Inspired by Single Image Super-Resolution (SISR) models, Liebel et al. [7] adapted a Super-Resolution CNN (SRCNN) [30] to work with the HR and LR bands of Sentinel-2. Wagner et al. [8] followed a similar approach, adopting a Very Deep Super-Resolution model (VDSR) [31] and Palsson et al. [32] proposed a modified version of the generator of Super-Resolution Generative Adversarial Network [33]. They were more interested in assessing the effects of hyper-parameters rather than obtaining optimal results. Gargiulo et al. [34] proposed a CNN modified from a model, originally designed for pansharpening, which was fast in inference mode. Wu et al. [35] proposed a Parallel Residual Network that processes each band independently, in parallel with resblocks, before fusing the feature maps and adding the output to the bicubic upsampling of the LR band that is super-resolved. These models used different datasets, hindering the comparison with previous models.
To the best of our knowledge, all works that tackle the sharpening of both sets of LR Sentinel-2 bands use two independent CNN models, one to enhance the 20 m bands and the other for the 60 m bands. In this work, we propose a model that can produce the SR of all the LR bands (20 m and 60 m) using only one single network architecture. Our proposal splits the training into two stages, first for the 20 m bands and then for the 60 m bands, using Residual in Residual Dense Blocks (RRDB) [36] as core blocks, which enhance resblocks proposed in [4] and adopted in other works, by reusing feature maps from all previous layers. We start training using a group of RRDBs that mainly focus on super-resolving the 20 m bands. Then, we expand our model by adding more RRDBs blocks that focus on the 60 m bands. More details about the proposal are discussed in Section 3.2.

Dataset
The Copernicus Sentinel-2 mission comprises a constellation of two polar-orbiting satellites, Sentinel-2A launched in 2015 and Sentinel-2B launched in 2017. Both satellites fly in the same orbit with a phase of 180 degrees, producing high revisit frequency (around 5 days) and are equipped with a Multi-Spectral Instrument that records images at the nadir in 13 multi-spectral bands with 3 different spatial resolutions (10 m, 20 m, and 60 m) covering spectral frequencies ranging from the visible to the shortwave infrared. Technical details of each band can be seen in Table 1. Table 1. Spatial and spectral characteristics of Sentinel-2 bands. Source: [37] . The main applications of the LR and VLR bands are devoted to environmental studies, vegetation and land cover mapping, discrimination of snow, ice, and clouds, as well as the retrieval of water-vapor, cirrus, and aerosol information, thus enabling a wide range of earth observation applications [38].

Spectral
As indicated, Sentinel-2 images are freely available at [2]. A dataset of Sentinel-2 Level-1C images for sharpening the LR and VLR bands to 10 m was proposed in [4], covering diverse regions of the world and spanning different climate zones, land cover, and biome types. In this work, for comparison purposes, we use a dataset composed of 60 images (45 for training and 15 for testing), performing experiments using the same train-test split. Due to the unavailability of ground truth images (i.e., images at a higher resolution than the original), we assume that spatial details are self-similar and scale-invariant between all bands, as considered in previous works [4,16,35]. Thus, to generate pairs of input-target images for training and testing, we applied Wald's protocol [39], where images are down-sampled, applying a Gaussian filter first, and, then, scaled in accordance to the desired scaling factor.    The architecture is formed by two branches. One that produces the SR of the 20 m bands (SR20), using as input the original 20 m and 10 m images, and a second branch that generates the SR of the 60 m bands (SR60) from the original 60 m, the 10 m bands, and the super-resolved 20 m bands obtained by the first branch.

Proposed Model
The two branches are composed of a Residual Dense block (RD) that includes a series of Residual in Residual Dense Blocks (RRDB) with shared weights. Residual in Residual Dense Blocks were first proposed in [36] for SISR tasks.
The RRDB performs the extraction of feature information of high complexity, aiming to recover details in LR images. The RRDB is a combination of three Dense Blocks (Figure 2), where the features are reused, allowing a synergistic effect and boosting the recovery of residual details. In addition, a long skip-connection in the same block maintains coherence with the input image. The scalar x b , usually a value between 0 and 1, acts as a residual scaling: feature maps are multiplied by this value and scaled-down essentially to maintain the stability during training [40]. The value of x b is fixed for all dense blocks. Several remote sensing applications have adopted this block structure due to its great performance in enhancing low-resolution images [17,18,41,42]. The Dense Block [43] in a RRDB ( Figure 3) is a combination of five convolutional 2D layers that are connected in a feed-forward manner, where each layer's input comes from its preceding layers. In this way, features are learned more effectively and the performance is improved by using the hierarchical information from all previous layers. Each convolutional layer has 32 filters with 3 × 3 kernels, stride 1, and Leaky-ReLU as activation function.

Leaky ReLU Conv2D
Leaky ReLU Conv2D Dense Figure 3. Dense Block. A RRDB is composed of three dense blocks scaled by Xb and combined, as depicted in Figure 2.
In the first branch, the LR (20 m bands) image is first upsampled using bicubic interpolation to match the size of the HR bands (10 m bands). Then, all bands are concatenated (C) and processed by a 2D convolutional layer that acts as a shallow feature extractor. This layer has 128 filters with 3 × 3 kernels and stride 1. Then, the feature maps pass throughout three RRDB blocks with shared weights, and finally, are reconstructed with a 2D convolution that produces the output image with the corresponding number of channels, matching in this case, the number of 20 m bands. Next, the output features are added to the bicubic interpolated LR bands, thus reducing the spectral distortion in the SR bands.
The architecture of the second branch is similar to the first one but introduces the superresolved 20 m bands as well, which are concatenated (C) with the HR (10 m) bands and the bicubic interpolated VLR (60 m) bands. The feature maps are processed by the sequence of three RRDB blocks and a convolutional layer and the result is added to the bicubic interpolated VLR image to obtain the final SR image with minimal spectral distortion.

Training Details
The training is done in two stages. First, we train the model to generate the superresolution of the 20 m bands and then, on a second stage, we train the model for superresolving the 60 m bands. In the second stage, the weights of all layers learned in the first stage are frozen to avoid changes during the training. Note that the training is done in two stages but inference is performed for the 20 m and 60 m bands at once.
In the first training stage, after applying Wald's protocol, bands with 20 m and 40 m (originally at 10 m and 20 m, respectively) are used as inputs, while in the second stage, bands with 60 m, 180 m, and 360 m (initially at 10 m, 20 m, and 60 m, respectively) are used. In both stages, the original bands are used as target images. After the generation of the input-target image pairs, from each image we select 8000 random patches of 32 × 32 pixels to train the SR20 branch and 500 random patches to train the SR60 branch. We use a 90-10% split for creating the training-validation subsets. For testing, each image of the test set is cropped into non-overlapping patches of 32 × 32 pixels for the SR20 model and 192 × 192 for the SR60 model. Each image is input to the model and, finally, reconstructed to compute the quantitative metrics.
Models are trained for 500 epochs with early-stopping using Adam optimizer and L1-norm as loss function. In each training stage, only the corresponding SR output is considered for calculating the loss. The learning rate is 2×10 −4 in SR20 and 5×10 −5 in SR60, with cosine annealing as the learning rate scheduler. Gradient clipping [44] is used and the scaling factor in the RRDB block is set to x b = 0.2 for maintaining stability in training. A batch size of 64 is set for training the models due to memory restrictions, using two Nvidia RTX-2080 GPUS with 11 GB of GRAM, configured with Pytorch-Lightning.

Quantitative Metrics
Several metrics are considered for the quantitative evaluation of the SR bands: RMSE, SRE, SAM, PSNR, SSIM, and ERGAS. In the following, we denote Y as the target image, X as the SR image, both with B channels and (H, W) as spatial dimensions, µ and σ are the mean and standard deviation, respectively, and E is the expected value.
• Root Mean Square Error (RMSE): measures the mean error in the pixel-value space.
• Signal to Reconstruction Ratio Error (SRE) [4]: measures the relative error in reference to the power of the signal, in dB, where the higher, the better (n is the number of pixels).
• Spectral Angle Mapper (SAM) [45]: measures the spectral fidelity between two images. It is expressed in radians, where smaller angles represent higher similarities.
• Peak Signal to Noise Ratio (PSNR): it is one of the standard metrics used to evaluate the quality of a reconstructed image. Here, MaxVal takes the maximum value of Y. Higher PSNR, generally, indicates higher quality.
• Structural Similarity (SSIM) [46]: measures the similarity of two images by considering three aspects: luminance, contrast, and structure. SSIM takes in consideration the mean (µ) and variance (σ) of the images, where a value of 1 corresponds to identical images. Constants C 1 = k 1 L and C 2 = k 2 L are values that depend on the dynamic range (L) of pixel values (k 1 = 0.01 and k 2 = 0.03 are used by default).
• Erreur relative globale adimensionnelle de systhese (ERGAS) [47]: measures the quality of the reconstructed image considering the scaling factor (S) and the normalized error per each channel (B). Lower values imply higher quality.

Super-Resolution Results
Tables 2 and 3 show the average results for both SR tasks on the test set. We obtained both tables in each corresponding training stage, where Wald's protocol [39] was also applied to the test set considering the proper scaling factor. We compare our results with the bicubic upsampling and two state-of-the-art models that use the same dataset, DSen2 [4] and Zhang et al. [3]. As can be noticed in Table 2, we outperformed both models in the four metrics considered and tied in the other two, with an improvement of 0.61 in RMSE, 0.19 in SRE and 0.2 in PSNR with the second best model. In the case of 60 m bands, the results in Table 3 show that we outperformed in three metrics, tied in two, and lost in one (the PSNR metric). We had a 1.11 decrease in RMSE and an increase of 0.16 in SRE, but the difference in PSNR was 0.84. Visual comparisons are also presented in Figures 4 and 5, where some patches with a different sort of spatial and spectral information were used from the test set, the DSen2 images were obtained using its public repository (https://github.com/lanha/DSen2/tree/ master/models). It is worth noting that, visually, the SR images are very similar to the target, demonstrating the low error achieved in Tables 2 and 3. To help understand the differences between the models, we plotted the absolute errors between the results and the targets in Figures 6 and 7.
If we compare the absolute error between the bicubic upsampling and the DL models, there is a significant difference, which supports the idea of using DL algorithms for superresolution. In general, most of the error maps present dark-blue areas that correspond to a small range of values between 0 and 50 (where the original reflectance values range between 0 and more than 10.000). To provide a more detailed analysis, in Tables 4 and 5 we show the quantitative data at band level for RMSE and SRE, since they were the only metrics available in the other models.
From Table 4 our model outperforms in three of the six bands considering the RMSE, with a decrease for the B5, B8A, and B11. The decay is noticeable for bands 8A and B11, where the drop was more than 25% for B8A and more than 30% for B11. However, the big difference with the second-best model was in B12, where the difference was around 24%. Regarding the SRE metric, our model consistently outperforms in all bands. Analyzing the results on Table 5 we obtained improvements in both bands for RMSE, but with SRE we outperformed in B1 but lost in B9.

Applications
Having shown the excellent performance of our approach for super-resolving Sentinel-2 bands, we present some applications that can benefit from the proposed SR model. It is important to point out that the analysis presented in this section is simply intended to visually demonstrate the benefits of working with the enhanced bands and not to include a detailed quantitative study.
In general, using more spectral channels allows a better determination of the objects spectral signature [48]; therefore, in the context of semantic segmentation, this feature could lead to segmentation results that are more accurate [49][50][51]. Usually, fully annotated ground-truth images are necessary for training deep learning models. However, dense annotations are time-consuming and expensive to generate, and, for some remote sensing applications, labeling requires expert knowledge or on-the-ground surveys. An alternative solution, at least for proof of concept studies, as the ones presented in this work, is to rely on Support Vector Machines (SVM). This machine learning technique performs well, even when trained with few and sparse labeled data [52][53][54][55]. In this section, SVM was applied to show the benefits of using Sentinel-2 sharpened 20 m and 60 m bands in different scenarios.
To illustrate the convenience of using the sharpened 20 m bands, we have trained a SVM classifier to perform Land-Cover Land-Use (LULC) classification on the image shown in Figure 8a. The Sentinel-2 image belongs to the dataset mentioned in Section 3.1 and corresponds to 30 December 2016. To generate the SVM maps, a SVM with radial basis kernel was trained using 200 pixel per class.
Four classes can be appreciated in the selected scene: vegetation (green), sand (orange), clouds (white), and shadows (black). Segmentation maps obtained only using the 10 m bands and a combination of 10 m plus the SR 20 m bands are presented in Figure 8c,d. Reddish pixels in Figure 8b correspond to vegetation areas. As we can see, adding the information provided by the SR of 20 m bands can be very useful for more discriminative results, especially in vegetation zones, as the 20 m bands were specially designed to monitor vegetation covers.
Another application considered that highlights the benefits of improving the GSD of the Sentinel bands are map indices, which are useful for environmental studies. In Wang and Qu [56], the Normalized Multi-band Drought Index (NMDI), which takes into account the soil moisture background, was proposed to monitor potential drought conditions by using three specific 20 m bands of Sentinel-2 {B8A,B11,B12} (Equation (7)). This index uses the difference between two liquid-water absorption bands in the shortwave-infrared region as a measure of water sensitivity in vegetation and soil.
NMDI is commonly used in agriculture [57], fire monitoring [58], forest analysis [59], etc. We have chosen a small agriculture zone to better show the performance of superresolution (Figure 9c). We can observe from the NMDI maps that vegetation zones contrast better from the soil if the maps are obtained with the super-resolved bands. Other interesting indices for vegetation studies are those that make use of the Red Edge bands of Sentinel-2 [60][61][62]. The NDVI-RE, see Equation (8), is a modification of the traditional NDVI (Normalized Difference Vegetation Index) that uses the Red-Edge bands, so it is more affected by the chlorophyll content and leads to a more accurate map, especially in drier zones [62].
An example is shown in Figure 9d where the 20 m Red Edge bands B5 and B7 are used. Index maps are colored using a rainbow palette where red areas represent less vigorous vegetation. Comparing the maps with the true color image in Figure 9a, we can see that red areas correspond to sandy (dry) regions and are better outlined than using the original 20 m bands. On the other hand, remote sensing plays an important role for the effective management and monitoring of coastal areas [63]. To show the benefits of using the SR 60 m bands, we have selected a coastal area and the seafloor map has been obtained using the SVM algorithm, as well. Specifically, the 60 m coastal blue band {B1} can be very helpful to derive bathymetry and benthic mapping data in shallow waters due to its excellent penetration capability in clear waters. Figure 10 illustrates the results achieved in this scenario using a Sentinel-2 image from 29 June 2019 of the area of Cabrera Island, Spain. The remarkable spatial improvement in the original coastal channel can be appreciated in Figure 10a. A color composite using the SR coastal band is also included in the right column of Figure 10b to better appreciate the sea bottom (in all the images, the same enhancements have been used, applying a Gaussian histogram equalization with a 10% on brightness increase).
A reference benthic map for the visual analysis, provided by the Spanish Institute of Oceanography [64], is included in Figure 10c. The green areas correspond to different densities of Posidonia oceanica seagrass, gray areas relate to photophilic algae on rocky substrates, while the remaining colors identify different types of soils. Bathymetric information is also provided in the right column to demonstrate the complexity of the scene, with water depths up to 30 m on the right side of the image and much higher depths on the other side.
In particular, we chose the radial basis kernel but SVM parameters were not fine tuned and only a mean of 500 pixel per class (0.05% of the image area) were considered during the training. The derived SVM seafloor maps are displayed in Figure 10d. The Sentinel-2 original 10 m bands {B2, B3, B4, B8} have been considered to generate the map on the left side while the result adding the SR coastal band is provided on the right side. The same training regions have been used in both cases. As expected, the coastal band does not affect the land classification (vegetation in dark green and soil in brown); however, some improvements are visible in water areas (vegetation in green lime color, sand in yellow, and deepwater areas in blue) thanks to the inclusion of the coastal channel at 10 m resolution.

Discussion
A comparative analysis with pansharpening techniques was not included, mainly due to the lack of a panchromatic band in Sentinel-2. In any case, a preliminary study was done with a few specific images of the dataset using classical pansharpening algorithms, in which a 10 m band acted as the panchromatic channel. The NIR band had the best performance among the available HR bands but, as expected, the spectral distortion was quite obvious in the LR bands, due to the lack of spectral overlapping between the HR and the LR bands. That said, depending on the application, we could apply pansharpening algorithms, but this would require a further comprehensive analysis to study the potential of newer and advanced pansharpening approaches.
Compared with the pansharpening paradigm, we can identify two main differences in the proposed approach. First, the idea of using all the HR bands available to super-resolve the LR bands as opposed to one band and, second, that these HR bands do not necessarily need to overlap the LR bands. We make use of this first key-difference for super-resolving the 60 m bands, where we use the 10 m bands combined with the super-resolved 20 m bands as HR bands for super-resolving at such high scaling factor. The SR 20 m bands were obtained with the RD-Blocks adjusted in the first stage of the training, in contrast with other approaches that used the bicubic upsampling of these bands.
It is important to note that our comparative analysis includes two approaches that have used the same dataset for training and testing the models. We acknowledge the existence of other works, as stated in Section 2, that have tackled the same problem but only provided the results for 20 m bands or, in other cases, have used other images to provide its benchmarks. However, the visual comparison, presented in Figures 4 and 5, shows that our model can super-resolve the LR and VLR bands with similar performance as a state-of-the-art model such as DSen2 [4], although, it can be quite difficult to spot the difference even for a well-trained eye. Analyzing the performance of the 60 m bands, in Figure 5 we can see that the reconstruction using a bicubic interpolation presents obvious lack of details, in contrast with the similarity between the SR images and target. Even for this scaling factor, the models can properly transfer the high-frequency information, keeping spectral distortion at minimum.
We also provide a visual comparison of the errors per band in Figures 6 and 7 where the details are easier to identify. Our model obtained more dark blue areas than those presented by the DSen2 model [4].
If we analyze the errors of each band presented by the bicubic images, we can detect a decrease in performance for bands 6, 7, and 9. These errors are attenuated by the DL algorithms, and the results have proven the idea that high-frequency details obtained from the HR bands can be remarkably combined.
Inspecting Figures 6 and 7 in further detail, regarding the 20 m bands, we observe more red spot areas (absolute errors of more than 250) in DSen2 bands than in our model. We also observe , in general, better performance in the Red-Edge bands, e.g., bands 5 and 8A, although it performs well with the SWIR bands also, even being spectrally away from the influence of the HR bands. The good performance with band B8A can help generate better NMDI maps, see Equation (7), which are often used for vegetation analysis.
Furthering the analysis of the 60 m bands, the absolute errors presented in Figures 6 and 7 showed fewer red spots areas than DSen2 (errors bigger than 250), and especially in the B9 band, where there are less light blue areas as well (errors between 100 and 150).
Tables 4 and 5 present the quantitative metrics per band where our model prevails in general. Looking at the RMSE results for the 20 m bands, it is noticeable how our model performs well with bands B5, B8A, and B11, where the influence of the HR band, especially in the confluence of the red (B4) and NIR band (B8) contributes to generate a good margin for bands B8A and B11. On the other hand, as expected, B12 has a significant drop in performance compared to the other two benchmark models. Regarding the RMSE results of 60 m bands, our model improves both bands consistently, again thanks to the spectral proximity of the HR bands of Sentinel-2.
Considering the applications, Land-Use-Land-Cover (LULC) is one of the main areas of interest in remote sensing, providing fundamental information to support regional and local policies for sustainable development, risk prevention and assessment, environmental protection, etc. The interest in applying automatic semantic segmentation (pixel-based classification) to remote sensing imagery has recently increased [65], especially with the use of Sentinel-2 satellites, because they are free of charge, have a high revisit time and spectral variety, thanks to the development of deep convolutional neural networks for accurate semantic segmentation. Furthermore, other relevant land applications are environmental studies (vegetation, desertification, fires, etc.) or agriculture [57]. In these scenarios, the use of spectral indices provides a powerful tool to monitor and map such areas. In this context, Sentinel-2 offers the opportunity to address all the previous applications thanks to the multispectral channels available, specially the 20 m resolution band located in the Red Edge and SWIR.
On the other hand, Sentinel-2 provides three 60 m channels that are mainly useful for atmospheric correction purposes. In fact, the cirrus band (Band 10) is not supplied in the Sentinel-2 product as it does not contain surface information. However, the coastal aerosol (Band 1), located in the lower blue part of the spectrum, can be useful as well in coastal applications, thanks to its water penetration capability. As shown, the application of the SR model proposed can help create better land and coastal maps, compared to using the original bands, opening the possibility for local studies that require a high spatial resolution.
It is important to highlight that the goal of the analysis included in Section 4.2 was just to visually demonstrate the practicality of our model in different scenarios and not to provide a detailed quantitative analysis or to achieve optimal mapping. For land areas, the enhanced 20 m bands proved their benefits. In coastal scenarios, enhancing the 60 m coastal channel can also be a good alternative to reach water depths over 30 m, particularly in order to put together more accurate bathymetric and seabed maps at higher water depth and spatial resolution.

Conclusions
Sentinel-2 satellites provide multi-spectral images with different ground sampling distances that enable a variety of studies and analyses of the Earth's surface. Due to the physical limitation of sensors and to minimize storage of data, only a small set of bands are provided with maximal spatial resolution. In this work, we have proposed a fully convolutional neural network that sharpens Sentinel-2 20 m and 60 m bands to 10 m GSD to get all bands at the maximal sensor resolution. Our SR model uses residual learning and dense connections that have proven their ability for enhancing low-resolution images in SISR tasks. The CNN model is formed by two branches, one for the SR of 20 m bands and the other for the 60 m bands, which are trained separately in two stages. At inference time, however, all bands are super-resolved in the same forward pass.
Quantitative and qualitative results have shown that our method performs better than state-of-the-art models that tackle the same problem. We have also shown that land applications such as LULC mapping and vegetation analysis could benefit from using the sharpened 20 m bands, and coastal studies could improve the quality of bathymetric and seafloor mapping using the SR 60 m bands.