Super-Resolution of Sentinel-2 Imagery Using Generative Adversarial Networks

Sentinel-2 satellites provide multi-spectral optical remote sensing images with four bands at 10 m of spatial resolution. These images, due to the open data distribution policy, are becoming an important resource for several applications. However, for small scale studies, the spatial detail of these images might not be sufficient. On the other hand, WorldView commercial satellites offer multi-spectral images with a very high spatial resolution, typically less than 2 m, but their use can be impractical for large areas or multi-temporal analysis due to their high cost. To exploit the free availability of Sentinel imagery, it is worth considering deep learning techniques for single-image super-resolution tasks, allowing the spatial enhancement of low-resolution (LR) images by recovering high-frequency details to produce high-resolution (HR) super-resolved images. In this work, we implement and train a model based on the Enhanced Super-Resolution Generative Adversarial Network (ESRGAN) with pairs of WorldView-Sentinel images to generate a super-resolved multispectral Sentinel-2 output with a scaling factor of 5. Our model, named RS-ESRGAN, removes the upsampling layers of the network to make it feasible to train with co-registered remote sensing images. Results obtained outperform state-of-the-art models using standard metrics like PSNR, SSIM, ERGAS, SAM and CC. Moreover, qualitative visual analysis shows spatial improvements as well as the preservation of the spectral information, allowing the super-resolved Sentinel-2 imagery to be used in studies requiring very high spatial resolution.


Introduction
Satellite remote sensing is used in various fields of application such as cartography, agriculture, environmental conservation, land use, urban planning, geology, natural hazards, hydrology, oceanography, atmosphere, climate, etc. In this context, a fundamental parameter to take into account when addressing a specific application is the spatial resolution.
With the recent launch of advanced instruments, the spatial resolution of satellite imagery has been enhanced. For instance, nowadays, WorldView-3/4 satellites [1] can collect 8-band multispectral data with 1.2 m of ground sample distance (GSD). Unfortunately, the cost to use such very high spatial resolution (VHSR) imagery can make it impractical when large areas have to be covered or if multi-temporal analysis have to be undertaken. Consequently, in these scenarios, it would be desirable to consider using open access data with acceptable spatial quality, like these provided by satellites such as Landsat or Sentinel. In particular, the Copernicus Sentinel-2 mission [2] comprises a constellation of two polar-orbiting satellites providing high revisit time and its Multi Spectral Instrument (MSI) records data in 13 spectral bands ranging from the visible to the shortwave infrared. This sensor layers in the generator and the removal of batch normalization layers, as they are prone to introduce artifacts. In addition, relativistic GAN [38] was used to improve the performance of the discriminator.
Super-resolution models based on CNNs face some challenges when applied to remote sensing, mainly related to the lack of training datasets. Ma et al. [39] proposed TGAN (Transferred Generative Adversarial Network) to solve the drawback of the poor quantity and quality of remote sensing data. Other authors like Haut et al. [40,41] and Lei et al. [42] downgraded public remote sensing images to form the LR-HR pairs and tested different network architectures. However, models trained only with synthetic LR-HR pairs tend to fail in generalization since the LR images are not only a consequence of Gaussian noise, blurring or compression artifacts [28,43]. More recently, Ma et al. [44] introduced DRGAN (Dense Residual Generative Adversarial Network) for the super-resolution of remote sensing imagery. Pouliot et al. [45] used a combination of Landsat (30 m GSD ) and Sentinel-2 (10 m GSD) imagery to train CNNs architectures for SR of Landsat imagery and Beaulieu et al. [46] have tested different CNNs and GANs networks with only one pair of Sentinel-2 (10 m GSD) and WorldView (2.5 m GSD) images with a scaling factor of 4. They have trained the models with 3 channels (Red, Green, and NIR) and the results were promising, especially using GAN-based networks. They believed that the radiometric and geometric differences between both satellites images could be an explanation of the low-value metrics obtained. Some authors have applied ESRGAN architectures as well, to solve different remote sensing SR problem applications [47,48] This work focuses in the analysis of single-image supervised SR techniques using the DL paradigm. Specifically, we present a GAN-based super-resolution model to enhance the 10 m channels of the Sentinel-2 sensor to 2 m with a similar quality as produced by the WorldView satellite. The spatial enhancement process is achieved by exploiting the synergy existing in the spectral domain between the Sentinel-2 and Worldview-2/3 missions.
To train our model, image pairs coming from the Sentinel-2 and WorldView satellites were used. These LR and HR pairs are hard to obtain because, even though Sentinel imagery is freely available, the commercial aspect of the Worldview data and its non-systematic sensing of the Earth surface make it difficult to get enough image pairs to train the model. To overcome this problem, we decided to conduct the training in two stages. First, we used an artificially generated set of LR-HR image pairs with WorldView data to pre-train the model and, then, we fine-tuned the model using the WorldView-Sentinel pairs. To make it feasible to work with different satellite sources, we have removed the upsampling module of the ESRGAN architecture, as the images needed to be co-registered (see Section 3.2). Besides, the input and output layers were adapted to work with four bands (see Section 3.1). Another important contribution of our model is that before training, we standardize the data by channels, i.e. subtracting the mean and dividing by the standard deviation on each channel, in order to obtain an output image where the spectral characteristics of the LR Sentinel-2 image are preserved (see Section 4.1). Finally, once trained, our proposed RS-ESRGAN can be applied to obtain a super-resolved multispectral Sentinel-2 output with a scaling factor of 5.
The performances achieved have been compared to other super-resolution algorithms available in the literature in order to study the effectiveness of the proposed SR method, as well as to analyze the spectral/spatial quality trade-off obtained.

Satellite Images
Sentinel-2 is a satellite mission developed within the Copernicus program, a joint initiative of the European Commission, the European Space Agency and the European Environment Agency, to provide operational information of our planet for safety and environmental applications. It is composed of two identical satellites: Sentinel-2A (launched on June 2015) and Sentinel-2B (launched on March 2017) flying in the same orbit but phased at 180°to give a high revisit frequency of 5 days at the Equator. Each satellite carries a multispectral payload (MSI) that provides a set of 13 spectral bands ranging from visible and near infrared (VISNIR) to shortwave infrared (SWIR). In particular, four of these bands (B2, B3, B4 and B8) record data at a spatial resolution of 10 m, six bands at 20 m and three bands at 60 m. Basically, bands at lower resolution are devoted to the analysis of the vegetation status, discrimination of snow, ice and clouds and to retrieve information about aerosols, water-vapor and cirrus. Data is available at [49] To improve the spatial resolution of the Sentinel-2 10 m bands using super-resolution techniques, multispectral WorldView data was used to produce the LR-HR training dataset. Worldview-2, launched on October 2009, was the first high-resolution 8-band multispectral commercial satellite [50]. Operating in a nearly circular sun-synchronous orbit at an altitude of 770 km, it provides a 1.84 m resolution for the VISNIR multispectral bands and a panchromatic band of 46 cm. WorldView-3 was launched on August, 2014 at an altitude of approximately 617 km, providing 8 multispectral bands at 1.24 m in the VISNIR plus additional shortwave infrared bands at 3.7 m and a panchromatic band of 31 cm. Table 1 lists the technical characteristics of Sentinel-2 and Worldview-2/3 spectral bands in the VISNIR spectrum.

Datasets
We used two datasets to train the models, the WorldView European Cities dataset [51] and a collection of pairs of WorldView-Sentinel images. A third dataset of WordView-Sentinel image pairs was used as an independent test set, to evaluate the performance of the model. In the following, we will refer to the two WordView-Sentinel datasets as W-S Set1 and W-S Set2, respectively.
The European Cities dataset is a large collection of publicly available WorldView-2 images provided by the European Space Agency (ESA). These images were sensed by the satellite from July 2010 to July 2015 covering several areas of Europe. Some examples are shown in Figure 1. Care was taken in selecting scenes with little or no cloud cover in the image. We used 32 large images covering part of Ireland, France, the United Kingdom, Malta, Italy, Spain and others areas, aiming to provide a variety of covers like agriculture landscapes, cities, shores, forest areas, etc. The W-S Set1 used to train the model consisted of pairs of WorldView-Sentinel images that were sensed on the Canary and Balearic Islands in Spain. Images were provided by the ARTEMISAT-2 project [52] and correspond to regions of the Maspalomas Natural Reserve, Teide National Park, and Cabrera archipelago. Sensing dates and the original resolution of the WorldView images can be seen at Table 2. The W-S Set2 dataset was composed with pairs of WorldView and Sentinel-2 images that were used to test the robustness of the model to unseen images during training. Details are provided in Table 3.

Image Pre-Processing
The WorldView ortho-ready standard level 2 product was used in the study as the target data. These standard imagery products are radiometrically corrected (relative radiometric response between detectors, non-responsive detector fill, and a conversion for absolute radiometry), sensor corrected (internal detector geometry, optical distortion, scan distortion and any line-rate variations), geometrically corrected (spacecraft orbit position and attitude uncertainty, Earth rotation and curvature and panoramic distortion) and projected [50].
To obtain true radiance values from the digital values of the image, it was necessary to perform a radiometric calibration applying the correct gains and offsets of the sensor for each band. After this calibration, Top Of Atmosphere (TOA) radiances were obtained. TOA radiance includes distortions produced by the absorption and scattering of the atmospheric gases and, in consequence, correction algorithms are necessary to minimize these effects for the different wavelength of the spectrum.
After a thorough assessment of different advanced atmospheric corrections models, comparing Worldview-2 estimated reflectances with real ones measured by a field spectroradiometer [53], most of the radiative models tested achieved good performance but the Fast Line-of-sight Atmospheric Analysis of Hypercubes (FLAASH) algorithm [54] provided the highest accuracy with Root Mean Square Error (RMSE) values around 3%.
The FLAASH model was applied and the corresponding parameters (atmosphere type, aerosol model and thickness, height, flight time and date, sensor geometry, adjacency, etc.) were properly set using climatic and field data besides the information included in the metadata file of each image. Next, to transform the perspective of a satellite-derived image to an orthogonal view of the ground, which removes the effects of sensor tilt and terrain relief, orthorectification was applied using rational polynomial coefficients and a high-resolution digital elevation model.
On the other hand, the available Sentinel-2 Level-2A products, that already provide Bottom Of Atmosphere (BOA) reflectance images, were considered for the analysis. Specifically, the Level-2A processing protocol mainly includes an atmospheric correction applied to the TOA Level-1C and the orthorectification [2]. The atmospheric correction model used is Sen2Cor [55], which is a combination of state-of-the-art techniques tailored to the Sentinel-2 environment.
Once Worldview and Sentinel images were properly pre-processed, both sets were modified to have the same spatial resolution of 2 m and properly co-registered. The process included locating and matching a number of ground control points (GCP) in both images and then performing a geometric transformation. Automatic and manual GCP were extracted to derive a representative and well distributed set of points. Next, a polynomial warping algorithm was applied to the Sentinel image to match the reference Worldview image [56]. We only considered those channels that overlap each other, specifically bands B2, B3, B4 and B8 of Sentinel-2 and bands B2, B3, B5 and B7 of WorldView as detailed in Table 1.
WorldView images in the European Cities dataset have a GSD of 1.6 m. Corrections were made to obtain the BOA reflectance images as well. To form LR-HR pairs, we used the BOA images as the HR and down-sampled them to obtain the LR image, preserving the same scaling factor as the WorldView-Sentinel pairs, as shown in Figure 2. Then, the LR European Cities images were interpolated using bicubic interpolation. Finally, we selected the same set of bands as for the WorldView-Sentinel dataset.

Network Architecture
Since the work of [35], GANs have been used in super-resolution tasks due to their ability to generate realistic outputs with rich textures and quality. The idea of using two models trained simultaneously was proposed in [34], where a Generator network (G) is trained aiming to produce new data-samples by capturing the data distribution from the training data. At the same time, a Discriminator network (D) is trained aiming to verify the authenticity of the generated samples by evaluating the probability of this data to correspond to the training distribution rather than being generated by G. This training process involves minimizing the error between the generated and real samples on one hand, and maximizing the chances of distinguishing between the real and the generated (fake) samples on the other, making the cost function to depend on both networks, G and D. Figure 3 shows the relationship between the networks, where the Generator has only access to the input data while the Discriminator has access to both, generated and real data [57].  When GANs are used for super-resolution, a generator is trained to produce a realistic version of a HR image from a LR version. In this work we based our model on the Enhanced Super-Resolution Generative Adversarial Network [37] due to its high performance in super-resolution of natural images.
The architecture of the Generator network can be seen in Figure 4 with the corresponding configuration of kernel size (k), feature maps (n) and stride (s) for each convolutional layer. A single convolutional layer is used as a feature extractor to generate 64 feature maps with 3 × 3 kernels with stride 1. These maps are the input of a deeper feature extractor composed by 23 Residual-in-Residual Dense Blocks (RRDB) and, finally, three convolutional layers are used to reconstruct the output. A long range skip-connection transfers low-level features that are combined with high-level features learned by the RRDB.
The residual learning block RRDB has a more dense and complex structure of convolutional layers with 3 × 3 kernels and 32 feature maps. All convolutional layers use stride 1 to maintain the resolution of the images throughout the generator. The nonlinear activation is the leaky-ReLU with a negative slope of 0.2. Each Dense Block allows the propagation of the information learned by preceding layers, increasing the network capacity by using effectively this residual and local information. The concept of Residual Scaling [58] was used to down scale feature maps by factor given by a hyperparameter X b (see Figure 4b) to prevent instability in training very deep networks.
Since we use an interpolated version of the LR image as an input image (see Section 3.2), the upsampling modules of the original ESRGAN implementation were removed from the Generator. Besides, the input and output layers were adapted to work with the four bands.   Figure 5 shows the architecture of the Discriminator network (D). It contains 8 blocks formed by convolution, batch-normalization and leaky-ReLU layers. The blocks contain an increasing number of feature maps, reaching 512 filters with alternating kernel sizes, a 3 × 3 kernel for a convolutional layer with stride 1, and a 4 × 4 kernel with stride 2 for reducing the size of the images when the number of feature maps is increased. The last layers are two dense fully-connected layers together with a final sigmoid output to produce the final score.

Methodology and Loss Functions
We train the model in three stages. First, we train the generator using the European Cities dataset, then, in a second stage, we fine tune the generator using image pairs from the W-S Set1. For these two stages, we use the L 1 loss (Equation (1)), that measures the mean absolute error between the generated output G(X i ) and the target Y i for each ith image.
After training the Generator with the European Cities dataset and the WorldView-Sentinel image pairs, the third and final stage consists in training the Generator along with the Discriminator with an adversarial approach, using the weights learned from previous stages, and only training the models with WorldView-Sentinel pairs. In this stage, a combination of losses is used for updating the weights of the Generator (Equation (2)) and the Discriminator (Equation (6)). The Generator loss L G is a sum of three terms: the adversarial loss L Ra G (Equation (5)), the content loss L 1 (Equation (1)), and the perceptual loss L percep (Equation (3)).
The perceptual loss L percep , introduced by [36], uses the VGG-19 network [59] as feature extractor, and measures the difference between feature maps obtained in two different points in the network. It computes the mean absolute error between feature maps obtained after the 4th convolutional layer and before the 5th max-pooling layer (φ 54 (.)), for each target image Y i and output G(X i ).
The adversarial loss L Ra G is calculated taking into account the concept of the Relativistic average GAN (Ra) scheme introduced by Jolicoeur-Martineau in [38]. Relativistic average GAN improves the performance of a standard GAN by encouraging the discriminator to not only distinguish between input real images (x r ) and fake images (x f ), but instead to distinguish that a real input image is relatively more realistic than a fake one.
A standard GAN computes the probability that the generic input data x is real as D(x) = σ(C(x)), where σ(.) is the sigmoid function and C(x) is the non-transformed discriminator output. By using the relativistic scheme, the probability of a real data to be classified as real decreases while increases the probability of fake data being classified as real. The relativistic discriminator (D Ra ) takes into account a priori knowledge that half of the mini-batch samples are even fake or real. So, the (D Ra ) is formulated as shown in Equation (4), where E x f [.] represents the operation of computing the mean of all fake data in the mini-batch [38].
Cross-entropy is used to calculate the adversarial loss L Ra G as shown in Equation (5). Due to the mini-max behavior of GANs, as one of the networks wants to minimize the probability of fake data to be detected and the other wants to increase its probability to discriminate between fake and real data, cross-entropy is used in a symmetrical form to update the weights of the Generator and Discriminator (Equation (6)), since both networks benefit from the information of real and generated data in an adversarial training, making results more realistic and with more details [37].

Training Details and Network Interpolation
Due to the large size of satellite images, around 10 K × 10 K pixels in European Cities dataset and over 1 K × 1 K in the WorldView-Sentinel dataset, images were divided into tiles of 140 × 140 pixels to form the train-validation-test subsets. The European Cities dataset contains 32 large images of different parts of Europe, that were tiled and used for pre-training the model. The WorldView-Sentinel dataset W-S Set1 contains 5293 tiles for training, 278 for validation and 296 for test. We performed horizontal and vertical flips for data augmentation as well as random crops of 128 × 128 pixels. Adam optimization [60] was used, with an initial learning rate of 10 −4 , halved every 20K iterations. The hyperparameters that weight the different losses in Equation (2) are η = 10 −2 , γ = 1 and λ = 5 −3 as in [37]. The batch size is 2 and the residual scaling parameter is X b = 0.2.
As indicated, the networks were trained in three stages, first with images from the European Cities dataset, then using the WorldView-Sentinel (W-S) Set1 for training the generator network alone and finally fine-tuning the model with adversarial training and using the same W-S Set1. We have used Pytorch framework training the models with a Nvidia GTX Titan X 12GB and 12GB of RAM memory, taking each stage 21 h to be completed.
The final SR image from a LR image X is obtained as a linear combination between a generator trained in adversarial mode G adv (X) and a generator trained alone G PSNR−oriented (X) (Equation (7)). This linear combination is done to remove unpleasant noise produced by purely GAN-based methods while maintaining a good perceptual quality [37]. The interpolation parameter α is a real number within the [0, 1] range.

Quality Assessment
To perform a quantitative comparison of the results obtained by the Generator (G(X)) using the input LR image (X) from the test set, and compared to its target HR image (Y), several metrics were considered.
• Peak Signal to Noise Ratio (PSNR): it is one of the standard metrics used to evaluate the quality of a reconstructed image. Here, MaxVal is the maximum value of the HR image (Y). Higher PSNR generally indicates higher image quality.
• Structural Similarity (SSIM) [61]: it is a metric that measures the similarity between two images taking into account three aspects: luminance, contrast and structure. It is in the range [−1, 1], where a SSIM equal to 1 corresponds to identical images. Constants C 1 = k 1 L and C 2 = k 2 L are values that depends on the dynamic range (L) of the pixels values, with k1 = 0.01 and k2 = 0.03 by default.
• Erreur relative globale adimensionnelle de systhese (ERGAS) [62]: it measures the quality of the output G(X) image by taking into consideration the scaling factor (S) as well as the normalized error per channel, considering the mean Y j of each band. Contrary to the PSNR and SSIM metrics for this index a lower value implies higher quality.
• Spectral Angle Mapper (SAM) [63]: it calculates the angle between two images by computing the dot product divided by the 2-norm of each image. This index indicates higher similarity between images as it approaches zero.
• Correlation Coefficient (CC): it computes the linear correlation between the images. It is in the range [−1, 1], where 1 is total positive linear correlation and n is the number of pixel in each channel.
For a qualitative comparison, true RGB and false-color composites were used to evaluate the overall spatial and spectral performance of the methods. The false-color image is composed of the NIR, Red and Green channels, in that order.

Experiments and Results
In this section we present numerical and visual results of our model applied to the test subset of the W-S Set1, for several values of the interpolation parameter α (Equation (7)). Next, we evaluate the robustness of the model on an independent set of images using W-S Set2. Finally, a comparison with other state-of-the-art super-resolution algorithms is presented. In all cases quantitative results are given in terms of mean and standard deviation of the quality metrics on the test sets. In addition, statistical tests were performed to analyse the statistical significance of the difference in performance of our method and other state-of-the-art models.

Data Standardization
Data normalization is typically performed prior to training a neural network to speed up learning and lead to faster convergence. The usual approach in SR models consists in dividing all image pixels by the largest pixel value, adjusting pixel values to the [0,1] range. This is performed across all channels, regardless of the actual range of pixel values that are present in the image. The inverse process is done at the output, re-scaling pixels to the original range.
However, we have observed that this type of normalization does not perform well in our scheme that is trained with images from two different sensors, leading to results where the distribution of the SR Sentinel image is shifted towards the spectral distribution of the original WorldView HR image.
To address this issue, we standardize the input and target data. Standardization is done by channel j, subtracting the meanX j and dividing by the standard deviation σ X j . To obtain the final SR image G(X), the inverse process is applied, using the mean and standard deviation of the input image X, as shown in Equations (13) and (14). Figure 6 illustrates the difference between the two methods. We have trained both models, one normalizing and the other one standardizing the data, and we generated SR results for two samples from the W-S Set1 corresponding to Cabrera and Maspalomas. We present the normalized histograms per band. Each plot shows the original distribution of the LR Sentinel (orange), HR WordView (blue), the SR output using normalization (green) and standardization (red). We clearly observe that the standardization approach helps to preserve the spectral content of the low resolution image.  Table 4 shows the quantitative assessment results of our SR model for several values of the interpolation parameter α (Equation (7) It can be observed that as the influence of the network with adversarial training increases (with α > 0.5) more texture appears in the SR image. This behaviour is helpful in regions with small details like forest and urban areas, though lower metrics are obtained as a trade-off. However, very large values of α may introduce some artifacts. This is consistent with the typical hallucination effect that has been reported in GAN models [35,37]. It is important to emphasize that only the Sentinel LR image is applied to the trained model and the Worldview HR image is just used as a reference to assess the quality.

Performance on the W-S Set2
We also tested our model on pairs of WorldView-Sentinel images that were not used for the training, the W-S Set2. Quantitative results are summarized in Table 5.
The proposed model outperforms nearest-neighbor and bicubic interpolation in all the metrics considered. In line with the results achieved with the W-S Set1, low values of α provide, in general, better quality metrics. In this experiment, α = 0.1 is the optimal value with improvements of 1.612 dB on PSNR and 0.087 on SSIM over bicubic interpolation.
Visual results are presented in Figures 11 and 12 for two images of Maspalomas and Cabrera. It can be inferred from these results a good recovery of the textures in dense forest areas. For images with high density of households, objects are clearly defined, recovering the delimitation of roads and roundabouts. Moreover, there is good spectral agreement between LR images and the SR results, despite of using test images not belonging to the training set.

Comparison with Other SR Models
In this subsection we compare the performance of our proposed RS-ESRGAN model (using α = 0) with the baseline bicubic interpolation and some state-of-the-art SR methods: SRCNN [64], EDSR [65], RCAN [66] and SRGAN [35].
For a fair comparison, all SR models considered were trained in the same way in two stages, first with European-Cities and then with WorldView-Sentinel Set1, with training details as described in Section 3.5. The upsampling modules of the EDSR, RCAN and SRGAN were removed from the original architectures and input images were pre-upsampled by a factor of 5 using bicubic interpolation. In addition, the architectures of were adapted to work with the four input channels considered. Quality metrics results on the W-S Set1 are summarized in Table 6. The best results are highlighted in bold.
Our model outperforms the others in all the metrics considered. The statistical analysis of the results on the W-S Set1 shows that differences between the proposed model and the other SR models are significant (all p-values smaller than 0.01). In addition to the quantitative comparisons, we also performed visual comparisons among our method and the above-listed methods, presented in Figures 13 and 14. The proposed model with α = 0 and α = 0.1 produces a clean image, with fine details, less artifacts and consistent with the target.

Discussion
Single-Image SR with GANs is a complex task requiring a great quantity of diverse image pairs for training. In addition, in the remote sensing context, the pre-processing steps applied to the data play a key role during the network training.
Aiming to exploit the open policy of Sentinel-2 and the very high spatial resolution that a WorldView imagery offers, we trained our model with real data from both satellite sensors. However, the generation of LR-HR image pairs is not easy mainly due to the high cost and availability of archived data for WorldView, as well as the difficulty of obtaining pairs of similar dates and lighting conditions. To alleviate this inconveniences, the network was pre-trained using LR-HR image pairs artificially generated from a collection of publicly available WorldView images.
One crucial pre-processing step is co-registration, necessary to compensate the miss-alignment of pixels between both datasets. The ability of WorldView to record images with different off-nadir angles may add spatial distortions in comparison with Sentinel imagery, that only senses in the nadir viewing direction, making more complex the generation of LR-HR pairs and requiring the application of orthorectification tasks with a precise digital elevation model. To perform an accurate co-registration, images must have the same GSD to define the ground control points and to apply the subsequent geometric transformation. We performed a bicubic interpolation on the LR images followed by the co-registration. Thus, the upsampled versions of the LR images were input to the network and we removed the upsampling module that is typically used in super-resolution networks like the ESRGAN.
Another important aspect to consider during the generation of LR-HR image pairs is the spectral information, particularly when dealing with sensors having bands with different spectral range. We used the four bands (RGB-NIR) that presented spectral overlap between both satellites. We observed that the standardization of the input and target images before the training was critical to produce SR images preserving the spectral information from the LR counterpart.
The proposed RS-ESRGAN model achieves a significant improvement in all the quality metrics compared to recent models like RCAN and SRGAN. Also, the subjective evaluation shows that the proposed RS-ESRGAN produces finer details and textures. The false color scheme, using the NIR band, shows that improvements are consistent in the four channels considered.
The final SR image is obtained as a linear combination of the results produced by the model trained with and without the adversarial scheme. The interpolation parameter α controls the trade-off between images with high-frequency details and texture, and higher visual quality in general (but also possible artifacts), or images without noise but with less spatial details producing higher quantitative metrics.
The performance on the independent test set W-S Set2 was also excellent. The model was trained with samples containing several structures like building, roads, shrubbery, arid zones and dense vegetation areas and, therefore, can generalize to images from geographical areas that are not included in the training sets. Actually, RS-ESRGAN was also tested with Sentinel-2 images from other parts of the world with satisfactory visual results. To avoid extending this article further, such additional results are not included.
The proposed single-image SR model produces HR images that can enable the use of Sentinel-2 imagery in several studies where fine detail is needed. Moreover, the user can decide about exploiting the spatial detail or preserving the spectral agreement by just tuning the interpolation parameter α.
The improvement in spatial quality at the expense of generating some noise in the SR image, and the computational burden due to the combination of different loss functions and the dense combination of feature maps in the generator might be a limitation of the model. These issues are currently unsolved and are certainly good topics to be explored in a near future.
Another issue that deserves attention is the presence of artifacts in SR images, that tend to appear with the fully adversarial mode, i.e. with higher values of α. Artifacts make roads appear more sinuous when they should be straight or produce the hallucination of connected dense vegetation areas, when HR images demonstrate the opposite. These are important problems to tackle in future works, although network interpolation seems to be a valid workaround in obtaining a SR image less blurry and with texture recovered.

Conclusions
In this paper we propose a single-image super-resolution method based on a deep generative adversarial network to enhance the spatial resolution of Sentinel-2 10 m bands to a resolution of 2 m. The proposed model, RS-ESRGAN, was adapted to work with co-registered remote sensing images and it was trained using the four channels (RGB and NIR) that overlap on both satellites, with Sentinel-2 images as input and WorldView images as target. Several pre-processing steps were needed to obtain the dataset of image pairs ready for training. The model was first trained with synthetic LR-HR pairs from the WorldView European Cities dataset and, next, with real LR-HR image pairs from Sentinel-2 and WorldView satellites. An important aspect for the training was the standardization of the images by channels, that proved to be a crucial step to preserve the spectral information between the low-resolution and the super-resolved images. The behaviour of the α parameter was analyzed, showing the trade-off between visual quality and quantitative metrics of the results. Finally, results show significant improvements, both quantitative and qualitatively, of RS-ESRGAN compared to other state-of-the-art deep learning SR models. Funding: This research has been supported by the ARTEMISAT-2 (CTM2016-77733-R) and MALEGRA (TEC2016-75976-R) projects, funded by the Spanish Agencia Estatal de Investigación (AEI), by the Fondo Europeo de Desarrollo Regional (FEDER) and the Spanish Ministerio de Economía y Competitividad, respectively. L.S. would like to acknowledge the BECAL (Becas Carlos Antonio López) scholarship for the financial support.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: