1. Introduction
Satellite imagery provides a unique and detailed perspective on the state and changes in land, coastal, and oceanic ecosystems [
1]. However, extractible information is limited by the spectral, spatial, and temporal resolutions of remote sensing images. Due to trade-offs in satellite instruments, images have generally either a high spatial resolution and a low spectral resolution or vice versa. One of the most used solutions is pansharpening: the fusion of a multispectral (MS) image with a panchromatic (PAN) image, both acquired simultaneously by the same satellite and capturing the same area [
2]. MS images are typically composed of several bands partitioning the solar radiation into different spectra (e.g., red, blue, green and near-infrared). PAN images are composed of one band but capturing the whole solar radiation at a higher spatial resolution than MS images. The resulting pansharpened image combines the highest spatial resolution of PAN image with the highest spectral resolution of MS image. The numerous available pansharpening methods can be labeled as spectral, spatial, and spectral-spatial or spatiotemporal [
3].
Pansharpening is a special case of image fusion: a combination of several images into a single composite image that has a higher information content than any of the original images [
4,
5]. Image fusion can thus also be done with images acquired at different dates/times by multiple sensors (optical, radar, hyperspectral, thermal, etc.) and embedded in different platforms (multiple satellites possibly in combination with other aerial vehicles). In that case, most of the traditional methods used for pansharpening (with MS and PAN images) cannot be applied [
2,
6]. The numerous studies focusing on fusion of remote sensing images have proposed various methods, each one adapted to the image characteristics and aiming at predefined objectives [
4,
5,
7].
These last years, deep learning, and in particular neural networks (NNs), has been extensively used in the remote sensing community, mainly for classification and object detection but also, to a lesser extent, for image fusion [
8]. NNs provide a flexible and powerful way to approximate complex nonlinear relationships without a priori assumptions on variables’ relationships. The network architecture can be multi-dimensional, thus potentially including spectral, spatial, and temporal variabilities within and between images. Deep convolutional neural networks (CNNs) are the most popular for image analysis due to excellent performances and proved efficiency [
9]. CNNs are robust thank to their particular architecture characterized by local receptive fields, shared weights, and subsampling. Many studies have implemented pansharpening and single sensor image fusion using CNNs with very concluding results [
6,
9,
10,
11,
12]. This study focused on image fusion from multiple sensors [
4,
13] with the goal to achieve super-resolution, i.e., further increasing the highest native spatial resolution [
14]. For multi-sensor image fusion, Shao et al. [
15] showed that previous methods, such as STARFM [
16], ESTARFM [
17], or ATPRK [
18], were outperformed by CNNs.
Sentinel-2 (S2) imagery (European Space Agency) is composed of 13 bands at different spatial resolutions: four bands at 10 m, six bands at 20 m, and three bands at 60 m. The spectral resolution is high, but the spatial resolutions are not sufficient for fine-scale analysis. Higher spatial resolutions (such as 2.5 m) allow accurate geometrical analysis of small objects and finer descriptions and change detections in many areas [
19,
20]. To increase S2 spatial resolutions, several studies [
21,
22,
23,
24,
25,
26,
27,
28] have proposed to fuse the S2 bands together to super-resolve the coarser bands (20 or 60 m) to 10 m. However, these fusion methods cannot be used to further increase the resolution (e.g., 2.5 m), as S2 images at such resolution simply do not exist. A solution consists in using an additional source of images from a different satellite constellation. The red (B4), green (B3), and blue (B2) bands of S2 at 10 m were super-resolved to 5 m using the corresponding bands of RapidEye (Planet Labs) images [
29]. Contrary to fusion methods using images from a unique satellite, this solution is more complex to implement because of differences in image footprint (swath) and acquisition date/time [
29,
30].
One of the additional sources of images could be the PlanetScope (PS) constellation (Planet Labs; planet.com) composed of more than 150 “Dove” microsatellites. These satellites cover the whole Earth at 3 m every day (about five days for S2). While superior in terms of spatiotemporal resolution, the radiometric quality is not equivalent to that of larger conventional satellites. Radiometric inconsistencies between different microsatellites was highlighted repeatedly [
31], notably due to sensor-specific spectral response functions but also to variations in orbital configuration [
32]. Several methods for radiometric normalization of PS imagery were thus recently developed using images of other satellite constellations, such as MODIS and Landsat [
31,
32,
33], but not yet using the Sentinel-2 imagery.
In this paper, we present an innovative method aiming at simultaneously normalizing PlanetScope radiometry (all bands: R, G, B, and NIR) and super-resolving Sentinel-2 imagery (10 bands from 10 or 20 to 2.5 m) using deep residual convolutional neural networks. After a complete and detailed description of the method, the super-resolution quality was thoroughly assessed visually (photointerpretation) and quantitatively. The proposed method is highly spatially and spectrally accurate. Its robustness was illustrated for six locations around the world with contrasted acquisition conditions.
4. Discussion
In this study, we presented and validated a novel method for super-resolving Sentinel-2 (S2) imagery (10 bands from 10 or 20 to 2.5 m). Super-resolution was achieved by fusion with additional images acquired at finer resolution by the PlanetScope (PS) constellation. The super-resolution quality was thoroughly analyzed for six S2 tiles acquired in contrasted conditions over five countries around the world, confirming that the proposed method is highly accurate and robust. The method also remarkably normalized the radiometric inconsistencies between PS micro-satellites. Super-resolution and radiometric normalization were achieved simultaneously using state-of-the-art residual convolutional neural networks (RCNNs), adapted to the particularities of S2 and PS imageries, and including the corresponding masks of clouds and shadows.
To our knowledge, only one study [
29] considered a similar approach combining S2 and RapidEye images for super-resolving three of the S2 bands (B2, B3, and B4) to 5 m, also using RCNNs, but with an architecture originally developed for conventional “RGB” images. With a spatial resolution of 2.5 m, the generalization of the procedure to 10 S2 bands (B2, B3, B4, B8, B5, B6, B7, B8a, B11, and B12) and a much higher accuracy, this study further explored the high potential of deep learning for multi-satellite multi-sensor image fusion. The proposed method is highly spatially and spectrally accurate, at the scale of the considered S2 tile, but also locally, and separately for each band, PS strip (i.e., unique PS satellite and orbit), and main land cover.
Radiometric inconsistencies between PS strips are mainly related to differences in sensor spectral response, orbital configuration [
31,
32], and acquisition date/time. The proposed method accurately captures and corrects these radiometric variations. As the PS strips varied for each S2 tile, the RCNNs had to be train from scratch every time (no use of pre-trained NN or weights). Although it strongly increases robustness, the processing time could be limiting for routine use. With a unique computer, 48 h per S2 tile were necessary. This processing time could be significantly decreased with code optimization and implementation in powerful cloud computing platforms, such as Google Earth Engine.
The high temporal resolution of PS imagery (revisit time of one day) is an important element. The small acquisition time differences between the S2 tile and PS scenes strongly limit land cover changes. Contrary to Shao et al. [
15], fusing Landsat-8 and S2 imageries, the use of temporal series was not necessary.
We deliberately selected S2 tiles and PS scenes with low percentages of clouds and shadows (<5%). The inclusion of S2 and PS masks in the network architecture improved the local rendering of the super-resolved images. Clouds and shadows in S2 were fully predicted. Clouds and shadows in PS scenes were not predicted but with speckle and/or blur effects. The proposed method should be tested with higher percentages of clouds and shadows. RCNNs should be able to learn and adapt the prediction. However, for percentages > 20%, it would probably be more appropriate to use only data free of clouds and shadows. The super-resolution quality would be identical but with a higher proportion of missing data.
The proposed method could also be applied to images of other satellite constellations. For instance, S2 could be replaced by Landsat and PS by Pleiades. However, it would be important to keep in mind that the proposed method is based on the scale-invariant hypothesis [
50]. We demonstrated that the ×8 scale ratio (from 20 to 2.5 m) resulted in high quality super-resolution, but the quality for the 4× scale ratio (from 10 to 2.5 m) was as expected higher. The maximum scale ratio is still to be determined.
The proposed RCNN architecture could probably still be improved in several ways. Data augmentation is a well-known way to improve the performance of deep networks. For image super-resolution, some of the augmentation methods highlighted by Ghaffar et al. [
62] could be added. The “CutBlur” approach could also be tested [
63], as well as the use of the FReLU activation function [
64], instead of ReLU, for residual learning. Residual learning could be done first, separately for S220, S210, and PS data (three branches, similarly to Wu et al. [
28]) and then together (single branch). Concerning loss functions, as the pseudo-Huber resulted in better predictions and performance than with the usual MAE and MSE, the robust adaptive loss function [
58] looks promising.