SEG-ESRGAN: A Multi-Task Network for Super-Resolution and Semantic Segmentation of Remote Sensing Images

: The production of highly accurate land cover maps is one of the primary challenges in remote sensing, which depends on the spatial resolution of the input images. Sometimes, high-resolution imagery is not available or is too expensive to cover large areas or to perform multitemporal analysis. In this context, we propose a multi-task network to take advantage of the freely available Sentinel-2 imagery to produce a super-resolution image, with a scaling factor of 5, and the corresponding high-resolution land cover map. Our proposal, named SEG-ESRGAN


Introduction
The application of Deep Learning (DL) in Remote Sensing (RS) for Earth Observation applications has contributed with significant advances in many fields, being Land-Use/Land-Cover (LULC) and data fusion the most relevant [1]. Moreover, the learning capacity of DL models have attracted the RS community in generating automated workflows, extracting high-level representations from raw data that are transformed to achieve excellent performance in the production of valuable assets [2].
One fundamental feature of RS images is the spatial resolution, which is defined as the minimum distance in which two separated objects can be distinguished [3]. Nowadays, many platforms, managed by public or private agencies, provide data with different spatial resolutions, where a higher resolution image has better detailed objects that can result in a more accurate segmentation map. A major problem is that high-resolution (HR) images are not always available with the specific characteristic requested, or obtaining them may represent a considerable economic barrier to circumvent.
One of the main characteristics that have driven the blooming in research and application in RS is the availability of open-access satellite data, providing free-of-charge imagery, such as those produced by the Copernicus Program (https://scihub.copernicus. eu/dhus/#/home, accessed on 7 October 2022), being the Sentinel-2 the foremost exponent of the constellation of multispectral medium-and high-resolution satellites available. The Copernicus Sentinel-2 satellites are two identical platforms, orbiting the Earth with a high revisit time and capable of providing multiSpectral (MS) images with a considerable surface coverage and different spatial resolutions (at 10, 20 and 60 m), covering the Visible and Near-Infrared Spectrum (VIS-NIR).
In computer vision, Semantic Segmentation (SS) assigns semantics labels to the pixels of an image [4]. In the context of LULC, the labels correspond to a semantic class and having an HR image for this purpose is essential for achieving good accuracy in the segmentation [5]. Therefore, to take advantage of the free usability of the Copernicus program, we propose the use of super-resolution techniques to assist with the segmentation task by enhancing the details of the 10 m bands provided by the Sentinel-2 satellite.
When a panchromatic channel with higher spatial detail is not available, superresolution (SR) methods can provide a suitable alternative to promote the use of enhanced bands. For this reason, in the last decade, the improvement of the spatial resolution of RS images has been a very active research area, aiming to reduce cost when addressing studies requiring imagery with a very high spatial resolution that usually represents a considerable budget on any project. Most common approaches are based on Convolutional Neural Networks (CNN) and Generative Adversarial Networks (GAN), achieving outstanding results on different RS data.
Many authors have also tackled the idea of applying DL semantic segmentation models to different RS datasets, mainly using hyperspectral or aerial images due the spectral and spatial characteristics. However, in this work, we extend our analysis to works that have applied SR as a preprocessing step or even combined with SR to improve performance in generating land cover maps or other RS applications.
The key point of our work is a multi-task network approach, which is inspired by [6]. This approach consists in predicting an SR image with its corresponding HR land-cover map from a low-resolution (LR) Sentinel-2 image. Predictions are made simultaneously in a multi-task fashion by employing two dedicated branches in the network architecture that collaborate in training to maximize their performance. Our model, named SEG-ESRGAN (Segmentation Enhanced Super-resolution Generative Adversarial Network), takes our previous proposal RS-ESRGAN [7] for super-resolution and extends this model coupling an encoder-decoder architecture to perform the segmentation task, i.e., to produce a Semantic Segmentation Super-Resolution (SSSR) map. The training was completed in a supervised manner, using very-high-resolution imagery and its LULC map, which was obtained from a WorldView-2 satellite. In this manner, we leverage the 10 m bands from Sentinel-2 to a challenging 2 m of spatial resolution (scaling factor of 5) with a substantial improvement on the land cover classification task, which, to the best of our knowledge, was not tackled before.
The organization of this paper is as follows: Section 2 presents the relevant works related to the application of Deep Learning to SR, SS and multi-tasking methods in RS. In Section 3, we present the materials and methods, describing the dataset, the model, training details and quantitative metrics used for evaluation. In Section 4, we present the experimental results with the discussion in Section 5. Finally, we present our conclusions in Section 6.

Super-Resolution
Considered as an ill-posed problem by many authors [8,9], SR seeks to recover an HR version of an LR image by learning to infer high-frequency content to the input image after seeing a set of LR-HR samples. The generation of the training dataset is also a challenge, mainly because of the lack of real-world LR-HR pairs of samples. Therefore, researchers often opt for modeling this image duality by degrading the HR image to form the LR pair to circumvent this problem [10].
Single-image SR (SISR) uses a single LR image to produce an HR version. In practice, this approach has much greater interest, as it simplifies the task, than the Multi-image SR, especially in RS applications when several samples from the same scene are gathered and the sub-pixel misalignments are not well handled [7].
The seminal work of Dong et al. [11], with their SRCNN model, introduced the use of fully convolutional approaches to produce an SR image. Nowadays, several works with distinct architectures and training strategies can be found in the literature [9,10], where two trends are identified: those whose minimize the error between the SR output and the ground truth (GT), and those who generate predictions based on a perceptual similarity.
Another important work presented by Kim et al. [12] (VDSR) increased the depth of the convolutional layers by processing the interpolated LR image to the target scaling factor and then refining the coarse HR image with high-frequency details. This pre-upsampling approach was limited by artifacts due the upsampling operation and the lower computational efficiency for working in a high-dimensional space.
Other models [8,13,14] proposed to work on the lower-dimensional space, making a model fully learnable, including the upsampling modules by introducing the transpose convolution [15] or the pixel-shuffle module [16], that scales the feature maps to a desired scaling factor to produce the final SR image. These post-upsampling approaches outperform pre-upsampling networks, although limiting the network to work on a fixed scaling factor.
As mentioned by Anwar et al. [9], all previous models assumed uniform importance in spatial and channel information. However, improvements related to exploiting the channel interdependence and mutual knowledge between intermediate feature maps have also been proven suitable in SR. Zhang et al. [17] introduced a novel channel attention mechanism inspired by the work of [18], focusing on boosting the representation capacity by scaling the channel-wise features adaptively.
Lately, Generative Adversarial Networks (GANs) [19] have attracted increasing attention from the research community, such as the work of Ledig et al. [8], where authors proposed SRGAN, a GAN-based network that produced a more photo-realistic output but with lower quantitative metrics. This seminal work opened a new research path for looking for more realistic SR images than just reducing the error associated with the LR-HR pair. Enhanced Super-Resolution Generative Adversarial Network (ESRGAN) [20] is another improved architecture used for image SR that used a more complex and dense combination of residual layers.
In the context of Remote Sensing, SR is becoming popular among researchers, especially since the outstanding results reached in the natural image domain [21]. Pansharpening techniques are often the first attempt for enhancing the LR MS bands when an additional panchromatic instrument with better spatial resolution is available in the platform. However, some platforms, as Sentinel-2, do not carry this extra sensor and, therefore, the application of SR techniques becomes the alternative to improve the details of the bands [22].
The lack of real-world datasets with LR-HR image pairs in remote sensing is also a challenge, where some authors propose the use of transfer learning [23] or the downsampling of RS images using the Wald's protocol [24] to form the LR-HR dataset [25][26][27][28]. Other authors appeal to other satellite sources, with better spatial resolution, to form the dataset, caring to have the minimum time gap between the acquisition moments of both pairs of images. Pouliot et al. [29] proposed a CNN to work with Landsat (30 m) and Sentinel-2 (10 m), and Teo and Fu [30] proposed a VDSR for the fusion of Landsat with Formosat (8 m) bands.
As mentioned before, Sentinel-2 satellites are of great importance for the RS community, with many authors addressing the SR of the 20 and 60 m bands [31][32][33] but few authors tackling the SR of the 10 m bands. For instance, Galar et al. [34] proposed the use of PlanetScope (2.5 m) in combination with Sentinel-2 bands, Panagiotopoulou et al. [35] introduced the use of SPOT-7 imagery (2.5 m), while other authors used WorldView imagery to work with Sentinel-2 as well [7,36].

Semantic Segmentation
Semantic Segmentation (SS), also known as classification in RS, is the task that seeks to assign the most probable class to each pixel from a set of probability scores predicted for each class [2,37,38]. It is a challenging task where DL models have become the state-of-theart for different applications, including RS [1,4].
Long et al. [39] presented a Fully Convolutional Neural Network (FCN) to produce a segmentation map. Similar to some image classification networks, this model reduces the feature map sizes after several convolution blocks but replaces the fully connected layers with a convolution layer and upsampling to recover the spatial size for the final output map. In addition, FCN incorporates different skip connections to combine low-level features, essential for determining homogeneous regions, with high-level features [40], which is helpful for determining fine-grained objects, although there are some limitations in incorporating the global context and the delimitation of small objects.
The context provides useful information for building semantics of objects, and it is an important concept that can boost performance in semantic segmentation tasks [41,42]. Local context is important to achieve fine-grained segmentation, whereas the global context is essential for resolving ambiguities [43].
Chen et al. [44] proposed DeepLab, a combination of CNN and conditional random fields (CRF) to capture finer details [45]. Later, authors presented DeepLabV2 [46], incorporating an À trous Spatial Pyramid Pooling (ASPP) [47], increasing the capture of context by working with different resolutions. DeepLabV3 [48] improved the ASPP module and removed CRF for a faster inference and DeepLabV3+ [49] added a decoder module to improve object boundaries.
Network architectures based on an encoder-decoder scheme are frequently used in SS, where the encoder gradually reduces the spatial dimensions of the input image to encode rich semantic information, whilst the decoder progressively recovers the spatial content, to reconstruct HR feature maps with sharp object boundaries.
Badrinarayanan et al. [50] proposed SegNet, which uses the pooling indices of the max-pool operations on the encoder blocks to perform the upsampling in the decoder counterpart, reducing the computation overhead with high performance in the segmentation. Ronneberger et al. [51] proposed U-Net, where the low-level features from the different levels of the encoder are concatenated with high-level features from the decoder at the same level, using skip connections and achieving excellent performance. Nowadays, many variants replace the encoder part with backbones from other networks, such as ResNet blocks [52] or VGG [53], among others.
One of the major drawbacks of encoder-decoder networks is the loss of spatial details because of the encoding process [4]. Therefore, Wang et al. [54] proposed HRNet, a novel architecture that combines features of four different parallel branches, each of them working at reduced scales, combining all distinct features in every stage of transition blocks that adapts the concatenation of features according to the scale of the parallel branches, enabling multi-scale fusion on each branch.
Many other models have used attention mechanisms that are popular today, being applied to many computer vision tasks such as semantic segmentation [55,56] or object detection [57]. Chen et al. [58] introduced the learning of weights for multi-scale features trained with images of different sizes. In this way, the attention module learns to appropriately weight the final score of each pixel, considering the different training scales. In addition, the attention module helps to visually diagnose the network's focus on objects at different scales and positions, resulting in improvements in segmentation performance considering this multi-scale approach.
Regarding RS applications, many authors use traditional machine learning models, such as Random Forest (RF) and Support Vector Machines (SVM) [59][60][61]; however, by the use of contextual pixel neighborhood, DL models are leading the performance in segmentation tasks [1,62]. The lack of dense annotation often limits the application of DL models in RS. In the context of aerial ortho-photo images, the 2014 IEEE GRSS Data Fusion Contest dataset and the ISPRS 2D Semantic Labeling Contest [63] are often used for researchers to benchmarks their models. For instance, Liu et al. [55] proposed an improved version of the DeepLabV3+ embedding attention mechanism on the ASSP, or Zhang et al. [56] combine a more sophisticated CNN architecture based on attention and the use of Digital Elevation models, which was also released with the data to improve their results.
Regarding the use of satellite images, recently, researchers combined the use of highperformance computing with machine learning models to produce LULC maps with great extension, after a tremendous effort of gathering annotated data for training and promoting the use of Sentinel-2 imagery. For instance, Malinowski et al. [64] worked to produce a European Land Cover map with 10 m of spatial resolution using Random Forest. Then, Karra et al. [65] trained a U-Net to generate a global land cover map with the equal spatial resolution. Recently, Brown et al. [66] released a Near-Real-Time global land-cover map with improved accuracy by training an FCN network on Sentinel-2 images.

Multi-Task Methods: Super-Resolution and Semantic Segmentation
Combining SR in conjunction with other tasks has been explored in many works [6,67,68], of which Dai et al. [69] was one of the seminal ones, as it has shown the validity of using SISR to improve the performance on other tasks such as edge detection and object detection compared to using an LR imagery.
Several works can be found regarding remote sensing applications as well. A usual strategy is to use SR as a pre-processing step, first enhancing the images and, then, training a second network for another task. This strategy was applied by Shermeyer and Van Etten [70], where the authors trained a VDSR [12] and SRRF [71] models for obtaining the SR images, continuing with the training of object detection models (SSD [72] and YOLO [73]) with this enhanced dataset. Pereira and dos Santos [74] trained an SR model, called D-DBPN [75], which was followed by a SegNet model [50], using the super-resolved images to improve the segmented map compared to the native spatial resolution.
Another strategy is to train the models in an end-to-end manner. In [76], the authors extended their precedent work by training simultaneously the D-DBPN and SegNet models with images from the 2014 IEEE GRSS Data Fusion Contest dataset and the ISPRS 2D Semantic Labeling Contest [63]. Another work [77] proposed a network architecture composed of convolutional layers with residual connections that first super-resolve the input image and, then, perform a binary segmentation for different targets (planes, boats, etc).
In a recent work, Wang et al. [6] proposed a dual-path network. This architecture consisted of three branches, a super-resolution branch, a semantic segmentation branch, and a feature affinity module that helped in training, combining HR features from the super-resolution branch rich in fine-grained structural information to guide the learning for the segmentation branch. The model was trained and tested on two public well-known datasets for urban visual understanding (CityScapes [78] and CamVid [79]).
Following a similar approach as [6], Xie et al. [80] proposed the use of improved networks, such as HRNet [54] for segmentation and EDSR [13] for SR, with generated LR images to form the LR-HR pair, which was obtained after training a GAN network for that purpose. They also used a similar feature affinity module in training and only kept the segmentation network in inference mode.
Regarding the use of satellite data, Ayala et al. [81] proposed to use multi-modal data, combining Sentinel-2 and Sentinel-1 imagery to train a U-Net [51] to produce a super-resolved segmentation map of buildings and roads. Khalel et al. [82] proposed an encoder-decoder architecture for pansharpening and segmentation, using WorldView-3 images to train the network in a multi-task fashion.
In [2], the authors propose a dual path network for super-resolution and semantic segmentation extending a DeepLabV3+ model. This model has a shared encoder and two dedicated decoders for each task to produce an SR image with a scaling factor of 2 and the corresponding land-cover map. To deal with the lack of fully annotated maps for training, the authors proposed the use of land-cover maps from [64], pairing with the 10 m Sentinel-2 bands, caring to match the temporal data as much as possible. The performance in both tasks was improved despite the noisiness introduced by using land-cover maps as GT.
Another relevant work presents the use of multi-task learning, training a shared feature extraction module that produces shared information for task-specific branches, to produce multitask classification, improving the generalization ability from small-scale datasets [83].
Therefore, not many authors have combined different remote sensing imagery to produce SR and segmentation maps. Aware of this gap, we propose a model that tackles key aspects of SR and semantic segmentation, dealing with satellite data from different sources, to produce a super-resolution image for the 10 m bands of Sentinel-2 with its corresponding improved land-cover map at 2 m/pixel.

Maspalomas Dataset
In a past project, we trained an SR model [7,27] using WorldView imagery as GT. However, creating a segmentation GT map is time consuming. Therefore, we narrowed our study to the region of Maspalomas (Gran Canaria, Spain). This touristic area poses a significant challenge, as it has distinct types of ground covers of varying sizes and colors. We used a large WorldView-2 image of 10 June 2017 (same date for the Sentinel-2 image) to serve as the GT for the SR task and, in addition, we manually annotated labels based on the content of this image to create the GT for the segmentation task. Table 1 shows the spectral characteristics of the 10 m Sentinel-2 bands along with the corresponding WorldView2 bands. WorldView-2 multispectral bands were resampled to 2.0 m of spatial resolution after applying the preprocessing steps described in [7]. This step is necessary to, first, achieve the Bottom-of-Atmosphere reflectance of the WorldView image and, then, to perform the co-registration with the Sentinel-2 image, where the 10 m bands were interpolated to 2 m for the same purpose.
After co-registration, we cropped both image pairs having different sizes, with the smallest tile limiting the patch size for training the models. Figure 1 specifies the location of these regions, where the tiling strategy was mainly defined to facilitate the labeling process.
We generated the fully labeled dataset in two steps. First, we manually annotated some small portions of the WorldView image. Then, we trained an SVM classifier to achieve a preliminary segmentation map. We selected SVM, as it has good performance, even with few annotations [84]. We considered 6 land cover classes: water, vegetation, built soil, bare soil, road, and swimming pool as the most representative classes in the region. The preliminary map was noisy, even using a 95% threshold and the appropriate parameters, and some classes were not properly classified on the resulting map. Thus, we manually corrected the unclassified and mislabeled classes in the SVM-generated map. Figure 2 shows a segmentation map after the correction process. In this manner, we managed to reduce the uncertainties in the formation of the Ground-Truth map. Recall that the labeled map was generated from the WorldView image with 2 m. Some samples of the dataset can be seen in Figure 3, with the corresponding Sentinel-2 pair as well. In summary, we have created a dataset composed of real-world multisensor imagery with a scaling factor of 5, where labels obtained from the WorldView image were generated after manually correcting the SVM map. Some labels are not accurately discerned in the corresponding Sentinel-2 image, which represents an extra challenge, where the model needs to gather spatial information to improve the output image and the corresponding segmentation map.
We selected tiles 5 and 7 to form the test subset, because they have representative scenes with roads, urban zones, ports, vegetation, swimming pools, etc. All the other remaining tiles formed the training dataset, obtaining 308 patches with 160 × 160 pixels per patch without overlap, which were organized in 90-10% for the train-validation subsets. [7] has demonstrated to be a good network for recovering rich semantic features that produce a realistic super-resolution result. Figure 4 shows the architecture of the generator, with a convolutional layer that produces an initial set of 64 feature maps from the bicubic interpolated input, which was followed by a sequence of Residual in Residual Dense Blocks (RRDBs) [7,20] that performs the dense feature extraction. A long skip connection combines low-level with high-level features maps learned by the RRDBs. Two final convolutional layers take these combined features to perform the final reconstruction of the SR image.  Therefore, we propose to reuse the rich semantic features produced in the different levels of the feature extraction stages of the RS-ESRGAN to also produce segmentation maps with fine-grained details. Specifically, we propose a network, named SEG-ESRGAN, whose architecture is shown in Figure 5, which includes a main branch for SISR, produced by the RS-ESRGAN, and an encoder-decoder architecture implementing the semantic segmentation branch. A Feature Affinity (FA) module combines the learning from both branches in a cooperative mode, which is used only for training the network. While RS-ESRGAN produces the SR image, several skip connections are retrieved from the feature extraction block of the RS-ESRGAN to reuse some of the features. From the 23 RRDBs in the feature extraction part of RS-ESRGAN, we retrieve output features from the RRDB-1, RRDB-6, RRDB-11 and RRDB-21 blocks, which were determined after hyperparameter tuning. These features are concatenated with outputs of the various sub-blocks of the encoder part, combining knowledge and reinforcing the synergy from both tasks.

RS-ESRGAN
The detailed architecture of the SEG-ESRGAN is shown in Figure 6. The encoder is composed of four sub-blocks (Figure 7a), where each encoder sub-block (Enc i in the figure) is a sequence of RRDB, Batch-Normalization (BN) and Spatial and Channel Squeeze and Excitation (scSE) block [85], which is also known as a dual attention block [55] or simple Squeeze and Excitation blocks [56]. The architecture of the scSE block is shown in Figure 7b. To increase the receptive field and to promote holistic feature information, the output of each encoder sub-block is downsampled. Thus, to match spatial sizes, the features retrieved by the skip connections are processed with a dilated 2D convolution, which is latter concatenated with the respective encoder output, to mix information from the SR and segmentation tasks.

RS-ESRGAN
The RRDB block in each encoder sub-block produces relevant feature extraction in a dense manner, concatenating and combining different low-level information to have richer high-level content. Later, output features from the BN are refined by a scSE block, which is composed of a dual-path attention block that focuses on retrieving meaningful characteristics in the spatial and spectral domains. We use the same 2D convolution configuration for the RRDB and scSE blocks (a 3 × 3 kernel and stride of 1) in the encoder's sub-blocks.
Regarding the architecture of the Squeeze and Excitation blocks (see Figure 7b), in the spatial domain, the input features are multiplied by a weight map ws that recalibrates the focus on relevant spatial content. This weight map is obtained after a 2D convolution operation with a 1 × 1 kernel that squeezes the input channels to 1 and, then, passes through a sigmoid, gathering relevant spatial information from the input features.
Several authors [32,86] have already shown that all channels do not contribute equally to attain the best performance. Therefore, in the channel squeeze and excitation block, a Global Average Pooling (GAP) operates over the input features to produce a single vector with the most relevant value per channel. Then, the information given by the vector is squeezed and expanded by a pair of 1 × 1 2D convolutions [18] with a ratio r equal to four, to reduce the inter-channel correlation and promote the flow of relevant feature content for the next block. Finally, a sigmoid activation produces the weight vector wc that operates over the input feature to find the most relevant spectral content from the features maps.
The first encoder sub-block of Figure 6 processes 64 feature maps from the first convolution of the generator of RS-ESRGAN. The subsequent encoder sub-blocks accept an additional 64 feature maps from the skip connection that are concatenated with the previous encoder block output. Thus, each encoder processes more feature maps and encodes more context in its features.
After passing through the four encoder sub-blocks, the features are downsampled and concatenated with the final skip connection from the RRDBs of the RS-ESRGAN block. The final output stride (the spatial ratio of the downsampled features with respect to their original size) of these high-level features is 16.
The architecture of the central and decoders sub-blocks was inspired by the work of [87]. The central sub-block is a variant of a Feature Pyramid Attention block [88]. Figure 8 illustrates the architecture that merges information from three different scales, using different kernel sizes, promoting context information retrieval from high-level feature maps. As in a U-Net, each decoder sub-block combines low with high-level complexity feature content. Figure 9 exhibits its architecture, where an upsampling layer rescales the high-level features that are concatenated with low-level features from the encoder part. Then, a sequence of BN and scSE reinforces highly relevant content for the next decoder block.
All decoders outputs are concatenated, matching sizes with interpolation, to produce a hyper-column [89] of feature maps that enrich the descriptors and information from all different levels of the decoder, and encourage fine-grained segmentation of objects. Finally, according to Figure 6, we add a spatial dropout layer [90] to prevent over-fitting and a logit layer with a 2D convolution to obtain the output channels, which produce one segmentation map per class. Thus, we present a novel architecture that reuses high-level feature maps from different feature extraction stages of the generator of RS-ESRGAN, richer in high-frequency content, to help in the segmentation capacity of the encoder-decoder network. The segmentation branch is composed of RRDB blocks, which enhance the network capacity and scSE blocks that focus on meaningful content, in this manner producing an SR image and an accurate corresponding segmentation map.

Loss Functions
Similarly to the works of [6] and [2], we use the same multi-loss approach for training the multi-task network. We use two dedicated losses for each branch and a feature affinity loss for combining the learning of both branches.
Regarding the super-resolution branch, the L1 norm was computed between the intensity pixel values of the SR outputŶ and the target Y, for images with shape H × W and C channels: For the semantic segmentation branch, we use the weighted Cross-Entropy loss (CE). For each pixel of the output map, a class vectorŷ ij is predicted with the corresponding scores for each class. Each pixel in the target map y ij is one-hot encoded, containing a 1 for the corresponding class and zero for the rest, K being the number of classes. The loss is averaged over the entire map, with HxW shape, as shown below: The weights for each class w k are computed from the training sub-set of the dataset, using Equation (3), as in [2] and [91], where β k corresponds to the frequency of occurrence of the class, and the term 1.02 is added for stability, in case of β k = 0.
Spatial details are essential for making an accurate segmentation; thus, structural information in the SISR branch can be contrasted with semantic information from the segmentation branch, even though not directly. Therefore, to share the learning between both branches, we use the same implementation of feature affinity loss as explained in [2,6], in which similarity matrices S are calculated from HR feature maps from both branches, looking for strong connections between pixels in the feature domain. The feature affinity loss (Equation (4)) computes the L1 distance between these similarity matrices.
where S (SSSR) and S (SISR) refer to the SSSR and SISR similarity matrices, respectively. The final loss we use for training the model consists of a linear combination of the above-mentioned losses, as shown in Equation (5): where w 1 and w 2 are hyper-parameters set to make the loss ranges comparable. In our case, we obtained the best weighting w 1 = 1.0 and w 2 = 0.1.

Quantitative Metrics
To evaluate the super-resolution performance, we use the traditional PSNR and SSIM, as well as two additional metrics (ERGAS, and SAM) for measuring the spectral quality of the results [7].
• Peak Signal to Noise Ratio (PSNR) assesses the reconstruction quality of the image, where higher value implies better quality. • Structural Similarity (SSIM) [92] compares three features of the image (luminance, contrast and structure). Values close to 1 indicate high matching between the compared images. • Erreur relative globale adimensionnelle de systhese (ERGAS) [93] measures the per-channel error between the images considering the scaling factor M, as well. In this case, a lower value indicates a better reconstruction. • Spectral Angle Mapper (SAM) [94] provides an indication of the spectral similarity of both images, where lower values means lower spectral distortion.
For the segmentation performance, we use standard metrics such as IoU, confusion matrix, Precision, Recall and F1-score. • Intersection-Over-Union (IoU) computed as the ratio between the overlap of the predicted segmentation area and the GT, and the union of these areas. The range of this metric is between 0 (indicating no overlapping) and 1 (indicating full overlap). • Confusion matrix is helpful to assess a multi-class classification or segmentation task. The rows of the confusion matrix indicate the true instances of each class, whilst the columns correspond to instances predicted for each particular class. The diagonal samples are the True Positive (TP) values for each class, corresponding to the number of samples of the class that are correctly classified. There are two different indicators for mis-classification. In False Positive (FP), the sample predicted for a class actually belongs to another class. In False Negative (FN), the sample of a particular class was predicted as belonging to another class. The Intersection over Union for a particular class i (IoU i ) is: • The Precision of class i (P i ) is the rate of TP i over all predictions for that class, and the Recall (R i ) measures the ratio of TP i over the GT of that class. Considering the confusion matrix presented above, the metrics for a particular class (C i ) can be computed as follows: • F1-score is the harmonic mean of the Precision and Recall of a particular class, which gives an overall measure considering both metrics:

Training Details
We trained our model using 308 patches of 160 × 160 pixels without overlap, using horizontal and vertical flips, as well as random crops of 160 × 160 pixels for data augmentation, standardizing by channels using the corresponding mean and standard deviation. The model was trained for 200 epochs with early stopping and saving the best weights according to the mIoU metric. After hyperparameter tuning, we used a batch size of 4, a learning rate of 5 × 10 −4 with AdamW [95], an improved version of Adam, as optimizer and weight-decay of 5 × 10 −5 . Different learning rate schedulers were tested, obtaining the best performance using CosineAnnealing.

Results
This section presents the results achieved with our model and a comparison made with other segmentation and multi-task models. Section 4.1 presents the results obtained by performing an inference on the Maspalomas dataset, to select the best weights of the RS-ESRGAN model that were used to initialize the SR branch on the multi-task model. Section 4.2 presents the results of our proposal and in Section 4.3 we provide a comparison with other models that were trained with the Maspalomas dataset. Finally, in Section 4.4 we introduce different inference results with different Sentinel-2/WorldView images, that do not belong to the Maspalomas dataset, to show the generalization ability of our proposal. [7] is a super-resolution network that maximizes its performance by combining network weights achieved after different training stages. First, it trains only the generator, calling this network a PSNR-oriented mode, and then, it fine-tunes this generator with an adversarial training. The best weights for the generator are obtained by interpolation, using Equation (9). By this means, it minimizes the noise-blur trade-off, getting results with more texture and finer details.

RS-ESRGAN
where G PSNR−oriented are the best weights achieved after training the generator alone and G adv are the best weights of the generator after training in an adversarial mode. By using the different pre-trained weights of RS-ESRGAN [7], we perform inference on the test set of the dataset to choose the best weights that initialize the SR branch of the multi-task model. We used different values of α in Equation (9) that balance the contribution of a PSNR-oriented model with α = 0 (SR_0), which tends to produce enhanced images but still a little blurry, and a fully adversarial model (SR_1.0) with α = 1, which tends to refine texture but introduces noise, as well. Table 2 shows the mean results according to the PSNR, SSIM, ERGAS and SAM metrics. We can notice that the best result was achieved using α = 0.1 for the PSNR and SSIM metrics, that focus on the reconstruction of higher detailed images. On the other hand, ERGAS and SAM metrics are indicative of the spectral information with respect to the target image. Figure 10 shows the SR results of this inference on the test set. We can notice a little spectral difference but with a considerable winning margin concerning the delineation of the buildings and the swimming pools in the image. However, if we look carefully at the images with α > 0.5, we notice some distortion around the edges of objects that hinders the final result.

SEG-ESRGAN Results
The final architecture of SEG-ESRGAN was achieved after several experiments, as described in Appendix A. We obtained our best results after loading the pre-trained weights for the SR branch (using α = 0.1) and fine-tuning the branch. The rest of the blocks are initialized using the Kaiming method [96]. Figure 11 shows the results of our model, with a zoom of some regions in Figure 12. We notice on the predicted map of Figure 11d that areas of bare ground are discernible among the vegetation located in the central part of the image near the water. In the same figure, we can see that the port area could not achieve a continuous segmentation, and the neighborhood in the upper right was recognized to some extent, although it is a challenging area, as it can be seen in the input image in Figure 11a.
Looking at Figure 11i, we can notice that it tends to confuse some areas of vegetation and bare soil with asphalt. Note the complexity of the classification of the original Sentinel-2 image due to the heterogeneity and size of the existing land covers, with narrow roads, small constructions, pools, and dark small shrubs and wet sandy areas over the dunes.
However, we highlight the excellent performance on detecting most of the land covers, specially swimming pools in the residential area. Inspecting the details in the zooming plots of Figure 12, we notice even more the performance on detecting small pools in the second and third rows. It is important to highlight the delineation and clear edges in the SISR results in comparison with the Sentinel-2 input image in the first column at the same figure. Regarding the quantitative SS performance, Table 3 shows the confusion matrix with extra columns for the precision, F1-score and IoU metrics, as well as the amount of pixels per class in the test set (Total Pixels). The confusion matrix is normalized per rows (recall metric on the diagonal) showing outstanding performance for water and bare soil, which are the majority classes, and around 75% of recall for the other classes, except for the asphalt class. We can appreciate the confusion between asphalt class with bare soil and vegetation. The precision rate for that class only reaches 33% with a recall of 64%. This confusion can be explained because of the high spectral similarity between these classes, especially in the dunes zone and in other areas where the vegetation can be easily confused with dark bare soil. Actually, this fact represented a challenge to manually correct the labeled pixels in the dataset. Table 4 shows the SR metrics in comparison with the bicubic interpolation. We can see that our model improves regarding the considered metrics; however, if we compare with the results obtained with inference using RS-ESRGAN alone (Section 4.1), our model has lost a little of the SR performance to improve the segmentation results.
Nevertheless, if we visually inspect the zoom results in Figure 12, we can notice the improvements regarding the details in the delineation of roads and buildings as well as in the small vegetation in the dunes zone.

Comparison with Other Models
To measure the performance achieved by our SEG-ESRGAN, we compared our superresolution and segmentation results with other state-of-the-art models, such as: • U-Net [51] trained with bicubic Sentinel-2 and SR images from RS-ESRGAN as input; • DeepLabV3+ [49] trained with bicubic Sentinel-2 images as input; • HRNet [54] trained with bicubic Sentinel-2 images as input; • Dual_DeepLab [2] trained with bicubic Sentinel-2 images as input, where the SR images were achieved by using the RS-ESRGAN in inference mode. We use the F1-score to measure the performance per class, as it better encompasses the precision and recall, along with the mean IoU, as a global segmentation metric. Tables 5 and 6 show the segmentation and super-resolution metrics, respectively, while Figures 13 and 14 show several samples of the segmentation and super-resolution results.
Our proposed model outperforms modern fully segmentation methods (U-Net, Deep-LabV3+ and HRNet) that do not produce an SR image. We also trained a U-Net with ResNet-101 as an encoder, with super-resolved images that were previously inferred using RS-ESRGAN; see Section 4.1. We named this model U-Net+SR, and, in this opportunity, the results improved in comparison with the U-Net that was trained using bicubic interpolated Sentinel-2 images, although still, our proposal has a better performance in almost all the classes.
For the comparison with the Dual_DeepLab model [2], we also trained the model using the same training strategy, but adjusting the architecture to be suitable for the dataset, i.e., we removed the extra-upsampling module from both decoder blocks, as the input is already interpolated to the target spatial resolution. We proposed modifications to the Dual_DeepLab model by adding RRDB blocks with separated convolutions to the decoders sub-block, calling this model Dual_DeepLab_RRDB. This modification increases the generation capacity of the decoder by making more feature maps and boosting the performance on the segmentation part. However, our SEG-ESRGAN proposal still produces better segmentation results in almost all the classes except in the asphalt class.
We also show the results obtained by training a U-Net with ResNet-101 as the encoder, using only WorldView images with the same training strategy. By this mean, we provide an upper bound that can be achieved when training a pure segmentation model with very-high resolution images as input to the network. Analyzing Table 6, although our model performs a bit worse than RS-ESRGAN with α = 0.1 (U-Net+SR) in terms of some super-resolution metrics (PSNR, SSIM, ERGAS), if we better inspect the samples corresponding to the RS-ESRGAN inference and our SEG-ESRGAN model, in Figure 14, we barely notice the difference between both results. It is worth analyzing the number of parameters and memory consumption of our proposed model. Table 7 shows the number of trainable parameters of each model and the estimated consumption in memory. We can notice that ESRGAN has 16.6 million parameters and needs 33 MB of memory. Our model, based on RS-ESRGAN, only adds 14.2 million parameters and nearly 28 MB of extra memory to perform the segmentation task. On the other hand, our best competitor (U-Net+SR) needs to use the ESRGAN model to perform SR first and, then, it uses an extra 103 MB to train the 51.5 million parameters of a U-Net with a ResNet-101 encoder.

Inference on Other Sentinel-2/WorldView Imagery
To explore the generalization performance on different image pairs of WorldView-Sentinel, we used the two pairs described in Table 8. Both datasets were preprocessed as described in [7] before any inference and analysis. Figure 15 shows different results in crops extracted from those pairs. We can notice that our SISR predictions are consistent with the GT image as well as with the segmentation predicted.

Discussion
One of the main challenges working with deep learning models is to have a suitable dataset for training. In this work, tackling the SR of Sentinel-2 bands, we paired the 10 m bands with the corresponding bands of WorldView-2 at 2 m of spatial resolution, forming a dataset of LR-HR with a scaling factor of 5. We are not aware of other datasets that have also their corresponding land-cover annotations for such high spatial resolution, and, as already shown in [2], the authors also rely on released land-cover maps that may often represent a class mismatch between the input and the corresponding GT. Therefore, to work with a high scaling factor, we opted for manually correcting an SVM map, generated with few annotations, even being time-consuming.
One of the main limitations when producing a multisensor dataset is the difficulty to guarantee the same information. Specifically, although we are using images from different sources taken on the same day, there are still spectral differences between them. Even applying the proper radiometric calibrations, advanced atmospheric correction models and co-registration to assure the geographic matching of pixels, these differences are mainly due to different shadows caused by differences in the acquisition time and off-nadir viewing angles or radiometric variations that are difficult to get rid off.
Concerning our final SEG-ESRGAN model, it was obtained after loading pre-trained weights on the super-resolution branch and letting the network adjust them. This model achieved the best performance regarding the segmentation task and without losing details in the SISR image. The proposed architecture included an encoder with sub-blocks of RRDB-BN-scSE, achieving excellent segmentation results, even in a class with few annotations (swimming pools), as it can be seen in Table 3. This can be attributed to the great capacity and higher performance showed by RRDB blocks in generating richer features, which are also benefited from the dense connections. After the features were processed by the RRDB and the BN in the encoder sub-block, scSE takes these features and dynamically determines the best spatial and spectral characteristic to be transferred to the next encoder sub-block.
We demonstrated that super-resolution and segmentation networks can work together using skip connections to retrieve high-level features with high resolution that can yield better segmentation performance. Even in a challenging scenario, having considerable similarity between classes, our model can produce consistent segmentation as well as detailed edges in the super-resolved image, as shown in the zoom images in Figure 12.
Regarding the SR performance, our model was compared to a bicubic interpolation and with the Dual_DeepLab model [2], obtaining better quantitative metrics and qualitative enhancements, as shown also in Figure 12 and in Table 6. We also proposed a variant to the Dual_DeepLab architecture to increment the dense connections by adding RRDBs in both decoders.
Regarding the segmentation performance, we compared our model with other full segmentation networks such as U-Net, DeepLabV3+ and HRNet, as well as with Dual_DeepLab and its variant, the Dual_DeepLab_RRDB. All these experiments were completed using the bicubic interpolated Sentinel-2 as input. Table 5 shows that our model achieves 0.74 in mean F1 and 0.6278 in mean IoU, having a good approximation to the upper bound performance, which was achieved with a U-Net (ResNet-101 as encoder) with only HR WorldView-2 images (mF1 = 0.7883 and mIoU = 0.6723). In addition, we used the pre-trained weights of RS-ESRGAN [7] in inference mode to produce an SR version of the Maspalomas dataset. With these enhanced images, we trained a U-Net model with ResNet-101 as an encoder to produce an SSSR label map. Our model reduces its performance on the SR task to achieve a better result in segmentation. Note that we did not make a comparison regarding the native resolution of Sentinel-2 bands, mainly because we lack the corresponding GT labels at that resolution and, as it has already been proved in [2] and [76], a multi-task network tends to perform better with spatial enhanced images.
It is important to highlight that the spatial resolution has been increased by a challenging factor of 5. That is, the original 100 m 2 pixel area has been enhanced to 4 m 2 . Note that small objects or covers do not appear in the original Sentinel-2 image and are intended to be shown in the final segmentation map. Nevertheless, the land cover maps achieved are quite similar to the ground truth obtained with WorldView-2 data. Specifically, the model performs well in discriminating small swimming pools in residential areas. This can be of interest for city councils, as the possibility of using free medium-resolution images can reduce the budget and effort of monitoring large areas, but we are aware that this topic needs further investigation.
Finally, we indicate that our model significantly reduces memory consumption and the number of parameters to achieve excellent performance regarding segmentation and super-resolution tasks.

Conclusions
The main objective of this work was to use the Sentinel-2 bands in applications that require high resolution, and by this mean, avoiding the high cost required in the acquisition of very-high resolution imagery, especially in studies involving great surface coverage of multitemporal analysis.
In this context, we propose an encoder-decoder network architecture that obtains high-resolution segmentation maps along with a super-resolved image, with a factor of 5, from a low-resolution multispectral Sentinel-2 imagery. To produce the SR image, we based our model on an RS-ESRGAN and retrieve skip connections that are used to produce the final segmentation map.
We develop a novel dataset consisting of registered WorldView/Sentinel-2 pairs for the region of Maspalomas, Canarias-Spain and the corresponding segmentation map using the WorldView-2 image. We manually labeled and corrected the land-cover maps produced using an SVM classifier to reduce the noise and mis-labeling errors.
Our model, named SEG-ESRGAN, achieved a global mean F1 = 0.74 and a weighted F1 = 0.8783, as well as an mIoU = 0.6278 regarding the segmentation task. If we compare these values with a baseline U-Net using a bicubic 2 m Sentinel-2 as input (mF1 = 0.7133, wF1 = 0.8677, and mIoU=0.5904), our model outperforms by a good margin. Even using the same U-Net model with improved SR images (mF1 = 0.7233, wF1 = 0.8651, and mIoU=0.6003), our model still performs better and approaches very well to the performance achieved with a U-Net trained with only WorldView-2 images (mF1 = 0.7883, wF1 = 0.8944 and mIou = 0.6723).
Considering the SR performance, our model provides enhanced images with better spatial detail and minimum spectral distortion. It achieved a PSNR = 30.786 and SSIM = 0.816 that outperforms the baseline bicubic interpolation (PSNR = 29.452 and SSIM = 0.792). In addition, we tested SEG-ESRGAN on a different set of Sentinel-2 and WorldView imagery not belonging to the train-test subsets, obtaining excellent results as well.
Furthermore, for generating the extra SSSR map, our model only adds 14.2 M parameters and 28 MB to the already proposed RS-ESRGAN, making these small extra features reach better segmentation results than other state-of-the-art models.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. SEG-ESRGAN Model Architecture
We performed several experiments to define our best model architecture. Table A1  summarizes the different versions tested and Table A2 shows the corresponding performance using F1-score metric. Although minor changes were made between each version, we present the main variations between each architecture besides the training details, as follows: • v1: We based our model using the RS-ESRGAN as the trunk for the dual network. From the feature extraction module of RS-ESRGAN, composed of sequential RRDB blocks, we retrieved four skip connections at different levels. These features are downsampled to different scales to emulate the UNet architecture and to extract context. Then, the features are connected to the decoder to produce the final segmentation map. These blocks are maintained in almost all the versions, as depicted in our best proposal in Figure 6. • v2: We used the blocks of Resnet-101 as encoder. The first feature map is retrieved with a skip-connection from the shallow feature extraction block of the RS-ESRGAN. We noticed that using the Resnet blocks increased the memory consumption of the dual network. • v3: We used scSE blocks as encoders. These blocks do not consume much memory and have good performance, obtaining useful features that are concatenated with the skip connections from the ESRGAN. • v4: We added RRDB modules and BN along with scSE to form the encoder blocks. We trained the entire network from scratch without loading any pre-trained weights to the RS-ESRGAN trunk.
We achieved our best results initializing the SR branch with the pre-trained weights of the RS-ESRGAN network and conducting a hyper-parameter sweep over model v4 using WandB [97]. We searched over the batch size, the learning rate, loss weights in Equation (5), and different levels for the skip connections from the RS-ESRGAN. Analyzing Table A2, we notice that using the combination of RRDB-BN-scSE in each encoder block produced an increment of the performance in the minority class (the swimming pool) that motivated us to continue with this architecture.