A Dual Network for Super-Resolution and Semantic Segmentation of Sentinel-2 Imagery

: There is a growing interest in the development of automated data processing workﬂows that provide reliable, high spatial resolution land cover maps. However, high-resolution remote sensing images are not always affordable. Taking into account the free availability of Sentinel-2 satellite data, in this work we propose a deep learning model to generate high-resolution segmentation maps from low-resolution inputs in a multi-task approach. Our proposal is a dual-network model with two branches: the Single Image Super-Resolution branch, that reconstructs a high-resolution version of the input image, and the Semantic Segmentation Super-Resolution branch, that predicts a high-resolution segmentation map with a scaling factor of 2. We performed several experiments to ﬁnd the best architecture, training and testing on a subset of the S2GLC 2017 dataset. We based our model on the DeepLabV3+ architecture, enhancing the model and achieving an improvement of 5% on IoU and almost 10% on the recall score. Furthermore, our qualitative results demonstrate the effectiveness and usefulness of the proposed approach.


Introduction
Land Use and Land Cover (LULC) maps are essential tools for documenting the changes in the environment and for quantifying the human footprint on the Earth's surface [1]. Due to the increasing availability of high resolution, frequently collected remote sensing data, there is a clear need for a highly automated data processing workflow to update the land cover and land use changes [2].
In remote sensing (RS), two important concepts regarding satellite imagery are spatial resolution and spectral resolution. Spatial resolution is the ground area imaged for the instantaneous field of view of the sensor. The higher the spatial resolution, the more detail it will contain. Fine details like small buildings, cars and street lines can be seen in very high-resolution platforms (50 cm-1 m), on high-resolution (1-4 m) a tree or a bus can be distinguished, whilst medium/moderate-resolution images (4-50 m) will only show coarse features [3]. A sensor's spectral resolution specifies the number of spectral channels, and their bandwidth, in which the sensor can collect reflected radiance. The spectral reflectance signatures can be used to identify the mineral content of rocks, the moisture of soil, the health of vegetation, etc. In order to achieve high resolution in the spectral domain, images are captured using multispectral or hyperspectral sensors. Moreover, another concept that plays an important role is the revisit time of the satellite, which indicates the time needed for the space platform to collect consecutive data of a specific location of the Earth.
When analyzing among the possible sources of images, we encounter some tradeoffs. On the first hand, there exist commercial satellites that provide imagery with spatial resolution of less than a meter, but these data can become expensive when needed for a

Related Work
The introduction of Deep Learning techniques in the computer vision field has led to major advances in all its different sub-domains (object detection and semantic segmentation among others). For this reason, the RS community has been recently attracted to use it in tasks like semantic segmentation or super-resolution. Particularly, CNNs [8] have been widely applied with outstanding results on different RS imaging problems [9][10][11][12]. The work related to this paper is presented in the next three sub-sections.

Semantic Segmentation
Semantic segmentation aims to assign a finite set of semantic labels, such as land cover classes, to every pixel in an image [13][14][15]. The network predicts a probability distribution of all classes for each pixel, and assigns the most probable class to it. Architectures based on an encoder-decoder scheme are commonly used [16][17][18]. In those architectures, the encoder gradually reduces the spatial dimensions of the input image in order to encode rich semantic information, whilst the decoder tries to gradually recover the spatial information so as to recover high resolution feature maps with sharp object boundaries. A very popular architecture is U-Net [17], which is broadly used due to its symmetry, achieved by maintaining skip-connections in all the levels of the encoder-decoder structure.
On the other hand, networks based on Spatial Pyramid Pooling modules [19] are able to encode rich contextual information by pooling features at different resolutions. However, detailed information related to object boundaries is missing due to the pooling or convolutions with striding operations within the network backbone. DeepLabV3 [18] employs various parallel atrous convolutions at different rates in its Atrous Spatial Pyramid Pooling (ASPP) module to capture contextual information at different rates, but lacks of a powerful decoder to recover high resolution feature maps. Atrous or dilated convolutions allow the expansion of the receptive field without loss of resolution and avoid the max-pooling operations, so feature maps at an arbitrary resolution can be obtained. DeepLabv3+ [20] extends DeepLabv3 by adding a simple, yet effective, decoder module in order to improve the object boundaries, such as in an encoder-decoder based structure, while maintaining the rich semantic information provided by a more powerful encoder based on a Spatial Pyramid Pooling module.
In the remote sensing field, the problem of semantic segmentation has been addressed from many perspectives, ranging from statistical approaches to methods based on machine learning [21,22]. Within this group, Random Forests (RF) and Support Vector Machines (SVM) are the most widely used, as they achieve good performances and are resistant to overfitting even with small training sets. However, deep learning models are becoming the state-of-the-art technology in LULC applications and have been shown to outperform classical approaches [23][24][25]. As opposed to methods that perform pixel-wise classification taking into account only single pixel features, like RF or SVM, deep learning models based on CNNs use contextual information of each pixel neighborhood, which leads to the improvement of performance and the reduction of noise in the resulting segmentation maps.
Many DL based models have been recently proposed for LULC classification. In early studies, labels have been predicted pixel by pixel using patch-based CNNs, relying on a small patch around the target pixel [26][27][28]. This approach has been applied in problems with limited annotated data, but it is time consuming and does not guarantee the spatial continuity and integrity of labels.
Fully convolutional approaches overcome the limitations of patch-based CNNs. They use an encoder-decoder structure, where typically the encoder is one of the popular CNN architectures (like VGGNet or ResNet) pretrained on the natural-image dataset ImageNet [29], and fine-tuned on RS data. For example, this approach has been applied in [30] on Landsat 5/7 multispectral images and in [31] on WorldView-2/-3 images. Other approaches follow an object-based strategy, combining CNNs with unsupervised image segmentation (e.g., superpixels) [32,33].

Single Image Super-Resolution
Single Image Super-Resolution (SISR) aims to recover a high-resolution (HR) image from a low-resolution (LR) image [34]. These techniques seek to learn implicit redundancy that is present in the the data to recover missing HR information from a single LR instance [35], which usually implies learning local spatial correlations.
As stated in [36], there are four kinds of supervised Deep Learning-based SISR methods. One is pre-upsampling SR, which applies a conventional upsampling operation, such as a bicubic interpolation, and then refines the HR image by using a deep convolutional neural network. This approach is very computationally expensive since most of the operations are done in the high dimensional space. The second one is post-upsampling SR, which integrates learnable upsampling layers at the end of the model instead of the traditional upsampling layers, reducing the computational cost. The third one is progressive-upsampling SR; it is based on post-upsampling, but aims at gradually reconstructing high-resolution images and allows multiscale SISR. Finally, iterative up-and-down SR is based on generating intermediate images, by iteratively employing upsampling and downsampling layers, and combining them to reconstruct the final SISR image.
An alternative to the pre-upsampling method is proposed in [35], with a CNN architecture where feature maps are extracted in the low-resolution space. Moreover, an efficient sub-pixel convolution layer (known as Pixel Shuffle) is introduced, which learns an array of upsampling filters instead of using a handcrafted interpolation.
On the other hand, architectures based on Generative Adversarial Networks (GANs) [37], like SRGAN [38] or ESRGAN [39], have been proposed as they produce high resolution images with photo-realistic details. Models based on GANs have also been applied for the super-resolution of remote sensing imagery [10,[40][41][42].
In particular, some works tackle the problem of super-resolving Sentinel-2 bands using DL approaches. Specifically, Lanaras et al. [43] propose to super-resolve the LR bands to 10 m using a CNN with skip connections (named resblocks) between feature maps, while [44] includes more resblocks and adversarial training. Other approaches, like [45], combine resblocks with self-attention mechanism and a procedure for training these models in high-performance environments. Other solutions have also been proposed in [46][47][48][49], focusing in learning difference details between the LR and HR bands.
On the other hand, to improve the spatial resolution of the Sentinel-2 10 m channels [50] uses an ESRGAN as baseline to produce SR of RGB Sentinel-2 bands with scaling factors 2 and 4, previously downsampling the dataset to form the LR-HR pairs for training. Li and Li [51] produce Sentinel-2 RGB images at 2.5 m, using GANs with the ESRGANstyle, introducing kernel estimation and noise injection to construct the pair of LR-HR from LR images. A comparison between several Sentinel-2 SR models using Wald's protocol [52] to generate the LR-HR pairs has been recently presented in [53].

Super-Resolution for Improving Semantic Segmentation
SISR can help to improve the results of semantic segmentation approaches. This idea has been explored in various works such as [54][55][56]. In particular, Dai et al. [54] show that applying SISR to input images of other computer vision tasks, like semantic segmentation, edge detection and object detection, improve their performance in LR imagery.
In the remote sensing field, some works apply super-resolution as a pre-processing step, using a first network for super-resolution and a second one for semantic segmentation of the super-resolved image [56,57], where both networks are separately trained.
A unified framework is proposed in [58], with a super-resolution network based on convolutional layers with residual connections and an encoder-decoder architecture for semantic segmentation, trained end-to-end. The model is trained and evaluated for the binary segmentation (object and background) of small patches with airplanes, ships and oiltanks. Another end-to-end framework is proposed in [59], using a D-DBPN for super-resolution followed by a Segnet model for semantic segmentation, and training them with a multi-task loss using images from the 2014 IEEE GRSS Data Fusion Contest dataset and the ISPRS 2D Semantic Labeling Contest [60].
Besides, a super resolution domain adaptation network was proposed in [61] to address the domain shift problem in the task of semantic segmentation of images with different resolutions (source and target domains, with low and high-resolution images, respectively). The model is trained with adversarial learning on datasets of very high resolution true orthophotos from the ISPRS 2D Semantic Labeling Contest [60].
In a recent work, Wang et al. [36] propose a two-stream model. Their model consists of three parts, a super-resolution stream, a semantic segmentation stream and a feature affinity module that helps to enhance the high-resolution features of the super-resolution stream with fine grained structural information from the super-resolution branch. The model is trained and evaluated in CityScapes and CamVid, two datasets for urban visual scene understanding. Our model adopts this dual-network approach, introducing modifications on the DeepLabV3+ architecture. Specifically, we employ more skip-connections between the encoder and both decoders, adding extra upsampling modules with a pixel-shuffle mechanism. We train our model on a subset of the Sentinel-2 Global Land Cover dataset, outperforming the baseline DeepLabV3+ trained with the same LR images, producing smooth and accurate segmentation maps and an improved version of LR input images.

Dataset
The S2GLC (Sentinel-2 Global Land Cover) project [2] was led by the Space Research Centre of the Polish Academy (CBK-PAN) with the support of the European Space Agency (ESA). The main goal of the project was the development of a methodology for producing high resolution global land cover maps based on Sentinel-2 imagery.
Specifically, we used the S2GLC 2017 or Land Cover Map of Europe 2017, available at [62], which is a product resulting from the Phase 2 of the S2GLC project, that restricted the methodology employed on S2GLC just to the European continent. The map was obtained by means of classifying, with a high level of automation, more than 15,000 Sentinel-2 images collected during the year 2017. The methodology for the classification of multi-temporal Sentinel-2 imagery relied on the random forest algorithm and achieved a high thematic overall accuracy, over 86% at country level. The resulting dataset legend consists of 14 land cover classes (see Figure 1). The map pixel size equals 10 m, which corresponds to the highest spatial resolution of Sentinel-2 imagery. We restricted our study area to Catalonia (Spain) and we used the S2GLC 2017 land cover map corresponding to that region as ground truth for the segmentation task. We searched for 2017 Sentinel-2 satellite images corresponding to this region (see Table 1), so as to match the date when the land cover dataset was created. We used the 10 m multispectral channels as ground truth for the super-resolution branch, composed by Bands 2, 3, 4 and 8 of Sentinel-2 images (Blue, Green, Red and Near Infrarred (NIR) channels, respectively). Then, we created our dataset (S2GLC-Cat) composed by geo-referenced pairs of the Sentinel-2 images and their corresponding land cover map from the S2GLC 2017 dataset. We cropped the S2GLC to match each Sentinel-2 image, reprojected the Sentinel-2 imagery using the coordinate system of S2GLC data and co-registered each pair. The process included locating and matching a number of ground control points in both images and then performing a geometric transformation. Automatic and manual control points were extracted to obtain a representative and well distributed set of points. Finally, a polynomial warping algorithm was applied to Sentinel-2 images.
Since both, land cover maps and Sentinel-2 multispectral images corresponding to the region of Catalunya, were too large, we formed our train and test sets by taking random patches of 512 × 512 from those images. It resulted in a total of 2700 images for the train set and 300 images for the test set. In order to implement the dual path approach, the input image was formed by downsampling the Sentinel-2 patches by a scale factor of 2, and we kept the full-resolution patches and labels as ground truth data. Moreover, we computed the histogram of both, the resulting patches and the full images, to check for any class imbalance. We concluded that the patches were representative of the full images. However, due to the non-stationarity behaviour of land cover classes, such as clouds and permanent snow surfaces, some images did not match their corresponding label. As discussed in Section 5.2, we relabeled more than 270 images in order to improve the segmentation results.

Network Architecture
Encoder-decoder networks (see Figure 2) have been successfully applied to many computer vision tasks, such as semantic segmentation, object detection and pose estimation. They are typically composed by an encoder module, that gradually reduces the spatial dimensions whilst extending the number of channels of the input image in order to encode rich semantic information; and a decoder module, which tries to gradually recover the spatial information so as to retrieve high resolution feature maps. In those architectures, it is referred as Output Stride (OS) the ratio of the input image spatial resolution to the encoder output resolution. For semantic segmentation tasks OS = 16 (or 8) is usually adopted by the feature extractor [20], meaning that the encoder output spatial resolution is 16 times smaller than the input image. From this point, the decoder gradually upsamples the feature maps generally making use of skip connections from the encoder at different levels.  The key point of our proposed architecture is a dual path network approach (DPN), which is inspired by [36]. This approach mainly consists in predicting one segmentation map and one super-resolved image, where both are twice the size of the input image. This is done simultaneously in a multitask fashion by employing two dedicated branches in the network architecture. It is worth mentioning that the model can be adjusted to work with any scaling factor by making minor changes in the decoder part.
The segmentation accuracy of the network can be related with the size of the input image (and its corresponding ground truth map): the higher the input spatial resolution, the better the performance [36]. This happens because larger input images contain finer spatial information labeled in the corresponding ground truth, so the edges of the different classes become more clear.
The motivation behind the dual-network approach (see Figure 3) is to use a lowresolution (LR) input image to predict a high resolution (HR) segmentation map, guiding the process with a HR version of the original image that is generated by a second branch. Thus, the learning paradigm consists of integrating the idea of super-resolution into an existing semantic segmentation pipeline to keep HR representations. The network, as proposed in [36], consists of a Semantic Segmentation Super-Resolution (SSSR) branch that predicts the HR segmentation map, and a Single Image Super-Resolution (SISR) branch that reconstructs a HR version of the input image, where both outputs sizes are twice the input size. Apart from those branches, there is also a Feature Affinity (FA) module that tries to enhance the HR features of the SSSR with the fine-grained structural information from the SISR by computing a loss between both outputs. More details about this FA module and the FA loss will be explained in Section 3.3.

Shared
FA Loss The idea is that the two branches share the same encoder (feature extractor) but have their own decoder. The SSSR branch is optimized with a typical semantic segmentation loss, such as the Cross Entropy Loss, and the SISR branch is optimized with a pixel-wise loss, such as Mean Square Error. Furthermore, as commented, there is also a FA loss that tries to guide the learning of both branches. All these losses will be explained individually in Section 3.3.

Semantic Segmentation Super-Resolution (SSSR)
In our case, we treat segmentation as the main task but we maintain the SISR output at inference time since we are also interested in predicting a HR version of the input image. Nevertheless, notice that at inference time this branch can be removed, notably reducing the computation cost, if only the segmentation map is of interest.
As stated before, the dual path network approach consists of integrating the idea of super-resolution into existing semantic segmentation architectures. We implemented this idea by appending an extra upsampling module at the end of the decoder of the DeepLabv3+ [20] network. Apart from that, we redesigned the original decoder module mainly to improve the super-resolution results, to cope with the peculiar spatial granularities of satellite imagery. We opted for considering the same design of the decoder and the extra upsampling module for both SSSR and SISR branches in order to maintain some kind of symmetry.
The DeepLabV3+ architecture (see Figure 4) extends DeepLabV3 [18] by adding a simple but effective decoder module to refine the segmentation results especially along object boundaries. The architecture is based on a powerful backbone encoder (we use ResNet101 [64]), an atrous spatial pyramid pooling module that allows encoding multiscale contextual information, and a decoder that receives a skip-connection from the encoder low-level features to facilitate the upsampling path. DeepLabV3 is characterized by employing atrous (dilated) convolutions in the last group of layers in order to maintain the resolution of the feature maps at an arbitrary resolution. Using ResNet101 as backbone encoder, the spatial resolution of the output feature maps is 32 times smaller than the input image resolution (OS = 32).  We conducted several experiments with different versions of the decoder and the extra upsampling modules. Here we explain the models, and the results of the experiments where all of them will be presented in Section 4. We started by just adding the extra upsampling modules, consisting in a stack of 3 × 3 2D Convolutions, followed by a nearest neighbor upsample and another 3 × 3 2D Convolution, on top of each decoder.
We then proceeded to add more skip-connections to the decoders since the SISR results were not satisfactory. In a first step, we added another concatenation with a lower-level feature map (referred as model v1 in Section 4). Then, we added the concatenation with the bicubic interpolation of the input image before the final upsampling just in the SISR branch (model v2). And finally, we explored adding also the concatenation with the bicubic interpolated image in the SSSR branch (model v3).
We further studied the configuration of the extra upsampling module for the SSSR branch. In Figure 5 we present the final design for the model which yields the best results (model v4). Notice that the decoder path receives skip-connections consisting in the ResNet101 feature maps F1, F2 and the bicubic interpolation of F0 at the extra upsampling module (both for the SSSR and SISR branches). Here F0 refers to the input image, and F1, F2 refer to the ResNet101 feature maps whose spatial dimensions are respectively two and four times smaller than the input image (see Figure 5 for clarification). Moreover, for the SSSR branch, the Segmentation Head module does not convert to the desired number of classes and preserves the channel dimensions; so the extra upsampling is done directly in the feature maps and a 3 × 3 2D Convolution is then used to convert to the number of classes. For the SISR branch, the channel dimensionality is reduced progressively. Figure 6 shows implementation details of both decoders. Note that long skip connections from the encoder provide low-level features that help in the reconstruction of high-resolution details.
During the experiments we also explored changing the type of upsampling done in each upsample module from the DeepLabv3+ architecture. We tried setting all the upsampling modules in the architecture to (1) nearest neighbor, (2) transpose convolution, and (3) Pixel Shuffle sub-network, and we obtained the best results for case (3). For this reason, in the illustration of the architecture presented in Figure 5, the upsampling modules use pixel shuffle but are just referred as "Upsampling ×2".    Figure 6. Implementation details of both decoders. Each convolutional layer is characterized by [kernel-size × inputchannels × output-channels].

Loss Functions
In this subsection we present the different losses used for training the neural network. Since the approach consists in a multi-task model, specific losses for each task are considered. We employed Cross Entropy (CE) loss and CE with weights for semantic segmentation; two pixel-wise losses, Mean Square Error (MSE) and Mean Absolute Error (MAE), for super-resolution, and the Feature Affinity (FA) loss to guide the SSSR learning from the SISR branch.

Semantic Segmentation Loss (SSL)
The Cross Entropy Loss is a common loss used in multi-class segmentation problems. The expression for each training example is the following: whereŷ is the vector containing the predicted probabilities for each class, y is the target one-hot vector containing a "1" only at the correct class position, and K is the number of classes. A common approach is to use the CE with weights. This variant of the CE loss is very useful when the dataset is unbalanced, i.e., there are classes that appear much less than others. CE with weights employs a rescaling weight given to each class, weighting more less frequent classes to improve the results for those classes.

Super-Resolution Loss (SRL)
MSE or MAE are commonly used as reconstruction loss for Single Image Super-Resolution since they compare the reconstructed image with the target one in a pixelwise manner.
The MSE is used to minimize the error defined as the sum of all the squared differences between the true and the predicted values, where N is the number of pixel in the images: Alternatively, the MAE loss aims to minimize the error which is the sum of the absolute differences between the true and the predicted values:

Feature Affinity Loss (FAL)
The Feature Affinity module aims to guide the learning of the SSSR branch from the SISR branch, since the segmentation pipeline can benefit from the fact that SISR can reconstruct high-resolution fine-grained information from a low-resolution input. The idea is that the feature maps from the SSSR are enhanced by the SISR, which contains more detailed fine-grained structural information, thus obtaining a finer segmentation. Even though the structures from SISR do not directly imply semantic categories, they can be grouped by the relationship between pixel and pixel, or region and region. As proposed in [36], we modeled these details by the correlation between internal pixels.
The FA Loss aims to learn the distance between the similarity matrix of the HR features of SSSR and the HR features of SISR, where the similarity matrix describes the pairwise relationship between every pair of pixels of a given feature map (for a feature map F with dimensions C × W × H, the similarity matrix would contain (W × H) 2 entries consisting in the relationship between every two pixels in the spatial dimension).
where S (SSSR) and S (SISR) refer to the SSSR and SISR similarity matrix respectively, and q is the norm, set to 1 for stability. So, the loss computes the pixel-wise distance (absolute value) for all the entries in the matrices, sums them up and normalizes by the total number of entries. Given a feature map F, the entry (i,j) of the similarity matrix of that feature map is computed by projecting the vector (dot product) taken in the channel dimension from the spatial dimension pixel number i, i.e., F i to the vector taken from the pixel j, i.e., F j , where the pixels are numbered in a row-wise manner from 1 to W × H. This models the correlation between internal pixels. See Figure 7 for a visualization example.
in this case p is the norm, set to 2 for stability. Note that for the implementation of the computation of the whole similarity matrix, feature maps can be flattened on their spatial dimensions and the pairwise relationship between every row vector (first normalized by the 2-norm) can be computed just by multiplying the resulting matrix by its transpose (see Figure 7).
Although, as considered in [36], it is better to compute the correlation of every pair of pixels, in our implementation we subsampled the feature maps to 1/8 before computing the similarity matrix to avoid high memory overheads. Moreover, since the high-resolution feature maps of both SISR and SSSR branches have different channel distributions, the FA module (See Figure 5) also incorporates a 1 × 1 2D Convolution that ensures that the number of channels of SSSR matches the ones of SISR in order to reduce instabilities during the training.

Multi-Task Loss
Since our approach consists in a multi-task network, the whole objective function, shown in Equation (6), is composed by a linear combination of a loss for semantic segmentation (CE or weighted CE), a loss for super-resolution (MSE or MAE) and the feature affinity loss.
where w 1 and w 2 are hyper-parameters set to make the loss ranges comparable. In our case we obtained the best results weighting both w 1 and w 2 to 1.0.

Training Details
The training consists in the minimization of the multi-task loss function (Equation (6)) in an end-to-end manner.

Data Standardization
A common approach for speeding up the convergence of a neural network is to normalize or standardize the data prior to the training. Then, at the output of the model, the original dynamic range of the image is recovered by doing the inverse process. We obtained better results by standardizing the input image. This is done per-channel, by subtracting the mean and dividing by the standard deviation of the input image: where µ and σ are the per channel computed mean and standard deviation, respectively, and MAX I is the maximum possible pixel value of the image.

Weights for Unbalanced Classes
Weights can be employed in the Cross Entropy loss to try to mitigate the negative impact of class imbalance in the dataset. Those weights are inversely proportional to the frequency of occurrence of each class in the dataset, so classes that have less appearance are weighted more. We used the following expression to compute the weights, as suggested in [65]: w n = 1 ln(1.02 + β n ) (8) where β n corresponds to the frequency of occurrence of the class, and the term 1.02 is added for stability.

Optimizer
We tried different optimizers and the best results were obtained using Adam. The learning rate was initialized to 2 × 10 −4 . We also explored the use of different learning rate schedulers (step decay, cosine annealing) but we did not obtain a clear improvement by using them.

Quality Assessment
In this section we will explain the metrics used to quantitatively assess the performance of the results obtained from the test set. We will differentiate between semantic segmentation metrics, used to evaluate the SSSR performance, and super-resolution metrics for the reconstruction of the SISR image.
In addition, we will present qualitative results, since the metrics are quite limited by the noisiness of the semantic segmentation ground truth.

Semantic Segmentation Metrics
• Intersection-Over-Union (IoU) or Jaccard Index: it is a very popular metric used in semantic segmentation. The IoU is computed as the ratio of the area of overlap between the predicted segmentation and the ground truth (intersection), and the area of union between the predicted segmentation and the ground truth. The metric ranges from 0 to 1, with 0 indicating no overlap and 1 indicating ideally overlapping segmentation. For a multi-class segmentation, the mean IoU (mIoU) is computed by averaging the per class IoU. • Confusion matrix: it is a matrix indicating on its rows the instances of the true classes whilst in its columns indicates the instances of the predicted classes. From the confusion matrix, the per class IoU can be obtained as: where TP stands for True Positive pixels, that can be computed taking the diagonal of the confusion matrix, GT stands for Ground Truth pixels and are obtained by taking the sum over columns (total number of true pixels for each class), and Pred stands for Predictions and are obtained by taking the sum over rows (total number of predicted pixels for each class).

Super-Resolution Metrics
• Peak Signal-to-Noise Ratio (PSNR): it is a widely used metric to quantify the quality of reconstructed images. It is defined as follows: where MAX I is the maximum possible pixel value of the image. • Structural Similarity Index Measure (SSIM) [66]: it is a metric used for measuring the similarity between two images. SSIM is a perception-based model that considers image degradation as perceived change in structural information. The SSIM extracts three key features from an image: luminance, contrast and structure from both the reference image (x) and the reconstructed one (y).The resulting metric ranges from −1 to 1, or is re-adjusted to be in the range [0,1]. The larger the value, the better results.
where µ x is the mean of x, µ y the mean of y, σ 2 x the variance of x, σ 2 y the variance of y, σ xy the covariance of x and y, c 1 = (k 1 L) 2 , c 2 = (k 2 L) 2 are two variables used to stabilize the division with weak denominator, L is the dynamic range of the pixelvalues, k 1 = 0.01 and k 2 = 0.03 by default.

Experiments and Results
We conducted several experiments with modified versions of the dual network architecture. We present the results obtained for the following four architectures: The main differences between the four models are summarized in Table 2. For each architecture, we performed several experiments varying the loss function used for semantic segmentation (CE with or without weights) and for super-resolution (MSE or MAE), the contribution of the losses in the multi-task loss, the upsampling method and the initial learning rate, and computed the super-resolution and semantic segmentation metrics on the test set. The most relevant results are presented in Table 3. Table 2. Different versions of the dual-network architecture.
x x x x -Bicubic interpolation of input image on SISR branch.
x x x -Bicubic interpolation of input image on SSSR branch.
x -Bicubic interpolation spectrally diffused with a 1 × 1 Conv2d x From Table 3 we can conclude that the best results are achieved by model v4, i.e., by using the bicubic interpolated image in both branches, as it obtained better segmentation results (given by mIoU) and achieved an equal value of SSIM as well. Regarding the SR Loss, the best super-resolution metrics were obtained when using MSE and by weighting its contribution in the total loss by 1.0. Choosing either Nearest Neighbor or Pixel Shuffle in the upsampling modules lead to the best segmentation results. Even though Nearest Neighbor achieved slightly higher segmentation metrics, we opted for Pixel Shuffle since the qualitative super-resolution results were much better.
Moreover, to asses the super-resolution model branch, results obtained with nearest neighbor, bilinear and bicubic interpolation techniques (×2) on the test set are provided in Table 4.  Figures 8 and 9 show some qualitative results obtained with our best model configuration (v4 using Pixel-Shuffle). Images were downsampled to form the LR-HR pair. Therefore, the input LR images are at 20 m and both GT images (for SISR and SSSR branches) are at 10 m. Examples of super-resolution results using Nearest Neighbor, bicubic interpolation and our model are presented in Figure 8. Semantic segmentation examples are shown in Figure 9. It can be observed that segmentation maps are smooth and remove some of the noise that is present in the ground truth annotations (see Section 5.3).

Dual Network Architecture
We implemented a dual network approach for semantic segmentation and superresolution based on an encoder-decoder structure, so as both tasks in the multi-task network share the same encoder as a feature extractor, but implement their own branch in the decoder path. We used as baseline the DeepLabv3+ architecture and modified it due to the particular fine grained structure of satellite images.
We showed the benefits of using a skip-connection consisting in the bicubic interpolation of the input image, and explored its concatenation either to just the SISR branch or to both SSSR and SISR branches. The best results were obtained when concatenating to both branches, but diffusing the spectral information of the interpolated image before concatenating into the SSSR branch. Regarding the type of upsampling modules employed in the whole architecture, we conclude that the best results are obtained by using the Pixel Shuffle sub-network.

Class Re-Labeling
Due to the non-stationary behaviour of land cover classes, such as clouds and permanent snow surfaces, some Sentinel-2 input images did not match their corresponding ground truth labels from the S2GLC 2017 dataset. Therefore, we decided to relabel those images in order to obtain more accurate segmentation results. We inspected the whole dataset and relabeled 270 images. After that, we trained the v4 Pixel Shuffle architecture on the relabeled dataset. Table 5 shows the confusion matrix as well as the IoU per class and mean IoU obtained with the relabeled dataset. We observe that the segmentation results on clouds and permanent snow covered surfaces increased significantly (+62.55% and +15.23%, respectively). Moreover, the global mIoU increased +5.28%, from 0.482 to 0.535. Table 5. Confusion matrix after relabeling the dataset, normalized by rows. The IoU column shows the segmentation metric with the relabeled dataset, and the IoU* column presents the results for the original dataset. Best values are in bold.  (Table A1).

Noisy Annotations
The reported semantic segmentation metrics are based on comparing the model predictions with the ground truth labels provided by the S2GLC land cover maps. However, this ground truth has been generated automatically using a Random Forest classifier. A global accuracy of 86% has been reported on this dataset [2], which means that there is an intrinsic and unavoidable inaccuracy in the ground truth that we use to train our models, which has an effect in the results obtained with them. In addition, the automatic procedure used to generate the ground truth land cover map is a pixel-based approach. The decision on a pixel does not take into account the pixel context, as opposed to predictions obtained by semantic segmentation models based on CNNs like our model. Therefore, ground truth annotations are noisy. Our model mitigates this noise and provides smoother segmentation maps, as can be appreciated in Figure 11. The IoU metric used for evaluating the segmentation result is not completely indicative of the performance of our model due to this high level of noise in the ground truth.

Comparison with Low-Resolution Predictions
In order to assess the usefulness of our approach, we trained a plain DeepLabV3+ architecture just for segmentation, using again as input the downsampled version of the original images and, as ground-truth segmentation maps, the donwsampled version of the relabeled ground truth maps. That is, no super-resolution is applied in this model. The goal was to compare the segmentation results obtained with this LR model and with the SSSR model (the high resolution semantic segmentation branch).
In this experiment, training DeepLabV3+ using the same procedure explained in Section 3.4, we reached a mIoU = 0.485, while our dual network method achieved mIoU = 0.535. Table 6 presents the precision, recall and IoU scores per class for both models (best result in bold). There is an increase in IoU in most of the classes, specially for classes with low IoU scores, such as marshes, peatbogs, natural material surfaces and permanent snow covered surfaces, as well as in the mean IoU. Additionally, the precision and recall were improved in the majority of classes, achieving a mean recall increase of almost 10%. Compared with DeepLabV3+, ourl model reduces the number of false negatives, especially in less frequent classes like marshes, vineyards and peatbogs, which agrees with the gain in the IoU scores.
Some qualitative results are shown in Figure 12, demonstrating the effectiveness of our method. The high resolution segmentation maps provide more details and a better definition of contours than the LR maps.

Conclusions
The main objective of the work was to apply Deep Learning techniques to obtain high resolution segmentation maps from lower resolution multispectral Sentinel-2 imagery. We implemented a dual network approach based on an encoder-decoder structure where both tasks in the multi-task network share the same encoder as a feature extractor, but implement their own branch in the decoder path. The SISR branch produces a super-resolved version of the input image with a scale factor 2, and the SSSR branch that generates the semantic segmentation map also at double resolution. The model is based on the DeepLabv3+ architecture. We trained and tested the model on the S2GLC-Cat 2017 dataset.
Regarding the super-resolution metrics, we obtained a PSNR = 35.4239 and SSIM = 0.7756, which are higher than baseline interpolation methods (bicubic interpolation: PSNR= 34.1574, SSIM = 0.6732). As for the semantic segmentation metrics, we showed the increase in the mIoU due to the re-labeling task, and achieved mIoU = 0.535 on the relabeled dataset. This metric is not highly indicative due to noise produced by the method used to generate the ground truth land cover maps. Our model outperforms a DeepLabV3+ trained with the same LR images and predicts smooth, as well as accurate, segmentation maps. Quantitative and qualitative results demonstrate the effectiveness of the proposed approach.  Data Availability Statement: Data are available at [62,63].

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: