Land Use Land Cover Classiﬁcation with U-Net: Advantages of Combining Sentinel-1 and Sentinel-2 Imagery

: The U-net is nowadays among the most popular deep learning algorithms for land use/land cover (LULC) mapping; nevertheless, it has rarely been used with synthetic aperture radar (SAR) and multispectral (MS) imagery. On the other hand, the discrimination between plantations and forests in LULC maps has been emphasized, especially for tropical areas, due to their differences in biodiversity and ecosystem services provision. In this study, we trained a U-net using different imagery inputs from Sentinel-1 and Sentinel-2 satellites, MS, SAR and a combination of both (MS + SAR); while a random forests algorithm (RF) with the MS + SAR input was also trained to evaluate the difference in algorithm selection. The classiﬁcation system included ten classes, including old-growth and secondary forests, as well as old-growth and young plantations. The most accurate results were obtained with the MS + SAR U-net, where the highest overall accuracy (0.76) and average F1-score (0.58) were achieved. Although MS + SAR and MS U-nets gave similar results for almost all of the classes, for old-growth plantations and secondary forest, the addition of the SAR band caused an F1-score increment of 0.08–0.11 (0.62 vs. 0.54 and 0.45 vs. 0.34, respectively). Consecutively, in comparison with the MS + SAR RF, the MS + SAR U-net obtained higher F1-scores for almost all the classes. Our results show that using the U-net with a combined input of SAR and MS images enabled a higher F1-score and accuracy for a detailed LULC map, in comparison with other evaluated methods


Introduction
Land use/land cover (LULC) classification has long been a topic of interest in Earth observation research [1][2][3]. These studies provide characterizations of large extents of the Earth surface by classifying the continuous variation of its attributes in discrete classes and contribute in the establishment of baselines in LULC change studies, which are essential for the management and monitoring of the land surface [4][5][6][7]. For this reason, LULC studies are crucial for the management and monitoring of the land surface.
A variety of LULC classification systems have been developed for different purposes. Particularly in tropical regions, several studies have emphasized the importance of discriminating between old-growth forests and plantations, as well as secondary forests, due to their differences in environmental management and biodiversity conservation [8][9][10][11][12]. Although these three classes may have similar canopy cover, secondary forests and plantations usually hold less above ground biomass, host less biodiversity and provide different ecosystem services than old-growth forests [13][14][15][16]. Furthermore, the clearing of old-growth forests to establish plantations can lead to an increase in the carbon emissions from the dead biomass or soil [17,18].
Previous studies have relied on remote sensors to obtain reflectance or backscattering signals of the land surface to predict different LULC. Therefore, with the development of new sensors and other technological advances, novel methods for obtaining LULC classifications have been proposed [19][20][21][22][23]. Recently, deep learning approaches have gained a lot of popularity due to the increase in accuracy, in comparison with previous machine learning approaches [24][25][26]. Additionally, deep learning algorithms are capable of learning complex and nonlinear patterns in the spatial and temporal dimensions, and they do not require a previous transformation of the inputs, e.g., calculating spectral transformations such as vegetation indices [27][28][29][30]. Thus, deep learning algorithms are among the most explored types of algorithms for obtaining LULC classifications and other Earth observation applications [31][32][33][34].
Fully Convolutional Neural Networks are a particular type of Convolutional Neural Networks (CNN) that allow obtaining a class prediction for each pixel in an image [25,26,[35][36][37][38][39]. These algorithms are capable of identifying patterns at different scales to produce classifications [32,37]. Thus, CNN architectures are nowadays among the most widely applied algorithms for classification tasks [20], and particularly, the U-net is one of the most popular algorithms in LULC studies [31,32,40] due to its capability of summarizing patterns in both the spectral and spatial domain (for additional details see Section 2.3).
Several studies have successfully used the U-net to evaluate forest disturbance and degradation, identify plantations or buildings, among others [41][42][43][44][45][46][47][48]. Furthermore, this architecture is designed to work with a small sample size, a common problem for LULC classification [25,40]. Nevertheless, the U-net has rarely been trained using multispectral bands (MS) besides RGB ones or in combination with synthetic aperture radar images (SAR) [31,32,37,49], even though the combination of MS and SAR imagery has provided more accurate results to generate LULC maps [7,[50][51][52][53]. For example, the information of MS images can be very useful to differentiate among certain LULC classes (e.g., water, bare soil, vegetation); however, SAR data can interact with the structure of vegetation (i.e., branches, leaves, stems) and therefore can potentially discriminate between forests and plantations [54].
The U-net algorithm has been used in combination with very high spatial resolution (VHR) imagery because it was initially designed for biomedical image segmentation that intends to segment fine-grained class boundaries [31,40]. Nonetheless, for Earth observation applications, medium resolution images from satellite sensors such as Landsat or Sentinel are preferable due to their wider spatial coverage, sufficient resolution for land cover mapping, and most importantly, their free-of-cost availability [44,49,51,55]. All of these traits make them an excellent option for environmental monitoring.
In this context, the objective of this study is to evaluate the potential of the U-net in combination with Sentinel-1 and Sentinel-2 images to develop a detailed LULC classification in a tropical area in Southern Mexico, with a particular interest to differentiate young and old-growth plantations, as well as secondary and old-growth forests. In addition, this evaluation includes comparing the results obtained with the U-net and the random forests (RF) algorithm (a machine learning algorithm), as well as assessing the effect of image input over the classification accuracy. Because the U-net summarizes both spectral and spatial features to perform the LULC classification, we expect that it will obtain higher accuracy than the RF algorithm, which only uses spectral features. In addition, we assume that the combination of MS and SAR with the U-net will help differentiate natural forests from plantations because of their difference in spatial configuration (i.e., random vs. uniform).

Study Site
The study site is located in southeastern Mexico, in the municipalities of Marqués de Comillas, Benemérito de las Américas and Ocosingo, Chiapas ( Figure 1). This area is part of the Selva Lacandona region, which holds one of the largest massifs of conserved tropical rainforest in North America [56,57]. Additionally, this region shows some of the highest deforestation rates in the country, which have been related to livestock ranching and, to a lesser extent, to agriculture or rubber/oil palm plantations [58][59][60]. This region exhibits a complex mosaic of LULC classes that includes tropical rainforest (old-growth and secondary), oil palm and rubber plantations, grasslands, agricultural fields, water, roads, areas with no or scarce vegetation, human settlements and aquatic vegetation.

Study Site
The study site is located in southeastern Mexico, in the municipalities of Marqués de Comillas, Benemérito de las Américas and Ocosingo, Chiapas ( Figure 1). This area is part of the Selva Lacandona region, which holds one of the largest massifs of conserved tropical rainforest in North America [56,57]. Additionally, this region shows some of the highest deforestation rates in the country, which have been related to livestock ranching and, to a lesser extent, to agriculture or rubber/oil palm plantations [58][59][60]. This region exhibits a complex mosaic of LULC classes that includes tropical rainforest (old-growth and secondary), oil palm and rubber plantations, grasslands, agricultural fields, water, roads, areas with no or scarce vegetation, human settlements and aquatic vegetation. The complete method is divided in three sections: (1) satellite imagery and LULC classes acquisition and preprocessing, (2) algorithm training and validation (with two subsections: U-net and RF) and (3) most accurate architecture selection, complete study area LULC classification and accuracy assessment ( Figure 2). The following paragraphs will describe each one of these sections. The complete method is divided in three sections: (1) satellite imagery and LULC classes acquisition and preprocessing, (2) algorithm training and validation (with two subsections: U-net and RF) and (3) most accurate architecture selection, complete study area LULC classification and accuracy assessment ( Figure 2). The following paragraphs will describe each one of these sections.

Remote Sensing Input Imagery
Sentinel-1 (SAR) and Sentinel-2 (MS) images were the imagery inputs used to train the U-net and RF algorithms. For Sentinel-1, the Ground Range Detected (GRD) collection was selected and only images acquired in the interferometric wide swath mode and ascending orbit were consulted. The two available SAR bands, vertical transmission and reception (VV) and vertical transmission and horizontal reception (VH), were converted from σ 0 to γ 0 by correcting the backscatter coefficient to the angle of acquisition of the image [61,62]. For the Sentinel-2 images, only the bands with the finest resolution were used, i.e., bands B, G, R and NIR with a pixel size of 10 m. These images corresponded to the Sentinel-2 2A collection, i.e., bottom of the atmosphere reflectance.
We first selected the Sentinel-2 image that had the lowest cloud cover percentage and was closest to the acquisition date of the field data (recorded in March 2019). This corresponded to an image acquired on 4 July 2019. Afterwards, the mean backscattering coefficient of nine Sentinel-1 images that were acquired up to one month prior to this date, i.e., from 4 June 2019 to 4 July 2019, was calculated to reduce the speckle noise typical of SAR images. In addition, a circular filter with radius = 3 pixels was applied to the mean backscattering image to further reduce the speckle noise.

Acquisition of Training and Validation Data
Training data were manually classified through visual interpretation by an interpreter who performed the field data acquisition in 33 systematically selected areas of 256 × 256 pixels, where LULC information was gathered in the field (Figure 1). The training data were created using these in-field observations (300 points) and remote sensing imagery which included Sentinel-2 images from 2016-2020, Planet images from January and March 2019 (pixel size = 3.12 m) [64] and VHR images provided by Google Earth [65], Yandex [66] and Bing [67] (pixel size < 1 m) as XYZ tiles in QGIS 3.16 [68]. The 33 areas were visually classified into 10 classes: old-growth forest, secondary forest, old-growth plantations, young plantations, grasslands/agricultural fields, roads, soil or areas with no vegetation, water, human settlements and aquatic vegetation. Additionally, the training dataset for MS and MS + SAR included two additional classes, clouds and shadows, which represented areas for which a LULC class could not be obtained based on the MS data, thus these areas would be masked out in the final LULC map. These two classes were not included in the calculation of the final performance metrics (see Sections 2.3.1 and 2.3.2). The

Remote Sensing Input Imagery
Sentinel-1 (SAR) and Sentinel-2 (MS) images were the imagery inputs used to train the U-net and RF algorithms. For Sentinel-1, the Ground Range Detected (GRD) collection was selected and only images acquired in the interferometric wide swath mode and ascending orbit were consulted. The two available SAR bands, vertical transmission and reception (VV) and vertical transmission and horizontal reception (VH), were converted from σ 0 to γ 0 by correcting the backscatter coefficient to the angle of acquisition of the image [61,62]. For the Sentinel-2 images, only the bands with the finest resolution were used, i.e., bands B, G, R and NIR with a pixel size of 10 m. These images corresponded to the Sentinel-2 2A collection, i.e., bottom of the atmosphere reflectance.
We first selected the Sentinel-2 image that had the lowest cloud cover percentage and was closest to the acquisition date of the field data (recorded in March 2019). This corresponded to an image acquired on 4 July 2019. Afterwards, the mean backscattering coefficient of nine Sentinel-1 images that were acquired up to one month prior to this date, i.e., from 4 June 2019 to 4 July 2019, was calculated to reduce the speckle noise typical of SAR images. In addition, a circular filter with radius = 3 pixels was applied to the mean backscattering image to further reduce the speckle noise.

Acquisition of Training and Validation Data
Training data were manually classified through visual interpretation by an interpreter who performed the field data acquisition in 33 systematically selected areas of 256 × 256 pixels, where LULC information was gathered in the field (Figure 1). The training data were created using these in-field observations (300 points) and remote sensing imagery which included Sentinel-2 images from 2016-2020, Planet images from January and March 2019 (pixel size = 3.12 m) [64] and VHR images provided by Google Earth [65], Yandex [66] and Bing [67] (pixel size < 1 m) as XYZ tiles in QGIS 3.16 [68]. The 33 areas were visually classified into 10 classes: old-growth forest, secondary forest, old-growth plantations, young plantations, grasslands/agricultural fields, roads, soil or areas with no vegetation, water, human settlements and aquatic vegetation. Additionally, the training dataset for MS and MS + SAR included two additional classes, clouds and shadows, which represented areas for which a LULC class could not be obtained based on the MS data, thus these areas would be masked out in the final LULC map. These two classes were not included in the calculation of the final performance metrics (see Section 2.3.1 and Section 2.3.2).
The resulting classifications were rasterized with a pixel size equal to the Sentinel-1 and Sentinel-2 resolution (i.e., 10 m). Finally, the LULC manual classification maps consisted of 12 bands, one for each class, with binary values of 1: presence and 0: absence. Finally, to estimate the error associated with the visual interpretations, a stratified random sampling procedure was implemented, which resulted in 119 points. These points were interpreted without the field acquired information, and these interpretations were contrasted with the field information to calculate the error present in the training data.
In addition, several pre-processing operations were performed on the U-net's input information. First, imagery data were standardized by subtracting each band's mean and divided by its standard deviation. Moreover, both the imagery and the LULC manual classification data were augmented by subsampling each 256 × 256-pixel area into 128 × 128 pixel tiles using a 64 pixel offset and mirroring these areas in both a vertical and horizontal directions. After this procedure, the total augmented dataset consisted of 891 sample units. From this dataset, 621 observations, from 23 images (256 × 256 pixel tiles) were used as the training set and 270 observations, from 10 images (256 × 256 pixel tiles) were used as the verification set. This procedure was performed using R 4.0.3 [69] and the raster [70], rray [71] and reticulate [72] packages.

Algorithm Training and Validation
First, the U-net architecture with different inputs was trained. Afterwards, the imagery input that obtained the highest F1-score with the U-net was used to train a RF algorithm. The comparison among the three types of trained U-nets had the purpose of evaluating the effect of the imagery inputs on its capabilities to generate a LULC map, while the comparison of the most accurate U-net with the RF algorithm had the objective of providing a point of comparison for algorithm selection (i.e., U-net vs. RF).

U-Net
The U-net is a CNN based algorithm; therefore, like any CNN, it "learns" to recognize the classes of interest using a supervised scheme by fixing the weights of convolutional filters in an iterative process [27,28]. These convolutional filters are usually organized in a network-like structure that enables the CNN to recognize spectral and spatial patterns at different scales [32,37]. The U-net has two parts, an encoder and a decoder ( Figure 3). In the encoder section, the input image is passed through several hidden layers. In each pass, the spatial resolution is reduced by the effect of the down sampling filters, while the "spectral" resolution is increased. On the contrary, in the decoder part, the image is passed through other hidden layers that perform the opposite process of the encoder part. Thus, in each pass, the image loses spectral resolution while it gains spatial resolution to obtain the final LULC classification. Two outputs are obtained from the U-net: the LULC classification map and the probability map. The typical U-net architecture comprises of five hidden layers [40]; however, in this study, due to memory restrictions, a simpler version of the U-net was used, with two to four hidden layers (Figure 3), which is expected to also reduce the chances of overfitting [73]. For a detailed description of the U-net architecture please consult [40]. In order to monitor the algorithm's training, two additional metrics, overall accuracy and average F1-score (avgF1-score; Equation (1)), were calculated in each iteration on both the training and validation sets. The overall accuracy was calculated as the number of correctly classified pixels over the total of pixels. Additionally, the F1-metric was calculated as the harmonic mean of precision and recall (Equation (1)). Afterwards, all the F1scores were averaged to obtain an avgF1-metric [80]: where avgF1 stands for the overall average F1-score, C for the number of classes (i.e., 10 classes), c for each class, p for precision, r for recall, TP for true positives, FP for false positives and FN for false negatives (Equations (2) and (3)). During the training phase, the F1scores are calculated as an average of the F1-score of all the batches of the epoch. Thus, the U-net comparisons were made using this batch-averaged F1-score. However, because the RF training is not performed with batches, an avgF1-score was calculated for each Unet architecture, in order to enable F1-score comparisons with the RF algorithm. This avgF1-score was obtained directly from the confusion matrix observations (with the validation data), instead of an average of the per batch F1-score. Finally, the most accurate Unet architecture was selected as the one with the highest avgF1-score. The hyperparameter exploration was done using the tfruns package [81]. In total, 90 architectures were generated for each type of input (i.e., MS, SAR and MS + SAR), which resulted in a total of 270 trained U-nets. All of the training and validation procedure was run in an NVIDIA RTX 2060 with 6 GB of memory. The code used to augment the training data, as well as the U-net hyperparameter exploration and training are available at https://github.com/JonathanVSV/U-netR (accessed on 26 April 2021). The whole training process was implemented in R 4.0.3, using the U-net and keras packages [74,75] to construct the U-net architecture and tensorflow [76] as the backend for keras. In turn, to search for the optimal hyperparameters, several iterations were fitted using an early stopping procedure to avoid overfitting. In this procedure, training was stopped if the loss metric over the validation set did not decrease in 0.01 in 10 iterations. The tested combination of hyperparameters included batch size (8,16,32), number of filters in the first layer (32,64), number of hidden layers (2-4) and dropout probability (0, 0.1, 0.2, 0.3, 0.4, 0.5) [77]. In addition, cross-entropy was used as the loss function, while Adam was the selected optimizer [78]. Additionally, a batch normalization and a He initialization were implemented to set the initial weights [79].
In order to monitor the algorithm's training, two additional metrics, overall accuracy and average F1-score (avgF1-score; Equation (1)), were calculated in each iteration on both the training and validation sets. The overall accuracy was calculated as the number of correctly classified pixels over the total of pixels. Additionally, the F1-metric was calculated as the harmonic mean of precision and recall (Equation (1)). Afterwards, all the F1-scores were averaged to obtain an avgF1-metric [80]: where avgF 1 stands for the overall average F1-score, C for the number of classes (i.e., 10 classes), c for each class, p for precision, r for recall, TP for true positives, FP for false positives and FN for false negatives (Equations (2) and (3)). During the training phase, the F1-scores are calculated as an average of the F1-score of all the batches of the epoch. Thus, the U-net comparisons were made using this batch-averaged F1-score. However, because the RF training is not performed with batches, an avgF1-score was calculated for each U-net architecture, in order to enable F1-score comparisons with the RF algorithm. This avgF1-score was obtained directly from the confusion matrix observations (with the validation data), instead of an average of the per batch F1-score. Finally, the most accurate U-net architecture was selected as the one with the highest avgF1-score.
The hyperparameter exploration was done using the tfruns package [81]. In total, 90 architectures were generated for each type of input (i.e., MS, SAR and MS + SAR), which resulted in a total of 270 trained U-nets. All of the training and validation procedure was run in an NVIDIA RTX 2060 with 6 GB of memory. The code used to augment the training data, as well as the U-net hyperparameter exploration and training are available at https://github.com/JonathanVSV/U-netR (accessed on 26 April 2021).

Random Forests
The RF algorithm is one of the most popular machine learning algorithms in LULC classification [21,52,82]. This algorithm is an ensemble learning method based on decision trees that trains several trees on random subsets of the sample observations and predictive variables. Afterwards, this algorithm assigns the final class for each pixel as the one that had the most votes from the individual trees [83].
The RF classification was performed using the same input information as the most accurate U-net (i.e., MS + SAR, see Results section). In this classification, the same training and validation data as in the U-net were used, but without the augmentation procedure. Thus, the training data consisted of 23 areas of 256 × 256 pixels, while the validation data, of 10 areas of 256 × 256 pixels. Due to computing memory restrictions, the RF algorithm could not be trained using the complete training set; therefore, we decided to randomly sample the training set to obtain a balanced dataset with the same number of observations per class [84]. The training data for the RF algorithm consisted of 6057 points by class, because this was the smallest number of observations (i.e., pixels) in the rarest class (i.e., aquatic vegetation). The RF algorithm was trained using the random Forest R package [85] with 500 trees and two randomly selected variables at each split. Once the RF was trained, it was used to predict the LULC classification of the validation data, and the LULC classes were evaluated using a confusion matrix and the same metrics as the U-net results, i.e., overall accuracy and avgF1-score.

Complete Study Area LULC Classification and Accuracy Assessment
The most accurate U-net architecture (i.e., MS + SAR) was afterwards used to predict the LULC classification of the complete study area. A known problem in this step is that frequently the edges of the predicted tiles show results with less quality than the one in areas closer to the center of the tile [40,43]. This is caused by the fact that pixels found in the edges of the tiles have a smaller number of neighboring pixels than its counterparts farther from the edges; thus, the spatial information extracted by the convolutions is limited in areas closer to the edges. In order to reduce this effect, we divided the complete study area into two grids of 128 × 128 pixel tiles so that the edges of the predicted tiles in one grid overlapped with the center of the predicted tiles of the other grid. Afterwards, the predicted tiles in both grids were mosaicked, selecting the LULC class of each pixel as the one with the highest probability.
The LULC classification accuracy for the complete study area was evaluated following a stratified random sampling design [86][87][88] where the random points are distributed by the proportion of the area occupied by each class in the LULC map, except for the rarest classes, where a larger number of random points than its corresponding area proportion is assigned. The total number of sample units for the verification process was calculated as (Equation (4); [86]).
where S Ô is the desired standard error of the estimated overall accuracy. Here we adopted a slighter higher value than the one recommended by [87], 0.015. W i is the proportion of area occupied by each class in the classification, while i stands for each class, while q stands for the number of classes. S i is the standard deviation of class i, where U i stands for the a priori expected user's accuracy. In this case, we used the user's accuracy values obtained for the validation dataset (Tables A1-A3). For the rarest classes (classes with less than 5% of the map, i.e., all classes except old-growth forest, grassland/agriculture and soil), 25 points were assigned per class, while, for the most abundant classes (old-growth forest, grassland/agriculture and soil), the number of verified points was assigned as a proportion of the area occupied by each class, giving a total of 448 points. The verification process was performed by visual interpretation using the same information as the one performed to create the manual classification of the 256 × 256 pixel tiles, i.e., Sentinel-2 2016-2019 images, Planet 2019 images, Google Earth, Bing and Yandex. Finally, the Openforis accuracy assessment tool [89] was used to calculate unbiased area estimates, and its 95% confidence intervals. These estimates were based on the confusion matrix resulting from the accuracy assessment process and the proportion of area occupied by each class according to Equations (5) and (6). Additionally, the F1-scores of each class, overall accuracy and avgF1-score were calculated.
where S(p .k ) is the standard error of the estimated area proportion for class k, W i is the area proportion of map class i, n ik stands for the sample count at cell (i,k) in the error matrix, while n i. is the row sum for class i. In addition, 95% CI stands for the 95% confidence intervals ofÂ k , the estimated area of class k, A is the total map area, while q stands for the number of classes.

Input Imagery and Hyperparameter Exploration
The results for the hyperparameter exploration showed that MS + SAR had the highest overall accuracy (0.76) and avgF1-score (0.58) on the verification dataset, followed very closely by MS (overall accuracy = 0.75, avgF1-score = 0.55) and finally SAR (overall accuracy = 0.65, avgF1-score = 0.39; Tables A1-A3). A similar pattern was observed for the same metrics evaluated on the training dataset (MS + SAR: overall accuracy = 0.91, avgF1score = 0.86; MS overall accuracy = 0.91, avgF1-score = 0.86; SAR: overall accuracy = 0.72, avgF1-score = 0.60). Thus, MS + SAR U-net was selected as the most accurate architecture. Two of the most accurate architectures, MS + SAR and MS, had three hidden layers, batch size = 16 and 64 filters in the first hidden layer (Table S1). However, the MS + SAR had a higher dropout (0.5) than the MS (0.1). In the case of the most accurate SAR U-net, it had two hidden layers, batch size = 8, 64 filters in the first hidden layer and an intermediate dropout (0.3; Table S1). Finally, an error of 0.15 was detected for the visual interpretations made for the training data without having access to the field information.

Image Input Comparison
MS+SAR U-net was the architecture that had the highest avgF1-score and overall accuracy. In comparison, MS U-net showed a slightly lower avgF1-score and overall accuracy. However, by comparing the F1-scores per class, it was evident that MS + SAR had higher scores in three classes (∆F1-score ≥ 0.05; old-growth plantations, secondary forest and young plantations) and lower scores in a single class (∆F1-score = 0.08; roads). Additionally, the other five classes had a similar F1-score in either MS + SAR or MS U-net (∆F1-score ≤ 0.01; Table 1). In comparison with the SAR U-net, MS + SAR and MS had a higher F1-score for all the classes (Table 1); however, the F1-score for water was similar in the three U-net architectures (∆F1-score = 0.03). The differences in class identification among different U-net architectures were also visually evident by comparison with the original manual classification (Figure 4). Although there were particular differences among the three U-net architectures, in general, the classes with highest F1-score were old-growth forest, water and human settlements, while the poorly classified classes were young plantations, roads and aquatic vegetation (Tables 1 and A1-A3). forest, water and human settlements, while the poorly classified classes were young plantations, roads and aquatic vegetation (Tables 1 and A1-A3).

Algorithm Comparison
When comparing the results obtained by the MS + SAR U-net and the MS + SAR RF classification, it was obvious that the U-net had both higher avgF1-score and overall accuracy than its RF equivalent (Δoverall accuracy = 0.23 and ΔF1-score = 0.15; Table A4). Although the same

Algorithm Comparison
When comparing the results obtained by the MS + SAR U-net and the MS + SAR RF classification, it was obvious that the U-net had both higher avgF1-score and overall accuracy than its RF equivalent (∆overall accuracy = 0.23 and ∆F1-score = 0.15; Table A4). Although the same pattern was observable by comparing the class F1-scores (0.08 ≤ ∆F1-score ≤ 0.58), water showed a slightly lower F1-score using the RF algorithm (∆F1-score ≤ 0.03), while young plantations showed the same F1-score in MS + SAR RF and MS + SAR U-net (Table 1).

Complete Study Area LULC Classification
The accuracy assessment of the complete study area classification showed an overall accuracy of 0.77 and avgF1-score of 0.68 (Table A5). In this verification, the classes that obtained the highest F1-score were water, human settlements, old-growth forests, grasslands/agriculture ( Table 2). On the contrary, the classes with the lowest F1-scores were young plantations, secondary forest, soil and roads ( Table 2). The complete LULC classification can be downloaded from https://github.com/JonathanVSV/U-netR (accessed on 26 April 2021). Visually analyzing the probability map of the final classification, it became evident that most of the old-growth forest areas had high probability values corresponding to the assigned class ( Figure 5). On the contrary, areas found in the limits of different LULC classes, as well as clouded areas with recently burned areas, had low probabilities of corresponding to the assigned class ( Figure 5).

Discussion
In this study we showed that deep learning algorithms with combined MS and SAR images can obtain a detailed LULC classification and discriminate among the forested classes of interest with promising results. Although the F1-scores obtained for secondary forests and old-growth plantations were not as high as the one of old-growth forests, the MS + SAR U-net indeed provided an increase between 0.08 and 0.25 in comparison with other tested methods (i.e., MS U-net, SAR U-net or MS + SAR RF). Thus, these results show that new classification methods can improve the capabilities of performing LULC maps from remote sensing information; however, they might still not meet the desired accuracies. It is worth mentioning that our results are relative to the comparisons made in regard to the type of images used (Sentinel-1 and Sentinel-2), as well as the algorithms compared (U-net and RF) and to the LULC classification system used. Future studies could help determine the difference in accuracy to perform LULC classifications in other study areas, with different LULC systems and comparing it with different supervised classification algorithms.

Algorithm Selection
Our results clearly showed the advantages of using deep learning methods, such as U-net, over traditional machine learning approaches, such as RF in LULC classification. The avgF1-score and accuracy obtained with the MS + SAR U-net was 0.15 and 0.23 higher than its RF equivalent (Tables A1 and A4). Additionally, 9 out of 10 classes obtained higher F1-scores in the MS + SAR U-net, in comparison with its RF counterpart, due to the

Discussion
In this study we showed that deep learning algorithms with combined MS and SAR images can obtain a detailed LULC classification and discriminate among the forested classes of interest with promising results. Although the F1-scores obtained for secondary forests and old-growth plantations were not as high as the one of old-growth forests, the MS + SAR U-net indeed provided an increase between 0.08 and 0.25 in comparison with other tested methods (i.e., MS U-net, SAR U-net or MS + SAR RF). Thus, these results show that new classification methods can improve the capabilities of performing LULC maps from remote sensing information; however, they might still not meet the desired accuracies. It is worth mentioning that our results are relative to the comparisons made in regard to the type of images used (Sentinel-1 and Sentinel-2), as well as the algorithms compared (U-net and RF) and to the LULC classification system used. Future studies could help determine the difference in accuracy to perform LULC classifications in other study areas, with different LULC systems and comparing it with different supervised classification algorithms.

Algorithm Selection
Our results clearly showed the advantages of using deep learning methods, such as U-net, over traditional machine learning approaches, such as RF in LULC classification. The avgF1-score and accuracy obtained with the MS + SAR U-net was 0.15 and 0.23 higher than its RF equivalent (Tables A1 and A4). Additionally, 9 out of 10 classes obtained higher F1-scores in the MS + SAR U-net, in comparison with its RF counterpart, due to the inclusion of features in the spatial domain. Previous studies comparing RF with a deep learning algorithm for mapping applications have reported similar outcomes [42,[90][91][92][93][94][95].
Comparing the results obtained for each class of the MS + SAR U-net and its RF counterpart, three notable cases were detected. The first one consisted of classes that showed higher F1-score using the U-net, while its RF counterpart obtained ∆F1-scores between 0.08-0.19 (Table 2). Most classes are included in this category (i.e., all classes except human settlement, water and young plantations); thus, the identification of these classes benefited from including information of the spatial domain. However, most of the potential to correctly identify these classes seems to derive from the spectral domain ( Figure 6). Comparing the results obtained for each class of the MS + SAR U-net and its RF counterpart, three notable cases were detected. The first one consisted of classes that showed higher F1-score using the U-net, while its RF counterpart obtained ΔF1-scores between 0.08-0.19 (Table 2). Most classes are included in this category (i.e., all classes except human settlement, water and young plantations); thus, the identification of these classes benefited from including information of the spatial domain. However, most of the potential to correctly identify these classes seems to derive from the spectral domain ( Figure 6). The second case refers to human settlements which shows that incorporating the spatial domain clearly boosted the performance for its identification (ΔF1-score = 0.58; Table  2). Because individual pixels corresponding to human settlements can have very different reflectance responses (e.g., buildings, trees, grassland: Figure 6), its correct detection strongly benefits from the information available in the spatial domain. Thus, using the U-net clearly outperforms the results obtained by RF, which essentially relies only on spectral information.
The third case consisted of two classes, water and young plantations, where the ΔF1score between U-net and RF was minimal ( Table 2). For the water class, because it is easily differentiated using spectral or backscattering information (Figure 6), including the spatial domain features only cause a minimal increment in its correct detection. In the case of young plantations, MS + SAR U-net and RF obtained the same F1-score. Thus, this class does not benefit from the spatial features of the image. We further discuss this topic in the Error analysis section.

Class Patterns
Among the different U-net architectures, the one that included MS + SAR as image input gave the highest avgF1-score; however, MS gave almost equally accurate results. This means that the classification capabilities of the MS + SAR U-net derive mainly from the MS imagery. This result is similar to previous studies, where using a combined input of MS + SAR outperforms the results obtained separately with MS or SAR to perform LULC classifications or detect LULC changes [52,[96][97][98][99].
When comparing the F1-scores obtained for each class using the MS, SAR and MS + SAR U-nets, we discovered five different groups of responses. The first one was characterized by a lower F1-score in the MS + SAR U-net in comparison with the MS one (Table  1). Roads was the only class found with this type of response. Because this class consists of narrow lines of pixels, its classification is easier done using only the MS data, which has the finest spatial resolution. The second case refers to human settlements which shows that incorporating the spatial domain clearly boosted the performance for its identification (∆F1-score = 0.58; Table 2). Because individual pixels corresponding to human settlements can have very different reflectance responses (e.g., buildings, trees, grassland: Figure 6), its correct detection strongly benefits from the information available in the spatial domain. Thus, using the U-net clearly outperforms the results obtained by RF, which essentially relies only on spectral information.
The third case consisted of two classes, water and young plantations, where the ∆F1score between U-net and RF was minimal ( Table 2). For the water class, because it is easily differentiated using spectral or backscattering information (Figure 6), including the spatial domain features only cause a minimal increment in its correct detection. In the case of young plantations, MS + SAR U-net and RF obtained the same F1-score. Thus, this class does not benefit from the spatial features of the image. We further discuss this topic in the Error analysis section.

Class Patterns
Among the different U-net architectures, the one that included MS + SAR as image input gave the highest avgF1-score; however, MS gave almost equally accurate results. This means that the classification capabilities of the MS + SAR U-net derive mainly from the MS imagery. This result is similar to previous studies, where using a combined input of MS + SAR outperforms the results obtained separately with MS or SAR to perform LULC classifications or detect LULC changes [52,[96][97][98][99].
When comparing the F1-scores obtained for each class using the MS, SAR and MS + SAR U-nets, we discovered five different groups of responses. The first one was characterized by a lower F1-score in the MS + SAR U-net in comparison with the MS one (Table 1). Roads was the only class found with this type of response. Because this class consists of narrow lines of pixels, its classification is easier done using only the MS data, which has the finest spatial resolution.
The second group of classes were those that obtained intermediate to high F1-scores, and obtained higher scores with MS + SAR than only MS or SAR. The two classes in this category were old-growth plantations and secondary forest (Table 1). This was probably due to the ability of SAR data, in this case the Sentinel-1 C-band, to slightly penetrate the canopy cover and acquire information about the geometric arrangement and texture of the old-growth plantations and secondary forests. As previous studies have shown, the use of SAR bands in addition to MS help discriminate old-growth plantations from forests, particularly when the temporal dimension is included [10,11]. Although the penetration of Sentinel-1 C-band is not as deep as those from L or P band [100], this information aids in the correct detection of plantations or secondary forest, particularly when used with the MS set.
The third group consisted of classes where the F1-score difference between MS and MS + SAR was negligible (∆F1-score ≤ 0.01; Table 1). Most LULC classes showed this pattern: grassland/agriculture, human settlements, old-growth forest, soil and water. For these seven classes, it was evident that MS bands gave the U-net most of its abilities to correctly classify them; thus, adding the SAR band did not substantially improve the F1-score for these classes.
The fourth corresponded to the water class, where its F1-score was very similar among the three U-nets with different imagery inputs (Table 1). This seems to be related to the particular spectral signal of water, which makes it easily distinguished with either MS or SAR signals ( Figure 6). A similar conclusion was reached in the comparison between U-net and RF.
The last one refers to two classes, aquatic vegetation and young plantations. Although its F1-score was higher for the MS + SAR U-net than the MS U-net, its F1-score was low (0.11-0.15). Therefore, we concluded that the U-net had small capabilities of identifying these two classes. We further discuss this topic in the Error analysis section and in the complete study area accuracy assessment.

Hyperparameters Exploration
The exploration of different hyperparameters of the U-net showed that although the different combinations of hyperparameters affect the overall accuracy and avgF1-scores, the magnitude of its effect is limited to a 0.05 and 0.11 interval difference between the highest and lowest scores (Table S1). The three most accurate U-nets trained with different imagery inputs tended to consist of relatively simple architectures (two-three hidden layers). This result may be related to the relatively small dataset with which the U-net was trained. Previous studies have reported that CNN with a larger number of filters or hidden layers are capable of identifying more complex patterns; thus, capable of resolving more complex tasks with higher accuracy [27,30,101]. Nevertheless, when a limited training set is available, such as the one used in this study, the main problem with these architectures is that they tend to overfit. Thus, choosing simpler architectures reduces the chances of overfitting although it might limit the abstraction capabilities of the CNN [31,47,102].

Error Analysis
The incorrect discrimination of certain classes by the U-net consisted of five possibilities: (1) similarity in spectral/spatial information, (2) a similar conceptual class definition, (3) spatial errors probably caused by the U-net architecture, (4) small number of observations and (5) possible errors in the training data caused by the date difference in the VHR images. The first three were related to a limitation in the spatial resolution of the Sentinel-1 and Sentinel-2 images.
The first case was the most dominant source of confusion. Although the field data and VHR images helped determine the LULC of each polygon, the resolution of the input images (10 m) was too coarse to distinguish certain classes [49]. For example, regardless of the imagery used to train the U-net, at 10 m resolution, very young plantations (mainly rubber and oil palm) cannot be distinguished from herbaceous cover (grassland/agriculture) or plantations at an intermediate growth stage with relatively larger individuals are practically indistinguishable from old-growth plantations (Table A5). On the other hand, most errors among the tree-dominated classes can be related to this type of confusion, where old-growth forests and plantations, as well as secondary forests, have very similar spectral/spatial signals (Table A5). This source of confusion was also evident in the error assessment procedure performed on the training dataset without having access to the field registered data, which obtained an accuracy of 0.85. This procedure showed that certain errors could be associated with the visual interpretation and that for certain classes, such as young plantations, the field data was essential to correctly classify these areas. Considering the insights provided by previous studies [103], the error associated with each LULC would likely increase in comparison with the ones reported here. Thus, the use of multiple interpreters (at least three) could help reduce visual interpretation errors [103].
The second circumstance is closely related to the arbitrary decisions that had to be made to manually classify the training and validation datasets, particularly to define LULC classes limits. Although the same guides and criteria were followed to manually classify the training and validation data, in some cases, both the conceptual and physical limits between certain classes were not completely clear. For example, the delimitation of water bodies and sand banks (i.e., soil) or any vegetated class and roads, where the limits were established mainly on mixed pixels which corresponded to neither one class nor the other. Thus, a small amount of error can be attributed to these decisions.
A third case was related to a border effect in the LULC predictions caused by the U-net design. When analyzing the probability of each pixel corresponding to each class, low probability values tended to concentrate on the border of each class polygons ( Figure 5). Admittedly, this aspect could be related to the arbitrary decisions on the limits of the classes; however, it is also associated with the effect of down-and up-sampling filters in the U-net, which cause a degradation in the spatial resolution of the image in each pass. Although the U-net foresees this aspect, by using the skipped connections to add spatial detail to the final result, it might not be enough to provide results with the same resolution as the original input, as previous studies have reported [40,44,95,104]. Thus, as confirmed by the verification of the LULC classification in the complete study area, certain errors were associated with the limits of the polygons, either by an increased size of the polygon in the classification in comparison with the input images (e.g., larger clouds) or by a small spatial offset of the borders of the polygons (e.g., limits between roads and plantations).
The fourth condition could help explain the low accuracy observed for identifying rare classes such as aquatic vegetation or young plantations. Although CNN are capable of identifying rare classes, the number of observations in the training data might have been too small for the CNN to correctly extract general patterns to identify this class [49]. Nevertheless, the case of aquatic vegetation is further discussed in the final part of the discussion.
Finally, another minor source of error might reside on the difference in acquisition date of the VHR imagery and the Sentinel-1 and Sentinel-2 used to obtain the LULC classification. In most cases, this date difference was negligible, but for three points in the accuracy assessment procedure, the U-net predicted young plantations as its LULC class; however, no VHR information was available to help identify these areas as young plantations or not. In these rare cases, the areas were manually classified as grassland/agriculture or soil, depending exclusively on the Planet and Sentinel-2 images. Although this might be a very small source of error, a sparse annotations approach can help reduce this error because not all the areas included in the training data need to be labeled [95].
In the first case, other studies have used simpler classification systems for which the discrimination is easier to achieve (e.g., forest/nonforest systems) [43,45,47]. Although, these studies had different research interests, the spectral and spatial discrimination of the LULC classes will also affect the performance of the deep learning algorithm to solve certain tasks ( Figure 6). Other studies have tackled this problem by using hierarchical classification systems and calculating the performance of the algorithm on each level [49]. A similar approach could be adopted in our classification system to obtain broader classes with higher F1 scores, e.g., a single plantations or forest class, instead of separating old-growth and young ones.
In the second situation, other studies have mainly relied on VHR imagery (e.g., Worldview or aerial images) to obtain higher accuracy (e.g., Worldview, aerial images) [41,42,[45][46][47]73,91,[105][106][107]. This type of imagery allows a better discrimination of LULC classes, in comparison with high or intermediate resolution imagery, by providing more detailed information; thus, the potential of the U-net is limited by the spatial resolution of the input imagery [19,31,41]. For example, in this study, the use of VHR imagery would have helped in distinguishing poorly identified classes such as young plantations and roads. In addition, other studies have enhanced the potential of the U-net by incorporating the temporal dimension into the convolutional filters [44,95], fusing multiresolution inputs [94] or using customized U-net architectures [51]. These alternatives might be interesting for future studies interested in performing LULC classification with the U-net architecture.
In the third circumstance, it is frequently mentioned that the full potential of deep learning algorithms is especially evident when there is a large volume of training data [28,30]. Nevertheless, frequently in Earth observations applications the data available for training are limited, due to the large quantity of time and resources required to obtain this data [20,37]. Therefore, it is not surprising that augmentation techniques are frequently used. For example, other studies have used more aggressive augmentation techniques, such as rotation in different angles (90 • , 180 • , 270 • ), as well as brightness, saturation and hue alterations [41,42,46]. Although augmentation techniques might enhance the generalization capabilities of the algorithm, they are usually computationally expensive, and therefore, they need to be adjusted to the available resources. In this case, we opted for an augmentation scheme that enabled a fast training with the available computational resources, mainly to shorten the hyperparameter exploration procedure.
Previous studies have relied on transfer learning to compensate for the small size of the available training set to obtain higher accuracies with deep learning algorithms [35,108,109]. In this case, we did not use transfer learning because we included NIR and two SAR bands, while most pre-trained CNN use exclusively RGB imagery. Additionally, many of the pretrained CNN use images with very different viewing geometry, resolution and classification systems than remote sensing ones [20,101]. Thus, we opted to train the U-net with only our training data.
Although it is clear that these three factors play a role in determining the performance of the U-net, the interactions among them are unclear. Future studies should address this topic in order to understand the effects of each factor over the classification performance.

Methodological Highlights
We consider that the use of the accuracy assessment protocol developed by [86,87] is a relatively common approach used to assess the accuracy of a map; however, in studies that use deep learning algorithms, it is rarely used (but see similar approaches in [41,42]). We are aware that if the verification set is large enough, with a random spatial distribution in the study area and has not been exposed to the CNN during the training phase (i.e., such as in the early stopping procedure), it should give similar estimates to the ones obtained by the abovementioned protocol. Nevertheless, a comparison of the F1-score obtained for each class in the validation and accuracy assessment procedures evidenced that for three classes-aquatic vegetation, roads and young plantations-its ∆F1-score was higher than 0.2 ( Table 2). This large difference in the F1-score can be related to two situations that artificially inflated the error for these three classes: (1) their small sample size in the validation dataset, and (2) their spatial location in the validation dataset.
In the first case, the effect of inflating the error to identify a class seems to be clearer in the rare classes, probably due to its limited number of observations. Because the validation data come from an augmented procedure, the data are not completely independent; thus, many observations are in fact a mirrored or a subsample of another. A direct consequence of this condition is that if a pixel was wrongly classified in one sample, it will be probably classified wrongly in an augmented version; thus, the F1-score obtained in the validation procedure can be artificially decreased, in comparison with the one obtained for the accuracy assessment procedure (0.20 ≤ ∆F1-score ≤ 0.25).
In the second situation, after comparing the F1-score obtained for the validation and accuracy assessment procedures for aquatic vegetation (∆F1-score = 0.55), it was evident that another factor was also responsible for the large difference. Thus, we noticed that aquatic vegetation class was systematically found in edges of the validation data, an aspect that was overlooked in the training and validation data acquisition. Therefore, due to its location in the edges of the validation dataset, the U-net obtained low quality predictions for this class, which inflated the error calculated for this class in the validation dataset in comparison with the accuracy assessment procedure.
We consider that the accuracy assessment results give a more precise estimate of the precision of the U-net to obtain LULC maps because they evaluate the classification over the complete study area and reduce the error caused by the low-quality predictions in the edges of the tiles. In addition, using this protocol allows obtaining an estimate of each class surface with confidence limits. Thus, we recommend future studies to use similar protocols to evaluate LULC maps.
Finally, it is worth mentioning that we opted for making a LULC map using a single date image to maximize the agreement between the field information and the MS image (as SAR images do not detect clouds) in a very dynamic landscape, instead of maximizing LULC coverage. Although other studies have addressed this issue by using multitemporal composites, few studies have analyzed the effect of the image composition over the abilities of an algorithm to perform the LULC classification task (but see [110,111]), less still using CNN-based algorithms. These studies report that better classification accuracies are usually obtained with composites constructed from images acquired inside a small temporal window (e.g., a season within a year). Nevertheless, in areas with high cloud coverage throughout the year, as the one studied here, even using multitemporal composites might not ensure a high-quality composite or a complete study area coverage. Future studies should address this trade-off between temporal agreement and study area coverage, especially using CNN-based algorithms that consider the spatial context of a pixel to determine its class.

Conclusions
The use of CNN for Earth observation applications has further improved the capabilities to generate detailed LULC classifications. Nevertheless, it is essential to evaluate the role of different imagery inputs and algorithms in LULC mapping with special focus on discriminating among forested classes. In this study we found that although we trained the U-net with a small dataset, it outperformed the random forests algorithm. Additionally, the LULC map with the highest accuracy was achieved using the U-net with the MS + SAR bands as inputs, followed very closely by MS U-net and lastly by SAR U-net. Furthermore, MS + SAR U-net obtained higher F1-scores for similar LULC classes such as old-growth forests and plantations. We conclude that the better performance of the U-net, in comparison with the RF, is mainly because of the incorporation of the spatial and spectral features in the LULC classification. In addition, the combined use of MS + SAR imagery helps in obtaining a detailed LULC map and especially in discriminating among forested classes. This study demonstrates the capabilities of CNN for obtaining detailed LULC classifications with medium spatial resolution satellite images.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Acknowledgments: The first author wishes to thank the Consejo Nacional de Ciencia y Tecnología (CONACyT) for providing a PhD scholarship. The authors are also grateful to the project "Herramientas para la Enseñanza de la Geomática con programas de Código Abierto", Programa de Apoyo a Proyectos para la Innovación y Mejoramiento de la Enseñanza (PAPIME # PE117519, http://lae.ciga.unam.mx/proyectos/geomatica/index.php accessed on 20 April 2021) for its support.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. Table A1. Confusion matrix for the validation set using the MS + SAR U-net. Rows show predicted classes while column, ground truth classes. LULC abbreviations: AV, aquatic vegetation; G/A, grasslands/agriculture; HS, human settlements; OF, old-growth forests; OP, old-growth plantations; R, roads; SF, secondary forest; So, soil, W, water; YP, young plantations. User acc stands for user's accuracy, while Prod acc for producer's accuracy.