Development of Semantic Maps of Vegetation Cover from UAV Images to Support Planning and Management in Fine-Grained Fire-Prone Landscapes

In Mediterranean landscapes, the encroachment of pyrophytic shrubs is a driver of more frequent and larger wildfires. The high-resolution mapping of vegetation cover is essential for sustainable land planning and the management for wildfire prevention. Here, we propose methods to simplify and automate the segmentation of shrub cover in high-resolution RGB images acquired by UAVs. The main contribution is a systematic exploration of the best practices to train a convolutional neural network (CNN) with a segmentation network architecture (U-Net) to detect shrubs in heterogeneous landscapes. Several semantic segmentation models were trained and tested in partitions of the provided data with alternative methods of data augmentation, patch cropping, rescaling and hyperparameter tuning (the number of filters, dropout rate and batch size). The most effective practices were data augmentation, patch cropping and rescaling. The developed classification model achieved an average F1 score of 0.72 on three separate test datasets even though it was trained on a relatively small training dataset. This study demonstrates the ability of state-of-the-art CNNs to map fine-grained land cover patterns from RGB remote sensing data. Because model performance is affected by the quality of data and labeling, an optimal selection of pre-processing practices is a requisite to improve the results.


Introduction
Remote sensing is a primary source of data for vegetation mapping, and due to continual developments in geo-information technologies, this field is gradually becoming more universal. The use of remotely sensed satellite imagery is an effective way of acquiring data for various land cover mapping applications [1][2][3][4]. Satellites can map large areas in single acquisitions, but their data suffer from insufficient spatial, spectral and temporal resolutions, which are typically too coarse for some applications. They also suffer from cloud cover contamination and are limited by fixed timing and costly data acquisition [5]. These issues can be mitigated by the low-altitude flight of unmanned aerial vehicles (UAVs). Although originally developed for military purposes, UAVs have become an important commercial tool for monitoring the Earth's surface, revolutionizing the acquisition of finegrained data due to their high spatial resolution, low-cost and application versatility. Their other advantages are their flexibility in obtaining data from target areas that are often difficult to reach, the minimization of disturbances of inspected areas, and the provision of real-time data [6]. Therefore, UAVs found their place in various fields, including ecology and the conservation of wildlife [7,8], agriculture and forestry [9], firefighting [10], and disaster zone mapping [11]. with bare soil. After combining GoogleLeNet with data augmentation, transfer learning (fine tuning) and pre-processing, an F1 score of 97% was achieved. The pre-processing techniques that improved the detection performance the most were background elimination and long-edge detection. Random flipping, scaling, cropping, and brightness were used for data augmentation. However, in most landscapes, shrubs are a very general and heterogeneous group of vegetation types with individuals of variable shapes, sizes, and distribution patterns, forming irregular and complex clusters of individuals [19]. High intra-class and low inter-class variance is a challenge when mapping shrub cover, causing difficulties in distinguishing them from their surroundings [35] or other vegetation classes. M. Mahdianpari et al. [34] used multispectral data, containing more complementary information, as a way to alleviate the problem of classification of spectrally similar vegetation types. They also found InceptionResNetV2 as the most efficient state-of-the-art CNN (compared to DenseNet121, InceptionV3, VGG16, VGG19, Xception and ResNet50) for classifying complex multispectral remote sensing wetlands scenes (F1 score of 93%). In their pursuit of maximizing the distinction between the target vegetation type (weeds) and the surroundings, C. Hung et al. [35] argued for the use of imagery from different seasons to take advantage of phenological dynamics and seasonal changes in the vegetation appearance, as well as performing the survey at lower flight altitudes (below 100 m [36]) or using higher resolution sensors to obtain more detail.
In [37], the authors used U-Net to differentiate between various forested classes using satellite imagery. They showed that the classification system could be improved by using a combination of multispectral and synthetic aperture radar imagery, rather than using only one type of data. They achieved an F1 score of 86% on an old-growth forest and 62% on an old-growth plantations, even though F1 scores for the secondary forest and young plantations were only 45% and 11%, respectively. This indicates that classification results can differ significantly, even between classes with similar species and patterns but with individuals at different stages of their life cycle. This can be a challenge in mixed landscapes such as the one discussed in our paper. The authors of the study also suggested that the classification results could be improved by using better spatial resolution imagery. Another study [38] employed U-Net to recognize poplar and coniferous trees from RGB satellite images. Classifying species to either one of the groups was successful, with a mean accuracy up to 96%; however, the network was not able to separate species within the same group. U-Net was also applied in [39] to detect the presence or absence of trees and large shrubs in Australia. Using multispectral band satellite imagery with a pixel size of 3.2 m and a panchromatic band with a pixel size of 80 cm, the overall classification accuracy, as well as precision and recall, were reported to be around 90%. The trees and shrubs were classified jointly, in sparse shrubland areas, open savanna woodlands and rangelands, compared to which our studied environment showed a higher level of complexity.

Case Study: Fire Prone Mediterranean Landscapes
In our study, we intended to develop a set of processing techniques for classifying shrubs-a key structural component of interest in a fire-prone Mediterranean region. Fire and herbivory are two important sources of ecological disturbance that shape landscapes and species communities in the Mediterranean basin [39]. Today, natural disturbance regimes were replaced by modified regimes driven by human land use and anthropogenic disturbances [40]. Notably, the decline in farming and grazing in marginal farmland areas is a major trend affecting Mediterranean landscapes, which has been followed either by afforestation or land abandonment, both resulting in a cessation of moderate disturbance regimes, an accumulation of fuel loads and increased fire risk [41,42]. In the case of land abandonment, natural regeneration is often associated with the establishment and expansion of shrub species in abandoned fields [39,41]. In the Mediterranean basin, many shrub species are fire-tolerant, with traits that, under the dry and warm weather conditions typical of Mediterranean areas, favor recurrent fires, and ultimately, the dominance and long-term persistence of shrubs. Consequently, ecosystem recovery through secondary succession is halted, as well as the natural reestablishment of native forests [43]. This paper reports results on a case study farm, Quinta da França (Figure 1), which integrates agricultural and forest land uses. The farm's management is guided by sustainability principles and focuses on promoting environmental services provided by agroforestry activities and sustainable forest management. Notably, the farm's forest area experienced its last major fire event in 1996 and is now managed for carbon storage and sequestration, with an estimated amount of 7000 tons of CO 2 /year. Its management is focused on the reduction in fire risk, increase in carbon sequestration, and biodiversity conservation. Vegetation cover and its level of development are heterogenous. Tree cover is dominated by a deciduous native species, Pyrenean oak (Quercus pyrenaica), with different-aged trees, including mature trees and patches of regenerating trees. This species is able to regenerate vegetatively after fire; mature stands are characterized by a low level of flammability, while regenerating stands are associated with a high level of flammability [44]. The understory is dominated by perennial broom shrubs (Cytisus striatus, C. multiflorus), which are characterized by an ability to resprout after fire and by fire-stimulated seed germination [45,46]. The structural complexity of the landscape, composed of regenerating tree patches and a pyrophytic shrub layer, increases fire proneness and the vulnerability to fire spread, requiring regular shrub control. The use of livestock for biomass regulation is now being implemented through targeted grazing.

Objectives
The general goal of our work was to develop a method for high-resolution land cover mapping applicable to the case study's forest area, with a focus on fire-prone shrub vegetation. Unlike other studies, e.g., [17], where the target vegetation contrasts with bare soil or the ground cover, here shrub patches have irregular shapes and are interspersed with other types of cover, including herbaceous patches and rock outcrops. A dedicated automatic classifier, able to differentiate shrub cover from multiple cover types at small spatial scales, is needed for this type of environment. Maps of vegetation cover will ultimately serve as a foundation for better-informed landscape planning and grazing management and for the research of innovative ways to integrate livestock productions, as well as biodiversity conservation and fire prevention in the fire-prone landscapes of Mediterranean regions.
The main objectives and contributions of this paper are: (i) the creation of pixel-based labeled datasets for the training, validation and testing of machine learning models for the classification of fire-prone vegetation type (shrubs) from natural color UAV images; (ii) a systematic analysis of the best training practices for increasing the accuracy of a state-of-the-art CNN (U-net) to automatically segment the key vegetation type in images; (iii) the evaluation of feasibility and the performance of semantic segmentation of shrub cover in a complex heterogeneous landscape.

Study Area
Quinta da França ( Figure 1) is located in Covilhã, Portugal. The climate is Mediterranean with warm and dry summers, and most precipitation occurs from October to May. Summer is a critical season regarding the risk of forest fires, with the average temperature reaching 22.2 • C and only 10 mm of rainfall in August (https://en.climate-data.org/ europe/portugal/covilha/covilha-6944/, accessed on 2 March 2022).
The area of interest is an oak forest (Figure 1b) of about 200 ha. This area includes, as of June 2018, a grazing parcel of about 100 ha, where the use of cattle is being tested as a nature-based solution for biomass regulation through grazing and trampling. The reintroduction of herbivores into fire-prone regions is being promoted as an environmentally sustainable and cost-effective tool for wildfire prevention [47]. However, such interventions also imply trade-offs and require thorough land planning and regular monitoring for which a detailed land cover mapping is essential.

Data Description
This study uses a set of images acquired by hexacopter with two cameras: VIS GITUP2 camera (Shenzhen, China) with RGB filter (370-680 nm) and a 170 • lens (fisheye), and a NIR Mapir Survey2 NDVI camera (San Diego, California) (Red: 660 nm, NIR: 850 nm) with a 16 MP (4608 × 3456 px) sensor Sony Exmor IMX206 (Bayer RGB) (Tokyo, Japan) and 90 • lens. The experiments presented in this paper use the RGB images acquired by the VIS GITUP2 camera. Some of the drawbacks of these images were the use of fisheye lenses and motion blur, which caused distortion and made the annotation more challenging, especially in peripheral areas of the images. In this paper, we exploit only the information contained in the RGB images, since this makes the presented method more convenient for use in combination with most aerial imaging systems, including off-the-shelf UAVs.
The flight altitude relative to the take-off point was 120 m, velocity 5 m/s and photos were taken every 5 s. The resolution is about 6.25 cm/1 px. The images were provided to us by landowners and were originally acquired for purposes other than this study. Their primary objective was to cover a full forest site (about 200 ha), which came at the expense of lower level of detail in the images. The UAV was assembled by the company Terraprima-Serviços Ambientais, Sociedade Unipessoal, Lda (Samora Correia, Portugal. https://www.terraprima.pt/pt, accessed on 2 March 2022).
We had access to three image sets, one acquired in August 2019, the other in December 2019, and another in August 2020. Training data were selected from the August 2019 image set. Test data were selected from all image sets, which allowed us to verify the generalization ability of the developed models for data from the different years and different seasons.
The images from all image sets were converted into PNG format and sliced into smaller square-shaped tiles with dimensions (800 × 800) px, corresponding to approximately (50 × 50) m patches of land. The tile size was chosen based on the size of the objects of interest and the amount of context. Selected tiles were then labeled at pixel level for the objects of interest. Four land cover classes were identified: shrubs, trees, shadows, and rocks.
For training, we selected a partition of 13 tiles from one of the original RGB TIFF images of 4608 × 3456 px taken in August 2019. We chose the tiles we considered as the most representative in terms of all different land cover configurations present in the image. We denote this set the Training Partition. For testing, we defined the following Test Partitions:

1.
One (800 × 800) px tile derived from the same image from which training tiles were taken, which was not previously used for training; 2.
Two (800 × 800) px tiles derived from two other images that were taken during the same flight; 3.
Two (800 × 800) px tiles derived from one image that was taken during the same season but in a following year (August 2020-summer season); 4.
Two (800 × 800) px tiles derived from one image that was taken during a different season (December 2019-winter season).
The reasoning for this approach was to test the performance of the trained models on highly similar data (1 and 2), on seasonally similar data (3) and on highly distinct data, taken during different phenological stage (4).
Land cover classification requires a fine-grained understanding of an image and its context, meaning that dense pixel-level annotations, such as semantic or instance segmentation, were needed. While the former labels each pixel with a corresponding class, the latter also classifies each instance of a class separately. For the purposes of this paper, semantic segmentation was sufficient. Table 1 shows the pixel share of the four classes in the training partition. It can be observed that the data are unbalanced, i.e., the classes are uneven, which is representative of this type of landscape. Labelbox (https://labelbox.com/, accessed on 2 March 2022) application was used for labeling. Segmentations approximated by superpixels were used to facilitate the annotation process, rather than selecting individual pixels. The Superpixel tool of Labelbox calculates segment clusters of pixels with similar color, which leads to more efficient annotation than using manual labeling, especially when it comes to objects with complex boundaries. The only parameter that was adjusted was segment cluster size. The smallest setting ('XS') was chosen not only due to the level of complexity of the objects' boundaries, but also due to the low inter-class variance of the target vegetation class, where more conservative calculations of the pixel color had to be applied. The drawback, in comparison to manual labeling, is that the algorithm can find patterns that are not relevant to the specific task or setting, leading to under-or over-segmentation. For such cases, there is an option of additional manual editing using the Eraser and Pen tools, which are synonyms for manual labeling. These tools were also used during the labeling phase. We assigned pixels to classes through a visual inspection, based on our knowledge and experience with the Mediterranean vegetation, as well as our in situ acquaintance of the farm's vegetation and its distribution.
The final product of the process was a set of hand-crafted dense pixel-level semantic segmentation maps, where each pixel was assigned a label of a corresponding class ( Figure 2). Pixel-based classification maps accurately capture the geometry of an image, such as corners and fine elements, but can face issues such as noise or an incorrect characterization of context-dependent classes [33]. The labeling process was challenging be-cause the boundaries between shrubs and other vegetation types were often indistinguishable. Additionally, there may be some incoherent labeling of shadows that coexisted with other classes.

Model
A state-of-the-art U-Net model (https://github.com/hlamba28/UNET-TGS, accessed on 2 March 2022), named TGS U-Net, was used as a basis for the work. The model builds on the original U-Net architecture (Figure 3), that extracts features with convolutional layers in the encoding part and restores the original size of the image in the decoding part. The TGS U-Net uses the input image size (128 × 128 × 3) and gradually reduces its dimensions, while increasing the number of channels (from 128 × 128 × 3 to 8 × 8 × 256), and then gradually increases its dimensions and decreases the depth (from 8 × 8 × 256 to 128 × 128 × 1).
The main building block of the TGS U-Net consists of two consecutive 2D convolutional layers with batch normalization and ReLU. Batch normalization was used to improve the training. The number of filters starts at 16 and is doubled at every convolution step. There are four such blocks in the encoder side, each followed by a max pooling layer, which halves the image dimensions, and a dropout layer. The fifth convolutional block forms a bottleneck with the maximum depth and minimum spatial dimensions, after which comes the decoder side, with four symmetrical deconvolution layers concatenated with the feature maps from the encoder side. Afterwards, comes a dropout layer and the convolutional block, which helps the model to assemble a more precise output. The number of filters is halved at each step, while the resolution is doubled. Ultimately, the output of a binary classification is sigmoid, which assigns each pixel a probability of belonging to the target class.
In this work, we kept most of the architecture of the original TGS U-Net but experimented with the effect of different input sizes and the number of filters in the quality of the segmentation in our scenario.

Model Training
We trained different models and systematically evaluated the effect of training parameters in performance. One of the most important parameters was the network input size. We tried the following choices: 128 × 128 px, 144 × 144 px, 192 × 192 px, 240 × 240 px, 288 × 288 px, 400 × 400 px, and 496 × 496 px. This required the adaptation of the input tiles (of size 800 × 800) to be processed by the network. First, we cropped the tiles in patches of smaller sizes, and then we resized these patches to the network input size. This may induce a scaling of the resolution that impacts the segmentation quality. In the experiments we evaluated this impact by creating several sets with different patch sizes: S1 with 832 (100 × 100) px patches, S2 with 208 (200 × 200) px patches, S3 with 117 (300 × 300) px patches, S4 with 52 (400 × 400) px patches, and S5 with 52 (500 × 500) px patches.
Data augmentation was then applied to each of these patch sets, generating three variants with sizes of around 800, 1600 and 3800 samples. All of these combinations were called training datasets, as these were the actual data used in the network training. The same data augmentation techniques were used to generate all training datasets; these were random rotations, skews, flips, random brightness, elastic distortions, and shears from the Augmentor library (Marcus D Bloice, Peter M Roth, Andreas Holzinger, Biomedical image augmentation using Augmentor, Bioinformatics, https://github.com/mdbloice/ Augmentor, accessed on 2 March 2022). Afterwards, patches were fed into the model with different input sizes, corresponding to different scale factors, depending on the patch dimensions (this is further explained in Section 3.1.2). Figure 4 summarizes all training datasets used. The model was trained with an Adam optimizer with a learning rate of 1 × 10 −5 .
Predictions were compared to labels with the binary cross entropy loss function. Early stopping was implemented if the validation loss did not improve for 10 consecutive epochs to prevent overfitting. Learning rate was reduced when the validation loss did not improve for five consecutive epochs. Each pixel was assigned to the classes with probabilities above the threshold of 0.5. Each Training Dataset was split into a training and validation set with the ratio 9:1. The validation set was never used in the training process and was only used to evaluate the model's generalization ability during training and to take decisions on the training process. Each model was trained for 50 epochs.
The cloud service Google Colab (https://colab.research.google.com/, accessed on 2 March 2022) was used for training and evaluating the model. Deep learning methods were implemented using Keras (https://keras.io/, accessed on 2 March 2022) with a TensorFlow (https://www.tensorflow.org, accessed on 2 March 2022) backend. With a memory limit of 12 GB and time limit of 12 h, which comes with the free version of the service, this paper also aimed to explore the setups with a reasonable trade-off between working within these limits and yielding good results. This increases the usability and practicality for future exploration with off-the-shelf computational systems. The developed code and used data sets are publicly available (https://github.com/firefrontproject/Shrubdetection-with-U-Net, accessed on 2 March 2022).

Evaluation Method
The main evaluation metric is the F1 score, a class-specific measure of segmentation accuracy suitable for unbalanced datasets, such as the ones used in this paper: where: and: where TP means true positives, FP is false positives and FN is false negatives.

Results
In this section, we present the results of several experiments designed to study the best model configurations in the training/validation sets, and to evaluate the generalization ability of the proposed models in test datasets. We performed several variations in model training according to the following conditions: (i) amount of training data, including augmentations; (ii) network input size; (iii) patch size; and (iv) hyperparameter tuning (number of filters, dropout rate and batch size). We started with a preliminary analysis of the effect of the amount of training data, patch size, and rescaling (network input size) by looking at performance metrics in the validation sets. This preliminary study allowed us to pre-select a set of promising model configurations for our problem. Then, we tested the trained models on the independent test datasets described in Section 2.2. to assess generalization to different conditions (same vs. different areas, days, seasons).

Preliminary Analysis in the Training Sets
In the first experiment (Section 3.1.1), we evaluated the performance of the system in a multi-class scenario. We considered not only shrubs but also trees, rocks, and shadows, as they are the most prominent patterns in the aerial images. Having noticed that the results for shrubs were not satisfactory, we decided to perform binary classifications in the remaining experiments (shrub cover vs. non-shrub cover), with more favorable results. In Section 3.1.2, we evaluate the influence of patch size, data augmentation and network input size in the performance of the model. In Section 3.1.3, we assess the influence of other hyperparameters.

Multi-Class Segmentation
The TGS U-Net was trained on the 832-sample dataset (without data augmentation) of all four classes separately. The model input dimensions were (128 × 128) px. The main characteristic of this dataset is its small patch size (100 × 100) px, and thus the highest number of original (i.e., non-augmented) samples. The results (for the target class, i.e., shrubs) obtained by using this dataset became the baseline for our study, because no other treatment of data was used in the first part of this experiment. The difference in performance among the used classes for the smallest 832 sample dataset can be found in Table 2. Confusion matrices of all classes from the initial experiment can be found in Table 3. Shrubs were often confused with trees, which was caused by the low inter-class variance of shrubs when compared to this class, and shadows because they overlapped with many of the shrubs in used images. Differences in the performance of the different classes were also assessed visually in form of continuous maps (heatmaps) and binary predictions ( Figure 5). Using predictions in the form of heatmaps, instead of discrete classes, can be particularly useful in the case of landscapes with a lot of transitions among vegetation species or types, where pixels can contain more than one vegetation type. Because a lot of shrubs occur around and under the trees, the heatmaps could in fact be more useful than binary maps for aiding an expert in the decision-making process regarding landscape management. In this first experiment, we noticed that the classification of shrubs (the class of most interest for our applications) was underperforming with respect to the other classes. Thus, we decided to train the remaining models with just two classes: shrub vs. nonshrubs. Therefore, in the remainder of the experiments we consider only models for binary classification.
In the second part of the initial experiment, for the target class (shrubs), the dataset was augmented to contain 1664 and 3832 samples. We note that no post-processing of the images was conducted. For the shrub class, the performance rose with the growing size of the training dataset, with F1 = 0.31 for the smallest (832), F1 = 0.63 for the intermediate (1664) and F1 = 0.68 for the largest dataset (3832).

Binary Segmentation
In the previous experiment, we verified that small patch sizes and multi-class segmentation did not perform well. In this experiment, we evaluated the simpler case of binary segmentation (shrub vs. non-shrub) and used the training datasets characterized by larger patch sizes, which increased memory and, most of all, time requirements. As a result, some of the experiments were left out because they were unfeasible to conduct; specifically, the rescaling experiments with the largest training datasets (3808 instances) and the larger network input sizes were left out of the study, even though they exhibited the best performance. Rescaling experiments with the worst-performing (the smallest, 808 instances) training sets and larger network input sizes were also left out. In the end, a set of 21 experiments was performed, exploring the impact of the patch size and rescaling of the model input on the performance. Data augmentation was also assessed simultaneously. No image post-processing was applied in this case either. The summary of conducted experiments can be found in Table 4. The impact of patch size and rescaling were then studied. With the increasing patch size, the accuracy was expected to improve, because a larger patch captures more spatial context, as is illustrated in Figure 6.
Tested scales were 1:1 (input size : patch size) as in [48], and 1:2, as in [49]. Because the input must be compatible with the four max-pooling layers contained in the architecture of the TGS U-Net, and therefore must be divisible by 24, the scales are only approximated to these layers, as can be seen from Figure 4. Table 5 shows how changing the model input size changes the amount of space represented in one pixel.  The results of a data augmentation and patch size variation for model input 128 × 128 are shown in Figure 7. The three best-performing models (F1 = 0.90), which can be seen in Figure 8, took 50, 46 and 60 h to train and qualitatively did not bring much of a value in comparison to a model that took only four hours to train, which can be seen in Figure 9. The best tradeoff between training time and performance was achieved by a model using patch size S3 with 1664 (300 × 300) px samples, with a reduction of 50% in spatial dimensions (S3-1664_144 × 144). It achieved a validation F1 score of 0.82 in about four hours of training.
In Figure 8, we present the results of different combinations of patch size and input size, resulting in different patch scaling to fit the network input size.

Hyperparameter Tuning
This section addresses the impact of different initial number of filters, dropout rate and batch size on the performance. The search was manual, using the following values: 1.
The results of the hyperparameter tuning are summarized in Figure 10.

Test Data
In this final test phase, all trained models (see Table 4) were evaluated on independent test datasets derived for the test partitions described in Section 2.2, which represented highly similar, seasonally similar, and highly distinct data. Summer and winter image used for the test partitions are depicted in Figure 11. The highest performance of test data (0.76 to 0.77) was achieved with patch size S3, as can be seen in Table 6.

Preliminary Analysis in the Training Sets
In this section, we elaborate on the results obtained in Section 3.1, regarding multiclass and binary segmentation, as well as using hyperparameter tuning to improve the classification results.

Multi-Class Segmentation
Here, we analyze the results obtained in the validation partition of this set (Section 3.1).
In the first part of the initial experiment, we demonstrated that shrubs had the second lowest F1 score of all the classes. We also noted that the tree class significantly outperformed shrubs (even when compared to the augmented shrub dataset from the second part of this initial experiment). One reason for this could be that trees were a much more balanced class without any artificial adjustments to the data (accounting for 48.58% pixel representation across the dataset, unlike shrubs that only accounted for 20.99%). More importantly, however, trees seem to have clearer boundaries and a more regular shape, which makes them easier to distinguish in comparison to other classes. They also suffer less from high intra-and low inter-class variance. Shrubs, however, are more challenging due to their diversity of shapes, cover patterns, and, even, reflectance values (i.e., color). Because of this high variety of features, which could lead to misclassification, the neural network showed some limitations when compared to trees, which explains the lower precision obtained for this class. The very high accuracy of the class of rocks is misleading in this case because it was an underrepresented cover type that generated a small sample with only a 1.09% share of pixels in the dataset-the large number of true negatives masked away the significant number of false positives and false negatives evident in the values of precision and recall. The recall scores were low, which means that the algorithm still underperformed when identifying rocks within their own class. The main reason pertains to the limited number of examples in the training samples for the neural network to learn the various patterns and reflectance values that can be exhibited by rocks. Furthermore, rocks can be covered with vegetation (grasses, mosses, lichens, and even small shrubs) and bordered by vegetation, which affects their shape and reflectance, making it even more challenging to correctly classify them.
In the second part of the initial experiment, we showed that the largest dataset achieved the highest F1 score. This was an expected result since the model had more learning examples, and data augmentation aided in encoding more invariance, making the learning process more robust. However, the performance began to converge with the increasing size from the intermediate to the largest dataset. Thus, although there might still be a potential for further improvement by data extension through augmentation, the performance gains would likely be marginal.
Overall, patch sizes of 100 × 100 px performed poorly in comparison to other patch sizes. Small patches likely failed to capture enough of the spatial detail and fine-grained boundaries between the class and the background. There were presumably too many patches consisting of only a part of one object, not capturing enough of the context. Moreover, contrary to the remaining datasets used in our study, the patches here were upscaled (from 100 × 100 px to 128 × 128 px), which could increase blur and break down relevant patterns. Additionally, scaling factors above 1 provide little improvement in performance because there is no additional information gained, and instead they occupy more space in GPU [55].

Binary Segmentation
Here, we explored the impact of different experiments with binary segmentation on the results. First, we investigated the impact of data augmentation, which notably improved performance. The greatest differences among F1 scores of models trained with smaller patch sizes were between the 808-and 1658-instance datasets, while these differences began to plateau at 3808 samples. Apparently, there was not a sufficient amount of information in the 808 sample datasets. (The F1 score of 0 in patch size S5 for the smallest dataset reflects a degeneracy of the network to 0 recall, i.e., no detections). Doubling the dataset size to 1658 seemed to be already satisfactory, and expanding it even further may not compensate the added computational cost of training. For the larger patch size S4, the F1 score equalized among different-sized datasets.
Next, we explored the impact of the patch size. This was motivated by the studies of [48,56], suggesting that the accuracy should improve with the increasing patch size because a larger patch captures more spatial context. Increasing the patch size improved the performance for smaller training sets (808), while increasing it beyond (300 × 300) px for the larger sets (1658, 3808) proved to be unjustified, since it did not improve the classification results, similar to [35] (Figure 7). Instead, it increased time and computational requirements.
Finally, resizing was studied. Resizing images to smaller resolutions may lead to a loss of information [57]. Reina et al. [40] achieved a better performance with minimal downscaling, whereas other studies report that down-scaling the input patch can contribute to a better filtering of the relevant spatial patterns ( [49,58]). This can, therefore, depend on the content of the images and the target group. The goal was to find out which approach would work for the data used in this paper. Scaling down images too much could significantly hamper the ability to detect structures and textures. Higher F1 scores were achieved when the size of the rescaled patch was closer to the input size. This is especially important in cases where the size of the objects of interest is already small [59], or where downscaling would lead to a loss of relevant context information [48,57]. However, it is an interesting technique for shortening the training time [59], and the scale of 1:2 is a good trade-off between the small decrease in performance and a shorter training time [49].

Hyperparameter Tuning
Similar to [50], adding more filters improved the performance only until a certain point (32 filters), after which it started to decrease (64 filters), disagreeing with the general notion that deeper networks achieve better accuracies [29]. Using more filters made the network deeper and more complicated, which was probably not necessary for the kind of data used in this study, or it brought too many learnable parameters for available data, which caused overfitting. The F1 score of the best performing model with 32 filters was 0.84 but took 10 h to train, while the model with 16 filters achieved an F1 score of 0.82 in half the time.
The metrics generally worsened with the increasing dropout rate. The only exception was recall, which increased to 1 with the highest dropout rate. The model simply labeled most of the pixels as shrubs, producing many false positives. The best performance was achieved with the smallest dropout rate, which was part of the original setting. There was less deterioration in metrics between the dropout rates 0.05 and 0.2, but this became more apparent with larger dropout rates. The change was especially pronounced between dropouts 0.2 and 0.5, where the decline, especially in accuracy but also in precision and F1 score, was significant.
Batch size is a hyperparameter that, as many other hyperparameters, depends on many factors, such as the type of problem or data. Some authors [54] reported the best results when using a batch size as small as 2 or 4, while others [60] favored batch sizes as large as 128. The batch size did not have much of an impact on the results in this study. Considering that the further exploration of a batch size tuning would be dependent on the computational resources available, and that a batch size of 32 is generally recommended as a suitable value in many cases, further experimenting with this hyperparameter was not carried out in our work.
There are many other hyperparameters that could be further explored to improve the classification results, but the optimal model generally depends more on the used data (https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparametersand-model-validation.html, accessed on 2 March 2022), rather than on hyperparameters. Nevertheless, it is not only important to tune the hyperparameters, but also to choose them diligently, since some of them may have a significant impact on the results, while others can have almost none.

Test Data
As expected, models performed worse with the new data. Some examples of the test results are shown in Table 6. The reason is that the test data did not come from the same dataset as training and validation data (excluding test dataset 1). The best average performance of all the experiments was achieved on test dataset 2 (F1 = 0.70), while test datasets 1 and 3 performed equally. The greatest culprit behind the gap between validation and test results was most likely the spatial distribution of vegetation in testing patches that were different from the training and validation sets. Due to a quite small dataset, models could not learn enough different spatial distributions of the target class. Patches in the test set 3 came from different images taken in a different year, so this was most likely the dominant factor of the performance decrease. The best evaluations were on test set 2 because these images were taken on the same day as the training images. The image from which training and validation patches were derived did not cover a representative enough sample of the shrub patterns in the area. The winter images were too different to be extrapolated from the summer data; a separate model would be necessary.
Furthermore, higher testing performances were generally achieved by models using larger patch sizes, larger dataset sizes and larger model input dimensions, in accordance with the validation results from Section 3.1.2. Data augmentation, patch size and model input dimensions (i.e., downscaling in our experiments) proved to be beneficial for the training and classification performance. The hyperparameter tuning did not bring any significant improvements in the performance, neither for validation, nor for test sets. Generally, the gaps between validation and test scores are relative to the data, selected metrics and models (https://machinelearningmastery.com/the-model-performance-mismatchproblem/, accessed on 2 March 2022).

Conclusions
This paper explored the potential of detecting irregular shrub cover in a complex heterogeneous landscape with U-Net. We presented a systematic analysis of the most important training parameters of a U-Net neural network when creating models for the segmentation of shrubs in RGB images acquired from a UAV. Due to their fire tolerance and high flammability, shrubs are of priority interest in terms of fire risk assessment and preventive management in Mediterranean regions, and their mapping is fundamental for better-informed land management and the reduction in forest fire hazards. This work consisted of two main parts: creating and manually labeling datasets and developing methods to increase detection accuracy using a U-Net neural network. We evaluated the impact of data augmentation, tiling, rescaling and hyperparameter tuning (number of filters, dropout rate and batch size) on the accuracy of the system. With respect to data augmentation, we observed that the largest datasets containing 3808 samples yielded the highest F1 scores. Regarding patch size, patches with (300 × 300) px, in combination with the largest datasets, provided the best results. For the larger datasets, larger patches did not improve performance, but increased the training time and computational demands. As for downscaling, degrading the image resolution typically leads to a loss of information, but the scale 1:2 significantly decreased the training time, while maintaining good performance levels. The configuration of pre-processing techniques yielding the best results depends on the problem and on the object of interest [19]. Hence, finding an optimal set of methods requires exhaustive research but could reap large benefits.
The major identified limitations were the amount of labeled data and the difficulty in ensuring a high precision when labeling. Using larger datasets with patches derived from several images taken during multiple flights could have a significant positive effect on the results. High-quality labels remain to be one of the central elements of image classification success. Due to the detailed boundaries of shrubs and some ambiguities with background elements, labeling by multiple annotators could help improve the quality of data. Using only RGB data was also a limitation, but this paper shows that it is possible to achieve reasonable results for some applications with such an inexpensive sensor.
Thus, based on the results achieved in this paper, we believe that further improvements in performance could be achieved by:

•
Further enlargement of the datasets, either from more labeled data from spatially and temporally distinct samples, or by employing more data augmentation variants; • Decreasing labeling incoherency, especially in case of frequently overlapping classes, by using more annotators and stricter rules on how to label mixed classes; • Alternatively, or in addition to the previous point, reduce the demands on precise segmentation and allow a less precise approach to labeling, e.g., selecting random regions of interest (ROI) within the class area, without identifying the exact borders of the class object. This could yield higher volumes of samples and concurrently dramatically reduce labeling time; • The systematic search of hyperparameters for augmentation and pre-processing techniques suitable for these particular data and tasks. Due to limited computational resources, we could not perform an exhaustive search of hyperparameters, but we noticed their importance in optimizing performance.
Finally, considering the nature and the objective of this task, using heatmaps in combination with expert opinions could be a better option than using binary predictions. This work has the potential to serve as an information tool for land planning and grazing management and could be also modified and repurposed to map other vegetation types, such as trees, or be used as a forest inventory tool.