Mask R-CNN and OBIA Fusion Improves the Segmentation of Scattered Vegetation in Very High-Resolution Optical Sensors

Vegetation generally appears scattered in drylands. Its structure, composition and spatial patterns are key controls of biotic interactions, water, and nutrient cycles. Applying segmentation methods to very high-resolution images for monitoring changes in vegetation cover can provide relevant information for dryland conservation ecology. For this reason, improving segmentation methods and understanding the effect of spatial resolution on segmentation results is key to improve dryland vegetation monitoring. We explored and analyzed the accuracy of Object-Based Image Analysis (OBIA) and Mask Region-based Convolutional Neural Networks (Mask R-CNN) and the fusion of both methods in the segmentation of scattered vegetation in a dryland ecosystem. As a case study, we mapped Ziziphus lotus, the dominant shrub of a habitat of conservation priority in one of the driest areas of Europe. Our results show for the first time that the fusion of the results from OBIA and Mask R-CNN increases the accuracy of the segmentation of scattered shrubs up to 25% compared to both methods separately. Hence, by fusing OBIA and Mask R-CNNs on very high-resolution images, the improved segmentation accuracy of vegetation mapping would lead to more precise and sensitive monitoring of changes in biodiversity and ecosystem services in drylands.


Introduction
Dryland biomes cover~47% of the Earth's surface [1]. In these environments, vegetation appears scattered [2] and its structure, composition and spatial patterns are key indicators of biotic interactions [3], regulation of water, and nutrient cycles at landscape level [4]. Changes in the cover and spatial patterns of dryland vegetation occur in response to land degradation processes [5]. Hence, methods to identify and characterize vegetation patches and their structural characteristics can improve our ability to understand dryland functioning and to assess desertification risk [5][6][7][8]. Progress has been made using remote sensing tools in this regard (e.g., quantification of dryland vegetation structure at landscape scale [9], monitoring vegetation trends [10], spatial patterns identifying ecosystem multifunctionality [11], characterizing flood dynamics [12], among many others). However, the improvement in the accuracy of vegetation cover measurement is still being studied to obtain maximum performance from data and technology. Estimating and monitoring changes in vegetation cover through remote sensing is key for dryland ecology and conservation [6]. Both historical temporal and spatial data are the base for remote sensing studies to identify the functioning and structure of vegetation [13,14].
The analysis of very high-resolution images to detect and measure vegetation cover and its spatial arrangement across the landscape starts typically by segmenting the objects to be identified in the images [7]. Object-Based Image Analysis (OBIA) [15] and Mask Region-based Convolutional Neural Networks (Mask R-CNN) [16] are among the most used and state-of-the-art segmentation methods. Though they provide a similar product, both methods rely on very different approaches. OBIA combines spectral information from each pixel with its spatial context [17,18]. Similar pixels are then grouped in homogenous objects that are used as the basis for further classification. Mask R-CNN, on the other hand, a type of artificial intelligence whose functioning is inspired by the human brain provides transferable models between zones and semantic segmentation with unprecedented accuracy [19,20]. Besides, fusion has recently been used to improve spectral, spatial, and temporal resolution from remote sensing images [21][22][23]. However, the fusion of methods for vegetation mapping has not been evaluated.
Remote sensing studies based on very high-resolution images have increased in the last years (e.g., [24][25][26][27]), partly because of the availability of Google Earth images worldwide [28][29][30] and the popularization of unmanned aerial vehicles (UAV). Although these images have shown a high potential for vegetation mapping and monitoring [31][32][33], two main problems arise when they are used. First, higher spatial resolution increases the spectral heterogeneity among and within vegetation types, resulting in a salt and pepper effect in their identification that does not correctly characterize the actual surface [34]. Second, the processing time of very high-resolution images and the computational power required is larger than in the case of low-resolution images [35]. Under these conditions, traditional pixel-based analysis has proved to be less accurate than OBIA or Mask R-CNN for scattered vegetation mapping [15,36]. There are many applications for OBIA [37][38][39] and deep learning segmentation methods [40,41]. For example, mapping greenhouses [42], monitoring disturbances affecting vegetation cover [5], or counting scattered trees in Sahel and Sahara [43]. These methods have been compared with excellent results in both segmenting and detecting tree cover and scattered vegetation [7,44,45]. However, greater precision is always advisable in problems of very high sensitivity [46]. Despite methodological advances, selecting the appropriate image source is key to produce accurate segmentations of objects, like in vegetation maps [47,48], and there is no answer to the question of which image or method to choose for segmenting objects. Understanding how the spatial resolution of the imagery used affects these segmentation methods or the fusing of both is key for their correct application to obtain better accuracy in object segmentation in vegetation mapping in drylands.
To evaluate which is the most accurate method between OBIA and Mask R-CNN to segment scattered vegetation in drylands and to understand the effect of the spatial resolution of the images used in this process, we assessed the accuracy of these two methods in the segmentation of scattered dryland shrubs and compared how final accuracy varies as does spatial resolution. We also check the accuracy of the fusion of both methods.
This work is organized as follows. Section 2 describes the study area, the dataset used, and the methodologies tested. Section 3 describes the experiments addressed to assess the accuracies of the methods used. The experimental results and discussion are presented in Section 4, and conclusions are given in Section 5.

Study Area
We focused on the community of Ziziphus lotus shrubs, an ecosystem of priority conservation interest at European level (habitat 5220* of Directive 92/43/EEC), located in Cabo de Gata-Níjar Natural Park (36 • 49 43 N, 2 • 17 30 W, SE Spain), one of the driest areas of continental Europe. This type of vegetation is scarce and patchy, which appears surrounded by a matrix of bare soil and small shrubs (e.g., Launea arborescens, Lygeum spartum and Thymus hyemalis). Z. lotus is a facultative phreatophyte [49] and forms large hemispherical canopies (1-3 m tall) that constitute fertility islands where many other species of plants and animals live [50]. These shrubs are long-lived species contributing to the formation of geomorphological structures, called nebkhas [51], that protect from the intense wind erosion activity that characterizes the area, thereby retaining soil, nutrients, and moisture.

Dataset
The data set consisted of two plots (Plot 1 and Plot 2) with 3 images of different spatial resolution in each one. The plots had an area of 250 × 250 m with scattered Z. lotus shrubs. The images were obtained from optical remote sensors in the visible spectral range, Red, Green and Blue bands (RGB) and spatial resolutions of < 1 m/pixel:

OBIA
OBIA-based segmentation is a method of image analysis that divides the image into homogeneous objects of interest (i.e., groups of pixels also called segments) based on similarities of shape, spectral information, and contextual information [17]. It identifies homogeneous and discrete image objects by setting an optimal combination of values for three parameters (i.e., Scale, Shape, and Compactness) related to their spectral and spatial variability. There are no unique values for any of these parameters, and their final combination always depends on the object of interest, so finding this optimal combination represents a challenge due to the vast number of possible combinations. First, it is necessary to establish an appropriate Scale level depending on the size of the object studied in the image [43]; for example, low Scale values for small shrubs and high Scale values for large shrubs [44,45]. Recent advances have been oriented in developing techniques (e.g., [53][54][55][56][57][58][59]) and algorithms (e.g., [60][61][62][63]) to automatically find the optimal value of the Scale parameter [64], which is the most important for determining the size of the segmented objects [65,66]. The Shape and the Compactness parameters must be configured too. While high values of the Shape parameter prioritize the shape over the colour, high values of the Compactness parameter prioritize compactness of the objects over the smoothness of their edges [67].

Mask R-CNN
In this problem of locating and delimiting the edges of dispersed shrubs, we used a computer vision technique named instance segmentation [68]. Such technique infers a label for each pixel considering other nearby objects, thus including the boundaries of the object. We used Mask R-CNN segmentation model [16], which extends Faster R-CNN detection model [16] and provides three outputs for each object: (i) a class label, (ii) a bounding box that delimits the object and (iii) a mask which delimits the pixels that constitute each object.
In the binary problem addressed in this work, Mask R-CNN generates for each predicted object instance a binary mask (values of 0 and 1), where values of 1 indicate a Z. lotus pixel and 0 indicates a bare soil pixel.
Mask R-CNN relies on a classification model for the task of feature extraction. In this work, we used ResNet 101 [69] to extract increasingly higher-level characteristics from the lowest to the deepest layer levels.
The learning process of Mask R-CNN is influenced by the number of epochs, which is the number of times the network goes through the training phase, and by other optimizations such as transfer-learning or data-augmentation (see Section 3.2). Finally, the 1024 × 1024 × 3 band image input is converted to 32 × 32 × 2048 to represent objects at different scales via the characteristic network pyramid.

Segmentation Accuracy Assessment
The accuracy of the segmentation task in this work was assessed with respect to ground truth by using the Euclidean Distance v.2 (ED2; [70]), which evaluates the geometric and arithmetic discrepancy between reference polygons and the segments obtained during the segmentation process. Both types of discrepancy need to be assessed. As reference polygons, we used the perimeter of 60 Z. lotus shrubs measured with photo-interpretation in all images by a technical expert. We estimated the geometric discrepancy by the "Potential Segmentation Error" (PSE; Equation (1)), defined as the ratio of the total area of each segment obtained in the segmentation that falls outside the reference segment and the total area of reference polygons as: where PSE is the "Potential Segmentation Error", r k is the area of the reference polygon and si is the overestimated area of the segment obtained during the segmentation. A value of 0 indicates that segments obtained from the segmentation fit well into the reference polygons. Conversely, larger values indicate a discrepancy between reference polygons and the segments. Although the geometric relation is necessary, it is not enough to describe the discrepancies between the segments obtained during the segmentation process and the corresponding reference polygons. To solve such problem the ED2 index includes an additional factor, the "Number-of-Segmentation Ratio" (NSR), that evaluates the arithmetic discrepancy between the reference polygons and the generated segments (Equation (2)): where NSR is the arithmetic discrepancy between the polygons of the resulting segmentation and the reference polygons and abs is the absolute value of the difference of the number of reference polygons, m, and the number of segments obtained, v. Thus, the ED2 can be defined as the joint effect of geometric and arithmetic differences (Equation (3)), estimated from PSE and NSR, respectively, as: where ED2 is Euclidean Distance v.2, PSE is Potential Segmentation Error, and NSR is Number-of-Segmentation Ratio. According to Liu et al. [70], values of ED2 close to 0 indicate good arithmetic and geometric coincidence, while high values indicate a mismatch between them.

Experiments
We set several experiments to assess the accuracy of the two different OBIA and Mask R-CNN segmenting scattered vegetation in drylands. We used the images of Plot 1 to test the OBIA and Mask R-CNN segmentation methods. The images of Plot 2 were used for the training phase in Mask R-CNN experiments exclusively ( Figure 1). In Section 3.1, we describe OBIA experiments, focused on detecting the best parameters (i.e., Scale, Shape and Compactness) of a popularly used "multi-resolution" segmentation algorithm [71]. In Section 3.2. we described the Mask R-CNN experiments, in which we first evaluated the precision in the detection of shrubs (capture or notice the presence of shrubs) and second how accurate is the segmentation of those shrubs. Finally, in Section 3.3. we described the fusion of both methods and compared all the accuracies between them in Section 4.3.

OBIA Experiments
To obtain the optimal value of each parameter of the OBIA segmentation, we use two approaches: (i) A ruleset called Segmentation Parameters Range (SPR) in eCognition v8.9 (Definiens, Munich, Germany) with the "multi-resolution" algorithm that segmented the images of Plot 1 by systematically increasing the Scale parameter in steps of 5 and the Shape and Compactness parameters in steps of 0.1. The Scale parameter ranged from 80 to 430, and the Shape and the Compactness from 0.1 to 0.9. We generated a total of 9234 results with possible segmentations of Z. lotus shrubs. The Scale parameter ranges were evaluated considering the minimum cover size (12 m 2 ) and maximum cover size (311 m 2 ) of the shrubs measured in the plot and the pixel size. (ii) We also performed the semi-automatic method Estimation of Scale Parameter v.2 (ESP2; [70]) to select the best scale parameter. This tool performs semi-automatic segmentation of multiband images within a range of increasing Scale values (Levels), while the user previously defines the values of the Compactness and Shape parameters. Three options available in the ESP2 tool were tested: a) the hierarchical analysis Top-down (HT), starting from the highest level and segmenting these objects for lower levels; b) the hierarchical analysis Bottom-up (HB), which starts from the lower level and combines objects to get larger levels; and c) analysis without hierarchy (NH), where each scale parameter is generated independently, based only on the level of the pixel [64].

Mask R-CNN Experiments
Mask R-CNN segmentation is divided in two phases: i) Training and ii) Testing phases. In the training phase, we selected 100 training polygons representing 100 shrub individuals with different sizes. The sampling was done using VGG Image Annotator [72] to generate a JSON file, which includes the coordinates of all the vertices of each segment, equivalent to the perimeter of each shrub. To increase the number of samples and reduce overfitting of the model, we applied data-augmentation and transfer-learning: Transfer-learning consists in using knowledge learnt from one problem to another related one [73], and we used it to improve the neural network. Since the first layers of a neural network extract low-level characteristics, such as colour and edges, they do not change significantly and can be used for other visual recognition works. As our new dataset was small, we applied fine adjustment to the last part of the network by updating the penultimate weights, so that the model was not overfitting, as mainly occurs between the first layers of the network. We specifically used transfer-learning on ResNet 101 [69] and used Region-based CNN with the pre-trained weights of the same architectures on COCO dataset (around 1.28 million images over 1000 generic object classes) [74].
We tested three different learning periods (100 steps per epoch) per model: (A) 40 epochs with transfer-learning in heads, (B) 80 epochs with 4 fist layers transfer-learning, (C) 160 epochs with all layers transfer-learning.
We trained the algorithm based on the ResNet architecture with a depth of 101 layers with each of the three proposed spatial resolutions. We then evaluated the trained models in all possible combinations between the resolutions. We evaluated the use of data-augmentation and transfer-learning from more superficial layers to the whole architecture with different stages in the training process. Particularly: We did the test phase using Plot 1. To identify the most accurate experiments, we evaluated the detection of the CNN-based models, and determined their Precision, Recall, and F1-measure [75] as: Recall = True Positives True positives + False Negatives ,

Fusion of OBIA and Mask R-CNN
We combined the most accurate segmentations obtained using OBIA and Mask R-CNN, according to ED2 values (Figure 1). We let o i denote the i-th OBIA polygon within the OBIA segmentation, O, and m j denote the j-th Mask R-CNN polygon within the Mask R-CNN segmentation, C. Then we have O = {o i : i = 1, 2, ..., m} and C = {c j : j = 1, 2, ..., n}. Here, the subscripts i and j are sequential numbers for the polygons of the OBIA and Mask R-CNN segmentations, respectively. m and n indicate the total numbers of the objects segmented with OBIA and Mask R-CNN, respectively. m and n must be equal. Finally, the corresponding segment data sets extracted (Equation (7)) by the fusion are considered a consensus among the initially segmented objects as: where OC ij is the intersected area between the segments of the OBIA segmentation (O i ) and the area of the segments of the Mask R-CNN segmentation (C j ). Finally, we estimate ED2 values of the final segmentation using validation shrubs from Plot 1, and we compared it with segmentation accuracy obtained by the different methods.

OBIA Segmentation
In total, 9234 segmentations were performed by SPR, 3078 for each image type (e.g., Google Earth, airborne and UAV). OBIA segmentation accuracy using the SPR presented large variability (Table 1), with values of ED2 ranging between 0.05 and 0.28. Segmentation accuracy increased with image spatial resolution. Thus, the higher the spatial resolution, the higher the Scale values and more accurate the segmentation was. This result was represented by a decrease in ED2 values of 0.14, 0.10 and 0.05 for Google Earth, airborne and UAV images, respectively. The best combinations of segmentation parameters along the different images were (  When we applied the semi-automatic method ESP2 to estimate the optimum value of the Scale parameter, we observed a similar pattern to that described for the SPR, with an increase in accuracy when increasing spatial resolution. The highest value of ED2 was for the Google Earth image segmentation results (ED2 = 0.25), decreasing for the orthoimage from the airborne sensor (ED2 = 0.15) and reaching the minimum value (best) in the UAV image (ED2 = 0.12). However, the results obtained by ESP2 were worse than the results obtained by the SPR method in all the images analysed (Table 1) with the largest differences in the image with the lowest spatial resolution (Google Earth). In the Google Earth images, the best method of analysis of the three options presented by the ESP2 tool was the hierarchical bottom level, with acceptable ED2 values, lower than 0.14 ( Table 1). For the airborne images, the results were equal to Google Earth images (hierarchical bottom level). Conversely, the segmentation of the UAV image produced the best ED2 values when applying the ESP2 without hierarchical level. The computational time for the segmentation of the images was higher in ESP2 than SPR approach. In addition, the computation time of the analysis was also influenced by the number of pixels to analyse, it increased in higher spatial resolution images in computer with a Core i7-4790K, 4 GHz and 32G of RAM memory (Intel, Santa Clara, CA, USA) ( Table 1).

Detection of Scattered Shrubs
We obtained the best detection results for the models trained and evaluated with UAV images (F1-measure = 0.91) and the models trained with the highest number of epochs and data-augmentation activated ( Table 2). The best transfer from a UAV trained model to a test with another resolution was to the image from the airborne sensor. Nevertheless, the Google Earth test image produced a similar result of F1-measure = 0.90. We consider that a model trained with data-augmentation and very high spatial resolution images (0.03 m/pixel) can generalize well to less accurate images such as those from Google Earth (0.5 m/pixel). Furthermore, when we trained the models with Google Earth images, we observed that it also generalised well to more precise resolutions (F1-measure = 0.90). For this reason, the detection of Z. lotus shrubs might be generalizable from any resolution less than 1 m/pixel.

Segmentation Accuracy for Detected Shrubs
The best segmentation accuracy was obtained with the models trained and tested with the same source of images, reaching values of ED2 = 0.07 in Google Earth ones. However, when the model trained with Google Earth images was tested in a UAV image, the ED2 resulted in 0.08. Moreover, the effect of data-augmentation was counterproductive in models trained with airborne images and only lowered ED2 (best results) in models trained with the UAV image. In general, data-augmentation helped to generalise between images but did not obtain a considerable increase in precision in models trained and tested with the same image resolution (Table 3 and Figure 3).

Fusion of OBIA and Mask R-CNN
Our results showed that the fusion between OBIA and Mask R-CNN methods in very high-resolution RGB images is a powerful tool for mapping scattered shrubs in drylands.
We found that the individual segmentations by using OBIA and Mask R-CNN independently were worse than the fusion of both. The accuracy of the fusion of OBIA and Mask R-CNN was higher than the accuracies of the separate segmentations (Table 4), being the most accurate segmentation of all the experiments tested in this work, with an ED2 = 0.038. However, the fusion between results on Google Earth images only improved the ED2 by 0.02. Therefore, the fusion of both segmentation methods provided the best segmentation over the previous methods (OBIA (ED2 = 0.05) and Mask R-CNN (ED2 = 0.07)), in very high-resolution images to segment scattered vegetation in drylands. Moreover, by merging the results of both methodologies (OBIA ∩ Mask R-CNN), the accuracy increases with an ED2 = 0.03. To our knowledge, the effect of mixing these two methodologies has not been studied until the date, and it might be vital to improving future segmentation methods. As can be seen in the conceptual framework (Figure 1), it is reasonable to think that the higher the resolution and, therefore, the higher the detail at the edges of vegetation represented in the images, the fusion will improve the final precision of the segmentation. Nevertheless, in images with lower resolution, the fusion improved but to a minor degree.
The spatial resolution of the images affected the accuracy of the segmentation, providing outstanding results in all segmentation methods and spatial resolutions. However, according to [57], we observed that the spatial resolution and Scale parameter played a key role during the segmentation process and controlled the accuracy of the final segmentations. In non-fusion segmentation methods (OBIA or Mask R-CNN) the segmentation accuracy was higher in the spatial resolution image from UAV and OBIA up to ED2 = 0.05. However, when the object to be segmented is larger than the pixel size of the image, the spatial resolution of the image is of secondary importance [37,57,76,77]. For this reason, as the scattered vegetation in this area presents a mean size of 100 m 2 [5], corresponding to 400 pixels of Google Earth image, only slight increases in segmentation accuracy were observed as the spatial resolution increased. Moreover, the overestimation of the area of each shrub was not significant as the images spatial resolution increased. Therefore, Google Earth images could be used to map scattered vegetation in drylands, if the plants to be mapped are larger than the pixel size. This result opens a wide range of new opportunities for vegetation mapping in remote areas where UAV or airborne image acquisition is difficult or acquiring commercial imagery of very high-resolution is very expensive. These results are promising and highlight the usefulness of free available Google Earth images for big shrubs mapping with only a negligible decrease in segmentation accuracy when compared with commercial UAV or airborne images. However, the segmentation of vegetation could be better if we use the near infrared NIR band since vegetation highlights in this range of the spectrum (e.g., 750 to 2500 nm) or used in vegetation indices such as the normalized difference vegetation index (NDVI) or Enhanced vegetation index (EVI). Finally, very high spatial resolution UAV images need much more computational time and are expensive and not always possible to obtain at larger scales in remote areas, hampering their use.

Conclusions
Our results showed that both OBIA and Mask R-CNN methods are powerful tools for mapping scattered vegetation in drylands. However, both methods were affected by the spatial resolution of the orthoimages utilized. We have shown for the first time that the fusion of the results from these methods increases, even more, the precision of the segmentation. This methodology should be tested on other types of vegetation or objects in order to prove to be fully effective. We propose an approach that offers a new way of fusing these methodologies to increase accuracy in the segmentation of scattered shrubs and should be evaluated on other objects in very high-resolution and hyperspectral images.
Using images with very high spatial resolution could provide the required precision to further develop methodologies to evaluate the spatial distribution of shrubs and dynamics of plant populations in global drylands, especially when utilizing free-to-use images, like the ones obtained from Google Earth. Such evaluation is of particular importance in drylands of developing countries, which are particularly sensitive to anthropogenic and climatic disturbances and may not have enough resources to acquire airborne or UAV imagery. For these reasons, future methodologies as the one presented in this work should focus on using freely available datasets.
In this context, the fusion of OBIA and Mask R-CNN could be extended to a larger number of classes of shrub and tree species or improved with the inclusion of more spectral and temporal information. Furthermore, this approach could improve the segmentation and monitoring of the crown of trees and arborescent shrubs in general, which are of particular importance for biodiversity conservation and for reducing uncertainties in carbon storages worldwide [78]. Recently, scattered trees have been identified as key structures for maintaining ecosystem services provision and high levels of biodiversity [43]. Global initiatives could benefit largely from CNNs, including those recently developed by FAO [79] to provide the forest extent in drylands. The uncertainties in this initiative [80,81] might be reduced implementing our approach CNN-based to segment trees. Tree and shrub segmentation methods could provide a global characterization of forest ecosystem structures and population abundances as part of the critical biodiversity variables initiative [82,83]. In long-lived shrubs where the precision of the segmentation is key for monitoring the detection of disturbances (e.g., pests, soil loss or seawater intrusion) [5]. Finally, the monitoring of persistent vegetation with minimal cover changes over decades could benefit from fusion approaches in the segmentation methods proposed.  Data Availability Statement: All drone and airborne orthomosaic data, shapefile and code will be made available on request to the correspondent author's email with appropriate justification.