Automatic Detection of Olive Tree Canopies for Groves with Thick Plant Cover on the Ground

Marking the tree canopies is an unavoidable step in any study working with high-resolution aerial images taken by a UAV in any fruit tree crop, such as olive trees, as the extraction of pixel features from these canopies is the first step to build the models whose predictions are compared with the ground truth obtained by measurements made with other types of sensors. Marking these canopies manually is an arduous and tedious process that is replaced by automatic methods that rarely work well for groves with a thick plant cover on the ground. This paper develops a standard method for the detection of olive tree canopies from high-resolution aerial images taken by a multispectral camera, regardless of the plant cover density between canopies. The method is based on the relative spatial information between canopies.The planting pattern used by the grower is computed and extrapolated using Delaunay triangulation in order to fuse this knowledge with that previously obtained from spectral information. It is shown that the minimisation of a certain function provides an optimal fit of the parameters that define the marking of the trees, yielding promising results of 77.5% recall and 70.9% precision.


Introduction
According to the Food and Agriculture Organization of the United Nations (FAO), in the year 2020 there were 12.8 million hectares dedicated to olive trees in the world [1]. Although far from the figures of wheat, the most cultivated crop in the world with 219 million hectares of dedicated land, olive trees stand in the 22nd position out of 161 in the ranking of primary crops. The European Union (EU) represents 40% of the dedicated land with 5.1 million hectares, with Spain accounting for 2.6 million hectares and 51.4% of the European crop area, followed by Italy with 1.1 million hectares and 22.4%. These figures translated into a total world production of 3.3 million tons of olive oil [2], of which 57.9% (1.9 million tons) were produced by the EU. Within the EU, Spain accounted for 1.1 million tons and 58.6% of the European production, with Italy producing 19.1% (0.37 million tons). In turn, the table olive world production was 2.96 million tons, of which 21.9% (0.65 million tons) were produced by Egypt and 26% (0.77 million tons) by the EU. Within the EU, Spain was the largest producer with 0.46 million tons and 58.6% of the European production.
These figures explain the interest in the olive oil and table olives production from the grove to the final production stages in the olive oil factories. Plenty of research has been devoted to the improvement of the quality of olive oil by means of controlling the key variables of the production process [3][4][5]. However, since no chemical or biochemical coadjuvants can be employed to improve the olive oil features, as the production process is carried out by mechanical means exclusively, focusing only on the operations on the olive oil factory means that the best that can be achieved is to preserve the quality properties of the olives arriving at the factory. Therefore, current research trends are starting to focus also on the improvement of the quality and productivity in the stages where the olives are still on the tree.
Regarding this, there are works that study different watering strategies [6,7], their effect on the organoleptic properties of the produced olive oil [8] or how the hydric stress affects the olive trees [9,10]. Other works focus on the early detection of plagues, such Xylella fastidiosa [11][12][13] and Verticillium [14], or the assessment of the nutritional state of the trees [15,16], while others study the way to evaluate different parameters such as the volume, height and width of the trees [17,18].
Many of these research works share a common factor, namely the acquisition of information on the Earth's surface using different types of sensors, such as visible spectra, multispectral and thermal cameras, LiDAR laser sensors or synthetic-aperture radars, mounted on satellites or unmanned aerial vehicles (UAV). Later, this information is used, in conjunction with novel image processing methods [19], in several tasks such as monitoring the temporal evolution of surface deformations [20] or mapping potential landslide runout zones [21]. This is what is known as remote sensing. When these research approaches are focused on the study of fruit tree groves, such as olive trees, a common necessity is to extract features from the pixels of the images belonging to the tree canopy, since, in order to verify the proposed hypothesis, the experimental results need to be compared with some ground truth obtained from the use of other sensors on the trees.
In order to relate these measurements with the features obtained from the pixels of the tree canopies in the images, it is required to delimitate each of these canopies, which is an arduous and tedious process were it to be performed manually [22]. Therefore, most of the time, authors tend to use more or less standard software tools when the simplicity of the images allows it [11,12,23]. Many works employ spectral values of the images to perform this segmentation, which could bias the features extracted afterwards [24,25]. Only a reduced number of articles explicitly address the problem of identifying the tree canopies as their main topic.
Among them, it is worth highlighting [26], in which the authors use the Gram-Schmidt spectral sharpening fusion method to integrate the panchromatic and multispectral images and train a convolutional neural network (CNN) that manages to detect oil palm trees with up to 98.65% precision and 98.88% recall using images obtained from the Quickbird satellite. A different approach is applied in [27], where the authors use the red and infrared spectral bands to compute the normalised difference vegetation index (NDVI) and apply classic computer vision methods such as thresholding and blob detection, obtaining a relative error of between 0.2% and 20.7%. The next step to increase the ground sample distance (GSD) would be to perform the data acquisition using airborne sensors mounted on aircrafts. In [28], the authors use principal components analysis to process a single band image derived from the 8 bands the sensor captures. They then apply a two-stage approach with edge detection followed by marker-controlled watershed segmentation on an image of a forest of spruce trees, Douglas firs and subalpine firs. According to the number of labelled trees, a recall of 85.3% is achieved, while the recall computed using the number of labelled pixels yields 75.6%. Ref. [29] employs colour-infrared images of forested regions and applies a fuzzy thresholding algorithm to find the seeds that are used by a region growing algorithm. As result, 73% of trees were correctly found, with a variation from 62% to 81%. Some more examples of the use of CNN for individual tree canopy detection can be found in [30], in which the data acquired by an RGB sensor is merged with the information collected by a LiDAR. A previous segmentation of the trees of an open woodland of live oak, blue oak and foothill pine is performed using the LiDAR information, so later the CNN can be trained with each of these Region Of Interest (ROI). This model has an average tree canopy recall of 0.69 with a precision of 0.61. In [31], citrus and other crop trees are detected from UAV images using a simple CNN algorithm, followed by a classification refinement using superpixels derived from a simple linear iterative clustering algorithm, achieving 94.59% precision and 97.94% recall.
When high-resolution images are captured with UAVs, contrary to the methodology followed in this article, most of them extract structured 3D information using LiDAR sensors or applying photogrammetric analysis. These methods can considerably improve the results obtained, especially when it is necessary to differentiate pixels that are spectrally similar but that are at different heights from the ground (tree canopies and weeds). Even so, they have some disadvantages that are discussed in Section 4. For example, [32] uses UAV LiDAR data for individual tree detection in subtropical mixed broadleaf forests in urban scenes. This method improves on the popular local maximum filter by removing those LMs caused by surface irregularities contained in the canopy height model, and obtains an F-score between 73.7% and 93.2%, depending on how irregular the distribution of trees or crown size is. In [33], RGB images of a maize plantation at the seedling stage are combined with LiDAR data captured by a UAV. The maize seedlings extracted from the images serve as seeds for the fuzzy C-means clustering algorithm used to segment individual maize plants. The results revealed an accuracy with R 2 greater than 0.95, a mean square error (RMSE) of 3.04-4.55 cm and a mean absolute percentage error of 0.91-3.75%. Other examples of articles using photogrammetry to detect and extract trees from high-resolution UAV RGB images are [34] for citrus trees, [35] for peach trees, [36] for chestnut trees and [37] for papaya trees. The first uses sequential thresholding, Canny edge detection and circular Hough transform algorithms on the Digital Surface Model (DSM), obtaining accuracies that exceeded 80%. The second uses an adaptive threshold and marker-controlled watershed segmentation in the DSM to measure the canopy width and canopy projection area, achieving an RMSE of 0.08-0.47 m and 3.87-4.96 m 2 , respectively. The third applies a segmentation stage based on the computation of vegetation indices combined with the canopy height model to extract the candidate tree canopies. The next stage receives these tree canopies to divide those that are greater than a threshold. The last stage extracts the desired features. The segmentation accuracy of this method is above 95%. Finally, in the fourth article, the authors make an improvement with regard to the existing scale-space filtering by applying a Lab colour transformation to reduce overdetection problems associated with the original luminance image. The achieved F-score is larger than 0.94.
As concluded in [38] and evidenced in the previous paragraphs, no algorithm is optimal for all types of images and plant types. For this reason and for clarity, articles related specifically to olive trees and how to detect their canopies have been compared in the Section 3. Although these articles are closely related to the issue at hand, none of them explicitly deals with images from groves with thick plant cover on the ground without using three-dimensional data. Finally, to complement this literature review, mention should also be made of articles dealing with detecting weeds in crops of herbaceous plants such as maize [39], sugar beet [40], sunflower and cotton [41,42] or bean and spinach [43].
The objective of this paper is to develop a method to segment olive tree canopies from high-resolution aerial images that contain information of the visible spectra, specifically, the red, green and blue bands that can cope with high levels of plant cover in the ground between canopies.
The key idea of the approach is to employ not just the spectral information contained in the images but to also consider the relative distance between canopies, computing and extrapolating the planting pattern used by the grower and fusing this information with the spectral data. This paper shows that the minimisation of a certain function provides an optimal fit of the parameters that define the marking of the trees, yielding promising results, without the need to resort to deep learning methods that are difficult to interpret.
The structure of the paper is as follows: Section 2 presents a general diagram divided into blocks that explains the workflow followed to achieve the results of this research. This section is made up of subsections that correspond to each of the blocks in the diagram. Although the methodology followed is explained in detail in all of them, the first block Section 2.1 also focuses on the materials used during data capture and the way in which it was carried out. In Section 3, first, the quality metrics of the results obtained after the application of the developed model are objectively shown, and these numbers are analyzed together with an explanation of the possible causes of the model's failures. Second, the execution times of the building blocks of the model are computed, both for the trainings and for the predictions of the model. Third, a comparison is made with other articles related specifically to olive trees and how to detect their canopies. Finally, in Section 4, the objective of the article is expanded and its usefulness and advantages compared with other methodologies are explained. Possible improvements are also discussed and future work is anticipated.

Materials and Methods
The workflow of the method proposed in this paper is presented in Figure 1 and shows the different materials and methods used, as well as the information transference between the different parts that compose it. This workflow can be divided into two blocks: the Data Preprocessing (DP) of the images taken in the groves and the Olive Tree Canopy Detection Model (OTCD), which is composed of three submodels, namely, the Vegetation Classification (VC), the Olive Tree Canopy Estimation (OTCE) and the Olive Tree Canopy Marking (OTCM). Labelled pixels Figure 1. Workflow of the method proposed in this paper. The arrows represent data transfers, and the nodes represent functions that are applied to that data. As exceptions, the initial node, Multispectral Captures, represents the data acquired by the UAV and the final node, Olive Tree Canopy Coordinates, represents the most optimal coordinates that the model is capable of predicting. Blocks with dotted outlines represent tasks that are performed beforehand.

Clipping and normalisation
The first block DP deals with the transformation and separation of the original raw data captured by the sensors to the format required for the input of the model, specifically: multispectral images, metadata associated with these multispectral images and pixels that are labelled to their corresponding class.
This preprocessing task, together with the training of the two first submodels of the OTCD block, is performed beforehand-in Figure 1 they are shown with a dashed line. This way, the time required to perform these task does not influence the time required to detect the tree canopies of a new image.
The second block OTCD deals with the prediction of the coordinates of each of the tree canopies included in the new images provided as inputs.

Multispectral Captures
The data employed in this work were gathered by the following sensors mounted on a DJI Matrice 600 UAV: These sensors are completed with a calibrated reflectance panel (CRP) Micasense RP04-1826404-SC that takes images of before and after each flight by the UAV in order to be used for the radiometric correction process. The calibrated reflectance values for the specific panel used in this work are 49.1%, 49.3%, 49.4%, 49.3% and 49.4% for the blue, green, red, near-infrared and red-edge bands, respectively.
Each multispectral capture stores the images of each of the five bands in a disk in .tif format, together with the metadata provided by the DLS and the GPS.
As commented in the Introduction, the main objective of this paper is to detect olive tree canopies when there is a large amount of plant cover in the ground, so that traditional segmentation methods fail to provide good results. This way, a set of 18 captures (Table 1) with a large amount of plant cover in the ground, were taken from an olive grove in the town of Diezma, province of Granada, in Andalucia, Southern Spain. This is an olive grove of trees of Picual cultivar variety with 29.14 ha of extension, showing a traditional arrangement of the trees. Six flight campaigns were carried out from October 2018 to November 2019. Figure 2 depicts an example of the capture as taken by the multispectral camera, depicting an image for each of the five bands previously mentioned. A thick plant cover can be seen between the olive trees. All these captures, together with the metadata, were used as inputs for the Radiometric Correction and Band Alignment (RCBA), the manual labelling and the first two submodels of the olive tree canopy detection model.

Radiometric Correction and Band Alignment (RCBA)
Once the multispectral captures are available as inputs to the RCBA, the first step is to transform the digital numbers (DN) of the images obtained by the camera sensor to radiance values first and to reflectance values afterwards. This correction, which is carried out for each band, allows the values subsequently employed to be independent of the flight, date, time of the day and climatic conditions, so that different captures can be compared in the same reference frame. Posteriorly, before finishing the RCBA, the reflectance images of each band need to be aligned, since the five sensors of the camera have a slight offset from each other. The computation of the metadata that are obtained from the RCBA is computionally expensive, and that is the reason why it is performed beforehand. The results are not applied directly to the multispectral captures but during the training and prediction phase of the models, since the application of the these values is almost instantaneous. This advocates for storing the multispectral captures with these metadata, since it eliminates the need to store all the corrected and aligned images.
For clarity, in the following discussions the dependency of the equations with the wavelength is omitted, but it must be noted that each computation needs to be carried out for each of the five spectral bands captured by the camera.

Raw Images to Radiance Images Conversion
The conversion of the DN into radiance values, L r , is carried out using Equation (1), as recommended by the camera manufacturer, to each of the pixels x and y, for each of the 5 wavelengths, λ.
In this equation, DN are the raw digital numbers obtained by the camera sensor and DN BL is the average of the raw values of all the covered pixels of the sensor, whose objective is to measure the small amount of signal captured independently of the incident light. The difference between these two values is normalised dividing by DN MAX , which is the maximum digital number achievable, equal to the maximum bit depth of the stored images minus 1. Although the camera captures 12 bit images, they are stored in 16 bits .tif format, so the value of DN MAX is 2 16 − 1. The terms a 1 , g and t e refer to the first radiometric calibration coefficient, the gain and the exposure time of the camera, respectively.
The correction of the decrease in the light captured by the sensor from its top to the bottom, R(y), is given by Equation (2), where a 2 and a 3 are the last two radiometric calibration coefficients. The vignetting correcting function, V(x, y), is obtained according to Equations (3)- (5), where r is the distance of each pixel to the centre of the vignette (c x , c y ) and k 0−5 are the polynomial correction factors. The methods used to obtain Equations (1)-(5) have not been shared by the manufacturer Micasense, but the values can be extracted from the metadata tags embedded in the images. They are shown in Table 2.

Radiance Images to Reflectance Images Conversion
The reflectance is defined as the ratio between the reflected and incoming radiant fluxes. Due to energy conservation considerations, this ratio should always lie between 0 and 1. The most general formulation is given using the bidirectional reflectance distribution function (BRDF), denoted by f r [44], as: where the subindex i refers to incoming magnitudes, and r refers to reflected magnitudes. The geometric parameters w, θ, φ and Ω are the solid angles, azimuth angles, zenith angles and projected solid angles, respectively.
The BRDF is used to describe the dispersion of a ray of incident light on a surface from an incoming direction towards another outgoing direction, and it is considered an intrinsic property of the surface. Its definition is here, dL r is the infinitesimal reflected radiance and dE i is the infinitesimal incoming irradiance. Given its infinitesimal nature, BRDF is only useful for conceptually understanding other related magnitudes, but it cannot be measured directly, since the raylights do not include any radiant flux. What is measured by the sensors in the multispectral camera and the DLS is the hemispherical-conical reflectance, which is used to obtain the hemisphericalconical reflectance factor (HCRF). The incoming irradiance is hemispheric since, besides the direct solar component, it also considers diffuse components coming from every direction. The reflected radiance is conic due to the fact that each pixel of the camera fills a certain solid angle equal to the instantaneous field of view (IFOV). This IFOV is so small (0.03 • ) that the measurement of the reflected radiance can be considered directional and not conic, so the HCRF could be approximated using the hemispheric-directional reflectance factor (HDRF). The HDRF is obtained as the ratio between the radiant flux that is reflected by the surface and the radiant flux that an ideal surface (lossless), and a perfectly diffuse (lambertian) would reflect under the same geometric and luminance conditions, since this would be equal to the incoming flux to this type of surface. This parameter could have values between 0 and +∞, although it usually lies below 1 if the reflections are not specular or close to being so.
since the ideal diffuse reflectance is equal to 1/π, The measurements made by the camera are already integrated through the whole solid angle corresponding to each pixel, as are the measurements of the DLS through the whole upper semisphere, so they can be used to compute the HDRF as: here, L r is the radiance reflected by the ground and vegetation surfaces, which is equal to the radiance measured by each pixel of the multispectral camera after it is corrected using Equation (1), and E i is the incoming irradiance to these same surfaces. This irradiance is divided into the direct irradiance that is perpendicular to the surface , E dir , and the diffuse irradiance coming from every direction, E di f . The former is computed using Equation (11), where E s is the same direct irradiance but measured in the direction of the sun, θ i is the angle between the sun and the horizon, α is the angle between the sun and perpendicular to the irradiance sensor, E dls is the raw measurement of the irradiance sensor, c f is the coefficient applied to account for the reflected radiance that the sensor cannot measure due to the Fresnel effect (the manufacturer provides a value of 0.9057) and w di f is the percentage of diffuse radiation with respect ot the total radiation (it has a value of 0.167 when the sky is clear and the sun is at its zenith). The computation of the latter is carried out using Equation (12).
The captures taken from the CRP before and after each flight also need to be taken into account. These captures made on the ground are used to compute a correction factor for each aerial capture, f CRP , which is applied to its correspondent incoming irradiance, E i . This factor is computed by assigning to each capture of each flight an interpolated irradiance and radiance values (using the timestamps) between the irradiance and radiance measured values from their previous and posterior CRP captures and afterwards applying Equation (13).
Therefore, in this paper, the terms reflectance and reflectance image refer to the computed values of the corrected HDRF.

Band Alignment
Since the Micasense camera is composed of five image sensors with their corresponding optics, independently mounted at some distance between each other, it is required to perform an alignment of the images that were captured by each of them. This procedure allows to link the reflectance value of each pixel in the green band with the corresponding values of the rest of the pixels that reside on the same spatial coordinate in the rest of the bands.
To perform this alignment, the camera library provides a method called align_capture, based on the algorithm Enhanced Correlation Coefficient Maximisation [45]. The results of this alignment are part of the generated metadata.

Manual Labelling
There is a large variety of software with different types of licenses that can be used for the manual labelling of the images. These are typically web or desktop applications that allow to assign a label to each image or to certain zones within it, usually delimited using rectangles, polygons, or more complex methods that require computer vision or machine learning algorithms. In any case, the labels are always required to be supervised by a human if they are to be used as references for the evaluation of other algorithms.
For the manual labeling of the images in this work the software needs to be able to work with multispectral images, and since there are no applications in the market that can handle this feature [46], a specific application written in Python using the library Matplotlib was developed (Figure 3). This application receives as inputs the aerial multispectral captures of the olive groves together with the metadata computed by the RCBA, in order to present already corrected and aligned images to the user. Each delimited region of the image can be labelled as one of four possible labels (olive tree, ground, weed or shadow).
Although each labelled pixel is a vector of six components (five band values and the corresponding class), for the rest of the processes only the red, green and blue bands have been taken into account. The reason is that the results obtained using just these three bands are very promising, and this enables the method to be used with simple visible spectra cameras as well. Additionally, these three reflectance values are transformed to the CIELAB colour space, so that the colour values are more perceptibly linear, meaning that if the human eye detects two colours as similar, the coordinates in the CIELAB colour space are also close together. Conversely, if two colours are perceived by the human eye as different, the coordinates are far apart. This conversion was carried out using 32 bits floats to take into account that the reflectance takes values between 0 and +∞.

Vegetation Classification (VC)
To be able to detect the olive canopies, the first step is to classify each of the pixels of the images to determine which of those correspond to the olive class or, at least, which of those have a high probability of belonging to that class.
Since during the manual labelling process, only positive cases of pixels for each class are labelled, that is, there is no specific label for pixels not belonging to any of the classes and, furthermore, not every pixel in the image is labelled, any classification algorithm could only discern between these four classes. During the prediction stage, if the algorithm is given all the pixels in an image to assign a class to them, it would be forced to assign one of the four possible classes even to pixels that do not belong to those, generating a large amount of false positives.
A possible solution is to include all the nonlabelled pixels into a fifth class called other to perform this classification, but this approach conveys two problems that prevented its use. The first one is that, due to the tedious nature of the manual labelling process, a label is assigned only to pixels that are almost 100% sure to belong to a class, according at the criteria of the person that is performing this manual operation. This reduces the fatigue during the process but provokes that the class other is filled with many pixels that actually belong to one of the other classes, potentially confusing the algorithm enough to impair a proper performance. The second problem is that the pixels classified as other are very heterogeneous due to their multiple origins (roads, cars, rocks, buildings, other types of vegetation, etc.), thus having very disperse values in the feature space and complicating the computation of classification boundaries.
Because of the above reasons, the workflow includes the VC block, which represents an initial filtering of the pixels to select only those marked as vegetation (olive tree or weed). The task of discerning between these two classes, which is much more difficult, is left for a later stage.
This vegetation classification is implemented using a one-class classification algorithm (OCC) known as local outlier factor [47], which is an unsupervised method to detect atypical values computing the local pixel density deviation with respect to its neighbours. In this case, the algorithm is used to predict if each new pixel belongs to the vegetation class or not, since during the training all the pixels received are not atypical.
In addition to the labelled pixels, the algorithm requires the definition of two parameters whose values are not too relevant for the posterior stages. The contamination value has been chosen to have the maximum possible value (0.5) in order to assure that each pixel labelled as vegetation is effectively vegetation, although that means that pixels that could be classified as such are lost. In turn, the number of neighbours evaluated for each pixel is fixed at a high enough value so that the subsequent steps perform adequately without excessively increasing the prediction time (100).
The output of this process is a vegetation mask (Figure 4) for each capture that represents the zone that, with high probability, is selected as olives or weed in the subsequent stages.

Olive Tree Canopy Estimation (OTCE)
The OTCE is based on a decision tree trained with the pixels labelled as olive tree and weed. At prediction time, this tree receives as inputs only the pixels selected by the vegetation mask, in order to classify them as olive tree or weed. However, the results obtained for the olive tree class are not acceptable, nor were they better when different classification algorithms were evaluated. This led to the conclusion that just the spectral information of each pixel independently is not enough to obtain good results, so techniques that take into account the spatial distribution of the pixels in the image were explored.
Nonetheless, this decision provides useful information: the probability image ( Figure 5), an image where each pixel of the vegetation mask receives a value between 0 and 1 according to the probability of it belonging to the olive tree class. This probability is computed by dividing the number of pixels classified as olive tree between the total number of pixels inside each leaf of the classification tree.

Olive Tree Canopy Marking (OTCM)
This block is the most important of the whole canopy detection algorithm and is composed of several connected steps, so that the output of one block is the input to the next.  The clipping and normalisation step receives as input the probability image that, basically, is an image with just one channel where the pixel values represent the probability of their belonging to the olive tree class. In this step, the value of all the pixels below a threshold parameter are assigned 0, and the rest of the values are rescaled to work with all the bit depth of the range. This step, therefore, removes those pixels with a probability of belonging to olive tree class below this specified threshold, which is a parameter that needs to be optimized automatically for each image.
The noise-filtering step receives an image where all nonzero pixels have a very high probability of belonging to olive tree class, but they still keep a relative probability information between them (Figure 6a), and applies a simple median filter with the smallest kernel possible (3 × 3) in order to remove isolated pixels that are most likely noise (Figure 6b). The next step computes the density for each pixel of the probability values, that is, each pixel is assigned the mean of the probability values of the neighbouring pixels and itself, using a mean filter with circular kernel. This steps allows to homogenise to high values the probabilities of zones with clusters of pixels showing high probability, and to low values areas where there are isolated pixels with low probability. The result is an image where the canopies are homogeneously highlighted and isolated areas have been removed (Figure 6c). The parameter optimisation phase carefully selects the size of the kernel for this step.
The last parameter that has to be optimised in the OTCM block is the segmentation threshold that, basically, is the threshold that is used to binarise the image during the segmentation step (Figure 6d). All that is left is to search for the contours of this binary image and obtain a list to extract the centroids (Figure 6e). This sequence of steps allows to achieve the objective of obtaining the correct coordinates of the olive canopies in each image, but, in order to obtain good results, the three parameters previously mentioned need to be chosen appropriately, as they strongly influence the final result, and the optimum value varies notably between captures. This is the reason to add the parameter optimisation block, whose task is to search for the most adequate values for the probability threshold p, the kernel size k and the segmentation threshold s for each probability image that it receives as input. In order to do this, a brute force method is applied, where the different steps presented above are performed repeatedly for each parameter combination taken from a regular discretisation of their respective domains. For this work, the set P, considered for the probability threshold, contains 19 values, starting at 0.05 and reaching 0.95 with increments of size 0.05. The set K, related to the kernel size, contains 6 values from 5 to 30 with increments of size 5; finally, the set S contains 50 values for the segmentation threshold from 5 to 250 with increments of 5. These sets provide a total of 5700 iterations. In each of these iterations, the Delaunay triangulation (Figure 7) is computed using the coordinates of the olive canopies found (x, y) and the variation coefficient (C v ) of the set of all of the triangle side length {l 1 , l 2 , . . . , l n }, removing those that are on the edge (and belong to just one triangle) so that the computation of the ratio between the radius of the inscribed circumference of the adjacent triangle and that of the circumscribed circumference in itself is below 0.1. In order to not leave holes in the triangulation, this procedure is performed recursively until there are no sides that do not fulfill the conditions. This procedure can be defined in each capture by means of the discrete function F as: whose value, for each combination of p, k and s, is: with here, i and j refer to the indexes of each end of each side of the Delaunay triangulation and n to the side itself. A graphical representation in R 3 of this discrete function can be visualised if any of the three parameters p, k or s is fixed. For instance, Figure 8a  This happens in all the images considered in this work and, in order to solve this issue, we consider only the values of C v that have a number of detected canopies within ±40% of the median number of detected trees for all iterations. That is, a new function G that returns the number of detected canopies is defined in Equation (18), along with the median of the set of values {t 1 , t 2 , . . . , t c } computed by G in Equation (19): if c is even , (19) and both are used to redefine the function F in Equation (20): The last step is reduced to an optimisation problem of the redefined function F to find the values of p, k and s and minimise it in Equation (21).
(p min , k min , s min ) = arg min p,k,s (C v ) (21) The graphical representation of the redefined function F is depicted in Figure 9, together with the minimum value of the variation coefficient marked. Many values have been removed from those in Figure 8 due to this redefinition. This methodology, besides using the spectral information of the pixels to detect the tree canopies, includes spatial variables related to pattern detection in images, specifically, the optimum iteration is that whose distribution of canopies provides the least variability, that is, the one where the detected canopies are distributed as evenly as possible throughout the image. Additionally, the variation coefficient is used instead of the standard deviation in order to be able to compare captures with different ground sample distances or different plantation distances.
This idea is key for the success of the algorithm, specially for the aerial images of regular plantations with thick plant cover on the ground. The principle employed is similar to the approach used by a human to manually discern whether there is a tree canopy or just plant cover in a certain area of the image: the pattern of the plantation is recognised and extrapolated to contiguous areas. If any extrapolated point is included in the area of interest, then the region is probably classified as canopy. If, on the other hand, in that zone there is no extrapolated point, then most likely the area is just some other type of vegetation. From the grower's point of view, it makes sense ot keep the distances between the olive trees as invariant as possible, since once that the minimum distance is fixed, there would be no reason to increase it, as it would imply reducing plantation density and, consequently, profit.

Results and Discussion
Generally, during the evaluation of the results of a model, the provided prediction for each input data is compared with the ground truth previously labelled by a person. If the model has a training phase, the input data used during this phase cannot be reused later during the evaluation step. For this, the input data are separated into a training dataset and an evaluation dataset. In addition, to guarantee that the results obtained are independent of the way in which both sets are separated, the evaluation is usually carried out through some type of cross-validation.
For the model proposed in this work, it can be seen (Figure 1) that the type of input data is different for the training phase and the prediction phase. In the first case, pixels labelled as olive tree, ground, weed or shadow are used, while the second case only accepts multispectral captures and their metadata as input. This reality means that the evaluation phase is carried out on complete multispectral captures and not on the pixels that make them up, although the first two blocks (VC and OTCE) extract the pixels from the captures using the metadata. For this reason, it must be taken into account that the labelled pixels used in the training have to be introduced in a grouped way for each multispectral capture in order to guarantee the selection of those groups that are not used later during the evaluation.
The first step is to carry out the manual labelling process to obtain the ground truth for the training and prediction phases of blocks VC and OTCE. The labelled regions contain pixels with an absolute certainty of belonging to the chosen class, at the cost of leaving possible pixels that still belong to that class unlabelled ( Figure 3). Then, the pixels labelled as olive are used as the ground truth in the rest of the model. This ground truth is a binary mask for each multispectral capture, in which the regions belonging to the olive tree canopies are marked ( Figure 10). It is obtained by carrying out the morphological operation of dilation with a circular kernel of 11 pixels in diameter applied to the pixels labelled as olive tree in order to cover the entire region of the olive tree canopies. The resulting masks are then inspected to ensure that this value is correct for all captures. False negative: Any positive for which there is no predicted positive coordinate that is within its bounds. Table 3 shows the results of performing a 5-fold cross-validation on the model. It shows, from left to right, the ID of the capture (Table 1), the fold in which each capture is part of the test set, the count of positives in the ground truth, the predicted positives, the true positives, the false negatives and the false positives obtained by the model. Finally, three metrics were evaluated: recall, precision and F 1 score. In the last row, the sum of each column is calculated, except for the three metrics, which are calculated in the usual way using the previous sums. For a better graphic interpretation of the results, the contours of the olive tree canopies in the ground truth and the marks of the coordinates predicted by the model are combined on single images. Both contours and marks that are true positives are coloured green, while contours and marks that are false negatives and false positives, respectively, are coloured red. Figure 11 shows an example of these images for 5 of the 18 captures. Analysing the results table (Table 3), it can be seen that 89% of the captures obtain an F 1 score above 0.5, 61% above 0.75, 56% above 0.8 and 28% above 0.9. In 5 of the 18 captures analyzed, practically perfect detection is achieved (Figure 11a,d), in six captures the detection is moderately good (Figure 11b), in five captures the canopies are detected acceptably (Figure 11e) and only in two captures serious detection failures occur (Figure 11c). If the causes of these two failures are analyzed in detail, it is seen that they are not related at all to the main hypothesis proposed in this article-automatic detection of olive tree canopies for groves with thick plant cover is optimised by minimizing the function of coefficients of variation of the lengths of the sides of the Delaunay triangles formed from the coordinates of the canopies predicted by the model. Rather, the causes are generated by the first part of the model, the VC in the first instance, classifying shadow pixels as vegetation, and the OTCE in the second instance, not being able to exclude them, at least, as weeds. In fact, Figure 11c shows how each and every one of the olive trees is detected, but their shadows are marked instead of the canopies. This is due to the fact that in the probability image the VC and the OTCE assigned a greater canopy probability to the shadows than to the canopies. The rest of the model from here works well even for images that, due to the time of year they were taken, do not have as much grass on the ground as one would expect. Table 4 shows the execution times of each model block in the workflow for each capture during the cross-validation, including the VC and OTCE training time, same at each fold. The last row shows the mean of each column. As expected, one of the two most unfavorable times is the training time with an average of 57 s per fold. This time must not be taken into account to assess the performance of the model since it is a task that is performed only once. However, another of the most critical times occurs during the prediction of the OTCM block with a mean of 57.2 s. Although it may seem like a long time, it is necessary to realise that the time is divided for the 5700 iterations of the brute force method applied in this block, that is, each iteration is executed in approximately one hundredth of a second. Furthermore, this time is no longer a concern considering that, for validating and representation purposes, the number of points computed by the F function was higher than it was really needed, and it can be drastically reduced. After these, the next highest time is the prediction time for the VC block, taking 32.3 s. This time could only be reduced by finding alternative vegetation classification methods.
Finally, although no articles were found that explicitly deal with the detection or segmentation of tree canopies with thick plant cover on the ground without using threedimensional data, those that are closely related are shown in Tables 5 and 6. They summarise information about the article reference, the method used, the publication date, where the data comes from (dataset), what type of data is used (channels), how much data is used (no. of images) and what quality metrics are achieved (accuracy, precision, recall, omission error, commission error and estimation error) in each of them. They are sorted by publication date, and the first row corresponds to the method presented in this article.

Conclusions
This article presents quite promising results as far as the automatic detection of olive tree canopies is concerned, even more so if we take into account the large amount of plant cover that some of the captures present. This fact makes detection complicated even sometimes for the human eye itself. Considering that olive growing is increasingly oriented towards the search for the quality of the resulting oil, and that organic farming takes advantage of the benefits of maintaining a thick plant cover on the ground, it is concluded that the development of this type of model is an imperative need.
The vast majority of studies carried out with aerial images in the field with olive trees and others fruit trees are based on the search for correlations between the samples collected at the foot of the tree (ground truth) and the features extracted from the images collected by the different types of onboard sensors on the UAV. For this purpose, it is always necessary to extract from the complete images the parts that correspond to the ground truth, usually the canopies. This task receives little attention in scientific works and can be arduous and tedious if performed manually. Even when algorithms are developed to automate this task, they are often used as an aid to the human labeller. In most cases, they work only for groves in which the canopies are clearly delimited by soil with very different spectral features.
The model developed in this article is very useful for performing labelling tasks fully automatically, but due to resulting prediction times of nearly 90 seconds (32.3 + 0.14 + 57.2) per capture on average, it cannot be applied in real time for now, that is, the prediction time is greater than the time that elapses between one capture and the next in the multispectral camera. This time, called period, inverse to frames per second (FPS), was set to 1 second during data collection. Despite this, the proposed method has advantages over alternative methods based on structure from motion and multiview stereo to compute 3D information such as photogrammetry or based on other sensors such as LiDAR. The first and most obvious one is that it makes predictions from a single capture, while photogrammetry requires a multitude of them in the same area. This fact allows cheaper data collection by reducing the overlap of the captures from 80% to almost 0% without compromising the accuracy. This consequently also reduces the flight time of the UAV. Another advantage is that more complex and costly sensors with LiDAR technology or spectrometers that require advanced knowledge and set-up are not used. In fact, only the red, green and blue bands of the multispectral camera were used.
In this way, as future work, it is intended to explore the viability of the model if the input data comes exclusively from a camera whose sensor works in the visible spectrum without radiometric correction. Other possible future research would be to find out the performance of the proposed method applied to other types of groves such as citrus trees, peach trees, chestnut trees or even plants with other typologies such as vines.
The success rate obtained in this study could be improved by evaluating the influence of tree shadows on the VC block, increasing the number of images labelled and used for training, or exploring other one-class classification algorithms that may exclude certain types of data. On the other hand, the number of discrete intervals in which each of the three parameters to be optimised is divided could be reduced with the aim of decreasing the prediction time of the block OTCM. Furthermore, a more efficient approach could be applied, such as gradient descent, widely used in the back-propagation algorithm of any neural network.
Finally, the proposed model is developed thinking of performing a single task as best as possible. This task is to detect olive tree canopies, that is, to predict with a very high probability the coordinates of the image in which the pixels belonging to those canopies would be found but without applying segmentation. This last task would have to be optimised in the context of a different model. Furthermore, as future work, the region classification task could be optimised, namely, the identification of the plantation from a structured arrangement of trees against regions that would form part of the environment such as roads, forests, buildings, uninhabited land, etc.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to confidentiality reasons derived from the project members.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: