This section presents the experimental evaluation of the proposed olive tree segmentation methodologies. The performance comparison was conducted using various metrics on a benchmark dataset. Details regarding the dataset characteristics, model hyperparameter tuning, and the chosen evaluation criteria are also provided.
4.1. Datasets
The primary challenge encountered in the olive tree segmentation task stems from the scarcity of publicly available datasets. Despite some works focusing on olive tree segmentation, there is a notable absence of dedicated public datasets. Even for the broader task of tree segmentation, the existing datasets are limited and exhibit three key characteristics that render them unsuitable for our specific objectives, multiple tree classes, street-level view, and diverse environments. Existing datasets often include multiple tree species classes, none of which correspond to the olive tree, so they are unusable for fine-tuning machine learning models for segmenting olive trees. Our goal is not to only differentiate among different tree species but rather to distinguish the olive tree from its background [
47,
48]. Many available datasets capture trees from a street-level perspective, resulting in a shape representation vastly different from that observed at the drone level. Given our emphasis on drone image analysis, we specifically require top-down views of the olive tree crowns (i.e., canopies) [
48]. In diverse environments, several datasets showcase trees in forested or urban environments, which significantly differ from the open field setting of an olive tree plantation. Such environmental disparities can profoundly impact segmentation performance [
49].
In addressing these challenges, the project utilized three publicly available datasets obtained from Roboflow [
50,
51,
52], all related to olive tree analysis. All datasets feature images with a resolution of 640 × 640 pixels and are singularly focused on a unified class—“olive trees”. As shown in
Table 1, the first dataset, “DATASET1” [
50], an indicative portion of which is depicted in
Figure 4, is the largest among the three and comprises 622 training images, 137 validation images, and 127 test images. On average, each image contains 78.9 boxes (b/image), totaling 52,837 instances of olive trees. The second dataset “DATASET2” [
51], sampled in
Figure 5, consists of 155 training images, 50 validation images, and 38 test images. The average number of boxes per image is 82.3, contributing to a total of 14,202 instances of olive trees. The third dataset. “DATASET3” [
52], indicated in
Figure 6, focused on segmentation, is comparatively smaller. It includes 67 training images, 19 validation images, and 9 test images, with an average of 8.2 masks per image. In total, this dataset comprises 681 instances of olive trees. This dataset selection strategy ensures relevance to the objectives, providing diverse and sufficiently challenging data for the development and evaluation of the olive tree segmentation model.
Ground sampling distance (GSD) is a crucial concept in photogrammetry, where it determines the clarity and detail of aerial images. It refers to the distance between the centers of two consecutive pixels as measured on the ground. This metric helps determine an image’s spatial resolution; a smaller GSD means higher resolution, allowing for more detailed images, whereas a larger GSD results in lower resolution and less detail visible in the image. The GSD is calculated based on the following equation (x):
where
is the flight height,
is the width of the camera sensor,
is the focal length of the camera, and
is the width of the captured image in pixels.
From the openly available datasets we examined in this paper, only DATASET3 was accompanied by the appropriate metadata containing valuable information for determining the flight height and calculating the GSD. Moreover, the images in DATASET1 and -2 do not adhere to central projection, and the wide capturing angles increase the perspective distortion. Therefore, calculating the GSD accurately is infeasible.
Images in the DATASET3 were captured using a DJI FC6310R drone camera, and each image includes raw metadata, such as latitude, longitude, GPS altitude, focal length, and width of camera sensor. The specific drone captured images with an image width of 5472 and image height of 3648. It is noted that this dataset comprises images from two different areas, both located in Morocco. Moreover, it is observed that a subset of the images retain the initial resolution at which the drone camera captured the images, while another set is cropped, likely due to some preprocessing performed by the creators of the dataset. However, for both subsets, the original image resolution is used to calculate the GSD values, and they are listed in
Table 2.
4.5. Experiments on YOLOv8-Seg for Segmentation
Similarly to the first approach, YOLOv8n-seg requires fine-tuning since the pretrained model does not recognize the class “olive trees” or “trees”. To overcome the absence of a dedicated segmentation dataset, a custom segmentation dataset was created using DATASET1 and RepViT-SAM. An extensive evaluation was conducted by an expert who visually evaluated the results accepting only the correct masks.
In this procedure, RepViT-SAM was applied to all images in the detection dataset, utilizing corresponding bounding boxes for segmentation. The resulting segmentation masks, initially in binary form, underwent a transformation into polygons compatible with YOLO. A custom script was employed to execute these transformations and store the outcomes as .txt files for each image. To enhance the performance and increase generalization capabilities, we integrated the DATASET3 into the custom annotated segmentation dataset.
Before merging the two datasets, a straightforward data augmentation technique was employed, involving horizontal flips and random rotations, as depicted in
Figure 9. This augmentation was specifically applied to the third dataset, generating augmented images for each original image. The considerable imbalance between the two datasets influenced the decision to apply data augmentation solely to one dataset. To achieve a more balanced outcome while maintaining enough images, we up-sampled the smaller dataset to approximately 150 images and retain only 150 images from the larger dataset. The final merged dataset (Merged_seg) comprised 323 training images, with an average number of masks per image set at 39.7. This strategy aimed to address the inherent dataset imbalance and enhance the model’s ability to generalize across diverse scenarios, contributing to the overall robustness of the segmentation model.
YOLOv8n-seg was fine-tuned on this merged custom dataset, involving 323 training images with an average number of 39.7 masks per image for 100 epochs, with early stoppage at 10 epochs and default hyper-parameters. The training stopped after 64 epochs achieving 0.761 mAP50, as illustrated in
Figure 10. The fine-tuned YOLOv8n-seg model was used directly for predicting segmentations on input images. The inference time for YOLOv8n-seg was 4.3 ms. Consequently, during drone-based inference only the YOLOv8n-seg model was utilized, ensuring faster inference times, and the results of drone-based inference using YOLOv8-seg are shown in
Figure 11.
Finally, an optimal threshold value of 0.35 was established through experimentation to achieve the most effective results for this specific application. While the default confidence threshold within YOLOv8n was 0.25, a systematic evaluation was conducted using five distinct thresholds. A visual quantitative analysis was conducted to choose this threshold, since it offered a balance between duplicate masks and missed masks. This evaluation process revealed that a threshold of 0.35 yielded superior performance for the olive tree segmentation task.
4.6. Methods Comparison and Discussion
For each lightweight model of the two proposed pipelines for olive tree segmentation on drone images, an evaluation was conducted both on the merged dataset, consisting of DATASET1 and DATASET2, as well as on each individual dataset, to better understand how the datasets affect the models’ performance and to achieve more robust evaluation results. Overall YOLOv8-seg model achieved slightly better results (mAP50 = 0.825,
Table 3) compared to the two variants of the SAM-based pipeline (mAP = 0.822 for RepViT-SAM,
Table 4; mAP = 0.796 for EdgeSAM,
Table 5). Since YOLOv8-seg relies only on one model, there is no risk of compound failure, as there is the SAM-based pipeline where two models are required and any possible failure in the automated labeling process may affect the final result. In terms of inference time, again YOLOv8-seg achieved a noticeably better time (3.5 ms,
Table 3) compared to the other two models (39.92 ms for RepViT-SAM,
Table 4; 40.61 ms for EdgeSAM,
Table 5). The simplicity of the YOLOv8-seg pipeline, where only this model was used for inference, caused this difference in inference times compared to the SAM-based pipeline, where two models are needed for inference, while also having a more complex model for segmentation than YOLOv8-seg.
The results from the evaluation of the three models on each individual dataset, as shown in
Table 3,
Table 4 and
Table 5, indicate that no significant bias of any dataset affected the results in such a way that is worth mentioning, since the evaluation metrics were very balanced for the different datasets on all models. As for the inference time, a slight increase was observed for DATASET2 compared to DATASET1, which is normal since the average number of bounding boxes per image was larger for DATASET2, as shown in
Table 1. The inference time for DATASET3 was, as expected, lower than those of the other datasets, since it had a much smaller average number of bounding boxes per image. These observations imply that the number of instances per image plays a pivot role in the inference time of the segmentation process. So, assuming a smart agriculture task of interest desired to be executed directly on an edge device, such as a UAV, the inference time is a crucial factor for the delivery of such a service. Moreover, as the computing capabilities in this case are limited, inference time should be kept at most in the order of tens of milliseconds. In addition to the hardware equipment and based on the results in
Table 3,
Table 4 and
Table 5, capturing images at lower heights, so at a lower UAV flight height, may lead to better inference times.
The flight height, which is the distance of the camera sensor above the canopy, can be considered, since the camera sensor resides at the bottom part of the UAV. Then, the appropriate height range for having a single instance per image depends on multiple parameters, which relate the olive grove design, age or size of the trees, and the camera sensor per se. Several techniques can be considered for the olive grove design and tree spacing. According to production techniques suggested by the International Olive Council [
49], orchard designs for olive trees include squares, offset squares, rectangles, and quincunxes. For precision farming, the square and offset square designs would be the most prominent choices, since they both provide good access to sunlight, as well as good coverage by precision farming tools due to fewer shadows and obstructions. Other designs or super-intensive orchards would be less appropriate for precision farming, as in these cases the trees develop in a hedgerow and complicate detection processes. So, for square or offset-square designs, planting distances of 6 m × 6 m and 7 m × 7 m are a sound yardstick for many Mediterranean olive-growing conditions [
49].
In addition, the age of the olive, variety, and general management practice impact on the area covered by the canopy and the tree height. Indicatively, the height of mature trees of the koroneiki variety, which is widely cultivated for olive oil production, can reach 6–8 m when it is traditionally cultivated. However, it is usually controlled through pruning to stay at lower heights, between 4 m and 6 m, in order to be better managed and facilitate mechanical harvesting.
Furthermore, the camera characteristics affect the ground resolution, achieved per flight height. We considered the GSD for investigating appropriate flight heights. Indicatively, Parrot Sequoia is a widely used camera connected to UAVs for precision agriculture tasks. Its RGB sensor has an image width of 4608 px, a focal length equal to 4.88 mm, and the sensor width equals 6.09 mm. So, for a UAV equipped with this sensor, the recommended flight would be between 6.5 m and 7.4 m, in order to achieve, at most, one complete tree instance per image and, thus, optimize the inference time.
The SAM-based approach simplifies the fine-tuning process by directly fine-tuning the YOLOv8-det on the available limited detection data. However, in this approach, the inference process relies on two resource-intensive models, YOLOv8-det for detection and RepViT-SAM/EdgeSAM for segmentation, impacting the inference time and resource requirements. In essence, the YOLOv8-seg approach offers a simpler inference pipeline but requires creating a custom segmentation dataset through a more complex process that leverages RepViT-SAM/EdgeSAM and limited open-source olive tree datasets, since in order to fine-tune the YOLOv8-seg model, a segmentation dataset is required. Consequently, during the inference process of the second approach, only the YOLOv8n-seg model was utilized, ensuring potentially faster inference times and fewer resource requirements.
The confusion matrices in
Table 6 present the performance of the three models—YOLOv8-seg, RepViT-SAM, and EdgeSAM—across various datasets, highlighting the nuanced differences in their segmentation capabilities. YOLOv8-seg consistently demonstrates superior performance, reflected by higher true positive (TP) counts and lower false positive (FP) and false negative (FN) counts across all datasets. For example, on DATASET1, YOLOv8-seg achieved 6825 TPs, 775 FPs, and 1038 FNs, indicating a high detection rate with minimal errors. This trend suggests that YOLOv8-seg is particularly adept at accurately identifying olive trees, thereby reducing both false alarms and missed detections.
In contrast, the SAM-based models, RepViT-SAM and EdgeSAM, exhibited higher error rates. RepViT-SAM, while achieving 6660 TPs on DATASET1, also incurred 540 FPs and 1203 FNs. Similarly, EdgeSAM recorded 6353 TPs but with 575 FPs and a significant 1510 FNs. These results indicate that the SAM-based models, despite their potential for fine-tuning ease, struggle with higher incidences of both false positives and negatives. This discrepancy can be attributed to the more complex inference pipelines of the SAM-based models, which involve multiple stages of detection and segmentation, potentially compounding errors at each step.
The variability in performance across different datasets also highlights the importance of dataset characteristics in the model evaluation. The consistent results of YOLOv8-seg across datasets suggest robust generalization capabilities, whereas the higher variance in the SAM-based models’ performances points to a sensitivity to specific dataset features. This sensitivity may necessitate more extensive fine-tuning and optimization for different agricultural contexts. Overall, the comparative analysis underscores the practical advantages of YOLOv8-seg in terms of both accuracy and efficiency, despite its more demanding initial setup. The SAM-based models, while promising, require further refinement to achieve comparable robustness and reliability in diverse real-world scenarios.
Since the YOLOv8-seg model outperformed the other two models, it was chosen for k-fold cross-validation to examine the stability of the model and the uniformity of the dataset. The k-fold cross-validation analysis, conducted over five iterations, assessed the performance of our model using metrics including precision, recall, mAP50, and mAP50-95. This method ensured a comprehensive evaluation, examining how consistently the model performed across different subsets of the data. As shown in
Table 7, precision varied slightly across the folds, with an average of 0.903, suggesting a high level of accuracy in the model’s predictions, with fewer false positives. Recall was generally lower than precision, with an average of 0.858, indicating the model’s ability to find all relevant instances is slightly less consistent than its precision. The mAP50 scores were strong, with an average of 0.8942, which highlights the model’s effectiveness at detecting objects when considering a 50% IoU threshold. mAP50-95, which averages the mAP over the IoU thresholds from 50% to 95%, had an overall lower performance compared to mAP50, averaging 0.6468. This expected decline across higher IoU thresholds indicates the model’s challenges in achieving precise localization under stricter conditions.
The bar chart depicted in
Figure 12 reflects these dynamics, illustrating the variability across folds and metrics. The precision remained relatively stable, whereas recall showed more fluctuation, suggesting that some folds may contain particularly challenging examples. The mAP metrics demonstrate that while the model performs robustly at a basic IoU threshold (50%), its performance became more variable as the criteria tightened (up to 95% IoU). Also, the calculated standard deviations, shown in
Table 7, reveal insights into the model’s performance consistency across different data scenarios. While precision was notably stable, the variability in recall and mAP metrics highlights potential challenges in model generalization, especially under more stringent evaluation criteria. Overall, the model demonstrated a robust performance with high precision and good overall mAP scores, indicating reliable detection capabilities. However, the variability in recall and mAP50-95 suggests potential areas for improvement, especially in enhancing the model’s ability to consistently localize objects more precisely across all data subsets.
Overall, this work tackled the data scarcity problem for the task of olive tree segmentation using drone images by proposing two distinct methodologies that do not rely on segmentation datasets but can leverage limited existing detection datasets by combining lightweight segmentation models. In the SAM-based approach, the YOLOv8-det was fine-tuned on those publicly available detection datasets and then used for inference along with RepViT-SAM/EdgeSAM for the final segmentation. On the YOLO-based approach, YOLOv8-seg is fine-tuned on a custom segmentation dataset created using some simple augmentation techniques and RepViT-SAM/EdgeSAM custom segmentation of the existing detection datasets. The YOLO-based approach achieved better results both in the metrics scores and in inference times.