An Automated Image Segmentation, Annotation, and Training Framework of Plant Leaves by Joining the SAM and the YOLOv8 Models

Zhao, Lumiao; Olivier, Kubwimana; Chen, Liping

doi:10.3390/agronomy15051081

Open AccessEssay

An Automated Image Segmentation, Annotation, and Training Framework of Plant Leaves by Joining the SAM and the YOLOv8 Models

by

Lumiao Zhao

^1,2

,

Kubwimana Olivier

¹ and

Liping Chen

^1,3,*

¹

College of Information Engineering, Tarim University, Alar 843300, China

²

College of Big Data and Information Engineering, Xinjiang University of Technology, Hotan 848011, China

³

Key Laboratory of Tarim Oasis Agriculture of Education Ministry of China, Alar 843300, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(5), 1081; https://doi.org/10.3390/agronomy15051081

Submission received: 17 March 2025 / Revised: 23 April 2025 / Accepted: 27 April 2025 / Published: 29 April 2025

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Recognizing plant leaves in complex agricultural scenes is challenging due to high manual annotation costs and real-time detection demands. Current deep learning methods, such as YOLOv8 and SAM, face trade-offs between annotation efficiency and inference speed. This paper proposes an automated framework integrating SAM for offline semantic segmentation and YOLOv8 for real-time detection. SAM generates pixel-level leaf masks, which are converted to YOLOv8-compatible bounding boxes, eliminating manual labeling. Experiments on three plant species show the framework achieves 87% detection accuracy and 0.03 s per image inference time, reducing annotation labor by 100% compared to traditional methods. The proposed pipeline balances high-quality annotation and lightweight detection, enabling scalable smart agriculture applications.

Keywords:

automatic annotation and segmentation framework; plant leaf segmentation and detection; deep learning; smart agriculture; SAM; YOLOv8

1. Introduction

Recently, a progression of different visual object detection frameworks, like YOLOv8 (You Only Look Once version 8) [1] and the Mask R-CNN (v1.0), etc. [2], have been proposed and widely used in the fields of agriculture. However, the complex visual scenes in fields, influenced by factors such as lighting variations, dust, weeds, and diverse plant phenotypes, pose significant challenges for leaf segmentation and detection for tasks like leaf segmentation and detection. Although the whole process of the data annotation required for model training remains labor-intensive and resource-demanding, this limitation, inevitably, hinders the development of models for smart agriculture applications.

To address the challenges of high manual annotation costs and the real-time detection requirements in complex agricultural scenes, this study proposes an automated pipeline learning framework that integrates the Segment Anything Model (SAM) and YOLOv8. This framework refers to an end-to-end workflow that systematically links automated image segmentation, data annotation, model training, and visual detection into a cohesive process, enabling self-sufficient data processing and learning without extensive human intervention. SAM is employed for offline, automated semantic segmentation to generate precise pixel-level masks of plant leaves from raw images. The masks are then processed to filter out noise (e.g., small regions with pixel areas < 3000, detailed in Section 4.3) and converted into YOLOv8-compatible bounding box annotations, eliminating the need for labor-intensive manual labeling. YOLOv8, trained on the SAM-generated dataset, serves as the final detection model. By leveraging its efficient architecture, it achieves real-time leaf detection in field conditions. Unlike approaches that use SAM for real-time segmentation, our framework decouples the roles: SAM focuses on high-quality annotation generation, while YOLOv8 handles lightweight, independent inference at 0.03 s per image. This pipeline creates a self-sufficient workflow from data preparation to end-to-end detection, significantly reducing costs and improving scalability for agricultural applications.

This paper’s structure is organized as follows: Section 1 is the Introduction and Section 2 analyses work-related, aiming at the limitations and development of current techniques used in image segmentation and plant leaf detection. Section 3 clarifies and elaborates all the methodology used in our proposed automated learning framework based on the SAM and the YOLOv8 models, including dataset production, data collection, and the model training process. Section 4 presents results of the experiments and analysis, by evaluating the performance of the proposed framework in plant leaf detection. Lastly, Section 5 reviews findings of the research and also discusses future application prospects and research directions.

2. Related Work

2.1. Applications of Visual Object Detection and Image Segmentation

Different domains, mainly in agriculture, widely apply visual object detection and image segmentation as critical technologies for tasks like disease detection, crop monitoring, and yield estimation. These technologies allow localization, precise identification, and classification of objects within complex visual environments.

In agriculture, object detection techniques are utilized to identify pest infestations, monitor plant growth, and track changes in health of crops. For instance, leaf detection plays a key role in plant pest management and health assessment. The capability to accurately segment and detect leaves in images, allows for the early detection of diseases and improves decision-making in crop management. Furthermore, automated segmentation helps in reducing labor costs by reducing the need for manual annotation of plant images.

Apart from agriculture, visual object detection and image segmentation have many applications in fields such as autonomous driving, where they assist vehicles to detect obstacles and navigate safely. In medical imaging, segmentation techniques are used to identify and isolate areas of interest, such as tumors, from MRI or CT scans, aiding in treatment planning and diagnosis.

Deep learning models, especially Convolutional Neural Networks (CNNs), have meaningfully advanced the efficiency and accuracy of segmentation tasks and object detection. The combination of models like YOLOv8 and SAM offers a strong framework for real-time, automated segmentation and annotation, making them highly adaptable to the challenging and various conditions found in the agricultural fields. By leveraging these techniques, visual object detection and segmentation have transformed from manual processes into highly efficient, automated systems that can process large-scale data with improved accuracy, supporting advancements in precision agriculture and other industries [3].

2.2. Visual Object Detection Models

Plant leaf detection techniques are essential in agriculture, mainly for crop management, disease detection, and growth monitoring. Developments in computer vision and deep learning techniques have led to a growing focus on developing efficient and accurate leaf detection methods. Even if image processing and machine learning-based methods show efficacy in particular scenarios, they typically encounter challenges related to background variations, sensitivity to lighting conditions, and also noise interference.

Faster R-CNN [4] is a noticeable member of the R-CNN family, providing candidate frames by integrating a region proposal network (RPN) and conducting object detection and classification on these frames. This algorithm has proven significant performance in agricultural applications, such as pest and disease identification and fruit detection. Similarly, the Single Shot MultiBox Detector (SSD) [5] is an efficient object detection algorithm that at the same time classifies and localizes objects in a single forward pass. It offers rapid detection with high accuracy, making it perfect for real-time, large-scale, agricultural tasks.

Furthermore, the YOLO (You Only Look Once) algorithm directly predicts object locations and categories in images through a single neural network. The YOLOv8 model further enhances detection accuracy while preserving high detection speeds, rendering it appropriate for real-time field monitoring [1]. As a whole, these cutting-edge methods make leaf segmentation much more reliable and accurate by automatically learning and extracting key features from images. This gives intelligent agricultural management a lot of technical support.

While YOLOv8 offers real-time detection, its reliance on manually annotated bounding boxes remains a bottleneck for large-scale agricultural datasets. In contrast, Mask R-CNN [2] provides pixel-level segmentation but suffers from slower inference speeds (≈200 ms per image), unsuitable for mobile robotic applications. Our framework addresses these limitations by leveraging SAM’s zero-shot segmentation to automate annotation, reducing labor costs, and integrating YOLOv8 for lightweight inference, achieving a 3223× speedup compared to standalone SAM (Table 1).

To demonstrate the preprocessing and training process of YOLOv8, Figure 1 is provided as follows:

2.3. Image Segmentation Methods

Modern advancements in image segmentation techniques have been remarkable, with methods generally falling into two categories: traditional approaches and those based on deep learning.

Traditional methods involve techniques like region growing, edge detection, and graph cut, such as the Canny edge detection algorithm [6], the region growth algorithm [7], and the GrabCut graph cut algorithm [8]. For example, Wen Jiang et al. [9] implemented color image segmentation using the PCNN-based Canny operator edge detection method. Sun Shashuang et al. [10] obtained an F-score of 94.12% by combining the GrabCut model and the Ncut algorithm to recognize green apples in an orchard environment with a detection error of 7.37%. Even though image segmentation techniques based on traditional methods get better results, they rely on manually designed rules and features, which are challenging to cope with complex and changing image scenes and have weak generalization ability.

Deep learning-based methods depend on techniques such as Convolutional Neural Networks (CNN), including U-Net [11], U-Net++ [12], U-Net3+ [13], Mask R-CNN, YOLO [14], and SAM [15]. For example, Yangyang Cao et al. [16] proposed a Mask R-CNN segmentation example method based on the Dual Attention Mechanism Feature Pyramid Network (DAFPN) to characterize small farms. Yang Hao et al. [17] presented AlgaeSeg-YOLO that can recognize the species, number, and contour of algae in complex backgrounds with a mAP of 95.61%. From the above, it can be seen that deep learning-based image segmentation techniques have a strong generalization ability and can adapt to different domains and types of image segmentation tasks without the need for a large number of parameter adjustments for each specific task. Aichen Wang et al. [18] proposed a foreground-foreground category balancing method and an enhanced YOLOv8s network, named NVW-YOLOv8s, to simultaneously detect and segment tomato fruits, achieving 91.4% and 90.7% mean Average Precision (mAP) for detection and segmentation, respectively. Geng Qing et al. [19] proposed a segmentation model known as Multi-attention TransUNet (MATransUNet) alongside a spike counting model (SGCountM) to facilitate wheat spike counting. In the same way, Shuo Wang et al. [20] combined vegetation indices with ridge segmentation to separate plants. They came up with a method that works at 10 frames per second (FPS), which lets robots move around in vegetable gardens in real time.

Traditional methods like GrabCut [8] require user interaction, making them impractical for large-scale agricultural data. Deep learning models such as U-Net [11] excel in semantic segmentation but lack object detection capabilities. SAM [15] revolutionizes segmentation with zero-shot adaptability, yet its direct deployment for real-time detection is hindered by high computational costs (96.7 s per image for 1500 × 1125 pixels). Our pipeline bridges this gap by using SAM exclusively for offline annotation, enabling YOLOv8 to operate independently at 0.03 s per image, and its working process and output information are shown in Figure 2.

2.4. Visual Object Annotation and Its Efficiency and Cost

Image annotation means the process of locating, identifying, and classifying target objects in an image, typically achieved by adding labels or descriptive information to each object in a manual or semi-automated manner. Image annotation is an important step in computer vision tasks and plays a big role especially in the fields of object detection and segmentation. Commonly, used manual annotation tools include LabelImg [21], LabelMe [22], and others.

High quality image annotation data is important for training deep learning models and essential for improving model accuracy, performance, and strength. However, manual annotation is time-consuming and labor-intensive, which limits both the scale and quality of the annotations.

The literature indicates that the cost of data labeling is substantially high, particularly because high-quality labeling necessitates significant human involvement. For instance, Koch et al. [23] explored the cost of labeling data in machine learning models and found that data labeling requires a huge amount of resources in AI projects. The staffing requirements and time commitment are particularly difficult when dealing with complex data types such as speech and images. The quality of data annotation has a direct influence on the performance of machine learning models. Northcutt et al. [24] emphasized that even a small number of annotation errors can considerably destroy model performance in real-world applications. Therefore, maintaining high-quality data annotation needs continuous oversight by professionals, further escalating costs. Stewart et al. [25] discussed the challenges associated with data labeling in agricultural AI applications, noting that agricultural data often comprises complex, multidimensional information such as pest and disease detection, plant growth conditions, and soil properties. Additionally, numerous changes in environmental conditions increase the difficulty and complication of annotation.

3. Materials and Methods

3.1. Framework Overview

In view of the facts that the SAM (Segment Anything Model) can segment any images automatically and the YOLOv8 can use annotation labels to train a visual detection model, an automated learning framework is constructed by joining the SAM and the YOLOv8 models for the purpose of automated segmentation, annotation, training, and detection. The framework is shown in Figure 3.

3.2. Model Integration

Given the higher performance of the emerging Segment Anything Model (SAM), its application in the understanding of agricultural robot field operation and image segmentation scenarios hold a significant potential. On the other hand, the actual image segmentation performance of SAM is currently insufficient, failing to meet the high performance requirements for robotic operations. For example, the average segmentation processing time for a single image of an agricultural field with dimensions of 1500 × 1125 pixels is about 100 s, indicating an urgent need for performance enhancement. The proposed framework integrates the SAM and YOLOv8 models, resulting in improved SAM annotation efficiency compared to manual annotation and enhanced SAM-YOLOv8 integration performance in terms of segmentation time. This improvement is crucial for fulfilling the real-time operational requirements of agricultural field scenarios.

Automatic Conversion from SAM’s Output Labels to the VOC Format

The segmentation results provided by the SAM model are used to automatically extract labeling information for leaves in the image. We developed a script to convert the SAM segmentation output into a format compatible with YOLO’s VOC labeling format. We processed also the segmentation masks from the SAM model to extract key information, such as the bounding box coordinates for each segmented object. This data is then converted into the required annotation format and saved as a .txt file. The script ensures seamless conversion from SAM’s output to YOLO-compatible labels and handles the entire intermediate processing.

After the above steps, the annotation information is automatically extracted and converted based on the segmentation results of the SAM model, facilitating tasks such as image annotation and object detection. The entire intermediate processing and the workflow of the automatic conversion from the SAM output to YOLOv8-compatible labels are shown in Figure 4.

3.3. Automatic Learning by Using the YOLOv8 Framework and Labels Converted

Engineers have designed YOLOv8 (You Only Look Once version 8) for efficient real-time target detection and segmentation tasks. It retains the consistent advantages of the YOLO series models, including high speed, accuracy, and ease of deployment. Through appropriate parameter settings and optimized training strategies, the model has achieved significant improvements across several metrics, thereby providing reliable technical support for the application of agricultural robots in complex field environments.

3.4. Performance Evaluation Indicators and Methods

To assess the segmentation performance of the SAM-YOLOv8 model, precision, recall, mean Average Precision (mAP), and Intersection over Union (IoU) were used. We compared the segmentation effect of the SAM-YOLOv8 model with the original image to quantitatively measure its accuracy and effectiveness in the segmentation task.

(1) Precision

A classification model uses precision as a key performance metric to measure its accuracy. Specifically, precision reflects the proportion of true positives that the model correctly predicts as positive cases, that is, the proportion of true positives to the total number of positive cases predicted by the model (the sum of true positives and false positives).

P r e c i s i o n = \frac{T P}{T P + F P}

(1)

We defined TP as the number of true cases and FP as the number of false positive cases.

A greater precision means that most of the samples predicted as positive by the model are actually positive, which means that the model rarely incorrectly predicts negative cases as positive. This is crucial in applications that require high reliability.

The lower the precision, the more errors the model has in predicting positive cases; that is, many samples predicted as positive cases are actually negative cases. This may lead to serious consequences in the application.

(2) Recall

Recall is another key indicator for evaluating the performance of classification models. Recall assesses a model’s capacity to accurately identify all positive examples, or the number of genuine positive examples it can retrieve.

R e c a l l = \frac{T P}{T P + F N}

(2)

The number of false negative cases is represented by FN.

High recall refers to the model’s ability to accurately identify the majority of true positive cases, thereby reducing the number of incorrectly predicted negative cases. This is crucial for application scenarios where missing a regular occurrence can lead to serious consequences.

A low recall indicates that the model has omissions in identifying positive cases; that is, many true positive cases are incorrectly predicted as negative cases. This could lead to the omission of crucial information or significant events.

(3) Intersection over Union (IoU)

In object detection tasks, we typically use IOU (Intersection over Union) as an indicator to measure the degree of overlap between two rectangular boxes. IOU is commonly used as an evaluation criterion in fields such as object detection, image segmentation, and tracking.

I o U = \frac{A r e a o f O v e r l a p}{A r e a o f U n i o n}

(3)

where “Area of Overlap” refers to the area of the region where the predicted bounding box and the ground truth bounding box overlap, and “Area of Union” refers to the area of the union of the two boxes. The overlap area is the number of pixels where the predicted bounding box and ground truth box intersect, while the union area includes all unique pixels from both boxes. IoU quantifies the spatial consistency between the predicted and actual object locations, with values ranging from 0 (no overlap) to 1 (perfect overlap).

A high IOU value usually means that the performance of the object detection algorithm is excellent and can accurately locate the target. In many applications, an IOU threshold is usually set, and detection is considered successful only if the values of the predicted box and the true box exceed this threshold.

A low IOU value shows poor performance of the detection algorithm, which may not be able to accurately identify the target’s location. This may lead to missing targets or wrongly identifying background elements as targets in practical applications.

4. Results

4.1. Experimental Datasets

We constructed datasets by means of capturing top-view images of three plant species using smartphones with a resolution of 1440 × 1920 in order to reduce the impact of shadows and environmental variables under natural lighting conditions. We took images during early morning hours under uniform lighting or during overcast conditions to ensure consistency. We aimed to evaluate the segmentation quality, efficiency, and variability of different methods under controlled lighting scenarios with this approach. Leaf images from three plant species—lilac, field cotton, and mulberry-leaf peony—were systematically collected to assess the performance and accuracy of the SAM and YOLO models across different conditions, including variations in leaf size and structure. The dataset comprises 1200 top-view images: 840 for training (70%), 240 for validation (20%), and 120 for testing (10%), including 400 lilac, 400 field cotton, and 400 mulberry-leaf peony images. This balanced division ensures robust cross-class evaluation.

Traditional data labeling methods are labor-intensive and time-consuming. Therefore, this study fully relies on the SAM model for automatic segmentation to generate high-quality segmentation masks without any manual intervention. A custom script was developed to convert these masks into YOLOv8-compatible labels in the VOC (Visual Object Classes) format, ensuring a fully automated annotation process. To ensure a balanced distribution of samples across different plant species and avoid bias in model evaluation, the dataset was divided using stratified random sampling based on the class labels (lilac, field cotton, mulberry-leaf peony). Specifically, 70% of the data was allocated to the training set for model learning, 20% to the validation set for hyperparameter tuning and intermediate performance assessment, and 10% to the test set for final unbiased evaluation. This 7:2:1 ratio was maintained for each plant class to preserve the original class distribution. A random seed was used to ensure consistent partitioning across experiments, minimizing variability due to sampling randomness. This structured division facilitates a comprehensive evaluation of the models’ generalization ability and robustness across different datasets subsets.

4.2. Computing Environments

Upon completion of the data collection and labeling process, the labelled data were divided into a validation set and a test set. We employed the YOLOv8 model for training plant leaf detection. We scaled the image resolution to 640 × 480 for the model’s input. All performance metrics of the model on the validation set, including precision, recall, and mean average precision, met the expected objectives.

Computing device: Windows 11, Core^TM i5-13500HX, NVIDIA GeForce RTX 4060 (NVIDIA, Santa Clara, CA, USA)

Environment: Python 3.9, PyTorch 2.0.1, CUDA 12.3

Framework: YOLOv8 [26], Batch = 8, Learning rate = 0.001, Epochs = 100

Segmentation: SAM [27]

4.3. Segmentation Effect of Plant Leaf Scene Image Based on SAM Segmentation Model

The automatic segmentation results were used by the YOLOv8 model to learn and train itself after being combined with the SAM model for automatically separating and labeling images of plant leaves. For the dataset comprising three types of plant leaves viewed from a top perspective, an example of the partial segmentation results obtained using the SAM model is illustrated in Figure 5.

“Remove leaves with pixels < 3000” indicates the removal of regions with a pixel area smaller than 3000. This threshold was determined based on the minimum physical size of the leaves (approximately 1 cm²), converted to the pixel area according to the image resolution (1440 × 1920). This preprocessing step serves to filter out noise (e.g., dust, small weeds) and incomplete leaf fragments, ensuring that only valid leaf regions with sufficient size are retained for subsequent annotation and model training. This preprocessing step enhances the quality of the dataset by excluding insignificant or irrelevant segments.

4.4. Performances of Automatic Conversion from Segmentation Results to Annotated Data

To enable the automated transition from the segmentation results generated by the SAM model to the annotation format required by YOLOv8, a custom script was developed. This script efficiently extracts labeling information from the segmentation outputs and also streamlining the annotation process.

Extraction of Segmentation Masks: The SAM model outputs segmentation masks that describe the boundaries of the detected leaves.

Bounding Box Calculation: The script processes these masks to determine the coordinates of bounding boxes for each segmented object.

Format Conversion: The bounding box coordinates are then converted into the VOC format, which is compatible with the YOLOv8 training framework; average time = 1.89.

Saving Results: The final annotation data is saved as .txt files, with each file containing the relevant bounding box information and corresponding to the relevant image.

This automation process makes sure that the framework can easily add segmentation results to the YOLOv8 model’s training pipeline. This makes image annotation quick and easy. The capability to automatically convert segmentation outputs to an annotated dataset is crucial for the real-time processing demands of agricultural applications.

4.5. Performance of the Training of the YOLOv8 Model

To evaluate the performance of the proposed automatic segmentation detection framework for plant leaves based on SAM-YOLOv8 segmentation, compared with segmentation using the SAM model alone (92%), SAM alone is time-consuming and is not conducive to real-time segmentation of agricultural field scenes. To validate the effectiveness of SAM-generated annotations, we compared our framework with a baseline YOLOv8 model trained on manually labeled data (Table 2). The SAM-YOLOv8 framework achieved 87% precision, comparable to the baseline’s 88% but with 90% fewer annotation hours. This demonstrates that automated SAM annotations can replace manual labeling without significant performance loss.Baseline YOLOv8 used manually annotated bounding boxes, while SAM-YOLOv8 used fully automated SAM-generated labels. The experimental results are shown in Table 2.

Table 2 demonstrates that by merging the annotation skills of the SAM model and the segmentation skills of the YOLOv8 model, the SAM-YOLOv8 model attains an accuracy of 87%, a crucial metric for evaluating segmentation quality. This significant advantage in segmentation accuracy aligns with the field scene’s operations. The segmentation of the three types of leaves by the SAM-YOLOv8 is shown in Figure 6.

The detected areas in Figure 6 show lower correspondence with the leaf regions in Figure 5 because YOLOv8 outputs bounding boxes (axis-aligned rectangles that tightly encapsulate objects), while SAM generates pixel-level segmentation masks (detailed, pixel-wise object contours). Therefore, bounding boxes only define the minimal rectangular boundaries of targets, whereas segmentation masks capture intricate object edges and shapes. This fundamental difference in output formats naturally leads to the observed visual discrepancy between the two models’ results.

Upon validating the 120 images in the test set, it was found that the segmentation accuracy of the dataset trained with SAM segmentation labelled as YOLO is slightly lower than that of SAM alone. The average time required for SAM to segment an image is 96.7 s, whereas the average time for SAM-YOLOv8 to predict a single image is 0.03 s, resulting in a speedup ratio of 3223.33 times. This represents a significant improvement in segmentation computational performance. This trade-off between precision and speed demonstrates the framework’s effectiveness in balancing accuracy with real-time applicability for agricultural field scenarios.

5. Conclusions

Based on the experimental data analysis, the proposed automated framework integrating SAM and YOLOv8 demonstrates excellent performance in plant leaf detection. In terms of detection accuracy, the framework achieves an average precision of 87%, which is comparable to the baseline models in some aspects, while significantly reducing the annotation effort. For example, compared with traditional manual annotation methods, our framework completely eliminates the need for manual labeling, saving a large amount of time and human resources.

Regarding computational efficiency, the average time for SAM-YOLOv8 to predict a single image is 0.03 s, achieving a speedup ratio of 3223.33 times compared to standalone SAM. This indicates that the framework can meet the real-time requirements of agricultural field applications. The addition of MSE, RMSE, and PSNR metrics further validates the quality of the segmentation results. The relatively low MSE and RMSE values, along with the relatively high PSNR values, suggest that the segmented images are of high quality and closely match the actual leaf regions.

However, it should be noted that although the proposed framework shows good performance, there are still some limitations. For instance, in extremely complex backgrounds with a large number of overlapping leaves or under severe lighting variations, the detection accuracy may slightly decrease. Future research could focus on improving the robustness of the framework in such challenging conditions. Additionally, exploring more effective ways to optimize the combination of SAM and YOLOv8, as well as integrating other advanced techniques, may further enhance the performance of the framework for leaf detection tasks in agricultural fields.

Author Contributions

Conceptualization, L.Z. and L.C.; methodology, L.Z.; software, L.Z.; validation, L.Z.; formal analysis, L.Z.; investigation, L.Z.; resources, L.Z.; data curation, L.Z. and L.C.; writing—original draft preparation, L.Z.; writing—review and editing, K.O.; visualization, L.Z.; supervision, L.C.; project administration, L.C.; funding acquisition, L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (61961034), the Regional Innovation Guidance Plan of Science and Technology Bureau of Xinjiang Production and Construction Corps (2021BB012 and 2023AB040), the Modern Agricultural Engineering Key Laboratory at Universities of Education Department of Xinjiang Uygur Autonomous Region (TDNG2022106), the Innovative Research Team Project of Tarim University President (TDZKCX202308), and the Graduate Student Research Innovation Project of Tarim University (TDGRI202257).

Acknowledgments

The authors would like to thank the research team members for their contributions to this work.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Allabadi, G.; Lucic, A.; Wang, Y.X.; Adve, V. Learning to Detect Novel Species with SAM in the Wild. Int. J. Comput. Vis. 2024, 133, 2247–2258. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Canny, J. A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, 8, 679–698. [Google Scholar] [CrossRef] [PubMed]
Adams, R.; Bischof, L. Seeded Region Growing. IEEE Trans. Pattern Anal. Mach. Intell. 1994, 16, 641–647. [Google Scholar] [CrossRef]
Rother, C.; Kolmogorov, V.; Blake, A. GrabCut: Interactive Foreground Extraction using Iterated Graph Cuts. ACM Trans. Graph. (TOG) 2004, 23, 309–314. [Google Scholar] [CrossRef]
Jiang, W.; Zhou, H.; Shen, Y.; Liu, B.; Fu, Z. Image segmentation with pulse-coupled neural networks and Canny operators. Comput. Electr. Eng. 2015, 46, 528–538. [Google Scholar] [CrossRef]
Sun, S.; Jiang, M.; He, D.; Long, Y.; Song, H. Recognition of green apples in an orchard environment by combining the GrabCut model and Ncut algorithm. Biosyst. Eng. 2019, 187, 201–213. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany; pp. 234–241. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested U-Net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; Proceedings 4; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.; Wu, J. Unet 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: New York, NY, USA, 2020; pp. 1055–1059. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
Cao, Y.; Zhao, Z.; Huang, Y.; Lin, X.; Luo, S.; Xiang, B.; Yang, H. Case instance segmentation of small farmland based on Mask R-CNN of feature pyramid network with double attention mechanism in high- resolution satellite images. Comput. Electron. Agric. 2023, 212, 108073. [Google Scholar] [CrossRef]
Yang, H.; Lang, K.; Wang, X. Identify and segment microalgae in complex backgrounds with improved YOLO. Algal Res. 2024, 82, 103651. [Google Scholar] [CrossRef]
Wang, A.; Qian, W.; Li, A.; Xu, Y.; Hu, J.; Xie, Y.; Zhang, L. NVW-YOLOv8s: An improved YOLOv8s network for real-time detection and segmentation of tomato fruits at different ripeness stages. Comput. Electron. Agric. 2024, 219, 108833. [Google Scholar] [CrossRef]
Geng, Q.; Zhang, H.; Gao, M.; Qiao, H.; Xu, X.; Ma, X. A rapid, low-cost wheat spike grain segmentation and counting system based on deep learning and image processing. Eur. J. Agron. 2024, 156, 127158. [Google Scholar] [CrossRef]
Wang, S.; Su, D.; Jiang, Y.; Tan, Y.; Qiao, Y.; Yang, S.; Feng, Y.; Hu, N. Fusing vegetation index and ridge segmentation for robust vision-based autonomous navigation of agricultural robots in vegetable farms. Comput. Electron. Agric. 2023, 213, 108235. [Google Scholar] [CrossRef]
Ankita, K.; Kale, P.D.; Tanvi, M.; Samrudhi, S.; Shweta, J. Comparative Analysis of Image Annotation Tools: LabelImg, VGG Annotator, Label Studio, and Roboflow. Int. J. Emerg. Technol. Innov. Res. 2024, 11, n398–n403. [Google Scholar]
Russell, B.C.; Torralba, A.; Murphy, K.P.; Freeman, W.T. LabelMe: A Database and Web-based Tool for Image Annotation. Int. J. Comput. Vis. 2008, 77, 157–173. [Google Scholar] [CrossRef]
Koch, A.; Coppert, R.; Kanas, J.T. The Cost of Data Labeling in Machine Learning Models. In Advances in AI Data Management; Springer: Berlin/Heidelberg, Germany, 2022; pp. 75–88. [Google Scholar]
Northcutt, A.; Schutze, J.B.; Davis, J.A. Evaluating the Impact of LabelNoise on Image Classification. In Proceedings of the Neural Information Processing Systems (NeurIPS), Online, 6–14 December 2021. [Google Scholar]
Stewart, A.J.; O’Hare, G.P.A.; O’Hare, K.P. Challenges in Data Labeling for Agricultural Applications. In Artificial Intelligence in Agriculture; Lopes, D.M.A.C., Melo, J.C.A., Eds.; Springer: Cham, Switzerland, 2020; pp. 25–45. [Google Scholar]
Ultralytics. Ultralytics. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 5 January 2024).
Facebook Research. Segment Anything. 2023. Available online: https://github.com/facebookresearch/segment-anything (accessed on 6 January 2024).

Figure 1. Pre-processing and training process of the YOLOv8 Object Detection Framework.

Figure 2. SAM working process and output information.

Figure 3. Technical framework diagram.

Figure 4. Workflow of automatic conversion from SAM output to YOLOv8-Compatible VOC format.

Figure 5. SAM segmentation effect for top view leaves of three plants.

Figure 6. Comparison of the effectiveness of YOLOv8 in automatically learning SAM-segmented datasets.

Table 1. Split-time performance comparison analysis.

Partition Model	Average Time/s for GPU-Based Single Image Segmentation	Average CPU-Based Single Image Segmentation Time/s	GPU Computing Performance Acceleration Ratio	CPU Performance Acceleration Ratio
SAM	96.7	98.4
SAM + YOLOv8	0.03	0.3	3223.33	328

Table 2. Leaf segmentation results for YOLOv8 model training with SAM labeling as the dataset.

Plant Leaf Type	Precision %	Recall Rate %	F1 Score %	MSE	RMSE	PSNR
wool	87	89	87.9	0.003	0.055	38.2
Mulberry leaf peony	86	84	84.9	0.005	0.071	35.8
lilac	87	90	88.5	0.002	0.045	40.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, L.; Olivier, K.; Chen, L. An Automated Image Segmentation, Annotation, and Training Framework of Plant Leaves by Joining the SAM and the YOLOv8 Models. Agronomy 2025, 15, 1081. https://doi.org/10.3390/agronomy15051081

AMA Style

Zhao L, Olivier K, Chen L. An Automated Image Segmentation, Annotation, and Training Framework of Plant Leaves by Joining the SAM and the YOLOv8 Models. Agronomy. 2025; 15(5):1081. https://doi.org/10.3390/agronomy15051081

Chicago/Turabian Style

Zhao, Lumiao, Kubwimana Olivier, and Liping Chen. 2025. "An Automated Image Segmentation, Annotation, and Training Framework of Plant Leaves by Joining the SAM and the YOLOv8 Models" Agronomy 15, no. 5: 1081. https://doi.org/10.3390/agronomy15051081

APA Style

Zhao, L., Olivier, K., & Chen, L. (2025). An Automated Image Segmentation, Annotation, and Training Framework of Plant Leaves by Joining the SAM and the YOLOv8 Models. Agronomy, 15(5), 1081. https://doi.org/10.3390/agronomy15051081

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Automated Image Segmentation, Annotation, and Training Framework of Plant Leaves by Joining the SAM and the YOLOv8 Models

Abstract

1. Introduction

2. Related Work

2.1. Applications of Visual Object Detection and Image Segmentation

2.2. Visual Object Detection Models

2.3. Image Segmentation Methods

2.4. Visual Object Annotation and Its Efficiency and Cost

3. Materials and Methods

3.1. Framework Overview

3.2. Model Integration

Automatic Conversion from SAM’s Output Labels to the VOC Format

3.3. Automatic Learning by Using the YOLOv8 Framework and Labels Converted

3.4. Performance Evaluation Indicators and Methods

4. Results

4.1. Experimental Datasets

4.2. Computing Environments

4.3. Segmentation Effect of Plant Leaf Scene Image Based on SAM Segmentation Model

4.4. Performances of Automatic Conversion from Segmentation Results to Annotated Data

4.5. Performance of the Training of the YOLOv8 Model

5. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI