Evaluation of YOLO Object Detectors for Weed Detection in Different Turfgrass Scenarios

: The advancement of computer vision technology has allowed for the easy detection of weeds and other stressors in turfgrasses and agriculture. This study aimed to evaluate the feasibility of single shot object detectors for weed detection in lawns, which represents a difﬁcult task. In this study, four different YOLO (You Only Look Once) object detectors version, along with all their various scales, were trained on a public ‘Weeds’ dataset with 4203 digital images of weeds growing in lawns with a total of 11,385 annotations and tested for weed detection in turfgrasses. Different weed species were considered as one class (‘Weeds’). Trained models were tested on the test subset of the ‘Weeds’ dataset and three additional test datasets. Precision (P), recall (R), and mean average precision (mAP_0.5 and mAP_0.5:0.95) were used to evaluate the different model scales. YOLOv8l obtained the overall highest performance in the ‘Weeds’ test subset resulting in a P (0.9476), mAP_0.5 (0.9795), and mAP_0.5:0.95 (0.8123), while best R was obtained from YOLOv5m (0.9663). Despite YOLOv8l high performances, the outcomes obtained on the additional test datasets have underscored the necessity for further enhancements to address the challenges impeding accurate weed detection.


Introduction
Weed encroachment within turfgrass swards strictly depends on the turfgrass management regime and may lead to a loss of functional quality and aesthetic perception.To date, the best weed control on turfgrasses is achieved by broadcast application of synthetic herbicides [1].Synthetic herbicides in the European Union have been subjected to strict bans due to herbicide exposure's health and environmental risks [2,3].According to the European Commission [4], approximately 100 different synthetic herbicides are allowed for turfgrass and landscape management.However, there are slight discrepancies between what is allowed in various European countries.Many endeavors are underway to replace synthetic herbicides and find appropriate products, tools, or management techniques that effectively control weeds in turfgrasses and urban environments.Currently, the most effective weed removal methods in turfgrasses or urban hard surfaces involve localized applications of nonselective biological products (i.e., acetic acid) [5] or thermal treatments [6], however, adequate efficacy has yet to be achieved.Robotic machines that can autonomously detect and remove weeds show great promise for more sustainable weed control in turfgrasses [7][8][9].Weed detection is fulfilled using various methods such as image processing, machine learning, and computer vision techniques, and it's an area of active research and development.Indeed, various works have been published investigating the feasibility of using machine vision technology for weed detection in turfgrass and grassland systems using Bayes classifier, morphology operator [10], weed shapes and texture features [11][12][13], color [12,13] and a range of YOLO model scales, specifically YOLOv5, YOLOv6, YOLOv7, and YOLOv8, for their efficacy in this task.The study aimed not only to evaluate these models' capacity for weed detection in turfgrass but also to compare their performance to identify the most effective approach.

Materials and Methods
2.1.YOLO and YOLOv5, YOLOv6, YOLOv7, YOLOv8 Detectors YOLO is an SSD; its first version was released in 2015 [22].YOLO performs object detection by dividing the input image into m × m grids of equal dimensions.Each grid cell is responsible for detecting an object if the object's center of the thing falls inside the cell.Each cell can predict a fixed number of bounding boxes, each with an accompanying confidence score.Each prediction comprises five values (x, y, w, h, and a confidence score).Here, x, y, w, and h are the center of the bounding box, width, and height, respectively.After predicting a bounding box, YOLO uses Intersection Over Union (IOU) to choose the most representative bounding box of an object in the grid cell, and non-max suppression is used to remove the excess bounding boxes.After the first YOLO release, YOLOv2 and YOLOv3 were published in 2016 [30] and 2017 [31], respectively.Then, Alexey Bochkovskiy released YOLOv4 in 2020 [32].In this experiment, YOLOv5 [33], YOLOv6 [34], YOLOv7 [35], and YOLOv8 [36] models were used and evaluated for weed detection in multiple turfgrass contexts.YOLOv5 was introduced by Glenn Jocher shortly after the release of YOLOv4 and is entirely based on the PyTorch framework.YOLOv6 and YOLOv7 detection models were released in June and July 2022, respectively.Finally, YOLOv8 was published by Ultralytics in January 2023.
YOLOv5 combines a cross-stage partial network (CSPNet) [37] and Darknet as a backbone.It uses a path aggregation network (PANet) [38] as a neck and adaptive feature pooling to enhance object location accuracy.The YOLOv5 head generates three different sizes of feature maps to achieve multi-scale [31] prediction.YOLOv5 outperforms YOLO's previous version in terms of accuracy of detection while maintaining a slightly slower inference speed [39].Real-time weed detection requires a high detection speed, accuracy, and compact model size as YOLOv5 provides higher inference efficiency on resource-poor edge devices [40].A YOLOv5 object detection application programming interface (API) was used.YOLOv5 offers five different model scales: YOLOv5n (nano), YOLOv5s (small), YOLOv5m (medium), YOLOv5l (large), and YOLOv5x (extra-large), which are compoundscaled variants of the same architecture.Table 1 shows more detailed information about the YOLOv5 models.
YOLOv6 (and the newer vesrions YOLOv7 and YOLOv8) perform anchor-free detection to obtain a higher inference speed.YOLOv6 utilizes an EfficientRep backbone based on RepVGG [41] to increases the parallelism.PAN [42] is boosted with RepBlocks or CSPStackRep [37].Task alignment learning approach from TOOD [43] is employed for label assignment and VariFocal [44] and an SIoU or GIoU [45,46] is used for classification and regression loss computation.RepOptimizer [47] quantization and channel-wise distillation [48] contribute to improve higher detection speed.YOLOv6 achieved an AP of 52.5% and AP50 of 70% at around 50 FPS on the MS COCO dataset test 2017 and an mAP of 43.1% on the COCO va1 2017 dataset.YOLOv6 provides different model scales for various applications: YOLOv6n (nano), YOLOv6s (small), YOLOv6m (medium), and YOLOv6l (large) [49].
YOLOv7 improves accuracy without affecting the inference speed.It introduces the extended efficient layer aggregation network (E-ELAN) [50] as an improved version of ELAN computational block.The E-ELAN enables efficient learning without losing the gradient path.YOLOv7is a concatenation-based architecture that scales network depth and width according to concatenating layer ratios, reducing hardware usage while ensuring efficiency at different scales.YOLOv7 relies on re-parameterized convolutions (RepConv) [41] and employs coarse label assignment for the auxiliary head and acceptable label assignment for the lead head.Additional innovations include batch normalization in conv-bn-activation, YOLOR inspired implicit knowledge YOLOR [51] and exponential moving average for the final inference model.To date, the YOLOv7 algorithm resulted with lower inference speed and higher accuracy than YOLOR, PP-YOLOE, YOLOX, Scaled-YOLOv4, and YOLOv5 [35].Furthermore, the YOLOv7 network provides two model sizes: YOLOv7 and YOLOv7x (extra-large).
YOLOv8 represents the state-of-the-art among YOLO object detectors.Indeed, no paper about YOLOv8 has been published yet.However, some information is available online (Table 1).YOLOv8 is an anchor-free detector developed to drop the number of box predictions and speed up the Non-maximum suppression.YOLOv8 uses mosaic augmentation to boost the training process and has been disabled for the last ten epochs.YOLOv8 provides several innovations to support a full range of vision AI tasks, including detection, segmentation, pose estimation, tracking classification, labeling, training, and deploying.YOLOv8 provided five scaled versions: YOLOv8n (nano), YOLOv8s (small), YOLOv8m (medium), YOLOv8l (large) and YOLOv8x (extra-large).YOLOv8x obtained an AP of 53.9% on MS the MS COCO dataset test-dev 2017, with an image size of 640 pixels and a speed of 280 FPS on an NVIDIA A100 and TensorRT [49].a GFlops is a computational power unit of measure equal to 1 B floating-point operations per second.b AP (%) represents the average precision of the YOLO detectors on the COCO 2017 dataset [49].
In general, larger model scales provide higher accuracy but a lower inference speed.Therefore, for this trial, all five of the YOLOv5 model scales, four of the YOLOv6 models, the two YOLOv7 models, and all five YOLOv8 model scales were used to train the weed detection algorithm.Hereafter hyperparameters for the training process are listed in Table 2.
EfficientDet was also trained and compared with the abovementioned YOLO models.EfficientDet is a SSD which employs bi-directional feature pyramid network (BiFPN) to enhance multi-scale feature fusion performance [52] and EfficientNet [53] backbones to boost image classification performance.The model was trained for 1000 iterations, with images size of 640 × 640, a learning rate of 0.0001 and a batch size of 8. YOLOv5 [10,13], [16,30], [33,23], [30,61], [62,45], [59,119] A public 'Weeds' dataset [54] was used for this trial to train the different YOLO models.The 'Weeds' dataset is a collection of weeds growing in lawns and in typical urban backgrounds that can easily confuse object detection models due to the similarity of the weeds with their surroundings.This dataset contains 4203 images with weeds labeled for 11,385 annotations.The 'Weeds' dataset contains approximately 62% of images of weeds with turf backgrounds, 13% of images with third surfaces (different floor patterns) backgrounds, and approximately 24% of images with both backgrounds (results from image analysis on 500 images sample).Weeds identified in this dataset were Erigeron canadensis L. (43%), Sonchus spp.(23%), Taraxacum officinale L. (weber) (18%), Oxalis spp.(4%), Cerastium spp.(3%) and a small percentage of unknown (results from image analysis on 500 images sample).No sufficient images for each species were provided in the dataset for training multi-class detectors; thus, only one class (Weeds) was assumed for this trial.The labels correlogram in the relationship between the position, width, and height of the dataset's objects (weeds) annotations.Generally, our datasets contain mostly small and stretched objects positioned at the center of the digital image.Before training the model, the images were cropped to obtain a resolution of 640 × 640 pixels without applying any resizing and were subjected to the auto-orient function.The auto-orient strips images of their EXIF data to be displayed in the same way as they are stored on disk.
To assess the model's performance on this dataset, a k-fold cross-validation was performed.K-fold cross-validation is a simple popular method for model evaluation.This procedure generally consists of dividing the dataset into k subsets after a random shuffle, training the model on k-1 subsets, and using the remaining subset to test the model.The evaluation scores produced are considered more reliable as a model performance summary.This dataset was divided into five subsets (two of 804 and three of 805 images); therefore, a five-fold cross-validation was performed.The best-performing models of each YOLO version were then trained on the dataset (train: 3664 images, validation: 359 images, and test: 180 images) and evaluated.The online platform Google Colaboratory (Colab), offered by Google, was used to implement and train the model.Colab, a cloud service based on Jupyter Notebooks, provides a free single 12 GB NVIDIA Tesla K80 GPU.

Additional Test Datasets
Furthermore, three additional test datasets were used to evaluate different model detection and to assess the potential use of the trained models for weed detection in other contexts.
The Home Lawn dataset comprises 180 images featuring 473 annotations of weeds proliferating in a mature stand of bermudagrass (Cynodon dactylon (L.) Pers.) during its early green-up phase (indicative of low-quality turf) as well as weeds emerging on hard surfaces such as streets, curbs, and brick floors.These intricate background settings, commonly found in residential and urban areas, could potentially influence the efficacy of weed detection.The images were captured in various locations, including residential lawns and parks, in Seville, Spain (37 • 389 N, 5 • 985 W; Datum: WGS84).
The Baseball Field dataset comprised 180 images and 285 annotations of weeds developing in a bermudagrass turf overseeded with ryegrass (Lolium perenne L.) actively growing.In this dataset, image backgrounds are considered uniform (high-quality turf) and most weeds are small in size and partially growing within the turf or with altered shapes due to the intense management.Images of this dataset were collected at the Opelika High School baseball field (Opelika, AL, USA; 32 • 645 N, 85 • 378 W; Datum: WGS84).
The Manila grass dataset consisted of 180 digital images and 242 annotations of weeds developing in a mature stand of manila grass (Zoysia matrella (L.) Merr.cv 'Diamond') actively growing.In this dataset, most weeds are large in size and with full growth shape.Images were taken at the experimental farm of the Department of Agriculture, Food and Environment of the University of Pisa (San Piero a Grado, Pisa, Italy; 43 • 400 N, 10 • 190 E; Datum: WGS84).
In all datasets, T. officinale, Plantago lanceolata L., and Sonchus spp.were identified as significant weed species and examples of imags are depicted in Figure 1.

Metrics
The trained models for weed detection were tested on the test subset and additional small datasets mentioned in Section 2.2.2.The number of weeds per image manually counted and reported as a ground truth value.Consequently, with this precision (P) and recall (R) were used as the evaluation metrics for weed detection.T model evaluation metrics are defined as follows: The Robotflow API (https://app.roboflow.com;accessed on 2 July 2023) was used to annotate additional test dataset images and convert them into the respective YOLO version format and dataset split for the five-fold cross-validation.

Metrics
The trained models for weed detection were tested on the test subset and three additional small datasets mentioned in Section 2.2.2.The number of weeds per image was manually counted and reported as a ground truth value.Consequently, with this data, precision (P) and recall (R) were used as the evaluation metrics for weed detection.These model evaluation metrics are defined as follows: where TP consists of the true positives (when the algorithm correctly detects weeds with a bounding box); FP corresponds to the false positives (when the algorithm computes a bounding box in a location without weeds); and FN indicates false negatives (when a target weed is not detected).The IoU between the bounding box produced by the detection and the ground truth is calculated.For each image, If the IoU is over a predetermined threshold (0.5 in this study), a TP is produced; otherwise, the result is an FP.As mentioned in Section 2.1, the trained model provides a TP using bounding box coordinates and a confidence score (the model's confidence regarding each detection performed).The area under the precision-recall curve represents the average precision (AP).
Average Precision(AP) = AP is a number between 0 and 1 used to summarize the different precision values obtained in the recall function.Furthermore, the mean average precision (mAP) is used to evaluate a model and is obtained by averaging the AP for each class.

Mean Average Precision
Generally, two maps are produced using two different thresholds: the mAP_0.5, the mean of the AP with a confidence score between 0 and 0.5, and the mAP_0.5:0.95, which is the mean of the AP with a confidence score between 0.5 and 0.95.Therefore, precision (P), recall (R), mean average precision 0.5 (mAP_0.5),and represent average precision 0.5-0.95(mAP_0.5:0.95) are considered the most common metrics when evaluating object detectors [25].
Models performance metrics (P, R, mAP_0.5 and mAP_0.5:0.95) obtained after the 5-fold cross validation have been subjected to a one-way ANOVA using statistical software R [55].Data normality assumptions were assessed using the Shapiro-Wilk for normality and Levene's test for homoscedasticity using 'car' package [56].Pairwise comparisons and mean separation were performed with a Tukey HSD post hoc test (FDR adjusted p-value) using 'scmamp' package [57].In Figure 2 is depicted and resumed the framework of the current study.
Models performance metrics (P, R, mAP_0.5 and mAP_0.5:0.95) obtained after the 5fold cross validation have been subjected to a one-way ANOVA using statistical software R [55].Data normality assumptions were assessed using the Shapiro-Wilk for normality and Levene's test for homoscedasticity using 'car' package [56].Pairwise comparisons and mean separation were performed with a Tukey HSD post hoc test (FDR adjusted p-value) using 'scmamp' package [57].In Figure 2 is depicted and resumed the framework of the current study.

Results and Discussion
The Analysis of variance revealed significant differences on models performance metrics (Table 3).Among these models, YOLOv5s, YOLOv6n, and YOLOv6l achieved the highest precision (P) scores with values of 0.9445 ± 0.0281, 0.9456 ± 0.0146, and 0.9414 ± 0.023, respectively.Notably, these three models exhibited significantly higher precision scores compared to EfficientDet, which yielded the lowest precision score of 0.9033 ± 0.0244 (p < 0.05).Conversely, no significant differences in precision scores were observed among the other models.Regarding the recall (R) metric, YOLOv7 yielded the best results with a score of 0.9552 ± 0.0136.No significant differences were found between YOLOv7 and YOLOv7x, as well as all scales of YOLOv8 models.However, YOLOv7 displayed significant differences when compared to EfficientDet (p < 0.01) and all scales of YOLOv5 and YOLOv6 models (p < 0.001, except for YOLOv5n with a p < 0.05).For the mean average precision at an intersection over union (IOU) threshold of 0.5 (mAP_0.5),YOLOv8 and YOLOv7 demonstrated the highest performance with scores of 0.9594 ± 0.0214 and 0.955 ± 0.0263, respectively.No significant differences were observed between these models and all scales of YOLOv6, YOLOv7x, and YOLOv8 (except for YOLOv8s with p < 0.05), as well as YOLOv5s and YOLOv5l.Conversely, all other models exhibited significantly lower mAP_0.5 values, with EfficientDet performing the worst (p < 0.01, with a score of 0.8931 ±

Results and Discussion
The Analysis of variance revealed significant differences on models performance metrics (Table 3).Among these models, YOLOv5s, YOLOv6n, and YOLOv6l achieved the highest precision (P) scores with values of 0.9445 ± 0.0281, 0.9456 ± 0.0146, and 0.9414 ± 0.023, respectively.Notably, these three models exhibited significantly higher precision scores compared to EfficientDet, which yielded the lowest precision score of 0.9033 ± 0.0244 (p < 0.05).Conversely, no significant differences in precision scores were observed among the other models.Regarding the recall (R) metric, YOLOv7 yielded the best results with a score of 0.9552 ± 0.0136.No significant differences were found between YOLOv7 and YOLOv7x, as well as all scales of YOLOv8 models.However, YOLOv7 displayed significant differences when compared to EfficientDet (p < 0.01) and all scales of YOLOv5 and YOLOv6 models (p < 0.001, except for YOLOv5n with a p < 0.05).For the mean average precision at an intersection over union (IOU) threshold of 0.5 (mAP_0.5),YOLOv8 and YOLOv7 demonstrated the highest performance with scores of 0.9594 ± 0.0214 and 0.955 ± 0.0263, respectively.No significant differences were observed between these models and all scales of YOLOv6, YOLOv7x, and YOLOv8 (except for YOLOv8s with p < 0.05), as well as YOLOv5s and YOLOv5l.Conversely, all other models exhibited significantly lower mAP_0.5 values, with EfficientDet performing the worst (p < 0.01, with a score of 0.8931 ± 0.0312).In terms of the mAP_0.5 to 0.95 metric, all YOLOv5 models (except YOLOV5n with p < 0.001) achieved the best results, ranging from 0.8841 ± 0.0795 for YOLOv5m to 0.8606 ± 0.0442 for YOLOv5l.YOLOv8l exhibited a significantly lower mAP_0.5 to 0.95 score (p < 0.01) compared to other YOLOv8 models, but it was the bestperforming model within the YOLOv8 series, achieving a score of 0.8043 ± 0.015.No significant differences were found between YOLOv8l and other YOLOv8 models, as well as YOLOv7.Notably, all other models displayed significantly lower mAP_0.5 to 0.95 scores, with YOLOv5n performing the worst (p < 0.001) at a value of 0.7002 ± 0.012.According to results from the five-fold cross-validation experiment, ten models (YOLOv5n, YOLOv5s, YOLOv5m, YOLOv6n, YOLOv6l, YOLOv7, YOLOv7x, YOLOv8s, YOLOv8l, and YOLOv8x) were trained on the 'Weeds' public dataset for 100 epochs (Table 2) and best performing models for each YOLO version were selected to be tested on the four different test datasets.Performances of the ten models are resumed and depicted in Figure 3.In general, YOLOv6 model scales depicted a different trend during the train process since this model provide evaluation at the first epoch, the 20th, the 40th, and every three epochs starting from the 50th.Among YOLOv5 model scales, YOLOv5s obtained the highest P (0.9485), while YOLOv5m obtained the best R (0.9761), mAP_0.5 (0.9783) and mAP0.5:0.95(0.7811).Among the YOLOv6 models, YOLOv6n obtained the best P (0.9646) and best mAP_0.5 (0.9651), while YOLOv6l resulted with the best R (0.8893) and best mAP_0.5:0.95(0.7421).YOLOv7 surpassed YOLOv7x for all the metrics resulting in P (0.9466), R (0.9663), mAP_0.5 (0.9758), and mAP_0.5:0.95(0.7672).Among YOLOv8 model scales, YOLOv8x obtained the best P and R values (0.95463 and 0.9761, respectively) while YOLOv8l resulted with the best mAP_0.5 and mAP_0.5:0.95(0.9775 and 0.8129, respectively).Based on these results, YOLOv5m, YOLOv6l, YOLOv7, and YOLOv8l were tested on the four different test datasets.The results of this trial are reported in Table 4. a Inference time refers to the average time needed for the model to detect weeds on a single digital image.
Table 4 shows the results of the YOLO model's detection on the four test datasets (confusion matrix is reported in Table A1 of Appendix A).All the tested models obtained higher performances on the 'Weeds' public dataset than the additional test datasets.For this test, YOLOv8l resulted with the highest P value (0.9476), best mAP_0.5 (0.9795), and mAP_0.5:0.95(0.8123).YOLOv5m resulted in the highest R (0.9663).The time required for models to inference on this test dataset was approximately 34 ms per image for YOLOv8l and 16.2 ms for YOLOv5m.When performing inference on the Home Lawn dataset, YOLOv6l resulted with the best P (0.7836), while YOLOv7 obtained the best R (0.6454), best mAP_0.5 (0.7108), best mAP_0.5:0.95(0.5209).The inference time required for this dataset was approximately 32.5 ms for YOLOv6l and 29.1 ms for YOLOv7.The best models performing inference on the Baseball Field dataset were YOLOv5m with the best P (0.6856), best R (0.8126), mAP_0.5 (0.7135) and best mAP_0.5:0.95(0.4716).The inference time required was approximately 24.1 ms for YOLOv5m.For the Manila grass dataset, the best P was obtained from YOLOv8l (0.7635), and the best R was obtained from YOLOv7 and YOLOv6l (both models got R of 0.7571).YOLOv8l resulted in the highest mAP_0.5 and mAP_0.5:0.95(0.7589 and 0.5296, respectively).Yu et al. [18] trained and tested multiple species of two-stage detectors to detect different weeds among actively growing perennial ryegrass, obtaining higher values of R (>0.98).Low R values suggest that the model misclassifies target weeds as turfgrass, thus producing an FN.This is unacceptable for field applications since weeds would be missed, leading to unsatisfactory weed control in turfgrass.Figure 4 shows an example of YOLOv5m, YOLOv6l, YOLOv7, and YOLOv8l weeds detection on the test datasets.
dataset, which had the most complex background.This limitation may be overcome by increasing the number of training images.Zhuang et al. [28] obtained similar low P and R values when using YOLOv3 for R. scabra detection in drought-stressed and unstressed turfgrasses.In this research, authors argued that high background variability in the training dataset increases cause a less efficient feature extraction and consequently decreases P and R metrics.For this reason, the authors suggest further research on training object detectors on images with the simplest background.Additionally, the annotation method used in this trial consisted of drawing bounding boxes around the weed within the image, which is not the method with the highest resolution.Indeed, Sharpe et al. [59] demonstrate that higher-resolution annotation methods improve the overall neural network accuracy.Moreover, artificial neural networks recognize plants using color, texture, and shape features [60], and Hahn et al. [61] found that multispectral components are highly effective in broadleaves weed detection in turfgrass.Thus, further research may be addressed to assess and clarify how these techniques and methods can improve detector's performances.According to Yang et al. [62], high image processing speed is imperative for real-time weed detection and treatment.Eventual actuators for weed control would only have a few seconds to detect weeds by processing images and delivering the treatments.The obtained results revealed disparities in the required inference time among different models.Specifically, YOLOv5m exhibited efficient inference time, taking less than 20 ms for all the datasets except for the Baseball Field, where it required approximately 24 ms.Similarly, YOLOv6l and YOLOv7 showed a similar trend, achieving detection within less than 30 ms for the public 'Weeds', Home Lawn and Manila grass datasets, while requiring longer inference time for the Baseball Filed dataset.For As shown in Figure 4, the model effectively detected weeds at a mature stage (>5 true leaves) growing outside from a turfgrass canopy.The training dataset consists of a multiple weed species dataset, most of which were rosettae-forming weeds grow in turfgrasses.For this reason, the model showed high performances in a situation similar to those occurring in the public 'Weeds' dataset.These findings are in accordance with [20].In this research, authors argued that the detection is highly affected by the broad-leaved weeds morphology and leaf pattern and color variations among species and within the same species.The authors proposed that multiple-species neural network training and images gathered from different geographical regions (to include various turf sites and weed biotypes) may be beneficial for the overall accuracy of weed detection models.However, the results obtained from experiments conducted on different datasets suggest that efforts are still required to improve the overall accuracy.Benjumea et al. [58] proposed an improved YOLO architecture that allows a more efficient detection of smaller objects.A more minor object detection was improved by modifying the architecture structure and fine-tuning parameters and resulted in an increased mAP_0.5 of approximately 7% without significantly affecting the inference time.Moreover, the models failed to detect weeds close to the image edge.Yu et al. [19] claim that the edge effect may be reduced by the continuous frames inputs (since in field applications, weed detection is based on videos), thus boosting detection accuracy.In this trial, all the models were able to detect weeds at the edge of the image frames.Furthermore, highly complex backgrounds such as low-quality turf may increase the computational complexity for feature extraction and a reduced R of the model [19].In this trial, only YOLOv5m models agree with this finding.Indeed, YOLOv5m obtained the lowest R when detecting weeds in the Home Lawn dataset, which had the most complex background.This limitation may be overcome by increasing the number of training images.Zhuang et al. [28] obtained similar low P and R values when using YOLOv3 for R. scabra detection in droughtstressed and unstressed turfgrasses.In this research, authors argued that high background variability in the training dataset increases cause a less efficient feature extraction and consequently decreases P and R metrics.For this reason, the authors suggest further research on training object detectors on images with the simplest background.Additionally, the annotation method used in this trial consisted of drawing bounding boxes around the weed within the image, which is not the method with the highest resolution.Indeed, Sharpe et al. [59] demonstrate that higher-resolution annotation methods improve the overall neural network accuracy.Moreover, artificial neural networks recognize plants using color, texture, and shape features [60], and Hahn et al. [61] found that multispectral components are highly effective in broadleaves weed detection in turfgrass.Thus, further research may be addressed to assess and clarify how these techniques and methods can improve detector's performances.According to Yang et al. [62], high image processing speed is imperative for real-time weed detection and treatment.Eventual actuators for weed control would only have a few seconds to detect weeds by processing images and delivering the treatments.
The obtained results revealed disparities in the required inference time among different models.Specifically, YOLOv5m exhibited efficient inference time, taking less than 20 ms for all the datasets except for the Baseball Field, where it required approximately 24 ms.Similarly, YOLOv6l and YOLOv7 showed a similar trend, achieving detection within less than 30 ms for the public 'Weeds', Home Lawn and Manila grass datasets, while requiring longer inference time for the Baseball Filed dataset.For YOLOv5m, YOLOv6l and YOLOv7, the average mAP_0.5 values were 0.72, 0,73 and 0.74, respectively, with corresponding average mAP_0.5:0.95values of 0.55, 0.55 and 0.54.Notably, YOLOv5m and YOLOv7 consistently achieved detections in less than 30 ms, while YOLOv6l resulted with an average inference time of 33 ms (Figure 5).Conversely, YOLOv8l exhibited a contrasting trend, requiring more than 30 ms for inference on all test datasets, except for the Baseball Field dataset, where it required less than 20 ms.YOLOv8l resulted with an average mAP_0.5 of 0.76 and an average mAP_0.5:0.95 of 0.56, with an average inference time of 32 ms.On the other hand, EfficientDet exhibited the lowest average mAP_0.5 and mAP_0.5:0.95(0.67 and 0.49, respectively) and the highest average inference time (50 ms).The situation among inference time, models and test datasets is not straightforward and warrants further investigation for clarification.Thus, additional studies should be conducted to explore and elucidate the intricate dynamics among these factors.
resulted with an average mAP_0.5 of 0.76 and an average mAP_0.5:0.95 of 0.56, with an average inference time of 32 ms.On the other hand, EfficientDet exhibited the lowest average mAP_0.5 and mAP_0.5:0.95(0.67 and 0.49, respectively) and the highest average inference time (50 ms).The situation among inference time, models and test datasets is not straightforward and warrants further investigation for clarification.Thus, additional studies should be conducted to explore and elucidate the intricate dynamics among these factors.

Conclusions
The task of achieving satisfactory weed detection in turfgrass through training digital images poses significant challenges.In this study, different YOLO object detectors were prepared and tested for weed detection in turfgrasses, considering different weeds species as a single class ('Weeds').Among the tested models, YOLOv8l demonstrated the highest overall performance on the test dataset, achieving a precision of 0.9476, mAP_0.5 of 0.9795, and mAP_0.5:0.95 of 0.8123.Despite YOLOv8l high performances, results on the additional test datasets were not acceptable for a professional use.Consequently, it became evident that several obstacles hinder accurate weed detection, emphasizing the need for more in-depth research.To enhance performance, future investigations should focus on exploring weed detection algorithms that incorporate multiple vegetative indices and features.Additionally, alternative annotation techniques, such as instance segmentation, should be compared with the more conventional bounding box-based object detection to determine whether different techniques can potentially yield improvements in weed identification.Moreover, a broad spectrum of weed species and ecotypes should be included in the training and testing of weed detection algorithms to ensure accurate performance in turfgrass scenarios.In conclusion, the findings of this study underscore the challenges associated with weed detection in turfgrass using digital image training.Further research endeavors are imperative to address the identified limitations and advance the field of weed detection in turfgrass through the exploration of enhanced algorithms, annotation techniques, and broader inclusion of weed species and ecotypes.

Conclusions
The task of achieving satisfactory weed detection in turfgrass through training digital images poses significant challenges.In this study, different YOLO object detectors were prepared and tested for weed detection in turfgrasses, considering different weeds species as a single class ('Weeds').Among the tested models, YOLOv8l demonstrated the highest overall performance on the test dataset, achieving a precision of 0.9476, mAP_0.5 of 0.9795, and mAP_0.5:0.95 of 0.8123.Despite YOLOv8l high performances, results on the additional test datasets were not acceptable for a professional use.Consequently, it became evident that several obstacles hinder accurate weed detection, emphasizing the need for more in-depth research.To enhance performance, future investigations should focus on exploring weed detection algorithms that incorporate multiple vegetative indices and features.Additionally, alternative annotation techniques, such as instance segmentation, should be compared with the more conventional bounding box-based object detection to determine whether different techniques can potentially yield improvements in weed identification.Moreover, a broad spectrum of weed species and ecotypes should be included in the training and testing of weed detection algorithms to ensure accurate performance in turfgrass scenarios.In conclusion, the findings of this study underscore the challenges associated with weed detection in turfgrass using digital image training.Further research endeavors are imperative to address the identified limitations and advance the field of weed detection in turfgrass through the exploration of enhanced algorithms, annotation techniques, and broader inclusion of weed species and ecotypes.

Figure 1 .
Figure 1.Weeds images and annotations randomly retrieved from the 'Weeds' public datase the three additional test datasets (Home Lawn, Baseball Field, and Manila grass).

Figure 1 .
Figure 1.Weeds images and annotations randomly retrieved from the 'Weeds' public dataset and the three additional test datasets (Home Lawn, Baseball Field, and Manila grass).

Figure 2 .
Figure 2. Trial framework for models' train and evaluation on different turfgrass scenario.

Figure 2 .
Figure 2. Trial framework for models' train and evaluation on different turfgrass scenario.

Figure 5 .
Figure 5. YOLO models and EfficientDet mean average precision (mAP) and inference time tradeoff.Data were averaged among the four test datasets.

Figure 5 .
Figure 5. YOLO models and EfficientDet mean average precision (mAP) and inference time trade-off.Data were averaged among the four test datasets.

Table 3 .
Results on five-fold cross validation test of different models studied.
* Values refer to the mean and standard deviation of models performance on five-fold cross validation.** Different letters on the same column represent different values at p < 0.05.

Table 4 .
Performance and inference time of best YOLO model scales for the four studied YOLO versions (YOLOv5m, YOLOv6l, YOLOv7, and YOLOv8l) and EfficientDet.

Table 4 .
Performance and inference time of best YOLO model scales for the four studied YOLO versions (YOLOv5m, YOLOv6l, YOLOv7, and YOLOv8l) and EfficientDet.