Determination of Optimal Dataset Characteristics for Improving YOLO Performance in Agricultural Object Detection

Song, Jisu; Kim, Dongseok; Jeong, Eunji; Park, Jaesung

doi:10.3390/agriculture15070731

Open AccessArticle

Determination of Optimal Dataset Characteristics for Improving YOLO Performance in Agricultural Object Detection

Department of Bio-Industrial Machinery Engineering, Pusan National University, Miryang 50463, Republic of Korea

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(7), 731; https://doi.org/10.3390/agriculture15070731

Submission received: 17 January 2025 / Revised: 21 March 2025 / Accepted: 25 March 2025 / Published: 28 March 2025

(This article belongs to the Special Issue Automation Strategy Using Machine Learning in Horticultural Crop Cultivation)

Download

Browse Figures

Versions Notes

Abstract

Recent advances in artificial intelligence and computer vision have led to significant progress in the use of agricultural technologies for yield prediction, pest detection, and real-time monitoring of plant conditions. However, collecting large-scale, high-quality image datasets in the agriculture sector remains challenging, particularly for specialized datasets such as plant disease images. This study analyzed the effects of the image size (320–640+) and the number of labels on the performance of a YOLO-based object detection model using diverse agricultural datasets for strawberries, tomatoes, chilies, and peppers. Model performance was evaluated using the intersection over union and average precision (AP), where the AP curve was smoothed using the Savitzky–Golay filter and EEM. The results revealed that increasing the number of labels improved the model performance to a certain degree, after which the performance gradually diminished. Furthermore, while increasing the image size from 320 to 640 substantially enhanced the model performance, additional increases beyond 640 yielded only marginal improvements. However, the training time and graphics processing unit usage scaled linearly with increasing image sizes, as larger size images require greater computational resources. These findings underscore the importance of an optimal strategy for selecting the image size and label quantity under resource constraints in real-world model development.

Keywords:

artificial intelligence; computer vision; object detection; you only look once

1. Introduction

In recent years, advancements in artificial intelligence (AI) and computer vision have rapidly progressed. These advancements have significantly impacted the agricultural sector. For instance, artificial neural networks have been used to predict crop yields (based on soil and weather parameters) [1]. Additionally, pH and humidity sensor data have been used to forecast crop growth statuses and determine optimal harvest times [2]. Convolutional neural networks (CNNs) are (primarily) used to process two-dimensional image data and can be extended to analyze multispectral or hyperspectral images. Technologies are being developed to analyze hyperspectral images of plant leaves to classify diseases in specific crops [3], thereby enhancing agricultural monitoring systems and enabling real-time monitoring of plant health.

By integrating these methods with the Internet of Things for data collection and processing, crop growth condition optimization can be made feasible [4]. Before the introduction of such technologies in traditional agricultural settings [5], experts visited the fields in person. They visually inspected the crop conditions and collected the data necessary to predict yields or assess crop health [6]. These on-site inspections are essential for detecting pests and diseases or evaluating crop growth. However, inspections depend on the subjective judgment of local specialists [7]. Consequently, effectively managing large agricultural areas or maintaining consistent outcomes under diverse environmental conditions has proven to be inefficient.

AI and big data technologies are being used in agricultural settings to evaluate crop conditions and predict yields quickly and accurately. Models that leverage visual information (such as image data) facilitate faster and easier crop health assessment and disease detection. AI models based on image data have the potential to accurately analyze and predict crop conditions. To leverage these technologies effectively, large quantities of high-quality image data are required during the training phase [8]. In specialized fields, such as agriculture, gathering a diverse range of data is often difficult and expensive. Therefore, securing sufficient data is challenging. Plant disease data are not readily available from ordinary farms. This is because artificial induction of specific diseases in crops is often necessary [9], restricting data collection to research laboratories or specialized institutions. Therefore, data reflecting the wide range of environments and conditions found in actual farming settings is scarce.

To address data scarcity, researchers have studied data augmentation methods that can enhance AI model performance, even with small datasets. Classical augmentation techniques include image rotation, resizing, flipping, brightness adjustments, and noise addition. These techniques transform image data in multiple ways to increase the diversity of the training data, prevent overfitting, and ultimately improve the generalization performance of the model. For example, Nesteruk et al. employed horizontal flipping and 90 degree rotation to enhance the data diversity in small datasets, thereby improving the accuracy of pest detection models [10]. Kodors et al. [11] emphasized the importance of mosaic data augmentation in agricultural environments. They utilized the YOLOv5m model on images of apples, pears, and early-stage pear fruits. Experimental results indicated that mosaic augmentation improved the accuracy by an average of 4.38% in terms of mAP@0.5:0.95, outperforming the image shuffle method, particularly when applied to different fruit growth stages or species. Additionally, Zou et al. [12] proposed a synthetic image-based augmentation method to address the challenges of image labeling and data scarcity in complex agricultural fields. Their approach involved independently segmenting crops, weeds, and soil and subsequently combining them to create synthesized field images. This method achieved an IoU of 0.98 in object detection tasks using YOLOv5.

Recently, generative adversarial networks (GANs) have been used to generate artificial training data for AI models. As GANs can generate new data that closely resemble existing data, they are particularly useful for supplementing data that are difficult to gather in real-world settings. Abbas et al. [13] proposed a method using a conditional GAN to generate tomato leaf disease images for dataset augmentation. Synthetic images were combined with real images during model training to prevent performance degradation caused by data scarcity and significantly improve the accuracy. Zhou et al. [14] employed a fine-grained GAN to identify lesions on grape leaves and augment localized lesion data with limited early-stage grape leaf lesion images. The generated local lesion region images were used as inputs for a deep-learning model, and the generalization capability, prediction accuracy, and robustness of the classification model were enhanced. Fawakherji et al. introduced a GAN-based multispectral data augmentation method aimed at improving crop and weed segmentation performance. They employed a deep convolutional GAN (DCGAN) to generate diverse crop shapes and a conditional GAN (cGAN) to synthesize textures for crops and backgrounds. By selectively replacing portions of original images with synthetic patches, they effectively mitigated data imbalance issues while maintaining realism. This method enhanced segmentation performance, improving the mean intersection over union (mIoU) by up to 19% compared with training with original data alone, with further improvements noted when multispectral (RGB + NIR) data were used [15]. Numerous studies have sought to boost AI model performance in agriculture through image augmentation; this underscores the critical role that image augmentation plays in optimizing model performance with data scarcity.

However, GAN-based-augmentation methods are still limited in their complete quality and diversity matching of the original images. GAN-generated images often do not capture fine textures or complex structures and are not as effective in model training as authentic data. Liu et al. [16] reported that existing GAN detectors show high accuracy for images generated by specific GAN models, but the performance drops when dealing with new GAN models or images with added noise. Tan et al. [17] pointed out that unlike real images, GAN-generated images could contain certain artifacts, owing to constraints in reproducing delicate textures and intricate structures. These artifacts could reduce the ability of a GAN model to generalize new data. Therefore, using GAN-based image augmentation could be less beneficial than using the original images. Obtaining enough high-quality original images in agricultural settings is often difficult.

Therefore, this study compares the performance of the You Only Look Once (YOLO) object detection model on agricultural image datasets with varying numbers of labels. A dataset optimization strategy that maintains adequate performance even with fewer labels is proposed. Some datasets used in this study were publicly available, and others were collected. The public datasets include those from AI-Hub (aihub.or.kr), Kaggle (kaggle.com), Roboflow (roboflow.com), and StrawDI (“StrawDI Dataset”, 2020) [18], which provide both bounding box (bbox) and instance segmentation annotations along with the images. The National Information Society Agency operates the AI-Hub in Korea and offers a range of datasets to promote AI technology development and big data-driven research. Kaggle is a global platform; datasets can be uploaded to and downloaded from Kaggle. Newly collected images were obtained from the Seongju Oriental Melon and Vegetable Research Institute of the Gyeongsangbuk-do Agricultural Research and Extension Services. These were annotated using the automated annotation features of a computer vision annotation tool (CVAT; CVAT.ai). This study aimed to enhance data labeling efficiency in the agriculture sector for facilitating AI utilization while proposing an optimal number of labels that balance data collection and labeling resource requirements with satisfactory model performance. Thus, this study aim to significantly contribute to the broader adoption of AI-based technologies in agricultural settings.

2. Materials and Methods

2.1. Background

2.1.1. Optimal Dataset

A wide range of object detection models for various crops has been developed in the agriculture domain, and the field evolves continuously, owing to advances in diverse model architectures. These have been developed for monitoring crop conditions, detecting pests, and determining harvest times, among other applications, by leveraging AI and deep learning technologies. However, most research has focused on using all of the available data, and where data are scarce, researchers make the most of whatever is available. Although the quality of data is often curated to a certain extent, relatively few studies have analyzed or optimized the size and composition of datasets for model performance.

2.1.2. YOLO Object Detection Model

In this study, the YOLO model was selected as an appropriate deep learning-based object detection model from the perspective of data optimization. With the advancement of deep learning, various models utilizing convolutional neural networks (CNNs), such as R-CNN and Faster R-CNN, have been proposed in object detection, achieving high accuracy. Additionally, models such as the Single-Shot MultiBox Detector (SSD) were developed, aiming for rapid detection speeds. Against this background, the YOLO series has drawn considerable attention due to its balanced approach between accuracy and speed, effectively reducing computational times. Initially, YOLO versions (v1~v4) were developed based on the Darknet framework. Subsequently, Ultralytics introduced the PyTorch-based YOLOv5 (requiring PyTorch ≥ v. 1.7), significantly improving accessibility and usability. Later models introduced by Ultralytics, including YOLOv8, v11, and v12, have been provided as libraries, enabling researchers to conduct training and model construction through simple import operations effortlessly. In this context, this subsection analyzes interest trends across different YOLO versions.

Google Trends data from January 2016 to January 2025 was analyzed to examine the popularity of early YOLO versions (v1~v4) compared to YOLOv5 (Figure 1a). After YOLOv5 emerged, interest in earlier versions declined sharply or stagnated. The combined interest of these versions dropped below approximately 30% of the total. Furthermore, analysis from January 2020 to January 2025 showed that models released after YOLOv5, such as YOLOv8 and v11, quickly gained attention upon release (Figure 1b). These newer models rapidly overtook the interest levels of previous versions. This demonstrates a clear generational shift. This phenomenon indicates that the YOLO series, particularly the latest models by Ultralytics, continues to achieve continuous performance improvements and rapid adoption in research environments. Consequently, this study selected the YOLO model due to its high accessibility and ongoing enhancements in performance. Our goal was to investigate data optimization and improvements in model performance.

Next, trends in existing research utilizing YOLO were examined in greater detail. Table 1 summarizes previous studies applying YOLO models specifically within the agricultural sector, focusing on the types and sizes of datasets, models employed, and key findings. This summary aims to elucidate the influence of the dataset size and composition on model performance in agriculture through practical research examples, thereby clarifying the necessity of further research on data optimization.

2.2. Computing Environment

This research was conducted across two computing environments. Table 2 presents the specifications of the systems used for training and testing the object detection models, while Table 3 outlines the specifications of the Jetson Orin Nano environment used for additional testing.

2.3. Data Collection and Preprocessing

Newly captured images and publicly available datasets pertaining to strawberries, tomatoes, and chilies were used in this study. Public datasets were collected from the AIHub [37,38], Kaggle [39], Roboflow [40], and the Strawberry Digital Images Dataset (StrawDI) from the Science and Technology Research Centre at the University of Huelva. New strawberry images were captured by a data curator at the “Gyeongsangbuk-do Agricultural Research and Extension Services, Seongju Oriental Melon and Vegetable Research Institute” using a Galaxy S22 (Samsung Electronics, Republic of Korea). Figure 2 shows examples of the collected strawberry, tomato, chili and pepper datasets.

Some portions of the collected data were provided in json format, containing details on the image capture, status, environmental conditions, and annotations. In this study, the bbox and segmentation information were extracted and analyzed. First, the bbox coordinates in the original annotation files were normalized based on the dimensions of each image. Specifically, the raw coordinate values were divided by the width and height of the image and transformed into values between 0 and 1. These were then saved in YOLO format, which includes the class label, center coordinates (x_center and y_center), width, and height. The segmentation data were converted into bbox format as necessary to satisfy the YOLO requirements. Newly captured images were labeled using a CVAT. After collection and preprocessing, these datasets were used to train the YOLO-based object detection models (Figure 3 and Table 4).

2.4. Data Analysis and Preprocessing

In this study, the data for each crop were organized and preprocessed to facilitate efficient model training. All image annotations were initially provided in json format and then converted into bbox coordinates following the YOLO format (class, x_center, y_center, width, and height). The converted coordinates were saved as text files identical to their corresponding image files. To ensure consistency throughout the dataset, all bbox coordinates were normalized to values between zero and one. Additionally, label statistics were compiled for each dataset to examine the number of objects per class and the size distribution.

The strawberry dataset consisted of 3386 images and 23,359 labels, each with an average pixel count of roughly 1.39 million. On average, the bbox area occupied 1.31% of the image, ranging from a minimum of 0.0001% to a maximum of 19.19% (Figure 4a and Figure 5a). The mean hue of the entire image was 0.27, suggesting a predominance of green tones, whereas the mean hue within the bboxes was 0.19, indicating stronger red tones. This discrepancy implies that green elements probably dominated the background, whereas red elements (such as strawberries) were more pronounced within the bboxes. A histogram (Figure 6a) further illustrates this point. While green hues dominated the entire image, a concentrated red hue appeared within the bboxes.

The tomato dataset included 804 images and 9777 labels, each with an average pixel count of approximately 12.68 million. The average bbox area ratio was 1.40%, with values ranging from 0.0043% to 22.15% (Figure 4b and Figure 5b). The mean hue of the entire image was 0.22, being dominated by red tones, whereas the bbox region exhibited a mean hue of 0.19, indicating an even stronger red tone. This suggests that the overall hue was slightly elevated by the background. However, the natural red color of the tomatoes was emphasized within the bboxes. As shown in the histogram (Figure 6b), the entire image and bboxes were dominated by red, although the hue distribution within the bboxes was more narrowly concentrated on red tones.

The chili dataset contained 682 images and 2258 labels, each with an average pixel count of approximately 11.4 million. The average bbox area ratio was approximately 6.35%, ranging from 0.0390% to 49.43% (Figure 4c and Figure 5c). Hue, saturation and value (HSV) color distribution analysis indicated that the average hue of the entire image and bbox region was approximately 0.25, suggesting that the chilies shared a similar color range with the background. The histogram (Figure 6c) reinforces this observation, showing that the hue distributions for the entire image and those within the bboxes nearly overlapped.

Finally, the pepper dataset contained 619 images and 5324 labels, each with an average pixel count of approximately 0.92 million. The average bbox area ratio was approximately 0.6%, ranging from 0.0099% to 18.41% (Figure 4d and Figure 5d). Analysis of the hue, saturation and value (HSV) color distribution showed that the average hue of the entire image area was about 33.1, and the average hue of the bbox area was about 0.35. However, the histogram (Figure 6d) shows that the hue distribution of the bbox was spread out, indicating that there were many different colors other than the aforementioned green and red.

2.5. Dataset Splitting Strategy

The collected data were divided into training, validation, and test datasets for the model training and evaluation. The dataset splitting criteria and sizes were systematically determined by comprehensively considering various factors influencing object detection performance.

Recent studies have indicated that the object size, measured as the bbox area, significantly affects the recognition performance of detection models. In particular, small objects exhibit limited pixel coverage, resulting in insufficient feature extraction and consequently worse detection accuracy [41]. Microsoft’s Common Objects in Context (COCO) dataset categorizes objects as small (area < 32²), medium (32² ≤ area < 96²), and large (area ≥ 96²) and separately evaluates performance based on these categories [42]. Furthermore, performance evaluations using YOLOv8 revealed notable differences in the average precision scores between small, medium, and large objects. These differences were attributed to imbalances in the bbox sizes within the datasets, where large objects were significantly more prevalent compared with small- and medium-sized objects [43]. Considering the standards of the COCO dataset and methodologies from the recent literature, this study designed a systematic approach to data splitting reflecting these characteristics.

Specifically, data were categorized into three groups (small, medium, and large) based on the total bbox area obtained by summing the areas of all bboxes in each image. The dataset was evenly divided into thirds according to the distribution of bbox areas; the lower third was classified as small, the middle third was medium, and the upper third was large. For example, in the chili dataset, images with a total bbox area of 0.1387 or less were categorized as small, those exceeding 0.1387 but not surpassing 0.2347 were medium, and those exceeding 0.2347 and up to 0.9446 were large. Similar criteria were applied to the strawberry and tomato datasets (Table 5).

Further adjustments were made to ensure even distribution of labels (objects) in each image. Along with bbox–based grouping, the number of labels per image was calculated, and the label counts were balanced across the training, validation, and test sets. Specifically, for each group (small, medium, and large), the label count distribution was considered to minimize variations when splitting the data [42,44,45,46]. This step prevented any single set from being overpopulated by images containing numerous labels. Ultimately, the training dataset included at least 1400 labels, the validation dataset included at least 600 labels, and the remaining data were assigned to the test set (Table 6 and Figure 7).

This dataset splitting strategy, which systematically considers object sizes and object count balance, was expected to enhance the model’s generalization performance and reliability, particularly improving detection accuracy for small objects.

2.6. Model Training Method

The Ultralytics YOLO model was used to conduct model training. To prevent data imbalances and optimize model performance, a data splitting method based on the number of labels was used. The number of objects in each label file was calculated, and the data were dynamically selected to match the target label count. Then, they were split into training (70%) and validation (30%) sets. This process was repeated for various label counts and yielded 300 datasets. The initial value for the number of labels was 5. A step interval used in the range of 5–1000 was 5, and a step interval of 10 was used in the range of 1000–2000 (e.g., 1010, 1020, and 1030). This approach facilitated an analysis of the training performance variation of the model with the number of labels. The datasets were divided into training (70%) and validation (30%) sets in the same proportions as for training and evaluation. Every training session used the same settings, running 200 epochs with a batch size of 16. In addition, four image sizes (320, 640, 960, and 1280) were tested to compare model performance. The data required for training were generated dynamically by calculating the number of objects in each label file. The training and validation sets were selected based on a specified label count before being copied to the YOLO training path. Once the training was completed, the data which had been used were cleared, and memory was managed by calling gc.collect.

2.7. Model Performance Evaluation Based on AP Metrics

Model performance was evaluated using a test dataset; the accuracy was measured by comparing the predicted bboxes against the ground-truth bboxes based on the intersection over union (IoU) criterion. Object detection was performed using YOLOv8 and v11, and the precision and recall of each class were calculated to assess the overall performance of the model.

2.7.1. Calculation of IoU and Average Precision (AP)

During the evaluation process, the area of overlap between the predicted bbox and the ground-truth bbox was calculated to obtain the IoU. If the IoU exceeded a predefined threshold (typically 0.5), then the predicted bounding box was considered to correctly detect the actual object. True positives (TPs), false positives (FPs), and false negatives (FNs) were thus determined, allowing for calculation of both the precision and recall (Table 7). Subsequently, the average precision (AP) was derived using a precision–recall curve, and the final model performance was assessed based on these calculations. After computing the AP for each class, the mean AP across all classes was used to evaluate the overall performance of the models.

P r e c i s i o n = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e P o s i t i v e}

(1)

R e c a l l = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e N e g a t i v e}

(2)

2.7.2. AP Curve Smoothing with Savitzky–Golay Filter and EEM

Variability was observed during the generation of AP curves by the model. This variability reflects the aggregated outcomes of numerous small-scale datasets trained with different numbers of labels and cannot be dismissed as noise. Although the AP curve generally trends upward, it also exhibits significant fluctuations across certain intervals. These fluctuations indicate how the model performance changes depending on the dataset, thereby complicating its consistent evaluation.

To address this issue, a two-step data smoothing method was implemented. First, a Savitzky–Golay (SG) filter included in the Python SciPy library was utilized. The SG filter is a widely adopted smoothing technique in various real-world data domains, such as sensor signals, medical imaging, and geophysical data, due to its effectiveness in reducing noise while preserving critical signal characteristics and structural features (e.g., peaks and edges). Simpler methods such as a moving average and exponential smoothing are easy to implement. However, these methods may distort data by excessively smoothing important features. This is particularly true in regions containing sharp changes or significant peaks. In contrast, the SG filter fits a polynomial to the data, removing noise within a selected window. This window is defined around a specific point x and includes several neighboring data points. Thus, the SG filter flexibly captures the characteristics of the original data [47]. The window size is denoted as

W

=

2 k + 1

, with

k

points on each side of the central point. If the relative position of a point within this window is denoted by m, where the center is

m = 0

, then

m \in {- k, - k + 1, \dots, 0, \dots, k - 1, k}

. The polynomial

p (m)

used to approximate the data within this window is expressed as follows:

p (m) = a_{0} + a_{1} m + a_{2} m^{2} + \dots + a_{d} m^{d}

(3)

In this context, the polyorder parameter

d

represents the degree of the polynomial. While it is theoretically possible to fit all

n

points exactly with a polynomial of degree

n - 1

, in practice, a lower-degree polynomial is generally preferred to account for noise and simplify the changing data pattern, thereby reducing the impact of noise and producing a smoother curve. A higher polyorder allows the model to capture more detailed fluctuations but makes it more sensitive to noise. (The polyorder must always be less than the window size

W

.)

Subsequently, the coefficients of the polynomial

a_{0}, a_{1}, \dots, a_{d}

are determined by constructing a Vandermonde matrix

H

. Given

W

and the values of

m

, one can establish the following equations for each point:

a_{0} + a_{1} (- k) + a_{2} {(- k)}^{2} + \dots + a_{d} {(- k)}^{d} = x_{0}

(4)

a_{0} + a_{1} (- k + 1) + a_{2} {(- k + 1)}^{2} + \dots + a_{d} {(- k + 1)}^{d} = x_{1}

(5)

⋮

a_{0} + a_{1} (0) + a_{2} {(0)}^{2} + \dots + a_{d} {(0)}^{d} = x_{\frac{W - 1}{2}}

(6)

⋮

a_{0} + a_{1} (k - 1) + a_{2} {(k - 1)}^{2} + \dots + a_{d} {(k - 1)}^{d} = x_{W - 2}

(7)

a_{0} + a_{1} (k) + a_{2} {(k)}^{2} + \dots + a_{d} {(k)}^{d} = x_{W - 1}

(8)

Expressing this in matrix form yields the following representation:

(\begin{matrix} 1 & - k & {(- k)}^{2} & \dots & {(- k)}^{d} \\ 1 & ⋮ & ⋮ & ⋰ & ⋮ \\ 1 & - 1 & {(- 1)}^{2} & \dots & {(- 1)}^{d} \\ 1 & 0 & 0 & \dots & 0 \\ 1 & 1 & 1^{2} & \dots & 1^{d} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 1 & k & k^{2} & \dots & k^{d} \end{matrix}) (\begin{matrix} a_{0} \\ a_{1} \\ a_{2} \\ ⋮ \\ a_{d} \end{matrix}) = (\begin{matrix} x_{0} \\ x_{1} \\ x_{2} \\ ⋮ \\ x_{W - 1} \end{matrix})

(9)

If this equation is denoted as

H a = x

, then the coefficients in vector a can be determined using the least-squares method. In this context,

H {(H^{T} H)}^{- 1} H^{T}

serves as the pseudoinverse, which can be used to obtain projection results:

\hat{x} = H {(H^{T} H)}^{- 1} H^{T} x = B x

(10)

The coefficients

a

obtained in this manner serve as SG filter coefficients. By convolving these coefficients with the original data

x

, a smoothed result

y

is obtained.

y [n] = \sum_{m = - k}^{k} a_{m} x [n - m]

(11)

At the edges of the data (where windows may extend beyond the available points), the interp mode is used. In this mode, polynomial fitting is performed on the portion of data within the actual data range, and the values for the out-of-range window points are estimated based on the fitting result. In this study, the window length of

W = 11

, the polyorder to 1, and the mode to “interp”. All other parameters were maintained at their default values. An example of the smoothed data produced by this procedure is shown in Figure 8.

Subsequently, the empirical exponential model (EEM) was applied to fit the SG-filtered curve [48]. The Savitzky–Golay (SG) filter was first applied instead of directly fitting the EEM to the original data. This approach was chosen because abrupt fluctuations in initial segments of raw data can overly influence EEM curve-fitting. Such fluctuations might lead to sensitivity toward early outliers or transient variations, rather than capturing the underlying general trend. In particular, direct application of the EEM to raw data exhibiting sharp initial increases followed by deceleration often produced excessively steep slopes. As a result, the model tended to fit local peaks or anomalies. Consequently, it failed to represent the overall data pattern accurately. Therefore, by first applying the SG filter mitigated noise and local fluctuations in the data. This produced a smooth dataset more representative of the general trend. Subsequently, applying for the EEM yielded more stable and accurate curve-fitting results. The empirical exponential model is defined as follows:

y = a \cdot e^{b x} + c \cdot e^{d x}

(12)

where x and y represent the number of labels and the model’s performance metric (average precision (AP)), respectively. The fitting coefficients a, b, c, and d were calculated based on the given dataset (x, y) using nonlinear least-squares optimization via the curve_fit function available in the Python SciPy library (Figure 9).

3. Results and Discussion

3.1. Determination of Optimal Model Performance Using AP Metrics and Kneedle Algorithm

In this study, 300 datasets were created with a wide range of label counts (5, 10, …, up to 2000 labels) for each image size and utilized to train the model. The average precision (AP) performance metric was employed for evaluation and comparison under various conditions. The Kneedle algorithm was utilized to systematically identify the point at which improvements in learning performance began to significantly diminish. This algorithm operates on the principle that the most informative point on a data performance curve is located at its point of maximum curvature, known as the “knee”. Specifically, the Kneedle algorithm maintains the overall data trend by rotating the performance curve around a reference line that connects its initial and terminal points. The “knee” is detected precisely at the position where the deviation between the rotated curve and the reference line reaches its maximum, effectively approximating the optimal trade-off between performance gain and resource expenditure [49].

3.2. Comparison of YOLOv8 Models Trained by Label Count and Image Size

In this study, 16 model groups were constructed and analyzed by varying the image sizes (320, 640, 960, and 1280) for four crops—strawberries, tomatoes, chilies, and peppers—and incrementally increasing the number of labels. A systematic analysis was conducted using these experimental groups, specifically employing a total of 4800 YOLOv8 model groups (1200 per crop) to comprehensively evaluate performance. Figure 10 shows that the performance improved as the number of labels increased. However, the rate of improvement tended to decrease beyond a certain threshold. Furthermore, the performance gains were most prominent when the image size was increased from 320 to 640, whereas further increases in the size to 960 and 1280 yielded only marginal improvements. The incremental benefits of adding labels gradually diminished when weighed against additional computational costs. Although a larger size contributed additional information and could thus improve performance up to a certain level, any gains beyond that had limited value, owing to increased resource demands.

3.2.1. Analysis: Strawberry Dataset

In the strawberry dataset, increasing the number of labels generally enhanced the average precision (AP) of the model, although the extent of this improvement varied significantly between the pre-knee point and post-knee point groups. Henceforth, these groups will be referred to as the pre-group and post-group.

In the pre-group, adding extra labels led to a noticeable rise in the average AP, ranging from approximately 0.11% to 4.84%. In contrast, in the post-group, this improvement became minimal, with increases restricted to roughly 0.04–0.1%. Specifically, knee points based on image size were identified at 175 labels for 320, 210 labels for 640, 205 labels for 960, and 220 labels for 1280 (Figure 10a and Table 8). The pre-group demonstrated average AP growth rates of about 1.11% at 320, 1.03% at 640, 1.03% at 960, and 0.90% at 1280. In contrast, the post-group showed negligible average increments (0.03–0.04%) with extremely narrow confidence intervals around these values (Figure 11a). Furthermore, while the overall AP increased with larger sizes, the extent of improvement gradually decreased as the image size rose. Notably, the AP improvement was highest between 360 and 640 (approximately 6.27%) but dropped sharply to about 1.50% from 640 to 960 and further declined to 0.79% from 960 to 1280. This trend continued similarly in the post-group (Figure 12a). This result can be attributed to the clear color contrast within the bboxes, facilitating object recognition and significantly enhancing model performance with increased labeling.

3.2.2. Analysis: Tomato Dataset

The tomato dataset displayed trends similar to those of the strawberry dataset, with comparable patterns noted. In the pre-group, introducing additional labels resulted in average AP improvements ranging from approximately 0.15% to 11.24%. In contrast, the post-group exhibited more modest increases from 0.05% to 0.19%. Specifically, knee points were observed at 215 labels for 320, 235 labels for 640, 220 labels for 960, and 185 labels for 1280 (Figure 10b and Table 8).

In the pre-group, the mean AP growth rates were statistically significant, measuring 2.05% for 320, 1.40% for 640, 1.31% for 960, and 2.12% for 1280. Conversely, the post-group showed minimal increases—0.07% at 320 and about 0.05% at larger sizes—with extremely narrow confidence intervals (Figure 11b). Additionally, the overall AP increased with larger sizes, though improvements diminished progressively at greater sizes. Specifically, the largest AP enhancement occurred between 360 and 640 (11.81%), decreasing to 1.69% from 640 to 960 and further to 1.10% between 960 and 1280. This pattern was consistently observed in the post-group as well (Figure 12b). This significant effect in the pre-group can be explained by the concentrated color distribution within the bbox, highlighting the importance of label quantity in improving AP scores.

3.2.3. Analysis: Chili Dataset

In the case of the chili dataset, the performance gains due to larger sizes were less pronounced than for strawberries or tomatoes.

In the pre-group, additional labels improved the average AP significantly by between 0.31% and 235.87% (mean 7.09%), whereas the post-group exhibited marginal improvements from 0.09% to 0.32% (mean 0.1%). Specifically, knee points appeared at 230 labels for 320, 290 labels for 640, 290 labels for 960, and 325 labels for 1280 (Figure 10c and Table 8). The pre-group growth rates were statistically significant, with average increases of 8.68% at 320, 5.60% at 640, 5.66% at 960, and 8.43% at 1280. In contrast, the post-group exhibited minimal average growth (about 0.10–0.11%) with extremely narrow confidence intervals (Figure 11c). Furthermore, the overall AP improved with size increases but showed diminishing returns at larger sizes. Specifically, the largest gain occurred between 360 and 640 (6.18%), with negligible improvement (0.02%) between 640 and 960 and even a decrease (−1.50%) between 960 and 1280, indicating limited benefits at larger sizes (Figure 12c). This variation likely arose from the similarity in hue between the objects and background, complicating object recognition. Thus, the chili dataset required more labels to reach the knee points, reflecting difficulties in differentiating objects based on color.

3.2.4. Analysis: Pepper Dataset

For the pepper dataset, introducing more labels in the pre-group significantly enhanced performance (0.24–14.51%, mean of 2.11%), whereas in the post-group, improvements were modest (0.08–0.23%, mean of 0.09%). Knee points were identified at 430 labels for 320, 325 labels for 640, 280 labels for 960, and 300 labels for 1280 (Figure 10d and Table 8).

The pre-group AP increments were statistically significant and tended to increase with the size slightly: 1.13% at 320, 1.87% at 640, 2.37% at 960, and 2.28% at 1280. The post-group growth rates were minimal (about 0.08–0.11%), with narrow confidence intervals. Notably, the t values were clearly differentiated between the two groups (Figure 11d). Moreover, the overall AP increased with the size, but improvements became smaller at larger sizes. Specifically, the AP rose most substantially from 360 to 640 (19.56%), with a lesser increase from 640 to 960 (5.50%) and a minimal increase from 960 to 1280 (1.11%), a pattern consistent in both groups (Figure 12d). The pepper dataset showed better object differentiation compared with the chili dataset due to greater color heterogeneity within the bounding boxes. However, its overall performance remained lower than for datasets with clearly distinguishable color contrasts such as strawberries or tomatoes, necessitating more labels to achieve comparable performance.

3.3. Comparison of Training Times and GPU Usage on YOLOv8

3.3.1. Comparison of Training Times

Analysis of the training times was conducted primarily through comparisons between the image size groups within each dataset. As summarized in Table 9 and visualized in Figure 13, across all datasets, the learning time increased as the number of labels increased, regardless of the knee point.

In the case of strawberries, the training time consistently increased as the image size increased. At the knee points, the training times recorded were 42.72 s for 320, 87.67 s for 640, 151.96 s for 960, and 209.50 s for 1280. This clear upward trend highlights the increased computational load associated with larger image inputs (Figure 13a and Table 9).

The tomato data showed a similar trend, although a notable increase in training time occurred when transitioning from 960 to 1280. At the knee points, the times were 68.41 s for 320, 77.43 s for 640, 113.11 s for 960, and a substantially longer 212.66 s for 1280, indicating significant computational costs at the largest size (Figure 13b and Table 9).

The chili dataset demonstrated a rapid increase in training time as the image size increased. The training times at the knee points were 130.90 s for 320, significantly increasing to 188.44 s for 640, further rising to 326.20 s for 960, and sharply escalating to 579.42 s for 1280. Consequently, the chili dataset exhibited the heaviest computational load among all evaluated datasets (Figure 13c and Table 9).

Peppers displayed less predictable behavior. The training times at the knee points started at 84.19 s for 320, slightly increased to 89.12 s at 640, and then surprisingly decreased to 75.88 s at 960 before slightly rising again to 79.99 s at 1280 (Figure 13d and Table 9). Unlike the previously analyzed crops, the peppers did not exhibit a general correlation between an increased image size and training time, indicating that the image size had a less consistent influence in this case.

In summary, the training times across datasets generally increased with the image size and number of labels, reflecting increased computational demands irrespective of model performance. However, the pepper dataset had a notably smaller average pixel count compared with other crops, which might partially explain the irregularities observed in its training time trends. Therefore, it is believed that additional factors such as the pixel count, object size, and color variability should be considered in future analyses to comprehensively understand training time behaviors.

3.3.2. Comparison of GPU Usage

The analysis of GPU usage was conducted primarily through comparisons between the crop groups at each image size. GPU usage significantly increased with larger sizes across all datasets, as summarized in Table 10 and visualized in Figure 14.

For strawberries, GPU memory usage sharply increased with larger sizes. At 320, GPU usage was moderate at 1.14 GB but sharply increased to 3.26 GB at 640, further rising to 6.80 GB at 960 and peaking at 12.18 GB at 1280 (Figure 14a).

Tomatoes showed similar but slightly lower memory demands, beginning with 0.88 GB at 320, significantly increasing to 2.75 GB at 640, rising further to 5.86 GB at 960, and ultimately reaching 12.31 GB at 1280 (Figure 14b).

The chili data also followed a similar trajectory, starting at a relatively low GPU usage of 1.10 GB for 320, sharply increasing to 3.24 GB at 640, rising quickly to 6.98 GB at 960, and reaching a significantly high 12.24 GB at 1280 (Figure 14c).

Peppers exhibited a consistent pattern, starting with 1.38 GB at 320, significantly rising to 3.75 GB at 640, further increasing to 8.16 GB at 960, and reaching the highest usage of all datasets with 14.65 GB at 1280 (Figure 14d).

In conclusion, GPU memory usage significantly increased with the image size across all evaluated datasets. Notably, across datasets, GPU usage rapidly increased for a small number of labels and quickly reached a plateau. This initial sharp rise was followed by stability in GPU memory usage, indicating that resource allocation efficiency remained effective despite handling additional labels beyond certain thresholds.

3.4. Comparison of YOLOv8 and 11 Models Trained on Number of Labels and Image Size

This section provides a comparative analysis of the YOLOv8 and YOLOv11 models. Comparisons were performed to optimally balance performance and resource consumption.

For strawberries, the YOLOv8 model achieved optimal balance at 210 labels with an AP of 0.7548. Conversely, YOLOv11 attained similar performance (AP of 0.7511) at a slightly lower knee point of 185 labels. This indicates that YOLOv11 could achieve comparable performance to YOLOv8 with fewer labels, potentially reducing data annotation costs and efforts (Figure 15a and Figure 16a).

In the tomato dataset, YOLOv8 reached its knee point at 235 labels with an AP of 0.7978, whereas YOLOv11 achieved slightly worse performance (AP of 0.7824) at a reduced knee point of 210 labels. This suggests that YOLOv8 provides higher accuracy but requires more labeled data, highlighting a trade-off between labeling effort and accuracy that must be considered in practical scenarios (Figure 15b and Figure 16b).

For chilies, an interesting observation emerged. YOLOv8 reached its knee point at 290 labels with an AP of 0.4820, while YOLOv11 outperformed YOLOv8 slightly, achieving an AP of 0.4910 at a lower knee point of 280 labels. This demonstrates YOLOv11’s superior efficiency in terms of both labeling requirements and detection accuracy for challenging datasets such as chilies, which generally present lower AP values (Figure 15c and Figure 16c).

The pepper dataset presented similar insights. YOLOv8 reached its knee point at 325 labels with an AP of 0.5542. In contrast, YOLOv11 achieved higher accuracy (AP of 0.5784) with a reduced number of labels (315). This improvement underscores YOLOv11’s capability in managing resource efficiency and model performance simultaneously.

Overall, YOLOv11 consistently showed the potential for either similar or improved detection accuracy compared with YOLOv8, often requiring fewer labeled data points to reach its knee point. This is particularly beneficial in real-world agricultural settings, where data annotation can be costly and time-consuming. Furthermore, YOLOv11’s performance improvement at reduced labeling points suggests an optimization in data utilization and training efficiency (Figure 15d and Figure 16d).

However, the GPU resource consumption of YOLOv11 generally exceeded that of YOLOv8 across all datasets. While YOLOv11 provides clear benefits in terms of labeling efficiency and potential accuracy, these advantages come at the cost of increased computational resources, particularly GPU memory.

In conclusion, the decision to adopt YOLOv8 or YOLOv11 should involve balancing between labeling efficiency, accuracy, and computational resources, with YOLOv11 offering substantial improvements in label efficiency and accuracy at the expense of higher GPU utilization.

3.5. Inference Time Comparison on Jetson Orin Nano and a Desktop Training System

The inference time comparison between Jetson Orin Nano and a standard desktop system provided insights regarding the effects of the image size on computational efficiency in different hardware environments. The desktop system maintained consistent inference times across various image sizes, indicating negligible computational overhead regardless of the image size (Figure 17a). Conversely, significant variations in inference times were observed for the Jetson Orin Nano, highlighting its sensitivity to changes in image size (Figure 17b).

Specifically, the Jetson Orin Nano demonstrated clear gradations in inference times correlated with the image sizes. The inference time was approximately 20 ms for images with a size of 320, increased moderately to about 29 ms for 640, and then rose sharply to 57 ms for images at a size of 960, culminating at a notably high 87 ms for the largest image size of 1280. These results emphasize the computational constraints and scalability issues associated with embedded systems, especially when higher image sizes are considered.

From these observations, it becomes evident that while larger image sizes do not significantly impact inference performance on desktop systems—thus encouraging the use of larger images to achieve potentially superior accuracy—the situation differs substantially for embedded platforms like the Jetson Orin Nano. In embedded contexts, where real-time responsiveness and limited computational resources are critical factors, careful consideration must be given to selecting the appropriate image size. If the primary objective is real-time processing with minimal latency, smaller image sizes, such as 320 or 640, are more suitable. Conversely, for applications where accuracy is paramount, and computational resources can accommodate the increased demands, larger sizes may still be preferable despite their associated inference latency.

4. Conclusions

In this study, the impact of the label count and image size on the performance, training times, and GPU utilization of YOLO models was comprehensively analyzed using four distinct agricultural datasets: strawberries, tomatoes, chilies, and peppers. The results clearly demonstrated that increasing the number of labels improves model accuracy up to a defined threshold (the “knee point”), beyond which performance gains diminish significantly. This indicates that additional labeling efforts past this threshold do not yield proportional improvements, making efficient data annotation critical.

Analyzing individual datasets revealed distinct optimal conditions; strawberries and tomatoes benefitted significantly from modest label increases and moderate resolutions due to clear color contrasts. In contrast, chilies and peppers required higher label counts and resolutions due to more challenging visual differentiation. This underscores the importance of dataset-specific considerations in determining optimal training configurations.

Additionally, comparative analysis between the YOLOv8 and YOLOv11 models indicated that YOLOv11 generally offered comparable or improved detection accuracy with fewer labels, suggesting potential efficiency gains in labeling processes. However, YOLOv11 required significantly more GPU resources, highlighting an essential trade-off between labeling efficiency, model accuracy, and computational resource allocation.

Finally, the inference times between Jetson Orin Nano and desktop systems were compared. The results emphasized the sensitivity of embedded platforms to increases in image sizes. While the desktop performance remained relatively unaffected, the embedded systems experienced substantial latency increases at larger sizes. Therefore, in real-world deployments, especially on resource-constrained platforms, careful selection of the image size, balancing accuracy and computational efficiency, is critical.

Practical recommendations for optimizing agricultural datasets include identifying and adhering to the “knee points” of labels to balance annotation effort and accuracy, selecting moderate image resolutions (about 640) to achieve optimal performance without excessive computational costs, and considering dataset-specific characteristics such as object-background contrast and color complexity when planning annotation strategies and resolutions. Furthermore, evaluating the trade-offs between accuracy, labeling costs, and computational resources when choosing between the YOLOv8 and YOLOv11 models is essential. Finally, selecting appropriate image sizes based on the deployment environment, particularly in resource-constrained or real-time applications, is crucial.

Overall, our findings highlight that balancing the label count, image size, computational resources, and platform constraints is crucial for effective deployment of YOLO-based detection models in agricultural applications. Future studies will focus on validating these findings across broader datasets and diverse environmental conditions to further optimize performance and resource efficiency.

Author Contributions

Conceptualization, J.S. and J.P.; methodology, J.S. and J.P.; software, J.S., D.K. and E.J.; validation, J.S.; formal analysis, J.S.; data curation, J.S. and J.P.; writing—original draft preparation, J.S., D.K. and E.J.; writing—review and editing, J.S. and J.P.; visualization, J.S.; supervision, J.P.; project administration, J.P.; funding acquisition, J.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was carried out with the support of the “New Agricultural Climate Change Response System Construction Project (Project No. RS-2023-00219113)” of the Rural Development Administration of the Republic of Korea. This research (paper) used datasets from “The Open AI Dataset Project (AI-Hub, S. Korea)”. All data information can be accessed through “AI-Hub (www.aihub.or.kr)”.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data available in a publicly accessible repository.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dahikar, S.S.; Rode, S.V. Agricultural Crop Yield Prediction Using Artificial Neural Network Approach. Int. J. Innov. Res. Electr. Electron. Instrum. Control. Eng. 2014, 2, 683–686. [Google Scholar]
Deivakani, M.; Singh, C.; Bhadane, J.R.; Ramachandran, G.; Sanjeev Kumar, N. ANN Algorithm based Smart Agriculture Cultivation for Helping the Farmers. In Proceedings of the 2021 2nd International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 7–9 October 2021; pp. 1–6. [Google Scholar]
Hsieh, T.H.; Kiang, J.F. Comparison of CNN Algorithms on Hyperspectral Image Classification in Agricultural Lands. Sensors 2020, 20, 1734. [Google Scholar] [CrossRef]
Sarma, K.K.; Das, K.K.; Mishra, V.; Bhuiya, S.; Kaplun, D. Learning Aided System for Agriculture Monitoring Designed Using Image Processing and IoT-CNN. IEEE Access 2022, 10, 41525–41536. [Google Scholar] [CrossRef]
Hamadani, H.; Rashid, S.M.; Parrah, J.D.; Khan, A.A.; Dar, K.A.; Ganie, A.A.; Gazal, A.; Dar, R.A.; Ali, A. Traditional Farming Practices and Its Consequences. In Microbiota and Biofertilizers; Springer: Cham, Switzerland, 2021; Volume 2, pp. 119–128. [Google Scholar]
Wu, B.; Zhang, M.; Zeng, H.; Tian, F.; Potgieter, A.B.; Qin, X.; Yan, N.; Chang, S.; Zhao, Y.; Dong, Q.; et al. Challenges and opportunities in remote sensing-based crop monitoring: A review. Natl. Sci. Rev. 2023, 10, nwac290. [Google Scholar] [CrossRef]
Hanuschak Sr, G.A. Timely and accurate crop yield forecasting and estimation: History and initial gap analysis. In Proceedings of the First Scientific Advisory Committee Meeting, Global Strategy; Food and Agriculture Organization of the United Nations: Rome, Italy, 2013; Volume 198. [Google Scholar]
Budach, L.; Feuerpfeil, M.; Ihde, N.; Nathansen, A.; Noack, N.; Patzlaff, H.; Harmouch, H. The Effects of Data Quality on Machine Learning Performance. arXiv 2022, arXiv:2207.14529. [Google Scholar]
Bailly, A.; Blanc, C.; Francis, E.; Guillotin, T.; Jamal, F.; Wakim, B.; Roy, P. Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models. Comput. Methods Programs Biomed. 2022, 213, 106504. [Google Scholar] [CrossRef]
Nesteruk, S.; Shadrin, D.; Pukalchik, M. Image augmentation for multitask few-shot learning: Agricultural domain use-case. arXiv 2021, arXiv:2102.12295. [Google Scholar]
Kodors, S.; Sondors, M.; Apeinans, I.; Zarembo, I.; Lacis, G.; Rubauskis, E.; Karklina, K. Importance of mosaic augmentation for agricultural image dataset. In Agronomy Research; Estonian University of Life Sciences: Tartu, Estonia, 2024; Volume 22. [Google Scholar]
Zou, K.; Shan, Y.; Zhao, X.; Che, X. A deep learning image augmentation method for field agriculture. IEEE Access 2024, 12, 37432–37442. [Google Scholar]
Abbas, A.; Jain, S.; Gour, M.; Vankudothu, S. Tomato plant disease detection using transfer learning with C-GAN synthetic images. Comput. Electron. Agric. 2021, 187, 106279. [Google Scholar] [CrossRef]
Zhou, C.; Zhang, Z.; Zhou, S.; Xing, J.; Wu, Q.; Song, J. Grape Leaf Spot Identification Under Limited Samples by Fine Grained-GAN. IEEE Access 2021, 9, 100480–100489. [Google Scholar] [CrossRef]
Fawakherji, M.; Suriani, V.; Nardi, D.; Bloisi, D.D. Shape and style GAN-based multispectral data augmentation for crop/weed segmentation in precision farming. Crop Prot. 2024, 184, 106848. [Google Scholar]
Liu, C.e.a. Towards robust gan-generated image detection a multi-view completion representation. arXiv 2023, arXiv:2306.01364. [Google Scholar]
Tan, C.; Zhao, Y.; Wei, S.; Gu, G.; Wei, Y. Learning on gradients Generalized artifacts representation for gan-generated images detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12105–12114. [Google Scholar]
Pérez-Borrero, I.; Marín-Santos, D.; Gegúndez-Arias, M.E.; Cortés-Ancos, E. A fast and accurate deep learning method for strawberry instance segmentation. Comput. Electron. Agric. 2020, 178, 105736. [Google Scholar] [CrossRef]
Yuan, W. AriAplBud: An Aerial Multi-Growth Stage Apple Flower Bud Dataset for Agricultural Object Detection Benchmarking. Data 2024, 9, 36. [Google Scholar] [CrossRef]
Hu, T.; Wang, W.; Gu, J.; Xia, Z.; Zhang, J.; Wang, B. Research on apple object detection and localization method based on improved yolox and rgb-d images. Agronomy 2023, 13, 1816. [Google Scholar] [CrossRef]
Karthikeyan, M.; Subashini, T.; Srinivasan, R.; Santhanakrishnan, C.; Ahilan, A. YOLOAPPLE: Augment Yolov3 deep learning algorithm for apple fruit quality detection. Signal Image Video Process. 2024, 18, 119–128. [Google Scholar]
Gong, X.; Zhang, S. A high-precision detection method of apple leaf diseases using improved faster R-CNN. Agriculture 2023, 13, 240. [Google Scholar] [CrossRef]
Jia, W.; Wang, Z.; Zhang, Z.; Yang, X.; Hou, S.; Zheng, Y. A fast and efficient green apple object detection model based on Foveabox. J. King Saud. Univ.-Comput. Inf. Sci. 2022, 34, 5156–5169. [Google Scholar]
Ji, W.; Pan, Y.; Xu, B.; Wang, J. A real-time apple targets detection method for picking robot based on ShufflenetV2-YOLOX. Agriculture 2022, 12, 856. [Google Scholar] [CrossRef]
Yuan, S.-Q.; Cao, Y.; Cheng, X. Research on Strawberry Quality Grading Based on Object Detection and Stacking Fusion Model. IEEE Access 2023, 11, 137475–137484. [Google Scholar]
Nergiz, M. Enhancing Strawberry Harvesting Efficiency through Yolo-v7 Object Detection Assessment. Turk. J. Sci. Technol. 2023, 18, 519–533. [Google Scholar]
Wang, C.; Wang, H.; Han, Q.; Zhang, Z.; Kong, D.; Zou, X. Strawberry Detection and Ripeness Classification Using YOLOv8+ Model and Image Processing Method. Agriculture 2024, 14, 751. [Google Scholar] [CrossRef]
Luo, Q.; Wu, C.; Wu, G.; Li, W. A Small Target Strawberry Recognition Method Based on Improved YOLOv8n Model. IEEE Access 2024, 12, 14987–14995. [Google Scholar]
Chai, J.J.; Xu, J.-L.; O’Sullivan, C. Real-Time Detection of Strawberry Ripeness Using Augmented Reality and Deep Learning. Sensors 2023, 23, 7639. [Google Scholar] [CrossRef]
Li, Y.; Xue, J.; Zhang, M.; Yin, J.; Liu, Y.; Qiao, X.; Zheng, D.; Li, Z. YOLOv5-ASFF: A Multistage Strawberry Detection Algorithm Based on Improved YOLOv5. Agronomy 2023, 13, 1901. [Google Scholar] [CrossRef]
Salim, R.; Fajar, A.N. OBJECT DETECTION OF CHILI USING CONVOLUTIONAL NEURAL NETWORK YOLOV7. J. Theor. Appl. Inf. Technol. 2024, 102, 2419–2427. [Google Scholar]
Chen, H.; Zhang, R.; Peng, J.; Peng, H.; Hu, W.; Wang, Y.; Jiang, P. YOLO-chili: An efficient lightweight network model for localization of pepper picking in complex environments. Appl. Sci. 2024, 14, 5524. [Google Scholar] [CrossRef]
Abubeker, K.; Akhil, S.; Kumar, V.A.; Jose, B.K. Computer Vision-Assisted Real-Time Bird Eye Chili Classification Using YOLO V5 Framework. J. Artif. Intell. Technol. 2024, 4, 265–271. [Google Scholar]
Appe, S.N.; Arulselvi, G.; Balaji, G. CAM-YOLO: Tomato detection and classification based on improved YOLOv5 using combining attention mechanism. PeerJ Comput. Sci. 2023, 9, e1463. [Google Scholar]
Liu, G.; Hou, Z.; Liu, H.; Liu, J.; Zhao, W.; Li, K. TomatoDet: Anchor-free detector for tomato detection. Front. Plant Sci. 2022, 13, 942875. [Google Scholar]
Li, P.; Zheng, J.; Li, P.; Long, H.; Li, M.; Gao, L. Tomato maturity detection and counting model based on MHSA-YOLOv8. Sensors 2023, 23, 6701. [Google Scholar] [CrossRef]
Facility Crop Disease Diagnostics. Available online: https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=153 (accessed on 25 November 2024).
Integrated Plant Disease Outbreak Data. Available online: https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=525 (accessed on 25 November 2024).
Laboro Tomato: Instance Segmentation Dataset. 2020. Available online: https://github.com/laboroai/LaboroTomato (accessed on 15 December 2024).
Roboflow. Maturity Peppers in Greenhouses by Object Detection Image Dataset. Available online: https://universe.roboflow.com/viktor-vanchov/pepper-detector-cfpbq/dataset/2 (accessed on 20 January 2025).
Bi, J.; Li, K.; Zheng, X.; Zhang, G.; Lei, T. SPDC-YOLO: An Efficient Small Target Detection Network Based on Improved YOLOv8 for Drone Aerial Image. Remote Sens. 2025, 17, 685. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Khan, A.T.; Jensen, S.M.; Khan, A.R. Advancing precision agriculture: A comparative analysis of YOLOv8 for multi-class weed detection in cotton cultivation. Artif. Intell. Agric. 2025, 15, 182–191. [Google Scholar] [CrossRef]
Huang, J.; Rathod, V.; Sun, C.; Zhu, M.; Korattikara, A.; Fathi, A.; Fischer, I.; Wojna, Z.; Song, Y.; Guadarrama, S. Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE Conference on Computer Vision And pattern Recognition, Honolulu, HI, USA, 21–27 July 2017; pp. 7310–7311. [Google Scholar]
Li, J.; Wang, Q.; Ma, J.; Guo, J. Multi-defect segmentation from façade images using balanced copy–paste method. Comput.-Aided Civ. Infrastruct. Eng. 2022, 37, 1434–1449. [Google Scholar] [CrossRef]
Yang, S.; Luo, P.; Loy, C.-C.; Tang, X. Wider face: A face detection benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5525–5533. [Google Scholar]
Savitzky, A.; Golay, M.J. Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem. 1964, 36, 1627–1639. [Google Scholar]
Che, Y.; Zheng, Y.; Forest, F.E.; Sui, X.; Hu, X.; Teodorescu, R. Predictive health assessment for lithium-ion batteries with probabilistic degradation prediction and accelerating aging detection. Reliab. Eng. Syst. Saf. 2024, 241, 109603. [Google Scholar]
Satopaa, V.; Albrecht, J.; Irwin, D.; Raghavan, B. Finding a” kneedle” in a haystack: Detecting knee points in system behavior. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems Workshops, Minneapolis, MN, USA, 20–24 June 2011; pp. 166–171. [Google Scholar]

Figure 1. Global interest analysis of Google Trends: (a) yolov1~v5 and (b) yolov5 onward.

Figure 2. Original image sample: (a) strawberries, (b) tomatoes, (c) chilies, and (d) peppers.

Figure 3. Bounding boxes for each image: (a) strawberries, (b) tomatoes, (c) chilies, and (d) peppers.

Figure 4. Comparison of bounding box sizes through center point aggregation: (a) strawberries, (b) tomatoes, (c) chilies, and (d) peppers.

Figure 5. Comparison of bounding box dimensions for comparative size: (a) strawberries, (b) tomatoes, (c) chilies, and (d) peppers.

Figure 6. Comparison of mean hue distributions for full image and bounding box: (a) strawberries, (b) tomatoes, (c) chilies, and (d) peppers.

Figure 7. Comparison of label counts across datasets(dot = outlier data points that fall outside the standard box-and-whisker plot range): (a) strawberries, (b) tomatoes, (c) chilies, and (d) peppers.

Figure 8. Comparison of AP curves (a) before applying the Savitzky–Golay (SG) filter and (b) after applying the SG filter.

Figure 9. Comparison of AP curves (a) before applying the EEM and (b) after applying the EEM.

Figure 10. YOLOv8 AP performance and knee point of models based on image size and label count (solid line = pre-group; dotted line = post-group): (a) strawberries, (b) tomatoes, (c) chilies, and (d) peppers.

Figure 11. AP increase rate of models based on image size and label count (solid line = pre-group; dotted line = post-group): (a) strawberries, (b) tomatoes, (c) chilies, and (d) peppers.

Figure 12. AP increase rate of models based on training image size: (a) strawberries, (b) tomatoes, (c) chilies, and (d) peppers.

Figure 13. Training time of the model as a function of the image size and number of labels: (a) strawberries, (b) tomatoes, (c) chilies, and (d) peppers.

Figure 14. YOLOv8 GPU usage of the model as a function of the image size and number of labels: (a) 320, (b) 640, (c) 960, and (d) 1280.

Figure 15. YOLOv8 and v11 AP performance and knee points of models based on image size and label count (image size = 640; solid line = pre-group; dotted line = post-group): (a) strawberries, (b) tomatoes, (c) chilies, and (d) peppers.

Figure 16. YOLOv8 and v11 GPU usage of the model as a function of image size and number of labels(image size = 640): (a) 320, (b) 640, (c) 960, and (d) 1280.

Figure 17. Inference times of the models as a function of image size: (a) Jetson Orin Nano and (b) desktop system.

Table 1. Summary of techniques and comparison between studies on object detection in agriculture.

Author(s) and Year	Dataset	Number of Classes	Dataset Size	Model or Architecture	Result
Wenan Yuan (2024) [19]	RGB Apple Buds Based on UAV	6	3600	YOLOv8	72.7% (mAP@50)
Tiantian Hu et al. (2023) [20]	RGB-D Apple	1	4785	Improved YOLOX	94.1% (mAP@50)
M. Karthikeyan et al. (2024) [21]	RGB Apple	3	1800	Augment YOLOv3	99.1% (mAP@50)
Xulu Gong et al. (2023) [22]	Apple Leaf Disease	5	4182	Improved Faster R-CNN	63.1% (AP)
Weikuan Jia et al. (2022) [23]	Green Apple	1	1386	Fast-FDM	62.3% (mAP@50-95)
Wei Ji et al. (2022) [24]	Apple	1	17,930	ShufflenetV2-YOLOX	96.7% (AP)
Shi-Qi Yuan et al. (2023) [25]	DanDong Strawberries	4	420	YOLOv5	85.4% (Accuracy)
Mehmet NERGİZ et al. (2023) [26]	Strawberry-DS	6	247	YOLOv7	46.0% (mAP@50-95)
Chenlin Wang et al. (2024) [27]	Strawberry	2	1187	YOLOv8+	97.8% (Accuracy)
Qiang Luo et al. (2024) [28]	Strawberry	2	3264	YOLOv8	91.2% (mAP@50)
Jackey J. K. Chai et al. (2023) [29]	StrawDI Team	3	3100	YOLOv7	89.0% (mAP@50)
Yaodi Li et al. (2023) [30]	Strawberry	4	1217	YOLOv5-ASFF	91.9% (mAP@50)
Richard Salim et al. (2024) [31]	Curly Red Chili	5	700	YOLOv7	97.7% (mAP@50)
HaiLin Chen et al. (2024) [32]	Chili	1	1456	YOLOv5	93.1% (AP@50)
Abubeker K. M. et al. (2024) [33]	Bird Eye Chili	2	1558	YOLOv5	94.0% (mAP@50)
Seetharam Negesh Appe (2023) [34]	Laboro Tomato	2	2034	CAM-YOLO	88.1% (mAP@50)
Guoxu Liu at al. (2022) [35]	Tomato	1	966	TomatoDet	98.2% (AP@50)
Ping Li et al. (2023) [36]	Tomato	3	2208	MHSA-YOLOv8	91.6% (mAP@50)

Table 2. Desktop system specifications for training and testing object detection models.

Component	Specification
CPU	AMD Ryzen 5 7600 5.1 GHz
GPU	NVIDIA GeForce RTX 4090 24 GB
Memory	64 GB
Programming Language	Python 3.9.19
Operating System	Window 11
CUDA	11.8
Torch	2.0.1 + cu118
Torchvision	0.15.2 + cu118

Table 3. Specifications of Jetson Orin Nano environment.

Component	Specification
CPU	6-core Arm Cortex-A78AE v8.2 64-bit CPU 1.5 MB L2 + 4 MB L3
GPU	1024-CUDA core NVIDIA Ampere architecture GPU with 32 Tensor Cores
Memory	8 GB 128-bit LPDDR5 68 GB/s
Module Power	15 W
Programming Language	Python 3.8
Operating System	Jetpack 5.1.1
Torch	2.0.0 + nv23.05
Torchvision	0.15.1a0 + 42759b1

Table 4. Configuration of datasets.

Dataset	Number of Images	Number of Labels
Strawberry	3386	23,359
Tomato	804	9777
Chili	682	2258
Pepper	619	5324

Table 5. Bbox size thresholds for each dataset.

Dataset	Small	Medium	Max_BBoxArea
Strawberry	0.061891	0.103104	0.384039
Tomato	0.126025	0.195229	0.536674
Chili	0.138742	0.234728	0.944629
Pepper	0.036184	0.061265	0.241488

Table 6. Label distribution statistics.

Dataset	Split	Total Labels	Mean/Standard Deviation
Strawberry	Train	18,418	6.8/4.7
	Validation	2493	4.7/4.6
	Test	2448	7.1/4.8
Tomato	Train	7716	12.0/11.7
	Validation	1012	13.0/11.2
	Test	1049	12.5/10.9
Chili	Train	1457	3.1/2.1
	Validation	631	3.7/2.5
	Test	170	4.5/2.4
Pepper	Train	3957	8.8/3.7
	Validation	702	6.9/4.5
	Test	665	10.1/4.7

Table 7. Confusion matrix.

Confusion Matrix		Predicted
Confusion Matrix		Positive	Negative
Actual	Positive	True Positive (TP)	False Positive (FP)
Actual	Negative	False Negative (FN)	True Negative (TN)

Table 8. AP improvement and knee points by crop and image size.

Crop	Image Size	Knee Point (Number of Labels)	Knee AP	Pre-Group AP Increase Rate (%)	Post-Group AP Increase Rate (%)
Strawberry	320	175	0.6940	1.11 (0.67–1.54) *	0.04 (0.03–0.04)
	640	210	0.7548	1.03 (0.68–1.37)	0.04 (0.03–0.04)
	960	205	0.7605	1.03 (0.67–1.39)	0.04 (0.03–0.04)
	1280	220	0.7693	0.90 (0.61–1.20)	0.03 (0.03–0.04)
Tomato	320	215	0.6986	2.05 (1.28–2.83)	0.07 (0.07–0.08)
	640	235	0.7979	1.40 (0.94–1.87)	0.05 (0.04–0.05)
	960	220	0.8002	1.31 (0.86–1.76)	0.05 (0.04–0.05)
	1280	185	0.8096	2.12 (1.21–3.04)	0.05 (0.04–0.05)
Chili	320	230	0.4392	8.68 (1.50–15.87)	0.10 (0.09–0.11)
	640	290	0.4820	2.60 (2.17–9.03)	0.11 (0.10–0.11)
	960	290	0.4818	5.66 (2.16–9.16)	0.10 (0.10–0.11)
	1280	325	0.4878	8.43 (0.78–16.09)	0.10 (0.10–0.11)
Pepper	320	430	0.4703	1.13 (0.89–1.37)	0.09 (0.08–0.09)
	640	325	0.5542	1.87 (1.31–2.43)	0.10 (0.09–0.11)
	960	280	0.5793	2.37 (1.53–3.21)	0.11 (0.10–0.11)
	1280	300	0.5920	2.28 (1.51–3.06)	0.11 (0.10–0.11)

* Statistically significant (p < 0.001). Values represent mean AP growth with 95% confidence intervals.

Table 9. Training time (sec, 200 epochs) by crop and image size (95% CI in %).

Crop	320	640	960	1280
Strawberry	42.72 (38.80–46.64) ¹	87.67 (83.75–91.59)	151.96 (148.04–155.88)	209.50 (205.58–213.42)
Tomato	68.41 (64.49–72.33)	77.43 (73.51–81.35)	113.11 (109.19–117.03)	212.66 (208.74–216.58)
Chili	130.90 (126.98–134.82)	188.44 (184.52–192.36)	326.20 (322.28–330.12)	579.42 (575.50–583.34)
Pepper	84.19 (80.27–88.11)	89.12 (85.20–93.04)	75.88 (71.96–79.80)	79.99 (76.07–83.91)

¹ Values indicate mean training times with 95% confidence intervals.

Table 10. GPU memory usage (GB) by crop and image Size (95% CI in %).

Crop	320	640	960	1280
Strawberry	1.14 (0.94–1.34) ¹	3.26 (3.06–3.46)	6.80 (6.60–7.00)	12.18 (11.98–12.38)
Tomato	0.88 (0.68–1.08)	2.75 (2.55–2.95)	5.86 (5.66–6.06)	12.31 (12.11–12.51)
Chili	1.10 (0.90–1.30)	3.24 (3.04–3.44)	6.98 (6.78–7.18)	12.24 (12.04–12.44)
Pepper	1.38 (1.18–1.57)	3.75 (3.56–3.95)	8.16 (7.96–8.35)	14.65 (14.46–14.85)

¹ Values indicate mean GPU memory usage with 95% confidence intervals.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, J.; Kim, D.; Jeong, E.; Park, J. Determination of Optimal Dataset Characteristics for Improving YOLO Performance in Agricultural Object Detection. Agriculture 2025, 15, 731. https://doi.org/10.3390/agriculture15070731

AMA Style

Song J, Kim D, Jeong E, Park J. Determination of Optimal Dataset Characteristics for Improving YOLO Performance in Agricultural Object Detection. Agriculture. 2025; 15(7):731. https://doi.org/10.3390/agriculture15070731

Chicago/Turabian Style

Song, Jisu, Dongseok Kim, Eunji Jeong, and Jaesung Park. 2025. "Determination of Optimal Dataset Characteristics for Improving YOLO Performance in Agricultural Object Detection" Agriculture 15, no. 7: 731. https://doi.org/10.3390/agriculture15070731

APA Style

Song, J., Kim, D., Jeong, E., & Park, J. (2025). Determination of Optimal Dataset Characteristics for Improving YOLO Performance in Agricultural Object Detection. Agriculture, 15(7), 731. https://doi.org/10.3390/agriculture15070731

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Determination of Optimal Dataset Characteristics for Improving YOLO Performance in Agricultural Object Detection

Abstract

1. Introduction

2. Materials and Methods

2.1. Background

2.1.1. Optimal Dataset

2.1.2. YOLO Object Detection Model

2.2. Computing Environment

2.3. Data Collection and Preprocessing

2.4. Data Analysis and Preprocessing

2.5. Dataset Splitting Strategy

2.6. Model Training Method

2.7. Model Performance Evaluation Based on AP Metrics

2.7.1. Calculation of IoU and Average Precision (AP)

2.7.2. AP Curve Smoothing with Savitzky–Golay Filter and EEM

3. Results and Discussion

3.1. Determination of Optimal Model Performance Using AP Metrics and Kneedle Algorithm

3.2. Comparison of YOLOv8 Models Trained by Label Count and Image Size

3.2.1. Analysis: Strawberry Dataset

3.2.2. Analysis: Tomato Dataset

3.2.3. Analysis: Chili Dataset

3.2.4. Analysis: Pepper Dataset

3.3. Comparison of Training Times and GPU Usage on YOLOv8

3.3.1. Comparison of Training Times

3.3.2. Comparison of GPU Usage

3.4. Comparison of YOLOv8 and 11 Models Trained on Number of Labels and Image Size

3.5. Inference Time Comparison on Jetson Orin Nano and a Desktop Training System

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI