Performance Comparison of Cherry Tomato Ripeness Detection Using Multiple YOLO Models

Yang, Dayeon; Ju, Chanyoung

doi:10.3390/agriengineering7010008

Open AccessArticle

Performance Comparison of Cherry Tomato Ripeness Detection Using Multiple YOLO Models

by

Dayeon Yang

^1,2

and

Chanyoung Ju

^2,*

¹

Department of Convergence Biosystems Engineering, Chonnam National University, Gwangju 61186, Republic of Korea

²

Purpose Built Mobility Group, Korea Institute of Industrial Technology, Gwangju 61012, Republic of Korea

^*

Author to whom correspondence should be addressed.

AgriEngineering 2025, 7(1), 8; https://doi.org/10.3390/agriengineering7010008

Submission received: 11 November 2024 / Revised: 16 December 2024 / Accepted: 26 December 2024 / Published: 30 December 2024

(This article belongs to the Special Issue Implementation of Artificial Intelligence in Agriculture)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Millions of tons of cherry tomatoes are produced annually, with the harvesting process being crucial. This paper presents a deep learning-based approach to distinguish the ripeness of cherry tomatoes in real time. It specifically evaluates the performance of YOLO (You Only Look Once) v5 and YOLOv8 (with a ResNet50 backbone) models. A new dataset was created by augmenting the original 300 images to 742 images using techniques such as rotation, flipping, and brightness adjustments. Experimental results show that YOLOv8 achieved a mean average precision (mAP) of 0.757, outperforming YOLOv5, which achieved an mAP of 0.701, by 5.6%. The proposed system is expected to address labor shortages caused by population decline in rural areas and enhance productivity in cherry tomato harvesting environments. Future research will focus on integrating segmentation techniques to precisely locate cherry tomatoes and develop a robotic manipulator capable of automating the harvesting process based on ripeness. This study provides a foundation for intelligent harvesting robots applicable in real-world.

Keywords:

deep learning; harvesting robot; object detection; cherry tomato; YOLO (You Only Look Once)

1. Introduction

Cherry tomatoes are highly nutritious, widely used in various dishes, and popular among many people in daily life. They are commonly grown not only in households but also hold significant importance in commercial agriculture. In particular, harvesting accounts for over 40% of the total labor demand in cherry tomato production [1]. However, the manual harvesting process of cherry tomatoes faces challenges due to high labor costs and the need to secure additional workers during peak seasons. Agriculture is influenced not only by weather and seasonal factors but also by structural constraints within the labor market. In actual agricultural environments, it is often challenging to secure adequate labor, and during peak seasons, unstable labor supply results in increased labor costs, affecting productivity. These manual harvesting methods are considered less sustainable because they require more time compared to mechanical harvesting [2,3]. To address these issues, recent advances in automation technology that combine computer vision and deep learning are playing an increasingly vital role in agriculture [4,5]. In particular, research is actively progressing on harvesting robots capable of automatically evaluating crop size, color, and ripeness, with the aim of significantly improving the efficiency of harvesting tasks. One of the most crucial functions of a cherry tomato harvesting robot is the ability to reliably and accurately identify crops through a visual recognition system [6]. This is an essential element for enabling efficient harvesting, as the ability of robots to analyze images in real-time, assess crop ripeness, and autonomously classify them is critical in agricultural automation [7].

In the process of recognizing cherry tomatoes through video or images, a harvesting robot encounters various challenges, such as fluctuations in natural light over time, occlusions by leaves or stems, and distinguishing between unripe tomatoes and the green color of leaves. These factors can act as obstacles that lower the reliability of accurate recognition. In this paper, we propose a harvesting system that utilizes the YOLOv5 [8] and YOLOv8 [9] models, along with a YOLOv8 model with a ResNet50 backbone, to classify the ripeness of cherry tomatoes in real-time. In this study, we labeled 300 images of cherry tomatoes and applied various data augmentation techniques to expand the dataset, enriching the training data for the model. Using this dataset, we evaluated the accuracy of ripeness classification and experimentally verified the performance of the deep learning model. Section 2 introduces related studies that use computer vision and machine learning techniques to automate cherry tomato ripeness detection in the agricultural sector, with a focus on deep learning. Section 3 describes the structure of the CNN-based algorithm for ripeness detection, and Section 4 details the experimental process of training deep learning models with the dataset. Section 5 analyzes the results to evaluate model performance, and the final chapter presents the conclusions of this study and suggests future research directions.

2. Related Research

In the agricultural field, extensive research is being conducted to monitor crop conditions and automatically detect disturbances or diseases during crop growth using computer vision and machine learning. For cherry tomato harvesting robots, studies have used CNN (Convolutional Neural Network)-based YOLOv5 models to detect and classify crops by ripeness. This section reviews successful cases of crop detection using YOLOv5 models in previous studies and explores their enhanced utility through comparative analysis with the latest YOLOv8 model. Object detection, which refers to recognizing and detecting targets in real time using algorithmic models, has been widely applied in agriculture using deep learning methods. Li et al. [10], a study proposed an algorithm to detect vegetable diseases by improving CSPDarknet, a backbone network in the YOLO series, into a CSPTR (Cross-Stage Partial Transformer) structure. Compared to the original YOLOv5s model, this model achieved an object detection accuracy of 93.1%. Khan et al. [11], a method was developed to classify the ripeness of tomatoes without using the commonly used RNN (Recurrent Neural Network) or CNN-based architectures in deep learning. Instead, the study used only the Transformer Attention mechanism to accurately classify ripeness, improving efficiency and precision in tomato harvesting, grading, and quality management. Vo et al. [12], the latest real-time object detection algorithm, YOLOv9, was applied to classify tomato ripeness and count tomatoes. This algorithm improved on previous YOLO models by incorporating PGI, which prevents information loss, and GELAN, a lightweight network structure, achieving a performance of 0.882. Li et al. [13], a new object detection method that improves the backbone of YOLOv8 using the MHSA (Multi-Head Self-Attention) mechanism was developed, taking into account the complexity of real production environments. This method builds a grading and counting model for tomato maturity classification, achieving a mAP of 0.864. Yang et al. [14], DSConv layers replace conventional Conv layers, and additional modules, DPAG (Dynamic Pyramid Attention Gate) and FAM (Feature Attention Module), enhance object feature emphasis, increase precision, and reduce computational complexity. Experimental results show a mAP of 93.4%, meeting the requirements for real-time tomato detection in agricultural environments.

Xiao et al. [15], the YOLOv8 and CenterNet deep learning models were used for tomato maturity classification. The CenterNet model considers the center of each object as a keypoint and calculates size and position based on this center, resulting in higher accuracy at 99.5% when using YOLOv8. Afonso et al. [16], Mask R-CNN deep learning was implemented with the ResNet network architecture for object detection and segmentation tasks. By post-processing the segmentation results from Mask R-CNN, background-detected fruits were removed. The mid-depth pixel value of each object’s mask was calculated using MATLAB, and objects exceeding a specific threshold were considered foreground. This approach effectively detected tomatoes using a simple, cost-effective camera setup in greenhouse environments.

Bresilla et al. [17], a modified M3 model was developed to improve YOLOv2’s performance for detecting fruits within branches. The M1 model initially divided input images into a 26 × 26 grid, rather than a 13 × 13 grid, to better detect smaller objects, although this slowed down training and detection. To address this, M2 was created by removing certain convolutional and pooling layers to make the model shallower and faster, though with a slight accuracy loss. Based on M2, the M3 model added Splitter and Joiner blocks, allowing for higher-resolution image processing to better detect small objects while enhancing accuracy and maintaining an optimal speed.

Color plays a crucial role in determining crop ripeness, as commonly seen in fruits like apples, oranges, and bananas, which possess distinctive colors. However, some fruits retain colors similar to green leaves even when fully ripe, making it challenging to distinguish crops from foliage based solely on color [18]. To address the issue with crops like cucumbers that share similar colors with leaves, Mao et al. [19] employed the IRELIEF algorithm and MPCNN (Multi-Path Convolutional Neural Network). By selecting the top three color components that best separate cucumbers from the background, the model calculates weights and reduces the background contrast. These color components are then input into separate convolutional layers along independent paths to extract features, achieving a recognition accuracy exceeding 90%. While the Faster R-CNN algorithm provides high accuracy, it is slow for real-time detection, which is required for determining cherry tomato ripeness. In this study, we use the YOLO algorithm, modifying the backbone of YOLOv8 with ResNet50 and conducting comparative experiments across different model sizes. This approach aims to achieve both high accuracy and efficiency in real-time harvesting robot environments.

3. Ripeness Classification Algorithm

In this section, we compare and evaluate the performance of YOLO series models and a modified YOLOv8 model, which are deep learning-based object detection algorithms for determining cherry tomato ripeness. The YOLO algorithm features a simple structure and outputs bounding boxes and class labels directly through the neural network, making the model compact with fast computation speed. Since YOLO requires only a single processing step to produce the final detection results, it enables real-time detection and demonstrates strong generalization capabilities, allowing for transfer learning across various domains [20].

3.1. YOLO

The accurate classification of crop ripeness is a challenging task due to the significant variations in visual characteristics that occur across ripeness stages, as well as the considerable variability within each class. Small differences in color, size, and shape can make classification complex, especially as these traits can vary depending on the environment. Factors such as lighting direction and intensity, background complexity, and occlusions caused by leaves and stems greatly affect recognition accuracy. Moreover, multiple instances of crops often appear in a single scene, making it essential for the robot to recognize and track each object independently. Object detection technology must therefore support the independent recognition and tracking of each instance. This study aims to meet these demands by comparing CNN-based YOLOv5 and YOLOv8 models, analyzing their performance, and empirically verifying the effectiveness of each model in implementing a highly efficient recognition system. The YOLOv5 model improves basic recognition accuracy by applying various data augmentation techniques, including image rotation, saturation adjustments, and exposure control, thereby increasing the diversity of the dataset and enhancing the model’s generalization capabilities. This approach enables the model to accurately identify objects across a wide range of backgrounds, making it particularly useful in complex agricultural environments with irregular lighting conditions. During training, YOLOv5 dynamically calculates anchor boxes suited to the training dataset, automatically optimizing anchor selection to maximize performance. However, YOLOv5 faces limitations when dealing with complex backgrounds, as its detection capability decreases, which can lead to errors in delineating object boundaries and result in inaccurate recognition of crop health, diseases, and pests. To address these limitations and further enhance recognition performance, YOLOv8 was developed. YOLOv5 is implemented using PyTorch and is available in various model sizes: YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, with "n" indicating the lightest model. This lightweight structure allows for deployment in constrained hardware environments, making it a flexible option for diverse applications. The algorithm is primarily composed of three sections: Backbone, Neck, and Head, each of which plays a critical role in different stages of object detection. The Backbone extracts meaningful features from the input image, converting it into a feature map, while the Neck combines multi-scale feature maps to effectively detect objects’ locations and sizes. Finally, the Head accurately predicts the location of each object and categorizes it based on its class, generating the final detection results.

The latest model, YOLOv8, replaces the CSP (Cross Stage Partial) layer used in YOLOv5 with the more efficient and streamlined C2f module, reducing structural complexity and increasing computational efficiency. The C2f module allows the model to maintain the same level of accuracy with lower computational costs, which is a key advantage for real-time processing in practical environments. Additionally, the inclusion of the SPPF (Spatial Pyramid Pooling Fast) layer enables pooling of image features of varying sizes into fixed-size feature maps, further accelerating processing speed. The kernel size of the first convolution layer was reduced from 6 × 6 to 3 × 3, cutting the total number of parameters by more than half, significantly reducing model size and memory usage. These enhancements substantially improve inference speed and performance, ensuring stable recognition capabilities across diverse agricultural conditions, making the model highly efficient for real-time object detection tasks. Consequently, this study conducts a detailed experimental analysis comparing YOLOv5 and YOLOv8 in the context of real-time cherry tomato ripeness classification to explore the optimal model structure and application methods for accurate ripeness classification and harvesting automation in agricultural environments. This investigation provides insights into how each model can be practically utilized in agricultural automation, offering foundational data for developing effective solutions for real-time tasks, such as crop classification, in the field of agriculture.

3.2. Improved YOLO

In image feature extraction or recognition systems, the depth of the neural network plays a crucial role in improving model performance. As the network depth increases, it can learn more complex patterns and high-dimensional features, which enhances the model’s representational power. For example, a deep neural network is capable of learning not only simple boundaries and color differences but also complex textures, shapes, and even correlations between objects across multiple layers. However, as networks deepen, issues such as vanishing gradients and exploding gradients are more likely to occur, which can hinder the model’s ability to converge and reduce its accuracy during the training process. To address these challenges, ResNet [21] was introduced. ResNet uses residual blocks that leverage skip connections to alleviate the vanishing gradient problem, ensuring that deep networks remain stable and learn effectively. This innovative structure provides stability even as network depth increases, making it highly valuable in deep learning applications where complex feature extraction is required.

ResNet is designed to maintain high performance even as network depth increases by preserving key information from the input while reducing dimensions and, when necessary, learning residual functions. Through the residual blocks of ResNet, the network does not directly learn additional features on top of the input but rather learns the difference between the input data and the output, or the residual, enabling the model to selectively learn only the necessary information. By focusing on residuals, the network minimizes the risk of overfitting while ensuring that critical patterns in the data are captured effectively. This approach helps the network avoid learning unnecessary details, thereby improving computational efficiency and enhancing the stability of the training process. These properties are particularly crucial in object detection tasks, where capturing fine-grained details in challenging conditions, such as occlusions or overlapping objects, is essential.

Dimensionality reduction in ResNet occurs in the first 1 × 1 convolution layer. In this layer, the input dimensions are reduced to 64, minimizing unnecessary computations and maximizing computational efficiency. The reduced dimensions are preserved in the following 3 × 3 convolution layer, which allows the model to extract key features without losing important information. Finally, the dimensions are expanded again in the second 1 × 1 convolution layer to match the original input dimension, resulting in an output that is equivalent in dimension to the input. Each layer thus learns the residual F(x), resulting in the final output y = F(x) + x. This structure not only ensures stable gradient flow but also enables the network to learn efficiently, even with significant depth. Such residual learning is particularly advantageous for applications where large datasets and optimized training time are critical.

In this study, the original backbone of the YOLOv8 model was replaced with ResNet50 to enhance object detection and image classification accuracy. ResNet50 is a deep neural network with 50 layers that includes a bottleneck structure to efficiently manage the complexity of deep networks and the time required for training. The bottleneck structure adjusts the channel dimensions through a 1 × 1 convolution layer, then extracts key features in a 3 × 3 convolution layer, thereby minimizing computational costs while retaining important information. In the first 1 × 1 convolution layer, the input data is compressed to reduce computational load, and then expanded in subsequent layers to retain key features and maximize learning performance.

Additionally, the Complete Intersection over Union (CIoU) loss function was employed for bounding box regression to further enhance the model’s localization accuracy. CIoU improves upon traditional IoU-based loss functions by incorporating center distance, aspect ratio mismatches, and overlap area into its calculations. The CIoU loss function is defined in Equation (1)

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v

(1)

Here, b and

b^{g t}

represent the predicted and ground truth bounding boxes, respectively,

ρ^{2} (b, b^{g t})

is the squared Euclidean distance between the centers of the predicted and actual boxes, and c is the diagonal length of the smallest enclosing box that contains both the predicted and actual boxes.

α

is a positive equilibrium parameter, and v measures the aspect ratio difference between the predicted and ground truth boxes. The formulas for

α

and v are defined in Equations (2) and (3)

α = \frac{v}{(1 - I o U) + v}

(2)

v = \frac{4}{π^{2}} {(arctan \frac{w}{h} - arctan \frac{w^{g t}}{h^{g t}})}^{2}

(3)

where w and h denote the width and height of the predicted bounding box, respectively, while

w^{g t}

and

h^{g t}

represent the width and height of the ground truth bounding box. The parameter

α

dynamically adjusts the aspect ratio penalty, ensuring that the loss remains balanced even when the IoU is low. The parameter v quantifies the differences in aspect ratios between the predicted and actual bounding boxes, penalizing significant mismatches.

By combining ResNet50’s structural improvements with the CIoU loss function, the YOLOv8 model achieves enhanced feature extraction and more accurate bounding box regression. This combination ensures that the model can robustly handle challenging scenarios, such as occlusions and overlapping objects.

ResNet’s skip connections and residual learning framework enhance YOLOv8’s feature extraction capabilities by focusing on critical differences rather than redundant patterns. These enhancements significantly improve the detection of subtle boundaries and reduce false positives and false negatives, particularly in complex agricultural environments. The application of ResNet50 enables YOLOv8 to maintain high detection accuracy even with images that have complex backgrounds. Such structural improvements make YOLOv8 superior to the original CSP-based YOLO architecture by reducing information loss and improving computational efficiency. Quantitative evaluations show that ResNet50-based YOLOv8 achieves higher precision and recall in various scenarios, particularly under challenging lighting conditions and in the presence of occlusions. These enhancements ensure stable and reliable detection performance even in real agricultural environments where complex backgrounds are present. These improvements contribute to the model’s ability to ensure accuracy under various lighting conditions, complex backgrounds, and occlusions that are common in real-world agricultural environments. Figure 1 illustrates the structure of YOLOv8 with its backbone modified to ResNet50, showing how this modification allows for more sophisticated feature extraction, resulting in higher detection accuracy.

The YOLOv8 model with ResNet50 as its backbone provides high detection accuracy while maintaining computational efficiency, making it particularly advantageous in agricultural environments where real-time processing is required. This capability makes it suitable for various applications, including agricultural automation, real-time crop harvesting, quality control, and pest detection. The combination of high accuracy and low latency facilitates the successful implementation of real-time crop management and harvesting robots, and the optimized model structure ensures seamless operation even with limited computational resources, offering a practical foundation for developing effective solutions.

4. Ripeness Classification Experiment

4.1. Dataset

The cherry tomato images used in this study were sourced from the Tomato Fruits Dataset, available on the machine learning platform Kaggle. To prepare the images for training a deep learning model, it was essential to conduct an accurate and thorough image annotation process. Cherry tomatoes that appear distinctly red were labeled as “ripe”, as these are the ones intended for automated harvesting. In contrast, tomatoes appearing in shades of orange or green were labeled as “unripe”, indicating they are not yet ready for picking. These two primary classes of ripeness, “ripe” and “unripe”, are illustrated in Figure 2 to showcase the visual distinctions used in labeling criteria.

Annotation was performed using Roboflow, a widely used platform for image labeling and dataset preparation in machine learning workflows. The dataset was carefully split into subsets of 70% for training, 20% for validation, and 10% for testing, ensuring that the model could learn, validate, and be tested on distinct portions of data. Each cherry tomato instance was annotated with bounding boxes to indicate its location and boundaries precisely, enabling the model to understand where each object begins and ends within an image. Since the primary objective of this system is to detect ripe cherry tomatoes for harvesting purposes, any non-red tomatoes were categorized as unripe to align with this goal.

Data augmentation was applied extensively to increase the diversity and robustness of the training dataset, which enhances the model’s ability to generalize to new, unseen conditions. Given the natural variability in brightness and color due to fluctuating weather and diverse environmental conditions in agricultural settings, data augmentation techniques—such as rotation, saturation adjustments, brightness changes, exposure modifications, and image flipping—were applied. These transformations help the model adapt to various lighting and background conditions that may be encountered during real-world harvesting. Figure 3 shows examples of these augmentations, which play a vital role in preventing overfitting by creating a more diverse dataset. This augmented data enables the model to perform consistently and maintain high accuracy when deployed on a harvesting robot operating in dynamic outdoor environments.

Figure 4 illustrates the similarity between cherry tomato images collected with the camera mounted on the actual harvesting robot and those in the dataset used in this study. Additionally, the AgRob v16 robot, which was used to gather data in a real cherry tomato greenhouse environment, is presented to emphasize the practical alignment of the dataset [2,6]. The dataset used in this study consists of a total of 742 images, including 300 original images and 442 augmented images. Approximately 60% of the dataset is composed of augmented images, while the remaining 40% consists of original images. Data augmentation techniques such as rotation, flipping, and brightness adjustment were applied to expand the dataset and ensure diversity.

4.2. Cherry Tomato Ripeness Classification Experiment

In this study, various YOLO models were trained using a custom dataset to classify the ripeness of cherry tomatoes. The dataset was labeled based on the color and ripeness level of each cherry tomato, categorizing them as either ripe or unripe. During the training process, loss functions were applied to optimize each model’s performance. The loss function was designed to minimize errors related to the bounding box location, class probability, and object presence, enabling the model to effectively learn the accurate position and ripeness status of cherry tomatoes. These functions play a crucial role in helping the model recognize subtle changes in shape and color for ripeness prediction.

Model performance was evaluated using precision, recall, and mean average precision (mAP), as defined in Equations (4)~(6).

Precision measures the probability that tomatoes predicted as ripe by the model are actually ripe, while recall measures the rate at which the model correctly detects actual ripe tomatoes. mAP is a comprehensive metric for evaluating prediction performance across multiple classes, indicating the model’s consistency across various situations.

Precision = \frac{True Positive}{True Positive + False Positive}

(4)

Recall = \frac{True Positive}{True Positive + False Negative}

(5)

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i}

(6)

In Equation (6), N represents the total number of object classes in the image dataset, and

{A P}_{i}

denotes the average precision for Class i. Here, True Positive refers to instances where the model correctly predicts ripe tomatoes as ripe, enhancing model accuracy. False Positive represents cases where the model incorrectly predicts unripe tomatoes as ripe, affecting the model’s error rate. False Negative indicates cases where the model incorrectly predicts ripe tomatoes as unripe, which can impact recall.

By employing multiple evaluation metrics to assess model performance in this experiment, we were able to clearly identify the strengths and limitations of the YOLO models in classifying cherry tomato ripeness. This provides valuable information for setting directions for future research aimed at model improvement and optimization.

4.3. Experimental Setup

The experimental environment for this study was built on an Ubuntu 22.04 operating system with CUDA 12.2, utilizing an NVIDIA GeForce RTX 4070 GPU for computation. The models were trained for 100 epochs with an initial learning rate of 0.01, selected to ensure sufficient convergence while maintaining computational efficiency. A batch size of 16 was chosen to balance memory constraints and processing speed, and the input image size was fixed at 640 to achieve an optimal trade-off between model accuracy and inference speed. The SGD optimizer was used for its stability and robustness in high-dimensional parameter spaces. To compare the performance of various YOLO models, several versions from the YOLO series were trained, including different variations of YOLOv8 and a YOLOv8 model with a ResNet50 backbone. This setup enabled a comprehensive comparison and analysis of each model’s performance.

5. Experimental Results

In this study, the performance of YOLOv5 and YOLOv8 models was evaluated. Figure 5 illustrates the models’ ability to classify ripe and unripe tomatoes based on ripeness, while Figure 6 visually presents the prediction results on the validation dataset.

The test results of the object detection algorithms, as shown in Table 1, include detection performance, model size, and speed. For the YOLOv5s model, the precision was measured at 0.657, recall at 0.751, and mAP at 0.701. The performance of modified YOLOv8 models with a ResNet50 backbone was assessed based on precision, recall, mean average precision (mAP), layer count, parameter count, and GFLOPs (Giga Floating Point Operations per Second). The YOLOv8 models with a ResNet50 backbone generally demonstrated improved precision compared to the original models. Specifically, the precision of the YOLOv8s model improved from 0.639 to 0.668 with the ResNet50 backbone, and mAP increased from 0.654 to 0.721. During the experiments, the impact of model architecture and parameter settings on performance was examined. The YOLOv8l(ResNet50) model achieved the highest mAP of 0.757, indicating that it is optimized for real-time operations with high precision, suitable for cherry tomato harvesting tasks. Applying this model in real-time cherry tomato detection and harvesting in agricultural environments can be expected to yield high harvesting efficiency and fast processing.

6. Conclusions

This paper proposes a real-time cherry tomato detection algorithm for harvesting robots. The study aims to improve the speed and accuracy of object detection by using the YOLO deep learning model. A new dataset was created by augmenting the original 300 image dataset to a total of 742 images, demonstrating improved performance compared to the original dataset. Future research will focus on refining the cherry tomato detection algorithm with segmentation specifically targeting cherry tomatoes. Segmentation is essential to isolate individual tomatoes from overlapping clusters and complex backgrounds, enabling the model for precise detection and classification. This will enable the robot to accurately detect the position of each tomato and automate harvesting according to ripeness using a manipulator. Although the limited dataset size resulted in moderate experimental results, further experiments will be conducted with an expanded dataset. Additionally, the model will be tested in real-world conditions to validate the efficiency and practicality of the algorithm. The applicability of Faster R-CNN, which may provide higher accuracy and speed in real-time detection, will be evaluated for agricultural environments requiring such tasks, comparing its performance against the YOLO algorithm.

Author Contributions

Conceptualization, C.J.; Methodology, D.Y. and C.J.; Validation, D.Y.; Formal analysis, D.Y.; Investigation, D.Y.; Writing—original draft, D.Y.; Writing—review & editing, C.J.; Supervision, C.J.; Project administration, C.J.; Funding acquisition, C.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Korea Institute of Industrial Technology (KITECH) under the project “Development of a Holonic Manufacturing System for the Future Industrial Environment” (EO240002), funded by the Clean Production System Core Technology Research Project in 2024. Additionally, this research was supported by KITECH under the project “Development of an Aerial Robot for Tracking Small Insects” (UR240052), funded by the Self-Research Project in 2024.

Data Availability Statement

The data utilized in this study is currently retained in a confidential manner to ensure its security and integrity, as it will be used for future research and additional publications. Access to the data is strictly controlled to maintain its confidentiality, and data sharing is possible only for legitimate requests aligned with research purposes. Requesters must clearly articulate the purpose and intended use of the data, and access will be granted upon review, provided that the requester adheres to research ethics and data protection regulations.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Miao, Z.; Yu, X.; Li, N.; Zhang, Z.; He, C.; Li, Z.; Deng, C.; Sun, T. Efficient tomato harvesting robot based on image processing and deep learning. Precis. Agric. 2023, 24, 254–287. [Google Scholar] [CrossRef]
Magalhães, S.A.; Castro, L.; Moreira, G.; Dos Santos, F.N.; Cunha, M.; Dias, J.; Moreira, A.P. Evaluating the single-shot multibox detector and YOLO deep learning models for the detection of tomatoes in a greenhouse. Sensors 2021, 21, 3569. [Google Scholar] [CrossRef] [PubMed]
Mitaritonna, C.; Ragot, L. After Covid-19, will seasonal migrant agricultural workers in Europe be replaced by robots. CEPII Policy Brief 2020, 33, 1–10. [Google Scholar]
Kim, M.K.; Lim, S.M.; Won, M.C. Seedling area recognition algorithm for autonomous rice planter based on deep learning and image processing. J. Inst. Control Robot. Syst. 2023, 29, 245–251. (In Korean) [Google Scholar] [CrossRef]
Lim, S.M.; Kim, M.K.; Lee, D.H.; Won, M.C. Vision-based farmland boundary detection algorithm for automation of an agricultural tractor. J. Inst. Control Robot. Syst. 2023, 29, 208–216. (In Korean) [Google Scholar] [CrossRef]
Lawal, O.M. Development of tomato detection model for robotic platform using deep learning. Multimed. Tools Appl. 2021, 80, 26751–26772. [Google Scholar] [CrossRef]
Kim, J.E.; Seol, J.H.; Son, H.I. Preliminary experimental results of a deep learning-based intelligent spraying system for pear orchard. J. Inst. Control Robot. Syst. 2020, 26, 23–28. (In Korean) [Google Scholar] [CrossRef]
Jocher, G. Ultralytics yolov5. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 1 November 2024).
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics yolov8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 November 2024).
Li, J.; Qiao, Y.; Liu, S.; Zhang, J.; Yang, Z.; Wang, M. An improved yolov5-based vegetable disease detection method. Comput. Electron. Agric. 2022, 202, 107345. [Google Scholar] [CrossRef]
Khan, A.; Hassan, T.; Shafay, M.; Fahmy, I.; Werghi, N.; Mudigansalage, S.; Hussain, I. Tomato maturity recognition with convolutional transformers. Sci. Rep. 2023, 13, 22885. [Google Scholar] [CrossRef] [PubMed]
Vo, H.-T.; Mui, K.C.; Thien, N.N.; Tien, P.P. Automating tomato ripeness classification and counting with YOLOv9. Int. J. Adv. Comput. Sci. Appl. 2024, 15. [Google Scholar] [CrossRef]
Li, P.; Zheng, J.; Li, P.; Long, H.; Li, M.; Gao, L. Tomato maturity detection and counting model based on MHSAYOLOv8. Sensors 2023, 23, 6701. [Google Scholar] [CrossRef] [PubMed]
Yang, G.; Wang, J.; Nie, Z.; Yang, H.; Yu, S. A lightweight yolov8 tomato detection algorithm combining feature enhancement and attention. Agronomy 2023, 13, 1824. [Google Scholar] [CrossRef]
Xiao, B.; Nguyen, M.; Yan, W.Q. Fruit ripeness identification using YOLOv8 model. Multimed. Tools Appl. 2024, 83, 28039–28056. [Google Scholar] [CrossRef]
Afonso, M.; Fonteijn, H.; Fiorentin, F.S.; Lensink, D.; Mooij, M.; Faber, N.; Polder, G.; Wehrens, R. Tomato fruit detection and counting in greenhouses using deep learning. Front. Plant Sci. 2020, 11, 571299. [Google Scholar] [CrossRef] [PubMed]
Bresilla, K.; Perulli, G.D.; Boini, A.; Morandi, B.; Corelli Grappadelli, L.; Manfrini, L. Single-shot convolution neural networks for real-time fruit detection within the tree. Front. Plant Sci. 2019, 10, 421226. [Google Scholar] [CrossRef]
Edan, Y.; Han, S.; Kondo, N. Automation in agriculture. In Springer Handbook of Automation; Springer: New York, NY, USA, 2009; pp. 1095–1128. [Google Scholar]
Mao, S.; Li, Y.; Ma, Y.; Zhang, B.; Zhou, J.; Wang, K. Automatic cucumber recognition algorithm for harvesting robots in the natural environment using deep learning and multifeature fusion. Comput. Electron. Agric. 2020, 170, 105254. [Google Scholar] [CrossRef]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A review of YOLO algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]

Figure 1. YOLOv8 architecture using ResNet50’s bottleneck residual block for cherry tomato detection.

Figure 2. Class classification of cherry tomatoes. (a) Ripe cherry tomato. (b) Unripe cherry tomato.

Figure 3. Various data augmentation techniques applied to cherry tomato images, showing. (a) Original Image. (b) 90° Rotation. (c) 15° Rotation. (d) Saturation Adjustment. (e) Zoom. (f) Brightness Adjustment.

Figure 4. (a) Tomato images captured by cameras mounted on the actual harvesting robot [6]. (b) Tomato images from the Tomato Fruits Dataset. (c) AgRob v16 robot conducting data collection in a real tomato greenhouse environment for dataset creation [2].

Figure 5. Detection and classification of cherry tomato ripeness.

Figure 6. Prediction Results of the Validation Dataset.

Table 1. Experimental performance comparison of YOLO models.

Model	Precision	Recall	mAP	Layers	Parameters	GFLOPs
YOLO-v5s	0.657	0.751	0.701	157	7,015,519	15.8
YOLO-v9s	0.679	0.755	0.738	917	7,318,368	27.6
YOLO-v8s	0.639	0.739	0.654	168	11,126,358	28.4
YOLO-v8m	0.604	0.774	0.67	218	25,840,918	78.7
YOLO-v8l	0.626	0.729	0.671	268	43,608,150	164.8
YOLO-v8x	0.616	0.717	0.657	268	68,125,484	257.4
YOLO-v8s (ResNet50)	0.668	0.784	0.721	350	30,990,102	86.7
YOLO-v8m (ResNet50)	0.677	0.76	0.733	378	39,229,974	111.8
YOLO-v8l (ResNet50)	0.707	0.775	0.757	406	49,231,382	151.7
YOLO-v8x (ResNet50)	0.718	0.768	0.745	406	62,828,886	196.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, D.; Ju, C. Performance Comparison of Cherry Tomato Ripeness Detection Using Multiple YOLO Models. AgriEngineering 2025, 7, 8. https://doi.org/10.3390/agriengineering7010008

AMA Style

Yang D, Ju C. Performance Comparison of Cherry Tomato Ripeness Detection Using Multiple YOLO Models. AgriEngineering. 2025; 7(1):8. https://doi.org/10.3390/agriengineering7010008

Chicago/Turabian Style

Yang, Dayeon, and Chanyoung Ju. 2025. "Performance Comparison of Cherry Tomato Ripeness Detection Using Multiple YOLO Models" AgriEngineering 7, no. 1: 8. https://doi.org/10.3390/agriengineering7010008

APA Style

Yang, D., & Ju, C. (2025). Performance Comparison of Cherry Tomato Ripeness Detection Using Multiple YOLO Models. AgriEngineering, 7(1), 8. https://doi.org/10.3390/agriengineering7010008

Article Menu

Performance Comparison of Cherry Tomato Ripeness Detection Using Multiple YOLO Models

Abstract

1. Introduction

2. Related Research

3. Ripeness Classification Algorithm

3.1. YOLO

3.2. Improved YOLO

4. Ripeness Classification Experiment

4.1. Dataset

4.2. Cherry Tomato Ripeness Classification Experiment

4.3. Experimental Setup

5. Experimental Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI