Classification of Tomato Fruit Using Yolov5 and Convolutional Neural Network Models

Four deep learning frameworks consisting of Yolov5m and Yolov5m combined with ResNet50, ResNet-101, and EfficientNet-B0, respectively, are proposed for classifying tomato fruit on the vine into three categories: ripe, immature, and damaged. For a training dataset consisting of 4500 images and a training process with 200 epochs, a batch size of 128, and an image size of 224 × 224 pixels, the prediction accuracy for ripe and immature tomatoes is found to be 100% when combining Yolo5m with ResNet-101. Meanwhile, the prediction accuracy for damaged tomatoes is 94% when using Yolo5m with the Efficient-B0 model. The ResNet-50, EfficientNet-B0, Yolov5m, and ResNet-101 networks have testing accuracies of 98%, 98%, 97%, and 97%, respectively. Thus, all four frameworks have the potential for tomato fruit classification in automated tomato fruit harvesting applications in agriculture.


Introduction
Fruit harvesting is labor-intensive and time-consuming work. However, with the development of artificial intelligence (AI), much of this work can now be performed by robots. Robotic harvesting comprises two main steps: fruit detection using a computer vision system and fruit picking using a robot arm. Of the two steps, fruit detection is the most crucial, since it is vital that only the fruit which are ripe and ready for consumption are harvested, while the remainder are left on the branch or vine to mature. Many techniques have been developed for fruit detection over the last decade. Conventional techniques rely mainly on color, texture, shape, and other shallow features of the image for detection [1][2][3][4]. However, the detection accuracy of such methods is heavily dependent on the illumination conditions. Moreover, as the detection algorithms are complicated and have many fixed thresholds, it is difficult to adapt them to other fruits and/or environments. Thus, the development of AI technology has prompted significant interest in the potential for applying machine learning to computer vision tasks, such as harvesting, in agriculture. Many machine learning methods [5][6][7] have been proposed for fruit classification based on color detection, edge detection, etc. Zhao et al. [8] proposed the AdaBoost algorithm combined with the average pixel value (APV) for tomato fruit detection. Based on the shape, texture, color properties, and Haar-like features of ordinary color images, the proposed algorithm has an accuracy of 96.5% when detecting ripe tomatoes. Luo et al. [9] used the AdaBoost framework and multiple color components obtained from the vision sensor to automatically identify clusters of ripe grapes on a farm. The proposed method has an accuracy of 96.56%. Liu et al. [10] employed an automatic tomato detection method for ordinary color images. The histograms of oriented gradients (HOG) method was used to train the support vector The techniques described in [29][30][31][32] provide useful and reliable solutions for tomato detection. However, more work is required to improve their detection performance in complex real-world greenhouse environments. In a previous study, the present group proposed a CNN-based technique for strawberry disease classification with an accuracy of 98-100% [33]. Building on the results obtained in [33], the present study combines the Yolov5 medium with four different CNN classification models (Yolo5m, ResNet50, ResNet-101 and EfficientNet-B0) to classify three states of tomatoes on the vine into three categories: ripe, immature, and damaged. Figure 1 shows the training loss and accuracy of the four classification models (Yolov5m, ResNet-50, ResNet-101, and Efficient Net-B0). The structural parameters and Top 1 and Top 2 accuracies of the four models are listed in Table 1 (Table 1). Overall, the Yolov5m model provides the best tradeoff between the accuracy (0.997) and the training time (52 min). Notably, none of the four models show signs of overfitting or underfitting. Hence, the effectiveness of the data augmentation process is confirmed. maximum detection accuracy was shown to be 99.5%. Liu et al. [32] performed tomato detection using the YOLO-tomato model with a new circular bounding box (C-Bbox) method. The maximum detection accuracy was shown to be 94.58%. The techniques described in [29][30][31][32] provide useful and reliable solutions for tomato detection. However, more work is required to improve their detection performance in complex real-world greenhouse environments. In a previous study, the present group proposed a CNN-based technique for strawberry disease classification with an accuracy of 98-100% [33]. Building on the results obtained in [33], the present study combines the Yolov5 medium with four different CNN classification models (Yolo5m, ResNet50, Res-Net-101 and EfficientNet-B0) to classify three states of tomatoes on the vine into three categories: ripe, immature, and damaged. Figure 1 shows the training loss and accuracy of the four classification models (Yolov5m, ResNet-50, ResNet-101, and Efficient Net-B0). The structural parameters and Top 1 and Top 2 accuracies of the four models are listed in Table 1 (Table 1). Overall, the Yolov5m model provides the best tradeoff between the accuracy (0.997) and the training time (52 minutes). Notably, none of the four models show signs of overfitting or underfitting. Hence, the effectiveness of the data augmentation process is confirmed.   Figure 2 shows the confusion matrices of the training results for the four models. As shown in Figure 2a, the Yolov5m model achieved a classification accuracy rate of 100% for ripe tomatoes, 100% for immature tomatoes, and 92% for damaged tomatoes (including a 6% error for immature tomatoes and 2% error for ripe tomatoes). The relatively poor accuracy of the classification model for the damaged tomatoes can be explained by the fact that the color characteristics of the damaged tomatoes are similar to those of the ripe and immature tomatoes. For the Yolo5m with ResNet50 model (Figure 2b), the classification accuracy was 100% for both ripe and immature tomatoes and 94% for damaged tomatoes (see Figure 2b). In other words, the classification accuracy for damaged tomatoes was improved by 2% compared to that of the Yolo5m model alone. As shown in Figure  2c, the Yolo5m with ResNet-101 model achieved a classification accuracy of 100% for immature and ripe tomatoes and 92% for damaged tomatoes. In other words, the accuracy of the combined model was the same as that of the standalone Yolov5m model. Thus, combining Yolo5m with ResNet-101 not only failed to improve the accuracy but also increased the training time. Finally, the Yolo5m with Efficient-B0 model achieved a classification accuracy of 100% for ripe tomatoes, 96% for immature tomatoes, and 94% for damaged tomatoes (Figure 2d). Thus, the classification performance is generally inferior to that of the standalone Yolov5m model. However, its training time is the shortest of the four models, since it contains just 4.0 M parameters (Table 1).  Figure 2 shows the confusion matrices of the training results for the four models. As shown in Figure 2a, the Yolov5m model achieved a classification accuracy rate of 100% for ripe tomatoes, 100% for immature tomatoes, and 92% for damaged tomatoes (including a 6% error for immature tomatoes and 2% error for ripe tomatoes). The relatively poor accuracy of the classification model for the damaged tomatoes can be explained by the fact that the color characteristics of the damaged tomatoes are similar to those of the ripe and immature tomatoes. For the Yolo5m with ResNet50 model (Figure 2b), the classification accuracy was 100% for both ripe and immature tomatoes and 94% for damaged tomatoes (see Figure 2b). In other words, the classification accuracy for damaged tomatoes was improved by 2% compared to that of the Yolo5m model alone. As shown in Figure 2c, the Yolo5m with ResNet-101 model achieved a classification accuracy of 100% for immature and ripe tomatoes and 92% for damaged tomatoes. In other words, the accuracy of the combined model was the same as that of the standalone Yolov5m model. Thus, combining Yolo5m with ResNet-101 not only failed to improve the accuracy but also increased the training time. Finally, the Yolo5m with Efficient-B0 model achieved a classification accuracy of 100% for ripe tomatoes, 96% for immature tomatoes, and 94% for damaged tomatoes (Figure 2d). Thus, the classification performance is generally inferior to that of the standalone Yolov5m model. However, its training time is the shortest of the four models, since it contains just 4.0 M parameters (Table 1). Figure 3 shows the accuracy, recall, precision, and F1 score metrics of the four models in the testing stage. As shown in Figure 3a, ResNet-50 and EfficientNet-B0 have the highest accuracies of 98%, while Yolov5m and ResNet-101 have the lowest accuracies of 97%. All four models have a recall of 100% for ripe tomatoes, as shown in Figure 3b. Yolov5m, ResNet-50, and ResNet-101 also have recall values of 1 for immature tomatoes. However, the recall value of EfficientNet-B0 falls to 0.96. Meanwhile, the recall values for damaged tomatoes vary in the range of 0.92 to 0.94 across the four models. As shown in Figure 3c, all four models have a precision of 0.98 for ripe tomatoes. Yolov5m, ResNet-50, and ResNet-101 have a precision of 1 for damaged tomatoes. However, EfficientNet-B0 has a lower precision of 0.95. The precision values of the four models for immature tomatoes vary from 0.94 to 0.96. All four models have an F1 score of 0.99 for ripe tomatoes (see Figure 3d). In other words, the models tend to predict the ripe tomato state more accurately than the other states. The F1 scores for the immature tomatoes and damaged tomatoes vary in the ranges of 0.96-0.98 and 0.93-0.97, respectively.  Figure 3 shows the accuracy, recall, precision, and F1 score metrics of the four models in the testing stage. As shown in Figure 3a, ResNet-50 and EfficientNet-B0 have the highest accuracies of 98%, while Yolov5m and ResNet-101 have the lowest accuracies of 97%. All four models have a recall of 100% for ripe tomatoes, as shown in Figure 3b. Yolov5m, ResNet-50, and ResNet-101 also have recall values of 1 for immature tomatoes. However, the recall value of EfficientNet-B0 falls to 0.96. Meanwhile, the recall values for damaged tomatoes vary in the range of 0.92 to 0.94 across the four models. As shown in Figure 3c, all four models have a precision of 0.98 for ripe tomatoes. Yolov5m, ResNet-50, and Res-Net-101 have a precision of 1 for damaged tomatoes. However, EfficientNet-B0 has a lower precision of 0.95. The precision values of the four models for immature tomatoes vary from 0.94 to 0.96. All four models have an F1 score of 0.99 for ripe tomatoes (see Figure 3d). In other words, the models tend to predict the ripe tomato state more accurately than the other states. The F1 scores for the immature tomatoes and damaged tomatoes vary in the ranges of 0.96-0.98 and 0.93-0.97, respectively.   Figure 4 shows the TPR, TNR, FPR, and FNR values of the four models in the testing stage. As shown, the TPR values of immature and ripe tomatoes are high, with a range of 96-100%. However, damaged tomatoes have lower TPR values of 92-94%. Similar to the TPR value, the TNR of the three categories also has a high value of 98-99%. The FPR values of Yolov5m, ResNet-50, ResNet-101, and EfficientNet-B0 models are low values of 0-3%, 0-2%, 0-3%, and 1-2%, respectively. Finally, the FNR has a low value of 0-8%. The ResNet-101 model has the lowest value of 0-6%. The four models are not underfitted or overfitted, thus posing less risk of confusion and error.  precision, and (d) F1 score. Figure 4 shows the TPR, TNR, FPR, and FNR values of the four models in the testing stage. As shown, the TPR values of immature and ripe tomatoes are high, with a range of 96-100%. However, damaged tomatoes have lower TPR values of 92-94%. Similar to the TPR value, the TNR of the three categories also has a high value of 98-99%. The FPR values of Yolov5m, ResNet-50, ResNet-101, and EfficientNet-B0 models are low values of 0-3%, 0-2%, 0-3%, and 1-2%, respectively. Finally, the FNR has a low value of 0-8%. The ResNet-101 model has the lowest value of 0-6%. The four models are not underfitted or overfitted, thus posing less risk of confusion and error.

Tomato State Dataset
Tomato images were collected from tomato farms in Miaoli County, Taiwan, and the Asian Vegetable Research and Development Center (AVRDC) in Tainan, Taiwan, using an iPhone 11. The images were collected from many different angles and distances, at different times of day, and in a variety of weather conditions in order to increase the efficiency and accuracy of the deep learning model. A total of 1508 images were obtained with a size of 3024 × 4032 pixels, a bit depth of 24, and a dpi resolution of 72 in both the horizontal and the vertical directions. Figure 5 presents typical images acquired for three tomato states: ripe, immature, and damaged. As shown, the ripe tomatoes have an orange to red color, the immature tomatoes are green, and the damaged tomatoes have an irregular shape and obvious physical damage. To improve the accuracy of the training model, the tomato images were acquired at three different times of the day (9.00 a.m., 12.00 p.m., and 5.00 p.m.) in order to capture the effects of different illumination conditions. As shown in Figure 6, the images captured at midday showed intense brightness and deep shadows, while the images taken in the afternoon were darker and more uniform in color and intensity. The captured images were cropped and normalized to a size of 224 × 224 pixels to optimize the model training process (see Figure 7). Tomato images were collected from tomato farms in Miaoli County, Taiwan, and the Asian Vegetable Research and Development Center (AVRDC) in Tainan, Taiwan, using an iPhone 11. The images were collected from many different angles and distances, at different times of day, and in a variety of weather conditions in order to increase the efficiency and accuracy of the deep learning model. A total of 1508 images were obtained with a size of 3024 × 4032 pixels, a bit depth of 24, and a dpi resolution of 72 in both the horizontal and the vertical directions. Figure 5 presents typical images acquired for three tomato states: ripe, immature, and damaged. As shown, the ripe tomatoes have an orange to red color, the immature tomatoes are green, and the damaged tomatoes have an irregular shape and obvious physical damage. To improve the accuracy of the training model, the tomato images were acquired at three different times of the day (9.00 am, 12.00 pm, and 5.00 pm) in order to capture the effects of different illumination conditions. As shown in Figure 6, the images captured at midday showed intense brightness and deep shadows, while the images taken in the afternoon were darker and more uniform in color and intensity. The captured images were cropped and normalized to a size of 224 × 224 pixels to optimize the model training process (see Figure 7).

Data Augmentation
Following the cropping and normalization process, the tomato image database contained 2176 images of immature tomatoes, 1753 images of ripe tomatoes, and 557 images of damaged tomatoes. For each category, 50 images were used for testing, while the remaining images (2127 images for immature tomatoes, 1703 images for ripe tomatoes, and 507 images for damaged tomatoes) were retained for training and validation purposes. The testing dataset thus contained very different numbers of images for each category. Accordingly, a data argumentation process was performed to balance the dataset and increase the classification accuracy. As shown in Figure 8, the augmentation process included image rotation from 0 to 90 • , brightness adjustment from 1.0 to 2.0, vertical and horizontal flipping, filling using the "nearest" mode, and shearing with a range of 0.2. Figure 9 shows the typical augmentation results obtained for the ripe, immature, and damaged tomato categories, respectively. After balancing, a dataset was constructed consisting of 1500 images for each category. Following the cropping and normalization process, the tomato image database contained 2176 images of immature tomatoes, 1753 images of ripe tomatoes, and 557 images of damaged tomatoes. For each category, 50 images were used for testing, while the remaining images (2127 images for immature tomatoes, 1703 images for ripe tomatoes, and 507 images for damaged tomatoes) were retained for training and validation purposes. The testing dataset thus contained very different numbers of images for each category. Accordingly, a data argumentation process was performed to balance the dataset and increase the classification accuracy. As shown in Figure 8, the augmentation process included image rotation from 0 to 90°, brightness adjustment from 1.0 to 2.0, vertical and horizontal flipping, filling using the "nearest" mode, and shearing with a range of 0.2. Figure 9 shows the typical augmentation results obtained for the ripe, immature, and damaged tomato categories, respectively. After balancing, a dataset was constructed consisting of 1500 images for each category.

Yolov5 Network Model
Yolov5 is an object detection network model that belongs to the Yolo family of models. The first three versions of Yolo were developed by Joseph Redmon between 2015 and 2018, while Yolov4 was released by Alexey Bochkovskiy in 2020 with an improved speed

Yolov5 Network Model
Yolov5 is an object detection network model that belongs to the Yolo family of models. The first three versions of Yolo were developed by Joseph Redmon between 2015 and 2018, while Yolov4 was released by Alexey Bochkovskiy in 2020 with an improved speed and accuracy [34]. Yolov5 was published by Glenn Jocher in 2020 with initial comparisons showing the same accuracy as Yolov4 but a faster prediction speed [35]. Yolov5 has five network model versions: Yolov5n, Yolov5s, Yolov5m, Yolov5l, and Yolov5x. While Yolov5n has the fastest calculation speed, its average precision is the lowest. Conversely, Yolov5x has the slowest calculation speed but the highest average precision [36]. In the present study, the tomato state classification system was implemented using the Yolov5m model. As shown in Figure 10, the backbone structure consisted of a Conv (Convolutional) layer, a C3 (Cross Stage Partial Networks Bottleneck with 3 convolutions) layer, and a classification layer. In total, the model consisted of 212 layers with 11.7 million parameters and 30.9 GFLOPs (Giga Floating Point Operations Per Second).

Residual Network (ResNet-50 and ResNet-101)
Deep CNN networks are affected by several limitations, including a time-consuming optimization process, a vanishing gradient problem, and degradation problems [37]. Residual Network (ResNet) improves many of these limitations and provides the ability to solve complicated tasks with an increased accuracy [38]. Although it is regarded as a deep network when implemented with 152 layers, it only has around 26 million parameters [39]. ResNet has a convolution block that uses the same 3 × 3 filter as Inception-Net. The convolution block consists of 2 convolution branches, where 1 branch applies a 1 × 1 convolution before adding it directly to the other branch. The identity block does not apply the 1 × 1 convolution but directly adds the value of the former branch to the other branch. Figure 11 shows the basic structure of the ResNet-50 model. To minimize the training time, the model is implemented using bottlenecks as the basic building block, where each building block consists of convolutional layers (2 layers of 1 × 1 and 1 layer of 3 × 3 in the middle) and keeps the original features of the images. Moreover, the ResNet-50 model uses a stack of three layers rather than the two layers employed in the ResNet-34 model. The three layers comprise 1 × 1, 3 × 3, and 1 × 1 convolutions, where the 1 × 1 layers are responsible for reducing and then increasing the dimensions, while the 3 × 3 layer serves as a bottleneck [40]. In this study, the ResNet-50 model is run based on the YOLOv5 environment. It is noted that the proposed ResNet-50 model includes a total of 151 layers, 23.5 million parameters, and 67.5 GFLOPs. ResNet-101 has a similar structure to ResNet-50 but has fewer layers (101) and more parameters (45 million). As shown in Figure 12

EfficientNet-B0
In recent studies, several groups have used EfficientNet to perform plant leaf disease classification and plant recognition [41,42]. There are eight versions of EfficientNet, ranging from B0 to B7, respectively. EfficientNet B0 achieved an accuracy of 77.1% on ImageNet, with 5.3 million parameters and 0.39B FLOPs, while ResNet-50 achieved an accuracy of 76%, with 26 million parameters and 4.1B FLOPs [43].
As shown in Figure 13, the structure of EfficientNet-B0 consists of MBConv blocks, which are similar to the inverted residual blocks used in MobileNetv2 [44]. The blocks feature shortcut connections between the first and last sections of the block, and the input block is expanded by a 1 × 1 Conv layer to increase the number of channels or depth of the feature map. Conversely, the Depthwise Conv 3 × 3 and Pointwise Conv 1 × 1 layers are used to reduce the number of channels of the output block. The shortcut connections ResNet-101 has a similar structure to ResNet-50 but has fewer layers (101) and more parameters (45 million). As shown in Figure 12

EfficientNet-B0
In recent studies, several groups have used EfficientNet to perform plant leaf disease classification and plant recognition [41,42]. There are eight versions of EfficientNet, ranging from B0 to B7, respectively. EfficientNet B0 achieved an accuracy of 77.1% on ImageNet, with 5.3 million parameters and 0.39B FLOPs, while ResNet-50 achieved an accuracy of 76%, with 26 million parameters and 4.1B FLOPs [43].
As shown in Figure 13, the structure of EfficientNet-B0 consists of MBConv blocks, which are similar to the inverted residual blocks used in MobileNetv2 [44]. The blocks feature shortcut connections between the first and last sections of the block, and the input block is expanded by a 1 × 1 Conv layer to increase the number of channels or depth of the feature map. Conversely, the Depthwise Conv 3 × 3 and Pointwise Conv 1 × 1 layers are used to reduce the number of channels of the output block. The shortcut connections connect narrow layers, which have a small number of channels, while the wider layers are arranged between the shortcut connections. Notably, this structure reduces both the number of parameters and the number of operations. The EfficientNet-B0 model also uses the AdaptiveAvgPool2d layers to find important features of the data and reduce the training parameters. The dropout layer is used to reduce interdependent learning between neurons. The data regression layer is linear regression. Thus, even though EfficientNet-B0 has a large number of layers (337 layers), it has just 4 million parameters and 7.3 GFLOPs. In this study, the EfficientNet-B0 model is run based on the Yolov5 environment for the classification of tomatoes on the vine.

Confusion Matrix, Recall, Precision, Accuracy, F1 Score, and Rate
The testing performance of the various classification models was evaluated using the confusion matrix shown in Table 2  The testing performance of the various classification models was evaluated using the confusion matrix shown in Table 2  In the testing stage, the recall performance of the models was evaluated as the ratio of the number of samples correctly predicted as positive to the total number of samples predicted as positive, i.e., The precision was evaluated as the ratio of the number of samples correctly predicted as positive to the total number of positive predictions, i.e., The accuracy was defined as the ratio of the total number of correctly predicted samples to the total number of samples in the dataset, i.e., The F1 score was defined as the harmonic mean of the precision and recall, i.e., Rate is a measure factor in a confusion matrix. It has 4 types, including the true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), and false negative ratee (FNR), i.e.,

Top 1 and Top 2 Accuracies
The effectiveness of the classification models in the training stage was further evaluated by means of the Top 1 and Top 2 accuracies. The Top 1 accuracy indicates the proportion of samples for which the category predicted by the model matches the true category. By contrast, the Top 2 accuracy considers the prediction result to be correct if either of the two most probable categories predicted by the model matches the true category.

Data Training
The augmented dataset was split in the ratio of 80:10:10 for training, testing, and validation purposes, as shown in Figure 14. The computational properties of the training system and the training parameters are listed in Tables 3 and 4,

Conclusions
This study proposed four deep learning frameworks (Yolov5m and Yolov5 with Res-Net-50, ResNet-101, and EfficientNet-B0, respectively) for the classification of ripe, immature, and damaged tomatoes on the vine. The testing results showed that the ResNet-50, EfficientNet-B0, Yolov5m, and ResNet-101 models have overall accuracies of 98%, 98%, 97%, and 97%, respectively. Furthermore, all four models have a recall value of 100% for ripe tomato classification. The Yolov5m, ResNet-50, and ResNet-101 models also have a recall value of 1 for immature tomatoes. However, the recall value falls to 0.96 for the EfficientNet-B0 model. The recall values for damaged tomatoes vary in the range of 0.92 to 0.94. All four models achieve a precision of 0.98 for ripe tomatoes. The Yolov5m, Res-Net-50, and ResNet-101 also have precisions of 1 for damaged tomatoes. However, the precision of the EfficientNet-B0 model falls to 0.95. The precision values of the four models for immature tomatoes vary in the range of 0.94-0.96. Finally, the four models all have F1 scores of 0.99 for ripe tomatoes. The F1 scores for immature and damaged tomatoes vary in the ranges of 0.96 to 0.98 and 0.93 to 0.97. The TNR and TPR have high values of 92-100% and 97-100%, respectively. While the FPR and FNR have low values of 0-3% and 0-8%, respectively. The model operates effectively. In general, the results confirm that all of the proposed models provide an accurate means of performing tomato fruit classification in automated fruit-harvesting applications.

Conclusions
This study proposed four deep learning frameworks (Yolov5m and Yolov5 with ResNet-50, ResNet-101, and EfficientNet-B0, respectively) for the classification of ripe, immature, and damaged tomatoes on the vine. The testing results showed that the ResNet-50, EfficientNet-B0, Yolov5m, and ResNet-101 models have overall accuracies of 98%, 98%, 97%, and 97%, respectively. Furthermore, all four models have a recall value of 100% for ripe tomato classification. The Yolov5m, ResNet-50, and ResNet-101 models also have a recall value of 1 for immature tomatoes. However, the recall value falls to 0.96 for the EfficientNet-B0 model. The recall values for damaged tomatoes vary in the range of 0.92 to 0.94. All four models achieve a precision of 0.98 for ripe tomatoes. The Yolov5m, ResNet-50, and ResNet-101 also have precisions of 1 for damaged tomatoes. However, the precision of the EfficientNet-B0 model falls to 0.95. The precision values of the four models for immature tomatoes vary in the range of 0.94-0.96. Finally, the four models all have F1 scores of 0.99 for ripe tomatoes. The F1 scores for immature and damaged tomatoes vary in the ranges of 0.96 to 0.98 and 0.93 to 0.97. The TNR and TPR have high values of 92-100% and 97-100%, respectively. While the FPR and FNR have low values of 0-3% and 0-8%, respectively. The model operates effectively. In general, the results confirm that all of the proposed models provide an accurate means of performing tomato fruit classification in automated fruit-harvesting applications.