Tomato Fruit Detection Using Modified Yolov5m Model with Convolutional Neural Networks

Tsai, Fa-Ta; Nguyen, Van-Tung; Duong, The-Phong; Phan, Quoc-Hung; Lien, Chi-Hsiang

doi:10.3390/plants12173067

Open AccessEditor’s ChoiceArticle

Tomato Fruit Detection Using Modified Yolov5m Model with Convolutional Neural Networks

by

Fa-Ta Tsai

¹,

Van-Tung Nguyen

¹

,

The-Phong Duong

²,

Quoc-Hung Phan

^1,*

and

Chi-Hsiang Lien

^1,*

¹

Department of Mechanical Engineering, National United University, Miaoli 36002 Taiwan

²

Department of Mechanical Engineering, HCMC University of Technology and Education, Ho Chi Minh City 700000, Vietnam

^*

Authors to whom correspondence should be addressed.

Plants 2023, 12(17), 3067; https://doi.org/10.3390/plants12173067

Submission received: 26 July 2023 / Revised: 23 August 2023 / Accepted: 23 August 2023 / Published: 26 August 2023

(This article belongs to the Section Plant Modeling)

Download

Browse Figures

Versions Notes

Abstract

:

The farming industry is facing the major challenge of intensive and inefficient harvesting labors. Thus, an efficient and automated fruit harvesting system is required. In this study, three object classification models based on Yolov5m integrated with BoTNet, ShuffleNet, and GhostNet convolutional neural networks (CNNs), respectively, are proposed for the automatic detection of tomato fruit. The various models were trained using 1508 normalized images containing three classes of cherry tomatoes, namely ripe, immature, and damaged. The detection accuracy for the three classes was found to be 94%, 95%, and 96%, respectively, for the modified Yolov5m + BoTNet model. The model thus appeared to provide a promising basis for the further development of automated harvesting systems for tomato fruit.

Keywords:

tomato fruit detection; convolutional neural network; Yolov5

1. Introduction

Tomatoes not only have succulent flesh and extreme sweetness, but they also contain many vitamins (C, E, and B), good acids (organic acids, phenolics, and chlorophyll), and many essential minerals that keep the human body in good health [1,2]. Besides the role tomatoes play in meeting daily nutrition needs, they also have a high economic value and are an important contributor to many local and national economies [3]. Nonetheless, the tomato farming industry faces a major challenge in that manual harvesting is labor intensive and inefficient. Consequently, the harvesting process is expensive, which increases the final cost to the consumer, and slow, which delays the delivery of the product to the market and reduces its aesthetic and nutritional value.

Many methods have been proposed for automating the tomato harvesting process [4,5,6,7]. Zhang et al. [8] developed a deep-learning model for tomato classification with an accuracy of 91.9% and a recognition time of just 0.01 s per 100 images. Taqi et al. [9] developed a robot system for cherry tomato harvesting in both private households and agricultural greenhouses. Zu et al. [10] used a Mask R-CNN [11] network model to detect and segment immature tomatoes on the vine based on the color of the tomato leaves. It was shown that the model achieved an F1 score of 92% and thus provided a reliable approach for automating the harvesting process. In general, the methods proposed in [4,5,6,7,8,9,10,11] have the potential to improve the tomato yield and go some way to alleviating the labor shortage problem caused by the COVID-19 pandemic [12,13]. However, these existing automated fruit harvesting methods still encounter challenges related to harsh environments and environmental interference, leading to compromised detection accuracy.

In recent years, artificial intelligence (AI) has been widely applied in many fields, including automobiles [14], the economy [15], and agriculture [16,17]. AI provides many important benefits, such as digital assistance, objective decision-making, improved speed and efficiency, reduced potential for human error, and 24/7 availability. As computer technology continues to advance, the detection and classification performance of AI models have improved accordingly, with the typical error rate in image recognition reducing from 29% in 2010 to less than 3% in 2017 [18]. Many AI methods have been developed for rapid object detection in real-time with high accuracy and minimal error. Some of the most well-known models include Region-based Convolutional Neural Network (R-CNN) [19], Faster R-CNN [20], Region-based (R-FCN) [21], Single-Shot Multi-Box Detector (SSD) [22], and You Only Look Once (YOLO) [23]. Compared to other methods, Yolo has a particularly fast recognition speed and high accuracy due to its end-to-end training. Hence, it is one of the most commonly used methods and has been successfully employed in many applications, including tomato detection [24,25].

Many different versions of the Yolo model have been developed over the years, including Yolov1, Yolov2, and Yolov3 [26]. The goal of Yolov1 (i.e., the original Yolo model) is to both detect and classify target objects in the image. This is achieved by treating the object detection problem as a regression problem. However, while Yolov1 performs well for large objects, it has problems recognizing small objects. Yolov2 (or Yolo9000) not only overcomes this problem by using the new training method and the anchor box accordingly, but it also has a speed similar to that of Faster R-CNN. Yolov3 has the same architecture as Yolov2 but replaces the softmax operation with logistic classifiers or uses the Darknet-53 structure as the backbone [27]. Bochkovskiy et al. [28] proposed a Yolov4 model, which combines the quintessence of research in object recognition detection in order to improve the speed and accuracy of the recognition process. Jocher et al. [29] presented a Yolov5 model with both a faster computational speed and a more straightforward implementation. Many versions of the Yolov5 model have been subsequently developed, including Yolov5n, Yolov5s, Yolov5m, Yolov5l, and Yolov5x, where the models differ mainly in their configurations and accuracy.

The literature contains many Yolo-based models for the detection of fruit products, such as oil palm fruit [30] and mango [31]. Liu et al. [32] presented an enhanced Yolov3 model for the identification of tomatoes in complex environments characterized by illuminance variations and object occlusion. The proposed model was shown to have an accuracy of up to 94.48% under ideal conditions. Guan et al. [33] designed a robot system based on Yolov5 for picking tomatoes on the vine. It was shown that the average recognition time of a single frame image was just 104 ms, which rendered the system suitable for real-time automatic picking applications. Egi et al. [34] presented a method for counting three classes of tomato fruit (ripe, green, and flowers) using Yolov5 and Deep Sort deep-learning algorithms. The method achieved detection accuracies of 85%, 99%, and 50% for the three classes, respectively.

In general, the results presented in [34] confirm the feasibility of improving the detection performance of the Yolo model through its integration with a further CNN. However, the existing Yolo-based methods may still face limitations in terms of accuracy and robustness, particularly in challenging conditions. The reliance on standard Yolo models may not fully address the nuanced complexities posed by varying lighting, occlusion, and fruit characteristics, which can lead to suboptimal performance. These drawbacks underscore the need for further improvements and modifications to enhance the automated tomato fruit detection process. Accordingly, the present study integrated the Yolov5m model with three other CNNs, namely BoTNet, ShuffleNet v2, and GhostNet. The classification performance of the three models was then compared with that of the original Yolov5m model for three classes of tomato fruit: ripe, immature, and damaged. Overall, the results showed that, among the various models, the modified-Yolov5m-BoTNet model provided the best detection performance with a mean average precision (mAP) of 94 over the three classes of tomatoes.

2. Results and Discussion

Figure 1 shows the confusion matrices for the detection results of the four models. As shown in Figure 1a, the Yolo5m model had high detection accuracies of 94% and 96% for immature and damaged tomatoes, respectively. However, the detection accuracy for ripe tomatoes was only 87%. For the modified-Yolov5m-BoTNet model, the detection accuracy for ripe tomatoes increased to 94% (Figure 1b), representing an improvement of 7% over the original Yolo5m model. Moreover, the detection accuracies for immature and damaged tomatoes were 95% and 96%, respectively. Thus, the modified-Yolov5m-BoTNet model provided good detection accuracy for all three classes of tomatoes. The modified-Yolov5m-ShuffleNet model had a high detection accuracy of 96% for both ripe and immature tomatoes (Figure 1c). However, the detection accuracy for damaged tomatoes was just 87%. The poor detection performance can be attributed to the compact backbone structure of the ShuffleNet v2 network, which contains only 2.2 M parameters. Finally, the modilfied-Yolov5m-GhostNet model had detection accuracies of 90%, 94%, and 94% for ripe, immature, and damaged tomatoes, respectively (Figure 1d). Thus, the detection accuracy for ripe tomatoes increased by 3% compared to the original Yolov5m model. Overall, the results presented in Figure 1 confirm that the modified-Yolov5m-BoTNet model provided the best overall detection performance of the four models, closely followed by the modilfied-Yolov5m-GhostNet model.

Figure 2 shows the TPR, TNR, FPR, and FNR values of the four models. As shown, the modified-Yolov5m-BotNet model had the best TPR performance over the three classes of tomatoes, with a value in the range of 94–96%. All four models had a low FPR, with values in the range of 2–19%. The FNR rates of the modified-Yolov5m-BotNet model were the lowest among all the models, with values in the range of 4–6%. All four models had high TNR values ranging from 80 to 82% for both ripe and immature tomatoes. Moreover, all four models achieved a TNR value close to 98% for damaged tomatoes.

Figure 3 illustrates the detection results obtained using the modified-Yolov5m-BoTNet model in typical real-world scenarios (e.g., those shown in Figure 4). It is seen that the detection varied widely in the range of 70–90%, depending on the size of the target within the image, the color contrast of the target relative to the background, and the occlusion (or otherwise) of the fruit. When the fruit was located far from the camera (Figure 3e), the detection accuracy had a relatively low value of 70–80%. However, when the fruit was clear and non-occluded, the detection accuracy increased to as much as 95% (Figure 3a,b,d). Furthermore, even for occluded fruit, the detection accuracy had a high value of 90–93% (see Figure 3c,f).

3. Materials and Methods

3.1. Dataset for Training

A total of 1508 images of tomatoes were acquired from tomato farms in Miaoli County in western Taiwan and the Asian Vegetable Research and Development Center (AVRDC) in Tainan, south Taiwan. The images were obtained manually using a mobile phone (iPhone 11) and had a size of 3024 × 4032 pixels, a bit depth of 24, and a resolution of 72 dpi. To maintain consistency and reliability, a meticulous camera protocol was adhered. Images were acquired from a standardized distance of approximately 0.5 m, ensuring minimal distortion and maintaining consistent object proportions. The camera was positioned perpendicular to the plane of the fruit to mitigate any angular discrepancies. To increase the robustness of the training model, the images were acquired at different times of the day to achieve various illumination conditions and degrees of contrast between the targets (the tomatoes) and the background (the vine). The images were normalized to a size of 640 × 640 pixels to fit the Yolov5m model and improve the consistency of the training samples. It is noted that, in practical application, a commercial camera embedded in a vehicle might be used to capture the image instead of manually by an iPhone. Thus, the image distance and rotations might be changed. CNN’s-Yolo based models, such as many other object detection algorithms, are sensitive to variations in image distance and rotation. YOLO models are designed to detect objects within images by dividing the image into a grid and predicting bounding boxes and class probabilities for each grid cell. This design allows YOLO to identify objects in different positions within an image. To address these sensitivity issues, the training experiment with data augmentation techniques shown in Section 3.7 helped the model to become more invariant to these variations.

Figure 4 presents typical normalized images of the three tomato classes (ripe, immature, and damaged) at various times of the day from 6:00 am to 6:00 pm. The tomatoes in the 1508 images were labeled with the appropriate classes using LabelImg open-source software version 1.8. In total, the images contained 2868 ripe, 3712 immature, and 885 damaged tomatoes. The data (200 MB) are available upon request to interested researchers for further analysis and validation.

3.2. Yolov5 Model

Figure 5 shows the basic structure of the Yolov5m model consisting of three blocks, namely the backbone, neck, and head. The data are first input to the backbone, which is implemented with CSPDarrknet. The backbone consists mainly of a C3 module (cross-stage partial network bottleneck with 3 convolutions), which improves on the CSPResBlock module in Yolov4 through the use of a single convolution layer and thus increases the training speed and reduces the number of training parameters. The backbone additionally incorporates a Spatial Pyramid Pooling Fast (SPPF) module. The SPPF module is similar to that of the Spatial Pyramid Pooling (SPP) module in Yolov4. However, SPPF is two times faster than SPP and employs kernels of different sizes in order to increase the receptive field and enable the processing of input images with a greater size variation. The neck block (PANet) uses multiple convolution and concatenation (Concat) blocks to extract feature pyramids from the input images. The neck also contains an Upsample module to enhance the subsample recognition accuracy. Finally, the head block, also known as the object detection module, predicts the coordinates of the bounding box (position, dimensions) for each target object of interest in the input image.

3.3. BoTNet Transform Model

Srinivas et al. [35] proposed a BoTNet transform model with the ability to detect and classify not only whole objects of different classes, but also segments of objects of the same class. The model was based on the bottleneck block structure used in ResNet [36]. As shown in Figure 6a, the ResNet bottleneck block comprises three consecutive convolution operations, including a 1 × 1 convolution to reduce the number of feature depths, a 3 × 3 convolution to extract features from images, and a 1 × 1 convolution to increase the number of feature depths relative to the original input. In the bottleneck architecture, the input values changed progressively from 2048 to 512 and then back to 2048. In other words, the output value was equal to the original number of dimensions. As shown in Figure 6b, the BoTNet model [35] was implemented using a transformer bottleneck block, in which the 3 × 3 convolution operation in ResNet50 was replaced with a Multi-Head Self-Attention (MHSA) module. It was shown in [35] that the combined use of convolutions and the MHSA improved the average precision (AP) of the mask by around 1.2% for the Common Objects in Context (COCO) instance segmentation benchmark.

Figure 7 shows the structure of the MHSA module. As shown, the module extracts long-range structural data from the input images [37,38,39,40]. The MHSA is used to connect components in the highest feature map. As a result, it provides access to a receptive field that includes all of the input images, and hence, the precision of a particular pixel is influenced by every input pixel. The MHSA module has three inputs, named the matrices of the queries (Q), keys (K), and values (V), denoted as Wq, Wk, and Wv, respectively. The self-attention module is calculated separately in multiple heads before being combined through another embedding, and the independent self-attention outputs are then concatenated and linearly transformed into the expected dimension. The multiple self-attention headers allow self-attention to be paid to different parts of the sequence. The MHSA is calculated as:

M u t i H e a d (Q, K, V) = [h e a d_{1}, \dots, h e a d_{h} {] W}_{0}

(1)

where

h e a d_{i} = S e l f A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

, and

W

are all learnable parameter matrices.

In the present study, the BotNet transform model was integrated within the backbone of the original Yolov5m network, as shown in Figure 8. The BotNet module was added behind the SPPF block of the original structure to enhance the feature map recognition of the input image through the MHSA module. Thus, the accuracy of the object detection process was improved, while the training time was reduced.

3.4. ShuffleNet Model

Deep learning network models are becoming more and more accurate. However, this performance improvement is often obtained at the expense of a very large number of parameters, which increases the training cost and prevents their deployment on weak hardware devices, such as mobile devices. Several lightweight deep learning models have been developed to address this problem, including MobileNet, NASNETMobile, and ShuffleNet v2 [41,42,43]. ShuffleNet v2 is an extension of the original ShuffleNet model proposed in [44] with pointwise group convolutions, bottleneck structures, and channel shuffle operations, and it was designed to further optimize the efficiency of the original structure. ShuffleNet v2 contains multiple shuffle units, which repeat different strides for different numbers of times. Figure 9 illustrates the structure of the shuffle unit. As shown in Figure 9a, the shuffle unit network includes channel split, concatenation, and channel shuffle blocks. After the initial channel split operation, one of the branches passes directly to the concatenation block, while the other branch contains two 1 × 1 ungrouped convolutions and a depth-wise convolution (DWConv) block. The outputs of the two branches are merged with the Concat block to ensure the same number of inputs and outputs. The channel shuffle block then allows for information exchange between the two branches. Figure 9b shows the structure of the spatial down-sampling module in the shuffle unit network. As shown, the split channel block is omitted, thus minimizing the number of input parameters and improving the speed as a result.

In the present study, ShuffleNet v2 was integrated with Yolov5m in order to improve the efficiency of the training process. In particular, the backbone structure of the original Yolov5m model was implemented using stacked Shuffle blocks in order to reduce the number of training parameters and giga floating point operations per second (GFLOPs) (see Figure 10).

3.5. GhostNet Model

GhostNet [45] is a cheap CNN based on a stack of Ghost bottleneck layers. As shown in Figure 11, the first layer of the network comprises a convolutional layer with 16 filters. The next layer consists of multiple stacked Ghost bottleneck layers (Φ), while the last layer is a convolutional layer with a feature vector dimension of 1280. Notably, the GhostNet model uses linear transformation to maintain the feature map with the normal convolution output [46]. Furthermore, the model mainly uses low-level linear math to enhance the feature map and processing channel. Thus, although the network has many convolutional layers, it has a relatively low number of GFLOPs. Consequently, the GhostNet network provides a high detection accuracy, but it can still be implemented on weak hardware devices.

In this study, the GhostNet model was incorporated into the backbone of Yolov5m in order to increase the number of layers while simultaneously reducing the GFLOP score. As shown in Figure 12, the modified component of the backbone structure consisted of GhostC3 cross-stage partial network bottlenecks with three convolutions and GhostConv convolution blocks.

3.6. Evaluation Metrics

The performance of object classification models is generally evaluated using a confusion matrix, as shown in Table 1.

For object detection problems, such as that considered in the present study, the performance can be evaluated using the Intersection over Union (IoU) metric, defined as:

I o U = \frac{A r e a o f O v e r l a p}{A r e a o f U n i o n}

(2)

where the area of overlap is the area of intersection between the predicted bounding box and the ground truth, while the area of union is the combined area of the predicted bounding box and the ground truth.

The performance of a classification model can be further evaluated using the precision and recall metrics, where the precision provides a measure of the prediction accuracy, and the recall gives a measure of the quantity of the finding ability. The precision and recall are defined, respectively, as:

P r e c i s i o n = \frac{T P}{T P + F P}

(3)

R e c a l l = \frac{T P}{T P + F N}

(4)

The F1-score is the harmonic mean of the precision and recall and provides a more representative assessment of the precision than the precision or recall metric alone. The F1-score is defined as:

F 1 = 2 \times \frac{R e c a l l \times P r e c i s i o n}{R e c a l l + P r e c i s i o n}

(5)

The average precision (AP) is the weighted sum of the precision at each threshold, where the weight is defined as the increase in recall between thresholds. In other words, the AP is calculated as,

A P = \sum_{k = 0}^{k = n - 1} [R e c a l l (k) - R e c a l l (k + 1)] * P r e c i s i o n (k)

(6)

where n is the number of thresholds.

The mAP (mean average precision) is the average AP value computed over the different classes, i.e.,

m A P = \frac{1}{n} \sum_{k = 1}^{k = n} A P_{k}

(7)

where AP_k is the AP of class k, and n is the number of classes.

Finally, from the confusion matrix, the following performance metrics can be derived:

T P R = \frac{T P}{\sum P o s i t i v e} = \frac{T P}{F N + T P}

(8)

T N R = \frac{T P}{\sum N e g a t i v e} = \frac{T P}{F N + T P}

(9)

F P R = \frac{F P}{\sum N e g a t i v e} = \frac{F P}{F P + T N}

(10)

F N R = \frac{F N}{\sum P o s i t i v e} = \frac{F N}{F N + T P}

(11)

where TPR is the true positive rate, TNR is the true negative rate, FPR is the false positive rate, and FNR is the false negative rate.

3.7. Training Data

As shown in Figure 13, the 1508 annotated images were randomly separated into a training dataset (80%), a testing dataset (10%), and a validation dataset (10%). For each model, the training process was performed using the parameters shown in Table 2. The hardware implementation details of the training process are shown in Table 3.

Figure 14 shows the evolution of the mAP values of the original Yolov5m model and the three modified Yolov5m models over the training process. The training results for each model are summarized in Table 4. As shown, the original Yolov5m model achieved an mAP value of 0.92. The modified-Yolov5m-BoTNet model achieved an mAP value of 0.94, corresponding to a 2.17% improvement over the original model. For the modified-Yolov5m-ShuffleNet model, the mAP reduced to 0.92 (i.e., the same as the original model). Finally, for the modified-Yolov5m-GhostNet model, the mAP increased slightly to 0.93, an increase of 1.09% compared to the original Yolov5m model. Of all the models, the modified-Yolov5m-ShuffleNet v2 model had the smallest number of parameters (2.2 M) and the fastest training time (1.97 h). The ShuffleNet v2 model also had the smallest GFLOP score of 4.8, which is significantly smaller than that of the original Yolov5m model (47.9). The modified-Yolov5m-BoTNet and modified-Yolov5m-GhostNet models also had relatively lower GFLOP scores of 15.5 and 29.3, respectively. The F1 scores of the four models varied in from 0.84 to 0.87, indicating that all of the models achieved a high detection accuracy despite the complexity of the detection task, which was characterized by significant variances in color, size, and illumination.

4. Conclusions

Three CNN models based on modified Yolov5m networks with BoTNet, ShuffleNet v2, and GhostNet incorporated into the backbone structure, respectively, were proposed for the automatic detection of three classes of tomato fruit on the vine, namely ripe, immature, and damaged. The detection results showed that, among the various models, the modified-Yolov5m-BoTNet model achieved the highest mAP value of 0.94 across the three tomato classes. Moreover, the detection accuracy of the modified-Yolov5m-BoTNet model had a high value in the range of 94–96%. The modified-Yolov5m-GhostNet model also had a high accuracy of 90–94%. However, the original Yolov5m model and modified-Yolov5m-ShuffleNet v2 model had generally lower accuracies of 87–96%. The TNR and TPR values of all the models varied in the range of 82–98%, while the FPR and FNR values varied in the range of 2–19%. Compared with the original Yolov5m model, the proposed modified models (in particular, the modified-Yolov5m-BoTNet model and modified-Yolov5m-GhostNet model) had significantly improved detection accuracies for tomato fruit. Furthermore, the models have fewer parameters than the original model, and thus not only have a faster training time, but they can also be implemented on weak hardware devices such as smartphones and other mobile devices. Consequently, they have significant potential for the realization of reliable and low-complexity automated fruit harvesting systems. In the future, the development of a fully automated harvesting system is anticipated. A refining algorithm was built to accommodate diverse environmental conditions and fruit characteristics, ensuring seamless integration with real-world agricultural practices. Future endeavors will involve an in-depth exploration of the models’ adaptability to dynamic field conditions, encompassing factors such as lighting variations, weather influences, and diverse vine layouts.

Author Contributions

Conceptualization and methodology, Q.-H.P. and C.-H.L.; software, validation, formal analysis, V.-T.N.; investigation, resources, data curation, T.-P.D. and F.-T.T.; writing—original draft preparation, Q.-H.P. and V.-T.N.; writing—review and editing, Q.-H.P. and V.-T.N.; visualization, supervision, project administration, funding acquisition, Q.-H.P., C.-H.L. and F.-T.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Science and Technology of Taiwan, grant number 111-2221-E-239-007; National United University under Grant No. 111-NUUPRJ-03; TCVGH-NUU1118904 Joint Research Program.

Data Availability Statement

Data are available on request.

Acknowledgments

The authors gratefully acknowledge the financial support provided to this study by the Ministry of Science and Technology of Taiwan (MOST) under Grant No. 111-2221-E-239-007; National United University under Grant No. 111-NUUPRJ-03; Taichung Veterans General Hospital/National United University Joint Research Program under Grant No. TCVGH-NUU1118904.

Conflicts of Interest

The authors declare no conflict of interest.

References

Pattnaik, G.; Shrivastava, V.K.; Parvathi, K. Transfer Learning-Based Framework for Classification of Pest in Tomato Plants. Appl. Artif. Intell. 2020, 34, 981–993. [Google Scholar] [CrossRef]
Lin, H.T. Cherry Tomato ‘TSS ASVEG No.22’. Taiwan Seed Improvement and Propagation Station; COA: Taichung, Taiwan, 2017. [Google Scholar]
Elbadrawy, E.; Sello, A. Evaluation of Nutritional Value and Antioxidant Activity of Tomato Peel Extracts. Arab. J. Chem. 2016, 9, S1010–S1018. [Google Scholar] [CrossRef]
Gongal, A.; Amatya, S.; Karkee, M.; Zhang, Q.; Lewis, K. Sensors and Systems for Fruit Detection and Localization: A Review. Comput. Electron. Agric. 2015, 116, 8–19. [Google Scholar] [CrossRef]
Kuznetsova, A.V.; Maleva, T.; Soloviev, V.N. Using YOLOv3 Algorithm with Pre- and Post-Processing for Apple Detection in Fruit-Harvesting Robot. Agronomy 2020, 10, 1016. [Google Scholar] [CrossRef]
Bulanon, D.M.; Burr, C.; DeVlieg, M.; Braddock, T.; Allen, B. Development of a Visual Servo System for Robotic Fruit Harvesting. AgriEngineering 2021, 3, 840–852. [Google Scholar] [CrossRef]
Mangaonkar, S.R.; Khandelwal, R.S.; Shaikh, S.A.; Chandaliya, S.; Ganguli, S. Fruit Harvesting Robot Using Computer Vision. In Proceedings of the International Conference for Advancement in Technology (2022 ICONAT), Goa, India, 21–23 January 2022; pp. 1–6. [Google Scholar]
Zhang, L.; Jia, J.; Gui, G.; Hao, X.; Gao, W.; Wang, M. Deep Learning Based Improved Classification System for Designing Tomato Harvesting Robot. IEEE Access 2018, 6, 67940–67950. [Google Scholar] [CrossRef]
Taqi, F.; Al-Langawi, F.; Abdulraheem, H.K.; El-Abd, M. A cherry-tomato harvesting robot. In Proceedings of the 18th International Conference on Advanced Robotics (ICAR), Hong Kong, China, 10–12 July 2017; pp. 463–468. [Google Scholar]
Zu, L.; Zhao, Y.; Jiuqin, L.; Su, F.; Zhang, Y.; Liu, P. Detection and Segmentation of Mature Green Tomatoes Based on Mask R-CNN with Automatic Image Acquisition Approach. Sensors 2021, 21, 7842. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef] [PubMed]
Balwinder-Singh; Shirsath, P.B.; Jat, M.L.; McDonald, A.; Srivastava, A.K.; Craufurd, P.; Rana, D.S.; Singh, A.K.; Chaudhari, S.K.; Sharma, P.; et al. Agricultural Labor, COVID-19, and Potential Implications for Food Security and Air Quality in the Breadbasket of India. Agric. Syst. 2020, 185, 102954. [Google Scholar] [CrossRef]
Rahimi, P.; Islam, S.; Duarte, P.M.; Tazerji, S.S.; Sobur, M.A.; Zowalaty, M.E.E.; Ashour, H.M.; Rahman, M.T. Impact of the COVID-19 Pandemic on Food Production and Animal Health. Trends Food Sci. Technol. 2022, 121, 105–113. [Google Scholar] [CrossRef]
Ramesh, K.; Desai, S.; Jariwala, D.; Shukla, V. AI Modelled Clutch Operation for Automobiles. In Proceedings of the IEEE World Conference on Applied Intelligence and Computing (AIC), Sonbhadra, India, 17–19 June 2022; pp. 487–491. [Google Scholar]
Kumar, A.; Finley, B.; Braud, T.; Tarkoma, S.; Hui, P. Sketching an AI Marketplace: Tech, Economic, and Regulatory Aspects. IEEE Access 2021, 9, 13761–13774. [Google Scholar] [CrossRef]
Qazi, S.; Khawaja, B.A.; Farooq, Q.U. IoT-Equipped and AI-Enabled Next Generation Smart Agriculture: A Critical Review, Current Challenges and Future Trends. IEEE Access 2022, 10, 21219–21235. [Google Scholar] [CrossRef]
Bhat, S.A.; Huang, N.-F. Big Data and AI Revolution in Precision Agriculture: Survey and Challenges. IEEE Access 2021, 9, 110209–110222. [Google Scholar] [CrossRef]
Furman, J.; Seamans, R. AI and the Economy. Innov. Policy Econ. 2019, 19, 161–191. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Washington, DC, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-Based Fully Convolutional Networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29, pp. 379–387. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.M.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Lecture Notes in Computer Science; Springer: Berlin, Germany, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.K.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Mirhaji, H.R.; Soleymani, M.; Asakereh, A.; Mehdizadeh, S.A. Fruit Detection and Load Estimation of an Orange Orchard Using the YOLO Models through Simple Approaches in Different Imaging and Illumination Conditions. Comput. Electron. Agric. 2021, 191, 106533. [Google Scholar] [CrossRef]
Padilha, T.C.; Moreira, G.É.G.; Magalhães, S.A.; Santos, F.N.D.; Cunha, M.; Oliveira, M. Tomato Detection Using Deep Learning for Robotics Application. In Lecture Notes in Computer Science; Springer Science+Business Media: Berlin, Germany, 2021; pp. 27–38. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2016, arXiv:1612.08242v1. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934v1. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; Christopher, S.T.; Laughing, L.C. Ultralytics/yolov5: V4.0-nn.SILU Activations, Weights & Biases Logging, Pytorch Hub Integration. 2021. Available online: https://zenodo.org/record/4418161 (accessed on 26 June 2023).
Junos, M.H.; Khairuddin, A.S.M.; Thannirmalai, S.; Dahari, M. Automatic Detection of Oil Palm Fruits from UAV Images Using an Improved YOLO Model. Vis. Comput. 2021, 38, 2341–2355. [Google Scholar] [CrossRef]
Shi, R.; Li, T.; Yamaguchi, Y. An Attribution-Based Pruning Method for Real-Time Mango Detection with YOLO Network. Comput. Electron. Agric. 2020, 169, 105214. [Google Scholar] [CrossRef]
Liu, G.; Nouaze, J.C.; Mbouembe, P.L.T.; Kim, J.N. YOLO-Tomato: A Robust Algorithm for Tomato Detection Based on YOLOV3. Sensors 2020, 20, 2145. [Google Scholar] [CrossRef] [PubMed]
Zhaoxin, G.; Han, L.; Zhijiang, Z.; Libo, P. Design a Robot System for Tomato Picking Based on YOLO V5. IFAC-Pap. 2022, 55, 166–171. [Google Scholar] [CrossRef]
Egi, Y.; Hajyzadeh, M.; Eyceyurt, E. Drone-Computer Communication Based Tomato Generative Organ Counting Model Using YOLO V5 and Deep-Sort. Agriculture 2022, 12, 1290. [Google Scholar] [CrossRef]
Srinivas, A.; Lin, T.-Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck Transformers for Visual Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 16519–16529. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with relative position representations. arXiv 2018, arXiv:1803.02155. [Google Scholar]
Bello, I.; Zoph, B.; Le, Q.V.; Vaswani, A.; Shlens, J. Attention Augmented Convolutional Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 3286–3295. [Google Scholar]
Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Standalone self-attention in vision models. arXiv 2019, arXiv:1906.05909, 2019. [Google Scholar]
Petit, O.; Thome, N.; Rambour, C.; Themyr, L.; Collins, T.; Soler, L. U-Net Transformer: Self and Cross Attention for Medical Image Segmentation. In Springer eBooks; Springer: Berlin, Germany, 2021; pp. 267–276. [Google Scholar]
Howar, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Zoph, B.; Vasudevan, V.K.; Shlens, J.; Le, Q.V. Learning Transferable Architectures for Scalable Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8697–8710. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]

Figure 1. Confusion matrices for the (a) Yolov5m model, (b) modified-Yolov5m-BoTNet model, (c) modified-Yolov5m-ShuffleNet v2 model, and (d) modified-Yolov5m-GhostNet model.

Figure 2. TPR, TNR, FPR, and FNR performance of (a) Yolov5m, (b) modified-Yolov5m-BotNet, (c) modified-Yolov5m-ShuffleNet v2, and (d) modified-Yolov5m-GhostNet.

Figure 3. Real-world detection results obtained using the modified-Yolov5m-BoTNet model for: (a) ripe tomatoes, (b) immature tomatoes, (c) immature and damaged tomatoes, (d) ripe tomatoes, (e) immature tomatoes, and (f) damaged and immature tomatoes.

Figure 4. Typical normalized tomato images: (a) ripe tomatoes at 6:00 am, (b) immature tomatoes at 11:00 am, (c) damaged tomatoes at 12:00 pm, (d) immature tomatoes at 3:00 pm, (e) ripe tomatoes at 5:00 pm, and (f) immature and damaged tomatoes at 6:00 pm.

Figure 5. Basic structure of the Yolov5m model.

Figure 6. (a) ResNet bottleneck and (b) BoTNet transformer bottleneck.

Figure 7. Multi-head self-attention module.

Figure 8. Modifed-Yolov5m-BotNet transform model.

Figure 9. Shuffle units of ShuffleNet v2: (a) basic unit of ShuffleNet v2, and (b) shuffle unit used for spatial down-sampling.

Figure 10. Modified-Yolov5m-ShuflfeNet model structure.

Figure 11. Structure of the Ghost module.

Figure 12. Modified-Yolov5m-GhostNet model backbone structure.

Figure 13. The chart shows the ratio data for training.

Figure 14. mAP values of: (a) Yolov5m, (b) modified-Yolov5m-BoTNet, (c) modified-Yolov5m-ShuffleNet, and (d) modified-Yolov5m-GhostNet.

Table 1. Confusion matrix structure.

	Predicted Condition
True Condition	P + N	PP	PN
	(Total Population)	(Predict Positive)	(Predict Negative)
	P	TP	FN
	(Positive)	(True Positive)	(False Negative)
	N	FP	TN
	(Negative)	(False Positive)	(True Negative)

Note that P denotes the actual total number of positive cases, and N is the total number of negative cases. Furthermore, true positive (TP) indicates that the model correctly predicts a positive case. Similarly, true negative (TN) indicates that the model correctly predicts a negative case. Meanwhile, false positive (FP) indicates that the model erroneously predicts a negative sample as positive, while false negative (FN) indicates that the model wrongly predicts a positive sample as negative.

Table 2. Training parameters.

Parameter	Value
Optimization	Adam
Batch size	32
Learning rate	0.0001
Decay	5 × 10⁻⁵
Drop out	0.1
Epochs	200
Image size	640 × 640 pixel
Augmentation hyperparameters	hyp.scratch-high.yaml

Table 3. Training system hardware.

CPU	GPU	Ram	Disk
2 × Xeon Processors @2.3 Ghz, 46 MB Cache	2 × Tesla T4 16 GB	16 GB	80 GB

Table 4. Summary of training results.

Model	Layer	Parameter	GFLOPS	mAP	F1 Score	Time (h)
Yolov5m	212	21.0 M	47.9	0.92	0.86	2.32
Modified-Yolov5m-BoTNet	162	6.7 M	15.5	0.94	0.87	1.98
Modified-Yolov5m-ShuffleNet v2	221	2.2 M	4.8	0.92	0.84	1.97
Modified-Yolov5m-GhostNet	378	14.2 M	29.3	0.93	0.87	2.44

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tsai, F.-T.; Nguyen, V.-T.; Duong, T.-P.; Phan, Q.-H.; Lien, C.-H. Tomato Fruit Detection Using Modified Yolov5m Model with Convolutional Neural Networks. Plants 2023, 12, 3067. https://doi.org/10.3390/plants12173067

AMA Style

Tsai F-T, Nguyen V-T, Duong T-P, Phan Q-H, Lien C-H. Tomato Fruit Detection Using Modified Yolov5m Model with Convolutional Neural Networks. Plants. 2023; 12(17):3067. https://doi.org/10.3390/plants12173067

Chicago/Turabian Style

Tsai, Fa-Ta, Van-Tung Nguyen, The-Phong Duong, Quoc-Hung Phan, and Chi-Hsiang Lien. 2023. "Tomato Fruit Detection Using Modified Yolov5m Model with Convolutional Neural Networks" Plants 12, no. 17: 3067. https://doi.org/10.3390/plants12173067

APA Style

Tsai, F.-T., Nguyen, V.-T., Duong, T.-P., Phan, Q.-H., & Lien, C.-H. (2023). Tomato Fruit Detection Using Modified Yolov5m Model with Convolutional Neural Networks. Plants, 12(17), 3067. https://doi.org/10.3390/plants12173067

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Tomato Fruit Detection Using Modified Yolov5m Model with Convolutional Neural Networks

Abstract

1. Introduction

2. Results and Discussion

3. Materials and Methods

3.1. Dataset for Training

3.2. Yolov5 Model

3.3. BoTNet Transform Model

3.4. ShuffleNet Model

3.5. GhostNet Model

3.6. Evaluation Metrics

3.7. Training Data

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI