A Tomato Recognition and Rapid Sorting System Based on Improved YOLOv10

Liu, Weirui; Wang, Su; Gao, Xingjun; Yang, Hui

doi:10.3390/machines12100689

Open AccessArticle

A Tomato Recognition and Rapid Sorting System Based on Improved YOLOv10

¹

School of Mechanical Engineering, Liaoning Petrochemical University, Fushun 113001, China

²

School of Mechatronic Engineering, Guangdong Polytechnic Normal University, Guangzhou 510665, China

³

School of Mechanical Engineering and Automation, Beihang University, Beijing 100083, China

^*

Authors to whom correspondence should be addressed.

Machines 2024, 12(10), 689; https://doi.org/10.3390/machines12100689

Submission received: 17 August 2024 / Revised: 27 September 2024 / Accepted: 29 September 2024 / Published: 30 September 2024

(This article belongs to the Section Machine Design and Theory)

Download

Browse Figures

Versions Notes

Abstract

:

In order to address the issue of time-consuming, labor-intensive traditional industrial tomato sorting, this paper proposes a high-precision tomato recognition strategy and fast automatic grasping system. Firstly, the Swin Transformer module is integrated into YOLOv10 to reduce the resolution of each layer by half and double the number of channels, improving recognition accuracy. Then, the Simple Attention Module (SimAM) and the Efficient Multi-Scale Attention (EMA) attention mechanisms are added to achieve complete integration of features, and the Bi-level Routing Attention (BiFormer) is introduced for dynamic sparse attention and resource allocation. Finally, a lightweight detection head is added to YOLOv10 to improve the accuracy of tiny target detection. To complement the recognition system, a single-vertex and multi-crease (SVMC) origami soft gripper is employed for rapid adaptive grasping of identified objects through bistable deformation. This innovative system enables quick and accurate tomato grasping post-identification, showcasing significant potential for application in fruit and vegetable sorting operations.

Keywords:

YOLOv10; image recognition; soft gripper; automatic sorting

1. Introduction

Due to the ongoing progress of agricultural technology, agricultural robotics have become vital tools in enhancing production efficiency and reducing labor costs. As a widely cultivated crop with a substantial number of picking operations and high labor requirements, the application of robotics technology to tomato grasping is crucial for improving efficiency in agricultural processes. Compared with traditional manual methods, machine vision technology has become a superior and practical solution for tomato harvesting, mainly because of its high efficiency, high precision, cost-effectiveness, stable performance, and support for data analysis [1,2,3]. This innovative technology effectively addresses key challenges of manual harvesting, including inefficiency, high costs, inconsistent quality, and unfavorable working conditions [4]. The integration of new soft grippers, known for their superior flexibility and adaptability, further enhances the application potential in tomato picking and sorting, offering a wide range of opportunities in the field [5,6,7,8,9,10,11].

In recent years, the focus of research on recognition has been on improving the YOLO algorithm, with significant advancements made through the modification of the convolution progress and the enhancement of the attention mechanism. The Swin Transformer module is incorporated into the convolutional neural network to improve the feature extraction capability of the model, especially when dealing with complex backgrounds and objects of different sizes [12]. In order to realize the progress of automatic medical image segmentation, Lin proposed a two-scale encoder called DS-TransUNet based on the Swin Transformer, and experiments have proved that this model can improve the semantic segmentation quality of different medical images [13]. Gong proposed a SPH-YOLOv5 model for satellite image target detection and evaluation, which is also integrated into the Swin Transformer module to retain feature information to the greatest extent and improve detection accuracy [14]. The attention mechanism can enable convolutional neural networks to pay adaptive attention to more important information, so the attention mechanism is an important way to realize network adaptive attention. The inclusion of the EMA module in the backbone of the YOLO series detection models can significantly improve the detection performance of smaller targets. Hao proposed a YOLOv5-EMA model for the accurate detection of cattle bodies, which helps to efficiently and accurately detect single cattle and important body parts in complex scenes and has greatly promoted the development of precision animal husbandry [15]. You proposed the SimAM-Efficient Net model and achieved outstanding results in terms of plant diseases, and the precision of this model was significantly higher than other models [16]. Aiming to address the problem that small insulator faults in transmission lines are difficult to identify, and combined with fault image features, Zhang proposed a small insulator target fault detection algorithm based on improved YOLOv8, adding the BiFormer attention module at the bottom of the backbone network. By improving the feature representation capability of the network, this algorithm significantly improves the detection accuracy of small targets in complex environments [17].

Various origami structures of soft grippers are widely used to achieve intelligent tomato sorting. Soft grippers are preferred over traditional rigid grippers due to their ability to provide soft contact and non-destructively grip tomatoes. The origami structure can be folded to enhance shape volume and movement space, thereby increasing grasping flexibility and success rate. Hu introduced a soft, retractable crawling robot that can adapt to surfaces and pipes without manual adjustments, showcasing the versatility and adaptability of soft crawling robots when combined with a lightweight soft origami gripper for object manipulation during movement [18]. Similarly, Chen developed a flexible origami actuator with variable effective length, enabled by the origami structure and powered by a mix of tendon and pneumatic pressure [19]. This innovative soft gripper can accommodate objects of varied shapes, weights, sizes, and textures, providing a novel soft fixture design utilizing origami structures for enhanced environmental adaptability [20]. Robotics and automation’s limited presence in the food industry highlights the demand for adaptable robotic end-effectors capable of handling diverse food materials. A dual-mode soft gripper can grip and suction objects, featuring four soft fingers and PenuNet bend drivers combined with suction pads for optimal performance [21]. Furthermore, the challenge of providing strong grasping force for soft grippers is addressed by a biomimetic soft grasping design integrating flat dry glue and soft actuators for lifting objects with varied shapes and surface textures. Integration of a flexible pressure film sensor allows measurement of an object’s surface characteristics, aiding in refining appropriate grasping strategies in unfamiliar environments [22].

This paper introduces an intelligent tomato sorting system that combines the SES-YOLOv10 model, an enhanced version of the YOLOv10 algorithm, with the SVMC origami soft gripper. The schematic diagram is illustrated in Figure 1. The identification of the quality and ripeness of tomatoes during sorting is a challenge faced in industrial production [23]. The sorting system proposed in this paper will be used to solve this problem. Firstly, YOLOv10n’s backbone network is integrated into the Swin Transformer module. Similar to the hierarchical structure, the resolution of each layer is halved, and the number of channels is doubled, enabling efficient sorting of tomatoes. EMA, SimAM, and BiFormer attention mechanisms were added to the neck network to increase the target receptive field, and better adapt to tomato images of different sizes. Finally, a lightweight detection head was added to improve the tomato recognition precision. As demonstrated by the experimental findings, the model enhances precision, recall, average precision, and F1 score in tomato sorting by 9.3%, 10.1%, 6.7%, and 8.9%, respectively. The improved algorithm can process an image at a speed of 2.0 ms and can process a large amount of data in a short time, thereby enhancing the efficiency and accuracy of tomato identification. Meanwhile, a bistable origami soft gripper is introduced, known for its rapid response speed, robust grasping capability, and wide adaptability. The origami structure’s unique folding attributes significantly reduce external shape volume and motion space, enabling a seamless transition between two-dimensional and three-dimensional states, particularly enhancing grip flexibility and success rate. The integration of origami technology into soft fixtures expands their versatility and programmability, facilitating the operation of complex configurations. Upon contact with the tomato surface, the gripper quickly envelopes and safely grips the target using the tendon drive system, reaching a grasping speed of 62.0 ms. By integrating YOLO visual recognition with mechanical grasping, the entire sorting process can be automated. This reduces the need for manual intervention and enhances overall production efficiency, and the system provides the tomato sorting industry with the necessary tools for quality control and competitive advantage in the market.

2. Origami Soft Gripper

The bistable origami soft gripper based on the SVMC is primarily composed of the single-vertex and multi-crease origami structure, support frame, trigger, and drive system as shown in Figure 2. The SVMC is established through mathematical analysis of spatial angular relationships and potential energy curves. The support frame is designed to secure the stability of the origami structure, and the trigger is set to achieve rapid deformation of the soft gripper during grasping. Furthermore, adding tendons improves grasping force and stability, which has two functions: one part of the tendons tightens the fingers during the grasping process, providing a stronger driving force (green tendons in Figure 2), while the other part enables the gripper to return to the outward unfolding form after completing the grasping task (red tendons in Figure 2).

In the grasping task, as long as the tomato can activate the trigger mechanism, the working mode of the bistable deformation envelope and tendon tensioning can be used to grip the tomato. In addition, when grasping other fragile objects, the trigger may damage the fragile surface, so at this time the target could be enveloped and gripped by using the tensioning tendon to drive the finger inward contraction. Figure 3 shows the results of the performance tests of the soft gripper. As we can see, the soft gripper takes 62.0 ms from contacting the target to deforming and completing the envelope and can grip an object with a diameter of 150 mm and a weight of 383.26 g. It is capable of grasping a target that exceeds its weight by 16 times under the premise that its self-weight is 24 g. This allows the identification and sorting of the vast majority of tomatoes on the market. It also has a fast response time and high adaptability and stability, which is described in detail in the paper [24].

3. Recognizing Tomato Datasets Using SES-YOLOv10n

3.1. SES-YOLOv10n Modeling

In this paper, an improved modeling application based on the YOLOv10 algorithm is used to model this experiment by modifying the convolution progress, enhancing the attention mechanism, and adding a lightweight detection head—SES-YOLOv10.

3.1.1. Swin Transformer Module

Patch Partition is the operation of dividing VIT into small blocks. Then it is divided into four stages, and each stage includes two parts, namely patch Merging (the first block is a linear layer) and the Swin Transformer Block. Figure 4 shows the structure diagram of the Swin Transformer. Patch Merging is an operation similar to pooling, in that pooling loses information, while patch Merging does not. The figure on the right shows the structure of the Swin transformer block, which is basically similar to a transformer block. The difference is that the multi-head self-attention MSA has been replaced by window-head self-attention (W-MSA) and mobile window-head self-attention (SW-MSA), with purple boxes on the right [25]. In the industrial sorting of tomatoes, a significant number of images need to be identified. High-resolution images, while beneficial for detail, result in substantial storage requirements, increased demands on processing equipment, and elevated processing costs. To address these challenges, the Swin Transformer is employed to replace the backbone network in the identification component of the system. By reducing the image resolution to half of the original and simultaneously doubling the number of channels, the system maintains feature recognition without compromising accuracy. This approach enhances the efficiency of the recognition process, thereby providing distinct advantages for the tomato sorting system.

3.1.2. Attention Mechanism

The attention mechanism in the neck network aims to expand the receptive field of tomato sorting images, adapt to various image scales, enhance the model’s ability to extract multi-scale features, and improve model convergence and precision.

(1): EMA

In the CA mechanism, position information is embedded to establish a connection between channel and space. Horizontal and vertical position information is aggregated by pooling input features for parallel global averaging. The EMA module aims to preserve information on each channel while reducing computational overhead. It reshapes some channels into batch dimensions and groups channel dimensions into multiple sub-features. This ensures that spatial semantic features are evenly distributed in each feature group. Specifically, the module encodes global information to recalibrate the channel weights in each parallel branch. Furthermore, the output features of the two parallel branches are aggregated through cross-dimensional interaction to capture pixel-level pairing relationships. EMA is an efficient multi-scale attention module that enables feature fusion at different scales, thereby capturing more comprehensive contextual information [26]. The structure of the EMA’s attention module is shown in Figure 5.

(2): SimAM

Most existing attention modules generate 1 Dimension (1D) or 2 Dimensions (2D) weights from feature X, which are then extended to channel (a) and space (b) attention. In contrast, SimAM directly estimates 3Dimensions (3D) weights without adding parameters to the original network. Figure 6 illustrates a comparison of different attention steps, showing that SimAM outperforms the most representative SE and CBAM attention modules. In each subgraph, the same color represents the same scalar for every point on each channel, spatial location, or feature. The core concept of SimAM is based on neuroscience theory and involves optimizing an energy function to determine the importance of each neuron. This module estimates the 3D attention weight in the feature map through a closed-form solution, enhancing the expressiveness of feature graphs without adding parameters to the model. SimAM emphasizes inferring the importance of each neuron by optimizing an energy function and then using a closed-form solution of this energy function to weight the feature graph. Another advantage of SimAM is that it avoids the need for extensive structural adjustments. Most of the operators are more than the choice of defined energy function solutions, reducing the complexity of structural design. Quantitative evaluations of various visual tasks demonstrate that the proposed module is flexible and efficient, improving the representational ability of many ConvNets. The module’s flexibility and effectiveness have been validated through experiments on a variety of visual tasks [16,27]. The structure of the SimAM’s Attention module is shown in Figure 6.

(3): BiFormer

The self-attention mechanism in Transformers effectively captures remote dependencies. However, this approach significantly increases computation and memory usage. To address this issue, BiFormer introduces dynamic sparse attention by filtering out irrelevant information. It achieves dynamic computation allocation and content awareness through Bi-Layer Routing Attention (BRA) [28]. Its structure diagram is shown in Figure 7.

3.1.3. Head

The structure of YOLOv10 comprises three main components: the backbone, the neck, and the head. To enhance the model’s ability to identify and detect targets in complex scenes, a lightweight detection head has been added to the existing three detection heads. This new detection head provides a more comprehensive and in-depth understanding of features, thereby improving the model’s performance in various challenging environments, such as those with poor lighting or overexposure, which are common in tomato sorting scenarios. Additionally, the detection capabilities for small and occluded targets are significantly enhanced, leading to improved recognition precision during tomato picking. Consequently, this not only increases the model’s accuracy but also enhances its practical applicability in tomato sorting operations [29]. The structure diagram of the improved YOLOv10n is illustrated in Figure 8.

3.2. Performance Test

The self-made dataset is classified into ripe and high-quality, ripe but low-quality, unripe but high-quality, unripe and low-quality. Due to the limited number of tomatoes, 180 images were taken as the dataset for this article, with a ratio of 8:1 between the training and validation sets. The primary training parameters employed include the SGD optimizer with a momentum of 0.937 and a weight decay rate of 0.0005. The initial learning rate is established at 0.01, maximum epochs at 300, and batch size at 32. The input picture size is set at 640 × 640 pixels. The configuration is as follows: Win10 64-bit operating system, graphics card driver Nvidia Driver 528.02 and CUDA 12.6, and an NVIDIA GeForce RTX 4070Ti GPU.

When using deep learning network algorithms to detect tomatoes, it is necessary to collect a large number of images as the training set. The target detection evaluation indicators are used to detect the performance of the model, which mainly include four key evaluation indicators: precision, recall, average precision, and F1 score. The evaluation criteria are shown in Table 1 [30]:

3.2.1. Classification Accuracy Evaluation

To better analyze the classification performance of the model, confusion matrices for YOLOv10 and SES-YOLOv10 were generated on this dataset. The confusion matrix after and before improvement is shown in Figure 9. The classification accuracy of ripe and high-quality, ripe and low-quality, unripe and high-quality, and unripe and low-quality are 89%, 87%, 92%, and 98%, respectively. The classification accuracy of each type of tomato has been improved, which is 10%, 9%, 6%, and 9% higher than the original model. In summary, the improved model based on YOLOv10 can improve classification precision, thereby significantly improving model performance.

3.2.2. Experiments

The color-defining bands for ripe and unripe tomatoes are shown in Figure 10a, where the range of 0–50% are unripe tomatoes and the range of 50–100% are ripe tomatoes. Tomatoes with characteristics such as surface spots, cracks, fester, and uneven color are of poor quality, whereas tomatoes with smooth surfaces and bright green rootstocks are of high quality. The surface defect pattern of the tomatoes is shown in Figure 10b.

The detection results of YOLOv10 and SES-YOLOv10 on the dataset are shown in Figure 10c,d. By comparison, it can be seen that the prediction performance of YOLOv10 is worse than SES-YOLOv10, and the prediction accuracy is generally lower. SES-YOLOv10 improves this situation, as its network can detect some previously missed small targets, its border positioning is more accurate, more small targets are detected, and it can correctly classify and recognize overlapping and low-contrast targets. This model has a stronger detection ability for targets.

To verify whether the improvements bring performance gains to the original model, comparative experiments of five algorithms are designed. The experimental results are shown in Table 2. According to the data in the table, SES-YOLOv10 has higher detection accuracy than other series of models, and it has an advantage in detecting small targets to a certain extent. Figure 11 shows a comparison of the four types of curves of five different algorithms, in which SES-YOLOv10 is almost optimal in each value, and all-around dramatically improves the performance of the overall YOLOv10.

3.2.3. Extended Evaluation

The application of the YOLO algorithm in crop picking and sorting systems has demonstrated high precision and efficiency. In recent years, numerous researchers have developed high-quality models for industrial sorting and picking to detect various crops, as illustrated in Table 3. These models include applications for detecting tomatoes (Chen et al., 2024) [30], apples (Fan et al., 2022) [31], bananas (Fu et al., 2022) [32], guavas (Liu et al., 2024) [33], peaches (Jing et al., 2024) [34], and strawberries (Mi et al., 2024) [35]. These examples highlight the advancements and improvements in various versions of the YOLO series algorithms, indicating that the YOLO recognition system is evolving in a more sophisticated and rigorous direction. Furthermore, the table reveals that the SES-YOLOv10 proposed in this paper achieves higher average accuracy, suggesting that the sorting system possesses significant commercial value and application potential.

4. Pairs of Tomatoes for Quick Grasping

4.1. Experimental Principle and Equipment

The tomato sorting and grasping experimental device consists of an experimental platform, an origami gripper, a 3D camera, a computer, a mobile platform, several tomatoes of different qualities, and storage containers. The computer with the SES-YOLOv10 model is used to recognize the picture captured by the 3D camera, output the position and category information of the tomato, and convert the output 2D coordinates into 3D coordinates that can be recognized by the robot arm. After the computer sends the position signal to the robot arm, The robot arm plans a suitable path to grasp the target tomato according to whether there are obstacles, path distance requirements, and time requirements. This experiment aims to simulate a tomato sorting system that uses the quick reaction of origami hands to identify and grab ripe and high-quality tomatoes for sale. Using a mobile platform to simulate a factory delivery system, several tomatoes of different qualities were placed on a mobile platform. Use a visual recognition positioning system combined with a fast-response origami gripper to select ripe and high-quality tomatoes.

4.2. Static Grasping

Static grasping is important to verify the performance of the gripper. In order to have a more comprehensive understanding of the improved algorithm and the performance of the gripper, the classification grasping experiment and the palletizing experiment will be carried out, respectively, to verify.

4.2.1. Classification Grasping Experiment

The classification grasping experiment focuses on the performance of SES-YOLOv10. By taking pictures of tomatoes, specific types are identified and then gripped, to verify the accuracy of the algorithm, test the grasping performance of the software gripper, and obtain its actual grasping range.

The experimental process is shown in Figure 12(a1,a2). First, 12 tomatoes of four different kinds are evenly placed on the experimental bench. Three-dimensional cameras are used to identify and scan the experimental bench and take photos. After using SES-YOLOv10 to recognize the photos, it can be found that the recognition accuracy is as high as 95%, which can accurately differentiate the types of tomatoes, and the robotic arm receives the commands. Adjust the holder to pick up in the order of unripe and low-quality, unripe and high-quality, ripe and low-quality, ripe and high-quality, and place all tomatoes in the paper bowl according to the classification.

Based on the above, the grasping experiments of different kinds of tomatoes show that the gripper can successfully and stably grip all kinds of tomatoes, and can accurately classify them through the SES-YOLOv10 system.

4.2.2. Pendulum Experiment

Pendulum experiments on the verification of the performance of the robotic arm have a great role, with the rapid development of the logistics industry, palletizing and handling technology has become an important link in the field of logistics, so the design of a manipulator handling tomatoes pendulum experiments, messy to neat to the box, the automated handling system for a long time to run the stability test.

The experimental process is illustrated in Figure 12(b1,b2). Nine tomatoes were placed in the left box. The SES-YOLOv10 recognition algorithm was employed to identify the tomatoes and then sent the motion command by a computer to control the robot arm to grip and arrange the tomatoes in an orderly manner, as shown by the yellow trajectory in Figure 12(b1). The tomatoes were neatly arranged in a 3 × 3 grid on the right plate. Depending on the flexibility of the soft gripper, none of the tomatoes fell or were damaged during the whole grasping process.

In conclusion, the pendulum experiment verifies the significant advantages of automated Pendulum and handling technology in improving efficiency, reducing cost, and improving the working environment. With the continuous progress and application of technology, this handling technology will bring more convenience and benefits for industrial production and logistics transportation.

4.3. Dynamic Grasping

Due to the existence of bistable characteristics, the gripper can also have a good grasping ability when facing a moving object. In this section, the dynamic grasping performance of the gripper is tested through the grasping experiment of a moving ball and the grasping experiment of the mobile platform.

The schematic diagram of the experiment process is shown in Figure 12(c1–c6). First, tomatoes of different ripeness are placed on the assembly line. Adjust the speed of the line to 1 m/s, so that the tomato can move uniformly in a straight line on the moving platform to simulate the operation of the conveyor belt in industry. When the tomato moves within the recognition range of the 3D camera, the camera scans the tomato on the moving platform and recognizes the unripe tomato. The position of the target tomato is fed back to the drive system of the robot arm and judged according to the speed of the moving platform and the speed of the robot arm. When the target tomato reaches the position of the gripper, the gripper is quickly triggered to grab the tomato. The robot arm puts the tomato into the specified container and completes the first simulated grab of the target tomato. Since the binocular camera kept shooting and identifying the situation on the assembly line, a second immature tomato was found in the process of continued identification, and the operation of the first grasping was repeated to complete the grasping of all unripe tomatoes in this batch of tomatoes, ensuring the sorting of high-quality ripe tomatoes.

The experiment successfully identified and gripped unripe tomatoes on the moving platform and located them in the designated position. The difficulty of the sorting experiment lies in accurately identifying and locating the target, and accurately gripping the target according to the moving speed of the target and the robot arm. Accurate grasping tests the coordination of the entire system and demands high speed from the soft gripper. Experiments show that the proposed visual recognition and positioning system is closely coordinated with the fast response of the origami hand, and can complete the sorting task.

4.4. Nondestructive Experiment

Handling fresh fruit presents significant challenges, primarily due to the potential for damage caused by grasping. This section aims to evaluate whether the gripper in the proposed sorting system can handle tomatoes and other softer objects without causing surface damage. To achieve this, a series of grasping experiments were conducted on various types of tomatoes and similarly delicate items.

The main influencing factors are grasping force, grasping mode, grasping speed, and different tomato surfaces. In this paper, the biomechanical characteristics of tomato were studied, and the optimal grasping force of 3.75 N was determined to minimize the damage to the tomato. In order to reduce the damage to tomatoes during the grasping process, a bistable origami software gripper based on SVMC was used to realize the rapid deformation of the software gripper during the grasping process, so as to reduce the impact and damage to tomatoes during the grasping process. The grasping mechanism design of the picking robot was optimized, and the shape, size, and opening and closing mode of the gripper were adjusted to more accurately control the force distribution on the tomato during the grasping process. According to the actual situation in the picking process, variable speed picking was realized to reduce the damage to the tomato. Reduce speed as you approach the target to ensure an accurate grasp; Increase speed and efficiency when leaving goals. The experimental process and results are shown in Figure 13a–f, which shows six different types of tomatoes. The results show that the gripper of the system will not cause any damage to the surface of the tomato. To make the results more convincing, other objects with softer surfaces than tomatoes (g. Sponges, h. Soft Tennis). From the picture, it is obvious that the fixture will not cause deformation or damage to the surface of the object during the clamp, so we conclude that the system will not handle the tomato in the case of damaging the product.

5. Conclusions

This study introduces an automated tomato sorting system based on the YOLOv10 algorithm and a bistable origami soft gripper. The system is designed to classify tomatoes effectively, distinguishing between ripe and unripe, as well as high-quality and low-quality specimens, while ensuring precise grasping using the soft gripper. The experimental results indicate that SES-YOLOv10 achieves a precision of 95.3% and a recall rate of 98.5% on the dataset employed in this system. The average accuracy of the two methods is 97.3% and 87.9%, respectively. Utilizing a paper folding gripper mounted on the AAB robot arm, the system successfully grips tomatoes in both static and dynamic scenarios. Moreover, the integration of the visual identification positioning system with the designed gripper system effectively accomplishes tomato simulation sorting, presenting a novel approach for tomato classification applicable in similar environments. The proposed tomato sorting system demonstrates high precision in tomato type detection and rapid post-identification grasping, thereby facilitating industrial sorting automation. Future work aims to enhance the system’s detection sensitivity by collecting a diverse array of tomato images for SES-YOLOv10 training, which is expected to improve the detection success rate. Additionally, planned improvements to both the robot arm and the origami bracket are anticipated. This research represents a significant advancement in tomato sorting and intelligent agriculture practices.

Author Contributions

W.L.: Writing—original draft, Project administration, Supervision, Funding acquisition. S.W.: Investigation, Formal analysis, Validation. X.G.: Investigation, Validation, and Formal analysis. H.Y.: Investigation and Funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Basic Scientific Research Projects of Liaoning Provincial Department of Education (Grant No. JYTMS20231454), National Science Foundation support projects, China (Grant No. 62003014), and the College Students Innovative Entrepreneurial Training Plan Program (Grant No. 202210148027).

Data Availability Statement

All the data presented in this study are contained in the article’s main text.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, T.; Sun, M.; He, Q.; Zhang, G.; Shi, G.; Ding, X.; Lin, S. Tomato recognition and location algorithm based on improved YOLOv5. Comput. Electron. Agric. 2023, 208, 107759. [Google Scholar] [CrossRef]
Cardellicchio, A.; Solimani, F.; Dimauro, G.; Petrozza, A.; Summerer, S.; Cellini, F.; Renò, V. Detection of tomato plant phenotyping traits using YOLOv5-based single-stage detectors. Comput. Electron. Agric. 2023, 207, 107757. [Google Scholar] [CrossRef]
Zheng, S.; Liu, Y.; Weng, W.; Jia, X.; Yu, S.; Wu, Z. Tomato recognition and localization method based on improved YOLOv5n-seg model and binocular stereo vision. Agronomy 2023, 13, 2339. [Google Scholar] [CrossRef]
Zhang, J.; Xie, J.; Zhang, F.; Gao, J.; Yang, C.; Song, C.; Rao, W.; Zhang, Y. Greenhouse tomato detection and pose classification algorithm based on improved YOLOv5. Comput. Electron. Agric. 2024, 216, 108519. [Google Scholar] [CrossRef]
Zhou, Z.; Zahid, U.; Majeed, Y.; Nisha Mustafa, S.; Sajjad, M.M.; Butt, H.D.; Fu, L. Advancement in artificial intelligence for on-farm fruit sorting and transportation. Front. Plant Sci. 2023, 14, 1082860. [Google Scholar] [CrossRef] [PubMed]
Zhang, P.; Tang, B. A two-finger soft gripper based on a bistable mechanism. IEEE Robot. Autom. Lett. 2022, 7, 11330–11337. [Google Scholar] [CrossRef]
Zhang, Z.; Ni, X.; Gao, W.; Shen, H.; Sun, M.; Guo, G.; Wu, H.; Jiang, S. Pneumatically controlled reconfigurable bistable bionic flower for robotic gripper. Soft Robot. 2022, 9, 657–668. [Google Scholar] [CrossRef]
Zaidi, S.; Maselli, M.; Laschi, C.; Cianchetti, M. Actuation technologies for soft robot grippers and manipulators: A review. Curr. Robot. Rep. 2021, 2, 355–369. [Google Scholar] [CrossRef]
Zaghloul, A.; Bone, G.M. 3D shrinking for rapid fabrication of origami-inspired semi-soft pneumatic actuators. IEEE Access 2020, 8, 191330–191340. [Google Scholar] [CrossRef]
Zou, X.; Liang, T.; Yang, M.; LoPresti, C.; Shukla, S.; Akin, M.; Weil, B.T.; Hoque, S.; Gruber, E.; Mazzeo, A.D. Paper-based robotics with stackable pneumatic actuators. Soft Robot. 2022, 9, 542–551. [Google Scholar] [CrossRef]
Wang, C.; Guo, H.; Liu, R.; Yang, H.; Deng, Z. A programmable origami-inspired webbed gripper. Smart Mater. Struct. 2021, 30, 055010. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Lin, A.; Chen, B.; Xu, J.; Zhang, Z.; Lu, G.; Zhang, D. Ds-transunet: Dual swin transformer u-net for medical image segmentation. IEEE Trans. Instrum. Meas. 2022, 71, 1–15. [Google Scholar] [CrossRef]
Gong, H.; Mu, T.; Li, Q.; Dai, H.; Li, C.; He, Z.; Wang, W.; Han, F.; Tuniyazi, A.; Li, H.; et al. Swin-transformer-enabled YOLOv5 with attention mechanism for small object detection on satellite images. Remote Sens. 2022, 14, 2861. [Google Scholar] [CrossRef]
Hao, W.; Ren, C.; Han, M.; Zhang, L.; Li, F.; Liu, Z. Cattle body detection based on YOLOv5-EMA for precision livestock farming. Animals 2023, 13, 3535. [Google Scholar] [CrossRef]
You, H.; Lu, Y.; Tang, H. Plant disease classification and adversarial attack using SimAM-EfficientNet and GP-MI-FGSM. Sustainability 2023, 15, 1233. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, Z.; Wang, X.; Fu, W.; Ma, J.; Wang, G. Improved yolov8 insulator fault detection algorithm based on biformer. In Proceedings of the 2023 IEEE 5th International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, 14–16 July 2023; pp. 962–965. [Google Scholar]
Terrile, S.; Argüelles, M.; Barrientos, A. Comparison of different technologies for soft robotics grippers. Sensors 2021, 21, 3253. [Google Scholar] [CrossRef] [PubMed]
Hu, Q.; Li, J.; Dong, E.; Sun, D. Soft scalable crawling robots enabled by programmable origami and electrostatic adhesion. IEEE Robot. Autom. Lett. 2023, 8, 2365–2372. [Google Scholar] [CrossRef]
Chen, B.; Shao, Z.; Xie, Z.; Liu, J.; Pan, F.; He, L.; Zhang, L.; Zhang, Y.; Ling, X.; Peng, F.; et al. Soft origami gripper with variable effective length. Adv. Intell. Syst. 2021, 3, 2000251. [Google Scholar] [CrossRef]
Wang, Z.; Or, K.; Hirai, S. A dual-mode soft gripper for food packaging. Robot. Auton. Syst. 2020, 125, 103427. [Google Scholar] [CrossRef]
Hu, Q.; Dong, E.; Sun, D. Soft gripper design based on the integration of flat dry adhesive, soft actuator, and microspine. IEEE Trans. Robot. 2021, 37, 1065–1080. [Google Scholar] [CrossRef]
Hussain, M. YOLO-v1 to YOLO-v8, the rise of YOLO and its complementary nature toward digital manufacturing and industrial defect detection. Machines 2023, 11, 677. [Google Scholar] [CrossRef]
Liu, W.; Bai, X.; Yang, H.; Bao, R.; Liu, J. Tendon driven bistable origami flexible gripper for high-speed adaptive grasping. IEEE Robot. Autom. Lett. 2024, 9, 5417–5424. [Google Scholar] [CrossRef]
Xu, X.; Feng, Z.; Cao, C.; Li, M.; Wu, J.; Wu, Z.; Shang, Y.; Ye, S. An improved swin transformer-based model for remote sensing object detection and instance segmentation. Remote Sens. 2021, 13, 4779. [Google Scholar] [CrossRef]
Zhang, X.; Cui, B.; Wang, Z.; Zeng, W. Loader Bucket Working Angle Identification Method Based on YOLOv5s and EMA Attention Mechanism. IEEE Access 2024, 12, 105488–105496. [Google Scholar] [CrossRef]
Mahaadevan, V.C.; Narayanamoorthi, R.; Gono, R.; Moldrik, P. Automatic identifier of socket for electrical vehicles using SWIN-transformer and SimAM attention mechanism-based EVS YOLO. IEEE Access 2023, 11, 111238–111254. [Google Scholar] [CrossRef]
Zheng, X.; Lu, X. BPH-YOLOv5: Improved YOLOv5 based on biformer prediction head for small target cigatette detection. In Proceedings of the Jiangsu Annual Conference on Automation (JACA 2023), Changzhou, China, 10–12 November 2023. [Google Scholar]
Tan, L.; Liu, S.; Gao, J.; Liu, X.; Chu, L.; Jiang, H. Enhanced Self-Checkout System for Retail Based on Improved YOLOv10. arXiv 2024, arXiv:2407.21308. [Google Scholar]
Chen, W.; Liu, M.; Zhao, C.; Li, X.; Wang, Y. MTD-YOLO: Multi-task deep convolutional neural network for cherry tomato fruit bunch maturity detection. Comput. Electron. Agric. 2024, 216, 108533. [Google Scholar] [CrossRef]
Fan, S.; Liang, X.; Huang, W.; Zhang, V.J.; Pang, Q.; He, X.; Li, L.; Zhang, C. Real-time defects detection for apple sorting using NIR cameras with pruning-based YOLOV4 network. Comput. Electron. Agric. 2022, 193, 106715. [Google Scholar] [CrossRef]
Fu, L.; Yang, Z.; Wu, F.; Zou, X.; Lin, J.; Cao, Y.; Duan, J. YOLO-Banana: A lightweight neural network for rapid detection of banana bunches and stalks in the natural environment. Agronomy 2022, 12, 391. [Google Scholar] [CrossRef]
Liu, Z.; Xiong, J.; Cai, M.; Li, X.; Tan, X. V-YOLO: A Lightweight and Efficient Detection Model for Guava in Complex Orchard Environments. Agronomy 2024, 14, 1988. [Google Scholar] [CrossRef]
Jing, J.; Zhang, S.; Sun, H.; Ren, R.; Cui, T. YOLO-PEM: A Lightweight Detection Method for Young “Okubo” Peaches in Complex Orchard Environments. Agronomy 2024, 14, 1757. [Google Scholar] [CrossRef]
Mi, Z.; Yan, W.Q. Strawberry Ripeness Detection Using Deep Learning Models. Big Data Cogn. Comput. 2024, 8, 92. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of recognition and sorting of tomato.

Figure 2. Schematic of the bistable origami flexible gripper.

Figure 3. Soft gripper performance test experiment ((a) trigger time; (b) maximum grasping weight).

Figure 4. Structure diagram of the Swin Transformer module.

Figure 5. Structure diagram of the EMA’s Attention module.

Figure 6. Structure diagram of the SimAM’s Attention module.

Figure 7. Structure diagram of BiFormer.

Figure 8. The improved structure diagram of YOLOv10n—SES-YOLOv10.

Figure 9. The confusion matrix before and after improvement. ((a). YOLOv10’s confusion matrix; (b). SES-YOLOv10’s confusion matrix).

Figure 10. Criteria for tomato characteristics and comparison of recognition accuracy for partial datasets before and after improvement ((a). Criteria for ripeness; (b). Criteria for quality; (c). Partial dataset accuracy identified by YOLOv10; (d). Partial dataset accuracy identified by SES-YOLOv10; RH: ripe and high-quality, RL: ripe and low-quality, UH: unripe and high-quality, UL: unripe and low-quality).

Figure 11. Four curves under different algorithms.

Figure 12. (a) Static grasping experiment: ((a1) use SES-YOLOv10 to identify tomatoes on the bench; (a2) mobile robotic arm clamps tomatoes of four types in paper bowlers, and experiment complete); (b) pendulum experiment ((b1) use a robotic arm to grasp the tomatoes from top to bottom and back to front; (b2) experiment complete); (c) dynamic grasping experiment ((c1) use SES-YOLOv10 to identify tomatoes on the assembly line; (c2) move the robot arm to grasp the unripe tomato; (c3) complete the first grasp; (c4) continue to identify the tomatoes on the assembly line; (c5) Find a second unripe tomato and grab it; (c6) complete the identification and grasp of all unripe tomatoes on the assembly line, and complete the experiment; ➀ First grab; ➁ Second grab).

Figure 13. The state of different objects during and after grab: ((a–f) different kinds of tomatoes; (g) Sponge; (h) Soft Tennis; (The red dashed arrow shows the state of the tomato being grasping)).

Table 1. Evaluation criteria.

Class	Formula
Precision	$P = \frac{T P}{T P + F P}$
Recall	$R = \frac{T P}{T P + F N}$
Average Precision	$A P = \int_{0}^{1} p (R) d r$
Mean Average Precision	$m A P = \frac{1}{n} \sum_{i = 1}^{n} A P$
F1 score	$F 1 = \frac{2 \times P \times R}{P + R}$

Table 2. Performance comparison of different algorithms.

Model	Precision/%	Recall/%	mAP@.5/%	mAP@.5:.95/%
YOLOv5	79.9	84.3	87.0	68.8
Faster-RCNN	78.1	88.9	82.3	64.2
Cascade-RCNN	79.6	83.2	79.2	61.6
YOLOv8	80.5	84.1	87.7	71.9
YOLOv10	86.3	88.6	90.6	79.0
SES-YOLOv10	95.3	98.7	97.3	87.9

Table 3. Comparison of research results in the same field.

Ref.	Model	Plant	mAP (%)	Focus
Chen et al. (2024) [30]	MTD-YOLOv7	Tomato	86.6	Ripe/unripe
Fan et al. (2022) [31]	YOLOv4P	Apple	93.74	NIR Images
Fu et al. (2022) [32]	YOLO-Banana(v5)	Banana	92.19	Shrubs/Stems
Liu et al. (2024) [33]	V-YOLO(v10)	Guava	93.3	Detection
Jing et al. (2024) [34]	YOLO-PEM(v8)	Peach	93.15	Detection
Mi et al. (2024) [35]	YOLOv9-ST	Strawberry	87.3	Ripe/unripe
Proposed method	SES-YOLOv10	Tomato	95.3	Tomato types

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, W.; Wang, S.; Gao, X.; Yang, H. A Tomato Recognition and Rapid Sorting System Based on Improved YOLOv10. Machines 2024, 12, 689. https://doi.org/10.3390/machines12100689

AMA Style

Liu W, Wang S, Gao X, Yang H. A Tomato Recognition and Rapid Sorting System Based on Improved YOLOv10. Machines. 2024; 12(10):689. https://doi.org/10.3390/machines12100689

Chicago/Turabian Style

Liu, Weirui, Su Wang, Xingjun Gao, and Hui Yang. 2024. "A Tomato Recognition and Rapid Sorting System Based on Improved YOLOv10" Machines 12, no. 10: 689. https://doi.org/10.3390/machines12100689

APA Style

Liu, W., Wang, S., Gao, X., & Yang, H. (2024). A Tomato Recognition and Rapid Sorting System Based on Improved YOLOv10. Machines, 12(10), 689. https://doi.org/10.3390/machines12100689

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Tomato Recognition and Rapid Sorting System Based on Improved YOLOv10

Abstract

1. Introduction

2. Origami Soft Gripper

3. Recognizing Tomato Datasets Using SES-YOLOv10n

3.1. SES-YOLOv10n Modeling

3.1.1. Swin Transformer Module

3.1.2. Attention Mechanism

3.1.3. Head

3.2. Performance Test

3.2.1. Classification Accuracy Evaluation

3.2.2. Experiments

3.2.3. Extended Evaluation

4. Pairs of Tomatoes for Quick Grasping

4.1. Experimental Principle and Equipment

4.2. Static Grasping

4.2.1. Classification Grasping Experiment

4.2.2. Pendulum Experiment

4.3. Dynamic Grasping

4.4. Nondestructive Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI