RT-DETR-Tomato: Tomato Target Detection Algorithm Based on Improved RT-DETR for Agricultural Safety Production
Abstract
:1. Introduction
- By replacing the original ResNet50 backbone of the RE-DETR model with the Swin Transformer, the model’s detection accuracy was improved.
- By integrating the BiFormer Vision Transformer with dual-level routing attention, the small object feature extraction capability was enhanced, leading to improved model performance and high computational efficiency.
2. Related Work
3. Methods
3.1. RT-DETR-Tomato Model
RT-DETR-Tomato-BS |
Input: original image I. |
Output: Final detection and tracking results. |
1. Swin Transformer feature extraction representation: . |
2. BiFormer block processing feature representation: . |
3. Transformer Encoder generates embedding vector: . |
4. The target matching result M obtained by the Hungarian algorithm: |
, where : The embedding vector of the |
current frame, : Embedding vector of the previous frame. |
5. Output final detection and tracking results Return . |
3.2. Swin Transformer
3.3. Integrating the BiFormer Module to Enhance Small Object Feature Extraction Capability
4. Results and Discussion
4.1. Dataset Construction
4.2. Experimental Platform and Evaluation
4.3. Model Performance
4.4. Performance Visualization
4.5. Different Algorithms Comparison
4.6. Comparison of Computational Costs for Different Models
- Model Training TimeThe training times for RT-DETR, RT-DETR-Tomato-B, RT-DETR-Tomato-S, and RT-DETR-Tomato-BS models are 2.057 h, 2.633 h, 1.091 h, and 1.305 h, respectively. The RT-DETR-Tomato-S model uses the Swin Transformer to replace the original ResNet50 backbone of the RE-DETR model; the RT-DETR-Tomato-B model introduces the BiFormer Vision Transformer with dual-level routing attention to enhance small object feature extraction capability. The RT-DETR-Tomato-BS model replaces the original ResNet50 backbone of the RE-DETR model with the Swin Transformer and simultaneously introduces the BiFormer Vision Transformer to enhance small object feature extraction capability. Shorter training times typically indicate lower computational resource demands and, thus, lower costs. Therefore, it can be seen that the RT-DETR-Tomato-B model has the lowest training time cost, while the RT-DETR-Tomato-S model has the highest.
- Detection PerformanceThe maximum mAP values for the RT-DETR, RT-DETR-Tomato-B, RT-DETR-Tomato-S, and RT-DETR-Tomato-BS models are 85.4%, 87.4%, 88.0%, and 88.7%, respectively. The RT-DETR-Tomato-S model uses the Swin Transformer to replace the original ResNet50 backbone of the RE-DETR model; the RT-DETR-Tomato-B model introduces the BiFormer Vision Transformer with dual-level routing attention to enhance small object feature extraction capability. The RT-DETR-Tomato-BS model replaces the original ResNet50 backbone of the RE-DETR model with the Swin Transformer and simultaneously introduces the BiFormer Vision Transformer to enhance small object feature extraction capability. The RT-DETR-Tomato-BS model shows the best detection performance, while the RT-DETR model shows the worst.
- Resource Consumption and ComplexityRT-DETR-Tomato-S model: This model replaces the original ResNet50 backbone of the RE-DETR model with the Swin Transformer. The hierarchical Transformer structure captures richer global dependencies and contextual information, enabling stronger feature representation in complex images and scenes, and allowing processing features at different resolutions, thereby improving detection accuracy. However, the Swin Transformer blocks in the backbone network have redundant parameters, increasing resource consumption and computational complexity compared to the RT-DETR model. RT-DETR-Tomato-B model: This model introduces the BiFormer Vision Transformer with dual-level routing attention to enhance small object feature extraction capability. BiFormer technology achieves more flexible content-aware computation allocation using lightweight BiFormer blocks. The model effectively reduces computational complexity through flexible computation allocation in the BiFormer block, increasing resource consumption but reducing computational complexity compared to the RT-DETR model. RT-DETR-Tomato-BS model: This model replaces the original ResNet50 backbone of the RE-DETR model with the Swin Transformer and simultaneously introduces the BiFormer Vision Transformer. This improves feature extraction capability while reducing computational complexity but increases resource consumption.
- Overall Cost AssessmentThe RT-DETR model has moderate training time, the lowest performance, and moderate cost. The RT-DETR-Tomato-B model has the shortest training time, good performance, and the lowest cost. The RT-DETR-Tomato-S model has the longest training time, good performance, and the highest cost. The RT-DETR-Tomato-BS model has a relatively short training time, the best performance, and the best overall cost.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zhaoxin, G.; Han, L.; Zhijiang, Z.; Libo, P. Design a robot system for tomato picking based on yolo v5. IFAC-PapersOnLine 2022, 55, 166–171. [Google Scholar] [CrossRef]
- Lawal, M.O. Tomato detection based on modified YOLOv3 framework. Sci. Rep. 2021, 11, 1447. [Google Scholar] [CrossRef]
- Xu, X.; Xue, Z.; Zhao, Y. Research on an algorithm of express parcel sorting based on deeper learning and multi-information recognition. Sensors 2022, 22, 6705. [Google Scholar] [CrossRef]
- Yin, G.; Yu, M.; Wang, M.; Hu, Y.; Zhang, Y. Research on highway vehicle detection based on faster R-CNN and domain adaptation. Appl. Intell. 2022, 52, 3483–3498. [Google Scholar] [CrossRef]
- Yamamoto, K.; Guo, W.; Yoshioka, Y.; Ninomiya, S. On plant detection of intact tomato fruits using image analysis and machine learning methods. Sensors 2014, 14, 12191–12206. [Google Scholar] [CrossRef] [PubMed]
- Zhao, Y.; Gong, L.; Zhou, B.; Huang, Y.; Liu, C. Detecting tomatoes in greenhouse scenes by combining AdaBoost classifier and colour analysis. Biosyst. Eng. 2016, 148, 127–137. [Google Scholar] [CrossRef]
- Luo, L.; Tang, Y.; Zou, X.; Wang, C.; Zhang, P.; Feng, W. Robust grape cluster detection in a vineyard by combining the AdaBoost framework and multiple color components. Sensors 2016, 16, 2098. [Google Scholar] [CrossRef] [PubMed]
- Liu, G.; Mao, S.; Kim, J.H. A mature-tomato detection algorithm using machine learning and color analysis. Sensors 2019, 19, 2023. [Google Scholar] [CrossRef]
- Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
- Guo, J.; Yang, Y.; Lin, X.; Memon, M.S.; Liu, W.; Zhang, M.; Sun, E. Revolutionizing Agriculture: Real-Time Ripe Tomato Detection With the Enhanced Tomato-YOLOv7 System. IEEE Access 2023, 11, 133086–133098. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
- Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W. BiFormer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
- Guo, X.; Zhu, Y.; Li, S.; Zhang, J.; Lv, C.; Liu, S. Scale adaptive small target recognition algorithm in complex agricultural environment-taking bees as research objects. Smart Agric. 2022, 4, 140. [Google Scholar]
- Yang, H.; Liu, Y.; Wang, S.; Qu, H.; Li, N.; Wu, J.; Yan, Y.; Zhang, H.; Wang, J.; Qiu, J. Improved apple fruit target recognition method based on YOLOv7 model. Agriculture 2023, 13, 1278. [Google Scholar] [CrossRef]
- Chen, F.; Zhang, L.; Kang, S.; Chen, L.; Dong, H.; Li, D.; Wu, X. Soft-NMS-enabled YOLOv5 with SIOU for small water surface floater detection in UAV-captured images. Sustainability 2023, 15, 10751. [Google Scholar] [CrossRef]
- Guo, Q.; Chen, Y.; Tang, Y.; Zhuang, J.; He, Y.; Hou, C.; Chu, X.; Zhong, Z.; Luo, S. Lychee fruit detection based on monocular machine vision in orchard environment. Sensors 2019, 19, 4091. [Google Scholar] [CrossRef]
- Wang, C.; Tang, Y.; Zou, X.; SiTu, W.; Feng, W. A robust fruit image segmentation algorithm against varying illumination for vision system of fruit harvesting robot. Optik 2017, 131, 626–631. [Google Scholar] [CrossRef]
- Xiao, F.; Wang, H.; Xu, Y.; Zhang, R. Fruit detection and recognition based on deep learning for automatic harvesting: An overview and review. Agronomy 2023, 13, 1625. [Google Scholar] [CrossRef]
- Xu, S.; Guo, Y.; Liang, X.; Lu, H. Intelligent Rapid Detection Techniques for Low-Content Components in Fruits and Vegetables: A Comprehensive Review. Foods 2024, 13, 1116. [Google Scholar] [CrossRef] [PubMed]
- Bulanon, D.M.; Kataoka, T.; Ota, Y.; Hiroma, T. AE—Automation and emerging technologies: A segmentation algorithm for the automatic recognition of Fuji apples at harvest. Biosyst. Eng. 2002, 83, 405–412. [Google Scholar] [CrossRef]
- Mao, W.; Ji, B.; Zhan, J.; Zhang, X.; Hu, X. Apple location method for the apple harvesting robot. In Proceedings of the 2009 2nd International Congress on Image and Signal Processing, Tianjin, China, 17–19 October 2009; pp. 1–5. [Google Scholar]
- Yin, H.; Chai, Y.; Yang, S.X.; Mittal, G.S. Ripe tomato extraction for a harvesting robotic system. In Proceedings of the 2009 IEEE International Conference on Systems, Man and Cybernetics, San Antonio, TX, USA, 11–14 October 2009; pp. 2984–2989. [Google Scholar]
- Wei, X.; Jia, K.; Lan, J.; Li, Y.; Zeng, Y.; Wang, C. Automatic method of fruit object extraction under complex agricultural background for vision system of fruit picking robot. Optik 2014, 125, 5684–5689. [Google Scholar] [CrossRef]
- Kurtulmus, F.; Lee, W.S.; Vardar, A. Green citrus detection using ‘eigenfruit’, color and circular Gabor texture features under natural outdoor conditions. Comput. Electron. Agric. 2011, 78, 140–149. [Google Scholar] [CrossRef]
- Linker, R.; Cohen, O.; Naor, A. Determination of the number of green apples in RGB images recorded in orchards. Comput. Electron. Agric. 2012, 81, 45–57. [Google Scholar] [CrossRef]
- Payne, A.; Walsh, K.; Subedi, P.; Jarvis, D. Estimating mango crop yield using image analysis using fruit at ‘stone hardening’ stage and night time imaging. Comput. Electron. Agric. 2014, 100, 160–167. [Google Scholar] [CrossRef]
- Kelman, E.E.; Linker, R. Vision-based localisation of mature apples in tree images using convexity. Biosyst. Eng. 2014, 118, 174–185. [Google Scholar] [CrossRef]
- Zhao, Y.; Gong, L.; Huang, Y.; Liu, C. Robust tomato recognition for robotic harvesting using feature images fusion. Sensors 2016, 16, 173. [Google Scholar] [CrossRef]
- Kapach, K.; Barnea, E.; Mairon, R.; Edan, Y.; Ben-Shahar, O. Computer vision for fruit harvesting robots–state of the art and challenges ahead. Int. J. Comput. Vis. Robot. 2012, 3, 4–34. [Google Scholar] [CrossRef]
- Xiang, R.; Jiang, H.; Ying, Y. Recognition of clustered tomatoes based on binocular stereo vision. Comput. Electron. Agric. 2014, 106, 75–90. [Google Scholar] [CrossRef]
- Bulanon, D.; Burks, T.; Alchanatis, V. Image fusion of visible and thermal images for fruit detection. Biosyst. Eng. 2009, 103, 12–22. [Google Scholar] [CrossRef]
- Hou, G.; Chen, H.; Jiang, M.; Niu, R. An Overview of the Application of Machine Vision in Recognition and Localization of Fruit and Vegetable Harvesting Robots. Agriculture 2023, 13, 1814. [Google Scholar] [CrossRef]
- Lin, G.; Tang, Y.; Zou, X.; Li, J.; Xiong, J. In-field citrus detection and localisation based on RGB-D image analysis. Biosyst. Eng. 2019, 186, 34–44. [Google Scholar] [CrossRef]
- Khoshroo, A.; Arefi, A.; Khodaei, J. Detection of Red Tomato on Plants Using Image Processing Techniques. Agric. Commun. 2014, 2, 9–15. [Google Scholar]
- Wang, D.; He, D.; Song, H.; Liu, C.; Xiong, H. Combining SUN-based visual attention model and saliency contour detection algorithm for apple image segmentation. Multimed. Tools Appl. 2019, 78, 17391–17411. [Google Scholar] [CrossRef]
- Rakun, J.; Stajnko, D.; Zazula, D. Detecting fruits in natural scenes by using spatial-frequency based texture analysis and multiview geometry. Comput. Electron. Agric. 2011, 76, 80–88. [Google Scholar] [CrossRef]
- Zhang, K.; Wang, H.; Shen, C.; Chen, X. Research on the technology used to inspect the visual appearance of tropical fruit, based on machine vision color space. In Recent Developments in Intelligent Computing, Communication and Devices, Proceedings of the ICCD 2017, Shenzhen, China, 4–9 December 2017; Springer: Singapore, 2019; pp. 53–58. [Google Scholar]
- Qiang, L.; Jianrong, C.; Bin, L.; Lie, D.; Yajing, Z. Identification of fruit and branch in natural scenes for citrus harvesting robot using machine vision and support vector machine. Int. J. Agric. Biol. Eng. 2014, 7, 115–121. [Google Scholar]
- Moallem, P.; Serajoddin, A.; Pourghassem, H. Computer vision-based apple grading for golden delicious apples based on surface features. Inf. Process. Agric. 2017, 4, 33–40. [Google Scholar] [CrossRef]
- Hameed, K.; Chai, D.; Rassau, A. A comprehensive review of fruit and vegetable classification techniques. Image Vis. Comput. 2018, 80, 24–44. [Google Scholar] [CrossRef]
- Wang, C.; Zou, X.; Tang, Y.; Luo, L.; Feng, W. Localisation of litchi in an unstructured environment using binocular stereo vision. Biosyst. Eng. 2016, 145, 39–51. [Google Scholar] [CrossRef]
- Luo, L.; Tang, Y.; Lu, Q.; Chen, X.; Zhang, P.; Zou, X. A vision methodology for harvesting robot to detect cutting points on peduncles of double overlapping grape clusters in a vineyard. Comput. Ind. 2018, 99, 130–139. [Google Scholar] [CrossRef]
- Kurtulmus, F.; Lee, W.S.; Vardar, A. Immature peach detection in colour images acquired in natural illumination conditions using statistical classifiers and neural network. Precis. Agric. 2014, 15, 57–79. [Google Scholar] [CrossRef]
- Gulzar, Y. Fruit image classification model based on MobileNetV2 with deep transfer learning technique. Sustainability 2023, 15, 1906. [Google Scholar] [CrossRef]
- Afonso, M.; Fonteijn, H.; Fiorentin, F.S.; Lensink, D.; Mooij, M.; Faber, N.; Polder, G.; Wehrens, R. Tomato fruit detection and counting in greenhouses using deep learning. Front. Plant Sci. 2020, 11, 571299. [Google Scholar] [CrossRef]
- Moreira, G.; Magalhães, S.A.; Pinho, T.; dos Santos, F.N.; Cunha, M. Benchmark of deep learning and a proposed HSV colour space models for the detection and classification of greenhouse tomato. Agronomy 2022, 12, 356. [Google Scholar] [CrossRef]
- Mu, Y.; Chen, T.S.; Ninomiya, S.; Guo, W. Intact detection of highly occluded immature tomatoes on plants using deep learning techniques. Sensors 2020, 20, 2984. [Google Scholar] [CrossRef]
- Magalhães, S.A.; Castro, L.; Moreira, G.; Dos Santos, F.N.; Cunha, M.; Dias, J.; Moreira, A.P. Evaluating the single-shot multibox detector and YOLO deep learning models for the detection of tomatoes in a greenhouse. Sensors 2021, 21, 3569. [Google Scholar] [CrossRef]
- Zheng, T.; Jiang, M.; Li, Y.; Feng, M. Research on tomato detection in natural environment based on RC-YOLOv4. Comput. Electron. Agric. 2022, 198, 107029. [Google Scholar] [CrossRef]
- Rong, J.; Zhou, H.; Zhang, F.; Yuan, T.; Wang, P. Tomato cluster detection and counting using improved YOLOv5 based on RGB-D fusion. Comput. Electron. Agric. 2023, 207, 107741. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
- Xu, H.; Li, B.; Zhong, F. Light-YOLOv5: A lightweight algorithm for improved YOLOv5 in complex fire scenarios. Appl. Sci. 2022, 12, 12312. [Google Scholar] [CrossRef]
- Jiang, B.; Luo, R.; Mao, J.; Xiao, T.; Jiang, Y. Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 784–799. [Google Scholar]
- Tian, Y.; Yang, G.; Wang, Z.; Wang, H.; Li, E.; Liang, Z. Apple detection during different growth stages in orchards using the improved YOLO-V3 model. Comput. Electron. Agric. 2019, 157, 417–426. [Google Scholar] [CrossRef]
- Zhang, J.; Wang, X.; Ni, G.; Liu, J.; Hao, R.; Liu, L.; Liu, Y.; Du, X.; Xu, F. Fast and accurate automated recognition of the dominant cells from fecal images based on Faster R-CNN. Sci. Rep. 2021, 11, 10361. [Google Scholar] [CrossRef]
- Zhang, K.; Wu, Q.; Chen, Y. Detecting soybean leaf disease from synthetic image using multi-feature fusion faster R-CNN. Comput. Electron. Agric. 2021, 183, 106064. [Google Scholar] [CrossRef]
Models | IoU (%) | P (%) | R (%) | (%) | mAP50 (%) | mAP50-95 (%) |
---|---|---|---|---|---|---|
RT-DETR | 78.3 | 82.5 | 80.0 | 81.2 | 85.4 | 44.8 |
RT-DETR-Tomato-S | 78.8 | 84.4 | 82.3 | 83.3 | 88.0 | 49.0 |
RT-DETR-Tomato-B | 79.0 | 84.1 | 82.7 | 83.2 | 87.4 | 48.3 |
RT-DETR-Tomato-BS | 78.8 | 85.3 | 82.9 | 84.1 | 88.7 | 50.3 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, Z.; Chen, S.; Ge, Y.; Yang, P.; Wang, Y.; Song, Y. RT-DETR-Tomato: Tomato Target Detection Algorithm Based on Improved RT-DETR for Agricultural Safety Production. Appl. Sci. 2024, 14, 6287. https://doi.org/10.3390/app14146287
Zhao Z, Chen S, Ge Y, Yang P, Wang Y, Song Y. RT-DETR-Tomato: Tomato Target Detection Algorithm Based on Improved RT-DETR for Agricultural Safety Production. Applied Sciences. 2024; 14(14):6287. https://doi.org/10.3390/app14146287
Chicago/Turabian StyleZhao, Zhimin, Shuo Chen, Yuheng Ge, Penghao Yang, Yunkun Wang, and Yunsheng Song. 2024. "RT-DETR-Tomato: Tomato Target Detection Algorithm Based on Improved RT-DETR for Agricultural Safety Production" Applied Sciences 14, no. 14: 6287. https://doi.org/10.3390/app14146287
APA StyleZhao, Z., Chen, S., Ge, Y., Yang, P., Wang, Y., & Song, Y. (2024). RT-DETR-Tomato: Tomato Target Detection Algorithm Based on Improved RT-DETR for Agricultural Safety Production. Applied Sciences, 14(14), 6287. https://doi.org/10.3390/app14146287