Tomato Visual Object Detection Method Based on the Mamba State Space Model
Abstract
1. Introduction
2. Materials and Methods
2.1. Dataset
2.1.1. Data Acquiring
2.1.2. Data Labeling and Augmentation
2.2. YOLO-VCW
2.2.1. Architecture
2.2.2. State Space Models and Mamba Core Operators
2.2.3. Visual State Space Module and Cross-Scan Mechanism
2.2.4. Feature Enhancement Strategy Based on Coordinate Attention
2.2.5. Optimization of the Bounding Box Regression Loss Function
2.3. Experimental Settings
2.3.1. Experimental Environment and Training Settings
2.3.2. Evaluation Metrics
3. Experimental Results and Analysis
3.1. Ablation Study
3.2. Comparison of Different Detection Models
3.2.1. Confusion Matrix Analysis
3.2.2. Detection Result Comparison of Different Algorithms
3.3. Public Dataset Validation
3.4. Random Seed Experiment
4. Discussion
4.1. Model Advantages and Practical Significance
4.2. Model Limitations and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zhou, H.; Tang, Y.C.; Zou, X.J.; Wang, H.J.; Chen, Z.X.; Long, Y.N.; Ai, P.Y. Research on key technologies of visual perception for agricultural picking robots. J. Agric. Mech. Res. 2023, 45, 68–75. (In Chinese) [Google Scholar] [CrossRef]
- Haggag, S.; Veres, M.; Tarry, C.; Moussa, M. Object Detection in Tomato Greenhouses: A Study on Model Generalization. Agriculture 2024, 14, 173. [Google Scholar] [CrossRef]
- El-Bendary, N.; Hariri, E.E.; Hassanien, A.E.; Badr, A. Using machine learning techniques for evaluating tomato ripeness. Expert Syst. Appl. 2015, 42, 1892–1905. [Google Scholar] [CrossRef]
- Chen, C.Q.; Meng, Q. Recognition of greenhouse tomato fruits based on image processing technology. J. Agric. Mech. Res. 2025, 47, 189–193. (In Chinese) [Google Scholar] [CrossRef]
- Shaoqing, R.; Kaiming, H.; Ross, G.; Jian, S. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Wang, Z.; Ling, Y.; Wang, X.; Meng, D.; Nie, L.; An, G.; Wang, X. An improved Faster R-CNN model for multi-object tomato maturity detection in complex scenarios. Ecol. Inform. 2022, 72, 101886. [Google Scholar] [CrossRef]
- Juhartini; Dwinita, A.; Desmiwati. Single Shot Multibox Detector (SSD) in Object Detection: A Review. IJACI Int. J. Adv. Comput. Inform. 2025, 1, 118–127. [Google Scholar] [CrossRef]
- Liu, G.; Nouaze, J.C.; Touko Mbouembe, P.L.; Kim, J.H. YOLO-Tomato: A Robust Algorithm for Tomato Detection Based on YOLOv3. Sensors 2020, 20, 2145. [Google Scholar] [CrossRef] [PubMed]
- Dong, W.; Zhao, Y.; Pei, J.; Feng, Z.; Ma, Z.; Wang, L.; Wang, S.S. Tomato detection in natural environment based on improved YOLOv8 network. J. Agric. Eng. 2025, 56, 1732. [Google Scholar] [CrossRef]
- Hao, F.; Zhang, Z.; Ma, D.; Kong, H. GSBF-YOLO: A lightweight model for tomato ripeness detection in natural environments. J. Real-Time Image Process. 2025, 22, 47. [Google Scholar] [CrossRef]
- Yue, X.; Qi, K.; Yang, F.; Na, X.; Liu, Y.; Liu, C. RSR-YOLO: A real-time method for small target tomato detection based on improved YOLOv8 network. Discov. Appl. Sci. 2024, 6, 268. [Google Scholar] [CrossRef]
- Liu, G.; Zhang, Y.; Liu, J.; Liu, D.; Chen, C.; Li, Y.; Zhang, X.; Touko Mbouembe, P.L. An improved YOLOv7 model based on Swin Transformer and Trident Pyramid Networks for accurate tomato detection. Front. Plant Sci. 2024, 15, 1452821. [Google Scholar] [CrossRef] [PubMed]
- Wang, S.; Xiang, J.; Chen, D.; Zhang, C. A Method for Detecting Tomato Maturity Based on Deep Learning. Appl. Sci. 2024, 14, 11111. [Google Scholar] [CrossRef]
- Wu, D.; Ma, X.J.; Liu, D.S.; Song, W.; Su, W.X. Research on tomato target detection algorithm based on improved YOLOv8. J. Agric. Big Data 2025, 7, 281–293. (In Chinese) [Google Scholar] [CrossRef]
- Fan, X.P.; Zhang, Y.Q.; Zhou, S.; Ren, M.F.; Wang, Y.W.; Chai, X.J. Recognition and localization method for tomato picking robots based on improved YOLOv8s and RGB-D information fusion. Trans. Chin. Soc. Agric. Eng. 2025, 41, 106–116. (In Chinese) [Google Scholar]
- Ni, J.P.; Zhu, L.C.; Dong, L.Z.; Cui, X.Z.; Han, Z.H.; Zhao, B. Real-time instance segmentation algorithm for tomato picking robots based on SwinS-YOLACT. Trans. Chin. Soc. Agric. Mach. 2024, 55, 18–30. (In Chinese) [Google Scholar]
- Liang, X.F.; Wei, Z.W. Nighttime tomato stem and branch segmentation method based on improved CycleGAN and YOLOv8. Trans. Chin. Soc. Agric. Eng. 2025, 41, 147–155. (In Chinese) [Google Scholar]
- Li, M.B.; Liu, Y.L.; Mu, Z.M.; Guo, J.W.; Wei, Y.; Ren, D.Y.; Jia, J.S.; Wei, Z.Z.; Li, Y.H. Tomato fruit recognition based on YOLOX-L-TN model. J. Agric. Sci. Technol. 2024, 26, 97–105. (In Chinese) [Google Scholar] [CrossRef]
- Wu, X.J.; Ding, Q. Tomato quality recognition technology based on improved YOLOv5. Agric. Eng. 2025, 15, 34–40. (In Chinese) [Google Scholar] [CrossRef]
- Vivi, A.; Erniwati, S. YOLOv8 for Object Detection: A Comprehensive Review of Advances, Techniques, and Applications. IJACI Int. J. Adv. Comput. Inform. 2025, 2, 53–61. [Google Scholar] [CrossRef]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
- Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
- Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar] [CrossRef]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Denver, CO, USA, 3–7 June 2021; pp. 13713–13722. [Google Scholar]
- Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
- Ge, G.; Yang, J.; Liu, Y.; Hu, Y.X.; Liu, H.H. Detection of tomatoes in complex agricultural scenes based on improved YOLOv8n model. Trans. Chin. Soc. Agric. Eng. 2025, 41, 143–153. (In Chinese) [Google Scholar]
- Laboro Tomato. Kaggle. Available online: https://www.kaggle.com/datasets/nexuswho/laboro-tomato (accessed on 12 June 2026).
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]










| Configuration Category | Configuration Item | Detailed Information |
|---|---|---|
| Hardware Environment | CPU | Intel Core i5-14450 |
| RAM | 8 GB | |
| GPU | RTX 5060 | |
| Software Environment | Operating System | Ubuntu 22.04 LTS (64-bit) |
| Python Version | Python 3.10 | |
| Deep Learning Framework | PyTorch 2.1.0 | |
| CUDA Version | CUDA 12.1 |
| Parameter | Value |
|---|---|
| Image Size | 640 × 640 |
| Number of Epochs | 300 |
| Batch Size | 16 |
| Learning Rate | 0.01 |
| Minimum Learning Rate | 0.0001 |
| Momentum | 0.937 |
| Weight Decay | 0.0005 |
| Optimizer | Stochastic Gradient Descent (SGD) |
| YOLOv8n | C2f-VSS | CA | WIoUv3 | P/% | R/% | F1/% | mAP50/% | mAP50–95/% | Parameters/M | GFLOPs/G |
|---|---|---|---|---|---|---|---|---|---|---|
| √ | 89.43 | 82.36 | 85.75 | 86.27 | 53.18 | 3.2 | 8.7 | |||
| √ | √ | 90.55 | 84.20 | 87.26 | 88.12 | 55.02 | 3.8 | 9.6 | ||
| √ | √ | 89.90 | 83.17 | 86.40 | 87.50 | 54.76 | 3.3 | 8.9 | ||
| √ | √ | 90.62 | 84.05 | 87.21 | 86.64 | 54.13 | 3.2 | 8.7 | ||
| √ | √ | √ | 91.06 | 85.27 | 88.07 | 90.48 | 56.68 | 3.9 | 9.8 | |
| √ | √ | √ | √ | 91.33 | 86.79 | 89.00 | 90.71 | 57.05 | 3.9 | 9.8 |
| Models | P/% | R/% | F1/% | mAP50/% | mAP50–95/% | Parameters/M | GFLOPs/G |
|---|---|---|---|---|---|---|---|
| Faster R-CNN | 87.20 | 80.15 | 83.53 | 84.30 | 50.06 | 41.2 | 79.4 |
| SSD | 85.67 | 78.52 | 81.94 | 82.71 | 48.28 | 24.4 | 45.6 |
| YOLOv7-Tiny | 88.19 | 81.03 | 84.46 | 85.60 | 51.34 | 6.7 | 12.1 |
| YOLOv8n | 89.43 | 82.36 | 85.75 | 86.27 | 53.18 | 3.2 | 8.7 |
| YOLO-VCW | 91.33 | 86.79 | 89.00 | 90.71 | 57.05 | 3.9 | 9.8 |
| Models | P/% | R/% | F1/% | mAP50/% | mAP50–95/% |
|---|---|---|---|---|---|
| RT-DETR-S | 85.42 | 82.01 | 83.68 | 83.17 | 49.29 |
| YOLOv8n | 86.61 | 81.53 | 83.99 | 83.31 | 49.04 |
| YOLO-VCW | 87.76 | 83.58 | 85.62 | 85.65 | 52.40 |
| Model | Evaluation Metric | Seed 2 | Seed 46 | Seed 97 | Mean | Standard Deviation |
|---|---|---|---|---|---|---|
| YOLOv8n | P/% | 89.27 | 89.52 | 89.17 | 89.32 | 0.18 |
| R/% | 82.41 | 82.05 | 82.28 | 82.25 | 0.18 | |
| F1/% | 85.70 | 85.62 | 85.59 | 85.64 | 0.06 | |
| mAP50/% | 86.11 | 86.34 | 86.06 | 86.17 | 0.15 | |
| mAP50–95/% | 53.05 | 53.23 | 53.24 | 53.17 | 0.11 | |
| YOLO-VCW | P/% | 91.05 | 90.71 | 91.30 | 91.02 | 0.30 |
| R/% | 86.57 | 86.82 | 86.63 | 86.67 | 0.13 | |
| F1/% | 88.75 | 88.72 | 88.90 | 88.79 | 0.10 | |
| mAP50/% | 90.29 | 90.64 | 90.82 | 90.58 | 0.27 | |
| mAP50–95/% | 57.28 | 57.16 | 56.89 | 57.11 | 0.20 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Li, W.; Zheng, H.; Zhao, C.; Liu, W.; Li, S.; Qian, M. Tomato Visual Object Detection Method Based on the Mamba State Space Model. Horticulturae 2026, 12, 770. https://doi.org/10.3390/horticulturae12070770
Li W, Zheng H, Zhao C, Liu W, Li S, Qian M. Tomato Visual Object Detection Method Based on the Mamba State Space Model. Horticulturae. 2026; 12(7):770. https://doi.org/10.3390/horticulturae12070770
Chicago/Turabian StyleLi, Wenhao, Hengyi Zheng, Chengheng Zhao, Wei Liu, Shunjie Li, and Mengbo Qian. 2026. "Tomato Visual Object Detection Method Based on the Mamba State Space Model" Horticulturae 12, no. 7: 770. https://doi.org/10.3390/horticulturae12070770
APA StyleLi, W., Zheng, H., Zhao, C., Liu, W., Li, S., & Qian, M. (2026). Tomato Visual Object Detection Method Based on the Mamba State Space Model. Horticulturae, 12(7), 770. https://doi.org/10.3390/horticulturae12070770

