Tomato Visual Object Detection Method Based on the Mamba State Space Model

Li, Wenhao; Zheng, Hengyi; Zhao, Chengheng; Liu, Wei; Li, Shunjie; Qian, Mengbo

doi:10.3390/horticulturae12070770

Open AccessArticle

Tomato Visual Object Detection Method Based on the Mamba State Space Model

by

Wenhao Li

^1,2,

Hengyi Zheng

^1,2,

Chengheng Zhao

^1,2,

Wei Liu

^1,2,

Shunjie Li

^1,2 and

Mengbo Qian

^1,2,*

¹

College of Optical, Mechanical and Electrical Engineering, Zhejiang A&F University, Hangzhou 311300, China

²

Zhejiang Key Laboratory of Intelligent Sensing and Robotics for Agriculture, Hangzhou 310058, China

^*

Author to whom correspondence should be addressed.

Horticulturae 2026, 12(7), 770; https://doi.org/10.3390/horticulturae12070770 (registering DOI)

Submission received: 24 May 2026 / Revised: 16 June 2026 / Accepted: 22 June 2026 / Published: 24 June 2026

(This article belongs to the Special Issue Intelligent Agricultural Equipment Monitoring Technology for Vegetable Production)

Download

Browse Figures

Versions Notes

Abstract

Tomato harvesting still relies heavily on manual labor, while factors such as clustered fruit growth, inconsistent ripening stages, occlusion, and complex cultivation environments pose significant challenges to automated harvesting systems and place higher demands on target detection accuracy. To address these issues, a tomato detection method based on the Mamba state space model was proposed, and an improved model termed YOLO-VCW was developed based on YOLOv8n. Specifically, the original C2f module in the backbone network was replaced with the C2f-VSS module to enhance global contextual feature extraction. A Coordinate Attention mechanism was introduced into the feature fusion stage to improve the model’s ability to focus on tomato target regions under complex background and occlusion conditions. In addition, the WIoUv3 loss function was adopted in the detection head to improve localization accuracy and training stability in overlapping fruit scenarios. Experimental results showed that YOLO-VCW achieved a precision of 91.33%, a recall of 86.79%, and an F1-score of 89.00% on the tomato dataset. Compared with YOLOv8n, the proposed model improved precision, recall, F1-score, and mAP₅₀ by 1.90%, 4.43%, 3.25%, and 4.44%, respectively, with only a slight increase in Parameters to 3.9 M. These results demonstrate that YOLO-VCW provides effective and robust performance for tomato target detection in complex environments.

Keywords:

target detection; YOLOv8n; Mamba state space model; coordinate attention mechanism; computer vision

1. Introduction

With the advancement of protected agriculture and smart equipment technologies, the application of tomato harvesting robots in fruit and vegetable production is gradually increasing. As a crucial component of the robot’s perception system, tomato object detection and maturity recognition directly determine fruit localization, grasping decisions, and operational efficiency. Consequently, their detection accuracy and real-time performance significantly impact the overall efficacy of the system [1]. However, affected by factors such as natural lighting variations, occlusion by branches and leaves, and differences in fruit scale, tomato object detection in complex agricultural environments remains highly challenging [2].

Early research primarily employed traditional image processing and machine learning methods for tomato recognition. El-Bendary et al. [3] evaluated tomato maturity by extracting color and texture features combined with classification algorithms. Chen and Meng [4] achieved tomato target recognition in greenhouses based on image segmentation and morphological features. With the evolution of deep learning technologies, object detection methods based on convolutional neural networks (CNNs) have gradually become a research hotspot. As a typical two-stage detection algorithm, Faster R-CNN exhibits excellent detection accuracy through the separated modeling of region proposal generation and classification regression [5]. Wang et al. [6] structurally improved Faster R-CNN to achieve multi-target tomato maturity detection in complex scenes.

In contrast, single-stage object detection methods, represented by YOLO and SSD, have been widely applied in the field of agricultural vision due to their end-to-end architectures and high detection efficiency. SSD performs object prediction via multi-scale feature maps, providing an essential reference for lightweight real-time detection models [7]. Liu et al. [8] introduced YOLOv3 into tomato detection tasks, verifying the applicability of single-stage detection methods in natural environments. In recent years, YOLO-based tomato object detection has made substantial progress. Dong et al. [9] enhanced tomato detection performance in natural environments by improving the YOLOv8 network structure. Hao et al. [10] proposed the lightweight GSBF-YOLO model, reducing computational complexity while ensuring detection accuracy. To address small-scale object detection, Yue et al. [11] achieved favorable real-time detection outcomes through a feature enhancement strategy. Liu et al. [12] combined the Swin Transformer with a multi-branch feature pyramid structure, effectively boosting detection accuracy in complex scenes. Wang et al. [13] and Wu et al. [14] improved the YOLO model from the perspectives of feature fusion and network structure optimization, enhancing both tomato maturity recognition and object detection performance. Fan et al. [15] fused an improved YOLOv8s with RGB-D information to achieve recognition and localization for tomato harvesting robots. Ni et al. [16] implemented tomato instance segmentation based on SwinS-YOLACT, providing technical support for precise grasping. Liang and Wei [17] utilized CycleGAN to improve nighttime imaging quality, thereby enhancing detection efficacy under complex lighting conditions. Furthermore, models such as YOLOX [18] and YOLOv5 [19] have also been employed in tomato vision tasks, further enriching the related research framework. Vivi and Erniwati [20] provided a systematic review of the structural evolution and application progress of YOLOv8.

Existing CNN- and Transformer-based detection models still face limitations in lightweight agricultural vision tasks. CNN-based methods mainly rely on local receptive fields and therefore have limited capability in modeling long-range contextual dependencies, which may lead to missed detections and blurred object boundaries under severe occlusion. Although vision Transformers can effectively capture global information, their quadratic computational complexity introduces considerable computational overhead, limiting their deployment on resource-constrained agricultural edge devices.

Recently, Mamba-based Structured State Space Models (SSMs) have emerged as an efficient alternative for visual representation learning. Benefiting from linear computational complexity and selective information propagation, Mamba can model long-range dependencies while maintaining low computational cost [21]. Building upon this framework, vision-oriented variants such as Vision Mamba and VMamba extend state-space modeling to visual tasks and achieve competitive performance in image understanding and object perception [22,23]. In particular, VMamba introduces the 2D Selective Scan (SS2D) mechanism, which enables feature interaction across different spatial regions through multidirectional scanning and global context aggregation. By capturing long-range dependencies in two-dimensional feature maps, SS2D enhances the representation of partially occluded targets and improves boundary discrimination in complex scenes. These characteristics are particularly beneficial for tomato detection tasks involving dense fruit distribution, branch-and-leaf occlusion, and background interference.

Furthermore, attention mechanisms have been widely adopted to enhance feature representation in complex visual environments. Among them, Coordinate Attention (CA) embeds positional information into channel attention, enabling the network to capture both channel dependencies and precise spatial location information simultaneously. Compared with conventional channel attention mechanisms, CA can more effectively highlight target regions while suppressing background interference, making it particularly suitable for tomato detection tasks under conditions of leaf occlusion, illumination variation, and complex backgrounds [24].

In addition, the WIoUv3 loss function dynamically adjusts the gradient contribution of samples with different qualities, allowing the model to focus more effectively on medium-quality samples during training and improving the utilization efficiency of training data. Consequently, it enhances bounding-box localization accuracy and detection robustness in complex agricultural environments [25].

Although existing studies have achieved notable results in tomato object detection and maturity recognition, the application of Mamba-based visual architectures to lightweight tomato detection in complex agricultural environments remains relatively limited. Furthermore, challenges still persist in accurately representing target features under occlusion and maintaining a balance between model lightweight design and detection accuracy [2,26]. Therefore, this study further explores the integration of state-space modeling with a lightweight object detection framework and proposes the YOLO-VCW model based on YOLOv8n. By introducing the C2f-VSS module, Coordinate Attention mechanism, and WIoUv3 loss function, the proposed model aims to improve tomato detection accuracy while preserving computational efficiency, thereby providing a more practical visual perception solution for large-scale tomato harvesting robots.

2. Materials and Methods

2.1. Dataset

2.1.1. Data Acquiring

To construct a dataset representative of the actual tomato harvesting environment, image acquisition was conducted in Hangzhou, Zhejiang Province, China. An iPhone 15 smartphone was utilized as the acquisition device to effectively capture the color, texture, and edge information of the tomatoes, thereby providing high-quality raw data to support the subsequent training of the object detection model.

During the data collection process, the primary targets were mature tomatoes, alongside a selection of fruits at varying stages of maturity. Complex scenarios frequently encountered in natural environments—such as dense fruit distribution, branch and leaf occlusion, and fruit overlapping—were deliberately included. Images were captured from multiple perspectives, including frontal, top, and side views, to accurately simulate the field of view of visual sensors during practical harvesting operations. All images were saved at the device’s original resolution without any compression. This approach maximized the preservation of image details, laying a solid foundation for subsequent image annotation and model training.

2.1.2. Data Labeling and Augmentation

Following image acquisition, the open-source tool LabelImg (version 1.8.6) was employed for the manual annotation of tomato targets. In accordance with the data format requirements of the YOLO series object detection models, bounding boxes were applied to the tomato targets within the images. In this study, tomatoes were categorized into three ripeness stages: Mature: fruits had a red coloration area of 90% to 100%, Semi-mature: coloration began around the fruit umbilicus, developing red halos, with a red coloration area of 31% to 89%, and Immature: the fruit pericarp was green or whitish-green, with a red coloration area of 0% to 30%. To ensure high annotation quality, a principle of consistency was strictly followed. For fruits with partial occlusion or blurred boundaries, reasonable annotations were performed on the visible areas based on expert experience to prevent the omission of valid targets or the introduction of significant noise.

Upon completion of the initial annotation, the data were converted into the YOLO format to construct the raw dataset alongside the corresponding images. Considering the complexity of sample distribution in practical harvesting scenarios and the high cost of data acquisition, as well as the heavy reliance of deep learning models on data scale and diversity, various data augmentation techniques were applied to expand the raw dataset as shown in Figure 1. Specifically, operations such as mirroring, cropping, brightness adjustment, rotation, and blurring were implemented to simulate variations under different shooting conditions. These methods effectively enhance the model’s adaptability to lighting fluctuations, scale variances, and background complexity.

A total of 1164 tomato images were collected and annotated in this study. The original dataset was first divided into training, validation, and test sets at a ratio of 8:1:1, resulting in 931, 116, and 117 images, respectively. To improve sample diversity and enhance the generalization capability of the model, various data augmentation techniques shown in Figure 1 were applied exclusively to the training set after dataset partitioning. Through these augmentation operations, the number of training images increased from 931 to 3967. The validation and test sets remained unchanged throughout the experiments, containing 116 and 117 images, respectively. Consequently, the final dataset used in this study consisted of 3967 training images, 116 validation images, and 117 test images, resulting in a total of 4200 images.

This dataset configuration provided a reliable foundation for model training, hyperparameter optimization, and performance evaluation of the proposed YOLO-VCW model while preventing potential information leakage between different subsets.

2.2. YOLO-VCW

2.2.1. Architecture

YOLOv8 is a representative single-stage object detection algorithm that employs a tripartite structure consisting of a Backbone, Neck, and Head. It achieves an excellent balance between detection accuracy and real-time efficiency, leading to its widespread adoption in real-time visual perception tasks. However, in agricultural applications such as tomato harvesting, fruits are frequently subject to branch and leaf occlusion, mutual overlapping, and significant fluctuations in illumination. Traditional convolutional neural networks, constrained by their local receptive fields, exhibit certain limitations in modeling long-range dependencies and complex spatial structures. As a result, feature information of partially occluded tomatoes may not be effectively propagated across distant regions, leading to missed detections and inaccurate boundary localization in complex agricultural scenes.

To address these challenges, this study adopts YOLOv8n as the baseline model and introduces targeted structural optimizations to propose an improved object detection model: YOLO-VCW. The overall architecture maintains the original three-stage structure of YOLOv8 to ensure model lightweightness and real-time performance, while integrating specialized mechanisms into key modules to better handle complex scene modeling.

In the Backbone section, a vision modeling module based on the State Space Model (SSM) is introduced, where the original C2f modules are replaced with C2f-VSS modules to enhance the network’s capability for global contextual modeling. In the Neck (feature fusion network), the Coordinate Attention (CA) mechanism is integrated to improve the model’s focus on tomato target regions under complex backgrounds and occluded conditions. For the Head (detection head), the WIoUv3 loss function is adopted for bounding box regression instead of the original loss, aiming to enhance localization precision and training stability in scenarios with overlapping fruits. The three improvements complement each other in functionality: the C2f-VSS module strengthens global contextual modeling, the CA mechanism enhances discriminative spatial feature representation, and WIoUv3 improves localization quality during training. Through their synergistic integration, the YOLO-VCW network effectively elevates the accuracy and robustness of tomato detection while preserving the real-time advantages of the YOLOv8 framework. The overall architecture of the proposed model is illustrated in Figure 2.

2.2.2. State Space Models and Mamba Core Operators

The C2f module in the YOLOv8 backbone enhances feature representation through a multi-branch convolutional structure. However, its convolution-based design inherently suffers from limited receptive fields and weak long-range dependency modeling, which restricts detection performance under complex greenhouse conditions such as occlusion and blurred fruit boundaries. To address this deficiency, this study integrates the concept of State Space Models (SSM) based on Mamba to refine the feature extraction modules in the backbone, constructing the C2f-VSS module.

SSMs were originally designed to describe dynamic systems that evolve over time. Their core principle involves recursive modeling of historical information through hidden states, enabling long-sequence modeling with low computational complexity. For a one-dimensional sequence input, the standard continuous-form representation of an SSM can be expressed as:

h (t) = A h (t - 1) + B x (t)

(1)

y (t) = C h (t)

(2)

where

x (t)

represents the input feature at time

t

,

h (t)

denotes the hidden state of the system, and

y (t)

is the output feature. The terms

A

,

B

, and

C

are the parameters of the state space model. In a Selective State Space Model, matrices

B

and

C

are dynamically modulated according to the input features, whereas matrix

A

remains a shared static state transition matrix [21].

Conventional Transformer models possess

O (L^{2})

time and space complexity due to their self-attention mechanism when processing a sequence of length

L

. In contrast, the Selective State Space Model (S6) employed by Mamba reduces the overall computational complexity to

O (L)

by introducing input-dependent dynamic parameters [21]. Compared with mainstream vision SSM backbones such as VMamba, Mamba’s selective mechanism has stronger information screening ability, which can automatically filter invalid background noise and retain effective target features, making it more suitable for cluttered agricultural scenes. This makes it significantly more suitable for the modeling requirements of high-resolution visual features, as illustrated in Figure 3.

The Selective State Space Model (S6) initially processes the input features

x (t)

through a branching structure, generating input-dependent dynamic parameters via linear mapping. These include the step size parameter

Δ

, the input mapping parameter

B

, and the output mapping parameter

C

. Specifically, the distinct branches are utilized to model the time scales and feature mapping relationships during the state update process, endowing the model with adaptive adjustment capabilities based on the input content. During the practical discretization phase, Mamba employs the Zero-Order Hold (ZOH) method to discretize the continuous state space model. This selectivity allows the model to effectively suppress complex background noise and redundant foliage information in tomato detection tasks. Meanwhile, long-range contextual dependencies between spatially separated fruit regions can be established through state propagation, enabling more complete feature representation for partially occluded tomatoes, thereby focusing more on the discriminative features of the fruit regions.

2.2.3. Visual State Space Module and Cross-Scan Mechanism

Tomato images exhibit typical two-dimensional spatial structural characteristics. If feature representations are serialized and processed along only a single direction, it becomes difficult to adequately capture the global dependencies across spatial dimensions [22]. To address this issue, a two-dimensional selective scanning mechanism [23], namely Selective Scan in 2D (SS2D), was introduced into the C2f-VSS module to enhance the model’s capability for two-dimensional spatial information modeling. As illustrated in Figure 4, the C2f-VSS module retains the cross-layer feature fusion characteristics of the original C2f module, where feature reuse and gradient propagation are achieved through a multi-branch architecture. However, the original residual convolution units in the branches are replaced with processing units based on the Visual State Space (VSS) model [22,23], enabling the module to achieve global modeling capability while maintaining a relatively low parameter count.

SS2D maps two-dimensional spatial features into multiple one-dimensional sequences through multi-directional serialized scanning of the feature map, and these sequences are subsequently modeled by the state space model [23]. Specifically, given an input feature map

F \in R^{C \times H \times W}

, SS2D performs scanning along four diagonal directions: from the upper-left to the lower-right, from the lower-right to the upper-left, from the upper-right to the lower-left, and from the lower-left to the upper-right [23], as illustrated in Figure 5. This multi-directional cross-scan design is the key to solving the blurred boundary problem in tomato detection. Traditional convolutions only extract local edge pixels, making it difficult to distinguish fruit boundaries from surrounding foliage when tomatoes are occluded, have similar color to stems and leaves, or suffer from uneven illumination. In contrast, SS2D aggregates contextual features from all directions around the fruit, enlarges the effective receptive field, and enhances the feature difference between tomato fruits and adjacent vegetation, thus effectively alleviating boundary ambiguity caused by occlusion, illumination variation, and background interference. Through this cross-directional scanning strategy, each pixel within the feature map can receive contextual information from different spatial directions, thereby forming an approximately global receptive field under complex tomato plant backgrounds. While maintaining linear computational complexity, this modeling approach effectively enhances the model’s perception of fruit boundaries, occlusion relationships, and spatial structural information. By aggregating contextual information from multiple scanning directions, SS2D enables pixels located near ambiguous boundaries to receive complementary semantic cues from surrounding regions, thereby alleviating boundary blurring caused by occlusion, illumination variation, and background interference.

2.2.4. Feature Enhancement Strategy Based on Coordinate Attention

In practical harvesting scenarios, tomato targets are often densely clustered and are easily affected by leaf occlusion and background interference. Traditional channel attention mechanisms mainly focus on channel importance while neglecting the precise spatial location information of targets, making them insufficient for detection tasks in complex environments. To address this issue, a Coordinate Attention (CA) mechanism [24] was introduced during the feature fusion stage, as illustrated in Figure 6. Compared with conventional channel attention mechanisms such as SE, CA retains precise positional information during feature encoding, making it more suitable for agricultural object detection tasks where target localization and background suppression are equally important. CA decomposes the global average pooling operation into two one-dimensional feature encoding processes along the horizontal and vertical directions, thereby explicitly preserving positional information while modeling inter-channel relationships. Given an input feature map

X \in R^{C \times H \times W}

, the feature aggregation processes along the height and width directions can be expressed as follows:

z_{h} (c, h) = \frac{1}{W} \sum_{w = 1}^{W} X (c, h, w)

(3)

z_{w} (c, h) = \frac{1}{H} \sum_{h = 1}^{H} X (c, h, w)

(4)

where

c

denotes the channel index, while

H

and

W

represent the height and width of the feature map, respectively. Subsequently, the aggregated features are compressed and fused through a shared

1 \times 1

convolution to generate attention weight maps with high spatial sensitivity, which are then applied to the original feature map. Combined with the C2f-VSS module, CA forms a two-level feature optimization: C2f-VSS completes global feature extraction and boundary enhancement, and CA further strengthens the spatial positioning of tomato targets on this basis. This collaborative design significantly improves the detection performance of targets with partial occlusion and blurred boundaries. By incorporating the CA mechanism, the model is able to focus not only on the semantic information of tomato targets but also on their precise spatial locations, thereby significantly improving the detection performance for targets with blurred boundaries and partial occlusions. This characteristic is particularly beneficial for tomato detection scenarios involving dense fruit clusters and severe leaf occlusion.

2.2.5. Optimization of the Bounding Box Regression Loss Function

In scenarios where tomato targets are densely distributed and severely overlapped, significant quality differences often exist between predicted bounding boxes and ground-truth boxes. Traditional bounding box regression loss functions, such as CIoU, mainly focus on geometric overlap while lacking a dynamic evaluation mechanism for sample quality. As a result, the training process is easily affected by low-quality samples, such as annotation errors or heavily occluded targets, thereby reducing convergence stability. To address this issue, the Wise-IoUv3 (WIoUv3) loss function [25], which incorporates a dynamic non-monotonic focusing mechanism, was introduced in this study. Based on geometric measurements, WIoUv3 constructs a two-layer attention-based computational framework to achieve adaptive gradient adjustment for samples of different quality levels.

First, to reduce the penalty imposed by center-point distance on high-quality anchor boxes while enhancing the attention paid to low-quality anchor boxes, a distance attention mechanism

R_{W I o U}

was introduced to construct the baseline loss

L_{W I o U v 1}

, as expressed in Equation (5):

L_{W I o U v 1} = R_{W I o U} L_{I o U}

(5)

R_{W I o U} = e x p (\frac{{(x− x_{g t})}^{2} + {(y− y_{g t})}^{2}}{{(W_{g}^{2}+ H_{g}^{2})}^{*}})

(6)

where

L_{I o U} = 1 - I o U

;

(x, y)

and

(x_{g_{t}}, y_{g t})

represent the center coordinates of the predicted box and the ground-truth box, respectively;

W_{g}

and

H_{g}

denote the width and height of the minimum enclosing box; and

*

indicates that the denominator is detached from the computational graph to prevent gradients that may hinder convergence.

On this basis, WIoUv3 introduces the Outlier Degree (

β

) to evaluate the quality of the current sample. The outlier degree is defined as the ratio between the current sample loss and its moving average value:

β = \frac{L_{W I o U v 1}}{L_{W I o U v 1}^{e m a}}

(7)

where

L_{W I o U v 1}^{e m a}

represents the exponential moving average of the loss during training, which dynamically reflects the overall sample quality level at the current training stage. Furthermore, a non-monotonic gradient gain coefficient

γ

is constructed to adaptively regulate the gradient contribution of samples with different quality levels, as expressed in the following equation:

γ = \frac{β}{δ α^{β - δ}}

(8)

where

γ

denotes the gradient gain coefficient,

β

represents the outlier degree,

α > 1

is the attenuation control coefficient used to regulate the suppression intensity of outlier samples, and

δ

is the balancing factor that controls the peak position of the gradient gain.

As illustrated in Figure 7, the gradient gain

γ

exhibits a non-monotonic variation with respect to the outlier degree

β

. When the sample is of high quality (

β

is small), the model considers it to have been sufficiently learned, and thus assigns a relatively low gradient weight, reducing its influence on parameter updates. When the sample is of moderate difficulty (

β \approx δ

), the gradient gain reaches its peak, and the model allocates the highest attention weight to such samples, which play a critical role in improving tomato detection accuracy. In contrast, when the sample is of low quality or considered an outlier (

β

is extremely large), such as in cases of severe occlusion or annotation errors, the gradient gain decreases significantly, thereby effectively suppressing its negative impact on model training.

Finally, the overall loss function of WIoUv3 is defined as follows:

L_{W I o U v 3} = γ L_{W I o U v 1}

(9)

In summary, WIoUv3 achieves adaptive regulation of samples with different quality levels by introducing a distance attention mechanism and a dynamic non-monotonic focusing strategy. Different from the original WIoUv3, which adopts a fixed exponential-form non-monotonic function, the proposed method introduces adjustable parameters

α

and

β

, providing the gradient gain function with greater flexibility and controllability. This design enables the model to better adapt to the quality variations in samples in densely distributed and heavily occluded tomato target scenarios, thereby improving model robustness and convergence stability.

2.3. Experimental Settings

2.3.1. Experimental Environment and Training Settings

To ensure the fairness, reproducibility, and practical relevance of the experimental results to real-world tomato harvesting robot applications, all experiments in this study were conducted under the same software and hardware environment. The detailed software and hardware configurations of the experimental platform are presented in Table 1.

Model training and testing were performed under the aforementioned environment settings. To further ensure the fairness and repeatability of the experimental results, all models were trained using identical parameter configurations, as listed in Table 2.

2.3.2. Evaluation Metrics

To comprehensively evaluate the performance of the proposed model in tomato detection tasks, Precision (P), Recall (R), F1-score (F1), and mean Average Precision (mAP) were selected as the primary evaluation metrics for detection accuracy. In addition, the number of parameters (Parameters) and computational complexity (GFLOPs) were employed to assess the overall efficiency and lightweight characteristics of the model.

The calculation formulas for Precision, Recall, and F1-score are as follows:

P = \frac{T P}{T P + F P}

(10)

R = \frac{T P}{T P + F N}

(11)

F 1 = \frac{2 \times P \times R}{P + R}

(12)

Here,

T P

denotes the number of correctly detected tomato targets,

F P

represents the number of falsely detected targets, and

F N

indicates the number of missed targets.

Average Precision (AP) reflects the overall detection performance of a single category under different recall levels and is defined as the area under the Precision–Recall (P–R) curve. The mean Average Precision (mAP) is obtained by averaging the AP values across all categories, and its calculation formula is expressed as follows:

A P = \int_{0}^{1} P (R) d R

(13)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(14)

where

N

denotes the number of target categories, and

A P_{i}

represents the

A P

value of the

i

-th category. In this study, mAP₅₀ was adopted as the primary evaluation metric, which refers to the mean Average Precision when the Intersection over Union (IoU) threshold between the predicted bounding box and the ground-truth bounding box is set to 0.5.

3. Experimental Results and Analysis

3.1. Ablation Study

To verify the contribution of each improvement module introduced in Chapter 2 to the performance enhancement of the YOLO-VCW model, ablation experiments were conducted by progressively incorporating different improvement strategies, as presented in Table 3. The original YOLOv8n model was adopted as the baseline, and the C2f-VSS module, Coordinate Attention (CA) mechanism, and WIoUv3 loss function were introduced sequentially.

As shown in Table 3, the effectiveness of all improvement modules was successfully validated. The results of the single-module experiments indicate that different modules contributed differently to the performance improvement of the baseline YOLOv8n model. Among them, the C2f-VSS module achieved the most significant enhancement, increasing mAP₅₀ and F1-score by 1.85% and 1.51%, respectively. The CA mechanism also produced stable gains, improving mAP₅₀ and F1-score by 1.23% and 0.65%, respectively. In contrast, the WIoUv3 loss function mainly contributed to the improvement of recall, increasing the recall rate to 84.05% while improving the F1-score by 1.46%; however, its contribution to mAP₅₀ was relatively limited. Overall, each individual module positively affected model performance across different evaluation metrics. In the progressive combination experiments, the introduction of the CA mechanism based on the C2f-VSS-enhanced model further improved performance, with mAP₅₀ and F1-score increasing by 2.36% and 0.81%, respectively, compared with the previous stage. This demonstrates that the two modules exhibit strong complementarity in feature representation and spatial attention modeling. Furthermore, after incorporating the WIoUv3 loss function, the model performance was further optimized, with mAP₅₀ and F1-score increasing by an additional 0.23% and 0.93%, respectively, while the recall rate significantly improved to 86.79%. This indicates that the proposed loss function effectively enhanced bounding box regression quality and improved the utilization of positive samples. Ultimately, after integrating all improvement strategies, the proposed model achieved substantial improvements over the original YOLOv8n model across all evaluation metrics. Specifically, mAP₅₀ increased by 4.44%, while precision, recall, and F1-score improved by 1.90%, 4.43%, and 3.25%, respectively. In addition, mAP_50–95 increased by 3.87%. Although the model complexity increased slightly, the overall increment remained limited, whereas the detection performance improved significantly. This demonstrates that the proposed method achieves an effective balance between computational complexity and detection accuracy. Moreover, the results indicate that the improved modules exhibit strong synergistic effects in complex scenarios, effectively enhancing the model’s capability for detecting densely distributed and occluded targets.

3.2. Comparison of Different Detection Models

To further evaluate the comprehensive performance of the YOLO-VCW model, comparative experiments were conducted using several mainstream object detection models, including Faster R-CNN, SSD, YOLOv7-Tiny, YOLOv8n, and the proposed YOLO-VCW model, as shown in Table 4. All models were trained and tested under the same dataset and experimental settings.

As presented in Table 4, the proposed YOLO-VCW model outperformed all comparative models across all evaluation metrics, achieving the best overall performance. Compared with the traditional two-stage detection model Faster R-CNN, YOLO-VCW improved precision, recall, and F1-score by 4.13%, 6.64%, and 5.47%, respectively, while mAP₅₀ and mAP_50–95 increased by 6.41% and 6.99%, respectively. Compared with the single-stage detector SSD, the improvements were even more pronounced, with mAP₅₀ and mAP_50–95 increasing by 8.00% and 8.77%, respectively, demonstrating the stronger detection capability of the proposed method in complex scenarios. Compared with the lightweight YOLOv7-Tiny model, YOLO-VCW improved precision, recall, and F1-score by 3.14%, 5.76%, and 4.54%, respectively, while mAP₅₀ and mAP_50–95 increased by 5.11% and 5.71%, respectively. Furthermore, compared with the baseline YOLOv8n model, YOLO-VCW achieved improvements of 1.90%, 4.43%, 3.25%, 4.44%, and 3.87% in precision, recall, F1-score, mAP₅₀, and mAP_50–95, respectively. These results demonstrate that the proposed improvement strategies significantly enhance detection accuracy while maintaining the lightweight characteristics of the model, particularly under complex background conditions and target occlusion scenarios. In terms of model complexity, Faster R-CNN and SSD contained 41.2 M and 24.4 M Parameters, respectively, with computational complexities of 79.4 and 45.6 GFLOPs, respectively, both of which were substantially higher than those of lightweight models. Although these models exhibited acceptable detection capability, they struggled to meet real-time detection requirements. In contrast, YOLOv7-Tiny, YOLOv8n, and the proposed model maintained relatively low model complexity. Among them, YOLOv8n required only 8.7 GFLOPs; however, its detection accuracy still had room for improvement. The proposed YOLO-VCW model achieved significant performance gains with only a slight increase in Parameters to 3.9 M and computational complexity to 9.8 GFLOPs.

Overall, although existing lightweight models exhibit advantages in computational efficiency, their detection accuracy in complex cultivation environments remains limited, with mAP₅₀ values generally below 87%. By integrating multiple improvement modules, the proposed YOLO-VCW model achieved a substantial breakthrough in detection performance with only a marginal increase in computational complexity, demonstrating efficient utilization of computational resources. These results indicate that YOLO-VCW achieves an effective balance between accuracy and efficiency and can provide a more reliable solution for small-object detection tasks in complex environments.

3.2.1. Confusion Matrix Analysis

To further evaluate the classification discrimination capability of the proposed model across different target categories, row-normalized confusion matrices of Faster R-CNN, SSD, YOLOv8n, and YOLO-VCW were analyzed. All elements in the matrices are presented in percentage form. Based on the statistical results of the validation set,

3 \times 3

confusion matrices were constructed, where the diagonal elements represent the proportion of correctly predicted samples for each category, while the non-diagonal elements indicate the percentage distribution of confusion among different categories. The horizontal axis corresponds to the predicted categories generated by the model, whereas the vertical axis represents the ground-truth categories. In addition, the color bar on the right side indicates the magnitude of the values in each cell, with darker colors representing higher percentages.

As shown in Figure 8, YOLO-VCW achieved recognition accuracies of 95.2% and 93.7% for the mature and semi-mature categories, respectively, demonstrating excellent discriminative capability in distinguishing these visually similar maturity stages. In contrast, the other models exhibited more obvious category confusion. Specifically, Faster R-CNN misclassified 8.1% of mature samples as semi-mature, while 10.1% of semi-mature samples were incorrectly recognized as immature. SSD achieved a recognition accuracy of 77.4% for mature tomatoes; however, 17.2% of semi-mature samples were incorrectly classified as mature. Although the confusion problem was alleviated in YOLOv8n, cross-category misclassification rates of 5.0% and 4.6% still existed between the semi-mature and immature categories. Compared with the other models, the classification results of YOLO-VCW were more concentrated along the diagonal region of the confusion matrix. The correct recognition rates for mature, semi-mature, and immature tomatoes reached 95.2%, 93.7%, and 94.6%, respectively, significantly reducing the misclassification rate among categories. In particular, for the semi-mature category, which is highly prone to confusion, YOLO-VCW effectively suppressed confusion with adjacent maturity categories, highlighting the superiority of the proposed model in modeling critical discriminative features.

3.2.2. Detection Result Comparison of Different Algorithms

To qualitatively evaluate the robustness of different algorithms in real agricultural environments, three representative challenging scenarios, including occlusion scenes, dense-distribution scenes, and long-distance scenes, were selected to compare the practical detection performance of Faster R-CNN, SSD, YOLOv8n, and the proposed YOLO-VCW model, as illustrated in Figure 9. In the figure, red bounding boxes indicate correctly detected targets, blue bounding boxes represent false detections, and purple bounding boxes denote missed detections.

The detection results demonstrate that Faster R-CNN and SSD suffered from varying degrees of false detections and missed detections under occluded, dense, and long-distance conditions, particularly showing limited capability in recognizing occluded targets and small-scale fruits. YOLOv8n exhibited improved overall detection stability; however, missed detections still occurred in densely distributed and long-distance scenarios. In contrast, the proposed YOLO-VCW model achieved superior detection performance across all complex scenarios. The proportion of correctly detected targets was significantly increased, while the numbers of false detections and missed detections were substantially reduced. Moreover, the model was able to perform more accurate target localization and maturity-stage classification, demonstrating its strong robustness and practical applicability in complex agricultural environments.

3.3. Public Dataset Validation

To further verify the generalization ability of the proposed YOLO-VCW model and avoid relying solely on the self-built dataset for performance evaluation, supplementary experiments were conducted on the publicly available Laboro Tomato Dataset [27]. The Laboro Tomato Dataset contains 804 high-resolution images, including 643 training images and 161 test images, with a total of 9777 annotated tomato instances distributed across six categories. The categories represent three maturity stages (fully ripened, half-ripened, and green) for both big tomatoes and little tomatoes. Since this study focuses on maturity detection of standard-sized tomatoes, only the three big-tomato categories, namely b_fully_ripened, b_half_ripened, and b_green, were retained. After category filtering, the resulting subset contained 2335 annotated instances in the training set and 575 annotated instances in the test set, yielding a total of 2910 annotated tomato targets for evaluation. This subset still includes challenging conditions such as illumination variation, target occlusion, scale variation, and dense fruit distribution, making it suitable for evaluating the generalization capability of the proposed model. During the experiments, the original train/validation split of the dataset was maintained unchanged, and the COCO annotation format was converted to the YOLO object detection format for model training. All experiments adopted the same hardware environment, training parameters and evaluation metrics as those in Section 2.3 to ensure the fairness of comparative experiments. To comprehensively validate the effectiveness of the proposed method, comparative experiments were carried out among YOLOv8n, RT-DETR-S [28] and the proposed YOLO-VCW model, and the experimental results are shown in Table 5.

As shown in Table 5, the proposed YOLO-VCW model achieved the best performance across all evaluation metrics. Specifically, YOLO-VCW obtained Precision, Recall, F1-score, mAP₅₀, and mAP_50–95 values of 87.76%, 83.58%, 85.62%, 85.65%, and 52.40%, respectively. Compared with YOLOv8n, YOLO-VCW improved mAP₅₀ by 2.34 percentage points and mAP_50–95 by 3.36 percentage points. Compared with RT-DETR-S, the improvements reached 2.48 and 3.11 percentage points, respectively.

To intuitively verify the generalization ability and robustness of different models on the public dataset, three representative challenging scenarios were selected for qualitative comparative analysis: long-distance, occlusion, and dense target scenarios. The results are shown in Figure 10, where red, orange, and green bounding boxes represent mature, semi-mature, and immature tomatoes, respectively; purple bounding boxes indicate missed detections, and blue bounding boxes indicate false positives.

As shown in Figure 10, the RT-DETR-S model exhibited significant detection deficiencies in all three scenarios. In the long-distance scenario, it missed two immature tomatoes; in the occlusion scenario, it failed to effectively extract the features of the occluded semi-mature tomato, resulting in a missed detection; in the dense scenario, it not only missed one semi-mature tomato but also produced one false positive by misidentifying a background region as a tomato. These findings indicate its insufficient capability in detecting small, occluded, and dense targets. The YOLOv8n model outperformed RT-DETR-S in all scenarios, but still exhibited classification errors between mature and semi-mature tomatoes. The proposed YOLO-VCW model achieved the best performance in the occlusion and dense scenarios, accurately detecting all targets and correctly distinguishing their maturity levels. This fully demonstrates the improved algorithm’s superior feature extraction capability and target localization accuracy under complex backgrounds. It only had one case of misclassifying a semi-mature tomato as a mature tomato in the occlusion scenario, which is mainly due to the extremely small difference in appearance features between semi-mature and mature tomatoes, exacerbating the classification confusion between adjacent maturity stages. The qualitative experimental results are highly consistent with the quantitative analysis conclusions presented earlier, indicating that the YOLO-VCW model not only maintains excellent detection accuracy but also possesses stronger scene adaptability and generalization ability. It can better meet the requirements of practical agricultural harvesting robots for tomato maturity detection. The main limitation of the model lies in the classification accuracy of samples with blurred boundaries between adjacent maturity stages, which will be the key optimization direction in future research.

These results indicate that the proposed YOLO-VCW framework remains effective on an independent public dataset. The consistent performance improvement demonstrates that the proposed model possesses good generalization capability and is not overly dependent on the characteristics of the self-constructed dataset. While the public dataset experiments verify the generalization capability of the proposed model across different data distributions, the randomness involved in deep learning training may still affect the reproducibility of the experimental results. Therefore, stability analysis under different random seeds was further conducted.

3.4. Random Seed Experiment

The training process of deep learning models exhibits inherent randomness, which primarily stems from weight initialization, the sequence of data augmentation operations, and the stochastic gradient descent process of the optimizer. To verify the stability of the proposed YOLO-VCW model and the reproducibility of experimental results on the target task of greenhouse tomato detection, this section conducts random seed experiments on our self-built tomato dataset. Three different random seeds (2, 46, 97) are used to train the YOLOv8n baseline model and the YOLO-VCW model respectively. All experiments maintain exactly the same training parameters, data augmentation strategies and testing environment as the previous comparison experiments, with only the random seed value for model initialization changed. The experimental results are shown in Table 6.

\bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}

(15)

s = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2}}

(16)

where

\bar{x}

denotes the mean value of a given evaluation metric,

n

represents the number of repeated experiments,

x_{i}

denotes the metric value obtained in the

i

-th run, and s represents the sample standard deviation.

From Table 6, it can be observed that both YOLOv8n and YOLO-VCW exhibit only minor performance fluctuations under different random seeds. For the YOLOv8n model, the standard deviations of Precision, Recall, F1-score, mAP₅₀, and mAP_50–95 are all below 0.20, the low variance demonstrates that the performance is statistically stable rather than arising from random fluctuations during training. Similarly, the YOLO-VCW model maintains low standard deviations across all evaluation metrics, with values of 0.30, 0.13, 0.10, 0.27, and 0.20 for Precision, Recall, F1-score, mAP₅₀, and mAP_50–95, respectively. Compared with YOLOv8n, YOLO-VCW consistently achieves higher mean values across all evaluation metrics under different random seeds. In particular, the average mAP₅₀ reaches 90.58%, representing an improvement of 4.41 percentage points over YOLOv8n. Meanwhile, the average mAP_50–95 increases from 53.17% to 57.11%. These results indicate that the proposed improvements not only enhance detection accuracy but also maintain stable performance under different initialization conditions.

Overall, the low standard deviations and consistent performance gains demonstrate that the proposed YOLO-VCW model possesses good training stability and reproducibility, and that the observed performance improvements are not caused by a particular random initialization.

4. Discussion

4.1. Model Advantages and Practical Significance

Previous studies have alleviated these problems to some extent; however, certain limitations still remain. For example, many existing methods can achieve relatively high detection accuracy under single-maturity or simple-background conditions, but their performance decreases significantly in heavily occluded or densely distributed scenarios, which are precisely the most common conditions in commercial greenhouses. In contrast, the proposed YOLO-VCW model demonstrated stronger environmental adaptability and robustness in terms of overall performance, achieving an mAP50 exceeding 90% while maintaining stable detection performance under complex conditions including partial occlusion, dense distribution and green-on-green targets.

From the perspective of model architecture, the comprehensive advantages of YOLO-VCW are mainly reflected in the following three synergistic aspects. First, the newly introduced C2f-VSS module effectively enhanced the model’s capability to capture long-range dependency information, enabling more accurate extraction of subtle target features in complex backgrounds and thereby significantly improving the recognition ability for “green-on-green” immature tomatoes. Second, the embedded Coordinate Attention (CA) mechanism strengthened the model’s perception of spatial positional information and inter-target boundaries, which helped distinguish subtle differences among densely distributed and partially occluded targets. Finally, the adoption of the WIoUv3 loss function optimized the bounding box regression process and improved the utilization efficiency of high-quality samples, thereby enhancing target localization accuracy under occlusion conditions. The synergistic effect of these three modules comprehensively strengthened both the feature representation and localization capabilities of the proposed model. These observations are also consistent with the ablation results presented in Section 3.1, where each individual module contributed positively to the overall detection performance, while their combined integration achieved the highest accuracy.

In practical applications, the YOLO-VCW model can serve as the visual perception core of agricultural harvesting robots by providing precise 2D target localization and maturity classification information for robotic manipulators. Compared with traditional manual harvesting, which is characterized by high labor costs, limited operational efficiency, and strong dependence on subjective factors, automatic harvesting systems based on visual detection can achieve long-term continuous operation and improve operational consistency. The high detection accuracy and low missed detection rate of the proposed model help reduce fruit loss during harvesting, thereby improving harvesting completeness and overall economic benefits. Furthermore, the proposed method can also be extended to other agricultural applications such as pre-harvest yield estimation and post-harvest fruit grading, demonstrating strong cross-scenario application potential.

4.2. Model Limitations and Future Work

First, the model’s performance degrades significantly under extreme environmental conditions that were not adequately represented in the training dataset. Specifically, the model shows reduced detection accuracy under extreme lighting conditions including direct strong sunlight (causing overexposure and glare), low-light nighttime conditions, and severe backlighting. Additionally, the model was trained and tested only on common tomato varieties grown in local greenhouses; its generalization ability to tomato varieties with different shapes, colors and growth habits remains unvalidated. Furthermore, all tests were conducted at a detection distance of 0.5–1.5 m, which matches the typical working distance of our harvesting robot. However, the model’s performance at distances greater than 1.5 m or less than 0.5 m has not been evaluated, and preliminary observations indicate a sharp decline in detection accuracy for small-scale targets beyond 2 m.

Second, the current study only evaluated the model’s detection accuracy on a desktop GPU platform, and no comprehensive deployment-related performance tests have been conducted. Specifically, we have not measured the model’s inference speed, memory footprint, or operational latency on edge computing devices commonly used in agricultural robots (such as NVIDIA Jetson series, Raspberry Pi and embedded ARM platforms). Moreover, the model has not been integrated with the robotic arm control system for end-to-end field testing, so its actual performance in dynamic harvesting scenarios (including real-time response to moving targets and coordination with manipulator motion) remains unknown.

Corresponding to the above limitations, future research will be carried out in the following aspects. First, we will expand the dataset to include more samples collected under extreme lighting conditions, different tomato varieties, and various detection distances. We will also introduce advanced data augmentation techniques such as style transfer and generative adversarial networks (GANs) to synthesize samples of rare scenarios, thereby improving the model’s robustness and generalization ability. Second, we will focus on optimizing the model for edge deployment and conducting comprehensive performance evaluations. Specifically, we will perform model quantization, pruning and knowledge distillation on the YOLO-VCW model to reduce its computational complexity and memory footprint without significant loss of accuracy. We will then test the optimized model on various edge computing platforms to measure its inference speed, latency and power consumption. Finally, we will integrate the optimized visual perception system with our dual-arm tomato harvesting robot platform to conduct end-to-end field tests, and further optimize the model based on actual harvesting performance.

In conclusion, this study proposes an improved YOLOv8-based tomato detection model that achieves excellent performance in complex greenhouse environments. Although there are still limitations regarding real-world deployment readiness, the results demonstrate the feasibility and potential of the proposed method. The future work outlined above will systematically address these limitations and promote the practical application of intelligent tomato harvesting technology.

Author Contributions

Conceptualization, W.L. (Wenhao Li) and M.Q.; methodology, W.L. (Wenhao Li), H.Z. and C.Z.; software, W.L. (Wenhao Li); validation, W.L. (Wenhao Li); formal analysis, W.L. (Wenhao Li); investigation, W.L. (Wenhao Li), W.L. (Wei Liu) and S.L.; resources, W.L. (Wenhao Li) and M.Q.; data curation, W.L. (Wei Liu) and S.L.; writing—original draft preparation, W.L. (Wenhao Li); writing—review and editing, W.L. (Wenhao Li), H.Z., C.Z., W.L. (Wei Liu), S.L. and M.Q.; visualization, W.L. (Wenhao Li); supervision, M.Q.; project administration, W.L. (Wenhao Li) and M.Q.; funding acquisition, M.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Science and Technology Department of Zhejiang Province (LD24E050001).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, H.; Tang, Y.C.; Zou, X.J.; Wang, H.J.; Chen, Z.X.; Long, Y.N.; Ai, P.Y. Research on key technologies of visual perception for agricultural picking robots. J. Agric. Mech. Res. 2023, 45, 68–75. (In Chinese) [Google Scholar] [CrossRef]
Haggag, S.; Veres, M.; Tarry, C.; Moussa, M. Object Detection in Tomato Greenhouses: A Study on Model Generalization. Agriculture 2024, 14, 173. [Google Scholar] [CrossRef]
El-Bendary, N.; Hariri, E.E.; Hassanien, A.E.; Badr, A. Using machine learning techniques for evaluating tomato ripeness. Expert Syst. Appl. 2015, 42, 1892–1905. [Google Scholar] [CrossRef]
Chen, C.Q.; Meng, Q. Recognition of greenhouse tomato fruits based on image processing technology. J. Agric. Mech. Res. 2025, 47, 189–193. (In Chinese) [Google Scholar] [CrossRef]
Shaoqing, R.; Kaiming, H.; Ross, G.; Jian, S. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Ling, Y.; Wang, X.; Meng, D.; Nie, L.; An, G.; Wang, X. An improved Faster R-CNN model for multi-object tomato maturity detection in complex scenarios. Ecol. Inform. 2022, 72, 101886. [Google Scholar] [CrossRef]
Juhartini; Dwinita, A.; Desmiwati. Single Shot Multibox Detector (SSD) in Object Detection: A Review. IJACI Int. J. Adv. Comput. Inform. 2025, 1, 118–127. [Google Scholar] [CrossRef]
Liu, G.; Nouaze, J.C.; Touko Mbouembe, P.L.; Kim, J.H. YOLO-Tomato: A Robust Algorithm for Tomato Detection Based on YOLOv3. Sensors 2020, 20, 2145. [Google Scholar] [CrossRef] [PubMed]
Dong, W.; Zhao, Y.; Pei, J.; Feng, Z.; Ma, Z.; Wang, L.; Wang, S.S. Tomato detection in natural environment based on improved YOLOv8 network. J. Agric. Eng. 2025, 56, 1732. [Google Scholar] [CrossRef]
Hao, F.; Zhang, Z.; Ma, D.; Kong, H. GSBF-YOLO: A lightweight model for tomato ripeness detection in natural environments. J. Real-Time Image Process. 2025, 22, 47. [Google Scholar] [CrossRef]
Yue, X.; Qi, K.; Yang, F.; Na, X.; Liu, Y.; Liu, C. RSR-YOLO: A real-time method for small target tomato detection based on improved YOLOv8 network. Discov. Appl. Sci. 2024, 6, 268. [Google Scholar] [CrossRef]
Liu, G.; Zhang, Y.; Liu, J.; Liu, D.; Chen, C.; Li, Y.; Zhang, X.; Touko Mbouembe, P.L. An improved YOLOv7 model based on Swin Transformer and Trident Pyramid Networks for accurate tomato detection. Front. Plant Sci. 2024, 15, 1452821. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Xiang, J.; Chen, D.; Zhang, C. A Method for Detecting Tomato Maturity Based on Deep Learning. Appl. Sci. 2024, 14, 11111. [Google Scholar] [CrossRef]
Wu, D.; Ma, X.J.; Liu, D.S.; Song, W.; Su, W.X. Research on tomato target detection algorithm based on improved YOLOv8. J. Agric. Big Data 2025, 7, 281–293. (In Chinese) [Google Scholar] [CrossRef]
Fan, X.P.; Zhang, Y.Q.; Zhou, S.; Ren, M.F.; Wang, Y.W.; Chai, X.J. Recognition and localization method for tomato picking robots based on improved YOLOv8s and RGB-D information fusion. Trans. Chin. Soc. Agric. Eng. 2025, 41, 106–116. (In Chinese) [Google Scholar]
Ni, J.P.; Zhu, L.C.; Dong, L.Z.; Cui, X.Z.; Han, Z.H.; Zhao, B. Real-time instance segmentation algorithm for tomato picking robots based on SwinS-YOLACT. Trans. Chin. Soc. Agric. Mach. 2024, 55, 18–30. (In Chinese) [Google Scholar]
Liang, X.F.; Wei, Z.W. Nighttime tomato stem and branch segmentation method based on improved CycleGAN and YOLOv8. Trans. Chin. Soc. Agric. Eng. 2025, 41, 147–155. (In Chinese) [Google Scholar]
Li, M.B.; Liu, Y.L.; Mu, Z.M.; Guo, J.W.; Wei, Y.; Ren, D.Y.; Jia, J.S.; Wei, Z.Z.; Li, Y.H. Tomato fruit recognition based on YOLOX-L-TN model. J. Agric. Sci. Technol. 2024, 26, 97–105. (In Chinese) [Google Scholar] [CrossRef]
Wu, X.J.; Ding, Q. Tomato quality recognition technology based on improved YOLOv5. Agric. Eng. 2025, 15, 34–40. (In Chinese) [Google Scholar] [CrossRef]
Vivi, A.; Erniwati, S. YOLOv8 for Object Detection: A Comprehensive Review of Advances, Techniques, and Applications. IJACI Int. J. Adv. Comput. Inform. 2025, 2, 53–61. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Denver, CO, USA, 3–7 June 2021; pp. 13713–13722. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Ge, G.; Yang, J.; Liu, Y.; Hu, Y.X.; Liu, H.H. Detection of tomatoes in complex agricultural scenes based on improved YOLOv8n model. Trans. Chin. Soc. Agric. Eng. 2025, 41, 143–153. (In Chinese) [Google Scholar]
Laboro Tomato. Kaggle. Available online: https://www.kaggle.com/datasets/nexuswho/laboro-tomato (accessed on 12 June 2026).
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]

Figure 1. Data augmentation methods: (a) Original image; (b) Mirroring; (c) Cropping; (d) Brightness adjustment; (e) Rotation; (f) Blurring.

Figure 2. The framework of the YOLO-VCW network.

Figure 3. Schematic diagram of the S6 core operator structure within the Mamba State Space Model.

Figure 4. Schematic diagram of the C2f-VSS module architecture.

Figure 5. Schematic diagram of the SS2D cross-directional scanning module.

Figure 6. Schematic diagram of the Coordinate Attention mechanism architecture.

Figure 7. Comparison of Gradient Gain under Different Parameters.

Figure 8. Comparison of Confusion Matrices among Different Models.

Figure 9. Visualization Comparison of Detection Results among Different Algorithms.

Figure 10. Qualitative detection results on the Laboro Tomato Dataset.

Table 1. Software and hardware configurations of the experimental platform.

Configuration Category	Configuration Item	Detailed Information
Hardware Environment	CPU	Intel Core i5-14450
	RAM	8 GB
	GPU	RTX 5060
Software Environment	Operating System	Ubuntu 22.04 LTS (64-bit)
	Python Version	Python 3.10
	Deep Learning Framework	PyTorch 2.1.0
	CUDA Version	CUDA 12.1

Table 2. Training parameters.

Parameter	Value
Image Size	640 × 640
Number of Epochs	300
Batch Size	16
Learning Rate	0.01
Minimum Learning Rate	0.0001
Momentum	0.937
Weight Decay	0.0005
Optimizer	Stochastic Gradient Descent (SGD)

Table 3. Ablation Experiment Results. (√ indicates that the corresponding method is adopted).

YOLOv8n	C2f-VSS	CA	WIoUv3	P/%	R/%	F1/%	mAP₅₀/%	mAP_50–95/%	Parameters/M	GFLOPs/G
√				89.43	82.36	85.75	86.27	53.18	3.2	8.7
√	√			90.55	84.20	87.26	88.12	55.02	3.8	9.6
√		√		89.90	83.17	86.40	87.50	54.76	3.3	8.9
√			√	90.62	84.05	87.21	86.64	54.13	3.2	8.7
√	√	√		91.06	85.27	88.07	90.48	56.68	3.9	9.8
√	√	√	√	91.33	86.79	89.00	90.71	57.05	3.9	9.8

Table 4. Comparison of Detection Performance among Different Models.

Models	P/%	R/%	F1/%	mAP₅₀/%	mAP_50–95/%	Parameters/M	GFLOPs/G
Faster R-CNN	87.20	80.15	83.53	84.30	50.06	41.2	79.4
SSD	85.67	78.52	81.94	82.71	48.28	24.4	45.6
YOLOv7-Tiny	88.19	81.03	84.46	85.60	51.34	6.7	12.1
YOLOv8n	89.43	82.36	85.75	86.27	53.18	3.2	8.7
YOLO-VCW	91.33	86.79	89.00	90.71	57.05	3.9	9.8

Table 5. Performance comparison on the Laboro Tomato Dataset.

Models	P/%	R/%	F1/%	mAP₅₀/%	mAP_50–95/%
RT-DETR-S	85.42	82.01	83.68	83.17	49.29
YOLOv8n	86.61	81.53	83.99	83.31	49.04
YOLO-VCW	87.76	83.58	85.62	85.65	52.40

Table 6. Stability analysis under different random seeds.

Model	Evaluation Metric	Seed 2	Seed 46	Seed 97	Mean	Standard Deviation
YOLOv8n	P/%	89.27	89.52	89.17	89.32	0.18
	R/%	82.41	82.05	82.28	82.25	0.18
	F1/%	85.70	85.62	85.59	85.64	0.06
	mAP₅₀/%	86.11	86.34	86.06	86.17	0.15
	mAP_50–95/%	53.05	53.23	53.24	53.17	0.11
YOLO-VCW	P/%	91.05	90.71	91.30	91.02	0.30
	R/%	86.57	86.82	86.63	86.67	0.13
	F1/%	88.75	88.72	88.90	88.79	0.10
	mAP₅₀/%	90.29	90.64	90.82	90.58	0.27
	mAP_50–95/%	57.28	57.16	56.89	57.11	0.20

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, W.; Zheng, H.; Zhao, C.; Liu, W.; Li, S.; Qian, M. Tomato Visual Object Detection Method Based on the Mamba State Space Model. Horticulturae 2026, 12, 770. https://doi.org/10.3390/horticulturae12070770

AMA Style

Li W, Zheng H, Zhao C, Liu W, Li S, Qian M. Tomato Visual Object Detection Method Based on the Mamba State Space Model. Horticulturae. 2026; 12(7):770. https://doi.org/10.3390/horticulturae12070770

Chicago/Turabian Style

Li, Wenhao, Hengyi Zheng, Chengheng Zhao, Wei Liu, Shunjie Li, and Mengbo Qian. 2026. "Tomato Visual Object Detection Method Based on the Mamba State Space Model" Horticulturae 12, no. 7: 770. https://doi.org/10.3390/horticulturae12070770

APA Style

Li, W., Zheng, H., Zhao, C., Liu, W., Li, S., & Qian, M. (2026). Tomato Visual Object Detection Method Based on the Mamba State Space Model. Horticulturae, 12(7), 770. https://doi.org/10.3390/horticulturae12070770

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Tomato Visual Object Detection Method Based on the Mamba State Space Model

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.1.1. Data Acquiring

2.1.2. Data Labeling and Augmentation

2.2. YOLO-VCW

2.2.1. Architecture

2.2.2. State Space Models and Mamba Core Operators

2.2.3. Visual State Space Module and Cross-Scan Mechanism

2.2.4. Feature Enhancement Strategy Based on Coordinate Attention

2.2.5. Optimization of the Bounding Box Regression Loss Function

2.3. Experimental Settings

2.3.1. Experimental Environment and Training Settings

2.3.2. Evaluation Metrics

3. Experimental Results and Analysis

3.1. Ablation Study

3.2. Comparison of Different Detection Models

3.2.1. Confusion Matrix Analysis

3.2.2. Detection Result Comparison of Different Algorithms

3.3. Public Dataset Validation

3.4. Random Seed Experiment

4. Discussion

4.1. Model Advantages and Practical Significance

4.2. Model Limitations and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI