A Precise Detection Method for Tomato Fruit Ripeness and Picking Points in Complex Environments

Wang, Xinfa; Wen, Xuan; Li, Yi; Du, Chenfan; Zhang, Duokuo; Sun, Chengxiu; Chen, Bihua

doi:10.3390/horticulturae11060585

Open AccessArticle

A Precise Detection Method for Tomato Fruit Ripeness and Picking Points in Complex Environments

by

Xinfa Wang

^1,2

,

Xuan Wen

^1,2

,

Yi Li

³,

Chenfan Du

^1,2

,

Duokuo Zhang

^1,2,

Chengxiu Sun

^1,2 and

Bihua Chen

^4,*

¹

College of Computer Science and Technology, Henan Institute of Science and Technology, Xinxiang 453003, China

²

Xinxiang City Key Laboratory of Intelligent Plant Factory, Xinxiang 453003, China

³

College of Information Engineering, Sichuan Agricultural University, Ya’an 625000, China

⁴

College of Horticulture and Landscape Architecture, Henan Institute of Science and Technology, Xinxiang 453003, China

^*

Author to whom correspondence should be addressed.

Horticulturae 2025, 11(6), 585; https://doi.org/10.3390/horticulturae11060585

Submission received: 1 April 2025 / Revised: 19 May 2025 / Accepted: 23 May 2025 / Published: 25 May 2025

(This article belongs to the Special Issue Application of Remote Sensing Technology in Orchard Precision Management)

Download

Browse Figures

Versions Notes

Abstract

Accurate identification of tomato ripeness and precise detection of picking points is the key to realizing automated picking. Aiming at the problems faced in practical applications, such as low accuracy of tomato ripeness and picking points detection in complex greenhouse environments, which leads to wrong picking, missed picking, and fruit damage by robots, this study proposes the YOLO-TMPPD (Tomato Maturity and Picking Point Detection) model. YOLO-TMPPD is structurally improved and algorithmically optimized based on the YOLOv8 baseline architecture. Firstly, the Depthwise Convolution (DWConv) module is utilized to substitute the C2f module within the backbone network. This substitution not only cuts down the model’s computational load but also simultaneously enhances the detection precision. Secondly, the Content-Aware ReAssembly of FEatures (CARAFE) operator is utilized to enhance the up-sampling operation, enabling precise content-aware processing of tomatoes and picking keypoints to improve accuracy and recall. Finally, the Convolutional Attention Mechanism (CBAM) module is incorporated to enhance the model’s ability to detect tomato-picking key regions in a large field of view in both channel and spatial dimensions. Ablation experiments were conducted to validate the effectiveness of each proposed module (DWConv, CARAFE, CBAM), and the architecture was compared with YOLOv3, v5, v6, v8, v9, and v10. The experimental results reveal that, when juxtaposed with the original network model, the YOLO-TMPPD model brings about remarkable improvements. Specifically, it improves the object detection F1 score by 4.48% and enhances the keypoint detection accuracy by 4.43%. Furthermore, the model’s size is reduced by 8.6%. This study holds substantial theoretical and practical value. In the complex environment of a greenhouse, it contributes significantly to computer-vision-enabled detection of tomato ripening. It can also help robots accurately locate picking points and estimate posture, which is crucial for efficient and precise tomato-picking operations without damage.

Keywords:

tomato maturity; deep learning; picking keypoints; complex environment

Graphical Abstract

1. Introduction

Tomatoes are popular among consumers because of their rich vitamin and mineral content, and more and more people are growing tomatoes in greenhouse greenhouses [1]. Tomatoes rank among the world’s most significant vegetable crops, holding the second position globally in both production and sales volume [2]. Harvesting, as an important part of tomato production, is still dominated by manual picking. With the widespread cultivation of tomatoes and the annual rise in labor costs, the automation of tomato harvesting to enhance picking efficiency and cut labor expenses has emerged as a major research focus [3]. In greenhouses, the difficulty of detection is greatly increased by the complex environmental conditions, including light, ripeness, similar foreground and background, and overlapping and shading situations. Hence, detecting ripe tomatoes in greenhouse cultivation systems is essential for mechanized harvesting, and the exact determination of optimal picking locations represents a key assurance for realizing efficient tomato harvesting.

Traditional image detection methods have been the main method for fruit plant detection. Moreira et al. [4] proposed a comparison between deep-learning-based and HSV color space models to classify each tomato and determine its ripening stage with a balanced accuracy of 68.1%, demonstrating that the HSV color space model outperforms the SSD MobileNetV2 model. Liu et al. [5] used directional feature histogram (HOG) descriptors to train a support vector machine (SVM) classifier for ripe tomato detection with 94.4% detection accuracy. Zhao et al. [6] achieved a low leakage rate of 3.5% by extracting Haar-like features from grayscale images to detect tomatoes. Traditional image detection methods usually rely on manually designed features and rules, which limit their adaptability in complex and dynamically changing environments [7]. Conventional detection methods are mainly based on the RGB features, texture features, and geometric features of object detection, which are more demanding in the background and not applicable to tomato recognition in real scenes.

In comparison to the aforementioned detection approaches, deep-learning-driven object detection methods demonstrate superior accuracy alongside notable advantages in processing speed and robustness [8]. With the escalating volume of agricultural data and the advanced computational power of servers, deep-learning technology has found broader applications in agriculture, especially in providing innovative solutions to detection challenges in intricate agricultural environments [9]. In the context of tomato detection in precision agriculture, deep-learning-based methods have emerged as a dominant solution, primarily categorized into single-stage and two-stage frameworks based on their reliance on region proposal mechanisms. Two-stage detectors achieve high accuracy through iterative candidate refinement but suffer from significant computational overhead, which limits their deployment in real-time agricultural scenarios—where dynamic lighting, leaf occlusions, and clustered fruit arrangements require low-latency inference. By contrast, single-stage algorithms bypass explicit proposal generation, offering substantial speed advantages that align with the real-time monitoring needs of robotic harvesting or drone-based surveillance systems [10]. Some of the more classical ones are SSD [11], YOLO [12], and Anchor-Free [13], among others. Xu et al. [14] proposed a fast tomato detection method based on improved YOLOv3-tiny with a 12% improvement over YOLOv3-tiny, reaching 25 frames/sec in CPU and 40.35 ms in inference speed. Xu et al. [15] used an improved Mask R-CNN to recognize cherry tomatoes in whole clusters with an accuracy of 93%. The fruit recognition rate is 76%, and the stem recognition rate is 94.47%. Yang et al. [16] improved the fruit target recognition accuracy by 6.9% by improving YOLOv7. Song et al. [17] proposed the SEYOLOX-tiny model by improving YOLOX, which increased the accuracy by 1.5% in corn cob detection. Gao et al. [18] proposed LACTA, a lightweight and exact detection for cherry tomatoes, specifically designed for harvesting robots operating in complex environments, with recall and average accuracy values of 92.5% and 97.3%, respectively, for automated harvesting of cherry tomatoes in greenhouses. Zhang et al. [19] proposed a visual detection and pose classification algorithm based on YOLOv5, which can detect unobstructed tomatoes and classify tomatoes in terms of ripeness and 3D pose with a detection speed of 20 fps, using an RGB image as the input. Convolutional module is the key to enhance feature extraction and detection accuracy improvement, Liu et al. [20] developed an object detection algorithm based on the introduction of DWConv in YOLOv5 computationally complexity significantly reduces the FLOPs by 54%; Sun et al. [21] proposed an algorithm called DFLM-YOLO based on DSConv, which improved 12.4% over the original YOLOv8s in terms of average accuracy mean while reducing the number of model parameters by 67.2%; Wang et al. [22] used AKConv to replace the YOLOv8 network structure to achieve fast and accurate recognition of insect pests in tea gardens, enhancing feature extraction and reducing the number of parameters in the model; and Zhu et al. [23] used Ghost Convolution (GhostConv) to reduce the number of parameters, thus improving the recognition accuracy and efficiency of tea oilseed in complex natural scenes; Wang et al. [24] enhanced YOLOv5 by integrating GSConv convolution, which significantly improves detection accuracy and inference efficiency for small tea buds in complex backgrounds. Compared with the baseline network, this modification achieves improvements of 3.26% in precision, 11.43% in recall, and 7.68% in mean average precision. Compared to multi-stage approaches, single-stage algorithms show significant benefits in achieving real-time object detection under complex conditions.

In addition to object detection in picking tasks, the detection of picking points is more important. The keypoint-based object detection method was initially applied to human posture detection [25]. In the field of agriculture, this method is widely applied to the phenotyping of crops [26,27]. The common picking algorithms are divided into two main categories; one uses machine-learning algorithms for pedicel estimation, contour fitting, and picking points localization through shape, texture, and color features of tomatoes using spatially symmetric spline interpolation methods and geometric analysis [28]. Although direct detection of pedicels and stems is avoided, the morphological requirements of tomatoes and stems are high, and the relative positions of stems and tomatoes cannot be predicted due to the relatively complex growing environment in actual greenhouses, resulting in poor detection and localization performance. Another category is the use of convolutional neural networks for stem and fruit detection. Song et al. [29] used DeepLabV3+ to segment the calyx, branches, and wires of kiwifruit, and the Intersection Over Union (IoU) were 0.686, 0.709, and 0.424, respectively. Xiang et al. [30] proposed a hybrid union neural network based on the dyadic edge method and deep-learning model-based hybrid joint neural network for the tomato plant stem segment recognition algorithm. Additionally, previous studies have introduced a multi-task end-to-end CNN framework to tackle the problem of robotic picking. Du et al. [31] proposed a multi-task model for tomato object localization, pose detection, and semantic segmentation, which achieves better performance, but its keypoint localization is chosen at the center of the calyx. Li et al. [32] proposed a multi-task-based perceptual network with high accuracy for both detection and segmentation. In the above studies, fruit detection and stem segmentation in real scenarios have been successfully achieved, but the process of picking point localization is still relatively complex. Existing approaches generally perform multi-step calculations on classification, detection, and segmentation outputs without considering the specific locations of picking and grasping points during real-world robotic tasks. Detection precision is closely linked to harvesting accuracy and efficiency, while model complexity impacts computational efficiency and resource demands. Hence, the challenge of reducing model size while maintaining detection performance represents a significant hurdle for current harvesting algorithms.

Tomato picking methods exhibit marked variations in key performance metrics, including picking precision, fruit damage percentage, efficiency, and cost. End-effector design constitutes a vital element of harvesting systems, with its structural makeup and working principles directly impacting the accuracy of fruit detachment [33]. There are four main types of fruit-picking tasks, as shown in Figure 1. The list of components is shown in Figure 1a. This picking method relies on the end-effector’s grasping motion to harvest tomatoes directly, as shown in Figure 1b. Gripping picking, on the other hand, picks tomatoes in a human-like manner with components such as the three fingers of the end-effector, as shown in Figure 1b the second. The above two methods act directly on the tomato fruits and use physical external force to remove the tomatoes, which causes more damage to the fruits and seriously affects the quality of the fruits. In addition, it may affect the normal growth of the tomatoes and reduce the yield. Air suction picking is conducted by the scissors cutting the stem through the end of the actuator generated by the negative pressure or suction to adsorb tomatoes for picking, as shown in Figure 1b the third. This harvesting approach is ideal for small-sized, smooth-skinned fruits but ill-suited to tomato picking in complex greenhouse environments. The cropping and grasping type of picking combines two actions: grasping and cropping, which firstly grasps the bottom and two sides of the tomato to fix the tomato fruits through the grasping parts and then cuts the stems of the tomatoes through the scissors of the end-effector, and finally removes the fruits, as shown in Figure 1b the fourth. This type provides precise cropping of the keypoints of the stems, and it is able to pick fruits in various postures to ensure the integrity of the tomatoes. Therefore, this paper will focus on the exploration of exact detection methods of tomato fruit ripening and picking points in complex environments to provide more scientific and possible modeling support for tomato-picking operations.

This paper focuses on addressing four key challenges in agricultural greenhouse tomato harvesting: balancing model efficiency and accuracy, joint detection of tomato fruits and picking points, minimizing picking damage through calculating grasping methods, and addressing data diversity and label complexity.

2. Materials and Methods

2.1. Data Acquisition

On 21 April 2024 and 19 May 2024, images of tomatoes on planting frames in six complex scenarios, totaling 2300 images, were collected from a greenhouse in Donggang Village, Hansi Town, Zhongmu County, Zhengzhou City, China (34°39′41.48″ N, 114°07′10.45″ E), in the greenhouse, under sunny and cloudy skies (from the morning to the afternoon), and stored in a JPEG format using a Casio EX-ZR5000 camera (Casio Computer Co., Ltd., Tokyo, Japan) at a distance of about 200–1000 mm from the planting racks to take tomato photographs with a single-frame pixel resolution of 2048 × 1536 and stored in JPEG format. The tomato variety used in this study was Hard Pink 8. As shown in Figure 2.

The dataset contains the following: (a) background complexity increases the difficulty of detection and needs to be more inclined to distinguish the background from the fruit; (b) shooting angle affects the morphology and location information of tomatoes in the image—hence, shooting multiple angles for training is necessary; (c) lighting conditions can lead to unclear images, and the target features are affected; (d) shooting range affects target complexity and pixel dimensions; training with diverse ranges improves model generalization; (e) fruit shooting with different ripeness is the key to the real scene of picking in the greenhouse, ripe tomatoes are used for real-time picking, and the data of immature tomatoes can be used for yield estimation, etc.; (f) the diversity of poses increases the difficulty of detecting and grasping, and it is more important to accurately localize the picking points and the fruits to re-positioning them.

2.2. Annotation of Images and Construction of Datasets

In the overall detection workflow of the YOLO-TMPPD model, the data annotation step serves as a fundamental and crucial part. It provides the essential labeled information that the model relies on for training and accurate prediction. This research utilized LabelMe(v4.5.10), an open-source data annotation tool, to annotate unripe tomatoes, target regions of ripe tomatoes, and four critical harvesting points. A representative example of the annotated data is shown in Figure 3a. Since the machine damaged the tomatoes more in the stem removal picking task, keep the stems of the tomatoes when picking, and the tomatoes and stems were prioritized for object detection as a combined target. Tomatoes and stems were individually labeled, with keypoints selected according to their growth patterns. This included one stem-based point near the pedicel and three tomato-specific landmarks: the calyx–tomato connection, the fruit center, and the tomato bottom. In this study, tomato posture characteristics can be determined based on three keypoints within the fruit, with the keypoints on the stem identified as picking points. It is worth noting that the flower stalks are labeled as picking points in this study because the flower stalks have more obvious node features on the stems, which is easy for model training. Secondly, in real tomato picking, the picking points need to be adjusted according to the actual environment. This method can be applied to different picking standards by simply ensuring that the picking points fall near the pedicels of the stems and the problem of overstepping the boundaries of the picking points (falling on the boundaries of the stems or beyond) can be avoided by the positioning of the pedicels.

To prevent inconsistencies arising from multiple annotators, the image annotation process was divided into stages and completed by one researcher within a ten-week timeframe, considering the substantial data volume in the labeled images. Targets for annotation included the following: immature and ripe tomato fruits (green and color change stages marked as unpickable, firm and finished stages marked as pickable), tomato pedicels visible and with no more than 50% relative blurring and occlusion. A systematic annotation protocol ensured simultaneous labeling of multiple objects and keypoints in each image. The process involved: first defining the combined tomato–stem region, then individually annotating tomatoes and stems, followed by sequential keypoint marking in the order of pedicel (p1), calyx–fruit union (p2), geometric center (p3), and basal point (p4). To maintain positional accuracy, all keypoints were constrained within their corresponding object-bounding boxes. Immature tomatoes received only detection annotations without keypoints. The keypoints that are not visible due to occlusion are labeled as “pi_NO”, and the keypoints that do not exist in the image are not labeled and will be handled separately in the subsequent format conversion program. Labeled data are saved as JSON files and then transformed into TXT files with identical names to the corresponding images as needed. TXT saves the information of the keypoint, as shown in Figure 3b, where the first column is the category number, which is divided into four target objects in this study, in which the combined box of tomato and stem is noted as 0, stem as 1, tomato as 2, and immature tomato is marked as 3; in which the second to the fifth columns denote the detection box information of the target object; X_pi or Y_pi denote the keypoints’ normalized coordinates, and V_pi denote the keypoints whether they are visible or not (0: not present, 1: present obscured, 2: present and visible); p1 is the keypoints on the stem; p2, p3, and p4 are the keypoints on the tomato. Finally, the data were programmatically converted into a multi-type dataset with a total of 11,661 tomato instances (4301 ripe tomatoes, 7360 unripe tomatoes) and 16,891 keypoints labeled. This annotation method ensures that the model can accurately learn the features, keypoints, and different growth stages of tomatoes, thus enabling it to achieve accurate detection and classification in real-world tomato-picking scenarios.

2.3. Data Augmentation

Data enhancement is conducted to expand the dataset to make the model more robust for image processing in different environments. Although various complex environments in greenhouses have been considered as much as possible in the data acquisition stage of this study, it is difficult to comprehensively cover tomato images under all conditions by image acquisition alone due to the significant changes in light, the different growth patterns of tomatoes, and the complex morphology of stems. Simultaneously, building a large-scale dataset poses a significant challenge because labeling targets and keypoints is a time-consuming and labor-intensive task. To enhance the richness of the experimental dataset, tomato images were expanded by image processing techniques, and their labeled data were transformed accordingly.

Considering the need for distance and proximity estimation in the smart pickup process, affine transformations were used to simulate different camera positions and shooting angles. To handle the inevitable noise and motion blur introduced during outdoor image acquisition, a combination of Gaussian filtering and random Gaussian noise addition was implemented as part of the preprocessing pipeline. Given that tomatoes in greenhouses are inevitably occluded by other targets, a random mask was also added to the images to achieve occlusion. Owing to substantial changes in greenhouse lighting conditions, image brightness exhibits significant fluctuations. These dynamic lighting variations are emulated via systematic brightness adjustments during preprocessing. Finally, the complexity of the greenhouse environment is reflected by randomly combining this series of image processing algorithms to achieve data enhancement, as shown in Figure 4. Where Figure 4a is the original image, Figure 4b is the four ways to enhance the image individually, which produces four times more data, and Figure 4c is the combination of the four data enhancement methods, which produces two times more data, for a total of six times more data. Thus, the trained images can be expanded to seven times the original dataset, totaling 16,100 sets of data.

2.4. YOLOv8

The YOLO algorithm is an efficient method for detection, making it a hot research topic for object detection [34], originally proposed by Redmon et al. [35]. YOLOv8 achieves substantial improvements in detection performance and robustness through innovative architectural modifications [36]. Integrating attention mechanisms and dynamic convolution addresses small object detection challenges while overcoming YOLOv7′s scaling coefficient limitations. The network is structured into five distinct scaling variants (n, s, m, l, x) to balance accuracy and computational efficiency. YOLOv8 utilizes cutting-edge backbone and neck architectures, which enhance feature extraction capabilities and object detection performance. The network improves multi-scale feature fusion by integrating detailed spatial information and semantic context across different feature map scales. The original coupling head is replaced by a decoupling head, and the regression branch and prediction branch are separated for better recognition.

2.5. Model Architecture of YOLO-TMPPD

In the YOLO-TMPPD architecture, four keypoint detections are added to the network by referring to the idea of pose detection and combining the natural growth pattern of tomatoes. Since tomato stems and backgrounds are similar, the attention module (CBAM) [37] is added based on the YOLOv8 network, which enables the network to learn more features in the foreground. Secondly, in the tomato detection task, the generalization ability to obtain a multi-maturity tomato detector is to be achieved. Therefore, the network model needs to focus on the potential semantic information of tomatoes in addition to the spatial feature relationships of tomatoes. However, this is information that many of the network models miss when implementing up-sampling. Therefore, the up-sampling operator in the YOLO-TMPPD model uses the lightweight Content-Aware Feature Reorganization Operator (CARAFE) [38], as it can aggregate contextual information over a large reception domain and has little computational overhead. Finally, the purpose of the model is to provide detection services for smart picking terminals optimized for tomatoes. Therefore, to make the detection and localization model easier to deploy and apply on embedded or other mobile devices, the innovation introduces a channel-by-channel convolution mechanism (DWConv). This innovative CNN implementation lowers input channel count to decrease model size, enabling faster execution than conventional convolution layers. The approach achieves a favorable balance between parameter reduction and computational efficiency.

The YOLO-TMPPD model uses CSPDarknet53 to Two-Stage FPN as the backbone of the network, DWConv as the convolutional layer, CARAFE up-sampling operations are connected to layers 6 and 4, respectively, and then the CBAM attention mechanism is added after layers 16, 20, and 24. The YOLO-TMPPD model is shown in Figure 5.

2.5.1. Depthwise Convolution

The introduction of channel-by-channel convolution DWConv can help the YOLOv8 model process image data with complex morphology, color, size, and position more efficiently while maintaining high performance [39]. DWConv only performs convolution on each input channel independently and does not need to deal with cross-channel information like standard convolution, thus reducing the computational complexity. As a key component in network architecture, DWConv significantly minimizes model parameter count while enhancing detection accuracy and computational efficiency. The DWConv is shown in Figure 6.

2.5.2. CARAFE

CARAFE (Content-Aware ReAssembly of FEatures) is an innovative feature map up-sampling method that guides the up-sampling process based on the content of the input features, resulting in more accurate and efficient feature reconstruction. Designed to boost the effectiveness of classical up-sampling approaches, CARAFE addresses limitations in methods such as nearest-neighbor interpolation, bilinear interpolation, and deconvolution. Traditional methods rely only on the spatial location of pixel points in determining the up-sampling kernel, which has a small perceptual domain and cannot fully utilize the semantic information of the feature map. CARAFE, on the other hand, significantly expands the perceptual domain by introducing a content-aware mechanism and effectively incorporates the semantic information of the feature map without introducing excessive parameters and computational effort. As shown in Figure 7.

CARAFE consists of two main modules: the Up-sampling Kernel Prediction Module is shown in Figure 7a, and the Feature Recombination Module is shown in Figure 7b. The former first compresses the number of channels of the feature map using 1 × 1 convolution to reduce the computation of the subsequent steps. Next, the compressed feature map is encoded using a coding kernel to generate a reorganization kernel based on the content of the input features. Lastly, the predicted up-sampling kernel undergoes normalization via the SoftMax function. This normalization step guarantees that the weights of the convolution kernel sum up to 1. Subsequently, for each position in the output feature map and its corresponding position in the input feature map, the feature reorganization module chooses a region-sized k_up × k_up centered around that position. Then, the selected region is subjected to a dot product operation with the corresponding positional up-sampling kernel predicted in the up-sampling kernel prediction module to obtain the output value. Notably, different channels at the same location share the same up-sampling kernel. CARAFE achieves instance-specific content-aware processing by predicting the up-sampling kernel based on the content of the input features. This capability enables CARAFE to apply adaptive and optimized reconfiguration kernels at specific positions. When contrasted with traditional techniques, the perceptual scope of CARAFE is significantly broadened. This broadening makes it easier to capture more abundant contextual information and boosts the accuracy and robustness of the up-sampling outputs.

2.5.3. CBAM

CBAM (Convolutional Block Attention Module) is a lightweight and practical attention module designed to enhance the ability to represent feature maps in channel and spatial dimensions. By dynamically adjusting the weights of the feature maps, CBAM can significantly improve the accuracy of object detection while maintaining a small number of parameters and computational costs. More specifically, CBAM initiates the process by generating an attention map in the channel dimension. Subsequently, this attention map is multiplied with the input feature map, enabling adaptive optimization of the features. Subsequently, it generates the self-attention map along the spatial dimension, which is again multiplied with the feature map to finally obtain the optimized output feature map. The CBAM structure is shown in Figure 8.

In the complex scene of tomato picking point detection, CBAM shows its unique advantages. Because the densely distributed regions in tomato and stem images often contain confusing information, CBAM can extract attention regions from both channel and spatial dimensions at the same time, which helps the YOLO-TMPPD detection model to effectively resist confusing details and focus more on beneficial target regions. This mechanism brings about a remarkable enhancement in the model’s detection performance when dealing with complex backgrounds. Moreover, it offers robust backing for applications in agricultural automation and other related sectors.

2.6. Evaluation Indicators

Object keypoint similarity (OKS) is used as an evaluation metric for keypoint detection [40]. The average precision (AP) is obtained by all the OKS in the graph. The mean average precision-keypoint (mAP-kp) is obtained by averaging all classes.

The OKS_p formula is given in Equation (1).

{O K S}_{p} = \frac{\sum \exp (- \frac{{d^{2}}_{p i}}{2} s_{p}^{2} σ_{i}^{2}) δ (v_{i} > 0)}{\sum [δ (v_{i} > 0)]}

(1)

where “d_pi” represents the Euler distance between the detection and the i-th keypoint in the target; “v_i” represents keypoint visibility: 0 is unlabeled, 1 is labeled but occluded, and 2 is labeled and visible; “S_p” is defined as the scale factor, computed by taking the square root of the object detection box’s area. “σ_i” indicates the normalization factor of keypoint i type. “δ” is the indicator function, when v_i is greater than 0, the value of δ is 1; when v_i is less than or equal to 0, the value of δ is 0. It is used to filter out the valid keypoints, which are valid when (v_i is greater than 0).

The AP formula is given in Equation (2).

A P @ s = \frac{\sum_{p} δ ({O K S}_{p} > s)}{\sum_{p} 1}

(2)

where the formula calculates the proportion of predictions with OKS_p greater than the threshold s as a measure of the average accuracy of the model in the critical point detection task. In this study, s = 0.5, and the proportion of predictions with OKS_p greater than 0.5 out of all predictions is calculated. The higher this percentage is, the better the model performs at that threshold and the better the predicted keypoints match the real keypoints.

Evaluation criteria for target recognition models involve model complexity (GFLOPs), parameter count, memory usage (Weights), and accuracy metrics, including Precision (P), Recall (R), F1 score, and FPS. Table 1 summarizes these evaluation metrics.

2.7. Grad-CAM

To validate the benefits of the models for ripe tomato and picking point detection, the models for removing unripe detection are visualized. Deep-learning models, especially those based on convolutional neural networks, usually achieve high accuracy in both recognition and detection tasks, but their decision-making process often lacks interpretability. Improving the interpretability of the models is necessary to enhance user trust. Hence, a technique for the CNN model decision-making process (Grad-CAM) [41] was used to visualize the detection results of the YOLO-TMPPD. This approach endeavors to uncover the decision-making rationale behind CNN by creating class activation heatmaps. Its fundamental principle involves leveraging gradient data from the target class to compute a weighted aggregation of feature maps in the last convolutional layer. Through ReLU activation, this process generates a coarse-grained localization map, which ultimately identifies the critical image regions driving prediction outcomes.

3. Results

3.1. Experimental Process

3.1.1. Experimental Environment

The computational environment and hyperparameters used for model training are listed in Table 2.

3.1.2. Experimental Details

The program modified the yolov8_ultralytics [42] code as a base. The training utilized the SGD optimizer with a learning rate of 0.01 and a weight decay of 0.005. A three-stage warmup strategy with a momentum of 0.8 was followed by a momentum of 0.937. The training schedule consisted of 300 epochs with a batch size of 16. To mitigate the bounding box overlap, the IoU prediction threshold was set to 0.3. To reduce memory consumption during training, the image input size was adjusted to 640 × 640, and the corresponding labeled data were modified. The data points were divided into 12,880 frames of training, 1610 frames of validation set, and 1610 frames of test set with the ratio of 8:1:1 with tomato target frame and keypoint label. The model is trained from scratch without using any pre-trained weights.

3.2. Model Ablation Studies

To select a more desirable feature extraction convolution operation and to verify its effectiveness, this study compares the models by introducing training in tomato picking alone, as shown in Table 3. The improved YOLO-DWConv has less computation and a number of parameters than the original model, and the model size is reduced by 3 MB. The YOLO-DWConv model is more accurate and consumes less resources compared to the other four improved models, YOLO-DSConv, YOLO-AKConv, YOLO-GhostConv, and YOLO-GSConv. Without introducing other modifications and additional learning parameters, the number of model parameters and computational effort of DWConv were chosen to be significantly less than the other convolutional operations and maintained better accuracy.

Despite considerable parameters and computation reductions, the replaced backbone network exhibited lower accuracy than the original architecture. To overcome this shortfall, this research introduces a backbone enhancement approach to elevate classification performance. Ablation experiments were performed for each improvement by adding the CBAM attention module and the CARAFE up-sampling module integrated into the backbone, respectively, in Table 4. “✓” indicates that the improvement was added, and “-” indicates that the improvement was not used. As can be seen from the table, the detection accuracy of the model on YOLO-TMPPD(v1) is significantly improved by adding both CARAFE and CBAM modules individually. The accuracy of the model reaches 97.55% when both are added to acquire a better model.

3.3. Comparative Experiments with Different Models

Vertical comparison experiments are performed on the models to validate the superiority of the YOLO-TMPPD network for keypoint detection. The target keypoint detection module was first added to the YOLO family of models (YOLOv3, YOLOv5, YOLOv6, YOLOv8, YOLOv9, YOLOv10) and trained and validated using the same dataset, and the performance of each model was explored.

In Table 5, YOLO-TMPPD achieves 97.55% precision, high recall (93.89%), and F1 score (94.02%), and its mAP-kp@0.5 is 4.43% better than YOLOv8-pose, while the model size is reduced by 1.9 MB. In terms of inference speed, YOLO-TMPPD reaches 336.2 In terms of inference speed, YOLO-TMPPD reaches 336.2 FPS, which is about 12.7% higher than that of YOLOv8-pose (298.2 FPS), showing strong real-time performance. Compared to YOLOv3-pose, YOLO-TMPPD has significant advantages in all metrics: 4.66% improvement in mAP-kp@0.5, 179.9 MB reduction in model size, and 206.4 FPS improvement in inference speed. Although YOLOv5-pose has a lighter model (18.2 MB) and lower GFLOPs (25), its accuracy is 8.33% lower and less stable in complex environments. YOLOv6-pose and YOLOv9-pose lag behind in accuracy by 4.51% and 6.49%, respectively, despite being larger and more computationally expensive. YOLOv10-pose, while similar to YOLOv6 in terms of GFLOPs and memory usage, lags behind in accuracy by 5.08%, which highlights the fact that YOLOv5-pose is more accurate than YOLOv6-pose. This highlights YOLO-TMPPD’s excellent balance of accuracy (97.55% mAP-kp), lightweight design, and real-time performance (336.2 FPS) for agricultural robotics applications.

Visualization and comparison of mAP-kp performance during model training intuitively demonstrate the advantages of the YOLO-TMPPD architecture, as depicted in Figure 9. Analysis of metric evolution on the custom dataset reveals that all evaluation indicators stabilize after 200 training epochs when comparing YOLO-TMPPD with other YOLO variants. For all models described in this study, training concluded at 300 epochs.

3.4. Model Interpretability Analysis

As shown in Figure 10, the Grad CAM visualization results clearly demonstrate the differences in discriminative attention patterns between YOLOv8 and YOLO-TMPPD in tomato detection tasks. The attention heatmap of YOLOv8 shows that it mainly focuses on the fruit area, with lower attention to the stem area. In contrast, YOLO-TMPPD’s attention heatmap exhibits dual enhanced attention to the keypoints of flower stems and fruit targets. Specifically, in complex greenhouse scenes, YOLO-TMPPD’s attention area can accurately locate key structures, such as the flower stem connection, while effectively suppressing background interference in densely distributed tomato stem areas.

3.5. Inference and Evaluation of Critical Point Detection

When picking tomatoes, the size and number of tomatoes in the field of view captured by the vision system changes as the picking terminal moves. As the vision system approaches the tomatoes, the tomato information is complete. Therefore, recognition is more direct. When the vision system moves away from the tomatoes, the field of view expands, and the tomatoes are small and dense, which is prone to phase blurring, overlapping, and blurring. Additionally, the spatial arrangement and shape disparities between tomato fruits and stems, along with the stems’ hue similarity to surrounding leaves, create non-trivial detection difficulties. The inherent variability of greenhouse environments, characterized by weed proliferation and illumination fluctuations, introduces confounding factors that structurally resemble stems and compromise model robustness.

Therefore, to verify the effectiveness of the model in detecting targets and critical points under various conditions, the test set is used to infer the trained YOLO-TMPPD model. The result of inference is shown in Figure 11. The figure illustrates YOLO-TMPPD and YOLOv8 localizing tomato fruits and picking points across different scenes. Error regions highlighted in yellow represent detection failures, including omissions, false detections, and positional inaccuracies. In general, the YOLOv8 model is more prone to missed detection and insufficient detection accuracy than the YOLO-TMPPD model, and the keypoint is more likely to be unable to be detected and have position deviation. However, the inference of the YOLO-TMPPD model is relatively accurate and better able to complete the task of detecting tomatoes and their picking sites in the greenhouse environment. Target detection has higher reliability and relevance, and the keypoints have better accuracy, especially the first and third keypoints, which effectively verify the detection ability of the improved model in a variety of greenhouse scenes, which will be more conducive to tomato fruit picking.

Based on this experimental dataset, YOLOv8 is used as the benchmark model, and the difference in keypoint detection performance between YOLO-TMPPD and the benchmark model is compared, as shown in Figure 12. Batch test results are visualized via a bar chart comparison, with the x-axis representing the Euclidean distance between predicted and ground-truth keypoints. The “—” symbol denotes cases where prediction errors exceed 100 pixels. Note that only the picking point in the image is evaluated, and the point appears on the label. Eighty images were randomly selected from the 230-frame test set, which contained 261 frames of pickable ripe tomato targets and 261 sets of tomato keypoints. YOLO-TMPPD predicted a total of 256 correct keypoints, with 5 undetected and 14 missed instances for the YOLOv8 model. More than 80% of the YOLO-TMPPD model predictions have an error range of less than 40 pixels, while YOLOv8 has only 64% within 40 pixels, and the results show that YOLO-TMPPD has a better leakage detection performance and a more accurate positioning of picking points.

In order to integrate the YOLO-TMPPD model more efficiently into the tomato fruit and its keypoint detection and harvesting machinery, the positions of its gripping points were calculated in the post-processing stage of the model output. Figure 13 shows the intuitive effect of this post-processing stage after labeling the tomatoes. First, the skeleton information (top of fruit, center of fruit, bottom of fruit) is converted into normalized unit vectors; then, the angular bisector vectors are obtained by summing and normalizing these unit vectors. Combined with the limitations of the tomato detection frame, the length of the angular bisector is determined, and keypoints 5 and 6 are plotted accordingly. Based on the above coordinate data, a second vector angular bisector operation is executed to locate keypoints 7 and 8. Finally, keypoints 4, 7, and 8 were selected as the capture points, and keypoint 1 was selected as the clipping point to ensure the stability of the capture operation and robustness to the change in tomato posture. At the same time, the visualization of the results is realized, and the relevant coordinates are directly recorded and output to the terminal, providing more refined information support for the picking operation. For the detection results of immature tomatoes, only count analysis is implemented, and the detection process of grab points is omitted to provide data support for yield estimation and reduce computational complexity.

4. Discussion

The YOLO-TMPPD network structure introduces two key architectural innovations. The first is the CARAFE content-aware up-sampling mechanism, which significantly enhances the model’s ability to understand complex spatial relationships by incorporating multi-scale contextual semantic information in the feature reconstruction process. The mechanism enables the model to accurately localize fine structures such as flower-stem junctions by expanding the receptive field to aggregate multi-scale contextual cues. The second innovation is the CBAM bi-dimensional attention mechanism, which effectively suppresses background noise while preserving discriminative features by dynamically adjusting the feature responses in the channel and spatial dimensions. In a scenario where tomato stems are densely distributed, this dual-attention mechanism improves target localization accuracy while maintaining feature integrity.

Although tomato-picking tasks have received significant attention, achieving fruit-picking point detection in greenhouse environments using machine vision remains challenging. To address this, we constructed a six-dimensional dataset designed to maximize data realism, closely matching manual picking scenarios. While existing studies have achieved notable advancements in tomato identification and picking task performance through model improvements and algorithmic innovations, most approaches focus on segmenting and locating tomato fruit centers. This ignores the potential effects of picking methods on fruit quality and results in complex modeling processes for determining picking points.

To enhance the practical application value of the model in the agricultural field, this study developed a tomato-picking detection system based on PySide6 technology. The actual scenario test data shown in Figure 14 validates the engineering practicality of the system. It is worth emphasizing that the innovative method proposed in this study achieves a dual compatibility breakthrough: firstly, it supports seamless integration with end-effectors that also have cutting and grasping functions, and secondly, it generates coordinate data formats compatible with the four mainstream end-effector types. This technological advancement significantly expands the applicability of the model, laying a solid foundation for improving the adaptability and production efficiency of greenhouse harvesting operations.

5. Conclusions

In this research, we constructed an object detection and keypoint detection dataset for tomatoes within the actual greenhouse environment. Moreover, we proposed a model dedicated to tomato maturity assessment and picking keypoint identification. This model incorporated the detection of four crucial tomato keypoints to enhance the efficacy of the tomato detection mechanism. The test results demonstrated that the object detection score of the proposed model reached 94.02%. Additionally, the detection accuracy of mature tomato picking points saw a 4.43% improvement, and the memory was optimized by 1.9 MB. Given the sparsity and imbalance inherent in the dataset, data augmentation techniques were employed to expand the sample size. Finally, an end-to-end system for fruit and picking point detection was developed to enable precise tomato picking. The inference results underwent post-processing to offer intuitive information regarding the skeletal posture of tomatoes and stems, as well as their picking points.

Despite substantial advancements in tomato detection and picking point detection, the proposed approach has certain drawbacks. Firstly, this research solely relies on data from April and May. It lacks data from various seasons, regions, and different tomato varieties throughout the entire year. As a result, in the future, constructing a more comprehensive dataset is essential to enhance the model’s generalization capacity. Secondly, the data annotation procedure is laborious and might involve subjective discrepancies. In the future, we can explore sample optimization combined with transfer learning or apply the model to more fruit and vegetable picking tasks for experimental verification. Finally, the current model mainly relies on 2D image information, and future research will integrate 3D point cloud [43] and multi-task image segmentation technology to realize accurate acquisition of tomato position information in 3D space. Provide more comprehensive and accurate decision support for intelligent picking machinery.

Author Contributions

Conceptualization, X.W. (Xinfa Wang), X.W. (Xuan Wen) and B.C.; methodology, X.W. (Xuan Wen); software, C.D., Y.L. and D.Z.; validation, X.W. (Xuan Wen), Y.L. and C.S.; formal analysis, X.W. (Xuan Wen) and X.W. (Xinfa Wang); investigation, X.W. (Xinfa Wang) and B.C.; resources, X.W. (Xuan Wen) and Y.L.; data curation, X.W. (Xuan Wen) and C.S.; writing—original draft preparation, X.W. (Xuan Wen); writing—review and editing, X.W. (Xinfa Wang), X.W. (Xuan Wen); visualization, Y.L., C.D. and D.Z.; supervision, X.W. (Xinfa Wang) and B.C.; project administration, X.W. (Xinfa Wang) and X.W. (Xuan Wen); funding acquisition, X.W. (Xinfa Wang). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Major Science and Technology Special Project of Henan Province (No. 241100110200) and the Henan Provincial Department of Science and Technology through the Henan Science and Technology Key Project (No. 252102111172 and No. 242102110331). The APC was funded by the Major Science and Technology Special Project of Henan Province (No. 241100110200).

Data Availability Statement

The dataset was constructed and published by the author’s team, with a subset of 100 images publicly available under an Apache 2.0 license at https://www.kaggle.com/datasets/xuanwen0725/tomatoandstem-picking-point (accessed on 14 May 2025). The full dataset has not yet been released due to ongoing field validation and institutional data policies; the full dataset is available through the corresponding author for legitimate reasons.

Acknowledgments

The author thanks Ultralytics for providing YOLOv8 architecture and open-source implementation.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AP	Average Precision
CARAFE	Content-Aware ReAssembly of Features
C2f	CSP Bottleneck with Two Convolutions
CBAM	Convolutional Block Attention Modul
DWConv	Depthwise Convolution
FN	False Negative
FP	False Positive
FPS	Frames Per Second
GFLOPS	Giga Floating-Point Operations Per Second
Grad-CAM	Gradient-Weighted Class Activation Mapping
IoU	Intersection over Union
mAP-kp	Mean Average Precision-Keypoint
OKS	Object Keypoint Similarity
TP	True Positive
YOLO	You Only Look Once

References

Hu, C.; Liu, X.; Pan, Z.; Li, P. Automatic Detection of Single Ripe Tomato on Plant Combining Faster R-CNN and Intuitionistic Fuzzy Set. IEEE Access 2019, 7, 154683–154696. [Google Scholar] [CrossRef]
Kasimatis, C.-N.; Psomakelis, E.; Katsenios, N.; Katsenios, G.; Papatheodorou, M.; Vlachakis, D.; Apostolou, D.; Efthimiadou, A. Implementation of a decision support system for prediction of the total soluble solids of industrial tomato using machine learning models. Comput. Electron. Agric. 2022, 193, 106688. [Google Scholar] [CrossRef]
Montoya-Cavero, L.-E.; Díaz de León Torres, R.; Gómez-Espinosa, A.; Escobedo Cabello, J.A. Vision systems for harvesting robots: Produce detection and localization. Comput. Electron. Agric. 2022, 192, 106562. [Google Scholar] [CrossRef]
Moreira, G.; Magalhães, S.A.; Pinho, T.; dos Santos, F.N.; Cunha, M. Benchmark of Deep Learning and a Proposed HSV Colour Space Models for the Detection and Classification of Greenhouse Tomato. Agronomy 2022, 12, 356. [Google Scholar] [CrossRef]
Liu, G.; Mao, S.; Kim, J.H. A Mature-Tomato Detection Algorithm Using Machine Learning and Color Analysis. Sensors 2019, 19, 2023. [Google Scholar] [CrossRef] [PubMed]
Zhao, Y.; Gong, L.; Zhou, B.; Huang, Y.; Liu, C. Detecting tomatoes in greenhouse scenes by combining AdaBoost classifier and colour analysis. Biosyst. Eng. 2016, 148, 127–137. [Google Scholar] [CrossRef]
Qi, D.; Tan, W.; Yao, Q.; Liu, J. YOLO5Face: Why Reinventing a Face Detector. arXiv 2022, arXiv:2105.12931. [Google Scholar]
Wu, Z.; Wang, X.; Jia, M.; Liu, M.; Sun, C.; Wu, C.; Wang, J. Dense object detection methods in RAW UAV imagery based on YOLOv8. Sci. Rep. 2024, 14, 18019. [Google Scholar] [CrossRef]
Tang, Y.; Chen, M.; Wang, C.; Luo, L.; Li, J.; Lian, G.; Zou, X. Recognition and Localization Methods for Vision-Based Fruit Picking Robots: A Review. Front. Plant Sci. 2020, 11, 510. [Google Scholar] [CrossRef]
Koirala, A.; Walsh, K.B.; Wang, Z.; McCarthy, C. Deep learning–Method overview and review of use for fruit detection and yield estimation. Comput. Electron. Agric. 2019, 162, 219–234. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Liu, Q.; Liu, M.; Jonathan, Q.M.; Shen, W. A real-time anchor-free defect detector with global and local feature enhancement for surface defect detection. Expert Syst. Appl. 2024, 246, 123199. [Google Scholar] [CrossRef]
Xu, Z.-F.; Jia, R.-S.; Liu, Y.-B.; Zhao, C.-Y.; Sun, H.-M. Fast Method of Detecting Tomatoes in a Complex Scene for Picking Robots. IEEE Access 2020, 8, 55289–55299. [Google Scholar] [CrossRef]
Xu, P.; Fang, N.; Liu, N.; Lin, F.; Yang, S.; Ning, J. Visual recognition of cherry tomatoes in plant factory based on improved deep instance segmentation. Comput. Electron. Agric. 2022, 197, 106991. [Google Scholar] [CrossRef]
Yang, H.; Liu, Y.; Wang, S.; Qu, H.; Li, N.; Wu, J.; Yan, Y.; Zhang, H.; Wang, J.; Qiu, J. Improved Apple Fruit Target Recognition Method Based on YOLOv7 Model. Agriculture 2023, 13, 1278. [Google Scholar] [CrossRef]
Song, C.-Y.; Zhang, F.; Li, J.-S.; Xie, J.-Y.; Yang, C.; Zhou, H.; Zhang, J.-X. Detection of maize tassels for UAV remote sensing image with an improved YOLOX Model. J. Integr. Agric. 2023, 22, 1671–1683. [Google Scholar] [CrossRef]
Gao, J.; Zhang, J.; Zhang, F.; Gao, J. LACTA: A lightweight and accurate algorithm for cherry tomato detection in unstructured environments. Expert Syst. Appl. 2024, 238, 122073. [Google Scholar] [CrossRef]
Zhang, J.; Xie, J.; Zhang, F.; Gao, J.; Yang, C.; Song, C.; Rao, W.; Zhang, Y. Greenhouse tomato detection and pose classification algorithm based on improved YOLOv5. Comput. Electron. Agric. 2024, 216, 108519. [Google Scholar] [CrossRef]
Liu, G.; Hu, Y.; Chen, Z.; Guo, J.; Ni, P. Lightweight object detection algorithm for robots with improved YOLOv5. Eng. Appl. Artif. Intell. 2023, 123, 106217. [Google Scholar] [CrossRef]
Sun, C.; Zhang, Y.; Ma, S. DFLM-YOLO: A Lightweight YOLO Model with Multiscale Feature Fusion Capabilities for Open Water Aerial Imagery. Drones 2024, 8, 400. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, S.; Chen, L.; Wu, W.; Wang, H.; Liu, X.; Fan, Z.; Wang, B. Microscopic Insect Pest Detection in Tea Plantations: Improved YOLOv8 Model Based on Deep Learning. Agriculture 2024, 14, 1739. [Google Scholar] [CrossRef]
Zhu, A.; Zhang, R.; Zhang, L.; Yi, T.; Wang, L.; Zhang, D.; Chen, L. YOLOv5s-CEDB: A robust and efficiency Camellia oleifera fruit detection algorithm in complex natural scenes. Comput. Electron. Agric. 2024, 221, 108984. [Google Scholar] [CrossRef]
Wang, M.; Li, Y.; Meng, H.; Chen, Z.; Gui, Z.; Li, Y.; Dong, C. Small target tea bud detection based on improved YOLOv5 in complex background. Front. Plant Sci. 2024, 15, 1393138. [Google Scholar] [CrossRef] [PubMed]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Vit, A.; Shani, G.; Bar-Hillel, A. Length phenotyping with interest point detection. Comput. Electron. Agric. 2020, 176, 105629. [Google Scholar] [CrossRef]
Weyler, J.; Milioto, A.; Falck, T.; Behley, J.; Stachniss, C. Joint Plant Instance Detection and Leaf Count Estimation for In-Field Plant Phenotyping. IEEE Robot. Autom. Lett. 2021, 6, 3599–3606. [Google Scholar] [CrossRef]
Bai, Y.; Mao, S.; Zhou, J.; Zhang, B. Clustered tomato detection and picking point location using machine learning-aided image analysis for automatic robotic harvesting. Precis. Agric. 2022, 24, 727–743. [Google Scholar] [CrossRef]
Song, Z.; Zhou, Z.; Wang, W.; Gao, F.; Fu, L.; Li, R.; Cui, Y. Canopy segmentation and wire reconstruction for kiwifruit robotic harvesting. Comput. Electron. Agric. 2021, 181, 105933. [Google Scholar] [CrossRef]
Xiang, R.; Zhang, M.; Zhang, J. Recognition for Stems of Tomato Plants at Night Based on a Hybrid Joint Neural Network. Agriculture 2022, 12, 743. [Google Scholar] [CrossRef]
Du, X.; Meng, Z.; Ma, Z.; Zhao, L.; Lu, W.; Cheng, H.; Wang, Y. Comprehensive visual information acquisition for tomato picking robot based on multitask convolutional neural network. Biosyst. Eng. 2024, 238, 51–61. [Google Scholar] [CrossRef]
Li, Y.; Feng, Q.; Liu, C.; Xiong, Z.; Sun, Y.; Xie, F.; Li, T.; Zhao, C. MTA-YOLACT: Multitask-aware network on fruit bunch identification for cherry tomato robotic harvesting. Eur. J. Agron. 2023, 146, 126812. [Google Scholar] [CrossRef]
Han, C.; Lv, J.; Dong, C.; Li, J.; Luo, Y.; Wu, W.; Abdeen, M.A. Classification, Advanced Technologies, and Typical Applications of End-Effector for Fruit and Vegetable Picking Robots. Agriculture 2024, 14, 1310. [Google Scholar] [CrossRef]
Yan, F.; Xu, Y. Improved Target Detection Algorithm Based on YOLO. In Proceedings of the 2021 4th International Conference on Robotics, Control and Automation Engineering (RCAE), Wuhan, China, 4–6 November 2021; pp. 21–25. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2016, arXiv:1506.02640. [Google Scholar]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 17–18 April 2024; pp. 1–6. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision–ECCV 2018, Munich, Germany, 8–14 September 2018; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–19. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE: Carafe: Content-Aware Reassembly of Features. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Wang, G.-Q.; Zhang, C.-Z.; Chen, M.-S.; Lin, Y.C.; Tan, X.-H.; Kang, Y.-X.; Wang, Q.; Zeng, W.-D.; Zhao, W.-W. A high-accuracy and lightweight detector based on a graph convolution network for strip surface defect detection. Adv. Eng. Inform. 2024, 59, 102280. [Google Scholar] [CrossRef]
Maji, D.; Nagori, S.; Mathew, M.; Poddar, D. YOLO-Pose: Enhancing YOLO for Multi Person Pose Estimation Using Object Keypoint Similarity Loss. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 21–24 June 2022; pp. 2636–2645. [Google Scholar]
Selvaraju, R.R.; Das, A.; Vedantam, R.; Cogswell, M.; Parikh, D.; Batra, D. Grad-CAM: Why did you say that? arXiv 2017, arXiv:1611.07450. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 14 November 2023).
Du, X.; Meng, Z.; Ma, Z.; Lu, W.; Cheng, H. Tomato 3D pose detection algorithm based on keypoint detection and point cloud processing. Comput. Electron. Agric. 2023, 212, 108056. [Google Scholar] [CrossRef]

Figure 1. Four common tomato-picking robot end-effectors: (a) component; (b) four types of end-effectors.

Figure 2. Images of tomatoes under various conditions. (a) Background complexity: simple and complex. (b) Shooting angle: top and side. (c) Illumination: balanced brightness and unbalanced brightness. (d) Shooting range: close-up and wide-angle. (e) Ripe complexity: fully ripe and partially ripe. (f) Pose complexity: upright and bent.

Figure 3. Image annotation. (a) Labeled object detection frames and keypoint information; p1: pedicel node; p2: center of the calyx; p3: center of the tomato fruit; p4: bottom of the tomato fruit (Tomato Navel). (b) Data format of labeled data saved in TXT files.

Figure 4. Data augmentation: (a) original image; (b) four types of data augmentation; (c) combine four data enhancement methods.

Figure 5. YOLO-TMPPD overall model architecture.

Figure 6. Depthwise Convolution.

Figure 7. The influence of carafe network on image: (a) Kernel Prediction Module; (b) Content-aware Reassembly Module.

Figure 8. CBAM.

Figure 9. Comparison of model training process.

Figure 10. Grad-CAM output for YOLOv8 and YOLO-TMPPD.

Figure 11. The effect of different models on tomato detection and picking points under different conditions. The first column shows the close-range small field of view image; the second column shows the medium-range top view angle image; and the third column shows the long-range, large field of view image.

Figure 12. A histogram of the Euclidean distance between the keypoint coordinates predicted by different models and their true values: (a) YOLOv8; (b) YOLO-TMPPD.

Figure 13. Tomato picking details. (a) original image; (b) harvest result information of multi-maturity inference post-processing.

Figure 14. The resulting graph of loading the YOLO-TMPPD model on Raspberry Pi using the tomato picking system developed with PySide6.

Table 1. Evaluation indicators.

Items	Formulas
Parameters	(K_h × K_w × C_in × C_out)
GFLOPs	(K_h × K_w × C_in × C_out × H × W)/10⁹
Weights	Model Size
P	TP − (TP + FP)
R	TP/(TP + FN)
F1	(2 × P × R)/(P + R)
FPS	1/Processing time per frame

Table 2. Experiment of hardware environment parameters.

Accessories	Model
Operating system	Ubuntu 20.04.6 LTS
CPU	Intel (R) Core (TM) i7-13700K
RAM	128 G
GRAM	24 GB
GPU	NVIDIA GeForce RTX 4090
Development environments	Python3.9.18, torch2.0.1 + cu117

Table 3. Computational performance of ablation experiments with different convolutional improvement models.

Model	Parameters (M)	GFLOPs	Weights (MB)	mAP-kp@0.5 (%)
YOLOv8	11.42	29.6	22	93.12
YOLO-DWConv	9.87	25.9	19	92.57
YOLO-DSConv	11.38	107.7	22.1	81.73
YOLO-AKConv	10.95	28.7	21.88	90.67
YOLO-GhostConv	10.65	27.8	20.5	92.44
YOLO-GSConv	11.28	29.2	21.7	90.68

Table 4. Model ablation results.

Model	+DWConv	+CBAM	+CARAFE	Parameters (M)	GFLOPs	Weights (MB)	mAP-kp@0.5 (%)
YOLOv8	-	-	-	11.42	29.6	22	93.12
YOLO-TMPPD(v1)	✓	-	-	9.86	25.9	19	92.57
YOLO-TMPPD(v2)	✓	✓	-	10.27	26.3	19.8	95.84
YOLO-TMPPD(v3)	✓	-	✓	10.03	26.2	19.3	94.73
YOLO-TMPPD(v4)	✓	✓	✓	10.44	26.6	20.1	97.55

Table 5. Detection results of different models.

YOLO Model	Parameters (M)	GFLOPs	Weights (MB)	P (%)	R (%)	F1 (%)	mAP-kp@0.5 (%)	FPS
YOLOv3-pose	104.8	286.9	200	92.63	87.11	89.79	92.89	129.8
YOLOv5-pose(s)	9.41	25	18.2	88.93	93.36	91.09	93.45	355.4
YOLOv6-pose(s)	16.37	44.5	31.4	93.21	87.48	90.25	93.04	229.1
YOLOv8-pose(s)	11.42	29.6	22	88.36	90.74	89.54	93.12	298.2
YOLOv9-pose(s)	18.51	69.4	37.6	88.78	89.19	88.98	91.06	244.5
YOLOv10-pose(s)	9.2	26.1	21.9	87.73	89.19	88.45	92.47	371.1
YOLO-TMPPD	10.44	26.6	20.1	97.26	93.89	94.02	97.55	336.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; Wen, X.; Li, Y.; Du, C.; Zhang, D.; Sun, C.; Chen, B. A Precise Detection Method for Tomato Fruit Ripeness and Picking Points in Complex Environments. Horticulturae 2025, 11, 585. https://doi.org/10.3390/horticulturae11060585

AMA Style

Wang X, Wen X, Li Y, Du C, Zhang D, Sun C, Chen B. A Precise Detection Method for Tomato Fruit Ripeness and Picking Points in Complex Environments. Horticulturae. 2025; 11(6):585. https://doi.org/10.3390/horticulturae11060585

Chicago/Turabian Style

Wang, Xinfa, Xuan Wen, Yi Li, Chenfan Du, Duokuo Zhang, Chengxiu Sun, and Bihua Chen. 2025. "A Precise Detection Method for Tomato Fruit Ripeness and Picking Points in Complex Environments" Horticulturae 11, no. 6: 585. https://doi.org/10.3390/horticulturae11060585

APA Style

Wang, X., Wen, X., Li, Y., Du, C., Zhang, D., Sun, C., & Chen, B. (2025). A Precise Detection Method for Tomato Fruit Ripeness and Picking Points in Complex Environments. Horticulturae, 11(6), 585. https://doi.org/10.3390/horticulturae11060585

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Precise Detection Method for Tomato Fruit Ripeness and Picking Points in Complex Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition

2.2. Annotation of Images and Construction of Datasets

2.3. Data Augmentation

2.4. YOLOv8

2.5. Model Architecture of YOLO-TMPPD

2.5.1. Depthwise Convolution

2.5.2. CARAFE

2.5.3. CBAM

2.6. Evaluation Indicators

2.7. Grad-CAM

3. Results

3.1. Experimental Process

3.1.1. Experimental Environment

3.1.2. Experimental Details

3.2. Model Ablation Studies

3.3. Comparative Experiments with Different Models

3.4. Model Interpretability Analysis

3.5. Inference and Evaluation of Critical Point Detection

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI