Research on High-Precision Target Detection Technology for Tomato-Picking Robots in Sustainable Agriculture

Song, Kexin; Chen, Shuyu; Wang, Gang; Qi, Jiangtao; Gao, Xiaomei; Xiang, Meiqi; Zhou, Zihao

doi:10.3390/su17072885

Open AccessArticle

Research on High-Precision Target Detection Technology for Tomato-Picking Robots in Sustainable Agriculture

by

Kexin Song

¹,

Shuyu Chen

²,

Gang Wang

^1,*

,

Jiangtao Qi

^1,3

,

Xiaomei Gao

¹,

Meiqi Xiang

¹ and

Zihao Zhou

¹

College of Biological and Agricultural Engineering, Jilin University, Changchun 130022, China

²

School of Foreign Language and Cultures, Jilin University, Changchun 130012, China

³

Jilin Provincial Key Laboratory of Smart Agricultural Equipment and Technology, Jilin University, Changchun 130022, China

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(7), 2885; https://doi.org/10.3390/su17072885

Submission received: 7 February 2025 / Revised: 16 March 2025 / Accepted: 20 March 2025 / Published: 24 March 2025

Download

Browse Figures

Versions Notes

Abstract

Robotic tomato picking is a crucial step toward mechanized and precision farming. Effective tomato recognition and localization algorithms for these robots require high accuracy and real-time performance in complex field environments. This study modifies the SSD model to develop a fast and high-precision tomato detection method. The classical SSD model is optimized by discarding certain feature maps for larger objects and incorporating a self-attention mechanism. Experiments utilized images from an organic tomato farm. The model was trained and evaluated based on detection accuracy, recall rate, time consumption, and model size. Results indicate that the modified SSD model has a 95% detection accuracy and 96.1% recall rate, outperforming the classical and self-attention SSD models in accuracy, time consumption, and model size. Field experiments also demonstrate its robustness under different illumination conditions. In conclusion, this study promotes the development of tomato-picking robots by presenting an optimized detection method that effectively balances accuracy and efficiency. This method improves detection accuracy remarkably. It also reduces complexity, making it very suitable for real-world use. It plays a crucial role in facilitating the adoption of robotic harvesting systems in modern agriculture. Technologically, it remarkably boosts the picking efficiency, lessens the reliance on human labor, and cuts down fruit losses through precise picking. As a result, it effectively enhances resource utilization efficiency, providing a practical solution for the development of sustainable agriculture.

Keywords:

mechanized farming; precision agriculture; self-attention mechanism; occlusion handling; real-time processing

1. Introduction

According to the statistics of the Food and Agriculture Organization, tomato as the world’s second-largest vegetable crop occupies a pivotal position in the vegetable industry pattern, showing a high economic value, and its output in the global vegetable production accounts for 40%. Tomato harvesting, an arduous and time-consuming procedure, has long been predominantly reliant on manual labor. Nevertheless, this approach has become unsustainable and inefficient within the modern agricultural context. The manual process not only incurs high labor costs but also exposes workers to physical strain and inconsistencies in harvest quality [1]. In the wake of technological advancements and the escalating demand for mechanized and precision agriculture, the dependence on manual tomato harvesting runs counter to the future course of agricultural development [2].

The transition to robotic tomato picking signifies a substantial advancement in resolving the inefficiencies associated with manual harvesting. Tomato-picking robots promise to revolutionize the agricultural industry by automating the harvesting process, thereby drastically reducing labor costs and alleviating the physical burden on workers [3]. Moreover, these robots possess the potential to augment the consistency and efficiency of tomato harvesting, thus guaranteeing a higher-quality and more dependable yield [4]. The integration of robotic systems into agricultural operations coincides with the broader trend of mechanization and precision farming, aiming to optimize resource utilization, maximize crop production, and foster sustainable agricultural practices [5].

A crucial factor in the development of efficient tomato-picking robots is the sophistication of the tomato recognition and localization algorithms they employ [6,7]. These algorithms must possess the capability to accurately detect tomatoes amidst complex field environments, distinguishing them from other objects such as leaves, branches, and soil. Additionally, they must precisely determine the spatial positions of the tomatoes to enable the robot to efficiently and accurately pick them without causing damage [8]. The performance of these algorithms is of critical importance since it directly impacts the overall efficacy and reliability of the robot in tomato harvesting. In this regard, advanced machine learning and computer vision techniques play a vital role in enhancing the performance of these algorithms.

Despite the remarkable progress within the realm of computer vision and object detection, extant target detection algorithms encounter numerous challenges during their application in tomato harvesting. One major issue is occlusion, where tomatoes may be partially or fully obscured by leaves, branches, or other tomatoes [9,10]. Dong, W proposed an Ellipse R-CNN, a CNN-based elliptical detector, in response to the problems that it is difficult to segment the images of severely occluded objects such as fruit clusters on trees in cluttered scenes, and that the bounding boxes obtained from multiple views are unreliable, making it further difficult to obtain the 3D sizes and 6D poses of individual objects [11]. Another challenge stems from varying illumination conditions that can influence the visibility and appearance of tomatoes, thereby increasing the difficulty of detection [12]. Tang, Y focused on the problem of object detection under changing environmental conditions. Aiming to identify olives with mobile cameras under natural light, they studied the impact of image preprocessing methods on the YOLOv7 model [13]. Furthermore, the need for real-time processing capabilities poses an additional constraint, as tomatoes must be detected and localized swiftly to enable the robot to function efficiently [14,15]. The harvesting of tomatoes is highly efficient. Although the above model can cope with the occlusion and the change in environmental conditions and other situations, it can play a role to a certain extent. However, there are inherent limitations, and it is difficult to fully meet the full needs of fast and efficient tomato harvesting.

These challenges underline the necessity for the development of enhanced detection methods specifically designed for tomato-picking robots. Moreover, the robustness of the algorithm against different tomato varieties and growth stages should also be considered, as these factors can introduce additional variability in the appearance and characteristics of the tomatoes. The optimal solution should integrate high accuracy with real-time performance, guaranteeing that tomatoes can be dependably detected and localized even under unfavorable field conditions. Moreover, the algorithm ought to be computationally efficient to reduce the robot’s energy consumption and prolong its operational lifespan [16].

In the field of tomato-harvesting applications, real-time and accurate object detection is of utmost importance. In terms of detection speed, when detecting images with a resolution of 1024 × 720, the detection speed of each model shows a significant difference. Among them, the detection speed of the SSD model reaches 4.53 fps, the YOLO model is 2.23 fps, and Faster R-CNN is only 0.58 fps, the detection speed of SSD is significantly better than the other two models [17]. Its architecture is simple, the reasoning speed is faster, and it is more suitable for application scenarios with strict real-time requirements. In addition, SSD has anti-interference capability and dataset self-learning ability, which is conducive to the detection of tomatoes of different sizes in the field [18]. In response to these requisites, this research puts forward a rapid and high-precision tomato detection methodology for tomato-picking robots. Through the modification and optimization of extant deep-learning models, particularly the Single-Shot Multi-Box Detector (SSD), this research endeavors to overcome the limitations of current algorithms and augment their performance within the context of tomato-harvesting applications [19]. The pivotal modifications entail discarding specific feature maps within the SSD model to concentrate on the detection of larger objects like ripe tomatoes and integrating a self-attention mechanism to enhance the model’s capacity for focusing on relevant features [20]. This integration empowers the model with an augmented capacity to discern and focus on the most relevant features, thereby enhancing its discriminative capabilities in complex scenarios.

The proposed method not only aims to improve detection accuracy but also seeks to reduce computational complexity and ensure robust performance under various field conditions. By focusing on larger objects and utilizing the self-attention mechanism, the modified SSD model is expected to achieve faster inference times and smaller model sizes compared to traditional SSD models, making it more suitable for deployment on tomato-picking robots [21]. Some researchers [22,23] have attempted alternative approaches, yet there remains a lack of a comprehensive solution that can handle the multiple challenges simultaneously.

Furthermore, this study conducts comprehensive experiments to evaluate the performance of the proposed method. By comparing the modified SSD model with both the classical SSD model and the SSD model with a self-attention mechanism, this study aims to demonstrate the advantages of the proposed approach in terms of accuracy, recall rate, time consumption, and model size. Additionally, field experiments are conducted to test the real-world performance of the tomato detection model under different illumination conditions and at various times of the day.

In summary, this study contributes to the advancement of tomato-picking robots by proposing a fast and high-precision tomato detection method. By addressing the limitations of current target detection algorithms and optimizing the SSD model for tomato harvesting applications, this study aims to pave the way for the widespread adoption of robotic harvesting systems in modern agriculture. Agricultural automation aims to optimize resource use, reduce environmental impact, and increase overall productivity. By identifying and picking tomatoes more precisely, the method in this study could reduce waste and increase the yield of harvested tomatoes, ultimately improving the economic viability of tomato cultivation. At the same time, reducing resource waste can improve the overall efficiency of resource use and promote the transformation of agricultural production into a sustainable model.

2. Materials and Methods

2.1. Modified Deep Learning Model of Single-Shot Multi-Box Detector

The deep learning framework proposed in this study was based on the Single-Shot Multi-Box Detector (SSD), and the classical SSD model has six feature maps, which are conv4_3, conv7, conv8_2, conv9_2, conv10_2, and conv11_2, as shown in Figure 1. The former feature maps are relatively large, and they have more advantages in detecting relatively small objects. The latter feature maps are relatively small, and they were designed to detect relatively large objects [24]. Since this study is applicable orientated, the camera will shoot the broccoli on purpose, the broccoli will be the largest object in the image. In other words, we did not care about the other objects in this task, so as shown in Figure 1, this study modified the classical SSD model by discarding the middle four feature maps, which were conv7, conv8_2, conv9_2, and conv10_2. By discarding the four-layer feature maps, the calculation amount of the model can be significantly reduced to improve the detection speed of the model. Attempts to partially discard the middle four layers, such as only dropping the conv7 and conv8_2 layers but retaining the conv9_2 and conv10_2 layers, can reduce the computational load to a certain extent, but cannot give full play to the advantages of model simplification and the focus on large object detection. The retained conv9_2 and conv10_2 layers are not suitable for the large object detection task in this study, which is not the best choice and increases the complexity of the model. In addition, this discard method will destroy the consistency and integrity of feature extraction, interfere with feature fusion, and then affect the performance of the model. The conv4_3 layer is an early feature extracted from the model, which contains some basic information about the image and is essential. By discarding these four feature maps, the receptive field and feature resolution of the model are affected. The receptive field of the remaining feature maps changes, with Conv4_3 and Conv11_2 now having a broader receptive field that can better capture the global features of larger objects like tomatoes. The other configuration is the same as Liu et al. [25].

2.2. Self-Attention Mechanism

Normally, the algorithm of the attention mechanism needs attention computation between querying vectors and inputting vectors [26]. Owing to the inputting vectors being featured maps, and the object detection mission being a non-sequence target, this study preferred generating querying vectors through inputting vectors. As shown in Figure 2, the input feature map is composited by several data vectors, and the transformation matrix should take responsibility for transforming the inputting data vectors into the corresponding spaces; they are query space, key space, and value space. In this step, the fraction of the query space vector for inputting feature map vectors can be obtained with the same theory, and the fraction of the key space vector to the input feature map vector can be obtained, too. The following Equations (1)–(3) demonstrate the above description. Subsequently, a scaling operation was performed between the above two fractions, and normalization was also conducted by using the Softmax equation. Then, the attention distribution of any input vector can be calculated. In the last step, the weighted average of all the value space vectors was obtained based on the attention distribution of this step to obtain the attention vector for any vectors in the input feature map. Similarly, the final attention vector for the input feature map can be obtained.

[q_{i} = h_{i} W_{q} q_{i + 1} = h_{i + 1} W_{q}] \Rightarrow Q = H W_{Q}

(1)

[k_{i} = h_{i} W_{k} k_{i + 1} = h_{i + 1} W_{k}] \Rightarrow k = H W_{k}

(2)

[v_{i} = h_{i} W_{v} v_{i + 1} = h_{i + 1} W_{v}] \Rightarrow V = H W_{v}

(3)

where the i represents the sequence number of the vectors in any space; W_q, W_k, and W_v mean the query space, key space, and value space; q, k, and v represent the vectors in query space, key space, and value space, respectively; H and h mean the inputting space and the vector of inputting space, respectively.

The final goal of the self-attention mechanism can be summarized in the following three submissions. Firstly, mapping the raw inputting vectors into the query space, key space, and value space accordingly. Secondly, obtaining the attention degree to which the model focuses on input information at its current location. Thirdly, based on the attention distribution at that location, the vectors in the value space are weighted and averaged to obtain the attention vector of the input vector at that location.

At last, the self-attention mechanism was added to the modified SSD model, as shown in Figure 3. In the self-attention mechanism, the inputting vectors are the feature maps of conv4_3 and conv11_2. In the SSD model, the conv4_3 layer contains the information of the underlying image, and after the introduction of the self-attention mechanism, the model can focus on the local area closely related to the target object (tomato) more accurately at this stage. By calculating the correlation between the feature vectors of each position in the feature map, the feature representation of each position is re-weighted, which effectively strengthens the capture efficiency of subtle features and greatly improves the model’s attention to key information. The conv11_2 layer is mainly responsible for the detection of large objects. With the introduction of the self-attention mechanism, the model can integrate global information more efficiently in a large receptive field, and pay full attention to the overall features of large objects in the image and their association with the surrounding environment. Using this mechanism, the model can accurately locate target objects in complex scenes, significantly enhance the discriminant ability of target features, and improve the accuracy and reliability of large object detection tasks.

2.3. Generating Detecting Model Based on Deep Learning

On 22 May 2023, 1000 images were randomly shot in an organic tomato farm, the images were obtained according to the following standard, the camera (Canon, Tokyo, Japan, version of EOS R6) was triggered by a remote electronic shutter (Canon, Japan, version of G7X3), and we used a handled ball head (Shenzhen Dajiang Ltd., Shenzhen, China, version of SE OM) to carry the camera. The main optical axis was parallel with the horizontal direction, and in order to simulate the picking robot’s seeking trace, the main optical axis was perpendicular to the tomato rows. When the image acquisition was finished, these images were transferred to the PC, and they were labeled for the deep learning process. Through using the plug-in unit of LabelImg (Google Brain, version 1.7), the target objects were surrounded by a rectangle box, and a corresponding tag was generated at the same time, with each tag file ending with a suffix of ‘.xml’. In this step, we merely gave attention to the tomatoes that belong to the planting rows near the camera; in other words, we discarded the tomatoes belonging to the depth direction. Since there were immature tomatoes (green), ripe tomatoes (red), and tomatoes between immature and ripe (shallow red), we labeled all of them only if they belonged to the planting rows near the camera and were bigger than the diameter of 3.5 cm, as shown in Figure 4. The dataset used in this study exhibits a certain degree of diversity. Then, they were randomly separated into training set, evaluation set, and testing set according to the proportion of 8:1:1. Then, the data were preprocessed. Firstly, all images were resized to a consistent size of 224 × 224 pixels using bilinear interpolation to meet the input requirements of the model. Then, the pixel values of each image were divided by 255 to normalize them to the range of [0, 1]. Only for the training set, data augmentation techniques were employed. This included randomly rotating the images by ±15 degrees, performing horizontal flips, and adding a small amount of Gaussian noise to increase the diversity of the dataset and improve the generalization ability of the model.

The corresponding configurations used in the training are described in the following. Deep learning framework Tensorflow (Version 1.11.0) was adopted in this step. The modified SSD model was used as the pre-trained model, the operating system was Ubuntu 20.04, the memory was 16 GB, the processor was Intel^®® Core™ i7-7 700 KCPU@ 4.00 GHz × 8, and the digital image processor (GPU) was NVIDIA GTX 2080Ti. Python (Version 3.6.5, Guido van Rossum) incorporated with OpenCV (Version 3.4.2, computer vision repository) was used as the programming language, and the training batch was set as 12.

2.4. Experiments and Statistical Analysis

On 25 May 2023, which was three days after the training used image acquisition, the field experiment was conducted to test the tomato detection accuracy. The experimental location remained at the same organic farm, but with a totally different greenhouse. The same hardware was applied in the experiment, and the experiment tallied with the standard as the training used image acquisition. The experiment lasted for three days; the experimental used images were obtained during morning, noon, and afternoon, respectively. Specifically, 100 images were obtained during each period. This study used detecting accuracy and recall rate to evaluate the detecting performance, the calculating methods of detecting accuracy and recall rate are demonstrated in the following.

Accuracy rate = \frac{T P}{T P + F P} \times 100 %

(4)

Recall rate = \frac{T P}{T P + F N} \times 100 %

(5)

TP (True Positive) is the number of tomatoes correctly detected by the model. For example, accurately identified ripe tomatoes fall into this category. FP (False Positive) refers to non-tomatoes misclassified as tomatoes, like misidentifying leaves as tomatoes. FN (False Negative) means actual tomatoes that the model fails to detect, such as those hidden by occlusion.

In order to figure out the modification effect in this study, the ablation experiment was conducted according to the following configuration. The experimental model was the model designed in this study, and the controlled models were the classical SSD model and the classical SSD model with the self-attention mechanism.

3. Results

The modified SSD model generated a tomato detection model with an accuracy of ~95%, the plug-in unit TensorBoard in Tensorflow demonstrates the accuracy curve during the training process. Although the training process underwent 1000 epochs, the model started to converge after 300 epochs. When the training process was finished, the tomato detection model was obtained. When the tomato detection model was deployed and used in the experiment, it could detect the tomatoes successfully, as shown in Figure 5.

As for the mean accuracy and mean recall rate, they were 94.77% and 96.1%, respectively. To assess the stability of the model, we conducted the training process five times independently. The standard deviations of the accuracy and recall rate across these five runs were 0.85% and 0.62%, respectively. This indicates that the model shows relatively stable performance during training, with a narrow confidence interval of [93.92%, 95.62%] for accuracy and [95.48%, 96.72%] for recall rate at a 95% confidence level. The experimental accuracy is a little lower than the training accuracy. Ablation experiments play a crucial role in understanding the individual contributions of different modifications to the overall performance of the model. In order to investigate the unique functions of discarding the middle four feature maps and adding the self-attention mechanism, this study conducted the ablation experiment, the experimental results are shown in Table 1. In terms of the accuracy rate and recall rate, the modified SSD model has a significant advantage (p < 0.05) over the classical SSD model, and it increased the accuracy rate and recall rate by 3.3% and 0.42%, respectively. Compared with the classical SSD model with the self-attention mechanism, the modified SSD model has a significant advantage (p < 0.05) on the aspects of time consumption and model size, and it reduced the time consumption and model size by 21.62% and 21.32%, respectively.

The detection results of the aforementioned three models are presented in Figure 6. It is demonstrated therein that the modified SSD model equipped with a self-attention mechanism was capable of detecting the target tomatoes. Meanwhile, the classical SSD model with a self-attention mechanism could not only detect the target tomatoes but also identify those smaller than the ones we had labeled. The classical SSD model exhibited the poorest detection performance, being only able to detect a portion of the target tomatoes and, moreover, yielding false positive detection results, specifically, misidentifying a branch as a tomato.

4. Discussion

4.1. Overall Judgement on the Modified SSD Model

Table 1 illustrates that the tomato detection model reached an accuracy of ~95%, and such accuracy can serve the applicable scenarios. Three years ago, there existed only one study that focused on immature tomato detection, but their model was a faster region-based convolutional neural network (R-CNN) with Resnet-101, and their average precision reached 87.83% [27]. Our study introduced the attention mechanism, which is a major contributor to improving the detection accuracy. One study conducted by Lawal observed extremely high accuracies, which were all more than 98% with regard to the three modified YOLOv3 models [28]. As for this study, we transferred from precision-orientated to application-orientated. That is to say, we do not need to detect all kinds of tomatoes for a tomato-picking robot; only the tomatoes that are big enough and can be reached by the mechanical manipulators have the value to be detected. Therefore, we sacrificed some accuracy and saved recognition time. Although Lawal’s study acquired an accuracy of more than 98% at a time consumption of more than 48 ms per image [29], the time consumption is only 29 ms per image in our study (Table 1). Actually, if we only paid attention to the accuracy, it is not too hard to pursue a high accuracy rate. More than ten years ago, the detection accuracy rate already reached approximately 94% under a greenhouse environment [30]. Owing to the tomato being round in shape, with clear different features from the background, when shooting close-up, the tomato detection accuracy can be beyond 94%. In summary, this study is application-oriented. The tomatoes were photographed randomly rather than in close-up, indicating that our research is closer to practical applications. Although numerous studies have achieved an accuracy higher than 94%, this study simultaneously demonstrates an efficiency advantage.

4.2. The Unique Effects of Modification

Figure 6 demonstrates that the classical SSD model with a self-attention mechanism can detect smaller objects than we labeled, but such smaller objects did not belong to the orientated target objects in our study. Since, in real applications, the picking robot can only pick ripe tomatoes or big-enough tomatoes that can undergo manually accelerated ripening after harvesting [31]. Typically, ripe tomatoes are characterized by specific color, texture, and firmness criteria. These characteristics can be further incorporated into the recognition model to enhance its selectivity for the appropriate objects to be picked. Although more refined recognition shows unique advantages, this function is not necessary when dealing with real application scenarios. On the other hand, if all kinds of tomatoes can be detected in real application scenarios, it would cause difficulty in deciding which one should be picked. As stated by Bechar, relevant research should focus on field use, which should be cost-effective and highly efficient [32].

The modified SSD model puts the emphasis on the relatively larger objects, and it has another advantage of avoiding the non-target tomatoes in the other planting rows, as shown in Figure 7. Owing to the principle of near large far small, the tomatoes growing in far-away rows appeared relatively small in the image, so the modified SSD model would discard them, so as to save the time consumption. Actually, the tomato-picking robot’s manipulator cannot reach the far-away rows to pick them, so detecting the far-away tomatoes is useless in real application scenarios. Another study that was also applicably orientated manifested a similar design strategy [33].

The ablation experimental result shows that discarding the four middle feature maps and adding the self-attention mechanism benefited the accuracy rate, recall rate, time consumption, and model size compared with the other two models. In the classical SSD model, there are six feature layers, which were Conv4_3, Conv7, Conv8_2, Conv9_2, Conv10_2, and Conv11_2, and their sizes decreased gradually [34]. The explanation at the mechanism level is as follows: the reason for linking six feature maps to the final output layer was to realize target detection at different scales; the top hierarchal feature maps are responsible for the relatively large object detection, and the lower hierarchal feature maps take the responsibility of the relatively small object detection [35]. The classical SSD model is universally applicable, but in specific target-orientated scenarios, we do not need to care about irrelevant objects. There was a clear object target, which was to detect the tomatoes, and the tomatoes were the larger objects in the images compared with other objects. Thus, this study abandoned the top hierarchal feature maps that are used for detecting relatively small objects and merely linked Conv4_3 and Conv11_2 to the self-attention mechanism. Although Conv4_3 belongs to the top hierarchal feature maps, it is linked to the normalization layer; thus, this study retained Conv4_3. In this study, the clear purpose is to detect big-enough tomatoes, so this research retained lower hierarchal feature maps, which are used for detecting relatively large objects on purpose. Figure 6 and Figure 7 show that such a design could avoid the interference of small tomatoes, including tomatoes growing in the other planting rows. In addition, the modified SSD model reduced the time consumption and model size significantly (p < 0.05) (Table 1); the above beneficial achievements should be ascribed to the abandonment of the middle four feature maps.

As down-sampling decreases the image size, the established convolutional kernel puts more emphasis on the global image feature [36]. By introducing the attention mechanism, the algorithm can enhance the focus of the local feature [37]. In other words, the self-attention mechanism puts more emphasis on what is currently of concern, and this research designed the target-feature-orientated data flow logic. Figure 3 shows that the inputting vectors of the self-attention mechanism are the feature maps derived from the feedforward neural network, which is Visual Geometry Group 16 (VGG16), and the feature maps had already extracted the related tomato features. Combined with the self-attention mechanism, the data flow, starting from the inputting vectors of the self-attention mechanism, concentrates more on the tomatoes than the other features in the images. Therefore, in the stage of class and location (Figure 3), it can give more accurate prediction results, which are already verified by the ablation experiments (Table 1).

4.3. Limitations

Occlusion is the major limitation that influences the accuracy rate and recall rate. Figure 8 presents an example of a leaf covering nearly one-half of the tomato, and although this tomato is riper than the others, and its character is more obvious than the others macroscopically, it could not be recognized successfully. One thing that should be mentioned is that we labeled such kinds of tomatoes while conducting the labeling process, but it is very difficult to successfully detect such tomatoes in the many areas that are covered by leaves, branches, or other tomatoes. Since machine vision technology relies on analyzing digital images to extract unique characters, so as to realize object detection [38], when there are not enough characters exposed to the camera, there is currently no appropriate solution. It is possible to deal with occlusion effectively by using multi-view perception [39]. The target scene is imaged from many different angles to obtain the image information of the target from different angles. Due to the different occlusion conditions at different viewing angles, more comprehensive target features can be obtained by fusing the information of multiple views, thus reducing the influence of occlusion on target detection. This requires adding multiple cameras in a practical application in order to capture images from different angles.

Another obvious limitation is the unbalanced illumination, such as the phenomenon shown in Figure 9. Although the tomato is big enough and has no canopy above it, the method proposed in this study still could not recognize it successfully. It is reasonable that color is one unique character used in image processing, but when the illumination is unbalanced, or the illumination is extremely strong or extremely weak, the color presented in the image will not match the actual color, and this is a major reason causing failure recognition in this study. Besides occlusion and unbalanced illumination, other factors also affect the model’s performance. Tomato surface diseases can change the fruit’s appearance, making it difficult for the model to accurately identify. When tomatoes are infected with diseases like early blight or powdery mildew, the lesions on the surface can disrupt the normal color and texture features that the model relies on for detection. Additionally, severe fruit overlapping can lead to shape deformation, confusing the model. In such cases, the model may misclassify or miss some tomatoes.

Nevertheless, the current issue holds the potential to be addressed through the application of multi-task learning in future research endeavors. Multi-task learning has the capacity to enhance the model’s performance via the sharing of features among related tasks [40]. Moreover, by jointly learning these tasks, the model can potentially better generalize unseen scenarios as it learns to disentangle the various factors that affect tomato identification. This is particularly crucial in real-world applications where lighting and maturity conditions can vary widely.

For the model pertaining to tomato-picking robots, the identification tasks of tomatoes under diverse lighting conditions and varying degrees of maturity can be regarded as highly correlated tasks and thus be subjected to joint learning. For instance, a model founded on a convolutional neural network (CNN) can be constructed [41]. In this model, the initial layers of the network share parameters to extract common features from the image. Subsequently, the branches dedicated to different lighting conditions and maturity levels learn their respective unique features. This architecture enables joint learning, thereby enhancing the model’s proficiency in identifying tomatoes within complex scenarios.

In the context of robotic tomato harvesting, various environmental factors can influence the efficiency and accuracy of the process. Two such crucial factors are occlusion and unbalanced illumination. Comparing occlusion and unbalanced illumination, unbalanced illumination will cause some real production trouble, for example, missing picking. Whereas occlusion will not cause real production trouble; for example, when a picking robot is working, the picking manipulator cannot bypass branches or leaves with current technology, as grasping the tomatoes and leaves at the same time is really hard in the research field of mechanical manipulators; however, if the occlusion is solely caused by tomatoes, when the outer tomato is picked, the inner tomato will be exposed, which will have minimal influence on the real robot harvesting work.

5. Conclusions

The primary objective of this study was to develop a fast and high-precision tomato detection method tailored for tomato-picking robots. Through the modification of the Single-Shot Multi-Box Detector (SSD) model, we eliminated unnecessary feature maps and integrated a self-attention mechanism to augment the model’s performance. Quantifiable results demonstrate that the modified SSD model achieved an accuracy of approximately 95% and a recall rate of 96.1% in tomato detection tasks, outperforming both the classical SSD model and the SSD model with a self-attention mechanism in terms of accuracy, recall rate, and model efficiency. Furthermore, the ablation experiments revealed significant improvements in time consumption and model size, with reductions of 21.62% and 21.32%, respectively. These results confirm that our modifications effectively prioritized the detection of larger, ripe tomatoes, while minimizing computational overhead. Despite the challenges presented by occlusion and unbalanced illumination, our study underlines the potential of the modified SSD model for real-world applications in robotic tomato harvesting. In conclusion, our research not only facilitates the progress of tomato-picking robots but also offers valuable insights into the optimization of deep-learning models for agricultural automation. Future research could focus on further enhancing the model’s adaptability to different tomato cultivars and more complex environmental conditions, thereby expanding the application scope of robotic tomato harvesting systems. For the occlusion problem, future research could explore the development of more sophisticated occlusion-aware feature-extraction techniques. For example, in the case of video-based tomato detection for robotic harvesting, temporal information can be incorporated. If the robot is equipped with a camera that can capture consecutive frames, algorithms can be designed to track the movement of tomatoes and predict the appearance of occluded parts based on previous and current frames. Regarding illumination variability, future research can develop methods to extract illumination-invariant features from tomato images. For instance, train the model to learn features that are independent of the absolute intensity of light, but instead focus on relative color and texture information. When actually deploying it into a commercial agricultural environment, the robotic tomato harvesting system should be designed in a modular fashion. Each component of the system, such as the detection module, the robotic arm, and the navigation module, can be easily replicated and integrated into a larger system. For example, if a farm wishes to expand its robotic harvesting operations, new robots with the same or enhanced detection capabilities (based on the improved SSD model) can be added to the existing fleet without the need for a major redesign. To sum up, this research is of great significance for the sustainable development of agriculture. Through the precise operation of robots, damage to fruits is reduced. Based on the accurate identification of fruits, resource allocation is optimized, the negative impact of agricultural production on the environment is mitigated, and the transformation of agriculture toward green and sustainable development is promoted.

Author Contributions

Conceptualization, K.S., S.C., G.W., J.Q., X.G., M.X. and Z.Z.; methodology, K.S., G.W., J.Q., X.G., M.X. and Z.Z.; validation, K.S.; formal analysis, K.S.; investigation, S.C.; data curation, K.S.; writing—original draft, K.S.; writing—review & editing, S.C.; visualization, K.S.; supervision, G.W., J.Q., X.G., M.X. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program of China (grant number: 2022YFD1500701), Science and Technology Development Plan Project of Jilin Province (grant number: 20240304171SF).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to they were specifically created for this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Rajendran, V.; Debnath, B.; Mghames, S.; Mandil, W.; Parsa, S.; Parsons, S.; Ghalamzan, E.A. Towards Autonomous Selective Harvesting: A Review of Robot Perception, Robot Design, Motion Planning and Control. J. Field Robot. 2023, 41, 2247–2279. [Google Scholar]
Groher, T.; Heitkaemper, K.; Walter, A.; Liebisch, F.; Umstaetter, C. Status quo of adoption of precision agriculture enabling technologies in Swiss plant production. Precis. Agric. 2020, 21, 1327–1350. [Google Scholar]
Wang, Z.; Xun, Y.; Wang, Y.; Yang, Q. Review of smart robots for fruit and vegetable picking in agriculture. Int. J. Agric. Biol. Eng. 2022, 15, 33–54. [Google Scholar]
de Almeida Machado, T.; Fernandes, H.C.; Megguer, C.A.; Santos, N.T.; Santos, F.L. Quantitative and qualitative loss of tomato fruits during mechanized harvest. Rev. Bras. Eng. Agric. Ambient. 2018, 22, 799–803. [Google Scholar]
Wang, L.; Zhao, B.; Fan, J.; Hu, X.; Wei, S.; Li, Y.; Zhou, Q.; Wei, C. Development of a tomato harvesting robot used in greenhouse. Int. J. Agric. Biol. Eng. 2017, 10, 140–149. [Google Scholar]
Liu, J.; Liu, Z. The Vision-Based Target Recognition, Localization, and Control for Harvesting Robots: A Review. Int. J. Precis. Eng. Manuf. 2024, 25, 409–428. [Google Scholar]
Zhao, Y.; Gong, L.; Huang, Y.; Liu, C. A review of key techniques of vision-based control for harvesting robot. Comput. Electron. Agric. 2016, 127, 311–323. [Google Scholar]
Feng, Q.; Cheng, W.; Zhou, J.; Wang, X. Design of structured-light vision system for tomato harvesting robot. Int. J. Agric. Biol. Eng. 2014, 7, 19–26. [Google Scholar]
Yang, G.; Wang, J.; Nie, Z.; Yang, H.; Yu, S. A Lightweight YOLOv8 Tomato Detection Algorithm Combining Feature Enhancement and Attention. Agronomy 2023, 13, 1824. [Google Scholar] [CrossRef]
Liu, G.; Nouaze, J.C.; Mbouembe, P.L.T.; Kim, J.H. YOLO-Tomato: A Robust Algorithm for Tomato Detection Based on YOLOv3. Sensors 2020, 20, 2145. [Google Scholar] [CrossRef]
Dong, W.; Roy, P.; Peng, C.; Isler, V. Ellipse R-CNN: Learning to Infer Elliptical Object from Clustering and Occlusion. IEEE Trans. Image Process. 2021, 30, 2193–2206. [Google Scholar] [CrossRef] [PubMed]
Mojaravscki, D.; Magalhaes, P.S.G. Comparative Evaluation of Color Correction as Image Preprocessing for Olive Identification under Natural Light Using Cell Phones. Agriengineering 2024, 6, 155–170. [Google Scholar] [CrossRef]
Tang, Y.; Chen, M.; Wang, C.; Luo, L.; Li, J.; Lian, G.; Zou, X. Recognition and Localization Methods for Vision-Based Fruit Picking Robots: A Review. Front. Plant Sci. 2020, 11, 510. [Google Scholar] [CrossRef] [PubMed]
Miao, Z.; Yu, X.; Li, N.; Zhang, Z.; He, C.; Li, Z.; Deng, C.; Sun, T. Efficient tomato harvesting robot based on image processing and deep learning. Precis. Agric. 2023, 24, 254–287. [Google Scholar] [CrossRef]
Liu, M.; Chen, W.; Cheng, J.; Wang, Y.; Zhao, C. Y-HRNet: Research on multi-category cherry tomato instance segmentation model based on improved YOLOv7 and HRNet fusion. Comput. Electron. Agric. 2024, 227, 109531. [Google Scholar]
Oikonomou, K.M.; Kansizoglou, I.; Gasteratos, A. A Framework for Active Vision—Based Robot Planning Using Spiking Neural Networks. In Proceedings of the 2022 30th Mediterranean Conference on Control and Automation (Med); Mediterranean Conference On Control And Automation, Vouliagmeni, Greece, 28 June–1 July 2022; IEEE: New York, NY, USA, 2022; pp. 867–871. [Google Scholar]
Nugroho, D.P.; Widiyanto, S.; Wardani, D.T. Comparison of Deep Learning-Based Object Classification Methods for Detecting Tomato Ripeness. Int. J. Fuzzy Log. Intell. Syst. 2022, 22, 223–232. [Google Scholar] [CrossRef]
Yuan, T.; Lv, L.; Zhang, F.; Fu, J.; Gao, J.; Zhang, J.; Li, W.; Zhang, C.; Zhang, W. Robust Cherry Tomatoes Detection Algorithm in Greenhouse Scene Based on SSD. Agriculture 2020, 10, 160. [Google Scholar] [CrossRef]
Magalhaes, S.A.; Castro, L.; Moreira, G.; dos Santos, F.N.; Cunha, M.; Dias, J.; Moreira, A.P. Evaluating the Single-Shot MultiBox Detector and YOLO Deep Learning Models for the Detection of Tomatoes in a Greenhouse. Sensors 2021, 21, 3569. [Google Scholar] [CrossRef]
Huo, B.; Li, C.; Zhang, J.; Xue, Y.; Lin, Z. SAFF-SSD: Self-Attention Combined Feature Fusion-Based SSD for Small Object Detection in Remote Sensing. Remote Sens. 2023, 15, 3027. [Google Scholar] [CrossRef]
Suh, H.-S.; Meng, J.; Nguyen, T.; Kumar, V.; Cao, Y.; Seo, J.-S. Algorithm-hardware Co-optimization for Energy-efficient Drone Detection on Resource-constrained FPGA. ACM Trans. Reconfig. Technol. Syst. 2023, 16, 33. [Google Scholar] [CrossRef]
Chen, Z.G.; Wu, K.H.; Li, Y.B.; Wang, M.J.; Li, W. SSD-MSN: An Improved Multi-Scale Object Detection Network Based on SSD. IEEE Access 2019, 7, 80622–80632. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, Y.A.; Gao, T.; Fang, Y.; Chen, T. A Novel SSD-Based Detection Algorithm Suitable for Small Object. Ieice Trans. Inf. Syst. 2023, E106D, 625–634. [Google Scholar]
Xie, J.; Pang, Y.W.; Nie, J.; Cao, J.; Han, J.G. Latent Feature Pyramid Network for Object Detection. IEEE Trans. Multimed. 2023, 25, 2153–2163. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Brauwers, G.; Frasincar, F. A General Survey on Attention Mechanisms in Deep Learning. IEEE Trans. Knowl. Data Eng. 2023, 35, 3279–3298. [Google Scholar] [CrossRef]
Ahmed, N. Data-Free/Data-Sparse Softmax Parameter Estimation With Structured Class Geometries. IEEE Signal Process. Lett. 2018, 25, 1408–1412. [Google Scholar] [CrossRef]
Mu, Y.; Chen, T.-S.; Ninomiya, S.; Guo, W. Intact Detection of Highly Occluded Immature Tomatoes on Plants Using Deep Learning Techniques. Sensors 2020, 20, 2984. [Google Scholar] [CrossRef]
Lawal, M.O. Tomato detection based on modified YOLOv3 framework. Sci. Rep. 2021, 11, 1447. [Google Scholar] [CrossRef]
Yin, H.; Chai, Y.; Yang, S.X.; Mittal, G.S. Ripe Tomato Detection for Robotic Vision Harvesting Systems in Greenhouses. Trans. Asabe 2011, 54, 1539–1546. [Google Scholar] [CrossRef]
de Bruijn, J.; Fuentes, N.; Solar, V.; Valdebenito, A.; Vidal, L.; Melin, P.; Fagundes, F.; Valdes, H. The Effect of Visible Light on the Postharvest Life of Tomatoes (Solanum lycopersicum L.). Horticulturae 2023, 9, 94. [Google Scholar] [CrossRef]
Bechar, A.; Vigneault, C. Agricultural robots for field operations. Part 2: Operations and systems. Biosyst. Eng. 2017, 153, 110–128. [Google Scholar]
Wang, G.; Huang, D.; Zhou, D.; Liu, H.; Qu, M.; Ma, Z. Maize (Zea mays L.) seedling detection based on the fusion of a modified deep learning model and a novel Lidar points projecting strategy. Int. J. Agric. Biol. Eng. 2022, 15, 172–180. [Google Scholar]
Zhao, H.; Li, Z.; Zhang, T. Attention Based Single Shot Multibox Detector. J. Electron. Inf. Technol. 2021, 43, 2096–2104. [Google Scholar]
Liu, X.; Pan, H.; Li, X. Object detection for rotated and densely arranged objects in aerial images using path aggregated feature pyramid networks. In Proceedings of the 11th International Symposium on Multispectral Image Processing and Pattern Recognition (MIPPR)—Pattern Recognition and Computer Vision, Wuhan, China, 10–12 November 2020. [Google Scholar]
Hassanzadeh, A. On the Use of Imaging Spectroscopy from Unmanned Aerial Systems (UAS) to Model Yield and Assess Growth Stages of a Broadacre Crop; Rochester Institute of Technology: Rochester, NY, USA, 2022. [Google Scholar]
Rapado-Rincón, D.; van Henten, E.J.; Kootstra, G. Development and evaluation of automated localisation and reconstruction of all fruits on tomato plants in a greenhouse based on multi-view perception and 3D multi-object tracking. Biosyst. Eng. 2023, 231, 78–91. [Google Scholar] [CrossRef]
Yao, M.; Min, Z. Summary of Fine-Grained Image Recognition Based on Attention Mechanism. In Proceedings of the 13th International Conference on Graphics and Image Processing (ICGIP), Kunming, China, 18–20 August 2022. [Google Scholar]
Ramik, D.M.; Sabourin, C.; Moreno, R.; Madani, K. A machine learning based intelligent vision system for autonomous object detection and recognition. Appl. Intell. 2014, 40, 358–375. [Google Scholar]
Liu, B.; Wei, S.S.; Zhang, F.; Guo, N.W.; Fan, H.Y.; Yao, W. Tomato leaf disease recognition based on multi-task distillation learning. Front. Plant Sci. 2024, 14, 1330527. [Google Scholar]
Fang, Y.C.; Ma, Z.Y.; Zhang, Z.X.; Zhang, X.Y.; Bai, X. Dynamic Multi-Task Learning with Convolutional Neural Network. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), Melbourne, Australia, 19–25 August 2017; pp. 1668–1674. [Google Scholar]

Figure 1. Modified SSD model.

Figure 2. Diagram of self-attention mechanism.

Figure 3. The final modified SSD model.

Figure 4. Labeling the training images using the plug-in unit LabelImg.

Figure 5. Detection results of the tomato detection model derived from the modified SSD model.

Figure 6. Detection results of three deep learning models derived from identical images.

Figure 7. Detection results of two deep learning models derived from identical images. The classical SSD model with a self-attention mechanism recognized the smaller tomatoes grown in planting rows far away from the camera; however, such too-refined recognition has no applicable significance in real application scenarios, and will cause difficulty in deciding which tomato should be picked, as well as wasting computational time.

Figure 8. Occlusion influences successful recognition.

Figure 9. Unbalanced illumination influences successful recognition.

Table 1. Ablation experiment results.

Model	Accuracy Rate/%	Recall Rate/%	Time Consumption per Image/ms	Model Size/MB
Modified SSD in this study	94.77 a	96.1 a	29 a	155 a
Classical SSD with Self-attention mechanism	94.2 a	96.7 a	37 b	197 b
Classical SSD	91.77 b	95.7 b	28 a	153 a

Notes: Different lowercase letters denote there exists a significant difference among the numbers at a significance level of 0.05 under the statistical analysis of the paired test.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, K.; Chen, S.; Wang, G.; Qi, J.; Gao, X.; Xiang, M.; Zhou, Z. Research on High-Precision Target Detection Technology for Tomato-Picking Robots in Sustainable Agriculture. Sustainability 2025, 17, 2885. https://doi.org/10.3390/su17072885

AMA Style

Song K, Chen S, Wang G, Qi J, Gao X, Xiang M, Zhou Z. Research on High-Precision Target Detection Technology for Tomato-Picking Robots in Sustainable Agriculture. Sustainability. 2025; 17(7):2885. https://doi.org/10.3390/su17072885

Chicago/Turabian Style

Song, Kexin, Shuyu Chen, Gang Wang, Jiangtao Qi, Xiaomei Gao, Meiqi Xiang, and Zihao Zhou. 2025. "Research on High-Precision Target Detection Technology for Tomato-Picking Robots in Sustainable Agriculture" Sustainability 17, no. 7: 2885. https://doi.org/10.3390/su17072885

APA Style

Song, K., Chen, S., Wang, G., Qi, J., Gao, X., Xiang, M., & Zhou, Z. (2025). Research on High-Precision Target Detection Technology for Tomato-Picking Robots in Sustainable Agriculture. Sustainability, 17(7), 2885. https://doi.org/10.3390/su17072885

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on High-Precision Target Detection Technology for Tomato-Picking Robots in Sustainable Agriculture

Abstract

1. Introduction

2. Materials and Methods

2.1. Modified Deep Learning Model of Single-Shot Multi-Box Detector

2.2. Self-Attention Mechanism

2.3. Generating Detecting Model Based on Deep Learning

2.4. Experiments and Statistical Analysis

3. Results

4. Discussion

4.1. Overall Judgement on the Modified SSD Model

4.2. The Unique Effects of Modification

4.3. Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI