Autonomous Tomato Harvesting System Integrating AI-Controlled Robotics in Greenhouses

Matache, Mihai Gabriel; Marin, Florin Bogdan; Persu, Catalin Ioan; Cristea, Robert Dorin; Nenciu, Florin; Atanasov, Atanas Z.

doi:10.3390/agriculture16080847

Open AccessArticle

Autonomous Tomato Harvesting System Integrating AI-Controlled Robotics in Greenhouses

by

Mihai Gabriel Matache

¹

,

Florin Bogdan Marin

^2,*

,

Catalin Ioan Persu

¹

,

Robert Dorin Cristea

¹

,

Florin Nenciu

¹

and

Atanas Z. Atanasov

^3,*

¹

Testing Department, National Institute of Research-Development for Machines and Installations Designed to Agriculture and Food Industry-INMA, 6 Ion Ionescu de la Brad Avenue, 013813 Bucharest, Romania

²

Interdisciplinary Research Centre in the Field of Eco-Nano Technology and Advance Materials CC-ITI, Faculty of Engineering, “Dunărea de Jos” University of Galaţi, 47 Domnească Street, 800008 Galati, Romania

³

Department of Agricultural Machinery, Agrarian and Industrial Faculty, University of Ruse “Angel Kanchev”, 7017 Ruse, Bulgaria

^*

Authors to whom correspondence should be addressed.

Agriculture 2026, 16(8), 847; https://doi.org/10.3390/agriculture16080847

Submission received: 3 March 2026 / Revised: 3 April 2026 / Accepted: 9 April 2026 / Published: 11 April 2026

(This article belongs to the Special Issue AI-Powered Agricultural Robots: From Field Sensing to Autonomous Operation)

Download

Browse Figures

Versions Notes

Abstract

Labor shortages and the need for increased productivity have accelerated the development of robotic harvesting systems for greenhouse crops; however, reliable operation under fruit occlusion and clustered arrangements remains a major challenge, particularly due to the limited integration between perception and motion planning modules. The paper presents the design and experimental validation of an autonomous robotic system for greenhouse tomato harvesting. The proposed platform integrates a rail-guided mobile base, a six-degrees-of-freedom robotic manipulator, and an adaptive end effector with a hybrid vision framework that combines convolutional neural networks and watershed-based segmentation to enable robust fruit detection and localization under occluded conditions. The proposed approach enables improved separation of overlapping fruits and provides accurate spatial localization through stereo vision combined with IMU-assisted camera-to-robot coordinate transformation. An occlusion-aware trajectory planning strategy was developed to generate collision-free manipulation paths in the presence of leaves and stems, enhancing harvesting safety and reliability. The system was trained and evaluated using a dataset of real greenhouse images supplemented with synthetic data augmentation. Experimental trials conducted under practical greenhouse conditions demonstrated a fruit detection precision of 96.9%, recall of 93.5%, and mean Intersection-over-Union of 79.2%. The robotic platform achieved an overall harvesting success rate of 78.5%, reaching 85% for unobstructed fruits, with an average cycle time of 15 s per fruit in direct harvesting scenarios. The rail-guided mobility significantly improved positioning stability and repeatability during manipulation compared with fully mobile platforms. The results confirm that integrating hybrid perception with occlusion-aware motion planning can substantially improve the functionality of robotic harvesting systems in protected cultivation environments. The proposed solution contributes to the advancement of automation technologies for greenhouse vegetable production and supports the transition toward more sustainable and labor-efficient agricultural practices.

Keywords:

AI-controlled robotics; automated harvesting; deep learning; sustainable farming

1. Introduction

The continuous advancement of artificial intelligence (AI), together with the rapid development of robotics, has created new opportunities for automation in agriculture. In the context of increasing pressure to improve productivity while reducing labor dependency and operational costs, robotic systems capable of performing complex agricultural tasks with limited human intervention have become a central research objective. Labor availability represents a major economic and operational constraint in protected horticulture. Published studies show that labor commonly accounts for about one-quarter to one-third of greenhouse production costs, with reported values of 29% in the Netherlands, 25% in Israel, and 35% in Japan for greenhouse horticulture [1], while a tomato greenhouse cost analysis reported a labor share of 24.7% of total production cost [2]. In Mediterranean greenhouse case studies, hired labor has been reported to reach up to 40% of total production costs [3].

Automated harvesting systems have received considerable attention as a response to labor shortages and the need for more efficient and sustainable farming practices. Over the past decade, significant progress has been reported in the development of robotic harvesters for a wide range of crops, including apples, tomatoes, strawberries, kiwi, citrus fruits, bell peppers, and cucumbers [4]. These systems typically integrate machine vision, learning-based algorithms for fruit detection and localization, and robotic manipulators designed to handle delicate biological products. A comprehensive review by Mail et al. (2023) analyzed the main design principles of modern harvesting robots and classified harvesting mechanisms into four primary categories: grasping and cutting, vacuum suction and plucking, twisting and pulling, and shaking and collecting [5]. Their study emphasized the importance of coordinated integration between mobile platforms, manipulators, sensing systems, and motion planning algorithms to ensure reliable operation across different crop types.

Mobility is a fundamental requirement for harvesting robots, as crops are cultivated in diverse spatial configurations depending on the growing environment. Various locomotion solutions have been explored, including four-wheel differential drive platforms, four-wheel independent drive systems, tracked vehicles, and rail-based carts that exploit existing greenhouse infrastructure such as heating pipes or guiding rails. In orchard environments, mobile harvesting systems may also rely on human-operated carriers such as tractors or forklifts. While many of these platforms are commercially available, others are custom-designed to meet crop-specific constraints. However, during harvesting operations, the interaction between the manipulator and the plant introduces variable dynamic loads that can compromise platform stability. For this reason, several studies have reported the use of stabilization mechanisms as an integral part of mobile harvesting platforms [6,7]. Representative examples of such platforms include Thorvald II [8] and the Qii-Drive AGV [9]. Alternative architectural solutions have also been proposed. Arima et al. (2004) developed a sliding system that moves a suspended manipulator underneath tabletop strawberry beds, eliminating the need for a fully mobile platform [8]. For indoor-grown crops such as mushrooms, Cartesian robotic configurations are often employed, allowing coverage of the entire cultivation surface without platform relocation [10]. In contrast, Yamamoto et al. (2014) investigated a stationary harvesting unit in which the crop bed itself moves on rails in front of the manipulator during the harvesting process [11]. These examples illustrate the diversity of mechanical solutions adopted to address the mobility and accessibility challenges associated with agricultural harvesting.

Most mobile harvesting platforms are equipped with mechanical arms and crop-specific end effectors. Sepúlveda et al. (2020) introduced a dual-arm robot for eggplant harvesting that combined a support vector machine (SVM) classifier with a planning algorithm, achieving a success rate of 91.67% and an average cycle time of 26 s per fruit [12]. Arad et al. (2020) presented a sweet pepper harvesting robot integrating an RGB-D vision system and deep learning-based segmentation, reporting a harvesting success rate of 85% with an average cycle time of 18 s per fruit [9]. In the case of tomatoes, Wang et al. developed a harvesting robot equipped with a five-degrees-of-freedom manipulator, laser navigation, and binocular stereo vision, achieving an average picking time of 15 s per fruit and a success rate of 86% in greenhouse conditions [13].A research study [14] further explored a dual-arm cooperative approach, where one arm stabilized the tomato cluster while the second arm performed the harvesting operation. Their system reached a success rate of 87.5%, with a harvesting cycle time of less than 30 s, excluding platform movement.

The sensory architecture of selective harvesting robots typically combines external and internal sensors. External sensors are primarily responsible for fruit detection, localization, and environmental perception, enabling navigation and target identification. Internal sensors monitor the state and performance of subsystems such as manipulators, end effectors, and mobile platforms. Among these, the vision system plays a critical role. To successfully position and orient the end effector, the robot must first obtain an accurate spatial description of ripe fruits on the plant. This task becomes particularly challenging in greenhouse environments, where fruits are frequently occluded by leaves, stems, or neighboring fruits. Additional variability in fruits’ color, size, and shape further complicates the detection process, while peduncle identification and orientation mapping are often required to ensure correct detachment.

A wide range of approaches have been proposed for fruit detection. Early methods relied on hand-crafted features, such as Histogram of Oriented Gradients (HOG) combined with SVM classifiers, followed by post-processing steps including false color removal and non-maximal suppression [15]. More recent studies have adopted deep learning techniques. Zhang et al. [16] introduced a convolutional neural network (CNN) architecture designed for rotational robustness, while Li et al. [17] modified the YOLOv3 model to create YOLO-Tomato, incorporating a DenseNet backbone and circular bounding boxes to better match fruit geometry. Through data augmentation and architectural refinements, their model achieved a detection accuracy of 94.58% under varying illumination conditions. Despite these advances, neural network-based detection models differ substantially in architecture, training strategies, loss functions, and optimization methods. Object detection frameworks are commonly classified into one-stage and two-stage detectors. One-stage detectors prioritize inference speed, whereas two-stage detectors generally provide higher localization accuracy at the expense of increased computational complexity. The choice between these approaches is strongly influenced by application constraints, including real-time requirements and hardware limitations.

Camera placement further affects both algorithm selection and overall system performance. Harvesting robots may employ binocular vision, laser vision, RGB-D sensors, multispectral cameras, or combinations thereof [18]. These sensors are typically mounted on the end effector, along the manipulator links, or at elevated positions to provide a global view of the crop. Some systems combine multiple placements to improve robustness and perception flexibility. In certain configurations, cameras are mounted on independent motion mechanisms, allowing extended scanning capabilities [19,20]. While end effector-mounted cameras provide a natural perspective for harvesting, they are particularly susceptible to occlusion by leaves and stems, increasing the need for robust image processing and accurate depth estimation.

In robotic harvesting, perception and motion planning cannot be treated as fully independent functions. Accurate fruit detection is necessary, but it does not by itself guarantee harvestability, because the manipulator must also approach the target safely within a cluttered canopy containing leaves, stems, and neighboring fruits. For this reason, the perception system must provide not only target localization, but also information on occlusion and local obstacle configuration that can support motion generation [20].

Geographically, the most significant progress in greenhouse and harvesting robotics has been reported in East Asia, particularly in Japan and China, and in Europe, mostly in the Netherlands and Spain [21]. These regions combine strong research activity with intensive protected horticulture systems, which has accelerated both experimental development and early deployment of robotic platforms for crop monitoring and harvesting. In particular, the Netherlands and Spain provide important greenhouse application contexts, while Japan and China have contributed substantially to the advancement of fruit detection, manipulation, and robotic harvesting strategies in structured cultivation environments [21].

Recent review studies published in 2024–2026 show that research in greenhouse and fruit-harvesting robotics is rapidly evolving toward tighter integration of perception, motion planning, and crop-specific end effectors [22,23,24], while commercial development is progressing through specialized harvesting platforms already tested in horticultural production systems [25]. In this context, the present work is positioned within the current generation of greenhouse harvesting systems that aim to balance detection accuracy, motion safety, and operational feasibility in practical environments, rather than focusing exclusively on algorithmic performance under controlled conditions.

Current Limitations and Proposed Research Contributions

Despite the progress achieved in robotic harvesting, several practical limitations still restrict reliable operation in greenhouse tomato crops. First, fruits are frequently partially occluded by leaves, stems, or neighboring fruits, which reduces detection reliability and complicates instance separation in clustered scenes. Second, even when fruits are correctly detected, harvesting performance depends on accurate transformation of image-based detections into the robot reference frame, so that the manipulator can reach the target safely and precisely. Third, many existing systems treat perception and motion generation as largely independent stages, which reduces robustness when occlusions directly affect the feasibility of the harvesting approach. Finally, several studies report detection performance under controlled conditions, but fewer works validate the complete perception-to-harvesting loop in real greenhouse environments.

In this context, the present study focuses on the design and experimental validation of an integrated robotic harvesting system for greenhouse tomatoes, because tomatoes crops are one of the main cultivated vegetables worldwide. The proposed system combines a rail-guided mobile platform (easily adaptable to all commercial greenhouses systems), a UR5e manipulator, and an adaptive gripper with a hybrid CNN–Watershed perception pipeline for robust fruit detection under partial occlusion. The detected fruit positions are transformed into the robot coordinate system through an IMU-assisted camera-to-robot calibration procedure, and an occlusion-aware geometric waypoint strategy is used to generate collision-free harvesting trajectories. The contribution of this work is therefore not active manipulation of foliage or high-level probabilistic decision making, but the practical integration and greenhouse validation of a complete perception, calibration, motion planning, and harvesting pipeline under realistic operating conditions.

The novelty of the proposed system lies in the integration of three key elements within a real-time robotic harvesting framework:

Lightweight and deployable segmentation pipeline

The combination of a moderately deep CNN with Watershed post-processing provides a computationally efficient alternative to fully learned instance segmentation networks. This makes the approach more suitable for real-time operation in closed-loop robotic systems, where inference latency directly affects control performance.

2.: Tight coupling between perception and motion planning

In contrast to many existing works where perception and manipulation are treated as separate modules, the proposed system integrates segmentation output directly into an occlusion-aware trajectory planning strategy.

3.: Validation in a real greenhouse environment under occlusion

The system is experimentally validated in a realistic greenhouse setting, focusing specifically on partial occlusion scenarios, which remain a key challenge in agricultural robotics.

These aspects differentiate the proposed approach from existing methods by emphasizing practical operability and real-time performance, rather than relying solely on increasingly complex deep learning architectures.

2. Materials and Methods

Experimental data was collected in 2025; experiments were performed within one of the experimental greenhouses of INMA Bucharest. Experiments were performed from 20 May until 30 June 2025. To ensure high experimental replicability, the study was conducted in a climate-controlled greenhouse facility. Environmental parameters were managed via an automated control system, maintaining a diurnal temperature regime of 24/18 °C (day/night) and a relative humidity range of 60–70%. Plants were grown in natural soil (sandy loam soil) and supplied with a balanced nutrient solution via a drip irrigation system. Fertigation events were triggered based on a cumulative solar radiation threshold to ensure water availability was synchronized with transpiration demands. The nutrient solution was maintained at an electrical conductivity (EC) of 2.2 dS/m and a pH of 5.8, with macro and micronutrient concentrations adjusted according to the specific phenological stage of the crop. Standardized cultural practices were strictly followed to minimize experimental bias, according to the tomato variety. An integrated pest management strategy was implemented to maintain crop health without interfering with physiological data collection. No synthetic chemical pesticides were applied during the experimental period to avoid potential phytotoxic effects on the recorded data. While the physical experiments were conducted within a specific greenhouse configuration, this environment was chosen as a representative baseline for intensive protected cropping. To ensure the results were not limited to a single static path, we introduced stochastic row-level variations across multiple trial runs. This allowed us to validate the motion planner’s ability to maintain trajectory accuracy despite the slight geometric inconsistencies typical of real-world agricultural structures.

2.1. Robot Architecture and Operating Principle

The working principle of the harvesting robot is presented in Figure 1. Powered by a 48 V Li-ion battery, the robot includes a Robotiq 2F Adaptive Gripper (Universal Robots, Odense, Denmark) with capacity to gently grasp and detach ripe tomatoes. The system relies on a ZED X RGB-3D stereo camera (Stereolabs, San Francisco, CA, USA), which continuously scans the plant to generate a 3D point cloud for accurate spatial positioning and obstacle detection. The harvesting robot employs a watershed-based segmentation algorithm trained to detect tomatoes at varying stages of ripeness. This vision system enables precise localization of the fruit, even in complex scenes with partial occlusion by leaves or stems. The control and perception software of the proposed harvesting system is deployed on a ZED Box (Stereolabs, San Francisco, CA, USA) embedded edge AI computer associated with the stereo vision system (Figure 1). Once a ripe tomato is detected, the robotic arm, with its six degrees of freedom, positions the gripper for a clean and damage-free harvest. The picked tomatoes are then placed in a collection container mounted on the platform. The robot navigates autonomously along the crop rows, repeating the scanning and harvesting cycle, providing an efficient, non-invasive, and fully automated solution for tomato harvesting in protected environments such as greenhouses or high tunnels.

The system was designed to operate autonomously and significantly enhance harvesting productivity and quality, while simultaneously reducing dependence on manual labor. The robot (Figure 2) consists of the following core components:

2.2. The Main Components Specifications and Details for the Proposed Tomato Harvesting Robot

Robotic Arm: The UR5e robotic arm (Universal Robots, Odense, Denmark) is a collaborative model (cobot) with a maximum payload: 5 kg; maximum reach: 850 mm (total arm length); arm weight: 20.6 kg; number of axes: 6 (provides high flexibility and multidirectional precision movement); repeatability: ±0.03 mm (ensures high precision in repeating movements); maximum speed: 1 m/s (on combined axes); range of motion on each axis: ±360° for all axes. The arm is controlled to interact with both fruits and plants during harvesting, adjusting its position and angle based on the input from the detection system. At the end of the arm is the gripper, which performs both gripping and cutting functions, integrating seamlessly into the harvesting process.

Gripping and Handling System: Once the fruits are identified, the robot uses a Rbotiq 2F-85 adaptive two-finger gripper (Robotiq Inc., Levis, QC, Canada) to grasp them. The Rbotiq 2F-85 is a commercially avail-able underactuated rigid gripper. In the present setup, elastomer pads were added to the gripper fingers in order to provide a more compliant contact surface and reduce the risk of fruit damage during grasping. The adaptive finger mechanism allows the gripper to accommodate moderate variations in fruit size, while the gripping force can be adjusted within the available operating range of the device. In this study, the gripping force was selected within a safe range suitable for tomato harvesting. No dedicated fruit firmness estimation module or external real-time force feedback control algorithm was implemented in the present work. The main gripping characteristics were: maximum payload: 5 kg; finger opening: 0 mm–85 mm (fingers fully open); gripping force: adjustable between 20 N and 235 N; gripper weight: 900 g; number of fingers: 2 (with adaptive fingers for more precise gripping); finger closing time: 200 mm/s (maximum closing speed); power supply: 24 V DC, 1.5 A (in maximum operation); communication interface: Modbus RTU (RS485), compatible with most industrial robot control systems; repeatability: ±0.05 mm (for finger positioning). Integrated force sensors allow for precise pressure adjustment during gripping.

Tomatoes Detection and Identification System: The visual recognition system includes a 3D RGB-spectrum camera (Stereolabs ZED X 3D Camera, Stereolabs, San Francisco, CA, USA) combined with a watershed-based segmentation algorithm trained to detect tomatoes at varying stages of ripeness. The camera has a 3D capture distance: rom 0.2 m to 20 m; field of view: horizontal: 100°, vertical: 60°, diagonal: 120°; distance accuracy: measurement errors of approximately 1% at 1 m and less than 5% at 20 m; DK and software integration: supports Stereolabs SDK, Python 3.14.4 (Python Software Foundation, Amsterdam, Netherlands) and ROS 2 (Open Robotics, Mountain View, California, United States). The algorithm enables the accurate detection and classification of ripe fruits, with a high degree of precision in distinguishing between mature and immature tomatoes. Additionally, the 3D camera generates a point cloud of the environment, allowing the robot to calculate distances to fruits and avoid obstacles such as branches or leaves.

Zed Box Controller: represents a high-performance, industrial-grade embedded computing solution engineered by Stereolabs (San Francisco, CA, USA). It is architected to function as the primary processing unit (neural center), for autonomous robotic systems and edge-computing applications. The computational core integrates a GPU 1024-core NVIDIA Ampere architecture, with 32 Tensor Cores @ 918 MHz, CPU 8-core Arm® Cortex® -A78AE @ 2 GHZ, Memory 16 GB 128-bit LPDDR5 102.4 GB/s, DL Accelerator 2x NVDLA v2 @ 614 MHz.

Tomato Storage System: After harvesting, the fruits are transferred to a modular collection compartment and classic plastic crates. This compartment is specifically designed for easy replacement when full and is internally lined with soft materials and cushioning systems to prevent fruit damage during handling. The modular system ensures flexibility and efficiency, allowing quick unloading and resumption of the harvesting operation.

Modular rails platform: The mobile platform was designed and produced within INMA Institute (Bucharest, Romania), with width: 456 mm; length: 1200 mm (30 m length each module); drive motor: 1.8 kw, 48 Vcc and a battery: 48 Vcc, 5 kW. It was designed to move along tracks mounted between crop rows in greenhouses. These modular tracks can be adjusted to fit various field dimensions and structures. The platform travels along predefined paths, enabling precise and automated harvesting while avoiding collisions through synchronized movement control.

Control software: The operation of the control software for the robotic harvesting system designed for tomatoes vegetables is structured into several distinct stages and modules, all integrated within a main program (denoted as MAIN) that coordinates the overall functioning of the system.

2.3. Procedural Workflow for Vision–Actuation Synchronization

Frame acquisition and point cloud generation

In the first stage, the ZED X 3D camera captures frames from the surrounding environment. These are processed to generate a 3D point cloud, which provides a detailed representation of the position of objects around the robot. The resulting data are transmitted to the processing unit.

Tomato identification and coordinate generation

In the second stage, a trained model for tomato recognition (the neural network developed during this research) is applied to the images provided by the 3D camera. The system identifies ripe fruits and determines the coordinates of the center of mass of each tomato. These coordinates are then passed on to the motion planning module of the robotic arm.

Robotic arm motion control

In the third stage, the identified tomato coordinates are used to command the UR5e robotic arm. The software computes the optimal trajectory for the arm to precisely reach the location of each tomato, employing the adaptive gripper to grasp and harvest the fruit. Once harvested, the tomato is transferred to the storage compartment.

Convolutional Neural Network development

Precise localization of tomato plants and fruits is crucial for efficient and damage-free robotic harvesting in complex greenhouse environments. Overlapping objects with similar color and shape challenge conventional detection methods, often causing merged detections and counting errors. This study investigates CNN-based detection combined with image segmentation to address these issues. A CNN identifies red, quasi-round objects, while background elements are also detected to optimize harvesting movements. Color homogeneity and geometric similarity reduce feature discrimination and boundary clarity. To improve separation, the watershed algorithm is applied as a post-processing step to CNN segmentation masks. The method enhances boundary delineation and object counting accuracy, especially for overlapping fruits. Experiments were conducted on both synthetic and real-world tomato datasets. Synthetic images were generated with controlled overlap, noise, and variability using OpenCV 4.13.0 (Palo Alto, CA, US). The results demonstrate improved detection and counting, with potential applications in automated harvesting and quality control.

Camera Calibration and Coordinate Transformation

Accurate calibration of the vision system is essential to ensure that the tomato positions detected by the CNN–Watershed pipeline can be correctly mapped into the robot’s kinematic workspace. For this purpose, we implemented a two-step procedure: orientation compensation and coordinate transformation. The ZED X stereo camera is equipped with an integrated IMU sensor, which continuously provides orientation information in quaternion form q = [x,y,z,w]. Compared with Euler angles or direct rotation matrices, quaternions avoid singularities and allow efficient computation of sequential rotations. This makes them particularly suitable for real-time robotic applications, where the orientation must be updated continuously during the harvesting process. The IMU data stream is processed using the update_imu_orientation function, which retrieves the orientation quaternion and stores it for subsequent transformations. This step ensures that any tilt of the camera, either due to robot base inclination or irregularities of the rail system, is immediately compensated. The camera calibration procedure was executed at system start-up, when the stereo vision subsystem and camera-to-robot transformation parameters were initialized. During operation, the integrated IMU continuously provided orientation updates in quaternion form, allowing real-time compensation of small camera tilts caused by rail irregularities or minor platform inclination. No major drift affecting harvesting performance was observed during the reported trials. Although minor drift may appear during extended operation, its practical impact was limited because the ZED SDK performs runtime sensor fusion and pose stabilization. Therefore, the system relied on start-up calibration together with continuous IMU-based orientation updating during the harvesting process.

The phase step was to transform the coordinates of detected fruits from the camera reference frame into the robot reference frame.

This is achieved by combining two operations:

1. Rotational Alignment—the quaternion data is converted into a 3 × 3 rotation matrix R. This aligns the camera’s local axes with the robot’s global coordinate system.

2. Translational Correction—a fixed offset vector

T

is introduced to account for the physical displacement between the camera mount and the robot’s kinematic base.

The complete transformation can be expressed as

P_{r o b o t} = R * P_{c a m e r a} + T

(1)

where

P_{r o b o t}

represents target point in the robot’s coordinate system [m],

R

is the rotation matrix derived from IMU quaternion data,

P_{c a m e r a}

represents the detected fruit coordinates in the camera reference frame [m] and

T

is the translational offset vector [m].

The camera-to-robot pose function performs this computation in real time. For every detected fruit, the raw 3D point is rotated according to the IMU-derived orientation and translated with the fixed offset. This ensures that the final coordinates are expressed in the robot’s kinematic reference frame, which can then be used as input for the inverse kinematics solver.

The UR5e robotic arm used in this study is a six-degrees-of-freedom serial manipulator, whose motion is governed by its internal kinematic model. The transformation between joint angles and end effector position is handled through the manufacturer’s embedded inverse kinematics (IK) solver. For each detected fruit, the 3D coordinates obtained from the vision system are converted into a target pose in the robot reference frame, and the IK solver computes the corresponding joint configuration required to reach that position.

The proposed approach for motion planning does not use a sampling-based planner such as RRT*, A*, or Potential Fields, but a deterministic geometric waypoint planner operating in Cartesian space and coupled with the UR5e embedded inverse kinematics solver. The rationale for this choice is the structured greenhouse environment, where plant rows impose relatively predictable geometric constraints and allow a lower-complexity planner with bounded computational cost.

To ensure safe operation inside the plant canopy, a simplified collision space was defined using the 3D point cloud generated by the stereo camera. Leaves and stems were identified as obstacle regions and approximated as bounded volumes in Cartesian space.

A 10 mm safety margin was introduced as a conservative buffer to account for stereo reconstruction uncertainty, coordinate transformation error, and manipulator positioning tolerance. This value was selected as a compromise between safety and efficiency: while it may slightly increase detour length in occluded cases, it reduces the risk of undesired contact with leaves and stems. The robot was therefore allowed to operate only within the free workspace obtained after excluding these obstacle volumes.

Trajectory generation followed two different strategies depending on the presence of occlusion. When the fruit was fully visible and unobstructed, the robot executed a direct linear motion from the current end effector position to the target fruit location. When leaves or stems partially blocked access to the fruit, intermediate waypoints were generated to guide the end effector around the obstacle before reaching the final harvesting pose.

Waypoint generation was based on a deterministic geometric displacement rule in Cartesian space. The planner first checked whether the direct segment between the current end effector position and the fruit target intersected any obstacle region derived from the segmented 3D point cloud. If no intersection was detected, the gripper followed a direct approach. If an intersection was detected, an intermediate waypoint for the end effector was generated by applying a lateral offset relative to the nearest obstacle boundary, so that the gripper approached the fruit from a collision-free side direction. The final fruit target remained unchanged, while only the intermediate end effector waypoint was displaced in order to bypass the obstacle.

The magnitude of the lateral offset was determined from the local extent of the obstacle together with a 10 mm safety margin. The number of waypoints was not fixed in advance, but determined adaptively according to the local obstacle configuration. In simple occlusion cases, a single intermediate waypoint was sufficient to bypass the obstacle. In denser canopy regions, additional waypoints were inserted sequentially until the resulting piecewise-linear trajectory of the end effector was free of obstacle intersections. Each generated waypoint was then verified through the inverse kinematics solver to ensure reachability and compliance with joint limits before execution.

A lateral avoidance strategy was adopted because, in row-based greenhouse tomato cultivation, leaves and stems usually block the direct approach locally rather than surrounding the fruit uniformly in three-dimensional space. Therefore, a side displacement of the end effector is often sufficient to create a collision-free approach while preserving computational simplicity and predictable real-time behavior. Compared with more complex free-space sampling-based planners, this method is better suited to the structured geometry of greenhouse rows and allows efficient obstacle avoidance during harvesting.

The concept is based on the following principle:

-: When no occlusions are detected, the fruit pose is passed directly to the UR5e inverse kinematics solver, which computes a valid joint configuration for a straight-line approach.
-: When occlusions are present, the system generates a sequence of intermediate waypoints that redirect the trajectory around obstacles, by applying a lateral offset relative to the nearest obstacle boundary, such that the modified segment is displaced outside the exclusion zone. Each waypoint is expressed in the robot’s coordinate system and sent to the IK solver in succession. This approach preserves the accuracy and safety of the robot’s internal solver while providing external control over the global motion path.

\{P_{s t a r t}, P_{w p 1}, P_{w p 2}, \dots, P_{t a r g e t}\} \Rightarrow I K (θ_{1}, θ_{2}, \dots, θ_{6})

(2)

where

P_{w p 1}

are dynamically generated waypoints ensuring a collision-free path.

The number of waypoints is determined adaptively according to the local obstacle configuration. In simple occlusion cases, a single waypoint is sufficient to bypass the obstacle. In denser canopy regions, additional waypoints are inserted sequentially until the resulting piecewise-linear trajectory is free of obstacle intersections. Each generated waypoint is then checked by the inverse kinematics solver for reachability and joint limit compliance before execution.

This method maintains the computational efficiency of the robot’s built-in

I K

solver, while allowing higher-level autonomy through adaptive trajectory shaping.

Convolutional Neural Network Architecture and Training Protocol

In the present study have been developed a convolutional neural network (CNN) capable of accurately distinguishing and counting overlapping objects of identical color, focused on detection of tomato fruits and structural components of the plant. The proposed architecture consists of ten convolutional layers, selected to accommodate the high spatial resolution of the input imagery required for downstream robotic control tasks. The initial convolutional layers are dedicated to extracting low-level visual features such as edges, textures, and localized contrast patterns, whereas the deeper layers capture higher-order representations, including the geometric form of tomatoes and the structural contours of the plant body. Following the convolutional feature extraction stage, the architecture incorporates five fully connected layers responsible for final classification (Figure 3).

The proposed convolutional neural network (CNN) was developed for multi-class semantic segmentation of tomato fruits, leaves, and background regions in greenhouse images. Input images were resized to 512 × 512 pixels and normalized to the [0, 1] range. The architecture consists of ten convolutional layers followed by five fully connected layers. All convolutional layers use 3 × 3 kernels with stride 1 and padding 1, and are initialized using He normal initialization. Each convolutional block is followed by batch normalization and 2 × 2 max pooling (stride 2). The number of feature maps increases progressively from 32 to 512 across the network depth (32, 64, 128, 256, 512). Rectified Linear Unit (ReLU) activation functions are applied throughout the network, and dropout with a probability of 0.5 is introduced before the first fully connected layer to mitigate overfitting. The final output layer uses Softmax activation to classify five categories: Tomato, Leaf, Failed Tomato, Failed Leaf, and No Object.

The final output is a dense multi-class probability map of size 512 × 512, where each pixel is assigned a probability distribution over the predefined classes. Regarding the output representation, the network produces a multi-class probability map of the same spatial resolution as the input image, where each pixel is assigned to one of the predefined classes (e.g., fruit, leaf). This output is then used as input to the Watershed algorithm to improve boundary delineation in occluded regions.

The convolutional neural network used for fruit segmentation was designed to balance feature representation capacity and real-time computational constraints, rather than to maximize architectural complexity. The selected architecture, consisting of 10 convolutional layers, is motivated by theoretical considerations related to hierarchical feature extraction and effective receptive field size. From a theoretical perspective, convolutional neural networks progressively build feature representations across layers, where early layers capture low-level features (edges, textures), while deeper layers encode higher-level semantic structures such as object boundaries and shape cues. In the context of greenhouse environments, fruit detection under occlusion requires the network to integrate both local texture information and global contextual cues, particularly when fruits are partially hidden by leaves. A key factor in this process is the effective receptive field (ERF) of the network. As the number of convolutional layers increases, the receptive field expands, allowing each output neuron to incorporate information from a larger spatial region of the input image. This is essential for distinguishing fruits from visually similar background elements and for resolving partial occlusions. However, it is known that the effective receptive field grows more slowly than the theoretical receptive field and tends to concentrate around the center of the input region. Therefore, a sufficient network depth is required to ensure adequate spatial context coverage. The chosen depth of 10 convolutional layers provides a receptive field large enough to capture the typical spatial extent of fruits and their surrounding foliage, but also avoiding unnecessary computational overhead. Increasing the network depth beyond this point would further enlarge the receptive field but with diminishing returns in segmentation performance, while significantly increasing inference latency.

Moreover, deeper architectures introduce additional challenges such as vanishing gradients, higher memory consumption, and increased sensitivity to overfitting, particularly when training data is limited. In contrast, a moderately deep network maintains stable training dynamics and allows efficient deployment on standard GPU hardware.

From a systems perspective, the perception module must operate within a closed-loop control framework. The segmentation output directly influences trajectory planning, requiring consistent and predictable inference times.

Model training was performed using multi-class Cross-Entropy Loss. Three optimizers were evaluated: SGD (momentum 0.9), Adam, and AdamW with weight decay regularization (10⁻²). The learning rate was set to 1 × 10⁻⁴ with step decay applied after epoch 20. The batch size was 64, and the network was trained for 30 epochs (50 epochs in the extended dataset experiment). All models were trained under identical hyperparameter settings to ensure fair comparison.

Post-processing of the CNN segmentation masks was performed using a Watershed algorithm to improve boundary delineation between overlapping fruits. This procedure included distance transform computation, marker extraction, and morphological filtering, adding an average computational overhead of approximately 11 ms per frame.

Training and inference were conducted using PyTorch 2.11 (PyTorch Foundation, San Francisco, CA, USA) on a workstation equipped with an NVIDIA RTX 3080 GPU (10 GB VRAM), Intel i7 processor, and 32 GB RAM. The total training time per configuration was approximately 2.5 h. The combined CNN and Watershed pipeline achieved a mean inference time of 85.3 ms per image, corresponding to approximately 11.7 frames per second, satisfying real-time operational requirements for robotic harvesting.

To facilitate robust detection, the CNN was trained on the previously constructed dataset of 1000 images, including 900 real greenhouse images and 100 synthetically generated images. The synthetic images were generated with variations in rotation, Gaussian noise, and controlled overlaps to simulate real working scenarios. The model was trained for 30 epochs using batches of 64 images, with the Cross-Entropy loss function applied.

Three main configurations were tested: CNN using the Adam optimizer; CNN using the SGD optimizer; CNN using AdamW with Watershed post-processing for segmentation refinement (Table 1). These were comparatively evaluated in terms of overall accuracy and computational performance.

2.4. Experimental Setup

The experimental validation of the robotic harvesting system was carried out in a research greenhouse (7 m wide and 50 m long, 7 m high), the structure being optimized for a light transmission of over 90%. Monitoring is fully automated through networks of precision sensors (temperature, humidity, CO₂) that adjust in real time the ventilation through motorized ridge windows and air recirculation fans. Irrigation is carried out through fertigation systems. For the tests, 60 tomato plants were selected and arranged in parallel rows, with a row spacing of 100 cm. Along the crop rows, a 30 m guiding rail was installed to allow the mobile platform to move over the entire length of the experimental section. The dataset used for training and evaluation of the proposed CNN model consisted of 1000 images, including 900 real greenhouse images and 100 synthetically generated images. A stratified split was applied to ensure balanced class distribution across subsets. Specifically, 70% of the dataset (700 images: 630 real and 70 synthetic) was allocated for training, 15% (150 images: 135 real and 15 synthetic) for validation, and 15% (150 images: 135 real and 15 synthetic) for testing. The test set was strictly isolated and not used during training or hyperparameter tuning to prevent data leakage and ensure unbiased performance evaluation. All real images were collected from the same experimental greenhouse facility under operational harvesting conditions. Image acquisition was performed during a single production cycle, within the same growing season, to maintain environmental consistency in terms of illumination, humidity, and plant development stage. The dataset includes several tomato types and varieties, including cherry tomatoes, table tomatoes, Florina de Buzău, and Citrina, thereby introducing variability in fruit morphology, canopy appearance, and ripening characteristics. This controlled acquisition protocol ensured dataset homogeneity while allowing variability in fruit occlusion, clustering, and natural lighting conditions representative of practical greenhouse environments. Additional diversity was introduced through variation in camera-to-scene distance (approximately 0.3–1.0 m), camera viewing angle (approximately −10° to 40°), and image acquisition hardware. The images were acquired using ZED X and Logitech HD Pro C920 cameras (Logitech International S.A., Lausanne, Switzerland) mounted on a mobile platform. The annotation process was carried out in LabelStudio Community Edition. Built-in segmentation tools were used to create the regions of interest directly on the images, following a predefined labeling schema that specified target classes. Label Studio Community Edition does not include inter-annotator agreement metrics, as a result, inter-annotator agreement scores were not computed. Annotation consistency checking was performed using a review and correct workflow, in which an annotator reviewed all segmentation marks before they were used in the final dataset. The annotated dataset contains approximately 28,000 labeled tomato instances, with around 19,000 instances corresponding to the green fruit class, which improved the efficacity of the model in distinguishing unripe tomatoes from leaves and stems. Figure 4 shows details from the tomato harvesting robot experimentation.

The plants were grown under natural light conditions, while the relative humidity inside the greenhouse was maintained at 60–70% through passive ventilation. This setup reflects the common environmental parameters of tomato cultivation in protected spaces, ensuring that the robotic tests were performed under realistic operational conditions.

To assess the performance of the robotic system, the following evaluation metrics were defined:

-: Fruit detection rate—percentage of fruits correctly identified by the CNN–Watershed algorithm relative to the ground truth count.
-: Positioning accuracy—mean absolute error (mm) between the computed fruit coordinates and their manually measured positions.
-: Harvesting success rate—percentage of fruits successfully harvested.
-: Average harvesting time—mean time (s) required to detect, plan, and harvest a single fruit.

During the experimental trials, several measures were adopted to improve operational safety for plants, fruits, and operators. The harvesting unit used a UR5e collaborative robotic arm, while obstacle-aware trajectory generation treated leaves and stems as protected regions surrounded by a 10 mm safety buffer. In occluded cases, intermediate waypoints were introduced to avoid direct contact with canopy elements. The rail-guided mobile platform improved motion stability during manipulation, and the gripper configuration with adaptive closure and compliant elastomer finger pads reduced the risk of fruit damage during grasping. All experiments were conducted under direct supervision, and no human operator was allowed to remain within the immediate manipulation workspace during active harvesting motions.

3. Results

3.1. Fruit Detection Performance

The performance of the perception module was analyzed from two complementary perspectives. First, segmentation/classification-oriented metrics were used to evaluate the quality of the CNN–Watershed output in separating tomato, leaf, and background regions. These metrics include precision, recall, F1-score, mIoU, and confusion matrix analysis. Second, fruit-level practical performance was assessed in terms of successful fruit identification under occlusion and clustered arrangements, as this is directly relevant for robotic harvesting.

The overall accuracy of the CNN was significantly improved through the use of the Adam optimizer, achieving an average of 91.5%, compared to 89.7% obtained with SGD.

Furthermore, the integration of the Watershed algorithm as an additional segmentation step increased accuracy to 96.9%, with a mean Intersection-over-Union (mIoU) score of 79.2%, notably higher than the other configurations.

This improvement indicates a clearer boundary delineation in the images, especially for overlapping or touching fruits.

In terms of inference time, the CNN optimized with Adam achieved an average of 74 ms per image. Adding the Watershed algorithm increased this to 85 ms, which still meets the real-time requirements of the application. Temporal variability between frames was also reduced, indicating high operational stability.

Training convergence was achieved earlier when using the Adam optimizer (around 17 epochs) compared to SGD (around 25 epochs), highlighting the adaptive optimizer’s efficiency in reducing network loss over a shorter training period.

The most notable improvement was observed in the ability to separate overlapping objects: in tests with 3–5 contacting fruits, the standard CNN showed a counting error rate of 18.3%, whereas applying the Watershed algorithm reduced this to 5.1%, confirming the method’s efficiency in segmenting similar and closely spaced objects.

The selection of Watershed segmentation was motivated by both computational and application-specific considerations. In dense greenhouse environments, fruits often appear in clusters with partial occlusions and weak boundary contrast, making instance separation particularly challenging.

The proposed hybrid approach combines the following:

(a) CNN-based dense prediction to identify fruit regions;

(b) Watershed algorithm to separate overlapping instances based on local gradient and distance information.

This combination offers several advantages in dense fruit cluster scenarios:

(1) Improved instance separation: Watershed effectively splits adjacent fruits that are detected as a single region by the CNN, particularly when boundaries are partially occluded.

(2) Reduced data requirements: Unlike instance segmentation networks, the proposed method does not require pixel-level instance annotations, making it more suitable for moderate-sized datasets.

(3) Computational efficiency: Watershed is lightweight and can be applied in real time, preserving the overall system frequency (~10–12 Hz), which is critical for closed-loop robotic control.

(4) Robustness to occlusion: The method leverages geometric cues from the CNN output and distance transforms, allowing better handling of partially visible fruits.

Table 2 presents the performance results for fruit identification for the three methods used in the research.

Figure 5 provides a visual description of the training losses for the three evaluated methods.

As illustrated in the figure above, the SGD optimizer (represented by the orange–yellow line) shows a relatively slow decrease in the loss function, with notable fluctuations starting around epoch 15, indicating a more unstable and less efficient learning dynamic compared to the other methods.

In contrast, Adam (blue line) leads to a faster reduction in loss during the early stages of training, but its progression tends to stabilize prematurely around epoch 20, suggesting a possible stagnation in parameter refinement.

The best-performing configuration in this scenario is Adam + Watershed (green line), which demonstrates a steady and consistent decrease in loss throughout all 30 epochs, reaching the lowest values at the end of training and thus highlighting its superior efficiency in network optimization.

The graph supports the common observations in the literature: AdamW is generally a superior choice for training modern CNNs, particularly when stability and convergence speed are key priorities. For complex tasks such as image segmentation or object detection, selecting an appropriate optimizer can make the difference between a robust model and an inefficient one.

In the Figure 6, a clear comparison is presented between the performance of three optimizers—SGD, Adam, and AdamW—during the training process of a Convolutional Neural Network (CNN) over the course of 30 epochs.

The vertical axis represents the loss function value, which reflects the model’s error rate during learning. The objective of any training process is to minimize this value as efficiently as possible.

It can be observed that the SGD optimizer exhibits a slow and fluctuating descending trend, especially after epoch 15, indicating unstable learning and limited adaptability.

Adam significantly accelerates the learning process during the initial epochs; however, after epoch 20, a plateauing tendency appears, which may indicate premature convergence.

In contrast, AdamW provides a steady and consistent decrease in the loss function throughout the entire analyzed interval, achieving the lowest final loss values and demonstrating superior efficiency in network parameter optimization.

In conclusion, the graph confirms that among the three tested methods, AdamW is the most efficient optimizer, both in terms of convergence speed and training stability.

These characteristics make it the preferred choice for applications where precision and reliability are essential.

The final reported performance throughout this paper corresponds to the CNN + AdamW + Watershed configuration, which represents the best-performing model.

Influence of Training Data Composition on CNN Performance

To evaluate the impact of training data composition on object detection performance, a comparative study was conducted using two distinct subsets of tomato fruit images:

Partially occluded tomatoes—fruits partially covered by leaves, stems, or adjacent fruits.
Fully visible tomatoes—unobstructed fruits, fully visible within the image frame.

For each subset, the number of training images was varied while maintaining the same CNN architecture (10 convolutional layers) and identical training parameters (batch size = 8, learning rate = 1 × 10⁻⁴, optimizer = AdamW).

The total number of epochs was fixed at 50 for all experiments, and the model’s performance was evaluated on a separate validation set, containing a balanced distribution of both classes. The obtained results clearly show that the number and visibility condition of the training images have a significant impact on the overall accuracy of the CNN model.

In the case of fully visible tomatoes, performance increases rapidly with the number of training images up to approximately 200 images, after which the improvements become negligible. The network quickly learns to recognize distinct visual features such as color, texture, and shape, leading to very high precision (~97%) and a stable detection rate (recall) above 96%. In contrast, occluded tomatoes required roughly twice as many images to achieve comparable accuracy. This difference is caused by the high visual variability resulting from partial overlaps, lighting differences, and background interference.

Even with 200 occluded images, where the F1-score reached around 90, isolated errors were still observed in cases of contiguous contours or occlusions caused by stem structures. The comparative analysis demonstrates that the CNN’s ability to extract visual features is strongly influenced by the complexity of the visual context represented in the training data. Therefore, obtaining a balanced dataset—including both fully visible and partially occluded images—is essential for achieving optimal generalization in real-world robotic harvesting applications. The diversity and volume of training data play a crucial role in achieving reliable CNN-based detection performance. We can conclude that at least 50–60% of the training images should include cases with partial occlusion in order to enhance the system’s robustness under variable field conditions.

The confusion matrix in Table 3 summarizes the classification performance of the CNN + AdamW + Watershed model in distinguishing between tomato fruits, leaves, and background areas.

The model achieved high consistency across all categories, correctly identifying 94% of tomatoes, 91.5% of leaves, and 94% of background regions (no object).

Misclassifications occurred mainly between “failed tomato” and “failed leaf” cases, typically caused by partial occlusions or color-texture similarities in shadowed regions.

In order to improve robustness in challenging scenarios we used two classes, failed Tomato and Failed Leaf, defined below:

Failed Tomato: Partially visible or heavily occluded fruit instances with ambiguous boundaries;

Failed Leaf: Leaf regions overlapping with fruit boundaries or frequently misclassified.

These classes were annotated manually and included during training to explicitly model uncertain or failure-prone regions, improving the robustness of the segmentation pipeline.

The improved recall value of 95.8% for the Tomato class demonstrates the model’s enhanced ability to detect partially occluded fruits while maintaining high boundary precision.

The refined confusion matrix confirms a balanced trade-off between false positives and false negatives across all categories, resulting in an overall classification accuracy of approximately 95%.

This improvement validates the efficiency of the CNN + AdamW + Watershed architecture in realistic agricultural environments.

The mean inference time was measured over 500 test images captured in varying illumination and occlusion conditions.

As reported in Table 4, the model maintained a mean processing time well below the real-time threshold, confirming its feasibility for field-deployable robotic systems.

To quantitatively evaluate the effectiveness of the proposed CNN + AdamW + Watershed model, several performance metrics were computed, including Precision, Recall, F1-score, and Mean Intersection-over-Union (mIoU).

These indicators reflect the model’s ability to correctly identify fruit and leaf regions, minimize false detections, and maintain stable segmentation boundaries.

The fruit detection module, based on the hybrid CNN–Watershed pipeline, was evaluated both qualitatively and quantitatively on a dataset of tomato fruit images collected under varying conditions of occlusion and overlap (Figure 7 and Figure 8). Table 2 reports the overall recall computed across all five output classes of the segmentation model (Tomato, Leaf, Failed Tomato, Failed Leaf, and No Object), whereas Table 5 presents recall values only for the three main operational classes shown there.

In practical clustered fruit scenarios, the CNN–Watershed pipeline improved the consistency of fruit separation and reduced counting errors relative to CNN-only predictions, especially in cases of moderate overlap.

Figure 7 illustrates a challenging case of fruit detection under occlusion, where a tomato is partially hidden by leaves and overlaps with an adjacent fruit. In the first image, the quasi-shape estimation is represented, highlighting the approximate contour of the occluded fruit. This intermediate step demonstrates the ability of the CNN–Watershed pipeline to infer the presence of a fruit even when its geometry is not fully visible. In the second image, the final detection is shown, where bounding boxes are generated around both fruits, confirming that the system successfully separated the overlapping objects.

Figure 8 presents the detection of a dense cluster of tomato fruits, a scenario that typically poses significant challenges for segmentation algorithms. In the first image, the quasi-shape estimation step groups the entire fruit cluster into a single approximate contour, which provides an initial localization of the target area. In the second image, the CNN–Watershed pipeline successfully separates the individual fruits within the cluster and assigns distinct bounding boxes to each of them.

This result highlights the strength of the hybrid method in managing cases where fruits are tightly packed and partially overlapping. While a conventional CNN might struggle to distinguish boundaries between adjacent fruits, the integration of Watershed segmentation enables a more refined separation, ensuring accurate fruit counting and localization.

To evaluate the impact of training data composition on object detection performance, a study was conducted using two distinct subsets of tomato images:

Partially Occluded Tomatoes: Partially covered by stems, leaves, or other adjacent fruits.
Fully Visible Tomatoes: Unobstructed and entirely visible within the image frame.

Regarding validation, manual verification was performed for all evaluation metrics. Specifically:

(a) For detection performance, all test images were manually annotated and reviewed to confirm correct detections and identify false positives/negatives.

(b) For positioning accuracy, selected trials were manually inspected to verify correspondence between estimated and actual fruit locations.

(c) For harvesting success, each trial was visually monitored and recorded, and the outcome (success/failure) was manually logged.

For each subset, the number of images used for training was varied while maintaining a consistent CNN architecture (10 convolutional layers) and identical training parameters (batch size = 8, learning rate = 1 × 10⁻⁴, optimizer = AdamW). The total number of epochs was fixed at 50 for all experiments, and model performance was evaluated on a validation set containing a balanced distribution of both classes. Table 6 presents the tested subset results.

3.2. Camera Calibration and Coordinate Transformation Accuracy

The calibration procedure ensured the alignment between the camera reference frame and the robot base frame, allowing the transformation of detected fruit coordinates into actionable robot poses. To evaluate the accuracy of this process, a set of calibration targets (checkerboard points and manually measured fruit positions) were used as ground truth references.

The mean absolute error between the transformed camera coordinates and the actual robot frame coordinates was 2 mm, with a standard deviation of 3 mm. Errors were smallest in the central region of the camera’s field of view, while slightly larger deviations were observed near the image borders, as expected from lens distortion effects. The data regarding the accuracy of camera-to-robot coordinate transformation is presented in Table 7.

These results demonstrate that the calibration procedure provides sufficient accuracy for robotic harvesting tasks, where sub-centimeter precision is acceptable given the adaptive gripper design and the tolerance to small positioning deviations.

3.3. Inverse Kinematics and Trajectory Planning Performance

The inverse kinematics (IK) solver of the UR5e robotic arm was evaluated during harvesting trials to assess its ability to reach target fruit positions derived from the camera-to-robot transformation. For unobstructed fruits, the solver consistently generated valid joint configurations with a mean computation time below 16 ms, enabling smooth point-to-point trajectories. In scenarios with occlusions, a waypoint-based trajectory strategy was employed to guide the gripper around leaves or stems. This approach slightly increased the overall execution time per fruit, from an average of 12 s in the direct case to 21 s in occlusion scenarios. However, the detour ensured collision-free harvesting and prevented damage to both fruits and plants.

The results show that, even with the additional processing overhead of waypoint generation, the system maintained real-time performance. The trajectory execution time reported in Table 8 refers exclusively to the robotic arm motion between waypoints and does not include perception, detection, or fruit handling operations, which are accounted for in the average harvesting cycle time reported in Table 9.

3.4. Harvesting Performance in Greenhouse Trials

The integrated robotic system was evaluated in real greenhouse conditions using the experimental setup described prior. The mobile platform moved along a 30 m guiding rail installed between two tomato rows, covering approximately 60 plants cultivated under natural lighting and ventilation conditions.

A total of 264 fruits were selected for harvesting trials under mixed scenarios, including 102 unobstructed fruits and 162 occluded fruits. This distribution was not imposed as a balanced statistical design, but reflected the actual greenhouse conditions encountered during testing, where partial occlusion by leaves, stems, or neighboring fruits was frequent. Since one of the main objectives of the present experimental model study was to evaluate harvesting performance under practically relevant occlusion conditions, the proportion of obstructed fruits was higher than that of fully visible fruits. These values were therefore interpreted as observed performance indicators from this test set, rather than as averages over multiple independent replicated trials. For this reason, standard deviation and confidence intervals were not calculated for the harvesting-level indicators reported in the present study. The experimental robot testing activity is presented in Figure 9.

The system achieved an overall success rate of 78.5%, defined as the proportion of fruits successfully detected, approached, and detached without damage. For unobstructed fruits, the success rate was higher (85%), while in cases of occlusion or clustering the rate decreased slightly to 72%, primarily due to incomplete detections or grasping inaccuracies.

The average cycle time per fruit (from detection to storage) was 19.5 s. Direct harvesting of unobstructed fruits required on average 15 s, whereas harvesting in occluded scenarios increased to 27 s due to the waypoint-based trajectory optimization. Importantly, no collisions with stems or leaves were recorded during the trials, confirming the robustness of the motion planning strategy.

During operation, the ZED X stereo camera maintained reliable detection despite variations in natural illumination, while the CNN–Watershed pipeline proved capable of handling both isolated and clustered fruits. The UR5e robotic arm and adaptive gripper consistently achieved stable grasping and detachment, demonstrating tolerance to small positioning errors.

In the present dataset, occluded and clustered fruits were grouped within the obstructed category, because both conditions represented reduced visual and physical accessibility relative to unobstructed harvesting cases.

In addition to perception and manipulation performance, the influence of platform motion on harvesting outcomes was experimentally evaluated. The rail-based mobile platform was tested at three translational speeds along the greenhouse rows, while maintaining identical perception and manipulation parameters. The results indicate a clear trade-off between platform speed and harvesting performance. At lower travel speeds, the system achieved higher harvesting success rates and more stable manipulation, whereas increased platform speed led to a gradual decrease in success rate due to reduced perception stability and tighter timing constraints during arm motion execution. These results highlight the role of constrained, rail-guided mobility in improving repeatability under greenhouse conditions.

4. Discussion

The greenhouse trials indicate that the proposed robotic system can operate reliably under realistic harvesting conditions. The hybrid CNN–Watershed pipeline achieved consistent fruit detection in the presence of partial occlusions and clustered arrangements, with a mean IoU of 79.2%, a precision of 96.9%, and an F1-score of 96.5% for fully visible tomatoes respectively 89.2% for occluded tomatoes. These values fall within, and in some cases exceed, the range reported for tomato and strawberry harvesting robots in comparable studies, where precision typically varies between 85% and 97%. The main strength of the proposed vision pipeline lies in its ability to separate adjacent fruits under occlusion, a situation in which CNN-only approaches often produce merged detections. This capability is particularly relevant in greenhouse environments, where leaf interference and fruit clustering are common.

The camera calibration and coordinate transformation process resulted in sub-centimeter positioning accuracy, which proved sufficient for harvesting when combined with the compliant behavior of the adaptive gripper. Minor errors observed near the edges of the camera field of view, mainly attributable to lens distortion, did not have a measurable impact on harvesting success. The inverse kinematics solver of the robotic arm operated reliably in real time, while the use of intermediate waypoints during occluded approaches ensured collision-free motion. Although this strategy led to a moderate increase in cycle time, it effectively prevented contact with stems and reduced the risk of plant damage.

The overall harvesting success rate reached approximately 85%, with an average cycle time of about 15 s per un-occluded fruit. These results are in line with those reported in related harvesting systems, although further improvements in speed are required to approach commercial throughput levels. The rail-based mobility platform contributed to system stability and repeatability, simplifying navigation and reducing the effects of dynamic loads during manipulation compared to fully mobile platforms. A key practical observation from the greenhouse trials is that successful harvesting depended not only on fruit detection accuracy, but also on the ability to adapt the manipulator approach according to occlusion and local canopy geometry.

Previous investigations have yielded comparable outcomes, reinforcing the validity of these findings. For example, a research study that analyzed the identification and precise location of peduncle cutting points [26] achieved a mIoU of 82.83% and an mPA of 91.37% with a rapid inference time of 11.42 ms. Beyond segmentation, the study introduced a specialized observation method that reaches an accuracy of 77.84% in 68.25 ms. The research demonstrated that optimizing the observation angle is key to accurately identifying tomato peduncle picking points, providing a sophisticated approach to improving the efficiency of robotic harvesting systems.

Alaaudeen et al. [27] reported identification success rates above 95% using a post-prediction process, with reattempt rates below 12%, confirming the importance of robust perception for robotic fruit harvesting. In the present study, the proposed hybrid CNN–Watershed pipeline reached 96.9% precision and 93.5% recall, and these perception results translated into an overall harvesting success rate of 78.5% under practical greenhouse conditions. Although the reported metrics are not directly identical, both studies support the view that perception accuracy is a key determinant of harvesting performance.

Gong et al. [28] demonstrated that robotic harvesting of occluded fruits can be achieved with relatively short picking times, reporting 11 s per fruit, a harvesting success rate of 73.04%, and an average gripping accuracy of 8.21 mm. In the present work, the overall harvesting success rate was slightly higher, reaching 78.5%, but the average cycle time increased in occluded cases, due to the waypoint-based collision-avoidance strategy. This suggests that the proposed system prioritizes safe manipulation and robustness in cluttered greenhouse environments, even at the expense of longer execution times.

Zhao et al. [29] proposed a YOLOv5-based tomato detection and localization system for robotic harvesting and reported a detection accuracy of 94.1%, with distance measurement errors of approximately 3–5 mm on a dataset of 640 greenhouse images. In comparison, the perception module developed in the present study achieved 96.9% precision and 93.5% recall, while the camera-to-robot transformation provided mean positioning errors of 2–3 mm, indicating comparable localization accuracy together with slightly stronger detection performance under greenhouse conditions.

Tan et al. [30] reviewed deep learning applications in fruit and vegetable picking robots and identified visual perception, multi-sensor fusion, adaptive control, and human–computer interaction as key directions for further development. Their analysis also showed that environmental complexity remains one of the main barriers to robust deployment in agricultural settings. These observations are consistent with the motivation of the present work, which focuses on reliable fruit detection and harvesting under greenhouse occlusion conditions.

While recent harvesting systems increasingly rely on advanced object detection frameworks such as YOLO-based architectures, Mask R-CNN, or multi-view 3D reconstruction methods, most existing approaches treat perception, calibration, and motion planning as largely independent modules. In contrast, the proposed system introduces a tightly integrated perception–calibration–trajectory generation pipeline in which occlusion information extracted from the CNN–Watershed segmentation stage directly influences motion planning decisions in real time. Rather than focusing solely on maximizing detection accuracy under controlled conditions, the present work emphasizes operational robustness within structured greenhouse environments, where clustered fruits and foliage interference represent the primary bottlenecks to reliable harvesting. The adoption of a geometric occlusion-aware waypoint strategy, coupled with IMU-compensated camera-to-robot transformation, provides a deterministic and computationally efficient alternative to sampling-based planners, which often introduce stochastic variability and higher latency. Furthermore, unlike studies limited to algorithmic benchmarking, the current research validates the complete cyber–physical loop—from perception to physical detachment—under real greenhouse conditions using a rail-stabilized robotic platform. This systemic integration and field-level validation distinguish the proposed architecture from detection-centric solutions and contribute practical evidence toward scalable robotic harvesting in protected horticulture systems.

While the literature provides comparable results, the present paper’s novelty lies in the tight integration of occlusion-aware perception into the robotic control loop with clear real-world validation and effective execution in greenhouse tomato systems.

Several limitations were identified during the trials. Performance decreased in cases of severe occlusion or highly dense fruit clusters, suggesting that additional perception cues or multimodal sensing may be necessary to further improve robustness. To maintain high performance despite severe occlusions, alternative planting configurations should be implemented to optimize the robot’s line of sight.

While the achieved cycle time is competitive, it remains a limiting factor for large-scale deployment. Future work will focus on reducing inference latency through lighter detection models and optimizing motion planning strategies. The integration of complementary functionalities, such as crop monitoring or disease detection, is also being considered to extend the system’s role within precision agriculture workflows.

Overall, the results demonstrate that a robotic harvesting system combining hybrid vision algorithms with occlusion-aware motion planning can function effectively in greenhouse conditions. The presented approach represents a step toward practical deployment by addressing key challenges related to perception, manipulation, and system integration, rather than focusing solely on isolated algorithmic performance. Broader topics such as probabilistic planning, active foliage interaction, and digital twins are acknowledged as important directions and are considered part of future work beyond the scope of this study.

Research Limitations

Despite the encouraging results, several limitations of the current study should be acknowledged.

First, the dataset used for training and evaluation consists of images collected from a single greenhouse environment, during one production cycle, including several tomato types and varieties (cherry tomatoes, table tomatoes, Florina de Buzău, and Citrina). As a result, the variability in lighting conditions, plant morphology, and fruit appearance is limited. This restricts the ability to assess the generalization capability of the perception model across different agricultural contexts.

Second, the perception module was trained on a relatively moderate-sized dataset (900 real images and 100 synthetic images), which may not fully capture the diversity of occlusion patterns and environmental variations encountered in large-scale deployments.

Future improvements for severe occlusion scenarios may involve both technical and agronomic measures. On the technical side, multi-view perception and stronger instance segmentation could improve fruit separation in dense clusters. On the agronomic side, local pre-harvest leaf removal, already used in greenhouse tomato cultivation, could reduce occlusion and improve robotic accessibility to fruits.

In addition to occlusion and clustered fruits, greenhouse deployment may also be affected by humidity, contamination of optical components, plant debris, canopy irregularities, and maintenance requirements. These aspects were not the main focus of the present study, but they represent relevant practical constraints for future large-scale implementation.

Regarding deployment scale, the current experiments were performed in a representative greenhouse section and support experimental model demonstration, but they do not yet capture the full complexity of large commercial greenhouse operation, where greater row-to-row variability, prolonged operation, maintenance constraints, and broader environmental heterogeneity may affect robustness. These aspects therefore represent important directions for future system refinement and validation.

These limitations highlight the need for further research to evaluate the system under more diverse conditions and to improve its robustness and adaptability for real-world industrial deployment.

5. Conclusions

This study presented the design, implementation, and experimental validation of an integrated robotic system for autonomous tomato harvesting in greenhouse environments. The proposed architecture combines CNN-based semantic segmentation enhanced by Watershed post-processing, IMU-assisted camera-to-robot coordinate transformation, and an occlusion-aware waypoint-based trajectory strategy embedded within a rail-guided robotic platform.

From a perception standpoint, the hybrid CNN–Watershed pipeline demonstrated robust performance in detecting and separating clustered or partially occluded fruits, achieving high precision and recall under natural illumination conditions. The integration of segmentation refinement significantly reduced counting errors in overlapping scenarios, confirming the importance of boundary-aware post-processing in structured horticultural environments. By automating fruit detection and harvesting under real greenhouse conditions, the system has the potential to reduce dependency on manual pickers and stabilize production efficiency during peak harvesting periods. Furthermore, the modular architecture and rail-based mobility concept support scalability across commercial greenhouse rows, enabling progressive deployment without requiring major infrastructure redesign.

From a manipulation perspective, the IMU-compensated coordinate transformation ensured sub-centimeter localization accuracy, enabling reliable inverse kinematics solutions for fruit approach. The proposed geometric waypoint strategy provided a deterministic and computationally efficient alternative to complex sampling-based planners, maintaining collision-free motion inside dense plant canopies while preserving real-time operation.

Greenhouse trials confirmed the feasibility of the complete perception–planning–manipulation loop under realistic agronomic conditions. The system achieved competitive harvesting success rates and cycle times comparable to recently reported robotic harvesting platforms, while demonstrating stable operation facilitated by the rail-guided mobility architecture. These results indicate that system-level integration, rather than isolated optimization of detection accuracy, is critical for improving operational robustness in protected cultivation.

Despite these promising results, several limitations remain. It is important to note that the results reported in this study are validated within a specific experimental setting, involving a single greenhouse environment, one production cycle. The dataset was collected from a single cultivar and production cycle, which may restrict generalization across different greenhouse configurations or seasonal conditions. Furthermore, while the waypoint-based planner proved effective in structured environments, highly irregular canopy geometries may require more adaptive motion planning strategies. Harvesting throughput also remains below the level required for fully commercial deployment.

Future research will focus on expanding dataset diversity across cultivars and seasons, incorporating lighter real-time segmentation architectures to reduce inference latency, and exploring adaptive motion sequencing to increase harvesting efficiency. The integration of additional sensing modalities and decision-support modules may further enhance system autonomy and scalability.

Overall, this work demonstrates that the tight coupling of hybrid perception algorithms with occlusion-aware robotic control represents a viable pathway toward reliable and scalable robotic harvesting in greenhouse tomato production systems.

No visible negative effects on plant health were observed during the experimental period, as no relevant mechanical damage or apparent increase in disease incidence was associated with robot operation on the tested plants. Nevertheless, long-term trials will be required to evaluate possible cumulative effects of repeated robotic harvesting on crop health under extended greenhouse use.

Considering the current advancement in the scientific area, the direction is that robots will work collaboratively with human workers. However, changes in robot infrastructure will be needed (integrated with other robot systems for logistics, conditioning and storing (fruit management).

Author Contributions

Conceptualization, M.G.M. and F.N.; methodology, M.G.M., A.Z.A., F.B.M. and R.D.C.; software, R.D.C.; validation, C.I.P., M.G.M. and R.D.C.; investigation, M.G.M., F.B.M. and R.D.C.; resources, A.Z.A. and M.G.M.; writing—original draft preparation, M.G.M. and F.N.; writing—review and editing, M.G.M., F.B.M. and F.N.; project administration, M.G.M. All authors have read and agreed to the published version of the manuscript.

Funding

This study was financed by the European Union—NextGenerationEU, through the National Recovery and Resilience Plan of the Republic of Bulgaria, project No. BG-RRP-2.013-0001. This work was supported by a grant of the Ministry of Agriculture and Rural Development, contract of sectorial financing, ADER 2023-2026, Project ADER 25.1.1/2023, “Technology for robotized harvesting of Solanaceae family vegetables in greenhouses and solariums, using artificial intelligence”.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Naito, H.; Ota, T.; Shimomoto, K.; Hosoi, F.; Fukatsu, T. Accuracy Assessment of Tomato Harvest Working Time Predictions from Panoramic Cultivation Images. Agriculture 2024, 14, 2257. [Google Scholar] [CrossRef]
Peña, A.; Rovira-Val, M.R.; Mendoza, J.M.F. Life cycle cost analysis of tomato production in innovative urban agriculture systems. J. Clean. Prod. 2022, 367, 133037. [Google Scholar] [CrossRef]
Sturiale, S.; Gava, O.; Gallardo, M.; Buendía Guerrero, D.; Buyuktas, D.; Aslan, G.E.; Laarif, A.; Bouslama, T.; Navarro, A.; Incrocci, L.; et al. Environmental and Economic Performance of Greenhouse Cropping in the Mediterranean Basin: Lessons Learnt from a Cross-Country Comparison. Sustainability 2024, 16, 4491. [Google Scholar] [CrossRef]
Wang, Z.H.; Xun, Y.; Wang, Y.K.; Yang, Q.H. Review of smart robots for fruit and vegetable picking in agriculture. Int. J. Agric. Biol. Eng. 2022, 15, 33–54. [Google Scholar] [CrossRef]
Mail, M.F.; Maja, J.M.; Marshall, M.; Cutulle, M.; Miller, G.; Barnes, E. Agricultural Harvesting Robot Concept Design and System Components: A Review. AgriEngineering 2023, 5, 777–800. [Google Scholar] [CrossRef]
Baeten, J.; Donne, K.; Boedrij, S.; Beckers, W.; Claesen, E. Autonomous Fruit Picking Machine: A Robotic Apple Harvester, Field and Service Robotics: Results of the 6th International Conference; Springer: Berlin/Heidelberg, Germany, 2008; pp. 531–539. [Google Scholar]
Van Henten, E.J.; Hemming, J.; Van Tuijl, B.A.J.; Kornet, J.G.; Meuleman, J.; Bontsema, J.; Van Os, E.A. An autonomous robot for harvesting cucumbers in greenhouses. Auton. Robot. 2002, 13, 241–258. [Google Scholar] [CrossRef]
Arima, S.; Kondo, N.; Monta, M. Strawberry harvesting robot on table-top culture. In 2004 ASAE Annual Meeting; American Society of Agricultural and Biological Engineers: St. Joseph, MI, USA, 2004; p. 1. [Google Scholar]
Arad, B.; Balendonck, J.; Barth, R.; Ben-Shahar, O.; Edan, Y.; Hellström, T.; Hemming, J.; Kurtser, P.; Ringdahl, O.; Tielen, T.; et al. Development of a sweet pepper harvesting robot. J. Field Robot. 2020, 37, 1027–1039. [Google Scholar] [CrossRef]
Reed, J.N.; Miles, S.J.; Butler, J.; Baldwin, M.; Noble, R.A.E. Automation and emerging technologies: Automatic mushroom harvester development. J. Agric. Eng. Res. 2001, 78, 15–23. [Google Scholar] [CrossRef]
Yamamoto, S.; Hayashi, S.; Yoshida, H.; Kobayashi, K. Development of a stationary robotic strawberry harvester with a picking mechanism that approaches the target fruit from below. Jpn. Agric. Res. Q. JARQ 2014, 48, 261–269. [Google Scholar] [CrossRef]
SepúLveda, D.; Fernández, R.; Navas, E.; Armada, M.; González-De-Santos, P. Robotic Aubergine Harvesting Using Dual-Arm Manipulation. IEEE Access 2020, 8, 121889–121904. [Google Scholar] [CrossRef]
Wang, L.; Zhao, B.; Fan, J.; Hu, X.; Wei, S.; Li, Y.; Zhou, Q.; Wei, C. Development of a tomato harvesting robot used in greenhouse. Int. J. Agric. Biol. Eng. 2017, 10, 140–149. [Google Scholar] [CrossRef]
Ling, X.; Zhao, Y.; Gong, L.; Liu, C.; Wang, T. Dual-Arm Cooperation and Implementing for Robotic Harvesting Tomato Using Binocular Vision. Robot. Auton. Syst. 2019, 114, 134–143. [Google Scholar] [CrossRef]
Anandhakrishnan, T.; Jaisakthi, S.M. Deep Convolutional Neural Networks for image based tomato leaf disease detection. Sustain. Chem. Pharm. 2022, 30, 100793. [Google Scholar] [CrossRef]
Zhang, Q.; Liu, F.; Li, B. A heuristic tomato-bunch harvest manipulator path planning method based on a 3D-CNN-based position posture map and rapidly-exploring random tree. Comput. Electron. Agric. 2023, 213, 108183. [Google Scholar] [CrossRef]
Li, T.; Sun, M.; He, Q.; Zhang, G.; Shi, G.; Ding, X.; Lin, S. Tomato recognition and location algorithm based on improved YOLOv5. Comput. Electron. Agric. 2023, 208, 107759. [Google Scholar] [CrossRef]
Lin, G.; Tang, Y.; Zou, X.; Xiong, J.; Fang, Y. Color-, depth-, and shape-based 3D fruit detection. Precis. Agric. 2020, 21, 1–17. [Google Scholar] [CrossRef]
Wada, K.; Kitamura, N.; Miyajima, R. Development of lightgun type input device for manipulator operation. In 2013 IEEE International Symposium on Industrial Electronics; IEEE: New York, NY, USA, 2013; pp. 1–5. [Google Scholar]
Rapado-Rincon, D.; van Henten, E.J.; Kootstra, G. Development and evaluation of automated localisation and reconstruction of all fruits on tomato plants in a greenhouse based on multi-view perception and 3D multi-object tracking. Biosyst. Eng. 2023, 231, 78–91. [Google Scholar] [CrossRef]
Sánchez-Molina, J.A.; Rodríguez, F.; Moreno, J.C.; Sánchez-Hermosilla, J.; Giménez, A. Robotics in greenhouses. Scoping review. Comput. Electron. Agric. 2024, 219, 108750. [Google Scholar] [CrossRef]
Lin, T.; Sun, F.; Li, X.; Guo, X.; Ying, J.; Wu, H.; Li, H. A Review of Key Technologies and Recent Advances in Intelligent Fruit-Picking Robots. Horticulturae 2026, 12, 158. [Google Scholar] [CrossRef]
Zhang, J.; Kang, N.; Qu, Q.; Zhou, L.; Zhang, H. Automatic fruit picking technology: A comprehensive review of research advances. Artif. Intell. Rev. 2024, 57, 54. [Google Scholar] [CrossRef]
Ge, Y.; Xiong, Y.; From, P.J. Three-dimensional location methods for the vision system of strawberry-harvesting robots: Development and comparison. Precis. Agric. 2023, 24, 764–782. [Google Scholar] [CrossRef]
Suchopár, A.; Kuře, J.; Kuřetová, B.; Hromasová, M. A Review of Integrated Approaches in Robotic Raspberry Harvesting. Agronomy 2025, 15, 2677. [Google Scholar] [CrossRef]
Yuan, X.; Fan, X.; Jiang, Z.; Sun, X.; Dong, Z.; Du, Y.; He, J.; Ali, S.; Sun, K. CGR-YOLO: A grape leaf disease detection model based on coordinate attention and ghost convolution with receptive field expansion. Comput. Electron. Agric. 2025, 229, 109673. [Google Scholar]
Alaaudeen, K.M.; Selvarajan, S.; Manoharan, H.; Jhaveri, R.H. Intelligent robotics harvesting system process for fruits grasping prediction. Sci. Rep. 2024, 14, 2820. [Google Scholar] [CrossRef]
Gong, L.; Wang, W.; Wang, T.; Liu, C. Robotic harvesting of occluded fruits with a precise shape and position reconstruction approach. J. Field Robot. 2021, 39, 69–84. [Google Scholar] [CrossRef]
Zhao, J.; Bao, W.; Mo, L.; Li, Z.; Liu, Y.; Du, J. Design of tomato picking robot detection and localization system based on deep learning neural networks algorithm of YOLOv5. Sci. Rep. 2025, 15, 6180. [Google Scholar] [CrossRef]
Tan, Y.; Liu, X.; Zhang, J.; Wang, Y.; Hu, Y. A review of research on fruit and vegetable picking robots based on deep learning. Sensors 2025, 25, 3677. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Robot’s working principle.

Figure 2. Tomato harvesting robot’s main components. 1—UR5e Robotic Arm; 2—Robotiq 2F Adaptive Gripper; 3—ZED X 3D Camera; 4—ZED Box Controller; 5—Robotic Arm Controller; 6—Mobile Platform; 7—48 V DC Battery; 8—Vegetable Storage Container; 9—Rail Track; 10—Platform Drive Motor.

Figure 3. Architecture of the convolutional neural network used for tomato detection.

Figure 4. Tomato harvesting robot experimentation.

Figure 5. Training Loss vs. Epochs.

Figure 6. Validation Accuracy vs. Epochs.

Figure 7. Quasi-shape estimation and final detection of the fruit’s position—the case when it is covered by a leaf and occlusion occurs between fruits.

Figure 8. Fruits detection process.

Figure 9. Testing the robot for harvesting tomatoes in greenhouse. (a) Direct frame captures from the ZED X camera system; (b) tomato fruit identification and grasping testing.

Table 1. Tested configurations.

Configuration	Optimizer	Post-Processing	Observations
CNN + Adam	Adam	None	Standard adaptive optimizer
CNN + SGD	SGD	None	Classical reference
CNN + Watershed + AdamW	AdamW	Watershed	Additional object separation refinement

Table 2. Performance results for fruit identification.

Method	Accuracy%	mIoU %	Recall %	False Positives%
CNN + SGD	89.7 ± 0.6	72.3	88.1	9.2
CNN + Adam	91.5 ± 0.4	74.6	90.4	8.1
CNN + AdamW + Watershed	96.9 ± 0.3	79.2	93.5	5.7

Table 3. Confusion matrix for CNN + Watershed tomato and leaf classification.

Predicted/Actual	Tomato	Leaf	Failed Tomato	Failed Leaf	No Object
Tomato	95.8%	1.8%	0.8%	0.5%	1.1%
Leaf	1.9%	92.4%	1.2%	0.9%	1.1%
Failed Tomato	0.8%	1.5%	91.8%	1.7%	1.3%
Failed Leaf	0.7%	1.1%	1.8%	90.5%	1.2%
No Object	0.8%	0.7%	1.0%	0.9%	96.4%

Table 4. Mean inference time and performance stability for CNN + Watershed.

Metric	Value, ms	Observation
Mean inference time	85.3	Compatible with real-time operation
Standard deviation	6.4	Low temporal variability
Minimum/Maximum	73.2/96.8	Stable under variable lighting
Frame rate equivalent	~11.7 FPS	Suitable for continuous video-based detection

Table 5. Performance Metrics.

Class	Precision %	Recall %	F1-Score %	mIoU %
Tomato	96.8	95.8	96.3	80.1
Leaf	94.6	92.4	93.5	77.1
No Object	92.3	96.4	94.3	82.0
Mean value	94.6	94.9	94.7	79.7

Table 6. Tested subset results.

Subset Type	Number of Training Images	Precision %	Recall %	F1-Score %	Observations
Occluded Tomatoes	100	85.4	82.7	84	Underfitting; insufficient examples for complex occlusion patterns
Occluded Tomatoes	200	90.1	88.3	89.2	Improved contour learning, but errors persist in dense clusters
Fully Visible Tomatoes	100	91.2	89.6	90.4	Solid baseline result, but limited generalization
Fully Visible Tomatoes	200	95.3	94.7	95	High stability; consistent precision and sensitivity
Fully Visible Tomatoes	300	96.9	96.2	96.5	Saturation zone; marginal performance gains with additional data

Table 7. Accuracy of camera-to-robot coordinate transformation.

Evaluation Metric	Value (Mean ± SD) (mm)	Notes
Positioning error (X)	2	Along X axis
Positioning error (Y)	3	Along Y axis
Positioning error (Z)	2	Along Z axis

Table 8. Performance of inverse kinematics and trajectory planning.

Scenario	Mean Time per Trajectory (s)	Collision Incidents (%)
Direct (no occlusion)	12	2
With occlusion + waypoints	21	15

Table 9. Harvesting success rate.

Scenario	Number of Fruits Tested	Success Rate (%)	Average Cycle Time (s)
Unobstructed fruits	102	85	15
Occluded/clustered fruits	162	72	27
Overall average	264	78.5	19.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Matache, M.G.; Marin, F.B.; Persu, C.I.; Cristea, R.D.; Nenciu, F.; Atanasov, A.Z. Autonomous Tomato Harvesting System Integrating AI-Controlled Robotics in Greenhouses. Agriculture 2026, 16, 847. https://doi.org/10.3390/agriculture16080847

AMA Style

Matache MG, Marin FB, Persu CI, Cristea RD, Nenciu F, Atanasov AZ. Autonomous Tomato Harvesting System Integrating AI-Controlled Robotics in Greenhouses. Agriculture. 2026; 16(8):847. https://doi.org/10.3390/agriculture16080847

Chicago/Turabian Style

Matache, Mihai Gabriel, Florin Bogdan Marin, Catalin Ioan Persu, Robert Dorin Cristea, Florin Nenciu, and Atanas Z. Atanasov. 2026. "Autonomous Tomato Harvesting System Integrating AI-Controlled Robotics in Greenhouses" Agriculture 16, no. 8: 847. https://doi.org/10.3390/agriculture16080847

APA Style

Matache, M. G., Marin, F. B., Persu, C. I., Cristea, R. D., Nenciu, F., & Atanasov, A. Z. (2026). Autonomous Tomato Harvesting System Integrating AI-Controlled Robotics in Greenhouses. Agriculture, 16(8), 847. https://doi.org/10.3390/agriculture16080847

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Autonomous Tomato Harvesting System Integrating AI-Controlled Robotics in Greenhouses

Abstract

1. Introduction

Current Limitations and Proposed Research Contributions

2. Materials and Methods

2.1. Robot Architecture and Operating Principle

2.2. The Main Components Specifications and Details for the Proposed Tomato Harvesting Robot

2.3. Procedural Workflow for Vision–Actuation Synchronization

Convolutional Neural Network Architecture and Training Protocol

2.4. Experimental Setup

3. Results

3.1. Fruit Detection Performance

3.2. Camera Calibration and Coordinate Transformation Accuracy

3.3. Inverse Kinematics and Trajectory Planning Performance

3.4. Harvesting Performance in Greenhouse Trials

4. Discussion

Research Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI