Reachability-Oriented Pose Estimation and Efficient Path Planning for Tomato Harvesting Robots

Yan, Junyao; Yin, Jianjun; Hu, Jintang; Lai, Kefan

doi:10.3390/app16115610

Open AccessArticle

Reachability-Oriented Pose Estimation and Efficient Path Planning for Tomato Harvesting Robots

College of Future Information Technology, Fudan University, Shanghai 200433, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5610; https://doi.org/10.3390/app16115610

Submission received: 13 May 2026 / Revised: 30 May 2026 / Accepted: 1 June 2026 / Published: 3 June 2026

(This article belongs to the Section Agricultural Science and Technology)

Download

Browse Figures

Versions Notes

Abstract

Agriculture is currently transitioning toward higher intelligence and facility-based production, where harvesting robots play a crucial role in enhancing efficiency and ensuring standardized output. Addressing the challenges of inaccurate picking pose estimation and limited reachability in greenhouse environments, this paper proposes a reachable grasping pose estimation method based on Particle Swarm Optimization (PSO). First, initial poses are calculated via instance segmentation and keypoint extraction. Subsequently, a fitness function is constructed based on inverse kinematics, and the PSO algorithm is employed to iteratively search for optimal reachable poses. To further tackle planning difficulties in confined spaces, a two-stage path planning method based on cost maps is introduced. A series of performance metrics were designed to validate the proposed pose estimation and path planning methods through simulation experiments. In real-world field tests, the system achieved a harvesting success rate of 85%, significantly outperforming existing methods. The results demonstrate that the proposed approach substantially enhances the operational feasibility and success rate of tomato harvesting robots.

Keywords:

tomato harvesting robot; pose estimation; robotic manipulator; path planning; manipulator reachability

1. Introduction

Traditional agricultural production has faced significant impediments in advancing large-scale, standardized, and intelligent operations due to its inherent complexity. Currently, the agricultural sector is undergoing a transition towards higher levels of intelligence and facility-based production. Agricultural robotics technology is progressively replacing manual labor and playing an increasingly vital role in modern farming activities. Harvesting robots significantly enhance picking efficiency and precision while reducing losses during the harvesting process, thereby ensuring high quality and standardized production of agricultural products [1]. Consequently, harvesting robots have become a prominent trend in the global smart agriculture sector, with extensive applications in both open-field farms and greenhouse environments.

Tomatoes are economically significant crops worldwide and are extensively cultivated in greenhouse environments. A typical tomato cluster comprises the fruit, peduncle, and main stem. Robotic harvesting typically employs a manipulator equipped with end-effectors, such as grippers or shears, to sever the peduncle. However, both the fruit and the main stem are highly fragile; even minor collisions can lead to physical damage, resulting in reduced harvesting efficiency and substantial economic losses. Therefore, the primary challenge lies in achieving precise and efficient peduncle cutting while maintaining a collision-free trajectory. This task is further complicated by the inherent biological variability of tomato clusters, which exhibit inconsistent geometries and spatial poses. Furthermore, the dense planting and random distribution within greenhouses create a constrained workspace and a complex operational environment for robotic manipulators. To address these multi-faceted challenges, extensive research has been conducted. Existing research primarily focuses on two key methodologies: the estimation of optimal grasping poses for robotic manipulators and the development of collision-free path planning strategies.

Grasp pose estimation in robotic fruit harvesting typically employs RGB-D cameras to acquire scene images and point clouds, utilizing computer vision techniques to achieve crop localization and pose estimation [2]. Lehnert et al. obtained sweet pepper point clouds using an RGB-D camera and subsequently estimated their poses by fitting superellipsoids through nonlinear least squares [3]. Guo et al. investigated pre-grasp planning for various fruits based on point cloud data, employing the SAC-IA ICP algorithm to match online target point clouds with offline templates for 6D pose estimation, while utilizing the RANSAC algorithm for fruit morphology classification [4]. Zhong et al. used the YOLACT model to segment the main stem regions of lychees, then determined the angles of the stems through skeleton extraction and least squares methods [5]. Based on sparse semantic segmentation results, Barth et al. inferred the positions of fruits and stems within the image plane and estimated the angles between tomato main stems and fruits to determine the grasping poses for the harvesting robot, which improved the success rate from 14% to 52% [6]. Kang et al. employed the Hough transform to calculate the grasping poses of individual apple fruits and planned paths based on 3D crop models, increasing the grasping success rate for occluded fruits from 61% to 81% [7]. Zhang et al. proposed a 3D pose detection method for tomato clusters, which integrates prior geometric models with spatial position information of key points to achieve 3D reconstruction of the clusters [8]. Luo et al. utilized Mask R-CNN to segment 2D images of grape clusters and extract the corresponding point cloud mappings; they then fitted the point clouds of grape stems using the Locally Weighted Scatterplot Smoothing (LOWESS) algorithm and estimated the stem poses through geometric methods [9]. Li et al. introduced a spatial pose description model for tomato bundles, establishing an optimal pose plane for the end-effector to cut tomato stems, and further proposed a grasping pose decision model for tomato harvesting manipulators based on Deep Reinforcement Learning (DRL) [10].

Most current methods for robotic fruit harvesting pose estimation require prior knowledge of the target’s geometry to establish models that accurately describe the pose of fruits or clusters. However, since the spatial distribution of tomato clusters and the geometric shapes of both stems and fruits can be irregular, methods based on geometric analysis may encounter significant limitations. Furthermore, most existing studies evaluate harvesting success solely based on the geometric relationship between the end-effector and the target cluster, without considering the constraints imposed by the manipulator’s joint configurations and workspace on the reachability of the target pose. Consequently, it is often necessary to restrict the robot’s workspace within a predefined and limited range.

Numerous scholars have investigated the dexterous workspace and the reachability of grasping poses for robotic manipulators. Cao et al. explored the dexterity of robotic arms by simulating human grasping postures, defining a dexterity metric that correlates higher grasping success rates with high-dexterity configurations [11]. Lou et al. proposed a voxel-based deep 3D Convolutional Neural Network (3D-CNN) capable of generating reachable 6-DOF grasping poses within unrestricted workspaces, further developing a reachability predictor to evaluate the feasibility of specific grasp candidates [12]. Akinola et al. presented a workspace-oriented online grasp planning framework, creating a comprehensive database containing hundreds of thousands of reachable end-effector poses to guide the manipulator toward feasible configurations [13]. Another study utilized Signed Distance Fields (SDF) to model the robot’s reachability space, enabling the rapid filtering of unreachable grasps, and trained a neural network to predict grasping quality based on the target object’s current motion [14]. Makhal et al. developed Reuleaux, an open-source library for robotic reachability and base placement analysis, providing an efficient tool for determining pose reachability or optimal base positioning [15]. Although extensive research on reachability exists within the field of planar robotic grasping, the effectiveness of these methods in robotic fruit harvesting requires further validation due to the complexity and unstructured nature of agricultural environments.

Regarding robotic manipulator path planning, various planning algorithms based on joint space have been widely adopted to ensure path feasibility, given the inherent difficulty and complexity of solving inverse kinematics. The Rapidly-exploring Random Tree (RRT) is a classic algorithm in the field of manipulator path planning [16]. The fundamental concept of RRT is to construct a tree-like structure through random sampling within the joint space, enabling the tree nodes to explore the entire free space rapidly. While RRT is well-suited for high-dimensional spaces and dynamic environments, it suffers from significant randomness, poor path quality, and non-uniform spatial distribution. Consequently, numerous studies have proposed improvements to the RRT algorithm. RRT-Connect constructs two separate trees starting from the initial and target poses, respectively, advancing them toward each other until a connection is established [17]. RRT* imposes two constraints—path length and manipulability metrics—during the sampling process within the search space [18]. GAO-RRT* replaces random sampling with a dual-weight sampling strategy and adopts a variable step-size expansion strategy to enhance node generation efficiency; it also introduces a reverse-growth strategy to guide the exploration tree out of local optima [19]. Other methodologies, such as Probabilistic Roadmap (PRM) [20,21] and Artificial Potential Field (APF) [22,23], are also frequently applied in joint space planning.

Although joint space-based path planning algorithms effectively ensure path smoothness and motion feasibility, they often struggle to avoid obstacles efficiently in complex or dynamic environments. Conversely, Cartesian space planning allows for the direct and precise representation of obstacles but poses challenges regarding the smoothness of joint movements, typically involving higher computational costs and more complex constraints. Interpolation methods are frequently employed in Cartesian space planning to smooth trajectories and prevent abrupt turns [24,25]. While recent constraint- and optimization-based approaches have improved feasibility, they still struggle in narrow, obstacle-dense environments [26,27]. Shang et al. proposed a cost-map generation method based on voxelized scene point clouds, which achieved promising results in agricultural scenarios; however, its path smoothness comes at the expense of precision loss, potentially leading to damage in collision-sensitive tasks [28]. In recent years, path planning methodologies grounded in deep reinforcement learning and adaptive AI have also been extensively deployed in harvesting robotics [29,30], shifting the paradigm from static geometric collision avoidance toward data-driven operational flexibility.

To address the challenges of difficult grasp pose estimation and unreachable poses in robotic tomato harvesting, this paper introduces a reachability-adaptive framework for optimal grasp pose determination. This approach performs geometric modeling of tomato clusters based on intrinsic structural keypoints to calculate initial poses, and formulates a constrained kinematic optimization problem solved via a custom-designed Particle Swarm Optimization (PSO) variant to guarantee strict kinematic reachability. Furthermore, to tackle the difficulties of path planning in narrow and complex spaces, a semantic-enhanced, risk-aware two-stage Cartesian path planning strategy is proposed, which effectively enhances the planning success rate and minimizes collision damage. The main contributions of this paper are summarized as follows:

1.: Prior-guided Initial Pose Formulation: A morphology-aware tomato cluster pose model is established by fusing deep-learning semantic features with cluster-specific geometric priors. By mapping multi-class instance segmentation to keypoint topological constraints, the optimal harvesting entry vector is analytically initialized. This analytical pose serves as a highly informed sampling center for population initialization, effectively shifting the optimization from a blind global search to an accelerated, targeted local refinement.
2.: Kinematics-Constrained Reachability Optimization: A novel reachability-driven optimization methodology is proposed, where joint-space physical constraints and analytical inverse kinematics (IK) feasibility are mathematically embedded into the algorithm’s fitness function. By transforming the reachability problem into a bounded non-linear optimization task, the customized PSO variant iteratively steers the end-effector away from singularity zones and joint limits, significantly expanding the robot’s effective operational workspace volume in non-dexterous environments.
3.: Semantic-Weighted and Decoupled Path Planning: A risk-aware path planning mechanism is developed by integrating high-level object semantics into a voxelized 3D cost-map, allowing for differentiated obstacle avoidance weights and distinct collision tolerances based on obstacle vulnerability (e.g., main stems vs. leaves). Furthermore, the trajectory generation is methodologically decoupled into two asymmetric search phases—approaching and executing—each governed by localized heuristic search strategies, which drastically improves the algorithm’s convergence rate and collision-free success rate in dense canopies.

2. Materials and Methods

2.1. Robotic System

As shown in Figure 1a, the tomato harvesting robot in this study consists of a mobile chassis equipped with a lifting platform, a 6-DOF robotic manipulator (RM65-B, Realman, Beijing, China), an Orbbec Femto Mega RGB-D camera (Orbbec 3D Technology Co., Ltd., Shenzhen, China), an end-effector, and an embedded computing platform (NVIDIA Jetson AGX Orin, NVIDIA Corporation, Santa Clara, CA, USA).

The 6-DOF robotic manipulator is mounted on the lifting platform of the mobile chassis. The base of the manipulator is moved to the target workspace through the horizontal movement of the chassis along tracks and the vertical adjustment of the lifting platform. The manipulator features a payload of 5 kg and a workspace radius of 610 mm. The end-effector is an electrically controlled gripping and cutting device with a length of 200 mm, designed to cut and hold the stems of tomato clusters. Figure 1b illustrates the kinematic model of the RM65-B manipulator, established using the modified Denavit–Hartenberg (D–H) method, with the corresponding D-H parameters presented in Table 1. An Orbbec Femto Mega RGB-D camera is mounted on the end-effector in an eye-in-hand configuration to capture color images and aligned depth images. The resolution of the RGB images is 1920 × 1080 pixels. The captured images are subsequently received by the embedded computing platform and processed into 3D point clouds for tomato cluster recognition, localization, and modeling. Figure 1c presents the tomato cluster images captured by the RGB-D camera and the processed 3D point clouds.

2.2. Estimation of Optimal Grasping Poses for Tomato Clusters

Obtaining the pose information of tomato clusters is essential for both pose estimation and path planning. This study introduces a 3D pose detection method that integrates instance segmentation, keypoint extraction, and pose calculation.

The harvesting robot runs on an embedded computing platform. Therefore, the detection model must be lightweight and capable of high real-time performance. In this study, we employ YOLOv8 [31] for cluster detection and instance segmentation. Specifically, the RGB image of a tomato cluster is segmented into three distinct regions: the main stem, the peduncle, and the fruit. Subsequently, we extract the pixel masks for each region and map them to their corresponding point clouds. To ensure scientific reproducibility, the perception module was validated on a custom dataset of 3000 greenhouse tomato cluster images (manually annotated via Labelme) split into training, validation, and testing sets with an 80:10:10 ratio (comprising 2400, 300, and 300 images, respectively). Standard online data augmentation policies including Mosaic, random horizontal flipping, and HSV color space jittering were applied during training to enhance robustness against greenhouse lighting variations. The YOLOv8-m instance segmentation network was trained for 300 epochs with a batch size of 16, a learning rate of 0.01, and an image resolution of 640×640 pixels on an NVIDIA RTX 4090 GPU.

The comprehensive performance evaluation, model physical scale, and onboard runtime efficiency are summarized in Table 2. The high precision and mask mAP across all classes provide a robust foundation for downstream keypoint extraction. Crucially, with a compact model footprint of 54.1 MB, the average end-to-end inference runtime per frame is 28.4 ms when executed directly on the NVIDIA Jetson AGX Orin onboard computing platform within the PyTorch (v2.1.0, Open Source Community) Ultralytics environment, which fully satisfies the real-time operational constraints of the tomato harvesting robot. Figure 2a,b present the original tomato cluster images and the corresponding instance segmentation masks, respectively.

We further group the main stem, peduncles, and fruits belonging to the same cluster to perform keypoint extraction. First, the connection points between the peduncle and the main stem (

P_{1}

), as well as between the peduncle and the fruit (

P_{2}

), are identified. These two points define the starting and ending points of the peduncle. Their coordinates are obtained by calculating the average coordinates of the pixel intersections between the respective masks. If no intersection exists, morphological dilation is applied to the peduncle mask until it intersects with the main stem or fruit masks. Subsequently, the optimal cutting point (

P_{0}

) is determined based on these two points.

P_{0}

is generally located at the geometric center of the peduncle, positioned away from both the main stem and the fruit. Given the irregular geometry of the peduncle, a traversal-based approach is used for calculation as follows:

(u_{P 0}, v_{P 0}) = arg min_{(u_{i}, v_{i})} |\sqrt{{(u_{P 1} - u_{i})}^{2} + {(v_{P 1} - v_{i})}^{2}} - \sqrt{{(u_{P 2} - u_{i})}^{2} + {(v_{P 2} - v_{i})}^{2}}|, (u_{i}, v_{i}) \in S_{P}

(1)

where

S_{p}

represents the set of pixel coordinates corresponding to the peduncle mask. To describe the growth direction of the main stem, additional keypoints must be extracted. A square bounding box with a side length of 100 pixels is established, centered at the connection point

P_{1}

. The final two keypoints,

M_{1}

and

M_{2}

, are then extracted from the intersections between the boundaries of this bounding box and the main stem mask. Figure 2c,d illustrate the representation of keypoints on the tomato cluster images and the 3D point clouds, respectively.

With the known camera intrinsic parameters and the completed eye-in-hand calibration, the pixel coordinates of the keypoints can be transformed into 3D coordinates within the manipulator’s base coordinate system. Based on the spatial positions of these keypoints, the optimal grasping pose of the end-effector—matching the orientation of the tomato cluster—is calculated. The end-effector coordinate system is defined as shown in Figure 3, where the x-axis represents the approach direction, the y-axis indicates the opening and closing direction of the gripper, and the z-axis is perpendicular to the gripper opening. To minimize potential damage to the main stem during grasping, the gripper’s opening and closing direction is aligned parallel to the growth direction of the main stem. Specifically, the unit vector for the y-axis direction of the end-effector is defined as:

e_{y} = \frac{v_{M 1 - M 2}}{∥ v_{M 1 - M 2} ∥}

(2)

where

e_{y}

is the unit vector along the y-axis, and

v_{M 1 - M 2}

is the vector pointing from

M_{1}

to

M_{2}

. To ensure a high success rate for cutting, the blades should be positioned as close to the peduncle as possible. Therefore, the z-axis of the end-effector should be aligned with the tangential direction near the cutting point

P_{0}

. This tangential direction is approximated as follows:

v_{P 0} = \frac{v_{P 0 - P 1} + v_{P 2 - P 0}}{∥ v_{P 0 - P 1} + v_{P 2 - P 0} ∥}

(3)

The z-axis direction of the end-effector is obtained by orthogonalizing

v_{P 0}

with respect to the y-axis direction vector:

{v_{P 0}}^{'} = v_{P 0} - \frac{e_{y} \cdot v_{P 0}}{e_{y} \cdot e_{y}} \cdot e_{y}

(4)

e_{z} = \frac{{v_{P 0}}^{'}}{∥ {v_{P 0}}^{'} ∥}

(5)

The x-axis is calculated by the cross product of the y-axis and z-axis unit vectors:

e_{x} = e_{y} \times e_{z}

(6)

With the unit vectors of each axis and the coordinates of the cutting point, the optimal grasping pose of the end-effector matching the tomato cluster’s orientation is determined. Its rotation and translation matrix

T_{g}^{0}

is expressed as:

T_{g}^{0} = [\begin{matrix} e_{x_{1}} & e_{y_{1}} & e_{z_{1}} & x_{P 0} \\ e_{x_{2}} & e_{y_{2}} & e_{z_{2}} & y_{P 0} \\ e_{x_{3}} & e_{y_{3}} & e_{z_{3}} & z_{P 0} \\ 0 & 0 & 0 & 1 \end{matrix}]

(7)

The URDF model of the RM65-B manipulator and the coordinate-transformed tomato cluster point clouds are loaded into the PyBullet (v3.2.6) [32] simulation environment. Figure 3 illustrates the visualization of the end-effector coordinate system and the optimal grasping pose for the tomato cluster within PyBullet. Although the optimal grasping pose matching the cluster’s orientation has been mathematically determined, it cannot be executed directly. The reachability of the calculated pose within the manipulator’s workspace must also be verified.

2.3. Calculation of Manipulator Reachable Poses Based on Analytical Methods and Particle Swarm Optimization

Having determined the optimal grasping pose for the end-effector matching the tomato cluster’s orientation, we proceed to perform inverse kinematics (IK) solving based on this pose. If the corresponding joint configurations can be successfully computed, the grasping pose is considered feasible. However, in many cases, the target point is located within the non-dexterous workspace of the manipulator, where the robot can often only reach the target with specific orientations. Consequently, the manipulator may be unable to reach the target point with the initially calculated pose, necessitating the determination of an alternative reachable pose that remains close to the original one. In this study, a combined approach utilizing an analytical solution and Particle Swarm Optimization (PSO) is employed to determine the reachable poses of the manipulator.

2.3.1. Analytical Method-Based Inverse Kinematics Analysis

The control of the manipulator’s end-effector requires a mapping from the Cartesian space position and orientation of the end-effector to the joint space. The process of determining each joint angle when the end-effector’s pose is known is referred to as IK. Based on the modified D-H parameters of the RM65-B manipulator in Table 1 and the modeling rules, the transformation matrix

{}_{i}^{i - 1}T

from the adjacent coordinate system

{i - 1}

to

{i}

can be expressed as:

{}_{i}^{i - 1}T = [\begin{matrix} c θ_{i} & - s θ_{i} & 0 & a_{i - 1} \\ s θ_{i} c α_{i - 1} & c θ_{i} c α_{i - 1} & - s α_{i - 1} & - s α_{i - 1} d_{i} \\ s θ_{i} s α_{i - 1} & c θ_{i} s α_{i - 1} & c α_{i - 1} & c α_{i - 1} d_{i} \\ 0 & 0 & 0 & 1 \end{matrix}]

(8)

where

c θ_{i}

and

s θ_{i}

denote

cos θ_{i}

and

sin θ_{i}

, respectively.

From Equation (8), the transformation matrices of

{}_{1}^{0}T

through

{}_{6}^{5}T

can be derived. Since an end-effector tool is attached to the manipulator in this study, it can be regarded as Link 7. Given that the orientation of the tool coordinate system is defined to be consistent with that of coordinate system 6, and the relative displacement can be obtained through direct measurement, the transformation matrix

{}_{7}^{6}T

between the adjacent coordinate systems is expressed as:

{}_{7}^{6}T = [\begin{matrix} I_{3 \times 3} & {}^{6}t_{7} \\ 0_{1 \times 3} & 1 \end{matrix}]

(9)

where

I_{3 \times 3}

is a

3 \times 3

identity matrix, representing that the orientation of the tool frame

{7}

is identical to that of frame

{6}

;

{}^{6}t_{7} = {[200, 0, 37]}^{T} mm

denotes the measured translation vector of the tool relative to frame

{6}

.

The transformation matrix from the manipulator base to the tool coordinate system, denoted as

T_{t o o l}

, can be obtained by the sequential multiplication of the transformation matrices from

{}_{1}^{0}T

to

{}_{7}^{6}T

:

T_{t o o l} = {}_{1}^{0}T \cdot {}_{2}^{1}T \cdot {}_{3}^{2}T \cdot {}_{4}^{3}T \cdot {}_{5}^{4}T \cdot {}_{6}^{5}T \cdot {}_{7}^{6}T = {}_{7}^{0}T

(10)

According to the D-H parameters, the first three joints (J1, J2, J3) of the manipulator primarily determine the end-effector position, while the axes of the last three joints (J4, J5, J6) intersect at a single point, forming a spherical wrist that is responsible for orientation adjustment. Owing to this spherical wrist configuration, the position and orientation can be kinematically decoupled. Let the target pose of the end-effector (Tool) be

{}_{7}^{0}T

, with its position vector denoted as

P = {[x, y, z]}^{T}

and its orientation matrix as

R

. The coordinates of the wrist center W in the base coordinate system,

P_{w} = {[P_{w x}, P_{w y}, P_{w z}]}^{T}

, can be obtained by the following equation:

P_{w} = P - R \cdot {}^{6}t_{7} - d_{6} \cdot R [\begin{matrix} 0 \\ 0 \\ 1 \end{matrix}]

(11)

During the IK solving process, the value of Joint 3 directly determines the extension state of the manipulator in the radial plane. Based on the geometric relationship and the Law of Cosines, consider the triangle formed by the center of Joint 2, the center of Joint 3, and the wrist center W. Let the length of the upper arm be

L_{1} = a_{2}

and the effective length of the forearm be

L_{2} = \sqrt{a_{3}^{2} + d_{4}^{2}}

. The cosine equation for Joint 3 can be derived from the coordinate transformation relationships as follows:

cos ϕ = \frac{P_{w x}^{2} + P_{w y}^{2} + {(P_{w z} - d_{1})}^{2} - a_{2}^{2} - (a_{3}^{2} + d_{4}^{2})}{2 a_{2} \sqrt{a_{3}^{2} + d_{4}^{2}}}

(12)

To ensure that the IK has real solutions, the condition

| cos ϕ | \leq 1

must be satisfied. Accordingly, the reachability discriminant operator

Δ

is defined as follows:

Δ = 1 - {(\frac{P_{w x}^{2} + P_{w y}^{2} + {(P_{w z} - d_{1})}^{2} - a_{2}^{2} - (a_{3}^{2} + d_{4}^{2})}{2 a_{2} \sqrt{a_{3}^{2} + d_{4}^{2}}})}^{2}

(13)

The physical significance of the discriminant

Δ

lies in characterizing the mapping relationship between the target pose and the boundaries of the robot’s workspace. When

Δ \geq 0

, the target pose possesses inverse solutions within the physical space, indicating that it belongs to the reachable region. Conversely, when

Δ < 0

, the target pose lies beyond the range of the manipulator’s stroke.

2.3.2. Reachable Pose Solving via Particle Swarm Optimization

According to Section 2.3.1, if

Δ < 0

, the joint angles cannot be derived through inverse kinematics, indicating that the manipulator is unable to reach the harvesting pose. Consequently, PSO is employed to search for reachable grasping poses for the manipulator. By identifying the pose matrices that satisfy the condition

Δ \geq 0

, the reachable grasping poses for the manipulator can be determined. The specific steps of the algorithm are as follows:

(a): Initialization of Particle Swarm

The position of a particle is determined by the three Euler angle components of the manipulator’s end-effector pose. Specifically, a set of Euler angles is represented as a single particle, where different Euler angles correspond to different position components. The particle is expressed as:

P_{n} = {(ψ_{n}, θ_{n}, φ_{n})}^{T}

(14)

We take the harvesting pose calculated in Section 2.2 as the initial pose, denoted as

P_{i n i t}

. To ensure a high success rate for harvesting and to accelerate the convergence speed of the algorithm, certain constraints are imposed on the search space of the end-effector orientation. Specifically, when the manipulator is unable to execute harvesting at the initial pose, the orientation search space is restricted to a partial spherical surface centered at the initial pose. With

P_{i n i t}

as the center, the search radius for each position component is set to

δ

. Accordingly, the upper and lower bounds for each component are constrained as follows:

\{\begin{matrix} ψ_{i n i t} - δ \leq ψ_{i} \leq ψ_{i n i t} + δ \\ θ_{i n i t} - δ \leq θ_{i} \leq θ_{i n i t} + δ \\ φ_{i n i t} - δ \leq φ_{i} \leq φ_{i n i t} + δ \end{matrix}

(15)

Next, centered at the initial pose

P_{i n i t}

, uniform sampling is performed on the search space spherical surface with a sampling interval of

δ / 2

. By sampling three discrete values in each dimension, a total of 27 pose particles are obtained through combination, e.g.,

(ψ_{i n i t} - δ / 2, θ_{i n i t}, ϕ_{i n i t})

,

(ψ_{i n i t}, θ_{i n i t}, ϕ_{i n i t})

,

(ψ_{i n i t} + δ / 2, θ_{i n i t}, ϕ_{i n i t})

, …,

(ψ_{i n i t} + δ / 2, θ_{i n i t} + δ / 2, ϕ_{i n i t} + δ / 2)

. Subsequently, another 23 pose particles are randomly generated within the workspace range to form an initial population with a size of

N = 50

. For each particle i, the velocity vector

V_{i} = {[v_{ψ i}, v_{θ i}, v_{ϕ i}]}^{T}

is initialized, with its values randomly generated within the range

[- V_{m a x}, V_{m a x}]

.

(b): Fitness Evaluation

Since the manipulator possesses inverse solutions when

Δ \geq 0

, it can be inferred that a larger value of

Δ

indicates a higher probability of reachability. Furthermore, as the initial pose is the optimal harvesting pose derived from the geometric analysis of the tomato cluster, the final end-effector pose should be as close to the initial pose as possible while ensuring reachability. Accordingly, the objective function and fitness function are designed as follows:

K = Δ

(16)

f (P_{i}) = ω_{1} \frac{K_{i} - K_{m i n}}{K_{m a x} - K_{m i n}} - ω_{2} \frac{Δ θ_{i}}{δ}

(17)

where K is the objective function,

f (P_{i})

is the fitness function, and

ω_{1}

and

ω_{2}

are the weight coefficients.

K_{m a x}

and

K_{m i n}

denote the maximum and minimum values of the objective function calculated for all particles in the current swarm, respectively, while

Δ θ_{i}

represents the angular difference between the pose of particle i and the initial pose. This fitness function reflects that a larger objective function value corresponds to higher fitness. Furthermore, since the target pose should be as close as possible to the initial pose, an angular difference penalty term is incorporated; thus, a smaller angular difference leads to higher fitness. The initial position of each particle is set as its individual historical best solution

P_{{b e s t}_{i}}

, and the position of the particle with the highest fitness in the initial population is selected as the global best solution

G_{b e s t}

.

(c): Velocity and Position Update

In each iteration, the Euler angle components of each particle are updated according to the following formulas:

V_{i}^{t + 1} = ω V_{i}^{t} + c_{1} r_{1} (P_{b e s t, i}^{t} - P_{i}^{t}) + c_{2} r_{2} (G_{b e s t}^{t} - P_{i}^{t})

(18)

P_{i}^{t + 1} = P_{i}^{t} + V_{i}^{t + 1}

(19)

where

ω

is the inertia weight, which controls the influence of the previous velocity on the current state.

c_{1}

and

c_{2}

are acceleration constants that regulate the weights for adjusting the particle toward the individual best and global best solutions, respectively.

r_{1}

and

r_{2}

are random numbers within the range

[0, 1]

, which enhance the randomness of the search.

To ensure that the search space is strictly confined within a radius

δ

centered at the initial pose, the following constraints are imposed on position and velocity: if the updated Euler angle components exceed the range defined by Equation (15), they are truncated to the boundaries of the search space. Meanwhile, to prevent particles from moving too fast and “overshooting,” an upper limit

V_{m a x}

is set for the velocity components. If

| v_{i, d} | > V_{m a x}

, then

v_{i, d} = sign (v_{i, d}) \cdot V_{m a x}

.

(d): Individual and Global Best Updates

In each iteration, the following evaluations are performed for each particle: If the fitness of the current position

f (P_{i})

is higher than the fitness of its historical best position

f (P_{{b e s t}_{i}})

, then

P_{{b e s t}_{i}}

is updated to

P_{i}

. Similarly, if the fitness of the current position

f (P_{i})

is higher than that of the global best position

f (G_{b e s t})

, then

G_{b e s t}

is updated to

P_{i}

.

(e): Iteration and Termination

Repeat steps (b) through (d) until the algorithm satisfies either of the following termination criteria: (1) The maximum number of iterations

G_{m a x}

is reached, at which point the loop terminates. (2) At least one particle in the swarm satisfies the objective function

K \geq 0

; its position is then returned as the final pose solution. If multiple particles meet this condition, the one with the higher fitness value is selected as the output.

The flowchart of the PSO-based reachable pose solving algorithm is shown in Figure 4. Through the optimization iterations of the PSO algorithm, the evaluated reachable grasping pose is finally obtained, and its transformation matrix is expressed as:

T_{g}^{*} = [\begin{matrix} n_{g}^{*} & o_{g}^{*} & a_{g}^{*} & p_{g}^{*} \\ 0 & 0 & 0 & 1 \end{matrix}]

(20)

where

n_{g}^{*}, o_{g}^{*}

, and

a_{g}^{*}

represent the unit vectors of the gripper’s coordinate system (normal, orientation, and approach vectors, respectively), and

p_{g}^{*}

denotes the position vector of the grasp center.

Compared with conventional analytical or map-based reachability-aware grasp planners, the distinct novelty of the proposed method lies in its dynamic optimization paradigm. Standard methods primarily rely on pre-computed offline reachability maps or rigid inverse kinematics tables, which exhibit high offline database storage overhead and struggle to adapt to the highly unstructured and dynamically occluded environments of tomato greenhouses. In contrast, our approach treats reachability optimization as a continuous, real-world iterative search space. By feeding the semantic environment and cluster characteristics directly into the unified PSO fitness function, the optimal transformation matrix

T_{g}^{*}

is dynamically resolved in real time. This flexible architecture empowers the manipulator to successfully discover dexterous harvesting poses even near the challenging boundaries of non-dexterous workspaces, establishing a significant departure from rigid, pre-mapped geometric grasp planning.

2.4. Two-Stage Cartesian Path Planning Based on Cost Map

In [28], a cost map-based Cartesian path planning method, OPHF-Planner, was proposed, which achieved robust planning performance in factory farming scenarios characterized by narrow workspaces and dense obstacles. However, this method suffers from insufficient control precision and necessitates a tolerance for minor collisions, which could lead to damage to tomato clusters during harvesting. To address these issues, this paper adopts the cost map strategy and divides the path planning into two distinct stages based on the characteristics of the tomato harvesting task: the start-to-pre-grasp stage and the pre-grasp-to-grasp stage, employing different search strategies for each. Figure 5 illustrates the overall architecture of the path planning algorithm.

2.4.1. Cost Map Generation

To facilitate path searching in Cartesian space, a voxel grid representation is employed to model both targets and obstacles, resulting in the generation of a target map and an obstacle map, respectively. The cost map is then defined as the weighted sum of these two maps. Given the two-stage search process, two distinct cost maps must be generated according to the following steps:

(a): Pre-grasp Point Computation

The reachable pose of the grasp point is obtained in Equation (20). Given that the x-axis represents the approach direction of the end-effector coordinate system, and letting d denote the distance between the pre-grasp point and the grasp point, the coordinates of the pre-grasp point are calculated as:

p_{p r e} = p_{g}^{*} - d \cdot n_{g}^{*}

(21)

The coordinates of the pre-grasp point

p_{p r e} = (x_{p r e}, y_{p r e}, z_{p r e})

are determined. By controlling the value of the distance d, the pre-grasp point can be positioned within a safe range that is close to the target tomato cluster while avoiding potential collisions.

(b): Point Cloud Voxelization

To simplify the computation, the original scene point cloud is downsampled using voxelization. After processing, the irregular point cloud data is transformed into a 3D grid format, while the magnitude of the points is reduced from

10^{5}

to

10^{3}

. The dimensions of the voxel point cloud are calculated as follows:

\{\begin{matrix} M & = ⌈\frac{x_{m a x} - x_{m i n}}{d_{v o x e l}}⌉ \\ N & = ⌈\frac{y_{m a x} - y_{m i n}}{d_{v o x e l}}⌉ \\ K & = ⌈\frac{z_{m a x} - z_{m i n}}{d_{v o x e l}}⌉ \end{matrix}

(22)

where M, N, and K denote the dimensions of the voxel grid along the

x, y

, and z axes, respectively.

(x_{m a x}, y_{m a x}, z_{m a x})

and

(x_{m i n}, y_{m i n}, z_{m i n})

represent the maximum and minimum boundaries of the original point cloud along these respective axes, and

d_{v o x e l}

is the voxel size. The position of each voxel is represented by its center coordinates. Assuming the center coordinates of a certain voxel are

(x_{v}, y_{v}, z_{v})

, these coordinates can be converted into indices

(i, j, k)

as follows:

\{\begin{matrix} i & = ⌊\frac{x_{v} - x_{m i n}}{d_{v o x e l}}⌋ \\ j & = ⌊\frac{y_{v} - y_{m i n}}{d_{v o x e l}}⌋ \\ k & = ⌊\frac{z_{v} - z_{m i n}}{d_{v o x e l}}⌋ \end{matrix}

(23)

Next, a three-dimensional array

M_{o r i g i n} \in {0}^{M \times N \times K}

is initialized. For each occupied voxel, we set

M_{o r i g i n} [i] [j] [k] = 1

. Consequently, the array

M_{o r i g i n} \in {0, 1}^{M \times N \times K}

is obtained, where a value of 1 indicates that the corresponding voxel is occupied, and a value of 0 indicates that the voxel is empty.

(c): Obstacle Map Generation

The obstacle map is used to characterize the presence of obstacles within specific voxels. Two obstacle maps are initialized. Obstacle Map 1 is directly initialized using the original voxel grid:

M_{o b s t a c l e, 1} = M_{o r i g i n}

(24)

For Obstacle Map 2, semantic information is integrated. As described in Section 2.2, segmentation masks for the main stems, peduncles, and fruits of tomato clusters were obtained via YOLOv8 instance segmentation and mapped onto the 3D point cloud. Let

(x_{o b j}, y_{o b j}, z_{o b j})

be the 3D coordinates of a point in the set corresponding to the main stems and fruits. Its corresponding voxel index

(i_{o b j}, j_{o b j}, k_{o b j})

can be calculated using Equation (23). Assuming

S_{o b j}

represents the set of all voxel indices corresponding to the main stems and fruits, Obstacle Map 2 is initialized as follows:

M_{o b s t a c l e, 2} [i] [j] [k] = \{\begin{matrix} λ, & (i, j, k) \in S_{o b j} \\ M_{o r i g i n} [i] [j] [k], & otherwise \end{matrix}

(25)

Subsequently, a Gaussian filter is applied to both

M_{o b s t a c l e, 1}

and

M_{o b s t a c l e, 2}

to allow the occupied grids to “radiate” their cost to the surrounding neighborhood. Taking

M_{o b s t a c l e, 1}

as an example:

G (i, j, k) = \frac{1}{{(2 π σ)}^{3 / 2}} e^{- \frac{i^{2} + j^{2} + k^{2}}{2 σ^{2}}}

(26)

where

i, j, k \in [- \frac{w i d t h_{k e r n e l}}{2}, \frac{w i d t h_{k e r n e l}}{2}]

. The filtered map is computed as:

M_{o b s t a c l e, 1} (x, y, z) = \sum_{i} \sum_{j} \sum_{k} M_{o b s t a c l e, 1} (x + i, y + j, z + k) \cdot G (i, j, k)

(27)

where

G (i, j, k)

is the value of the Gaussian kernel at position

(i, j, k)

,

w i d t h_{k e r n e l}

denotes the kernel size, and

σ

represents the standard deviation. Finally, the obstacle maps are normalized.

M_{o b s t a c l e, 1} (x, y, z) = \frac{M_{o b s t a c l e, 1} (x, y, z) - min (M_{o b s t a c l e, 1})}{max (M_{o b s t a c l e, 1}) - min (M_{o b s t a c l e, 1})}

(28)

The same procedure is applied to

M_{o b s t a c l e, 2}

.

(d): Target Map Generation

The purpose of the obstacle map is to guide the path away from obstacles, while the target map is designed to direct the path toward the goal. First, a 3D array of zeros with the same dimensions as the original voxel grid is initialized:

M_{t a r g e t} \in R^{M \times N \times K}, M_{t a r g e t} [i] [j] [k] = 0

(29)

The Euclidean distance from each voxel position to the target position is then calculated:

D_{t a r g e t} (x, y, z) = \sqrt{{(x - x_{t a r g e t})}^{2} + {(y - y_{t a r g e t})}^{2} + {(z - z_{t a r g e t})}^{2}}

(30)

where

(x_{t a r g e t}, y_{t a r g e t}, z_{t a r g e t})

represents the voxel index of the target location. The target map is obtained by normalizing these Euclidean distances:

M_{t a r g e t} (x, y, z) = \frac{D_{t a r g e t} (x, y, z)}{max (D_{t a r g e t})}

(31)

By substituting the index coordinates of the pre-grasp point and the grasp point into the above equations, the target map for the pre-grasp point,

M_{t a r g e t, 1}

, and the target map for the grasp point,

M_{t a r g e t, 2}

, are generated respectively.

(e): Cost Map Fusion

Since the path planning objective is to minimize the distance to the target while maximizing the distance from obstacles, the cost map is defined as the weighted sum of the target map and the obstacle map. Specifically, a smaller Euclidean distance to the target results in a lower cost, whereas a closer proximity to obstacles—characterized by higher values in the Gaussian-filtered voxel grid—leads to a higher cost. The two cost maps are generated as follows:

M_{c o s t, 1} (x, y, z) = ω_{t a r g e t} \cdot M_{t a r g e t, 1} (x, y, z) + ω_{o b s t a c l e} \cdot M_{o b s t a c l e, 1} (x, y, z)

(32)

M_{c o s t, 2} (x, y, z) = ω_{t a r g e t} \cdot M_{t a r g e t, 2} (x, y, z) + λ \cdot ω_{o b s t a c l e} \cdot M_{o b s t a c l e, 2} (x, y, z)

(33)

M_{c o s t, 1}

corresponds to the cost map used in the original OPHF-Planner method and is employed during the first stage of path planning to compute the trajectory from the start point to the pre-grasp point. In contrast,

M_{c o s t, 2}

integrates semantic information regarding the point clouds of the main stems and fruits. Since the obstacle maps are normalized, the coefficient

λ

is introduced in the weighted summation to assign higher collision-avoidance priority to these semantic regions. This cost map is utilized in the second stage of planning to determine the path from the pre-grasp point to the final grasp point.

2.4.2. Start-to-Pre-Grasp Path Planning

During this stage, path searching is performed using the cost map

M_{c o s t, 1}

, which does not include semantic obstacle information. The search originates from the voxel index corresponding to the initial pose of the robotic scissors within the camera coordinate system. A greedy search strategy is employed, starting from the index position

S_{s t a r t}

and yielding an index path

P_{i n d e x}

. Given a maximum iteration limit

G_{m a x}

, the search process is as follows:

(a): Set the current index position $S_{c u r r e n t} = S_{s t a r t}$ to initiate the search.
(b): Within the cost map $M_{c o s t, 1}$ , identify the position with the minimum cost among the 26 neighboring voxels (representing an index offset of $\pm 1$ across all three dimensions) of $S_{c u r r e n t}$ and designate it as the next position $S_{n e x t}$ .
(c): Increment the cost at the next position, $M_{c o s t, 1} (S_{n e x t}) = M_{c o s t, 1} (S_{n e x t}) + 1$ , to prevent path backtracking by increasing the local cost.
(d): Append $S_{c u r r e n t}$ to the path $P_{i n d e x}$ and update the current position: $S_{c u r r e n t} = S_{n e x t}$ .
(e): Repeatedly execute steps (2) through (4) until the cost values of all neighboring voxels are greater than $M_{c o s t, 1} (S_{c u r r e n t})$ or the maximum iterations $G_{m a x}$ are reached.

Once the index sequence

P_{i n d e x}

is obtained, the final index is compared with the pre-grasp point index to verify arrival. The index coordinates are then restored to Cartesian coordinates to obtain the first-stage Cartesian path

P_{C a r t e s i a n}^{(1)}

:

P_{C a r t e s i a n}^{(1)} = P_{i n d e x} \cdot d_{v o x e l} + B_{m i n}

(34)

where

B_{m i n} = (x_{m i n}, y_{m i n}, z_{m i n})

represents the lower boundary of the point cloud. By conducting a greedy search on

M_{c o s t, 1}

, a feasible path can be determined rapidly and effectively, allowing the robotic arm to move efficiently from the start point to the pre-grasp point. It should be noted that the search strategy in this stage tolerates minor collisions with obstacles. Furthermore, the voxelization and restoration process introduces precision loss, resulting in a jagged trajectory; therefore, the path must be smoothed using a Savitzky–Golay (S-G) filter [33]. Since the primary objective of the first stage is rapid approach to the target area, these factors are acceptable. However, as the second stage is collision-sensitive and requires finer manipulation, a different path searching strategy is required.

2.4.3. Pre-Grasp-to-Grasp Path Planning

During this stage, path searching is conducted using the cost map

M_{c o s t, 2}

, which integrates semantic obstacle information. In

M_{c o s t, 2}

, voxels representing tomato peduncles and main stems are assigned significantly higher costs than other obstacles, providing enhanced collision-avoidance capabilities. The search originates from the pre-grasp point

p_{p r e} = (x_{p r e}, y_{p r e}, z_{p r e})

. This phase employs a gradient descent-based search that directly updates spatial coordinates rather than voxel indices, thereby achieving higher precision for the path points. The search process is detailed as follows:

(a): Trilinear Interpolation

The cost map is defined on a discrete grid. To implement gradient search in continuous space, the continuous cost value for any query point

P_{q} \in R^{3}

is obtained via trilinear interpolation:

f (P_{q}) = \sum_{l = 0}^{1} \sum_{m = 0}^{1} \sum_{n = 0}^{1} ω_{l m n} \cdot M_{c o s t, 2} (i + l, j + m, k + n)

(35)

where

ω l m n

represents the weight coefficients based on relative distances, ensuring the spatial continuity of the cost function.

(b): Numerical Gradient Approximation

Since the gradient

\nabla f (P_{q})

cannot be directly obtained in discrete grid space, it is approximated using the central difference method:

\nabla f (P_{q}) = [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}]

(36)

The components are calculated as follows (taking the x-direction as an example):

\frac{\partial f}{\partial x} = \frac{f (x + d_{v o x e l}, y, z) - f (x - d_{v o x e l}, y, z)}{2 \cdot d_{v o x e l}}

(37)

(c): Position Update

Starting from the current position

P_{c u r r e n t}

, the next path point is reached by moving a fixed step size along the negative gradient direction:

P_{n e x t} = P_{c u r r e n t} - η \cdot \frac{\nabla f (P_{q})}{| \nabla f (P_{q}) | + ϵ}

(38)

where

η

is the search step size and

ϵ

is an infinitesimal constant used to prevent numerical oscillation when the gradient is zero.

(d): Iteration Control

Steps (a) through (c) are executed iteratively until the maximum iteration limit

G_{m a x}

is reached or the distance to the grasp point becomes less than

d_{v o x e l}

.

This path searching method yields a smooth Cartesian trajectory

P_{C a r t e s i a n}^{(2)}

.

2.4.4. Path Orientation Assignment and Inverse Kinematics

After obtaining the path waypoints, it is necessary to assign an orientation to each point so that they can be converted into joint angles for execution via IK. During the first stage, a linear interpolation method is employed to ensure a smooth transition from the initial orientation

R_{s t a r t}

to the predefined grasping orientation

R_{g}

, thereby avoiding sudden posture changes during the initial movement. In the second stage, as the end-effector moves from the pre-grasp point

P_{p r e}

to the final grasp point

P_{g}

, the orientation is kept constant at

R_{g}

. This ensures that the robotic arm approaches the tomato cluster using the estimated reachable pose, minimizing the risk of collisions caused by orientation adjustments.

Once orientations are assigned to all waypoints, the Cartesian trajectory is transformed into joint space. In the first stage, the Relaxed-IK solver [34] is utilized. By employing optimization methods, this algorithm satisfies Cartesian pose constraints while simultaneously considering joint limits, obstacle avoidance, and motion smoothness, resulting in continuous joint trajectories that stay away from singularities. In the second stage, since obstacle avoidance and smoothness have been guaranteed in the preceding planning phase, the PyKDL solver (v1.5.1) [35]—based on numerical iteration—is used for rigorous inverse kinematics. Compared to Relaxed-IK, PyKDL provides a smaller pose convergence tolerance, ensuring that the robotic scissors reach the calculated grasp point with high precision.

3. Experiments and Results

3.1. Experiment Design

To evaluate the performance of the proposed algorithm, simulations were conducted in the PyBullet environment. A comprehensive dataset consisting of 238 sets of tomato cluster images and point cloud data was collected from real-world scenarios. As shown in Figure 6, the URDF models of the robotic manipulator (integrated with an RGB-D camera and an end-effector) and the voxelized scene point clouds were loaded into PyBullet for simulation. Distinct experimental metrics were designed for the two separate tasks: pose estimation and path planning.

To ensure strict evaluation rigor and avoid parameter over-fitting, a two-stage unidirectional experimental protocol was implemented across the entire dataset. First, in the parameter calibration phase, the variations in performance indicators under different parameter settings were systematically observed across the 238 cases to analyze algorithmic sensitivity and determine a single set of optimal, globally fixed hyperparameters. Subsequently, in the performance evaluation phase, these optimal hyperparameters were strictly locked. Our finalized approach was then executed across the identical 238 unseen testing cases under a rigid “blind-test” condition—with zero retrospective parameter adjustments—and compared with existing pose estimation and path planning methods to verify its unbiased effectiveness. Finally, the proposed method was applied to real tomato harvesting scenarios, validating its feasibility in practical environments.

3.2. Pose Estimation Experiments

3.2.1. Pose Estimation Evaluation Metrics

To evaluate the feasibility and efficiency of the proposed pose estimation method, the following metrics are defined:

Reachability Rate: A target picking pose is considered kinematically reachable if valid analytical solutions for all six joint angles can be successfully resolved through the inverse kinematics (IK) solver. Let $N_{r}$ denote the number of reachable targets; the reachability rate ( $R_{rate}$ ) is defined as the ratio of reachable targets to the total number of tested cases, expressed as:

$R_{rate} = \frac{N_{r}}{N} \times 100 %$

(39)
Collision Rate: Even when a pose is kinematically reachable, the robotic manipulator or end-effector may still collide with the main stem, fruits, or surrounding obstacles during execution, rendering the pose non-viable for a safe harvest. Let $N_{c}$ denote the number of kinematically reachable cases that encounter such collision issues; the collision rate ( $C_{rate}$ ) is defined as the proportion of these collision-compromised cases relative to the total number of tested cases N, expressed as:

$C_{rate} = \frac{N_{c}}{N} \times 100 %$

(40)
Success Rate: A pose estimation trial is deemed fully successful only if the generated picking pose is both kinematically reachable and strictly collision-free. Let $N_{s}$ denote the number of successful targets. Since a successful case must be reachable ( $N_{r}$ ) but free from collisions ( $N_{c}$ ), the number of successful cases is strictly bounded by $N_{s} = N_{r} - N_{c}$ . The success rate ( $S_{rate}$ ) is defined as the ratio of successful targets to the total number of tested cases N, formulated as:

$S_{rate} = \frac{N_{s}}{N} \times 100 % = \frac{N_{r} - N_{c}}{N} \times 100 % = R_{rate} - C_{rate}$

(41)
Average Estimation Time: The average duration from the input of images and point clouds to the output of the final target pose.

3.2.2. Pose Estimation Experimental Settings

To ensure the performance of the proposed PSO-based reachable pose calculation method, appropriate parameters must be configured. The partial parameter settings are summarized in Table 3.

The remaining parameters to be determined are the search radius

δ

for each positional component, and the weight ratio

ω_{1} : ω_{2}

between the target term and the angular difference penalty term in the composite fitness function. To determine their optimal values, parameter calibration experiments were conducted on the dataset, as illustrated in Figure 7.

As the search radius increases, both the reachability and collision rates of the picking poses exhibit a continuous upward trend. This occurs because a larger search radius expands the exploration space for the end-effector poses, thereby increasing the probability of finding reachable configurations. However, as the search space expands, the deviation in the particle population relative to the initial center pose also increases. Excessive deviation may yield reachable poses that fail to satisfy the geometric constraints between the tomato cluster and the end-effector scissors, leading to a concurrent rise in the collision rate. Consequently, the success rate follows a trend of increasing first and then decreasing. The maximum success rate is achieved at a search radius of

π / 4

with relatively low time consumption; therefore,

δ = π / 4

is selected as the optimal search radius.

When the weight ratio between the target term and the penalty term is less than or equal to 4, both the reachability and collision rates of the picking poses exhibit an upward trend as the weight ratio increases. Once the weight ratio exceeds 4, the reachability rate gradually converges, while the collision rate continues to rise slightly. At lower weight ratios, the angular difference penalty term in the fitness function exerts a dominant influence, steering the particle population to remain close to the initial center pose. This ensures a low collision rate but may also cause the algorithm to fall into local convergence, resulting in a lower reachability rate. As the weight ratio increases beyond a certain threshold, the influence of the angular difference on the fitness function diminishes, allowing the algorithm to explore a broader space and enhancing the probability of identifying reachable poses. However, the concurrently weakened constraint on the end-effector orientation leads to larger angular deviations that fail to satisfy the geometric constraints for picking, thus increasing the collision rate. The maximum success rate is achieved at a weight ratio of 4 with an acceptable time cost; therefore, the weight ratio is set as

ω_{1} : ω_{2} = 4

.

Previous studies on the picking poses of tomato clusters have been relatively limited. Most researchers either adopt a fixed orientation or simply fit the peduncle direction after identifying the target picking point, often neglecting obstacle avoidance and robotic arm reachability. To evaluate the effectiveness of the proposed high-reachability pose estimation method, this study conducts experiments on five different picking pose estimation strategies as follows:

1.: Fixed Horizontal Orientation (FO): A traditional approach in agricultural robotic harvesting [36]. Once the picking point is identified, the robotic arm maintains the end-effector’s scissors in a horizontal orientation.
2.: Peduncle Orientation Fitting via LOWESS (LOWESS): Adapted from [9]. This method first extracts the tomato peduncle point cloud and selects a local segment near the cutting point. A quadratic surface polynomial is then fitted using weighted local data to construct a local surface model. Subsequently, the midpoint along the X-axis of the surface is selected to reconstruct the local peduncle curve. The tangent vector at the cutting point is utilized as the peduncle orientation to estimate its 6D pose. During harvesting, the scissors’ opening remains perpendicular to the peduncle, with the Z-axis of the end-effector aligned with the peduncle direction.
3.: Spatial Pose Modeling of Tomato Clusters (SPM): Adapted from [10]. This approach establishes a spatial pose description model where the main stem is modeled as a cylinder, peduncles as interconnected cylinders, and the cluster as a rectangular parallelepiped. The picking pose is positioned on the peduncle cutting plane parallel to the main stem.
4.: Keypoint Extraction-based Pose Calculation (KE): The pose estimation method proposed in Section 2.2 of this paper. Masks are obtained via YOLOv8 instance segmentation, and keypoints are extracted to calculate picking poses that match the geometric morphology of the tomato cluster.
5.: Our Method: This refers to the proposed PSO-based high-reachability pose solving method. The primary parameters influencing its performance are the search radius and the weight ratio between the target and penalty terms in the fitness function. For the experiments, optimal parameters were selected: a search radius of $π / 4$ and a weight ratio of $ω_{1} : ω_{2} = 4$ .

To ensure strict benchmarking fairness and eliminate evaluation bias among these five pose estimation strategies, all methods were implemented within an identical computational sandbox environment. Specifically, the exact same raw 3D point clouds and multi-class semantic segmentation masks captured from real-world greenhouse scenarios were directly imported into the PyBullet environment as the uniform perception input for all evaluated strategies. Furthermore, the geometric criteria for collision detection utilized an identical collision model generated from the voxelized scene and the manipulator’s URDF model. For reachability and execution verification, the same kinematic and operational constraints were rigidly enforced across all runs: reachability was uniformly determined by whether the PyKDL inverse kinematics solver yielded a valid analytical solution, whereas the execution constraint required that the end-effector successfully contacted the peduncle point cloud without triggering any collision with the main stem point cloud. Finally, all comparative experiments were executed on the same NVIDIA Jetson AGX Orin embedded computing platform to maintain complete hardware environment consistency.

3.2.3. Pose Estimation Results and Analysis

Table 4 presents the results of reachability rate, collision rate, success rate, and average estimation time for different methods based on 238 groups of data.

Among the five methods evaluated, only the proposed PSO-based pose optimization method achieved a significantly high success rate. The Fixed FO approach does not account for the geometric relationship between the end-effector and the tomato cluster, often leading to suboptimal cutting angles. Furthermore, it neglects both obstacle avoidance and kinematic reachability, resulting in a low success rate and high collision rate. While LOWESS fitting considers the growth direction of the peduncle, it remains limited by its lack of collision constraints regarding the main stem and fruits.

Both SPM and KE methods exhibit improved performance by incorporating peduncle orientation and collision constraints. However, as they do not address the reachability of the robotic arm, the calculated poses are frequently unreachable when the target cluster is located far from the base in a non-dexterous workspace. In contrast, our method achieves the best performance by comprehensively integrating the cluster’s geometric orientation with kinematic reachability. Compared to the KE method, the collision rate of our approach rises slightly due to the angular deviations between the optimized reachable poses and the initial poses. Nevertheless, by strictly constraining the search radius during parameter configuration, the collision rate is maintained within an acceptable range.

Regarding the average estimation time, the FO method is the fastest as it only requires the coordinates of the picking point from instance segmentation. The KE method takes longer as it involves extracting keypoints from RGB images and 3D point clouds. LOWESS and SPM are more time-consuming due to the geometric fitting of point clouds. Although our method increases the average estimation time to 0.6689 s due to the iterative nature of the PSO algorithm, this duration is substantially lower than the typical harvesting time for a single tomato cluster (approximately 15 s). Therefore, the proposed method fully meets the real-time requirements of automated harvesting operations.

To verify the enhancement of robotic arm reachability in non-dexterous workspaces provided by the PSO-based pose optimization method, the success rates and average processing times of both the KE method and the proposed PSO method were measured and compared across varying distances between the target picking points and the robotic base. The experimental results are summarized in Table 5.

In the 200–300 mm range, the success rate of the pose directly calculated via KE is relatively low due to the proximity to the base, where many poses fall within the robot’s singularities. The PSO-based method effectively avoids these singular points, thereby improving performance. In the 300–500 mm range, as most poses are located within the dexterous workspace, the success rates and average computation times for both KE and the PSO-based method are comparable.

However, as the distance exceeds 500 mm, the manipulator gradually enters the non-dexterous workspace. In this region, many poses calculated directly via KE become kinematically unreachable, necessitating iterative searches through the PSO algorithm to find valid solutions. Consequently, the PSO-based method demonstrates a significant improvement in success rate over KE, albeit with a noticeable increase in average processing time due to the higher frequency of iterative computations required for unreachable poses. Beyond 800 mm, both methods fail to produce reachable poses as the targets exceed the manipulator’s maximum reach.

In summary, the proposed PSO-based reachable pose solving method leverages an iterative search mechanism within the 3D workspace, significantly enhancing pose estimation success rates and reachability at a reasonable computational cost. Experimental results demonstrate that this method effectively overcomes the bottleneck of non-dexterous workspaces, extending the effective operational radius from 600 mm to 800 mm. This extension corresponds to a

138 %

expansion in the overall volumetric working workspace, significantly strengthening the robot’s adaptability to the complex environments of agricultural greenhouses.

3.3. Path Planning Experiments

3.3.1. Path Planning Evaluation Metrics

To evaluate the performance of the path planning algorithm, the following performance metrics were defined:

Success Rate: The ratio of the number of targets for which an effective joint trajectory was successfully planned and executed to the target picking pose—within allowable collision limits—relative to the total number of attempts.
Average Planning Time: The duration from the initiation of the path planning process to the output of a valid joint trajectory, excluding the time required for data preprocessing and motion execution.
Average Path Length: The average distance traveled by the robotic arm’s end-effector during the execution of the planned trajectory.

3.3.2. Path Planning Experimental Settings

In OPHF-Planner [28], it was experimentally determined that the algorithm achieves optimal performance with a voxel size of 0.02 m and a weight ratio of the obstacle map to the target map set at 0.2. In this study, the path planning process is divided into two stages. The planning method for the first stage—from the start point to the pre-grasp point—is consistent with that of OPHF-Planner. Consequently, these two parameters remain unchanged in our implementation. The specific parameter settings for the path planning experiments are summarized in Table 6.

For the second stage, pre-grasp-to-grasp planning, the parameters that primarily influence algorithmic performance are the weight coefficients of the main stem and fruits in the obstacle map, which were determined through experiments. Given that tomato harvesting scenarios are highly sensitive to collisions, any contact with the point clouds of the main stem or fruit clusters during path planning is categorized as a failure. For other obstacles, a slight collision tolerance of 0.05 m is permitted.

Using the experimental dataset, path planning was executed on tomato clusters for which feasible picking poses could be estimated. The performance metrics across various parameter values were measured, and the results are summarized in Figure 8.

As the weight coefficient increases, the success rate of path planning initially rises and subsequently declines, reaching its peak at

λ = 2.0

. This trend occurs because, at lower weight coefficients, the obstacle avoidance capability regarding the main stem and fruit point clouds is insufficient, leading to collisions between the end-effector’s scissors and the obstacles as they approach the target peduncle. Conversely, as the weight coefficient increases, the repulsive force exerted by the main stem and fruits becomes stronger. This can make it difficult for the path to navigate through obstacles to reach the target, thereby reducing the success rate. The average path length exhibits a slight upward trend with minimal fluctuations. This is attributed to the increased tendency of the path to deviate from obstacles as the weight increases. However, since the weight primarily influences only the final segment of the trajectory, its impact on the overall path length remains limited. Similarly, the average planning time also shows an increasing trend. As the path’s obstacle-avoidance tendency grows, the trajectory becomes more tortuous, necessitating more steps to generate the path and thus increasing the number of iterations. In summary, the path planning algorithm achieves optimal overall performance when the weight coefficient of the main stem and fruits in the obstacle map is set to

λ = 2.0

.

To evaluate the effectiveness of the proposed path planning approach, experiments were conducted comparing the following four methods:

1.: Bi-RRT: Bi-RRT is a classic robotic path planning algorithm that grows two random trees from the start and goal configurations. These trees alternately explore the joint space until they meet, forming a complete path. Collision detection is performed before adding each new node, using forward kinematics to ensure a collision-free trajectory.
2.: OPHF-Planner: This method generates an obstacle map and a target map from voxel point clouds, which are then weighted and synthesized into a cost map. A greedy search strategy is employed to find a feasible path within this cost map.
3.: S-OPHF: This approach utilizes a cost map integrated with semantic information within the OPHF-Planner framework. Specifically, the main stems and fruits of tomato clusters are distinguished from other obstacles using different weights during obstacle map generation, while the search strategy remains unchanged.
4.: Our Method: The tomato harvesting path planning is divided into two stages. The first stage rapidly approaches the target tomato cluster and adjusts to the goal pose, while the second stage incorporates semantic obstacle awareness for precise obstacle avoidance and planning. The weight coefficient $λ$ for the main stem and fruits in the obstacle map is set to $2.0$ .

To guarantee strict benchmarking fairness and maintain strictly controlled variables across all four path planning strategies, all evaluations were executed within the identical PyBullet simulation setup. Specifically, the exact same geometric collision model—constructed from the voxelized scene point clouds and the manipulator’s URDF model—was uniformly deployed for all methods. Furthermore, the start configurations, target picking poses, and physical kinematic boundary constraints (including locked joint position, velocity, and acceleration limits of the RM65-B manipulator) were rigidly synchronized across all simulation runs. Crucially, the same algorithmic execution criteria were enforced: a planning attempt was deemed successful only if the path planner could guide the end-effector to successfully reach the target peduncle cutting point without exceeding the designated collision tolerance between the robotic arm (or gripper) and the obstacles. Finally, all comparative planners were executed on the same NVIDIA Jetson AGX Orin (running on NVIDIA JetPack 5.1.1 SDK and Ubuntu 20.04 LTS) embedded computing platform to eliminate any hardware-induced performance discrepancies.

3.3.3. Path Planning Results and Analysis

Table 7 presents the results of the path planning success rate, average path length, and average planning time for the different methods based on 238 groups of test data.

It can be observed that our method achieves the highest success rate among the evaluated algorithms while maintaining a shorter average path length and planning time. Since Bi-RRT is a planning method based on joint space, its obstacle avoidance capability in Cartesian space is limited, resulting in a lower success rate. Furthermore, the obstacle-avoiding trajectories generated in joint space often become highly complex when mapped to Cartesian space, frequently leading to excessively long paths.

The OPHF-Planner performs optimally in terms of average path length and planning time due to its greedy-based search strategy, which allows for faster iterations; however, its success rate is compromised because it does not perform strict obstacle avoidance. While our method adopts a similar greedy strategy in its first stage, the gradient descent-based search in the second stage typically requires higher computational effort and more iterations. The S-OPHF method achieves a higher success rate than the OPHF-Planner by utilizing a cost map integrated with semantic information, which provides better obstacle avoidance. However, since it initiates planning based on the semantic cost map from the start, it may be affected by obstacle occlusions when far from the target, leading to increased path lengths and planning times. These comparisons demonstrate the superiority of the our two-stage path planning method in tomato harvesting scenarios.

3.4. Ablation Study

To quantitatively evaluate the individual contributions of the semantic cost mapping and the PSO-based pose optimization modules within the proposed framework, a systematic ablation study was conducted based on 100 randomized harvesting simulation scenarios. Two crucial metrics were monitored to assess the holistic perception-planning loop: the end-to-end harvesting success rate (%) and the average total computation time (s), which encompasses both picking pose formulation and trajectory generation.

As presented in Table 8, the baseline system (Config 1, utilizing traditional inverse kinematics and a standard geometric RRT planner) yields a low success rate of 41.0% due to severe workspace singularities and blind collisions in dense tomato clusters. Incorporating only the PSO optimizer (Config 2) isolates the pose optimization module; it significantly resolves kinematic reachability at non-dexterous workspace boundaries, shifting the success rate to 57.0%, though it introduces additional iterative calculation overhead (1.1432 s). Conversely, implementing only the semantic cost mapping while discarding PSO (Config 3) focuses entirely on path-level collision avoidance. This configuration improves the success rate to 53.0% by safely circumscribing obstacles, but it remains heavily constrained by standard pose failures.

Crucially, the complete framework (Config 4, Ours) synergizes both modules into a cohesive pipeline, achieving the highest harvesting success rate of 87.0% in the simulated trials. This performance is highly consistent with the subsequent real-world greenhouse experiment result (85.0%), with the minor 2.0% discrepancy directly attributed to unmodeled physical disturbances in the field, such as wind-induced crop swinging and extreme lighting fluctuations. This evaluation mathematically confirms that semantic environment representation and continuous heuristic pose optimization function as an indispensable, interdependent closed loop essential for robust automated harvesting.

3.5. Field Experiments

To verify the feasibility of the proposed pose estimation and path planning methods, autonomous harvesting experiments were conducted using a tomato harvesting robot in a real greenhouse environment. During the experiments, the robot first adjusted its chassis and lifting platform to reach an appropriate workspace. Subsequently, the vision system identified the target tomato clusters, and the reachable grasping poses were calculated using the proposed pose estimation algorithm. The robotic arm then moved to the target pose along the trajectory generated by the path planning algorithm, followed by the end-effector closing to grip the peduncle. Finally, the robotic arm retracted along the planned path and placed the harvested tomato cluster into the collection bin. Figure 9 illustrates the complete process—from identification to crating—for two separate tomato cluster harvesting tasks.

In this study, 60 tomato clusters were randomly selected for harvesting experiments in a greenhouse, out of which 51 tasks were successfully completed. The success rate reached 85.0%. To guarantee the statistical reliability of this evaluation in unstructured environments, the 95% confidence interval (CI) for the success rate was rigorously calculated using the Wilson score interval method, yielding a baseline range of [73.9%, 91.9%]. This statistical bound demonstrates high robustness, as even the conservative lower bound (73.9%) outperforms the 72.1% macro-reference reported in [37].

To comprehensively delineate the operational boundaries of the proposed framework, an in-depth retrospective error analysis was performed on the remaining 9 unsuccessful harvesting trials, which can be systematically categorized into three distinct physical and perceptual bottlenecks under unstructured greenhouse settings:

1.: Occlusion-Driven Keypoint Drift (3 cases): For the 3 failures linked to the vision algorithm, severe occlusions from dense foreground foliage or overlapping companion fruits fragmented the instance segmentation masks of the peduncle. This structural incompleteness induced a sub-centimeter mathematical drift during the 3D keypoint extraction step ( $P_{0}$ ), ultimately causing the custom end-effector to misalign with the optimal cutting zone.
2.: Kinematic Singularity at Workspace Boundaries (4 cases): Among the remaining 6 mechanical failures, 4 cases occurred when tomato clusters grew in extreme spatial configurations (e.g., deeply recessed near support frames or excessively far from the robotic base). Although the semantic cost mapping successfully identified collision-free corridors, the required cutting orientations forced the RM65-B manipulator into kinematic singularity regions or hard joint limits, leading to analytic inverse kinematics (IK) solver timeouts.
3.: Dynamic Botanical Disturbances (2 cases): The final 2 failures were caused by the dense nature of the tomato canopy, where the physical approach of the manipulator triggered secondary contact with peripheral leaves. This minor mechanical force propagated through the vine, causing transient pendulum-like swinging of the target fruit cluster. Because the current pipeline operates on an open-loop execution profile, this dynamic displacement caused the blade to brush against the peduncle without successfully severing it.

The average harvesting time per cluster was 15.8 s, a slight increase compared to the 14.6 s achieved in [37]. This minor efficiency discrepancy highlights an inherent algorithmic trade-off between execution speed and kinematic safety. Because the proposed method incorporates an iterative PSO-based search and a two-stage semantic-aware path planner to actively circumvent obstacles and guarantee pose reachability, it naturally introduces additional computational overhead during the trajectory formulation phase. However, from an agricultural engineering perspective, this extra 1.2 s per cluster represents a highly justifiable compromise: by sacrificing a marginal amount of real-time speed, the harvesting success rate was substantially elevated from

72.1 %

to

85 %

. In practical greenhouse deployments, preventing structural damage to tomato clusters and eliminating catastrophic collision failures are of paramount importance, rendering this trade-off highly advantageous for long-term automated operations. Overall, the results demonstrate that the proposed pose estimation and path planning algorithms significantly enhance the feasibility of the harvesting robot, substantially improving the success rate in real-world scenarios while maintaining a highly acceptable operational efficiency.

3.6. General Comparison and Future Learning-Based Perspectives

In recent years, learning-based robotic manipulation and motion planning approaches, particularly Transformer-based behaviors (e.g., embodied AI pipelines) and Deep Reinforcement Learning (DRL) driven planners, have demonstrated remarkable generalization in unstructured environments. Transformer-based models excel at processing multimodal sensory tokens and generating long-horizon manipulation trajectories, while DRL networks can learn end-to-end control policies directly from continuous state-action spaces.

However, deploying these data-driven paradigms in commercial greenhouse harvesting still faces critical engineering bottlenecks: (1) Sample Efficiency and Hard Constraints: DRL requires millions of trial-and-error interactions and struggles to strictly guarantee zero-collision kinematics boundaries, which poses a substantial risk of damaging fragile crop tissues or greenhouse support structures. (2) Computational Overhead: Running heavy Transformer backbones requires powerful desktop-grade GPUs, which contradicts the low-power, cost-sensitive constraints of edge-computing chips on mobile agricultural robots.

In contrast, our proposed modular pipeline—combining lightweight instance segmentation (YOLOv8) with continuous kinematic optimization (PSO)—strikes a practical balance for agricultural automation. It embeds explicit physical geometry and robot boundaries directly into the fitness function, ensuring absolute deterministic collision avoidance and high reachability without requiring vast training datasets or specialized onboard hardware. Nevertheless, incorporating hybrid learning-heuristic frameworks represents a promising frontier for our future work.

Furthermore, the generalization and operational robustness of the proposed framework under varying commercial greenhouse conditions merit comprehensive elucidation. To counter non-uniform natural illumination changes (e.g., direct sunlight glare and overcast shadows) and distinct greenhouse structures, the perception layer leverages extensive online HSV color space jittering and Mosaic augmentations during the offline training phase, rendering the underlying YOLOv8-m backbone highly invariant to luminance fluctuations. In scenarios featuring severe botanical occlusions where fruit clusters are heavily dynamic or wrapped by dense foliage, our two-stage semantic-aware path planner plays a deterministic role; it continuously updates the local obstacle boundaries, transforming unstructured geometric constraints into smooth heuristic potential fields to actively guide the arm away from collision risks. Regarding adaptation across different tomato cultivars (e.g., standard round tomatoes versus elongated truss varieties), although variations in macro fruit size and cluster density exist, their basic anatomical topologies—specifically the geometric characteristics of the main stem and the pickable peduncle sections—remain highly consistent. Because our keypoint extraction algorithm explicitly targets these structural attachment invariants rather than volatile fruit color boundaries, the framework exhibits inherent cross-cultivar portability, ensuring reliable automated harvesting capabilities across diverse agricultural operational profiles.

4. Conclusions

In this paper, we focus on the autonomous harvesting of greenhouse tomatoes using agricultural robots. To address the challenges of difficult picking pose estimation and limited pose reachability for tomato clusters, a novel high-reachability picking pose estimation method is proposed. First, an instance segmentation-based model utilizing multiple keypoints of the main stem, peduncle, and fruits is established to describe the spatial posture of the tomato cluster. Subsequently, grasping poses are sampled based on the cluster’s posture, and a PSO algorithm is employed to iteratively solve for reachable poses within the manipulator’s workspace, achieving high success rates and efficiency in pose estimation.

Furthermore, to tackle the path planning difficulties in narrow and complex environments, a two-stage path planning method integrated with semantic obstacle information is proposed, which effectively enhances the obstacle avoidance capability of the robotic arm in harvesting scenarios. Experiments were conducted in both simulation and real-world field environments. Compared with existing methods, our method significantly improves picking reachability and success rates, reduces damage during the harvesting process, and achieves considerable operational efficiency.

Despite the promising results, we openly acknowledge that the current real-world field evaluation remains relatively localized in scale due to strict greenhouse seasonal harvesting windows and operational safety constraints. Furthermore, the baseline comparison with previous literature serves primarily as a functional macro-reference rather than a strictly controlled on-platform benchmark, owing to inherent variations in gripper mechanisms and environmental conditions. In future work, we plan to expand the deployment scale across multiple breeding seasons and establish an open-source hardware testing platform to facilitate direct, unified benchmarking of different harvesting algorithms under identical real-world conditions.

Author Contributions

Conceptualization, J.Y. (Junyao Yan); methodology, J.Y. (Junyao Yan); software, J.Y. (Junyao Yan); validation, J.Y. (Junyao Yan); formal analysis, J.Y. (Junyao Yan); investigation, J.Y. (Junyao Yan), J.H. and K.L.; resources, J.Y. (Junyao Yan); data curation, J.Y. (Junyao Yan); writing—original draft preparation, J.Y. (Junyao Yan); writing—review and editing, J.Y. (Jianjun Yin), J.H. and K.L.; visualization, J.Y. (Junyao Yan); supervision, J.Y. (Jianjun Yin); project administration, J.Y. (Jianjun Yin). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data, annotation masks, camera calibration parameters, and robot configuration settings generated and analyzed during the current study are not publicly available due to intellectual property restrictions regarding ongoing robotic harvesting research, but they are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, Y.; Ma, X.; Shu, L.; Hancke, G.P.; Abu-Mahfouz, A.M. From industry 4.0 to agriculture 4.0: Current status, enabling technologies, and research challenges. IEEE Trans. Ind. Inform. 2020, 17, 4322–4334. [Google Scholar] [CrossRef]
Li, Y.; Feng, Q.; Li, T.; Xie, F.; Liu, C.; Xiong, Z. Advance of target visual information acquisition technology for fresh fruit robotic harvesting: A review. Agronomy 2022, 12, 1336. [Google Scholar] [CrossRef]
Lehnert, C.; Sa, I.; McCool, C.; Upcroft, B.; Perez, T. Sweet pepper pose detection and grasping for automated crop harvesting. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2016; pp. 2428–2434. [Google Scholar]
Guo, N.; Zhang, B.; Zhou, J.; Zhan, K.; Lai, S. Pose estimation and adaptable grasp configuration with point cloud registration and geometry understanding for fruit grasp planning. Comput. Electron. Agric. 2020, 179, 105818. [Google Scholar] [CrossRef]
Zhong, Z.; Xiong, J.; Zheng, Z.; Liu, B.; Liao, S.; Huo, Z.; Yang, Z. A method for litchi picking points calculation in natural environment based on main fruit bearing branch detection. Comput. Electron. Agric. 2021, 189, 106398. [Google Scholar] [CrossRef]
Barth, R.; Hemming, J.; Van Henten, E.J. Angle estimation between plant parts for grasp optimisation in harvest robots. Biosyst. Eng. 2019, 183, 26–46. [Google Scholar] [CrossRef]
Kang, H.; Zhou, H.; Chen, C. Visual perception and modeling for autonomous apple harvesting. IEEE Access 2020, 8, 62151–62163. [Google Scholar] [CrossRef]
Zhang, F.; Gao, J.; Zhou, H.; Zhang, J.; Zou, K.; Yuan, T. Three-dimensional pose detection method based on keypoints detection network for tomato bunch. Comput. Electron. Agric. 2022, 195, 106824. [Google Scholar] [CrossRef]
Luo, L.; Yin, W.; Ning, Z.; Wang, J.; Wei, H.; Chen, W.; Lu, Q. In-field pose estimation of grape clusters with combined point cloud segmentation and geometric analysis. Comput. Electron. Agric. 2022, 200, 107197. [Google Scholar] [CrossRef]
Li, Y.; Feng, Q.; Zhang, Y.; Peng, C.; Ma, Y.; Liu, C.; Ru, M.; Sun, J.; Zhao, C. Peduncle collision-free grasping based on deep reinforcement learning for tomato harvesting robot. Comput. Electron. Agric. 2024, 216, 108488. [Google Scholar] [CrossRef]
Cao, B.; Sun, K.; Gu, Y.; Jin, M.; Liu, H. Workspace analysis based on manipulator pose dexterity map. In Proceedings of the 2018 3rd International Conference on Robotics and Automation Engineering (ICRAE); IEEE: New York, NY, USA, 2018; pp. 166–170. [Google Scholar]
Lou, X.; Yang, Y.; Choi, C. Learning to generate 6-dof grasp poses with reachability awareness. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2020; pp. 1532–1538. [Google Scholar]
Akinola, I.; Varley, J.; Chen, B.; Allen, P.K. Workspace aware online grasp planning. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: New York, NY, USA, 2018; pp. 2917–2924. [Google Scholar]
Akinola, I.; Xu, J.; Song, S.; Allen, P.K. Dynamic grasping with reachability and motion awareness. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: New York, NY, USA, 2021; pp. 9422–9429. [Google Scholar]
Makhal, A.; Goins, A.K. Reuleaux: Robot base placement by reachability analysis. In Proceedings of the 2018 Second IEEE International Conference on Robotic Computing (IRC); IEEE: New York, NY, USA, 2018; pp. 137–142. [Google Scholar]
LaValle, S. Rapidly-Exploring Random Trees: A New Tool for Path Planning; Research Report 9811; Department of Computer Science, Iowa State University: Ames, IA, USA, 1998. [Google Scholar]
Kuffner, J.J.; LaValle, S.M. RRT-connect: An efficient approach to single-query path planning. In Proceedings of the 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065); IEEE: New York, NY, USA, 2000; Volume 2, pp. 995–1001. [Google Scholar]
Shen, H.; Xie, W.F.; Tang, J.; Zhou, T. Adaptive manipulability-based path planning strategy for industrial robot manipulators. IEEE/ASME Trans. Mechatron. 2023, 28, 1742–1753. [Google Scholar] [CrossRef]
Zhu, L.; Duan, P.; Meng, L.; Yang, X. GAO-RRT*: A path planning algorithm for mobile robot with low path cost and fast convergence. AIMS Math. 2024, 9, 12011–12042. [Google Scholar] [CrossRef]
Kavraki, L.E.; Svestka, P.; Latombe, J.C.; Overmars, M.H. Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Trans. Robot. Autom. 2002, 12, 566–580. [Google Scholar] [CrossRef]
Chen, G.; Luo, N.; Liu, D.; Zhao, Z.; Liang, C. Path planning for manipulators based on an improved probabilistic roadmap method. Robot. Comput.-Integr. Manuf. 2021, 72, 102196. [Google Scholar] [CrossRef]
Chen, Y.; Chen, L.; Ding, J.; Liu, Y. Research on real-time obstacle avoidance motion planning of industrial robotic arm based on artificial potential field method in joint space. Appl. Sci. 2023, 13, 6973. [Google Scholar] [CrossRef]
Jiang, Q.; Cai, K.; Xu, F. Obstacle-avoidance path planning based on the improved artificial potential field for a 5 degrees of freedom bending robot. Mech. Sci. 2023, 14, 87–97. [Google Scholar] [CrossRef]
Noreen, I. Collision free smooth path for mobile robots in cluttered environment using an economical clamped cubic B-Spline. Symmetry 2020, 12, 1567. [Google Scholar] [CrossRef]
Tagliavini, A.; Bianco, C.G.L. η3D-splines for the generation of 3D Cartesian paths with third order geometric continuity. Robot. Comput.-Integr. Manuf. 2021, 72, 102203. [Google Scholar] [CrossRef]
Dobiš, M.; Dekan, M.; Sojka, A.; Beňo, P.; Duchoň, F. Cartesian constrained stochastic trajectory optimization for motion planning. Appl. Sci. 2021, 11, 11712. [Google Scholar] [CrossRef]
Malhan, R.K.; Thakar, S.; Kabir, A.M.; Rajendran, P.; Bhatt, P.M.; Gupta, S.K. Generation of configuration space trajectories over semi-constrained cartesian paths for robotic manipulators. IEEE Trans. Autom. Sci. Eng. 2022, 20, 193–205. [Google Scholar] [CrossRef]
Shang, H.; Chi, X.; Li, R.; Zhao, X.; Hu, H. Obstacle-Aware and High-Reach Path Planning for Robotic Manipulators in Complex Factory Farming Environments. IEEE Trans. Ind. Inform. 2025, 21, 7733–7741. [Google Scholar] [CrossRef]
Xie, F.; Guo, Z.; Li, T.; Feng, Q.; Zhao, C. Dynamic task planning for multi-arm harvesting robots under multiple constraints using deep reinforcement learning. Horticulturae 2025, 11, 88. [Google Scholar] [CrossRef]
Choi, D.W.; Park, J.H.; Yoo, J.H.; Ko, K. AI-driven adaptive grasping and precise detaching robot for efficient citrus harvesting. Comput. Electron. Agric. 2025, 232, 110131. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 5 March 2026).
Coumans, E.; Bai, Y. Pybullet, a Python Module for Physics Simulation for Games, Robotics and Machine Learning. 2016. Available online: https://docs.google.com/document/d/10sXEhzFRSnvFcl3XxNGhnD4N2SedqwdAvK3dsihxVUA/edit?tab=t.0 (accessed on 5 March 2026).
Schafer, R.W. What is a savitzky-golay filter? IEEE Signal Process. Mag. 2011, 28, 111–117. [Google Scholar] [CrossRef]
Rakita, D.; Mutlu, B.; Gleicher, M. RelaxedIK: Real-time Synthesis of Accurate and Feasible Robot Arm Motion. In Proceedings of the Robotics: Science and Systems (RSS), Pittsburgh, PA, USA, 26–30 June 2018. art. 43. [Google Scholar]
Beeson, P.; Ames, B. TRAC-IK: An open-source library for improved solving of generic inverse kinematics. In Proceedings of the 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids); IEEE: New York, NY, USA, 2015; pp. 928–935. [Google Scholar]
Gao, J.; Zhang, F.; Zhang, J.; Yuan, T.; Yin, J.; Guo, H.; Yang, C. Development and evaluation of a pneumatic finger-like end-effector for cherry tomato harvesting robot in greenhouse. Comput. Electron. Agric. 2022, 197, 106879. [Google Scholar] [CrossRef]
Rong, J.; Wang, P.; Wang, T.; Hu, L.; Yuan, T. Fruit pose recognition and directional orderly grasping strategies for tomato harvesting robots. Comput. Electron. Agric. 2022, 202, 107430. [Google Scholar] [CrossRef]

Figure 1. (a) Hardware system of the tomato harvesting robot; (b) Modified D-H model of the RM65-B manipulator; (c) Tomato cluster images and 3D point clouds.

Figure 2. Visualization of tomato cluster processing: (a) original RGB image of a tomato cluster; (b) instance segmentation masks of the main stem (pink), peduncle (green), and fruit (blue) extracted by YOLOv8; (c) visualization of keypoints on the RGB image; (d) extracted 3D point cloud of the tomato cluster with corresponding keypoint representations.

Figure 3. Visualization of the end-effector coordinate system and the optimal grasping pose for the tomato cluster in the PyBullet simulation environment.

Figure 4. Flowchart of the PSO-based reachable pose solving algorithm.

Figure 5. Overall architecture of the hierarchical path planning algorithm. The framework illustrates the two-stage process: Stage I utilizes Cost Map 1 and greedy search for global approach, while Stage II employs Cost Map 1 and gradient descent for precision grasping.

Figure 6. The robotic manipulator URDF model and voxelized point cloud loaded in PyBullet.

Figure 7. (a–d) (Blue lines) illustrate the influence of search radius on pose estimation performance, while (e–h) (red lines) demonstrate the impact of the weight ratio between the target term and the angular difference penalty term on the performance.

Figure 8. (a–c) illustrate the impact of different values of the weight coefficient

λ

on path planning performance metrics, respectively corresponding to: (a) success rate, (b) average planning time, and (c) average path length.

Figure 8. (a–c) illustrate the impact of different values of the weight coefficient

λ

on path planning performance metrics, respectively corresponding to: (a) success rate, (b) average planning time, and (c) average path length.

Figure 9. The tomato harvesting process in the greenhouse environment. Panels (a–c) illustrate the identification, grasping, and crating stages for the first tomato cluster, while panels (d–f) display the corresponding stages for the second tomato cluster.

Table 1. D–H parameters of the RM65-B manipulator.

Link i	$α_{i - 1}$ (°)	$a_{i - 1}$ (mm)	$d_{i}$ (mm)	$θ_{i}$ (°)
1	0	0	240.5	0
2	90	0	0	90
3	0	256	0	90
4	90	0	210	0
5	−90	0	0	0
6	90	0	144	0

Table 2. Performance, scale, and runtime metrics of the custom YOLOv8-m instance segmentation model.

Target Class	Precision (%)	Recall (%)	Box ${mAP}_{50}$ (%)	Mask ${mAP}_{50}$ (%)
Main Stem	86.8	83.5	85.4	81.2
Peduncle	84.2	81.0	82.9	78.5
Tomato Fruit	93.1	90.4	92.8	89.6

Table 3. Partial Parameter Configurations for the PSO-based Reachable Pose Calculation.

Parameter	Value
Swarm size (N)	50
Maximum velocity ( $V_{m a x}$ )	$δ / 10$
Inertia weight ( $ω$ )	0.8
Cognitive acceleration coefficient ( $c_{1}$ )	2.0
Social acceleration coefficient ( $c_{2}$ )	2.0
Maximum iterations ( $G_{m a x}$ )	200

Table 4. Experimental comparison of different picking pose estimation methods.

Method	Reachability Rate	Collision Rate	Success Rate	Avg. Time (s)
FO	52.94%	24.37%	28.57%	0.1382
LOWESS	57.15%	15.97%	41.18%	0.5434
SPM	64.28%	4.20%	60.08%	0.3485
KE	64.70%	2.52%	62.18%	0.1764
Our Method	88.66%	3.78%	84.88%	0.6689

Note: The bold values within the data columns indicate the optimal performance/best results achieved for each respective evaluation metric.

Table 5. Comparison of performance between KE and the proposed PSO method at different distances.

Distance to Base (mm)	Keypoint Extraction (KE)		PSO-Based Optimization
Distance to Base (mm)	Success Rate	Avg. Time (s)	Success Rate	Avg. Time (s)
200–300	59.44%	0.1835	80.56%	0.5682
300–400	87.80%	0.1748	90.24%	0.1923
400–500	81.13%	0.1808	88.68%	0.3058
500–600	62.26%	0.1782	82.36%	0.6098
600–700	39.62%	0.1947	73.58%	1.1289
700–800	16.98%	0.1852	56.60%	1.5892
800–900	0%	0.1859	9.43%	1.7829

Table 6. Parameter settings for the path planning experiments.

Parameter	Value
Voxel size ( $v o x e l_s i z e$ )	0.02 m
Weight ratio ( $ω_{o b s t a c l e} : ω_{t a r g e t}$ )	0.2
Distance between pre-grasp and grasp points (d)	0.1 m
Gradient descent step size ( $η$ )	0.01 m
Maximum iterations ( $G_{m a x}$ )	100

Table 7. Performance comparison of different path planning methods.

Method	Success Rate	Avg. Path Length (m)	Avg. Planning Time (s)
Bi-RRT	62.60%	1.7574	1.5643
OPHF-Planner	73.95%	0.4326	0.5869
S-OPHF	83.19%	0.5984	0.7683
Our Method	89.64%	0.4526	0.7093

Note: The bold value within the data columns indicates the optimal performance/best result achieved for the respective evaluation metric.

Table 8. Quantitative ablation study results of the proposed framework modules.

Configuration	Semantic Cost Mapping	PSO	Success Rate (%)	Average Time (s)
Config 1 (Baseline)	×	×	41.0%	0.6285
Config 2	×	✔	57.0%	1.1432
Config 3	✔	×	53.0%	0.7516
Config 4 (Ours)	✔	✔	87.0%	1.2654

Note: "✔" indicates that the respective module is enabled, while "×" indicates it is disabled. The bold value within the data columns represents the optimal success rate achieved by our integrated framework.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yan, J.; Yin, J.; Hu, J.; Lai, K. Reachability-Oriented Pose Estimation and Efficient Path Planning for Tomato Harvesting Robots. Appl. Sci. 2026, 16, 5610. https://doi.org/10.3390/app16115610

AMA Style

Yan J, Yin J, Hu J, Lai K. Reachability-Oriented Pose Estimation and Efficient Path Planning for Tomato Harvesting Robots. Applied Sciences. 2026; 16(11):5610. https://doi.org/10.3390/app16115610

Chicago/Turabian Style

Yan, Junyao, Jianjun Yin, Jintang Hu, and Kefan Lai. 2026. "Reachability-Oriented Pose Estimation and Efficient Path Planning for Tomato Harvesting Robots" Applied Sciences 16, no. 11: 5610. https://doi.org/10.3390/app16115610

APA Style

Yan, J., Yin, J., Hu, J., & Lai, K. (2026). Reachability-Oriented Pose Estimation and Efficient Path Planning for Tomato Harvesting Robots. Applied Sciences, 16(11), 5610. https://doi.org/10.3390/app16115610

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reachability-Oriented Pose Estimation and Efficient Path Planning for Tomato Harvesting Robots

Abstract

1. Introduction

2. Materials and Methods

2.1. Robotic System

2.2. Estimation of Optimal Grasping Poses for Tomato Clusters

2.3. Calculation of Manipulator Reachable Poses Based on Analytical Methods and Particle Swarm Optimization

2.3.1. Analytical Method-Based Inverse Kinematics Analysis

2.3.2. Reachable Pose Solving via Particle Swarm Optimization

2.4. Two-Stage Cartesian Path Planning Based on Cost Map

2.4.1. Cost Map Generation

2.4.2. Start-to-Pre-Grasp Path Planning

2.4.3. Pre-Grasp-to-Grasp Path Planning

2.4.4. Path Orientation Assignment and Inverse Kinematics

3. Experiments and Results

3.1. Experiment Design

3.2. Pose Estimation Experiments

3.2.1. Pose Estimation Evaluation Metrics

3.2.2. Pose Estimation Experimental Settings

3.2.3. Pose Estimation Results and Analysis

3.3. Path Planning Experiments

3.3.1. Path Planning Evaluation Metrics

3.3.2. Path Planning Experimental Settings

3.3.3. Path Planning Results and Analysis

3.4. Ablation Study

3.5. Field Experiments

3.6. General Comparison and Future Learning-Based Perspectives

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI