Retrieval Augment: Robust Path Planning for Fruit-Picking Robot Based on Real-Time Policy Reconstruction

Chen, Binhao; Zhang, Shuo; He, Zichuan; Gong, Liang

doi:10.3390/su18020829

Open AccessArticle

Retrieval Augment: Robust Path Planning for Fruit-Picking Robot Based on Real-Time Policy Reconstruction

School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

Sustainability 2026, 18(2), 829; https://doi.org/10.3390/su18020829

Submission received: 20 December 2025 / Revised: 5 January 2026 / Accepted: 8 January 2026 / Published: 14 January 2026

(This article belongs to the Section Sustainable Agriculture)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The working environment of fruit-picking robots is highly complex, involving numerous obstacles such as branches. Sampling-based algorithms like Rapidly Exploring Random Trees (RRTs) are faster but suffer from low success rates and poor path quality. Deep reinforcement learning (DRL) has excelled in high-degree-of-freedom (DOF) robot path planning, but typically requires substantial computational resources and long training cycles, which limits its applicability in resource-constrained and large-scale agricultural deployments. However, picking robot agents trained by DRL underperform because of the complexity and dynamics of the picking scenes. We propose a real-time policy reconstruction method based on experience retrieval to augment an agent trained by DRL. The key idea is to optimize the agent’s policy during inference rather than retraining, thereby reducing training cost, energy consumption, and data requirements, which are critical factors for sustainable agricultural robotics. We first use Soft Actor–Critic (SAC) to train the agent with simple picking tasks and less episodes. When faced with complex picking tasks, instead of retraining the agent, we reconstruct its policy by retrieving experience from similar tasks and revising action in real time, which is implemented specifically by real-time action evaluation and rejection sampling. Overall, the agent evolves into an augment agent through policy reconstruction, enabling it to perform much better in complex tasks with narrow passages and dense obstacles than the original agent. We test our method both in simulation and in the real world. Results show that the augment agent outperforms the original agent and sampling-based algorithms such as BIT* and AIT* in terms of success rate (+133.3%) and path quality (+60.4%), demonstrating its potential to support reliable, scalable, and sustainable fruit-picking automation.

Keywords:

fruit-picking robot; reinforcement learning; real-time policy peconstruction; experience retrieval; action revision

1. Introduction

In agriculture, picking tasks require robots to recognize target fruit and obstacles, and plan a collision-free path connecting the start posture and target posture. A robust picking strategy must address both grasping posture decision and path planning, presenting challenges to algorithmic efficiency and accuracy due to the different fruits’ varying spatial postures and shapes. With increasing labor shortages, rising production costs, and the demand for reduced environmental impact, efficient and intelligent harvesting robots have become a key enabler for sustainable agricultural production. Researchers have proposed solutions such as RRT, A*, and APF [1,2,3]. Sampling-based algorithms, including RRT and its improved versions like RRT_connect [4], RRT* [5], are widely used for their computational efficiency. Hemming et al. [6] applied RRT for autonomous pepper picking, but their approach was limited to static environments with minimal interactions. Wang et al. [7] utilized RRT with polynomial trajectory optimization for tomato picking, achieving an 88% success rate, albeit with a high time cost (20 s per task). However, these methods are challenging to apply to high-DOF robots, as they require precise environment descriptions, which lead to exponential increases in computational complexity as the robot’s DOF increases [8].

Recent advancements in DRL have shown considerable potential for path planning in high-DOF robots [9,10,11]. DRL algorithms such as SAC [12], which update policy through experience replay and reward mechanisms, train the action policy network to solve the Markov Decision Process (MDP) for picking tasks. Chiang et al. [13] integrated DRL with classical RRT to improve path planning efficiency. Wang et al. [14] formulated agricultural picking sequence planning as a three-dimensional TSP and solved it using a pointer network-based actor–critic DRL framework. However, because the robot does not fully learn the real-world environment, it often encounters unforeseen obstacle zones, resulting in movement failures. Li et al. [15] developed a DRL-based harvesting strategy for clustered kiwifruits, in which fruit recognition and localization as well as picking order planning are first performed, followed by DRL-based optimization to generate an efficient harvesting strategy. Yi et al. [16] proposed a self-supervised DRL-based view planning method that employs a Self-Supervised Convolutional Network to evaluate the effectiveness of actions during training and dynamically adjust rewards to guide policy learning. Liu et al. [17] proposed an expert experience-based guided reinforcement learning strategy for high-DOF robotic arms in apple picking, outperforming RRT in planning time and path quality. Feng et al. [18] introduced an enhanced HER-SAC algorithm that incorporates a heuristic action fusion strategy during training to optimize cherry tomato picking. Their work addresses grasping posture decisions more thoroughly, and is referenced in the training of the original agent (Section 3.2) in this paper. Despite the promising potential of DRL, its application in agricultural picking tasks remains constrained by the substantial computational cost required for training and the limited adaptability of offline policies to dynamically changing environments. Existing optimization approaches, such as retrieval-based reinforcement learning methods, typically incorporate human experience directly into the training loop, which further increases training complexity and computational burden. Moreover, updating the experience repository generally necessitates retraining the policy network, resulting in additional computational overhead and high overall training costs. Such resource-intensive learning paradigms are difficult to sustain in long-term, large-scale agricultural systems, where computational efficiency, energy consumption, and deployment cost are critical considerations.

In this paper, we present a decision framework based on real-time policy reconstruction to augment DRL-trained agent, enabling robust picking path planning in complex picking scenes (Figure 1). By shifting policy optimization from the training phase to real-time inference, the proposed framework reduces the need for repeated retraining and supports more energy- and data-efficient robot operation, aligning well with the requirements of sustainable agricultural automation. Retrieval augment, originating from the use of external memory in natural language processing (NLP) to improve large language models [19], has also been employed by Peter C. et al. [20] in large-scale verification to enhance DRL agent performance in Go scenarios. In contrast to policy correction or residual reinforcement learning methods, which typically require learning an additional corrective policy or residual model, our approach performs explicit policy reconstruction at the action level. Specifically, the final action decision is generated through real-time evaluation, weighted fusion, and rejection sampling between the original policy actions and experience-generated candidate actions. This process does not rely on any learnable correction model and does not introduce new optimization objectives during deployment. Furthermore, compared with planning-guided reinforcement learning methods, the proposed framework neither requires an explicit system dynamics model nor performs online trajectory optimization. Instead, it leverages an offline-constructed experience base and its probabilistic modeling results to provide high-quality candidate actions for the current state under strict real-time constraints. Our framework first regresses pre-collected, high-quality empirical data to summarize the robot’s performance in typical complex scenes, thereby constructing an experience base. When faced with difficult picking tasks, the original agent can reconstruct their action policy by retrieving the experience base, evolving into an augment agent capable of adapting to dynamic and complex environments. The main contributions of this paper are as follows:

1.: Propose a decision framework for real-time policy reconstruction based on experience retrieval.
2.: Propose a new method for collecting and characterizing high-quality robot empirical data.
3.: Instantiate a specific policy reconstruction scheme: an action fusion method based on real-time evaluation and rejection sampling is proposed for policy reconstruction.

2. Problem Formulation

We consider path planning for a picking task, which requires planning a collision-free path from the starting position to the optimal cutting position in the workspace and maximizing the path quality (path length and smoothness). Specifically, we define this problem as follows.

As shown in Figure 2, we consider the robot R and the goal

G = {f, k, b_{1}, b_{2}}

within the same bounded workspace

χ \in R^{3}

. The goal G consists of the fruit f (

k g

and cube F), the operation point k, the target branch

b_{1} (k k^{'})

, and the main branch

b_{2} (m n)

. The workspace is divided into free space

χ_{f r e e}

and obstacle space

χ_{o b s t a c l e}

, where obstacles are generalized as 3D entities in

χ

, excluding the target branch. We define the robot’s state

s (t)

at time t as its joint angles

θ = (θ_{1}, \dots, θ_{n})

in configuration space

Q \in R^{n}

, and the end-effector pose

T = (p, q)

in Cartesian space, with

p = (x, y, z)

as the 3D position and

q = (q_{w}, q_{x}, q_{y}, q_{z})

as the quaternion representing the rotation. The set of valid states of the robot is defined as

S = {s (θ, T) | c o l l i s i o n (s) = 0, \forall θ_{i} i n θ : θ_{i}^{m i n} < θ_{i} < θ_{i}^{m a x}}

. The starting pose of the robot is

s_{0} = (θ_{0}, T_{0})

, and the target pose is the optimal cutting pose

s^{*} = (θ^{*}, T^{*})

. The optimal cutting pose satisfies the conditions

s^{*} \in S, d (p^{*}, k) \leq δ

, and

v_{H}^{Z} \cdot v_{b_{1}} \leq ϵ

, where

k

is the unit vector pointing from the world coordinate origin to the operation point k.

v_{H}^{Z}

is the unit vector in the direction of the Z-axis of the end-effector’s reference frame in the world coordinate system, and

v_{b_{1}}

is the unit vector in the direction of branch

b_{1}

. Here,

δ

and

ϵ

are small constants to ensure the end effector is sufficiently close to k and the orthogonal cutting posture. According to the definition, the optimal cutting pose is not unique; i.e., there exists a set

S^{*} = \{s^{*} | s^{*} \in S, d (p^{*}, k) \leq δ, v_{H}^{Z} \cdot v_{b_{1}} \leq ϵ\}

. We define the feasible path

c_{i} (s_{start}, s_{goal})

, and connecting states

s_{start}

and

s_{goal}

are defined as a sequence of states, i.e.,

c_{i} (s_{start}, s_{goal}) = (s_{1}^{i}, s_{2}^{i}, s_{3}^{i}, \dots, s_{m}^{i})

, where

s_{1}^{i} = s_{start}

,

s_{m} = s_{goal}

, and

s_{j}^{i} \in S

.

We observe that there exists a set of paths that connect the starting position and the optimal cutting position, i.e.,

C = \{c (s_{0}, s^{*}) | s^{*} \in S^{*}\}

. It is important to note that we do not intend to traverse the set

C

to find the optimal path, which would be time- and computationally expensive. Instead, our goal is for the robot to autonomously plan a path

c_{i} (s_{0}, s^{*})

on the first attempt, satisfying

c_{i} \in C

, while minimizing the path cost function

J (c)

. In this paper, we define the cost function as

J (c) = \sum_{0 < i < i + 1 < m} f (i, i + 1)

, where m is the length of the path sequence, and

f (i, i + 1) = w_{1} \cdot d (p_{i}, p_{i + 1}) + w_{2} \cdot \sqrt{\sum_{k}^{n J o i n t} {(θ_{i}^{k} - θ_{i + 1}^{k})}^{2}}

represents the evaluation of the cost for the transition from

s_{i}

to

s_{i + 1}

.

We classify picking tasks into simple

G_{1}

and complex

G_{2}

categories (as shown in Figure 2). Let the operation point k serve as the center of a square region Z with side length r, which defines the area of interest. The obstacle percentage within this region is calculated as

p_{o b s} = \frac{V_{o b s}}{V_{s p h e r e}}

. We designate the half-square of Z oriented toward the robot as the operational space

Z_{0}

. From the outer boundary of

Z_{0}

, m parallel rays

{[r_{i}]}_{i = 1}^{m}

are emitted uniformly in the vertical direction. Rays that encounter obstacles within a threshold distance

r_{0}

are labeled

r_{o b s}

. These rays are then projected onto a grid roadmap

M_{g r i d}

, where grids corresponding to

r_{o b s}

are marked as obstacle grids. The Manhattan distances

{d_{i}}_{i = 1}^{n}

from k to each obstacle grid are recorded, and their average is computed as

d_{a v e} = \frac{\sum_{i = 1}^{n} d_{i}}{n}

, with n being the number of obstacle grids. A scene is defined as complex if

p_{o b s} > p

and

d_{a v e} < γ

, where p and

γ

are thresholds that ensure complex scenes are characterized by dense obstacles and narrow passages. Consequently, the set of goals

G

is partitioned into simple scenes

G_{1}

and complex scenes

G_{2}

, i.e.,

G = G_{1} \cup G_{2}

.

Given the extensive use of mathematical notation in this paper, all symbols appearing in this section and subsequent sections are summarized and explained in Appendix A.

3. Retrieval Augment: Real-Time Policy Reconstruction

This section presents our method, starting with an overview in Section 3.1, followed by a detailed description of each component.

3.1. Overview

The entire decision framework is realized as shown in Figure 3.

3.1.1. Original Agent Training

The policy reconstruction proposed in this paper begins with the pre-training of a picking robot agent. At this stage, the focus is not on achieving optimal performance across all picking tasks, as a strong agent presents significant challenges in terms of computational resource and DRL algorithms. Instead, the approach emphasizes enhancing the agent’s performance through retrieval augment, without further training. To achieve this, we extract simple scenes from

G_{1}

as training tasks and train the agent with fewer episodes (see Section 3.2 and Experiment for details).

3.1.2. Experience Base

Constructing an experience base is significant for effective policy reconstruction. In this paper, we adopt a scene-experience bundle model, where each experience corresponds to a typical challenging scene. For a typical scene

G \in G_{2}

, we propose a raw data collection method based on hierarchical collaborative path exploration. First, the workspace

χ

is decomposed and initially explored to identify risk regions R via a scoring mechanism. Subsequently, a deeper exploration is conducted in the Cartesian space within R, where a reward function weights the data to yield the raw dataset

D = ((T_{1}, w_{1}), (T_{2}, w_{2}), \dots, (T_{n}, w_{n}))

, with each

(T_{i}, w_{i})

representing the end-effector pose (position

p_{i}

and orientation

q_{i}

) and

w_{i}

denoting its weight.

D

is divided into

D_{p} = ((p_{1}, w_{1}), (p_{2}, w_{2}), \dots, (p_{n}, w_{n}))

and

D_{q} = ((q_{1}, w_{1}), (q_{2}, w_{2}), \dots, (q_{n}, w_{n}))

. A weighted Expectation Maximization Algorithm (EM) is then applied to

D_{p}

and

D_{q}

to derive Gaussian Mixture Model (GMM)

G_{p}

and

G_{q}

. Finally, the scene, risk region, and sampling models are integrated into an experience:

E_{i} = (G_{i}, R_{i}, G_{p}^{i}, G_{q}^{i})

. Note that q in

D_{q}

represents a quaternion. Directly applying a standard GMM in Euclidean space would ignore the intrinsic manifold structure of unit quaternions on

S^{3}

, thereby destroying their inherent geometric constraints and leading to invalid samples. To avoid data distortion and improve sampling efficiency, we first map q to its tangent space to preserve its geometric constraints. We then perform EM algorithm iterations in this tangent space. Finally, the standard quaternion is recovered by applying the inverse mapping to the sampling result (see Section 4.3 for details).

3.1.3. Experience Retrieval

When the original agent encounters a task scene

G \in G_{2}

, it will extract the scene features and retrieve the most similar scene from the experience base. Following the method described in Section 3, we compute

p_{o b s}

, unit vectors

v_{b 1}

and

v_{b 2}

, and the grid roadmap

M_{g r i d}

, which is then mapped to a matrix M with elements 0 or 1. The similarity between

G_{1}

and

G_{k}

is computed using the following Equation:

SIM (G_{i}, G_{k}) = w_{H_{u}} \frac{H_{i} \cdot H_{k}}{∥ H_{i} ∥ \cdot ∥ H_{k} ∥} + w_{b_{1}} v_{b_{1}}^{i} \cdot v_{b_{1}}^{k} + w_{b_{2}} v_{b_{2}}^{i} \cdot v_{b_{2}}^{k} + w_{p} (1 - |p_{obs}^{i} - p_{obs}^{k}|)

(1)

where

H_{1}, H_{k}

are the Hu moments of the matrices

M_{1}, M_{k}

, and the weights satisfy

w_{H_{u}} + w_{b 1} + w_{b 2} + w_{p} = 1

. For a task scene G, we obtain that

G^{*}

satisfies

G^{*} = {arg max}_{G^{*} \in G_{2}} S I M (G, G^{*})

. Extract the experience

E^{*}

for policy reconstruction.

Notably, the design objective of the similarity metric is to capture task-level geometric relevance rather than to serve as a learnable module. Specifically, Hu moments are employed to characterize the global distribution of obstacles, the branch orientation term encodes constraints directly related to the cutting posture in picking tasks, and the obstacle distance reflects the narrowness of local passageways. The weighting scheme is intended to balance these complementary geometric features, while normalization ensures their comparability in the similarity computation. In our experiments, the weights are set to

w_{H_{u}} = 0.3, w_{b 1} = 0.2, w_{b 2} = 0.2

, and

w_{p} = 0.3

. The experimental results in Section 4.2 indirectly validated this design. It shows that the similarity of retrieved experiences tends to stabilize as the size of the experience base increases. Moreover, subsequent experimental results indicates that the performance of policy reconstruction is more sensitive to the quality of experience similarity itself than to the specific numerical values of the weights.

3.1.4. Real-Time Policy Reconstruction

The two key elements of policy reconstruction are the action policy

π_{θ} (a_{t} ∣ s_{t}) = N (a_{t} ∣ μ_{θ} (s_{t}), σ_{θ} (s_{t}))

of the original agent, and experience

E^{*} = (G^{*}, R^{*}, G_{p}^{*}, G_{q}^{*})

. We define policy reconstruction process as

RC (π_{θ}, E^{*}, t) = π^{*} (a_{t} ∣ s_{t})

, which evolves the original agent into an augment agent. Notably,

RC

exhibits varying performance at different time step t. The simplest case occurs when reconstruction is not required in the current state, causing

π^{*}

to revert to the original policy

π_{θ}

(see Section 4.3 for details).

3.2. Original Agent Training

3.2.1. Observation Space

The observation is described by the following: (a) The joint angles

θ = [θ_{1}, θ_{2}, \dots, θ_{n}]

. The robot in this paper is a 7-DOF robotic arm, so

θ \in R^{7}

; (b) posture of end-effector

T = [p, q] \in R^{7}

; (c) Euclidean distance

d

of the end-effector reference system origin from the target operating point; (d) the angle

α = \frac{v_{H}^{z} \cdot v_{b_{1}}}{∥ v_{H}^{z} ∥ ∥ v_{b_{1}} ∥}

between the unit vector in the direction of z-axis of the end-effector reference system and that in the direction of the target branch

b_{1}

; (e) unit vector

v_{b 2}

in the direction of the main branch

b_{2}

; (f) binary number representation of the collision detection result. As described above, the observed at the time step t is represented as a 20-dimensional vector:

s (t) = [θ, x, d, α, v_{b_{2}}, collision] \in R^{20}

(2)

This 20-dimensional observation space vector needs to be normalized.

3.2.2. Action Space

Action is described by the desired difference in position and orientation represented by

\dot{p}, \dot{q}

, respectively:

a (t) = [\dot{p}, \dot{q}] \in R^{7}

(3)

3.2.3. Reward Function

The goal is to guide the robot to autonomously plan a path from the starting pose to the optimal cutting pose and to maximize the quality of the path. Based on this goal, we design five reward functions to reward or penalize the action at each step.

r_{d i s t a n c e}

guides the robot to approach the operation point.

r_{p o s}

guides the robot to adjust its posture to the optimal grasping posture.

r_{t r a j}

guides the robot to plan a smooth path as much as possible.

r_{c o l l i s i o n}

monitors whether the path planned by the robot is collision-free. Considering the design of action space, the next state after performing the action may be the singularity of the robot.

r_{v a l i d}

is designed to guide the policy network to output actions for which the inverse solution exists.

r_{d i s t a n c e} = - \frac{1}{2} d^{2}

(4)

r_{pos} = - {({cos}^{- 1} (\frac{v_{H}^{z} \cdot v_{b_{1}}}{∥ v_{H}^{z} ∥ ∥ v_{b_{1}} ∥}))}^{2} = - {(α - \frac{π}{2})}^{2}

(5)

r_{traj} = - \sqrt{\sum_{i = 1}^{n_{joints}} {(θ_{t}^{i} - θ_{t - 1}^{i})}^{2}}

(6)

r_{collision} = \{\begin{matrix} - 100, & if collision, \\ 0, & otherwise, \end{matrix}

(7)

r_{valid} = \{\begin{matrix} - 100, & if invalid, \\ 0, & otherwise . \end{matrix}

(8)

Finally, the reward at each step is computed by

r_{t} (s_{t}, a_{t}) = λ_{1} r_{distance} + λ_{2} r_{pos} + λ_{3} r_{traj} + r_{collision} + r_{valid} .

(9)

In this paper, the values of

λ_{1}, λ_{2}

and

λ_{3}

are set to

\frac{1}{3}

.

3.3. Experience Base

We propose a data collection method based on hierarchical collaborative exploration. To ensure the integrity of quaternion data representing the state of end effector, we employ a tangent space mapping during regression and an inverse mapping during sampling. This approach prevents distortions that arise from directly applying quaternion manifolds to the Euclidean space GMM.

In our previous work, we proposed a hierarchical collaborative path planning algorithm based on workspace decomposition [21], demonstrating its advantages in harvesting tasks. The core concept is that the discrete workspace guides the search direction in the continuous configuration space, while progress in the configuration space is fed back to the discrete layer to update the growth direction. In this paper, we introduce an empirical data collection method based on hierarchical collaborative path exploration, incorporating the reward function from Section 3.2 into the collection process.

First, we decompose the workspace of task scene

G_{i}

into cubic local regions (called cell) of uniform size

(R_{1}, R_{2}, \dots, R_{n})

, which serve as nodes, with the spatial proportion of obstacles in each cell used as weights to initialize the roadmap M (Algorithm 1, Line 1–2). The cell containing the end effector’s starting position and the cell containing the operation point k are labeled as

R_{s t a t e}

and

R_{g o a l}

, respectively. We then apply the globally optimal algorithm Dijkstra [22] to find the desired exploration path in M, connecting

R_{s t a t e}

and

R_{g o a l}

resulting in a sequence of cells

{[R_{i}^{(j)}]}_{i = 1}^{k}

, where i denotes the cell marked in M and j denotes the cell in the desired exploration path (Algorithm 1, Line 3–4). To avoid data redundancy, after the primary exploration, we score all cells in

{[R_{i}^{(j)}]}_{i = 1}^{k}

. Only those with scores below a certain threshold (risk cells) are explored further. For the primary exploration, we uniformly sample the end-effector states

T = (p, q)

in Cartesian space. After sampling, all points within two adjacent cells are connected pairwise, and each path is scored for feasibility (Algorithm 1, Line 5–12). Scoring is based on the following equation:

Score (R_{k}) = \frac{exp (α \cdot VAL (R_{k})) \cdot exp (β \cdot CONN (R_{k}))}{γ \cdot {OBS}^{2} (R_{k})}

(10)

where

O B S (R_{k})

represents the volume ratio of obstacles within the cell space,

V A L (R_{k})

is the ratio of sampled points for which an IK solution exists and does not collide with obstacles, and

C O N N (R_{k})

is the ratio of connecting lines between sampled points in adjacent cells that lead to the next voxel with a collision-free path. The constants

α, β

and

γ

are parameters. Cells with scores below a threshold

η

are labeled as risk cells

R_{risk} = \{R_{k} | Score (R_{k}) \leq η\}

(Algorithm 1, Line 13).

We explore the cells in

R_{r i s k}

sequentially as follows: for each cell

R_{k} \in R_{r i s k}

, we uniformly sample the end-effector states

T = (p, q)

to obtain the sequence

{T_{1}^{k}, T_{2}^{k}, \dots, T_{n}^{k}}

. To prevent invalid state data from influencing the regression results of the EM algorithm, we discard invalid states without applying weights. Consequently, the reward function introduced in Section 3.1 eliminates

r_{c o l l i s i o n}, r_{v a l i d}, r_{t r a j}

. Each state

T_{i}

is assigned a weight

w_{i}^{k}

base on

r_{p o s}

and

r_{d i s t a n c e}

, using the following formula:

w_{i}^{k} = exp (- η_{1} \cdot \frac{1}{r_{distance}} - η_{2} \cdot \frac{1}{r_{pos}})

(11)

where

η_{1}

and

η_{2}

are constant coefficients. The weight information is then normalized as follows:

w_{i}^{k} = \frac{w_{i}^{k} - w_{min}^{k}}{w_{max}^{k} - w_{min}^{k}}

, resulting in the weighted data

D^{k} = ((T_{1}^{k}, w_{1}^{k}), (T_{2}^{k}, w_{2}^{k}), \dots, (T_{n}^{k}, w_{n}^{k}))

. Further decomposition of

D_{p}^{k} = ((p_{1}^{k}, w_{1}^{k}), (p_{2}^{k}, w_{2}^{k}), \dots, (p_{n}^{k}, w_{n}^{k}))

and

D_{q}^{k} = ((q_{1}^{k}, w_{1}^{k}), (q_{2}^{k}, w_{2}^{k}), \dots, (q_{n}^{k}, w_{n}^{k}))

is shown in (Algorithm 1, Lines 14–22). Using these data as input, we apply the EM algorithm for regression to obtain

G_{p}^{k}

and

G_{q}^{k}

. By completing the exploration of all risk cells, we obtain the sets

{[G_{p}^{k}]}_{k = 1}^{r}

and

{[G_{q}^{k}]}_{k = 1}^{r}

, where r is the size of

R_{r i s k}

.

Algorithm 1 Data Collecting Based on Hierarchical Collaborative Path Exploring

Require: Task scene

G_{i}

, start position

s_{0}

, operating point k
Ensure:

D_{p}

,

D_{q}

1:: $R \leftarrow WorkSpaceDecomposition (G_{i}, X)$
2:: Initialize roadmap $M_{i}$ using $R$
3:: $R_{start}, R_{goal} \leftarrow LocalRegionMapping (s_{0}, k)$
4:: ${R_{i}^{(j)}}_{j = 1}^{J} \leftarrow Dijkstra (M_{i}, R_{start}, R_{goal})$
▹ Preliminary exploration
5:: for $R_{i}^{(j)} \in {R_{i}^{(j)}}_{j = 1}^{J}$ do
6:: ${T_{1}, \dots, T_{n}}_{j} \leftarrow Sampling (R_{i}^{(j)})$
7:: for $T \in {T_{1}, \dots, T_{n}}_{j}$ do
8:: $IK (T)$
9:: $CollisionDetection (T)$
10:: end for
11:: end for
12:: for ${R_{i}^{(j)}, R_{i}^{(j + 1)}} \in {R_{i}^{(j)}}_{j = 1}^{J}$ do
13:: for $T \in {T_{1}, \dots, T_{n}}_{i}^{(j)}, T^{'} \in {T_{1}, \dots, T_{n}}_{i}^{(j + 1)}$ do
14:: $isPathFeasible (T, T^{'})$
15:: end for
16:: end for
17:: $R_{risk} \leftarrow FindRiskRegion ({R_{i}^{(j)}}_{j = 1}^{k})$
▹ Explore risk regions
18:: Initialize $D \leftarrow \emptyset$
19:: for $R \in R_{risk}$ do
20:: ${T_{1}, \dots, T_{n}} \leftarrow Sampling (R)$
21:: for $T \in {T_{1}, \dots, T_{n}}$ do
22:: if T is valid then
23:: $w_{i} \leftarrow ComputeWeight (T)$
24:: $D \leftarrow D \cup {(T_{i}, w_{i})}$
25:: end if
26:: end for
27:: end for
28:: $(D_{p}, D_{q}) \leftarrow DataSegmentation (D)$
29:: return $D_{p}, D_{q}$

The EM algorithm is widely used for parameter estimation in probabilistic models with hidden variables, performed through the E and M steps. When data have associated weights, these weights must be incorporated into both the E-step and the M-step to reflect the importance of each data point. Since each weight is calculated individually as

\sum_{i = 1}^{n} w_{i}^{k} \neq 1

, the EM algorithm must explicitly introduce the sum of weights

S = \sum_{i = 1}^{n} w_{i}^{k}

and normalize the weights as

{w_{i}^{k}}^{'} = \frac{w_{i}^{k}}{S}

to ensure unbiased parameter estimation.

In the E-step, the EM algorithm computes the posterior probabilities of the hidden variables using the following equation:

γ_{i k} = \frac{w_{i} π_{k} N (x_{i}; μ_{k}, \sum_{k})}{\sum_{j = 1}^{K} π_{j} N (x_{i}; μ_{j}, \sum_{j})}

(12)

In the M-step, the algorithm maximizes the expected likelihood function to update the parameters:

π_{k} = \frac{\sum_{i = 1}^{N} γ_{i k}}{\sum_{i = 1}^{N} w_{i}} = \sum_{i = 1}^{N} γ_{i k}

(13)

μ_{k} = \frac{\sum_{i = 1}^{N} γ_{i k} w_{i} x_{i}}{\sum_{i = 1}^{N} γ_{i k} w_{i}}

(14)

Σ_{k} = \frac{\sum_{i = 1}^{N} γ_{i k} w_{i} (x_{i} - μ_{k}) {(x_{i} - μ_{k})}^{T}}{\sum_{i = 1}^{N} γ_{i k} w_{i}} + λ I

(15)

where

λ I

is a small regularization term to prevent covariance singularity.

D_{p}

can be directly regressed to obtain a 3D Gaussian Mixture Model (GMM)

G_{p} (x) = \sum_{k = 1}^{K} π_{k} N (x | μ_{k}, Σ_{k})

, where

x \in R^{3}

. For

D_{q}

, since quaternion are manifolds composed of unit quaternions (

S^{3}

), traditional GMMs, which assume data in Euclidean space, cannot be applied directly. Ignoring the manifold structure of quaternions can lead to invalid sampling results. To address this, we use tangent space mapping, which locally linearizes the manifold. This allows us to apply the traditional GMM while preserving the geometric properties by first iteratively computing the mean quaternion as a base point using log-exponential mapping as (16). We then map all quaternions in

D_{q}

to the tangent space as (17), where

θ = a r c c o s (q_{r e f} \cdot q_{i})

and

v_{i} \in R^{3}

, to obtain

D_{v} = ((v_{1}, w_{1}), (v_{2}, w_{2}), \dots, (v_{n}, w_{n}))

. We apply the 3D GMM

G_{q} (v) = \sum_{k = 1}^{K} π_{k} N (v | μ_{k}, Σ_{k})

to the data

D_{v}

. After obtaining the sample results

v_{i}

from

G_{q} (v)

, we inversely map them to quaternions

q_{i}

using (18):

q_{ref}^{(t + 1)} = exp (\frac{\sum w_{i} {log}_{q_{ref}^{(t)}} (q_{i})}{\sum w_{i}})

(16)

v_{i} = {log}_{q_{ref}} (q_{i}) = \frac{θ}{sin θ} (q_{i} - q_{ref} cos θ)

(17)

q = {exp}_{q_{ref}} (v) = q_{ref} cos | v | + \frac{v}{| v |} sin | v |

(18)

At this stage, we can obtain the risk cells

R_{r i s k}

for task G and the empirical sampling models

{[G_{p}^{k}]}_{k = 1}^{r}

and

{[G_{q}^{k}]}_{k = 1}^{r}

. These are then associated as an experience

E = (G, R_{r i s k}, {[G_{p}^{k}]}_{k = 1}^{r}, {[G_{q}^{k}]}_{k = 1}^{r})

. By selecting a set of typical complex tasks for exploration as described above, we can construct an experience base

E B = (E_{1}, E_{2}, \dots)

.

3.4. Policy Reconstruction

The policy reconstruction process

RC (π_{θ}, E^{*}, t) = π^{*} (a_{t} ∣ s_{t})

indicates that the reconstructed policy

π^{*}

is derived by merging the empirical model

E^{*}

with the original agent’s policy

π_{θ}

at time t. This process consists of three steps (Algorithm 2).

Algorithm 2 Real-Time Policy Reconstruction

Require: Primitive agent’s policy

π_{θ}

, experience set

E^{*} = (G^{*}, R_{risk}^{*}, G_{q}^{*}, G_{a}^{*})

, observation

s_{t}

, threshold

ϵ

Ensure: Action

a_{t}

1:: $c o u n t \leftarrow 0$
2:: $a_{θ} \leftarrow SampleFrom π_{θ}$
3:: $a_{E^{*}} \leftarrow GetActionFrom E^{*} (s_{t})$
4:: $w_{a_{θ}}, w_{a_{E^{*}}} \leftarrow ActionEvaluation (a_{θ}, a_{E^{*}})$
▹ Action fusion
5:: $p_{θ}, q_{θ}, p_{E^{*}}, q_{E^{*}} \leftarrow ActionDisassembly (a_{θ}, a_{E^{*}})$
6:: $p_{mix} \leftarrow \frac{w_{a_{θ}} p_{θ} + w_{a_{E^{*}}} p_{E^{*}}}{w_{a_{θ}} + w_{a_{E^{*}}}}$
7:: $q_{mix} \leftarrow SLERP (q_{θ}, q_{E^{*}}, γ = \frac{w_{a_{E^{*}}}}{w_{a_{θ}} + w_{a_{E^{*}}}})$
8:: $a_{mix} \leftarrow (p_{mix}, q_{mix})$
9:: if $Rejection = TRUE$ then
10:: if $c o u n t > ϵ$ then
11:: if $w_{a θ} > w_{a E^{*}}$ then
12:: $a_{t} \leftarrow a_{θ}$
13:: else
14:: $a_{t} \leftarrow a_{E^{*}}$
15:: end if
16:: end if
17:: $w_{a_{mix}} \leftarrow ActionEvaluation (a_{mix})$
18:: if $IK (a_{mix})$ and $notCollision (a_{mix})$ and $w_{a_{mix}} > max (w_{a_{θ}}, w_{a_{E^{*}}})$ then
19:: $a_{t} \leftarrow a_{mix}$
20:: else
21:: Go back to line 2 and $c o u n t \leftarrow c o u n t + 1$
22:: end if
23:: end if
24:: return $a_{t}$

Step 1: The actions

a_{θ} = (p_{θ}, q_{θ})

and

a_{E^{*}} = (p_{E^{*}}, q_{E^{*}})

are generated by

π_{θ}

and

E^{*}

, respectively, based on the current observation (Algorithm 2, Line 1–2). A reward function is then introduced to evaluate these actions. Specifically,

p_{E^{*}} = p_{G} - p_{t}, q_{E^{*}} = q_{G}

where

p_{G}, q_{G}

are sampled from

G_{p}^{*}

and

G_{q}^{*}

, respectively. It is important to note that the GMM model used for

a_{E^{*}}

varies depending on which risk region the end effector is, as determined by the current state. If the robot is not in risk region, no policy reconstruction is needed, and

π^{*} (a_{t} | s_{t})

reduces to

π_{θ}

. The robot must virtually execute the action in the simulation for evaluation, a process that requires the path to be smoothed as much as possible. For example, after executing the action

a_{θ}

, we set that the end-effector position is updated as

p_{t + 1} = p_{t} + λ \cdot p_{θ}

, where

λ

is the step size. The rotational maneuver is performed using Spherical Linear Interpolation (SLERP) to ensure path smoothness. The updated orientation is given by

q_{t + 1} = S L E R P (q_{t}, q_{θ}, γ) \frac{s i n ((1 - γ) α)}{s i n α} q_{t} + \frac{s i n (γ α)}{s i n α} q_{θ}

, where

α = a r c c o s (q_{t} \cdot q_{θ})

and

γ

is the interpolation parameter. Introducing the reward function described in Section 3.2 to evaluate the actions, we obtain

w_{a_{θ}} = r e w a r d (a_{θ}, s_{t})

and

w_{a_{E^{*}}} = r e w a r d (a_{E^{*}}, s_{t})

.

Step 2: Based on the evaluation results,

a_{θ}

and

a_{E^{*}}

are weighted and fused by using (19), (20) and (21) to obtain the mixed action

a_{m i x}

(Algorithm 2, Line 3–7).

a_{m i x} = (p_{m i x e d}, q_{m i x e d})

(19)

p_{m i x} = \frac{w_{a_{θ}}}{w_{a_{θ}} + w_{a_{E^{*}}}} p_{θ} + \frac{w_{a_{E^{*}}}}{w_{a_{θ}} + w_{a_{E^{*}}}} p_{E^{*}}

(20)

q_{m i x} = S L E R P (q_{θ}, q_{E^{*}}, γ = \frac{w_{a_{θ}}}{w_{a_{θ}} + w_{a_{E^{*}}}})

(21)

Step 3: Action

a_{m i x}

undergoes rejection sampling, and the result is fed back to Step 1 in real time (Algorithm 2, Line 8–18). We define three constraints as rejection conditions. (1) There exists an IK (Inverse Kinematics) solution to the end-effector state after executing

a_{m i x}

. (2) The robot can execute

a_{m i x}

collision-free. (3)

a_{m i x}

is better than both

a_{θ}

and

a_{E^{*}}

.

r e w a r d (a_{m i x e d}, s_{t}) > m a x (r e w a r d (a_{θ}, s_{t}), r e w a r d (a_{E^{*}}, s_{t}))

. If all three conditions are satisfied, we accept

a_{m i x}

; otherwise, we resample to obtain new

a_{θ}

and

a_{E^{*}}

, repeating the fusion process. If the number of rejections exceeds the threshold

ϵ

, this indicates that policy reconstruction does not improve performance in the current state, and the action with the higher reward from

a_{θ}

and

a_{E^{*}}

is returned.

4. Experiments

In this section, we address the following three questions through both simulation and real-world experiments:

1.: Effectiveness of Augment agent: Does our policy reconstruction method for augment agent confer performance advantages? We compare the performance of augment agents—derived from original agents trained with varying numbers of episodes—under complex tasks, and benchmark them against traditional path planning algorithms commonly used in fruit-picking robots.
2.: Efficacy of Experience Retrieving: Given that our method relies on retrieving using an experience base, is the proposed similarity-based experience retrieving method effective, and to what extent do augment agents depend on the experience data? To answer this, we construct experience base of varying sizes and examine the impact of library size on the performance of retrieving method as described in Section 3.1. We further compare the performance of augment agent under different experience similarity metrics.
3.: Sim-to-Real Transferability: Can the proposed method retain its effectiveness and advantages when migrating from simulation to real-world settings? We migrate the top-performing algorithms from our simulation experiments, along with our proposed method, to real-world tasks and evaluate their performance comparatively.

4.1. Effectiveness of Augment Agent

We conducted agent training and policy reconstruction experiments in simulation environment Robot Operating System (ROS) and PyBullet [23]. The training environment is built on the Gym [24] framework, and we employ a customized, lightweight SAC algorithm instead of directly using StableBaseline3. The robot used in the experiments is a Realman RM75, a 7-DOF robotic arm equipped with a depth sensor at the end effector. To construct the dataset, we digitally modeled 600 real-world fruit-picking task scenes, including 400 simple scenes and 200 complex ones. Among these, 300 simple scenes were used for training, while 100 complex scenes contributed to the experience base. The remaining 200 scenes (100 simple and 100 complex) were reserved for testing. All experiments were conducted on a system with an AMD Ryzen 7 9700X 3.80 GHz processor and an NVIDIA 4070 Ti Super GPU with 16 GB of VRAM.

During training, we saved the model every 200 episodes from 0 to 6000 and evaluated its performance across three metrics: success rate in simple tasks, success rate in complex tasks, and success rate in complex tasks after retrieval augment. The results, presented in Figure 4a, indicate that by 6000 training episodes, the success rate in simple tasks stabilizes near 100%, while the success rate in complex scenes reaches 56%, suggesting convergence. Under identical training conditions, the augment agent consistently outperforms the original agent in complex tasks, where it maintains a success rate at least 25% higher when convergence occurs. Notably, around 3200 training epochs, we observed significant performance fluctuations, with success rates dropping to approximately 30% in both simple and complex tasks. However, the augment agent maintained a success rate exceeding 50% in complex tasks, demonstrating its robustness against training instability.

Furthermore, we selected the agent at 5600 episodes as a representative for comparative testing against RRT-Connect, RRT*, AIT* [25], and BIT* [26] from the OMPL library [27]. The results, presented in Figure 4b,c, demonstrate that retrieval augment agent outperforms all other algorithms, achieving a success rate exceeding 70%. The original agent at 5600 episodes exhibits a success rate comparable to AIT* and BIT*, all of which surpass RRT-Connect and RRT*. In terms of path quality, DRL-trained agents significantly outperform sampling-based algorithms, achieving path lengths of approximately 0.2 m—less than half of those generated by other algorithms. Sampling-based algorithms’ reliance on random sampling makes it challenging to sample feasible states in highly cluttered environments within the given time constraint (2 s), also resulting in excessively long paths. In contrast, the trained agent benefits from reward-driven guidance, effectively reducing path length. Nonetheless, AIT* and BIT* leverage their ellipsoid-inspired sampling mechanisms to maintain shorter path lengths compared to RRT-Connect.

4.2. Efficacy of Experience Retrieving

We extended the task datasets by constructing experience bases of varying sizes, ranging from 0 to 500 experiences. Ten test tasks were selected to extract scene features. Similarity is computed to retrieve similar experiences from experience bases of varying sizes, and the retrieved results were averaged for subsequent analysis. We classify experiences with similarity greater than 80% as strong similarity and those with similarity between 50% and 80% as middle similarity. As shown in Figure 5a, when the library contains 150 or more experiences, the ratio of strong similarity experiences converges to approximately 0.1, while that of middle similarity experiences converges to about 0.7. Additionally, the similarity of the most similar experiences stabilizes at up to 0.96, indicating that roughly 150 experiences are sufficient for effective policy reconstruction.

Subsequently, we selected experiences with varying similarity levels to augment the agent trained after 5600 episodes and evaluated their performance in complex tasks. As illustrated in Figure 5b, low-similarity experiences had little effect or even slightly reduced the success rate, whereas a significant performance improvement was observed when the similarity reached approximately 0.5, with stabilization occurring above 0.7. This demonstrates that proposed policy reconstruction method is sensitive to experience quality, and that low-quality experiences may adversely affect decision-making.

4.3. Real World

We experimented in a real fruit-picking environment that we built ourselves, selecting AIT* and BIT*—which demonstrated superior performance in simulation—for comparison with the retrieval augment agent at 5600 episodes. Figure 6 illustrates an experimental example of a successful grasp. The results, summarized in Table 1, show that the retrieval augment agent achieved the best performance, successfully completing seven out of ten tasks with an average path length of 0.211 m. In contrast, AIT* and BIT* exhibited similar path lengths, both significantly longer than those achieved by agent. These results highlight the limitations of sampling-based planners compared to DRL-trained agent, which benefits from direct reward-guided optimization. In terms of planning time, the proposed agent significantly outperforms both AIT* and BIT*, even with the additional computational steps introduced by action fusion and rejection sampling. In practice, the average per-step decision time of the augmented agent is only marginally higher than that of the original agent, while the average execution time for successful trajectories generated by the original agent is approximately 9.8 s. Additionally, real-world uncertainties introduce some variability in algorithm performance, leading to a slightly lower success rate than in simulation.

5. Discussion

In this paper, we propose a robot path planning decision framework based on retrieval augment, which optimizes the policy in real time by policy reconstruction during the reasoning process. This framework offers a novel solution for robust robot path planning. We also offer an instantiation for the decision framework. For experience base construction, we introduce a high-quality data collection approach based on hierarchical collaborative exploration, and use the weighted EM algorithm and tangent space mapping to extract essential features from the empirical data. For policy reconstruction, we propose a scheme leveraging action evaluation and rejection sampling. Finally, we demonstrate the effectiveness of the method in simulation and real-world fruit-picking tasks.

Several open questions remain. First, large-scale experience base may impact the retrieval efficiency, presenting challenges to the computational and storage capacities of the robot. Developing a cloud-based brain to establish a unified experience base will be a crucial direction for future work. Such a shared experience infrastructure could facilitate knowledge reuse across different robots, crops, and farming environments, significantly reducing duplicated training effort and supporting sustainable, large-scale agricultural deployment. Second, the action fusion process is highly sensitive to the quality of experience and the accuracy of observations. Inaccurate observations may even lead to policy degradation. More robust action fusion methods need to be explored in future research. Finally, although Hu moments are intuitively adopted in this work to characterize the global distribution of obstacles in the similarity metric, comparisons with alternative feature representations have not been explored, which will be considered as an important direction for future research.

Author Contributions

Conceptualization, B.C., S.Z., Z.H. and L.G.; methodology, B.C., S.Z., Z.H. and L.G.; software, B.C., S.Z. and Z.H.; validation, B.C., S.Z. and Z.H.; formal analysis, B.C., S.Z. and Z.H.; investigation, B.C., S.Z. and Z.H.; resources, B.C. and S.Z.; data curation, B.C. and S.Z.; writing—original draft preparation, B.C. and S.Z.; writing—review and editing, B.C., S.Z., Z.H. and L.G.; visualization, B.C., S.Z. and Z.H.; supervision, L.G.; project administration, L.G.; funding acquisition, L.G. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to acknowledge the financial support provided by the National Natural Science Foundation of China (NSFC) under Grant No. 52175024.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RRT	Rapidly Exploring Random Trees
DRL	Deep Reinforcement Learning
DOF	Degrees of Freedom
SAC	Soft Actor–Critic
NLP	Natural language processing
EM	Expectation Maximization Algorithm
GMM	Gaussian Mixture Model
ROS	Robot Operating System

Appendix A. Notation Summary

This appendix provides a detailed description of the mathematical symbols used throughout this paper.

Appendix A.1. Basic Problem Formulation Symbols

$χ$	Robot workspace
$χ_{f r e e}, χ_{o b s t a c l e}$	Free space and obstacle space
R	Fruit-picking robot
$G$	Task goal set
f	Target fruit
k	Operation (cutting) point
$b_{1}$	Target branch
$b_{2}$	Main branch
$θ$	Joint angle vector
$T = (p, q)$	End-effector pose
p	End-effector position
q	End-effector orientation (quaternion)
s	Robot state
$s_{0}$	Initial state
$s^{*}$	Optimal cutting state

Appendix A.2. Experience Retrieval Symbols

$G_{1}, G_{2}$	Simple and complex task sets
$M_{g r i d}$	Grid-based environment representation
H	Hu moment feature
$S I M (\cdot)$	Scene similarity function
$w_{H u}, w_{b_{1}}, w_{b_{2}}, w_{p}$	Similarity weights
$E^{*}$	Retrieved experience

Appendix A.3. Policy Reconstruction Symbols

$π_{θ}$	Original DRL policy
$π^{*}$	Reconstructed policy
a	Action
$a_{θ}$	Action from original policy
$a_{E^{*}}$	Action from experience
$a_{m i x}$	Fused action
$ϵ$	Rejection sampling threshold

Appendix A.4. Reward and Observation Symbols

$s (t) \in R^{20}$	Observation vector
$a (t) \in R^{7}$	Action vector
$r_{d i s t a n c e}$	Distance reward
$r_{p o s}$	Pose reward
$r_{t r a j}$	Trajectory smoothness reward
$r_{c o l l i s i o n}$	Collision penalty
$r_{v a l i d}$	Invalid action penalty
$λ_{1}, λ_{2}, λ_{3}$	Reward weights

References

LaValle, S.; Kuffner, J. Randomized kinodynamic planning. In Proceedings of the 1999 IEEE International Conference on Robotics and Automation (Cat. No.99CH36288C), Detroit, MI, USA, 10–15 May 1999; Volume 1, pp. 473–479. [Google Scholar] [CrossRef]
Van Henten, E.; Hemming, J.; Van Tuijl, B.; Kornet, J.; Bontsema, J. Collision-free Motion Planning for a Cucumber Picking Robot. Biosyst. Eng. 2003, 86, 135–144. [Google Scholar] [CrossRef]
Chiang, H.T.; Malone, N.; Lesser, K.; Oishi, M.; Tapia, L. Path-guided artificial potential fields with stochastic reachable sets for motion planning in highly dynamic environments. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 2347–2354. [Google Scholar] [CrossRef]
La Valle, A.J.; Sakcak, B.; LaValle, S.M. Bang-Bang Boosting of RRTs. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 2869–2876. [Google Scholar] [CrossRef]
Linard, A.; Torre, I.; Bartoli, E.; Sleat, A.; Leite, I.; Tumova, J. Real-Time RRT* with Signal Temporal Logic Preferences. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 8621–8627. [Google Scholar] [CrossRef]
Hemming, J.; Bac, C.W.; Tuijl, B.; Barth, R.; Bontsema, J.; Pekkeriet, E.; Van Henten, E. A robot for harvesting sweet-pepper in greenhouses. In Proceedings of the International Conference of Agricultural Engineering—AgEng 2014, Zurich, Switzerland, 6–10 July 2014. [Google Scholar]
Wang, D.; Dong, Y.; Lian, J.; Gu, D. Adaptive end-effector pose control for tomato harvesting robots. J. Field Robot. 2023, 40, 535–551. [Google Scholar] [CrossRef]
Malik, A.; Lischuk, Y.; Henderson, T.; Prazenica, R. A Deep Reinforcement-Learning Approach for Inverse Kinematics Solution of a High Degree of Freedom Robotic Manipulator. Robotics 2022, 11, 44. [Google Scholar] [CrossRef]
Orsula, A.; Bøgh, S.; Olivares-Mendez, M.; Martinez, C. Learning to Grasp on the Moon from 3D Octree Observations with Deep Reinforcement Learning. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 4112–4119. [Google Scholar] [CrossRef]
Yandun, F.; Parhar, T.; Silwal, A.; Clifford, D.; Yuan, Z.; Levine, G.; Yaroshenko, S.; Kantor, G. Reaching Pruning Locations in a Vine Using a Deep Reinforcement Learning Policy. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 2400–2406. [Google Scholar] [CrossRef]
Lin, G.; Zhu, L.; Li, J.; Zou, X.; Tang, Y. Collision-free path planning for a guava-harvesting robot based on recurrent deep reinforcement learning. Comput. Electron. Agric. 2021, 188, 106350. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft Actor-Critic Algorithms and Applications. arXiv 2019, arXiv:1812.05905. [Google Scholar] [CrossRef]
Chiang, H.T.L.; Hsu, J.; Fiser, M.; Tapia, L.; Faust, A. RL-RRT: Kinodynamic Motion Planning via Learning Reachability Estimators from RL Policies. arXiv 2019, arXiv:1907.04799. [Google Scholar] [CrossRef]
Wang, X.; Zhou, J.; Xu, Y.; Liu, z. Research on low-loss and high-efficiency picking sequence planning of safflower-filaments based on improved deep reinforcement learning. Comput. Electron. Agric. 2025, 237, 110692. [Google Scholar] [CrossRef]
Li, H.; He, Z.; Wang, Y.; Ding, X.; Cui, Y. Research on the mechanized harvesting strategy for clustered kiwi fruits based on deep reinforcement learning. Comput. Electron. Agric. 2025, 237, 110686. [Google Scholar] [CrossRef]
Yi, T.; Zhang, D.; Luo, L.; Wang, Y.; Liu, B. View planning for grape harvesting based on self-supervised deep reinforcement learning under occlusion. Comput. Electron. Agric. 2025, 239, 110913. [Google Scholar] [CrossRef]
Liu, Y.; Gao, P.; Zheng, C.; Tian, L.; Tian, Y. A Deep Reinforcement Learning Strategy Combining Expert Experience Guidance for a Fruit-Picking Manipulator. Electronics 2022, 11, 311. [Google Scholar] [CrossRef]
Li, Y.; Feng, Q.; Zhang, Y.; Peng, C.; Ma, Y.; Liu, C.; Ru, M.; Sun, J.; Zhao, C. Peduncle collision-free grasping based on deep reinforcement learning for tomato harvesting robot. Comput. Electron. Agric. 2024, 216, 108488. [Google Scholar] [CrossRef]
Xie, C.W.; Sun, S.; Xiong, X.; Zheng, Y.; Zhao, D.; Zhou, J. RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 19265–19274. [Google Scholar] [CrossRef]
Humphreys, P.C.; Guez, A.; Tieleman, O.; Sifre, L.; Weber, T.; Lillicrap, T. Large-Scale Retrieval for Reinforcement Learning. arXiv 2022, arXiv:2206.05314. [Google Scholar] [CrossRef]
Chen, B.; Gong, L.; Yu, C.; Du, X.; Chen, J.; Xie, S.; Le, X.; Li, Y.; Liu, C. Workspace decomposition based path planning for fruit-picking robot in complex greenhouse environment. Comput. Electron. Agric. 2023, 215, 108353. [Google Scholar] [CrossRef]
Dijkstra, E.W. A note on two problems in connexion with graphs. Numer. Math. 1959, 1, 269–271. [Google Scholar] [CrossRef]
Coumans, E.; Bai, Y. PyBullet, a Python Module for Physics Simulation for Games, Robotics and Machine Learning, 2016–2021. Available online: http://pybullet.org (accessed on 1 September 2022).
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv 2016, arXiv:1606.01540. [Google Scholar] [CrossRef]
Strub, M.P.; Gammell, J.D. Adaptively Informed Trees (AIT*) and Effort Informed Trees (EIT*): Asymmetric bidirectional sampling-based path planning. Int. J. Robot. Res. 2021, 41, 390–417. [Google Scholar] [CrossRef]
Gammell, J.D.; Srinivasa, S.S.; Barfoot, T.D. Batch Informed Trees (BIT*): Sampling-based optimal planning via the heuristically guided search of implicit random geometric graphs. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 3067–3074. [Google Scholar] [CrossRef]
Sucan, I.A.; Moll, M.; Kavraki, L.E. The Open Motion Planning Library. IEEE Robot. Autom. Mag. 2012, 19, 72–82. [Google Scholar] [CrossRef]

Figure 1. Augment agent’s performance in real-world tasks. Augment agent controls the robot to successfully reach the target operating point with optimal posture (Section 4.2).

Figure 2. A visual representation of the problem we want to address in this paper. The end effector of the robot has to reach the operation point k in an optimal cutting attitude (red arrow) while maximizing the path quality, i.e., decreasing the value of the path cost function (Section 2). The distinction between simple and complex tasks is also visualized in the figure.

Figure 3. Overview of the proposed decision framework, as described in Section 3.1. Pre-train the original agent using DRL algorithms on simple tasks with a limited number of training episodes (Section 3.2). Select typical complex task scenes for hierarchical collaborative exploration to collect empirical data and integrate it into the original dataset to form experience base (Section 3.3). When faced with complex tasks, extract experience by retrieving to similar task. The experience and the original agent policy are then combined during policy reconstruction, yielding an augment agent capable of completing complex tasks (Section 3.4).

Figure 4. The results in simulation. Augment agent performs much better than original agent on complex tasks, while original agent that excels in simple tasks performs poorly on complex tasks (a). In the comparison experiments, (b) shows that the DRL-trained agents (both the original and the augment agents) achieve substantially higher path quality than sampling-based algorithms. (c) indicates that the augment agent outperforms the other algorithms in terms of success rate, exceeding 70%.

Figure 5. Results on experience base construction. (a) The horizontal axis denotes the number of experiences, while the vertical axis represents the proportion of different experience categories (shown as a bar chart, where gray indicates weakly similar experiences, light green denotes moderately similar experiences, and dark green represents highly similar experiences) as well as the experience similarity (shown as a line plot). The experimental results demonstrate that the quality of the retrieved experiences stabilizes once the experience base reaches a critical size. (b) The horizontal axis represents the experience similarity, and the vertical axis denotes the task success rate. The results indicate that the performance of the augment agent is sensitive to the quality of the experiences used during policy reconstruction; low-quality (i.e., low-similarity) experiences can degrade overall performance.

Figure 6. This figure presents a successful example of cherry tomato harvesting in a real-world scenario. It shows 16 key frames of the grasping process, where each frame includes an image captured by a third-person camera (left) and an image acquired by a camera mounted on the robot end effector (right).

Table 1. The augment agent achieved the highest performance with seven out of ten successes and considerably shorter path lengths than the other algorithms. Owing to real-world uncertainties, the test results were slightly lower than those observed in simulation.

	Trail	Success Trails	Success Rate	Average Path Length	Average Path Time
BIT*	10	4	0.4	0.524	12.6 s
AIT*	10	3	0.3	0.533	11.3 s
agent	10	7	0.7	0.211	10.5 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, B.; Zhang, S.; He, Z.; Gong, L. Retrieval Augment: Robust Path Planning for Fruit-Picking Robot Based on Real-Time Policy Reconstruction. Sustainability 2026, 18, 829. https://doi.org/10.3390/su18020829

AMA Style

Chen B, Zhang S, He Z, Gong L. Retrieval Augment: Robust Path Planning for Fruit-Picking Robot Based on Real-Time Policy Reconstruction. Sustainability. 2026; 18(2):829. https://doi.org/10.3390/su18020829

Chicago/Turabian Style

Chen, Binhao, Shuo Zhang, Zichuan He, and Liang Gong. 2026. "Retrieval Augment: Robust Path Planning for Fruit-Picking Robot Based on Real-Time Policy Reconstruction" Sustainability 18, no. 2: 829. https://doi.org/10.3390/su18020829

APA Style

Chen, B., Zhang, S., He, Z., & Gong, L. (2026). Retrieval Augment: Robust Path Planning for Fruit-Picking Robot Based on Real-Time Policy Reconstruction. Sustainability, 18(2), 829. https://doi.org/10.3390/su18020829

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Retrieval Augment: Robust Path Planning for Fruit-Picking Robot Based on Real-Time Policy Reconstruction

Abstract

1. Introduction

2. Problem Formulation

3. Retrieval Augment: Real-Time Policy Reconstruction

3.1. Overview

3.1.1. Original Agent Training

3.1.2. Experience Base

3.1.3. Experience Retrieval

3.1.4. Real-Time Policy Reconstruction

3.2. Original Agent Training

3.2.1. Observation Space

3.2.2. Action Space

3.2.3. Reward Function

3.3. Experience Base

3.4. Policy Reconstruction

4. Experiments

4.1. Effectiveness of Augment Agent

4.2. Efficacy of Experience Retrieving

4.3. Real World

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Notation Summary

Appendix A.1. Basic Problem Formulation Symbols

Appendix A.2. Experience Retrieval Symbols

Appendix A.3. Policy Reconstruction Symbols

Appendix A.4. Reward and Observation Symbols

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI