Push-or-Avoid: Deep Reinforcement Learning of Obstacle-Aware Harvesting for Orchard Robots

Fu, Heng; Li, Tao; Feng, Qingchun; Chen, Liping

doi:10.3390/agriculture16060670

Open AccessArticle

Push-or-Avoid: Deep Reinforcement Learning of Obstacle-Aware Harvesting for Orchard Robots

¹

College of Mechanical and Electronic Engineering, Northwest A&F University, Yangling 712100, China

²

Intelligent Equipment Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(6), 670; https://doi.org/10.3390/agriculture16060670

Submission received: 29 January 2026 / Revised: 4 March 2026 / Accepted: 9 March 2026 / Published: 16 March 2026

(This article belongs to the Section Agricultural Technology)

Download

Browse Figures

Versions Notes

Abstract

In structured orchard environments, harvesting robots operate where rigid bodies (e.g., trunks, poles, and wires) coexist with flexible foliage. Strict avoidance of all obstacles significantly compromises operational efficiency. To address this, this study proposes an end-to-end autonomous harvesting framework characterized by an “avoid-rigid, push-through-soft” strategy. This framework explicitly propagates uncertainties from sensor data and reconstruction processes into the planning and policy phases. First, a multi-task perception network acquires 2D semantic masks of fruits and branches. Class probabilities and instance IDs are back-projected onto a 3D Gaussian Splatting (3DGS) representation to construct a decision-oriented, semantically enhanced 3D scene model. The policy network accepts multi-channel 3DGS rendered observations and proprioceptive states as inputs, outputting a continuous preference vector over eight predefined motion primitives. This approach unifies path planning and action decision-making within a single closed loop. Additionally, a dynamic action shielding module was designed to perform look-ahead collision risk assessments on candidate discrete actions. By employing an action mask to block actions potentially colliding with rigid obstacles, high-risk behaviors are effectively suppressed during both training and execution, thereby enhancing the robustness and reliability of robotic manipulation. The proposed method was validated in both simulation and real-world scenarios. In complex orchard scenarios, the proposed AE-TD3 algorithm achieves a harvesting success rate of 77.1%, outperforming existing RRT (53.3%), DQN (60.9%), and TD3 (63.8%) methods. Furthermore, the method demonstrates superior safety and real-time performance, with a collision rate reduced to 16.2% and an average operation time of only 12.4 s. Results indicate that the framework effectively supports efficient harvesting operations while ensuring safety.

Keywords:

deep reinforcement learning; 3D gaussian splatting; harvest robot; obstacle-aware harvesting

1. Introduction

Apples are among the most widely cultivated fruits globally, and their efficient harvesting directly impacts fruit quality and production yield [1]. However, with the accelerating aging of the population and the continuous rise in labor costs, traditional manual harvesting increasingly fails to meet modern demands for scalability, consistency, and operational duration. Consequently, developing apple-harvesting robots capable of autonomously recognizing, locating, and picking fruits in complex orchard environments has become a critical research direction in smart agriculture [2].

Distinct from cluster-growing crops such as tomatoes and grapes, apples naturally grow in environments characterized by a coexistence of rigid and flexible components [3]. Rigid trunks, flexible branches, and fruits constitute a highly unstructured workspace where significant branch deformation occurs, and fruits are distributed randomly, often with mutual occlusions. This complex spatial configuration makes the robotic arm highly susceptible to collisions during harvesting. Such collisions can not only damage the robot’s end-effector or sensors but also cause fruit bruising or branch breakage, thereby affecting yield and future growth. Therefore, in real-world orchards with complex structures and significant spatial constraints, planning stable, safe, and collision-free harvesting paths under conditions of limited perception and overlapping interference remains a core challenge restricting the automation level of apple harvesting [4,5].

To address these challenges, researchers have extensively explored traditional planning algorithms based on sampling or optimization, as well as emerging Deep Reinforcement Learning (DRL) methods [6]. Specifically, the objective of collision-free path planning is to generate a sequence of feasible waypoints based on the systematic perception of the orchard scene, enabling the robot to smoothly approach the target fruit without colliding with branches. Extensive research has been conducted in this area, utilizing methods such as swarm optimization algorithms [7], Artificial Potential Fields (APF) [8], graph search algorithms [9], and sampling-based algorithms (e.g., PRM, RRT) [10,11]. For instance, Wang et al. [12] combined the Adaptive Bidirectional A* (ABA*) algorithm with a Transformer model to handle dynamic obstacle avoidance by predicting potential collisions and adjusting paths accordingly, thereby reducing computational load. Ye et al. [13] proposed an improved Bi-RRT to plan collision-free paths for a lychee-harvesting robot, achieving a 100% success rate with an average path determination time of 4.24 s. Luo et al. [14] developed a path planning strategy for a grape-harvesting robot based on minimum energy and APF, utilizing APF combined with sampling search to generate collision-free waypoints. This method achieved a success rate of up to 90%, with an average planning time ranging from 98 ms to 273 ms. Zhang et al. [15] proposed a Heuristic Dynamic Rapidly exploring Random Tree Connect (HDRRT-Connect) algorithm for rapid path planning in mango harvesting, reporting an average path cost of 95.7739, an average planning time of 0.448 s, and a success rate of 90%. Wang et al. [16] introduced an adaptive end-effector pose control method based on a Genetic Algorithm (GA) to seek reachable poses for target tomatoes, achieving an 88% harvesting success rate with a cycle time of 20 s.

Despite the effectiveness of the aforementioned studies in obstacle avoidance tasks, significant limitations persist. For example, APF suffers from local minima issues [17]; ant colony and genetic algorithms are computationally intensive and time-consuming [18]; while A* and Dijkstra algorithms require substantial memory to store environmental information [19]. Furthermore, sampling-based algorithms rely on random sampling [20], meaning generated paths are not always optimal, and there is a risk of failing to find a solution in certain scenarios. Crucially, these algorithms often lack semantic understanding of the environment and require precise descriptions of all obstacles within the workspace. However, accurately characterizing all obstacles in unstructured orchard environments is difficult and consumes excessive computational resources. Consequently, the applicability of these methods in complex operational scenarios remains limited. Recently, with advancements in Deep Reinforcement Learning (DRL), an increasing number of studies have introduced DRL into robotic harvesting scenarios [21]. DRL is a self-optimizing algorithm that improves decision accuracy through continuous trial-and-error feedback, demonstrating a strong ability to fit optimal decision-making requirements that are difficult to explicitly describe. Luo et al. [22] employed the Deep Deterministic Policy Gradient (DDPG) algorithm to enhance the detection of grape stems under complex occlusion. By learning optimal viewpoint strategies directly through interaction between the robotic arm and the environment, the method effectively handles feature extraction and environmental modeling without manually designing information acquisition metrics. Li et al. [23] developed an improved HER-SAC algorithm to guide the end-effector for collision-free tomato grasping. Field trials in a tomato greenhouse showed a harvesting success rate of 85.5%, representing improvements of 57.3% and 43.0% over traditional fixed-horizontal and parallel-to-stem approaches, respectively. The average operation time from recognition to successful harvest was 11.42 s. Tao et al. [24] utilized a trained SS-DQN network for view planning to address occlusion issues in grape harvesting. By continuously outputting actions to guide camera viewpoint changes, the method determined the optimal perspective for observing stems, achieving an average success rate of 71.58% and an average view planning time of 27.86 s. Traditional RL methods, such as Q-Learning and Deep Q-Networks (DQN), typically focus on discrete action spaces [25]. However, fruit harvesting tasks benefit from continuous actions, which enhance operational efficiency.

Alongside path planning, in scene understanding and decision-making for robotic operations, high-quality 3D reconstruction serves as a critical intermediary linking perception to planning and control. It provides a unified geometric and appearance representation for object recognition, pose estimation, grasp planning, and safety obstacle avoidance. Traditional 3D reconstruction methods mostly follow a unified multi-view geometry pipeline: first, Structure from Motion (SfM) is employed to estimate camera poses and sparse point clouds; subsequently, Multi-View Stereo (MVS) is used to acquire dense depth maps; finally, continuous surfaces are generated via Poisson reconstruction or TSDF fusion, followed by texture mapping. While these geometry-driven methods are mature for modeling static scenes and regular objects, they are prone to holes, artifacts, and reduced accuracy in scenarios with weak textures, strong reflections, or large-scale complexity. Furthermore, geometry and appearance are often modeled separately, making it difficult to unifiedly represent lighting and materials [26,27]. Neural Radiance Fields (NeRF) introduce an implicit neural representation paradigm for 3D reconstruction. NeRF utilizes a Multi-Layer Perceptron (MLP) to jointly model volume density and view-dependent radiance in a continuous 5D space. By fitting multi-view observations directly in the image domain through differentiable volume rendering, NeRF achieves high-quality novel view synthesis and 3D reconstruction [28]. Compared to traditional SfM or MVS-based surface reconstruction, NeRF unifies the encoding of geometry and appearance, offering stronger representation capabilities for complex lighting and detailed textures. However, it relies on extensive ray sampling and integration, resulting in significant computational overhead for both training and inference. To mitigate the high computational costs associated with NeRF-based reconstruction, Instant Neural Graphics Primitives (Instant-NGP) proposed multi-resolution hash encoding. This approach stores spatial features in a multi-scale hash table and retains only a lightweight MLP for querying, drastically reducing the computational load of forward and backward propagation. Consequently, Instant-NGP can fit a single-scene neural radiance field within seconds while maintaining reconstruction quality close to the original NeRF, enabling rapid modeling and interaction in high-resolution scenes [29]. In the context of further balancing high quality with real-time performance, 3D Gaussian Splatting (3DGS), proposed by Kerbl et al. [30], offers an explicit and continuous 3D representation for multi-view scene reconstruction. 3DGS represents the scene as a set of anisotropic 3D Gaussians (ellipsoids) and employs efficient Gaussian rasterization to achieve real-time visualization and optimization. The scene is represented by a set of anisotropic Gaussian primitives parameterized by mean μ(spatial position), covariance Σ (shape/scale/orientation), color c, and opacity α. Through efficient Gaussian rasterization, a continuous, smooth surface approximation is obtained, significantly alleviating the sparsity and fragmentation issues inherent in point clouds. Simultaneously, the explicit geometric envelopes and projectability of Gaussian primitives facilitate not only rapid rendering and multi-view fusion but also the construction of computable collision boundaries and interpretable object entities. Therefore, 3DGS holds potential advantages in robotic tasks such as precise target localization and collision detection.

However, these research advances have not yet been effectively translated into practical orchard harvesting systems. In practical orchard operations, the standard workflow typically involves capturing scene data via cameras or other sensors, followed by using deep learning models to detect and localize fruits, branches, leaves, and other obstacles in 3D [31,32,33]. An environment map is then constructed and fed into a planner to compute collision-free, smooth approach trajectories for the robotic arm. Despite their utility, existing methods predominately employ a modular pipeline: visual perception and reconstruction first, followed by path planning based on a static map, and finally trajectory execution. This sequential, modular approach faces multiple challenges in structurally complex orchards. First, although 3D scene representation methods have advanced considerably in recent years, existing orchard harvesting systems still commonly rely on sparse, discrete, or static environment representations, making it difficult to accurately characterize the geometry of flexible foliage and thereby leading to inaccurate collision boundary estimation [34]. Second, existing methods mostly train agents using omniscient views, directly providing the positions of target fruits and obstacles. This ignores the local field-of-view limitations and sensor noise accumulation inherent in real-world perception systems. This over-reliance on omniscient information prevents the effective propagation of geometric errors and occlusion uncertainties from the perception phase to the planning phase, leaving the planner without an accurate assessment of actual risks. Furthermore, existing methods typically treat all obstacles as objects to be avoided [35], failing to exploit the deformability of flexible foliage, which results in excessively conservative obstacle avoidance strategies. The root cause of these issues lies in the lack of a continuous, semantically rich scene representation and the organic integration of processing stages. Therefore, it is essential to construct an end-to-end framework integrating advanced scene modeling techniques to achieve synergistic decision-making from perception to execution via continuous geometric representation and joint optimization.

In this context, this paper proposes an end-to-end harvesting framework where perception and planning are tightly coupled. Based on efficient 3D scene reconstruction, the framework utilizes Gaussian primitives to perform semantic-level continuous representation of trunks, branches, and fruits, transforming the unstructured environment into a continuous, semantically enhanced state space. Leveraging this representation, the policy network can accurately perceive the boundaries of both rigid and flexible obstacles. It guides the robotic arm to generate safe, smooth, and efficient collision-free trajectories in real-time while pushing through soft obstacles, enabling the robot to actively interact with the environment to harvest occluded fruits under reduced collision risk. The main contributions of this paper are as follows:

(1): An “avoid-rigid, push-through-soft” end-to-end autonomous harvesting framework is proposed. This framework explicitly models the controllable contact with flexible branches as a risk cost within the reward function and propagates uncertainties from the reconstruction process to the planning phase, enabling the policy to choose to push aside flexible obstacles for harvesting under safety constraints.
(2): An AE-TD3 algorithm integrating expert priors is proposed. By encoding expert experience into executable constraints and guiding signals, the algorithm significantly narrows the policy search space and improves sample efficiency. Additionally, a dynamic Action Mask module is designed to proactively shield against dangerous actions that may collide with rigid trunks, working in conjunction with risk-sensitive rewards to reduce ineffective exploration and collisions during training.
(3): An online 3DGS reconstruction module is constructed. Combining detection results from a multi-task network, it continuously represents fruits and flexible branches using anisotropic Gaussian primitives. This forms a continuous and stable scene representation, enhancing the reliability of collision detection and feasibility judgment for the agent.

2. Materials and Methods

2.1. Overall Framework Design

The autonomous obstacle avoidance harvesting framework proposed in this study employs an end-to-end perception-reconstruction-decision architecture, as illustrated in Figure 1. The overall workflow comprises four core modules. First, multi-modal observation data of the orchard scene are acquired via an RGB-D camera. Second, a multi-task perception network (MT-WavYOLO) processes RGB images in real-time to generate 2D semantic masks for fruits and flexible branches. Subsequently, a semantic-mask-driven 3D Gaussian Splatting module constructs a continuous 3D scene representation encapsulating both geometric and semantic information. Finally, the AE-TD3 policy network takes multi-channel 3DGS rendered observations and proprioceptive states as inputs. Combined with a dynamic action shielding mechanism, it outputs safe harvesting trajectories in an end-to-end manner. By explicitly propagating uncertainties from the perception process to the decision-making layer and incorporating expert priors with risk-aware constraints, this architecture achieves robust harvesting operations in rigid-flexible coupled environments.

2.2. Acquisition of Apple Fruit and Obstacle Information

Accurate acquisition of information regarding target fruits and both rigid and flexible obstacles is a prerequisite for motion planning and safe grasping in harvesting robots. In our previous work [36], a multi-task network, MT-WavYOLO, was proposed. This network is capable of simultaneously estimating the complete size of occluded fruits and performing accurate segmentation. In this study, while maintaining robust recognition capabilities for occluded fruits, we further utilize the network’s segmentation branch to obtain fine-grained pixel-level masks of branches, thereby providing higher-quality semantic input for subsequent modules. The image processing workflow is illustrated in Figure 2.

2.3. 3D Scene Modeling and Representation

To facilitate online 3D modeling during the robotic observation phase, a 2D semantic-driven incremental 3DGS construction module was designed. The system first utilizes the multi-task network (MT-WavYOLO) to perform real-time inference on the current RGB frame, generating 2D semantic masks containing target fruits and flexible branches. Based on this, the module adopts a strategy of frame-level direct mapping and incremental updating to construct the global scene. Since the sensor is mounted in an eye-in-hand configuration, the camera pose for each frame is obtained from forward kinematics combined with hand–eye calibration.

Focusing exclusively on valid pixel regions covered by masks (Region of Interest), the system utilizes aligned depth information to back-project these regions into the camera coordinate system. Initial Gaussian primitives are generated at the projected positions, and 2D semantic labels are directly mapped as semantic vectors

s

to these primitives, endowing them with distinct physical attributes (e.g., branches or fruits) from the initial moment. To circumvent the high computational overhead of iterative optimization, the covariance matrix of the Gaussians (describing shape and scale) undergoes One-shot Adaptive Estimation based on local depth gradients and a sensor noise model. Mathematically, for a pixel with measured depth

d

and semantic probability vector

P_{s e m}

, the initialization employs a deterministic mapping to encode uncertainties. To propagate classification uncertainty, the semantic vector

s_{i}

is directly assigned the soft probability distribution from the 2D network:

s_{i} = P_{s e m}

. Further mathematical details are provided in Appendix A. To model geometric uncertainty, the covariance scale along the optical axis (

σ_{z}

) follows a quadratic sensor noise model:

σ_{z} = η {\cdot d}^{2} + β

(1)

where

η

and

β

are sensor-specific noise coefficients. This formulation ensures that distant or noisy observations naturally result in spatially larger (more uncertain) primitives. This allows for the rapid acquisition of anisotropic geometric envelopes that conform to object surfaces. In real-world experiments, we use a Lanxing/MRDVS M4 Pro ToF RGB-D camera (Hangzhou Lanxin Robot Technology Co., Ltd., Hangzhou, China). With depth

d

measured in meters, we set

η = 4.48 \times 10^{- 4} m^{- 1}

and

β = 6.65 \times 10^{- 3} m

in Equation (1). These values are obtained by least-squares fitting of Equation (1) to the manufacturer-reported depth accuracy curve (±4 mm + 0.25% × depth) over the specified working range of the sensor, and are used as a conservative uncertainty bound.

Subsequently, for consecutive observation frames, the system performs weighted fusion of newly generated candidate primitives with the existing Gaussian set in the global map, based on visibility and depth re-projection error under the current view. Semantic consistency constraints are introduced to maintain local consistency in appearance and geometry for Gaussians of the same class.

Through these mechanisms, the module can output a high-fidelity Gaussian set in real-time that simultaneously encodes fruit shape, obstacle geometry, and semantic attributes. Compared to using point clouds or voxels directly as DRL observations, the semantically enhanced 3DGS offers a continuous, differentiable spatial structure with lower interface overhead. It provides a more stable environmental representation for the robotic arm to understand fruit positions, assess obstacle contactability, and learn harvesting strategies that differentiate between rigid and flexible contacts. The reconstruction of real-world and virtual orchard scenes captured by the camera is shown in Figure 3.

2.4. Deep Reinforcement Learning-Based Obstacle Avoidance Path Planning

To achieve efficient and safe autonomous harvesting in unstructured orchard environments characterized by rigid-flexible coupling, this paper proposes an end-to-end decision-making framework where perception and planning are deeply coupled. The harvesting task is modeled as a Markov Decision Process (MDP). The objective is to learn an optimal policy that balances safety constraints (avoiding rigid bodies) with operational efficiency (interacting with soft bodies) through continuous interaction between the Deep Reinforcement Learning (DRL) agent and the environment. The overall workflow forms a complete closed loop from semantic perception to safe decision-making, as illustrated in Figure 1.

First, regarding perception and state construction, the system input derives from the aforementioned semantic 3DGS online reconstruction module. We directly map the extracted anisotropic Gaussian primitives and their semantic labels into a compact, vectorized state space. This not only provides the agent with a continuous, explicit geometric representation but also clarifies the physical attribute differences of various environmental elements via semantic information, laying the foundation for subsequent differentiated interactions. Second, regarding policy optimization and decision-making, given that robotic harvesting is a high-dimensional continuous control problem, the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm [37], based on the Actor-Critic architecture, is employed as the core decision-making algorithm (as shown in Figure 4).

Twin Delayed Deep Deterministic Policy Gradient (TD3) is an improved DRL method based on DDPG [38]. It comprises Actor and Critic networks but introduces mechanisms such as twin critics, target policy smoothing, and delayed policy updates to the target value construction and update processes, thereby enhancing estimation unbiasedness and training stability.

TD3 employs Twin Critics, introducing two independent Critic networks. To suppress overestimation bias, a “minimum value” conservative estimation is used when calculating the temporal difference target for the next moment:

y_{i} = r_{i} + γ \min_{j \in {1,2}} Q_{j}^{'} (S_{i + 1}, {\tilde{a}}_{i + 1})

(2)

where

{\tilde{a}}_{i + 1}

is the smoothed action generated by the target policy for the next state.

To improve the smoothness of the target estimation and reduce the overfitting of the value function to peak actions, TD3 adds Target Policy Smoothing (truncated Gaussian noise) to the output of the target policy when calculating

y_{i}

:

{\tilde{a}}_{i + 1} = \overset{´}{μ} (S_{i + 1} | {\overset{´}{θ}}_{μ}) + ε, ε ~ c l i p (N (0, {\tilde{σ}}^{2}), - c, c)

(3)

The Critic networks are updated by minimizing the mean squared error in mini-batches sampled from the experience replay buffer:

L_{θ_{Q}} = \frac{1}{N} \sum_{i = 1}^{N} (y_{i} - Q (s_{i}, a_{i} | θ_{Q}))^{2}

(4)

The Actor network update is based on the deterministic policy gradient:

\nabla_{θ_{μ}} J \approx \frac{1}{N} \sum_{i = 1}^{N} \nabla_{a} Q (s, a | θ_{Q_{1}}) |_{a = μ (s_{i})} \nabla_{θ_{μ}} μ (s_{i} | θ_{μ})

(5)

TD3 also employs Delayed Policy Updates, meaning the Actor network and target networks are updated less frequently than the Critic networks. This low-frequency update strategy prevents the Actor from converging too rapidly based on unstable value function estimates, thereby improving overall training stability.

{\overset{´}{θ}}_{Q} \leftarrow τ θ_{Q} + (1 - τ) {\overset{´}{θ}}_{Q}, {\overset{´}{θ}}_{μ} \leftarrow τ θ_{μ} + (1 - τ) {\overset{´}{θ}}_{μ}

(6)

Specifically, to effectively process the high-dimensional state space input derived from the semantic 3DGS representation and the robot’s proprioception, both the Actor and Critic networks are instantiated as Multi-Layer Perceptrons (MLPs). Each network architecture comprises two hidden layers with 512 neurons per layer. ReLU activation functions are uniformly applied across the hidden layers to ensure robust non-linear feature extraction and representation capabilities. Additionally, the Actor network employs a tanh activation function at its output layer to strictly bound the continuous preference scores within a normalized range, whereas the Critic network utilizes a linear output for unbounded Q-value estimation.

To accelerate policy convergence and provide robust initialization, this study integrates expert demonstrations into the TD3 architecture for behavior cloning (BC) pre-training. Inspired by manual harvesting behaviors, we propose a heuristic trajectory generation method based on an Anthropomorphic “Push-and-Grasp” Strategy. When facing fruits occluded by branches and leaves, manual harvesting typically follows a hierarchical decision logic: first, identifying gaps and actively pushing aside obstacles to construct an operational channel; subsequently, reaching in and grasping the fruit from a safe angle. We abstract this human cognition and manipulation skill into a set of spatial waypoints with explicit geometric constraints to guide the robotic arm in generating human-like operational trajectories.

Specifically, for the “Active Pushing” phase, the system first extracts the 3D topological skeletons of the two flexible branches closest to the target fruit in the simulation environment and connects their key points to form a baseline vector representing the main orientation of the obstacle. To simulate the human intuition of “applying force vertically to maximize the gap,” a normal vector to this baseline is constructed within the XOZ plane. The interaction guide point is defined above this normal at a distance slightly larger than the gripper radius. This geometric constraint forces the robotic arm to cut in and push away branches at an optimal mechanical angle, reproducing the human behavior of “creating space.” For the “Safe Grasping” phase, to avoid collision risks caused by blind approaches, we mimic the visual confirmation and pre-aiming behavior of humans by designing a Pre-grasping Waypoint. By introducing explicit spatial bias constraints (set to z − 0.45, y + 0.4 relative to the fruit coordinate frame), the end-effector is forced to approach the target from a predefined collision-free zone. Finally, an inverse kinematics solver generates a complete action sequence that smoothly connects these guide points, providing high-value supervisory signals for the policy network to learn dexterous, human-like skills.

Based on this method, an expert dataset containing 600 trajectories (approximately 150,000 steps) was collected. We first acquire reasonable action generation capabilities through offline BC pre-training. Subsequently, during the online reinforcement learning process, the BC regularization term is used as an auxiliary supervisory signal and introduced into the optimization objective, forming a joint loss:

\min_{θ} L_{R L} (θ) + λ_{t} L_{B C} (θ), L_{B C} (θ) = E_{(o, a) ~ D} [| | π_{θ} (o) - a | |_{2}^{2}]

(7)

where

L_{R L}

is the standard reinforcement learning optimization objective, and

λ_{t}

is a weighting coefficient dynamically adjusted during training. To maintain sufficient expert constraints in the early stages while gradually releasing exploration freedom in the later stages, an exponential decay form is adopted:

λ_{t} = λ_{0} e x p (- k t)

(8)

This allows the agent to achieve a better balance between exploration and exploitation. Furthermore, a ReduceLROnPlateau adaptive learning rate scheduler is introduced to automatically lower the learning rate when performance metrics fail to improve over an extended period, effectively mitigating oscillation and stagnation issues during training.

2.5. Dynamic Safety Constraint Mechanism

The legitimacy of an action in a harvesting environment depends not only on the current state but also on its feasibility regarding future trajectories. A dynamic shielding method was designed (as shown in Figure 5): at each time step, look-ahead prediction is performed for all candidate discrete actions to simulate their execution results. During this process, the system utilizes tensor-based parallel computation to rapidly evaluate the geometric intersections between the robot’s bounding capsules and the local Gaussian primitives, thereby detecting potential collisions with rigid obstacles. If an action is predicted to be dangerous, it is marked as unselectable, thereby generating a corresponding Action Mask. This mask is returned as a Boolean vector, where feasible actions are marked as 1 and infeasible ones as 0. Unlike traditional static rule-based shielding, this predictive shielding adapts to complex 3D geometric relationships and varying task phases, ensuring the dynamic consistency of constraints.

Building on this, we integrate the shielding mechanism directly into the hybrid control loop to balance safety and exploration. Instead of relying on heuristic substitution rules, the generated Boolean mask is applied element-wise to the Actor’s continuous preference vector

a_{t}

(as defined in Equation (11)). Specifically, the scores of primitives marked as unsafe are set to −∞, forcing the subsequent argmax operation to automatically select the highest-scoring safe primitive. This ensures that the physically executed action is strictly collision-free (Hard Constraint). Simultaneously, to guide the policy update, if the primitive with the highest raw preference score is masked, a mild negative reward is applied. This feedback mechanism discourages the agent from preferring high-risk actions, promoting effective safety-aware learning.

To reduce the computational overhead of action masking, a caching and invalidation mechanism was introduced into the mask calculation process. Specifically, the cached mask

M_{c a c h e}

is reused only when the following stability criteria are met: (1) Position: the displacement of the end-effector

|△ x_{e e}| \leq 0.02 m

; (2) Rotation: the orientation change

|△ θ_{e e}| \leq 5^{\circ}

; (3) Staleness: the number of elapsed steps since the last computation

t - t_{c a c h e} \leq 10

; and (4) Phase: the semantic task phase remains unchanged. If any condition is violated, the mask is recomputed using the current environment state. This mechanism reduces redundant calls to physical simulations and improves the real-time efficiency of the action shielding module.

2.6. Training Hyperparameters and Strategies

The state space consists of integrated features from the robot and the environment, specifically including the end-effector pose, the relative position of the fruit, and the minimum signed distances to both rigid bodies and flexible foliage. The resulting state vector is defined as:

s_{t} = [vec (O_{t}), p_{t}] \in R^{425}

(9)

where

vec (O_{t}) \in R^{400}

represents the flattened environmental perception derived from the 3DGS model. It consists of four semantic channels (depth, fruit, flexible, rigid), each downsampled to a 10 × 10 grid. The term,

p_{t} \in R^{25}

represents the integrated proprioceptive state, defined as:

p_{t} = [x_{e e}, q_{e e}, \dot{q}, x_{f r u i t}, g {, d}_{r i g i d}, d_{s o f t}]^{⊺}

(10)

where

x_{e e}, q_{e e}

denote the end-effector pose,

\dot{q}

represents joint velocities,

x_{f r u i t}

is the target position,

g

is the gripper state, and

d

denotes safety distances.

To leverage the optimization stability of continuous reinforcement learning while ensuring safe execution, we employ a hybrid control strategy. The action space is modeled as a continuous preference vector

a_{t} \in R^{8}

:

a_{t} = [s_{+ x}, s_{- x}, s_{+ y}, s_{- y}, s_{+ z}, s_{- z}, s_{o p e n}, s_{c l o s e}]

(11)

Each element of

a_{t}

represents a preference score for one of eight predefined motion primitives: Cartesian translations along axes with a fixed step size (δ = 0.01 m) and gripper actuation. This design allows the Action Shielding Module to apply a binary mask

m_{t}

directly to these scores. The final executed action is selected via a masked argmax operation, enabling the agent to learn in a continuous latent space while strictly executing safe, collision-free discrete steps.

To address the optimization objective of collision-aware harvesting, the reward function is designed to (i) complete the task efficiently and (ii) reduce collision-related events, including collisions with rigid obstacles and adverse contacts such as gripper entrapment.

(1): Target distance reward—encouraging the agent to approach the target position of the current phase. The reward calculation is based on the change in target position, i.e., rewarding the reduction in target distance at each step. Through this reward, the agent is driven to move towards the target location, thereby accelerating the approach speed.

$r_{g} = k_{g} \cdot [d_{g} (t - 1) - d_{g} (t)]$

(12)

where $d_{g}$ represents the target position for the current phase, $k_{g}$ is the corresponding reward coefficient, and $t$ is the current time step. This reward encourages movement in the target direction by continuously reducing the distance.
(2): Target grasping reward, used to guide the agent to successfully grasp the target fruit. The reward calculation for the grasping action considers the distance between the robot and the target fruit, encouraging the agent to grasp the target.

$r_{p} = k_{p} \cdot [d_{p} (t - 1) - d_{p} (t)]$

(13)

where $d_{p}$ is the position of the target fruit, $k_{p}$ is the corresponding reward coefficient, and $r_{p}$ represents the grasping reward.
(3): Placing reward—guiding the robotic arm to successfully grasp the target fruit and complete the placing action at the designated location (fruit bin), constructed as:

$r_{p l a c e} = k_{t} \cdot [d_{t} (t - 1) - d_{t} (t)]$

(14)

where $d_{t}$ is the target placement position, and $k_{t}$ is the corresponding reward coefficient.
(4): Task completion reward—when the target of a specific phase is completed, a stage-based reward is granted to encourage the agent to complete the task stepwise.

$r_{s c} = k_{s c} \cdot I$

(15)

where $k_{s c}$ is the task completion reward coefficient, and $I$ is an indicator function that returns 1 when the task is completed and 0 otherwise.
(5): Time penalty—to encourage the agent to complete the task as soon as possible and reduce unnecessary steps.

$r_{t i m e} = - γ_{t} \cdot t$

(16)

where $r_{t i m e}$ is the time penalty coefficient, and is $t$ the current time step. This reward term helps avoid ineffective long waits or repetitive actions.
(6): Gripper collision penalty—to prevent excessive contact between the robotic arm gripper and flexible branches, which leads to harvesting failure or damage to the end-effector. When entrapment of a branch is detected, it is considered an adverse collision, and the system applies a negative reward.

$r_{d a n g e r} = - σ_{d a n g e r}$

(17)

where $σ_{d a n g e r}$ is the penalty coefficient for dangerous behavior. When the detection system identifies that the gripper is touching branches, the system automatically applies a negative reward to ensure the agent does not execute dangerous actions. During simulation training, collision detector geometries are added inside the gripper to detect whether branches are trapped between the fingers.

The determination of these reward coefficients follows a hierarchical priority principle to balance conflicting objectives. Specifically, the task completion reward (

k_{s c}

) is assigned the highest weight to ensure convergence, preventing the agent from getting stuck in local optima. This is followed by the safety penalty (

σ_{d a n g e r}

) to strictly enforce obstacle avoidance constraints. Finally, the distance guidance (

k_{g}

,

k_{p}

) and time penalty (

γ_{t}

) are assigned smaller weights and scaled to provide smooth gradient signals without overshadowing the primary task objectives. The specific values were determined through distinct empirical tuning using a coarse-to-fine strategy to ensure training stability. In summary, the total reward function is:

r_{t} = r_{g} + r_{p} + r_{p l a c e} + r_{s c} + r_{t i m e} + r_{d a n g e r}

(18)

The model training process is divided into two phases: offline pre-training and online learning. First, behavior cloning pre-training is performed on the Actor network using collected expert demonstration data to obtain an initial safe policy distribution; subsequently, the online reinforcement learning phase is entered. The total number of training episodes is set to 1500, with the maximum time steps (Max Steps) per episode limited to 600. After each interaction step, 1024 samples are sampled from the experience replay buffer for one network parameter update. To suppress Q-value overestimation during training and enhance policy smoothness, the target policy noise of TD3 is set to 0.2, the noise clip is set to 0.5, and the Critic network employs a delayed policy update mechanism (one Actor update is performed for every 2 Critic updates). Specific hyperparameter settings are detailed in Table 1.

3. Results and Analysis

3.1. Experimental Platform and Test Conditions

To validate the proposed framework and method, a high-fidelity robotic fruit harvesting simulation environment was first built based on the MuJoCo (v3.2.0) physics engine to realize the dynamic interaction simulation between the robotic system and the orchard environment (as shown in Figure 6 and Figure 7). The core motivation for adopting simulation as the primary experimental phase lies in addressing the high heterogeneity of orchard environments: apple varieties (e.g., Gala, Qincui, and Fuji) under different regional and agronomic management conditions exhibit significant differences in tree structure and fruit distribution. Relying solely on field experiments in a single orchard makes it difficult to fully verify the adaptability of the strategy to diverse scenarios. By systematically simulating fruit growth patterns likely to occur in different orchards, the generalization performance of the algorithm regarding structural variations can be quantitatively evaluated under controlled conditions. At the parameter level, the simulation system strictly matches the kinematics and dynamics of the physical robot. The system components include a wheeled mobile platform, a multi-degree-of-freedom (DOF) robotic arm, fruit tree models with hybrid rigid/flexible characteristics, and fruit targets. The robot platform is modeled based on the orchard multi-arm harvesting robot independently developed by our research group; the orchard environment is constructed using a generic high-density dwarf apple tree model (plant spacing 1 m, row spacing 3.5 m, average tree height 2.5 m). To fully simulate the randomness of branch postures and fruit distribution in real-world orchard operations, Motion Capture (MoCap) technology was introduced to achieve parameterized random generation of branch poses and fruit positions. This effectively enhances the diversity and scene coverage of the training data, providing a richer training sample space for the reinforcement learning algorithm.

Model training was conducted on a high-performance server equipped with an Intel i9-13900K CPU and an NVIDIA GeForce RTX 4090 GPU. Upon completion of model training, the model was deployed to an industrial control computer equipped with a 12th Gen Intel i5 processor and an NVIDIA GeForce RTX 3060 GPU (12GB VRAM) for algorithm validation and robotic arm control. Following simulation validation, field experiments were conducted under on-site conditions. The experimental platform is the multi-arm apple harvesting robot independently developed by our research group (as shown in Figure 6). This platform adopts a modular architecture, configured with upper and lower dual arms on both the left and right sides, carrying a total of four isomorphic operational units. Each unit remains consistent in configuration design and control systems and employs an “eye-in-hand” perception structure. Each operational unit consists of a non-standard RPPR configuration 4-DOF robotic arm, a ToF-based RGB-D visual sensor, and a pneumatic 3-DOF flexible end-effector. The visual sensor has a field of view (FOV) of 81° × 56° and a depth accuracy of ±3 mm at 1 m. The end-effector has a maximum gripping diameter of 120 mm and utilizes silicone-wrapped materials to enhance gripping compliance and fruit protection capabilities. The system utilizes a CAN bus to achieve distributed collaborative control. After the mobile platform aligns precisely with the row, the four units can execute reinforcement learning strategies in parallel; the upper and lower dual arms perform spatial division based on the vertical distribution of fruits, while the left and right sides advance synchronously to complete the harvesting task. The platform is equipped with a lithium battery pack providing approximately 8 h of endurance, has a maximum travel speed of 1.5 m/s, and is adaptable to narrow-row operational environments. This paper selects one of the minimum control units as the subject of study to conduct experimental research and performance evaluation in both simulation and field environments.

3.2. Virtual Environment Simulation Experiment

A standardized virtual orchard environment was constructed based on the MuJoCo simulation platform, as shown in Figure 7. In this environment, the deformable characteristics of flexible branches under force were precisely simulated by configuring physical parameters such as elastic coefficients, damping coefficients, joint limits, and friction coefficients (specific parameters are detailed in Table 2). Simultaneously, trunks and fruits were modeled as high-stiffness rigid bodies to ensure their undeformability. The robot’s geometric parameters, mass distribution, moments of inertia, and joint kinematic constraints were imported via URDF files. The execution of the harvesting process mirrors that of the physical robot to ensure the authenticity and consistency of the simulation environment regarding dynamics and collision detection. The single harvesting task was divided into four consecutive phases: (1) Observation Phase, acquiring environmental information and localizing the target fruit; (2) Approach Phase, where the robotic arm moves to a pre-grasping pose (pre-grasping point) near the fruit; (3) Grasping Phase, aligning the end-effector with the fruit and closing the gripper to achieve stable clamping; and (4) Placing Phase, transporting the grasped fruit to a designated target position and completing the placement. Phase transitions are event-driven and follow standard harvesting logic: the system enters Observation once the target fruit is localized (valid 3DGS state), switches to Approach while moving toward the pre-grasp pose, transitions to Grasping when the end-effector is in the graspable region and the gripper-close primitive is executed, and enters Placing after a successful grasp, ending when the fruit is released at the designated placement location by executing the gripper-open primitive.

Furthermore, to enhance the model’s generalization capability and align with the physical characteristics of standardized orchards, a topology-constrained fruit position randomization strategy was introduced during scene construction. In modern manual horticulture, fruit trees are typically pruned into a spindle shape, presenting a relatively flat canopy where hard-to-harvest fruits are predominantly occluded by flexible branches. Based on this horticultural prior, constraints were applied to the simulated environment: branches always remain connected to the rigid trunk, and target fruits are parameter-sampled within the local coordinate system of the branches to be randomly distributed within a specified spatial range strictly behind the flexible branches. This ensures the rationality and diversity of the generated environments.

At the initial stage of each episode, the robot captures RGB-D images and 3D Gaussian Splatting data at the observation position. These data are processed by MT-WavYOLO to extract 2D masks of fruits and occluding branches. Subsequently, the corresponding 3D GS representations for apples and branches are obtained based on these masks. The TD3 model utilizes this processed data to plan the robot’s harvesting path. The planning results are sent to the MoveIt (v1.1.16) motion planner, which then generates joint trajectories to control the robotic arm. This process demonstrates an integrated system combining 3D modeling, data processing, and motion planning for the effective training of the robot system (as shown in Figure 8).

In the simulation training phase, to strictly validate the algorithmic performance, we compared AE-TD3 against its direct continuous control peers: DDPG, SAC, and Standard TD3. These state-of-the-art baselines were selected to comprehensively evaluate the policy’s convergence efficiency, exploration stability, and optimization capability in a continuous action space. To ensure experimental reproducibility and a strictly fair comparison, all reinforcement learning agents operated under the identical Markov Decision Process (MDP) setup and reward function structure. Furthermore, a unified hyperparameter configuration was adopted for common training parameters. Specifically, fundamental settings such as a batch size of 1024, a learning rate of 0.001, a discount factor of 0.99, and the maximum training episodes were kept strictly consistent across all algorithms, while algorithm-specific parameters (e.g., exploration noise) followed standard literature recommendations. Figure 9 illustrates the training results of four reinforcement learning algorithms (DDPG, SAC, AE-TD3, and the baseline TD3) in the harvesting task. Within approximately the first 200 steps, the reward value of the AE-TD3 algorithm rose rapidly, entering the positive range and quickly converging to a stable interval, with a final average reward approaching 100. This performance improvement is primarily attributed to the proposed dynamic action shielding module. By performing look-ahead collision risk assessment on candidate discrete actions, the module utilizes the Action Mask to proactively shield against high-risk actions that may collide with rigid obstacles. This ensures that only valid actions are executed during training, significantly reducing fluctuation amplitudes. Additionally, benefitting from the expert experience guidance mechanism, the algorithm was able to reproduce expert trajectories in the policy space during the early training stages, providing reasonable prior references for the agent. As the algorithm gradually learned stable harvesting actions, the weight coefficients were dynamically adjusted during training to reduce reliance on expert experience, thereby enabling autonomous exploration and optimization in the later stages. This progressive learning approach, evolving from imitation learning to autonomous exploration, effectively improved the policy’s convergence speed and generalization ability.

Although DDPG, SAC, and the baseline TD3 showed certain convergence trends in the later stages of training, their overall convergence speeds were slower, and they exhibited larger fluctuations. Specifically, DDPG displayed significant fluctuations throughout the training process; its reward curve hovered in the low-value range for an extended period and only gradually stabilized in the later stages. Furthermore, its reward level during exploration remained consistently lower than that of other algorithms, indicating limitations in convergence speed and stability within discrete action spaces. The training curve of SAC was relatively smooth with smaller fluctuations, but its final convergence value and average reward were the lowest, reflecting insufficient exploration efficiency and limited policy optimization capability. In comparison, while the baseline TD3 achieved a final convergence value second only to AE-TD3, it exhibited substantial fluctuations in the early training stages due to the lack of prior guidance from expert experience and a mechanism for shielding dangerous actions. It did not gradually converge to a stable interval until approximately 1000 episodes.

In addition to quantitative results, qualitative observations of the robotic arm’s behavior during obstacle avoidance were conducted. The results indicate that the proposed policy demonstrates a higher harvesting success rate in complex scenarios involving occlusions. As shown in the success cases in Figure 10, the robotic arm was able to adapt flexibly to the environment by employing end-effector pushing. Whether blocked by multiple overlapping branches (Figure 10a) or a single branch directly in front (Figure 10b), AE-TD3 planned reasonable paths to push away the branches closest to the fruit to execute harvesting. This method achieved a harvesting success rate of approximately 89% under occluded conditions, with an average step count of approximately 220 steps, significantly outperforming traditional methods. Notably, this operational mode featuring compliant obstacle avoidance effectively enhances the robotic arm’s task adaptability in rigid-flexible coupled environments—agile actions that are typically difficult to achieve with traditional algorithms.

3.3. Field Experiments

Given the strong randomness inherent in real-world scenarios and the unrepeatability of fruit harvesting tasks, the proposed method was validated in both a laboratory environment and a real orchard environment. First, a setup capable of simulating the natural growth state of apples in standardized orchards was constructed under laboratory conditions. This environment features high controllability and repeatability, making it suitable for comparative experiments between different algorithms. Subsequently, experiments were conducted in a real orchard to further evaluate the harvesting efficiency and practicality of the proposed method in a natural environment.

We established a physical experimental platform in the laboratory, consisting of the independently developed orchard harvesting robot and fruit tree models, to simulate the common problem of branch occlusion in orchard environments. This platform allows for the construction of representative occlusion scenarios through the flexible combination of tree models, fruit models, and flexible branches. We designed four experimental scenarios: single flexible branch occlusion, rigid branch occlusion, multiple flexible branch occlusion, and hybrid flexible-rigid branch occlusion (as shown in Figure 11). These scenarios basically cover the typical occlusion situations found in real-world orchards.

For the comparative analysis in field scenarios, we expanded the baseline set to evaluate the framework against different planning philosophies. RRT-Connect (Geometric Baseline) was selected to represent traditional planning methods; it treats environmental elements as rigid obstacles, lacking semantic awareness of flexibility. DQN (Discrete RL Baseline) was included to benchmark the performance limits of discrete action spaces compared to continuous control, while Standard TD3 served as the ablation baseline. Note that for all learning-based methods (DQN, TD3, AE-TD3) deployed in the field, the agents utilized the identical 3DGS-based state representation (425 dimensions) and interacted with the same MoveIt execution interface to ensure fairness. We repeated the execution 20 per scenario for each method (RRT, DQN, TD3, and AE-TD3). To fairly validate the effectiveness of the proposed method, we employed metrics combining visual localization and grasping results to evaluate robot performance:

(1): Grasping Success Rate (%): The ratio of the number of fruits successfully grasped by the end-effector to the number of fruits localized by the vision system.
(2): Harvesting Success Rate (%): The ratio of the number of fruits successfully released into the basket to the number of fruits localized by the vision system.
(3): Collision Rate (%): The incidence where a fruit was successfully grasped, but the robotic arm collided with a rigid branch, or a flexible branch got stuck in the gripper.
(4): Average Time (s): The time consumed to execute the entire harvesting process for a target fruit.

Note: The “Collision Rate” specifically refers to harmful collisions with rigid obstacles (e.g., trunks) or severe entanglements that impede operation. Benign contacts with flexible foliage (pushing actions) authorized by the soft-shielding mechanism are not counted as collisions.

For proportion-type metrics (grasp success, harvest success, and collision rate), we report the observed counts and denominators (k/n) together with two-sided 95% confidence intervals (CIs) computed using the Wilson score interval (z = 1.96). For time, we report the mean and the two-sided 95% CI of the mean using the t distribution. For pairwise comparisons of harvest success rates, we use a two-sided two-proportion z-test (equivalently, a chi-square test without continuity correction). When AE-TD3 is compared against multiple baselines, we apply the Holm–Bonferroni procedure to control the family-wise error rate at

α

= 0.05, and report Holm-adjusted p-values. This statistical approach is intended to better account for the inherent stochasticity of field environments. By moving beyond simple point estimates to incorporate interval estimation and multi-comparison corrections, we aim to enhance the credibility of the performance gains observed in the AE-TD3 framework.

Table 3 compares the grasping performance of different algorithms in laboratory scenarios. Per-scenario variability (n = 20 trials per scenario) was also quantified. Across the four occlusion scenarios, harvest success ranged from 40.0% to 70.0% for RRT (Wilson 95% CI across scenarios: 21.9–85.5%), 50.0% to 75.0% for DQN (29.9–88.8%), 55.0% to 75.0% for TD3 (34.2–88.8%), and 75.0% to 90.0% for AE-TD3 (53.1–97.2%). The pooled CIs over 80 trials per method are shown as error bars in Figure 12. Combining the analysis of the four scenarios (Figure 12a–d), AE-TD3 improved performance in a coherent manner across the full manipulation pipeline rather than on a single metric. First, Figure 12a shows that AE-TD3 achieved the highest grasp success rate under all occlusion types, and the advantage became more evident in harder conditions (rigid and hybrid occlusions), indicating stronger robustness in reaching and aligning the end-effector under constrained free space. Second, Figure 12b further demonstrates that this grasp advantage was effectively translated into final harvest outcomes: AE-TD3 increased the aggregated harvest success rate from 53.8% (RRT) to 81.2%, corresponding to an absolute increase of 27.5 percentage points. Notably, the grasp-to-harvest performance drop for AE-TD3 was smaller (87.5% to 81.2%) than that of RRT (66.2% to 53.8%) and TD3 (75.0% to 65.0%), suggesting fewer failures during the critical grasp-to-detach transition.

In terms of safety, Figure 12c indicates that the proposed method reduced the collision rate among successful harvests from 27.9% to 12.3%, corresponding to a relative reduction of 55.9%. This reduction was particularly meaningful under rigid and hybrid occlusions, where dynamic action shielding is expected to play a dominant role by suppressing high-risk operations near rigid bodies while still allowing controlled contact with compliant foliage when necessary. Finally, Figure 12d shows that AE-TD3 consistently shortened the average time per fruit from 14.83 s to 10.98 s, corresponding to a relative decrease of 26.0%, implying that the policy does not merely “try more” to succeed but converges to more direct and stable interaction patterns with reduced path redundancy. Comparisons with TD3 and DQN showed improvements in the same direction, albeit with smaller magnitudes, which supports that the gain is attributable to the combined effect of risk-aware interaction and action shielding rather than random variance.

Two-sided two-proportion tests were performed on the pooled laboratory harvest success (80 trials per method across the four scenarios) to compare AE-TD3 with each baseline method (RRT, DQN, and TD3). To control the family-wise error rate for the three pairwise comparisons (m = 3), Holm–Bonferroni correction was applied at α = 0.05. Using the pooled counts (AE-TD3: 65/80; RRT: 43/80; DQN: 50/80; TD3: 52/80), the Holm-adjusted p-values were 0.00061 (AE-TD3 compared with RRT), 0.01670 (AE-TD3 compared with DQN), and 0.02043 (AE-TD3 compared with TD3). It should be emphasized that the purpose of the laboratory evaluation was to isolate and verify the decision-making and interaction capabilities of the harvesting framework itself; therefore, targets in all scenarios were identifiable. On this basis, the complete framework was further validated in subsequent orchard experiments.

Finally, the proposed method was validated in a standardized orchard located in Huangling County, Yan’an City, Shaanxi Province, as shown in Figure 13. The fruit trees were arranged in rows with a tree spacing of 1.0 m and a row spacing of 3.5 m. This spatial layout provided ample space for the smooth operation of the Cartesian robotic arm and the mobile platform.

In this experiment, using the minimum execution unit, harvesting experiments were conducted for each method on 50 apples subjected to varying degrees of flexible branch occlusion (sample images of some scenarios are shown in Figure 11) to validate the proposed method. Compared with the original methods, the proposed method achieved higher grasp success rates and harvest success rates, along with shorter average times.

As seen in Table 4, in real-world orchard operational scenarios, AE-TD3 outperformed the control methods in harvest success rate, safety, and efficiency. Specifically, the harvest success rate of AE-TD3 reached 77.1%, compared with the 53.3%, 60.9%, and 63.8% of RRT, DQN, and TD3, respectively. Meanwhile, among successfully harvested samples, the collision rate of AE-TD3 was only 16.2%, whereas the control methods ranged between 29% and 33%. Furthermore, the average operation time per fruit was only 12.4 s, lower than other methods. Overall, the proposed action shielding method directly suppressed potential rigid collision actions. While reducing collision risk, it incorporated a soft shielding mechanism to maintain necessary exploration space, thereby guiding the robotic arm to gradually learn robust obstacle avoidance harvesting strategies in a reward-sparse environment. This demonstrates good adaptability to high-dimensional, complex robotic manipulation scenarios. The robotic arm did not follow the principle of the shortest collision-free path in geometric space but tended to actively push the flexible branches closest to the fruit. This behavioral pattern strongly validates the necessity of explicitly incorporating contact risk into the decision loop, trading controlled flexible contact for target accessibility under complex occlusion. In summary, the proposed AE-TD3 harvesting strategy significantly improved the end-to-end success rate from visual localization to final harvest while maintaining low collision risk, yielding substantial efficiency gains.

Notably, the framework still exhibited a certain proportion of failure samples during the grasping phase: under the current test scale, AE-TD3 failed to grasp 9 targets. When targets were densely surrounded by multiple branches, due to the limited degrees of freedom of the robotic arm and workspace constraints, the policy struggled to plan human-like detour and insertion paths, often falling into local optima or repeatedly probing within narrow feasible regions, leading to the forced termination of the grasping action. On the other hand, strong lighting variations, complex background textures, and partial occlusions reduced detection accuracy, resulting in 2 fruits not being successfully detected. Additionally, RGB-D sensors are prone to depth noise and voids in long-distance or high-contrast scenarios, which caused an accumulation of 3D localization errors, weakening the policy’s ability to assess end-effector safety and contact risk. To further address vision failures and depth noise in complex field environments, future iterations will integrate an adaptive LED fill-light system to mitigate illumination variability. Additionally, we plan to employ multi-sensor fusion by combining RGB-D cameras with solid-state LiDAR to correct depth artifacts and optimize Next-Best-View (NBV) planning algorithms, ensuring more robust perception and reconstruction under extreme outdoor conditions. Currently, the localization module integrates the MT-WavYOLO perception network with a frustum-based localization method. The premise for accurate localization is the assumption that target fruits are approximately spherical. This assumption facilitates rapid spatial initialization and assists the 3DGS module in generating precise geometric envelopes for near-spherical objects common in standardized orchards, such as apples, citrus, or pears. However, the framework currently lacks sufficient capability for extension to non-spherical or complex-shaped crops (e.g., cucumbers or grape clusters). Nevertheless, the “avoid-rigid, push-through-soft” strategy serves as a fundamental behavioral primitive. By recalibrating compliance parameters to match the physical properties of various branch structures, this strategy can be migrated to diverse rigid-flexible coupled scenarios.

4. Conclusions

This study proposes and validates an end-to-end autonomous obstacle-avoidance harvesting framework for unstructured, rigid-flexible coupled orchard scenarios. Centered on a multi-task perception network (MT-WavYOLO), 3D Gaussian Splatting (3DGS) modeling, and Deep Reinforcement Learning (AE-TD3), the framework establishes a closed-loop perception–reconstruction–decision architecture. The multi-task network is utilized for target prior extraction, while 3DGS represents the scene using anisotropic Gaussian primitives, providing the policy with continuous and physically consistent geometric and semantic information. The policy network was trained in a simulation environment based on ROS communication and MuJoCo. Through risk-sensitive reward functions, expert demonstration warm-starts, and the shielding of dangerous actions via action masks, synergistic optimization of grasping and obstacle avoidance under flexible foliage occlusion was achieved.

The integration of 3DGS significantly improves the precision of scene geometric characterization and the granularity of accessibility assessments, enabling the policy to possess more accurate target recognition and geometric decoupling capabilities under conditions of dense fruit and complex foliage. Simultaneously, by constructing a smooth risk field in the state space and projecting it onto the discrete action set, the system attains stable learning signals and higher sample efficiency during the training process. The entire closed loop operates efficiently in simulation and was further validated on a real platform in an outdoor standardized orchard. A single harvesting operation can be completed in only 12.4 s, achieving a grasping success rate of 81.3% and a collision rate of 16.2%. These results demonstrate strong task performance and improved collision safety in cluttered orchard environments.

Author Contributions

Conceptualization, H.F.; methodology, H.F.; software, H.F.; validation, H.F.; formal analysis, H.F. and T.L.; investigation, T.L. and H.F.; resources, T.L.; data curation, T.L.; writing—original draft preparation, H.F.; writing—review and editing, T.L. and Q.F.; visualization, T.L.; supervision, L.C. and T.L.; project administration, Q.F. and T.L.; funding acquisition, L.C. and T.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (32572207); Science and Technology Program of Tianjin, China (23YFZCSN00290).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to ongoing research restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Algorithmic Details of Semantic 3DGS Construction

Algorithm A1: Incremental Semantic 3DGS Construction

Input : RGB image I_{t}

, Aligned Depth map D_{t}

, Semantic Probabilities P_{s e m}

, Camera Pose T_{t}

, Map M_{t - 1}

Output : Updated Gaussian Map M_{t}

1 : M_{n e w} ⟵ \emptyset

2 : for each valid pixel u

in D_{t}

do

3 : d ⟵ D_{t} (u)

4 : μ ⟵

BackProject (u, d, T_{t})

5 : Σ ⟵

OneShotCovariance (d, \nabla d, η, β)

6 : s ⟵ P_{s e m} (u)

7 : g_{new} ⟵

Gaussian (μ, \sum, s)

8 : g_{match} ⟵

FindCorrespondence (g_{n e w}, M_{t - 1})

9 : if g_{match} \neq

Null and ReprojectionError (g_{new}

, g_{match}

) < ε

then

10 : Update (g_{match}

, g_{new})

11 : end

12 : M_{new}

.

append (g_{new})

13 : end if

14 : end for

15 : M_{t} ⟵ M_{t - 1} \cup M_{n e w}

16 : return M_{t}

References

Fu, H.; Guo, Z.; Feng, Q.; Xie, F.; Zuo, Y.; Li, T. MSOAR-YOLOv10: Multi-scale occluded apple detection for enhanced harvest robotics. Horticulturae 2024, 10, 1246. [Google Scholar] [CrossRef]
Li, T.; Xie, F.; Zhao, Z.; Zhao, H.; Guo, X.; Feng, Q. A multi-arm robot system for efficient apple harvesting: Perception, task plan and control. Comput. Electron. Agric. 2023, 211, 107979. [Google Scholar] [CrossRef]
Jin, Y. Research Progress Analysis of Robotics Selective Harvesting Technologies. Trans. Chin. Soc. Agric. Mach. 2020, 51, 1–17. [Google Scholar]
Zhang, K.; Lammers, K.; Chu, P.; Li, Z.; Lu, R. An automated apple harvesting robot—From system design to field evaluation. J. Field Robot. 2024, 41, 2384–2400. [Google Scholar] [CrossRef]
Au, W.; Zhou, H.; Liu, T.; Kok, E.; Wang, X.; Wang, M.; Chen, C. The Monash Apple Retrieving System: A review on system intelligence and apple harvesting performance. Comput. Electron. Agric. 2023, 213, 108164. [Google Scholar] [CrossRef]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
Wang, D.; Tan, D.; Liu, L. Particle swarm optimization algorithm: An overview. Soft Comput. 2018, 22, 387–408. [Google Scholar] [CrossRef]
Warren, C.W. Global path planning using artificial potential fields. In 1989 IEEE International Conference on Robotics and Automation; IEEE Computer Society: Washington, DC, USA, 1989; pp. 316–321. [Google Scholar]
Chen, P.C.; Hwang, Y.K. SANDROS: A dynamic graph search algorithm for motion planning. IEEE Trans. Robot. Autom. 2002, 14, 390–403. [Google Scholar] [CrossRef]
Bohlin, R.; Kavraki, L.E. Path planning using lazy PRM. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065); IEEE: New York, NY, USA, 2000; Volume 1, pp. 521–528. [Google Scholar]
Karaman, S.; Walter, M.R.; Perez, A.; Frazzoli, E.; Teller, S. Anytime motion planning using the RRT. In 2011 IEEE International Conference on Robotics and Automation; IEEE: New York, NY, USA, 2011; pp. 1478–1483. [Google Scholar]
Wang, P.; Ghergherehchi, M.; Kim, J.; Zhang, M.; Song, J. Transformer-based path planning for single-arm and dual-arm robots in dynamic environments. Int. J. Adv. Manuf. Technol. 2025, 139, 3801–3819. [Google Scholar] [CrossRef]
Ye, L.; Duan, J.; Yang, Z.; Zou, X.; Chen, M.; Zhang, S. Collision-free motion planning for the litchi-picking robot. Comput. Electron. Agric. 2021, 185, 106151. [Google Scholar] [CrossRef]
Luo, L.; Wen, H.; Lu, Q.; Huang, H.; Chen, W.; Zou, X.; Wang, C. Collision-free path-planning for six-dof serial harvesting robot based on energy optimal and artificial potential field. Complexity 2018, 2018, 3563846. [Google Scholar] [CrossRef]
Zhang, B.; Yin, C.; Fu, Y.; Xia, Y.; Fu, W. Harvest motion planning for mango picking robot based on improved RRT-Connect. Biosyst. Eng. 2024, 248, 177–189. [Google Scholar] [CrossRef]
Wang, D.; Dong, Y.; Lian, J.; Gu, D. Adaptive end-effector pose control for tomato harvesting robots. J. Field Robot. 2023, 40, 535–551. [Google Scholar] [CrossRef]
Yang, Y.; Luo, X.; Li, W.; Liu, C.; Ye, Q.; Liang, P. AAPF*: A safer autonomous vehicle path planning algorithm based on the improved A* algorithm and APF algorithm. Clust. Comput. 2024, 27, 11393–11406. [Google Scholar] [CrossRef]
Akay, B.; Karaboga, D. Artificial bee colony algorithm for large-scale problems and engineering design optimization. J. Intell. Manuf. 2012, 23, 1001–1014. [Google Scholar] [CrossRef]
Alamoudi, O.; Al-Hashimi, M. On the Energy Behaviors of the Bellman–Ford and Dijkstra Algorithms: A Detailed Empirical Study. J. Sens. Actuator Netw. 2024, 13, 67. [Google Scholar] [CrossRef]
Karaman, S.; Frazzoli, E. Sampling-based algorithms for optimal motion planning. Int. J. Robot. Res. 2011, 30, 846–894. [Google Scholar] [CrossRef]
Zhang, Y.; Li, Y.; Feng, Q.; Sun, J.; Peng, C.; Gao, L.; Chen, L. Compliant Motion Planning Integrating Human Skill for Robotic Arm Collecting Tomato Bunch Based on Improved DDPG. Plants 2025, 14, 634. [Google Scholar] [CrossRef]
Luo, L.; Liu, B.; Chen, M.; Wang, J.; Wei, H.; Lu, Q.; Luo, S. DRL-enhanced 3D detection of occluded stems for robotic grape harvesting. Comput. Electron. Agric. 2025, 229, 109736. [Google Scholar]
Li, Y.; Feng, Q.; Zhang, Y.; Peng, C.; Ma, Y.; Liu, C.; Ru, M.; Sun, J.; Zhao, C. Peduncle collision-free grasping based on deep reinforcement learning for tomato harvesting robot. Comput. Electron. Agric. 2024, 216, 108488. [Google Scholar] [CrossRef]
Yi, T.; Zhang, D.; Luo, L.; Wang, Y.; Liu, B. View planning for grape harvesting based on self-supervised deep reinforcement learning under occlusion. Comput. Electron. Agric. 2025, 239, 110913. [Google Scholar] [CrossRef]
Huang, Y. Deep Q-networks. In Deep Reinforcement Learning: Fundamentals, Research and Applications; Springer: Singapore, 2020; pp. 135–160. [Google Scholar]
Gao, L.; Zhao, Y.; Han, J.; Liu, H. Research on multi-view 3D reconstruction technology based on SFM. Sensors 2022, 22, 4366. [Google Scholar] [CrossRef] [PubMed]
Gené-Mola, J.; Sanz-Cortiella, R.; Rosell-Polo, J.R.; Escola, A.; Gregorio, E. In-field apple size estimation using photogrammetry-derived 3D point clouds: Comparison of 4 different methods considering fruit occlusions. Comput. Electron. Agric. 2021, 188, 106343. [Google Scholar] [CrossRef]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
Li, S.; Li, C.; Zhu, W.; Yu, B.; Zhao, Y.; Wan, C.; You, H.; Shi, H.; Lin, Y. Instant-3D: Instant neural radiance field training towards on-device AR/VR 3D reconstruction. In Proceedings of the 50th Annual International Symposium on Computer Architecture; Association for Computing Machinery: New York, NY, USA, 2023; pp. 1–13. [Google Scholar]
Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 2023, 42, 139. [Google Scholar] [CrossRef]
Xiao, F.; Wang, H.; Xu, Y.; Zhang, R. Fruit detection and recognition based on deep learning for automatic harvesting: An overview and review. Agronomy 2023, 13, 1625. [Google Scholar] [CrossRef]
De-An, Z.; Jidong, L.; Wei, J.; Ying, Z.; Yu, C. Design and control of an apple harvesting robot. Biosyst. Eng. 2011, 110, 112–122. [Google Scholar] [CrossRef]
Yoshida, T.; Onishi, Y.; Kawahara, T.; Fukao, T. Automated harvesting by a dual-arm fruit harvesting robot. ROBOMECH J. 2022, 9, 19. [Google Scholar] [CrossRef]
Noda, S.; Kogoshi, M.; Iijima, W. Robot Simulation on Agri-Field Point Cloud With Centimeter Resolution. IEEE Access 2025, 13, 14404–14416. [Google Scholar] [CrossRef]
Lin, G.; Zhu, L.; Li, J.; Zou, X.; Tang, Y. Collision-free path planning for a guava-harvesting robot based on recurrent deep reinforcement learning. Comput. Electron. Agric. 2021, 188, 106350. [Google Scholar]
Fu, H.; Li, T.; Feng, Q.; Chen, L. MT-WavYOLO: Bridging multi-task learning and 3D frustum fusion for non-destructive robotic harvesting of occluded orchard fruits. Comput. Electron. Agric. 2026, 242, 111335. [Google Scholar] [CrossRef]
Wu, C.; Ruan, J.; Cui, H.; Zhang, B.; Li, T.; Zhang, K. The application of machine learning based energy management strategy in multi-mode plug-in hybrid electric vehicle, part I: Twin Delayed Deep Deterministic Policy Gradient algorithm design for hybrid mode. Energy 2023, 262, 125084. [Google Scholar] [CrossRef]
Egbomwan, O.E.; Liu, S.; Chaoui, H. Twin delayed deep deterministic policy gradient (TD3) based virtual inertia control for inverter-interfacing DGs in microgrids. IEEE Syst. J. 2022, 17, 2122–2132. [Google Scholar] [CrossRef]

Figure 1. Overall workflow of obstacle avoidance picking framework.

Figure 2. Image data processing workflow.

Figure 3. 3D gaussian splatting scene reconstruction. (a) Real-world orchard scene; (b) Orchard scene 3DGS reconstruction; (c) Virtual scene 3DGS reconstruction.

Figure 4. Obstacle-avoidance harvesting path planning based on deep reinforcement learning.

Figure 5. Execution pipeline of the dynamic action shielding module (mask generation, caching, and masked action selectione.

Figure 6. Multi-arm harvesting robot and the manipulator.

Figure 7. MuJoCo simulation environment and obstacle-avoiding harvesting process.

Figure 8. The training environment of the TD3 model.

Figure 9. Reward of AE-TD3, TD3, SAC and DDPG algorithms in the training process.

Figure 10. AE-TD3 attempts to plan branch-separating harvesting trajectories in the simulation environment. (a) The target fruit is occluded by multiple overlapping branches; (b) The target fruit is occluded by a single branch directly in front.

Figure 11. Four representative occlusion scenarios constructed in the laboratory environment. (a) Single flexible branch; (b) Multiple flexible branches; (c) Rigid branch; (d) Hybrid flexible–rigid branch.

Figure 12. Multi-metric comparison over four combined laboratory scenarios: (a) grasp success rate, (b) harvest success rate, (c) collision rate among successfully harvested samples, and (d) average time per fruit for RRT, DQN, TD3, and AE-TD3. Error bars indicate two-sided 95% confidence intervals (Wilson score intervals for proportion metrics in (a–c), and t-based 95% CIs for the mean time in (d)).

Figure 13. Standardized orchard operation scenario for harvesting robots.

Table 1. Unified Hyperparameter Settings for the AE-TD3 Algorithm.

Parameter	Value
Action Dimension	8
State Dimension	425
Network Size	(512, 512)
Batch Size	1024
Replay Buffer Size	$10^{6}$
Learning Rate	0.001
Max Episodes	1500
Max Steps Per Episode	600
Policy Noise	0.2

Table 2. MuJoCo Simulation Environment Parameter Settings.

Parameter	Description	Value
$k_{b}$	Equivalent stiffness of branch compliance	$1.2 \times 10^{4}$
$c_{b}$	Equivalent damping of branch compliance	$1.2 \times 10^{2}$
$μ_{fg}$	Sliding friction coefficient between apple and silicone gripper	0.80
$μ_{fb}$	Sliding friction coefficient between apple and branch surface	0.35
$μ_{gb}$	Sliding friction coefficient between silicone gripper and branch	0.60
$b_{R}$	Joint damping of revolute joints (R)	0.8
$b_{P}$	Joint damping of prismatic joints (P)	50
$c_{θ}$	Angular damping of branch hinge	0.6
$k_{θ}$	Angular stiffness of branch hinge	12

Table 3. Comparison of Grasping Performance Among Different Algorithms in Laboratory Scenarios.

Scenario	Method	Grasping Success Rate (%)	Harvesting Success Rate (%)	Collision Rate (%)	Average Time (s)
Scenario 1	RRT	80.0	70.0	21.4	13.2
	DQN	80.0	75.0	26.7	12.8
	TD3	85.0	75.0	20.0	12.1
	AE-TD3	95.0	90.0	5.6	9.5
Scenario 2	RRT	70.0	60.0	25.0	14.1
	DQN	75.0	70.0	21.4	13.7
	TD3	80.0	70.0	21.4	13.3
	AE-TD3	90.0	85.0	11.8	10.7
Scenario 3	RRT	60.0	45.0	33.3	15.8
	DQN	65.0	55.0	27.3	14.5
	TD3	70.0	60.0	25.0	14.2
	AE-TD3	85.0	75.0	20.0	11.6
Scenario 4	RRT	55.0	40.0	37.5	16.2
	DQN	60.0	50.0	30.0	15.1
	TD3	65.0	55.0	27.3	14.8
	AE-TD3	80.0	75.0	13.3	12.1

Table 4. Field Harvesting Experiments.

Method	Grasping Success Rate (%)	Harvesting Success Rate (%)	Collision Rate (%)	Average Time (s)
RRT	62.2	53.3	29.2	16.3
DQN	67.4	60.9	32.1	15.3
TD3	70.2	63.8	30.0	14.5
AE-TD3	81.3	77.1	16.2	12.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fu, H.; Li, T.; Feng, Q.; Chen, L. Push-or-Avoid: Deep Reinforcement Learning of Obstacle-Aware Harvesting for Orchard Robots. Agriculture 2026, 16, 670. https://doi.org/10.3390/agriculture16060670

AMA Style

Fu H, Li T, Feng Q, Chen L. Push-or-Avoid: Deep Reinforcement Learning of Obstacle-Aware Harvesting for Orchard Robots. Agriculture. 2026; 16(6):670. https://doi.org/10.3390/agriculture16060670

Chicago/Turabian Style

Fu, Heng, Tao Li, Qingchun Feng, and Liping Chen. 2026. "Push-or-Avoid: Deep Reinforcement Learning of Obstacle-Aware Harvesting for Orchard Robots" Agriculture 16, no. 6: 670. https://doi.org/10.3390/agriculture16060670

APA Style

Fu, H., Li, T., Feng, Q., & Chen, L. (2026). Push-or-Avoid: Deep Reinforcement Learning of Obstacle-Aware Harvesting for Orchard Robots. Agriculture, 16(6), 670. https://doi.org/10.3390/agriculture16060670

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Push-or-Avoid: Deep Reinforcement Learning of Obstacle-Aware Harvesting for Orchard Robots

Abstract

1. Introduction

2. Materials and Methods

2.1. Overall Framework Design

2.2. Acquisition of Apple Fruit and Obstacle Information

2.3. 3D Scene Modeling and Representation

2.4. Deep Reinforcement Learning-Based Obstacle Avoidance Path Planning

2.5. Dynamic Safety Constraint Mechanism

2.6. Training Hyperparameters and Strategies

3. Results and Analysis

3.1. Experimental Platform and Test Conditions

3.2. Virtual Environment Simulation Experiment

3.3. Field Experiments

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Algorithmic Details of Semantic 3DGS Construction

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI