DRFW-TQC: Reinforcement Learning for Robotic Strawberry Picking with Dynamic Regularization and Feature Weighting

Zheng, Anping; Fang, Zirui; Li, Zixuan; Dong, Hao; Li, Ke

doi:10.3390/agriengineering7070208

Open AccessArticle

DRFW-TQC: Reinforcement Learning for Robotic Strawberry Picking with Dynamic Regularization and Feature Weighting

by

Anping Zheng

^1,†,

Zirui Fang

^1,†

,

Zixuan Li

¹

,

Hao Dong

¹ and

Ke Li

^1,2,3,*

¹

School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei 230036, China

²

Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei 230036, China

³

Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei 230088, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

AgriEngineering 2025, 7(7), 208; https://doi.org/10.3390/agriengineering7070208

Submission received: 19 May 2025 / Revised: 26 June 2025 / Accepted: 30 June 2025 / Published: 2 July 2025

(This article belongs to the Section Computer Applications and Artificial Intelligence in Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Strawberry harvesting represents a labor-intensive agricultural operation where existing end-effector pose control algorithms frequently exhibit insufficient precision in fruit grasping, often resulting in unintended damage to target fruits. Concurrently, deep learning-based pose control algorithms suffer from inherent training instability, slow convergence rates, and inefficient learning processes in complex environments characterized by high-density fruit clusters and occluded picking scenarios. To address these challenges, this paper proposes an enhanced reinforcement learning framework DRFW-TQC that integrates Dynamic L2 Regularization for adaptive model stabilization and a Group-Wise Feature Weighting Network for discriminative feature representation. The methodology further incorporates a picking posture traction mechanism to optimize end-effector orientation control. The experimental results demonstrate the superior performance of DRFW-TQC compared to the baseline. The proposed approach achieves a 16.0% higher picking success rate and a 20.3% reduction in angular error with four target strawberries. Most notably, the framework’s transfer strategy effectively addresses the efficiency challenge in complex environments, maintaining an 89.1% success rate in eight-strawberry while reducing the timeout count by 60.2% compared to non-adaptive methods. These results confirm that DRFW-TQC successfully resolves the tripartite challenge of operational precision, training stability, and environmental adaptability in robotic fruit harvesting systems.

Keywords:

strawberry picking robot arm; reinforcement learning; simulation optimization

1. Introduction

This paper presents an integrated solution combining enhanced reinforcement learning with specialized end-effector control for the Franka Emika Panda robotic arm. Precision robotic harvesting in ridge-cultivated strawberry fields presents significant technical challenges due to the demanding requirements of end-effector pose control in dense, unstructured environments [1,2,3]. The development of reliable automated picking systems necessitates addressing three fundamental issues: precise end-effector trajectory planning for delicate fruit handling, stable training of deep reinforcement learning models under sparse reward conditions, and efficient adaptation to complex field environments. Traditional model-based control approaches, while effective in structured settings, often fail to meet these requirements due to their limited adaptability to the dynamic spatial configurations characteristic of ridge cultivation systems [4,5,6].

Recent advances in deep reinforcement learning have demonstrated promising results for robotic manipulation tasks [7], yet existing algorithms exhibit notable limitations when applied to strawberry harvesting scenarios. The high variability in fruit positioning [8], combined with frequent occlusions and delicate handling requirements, creates a challenging environment for conventional Deep Deterministic Policy Gradient and Soft Actor-Critic methods. These approaches typically suffer from training instability, slow convergence rates, and suboptimal performance in high-density picking scenarios, particularly when dealing with the seven-degree-of-freedom pose control required for precise end-effector alignment.

This paper introduces several key innovations to address the aforementioned challenges. The main contributions of this study are as follows:

A novel posture optimization reward function is developed to guide precise end-effector alignment with target strawberries, incorporating both spatial positioning and orientation constraints to ensure optimal grasping postures in cluttered environments.
The DRFW-TQC framework implements group-wise feature weighting and Dynamic L2 Regularization techniques within the Truncated Quantile Critics [9] algorithm to accelerate convergence and enhance training stability, effectively addressing the sparse reward problem inherent in complex picking scenarios.
The proposed framework further incorporates a transfer strategy [10] that systematically migrates both the experience replay buffer and the learned networks to complex operational conditions, significantly improving training efficiency in challenging complex conditions.

2. Materials and Methods

This section delineates the comprehensive methodology employed in developing the robotic strawberry picking system. We establish the simulation environment, detailing its mechanical validation and spatial distribution. Subsequently, we present the posture optimization reward function that integrates spatial positioning and orientation constraints. The characterization of the state and action spaces follows, specifying the joint angle parameters and motion control mechanisms. In conclusion, we expound upon the DRFW-TQC framework’s innovative contributions, which collectively address the challenges of training, convergence, and policy optimization in complex agricultural environments.

2.1. Strawberry Picking Simulation Environment Modeling

To meet strawberry harvesting precision requirements while overcoming the cost and practical constraints of real-world training, we present a robotic strawberry harvesting system integrating DRFW-TQC with precision end-effector control for the Franka Emika Panda manipulator, utilizing a high-fidelity simulation environment constructed through Blender modeling software. The simulation environment has been rigorously validated to ensure its mechanical and spatial parameters fall within a 5% deviation from actual field measurements [11], with texture optimization balancing visual verisimilitude against computational efficiency. As shown in Figure 1a–d, the environment simulates a typical layout for strawberry picking, which mainly consists of a Franka Emika Panda manipulator, strawberry stalks, strawberry leaves, a ridge field, and multiple mature and immature strawberries in red and green hues, respectively. The simulation environment incorporates stochastic spatial distribution patterns, whereby individual strawberries and leaves exhibit randomized positional coordinates and quaternion-based orientation vectors. The combination of these elements results in the generation of random and naturalistic occlusion scenarios. Each part of the strawberry plant is equipped with a corresponding collision detection model that accurately reflects the mechanical properties in reality. In the splicing work of the simulation environment, this study adopts the splicing method of modularity and a coordinate system to organically combine several independent environmental elements and accurately locate these sub-modules by the spatial positions of the actual planting.

2.2. Posture Optimization Reward Function

In order to facilitate the precise alignment of the end-effector with the target strawberries, whilst incorporating both spatial positioning and orientation constraints, the posture optimization reward function is meticulously engineered within the DRFW-TQC framework. This sophisticated reward formulation directly supports the stringent success criteria established for the strawberry picking task, where completion requires the successful harvesting of all target fruits in the scene. The picking success evaluation incorporates multiple precision metrics, including the following:

The attitude alignment condition is satisfied.
The end-effector contacts the target strawberry.
The gripper jaws successfully close.

In accordance with Figure 2c, the task is divided into two phases: approach and grasping. The robotic arm calculates the distance

d_{t} = ∥ P_{i} - T_{i} ∥

between the end-effector position

P_{i}

and the target strawberry position

T_{i}

and moves to the target so that

d_{t}

is less than a preset threshold

d_{ϕ}

while adjusting its attitude [12]. As shown in Figure 1e–h, the robotic arm approaches each strawberry in sequence. The attitude alignment condition is met when the angular error for grasping

θ_{error}

satisfies

θ_{error} = π - min (θ_{P}, θ_{- P}),

(1)

where

θ_{P} = 2 arccos | q_{P_{i}} \cdot q_{T_{i}} |

and

θ_{- P} = 2 arccos | {- q}_{P_{i}} \cdot q_{T_{i}} |

represent the quaternion angular errors between the current orientation

q_{P_{i}}

or its opposite

- q_{P_{i}}

and the target orientation

q_{T_{i}}

. When

θ_{error}

is less than the preset value, the attitude alignment condition is satisfied [13]. Upon entry of the gripper jaws of the end-effector into the preset grasping judgment area, the gripper jaws close. As shown in the orange boxes in Figure 1i–l, the robot arm gripper jaws close and pick each strawberry in order. Subsequently, if the sensor verification confirms the formation of an effective grip on the strawberry stalks, the system determines that the grasping phase is complete.

By separating the operation into a distinct approach and grasping phases, the system can first achieve precise positioning while avoiding collisions with adjacent fruits and foliage and then execute grasping with a proper judgment area to prevent bruising. The thresholds

d_{ϕ}

and

θ_{error}

ensure the end-effector maintains optimal positioning throughout both phases, theoretically minimizing damage risk while enabling efficient picking operations in dense strawberry canopies.

The reward function considers several factors comprehensively, including the distance from the target strawberry

r_{dis}

, the success or failure of the grasping process

r_{suc}

, the collision with the environment

r_{col}

, and the timeout penalty

r_{tru}

[14]. The reward function is designed to guide the robot to approach the target strawberry with the correct attitude while avoiding unnecessary collisions and timeout behaviors. The distance and attitude alignment reward function between the robot end-effector and the target strawberry for each step is modeled as follows:

r_{dis} = - δ_{1} {(d_{t} + d_{ϕ})}^{2} - δ_{2} max (z_{T_{i}} - z_{P_{i}}, 0) - δ_{3} θ_{error}^{2},

(2)

where

d_{t}

denotes the distance between the end-effector position

P_{i}

and the target strawberry position

T_{i}

.

d_{ϕ}

is a distance threshold that prevents the agent from wandering near the edge of the strawberry. The terms

z_{T_{i}}

and

z_{P_{i}}

represent the positions on the Z-axis of the strawberry and the end-effector, respectively [15]. This methodology introduces an additional penalty for height differences, with the objective of encouraging the robot to complete the picking task at a higher position. This adaptation is intended to facilitate the robot’s performance in ridge cultivation environments. The parameter

θ_{error}

is the angular error between the current pose and the target pose. The coefficients

δ_{1}

,

δ_{2}

, and

δ_{3}

are weighting factors that balance the learning process [16]. In the experiment, the values were set to 100, 150, and 20, respectively.

To prevent the robotic arm from excessively delaying the execution of the task, a timeout penalty

r_{tru}

is introduced. The task is considered truncated if the number of steps per episode exceeds 500. Each step performed before it is truncated or the task is completed is considered a timeout penalty. This design prevents the agent from wandering on invalid paths for extended periods. The timeout penalty for each step is defined as

r_{tru} = - 2 if not truncated or completed .

(3)

Collision detection is implemented using the pybullet.getContactPoints function [17], which penalizes the robot arm for collisions with non-target strawberries. The collision penalty

r_{col}

for each step is defined as

r_{col} = \{\begin{matrix} - 25 & if contact with non - target strawberry \\ - risk & if contact with obstacles \end{matrix},

(4)

where risk represents the penalty severity for critical collisions that may damage the robotic system or crops. The hierarchical penalty structure ensures stronger constraints for high-risk interactions while allowing limited tolerance for non-target strawberry proximity during manipulation.

When the target strawberry meets the success criteria for picking, the reward function provides a significant reward to guide the agent toward learning an effective picking strategy. The success reward

r_{suc}

for each step is calculated as

r_{suc} = \sum_{i = 0}^{T_{i}} R_{i} if success,

(5)

where

T_{i} \in sort [T_{1}, T_{2}, \dots, T_{N}]

represents the sorted list of target strawberries according to the distance

d_{t}

from the end-effector, as shown in Figure 2b. The final reward function

r_{t}

for each step is the sum of all components:

r_{t} = r_{suc} + r_{col} + r_{tru} + r_{dis} .

(6)

By incorporating multiple factors such as distance, collision, and grasping success, the reward function enables the robot to learn an effective strategy for strawberry picking and gradually optimize the picking process.

2.3. State and Action Space Characterization

The observation space consists of the joint angles of the robot arm and the positions of the target strawberries

s_{t} = [P_{θ}, T_{i}] \in S

. As demonstrated in Figure 2a, the joint angles represented by the sequence of numbers, denoted by

P_{θ} = [θ_{1}, θ_{2}, θ_{3}, \dots, θ_{7}] \in R^{7}

, correspond to the current angular positions of the joints of the robot arm. These values constitute the internal state information of the robot. The end-effector position

P_{i}

and strawberry positions

T_{i} = [x_{i}, y_{i}, z_{i}] \in R^{3}

are both acquired through the pybullet.getLinkState function, providing the 3D spatial information to determine the picking action [18]. In practical applications, when the robot completes the picking task of the current target strawberries, the system will re-select the next target according to the distance to ensure that the picking task can be carried out sequentially, which simplifies the problem that the state space is too large due to the simultaneous consideration of all strawberries [19,20].

As shown in Figure 2a, the action space is represented by the incremental movements of each joint

a_{t} = [a_{1}, a_{2}, \dots, a_{7}] \in A

. For strawberry picking tasks, the action space encompasses the motion control of the robot arm, enabling flexible movement in a three-dimensional space. The motion space consists of the control angles of seven joints, adjusted using the pybullet.setJointMotorControl2 function to reach and grasp the target strawberry [21].

2.4. DRFW-TQC Framework for Training Optimization

The DRFW-TQC framework addresses the critical challenges of slow convergence and training instability in complex strawberry harvesting scenarios through two key technical innovations. By incorporating the Group-Wise Feature Weighting Network in the Truncated Quantile Critics algorithm, abbreviated as TQC, the system prioritizes relevant state features during policy optimization, significantly improving exploration efficiency at the early stage of training, while Dynamic L2 Regularization techniques prevent overfitting to sparse reward signals. The effectiveness of group-wise feature weighting is demonstrated by the cumulative reward at the beginning of the training. Similarly, stable growth in the cumulative reward with low variance indicates effective regularization against sparse reward overfitting.

2.4.1. Truncated Quantile Critics Algorithm

The selection of TQC as the base algorithm is motivated by its distinctive advantages over alternative distributional RL approaches in continuous control domains. TQC’s quantile truncation mechanism has been demonstrated to exhibit superior computational efficiency in comparison to alternative distributional RL approaches of fixed quantile optimization [9]. In contrast to the methods of Quantile Regression-DQN and the Implicit Quantile Network, which necessitate the discretization of action spaces, TQC inherently accommodates continuous action spaces, thereby circumventing the loss of control precision that arises from the discretization of actions. This property is critical for robotic manipulation tasks that require precise motor control. The DRFW-TQC framework’s design intrinsically leverages the TQC’s unique quantile truncation mechanism to address the critical challenges of slow convergence and training instability in complex strawberry harvesting scenarios.

For balancing exploration and exploitation, the TQC algorithm enhances value function estimation through quantile truncation while maintaining the maximum entropy framework. The foundation lies in its reformulated objective function:

J (α) = E_{D, π_{ϕ}} [log α \cdot (- log π_{ϕ} (a_{t} | s_{t}) - H_{T})],

(7)

where

D

is the experience replay, and

π_{ϕ}

denotes a parameterized stochastic policy that maps states to probability distributions over actions. This equation automatically adjusts the temperature parameter

α

to maintain stochasticity near the target entropy

H_{T}

, ensuring adequate exploration during policy optimization.

Initially, as shown in the upper right part of Figure 3, each of the n target critic networks generates

N / n

quantile-based approximations of the return distribution for the next state–action pair

(s_{t}^{'}, a_{t}^{'})

, producing M atoms per target critic [22]. These atoms are aggregated into a unified set:

Z (s_{t}^{'}, a_{t}^{'}) = {θ_{{\bar{ψ}}_{n}}^{m} (s_{t}^{'}, a_{t}^{'})},

(8)

where

n \in [1, \dots, N] and m \in [1, \dots, M]

, and

θ_{{\bar{ψ}}_{n}}^{m} (s_{t}^{'}, a_{t}^{'})

represents the parameters of the mth atom output by the nth target critic network. The approximations in this set are then sorted in ascending order, yielding an ordered sequence. To mitigate overestimation bias, the right tail of this combined distribution is truncated by discarding the

N - G

largest approximations. The remaining G approximations are used to construct the target distribution. For each retained atom

z_{(i)} (s_{t}^{'}, a_{t}^{'})

, which is also the next quantile, the corresponding target quantile is computed as [23]

y_{i} (s_{t}, a_{t}) = r (s_{t}, a_{t}) + ζ (z_{(i)} (s_{t}^{'}, a_{t}^{'}) - α log π_{ϕ} (a_{t}^{'} | s_{t}^{'})),

(9)

where

ζ \in [0, 1)

is the discount factor, and

r (s_{t}, a_{t})

is immediate reward obtained after taking action

a_{t}

in state

s_{t}

. As shown in the lower right part of Figure 3, the target quantile is used later in the computation of the critic loss.

The TQC algorithm integrates the maximum entropy reinforcement learning framework with the output of the quantile critic. The actor network’s objective function is given by

J_{π} (ϕ) = E_{D, π_{ϕ}} [α log π_{ϕ} (a_{t} | s_{t}) - \frac{1}{N M} \sum_{n, m} θ_{ψ_{n}}^{m} (s_{t}, a_{t})],

(10)

where

θ_{ψ_{n}}^{m} (s_{t}, a_{t})

represents the current quantile of the mth current atom output by the nth critic network.

While the next quantiles are trained on truncated targets, the actor network is optimized using the full, untruncated distribution to avoid compounding the truncation effect. This decoupling of value learning and policy updates is a critical design choice that contributes to the method’s robustness and performance.

2.4.2. Group-Wise Feature Weighting Network

In order to resolve the issues of sluggish convergence and inadequate key feature extraction inherent to the TQC algorithm when undertaking the early strawberry picking training, due to truncation, this paper proposes the introduction of a Group-Wise Feature Weighting Network, abbreviated as GFWN, within the actor network [24]. Specifically, the input features are divided into K groups along the channel dimension. As shown in Figure 4, for each group

x_{k}

, the group normalization module computes

{\hat{X}}_{k} = \frac{x_{k} - μ_{k}}{\sqrt{σ_{k}^{2} + ϵ}},

(11)

where

k \in

[1, …, K],

x_{k}

is the feature encoder for the kth group.

μ_{k}

and

σ_{k}^{2}

denote the group-specific mean and variance, respectively.

A fully connected layer is applied to transform the normalized features, followed by a softmax activation function that generates a set of normalized attention weights

w_{k}

.

The final output combines the weighted features with a learnable scaling parameter

γ

:

w_{out} = γ \cdot \sum_{k = 1}^{K} ω_{k} ⊙ {\hat{X}}_{k},

(12)

where ⊙ denotes element-wise multiplication, and the fused features

w_{out}

are subsequently fed into the actor network to generate the action distribution

π_{ϕ}

.

The Group-Wise Feature Weighting Network is an effective method of reducing the interference of irrelevant features. This enables the model to focus on task-relevant information during optimization. It is evident that by adjusting the feature weights during the training process, the DRFW-TQC is capable of resolving the TQC problem of losing early critical information due to truncation, thereby ensuring optimal exploration efficiency at the early stage of training.

2.4.3. Dynamic L2 Regularization

In order to address the issue of critic networks overfitting to sparse reward signals in strawberry picking training, which leads to poor generalization on the unseen state–action pairs and training instability, this paper introduces quantile Dynamic L2 Regularization, abbreviated as DL2 [25]. This regularization mechanism dynamically balances exploration and convergence by automatically adjusting its strength based on the training phase [26].

As demonstrated in the lower right quadrant of Figure 3, the critic network in TQC updates its parameters by minimizing the quantile Huber loss between the current quantiles and the target quantiles [27]. Specifically, given the current quantiles and the target quantiles from Equation (9), the quantile Huber loss is computed as

J_{Q} (ψ_{n}) = \frac{1}{G M} \sum_{i = 1}^{G} \sum_{m = 1}^{M} ρ_{τ_{m}}^{H} (y_{i} (s_{t}, a_{t}) - θ_{ψ_{n}}^{m} (s_{t}, a_{t})),

(13)

where

y_{i} (s_{t}, a_{t})

is the target quantile of the atoms output by the nth critic network,

θ_{ψ_{n}}^{m} (s_{t}, a_{t})

is the current quantile,

τ_{m} = \frac{2 m - 1}{2 M}

is the quantile position, and

ρ_{τ_{m}}^{H} (μ)

is the Huber loss defined as

ρ_{τ_{m}}^{H} (μ) = | τ_{m} - II (μ < 0) | \cdot L_{H} (μ),

(14)

with

L_{H} (μ)

given by

L_{H} (μ) = \{\begin{matrix} \frac{1}{2} μ^{2} & if | μ | \leq δ \\ δ (| μ | - \frac{1}{2} δ) & otherwise \end{matrix},

(15)

where

δ

is a threshold parameter.

In addressing the problem of overfitting to sparse reward signals, a Dynamic L2 Regularization term is incorporated into the critic loss function. This regularization term penalizes large differences between different quantile estimates, encouraging smoother value distribution estimates while preventing overfitting to individual quantiles. The Dynamic L2 Regularization term is defined as

L_{L 2} = λ \cdot \frac{2}{N (N - 1)} \sum_{i = 1}^{N} \sum_{j = i + 1}^{N} {∥ θ_{ψ_{i}} (s_{t}, a_{t}) - θ_{ψ_{j}} (s_{t}, a_{t}) ∥}_{2}^{2},

(16)

where

{∥ \cdot ∥}_{2}^{2}

denotes the squared L2 norm, implemented as the MSE loss, and

θ_{ψ_{n}} (s_{t}, a_{t})

is the current quantile of the approximation output by the nth critic network.

λ

is a dynamic regularization coefficient that controls the penalty strength. The coefficient

λ

is dynamically adjusted during training as

λ = max (0, 10^{- 3} \cdot (1 - \frac{t}{T})),

(17)

with t being the current training step and T being a predefined step threshold. The total critic loss function, incorporating both the quantile Huber loss and the Dynamic L2 Regularization term, is given by

J_{Q} (ψ_{n}) = J_{Q} (ψ_{n}) + L_{L 2} .

(18)

By introducing the DL2, the model’s tendency to overfit to training noise is suppressed in the early stages of training, while numerical instability caused by excessive constraints on the critic loss function is prevented in the later stages [28]. This ensures robust and efficient training throughout the learning process.

2.4.4. Multi-Objective Strawberry Picking Transfer Strategy

The challenge of state spaces in complex environments, which often hinder efficient model learning, is addressed by the employment of the transfer strategy in this paper, enabling rapid adaptation to complex environments [29]. As shown in Figure 5, a diverse set of offline datasets and network parameters are generated in the simple simulation environment, covering a wide range of task scenarios. Upon transfer to the complex environment, the model leverages the knowledge acquired from the simple environment to quickly adapt, thereby improving the success rate of strawberry picking.

The offline dataset is constructed from high-quality trajectory data generated by the reinforcement learning model, with each data point consisting of the state

s_{t}

, action

a_{t}

, reward

r_{t}

, and next state

s_{t + 1}

. The experimental results demonstrate that this approach significantly enhances the training efficiency of the model in complex environments.

3. Experimental Design and Evaluation Metrics

The task involves the sequential picking of multiple target strawberries, with the maximum number of training steps determined by the number of target strawberries. Each experiment runs for 1980 steps per epoch, with the results recorded and evaluated to ensure the stability and reproducibility of the experimental outcomes.

A series of ablation experiments have been conducted in order to evaluate the advantages of the DRFW-TQC in this particular picking task. The experiments have specifically introduced the DL2 and the GFWN. For the purpose of comparison, a selection has been made of widely used algorithms in continuous action space tasks. These include Soft Actor-Critic, also called SAC, and Deep Deterministic Policy Gradient, also called DDPG. It has been demonstrated that both of these algorithms exhibit strong performance in similar domains.

All strategy networks and value networks in this experiment consist of four hidden layers, each with 256 neurons [30]. The remaining parameters follow the default configurations of the stable-baselines3 library [31], which are based on extensive experimentation and tuning. These configurations provide a standardized setup applicable to a wide range of tasks, ensuring the fairness and consistency of the experiments.

The transition of the robotic arm from the initial phase to the terminated, truncated, or successful completion of the procedure is designated as an episode. In order to assess the performance of the algorithm in a rigorous manner, five quantitative metrics are adopted [32,33,34,35]. These are as follows:

Average cumulative reward (AR), which measures the average of the cumulative rewards in the final ten epochs;
Picking success rate (PS), which is the ratio of the number of successful episodes to the total episodes in the last ten epochs;
Angular error (AE), or $θ_{error}$ , is used to evaluate the accuracy of the end-effector’s grasping in the last ten epochs;
First Success Step (FS), which is the total number of steps required for the first successful pick, indicates exploration efficiency;
Timeout count (TO), the total number of truncated episodes, reflects the efficiency of policy convergence.

To ensure rigorous validation of the proposed DRFW-TQC framework’s performance, a comprehensive statistical analysis methodology was employed. Normality assumptions were first verified using the Shapiro–Wilk test statistic. For datasets satisfying normality assumptions, variance homogeneity was subsequently assessed through Levene’s test. When both normality and homoscedasticity conditions were met, ANOVA was employed for omnibus testing. For non-normal distributions or heterogeneous variances, the non-parametric Kruskal–Wallis test was utilized instead:

H_{K W} = \frac{12}{N (N + 1)} \sum_{i = 1}^{k} \frac{R_{i}^{2}}{n_{i}} - 3 (N + 1),

(19)

where N represents the total sample size across all k groups,

n_{i}

denotes the sample size of the ith group, and

R_{i}

corresponds to the sum of ranks for observations in the ith group when all N observations are jointly ranked.

Significant omnibus test results at the 0.05 threshold level initiated post hoc analyses, employing Tukey’s HSD test for parametric comparisons or Mann–Whitney U tests with Bonferroni correction for non-parametric cases. The effect magnitude quantification adopted Cohen’s d effect size measure:

d_{C h} = \frac{{\bar{X}}_{1} - {\bar{X}}_{2}}{\sqrt{\frac{(n_{1} - 1) s_{1}^{2} + (n_{2} - 1) s_{2}^{2}}{n_{1} + n_{2} - 2}}},

(20)

where

{\bar{X}}_{1}

and

{\bar{X}}_{2}

represent the sample means of two independent groups,

n_{1}

and

n_{2}

are the sample sizes, and

s_{1}^{2}

,

s_{2}^{2}

are the sample variances of the respective groups. The interpretation of the results considered both statistical significance with p-values below 0.01 indicating significant differences and effect sizes with Cohen’s d values exceeding 0.2 representing small effects and those above 0.5 indicating medium effects. The analytical approach ensured robust statistical inference while accommodating the varied distributional properties of different performance metrics.

4. Results and Discussion

This section presents a comprehensive analysis with comparative simulation studies that quantify improvements in five metrics relative to baseline methods. Furthermore, this section examines the framework’s transfer learning capabilities in complex environments and conducts hyperparameter sensitivity analyses to elucidate the contribution of key algorithmic components.

4.1. Results of Simulation Experiments

As illustrated in Figure 6a,b and Table 1, the proposed DRFW-TQC framework, which integrates the DL2 and the GFWN, achieves superior performance across multiple metrics compared to baseline methods. The experimental results reveal that DRFW-TQC attains the highest average reward and the smallest angular error in the final training phase, demonstrating its robust optimization capability. Notably, while the baseline TQC achieves competitive cumulative rewards in later epochs, it had a picking success rate of only 73.39%, indicating suboptimal learning efficiency. In contrast, TQC+DL2 significantly enhances the early stages of exploration and renders the training process more stable, reducing the required steps for the first success to 16,420 while simultaneously improving the average cumulative reward to 242,501, suggesting reduced overfitting to sparse reward signals. Further refinement with GFWN yields additional improvements in long-term performance, elevating the picking success rate to 85.11% while maintaining high average cumulative rewards. As demonstrated in the reward graph depicted in Figure 6b, TQC+GFWN exhibits superior early average cumulative reward performance in comparison to competing approaches, which underscores its effectiveness in optimizing exploration of the early strawberry picking training and inadequate key feature extraction.

Comparative analysis with SAC and DDPG highlights DRFW-TQC’s superior convergence speed and stability, as evidenced by its consistently higher success rate progression. DDPG has been shown to exhibit diminished performance growth, whilst SAC has been demonstrated to demonstrate substandard cumulative reward performance. Conversely, DRFW-TQC has been shown to exhibit both accelerated initial learning and stable asymptotic performance. This validates the efficacy of DRFW-TQC in achieving a balance between exploration and exploitation, in addition to feature representation robustness.

The computational analysis demonstrates that the proposed enhancements introduce measurable yet justifiable processing requirements, with TQC+GFWN exhibiting minimal overhead at 8339 s, merely 0.7% longer than the baseline TQC’s 8281 s due to lightweight feature weighting operations. The TQC+DL2 configuration shows a more significant impact at 12,919 s, representing a 56% increase primarily attributed to dynamic regularization calculations during critic updates. The DRFW-TQC framework combining both components requires 13,041 s, exhibiting a 57% longer processing time compared to the baseline TQC but delivering greater performance improvements.

In the course of model training for the strawberry picking task, the final grasping point location of the end-effector for each episode was visualized for each algorithm. Figure 7 illustrates the 3D distribution of the grasping points, where red points denote correct locations, black points denote incorrect locations, and orange, green, and blue points represent projections on the XY, XZ, and YZ planes, respectively. As illustrated in Figure 7f, DRFW-TQC exhibits a higher concentration of red correct capture points in the proximity of the target area when compared to TQC+GFWN and TQC+DL2 in Figure 7d,e, accompanied by a reduced number of black error points. In contrast, DDPG and SAC in Figure 7a,b exhibit dispersed black error points, indicating lower localization accuracy. As illustrated in Figure 7c, TQC’s XY and YZ planes, the orange and blue points are more dispersed, suggesting that its final grasping point performance lags behind that of DRFW-TQC.

The proposed DRFW-TQC framework demonstrates statistically significant improvements across all evaluation metrics compared to baseline methods. As shown in Table 2, the experimental results demonstrate consistent and statistically significant improvements in the DRFW-TQC framework over the SAC, DDPG, and baseline TQC approach across all evaluation metrics. In terms of task completion quality, DRFW-TQC achieves a superior PS with a medium effect size of Cohen’s d equal to 0.38, where the p-value of 0.005 confirms statistical significance at the 1% level. The AR optimization performance of DRFW-TQC substantially outperforms TQC, achieving a 29.1% higher cumulative reward with extreme significance at p = 0.0007 and an effect size of 0.26. The AE further substantiates DRFW-TQC’s enhanced grasping accuracy, showing a reduction in error by 20.3% radians with an effect size of 0.46, extremely significant at p < 0.001. The training efficiency analysis reveals the particularly compelling advantages of DRFW-TQC.

This optimization in learning is accompanied by improved stability, as evidenced by a 19.0% reduction in TO, showing a medium effect size of 0.57 with p < 0.001. These comprehensive results establish DRFW-TQC as a superior approach to robotic strawberry harvesting, offering extremely significant improvements in both operational precision and learning efficiency compared to the baseline TQC method.

4.2. Experimental Results on Transfer Effects

The experimental results in Table 3 demonstrate that DRFW-TQC performance is significantly enhanced by leveraging the transfer strategy from a simple environment with 4 strawberries to a complex environment with 8 strawberries. DRFW-TQC with the transfer strategy achieves a substantial reduction in the FS and TO and reduces the AE to 0.091 rad. Moreover, it has been demonstrated that the AR is enhanced by 51.2%, elevating the performance of PS to 0.891. These results validate the effectiveness of the transfer strategy in complex environments, significantly improving DRFW-TQC efficiency and stability and providing strong support for the automated development of strawberry picking systems.

Upon analyzing each strawberry in contact with the robot arm for success and failure decomposition, it was determined that the accuracy characteristics remained consistent across the experimental conditions. As shown in Figure 8a, the transfer strategy configuration achieves angular errors below 0.05 radians in 76.42% of attempts, with only 0.55% exceeding this threshold. Non-target picking maintains comparable precision at a 22.70% success rate for low-error attempts. The non-transfer condition exhibits similar error distribution patterns, with 76.08% of target picks and 22.22% of non-target picks maintaining sub-0.05 radian errors in Figure 8b. The key distinction lies in the operational scale enabled by the transfer strategy, which facilitates a 2.6-fold increase in total picking attempts while preserving the angular precision characteristics. This scaling effect is particularly evident in the absolute numbers of high-precision attempts, which increase from 14,632 to 38,425 for target strawberries. The proportional increase in non-target picks from 4274 to 11,411 similarly reflects this expanded operational scope rather than any degradation in discrimination capability. These findings suggest that the transfer strategy effectively preserves the robotic arm’s precise properties when in contact with strawberries while concurrently enhancing its capacity to function in complex environments.

4.3. Experimental Results for Hyperparameters

4.3.1. Hyperparameter Analysis of Distance and Attitude Alignment Reward Function

Due to the stochastic nature of reinforcement learning, no single parameter set consistently outperforms all others in every experimental run; however, the selected weighting coefficients

δ_{1} = 100

,

δ_{2} = 150

, and

δ_{3} = 20

in Equation (2) demonstrate robust performance across multiple evaluation metrics while exhibiting relative insensitivity to small parameter variations. The choice of

δ_{1} = 100

effectively balances proximity incentives, as evidenced by a reduction in the AR of approximately 30% when higher values ranging from 500 to 1000 are employed. Lower values ranging from 50 to 95 result in diminished PS between

0.7819

and

0.8495

. The optimal value for the horizontal attitude constraint of the end-effector is determined to be 150 because values less than 10 results in an increase in AE of 12.6%, while values greater than 500 lead to an unnecessary increase in TO by 12.2%. Implementing

δ_{3} = 20

has been shown to result in a reduction of AE ranging from 8.8% to 27.5% compared to configurations lacking weighting or those with excessive weighting above 1000.

4.3.2. Hyperparameter Analysis of GFWN

As shown in Figure 9a,b, our architectural analysis reveals critical insights into parameter selection for the GFWN framework. The group number K demonstrates a non-monotonic relationship with model performance, where

K = 4

emerges as the optimal configuration. This setting captures sufficient environmental complexity without introducing the detrimental effects observed at

K = 16

, where excessive subdivision leads to catastrophic overfitting. The superiority of

K = 4

persists throughout the training, showing both higher initial success rates and more stable convergence behavior.

5. Conclusions

Our study presents an efficient and accurate solution for strawberry picking automation. We enhance the TQC method by introducing DL2 and GFWN, enabling efficient migration from simple to complex environments through a transfer strategy. The experimental results demonstrate that DRFW-TQC achieves a significantly higher average picking success rate in simulated environments compared to existing methods, with improved stability through the transfer strategy.

The effectiveness of the proposed method is demonstrated through comprehensive experiments in simulated ridge cultivation environments. Comparative evaluations against approaches show substantial improvements in picking success rates and operational efficiency, particularly in high-density scenarios with multiple target strawberries. These results validate the framework’s potential for practical deployment in agricultural robotics applications, while the methodological innovations contribute to the broader field of reinforcement learning for robotic manipulation tasks.

While the simulation results validate the framework’s technical feasibility, we acknowledge the current limitation regarding real-world validation. Future research will focus on two critical steps for practical deployment: field testing under varying environmental conditions and the systematic evaluation of sim-to-real transfer performance. The simulation environment has been carefully designed with physically accurate parameters, including strawberry morphology and sensor noise models, to facilitate this transition. Subsequent studies will further explore the framework’s applicability to diverse agricultural harvesting tasks beyond strawberry picking.

Author Contributions

Conceptualization, A.Z., Z.F., H.D. and K.L.; methodology, A.Z. and Z.F.; data curation, A.Z., Z.F. and Z.L.; formal analysis, A.Z. and Z.L.; investigation, A.Z., Z.F., H.D. and K.L.; validation, A.Z., H.D. and Z.L.; visualization, H.D. and Z.L.; writing—original draft, A.Z. and Z.F.; writing—review and editing, A.Z. and H.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 32272498), the National Key Research and Development Program of China (No. 2022YFD1400100), the University Synergy Innovation Program of Anhui Province (No. GXXT-2022-041), the Anhui Provincial Quality Engineering Project of Higher Education Institutions (2022jyxm464), and the Anhui Agricultural University Introduction and Stabilization of Talents Research Funding (No. yj2020-74). These four funding projects provide a good study environment and experimental equipment for this experiment.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in this study are openly available at https://github.com/pinanzh/DRFW-TQC, accessed on 19 May 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

TQC	Truncated Quantile Critics
GFWN	Group-Wise Feature Weighting Network
DL2	Dynamic L2 Regularization
DDPG	Deep Deterministic Policy Gradient
SAC	Soft Actor-Critic
AR	average cumulative reward
PS	picking success rate
AE	angular error
FS	First Success Step
TO	timeout count

References

Gunderman, A.; Collins, J.; Myers, A.; Threlfall, R.; Chen, Y. Tendon-Driven Soft Robotic Gripper for Blackberry Harvesting. IEEE Robot. Autom. Lett. 2022, 7, 2652–2659. [Google Scholar] [CrossRef]
Li, Y.; Feng, Q.; Zhang, Y.; Peng, C.; Ma, Y.; Liu, C.; Ru, M.; Sun, J.; Zhao, C. Peduncle collision-free grasping based on deep reinforcement learning for tomato harvesting robot. Comput. Electron. Agric. 2024, 216, 108488. [Google Scholar] [CrossRef]
Miao, Z.; Chen, Y.; Yang, L.; Hu, S.; Xiong, Y. A Fast Path-Planning Method for Continuous Harvesting of Table-Top Grown Strawberries. IEEE Trans. AgriFood Electron. 2025, 3, 233–245. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, K.; Yang, L.; Zhang, D.; Cui, T.; Yu, Y.; Liu, H. Design and simulation experiment of ridge planting strawberry picking manipulator. Comput. Electron. Agric. 2023, 208, 107690. [Google Scholar] [CrossRef]
Magistri, F.; Pan, Y.; Bartels, J.; Behley, J.; Stachniss, C.; Lehnert, C. Improving Robotic Fruit Harvesting Within Cluttered Environments Through 3D Shape Completion. IEEE Robot. Autom. Lett. 2024, 9, 7357–7364. [Google Scholar] [CrossRef]
Rizwan, A.; Khan, A.N.; Ibrahim, M.; Ahmad, R.; Iqbal, N.; Kim, D.H. Optimal environment control and fruits delivery tracking system using blockchain for greenhouse. Comput. Electron. Agric. 2024, 220, 108889. [Google Scholar] [CrossRef]
Liu, Y.; Ping, Y.; Zhang, L.; Wang, L.; Xu, X. Scheduling of decentralized robot services in cloud manufacturing with deep reinforcement learning. Robot. Comput.-Integr. Manuf. 2023, 80, 102454. [Google Scholar] [CrossRef]
Li, T.; Xie, F.; Zhao, Z.; Zhao, H.; Guo, X.; Feng, Q. A multi-arm robot system for efficient apple harvesting: Perception, task plan and control. Comput. Electron. Agric. 2023, 211, 107979. [Google Scholar] [CrossRef]
Kuznetsov, A.; Shvechikov, P.; Grishin, A.; Vetrov, D. Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In Proceedings of the International Conference on Machine Learning, Online, 13–18 July 2020; pp. 5556–5566. [Google Scholar]
Zhu, Z.; Lin, K.; Jain, A.K.; Zhou, J. Transfer learning in deep reinforcement learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13344–13362. [Google Scholar] [CrossRef]
Gronauer, S.; Kissel, M.; Sacchetto, L.; Korte, M.; Diepold, K. Using simulation optimization to improve zero-shot policy transfer of quadrotors. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 10170–10176. [Google Scholar]
Liu, Y.; Xu, H.; Liu, D.; Wang, L. A digital twin-based sim-to-real transfer for deep reinforcement learning-enabled industrial robot grasping. Robot. Comput.-Integr. Manuf. 2022, 78, 102365. [Google Scholar] [CrossRef]
Al Ali, A.; Shi, J.-F.; Zhu, Z.H. Path planning of 6-DOF free-floating space robotic manipulators using reinforcement learning. Acta Astronaut. 2024, 224, 367–378. [Google Scholar] [CrossRef]
Gao, Y.; Wu, J.; Yang, X.; Ji, Z. Efficient hierarchical reinforcement learning for mapless navigation with predictive neighbouring space scoring. IEEE Trans. Autom. Sci. Eng. 2023, 165, 677–688. [Google Scholar] [CrossRef]
Gan, Y.; Li, P.; Jiang, H.; Wang, G.; Jin, Y.; Chen, X.; Ji, J. A reinforcement learning method for motion control with constraints on an HPN Arm. IEEE Robot. Autom. Lett. 2022, 7, 12006–12013. [Google Scholar] [CrossRef]
Goldenits, G.; Mallinger, K.; Raubitzek, S.; Neubauer, T. Current applications and potential future directions of reinforcement learning-based Digital Twins in agriculture. Smart Agric. Technol. 2024, 8, 100512. [Google Scholar] [CrossRef]
Panerati, J.; Zheng, H.; Zhou, S.; Xu, J.; Prorok, A.; Schoellig, A.P. Learning to Fly—A Gym Environment with PyBullet Physics for Reinforcement Learning of Multi-agent Quadcopter Control. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 7512–7519. [Google Scholar]
Ishmatuka, C.; Soesanti, I.; Ataka, A. Autonomous Pick-and-Place Using Excavator Based on Deep Reinforcement Learning. In Proceedings of the 2023 15th International Conference on Information Technology and Electrical Engineering (ICITEE), Chiang Mai, Thailand, 26–27 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 19–24. [Google Scholar]
Yang, S.; Zhang, W.; Song, R.; Cheng, J.; Li, Y. Learning multi-object dense descriptor for autonomous goal-conditioned grasping. IEEE Robot. Autom. Lett. 2021, 6, 4109–4116. [Google Scholar] [CrossRef]
Xie, F.; Guo, Z.; Li, T.; Feng, Q.; Zhao, C. Dynamic Task Planning for Multi-Arm Harvesting Robots Under Multiple Constraints Using Deep Reinforcement Learning. Horticulturae 2025, 11, 88. [Google Scholar] [CrossRef]
He, Z.; Li, J.; Wu, F.; Shi, H.; Hwang, K.-S. Derl: Coupling decomposition in action space for reinforcement learning task. IEEE Trans. Emerg. Top. Comput. Intell. 2023, 8, 1030–1043. [Google Scholar] [CrossRef]
Shao, Y.; Zhou, H.; Zhao, S.; Fan, X.; Jiang, J. A Control Method of Robotic Arm Based on Improved Deep Deterministic Policy Gradient. In Proceedings of the 2023 IEEE International Conference on Mechatronics and Automation (ICMA), Harbin, China, 6–9 August 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 473–478. [Google Scholar]
Zhang, Y.; Li, G.; Al-Ani, M. Robust learning-based model predictive control for wave energy converters. IEEE Trans. Sustain. Energy 2024, 15, 1957–1967. [Google Scholar] [CrossRef]
Xiao, M.; Wang, D.; Wu, M.; Liu, K.; Xiong, H.; Zhou, Y.; Fu, Y. Traceable group-wise self-optimizing feature transformation learning: A dual optimization perspective. ACM Trans. Knowl. Discov. Data 2024, 18, 1–22. [Google Scholar] [CrossRef]
Han, Z.; Yang, Y.; Zhang, C.; Zhang, L.; Zhou, J.T.; Hu, Q. Selective learning: Towards robust calibration with dynamic regularization. arXiv 2024, arXiv:2402.08384. [Google Scholar]
Mysore, S.; Mabsout, B.; Mancuso, R.; Saenko, K. Regularizing action policies for smooth control with reinforcement learning. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1810–1816. [Google Scholar]
Gupta, D.; Hazarika, B.B.; Berlin, M. Robust regularized extreme learning machine with asymmetric Huber loss function. Neural Comput. Appl. 2020, 32, 12971–12998. [Google Scholar] [CrossRef]
Yan, H.; Shao, D. Enhancing Transformer Training Efficiency with Dynamic Dropout. arXiv 2024, arXiv:2411.03236. [Google Scholar]
Lyle, C.; Rowland, M.; Dabney, W.; Kwiatkowska, M.; Gal, Y. Learning dynamics and generalization in deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 14560–14581. [Google Scholar]
Jawaddi, S.N.A.; Ismail, A. Integrating OpenAI Gym and CloudSim Plus: A simulation environment for DRL Agent training in energy-driven cloud scaling. Simul. Model. Pract. Theory 2024, 130, 102858. [Google Scholar] [CrossRef]
Raffin, A.; Hi, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable Reinforcement Learning Implementations. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]
Gharakhani, H.; Thomasson, J.A.; Lu, Y. An end-effector for robotic cotton harvesting. Smart Agric. Technol. 2022, 2, 100043. [Google Scholar] [CrossRef]
Huang, A.; Yu, C.; Feng, J.; Tong, X.; Yorozu, A.; Ohya, A.; Hu, Y. A motion planning method for winter jujube harvesting robotic arm based on optimized Informed-RRT* algorithm. Smart Agric. Technol. 2025, 10, 100732. [Google Scholar] [CrossRef]
Cao, H.G.; Zeng, W.; Wu, I.C. Reinforcement learning for picking cluttered general objects with dense object descriptors. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 6358–6364. [Google Scholar]
Bi, A.; Zhang, C. Robot Arm Grasping based on Multi-threaded PPO Reinforcement Learning Algorithm. In Proceedings of the 2024 3rd International Conference on Artificial Intelligence, Human-Computer Interaction and Robotics (AIHCIR), Hong Kong, China, 15–17 November 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 369–373. [Google Scholar]

Figure 1. The simulation environment of the strawberry picking. (a–d) Side view of the simulated strawberry cultivation environment layout. (e–h) The robotic arm approaches each strawberry in sequence. (i–l) The robot arm gripper jaws close and pick each strawberry in order.

Figure 2. Schematic representation of the robotic strawberry picking process. (a) The action and state space of the robotic arm. (b) The selection of the order in which target strawberries will be picked. (c) The illustration delineates the trajectory of the end-effector during the approach and grasping process.

Figure 3. Architecture of the DRFW-TQC framework integrating GFWN and DL2.

Figure 4. The internal composition of the actor network enhanced by integrating the GFWN.

Figure 5. Transfer strategy from simple to complex environment.

Figure 6. Picking success rates (a) and average cumulative rewards (b) folding lines for each method in experiments with 4 target strawberries. The shaded area represents the standard error of 30 random seeds.

Figure 7. 3D coordinates and projections of the end-effector for each episode. (a–f) represent the spatial distribution of the final grasping point positions of the end-effector in each episode under the DDPG, SAC, TQC, TQC+GFWN, TQC+DL2, and DRFW-TQC algorithms, respectively.

Figure 8. Confusion matrix for each strawberry picking situation in contact with the robot arm under the DRFW-TQC. (a) denotes the use of the transfer strategy at DRFW-TQC with 8 target strawberries. (b) denotes the case where DRFW-TQC does not use the transfer strategy with 8 target strawberries.

Figure 9. Picking success rates (a) and average cumulative rewards (b) folding lines of the TQC+GFWN method for the number of groups K in the hyperparametric experiment with 4 target strawberries. The shaded area represents the standard error of 30 random seeds.

Table 1. Simulation results of various methods with 4 target strawberries. All values represent the mean ± standard error derived from 30 independent experimental trials.

Method	FS( $\times 10^{3}$ )	TO	AE	AR ( $\times 10^{4}$ )	PS
TQC	26.4 ± 3.7	294 ± 19	0.192 ± 0.015	20.87 ± 4.15	0.734 ± 0.060
TQC+DL2	16.4 ± 2.4	256 ± 16	0.175 ± 0.018	24.25 ± 3.93	0.807 ± 0.050
TQC+GFWN	25.9 ± 5.1	271 ± 17	0.179 ± 0.012	22.69 ± 4.01	0.851 ± 0.031
DRFW-TQC	15.1 ± 1.7	238 ± 17	0.153 ± 0.015	26.94 ± 3.62	0.851 ± 0.044
SAC	35.8 ± 5.5	301 ± 9	0.171 ± 0.016	9.52 ± 1.93	0.827± 0.039
DDPG	37.1 ± 4.7	419 ± 3	0.447 ± 0.031	−16.54 ± 1.25	0.074 ± 0.023

Table 2. Statistical significance test results compared with SAC, DDPG, and baseline TQC.

Comparison	Metric	Relative Improvement	p-Value	Effect Size
DRFW-TQC vs. TQC	PS	+16.0%	$5.19 \times 10^{- 3}$	0.38
	AR	+29.1%	$7.23 \times 10^{- 4}$	0.26
	AE	−20.3%	$9.50 \times 10^{- 12}$	−0.46
	TO	−19.0%	$1.22 \times 10^{- 9}$	−0.57
	FS	−42.7%	$5.84 \times 10^{- 11}$	−0.71
DRFW-TQC vs. SAC	PS	+2.9%	$6.45 \times 10^{- 8}$	0.09
	AR	+183.0%	$6.54 \times 10^{- 21}$	0.93
	AE	−10.5%	$8.79 \times 10^{- 4}$	−0.20
	TO	−20.9%	$2.28 \times 10^{- 22}$	−0.84
	FS	−57.7%	$9.76 \times 10^{- 20}$	−0.93
DRFW-TQC vs. DDPG	PS	+1050.0%	$2.94 \times 10^{- 94}$	3.56
	AR	+262.9%	$1.41 \times 10^{- 97}$	2.62
	AE	−65.8%	$6.38 \times 10^{- 84}$	−2.19
	TO	−43.2%	$4.86 \times 10^{- 90}$	−2.71
	FS	−59.2%	$1.57 \times 10^{- 31}$	−1.15

Table 3. Transfer performance of DRFW-TQC with 8 target strawberries. All values represent the average of 30 independent experimental trials, with PS and AR representing the average ± standard error.

Condition	FS	TO	AE	AR ( $\times 10^{4}$ )	PS
With transfer strategy	44	111	0.091	62.21 ± 4.22	0.891 ± 0.032
Without	21,992	279	0.188	41.04 ± 5.35	0.774 ± 0.056

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zheng, A.; Fang, Z.; Li, Z.; Dong, H.; Li, K. DRFW-TQC: Reinforcement Learning for Robotic Strawberry Picking with Dynamic Regularization and Feature Weighting. AgriEngineering 2025, 7, 208. https://doi.org/10.3390/agriengineering7070208

AMA Style

Zheng A, Fang Z, Li Z, Dong H, Li K. DRFW-TQC: Reinforcement Learning for Robotic Strawberry Picking with Dynamic Regularization and Feature Weighting. AgriEngineering. 2025; 7(7):208. https://doi.org/10.3390/agriengineering7070208

Chicago/Turabian Style

Zheng, Anping, Zirui Fang, Zixuan Li, Hao Dong, and Ke Li. 2025. "DRFW-TQC: Reinforcement Learning for Robotic Strawberry Picking with Dynamic Regularization and Feature Weighting" AgriEngineering 7, no. 7: 208. https://doi.org/10.3390/agriengineering7070208

APA Style

Zheng, A., Fang, Z., Li, Z., Dong, H., & Li, K. (2025). DRFW-TQC: Reinforcement Learning for Robotic Strawberry Picking with Dynamic Regularization and Feature Weighting. AgriEngineering, 7(7), 208. https://doi.org/10.3390/agriengineering7070208

Article Menu

DRFW-TQC: Reinforcement Learning for Robotic Strawberry Picking with Dynamic Regularization and Feature Weighting

Abstract

1. Introduction

2. Materials and Methods

2.1. Strawberry Picking Simulation Environment Modeling

2.2. Posture Optimization Reward Function

2.3. State and Action Space Characterization

2.4. DRFW-TQC Framework for Training Optimization

2.4.1. Truncated Quantile Critics Algorithm

2.4.2. Group-Wise Feature Weighting Network

2.4.3. Dynamic L2 Regularization

2.4.4. Multi-Objective Strawberry Picking Transfer Strategy

3. Experimental Design and Evaluation Metrics

4. Results and Discussion

4.1. Results of Simulation Experiments

4.2. Experimental Results on Transfer Effects

4.3. Experimental Results for Hyperparameters

4.3.1. Hyperparameter Analysis of Distance and Attitude Alignment Reward Function

4.3.2. Hyperparameter Analysis of GFWN

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI