A Deep Reinforcement Learning Strategy Combining Expert Experience Guidance for a Fruit-Picking Manipulator

: When using deep reinforcement learning algorithms for path planning of a multi-DOF fruit-picking manipulator in unstructured environments, it is much too difﬁcult for the multi-DOF manipulator to obtain high-value samples at the beginning of training, resulting in low learning and convergence efﬁciency. Aiming to reduce the inefﬁcient exploration in unstructured environments, a reinforcement learning strategy combining expert experience guidance was ﬁrst proposed in this paper. The ratios of expert experience to newly generated samples and the frequency of return visits to expert experience were studied by the simulation experiments. Some conclusions were that the ratio of expert experience, which declined from 0.45 to 0.35, was more effective in improving learning efﬁciency of the model than the constant ratio. Compared to an expert experience ratio of 0.35, the success rate increased by 1.26%, and compared to an expert experience ratio of 0.45, the success rate increased by 20.37%. The highest success rate was achieved when the frequency of return visits was 15 in 50 episodes, an improvement of 31.77%. The results showed that the proposed method can effectively improve the model performance and enhance the learning efﬁciency at the beginning of training in unstructured environments. This training method has implications for the training process of reinforcement learning in other domains.


Introduction
Automatic fruit-picking systems based on a multi-DOF (degree of freedom) manipulator have become a major direction in fruit harvesting in order to increase efficiency and reduce production costs [1]. However, it is more difficult for an automatic multi-DOF manipulator to pick fruits in complex natural environments. Cluttered branches will seriously hinder the fruit-picking process, and the different ripening sequences of fruits present difficulties in picking fruits. Therefore, picking-path planning in unstructured natural environments is one of the main research topics of automatic fruit-picking systems [2][3][4].
A variety of path-planning algorithms have been proposed, such as the A* algorithm [5,6], ant colony algorithm [7][8][9], raster algorithm [10], artificial potential field algorithm [11,12], etc. However, these algorithms rely on real-time modelling of the multi-DOF manipulator and the environments, and it is difficult to accurately model the natural picking environments because the variability and the computational complexity of modelling increases exponentially with the number of DOFs. Deep reinforcement learning (DRL) [13][14][15] is a self-learning approach that enables the agent to interact with the environment to obtain an optimal policy for solving a problem. In recent years, it has become a new solution for the problem of path planning of multi-DOF manipulators in unstructured environments, DeepMind, UC Berkeley, and many others have applied DRL to the trajectory planning problem of robotic arms [13][14][15]. Evan proposed a path-planning method for a multiarm manipulator based on the SAC (soft actor-critic) algorithm with hindsight replay (HER), which is suitable for multiarm manipulators with static and periodically mobile eralization of the policy during the training process. In addition, it was critical to the performance of the algorithm in the ratio between expert experience, frequency of return visits, and self-generated samples in the training samples in the training process. Therefore, the two main factors of a deep reinforcement learning strategy combined with expert experience were analyzed in this paper.
For the convenience of the readers, the main acronyms used in the paper are described in Table A1/Appendix A.

DDPG
Fruit picking by a multi-DOF manipulator can be described as a continuous stateaction model in a high-dimensional space, and the deep deterministic policy gradient (DDPG) algorithm can effectively solve the continuous action space task. The DDPG algorithm is based on the actor-critic network architecture, which consists of a policy network (actor) and a value network (critic) [29]. The network architecture of DDPG is shown in Figure 1; it had four neural networks, including the actor network, the actor target network, the critic network, and the critic target network. The actor network was used to select the action that corresponded to the DDPG, and the critic network assessed the strengths and weaknesses of the selected action by calculating the value function. The actor target network and the critic target network were used to update the value network system and the actor network system, but not to carry out online training and updating of network parameters. denotes the angular velocity of the multi-degree-of-freedom robot arm; denotes the timely reward of the current environment; and denotes the state of the next moment [15].
The input of the actor network is the current state of multi-DOF manipulator, , including the information such as the angle, angular velocity of each joint and the target position of the end-effector; and the output of the actor network is the angular velocity of the multi-DOF manipulator, . Then, according to the distance between the current position and the target position of the end-effector, the environment feeds back an immediate reward, . By constantly interacting with the environment and performing the appropriate actions, the multi-DOF manipulator can solve the task of path planning. There are two states for termination: (1) the end-effector of the multi-DOF manipulator reaches the target point or encounters an obstacle; and (2) the number of steps interacting with the environment reaches the upper limit, and Path-planning algorithm for multi-DOF manipulator were in Algorithm 1. The network architecture of DDPG algorithm. S i denotes the current state, including the joint angle, angular velocity, and the target position of the end-effector; α i denotes the angular velocity of the multi-degree-of-freedom robot arm; r i denotes the timely reward of the current environment; and S i+1 denotes the state of the next moment [15].
The input of the actor network is the current state of multi-DOF manipulator, s t , including the information such as the angle, angular velocity of each joint and the target position of the end-effector; and the output of the actor network is the angular velocity of the multi-DOF manipulator, a t . Then, according to the distance between the current position and the target position of the end-effector, the environment feeds back an immediate reward, r t . By constantly interacting with the environment and performing the appropriate actions, the multi-DOF manipulator can solve the task of path planning. There are two states for termination: (1) the end-effector of the multi-DOF manipulator reaches the target point or encounters an obstacle; and (2) the number of steps interacting with the environment reaches the upper limit, and Path-planning algorithm for multi-DOF manipulator were in Algorithm 1.
a. Select the action of the multi-DOF manipulator a t , a t = µ(s t |θ µ ) . b. Multi-DOF manipulator executes action a t and observes reward r t and new state s t+1 . c. Store transition (s t , a t , r t , s t+1 ) in R. d. Sample a random minibatch of N transitions (s t , a t , r t , s t+1 ) from R, and update the actor and the critic networks, θ µ and θ Q . e. Update the target networks: τ-Update parameters f. If s t+1 is terminated, the current episode ends, otherwise skip to step a Training terminated.

Deep Reinforcement Learning Strategies with Expert Experience
In the early stages of fruit picking, the complexity and disorder of the target locations and the fact that the network parameters are randomly generated at the initial stage make the model inefficient and difficult for the network to converge during the training process. Therefore, in this paper, expert experience was used as some initial training samples to train the initial policy of the DRL algorithm, which could reduce the exploration time and the learning difficulty at the beginning of the training process. In this paper, the rapid-exploration random tree (RRT) algorithm was adopted to obtain sufficient expert experience.
The RRT algorithm is a probability-complete global path-planning algorithm that obtains path points by random sampling in the search space, and then achieves a feasible path from the start point to the goal point. The RRT algorithm generates new points by random sampling in the workspace. The random tree expansion diagram for the RRT algorithm is shown in Figure 2. Due to the large randomness of the RRT algorithm, there is a large gap between the generated paths, which increases the diversity of the expert sample and helps to enhance the generalization ability of the model. ( , , ) and the initial position of the end-effector of the multi-DOF manipulator, ( , , ), were randomly set, and the steps of obtaining expert experience were in Algorithm 2: The workspace set in the simulation environment was a 0.5 m × 0.8 m × 0.5 m threedimensional space centered on the point (0.25, 0, 1.002), and the initial coordinate of the end of the picking robot arm was set to (−0.094, −0.025, 1.345).
P(x, y, z) and the initial position of the end-effector of the multi-DOF manipulator, P 0 (x 0 , y 0 , z 0 ), were randomly set, and the steps of obtaining expert experience were in Algorithm 2: e. Calculate the reward, r i , by the state sequence and reassemble D into (s i , a i , r i , s i+1 ). f. Repeat the above steps until obtaining sufficient expert experience.
The above steps were used to obtain a large amount of expert experience, then a combination of expert experience and newly generated sample experience in the sample pool was randomly used for training.

The Simulation Platform
In this paper, C4D and CoppeliaSim were used to build an apple-picking simulation environment for a multi-DOF manipulator mounted on a mobile platform, as shown in Figure 3. When apple picking, the mobile platform would be braked hard to avoid any movement. The computer configuration in the experiments is shown in Table 1.

The Multi-DOF Manipulator
To verify the validity of the method, the 7-degree-of-freedom manipulator, Franka, was adopted as the picking robot in this paper. Figure 4 shows the structure of the multi-DOF manipulator.

The Multi-DOF Manipulator
To verify the validity of the method, the 7-degree-of-freedom manipulator, Franka, was adopted as the picking robot in this paper. Figure 4 shows the structure of the multi-DOF manipulator. Figure 3. Picking scene. On the left side is an apple tree model, in which red spheres indicate ripe apples and green spheres indicate unripe apples; on the right side is a multi-DOF picking manipulator fixed on top of a mobile platform.

The Multi-DOF Manipulator
To verify the validity of the method, the 7-degree-of-freedom manipulator, Franka, was adopted as the picking robot in this paper. Figure 4 shows the structure of the multi-DOF manipulator.
All joints of a Franka are rotating joints, and joint 7 is the end actuating claw. Figure  5 shows the simplified model of the multi-DOF manipulator, and Table 2 shows the D-H parameters of the multi-DOF manipulator.   All joints of a Franka are rotating joints, and joint 7 is the end actuating claw. Figure 5 shows the simplified model of the multi-DOF manipulator, and Table 2 shows the D-H parameters of the multi-DOF manipulator.

Model of Untargeted Fruit
During apple picking, the target objects are the mature fruits in the outer canopy of the tree, where most obstacles are relatively few untargeted fruits, with minimal impact from branch obstacles. Therefore, obstacles in this paper were mainly untargeted fruits, and untargeted fruit obstacles were modelled by simplification of shapes [30][31][32]. As shown in Figure 6, a sphere model was adopted for an apple obstacle.

Model of Untargeted Fruit
During apple picking, the target objects are the mature fruits in the outer canopy of the tree, where most obstacles are relatively few untargeted fruits, with minimal impact from branch obstacles. Therefore, obstacles in this paper were mainly untargeted fruits, and untargeted fruit obstacles were modelled by simplification of shapes [30][31][32]. As shown in Figure 6, a sphere model was adopted for an apple obstacle.

Model of Untargeted Fruit
During apple picking, the target objects are the mature fruits in the outer canopy of the tree, where most obstacles are relatively few untargeted fruits, with minimal impact from branch obstacles. Therefore, obstacles in this paper were mainly untargeted fruits, and untargeted fruit obstacles were modelled by simplification of shapes [30][31][32]. As shown in Figure 6, a sphere model was adopted for an apple obstacle. Figure 6. Untargeted fruit obstacle model. A sphere envelope with the radius of 5 cm was used to simplify the model of an apple obstacle. Figure 6. Untargeted fruit obstacle model. A sphere envelope with the radius of 5 cm was used to simplify the model of an apple obstacle.

The Impact of Different Amounts of Expert Experience
Compared to the traditional training process of DDPG, expert experience can implicitly give priority information on the unstructured environment to the multi-DOF manipulator, enabling it to adapt to the environment faster in the early stage of training. However, if only expert experience is used to train, the model tends to be less effective when encountering unknown states. To improve the sample diversity, two experience replay buffers were set up, one for expert experience and the other for newly generated experience. When the network model was trained, m 1 and m 2 samples were respectively taken from the two experience replay buffers and were sent to the network for training as the training samples. As shown in Formula (1), k is the ratio of the number of expert samples to the total number of training samples: where m 1 is the number of the expert samples, m 2 is the number of the newly generated samples, and (m 1 + m 2 ) is the total number of training samples. As the feedback after performing an action, the reward value is the basis for optimizing the reinforcement learning strategy [33]. Figure 7 shows the reward values changed with the number of the episode. The curve with a value for k of 0 represented the training curve without expert experience, and was set as the baseline. The other colored curves corresponded to the reward value curves with different k values. As seen in Figure 6, the reward gradually rose, and eventually reached a state of convergence.
to the total number of training samples: where m1 is the number of the expert samples, m2 is the number of the newly generated samples, and (m1+ m2) is the total number of training samples. As the feedback after performing an action, the reward value is the basis for optimizing the reinforcement learning strategy [33]. Figure 7 shows the reward values changed with the number of the episode. The curve with a value for k of 0 represented the training curve without expert experience, and was set as the baseline. The other colored curves corresponded to the reward value curves with different k values. As seen in Figure 6, the reward gradually rose, and eventually reached a state of convergence.  As can be seen in the red circle in Figure 7, the reward values increased faster at the beginning of training due to the involvement of expert experience and reached a high level within 500 episodes, while the reward value of the baseline increased at a relatively slow rate.
The total reward values in the first 500 episodes were calculated at different k values, as shown in Figure 8. In the figure, the lowest total reward value was −151.6 when k was 0, and the highest total reward value was −44 when k was 0.45, an improvement of 70.98% compared to the baseline, indicating that the models combined with expert experience gave the multi-DOF manipulator a larger reward and improved the learning efficiency of the algorithm at the beginning of the training. As can be seen in the red circle in Figure 7, the reward values increased faster at the beginning of training due to the involvement of expert experience and reached a high level within 500 episodes, while the reward value of the baseline increased at a relatively slow rate.
The total reward values in the first 500 episodes were calculated at different k values, as shown in Figure 8. In the figure, the lowest total reward value was −151.6 when k was 0, and the highest total reward value was −44 when k was 0.45, an improvement of 70.98% compared to the baseline, indicating that the models combined with expert experience gave the multi-DOF manipulator a larger reward and improved the learning efficiency of the algorithm at the beginning of the training.   Figure 9 shows the average success times and success rate for different k values during the training process, and the point with k value of 0, named the baseline, was not involved with expert experience. Both the average success times and success rate of the baseline were the lowest. Compared with the baseline, the average success times and success rate of the experiments with other k values were both improved. This indicated that under the unstructured environment, expert experience could play an important role in avoiding too much unnecessary exploration of the multi-DOF manipulator in the training process. As the k value gradually increased, both curves tended to rise and then fall, which demonstrated that too much expert experience made the generalization performance of the model worse. In Figure 8, when the value k was 0.35, compared to the baseline, the average success times increased from 1450 to 3018, an increase of 51.95%, and the average success rate increased from 0.447 to 0.712, an increase of 37.22%. This suggested that expert experience could contribute to the training of the multi-DOF manipulator in unstruc-  Figure 9 shows the average success times and success rate for different k values during the training process, and the point with k value of 0, named the baseline, was not involved with expert experience. Both the average success times and success rate of the baseline were the lowest. Compared with the baseline, the average success times and success rate of the experiments with other k values were both improved. This indicated that under the unstructured environment, expert experience could play an important role in avoiding too much unnecessary exploration of the multi-DOF manipulator in the training process. As the k value gradually increased, both curves tended to rise and then fall, which demonstrated that too much expert experience made the generalization performance of the model worse. In Figure 8, when the value k was 0.35, compared to the baseline, the average success times increased from 1450 to 3018, an increase of 51.95%, and the average success rate increased from 0.447 to 0.712, an increase of 37.22%. This suggested that expert experience could contribute to the training of the multi-DOF manipulator in unstructured environments, improving the overall learning efficiency.
involved with expert experience. Both the average success times and success rate of the baseline were the lowest. Compared with the baseline, the average success times and success rate of the experiments with other k values were both improved. This indicated that under the unstructured environment, expert experience could play an important role in avoiding too much unnecessary exploration of the multi-DOF manipulator in the training process. As the k value gradually increased, both curves tended to rise and then fall, which demonstrated that too much expert experience made the generalization performance of the model worse. In Figure 8, when the value k was 0.35, compared to the baseline, the average success times increased from 1450 to 3018, an increase of 51.95%, and the average success rate increased from 0.447 to 0.712, an increase of 37.22%. This suggested that expert experience could contribute to the training of the multi-DOF manipulator in unstructured environments, improving the overall learning efficiency.  Figure 10 shows the path-planning results of the multi-DOF manipulator for a value k of 0.35. The red apple was the target fruit, and the green apple was the obstacle. The red line was the planning path of the multi-DOF manipulator.  Figure 10 shows the path-planning results of the multi-DOF manipulator for a value k of 0.35. The red apple was the target fruit, and the green apple was the obstacle. The red line was the planning path of the multi-DOF manipulator. In the above-mentioned training process, the value of k was a constant value, and the learning performance of the multi-DOF manipulator was the best at the value k of 0.35 in terms of the average success times and success rate. However, as can be seen in Figure 9, the reward value with the value k of 0.35 was not the highest. This indicated that different values of k could have different effects on the training process. Therefore, instead of setting k to a constant value at the initial stage of training, the dynamic decline from an initial value to 0.35 was used to evaluate the effect of the dynamic k value on the training process. Figure 11 shows the reward value curves for dynamic k values from different initial values to 0.35. The decline in the k value was made within the first 500 episodes, after which the k value of 0.35 was maintained. The red line indicates the training curve with the value k of 0, and no expert experience involved; namely, the baseline. The other colored curves indicate the curves respectively corresponding to different initial k values. As can be seen in Figure 10, the reward value gradually increased, and eventually reached a state of convergence. 0.5 Figure 10. Path planning result for constant k value. The red curve indicates the trajectory of the end-effector of the multi-DOF manipulator; the red spheres and green spheres indicate the ripe and unripe apples, respectively; and the background is the land model.
In the above-mentioned training process, the value of k was a constant value, and the learning performance of the multi-DOF manipulator was the best at the value k of 0.35 in terms of the average success times and success rate. However, as can be seen in Figure 9, the reward value with the value k of 0.35 was not the highest. This indicated that different values of k could have different effects on the training process. Therefore, instead of setting k to a constant value at the initial stage of training, the dynamic decline from an initial value to 0.35 was used to evaluate the effect of the dynamic k value on the training process. Figure 11 shows the reward value curves for dynamic k values from different initial values to 0.35. The decline in the k value was made within the first 500 episodes, after which the k value of 0.35 was maintained. The red line indicates the training curve with the value k of 0, and no expert experience involved; namely, the baseline. The other colored curves indicate the curves respectively corresponding to different initial k values. As can be seen in Figure 10, the reward value gradually increased, and eventually reached a state of convergence. the reward value with the value k of 0.35 was not the highest. This indicated that different values of k could have different effects on the training process. Therefore, instead of setting k to a constant value at the initial stage of training, the dynamic decline from an initial value to 0.35 was used to evaluate the effect of the dynamic k value on the training process. Figure 11 shows the reward value curves for dynamic k values from different initial values to 0.35. The decline in the k value was made within the first 500 episodes, after which the k value of 0.35 was maintained. The red line indicates the training curve with the value k of 0, and no expert experience involved; namely, the baseline. The other colored curves indicate the curves respectively corresponding to different initial k values. As can be seen in Figure 10, the reward value gradually increased, and eventually reached a state of convergence.  The total reward values for the first 500 episodes corresponding to different initial k values are shown in Figure 12, in which the points with the value k of 0 are the reward values of the baseline experiment and the remaining points are the reward values with expert experience involved. When the k value was dynamic, the training process with expert experience had a higher reward value than the baseline experiment in the first 500 episodes, and the highest reward value was −40.15 when the k value was 0.45, an improvement of 73.52% compared to the baseline, indicating that the initial performance of the The total reward values for the first 500 episodes corresponding to different initial k values are shown in Figure 12, in which the points with the value k of 0 are the reward values of the baseline experiment and the remaining points are the reward values with expert experience involved. When the k value was dynamic, the training process with expert experience had a higher reward value than the baseline experiment in the first 500 episodes, and the highest reward value was −40.15 when the k value was 0.45, an improvement of 73.52% compared to the baseline, indicating that the initial performance of the model was significantly improved by expert experience at the early stage of training, which accelerated the learning speed.    Figure 13 shows the average success times and success rate curves corresponding to different initial k values, in which the points with the value k of 0 are the average success times and success rate of the baseline experiments. As the initial k value gradually increased, all the curves in Figure 12 showed a trend of increasing first, and then decreased. Compared to the baseline experiment, the average success times and success rate of each experiment increased when the k value was dynamic, and when the initial k value was 0.45, the average success times increased from 1450 to 3088, an increase of 53.04%, and the average success rate increased from 0.447 to 0.721, an increase of 38%.   Figure 13 shows the average success times and success rate curves corresponding to different initial k values, in which the points with the value k of 0 are the average success times and success rate of the baseline experiments. As the initial k value gradually increased, all the curves in Figure 12 showed a trend of increasing first, and then decreased. Compared to the baseline experiment, the average success times and success rate of each experiment increased when the k value was dynamic, and when the initial k value was 0.45, the average success times increased from 1450 to 3088, an increase of 53.04%, and the average success rate increased from 0.447 to 0.721, an increase of 38%.
times and success rate of the baseline experiments. As the initial k value gradually increased, all the curves in Figure 12 showed a trend of increasing first, and then decreased. Compared to the baseline experiment, the average success times and success rate of each experiment increased when the k value was dynamic, and when the initial k value was 0.45, the average success times increased from 1450 to 3088, an increase of 53.04%, and the average success rate increased from 0.447 to 0.721, an increase of 38%.  Figure 14 shows the path-planning results of the multi-DOF manipulator with an initial k value of 0.45. The red apple was the target fruit and the green apple was the obstacle. The red line was the planning path of the multi-DOF manipulator.  Figure 14 shows the path-planning results of the multi-DOF manipulator with an initial k value of 0.45. The red apple was the target fruit and the green apple was the obstacle. The red line was the planning path of the multi-DOF manipulator.  Table 3 compares the performance of the constant k value with dynamic k values and the baseline. It shows that the dynamic k value had the highest reward value in the early training period, 8.75% higher than the constant k value of 0.45 and 17.64% higher than the constant k value of 0.35; the dynamic k value also outperformed the constant k value in terms of the average success times and success rate. It was concluded that dynamic k values in the early training period were more effective in improving the learning efficiency of the model, and more expert experience involved in the early stage of training could enable the model to adapt to the environment faster; thus, the number of expert experience needed to be appropriately reduced, and newly generated samples were added in the following stage of training to overcome the disadvantage of insufficient diversity of expert experience and reduce its influence in policy updating. To test the performance of the model, the average time and average path length of 100 successful path plannings were chosen as evaluation metrics, and comparative experiments were made between the proposed method and the RRT algorithm, as shown in Figure 14. Path-planning result for dynamic k values from 0.45 to 0.35. The red curve indicates the trajectory of the end-effector of the multi-DOF manipulator; the red spheres and green spheres indicate the ripe and unripe apples, respectively; and the background is the land model. Table 3 compares the performance of the constant k value with dynamic k values and the baseline. It shows that the dynamic k value had the highest reward value in the early training period, 8.75% higher than the constant k value of 0.45 and 17.64% higher than the constant k value of 0.35; the dynamic k value also outperformed the constant k value in terms of the average success times and success rate. It was concluded that dynamic k values in the early training period were more effective in improving the learning efficiency of the model, and more expert experience involved in the early stage of training could enable the model to adapt to the environment faster; thus, the number of expert experience needed to be appropriately reduced, and newly generated samples were added in the following stage of training to overcome the disadvantage of insufficient diversity of expert experience and reduce its influence in policy updating.
To test the performance of the model, the average time and average path length of 100 successful path plannings were chosen as evaluation metrics, and comparative experiments were made between the proposed method and the RRT algorithm, as shown in Table 4 [1]. The path-planning result of the RRT algorithm is shown in Figure 15.   Figure 15. Path-planning result for RRT algorithm. The red curve indicates the trajectory of the endeffector of the multi-DOF manipulator; the red spheres and green spheres indicate the ripe and unripe apples, respectively; and the background is the land model.

Frequency of Return Visits to Expert Experience
In this section, we compare expert experience and policy as teacher and student, respectively. Analyzed from the perspective of reinforced learning, policy learning by students may not require the full guidance of the teacher, who usually provides regular guidance to students to correct errors they encounter in their learning. The model may be given some guidance by visiting expert experience to correct its blind exploration in the training process. When the frequency of return visit is too low, the guidance given to the model will be ineffective, while when the frequency of return visit is too high, the generalization of the model will be reduced. Therefore, this section explores the effect of a regular return visit to expert experience on the learning process through experiments, which were set to introduce expert experience every 50 episodes. Figure 13 shows the curve of reward values corresponding to the different numbers of return visits. The expert experience was set to be engaged within the first 500 episodes, after which the expert experience was accessed at a set frequency of visits. The red line was the training curve without visiting expert experience, named as the baseline, and the other colored curves represented the training curves according to the different frequency of visiting expert experience. As can be seen from the curves, the reward gradually increased as the training proceeded, and eventually reached a state of convergence.
As can be seen in Figure 16, the reward with expert experience at the early stages of training increased rapidly and reached a high level within 500 episodes, while the reward of the baseline experiment was relatively small and was still increasing within 500 episodes. As a result of regular visits to expert experience, the curves of reward increased faster than the baseline experiment, and the reward remained high, but the reward values Figure 15. Path-planning result for RRT algorithm. The red curve indicates the trajectory of the end-effector of the multi-DOF manipulator; the red spheres and green spheres indicate the ripe and unripe apples, respectively; and the background is the land model.
In Table 4, it can be seen that the path-planning time of the baseline was decreased by 9.86% and the path length declined by 7.52%, compared to those of the RRT algorithm. It was concluded that the deep reinforcement learning method had an advantage over the RRT algorithm. In addition, the performance of the dynamic k value was improved by 30.54% in the path-planning time and 11.84% in the path length compared to the baseline, which showed that the proposed method in this paper was verified. Compared with the paths in Figures 10 and 14, the path length of the RRT algorithm with more inflection points was longer than the proposed algorithm, and the time to complete a single picking was also longer than the proposed algorithm. Moreover, for a traditional algorithm such as RRT, the result of path planning is affected by many factors, such as step length, the number of iterations, and so on, which also made the traditional algorithm more complicated than the proposed algorithm in this paper.

Frequency of Return Visits to Expert Experience
In this section, we compare expert experience and policy as teacher and student, respectively. Analyzed from the perspective of reinforced learning, policy learning by students may not require the full guidance of the teacher, who usually provides regular guidance to students to correct errors they encounter in their learning. The model may be given some guidance by visiting expert experience to correct its blind exploration in the training process. When the frequency of return visit is too low, the guidance given to the model will be ineffective, while when the frequency of return visit is too high, the generalization of the model will be reduced. Therefore, this section explores the effect of a regular return visit to expert experience on the learning process through experiments, which were set to introduce expert experience every 50 episodes. Figure 13 shows the curve of reward values corresponding to the different numbers of return visits. The expert experience was set to be engaged within the first 500 episodes, after which the expert experience was accessed at a set frequency of visits. The red line was the training curve without visiting expert experience, named as the baseline, and the other colored curves represented the training curves according to the different frequency of visiting expert experience. As can be seen from the curves, the reward gradually increased as the training proceeded, and eventually reached a state of convergence.
As can be seen in Figure 16, the reward with expert experience at the early stages of training increased rapidly and reached a high level within 500 episodes, while the reward of the baseline experiment was relatively small and was still increasing within 500 episodes. As a result of regular visits to expert experience, the curves of reward increased faster than the baseline experiment, and the reward remained high, but the reward values decreased slightly in the following training period.  Figure 17 shows the curves of the average success times and success rates corresponding to the different numbers of visits. As the frequency of return visits gradually increased, all the curves tended to rise and then fall. Compared to the baseline experiment, when the number of return visits was 15, the average success times were 2549, an increase of 43.11%; the average success rate was 0.589, an increase of 24.11%. In addition, the times for 100 successful path plannings of the baseline model and the model with the number of return visits of 15 were 740 s and 650 s, respectively. The above experiments showed that the reward increased to a higher score within a short period time after the initialization of the policy using expert experience, denoting  Figure 17 shows the curves of the average success times and success rates corresponding to the different numbers of visits. As the frequency of return visits gradually increased, all the curves tended to rise and then fall. Compared to the baseline experiment, when the number of return visits was 15, the average success times were 2549, an increase of 43.11%; the average success rate was 0.589, an increase of 24.11%. In addition, the times for 100 successful path plannings of the baseline model and the model with the number of return visits of 15 were 740 s and 650 s, respectively.
The above experiments showed that the reward increased to a higher score within a short period time after the initialization of the policy using expert experience, denoting that the guidance based on expert experience could avoid too much unnecessary exploration, while regular visits to expert experience could enable the multi-DOF manipulator to receive some guidance at the following stage of training, which could correct the errors in policy updating and maintain the reward at a high level. However, it was found in the experiments that when the number of return visits was too large, the performance of the training was greatly reduced. This w because the higher the frequency of return visits was, the less stable the updated policy was, making the curve less stable and slightly decreasing the reward value in the following stages of training.
sponding to the different numbers of visits. As the frequency of return visits gradually increased, all the curves tended to rise and then fall. Compared to the baseline experiment, when the number of return visits was 15, the average success times were 2549, an increase of 43.11%; the average success rate was 0.589, an increase of 24.11%. In addition, the times for 100 successful path plannings of the baseline model and the model with the number of return visits of 15 were 740 s and 650 s, respectively. The above experiments showed that the reward increased to a higher score within a short period time after the initialization of the policy using expert experience, denoting that the guidance based on expert experience could avoid too much unnecessary exploration, while regular visits to expert experience could enable the multi-DOF manipulator to receive some guidance at the following stage of training, which could correct the errors in policy updating and maintain the reward at a high level. However, it was found in the experiments that when the number of return visits was too large, the performance of the training was greatly reduced. This w because the higher the frequency of return visits was, the less stable the updated policy was, making the curve less stable and slightly decreasing the reward value in the following stages of training. Figure 18 shows the path-planning results for the multi-DOF manipulator when the number of return visits was 15. The red apple was the target fruit, the green apple was the obstacle, and the red line was the planning path of the multi-DOF manipulator.  Figure 18 shows the path-planning results for the multi-DOF manipulator when the number of return visits was 15. The red apple was the target fruit, the green apple was the obstacle, and the red line was the planning path of the multi-DOF manipulator.

Discussion
In this paper, a deep reinforcement learning strategy incorporating expert experience was proposed to address the problem of blind exploration of the robot arm in the early stage of training for models in unstructured environments. In the complex obstacle scene, too much blind exploration made the early learning efficiency very low. The effects of different k values and different numbers of return visits on the training results were mainly discussed and demonstrated in the simulation environment. In this paper, the k value was obtained by the enumeration method, with the advantage of being easy to implement. However, there may be more complex relationships between path-planning results and the proportion of expert experience. Enlightened by [33], we will discuss this by using the neural network to optimize the k value more deeply in further work.
In order to obtain effective expert experience, we needed to manually choose the samples to form an expert sample base. Meanwhile, the simulation environment of this paper was relatively simple compared with the actual picking scenario. In the actual picking scenario, too small of a distance between fruits and too small of a distance between fruits and branches will increase the complexity of the environment, which will put higher requirements on the path-planning accuracy of the robotic arm. Therefore, a future study will include more complex experimental environments in the training process to improve the applicability of the model in more realistic scenes.

Conclusions
To solve the problem of blind exploration of DDPG at the initial stage of training, a Figure 18. Path-planning result of the multi-DOF manipulator. The red curve indicates the trajectory of the end-effector of the multi-DOF manipulator; the red spheres and green spheres indicate the ripe and unripe apples, respectively; and the background is the land model.

Discussion
In this paper, a deep reinforcement learning strategy incorporating expert experience was proposed to address the problem of blind exploration of the robot arm in the early stage of training for models in unstructured environments. In the complex obstacle scene, too much blind exploration made the early learning efficiency very low. The effects of different k values and different numbers of return visits on the training results were mainly discussed and demonstrated in the simulation environment. In this paper, the k value was obtained by the enumeration method, with the advantage of being easy to implement. However, there may be more complex relationships between path-planning results and the proportion of expert experience. Enlightened by [33], we will discuss this by using the neural network to optimize the k value more deeply in further work.
In order to obtain effective expert experience, we needed to manually choose the samples to form an expert sample base. Meanwhile, the simulation environment of this paper was relatively simple compared with the actual picking scenario. In the actual picking scenario, too small of a distance between fruits and too small of a distance between fruits and branches will increase the complexity of the environment, which will put higher requirements on the path-planning accuracy of the robotic arm. Therefore, a future study will include more complex experimental environments in the training process to improve the applicability of the model in more realistic scenes.

Conclusions
To solve the problem of blind exploration of DDPG at the initial stage of training, a reinforcement learning strategy combined with expert experience, generated by the RRT algorithm, was proposed in this paper. Moreover, the ratios of expert experience to newly-generated samples and the frequency of return visits to expert experience were analyzed by the simulation experiments. In terms of the average success times and success rate of the simulation experiments, some conclusions can be made in this paper. First, the proposed method was verified by comparing with and without expert experience. Second, the dynamic k value, which declined from 0.45 to 0.35, was more effective in improving learning efficiency of the model than the constant k value. Third, when the frequency of return visits was 15 in 50 episodes, the highest success rate was achieved. The training method proposed in this paper has implications for the training process of reinforcement learning in other domains. However, picking experiments in real orchards need to be further conducted and analyzed in subsequent studies due to the gap between the simulation environment and the natural picking environment.

Conflicts of Interest:
The authors declare no conflict of interest.