3.4. The Impact of Different Amounts of Expert Experience
Compared to the traditional training process of DDPG, expert experience can implicitly give priority information on the unstructured environment to the multi-DOF manipulator, enabling it to adapt to the environment faster in the early stage of training. However, if only expert experience is used to train, the model tends to be less effective when encountering unknown states. To improve the sample diversity, two experience replay buffers were set up, one for expert experience and the other for newly generated experience. When the network model was trained, m
1 and m
2 samples were respectively taken from the two experience replay buffers and were sent to the network for training as the training samples. As shown in Formula (1), k is the ratio of the number of expert samples to the total number of training samples:
        where m
1 is the number of the expert samples, m
2 is the number of the newly generated samples, and (m
1+ m
2) is the total number of training samples.
As the feedback after performing an action, the reward value is the basis for optimizing the reinforcement learning strategy [
33]. 
Figure 7 shows the reward values changed with the number of the episode. The curve with a value for k of 0 represented the training curve without expert experience, and was set as the baseline. The other colored curves corresponded to the reward value curves with different k values. As seen in 
Figure 6, the reward gradually rose, and eventually reached a state of convergence.
As can be seen in the red circle in 
Figure 7, the reward values increased faster at the beginning of training due to the involvement of expert experience and reached a high level within 500 episodes, while the reward value of the baseline increased at a relatively slow rate.
The total reward values in the first 500 episodes were calculated at different k values, as shown in 
Figure 8. In the figure, the lowest total reward value was −151.6 when k was 0, and the highest total reward value was −44 when k was 0.45, an improvement of 70.98% compared to the baseline, indicating that the models combined with expert experience gave the multi-DOF manipulator a larger reward and improved the learning efficiency of the algorithm at the beginning of the training.
Figure 9 shows the average success times and success rate for different k values during the training process, and the point with k value of 0, named the baseline, was not involved with expert experience. Both the average success times and success rate of the baseline were the lowest. Compared with the baseline, the average success times and success rate of the experiments with other k values were both improved. This indicated that under the unstructured environment, expert experience could play an important role in avoiding too much unnecessary exploration of the multi-DOF manipulator in the training process. As the k value gradually increased, both curves tended to rise and then fall, which demonstrated that too much expert experience made the generalization performance of the model worse. In 
Figure 8, when the value k was 0.35, compared to the baseline, the average success times increased from 1450 to 3018, an increase of 51.95%, and the average success rate increased from 0.447 to 0.712, an increase of 37.22%. This suggested that expert experience could contribute to the training of the multi-DOF manipulator in unstructured environments, improving the overall learning efficiency.
 Figure 10 shows the path-planning results of the multi-DOF manipulator for a value k of 0.35. The red apple was the target fruit, and the green apple was the obstacle. The red line was the planning path of the multi-DOF manipulator.
 In the above-mentioned training process, the value of k was a constant value, and the learning performance of the multi-DOF manipulator was the best at the value k of 0.35 in terms of the average success times and success rate. However, as can be seen in 
Figure 9, the reward value with the value k of 0.35 was not the highest. This indicated that different values of k could have different effects on the training process. Therefore, instead of setting k to a constant value at the initial stage of training, the dynamic decline from an initial value to 0.35 was used to evaluate the effect of the dynamic k value on the training process.
Figure 11 shows the reward value curves for dynamic k values from different initial values to 0.35. The decline in the k value was made within the first 500 episodes, after which the k value of 0.35 was maintained. The red line indicates the training curve with the value k of 0, and no expert experience involved; namely, the baseline. The other colored curves indicate the curves respectively corresponding to different initial k values. As can be seen in 
Figure 10, the reward value gradually increased, and eventually reached a state of convergence.
 The total reward values for the first 500 episodes corresponding to different initial k values are shown in 
Figure 12, in which the points with the value k of 0 are the reward values of the baseline experiment and the remaining points are the reward values with expert experience involved. When the k value was dynamic, the training process with expert experience had a higher reward value than the baseline experiment in the first 500 episodes, and the highest reward value was −40.15 when the k value was 0.45, an improvement of 73.52% compared to the baseline, indicating that the initial performance of the model was significantly improved by expert experience at the early stage of training, which accelerated the learning speed.
Figure 13 shows the average success times and success rate curves corresponding to different initial k values, in which the points with the value k of 0 are the average success times and success rate of the baseline experiments. As the initial k value gradually increased, all the curves in 
Figure 12 showed a trend of increasing first, and then decreased. Compared to the baseline experiment, the average success times and success rate of each experiment increased when the k value was dynamic, and when the initial k value was 0.45, the average success times increased from 1450 to 3088, an increase of 53.04%, and the average success rate increased from 0.447 to 0.721, an increase of 38%.
 Figure 14 shows the path-planning results of the multi-DOF manipulator with an initial k value of 0.45. The red apple was the target fruit and the green apple was the obstacle. The red line was the planning path of the multi-DOF manipulator.
 Table 3 compares the performance of the constant k value with dynamic k values and the baseline. It shows that the dynamic k value had the highest reward value in the early training period, 8.75% higher than the constant k value of 0.45 and 17.64% higher than the constant k value of 0.35; the dynamic k value also outperformed the constant k value in terms of the average success times and success rate. It was concluded that dynamic k values in the early training period were more effective in improving the learning efficiency of the model, and more expert experience involved in the early stage of training could enable the model to adapt to the environment faster; thus, the number of expert experience needed to be appropriately reduced, and newly generated samples were added in the following stage of training to overcome the disadvantage of insufficient diversity of expert experience and reduce its influence in policy updating.
 To test the performance of the model, the average time and average path length of 100 successful path plannings were chosen as evaluation metrics, and comparative experiments were made between the proposed method and the RRT algorithm, as shown in 
Table 4 [
1]. The path-planning result of the RRT algorithm is shown in 
Figure 15.
In 
Table 4, it can be seen that the path-planning time of the baseline was decreased by 9.86% and the path length declined by 7.52%, compared to those of the RRT algorithm. It was concluded that the deep reinforcement learning method had an advantage over the RRT algorithm. In addition, the performance of the dynamic k value was improved by 30.54% in the path-planning time and 11.84% in the path length compared to the baseline, which showed that the proposed method in this paper was verified. Compared with the paths in 
Figure 10 and 
Figure 14, the path length of the RRT algorithm with more inflection points was longer than the proposed algorithm, and the time to complete a single picking was also longer than the proposed algorithm. Moreover, for a traditional algorithm such as RRT, the result of path planning is affected by many factors, such as step length, the number of iterations, and so on, which also made the traditional algorithm more complicated than the proposed algorithm in this paper.
  3.5. Frequency of Return Visits to Expert Experience
In this section, we compare expert experience and policy as teacher and student, respectively. Analyzed from the perspective of reinforced learning, policy learning by students may not require the full guidance of the teacher, who usually provides regular guidance to students to correct errors they encounter in their learning. The model may be given some guidance by visiting expert experience to correct its blind exploration in the training process. When the frequency of return visit is too low, the guidance given to the model will be ineffective, while when the frequency of return visit is too high, the generalization of the model will be reduced. Therefore, this section explores the effect of a regular return visit to expert experience on the learning process through experiments, which were set to introduce expert experience every 50 episodes.
Figure 13 shows the curve of reward values corresponding to the different numbers of return visits. The expert experience was set to be engaged within the first 500 episodes, after which the expert experience was accessed at a set frequency of visits. The red line was the training curve without visiting expert experience, named as the baseline, and the other colored curves represented the training curves according to the different frequency of visiting expert experience. As can be seen from the curves, the reward gradually increased as the training proceeded, and eventually reached a state of convergence.
 As can be seen in 
Figure 16, the reward with expert experience at the early stages of training increased rapidly and reached a high level within 500 episodes, while the reward of the baseline experiment was relatively small and was still increasing within 500 episodes. As a result of regular visits to expert experience, the curves of reward increased faster than the baseline experiment, and the reward remained high, but the reward values decreased slightly in the following training period.
Figure 17 shows the curves of the average success times and success rates corresponding to the different numbers of visits. As the frequency of return visits gradually increased, all the curves tended to rise and then fall. Compared to the baseline experiment, when the number of return visits was 15, the average success times were 2549, an increase of 43.11%; the average success rate was 0.589, an increase of 24.11%. In addition, the times for 100 successful path plannings of the baseline model and the model with the number of return visits of 15 were 740 s and 650 s, respectively.
 The above experiments showed that the reward increased to a higher score within a short period time after the initialization of the policy using expert experience, denoting that the guidance based on expert experience could avoid too much unnecessary exploration, while regular visits to expert experience could enable the multi-DOF manipulator to receive some guidance at the following stage of training, which could correct the errors in policy updating and maintain the reward at a high level. However, it was found in the experiments that when the number of return visits was too large, the performance of the training was greatly reduced. This w because the higher the frequency of return visits was, the less stable the updated policy was, making the curve less stable and slightly decreasing the reward value in the following stages of training.
Figure 18 shows the path-planning results for the multi-DOF manipulator when the number of return visits was 15. The red apple was the target fruit, the green apple was the obstacle, and the red line was the planning path of the multi-DOF manipulator.