Next Article in Journal
Novel Object Captioning with Semantic Match from External Knowledge
Previous Article in Journal
A Risk Characterization Model and Visualization System in Aluminum Production
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Research on Manipulator-Path Tracking Based on Deep Reinforcement Learning

Key Laboratory of State Forestry Administration on Forestry Equipment and Automation, College of Engineering, Beijing Forestry University, Beijing 100083, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(13), 7867; https://doi.org/10.3390/app13137867
Submission received: 23 May 2023 / Revised: 25 June 2023 / Accepted: 29 June 2023 / Published: 4 July 2023

Abstract

:
The continuous path of a manipulator is often discretized into a series of independent action poses during path tracking, and the inverse kinematic solution of the manipulator’s poses is computationally challenging and yields inconsistent results. This research suggests a manipulator-route-tracking method employing deep-reinforcement-learning techniques to deal with this problem. The method of this paper takes an end-to-end-learning approach for closed-loop control and eliminates the process of obtaining the inverse answer by converting the path-tracking task into a sequence-decision issue. This paper first explores the feasibility of deep reinforcement learning in tracking the path of the manipulator. After verifying the feasibility, the path tracking of the multi-degree-of-freedom (multi-DOF) manipulator was performed by combining the maximum-entropy deep-reinforcement-learning algorithm. The experimental findings demonstrate that the approach performs well in manipulator-path tracking, avoids the need for an inverse kinematic solution and a dynamics model, and is capable of performing manipulator-tracking control in continuous space. As a result, this paper proposes that the method presented is of great significance for research on manipulator-path tracking.

1. Introduction

A manipulator is a highly integrated mechanical system combined with electromechanical control. It is a typical time-varying and highly coupled multi-input, multi-output nonlinear system. Its regulation is extremely difficult because it is a complex system with several uncertainties. Researchers have studied manipulator-control techniques extensively over the past few decades. The existing control methods for robots include computational torque control, robust adaptive control [1], adaptive neural-network control [2], output-feedback control [3,4], dead-zone nonlinear compensation control [5], virtual decomposition control [6], the sliding mode controller [7] and so on.
The subject of path tracking [8,9] is crucial to manipulator control. The challenge is to determine how to ensure that the manipulator’s end follows the optimal path after the path-planning algorithm has successfully planned an optimal path. According to Cai ZX [10], each manipulator joint’s acceleration and speed are broken out separately. Through position or speed feedback, they modify the target speed or desired acceleration of each joint and use the error as a control input. It is challenging to accurately determine the dynamic characteristics of each manipulator link because the end effector frequently clamps various objects during actual manipulator control. The track and control of the manipulator are simultaneously made challenging by the presence of external interference and flaws in the dynamic modeling. In order to maintain the unknown components within a certain range and to account for the disparity between the predicted model and the real model of the manipulator, Spong [11] introduced a robust term to the control input. In order to increase the resilience and stability of the tracking and compensate for the error created by the external disturbance and the linearization of the dynamic model, Purwar [12] optimized the parameters of the neural-network controller using the Lyapunov stability criterion. To prevent collisions with objects in the workspace and irregular robot configurations, Jasour [13] presented nonlinear-model predictive control (NMPC), which allows the end effectors of the robot manipulator to follow preset geometric routes in Cartesian space. The robot’s nonlinear dynamics, including the dynamics of the actuator, are also taken into account. Chen [14] presented an intelligent control approach that combines sliding mode control (SMC) and fuzzy neural networks (FNN) to accomplish backstepping control for manipulator-path tracking problem. Zhang [15] introduced the concept of the virtual rate of change of torque and virtual voltage, which are linear relationships in the state and control variables. In order to improve the motion accuracy, he added kinetic constraints and transformed the non-convex minimum-time route-tracking problem into a convex optimization problem that can be solved quickly.
The processes of exact dynamic modeling and inverse kinematics solutions are used in all of the methods mentioned above. However, as the manipulator’s degree of freedom increases, the complexity of the dynamic modeling and the number of inverse kinematics solution calculations increase, and the inverse solution is not singular. All these approaches have restrictions in terms of how they can be used [16,17]. Path-tracking issues can now be solved without the need for system-dynamics modeling thanks to a number of recent successful applications of reinforcement learning to decision-control problems. This opens up new perspectives on path-tracking issues. The manipulator-path-tracking problem can be modeled as a Markov decision process (MDP) using the reinforcement learning [18,19] approach, which interacts with samples to optimize the control policy. Guo [20] completed the path tracking of the UR5 manipulator using the depth-value-function approach (DQN). The action space is discretized, however, and it is challenging to finely control the manipulator. For managing robots with unknown parameters and dead zones, Hu [21] offered a deep-reinforcement-learning system. Three components make up the algorithm: a state network that estimates information about the robot manipulator’s state; a proposed critic network that assesses the effectiveness of the network; and an action network that learns a method for enhancing the performance metrics. Liu [22] achieved tracking control over a manipulator in a continuous action space, but the algorithm’s learning process was unstable and was heavily influenced by even minor changes to the hyperparameters. There is currently no deep-reinforcement-learning system that can reliably and continuously control the manipulator for route-tracking operations, according to the research findings.
To address the aforementioned issues, this paper proposes a reinforcement-learning method for multi-DOF manipulator-path tracking, which converts the tracking-accuracy requirements and energy constraints into cumulative rewards obtained by the control strategy to ensure the stability and control accuracy of the tracking trajectory. The entropy of the policy is used as an auxiliary gain of the agent and introduced into the training process of the control strategy, thereby increasing the robustness of the path tracking. The method has good results in manipulator-path tracking, which not only avoids the process of finding the inverse kinematic solution, but also does not require a dynamics model and can ensure control over the tracking of the manipulator in continuous space.
The remainder of this essay is structured as follows. The theoretical foundation of the algorithms in this paper, as well as the specific algorithmic applications and simulated experimental settings corresponding to the three manipulators, are presented in Section 2 after an introduction to the theoretical knowledge of deep-reinforcement-learning algorithms. The findings of the three manipulators’ simulation experiments are mostly shown in Section 3. The simulation results are analyzed in Section 4, along with the algorithm used in this paper in comparison to other algorithms. The work is concluded in Section 5, which also identifies potential future research areas.

2. Method

2.1. Deep Q Network

The value-function approach in deep reinforcement learning, which originally evolved from the Q-Learning algorithm in classical reinforcement learning, is the basis of the Deep Q-Network (DQN) algorithm [23], a classic algorithm. The algorithm known as Q-Learning [24] is based on the Q value, which is the projected future cumulative reward value of action a (activity a must be finite and discrete), taken in accordance with policy π in state S .
Q π s , a = Ε π t = 0 γ t r s t , a t | s 0 = s , a 0 = a
In Equation (1), γ is the discount factor determining the agent’s horizon. The optimal value Q is defined as the maximized Q value, and the strategy that can obtain the optimal value is defined as the optimal strategy π . The DQN uses a deep neural network Q π s , a ; θ with parameters θ to replace the Q value Q π s , a , which allows the algorithm to remain valid in the case of high-dimensional and continuous input states S . In addition, the experience-playback pool Replay buffer and a target Q network with parameters θ are also added to DQN. The Replay buffer improves the utilization efficiency of samples, and the use of the target Q network solves the loss in the neural-network-function problem. The target Q network is defined as Equation (2).
y = r + γ max a Q s , a ; θ
Therefore, this problem can be transformed into a supervised learning problem to solve, that is, min θ y Q s , a ; θ , where θ in ever τ step copy their own parameters to θ .

2.2. Soft Actor-Critic

Although the DQN algorithm, which is a milestone in deep reinforcement learning, solves the problem of high-dimensional and continuous input states, which cannot be solved by classical reinforcement learning, it still cannot solve situations in which the output actions are high-dimensional and continuous (such as multi-degree-of-freedom manipulator). Although the other deep-reinforcement-learning algorithms (such as DDPG [25], TD3 [26], and other algorithms) can handle cases in which the output action is high-dimensional and continuous, they usually have high sample complexity and weak sample convergence, which lead to some additional hyperparameter tuning.
The Soft Actor-Critic (SAC) algorithm [27,28] is a reinforcement-learning algorithm that introduces the maximum-entropy theory. In the framework of the algorithm, the strategy not only needs to maximize the expected cumulative reward value, but also needs to maximize the expected entropy, as shown in Equation (3),
π * = max t = 0 T E ( s t , a t ) ~ p π [ r ( s t , a t ) + α H ( π ( | s t ) ) ]
where α is the weight of the entropy term, which can determine the relative importance of the entropy term relative to the reward term, thereby controlling the randomness of the optimal strategy.
The maximum-entropy framework uses a technique called “Soft Policy Iteration” to alternately evaluate and improve policies in order to accomplish this goal. In an environment where the state space is discrete, the method can obtain the soft Q value from the randomly initialized function Q : S × A R and repeatedly apply the modified Bellman backup operator T π , as shown in Equation (4),
T π Q ( s t , a t ) = r ( s t , a t ) + γ E s t + 1 : p [ V ( s t + 1 ) ]
where
V ( s t ) = E a t ~ π [ Q ( s t , a t ) log π ( a t | s t ) ]
is the soft state-value function used to calculate the policy value in policy evaluation. In the continuous state, a neural network with parameters is first used to replace the soft Q-function Q θ s t ,   a t , and then it is trained to minimize the Bellman residual, as shown in Equation (6),
J Q ( θ ) = E ( s t , a t ) ~ D 1 2 Q θ ( s t , a t ) r ( s t , a t ) + γ E s t + 1 ~ p [ V θ ¯ ( s t + 1 ) ] 2
which can also be optimized with stochastic gradients:
θ J Q ( θ ) = θ Q θ ( s t , a t ) Q θ ( s t , a t ) r ( s t , a t ) γ V θ ¯ ( s t + 1 )
where V θ ¯ s t + 1 is estimated by the target network of Q and the Monte Carlo estimation of the soft state-value function sampled from the experience pool.
The goal of policy improvement is to maximize the available reward, so the policy must be updated to the new soft Q function’s exponential form and restricted to a parameterized distribution (such as the Gaussian distribution). The policy must then be projected back into the permissible policy space using an information projection defined in terms of the Kullback–Leibler (KL) divergence, as displayed in Equation (8),
π n e w = arg min π Π D K L π ( | s t ) P exp ( Q π o l d ( s t , ) ) Z π o l d ( s t )
where Z π o l d s t can be ignored because it has no effect on the gradient. Furthermore, the policy π ϕ a t s t is parameterized with a neural network that can output mean and variance to define a Gaussian distribution, and then the parameters of the policy are learned by minimizing the expected KL divergence, as shown in Equation (9).
J π ϕ = E s t ~ D E a t ~ π ϕ log π ϕ ( a t | s t ) Q θ ( s t , a t )
However, since it is difficult to find the gradient in the Gaussian distribution a ~ N m , s , it is converted into a form in which the gradient is easy to find: a = m + s ε , ε ~ N 0 ,   1 , i.e., a t = f ϕ ε t , s t . The policy network can then be optimized by applying the policy gradient to the expected future reward, as shown in Equation (10).
J π ϕ = E s t ~ D , ε t ~ N log π ϕ ( f ϕ ( ε t ; s t ) | s t ) Q θ ( s t , f ϕ ( ε t ; s t ) )
This paper can also approximate the gradient of Equation (10) with:
ϕ J π ϕ = ϕ log π ϕ ( a t | s t ) + ( a t log π ϕ ( a t | s t ) a t Q ( a t | s t ) ) ϕ f ϕ ( ε t ; s t )

2.3. Deep-Reinforcement-Learning Algorithm Combined with Manipulator

This paper models the path-tracking problem of manipulator as a Markov decision process (MDP), which is represented by < S ,   A ,   R ,   T ,   γ > , where s t S represents the observations of the agent. The policy Π : S A maps the current environmental state s t to the control input a t A of each joint of the manipulator, T s t + 1 s t ,   a t represents the dynamic characteristics of the robotic arm, that is, the probability that the system transitions from state s t to s t + 1 under the control of a t . The expected path P R N × 3 of the manipulator can be generated by traditional path-planning methods, where N is the number of points on the path. The instantaneous reward obtained by the agent at time t is represented by r t R , which is related to the tracking accuracy of the robot arm on the desired path and the energy consumed. The policy continuously interacts with the manipulator system to obtain the sampling trajectory τ = s 0 , a 0 , , s t , a t ,   , s T , a T . The objective of reinforcement learning is to maximize the expected cumulative reward that the agent receives, as illustrated in Equation (12).
maximize π Ε τ ~ p τ t = 0 T γ t r s t , a t
where
p τ = p s 0 t = 0 T π a t | s t T s t + 1 | s t , a t
Figure 1 depicts the framework of the deep-reinforcement-learning-based robot-arm path-tracking model. The framework is made up of the manipulator body, the desired path, the control strategy, and the feedback controller. The tracking error of the desired path minus the actual path of the manipulator is the input signal. The input signal used by the control strategy to determine the anticipated position and speed of each joint at the following instant serves as the reference signal for the feedback controller. The feedback controller combines each joint’s current position and velocity data and produces the necessary joint torque to move the manipulator’s end point, thereby performing the path-tracking function.
This appealing framework was applied to two-link manipulator, multi-degree-of-freedom manipulators and redundant manipulators to test path tracing in the V-REP PRO EDU 3.6.0 simulation platform.

2.3.1. Application of Two-Link Manipulator

The plane of two-link manipulator simulation system is shown in Figure 2. The settings of the two-link manipulator in the simulation environment are as follows. The lengths of the rods are l 1 = 1.0   m ,   l 2 = 0.8   m , and the mass of the rods are m 1 = 0.1   k g ,   m 2 = 0.08   k g . Each joint adopts the incremental control method, that is, the joint rotates around a fixed angle Δ θ in the direction given by the control signal a t at any time t , where a t Δ θ t 1 , Δ θ t 2 R 2 , θ t i = 0.05 ° ,   i = 1 ,   2 . The state of the entire simulation system is s t θ t 1 ,   θ t 2   ,   θ ˙ t 1   ,   θ ˙ t 2   , x t   , y t   , x t   , y t R 8 , where θ t i , θ ˙ t i are the angle and angular velocity of each joint at i th time. The x t ,   y t is the position of the endpoint of the manipulator at time t , and the desired target-point position x t ,   y t is set as Equation (14).
x t * = l 2 cos ω 1 t + ω 2 t + l 1 cos ω 1 t y t * = l 2 sin ω 1 t + ω 2 t + l 1 sin ω 1 t , ω 1 = ω 2 = 1   rad / s
The two-link manipulator’s output action has low dimensionality and can be approximated as a discrete quantity, so the algorithm uses the classical DQN algorithm. The strategy’s network structure in the DQN algorithm is as follows: the input state is 8-dimensional, the output action is 2-dimensional, the hidden layer has two layers, and the number of nodes in each layer is 50. The hyperparameters are set as follows: replay buffer = 1 × 106, learning rate = 3 × 10−4, discount factor = 0.99, batch-size = 64, the update between the Q network and the target Q network adopts the soft update method, and its soft parameter tau = 0.001. In addition, the setting of the reward is r t = exp p t p t , where p t ,   p t are the target path points at time t and the position of the end point of the robot arm, respectively.

2.3.2. Application of Multi-Degree-of-Freedom Manipulator

This paper applies the algorithm to a multi-degree-of-freedom manipulator, UR5, to achieve path tracking under continuous control. The UR5 simulation system, which is shown in Figure 3, uses a deep-reinforcement-learning algorithm to achieve path tracking in the presence of obstacles after generating a path by a conventional path-generation algorithm. The system’s actions a a 1 ,   a 2 ,   a 3 ,   a 4 ,   a 5 ,   a 6 , R 6 , and states are set as s θ 1 ,   θ 2 ,   θ 3 ,   θ 4 ,   θ 5 ,   θ 6 ,   θ ˙ 1 ,   θ ˙ 2 ,   θ ˙ 3 ,   θ ˙ 4 ,   θ ˙ 5 ,   θ ˙ 6 ,   x ,   y ,   z R 15 , where θ i ,   θ ˙ i is the angle and angular velocity of the first joint, and x ,   y ,   z are the distance between the endpoints p and the corresponding desired target points p . The initial position of the endpoint is [−0.095, −0.160, 0.892] and the initial position of the target point is [−0.386, 0.458, 0.495]. The desired path is generated by the traditional RRT [1,29] path-generation algorithm with the stride set to 100.
In addition, this experiment set up 4 additional variables to explore the impact of these factors on tracking performance:
  • The upper control method of the manipulator adopts two control methods, position control and velocity control. The position control is the control of the joint angle, and the input action is the increment of the joint angle. The range of the increment at each moment is set as [−0.05, 0.05] rad. The velocity control is the control of joint angular velocity. The input action is the increment of joint angular velocity. The increment range of each moment was set to [−0.8, 0.8] rad/s in the experiment. In addition, the underlying control of the manipulator adopts the traditional PID torque-control algorithm.
  • Addition of noise to the observations. This paper set up two groups of control experiments, one of which added random noise to the observations; the noise was adopted from the standard normal distribution N(0,1), and the size was 0.005 × N(0,1).
  • The setting of the time-interval distance n . The target path points given by the manipulator at every n time are the target path points at the N* n time, where N = 1, 2, 3…, and are used to study the effect of different interval points on the tracking results. In the experiments, the interval distances were set as 1, 5, and 10, respectively.
  • Terminal reward. A control experiment was set up in which, during the training process, when the distance between the endpoint of the robotic arm and the target point was within 0.05 m (the termination condition is met), an additional +5 reward was given to study its impact on the tracking results.
In order to improve sampling efficiency, the SAC algorithm combines the value-based and policy-based approaches, using the value-based approach to learn the Q-value function or state-value function V and the policy-based approach to learn the policy function, thus making it suitable for the continuous high-dimensional action space and, therefore, more appropriate for control over tracking of the manipulator. As a result, the SAC algorithm was chosen to control tracking of the multi-degree-of-freedom manipulator in this research. All network structures were as follows: each network contained two hidden layers, the number of nodes in each layer was 200, and the activation function of the hidden layer was set to ReLu. The hyperparameters were set as follows: replay buffer = 1 × 106, discount factor = 0.99, batch size = 128, the update between the Q network and the target Q network adopted the soft update method, the soft parameter tau = 0.01, the learning rate of Actor and Critic network were both set to learning rate = 1 × 10−3, and the weight coefficient of policy entropy during the entire training process α = 1 × 10−3. The reward settings for this experiment were set as Equation (15).
r t = p τ t , n * p t , τ t , n = n floor t / n + 1 n > 1 t n = 1
where n is the interval distance. In addition, an experiment was terminated when the robot arm ran for 100 steps or the distance between the end point of the robot arm and the target point was within 0.05 m.

2.3.3. Application of Redundant Manipulator

The algorithm proposed in this paper was also applied on a 7-degree-of-freedom redundant manipulator for path tracking. The simulation system of the 7-DOF redundant manipulator is shown in Figure 4. The simulation platform still used V-REP PRO EDU 3.6.0, and the simulated manipulator used the KUKA LBR ii wa 7 R800 redundant manipulator. The setting of actions was a a 1 ,   a 2 ,   a 3 ,   a 4 ,   a 5 ,   a 6 ,   a 7 R 7 , and states were set as s θ 1 ,   θ 2 ,     θ 3 ,     θ 4 ,   θ 5 ,   θ 6 ,   θ 7 ,   θ ˙ 1 ,   θ ˙ 2 ,   θ ˙ 3 ,   θ ˙ 4 ,   θ ˙ 5 ,   θ ˙ 6 ,   θ ˙ 7 ,   x ,   y ,   z R 17 where θ i ,   θ ˙ i is the angle and angular velocity of the first joint and x ,   y ,   z is the distance between the end-point p of the manipulator and the corresponding desired target point p * . The initial position of the end-point was [0.0044, 0.0001, 1.1743], and the initial position of the target point was [0.0193, 0.4008, 0.6715]. The expected path was an arc trajectory generated by the path-generation algorithm, and the step size was set to 50.
The setup of the redundant-manipulator path-tracking experiment was exactly the same as that of UR5. The experiment still used the continuous-control reinforcement-learning algorithm SAC, and all network structures were also the same as in the UR5 setup. That is, each network contained two hidden layers, the number of nodes in each layer was 200, and the activation function of the hidden layer was set to ReLu. The hyperparameter settings were also the same: replay buffer = 1 × 106, discount factor = 0.99, batch size = 128, the update between the Q network and the target Q network adopts the soft update method, the soft parameter tau = 0.01, the learning rate of Actor and Critic network are both set to learning-rate = 1 × 10−3, and the weight coefficient α = 1 × 10−3 of policy entropy during the whole training process.
The reward settings were also the same as those used before. The only difference was that the redundant-manipulator path tracking-experiment was conducted to verify the generalization of the algorithm, so it did not involve the results when other hyperparameter conditions changed. Therefore, the default separation distance in the reward setting was n   = 1.

3. Simulation Results

In this section, the simulation results of path tracking of three manipulators are compared and analyzed.

3.1. Simulation Results of Planar Two-Link Manipulator

The two-link manipulator experiment was mainly used to verify the feasibility of the reinforcement-learning algorithm in the field of manipulator-path tracking. The specific parameter settings for the experiment are detailed in Section 2.3.1. The tracking-curve results of this experiment are shown in Figure 5.
The blue line represents the real operating end path, while the red line represents the desired-goal path. The tracking findings in Figure 5 show that the deep-reinforcement-learning-based strategy completely succeeds in tracking the target path. The experimental findings in the simulation environment are displayed in Figure 6.
The experimental results show that it is completely feasible to use the deep-reinforcement-learning algorithm to achieve path tracking with a simple two-link manipulator.

3.2. Simulation Results of Multi-Degree-of-Freedom Manipulator

After exploring the application of reinforcement learning in path tracking, as well as its successful application to the two-link manipulator to achieve the tracking target, the application of the multi-degree-of-freedom manipulator UR5 to conduct path tracking under continuous control was also explored. The specific parameter settings for the experiment are detailed in Section 2.3.2.
The experimental results are shown in the following figures. Figure 7a,b show the path-tracking results without observation noise and with observation noise in the position-control mode, respectively. Figure 7c,d show the speed-control results and the path-tracking results with and without observation noise in the mode, respectively. Different time intervals are set in each picture; the upper three curves of each picture are the results of not giving the terminal reward, and the lower three curves are the results of giving the terminal reward.
In addition, this experiment also quantitatively analyzed the tracking results, calculated the average error between the obtained path and the target path under different experimental conditions and the average distance between the endpoint of the manipulator and the target point at the last moment, and set the average error to less than 0.07 and the error of the target point to less than 0.05 to achieve the tracking-accuracy requirements. The results are shown in Table 1 and Table 2.
The training-process curve is shown in Figure 8, where (a) is the training-process curve without observation noise in position-control mode, (b) is the training-process curve with the addition of observation noise in position-control mode, (c) is the training-process curve without observation noise in velocity-control mode, and (d) is the training-process curve with the addition of observation noise in velocity-control mode. The X-axis represents the number of training sessions and the Y-axis represents the reward value set in the reinforcement learning. There are six curves in each graph, which represent the training curves with different time intervals with or without terminal reward. The training curves that reach a reward value of 0 on each graph are those with the terminal reward added.
In addition, since the system-dynamics model is not considered in the path-tracking experiment based on deep reinforcement learning, in order to verify the advantage of the method based on deep reinforcement learning without the need for the dynamic model, this paper further explores the influence of the change in dynamic characteristics in the experimental results. To this end, we changed the quality of the end effector, the trained model was tested, and the experimental results are shown in Table 3 and Table 4.
The degree of smoothness [30,31] in the trajectory of the manipulator’s end effector has an impact on the overall working effect in both scientific trials and real-world production. A crucial reference point is the energy that the manipulator generates while it operates [32]. Therefore, the suggested algorithm’s performance was compared to that of the conventional inverse-kinematics approach in this work in terms of trajectory smoothness and energy usage. In Table 5 and Table 6, the experimental findings are displayed. The smoothness of the trajectory was determined by the angle between the tangent vectors of the neighboring points of the curve, and the smoothness of the manipulator motion was determined by calculating the average value of the turning angles across the entire trajectory, where the angle was expressed in the degree system. The energy consumption of the manipulator throughout the path-tracking process is calculated by Equation (16),
E = t 0 t M P t d t k = 0 M P k d t
P k = i = 1 n P i k
P i k = τ i k θ i . k
where k is the k th path point in the entire path, i is the i th joint of the manipulator, τ ,   θ ˙ is the joint torque and joint speed, M is the number of path points, and d t is the distance between the path points.

3.3. Simulation Results of Redundant Manipulator

The UR5 manipulator was used to experimentally verify and examine the algorithm that is proposed in this study. The experimental findings demonstrate the effectiveness of the algorithm suggested in this paper in resolving the manipulator’s route-tracking issue. In addition, the verification was conducted on a redundant manipulator to further confirm the efficacy and generalizability of the technique. The specific parameter settings for the experiment are detailed in Section 2.3.3.
Since this was a verification experiment, it was only used to explore the path-tracking results in the speed-control mode. The path-tracking results of the redundant manipulator are shown in Figure 9.
This study also takes into account the sampling randomness of the deep-reinforcement-learning algorithm. Numerous trials were conducted using a variety of random seed settings in order to show the robustness of the methodology proposed. Figure 10 displays the experimental results’ training-process curve. The X-axis represents the number of training sessions and the Y-axis represents the average return set in the reinforcement learning. It can be seen that the training process can still converge even if there are fluctuations.
The training results under various random seed settings demonstrate that the generalization and stability of the method in this work are assured, and the experimental results demonstrate that it still had an excellent tracking effect on the redundant manipulator.

4. Discussion

The data provided in Table 1 and Table 2 were used to examine how path tracking is affected by the four additional factors indicated in Section 2.3.2. The experimental outcomes demonstrate that the algorithm performs satisfactorily for both position control and speed control, and the introduction of noise into system-wide observations during the simulation training contributed to an increase in the control strategy’s robustness and noise opposition. It was found that when the value of n was excessively large ( n = 10), its convergence effect on the target point was better, but its tracking effect on the target path was worse; when n = 1, the situation was the opposite. Therefore, when choosing the value of n, it is necessary to accept a trade-off between the path-tracking effect and the final position of the end point. When the manipulator’s end point exceeds the target point’s permissible error range during the simulation-training procedure, increasing the reward can help the manipulator better converge to the target point, albeit at the cost of some route-tracking precision.
In this experimental scheme, the target path generated by the RRT algorithm is clearly unsatisfactory, as shown in Figure 7. However, the tracking path generated by the SAC algorithm for reinforcement learning in accordance with this target path also becomes smoother under the condition of satisfying the tracking accuracy, which is advantageous for its actual execution. The SAC algorithm also has a better capacity for exploration, and it can accelerate training because it is based on the maximum-entropy feature. Figure 8 shows that the training process’s curve reached convergence earlier than expected.
Table 3 and Table 4 show that even when the load changes, the model trained with an unchanged load quality still provides sufficient stability for deep-reinforcement-learning-based path tracking, demonstrating that kinetic-feature variations have no effect on the algorithm’s performance. In light of the benefits of applying deep reinforcement learning for manipulator-path tracking, the approach in this study does not require dynamics modeling.
The experimental findings demonstrate that the algorithm presented in this research can, for the most part, satisfy the tracking-accuracy requirements. The two assessment indices of trajectory smoothness and energy consumption, which are more significant in actual operations, were thus chosen for comparison when both methods matched the tracking-accuracy requirements when better than the conventional algorithm in the Jacobi-matrix approach. Table 5 and Table 6 show that the proposed algorithm outperforms the conventional inverse-kinematics approach in terms of energy consumption and trajectory smoothness, demonstrating its efficacy and applicability.
The current research on deep-reinforcement-learning algorithms applied to path tracking mainly includes the combination of the DQN algorithm and path tracking and the combination of the DDPG algorithm and path tracking [33]. The DQN method still depends on the robot system for the inverse kinematic solution of the position while performing action execution because it can only handle discrete state quantities and can only discretize the action space into 27 possible actions in Cartesian space. The DDPG algorithm resolves the discretization issue and permits continuous control during path tracking, but it lacks robustness, and even minor parameter changes can have a significant impact on the algorithm’s performance, even causing it to stop working altogether. The combination of the SAC algorithm with the manipulator proposed in this paper allows continuous control in the action space during path tracking, and the input and output quantities are the angle and the angular acceleration, which directly control the robotic arm’s performance without inverse kinematic effects on the action. Moreover, the SAC algorithm adopts the concept of maximum entropy and considers not only the optimal action, but also other sub-optimal actions to achieve the maximum trade-off between expected return and entropy and to improve its robustness. Therefore, when controlling the manipulator for tracking, the algorithm in this paper is easier to adjust in the face of disturbances. As a result, the present algorithm is more suitable for the path-tracking problem in manipulators than the DQN algorithm and the DDPG algorithm. However, the proposed algorithm still suffers from a deficiency in that the input is the desired path when training on the network. Therefore, it is necessary to train the network first when experiments are conducted and, subsequently, when controlling the robotic arm for path tracking. The proposed algorithm is not yet able to track the input path in real time, and it is more suitable for real work scenarios, in which specific tasks are performed repeatedly.

5. Conclusions

This paper presents a technique for implementing the path tracking of manipulators using a deep-reinforcement-learning algorithm. The target path is generated using a conventional path-planning method, while the control signal for manipulator control and target path tracking is generated using a deep-reinforcement-learning approach. In this study, simulation experiments were conducted on a six-degree-of-freedom robotic arm, which is the most widely used form in practical applications and research. The experimental results show that the method has good results in manipulator-path tracking, which not only avoids the process of finding the inverse kinematic solution, but also does not require a dynamics model and can ensure the control of the tracking of a manipulator in continuous space. In addition, through further verification experiments on the tracking of the path of the redundant manipulator, the generalization and stability of our method were reflected. Therefore, the method used in this paper is important for the study of deep reinforcement learning in conjunction with manipulator-path tracking. In response to the issues with the presented algorithm that were noted in the Discussion session, the network will next be trained using inputs that consist of randomly generated 3D pathways. In this method, the problem of the inability to track in real time is resolved, since the trained network can control the manipulator to execute path tracking inside a specific working region in the face of randomly generated desired paths.

Author Contributions

Conceptualization, P.Z. and J.Z.; methodology, P.Z. and J.K.; software, P.Z. and J.Z.; resources, J.K.; data curation, J.Z.; writing—original draft preparation, P.Z.; writing—review and editing, P.Z.; funding acquisition, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key-Area Research and Development Program of Guangdong Province, grant number no. 2019B020223003; and Guangdong Basic and Applied Basic Research Foundation, grant no. 2022A1515140013.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author only upon reasonable request.

Acknowledgments

We are very grateful to the anonymous reviewers for their constructive comments for improving this paper.

Conflicts of Interest

The authors report there are no competing interest to declare.

References

  1. Arteaga-Peréz, M.A.; Pliego-Jiménez, J.; Romero, J.G. Experimental Results on the Robust and Adaptive Control of Robot Manipulators Without Velocity Measurements. IEEE Trans. Control Syst. Technol. 2020, 28, 2770–2773. [Google Scholar] [CrossRef]
  2. Liu, A.; Zhao, H.; Song, T.; Liu, Z.; Wang, H.; Sun, D. Adaptive control of manipulator based on neural network. Neural Comput. Appl. 2021, 33, 4077–4085. [Google Scholar] [CrossRef]
  3. Zhang, S.; Wu, Y.; He, X. Cooperative output feedback control of a mobile dual flexible manipulator. J. Frankl. Inst. 2021, 358, 6941–6961. [Google Scholar] [CrossRef]
  4. Gao, J.; He, W.; Qiao, H. Observer-based event and self-triggered adaptive output feedback control of robotic manipulators. Int. J. Robust Nonlinear Control 2022, 32, 8842–8873. [Google Scholar] [CrossRef]
  5. Zhou, Q.; Zhao, S.; Li, H.; Lu, R.; Wu, C. Adaptive neural network tracking control for robotic manipulators with dead zone. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 3611–3620. [Google Scholar] [CrossRef]
  6. Zhu, W.H.; Lamarche, T.; Dupuis, E.; Liu, G. Networked embedded control of modular robot manipulators using VDC. IFAC Proc. Vol. 2014, 47, 8481–8486. [Google Scholar] [CrossRef]
  7. Jung, S. Improvement of Tracking Control of a Sliding Mode Controller for Robot Manipulators by a Neural Network. Int. J. Control Autom. Syst. 2018, 16, 937–943. [Google Scholar] [CrossRef]
  8. Cao, S.; Jin, Y.; Trautmann, T.; Liu, K. Design and Experiments of Autonomous Path Tracking Based on Dead Reckoning. Appl. Sci. 2023, 13, 317. [Google Scholar] [CrossRef]
  9. Leica, P.; Camacho, O.; Lozada, S.; Guamán, R.; Chávez, D.; Andaluz, V.H. Comparison of Control Schemes for Path Tracking of Mobile Manipulators. Int. J. Model. Identif. Control 2017, 28, 86–96. [Google Scholar] [CrossRef]
  10. Cai, Z.X. Robotics; Tsinghua University Press: Beijing, China, 2000. [Google Scholar]
  11. Fareh, R.; Khadraoui, S.; Abdallah, M.Y.; Baziyad, M.; Bettayeb, M. Active Disturbance Rejection Control for Robotic Systems: A Review. Mechatronics 2021, 80, 102671. [Google Scholar] [CrossRef]
  12. Purwar, S.; Kar, I.N.; Jha, A.N. Adaptive output feedback tracking control of robot manipulators using position measurements only. Expert Syst. Appl. 2008, 34, 2789–2798. [Google Scholar] [CrossRef]
  13. Jasour, A.M.; Farrokhi, M. Fuzzy Improved Adaptive Neuro-NMPC for Online Path Tracking and Obstacle Avoidance of Redundant Robotic Manipulators. Int. J. Autom. Control 2010, 4, 177–200. [Google Scholar] [CrossRef] [Green Version]
  14. Cheng, M.B.; Su, W.C.; Tsai, C.C.; Nguyen, T. Intelligent Tracking Control of a Dual-Arm Wheeled Mobile Manipulator with Dynamic Uncertainties. Int. J. Robust Nonlinear Control 2013, 23, 839–857. [Google Scholar] [CrossRef]
  15. Zhang, Q.; Li, S.; Guo, J.-X.; Gao, X.-S. Time-Optimal Path Tracking for Robots under Dynamics Constraints Based on Convex Optimization. Robotica 2016, 34, 2116–2139. [Google Scholar] [CrossRef]
  16. Annusewicz-Mistal, A.; Pietrala, D.S.; Laski, P.A.; Zwierzchowski, J.; Borkowski, K.; Bracha, G.; Borycki, K.; Kostecki, S.; Wlodarczyk, D. Autonomous Manipulator of a Mobile Robot Based on a Vision System. Appl. Sci. 2023, 13, 439. [Google Scholar] [CrossRef]
  17. Tappe, S.; Pohlmann, J.; Kotlarski, J.; Ortmaier, T. Towards a follow-the-leader control for a binary actuated hyper-redundant manipulator. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 3195–3201. [Google Scholar]
  18. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  19. Martín-Guerrero, J.D.; Lamata, L. Reinforcement Learning and Physics. Appl. Sci. 2021, 11, 8589. [Google Scholar] [CrossRef]
  20. Guo, X. Research on the Control Strategy of Manipulator Based on DQN. Master’s Thesis, Beijing Jiaotong University, Beijing, China, 2018. [Google Scholar]
  21. Hu, Y.; Si, B. A Reinforcement Learning Neural Network for Robotic Manipulator Control. Neural Comput. 2018, 30, 1983–2004. [Google Scholar] [CrossRef] [Green Version]
  22. Liu, Y.C.; Huang, C.Y. DDPG-Based Adaptive Robust Tracking Control for Aerial Manipulators With Decoupling Approach. IEEE Trans Cybern 2022, 52, 8258–8271. [Google Scholar] [CrossRef]
  23. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  24. Fujimoto, S.; Meger, D.; Precup, D. Off-policy deep reinforcement learning without exploration. In Proceedings of the International Conference on Machine Learning (PMLR), Long Beach, CA, USA, 9–15 June 2019; pp. 2052–2062. [Google Scholar]
  25. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
  26. Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning (PMLR), Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
  27. Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft Actor-Critic Algorithms and Applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
  28. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning (PMLR), Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
  29. Karaman, S.; Frazzoli, E. Sampling-Based Algorithms for Optimal Motion Planning. Int. J. Robot. Res. 2011, 30, 846–894. [Google Scholar] [CrossRef] [Green Version]
  30. Yang, J.; Li, D.; Ye, C.; Ding, H. An Analytical C3 Continuous Tool Path Corner Smoothing Algorithm for 6R Robot Manipulator. Robot. Comput.-Integr. Manuf. 2020, 64, 101947. [Google Scholar] [CrossRef]
  31. Kim, M.; Han, D.-K.; Park, J.-H.; Kim, J.-S. Motion Planning of Robot Manipulators for a Smoother Path Using a Twin Delayed Deep Deterministic Policy Gradient with Hindsight Experience Replay. Appl. Sci. 2020, 10, 575. [Google Scholar] [CrossRef] [Green Version]
  32. Carvajal, C.P.; Andaluz, V.H.; Roberti, F.; Carelli, R. Path-Following Control for Aerial Manipulators Robots with Priority on Energy Saving. Control Eng. Pract. 2023, 131, 105401. [Google Scholar] [CrossRef]
  33. Li, B.; Wu, Y. Path Planning for UAV Ground Target Tracking via Deep Reinforcement Learning. IEEE Access 2020, 8, 29064–29074. [Google Scholar] [CrossRef]
Figure 1. The framework of the robot-arm path-tracking model based on deep reinforcement learning.
Figure 1. The framework of the robot-arm path-tracking model based on deep reinforcement learning.
Applsci 13 07867 g001
Figure 2. Simulation system of two-link manipulator.
Figure 2. Simulation system of two-link manipulator.
Applsci 13 07867 g002
Figure 3. UR5 simulation system.
Figure 3. UR5 simulation system.
Applsci 13 07867 g003
Figure 4. The redundant-manipulator-simulation system.
Figure 4. The redundant-manipulator-simulation system.
Applsci 13 07867 g004
Figure 5. Path-tracking curve of two-link manipulator based on DQN.
Figure 5. Path-tracking curve of two-link manipulator based on DQN.
Applsci 13 07867 g005
Figure 6. Simulation results of path tracking of two-link manipulator based on DQN.
Figure 6. Simulation results of path tracking of two-link manipulator based on DQN.
Applsci 13 07867 g006
Figure 7. Path-tracking results of UR5 manipulator based on maximum-entropy reinforcement learning. (a) Tracking results without observation noise in position-control mode. (b) Tracking results with observation noise in position-control mode. (c) Tracking results without observation noise in velocity-control mode. (d) Tracking results with observation noise in velocity-control mode.
Figure 7. Path-tracking results of UR5 manipulator based on maximum-entropy reinforcement learning. (a) Tracking results without observation noise in position-control mode. (b) Tracking results with observation noise in position-control mode. (c) Tracking results without observation noise in velocity-control mode. (d) Tracking results with observation noise in velocity-control mode.
Applsci 13 07867 g007aApplsci 13 07867 g007b
Figure 8. Training-process curves. (a) Training-process curve without observed noise in position-control mode. (b) Training-process curve with observation noise in position-control mode. (c) Training-process curve without observed noise in speed-control mode. (d) Training-process curve with observation noise in speed-control mode.
Figure 8. Training-process curves. (a) Training-process curve without observed noise in position-control mode. (b) Training-process curve with observation noise in position-control mode. (c) Training-process curve without observed noise in speed-control mode. (d) Training-process curve with observation noise in speed-control mode.
Applsci 13 07867 g008
Figure 9. Verification results of redundant-manipulator path tracking.
Figure 9. Verification results of redundant-manipulator path tracking.
Applsci 13 07867 g009
Figure 10. Redundant-manipulator path-tracking training-process curve.
Figure 10. Redundant-manipulator path-tracking training-process curve.
Applsci 13 07867 g010
Table 1. Results of position-control-mode path tracking.
Table 1. Results of position-control-mode path tracking.
Position Controlw/o Observation-NoiseObservation-Noise
IntervalInterval
15101510
Average error between tracks (m)w/o terminal reward0.03740.03300.05920.03940.04270.0784
terminal reward0.03350.07960.05020.03350.04750.0596
Distance between end points (m)w/o terminal reward0.04010.06330.04200.04430.04850.0292
terminal reward0.03160.02230.02310.01110.01480.0139
Table 2. Results of velocity-control-mode path tracking.
Table 2. Results of velocity-control-mode path tracking.
Velocity Controlw/o Observation-NoiseObservation-Noise
IntervalInterval
15101510
Average error between tracks(m)w/o terminal reward0.03430.03590.06460.03480.03180.0811
terminal reward0.02830.05690.06160.03500.06450.0605
Distance between end points (m)w/o terminal reward0.02330.02240.05210.04560.03650.0671
terminal reward0.00830.00300.03370.02750.01920.0197
Table 3. Analysis of dynamic characteristics of position control.
Table 3. Analysis of dynamic characteristics of position control.
Position Control0.5 kg1 kg2 kg3 kg5 kg
Average error between tracks (m) w/o observation noisew/o terminal reward0.037420.037430.037440.037450.03746
terminal reward0.033540.033540.033590.033550.03355
observation noisew/o terminal reward0.039430.039430.039430.039420.03941
terminal reward0.033460.033460.033460.033460.03345
Distance between end points (m)w/o observation noisew/o terminal reward0.040470.040470.040480.040490.04050
terminal reward0.031650.031660.031570.031590.03161
observation noisew/o terminal reward0.044300.044410.044380.044300.04436
terminal reward0.011100.011090.011090.011080.01105
Table 4. Analysis of Dynamic Characteristics of Velocity Control.
Table 4. Analysis of Dynamic Characteristics of Velocity Control.
Velocity Control0.5 kg1 kg2 kg3 kg5 kg
Average error between tracks (m) w/o observation noisew/o terminal reward0.034260.034270.034250.034260.03425
terminal reward0.028260.028250.028660.028730.02882
observation noisew/o terminal reward0.034780.034790.034830.034860.03497
terminal reward0.035030.035030.035030.035020.03501
Distance between end points (m)w/o observation noisew/o terminal reward0.023260.024440.024360.024300.02422
terminal reward0.008310.012010.013950.014630.01513
observation noisew/o terminal reward0.045600.045620.045650.045690.04578
terminal reward0.027480.027460.027430.027410.02733
Table 5. Analysis of track smoothness.
Table 5. Analysis of track smoothness.
Velocity Controlw/o Terminal RewardTerminal RewardJacobian
Matrix
IntervalInterval
15101510
Smoothness0.57510.33510.59250.08160.55610.44420.7159
Table 6. Analysis of energy consumption.
Table 6. Analysis of energy consumption.
Energy Consumption0.5 kg1 kg2 kg3 kg5 kg
Position Controlw/o terminal reward4.444384.714275.275075.794266.92146
terminal reward5.018895.342585.953106.552277.76305
Velocity Controlw/o terminal reward4.974655.380626.238866.950998.33596
terminal reward6.037356.379817.056967.751859.15828
TraditionalJacobian matrix8.952349.8159310.890710.913313.3241
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, P.; Zhang, J.; Kan, J. A Research on Manipulator-Path Tracking Based on Deep Reinforcement Learning. Appl. Sci. 2023, 13, 7867. https://doi.org/10.3390/app13137867

AMA Style

Zhang P, Zhang J, Kan J. A Research on Manipulator-Path Tracking Based on Deep Reinforcement Learning. Applied Sciences. 2023; 13(13):7867. https://doi.org/10.3390/app13137867

Chicago/Turabian Style

Zhang, Pengyu, Jie Zhang, and Jiangming Kan. 2023. "A Research on Manipulator-Path Tracking Based on Deep Reinforcement Learning" Applied Sciences 13, no. 13: 7867. https://doi.org/10.3390/app13137867

APA Style

Zhang, P., Zhang, J., & Kan, J. (2023). A Research on Manipulator-Path Tracking Based on Deep Reinforcement Learning. Applied Sciences, 13(13), 7867. https://doi.org/10.3390/app13137867

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop