Trajectory Control of An Articulated Robot Based on Direct Reinforcement Learning

: Reinforcement Learning (RL) is gaining much research attention because it allows the system to learn from interacting with the environment. Yet, with all these successful applications, the application of RL in direct joint torque control without the help of an underlining dynamic model is not reported in the literature. This study presents a split network structure that enables successful training of RL to learn the direct torque control for trajectory following a six-axis articulated robot without prior knowledge of the dynamic robot model. The training took a very long time to converge. However, we were able to show the successful control of four different trajectories without needing an accurate dynamics model and complex inverse kinematics computation. To show the RL-based control’s effectiveness, we also compare the RL control with the Model Predictive Control (MPC), another popular trajectory control method. Our results show that while the MPC achieves smoother and more accurate control, it does not automatically treat the singularity. In addition, it requires complex inverse dynamics calculations. On the other hand, the RL controller instinctively avoided the violent action around the singularities. Author Contributions: Conceptualization, C.-H.T. and J.-Y.Y.; methodology, C.-H.T. and J.-Y.Y.; software, C.-H.T., J.-J.L. and T.-F.H.; validation, C.-H.T., J.-J.L. and T.-F.H.; formal analysis, C.-H.T.; investigation, J.-J.L.; resources, J.-Y.Y.; data curation, C.-H.T.; writing—original draft preparation, writing—review and editing, J.-J.L. and J.-Y.Y.; visualization, supervision, J.-Y.Y.;


Introduction
Reinforcement Learning allows the system to learn by interacting with the environment. There is no need for a large number of labeled samples, and the system learns as it accumulates experience. Since its introduction, RL has shown great potential in many applications.
This study addresses the direct application of Reinforcement Learning (RL) to the joint force control of an articulated robot to follow pre-specified trajectories. The RL controller uses a neural network to compute the required joint torque. It requires no knowledge of an underlying dynamic robot model and needs no complex inverse kinematics calculation. From the application point of view, a neural network-based robot arm controller is desirable. However, very few available RL robot control examples have addressed the direct torque control of articulated robots. Searching through the literature, one finds that most available results dealt only with the kinematic trajectory planning and left the torque control problem to the controller that came with the robot. Many related results have based their control on model-based inverse dynamic cancellation and used RL to learn the control parameters [1,2]. The others that did address direct joint torque were for SCARA robots with only two links [2,3]. Considering how humans learn to control their limbs, it is still desirable to know if one can train the robot arm the way we train ourselves to use our arm.
Learning control has been around for a long time. In 1995, Nuttin and Van Brussel proposed a learning controller to increase the insertion speed in consecutive peg-into-hole operations; the learning controller could perform the same without increasing the contact force level [4]. In 1997, Schaal examined the types of learning problems that benefit from researchers have proposed different MPC methods. In 2014, Nikdel designed an MPC controller to control a single-degree-of-freedom shape memory alloy (SMA) actuated rotary robotic arm to track the desired joint angle [28]. MPC discovered practices in dynamic and unpredictable environments such as chemical plants and oil refineries [29], as well as power system balancing [30]. More recently, the MPC has also begun to find applications in the control of autonomous vehicles [31] and robotic trajectory control [32][33][34]. In 2016, Best et al. applied MPC to robot joint control. The dynamic model-based control of a soft robot shows significant improvement over the regular controller [35]. In 2017, Lunni et al. applied a Nonlinear Model Predictive Controller (NMPC) to an aerial robotic arm for tracking desired 3D trajectories. They demonstrated online use of the NMPC controller with limited computational power [36]. In 2018, Guechi et al. proposed a combination of MPC and feedback linearization control for a two-link robotic arm [37,38]. In the same year, Car et al. applied an MPC-based position control to an unmanned aerial vehicle (UAV). The onboard computer running linear MPC-based position control communicates through a serial port with the low-level altitude controller [39].
This paper describes using RL to control a HIWIN RA605 robotic arm. The task is for the robot arm to follow a desired trajectory while maintaining a specific attitude for the end effector. We propose a split network structure that allows for the direct application of RL to robot trajectory control. We also compare the control performance to a Twin Delayed Deep Deterministic Policy Gradient (TD3) network and an MPC. The results show that the proposed RL structure can achieve articulated robot control without the help of an accurate robot model. The RL trajectory is oscillatory but instinctively avoids the singularity and requires no background model. The behavior is similar to a human holding out their hand. Although the TD3 controller still converges, it oscillates around a noticeable offset. The MPC is smooth but requires a system model and additional treatments to avoid singularities.

Deep Deterministic Policy Gradient (DDPG)
The popular DDPG was first used in this study to train the RL network. The standard reinforcement learning algorithm considers the problem of an agent interacting with an environment E. The policy π for the agent's action a at state s is represented by a probability distribution π : S → P (A) , where S represents the state space and A represents the action space. Assuming that the actions can be parameterized with parameters θ; the parameterized policy becomes π θ (s) which selects the action a in state s with parameters θ by interacting with the environment. At each time step t, the agent receives a reward r for its action, and the return from a state is defined by the sum of the discounted future reward R = ∑ t γ i−1 r(s t , a t ), γ ∈ [0, 1]. As a result, the expected return, or the value function, for the action a = π θ (s) becomes Q π (s t , π θ (s t )) = E r,s∼E,a∼π [R|s t , a t ] . The goal of the agent is to maximize the value function Q by updating π θ (s t ) and Q π (s t , π θ (s t )). To allow an iterative learning process, one can express the value function Q in a recursive form: Q π (s t , a t ) = E r t ,s t+1 ∼E r(s t , a t ) + γE a t+1 ∼π [Q π (s t+1 , a t+1 )] . One can omit the inner expectation if one chooses a deterministic policy as µ : S → A , and reduces the value function to an expression that depends only on the environment such as Q µ (s t , a t ) = E r t ,s t+1 ∼E [r(s t , a t ) + γQ µ (s t+1 , µ(s t+1 )]. It is now possible to use an offpolicy algorithm to learn Q µ with a greedy policy µ(s) = argmax a Q(s, a).
The standard RL algorithm is not directly applicable to continuous action space due to the need for optimization at every time step. DDPG thus resolves into an actor-critic model structure. The actor specifies the deterministic policy by mapping the states to a specific action. The critic then learns the value function using the Bellman equation. The policy gradient of the expected return can be expressed as: The basic algorithm for DDPG is shown in Algorithm 1. Algorithm 1: DDPG

2.
For every episode: A.
For every time step t: a. Given a state s t , take action a t based on policy π; b.
Obtain reward r t and the new state s t+1 ; c.
Store all the (s t , a t , r t , s t+1 ) into the buffer.
B. a. Sample a batch of (s i , a i , r i , s i+1 ) from the buffer; b. Set the cost function c.
Update the parameters of Q, which is θ, to minimize C(θ).
Set the value function: Update the parameters of π, which is φ, to maximize V(φ).
where τ is a factor between 0 and 1.

The Modified DDPG Method
As mentioned before, the algorithm in the previous subsection adopted the greedy policy to optimize a t for each time step and is unsuitable for applying to continuous-time robotic motion control. Therefore, modifying the algorithm into an applicable actor-critic algorithm is necessary.
In the DDPG algorithm, π(s t ) is used to represent the decision-making policy that determines the corresponding action in the state s t . The decision of the action a t by π(s t ) does not result directly from the information of the state s t and would require two steps to determine. Suppose there are five possible discrete actions, π(s t ) will first decide on the most probable action and use it as its decision. Nevertheless, due to the greedy selection, the final choice may not be the same as the action with the highest possibility.
In this paper, the action consists of six joint torques; there is an infinitely possible choice of actions because the torque is a continuous variable. As a result, the possibility of all activities being covered cannot be covered. Instead of using π(s t ) to determine the most probable action, the algorithm resolves to using T(s t ) to directly determine the action a i , which consists of the six joint torques, based on the states s t . The algorithm also added noise to the action to ensure value function compatibility.
A. For every time step t:

1.
Given a state s t , take action a t based on T; 2.
Obtain reward r t and each new state s t+1 ; 3.
Store all (s t , a t , r t , s t+1 ) into the buffer.
B. a. Sample a batch of (s i , a i , r i , s i+1 ) from the buffer; b.
Set the cost function.
Update the parameters of Q, which is θ to minimize C(θ).
Set the value function: Update the parameters of T, which is φ to maximize V(φ).
where τ is a factor between 0 and 1.
Using the DDPG-1 method, one can consistently update the parameters of the critic Q and the policy T consistently during training to obtain the optimized critic Q and policy T. For each training step, the action selected is a t = T(V; φ) + N, where N is the noise for value function compatibility. We also explored the TD3 algorithm, but it did not provide too much improvement.

The Differential Kinematics
The robotic control is slightly more complicated than the standard control systems. Unlike the regular control system that directly takes the control error for feedback, the control system needs to adequately translate the trajectory error in the task space into the various joint torques. In other words, there is a transformation of the measurement from the "task space" into the "joint space," and the translation involves the differentiation of the Jacobian matrices, which, in turn, introduces Coriolis accelerations. The robotic control system will need to cancel or suppress this nonlinear effect. In a common end-following task, the position and attitude of the robot end-effector in the Cartesian coordinate system contain three position values (P x , P y , and P z ) and three attitude angles (α, β, and γ), for a total of six degrees of freedom. For an RA 605 robotic arm with six joints, the controller will need to perform the inverse kinematics and compute the nonlinear dynamic forces for cancellation and exercise the control. One of the goals of this research is to use a deep learning network for the inverse dynamic model and directly control the robotic arm.

The Reinforcement Learning Structure
The RL Network Structure The following section will briefly describe the deep learning network setup. For the agent, we set 43 observed results as the inputs, including the roll angle φ, pitch angle θ, yaw angle ψ, the roll angular velocity ψ. The measurements were made with respect to the end of the last joint relative to the reference coordinate. In addition, there are also three positions (P x , P y , and P z ), velocities (V x , V y , and V z ), and accelerations (A x , A y , and A z ) of the endpoint of the robotic arm, also based on the reference coordinate. Furthermore, there are six joint angles ( q 1 ∼ q 6 ), six joints angular velocities ( . q 1 ∼ . q 6 ), six joints angular accelerations ( .. , six joints torques ( τ 1 ∼ τ 6 ), and the total power of joints (P joints ). The outputs of the actor network are the six joint torques.
Although deep learning is a powerful tool, the computation effort involved in training the network can be too great for practical applications. Our initial attempt to help with the training was to separate the effect of the positioning error and the effect of the velocity error to prevent the conflicting demands from holding position and moving the arm. However, the learning behavior of the network was not predictable. After many attempts, we finally came up with a unique structure for the actor network. The actor network consists of two parallel networks, each with three 225-neuron fully connected hidden layers and a fully connected output layer, as shown in Figure 1. In addition, we also break down the critic network into two parts: one for observation and the other for the robot. The observation part of the network takes the 43 observed results as inputs, and the robot part takes the six joint torques as inputs. The two parts both consist of three fully connected hidden layers with 225 neurons. Their outputs combine to form the network Q value output, as shown in Figure 2. It is now easy to use the RL blocks for training, as shown in Figure 3, consisting of the Observation, Reward, Agent, Robot, and Sensor block. The Observation block receives the resultant states reflecting the attitude of the end effector with 43 inputs, as shown in Figure 4. The Reward block diagram receives the result states and calculates the reward, as shown in Figure 5. The Agent gets the output from the Observation, Reward, and Check-If-Done block, updates the Q function and the π function, and outputs the reaction torques to the robot. The Sensor computes the forward kinematics as the resulting states and feeds it back to the three blocks.

Model Predictive Control [11,12]
As described before, MPC is based on the system dynamic model. The robot is described by Equation (9). Define the control law, Equation (10), to cancel the nonlinear effect: The system reduces to a feedback-linearized system (11). ..
In the MPC, we apply a proportional-derivative (PD) controller given by Equation (12) to stabilize the robotic arm system: where θ d is the vector of joint variables to reflect the desired trajectory. Consider a time interval [t, t + h], where h is the timeslot of the prediction. From Equation (10), we can develop the following two prediction models [12]: .
If the target joint angle is θ d , a simplified MPC design is to define a one-horizon time quadratic cost function to stabilize the system as: where e(t + h) = θ d − θ(t + h) is the predicted angular position error, and . e(t + h) = 0 − . θ(t + h) is the predicted angular velocity error. In Equation (7), the timeslot h and weight ρ are positive real control parameters to be designed. To minimize J, one introduces the prediction model Equation (13) into Equation (14) and lets the partial derivative ∂J/∂v equal 0. Considering that the robot slowly follows the trajectory, the resulting control becomes: where k 1 and k 2 are: By allowing the target joint angle to change over time, and from Equation (8), we obtain a more general control law: From Equations (9), (10), and (18), a block diagram of the MPC controller and the robotic arm system can be shown in Figure 6, where θ d (t) are the target joint angles in the joint-space. The main goal of the control is for the joint angles to follow their targets according to the second-order dynamics specified by the transfer function ω 2 0 /(s 2 + 2ζω 0 + ω 2 0 ), where ζ is a damping factor and ω 0 is a natural frequency by design.

Modeling of the Robotic Arm System
The HIWIN Technologies Corporation provided the RA605 robot arm for the D-H table, which can be readily incorporated into the "rvctools" Robotics toolbox in MATLAB for simulation purposes. It is also possible to obtain the system parameters for the dynamic model of the robot.
This research applied the DDPG to the RL training algorithm. We built an RA605 robotic arm model within the MATLAB simulation environment and, for generality purposes, four different target trajectories: a small circle trajectory (0.1 m diameter), a larger circle trajectory (0.2 m diameter), a small square trajectory (0.1 m side length) and a big square trajectory (0.2 m side length), Figure 7. To demonstrate the ability of six-axis trajectory control, we fixed the roll, pitch, and yaw angles of the end effector at π, −π/2, and 0 rad, respectively.

The Control Results
This section compares the control performance of the RL and MPC trajectory controls.

The RL Training Results
The 0.1 m diameter circle trajectory took about 343,070 s to train on an i7-computer running GPU and had 22,228 episodes. Figure 8 shows the roll, pitch, and yaw angle tracking results in the task space.  The 0.1 m square trajectory converged after training for about 186,100 s and 12,284 episodes. In Figure 10, we show the roll, pitch, and yaw angle tracking results in the task space. The 0.2 m square trajectory took about 129,040 s and 8338 episodes to train. In Figure 11, we show the roll, pitch, and yaw angle tracking results in the task space.

The Twin Delayed Deep Deterministic Policy Gradient Training
The authors also constructed a Twin Delayed Deep Deterministic Policy Gradient (TD3) network for comparison purposes. The network took 25,000 episodes to reach stable control. Figure 12 shows the resulting response. The robot arm's trajectory follows a short 7.5 mm vertical line. As one observes from the result, Figure 12, the robot arm moved too fast and oscillated around a position with a 30 mm offset.

The MPC Results
For the MPC, we selected h = 0.003 s and ρ = 1.04 × 10 −5 to achieve a closedloop bandwidth of ω n = 200 rad/ sec and ζ = 1 for no oscillation. From Equation (17), The block diagram for the control system is shown in Figure 13. The MPC controller consists of the inverse dynamic calculation and uses the values of v, q, and . q for the input and the joint torques τ for the output. The parameter v is the constant angular acceleration produced by the MPC controller. Four different target trajectories are shown in Figure 14.  For comparison, we ran the same trajectories as in the previous tests. The roll, pitch, and yaw angles in the task space are set at π,−π/2, and 0 rad, respectively. The motion is constrained to a slow speed.
To compare with the RL method easily, we put the result of MPC together with the RL method. For the 0.1 m diameter circle trajectory, we show the roll, pitch, and yaw angle tracking results in the task space in Figure 8. Figure 9 shows 0.2 m diameter circle roll, pitch, and yaw angle tracking results in the task space. The results for the 0.1 m square trajectory are shown in Figure 10. Again, Figure 10 shows the roll-pitch-yaw trajectories. Finally, the results for the 0.2 m square target trajectory are shown in Figure 11. The roll-pitch-yaw angle tracking results are shown in Figure 11.      Table 4. The roll angle of the end of the robotic arm (0.2 m circle).
Furthermore, looking at this experimental result, one finds that the MPC drives joints 4 and 6 into very strenuous movements and does not restore to the original position, as shown in Figure 15. After examining the trajectory, one sees that the target trajectory passes through the vicinity of the singularity. The model-based control cannot avoid violent action without further treatment. The RL control, on the other hand, instinctively avoided these movements.
When the two methods are compared, it is clear that while the MPC is very effective in dealing with highly nonlinear system dynamics, it still necessitates the use of a model and additional treatment for singularities. On the other hand, the RL requires no prior knowledge of the underlying model and can drive the robotic arm through any trajectory. In addition, it is not affected by singularity. Furthermore, as the trajectory becomes larger and more complicated, the MPC no longer shows an advantage over the RL. This may be because the MPC cannot accurately calculate the attitude using the forward kinematics while the robotic six-joint angles become too large.

Conclusions
This paper proposes a split network structure for the direct application of RL to robotic trajectory following control. We first show that it is possible to train the RL network to directly mimic the inverse dynamic cancellation action without the help of a precise system model. The RL uses the modified DDPG algorithm as the underlying learning algorithm, and the tests follow four different target trajectories. This study also compares RL performance to MPC performance. The RL control inherently avoids exercising violent movements while passing through the singularities and still succeeds in handling large and complicated trajectory situations. The MPC requires an accurate system model and is straightforward in implementation. However, it cannot avoid singularity without additional treatment and fails to control in large and complicated trajectory situations.
Comparing the two methods shows that RL successfully makes the robotic arm go through the target trajectory without excessive joint angle movements. MPC does not need time for training and achieves smoother control, but it requires an accurate system model. The authors will also try to implement the controller in an experiment for future verification.