Manipulator Trajectory Optimization Using Reinforcement Learning on a Reduced-Order Dynamic Model with Deep Neural Network Compensation

: This article reports the construction of an articulated manipulator’s hybrid dynamic model and trajectory planning and optimization of the manipulator using deep reinforcement learning (RL) on the dynamic model. The hybrid model was composed of a physical-based reduced-order dynamic model, linear friction and damping terms, and a deep neural network model to compensate for the nonlinear characteristics of the manipulator. The hybrid model then served as the digital twin of the manipulator for trajectory planning to optimize energy efﬁciency and operation speed by using RL while taking obstacle avoidance into consideration. The proposed strategy was simulated and experimentally validated. The energy consumption along paths was reduced and the speed was increased so the manipulator could achieve more efﬁcient motion.


Introduction
The fourth industrial revolution is strongly promoted by governments around the world.Although industrial robots have already been widely deployed in factories for repetitive tasks, such as loading/unloading and pick-and-place, there are many tasks in factories that still require human operators.The use of manipulators is highly likely to increase, and the intelligence of the manipulator will therefore become increasingly important.For example, the smart and automatic generation of efficient or fast motion trajectories for a manipulator is needed.Trajectory planning has been an important issue for several decades because it strongly determines the effectiveness and cost of factory operations.Although the issue is classic, solutions continue to evolve due to the development of new technologies, such as artificial intelligence (AI).In the following section, related works on the trajectory optimization of robot arms are listed, including the traditional and the machine-learning methods.
The trajectory optimization of manipulators has been reported using the classic method, especially for energetic optimization.Singh and Leu used dynamic programming to optimize the trajectory to improve the energy efficiency and speed of one and two degrees of freedom (DOF) manipulators [1].Field and Stepanenko adopted iterative dynamic programming for the trajectory energy optimization of a six-DOF industrial robot [2].Hirakawa and Kawamura optimized the trajectory energy of a three-DOF robot arm using the cost function method [3].Gregory et al. studied trajectory planning problems for a selective compliance articulated robot arm (SCARA) using an optimal control method [4].Hansen et al. optimized the trajectory for an industrial robot; they considered a dynamic model of the robot, friction losses, and losses related to the servo and inverter [5].
Machines 2023, 11, 350 2 of 24 Wigström et al. used a dynamic planning method to reduce the energy consumption of a robot arm by 10-20% [6].Along with some works aiming for joint torque optimization [7,8], speed and time optimization are other popular targets.Sahar and Hollerbach used a graph search to optimize the time-minimum path for a two-DOF robot [9].Gasparetto and Zanotto optimized a trajectory by considering the integral of jerk and execution time [10].Rubio et al. proposed a computationally efficient algorithm to generate a time optimal and collision-free trajectory by evolving the trajectories in a discretization space and considering the physical constraints of industrial robots in industrial environments [11].Ghasemi et al. translated the control problem into a nonlinear two-point boundary value problem to calculate the time-optimal trajectories of robot arms [12].Schulman et al. developed trajectory optimization based on the sequential convex optimization procedure and efficient formulation of constraints to apply to various collision-free robot tasks in simulation [13].On the other hand, some papers dealt with the control optimization for manipulators based on dynamic models.For instance, Krivošej and Šika actuated a three-DOF manipulator by the auxiliary cable mechanism and optimized the cable force distribution by using the simplex and genetic algorithm (GA) based on dynamic model analysis [14].
The typical trajectory optimization problem can also be approached by AI.Choi et al. proposed a two-phase learning algorithm to achieve global and local trajectory optimization and obstacle avoidance for more energy-efficient aircraft navigation [15].Garg and Kumar used a GA and simulated annealing to find the optimal trajectory with minimum torque [16].Tian and Collins used a GA to plan the trajectory with obstacle avoidance [17].Števo et al. used the GA to optimize trajectory planning for an ABB robot arm [18].Abu-Dakka et al. planned time-optimal trajectories for an industrial robot using an evolutionary algorithm [19].Among research studies on AI, machine learning with deep neural network (DNN) models, aiding its versatility, has been utilized for the trajectory planning problem.Glasius et al. used the neural network method to generate trajectories with obstacle avoidance for a two-DOF robot arm [20].Martín and Millán also utilized DNN to calculate the inverse kinematics for a robot manipulator, which enabled the robot to avoid obstacles [21].Imajo et al. applied a recurrent neural network to generate obstacleavoidance trajectories [22].Levine et al. used deep learning for a robot arm to perform hand-eye coordinated grasping tasks [23].Qiao et al. designed constraint equations for various physical limitations and self-collision avoidance for a mobile manipulator, defined the trajectory tracking as an optimization problem, and proposed the differential evolution algorithm to solve it [24].
To tackle the implicit relationship between the goal and complex dynamics, reinforcement learning (RL) has recently become a popular method in robotics [25] and taskoriented research.Stulp et al. used RL to generate motions for a robot manipulator for pick-and-place tasks [26].Cheng and Zhang used deep reinforcement learning to develop obstacle avoidance control strategies and applied them to an unmanned marine vessel [27].Cao et al. utilized Q-learning to construct a scalable and applicable method that can maximize the probability of arriving on time for vehicles in transportation systems [28].For trajectory optimization, Kollar and Roy used RL for dynamic programming and a support vector machine to optimize the trajectory efficiency of robot campus exploration [29].Akrour et al. developed a model-free trajectory optimization algorithm, which backpropagated a local Q-function, achieved a policy update in a closed form, and was deployed for two trajectory optimization tasks in simulations [30].Li et al. used RL based on the path integral of dynamic movement primitive in joint space, and trained a manipulator with visual feedback to grasp objects under external perturbations [31].Bucinskas et al. deployed online deep Q-learning to improve the positioning accuracy for an industrial manipulator with various operation times, speed, and load [32].
While the energy consumption of trajectory planning algorithms largely depends on the dynamic model of robot arms, the reported works rarely investigated physical models and trajectory optimization simultaneously.However, many industrial tasks require the precise estimation of the energy consumption and joint torque of the manipulator.For Machines 2023, 11, 350 3 of 24 example, if a factory must use a manipulator to move very heavy loads, it needs to estimate and minimize energy consumption through trajectory optimization to save on electricity costs.In addition, the manipulator utilized in this paper was designed to carry sensors to examine patients lying in a hospital bed.More specifically, the proposed application is a postoperative free-flap registration and tracking system [33,34].The goal of the application is to monitor the condition of the free flap on the face, head, or neck after surgery.The robot arm is set up beside the bed, and attached is an infrared camera and an RGB camera.The image system periodically calculates the condition of the free flap.The cameras need to keep a specific distance and angle from the free flap to obtain valid images, but the patient might move slightly or roll over unconsciously.Therefore, the image system will calculate where the cameras should be placed, and then the robot arm will move the cameras to that position.Some monitors or instruments placed around the bed to measure the patient's health condition constitute obstacles, so the robot arm must avoid collisions with these objects.In such scenarios, the torque values of the manipulator should be precisely determined to meet medical safety standards and mitigate severe injuries that could result from accidents.
To precisely know the joint torques and calculate energy consumption, the precise dynamic model has to be figured out.Among research of robot arm dynamic model calibration, scarce reported work combined a physical-based model with a black-box (like DNN) model, while there exist many empirical factors in the dynamic model of manipulators other than analytic parts.For this paper, this approach was adopted, and the results are reported in two serial parts.The first is the development of a hybrid dynamic model, and the second is using the deep RL method to optimize the collision-free trajectory for energy and time consumption based on the developed dynamic model.The contributions are summarized as follows: • A hybrid dynamic model of the manipulator was constructed, which was composed of an analytic three-DOF reduced-order dynamic model, linear non-idealized terms for mechanical energy loss, and a DNN model to calibrate other non-measurable and highorder empirical factors.While the hybrid model can largely decrease the computation complexity of the analytic model and reduce the training load and dataset size for DNN, it maintained high similarity to the real robot arm.

•
To achieve energy efficiency with obstacle avoidance, the trajectory optimization of the manipulator was treated as a bandit problem.A trajectory planner initialized by a bi-directional rapidly-exploring random tree (Bi-RRT) and optimized by deep RL was introduced to spare complex mathematical deduction and calculations of the compound-goaled problem.

•
Simulations and experiments were thoroughly conducted.In addition, performance comparisons to GA, human-taught trajectories, and the time compression method were executed as well.The results show that the proposed optimization method yields better performance.
The remainder of the paper is organized as follows.Section 2 describes the hybrid model and bounding box model of the manipulator, and Section 3 introduces the trajectory planning and optimization using RL.Section 4 reports the results of the simulation and experiments, and Section 5 concludes the work.

The Manipulator Models
The main method proposed in this paper is divided into two stages.First, the improved dynamic model of the manipulator was proposed by using the physical and data-driven sub-models.Then, the stateless deep reinforcement learning was utilized to optimize the trajectory and achieve obstacle avoidance based on the improved manipulator dynamic model.In this section, the construction of the hybrid dynamic model and the bounding box model for collision-free trajectory planning is demonstrated.Parameters were set according to those of the experimental validation platform, the six-DOF manipulator (TM5-900, Techman Robot, Inc., Taoyuan City, Taiwan), as shown in Figure 1a.
bounding box model for collision-free trajectory planning is demonstrated.Parameters were set according to those of the experimental validation platform, the six-DOF manipulator (TM5-900, Techman Robot, Inc., Taoyuan City, Taiwan), as shown in Figure 1a.

The Hybrid Dynamic Model of The Manipulator
The schematic diagram and the reduced-order dynamic model of the manipulator are shown in Figure 1b,c, respectively.The configuration design in Figure 1b is in contrast to the conventional industrial manipulator, which features a symmetrical mechanical design for dynamic balance.Furthermore, the last three DOFs intersect at a point to avoid singularity.Such a configuration, which supposedly offers more maneuverability, flexibility, and agility, is becoming increasingly common in modern collaborative robots, such as Sawyer [35], UR5e [36], and Panda [37].The dynamics of the six-DOF articulated manipulator is approximated to a three-DOF reduced-order model to reduce the computation complexity.In more detail, in the general six-DOF dynamic model, the first and last three DOFs of the manipulator are mainly used for translational and rotational motions, respectively.To ease the process of solving the inverse kinematics, the axes of the last three DOFs in general design of articulated manipulators are intersected with one another, and the physical space of the manipulator that hosts these three DOFs (i.e., the waist) is usually compact and lightweight.Thus, when the manipulator performs tasks with an extended arm posture and large displacement, the rotational motion of the last three DOFs has little effect on the overall dynamics.Thus, the analysis of the overall six-DOF dynamics of the manipulator can be approximated by using a reduced-order three-DOF manipulator model that maintains the three translational DOFs and approximates the three rotational DOFs as a point mass mounted at the end of the third link.In this way, the computation cost of the reduced-order model is much lower than that of the original model, yet maintains the ability to extract the dynamic characteristics of the original manipulator.This approximation is achievable because the manipulator in general has less mass, a smaller size, and small motors that output less power in the last three joints compared to the first three DOFs.Thus, the manipulator's energy consumption is mainly the result of the first three DOFs.Note that the mass and inertia effects of the last three DOFs with respect to the base frame are still maintained in the reduced-order model because the mass of the last three DOFs are preserved by the modeling of the joints as a point mass; only

The Hybrid Dynamic Model of The Manipulator
The schematic diagram and the reduced-order dynamic model of the manipulator are shown in Figure 1b,c, respectively.The configuration design in Figure 1b is in contrast to the conventional industrial manipulator, which features a symmetrical mechanical design for dynamic balance.Furthermore, the last three DOFs intersect at a point to avoid singularity.Such a configuration, which supposedly offers more maneuverability, flexibility, and agility, is becoming increasingly common in modern collaborative robots, such as Sawyer [35], UR5e [36], and Panda [37].The dynamics of the six-DOF articulated manipulator is approximated to a three-DOF reduced-order model to reduce the computation complexity.In more detail, in the general six-DOF dynamic model, the first and last three DOFs of the manipulator are mainly used for translational and rotational motions, respectively.To ease the process of solving the inverse kinematics, the axes of the last three DOFs in general design of articulated manipulators are intersected with one another, and the physical space of the manipulator that hosts these three DOFs (i.e., the waist) is usually compact and lightweight.Thus, when the manipulator performs tasks with an extended arm posture and large displacement, the rotational motion of the last three DOFs has little effect on the overall dynamics.Thus, the analysis of the overall six-DOF dynamics of the manipulator can be approximated by using a reduced-order three-DOF manipulator model that maintains the three translational DOFs and approximates the three rotational DOFs as a point mass mounted at the end of the third link.In this way, the computation cost of the reduced-order model is much lower than that of the original model, yet maintains the ability to extract the dynamic characteristics of the original manipulator.This approximation is achievable because the manipulator in general has less mass, a smaller size, and small motors that output less power in the last three joints compared to the first three DOFs.Thus, the manipulator's energy consumption is mainly the result of the first three DOFs.Note that the mass and inertia effects of the last three DOFs with respect to the base frame are still maintained in the reduced-order model because the mass of the last three DOFs are preserved by the modeling of the joints as a point mass; only the rotational kinematic energy of the last three DOFs are ignored.This reduced-order model assumption restricts the solution to systems similar to the application proposed in [33,34], in which the author participated.The manipulator carries cameras at the endpoint and never changes the load during the task, so the last three joints could be considered a lump of mass.Meanwhile, the last joints do allow relatively small rotations during the task.In addition, the proposed algorithm can be extended to a six-DOF dynamic model as well, but the learning algorithm requires many more computational resources and a longer training time.
The hybrid dynamic model is composed of three sub-models: where where + means pseudo inverse.The data on the right side were experimentally collected, and the experiment part is described in Section 4.1.
In addition to the two sub-models based on physical dynamics, the sub-model τ DNN based on DNN was added to cover other nonlinear and higher-order dynamic behaviors of the manipulator, such as lubricant viscosity in the bearings, the dead zone of motor activation, and static friction.The DNN overweighs other methods for such nonlinear and arcane problems, and it can run faster with suitable equipment.The DNN model was trained using supervised learning.The inputs of the model include angle θ, velocity The DNN model has five hidden layers, with 256, 512, 1024, 512, and 256 neurons in each layer, respectively.The activation functions in the hidden layers are tanh, and those of the output layer are linear.These hyper-parameters and model structures were determined by heuristics and experience.During the training process, the dropout rate of each hidden layer was set to 0.5, and the Adam optimizer was utilized to minimize the mean square error between the actual torque and the torque value generated from the DNN model.θ , τ actual , τ analytic , and τ loss were utilized to train the DNN model using Equation (6).Finally, the hybrid dynamic model was constructed, as shown in Equation (1).
Machines 2023, 11, x FOR PEER REVIEW 6 of 25 square error between the actual torque and the torque value generated from the DNN model.Figure 2 depicts the overall construction flow of the hybrid dynamic model.First, the three-DOF analytic reduced-order dynamic model was constructed given the mechanical parameters of the manipulator from the manufacturer.Then, the joint states ,  ,  and the torques  of the manipulator at each instant during motion were collected.The torques of the analytic model  were correspondingly computed using Equation (2).The time-series of  and  were utilized to calculate the coefficients in Equation (3) by using Equation (5).Then, given the time series of joint states ,  ,  , the torques of the loss model  were computed by using Equation (3).Next, the time series of ,  ,  ,  ,  , and  were utilized to train the DNN model using Equation (6).Finally, the hybrid dynamic model was constructed, as shown in Equation ( 1).

The Bounding Box Model and Singularity Penalty
To generate collision-free motion trajectories, the actual geometric shape of the manipulator and obstacles should be considered; however, the actual shape of each manipulator link is too complicated to calculate collision in a reasonable time.Thus, oriented bounding box techniques [38] were utilized.
Each object in the simulation environment, including the manipulator links and surrounding obstacles, was enclosed by a compact oriented rectangular bounding box.The separating axis theorem (SAT) [39] was then applied to check for dynamic collisions between any two bounding boxes [40].In the three-dimensional case, if there was no collision of two cuboids, such as bounding boxes A and B for example, there was at least one separating plane between them.In considering the planes spanning two edges of the two boxes and the planes spanning two edges of the same box, there are three axes that constitute the face normal from A and three axes as the face normal from B. Furthermore,

The Bounding Box Model and Singularity Penalty
To generate collision-free motion trajectories, the actual geometric shape of the manipulator and obstacles should be considered; however, the actual shape of each manipulator link is too complicated to calculate collision in a reasonable time.Thus, oriented bounding box techniques [38] were utilized.
Each object in the simulation environment, including the manipulator links and surrounding obstacles, was enclosed by a compact oriented rectangular bounding box.The separating axis theorem (SAT) [39] was then applied to check for dynamic collisions between any two bounding boxes [40].In the three-dimensional case, if there was no collision of two cuboids, such as bounding boxes A and B for example, there was at least one separating plane between them.In considering the planes spanning two edges of the two boxes and the planes spanning two edges of the same box, there are three axes that constitute the face normal from A and three axes as the face normal from B. Furthermore, there are nine axes from all the pairs of edges of A and B, with Edge 1 of A crossing Edge 1 of B, Edge 1 of A crossing Edge 2 of B, Edge 2 of A crossing Edge 1 of B, and so on.Therefore, there were a total of 3 + 3 + 9 = 15 possible separating planes and axes that should be checked.The flow chart showing self-collision detection is shown in Figure 3.
there are nine axes from all the pairs of edges of A and B, with Edge 1 of A crossing Edge 1 of B, Edge 1 of A crossing Edge 2 of B, Edge 2 of A crossing Edge 1 of B, and so on.Therefore, there were a total of 3 + 3 + 9 = 15 possible separating planes and axes that should be checked.The flow chart showing self-collision detection is shown in Figure 3.In the empirical implementation, the position and orientation of each link along with the bounding box can be derived in real time using forward kinematics with known joint angles.The position and orientation of other static or moving obstacles should be predefined or detected using other sensory systems.Once the status of all the objects is known, the link-to-link or link-to-obstacle collisions can be detected.Figure 4 illustrates a simulation environment, where the manipulator and obstacles are covered by the bounding boxes ready to check the collision status.In the empirical implementation, the position and orientation of each link along with the bounding box can be derived in real time using forward kinematics with known joint angles.The position and orientation of other static or moving obstacles should be predefined or detected using other sensory systems.Once the status of all the objects is known, the link-to-link or link-to-obstacle collisions can be detected.Figure 4 illustrates a simulation environment, where the manipulator and obstacles are covered by the bounding boxes ready to check the collision status.
there are nine axes from all the pairs of edges of A and B, with Edge 1 of A crossing Edge 1 of B, Edge 1 of A crossing Edge 2 of B, Edge 2 of A crossing Edge 1 of B, and so on.Therefore, there were a total of 3 + 3 + 9 = 15 possible separating planes and axes that should be checked.The flow chart showing self-collision detection is shown in Figure 3.In the empirical implementation, the position and orientation of each link along with the bounding box can be derived in real time using forward kinematics with known joint angles.The position and orientation of other static or moving obstacles should be predefined or detected using other sensory systems.Once the status of all the objects is known, the link-to-link or link-to-obstacle collisions can be detected.Figure 4 illustrates a simulation environment, where the manipulator and obstacles are covered by the bounding boxes ready to check the collision status.Notice that the process to deal with singularities of the manipulator is similar to that for collision avoidance.When a singularity happens, a penalty score will be given so that the learning process knows that it is not a good trajectory.In addition, no other restrictions are placed for the lumped joints because the joints are fixed, and they also have separated bounding boxes when calculating the collision.

Trajectory Optimization by Reinforcement Learning
After the model was constructed and the collision avoidance strategy was set up, the next step was trajectory optimization for energy efficiency or motion speed, which improves the performance of the manipulator in an empirical factory utilization.
The general motion trajectory of the manipulator is composed of key positions, including a start point, via points, and an end point.While the start and end points are usually known a priori, the variation of the trajectory lies in the position and time stamp of the via points, which are the optimization variables.In between the start, via, and end points, a cubic spline trajectory was adopted [41].The speed limitation of the first three joints of the manipulator was also considered and set to 180 [deg/ sec] according to the manipulator's operation manual [42].In this case, a smooth trajectory (i.e., position and velocity continuity) prevents the manipulator from wasting energy with abrupt motions that easily reach joint torque limits; thus, the generated trajectory can be optimized for the overall motion requirement, but it is not constrained by undesired abrupt movements.
Deep reinforcement learning was utilized for trajectory optimization because it can spare complicated constraint setting and analytics while achieving comparably good results, which is user-friendly for industrial users.The algorithm used here was proximal policy optimization (PPO), proposed by OpenAI in 2017 [43].The PPO is based on the actor-critic structure.The actor, which is gradually optimized regarding its performance during the training process, takes actions by changing the positions or time stamps of via points in this case.This is followed by the critic, which gives the actor a score according to the action it takes and guides the actor to change its action to achieve a higher score.The reward here was set for energy or speed optimization, and the reward of the trajectory was higher if its energy consumption or elapsed time was lower.Different trajectories have different rewards, and the trajectory with the highest reward is the desired one.The actor-critic structure in PPO is illustrated in Figure 5. Notice that the process to deal with singularities of the manipulator is similar to that for collision avoidance.When a singularity happens, a penalty score will be given so that the learning process knows that it is not a good trajectory.In addition, no other restrictions are placed for the lumped joints because the joints are fixed, and they also have separated bounding boxes when calculating the collision.

Trajectory Optimization by Reinforcement Learning
After the model was constructed and the collision avoidance strategy was set up, the next step was trajectory optimization for energy efficiency or motion speed, which improves the performance of the manipulator in an empirical factory utilization.
The general motion trajectory of the manipulator is composed of key positions, including a start point, via points, and an end point.While the start and end points are usually known a priori, the variation of the trajectory lies in the position and time stamp of the via points, which are the optimization variables.In between the start, via, and end points, a cubic spline trajectory was adopted [41].The speed limitation of the first three joints of the manipulator was also considered and set to 180 [deg/sec] according to the manipulator's operation manual [42].In this case, a smooth trajectory (i.e., position and velocity continuity) prevents the manipulator from wasting energy with abrupt motions that easily reach joint torque limits; thus, the generated trajectory can be optimized for the overall motion requirement, but it is not constrained by undesired abrupt movements.
Deep reinforcement learning was utilized for trajectory optimization because it can spare complicated constraint setting and analytics while achieving comparably good results, which is user-friendly for industrial users.The algorithm used here was proximal policy optimization (PPO), proposed by OpenAI in 2017 [43].The PPO is based on the actor-critic structure.The actor, which is gradually optimized regarding its performance during the training process, takes actions by changing the positions or time stamps of via points in this case.This is followed by the critic, which gives the actor a score according to the action it takes and guides the actor to change its action to achieve a higher score.The reward here was set for energy or speed optimization, and the reward of the trajectory was higher if its energy consumption or elapsed time was lower.Different trajectories have different rewards, and the trajectory with the highest reward is the desired one.The actor-critic structure in PPO is illustrated in Figure 5.The optimization was treated as a bandit problem or a stateless problem since the arrangement of object position in industrial factories changes once in a long time; therefore, the state, containing the start point, end point, and obstacles, almost maintains constant.The classic RL problem generates action data corresponding to various state data.In contrast, each case of the trajectory planning was trained and optimized separately without any state data inputting to the DNN.During this process, it is not necessary to consider transition models as well.In the actor space, the algorithm adjusted the positions of the via points for energy optimization or adjusted the timestamps of the via points for The optimization was treated as a bandit problem or a stateless problem since the arrangement of object position in industrial factories changes once in a long time; therefore, the state, containing the start point, end point, and obstacles, almost maintains constant.The classic RL problem generates action data corresponding to various state data.In contrast, each case of the trajectory planning was trained and optimized separately without any state data inputting to the DNN.During this process, it is not necessary to consider transition models as well.In the actor space, the algorithm adjusted the positions of the via points for energy optimization or adjusted the timestamps of the via points for speed optimization.Then, the value function was updated by collecting action and reward data.Finally, the data were used to update the parameters of the actor and the critic.
The DNN models in the learning algorithm contain an actor network and a critic network.There is no input layer in the networks.Although they were tuned slightly using Machines 2023, 11, 350 9 of 24 heuristics, the following network configurations referenced the model structure in the original PPO paper [43].The two networks both have two fully-connected hidden layers based on rectified linear unit (ReLU) neurons.Both networks have 512 neurons in the first layer and 256 neurons in the second layer.The critic network only has one neuron in the output layer, and the action data from the actor network contain angles of the first three manipulator joints for all via points.The output layer of the actor network parallel includes the mean layer µ and the covariance layer σ, both as the same dimensions as the number of action data, passing through a Gaussian probability function to increase exploration when training [44].In this study, RL was used to solve this optimization problem, and the goal was to find the best via points connected by cubic equations.That is, once the via points were determined, the trajectory was also determined.As shown in Figure 5, the action output from the actor network led to adjustments to the via points, and the reward given back by the dynamic model considered the collision and energy consumption.Here, the principle was that the more serious the collision was, the more negative the reward would be; if there was no collision, less total energy consumed by the entire trajectory could obtain a more positive reward.The total algorithm to optimize the via points based on PPO is depicted in Figure 6.
Machines 2023, 11, x FOR PEER REVIEW 9 of 25 speed optimization.Then, the value function was updated by collecting action and reward data.Finally, the data were used to update the parameters of the actor and the critic.
The DNN models in the learning algorithm contain an actor network and a critic network.There is no input layer in the networks.Although they were tuned slightly using heuristics, the following network configurations referenced the model structure in the original PPO paper [43].The two networks both have two fully-connected hidden layers based on rectified linear unit (ReLU) neurons.Both networks have 512 neurons in the first layer and 256 neurons in the second layer.The critic network only has one neuron in the output layer, and the action data from the actor network contain angles of the first three manipulator joints for all via points.The output layer of the actor network parallel includes the mean layer  and the covariance layer , both as the same dimensions as the number of action data, passing through a Gaussian probability function to increase exploration when training [44].In this study, RL was used to solve this optimization problem, and the goal was to find the best via points connected by cubic equations.That is, once the via points were determined, the trajectory was also determined.As shown in Figure 5, the action output from the actor network led to adjustments to the via points, and the reward given back by the dynamic model considered the collision and energy consumption.Here, the principle was that the more serious the collision was, the more negative the reward would be; if there was no collision, less total energy consumed by the entire trajectory could obtain a more positive reward.The total algorithm to optimize the via points based on PPO is depicted in Figure 6.

Determination of The Initial Via Points
If there is no obstacle, the determination of the initial via points is not critical because the follow-up optimization will adjust the position and time stamp of the via points.In contrast, when obstacles do exist, the initially selected via points should be outside and away from the obstacles to speed the optimization convergence.The bi-directional rapidlyexploring random tree (Bi-RRT) [45] was utilized to find the initial via points.
The details of the implementation of the Bi-RRT are described as follows.First, the physical spaces taken by the manipulator and the obstacles were modeled as bounding boxes, as described in Section 2.2, which were to be avoided by the Bi-RRT.Second, points on the midplane were selected as goals for the Bi-RRT to search from the start and end points.The midplane was defined as the plane that was normal to and intersected with the midpoint of the line segment between the start and end points.Third, the convex hull formed by the projected vertices of the obstacles on the midplane was computed.Then, eight points with an equal angular distribution (i.e., 45 • apart) and on the boundary of the convex hull were chosen as the initial via points.In addition, the midpoint of the line segment between the start and end points was chosen.Finally, the Bi-RRT was utilized to find collision-free paths that connected the start and end points to the eight goal points, i.e., at most nine Bi-RRT paths if all searches were successful.
Because the nodes generated by the Bi-RRT were determined by the length of each search step, not all the nodes should be the initial via points, and the node was removed if the path without it was shorter and still collision-free.The nodes from the start point were sequentially and individually checked and iterated until no node could be removed.The remaining nodes were set as the initial via points.

Energy Optimization
For energy optimization, the reward was set as: ∑ Energy , i f collision f ree (7) The symbol dist min indicates the minimum distance between the two nearest bounding boxes in the whole trajectory.The symbol ∑ Energy is the total energy consumption of the model, which is defined as: where t i t f are the start and final time stamps of the trajectory, and τ j , θ j are the j th joint torque and joint angle, respectively.The main idea of the reward setting shown in Equation ( 7) is that if the collision occurs, then the reward is negative, and the reward is lower if the collision is more severe.In contrast, if there is no collision, the reward is positive, and the score is higher when the energy consumption is lower.Equation ( 7) can be utilized regardless of the number of obstacles.

Speed Optimization
The speed optimization process is similar to the energy optimization process, but the goal is to reduce the elapsed time from the start point to the end point while avoiding an overloaded operation.The reward was set as: reward S = − τ limit − τ peak , i f torque over limit where time reduction is the difference of time consumption before and after optimization.The main idea of the reward setting shown in Equation ( 9) is that if the computed torque exceeds the limit, the reward is negative, and the more the torque exceeds, the lower the reward.In contrast, the reward is positive and higher if the time reduction is more significant.Note that the energy-optimized trajectory was utilized as the initial trajectory for speed optimization because it is collision-free, and the motion profile is relatively smooth.In this case, the optimization aims at maintaining the same profile while increasing the magnitudes of the motion states and torques until the manipulator achieves its torque limit [46,47].
In this work, the energy and speed optimizations were executed separately because the optimization of these two goals is generally oriented toward different optimized sets.Therefore, rather than using a reward composed of both energy consumption and speed objectives, energy and speed were optimized separately with different reward functions.

Genetic Algorithm
Furthermore, to compare the performance of the proposed method with classic artificial intelligence methods, GA was adopted for evaluation as well.GA mimics natural genetic exchange rules and follows the idea of the "survival of the fittest" to generate increasingly better offspring over several generations.In each generation loop, a set of children, of which each consists of six lines of binary values as genes, would first be converted to decimal values as six via angles, and the energy cost based on Equation ( 8) would be calculated corresponding to each child.Then, the same number of children would be randomly selected as parents of the next generation in the children set, following the rule that the child with a lower energy cost was assigned a higher probability to be chosen.
After that, each parent would have a high probability to exchange some of their binary elements with one of the children (selected randomly) at random element places, generating offspring as the next generation.Through the process, parents with a lower energy cost would generate more offspring, and finally the optimal set of via angles would appear in the last generation.The illustration depicting the GA procedure is shown in Figure 7.

icant.
Note that the energy-optimized trajectory was utilized as the initial trajectory for speed optimization because it is collision-free, and the motion profile is relatively smooth.In this case, the optimization aims at maintaining the same profile while increasing the magnitudes of the motion states and torques until the manipulator achieves its torque limit [46,47].In this work, the energy and speed optimizations were executed separately because the optimization of these two goals is generally oriented toward different optimized sets.Therefore, rather than using a reward composed of both energy consumption and speed objectives, energy and speed were optimized separately with different reward functions.

Genetic Algorithm
Furthermore, to compare the performance of the proposed method with classic artificial intelligence methods, GA was adopted for evaluation as well.GA mimics natural genetic exchange rules and follows the idea of the "survival of the fittest" to generate increasingly better offspring over several generations.In each generation loop, a set of children, of which each consists of six lines of binary values as genes, would first be converted to decimal values as six via angles, and the energy cost based on Equation ( 8) would be calculated corresponding to each child.Then, the same number of children would be randomly selected as parents of the next generation in the children set, following the rule that the child with a lower energy cost was assigned a higher probability to be chosen.
After that, each parent would have a high probability to exchange some of their binary elements with one of the children (selected randomly) at random element places, generating offspring as the next generation.Through the process, parents with a lower energy cost would generate more offspring, and finally the optimal set of via angles would appear in the last generation.The illustration depicting the GA procedure is shown in Figure 7.

Simulation and Experiment Results
In this section, the experiment results related to the hybrid dynamic model described in Section 2 are first demonstrated.Then, the simulation and experiment results related to the trajectory optimization described in Section 3 are reported.

Simulation and Experiment Results
In this section, the experiment results related to the hybrid dynamic model described in Section 2 are first demonstrated.Then, the simulation and experiment results related to the trajectory optimization described in Section 3 are reported.

Performance of The Hybrid Dynamic Model
The construction of the hybrid dynamic model requires measurements of various states of the manipulator as described in Section 2, including joint angles, velocities, accelerations, and torques.Due to the necessity for motion data (i.e., velocity and acceleration), it was efficient to collect the required data while the manipulator was moving according to preset trajectories containing variations of joint orientation, velocity, and acceleration.The trajectories were randomly chosen within the robot's workspace, and the trajectories were set bidirectionally and with various speeds to cover a wide range of joint motions.The construction of the hybrid dynamic model requires measurements of various states of the manipulator as described in Section 2, including joint angles, velocities, accelerations, and torques.Due to the necessity for motion data (i.e., velocity and acceleration), it was efficient to collect the required data while the manipulator was moving according to preset trajectories containing variations of joint orientation, velocity, and acceleration.The trajectories were randomly chosen within the robot's workspace, and the trajectories were set bidirectionally and with various speeds to cover a wide range of joint motions.Figure 8a,b, drawn using Python and Matplotlib, show the training trajectory data groups, T1 and T2, each with five trajectories.The collected data at each instant (i.e., time stamp) were considered as one data set, and each trajectory contained 50 to 150 data sets.In total, 10,306 data sets were collected, of which 70% and 30% were used for training and validation in the DNN model training, respectively.The model was trained for 100 epochs, and the loss stabilized after the 15th epoch.The performance of the DNN sub-model was evaluated using the ten different trajectories, as shown in Figure 8c,d drawn using Python and Matplotlib.The means and standard deviations (STD) of the energy errors between the dynamic model and the manipulator were utilized for comparison, and the experiment results are shown in Table 1.Three scenarios were compared: the analytic reduced-order model without any fitting (i.e., ), the analytic model with energy compensation (i.e.,  +  ), and the hybrid dynamic model with energy compensation and the DNN model (i.e.,  =  +  +  ).Using  as the baseline, the averaged energy error of Joints 1 to 3 of the model  +  conceivably reduced by 100%, 69.8%, and 23.4%, respectively, and the overall energy error was reduced by 90.6%.The reduction of Joint 1 is most significant, because the joint supports the heaviest weight and inertia of the manipulator, so it ameliorates the most by  .Next, with the added DNN term, the averaged energy error of Joints 2 and 3 of the complete model further decreased by 68.6% and 98.4%,The performance of the DNN sub-model was evaluated using the ten different trajectories, as shown in Figure 8c,d drawn using Python and Matplotlib.The means and standard deviations (STD) of the energy errors between the dynamic model and the manipulator were utilized for comparison, and the experiment results are shown in Table 1.Three scenarios were compared: the analytic reduced-order model without any fitting (i.e., τ analytic ), the analytic model with energy compensation (i.e., τ analytic + τ loss ), and the hybrid dynamic model with energy compensation and the DNN model (i.e., τ hybrid = τ analytic + τ loss + τ DNN ).Using τ analytic as the baseline, the averaged energy error of Joints 1 to 3 of the model τ analytic + τ loss conceivably reduced by 100%, 69.8%, and 23.4%, respectively, and the overall energy error was reduced by 90.6%.The reduction of Joint 1 is most significant, because the joint supports the heaviest weight and inertia of the manipulator, so it ameliorates the most by τ loss .Next, with the added DNN term, the averaged energy error of Joints 2 and 3 of the complete model further decreased by 68.6% and 98.4%, respectively.The overall energy error was further reduced by 89.0%.In short, both the loss model and the DNN model significantly improved the accuracy of the dynamic model of the manipulator.
Furthermore, to examine the influence of the quantity of training data sets on the performance of the DNN model, the training and evaluating trajectory data groups, T1, T2, E1, and E2, shown in Figure 4, were separately utilized.Figure 9, drawn using Python and Matplotlib, depicts the errors of the manipulator energy consumption (E error ) and the root mean square errors (RMSE) of the joint torques (τ RMSE ) between the manipulator and the dynamic model.The results show that the model trained by a larger training set, T1+T2, has smaller total energy error means and STDs than those trained by a smaller set, T1 or T2, because the larger data set covers a larger variation for training.This also can be roughly observed in the RMSE of torques.There is no obvious trend to make conclusions among different evaluation data sets.Furthermore, to examine the influence of the quantity of training data sets on the performance of the DNN model, the training and evaluating trajectory data groups, T1, T2, E1, and E2, shown in Figure 4, were separately utilized.Figure 9, drawn using Python and Matplotlib, depicts the errors of the manipulator energy consumption ( ) and the root mean square errors (RMSE) of the joint torques ( ) between the manipulator and the dynamic model.The results show that the model trained by a larger training set, T1+T2, has smaller total energy error means and STDs than those trained by a smaller set, T1 or T2, because the larger data set covers a larger variation for training.This also can be roughly observed in the RMSE of torques.There is no obvious trend to make conclusions among different evaluation data sets.In this work, the DNN training process did not include variations in payload, but the experiments did evaluate the manipulator with various payloads.The results are shown in Table 2.The similarity of the results across tests with different payloads indicates that no matter how the payload increases, the added sub-models,  and  , can reduce the energy and torque errors.This further indicates that the regression and the DNN training process successfully learned the intrinsic mechanical and dynamic characteristics of the manipulator rather than overfitting to the characteristics of free motion without any In this work, the DNN training process did not include variations in payload, but the experiments did evaluate the manipulator with various payloads.The results are shown in Table 2.The similarity of the results across tests with different payloads indicates that no matter how the payload increases, the added sub-models, τ loss and τ DNN , can reduce the energy and torque errors.This further indicates that the regression and the DNN training process successfully learned the intrinsic mechanical and dynamic characteristics of the manipulator rather than overfitting to the characteristics of free motion without any payload.The results also show that as the payload increases, the errors also increase, which indicates that the payloads do affect the dynamic model.If the payloads are known a priori, the motion of the manipulator subjected to these specific payloads can be included in the DNN model training to improve model accuracy.

Simulation Results of Trajectory Optimization
Before the empirical implementation on the manipulator, the performance of RL for energy and speed optimization was evaluated in a simulation environment.The computer used here had an Intel ® CoreTM i9-9900K processor, NVIDIA GeForce RTX-2070 8 GB, and 32 GB RAM.The joint angles of the model at the start and final positions were 50.0 • , 0.0 • , −120.0 • and 80.0 • , 30.0 • , −90.0 • , respectively, with setting the trajectory elapsed time at 0.65 s.The classic optimization methods in robotics [49], GA, and brute force search were used for comparison.The generation number in GA was set to 200, and each generation contained 100 offspring, with mapping decimal values to 10 bits of binary representation as the genes.Notice that GA here could not be implemented through parallel computing, such as PyCUDA, since when calculating the torque values from the NN for each child in the stage of computing fitness, it occupied a large number of threads on the graphic card.Therefore, the graphic card could compute the fitness for only one child once.The resolution in the brute force search is 1.5 • , with the via point values ranging among the start and final positions.Equally divided segments between the initial and final points were set as initial conditions for the RL method and as a baseline for all optimization algorithms, and each algorithm, except for brute force, was run three times with different random seeds to obtain averaged values.Table 3 shows the simulation comparison results for energy optimization in tasks with two via points and without obstacles.Unsurprisingly, the brute force search yielded the best energy consumption of 45.46 J.The GA and the proposed algorithm yields were 45.73 J and 46.68 J, respectively, which are 0.6% and 2.7% more than the brute force method; however, considering computation time, the brute force method took about 100 times of GA, and GA took about 10 times of RL, even though RL had 2.5 times the iteration number of GA.By taking computing the energy for one trajectory as one step, and analyzing the time complexity for each algorithm, the brute force method took 116 = 1,771,561 steps.In GA, since each generation had 100 offspring to compute energy, and had 200 generations, it took 20,000 steps.RL only took one step for each episode with 500 episodes because it optimized the policy based on the gradient descent method, leveraging a fast calculation of GPU.Energy optimization in tasks with one obstacle, which was placed to interfere with the initial trajectory, was also conducted.In this case, the Bi-RRT initially generated seven sets of via points to build initial trajectories, and all seven trajectories were optimized for 100 epochs first.Afterwards, for each following 100 epochs, only one of the trajectories was randomly selected for optimization, where the likelihood of selection depended on the learning performance of the previous 100 epochs.There were 2100 epochs in total.For comparison, the GA was used with the target function shown in Equation ( 7), 100 offspring in each generation, and a total of 2000 generations.Figure 10, drawn using MATLAB, shows the simulation results.For the GA method, the offspring initially had diverse energy consumptions and then mostly converged to lower energy values in the final generation.In the final generation, the energy of the trajectory that could circumvent the obstacle was 54.08 J, and the total computing time was about two hours.For the proposed RL method, Trajectory 3 yielded the least energy at 58.08 J, and it was chosen to optimize for three multiples of 100 epochs.The total computation time was about five minutes.Similarly, although the energy result of RL was slightly larger than that of GA by 7%, the computing time was only 1/24 of the GA under the same generation number.

GA
(59.Energy optimization in tasks with one obstacle, which was placed to interfere with the initial trajectory, was also conducted.In this case, the Bi-RRT initially generated seven sets of via points to build initial trajectories, and all seven trajectories were optimized for 100 epochs first.Afterwards, for each following 100 epochs, only one of the trajectories was randomly selected for optimization, where the likelihood of selection depended on the learning performance of the previous 100 epochs.There were 2100 epochs in total.For comparison, the GA was used with the target function shown in Equation ( 7), 100 offspring in each generation, and a total of 2000 generations.Figure 10, drawn using MATLAB, shows the simulation results.For the GA method, the offspring initially had diverse energy consumptions and then mostly converged to lower energy values in the final generation.In the final generation, the energy of the trajectory that could circumvent the obstacle was 54.08 J, and the total computing time was about two hours.For the proposed RL method, Trajectory 3 yielded the least energy at 58.08 J, and it was chosen to optimize for three multiples of 100 epochs.The total computation time was about five minutes.Similarly, although the energy result of RL was slightly larger than that of GA by 7%, the computing time was only 1/24 of the GA under the same generation number.Based on Section 3.3, the performance of RL for speed optimization was also evaluated in the simulated task with one obstacle, and Figure 11, drawn using Python and Matplotlib, shows the simulation results with three trials under the same condition.The initial elapsed time was one second, and time consumption after optimization was reduced to 0.62, 0.64, and 0.63 s, respectively, because the torques after learning were apparently closer to the torque limits and were distributed appropriately to speed up the task as much as possible.Based on Section 3.3, the performance of RL for speed optimization was also evaluated in the simulated task with one obstacle, and Figure 11, drawn using Python and Matplotlib, shows the simulation results with three trials under the same condition.The initial elapsed time was one second, and time consumption after optimization was reduced to 0.62, 0.64, and 0.63 s, respectively, because the torques after learning were apparently closer to the torque limits and were distributed appropriately to speed up the task as much as possible.

Experimental Results of Energy Optimization
After simulation, an experimental task on the realistic platform was set up to evaluate the performance of the proposed strategy, wherein the robot needed to move from the start position to the end position while avoiding one obstacle placed on top of a desk.The start and end configurations of the manipulator are shown in Figure 12.For the purpose of comparison, in addition to the trajectory generated by the proposed strategy, three testers were invited, and each designed five collision-free and intuitively energysaving trajectories using the built-in teaching mode of the manipulator.Then, the five sets of via points selected by each tester were utilized to generate trajectories using cubic spline interpolation.The elapsed time for each trajectory was set to four seconds.At the same time, the kinematics data of the manipulator were also collected as input of the hybrid dynamic model to verify the model's accuracy by comparing the energy consumption of the dynamic model with the manipulator.

Experimental Results of Energy Optimization
After simulation, an experimental task on the realistic platform was set up to evaluate the performance of the proposed strategy, wherein the robot needed to move from the start position to the end position while avoiding one obstacle placed on top of a desk.The start and end configurations of the manipulator are shown in Figure 12.For the purpose of comparison, in addition to the trajectory generated by the proposed strategy, three testers were invited, and each designed five collision-free and intuitively energy-saving trajectories using the built-in teaching mode of the manipulator.Then, the five sets of via points selected by each tester were utilized to generate trajectories using cubic spline interpolation.The elapsed time for each trajectory was set to four seconds.At the same time, the kinematics data of the manipulator were also collected as input of the hybrid dynamic model to verify the model's accuracy by comparing the energy consumption of the dynamic model with the manipulator.Table 4, where Act.indicates values from the real manipulator and Mo.means values from the hybrid dynamic model, shows the energy consumptions of the manipulator that used the paths generated by the three testers and the RL method described in Section 3.For each tester and RL, the raw data of five trials and their mean and STD were presented.The results show that testers tended to design different trajectories to avoid the obstacles, so the STDs are high.In contrast, the trajectories from RL have consistent energy costs with small deviations.The table clearly shows that the trajectories designed by the testers consumed more energy than those designed by the RL.Among all 15 trajectories, 14 had a higher energy cost than that of the trajectories designed by the RL.On the other hand, the RMSEs between the hybrid dynamic model results and the experiment results of the testers and RL were 0.76 and 1.12, respectively, which confirms that the model can accurately predict the energy consumption with an error of less than 4% in this case.Table 4, where Act.indicates values from the real manipulator and Mo.means values from the hybrid dynamic model, shows the energy consumptions of the manipulator that used the paths generated by the three testers and the RL method described in Section 3.For each tester and RL, the raw data of five trials and their mean and STD were presented.The results show that testers tended to design different trajectories to avoid the obstacles, so the STDs are high.In contrast, the trajectories from RL have consistent energy costs with small deviations.The table clearly shows that the trajectories designed by the testers consumed more energy than those designed by the RL.Among all 15 trajectories, 14 had a higher energy cost than that of the trajectories designed by the RL.On the other hand, the RMSEs between the hybrid dynamic model results and the experiment results of the testers and RL were 0.76 and 1.12, respectively, which confirms that the model can accurately predict the energy consumption with an error of less than 4% in this case.In addition to the use of Bi-RRT generated via points for optimization, the two collision-free via points selected by the testers could also be utilized as the initial via points for optimization.Table 5 shows the original and optimized energy consumptions of five trajectories taught by Tester 1 before optimization.The energy consumption improved for all five trajectories by approximately 20% on average.Note that the energy consumption  In addition to the use of Bi-RRT generated via points for optimization, the two collision-free via points selected by the testers could also be utilized as the initial via points for optimization.Table 5 shows the original and optimized energy consumptions of five trajectories taught by Tester 1 before optimization.The energy consumption improved for all five trajectories by approximately 20% on average.Note that the energy consumption was a little higher than that of the via points initialized by Bi-RRT in Table 4 because the number of via points was fixed in this case, which imposed stronger constraints on optimization.

Experimental Results of Speed Optimization
Under the same experimental setup as that described in Section 4.2, the yielded energyefficient trajectories were further computed for speed optimization, i.e., reducing motion time.To evaluate the performance of the proposed speed optimization, a baseline trajectory was designed that individually compressed the elapsed time of all trajectory sections until the joint reached its torque limit.Hereafter, this method is referred to as "time compression (TC)" for comparison.
Table 6 compares the optimized time of the trajectories using TC and RL.The original elapsed time was four seconds, and the average optimized times using TC and RL were 1.57 ± 0.14 and 1.14 ± 0.12 s, respectively.The performance of the RL method is better than that of the TC method by approximately 27%.To examine this in more detail, Figure 13a manifests the joint torques of the hybrid dynamic model and the manipulator before and after speed optimization in one trial.By comparing the values of the model before and after optimization, RL magnified torque values within the torque limitation to speed up the task, rendering the torque profile more severe.In addition, it is difficult to simulate the initial and high-frequency responses of the actual torque profiles in the dynamic model.Similar to the concept of Table 5, the energy-optimized trajectories based on testerselected via points for simplicity could also be set as the initial trajectories for speed optimization, and the optimized time of the five energy-optimized trajectories using via points selected by Tester 1 are shown in Table 7.On average, the optimized times for TC and RL were 1.67 and 1.29 s, respectively, and the latter was better than the former by 23%.With further observation, Figure 13b shows the joint torques of the model and the manipulator before and after the speed optimization of the case, and the results were similar to those of Figure 13a.Similar to the concept of Table 5, the energy-optimized trajectories based on testerselected via points for simplicity could also be set as the initial trajectories for speed optimization, and the optimized time of the five energy-optimized trajectories using via points selected by Tester 1 are shown in Table 7.On average, the optimized times for TC and RL were 1.67 and 1.29 s, respectively, and the latter was better than the former by 23%.With further observation, Figure 13b shows the joint torques of the model and the manipulator before and after the speed optimization of the case, and the results were similar to those of Figure 13a.

Conclusions
This article has reported the construction of the hybrid dynamic model as the digital twin and strategy to optimize the energy and speed of a manipulator by using RL, based on the dynamic model.
The hybrid model is composed of a reduced-order analytic model, a linear empirical compensation, and a DNN model.To spare computing complexity, the analytic dynamic model was reduced to a three-DOF manipulator, with the other links fixed as a mass point, and constructed given mechanical parameters from specification.The empirical part, compensating for inertia, damping, and friction, and the DNN model, ameliorating for the rest of the un-modeled dynamics, were both trained by actual torque data.The experimental results showed that averaged energy errors between the manipulator results and the analytic model (τ analytic ), the analytic model with energy loss terms (τ analytic + τ loss ), and the hybrid model (τ hybrid = τ analytic + τ loss + τ DNN ), are −22.1,−2.1, and −0.2 J, respectively, which confirms that the hybrid composition of the model can significantly augment energy estimation accuracy.In cases where the manipulator was subjected to different payloads, the results exhibited a similar trend in amelioration.
Afterwards, the hybrid model was used in the manipulator trajectory optimization for energy efficiency and speed.The optimization was treated as a bandit problem and solved by RL.The Bi-RRT was utilized to find the initial via points of the trajectories, and then the PPO algorithm was used for trajectory optimization with an obstacle-avoidance capability.In addition, the trajectory optimization using GA was implemented for a performance comparison as well.The simulation results showed that RL can reach almost the same optimization effect as GA, but using only one tenth of the computation time of GA.In the energy optimization experiment, the human-taught method was compared to RL, and the results showed that the energy consumption of the 15 trajectories designed by the testers ranged between 33.9 and 90.9 J, while that of the five trajectories derived by RL had a mean and STD 33.1 ± 0.7 J, attesting to the effectiveness of energy optimization using RL.For the speed optimization experiments, the performance of the RL method was better than that of the time compression method by approximately 27%.In short, RL can effectively reduce The linear and angular velocities of the links are  The corresponding terms in the inertia matrix M(θ) are

Figure 1 .
Figure 1.The manipulator and the model: (a) a photograph of the manipulator for experimental validation, (b) schematic diagram and coordinate systems of the manipulator, and (c) the three-DOF reduced-order model of the manipulator.

Figure 1 .
Figure 1.The manipulator and the model: (a) a photograph of the manipulator for experimental validation, (b) schematic diagram and coordinate systems of the manipulator, and (c) the three-DOF reduced-order model of the manipulator.

θ
of the first three joints of the manipulator.The labeled outputs of the model are the residual torques τ residual between the torques of the manipulator τ actual and the torques of the physical models τ analytic + τ loss .The τresidual represents the estimates of the DNN model after training.   τ actual − τ analytic + τ loss = τ residual τ DNN = DNN θ,

Figure 2 θ
depicts the overall construction flow of the hybrid dynamic model.First, the three-DOF analytic reduced-order dynamic model was constructed given the me-chanical parameters of the manipulator from the manufacturer.Then, the joint states θ, and the torques τ actual of the manipulator at each instant during motion were collected.The torques of the analytic model τ analytic were correspondingly computed using Equation (2).The time-series of τ actual and τ analytic were utilized to calculate the coefficients in Equation (3) by using Equation (5).Then, given the time series of joint states θ, torques of the loss model τ loss were computed by using Equation (3).Next, the time series of θ,

Figure 2 .
Figure 2. The overall construction framework of the hybrid dynamic model.

Figure 2 .
Figure 2. The overall construction framework of the hybrid dynamic model.

Figure 3 .
Figure 3.The flow chart of the self-collision detection.

Figure 4 .
Figure 4.The manipulator covered by the bounding boxes ready to check collision, where the green box is Link 1, the blue box is Link 2, the red box is Link 3, and the gray boxes are the combination of Links 4 to 6.

Figure 3 .
Figure 3.The flow chart of the self-collision detection.

Figure 3 .
Figure 3.The flow chart of the self-collision detection.

Figure 4 .
Figure 4.The manipulator covered by the bounding boxes ready to check collision, where the green box is Link 1, the blue box is Link 2, the red box is Link 3, and the gray boxes are the combination of Links 4 to 6.

Figure 4 .
Figure 4.The manipulator covered by the bounding boxes ready to check collision, where the green box is Link 1, the blue box is Link 2, the red box is Link 3, and the gray boxes are the combination of Links 4 to 6.

Figure 5 .
Figure 5.The actor-critic structure in the proximal policy optimization (PPO).

Figure 5 .
Figure 5.The actor-critic structure in the proximal policy optimization (PPO).

Figure 6 .
Figure 6.The optimization algorithm for via points.Figure 6.The optimization algorithm for via points.

Figure 6 .
Figure 6.The optimization algorithm for via points.Figure 6.The optimization algorithm for via points.

Figure 7 .
Figure 7.The genetic algorithm (GA) procedure as the compared baseline.More detailed concepts of GA can refer to[48].

Figure 7 .
Figure 7.The genetic algorithm (GA) procedure as the compared baseline.More detailed concepts of GA can refer to [48].

Figure
Figure 8a,b, drawn using Python and Matplotlib, show the training trajectory data groups, T1 and T2, each with five trajectories.The collected data at each instant (i.e., time stamp) were considered as one data set, and each trajectory contained 50 to 150 data sets.In total, 10,306 data sets were collected, of which 70% and 30% were used for training and validation in the DNN model training, respectively.The model was trained for 100 epochs, and the loss stabilized after the 15th epoch.

Figure 8 .
Figure 8.The trajectory data groups used for constructing the DNN model of the manipulator.The training trajectories T1 shown in (a) and T2 shown in (b).The evaluation trajectories E1 shown in (c) and E2 shown in (d).

Figure 8 .
Figure 8.The trajectory data groups used for constructing the DNN model of the manipulator.The training trajectories T1 shown in (a) and T2 shown in (b).The evaluation trajectories E1 shown in (c) and E2 shown in (d).

Figure 9 .
Figure 9.The errors of the manipulator energy consumption and the RMSE of the joint torques.Subfigures (a-c) are the results using the evaluating trajectory data groups, E1, E2, and E1 + E2, respectively.In each subfigure, three bars represent the performance of the hybrid model by training data groups T1, T2, and T1 + T2, where black lines are plus and minus one STD and colored bars are mean values.

Figure 9 .
Figure 9.The errors of the manipulator energy consumption and the RMSE of the joint torques.Subfigures (a-c) are the results using the evaluating trajectory data groups, E1, E2, and E1 + E2, respectively.In each subfigure, three bars represent the performance of the hybrid model by training data groups T1, T2, and T1 + T2, where black lines are plus and minus one STD and colored bars are mean values.

Figure 10 .
Figure 10.The total energy consumption of the model versus the computation epoch using the RL algorithm and GA.The proposed RL learning algorithm utilizes seven initial trajectories found by Bi-RRT.

Figure 10 .
Figure 10.The total energy consumption of the model versus the computation epoch using the RL algorithm and GA.The proposed RL learning algorithm utilizes seven initial trajectories found by Bi-RRT.

Machines 2023 , 25 Figure 11 .
Figure 11.The joint torque profiles of the model (a) before and (b) after speed optimization.The rows represent three different trials.The black solid lines represent the torque limits.

11 .
The joint torque profiles of the model (a) before and (b) after speed optimization.The rows represent three different trials.The black solid lines represent the torque limits.

Figure 12 .
Figure 12.The (a) start and (b) end configurations of the manipulator in the experiment.The red shaded box is the obstacle placed on top of a desk, which is shown as transparent.The solid boxes represent the bounding boxes of the manipulator.(c) A photograph of the experimental setup.

Figure 12 .
Figure 12.The (a) start and (b) end configurations of the manipulator in the experiment.The red shaded box is the obstacle placed on top of a desk, which is shown as transparent.The solid boxes represent the bounding boxes of the manipulator.(c) A photograph of the experimental setup.

Figure 13 .
Figure 13.The speed optimization results.From the top: the joint torque profiles of the hybrid model before (1st row) and after (2nd row) speed optimization and of the manipulator (3rd row).The profiles shown in (a,b) were optimized from different energy-optimized trajectories: (a) using Bi-RRT generated via points and (b) using human-taught via points.All plots were drawn using Python and Matplotlib.

Figure 13 .
Figure 13.The speed optimization results.From the top: the joint torque profiles of the hybrid model before (1st row) and after (2nd row) speed optimization and of the manipulator (3rd row).The profiles shown in (a,b) were optimized from different energy-optimized trajectories: (a) using Bi-RRT generated via points and (b) using human-taught via points.All plots were drawn using Python and Matplotlib.

θ 3 (θ
−l 3 a 2 s 3 m 3 − a 3 a 2 s 3 m 4 ) C 23 = .θ 3 (−l 3 a 2 s 3 m 3 − a 3 a 2 s 3 m 4 ) C 31 = .θ 1 (I 3x c 23 s 23 − I 3y c 23 s 23 − l 3 2 c 23 s 23 m 3 − a 3 2 c 23 s 23 m 4 − l 3 a 2 s 2 c 23 m 3 − a 3 a 2 s 2 c 23 m 4 ) C 32 = .θ 2 (l 3 a 2 s 3 m 3 + a 3 a 2 s 3 m 4 ) (A9)The terms in gravity matrix G(θ) areG 2 = −g(l 3 s 23 m 3 + a 3 s 23 m 4 + l 2 s 2 m 2 + a 2 s 2 m 3 + a 3 s 2 m 4 ) G 3 = −g(l 3 s 23 m 3 + a 3 s 23 m 4 ) (A10)If a more complete dynamic model including mechanical energy loss were to be considered, more terms should be added to Equation (A5), τ actual = M(θ) , damping, friction, and high-ordered nonlinear and other unpredictable effects are included.The equation B m = I m • N 2 represents the equivalent inertia of motors, where I m is the motor inertia and N is the reduction ratio.C m is the equivalent damping coefficient, and f c is the friction coefficient of the simplified kinetic friction model.Since these parameters depend heavily on individual machine conditions and are therefore unknown, abundant data for τ actual , were collected in experiments to infer the values of B m , C m , and f c using linear regression.Afterwards, a neural network model represented as τ DNN was used to fit the high-ordered nonlinear and unpredictable effects.

Table 1 .
The energy consumption discrepancy between the empirical manipulator and the models with different compensation terms.

Table 1 .
The energy consumption discrepancy between the empirical manipulator and the models with different compensation terms.

Table 2 .
The energy consumption discrepancy and joint root mean square error (RMSE) torque between the empirical manipulator and the models with different payloads.

Table 3 .
Simulation results of the energy optimization task.

Table 4 .
The energy consumptions of the manipulator using human-taught and reinforcement learning (RL)-learned trajectories.

Table 4 .
The energy consumptions of the manipulator using human-taught and reinforcement learning (RL)-learned trajectories.

Table 5 .
The energy consumption of the RL-learned trajectories using initial via points selected by Tester 1.

Table 6 .
The results of speed optimization.

Table 7 .
The results of speed optimization using via points selected by Tester 1.