1. Introduction
Robotic arms possess attributes such as operational flexibility and heightened safety, enabling them to supplant human labor in executing a wide range of intricate and repetitive tasks. These robots have found extensive applications in sectors such as manufacturing and logistics. Motion planning is a critical aspect that ensures the successful completion of diverse operational tasks by robots. Presently, motion planning research within the realm of control systems has emerged as a focal point in the field of robotics. The motion planning problem can be bifurcated into two components: path planning and trajectory planning. The primary distinction between these components lies in the temporal dimension. Trajectory planning entails the application of temporal laws to the path devised by the path planner [
1]. The objective of this process is to facilitate the completion of tasks by the robot within a specified time frame, while adhering to a multitude of constraints. These constraints include ensuring a smooth trajectory, preventing the breach of the robot’s physical limitations, and avoiding collision or contact incidents, among others.
Swarm intelligence optimization methods demonstrate a miraculous ability to manage complex environmental changes, facilitating efficient trajectory planning through autonomous global optimal solution searches, adaptive parameter adjustments, and rapid convergence. The application of these techniques to address trajectory tracking challenges has garnered increasing attention, with examples including genetic algorithms [
2,
3], particle swarm algorithms [
4], and ant colony algorithms [
5]. An adaptive elite genetic algorithm incorporating singularity avoidance has been employed for online timeoptimal trajectory planning in Cartesian space [
2]. To address the path constraint problem in trajectory planning, a timeoptimal trajectoryplanning method for robotic arms has been proposed, which simultaneously searches for the optimal path [
3]. Particle swarm optimization has emerged as a prevalent optimization technique, utilized in robot trajectory planning to guide practical industrial production applications [
4]. The current century has seen the development of numerous exceptional bionicinspired optimization algorithms, such as artificial bee colony algorithms [
5], whale algorithms [
6], and hybrid algorithms [
7,
8]. Although these algorithms exhibit impressive performance in resolving trajectorytracking challenges, their extensive calculation times in dynamic environments may result in unstable optimal solution computations. Recently, the learningbased theories have attracted significant attention in addressing motionplanning issues, owing to their outstanding adaptability to complex problems [
9]. Notable examples include deep learning [
10,
11] and reinforcement learning [
12,
13]. The learningbased approach employs a machine learning strategy, utilizing datadriven solutions to eliminate dependence on intricate models and reduce the complexity of devising trajectoryplanning methods. In robot trajectory planning, position coordinates function as the input parameters, while the joint angles serve as the output. Converged neural network parameters are obtained through samplebased training, with the learning strategy focused on estimating the solution. As the number of samples increases, the final results exhibit progressive improvement [
11]. In dynamic environments, traditional offline planning methods become inapplicable when a space robot is in motion. A trajectoryplanning strategy based on reinforcement learning enables the robot to rapidly adjust its executed actions, rendering this method suitable for realtime applications [
13].
Timeoptimal trajectory planning has been extensively investigated [
14]. The fundamental concept involves determining the time points of the optimal trajectory within the planning period using optimization algorithms under constraints such as velocity, acceleration, and jerk. Moreover, energy consumption has emerged as a crucial reference factor in trajectory planning [
15]. Multiobjective trajectoryplanning problems are progressively supplanting single optimal problems, becoming the predominant focus of industrial robot trajectoryplanning research. Since intense pulses can induce vibrations in robotic arm motion, joint pulses represent a significant consideration in the trajectoryplanning research of robotic arm multiobjective optimization. Liu [
16] introduced a trajectory competition multiobjective particle swarm optimization algorithm, addressing cooperative robot trajectory planning from the three aspects of time, energy, and pulse optimization. Hou [
17] proposed an innovative time–energy consumption optimization trajectoryplanning method for dynamic identification. Rout [
18] employed a multiobjective ant lion optimization technique to obtain the optimal trajectory by minimizing the time–torque ratio. Huang [
19] drew on the elite nondominated sorting genetic algorithm, put forward a timepulse comprehensive optimaltrajectoryplanning method for industrial robots. Trajectories generated through multiobjective optimization ensure that robots can attain relatively optimal performance when reaching the same location. However, traditional sixaxis robotic arm trajectoryplanning methods may encounter challenges in fulfilling multiobjective optimization due to their reliance on numerical optimization, which necessitates precise mathematical models and adherence to the robot’s physical constraints.
To attain moreefficient and adaptive trajectory planning, this paper presents a sixaxis roboticarmtrajectoryplanning method based on deep reinforcement learning. Compared with the existing work, three contributions are presented:
(1) This approach achieves multiobjective optimization through online learning and realtime strategy adjustments, encompassing the comprehensive considerations of accuracy, smoothness, and energy consumption.
(2) In the training process, a decaying episode mechanism was introduced to help find the feasible solution faster, which makes the path planning shorter to help the subsequent trajectory optimize the ground difference in more detail.
(3) By leveraging the properties of reinforcement learning, a trajectoryplanning strategy that optimally balances various performance indicators can be identified without depending on precise mathematical models and physical constraints.
The remainder of this paper is organized as follows.
Section 2 shows the kinematics and dynamics of the manipulator.
Section 3 details the multiobjective optimization problem.
Section 4 presents the proposed RLbased multiobjective optimal trajectoryplanning method, along with a detailed account of the reward function design.
Section 5 offers the simulation, and physical experiments are shown.
Section 6 concludes this work.
2. Kinematics and Dynamics of Manipulator
The kinematics of a manipulator represent its motion characteristics, exclusive of mass, inertia, and force considerations. To accurately control a robot’s position, it is essential to construct its kinematic model. Each link of the robot can be characterized by four kinematic parameters, with two describing the link itself and the other two representing the connection relationship between the links. The method in which these kinematic parameters express the kinematic relationship of the robot mechanism is commonly referred to as the Denavit–Hartenberg (DH) method [
20], and the Denavit–Hartenberg parameters [
21] are shown in
Figure 1.
For a sixdegreeoffreedom manipulator, the transformation formula for the connecting rod, represented from the base to the end effector, can be obtained by articulating the parameter configuration in the form of a homogeneous transformation matrix,
$\mathbf{q}=[{q}_{1},{q}_{2},...,{q}_{n}]$:
where
${}_{i}^{i1}T\left({q}_{i}\right)$ is defined as:
In accordance with the forward kinematics of the manipulator as indicated in Equation (1), the configuration space is expressed. Owing to the taskspecific configuration of the manipulator’s end effectors, it is often necessary to infer the particular configuration space based on the position space. This process, referred to as inverse kinematics, enables precise motion control for each joint of the manipulator according to the angle. In inverse kinematics, the desired workspace displacement
$\dot{\mathbf{x}}$ and the initially unknown configuration space displacement
$\dot{\mathbf{q}}$ are determined:
where
$J\left(\mathbf{q}\right)$ is the Jacobian matrix. The forward kinematics and inverse kinematics of the manipulator are solved, and the optimal trajectory planning of the manipulator based on the reinforcement learning method is realized. The M DH method [
22] is different from the DH method, which has different connecting rod frames and parameter settings, such as the following:
3. MultiObjective Optimization Problem
In this work, we used a deep reinforcement learning (DRL) approach for the trajectory planning of a sixdegreeoffreedom manipulator. We aimed to improve the accuracy, enhance the motion smoothness, and minimize the energy consumption. We formulated these goals into a multiobjective optimization problem that our DRL algorithm resolves. Specifically, we prioritized accuracy for precise trajectory tracking, smoothness to reduce jointimpactinduced vibrations, and energy efficiency to decrease overall system consumption.
The trajectoryplanning problem can be transformed into a multiobjective optimization problem as:
where
$\mathit{\theta}=[{\theta}_{1},{\theta}_{2},{\theta}_{3},{\theta}_{4},{\theta}_{5},{\theta}_{6}]$ and the weight coefficient of the cost function
${f}_{i}\left(\mathit{\theta}\right)$, denoted as
${\lambda}_{i}\in [0,1]$, represents the
jth constraint corresponding to the function.
${h}_{ij}\left(\mathit{\theta}\right)$ represents the
jth constraint condition for the
ith joint of the robot. This consideration incorporates limitations related to the joint angle, joint velocity, and joint acceleration.
The requirements of planning trajectory accuracy, trajectory smoothing, and minimum energy consumption were considered. The corresponding cost function is defined as follows:
(1) Accuracy:
This function is mainly concerned with the position and attitude error of the end effector when the manipulator executes the trajectory. The accuracy cost function is defined by calculating the distance between the desired position and the current position of the end effector:
${p}_{d}=[{x}_{d},{y}_{d},{z}_{d}]$ and $p=[x,y,z]$ denote the threedimensional coordinate positions of the end effectors for the desired trajectory and the actual trajectory, respectively, within the Euclidean space.
(2) Smoothness:
To achieve a smooth trajectory and reduce the vibrational impact caused by the motion of each joint of the manipulator, the smoothness cost function is formulated as follows:
The number of joints is denoted as n, and ${\lambda}_{v}$ and ${\lambda}_{a}$ are the weight coefficients of the velocity and acceleration, respectively. $\dot{q}$ and $\ddot{q}$ are the velocity and acceleration of the ith joint, respectively.
(3) Energy consumption:
The energy consumption cost function primarily addresses the energy consumption associated with the manipulator’s joint movements. By taking into account the sum of the squared angular changes and joint torques within the discretized unit time, the energy consumption cost function is represented as follows:
Here, $\Delta \theta $ signifies the angular change per unit time, and ${\tau}_{k}$ represents the torque of the kth joint.
4. Trajectory Planning with Reinforcement Learning
4.1. PPO
Proximal policy optimization (PPO) represents a significant advancement in the field of reinforcement learning, particularly in addressing the balance between exploration and exploitation in policybased methods. PPO has been recognized for its ability to maintain the stability of policy updates without sacrificing learning efficiency [
23].
Reinforcement learning, as a subset of machine learning, is intrinsically about learning from interaction to achieve optimal behavior. Policybased methods, including PPO, directly parameterize the policy and make updates to maximize expected return. These methods have an advantage over valuebased methods due to their ability to handle continuous action spaces and learn stochastic policies. However, they often suffer from high variance, leading to instability during training.
The PPO algorithm attempts to mitigate this issue by optimizing an objective function designed to keep policy updates close to the current policy:
This objective function maintains the relative advantage of the new policy over the old policy, encouraging the exploration of new behaviors, but penalizing deviations that would significantly alter the current policy.
The cornerstone of PPO is the introduction of a novel surrogate objective function with clipping, which further constrains the policy updates to a neighborhood around the old policy:
Here, $\u03f5$ is a hyperparameter that defines the limit of the policy update clipping. This clipping mechanism ensures modest changes in the policy at each step, promoting stability during the learning process while preventing drastic policy deviations, which could lead to suboptimal solutions or convergence instability.
As the PPO algorithm undergoes iterations and updates, the final trajectory points of the robotic arm incrementally approach the ultimate target point. However, due to the inherent limitations within the algorithm, the points may not fully coincide with the designated target end pose. To address this, after planning using the deep reinforcement learning algorithm, this study incorporated the target end pose point settings into the planning point set, analogous to the planning method employed by the RRT algorithm [
24]. Consequently, the trajectoryplanning method utilized in this study ultimately forms a closure from the starting point to the end point, enhancing its applicability in realworld scenarios.
4.2. Reinforcement Learning Settings
4.2.1. Observation and Action Spaces
Utilizing the PPO reinforcement learning algorithm, the states incorporated into the observation space of the environment encompass the current state ${\mathit{\theta}}_{start}$, current position ${p}_{c}$, target state ${\mathit{\theta}}_{end}$, and target position ${p}_{d}$ for each of the six joints in the robotic arm, as well as the Euclidean distance ${d}_{e}$ between the present end pose and the target point. The end of the robotic arm is considered the end effector, culminating in a total of 25dimensional state variables. Moreover, if the motion state of the robotic arm surpasses its own constraints, the specific joint state exceeding the observation space will be adjusted to its maximum limit value.
The action space is configured to correspond to the speed ${\dot{\theta}}_{k}$ of each joint, and the state of the observation space is updated with every provided action.
4.2.2. Reward Function
The trajectoryplanning problem necessitates addressing both temporal considerations and the inherent complexity of the roboticarmplanning task, thus requiring a meticulously designed reward function. In light of the optimization principles previously discussed and in conjunction with realworld reward requirements, the present study proposes the following reward function
$r({s}_{t},{a}_{t})$:
The components
${r}_{a}^{t}$,
${r}_{\mathrm{s}}^{t}$, and
${r}_{\mathrm{e}}^{t}$ of the reward function pertain to accuracy, smoothness, and energy consumption, respectively. Subsequently, the following optimal constraints can be derived:
Here, ${w}_{\mathrm{a}}$ and ${\sigma}_{a}$ represent the weight coefficient of the accuracy reward, while ${w}_{s}$, ${\lambda}_{v}$, and ${\lambda}_{acc}$ are the weight coefficients of the smoothness reward. ${w}_{e}$ denotes the weight coefficient of the energy consumption reward.
The extra reward
${r}_{ex}^{t}$ is provided when the current training performance achieves a specified threshold, such as attaining a predetermined minimum distance between the current and desired end point distances.
4.3. MultiObjective Optimal Trajectory Planning
In a deepreinforcementlearningbased trajectoryplanning approach, the traditional total cost function can be transformed into a corresponding reward function. Through this transformation, the agent engages in continuous learning and iteration, mapping the objective of minimizing the cost function to the objective of maximizing the reward function. The agent is capable of generating the desired trajectory based on the input conditions. To achieve this transformation, it is necessary to convert various measures in the cost function into corresponding rewards. This conversion is accomplished by using the negative value of the cost item as part of the reward function. Thus, during the optimization process, the agent naturally chooses the behavior that minimizes the cost. Next, these rewards are integrated into a comprehensive reward function.
Within the reinforcement learning framework, the agent selects actions according to the current state and adjusts strategies by interacting with the environment. Throughout this process, the agent utilizes the received reward to continuously update the behavior strategy, aiming to maximize the reward while minimizing the cost. By employing the method, the traditional trajectoryplanning problem is transformed into an optimization problem based on deep reinforcement learning. In the learning process, the agent gradually adapts to the task requirements in order to generate highquality trajectories according to the given input conditions. This approach not only optimizes the smoothness, energy consumption, and accuracy of the trajectory, but also enables the introduction of other constraints or objectives as required by the practical application scenario, ensuring that the trajectory planning aligns more closely with the actual needs.
The concept of obtaining optimal trajectory planning through the cost function is transformed into a deep reinforcement learning method, which involves transforming the total cost function into a specific reward function. By minimizing the cost function and mapping it to the reward maximization settings, the agent is able to plan an ideal trajectory based on the input. The overall framework is illustrated in
Figure 2.
As depicted in
Figure 2, reinforcement learning serves as a critical mechanism in the trajectory planning for a sixaxis robotic system. An episode within this context is delineated as a sequential arrangement of states, actions, and rewards that initiates from an initial configuration and culminates upon achieving a predefined terminal condition within the operational environment.
The initial state is a representation of the robot’s configuration, which is articulated as a randomly selected pose, embodying both the positional attributes and potential velocities of each of the six joints in the robot’s structure. Actions are determined on the basis of the current state and are expressed as alterations in the angular positioning of the robot’s joints.
The execution of an action consequently leads to a transition of the robot to a new state, a process that is steered by the robot’s inherent kinematic properties. In tandem with this transition, the robot is accorded a reward, the extent of which is dictated by a composite measure of factors including accuracy in task performance, the energy consumed in the action’s execution, and the smoothness of the movement, thereby ensuring a balanced consideration of performance and efficiency.
The episode reaches its terminus upon the attainment of a goal configuration, indicating the successful completion of a given task. The overarching aim of the reinforcement learning framework is the procurement of an optimal policy—a strategic decisionmaking blueprint that optimizes the cumulative reward accumulated over the duration of an episode. This is achieved through a continuous, iterative interaction with the environment, allowing the system to learn from its experiences and progressively improve its performance.
The decaying episode mechanism was set up, which automatically sets the step number of one episode to the current step in the current episode when the training accuracy
${d}_{e}$ reaches a setting value, which will produce a new stable state in the training process, but will not affect the actual performance of trajectory planning. The PPO algorithm of this paper is shown in Algorithm 1.
Algorithm 1 Proximal policy optimization (PPO) for trajectory planning 
Require: Start ${\mathit{\theta}}_{start}$, end angle ${\mathit{\theta}}_{end}$, hyperparameters of actor and critic networks, learning rate $\alpha $, discount factor $\gamma $, generalized advantage estimation (GAE) parameter $\lambda $, clipping parameter $\u03f5$, entropy coefficient ${c}_{ent}$, number of iterations N, number of epochs E, batch size B, episode length T, accuracy setting ${d}_{e}$.
 1:
for $n=1,2,\dots ,N$ do  2:
Collect T time steps transition $\left\{\begin{array}{cccc}{s}_{t}& {a}_{t}& {r}_{t}& {s}_{t+1}\end{array}\right\}$ using the running policy ${\pi}_{\theta}$  3:
if ${f}_{a}\left(\mathit{\theta}\right)<{d}_{e}$ then  4:
$T\leftarrow {T}_{\mathrm{current}}$  5:
end if  6:
Compute advantage estimates ${\widehat{A}}_{t}$  7:
for $e=1,2,\dots ,E$ do  8:
Randomly sample a batch of B time steps  9:
Update the critic network parameters by minimizing the loss:  10:
$L\left(\varphi \right)=\frac{1}{B}{\sum}_{t=1}^{B}{({V}_{\varphi}\left({s}_{t}\right){\widehat{R}}_{t})}^{2}$  11:
Update the actor network parameters using the PPOclip objective and the entropy bonus:  12:
${\theta}_{k+1}=arg{max}_{\theta}\frac{1}{B}{\sum}_{t=1}^{B}{L}^{CLIP}\left(\theta \right)+{c}_{ent}H\left({\pi}_{\theta}(\xb7{s}_{t})\right)$  13:
${L}^{CLIP}\left(\theta \right)=min\left(\frac{{\pi}_{\theta}\left({a}_{t}\right{s}_{t})}{{\pi}_{{\theta}_{k}}\left({a}_{t}\right{s}_{t})}{\widehat{A}}_{t},clip\left(\frac{{\pi}_{\theta}\left({a}_{t}\right{s}_{t})}{{\pi}_{{\theta}_{k}}\left({a}_{t}\right{s}_{t})},1\u03f5,1+\u03f5\right){\widehat{A}}_{t}\right)$  14:
end for  15:
end for

5. Experiments
5.1. Environment Settings
In this study, we employed a sixaxis robotic arm with a 5 kg payload capacity, manufactured by SIASUN. The robot manipulator’s kinematic characteristics were described using the modified Denavit–Hartenberg (M DH) parameters.
Table 1 presents the M DH parameters for each link of the robot, providing a comprehensive overview of the manipulator’s characteristics.
In
Table 1,
${a}_{i}$,
${\alpha}_{i}$, and
${d}_{i}$ denote the link length, link twist, and link offset, respectively, while
${m}_{i}$ and
${I}_{xx}$,
${I}_{yy}$, and
${I}_{zz}$ represent the mass and the moments of inertia of each link, respectively.
In order to validate the efficacy of the proposed algorithm, a series of experiments was conducted using the digital model parameters derived from the physical entity of the Siasun cooperative robot. The constraints of each joint of the manipulator were set as $[\pi \phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\pi ]$, with a particular emphasis on Joint 4, which was constrained by $[8/9\pi \phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}8/9\pi ]$. The initial T was set as 10, and the reward function weight coefficient ${w}_{a}={w}_{s}={\lambda}_{v}={\lambda}_{acc}={w}_{e}=1,{\sigma}_{a}=2$, and the accuracy was set as ${d}_{e}=0.003$.
The training environment for the reinforcement learning process was established using a simulation environment based on the deep learning framework PyTorch and Gym. The hyperparameters employed by the PPO algorithm are presented in
Table 2. In order to further demonstrate the algorithm’s effectiveness and adaptability in practical applications, the practical experiments were conducted on the trajectory planned by the proposed method in a collaborative robot.
The paper used two different neural networks, namely the actor network and the critic network, as continuous action spaces. The implementation of the network was based on the PyTorch framework. The structure of the actor and critic networks is shown in
Table 3.
5.2. Performance Evaluation in Simulation Environment
Assuming the center of the actuator at the end of the robotic arm is the movement point to be considered, set the starting point to ${q}_{start}=[0,0,27\pi /35,7\pi /35,0,0]$ and the stopping point to ${q}_{end}=[3,1.5,1.5,2,2,2.5]$.
To provide a morecomprehensive evaluation of the proposed methods, this study included an analysis of the bidirectional rapidlyexploring random tree (BiRRT) algorithm, an advanced variant of the RRT algorithm. The RRT, a samplingbased motion planning technique, progressively develops a tree structure to navigate the robot’s configuration space. This method demonstrates exceptional effectiveness in highdimensional spaces and in addressing intricate constraints, making it a pertinent reference point for assessing the performance of the proposed approaches in a scientifically rigorous manner. To maintain consistency in the number of planned trajectory points, the step size of the BiRRT method was set to 0.145 m, ensuring efficient planning. Additionally, a bidirectional search method was employed to further optimize the process.
Figure 3 presents the trajectories planned through the proposed method and BiRRT framework, where the blue dotted line represents the BiRRT, the red line signifies the proposed method, yellow dots indicate the starting position of the planned trajectory, and purple dots correspond to the ending position of the plan. As illustrated in
Figure 3, both BiRRT and the proposed RL can achieve trajectory planning. However, due to the configuration of the planned trajectory points, several unstable points emerge in the middle section of the BiRRT planned trajectory, while the RL method can attain smooth trajectory planning. In terms of convergence, BiRRT methods exhibit rapid convergence towards feasible solutions at the expense of requiring multiple adjustments. In contrast, RL methods do not necessitate this step, but require numerous training iterations to discover the optimal solution.
Figure 4 and
Figure 5, respectively, show the velocity and acceleration of six joint movements during the trajectoryplanning process of the robotic arm.
Figure 6 also displays the jerk variation for six joints of the BiRRT and RL in the trajectory planning of a sixaxis robotic arm. The proposed RL method exhibited a marginally smaller oscillation amplitude compared to the BiRRT method, further substantiating that the RL method can optimize the planned trajectory, while the BiRRT method cannot accomplish such applications.
Figure 7 depicts the average reward during the training process. It can be seen that the training reward has several stable stages, which is due to the adaptive adjustment of the step time. When the step time changes, the corresponding cumulative reward will undergo new changes and eventually tend to be stable.
Figure 8 shows the change in the time step size of the decaying episode throughout the training process. It can be seen from
Figure 8 that, as the training continued, the time step of the episode decreased gradually, from 60 to 46, then to 36, and finally, to 32 under the set maximum training conditions, and finally, it tended to be stable. This result can be verified indirectly from
Figure 7. This demonstrates the adaptive adjustment of trajectory planning, which corresponds to the continuous updates occurring within the neural network.
5.3. Performance Evaluation in Physical Environment
To validate the effectiveness of the proposed RLbased trajectoryplanning method in a realworld scenario, a physical experiments was conducted by using a cooperative robotic arm. The experiments were designed to closely follow the simulated scenarios, and the robotic arm’s motion was captured at different time intervals (0 s, 6 s, 12 s, 18 s, 24 s, and 30 s). The performance of the robotic joints’ changes demonstrated the practical applicability of our proposed method.
Figure 9 presents the motion of the robotic arm at various time intervals during the experiment. The images reveal that the proposed RLbased trajectoryplanning method resulted in smooth and efficient motion, with no sudden jerks or instabilities. The realworld performance of the robotic arm closely matched the simulated results, indicating that the RL method can effectively optimize the planned trajectory, while the BiRRT method cannot achieve such performance. In addition to the experimental results detailed in this paper, we have also furnished
Supplementary Materials for a more comprehensive understanding. These include a demonstrative video (
Video S1), which effectively illustrates the path planning mechanism employed by a sixaxis robotic arm.
The physical experiments revealed that the proposed RLbased trajectoryplanning method can successfully optimize the robotic arm’s motion in terms of velocity, acceleration, and jerk minimization. This demonstrated the potential of our method for various practical applications, including, but not limited to industrial automation, robotic manipulation, and autonomous systems.
6. Conclusions
This study presented a comprehensive investigation into the trajectoryplanning problem of a sixaxis robotic arm, utilizing deep reinforcement learning to develop a multiobjective optimization approach. This method optimally integrates the goals of minimizing accuracy, energy consumption, and smoothness, allowing for the generation of a desired trajectory for the robotic arm. The input parameters consisted of joint angles and Cartesian coordinates based on forward and inverse kinematics, while the output was the estimation of the joint angles. The decaying episode mechanism was employed throughout the training process, fostering rapid convergence to moreefficient solutions.
Through a comparative analysis with the BiRRT algorithm, the proposed method demonstrated its effectiveness and robustness, as evidenced by the simulation results. These findings have the potential to contribute significantly to the future development of advanced robotic arm trajectory planning, enhancing performance, and broadening the applicability of deep reinforcement learning in complex motionplanning tasks.