Constrained Motion Planning of 7-DOF Space Manipulator via Deep Reinforcement Learning Combined with Artiﬁcial Potential Field

: During the on-orbit operation task of the space manipulator, some speciﬁc scenarios require strict constraints on both the position and orientation of the end-effector, such as refueling and auxiliary docking. To this end, a novel motion planning approach for a space manipulator is proposed in this paper. Firstly, a kinematic model of the 7-DOF free-ﬂoating space manipulator is established by introducing the generalized Jacobian matrix. On this basis, a planning approach is proposed to realize the motion planning of the 7-DOF free-ﬂoating space manipulator. Considering that the on-orbit environment is dynamical, the robustness of the motion planning approach is required, thus the deep reinforcement learning algorithm is introduced to design the motion planning approach. Meanwhile, the deep reinforcement learning algorithm is combined with artiﬁcial potential ﬁeld to improve the convergence. Besides, the self-collision avoidance constraint is considered during planning to ensure the operational security. Finally, comparative simulations are conducted to demonstrate the performance of the proposed planning method.


Introduction
In the last few decades, aerospace technology has developed rapidly, and the on-orbit operation tasks have become more and more complicated. Although most of those tasks can be carried out by astronauts, the extravehicular activities are unaffordable and dangerous, especially when the operating subjects are not cooperative. With the development of automatic control, microelectronics, and other technologies, manipulators have been able to replace humans to complete tasks, such as moving objects [1] and precise operations [2]. In this context, the space manipulator has attracted increasing attention from researchers in the fields of non-cooperative target capture [3], combined spacecraft control [4], space station maintenance [5,6], large-scale space structure assembling [7], and other on-orbit operations. Dynamic modeling, motion planning, and trajectory tracking control of robotic arms are critical to mission success [8]. Due to the dexterity brought by the humanoid arm structure, the 7-DOF free-floating space manipulator is one of the most widely used manipulators in aerospace engineering. However, the high degree of freedom means that the dynamic model of the manipulator is more complex, which also brings great challenges to motion planning, especially when there are complex constraints. Therefore, the constrained motion planning of the 7-DOF free-floating space manipulator has important theoretical and applied values.
Some specific operation tasks, e.g., refueling and auxiliary docking, require strict constraints of the position and orientation of the space manipulator end-effector, thus a large amount of studies have been conducted to solve the space manipulator end-effector position and orientation alignment motion planning problem in recent years. To deal with the dynamic singularities limits in motion planning of the 7-DOF free-floating space robot, Wang proposed a constrained differential evolution (DE) scheme with premature handling strategy, which can move the end-effector to the target pose while minimizing disturbance on the base [9]. The motion planning goal was also achieved by Liu through particle swarm optimization combined with differential evolution [10]. To address the motion planning of the robotic arm mounted on the free-floating unmanned spacecraft, Rybus proposed an active-set algorithm-based planning method, and obtained a collision-free path of the arm, which can move the gripper to the desired position and orientation and maneuver the spacecraft to the desired attitude [11]. Jin developed a method based on damped least squares and feedback compensation to avoid the dynamic singularities because of the inverse kinematics during the motion planning of high-degree-of-freedom space robot. The base attitude disturbance and moving time constraints were considered and the chaotic particle swarm optimization was employed to solve the multi-objective optimization problem [12]. Zhang considered the attitude disturbance of the base caused by the joint motion of the space manipulator and the collision between the end-effector and the target, and proposed a preimpact trajectory planning method for the non-redundant free-floating space manipulator based on the particle swarm optimization, which allows the end-effector to move to the desired capture pose with minimum base disturbance [13]. Lu formulated the motion planning problem of the 10-DOF free-floating space manipulator as a constrained convex quadratic programming problem, thereby achieved the spacecraft attitude stabilization and the end-effector trajectory tracking simultaneously [14]. Similarly, Misra also solved the 10-DOF free-floating space manipulator trajectory planning problem by formulating it as a convex optimization problem, in addition to minimizing the disturbance on base, the obstacle avoidance was also considered in his works [15,16]. Lu proposed a Cartesian motion planning algorithm for the free-floating space manipulator, and designed four optimization objective functions for different control requirements, thus achieving the end-effector position and orientation alignment and minimized the disturbance on base [17]. It can be seen that the present methods for solving the space manipulator end-effector position and orientation alignment motion planning problem are mainly based on heuristic algorithms and optimization methods; their common shortcoming is that the planning method is very sensitive to the initial configuration and the desired pose of the space manipulator, which means a new planning process must be performed when the initial configuration or the desired pose changed. Meanwhile, when measurement error occurs during motion, it may lead to the failure of planning. Considering that the real operation environment on orbit is complicated and dynamical, the robustness is necessary for the motion planning approach. It should be noted that the definition of the position and orientation of the space manipulator end-effector in this paper is different from that in most of the above references. Since the mission scenarios, such as refueling and auxiliary docking, which are considered in this paper have no special requirement on the last joint of the space manipulator, the rotation of the end-effector around its main axis is not considered during planning.
Artificial potential field (APF) has been widely used in obstacle avoidance of manipulators since it was first proposed by Khatib in 1986 [18]. Liu proposed a motion planning method based on improved APF to enable the space manipulator pass through the windowshaped obstacle. While the repulsive force generated by the repulsive potential field of obstacles is set to act only on the end-effector, which leads to no guarantee of collisionfree between other parts of the manipulator and obstacles [19]. To properly handle the interaction between the repulsive potential field of obstacle and the entire manipulator, some studies have focused on the selection of the point of action. Wang selected the action point of repulsive force on the joint closest to the obstacle [20]. Li described the repulsive force acting on the manipulator by selecting several evaluation points on the entire manipulator [21]. The above motion planning methods are all carried out in Cartesian space, and inverse kinematics solutions are required at each planning step. However, the inverse kinematics of manipulator is a complicated multi-solution problem, which may result in discontinuous joint angle path. Zhang proposed an improved APF method for six-DOF serial robot motion planning to address this problem, the planning was performed directly in the joint space to avoid the inverse kinematics solving [22]. Although this method no longer needs to solve the inverse kinematics, it requires substantial traversal in each step and is easy to fall into a local minimum; meanwhile, the robustness of this method is poor.
Since the artificial intelligence approach have strong learning and reasoning abilities [23], it has been introduced by scholars to improve the robustness of the space manipulator motion planning and control methods [24]. Among them, the reinforcement learning is one of the most widely used approaches in manipulator path planning [25]. Yan used soft Q-learning to solve the motion planning and control problem of free-floating space robots to capture targets [26]. Liang designed a motion control model for space robot based on the actor-critic algorithm, which can move the end-effector to the desired position and orientation [27]. In contrast, the deep deterministic policy gradient (DDPG) algorithm is more commonly used in manipulator path planning problems due to its good convergence and the outstanding ability to address the continuous action space [28]. For ground manipulators, some scholars use DDPG algorithm to solve the point-to-point path planning problem [29][30][31][32]. However, their planning goals are relatively simple, which also did not consider complex motion constraints. For space manipulators, Du presented a "pretraining" skill to improve the learning efficiency of DDPG in space manipulator motion planning [33]. Hu considered multi constraints including safety performance, coupling disturbance and path length, and proposed a multi-constrained reward deep deterministic policy gradient algorithm for the motion planning of a free-floating space robot, by which the end-effector can be moved to the desired position while avoiding obstacles [34]. In addition, the motion planning problem of free-floating dual-arm space manipulator was also researched by some scholars. Wu solved the motion planning problem of dual-arm space robot by using DDPG [35]. Based on his work, Li considered complicated constraints, thereby made the path planning approach of dual-arm space manipulator closer to the actual engineering [36]. In aforesaid works, only the position of the end-effector was planned, and the degrees of freedom of the manipulator are relatively low, which limits the application range of those works. However, the consideration of high-degree-of-freedom space manipulator model and complex planning goals will make the DDPG algorithm difficult to converge.
In this paper, a novel motion planning approach base on deep reinforcement learning combined with artificial potential field is proposed to solve the 7-DOF space manipulator motion planning problem considering the end-effector position and orientation alignment. Because the base of the space manipulator is usually in a free-floating state when performing on-orbit operation task, there exist strong coupling effect between the base and the manipulator links. The generalized Jacobian matrix proposed by Umetani and Yoshida [37] is introduced to describe the kinematic characteristics of the 7-DOF free-floating space manipulator. Considering the complicated and dynamical environment in space, robustness is necessary for the motion planning algorithm, thus the DDPG algorithm is adopted to establish the algorithm framework of the proposed planning approach. The artificial potential field is utilized to improve the convergence of DDPG for planning missions with complex goal. Furthermore, the self-collision avoidance constraint is considered during planning process to ensure the operational security.
The rest of this paper is organized as follows: in Section 2, a preliminary, including the kinematics of 7-DOF free-floating space manipulator, the principle of DDPG algorithm, and the artificial potential field, is introduced. The motion planning algorithm design is elaborated in Section 3. Section 4 shows the simulation results and discussion. Additionally, the conclusions of this work are given in Section 5.

Preliminary
In this paper, the angular velocities of the manipulator joints are taken as the planning object, so the kinematic model of the 7-DOF space manipulator is indispensable. Meanwhile, the DDPG algorithm and the artificial potential field are the theoretical bases of the proposed motion planning approach. Therefore, in this section, the kinematics of the 7-DOF free-floating space manipulator are established, and a brief introduction of DDPG algorithm and artificial potential field is given.

Kinematic Modeling of 7-DOF Free-Floating Space Manipulator
The configuration and kinematic relationship of the 7-DOF space manipulator considered in this paper is shown in Figure 1. The dynamic parameters are listed in Table 1. Among them, link 1 is an offset link which is fixedly connected to the base to prevent link 2 and link 3 from colliding with the base. Different from ground manipulators, there exist complex coupling effects between the free-floating base and the links of the space manipulator, so the commonly used kinematic modeling approaches are inapplicable to space manipulators. To this end, the generalized Jacobian matrix is introduced to establish the kinematic model of the 7-DOF space manipulator. rithm, and the artificial potential field, is introduced. The motion planning algorithm design is elaborated in Section 3. Section 4 shows the simulation results and discussion. Additionally, the conclusions of this work are given in Section 5.

Preliminary
In this paper, the angular velocities of the manipulator joints are taken as the planning object, so the kinematic model of the 7-DOF space manipulator is indispensable. Meanwhile, the DDPG algorithm and the artificial potential field are the theoretical bases of the proposed motion planning approach. Therefore, in this section, the kinematics of the 7-DOF free-floating space manipulator are established, and a brief introduction of DDPG algorithm and artificial potential field is given.

Kinematic Modeling of 7-DOF Free-Floating Space Manipulator
The configuration and kinematic relationship of the 7-DOF space manipulator considered in this paper is shown in Figure 1. The dynamic parameters are listed in Table 1. Among them, link 1 is an offset link which is fixedly connected to the base to prevent link 2 and link 3 from colliding with the base. Different from ground manipulators, there exist complex coupling effects between the free-floating base and the links of the space manipulator, so the commonly used kinematic modeling approaches are inapplicable to space manipulators. To this end, the generalized Jacobian matrix is introduced to establish the kinematic model of the 7-DOF space manipulator.    Define joint j as the joint between link i − 1 and link i, and joint 1 is defined as the joint between the base and link 1, which is a fixed joint as mentioned before. According to the geometric topological relation shown in Figure 1, the mass center position of each link can be described as follows: where the superscript "I" denotes the corresponding vector is represented in the inertial coordinate system. I r 0 and I r i are the radius vectors of the mass center of the base and link i (i ∈ [1,8]), respectively. I a j denotes the vector from joint j to the mass center of link j, and I b j is the vector from the mass center of link j to joint j + 1. It should be noted that I b 0 denotes the vector from mass center of the base to the connected point between the base and link 1. Take the derivative of I r i with respect to time, and the velocity of each link can be obtained as follows: where I v 0 and I ω 0 denote the linear velocity and angular velocity of the base, respectively; I v i is the velocity of the mass center of link i, I z j is the direction vector of the angular velocity of joint j; I p j and . φ j are the position vector and angular velocity magnitude of joint j, respectively. Noted that the angular velocity of joint 1 is zero.
The angular velocity of link i can be described as follows: From Equations (1)-(3) the linear velocity and angular velocity of the end-effector can be obtained as follows: where I v e and I ω e denote the linear velocity and angular velocity of the end-effector, respectively. I p e is the position vector of the end-effector. Equation (4) can be transformed as the following form: , and J 0 and J φ are the Jacobian matrices from the joint space to the inertial space with the following definition: where E 3 denotes 3 × 3 identity matrix and p k 0e = p k e − r 0 . Additionally, for a vector e = e x , e y , e z , the operatorẽ is defined as: For the manipulator on ground, the linear velocity and angular velocity of the base are both zero, thus Equation (5) is sufficient to describe the kinematic characteristics. However, for the space manipulator, the joint angular velocity is coupled with the linear velocity and angular velocity of the base, so Equation (5) needs to be rewritten. Considering that during operation the base of space manipulator is in a free-floating state, it follows the linear momentum and angular momentum conservation law. The linear momentum and angular momentum of the space manipulator can be expressed as follows: where P and H represent the linear momentum and angular momentum, respectively. Substituting Equations (1)-(3) into Equation (8), we can obtain: where I r g is the radius vector of the mass center of the space manipulator system, and Additionally, J Pφ and P φ are defined as follows: To simplify the expression, the following matrices are defined: Assuming that the initial linear and angular momentum are both zero, then the kinematic model of the space manipulator can be obtained by combining Equations (5) and (9):

Deep Deterministic Policy Gradient Algorithm
As one of the most widely used deep reinforcement learning algorithms, the DDPG algorithm has been extensively adopted to solve the motion planning problem of manipulators in recent years due to its capability to deal with continuous action space.
The implementation framework of DDPG algorithm is demonstrated in Figure 2. As shown in Figure 2, the DDPG algorithm is constructed based on the actor-critic learning framework. The algorithm flow of actor-critic framework can be described as follows: (1) Firstly, a set of state s is obtained from the environment, and the actor generates a set of action a according to s; (2) After the action a is applied to the environment, a set of new state s and the reward r of the current step are fed back from the environment; (3) According to the reward r, the action evaluation function of critic is updated, and then the actor will update its policy function in the direction suggested by the critic.
The above steps are one-step training of the actor-critic learning framework, and then continue the loop above until the training succeed.
As one of the most widely used deep reinforcement learning algorithms, the DDPG algorithm has been extensively adopted to solve the motion planning problem of manipulators in recent years due to its capability to deal with continuous action space.
The implementation framework of DDPG algorithm is demonstrated in Figure 2. As shown in Figure 2, the DDPG algorithm is constructed based on the actor-critic learning framework. The algorithm flow of actor-critic framework can be described as follows: (1) Firstly, a set of state s is obtained from the environment, and the actor generates a set of action a according to s ; (2) After the action a is applied to the environment, a set of new state s and the reward r of the current step are fed back from the environment; (3) According to the reward r , the action evaluation function of critic is updated, and then the actor will update its policy function in the direction suggested by the critic.
The above steps are one-step training of the actor-critic learning framework, and then continue the loop above until the training succeed.

Replay Buffer
Online  Based on the actor-critic learning framework, the target network and replay buffer of deep Q network algorithm is introduced in DDPG to ensure the stability and convergence of the network. The network corresponding to the actor is called the policy network, which is divided into the online policy network and the target policy network. Similarly, the network corresponding to the critic is called the value network or Q network, which is divided into the online Q network and the target Q network, as shown in Figure 2. The main functions of the above networks are as follows: Based on the actor-critic learning framework, the target network and replay buffer of deep Q network algorithm is introduced in DDPG to ensure the stability and convergence of the network. The network corresponding to the actor is called the policy network, which is divided into the online policy network and the target policy network. Similarly, the network corresponding to the critic is called the value network or Q network, which is divided into the online Q network and the target Q network, as shown in Figure 2. The main functions of the above networks are as follows: (1) Online policy network: The online policy network is mainly used to interact with the environment, generate action a according to the current state s, and update the parameter θ µ in the policy network. (2) Target policy network: The target strategy network uses the data extracted from the replay buffer for training, and completes the task of selecting the next action a according to the next state s . The network parameters θ µ are copied and soft updated from the online policy network. (3) Online Q network: The task of the online Q network is to calculate the value Q s, a θ Q of the current state s and action a, and update the parameter θ Q in the Q network. (4) Target Q network: The target Q network is mainly used to calculate the "label" y = r + γQ s , a θ Q , and the network parameters θ Q are copied and soft updated from the online Q network The detailed algorithm flow will be shown in Section 3, which will not be repeated here.

Artificial Potential Field
Artificial potential field is widely used to solve the path planning problem of robots, especially the path planning problem that needs to avoid obstacles. The principle of artificial potential field is to create a virtual gravitational field U a at the target that needs to be reached which can affect the entire environment, and create a virtual repulsion field U r with a certain range of influence at the obstacle. When the agent moves in the environment, the total potential field can be expressed as: The force experienced by the agent in the potential field is the negative gradient of potential energy: Equation (14) means that the resultant force on the agent at any position of the environment is the sum of all gravitational and repulsive forces on that point. Under the action of resultant field force, the agent can bypass obstacles on its motion path and move towards the target along the negative gradient direction.

Motion Planning Algorithm Design
The algorithm flow of artificial potential field in manipulator path planning is shown in Figure 3, and it can be described as follows: (1) Establish artificial potential field according to the planning goal; (2) Calculate the configuration of the manipulator at the current step; (3) Traverse all the possible set of adjacent joint angles of the current configuration, find the set with the smallest total potential energy, and then proceed to (step 2). Repeat the above cycle to obtain the complete optimal path along the negative gradient of potential field. However, this method requires substantial traversal and is easy to fall into a local minimum. Meanwhile, it has no generalization ability, which means that when the initial configuration or target point changes, the path needs to be replanned.
To overcome the above shortcomings of artificial potential field, considering that the deep reinforcement learning method can avoid local optima and has strong generalization ability, thus the DDPG algorithm is combined with the artificial potential field in this paper. The algorithm flowchart of the proposed motion planning method is demonstrated in Figure 3.
As shown in Figure 4, the brief steps of the proposed motion planning method are as follows: Repeat the above cycle to obtain the complete optimal path along the negative gradient of potential field. However, this method requires substantial traversal and is easy to fall into a local minimum. Meanwhile, it has no generalization ability, which means that when the initial configuration or target point changes, the path needs to be replanned.
To overcome the above shortcomings of artificial potential field, considering that the deep reinforcement learning method can avoid local optima and has strong generalization Aerospace 2022, 9, 163 9 of 19 ability, thus the DDPG algorithm is combined with the artificial potential field in this paper. The algorithm flowchart of the proposed motion planning method is demonstrated in Figure 3.
As shown in Figure 4, the brief steps of the proposed motion planning method are as follows: (1) Establish the model of space manipulator and artificial potential field in the environment; (2) Observe current state s from environment and choose action a; (3) Execute action a, obtain the new state s , and calculate reward r of current step based on the potential energy difference between state s and s ; (4) Store (s, a, r, s ) into the replay buffer, and start training when the replay buffer is full; (5) Replace s by s and go to step 2.  It should be noted that since the artificial potential field is introduced in the environment and potential energy difference is used for the reward calculation, the proposed method can well prevent the agent from learning the skill that can obtain higher rewards but fail to complete the goal, which is a frequent phenomenon in reinforcement learning It should be noted that since the artificial potential field is introduced in the environment and potential energy difference is used for the reward calculation, the proposed method can well prevent the agent from learning the skill that can obtain higher rewards but fail to complete the goal, which is a frequent phenomenon in reinforcement learning [38].

Setting of the State and Action in DDPG
In the motion planning of space manipulator, the action can be chosen as the set of all the joint angular velocities as shown in Equation (15), since the kinematic model is adopted as a part of the environment.
The information contained in the state is crucial to the training of agent, hence in DDPG algorithm the state should be carefully set up to ensure that all the feature information is covered. Since the planning goal in this paper is to achieve the position and orientation alignment of the space manipulator end-effector simultaneously, the state must include not only the configuration information and velocity information of the space manipulator, but also the differences between the current and the target position and orientation of the end-effector. Meanwhile, the space manipulator is moving in the artificial potential field, thus the potential energy of the current configuration should also be included in the state. Based on the above analysis, the state can be set as follows: where θ 0 is the attitude of the base, p t denotes the target position and orientation of the end-effector, d et represents the distance between the end-effector and the target point, and a et denotes the angle between the point vector of the end-effector and the target vector. U denotes the potential energy corresponding to the current configuration of the space manipulator.

Reward Function Design
In the DDPG algorithm, the reward function determines the convergence and performance of the algorithm directly. Hence, it needs to be properly designed. Considering that the motion planning algorithm needs to achieve the main planning goal while complying with the self-collision avoidance constraint, the reward function is designed from these two aspects.

The Main Planning Goal
As mentioned before, the potential energy-based reward function can prevent the agent from learning the strategy that can obtain higher rewards but fail to complete the goal. Hence, an artificial potential field is designed to describe the reward of each step in DDPG. Since the main planning goal is to achieve the position and orientation alignment of the space manipulator end-effector simultaneously, d et and a et are the main basis of potential field design. The form of the designed potential field is as follows: where K d and K a are the weight coefficients, and in this paper a set of coefficients is chosen as K d = 10, K a = 100. The potential field has the following characteristics: When d et is large, the potential energy is mainly determined by d et , and the influence of a et gradually increase with d et decrease. When both d et and a et tend to zero, the potential energy reaches a maximum. The visualization of potential field is demonstrated in Figure 5.
The reward function in Equation (18) means that when the space manipulator transitions from a state with low potential energy to a state with high potential energy, a positive reward will be obtained, otherwise the reward will be negative. Additionally, the total reward of each episode can be calculated as: where R denotes the total reward for this episode, and f is the total step of this episode.
 

Self-Collision Avoidance Constraint
During the operation of space manipulator, collisions between each link of space manipulator must be avoided for operational security, hence the self-collision avoidance constraint is also considered in reward function design. Considering that the collision detection is required in each step of DDPG algorithm, a complicated collision detective model will inevitably bring heavy calculation burden, and may even cause the failure of the planning algorithm. A vector-based calculation method of the minimum distance between each link proposed in reference [36] is introduced in this paper.
According to the configuration of the 7-DOF space manipulator, the collision situation of each link is listed in Table 2, where the check mark means collision may occurred between the links on the corresponding row and column, while the cross indicates that the collision will not occur. Let the corresponding potential energy of the current state s and the next state s be U t (d et , a et ) and U t+1 (d et , a et ), respectively. Then, the reward function can be expressed as follows: The reward function in Equation (18) means that when the space manipulator transitions from a state with low potential energy to a state with high potential energy, a positive reward will be obtained, otherwise the reward will be negative. Additionally, the total reward of each episode can be calculated as: where R denotes the total reward for this episode, and f is the total step of this episode. U f (d et , a et ) and U 0 (d et , a et ) represent the corresponding potential energy of the final state and the initial state, respectively.

Self-Collision Avoidance Constraint
During the operation of space manipulator, collisions between each link of space manipulator must be avoided for operational security, hence the self-collision avoidance constraint is also considered in reward function design. Considering that the collision detection is required in each step of DDPG algorithm, a complicated collision detective model will inevitably bring heavy calculation burden, and may even cause the failure of the planning algorithm. A vector-based calculation method of the minimum distance between each link proposed in reference [36] is introduced in this paper.
According to the configuration of the 7-DOF space manipulator, the collision situation of each link is listed in Table 2, where the check mark means collision may occurred between the links on the corresponding row and column, while the cross indicates that the collision will not occur.
According to Table 2, the minimum distance between each link of the space manipulator can be described as follows: 14 (20) where d ij (i ∈ [1,5], j ∈ [4,8]) denotes the minimum distance between link i and link j, which can be calculated by the method proposed in reference [36].
To achieve the self-collision avoidance, a penalty term related to d min is added to the reward function in Equation (18). The penalty term r d is designed as follows: where the coefficients k 1 and k 2 are chosen as k 1 = 2000, k 2 = 20. From Equation (21) it can be seen that the penalty term has the following properties: When d min is large enough, the current configuration of the space manipulator is considered safe, and the penalty is set to 0; when d min exceeds the safety threshold, as d min decreases, the absolute value of penalty will rapidly increase; when d min is less than the minimum safety distance, a very large penalty will be provided to the reward function. Thus, the composite reward function can be expresses as:

Simulation Setup
The simulation is implemented in Python 3.6 (https://www.python.org/). The parameter settings of actor and critic network in DDPG are as follows: The number of hidden layers is set to 2, each layer has 200 neurons, and the learning rate of both networks is set to 0.001. The size of the replay buffer is set to 80,000, and the size of the mini batch sampled at each step is 32. The maximum number of steps in each episode is 8000, and the step length is 0.03 s, which means that the maximum time length of each episode is 240 s. Additionally, the maximum number of episodes is set to 5000. The reference configuration of the space manipulator is shown in Figure 1, and the initial configuration for training and testing are listed in Table 3. It should be noted that since joint 1 is a fixed joint and joint 8 has no contribution to the position and orientation of the end-effector, so only joints 2-7 are set. The target point represent in inertial coordinate is (0.5, 1.6, 1), and the target orientation vector is (0, 1, 0). Table 3. The initial configuration of the space manipulator.

States
Initial Configuration for Training Initial Configuration for Testing

DDPG Training
The training of the designed algorithm for the position and orientation alignment of the space manipulator end-effector is carried out. The planning goal is considered to be achieved when d et is within 0.05 m and a et is less than 1 • . To show the superiority of the motion planning approach based on deep reinforcement learning and artificial potential field, a comparison is carried out by using the potential function shown in Equation (17) as the reward function instead of the potential difference shown in Equation (18) according to the reward function design method proposed in Ref. [36], and the simulation results are demonstrated in Figures 6-8.

DDPG Training
The training of the designed algorithm for the position and orientation alignment of the space manipulator end-effector is carried out. The planning goal is considered to be achieved when et d is within 0.05 m and et a is less than 1°. To show the superiority of the motion planning approach based on deep reinforcement learning and artificial potential field, a comparison is carried out by using the potential function shown in Equation (17) as the reward function instead of the potential difference shown in Equation (18) according to the reward function design method proposed in Ref. [36], and the simulation results are demonstrated in Figures 6-8.     Figure 6 demonstrates the success rate of the proposed motion planning approach and the DDPG-based motion planning method proposed in reference [36]. It can be seen that the success rate of the proposed motion planning approach can be maintained above 90% after about 4000 episodes, while the success rate of the DDPG-base method shows a trend of rising first and then falling, and ultimately cannot converge to a desired result. This is because in the DDPG-based method, when the end-effector is always hovering near the target point, the agent can obtain a greater total reward than completing the orientation alignment, which proved that the introduction of artificial potential field can prevent the agent from learning the strategy that can obtain higher rewards but fail to complete the goal.   Figure 6 demonstrates the success rate of the proposed motion planning approach and the DDPG-based motion planning method proposed in reference [36]. It can be seen that the success rate of the proposed motion planning approach can be maintained above 90% after about 4000 episodes, while the success rate of the DDPG-base method shows a trend of rising first and then falling, and ultimately cannot converge to a desired result. This is because in the DDPG-based method, when the end-effector is always hovering near the target point, the agent can obtain a greater total reward than completing the orientation alignment, which proved that the introduction of artificial potential field can prevent the agent from learning the strategy that can obtain higher rewards but fail to complete the goal. Figure 7 gives out the episode reward, the light-blue line represents the sum reward at each episode, while the dark-blue line is the smoothing spline curve of the episode reward. According to the dark-blue line, the episode reward has the same trend as the success rate, which further proves that the designed algorithm has good convergence.
The self-collision rate is shown in Figure 8. As can be seen, the self-collision rate shows an opposite trend to the success rate, that is, there is a high probability that the  According to the dark-blue line, the episode reward has the same trend as the success rate, which further proves that the designed algorithm has good convergence.
The self-collision rate is shown in Figure 8. As can be seen, the self-collision rate shows an opposite trend to the success rate, that is, there is a high probability that the planning goal cannot be successfully achieved when self-collisions occur. The self-collision rate finally converges to about 5%, which verifies the collision-avoidance capability of the proposed method.

Space Mission Application Case
Based on the trained DDPG networks, two space mission scenarios are simulated to test the effectiveness of the proposed approach and demonstrate it is not sensitive to the initial configuration of the space manipulator. The initial configuration of the space manipulator is set as shown in Table 3  planning goal cannot be successfully achieved when self-collisions occur. The self-collision rate finally converges to about 5%, which verifies the collision-avoidance capability of the proposed method.

Space Mission Application Case
Based on the trained DDPG networks, two space mission scenarios are simulated to test the effectiveness of the proposed approach and demonstrate it is not sensitive to the initial configuration of the space manipulator. The initial configuration of the space manipulator is set as shown in Table 3       The differences in the distance and angle between the end-effector and the target position and orientation are shown in Figure 9. From Figure 9 it is obvious that after about 23 s, the position and orientation deviation have converged to the preset accuracy range simultaneously, which verified the effectiveness of the proposed motion planning approach.
The minimum distance between all links during the motion of space manipulator is given in Figure 10. It can be seen that the minimum distance between all links gradually  The differences in the distance and angle between the end-effector and the target position and orientation are shown in Figure 9. From Figure 9 it is obvious that after about 23 s, the position and orientation deviation have converged to the preset accuracy range simultaneously, which verified the effectiveness of the proposed motion planning approach.
The minimum distance between all links during the motion of space manipulator is given in Figure 10. It can be seen that the minimum distance between all links gradually The differences in the distance and angle between the end-effector and the target position and orientation are shown in Figure 9. From Figure 9 it is obvious that after about Aerospace 2022, 9,163 16 of 19 23 s, the position and orientation deviation have converged to the preset accuracy range simultaneously, which verified the effectiveness of the proposed motion planning approach.
The minimum distance between all links during the motion of space manipulator is given in Figure 10. It can be seen that the minimum distance between all links gradually decreases and enters the safety threshold around 16 s, but it is always greater than the minimum safety distance of 0.1 m during the motion, which verified the effectiveness of the self-collision avoidance constraint. To show the motion characteristics of the space manipulator more clearly, the trajectory of the base and all joint angles are demonstrated in Figures 11 and 12.
Another set of simulations with different initial joint angles is conducted to demonstrate the proposed method is not sensitive to the initial configuration of the space manipulator. The initial configuration settings are shown in Table 3, and the initial joint angles are set to (−1.6, −12.0, −33.9, 6.4, 17.6, 39.5) rad. The results are shown in Figure 13. From Figure 13, it can be seen that when the distance between the initial position of the space manipulator end-effector and the target point becomes shorter, the proposed motion planning approach can still generate a trajectory that can achieve the position and orientation alignment of the end-effector, which indicated that the proposed approach is not sensitive to the initial configuration of the space manipulator.

Robustness Verification
In practical applications, the state information of the space manipulator, such as base position, base attitude, and joint angles obtained by the measuring units, usually has errors and uncertainties, so the robustness of the motion planning is required. To verify the robustness of the proposed approach, the Monte Carlo simulation is carried out for the mission scenario, where the state information input to the planning algorithm deviates from the actual state. The initial configuration of the space manipulator is shown in Table  3, and the initial joint angles are set to (0.1π, 0.1π, 0.2π, 0.5π, 0.5π, 0.3π) rad. During motion, random measuring errors in the range of [−5, +5] deg are added to all joint angle inputs in each step, 1000 Monte Carlo simulations are performed, and the result is shown in Figure 14.  From Figure 13, it can be seen that when the distance between the initial position of the space manipulator end-effector and the target point becomes shorter, the proposed motion planning approach can still generate a trajectory that can achieve the position and orientation alignment of the end-effector, which indicated that the proposed approach is not sensitive to the initial configuration of the space manipulator.

Robustness Verification
In practical applications, the state information of the space manipulator, such as base position, base attitude, and joint angles obtained by the measuring units, usually has errors and uncertainties, so the robustness of the motion planning is required. To verify the robustness of the proposed approach, the Monte Carlo simulation is carried out for the mission scenario, where the state information input to the planning algorithm deviates from the actual state. The initial configuration of the space manipulator is shown in Table 3, and the initial joint angles are set to (0.1π, 0.1π, 0.2π, 0.5π, 0.5π, 0.3π) rad. During motion, random measuring errors in the range of [−5, +5] deg are added to all joint angle inputs in each step, 1000 Monte Carlo simulations are performed, and the result is shown in Figure 14.
As presented in Figure 14, the end-effector terminal states in all 1000 Monte Carlo simulations are within the target zone, verifying that the proposed method has strong robustness to measuring errors. Note that all terminal state points should be on the blue line in theory if the planning goal is achieved; however, since the sampling is discrete during the simulation process, there will be cases where the terminal state point is in the target zone. position, base attitude, and joint angles obtained by the measuring units, usually has errors and uncertainties, so the robustness of the motion planning is required. To verify the robustness of the proposed approach, the Monte Carlo simulation is carried out for the mission scenario, where the state information input to the planning algorithm deviates from the actual state. The initial configuration of the space manipulator is shown in Table  3, and the initial joint angles are set to (0.1π, 0.1π, 0.2π, 0.5π, 0.5π, 0.3π) rad. During motion, random measuring errors in the range of [−5, +5] deg are added to all joint angle inputs in each step, 1000 Monte Carlo simulations are performed, and the result is shown in Figure 14.

Conclusions
A motion planning approach based on DDPG algorithm and artificial potential field is proposed in this paper to achieve the position and orientation alignment of the space manipulator end-effector during on-orbit operation. The kinematic characteristics of the free-floating 7-DOF space manipulator is described by the generalized Jacobian matrix, and the framework of the motion planning algorithm is established based on DDPG algorithm. The artificial potential field is established in the environment part of the DDPG algorithm, and the reward function is designed on this basis, which improved the convergence of the DDPG algorithm. Considering the operational security of the space manipulator, the self-collision avoidance constraint is taken into account. The effectiveness and superiority of proposed approach are verified by comparison simulations. The results in this paper can provide theoretical support for on-orbit operation tasks of space manipulator with end-effector position and orientation alignment requirements.