Buffer Compliance Control of Space Robots Capturing a Non-Cooperative Spacecraft Based on Reinforcement Learning

: Aiming at addressing the problem that the joints are easily destroyed by the impact torque during the process of space robot on-orbit capturing a non-cooperative spacecraft, a reinforcement learning control algorithm combined with a compliant mechanism is proposed to achieve buffer compliance control. The compliant mechanism can not only absorb the impact energy through the deformation of its internal spring, but also limit the impact torque to a safe range by combining with the compliance control strategy. First of all, the dynamic models of the space robot and the target spacecraft before capture are obtained by using the Lagrange approach and Newton-Euler method. After that, based on the law of conservation of momentum, the constraints of kinematics and velocity, the integrated dynamic model of the post-capture hybrid system is derived. Considering the unstable hybrid system, a buffer compliance control based on reinforcement learning is proposed for the stable control. The associative search network is employed to approximate unknown nonlinear functions, an adaptive critic network is utilized to construct reinforcement signal to tune the associative search network. The numerical simulation shows that the proposed control scheme can reduce the impact torque acting on joints by 76.6% at the maximum and 58.7% at the minimum in the capturing operation phase. And in the stable control phase, the impact torque acting on the joints were limited within the safety threshold, which can avoid overload and damage of the joint actuators.


Introduction
With the development of space technology, the number of spacecraft launched every year is increasing, thereby generating a series of high-intensity and high-risk space missions, such as on-orbit fuel refueling, on-orbit maintenance, recovery of failed spacecraft, space debris removal [1][2][3][4], etc. As outer space is a harsh environment with high pressure, extreme temperatures, high vacuum and strong electromagnetic radiation, it is very hazardous for astronauts to out of the module to carry out the above mission operations. Giordano et al. [5] proposed a dynamics decomposition that decouples the end-effector task from the base force actuator and reduces the use of thrusters. Qin et al. [6] proposed a fuzzy adaptive robust control (FARC) strategy which is adaptive to these model variations for trajectory tracking control of space robots. Virgili-Llop et al. [7] presented an optimization-based guidance algorithm for onboard implementation and real-time use suitable for space robots. Ai and Chen [8] considered the process of capturing spacecraft by dual-arm clamping and the force/position control of its post-stabilization movement and proposed a fuzzy control scheme based on the passivity theory. Therefore, it is a better choice to use space robots to replace astronauts to complete on-orbit services (OOS) missions. Because the capture and operation ability of space robots is the basic and key technology to realize OOS missions. Liu et al. [9] studied the on-orbit services space robot considering joint friction, based on the Jourdain's velocity variation principle and the single direction recursive construction method derived the dynamic equation of the system. Lim and Chung [10] analyzed the dynamic behavior of a tethered satellite system for space debris capture, by using the absolute nodal coordinate formulation established the equations of motion of the tethered satellite system. Shah et al. [11] presented strategies for point-to-point reactionless manipulation of a satellite-mounted dual-arm robotic system for capturing tumbling orbiting objects. Uyama et al. [12] studied an impedance-based contact control of a free-flying space robot with respect to the coefficient of restitution Therefore, the research on capturing operation technology of space robots has become a hot topic in the aerospace field in recent years.
The operation process of space robot capturing a non-cooperative spacecraft can be divided into four phases: (1) the observation phase, in this phase, the position and attitude of the target spacecraft are observed; (2) the approaching phase, through trajectory planning and motion control of space robot to reaching the capture area; (3) the capturing operation phase, the space robot uses the end-effector to capture the target spacecraft; (4) stable control phase, considering the post-capture unstable motion which is caused by the collision and impact of the capturing operation phase, design the stability control strategy of the hybrid system formed by the space robot and the target spacecraft. Considering that the space robot will inevitably experience a violent contact and collision with the target spacecraft during phase 3 of the capturing operation process, in this process, the joint of the manipulator arm will be subjected to a great impact moment [13]. If the impact torque affecting the joint is too large, it could cause the impact damage to the joints and lead to the failure of the space mission. At present, there is no effective way to solve the problem except using the minimum relative approaching velocity. Although this method is feasible for cooperative spacecraft, it is basically not applicable to non-cooperative spacecraft. Therefore, in the phases 3 and 4 of the capture of a non-cooperative spacecraft, it is of great exploratory value and significance to take certain measures to avoid the damage to joint actuators caused by such impact and collision.
Recently, the dynamics and control of space robots capturing a spacecraft have become the focus of aerospace technicians, and some research results have emerged. It is worth mentioning that the studies mainly focus on pre-capture motion planning and post-capture attitude control. For the motion planning and trajectory tracking control of space robots, Jiang et al. [14] investigated the finite-time control problem associated with attitude stabilization of a rigid spacecraft subject to external disturbances, actuator faults, and input saturation, and proposed an adaptive fixed-time-based finite-time attitude controller designed to guarantee finite-time reachability of the attitude orientation in a small neighborhood of the equilibrium point. Liu et al. [15] studied the effect of payload collisions on the dynamics and control of a flexible dual-arm space robot capturing an object, proposed a method for the determination of initial conditions for post-impact dynamic simulation of the system and proposed a PD controller to maintain stabilization of the robot system after the capture of the object. Walker et al. [16] presented an adaptive control method that achieves globally stable trajectory tracking in the presence of uncertainties in the inertial parameters of the system. Yi and Ge [17] studied an indirect Legendre pseudospectral method for attitude motion tracking control of an asymmetric underactuated rigid spacecraft equipped with only two pairs of jet thrusters. Sands [18] proposed a novel optimization whiplash compensation method to realize automatic control of flexible space robotics. Stolfi [19] focused on the issue of maintaining a stable first contact between the arms end-effectors and a target satellite before the grasp is performed, investigates the application of the Impedance + PD control approach to a two-arm space manipulator used to capture a non-cooperative target. Zhang and Zhu [20] presented the notion that the planning task does not need to solve the inverse kinematics, investigating a novel motion planning algorithm based on rapidly-exploring random trees (RRTs) for an free-floating space robots from an initial configuration to a goal end-effector pose. Cocuzza [21] aimed at locally minimizing the dynamic disturbances transferred to the spacecraft during trajectory tracking maneuvers, based on a constrained least-squares approach, proposed a novel solution for the inverse kinematics of redundant space manipulators. Du et al. [22] based on the continuous finite-time control technique, studied the attitude stabilization of spacecraft, a finite-time attitude tracking control law has been designed for a single spacecraft and a distributed finite-time attitude synchronization algorithm has also been developed for a group spacecraft. Aghili [23] presented a combined prediction and motion-planning scheme for robotic capturing of a drifting and tumbling object with unknown dynamics using visual feedback, and used the estimated states, parameters, and predicted motion trajectories to plan the trajectory of the robot's end-effector to intercept a grapple fixture on the object with zero relative velocity in an optimal way. In order to realize the attitude stabilization and joint tracking control of the space robot with flexible links and elastic base, Yu [24] proposed a terminal sliding mode controller based on desired trajectory to control the free-flying space manipulator when parametric uncertainties and modeling errors exist.
For the post-capture attitude stable control of space robots, Cheng [25] studied the attitude management of space robots after capturing a satellite, the control of the auxiliary docking operation and presented an adaptive control scheme based on extreme learning machine to achieve the coordinated control of the target. Wang et al. [26] considered identifying the mass properties and eliminating the unknown angular momentum of space robotic systems after capturing a non-cooperative target, designing an integrated control framework which includes a detumbling strategy, coordination control and parameter identification, and proposed a coordination control scheme for stabilizing both the base and end-effector based on impedance control implemented considering the target's parameter uncertainty. Zhang et al. [27] proposed a modified adaptive sliding mode control algorithm to reduce the momentum, which can reduce the unknown angular momentum of a target, and uses a new signum function and time-delay estimation to assure fast convergence and achieve good performance with a small chattering effect. Wu et al. [28] developed a generic frictional contact model which can represent the contact forces between the robot's end-effector and the target object and designed a resolved motion admittance control method based on the frictional contact model. Rekleitis [29] developed a planning and control methodology for manipulating passive objects by cooperating orbital free-flying servicers in zero gravity. Although the above control algorithms focus on the dynamics and control of space robots capturing a spacecraft, the protection of the joint actuators of the space robot under the impact torque is not considered. Since a space robot's joints are easily destroyed by the impact torque during the process of space robot on-orbit capturing a non-cooperative spacecraft, therefore, the studies on compliance control of space robots during the capturing process need to be improved.
For the series elastic actuator (SEA) in the ground robot, Gu et al. [30] presented a modularized series elastic actuator aimed to improve the compliance of the robotic arm. Calanca and Fiorini [31] refined and improved the stability analysis of the environmentadaptive force control of SEAs. Wang et al. [32] presented a practical control approach for series elastic actuators which can work well even in the presence of unknown payload parameters and external disturbances. Considering that SEA devices play a key role in protecting the robot's joints from impact damage when the ground robot collides with the outside environment, therefore, this paper designs a rotary series elastic actuator (RSEA) device suitable for space robots, and at the same time, designs an active controller strategy which can timely control the opening and closing of joint actuators to achieve buffer compliance control. The RSEA also leads to joint flexibility due to the presence of a buffer spring inside the system. Since the system meets the law of conservation of linear momentum and law of conservation of angular momentum, its orbital dynamics and base attitude are coupled, which make its links' locomotion leads to the base's reactions, and consequently a variation of the end-effector position. At the same time, momentum, momentum moment and energy transfer change also exist in the pre-contact and postcapture phase of systems consisting of a space robot and spacecraft. In addition, due to the high velocity and rotation characteristics of the non-cooperative target spacecraft, the dynamic parameters of the post-capture hybrid system are difficult to obtain accurately. The above multiple complex situations make research on the dynamic modeling and control of the on-orbit capturing process of space robots equipped with RSEA devices very complicated.
In an effort to address the various aforementioned drawbacks, this work investigates the dynamic modeling, buffer compliance control and vibration suppression of a space robot capturing a non-cooperative spacecraft. First of all, dynamic models of the space robot and the target spacecraft before capture are obtained by using the Lagrange approach and Newton-Euler method. Second, based on singular perturbation theory, the post-capture hybrid system was transformed into two subsystems, a slow rigid motion subsystem, and a fast flexible-joint subsystem. For the fast subsystem, the velocity difference feedback controller is used to actively suppress the elastic vibration of the joints' flexibility. For the slow subsystem, a buffer compliance control scheme based on reinforcement learning (RL) is proposed. The proposed reinforcement learning consists of two modules: associative search network (ASN) and adaptive critic network (ACN). ASN is used to approximate unknown nonlinear terms of mixed systems; the ACN adopts the online learning method. The learning strategy of RL obtains the original error evaluation signal through the performance evaluation unit, this error evaluation signal is coupled with ACN to generate the reinforcement signal. Then, the updated result is used as the learning rule of the neural network to train the neural network weight adaptive law of ASN and ACN, which can adjust and optimize the control strategy in real time. For the reinforcement learning strategy, Liu et al. [33] obtained the system dynamics model of space robota by reinforcement learning, by comparison with the traditional PD control method, that shows the self-learning ability of the reinforcement learning strategy. Sands [34] proposed deterministic artificial intelligence that can applied to both unmanned underwater vehicles and space robotics. Tang and Liu [35] studied the control and stability issues of a trajectory tracking of an n-link rigid robot manipulator, and obtained an optimal control signal by a reinforcement learning strategy. Cui et al. [36] proposed a reinforcement learning strategy to investigate the trajectory tracking problem for a fully actuated autonomous underwater vehicle with external disturbances, control input nonlinearities and model uncertainties. On this basis, the proposed control scheme can absorb the impact energy generated in the collision process through the stretching and compression of the built-in spring in the collision capture phase. In the stable control phase, the control strategy based on reinforcement learning is used to actively turn on and off the joints' actuators to ensure that the joints' actuators will not be overloaded and damaged. In addition, the reinforcement learning strategy has the advantage of not needing the precise dynamics model of the hybrid system and can effectively improve the intelligence and reliability of the on-orbit acquisition operation of the space robot. The numerical simulation shows that the proposed control scheme can not only effectively absorb the impact energy generated by the on-orbit capture, but also open and close the joint actuators in a timely way when the impact energy is too large, which can avoid overload and damage to the joint actuators.
The paper is organized as follows: in Section 2, the compliant mechanism and buffer compliance strategy are introduced. In Section 3, the dynamic model of the space robot capturing a non-cooperative target spacecraft is established. In the same section, the impact effect during the capturing operation phase is discussed. In Section 4, a reinforcement learning control algorithm combined with a compliant mechanism is proposed to achieve buffer compliance control and its stability is verified by introducing the suitable Lyapunov function. In Section 5, numerical simulations are carried out to validate the proposed buffer compliance control strategy. Finally, the conclusions are given in Section 6.

Buffer Compliance Strategy
The RSEA consists of five modules: input disk, sweeping arm, support axis, springs, block. The RSEA device of the space robot system is installed between the actuators and the manipulator and is connected to the actuators through its input disk. The block is firmly connected to the input disk. The hollow shaft of the sweeping arm is connected with the support axis fixed on the input disk through a bearing. When the motor rotates it drives the input disk to rotate. Through the block compression spring, the spring transfers the force to the sweeping arm. The hollow shaft of the sweeping arm is directly connected with the manipulator, so as to complete the smooth transfer of motion and force. The general structure diagram of the space manipulator is shown in Figure 1, and the structure of the designed RSEA device is shown in Figure 2. In Figure 2, R is the effective radius of sweeping arm and r is the radius of spring.  In the capture phase, the end-effector of the manipulator contacts and collides with the spacecraft, whereupon the joint of the manipulator will be subjected to a huge impact torque. The impact torque acts on the output sweeping arm of the RSEA device first, and then is transferred to the spring group. The impact energy generated by the collision is stored in the spring through the deformation of the spring group, so as to realize the protection of the joint. In the stable control phase, the joints are also affected by the impact torque due to the impact of the capture collision. If the torque exceeds the limit that the joint actuators can withstand and the actuators do not turn off, the actuators will be damaged. Therefore, it is necessary to set a shutdown torque threshold to turn off the actuators according to the torque limit that the joint can withstand. When the impact torque on the joints is detected to exceed the shutdown torque threshold value, all actuators turn off. In this time, the internal spring assembly of the RSEA device provides an elastic force to reduce the impact torque on the joints. In addition, in practical operation, if only the shutdown torque threshold is set, the actuators will be switched on and off frequently, thus affecting the actuators performance. On this basis, the control strategy proposed also sets a startup torque threshold value of actuators, when the joint torque exceeds the shutdown torque threshold, the actuators turn off, and when the joint torque is below the startup torque threshold, the actuators turn on again.

Dynamics Modeling and Impact Effect Analysis
The structure of a space robot with RSEA and target spacecraft systems is shown in Figure 3. The space robot consists of a rigid base B 0 , rigid links B i (i = 1,2), and rigid target spacecraft B 3 . We build the inertial coordinate system XOY, while at the same time, the local coordinate system x i O i y i (i = 0,1,2) of each component B i (i = 1,2) is established; O 0 is the rotation center of the base, O i is the rotation center of B i (i = 1,2); m 0 is the mass of the base, m s is the mass of the non-cooperative spacecraft, m i is the mass of B i (i = 1,2). I 0 is inertial moment of the base with respect to its mass center, I s is the inertial moment of the non-cooperative spacecraft with respect to its mass center, I i (i = 1,2) is the inertial moment of B i (i = 1,2) with respect to their mass center. I 0 represents the distance from point O 0 to O 1 , l i (i = 1,2) represents length of B i along the x i axis. d i (i = 1,2) is the distance from the mass center of B i to O i . I im (i = 1,2) is inertial moment of the i-h actuator. k im (i = 1,2) is the spring stiffness of the RSEA device. r c is the position vector of the mass center of the entire system in inertial coordinate system (XOY). r i (i = 1,2) is position vector of the mass center of B i in the inertial coordinate system (XOY). Regarding the target spacecraft as a homogeneous rigid body, its dynamic equation can be obtained by the Newton-Euler method: where q s = [x s , y s , θ s ] T the generalized coordinates of the target spacecraft system; x s , y s are the position vectors of the mass center of B 3 , θ s is the attitude angle of the spacecraft system. D(q) ∈ R 3×3 are the inertia positive definite matrices, J s ∈ R 3×3 is its impact contact point corresponding to the motion Jacobi matrix. F ∈ R 3×1 is the force acting on the spacecraft. According to the position vector relation in Figure 2, the position vectors of the mass center of B i (i = 0, 1, 2) in the pre-contact phase are: where x a , y a are the position vector of the mass center of base B 0 , e i (i = 0, 1, 2) is the unit vector along the x i axis in the x i O i y i frame.
Differentiating Equation (2) with respect to time t, then the total kinetic energy of the space robot with RSEA is: where ω i (i = 0, 1, 2) is the angular velocity of the rotation center O i , ω jm (j = 1, 2) is the angular velocity of the actuator.
Neglecting the micro-gravity in space, the potential energy of the system only comes from the RSEA device, so the total potential energy of the system is: is the deformation of the spring on the block of the RSEA device, α i is the angular difference between the sweeping arm and the input disk. Based on Equations (3) and (4), and combing with the Lagrange equations, the dynamic equations of the space robot of pre-capture phase are as follows where q = [x a , y a , θ 0 , θ 1 , θ 2 ] T are the generalized coordinates of the system, θ 0 is the attitude angle displacement of the base, θ i (i = 1, 2) is the attitude angle displacement of the i-th link, θ im (i = 1, 2) is the attitude angle displacement of the i-th actuator. D(q) ∈ R 5×5 are the inertia positive definite matrices, C(q, . q ∈ R 5×1 is the Coriolis/centrifugal matrix. T , τ a = [0, 0] T is the position control torque of the base, τ 0 is the attitude control torque of base. τ m = [τ 1m , τ 2m ] T is the joint torque/force delivered by actuators. I m = diag(I 1m , I 2m ), K = diag(k 1 , k 2 ) is the equivalent stiffness of joints, and its calculation formula is given in Equation (46). J ∈ R 3×5 is its end-effector impact contact point corresponding motion Jacobi matrix, F ∈ R 3×1 is the force acting on the end-effector. In the capturing operation phase, the space robot contacts and collides with the target spacecraft, and the interaction force at the end is satisfied: Based on Equation (6), and combining it with Equations (1) and (5), we can obtain that: The actuators will be turned off during the capture phase, which is τ c = 0 5×1 . Integrating Equation (7) over the momentary period of collision [13]: The space robot and spacecraft satisfy the velocity constraint in the post-capture phase. Based on this, the following generalized velocity of the post-capture hybrid system can be obtained: where N = D(q) Integrating first item of Equation (5), we have: where P = t 0 +∆t t 0 Fdt is the impact impulse during the capture phase. Invoking Equations (9) and (10), we can obtain that: where (J T +1 is the Moore-Penrose pseudo-inverse of J T . The period of contact is transient: ∆t → 0 , then the collision force can be approximated as: After the space robot capturing the target spacecraft, a hybrid system is formed. Consider the velocity constraint relationship of arm and target, we can obtain that: Differentiating Equation (13), we have: ..
Invoking Equations (1), (5) and (14), we can obtain that: In order to facilitate the design of subsequent control strategies, the first item of Equation (15) of the hybrid system can be expressed in the form of block matrices as follows, so as to obtain the fully controllable formal dynamics model: where the submatrices of C A , and C A11 , C A21 are zero matrix. Equation (16) can be decomposed into: From Equation (17), we have: Invoking Equations (18) and (19), we can obtain that: where

Fast Subsystem and the Corresponding Controller
In order to actively suppress the flexible vibration of the joint caused by the RSEA device, based on singular perturbation theory, the post-capture hybrid system was transformed into two subsystems, a slow rigid motion subsystem, and a fast flexible-joint subsystem. This controller consists of a slow sub-controller and a fast flexible-joint sub-controller: where τ f ∈ R 2×1 is the fast flexible-joint sub-controller, τ s ∈ R 2×1 is the slow sub-controller. Defining the positive proportional factor ε and the positive definite diagonal matrix K 1 , it satisfies: Invoking Equation (22), the flexible-joint fast subsystem is: In order to suppress the elastic vibration of the system, the following speed difference feedback controller is designed to control the fast subsystem: (24) where Substituting Equations (21), (24) into Equation (23), we have: It can be shown that while ε → 0 , the equivalent stiffness of joints K → ∞ . At this point, the hybrid system is equivalent to a rigid model. Then the dynamic equation of the slow subsystem can be obtained from first item of Equations (20) and (21) D xθ ..
where D xθ = D x + I x , I x = diag(0, I 1m , I 2m ). C xθ is the corresponding matrix of C x when

Slow Subsystem and the Corresponding Controller
The buffer compliance control based on reinforcement learning is shown in Figure 4, Where ASN is used to approximate the unknown nonlinear term of the system, ACN is used to construct reinforcement signals to optimize ASN.
Define the trajectory tracking error as: where q θd ∈ R 3×1 are desired trajectories of the hybrid system. At the same time, the error evaluation signal is defined as: where Λ ∈ R 3×3 is a positive definite diagonal matrix. Invoking Equations (27) and (28), the dynamic equation of the slow subsystem can be written as: where d = D xθ ( .. q θd + Λ . e) + C xθ ( . q θd + Λe) is the unknown nonlinear term of the system. Considering that it cannot be obtained directly, it can be approximated by the ASN: where W a ∈ R n×3 is the ideal weight matrix of a radial basis function neural network (RBFNN), ς(x) is the optimal approximation error. The radial basis kernel functions Φ(x) = [Φ 1 , Φ 2 , · · · Φ n ] T are represented by a Gaussian radial basis function (GRBF) as: T , c and σ are the variance and the centre vector of the GRBF. On this basis, the slow rigid motion subsystem control law is given as: whereŴ a is the estimate of the ideal weight W a . Defining the estimation error W a = W a −Ŵ a , it satisfies . W a = − . W a . K z ∈ R 3×3 is a positive definite diagonal matrix. τ a is a robust control law, which is defined as: where K a ∈ R 3×3 is a positive definite diagonal matrix.
Substituting Equations (32) and (33) into Equation (29), we have: In order to optimize ASN, reinforcement learning signals are defined by CAN: whereŴ c ∈ R m×3 is the estimate of the ideal weight W c .

Assumption 1.
The ideal weights W a , W c are bounded and satisfy: where W aM , W cM is an unknown positive constant.

Assumption 2.
The optimal approximation error ς(x) is bounded and satisfies: where ς M is an unknown positive constant.
Next, the weight adaptive law of neural network can be further designed as: .  (36) and (37), the control law (32) based on reinforcement learning signals (35) can ensure that the trajectory tracking error e converges to zero asymptotically.
Proof of Theorem 1. Introducing the Lyapunov function: Differentiating Equation (13), we have: Substituting Equations (34)-(37) into Equation (39) yields: Combining Assumption 1, we have: and combing Assumption 2, Equation (40) is rewritten as: Considering the regression vector Φ is bounded, it can be set Φ 2 ≤ − Φ . K zm is minimum eigenvalue of K z . Equation (41) satisfies: where . To assure . V ≤ 0, we only require one of the following conditions: Based on the analysis results of the above steps, and combing with the Lyapunov stability theorem, which implies that the whole closed-loop system is stable the trajectory tracking error e converges to zero asymptotically from the stability analysis in Theorem 1. The proof is thus completed.

Impact Resistance Performance Simulation in the Capture Phase
To show the performance of the proposed controller, simulations are carried out on a planar space robot with the RSEA and target spacecraft systems shown in Figure 3. The actual parameters of the system are as follows: m 0 = 80 kg, m 1 = 5 kg, m 2 = 5 kg, m s = 30 kg; I 0 = 30 kg·m 2 , I 1 = 3 kg · m 2 , I 2 = 3 kg · m 2 , I s = 15 kg · m 2 , I 1m = 0.05 kg · m 2 , I 2m = 0.05 kg · m 2 ; k 1m = k 2m = 1000 N/m; l 0 = 1 m, The equivalent stiffness of joints [30] is as follows: where K m = diag(k 1m , k 2m ), R = 0.1 m, r = 0.01 m. ϕ is the angle of sweeping arm when the force F = [20 N·m, 20 N·m, 0] T acting on the end of the space manipulator, select ϕ = diag(3 • , 2 • ).
In order to verify impact resistance performance simulation in the capture phase, the space robot system with/without RSEA device was used to carry out acquisition simulation tests on spacecraft with different velocity. The simulation results are shown in Table 1. In Table 1, the first column of velocity terms, the first two are linear velocities, and the third is angular velocities. In the second and third columns, the preceding and the following items are the impact torques of joints without and with RSEA devices respectively. The fourth column has the maximum percentage reduction in joint impact torque with the RSEA device. As can be seen from Table 1, for the capture phase of spacecraft at different initial velocities, the configuration of RSEA device can effectively reduce the impact torque acting on joints, and effectively realize the protection of the joint.

Buffer Compliance Control Performance Simulation in Stable Control Phase
To show the buffer compliance control performance of the proposed controller, simulations are carried out for stable control phase. The actual parameters of the system are as follows: K 2 = diag(5,5), Λ = diag (5,5,5), K z = diag(400,400,400), ε = 0.5, K a = diag (20,20,20), K b = diag(50,50,50), η = 1, K c = diag (10,10,10). In pre-impact phase q θ = [90 • , 45 • , 45 • ] T , assuming that the space robot system capturing a non-cooperative spacecraft at t 0 = 0 s. At this time, the velocity of the spacecraft is v t = [0.45 m/s, 0.45 m/s, 0.5 rad/s] T , the desired trajectory of post-capture hybrid system is q θd = [100 • , 30 • , 60 • ] T . Assume that when the joint actuators running, the limit of the impact torque it can bear is 90 N·m. In order to protect the joint actuators, the buffer compliance control strategy of active opening and closing actuators (named switching strategy) is adopted. The shutdown torque threshold is 60 N·m, and the startup torque threshold is 6 N·m. The simulation results are shown in        Figure 5 shows the impact torque acting on the joints when not adopting the switching strategy, where it can be found that the impact torque still exceeds the safety threshold of the joint at this time. Figure 6 shows the impact torque acting on the joints when adopting a switching strategy. By comparing Figures 5 and 6, it can be seen that the impact torque acting on the joints can be limited within a safe range by combining with the buffer compliance control, which ensures the protection of the joint motor during the stable control phase. Figure 7 shows the evaluation factor signal. It can be found that the ACN is optimized through the interaction with the environment and the reward signal is obtained, and finally reaches the stable state.
To show the effectiveness of the defined reinforcement signal, the tracking accuracy is quantitatively analyzed by comparison of the trajectory tracking error of the proposed RL control scheme, RL with robust controller off and neural network control strategy without reinforcement signal (turn off RL). The mean absolute error MAE = 1 n n ∑ i=1 |e i | was used to evaluate the tracking accuracy, the simulation results are shown in Table 2. It can be seen in Table 2 the mean absolute error of proposed RL is smaller than the other control strategies, which shown that the proposed control method has high tracking accuracy and good tracking performance.  Figures 8-10 shows the stabilization of the hybrid system when the proposed buffer compliance controller is adopted. The solid line is the trajectory tracking curve of the system when the control algorithm based on reinforcement learning is adopted, the dotted line is the trajectory tracking curve when turn off the robust term τ a , the double line is the trajectory tracking curve when turn off RL. By comparing them, it can be found that the unstable hybrid system finally reaches the stable and expected state, and the proposed RL control scheme has faster convergence speed and higher tracking accuracy.
If the fast subsystem controller of the system is turned off, the system trajectory tracking curve shown in Figures 11-13 can be obtained. By comparing Figures 8-10 with Figures 11-13, it can be seen that if the fast subsystem is turned off, the elastic vibration of the unstable hybrid system will continue to increase and eventually lead to the divergence of the system. Therefore, the proposed velocity difference feedback controller can actively suppress the elastic vibration of the system joint, and then achieve stable track of the trajectory.

Conclusions
In this paper a space robot with a RSEA device to protect the joint of the robot under impact torque during the satellite capture process is designed. The timely opening and closing of joint actuators was proposed to achieve buffer compliance control. The dynamic model of the post-capture hybrid system was derived from the Lagrange equations, law of conservation of momentum and the constraints of kinematics and velocity. Then, based on singular perturbation theory, the hybrid system was decomposed into a slow subsystem and a fast subsystem. A buffer compliance control based on a reinforcement learning algorithm was applied to control the slow subsystem with unknown unknown nonlinear disturbances term. The fast control was designed with speed difference feedback controller. The simulation results show that the proposed strategy can reduce the impact torque by 76.6% at the maximum and 58.7% at the minimum during the capture phase, which reflects a good anti-impact performance. In the stable control phase, the impact torque acting on the joint is guaranteed to be limited within the safety threshold, so as to avoid the overload and damage of the joint actuators. In addition, the proposed reinforcement learning strategy has strong online adaptability and autonomous learning ability under complex conditions and can be continuously optimized through real-time interaction with the complex space environment, so as to ensure the accuracy and stability of the system stabilization motion.
Note that this paper only considers that space manipulators mounted on their spacecraft are rigid. For future research, the buffer compliance control of space robot with a flexible-link capturing a non-cooperative spacecraft control problem will be studied, and the control scheme extended to practical applications.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Conflicts of Interest:
The authors declare no conflict of interest.