A Pre-Trained Fuzzy Reinforcement Learning Method for the Pursuing Satellite in a One-to-One Game in Space.

In order to help the pursuer find its advantaged control policy in a one-to-one game in space, this paper proposes an innovative pre-trained fuzzy reinforcement learning algorithm, which is conducted in the x, y, and z channels separately. Compared with the previous algorithms applied in ground games, this is the first time reinforcement learning has been introduced to help the pursuer in space optimize its control policy. The known part of the environment is utilized to help the pursuer pre-train its consequent set before learning. An actor-critic framework is built in each moving channel of the pursuer. The consequent set of the pursuer is updated through the gradient descent method in fuzzy inference systems. The numerical experimental results validate the effectiveness of the proposed algorithm in improving the game ability of the pursuer.


Introduction
Tracking space targets is beneficial for orbital garbage removal, recovery of important components, and early warning of space threats [1]. However, with the continuous development of space techniques, the targets in space have been expanded from non-maneuverable ones to maneuverable ones. Tracking a target that has maneuverability is still a challenging problem because the target is non-cooperative and the environment is usually partially unknown.
In order to track a non-cooperative target in space, one can apply control theory to design a control law. By establishing an attitude-position coupling model, an adaptive control law that considers the unknown mass and inertia was proposed [2]. Besides considering the system static errors and disturbances, some adaptive control laws were designed [3,4]. With the development of the research, many mature control methods were introduced to the field of tracking targets in space [5][6][7]. In addition, a back-stepping adaptive control law with the consideration of a variety of model uncertainties, as well as the input constraints and an optimal inverse controller with external disturbances were attempted in [8,9], respectively. It was proven that the closed-loop system was still stable in the presence of external disturbances and uncertain parameters. However, these proposed control laws are basically used for the targets, which do not have the ability to maneuver.
For tracking a target that can move, there is the potential to describe the problem as a pursuit-evasion problem, which is also known as the space differential game. The differential game, which was introduced in [10,11], is usually applied to continuous systems. To find a superior strategy of the pursuer in aircraft combat, scholars proposed the proportional navigation method [12]. In addition,

Dynamics of the Space Differential Game
To describe the space differential game, the following coordinate systems are established: (a) Earth-centered inertial (OXYZ); (b) the orbital coordinate system of the spacecraft (Ox o y o z o ); (c) the orbital coordinate system of the virtual host spacecraft (Ox r y r z r ).
In this game, there are one pursuer and one evader, where the pursuer P aims to track the evader E and the evader E aims to escape from the pursuer P. The position relationship among the pursuer, the evader, and the virtual host point o is drawn in Figure 1. The virtual host point o is located near the two satellites. The pursuer and the evader can be abstracted as the agents, which have the ability of interacting with the environment. In this paper, we will focus on the control strategy of the pursuer to make it have an advantage in this game. The pursuer is expected to update its control policy according to its interaction with the environment through reinforcement learning. Therefore, for a simulated experiment in this paper, it is necessary to build an environment that includes the dynamics of the agents in it.
This pursuit-evasion game is supposed to occur in the neighborhood of a near circular reference orbit. In addition, it is supposed that there may exist an external disturbance force acting on the pursuer and the evader. Denote the position of satellite P as x P = [x P , y P , z P ] T , while the position of satellite E as x E = [x E , y E , z E ] T . Therefore, the dynamics of the pursuer, P, is expressed as below [30]: where µ represents the Earth's gravitational constant, ω (t) represents the instantaneous angular velocity of the reference orbit, and r (t) represents the instantaneous radius of the orbit. Besides, the dynamics of the evader E is expressed as follows.
where u j i (j = x, y, z) represents the force in the corresponding channel and T i (i = P, E) represents the maximum unit mass thrust of the satellite. It is noted that the external disturbance force is only added to the pursuer, because we always consider the relative states between the pursuer and the evader.
Through Equations (1) and (2), the environment for the learning algorithm is built, and it is seen as the real environment, which is differentiated from the estimated environment referred to in Section 4.2.

Reinforcement Learning in Continuous Systems
To avoid the curse of dimensionality, the technique of generalization should be addressed. Besides, the problem regarding satellite motion requires the inputs of the learning system to have clear physical meaning. Therefore, the zero-order Takagi-Sugeno (T-S) fuzzy system, which provides a more meaningful inference rule compared with neural networks, is employed as the approximator. In this way, the fuzzy actor-critic learning framework will be built. Through the gradient descent method, the consequent parameters of the actor and the critic will be updated.

The Fuzzy Inference System
The fuzzy inference rule of the employed Takagi-Sugeno (T-S) fuzzy system is expressed as below [31].
Rule l : IF s 1 is F l 1 , · · · , and s n is F l If we assume that the fuzzy system has L rules, n input variables, and each input has h membership functions, the output of the fuzzy system can be expressed as: where s i (i = 1, · · · , n) represents the ith input of the fuzzy system, F l i represents the fuzzy set of the ith input variable, z l represents the output of the lth rule, φ l represents the consequent parameter, s = [s 1 , · · · , s n ] T represents the state vector, and µ F l i represents the membership function of s i under the lth rule. The expression of Ψ l (s) is as follows.
The applied membership functions here are triangular membership functions, which are shown in Figure 2. This shows that the input will only activate two membership functions at one time for one input, which will save computing cost when the number of membership functions rises.

The Fuzzy Actor-Critic Learning Algorithm
In the actor-critic learning algorithm, the value function and the policy function are approximated through T-S systems, respectively. The critic part is used to estimate the value function, while the actor part is used to generate the action. To apply the actor-critic learning framework into a continuous system, we need two critic parts to estimate the current value functionV t (s t ) and the next value functionV t (s t+1 ) and one actor part to generate the current control variable. In this way, the temporal difference can be expressed as below.
Denote Ξ = 1 2 ∆ 2 t as the variance of the difference signal; therefore, the adaptive update rule of the parameters in the critic is expressed as: where φ C represents the consequent parameter of the critic and α represents the learning rate of the critic.
In addition, we have: which can be combined with Equation (5). In this way, Equation (7) can be solved. Denoting the output of the actor as u t , a rand noise, σ, will be added to u t to explore better rewards. Therefore, the real output is u c = u t + σ.
Further, the adaptive update rule of the parameters of the actor is expressed as: where φ A represents the consequent parameter of the actor and β represents the learning rate of the actor.

Pre-Trained Fuzzy Reinforcement Learning for the Pursuing Satellite in a One-to-One Game in Space
The proposed algorithm is single-looped, which means that for the motions of the pursuing satellite P, each agent has to be divided into three channels, the x, y, and z channels. In each channel, there exists two inputs, the relative distance and the relative velocity of the current channel. With the help of the genetic algorithm, the consequent sets of actors in each channel will be initialized.

Fuzzy Reinforcement Learning Algorithm
Take the x channel as an example. The inputs are s 1 = x and s 2 = v x ; therefore, the inference rule is expressed as: where ϕ l represents the consequent parameter in the consequent set ϕ x P of critics. In addition, the following relationship is shown.
Similarly, the output of the actor is shown as below.
where φ l represents the consequent parameter in the consequent set φ x P of actors. To add a noise σ for exploring, the final control variable is expressed as follows.
The designed reward function, r t , is expressed as: The expressions of D x (t), D y (t) and D z (t) are as follows.
In Figure 3, the learning logic is illustrated. From this figure, it is seen that the learning framework is divided into x, y, and z channels, and each channel has two critic parts and one actor part.
It is noticed that the two critic parts are applied to estimate the value of the current time,V (t), and the value of the next time,V (t + 1). It shows that in the x channel, the combination of x and v x is input into the critic part and the actor part to generate the estimated valueV x P (s t ) and the control variable u x P , respectively. Combining u x P , u y P , and u z P , the control vector of the pursuing satellite, u P , can be generated. Under such a control policy, the pursuer will interact with the environment, which already contains the motions of the evader. Then, the next state s t+1 and the rewards for all the channels are expected to be obtained. Take the x channel as an example; the time time difference, ∆ t can be calculated according to r| x P ,V x P (s t ) andV x P (s t+1 ), and the consequent parameters of the critic part and the actor part can be adaptively tuned through (7) and (10).

Pre-Training Process Based on the Genetic Algorithm
Denote the symbols φ P x , φ P y , and φ P z as representing the consequent sets of the actor parts in the x, y, and z channels of the pursuer, respectively. The structure of φ P x , φ P y , and φ P z is defined as a two-dimensional matrix, where the row number depends on the number of membership functions of the first input, and the column number depends on that of the second input. It is supposed that there exist 13 membership functions for the relative distance and 7 membership functions for the relative velocity in each learning channel. Therefore, it is clear that those consequent sets are 13 × 7 matrices.
Conventionally, the reinforcement learning algorithm is conducted on a totally unknown environment, because the agent is expected to interact with the environment without any external help. However, according to the the human study of orbital dynamics, one can build a mathematical model for the pursuer and the evader in space. Therefore, actually, a part of the real environment seems to be known. To utilize this known part to help find the initial values of the consequent sets, φ P x , φ P y , and φ P z will be helpful for the learning. Training these consequent sets based on the estimated environment is seen as a pre-training process before the learning.
The known part is defined as an estimated environment, which can obtain the estimated optimal strategy for the pursuer. Denotex = [x P ,x E ] T as the state variable in the estimated environment, In addition, denote the estimated ω asω; therefore, the dynamics of the pursuer and the evader in the estimated environment can be expressed as:˙x where: With the cost function, which is shown as follows: where i = x, y, z, the estimated optimal strategy for the pursuer will be obtained. In this way, the training pairs will be generated, which can be used to train φ P x , φ P y , and φ P z . To approximate the training pairs through the fuzzy inference system, the genetic algorithm (GA) is applied here to conduct the pre-training process. Take the x channel as an example. If it is supposed that we can obtain N pairs of training data, then the diagram of the GA process is described as in From the figure, it is seen that the inputs for GA in the x channel are x and v x , which will be input into the fuzzy inference system. The " chromosome" is a consequent set that is composed of the "genes". The "genes" are also shown as the consequent parameters. The symbol M, which represents the fitness function during the pre-training learning, can be calculated according to the values of u tr from the training data and the values of u A obtained from the fuzzy inference system. The expression of M is as below: where u A is the output of the fuzzy inference system and u tr (i) is the control value of the ith training pair. Sorted by the fitness error, the current chromosome will be updated by performing crossover and mutation on the genes. With the help of the GA technique [32], φ P x , φ P y , and φ P z will be trained to approximate the training data better.
It is noted that the proposed algorithm will make use of the estimated optimal strategy; therefore, the reward function shown in Equation (16) should be consistent with the cost function shown in Equation (22).

Simulation
A one-to-one space differential game was simulated in this paper. The scenario contained a pursuing satellite P and an evading satellite E. The reference orbit was a circular orbit with a radius of 6.9 × 10 3 km. Table 1 denote the symbols x P0 and x E0 as the initial states of the pursuer and the evader, respectively, where the first three items of the vectors represent the position in m and the last three items the velocity in m/s of the agent. In this scenario, it was supposed that there were some deviations between the real environment and the estimated environment, where the condition ω −ω = 8 × 10 −4 rad/s existed. In addition, the real environment in this scene was supposed to have the external disturbance item as d t = 1.5 × 10 −5 , 1.5 × 10 −3 , 2.0 × 10 −3 m/s 2 . With the learning rate of the critic, α = 0.01, the learning rate of the actor β = 0.001, the random noise σ = 0.1 for exploring, T P = 0.03 × 9.8 × 10 −3 and T E = 0.01 × 9.8 × 10 −3 , the proposed PTFRL was processed. As the pursuer and the evader moved in the x, y, and z planes at the same time, the simulation results were drawn in the X-Y plane and Y-Z plane, respectively. The total learning process cost 1560 iterations with 3496.98 seconds for learning. Figure 5a shows the trajectories of the pursuer and the evader after the pre-training process in the X-Y plane. In this figure, the evader has its optimal strategy, and it is seen that there are some tracking errors from the pursuer to the evader because of the deviations between the estimated environment and the real environment. However, it is seen that the pursuer still has the ability to track the moving trend of the evader because it was pre-trained, and it utilized the information of the estimated environment. Compared with Figure 5a, Figure 5b draws the trajectories of the pursuer and the evader after the proposed PTFRL. It clearly shows that the pursuer could track the evader better after the learning. In the Y-Z plane, the trajectories before learning and the ones after learning are illustrated in Figure 6a,b, respectively. Due to the largest external disturbance in the z channel, Figure 6a shows that the pursuer tracked the evader badly; therefore, there was a big tracking error. In Figure 6b, the pursuer improved its control policy for tracking the evader in the z channel. Overall, from Figures 5 and 6, it is shown that, after the proposed learning algorithm, the pursuer could track the evader better because of more suitable consequent set. During the learning process, the pursuer would seek better consequent parameters for different relative states. In this way, the consequent set was updated, which made the pursuer tend to get much closer to the evader.  The whole learning process could be divided into three periods: before pre-training, after pre-training, and after PTFRL. Before pre-training, the pursuer was in free flight without any control policy. When the pursuer finished the pre-training, it took the estimated optimal control policy based on the estimated environment. Finally, when the pursuer took the control policy after PTFRL, this meant that the pursuer finished the learning. The tracking errors under these three periods of the pursuer in the x, y, and z channels are shown in Figure 7. From this figure, it is seen that compared with the tracking error before pre-training, the one after pre-training effectively decreased, and that after PTFRL further approached zero. The max errors under different periods of all channels are drawn in Figure 8. It is clearly seen that, compared with the max error before pre-training, it decreased after pre-training and was further cut down after PTFRL. If all the rewards during the flight were accumulated, the total reward would be obtained. Therefore, there existed the real total reward under the real flight, and the ideal total reward if the pursuer could track the evader perfectly. The ideal total rewards and the real ones in the x, y, and z channels are shown in Figure 9. It shows that the total reward of each channel after pre-training rose compared with that before pre-trained. In addition, the total rewards attempted to approach the ideal values after PTFRL in all channels.        Figure 9. Comparisons of total rewards in different periods.

Discussion
Based on numerical experimental results in Section 5, the following discussions are shown below.
(a) From Figure 7, it can be concluded that in the x channel, compared with the terminal tracking error before pre-training, the errors decreased by 21.47% and by 85.74% after pre-training and after PTFRL, respectively. Similarly, the terminal tracking errors decreased by 45.68% and 90.80% after pre-training and after PTFRL in the y channel, while the errors decreased by 42.53% and 94.27% after pre-training and after PTFRL in the z channel.
(b) In Figure 8, it is seen that, compared with the condition before pre-training, the max tracking error decreased by 21.47% after pre-training, as well as 69.36% after PTFRL in the x channel. In the y channel, compared with the max tracking error before pre-training, it decreased by 57.26% after pre-training and after PTFRL, because the max error equaled the initial error. Besides, the max error in the z channel decreased by 42.53% and by 73.76% after pre-training and after PTFRL, respectively.
(c) Figure 9 shows that if the ideal total reward was set as the target value, the real total reward in the x channel improved by 38.34% and by 97.97% after pre-training and after PTFRL, compared with that before pre-training. In addition, the reward improved by 70.49% and 99.15% after pre-training and after PTFRL in the y channel. As for the z channel, compared with the real total reward before pre-training, the reward improved by 66.98% and 99.67% after pre-training and after PTFRL, respectively.

Conclusions
To help a pursuer find its advantaged control policy in a one-to-one game in space, an algorithm of pre-trained fuzzy reinforcement learning (PTFRL) was proposed in this paper. To reduce the difficulty of solving without prior information, the man-made model was defined as an estimated environment. By employing the fuzzy inference systems, an actor-critic learning framework, which could be divided into x, y, and z channels, was established. To make use of the estimated optimal strategy, a pre-training process was conducted through initializing the consequent set of the pursuer. With the inputs of the relative position and the relative velocity in each channel, the proposed algorithm controlled the pursuer optimally. By comparing the simulation results before pre-training, after pre-training, and after PTFRL, it was seen that the tracking errors were effectively decreased after the pre-training process and further approached zero after the proposed PTFRL.