Active Debris Removal Mission Planning Method Based on Machine Learning

: To prevent the proliferation of space debris and stabilize the space environment, active debris removal (ADR) has increasingly gained public concern. Considering the complexity of space operations and the viability of ADR missions, it would be necessary to schedule the ADR process in order to remove as much debris as possible. This paper presents an active debris removal mission planning problem, devoted to generate an optimal debris removal plan to guide the mission process. According to the problem characteristics, a two-layer time-dependent traveling salesman problem(TSP) mathematical model is established, involving the debris removal sequence planning and the transfer trajectory planning. Subsequently, two main novel methods based on machine learning are proposed for the ADR mission planning problem, including a deep neural networks(DNN)-based estimation method for approximating the optimal velocity increments of perturbed multiple-impulse rendezvous and an reinforcement learning(RL)-based method for optimizing the sequence of debris removal and rendezvous time. Experimental results of different simulation scenarios have veriﬁed the effectiveness and superiority of the proposed method, indicating the good performance for solving the active debris removal mission planning problem


Introduction 1.Background
In the past 60 years, mankind has been actively exploring and developing near-Earth space through rockets, satellites, and other kinds of spacecraft.A by-product of this activity is space debris, which is structural fragments, the upper stages of rockets, or even the abandoned satellite itself after completing its mission [1].The debris can stay in orbit for decades or even hundreds of years, posing a serious threat to both functioning and newly launched satellites and spacecraft.Kessler and Cour-Palais [2] put forward the "Kessler syndrome", which attracted widespread attention.The theory pointed out a dramatic situation of space debris, when the collision of two large objects of space debris causes an avalanche-like cascade of mutual collisions, ultimately leading to the formation of a cloud of debris around the Earth [3].Research shows that at least five large space debris need to be removed every year to stabilize the space environment [4,5].
Active space debris removal mission planning problem is a complex optimization problem, which is used to determine the sequence of debris removal, rendezvous time, and transfer trajectory between two adjacent debris removal processes.
The combinatorial optimization characteristics of the problem are similar to those of the traveling salesman problem (TSP) [25] to a degree, but compared with it, ADR mission planning problem shows greater complexity and difficulty in solving.This can be summarized in two aspects: first, the calculation of the optimal transfer velocity increment between two debris is much more complicated than the calculation of the Euclidean distance between two cities with a fixed location, and it is particularly time-consuming; second, because space debris runs in different orbits and its spatial location changes with time, ADR mission planning problem has a larger search space than the classic TSP problem of the same scale.Thus, effective and efficient planning methods are of crucial significance for the completion of the ADR missions.

Literature Review
At present, the methods for solving active space debris removal mission planning problem can be divided into three main categories: explicit enumeration methods, implicit enumeration methods, and meta-heuristic methods [26].
(1) Explicit enumeration methods The explicit enumeration methods are committed to finding the exact global optimal solution by enumerating all possible combinations.This method is only applicable when the search space is very small.Zuiani and Vasile [27] presented a simplified ADR mission planning scenario of five space debris under the preliminary design of low-thrust, manyrevolution transfers.According to different propulsion systems, Braun et al. [28] proposed four space debris removal modes and the best removal sequence of four to six debris is obtained through the brute-force approach.However, when the number of debris is larger than 10, the considerable calculation time of enumerating all feasible combinations is unacceptable.
(2) Implicit enumeration methods Implicit enumeration methods are committed to explore the optimization process by finding those sequences with high probabilities of becoming the optimal solution and pruning other sequences [29].The tree search algorithm can be regarded as a typical implicit enumeration method, and the tree nodes represent the space debris involved in orbital transfers.The pruning strategy is crucial to the tree search algorithm, because excessive pruning may lose the optimal solution and inadequate pruning will increase the computational burden [26].Li et al. [30] proposed a beam search algorithm with the pruning strategy, in which only the first beam width's number of nodes was expanded for the next level, and the others were abandoned.Branch and bound is another effective pruning strategy.Olympio and Frouvelle [31] studied a space debris removal mission in solar synchronous orbit with J 2 perturbation considered, and designed the branch and prune algorithm with a pruning strategy to accelerate the removal sequence search process.Barea et al. [32] divided the ADR mission planning problem into two layers: the upper layer constructed a linear integer model, and the lower layer constructed a nonlinear mixed integer model and applied the branch and bound algorithm to search the best removal sequence.Cerf [26] used the branch and bound algorithm to optimize the debris removal sequence and transfer trajectory to obtain a optimal debris removal scheme with the minimum total propellant consumption.Further considering the mission completion time, Madakat et al. [33] modeled the problem as a multi-objective time-dependent traveling salesman problem (TDTSP) with the objective of minimizing the total mission time and total propellant consumption, and adopted the branch and bound algorithm to find the optimal solution for low-Earth-orbit(LEO) space debris removal.Olive et al. [34] further built the TOPAS platform (Tool for Optimal Planning of ADR Sequence) based on the branch and bound algorithm, which was applicable to complex ADR scenarios with up to 10 space debris.
However, in most of the cases mentioned above, still only small amount of space debris removal is scheduled.When it comes to cases of large scale, implicit enumeration methods may not be efficient and a suitable approach for ADR mission planning of a complicated characteristic of mixed integer programming is required.
(3) Meta-heuristic methods The meta-heuristic methods possess a great potential for finding the near-optimal solution within an acceptable time, and provide another idea for solving the ADR mission planning problem.In recent research, genetic algorithm (GA) [35], simulated annealing (SA) [36], Physarum Algorithm [37,38], particle swarm optimization (PSO) [39], and ant colony optimization (ACO) [40] have been applied to schedule debris removal missions.Murakami and Hokamoto [41] proposed two transfer trajectory selection rules to simplify the transfer trajectory optimization problem, and used GA by encoding the removal sequence and the transfer time into the chromosome together.Liu et al. [42] was devoted to developing a preliminary plan for a multi-nanosatellite active debris removal platform (MnADRP) for LEO missions and a dynamic multi-objective TSP scheme was proposed in which three optimization objectives, i.e., the debris removal priority, the MnADRP orbital transfer energy, and the number of required nanosatellites were modeled, respectively.Chen et al. [43] proposed a GA-based three-stage removal strategy for space debris: the first stage to develop the mission scenario with multiple spacecraft including one main spacecraft and some following spacecraft for debris removal missions; the second stage to define the fuel, time, and the quantity of the following spacecraft as the constraints; and the third stage to establish the mathematical model taking the minimum fuel consumption as the optimal objective.Missel and Mortari [44] studied the Space Sweeper with Sling-Sat (4S) mission of debris removal and optimized the combinatory selecting of the debris interaction order, ejection velocities, and sequence timing by an evolutionary algorithm.Federici et al. [3] designed an effective coding and mutation operator, and applied SA to optimize the removal sequence and the rendezvous time of ADR missions, which could accomplish the removal mission planning for 20 space debris.Medioni et al. [45] performed optimization using SA and a tool to classify the targets in groups gathered by similarities of orbital elements, with the objective for each mission being not to exceed a total ∆v = 4 km/s, and for a mission time lower than 3 years, which was proposed for removal target selection.Carlo et al. [46] proposed TSP and VRP(Vehicle Routing Problem) modeling strategies, respectively, for LEO ADR missions and designed a bio-inspired incremental automatic planning and scheduling discrete optimization algorithm based on the Physarum algorithm.Oriented to the geostationary-orbit(GEO) debris removal mission planning problem, Jing et al. [47] studied three key sub-problems: mission allocation, sequence planning, and trajectory transfer planning, which were modeled by a hybrid optimal control model and solved by an improved multi-objective PSO.Mohammadi-Dehabadi et al. [48] designed a multi-objective PSO algorithm for minimizing the total propellant consumption and mission completion time and, through their experiments, it was shown that the initial orbital elements of the spacecraft had a great impact on the propellant consumption and time cost for ADR missions.Stuart et al. [49] used ACO to generate a preliminary debris removal sequence and to determine the number of spacecrafts required to clear a debris set, and re-planned the preliminary scheme by multi-agent coordination via auctions.Shen et al. [50] used ACO to optimize the sequence planning model for the dynamic ADR mission planning problem under J 2 perturbation, and successfully obtained the optimal removal sequence of 10 debris.Zhang et al. [51] proposed an improved ACO with the inner-outer operator, whose effectiveness to select the optimal removal targets from a large set of debris (up to 2000) was proved by their experiments.Li and Baoyin [52] combined the advantage of an evolutionary algorithm and population intelligence algorithm, took ACO as the framework and added the GA operator, and developed the evolutionary elitist club algorithm (EECA) to optimize the multi-debris removal mission.Zhu [53] designed the dynamic sequence planning ant colony optimization algorithm based on the framework of the ant colony system, introduced the concept of a pheromone tensor to characterize the dynamic transfer preference between debris targets, and proposed a step-by-step rendezvous sequence planning strategy based on the time discretization method.Unlike the methods described in refs.[52,53], which added the time dimension into the pheromone matrix of the ACO by discretizing and fitting time, Zhang et al. [29] directly put the timeline particles at a certain moment of the corresponding timeline, but not at a series of discrete time points and solved the time-dependent characteristics of the ADR mission planning problem through a new structure called the Timeline Club.
As can be seen from the above, the literature may fall into three main drawbacks.First, most of the ADR mission merely involved several debris and did not discuss the situation with large-scale debris.Second, in order to simplify the planning problem, some studies adopted a static and non-time-varying transfer trajectory model, which was far from the actual situation.Last but not the least, a hierarchical optimization strategy was often adopted to decompose the ADR mission planning problem where the debris removal sequence and transfer trajectory were optimized separately.Consequently, the final solution may not be optimal without a global optimization.
Therefore, this paper develops two novel methods to confront the deficiencies above, including a method for estimating the optimal velocity increments of perturbed multipleimpulse rendezvous for actual transfer trajectory planning and a method for optimizing the sequence of debris removal and transfer trajectory simultaneously.

Contributions
This paper is devoted to solve the high complexity of the combination optimization in the active debris removal mission planning problem and proposes an ADR planning method based on machine learning, which performs well.The main contributions of this paper are summarized as follows.
(1) A two-layer time-dependent TSP mathematical model is proposed, which clarifies the structure and characteristics of the ADR mission planning problem.Based on this model, the solving method is designed subsequently; (2) A deep neural networks(DNN)-based estimation method for approximating the optimal velocity increments of perturbed multiple-impulse rendezvous is proposed, which overcomes the deficiency of time consumption for optimizing the velocity increments using conventional methods.Its accuracy is much higher than typical analytical approximation methods; (3) An reinforcement learning(RL)-based method for debris removal sequence planning is introduced for the first time.Unlike traditional intelligence optimization algorithms, the RL-based method can quickly generate solutions after training and learning, which is much more appropriate when it comes to large-scale debris removal situation.
The reminder of this paper is outlined as follows: Section 2 first defines the ADR mission planning problem and then transforms it into a special time-dependent TSP problem within several criteria and constraints.Based on that, it formally demonstrates a two-layer time-dependent TSP mathematical model to address the problem.Section 3 describes two main methods for solving the ADR mission planning problem, including a DNN-based estimation method for the optimal velocity increments of perturbed multiple-impulse rendezvous and an RL-based planning method for optimizing the sequence of debris removal.Section 4 first presents the design of two different test scenarios and then evaluates the performance of the proposed approach.In Section 5, the conclusions are given.

Problem Description
In this work, the problem is referred to as the active space debris removal mission planning problem, which is devoted to generate an optimal debris removal plan to guide the mission process.
The active space debris removal mission planning problem mainly involves the optimization of the removal sequence, rendezvous time, and transfer trajectory between two adjacent debris.According to the problem characteristics, the problem can be decomposed into two layers, as shown in Figure 1.The outer layer is the debris removal sequence planning, which optimizes the debris removal sequence and rendezvous time.The inner layer is the transfer trajectory planning, which optimizes the transfer time and transfer velocity increment.Note that, although the problem is decomposed into two layers in order to address it more clearly, the two layers are considered simultaneously by the following described solving methods.

Debris removal sequennce planning
Transfer trajectory planning The removal sequence is denoted by the removal order of the space debris targets.The removal order number is an integer variable while the rendezvous time, transfer time, and the transfer velocity increment are both real variables.It can be seen that this problem is a two-layer mixed integer optimization problem, and its difficulty lies in its nested optimization attribute and mixed integer variable characteristics.
Typically, adjusting the rendezvous time between spacecraft and debris with the removal sequence determined has a relatively small impact on the overall fulfillment of the mission.However, if the removal sequence is changed, the overall propulsion consumption may be greatly affected, or even exceed the maximum spacecraft loading capacity.Therefore, the debris removal sequence planning focuses more on the rendezvous order for spacecraft and debris, making it similar to the TSP problem to a certain degree.Variously, the classic TSP problem is a static target visiting sequence planning problem with a fixed city location, while the debris removal sequence planning is a time-dependent rendezvous sequence planning problem for moving targets, typically a time-dependent TSP problem (TD-TSP) [54], as shown in Figure 2. In the moving target rendezvous problem, the traveler can stay with the target for a period of time, or he can set out immediately for the next target.Although we focus more on the debris removal sequence, it cannot be optimized regardless of the rendezvous time.Otherwise, only the static debris removal sequence is obtained, which has no practical significance.However, since the rendezvous time is a continuous real variable, it is impractical to search its optimization space entirely.Thus, a time discretization strategy is adopted.As shown in Figure 3, the rendezvous time between spacecraft and space debris is no longer a continuous value in the time interval, but only the time point represented by each discrete spot after discretization.Obviously, the search space of the time-dependent sequence planning problem has been greatly reduced, laying a foundation for the efficient solution of the problem.As for the inner layer, i.e., the transfer trajectory planning, perturbed multiple-impulse transfer is involved.For the transfer trajectory planning problem where the rendezvous between two targets is caused by two impulses, if the initial and terminal conditions are given, the transfer velocity increment is a certain value, which can be directly solved by the Lambert algorithm.However, for the problem of perturbed multiple-impulse rendezvous, the transfer velocity increment is no longer a fixed value, and the optimal value needs to be obtained through optimization.The perturbation in this paper is referred to as the J 2 perturbation [30], which is caused by the Earth's oblateness.J 2 perturbation will affect the right ascension of the ascending node (RAAN) of the orbit for both the spacecraft and debris and add the complexity of solving the inner problem.

A Two-Layer TD-TSP Mathematical Model
Based on the analysis above, the mathematical model of the active space debris removal mission planning problem is constructed as a two-layer TD-TSP model. (

1) Decision variable
The main variables to be considered include the debris removal sequence, rendezvous time, and the relevant impulse parameters of spacecraft transfer.According to the two-layer attribute of the problem, the outer and inner decision variables can be designed, respectively.
The outer variable is defined in Equation (1).
where N represents the number of space debris, S i represents the removal order for debris i, T mt i represents the rendezvous time between the spacecraft and debris i, and T dp i represents the departure time of the spacecraft from debris i.
The inner design variables include two types, respectively representing the impulse time and impulse vector, and are defined in Equations ( 2) and (3).
where n is the number of the involved impulses.∆v i = ∆v ix , ∆v iy , ∆v iz points to the three components of the vector.
(2) Constraints Similarly, the constraints will also be described in two layers.
(i) Outer constraints Equation ( 4) indicates that the value of the debris removal sequence is a positive integer within [1, N], and the same debris cannot be removed repeatedly.
(5) Equation (5) restricts that the rendezvous time between the spacecraft and the next debris must be later than the departure time from the previous debris.
Equation ( 6) means that the waiting time of the spacecraft should be longer than the minimum operating time, that is, the spacecraft needs to stay in the target orbit for a sufficient time to achieve debris removal operation.
(ii) Inner constraints Equation (7) indicates that the impulses should be applied in sequence and the (i + 1)th impulse should be later than the ith impulse.
Equation ( 8) denotes the conditions to judge whether the spacecraft and debris rendezvous, where r ce , v ce and r te , v te are respectively the position and velocity vector of the spacecraft and debris, and ε r and ε v are the maximum allowable position and velocity error for rendezvous.
(3) Optimization objective The active space debris removal mission planning problem aims to complete the space debris removal mission with the goal of maximizing the utilization efficiency of spacecraft resource and minimizing the impulse propellant consumption subject to constraints.
Equation ( 9) addresses the sequence optimization objective for the outer layer.∆V i is the optimal impulse velocity increment for the transfer between debris i to debris i + 1.
Equation ( 10) addresses the transfer trajectory optimization objective for the inner layer, which is to minimize the impulse velocity increment for a certain transfer, related to the impulse vector.∆v ij represents the jth impulse velocity increment from debris i to debris i + 1.
Therefore, the overall optimization objective for the active space debris removal mission planning problem can be expressed as Equation (11).

Methods
Figure 4 illustrates the whole framework of the active debris removal mission planning method based on machine learning, which is composed of two main connected segments, namely the estimation of transfers and sequence planning.Accordingly, the two main parts of the method are addressed as follows.

A DNN-Based Estimation Method for Approximating the Optimal Velocity Increments of Perturbed Multiple-Impulse Rendezvous
The debris removal sequence planning needs to quickly obtain the transfer cost (velocity increment) between any two targets in a given initial and terminal state in order to evaluate the profit of the removal sequence plan.However, unlike the rendezvous caused by two impulses, the transfer cost for perturbed multiple-impulse rendezvous is no longer a fixed value and needs to be optimized.This is usually a time-consuming process.If it is nested in the sequence planning, optimizing the removal order and transfer trajectory simultaneously, the calculation time will be unacceptable.Therefore, to solve the ADR mission planning problem, a method that can quickly and accurately estimate the optimal velocity increments of perturbed multiple-impulse rendezvous is demanded.
Previously, researchers basically used analytical methods to roughly estimate the optimal velocity increment.This kind of method is fast, but its estimation accuracy is not high under the circumstances of perturbed multiple-impulse rendezvous.Large estimation errors will directly affect sequence optimization in the outer layer.Thus, a method for estimating the optimal velocity increments of perturbed multiple-impulse rendezvous based on deep neural networks (DNN) is proposed.
First of all, the perturbed multiple-impulse rendezvous can be divided into three types according to the difference variation trend of the right ascension of the ascending node (Ω) between the departure body and the rendezvous target.They are "Ω-closing rendezvous", "Ω-intersecting rendezvous", and "Ω-separating rendezvous" [53], as illustrated in Figure 5.The estimation of optimal velocity increments of the three types is implemented separately.The whole solving framework of the estimation method for perturbed multiple-impulse rendezvous is depicted in Figure 6.• Step 1: Build a training database.
The database is required to cover three types of the perturbed multiple-impulse rendezvous cases with different parameters and conditions for training.A two-step approach including an improved differential evolution (DE) [55] algorithm and a sequential quadratic programming (SQP) algorithm [56] is applied as the optimizer to generate the rendezvous solution for each case.More detailed information on the parameters and implementations for building the database is presented in Section 4.1.

•
Step 2: Train the deep neural network.
Three deep neural networks are demanded for training the three types of the perturbed multiple-impulse rendezvous.
A deep neural network is a complex nonlinear system composed of a large number of interconnected neurons (nodes) [57].Each node obtains one or more inputs from other nodes, and generates an output through an activation function over the weighted sum of these inputs.Neural networks have many different network structures, and most of them contain three types of layers: input layer, hidden layer, and output layer.For general regression problems, the fully connected multi-layer perceptron (MLP) is usually a more suitable network model [58].A well-trained MLP with a moderate network size can approximate any complex nonlinear function.In this paper, MLP is adopted as the architecture for the three DNNs.The activation process is expressed as Equation ( 12).
where x j represents the output of node j in the current layer, x i represents the output of node i in the previous layer, w ij is the connection weight from node i to node j, b j denotes the variable bias of node j, N d is the total number of nodes in the previous layer, and f is the activation function.In this paper, a Leaky Rectified Linear Unit (Leaky ReLU) [59] is selected as the hidden-layer activation function because it is easy to calculate and can avoid the problem of gradient disappearance.The Leaky ReLU is expressed as Equation ( 13) and the parameter γ is usually set to 0.01.The output-layer activation function is a linear function called identity activation function, expressed as Identity(x) = x.
LeakyReLU(x) = max(0, x) + γmin(0, x) Network training is an iteration process used to adjust the weight vectors continuously, aiming to minimize the loss function.The mean squared error function is selected as the loss function for MLP, and is calculated as Equation (14).
where b is the batch size, set equals to 32, o p (i) is the predicted output of the neural network, and o m (i) is the optimal velocity increment.The training adopts a cross-validation method, where, in each epoch, 90% of the data is used as training samples while the remaining 10% is used for validation.The early stop value is set to 50.Adaptive Moment Estimation (Adam) [60] is used to optimize the parameters of the network, and three MLPs are built and trained based on Keras [61] and TensorFlow [62].
Through the well-trained DNNs, based on the orbital elements of the departure body and rendezvous target, the optimal velocity increments of perturbed multiple-impulse rendezvous can be approximated.The flow chart is presented in the third step in Figure 6, where Ele c0 and Ele t0 are the initial orbit elements of the departure body and rendezvous target, Ele c f and Ele t f are the terminal orbit elements, Ω c0 , Ω t0 , Ω c f , Ω t f are the initial and terminal right ascension of the ascending node of the departure body and rendezvous target, respectively, and ∆T is the transfer time.

An RL-Based Method for Debris Removal Sequence Planning
Through the analysis and modeling above, it can be seen that the space debris removal sequence planning problem is a time-dependent rendezvous sequence planning problem for moving targets and its complicated two-layer optimization characteristics make the search space huge.When debris number increases sharply, the general optimization algorithms are no longer able to obtain the optimal solution in an acceptable time.Thus, appealing to machine learning methods, which are currently emerging vigorously in the classical optimization field, this paper designs an reinforcement learning-based (RL-based) debris removal sequence planning algorithm.Next, each module of the algorithm is introduced step by step.
(1) Improved pointer network Pointer Network is a variant of the neural network structure proposed by Vinyals et al. [63], based on the sequence to sequence (seq2seq) network model [64].It can learn the conditional probability of the output sequence from the input sequence, and can predict the solution of the combinatorial optimization problem with high accuracy.The pointer network is composed of an encoder and a decoder.Its principle is to map the output into a series of pointers pointing to the elements of the input sequence according to probability.Based on the problem characteristics and constraints, two improvement strategy are designed.
(i) Dynamic information embedding Dynamic information is embedded into the encoder.The spatial location of the dynamic space targets is represented by the corresponding six orbital elements, namely, semi-major axis (a), the inclination (i), eccentricity (e), right ascension of the ascending node (Ω), argument of perigee (w), and true anomaly (m).At the same time, the debris removal sequence planning involves the trajectory transfer process, which requires the rendezvous time between the spacecraft and debris T mt , and the departure time from the debris T dp .Therefore, the input of the improved pointer network can be expressed as Equation (15).X = (S, T) = a, i, e, Ω, ω, m, T mt , T dp (15) The improved pointer network after embedding dynamic information based on time variables is shown in Figure 7.At each step, the decoder network produces a vector that modulates a content-based attention mechanism [63] over the inputs.The output of the attention mechanism is a softmax distribution, namely the probability distribution.The size of the arrow in Figure 7  (ii) Mask design Mask refers to a processing method used to avoid the selection of certain elements by reducing the corresponding decision probability to 0. This is a good way to reduce the complexity of exploration for the pointer network that relies on the attention mechanism to calculate the decision probability distribution to guide the decision-making process.The attention mechanism of the improved pointer network with mask rectification is depicted as Equation (16).
where W 1 and W 2 are the network parameters to be trained, vector u i j is the pointer of the input element, p(C i | C 1 , . . ., C i−1 , P; θ) represents the probability of the sequence, and λ i j is the mask vector of the current step whose value is 0 or 1.
To sum up, for step k in the pointer network, the mask assignment rules are set as follows: 1.
On the basis of the mask matrix in step k, the constraint conditions of all decisions with non-zero mask will be judged, and the illegal decision mask will be assigned 0; 2.
Assign 0 to the selected decision mask in step k; 3.
Restore the masks that are assigned 0 under the first rule, 4.
If k equals to the maximum iteration number K, set all the masks to 0 and finish; otherwise save the mask matrix, and continue to step k + 1.
(2) An AC framework-based reinforcement learning method The improved pointer network model applicable to debris removal sequence planning is described above.However, since the neural network method usually belongs to the category of supervised learning, it needs to obtain a large amount of training data.Based on the complex reality of the space debris removal mission, it is difficult to obtain largescale realistic data that can cover all kinds of mission scenario information.Under this circumstance, it is hard to guarantee the problem optimality merely through the pointer network model.Consequently, this paper uses the Actor-Critic method (AC) [65] to train the improved pointer network.This method combines the advantages of value-based and policy gradient optimization.It can interact with the environment by itself and does not need a large number of training sample data, so it is applicable to the space debris removal sequence planning problem.
The Actor-Critic structure is composed of two neural networks, namely Actor network and Critic network.Actor network is a network based on strategy gradient optimization.It takes the state as the input and action as the output, selects actions based on the value calculated by the Critic network, and updates the network parameters and the probability of actions.The Critic network takes the current state and action as the input and the value as the output.The Critic network evaluates the action of the Actor network, and the evaluation needs to be fed back to the Actor network.
The loss function of the improved pointer network is defined as Equation (17).
where θ represents the parameter of pointer network, X represents the decision state space, θ(• | X) represents the probability distribution of the pointer network decision strategy corresponding to the parameter θ, π represents the current decision , L(π | X) represents the objective value of the current decision, and it is calculated according to Equation (11).The gradient of the loss function [66] can be defined as Equation (18).
where b(X) denotes the baseline function of the gradient and p θ (• | X) denotes the probability of decision π under the corresponding decision probability distribution for the parameter θ.
For the baseline function, Wang et al. [67] proposed that the network could be built separately for Actor-Critic outside the pointer network for calculation, but this method has poor stability, which may cause the training to be unable to converge.Therefore, the baseline function is set based on the exponential moving average.Comparing with the simple moving average, the exponential moving average focuses more on the recent data, and the weight of the data will decline exponentially over time.
The baseline function [68] can be expressed as Equation (19).
Based on the improved pointer network, this method introduces a critic network, which evaluates by mean square error as expressed in Equation ( 20), and applies stochastic gradient descent (SGD) for training.

Experimental Results and Discussion
The proposed methods for estimating optimal velocity increments of the perturbed multiple-impulse rendezvous and for planning the debris removal sequence are simulated respectively as follows.

Experiments for the Estimation of Transfer Impulse Velocity Increment
Sun-synchronous orbit (SSO) is a significant kind of satellite orbit, in which almost half of the Earth observation satellites (EOSs) run.With the number of SSO space debris increasing, it will pose a huge danger to the functioning EOSs.Therefore, SSO space debris removal missions are the main focus.The training database is built by randomly generating the six orbit elements of the departure body and rendezvous target both near the Sun-synchronous orbit.The ranges are shown in Table 1.For the SSO orbit, the semi-major axis, inclination, and eccentricity must satisfy the condition: cosi = −4.7736× 10 −15 (1 − e) 2 a 7/2 [69].The impulse solution for each rendezvous sample is obtained by a two-step approach including an improved differential evolution (DE) algorithm [55] and a sequential quadratic programming (SQP) algorithm [56].According to Zhu [53], due to the interference of huge numbers of local optima for the perturbed multiple-impulse rendezvous problem and the stochasticity of evolutionary algorithm, the optimality cannot be guaranteed by running only one time.Therefore, 100 independent runs are implemented for each rendezvous case and the best solution of the 100 runs is selected and determined as the satisfactory solution.The effectiveness of this method to guarantee the high quality of the selected solutions is verified through simulation experiments by Zhu [53].The computational time for each case of one run is about 5 s on average.The relevant parameter settings of the experiment for estimation of transfer impulse velocity increment are presented in Table 2.The number of the initial training samples for each transfer type 5000 N r1 The number of the terminal training samples for each transfer type 10,000

N t1
The number of the testing samples for each transfer type 1000 h 1 Hidden layers 2 On the basis of domain knowledge that the optimal velocity increments of perturbed multiple-impulse rendezvous should be associated with its initial and terminal states, we list all potential learning features in Table 3. Difference between initial RAAN of the departure body and terminal RAAN of the rendezvous target Ωc , Ωt RAAN variation rates of the departure body and rendezvous target ∆ϕ 0 , ∆ϕ f Initial and terminal phase differences between the departure body and rendezvous target ∆T Transfer time Mean relative error (MRE) is regarded as the evaluation criterion and is calculated as Equation (21).
where N t refers to the number of the testing samples; V i E and V i O represent the estimated and optimized optimal velocity increments of the multiple-impulse transfer i.
Comparing different combinations of learning features for estimation of the optimal velocity increment, these three combinations are the most appropriate, with the lowest MRE, for the three types of perturbed multiple-impulse transfer, as listed in Table 4.Moreover, two typical analytical approximation methods for estimating optimal perturbed multiple-impulse transfer are compared with the proposed method.The first method [70] is based on the Gauss form of variational equations [71] and the second [72] is based on the Edelbaum's method for approximating high-thrust transfers [73].The comparison of the simulated results is presented in Table 5.The simulation results manifest that the estimation accuracy of the proposed method is much higher than the analytical estimation methods.With the appropriate features determined, the training database was expanded to 100,00 and the training process was restarted.Finally, the estimation error of single rendezvous can be reduced to lower than 3%, which verifies the effectiveness of the proposed estimation method for approximating the optimal velocity increments of perturbed multiple-impulse rendezvous.

Experiments for Active Debris Removal Planning
In order to verify the performance of the proposed active debris removal mission planning method, two experiments are designed, namely, a dynamic TSP scenario based on moving city targets and the debris removal scenario.The following describes the settings, simulation results, and comparative analysis of the two experiments.
(1) Dynamic TSP scenario This scenario requires the traveler to find the best traveling route within a certain time horizon, so that he can visit all the cities without repetition or omission, and ensure the shortest total length of the route.It is assumed that there are 15 cities to be visited in 1 h.Table 6 shows the location coordinates of 15 cities at the initial time and the terminal time.The first city is the starting point, and the best travel route to other cities needs to be determined.Each city moves straight forward from the initial location to the terminal location at a constant rate in the time domain.A traveler can choose to stay in a certain city for a period of time, or he can immediately set out for the next city.The total length of the travel route refers to the sum of the travel distance between the cities, excluding the distance the traveler moving with the city.In order to learn as much as possible about the empirical knowledge under different city distribution situations, 1500 sets of cities are sampled respectively from the uniform distribution at an interval of [0, 1] and a normal distribution whose mean value is 0 and a standard deviation that lies in the interval of [0.5, 2].A total of 3000 sets of cities form the training set.The parameter settings in the training process are listed in Table 7.To present the effectiveness of the debris removal sequence planning algorithm based on the pointer network and reinforcement learning (Ptr-nets+AC algorithm) proposed in this paper, it is compared with the greedy-based heuristic (GH) algorithm and ant colony optimization (ACO), which perform well in TSP.The GH is a constructive algorithm adopting a greedy strategy in which the traveler always chooses the nearest city to visit.As for the parameters for ACO, the ant colony size N A and maximal iteration G A max are set to 100 and 1000 respectively; the local attenuation coefficient ζ and global attenuation coefficient ρ are both set to 0.9.
Table 8 presents the best three solutions optimized by the three algorithms in 20 runs.Through comparative analysis, it can be seen that the proposed Ptr-nets+AC algorithm obtains the shortest travel route with the shortest running time, and its optimal solution is (1, 12, 2, 9, 10, 6, 7, 5, 15,13,8,4,3,14,11).ACO can also obtain the optimal travel path during the 20 runs of optimization, but its shortest travel length is longer than that generated by the Ptr-nets+AC algorithm, which is caused by the difference in the visit time decision-making between the two algorithms.The GH algorithm can only get the suboptimal solution because it cannot realize global optimization according to the greedy-based searching rule.Accordingly, it is summarized that the designed Ptr-nets+AC algorithm can not only obtain the optimal travel route with high efficiency, but also has a great advantage in visit time optimization.Figure 8 illustrates the optimal travel route generated by the Ptr-nets+AC algorithm.(2) Debris removal scenario In order to further verify the reliability and effectiveness of the proposed ADR mission planning model and algorithm, a debris removal scenario is set based on the 9th Global Trajectory Optimization Competition (GTOC9) [74].GTOC9 assumes that a satellite operating in the sun-synchronous orbit exploded, resulting in a large number of space debris, which would greatly pollute and destroy the SSO space environment.It is required to generate a debris removal plan to clean up 123 space debris with the flight cost of the spacecraft being as low as possible.Table 9 shows the orbital elements of the partial space debris targets and the complete list of the 123 debris can be found on the GTOC9 website [74].
Twelve debris removal sequences obtained by the NUDT (National University of Defense Technology) team [70] in GTOC9 are taken as an example to test the performance of the proposed method.Table 10 presents the 12 selected sequences for removing 123 debris.The start time of the mission is the time when the spacecraft rendezvouses with the first debris, and the terminal time is the time when the spacecraft departs from the last debris.

Number Start Time Terminal Time Debris Removal Sequence Objective
The optimization objective of this space debris removal experiment is to minimize the flight cost of the spacecraft completing a sequence of debris removal, calculated as Equation ( 22).
where α = 2 × 10 −6 is the weight coefficient, N represents the total number of space debris, m de is the mass of debris removal equipment, and m pi is the mass of propellant consumed by the transfer from debris i to debris i + 1, which can be calculated as Equation (23).(23) where ∆v i is the velocity increment required for the spacecraft to maneuver from debris i to debris i + 1, m i is the mass of the spacecraft after the removal of debris i, m dry is the dry mass of the spacecraft, I sp represents the specific impulse, and g 0 represents the gravitational acceleration at sea level.
The Ptr-nets+AC algorithm is adopted to replan the 12 debris removal sequences.The start time and terminal time of the replanned debris removal mission remain the same.Similarly, the first debris in the original sequence is regarded as the starting point for the spacecraft.Due to the large time span of the mission scenario, the time discretization interval is set to 5 days.For the 12 debris removal sequences, they are all optimized independently as 20 runs and we take the best solution as the final debris removal scheme.Table 11 shows the replanned results of the 12 debris removal sequences by the Ptr-nets+AC algorithm.
We note that the 12 original debris removal sequences were obtained through the mixed integer genetic algorithm (MIGA) [70].The relevant parameters of ACO are set the same as those in the above dynamic TSP scenario.As for the parameters of MIGA, the population size N M and maximal iteration G M max is set to 1000; the crossover probability P c and mutation probability P m are set to 0.8 and 0.2, respectively.Consequently, MIGA, ACO [53], GH and the Ptr-nets+AC algorithm are compared in Figures 9 and 10. Figure 9 presents the comparison of the objective function, i.e., the flight cost of the spacecraft completing a sequence of debris removal, and Figure 10 illustrates the comparison of the computational time for the four algorithms.Comparing the results from Table 11 and Figures 9 and 10, it is found that the replanned flight cost by the Ptr-nets+AC algorithm is lower than that of other methods.The Ptr-nets+AC algorithm, based on the training and learning mechanism, can better capture the optimization information in complex environments and guide the generation of better debris removal schemes, while the MIGA, ACO, and heuristic algorithm still have much to improve on.Compared with the optimal debris removal sequence obtained by the Ptr-nets+AC and MIGA algorithm, it can be seen that most of the original debris removal sequences have been changed and only the 8th and 12th sequences remain the same, which are also the shortest debris chains among the 12 debris chains.To some extent, it indicates that MIGA is slightly weak in the larger-scale space debris removal mission planning problem.MIGA may fall into the local optimal solution easily, while the Ptr-nets+AC algorithm obtains global information from multiple rounds of training and learning, making it easier to realize global optimization.Moreover, it is presented that the Ptr-nets+AC algorithm consumes the minimum running time and can generate a high-quality debris removal solution almost instantaneously.Although it takes time to train the model, this method is valuable, especially for the research on large-scale debris removal missions.Taking the No. 2 removal sequence as an example, it involves 11 space debris and 10 transfers, which is depicted in both Tables 10 and 11. Figure 11 illustrates how the No. 2 removal sequence is changed after the replanning by the Ptr-nets+AC algorithm and Figure 12 illustrates the variation of the 10 transfer velocity increments for the removal sequence optimized by the MIGA and Ptr-nets+AC algorithm.As can be seen from the two figures, almost half of the removal sequence is rearranged and the velocity increments of the involved spacecraft transfers vary a lot.The velocity increments of 8 transfers are reduced by a large margin after the replanning and the total velocity increment is reduced by 544 m/s, which indicates the good performance of the Ptr-nets+AC algorithm in planning the active space debris removal missions.Furthermore, by analyzing the two unchanged debris removal sequences (the 8th and 12th), the flight cost has also been reduced after replanning by the Ptr-nets+AC algorithm, which indicates that the Ptr-nets+AC algorithm can further optimize the rendezvous time between the spacecraft and the debris so as to obtain a better removal scheme.By comparing and analyzing the experimental results above, the effectiveness and superiority of the active debris removal mission planning method proposed in this paper are verified.

Conclusions
This paper proposes an active debris removal mission planning problem, devoted to optimizing the debris removal sequence, rendezvous time, and involved transfer trajectory, to generate an optimal debris removal plan to guide the mission process.According to the problem characteristics, the problem is decomposed into two layers.The outer layer is the debris removal sequence planning, which optimizes the debris removal sequence and rendezvous time.The inner layer is the transfer trajectory planning, which optimizes the transfer time and transfer velocity increment.Subsequently, a two-layer time-dependent TSP mathematical model is established.Two main solving methods for the ADR mission planning problem are proposed, including a DNN-based estimation method for approximating the optimal velocity increments of perturbed multiple-impulse rendezvous and an RL-based method for optimizing the sequence of debris removal and rendezvous time.Experimental results of different simulation scenarios have verified the effectiveness and superiority of the two proposed methods, respectively, indicating the good performance for solving the active debris removal mission planning problem.
For future perspectives, multiple spacecraft should be involved for future debris removal missions since a single spacecraft owns the restricted capability.Then, the mission allocation and coordination between multiple spacecraft are required to be further optimized for the ADR mission planning problem, which is worth further research.

Figure 1 .
Figure 1.Decomposition of the active space debris removal mission planning problem.

Figure 2 .
Figure 2. A static target visiting sequence and a time-dependent rendezvous sequence for moving targets.

Figure 3 .
Figure 3.A time-dependent moving target rendezvous sequence with time discretization.

Figure 4 .
Figure 4. Framework of the active debris removal mission planning method based on machine learning.

Figure 5 .Figure 6 .
Figure 5. Variation trend of the right ascension of the ascending node (Ω) between the departure body and the rendezvous target for three types of perturbed multiple-impulse rendezvous.

Figure 7 .
Figure 7. Improved pointer network with dynamic information.

Figure 8 .
Figure 8.The optimal travel route generated by the Ptr-nets+AC algorithm.

Figure 9 .
Figure 9.Comparison of the optimal replanned debris removal plan for MIGA, ACO, GH, and the Ptr-nets+AC algorithm.

Figure 10 .
Figure 10.Comparison of the running time for GH, MIGA, ACO, and Ptr-nets+AC algorithm to plan the 12 space debris removal sequences.

Figure 11 .
Figure 11.Comparison between the original and replanned solutions for the No. 2 debris removal sequence generated by the MIGA and Ptr-nets+AC algorithm, respectively.

Figure 12 .
Figure 12.Variation of the 10 transfer velocity increments for the No. 2 debris removal sequence optimized by the MIGA and Ptr-nets+AC algorithm.

Table 1 .
The ranges of the six orbital elements.

Table 2 .
Parameters of the experiment for estimation of transfer impulse velocity increment.

Table 3 .
Potential learning features for estimating optimal velocity increments.Semi-major axis of the departure body and rendezvous target e c , e t Eccentricities of the departure body and rendezvous target i c , i t Inclinations of the departure body and rendezvous target ∆Ω c0t0 Difference between initial RAAN of the departure body and initial RAAN of the rendezvous target ∆Ω c f t f Difference between terminal RAAN of the departure body and terminal RAAN of the rendezvous target ∆Ω c0t f

Table 4 .
Selected learning features for the three types of perturbed multiple-impulse transfer. ,a t + e c , e t + i c , i t + ∆Ω c0t0 + ∆Ω c f t f + ∆T 5.98% Ω-intersecting a c , a t + e c , e t + i c , i t 5.95% Ω-separating a c , a t + e c , e t + i c , i t + ∆Ω c0t0 + ∆Ω c f t f + +∆Ω c0t f + ∆T 5.51%

Table 5 .
MREs of the estimation for the two analytical methods and DNN-based method.

Table 6 .
The initial location and the terminal location of 15 cities.

Table 7 .
Parameter settings in the training process for active debris removal planning.

Table 8 .
Optimal traveling route for the heuristic algorithm, ant colony optimization, and the Ptr-nets+ AC algorithm.

Table 9 .
The orbital elements of the partial space debris.

Table 10 .
Twelve removal sequences for the 123 debris.

Table 11 .
The 12 replanned debris removal sequences by the Ptr-nets+AC algorithm.