Deep Reinforcement Learning-Based Accurate Control of Planetary Soft Landing

Planetary soft landing has been studied extensively due to its promising application prospects. In this paper, a soft landing control algorithm based on deep reinforcement learning (DRL) with good convergence property is proposed. First, the soft landing problem of the powered descent phase is formulated and the theoretical basis of Reinforcement Learning (RL) used in this paper is introduced. Second, to make it easier to converge, a reward function is designed to include process rewards like velocity tracking reward, solving the problem of sparse reward. Then, by including the fuel consumption penalty and constraints violation penalty, the lander can learn to achieve velocity tracking goal while saving fuel and keeping attitude angle within safe ranges. Then, simulations of training are carried out under the frameworks of Deep deterministic policy gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor Critic (SAC), respectively, which are of the classical RL frameworks, and all converged. Finally, the trained policy is deployed into velocity tracking and soft landing experiments, results of which demonstrate the validity of the algorithm proposed.


Introduction
With the development of space technology, the scope of space exploration is constantly expanding. To further explore and study planets such as the Moon and Mars, a large number of planetary surface exploration missions have been carried out and many are in planning [1,2]. In planetary surface exploration missions, the lander faces many challenges. On the one hand, to avoid damaging onboard equipments, the landing velocity relative to the planet's surface must be kept under a threshold. In addition, the maneuvering ability of the landing probe is limited, so the lander needs to have a high landing accuracy to explore a specific area. Therefore, precise soft landing guidance technology is always one of the key technologies of planetary exploration, which has been widely studied and achieved many achievements [3][4][5].
At present, soft landing algorithms can be roughly divided into five categories: establishing lunar vertical line, gravity turn guidance, nominal trajectory guidance, explicit guidance, and learning-based method.
The establishment of lunar vertical line is an open-loop guidance method, which requires high accuracy of orbit entry and mid-course correction [6]. Gravity turn guidance is a semi-open loop and semi-closed loop guidance method. In the main braking period, the main goal is to reduce the speed of the lander. While the distance between the lander and the lunar surface is shortened to a certain range, the closed-loop guidance works are based on the feedback information of the sensor to improve the landing accuracy and stability [7]. These two methods were generally only used in early lunar landings [8].
Nominal trajectory guidance consists of open-loop offline trajectory planning and closedloop online trajectory tracking. Before landing, the lander needs to plan an optimized landing trajectory. During the landing process, the deviation between the lander position and velocity and the planned trajectory is constantly measured and eliminated, through which the lander is controlled to land at the desired landing sites [9][10][11][12]. Explicit guidance method solves the closed-loop guidance problem based on the explicit function of control functional. To obtain the analytical expression of the optimal control problem of soft landing, it is necessary to simplify the soft landing model when building the mathematical model [13]. The work in [14] proposes a guidance method for the powered soft landing of a launcher with non-cluster configured engines, for which it is difficult to maintain a low thrust-to-weight ratio. The work in [15] deals with designing soft landing trajectory, from lunar parking orbit to the surface of Moon and solve the optimization problem with Differential Evolution (DE) which is superior in convergence speed.
In recent years, machine learning has made great breakthroughs and development. Some researchers have studied soft landing based on intelligent learning algorithms such as deep learning. In [16], the deep architectures' ability to drive the onboard decision-making system is investigated in detail. Under the assumption of perfect state information, deep networks are trained to approximate the optimal control action in pinpoint landing experiments and have the ability to cope with large sets of possible initial states. The work in [17] proposes an autonomous lunar landing method based on deep learning that takes raw images taken by onboard optimal cameras as input and directly outputs fuel-optimal control actions, in which the direct filters for state estimation is not necessary. Moreover, the deep networks are trained by a supervised machine learning algorithm, and the training datasets are generated by NLP solver software packages. Further, in [18], a recurrent neural network architecture is proposed to predict the fuel-optimal thrust from a sequence of states in the powered descent phase of planetary soft landing.
In future exploration missions, it is necessary to enable the lander with higher autonomy that can make real-time adjustments to landing trajectory according to the landing condition, which is difficult for offline planning methods.
As an important branch of machine learning, DRL integrates the perception ability of deep learning with the decision-making ability of RL. It is an end-to-end algorithm that directly takes the environmental information as inputs and outputs the control efforts. It is especially suitable for solving the decision-making and planning problems of complex systems and has been widely studied and applied in the fields of games [19], autonomous driving [20], and manipulator control [21].
In this paper, the problem of planetary soft landing of the powered descent phase is studied. The soft landing problem of the powered descent phase is formulated, and the theoretical basis of RL used in this paper is introduced. To make it easier for the training process to converge, a reward function including process reward is designed to solving the problem of sparse reward. In addition, the fuel consumption penalty and constraints violation penalty are included to save fuel and keep the attitude angle within constraints. The main contributions of this work are (i) a velocity tracking reward function is designed with process reward, which makes it easier for the lander to learn to achieve the goal of soft landing as well as enables it with a better generalization capability, and (ii) the goals of keeping the attitude within constraints and reducing fuel consumption are reached by including fuel consumption penalty and constraints violation penalty into the reward function.
The remainder of this paper is organized as follows. Section 2 gives the preliminaries about RL and formulates the soft landing as an RL problem. Section 3 describes the details about soft landing control method based on DRL. Simulation results and necessary discussions are given in Section 4. Finally, conclusions are presented in Section 5.

Soft Landing Problem Formulation
The planetary surface fixed frame of reference is defined as Figure 1. The forces acting on the lander in power descent include gravity, aerodynamic force, and engine thrust [22].
As the powered descent begins at an altitude that is quite low compared to the planet's radius, and the distance between the lander and target landing sites varies slightly during this phase, it is appropriate to assume that the planet's gravity is a constant g.
When it comes to the power descent phase, the lander has already released the parachute and the speed is on the order of 100 meters per second [22]. Compared with planetary gravity, the acceleration caused by aerodynamic force was very small. Therefore, the external force on the lander was dominated by gravity, and the aerodynamic force caused by the wind field was added into the model as an environmental disturbance. The lander is equipped with six thrusters deployed in body frame Figure 2. The thrust of each engine is T i , and meets constraint Let T = T 1 T 2 T 3 T 4 T 5 T 6 T be the vector composed of the thrusts of six thrusters, and the thrust vector in body frame can be obtained according to the geometric relation Moreover, the same for the torque vector in body frame The translation dynamics are expressed aṡ where r = x y z T is the translational vector of the lander, v is the velocity vector, and v b = u v w T and ω t = p q r are the velocity and angular velocity in the body frame, respectively. Q = q 0 q 1 q 2 q 3 is the quaternion. I and m are the inertia matrix and mass of the lander, respectively. C e b is the direction cosine matrix from the body frame to the surface fixed frame of reference.
Attitude dynamics are express aṡ During soft landing, the mass of the lander will gradually decrease with the fuel consumption, i.e.,ṁ where I sp is the specific impulse of the engine, and the inertia matrix will also gradually decrease as the mass decreases. The shape of the lander is a cuboid of sides of length a × b × c with uniform mass distribution The soft landing problem is described as a fuel optimization problem as

RL Basis
RL is a data-driven algorithm, which is different from supervised learning. RL obtains training data(experience) through interaction with the environment, as shown in Figure 3. At time step t, the agent gets an observation S t = s from the environment, then takes an action A t = a according to the policy π(A t = a t |S t = s), the environment transfers to the next state s based on the model P(S t+1 = s |S t = s, A t = a t ) and returns a reward R t+1 . Define the accumulated return of an episode as where γ ∈ [0, 1] is the discount factor. When γ → 0, the agent only cares about the most recent rewards. While γ → 1, the agent has a longer horizon and cares more about the future reward. Moreover, the target is to update policy to maximize G t . DDPG, TD3, and SAC are three of the most successful and popular RL algorithms, so we chose them as the framework in this paper.
(1) DDPG DDPG is a deterministic policy RL framework that outputs a deterministic action, and it is the result of the deep Q network (DQN) extended to continuous control space. DDPG has been extensively researched and applied to the field of continuous control. It learns both a value function and a policy. First, it approximates the value function via the Bellman equation with offline experience through gradient descent. Then, the policy is updated by maximizing the approximated value function.
DDPG is one of the standard algorithms that training a deep neural network to approximate Q function. It makes use of the past experience through the trick of replay buffer. When it is time to update, it randomly samples from the buffer. In order to stabilize the training process, the size of the replay buffer should be properly chosen. If it is too small, it can only store recent experiences, which makes the policy brittle. However, if it is too large, the possibility of sampling a good experience will decrease, and it takes more episodes for the training process to converge.
As shown in Algorithm 1, the whole learning process consists of two parts: Q-learning and policy learning.
According to Bellman equation, optimal Q function under optimal policy satisfies where P is the environment model. Given the experience (s, a, r, s , d), value function Q π θ under the policy π θ can be represented as Algorithm 1: DDPG based soft landing Initialize policy network parameters θ, value function network parameters φ and replay buffer D Initialize target parameters equal to main parameters θ → θ targ ,φ → φ targ for episode = 1,2,. . . , do Observe landing state s and select thrust action update policy through gradient ascent: Setup a mean-squared Bellman error function as where is the value function target. In the condition of continuous action space, it is difficult to compute action a which maximizes Q φ (s , a ). Therefore, DDPG uses target network to solve this problem where a is obtained via target policy µ θ targ . With the training process going on, policy and value function will gradually converge to optimal policy and optimal value function respectively.
(2) TD3 TD3 is modified from DDPG. DDPG can perform well sometimes but it highly depends on the choice of hyperparameters. TD3 takes three critical tricks to make the training process more stable.
• Clipped Double-Q Learning. TD3 learns two value function networks at the same time.
When calculating the target, and are input to the two target value function networks at the same time after obtaining. When the value function network is updated, the smaller one is selected to compute the loss function of the error of the Bellman equation.
• Target Policy Smoothing. The value function learning method of TD3 and DDPG is the same. When the value function network is updated, noise is added to the action output of the target policy network to avoid overexploitation of the value function where ε∼N(0, σ) is a mean-zero Gaussian noise, and θ targ is the parameters of the target strategy network. Adding noise to the output action of the strategy network serves as regularization, which avoids the overexploitation of value function and stabilizes the training process. • Delayed Policy Updates. As the output of the target strategy network is used to compute the target of the value function, the agent can be brittle because of frequent strategy updates, so TD3 adopts the Delayed Policy Updates trick. When updating the strategy network, the update frequency of the strategy network is lower than that of the value function network. This helps to suppress the training fluctuation and makes the learning process more stable.
(3) SAC SAC is an RL framework that maximizes cross-entropy. It applies the learning techniques of DDPG to the learning of random strategies and optimizes random strategies in an offline learning mode.
The agent starts from the initial state s 0 ∼p(s 0 ), samples from policy distribution a t ∼π(· | s t ), and gets an action a t acting on the environment. Then, the environment returns a reward r(s t , a t ) and transfers to a new state s t+1 ∼p(· | s t , a t ) according to the environmental model. Repeating the interacting process and the trajectory of the state, τ = (s 0 , a 0 , s 1 , a 1 , ...) can be obtained. The probability distribution of the trajectory τ regarding the strategy π is expressed as ρ π (τ) = p(s 0 ) ∏ t π(a t |s t )p(s t+1 |s t , a t ) The maximum cross-entropy RL optimizes the cumulative return and cross-entropy of the strategy. For Markov decision processes (MDPs) with infinite loss rewards, the optimization objective can be expressed as Moreover, the optimal policy is represented as where log π t (a t |s t ) is the cross-entropy of strategy distribution, which can be added into the optimization target to encourage agents to explore the environment in training, and to improve the robustness of training results. α is the temperature coefficient, which is used to adjust the importance of cross-entropy, and thus plays a role in regulating the randomness of the optimal strategy. A large α encourages the agent to explore the environment. Therefore, the larger the α is, the more stochastic the strategy will be. While the smaller alpha is, it is more likely that the policy falls into a local optimal point. When α → 0, maximizing cross-entropy RL degenerates into conventional RL that maximizes cumulative reward.
Based on the optimization objective above, the state value function is defined as Moreover, the value function Then, according to the Bellman equation, SAC makes use of "Clipped Double Q-learning" and the Q-learning is similar to TD3 except for the compute of the target value function According to Equation (24), then The updating of the policy network makes use of the re-parameterization trick where the action distribution is Gaussian; µ θ (s) and σ θ (s) are the mean value and variance of the Gaussian distribution, respectively; and ξ is the standard Gaussian distribution. After sampling from the distribution, the action output is restricted to the constrained range through the Tanh activation function. Besides, SAC makes use of the Clipped Q trick when updating its strategy The strategy optimization objective is finally represented as Because of the inherent stochasticity, SAC can effectively avoid the overexploitation of value function.

Soft Landing with DRL
Based on the dynamic model established above, in this section we will design an algorithm based on RL according to the characteristics of soft landing problems, including the selection of observation values and the design of reward function and other settings concerning how the agent interacts the environment.

Reward Setting
The reward function is an index to evaluate the behavior of agents, which is directly related to the training result.
• Goal achieving reward: When the altitude of the lander is less than 0, the speed is downward, the speed is less than the upper limit of soft landing speed, and the attitude angle and angular rate is within the limited range, the lander is considered to have achieved soft landing, and gets the reward r goal = λ(h < 0 and v z < 0 and v < v lim and φ < φ lim and θ < θ lim and ψ < ψ lim and ω < ω lim ) where λ is a large positive constant serving as a soft landing bonus. φ lim , θ lim , ψ lim are upper bounds of Euler angle of the lander, and ω lim is the upper bounds of angular rate. • Velocity tracking reward: At the beginning of the phase of powered descent, the lander is several kilometers away from the landing zone, and the initial velocity is around 100 m/s. If the agent is rewarded only when it achieves a soft landing at the target area, the state space is so sparse that it's nearly impossible to converge. Therefore, we transfer the soft landing problem into a velocity tracking problem. The process reward is introduced in the landing process, that is, a reference velocity is given according to the real-time relative position between the lander and target landing area where k v1 and k v2 are constant coefficients that determine the mapping relationship between position and reference velocity, and large coefficients lead to smaller reference velocity in the powered descent process. Moreover, the reward is given according to the deviation between the real velocity and the reference velocity of the lander where β is the reward coefficient of velocity tracking error. • Crash penalty: To avoid the crash of the lander, a penalty is included in the reward. When the attitude angle or speed deviation exceeds the threshold, the episode terminates and the environment returns a large negative reward as a penalty where η is the penalty of attitude crash. • Fuel consumption penalty: In planetary exploration missions, the fuel carried by the lander is limited, so the fuel consumption should be minimized. A reward regarding fuel consumption is defined as where α weights a term penalizing fuel consumption. The fuel consumption coefficient α and velocity tracking error coefficient β explicitly control the trade-off between fuel consumption and velocity tracking. With higher |α| and lower |β|, fuel consumption weights more in the reward and the lander will exchange some velocity tracking performance for less fuel consumption. • Constant reward: Notice that the rewards r vel , r crash , r f uel are all negative. To encourage the agent to explore more, a positive constant reward needs to be introduced into the reward.
Therefore, the overall reward format is as follows: r = r f uel + r vel + r crash + r constant + r goal (36)

Observation Space
To improve the landing performance, it is necessary to include position, velocity, attitude, and angular rate into the observation s t . Based on the analysis of reward setting, the absolute velocity in s t is replaced by the velocity deviation in the lander body frame.
Tracking velocity deviation rather than direct position can improve the generalization ability of the trained agent.

Action Space
The lander's action is the thrusts of the engines, which are bounded in a specific range. After going through the Tanh activation function, the output of the policy network is bounded to a i ∈ [−1, 1]. Then, the actual thrust is got through a linear mapping where T min and T max are the lower and upper bounds of the thrust, respectively.

Network Architecture
We use the deep learning framework PyTorch to build the neural networks. The hyperparameters such as network learning rates and noise variance are defined Section 4. DDPG, TD3, and SAC all contain value function networks and policy networks. In this paper, all the networks of value functions have the same structure as Figure 4a. We employ three hidden layers to process the vector concatenated of observation and action. All the hidden layers contain 200 nonlinear units and the activation function is ReLU.
The policy network structure of DDPG and TD3 is the same, as shown in Figure 4b. The network includes three hidden layers, of which the activation function is ReLU, and the activation function of the output layer is Tanh, through which the action is normalized to [−1, 1].
Different from the deterministic policy of DDPG and TD3, SAC is a stochastic strategy and the structure of the policy network is shown in Figure 4c. The strategy network consists of two paths of networks. Both of them have the same structure, which outputs the mean value and variance of the Gaussian distribution, respectively. Then, the network output is obtained by sampling and activation functions in turn.

Simulation Results and Discussion
In this section, the simulation experiments are carried out and the results and related discussions are proposed.

Simulation Settings
We train the lander velocity controller in a 6DOF environment established in Section 2. The environmental parameters settings are shown in Table 1.

Parameters
Values The hyper parameter settings of DDPG, TD3, and SAC algorithms are shown in Tables 2-4, respectively. The training of DDPG is unstable and the learning rate of its value function is lower than TD3.

Simulation Results
We deployed training algorithms according to the setting of environment and algorithm parameters listed above, and the reward change curves of DDPG, TD3, and SAC in the training process are shown in Figure 5.
The dark red curve is the average reward. From Figure 5a, the episode reward starts to rise at around episode 10,000 and continues to increase until episode 20,000. Though the curve of average reward looks stable, the episode reward fluctuates between 100 and 400, which is very unstable. It is obvious from the reward curve that the performance of the TD3 agent improves significantly after 700 episodes of training. After 5000 episodes of training, the average reward converges at 450. Compared with DDPG, the learning speed of TD3 is faster and the performance is more stable, which is the result of the three tricks of "Clipped Double-Q Learning", "Target Policy Smoothing", and "Delayed Policy Updates".
The training of the SAC agent experienced a significant improvement at around episode 3000 and 5000, respectively, and finally stabilized and the reward converged to 500. Due to the introduction of the "Clipped Double-Q Learning" like TD3, plus the inherent smoothing characteristics of the stochastic strategy, the training process of SAC fluctuates within a very small range, and the exploration of the environment is sufficient. The accumulated rewards of some episodes are close to 600, which is higher than DDPG and SAC.
The reference velocity is set as v d = 0 0 0 m/s, and the policy obtained from training is used to control the speed. With the initial velocity v 0 b = −2.0 2.0 10.0 m/s, the curves of the velocity deviation are shown in Figure 6.
All of the trained controllers can keep the velocity deviation within a certain range. But there is a continuous oscillation in the velocity control by DDPG agent. The agent trained by SAC algorithm performs the best, which gets the velocity error converges to 0 and keeps it stable. Besides velocity tracking experiments, we validate the trained policy of each algorithm by soft landing tests. Through 100 shooting experiments, landing statistics are shown in Table 5. Among the three, SAC has the highest landing success rate of 96%, while DDPG terminates 26 times due to attitude over constraint, which is caused by continuous oscillation when tracking reference velocity. In the soft landing process, the reference velocity changes dynamically, causing the oscillation more serious and leading to a higher failure rate. Taking the target landing site as the origin, the landing trajectory and landing point distribution of the three controllers are shown in Figures 7 and 8, respectively. By analyzing the distribution of landing points, DDPG and SAC have nearly the same accuracy of 100 m in successful landing cases, while some landing points of SAC distance 200 m from the origin.

Conclusions
This paper presents an end-to-end soft landing control algorithm based on RL. First, the 6DOF soft landing dynamics model is established and the soft landing problem of the powered descent phase is formulated. The theoretical basis of RL is briefly introduced. Then, to solve the problem of sparse reward, which makes it hard for the policy to converge, the reward function including process reward is designed. Besides, the fuel consumption penalty and constraints violation penalty are included in the reward function to optimize fuel consumption and keep attitude angle within constraints. Moreover, the networks architecture of the RL algorithms used is designed. The value functions of DDPG, TD3, and SAC are approximated by deep neural networks that have the same architecture. Both DDPG and TD3 have a policy network that outputs deterministic action, while the action output of SAC is sampled according to a Gaussian distribution characterized by the output of its policy network. Finally, simulations of training are carried out to evaluate the algorithm proposed. The results show that the performance varies between different RL frameworks, and the agent trained by SAC tracks the reference velocity best. In addition, the trained policy is deployed to soft landing experiments, results of which demonstrate the validity of the algorithm proposed. Future work will focus on the stability guarantee of RL-based soft landing algorithm, which is of great importance in space exploration missions to ensure success.