Safety-Aware Optimal Attitude Pointing for Low-Thrust Satellites

: In geostationary orbit, long eclipses and the seasonal variations in the direction and intensity of the solar input can cause damage to sensitive equipment during attitude maneuvers, which may inadvertently point the equipment towards the Sun. The requirement that transmitting and receiving antennae remain pointed towards the Earth creates further restrictions to pointing directions. The aim of the study is to construct a novel geometric and reinforcement-learning-based method to determine attitude guidance maneuvers that maintain the equipment in safe and operational orientations throughout an attitude maneuver. The attitude trajectory is computed numerically using the geometric framing of Pontryagin’s maximum principle applied to the vehicle kinematics using the global matrix Lie group representation on SO ( 3 ) , and the angular velocities are shaped using free parameters. The values of these free parameters are determined by a reinforcement learning algorithm to avoid the forbidden areas while maintaining the pointing in operational areas (modeled as subsets of the two-sphere of all possible pointing directions of a particular axis). The method is applied to a model geosynchronous satellite and demonstrated in a simulation.


Introduction
Earth-pointing satellites are sometimes required to perform attitude maneuvers from one direction in space to another in such a way that, en route, payloads such as telescopes do not see bright objects such as the Sun or Moon, while maintaining the communication with the ground via the beam pencil of a communications antenna. To respect these pointing constraints, we require the satellite orientation to respect a particular pointing set. The pointing set is a subset of the satellite's configuration set that guarantees safety (in the case of bright spot avoidance) or achieves a mission outcome (in the case of the pointing of communications antennae). The parallel concept of a safe set in robot guidance traditionally implies solely the robot's [x, y, z] position, but in this paper, we extend this idea to orientation.
Several approaches to the overall constrained attitude motion planning problem are developed. For instance, Reference [1] developed attitude commands from a geometrical perspective, defining exclusion (or keep-out) zones on a unit sphere and determining ideal tangential paths around these zones. In [2], the problem was approached by applying the potential function method. Artificial potential functions guide the satellite during the attitude maneuver and avoid the violation of pointing constraints by overlaying regions of high potential around the forbidden regions. In [3], the unit sphere was discretized into a graph, and an admissible path between attitude keep-out zones was found with the A * pathfinding algorithm. Randomized path planning algorithms were used by [4]. Here, solution paths were chosen in a random way, and a tree of possible paths was evaluated in the target direction. The all-time low-cost admissible path was chosen. In contrast to most other methods in the literature, Reference [4] treated the matter directly on the special orthogonal group SO (3). Using SO (3) to represent the spacecraft's rotation allows pointing constraints to be imposed on multiple body-fixed directions. Biggs [5] found reference motions for reorienting a spinning satellite. The path-planning problem with pointing constraints was addressed using optimization of an analytically defined cost function, which includes these parameters. The analytical expression was expressed explicitly on the special orthogonal group SO(3) (instead of employing a local parameterization like Euler angles). However, no explicit way of determining the parameters to respect the pointing constraints was presented.
One way of determining the parameters required is to preserve the pointing set is to "teach" the satellite's guidance system via a learning algorithm to respect the boundaries of the pointing set. In reinforcement learning (RL), an agent (the vehicle) maps observations (states) to actions via a policy function, which can generally be expressed as a neural network to maximize its reward, a scalar feedback signal. The goal of RL is to determine actions that generate a desired system behavior.
The application of RL in obstacle avoidance is not new; the study [6] used an artificial potential field to design the positive rewards of the RL algorithm. According to the different requirements of the task, rewards were modified in the training process to obtain different paths. A control strategy with learning capabilities for unknown environments was demonstrated in [7] using RL with only sparse reward information. A discrete coding of the input space using a neural network structure was presented, which enabled a faster and more efficient convergence of the RL process. Preservation of a safe set is the fundamental concept of motion planning, which avoids catastrophic outcomes by maintaining motion within certain constraints defined by safety. Collision avoidance is the most commonly applied example of this kind of motion planning. A method was described for generating plan-like, reflexive obstacle avoidance behavior in a mobile robot in [8]. The paper showed experiments using a simulated vehicle with a primitive range sensor. As the vehicle explored its surroundings, it adapted its responses to sensory stimuli so as to minimize the negative reinforcement arising from collisions. The problem of a mobile robot learning to navigate an unknown environment while avoiding collisions was considered in [9]: an uncertainty-aware model-based learning algorithm that estimates the probability of collision together with a statistical estimate of uncertainty. In order to learn collision avoidance, the robot must experience collisions at training time. The vehicle accelerates in velocity in settings where it has high confidence. In [10], a tailored design of state and action spaces and a dueling deep Q-network (a deep RL network for autonomous navigation and obstacle avoidance) were proposed for unmanned surface vehicles. In [11], a method for enabling a quadrotor equipped with a monocular camera to autonomously avoid collisions with obstacles in unstructured environments was given. They proposed a deep RL-based method for obstacle avoidance that uses recurrent neural networks with temporal attention and provides better results compared to prior works in terms of distance covered without collisions. The paper [12] focused on the application of RL to obstacle avoidance in dynamic environments for the control of a mobile robot. An intelligent controller was proposed by integrating RL with the behavior-based control architecture and applied to the obstacle avoidance. The problem in which many autonomous systems are designed with a strong reliance on black box predictions from deep neural networks, which tend to be overconfident about unseen data, was considered in [13]. The importance of predictions that are robust to this is evident for safety-critical applications, such as collisions. The paper used an RL framework to form uncertainty-aware navigation around pedestrians.
Our aim in this paper is to use RL in the planning layer of the architecture, to produce a trajectory between two orientations that maintains the spacecraft in the pointing set throughout the maneuver. The approach combines and extends that described in [5]; we describe the system kinematics on SO(3), and then formulate the problem as an optimal kinematic control problem on a matrix Lie group. An application of Pontryagin's maximum principle (PMP) gives the necessary conditions for optimality in the form of extremal equations, which as in [5] contain a number of unknown parameters. RL is applied to augment the shape of the curve to avoid a forbidden region, by selecting the value of the parameter such that the path remains in the pointing set. These amended parameter values are then used to construct a new trajectory that maintains the satellite in the pointing set.

Methods
We considered the case of a satellite in geostationary orbit equipped with a communications antenna with a relatively small beam pencil that needed to be pointed towards the communications ground-station at all times and a sensitive on-board telescope that needed to avoid a bright celestial object. Expressing the attitude kinematics as R(t) ∈ SO(3), we have: where R(t) ∈ SO(3), and: is the standard basis of the Lie algebra so (3). We developed a three-step method that solves the attitude guidance problem of determining the desired ω 1 , ω 2 , ω 3 and solving Equation (1) numerically such that the boundary conditions: (where R 0 is the prescribed initial attitude and R d the desired final attitude) are satisfied. Figure 1 shows the local vertical and local horizontal (LVLH) reference frame, which has its origin at the satellite center of mass and has orthogonal axes (denoted by x, y, z), with the z-axis from the origin of the frame towards the Earth center, the x-axis in the orbital plane in the direction of orbital motion and perpendicular to the z-axis, and the y-axis completing the frame by the right-hand rule. A second, body-fixed, set of axes (denoted by X, Y, Z) and also centered at the satellite center of mass move with the body, and the spacecraft attitude is then defined as the relative angle from the local-level coordinates to the body frame. The line-of-sight unit vector 0 is along the initial direction of the boresight, and f denotes its final desired direction. The figure depicts a bright object to be avoided by the telescope. The centroid of the bright point, s (from the Sun), is specified by an angle (azimuth) α * in the slew-plane 0 − f and an elevation angle * that measures its angle from the Y-axis. Although the Sun is moving with respect to the Earth-centered frame, we may assume for our purposes that since the maneuver time is significantly shorter than a solar day, the Sun's position is fixed in this frame. The celestial object is avoided by a minimum specified angle denoted spec ; that is, the dot product of the Sun's position vector s with the boresight axis position a must fall within the set:

Bright Spot Avoidance
This set describes a region bounded by a cone in R 3 , and the intersection of this cone with the sphere S 1 describes the boundary of the forbidden region of the orientation of the boresight axis. In this paper, the boresight axis is denoted by a = [2, 2, 1], and the trajectory traced by the boresight is then be given by: This trajectory a(t) must lie within the set S 1 at all times.

Communications Pointing Maintenance
In addition to the telescope in the previous section, we considered that the spacecraft has a communications antenna with a beam pencil centered on the Z-axis with pitch φ. We wished to ensure that the beam pencil does not pass out of contact with a ground station directly at the nadir; i.e., the z-axis in the body-fixed frame must remain within sin 2φ of the Z-axis in the LVLH frame. This criterion requires that the dot product of the antenna axis must fall within the set defined by the inequality: The boundary of this region is a cone in R 3 , and the intersection of this cone with the sphere S 2 describes the boundary of the region in which the antenna axis must remain. In this paper, the antenna axis is denoted by b = [0, 0, 1], and the trajectory traced by the antenna is then given by: This trajectory b(t) must lie within the set S 2 at all times.
In what follows, the maintenance of the pointing set means that: for all t ∈ [0, 1]. The boundaries of these two sets are shown on the sphere in Figure 2.

Trajectory Generation
We determined the solution g(t) ∈ G of the left-invariant differential system (1) that satisfies the boundary conditions: while minimizing the expression: This cost J in Equation (10) allows one to define a large class of trajectories where the weights of the cost function can be chosen to alter the shape of the path between two prescribed boundary configurations. A standard geometric control theory approach was used to generate these trajectories to minimize a cost J ; for context, as for the nomenclature used in this section, the reader is directed to Appendix A, as well as any reader unfamiliar with geometric control theory.
As stated in Appendix A, to determine the optimal trajectory minimizing the cost J , we required integrating the kinematic equations (Equation (1)) and the extremal equations (Equation (A2)) simultaneously in the real-time interval [0, T f ]. To do this, we used numerical integration in the following way: setting: then by the PMP, the optimal values of the velocities are given by: and for g in SO (3), the solution of the kinematic equation, we may assign:   P 4 P 5 P 6 P 7 P 8 P 9 P 10 P 11 P 12   =   g 11 g 12 g 12 g 21 g 22 g 23 g 31 g 32 g 33   . (13) Using the fact that from (1): (where R(t) denotes the desired trajectory), we may express the kinematic and extremal equations together in vector form: The initial value is given by: where R ij 0 are the i, j components of the matrix R 0 in Equation (9), and λ 0 = [p 0 1 , . . . , p 0 3 , M 0 1 , . . . , M 0 3 ], the unknown initial vector for the extremal equations. Integrating (15) starting from P 0 with an initial guess for λ 0 and rearranging the solution P 7 , . . . , P 22 using the function: we then obtained the Φ([P 4 (t), . . . , P 12 (t)]) = R(t), the matrix that gives the optimal solution in the configuration space. Note: we are aware that for integrating Equation (1) developed on a matrix Lie group, symplectic integration is preferred over a numerical integration of (15). However, we calculated the deviation from the Lie group over the time interval [0, T f ] used in the simulations presented in Section 2 by calculating the norm of the difference of R(t)R(t) T and the identity matrix over the whole of the trajectory to determine if the structure of SO(3) was preserved. Figure 3 shows that the difference is of the order of magnitude of 10 −13 , while Figures 4 and 5 also strongly suggest this result since the curves remain on the surface of the sphere throughout the maneuver.    (23)).
The following steps were then used to solve the optimal control problem of moving the vehicle from the initial to the final configuration:

After
Step 3 was completed, an initial trajectory was determined in terms of the particular choice of variables c 0 i . We went on to apply RL to allow the vehicle to reactively choose a new optimal path that avoided the obstacles by adapting the c i values away from the chosen initial values c 0 i . The initial choice of one for each c i value was made, which corresponded in the case that energy was being evaluated to equal inertia values for each axis.

Reinforcement Learning
In this subsection, we specify the observation and reward functions that we used in conjunction with MATLAB's Reinforcement Learning Toolbox software. In this software, the policy that selects the best action to take is represented in the form of a neural network. In this case, the action is an amendment of the values c 0 1 , c 0 2 , c 0 3 ; the action output is added to the c 0 i values selected in Step 1 in order to choose a new path such that the obstacle is avoided.
The observation sent to the RL block consists of detecting the boundary of the sets S 1 and S 2 , and the reward function r(t) is set up in order to ensure that the condition in (8) is maintained. In particular, when the antenna intersects the boundary of the pointing set S 2 , the condition (denoting b = [x b , y b , z b ]) is expressed by a logical value: where r 2 = sin(2φ) and φ = 0.175. Since the antenna axis is taken as [0, 0, 1], the boundary of the region is clearly a plane parallel to the x-y-axis, and z b = 0.791 where it intersects the z-axis arises by choosing φ = 0.174; therefore, the coordinates of the boundary projected on the plane) are given by X b = sin(2φ) cos(θ), Y b = sin(2φ) sin(θ) (for θ ∈ [0, 2π]), and z b can be calculated from the inverse stereographic projection through the south pole, To determine the boundary of the set S 1 , we used the projection from sphere to plane, to map the vector a = [x a , y a , z a ] to the plane where it has the form [X a , Y a ] and the Sun vector s = [x s , y s , z s ] to the plane where it has the form [X s , Y s ], and we calculated the value: The intersection of a with the boundary of the set S 1 is then described by: where r 1 = 0.25. The reward is correspondingly calculated as: The RL-block then uses a neural network to calculate the three action outputs based on the values of the reward and observation functions. We chose the simple reward function (20) to avoid the issues of "reward shaping", which may lead the agent to make unnecessarily elaborate maneuvers to increase the reward. A drawback of the Reinforcement Learning Toolbox is that the neural network defining the policy function component is a "black box", and so, we had no real insight into the scaling involved. In this case, it was noticed that the policy was outputting actions to the order of 10 −3 , so it was required to introduce a scaling of the actions to produce reasonable results. After trial and error, we determined that a scaling of 100 was efficient. Some further insight into the neural networks used in the policy function could have assisted in choosing a scaling term less heuristically.

Choosing a New Optimal Path Using Reinforcement Learning
After the initial guess c i , the values c 0 i were amended to the obstacle-avoidance values chosen by RL (which we denote by c rl i , which as we noted in the previous section was determined by c rl i = c 0 i + action), and we constructed a new trajectory that avoided the obstacle. We followed Steps (1)-(3)with the following amendments in order to construct the final trajectory: • Step (1a): Determine the solution R(1, λ 0 , c rl i ) of (1) that minimizes the cost (10) using (15) in terms of the unknown boundary conditions λ 0 on the imaginary time domain [0, 1]. • Step (2a): Construct the "error matrix" R(1, λ 0 , c rl i ) = R(1, λ 0 , c rl i ) − R T f and the function: • Step (3a): Find λ 0 that minimizes S(λ 0 ) using a local minimizer.

Results
We applied the steps described in Section 2 to create attitude maneuvers for a geostationary satellite from initial configuration R 0 to final configuration R D . We show two examples: Rather than simply plotting these two maneuvers, we ran a series of simulations on an incremented set of the final position of Paths 1 and 2 to create a class of curves starting from the initial point R 0 and going to a set of perturbations of the end point R D . This showed that the method successfully adjusted each curve in the class to remain within the pointing set, rather than showing a single curve from R 0 to R D , which could represent an exceptional case.
The incremented sets of the final positions for Paths 1 and 2 are given by: where the rotation matrix used to determine the iterations was chosen such that the curves in the set from the initial point R 0 to the final points in the sets R i would violate the conditions stated for each path. The plots of the maneuvers can be seen in Figures 4a,b and 5a We chose Path 1 where the original optimal trajectory could be seen to cause the telescope to pass across the bright spot S, while in Path 2, the original optimal trajectory designed for the antenna did not fall in the desired pointing region. One can see from the adapted trajectories in Figures 4b and 5b that the c 0 i values were successfully updated in order to point the telescope a(t) away from the forbidden region at all times while maintaining the communications antenna curve b(t) in the desired region.

Discussion and Conclusions
This paper presented a method to obtain optimal attitude guidance controls for a spacecraft in geostationary orbit, avoiding pointing sensitive instrumentation towards the Sun during an attitude maneuver while maintaining the direction of a beam pencil. This method used RL to produce an optimal trajectory that reacted to the presence of obstacles to maintain the configuration in the pointing set. The simulations carried out demonstrated that the trajectory that we designed avoided the forbidden pointing direction while maintaining the antenna in the operational pointing direction for two classes of paths, one that violated the bright-spot condition and a second that violated the antenna-pointing condition. When comparing the average CPU times for the two classes of solutions shown here (Path 1, 1.437 s, and Path 2, 2.113 s) with those determined in [5], which perform Lie group-based pointing while avoiding a bright spot (between 0.36 and 0.406 s), the comparison seems unfavorable: however, in their method, the user is required to first run the program with a "guessed" c i value and then increment the c i values manually, each time running and then visually confirming that the trajectory avoids the forbidden region, a process that is certainly less efficient even if it is less computationally demanding. Therefore, a comparison of CPU times between the two papers is not entirely meaningful, and in terms of the ease of operation, ours is certainly preferable.
A possible drawback of our method is that we assumed the attitude to be precisely known. However, in a real mission, there will be delays in the pose estimation. We also assumed that the initial and final orientations were chosen such that the Sun-avoidant and beam pencil vectors fell in the pointing set at the initial and final position, which is reasonable, but a refinement could be to select a suitable orientation to replace the initial and final orientations if they are chosen to fall outside of the pointing set. Another weakness is the "black box" nature of the policy function neural network; also, we constructed a simple reward function and did not perform any reward shaping: it was seen as preferable since a poorly shaped reward function can cause a solution that is not ideal even if it provides the highest reward. This can be seen particularly in the fact that the vehicle passes the obstacle relatively closely and then overcompensates through the rest of the trajectory. Refining of the reward function and the inclusion of time and sensor delays are areas to be addressed in future work. cotangent bundle of any manifold G is endowed with a canonical symplectic two-form χ ξ