Path Following Control for Underactuated Airships with Magnitude and Rate Saturation

This paper proposes a reinforcement learning (RL) based path following strategy for underactuated airships with magnitude and rate saturation. The Markov decision process (MDP) model for the control problem is established. Then an error bounded line-of-sight (LOS) guidance law is investigated to restrain the state space. Subsequently, a proximal policy optimization (PPO) algorithm is employed to approximate the optimal action policy through trial and error. Since the optimal action policy is generated from the action space, the magnitude and rate saturation can be avoided. The simulation results, involving circular, general, broken-line, and anti-wind path following tasks, demonstrate that the proposed control scheme can transfer to new tasks without adaptation, and possesses satisfying real-time performance and robustness.


Introduction
As a kind of lighter-than-air vehicles, airships have distinct advantages over other vehicles in ultra-long duration flight, low fuel consumption, making them a cost-effective platform for communication relay, monitoring, surveillance, and scientific exploration. Many countries regard airships as one of the most important platforms in near space and have developed techniques about airships for decades [1][2][3].
With the rapid progress of airship technologies, the essential role of flight control is more prominent. Path following is one of the most frequently performed tasks in flight control, which requires the airship to follow a predefined geometric path without temporal constraint. Due to the inherent dynamics nonlinearity, unknown dynamics, parametric uncertainty, and external disturbance, airship path following control becomes a challenging research topic [4,5]. Numerous achievements on the path following problems have been witnessed, employing various methods such as sliding mode control (SMC) [6,7], H-infinity [8,9], backstepping control [4, [10][11][12], fuzzy control [13][14][15][16][17][18], and so forth. In Reference [19], a computationally efficient observer-based model predictive control (MPC) method is presented to achieve arbitrary path following for marine vessels. In Reference [20], a path following controller for a quadrotor with a cable suspended is proposed. The coordinated path following control for multiple nonlinear autonomous vehicles connected by digital networks is studied in Reference [21]. In Reference [22],the authors investigate the reliable H-infinity path following controller for an autonomous ground vehicle.

•
The capability of transferring between tasks is demonstrated by performing circular, general, and broken-line path following tasks, while the robustness is proved by circular path following under wind disturbance.
The rest of this paper is organized as follows: Section 2 presents the airship model. The design of the controller is shown in Section 3. Section 4 simulates the proposed controller. Discussion and conclusion are given in Sections 5 and 6, respectively.

Airship Model
The airship studied in this paper is depicted in Figure 1. The static lift of the airship is provided by the helium contained in the envelope. The aerodynamic control surfaces attached to the rear of the airship, namely the tails and rudders, offer the control torques for the airship and enhance the course stability. The propellers are mounted symmetrically on the sidewall of the gondola fixed at the bottom of the envelope. To develop the airship model, the body reference frame (BRF) is firstly defined. The BRF is attached to the airship with its origin O coincident with the center of volume. Ox points towards the head of the airship, Oy is perpendicular to Ox towards the right of the airship, and Oz is determined by the right-hand rule. The origin O g of the earth reference frame (ERF) is fixed on the ground, O g x g and O g y g point to the north and east, respectively. O g z g is determined by the right-hand rule.
According to Reference [45], the model of the airship is described as: where ζ = [x, y, z] T and Θ = [φ, θ, ψ] T are the position and attitude of airship described in ERF, respectively. v = [u, v, w] T and Ω = [p, q, r] T are velocity and angular velocity described in BRF, respectively. The detailed expressions ofĀ,N,Ḡ,B, u F and u δ are described in Reference [45].
To facilitate the design of the controller, the attitude motion of pitch and roll in the vertical plane is ignored and only the horizontal motion is considered in this study [9]. Thus, the values of the variables w, p, q, φ, θ, and z are set as zero. Consequently, the planar linear kinematics and dynamic state-space equations can be derived from (1) and (2) as: where X 1 = [r, u, v] T is the state, X 2 = [ψ, x, y] T is the measured output, and F = [F T , δ r ] T is the control input, where F T denotes the thrust force and δ r denotes the rudder deflection. The specific expressions of the variables above are described in Reference [9].

Path Following Controller Design
In this section, a path following controller is developed for underactuated airships with magnitude and rate saturation. The proposed controller consists of two sub-systems: error bounded LOS guidance and PPO control. The guidance loop provides the desired heading angle and the target position, and PPO calculates the near-optimal solution in bounded space. In PPO control, the Markov decision process (MDP) model is constructed firstly. Then, the reward function is designed based on the MDP model. Finally, the optimization process of PPO is introduced. Figure 2 shows the structure of the controller.  During the design process, the state space is reduced to an acceptable range. After adequate training, most states in the space will be experienced. Consequently, the desired action policy can be obtained, realizing the real-time path following.

Error Bounded LOS Guidance
The underactuated airship is desired to follow a continuous path ζ p (ω) = x p (ω) , y p (ω) T parameterized by a time-independent variable ω. Suppose that the target position provided by the path is ζ p = (x p , y p ), a path parallel frame (PPF) can be defined: its origin point is ζ p , ζ p x p is parallel with the local tangent vector, and ζ p y p is the normal vector pointing to the right. The ERF should be positively rotated by an angle ψ p to reach PPF.
Thus, the tracking error in PPF can be expressed as follows: where s is the along-track error, e is the cross-track error, and R(ψ P ) is given by: To reduce the state space, the target position ζ c = (x c , y c ) is selected as follows: where K r > 0. The target position ζ c will remain in a circle near current position.
Obviously, the tracking error ζ − ζ p will converge to zero as ζ − ζ c converge to zero. Considering Lyapunov function as follows: Substituting Equation (10) into Equation (11): Differentiating Equation (12) with respect to time, it leads to: Designing ψ andω as follows [46]: where k e > 0 and k s > 0. Substituting Equation (14) and Equation (15) into Equation (13) yields: According to the derivation above, the desired yaw angle ψ and the derivative of the path parameterω are obtained. Since the vertical plane motion is not considered, the desired attitude Θ c is also obtained.

MDP Model of the Airship
MDP is a discrete-time stochastic control process, providing a mathematical framework for optimization problems solved via RL. MDP can be expressed by a tuple of four elements (S, A, P, R). In time step t, the state of the agent is S t , and the controller chooses an action A t based on S t , independent of all previous states and actions. In next time step t + 1, the agent moves to state S t+1 and gives the controller an immediate reward R, according to the state transition probability P. The airships' MDP model is described as follows.
First, state space S. A qualifying state should contain sufficient information for the controller to make reasonable decisions, and avoid invalid information to accelerate the optimization process. In the path following control scenario, the state space S is selected as follows: where ∆ψ = ψ − ψ c , ∆x = x − x c , and ∆y = y − y c . Considering the inherent feature of airships and the guidance loop, we have a bounded state space, where r ∈ [−r max , Second, action space A. To deal with the magnitude and rate saturation problem, the acceleration of actuators is selected as the continuous action space, as well as the output of the PPO network. sat rate (a) represents the acceleration saturation while sat magnitude (F) denotes the control variable saturation.
Third, state transition probability P. In this study, the state transition is determined by the airship model. In other words, the result of an action in a certain state is unique.
Fourth, reward function R. The reward function is described as: where k D < 0, k ψ > 0, and k u > 0.

Optimization Process of PPO
Policy gradient methods generally consist of a policy gradient estimator and a stochastic gradient ascent algorithm. Usually, the estimator is obtained by differentiating an objective function whose gradient is designed as the policy gradient estimator. In PPO, the clipped objective function [40] is given by: where S denotes an entropy bonus, and L CLIP (θ) is defined as: where r t (θ) = π θ (a t |s t ) π θ old (a t |s t ) is the probability ratio, π is the action policy, θ is the policy parameter, and ε is a clip hyperparameter. clip(r t (θ) , 1 − ε, 1 + ε) restricts aggressive policy updates, thus achieving a satisfying performance.Â t is the advantage estimator given by: where δ t = r t + γV(s t+1 ) − V(s t ), and T is a constant much less than the episode length. The PPO (see Algorithm 1) adopted in this paper includes an actor network and a critic network. As shown in Figure 2, the input layer and hidden layer of the actor and the critic share the same structure. The input states pass through two full connection layers and finally connect to the output layer. The output layer of the actor is a full connection layer with two outputs, representing the probability distribution of accelerations. The output layer of the critic is a full connection layer with one output, representing the state value. Run policy π θ k in the environment for T timesteps. 6: 7: Compute rewards-to-goR t . 8: 9: Compute advantage estimatesÂ t based on the current critic network V φ k . 10: 11: Update the actor network by maximizing the clipped objective function Equation (22) 18: end for 19: During training, the processed states and target of the airship are taken as the input of the neural network. The actor-network outputs the derivation of the airship input based on the processed states, while the critic network learns a value function as the actor update basis. The output of the neural network is the acceleration of the control input. After saturation and integration, the control torque and force are obtained.

Simulations
In this section, five sets of simulations are presented. In the first simulation, the controller is trained by following a circular arc path. To evaluate the performance of the well-trained controller, a comparative path following controller based on nonlinear MPC is introduced, and some simulations are implemented. Subsequently, the performance of the proposed controller is evaluated by following the circular, general, and broken line paths without any adjustment in the next three simulations, respectively. The last simulation is conducted under wind disturbance.

Controller Training and Comparing
During training, the initial states of the airship are on the open interval (0, 1) generating uniformly distributed pseudorandom numbers. The desired path is given as: The physical parameters of the airship are the same as those in Reference [9]. Some of the parameters of the airship and the controller are list in Table 1. After 15, 000 episodes of training, the episode reward of the proposed method converges to 15 on average from 150. Table 1. Parameters of the airship and the controller.

Parameters Value Parameters Value
As a comparison, an extended nonlinear MPC algorithm for time-varying reference in Reference [44] is introduced. The simulations are performed on an 8-core machine with an Intel Core i5-9300H CPU at 2.40 GHz and 16 GB of RAM. The detailed expression of the algorithm is presented in Algorithm 3.11, Chapter 3 in Reference [44].
To estimate the performance of the two controllers, we firstly perform four scenes with different initial states under the same task. The simulation time is set as 200 s, the time step is 0.1 s, and the simulation results are shown in Table 2 and Figure 3. It is obvious in Table. 2 that the time consumption of the compared MPC method, averaging 206.63 s, is far greater than the proposed method. That's even longer than the simulation time, which is unacceptable for practical application. In contrast, the proposed method takes only an average of 1.15 s, demonstrating a satisfying real-time performance in the path-following process. As shown in Figure 3, the tracking trajectories and the yaw angles are illustrated. It must be emphasized that the desired yaw angles of the proposed and the compared controllers are inconsistent. The reason is that the desired yaw angles are related to the state variables, such as the position and velocity. Due to the differences between the algorithms, different control inputs are obtained, leading to different state variables as well as the desired yaw angles.  For both methods, the trajectories and the yaw angles are convergent in all four scenes. Besides, the greater the initial state differences are, the slower the convergence speeds of the trajectory and the yaw angle get. The distinctions lie that the tracking errors after convergence of the proposed method are 10~30 m, while they are less than 1m of the compared method. The yaw angle error of the proposed method is 10 −1~1 0 −2 rad, while it is less than 10 −3 rad of the compared method. Furthermore, the convergence speed of the proposed method is faster than the compared, which is embodied as the trajectories and yaw angles approaching the desired curves with less time in the illustrations. In a word, the proposed method explicates faster convergence speed but worse in tracking precision.
To verify if the magnitude and rate saturation is solved, the state variables of the two controllers in Scene 3 are shown in Figure 4. The simulation time is set as 200 s, and the time step is 0.1 s. Figure 4a shows the time histories of angular velocity and velocities, illustrating that the forward velocity u can track its desired value well. Although no desired values for r and v are set, they are convergent and bounded. Figure 4b gives the time histories of control inputs F T and δ r with respect to time. Due to actuator rate constraints, the control inputs avoid an initial large jump in References [4,12], which is less harsh to the actuators. In addition, as can be seen from Figure 4b,c, the saturation constraints are never violated. However, the oscillation of the control inputs of the proposed method is relatively serious. Because the execution of the action policies is related to a probability density function, the control inputs are not always continuous.  In extreme circumstances where the airship is desired to follow a path with big initial tracking errors, strong saturation is inevitable. To compare the performance of the two controllers in such situations, a numerical simulation is presented in Figure 5. The time is set as 2000 s, and the time step is 0.1 s. The initial states are chosen as ∆x 0 = ∆y 0 = 50 m and ψ 0 = −1.5280 rad. For both controllers, the convergence speeds are much slower than Scene 1-4, consuming more than 300 s. To adjust ψ, the rudder keeps saturated for almost 300 s. During the adjustment, the magnitude and rate constraints are never violated. Besides, compared to the MPC method, the convergence speed of the proposed method is faster while the control inputs and states are less stable.   Conclusions can be drawn from Figures 3-5 that the airships can track the desired path subject to actuators' magnitude and rate saturation under either of the two controllers. Compared with the MPC method, the proposed one is faster in convergence speed but worse in tracking precision, and stronger in the oscillation of the control inputs. Meanwhile, the proposed method has a satisfying real-time performance which the compared method does not. Despite the tracking errors which are acceptable for the airship, the proposed controller has satisfactory effectiveness and robustness in the presence of actuators' magnitude and rate saturation.

Circular Path Following
In this section, to validate the ability to transfer to new tasks, circular paths with different radii varying from 1 km to 2 km are adopt to the proposed controller. The simulation results are shown in Figures 6-8. As shown in Figure 6, the airship is capable of following all circular paths in the presence of magnitude and rate saturation. The average tracking errors are roughly less than 20 m, which implies the task-independent feature of the proposed controller. From Figures 7a and 8a, we learn the real yaw angle is quickly convergent to the desired yaw angle and remain stable. Besides, from the cycle of the airship yaw angles, we learn that the airship takes about 800 s to fly one round with a radius of 1 km, and 2000 s with a radius of 2 km, which implies that the airship has a larger speed when flying around a smaller circle. From Figures 7b and 8b, we learn that both of the desired forward speed u is 6 m/s, but the slide speed v is different. The slide speed is adjusted bigger to achieve a smaller turning radius. Thus, when flying a circle with a smaller radius, the same forward speed indicates the bigger resultant velocity, leading to a shorter endurance. We also learn that the velocity is tracked well, and the states are stable during path following tasks. It is worth emphasizing that the controller is directly obtained from Section 4.1 without adjustments, which proves the effectiveness of the proposed controller.

General Path Following
This section presents the results of the airship following the general paths of the proposed controller. As shown in Figures 9 and 10, two referenced paths composed of four linear segments and four arc segments are investigated. As can be seen from Figures 9 and 10, the controller accomplishes the tracking tasks well when encountering more general paths. Position errors are reduced efficiently and quickly to an acceptable region. Despite the errors between the referenced and real paths, the tracking states are quite stable.

Broken-line path following
In this section, a simulation of the airship tracking a continuous unsmoothed path is performed. The initial states are chosen as ∆x 0 = ∆y 0 = −50 m and ψ 0 = 1.1980 rad. The desired path ζ p (ω) = x p (ω) , y p (ω) T is a broken-line containing to cusps, which is illustrated in Figure 11, and the parameter equation is given as: As shown in Figure 11, the controller can track the broken-line path well, despite a brief adjustment at the cusp of the path. The tracking error of the yaw angle is quickly convergent to an acceptable region. Besides, Figure 12 indicates that the angular velocity and velocities are relatively stable during the tracking process.

Anti-Wind Path Following
In this section, the constant wind disturbance is taken into account when following the desired path. The simulation results are integrated and shown in Figure 13. The initial states are chosen as ∆x 0 = ∆y 0 = 200 m and ψ 0 = −1.4810 rad. The wind speed is 5 m/s, as expressed by the light blue arrows in Figure 13a. Figure 13a illustrates that the airship is capable of following the desired path subject to magnitude and rate saturation, as well as the wind disturbance. Compared with Figure 5a, the airship is forced to cruise for extra miles to converge to the desired path due to the wind disturbance. Figure 13b shows that the yaw angle is well tracked, despite that the airship takes more than 500 s to converge to the desired value.
The simulation results prove the robustness of the controller. Although trained without disturbance, the controller is capable of resisting a wind disturbance.

Discussion
The proposed RL-based path following controller for underactuated airship shows satisfying performance in numerical simulation. Not only the magnitude and rate saturation of actuators are handled, but the capability of transferring between various tasks without adaption is obtained. Discussions focusing on the two aspects are presented next.
For the target orientation control system, actuators saturation is an inevitable problem. Generally, the actuators saturation problem is handled by auxiliary systems and backstepping technique [28,30,48,49], smooth hyperbolic functions [27,50], model predictive control [9,19,51,52], or employing the generalized nonquadratic cost function [53][54][55] to guarantee the constrained control input. However, the consideration of system capability is always behind the consideration of achieving the control objective. In fact, without first considering the system capability, the designed controllers will be less meaningful in practical application. Unlike the controllers aforementioned, the proposed control strategy is based on building a capacity envelope of the airship. The action policies are generated within the envelope, thus avoiding the saturation. This approach takes full advantage of the adaptivity and robustness of neural networks, more simple to design, and proves to be a feasible solution to the magnitude and rate saturation problems.
On the other hand, transferring to new tasks is a common and challenging problem for RL in practical applications. For airships path following control, the difficulty of transferring to new tasks lies in various task requirements. By reasonably defining the bounded state space S and action space A, the policies and tasks are decoupled, indicating that the task-independent policies could transfer to new tasks without adaption.
However, there are still some limitations of the proposed method. First, the oscillation of the control input signals is relatively serious, which indicates a high-frequency actuation of the actuators, and will limit the practical implementation of the controller from shortening the service life of the actuators, occupying too much communication bandwidth, or consuming too much energy. Thus the proposed method needs further improvement to reduce the oscillation in the future. Second, the tracking precision of our approach is relatively low. On one hand, the defect in tracking accuracy is caused by the imperfect design of the reward function, which is considerably related to the algorithm accuracy. On the other, the robustness of the RL algorithm is acquired at the expense of tracking precision. There is no specific measurement to counteract the effect of disturbances. Therefore, although the episode reward converges to 10% of the beginning, there's still room for improvement.
In the future, the oscillation of the control inputs will be reduced. Besides, the tracking precision of our algorithm will be improved. To reach that point, the design of the reward function might be optimized by applying inverse RL methods, and the disturbance will be further considered by employing active observation or compensation. Furthermore, the way of applying the RL strategy to more complex scenarios like spatial flight will be researched.
To sum up, it is a meaningful attempt to combine the policy search strategy with the path following controller. It can be applied to not only airships but also other dynamic systems like vessels, spacecraft, and so forth [19,32,35,48], indicating the broad prospects of the approach. Moreover, the controller can transfer to various tasks without adaption, and resist wind disturbance without anti-disturbance training, demonstrating the research potential of the RL method applying to the flight control field.

Conclusions
In this paper, a real-time path following controller has been presented for airships with actuator magnitude and rate saturation. First of all, the MDP model of the airship path following control is established, and the error bounded LOS guidance is proposed to restrain the state space. Next, a PPO algorithm containing an actor network and a critic network is utilized to acquire the optimal action policy through trial and error. Since the policy is obtained from the action space, the magnitude and rate saturation will not happen. Finally, the numerical simulation results show that the controller can track circle paths with different radii, general paths, and unsmoothed curves, without any adjustment of parameters. Moreover, even trained without disturbance, the controller shows the satisfying performance when confronting wind disturbance. All these results validate the effectiveness and robustness of the proposed controller. The unique feature of this study lies that the proposed error bounded LOS guidance law provides a task-independent state space for PPO, therefore the trained action policy can transfer to new tasks without adaptation. In the future, the oscillation of the control inputs will be reduced, the tracking precision will be raised by employing anti-disturbance techniques or optimizing the reward function, and RL-based spatial path following control will be studied.