Robust Quadrotor Control through Reinforcement Learning with Disturbance Compensation

: In this paper, a novel control strategy is presented for reinforcement learning with disturbance compensation to solve the problem of quadrotor positioning under external disturbance. The proposed control scheme applies a trained neural-network-based reinforcement learning agent to control the quadrotor, and its output is directly mapped to four actuators in an end-to-end manner. The proposed control scheme constructs a disturbance observer to estimate the external forces exerted on the three axes of the quadrotor, such as wind gusts in an outdoor environment. By introducing an interference compensator into the neural network control agent, the tracking accuracy and robustness were signiﬁcantly increased in indoor and outdoor experiments. The experimental results indicate that the proposed control strategy is highly robust to external disturbances. In the experiments, compensation improved control accuracy and reduced positioning error by 75%. To the best of our knowledge, this study is the ﬁrst to achieve quadrotor positioning control through low-level reinforcement learning by using a global positioning system in an outdoor environment.


Introduction
A quadrotor is an underactuated, nonlinear coupled system. Because quadrotors have various applications, researchers have long been focusing on the problems of attitude stabilization and trajectory tracking in quadrotors. Many control methods are used for quadrotors. Proportional-integral-derivative (PID) controllers are widely used in consumer quadrotor products and is often treated as a baseline controller for comparison with other controllers [1]. In practice, the tuning of the PID controller's gain often requires expertise, and the gain is selected intuitively by trial and error. Advanced control strategies using model-based methods have been applied to improve the flight performance of quadrotors. Methods such as feedback linearization [2], model predictive control (MPC) [3,4], robust control [5], sliding mode control (SMC) [6][7][8], and adaptive control [9,10] have been applied to optimize the flight performance of quadrotors. However, the performance and the robustness of the aforementioned strategies are highly related to the accuracy of the manually developed dynamic model.
During outdoor flight, quadrotors are susceptible to wind gust, which affects the flight performance or even leads to system instability [3]. Although quadrotors are sensitive to external disturbances [11], designers of most controllers have not accounted for this problem. Some active disturbance rejection methods have been proposed to estimate disturbances, and these methods perform well in cases of a sustained disturbance. Chovancova et al. [12] designed proportional-derivative (PD), linear quadratic regulator (LQR), and backstepping controllers for position tracking and compared their performance in a simulation. A disturbance observer with a position estimator was designed to improve controller positioning performance, which was evaluated when external disturbance was applied in simulations. The active disturbance rejection control (ADRC) algorithm treats the total disturbance as a new state variable and estimates it through an extended state observer (ESO). Moreover, the ADRC algorithm does not require the exact mathematical model of the overall system to be known. Therefore, this algorithm has become an attractive technique for the flight control of quadrotor unmanned aerial vehicles (UAVs) [13,14]. Yang et al. proposed the use of ADRC and PD control in a dual closed-loop control framework [15]. An ESO was used to estimate the perturbations of gust wind as dynamic disturbances in the inner loop control. A quadrotor flight controller with a sliding mode disturbance observer (SMC-SMDO) was used in [16]. The SMC-SMDO is robust to external disturbances and model uncertainties without the use of high control gain. Chen et al. [17] constructed a nonlinear disturbance observer that considers external disturbances from wind model uncertainties separately from the controller and compensates for the negative effects of the disturbances. In [18], a nonlinear observer based on an unscented Kalman filter was developed for estimating the external force and torque. This estimator reacted to a wide variety of disturbances in the experiment conducted in [18].
Reinforcement learning (RL) has solved many complicated quadrotor control problems in many studies. RL outperforms other optimization approaches and does not require a predefined controller structure, which limits the performance of an agent. In [19], a quadrotor with a deep neural network (DNN)-based controller was proposed for following trails in an unstructured outdoor environment. In [20], RL and MPC were used to enable a quadrotor to navigate unknown environments. MPC enables vehicle control, whereas RL is used to guide a quadrotor through complex environments. In addition to high-level planning and navigation problems, RL control has been used for achieving robust attitude and position control [21,22]. The control policy generated through the RL training of a neural network achieves low-level stabilization and position control and the policy can control the quadrotor directly from the quadrotor state inputs to four motor outputs. The aforementioned studies have implemented their proposed control strategies in simulations and real environments. Although quadrotors with RL controller exhibit stability under disturbance, the control policy cannot eliminate the steady-state error caused by wind or modeling error and the performance of the controller can be improved. In [23], an integral compensator was used to enhance tracking accuracy. The effect of this compensator on the tracking accuracy of the controller was verified by introducing a constant horizontal wind that flowed parallel to the ground in a simulation. Although the aforementioned integral compensator can eliminate the steady-state error, it slows down the controller response and has a large overshoot.
This paper presents a unique disturbance compensation RL (DCRL) framework that includes a disturbance compensator and an RL controller. The external disturbance observer in this framework is based on the work of [24]. The rest of this paper is organized as follows. Section 2 introduces the dynamic model of a quadrotor and the basics of RL. Section 3 describes the proposed DCRL control strategy. Section 4 describes the training and implementation of the proposed DCRL strategy in a quadrotor experiment in indoor and outdoor environments. Finally, Section 5 concludes the paper.

Preliminary Information
This section briefly introduces the dynamic model of a quadrotor, the basics of RL and the use of RL in solving the quadrotor control problem.

Quadrotor Dynamic Model
In this paper, we assume that a quadrotor is a rigid and symmetrical body whose center of gravity coincides with its geometric center.
The vector x = [x, y, z] T denotes the position of the quadrotor in an inertial frame. The translation dynamics of the quadrotor can be expressed as follows: where m and g are the mass of the quadrotor and the acceleration due to gravity, respectively; R = [b x b y b z ] ∈ SO(3) is the rotation matrix, which is used to transform a coordinate from the body-fixed reference frame to the inertial reference frame; and T i is the thrust generated from motors and applied on the z-axis of the body frame b z . Figure 1 displays the order of motors placement. Finally, the vector F ext represents the external disturbance force accounting for all other forces acting on the quadrotor. For the rotation dynamics of system, we use a quaternion representation of quadrotor attitude to avoid gimbal lock and ensure better computational efficiency.
whereq = [q w , q x , q y , q z ] T is the normed quaternion attitude vector, ⊗ is the quaternion multiplication. Ω is the angular velocity of body-frame, µ is the control moment vector and J is the matrix of vehicle moment of inertia tensor. The rotation transformation between the quaternion q to the rotation matrix R can be expressed as follows: Each thrust from the propeller axis is assumed to be aligned perfectly with the zaxis. The force T i and motor moment µ produced at a motor spinning speed of ω i can be expressed as follows: where i = 1, 2, 3, 4. Ω i is the speed of motors; l is the arm length of the quadrotor; and c f , c d are the coefficients of the generated force and z-axis moment, respectively. The developed dynamic model is based on the following assumptions: (a) the quadrotor structure is rigid, (b) the center of mass of the quadrotor and the rotor thrusts are in the same plane, and (c) blade flapping and aerodynamics can be ignored.

Reinforcement Learning
The standard RL framework comprises a learning agent interacting with the environment according to a Markov decision process (MDP). A state transition has a Markov property if the probability of this transition is independent of its history. The MDP involves solving decision problems with the Markov property, and RL theories are based on the MDP. The standard MDP is defined by the tuple (S, A, P a ss , r, ρ 0 , γ), where S is the state space, A is the action space, P a ss : S × A × S → R+ is the transition probability density of the environment, r : S × A → R is the reward function, ρ 0 : S → R + is the distribution of the initial state s 0 , and γ ∈ (0, 1) is the discount factor. In modern deep RL conducted using neural networks, the agent selects an action at each time step according to the policy π(a|s; θ) = Pr(a|s; θ), where θ ∈ R N θ is the weight of the neural network.
The goal of the MDP is to find a policy π(a ∈ A|s) that can maximize the cumulative discounted reward.
A state-dependent value function V π that measures the expected discounted reward with respect to π can be defined as follows: The state-action-dependent value function can be defined as follows: The advantage function can be defined as follows: where A π is the difference between the expected value when selecting some specific action a. The advantage function can be used to determine whether the selected action is suitable with respect to policy π. Many basic RL algorithms, such as the policy gradient method [25], off-policy actor-critic algorithm [26], and trust region policy optimization [27] can be used to optimize a policy. To maximize the expected reward function V π (s), the neural network parameterized policy π = π(a|s; θ) is adjusted as follows: where is the state occurrence probability and α > 0 is the size of the learning step. Equation (9) is an expression for the policy gradient [28]. By using the state distribution ρ and stateaction value function Q, a policy can be improved without any environmental information. The state distribution ρ π (s) depends on the policy π, which indicates that it must be re-estimated when the policy is changed. In [26], the policy gradient was analyzed by replacing the original policy π with another policy µ; therefore, (9) had the following form in [26].
Equation (11) can still maximize V π with distinct policy gradient strategies.
To solve the nonlinear dynamic control problem for a quadrotor, we used the proximal policy optimization algorithm (PPO) [29] and off-policy training method [22] to train the actor and critic functions. The following inequalities are valid for the actor function: Therefore, for a policy search iteration, (13) provides improvement criteria for the action policy under a certain state. Figure 2 depicts a block diagram of the quadrotor control with the DCRL framework. The proposed DCRL strategy enhances the RL control policy with external disturbance observer and compensator to strengthen the system robustness. The compensation algorithm estimates the external forces and adjusts the input command for the RL controller. The RL controller then changes the motor thrusts accordingly. In Figure 2, the observer takes the attitude q f and acceleration a f as input, and outputs the estimated external disturbance. F ext ,F ext are the external disturbance and the external disturbance estimated by the observer. The disturbance compensator calculates the q comp from quadrotor attitude q f andF ext . The original RL actor was trained to hover at the original point by receiving the state s which contains the position, velocity, attitude and angular velocity of the quadrotor, and output four motors thrust follows the policy α(s). In DCRL, to make the quadrotor hover at the reference position x re f , the original point can be shifted with an off-set of reference command and as the input of position x dev to the RL actor. The DCRL generates thrust command with the sum of RL actor output and compensation force. The RL controller in the DCRL was trained to recover and hover at the original point under an ideal simulation environment, while the external disturbances were not considered in RL control policy training. Several reasons exist for not adding external disturbances in RL training. First, the RL controller sometimes has superior performance to traditional methods if the simulation environment is highly similar to a real-world controller model. However, such a model has numerous uncertainties and is highly difficult to reproduce in a simulator. In this study, the sensor noise was one of the uncertain factors because each inertial measurement unit (IMU) sensor on the flight computer had different physical characteristics. Second, the sensor noise does not follow the assumption of the MDP in RL theory; thus, the final performance of the trained policy cannot be guaranteed to be suitable. Finally, the aforementioned traditional external disturbance estimation methods have been demonstrated to be effective. Therefore, we focused on eliminating known disturbances by using an RL controller with a traditional observation method for achieving a superior positioning performance in this study.

Disturbance Observer
In general, by rearranging the terms in (1), the external force may be calculated directly from the acceleration information as follows: where the thrust forces are only applied on the z-axis of the quadrotor in the body frame and a b is the acceleration measured by the onboard IMU sensor. The parameter a b includes the gravitational acceleration. A low-pass filter (LPF) with cut-off frequency at 30 Hz is used to reduce the effects of noise caused by rotor spinning vibrations or the IMU. The thrust and acceleration are transformed to the inertial reference frame prior to filtering. The reason of this preprocessing is that the external force in the inertial reference frameF ext is assumed to have a lower rate of change than do be slow-changing relative to the LPF dynamics.

Disturbance Compensator
When an external disturbance F ext is acting on the quadrotor (Figure 3), this disturbance generates a translational acceleration vector a ext . For disturbance compensation, a new compensation thrust vector g f c is defined. This vector combines a ext and the gravitational acceleration vector gi z and can be expressed as follows: which only considers the hovering situation without an acceleration command from the trajectory tracking reference. The normalized vector g f c is then used to formulate a new coordinate frame (force compensation frame) with a rotation matrix R ci relative to the inertial frame. In three-dimensional space, any rotation coordinate system about a fixed point is equivalent to a single rotation by an angle θ about a fixed axis (called the Euler axis) that passes through the fixed point.
To obtain the rotation matrix R ci from i z to g f c represent in quaternion with the following equations can be used: where v is the original quaternion vector, u is the unit vector of the rotation axis, v is the rotated quaternion vector, q is the rotation vector between v and v , and θ is the rotation angle. By substituting q ci into (3), R ci can be determined as follows: The aforementioned equation is equivalent to Rodrigues' rotation formula. After obtaining the rotation matrix R ci , the quadrotor attitude in the force compensation frame R cb can be calculated using the following equation: After obtained the corrected coordinate for compensation, the magnitude of thrust of motors with compensation is where α i (s) is the i-th motor action output of neural network which would be specified in following section. By modifying the quadrotor attitude in the compensation frame R cb and using this attitude as the input state of the RL controller, the controller can generate corresponding motor thrust to maintain the target attitude and therefore eliminate the disturbance acting on the quadrotor.

Experiments
In this section, we introduce our training method for a low-level quadrotor control policy. The RL controller receives the information on the quadrotor state (position, velocity, attitude, and angular velocity) from sensors and directly outputs the control commands of four rotors. The training was first performed and tested in a simulator. The quadrotor simulator was established using Python according to the dynamic model described in Section 2.1 for training and verifying the flight performance.
After verifying that the RL control policy was trained successfully, we transported the controller into our DCRL structure and performed a real flight with the quadrotor. To implement the proposed DCRL control algorithm in this study, PixRacer flight controller hardware was developed and implemented using Simulink. The DCRL control strategy was examined in an indoor environment by performing fixed-point hovering under an external wind disturbance. Then, the quadrotor was set to track a square trajectory in an outdoor experiment. The position and velocity of the quadrotor were obtained using an OptiTrack motion capture system on the ground station computer. These data were transmitted through Wi-Fi to the onboard PixRacer flight controller within 10 m range in the indoor experiment. The physical parameters of the quadrotor platform are presented in Table 1. For outdoor trajectory tracking, position information was only obtained from an onboard global positioning system (GPS) sensor.

RL Controller Training
In the RL training process, we followed the dynamic equations in Section 2.1 and constructed a simulation environment in Python to generate training data. In the simulation environment, the state space of the MDP comprised the position, velocity, attitude, and angular velocity of the quadrotor. Moreover, the four motors thrust outputs were chosen as the action space. The training process follows the work in [22] using two processes, one for data collection and another for value and policy network update. The update is based on off-policy training, and the main difference between on-policy is the on-policy only uses the collected data once and be cleaned up after each time neural network updates. On the contrary in the off-policy training, the collection thread keeps generating the trajectory data and neural network updating thread can reuse the data from collection which accelerates the learning process.
In data collection process, the quadrotor was randomly launched in a 2 m cubic space with random states. The training data, which comprised the quadrotor states, action, probability, and reward, were recorded in each episode which contained 200 steps in two seconds flight, and then saved as a single data trajectory in a memory buffer. The normalized reward function for evaluating the current state of the quadrotor is as follows: r = −(0.002 e q + 0.002 e p + 0.002 a ), (20) where e q is the vehicle angle error, e p is the vehicle position error, and a is the motor thrust command for constraining the energy cost. When the number of data trajectories in memory buffer exceeded 10, the training process starts. In training process, trajectories were randomly sampled from the memory buffer. The advantage and value functions were defined recursively and calculated in reverse direction which depend on the future time t + 1. The functions were estimated according to and With the two equations above, we use stochastic gradient descent to optimize the objectives as follows: To approximate the function π(a|s) for proposing actions and V(s) for predicting the state value, two neural network were formulate as following equations, where stochastic Gaussian policy was used for the actor network: π(a|s) = 1 The state value function can be approximated as follows: where h j i and y j i are the ith hidden layer and output layer of neural networks with width j, • is the fully connected activation function. In both the actor and critic networks, the input state s is the quadrotor's position, velocity, attitude and angular velocity. The sin and cos functions are used to constrain the output range. A rectified linear unit (ReLU) is used as the activation function due to its characteristic of fast calculation and easy implementation in a microcontroller unit. When implementing the RL controller in a quadrotor flight computer system, only (25) is used to control the quadrotor.
To apply the developed RL controller in an outdoor environment, we extracted the parameters of a trained neural network and loaded them into a Simulink model. The input state of position was limited to the same finite range as that adopted in the training environment to prevent an untrained condition from occurring when using the developed RL controller with a GPS in outdoor environments. With the successful training of the external disturbance observer and RL neural network controller, the DCRL control policy was transferred to the PixRacer flight computer in real quadrotors to replace the original PID controller.

Results of the Indoor Experiment
In the indoor experiment, we put an electric fan to simulate a constant wind disturbance ( Figure 4). We used a self-made quadrotor with a flight control board and GPS mounted in the plane. An OptiTrack motion capture system provides reliable state information, and a multisensor fusion framework in the flight computer fuses the measurement from the onboard IMU and the motion capture data to compensate for the time delay and low update frequency of the OptiTrack system. We compared the position errors between RL with and without compensation under wind disturbance. The measured wind speed was 3.6 m/s at the center of the x-axis of the quadrotor. Figure 5 displays the position error histogram. The mean errors of the original RL and DCRL controllers were 8.4 and 2 cm, respectively, which reduced the hovering error by 75%. Figure 6 presents the estimated position error and external disturbance force for a 30-s flight. Video clips of the indoor experiment can be found at https://youtu.be/RtAoiljZTSI (accessed on 5 April 2021).

Results of the Outdoor Experiment
After verifying the DCRL control algorithm under motion capture accurate position and velocity measurement and relative steady wind perturbation in a laboratory environment, the quadrotor was moved outdoors and a GPS was used to obtain position feedback. The maximum wind speed was measured to be 4.2 m/s by an anemometer. The position trajectory was a 10-m 2 square with a constant height and a velocity of 1 m/s. The results of the outdoor experiment are shown in Figure 7, and the estimated external forces acting on the x-axis and y-axis are presented in Figure 8. Table 2 summarizes the position errors in the indoor and outdoor experiments. In the outdoor experiment, a position waypoint was used as a reference for trajectory tracking without a velocity command. Thus, the quadrotor had to maintain a certain position error to obtain the moving velocity for following the waypoint. The tracking errors may have also been caused by the 2.5-m horizontal position accuracy and 10-Hz update rate of the adopted GPS sensor. However, the experimental results still indicate that the DCRL structure can reduce the quadrotor positioning error.

Conclusions
In this paper, an RL control structure with external force compensation and an external force disturbance observer is proposed for quadrotors. The DCRL controller can reduce the effects of wind gust on quadrotors in fixed-position hovering and trajectory tracking tasks and improve their flight performance. In the outdoor experiments, compared with the original RL control algorithm, the proposed control strategy reduced the fixed-position hovering error by 75%. To the best of our knowledge, this study is the first to use a low-level RL controller with a GPS in an outdoor environment to eliminate the external disturbance acting on flying quadrotors.