Automatic Control Optimization for Large-Load Plant-Protection Quadrotor

: In this article, methods for the attitude control optimization of large-load plant-protection quadrotor unmanned aerial vehicles (UAVs) are presented. Large-load plant-protection quadrotors can be deﬁned as quadrotors equipped with sprayers and a tank containing a large amount of water or pesticide, allowing the quadrotors to water plants or spray pesticide during ﬂight. Compared to the control of common small quadrotors, two main points need to be considered in the control of large-load plant-protection quadrotors—ﬁrst, the water in the tank gradually diminishes during ﬂight and the physical parameters change during this process. Second, the size and mass of the rotors are especially large, which greatly slows the response rate of the rotors. We present an extended-state reinforcement learning (RL) algorithm to solve these problems. The moment of inertia (MOI) of the three axes and the dynamic response constant of the rotors are included in the state list of the quadrotor during the training process, so that the controller can learn these changes in the models. The controlling laws are automatically generated and optimized, which greatly simpliﬁes the tuning process compared to those of traditional control algorithms. The controller in this article is tested on a 10 kg class large-load plant-protection quadrotor, and the ﬂight performance veriﬁes the effectiveness of our work.


Introduction
Quadrotor unmanned aerial vehicles (UAVs) have been widely used in many areas because of their maneuverability and flexibility, as Reference [1] summarized. In plant protection areas, quadrotor UAVs are helping reducing agriculture damage [2]. However, quadrotors are high-dimensional nonlinear underactuated systems with different physical parameters, which increases the difficulty of stable control. Some control algorithms have been tested and applied to quadrotors, and the most widely used algorithm is the traditional proportional-integral-derivative (PID) controller, which achieves an acceptable flight performance. Projects [3,4] are popular open source flight controllers using PIDbased control algorithms; the users must tune PID parameters for each quadrotor platform. Another research effort [5] made many improvements to the PID controllers, and the Smith predictor and rotational trajectory planner were augmented. The linear quadratic regulator (LQR) algorithm has also made progress in the control of quadrotor UAVs. One study [6] linearized the quadrotor system and applied the LQR method to quadrotors and energy consumption was also considered. Another report [7] presented a morphing quadrotor, where the arms of the quadrotor could rotate around the body using servos. To achieve stable flight at any configuration, the researchers exploited an LQR-based control strategy that adapts to the drone morphology in real time. However, the performance of these linear control algorithms becomes worse with a larger attitude because the model errors are larger under this condition. Studies [5,7] addressed this problem and compensated for the model errors in their attitude control loops.
Many studies have tested nonlinear control algorithms on quadrotor UAVs. One study [8] achieved stable control of a quadrotor with the backstepping algorithm, and this control method was found to control the quadrotor in any state without nonlinear compensation. Backstepping and sliding mode techniques were applied to an indoor micro quadrotor in another study [9]. A further study [10] developed sliding mode controllers for the attitude loop and the position loop separately. H-infinity control was found to achieve robust control for quadrotors in another study [11], and the effect of external disturbances was minimized. An additional study [12] introduced observers to further compensate for these external disturbances.
The present article focuses on the attitude control of large-load plant-protection quadrotors. Large-load plant-protection quadrotors can be defined as quadrotors equipped with sprayers and a tank containing a large amount of water or pesticide (the mass of water or pesticide can be even heavier than the quadrotors themselves) and the quadrotors can water plants or spray pesticide during flight. Compared to the control of common small quadrotors, some factors need to be considered in the control of large-load plant-protection quadrotors. First, the water in the tank gradually diminishes during application, so the physical parameters change during this process. The mass and moment of inertia (MOI) of the three axes of a plant-protection quadrotor with an empty tank or with a full tank can be very different. The controller must learn these changes because the flight performance of the plant-protection quadrotor can be severely affected. One report [13] presents a fuzzy PID method to solve this problem on plant-protection quadrotor, the PID parameters were adjusted in real time using a fuzzy control table. This problem also appears in the control of the morphing quadrotor in the study [7], and was solved using the adaptive LQR method. Deterministic artificial intelligence (DAI) is a novel intelligent control algorithm that can automatically enforce optimality without any controller tuning. This method has been applied to unmanned underwater vehicles (UUV) [14] and spacecraft [15], and is also able to solve the changing-MOI problem. Second, for large-load plant-protection quadrotors, the size and mass of the rotors are particularly large, which greatly slows the rotor response rate. This factor cannot be ignored; if the response rate is too low, the quadrotor cannot track the desired attitude in a timely manner in any case, which limits the size of electrically powered rotors. Projects [3,4] present feedforward compensation methods to solve this problem. In [5], a Smith predictor is designed to accelerate the response rate of rotors. Another report [7] includes the delay of the rotors in the state equation of the system and solves for the optimal control law using the LQR method. These methods can achieve acceptable performance but rely heavily on parameter tuning, which can be burdensome and dangerous in real applications.
We present a reinforcement learning (RL) method to solve the above problems. RL is an optimization algorithm, where the agents take actions and receive rewards in the environment. In the training process, the agents will optimize the policy to get higher rewards. RL has made remarkable achievements and demonstrated strong potential in complex nonlinear control systems. These techniques have been tested on quadrotor UAVs. One study [16] controlled the full quadrotor using only one neural network in an end-to-end style; the controller tracked the desired position and sent control values to the four rotors directly. In [17], a supplementary RL controller was designed to optimize the performance of a PID-based controller. Another article [18] studied the fault-tolerant control (FTC) problem using the RL method, and a simulation example of a quadrotor was conducted. The report [19] proposes an event-triggered RL method to control the quadrotor, and achieved great performance. However, these RL methods focus on specific platforms and cannot be applied to platforms with other physical parameters or adapt to changes in the physical parameters of the platform. We presented an extended-state RL method in our previous research [20], where the core idea is to extend the critical physical parameters of the system into the state list during the RL training process, and then the controller can automatically fit changes in these parameters. In the present article, we choose the MOIs of three axes and the dynamic response constant of the rotors as critical physical parameters. The RL method in this article is a deterministic policy gradient (DPG) [21] style algorithm. To accelerate the training process, we train 128 quadrotors with different physical parameters in parallel.
Our method automatically generates and optimizes the control law with almost no parameter tuning, which is a great benefit compared to many traditional algorithms. This end-to-end controller design process was found to simplify tuning work and to ensure safety during the experimental process, which is very important for large-load quadrotors. The controller is run on flight controller hardware developed by our team named TensorFly Project that is equipped with a high-performance Raspberry Pi microcomputer, which meets the computational cost of our controller neural network.
The article is organized as follows-in Section 2.1, a nonlinear model of the large-load plant-protection quadrotor system model is established and identification methods are introduced. In Section 2.2, the training process of the extended-state RL algorithm are presented. In Sections 3.1-3.4, we introduce the training details and results. Sections 3.5 and 3.6 present the real flight performance of a 10 kg class large-load plant-protection quadrotor using the RL controller. Section 4 concludes the article.

System Model and Identification
In this article, the quadrotor is a 10 kg class large-load plant-protection quadrotor, and Figure 1 introduces the definition of the quadrotor's frames. In Figure 1a, the attitude of the quadrotor is represented with φ (the roll angle), ϕ (the pitch angle) and ψ (the yaw angle). In Figure 1b, ω 1 to ω 4 represent the angular velocities of the four rotors.  The movement and the rotation of the quadrotor is determined by the lift force of the rotors. The dynamic model of the quadrotor is presented in Equation (1): where ω x , ω y , ω z represent the angular velocities of the three axes and J xx , J yy , J zz represent the MOIs of the three axes. M x , M y , M z represent the torque generated by the rotors. The torque is generated by the four rotors: where C T and C M are the lift coefficient and the torque coefficient of the rotors, and ω 1 to ω 4 are rotors' angular velocities. In addition, d represents the arm length of the plant-protection quadrotor; C x , C y , C z and S x , S y , S z are intermediate variables; and a x , a y , a z ∈ [−1.0, 1.0] are the control values of the attitude controller. We scale the controller output between −1.0 and 1.0 to simplify the design of controller neural networks. The relationships between the angular velocity of a rotor and the control value of the rotor are modeled in Equation (4): where ω represents the current angular velocity of the rotor and ω d represents the desired angular velocity of the rotor. σ ∈ [0, 1.0] represents the control value of the rotor, C R represents the rotor control constant, and T m represents the dynamic response constant.
System identification is an important aspect in controlling quadrotors and we present several methods to measure or estimate the relevant physical parameters (shown in Figure 2).  Parameters such as the mass and length of arms can be measured directly. MOIs can be measured using the three wire pendulum methods, but these methods require large and expensive equipment for a large-load quadrotor. We estimate the MOIs using the computer aided design (CAD) method; that is, we build an accurate model in SolidWorks software (Figure 2a), weigh every component of the quadrotor, assign the masses of these components in SolidWorks software, and finally automatically calculate the MOIs of the total quadrotor. Moreover, this article focuses on controlling the plant-protection quadrotor with any amount of water carried on it, and the water model must be considered. It is worth noting that, the water in the tank may wobbles during flight, which can greatly worsen the flight performance. To increase the impact of water, we control the quadrotor within a low speed (under 5.0 m/s in the x and y axes, under 1.0 m/s in the z axis) and a low accelerated velocity (under 1.0 m/s 2 in the three axes). Moreover, the tank in this project is an anti-oscillation tank specially designed for plant-protection quadrotor, where water is separated in small blocks, which greatly eliminate the impact of water. As a result, we assume that the water is shaped as a cuboid, as shown in Figure 3.
In Figure 3, cog bo is the center of gravity (COG) of the body (including the empty tank), cog wa is the COG of the water cube, and cog all is the COG of the total quadrotor. In addition, d cog denotes the distance between cog bo and cog wa , d bo represents the distance between cog bo and cog all , and d wa is the distance between cog wa and cog all . To simplify the calculation, the movement of the COG of the water is ignored, so d cog is constant. V wa is the volume of the water cube, and l wa is the length and width of the water cube, so the height of the water cube can be calculated as h wa = V wa /l 2 wa . Then, d bo and d wa can be calculated as Equation (5) presents: where ρ wa is the density of the water or pesticide. The MOIs of the water can be calculated using Equation (6): where J xx,wa , J yy,wa , and J zz,wa are the MOIs of the water cube in the three axes. Then, the MOIs of the total quadrotor can be calculated using rigid-body parallel axis theory, as Equation (7) presents: where J xx,bo , J yy,bo , and J zz,bo are the MOIs of the quadrotor body in the three axes. In this way, we can estimate the MOIs of the total quadrotor with any amount of water carried in it. In this article, we identify the parameters of the motors and rotors with a rotor bench (Figure 2b). The bench can measure the lift force, the torque and the angular velocity of the rotor, so the lift coefficient C T and torque coefficient C M can be calculated. The rest of the unknown parameters, such as the dynamic response constant of the rotors T m , can be found in an online quadrotor database: https://flyeval.com/ (accessed on 27 March 2021). [22].
In the simulation and training process, the simulation environment updates the states of the quadrotors, including the attitude, the angular velocities of the three axes and the angular velocities of the rotors. The attitude is described in unit quaternions in this article: where Θ is the combined rotation of the roll angle, the pitch angle and the yaw angle, and p x , p y , p z are the projections of Θ in the roll, pitch and yaw axes, respectively. The quaternions can be updated with angular velocities in three axes using the Runge-Kutta method:

Attitude Controller Design for Large-Load Plant-Protection Quadrotors
The attitude controller receives the desired attitude att d , the current attitude att, the angular velocities in the three axes ω x ,ω y ,ω z , the MOIs in the three axes J xx , J yy , J zz and the response constant of the rotors T m , and then outputs the control values of the three axes a x , a y , a z ∈ [−1.0, 1.0] according to the states. The control allocator transfers a x , a y , a z and the desired throttle value th d to the control values of each rotor σ i , i = 1, 2, 3, 4. The control diagram is presented in Figure 4.

Quadrotor Attitude Controller with Reinforcement Learning
The RL frame includes the following factors-the agent, environment, state s t , action a t , policy π(s t ) and reward r t (s t+1 , a t ). During the training process, the environment outputs the state s t , and the agent takes action a t with the policy π(s t ) and then obtains a reward r t (s t+1 , a t ). To describe the performance of the agent in the long term, the value of the current state and action V t (s t+1 , a t ) is introduced: where λ ∈ [0, 1) is the discount factor. Equation (10) shows that the stat-action value V t is the sum of expected future rewards. Although these rewards are unknown, we can estimate them from previous outcomes. We can assume that an optimal policy π opt can maximize V t+1 : where A is the action space of the agent. During the training process, we apply the current policy π instead of π opt . In this article, the agent is a quadrotor with four rotors, and the environment receives the angular velocities of the rotors and outputs the attitude and angular velocities in three axes.

Network Structure and Training Algorithm
The structure of the RL controller is an actor-critic [21] based structure, as presented in Figure 5. The critic network is bulit to estimate the value of the state and action, and the actor network is bulit to compute the control values with the state.  The RL method in this article is the extended-state RL proposed in our previous work [20], where system parameters can be extended into the state list, which enables the controller to fit the changes in these parameters. As a result, the state list consists of the attitude errors att e = [q 0 , q 1 , q 2 , q 3 ] and the current angular velocities in the three axes ω x , ω y , ω z , along with the MOIs in the three axes J xx , J yy , J zz and the response constant of the rotors T m . The action list consists of control values in the three axes a x , a y , a z .
The training method in this article is the DPG method [21], consists of two actor networks and two critic networks. The actor networks include the target actor network and the online actor network, both with the same structure, and two hidden layers of 128 nodes are included. The tanh activation function is set to limit the output value between −1.0 and 1.0 in the output layer. The online actor network is trained in each time step, the gradients are computed with the sampled policy gradients, and the weights are updated using the Adam optimizer [23]. The target actor network is soft updated [21] using the online actor network.
The critic networks also include the target critic network and the online critic network. The two critic networks share the same structure as well, and two hidden layers of 128 nodes are designed. The cost function of the online critic network is a squared error function, and we lower the cost with the Adam optimizer. We soft update the weights of the target critic network using the online critic network.
The reward function design is critical work in the RL algorithm. Squared error style reward functions are widely used in the RL-based control domain, as Equation (12) presents: where p 1 and p 2 are positive parameters and Θ = 2arccos(q 0 ) is the total rotation angle of the attitude error. This reward function encourages the agent to stably reach the zero point. However, in many RL studies using these squared error reward functions, controllers show obvious steady-state errors [24]. We analyze the cause of steady-state errors and present a reward shaping method to solve this problem in article [20]. The reward shaping skill is also applied in this article, and we construct a piecewise function as: where x is the input state value (attitude error or angular velocity in this article), and are constants. With Equation (13), the reward function is designed as: where p 1 , p 2 , δ 1 , ρ 1 , δ 2 and ρ 2 are positive parameters to be tuned. Compared to the reward functions like Equation (12), this function output positive reward for close-to-zero states. In a real flight environment, sensor noise can decrease the performance of the controller. The attitude angle data are estimated and filtered using gyroscope data, accelerometer data and magnetic compass data, so the errors of attitude angle data cannot be ignored. The angular velocities from sensors are relatively accurate, so errors of angular velocity data are ignored in this research. To improve the the robustness for sensor noise of the quadrotor, sensor noise are added to the agents in the training process.
In our algorithm, we run several agents in parallel and immediately employ their experience at each time step, which has some advantages: (1) Different agents have different physical parameters with different initial states and take different actions, thus avoiding the experience correlation problem. (2) Compared to traditional RL methods such as deep Q learning (DQN) [25], parallel training does not need experience replay skills and thus requires much less memory.
Several agents can provide more experience at the same time, greatly accelerating the training process.
The whole algorithm is listed as Algorithm 1.

Algorithm 1
Extended-state RL parallel training algorithm n agent : max number of agents e max : max number of episodes s max : max number of steps during an episode R(s ): The reward function τ: soft update rate 1: create n agent agents with different system parameters 2: randomly initialize the weights of the online critic network θ v and the online actor network θ π 3: initialize the weights of the target critic network θ v and the target actor network θ π : θ v ← θ v , θ π ← θ π 4: for epi = 1, 2...e max do 5: initialize all agents with random states 6: for step = 1, 2...s max do 7: clear the memory box B ← ∅ 8: for n = 1, 2...n agent do 9: recieve the current state s 10: take action a according to s 11: receive the next state s 12: put E = (s, a, s ) into the memory box B ← B E 13: end for 14: for i = 1, 2...n agent do 15: obtain experience e i = s i , a i , s i from B 16: compute the reward of experience e i : r i ← R s i 17: update θ v by minimizing the loss: update θ π using the sampled policy gradients: soft update θ π with θ π : θ π ← τθ π + (1 − τ)θ π

22:
end for 23: end for 24: end for

Training Environment
A simulation environment is developed to train the controller using Python 3.8 and TensorFlow 1.14. We can create quadrotors with desired physical parameters and test the controllers in it. The time steps are not associated with real time, so the training process can be much faster than in the real world. The states of the quadrotors do not update in each time step until the training or control process is finished. We train the controller using a laptop computer equipped with a i7-7700HQ central processing unit (CPU) and an NVIDIA GTX1060 graphics processing unit (GPU).

Quadrotor Details
The large-load plant-protection quadrotor is a 10 kg class "X" type plant-protection quadrotor equipped with HLY Q9L brushless motors, 3080 carbon fiber propellers, Hobbywing XRotor Pro HV 80A electronic speed controls (ESCs) and two ACE TATTU 16,000 mAh 6S Li-polymer batteries in series. The parameters of the quadrotor with an empty tank are identified in Section 2.1 and listed in Table 1. The water in the tank is idealized as a cube with equal length and width. 10 L Maximum mass of water M wa,max 10 kg Density of water ρ wa 1.0 × 10 3 kg/m 3 Distance between the quadrotor and water d cog 0.3 m Length and width of the water cube l wa 0.28 m

Training Details
The rotor dynamic response constant of this large-load quadrotor is T m = 0.035 s, and we calculate that this constant for a common small quadrotor is T m = 0.0157 s. During the training process, T m is generated randomly between 0.015 s and 0.045 s for agents.
Using the data in Table 1, we can calculate the maximum and minimum MOIs in three axes using Equation (7), and the results are listed in Table 2: As a result, the MOIs in the x and y axes range between 0.8 and 1.7 kg·m 2 , and the MOI in the z axes ranges between 1.8 and 2.1 kg·m 2 during the training process.
In this research, sensor noise is added during the training process to improve the robustness of the controller. The attitude and heading reference system (AHRS) used in our flight controller is a mature product, so the attitude angle data and angular velocity data can be read directly from the module. The AHRS module claims that the errors of attitude angle in the roll and pitch axes are between ±0.1 • and the errors of attitude angle in the yaw axis are between ±1 • . As a result, we add noise conforming to uniform distributions between ±1 • to attitude angle data in the three axes during the training process. Table 3 presents the training hyperparameters. The total training process takes approximately 3 h.

Performance in Simulation
The RL controller in this article can generate an optimal control law with the MOIs in three axes and the rotor dynamic response constant input. We first test the performance of the plant-protection quadrotor without water in the tank, that is: J xx = 1.066, J yy = 0.961, J zz = 1.843, and T m = 0.035. The performance along the roll axis and yaw axis are shown in Figures 6 and 7, respectively.   The convergence rate of our controller is very high, and nearly no overshoot or steadystate error exists. Next, we test the performance of plant-protection quadrotors with a half-full water tank (J xx = 1.377, J yy = 1.272, J zz = 1.908, and T m = 0.035) and a full water tank (J xx = 1.540, J yy = 1.435, J zz = 1.973, T m = 0.035). The performance along the roll axis is given in Figure 8. We conclude that our controller can adapt to changes in the system parameters and keep the quadrotor flying stably and accurately with any amount of water in the tank.
(a) Quadrotor with half water in the tank.
(b) Quadrotor with full water in the tank.

Figure 8.
Step response performance along the roll axis.
We test the robustness for sensor noise of the controller in Figure 9. Noises conforming to uniform distributions between ±1 • are added to attitude angle data of the three axes, the proposed controller still keeps the attitude of the quadrotor around the desired attitude angle, and the attitude errors are less than 1 • , which is acceptable.
Traditional PID controllers are designed for comparison, and their performance and the performance of our RL controller are presented below. As show in Figure 10, PID controller 1 performs well for the quadrotor without water in the tank, but when the tank is filled with water, PID controller 1 shows obvious overshoot compared to our RL controller. As show in Figure 11, PID controller 2 performs well for the quadrotor with a full water tank, but when the tank is empty, PID controller 2 shows a notably lower convergence rate than our RL controller. In conclusion, it is difficult to control the quadrotor well with any amount of water using the same PID parameters, and the performance of our RL controller is much more robust than that of traditional PID controllers with system parameter changes. (a) Step response test along the roll axis. (b) Step response test along the yaw axis.

Figure 9.
Step response test for controllers with sensor noise.   Step response performance of PID controller 2 along the roll axis.
In the training process, the states of the quadrotors are initialized randomly, so the quadrotor can recover from any state (even the full turnover state) to zero points. During the training process, the performance of our controller improves gradually. We designed tests to verify these two factors. We test controllers in different episodes (ep = 0, 1000, 5000 and 10,000) on the roll axis of the quadrotor without water in the tank. In Figure 12, the quadrotors recovered to zero point states from random initial states 100 times. We conclude that, after the training process, the controller tends to recover from any state to the zero point and that the performance improves as the training proceeds.

Flight Controller Platform Details
The flight controller in this project is developed by our team and is called the TensorFly Project; this controller focuses on flight control with RL methods. To meet the demand of real time network calculation, we choose a Raspberry Pi 3B microcomputer (four 1.2 GHz CPU cores) as the main controller in our project. The controller is coded with C++ and calculates matrixes using the eigenlibrary [26]. The flight controller runs the Raspbian operating system on the board, and we perform substantial work toward reducing the latency of the system. The actor network is calculated using the eigenlibrary [26] and, on average, the computation process takes approximately 260 µs. However, in rare cases, the computation process may take over 1 ms and we think the problem may result from background processes of the onboard operating system. Regardless, the time consumption of the network calculation process is far less than the control period (10 ms), which is acceptable. The AHRS module contained in this controller is a JY901 module, which outputs attitude angle data and angular velocity data in real time directly, and the attitude errors are less than 1.0 • . The flight controller and the AHRS module are shown in Figure 13.

Flight Performance in Real Flight
The 10 kg class large-load plant-protection quadrotor and its flight using our RL controller are presented in Figure 14.  We test the flight performance of the attitude tracking task on this quadrotor without water and with a half-full water tank, and the performance is presented below. We control the desired attitudes of the quadrotor using a remote controller manually. The flight log of the quadrotor without water in the tank (J xx = 1.3660, J yy = 1.261, J zz = 2.343, T m = 0.035) is presented in Figure 15, and the attitude errors are presented in Figure 16. The flight log of the quadrotor with a half-full water tank (J xx = 1.3660, J yy = 1.261, J zz = 2.343, T m = 0.035) is presented in Figure 17, and the attitude errors are presented in Figure 18.    We conclude that the quadrotor can rapidly and smoothly track the desired attitude (the attitude errors are within 6 • during flight), regardless of the amount of water in the tank, which is very practical in real applications.

Discussion
In this article, we present an extended-state RL method for large-load plant-protection quadrotors. We focus on critical problems of plant-protection quadrotors in real applications, such as changes in the moment of inertia during watering work and the slow response rates of large rotors. The RL controller extends these parameters in the state list during training and running, so the controller can automatically adapt to these changes. The introduction of excellent training skills greatly accelerates the training process, and reward shaping skills eliminate steady-state errors. The controller is tested in a simulation environment and applied on a real large-load plant-protection quadrotor platform and the controller achieves great performance regardless of the amounts of water in the tank. Compared to traditional methods that need manual modification and tuning of the controller, our methods show an end-to-end design style, greatly simplify the controller design work and ensure safety in real flight experiments.

Future Work
During our experimental process, we found that uncertainties such as side winds, movement of the COG of the quadrotor, wobbling water in the tank and external disturbances can increase the flight stability. Active disturbance rejection control (ADRC) methods will be applied and tested in our future work to solve these practical problems for large-load plant-protection quadrotors.
The extended-state RL method proposed in this paper can be widely used on several other platforms such as unmanned underwater vehicles and robot arms. We note that the DAI methods can achieve a similar performance, and we plan to consider this comparative methodology to further improve the performance of our methods.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.