Research on the Multiagent Joint Proximal Policy Optimization Algorithm Controlling Cooperative Fixed-Wing UAV Obstacle Avoidance

Multiple unmanned aerial vehicle (UAV) collaboration has great potential. To increase the intelligence and environmental adaptability of multi-UAV control, we study the application of deep reinforcement learning algorithms in the field of multi-UAV cooperative control. Aiming at the problem of a non-stationary environment caused by the change of learning agent strategy in reinforcement learning in a multi-agent environment, the paper presents an improved multiagent reinforcement learning algorithm—the multiagent joint proximal policy optimization (MAJPPO) algorithm with the centralized learning and decentralized execution. This algorithm uses the moving window averaging method to make each agent obtain a centralized state value function, so that the agents can achieve better collaboration. The improved algorithm enhances the collaboration and increases the sum of reward values obtained by the multiagent system. To evaluate the performance of the algorithm, we use the MAJPPO algorithm to complete the task of multi-UAV formation and the crossing of multiple-obstacle environments. To simplify the control complexity of the UAV, we use the six-degree of freedom and 12-state equations of the dynamics model of the UAV with an attitude control loop. The experimental results show that the MAJPPO algorithm has better performance and better environmental adaptability.


Introduction
The autonomy and intelligent development of the coordinated control of multi-agent systems such as multi-unmanned aerial vehicle (UAV) and multi-robot have received more and more attention.
To solve the problem of coordinated control and obstacle avoidance of multiagent systems, researchers This paper proposes the Multiagent Joint Proximal Policy Optimization (MAJPPO) algorithm, which uses the moving window average of the state-value functions of different agents to get the centralized state-value function to solve the problem of multi-UAV cooperative control. The algorithm can effectively improve the collaboration among agents in the multiagent system than the multiagent independent PPO (MAIPPO) algorithm. Since the PPO algorithm uses a state-value function as the evaluation function, it is different from Deep Q-Network (DQN) [23], which uses the action-value function as the evaluation function. Therefore, the centralization value function of the MAJPPO algorithm does not require the policies of collaborative agents during training, thereby reducing the complexity of the algorithm. Finally, we use algorithms to train multi-UAV formation through a multi-obstacle environment to evaluate the performance of the algorithm. In the process of reinforcement learning of UAV, there are two choices of the controlled object. One is the use of the UAV dynamics model with an attitude control loop, and the other is the use of the UAV dynamics model without an attitude control loop. We use the UAV dynamics model with the attitude control loop as the control object of multi-UAV cooperative control. This is mainly because the UAV dynamic model with the attitude control loop has less freedom and fewer optimization targets.
Therefore, the main contributions of the paper are as follows: 1 The development of the MAJPPO algorithm; and, 2 The MARL algorithm is applied to the multi-UAV formation and obstacle avoidance field.
In regards to the rest of the paper, we will firstly introduce the background related to this paper in Section 2. Section 3 describes the PPO algorithm. Section 4 presents the independent PPO algorithm for the multiagent environment. Section 5 describes the novel MAJPPO algorithm, and it brings in some discussion. Section 6 describes the dynamics model of small UAV with the attitude control loop, RL of a single UAV, and the basic settings of the formation. Section 7 introduces experiments and analysis. The conclusions appear in Section 8.

Background and Preliminary
In the field of RL, the Markov decision process (MDP) is a key concept. RL enables the agent to learn a policy with good profits through interaction with the environment in an unknown environment. Such environments are often formalized as Markov Decision Processes (MDPs), described by a five-tuple (S; A; P; R; γ). At each time step t, an agent interacting with the environment observes a state s t ∈ S, and chooses an action a t ∈ A, which determines the reward r t ∼ R(s t ; a t ) and next state s t+1 ∼ P(s t ; a t ). The purpose of RL is to maximize the cumulative discount rewards G t = τ=t, T γ τ−t r τ , where T is the time step when an episode ends, t denotes the current time step, γ ∈ [0, 1] is the discount factor, and r τ is the reward received at the time step τ. The action-value function (abbreviated as Q-function) of a given policy π is defined as the expected return starting from a state-action pair (s, a), expressed as Q π (s, a) = E[G t s t = s, a t = a, π] . Q-learning is a widely used RL algorithm. Q-learning mainly uses the action-value function Q π (s, a) = E[G t s t = s, a t = a, π] to learn the policy [24]. DQN [23] is a kind of RL algorithm combining Q-learning and neural network, which learns the action-value function Q * corresponding to the optimal policy by minimizing the loss: L(θ) = E π (Q θ (s, a) − y) 2 , where y = r + γmaxQ θ (s , a ), and y represents the Q-learning target value.
In the real world, the agent often cannot obtain all the information of the environment, or the environment information obtained by the agent is incomplete and noisy, that is, only part of the environment can be observed. In this case, we can use the partially observable Markov decision process (POMDP) to model such problems. A POMDP can be described as a six-tuple S; A; P; R; γ; O , where O is the observation perceived by the agent. The deep recurrent Q-network (DRQN) [25] is proposed to deal with partially observable problems and POMDP problems, which extends the architecture of DQN with Long Short-Term Memory (LSTM).
In a multiagent learning domain, the POMDP generalizes to a stochastic game or a Markov game. A multiagent learning environment can be modeled as a decentralized POMDP (Dec-POMDP) framework [26]. The Dec-POMDP model extends single-agent POMDP models by considering joint actions and observations. Solving the Dec-POMDP problem is at the core of the MARL algorithms. The mainstream solution is to optimize the decentralized policy by centralized learning, such as MADDPG, VDN, and QMIX.

PPO Algorithm
Policy gradient (PG) methods are the same as Q-learning in the sense that they explicitly learn a stochastic policy distribution π θ parametrized by θ. The objective of PG is to maximize the expected return over the trajectories induced by the policy π θ . If we denote the reward of a trajectory τ generated by policy π θ (τ) as r(τ), the policy gradient estimator has the form gE[r(τ)∇ θ log π θ (τ)]. This is the REINFORCE algorithm [27]. However, the REINFORCE algorithm has a high variance. A baseline, such as a value function baseline, can be used to improve the shortcomings of this type of algorithm. A generalized advantage estimate [28] is to use this method at the expense of some bias to reduce variance.
Schulman et al. [29] proposed that the TRPO algorithm can solve the shortcomings of the PG method that needs to be carefully adjusted for the step size. The PPO algorithm [30,31] is a simplification of the TRPO algorithm, which has a simpler execution method and sampling method.
The PPO algorithm optimizes the surrogate objective (1): denotes the likelihood ratio andÂ t is the generalized advantage estimate.
Similar to the DRQN algorithm, the combination of PPO and LSTM has a good effect on solving the POMDP problem [19,31].

Multiagent Independent PPO Algorithm
Tampuu et al. [11] demonstrate how competitive and collaborative behaviors emerge by independent Q-learning. Bansal et al. [19] explored that a multiagent environment produces complex behaviors by independent Proximal Policy Optimization (PPO) algorithm (MAIPPO).
Of course, the MAIPPO algorithm may also cause policies to fail to converge due to environmental non-stationary caused by the changing policies of the learning agents. The network structure of MAIPPO is shown in Figure 1. Solving the Dec-POMDP problem is at the core of the MARL algorithms. The mainstream solution is to optimize the decentralized policy by centralized learning, such as MADDPG, VDN, and QMIX.

PPO Algorithm
Policy gradient (PG) methods are the same as Q-learning in the sense that they explicitly learn a stochastic policy distribution parametrized by θ. The objective of PG is to maximize the expected return over the trajectories induced by the policy . If we denote the reward of a trajectory τ generated by policy as , the policy gradient estimator has the form ≔ . This is the REINFORCE algorithm [27]. However, the REINFORCE algorithm has a high variance. A baseline, such as a value function baseline, can be used to improve the shortcomings of this type of algorithm. A generalized advantage estimate [28] is to use this method at the expense of some bias to reduce variance.
Schulman et al. [29] proposed that the TRPO algorithm can solve the shortcomings of the PG method that needs to be carefully adjusted for the step size. The PPO algorithm [30,31] is a simplification of the TRPO algorithm, which has a simpler execution method and sampling method.
The PPO algorithm optimizes the surrogate objective (1): where | | denotes the likelihood ratio and is the generalized advantage estimate.
Similar to the DRQN algorithm, the combination of PPO and LSTM has a good effect on solving the POMDP problem [19,31].

Multiagent Independent PPO Algorithm
Tampuu et al. [11] demonstrate how competitive and collaborative behaviors emerge by independent Q-learning. Bansal et al. [19] explored that a multiagent environment produces complex behaviors by independent Proximal Policy Optimization (PPO) algorithm (MAIPPO).
Of course, the MAIPPO algorithm may also cause policies to fail to converge due to environmental non-stationary caused by the changing policies of the learning agents. The network structure of MAIPPO is shown in Figure 1. The structure of the MAIPPO algorithm we construct is relatively simple. Actor and critic networks of the MAIPPO algorithm are composed of LSTM layers and a series of fully-connected layers (abbreviated as FC layers). The critic network obtains state-value function and optimizes the critic network by minimizing the loss. The generalized advantage estimate , is calculated from the value function and then to optimize the actor network by the surrogate objective.
To update the critic network by minimizing the loss function (2): , (2) The structure of the MAIPPO algorithm we construct is relatively simple. Actor and critic networks of the MAIPPO algorithm are composed of LSTM layers and a series of fully-connected layers (abbreviated as FC layers). The critic network obtains state-value function V(O t ) and optimizes Sensors 2020, 20, 4546 5 of 16 the critic network by minimizing the loss. The generalized advantage estimate A(O t , a t ) is calculated from the value function V(O t ) and then to optimize the actor network by the surrogate objective.
To update the critic network by minimizing the loss function (2): To update the actor network by optimizing the surrogate objective (4): where S denotes an entropy bonus, and c is coefficient. We can use a truncated version of generalized advantage estimation (5), so: O t in the above formulas has the following expressions for agent i and agent j:

Multiagent Joint PPO Algorithm
In a multiagent learning environment, the environment becomes the non-stationary due to the changing policies of the learning agents. For the independent Q-learning algorithm, agents optimize policies through the local action-value function, which obstructs convergence.
There are a series of improved algorithms whose main purpose is to learn the centralized critic. The counterfactual multiagent (COMA) policy gradients algorithm and the multiagent Deep Deterministic Policy Gradient (MADDPG) use a centralized critic to estimate the Q-function and decentralized actors to optimize the agents' policies. VDN is to optimize the decentralized policies by using the sum of Q-value functions of each agent as a centralized evaluation function. QMIX is an improved algorithm of VDN, which learns a more complex joint action-value function by constructing a mixed network. DPIQN and DRPIQN propose to employ policy features of collaborators and opponents to infer and predict their policies.
The MAJPPO algorithm is proposed based on the MAIPPO algorithm. Different from the Q-learning algorithm, which uses the action-value function to evaluate and optimize policy, the PPO algorithm mainly uses the state-value function and the generalized advantage estimate to evaluate and optimize a policy. The MAJPPO algorithm learns mostly to obtain the joint state-value function and the generalized advantage estimate to evaluate and optimize the distribution policies. To enhance the stability of training and the cooperative between agents, we use the moving window average of the state-value functions of different agents to obtain joint state-value functions V i joint (8) and V j joint (9): where ξ is constant. Agent i and agent j simultaneously obtain their respective observations O i and O j , which include both observations of their own and partial observations of other agents. The state-value functions V i and V j are obtained through the processing of their respective critic networks. Then, to obtain the joint state-value functions V i joint and V j joint through the weighted average of the state-value functions. The joint state-value function V i joint includes both the evaluation of the state of agent i and the evaluation of the state of other agents. The small (1 − ξ) in V i joint is mainly to reduce the effect of the evaluation of the remaining state of the state s j of agent j except for s j,p on the joint state-value function. The state-value function V j joint obtained by the agent j includes both the evaluation of the state of agent j and the evaluation of the partial state of other agents. The surrogate objective obtained by V i joint and V j joint optimize the actor networks to get the cooperative policy. The value function that the MAJPPO algorithm learns through critic networks is a combination of state features with its states and other agents. The VDN's paper pointers out that lazy agents arise due to the partial observability of state. The critic networks of the MAJPPO algorithm use global information to learn the value function. The advantage functions deriving from the value function are used to update actor networks. This can solve the lazy agent problem to some extent. The MAJPPO algorithm and VDN algorithm or QMIX algorithm have similarities. MAJPPO uses the weighted average of the state-value function of each agent to replace the local state-value function of each agent to achieve the goal of centralized learning.

Dynamics Model of Small UAV and Attitude Control
We use the six-degree-of-freedom, 12-state equations of motion with the quasi-linear aerodynamic and propulsion models [32]. The model is provided in Appendix A. It is a fairly complicated set of 12 nonlinear, coupled, first-order, ordinary differential equations. Among the variables of these equations, in addition to 12 state variables [p n ; p e ; h; u; v; w; ϕ; θ; ψ; p; q; r], there are four input variables: the aileron deflection is denoted by δ a , the elevator deflection is denoted by δ e , and the rudder deflection is denoted by δ r and the throttle command δ t .
We could use the attitude control method of Appendix B to control the attitude of the above-mentioned UAV dynamics model.

RL of Single UAV
We use the above-mentioned UAV model as the control body of RL. The basis of multiagent RL for multi-UAV collaborative control is that RL can control the stable flight of a UAV.
We have two ways to control the UAV's stable flight by using reinforcement learning. One is to reinforce learning to control the dynamic model of the UAV directly, and the other is to reinforce learning to control the dynamic model of the UAV through the attitude loop. For the first method, the details are as follows. We use the 12-state of UAV as the input of the neural network of PPO. The network output of PPO is [δ a ; δ e ; δ r ; δ t ], where [δ a ; δ e ; δ r ] ∈ [−1, 1] and δ t ∈ [0, 1]. The output [δ a ; δ e ; δ r ; δ t ] is applied to the UAV motion model to obtain the next states of UAV after 0.1 s, and this cycle, as shown in Figure 2. We opt to use 10 s, which is 100 steps as an episode. Additionally, every 10 episodes update the network. A reasonable reward function structure is necessary to learn a stable ; h target in a stable attitude and a certain velocity V target . Then, this reward function (11) can be constructed like this: where η v is constant, and V = norm([u, v, w]). r navig can be transformed according to specific tasks. In this way, the UAV can get a stable policy model to complete the task. For the second method, the details are as follows. We use the 6-state of UAV (position and velocity) as the input of the neural network of PPO. The network output of PPO is ; ; , where ; ∈ 0.5, 0.5 and ∈ 0,1 . The output ; ; is applied to the UAV dynamics model with attitude control loop to obtain the next states of UAV after 0.5 s, and this cycle, as shown in Figure 3. In order to learn a stable control model, a reasonable reward function structure is necessary. To accomplish such a task: the UAV starts from the appropriate position and reaches another position ; ; ℎ in a stable attitude and a certain velocity . Then, this reward function (13) can be constructed like this: In this way, the UAV can obtain a stable policy model to complete the task. Comparing the two methods, we can find that controlling the UAV with the attitude loop has fewer optimization targets, which can reduce the complexity of the UAV control and facilitate multi-UAV coordinated control.  For the second method, the details are as follows. We use the 6-state of UAV (position and velocity) as the input of the neural network of PPO. The network output of PPO is ; ; , where ; ∈ 0.5, 0.5 and ∈ 0,1 . The output ; ; is applied to the UAV dynamics model with attitude control loop to obtain the next states of UAV after 0.5 s, and this cycle, as shown in Figure 3. In order to learn a stable control model, a reasonable reward function structure is necessary. To accomplish such a task: the UAV starts from the appropriate position and reaches another position ; ; ℎ in a stable attitude and a certain velocity . Then, this reward function (13) can be constructed like this:

Multi-UAV Formation
In this way, the UAV can obtain a stable policy model to complete the task. Comparing the two methods, we can find that controlling the UAV with the attitude loop has fewer optimization targets, which can reduce the complexity of the UAV control and facilitate multi-UAV coordinated control.

Multi-UAV Formation
We use the MAIPPO and MAJPPO algorithms to solve multi-UAV collaborative control tasks. The control tasks we studied mainly constitute three UAV formations and obstacle avoidance. In this way, the UAV can obtain a stable policy model to complete the task. Comparing the two methods, we can find that controlling the UAV with the attitude loop has fewer optimization targets, which can reduce the complexity of the UAV control and facilitate multi-UAV coordinated control.

Multi-UAV Formation
We use the MAIPPO and MAJPPO algorithms to solve multi-UAV collaborative control tasks. The control tasks we studied mainly constitute three UAV formations and obstacle avoidance.
The inputs to the MAIPPO and MAJPPO algorithms include their states, distance from obstacles and partial states of other UAVs. We found that using positions of other UAVs in experiments can result in more stable training results. For example, three UAVs are represented by UAV1, UAV2, and UAV3, respectively. Moreover, two obstacles are represented as obstacle1 and obstacle2. Then, the network input of UAV1 is p 1 n ; p 1 e ; h 1 ; u 1 ; v 1 ; w 1 ; p 2 n ; p 2 e ; h 2 ; p 3 n ; p 3 e ; h 3 ; dist 11 ; dist 12 , where p 2 n ; p 2 e ; h 2 is the position of UAV2, p 3 n ; p 3 e ; h 3 is the position of UAV3, and so on. dist 11 and dist 12 are the distance between UAV1 and obstacle1 and the distance between UAV1 and obstacle2. UAV has a detection distance d detection for obstacles, when dist 11 > d detection , dist 11 = d detection . The reward function (16) consists of three parts: (I) one to fly UAVs with a stable attitude and velocity, denoted by R single , (ii) another part to coordinate UAVs' flight while maintaining a certain formation distance, denoted by R f orm , (iii) the third part to implement UAVs' obstacle avoidance, as follows.
where K, α s , α f , and α o are constants, and α s + α f + α o = 1, d f orm represent the safe distance of the formation, norm p 12 n , p 12 e , h 12 represents the distance of between UAV1 and UAV2 while norm p 13 n , p 13 e , h 13 represents the distance of between UAV1 and UAV3. R 1 obstacle and R 2 obstacle are the reward function of UAV for obstacle1 and obstacle2.

Network Settings
Critic network architectures first process the input using an LSTM layer with 128 hidden units, and then a fully connected linear layer with 128 hidden units followed by a TanH layer, and then a fully connected linear layer with 128 hidden units followed by a TanH layer.
The actor consists of two parts: a neural network and a normal distribution. The actor network has an LSTM layer with 128 hidden units, and then a fully connected linear layer with 128 hidden units followed by a TanH layer. The output of the network is the mean value of the normal distribution with covariance matrix C = 0.05 I, where I is the identity matrix [33]. The distribution generates actions. The output range of the angles [θ; φ] in the actor output is limited to [−0.5, 0.5], and the range of the throttle δ t is limited to [0, 1]. Therefore, the mean value of the angle uses TanH as the activation function, and the mean of the throttle uses sigmoid as the activation function.
Due to the computational complexity of the UAV motion model, to shorten the training time, the use of multiple processes is inevitable.

Parameter Settings
The learning rate of Adam is 0.0001. The clipping parameter = 0.2, discounting factor γ = 0.995 and generalized advantage estimate parameter λ = 0.95. We use large batch sizes, which can improve the variance problem to some extent and help to explore. In each iteration, we collect 1000 samples or 20 episodes, and 50 steps as one episode, and perform 20 episodes of training in mini-batches consisting of 512 samples. We found l 2 regularization with parameter 0.01 of the policy and value network parameters to be useful. The coefficient of the entropy is c = 0.001. the parameters in the reward function are set to K = 100, α s = 0.5, The sampling time is set to ∆t = 0.5.

Mission Environment
Assume that the three UAVs fly from the initial area to the target area at a certain speed and a stable attitude as required by the formation, and pass through the area with six obstacles. The following initial values are assumed to simplify the task environment:

Experimental Comparison and Analysis
In order to compare the performance of the algorithm intuitively, we use the above parameters and task environment to perform experiments on the MAIPPO and MAJPPO algorithms until convergence, where the parameter in the MAJPPO algorithm is ξ = 0.9. The learning curves of the algorithms are revealed in Figure 4. It should be noted that the reward in Figure 4 is the sum of the rewards of three UAVs. We performed 10,000 iterations for the MAIPPO algorithm and the MAJPPO algorithm. It can be clearly seen from Figure 4 that the MAJPPO algorithm demonstrates better performance than the MAIPPO algorithm in dealing with multi-UAV collaboration and obstacle avoidance problems. It can also be seen from Figure 4 that the learning curve of the MAIPPO algorithm is not stable after convergence, and the MAJPPO algorithm can get higher reward value and convergence more stable. Therefore, the MAJPPO algorithm can get better results than the MAIPPO algorithm when dealing with this Dec-POMDP environment. The training learning curve of the MAIPPO algorithm cannot converge well because of the instability of the environment. Figure 5 shows trajectory curves, distance curves, altitude curves, velocity curves, and distance curves between UAVs and obstacles of UAVs after training using MAIPPO algorithm and MAJPPO algorithm. As can be seen from Figure 5, these three UAVs can fulfill the mission requirements well.
performance than the MAIPPO algorithm in dealing with multi-UAV collaboration and obstacle avoidance problems. It can also be seen from Figure 4 that the learning curve of the MAIPPO algorithm is not stable after convergence, and the MAJPPO algorithm can get higher reward value and convergence more stable. Therefore, the MAJPPO algorithm can get better results than the MAIPPO algorithm when dealing with this Dec-POMDP environment. The training learning curve of the MAIPPO algorithm cannot converge well because of the instability of the environment.   Figure 5 shows trajectory curves, distance curves, altitude curves, velocity curves, and distance curves between UAVs and obstacles of UAVs after training using MAIPPO algorithm and MAJPPO algorithm. As can be seen from Figure 5, these three UAVs can fulfill the mission requirements well. It can be seen from Figure 5 that the network trained by the MAJPPO algorithm performs better in the multi-obstacle environment for multi-UAV obstacle avoidance control. To specifically evaluate the performance of distance, altitude, and velocity of UAVs, we calculate the sum of first-order absolute center moment separately as (17), (18), and (19).
For d: For h: For v: It can be seen from Figure 5 that the network trained by the MAJPPO algorithm performs better in the multi-obstacle environment for multi-UAV obstacle avoidance control. To specifically evaluate the performance of distance, altitude, and velocity of UAVs, we calculate the sum of first-order absolute center moment separately as (17), (18), and (19).
For d: For h: For v: As can be seen from

Parameter Evaluation
Since the weighted average parameter ξ in the MAJPPO algorithm has a great influence on the performance of the algorithm, we discuss and analyze it. The learning curves of MAIPPO and MAJPPO for ξ = 0.8, 0.9, 0.99, and 0.999 are shown in Figure 6.

Parameter Evaluation
Since the weighted average parameter in the MAJPPO algorithm has a great influence on the performance of the algorithm, we discuss and analyze it. The learning curves of MAIPPO and MAJPPO for = 0.8, 0.9, 0.99, and 0.999 are shown in Figure 6.
In the MAJPPO algorithm, when = 1, it is actually the independent PPO algorithm. This can be seen from Figure 6, when the value of is closer to 1. The performance of the algorithm will also show similar performance to the independent PPO algorithm, such as = 0.999. However, the performance of the algorithm does not become better as the value of becomes smaller. For example, when =0.8, the performance of the algorithm is not as good as = 0.9.

Conclusions and Future Work
Based on the MAIPPO algorithm, we propose the MAJPPO algorithm that uses the moving window averaging of state-valued function to obtain a centralized state value function to deal with multiagent coordination problems. The MAJPPO algorithm is also a kind of centralized training and distributed execution algorithm. We also presented a new cooperative multi-UAV simulation environment, where multi-UAV work together to accomplish formation and obstacle avoidance. In order to accomplish this task, we use the dynamic model of the UAV with attitude control capability as the control object. It can be seen from the experimental comparison that the MAJPPO algorithm can better deal with the partial observability of the state in the multiagent system and obtain better experimental results.
The comparison of the MAJPPO algorithm with other multi-agent reinforcement learning algorithms, such as MADDPG, VDN, and QMIX, is left for future work. In the MAJPPO algorithm, when ξ = 1, it is actually the independent PPO algorithm. This can be seen from Figure 6, when the value of ξ is closer to 1. The performance of the algorithm will also show similar performance to the independent PPO algorithm, such as ξ = 0.999. However, the performance of the algorithm does not become better as the value of ξ becomes smaller. For example, when ξ = 0.8, the performance of the algorithm is not as good as ξ = 0.9.

Conclusions and Future Work
Based on the MAIPPO algorithm, we propose the MAJPPO algorithm that uses the moving window averaging of state-valued function to obtain a centralized state value function to deal with multiagent coordination problems. The MAJPPO algorithm is also a kind of centralized training and distributed execution algorithm. We also presented a new cooperative multi-UAV simulation environment, where multi-UAV work together to accomplish formation and obstacle avoidance. In order to accomplish this task, we use the dynamic model of the UAV with attitude control capability as the control object. It can be seen from the experimental comparison that the MAJPPO algorithm can better deal with the partial observability of the state in the multiagent system and obtain better experimental results.
The comparison of the MAJPPO algorithm with other multi-agent reinforcement learning algorithms, such as MADDPG, VDN, and QMIX, is left for future work.

Conflicts of Interest:
Authors declare no conflict of interests.

Appendix A
"SMALL UNMANNED AIRCRAFT Theory and Practice" presents the six-degree-of-freedom, 12-state equations of motion with the quasi-linear aerodynamic and propulsion models for the small UAV in chapter 5 as follows.

Appendix B
We use the following three control methods as the attitude control algorithm for the fixed-wing UAV dynamic model introduced by Appendix A.

Conflicts of Interest:
Authors declare no conflict of interests.

Appendix A
"SMALL UNMANNED AIRCRAFT Theory and Practice" presents the six-degree-of-freedom, 12-state equations of motion with the quasi-linear aerodynamic and propulsion models for the small UAV in chapter 5 as follows.
cos The variables and constants in the above formulas are specifically explained in [32]. The data we used in the experiment are from Aerosonde UAV of APPENDIX E in [32].

Appendix B
We use the following three control methods as the attitude control algorithm for the fixed-wing UAV dynamic model introduced by Appendix A.   The three tables below contain some of the variables and values in the control block diagram above, which are also used in our simulations.   The three tables below contain some of the variables and values in the control block diagram above, which are also used in our simulations.  The three tables below contain some of the variables and values in the control block diagram above, which are also used in our simulations.