Constrained Path Planning for Unmanned Aerial Vehicle in 3D Terrain Using Modiﬁed Multi-Objective Particle Swarm Optimization

: This paper considered the constrained unmanned aerial vehicle (UAV) path planning problem as the multi-objective optimization problem, in which both costs and constraints are treated as the objective functions. A novel multi-objective particle swarm optimization algorithm based on the Gaussian distribution and the Q-Learning technique (GMOPSO-QL) is proposed and applied to determine the feasible and optimal path for UAV. In GMOPSO-QL, the Gaussian distribution based updating operator is adopted to generate new particles, and the exploration and exploitation modes are introduced to enhance population diversity and convergence speed, respectively. Moreover, the Q-Learning based mode selection logic is introduced to balance the global search with the local search in the evolution process. Simulation results indicate that our proposed GMOPSO-QL can deal with the constrained UAV path planning problem and is superior to existing optimization algorithms in terms of efﬁciency and robustness.


Introduction
The unmanned aerial vehicle (UAV) is a kind of aircraft without a human pilot, which is widely used in commercial, agricultural, military, and other fields [1][2][3]. Path planning is one of the most basic problems in the UAV flight management system, of which the purpose is to design a safe and short enough trajectory linking the starting and target points and satisfying some constraints [4][5][6]. Taking UAV flight distance and risk as objective functions, the height of UAV, angle of UAV, and limited UAV slope as constraints, Yu et al. [7] established an optimization model, which was solved by an adaptive selective mutation constrained differential evolution algorithm. Liu et al. [8] proposed a four-dimensional coordinated collision free path planning method for multiple UAVs, which took time variables into account and successfully generated collision free paths for multiple UAVs with the same time consumption. Maw et al. [9] proposed a hybrid path planning algorithm, which adopted an anytime graph-based path planning algorithm for global planning and deep reinforcement learning for local planning. This real-time mission planning system applied to autonomous UAV can adapt to the real environment and effectively avoid both static and moving obstacles.
The commonly used constraint-handling technique is the penalty function method [10], which constructs a new objective function by adding constraints to the original objective function and thus transfers the constrained optimization problem into the unconstrained one. Besides, the feasibility rule proposed by Deb [11] is another popular constrainthandling technique. By comparing individuals in pairs with three comparison rules, it chooses the better ones to keep to the next generation. Furthermore, multi-objective optimization

Basic MOPSO
For a multi-objective optimization problem, it is extremely difficult and important to increase the solution diversity. Researchers first extended the PSO strategy to solve multi-objective problems [25]. It introduces the concept of external archives and adopts the non-dominant solution in archives to guide the search direction of particles, which keeps the population diversity. The speed update formula of particles is modified as follows [25]: v i (t + 1) = ω · v i (t) + c 1 · r 1 · (x pbest,i (t) − x i (t)) + c 2 · r 2 · (x gbest (t) − x i (t)) (1) x i (t + 1) = x i (t) + v i (t + 1) where t represents iteration count; ω represents the inertia factor; c 1 , c 2 represent learning factors; r 1 , r 2 are uniformly generated random numbers in [0, 1]; x gbest denotes a random particle in the archives; x pbest,i denotes the historical optimal location for the i-th particle. The implementation process of MOPSO can be described as Algorithm 1. if the capacity of archives A exceed the maximum capacity requirement, then calculate the crowding distance of all individuals in the archives A and arrange them in descending order. Delete the individuals with the smallest crowding distance one by one until the capacity of archives A is equal to the maximum capacity. t = t + 1. end while

Q-Learning
Reinforcement learning is where an agent learns in a "trial and error" manner, and rewards are obtained through interaction with the environment guiding its behavior [24]. Q-Learning is one of the more commonly used algorithms in reinforcement learning at present, used mainly to learn how to choose the action that can generate the largest accumulated reward value through rewarding or punishing.
Q-Learning involves an agent, state set S, and action set A. By performing actions in the environment that cause the agent to transition from one state to another, performing actions in a specific state provides rewards. In short, Q(s,a) is the expectation of rewards that can be obtained by taking action a (a ∈ A) in the state of s (s ∈ S) at a certain moment. For Q-Learning, a Q-table is constructed by combining state s and action a, and then the optimal action is chosen based on the Q-value. Finally, the Q-table is updated through reward or punishment. Therefore, the core idea of Q-Learning is to get the optimal control strategy by maximizing rewards. For multiple iterations, the Q-table is updated dynamically as follows: where γ stands for the discount factor; λ is the learning rate; r t+1 denotes the immediate reward obtained from the performance of a t at the state s t . Figure 1 displays the structure diagram of Q-Learning. where γ stands for the discount factor; λ is the learning rate; rt+1 denotes the immediate reward obtained from the performance of at at the state st. Figure 1 displays the structure diagram of Q-Learning.

Proposed Modified Algorithm
Global and local searches play significant roles in solving complex optimization problems. However, it is difficult to consider both of them in a single search mode. Introducing Q-Learning can solve this problem well. In this paper, two search modes are adopted, in which one tends to focus on global search, and the other focuses more on local search. Based on the learned experience, it constantly chooses one of them to complete the search process. Therefore, the balance of global and local search processes can be realized well.

Gaussian Based Exploration and Exploitation Update Modes
The contradiction between global search ability and local search ability restricts the performance of the algorithm. In GMOPSO-QL, the exploration mode performs a global search to find the possible range of the optimal solution. In contrast, the exploitation mode performs a local search to find the optimal solution within the possible range.
Equation (1) is the velocity update method of particles applied to the single-objective optimization problem. In the formula, xgbest is the global optimal solution. However, it is replaced by a set of solutions, saved in the external archives, in the multi-objective optimization problem. Meanwhile, to enhance the randomness of basic MOPSO, Gaussian random number is employed to replace the commonly used uniformly distributed random numbers, which performs better in many cases [26]. Therefore, a speed update operator with Gaussian distribution is adopted to modify Equation (1). Then, the velocity update method of the particles in the exploration and exploitation update mode is obtained as follows: (1) Exploration update mode where xb is a random individual in the external archives, and g1 is generated with Gaussian distribution. The learning factor c11 should be large and c21 should be small, so that the historical experience of the particles themselves can more mainly guide the search direction and enhance the diversity of the population.
(2) Exploitation update mode where the learning factor c12 should be small and c22 should be large, so that the optimal particles can guide the search direction more, and the algorithm can converge to the optimal solution more quickly.

Proposed Modified Algorithm
Global and local searches play significant roles in solving complex optimization problems. However, it is difficult to consider both of them in a single search mode. Introducing Q-Learning can solve this problem well. In this paper, two search modes are adopted, in which one tends to focus on global search, and the other focuses more on local search. Based on the learned experience, it constantly chooses one of them to complete the search process. Therefore, the balance of global and local search processes can be realized well.

Gaussian Based Exploration and Exploitation Update Modes
The contradiction between global search ability and local search ability restricts the performance of the algorithm. In GMOPSO-QL, the exploration mode performs a global search to find the possible range of the optimal solution. In contrast, the exploitation mode performs a local search to find the optimal solution within the possible range.
Equation (1) is the velocity update method of particles applied to the single-objective optimization problem. In the formula, x gbest is the global optimal solution. However, it is replaced by a set of solutions, saved in the external archives, in the multi-objective optimization problem. Meanwhile, to enhance the randomness of basic MOPSO, Gaussian random number is employed to replace the commonly used uniformly distributed random numbers, which performs better in many cases [26]. Therefore, a speed update operator with Gaussian distribution is adopted to modify Equation (1). Then, the velocity update method of the particles in the exploration and exploitation update mode is obtained as follows: (1) Exploration update mode where x b is a random individual in the external archives, and g 1 is generated with Gaussian distribution. The learning factor c 11 should be large and c 21 should be small, so that the historical experience of the particles themselves can more mainly guide the search direction and enhance the diversity of the population.
where the learning factor c 12 should be small and c 22 should be large, so that the optimal particles can guide the search direction more, and the algorithm can converge to the optimal solution more quickly.

Q-Learning Based Mode Selection
Q-Learning is used to select two update modes, which are exploration update mode and exploitation update mode. They share the same formula form but differ in the size of the learning factors. The exploration update mode has large c 1 and small c 2 , so that the historical experience of particles themselves can guide the search direction more, thus enhancing the population diversity. As opposed to the exploration update mode, the exploitation update mode has a small c 1 and a large c 2 . This allows the optimal particle to guide the search direction more and the algorithm to converge quickly.
In the proposed algorithm, each individual has its own Q-table to ensure that the learning process being independent. The Q-table is designed as a 2 × 2 matrix with the columns representing actions and rows representing states. When the update mode is chosen, the action corresponding to the largest Q-value in the current "state" line is the update mode of the particle position in this iteration.
In the Q-Learning algorithm, the setting of the learning rate λ is very important [24]. When it is close to 1, the agent will consider future rewards with greater weight; when it is close to 0, the agent tends to only consider immediate rewards. Therefore, the learning rate can be adaptively reduced during the iterative process, which can effectively learn from the search space [27].
In this paper, the learning rate is updated as follows: where λ initial and λ final are the initial and final values of λ, respectively; t max denotes the maximum number of iterations.
The reward r is calculated as follows:

Updating Archives
The external archives are adopted to store the non-dominated solutions resulting from the search process. At the end of the search, the external archives are the final optimal solution set. The update strategy for external archives is shown in Algorithm 2.

Algorithm 2.
Steps of updating external archives /*Input parameters*/ The current capacity of external archives m, the maximum capacity of external archives M /*Update Archives*/ For each particle in current population x i do For each particle in the archives x j do If x i dominates x j do Delete x j from the archives End if End for If x i is not dominated by any solution in the archives do Place x i in the archives.

End if End for If m > M do
Calculate the crowding distance of all solutions in the archives and arrange them in descending order.
Delete the solutions with the smaller crowding distance until m = M.

Framework of GMOPSO-QL
In summary, the framework of GMOPSO-QL is shown in Figure 2.
Calculate the crowding distance of all solutions in the archives and ar in descending order. Delete the solutions with the smaller crowding distance until m = M.

Framework of GMOPSO-QL
In summary, the framework of GMOPSO-QL is shown in Figure 2.

Computing Complexity
The main calculation of GMOPSO-QL is in the iterative optimization p sume that N is the number of individuals in the particle population, M is the objective functions, and tmax is the maximum number of iterations. During iteration, each particle selects one of the two update modes acco own state. However, no matter which one is selected, the total time complexity ticle updating process is O(N). When updating external archives, the time co mainly due to the non-dominated sorting operator. At this point, each parti compared with other particles on each objective function, and the time compl process is O(MN 2 ). Therefore, the time complexity of the proposed algorithm is O(MN 2 ) in ea and the total time complexity of it is O(tmax·MN 2 ), which is the same as th MOPSO algorithm. The analysis shows that the improved algorithm does not time complexity and still maintains high execution speed.

Problem Modelling
The representation of the flight path is the basis of UAV path planning [ three-dimensional (3D) flight space, the UAV path can be described by a set waypoints {p0, p1, p2,..., pN, pN+1} and the coordinate of pk is (xk, yk, zk), where th point p0 is the start point and the last waypoint pN+1 is the target point. This m

Computing Complexity
The main calculation of GMOPSO-QL is in the iterative optimization process. Assume that N is the number of individuals in the particle population, M is the number of objective functions, and t max is the maximum number of iterations. During iteration, each particle selects one of the two update modes according to its own state. However, no matter which one is selected, the total time complexity of the particle updating process is O(N). When updating external archives, the time complexity is mainly due to the non-dominated sorting operator. At this point, each particle must be compared with other particles on each objective function, and the time complexity of this process is O(MN 2 ). Therefore, the time complexity of the proposed algorithm is O(MN 2 ) in each iteration, and the total time complexity of it is O(t max ·MN 2 ), which is the same as the standard MOPSO algorithm. The analysis shows that the improved algorithm does not increase the time complexity and still maintains high execution speed.

Problem Modelling
The representation of the flight path is the basis of UAV path planning [1,28]. In the three-dimensional (3D) flight space, the UAV path can be described by a set of discrete waypoints {p 0 , p 1 , p 2 , . . . , p N , p N+1 } and the coordinate of p k is (x k , y k , z k ), where the first waypoint p 0 is the start point and the last waypoint p N+1 is the target point. This method can represent any flight path in 3D space. The Bezier curve is usually used to ensure the smoothness of the UAV flight path [18,28], since it can describe a better path with fewer control points. In this paper, the cubic B-spline curve is employed to represent the flight path of UAV.
Evaluation objectives (fitness functions) of a UAV flight path contain the flight cost and performance constraints. The following multi-objective optimization model can be established to describe the constrained path planning problem: where J is the flight cost including length cost J L and flight altitude cost J H ; C is the sum of constraint costs including threat constraint C 1 , terrain constraint C 2 , turning constraint C 3 , and climbing and gliding constraint C 4 . Both the cost functions and the constraints are described as follows.
(1) J L denotes the flight length cost, which can be calculated as follows [28]: (2) J H denotes the flight altitude cost ensuring the low-flying benefits, calculated as follows [28]: where z p is the height of the waypoint p.
(3) The threat constraint function C 1 is the sum of all detection probabilities and killing probabilities, which is calculated as follows [5]: where d is the distance from UAV to the threat center; P R is the radar detection probability, P M is the killing probability of ground-to-air missiles, P G is the killing probability of anti-aircraft gun; R R max is the radar's detection radius, and R M max and R G max are the maximum hit radiuses of the missile and the anti-aircraft gun.
(4) The terrain constraint function C 2 makes the UAV fly higher than the predetermined safe flight altitude, which is expressed as follows [5]: where H safe is the minimal safe flight height, and H surf (x k , y k ) is the ground level at the position (x k , y k ).
(5) The turning angle constraint function C 3 is introduced to constrain the UAV's flight on the plane to obtain a smoother path, which is expressed by [18]: where ϕ k represents the trajectory turning angle at the waypoint p k . The maximum turning angle ϕ k max can be expressed as where n max denotes the maximum lateral overload, g is the acceleration of gravity, and V denotes the flight velocity.
The UAV slope s k at the waypoint p k needs to meet certain restrictions, which constrains the UAV's flight in the vertical direction. The climbing and gliding slope constraint function C 4 is calculated as follows [5]: where α k and β k denote the maximum climbing and gliding slopes as follows:

GMOPSO-QL for Path Planning
The application of GMOPSO-QL to the UAV path planning problem under 3D complex environments can be implemented according to the framework shown in Algorithm 3. Calculate the learning rate λ by Equation (6). for each particle i do Choose the best a for the current s from Q-table. switch action case 1: Exploration update mode Update the velocity v i and position x i by Equations (2) and (4) x min i f x < x min end for Update the personal best position x pbest,i . Calculate the reward r by Equation (7). end for Update the Q-table by Equation (3). Update the archives A t = t +1. end while /*Output*/ Output the best path for UAV.

Simulation Results
In this section, numerical simulations are performed on the test cases to test the effectiveness of our proposed GMOPSO-QL in solving the constrained path planning problem. The experimental platform used was Intel(R) Core(TM) i5, 2.30 GHz CPU, 16 GB memory, and Windows 10. Matlab-R2018a was used to perform simulation experiments. We simulated the mission scenarios similar to valleys and mountainous environments, with L max = 100 km long and W max = 100 km wide. Several threats are set in the scene, and their center and radius are known. To ensure that the calculated constraint costs are as consistent as possible with the actual situation, we set the UAV's parameters in the simulation experiments according to some real physical parameters and relevant reference [5]: the maximum flying altitude is H max = 4 km, the maximum lateral overload is n max = 5, the minimum safe flying altitude is H safe = 50 m, and the constant flying speed is V = 200 m/s. We compare our proposed GMOPSO-QL algorithm with the basic GMOPSO [29] and MOPSO [30] algorithms. To ensure the comparability of the three algorithms, the same fitness evaluation method is adopted, and the initial particles are generated by random initialization. Each algorithm is performed 50 times for path planning independently.
The main parameters of GMOPSO-QL include the maximum iteration number t max = 400, the population size NP = 50, the initial and final learning rates λ initial = 0.9, λ final = 0.1, the inertia weight ω = 0.4, the discount factor γ = 0.5. Set the number of control points n = 4, that is, D = 3n = 12 parameters need to be determined to generate a smooth UAV flight path. The relevant parameters of GMOPSO and MOPSO are set according to the references [29,30]. Figure 3 displays the plane view of the best paths generated by the three algorithms in 50 runs in Case 1. Figure 4 shows the best flight paths in the 3D environment. The black circle covers the threat area. The labels R, M, and G represent radars, missiles, and anti-aircraft guns, respectively. The vertical profiles corresponding to the best paths in Figure 4 are displayed in Figure 5. It can be seen that all three algorithms can avoid threats and obtain a safe path that meets performance constraints. However, compared with GMOPSO and MOPSO, the best path obtained by the GMOPSO-QL algorithm is smoother and better. From the perspective of the vertical profiles of the best paths, the path planned by the GMOPSO-QL algorithm is gentle and can maintain low-altitude flight. Intuitively, the proposed algorithm can converge to a better solution. Compared with GMOPSO and MOPSO, it has good convergence.
the maximum flying altitude is Hma x = 4 km, the maximum lateral overload is nmax = 5, the minimum safe flying altitude is Hsafe = 50 m, and the constant flying speed is V = 200 m/s. We compare our proposed GMOPSO-QL algorithm with the basic GMOPSO [29] and MOPSO [30] algorithms. To ensure the comparability of the three algorithms, the same fitness evaluation method is adopted, and the initial particles are generated by random initialization. Each algorithm is performed 50 times for path planning independently. The main parameters of GMOPSO-QL include the maximum iteration number tmax = 400, the population size NP = 50, the initial and final learning rates λinitial = 0.9, λfinal = 0.1, the inertia weight ω = 0.4, the discount factor γ = 0.5. Set the number of control points n = 4, that is, D = 3n = 12 parameters need to be determined to generate a smooth UAV flight path. The relevant parameters of GMOPSO and MOPSO are set according to the references [29,30]. Figure 3 displays the plane view of the best paths generated by the three algorithms in 50 runs in Case 1. Figure 4 shows the best flight paths in the 3D environment. The black circle covers the threat area. The labels R, M, and G represent radars, missiles, and antiaircraft guns, respectively. The vertical profiles corresponding to the best paths in Figure  4 are displayed in Figure 5. It can be seen that all three algorithms can avoid threats and obtain a safe path that meets performance constraints. However, compared with GMOPSO and MOPSO, the best path obtained by the GMOPSO-QL algorithm is smoother and better. From the perspective of the vertical profiles of the best paths, the path planned by the GMOPSO-QL algorithm is gentle and can maintain low-altitude flight. Intuitively, the proposed algorithm can converge to a better solution. Compared with GMOPSO and MOPSO, it has good convergence.   the maximum flying altitude is Hma x = 4 km, the maximum lateral overload is nmax = 5, the minimum safe flying altitude is Hsafe = 50 m, and the constant flying speed is V = 200 m/s. We compare our proposed GMOPSO-QL algorithm with the basic GMOPSO [29] and MOPSO [30] algorithms. To ensure the comparability of the three algorithms, the same fitness evaluation method is adopted, and the initial particles are generated by random initialization. Each algorithm is performed 50 times for path planning independently. The main parameters of GMOPSO-QL include the maximum iteration number tmax = 400, the population size NP = 50, the initial and final learning rates λinitial = 0.9, λfinal = 0.1, the inertia weight ω = 0.4, the discount factor γ = 0.5. Set the number of control points n = 4, that is, D = 3n = 12 parameters need to be determined to generate a smooth UAV flight path. The relevant parameters of GMOPSO and MOPSO are set according to the references [29,30]. Figure 3 displays the plane view of the best paths generated by the three algorithms in 50 runs in Case 1. Figure 4 shows the best flight paths in the 3D environment. The black circle covers the threat area. The labels R, M, and G represent radars, missiles, and antiaircraft guns, respectively. The vertical profiles corresponding to the best paths in Figure  4 are displayed in Figure 5. It can be seen that all three algorithms can avoid threats and obtain a safe path that meets performance constraints. However, compared with GMOPSO and MOPSO, the best path obtained by the GMOPSO-QL algorithm is smoother and better. From the perspective of the vertical profiles of the best paths, the path planned by the GMOPSO-QL algorithm is gentle and can maintain low-altitude flight. Intuitively, the proposed algorithm can converge to a better solution. Compared with GMOPSO and MOPSO, it has good convergence.   Pareto frontiers of the 50 runs are operated by the Pareto sorting scheme. The finally obtained Pareto frontier is shown in Figure 6. It can be seen that all three algorithms can get a complete Pareto frontier. The Pareto solution set obtained by GMOPSO-QL is closer to the coordinate axis, which indicates that the objective function values obtained by the proposed algorithm is smaller, and its Pareto frontier is superior to the other two comparison algorithms. The proposed algorithm shows the most satisfactory performance.  Table 1 lists the statistical data during 50 runs, including the best, median, mean, and worst values, as well as the standard deviation of the path cost, where the best data in the Pareto frontiers of the 50 runs are operated by the Pareto sorting scheme. The finally obtained Pareto frontier is shown in Figure 6. It can be seen that all three algorithms can get a complete Pareto frontier. The Pareto solution set obtained by GMOPSO-QL is closer to the coordinate axis, which indicates that the objective function values obtained by the proposed algorithm is smaller, and its Pareto frontier is superior to the other two comparison algorithms. The proposed algorithm shows the most satisfactory performance. Pareto frontiers of the 50 runs are operated by the Pareto sorting scheme. The finally obtained Pareto frontier is shown in Figure 6. It can be seen that all three algorithms can get a complete Pareto frontier. The Pareto solution set obtained by GMOPSO-QL is closer to the coordinate axis, which indicates that the objective function values obtained by the proposed algorithm is smaller, and its Pareto frontier is superior to the other two comparison algorithms. The proposed algorithm shows the most satisfactory performance.  Table 1 lists the statistical data during 50 runs, including the best, median, mean, and worst values, as well as the standard deviation of the path cost, where the best data in the  Table 1 lists the statistical data during 50 runs, including the best, median, mean, and worst values, as well as the standard deviation of the path cost, where the best data in the three algorithms are marked in bold. FR represents the ratio of the feasible paths that satisfy the constraints of all 50 runs. AT represents the average time for the algorithm to run once. Note that the statistics listed in columns 2-6 only calculate the successful operation of generating a safe path. Figure 7 is the boxplot corresponding to the statistical results in Table 1.  Table 1. From Table 1, it can be seen that the overall performance of the GMOPSO-QL is superior to MOPSO and GMOPSO when the average running time is the same, which proves that our proposed modified strategies to the original algorithm are effective. Although MOPSO's statistical results in Figure 3 appear to be more concentrated than those of the proposed algorithm, MOPSO's feasibility rate is only 70%, while the proposed algorithm achieves 100%. In addition, for the key indicators of statistical results, including best value, median value and mean value, the proposed algorithm is the best of the three algorithms.   Figure 8 shows the plane view of the best flight paths planned by the three algorithms in 50 runs on Case 2. Figure 9 shows the corresponding best flight paths in the 3D environment. Vertical profiles of the obtained best UAV paths are displayed in Figure 10, and the Pareto frontiers are shown in Figure 11. In addition, Table 2 lists the statistical data of the flight cost values. Figure 12 is the boxplot of the statistical results in Case 2. From Table 1, it can be seen that the overall performance of the GMOPSO-QL is superior to MOPSO and GMOPSO when the average running time is the same, which proves that our proposed modified strategies to the original algorithm are effective. Although MOPSO's statistical results in Figure 3 appear to be more concentrated than those of the proposed algorithm, MOPSO's feasibility rate is only 70%, while the proposed algorithm achieves 100%. In addition, for the key indicators of statistical results, including best value, median value and mean value, the proposed algorithm is the best of the three algorithms. Figure 8 shows the plane view of the best flight paths planned by the three algorithms in 50 runs on Case 2. Figure 9 shows the corresponding best flight paths in the 3D environment. Vertical profiles of the obtained best UAV paths are displayed in Figure 10, and the Pareto frontiers are shown in Figure 11. In addition, Table 2 lists the statistical data of the flight cost values. Figure 12 is the boxplot of the statistical results in Case 2.
three algorithms are marked in bold. FR represents the ratio of the feasible paths that satisfy the constraints of all 50 runs. AT represents the average time for the algorithm to run once. Note that the statistics listed in columns 2-6 only calculate the successful operation of generating a safe path. Figure 7 is the boxplot corresponding to the statistical results in Table 1.
From Table 1, it can be seen that the overall performance of the GMOPSO-QL is superior to MOPSO and GMOPSO when the average running time is the same, which proves that our proposed modified strategies to the original algorithm are effective. Although MOPSO's statistical results in Figure 3 appear to be more concentrated than those of the proposed algorithm, MOPSO's feasibility rate is only 70%, while the proposed algorithm achieves 100%. In addition, for the key indicators of statistical results, including best value, median value and mean value, the proposed algorithm is the best of the three algorithms.   Figure 8 shows the plane view of the best flight paths planned by the three algorithms in 50 runs on Case 2. Figure 9 shows the corresponding best flight paths in the 3D environment. Vertical profiles of the obtained best UAV paths are displayed in Figure 10, and the Pareto frontiers are shown in Figure 11. In addition, Table 2 lists the statistical data of the flight cost values. Figure 12 is the boxplot of the statistical results in Case 2.      From Figure 8 we can see that all algorithms can achieve a safe and feasible path that avoids threats. Compared with the flight paths generated by MOPSO and GMOPSO, the  GMOPSO-QL is better than that of MOPSO and GMOPSO. Numerical results show that GMOPSO-QL has the best performance. In Table 2, each item of GMOPSO-QL is superior to MOPSO and GMOPSO, which undoubtedly proves the effectiveness of GMOPSO-QL. Above all, the GMOPSO-QL algorithm performs better than MOPSO and GMOPSO, and is proved to be effective, efficient, and robust enough to achieve a safe, feasible, and smooth path for UAV.

Conclusions
In this paper, a multi-objective optimization approach was employed to solve the constrained UAV path planning problem, and the GMOPSO-QL-based UAV path planning method was proposed. In GMOPSO-QL, the exploration and exploitation update From Figure 8 we can see that all algorithms can achieve a safe and feasible path that avoids threats. Compared with the flight paths generated by MOPSO and GMOPSO, the path planned by GMOPSO-QL is the smoothest, and it can make better use of the terrain environment by maintaining a low-altitude flight. Additionally, the Pareto frontier of GMOPSO-QL is better than that of MOPSO and GMOPSO. Numerical results show that GMOPSO-QL has the best performance. In Table 2, each item of GMOPSO-QL is superior to MOPSO and GMOPSO, which undoubtedly proves the effectiveness of GMOPSO-QL.
Above all, the GMOPSO-QL algorithm performs better than MOPSO and GMOPSO, and is proved to be effective, efficient, and robust enough to achieve a safe, feasible, and smooth path for UAV.

Conclusions
In this paper, a multi-objective optimization approach was employed to solve the constrained UAV path planning problem, and the GMOPSO-QL-based UAV path planning method was proposed. In GMOPSO-QL, the exploration and exploitation update modes are designed to modify the performance of the original MOPSO. The exploration update mode refers more to the historical experience of the particles themselves to enhance the diversity of the population, while the exploitation update mode leads particles to move towards the direction of the optimal particle to achieve rapid converging speed. Based on the Q-Learning strategy, GMOPSO-QL balances the global and local searches by organically combining the two update modes. Numerical results indicate that GMOPSO-QL performs well in solving the constrained path planning problem for UAVs. Compared with MOPSO and GMOPSO, GMOPSO-QL has better convergence and robustness. In the future, we will focus on extending the GMOPSO-QL-based path planner to the multiple UAVs system.