Multiple Unmanned Aerial Vehicle Autonomous Path Planning Algorithm Based on Whale-Inspired Deep Q-Network

: In emergency rescue missions, rescue teams can use UAVs and efﬁcient path planning strategies to provide ﬂexible rescue services for trapped people, which can improve rescue efﬁciency and reduce personnel risks. However, since the task environment of UAVs is usually complex, uncertain


Introduction
Due to the continuous development of multi-sensor data fusion technology and autonomous flight control technology, Unmanned Aerial Vehicles (UAVs) have become an important tool in the field of emergency rescue.Particularly, in the face of natural disasters and emergencies, UAVs are widely used in search and rescue operations [1], for example, the 9.0-magnitude earthquake in Japan in 2011 [2], the 8.1-magnitude earthquake in Nepal in 2015 [3], and Hurricane Harvey in Texas in 2017 [4], which caused massive damage and casualties.UAVs were used to assist rescue teams to provide necessary rescue services for trapped people in danger.However, due to the lack of effective path planning, some UAVs collided during the flight, resulting in mission delays and resource waste, and even secondary disasters, which posed additional risks to both rescue workers and trapped people.Therefore, effective and collision-free path planning is critical for the successful rescue mission of UAVs.A better path planning strategy can avoid collisions between UAVs or even with buildings or other obstacles, improve rescue efficiency, and ensure the smooth progress of rescue operations [5].
The complexity of the rescue environment is the primary challenge of the path planning problem [6].In general, obstacles (e.g., buildings, bridges, trees, gravel) in the disaster area are relatively messy, and the geographical environment is complex.Planning a collision-free flight path for UAVs has practical significance in assisting rescue missions.At present, some studies [7][8][9][10] (e.g., the A star algorithm, Dijkstra's algorithm, and the Bellman-Ford algorithm) have been used in collision-free path planning algorithms.These algorithms evaluate the quality of the path with certain rules and heuristic functions on the existing map and perform path planning by searching and optimizing to find the best path.However, some disasters may change the inherent topography of the affected area and damage the structure of buildings and infrastructure, which will cause unexpected changes in the task area.
For this reason, it is more practical to study the path planning problem in an uncertain task environment.In a disaster environment, the communications and navigation infrastructure may not be able to function properly, requiring UAVs to perform tasks autonomously without a GPS signal or effective communication with the ground command centers [11], which will cause traditional path planning algorithms to fail to obtain the latest maps and environmental information.In addition, unforeseen changes may occur in disaster environments, such as road damage, obstacles, and personnel evacuations.These changes will put forward higher requirements for path planning and navigation algorithms that can adapt to environmental changes in real time and quickly re-plan paths to meet these challenges.Traditional algorithms are usually designed based on specific maps and road network structures.When the map changes, the algorithm may not be able to adapt to the new environment.In order to meet these challenges, it is necessary to adopt more flexible and real-time path planning algorithms, such as the path planning algorithm based on reinforcement learning.The path planning algorithm based on reinforcement learning has the advantages of adaptability, real-time planning, and a learning ability.Researchers try to introduce the reinforcement learning (RL) model into the path planning algorithm [12][13][14] and learn flexible and feasible strategies through the continuous trial and error of the agent in unknown environments.
However, training RL models requires a large amount of training data, training time, and computing resources [15].In addition, as the complexity of the task environment increases, RL-based path planning algorithms face challenges when dealing with complex environmental constraints (such as avoiding dangerous areas or other agents, energy constraints, and time constraints).It is necessary to design appropriate state representation and action selection strategies.The heuristic algorithm can provide prior knowledge for the reinforcement learning model by considering and dealing with these complex constraints using certain rules and heuristic functions.For example, the genetic algorithm [16] uses operations such as selection, crossover, and mutation; the particle swarm algorithm [17] updates the velocity and position of particles; and the bee colony algorithm [18] uses a local search and information exchange.These algorithms then combine fitness functions and constraints to generate UAV path solutions that adapt to the environment.These heuristic algorithms have advantages in terms of computational efficiency, data requirements, and solution speed, and they can effectively reduce path redundancy and unnecessary trial and error.
To this end, we propose a multi-UAV path planning algorithm based on a whaleinspired deep Q-network (WDQN).Specifically, we use the deep Q-network (DQN) as the framework and design a comprehensive reward function to balance the three factors of path length, obstacle avoidance, and energy consumption.Then, a deep neural network is used to approximate the Q-value function, and the comprehensive reward function is used to evaluate and reward the planned path in real time during training.In addition, the WoA is introduced when training the DQN, and the WoA balances exploration and exploitation to help the DQN model explore a wider range of actions and states, enabling it to discover new and possibly better strategies.Moreover, the search strategy is dynamically adjusted according to the fitness of the solution to help the DQN model to focus the search on promising areas of the state-action space, thereby improving the convergence speed and learning efficiency.Simulation experiments show that the proposed algorithm can enable UAVs to explore unknown and complex environments and plan collision-free feasible paths.
Finally, compared with the classic RL algorithm, the WDQN has a better performance in learning efficiency, path planning success rate, and path length.
The main contributions of this paper are as follows: • We combine a WoA and DQN to propose a path planning algorithm for UAV-assisted disaster rescue.The introduced WoA significantly improves the learning efficiency of the DQN, and the designed comprehensive reward function effectively improves the sparse reward problem of the RL algorithm in complex environments.

•
We train the model with increasing scene complexity to improve the robustness and generalization of the algorithm.During the training phase, we gradually increase the number of obstacles in the task scene, so that the UAV can gradually adapt to the complex environment and learn a better pathfinding strategy.

•
Compared with the global path planning algorithm, this algorithm is suitable for disaster areas with limited communication and has good scalability; that is, it can apply the experience learned in the training process to unknown environments.In addition, the comparative experiments with the classical RL algorithms show that the WDQN has a better training efficiency and path planning ability.
The rest of the paper is organized as follows.In Section 2, we introduce research work related to this paper.The construction of a disaster scenario and the formulation of the rescue problem are described in Section 3. In Section 4, the WDQN-based multi-UAV path planning algorithm proposed in this paper is described in detail.In Section 5, the relevant parameters of the algorithm are introduced, and experiments are carried out to validate the proposed algorithm.The full text is summarized in Section 5.

Related Works
Multi-UAV reliable communication and path planning in emergency rescue scenarios are two current research hotspots [19][20][21].On the one hand, the development of UAV communication security is constantly advancing in the direction of encrypted communication technology [22], anti-jamming technology [23], and anti-attack technology [24].On the other hand, the development of path planning for UAVs is shifting from methods based on known maps to exploring unknown environments.In this paper, we assume that the communication environment of multi-UAVs is reliable and focus on the collision-free path planning algorithm in emergency scenarios.The following reviews the existing path planning algorithms from two aspects, namely global path planning based on a fully known environment and local path planning based on an uncertain environment [25].

Global Path Planning
Global path planning based on a completely known environment is a static planning algorithm.Global path planning is to calculate the optimal path from the starting point to the end point through an algorithm based on the known environmental map.Common global path planning algorithms include Dijkstra's algorithm, the A* algorithm, the sampling-based Rapidly Exploring Random Tree Star (RRT*) algorithm, etc.
Dijkstra's algorithm is widely used in solving the shortest path problem in weighted graphs.Dijkstra's algorithm is an efficient choice, especially when the target is uncertain.However, its time complexity and space complexity are relatively high, which is not suitable for complex environments.In order to avoid the sequential bottleneck problem that may occur in Dijkstra's algorithm, Wang et al. [26] first used the extended hierarchical graph to model the dynamic grid environment, then used the matrix alignment method to perform parallel exploration in the simulated environment, and finally traced the safe path according to the navigator matrix.This method shows a good performance in a large-scale dynamic grid environment.The A* algorithm is currently one of the most widely used graph traversal and path exploration techniques, which is an extension of the Dijkstra algorithm.The traditional A* algorithm has the advantages of simple principles and a fast calculation speed.But as the map grows larger, its computational load will increase exponentially, resulting in a high memory usage, long computing time, and low exploration efficiency [27].To this end, Guilherme et al. [28] proposed an improved A* algorithm for UAVs for target search and rescue, which introduced a truncation mechanism to prevent UAVs from falling into the local optimum.The RTT* algorithm has great advantages in search efficiency and search quality, and it has been successfully applied in the field of unmanned driving and UAV navigation.Since the classic RTT* algorithm adopts a fixed step size expansion strategy, setting the step size too large or too small will directly affect the search results.To this end, Wang et al. [29] designed a dynamic step size strategy.When the underground intelligent vehicle is far away from obstacles, a larger step size is used to speed up the convergence speed; when it is close to obstacles, a smaller step size is used to ensure the safety of the path.Hu et al. [30] first used the bidirectional extending random tree search strategy to plan the static path and then pruned the random tree according to the change in environmental information, so as to reduce the calculation amount of the RRT* algorithm in uniform random sampling.Compared with the classic RRT algorithm, the improved algorithms converge faster and have higher accuracy in path planning.But it is easy for them to fall into local optimum, and the energy of the system is relatively large.Therefore, intelligent algorithms, such as PSO and the WOA, are often used to improve efficiency and quality.Dong et al. [31] used the Adaptive Whale Algorithm (AWOA) to optimize the deployment of UAVs, used the variable length population strategy to find the optimal number of stopping points, and introduced nonlinear parameters and partial mutation rules to balance exploration and exploitation, so as to minimize the energy consumption of UAVs.Yu et al. [32] improved the PSO algorithm through the simulated annealing algorithm, and proved by experiments that the algorithm can plan high-quality paths for UAVs.
The performance of the global path planning algorithm is highly dependent on the predicted environment information.When the task environment of UAVs undergoes unpredictable changes, the paths planned by these methods will no longer be reliable.

Local Path Planning
Local path planning based on uncertain environments belongs to the dynamic programming algorithm.When the environmental information is completely unknown or partially known, this type of algorithm uses the local information obtained by UAVs to plan a collisionfree path.Classical local path planning usually introduces technologies such as the dynamic window method (DWA), the artificial potential field method, and reinforcement learning.These algorithms have a good robustness to environmental errors and noise.
Traditional DWA algorithms prefer to bypass the periphery of the dense obstacle area, which increases the total distance.Additionally, the objective cost function fails when a "C"-shaped obstacle is encountered, resulting in no path [33] being found.Tan et al. [34] introduced the concept of the obstacle search angle to the traditional DWA.When the static obstacle is not within a certain angle range of the agent's forward direction, the impact on the agent's subsequent navigation is not considered, which improves the adaptability of obstacle avoidance in different scenarios.The artificial potential field method simulates the repulsive field generated by obstacles and the gravitational field generated by the end point, so that the robot can avoid obstacles in the task environment and move forward to the end point.This method is relatively simple in theory and operation, but there are also problems, such as unreachable goals and the way it is easy to fall into the local optimum [35].To this end, Sun et al. [36] introduced the gravity barrier function in the repulsion function and further restricted the flight boundary and flight speed of the UAV, so as to avoid falling into the local optimum and improve the safety and stability of the track route.However, the ability to deal with unknown task environments using methods based on the artificial potential field method is limited.
In recent years, methods based on reinforcement learning have been widely used.The Q-learning algorithm is one of the commonly used reinforcement learning methods, in which UAVs can learn the optimal action strategy by interacting with the environment.For example, Souto et al. [37] used Q-learning as a reinforcement learning technique to control the UAV to move to the target, aiming to reduce its energy consumption in the disaster area rescue process.However, the Q-learning algorithm faces challenges when dealing with continuous state and action spaces.To solve this problem, deep reinforcement learning (DRL) is introduced into local path planning tasks.DRL not only has the powerful perception and representation ability of a deep neural network when processing high-dimensional decision-making information but also inherits the learning mechanism of reinforcement learning in the interaction with the environment, which can realize real-time path planning for agents [38].Among them, the DQN is one of the most classic methods.Zhang et al. [39] transformed the trajectory optimization problem into a constrained Markov decision process with a UAV as an agent and proposed a trajectory planning method based on the deep Q-network.Although the DQN is effective in obstacle avoidance and path planning, the algorithm has the problems of large randomness in action selection and slow convergence during training.To this end, Huang et al. [40] proposed the Dueling DQN algorithm based on the tree sampling mechanism.In addition, Wang et al. [41] proposed a DDQN based on a greedy strategy to solve the problem of mountain rescue in complex environments.
These algorithms enable UAVs to perform tasks in uncertain environments.Due to the lack of a grasp on global information, the planned path may not be optimal or even feasible.In addition, the path planning problem in an unknown environment needs to consider multiple possible states and transition probabilities, which will lead to a high computational complexity of the algorithm.

System Model
In order to simulate the real flight environment, a disaster search and rescue environment model is established first, which is the basis of UAV path planning.In this disaster search and rescue environment, the operating environment of UAVs is considered a twodimensional nonlinear horizontal plane.The scientific issues considered in this paper are described as follows.In a complex and unknown disaster environment, multiple UAVs autonomously avoid unknown environmental obstacles during high-speed movement, search, approach, and find trapped people in the shortest possible time, aiming to provide reliable action routes and more time for rescue teams.Each UAV is considered a particle moving in two dimensions, independently making new decisions at all times to find its target.As shown in Figure 1, we model the entire auxiliary rescue mission from three aspects: environment information processing, value function acquisition, and action selection.

Environment Information Processing
The environmental information processing module converts the current environmental information of the UAV into a state representation.In the search process, many objects can be regarded as obstacles, such as trees, houses, mountains, etc.For the convenience of discussion, Figure 1a is a simplified environment model of a multi-UAV auxiliary rescue mission.We first divide the entire rectangular task area = H x × H y into M cells and denote the position of the i-th cell as h i x , h i y , where 1 < i < M. Let N denote the number of UAVs participating in the disaster search and rescue mission, which are distributed in the entire mission area.We use p t uxj , p t uyj ∈ to represent the position of U AV j at time t, where 1 < j < N. In the case of a fixed scanning angle, the flight altitude of the UAV is directly proportional to the coverage radius of the scanning area.Therefore, for each cell, the UAV's flight altitude can be adjusted based on a determined coverage radius to achieve accurate coverage of each cell.In other words, the U AV j flight altitude al j can be calculated based on the fixed scanning angle θ and coverage radius cr j using the formula al j = cr j tan θ 2 .O is the set of obstacles in the environment, which are randomly distributed in the task area.Let p oxk , p oyk ∈ denote the position of obstacle k.Considering that the people on the ground will be trapped in the ruins when the house collapses, the movement of the trapped people is often limited by the space environment, and it is assumed that the position of the trapped people does not change with time.The coordinates of the trapped person j(1 < j < N) can be denoted by p txj , p tyj .

Value Function Acquisition
The purpose of the value function acquisition module is to train a path planning policy network for UAVs through learning algorithms.In this module, we build a neural network to represent the Q function, which consists of an input layer, multiple hidden layers, and an output layer.The input layer receives the current state representation as input, which represents the currently known information about the environment and the state of each UAV.The hidden layer maps the input to a higher-level feature representation through the calculation of multiple layers of neurons and the processing of activation functions.The output layer gives the Q-value of each possible action and selects the optimal action to guide the UAV's behavioral decision making.The details will be introduced in Section 4.

Action Selection
The UAV has a fixed field of view (FOV) and can randomly select any surrounding cells to move during flight.As shown in Figure 1c, the movement of the UAV mainly includes the following nine actions: east, south, west, north, southeast, southwest, northeast, northwest, and hovering.The action selection module first selects the optimal action according to the action value function and then transmits it to each UAV in the environment in the form of coordinates.

Method
In order to reduce the frequency of blind trial and error for UAVs in the search process, we design a whale-inspired deep Q-network.A WOA is used to accumulate prior knowledge about action selection.In this section, we first briefly introduce the basic principles of the WOA and DQN then introduce the proposed algorithm in detail.

Action Decision Based on WoA
The WoA is an optimization algorithm inspired by the foraging behavior of whales in nature.Each UAV is considered a whale, and there are three hunting models for each whale, which are the encircling model, the searching model, and the bubble-net attacking model.

Encircling Model
At this time, the whale chooses to swim towards the optimal individual, and the position update formula of the whale is as follows.
where t represents the current number of iterations, X best .represents the whale currently in the optimal position, β 1 and β 2 represent random variables in the range of (0, 1), respec- tively, and T represents the maximum number of iterations.The initial value of a is set to 2; it will decrease to 0 as the number of iterations increases.

Searching Model
At this time, the whale randomly selects an individual to approach, and the position update formula of the whale is as follows.
where X rand is the position of the whale randomly selected in the current group.

Bubble-Net Attacking Model
Whales also need to constantly adjust their position when using bubble nets to drive away prey.In this case, the position update formula of the whale is as follows.
In the formula, b is a constant, and the default value is 1. l is a random number uniformly distributed in the range (0, 1).

Path Exploration Based on DQN
RL is a machine learning method.The agent obtains feedback and rewards from the environment by constantly trying different actions and gradually forms an optimal strategy.The interaction process between the agent and the environment can be represented by the quaternion (S, A, R, P, γ) of the Markov decision process (MDP), where S represents the state space of the agent, A represents the action space of the agent, R = S × A represents the immediate reward obtained by the agent after taking an action in the current state, P : S × A × S → [0, 1] represents the transition probability of the agent from the current state to the next state after taking an action, and γ is a discount factor.
The core idea of the DQN is to use a deep neural network (DNN) to approximate the Q-value function used to evaluate the value of taking a specific action in a given state, thereby maximizing the discounted cumulative reward R in the reinforcement learning environment.
where γ ∈ [0, 1] represents the discount factor and r represents the instant reward obtained at time t.
The DQN uses the basic idea of a Q-learning algorithm, which is to guide the agent to take the best action in a given state by learning a value function Q(s, a).Unlike traditional Q-learning, the DQN uses a DNN to approximate the optimal value function.Specifically, a Q-network takes a state s as input and outputs a corresponding Q-value a to each action.
By optimizing the parameters of the Q-network, it can accurately predict the cumulative reward of performing each action in a given state; that is, And the Q-function is shown in Equation (7).
where θ represents the weight of the Q-network.The optimal Q-function is shown in Equation (8).
The function Q * returns the maximum reward of all strategies, which can be expressed by the Bellman equation as follows.
where s t+1 and a t+1 represent the subsequent state and action in the next time step, respectively.θ represents the weight of the target Q-network, and y t represents the target value.
In the DQN, the parameters θ of the Q-network are updated by optimizing the loss function, that is, minimizing the mean square error between the target Q-value and the current Q-value.The formula of the loss function is as follows.
We choose the gradient descent method to update the parameters of the Q-network to gradually approximate the true Q-value function.The update formula of parameter θ is where a represents the learning rate, which is used to control the step size of the parameter update.∇ represents the gradient operation, and ∇Loss(θ) represents the gradient of the loss function with respect to the network parameters.The network parameters are updated iteratively, so that the Q-function gradually approaches the true Q-value function.Finally, we can obtain the appropriate action through the well-trained network.

The Autonomous Path Planning Strategy Based on WDQN
For the path planning task, we proposes a whale-inspired DQN algorithm.The core idea of the WDQN is to use the WOA to provide the DQN with the decision-making experience required for training, improve the interaction efficiency between agents and the environment, and accelerate the convergence speed of the DQN algorithm.The framework of the WDQN algorithm consists of four parts, as shown in Figure 2. First of all, the input of the proposed algorithm is a two-dimensional vector composed of the real-time position of each UAV and the environment information.Second, we design a comprehensive reward function to generate dynamic rewards in real time based on environment information, which enables UAVs to have a good control performance.Then, each UAV is regarded as a whale, and the corresponding fitness function value is calculated according to the reward obtained during training.The action decision output by the WOA is added to the replay buffer of the Q-network for training, so that the model can obtain more valuable experience in the early stage of training.Finally, we increase the complexity of the environment step by step to improve the robustness and generalization of the algorithm.

Reward Function
The setting of the reward function is crucial to the performance of reinforcement learning algorithms.Choosing an appropriate reward function can effectively promote the convergence of the algorithm, while an inappropriate one may make it difficult for the algorithm to converge [42].In traditional reinforcement learning algorithms, a learner is rewarded only after completing a task, and there is no reward for the previous series of behaviors.It has been pointed out that this reward mechanism can lead to the sparse reward problem when faced with complex environments [43].When the set of environmental states is large, the learner encounters a series of nonfeedback states before completing the task.Since effective rewards cannot be obtained in time, the algorithm will be difficult to converge.To solve this problem, we design a comprehensive reward function, as shown in Equation (14).
The reward function is divided into three parts: the target reward function R target , obstacle avoidance reward function R avoid , and energy consumption reward function R cost , which are used to evaluate the directionality, safety, and efficiency of the model, respectively: (1) The target reward function R target is used to guide the UAV to quickly reach the target position, which is calculated using the following formula.
where β 1 represents the weight of the target reward function, d origin represents the initial Euclidean distance between the UAV and the target node, d t distance represents the Euclidean distance between the UAV and the target node at time t, and represents the displacement of the UAV from time step t − 1 to t.
(2) The obstacle avoidance reward function R avoid is a negative reward used to measure the safe distance between the UAV and nearby obstacles.This function can guide the UAV away from obstacles and ensure the safety of the planned path, which is calculated using the following formula.
where β 2 represents the weight of the obstacle avoidance reward function and d nearobs represents the distance between the UAV and nearby obstacles.If there are no obstacles around the current position of the UAV, then d nearobs = 0; otherwise, calculate the sum of the Euclidean distances to all surrounding obstacles.
(3) To make the UAV tend to plan shorter paths, the R cost is designed as a reward function to measure the remaining energy, which is also a negative reward.It can be seen from the following formula that the faster the search, the higher the reward.
where β 3 represents the weight of the energy consumption reward function, B cost represents the current energy consumption value, and B orgina represents the initial energy value.

Complexity of Task Scene
In this paper, the reinforcement learning model is trained in the form of the complexity of the task scene.In order to simulate the disaster-affected scene, we assume that the task scene of UAVs contains N obs obstacles (e.g., buildings trees, and vehicles), and use a quaternion (x, y, l, w) to identify the attributes of each obstacle.Among them, x and y represent the abscissa and ordinate of the obstacle center, respectively, and l and w represent the length and width of the obstacle, respectively.Obviously, as the number of obstacles increases, the complexity of the UAVs path planning increases.Therefore, we assume that the complexity L of the task scene is proportional to the number of obstacles N obs , and the initial complexity of the training scene is L 0 = 1.Specifically, N obs is a random integer in the interval [L, 2L].In the initial stage of training, we use simple and easily avoidable obstacles to enable the UAV to quickly learn basic path planning techniques.As the training progresses, when the number N suc of UAVs that find the target is greater than 80% of the total number of UAVs N, the complexity L of the scene will be upgraded, meaning that the terrain becomes gradually more complex and the passage becomes narrower.Otherwise, the difficulty remains the same.Namely,

WDQN for Multi-UAV Path Planning
In this paper, we use the Adam algorithm to adaptively adjust the learning rate and optimize the model parameters by calculating the first-order moment estimation and second-order moment estimation of the gradient, so as to speed up the convergence speed of the model and improve the robustness and stability of the algorithm.The update formula of the parameters is as follows.
where ∧ m t represents the value of the first-order moment estimator m t after bias correction, ∧ v t represents the value of the second-order moment estimator v t after bias correction, and σ = 10e −8 is an offset used to prevent division by zero.
where µ 1 and µ 2 represent the exponential decay rates of the first-order moment estimation and second-order moment estimation, respectively.In order to guide the UAV to choose actions reasonably, we use the ε−greedy strategy to balance exploration and utilization, thereby preventing the algorithm from falling into the local optimum.Decreasing ε gradually during the training process makes the Q-network perform more random exploration in the initial stage, which is beneficial to try more actions.As the training progresses, gradually decrease ε to make it more inclined to choose the optimal action for a better performance.The update formula of ε is as follows.
where ε min represents the minimum value of the exploration rate.In summary, we give a detailed description of the WDQN-based multi-UAV autonomous path planning algorithm (see Algorithm 1).

Algorithm 1 Pseudocode of simulated WDQN
Input: Q-network weights θ, maximum iterations T, learning rate α, number of whales W, discount factor γ; Output: The optimal policy π; 1: Initialization the Q network, initialization the positions of UAV and the trapped people, initialization the level of obstacles L, initialization Q network Q and target Q-network Q , initialization greedy probability ε, set t = 0; 2: for t < T do 3: while P uavs != P obstacles and P uavs != P people do 4: update ε according to Equation (26) 5: eps = random(0, 1)

Simulation Settings
In order to better explore and learn the optimal path planning in different environments, it is necessary to train the WDQN model first.In the training phase, we simulated a 100 m × 100 m task area, in which we set up six randomly distributed UAVs and six randomly distributed trapped people.In addition, we gradually increased the complexity of the task scene to improve the training performance of the DQN.The initial complexity of the model training is 1, which means that the initial number of obstacles in the task scene is a random integer in the interval [1, 2].The maximum complexity of the model training is 10, and in this case the number of obstacles in the task scene is a random integer in the interval [10, 20].
A whole task environment consists of four parts: background, obstacles, UAVs, and targets.Figure 3 shows three examples of initial task environments with different complexities in a 100 m × 100 m task area at the test time.It can be seen that the red pentagrams represent targets, the Y-shaped marks of different colors represent UAVs with different numbers, and the black rectangles represent obstacles.This paper uses the machine learning library PyTorch to build and train the WDQN.Table 1 describes the main parameters involved in the training phase of the model.At the beginning of each episode of the model training phase, the task environment is randomly initialized, including the positions of entities such as UAVs, targets, and obstacles.Each episode is terminated when the UAV hits an obstacle or runs out of battery or successfully finds the target.The constants β 1 , β 2 , and β 3 in Equations ( 15), (18), and (19), respectively, represent the weights of the target reward function, the obstacle avoidance reward function, and the energy consumption reward function.The value range of these parameters is usually between [0, 1], and β 1 + β 2 + β 3 = 1, indicating the importance of the corresponding reward function.When β 1 = 0, the target reward function does not work, and the path planning algorithm will not give priority to reaching the target position but may pay more attention to the optimization of obstacle avoidance and energy consumption.When β 2 = 0, the obstacle avoidance reward function does not work, and the algorithm is prone to collisions.When β 3 = 0, the energy consumption reward function will not work, and the algorithm will not give priority to the energy consumption index, which will easily increase the path cost.Therefore, appropriate values of β 1 , β 2 , and β 3 should be selected to meet the requirements of the task.
The performance of the WDQN algorithm with different reward function weight combinations is shown in Table 2.It shows the data obtained by taking the average value of 50 experiments in the scene of 100 × 100, with an obstacle complexity level of 7 and the number of targets set to six.Since this paper is applied in the emergency rescue scenario, it is most important to find a path to the trapped people; that is, the value of β 1 should not be too small.When the value of β 2 is too small, the ability of the algorithm to avoid obstacles will be reduced.And when the value of β 3 is too small, the path cost of the algorithm will increase.Therefore, comprehensively considering the three indicators of success rate, collision rate, and path length, we set β 1 = 0.7, β 2 = 0.2, and β 3 = 0.1.In this case, the algorithm achieves the best results.The learning rate is a very important hyperparameter in the deep learning algorithm, which has an important impact on the training performance of the reinforcement learning model.In order to select appropriate parameters, we conducted seven independent experiments to compare the training performance of the WDQN under different initial learning rate settings, as shown in Figure 4.In terms of changes in the cumulative reward curve, when the learning rate is 0.01, the performance of the model is the worst.When the learning rate is 0.001, the training performance is the best.When the learning rate is lower than 0.001, the performance of the model gradually decreases, and the convergence of the cumulative reward function becomes worse.Therefore, we set the learning rate α to 0.001 in subsequent experiments.

Effectiveness of WDQN Algorithm
In this part, we tested the effectiveness and generalization of the proposed algorithm in three aspects: task scenes with different complexities, task targets with different numbers, and task areas with different sizes.

Experiments with Different Task Scene Complexities
Figure 5 shows the effectiveness tests of the proposed algorithm under three different task scene complexity settings, which are the test results of the corresponding task environments in Figure 3.The size of the test scene is fixed at 100 m × 100 m.The number of UAVs and targets is six, and they are randomly distributed in the test scene.The black rectangle is used to simulate obstacles.As the complexity of the task scene increases, the proportion of obstacles occupying the entire task area gradually increases.It can be seen from Figure 5 that although the difficulty of obstacle avoidance gradually increases the proposed algorithm can still enable UAVs to successfully avoid obstacles and find these targets.In the figure, the red pentagrams represent targets, the black rectangles represent obstacles, the Y-shaped marks of different colors represent UAVs with different numbers, and the line segments represent the paths taken by the UAVs.Since each UAV can only obtain the local information it is close to during the pathfinding process, it cannot predict the distribution of obstacles in the task environment.The final path of the UAV is usually not the shortest path to the target.This is normal and understandable in local path planning algorithms.In order to test the execution time of path planning in task scenes of different complexities, we conducted 30 experiments with different complexity settings and recorded the execution times.The average execution time in level 1 is 225.21 s, the average execution time in level 5 is 126.52 s, and the average execution time in level 10 is 173.33 s.It can be seen that although fewer obstacles can reduce the limitation of the path planning algorithm it may increase the computational complexity and execution time of the algorithm.This is because the algorithm needs to consider more path selections and calculations to find the optimal path.And as the complexity of the task scenario increases, the execution time is normally distributed.

Experiments with Different Numbers of Task Targets
Figure 6 shows the test results of the proposed algorithm under three settings with different numbers of task targets.The size of the test environment is still 100 m × 100 m, and the test scene complexity is fixed at 10.The red pentagrams represent targets, the black rectangles represent obstacles, the Y-shaped marks of different colors represent UAVs with different numbers, and the line segments represent the paths taken by the UAVs.The number of UAVs and targets is gradually increasing, and they are randomly distributed in the mission area.It can be seen that the proposed algorithm has good scalability in the number of task targets.When the number of task targets in the test scene is more or less than that of the training scene, the proposed algorithm can still plan a safe and feasible path for each UAV.In order to test the execution time of path planning with different numbers of task targets, we conducted 30 experiments with different numbers of task targets and recorded the execution time.The average execution time with two task targets is 15.95 s, the average execution time with five task targets is 71.39 s, and the average execution time with seven task targets is 193.65 s.It can be seen that as the number of task targets increases the execution time also increases.

Experiments with Different Sizes of Task Area
Figure 7 shows the test results of the proposed algorithm in three different sizes of task area.Since it is difficult to accommodate too many obstacles in a small-sized task area, we fixed the test scene complexity to 9, set the number of UAVs and targets to three, and randomly distributed them in the task area.The red pentagrams represent targets, the black rectangles represent obstacles, the Y-shaped marks of different colors represent UAVs with different numbers, and the line segments represent the paths taken by the UAVs.The sizes of the three test scenes are set to 60 m × 60 m, 160 m × 160 m, and 200 m × 200 m, respectively.It can be seen that the proposed algorithm has good scalability in the size of the task area.When the test scene is larger or smaller than the training scene, the proposed algorithm can still plan a safe and feasible path for each UAV.
In order to test the execution time of path planning under different task area sizes, we conducted 30 experiments for different task area sizes and recorded the execution time.The average execution time in the 60 × 60 area is 27.26 s, the average execution time in the 160 × 160 area is 70.73 s, and the average execution time in the 200 × 200 area is 239.31 s.It can be seen that as the size of the task area increases the execution time also increases.

Comparison and Analysis
In order to further evaluate the proposed WDQN-based path planning algorithm, we conducted multiple comparative experiments involving mainstream reinforcement learning models including Q-learning, the DQN, the DDQN, and the Dueling DQN.During the experiments, we focused on four metrics: the reward and loss in the training phase, the success rate, and the average path length in the testing phase.

Reward
The cumulative reward is the feedback signal provided to the agent in the task environment, which is used to guide the agent to learn and adjust its behavior.Rewards can be positive, negative, or zero, representing encouraging, punishing, or neutral evaluations of the agent's actions.In our experiments, the initial reward is set to a sufficiently small value.To measure the cumulative reward, we train these reinforcement learning models under the same experimental settings.The figure below shows the change curves in the cumulative reward obtained by each algorithm during the training phase.
First of all, observing the convergence of each curve in Figure 8, it can be found that there are unstable fluctuations in the curves of Q-learning and the DQN, and the convergence of the DDQN, Dueling DQN, and WDQN is relatively good, which shows that the proposed algorithm is easy to train.Then, observing the level of the reward value, it can be found that when the training tends to be stable (i.e., after 8000 episodes), the average reward obtained by UAVs in Q-learning and the DQN from the task environment is relatively low, while the average reward obtained by UAVs in the Dueling DQN and WDQN is relatively high, indicating that actions taken by UAVs can achieve better scores in path planning tasks.Finally, observing the training reward of the first 2000 episodes, it can be seen that the proposed WDQN can obtain a higher cumulative reward faster.This is because the introduction of the WOA enriches the action decisions in the replay buffer of the DQN and greatly speeds up the learning efficiency of the model.However, the learning progress of several other methods, especially Q-learning and the DQN, is relatively slow, and it is difficult to complete the path planning task in a short period of time.From the analysis of the above three points, it can be seen that the proposed model is superior to other reinforcement learning models in terms of stability, action decision quality, and learning efficiency.

Loss
In reinforcement learning, the loss is a metric used to measure the discrepancy between the expected future reward and the actual observed reward.The goal of the UAV is to find the optimal path planning strategy by minimizing the loss function, so that the maximum reward can be obtained in the long-term accumulation process.In our experiments, the initial loss is set to a sufficiently large value.To measure the loss, we train these reinforcement learning models under the same experimental settings.The figure below shows the change curve of the loss obtained by each algorithm during the training phase.
First of all, observing the convergence of each curve in Figure 9, it can be found that the loss functions of Q-learning and the DQN have not achieved a satisfactory convergence, and the convergence efficiency of the DDQN, Dueling DQN, and WDQN is relatively high.Then, observing the value of the loss, it can be found that when the training tends to be stable (that is, after 8000 episodes) the loss of UAVs in the Q-learning and DQN algorithms is relatively high, indicating that there is a certain gap between the obtained path planning strategy and the expected planning effect.The loss of UAVs in the WDQN is the lowest, which shows that the proposed algorithm can show a better performance in path planning tasks.Finally, observing the changes in the loss of the first 2000 episodes, it can be seen that the proposed algorithm greatly reduces the loss of UAVs in a short period of time, which is due to the auxiliary effect of the WOA on the DQN model.The loss of other methods, especially Q-learning and the DQN, is maintained at a high level in the early stage, indicating that their early exploration and learning capabilities are relatively poor, and more episodes of training are needed to adapt to the rules of the path planning task.From the analysis of the above three points, it can be seen that the proposed model has more advantages than other reinforcement learning models in terms of stability, path planning quality, and learning efficiency.

Success Rate
The success rate of a path planning algorithm refers to the probability of planning a feasible path in an unknown test environment.To compare the success rate of these five well-trained reinforcement learning models, we conduct 1000 independent tests under three different scene complexity settings.Each test starts with changing the task environment, which means reinitializing the positions of objects, such as UAVs, targets, and obstacles within the 100 m × 100 m mission area.In this part, the success rate can be obtained by calculating the ratio between the number of feasible paths and the number of tests.
The figure below shows the success rate of five well-trained models for path planning under different scene complexity settings.First of all, observing the test results under the level 1 scene complexity setting, it can be seen from Figure 10 that the success rate of the path planning algorithm based on the WDQN is as high as 99.5%.Among the remaining four comparison methods, the success rates of Q-learning and the DQN are relatively good.This is because the network structure of these two models is simple, and the demand for training data is not large, so they are suitable for dealing with path planning problems in simple task scenes.
Then, observing the test results under the level 5 scene complexity setting, it can be seen that the relatively simple learning method of the Q-learning algorithm limits its performance in complex scenes, and its success rate is the lowest.Due to the more effective exploration and learning in the training phase, the WDQN and Dueling DQN show higher performances.Between the two algorithms, the success rate of the WDQN exceeds 95%.Finally, under the level 10 scene complexity setting, the performances of all five reinforcement learning models in the path planning task decrease.Among them, the success rate of the Dueling DQN is close to 70%, and the success rate of the DDQN, the DQN, and Q-learning is below 50%.Under this setting, the performance of the WDQN proposed in this paper is least affected by scene complexity, and its success rate still remains above 90%.Overall, our algorithm has good robustness under different scene settings and can perform path planning tasks with a high success rate.

Average Path Length
The average path length of a path planning algorithm refers to the average length of feasible paths planned in an unknown test environment.A short average path length means less distance, less time, and less energy consumption for the UAV to fly.To compare the average path lengths of these five well-trained reinforcement learning models, we conduct 1000 independent tests under three different scene complexity settings, and each test starts with changing the task environment.In order to ensure the fairness of the experiments, we set the number of UAVs and targets as one, the initial position of the UAV is fixed as [0,0], and the position of the trapped target is fixed as [95,95].In addition, each algorithm has obstacles with the same properties when changing the task scene; that is, the number, location, and size of obstacles remain consistent.In this part, the average path length can be obtained by calculating the ratio between the total length of feasible paths and the number of feasible paths.Figure 11 shows the average path length of five reinforcement learning models under different scene complexity settings.According to the changes in the three sub-images, it can be seen that, in order to avoid obstacles, the average path length of these algorithms will increase as the scene complexity increases.In the tests of a single-UAV single-target search, the effective paths planned by Q-learning and the DQN are relatively long, and the success rate is low.It is difficult for these two algorithms to plan a satisfactory rescue route in a complex environment.Both the DDQN and Dueling DQN show similar performance in terms of success rate and effective path length.The WDQN algorithm proposed in this paper can plan the shortest path with a high success rate.Under these three levels of scene complexity settings, the average paths are 230.19m, 262.755 m, and 318.153 m, respectively.

Conclusions
In this paper, we study the path planning problem of multiple UAVs in emergency rescue scenarios based on the RL model and propose the WDQN-based path planning algorithm.The WDQN overcomes the limitations of traditional path planning algorithms in uncertain task scenes and improves the learning efficiency of RL models, enabling the UAV to autonomously plan a feasible collision-free path in an uncertain scene.Through continuous trial and error and studying feedback from the environment, the model gradually adapts and explores feasible collision-free paths in different scenes.In addition, we train the Q-network by gradually increasing the difficulty of obstacle avoidance, which enhances the robustness of the algorithm.In order to test the effectiveness of the proposed algorithm, we conduct experiments under different difficulty levels of obstacle avoidance, different numbers of task targets, and different scales of task areas.The results show that the proposed algorithm can successfully find targets and plan safe and feasible paths in all task scenes.Moreover, compared with several other RL algorithms, the proposed algorithm shows a better performance in terms of training efficiency, task completion rate, and average path length.
In three-dimensional space, the UAV needs to consider movement in both the horizontal and vertical directions, as well as different angles of rotation.This increases the number of actions in the UAV's action set to 27, which in turn increases the complexity of the action set and leads to a larger search space in the path planning algorithm.In addition, the jitter can lead to instability in the UAV's path, resulting in an increased energy consumption and flight risks, particularly in three-dimensional space.Consequently, in future work, we plan to propose a path planning strategy to address the complexity of a real-world three-dimensional environment, aiming to reduce the search space, accelerate the path planning process, and generate high-quality paths.
Furthermore, UAV path planning in large-scale mission environments is also a challenging problem.First of all, in large-scale task scenarios, UAVs need to plan paths in complex environments, avoid obstacles, and take into account factors such as task priorities and time constraints.This requires path planning algorithms to efficiently handle largescale maps and tasks while ensuring the safety and efficiency of the path.In future work, we will consider dividing the large-scale region into multiple small task regions, decomposing the complex path planning problem into multiple relatively simple sub-problems.By decomposing the problem, we will reduce the computational complexity and the difficulty of path planning and design appropriate allocation algorithms and mechanisms to achieve intra-regional and inter-regional path planning and collaboration.

Figure 1 .
Figure 1.The schematic diagram of multi-UAV collaborative path planning based on reinforcement learning.

Figure 3 .
Figure 3. Examples of initial task environments with different complexities.

Figure 4 .
Figure 4.The rewards with different rates.

Figure 5 .
Figure 5. Path planning in three task scene of different complexities.

Figure 6 .
Figure 6.Path planning in different numbers of task targets.

Figure 7 .
Figure 7. Path planning under different task area sizes.

Figure 9 .
Figure 9.Comparison of loss during episodes.

Figure 10 .
Figure 10.Comparison of success rates of different algorithms across varying difficulty levels.

10 Figure 11 .
Figure 11.Comparison of average path lengths and success rates of different algorithms across varying difficulty levels.
the transition (s t , a t , r t , s t+1 ) into replay buffer random minibatch of transition (s t , a t , r t , s t+1 ) from replay buffer store sample

Table 1 .
Summary of hyperparameters involved in training phase

Table 2 .
The performance of the WDQN algorithm with different reward function weight combinations.