Path Planning Research of a UAV Base Station Searching for Disaster Victims’ Location Information Based on Deep Reinforcement Learning

Zhao, Jinduo; Gan, Zhigao; Liang, Jiakai; Wang, Chao; Yue, Keqiang; Li, Wenjun; Li, Yilin; Li, Ruixue

doi:10.3390/e24121767

Open AccessArticle

Path Planning Research of a UAV Base Station Searching for Disaster Victims’ Location Information Based on Deep Reinforcement Learning

by

Jinduo Zhao

,

Zhigao Gan

,

Jiakai Liang

,

Chao Wang

,

Keqiang Yue

,

Wenjun Li

^*,

Yilin Li

and

Ruixue Li

Zhejiang Integrated Circuits and Intelligent Hardware Collaborative Innovation Center, Hangzhou Dianzi University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Entropy 2022, 24(12), 1767; https://doi.org/10.3390/e24121767

Submission received: 31 October 2022 / Revised: 25 November 2022 / Accepted: 28 November 2022 / Published: 2 December 2022

(This article belongs to the Special Issue Advances in Information Sciences and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Aiming at the path planning problem of unmanned aerial vehicle (UAV) base stations when performing search tasks, this paper proposes a Double DQN-state splitting Q network (DDQN-SSQN) algorithm that combines state splitting and optimal state to complete the optimal path planning of UAV based on the Deep Reinforcement Learning DDQN algorithm. The method stores multidimensional state information in categories and uses targeted training to obtain optimal path information. The method also references the received signal strength indicator (RSSI) to influence the reward received by the agent, and in this way reduces the decision difficulty of the UAV. In order to simulate the scenarios of UAVs in real work, this paper uses the Open AI Gym simulation platform to construct a mission system model. The simulation results show that the proposed scheme can plan the optimal path faster than other traditional algorithmic schemes and has a greater advantage in the stability and convergence speed of the algorithm.

Keywords:

UAV base station; path planning; deep reinforcement learning; received signal strength

1. Introduction

Nowadays, with the development of 5G communication technology and the improvement of the mobile Internet of Things, unmanned aerial vehicle (UAV) base stations have begun to be widely used in auxiliary communication and post-disaster rescue tasks [1,2]. The UAV air base station has the characteristics of strong maneuverability, controllable mobility, and convenient deployment and can support the high-speed transmission of communication data, etc. [3]. The application of UAV base stations in disaster relief scenarios improves the problem of communication signals being difficult to reach in complex environments, and also makes it possible to provide all-round coverage of communication signals in disaster relief mission areas. Considering the special nature of UAV base stations and the complexity of the disaster relief environment [4,5], how to use the high flexibility of UAVs to plan an optimal collision-free path in the complex situation with obstacles after a disaster is the most pending problem of UAVs in disaster rescue missions.

In the scenario where UAVs search for location information of disaster victims, the path planning problem of UAVs is the key to accomplish such tasks. Traditional UAV path planning algorithms include the artificial potential field method, heuristic algorithm, ant colony algorithm, etc. The UAV path planning method using an improved artificial potential field was proposed in [6,7], which effectively solved the path planning of multi-UAS and UAV obstacle avoidance against dynamic obstacles by introducing a rotating potential field and Markov prediction model. In [8,9,10], an improved heuristic algorithm was proposed to solve the problem of UAV path planning and mission area coverage in complex environments. To solve the coverage path planning problem for autonomous heterogeneous UAVs over a finite area, an original clustering-based algorithm was designed in [11], which divided the mission area into clusters to obtain the optimal UAV point-to-point path. For the multi-target search and path planning problem in unknown environment, Refs [12,13] proposed an improved artificial bee colony algorithm, which greatly improved the stability of UAV flight and the speed of UAV path planning. In [14], a combination of a pseudospectra algorithm and an ant colony algorithm was used to solve the 3D path planning problem of solar powered UAV. In [14], a combination of the pseudospe-tral and colony algorithms was used to solve the 3D path planning problem of solar powered UAV. For the path planning problem of multiple UAVs, the authors of [15,16] proposed the use of a sparrow search algorithm and a two-level coordination framework approach to achieve path planning for UAV swarms in dynamic obstacle environments, respectively. In order to improve the stability of the algorithm and the efficiency of the task completion during the task execution, designing a new alternative framework for UAV motion performance can effectively reduce the task execution time of UAVs while improving the robustness of the algorithmic framework [17,18]. A non-rigid hierarchical discrete grid structure was proposed in [19] to achieve path planning of UAVs in 3D space. These optimization algorithms based on traditional algorithms were all based on converting the UAV path planning problem into a path optimization problem and solving the optimization model to obtain the optimal flight path. However, these methods often suffer from a low intelligence level, a long solution time and restricted application scenarios, and cannot be applied to post-disaster rescue scenarios with complex environments and changing scenarios.

In recent years, artificial intelligence technology has flourished, and deep reinforcement learning has gradually been applied to solve the path planning problems of UAVs in various complex environments. In [20,21], the use of the DDPG algorithm was proposed to enable UAVs to autonomously avoid threat areas and thus obtain an optimal flight path. The improved DRL algorithm was used in [22,23,24] to implement real-time UAV path planning in the respective application scenarios. In [25], to improve the collection of global and local information during the flight of UAVs, a multi-layer path planning algorithm based on reinforcement learning (RL) technique was proposed, which divided the information into upper and lower layers and then coordinated the processing of the upper and lower layers to finally plan a collision-free path for the UAV. By introducing a UAV mobile edge computing platform in [26,27,28], a better quality of path planning for reinforcement learning algorithms was provided while risk avoidance was achieved. In scenarios where there is no basic communication infrastructure or where communication with the Internet is not possible due to emergencies such as disasters, a reinforcement learning-based path planning scheme for IoT UAVs was proposed in [29] to achieve autonomous path planning for UAVs in unknown environments. A DL-based collision avoidance method for UAV communication networks was proposed in [30], while a series of convex optimization problems were formulated and solved to effectively solve the optimal trajectory planning problem for UAV communication networks. To optimize the flight trajectory and improve the energy management capability of the solar powered aircraft, a neural network controller was trained using the RL method in [31] and used as an integrated controller for aircraft navigation and guidance. In [32,33], the algorithms of dual-latency deep deterministic policy gradient (TD3) participant-critic deep reinforcement learning (DRL) framework and multi-step duel DDQN (multi-step D3QN) were used to implement UAV path design in 3D space, respectively. For the path planning problem of multi-UAV wireless data collection, the process was solved by the DRL method in [34] and an approximate optimal UAV control strategy with unknown environmental data information was implemented.

Traditional reinforcement learning algorithms and traditional path planning algorithms often have good test effects in some simple scenarios with simple environments and small state dimensions of algorithm input, but the execution effect of the algorithms will be greatly reduced in some scenarios with complex environments, many obstacles, and large state information dimensions input by the algorithms.

For the shortcomings of traditional path planning algorithms, this paper studies the path planning problem of UAV base stations collecting location information of trapped people based on deep reinforcement learning technology, proposes the reinforcement learning idea of combining state splitting and optimal value, and also designs the DDQN-SSQN reinforcement learning algorithm to solve the path planning problem of UAVs in performing search and rescue tasks by combining the received signal strength indication (RSSI) collected by UAV base stations in real time during the flight. Finally, the DDQN-SSQN algorithm is used to simulate the built task environment and compare with the traditional reinforcement learning algorithm to demonstrate the advantages of the proposed algorithm. The proposed algorithm improves the network structure of the algorithm on the basis of the traditional DDQN algorithm, and the algorithm classifies the multi-dimensional environmental state and conducts targeted training, which greatly improves the decision-making efficiency of the Agent.

The main contributions of this paper are as follows:

(1) The path decision-making algorithm based on deep reinforcement learning is used to classify and store multi-dimensional state information, and the optimal path information is obtained through targeted training, which greatly improves the efficiency of UAV to complete search and rescue tasks;

(2) Use the self-built virtual environment model to complete the training of the drone, which effectively avoids the training loss of the physical UAV and reduces the cost of model training;

(3) Introduce the received signal strength indication (RSSI) into the algorithm model of deep reinforcement learning and use this to affect the reward obtained by the agent, which reduces the difficulty of UAV decision-making and improves the positioning accuracy of the ground user’s location coordinates.

The structure of this article is as follows: In the second chapter, the construction of a virtual disaster relief scenario is introduced. The construction of the scene model is mainly completed from three aspects: system environment model, obstacle model, and ground user model. In the third chapter, the design ideas of the DDQN-SSQN algorithm proposed in this paper and the network structure of the algorithm are introduced. In the fourth chapter, the relevant parameters of the algorithm simulation are given, and the test effect of the proposed algorithm and the traditional reinforcement learning algorithm is compared to verify the advantages of the proposed new algorithm. The full text is summarized and prospected in Section 5.

2. Model Construction

This chapter mainly introduces the construction of the UAV search ground disaster personnel model. The model describes the environmental information for the UAV search and rescue mission and also provides a training ground for the UAV. The system model transforms the UAV path planning problem into a reward and punishment decision problem by introducing decision theory and a related mathematical model of the cost function. In order to simulate the disaster relief environment of UAV in real situations, the whole mission model is modeled from three aspects: system environment, obstacles, and ground users.

2.1. System Environment Model

Figure 1 shows the system environment model of the UAV base station performing search and rescue missions for trapped people.

S_{x y} = L_{x} \times L_{y}

is the area of the entire mission area. The entire scene model is divided into multiple subtask areas according to the number of UAVs, and the area of each subtask area is

S_{x y}^{i}

.

N

,

N = \{1, 2, 3, \dots N\}

is the number of drones used to perform search and rescue missions in the scenario,

M

,

M = \{1, 2, 3, \dots M\}

is the number of people affected on the ground. The position coordinates of the UAV change with increasing time and can be expressed as:

P_{o}^{n} (t) = [x_{o}^{n} (t), y_{o}^{n} (t), z_{o}^{n} (t)] \in R^{3}

(1)

In the equation,

t

denotes the time of flight of the UAV and

t \in [0, T^{n}], n \in [0, N], T^{n}

indicate the time required for the n UAVs to complete the search and rescue mission. Considering that the ground users will be trapped in the debris when the house collapses and the movement of the trapped people is often restricted by the spatial environment, the ground trapped people are set to appear randomly at any location of the scene model and the location coordinates do not change with the increase in time. The spatial coordinates of the trapped people can be expressed as:

P_{i}^{m} = [x_{i}^{m}, y_{i}^{m}, z_{i}^{m}] \in R^{3}

(2)

Here,

m \in [0, M]

.

K

,

K = \{1, 2, 3, \dots K\}

indicates the number of obstacles in the scene. Since the position of obstacles in the mission scene generally does not move after the house collapses, the location of the set obstacle is fixed in the scene. The coordinates of the spatial location of the obstacle are as follows:

P_{c}^{k} = [x_{c}^{k}, y_{c}^{k}, z_{c}^{k}]

(3)

x_{c}^{k}, y_{c}^{k} \in R

(4)

z_{c}^{k} \in [0, H]

(5)

In the formula,

H

denotes the height of the obstacle from the horizontal ground.

The entire area to be searched will be divided into N subtask areas according to the performance of

N

UAVs, and a UAV will be arranged on each subtask area to collect the location information of trapped people, and so as to achieve the coverage of the entire mission area by

N

UAVs. The area of

N

subtask regions is

S_{x y}^{i}, i = \{0, 1, 2 \dots N\}

. The area

S_{x y}

of the scene model can be expressed as:

S_{x y} = \sum_{i = 0}^{N} S_{x y}^{i}

(6)

The size of the

N

subtask area

S_{x y}^{i}

obtained from the scene model segmentation will be set according to the performance of

N

UAVs. The performance of the drone will be judged based on a combination of indicators such as the drone’s endurance, flight speed, and climb rate. The performance of

N

UAVs in the scenario is

\{I^{1}, I^{2}, I^{3}, \dots, I^{N}\}

, The area of the task area

S_{x y}^{i}

obtained by segmentation can be expressed as:

S_{x y}^{i} = S_{x y} \times \frac{I^{i}}{\sum_{n = 1}^{N} I^{n}}

(7)

Here,

i = \{0, 1, 2 \dots N^{'}\}

.

2.2. Obstacle Model

UAVs inevitably encounter obstacles in the actual flight process. Since there is randomness in whether an obstacle will appear in the UAV’s line-of-sight range, the obstacle model can be built by random generation. The virtual scene dynamically adjusts the number of obstacles

K

in the environment according to the size of the actual task area and places the obstacles randomly at any location in the scene. Considering the UAV’s own load and the search effect of the UAV, the UAV needs to fly close to the ground when performing rescue missions, which leads to the UAV having to avoid various obstacles such as trees, houses, and ruins during the process of moving. In order to enable the UAV to better detect the surrounding environment, this paper used millimeter-wave radar technology to realize the real-time monitoring and identification of obstacles during the flight of the UAV. The back-end signal processing module of millimeter-wave radar can use the echo signal to calculate the presence, speed, direction, distance, angle and other target information of moving objects. At the same time, millimeter-wave radar also has the characteristics of small size and light weight, so it can be carried around the airframe without affecting the UAV load, and effectively complete the obstacle avoidance task.

The obstacle avoidance detection of the UAV is shown in Figure 2. As can be seen in the figure, the position coordinates of the UAV in flight are

(x_{o}, y_{o})

, the distance between the UAV and the obstacle is

d_{i}

, and the phase difference is

β_{i}

. The two-dimensional coordinates of the obstacle

(x_{i}^{'}, y_{i}^{'})

obtained by the UAV monitoring during the flight can be expressed as:

\{\begin{matrix} x_{i}^{'} = x_{o} + d_{i} \times c o s β_{i} \\ y_{i}^{'} = y_{o} + d_{i} \times s i n β_{i} \end{matrix}

(8)

Here,

i \in [0, K]

. In the figure,

v_{u}

denotes the velocity of the UAV flight,

v_{u_x}

denotes the component velocity of the UAV along the

x

-axis direction, and

v_{u_y}

denotes the component velocity of the UAV along the

y

-axis direction. The flight speed of the UAV at the moment

t + 1

and the component speeds on the

x

and

y

axes can be expressed as:

v_{u} (t + 1) = v_{u} (t) + â t

(9)

v_{u_x} (t + 1) = v_{u_x} (t) + â t \times \cos θ

(10)

v_{u_y} (t + 1) = v_{u_y} (t) + â t \times \sin θ

(11)

In the formula,

â

denotes the flight acceleration of the UAV, and

θ

denotes the angle between the actual flight direction of the UAV and the

x

-axis.

The UAV will monitor the location information of surrounding obstacles in real time through the millimeter wave radar mounted around the body during flight. The signal processing module of the UAV will transmit a feedback signal

f_{c}

to the UAV based on the distance

d_{c i}

between the body and the obstacle, and the signal

f_{c}

can be expressed as:

f_{c} = \{\begin{matrix} 1, 15 \leq d_{c i} \\ 0, D_{c} \leq d_{c i} \leq 15 \\ - 1, 0 \leq d_{c i} \leq D_{c} \end{matrix}

(12)

Here,

d_{c i}

denotes the distance between the UAV and the obstacle, and

D_{c}

denotes the danger radius where the UAV is at risk of collision. Equation (12) can be seen, when the horizontal distance between the UAV and the obstacle is greater than 15m, the threat of the obstacle to the UAV is very small, at this time can be considered that there is no obstacle around the body, that is,

f = 1

; when the distance is greater than the danger radius of the UAV is less than 15m, at this time can be considered that the UAV is approaching the obstacle, that is,

f = 0

; when the distance is less than the danger radius

D_{c}

, the UAV may collide with the obstacle at any time, at this time can be considered mission failure, that is,

f = - 1

.

2.3. Ground User Model

This scenario model is used to simulate a situation where people are buried when a building collapses. The UAV carries an airborne base station over the mission area and collects in real time the received signal strength between the cell phones of the affected people on the ground and the base station. The formula for calculating RSSI is as follows:

R = 10 l g \frac{P_{r}}{1 mW}

(13)

Here,

R

denotes the received signal strength of the UAV to the ground phone during flight, and

P_{r}

denotes the power of the received signal.

Received Signal Strength Indication (RSSI) is mainly applied to measure the distance between the transmitter and the receiver, and its range measurement method is implemented mainly based on the principle that the power of the radio wave signal decays with the increase in the propagation distance during the transmission of the radio wave in the medium. Therefore, the transmit signal power and the received signal power of a known node are obtained from the transmitter, and then the distance magnitude between the transmitter and the receiver can be calculated by the attenuation model between the signal power and the distance. The relationship between received signal power and distance can be expressed as:

P_{r} = (P_{t} / d_{a}) \times n

(14)

Here,

P_{r}

denotes the power of the received signal,

P_{t}

denotes the power of the transmitted signal,

d_{a}

denotes the distance between the UAV base station and the ground user, and

n

denotes the propagation factor of the signal. Substituting Equation (14) into Equation (13), the arithmetic expression of the received signal strength versus distance can be obtained as:

R = 10 n l g (\frac{P_{t}}{d_{a}})

(15)

The received signal power A between the UAV base station and the ground user’s cell phone at a distance of 1 m can be expressed as:

A = 10 n l g P_{t}

(16)

Substituting Equation (16) into Equation (15) yields:

R = A - 10 n l g d_{a}

(17)

Since the number of trapped people in the ground scenario is

M

, the RSSI between the cell phone terminal of user

m

and the UAV base station can be expressed as:

R_{m} = A - 10 n l g d_{a m}

(18)

Here,

R_{m}

denotes the received signal strength of the UAV to the cell phone terminal of ground user

m

during flight, and

d_{a m}

denotes the distance size between the UAV base station and user

m

.

From Equation (18), the distance between the aerial UAV base station

u

and the ground user

m

can be expressed as:

d_{a m} = 10^((a b s (R_{m}) - A) / (10 * n))

(19)

Here,

a b s (R_{m})

denotes the absolute value of the received signal strength of the UAV to the ground user

m

cell phone terminal during flight,

A

denotes the received signal power between the UAV base station and the ground user cell phone at a distance of 1 m, and

n

denotes the propagation factor of the signal.

Considering that in the actual house collapse environment, the specific number of buried people, and the location of buried people are unknown, random generation is used to construct the ground user model. From Equation (2), the number of ground users is

m

, where

m \in [0, M]

, and the 3D position coordinates of each user are

(x_{i}^{m}, y_{i}^{m}, z_{i}^{m})

. The UAV base station will collect the received signal strength

R

from the ground user’s cell phone in real time during the flight and calculate the distance

d_{a}

between the ground user and the UAV base station at the same time, and finally, the UAV base station will send the search result

f_{R}

to the ground control station. The search result

f_{R}

is as follows:

f_{R} = \{\begin{matrix} 0, d_{a} > D_{r} \\ 1, d_{a} \leq D_{r} \end{matrix}

(20)

Here,

D_{r}

denotes the maximum search radius of the UAV base station for ground users. When the received signal strength between the ground user’s cell phone and the base station exceeds the set threshold, i.e., the distance between the UAV base station and the ground user’s cell phone terminal is less than the maximum search radius of the UAV, the ground console can precisely locate the location information of the ground user.

3. DDQN-SSQN UAV Path Planning Algorithm

This chapter first introduces the knowledge of deep reinforcement learning (DRL) algorithms, and then introduces the design ideas and structure of Double DQN-state splitting Q network (DDQN-SSQN) algorithms in detail.

3.1. Deep Reinforcement Learning

Reinforcement learning (RL) is an important branch of artificial intelligence today, and is now widely used in robotics, autonomous driving, computer vision, natural language processing, and other fields. The core of reinforcement learning is oriented to an intelligent body (agent) and puts it into a complex, uncertain environment for interaction. The interaction process between the intelligent body and the environment is shown in Figure 3. The interaction process between agent and environment can be represented by the Markov Decision Process (MDP), and the basic framework of MDP is (S, A, R) [35]. The Agent outputs an action A after acquiring a state S in the environment based on the learned experience. action A, when executed in the environment, outputs the next state of the intelligence and the reward r from that action, and the reward r obtained is superimposed as the intelligence interacts with the environment. The ultimate goal of reinforcement learning is to maximize the value of the reward obtained in long term iterative environmental exploration. Deep Reinforcement Learning (DRL) is a decision control system that combines the ideas of feature extraction and neural networks from Deep Learning (DL) based on reinforcement learning. By inputting multidimensional environmental data information, DRL uses the success and failure experience gained during the exploration process as the basis for updating the decision network and optimizing the network parameters in order to obtain the optimal strategy.

Deep Q-network (DQN) is a Q-learning algorithm based on deep learning. The rewards r obtained by taking action A at each state S are usually stored in Q-learning using the form of a table. Since the learning tasks that the Agent has to face in a practical reinforcement learning task are often continuous, there will be an infinite number of state information, which also leads to the problem of dimensional disaster. DQN combines value function approximation and neural network technology and adopts the target network and experience playback method to train the network, which reduces the requirements for storage space and solves the problem of excessive dimension. The value function approximation can be expressed as:

Q (s, a, θ) \approx Q_{π} (s, a)

(21)

Here, s represents the state

S

, a represents the action

A

,

θ

represents the parameter of the network, and

Q_{π} (s, a)

represents the actual value function. The target Q-value function y of the DQN algorithm can be expressed as:

y^{D Q N} = \{\begin{matrix} r, e n d = T r u e \\ r + γ \underset{a ϵ A (s_{t + 1})}{a r g \max} Q (s_{t + 1}, a; θ), e n d = F a l s e \end{matrix}

(22)

Here,

r

represents the reward value obtained and

γ

represents the discount factor.

Double DQN (DDQN) is an improvement on DQN. The model structure of DDQN and the model structure of DQN are basically the same; the difference lies in their objective function. The objective function of DDQN can be expressed as:

y^{D D Q N} = \{\begin{matrix} r, e n d = T r u e \\ r + γ Q (\underset{a ϵ A (s_{t + 1})}{s_{t + 1,} a r g \max} Q (s_{t + 1}, a; θ); θ^{'}), e n d = F a l s e \end{matrix}

(23)

Here,

θ^{'}

is the periodic copy of

θ

, which represents the target network parameters. When DDQN calculates the target value, it is calculated not according to the parameters of the target Q network such as DQN, but according to the parameters of the current Q network. Therefore, the calculated target value will be a little smaller than the original. This reduces the problem of overvaluation and makes the Q value closer to the true value.

Dueling DQN is also improved on the basis of DQN, and the modified point is at the last layer of the DQN neural network. Originally, the last layer of the DQN neural network was the fully connected layer, and the output after passing through this layer was n Q values. Dueling DQN does not directly train to get these n Q values, it obtains two indirect variable state values V and act ion advantage A through training, and then expresses the Q value through their sum [36]. The calculation formula of the Q function can be expressed as:

Q (s, a; θ, α, β) = V (s; θ, β) + A (s, a; θ, α)

(24)

where

θ

is the convolutional layer parameter, and

β

and

α

are the two-branch fully connected layer parameters. In practice, action advantage A is generally set as a separate action advantage function minus the average of all action dominance functions in state

s

[37], The specific formula is as follows:

Q (s, a; θ, α, β) = V (s; θ, β) + (A (s, a; θ, α) - \frac{1}{|A (s)|} \sum_{a \in A (s)} A (s, a; θ, α)

(25)

This can narrow the range of Q values without affecting the size ordering of the dominant function, which greatly improves the stability of the algorithm.

3.2. DDQN-SSQN Algorithm Design

In this paper, an improved algorithm DDQN-SSQN for split-state Q network is proposed based on Double DQN (DDQN) algorithm and Dueling DQN algorithm. The DDQN-SSQN algorithm was used to solve the path decision problem of UAV in flight process. In this algorithm, the process of executing a task by a UAV base station in a simulated scenario was modeled, and the whole model consists of five parts: intelligent body, state, action, reward, and task completion. The modules were built in the following manner.

(1) Agent: The UAV base station will fly as an Agent in the mission area and collect environmental data information in real time. The UAV will select the next action to be completed according to the current state

s_{0}

and the internal policy

π

. When the

a_{0}

action is completed, the UAV will be in a new state

s_{1}

, after which the next action

a_{1}

will be obtained through the policy

π

, and so on until the mission is completed.

(2) State: the state

S = (p_{o}, p_{c}, d_{c}, d_{a}, t_{r}, c)

that the UAV is in during flight. Where

p_{o}

denotes the two-dimensional coordinates of the location of the UAV

(x_{s_o}, y_{s_o}

),

p_{c}

denotes the two-dimensional location coordinates of the obstacle detected within the sight range of the UAV

(x_{s_c}, y_{s_c})

,

d_{c}

denotes the size of the horizontal distance between the UAV and the obstacle,

d_{a}

denotes the distance between the UAV base station and the trapped person on the ground,

t_{r_a}

denotes the degree of convergence of the UAV moving toward the trapped person at the current moment compared to the previous moment. The expression for

t_{r_a}

is as follows:

t_{r_a} = d_{a} (t) - d_{a} (t - 1)

(26)

c

indicates the number of cell phone locations of people trapped on the ground that were successfully searched by the UAV base station.

(3) Action: Since the flight height of the UAV is fixed in the air 10 m away from the ground, the flight action of the UAV can be regarded as a two-dimensional plane flight problem. Flight action

A

consists of

(a_{1}, a_{2}, a_{3}, a_{4})

.

a_{0}

means one step forward,

a_{1}

means one step to the left,

a_{2}

means one step to the right, and

a_{3}

means one step backward.

(4) Reward: The real-time reward

r

obtained by the UAV in the process of completing the mission is determined by the state

s_{t}

the UAV is in at that moment and the action

a_{t}

to be performed next, so the reward obtained at moment

t

can be expressed as

r_{t} (s_{t}, a_{t})

. The reward

r_{t} (s_{t}, a_{t})

obtained by the Agent at each moment consists of five components: the reward

r_{t}^{1}

obtained by approaching the obstacle, the reward

r_{t}^{2}

obtained by collision, the reward

r_{t}^{3}

obtained by approaching the target, the reward

r_{t}^{4}

obtained by completing the task, and the reward

r_{t}^{5}

obtained by the loss of raw material of the UAV.

r_{t} (s_{t}, a_{t})

can be calculated as follows:

r_{t} (s_{t}, a_{t}) = ε_{1} r_{t}^{1} + ε_{2} r_{t}^{2} + ε_{3} r_{t}^{3} + ε_{4} r_{t}^{4} + ε_{5} r_{t}^{5}

(27)

Here,

ε_{1}

,

ε_{2}

,

ε_{3}

,

ε_{4}

and

ε_{5}

denote the weights of the rewards obtained by completing the corresponding target tasks, respectively, and the sum of all reward weights satisfies:

ε_{1} + ε_{2} + ε_{3} + ε_{4} + ε_{5} = 1

(28)

The UAV can detect the horizontal distance

d_{c}

between the airframe and the surrounding obstacles in real time by the millimeter wave radar it carries. Since the detection range of the millimeter wave radar is limited and the threat of the obstacle to the airframe itself is minimal when the horizontal distance between the obstacle and the UAV is more than 10 m,

d_{c} \in (0, 10]

is set. Therefore, the proximity to the obstacle reward

r_{t}^{1}

can be expressed as:

r_{t}^{1} = \{\begin{matrix} μ_{1} * (d_{c} (t) - d_{c} (t - 1)), d_{c} \in (0, 10] \\ 0, d_{c} > 10 \end{matrix}

(29)

Here,

μ_{1}

denotes the reward factor for proximity to an obstacle.

Collision reward is the reward obtained when a UAV collides with an obstacle in its flight path. However, according to the actual knowledge, the UAV base station is costly and easily damaged, and once it collides with an obstacle it will cause huge property damage. Therefore, the collision range is set to 2 m in the collision model, i.e., a collision event is considered to occur when the distance

d_{c}

between the UAV and the obstacle is less than 2 m. Then the collision reward

r_{t}^{2}

can be expressed as:

r_{t}^{2} = \{\begin{matrix} 100 * μ_{2}, d_{c} \leq 2 \\ 0, d_{c} ϵ (2, 10] \end{matrix}

(30)

The RSSI signal data between the airborne base station and the ground personnel’s cell phone are continuously collected by the UAV base station in the course of the disaster relief mission, and are used to obtain the horizontal distance

d_{c}

of the affected personnel from the UAV and the convergence degree

t_{r_a}

of the UAV to the location of the affected personnel at the current moment compared to the previous moment. Therefore, the proximity to the target reward

r_{t}^{3}

can be expressed as:

r_{t}^{3} = μ_{3} * (t_{r_{a}} + \frac{c}{10})

(31)

Here,

μ_{3}

represents the reward factor close to the target, and

c

represents the number of people trapped on the ground that the UAV searches.

The UAV base station establishes information communication with the trapped person’s cell phone to obtain the received signal strength indication RSSI and determines the direction and specific location of the trapped person’s cell phone according to the change of signal strength. When the distance between the UAV and the ground user’s mobile phone terminal is long due to low communication signal strength, large signal interference, and other reasons, it will be impossible to accurately locate the location of the trapped person, so when the horizontal distance between the UAV base station and the ground mobile phone terminal is less than 5 m, it can be considered that the disaster victim has been successfully found. If the location of all trapped persons is found, the mission is considered complete. Task completion reward

r_{t}^{4}

can be expressed as:

r_{t}^{4} = 100 * μ_{4}

(32)

Here,

μ_{4}

represents the reward coefficient for task completion.

In the process of performing tasks, drones will inevitably have problems with power consumption and hardware equipment consumption. UAV raw material loss reward

r_{t}^{5}

can be expressed as:

r_{t}^{5} = 10 * μ_{5}

(33)

Here,

μ_{5}

represents the reward coefficient of the loss of raw materials for the UAV.

(5) Task completion: The completion of UAV flight tasks

f_{d}

determined by the maximum number of flight steps p and the number of UAV base stations successfully searched for ground mobile phone terminals c. The expression for task completion

f_{d}

is as follows:

f_{d} = \{\begin{matrix} 0, p < P_{M A X} a n d c < M \\ 1, p = P_{M A X} o r c = M \end{matrix}

(34)

Among them, when

f_{d} = 0

, it means that the search and rescue mission is still in progress; When

f_{d} = 1

, it means that the UAV has completed this search and rescue mission, the

P_{M A X}

indicates the maximum number of flight steps of the UAV, and

M

represents the number of trapped people on the ground.

3.3. DDQN-SSQN Algorithm Structure

The DDQN-SSQN algorithm proposed in this paper is an improvement on the neural network based on the traditional DDQN algorithm. The network structure of the DDQN-SSQN algorithm is shown in Figure 4, where

(S_{1} ~ S_{8})

denotes the state information of the intelligent body, mainly including the position coordinates

p_{o} (x_{s_n}, y_{s_n})

of the UAV, the position coordinates

p_{c} (x_{s_c}, y_{s_c})

of the obstacle, the horizontal distance

d_{c}

between the obstacle and the UAV, the straight-line distance

d_{a}

between the trapped person and the UAV, the convergence degree

t_{r_a}

of the UAV moving toward the trapped person, and the number of people

c

who searched for the trapped person. Before the state information of the drone is input into the neural network, the overall state State is divided into three parts: position state State1

(S_{1} ~ S_{4})

, distance state State2

(S_{5}, S_{6})

and other state State3

(S_{7}, S_{8})

. State1

(S_{1} ~ S_{4})

is predicted by hidden layer 1 and hidden layer 2 to obtain the position state of UAV and obstacle at the next moment (

S_{1}^{'} ~ S_{4}^{'}

); State2

(S_{5}, S_{6})

and (

S_{1}^{'} ~ S_{4}^{'}

) are predicted by hidden layer 3 and hidden layer 4 to obtain the position state and distance state at the next moment (

S_{1}^{″} ~ S_{4}^{″}, S_{5}^{'}, S_{6}^{'}

); State3

(S_{7}, S_{8})

and (

S_{1}^{″} ~ S_{4}^{″}, S_{5}^{'}, S_{6}^{'}

) output action value

A_{i} (s, a)

and state advantage

V (s)

through hidden layer 5 and hidden layer 6, and then sum up the two to finally obtain

Q_{i} (s, a)

.

Q_{i} (s, a)

represents the value of each action performed by the Agent in the current state and is calculated as:

Q_{i} (s, a) = (A_{i} (s, a) - \frac{1}{|A (s)|} \sum_{a \in A (s)} A (s, a)) + V (s)

(35)

Here,

i \in [1, 4]

, s represents the state

S

, a represents the action

A

.

The execution architecture of the DDQN-SSQN algorithm is shown in Figure 5. The action policy used by the algorithm is the

ε

-greedy policy [38], i.e., the Agent has a probability of executing the action corresponding to the maximum Q function in the scene model with

1 - ε

. The execution action A can be expressed as:

A = \{\begin{matrix} a r g \max_{a} Q (s, a), P < (1 - ε) \\ R a n d o m, P \geq (1 - ε) \end{matrix}

(36)

Here,

a r g m a x Q (s, a)

denotes the action corresponding to the maximum value in the

Q (s, a)

function.

The DDQN-SSQN algorithm proposed in this paper is summarized in Algorithm 1, where the input is composed of the UAV’s own flight parameters, agent model, and environmental parameters (step 1), and the output is the set of relevant data parameters obtained during training and the trained agent model (step 29).

In the model training phase, relevant training parameters such as the number of training rounds, discount factor, experience pool capacity, and update frequency are first initialized. Next, set the number of obstacles and trapped people, and clear the target Q network of the agent (step 2).

Algorithm 1: The Proposed DDQN-SSQN Algorithm Scheme

At the beginning of each round of training, the position coordinates of the obstacles and the position coordinates of the trapped people are randomly reset; the mission environment faced by the drone will reset and obtain the initial state

S_{t}

; even if the reward received by the agent is cleared to zero (step 4). The training of the model will undergo

N_{e p}

iterations (step 3).

In the process of training, the agent will obtain the next action

a_{t}

to be performed according to the current state

S_{t}

(step 7), and the selection of actions will be completed using the

ε

-greedy strategy (Equation (32)); After the agent executes action

a_{t}

, the task environment will be updated, and a new status

S_{t + 1}

will be obtained, and the reward

r_{t} (s_{t}, a_{t})

obtained for performing the action will be obtained (steps 8–9); During the execution of training, the agent will collect real-time task parameter information, including: task completion, flight step, flight time, and the number of trapped people found (steps 10–13); The collected data information is packaged and sent to the experience pool (step 14) to provide a data set for the agent’s learning. After the above data collection is completed, the system will update the status information

S_{t}

and the agent in time, and obtain the sum of the rewards

r_{a}^{i}

obtained by the agent from the beginning of the first round to the end of the

i

round (steps 15–17). During each turn, the agent repeats the flight mission in the simulation environment until the flight mission is completed, the flight time exceeds the set threshold, or all the difficult personnel are found, and the flight mission is terminated (steps 6–18). After the task is terminated, the various data obtained up to this turn will be packaged and stored in the form of a collection (steps 22–26), and if the number of training rounds meets the system update frequency, the target

Q

network will be updated (steps 19–21). After this round of training, a new round of training will be restarted, and training will not stop until the number of training rounds reaches

N_{e p}

. Finally, the parameter information of various tasks obtained during training will be output, and a trained agent model will also be obtained. After that, the drone will use the trained model to carry out real-time path planning in the disaster relief environment and the location information collection of trapped people.

4. Experimental Simulation and Analysis

This chapter focuses on the experimental simulation design. First, the algorithm is simulated and analyzed using the Open AI Gym simulation platform, and then the DDQN-SSQN algorithm is compared with the traditional DDQN, DQN, and Q-Learning algorithm schemes to verify the superiority of the algorithm proposed in this paper.

4.1. Simulate Experimental Design

The simulation parameters of the scene where the UAV base station collects the location information of the affected personnel were as follows: the flight area of the UAV base station was set to 20 × 20, and the entire area was composed of 400 rectangles with a side length of 5 m, so the task range of the scene was 100 m × 100 m. The flight height of the UAV was fixed in the air at a height of

H = 10 m

from the ground, the maximum flight speed was set to

V_{m a x} = 5 m / s

, and the maximum flight time was set to

T_{m a x} = 140 \min

. The obstacles encountered by the drone during flight randomly appeared in the scene model, the number of generated obstacles in the scene model

N_{b} = 20

, and the two-dimensional position coordinates of the obstacles were

(x_{o}^{n}, y_{o}^{n})

, of which

x_{o}^{n} ϵ [0, 19]

,

y_{o}^{n} ϵ [0, 19]

,

n ϵ [1, 20]

. In the scene model, the ground disaster victims appeared randomly, the number of disaster victims generated in the model was

M_{s} = 500

, and the two-dimensional location coordinates of the ground users were

(x_{i}^{m}, y_{i}^{m})

, where

x_{i}^{m} ϵ [0, 19]

,

y_{i}^{m} ϵ [0, 19]

,

m ϵ [1, 500]

. Table 1 describes the simulation parameters during training.

4.2. Analysis of Simulation Result

This paper used the PyTorch deep learning architecture to complete the network part of the deep reinforcement learning algorithm. When the drone collided with an obstacle or the flight time of the UAV was greater than

T_{m a x}

, the training round was set to end. The training reward results of the four algorithms in the simulation scenario are shown in Figure 6.

According to Figure 6, in the first 100 rounds of training, the four algorithms were in the environmental exploration stage, and the reward value obtained by the agent oscillated disorderly; In

100 ~ 200

rounds, the rewards obtained by the DDQN-SSQN algorithm began to rise rapidly, the rewards obtained by the DDQN and DQN algorithms rose slowly, and the rewards obtained by the Q-Learning algorithm did not see an upward trend; When the number of training rounds reached 200 rounds, the reward return of the DDQN-SSQN algorithm began to converge and eventually stabilized, while the remaining three algorithms began to converge after the training round reached 400 rounds, and were not completely stable before 500 rounds and still had a large oscillation amplitude. The DDQN-SSQN algorithm proposed in this paper can accelerate the learning process by learning the environmental state, and compared with the other three algorithms, the DDQN-SSQN algorithm can provide the best flight strategy for the UAV in a shorter time and obtain more rewards.

In Figure 7a–d show the obstacle avoidance path diagram obtained by UAV training under the four algorithms of DQN, DDQN, Q-Learning, and DDQN-SSQN, respectively. By observing the above path diagram, it can be seen that the drone can effectively avoid obstacles under the training of the four algorithms, but the path taken by the drone was different. In the obstacle avoidance path diagram of the Q-Learning algorithm shown in Figure 7c, the drone consumed the most steps, and the search strategy selected by the algorithm was to fly along the edge of the obstacle, although this method can avoid obstacles, but also increased the risk of drone crash; Figure 7a shows the obstacle avoidance path of the DQN algorithm, and it can be seen that the UAV calculation strategy was to search away from obstacles, so that although the threat of obstacles can be effectively avoided, the UAV’s search coverage of the mission area was reduced; Figure 7b shows the obstacle avoidance path of the DDQN algorithm, which was significantly better than that of Figure 7a and Figure 7c, but repeated paths appeared during the UAV search process, which increased the energy consumption of the UAV; It can be seen from the UAV obstacle avoidance path diagram under the DDQN-SSQN algorithm shown in Figure 7d that under this scheme, the UAV could complete the entire search task with the least number of flight steps under the condition of ensuring the coverage of the mission area, effectively reducing the flight energy consumption of the UAV. Therefore, the DDQN-SSQN algorithm scheme was better than other schemes.

The time taken by the UAV to complete the flight task under the control of four algorithms was compared, and the comparison results are shown in Figure 8.

According to Figure 8, in the first 250 rounds, because the UAV needed to explore the environment and gain experience, it took more time for the four algorithms to complete a task cycle in the early stage; When the training reached

250 ~ 300

rounds, the time required for the search and rescue UAV based on the DDQN-SSQN algorithm to complete the task was greatly reduced, and the time consumed after 300 rounds began to stabilize, about

35 ~ 38 \min

. From the perspective of the overall convergence trend, the Q-Learning algorithm scheme took the longest time to complete the task, about

50 ~ 60 \min

, while the time consumed by DDQN and DQN algorithm schemes decreased after 450 rounds, but it was always higher than the algorithm scheme proposed in this paper. Therefore, the DDQN-SSQN algorithm scheme was superior to other schemes, under which the UAV could complete the search and rescue task in the shortest time.

Figure 9 shows the 3D path planning diagram of the UAV under the four algorithms of DQN, DDQN, Q-Learning, and DDQN-SSQN. The UAV made an optimal path planning solution based on the size of the RSSI value and the coordinates of the obstacle’s location. Figure 10 shows the comparison graph of the number of people searched by UAVs under the four algorithms, from which it can be seen that the number of people searched by the three algorithms DQN, DDQNm and DDQN-SSQN in rounds

0 ~ 100

rose rapidly and the Q-Learning algorithm rose very slowly; When the training reached

100 ~ 200

rounds, the rising trend of the DQN and DDQN algorithms moderated and oscillated a little, and the DDQN-SSQN algorithm continued to rise rapidly and started to stabilize after 200 rounds; After 250 rounds the DDQN algorithm started to stabilize gradually, while the DQN, Q-Learning algorithm remained in oscillation until the end of training. From the above analysis, it can be seen that the proposed scheme could search for more affected people in the shortest time, so the DDQN-SSQN algorithm scheme was better than other schemes.

The number of buried people found throughout the search task was used as an indicator of task completion. The more buried people found in a task cycle, then the higher the task completion rate. Figure 11 shows the completion of the search and rescue tasks by the UAV base station under the manipulation of the four algorithms. From the figure, it can be seen that the task completion rate of all four algorithm schemes gradually increased with the increase in training rounds, among which the DDQN-SSQN algorithm scheme had the fastest increase in task completion rate and ccould complete the whole search and rescue task after the training reached 250 rounds. Figure 12 shows a comparison chart of the success rate of UAV obstacle avoidance under four algorithm schemes. The four stages in the figure represent the number of rounds in the training process, 1, 2, 3, 4 correspond to

0 ~ 200

rounds,

200 ~ 400

rounds,

400 ~ 600

rounds, and

600 ~ 700

rounds, respectively. It can be seen from the figure that the success rate of DDQN-SSQN algorithm scheme in obstacle avoidance was much higher than that of the remaining three algorithms during the training process. According to the above analysis, the DDQN-SSQN algorithm enabled the UAV to maximize the completion rate of search and rescue tasks with the shortest number of training rounds without collision. Therefore, the proposed algorithm scheme was better than the other three algorithms.

5. Conclusions

This paper investigated the path planning problem in the process of collecting the location information of the affected people by UAV base station based on deep reinforcement learning algorithm and proposed a DDQN-SSQN reinforcement learning algorithm combining state splitting and optimal state to complete the optimal path planning in the process of UAV information collection. In order to realize the collection of the location information of the trapped people on the ground by the UAV base station, we introduced RSSI as the search indication of the disaster victims and input tedit into the algorithm network as status information for learning, which reduced the difficulty of search and rescue decision-making of the UAV base station. In this paper, the virtual disaster relief scenario model was constructed using the Open AI Gym simulation platform, and the network structure of the algorithm was improved on the basis of the DDQN algorithm, and then the new algorithm designed in this paper was compared with the traditional reinforcement learning algorithm through simulation comparison experiments. The simulation experimental results showed that the proposed algorithm could find the optimal search path in the shortest time compared with several other reinforcement learning algorithms. The proposed algorithm was better than the traditional algorithm in many aspects such as training speed, task completion rate, and obstacle avoidance rate. In future research, we plan to consider the constraints related to terrain and environment when dividing the mission area, so as to avoid wasting resources due to UAV search and exploration in uninhabited land. In addition, the location information of the trapped people on the ground collected by the UAV can be further processed to achieve three-dimensional coordinate positioning of the disaster-stricken people.

Author Contributions

Methodology, J.L.; Software, C.W.; Formal analysis, R.L.; Investigation, Z.G.; Writing—original draft, J.Z.; Supervision, W.L.; Project administration, K.Y.; Funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Zhejiang Key Research and Development Project grant number 2022C01048; This research was funded by Zhejiang Province Public Welfare Project grant number LGG22F010012.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, Z.; Guo, J.; Chen, Z.; Yu, L.; Wang, Y.; Rao, H. Robust secure UAV relay-assisted cognitive communications with resource allocation and cooperative jamming. J. Commun. Netw. 2022, 24, 139–153. [Google Scholar] [CrossRef]
Sun, M.; Xu, X.; Qin, X.; Zhang, P. AoI-Energy-Aware UAV-Assisted Data Collection for IoT Networks: A Deep Rein-forcement Learning Method. IEEE Internet Things J. 2021, 8, 17275–17289. [Google Scholar] [CrossRef]
Ouahouah, S.; Bagaa, M.; Prados-Garzon, J.; Taleb, T. Deep-Reinforcement-Learning-Based Collision Avoidance in UAV Environment. IEEE Internet Things J. 2022, 9, 4015–4030. [Google Scholar] [CrossRef]
Shiri, H.; Seo, H.; Park, J.; Bennis, M. Attention-Based Communication and Control for Multi-UAV Path Planning. IEEE Wirel. Commun. Lett. 2022, 11, 1409–1413. [Google Scholar] [CrossRef]
Sun, R.; Zhao, D.; Ding, L.; Zhang, J.; Ma, H. UAV-Net+: Effective and Energy-Efficient UAV Network Deployment for Extending Cell Tower Coverage with Dynamic Demands. IEEE Trans. Veh. Technol. 2022, 1–13. [Google Scholar] [CrossRef]
Pan, Z.; Zhang, C.; Xia, Y.; Xiong, H.; Shao, X. An Improved Artificial Potential Field Method for Path Planning and For-mation Control of the Multi-UAV Systems. IEEE Trans. Circuits Syst. II Express Briefs 2022, 69, 1129–1133. [Google Scholar]
Feng, J.; Zhang, J.; Zhang, G.; Xie, S.; Ding, Y.; Liu, Z. UAV Dynamic Path Planning Based on Obstacle Position Prediction in an Unknown Environment. IEEE Access 2021, 9, 154679–154691. [Google Scholar] [CrossRef]
Qadir, Z.; Zafar, M.H.; Moosavi, S.K.R.; Le, K.N.; Mahmud, M.A.P. Autonomous UAV Path-Planning Optimization Using Metaheuristic Approach for Predisaster Assessment. IEEE Internet Things J. 2021, 9, 12505–12514. [Google Scholar] [CrossRef]
Shao, X.X.; Gong, Y.J.; Zhan, Z.H.; Zhang, J. Bipartite Cooperative Coevolution for Energy-Aware Coverage Path Planning of UAVs. IEEE Trans. Artif. Intell. 2022, 3, 29–42. [Google Scholar] [CrossRef]
Sanchez-Fernandez, A.J.; Romero, L.F.; Bandera, G.; Tabik, S. VPP: Visibility-Based Path Planning Heuristic for Moni-toring Large Regions of Complex Terrain Using a UAV Onboard Camera. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 944–955. [Google Scholar] [CrossRef]
Chen, J.; Du, C.; Zhang, Y.; Han, P.; Wei, W. A Clustering-Based Coverage Path Planning Method for Autonomous Het-erogeneous UAVs. IEEE Trans. Intell. Transp. Syst. 2021, 1–11. [Google Scholar] [CrossRef]
Jinqiang, H.; Husheng, W.; Renjun, Z.; Rafik, M.; Xuanwu, Z. Self-organized search-attack mission planning for UAV swarm based on wolf pack hunting behavior. J. Syst. Eng. Electron. 2021, 32, 1463–1476. [Google Scholar] [CrossRef]
Zhou, X.; Gao, F.; Fang, X.; Lan, Z. Improved Bat Algorithm for UAV Path Planning in Three-Dimensional Space. IEEE Access 2021, 9, 20100–20116. [Google Scholar] [CrossRef]
Wang, X.; Yang, Y.; Wu, D.; Zhang, Z.; Ma, X. Mission-Oriented 3D Path Planning for High-Altitude Long-Endurance So-lar-Powered UAVs with Optimal Energy Management. IEEE Access 2020, 8, 227629–227641. [Google Scholar] [CrossRef]
Liu, Q.; Zhang, Y.; Li, M.; Zhang, Z.; Cao, N.; Shang, J. Multi-UAV Path Planning Based on Fusion of Sparrow Search Algo-rithm and Improved Bioinspired Neural Network. IEEE Access 2021, 9, 124670–124681. [Google Scholar] [CrossRef]
Cheng, Z.; Zhao, L.; Shi, Z. Decentralized Multi-UAV Path Planning Based on Two-Layer Coordinative Framework for Formation Rendezvous. IEEE Access 2022, 10, 45695–45708. [Google Scholar] [CrossRef]
Huang, Z.; Chen, C.; Pan, M. Multiobjective UAV Path Planning for Emergency Information Collection and Transmis-sion. IEEE Internet Things J. 2020, 7, 6993–7009. [Google Scholar] [CrossRef]
Airlangga, G.; Liu, A. Online Path Planning Framework for UAV in Rural Areas. IEEE Access 2022, 10, 37572–37585. [Google Scholar] [CrossRef]
Wu, X.; Lei, Y.; Tong, X.; Zhang, Y.; Li, H.; Qiu, C.; Guo, C.; Sun, Y.; Lai, G. A Non-rigid Hierarchical Discrete Grid Structure and its Application to UAVs Conflict Detection and Path Planning. IEEE Trans. Aerosp. Electron. Syst. 2022. [Google Scholar] [CrossRef]
Qie, H.; Shi, D.; Shen, T.; Xu, X.; Li, Y.; Wang, L. Joint Optimization of Multi-UAV Target Assignment and Path Planning Based on Multi-Agent Reinforcement Learning. IEEE Access 2019, 7, 146264–146272. [Google Scholar] [CrossRef]
Li, B.; Wu, Y. Path Planning for UAV Ground Target Tracking via Deep Reinforcement Learning. IEEE Access 2020, 8, 29064–29074. [Google Scholar] [CrossRef]
Wei, K.; Huang, K.; Wu, Y.; Li, Z.; He, H.; Zhang, J.; Chen, J.; Guo, S. High-Performance UAV Crowdsensing: A Deep Reinforcement Learning Approach. IEEE Internet Things J. 2022, 9, 18487–18499. [Google Scholar] [CrossRef]
Zhu, B.; Bedeer, E.; Nguyen, H.H.; Barton, R.; Henry, J. Joint Cluster Head Selection and Trajectory Planning in UAV-Aided IoT Networks by Reinforcement Learning with Sequential Model. IEEE Internet Things J. 2021, 9, 12071–12084. [Google Scholar] [CrossRef]
Zhu, S.; Gui, L.; Cheng, N.; Sun, F.; Zhang, Q. Joint Design of Access Point Selection and Path Planning for UAV-Assisted Cellular Networks. IEEE Internet Things J. 2019, 7, 220–233. [Google Scholar] [CrossRef]
Cui, Z.; Wang, Y. UAV Path Planning Based on Multi-Layer Reinforcement Learning Technique. IEEE Access 2021, 9, 59486–59497. [Google Scholar] [CrossRef]
Wang, Z.; Rong, H.; Jiang, H.; Xiao, Z.; Zeng, F. A Load-Balanced and Energy-Efficient Navigation Scheme for UAV-Mounted Mobile Edge Computing. IEEE Trans. Netw. Sci. Eng. 2022, 9, 3659–3674. [Google Scholar] [CrossRef]
Zhang, L.; Jabbari, B.; Ansari, N. Deep Reinforcement Learning Driven UAV-assisted Edge Computing. IEEE Internet Things J. 2022. [Google Scholar] [CrossRef]
Chang, H.; Chen, Y.; Zhang, B.; Doermann, D. Multi-UAV Mobile Edge Computing and Path Planning Platform Based on Reinforcement Learning. IEEE Trans. Emerg. Top. Comput. Intell. 2021, 6, 489–498. [Google Scholar] [CrossRef]
Liu, R.; Qu, Z.; Huang, G.; Dong, M.; Wang, T.; Zhang, S.; Liu, A. DRL-UTPS: DRL-based Trajectory Planning for Unmanned Aerial Vehicles for Data Collection in Dynamic IoT Network. IEEE Trans. Intell. Veh. 2022, 1–14. [Google Scholar] [CrossRef]
Hsu, Y.-H.; Gau, R.-H. Reinforcement Learning-Based Collision Avoidance and Optimal Trajectory Planning in UAV Communication Networks. IEEE Trans. Mob. Comput. 2022, 21, 306–320. [Google Scholar] [CrossRef]
Xi, Z.; Wu, D.; Ni, W.; Ma, X. Energy-Optimized Trajectory Planning for Solar-Powered Aircraft in a Wind Field Using Reinforcement Learning. IEEE Access 2022, 10, 87715–87732. [Google Scholar] [CrossRef]
Babu, N.; Donevski, I.; Valcarce, A.; Popovski, P.; Nielsen, J.J.; Papadias, C.B. Fairness-Based Energy-Efficient 3-D Path Planning of a Portable Access Point: A Deep Reinforcement Learning Approach. IEEE Open J. Commun. Soc. 2022, 3, 1487–1500. [Google Scholar] [CrossRef]
Xie, H.; Yang, D.; Xiao, L.; Lyu, J. Connectivity-Aware 3D UAV Path Design with Deep Reinforcement Learning. IEEE Trans. Veh. Technol. 2021, 70, 13022–13034. [Google Scholar] [CrossRef]
Bayerlein, H.; Theile, M.; Caccamo, M.; Gesbert, D. Multi-UAV Path Planning for Wireless Data Harvesting with Deep Reinforcement Learning. IEEE Open J. Commun. Soc. 2021, 2, 1171–1187. [Google Scholar] [CrossRef]
Li, Y.; Aghvami, A.H.; Dong, D. Path Planning for Cellular-Connected UAV: A DRL Solution with Quantum-Inspired Experience Replay. IEEE Trans. Wirel. Commun. 2022, 21, 7897–7912. [Google Scholar] [CrossRef]
Khamidehi, B.; Sousa, E.S. Reinforcement Learning-aided Safe Planning for Aerial Robots to Collect Data in Dynamic Environments. IEEE Internet Things J. 2022, 9, 13901–13912. [Google Scholar] [CrossRef]
Wang, X.; Gursoy, M.C.; Erpek, T.; Sagduyu, Y.E. Learning-Based UAV Path Planning for Data Collection with Inte-grated Collision Avoidance. IEEE Internet Things J. 2022, 9, 16663–16676. [Google Scholar] [CrossRef]
Li, D.; Yin, W.; Wong, W.E.; Jian, M.; Chau, M. Quality-Oriented Hybrid Path Planning Based on A* and Q-Learning for Unmanned Aerial Vehicle. IEEE Access 2021, 10, 7664–7674. [Google Scholar] [CrossRef]

Figure 1. System environment model.

Figure 2. UAV obstacle avoidance detection.

Figure 3. Reinforcement learning schematic.

Figure 4. Network structure of DDQN-SSQN algorithm.

Figure 5. DDQN-SSQN algorithm execution architecture.

Figure 6. Comparison chart of rewards earned.

Figure 7. UAV obstacle avoidance path diagram under four algorithm schemes.

Figure 8. Comparison of UAV flight time under four algorithm schemes.

Figure 9. Three-dimensional path planning of UAV under four algorithm schemes.

Figure 10. Comparison of the number of UAVs searching for disaster victims under four algorithm schemes.

Figure 11. Comparison of UAV mission completion rates under four algorithm schemes.

Figure 12. Comparison of the success rate of UAV obstacle avoidance under four algorithm schemes.

Table 1. Training simulation parameters.

Parameter Symbols.	Parameter Name	Parameter Setting Value
$S_{n}$	Number of states	8
$A_{n}$	Number of movements	4
$N_{e p}$	Number of training rounds	700
$γ$	Discount Factor	0.95
$L$	Learning Rate	0.0001
$E$	Experience pool capacity	100,000
$V_{u}$	Update Frequency	2
$N_{s}$	Number of samples taken	128

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, J.; Gan, Z.; Liang, J.; Wang, C.; Yue, K.; Li, W.; Li, Y.; Li, R. Path Planning Research of a UAV Base Station Searching for Disaster Victims’ Location Information Based on Deep Reinforcement Learning. Entropy 2022, 24, 1767. https://doi.org/10.3390/e24121767

AMA Style

Zhao J, Gan Z, Liang J, Wang C, Yue K, Li W, Li Y, Li R. Path Planning Research of a UAV Base Station Searching for Disaster Victims’ Location Information Based on Deep Reinforcement Learning. Entropy. 2022; 24(12):1767. https://doi.org/10.3390/e24121767

Chicago/Turabian Style

Zhao, Jinduo, Zhigao Gan, Jiakai Liang, Chao Wang, Keqiang Yue, Wenjun Li, Yilin Li, and Ruixue Li. 2022. "Path Planning Research of a UAV Base Station Searching for Disaster Victims’ Location Information Based on Deep Reinforcement Learning" Entropy 24, no. 12: 1767. https://doi.org/10.3390/e24121767

APA Style

Zhao, J., Gan, Z., Liang, J., Wang, C., Yue, K., Li, W., Li, Y., & Li, R. (2022). Path Planning Research of a UAV Base Station Searching for Disaster Victims’ Location Information Based on Deep Reinforcement Learning. Entropy, 24(12), 1767. https://doi.org/10.3390/e24121767

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Path Planning Research of a UAV Base Station Searching for Disaster Victims’ Location Information Based on Deep Reinforcement Learning

Abstract

1. Introduction

2. Model Construction

2.1. System Environment Model

2.2. Obstacle Model

2.3. Ground User Model

3. DDQN-SSQN UAV Path Planning Algorithm

3.1. Deep Reinforcement Learning

3.2. DDQN-SSQN Algorithm Design

3.3. DDQN-SSQN Algorithm Structure

4. Experimental Simulation and Analysis

4.1. Simulate Experimental Design

4.2. Analysis of Simulation Result

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI