DDQN with Prioritized Experience Replay-Based Optimized Geographical Routing Protocol of Considering Link Stability and Energy Prediction for UANET

Unmanned aerial vehicles (UAVs) are important equipment for efficiently executing search and rescue missions in disaster or air-crash scenarios. Each node can communicate with the others by a routing protocol in UAV ad hoc networks (UANETs). However, UAV routing protocols are faced with the challenges of high mobility and limited node energy, which hugely lead to unstable link and sparse network topology due to premature node death. Eventually, this severely affects network performance. In order to solve these problems, we proposed the deep-reinforcement-learning-based geographical routing protocol of considering link stability and energy prediction (DSEGR) for UANETs. First of all, we came up with the link stability evaluation indicator and utilized the autoregressive integrated moving average (ARIMA) model to predict the residual energy of neighbor nodes. Then, the packet forward process was modeled as a Markov Decision Process, and according to a deep double Q network with prioritized experience replay to learn the routing-decision process. Meanwhile, a reward function was designed to obtain a better convergence rate, and the analytic hierarchy process (AHP) was used to analyze the weights of the considered factors in the reward function. Finally, to verify the effectiveness of DSEGR, we conducted simulation experiments to analyze network performance. The simulation results demonstrate that our proposed routing protocol remarkably outperforms others in packet delivery ratio and has a faster convergence rate.


Introduction
Hardware and communication technologies have developed by leaps and bounds in recent years, making UAVs more agile, more robust and lower cost. The applications of UAVs are becoming more and more extensive, such as wildfire monitoring [1], border surveillance and reconnaissance [2], post-disaster communication aid [3], pesticide spraying for farming [4], and assist sensor network communication [5]. The UAV ad hoc network is more efficient in dealing with complicated multitasking and has higher scalability than a single UAV.
A UAV ad hoc network (UANET) is a particular ad hoc network similar to VANET and MANET. The similarities among them are: each node can move freely and communicates with other nodes; the differences are: the nodes move faster for UANETs and network density is sparse in some scenarios [6]. Furthermore, UAVs can move in 2D and 3D [7] environments with high speed. The longer distances and higher speeds will consume a lot of energy for communication, which is a fatal problem for energy-limited UAVs, and lead to link interruption when nodes' energy is exhausted. The high-speed mobility of UAVs will bring about frequent topology change and link instability. Moreover, UANET environments are more flexible and changeable. Therefore, considering the challenges mentioned above in UANETs, we must design an adaptive and intelligent routing protocol to select the next hop for better network performance in UANETs.
The primary purpose of routing is that a source node sends packets to a destination node by forwarding through some intermediate nodes. There are three categories concerning routing protocols: traditional routing protocols, heuristic routing protocols and reinforcement-learning (RL)-based routing protocols. Traditional routing protocols mainly include four classes: (1) Proactive protocols, such as optimized link state protocol (OLSR) [8], which is a table-driven method for forwarding packets in a timely manner based on global network topology. However, it needs to frequently update routing tables, resulting in excessive overhead. (2) Reactive routing protocols, like ad hoc on-demand distance vector (AODV) [9], which can reduce overhead, but it takes a long time to reestablish a new route when the network topology changes or route failure occurs. (3) Hybrid protocols, the like zone routing protocol (ZRP) [10]. This combines the advantages of proactive and reactive routing. However, it is challenging to maintain the gathering of information concerning high-dynamic nodes and link behavior. (4) Geographic location protocols, for instance, GPSR [11], which utilize the location information for greedy forwarding of packets and perimeter forwarding when faced with a void area. Although GPSR is superior to non-position-based routing protocols, perimeter forwarding causes large delays which may severely impact network performance. In summary, due to the unique features of UANETs, it is tough to adapt these protocols to highly dynamic UANETs. Many studies have proposed modifications of the traditional routing protocols to adapt to UANETs [12,13]. Although traditional routing protocols are simple and easy to implement, their major drawbacks are a lack of intelligence and autonomy in terms of adapting to high-speed mobility, complex environments, and varied flight tasks in UANETs. Some researchers have been inspired by laws of nature or experience with specific problems, proposing heuristic routing protocols. In [14,15], a genetic algorithm-based routing protocol to improve the efficiency of MANET was proposed. Ant colony optimization has been used for routing protocol [16,17]. However, in all of these protocols, only simple intelligent interactions are implemented, lacking explicit theoretical support, and global optimal solutions cannot be guaranteed.
In order to overcome the above problems, some researchers have proposed reinforcementlearning (RL)-based routing protocols that are intelligent and highly autonomous in UANETs. As we know, RL is an essential part of machine learning, and is inspired by behaviorist psychology; agents can make decisions by interacting with an environment so as to maximize the cumulative reward. Each UAV is regarded as an agent for RL-based routing protocols and learns to select the next hop based on the designed reward function and the algorithmic model to maximize the overall performance of the network according to the different optimization criteria. Currently, most researchers use value-based RL to solve routing problems in UANETs [18][19][20][21]. However, these value-based reinforcement-learning routing protocols mainly operate by establishing Q-Table to select the next hop, which consumes a lot of memory space when the network scale is large. Meanwhile, It needs to retrain the Q-Table when some nodes join or leave, which leads to poor extensibility. Therefore, deep-neural-network-based reinforcement learning is proposed to solve the above routing problems [22]. However, the majority of the aforementioned studies do not comprehensively consider the limited energy problem and the link stability, which seriously affect the network performance.
Motivated by the above considerations, a distributed model-free deep-reinforcementlearning algorithm is proposed to optimize the geographical routing protocol. When a node selects the next hop, it considers location information, link stability, and energy information. Finally, nodes can learn how to choose optimal action quickly to maximize network performance through constant training. The main contributions of this paper can be summarized as follows: • We introduce a link-stability evaluation indicator, which uses the variance of distance between nodes over a period of time to measure the degree of link stability for decreasing high mobility caused packet loss. • We use the ARIMA model to predict the neighbor nodes' residual energy to prevent premature node death, which can achieve energy balance and decrease packet loss. • A double deep Q network with prioritized experience replay is used to assist in making routing decisions efficiently. We take geographical location, residual energy, and link stability into account when selecting the next hop. According to the aboveconsidered factors, a more appropriate reward function is designed to make the algorithm converge more quickly. • We conducted extensive experiments and analyzed various performance metrics to verify the advantages of our proposed protocol, The results show that network performance of DSEGR in terms of packet delivery rate and convergence rate is better than the compared routing protocols.
The remainder of this paper is arranged as follows. Section 2 reviews the related works on UANETs and indicates the problems concerning the routing protocols. The system model is presented in Section 3. Section 4 describes our proposed DSEGR protocol in detail. Experimental results and network performance analyses are presented in Section 5. Finally, the conclusions and future work are presented in Section 6.

Conventional Routing Protocol
Traditional routing protocols have developed rapidly in the last few decades. The authors of [23] carried out performance tests and comparisons concerning classical routing protocols for AODV, OLSR, Dynamic Source Routing (DSR), and the geographic routing protocol (GRP). The simulation results indicate that different routing protocols can be used in different UAV communication network scenarios. Moreover, some studies have proposed amending the traditional routing protocols. The authors of [24] introduced a predictive OLSR (P-OLSR) protocol that takes advantage of the GPS information to predict the quality of the links among the nodes. The experiment's results show that the P-OLSR protocol performs better than the classic OLSR in average goodput. The authors of [25] proposed the mobility-and load-aware OLSR (ML-OLSR) routing protocol. The results of the study show improved delay performance and packet delivery rate compared to OLSR. However, each node consumes large amounts of energy for movement and communication, especially for energy-limited small UANETs. Thus, the ML-OLSR protocol lacks an energy metric to achieve a better performance. The authors of [26] proposed a new mobility model based on spiral line (SLMM) for aerial backbone networks. With the help of the mobility model, it can improve AODV. These routing protocols are all based on topology and have a low degree of autonomy. They cannot achieve a better performance to adapt to the high mobility, the complex flight environment, and the diverse flight tasks. In addition, the routing protocols based on topology aim at building the whole network routing table, which increases the maintenance costs. Thus, location-aided routing (LAR) [27] uses geographic location information to reduce the routing overhead. The later classic greedy peripheral stateless routing (GPSR) protocol [11] is proposed. Every UAV periodically broadcasts geographical location information to assist packet forwarding. GPSR uses greedy forwarding to reduce the end-to-end delay. With the increasing number of network nodes, the GPSR protocol has stronger expansibility, which is more suitable for UANETs. However, it only achieves local optimality and causes excessive routing overhead when nodes are faced with a void area. To overcome the excessive overhead caused by perimeter forwarding . In [28] a game theory was used to select the next-hop node considering the forwarding angle and distance in the perimeter forwarding stage. In reference [29], the authors proposed ECORA, which considers the positioning prediction and link expiration time to provide a better quality of experience (QoE).

Heuristic Routing Protocols
Researchers have come up with some heuristic routing protocols, which have primary intelligence and can appropriately optimize and adjust the network communications temporally. The authors of [30] integrated Tabu-search meta-heuristics and simulated annealing to improve local-search results in discrete optimization for geographical routing protocol in the Delay-Tolerant Network. The results show the Tabu list improves the packet delivery ratio, and if Tabu is used, a better performance can be obtain with the simulated annealing routing strategy. In [30], an ant colony optimization algorithm and the Dynamic Source Routing algorithm are proposed. The pheromone level is calculated by the distance, the congestion level, and the stability of a route. A new pheromone volatilization mechanism is proposed; judging by the results, the proposed algorithm outperforms traditional algorithms in terms of network performance for a battle environment. In [31], the authors extended the Gaussian-Markov mobility model into the fitness function of the bee colony algorithm (BCA) to predict the link expiration time, and then used BCA for the route discovery of FANETs in the 3D environment [32]. A genetic-algorithm (GA)-based ad hoc on-demand multi-path distance-vector routing protocol is proposed [33], which utilizes a fitness function to optimize the path based on energy consumption. The proposed routing protocol extends the network's lifetime. Although a heuristic algorithm can employ simple intelligence to manage an ad hoc network, it is more suitable for solving static optimization problems. A heuristic routing protocol will relearn when nodes join and leave, and which lack of scalability and consume considerable computation time for high mobility UAV networks.

Reinforcement Learning Based Routing Protocol
With the development of artificial intelligence, RL technologies have shown prominent advantages in dynamic routing optimization. Some works utilize deep RL to autonomously learn routing decisions, which causes nodes to have independent processing capabilities, and full intelligence to handle complex tasks and understand an environment. The paper of Boyan and Littman [34] is a groundbreaking work that used RL in a communication network. After its publication, many studies followed the original RL idea to improve the routing performance according to different features of networks and performance requirements [35][36][37]. The QGrid [38] routing protocol divides the whole network into different grids then selects the next hop with the optimal grid which can avoid void areas and achieve the optimal distance to the destination, but QGrid does not consider link status, which may cause a suboptimal solution. To settle this problem, the authors of [39] proposed a Q-learning-based geographic routing protocol (QGeo), which considers the packet link error and location error to improve the packet delivery rate. However, the limitations of the paper are that its methods present difficulties concerning convergence when Q-table is very large. In [40], the authors proposed a hybrid routing protocol which proposed position prediction to overcome the directional deafness problem in the MAC layer and, furthermore, they proposed a Q-learning routing protocol in a network layer, but it has the same routing problems as QGeo. Therefore, it cannot be applied to large-scale networks. In [41], the authors proposed a full-echo Q-routing algorithm with simulated annealing interference (FEQSAI) to adaptively control the exploration rate of the algorithm. It is an online learning algorithm that learns by adjusting the adaptive temperature parameter which is related to the energy and the speed of nodes. It is suitable for emergency FANET scenarios that cannot be trained in advance, but it cannot obtain a better performance at the beginning and needs to retrain in new scenarios. In order to adapt to large-scale networks, a deep Q network enhanced GPSR protocol (QNGPSR) is proposed in [42]; it improves routing performance by using a deep Q network and two-hop node estimation to avoid the probability of packet forwarding to void areas. However, QNGPSR cannot register load information. To solve this problem, a traffic-aware QNGPSR (TQNGPSR) protocol was proposed in [43]. TQNGPSR achieves load balance and adapts to large traffic scenarios. In Table 1, we list some of the routing protocols, highlighting the characteristics of different protocols and comparing them with our proposed routing protocol. Although the previous works provide relatively effective improvements, there are still some drawbacks that need to be optimized.

System Model
In this paper, we study a routing problem in such a scenario where UAVs perform search and rescue missions in natural disasters or air crashes. During the search phase, the UAVs move randomly with high speed in a large area to find urgent and critical target areas quickly. In the rescue phase, multiple UAVs collaboratively execute tasks in a relatively small and fixed area; the speed of the UAVs is low and the relative displacements between the nodes are not very large. In order to efficiently complete the search and rescue missions, UAVs need to interact with the other nodes. The communication radius of UAVs is R. The initial energy and the type of UAV are all the same. Each node is equipped with GPS.

Link Stability Model
In order to find targets quickly in the search phase, high mobility is one of the major challenges of the UAV routing protocol, and leads to frequent topology changes or communication link interruptions. For the sake of reducing packet loss due to the rapid movement of nodes, we propose a link-stability evaluation indicator. Each node records its own location information during the last five hello moments and stores the location sequence in the hello packet header. A node i receives the location sequence , L j (t 5 ) from its neighbor node j. The distance between node i and its neighbor node j at time t 1 can be calculated as follows are the coordinates of the nodes i and j at time t 1 , respectively. Hence, the distance sequence between i and j during the last five hello intervals is expressed as . Then, in accordance with the variance of D ij , the link change extent between i and j over a period of time can be evaluated.Then, we can estimate link stability LS ij , which is given by where E D ij 2 and E D ij 2 are the expected distance and the expected square of the distance, respectively. D D ij is the variance of D ij . If the variance is equal to zero, the link between nodes i and j is strictly stable during the last five hello intervals. The smaller D ij is, the more stable the link is.

ARIMA-Based Residual Energy Prediction Model
In the rescue phase, some nodes may be required to stay in the air for a period of time, or flight speed is relatively slow, which will cause nodes to consume less energy in the marginal area, but consume a lot of energy in the central region, i.e., energy imbalance. Eventually, some central nodes die early, leading to sparse network topology, and further decreasing packet delivery rate.
In order to clarify the problem, we use Figure 1 as an example, which shows UAV nodes and communication links in UANETs. Source nodes i, b, c want to send packets to the destination node d at different time slots. Then, packets will be forwarded through nodes j and k successively and finally reach the destination node d. Hence, relay nodes j and k will take on more communication tasks, leading to the nodes becoming congested and more likely to run out of energy. Assuming node j is on the verge of death, but node i does not receive the latest hello packet from the node, if the node still sends packets to node j, it will cause packet loss. Therefore, we take energy into consideration when selecting the next hop to address the energy imbalance problem. In our scenario, the influence that the external environment, the weather, and some unimportant factors have on energy are ignored. The energy consumption of UAVs is mainly divided into the following three parts: flight energy consumption, send-packet energy consumption, and receive-packet energy consumption. Nodes move at the same velocity magnitude, but in random directions, so the flight energy consumption is basically proportional to the time. The residual energy of node j after receiving packets from nodes i, b, c and sending packets to node k during a hello interval is denoted as follows.
where E res j (t 0 ) is the residual energy of the node at the last hello moment t 0 . E res j (t 1 ) is the residual energy of the node at the current hello moment t 1 . E f ly is the flight energy consumption during a hello interval. E rx and E tx are the energy consumption when transmitting and receiving a packet, respectively. ε f s d 2 i,j is the required energy loss in the free-space model; the energy dissipation coefficient of the circuit amplifier is 2.
Meanwhile, each node records its residual energy sequence during the last ten hello intervals and formats the energy sequence in the hello packet header. For example, the residual energy sequence of node j is represented as E j = E res j (t 1 ), E res j (t 2 ), · · · , E res j (t 10 ) . Each node periodically broadcasts its energy sequence to the neighbor nodes by hello packets. For the sake of avoiding nodes dying early due to packet loss, each node uses the ARIMA model to predict neighboring nodes' residual energy by receiving neighbors' energy sequences. The ARIMA model is a time-sequence analysis model which can use historical data to predict future values. It has flexible and excellent forecasting performance and lightweight computational cost. Moreover, the reason we used the ARIMA model is that the energy of UAVs is affected by many factors, so the variation pattern of energy consumption is neither stationary nor purely linear or non-linear. Some statistical prediction methods, such as the autoregressive (AR) model, the moving-average (MA) model and the ARMA model, cannot adapt to our scenario. However, the ARIMA model can be applied in a non-stationary time sequence.
Due to UAVs' residual energy sequence being non-stationary, firstly, we should preprocess the residual energy sequence when a node receives a hello packet from a neighboring node. A d-order differencing is used to transform it into a stationary time sequence, where d is the number of times that the residual energy of the node has been subtracted from past values. Then, we need to check whether the residual energy sequence is a white noise sequence. If it is not, we need to determine the parameters p and q of the ARIMA model by the autocorrelation coefficient (ACF) and the partial autocorrelation coefficient (PACF), respectively. p is the order of the auto-regression model, which represents the relationship between the remaining energy of the node at the next moment and the remaining energy of the node in the previous p moments. q is the order of the moving average model, which describes the relationship between the remaining energy of the nodes at the next moment and the white noise for the previous q moments. The Akaike information criterion is used to specify the optimal order (p, q) of the model. After the model order is determined, the mean square error method or the maximum likelihood method is used to obtain the weight coefficients η and ϕ of the autoregressive-term and the sliding-average-term . According to the ARIMA model established above, node i can predict the residual energy of the neighboring node by the following formula where E res j (t) is the predicted residual energy at the next hello moment t. E res j (t − v) is the residual energy at the previous vth hello moment. w(t) is the Gaussian white noise of residual energy of the node j at the next moment t. w(t − l) is the Gaussian white noise at previous lth hello moment. η v and ϕ l is the weight coefficients of residual energy and Gaussian white noise at the previous vth and lth hello moments, respectively .

Proposed Routing Protocol
In this section, we proposed a model-free reinforcement-learning-based geographic routing protocol. The process of routing decisions is simplified as the Markov Decision Process. Our protocol mainly includes the following parts: sensing phase, feature extraction phase, learning phase, and routing decision phase. The routing protocol architecture is shown in Figure 2. The sensing phase is to acquire features. The deep-reinforcementlearning algorithm is used for training nodes in how to make routing decisions. Next, the DSEGR protocol is described in detail.

Sensor Phase
In this sensing phase, each UAV node periodically broadcasts its location by hello packet and maintains its neighbor tables. The neighbor topology table of the node is x , and L x , respectively, represent the link stability between the node i and its neighbor node x during a period of time, the predicted residual energy of the neighbor node x of the node i in next hello moment, and the geographical location of a node x. NF x = (D 1d , D 2d , . . . , D 8d ) is the topology information of the neighbor node x, where D hd is used to denote the farthest distance between the neighbor of node i and the destination node d in the hth region, which is used to estimate the location of the two-hop neighbors of the node in the region, which has already been described in [42]. In the process of data-packet forwarding, additional information, including the source node s, the last-hop node l, the destination node d and its position information L d , and the node set HV N = {i, . . . , l} that has been visited during the packet-forwarding process, are recorded in the data packet. NF i is to be able to evaluate the ability of the two-hop node of node i to reach the destination node to avoid forwarding of a packet to a void area. The purpose of recording additional information in a packet is to construct feature values and prevent the formation of loops.

Markov Decision Process
In this section, the use of deep reinforcement learning (DRL) to learn how to make routing decisions for UANETs is presented. The Markov Decision Process (MDP) is used to simplify the modeling of reinforcement learning. Hence, we should first model the routing problem as an MDP. As we all know, an MDP consists of the following four tuples S, A, P, R where S is a state space, A is an action space, P is the probability of transition from one state to another state, and R is an immediate reward after a state transition. Consider the following scenario: when the node i receives a packet, the decision target is to select a neighbor node to forward the packet, while the state is transferred from the current node to the next node, and the process is repeated until the packet delivery fails or succeeds. Based on the above scenario, the MDP can be created as follows.
(1) State space: S = {s 1 , s 2 , . . . , s N } where N is the total number of nodes in a UANET. the optimal decision of node i is decided by the information of its neighbors. Therefore, the features of the node can be extracted as follows where D i,d is the distance between node i and the destination node d; D x,d is the distance between the neighbor node x of node i and the destination node d; D i 2 ,d is the minimum distance between the two-hop node i 2 of nodes i and d. C ix→xd is the cosine similarity of the vector from the last hop node i to node x and the vector from node x to the neighboring node [37]; the smaller the D sum , the greater the probability the two-hop neighbors of the node can reach the destination node and avoid the void area.
(2) Action space: A = {a 1 , a 2 , . . . , a N }. When the node i receives a packet, it needs to select a neighbor as the next hop to forward the packet. Thus, optional action of node i is expressed as a i ∈ {x | x ∈ N nbr (i)}, where x is the the neighbor node x of node i.
(3) Transition probability: The probability of transition between the node i and its neighbor node is determined by the environment, and it is random and unknown.
(4) Reward function: The aim of design reward function R is to make nodes learn how to take a action and achieve faster convergence performance. The reward function of node i selecting its neighbor node x is described as where w 1 , w 2 , and w 3 are the weights of the distance between node x and the destination d, the predicted residual energy of node x, and the link stability between node i and its neighbor node x, respectively. w 1 + w 2 + w 3 = 1. If the neighbor node of node i is the destination, the reward is 100. If not, the reward value is calculated by weighting the factors considered above.
In order to acquire appropriate weights w 1 , w 2 , and w 3 , The analytic hierarchy process (AHP) method is used. The AHP structure of the reward function is shown in Figure 3. In our scenario, the target layer of the AHP model is the reward of a selected next hop node. The criteria layer is our above-considered factors. The index layer is the reward of different neighbor nodes.  First of all, we need to build the judgment matrix A, which is established by comparing each factor of the criteria layer in pairs.
In accordance with the judgment matrix A, we can obtain the eigenvector of matrix Aω = (0.5, 0.25, 0.25), which correspond to the weights of the distance, the predicted energy, and the link stability, respectively. In accordance with Aω = λω, the eigenvalue of matrix A is calculated as Then, we should check the consistency to judge the rationality of the judgment matrix A.
where n is the dimension of the matrix A. CR is the consistency ratio, and RI is the random index. If CR is less than 0.1, it means the construction of matrix A is reasonable and satisfies the consistency condition.
In accordance with the above reward function, as we can see, the weight of the distance is the largest, which means that when the link stability and the neighbors' predicted energy are the same, the shortest route is a priority, except for a void area situation. When nodes move slowly, the link stability has little effect on the reward function, but the energy factor mainly affects the reward function. When nodes move quickly, link stability is the dominant factor.

DDQN with Prioritized Experience Relay Algorithm
In our scenario, since the action space is discrete, we adapt the value-based RL algorithms. Value-based RL is aimed at learning the optimal policy function π * (a t | a t ) based on the action-value function Q(s t | a t ). The optimal policy function π * is denoted as π * (a t | a t ) = P(A = a t | S = a t ), and is used to represent the optimal probability of the node selecting the next hop a t in the state s t . The action-value function Q is expressed as Q π (a t | a t ) = E π [G | S = s t , A = a t ], representing the expected cumulative reward by taking the next hop a t in the state s t with the policy π, where the cumulative reward G is denoted as G = ∑ T t=0 γ t R t and represents the sum of the reward values from the current node to the destination node with T hops. γ ∈ [0, 1] is the discount factor, which is used to evaluate the importance of future rewards.
We hope the designed routing protocol can adapt to large-scale UANETs because deep reinforcement learning is a powerful algorithm that combines the advantages of deep learning for perception and reinforcement learning for decision making, which can cause nodes to interact with the environment and intelligently make decisions in UANETs. In [42], the authors use a deep neural network to learn to approximate the action-value function, and there are two Q networks; one is the current Q network with the parameter set θ, and the other is the target Q network with the parameter set θ . The target Q network is always greedy in terms of selecting the maximum Q value of the next hop node as the target Q value, which accelerates the network's convergence, but it leads to overestimation and large biases. In our paper, we use double deep Q-learning (DDQN) [44] to eliminate the problem of overestimation by decoupling the action selection of the target Q value and the calculation of the target Q value. That is to say, the next state s is input into the current Q network to obtain the optimal Q value corresponding to the action a . Then, the action s and the next state s are employed to obtain the target Q value in the target Q network. The update of the target Q value is as follows.
In our work, γ is set as 0.9 based on our experience. The objective function of the Q network is to minimize the loss function L(θ t ), which is shown as: Then, the Adam gradient descent method is used to optimize the loss function. The gradient of the loss function can be depicted as In [44], when calculating the target Q value, all samples e t = (s t, a t, r t , s t+1 ) in the experience replay pool can be sampled with the same probability. However, different samples have different effects on backpropagation. The larger the TD-error, the greater the effect on the backpropagation, where TD-error = Q target − Q(s t , a t ; θ). So we adapted the prioritized experience relay in [45], which produces samples where larger absolute values of TD-error are more likely to be produced as samples. Generally, the SumTree method is used to store samples with priority. The convergence rate of the current Q network is accelerated with this method.
The parameters of the target Q network can be updated by the newest parameters of the current Q network at regular intervals. We use the moving-mean method to update the target network parameter θ t = αθ t−1 + (1 − α)θ t , which can save storage space and avoid large gaps in the update process that can accelerate learning.
In this paper, the training and run phases adopt two different strategies. In the run phase, we use the ε − greedy method to select the next-hop node, which is depicted as where ε is the exploration probability that is used to balance exploitation and exploration in order to avoid falling into a local optimum. Generally, according to our experience, ε is set as 0.05. The 1 − ε probability is used to select the next hop with the maximum Q value. The ε probability is used to randomly select a node as the next-hop forwarding node. In addition, we use the softmax function to choose the next hop in the test phase, which can distribute traffic among nodes that have nearly identical Q values. When node i needs to forward packets, the softmax policy of node i is given by

Routing Decision Phase
According to the above-constructed reinforcement learning model, we design the DSEGR protocol. The DSEGR protocol mainly contains neighbor table maintenance and routing decision. Each node maintains and updates its neighbor topology table by the periodic hello packet. Whenever a node receives a packet, the routing algorithm will be executed; a detailed description is shown in algorithm 1. We take node i as an instance. First extract features of node i, which can be described as , from the neighbor topology table and data packet, where L i , L l , L d , L x is obtained from the data packet. L i,x , E pre x , NF x is obtained from the NT table information. Then, all neighbor features are input into the RL model to learn how to make an optimal routing decision.

Algorithm 1 DSEGR routing protocol
Input: L i : loaction of node i; p: data packet; NT i : topology table of node i; Output: Select next hop with optimal Q value; 1: Obtain set of neighbor nodes A i by NT i ; 2: Set of optional neighbor is A a * = −1 and packet loss 5: else 6: Input F i into our RL model and output Q x

9:
if Q network is in training phase then 10: Select next hop according to (13); 11: Store e t = (s i , a i , r i , s x ) into experience set D t = {e 1 , · · · , e t }; 12: Update Q network parameters θ and θ 13: else 14: State select next hop using (14) 15: end if 16: end for 17: end if

Simulation Environment and Performance Metrics
In this paper, we use Python 3.6 and the Simpy module to simulate the underlying communication of UANETs. We assume that each UAV is flying at the same altitude. A total of 40∼100 UAVs were randomly distributed in a 2000 m × 2000 m area. Nodes move 10 m/s in the training phase and 1∼20 m/s in the test phase for the DSEGR protocol. The communication radius is 350 m; the hello packet interval is 0.5 s; the DDQN contains four layers corresponding to 5 × 16 × 4 × 1 neurons, and each layer uses SELU as the activation function. Considering that the dimensions of each feature are not the same, the different features are processed by using the Min-Max normalization method, which can speed up the convergence speed of DSEGR. Adam is used as the optimizer. The learning rate α is set to 0.001. The small-batch gradient descent method is used to update parameters, and the batch size is 32. The initial energy of UAVs is 1000 J. The simulation parameter settings of the DSEGR protocol are listed in Table 2.
In this paper, we compare the performance of QNGPSR, DSEGR, FEQSAI, and GPSR protocols in the training and testing phase. To verify the validity of our proposed routing protocol, we analyze the convergence rate of different routing protocols in the training phase, and take the average end-to-end (E2E) delay, packet delivery ratio (PDR), and the time of the first dead node as routing performance evaluation indicators, where average E2E delay can reflect timeliness, PDR demonstrates the reliability of routing algorithms, and the time of the first dead node indicates whether nodes' energy levels are balanced in a UANET.

Convergence Analysis
In this section, we mainly analyze the convergence rate of QNGPSR and DSEGR in the training phase since the FEQSAI and GPSR make routing decisions online. One-hundred random maps are used to train the QNGPSR and DSEGR protocol with 200 episodes. Each episode corresponds to 10 s simulation time. Maps are collected from flight trajectories around the Paris Charles de Gaulle Airport in November 2017 from https://www. flightradar24.com/ (accessed on 5 October 2021). The QNGPSR protocol fails to converge due to node movement during the training phase, so QNGPSR nodes are trained in the stationary state. DSEGR is random direction movement. In order to be able to verify the generalization performance of the algorithm, the map used in the testing phase is different from the training phase. The convergence result is shown in Figure 4. Figure 4 shows the learning curves of QNGPSR and DSEGR. As can be seen, the DSEGR protocol basically converge after five episodes. However, QNGPSR needs 35 episodes to converge. Furthermore, the average E2E delay of DSEGR is close to QNGPSR after the final convergence. Although nodes in QNGPSR are static during the training phase, DSEGR can still obtain a faster convergence speed, which verifies our proposed DDQN with the prioritized experience relay algorithm and designed reward function is reasonable and efficient.

Effects of the Nodes' Movement Speed
In this section, we compare different protocols' performances with nodes' movement speed changes between 1 and ∼20 m/s in the test phase. There are 100 nodes in the network. The packet send rate is 1 Hz and the initial energy is 1000 J. The simulation results are shown in Figures 5-7. Figure 5 presents the effects of different nodes' movement speeds on the average E2E delay. It is obvious that the metric of DSEGR and QNGPSR outperform GPSR and FEQSAI because QNGPSR and DSEGR can reduce the probability of reaching void areas by two-hop neighbor estimation. The average E2E delay of QNGPSR is basically similar to that of DSEGR. As the speed increases, the metric of QNGPSR and GPSR first increases and then decreases; this is because they increase average hop counts. However, once the speed reaches a threshold value, which leads to link-stability decreasing dramatically and packet loss increases, the E2E delay decreases accordingly. The average E2E delay of FEQSAI is the largest since it involves online learning, resulting in the delay being very large at the beginning. It can be seen that, although the maximum average E2E delay difference between DSEGR and QNGPSR is 8 ms at a speed of 1 m/s, the PDR of DSEGR is 26% higher than QNGPSR at a speed of 1 m/s, which is presented in Figure 6. Moreover, the DSEGR has a slight advantage over QNGPSR at 5 m/s and 10 m/s.  Figure 6 shows the effects of different nodes' movement speeds on the packet delivery ratio. As the speed increases, the PDR of the DSEGR significantly outperforms QNGPSR, GPSR, and FEQSAI. DSEGR considers link stability and predicts the residual of the neighbor node in the reward function, which can greatly reduce the packet loss rate. The main influence factor is mobility when nodes' speeds are high; thus, unstable links seriously affect the PDR. We can see in the figure that the DSEGR can still achieve higher PDR when the node speed is 20 m/s. At low speeds, the nodes' residual energy is the main influence factor. However, the DSEGR protocol can achieve a higher PDR. Figure 7 further proves that the time of the first dead node is late for the DSEGR when nodes are at low speeds.  Figure 7 depicts the effects of the different nodes' movement speeds on the time of the first dead node. As can be seen, the nodes in DSEGR die later than those in QNGPSR, GPSR, and FEQSAI. The reason for this is that DSEGR has an energy prediction model, which can predict the remaining energy of neighbor nodes to balance the energy. Nodes moving at low speeds may lead to certain nodes acting as central nodes. As a result, the central nodes will die early due to node energy exhaustion. With increasing speed, the relative displacement between nodes changes more obviously, and the links become unstable. There are no so-called central nodes; thus, the energy consumption of the nodes will become more balanced, and the time of the first dead node is postponed.

Effects of the Node Density
In this section, we compare different protocols' performances with 40, 60, 80, 100, and 150 nodes in the test phase, separately. The movement speed of nodes is 10 m/s. The packet send rate is 1 Hz and the initial energy is 1000 J. The simulation results are shown in Figures 8-10. Figure 8 shows the effects of the different network densities on the average E2E delay. As can be seen, with the increase in node density, the average E2E delays of DSEGR, QNGPSR, and GPSR increase firstly and then decline. That is because the nodes can achieve more long paths which lead to an initial increase in delays, but the routing void area decreases as the node density increases, which causes an increase in delay. However, FEQSAI needs to relearn online when the number of nodes increases. the average E2E delay of DSEGR increases since Q-Tbale learning complexity increases. So the FEQSAI has the worst delay performance. The average E2E delay of DSEGR is similar to QNGPSR at the same node density. However, GPSR chooses the perimeter forward when it encounters a void area, which causes its average E2E delay to be larger than DSEGR and QNGPSR.  Figure 9 depicts the relationship between PDR and different node densities. It can be seen that, with the total number of nodes increasing in the UANET, each node has more neighbor nodes. Even if some neighbor nodes' energy is exhausted, there are other neighbor nodes providing assistance to forward packets. Thus, there is an increasing trend in the packet delivery ratio of each protocol. DSEGR significantly outperforms QNGPSR, GPSR, and FEQSAI since DSEGR considers link stability and residual energy prediction. Although FEQSAI considers E2E transmission energy, it is an online learning algorithm and cannot converge in our scenario, so the PDR is the worst.  When there are 40 nodes in the network, there are few feasible links. Thus, few packets are effectively and successfully delivered to the destination, which results in network nodes consuming less energy. So there is no dead node. As node density increases, the DSEGR can predict the residual energy of a neighbor node, which can avoid forwarding packets by a low-energy neighor node. Thus, the time of the first dead node is postponed, and the energy consumption of nodes is more balanced. Finally, it can reach a stable state. Hence, the time of the first dead node is almost same when node density is high for DSEGR, GPSR and QNGPSR.

Effects of Packet Send Rate
In this section, we compare different protocols' performances with the different packet send rate changes between 1 and 5 Hz in test phase. The node speed is 10 m/s. There are 100 nodes moving in the network and the initial energy is 1000 J. The simulation results are shown in Figures 11-13. Figure 11 presents the effects of the different packet send rates on the average E2E delay. As shown in the figure, when the packet send rate is 1Hz, DSGER has a lower average E2E delay than GPSR and QNGPSR. As the packet send rate increases, the average E2E delay of DSEGR also increases. The reason for this is that DSEGR adopts an energy prediction model, letting the node die late. Thus, there exist more feasible long paths, which can be verified in Figures 12 and 13. Figure 12 demonstrates the results of PDR versus different packet send rates. It is clear that with the increase of packet send rate, the PDR of the DSEGR, QNGPSR, and GPSR also decreases. This is because of the nodes' buffer overflow. DSEGR performs better PDR than the other protocols. The reason for this is that DSEGR selects the next hop considering link stability and the node's residual energy. Figure 13 examines the effects of the time of the first dead node on different packet send rates. As we can see, when the send rate is 1 Hz, DSEGR has no dead nodes. Afterward, nodes consume more energy and die early as the packet send rates grows. However, the time of the first dead node for DSEGR is still better than the other protocols. The reason for this is that the DSEGR has an energy predict model, which can balance the energy.

Effects of Initial Energy
In this section, we compare different protocols' performances with the different initial energy changes among 400 and 1200 J in test phase. The speed of nodes is 10 m/s and the packet send rate is 1 Hz. There are 100 nodes moving in the network. The simulation results are shown in Figures 14-16. Figure 14 depicts the relationship between the average E2E delay and different initial energy. As we can see from the figure, as the initial energy increases, the average E2E delay of DSEGR is similar to QNGPSR. The delay of DSEGR and QNGPSR is better than the others since they reduce the routing void area.  Figure 15 illustrates the results of PDR versus different initial energy. With the increase of initial energy, the PDR of all the protocols also increase. This is because the node dies late. It is clear that the DSEGR performs better PDR than the other protocols. The reason for this is that DSEGR can select the next hop considering neighbors' residual energy. Although FEQSAI considers the E2E transmission energy, which cannot converge due to retraining at different maps, the PDR is lowest.  Figure 16 depicts the effects of the time of the time of the first dead node on different initial energies. As the initial energy increases, the nodes die later and later for all tested protocols. As we can see from the figure, the time of the first dead node for DSEGR is better than the other protocols for different initial energies. There is not even a dead node for larger initial energies, such as 1000 J and 1200 J, because the DSEGR involves an energy prediction method.

Conclusions
In this paper, we introduced a novel UAV routing protocol to optimize the UAV network performance. Our protocol overcame the shortages of other deep-reinforcementlearning protocols. Meanwhile, we designed a reasonable link stability evaluation indicator to adapt high mobility, and utilized the ARIMA energy prediction model to balance energy. With numerous experiments, simulation results show that DSEGR significantly outperforms GPSR and QNGPSR in terms of PDR. The average E2E delay of DSEGR is close to QNGPSR. The classic GPSR has poor performance metrics. DSEGR can achieve a faster convergence speed. The time of the first dead node is postponed for DSEGR. In summary, our proposed routing algorithm has better robustness and reliability, and is more suitable for performing search and rescue tasks in high dynamic and energy-limited UANETs.
In our work, we only considered a simple scenario. In the future, we will extend the UANET to a more complex integrated ground-air-space scenario. In addition, we will apply multi-agent reinforcement learning in a collaborative manner to obtain better network performance.