Reinforcement-Learning-Based Geographic Routing Considering Future Evolution of Link States for UAV Networks

Xu, Ming; Xia, Yu; Liu, Wei; Huang, Daqing

doi:10.3390/drones10020150

Open AccessArticle

Reinforcement-Learning-Based Geographic Routing Considering Future Evolution of Link States for UAV Networks

¹

College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

²

School of Electrical and Electronic Engineering, Chongqing University of Technology, Chongqing 400054, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(2), 150; https://doi.org/10.3390/drones10020150

Submission received: 12 January 2026 / Revised: 17 February 2026 / Accepted: 19 February 2026 / Published: 21 February 2026

(This article belongs to the Section Drone Communications)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

This paper proposes a reinforcement-learning-based geographic routing protocol that incorporates a new multi-parameter fusion link-state evaluation method and a new routing hole bypass method.
Simulation results show that compared to existing geographic routing protocols, the proposed one achieves higher packet reception rate, lower energy consumption and end-to-end latency.

What are the implications of the main findings?

Considering the future evolution of link states helps the geographic routing protocol effectively cope with link fluctuations caused by high-speed movement of UAVs, thereby more suitable for highly dynamic UAV networks.

Abstract

Achieving autonomous and reliable unmanned aerial vehicle (UAV) swarm applications requires a flexible and efficient communication network structure. Unfortunately, the high-speed movement of UAVs leads to drastic changes in wireless links and topology structures, posing significant challenges to reliable data transmissions. Geographic routing protocols exhibit better adaptability to highly dynamic network topologies and have garnered extensive attention in UAV networks. However, existing works did not effectively address the impact of factors such as link state fluctuations and routing holes on the performance of these protocols. To this end, by considering future evolution of link states, this paper proposes a reinforcement-learning-based geographic routing protocol (Evo-QGeo) and introduces a new routing hole bypass method. Thanks to the evaluation of future evolution of link states and the multihop optimization capability of reinforcement learning, the end-to-end packet reception rate of Evo-QGeo is improved by up to 11.81~44.61% compared to existing ones. Meanwhile, the energy consumption is reduced by up to 36.94~74.47%, the latency is reduced by up to 21.63~38.68%, and the end-to-end expected transmission count is reduced by up to19.60~26.10%. This makes Evo-QGeo more suitable for highly dynamic UAV networks.

Keywords:

unmanned aerial vehicle (UAV); swarm; flying ad hoc network (FANET); geographic routing; link state; routing hole; reinforcement learning

1. Introduction

With the gradual maturity of unmanned aerial vehicle (UAV) technology and the continuous reduction in their costs, large-scale UAV swarms have become an important direction for the future development of UAVs, with broad application prospects in military surveillance, emergency rescue, transportation logistics and other fields, as shown in Figure 1 [1,2]. In order to achieve the intelligence and autonomy of UAV swarms, a flexible, efficient, and reliable network structure is necessary to achieve smooth communication and collaborative operation between UAVs. This has positioned Flying Ad Hoc Networks (FANETs) at the forefront of current academic studies [3]. Similar to the case of MANETs, the success of FANET deployments relies heavily on designing routing protocols that are efficient, reliable, and stable [4].

Due to the higher speed and flexibility of UAV nodes, the changes in link and topology in FANETs are more severe, which poses significant challenges for reliable data transmissions [5]. As traditional table-driven routing protocols depends on the stability of topology, their performance will severely degrade when used in UAV networks [6,7,8]. In contrast, geographic routing protocols utilize node locations for forwarding decisions, without requiring global topology information, thus better shielding topology changes. This renders geographic routing protocols particularly effective for handling the fast topology changes inherent to aerial networks [9].

In order to improve the performance of geographic routing in UAV networks, a large amount of work has been done in recent years. On the one hand, by evaluating the link stability, routing decisions can obtain more reliable local topology information. On the other hand, through techniques such as reinforcement learning, routing decisions can self-optimize and avoid falling into local optima. Although these improvements have effectively improved the performance of geographic routing in UAV networks, they did not effectively address the impact of factors such as link-state fluctuations and routing holes [10,11,12,13,14,15,16,17,18,19,20,21,22,23].

To solve these problems, this paper proposes a novel geographic routing protocol that utilizes reinforcement learning to achieve intelligent decision-making and self-optimization of routing selection. In order to achieve optimal single hop relay selection for nodes, a new multi-parameter fusion link-state evaluation method was invented, which considers the future evolution of link states, effectively improving the stability and reliability of link-state evaluation. At the same time, a new routing hole bypass method was designed to improve the robustness of networks. Simulation results demonstrate that compared to existing geographic routing protocols, the proposed Evo-QGeo achieves up to 11.81~44.61% higher end-to-end packet reception rate, up to 36.94~74.47% lower energy consumption, up to 21.51~38.68% lower latency and up to 19.60~26.10% end-to-end expected transmission count. This is due to the evaluation of future evolution of link states and the multihop optimization capability of reinforcement learning.

The remainder of this paper is organized as follows: Section 2 discusses the related works. Section 3 presents the system model. Section 4 provides a detailed description of the proposed reinforcement-learning-based geographic routing protocol. Section 5 shows the simulation results compared with existing protocols. Section 6 presents the conclusion and points out future research directions.

2. Related Works

In the past decade, a large number of routing protocols for ad hoc networks have emerged, such as Optimized Link State Routing (OLSR) protocol [24], Ad hoc On-demand Distance Vector (AODV) protocol [25], Greedy Perimeter Stateless Routing (GPSR) protocol [26], etc. [3]. A multitude of studies suggest that topology-driven approaches struggle within volatile FANET scenarios. As these protocols rely on pre-established paths, their substantial routing overhead and delayed convergence make them ill-equipped to handle the high-speed mobility of UAV nodes. For instance, Ferronato et al. used the NS2 simulation tool to analyze the performance of OLSR, AODV, and Zone Routing Protocol (ZRP) in UAV networks [6]. Khanh et al. considered the requirements of modern cellular UAV communications and evaluated the performance of AODV, Dynamic Source Routing (DSR), and OLSR using the NS2 simulation tool [7]. Tan et al. performed performance comparisons on AODV, DSR and OLSR using the OPNET simulation tool [8]. The results of these simulation evaluations all demonstrate that the performance of traditional topology-based routing protocols degrades significantly as the speed of UAV nodes increases.

Some studies attempt to improve traditional topology-based routing protocols to address issues such as link instability and drastic topology changes. To refine relay decisions, these improvements prioritize stability-centric factors, specifically evaluating parameters like link consistency and the estimated duration of connectivity. For example, Rosati et al. proposed an enhanced OLSR protocol, which uses the relative velocity between neighbors to optimize the stability of the established routes [10]. Gangopadhyay et al. proposed an improved OLSR protocol for efficient multipoint relay selection, which uses the position of nodes to predict the link duration [11]. Wang et al. defined the stability and efficiency of a wireless link and used them as a prior knowledge to improve the robustness and agility of route establishment [12]. However, these protocols have limited performance improvements when the node speed is high. The main reason is that when network nodes move faster, topology changes become more dramatic, while traditional ad hoc routing protocols are not originally designed to accommodate such drastic topology changes.

Although the above improvements have been proposed for FANETs, their inherent shortcomings in the face of highly dynamic networks have not been fundamentally changed. In contrast, geographic routing protocols utilize node locations for forwarding decisions, which can better shield topology changes and are therefore more suitable for highly dynamic FANETs [9]. Therefore, some studies have applied geographic routing to UAV networks and make enhancements based on their characteristics. The most direct method is also to evaluate the stability of links or routes to improve the reliability of routing decisions. For example, Cui et al. proposed an improved geographical routing protocol. They designed a best-link forwarding strategy to select the next-hop relay from a region specified by the effective transmission range. Furthermore, a routing recovery mechanism is proposed to avoid local optima. Simulation results demonstrate that it performs better than traditional geographical routing algorithms in terms of packet delivery ratio (PDR) and goodput [13].

Singh et al. proposed a dynamic geographic routing protocol comprising two mechanisms. The first mechanism provides a forwarding decision for next-hop selection which utilizes multiple metrics related to geographic progress, link stability, network connectivity, and queue size. The second mechanism adaptively corrects the weights of relay metrics in response to the changing network dynamics of UAV networks. The simulation results demonstrate that the proposed protocol is superior to existing routing protocols, considering multiple Quality of Service (QoS) metrics under several different network scenarios [14]. Cui et al. proposed an efficient geographic routing scheme that utilizes computer vision technology to sense and maintain the information of neighbors. The relay selection process incorporates the link quality, neighbor stability, and the distance to destination. Real-world experiments were conducted to evaluate the proposed scheme, which confirmed its effectiveness. Meanwhile, simulation results show that the proposed scheme reduces the overhead and latency while not compromising the PDR [15].

Considering the rapid changes in UAV links, some studies propose to improve the effectiveness and sustainability of link-stability assessment by predicting future positions of UAVs. For example, Jiang et al. addressed UAV routing challenges by proposing a predictive mobility-based virtual algorithm, achieving superior results in both link stability and routing longevity. The Gaussian distribution was used to model UAV movements. Moreover, an optimization model was designed to select the optimal relay with the best performance between UAVs. Numerical experiments indicate that the proposed algorithm can improve the routing lifetime, the end-to-end delay, and the PDR compared to other traditional algorithms [16]. Based on practical link characteristics and making use of positioning devices onboard, Asadpour et al. proposed a mobility-driven routing protocol. Given the current position, velocity, and angle of UAVs, future positions could be predicted and employed to refine relay selection decisions. Both field experiment and simulation demonstrate that by predicting future positions of UAVs, the proposed protocol deals with intermittent network connectivity well and the routing performance can then be enhanced [17].

Zhou et al. proposed an improved geographic routing protocol based on position prediction for UAV networks. The greedy forwarding strategy adopted by the proposed protocol considered several different factors such as distance, velocity, movement direction and the number of neighboring nodes. If the routing hole appears, the failed node selects a node with better decision values from its two-hop neighbors. Evaluations reveal substantial improvements in both PDR and delay compared to standard protocols like AODV and GPSR [18]. Additionally, Li et al. addressed stability issues by proposing a link prediction-based adaptive protocol. The proposed protocol utilizes a time-series-based mobility model to characterize the link stability and selects the relays that are expected to last for a longer duration. Numerical experiments show that it achieves better performance under several different conditions, with a 20.58% higher PDR, a 15.72% lower delay, and a 37.22% lower control overhead [19].

Although the above improvements indeed enhance the performance of geographic routing in UAV networks, the inherent flaws of these protocols, such as non-global optima and routing hole, have not been resolved. In recent years, reinforcement-learning algorithms have been used in geographic routing to enable the self-optimization of routing decisions. For example, to address the volatility of UAV networks, Jung et al. introduced QGeo, a position-based protocol that utilizes Q-learning to optimize packet delivery. Numerical experiments show that QGeo has a higher PDR and a lower control overhead than the original GPSR protocol [20]. Wei et al. proposed an adaptive geographic routing protocol that also uses Q-learning. The parameters of reinforcement learning and the interval of beacon messages are adjusted by sensing the local topology changes. Meanwhile, a new routing-hole-avoidance mechanism is designed, which broadcasts the routing-hole information and selects the data forwarding route based on the node degree of neighbors and the distance to destination. Numerical experiments demonstrate that the proposed protocol improves throughput by 6% and reduces delay by 20% compared to QGeo and GPSR [21].

Huang et al. proposed a geographic routing protocol for UAV networks, in which Q-learning and fuzzy logic are employed. The proposed protocol adopts an efficient Q-value update mechanism based on HELLO and ACK messages. In order to mitigate the blindness of random exploration, a fuzzy-logic-based mechanism was designed to incorporate multiple metrics such as Q-value, link quality and access delay. Numerical experiments show that the proposed protocol can make efficient routing decisions within dynamic FANETs, and outperforms existing geographic routing protocols regarding PDR, transmission delay, and control overhead [22]. Wu et al. proposed a trajectory-informed routing protocol enhanced by Q-learning for mission-oriented FANETs. In order to dynamically optimize the next-hop selection, the mission-planned flight trajectories are integrated into a reinforcement learning framework, which are used to evaluate the link stability, queue length, and node mobility patterns. Extensive simulations show that the proposed protocol outperforms the original GPSR, achieving up to 23% higher PDR, over 80% reduction in transmission delay, and up to 37% and 52% improvements in throughput and efficiency, respectively [23].

Although the above improvements have enhanced the global optimization capability of geographic routing, they did not consider how to effectively evaluate link states. Evaluating link states more effectively in highly dynamic environments is a key factor determining routing performance. Existing methods use indicators such as relative position, relative velocity, statistical link quality, and expected link duration to make decisions, which can improve the accuracy of link state assessment to a certain extent. However, there are still some important factors that have not been considered, mainly including: (1) it is difficult to accurately detect and respond to link changes with a single link parameter; (2) Existing studies often overlook the impact of future evolution of link states, thus unable to make optimal link state assessments.

3. System Model

This work considers a FANET consisting of n UAV nodes in a three-dimensional (3D) task area, where U = {u₁, u₂, …, u_n} is the set of UAV nodes. Let G = (U, E) represent the network diagram, where E is the set of wireless links between nodes. Each node u_i can obtain accurate geographic coordinates from the onboard satellite positioning system P_i = (x_i, y_i, z_i). Let d_i,j denote the distance between node u_i and u_j, and R denote the effective communication range between UAV nodes. Nodes periodically broadcast Beacon messages to share information with neighbors, including the location of current node, flight speed, link quality, and set of neighbors.

3.1. Mobility Model

In actual deployments, the trajectory of nodes is affected by their maneuverability characteristics. Considering that the direction and magnitude of velocity of each node in FANETs may vary, this paper used the Random WayPoint (RWP) mobility model. Each node was initialized with speeds v_x, v_y and v_z, respectively. Random speed increments within the specified range were added to v_x, v_y and v_z independently. Therefore, coordinates of UAV node i are defined as follows:

\{\begin{matrix} x_{i} (t + T) = x_{i} (t) + (v_{x} + Δ v_{x} (t)) \cdot T \\ y_{i} (t + T) = y_{i} (t) + (v_{y} + Δ v_{y} (t)) \cdot T \\ z_{i} (t + T) = z_{i} (t) + (v_{z} + Δ v_{z} (t)) \cdot T \end{matrix}

(1)

where (x_i(t), y_i(t), z_i(t)) represents the coordinates of node i at time t, T is the update interval, ∆v_x(t), ∆v_y(t) and ∆v_z(t) are the random speed increments at time t, respectively.

3.2. Channel Model

This paper mainly studies the communication between UAVs, therefore only considering the air-to-air channel [27,28], for which the path loss is usually modeled using the log-distance model as follows:

P L (d) = P L (d_{0}) + 10 η \log_{10} (\frac{d}{d_{0}}) + N (0, σ)

(2)

where d₀ is the reference distance, PL(d₀) is the free-space path loss in dB at d₀, η is the path loss exponent characterizing the attenuation of signals, N(0, σ) is a Gaussian random variable with mean value 0 and standard variance σ. The small-scale fading adopts the Rayleigh model as follows:

p (L_{R}) = \frac{L_{R}}{σ_{R}^{2}} e^{(- \frac{{(L_{R})}^{2}}{2 σ_{R}^{2}})}

(3)

where σ_R is the standard variance of Rayleigh fading in dB. Therefore, received signal can be expressed with:

{P r (d) |}_{dBm} = {P_{t} |}_{dBm} - {P L (d) |}_{dB} - {p (d) |}_{dB}

(4)

where P_t is the transmit power in dBm.

3.3. Network Performance Metrics

The end-to-end packet delivery rate (PDR), network energy consumption, end-to-end delay (E2ED) and end-to-end expected transmission count (ETX_E2E) were used to evaluate the network performance. PDR is defined as follows:

P D R = \frac{\sum_{j = 1}^{K_{r}} R P_{j}}{\sum_{i = 1}^{K_{t}} T P_{i}}

(5)

where i and j denote the source node and destination node respectively. K_t is the number of source nodes, K_r and is the number of destination nodes. RP_j is the number of data packets received by node j and TP_i is the number of data packets sent by node i.

The network energy consumption can be computed as follows:

E = \sum_{i = 1}^{n} E_{b t} \times n u m_{i} + \sum_{i = 1}^{n} \sum_{j = 1}^{n_{j}} E_{b r} \times n u m r_{i, j} + \sum_{i = 1}^{n_{t}} E_{p t} \times T P_{i} + \sum_{j = 1}^{n_{r}} E_{p r} \times R P_{j}

(6)

where E_bt denotes the energy consumption of broadcasting a beacon, num_i is the number of beacons broadcasted from node i in T, E_br denotes the energy consumption of receiving a beacon, numr_i,j is the number of beacons received by node j from node i in T, E_pt denotes the energy consumption of transmitting a data packet, n_t is the number of transmitting nodes, TP_i is the number of data packets transmitted by node i, E_pr denotes the energy consumption of receiving a data packet, n_r is the number of receiving nodes, RP_j is the number of data packets received by node j.

E2ED is defined as follows:

D = \frac{1}{N} \sum_{i = 1}^{N} (t_{r}^{i} - t_{s}^{i})

(7)

where

t_{r}^{i}

denotes the sending timestamp of the ith successfully received packet,

t_{s}^{i}

denotes the receiving timestamp of the ith successfully received packet, N is the number of successfully received packets in network.

ETX_E2E is defined as follows:

E T X_{E 2 E} = \sum_{i = 1}^{n} \frac{1}{p_{f i} \times p_{r i}}

(8)

where p_fi denotes the probability of data packet transmission within a one-hop route, p_ri represents the probability of successful transmission of the corresponding acknowledgment (ACK) packet, and n stands for the end-to-end hop count.

3.4. Problem Formulation

The goal of this work is to maximize the PDR and minimize the energy consumption, ET_i and E2ED. Therefore, the optimization function was defined as follows:

\min \sum_{i = 1}^{n} (ω_{d} D_{i} + ω_{e} E_{i} + ω_{t} E T_{i} - ω_{p} P D R_{i})

(9)

where D_i, E_i, ET_i and PDR_i denote the E2ED, energy consumption, ET_i and PDR of node i, respectively. ω_d, ω_e, ω_t and ω_p represent the weights for them.

4. Description of the Proposed Protocol

To solve the optimization problem above, this paper proposed a novel geographic routing protocol (denoted as Evo-QGeo) that utilizes reinforcement learning to achieve intelligent decision-making and self-optimization of routing selection. In order to achieve optimal single-hop relay selection for nodes, a new multi-parameter fusion-based link-state evaluation method was invented, which considers the future evolution of link states to improve the stability and reliability of link-state evaluation. At the same time, a new routing hole bypass method was designed to improve the robustness of network. This section will provide detailed descriptions of each component algorithm.

4.1. Multi-Parameter Fusion-Based Link-State Evaluation

A new multi-parameter fusion-based link-state evaluation method was invented to achieve optimal single-hop relay selection. It adopts three main parameters that affect the link performance: the degree of progress towards the destination, link quality, and expected link duration, to construct a multi-parameter fused link-state evaluation method. Meanwhile, it considers the future evolution of link states to further improve the stability and reliability of link-state evaluation.

The closer the relay node is to the destination, the more advantageous it is in terms of hop count and latency. The forward distance shown in Figure 2 was used to measure the degree of progress of the relay node towards the destination. Find the intersection point of the communication range of relay node i with the connection between the sending node and the destination node, and define the distance between this point and the sending node as the forward distance of relay node i. Assuming that the coordinates of the sending node i, the relay node j, and the destination node r are (x_i, y_i, z_i), (x_j, y_j, z_j) and (x_r, y_r, z_r) respectively. Using

\vec{i j}

and

\vec{i r}

representing the distance vectors from the sending node to the relay node and the destination node respectively, the projection of

\vec{i j}

on

\vec{i r}

can be obtained as:

d_{1} = \frac{\vec{i j} \cdot \vec{i r}}{|\vec{i r}|}

(10)

The distance from relay node i to the connection between the sending node and the destination node is:

d_{0} = \sqrt{{(x_{i} - x_{j})}^{2} + {(y_{i} - y_{j})}^{2} + {(z_{i} - z_{j})}^{2} - {d_{1}}^{2}}

(11)

Therefore, the forward distance P_i,j of the relay node i is:

P_{i, j} = d_{1} + \sqrt{R^{2} - {d_{0}}^{2}}

(12)

It is necessary to minimize the frequency of route switching caused by link changes as much as possible. Therefore, knowing the expected link duration between nodes can enable the selection of more stable relays. The expected link duration between node i and j can be computed as follows: assuming that the coordinates of node i and j are (x_i, y_i, z_i) and (x_j, y_j, z_j), and their velocities are

\vec{V_{i}}

and

\vec{V_{j}}

, respectively, as shown in Figure 3, the relative velocity between them can be calculated as follows:

\vec{V_{i j}} = \vec{V_{i}} - \vec{V_{j}}

(13)

Meanwhile, the relative distance l_i for node i flying out of the communication range of node i can be calculated as follows:

l_{i} = l_{i j} \times \cos θ + \sqrt{R^{2} - {(l_{i j} \times \sin θ)}^{2}}

(14)

where l_ij denotes the distance between nodes i and j, θ is the angle between the distance vector from node i to j with relative velocity

\vec{V_{i j}}

. Therefore, the expected link duration between node i and j is:

t_{i j} = \frac{l_{i}}{|\vec{V_{i j}}|}

(15)

Considering the high dynamics of FANETs, the statistics of Packet Reception Rate (PRR) in the current time window may not accurately reflect the link quality. In view of this, a channel fluctuation adaptive link quality estimator was designed. It dynamically adjusts the smoothing factor based on the degree of fluctuation in link quality to achieve adaptive estimation of dynamic link state, shown as follows:

P E_{i, j} (t) = α (t) \times P E_{i, j} (t - T) + (1 - α (t)) \times P R R_{i, j} (t)

(16)

where PE_i,j(t) and PE_i,j(t − T) are the estimated link quality at time t and t − T respectively, PRR_i,j(t) is the statistics of PRR at time t. The smoothing factor α(t) is adjusted adaptively according to the estimation error at time t as follows:

α (t) = \{\begin{matrix} 0.6, |P R R (t) - P R R (t - T)| < 0.2 \\ 0.4, 0.2 \leq |P R R (t) - P R R (t - T)| \leq 0.5 \\ 0.1, |P R R (t) - P R R (t - T)| > 0.5 \end{matrix}

(17)

The above three parameters are fused to obtain a statistical description of link state. Define the link state metric LS_ij between nodes i and j as follows:

L S_{i, j} = α_{i, j} P E_{i, j} + β_{i, j} P_{i, j} + λ_{i, j} R_{i, j}

(18)

where PE_i,j, P_i,j and R_i,j are the link quality, the normalized degree of progress, and the normalized expected link duration between nodes i and j. α_i,j, β_i,j and λ_i,j are the corresponding weights for them. The larger the value of LS_ij, the higher the probability of neighbors being selected as relays. The weights α_i,j, β_i,j and λ_i,j are determined by analyzing the trend of changes in corresponding link parameters. If the corresponding parameter becomes better, the corresponding weight will be increased. If the corresponding parameter deteriorates, the corresponding weight will be reduced. For example, if the expected link duration increases, then its weight λ_i,j will be increased.

4.2. Reinforcement-Learning-Based Geographic Routing

Reinforcement-learning-based geographic routing could achieve intelligent decision-making and self-optimization of routing selection, which makes it more suitable for FANETs with rapidly changing topology. The set of actions transmitted from one node to the next is defined as A, and each possible action a can obtain a Q value. Each node maintains a table to store the Q value of its neighbors. Each node can update its own Q-value table and make corresponding decisions based on the Q-value tables broadcasted by its neighbors. Taking node i sending data to node j as an example, node i updates the Q value corresponding to node B as follows:

Q_{i, j} (s^{'}, a^{'}) = (1 - α) \times Q_{i, j} (s, a) + α \times (r + γ \arg \max_{a^{'}} Q_{i, j} (s^{'}, a^{'}))

(19)

where Q_i,j(s, a) is the Q value of taking action a in state s. α is the learning rate, and γ is the discount factor, with values ranging from 0 to 1. s′ is the new state after executing action a, and a′ is the action under the new state s′. The choice of action adopts the ε-greedy strategy, where ε is the probability of selecting the optimal action, with a value between 0 and 1. The initial Q value of a node can be set to the LS value at the initial time. When node j is the destination node, a large positive value should be set for the reward function. When node B is before a routing hole, a large negative constant should be set for the reward function. Therefore, the reward function is defined as follows:

r = \{\begin{matrix} 10, Target node \\ - 10, Cavity \\ L S_{T, B} (t), else \end{matrix}

(20)

where LS_i,j(t) is the link state between node i and node j at time t.

4.3. The Routing-Hole Bypass Method

As the beacon can contain information about the set of neighbors, it is possible to construct a local topology within two hops. In this way, the nodes before routing hole can evaluate the link state within two hops to find a route to bypass, as shown in Figure 4. By weighting the fusion of two link states for decision-making, optimization of relay selection within the two-hop range can be achieved in the absence of multi-hop topology information. By weighting the states of two-hop links for decision-making, optimization of relay selection within the two-hop range can be achieved. It prioritizes using nodes within subregions I and III as relay nodes, and only requests nodes within subregion II for transmission when there are no expected nodes within I and III. By adopting the above approach, it is possible to avoid maintaining multihop topology information while solving the problem of routing hole.

5. Simulation Setup

5.1. Protocols for Comparison

The original GPSR protocol [26] and an improved geographic routing protocol QGeo [20] were chosen to compare with the protocol proposed Evo-QGeo in this paper. QGeo represents the existing reinforcement-learning-based geographic routing protocols. It does not take into account fluctuations of link states, and its perimeter forwarding strategy only considers single-hop link information.

5.2. Simulation Parameters

Multiple scenarios were simulated with different maximum flight speed, number of UAVs, communication ranges and deployment areas, as shown in Table 1. The default scenario is set with the maximum flight speed of 20 m/s, deployment area of 800 × 800 × 800 m³, communication range of 350 m and 40 nodes. Different scenarios were obtained by changing one of these parameters while keeping the other two parameters unchanged. For instance, with the maximum flight speed of 20 m/s, communication range of 350 m and the deployment area of 800 × 800 × 800 m³ unchanged, the number of UAVs was set as 20, 30, 40, 50, 60 in sequence to evaluate the performance of the candidate protocols. We repeated the simulation 10 times for each scenario to eliminate the randomness of single simulation. The average PDR, energy consumption, ETX_E2E and E2ED of 10 repeated simulations were obtained as the final results.

In the initialization phase, the initial positions of UAVs were randomly generated from near to far according to the IDs of UAV and the size of the deployment area. At the same time, we randomly selected the UAVs with smaller IDs as source nodes and those with larger IDs as destination nodes, so that the randomly generated source nodes were sufficiently far away from the destination nodes. Energy consumptions for transmitting and receiving beacon and data packets were set as E_bt = 2.1 μJ, E_br = 0.26 μJ, E_pt = 0.46 μJ, and E_pr = 0.12 μJ.

5.3. Implementing Hardware-in-the-Loop Protocol Simulation

Considering that large-scale UAV ad hoc network flight is difficult to realize at the current stage, a Hardware-in-the-Loop (HIL) protocol simulator was selected to further validate the performance of the proposed protocols [29]. As shown in Figure 5, this HIL simulator consists of UAVs, ad hoc network radios, and a simulation computer. The actual flight trajectories of the UAVs are transmitted back to the simulation computer via the ad hoc network radios. Subsequently, the simulation computer calculates the movement trajectories within the simulator based on the relative positions between the UAV trajectories and the takeoff points. The scenario in HIL simulator is set with the maximum flight speed of 20 m/s, deployment area of 800 × 800 × 800 m³, communication range of 350 m and 40 nodes.

6. Simulation Results of Protocols

6.1. Simulation Results

Figure 6 shows the PDR of three protocols under different scenarios. It can be seen that the QGeo protocol is more suitable for dynamic networks than the original GPSR protocol, but its performance is lower than Evo-QGeo. This is because Evo-QGeo provides a more comprehensive and accurate evaluation of link states, making it more suitable for highly dynamic FANETs. PDR of Evo-QGeo is the highest regardless of changes in deployment area, number of nodes, node speed, or communication range. It increases by 3.15% to 44.61% compared to the original GPSR protocol and 0.81% to 31.35% compared to the QGeo protocol.

Figure 7 shows the energy consumption of three protocols under different scenarios. It can be seen that as the number of nodes, deployment area size, and maximum flight speed increase, the energy consumption of all three protocols will increase. This is because the increase in the number of nodes will lead to an increase in packet transmissions and reception, while the increase in deployment area size and maximum flight speed will result in a decrease in PDR and then an increase in the number of packet retransmissions. Compared to GPSR and QGeo, Evo-QGeo has the lowest energy consumption in different scenarios. Compared to the original GPSR protocol, the energy consumption reduces by 38.79% to 74.47%. Compared to the QGeo protocol, energy consumption reduces by 29.22% to 48.40%.

Figure 8 shows the E2ED of three protocols under different scenarios. It can be seen that as the deployment area size and maximum flight speed increase, the E2ED of all three protocols will increase. This is because the increase in deployment area size and maximum flight speed increases the difficulty in finding suitable relays. On the other hand, as the number of nodes increases, the E2ED of all three protocols will decrease. This is because the increase in the number of nodes makes it easier to find suitable relays, which in turn will affect the E2ED. E2ED of Evo-QGeo is the smallest under all scenarios. Compared to the original GPSR protocol, the E2ED reduces by 7.43% to 38.68%. Compared to the QGeo protocol, the E2ED reduces by 4.55% to 21.63%.

Figure 9 shows the ETX_E2E of three protocols under different scenarios. It can be observed that GPSR exhibits poor link quality because the next-hop nodes it selects are typically far away. Consequently, this results in the highest ETX_E2E for GPSR. Benefiting from a more reliable next-hop routing selection, Evo-QGeo achieves the lowest ETX_E2E across different scenarios. Furthermore, it is also evident that with higher node density and slower relative topological changes, the performance fluctuations of the three routing protocols decrease. Benefiting from a more comprehensive consideration, Evo-QGeo outperforms the other protocols even in certain adverse scenarios. Compared to the original GPSR protocol, the ETX_E2E reduces by 3.88% to 23.64%. Compared to the QGeo protocol, the ETX_E2E reduces by 2.15% to 19.60%.

6.2. HIL Simulation Results

Figure 10 presents the comparison between the HIL simulation results and the numerical simulation results. It can be observed that the results obtained under the HIL simulation, which incorporates real UAV trajectories, are consistent with the numerical simulation results, indicating that the performance of the proposed Evo-QGeo is superior to existing methods. Specifically, the PDR of Evo-QGeo is higher than that of GPSR and QGeo, whereas the energy consumption, delay, and ETX_E2E are lower than those of GPSR and QGeo. Compared with the existing methods, the PDR of Evo-QGeo is improved by 7.58~11.81%, while the energy consumption is reduced by 36.94~66.88%, the delay is decreased by 12.12~21.51%, and the ETX_E2E is lowered by 19.31~26.10%.

7. Conclusions

In order to achieve the intelligence and autonomy of UAV swarms, a flexible, efficient, and reliable network structure is necessary. However, the higher speed and flexibility of UAV nodes pose significant challenges for routing protocols. In order to improve the performance of geographic routing in UAV networks, this paper proposes a reinforcement-learning-based geographic routing protocol by considering future evolution of link states and introducing a new routing-hole bypass method. Simulation results demonstrate that compared to existing geographic routing protocols, the proposed Evo-QGeo achieves up to 11.81~44.61% higher end-to-end packet reception rate, up to 36.94~74.47% lower energy consumption, up to 21.51~38.68% lower latency and up to 19.60~26.10% end-to-end expected transmission count. In the future, more approaches such as adaptive beaconing and multipath routing will be considered to further improve the performance of geographic routing in UAV networks.

Author Contributions

Conceptualization, M.X. and W.L.; methodology, M.X. and Y.X.; software, M.X. and W.L.; validation, Y.X. and W.L.; formal analysis, M.X.; data curation, M.X. and Y.X.; writing—original draft preparation, M.X. and W.L.; writing—review and editing, Y.X. and D.H.; visualization, M.X. and W.L.; supervision, Y.X. and W.L.; funding acquisition, Y.X. and D.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fundamental Research Funds for the Central Universities under Grant B240203012 and the Scientific and Technological Research Program of Chongqing Municipal Education Commission under Grant No. KJQN202501158.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fan, C.; Zhou, X.; Zhang, T.; Yi, W.; Liu, Y. Cache-enabled UAV emergency communication networks: Performance analysis with stochastic geometry. IEEE Trans. Veh. Technol. 2023, 72, 9308–9321. [Google Scholar] [CrossRef]
Du, Z.; Luo, C.; Min, G.; Wu, J.; Luo, C.; Pu, J.; Li, S. A survey on autonomous and intelligent swarms of uncrewed aerial vehicles (UAVs). IEEE Trans. Intell. Transp. Syst. 2025, 26, 14477–14500. [Google Scholar] [CrossRef]
Lakew, D.S.; Sa’ad, U.; Dao, N.-N.; Na, W.; Cho, S. Routing in flying ad hoc networks: A comprehensive survey. IEEE Commun. Surv. Tutor. 2020, 22, 1071–1120. [Google Scholar] [CrossRef]
Yuan, X.; Su, J.; Xia, Y. A survey on routing in flying ad hoc networks: Scenario characteristics, multi-dimensional classification and future prospects. Chin. J. Comput. 2025, 48, 3000–3030. [Google Scholar]
Zhou, Z.; Tang, J.; Feng, W.; Zhao, N.; Yang, Z.; Wong, K.-K. Optimized routing protocol through exploitation of trajectory knowledge for UAV swarms. IEEE Trans. Veh. Technol. 2024, 73, 15499–15512. [Google Scholar] [CrossRef]
Ferronato, J.J.; Trentin, M.A.S. Analysis of routing protocols OLSR, AODV and ZRP in real urban vehicular scenario with density variation. IEEE Lat. Am. Trans. 2017, 6 15, 1727–1734. [Google Scholar] [CrossRef]
Khanh, Q.V.; Chehri, A.; Nam, V.H.; Hue, C.T.M.; Quy, N.M. Performance evaluation of routing protocol for 6G UAV communication networks. In Proceedings of the 2024 IEEE 99th Vehicular Technology Conference (VTC2024-Spring), Singapore, 24–27 June 2024; IEEE: New York, NY, USA, 2024. [Google Scholar]
Tan, X.; Zuo, Z.; Su, S.; Guo, X.; Sun, X.; Jiang, D. Performance analysis of routing protocols for UAV communication networks. IEEE Access 2020, 8, 92212–92224. [Google Scholar] [CrossRef]
Bujari, A.; Palazzi, C.E.; Ronzani, D. A comparison of stateless position-based packet routing algorithms for FANETs. IEEE Trans. Mob. Comput. 2018, 17, 2468–2482. [Google Scholar] [CrossRef]
Rosati, S.; Krużelecki, K.; Heitz, G.; Floreano, D.; Rimoldi, B. Dynamic routing for flying ad hoc networks. IEEE Trans. Veh. Technol. 2016, 65, 1690–1700. [Google Scholar] [CrossRef]
Gangopadhyay, S.; Jain, V.K. A position-based modified OLSR routing protocol for flying ad hoc networks. IEEE Trans. Veh. Technol. 2023, 72, 12087–12098. [Google Scholar] [CrossRef]
Wang, C.-M.; Yang, S.; Dong, W.-Y.; Zhao, W.; Lin, W. A distributed hybrid proactive–reactive ant colony routing protocol for highly dynamic FANETs with link quality prediction. IEEE Trans. Veh. Technol. 2025, 74, 1817–1822. [Google Scholar] [CrossRef]
Cui, Y.; Tian, H.; Chen, C.; Ni, W.; Wu, H.; Nie, G. New geographical routing protocol for three-dimensional flying ad hoc network based on new effective transmission range. IEEE Trans. Veh. Technol. 2023, 72, 16135–16147. [Google Scholar] [CrossRef]
Singh, V.; Sharma, K.P.; Verma, H.K.; Kumar, G.; Balusamy, B.; Rani, S.; Jiang, W.; Çırpan, H.A. A-Geo: Adaptive geographic routing for consumer FANETs in next-generation communication. IEEE Trans. Consum. Electron. 2025, 71, 11034–11043. [Google Scholar] [CrossRef]
Cui, Y.; Liu, L.; Zuo, X.; Yang, X.; Zhang, W.; Hou, Z.; Feng, Z. Seeing is better than hearing: Sensing-assisted efficient routing scheme for UAV network. In Proceedings of the IEEE/CIC International Conference on Communications in China, Shanghai, China, 26–28 August 2025; IEEE: New York, NY, USA, 2025. [Google Scholar]
Jiang, M.; Zhang, Q.; Feng, Z.; Han, Z.; Li, W. Mobility prediction based virtual routing for ad hoc UAV network. In Proceedings of the 2019 IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, USA, 9–13 December 2019; IEEE: New York, NY, USA, 2019. [Google Scholar]
Asadpour, M.; Hummel, K.A.; Giustiniano, D.; Draskovic, S. Route or carry: Motion-driven packet forwarding in micro aerial vehicle networks. IEEE Trans. Mob. Comput. 2017, 16, 843–856. [Google Scholar] [CrossRef]
Zhou, T.; Yan, F.; Shen, F.; Xia, W.; Shen, L. A geographic location prediction-based routing algorithm for flying ad hoc networks. In Proceedings of the 2023 IEEE/CIC International Conference on Communications in China (ICCC), Dalian, China, 10–12 August 2023; IEEE: New York, NY, USA, 2023. [Google Scholar]
Li, X.; Sun, H. Prediction-based reactive-greedy routing protocol for flying ad hoc networks. Wirel. Netw. 2025, 31, 2893–2907. [Google Scholar] [CrossRef]
Jung, W.-S.; Yim, J.; Ko, Y.-B. QGeo: Q-learning-based geographic ad hoc routing protocol for unmanned robotic networks. IEEE Commun. Lett. 2017, 21, 2258–2261. [Google Scholar] [CrossRef]
Wei, C.; Wang, Y.; Wang, X.; Tang, Y. QFAGR: A Q-learning-based fast adaptive geographic routing protocol for flying ad hoc networks. In Proceedings of the IEEE Global Communications Conference, Kuala Lumpur, Malaysia, 4–8 December 2023; IEEE: New York, NY, USA, 2023. [Google Scholar]
Huang, S.; Tang, J.; Zhou, Z.; Yang, G.; Davydov, M.V.; Wong, K.K. A Q-learning and fuzzy logic based routing protocol for UAV networks. In Proceedings of the 2024 16th International Conference on Wireless Communications and Signal Processing (WCSP), Hefei, China, 23–25 October 2024; IEEE: New York, NY, USA, 2024. [Google Scholar]
Wu, M.; Jiang, B.; Chen, S.; Xu, H.; Pang, T.; Gao, M.; Xia, F. Traj-Q-GPSR: A trajectory-informed and Q-learning enhanced GPSR protocol for mission-oriented FANETs. Drones 2025, 9, 489. [Google Scholar] [CrossRef]
Clausen, T.; Jacquet, P. Optimized Link State Routing Protocol (OLSR); RFC 3626; The Internet Engineering Task Force (IETF): Fremont, CA, USA, 2003. [Google Scholar]
Perkins, C.; Belding-Royer, E.; Das, S. Ad Hoc On-Demand Distance Vector (AODV) Routing; RFC 3561; The Internet Engineering Task Force (IETF): Fremont, CA, USA, 2003. [Google Scholar]
Karp, B.; Kung, H.-T. GPSR: Greedy perimeter stateless routing for wireless networks. In Proceedings of the 6th ACM Annual International Conference on Mobile Computing and Networking (MobiCom), Boston, MA, USA, 6–11 October 2000; Association for Computing Machinery: New York, NY, USA, 2000. [Google Scholar]
Ede, B.; Kaplan, B.; Kahraman, İ.; Keşir, S.; Yarkan, S.; Ekti, A.R.; Baykaş, T.; Görçin, A.; Çırpan, H.A. Measurement-based large-scale statistical modeling of air-to-air wireless UAV channels via novel time–frequency analysis. IEEE Wirel. Commun. Lett. 2022, 11, 136–140. [Google Scholar] [CrossRef]
Hua, B.; Han, L.; Deng, Q.; Zhu, Q.; Li, H.; Qu, Y.; Briso-Rodríguez, C.; Mao, K. AAV air-to-air channel: Statistical properties and experimental verification. IEEE Internet Things J. 2025, 12, 25790–25803. [Google Scholar] [CrossRef]
Xu, M.; Liu, W.; Xu, C.; Zhang, Y.; Zhang, K.; Feng, Y.; Xia, Y.; Huang, D. Implementing hardware-in-the-loop protocol simulation for UAV Networks. In Proceedings of the 28th Asia Pacific Conference on Communications (APCC), Sydney, Australia, 9–22 November 2023; IEEE: New York, NY, USA, 2023. [Google Scholar]

Figure 1. Typical application scenarios of UAV swarms.

Figure 2. The degree of progress of the relay node towards the destination.

Figure 3. The illustration of determining the expected link duration.

Figure 4. The proposed method to bypass the routing hole.

Figure 5. HIL Simulator [29].

Figure 6. PDR of three protocols under different scenarios: (a) Different deployment areas; (b) Different number of nodes; (c) Different node speeds; (d) Different communication ranges.

Figure 7. Energy consumption of three protocols under different scenarios: (a) Different deployment areas; (b) Different number of nodes; (c) Different node speeds; (d) Different communication ranges.

Figure 8. E2ED of three protocols under different scenarios: (a) Different deployment areas; (b) Different number of nodes; (c) Different node speeds; (d) Different communication ranges.

Figure 9. ETX_E2E of three protocols under different scenarios: (a) Different deployment areas; (b) Different number of nodes; (c) Different node speeds; (d) Different communication ranges.

Figure 10. Comparison between HIL simulation results and numerical simulation: (a) PDR; (b) Energy consumption; (c) E2ED; (d) ETX_E2E.

Table 1. Simulation Parameters.

Parameter	Value
Network lifetime (s)	120
Deployment areas (m³)	400³, 600³, 800³, 1000³, 1200³
Number of UAVs	20, 30, 40, 50, 60
Communication range (m)	250, 300, 350, 400, 450
Mobility model	RWP
Propagation loss model	Log-normal integrates both shadowing and Rayleigh fading
Channel model parameters	η = 2, σ = 2 dB, σ_R = 1 dB
Receiver sensitivity	−98 dBm
Beacon interval	3 s
MAC	IEEE 802.11b
Maximum flight speed (m/s)	5, 10, 20, 40, 60
Minimum beacon interval (s)	1
Packet size	200 bits
Bandwidth	384 kHz
Data rate	250 bps

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, M.; Xia, Y.; Liu, W.; Huang, D. Reinforcement-Learning-Based Geographic Routing Considering Future Evolution of Link States for UAV Networks. Drones 2026, 10, 150. https://doi.org/10.3390/drones10020150

AMA Style

Xu M, Xia Y, Liu W, Huang D. Reinforcement-Learning-Based Geographic Routing Considering Future Evolution of Link States for UAV Networks. Drones. 2026; 10(2):150. https://doi.org/10.3390/drones10020150

Chicago/Turabian Style

Xu, Ming, Yu Xia, Wei Liu, and Daqing Huang. 2026. "Reinforcement-Learning-Based Geographic Routing Considering Future Evolution of Link States for UAV Networks" Drones 10, no. 2: 150. https://doi.org/10.3390/drones10020150

APA Style

Xu, M., Xia, Y., Liu, W., & Huang, D. (2026). Reinforcement-Learning-Based Geographic Routing Considering Future Evolution of Link States for UAV Networks. Drones, 10(2), 150. https://doi.org/10.3390/drones10020150

Article Menu

Reinforcement-Learning-Based Geographic Routing Considering Future Evolution of Link States for UAV Networks

Highlights

Abstract

1. Introduction

2. Related Works

3. System Model

3.1. Mobility Model

3.2. Channel Model

3.3. Network Performance Metrics

3.4. Problem Formulation

4. Description of the Proposed Protocol

4.1. Multi-Parameter Fusion-Based Link-State Evaluation

4.2. Reinforcement-Learning-Based Geographic Routing

4.3. The Routing-Hole Bypass Method

5. Simulation Setup

5.1. Protocols for Comparison

5.2. Simulation Parameters

5.3. Implementing Hardware-in-the-Loop Protocol Simulation

6. Simulation Results of Protocols

6.1. Simulation Results

6.2. HIL Simulation Results

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI