Deep-Reinforcement-Learning-Based Intelligent Routing Strategy for FANETs

Lin, Deping; Peng, Tao; Zuo, Peiliang; Wang, Wenbo

doi:10.3390/sym14091787

Open AccessArticle

Deep-Reinforcement-Learning-Based Intelligent Routing Strategy for FANETs

¹

Wireless Signal Processing and Network Laboratory, Key Laboratory of Universal Wireless Communication, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing 100876, China

²

Department of Electronic and Communication Engineering, Beijing Institute of Electronic Science and Technology (BESTI), Beijing 100070, China

^*

Author to whom correspondence should be addressed.

Symmetry 2022, 14(9), 1787; https://doi.org/10.3390/sym14091787

Submission received: 2 August 2022 / Revised: 21 August 2022 / Accepted: 23 August 2022 / Published: 28 August 2022

(This article belongs to the Special Issue Optical and Wireless Communications towards 6G Networks)

Download

Browse Figures

Versions Notes

Abstract

:

Flying ad hoc networks (FANETs), which are composed of autonomous flying vehicles, constitute an important supplement to satellite networks and terrestrial networks, and they are indispensable for many scenarios including emergency communication. Unfortunately, the routing therein is largely affected by rapid topology changes, frequent disconnection of links, and a high vehicle mobility. In this paper, an intelligent routing strategy based on deep reinforcement learning (DRL) is proposed, which is decentralized and takes into account the status of symmetrical nodes in two hops. In order to perceive the local dynamics of the network as comprehensively as possible, the location, moving speed, load degree, and link quality of the nodes are considered into the setting process of state elements in the method. The nodes can select the neighboring node adaptively according to the Q values calculated by the model obtained through the training of Deep Q-Networks. The simulation and analysis show that the proposed method possesses good convergence characteristics and has obviously better performance compared with several common methods.

Keywords:

FANET; UAV; routing; reinforcement learning; decentralized

1. Introduction

The next generation mobile communication technology (6G) aims to realize the full connection between terrestrial communication networks and nonterrestrial communication networks, so as to achieve seamless global coverage and allow network signals to reach any remote area [1,2]. The wireless communication technology assisted by unmanned aerial vehicles (UAVs) including blimps, balloons, and fixed-wing and rotary-wing UAVs is getting more and more attention from scholars [3,4,5,6], and is expected to become an indispensable part of 6G [7,8,9,10,11]. Compared with terrestrial networks, flying ad hoc networks (FANETs) have more flexible nodes and communication links, so the latter can give play to their asymmetric advantages compared with the former to achieve auxiliary data transmission or emergency communication. FANETs based on UAVs are capable of improving the coverage of ground wireless service as well as providing reliable communications for ground devices [12,13].

Routing technology aims to ensure that the control and business data in the network can be delivered to the destination quickly, stably, and completely [14,15,16,17,18]. FANETs differ from traditional ground communication networks in terms of connectivity, mobility, applications areas, etc. In this case, the routing algorithms specifically proposed for fixed topology characteristics of ground networks may not apply well to FANETs, as the topology of the latter is in a process of continuous change. Meanwhile, this situation makes location-based algorithms become the main development direction of FANETs’ routing technology [19].

A lot of research efforts have been devoted into routing technology in FANETs, for which some effective and robust algorithms have been presented [20,21,22,23,24,25,26,27,28]. In [20], a bidirectional Q-learning routing strategy was proposed on the basis of the Ad Hoc On-Demand Distance Vector Routing (AODV) method, which could obviously accelerate the convergence speed of the Q-learning process. The particle swarm optimization was adopted in [21] to address the suboptimal choice problem of greedy forwarding. Relying on the cooperation between ground equipment and UAV nodes, S. Jiang et al. [22] proposed a method that integrated fuzzy logic and a depth-first search to construct a Q-value routing table with well convergence performance. In [23], a routing strategy dominated by upper UAV nodes participating in multilayer FANETs was proposed, which was based on information such as node location and network connectivity. P. K. Deb et al. [24] introduced LEO satellites into FANETs and adopted Q-learning to adapt routing decision, so as to improve the efficiency of the network providing services to the ground in a 6G scenario. An improved Optimized Link State Routing (OLSR) protocol based on dynamic topology was presented in [25], which could adaptively adjust the sending period of HELLO packets according to the dynamics of the network topology. In [26], the distance factor and link lifetime factor were mainly considered, and the routing problem was effectively solved by linear programming. Taking into account the deviation angle, node spacing, and link lifetime factors, the authors in [27] designed an optimal virtual routing model to select the neighboring relay node. K. Zhu et al. [28] utilized the ant colony optimization approach for calculating the route in UAV-assisted massive machine-type communications networks to reduce the energy consumption and prolong the network lifetime.

The UAV nodes in FANETs are symmetrical, and each node is responsible for data collection and packet forwarding, which also limits the application of central routing algorithms in FANETs. It is worth noting that these aforementioned routing mechanisms have all or part of the following problems: (1) the decision-making method is central rather than distributed, thus the decision-making process may consume large communication resources and bring large time delay; (2) the proposed method does not take the mobile characteristics of nodes in FANETs into account, so the method may not be applicable; (3) information of only one-hop neighboring nodes is obtained, which leads to the method being merely able to obtain a local optimal solution with limited performance; (4) Q-learning is adopted to record and update the routing table, but the table becomes cumbersome and inefficient due to the topology dynamics of FANETs.

This paper focuses on exploring and proposing an intelligent decentralized routing method that mainly uses the state information of nodes in two hops. Moreover, the location, moving speed, load degree, and link quality of the nodes are utilized flexibly to build the supporting elements of the method. The proposed method does not need to predict the state of links/nodes or set classification thresholds. The remainder of this paper is organized as follows. The system model is introduced in Section 2. Section 3 gives an introduction to reinforcement learning. The proposed intelligent routing method is detailed in Section 4. Section 5 and Section 6 separately cover the performance verification for the proposed method and the conclusion.

Notation: Throughout the paper, scalars are denoted by a nonboldface type, while vectors and matrices are denoted by a boldface type.

{(\cdot)}^{T}

and

E {\cdot}

, respectively, signify matrix transpose and statistical expectation. Furthermore,

ω_{i}

represents the ith entry of the vector

ω

.

2. System Model

Figure 1 illustrates the routing scenario in the FANET, where the blue, green, and brown circles represent the current node, one-hop neighboring nodes, and two-hop neighboring nodes, respectively, while

S^{I}

and

S^{II}

separately denote the state information matrix of one-hop neighboring nodes and two-hop neighboring nodes. As aforementioned, the purpose of routing in the scenario is to select a neighboring one-hop node for the current UAV node, so as to transmit the data to the final destination in a timely and complete manner.

This paper considered the information of two-hop nodes in order to provide a rich reference for routing decision-making. We assumed that the set of one-hop neighboring nodes and two-hop neighboring nodes for the current node c are

N_{c}

and

M_{c}

, respectively. Then,

|N_{c}|

and

|M_{c}|

separately represented the number of one-hop neighboring nodes and two-hop neighboring nodes. Each UAV node in the FANET was equipped with the positioning system and thus could obtain its own position, movement speed, and direction. Without losing generality, this paper set the effective communication distance of each node as R. Meanwhile, it was assumed that the current node and the one-hop neighboring node could acquire or estimate the load and link quality of each of the neighboring nodes by communication interaction, e.g., by exchanging HELLO packages.

On the basis of this information, the method proposed in this paper conducted further processing, the intelligent output of the method was obtained by relying on the reasonable design of the related parameters, and the output could then provide a recommendation for the selection among the one-hop neighboring (i.e., the candidate) nodes. It is worth noting that the current node did not require the information of all nodes within its two hops in the proposed method, and too-strict requirements may make the method infeasible due to the failure of some links. We introduce the details of the proposed decentralized and intelligent routing method in Section 4.

3. Reinforcement Learning

Reinforcement learning is different from other machine learning methods in the process of model training [29,30,31,32]. Its agent relies on the interactive process with the environment to update the evaluation of decision-making actions under different environmental states (see Figure 2). The optimal strategy

π^{*}

can be acquired in a process of continuous interaction, with which the reward R is maximized, i.e.,

R = \sum_{t = 0}^{\infty} γ^{t} r_{t + 1}

(1)

where

r_{t}

and

γ^{t}

, respectively, denote the instantaneous reward and decay value of the future reward at time t. Among the RL algorithms, Q-learning is commonly utilized. The action-value function of Q-learning can be mathematically expressed as [29,33]

Q^{*} (s, a) = E_{s^{'}} [r_{t + 1} + γ max_{a^{'}} Q^{*} (s^{'}, a^{'}) | s_{t} = s, a_{t} = a]

(2)

where s and a are, respectively, the current state and action, while

s^{'}

and

a^{'}

denote the next state and action, respectively. Therefore,

Q^{*} (s, a)

can be understood as the expected reward when taking action a under state s.

Considering that it is difficult to obtain the optimal strategy

π^{*}

straightforwardly, the optimal strategy

π^{*}

is obtained through updating the action-value function

Q^{π} (s, a)

continuously in Q-learning. Concretely, the following rule is utilized to update the Q-table which stores the action-value function

Q^{π} (s, a) \leftarrow Q^{π} (s, a) + α (r (s, a) + γ max_{a^{'}} Q^{π} (s^{'}, a^{'}) - Q^{π} (s, a))

(3)

where

r (s, a)

denotes the reward for performing action a under state s,

α

represents the learning rate, and

max_{a^{'}} Q^{π} (s^{'}, a^{'})

means the max Q value under the next state s. To balance the exploration and exploitation processes in the continuously interactive Q-learning method, and to avoid the learning process obtaining a suboptimal solution with unsatisfactory performance, the

ε

-greedy algorithm is commonly adopted, which can be described as

a = \{\begin{matrix} randomly action with probability ε \\ arg {max}_{a \in A} Q^{π} (s, a) with probability (1 - ε) \end{matrix}

(4)

Although the Q-learning method based on a Q-table has the advantages of an intuitive principle logic, simplicity, and effectiveness, the method becomes extremely inefficient when the state space or action space is huge, as the increase of storage entries significantly reduces the efficiency of reading from and saving to the table. The deep reinforcement learning (DRL) [34] avoids this problem to a great extent, it introduces a deep learning that is based on a deep neural network (i.e., Deep Q Network, DQN) to perceive the logical relationship between the environmental state and agent action, which ensures that DRL does not need to traverse all state–action pairs as in Q-learning. The typical realization of DRL is the DQN algorithm that updates the action-value function through

Q^{*} (s, a) \approx Q (s, a | θ)

(5)

where

θ

denotes the parameter matrix of the neural network. In order to avoid the performance turbulence of the DQN in the training process, both the main network and the target network are adopted. Moreover, with Equations (3) and (5), the following objective function is utilized to update

θ

, i.e.,

L (θ, θ^{'}) = E [{(r (s, a) + γ max_{a^{'}} Q (s^{'}, a^{'} | θ^{'}) - Q (s, a | θ))}^{2}]

(6)

where

θ

and

θ^{'}

separately represent the parameter of the main network and the target network.

4. Proposed Intelligent Routing Method

The rationality of the setting of supporting elements related to reinforcement learning can greatly affect the effectiveness of the method. In this section, the detailed content of the proposed DRL-based FANETs’ intelligent routing (DRL-FIR) strategy, including the description of the state space, action space, and reward settings, is introduced.

4.1. State Space

As the input of the DRL model, the state should be able to objectively and comprehensively reflect the environment of the agent (i.e., the current UAV). In order to enable the agent to grasp more information about the environment, this paper adopted four routing-related parameters of neighboring nodes within two hops of the current node as the composition of the state space, which are introduced below.

The signal-to-interference-noise ratio (SINR) can well reflect the quality of the channel, and this paper utilized

η_{_{c, i}}^{I} (t)

and

η_{_{i, j}}^{I I} (t)

to represent the SINR of the channel between the current node c and one-hop node i, and the SINR of the channel between one-hop node i and two-hop node j at time t, respectively. We have:

\begin{matrix} η_{c, i}^{I} (t) & = \frac{g_{c, i} (t) p_{c, i} (t)}{σ_{i}^{2} (t)} \end{matrix}

(7)

\begin{matrix} η_{i, j}^{I I} (t) & = \frac{g_{i, j} (t) p_{i, j} (t)}{σ_{j}^{2} (t)} \end{matrix}

(8)

where

g_{c, i} (t)

and

g_{i, j} (t)

denote the average channel gain of the two links (i.e.,

c \to i

and

i \to j

), respectively,

p_{c, i} (t)

and

p_{i, j} (t)

are the transmission power of node c and node i, respectively, while

σ_{i}^{2} (t)

and

σ_{j}^{2} (t)

represent the variance of a Gaussian white noise at routing node i and node j, respectively. For the sake of clarity, the time identification in the equations is omitted in the following part of this paper. We could then acquire the channel capacity by the Shannon formula

\begin{matrix} C_{c, i}^{I} & = B_{c, i} {log}_{2} (1 + η_{c, i}^{I}) \end{matrix}

(9)

\begin{matrix} C_{i, j}^{I I} & = B_{i, j} {log}_{2} (1 + η_{i, j}^{I I}) \end{matrix}

(10)

with

B_{c, i}

and

B_{i, j}

denoting the available bandwidth of the two links, respectively. In order to fairly compare the potential of different candidate nodes, we took the channel capacity ratio as the part of the state, which could be, respectively, denoted as

\begin{matrix} {\bar{C}}_{c, i}^{I} & = \frac{C_{c, i}^{I}}{\sum_{n \in N_{c}} C_{c, n}^{I} / |N_{c}|} \end{matrix}

(11)

C_{c, i}^{I I} = \frac{|M_{c} - M_{c} \cap N_{c}| \times \sum_{j \in (N_{i} - N_{i} \cap N_{c})} C_{i, j}^{I I}}{|N_{i} - N_{i} \cap N_{c}| \times \overset{C_{n, m}^{I I}}{\sum_{n, m \in (M_{c} - M_{c} \cap N_{c})}}}

(12)

where

N_{i}

represents the set of one-hop neighboring nodes for node i, which is a single-hop neighbor node of c, while

|\cdot|

means the cardinality of set “·”. Note that those nodes simultaneously belonging to one hop were excluded when calculating the ratio of two-hop nodes.

The second element of the state set in the proposed method was distance, which plays a key role in almost all location-based routing algorithms. With the positioning ability, the distance between nodes could be easily calculated. The following distance ratios were adopted in DRL-FIR:

\begin{matrix} {\bar{D}}_{c, i}^{I} & = \frac{D_{c, i}^{I}}{\sum_{n \in N_{c}} D_{c, n}^{I} / |N_{c}|} \end{matrix}

(13)

{\bar{D}}_{c, i}^{I I} = \frac{|M_{c} - M_{c} \cap N_{c}| \times \sum_{j \in (N_{i} - N_{i} \cap N_{c})} D_{i, j}^{I I}}{|N_{i} - N_{i} \cap N_{c}| \times \sum_{n, m \in (M_{c} - M_{c} \cap N_{c})} D_{n, m}^{I I}}

(14)

where

D_{c, i}^{I}

and

D_{i, j}^{I I}

denote the distance between the current node c and the one-hop node i, and the distance between node i and the two-hop node j, respectively.

Considering that the business load of nodes can greatly affect the performance of the routing process in terms of delay and packet loss rate, this paper also took into account the load ratio of nodes within two hops in the setting of the state, which can be expressed as

\begin{matrix} {\bar{L}}_{c, i}^{I} & = \frac{L_{c, i}^{I}}{\sum_{n \in N_{c}} L_{c, n}^{I} / |N_{c}|} \end{matrix}

(15)

{\bar{L}}_{c, i}^{I I} = \frac{|M_{c} - M_{c} \cap N_{c}| \times \sum_{j \in (N_{i} - N_{i} \cap N_{c})} L_{i, j}^{I I}}{|N_{i} - N_{i} \cap N_{c}| \times \sum_{n, m \in (M_{c} - M_{c} \cap N_{c})} L_{n, m}^{I I}}

(16)

where

L_{c, i}^{I}

and

L_{i, j}^{I I}

represent the queue length in the MAC layer of nodes i and j, respectively.

Furthermore, we note that the mobility characteristics of nodes should also be considered, as they can affect the life of links to a great extent, and then affect the reliability of routing. The lifetime of the link between nodes i and j (i.e.,

T_{i, j}

) can be calculated by solving the following formula

{(x_{i} + v_{x i} T_{i, j} - x_{j} - v_{y i} T_{i, j})}^{2} + {(y_{i} + v_{y i} T_{i, j} - y_{j} - v_{y j} T_{i, j})}^{2} = R^{2}

(17)

where

(x_{a}, y_{a})

and

(v_{x a}, v_{y a})

denote the position and velocity vector of node a, respectively. The link lifetime ratios were then set in the state of the proposed method, which can be expressed as

\begin{matrix} {\bar{T}}_{c, i}^{I} & = \frac{T_{c, i}^{I}}{\sum_{n \in N_{c}} T_{c, n}^{I} / |N_{c}|} \end{matrix}

(18)

{\bar{T}}_{c, i}^{I I} = \frac{|M_{c} - M_{c} \cap N_{c}| \times \overset{T_{i, j}^{I I}}{\sum_{j \in (N_{i} - N_{i} \cap N_{c})}}}{|N_{i} - N_{i} \cap N_{c}| \times \overset{T_{n, m}^{I I}}{\sum_{n, m \in (M_{c} - M_{c} \cap N_{c})}}}

(19)

with

T_{c, i}^{I}

and

T_{i, j}^{I I}

, respectively, denoting the lifetime of links

c \to i

and

i \to j

.

Finally, the state of the DRL-FIR method can be written as matrix

S = [S^{I}, S^{II}]

with

S^{I} = {[{\bar{C}}_{c, i}^{I}, {\bar{D}}_{c, i}^{I}, {\bar{L}}_{c, i}^{I}, {\bar{T}}_{c, i}^{I}]}_{i \in N_{c}}^{T}

and

S^{II} = {[{\bar{C}}_{c, i}^{I I}, {\bar{D}}_{c, i}^{I I}, {\bar{L}}_{c, i}^{I I}, {\bar{T}}_{c, i}^{I I}]}_{i \in N_{c}}^{T}

.

4.2. Action Space

Intuitively, the selection of actions in the action space corresponds to the choice of the next hop node. Unlike Q-learning, which updates the Q value of each neighboring node online, DRL-FIR belongs to the type of offline learning mode. The size of the action space thus should be set in advance to be equal to the number of one-hop nodes in the state space. Mathematically, we have

a \in \{n o d e_{c 1}, n o d e_{c 2}, \dots, n o d e_{c |N_{c}|}\}

, where

n o d e_{c *}

means that the next routing node is node * for the current node c.

4.3. Reward

The proposed method aimed to realize the reliability, stability, and integrity of data transmission through node selection. Under the guidance of this goal, if the next hop is the destination node, the agent should obtain the maximum reward, and then we have

F_{j} = \{\begin{matrix} 1, Node j is the destination node \\ 0, Node j is not the destination node \end{matrix}

(20)

Meanwhile, we note that the selection of ordinary relay nodes should also be rewarded in order to guide the convergence of the routing process. Therefore, the above four factors were taken into account in the setting of rewards. Specifically, denote

C_{j}

and

T_{j}

, respectively, as the channel capacity and the link lifetime between the current node and the selected next node j, and

D_{j}

and

L_{j}

as the distance to the destination and the MAC queue length of j. In the proposed method, the reward was set as

r = - (μ_{1} e^{- \frac{C_{j}}{C_{max}}} + μ_{2} \frac{D_{j}}{D_{max}} + μ_{3} \frac{L_{j}}{L_{max}} + μ_{4} e^{- \frac{T_{j}}{T_{max}}}) F_{j},

(21)

where

C_{m a x}

,

T_{m a x}

,

D_{m a x}

, and

L_{m a x}

, respectively, represent the maximum channel capacity, link lifetime, distance to the destination, and MAC queue length among the neighboring one-hop nodes, while

μ_{*}

is the weight factor, and we have

μ_{1} + μ_{2} + μ_{3} + μ_{4} = 1

.

4.4. Other Details

The Residual Network (ResNet) which can effectively avoid the problem of degradation of traditional network structures was adopted in the proposed DRL-FIR. The Q values were estimated by an 8-layer ResNet. Moreover, the Adam optimizer and Relu activation function were employed in the training process of the ResNet. The input and outputs of the network, respectively, corresponded to the summative state of the current node and the Q values of the one-hop nodes.

Algorithm 1 summarizes the proposed intelligent routing strategy. The algorithm mainly shows the training process of the model of the proposed method. In order to avoid violent fluctuations in the process of training, the algorithm adopted the dual neural network (Step 12), replay memory (Step 9), and small batch learning (Step 11) approaches. It is worth noting that the complexity of the proposed strategy is mainly in the training stage of the model, while the complexity of the testing or application stage is quite low, as the method only needs relatively simple forward linear and nonlinear calculations in these stages. Considering that the training of the proposed method can be carried out by ground equipment, and the UAV nodes should only record the historical status and action decisions, thus, the method is computationally feasible for FANETs.

We note that the proposed DRL-FIR method is forward-compatible, which means that the model of the method can be applied to other scenes that are more simplified than the corresponding scene during training. For the latter, it only needs to set the parameters of the model to zero or replace them with simple values. Meanwhile, it is worth adding that the proposed method does not require the information of all the two-hop nodes around it, as the state set by the method contains relative values rather than absolute values. This ensures the practicality of the method to a great extent, considering that the link sustainability between nodes may be unstable in the dynamic FANETs. The method proposed in this paper is applicable to all nodes in FANETs. The current node obtains and calculates the asymmetry information of surrounding nodes for rapid decision-making, and then sends data packets to the next symmetric node until the data routing process is completed.

Algorithm 1: The proposed DRL-FIR strategy.

1:: Input: state $S$ , action $A$ , learning rate $α$ , discount factor $γ$ , update frequency of the target network $F_{t}$ , the source UAV node, and the destination UAV node.
2:: Initialize replay memory $M_{R}$ , main network $Q (s, a | θ)$ with random weights $θ, target network$ Q(s,a|θ’) $with weights$ θ’ = θ, ε, ε_decay, ε_min $training start threshold$ Zs, n = 0 , r_ = 0.

3:: For $k \leftarrow 1, \dots, K$ Do
4:: if ε > ε_min then $ε \leftarrow ε \cdot ε_{d e c a y}$
5:: For $z \leftarrow 1, \dots, Z$ Do
6:: Generate a random number p from 0 to 1;
7:: $a \leftarrow \{\begin{matrix} random select from A & 0 \leq p < ε_{min} \\ arg {max}_{a \in A} Q (s, a | θ) & ε_{min} \leq p < 1 \end{matrix}$ ;
8:: Take action a and make state transition $s \to s^{'}$ ; calculate $r \leftarrow R (s, a, s^{'})$ ;
9:: Save $(s, a, r, s^{'})$ into replay memory $M_{R}$ ; $r_{-} \leftarrow r_{-} + r$ ;
10:: if $z > Z_{s}$ then
11:: $\begin{matrix} Randomly sample a batch of experiences from M_{R}; \\ calculate L (θ, θ^{'}); update weights θ; n \leftarrow n + 1 \end{matrix}$
12:: if $n mod C = 0$ then $θ^{'} \leftarrow θ$ , $s \leftarrow s^{'}$
13:: if $s^{'}$ is the destination then Save $r_{-}$ into action list.
14:: End
15:: End

5. Simulation and Performance Analysis

This paper utilized Keras as the platform of deep learning for training the model of the proposed DRL-FIR method. The simulation area of the FANET was set as 1000 m × 1000 m, including 45 UAV nodes with randomly generated positions. The packet size and the transmission range of each node were, respectively, set as 512 Byte and 120 m. The available bandwidth, SINR, queue length, and moving speed factors were separately generated randomly, and the generated data obeyed a uniform distribution. The source node and the destination node were randomly selected, and the simulation set so that a node continued to move in the direction of reflection when it moved to the boundary of the region. The simulation parameters including the model-related parameters are summarized in Table 1. Meanwhile, this paper randomly generated a large number of samples (i.e., snapshots) for model training, and adopted the same method to generate the test set to verify the performance of the model (see Figure 3). For each snapshot in the training set, the model was trained 1000 times.

To demonstrate the performance of the proposed method, four algorithms were adopted as comparison methods, which are shown below (It should be noted that most of the existing methods belong to the type of central decision-making, or their applicable scenarios are different from the scenarios considered in this paper in topology. Therefore, the comparison methods considered in this paper are variations of the proposed method or the typical distributed routing method.)

FIR-OH: This method considers only the information of one-hop neighboring nodes. Note that the performance of this method can basically correspond to the results of some existing Q-learning based methods, e.g., [20].
GPSR: A classical distributed routing algorithm proposed in [35]. Many methods have been improved on this method [19], and this method is often used for performance comparison.
W-GPCR: The weighted greedy perimeter coordinator routing method proposed in [36]. We modified this method so that it could be used in this paper. We set this method to consider a Euclidean distance, channel capacity, and link lifetime with the same weight.
FIR-CD: a variant of the proposed DRL-FIR method. This method only considers the channel capacity and distance factors in Equations (11)∼(Section 4.1). Ignoring factors such as MAC queue length and link lifetime may cause the method to not be well applicable to the dynamics of FANETs.
FIR-DLT: a variant of the proposed DRL-FIR method. This method only considers the distance, MAC queue length, and link lifetime factors in Equations (12)∼(Section 4.1). This method is similar to [27] in terms of factors considered, although the lifetime factor of candidate links is not predicted in the former.

We first show the convergence performance of the proposed DRL-FIR method with different decay values in Figure 4. The decay value

γ

can reflect the preference of the model for present and future rewards in the training process. As can be seen, both the proposed method and FIR-OH can converge under different parameter configurations. This reflects the effectiveness of the proposed method. We could observe that the proposed method has the best performance when

γ = 0.9

, which is consistent with the feature that the data need to be finally transferred to the destination in the routing scenario. Since DRL-FIR is capable of perceiving the regional dynamics of the node’s environment more comprehensively compared to FIR-OH, the performance after convergence of the former is better than that of the latter under the same configuration. More specifically, the information about the channel state, link lifetime, spatial location, and MAC queue length of the neighboring nodes within the two hops rather than the one hop enables the DRL-FIR method to determine a better next hop than the FIR-OH method, so as to obtain greater rewards. In the subsequent simulation, we set the decay value as 0.9.

In order to intuitively show the performance of the proposed DRL-FIR in this paper, the packet loss rate (PLR) and delay performance of methods for random FANET topology snapshots are shown in Figure 5 and Figure 6, respectively. It can be observed that the methods show obvious differences in their performance in terms of PLR. It should be noted that compared with factors such as channel capacity and spatial location, the PLR performance is closely related to the link lifetime and MAC queue length, as these two can greatly affect the reliable reception of data packets. In contrast, the delay performance is more concerned with channel capacity and spatial location factors.

The performance of FIR-DLT is slightly better than that of FIR-OH; this is because the PLR has little to do with the capacity status of the channel, and FIR-DLT can give play to the advantage of a multihop node environment perception without considering it. Meanwhile, the PLR performance of the FIR-CD method deteriorates as the relevant queuing and link lifetime factors have not been considered. GPSR has the worst performance because it only considers the distance between each surrounding one-hop node and the destination node. The PLR performance of W-GPCD is significantly better than that of GPSR due to the more abundant influence factors considered. As the link lifetime parameter is referenced in the decision-making, its PLR is also lower than that of FIR-CD.

It is worth noting that the performance of the methods in terms of transmission delay changes significantly. Since the distance and channel capacity factors have the greatest relationship with the transmission delay, the performance of the FIR-CD method is better than that of the FIR-OH, W-GPCD, and FIR-DLT methods. For the same reason, the FIR-DLT and GPSR methods basically show similar performance without considering the channel capacity factor. By observing the two figures, it can be concluded that the performance of DRL-FIR is better because it considers more comprehensive influencing factors.

Finally, the average performance of the methods for 1000 randomly generated FANET topology snapshots is summarized in Table 2. It can be seen that GPSR shows the worst performance in terms of both PLR and transmission delay. This is because the factors it considers are too limited or single in area size and node state. FIR-OH performs moderately well in several methods based on DQN, while the FIR-CD and FIR-DLT methods show their respective advantages in two aspects of performance because they have their own emphasis on the factors considered. The performance of W-GPCD is similar to that of FIR-OH on the whole, but there is still a certain gap, as the latter is more comprehensive in its consideration of factors and adaptive decision-making ability. Since the key factors are taken into account, the proposed DRL-FIR method has the lowest PLR and transmission delay among all methods.

6. Future Work

Suitable routing algorithms are quite important for FANETs, the proposed method based on deep reinforcement learning in this paper can be well applied to FANETs with topological dynamic characteristics by considering multiple states of nodes in a local region. Furthermore, it is not mandatory to obtain the information of all the nodes within the two hops, which also makes the method feasible in reality. Compared with several utilized comparison methods, the main disadvantage of the proposed method is its high complexity, as it needs to obtain more information and carry out the corresponding preprocessing, but the calculation of the proposed method consumes very few resources, which is completely affordable for UAV nodes.

It should be noted that although central routing algorithms cannot solve the dynamic problem of the FANETs well, their global planning process may have some advantages over the distributed algorithm in some respects. Therefore, a routing algorithm combining central and distributed modes may have better performance, which can be taken as a further direction for researchers. Meanwhile, the types of messages will also become increasingly abundant with the rapid development of wireless communication networks. It is expected to be able to provide users with required services more flexibly and efficiently by considering the different performance requirements of various message types in the design of a routing algorithm.

7. Conclusions

This paper considered the routing problem in FANETs. A distributed adaptive routing strategy based on deep reinforcement learning was proposed to adapt to node mobility and network topology dynamics. The state parameter of the method was reasonably designed to reflect the local characteristics of the network as completely and comprehensively as possible. It integrated the moving speed, location, link quality, load, and link life of nodes within two hops. Simulation results showed that the proposed strategy possessed significantly better performance than commonly utilized methods.

Author Contributions

Conceptualization, D.L. and P.Z.; Investigation, D.L. and T.P.; Methodology, P.Z.; Project administration, T.P. and W.W.; Validation, D.L. and P.Z.; Writing—original draft, D.L. and P.Z.; Writing—review and editing, T.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the China National Key R&D Program (no. 2020YF-B1808000) the Beijing Natural Science Foundation (no. L192002), and in part by the “Advanced and sophisticated” discipline construction project of universities in Beijing (no. 20210013Z0401).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

FANET	flying ad hoc network
DRL	deep reinforcement learning
UAV	unmanned aerial vehicle
AODV	Ad Hoc On-Demand Distance Vector Routing
OLSR	Optimized Link State Routing
DQN	Deep Q-Network
DRL-FIR	DRL-based FANETs’ intelligent routing
SNR	signal-to-noise ratio
SINR	signal-to-interference-noise ratio
RESNET	residual network
MDP	Markov decision process
DNN	deep neural network
GPSR	greedy perimeter stateless routing
UA	uniform allocation
QoS	quality of service
UE	user equipment

References

Pattnaik, S.K.; Samal, S.R.; Bandopadhaya, S.; Swain, K.; Choudhury, S.; Das, J.K.; Mihovska, A.; Poulkov, V. Future Wireless Communication Technology towards 6G IoT: An Application-Based Analysis of IoT in Real-Time Location Monitoring of Employees Inside Underground Mines by Using BLE. Sensors 2022, 22, 3438. [Google Scholar] [CrossRef] [PubMed]
Rana, A.; Taneja, A.; Saluja, N.; Rani, S.; Singh, A.; Alharithi, F.S.; Aldossary, S.M. Intelligent Network Solution for Improved Efficiency in 6G-Enabled Expanded IoT Network. Electronics 2022, 11, 2569. [Google Scholar] [CrossRef]
Khan, A.; Zhang, J.; Ahmad, S.; Memon, S.; Qureshi, H.A.; Ishfaq, M. Dynamic Positioning and Energy-Efficient Path Planning for Disaster Scenarios in 5G-Assisted Multi-UAV Environments. Electronics 2022, 11, 2197. [Google Scholar] [CrossRef]
Dahmane, S.; Yagoubi, M.B.; Brik, B.; Kerrache, C.A.; Calafate, C.T.; Lorenz, P. Multi-Constrained and Edge-Enabled Selection of UAV Participants in Federated Learning Process. Electronics 2022, 11, 2119. [Google Scholar] [CrossRef]
Sharma, R.; Patel, K.; Shah, S.; Aibin, M. Aerial Footage Analysis Using Computer Vision for Efficient Detection of Points of Interest near Railway Tracks. Aerospace 2022, 9, 370. [Google Scholar] [CrossRef]
Zhang, R.; Li, S.; Ding, Y.; Qin, X.; Xia, Q. UAV Path Planning Algorithm Based on Improved Harris Hawks Optimization. Sensors 2022, 22, 5232. [Google Scholar] [CrossRef]
You, X.; Wang, C.X.; Huang, J.; Gao, X.; Zhang, Z.; Wang, M.; Hunag, Y.; Zhang, C.; Jiang, Y.; Liang, Y.C.; et al. Towards 6G wireless communication networks: Vision, enabling technologies, and new paradigm shifts. Sci. China Inf. Sci. 2021, 64, 1–74. [Google Scholar] [CrossRef]
Zhang, Z.; Xiao, Y.; Ma, Z.; Xiao, M.; Ding, Z.; Lei, X.; Fan, P. 6G Wireless Networks: Vision, Requirements, Architecture, and Key Technologies. IEEE VEhicular Technol. Mag. 2019, 14, 28–41. [Google Scholar] [CrossRef]
Mozaffari, M.; Lin, X.; Hayes, S. Toward 6G with Connected Sky: UAVs and Beyond. IEEE Commun. Mag. 2021, 59, 74–80. [Google Scholar] [CrossRef]
Park, K.-W.; Kim, H.M.; Shin, O.-S. A Survey on Intelligent-Reflecting-Surface-Assisted UAV Communications. Energies 2022, 15, 5143. [Google Scholar] [CrossRef]
Alsamhi, S.H.; Shvetsov, A.V.; Kumar, S.; Hassan, J.; Alhartomi, M.A.; Shvetsova, S.V.; Sahal, R.; Hawbani, A. Computing in the Sky: A Survey on Intelligent Ubiquitous Computing for UAV-Assisted 6G Networks and Industry 4.0/5.0. Drones 2022, 6, 177. [Google Scholar] [CrossRef]
Zhang, Z.; Zhou, C.; Sheng, L.; Cao, S. Optimization Schemes for UAV Data Collection with LoRa 2.4 GHz Technology in Remote Areas without Infrastructure. Drones 2022, 6, 173. [Google Scholar] [CrossRef]
Cardoso, C.M.M.; Barros, F.J.B.; Carvalho, J.A.R.; Machado, A.A.; Cruz, H.A.O.; de Alcântara Neto, M.C.; Araújo, J.P.L. SNR Prediction with ANN for UAV Applications in IoT Networks Based on Measurements. Sensors 2022, 22, 5233. [Google Scholar] [CrossRef] [PubMed]
Pang, X.; Liu, M.; Li, Z.; Gao, B.; Guo, X. Geographic Position based Hopless Opportunistic Routing for UAV networks. Ad Hoc Netw. 2021, 120. [Google Scholar] [CrossRef]
Wheeb, A.H.; Nordin, R.; Samah, A.A.; Alsharif, M.H.; Khan, M.A. Topology-Based Routing Protocols and Mobility Models for Flying Ad Hoc Networks: A Contemporary Review and Future Research Directions. Drones 2022, 6, 9. [Google Scholar] [CrossRef]
Hong, L.; Guo, H.; Liu, J.; Zhang, Y. Toward Swarm Coordination: Topology-Aware Inter-UAV Routing Optimization. IEEE Trans. Veh. Technol. 2020, 69, 10177–10187. [Google Scholar] [CrossRef]
Zhang, Y.; Qiu, H. DDQN with Prioritized Experience Replay-Based Optimized Geographical Routing Protocol of Considering Link Stability and Energy Prediction for UANET. Sensors 2022, 22, 5020. [Google Scholar] [CrossRef]
Shen, H.; Jiang, Y.; Deng, F.; Shan, Y. Task Unloading Strategy of Multi UAV for Transmission Line Inspection Based on Deep Reinforcement Learning. Electronics 2022, 11, 2188. [Google Scholar] [CrossRef]
Bujari, A.; Palazzi, C.E.; Ronzani, D. A Comparison of Stateless Position-based Packet Routing Algorithms for FANETs. IEEE Trans. Mob. Comput. 2018, 17, 2468–2482. [Google Scholar] [CrossRef]
Zhou, J.; Liu, J.; Shi, W.; Xia, B. A bidirectional Q-learning routing protocol for UAV networks. In Proceedings of the 13th International Conference on Wireless Communications and Signal Processing (WCSP), Changsha, China, 20–22 October 2021; pp. 1–5. [Google Scholar]
Wang, F.; Chen, Z.; Zhang, J.; Zhou, C.; Yue, W. Greedy forwarding and limited flooding based routing protocol for UAV flying Ad-Hoc networks. In Proceedings of the IEEE 9th International Conference on Electronics Information and Emergency Communication (ICEIEC), Beijing, China,, 12–14 July 2019; pp. 1–4. [Google Scholar]
Jiang, S.; Huang, Z.; Ji, Y. Adaptive UAV-Assisted Geographic Routing With Q-Learning in VANET. IEEE Commun. Lett. 2021, 25, 1358–1362. [Google Scholar] [CrossRef]
Zhang, Q.; Jiang, M.; Feng, Z.; Li, W.; Zhang, W.; Pan, M. IoT Enabled UAV: Network Architecture and Routing Algorithm. IEEE Internet Things J. 2019, 6, 3727–3742. [Google Scholar] [CrossRef]
Deb, P.K.; Mukherjee, A.; Misra, S. XiA: Send-It-Anyway Q-Routing for 6G-Enabled UAV-LEO Communications. IEEE Trans. Netw. Sci. Eng. 2021, 8, 2722–2731. [Google Scholar] [CrossRef]
Jiang, Y.; Mi, Z.; Wang, H.; Sun, Y.; Zhao, N. Research on OLSR adaptive routing strategy based on dynamic topology of UANET. In Proceedings of the IEEE 19th International Conference on Communication Technology (ICCT), Xi’an, China, 16–19 October 2019; pp. 1258–1263. [Google Scholar] [CrossRef]
Gharib, M.; Afghah, F.; Bentley, E. OPAR: Optimized predictive and adaptive routing for cooperative UAV networks. In Proceedings of the IEEE INFOCOM 2021—IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Vancouver, BC, Canada, 10–13 May 2021; pp. 1–6. [Google Scholar] [CrossRef]
Jiang, M.; Zhang, Q.; Feng, Z.; Han, Z.; Li, W. Mobility prediction based virtual routing for Ad Hoc UAV network. In Proceedings of the IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, USA, 9–13 December 2019; pp. 1–6. [Google Scholar] [CrossRef]
Zhu, K.; Xu, X.; Huang, Z. Energy-efficient routing algorithms for UAV-assisted mMTC networks. In Proceedings of the IEEE 30th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Istanbul, Turkey, 8–11 September 2019; pp. 1–6. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Zhao, X.; Yang, R.; Zhang, Y.; Yan, M.; Yue, L. Deep Reinforcement Learning for Intelligent Dual-UAV Reconnaissance Mission Planning. Electronics 2022, 11, 2031. [Google Scholar] [CrossRef]
ud Din, A.F.; Mir, I.; Gul, F.; Mir, S.; Saeed, N.; Althobaiti, T.; Abbas, S.M.; Abualigah, L. Deep Reinforcement Learning for Integrated Non-Linear Control of Autonomous UAVs. Processes 2022, 10, 1307. [Google Scholar] [CrossRef]
Zhan, G.; Zhang, X.; Li, Z.; Xu, L.; Zhou, D.; Yang, Z. Multiple-UAV Reinforcement Learning Algorithm Based on Improved PPO in Ray Framework. Drones 2022, 6, 166. [Google Scholar] [CrossRef]
Yu, Y.; Wang, T.; Liew, S.C. Deep-Reinforcement Learning Multiple Access for Heterogeneous Wireless Networks. IEEE J. Sel. Areas Commun. 2019, 37, 1277–1290. [Google Scholar] [CrossRef] [Green Version]
Lange, S.; Riedmiller, M. Deep auto-encoder neural networks in reinforcement learning. In Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar]
Karp, B.; Kung, H.T. GPSR: Greedy perimeter stateless routing for wireless networks. In Proceedings of the 6th Annual International Conference on Mobile Computing and Networking, Boston, MA, USA, 6–11 August 2000; pp. 243–254. [Google Scholar] [CrossRef]
Li, M.; Gu, Z.; Long, Y.; Shu, X.; Rong, Q.; Ma, Z.; Shao, X. W-GPCR Routing Method for Vehicular Ad Hoc Networks. Sensors 2021, 21, 1998. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Schematic diagram of the routing model in the FANET. Blue, green, and brown circles represent the current node, one-hop neighboring nodes, and two-hop neighboring nodes, respectively.

Figure 2. Schematic diagram of the reinforcement learning. By taking an action in the current state corresponding to the environment, the agent makes the state change, and then gets the corresponding reward.

Figure 3. Schematic diagram of using snapshots to train and test the model of the proposed method.

N_{t r a i n} = 2000

and

N_{t e s t} = 500

, respectively, denote training set size and test set size in the simulation.

Figure 3. Schematic diagram of using snapshots to train and test the model of the proposed method.

N_{t r a i n} = 2000

and

N_{t e s t} = 500

, respectively, denote training set size and test set size in the simulation.

Figure 4. Convergence performance of the proposed method under different decay values.

Figure 5. Performance comparison of the proposed DQN-IR algorithm considering different factors.

Figure 6. Delay performance comparison of different routing algorithms.

Table 1. Simulation parameters.

Hyperparameter	Value
Simulation area	1000 m × 1000 m
Node transmission range R	120 m
Node number	45
Available bandwidth B	1 Mb∼5 Mb
Packet size	512 Byte
SINR $η$	10 dB∼40 dB
MAC queue length L	1 Kb∼10 Kb
Moving speed L	3 m/s∼10 m/s
$γ$ , $ε$ , $α$	0.9, 0.6, 0.01
Decay rate of $ε$ , $ε_{d e c a y}$	0.995
The minimal value of $ε$ , $ε_{min}$	0.001
Experience-replay memory capacity	200
Target network update frequency $F_{t}$	500
Experience-replay minibatch size	32
$μ_{1}$ , $μ_{2}$ , $μ_{3}$ , $μ_{4}$	0.25, 0.25, 0.2, 0.3
K, Z, $Z_{s}$	5000, 3000, 600
Training/test set size	2000/500

Table 2. Performance summary.

Method/ Performance	FIR-OH	GPSR	W-GPCD	FIR-CD	FIR-DLT	DRL-FIR
PLR	21.2%	36.3%	23.6%	25.6%	19.4%	13.8%
Delay (s)	2.72	2.96	2.75	2.54	2.94	2.31

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, D.; Peng, T.; Zuo, P.; Wang, W. Deep-Reinforcement-Learning-Based Intelligent Routing Strategy for FANETs. Symmetry 2022, 14, 1787. https://doi.org/10.3390/sym14091787

AMA Style

Lin D, Peng T, Zuo P, Wang W. Deep-Reinforcement-Learning-Based Intelligent Routing Strategy for FANETs. Symmetry. 2022; 14(9):1787. https://doi.org/10.3390/sym14091787

Chicago/Turabian Style

Lin, Deping, Tao Peng, Peiliang Zuo, and Wenbo Wang. 2022. "Deep-Reinforcement-Learning-Based Intelligent Routing Strategy for FANETs" Symmetry 14, no. 9: 1787. https://doi.org/10.3390/sym14091787

APA Style

Lin, D., Peng, T., Zuo, P., & Wang, W. (2022). Deep-Reinforcement-Learning-Based Intelligent Routing Strategy for FANETs. Symmetry, 14(9), 1787. https://doi.org/10.3390/sym14091787

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep-Reinforcement-Learning-Based Intelligent Routing Strategy for FANETs

Abstract

1. Introduction

2. System Model

3. Reinforcement Learning

4. Proposed Intelligent Routing Method

4.1. State Space

4.2. Action Space

4.3. Reward

4.4. Other Details

5. Simulation and Performance Analysis

6. Future Work

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI