A Survey on Applications of Reinforcement Learning in Flying Ad-Hoc Networks

: Flying ad-hoc networks (FANET) are one of the most important branches of wireless ad-hoc networks, consisting of multiple unmanned air vehicles (UAVs) performing assigned tasks and communicating with each other. Nowadays FANETs are being used for commercial and civilian applications such as handling trafﬁc congestion, remote data collection, remote sensing, network relaying, and delivering products. However, there are some major challenges, such as adaptive routing protocols, ﬂight trajectory selection, energy limitations, charging, and autonomous deployment that need to be addressed in FANETs. Several researchers have been working for the last few years to resolve these problems. The main obstacles are the high mobility and unpredictable changes in the topology of FANETs. Hence, many researchers have introduced reinforcement learning (RL) algorithms in FANETs to overcome these shortcomings. In this study, we comprehensively surveyed and qualitatively compared the applications of RL in different scenarios of FANETs such as routing protocol, ﬂight trajectory selection, relaying, and charging. We also discuss open research issues that can provide researchers with clear and direct insights for further research.


Introduction
Flying ad-hoc networks (FANETs) are gaining popularity because of their versatility, easy deployment, high mobility, and low operational cost [1].FANETs are usually formed by unmanned aerial vehicles (UAVs), which can fly autonomously or can be controlled remotely [2].UAVs have been used by militaries around the globe since the beginning of surveillance and rescue purposes [3].Nowadays, with the advancement of technology, UAVs have been extensively used in every domain for sensitive tasks such as traffic monitoring [4], disaster monitoring [5], relay for other ad-hoc networks [6], remote sensing [7], and wildfire monitoring [8].Multiple UAVs can be used to perform different tasks individually; however, when it comes to FANET, the UAVs must communicate with each other and coordinate accordingly, as shown in Figure 1.FANET is an ad-hoc network of UAVs.Generally, in FANETs small UAVs are used because coordination and collaboration among small UAVs can outperform the large UAVs.Moreover, small UAVs have low acquisition, operational costs, increased scalability, and survivability [9].However, FANET has some major challenges to overcome such as

•
Communication: UAVs can move at high speed, which poses difficulties in maintaining communication with other UAVs.In addition, the distance among the nodes is higher than that other ad-hoc networks [10].• Power constraint: Generally, UAVs carry batteries as a power supply, which is limited to support their operations and flying time.Increasing the capacity of the battery may degrade the performance of the UAVs after a certain point owing to the energy and weight ratio.Therefore, effective battery and charging management is one of the major challenges of FANETs [11].
• Routing protocol: Routing in FANETs is also a challenge owing to the high mobility and power constraints of the UAVs.Many routing protocols have been designed for ad-hoc networks but FANET requires a highly dynamic routing protocol to cope with the dynamic changes in the FANET topology [12].

•
Ensuring QoS: There are also some quality of service (QoS) related challenges that should be addressed such as ensuring low latency, determining the trajectory path to provide service, synchronization among UAVs, and protection against jamming attacks.Many researchers have been working for the last few years to overcome these challenges.They have been using different techniques related to artificial intelligence (AI) so that the network can autonomously and adaptively learn and overcome the challenge by itself.Reinforcement learning (RL) is one of the most important algorithms that has a significant contribution to the development of AI [13][14][15].RL is popular for its trial-andoptimize scheme.RL consists of an agent and an environment in which the agent explores the environment by taking actions and reaches an optimal policy for the system [16].However, to achieve the optimal policy, the agent must know the entire system, which makes the RL time-consuming and unsuitable for large networks.With the development of computational capability achieved by the GPU, this problem can be addressed by integrating deep neural networks (DNNs) into RL, namely deep reinforcement learning (DRL) [15,17].
Currently, there is no survey discussing the applications of RL in FANETs.This motivates us to deliver the survey with the fundamentals of RL and DRL and a comprehensive literature review on the applications of RL and DRL to address the challenges in FANETs.The major issues include routing protocol, selecting flight trajectory, charging UAVs, antijamming, and ensuring the QoS of FANETs.

Fundamentals of Deep Reinforcement Learning
In this section, we briefly discuss the internal structure, decision-making process, and convergence process of reinforcement learning (RL) and deep reinforcement learning (DRL).

Reinforcement Learning
Reinforcement learning is an effective and extensively used tool of AI which learns about the environment by taking different actions and achieves an optimal policy for operation.The RL consists of two main components: an agent and an environment.The agent explores the environment and decides which action to take using the Markov decision process (MDP) [18].
MDP is a framework for modeling decision-making problems and helping the agent to control the process stochastically [18].MDP is a useful tool for dynamic programming and RL techniques.Generally, MDP has four parameters represented by the tuple ( , , , ), where is a finite state space, is a finite action space, is the transition probability from the present state to the next state after taking action , and is the immediate reward given by the environment for action [19].As shown in Figure 2, at each time step , the agent observes its present state in the environment and takes action .Then, the agent receives a reward and the next state +1 from the environment.The main goal of the agent is to determine a policy to accumulate the maximum possible reward from the environment.In long term, the agent also tries to maximize the expected discounted total reward defined by max[ =0 ( , ( ))], where ∈ [0, 1] is the discount factor.Using the discounted reward, a Bellman equation named the -function ( 1) is formed to take the next action using MDP when the state transition probabilities are known in advance.The -function can be expressed as where is the learning rate.RL with a -function is also known as -learning.Initially, the agent explores every state of the environment taking different actions and forms a -table using the -function for each state-action pair.Then, the agent starts exploiting the environment by taking actions with the maximum -value from the -table.This policy is known as the -greedy policy, where the agent starts exploring or exploiting the environment depending on the value of the probability .An illustration of -learning is presented in Algorithm 1.

Algorithm 1
The -learning Algorithm.

Deep Reinforcement Learning
The -learning algorithm is efficient in terms of its comparatively small action and state space.However, the system becomes more complicated for large action and state space.In this situation, the -learning algorithm may not be able to achieve an optimal policy owing to the complex and large -table.To overcome this problem, researchers replaced the -table with a deep neural network (DNN) and named it deep -learning (DQL) [15].DQL is a deep reinforcement learning (DRL) that works with -values similar to -learning, except for the -table part as shown in Figure 3.The main goal of the DNN is to skip manual calculations each time by learning from the data.A DNN is a computational nonlinear model like the structure of the human brain, which can learn and perform tasks such as decision-making, prediction, classification, and visualization [20].It is composed of neurons arranged in multiple layers.It typically has one input layer, two hidden layers, and on output layer, interconnected as depicted in Figure 4 [21].The input layer accepts the inputs with the input neurons and sends them to the hidden layers.The hidden layer then sends the data to the output layer.Every neuron has a weighted input, an activation function, and an output.The activation function determines the output depending on the input of the neuron [22].It acts as a trigger that depends on the weighted input.During the training phase, the weighted values of the inputs of the neurons are updated based on the outputs of the output layer using backpropagation by the agent.The agent takes the output of the policy DNN and compares it with a target DNN model and calculates error [23].Then the agent updates the policy DNN using backpropagation.This process is generally referred to as optimization with gradient descent.After a certain time, the agent updates the target DNN using policy DNN.For a more stable convergence of the optimal policy, experience replay memory (ERM) is introduced into the DQL framework [24,25].The agent takes different actions and saves the present states, obtained rewards, next states, and actions taken in ERM [24,25].Then, the agent takes a mini-batch of data from the ERM and trains the policy DNN. Figure 5 and Algorithm 2 illustrate the framework and flow of the DQN better [26].Thus, the agent can make decisions efficiently and in a timely manner using the learned DNN.Require: Initialize policy and target DQL network with random and , respectively.Require: Initialize experience replay memory (ERM).Require: Initialize .for = 1, 2, . . ., do Select an action for present state based on probability .Observe the immediate reward and next state +1 .Insert ( , , , +1 ) in ERM.Create a mini-batch with random sample of ( , , , +1 ) from ERM. Optimize the weights of the policy DNN with gradient descent via MDP.

Hidden layers Input Layers
← after certain number of time steps.end

Fundamentals of FANET
In this section, we briefly discuss the architectural design and characteristics of FANET.We also compare the FANET with other ad-hoc networks such as vehicular ad-hoc networks (VANETs), robot ad-hoc networks (RANETs), ship ad-hoc networks (SANETs), smartphone ad hoc networks (SPANs), and wireless sensor networks (WSNs).Finally, we discuss the optimal FANET design that researchers are trying to achieve.

FANET Architecture
The architecture of FANET is similar to MANET as it is a subset of MANET.FANET contains multiple manned or unmanned aerial vehicles and ground gateway units (GGUs) communicating with each other in an ad-hoc manner [9,27].There are different types of topologies in FANET, such as: • Centralized topology: An example of centralized topology is shown in Figure 6, where all UAVs are communicating with a GGU directly to transmit data to the control center.In this topology, UAVs also communicate with each other via the GGU [28].This topology is more fault-tolerant but requires higher bandwidth, causes high latency, and constrains high-speed UAV mobility.Furthermore, putting up GGUs for multiple UAV groups is not economically feasible.

Characteristics of FANET
FANETs have some unique characteristics that make them different from other ad-hoc networks.Some of the major characteristics are given as follows: • Node mobility and model: There are different types of aerial vehicles which can move at an average speed of 6-500 km/h [9].Thus, node mobility is the most important distinguishable factor which makes the FANET different from other ad-hoc networks.Furthermore, node mobility results in several challenges in communication designing.
In FANET, UAVs can move freely at any direction and speed, depending on the task on its own.By contrast, other ad-hoc networks have regular, low, predefined, and controlled mobility [27].Moreover, high mobility in FANET results in frequent changes in network topology compared to other ad-hoc networks.

•
Node density: In wireless ad-hoc networks, node density is a crucial parameter for selecting data routing path.In FANET, node density mostly depends on the type of UAV, objective, UAV speed, and communication range.As UAVs can be speedy and have a long communication range, the number of UAV per unit area can decrease [30].
In other ad-hoc networks, such as VANETs, SANETs, WSNs, and SPANs, the node density is high compared to FANET [31].• Localization: In ad-hoc network, global positing system (GPS) is widely used to locate the nodes.However, owing to the high speed mobility, FANETs use low latency GPS system to locate the UAVs such as network-based positioning [32], heightbased positioning [33], differential GPS (DGPS) [34], and assisted GPS (AGPS) [35].Moreover, localization is a major factor in flight trajectory and routing path selection.

•
Radio propagation: When it comes to radio propagation model, FANETs have a great advantage of line-of-sight (LoS) over other ad-hoc networks.In FANET, UAVs can have a clear LoS among them due to their free mobility in the air.By contrast, in other ad-hoc networks, there is little or no LoS between the source and the destination owing to the geographical structure of the terrain.• Energy Constraint: Energy limitation is one of the major design issues in ad-hoc networks.In FANET, it depends on the size of the UAV.Most of the large UAVs are not power-sensitive, whereas energy limitation is a concern for mini-UAVs [9].In other ad-hoc networks, it varies from type to type as shown in Table 1.

Optimal FANET Design
Many researchers are trying to establish an optimal solution for FANET, which is more adaptable to any situation and more scalable to any extend.We discuss some optimal conditions that many researchers are trying to achieve.Moreover, we discuss the advantages of using RL over conventional methods in FANET.
As discussed earlier, FANETs have unpredictable nature owing to their high mobility and speed.The flying routes may vary from UAV to UAV in a multi-UAV system, depending on the operation requirements.More UAVs can join an ongoing operation to complete the task faster.UAVs also may fail owing to any technical problems or any environmental issues.There are so many variables in FANET environment that needs to be addressed.Thus, the optimal design should be more adaptive, super-fast, highly scalable, energy-efficient, more stable, and highly secure.
To achieve these features, there is no alternative to RL owing to its self-learning capability and energy efficiency.The conventional methods of selecting routing paths and flying trajectories are energy inefficient and slow.Moreover, these are not self-learning methods.To make the design solutions more adaptive and scalable, UAVs should learn to make their own decision based on the current situation.To establish self-learning design solutions, researchers have started using RL.Furthermore, many other problems, such as autonomous charging, jamming protection, relaying, localization, and fault handling, can be addressed using RL.

Applications of RL in FANET
In this section, we discuss the challenges of FANET that researchers solved with RL or DRL and how they implemented RL or DRL in FANET in detail.We focus on the main challenges of the FANET like routing protocol, flight trajectory selection, protection against jamming, and other challenges such as charging and relaying, as shown in Figure 9.

Protection Against Jamming
Other Challenges: Charging UAVs Relaying

Routing Protocol
We discuss the basics of the routing protocol, the RL-based approaches for solving routing protocol problems such as energy consumption, end-to-end delay, and path stability and we present a comparative analysis among them.The routing protocol specifies how one node communicates with other nodes in a wireless ad-hoc network [12].Figure 10 illustrates two possible routing paths from source to destination in multi-UAV FANET.The main goal of the routing protocol is to direct the traffic toward the destination regardless of the node mobility [41].There are no dedicated routing protocols currently available for FANETs [41].FANET still uses conventional routing protocols used in mobile adhoc network (MANET) an VANET.There are different types of conventional routing protocols [42], given as follows: • Proactive routing: Like wired network routing, all nodes in an ad-hoc network maintain a route table consisting of routes to other nodes.Whenever a node transmits data, a route table is used to determine the route to the destination.The route table continues to be updated to maintain the change in topology.This type of routing protocol is unsuitable for FANETs owing to the frequent high-speed mobility of the nodes [43].

•
Reactive routing: Whenever a node initiates communication, this type of routing protocol starts discovering routes to the destination.Predefined routing tables were not maintained in this protocol.These types of routing protocols are known as ondemand routing protocols.The main drawbacks of this protocol in terms of FANETs are poor stability, high delay, high energy consumption, and low security [44].

•
Hybrid routing: This is a combination of and a trade-off between proactive and reactive routing protocols.In this protocol, nodes maintain a route table consisting of routers to their neighbors and start route discovery whenever the nodes try to communicate the nodes beyond their neighbors.Owing to the complex flying environment and high mobility, UAV nodes are unpredictable [47].Hence, conventional protocols of VANETs and MANETs cannot cope with changes in the network in real time.Therefore, many researchers have attempted to develop a self-learning, highly reliable, adaptive, and autonomous routing protocol using reinforcement learning (RL) [48].The main purpose of using RL in FANET routing is to ensure fast and stable routing with minimum energy consumption.4.1.1.QMR Liu et al. [12] proposed a -learning-based multiobjective optimization routing protocol (QMR) where end-to-end delay and energy consumption are optimized simultaneously.They also dynamically changed the -learning parameters such as learning rate, discount factor, and -value for exploration and exploitation.QMR consists of routing neighbor discovery, -learning algorithm, routing decision, and penalty mechanism.Initially, the QMR collects the geographic locations of their neighbors using a global positioning system (GPS) and sends HELLO packets to start the route discovery process.Each HELLO packet contains the node's geo-location, energy, mobility model, queuing delay, and discount factor.Nodes start to maintain and update their neighbor table upon receiving the HELLO packets.A neighbor table contains the arrival time, learning rate, MAC delay, and -value along with the information of the HELLO packet [12].
After route discovery, QMR selects a neighbor to forward the data packet usinglearning.The -learning algorithm considers energy consumption, link stability, one-hop delay, and neighbor relationships to select the next hop for data forwarding.The learning rate of the algorithm is an exponential adaptive function that depends on the one-hop delay.The discount factor varies with the velocity of the neighbor.For faster neighbors, the discount factor is low, and vice versa.Moreover, the trade-off between the exploration and exploitation depends on the actual velocity of the data packet traveling over a link, link quality, and neighbor relationship [12].
By incorporating all the variables, the source node computes -weighted -values and forms a -table, where represents the link quality and neighbor relationship.Then, the source node selects the link with the maximum -weighted -value to forward the data and obtains maximum reward [12].If there is no neighbor with a nonzero -weighted -value, then the source node receives the minimum reward for all neighbors, updates the neighbor table, and searches for new neighbors using route discovery [12].

RLSRP with PPMAC
Reinforcement learning based self-learning routing protocol (RLSRP) with positionprediction-based directional MAC (PPMAC) is a hybrid communication protocol proposed in [49] wherein PPMAC resolves the directional deafness problem with directional antennas and RLSRP provides the routing path using RL.
In [49], Zheng et al. predicted the positions of other nodes, controlled the communication and data transmission using the PPMAC scheme.The authors used self-learning RL to determine the shortest route with the shortest delay from the source to the destination.The partially observable Markov decision process (POMDP) is incorporated with the proposed RL algorithm, where the end-to-end data transmission delay is provided as a reward.Similar to QMR, RLSRP maintains a neighbor table to keep track of the changes in the network topology.The learning parameters, such as discount factor and learning rate, are fixed.Moreover, RLSRP uses a greedy policy and selects the route with the maximum value function, where the end-to-end delay is minimum.

Multiobjective Routing Protocol
Yang et al. [50] proposed a -learning-based fuzzy logic for multiobjective routing protocol.The source node determines the routing path using the proposed algorithm while considering the transmission rate, residual energy, energy drain rate, hop count, and successful packet delivery time.A fuzzy system is used to identify reliable links, andlearning supports the fuzzy system by providing a reward on the path [50].The algorithm not only considers the single-link performance but also the whole path performance using two types of -values from two -learning algorithms.After obtaining the -values for the single links and the entire path, the fuzzy logic evaluates the -values and determines the optimal path for routing.Moreover, the learning parameters, such as the discount factor and learning rate, are fixed for the -learning algorithm.
Similarly, in [51], He et al. determined the routing path using a fuzzy logic-based RL algorithm, but they considered delay, stability, and bandwidth efficiency factors.Figure 11 summarizes the applications of RL in the routing protocol via block diagrams.Moreover, a comparative analysis of the aforementioned protocols is presented in Table 2.

Routing Protocol Algorithm Advantages Limitations
QMR [12] -learning with dynamic learning rate, discount factor, and adaptive mechanism of exploration and exploitation 1.Multiple objectives, such as end-to-end delay, energy consumption are considered.

1.
Re-establishing communication is uncertain if a node gets lost.

Dynamic and adaptive
-learning parameters, such as learning rate, and discount factor based on nodes' velocity and link stability.

2.
Whole route stability is not considered.
3. An adaptive mechanism is used for balancing exploration and exploitation.
3. Computational energy consumption is not considered.

4.
A penalty mechanism is used to combat "neighbor unavailability" problem.

RLSRP with PPMAC [49]
Reinforcement learning with partially observable Markov decision process (POMDP) 1.The positions of nodes are predictable.

2.
Antenna direction can be changed towards the routing direction.

2.
Only end-to-end delay is considered for route selection.

3.
There is no adaptive mechanism for balancing exploration and exploitation.4. Broadcasting is used for re-establishing the communication with other nodes.

Computational energy consumption is not considered.
Multiobjective Routing Protocol [50,51] -learning-based fuzzy logic 1.Multiple factors, such as the transmission rate, residual energy, energy drain rate, hop count, and successful packet delivery time are considered.

1.
There is no adaptive mechanism for balancing exploration and exploitation.

2.
Both single and whole route performances are considered.

Two -values are used from two
-learning algorithm.

3.
There is no mechanism to remedy the "neighbor unavailability" problem.4. Fuzzy logic is used to select the optimal route.4. Computational energy consumption is not considered.

Flight Trajectory Selection
We discuss the basics of UAV flight trajectory and RL-based approaches for solving problems of flight trajectory selection such as energy consumption [52], data fetching, QoS, quality of experience (QoE), coverage [53,54], and obstacles [55].In addition, we present a comparative analysis.
As the existence of FANETs comes from the flying nodes, selecting the flying trajectory is a crucial factor in autonomous flying scenarios.There are various usages of FANETs, where flying trajectory selection plays a vital role.Using FANETs as portable interconnected aerial base stations (BSd) is one of the major commercial and civilian applications of FANETs.Because UAV base stations (UBSs) can be easily deployed to handle temporary traffic congestion, provide emergency coverage in disaster areas, to ensure the QoS, or collect data from remote internet of things (IoT) devices regardless of terrestrial territory as shown in Figure 12 [14].Moreover, UAVs can also be used to deliver products at the doorstep of people.Owing to the complex flying environment, limited data memory, limited power support, user mobility, and various QoS requirements, many researchers have proposed different trajectory designs that incorporate RL.The main reason for using RL is to obtain an optimal solution for the aforementioned challenges.The applications of RL in flight trajectory selection are summarized below and a comparative analysis is presented in Table 3.

-SQUARE
-SQUARE is a -learning-based UAV flight planning algorithm that improves the quality of experience (QoE) of video users proposed in [52].A macro BS is considered to consist of several user clusters that require video streaming.Multiple UBSs are hovering over multiple clusters with prefetched or on-demand data depending on the QoE demand of the clusters without interfering with each other.The flying path is determined by the -learning algorithm, where the location of the cluster with a high QoE requirement, residual energy of the UBS, and flying time are considered.Paths to multiple recharge points are also considered for recharging the UBSs.While hovering over the cluster if the energy level approaches a certain threshold, the UBS will fly to the charging point to charge and comeback.UBS flies back to the macro BS if more video data need to be fetched.Here, the agent UBS follows the -greedy policy to determine the flight trajectory [52].

Decentralized Trajectory Design
A scenario is considered in [56], where multiple UAVs are performing real-time sensing and sending tasks.The main motive is to determine the decentralized flight trajectories using the opponent modeling -learning algorithm to transmit data efficiently using the sense-and-send protocol.The opponent modeling -learning algorithm is a other agents.Moreover, the agents use a greedy policy to achieve optimal solutions for joint trajectory design and power allocation.

Multi-UAV Deployment and Movement Design
Multiple UAVs are deployed in a 3D space to serve mobile users in [59].The -learning algorithm is used to solve the NP-hard problem [60] of 3D deployment and movement toward the users considering users' mobility.The main goal is to maximize the sum mean opinion score (MOS) of the users while maintaining the QoE.
Liu et al. [59] proposed a three-step solution in which they used the -means algorithm to cluster the users, and then trained the UAV agents using a -learning algorithm to find its optimal 3D positioning with respect to the mobile users.Finally, they also used a -learning algorithm to determine the flying trajectory toward the moving users.However, there is a huge scope for implementing deep -learning to overcome constraints such as intercluster users' mobility and UAVs flying in all possible directions.Ghanavi et al. also adopted a similar kind of approach to maintain QoS in [61].However, a double -learning approach was used instead of simple -learning in [62] for similar 3D scenarios and achieved a 14.1% gain in user satisfaction compared to simple -learning.

Trajectory Optimization for UBS
Bayerlein et al. [55] optimized the trajectory of a UBS using -learning to maximize the sum-rate for multiple users.The authors considered a scenario in which a UBS agent is flying at a fixed altitude to serve multiple ground users.A cuboid obstacle was also considered in this scenario.The UBS selects the flying trajectory toward the users while avoiding the obstacles using both table-based and neural network (NN) based Q-learning.Finally, the authors compared the results of table-based and NN-based Q-learning approaches, where NN-based Q-learning is more efficient and scalable.
A similar approach was taken in [53], where Klaine et al. used UBSs to provide emergency radio coverage in disaster areas.The main goal of the approach was to provide an efficient emergency network while maximizing coverage, sum-rate, and avoiding obstacles and interference.

Other Scenarios
There are other usages and challenges of FANETs, such as charging UAVs, using UAVs as network relay, using UAVs to give protection against jamming, that were solved by some researchers using RL.The applications of RL in these scenarios are summarized in Table 4.  [64].Then, the authors developed an adaptive model-free jamming defense mechanism based on federated Q-learning with spatial retreat strategy in [65] for FANETs.
Jamming protection for other networks.
Charging UAVs

Deep -learning
The mobile charging scheduling problem is interpreted as an auction problem where each UAV bids its own valuation and then the charging station schedules drones based on it in terms of revenue optimality.The charging auction enables efficient scheduling by learning the bids distribution using DQL [11].
Scheduling UAVs for charging.

Open Research Issues
This section discusses and highlights future possible research issues based on the analysis performed in the previous section.We summarize and compare multiple applications of RL in routing protocols and flight trajectory selection.Moreover, we summarize the applications of RL in other issues of FANET.In designing the routing protocol or selecting the flight trajectory, multiple researchers have implemented RL and attempted to solve different issues.However, there are still some open research issues in FANET that are not addressed by any studies.The open research issues are summarized below:

•
Energy constraint: UAVs carry batteries as the main power source to support all the functionalities, such as flying, communication, and computation.However, the capacity of the batteries is insufficient for long-term deployment.Many researchers used solar energy for on-board energy harvesting and used RL to optimize the energy consumption.Unfortunately, these solutions are not sufficient for long flights.This opens a key research issue, where UAVs can harvest power wireless from nearby roadside units or base stations or power beacons for communicational and computational functionalities utilizing RL.Another way to solve the energy issue is that UAV has to exploit DRL to visit charging stations while other UAVs fill up the void.• 3D deployment and movement: Many studies have been carried out regarding deployment and movement.However, most of the researchers have made some significant assumptions, such as constraining UAV and user mobility [59] or reducing action-state space [56], in multi-UAV scenario.Consequently, 3D deployment and movement design considering all the constraints is still an open research issue of FANET.Furthermore, it is also important for cooperative communication of other networks, where UAVs act as relays.

•
Routing issue: A few works have been done on routing protocols utilizing RL for FANET.Routing protocol is crucial for FANETs due to their high node mobility, low node density, and 3D node movement.There are still scopes of improvements, such as handling no neighbor problem, multiflow transmission, directional antenna problem, and scalability issues, utilizing RL.Moreover, the scope of extending the routing protocols of VANETs and MANETs for FANET using RL is still an open research issue.

•
Interference management: Recently, UAVs are using WiFi for communicating with each other.However, interference can occur when working areas of two different FANETs with different targets overlap.Furthermore, UBSs can interfere with each others' UAV to ground communication owing to their high moving speed.These scenarios are still open challenges, where RL can be utilized.• Fault handling: Fault occurrence is widespread in any network.Fault handling is crucial in FANET to avoid interruption.However, there are no existing RL-based solutions that can handle any fault, such as UAV hardware problems, equipped component problems, and communication failure due to any software issues.Thus, fault handling using RL needs to be deeply explored.

•
Security issue: Many RL-based strategies were developed in the past to prevent jamming and cyber attacks for MANET and VANET [66].However, there are few RL-based solutions available for FANET security.If even all the aforementioned issues were solved, communication in FANET can still be interrupted due to a security breach.Consequently, RL-based security solutions require an in-depth investigation.

Conclusions
In this study, the latest applications of RL in FANETs have been exhaustively reviewed in terms of major features and characteristics and qualitatively compared with each other.However, RL can be computationally expensive, but the outcome from using RL is promising in terms of providing better performance in terms of major performance parameters such as energy consumption, flight time, communication delay, QoS, QoE, and network lifetime.The comparative analysis of different applications of RL in different scenarios of FANETs presented in this study can be effectively used for choosing and improving flight paths, routing protocols, charging, relaying, etc.We also discuss the RL-based open research issues of FANETs that need to be explored.Finally, it can be concluded that adaptive RL parameters and a balance between exploration and exploitation strategies help RL to converge more rapidly while overcoming the challenges of FANETs.

Figure 5 .Algorithm 2
Figure 5. DQL framework.Algorithm 2 The Deep -learning Algorithm.Require: Initialize policy and target DQL network with random and , respectively.Require: Initialize experience replay memory (ERM).Require: Initialize .for = 1, 2, . . ., do Select an action for present state based on probability .Observe the immediate reward and next state +1 .Insert ( , , , +1 ) in ERM.Create a mini-batch with random sample of ( , , , +1 ) from ERM. Optimize the weights of the policy DNN with gradient descent via MDP.← after certain number of time steps.end

Figure 6 .Figure 7 .Figure 8 .
Figure 6.Centralized topology of FANET.•Decentralizedtopology: In this topology, UAVs can communicate with each other as well as with the GGUs as shown in Figure7 [9].This topology provides the UAVs more flexibility for mobility and requires less bandwidth but increases the power consumption owing to the large overheads.

Figure 9 .
Figure 9.A taxonomy of the applications of RL in FANET.

Figure 11 .
Figure 11.Application of RL in Routing Protocols of FANET.

Environment Reward, r t Next State, s t+1 Action, a t State, s t Policy Controller Agent Figure 2. The
agent-environment in Markov decision process.

Table 1 .
Comparative analysis of different ad-hoc networks.

Table 2 .
Comparative analysis of the routing protocols based on RL in FANET.

Table 4 .
Summary of other scenarios based on RL in FANET.