A Dyna-Q-Based Solution for UAV Networks Against Smart Jamming Attacks

Unmanned aerial vehicle (UAV) networks have a wide range of applications, such as in the Internet of Things (IoT), 5G communications, and so forth. However, the communications between UAVs and UAVs to ground control stations mainly use radio channels, and therefore these communications are vulnerable to cyberattacks. With the advent of software-defined radio (SDR), smart attacks that can flexibly select attack strategies according to the defender’s state information are gradually attracting the attention of researchers and potential attackers of UAV networks. The smart attack can even induce the defender to take a specific defense strategy, causing even greater damage. Inspired by symmetrical thinking, a solution using a software-defined network (SDN) to combat software-defined radio was proposed. We propose a network architecture which uses dual controllers, including a UAV flight controller and SDN controller, to achieve collaborative decision-making. Built on the top of the SDN, the state information of the whole network converges quickly and is fitted to an environment model used to develop an improved Dyna-Q-based reinforcement learning algorithm. The improved algorithm integrates the power allocation and track planning of UAVs into a unified action space. The simulation data showed that the proposed communication solution can effectively avoid smart jamming attacks and has faster learning efficiency and higher convergence performance than the compared algorithms.


Introduction
Applying UAVs in the Internet of Things (IoT) can complete a variety of IoT services, including video surveillance, sensor data collection, disaster relief emergency communications, and intelligent transportation.With the rapid development of IoT applications, the application of UAVs in the IoT has gradually changed from a single service delivery (i.e., Amazon parcel delivery, power line monitoring, etc.) to UAV swarm applications (i.e., urban pollution monitoring, geological disaster prevention, and military 'bee colonies').
The UAV network will foreseeably carry more and more high-value sensitive data, but the UAV network can easily become the target of cyberattacks since the wireless channels are exposed to the air.As the UAV network faces serious security threats, it is imperative to effectively manage possible security threats.
With the advent of software-defined radio (SDR) technology, programmable, intelligent wireless attack devices will pose a greater threat to UAV networks.Smart attack behavior can autonomously perceive the state of the wireless spectrum.By analyzing the characteristics of the defender's behavior, it can carry out various types of attacks, such as eavesdropping, jamming, and spoofing.Among them, the jamming attack has the lowest technical threshold, the least implementation cost, and the most direct attack effect.The jamming attack directly weakens or blocks the UAV network communication links, so it is urgent to address this kind of smart attack behavior.
Inspired by symmetrical thinking, a solution using a software-defined network (SDN) to combat SDR may be the best choice.Both SDN and SDR are highly flexible and controllable and can deploy intelligence algorithms.Although the SDN architecture has some successful applications in MANETs (Mobile Ad-hoc NETworks) and VANETs (Vehicular Ad-hoc NETworks), due to the mobility of the UAV network and the rapid change of the network topology, it is necessary to design and deploy a new SDN architecture that meets the mission requirements and security requirements.Similarly, there are some successful experiences in deploying intelligent algorithms to the SDN on the ground.However, in deploying intelligent algorithms to UAVs, especially on UAV networks, a series of problems still need to be solved, such as communication process optimization and convergence optimization.
The main contributions of this paper can be summarized as follows: • We propose a module-level SDN controller design for UAV networks; We propose a dual-controller cooperative SDN-based UAV network wireless communication scheme; • A Dyna-Q-based reinforcement learning algorithm for power allocation and track planning collaborative optimization against smart jamming is proposed.
The remainder of this paper is organized as follows.We review related work in Section 2. In Section 3, we design the topology architecture and functional architecture of the SDN-based UAV network, present the design of the module-level SDN controller, and establish the network model and jamming attack model.We propose an improved Dyna-Q-based smart defense communication solution in Section 4. Simulation results and discussion are presented in Sections 5 and 6.

Related Work
Several studies on different approaches have been conducted regarding secure UAV communication, but few studies have focused on UAV network communication.The security problem of the UAV network has remained an open issue until now.In Section 2.1, we present the latest research advances in SDN architectures that have previously received less attention, yet have great potential in addressing UAV network communication problems.In Section 2.2, we highlight the unique security risks in the UAV network.In Section 2.3, we track the security risks and development opportunities that intelligent technology brings to UAV networks.

SDN-Based UAV Network Control
SDN architecture is essential to enhance the controllability of networks against attackers, since it can calculate an optimal forwarding path according to the approximately real-time network status.This architecture has had many successful applications in ground networks, and therefore many researchers have attempted to replicate this success in the UAV networks.Zhang et al. [1] designed a SDN-based network framework for UAV backbone networks.In this framework, an SDN controller is deployed in the ground control station.They showed that the deployment of an SDN can extend the UAV's battery life due to a load balance algorithm integrated in the SDN controller.White et al. [2] proposed a SDN/NFV (Network Function Virtualization)-based lightweight modular network architecture which meets the high mobility requirement of a UAV swarm.Their main contribution is realizing the highly robust migration of many network services related to UAV networks, such as network monitoring, intrusion detection, and smooth migration of UAVs among different clusters, etc. Zhao et al. [3] developed a SDN architecture-based UAV networks with a single centralized SDN controller to solve the problem about how to replan the location of the relay UAVs, thus improving the QoS (Quality of Service) of a real-time video monitoring service.Alioua et al. [4] implemented SDN-based UAV-aided VANETs and investigated how to realize efficient data processing by offloading of computing tasks Symmetry 2019, 11, 617 3 of 19 and sharing of state information.They modeled the tradeoff between computational delay and energy expenditure as a two-person sequential game problem.Barritt et al. [5] introduced a network framework called a temporospatial software-defined Network (TS-SDN), which can be applied in UAV networks.Kirichek et al. [6] proposed a software-defined flight ubiquitous sensor network (FUSN) and a set of message interaction rules between a UAV and ground control station.They suggested that sensor modules and routing control modules should be deployed in different UAVs.Rahman et al. [7] conducted a study on how to adjust the location of SDN controller to reduce the communication overhead of the control messages.Ramaprasath et al. [8] claimed that they exploited a SDN-based scheme to control routing.The goal of their work was to maximize throughput, balance traffic, and reduce network delay.Toufga et al. [9] defined a topology-discovering service for SDN-based hybrid VANETs.Their method fully considered flow load balancing of an SDN controller and calculation resources needed.Rahman et al. [10] studied the deployment of an SDN controller.They claimed that SDN controllers should be deployed in the central area of UAV networks to reduce hopping counts of control packets and reduce network delay.

Cyber Threats Against UAV Networks
Recently, there were some attempts to adopt the idea of machine learning into UAV network defense.Kim et al. [11] evaluated the behavior of attackers targeting UAV nodes, especially automatic dependent surveillance-broadcast (ADS-B) attack and false data injection attack.They designed a set of rules to identify the normal behavior of UAVs, but they did not evaluate the performance of their methods in terms of detection accuracy and resource overhead.Strohmeier et al. [12] summarized the attacks targeting ADS-B components, such as eavesdropping, jamming, and false data injection, but they did not validate their countermeasures via simulation.Strohmeier [12] and Wesson [13] claimed that ADS-B is a component that is vulnerable to cyberattacks because it does not have built-in security mechanisms.They investigated a communication scheme based on confidentiality to protect the privacy of messages broadcasted by ADS-B components.Unfortunately, they did not offer a solution to prevent the detected attacks.Shepard et al. [14] believe that GPS (Global Positioning System) spoofing attacks are the most lethal attacks to UAV networks.They proposed an improved scheme to identify GPS spoofing attacks.Manesh et al. [15] systematically reviewed the security risks and security solutions of ADS-B, and they divided the security solutions into ten categories, namely lightweight PKI (Public Key Infrastructure), message authentication code, µ TESLA (Timed Efficient Stream Loss-tolerant Authentication), multilateration, fingerprinting, spread spectrum, distance bounding, Kalman filtering, data fusion, and traffic modelling.However, they did not consider the situation of air-to-air interference, so their review only focused on interference to ground stations.Brust et al. [16] proposed a method for escorting detected malicious UAVs by transforming the formation of the UAV swarm.However, if the malicious UAV suddenly launches a jamming attack, the large number of UAVs near this malicious UAV will suffer disastrous consequences.Zhao et al. [17] proposed two algorithms including a centralized deployment algorithm and a distributed motion control algorithm for UAV airborne networks, and the functions of these algorithms are to realize on-demand coverage when a disaster occurs; however, looking at it from another angle, the jamming UAV can use these algorithms to achieve maximum jamming efficiency.

Smart Defense Technology for Jamming Attacks
Jamming attacks should be given higher priority than other types of attacks in UAV networks, since available radio channels are the physical basis of any type of UAV communications.Using game theory and reinforcement learning techniques to tackle the smart jamming problem is an emerging academic research hotspot.Wu et al. [18] proposed a Colonel Blotto anti-jamming game-based power allocation scheme which can enhance the anti-jamming performance in cognitive radio networks.Xiao et al. [19] studied the interactions among a source node, a relay node, and a jamming mode using a Stackelberg game, and these nodes chose their transmit power in turn with the premise that they did not interfere with primary users.Tang et al. [20] proposed a Stackelberg game-based packets transmission scheme to establish a power allocation strategy in order to improve the SINR (Signal to Interference plus Noise Ratio) of radio channels.Xiao et al. [21] investigated the subjective decision process of smart jammers in a time-varying environment based on prospect theory.El-Bardan et al. [22] introduced a stochastic differential game model to solve the power allocation problems under conditions of uncertain channel gains.Wang et al. [23] intended to provide a systematic review of the UAV networks from a cyber-physical system perspective.They summarized the security requirements of UAV networks as sensing security, storage security, communication security, actuation control security, and feedback security.They divided the UAV network into three hierarchies, i.e., the cell level, the system level, and the system of system level, for detailed investigation, and the coupling effects were discussed as well.They outlined several uses of intelligent algorithms for UAV networks, including flight control, path planning, machine vision, and pattern recognition, among others.
In conclusion, most studies related to the application of SDNs in UAV networks focused on the deployment of the SDN controller, routing control, and load balance problems.To the best of our knowledge, no existing works focus on the joint optimization between network performance and UAV track planning.Although some researchers have obtained research progress in anti-jamming by formulating a game between UAVs and jammers, they do not consider the multi-UAV scenario.Those studies on multi-UAV scenarios and smart attacks assume that UAVs are stationary, which is not consistent with reality.In this paper, we focus on the smart defense communication strategy of UAV networks by optimizing the power allocation of each UAV under the action space of radio channel selecting and trace planning.

System Model and Network Controller Design
In this section, our goal is to implement the construction of an SDN-based UAV network and model the network and the jamming attacks against the network.In Section 3.1, we propose an SDN-based UAV network architecture.In Section 3.2, a module-level SDN controller design is given in detail.In Section 3.3, we build a hierarchical mesh UAV network model and model UAV location as a spatial grid world.In Section 3.4, we build a model of a UAV network against multiple jammers.

Network Architecture
In this subsection, we respectively establish the topology architecture and functional architecture of the UAV network.The UAV network topology we studied is a hierarchical mesh network architecture.The functional structure of the dual controller is proposed, and the functional architecture clearly reflects the design concept that the data plane and the control plane are separated from each other.

Topology Architecture
There are six kinds of communication objects in the SDN-based UAV network under a smart jamming environment: the backbone UAV, mission UAV, GPS, ground station, jamming UAV, and jamming station, as illustrated in Figure 1.Among them, the first four belong to the UAV network, and the latter two belong to the jamming source, which will be described in detail in Section 3.4.The UAV network topology studied in this paper is a hierarchical mesh network structure, which is widely used in engineering practice because of its strong scalability.The backbone UAVs, which have wider communication bandwidth, stronger calculation capacity, and larger wireless coverage than the mission UAV, are connected to each other to form the core layer of the UAV network.Each backbone UAV provides a communication relay service for several mission UAVs, which form a cluster as shown by the dotted oval in Figure 1.Each of these clusters is equivalent to a small ad hoc network, and they are the edge layer of UAV network.Mission UAVs can be equipped with many kinds of payload, and they periodically send and receive packets to evaluate the SINR of each channel.All backbone UAVs and some mission UAVs are equipped with GPS modules to obtain their position, speed information, and positioning time.The ground station is connected to at least one backbone UAV.
The backbone UAVs periodically generate a network situation view report and send it to the ground station and other UAVs.The ground station can dispatch the latest mission plan to the UAV network.

Functional Architecture
In order to achieve the goal of autonomous decision-making, unlike the conventional method, which deploys an SDN controller on the ground station, we deploy a network controller that includes one UAV flight controller and one SDN controller in the core layer of the UAV network.The functional architecture of the UAV network is shown in Figure 2. The separation of the data plane and the control plane of the SDN architecture provides a solid foundation for the autonomous control of the UAV network.The control plane consists of the network controller deployed in a UAV at the core layer, and the data plane consists of all the edge-layer UAVs and other UAVs in the core layer.The network controller chooses the optimal strategy based on the state information, including GPS data, transmission rate data, network delay data, and SINR data collected from the data plane.The UAV flight controller is responsible for UAV track planning and flight attitude control, such as heading angle adjustment, location control, and energy control.The SDN controller is responsible for transmission channel configuration and data packet forwarding control, including routing control, congestion control, and power allocation control.The relationship between the UAV flight controller and the SDN controller is a collaborative relationship.Specifically, the flight of the UAV swarm is usually controlled by the flight controller.However, the SDN controller can fine-tune the UAV track when the network is subjected to a jamming attack.

Functional Architecture
In order to achieve the goal of autonomous decision-making, unlike the conventional method, which deploys an SDN controller on the ground station, we deploy a network controller that includes one UAV flight controller and one SDN controller in the core layer of the UAV network.The functional architecture of the UAV network is shown in Figure 2. The separation of the data plane and the control plane of the SDN architecture provides a solid foundation for the autonomous control of the UAV network.The control plane consists of the network controller deployed in a UAV at the core layer, and the data plane consists of all the edge-layer UAVs and other UAVs in the core layer.The network controller chooses the optimal strategy based on the state information, including GPS data, transmission rate data, network delay data, and SINR data collected from the data plane.The UAV flight controller is responsible for UAV track planning and flight attitude control, such as heading angle adjustment, location control, and energy control.The SDN controller is responsible for transmission channel configuration and data packet forwarding control, including routing control, congestion control, and power allocation control.The relationship between the UAV flight controller and the SDN controller is a collaborative relationship.Specifically, the flight of the UAV swarm is usually controlled by the flight controller.However, the SDN controller can fine-tune the UAV track when the network is subjected to a jamming attack.

SDN Controller Design
The control of the SDN-based UAV network adopts a large and small dual-loop design.The large loop is the loop formed by each UAV and the network controller.Each UAV periodically collects attack features, including the SINR of each channel, GPS coordinates, flight velocity, and positioning time, and sends them to the network controller, thus generating a network attack situation map.The SDN controller generates the power allocation scheme and the new flow table according to the attack situation map and sends the commands and new flow table to the UAV involved for execution.The small loop is the control loop of each UAV itself.Each UAV periodically receives a global network attack situation map from the network controller.Each UAV combines the full network situation map with its newly received situation information as the basis for its next action.
Specifically, we propose a design scheme for an SDN controller, as illustrated in Figure 3.The design diagram consists of three parts.The lower area is the information collection area of the UAV network, the middle area is the SDN controller area, and the upper area is the monitoring area.
1.The information collection area generates state information.The information collection area collects various types of information to help the SDN controller make decisions.For example, the msg/packet/byte count module counts the number of messages, packets, and bytes transmitted in the network, respectively, and the Src/Dst (Source/Destination) Address Module records the address information of the packets forwarded by each node.2. At the same time, the SDN controller also periodically transmits the summarized important information to the ground station.The flow of state information is as indicated by the arrows in Figure 2. 3. The monitoring area generates a network situation view.The function of this area is to convert the received status information into a network situation report suitable for human reading to assist the ground operator in mastering the latest network situation.

SDN Controller Design
The control of the SDN-based UAV network adopts a large and small dual-loop design.The large loop is the loop formed by each UAV and the network controller.Each UAV periodically collects attack features, including the SINR of each channel, GPS coordinates, flight velocity, and positioning time, and sends them to the network controller, thus generating a network attack situation map.The SDN controller generates the power allocation scheme and the new flow table according to the attack situation map and sends the commands and new flow table to the UAV involved for execution.The small loop is the control loop of each UAV itself.Each UAV periodically receives a global network attack situation map from the network controller.Each UAV combines the full network situation map with its newly received situation information as the basis for its next action.
Specifically, we propose a design scheme for an SDN controller, as illustrated in Figure 3.The design diagram consists of three parts.The lower area is the information collection area of the UAV network, the middle area is the SDN controller area, and the upper area is the monitoring area.

1.
The information collection area generates state information.The information collection area collects various types of information to help the SDN controller make decisions.For example, the msg/packet/byte count module counts the number of messages, packets, and bytes transmitted in the network, respectively, and the Src/Dst (Source/Destination) Address Module records the address information of the packets forwarded by each node.2. At the same time, the SDN controller also periodically transmits the summarized important information to the ground station.The flow of state information is as indicated by the arrows in Figure 2.

3.
The monitoring area generates a network situation view.The function of this area is to convert the received status information into a network situation report suitable for human reading to assist the ground operator in mastering the latest network situation.

Network Model
In the hierarchical mesh network chosen for its flexible scalability, the communication process between the core layer and the edge layer is similar.For simplicity, we only model the core layer of the network, and the modeling method of the edge layer is similar.The notations used in our model are summarized in Table 1.The  -th frequency pattern The chosen channel at time slot k Total power constraints of the UAV/jammer The distance between UAV i u and j u at time slot k Channel power gains between UAV i u and j

Network Model
In the hierarchical mesh network chosen for its flexible scalability, the communication process between the core layer and the edge layer is similar.For simplicity, we only model the core layer of the network, and the modeling method of the edge layer is similar.The notations used in our model are summarized in Table 1.The ψ-th frequency pattern f (k)  The chosen channel at time slot k Transmit power of UAV u at time slot k P T/J Total power constraints of the UAV/jammer The distance between UAV u i and u j at time slot k Channel power gains between UAV u i and u j at time slot k C h

Cost of frequency hopping C p
Cost of data transmitting C m

Cost of UAV path replanning
There are N U UAVs in the network, and each UAV flies at a certain height h u to avoid collisions.The flying height can be adjusted when receiving commands from the network controller or its own flight controller, but the height should be maintained after adjustment.UAV nodes transmit messages over N radio channels.All the UAV nodes follow the same frequency pattern sets denoted by , where is the number of frequency patterns and the ψ-th pattern C ψ consists of κ time slots pattern modes, where the i-th channel is denoted by , and the chosen channel at time slot k can be denoted by The transmit power of the u-th UAV at time slot k is denoted by u , and the total transmit power P r is quantized into L + 1 levels.The position of the u-th UAV at time slot k can be denoted by L u i u j .The UAV swarm needs to maintain a relatively stable topology during flight.For simplicity, we assume that only one UAV is allowed to relocate at each time slot and that the other UAVs maintain uniform motion.Within each time slot, the position of the UAV can be relocated to one of the surrounding eight spatial grids, which in the clockwise direction are the north, the northeast, the east, the southeast, the south, the southwest, the western, and the northwest, as illustrated in Figure 4. Since the height difference of the UAV network is much smaller than the UAV's flight range, the spatial grid approximates the planar grid.The side length of the space grid is determined by the maneuverability of the UAV.Specifically, the side length of UAV u i at time slot k denoted by d (k) u i is equal to the minimum displacement of the UAV that flies in eight directions, according to the Dubins path.The entire UAV swarm can be regarded as one large virtual UAV, and the front direction of the eight spatial grids coincides with the flight direction of the virtual drone at time slot k.UAV u i 's eight relocation spatial grids at time slot k are denoted by st , where s, t ∈ {−1, 0, 1} represents the coordinates of the spatial grids.Specifically, s represents the left and right direction, where −1 means leftward, 0 means no motion, and 1 means rightward; and t represents the front-rear directions, where −1 means forward, 0 means no motion, and 1 means backward.For example, L sets denoted by , where  is the number of frequency patterns and the  -th pattern C  consists of  time slots pattern modes, where the i -th channel is denoted by and the chosen channel at time slot k can be denoted by   The transmit power of the u -th UAV at time slot k is denoted by   k u P , and the total transmit power r P is quantized into 1 L  levels.The position of the u -th UAV at time slot k can be denoted by   k u L , and x y z are the converted Cartesian coordinates.The distance between UAV i u and UAV j u at time slot k is   i j k u u d .The UAV swarm needs to maintain a relatively stable topology during flight.For simplicity, we assume that only one UAV is allowed to relocate at each time slot and that the other UAVs maintain uniform motion.Within each time slot, the position of the UAV can be relocated to one of the surrounding eight spatial grids, which in the clockwise direction are the north, the northeast, the east, the southeast, the south, the southwest, the western, and the northwest, as illustrated in Figure 4. Since the height difference of the UAV network is much smaller than the UAV's flight range, the spatial grid approximates the planar grid.The side length of the space grid is determined by the maneuverability of the UAV.Specifically, the side length of UAV

Jamming Attack Model
There are two types of interference sources: one is a low-mobility jamming station and the other is a high-mobility jamming UAV, as illustrated in Figure 1.When the velocity of the jammer is not much different from the velocity of UAVs, continuous jamming can occur.When the UAV swarm is flying at high speed, the impact of the jamming station is negligible.It can be observed from the analysis mentioned above that the relative velocity of the jammers which have a great influence on the UAV network will be low.Therefore, we can assume that the state information of the whole network can converge in the SDN controller in one time slot.
denote the channel power gains between UAV u i and UAV u j at time slot k, and let H denote the channel power gains between UAV u i and jammer J j .Let d 0 denote a reference distance and ρ the path loss exponent; the value of path loss PL at distance d can thus be modeled by where ν is a constant indicating the antenna gain and the path loss exponent ρ characterizes the radio environment, where ρ = 2 describes a free-space propagation and ρ = 4 describes a two-ray model.Assuming r i ∼ N(0, 1) follow the standard normal distribution, we have h i /PL g , where g = T/J.

Improved Dyna-Q-Based Smart Defense Communication Solution
Artificial intelligence can make the UAV network smarter, thus resisting SDR-based smart attacks.The Dyna-Q-based algorithm is a model-based reinforcement learning technique.If we do not consider the limitations of calculation resources, a deep reinforcement learning technique (e.g., DQN (Deep Q Network), fast-DQN) can be adopted to tackle large state space and action space problems, as convolutional neural networks (CNNs) have powerful feature extraction capacity.Unfortunately, it is very uneconomical to deploy the large amount of calculation resources that CNNs need on UAVs.A practical method is to use a grading method that divides continuous values into several quantitative levels to compress the state space and the action space, and then attempt to speed up the convergence of the algorithm.The SDN-based UAV network has strong environmental awareness and control flexibility, and the error between radio channel models and the real radio environment is easily eliminated by multiple iterations.Therefore, we chose the Dyna-Q technique to calculate a better smart defense strategy of the UAV network.
The interaction between UAVs and jammers can be formulated as a multistage repeated game.To derive the optimal communication, Dyna-Q-based reinforcement learning techniques can be used to control both the UAV power allocation policy and UAV relocation policy.In the proposed algorithm, the main function of the SDN controller is to aggregate the SINR values of the UAV nodes of the whole network, fit a quadric equation with these SINR values and their GPS coordinates, and use the surface equation to predict SINR values of the eight spatial grids that surround the UAV.Compared to the conventional Dyna-Q algorithm, the proposed algorithm uses global information to predict the SINR values in eight directions around each UAV and uses the predicted values to optimize the Q function.
Upon receiving packets from other UAVs, the receiving UAV node extracts the SINR estimated by the sending UAV node and formulates the state as  In the process of communication, the sending UAV node evaluates the SINR from the feedback packets and calculates the utility based on the SINR and the communication cost, including the cost of frequency hopping C h , the cost of data transmitting C p , and the cost of UAV path replanning C m ; the utility at time slot k r (k) s (k) , a (k) is given by where F (.) is an bool function that equals 0 if the argument of F (.) equals 0; otherwise, F (.) equals 1.
The value function denoted by V(s) stands for the maximum value of the Q function.The UAVs update their Q function and value function at time slot k as follows: where the learning factor α adjusts the UAV's learning speed and the discount factor γ adjusts the importance of future rewards.In order to balance the exploit and exploration, the ε − greedy strategy is often used.ε ∈ (0, 1] is a small positive value, which represents the likelihood of choosing the explore strategy, and 1 − ε means the likelihood of choosing the exploit strategy. When all the UAVs send their states s (k) u i at time slot k to the SDN controller, the SDN controller can fit a quadratic equation (Equation ( 5)); that is: where (x, y) means the position coordinates of UAV, v is the SINR at (x, y), and q 0 − q 5 are the parameters of the quadratic equation.Let N (k) be the number of states received by the SDN controller at time slot k, and let δ be the functional error of least squares fitting as follows: v i,j − q 0 + q 1 x i,j + q 2 y i,j + q 3 x 2 i,j + q 4 x i,j y i,j + q 5 y 2 i,j Let the partial derivatives of δ to q 0 − q 5 be 0, and the value of q 0 − q 5 can be obtained by solving these equations, which is given by Using Equations ( 5) and ( 7), we can estimate the SINR of the spatial grids surrounding each UAV.In each episode of the Dyna-Q-based learning algorithm, agents in UAVs will learn n steps from model (s, a) additionally, and eight experiences from the SDN global model created by the fitted quadratic surface via Equations ( 5) and ( 7) will speed up the convergence.The details of the algorithm are shown in Table 2.

Simulation Results
Simulations are carried out to appraise the performance of the proposed power allocation policy and trace relocation policy against a smart jammer with a Dyna-Q-based reinforcement learning algorithm.Simulation parameters similar to those used previously [21,24] are chosen, with α = 0.95, γ = 0.7, and = 5.The simulated flight area is 1000 km × 800 km.The horizontal and vertical position coordinates of the moving objects are the remainders devided by 1000 km and 800 km, respectively.The initial position coordinates of the three UAVs are (260,610), (790,110), and (520,270), respectively.The initial position coordinates of the two jamming UAVs are (320,360) and (450,100).The mobility model of the jammer is a random waypoint model.The length of all spatial grids is 1 km.The speed of all the UAVs is 50 km/h.The maximum number of relocating UAVs in one time slot is 1.When calculating the channel gains via Equation (1), the distance d is calculated from the position coordinates after the remainder operation.
We use four algorithms to form the smart defense algorithm of the UAV network.The four algorithms are the WoLF-PHC (Win or Learn Fast-Policy Hill-Climbing) algorithm [21], Q-learning algorithm [25], Dyna-Q algorithm [26], and our improved Dyna-Q algorithm.We performed 100 time slot simulations for each algorithm on a computer with 3.6 GHz Intel Core i7-4790 and 8 GB of RAM, and each time slot contains 30 episodes.Since the state space and the action space are all quantified into discretized levels, all the four algorithms end in 10 s.
We conducted three simulation experiments.The first experiment verified the performance of the four algorithms under the fixed jamming attack strategy.The second experiment verified the performance of the four algorithms under the smart jamming strategy, which means that jammers can use SDR to change their jamming strategy.The third experiment verified the influence of several key parameters on a utility metric.
In the first experiment, we compared the performance of the above four reinforcement learning algorithms under a certain jamming strategy randomly selected by jammers in 100 time slots.In this performance analysis, two most important metrics are selected, which are the utility and the SINR of the UAV network.The utility metric can be calculated by Equation ( 2), and the SINR metric is just the first item on the right side of Equation ( 2).As shown in Figure 5, the utility and SINR of the UAV network increases over time and gradually converge with different amplitudes.The improved Dyna-Q based strategy has the highest utility, followed by the Dyna-Q, WoLF-PHC, and Q-learning-based strategies.After 100 time slots, the utility of the improved Dyna-Q-based strategy is 1.1252, which is 6.9%, 13.8%, and 18.6% higher than that of the Dyna-Q, WoLF-PHC, and Q-learning-based strategies, respectively.The performance of the WoLF-PHC algorithm has large fluctuations.After 40 time slots, the range of utility of the WoLF-PHC-based strategy varies from 0.88 to 1.11, which is wider than that of the other three strategies.Besides, the SINR values calculated by the four algorithms have a similar trend compared to their utilities.The utility values are calculated by subtracting each cost from the SINR values.However, although three costs (path planning, data transmission, and frequency hopping) vary randomly, the sum of the three will offset a large part of the change, resulting in a stable trend of the sum of the three, which causes the trend of SINR to be very similar to that of utility.By the end of the simulation, the SINR with improved Dyna-Q is 3.1141, which is 2.8% higher than that of Dyna-Q, 5.4% higher than that of WoLF-PHC, and 6.6% higher than that of the Q-learning strategy.The second experiment was used to test the performance of the four algorithms when the jammers used a SDR-based smart jamming strategy.Specifically, the jammers can continuously adjust their attack strategies using the Q-learning algorithm, including reallocating the jamming power and replanning the location of the jammers, but can only adjust the position of one jammer to its surrounding eight grids at one time slot, with 0 .The UAV network selects and performs actions from the action space according to the four algorithms, respectively, and finally calculates the utility and SINR.Similarly, at most one UAV can be allowed to move to one of the surrounding eight grids at one time slot.The jammers change their jamming strategy nine times at equal intervals in 300 time slots.The experimental results are shown in Figure 5.It can be seen from Figure 6 that the four algorithms bring about different degrees of performance hopping for each change of the attack strategy.The improved Dyna-Q-based strategy The second experiment was used to test the performance of the four algorithms when the jammers used a SDR-based smart jamming strategy.Specifically, the jammers can continuously adjust their attack strategies using the Q-learning algorithm, including reallocating the jamming power and replanning the location of the jammers, but can only adjust the position of one jammer to its surrounding eight grids at one time slot, with ε = 0.3, µ = 0.7, P T = P J = 0.4, C m = 0.8, C p = 0.2, and C h = 0.4.The UAV network selects and performs actions from the action space according to the four algorithms, respectively, and finally calculates the utility and SINR.Similarly, at most one UAV can be allowed to move to one of the surrounding eight grids at one time slot.The jammers change their jamming strategy nine times at equal intervals in 300 time slots.The experimental results are shown in Figure 5.It can be seen from Figure 6 that the four algorithms bring about different degrees of performance hopping for each change of the attack strategy.The improved Dyna-Q-based strategy has the highest utility, followed by the Dyna-Q, WoLF-PHC, and Q-learning-based strategies.For instance, at the 300 th time slot, the utility of improved Dyna-Q-based strategy is 1.1626, which is 8.7%, 14.9%, and 22.5% higher than that of the Dyna-Q, WoLF-PHC, and Q-learning-based strategies, respectively.The trend of SINR of the UAV network is basically the same as that of utility when facing smart jamming.At the 300 th time slot, the SINR with improved Dyna-Q is 3.1928, which is 3.4% higher than that of Dyna, 5.7% higher than that of WoLF-PHC, and 10.4% higher than that of the Q-learning strategy.
adjust their attack strategies using the Q-learning algorithm, including reallocating the jamming power and replanning the location of the jammers, but can only adjust the position of one jammer to its surrounding eight grids at one time slot, with 0. .The UAV network selects and performs actions from the action space according to the four algorithms, respectively, and finally calculates the utility and SINR.Similarly, at most one UAV can be allowed to move to one of the surrounding eight grids at one time slot.The jammers change their jamming strategy nine times at equal intervals in 300 time slots.The experimental results are shown in Figure 5.It can be seen from Figure 6 that the four algorithms bring about different degrees of performance hopping for each change of the attack strategy.The improved Dyna-Q-based strategy has the highest utility, followed by the Dyna-Q, WoLF-PHC, and Q-learning-based strategies.For instance, at the 300 th time slot, the utility of improved Dyna-Q-based strategy is 1.1626, which is 8.7%, 14.9%, and 22.5% higher than that of the Dyna-Q, WoLF-PHC, and Q-learning-based strategies, respectively.The trend of SINR of the UAV network is basically the same as that of utility when facing smart jamming.At the 300 th time slot, the SINR with improved Dyna-Q is 3.1928, which is 3.4% higher than that of Dyna, 5.7% higher than that of WoLF-PHC, and 10.4% higher than that of the Qlearning strategy.To better observe the impact of changes in the jamming strategy, we averaged the utility and SINR within the same jamming strategy.The results are shown in Figure 7, Table 3, and Table 4.The average performance of the UAV network with the improved Dyna-Q strategy is optimal, the average performance of Dyna-Q and WoLF-PHC are lower and similar, and the average performance of the Q-learning algorithm is the worst.For instance, the mean value of nine average utilities in Table 3 with improved Dyna-Q is 1.0968, 0.9861 with Dyna, 0.9618 with WoLF-PHC, and 0.8380 with Q-learning.The performance advantage of the improved Dyna-Q algorithm is obvious, which is closely related to the situation awareness capabilities of the SDN architecture.The reason for the low performance of the Q-learning algorithm is that the fast-paced attack and defense strategy changes make the algorithm unable to converge in time.It is well known that the Q-learning algorithm can only update the Q value of one state per episode if the delay update mechanism is not used.This slow learning will inevitably lead to slow convergence.For example, in the 60-80th time slots of Figure 6a, the utility of the Q-learning algorithm shows a continuous but slow growth.The phenomenon of nonconvergence after 20 time slots is not uncommon in the other three algorithms.The advantage of the Dyna-Q algorithm over Q-learning is that it can be learned from models built with historical experience to speed up convergence.The advantage of the improved Dyna-Q algorithm over the Dyna-Q algorithm is that it can be learned from the SINR model fitted by Equation ( 7  The last experiment was used to test the effect of changes in key parameters on UAV network performance with 0.4    The last experiment was used to test the effect of changes in key parameters on UAV network performance with P T = P J = 0.4, C p ∈ [0,  5-7.The same effect of the three cost parameters on network performance is that the utility decreases almost linearly as the cost increases.For instance, the utility of the UAV network decreases 68.0% with Dyna-Q if the frequency hopping cost C h changes from 0.1 to 0.8.The utility of the UAV network decreases 34.1% w improved Dyna-Q if the UAV path re-planning cost C m changes from 0.1 to 0.8.The utility of UAV network decreases 60.4% with WoLF-PHC if the unit transmission cost C p changes from 0.1 to 1.5.It can be seen from the simulation results that each of the three cost parameters has a particular pattern.The pattern of the frequency hopping cost h C is that when the cost increases, the other three algorithms decrease rapidly, except for the slow decline of the improved Dyna-Q.This is because when the frequency jump cost increases, path replanning becomes the main strategy to avoid jamming.At this time, the SDN-based improved Dyna-Q can use the whole network situation information to better avoid jamming.The characteristic of UAV path replanning cost m C is that when it increases, the UAV will use path planning less to avoid jamming.At this time, the problem gradually turns into the traditional problem of using a power allocation method to avoid jamming.In this case, the advantages of improved Dyna-Q will gradually diminish.As unit transmission cost p C grows, any form of communication cost in the network will increase, because any type of  It can be seen from the simulation results that each of the three cost parameters has a particular pattern.The pattern of the frequency hopping cost C h is that when the cost increases, the other three algorithms decrease rapidly, except for the slow decline of the improved Dyna-Q.This is because when the frequency jump cost increases, path replanning becomes the main strategy to avoid jamming.At this time, the SDN-based improved Dyna-Q can use the whole network situation information to better avoid jamming.The characteristic of UAV path replanning cost C m is that when it increases, the UAV will use path planning less to avoid jamming.At this time, the problem gradually turns into the traditional problem of using a power allocation method to avoid jamming.In this case, the advantages of improved Dyna-Q will gradually diminish.As unit transmission cost C p grows, any form of communication cost in the network will increase, because any type of network defense strategy relies on packet transmission.Consistent with expectations, as unit transmission cost C p grows, the utility declines in roughly the same proportion, regardless of the algorithm used.

Discussion and Conclusions
In this paper, we propose a dual-controller cooperative SDN-based UAV network wireless communication scheme and design Dyna-Q-based reinforcement learning algorithm using power allocation and track planning collaborative optimization against smart jamming.The proposed Dyna-Q algorithm has faster convergence speed and more stable performance than other three algorithms.Researchers have applied many algorithms, such as WoLF-PHC, Q-learning, DQN, and fast-DQN, to study the interactions between smart attackers and smart defenders.The DQN and fast-DQN algorithms belong to deep reinforcement learning algorithms, which need a large amount of calculation resources and have high energy consumption.Although these kinds of algorithms can reach higher performance, it is neither economical nor realistic to deploy such a large amount of calculation resources on the UAV platform.WoLF-PHC is a practically decentralized learning algorithm.It is a simple and practical algorithm for mixed-strategies learning.It does not need to know the recent behaviors of the agent and the current strategy of the opponent.However, the algorithm does not prove that it can converge to the Nash equilibrium strategy, and the stability of the algorithm is insufficient.The Q-learning algorithm is the most basic model-free reinforcement learning algorithm.The temporal-difference (TD) learning idea is the basis of many reinforcement learning algorithms.However, the UAV network has too much action space.The Q-learning algorithm can only update one Q value in an episode, and the learning efficiency is low.The method is not based on any model, which makes the SDN-based UAV network unable to use its cooperative sensing ability.The Dyna-Q algorithm is often overlooked because its environment model is difficult to build.The SDN-based UAV network has a large number of sensors and its state information can quickly converge to the network controller, which can be used to build an environment model.Therefore, the Dyna-Q-based reinforcement learning algorithm is more suitable to solve the smart defense problem of UAV networks.
other UAVs.The ground station can dispatch the latest mission plan to the UAV network.

Figure 1 .
Figure 1.Topology of a software-defined network (SDN)-based unmanned aerial vehicle (UAV) network under a smart jamming environment，and the dotted lines in the figure indicate the wireless channel between the UAVs or between UAVs and Ground Station.GPS: Global Positioning System.

Figure 1 .
Figure 1.Topology of a software-defined network (SDN)-based unmanned aerial vehicle (UAV) network under a smart jamming environment, and the dotted lines in the figure indicate the wireless channel between the UAVs or between UAVs and Ground Station.GPS: Global Positioning System.

Figure 2 .
Figure 2. Functional architecture of the SDN-based UAV network.SINR: Signal to Interference plus Noise Ratio.Arrows indicate the flow of state information.

Figure 2 .
Figure 2. Functional architecture of the SDN-based UAV network.SINR: Signal to Interference plus Noise Ratio.Arrows indicate the flow of state information.

Figure 3 .
Figure 3. Design of functional modules in the SDN controller and flow of state information.Msg: Message, Src: Source, Dst: Destination.

u at time slot k h C
Cost of frequency hopping p C Cost of data transmitting m CCost of UAV path replanningThere are U N UAVs in the network, and each UAV flies at a certain height u h to avoid collisions.The flying height can be adjusted when receiving commands from the network controller or its own flight controller, but the height should be maintained after adjustment.UAV nodes transmit messages over N radio channels.All the UAV nodes follow the same frequency pattern

Figure 3 .
Figure 3. Design of functional modules in the SDN controller and flow of state information.Msg: Message, Src: Source, Dst: Destination.
u are the converted Cartesian coordinates.The distance between UAV u i and UAVu j at time slot k is d (k) u i (k) −1,−1 represents the spatial grid of UAV u i in the front left at time slot k.When receiving a message, the UAV evaluates the signal-to-interference-plus-noise ratio (SINR) of the channel based on the bit error rate (BER) of the message.For simplicity, the value of SINR is quantized to ξ levels.Each UAV broadcasts its quantized value of the SINR and the index of transmit frequency chosen according to C ψ at time slot k to its neighbor nodes.Let the vector h (k) u = h (k) u,i 1≤i≤N denote the channel power gains of UAV u's N channels, C h the cost of frequency hopping, C p the cost of data transmission, and C m the cost of track replanning.Symmetry 2019, 11, x 8 of 19

iu
at time slot k denoted by   i k u d is equal to the minimum displacement of the UAV that flies in eight directions, according to the Dubins path.The entire UAV swarm can be regarded as one large virtual UAV, and the front direction of the eight spatial grids coincides with the flight direction of the virtual drone at time slot k .UAV i u 's eight relocation spatial grids at time slot k are denoted by s t   represents the coordinates of the spatial grids.Specifically, s represents the left and right direction, where -1 means leftward, 0 means no motion, and 1 means rightward; and t represents the front-rear directions, where -1 means forward, 0 means no motion, and 1 means backward.For example, represents the spatial grid of UAV i u in the front left at time slot k .When receiving a message, the UAV evaluates the signal-to-interference-plus-noise ratio (SINR) of the channel based on the bit error rate (BER) of the message.For simplicity, the value of SINR is quantized to  levels.Each UAV broadcasts its quantized value of the SINR and the index of transmit frequency chosen according to C  at time slot k to its neighbor nodes.Let the vector power gains of UAV u 's N channels, h C the cost of frequency hopping, p C the cost of data transmission, and m C the cost of track replanning.

Figure 4 .
Figure 4. Schematic diagram of the relocation of a UAV.The black dotted line in the figure represents the wireless channel between the UAVs, and the red grid represents the grid world of UAVS, and the red arrows represent the directions in which the UAV may move after the jammer adjusts the jamming strategy.

Figure 4 .
Figure 4. Schematic diagram of the relocation of a UAV.The black dotted line in the figure represents the wireless channel between the UAVs, and the red grid represents the grid world of UAVS, and the red arrows represent the directions in which the UAV may move after the jammer adjusts the jamming strategy.
k) u represents the position coordinates of the sending UAV required by GPS and S is the state set.The receiving UAV node adopts a Dyna-Q-based reinforcement learning algorithm to choose the transmit power P (k) s and determines whether to launch the track relocation action with the communication action denoted by a (k) = P (k) s , L u (k) st ∈ A, where L u(k) st is UAV u's flight direction approximated by eight spatial grids and A is the action space.

Figure 5 .P
Figure 5. Performance of the SDN-based learning algorithms using power allocation and path replanning strategy against smart attacks with 0.4 T J P P  

Figure 5 .
Figure 5. Performance of the SDN-based learning algorithms using power allocation and path re-planning strategy against smart attacks with P T = P J = 0.4, C m = 0.8, C p = 0.2, and C h = 0.4 where P T/J means total power constraints of the UAV/jammer, C m means the cost of UAV path replanning, C p means the cost of data transmitting, and C h means the cost of frequency hopping.(a) shows the utility of the UAV network and (b) shows the SINR of the UAV network.

Figure 6 .
Figure 6.Performance of the SDN-based learning algorithms using power allocation and path replanning strategy against smart attacks with 0.4 T J P P  

Figure 6 .
Figure 6.Performance of the SDN-based learning algorithms using power allocation and path re-planning strategy against smart attacks with P T = P J = 0.4, C m = 0.8, C p = 0.2, and C h = 0.4 under the condition that jammers change their jamming strategy, including changing jamming channels and changing the positions of jammers, nine times at equal intervals in 300 time slots.(a) shows the utility of the UAV network and (b) shows the SINR of the UAV network.
) to speed up convergence.The advantage of WoLF-PHC is that it can dynamically adjust the learning rate parameters according to the learning effect to speed up the convergence.Due to the smart jamming attack, Dyna-Q has almost no advantage over WoLF-PHC.The reason is that the changed attack strategy invalidates many experiences in the Dyna model.Learning the wrong experience in the Dyna model leads to the learning effect not rising, but falling.

Figure 7 .
Figure 7. Average performance of the SDN-based learning algorithms using power allocation and path replanning strategy against smart attacks with 0.4 T J P P   three costs, which are frequency hopping cost, h C ; UAV path replanning cost, m C ; and unit transmission cost, p C .The simulation results are shown in Figure 8 and Tables 5-7.The same effect of the three cost parameters on network performance is that the utility decreases almost linearly as the cost increases.For instance, the utility of the UAV network decreases 68.0% with Dyna-Q if the frequency hopping cost h C changes from 0.1 to 0.8.The utility of the UAV network decreases 34.1% w improved Dyna-Q if the UAV path re-planning cost m C changes from 0.1 to 0.8.The utility of UAV network decreases 60.4% with WoLF-PHC if the unit transmission cost p C changes from 0.1 to 1.5.

Figure 7 .
Figure 7. Average performance of the SDN-based learning algorithms using power allocation and path replanning strategy against smart attacks with P T = P J = 0.4, C m = 0.8, C p = 0.2, and C h = 0.4 under the condition that jammers change their jamming strategy, including changing jamming channels and changing the positions of jammers, nine times at equal intervals over 300 time slots.(a) shows the average utility of the UAV network and (b) shows the average SINR of the UAV network.
1.5], C h ∈ [0.1, 0.8], and C m ∈ [0.1, 0.8].The key parameters we chose are three costs, which are frequency hopping cost, C h ; UAV path replanning cost, C m ; and unit transmission cost, C p .The simulation results are shown in Figure 8 and Tables

Figure 8 .
Figure 8.Average performance of the SDN-based communication scheme with
The Flow Table Module stores a flow table currently executed by each substrate node, the SINR value of each wireless channel is collected by the channel SINR Module periodically, and the GPS Coordinate Module periodically reports the position and velocity information to the controller or the ground station.2. The SDN controller area maps state information.The SDN controller acts like a function that maps the state information provided by the data plane to control instructions such as the Optimal Flow Table, Power Allocation Policy, and UAV Location Policy, as shown by the three black horizontal arrows in Figure The Flow Table Module stores a flow table currently executed by each substrate node, the SINR value of each wireless channel is collected by the channel SINR Module periodically, and the GPS Coordinate Module periodically reports the position and velocity information to the controller or the ground station.2. The SDN controller area maps state information.The SDN controller acts like a function that maps the state information provided by the data plane to control instructions such as the Optimal Flow Table, Power Allocation Policy, and UAV Location Policy, as shown by the three black horizontal arrows in Figure

Table 1 .
List of main notations used in this work.
J N Number of jamming UAVs u h Flying height of the u-th UAV N Number of radio channels  Number of frequency patterns C 

Table 1 .
List of main notations used in this work.

Table 2 .
The algorithm of Dyna-Q-based UAV networks' smart defense communication.

Table 5 .
Average utility values with different frequency hopping costs.

Table 6 .
Average utility values with different UAV path replanning costs.

Table 3 .
Average utility values of the UAV network over nine rounds of smart jamming attack.

Table 4 .
Average SINR values of the UAV network over nine rounds of smart jamming attack.

Table 5 .
Average utility values with different frequency hopping costs.

Table 6 .
Average utility values with different UAV path replanning costs.

Table 7 .
Average utility values with different unit transmission costs.