Location-aware Cooperative Anti-jamming Distributed Channel Selection Approach in UAV Communication Networks

This paper investigates the cooperative anti-jamming distributed channel selection 1 problem in UAV communication networks. Considering the existence of malicious jamming 2 and co-channel interference, a location-aware cooperative anti-jamming scheme is designed for 3 the purpose of maximizing the users’ utilities. Users in the UAV group cooperate with each 4 other via location information sharing. When the received interference energy is lower than 5 mutual interference threshold, users conduct channel selection strategies independently. Otherwise, 6 users take joint actions with a cooperative anti-jamming pattern under the impact of mutual 7 interference. Aimed at the independent anti-jamming channel selection problem under no mutual 8 interference, a Markov Decision Process framework is introduced, whereas for the cooperative 9 anti-jamming channel selection case under the influence of co-channel mutual interference, a 10 Markov game framework is employed. Furthermore, motivated by reinforcement learning with 11 a “Cooperation-Decision-Feedback-Adjustment" idea, we design a location-aware cooperative 12 anti-jamming distributed channel selection algorithm (LCADCSA) to obtain the optimal anti-jamming 13 channel strategies for the users with a distributed way. In addition, the channel switching cost and 14 cooperation cost, which have great impact on the users’ utilities, are introduced. Finally, simulation 15 results show that the proposed algorithm converges to a stable solution with which the UAV group 16 can avoid the malicious jamming as well as co-channel interference effectively. 17


Introduction
Unmanned aerial vehicle (UAV) communication networks, as a kind of newly-developing wireless communication networks, have become a hot research issue [1,2].When important tasks are carried out, how to construct a reliable and robust UAV network is of great significance.In some scenario with strongly competitive characteristics, the destructive effect caused by malicious jamming must be taken into consideration.
However, there are some limitations using these techniques: i) Anti-jamming Power control is ineffective under the circumstance of high jamming power.ii) Traditional UFH and FHSS consume a large number of spectrum resources, and they are not able to work well under the dynamic spectrum environment [4][5][6].
In addition, game theory [7][8][9][10], as a strong theoretical tool, is suitable to model the anti-jamming competitive scenario.Specifically, Stackelberg game approach [10], as a kind of hierarchical game, has been widely used in anti-jamming field.For example, in [11], the authors summarized the application of Stackelberg game in anti-jamming dense networks, and introduced several classical anti-jamming scenarios and system models.Moreover, a outlook of the application of anti-jamming Stackelberg game was also made.In [12][13][14], Stackelberg game approaches were adopted for the anti-jamming power control problem, where the user acted as the leader, and the jammer acted as the follower of the game.Utility functions were designed, and Stackelberg Equilibriums (SE) were also obtained via game approaches.Moreover, considering the channel selection problem under malicious jamming environment, a hierarchical anti-jamming channel selection scheme was proposed using a Stackelberg game framework [15].However, most existing studies under the Stackelberg game framework formulated the interactions between user-side and jammer-side, which brought large deviation in information acquisition.Thus, studies focusing on anti-jamming channel selection under dynamic jamming environment are of great importance.
In fact, the dynamic feature of the channel state brings some challenge to anti-jamming channel selection.In addition, the mobility of UAVs also influences the receiver's signal energy, causing the decline of communication quality [16].In [17], the author investigated the multi-stage spectrum access problem for Flying Ad-Hoc Network (FANET).Markov decision process (MDP) [18,19], as a decision framework under dynamic channel environment, has been adopted to model the anti-jamming problem.For the purpose of solving the MDP problem, Q-learning [20] methods are usually employed using a "Decision-Feedback-Adjustment" structure to obtain the optimal strategy.For instance, in [4], the author formulated the anti-jamming decision problem as a MDP, and obtained the best anti-jamming scheme via Q-learning.Furthermore, in [21], a deep Q network was built, and the anti-jamming channel selection problem was solved using a deep reinforcement learning method.In addition, in view of the multi-user scenarios in anti-jamming field, MDP problem has been extended to Markov game [22], and several learning algorithms were designed for multi-user scenarios.In [23], a multi-agent learning algorithm was proposed to obtain a stable solution for dynamic spectrum access problem.In [24][25][26], some multi-user reinforcement learing methods were adopted, where users took actions independently.However, in those methods mentioned in [24][25][26], users' states are influenced by each other, which leads to unsteady learning environments and poor decision effects.
Taking a overall consideration of the challenges and inspirations brought by above studies, in this paper, we mainly focus on the anti-jamming channel selection problem under dynamic environment, where the channel state and UAVs' locations are time-varying.Moreover, the channel switching cost and cooperation cost are introduced, which have a great impact on the users' utilities.A cooperative anti-jamming mechanism is constructed, in which users can realize information sharing and take actions jointly.Specifically, users in the UAV group sense the location information, and calculate the receiving signal energy as well as estimate whether they are influenced by co-channel interference.For the case where users are not influenced by co-channel interference, a MDP is formulated to model the anti-jamming problem for the UAVs, and a independent Q-learning method is employed to obtain the users' channel selection strategies.For the case where users are indeed influenced by co-channel interference, a Markov game is formulated, and a multi-agent Q-learning method is designed for UAV communication networks.To sum up, the main contributions are summarized as follows: • A cooperative anti-jamming mechanism is designed for UAV communication networks, where UAVs cooperate via information exchange.Considering the influence of co-channel interference, a MDP and a Markov game are formulated respectively.
• A location-aware cooperative anti-jamming distributed channel selection algorithm (LCADCSA) is designed for the anti-jamming selection problem.Without the influence of co-channel interference, a independent Q-learning method is adopted, while under the influence of co-channel interference, a multi-agent Q-learning method is employed.
• Simulation results exhibit the performance of the proposed LCADCSA, which can avoid the malicious jamming and co-channel interference effectively.Moreover, the influence of channel switching cost and cooperation cost are investigated.
Compare this paper to our previous works [27,28], which studied anti-jamming channel selection in wireless communication networks, and to our previous works [29], the main differences are: i)  Work [27] investigated the multi-agent learning method for anti-jamming problem, and work [28] considered the single reinforcement learning in fading environment.However, both these two works did not take the mobility of UAVs into consideration.Whereas in our paper, the anti-jamming channel selection approach in UAV communication networks is investigated, while taking the mobility of UAVs into consideration, which causes the variation of co-channel interference.Moreover, channel switching cost and cooperation cost are introduced, which influence the users' utilities.ii) In [29], we focused on anti-jamming power control problem in UAV communication networks, whereas in this paper, the cooperative anti-jamming channel selection scheme is designed, and a cooperative anti-jamming algorithm based on multi-agent reinforcement learning is derived, which obtains strategies by interacting with the environment.
The rest of this paper is shown as follows.In Section 2, the system model and problem formulation are investigated.In Section 3, the location-aware cooperative anti-jamming mechanism in UAV Group is designed.In Section 4, the proposed location-aware cooperative anti-jamming distributed channel selection algorithm (LCADCAS) is shown.In Section 5, simulations and discussions are conducted.In the end, we make conclusion in Section 6.

System Model and Problem Formulation
The system model is shown in Fig. 1.Assume that there are N users (a transmitter-receiver UAV formation is treated as one user) and one jammer in the system scenario.UAVs are under the threat of malicious jammer.In the UAV group, the locations of UAVs are time-varying, and UAVs cooperate with each other via information exchange.Denote the user set as N = {1, ..., n, ..., N}.The available channel set for user is M= {1, ..., m, ..., M}.
Consider two different case of UAV transmission: i) When users are close to each other, and transmitting in the same channel, high received signal energy from other users made them influenced by co-channel interference.ii) When users are far away from each other, the received signal energy from other users is somehow low, which means the users are not influenced by co-channel interference.
Mutual interference threshold τ 0 is used to measure the influence of co-channel interference, that is: When received interference energy is lower than τ 0 , the UAV communication network is not influenced by co-channel interference, and vice versa.
Assume that channel strategy a n means user n chooses channel c n , c n ∈ M, to transmit, a −n is the channel strategy combination of all users except user n, a j is the jamming channel.Users transmit with CSMA pattern, then the throughput of user n is expressed as: where d n denotes the distance between the transmitter and the receiver of user n, P n represents the user n's transmission power.α is the path-loss exponent, and N c n represents the channel noise power.Moreover, I n (c n ) is the congestion degree of channel c n , which is expressed as: ( where P x is the transmission power of user x, x ∈ N /n, d x,n denotes the interference distance from user x to user n, then P x d −α x,n can be viewed as the received signal energy from user x to user n. f (a n , a x ) is a indicator function, which depicts the channel occupation of user n's selected channel, shown as: As shown in Eq. ( 1), Tr n a n , a −n , a j depicts the user n's throughput under the threat of malicious jamming and co-channel interference, and in Eq. ( 2), the congestion degree I n (c n ) reflects the number of users who are influenced by co-channel interference.
Consider the channel switching of user, we introduce the channel switching cost unit W s to evaluate the performance loss.Moreover, if UAVs cooperative with each other to share more information and take actions jointly, a cooperation cost unit W c is also brought in.Then, as a tradeoff between throughput and its cost, the utility of user n in one time slot is defined as: where δ s and δ c are indicator functions for channel switching and cooperation.δ s = 1 indicates that channel switching occurs at the beginning of current slot, whereas δ s = 0 means that the user keep its channel strategy.δ c = 1 indicates that users are cooperation with each other and take joint channel actions, whereas δ c = 0 means users choose channels independently.The optimization object of user n is: Every user in the UAV group wants to employ an optimal anti-jamming channel selection strategy for the purpose of maximizing the network's throughput.However, due to the dynamic feature of the jamming channel and the time-varying locations of UAVs, solving the optimization problem is challenging.Therefore, in the next section, we combine MDP, Markov game and Q-learning to investigate and solve the anti-jamming channel selection problem in UAV communication networks.

Location-aware Cooperative Anti-jamming Mechanism in UAV Group
In this part, the location-aware cooperative anti-jamming mechanism in UAV group is designed and analyzed.According to the location sharing information of UAVs, the process of the designed cooperative anti-jamming mechanism is shown in Fig. 2. The

Markov Decision Process
As mentioned above, when users are not influenced by co-channel interference, the anti-jamming channel selection problem can be formulated as a Markov decision process, and each user's strategy is independent to others'.Definition 1.When users are free from the influence of co-channel mutual interference, the Markov decision process of user n can be express as (S n , A n , R n , T n ), where: • S n is the discrete set of user n's environment.sn (t) = f n (t) , f j (t) , s n (t) ∈ S n is the environment state of user n in time t.f n (t) and f j (t) represent user n's transmission channel and jamming channel respectively.In this case, user n's state is not influenced by other users.
• A n is the channel strategy set of user n, a n (t) ∈ A n denotes the channel selection strategy under the state of t moment, similarly, user n's strategy is not influenced by others.
• The reward function of user n is R n , which satisfies S n × A n → R n .Specifically, for every state s n (t), user can obtain a reward with action a n (t).
• The state transition function T n , which satisfies S n × A n → T n .Moreover, it also meets with Markov property, shown as: For each user in the UAV group, the corresponding Markov decision process can be solved using single Q-learning method.Optimal anti-jamming selection strategies can be derived as well.

Markov Game
When users are under the influence of co-channel interference, the anti-jamming channel selection problem can be formulated as a Markov game, each user's strategy is related to other user's strategy.Thus, all users in the group take joint actions to fight against malicious jammer, and avoid co-channel interference as much as possible.

Preprints
Definition 2. When users are influenced by co-channel interference, the anti-jamming channel selection problem can be formulated as a Markov game, which can be expressed as G= {S, A, T , R 1 , ..., R N }.The details are shown as follows: • S is the discrete state set.In cooperative anti-jamming issue, s (t) = f 1 (t) , ..., f N (t) , f j (t) , s (t) ∈ S represents all users' states and the jammer's state.Users' states are correlative.
• Denote A n as the channel selection set of user n, and A is the joint action set of all users in the UAV group.The action space is • T is the state transition function, and the state space is S × A × S, which satisfies Specifically, a is the joint channel selection strategy, s is the current state.s is the coming state after all users take joint action a under state s.The state transition function T satisfies Markov property as well.
• R 1 , ..., R N are the reward functions of each user, and they satisfy S × A → R n , n ∈ N .For UAVs in the group, no matter what joint actions are being taken, each one can obtain an immediate reward.
Aiming at two different state in the UAV group, the anti-jamming channel selection problem are formulated as Markov decision process and Markov game respectively.For Markov decision process, single Q-learning approach is used to obtain each user's optimal channel selection strategy.
Whereas for Markov game, multi-agent learning method is adopted for the purpose of acquiring the joint channel selection strategies for all users.

Single Q-learning
Single Q-learning method is suitable for the case where UAV group is not influenced by co-channel mutual interference.In traditional single-Q learning algorithm, every user maintains and updates its independent Q table Q n , for user n, the updating process of Q function is shown as: where λ n is the learning rate of user n, γ n represents the discount factor for Q table update.r n t is the immediate reward of user n under environment s n , also can be viewed as the normalized utility, which is: where w s and w c are normalized switching cost unit and normalized cooperation cost respectively.V n (s n ) is the value function of user n, in single Q-learning, V n (s n ) can be expressed as: The defined value function V n (s n ) can be viewed as finding the highest benefit in user n's "memory" under state s n .
Each user in the UAV group adopts independent Q-learning via a "Decision-Feedback-Adjustment" way, and each user can converge to a optimal channel selection strategy.

Multi-agent Q-learning
Aimed at the case where UAVs are influenced by co-channel interference, an cooperative anti-jamming channel selection algorithm based on multi-agent Q-learning is designed.In  Anti-jamming channel selection under the influence of co-channel interference.
Anti-jamming channel selection without the influence of co-channel interference.multi-agent Q-learning, each user maintains and updates a Q table Qn which is based on joint action a.Similar to single Q-learning, the Q function updates using the following rule: where λ n is user n's learning rate under joint action, γ n is the discount factor correspondingly.r n t denotes the user n's immediate reward when taking joint action a under state s.Moreover, r n t represents the normalized throughput under joint action, which can also be shown as: V n (s ) is user n's value function in multi-agent Q learning, which is: where a * represents the best joint action when all users' total benefit reaches maximum.a * can be expressed using the following equation: Without loss of generality, either in single Q-learning or in multi-agent Q-learning, ε-greedy policy is introduced for the purpose of avoiding local optimum.Moreover, it is obviously that cooperation cost unit δ c = 0 in single Q-learning, and that δ c = 1 in multi-agent Q-learning.As in single Q-learning, users take actions dependently, while in multi-agent Q-learning, users cooperates with each other to avoid mutual interference.

Location-aware Cooperative Anti-jamming Distributed Channel Selection Algorithm
In this section, the location-aware cooperative anti-jamming distributed channel selection algorithm is designed.
As shown in Fig. 3, it depicts the anti-jamming distributed channel selection framework under different cases.In the left part of Fig. 3, the anti-jamming distributed channel selection framework under the influence of co-channel interference is designed.Users in the UAV group adopt a "Joint action-Feedback-Adjustment" idea, and realize cooperative anti-jamming using multi-agent learning.

Simulation Setting
In the simulation part, a UAV communication network which consists of three users and one jammer is investigated.The available channel number for users to access is 4. The jammer send sweeping jamming signal to the available channels, and the jamming signal stays at one channel for about 2.28ms.The transmission time in each user's slot is T tr = 0.98ms, and the time for WBSS, information sharing, ACK and learning are totally T wbss + T is + T ack + T le = 0.2ms in each slot.
Other simulation settings are shown as follows.Assume the transmission power of each user is 0.1W, the initial locations of three users are (100m,100m), (300m,800m) and (150m,0m) respectively.
The trajectories of UAVs are shown in Fig. 5, and the flying time is divided into 10 epoches.UAVs move 150m per epoch, and the duration time of each epoch is set to be 3s.Furthermore, the pass-loss factor α = 2, co-channel interference threshold is 6.25 × 10 −7 W. The total simulation time is equal to the flying time(approximately 30s).Motivated by [30], λ 1 = ... = λ n = λ1 = ... = λn = 0.8, Moreover, Fig. 6 depicts the interference distance of UAVs.In detail, the flying process is divided into four stages.During flying time 0s to 6s (the first stage), the distance between user 1 and user 3 is less than 400m, and they will influenced by co-channel interference as the received signal energy is higher than threshold.During 6s to 15s (the second stage), user 1 and user 3 are influenced by co-channel interference.From 15s to 21s (the third stage), all users are keep relative far from each other, so there exists no co-channel interference, while from 21s to 30s (the fourth stage), user 2 and user 3 are influenced by co-channel interference.

Cumulative Normalized Utility of Users
In this part, the user's performance analysis is mainly investigated.As is mentioned in Algorithm1, when users are influenced by co-channel interference, the proposed location-aware cooperative anti-jamming distributed channel selection algorithm (LCADCSA) is based on multi-agent Q-learning.When users are not influenced by co-channel interference, the proposed LCADCSA algorithm is based on single Q-learning.For better clarification, we use cumulative normalized utility p to show the effective of LCADCSA approach, which is defined as follows: where PN is the number of packet in every update, and PN is set to be 20 in the simulation, which means the cumulative normalized utility updates per 20 slots, and the time of each update is 23.6ms.
The cumulative normalized utilities of users are shown in Fig. 7   For the purpose of evaluate the effective of the LCADCSA algorithm, it is compared to the sensing based algorithm and multi-user single Q-learning.In sensing based algorithm, users select channels that are not jammed by the jammer after sensing current channel states, and in multi-user single Q-learning, each user adopts single Q-learning independently to avoid the jamming channel while ignoring the existence of mutual interference.Simulation results shows that users can also achieve higher cumulative normalized utilities U cum using LCADCSA algorithm when there exists mutual interference between users.The reason is: In the proposed algorithm, users can learn the action of jammer, and can also adjust their channel selection strategy jointly according to their location and interference information.Thus, the users can avoid malicious jamming and co-channel interference simultaneously.
In addition, In Fig. 8 and Fig. 9, we make comparisons of user 1's cumulative normalized utilities with different channel switching cost and cooperation cost.As shown in Fig. 8, with the increase of channel switching cost, user 1's cumulative normalized utility decrease either in LCADCSA algorithm or in multi-user single Q-learning algorithm.As shown in Fig. 9, with the increase of cooperation cost, user 1's cumulative normalized utility decreased a lot in multi-agent Q-learning stages, and the utility keep invariant in single Q-learning stages.The reason is that in multi-agent Q-learning, users cooperate with each other and share their joint Q tables as well as actions, whereas in single Q-learning, users only need to take actions and update their Q tables independently.Moreover, if the cooperation cost is too high, the influence of cooperation is greater than co-channel interference, which makes it unwise to cooperate to avoid co-channel interference.

Channel Selection Strategies of Users and the Jammer
As an example, Fig. 10    interference.As is shown in Fig. 10, the users' channel selections avoid the vast majority of jamming channels.Moreover, user 1 and user 3 avoid being influenced by co-channel interference as they select different channel in each time slot.In addition, although there exist some overlapping areas between user 2's channels and other users' channels, the communication of user 2 would not be influenced by co-channel interference as its received co-interference signal energy is lower than threshold.In a word, the time-frequency diagram shows that the proposed LCADCSA algorithm is effective.

Conclusion
This paper investigated the anti-jamming channel selection problem in UAV communication networks.Via constructing an cooperative anti-jamming mechanism, users can realize information sharing and then take actions according to the interference level in the network.The channel switching cost and cooperation cost, which had a great impact on the users' utilites, were introduced.For the case where users were not influenced by co-channel interference, a Markov decision process was formulated for independent anti-jamming channel selection, and a single Q-learning method was designed to obtain the independent anti-jamming channel selection strategies.For the case where

Figure 3 .
Figure 3. Anti-jamming distributed channel selection framework under different cases.
(a), Fig.7(b) and Fig.7(c) respectively, where the channel switching cost and cooperation cost are set to be 0. As is shown in those three figures, the users' channel selection processes are divided into four stages: In the first stage, user 1 and user 3 cooperate with each other, and adopt multi-agent Q-learning, user 2 employs sing Q-learning.In the second stage, user 1 and user 2 cooperate with each other, and adopt multi-agent Q-learning, while user 3 employs sing Q-learning.In the third stage, as all users are not influenced by co-channel interference, each user adopts sing Q-learning method.In the fourth stage, user 2 and user 3 cooperative via multi-agent Q-learning, whereas user 1 chooses its transmission channel independently via single Q-learning.

Figure 8 .
Figure 8.The cumulative normalized utility of user 1 with different channel switching cost.
shows the time-frequency diagram after the LCADCSA algorithm converging in the first stage(4800ms-4850ms), where user 1 and user 3 are influenced by co-channel interference.The red square denotes the jamming channel, the blue square, black square and yellow square represent the channel selection of user 1, user 2 and user 3 respectively.The mixed color square means that either more than two users choose the same channel, or users and the jammer choose the same channel in one certain slot.During the first stage, users are under the threat of malicious jammer and co-channel interference.Thus, user 1 and user 3 adopt multi-agent Q-learning and take joint channel selection, whereas user 3 employs single-agent Q-learning as it is not influenced by co-channel Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted: 13 September 2018 doi:10.20944/preprints201809.0227.v1Peer-reviewedversion available at Appl.Sci.2018Sci., 8, 1911;; doi:10.

Figure 9 .
Figure 9.The cumulative normalized utility of user 1 with different cooperation cost.

Figure 10 .
Figure 10.The time-frequency diagram after LCADCSA converging in the first stage.

Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 13 September 2018 doi:10.20944/preprints201809.0227.v1
details are shown as follows.When users are transmitting, they share location information with each other.After that, each user makes a judgement according to the received interference energy.If users are influenced by co-channel interference, a Markov game is formulated to model the cooperative anti-jamming problem.Every user has to avoid the jamming channel, as well as avoiding the co-channel interference channel for the purpose of realizing higher throughput.If users are in a specific location where they are not influenced by co-channel interference, a Markov decision process is able to formulate the coming optimization problem.Each user makes anti-jamming decision independently via single Q-learning approach.Preprints(www.preprints.org)| NOT PEER-REVIEWED | Posted: 13 September 2018 doi:10.20944/preprints201809.0227.v1Peer-reviewed version available at Appl.Sci.2018, 8, 1911; doi:10.3390/app8101911