On the Performance of Deep Reinforcement Learning-Based Anti-Jamming Method Confronting Intelligent Jammer

: With the development of access technologies and artiﬁcial intelligence, a deep reinforcement learning (DRL) algorithm is proposed into channel accessing and anti-jamming. Assuming the jamming modes are sweeping, comb, dynamic and statistic, the DRL-based method through training can almost perfectly avoid jamming signal and communicate successfully. Instead, in this paper, from the perspective of jammers, we investigate the performance of a DRL-based anti-jamming method. First of all, we design an intelligent jamming method based on reinforcement learning to combat the DRL-based user. Then, we theoretically analyze the condition when the DRL-based anti-jamming algorithm cannot converge, and provide the proof. Finally, in order to investigate the performance of DRL-based method, various scenarios where users with different communicating modes combat jammers with different jamming modes are compared. As the simulation results show, the theoretical analysis is veriﬁed, and the proposed RL-based jamming can effectively restrict the performance of DRL-based anti-jamming method.


Introduction
With the rapid development of wireless communications, the information security has attracted more and more attention.In sensitive areas such as airports [1] and electronic warfare-type battlefield [2], anti-jamming techniques are needed for against illegal jammers.With the development of artificial intelligence (AI) [3], communication devices are becoming increasingly intelligent [4].Users are able to learn the jamming modes with the help of AI technologies, the probability of being jammed can be reduced by making the right decision, such as switching to idle communication frequencies [5] or adjusting the communication power [6].
The traditional jamming modes include sweeping jamming, comb jamming, and barrage jamming [7].Due to the fixed jamming pattern, the traditional jamming can be easily avoided by the users.For example, sweeping jamming could be avoided by selecting communication time [8].Comb jamming could be avoided by selecting communication frequency.For barrage jamming, as the range of the jammed frequencies grows bigger, the output power of the barrage jamming is reduced proportionally [9].When the power of jamming is low, the jamming could be invalid by choosing the appropriate communication power [6].The anti-jamming strategies based on the principle of statistics [10], game theory [11], and the other approaches [12,13] have excellent effects on reducing the probability of being jammed.
In order to improve the jamming effect, anti-jamming methods are needed to take into consideration when we design a jamming method.There are many kinds of anti-jamming methods [14][15][16], such as frequency hopping [14], fixed frequency with a greater communication power [15], direct sequence spread spectrum (DSSS) [16].In [17], link-layer jamming and the corresponding anti-jamming methods are overviewed.Link-layer jamming can analyze the user's Media Access Control (MAC) protocol parameters, and releases a high energy-efficiency jamming attack.In addition to the adversarial jamming, we can also utilize the cooperative jamming techniques to improve the physical-layer security [18].In order to face the developing of anti-jamming, many researchers use optimization, game-theoretic or information theoretic principles to improve the jamming performance [19][20][21][22].However, these methods require some prior information (the communications protocol, user's transmission power, etc.) that is difficult to obtain in the actual environment.Other researchers solve the jamming problem with repeatedly interacting with the victim transmitter-receiver pair [23], combining optimization, game-theoretic theories and other theoretical tools.But these only take the situation that the user has a fixed strategy into consideration, or choose a strategy from a strategy set.It is difficult to design the efficient jamming strategy when facing an intelligent user that could change communication strategy by learning [24].
In [24], a dynamic spectrum access anti-jamming method based on a deep reinforcement learning (DRL) algorithm was designed.With a DRL algorithm, intelligent users can learn the jammer's strategy by trail-and-error channel accessing process and obtain the optimal anti-jamming policy with no need of jammer's information.As concluded in [24], even the space of spectral states is infinite, the powerful approximation capability of deep neural network can still converge to the optimal policy, and the success rate could get around 95% against sweeping, comb and dynamic jamming, which performs much better than the traditional reinforcement learning algorithm.
The DRL-based anti-jamming method are powerful, but to the best of our knowledge, the efficient jamming approach against the DRL-based communication user has not been researched.What will happen when the jammer is also intelligent which can learn the communication mode of users and adjust its jamming strategy dynamically?Inspired by [25], in this paper, we investigate the performance of DRL-based anti-jamming method in face of different modes of jamming, including traditional jammers and intelligent jammer based on reinforcement learning (RL).RL [26,27] is an intelligent algorithm that gets feedback from the environment which does not need the priori information.Compare to DRL, an RL algorithm needs less calculation and fewer iterations to converge.In the wireless network, communication devices often access the spectrum for a short time, which only provide the jammers with a small amount of data.Due to this reason, we assume that the jammer adopts RL algorithm, instead of DRL.
In this paper, we design an RL-based jamming algorithm and investigate the performance of a DRL-based anti-jamming method.The implementation as well as the theoretical performance are discussed.Different from [28], we take a small possibility of random jamming as a way to against DRL-based user.We theoretically analyze the condition when the DRL-based anti-jamming algorithm does not converge and provide the process of proof.The main contributions are summarized as follows:

•
We investigate the performance of DRL-based anti-jamming method in face of different jamming modes.Without the prior information of users, we designed an RL-based jamming algorithm to against the DRL algorithm, and the simulation results verify the effectiveness of the jamming method.

•
We theoretically analyze the condition when the DRL-based algorithm cannot converge, and verified it by simulation.
We organize this paper as follows.Section 2 discusses system model.In Section 3, we present the RL-based jamming algorithm.The analysis of DRL algorithm is provided in Section 4, followed by the simulation results in Section 5. Concluding remarks of the paper are in Section 6.

System Model and Problem Formulation
As shown in Figure 1, the system model consists of two users (a transmitter-receiver pair), two agents (one for the user, the other for jammer), which can sense the frequency spectrum by performing full-band detection every t s with step ∆ f .According to the sensing, agents guide the frequency choice of jammers or user respectively.The transmitter is equipped with one transmission radio interface that can communicate on one frequency band at one time.The communication frequency range of the transmitter is denoted by B u , and transmission signal bandwidth is b u .Channel selection set for transmitter is marked as A u = {a 1 , a 2 , ..., a N } (|A u | = N).The agent senses the whole communication band continuously and selects a communication channel over N (N = B b b u ) channels for transmitter through control link.The receiver is a full band receiver, it reports to the agent the quality of the received signal through control link or wire link.For the jammer, we use b j to denote the jamming band for the jammer at one time, and a jammer can jam several frequency bands.The jamming frequency range is denoted by B j , we set B j = B u .
The number of the jammer's decision set can be calculated as M = B j b j .The action set of jammer can be denoted by A j = {a 1 , a 2 , ..., a M } ( A j = M).
To simplify the analysis, we divide the continuous time into discrete time slots (have shown in Figure 3), and assume that both jammer and user consume the same time τ for making strategy and Assuming that the decision strategy of users and jammer remains unchanged during the time slot τ.

Agent and Optimized Objective of Jammers
The agent of a jammer can sense the communication channels, analyzes the communication preference of user, and guide deployment of jamming through reliable control link or wire link.The signal-to-interference-plus-noise ratio (SINR) of the user's receiver can be noted by where the center transmission frequency at time t is denoted by f t .The parameter b u is the user transmission bandwidth.The transmitted power of a user is denoted by p u , n( f ) is the power spectral density (PSD) of noise, f j t denotes the selecting jamming center frequency by Jammer j at time t.PSD of one jamming band is denoted by J( f ), g j is the channel gain from jammer to the user's receiver.Denoting the channel gain from user's transmitter to the jammer as g u .Set µ( f t ) as an indicator function for successful transmission as follows: where β th is defined as the threshold of SINR.When the SINR is below the threshold, β th < β( f t ), the transmission is seen as failed.The aim of jammer is to minimize the target function: where γ is the discount factor and γ ∈ (0, 1).
We set u t,i as discrete spectrum sample value u t,i = 10 log( of spectrum analysis of jammer, and R j ( f ) is denoted by: where g a is the channel gain from user's transmitter to the agent end, and g a , j is the channel gain from jammer to jammer's agent end.With u t,i , we can rebuild the spectrum at time t, find the user's communication frequency, and denote as u t,c .The total range of a jammer can jam at one time is is the jamming band for jammer i, note that the B u > B n j .

RL-Based Jamming Algorithm
The main task of RL is to adjust the strategy according to the feedback information obtained from the environment.In our case, the feedback is the jamming effect.By feedback, the algorithm improves the effectiveness of its decision strategy.Some researchers have applied RL to solve anti-jamming or jamming problems [29][30][31].However, they assumed the strategy of the opponent is fixed or choosing strategy in a strategy set.The algorithm cannot apply well if the opponent is intelligent which can change the strategy through learning.We explain the essence of RL by four crucial elements, i.e., agent, state, action and reward [32].

•
Agent: The agent is responsible for making decisions.It usually has the capability of sensing the environment and taking actions.In our case, the agent could sense the frequency spectrum and guides jammer to choose jamming band.

•
State: Environment state for the corresponding action.In our case, the state is the frequency spectrum that agent senses.

•
Action: All actions can be taken by the agent, or under the guidance of agent.In our case, action is to jam which frequency band.

•
Reward: Feedback of the corresponding action in the environment.In our case, reward is the jamming effect.
Figure 2 shows the process of RL algorithm.Agent gets state s t at time t, makes the action a t , gets the reward at time t, and use the reward r t to update its value function.Then, the agent repeats the process at time t + 1.

Reward
With the setting of reward value, the intelligent agent can know which actions are effective in a given environment and which are not.When jamming successfully, reward R is 1.If not, R should be 0. Then the agent learns to make the more effective decision in the corresponding state.As shown in Figure 3, the horizontal axis represents frequency, and the longitudinal axis represents time.To simplify analysis, we set b u = b j .In this case, state is the user's communication frequency band denoted as u t and action is the jamming frequency band denoted as a t .Note decision a t−1 is made at time t − 1, and execute at time t. Figure 3 shows the relation of state and action, and R can be described as: Note that, in an actual situation, it is hard for a jammer to judge if u t = a t−1 .In order to know the jamming effect, the following methods can be used.
If NACK is detected, we affirm u t = a t−1 , R = 1.Since the communication protocol of civil communication system is public, jammers can easily recognize NACK.However, in the actual situation, the users' communication protocol is unknown.Auxiliary methods are needed for the jammers.

2.
Detect change of communication power.In ref. [34], the detection of increasing power is the sign of successful jamming.Many wireless communication devices are designed to increase power when it gets the interference or noise in the environment.Take a cell phone for example, when it can not communicate well to base station, a cell phone increases the communication power.By sensing the changing transmit power of users, we could affirm u t = a t−1 , R = 1.

3.
Detect users switching channels.In actual communication, switching channels usually requires negotiation between the transmitter and receiver and often consumes more energy.Due to the limited energy, users are more inclined to stay on the current channel to communication.Therefore, detecting the channel switching can also evaluate jammer effect.

Q-Learning
Q-learning is one of the most popular RL algorithms.The core updated formula [35] is given by: where Q(u t−1 , a t ) is the Q-value that evaluates the action a t at state u t−1 (we set τ = 1), α ∈ (0, 1) is the learning rate, R is the reward value.
In actual situation, a jammer could select n frequency bands to jam in decision set A j (under guidance of agent).At time t, the decision is selected from Q(u t,c , A j ) (Q is a two-dimensional matrix), note A s,t is the top n values selected from Q(u t,c , A j ) at time t, define this operation as sort[Q(u t,c , :)] 1:n .R s,t is the reward vector for corresponding actions.For example, if the actions A s,t = {a 1,t , a 2,t , a 4,t } and if a 4,t , R = 1,then R s,t = {0, 0, 0, 1, 0, ..., 0} , ( A j = |R s,t | = M).The update formula of jammer is as follows: The reinforcement learning jamming algorithm is repeated until any stop criterion is met (see Algorithm 1).

Algorithm 1 Reinforcement learning jamming algorithm (RLJA).
Initialize: Set learning rate α ∈ (0, 1), discount rate γ ∈ (0, 1), Q = 0, training time T, and probability of random action ε ∈ (0, 1).The number of jamming frequency bands is n. for t = 1, 2, ..., ∞ do Generate random value η ∈ (0, 1) if t < T then Agent chooses n action from A j randomly (a t ∈ A j , P(at = a) = 1 N , N = A j ) gets reward, update Q matrix: Agent chooses n action from A j randomly, gets reward, updates Q matrix: Agent guides actions based on Q matrix end for Theorem 1.The jamming process of RL algorithm can converge with different ε under condition that the communication process is action replay process (ARP).
Proof.The proof is given in reference [35], which is divided into two parts.The first part explains what action-replay process (ARP) is.ARP is derived from a particular sequence of episodes observed in the real process, with a finite state space S and a finite action set A. Second part proves that Q-learning algorithm can converge under condition that the process in real situation is ARP.Exploring conditions are not requirements to the proof, the iterations are.Since the jamming pattern without a random scheme in this case is ARP, iterations are enough.Therefore, the algorithm can converge.

Theoretical Analysis of the Condition When DRL-Based Algorithm Does Not Converge
A DRL algorithm is a combination of a deep learning (DL) algorithm and reinforcement RL algorithm [36], it combines the perception ability of the DL algorithm and decision-making ability of RL.DL is used to extract environmental features from the complex environment, RL learns from features and optimizes itself by trial and error.Neural network is a nonlinear function approximator, through data training, to find the best-optimized function of a certain situation [37], which in this paper is to approximating the Q-function of RL.Motivated by [38], in this paper we prove that the DRL-based algorithm does not converge when facing with certain probability of random actions in steady-state dynamics.Then we verify our proof by simulation.
Theorem 2. Anti-jamming process of DRL algorithm does not converge when facing with certain probability of random actions.
Proof.The following proof follows the lines given in [38].To start with, note the anti-jamming process of DRL is a Markov process [24], first consider a Markov process with finite states S = {s 1 , s 2 , ..., s n }.We assume that the cost of transition between states and the discount factor α ∈ (0, 1).Let the function approximator be parametrized by a single scalar r.
We let the J(0) be a nonzero vector that satisfies with e T J(0) = 0, where the e = ( n 1, 1, ..., 1) T .J(r)is a unique solution to a linear differential equation [38]: where I is a unit matrix of n × n, and σ is a tiny positive constant.Q matrix is average direction of the consequence of constant probability of random actions.It can be noted by: According to our definition of J, it is easy to know that all functions represented by J can map to three-dimensional space {J ∈ 3 e T J = 0 }.Set the Markov chain transition probability matrix is P, and can be denoted as: Since the transition cost is set to 0, the TD(0) operator [39] can be defined as T (0) = αP, for any J ∈ 3 , there is a certain angle θ, and a scalar β ∈ (0, 1), for any r, T (0) J(r) is equal to the vector J(r) extends β and rotates the θ angle.Concretely, the update equation of TD(0) can be addressed as: where the i t is the state at time t.The expectation of the update direction formula as follows: where j is the state that the i state is going to visit next.α n ∑ i=1 p i,j J(j, r) − J(i, r) is equivalent to (αP − I) J(r).From Equation ( 9), and as the r t+1 − r t becoming small enough, the algorithm update process can be approximated by another differential equation.The average direction is given by the parameter, the differential equation describing this process can be: for σ = 0 (the process of neural network convergence is slow, then the σ can approximate to 0), then we have: where the first equation comes from that JT (r)Q T J(r) = 0, because the for large enough n, the rank(Q T ) < n.For example [38], For any r, note the Q matrix is average direction under influence constant probability of random actions.Because Q guides J(r) to the direction of misconvergence, Q•P Q P = θ > 0. Therefore, the Q T P + P T Q is positive definite.With the setting of random actions, there exists a positive constant c that: when σ is positive and sufficient small, the inequality remains true.The combination of this inequality and the fact that: It can be seen that J(r) does not converge.The process of approximate approaching (DRL) will never reach the target if the requirements are met.

Remark 1.
According to Theorem 2 we may draw the conclusion that if the jammer's jamming strategy is random, the DRL-based anti-jamming method will fail to converge.However, in the perspective of jammer, random jamming may receive the low jamming efficiency since it is non-targeted.Reminding that in the Q-learning algorithm the process of "exploration" is random, therefore in the simulation part we investigate how the randomness of RL-based jamming impacts the performance of DRL-based anti-jamming algorithm.

Simulation Parameters
Consider one transmitter-receiver pair and jammers working in a frequency band of 20 MHz.Both users' and jammer's signal were raised cosine waveforms with roll-off factor η = 0.6 and performed full-band detection every t s = 1 ms with ∆ f = 100 kHz.The bandwidth of the jammer and transmitter was 4 MHz.So, the number of actions set A j = 5.The jamming power was 30 dBm.The signal power was 0 dBm.Demodulation threshold β th was 10 dB.Five kinds of jamming modes were considered against DRL-based user: 1.
Dynamic jamming (change the sweeping and comb jamming patterns periodically for every 100 ms); 4.
Jamming based on statistics (select top three the most frequently used channels by users in 100 ms); 5.
Figure 4a shows the spectrum waterfall of the first three jamming modes where the horizontal axis represents frequency and the longitudinal axis represents time.The simulation results of them combating with a DRL-based anti-jamming method can be seen in [24].The patterns of jamming based on statistics and RL were dynamically changing according to user's signal, which can be seen in Figure 4b.Three kinds of users are taken into consideration against the RL-based jammer: 1. Frequency-fixed user; 2.
Frequency-hopping user with frequency hopping table shown in Table 1; 3.
DRL-based user [24].According to paper [40], the floating-point operations per second (FLOPS) of DRL in our simulation test was 10 9 , and processing speed magnitude of normal CPU was 10 11 FLOPS.By now, the embedded neural-network processing unit (NPU) can reach 1.6 × 10 13 FLOPS.It is likely to run DRL algorithm in many scenarios, such as [41,42], underwater, unmanned aerial vehicle (UAV) systems in future.

Performance Analysis of User versus Jammers
In order to investigate the performance of DRL-based anti-jamming algorithm and verify the theoretical analysis in Section 4, we let the DRL-based users combat different kind of jammers.In Figure 5, the horizontal axis represents cumulative iterations.Each iteration indicates 2 s (200 times) of the users' communication.The longitudinal axis is the normalized throughput in one iteration.In Figure 5, four kinds of jammer jam DRL-based user.It can be seen that in the face of four types of jamming, DRL-based user can almost perfectly avoid the jamming signal after training.In order to illustrate the effect of randomness on the DRL-based anti-jamming algorithm, Figure 6 shows the throughput against RL-based jammer with different ε.As we can see in Figure 6, if the RL used a greedy strategy (ε = 0) completely, after 50 iterations the jammer failed.As ε increased, the performance of DRL anti-jamming dropped dramatically.The deep neural network did not converge due to the randomness, which caused the failure of DRL-based user to learn the pattern of RL-based jamming.From the right of Figure 4b we can see that DRL-based method gets jammed with high probability in face of RL-based jamming.In Figure 7, the performance of RL-based jammer (ε = 0.2) combating different modes of users are shown.As we can see, the RL-based jammer can effectively jam the frequency-fixed and frequency-hopping users because the communication process of them are ARP according to the definition in [35].As expected, since the DRL-based user's strategy is changing according to the communication effect, the process of user is not ARP, the RL-base jammer can not get its ideal effect because the RL algorithm is not convergent.Note the performance DRL-based mode gets are still unsatisfactory because of ε = 0.2.The Figure 8 shows the normalized throughput of different modes of users with different ε after 50 iterations.As we can see, with the ε increasing, the probability to randomly choose actions of RL algorithm increased, which had as a result the jammer overly exploring and the actions with good reward could not be fully exploited.However, as ε increased, the performance of DRL-based user decreased rapidly due to the randomness as explained above.In Table 2, the performance that different kinds of jammers combat with different kinds of users are summarized.The numbers in the iteration column are the number of iterations that the RL or DRL algorithm needed to converge.The RL-based jamming was used with the setting of ε = 0.2 in simulation.In the normalized throughput column, the average normalized throughput that is calculated after 50 iterations is shown.As we can see in Table 2, the throughput of DRL-based anti-jamming method was restricted by the RL-based jamming with ε = 0.2.Due to the good convergent speed and jamming effect of the RL-based jammer, the effective anti-jamming method needs to be further studied.

Conclusions
In this paper, we investigated the performance of DRL-based anti-jamming method.We first designed a RL-based jammer.In order to find when the DRL-based method would fail, we first theoretically analyzed the condition when the DRL algorithm couldn't converge, then we verified the analysis in the simulation part.As the simulation results showed, with the small number of ε, the performance of DRL-based user could be effectively restricted by the RL-based jammer with the small ε, while RL-based jammer could achieve an excellent jamming effect against common users.In future work, we will study the situations where a user can work on multiple channels simultaneously.

Figure 3 .
Figure 3. Time-frequency two-dimensional diagram of jammer and user.
Illustration of sweeping, comb and dynamic jamming modes.

Figure 5 .
Figure 5. Performance of deep reinforcement learning (DRL)-based user versus different kind of jammers.

3 Figure 6 .
Figure 6.Performance of DRL-based user versus RL-based jammer with different ε.

Figure 7 .
Figure 7. Performance of reinforcement learning (RL)-based jammer with ε = 0.2 versus different modes of user.

Figure 8 .
Figure 8. Performance of RL-based jammer with different ε versus DRL-based user.

Table 2 .
Performance comparison of different modes of user and jammer.