An Anti-Jamming Hierarchical Optimization Approach in Relay Communication System via Stackelberg Game

: In this paper, we study joint relay selection and the power control optimization problem in an anti-jamming relay communication system. Considering the hierarchical competitive relationship between a user and jammer, we formulate the anti-jamming problem as a Stackelberg game. From the perspective of game, the user selects relay and power strategy ﬁrstly which acts as the leader, while the jammer chooses power strategy then that acts as follower. Moreover, we prove the existence of Stackelberg equilibrium. Based on the Q-learning algorithm and multi-armed bandit method, a hierarchical joint optimization algorithm is proposed. Simulation results show the user’s strategy selection probability and the jammer’s regret. We compare the user’s and jammer’s utility under the proposed algorithm with a random selection algorithm to verify the algorithm’s superiority. Moreover, the inﬂuence of feedback error and eavesdropping error on utility is analyzed.

The power adjustment is considered as an effective method to respond to the jammer's attacks directly. In [1], the authors made receivers more robust with adaptive arrays utilizing power inversion algorithm in jamming environment. In [2], based on the competitive relationship between user and jammer, a Stackelberg game was proposed to model the anti-jamming communication hierarchical optimization problem in the presence of smart jammer. Considering it is hard for the user and jammer to obtain the accurate power and channel state information, the authors analyzed the anti-jamming performance with observation error in [3][4][5]. In [6], the authors studied the communication problem between in WBANs via Stackelberg frame facing jamming attacks. In [7], the authors investigated cross-layer anti-jamming joint optimization problem based on Q-learning algorithm. In [8], the authors investigated discrete power strategy optimization problem with imperfect information. In [9], user's strategy selection probability and the jammer's regret. We give utility comparison under the proposed algorithm and random selection algorithm with feedback error and eavesdropping error. In the last, we summarize the main contributions of this paper in the following: • In relay communication scenario, considering the competitive relationship between the user and jammer, we formulate the anti-jamming joint optimization problem as a Stackelberg game, in which the user acts as leader and jammer acts as follower.

•
We prove the existence of SE and propose a hierarchical joint optimization algorithm based on Q-learning and MAB method via Stackelberg game frame.

•
Simulation results show the user's strategy selection probability and jammer's regret. Utility under the proposed algorithm is compared with random selection algorithm to verify the algorithm's superiority. Moreover, the influence of feedback error and eavesdropping error on utility is analyzed.
The main differences from our previous work [36] are summarized as: (i) There exist multiple relays in the anti-jamming system which needs to realize the optimization of relay selection and power control simultaneously. (ii) The feedback error and eavesdropping error are introduced because it is hard for the the user and jammer to get accurate feedback information. (iii) The user and jammer do not need to know the power and channel fading information which is ignored in [36].
In the rest of this paper, we establish the anti-jamming model in the proposed relay communication scenario, and give the relative problem formulations in Section 2. In Section 3, we formulate the anti-jamming joint optimization problem as a Stackelberg game and prove the existence of SE. Moreover, a hierarchical joint optimization algorithm is proposed. Simulation results show the user's strategy selection probability and the jammer's regret in Section 4, besides, utility comparison under different algorithms with feedback error and eavesdropping error is also given. Finally, we draw a conclusion in Section 5.

System Model
We assume an anti-jamming relay communication scheme consisted of a base station (BS), a user, a jammer and a relay group, as shown in Figure 1. The user collects nearby information and sends it to the BS. Considering the serious channel fading of wireless communication, the user cannot transmit messages to BS directly, which communicates with BS through a relay group. In phase 1, the user transmits messages to the relay group that contains its collected information and instruction which relay is acquired to assist the user to forward messages to BS. In phase 2, the relay adopts apply and forward (AF) mode, and retransmits the messages received from the user to BS. The user and relay use the same channel. The jammer sends malicious noise-like signals to decrease communication quality. Moreover, both the user and jammer have ability to change power to achieve a better communication or jamming effect. Note that for simplicity, we consider relay's power remains unchanged, because relay as the fixed station on the ground, the power constraint is not obvious compared to the user. We consider the user and jammer update their respective strategies with different time scales. On the one hand, jammer updates its power strategy every epoch. On the other hand, in order to improve anti-jamming ability, the user updates its joint relay and power strategy quicker than jammer to adjust timely and obtain the optimal strategy. As shown in Figure 2, each epoch is divided as T time slots, and at the end of each time slot, BS feeds the signal-on-interference-plus-noise ratio (SINR) information of current time slot back to the user as the communication reward, which is also eavesdropped by jammer [40][41][42] to measure the jamming effect at the current epoch. In the formulated model, we assume that the user has M power levels and its power set is P = [P 1 , P 2 , ..., P M ]. Jammer has L power levels and its power set is J = [J 1 , J 2 , ..., J L ]. The relay group contains N relays and the relay set is R = [R 1 , R 2 , ..., R N ]. The nth relay's power is a fixed value and defined as Q n , and all relays' power set is Q = [Q 1 , Q 2 , ..., Q N ]. The distance between the user, jammer, BS and the nth relay are d u,r n , d j,r n , d B,r n , respectively. The distance between BS and jammer is d B,j . For convenience, some used notations are listed in Table 1.

P, J
Power set of the user and jammer R, Q Relay set and all relays' power set P, J Transmitting power of the user and jammer, i.e., P ∈ P, J ∈ J α u,r n , β j,r n The channel gain between the user, jammer and the nth relay The channel gain between the nth relay, jammer and BS N r n , N B The back noise power at the nth relay and BS

G n
The nth relay's amplification factor X u , X j , X r n The user's, jammer's and nth relay's transmitted signal The nth relay's and BS's received signal The transmitting cost of the user and jammer ε 1 , ε 2 Feedback error and eavesdropping error γ The SINR received at BS γ , γ The feedback SINR received by the user and eavesdropped by jammer

Problem Formulation
In the anti-jamming system, the communication process is divided into two phases. In phase 1, the user transmits messages and selects one relay to help forward messages to BS. When the jammer senses communication signals between the user and relay, it releases jamming signals immediately to destroy normal communication. Let P and J denote the user's and jammer's power, where P ∈ P and J ∈ J . Inspired by [43], we define α u,r n = d −δ u,r n and β j,r n = d −δ j,r n as the channel gain between the user, jammer and the nth relay, respectively, among which δ is the path-loss factor. The backward noise signal at the nth relay is m r n . X u and X j denote the user's and jammer's transmitted signal, respectively. Y r n denotes the nth relay's received signal from the user and jammer, which is defined as follows: In phase 2, the relay adopts amplifies and forwards (AF) mode. Considering the nth relay is selected to help the user forward messages to BS, its amplification factor is expressed as follows: Let X r n denote the nth relay's retransmitted signal. We define the received signal at BS as Y B , and it is expressed as: where α B,r n = d −δ B,r n and β B,j = d −δ B,j denote the channel gain between the nth relay, jammer and BS, respectively. m B is the backward noise signal at BS.
Thus the SINR received at BS can be defined as follows: For the user, it obtains feedback SINR γ at the end of each time slot. Considering there may exists feedback information error, we introduce feedback error ε 1 = |γ − γ | γ, which means the deviation degree of feedback SINR γ with the actual SINR γ. Inspired by [2][3][4], considering the transmitting cost, we give the user's utility function based on the received SINR γ in the following: For jammer, it eavesdrops the feedback SINR γ from BS to the user at the end of each time slot and obtains eavesdropping result γ . Similarly, considering that it is hard for a jammer to eavesdrop γ accurately, we introduce eavesdropping error ε 2 = |γ − γ | γ, which means the deviation degree of eavesdropped SINR γ with the actual SINR γ. Similarly, considering the jamming cost, we give the jammer's utility function based on the eavesdropped result γ in the following:

The Joint Relay Selection and Power Control Optimization Method via Stackelberg Game
In this section, we formulate the anti-jamming joint optimization problem as a Stackelberg game firstly, which is an effective theoretical method to deal with the hierarchical confrontation relationship between the user and jammer. Then we prove the existence of Stackelberg equilibrium (SE). We propose a hierarchical joint optimization algorithm under the game frame in the last.

Stackelberg Game Model
In the anti-jamming relay communication, the user transmits messages firstly and selects a relay to help broadcast messages to base station, after having sensed the user's signals, the jammer releases jamming signals immediately. Take a perspective of Stackelberg game, the user acts as leader and jammer acts as follower in the proposed scenario. The proposed game can be denoted as G = {P, Q, J , U, V}, where P and Q denote the user's power strategy space and relay selection strategy space respectively, J denotes jammer's power strategy space, U and V denote the user's and jammer's utility, respectively.
Based on the utilities given in Section 2.2, both the user and jammer aim to maximize their utilities to get a better communication effect or jamming effect. For the jammer, given a user's joint strategy P ∈ P, Q n ∈ Q, it makes the optimal power strategy to destroy normal communication, and the optimization problem is expressed as: max J V(P, Q n , J).
For the user, it makes a joint relay selection and power control strategy to guarantee anti-jamming communication quality, and the user's optimization problem can be expressed as follows: max P,Q n U(P, Q n , J).
Based on the analysis above, the joint optimization problem can be solved by hierarchical decision-making method via Stackelberg game frame, which is shown as follows: subject to : P ∈ P, Q n ∈ Q The optimal solution : (J * ) max J V(P, Q n , J) subject to : J ∈ J

Existence of Stackelberg Equilibrium
In the proposed scenario, we consider that the user adopts mixed strategy to fool the jammer due to the randomness of strategy, which can increase anti-jamming performance effectively. Let q denotes the user's mixed strategy, i.e., the probability distribution of the user's optional relay selections and power strategies. Motivated by [4,8], we define the Stackelberg equilibrium (SE) and give the proof of existence of SE in the following. Definition 1. If no player can improve the utility by deviating its optimal strategy unilaterally, the policy profile (q * , J * ) constitutes the SE, which satisfies the following conditions: Lemma 1. There exists a user's stationary strategy and a smart jammer's stationary strategy, which constitute a SE [8].
Proof. Inspired by [44,45], every finite strategy game has a mixed strategy equilibrium [46], which means there exists a SE in the formulated game. For the jammer, it aims to maximize the utility and makes strategy based on best-response: Having known the jammer's optimal strategy, the user's optimal strategy can be obtained as follows: q * = arg max q U(q, J * (q)). (13) Based on the analysis above, the policy profile (q * , J * ) constitutes the SE.

Hierarchical Joint Optimization Algorithm
In this section, we propose a hierarchical joint optimization algorithm, which realizes the user's strategy optimization based on Q-learning algorithm [47] and jammer's strategy optimization based on multi armed bandit (MAB) method [48][49][50], respectively.
For the user, its mixed power and relay selection strategy at the tth time slot is expressed as q (t) = {q 1,1 (t), q 1,2 (t), ..., q 1,N (t), ..., q m,n (t), ..., q M,1 (t), q M,2 (t), ..., q M,N (t)}, where q m,n (t) denotes the probability to select the nth relay and power P m at the tth time slot. Then we update the user's Q value as follows: where κ t denotes the user's learning rate at the tth time slot, which satisfies ∑ ∞ t=0 κ t = ∞, ∑ ∞ t=0 (κ t ) 2 < ∞. r(t) denotes the user's reward, i.e., its utility U at the tth time slot. Based on the analysis above, the user's mixed strategy is updated as follows: where τ 0 controls the tradeoff of exploration-exploitation. For the jammer, we formulate the finite strategy optimization as a MAB problem, and consider each power strategy as a arm to select, i.e., the jamming power strategy J l is considered as the lth arm. J(k) denotes the jammer's power strategy at the kth epoch. Thus the times of power J l has been selected in the past K epochs is defined as: where δ(x, y) is the Kronecker delta function and it can be expressed as: Thus, the jammer's lth arm's statistical average reward µ l in the past K epochs is defined as: B(K) denotes the jammer's total utility in the past K epochs and is expressed as: Inspired by [48][49][50], we define jammer's regret in the past K epochs, which is an important index to represent the loss of utility so far because of the failure to select the optimal jamming power strategy, as shown in the following: where l * denotes the estimated optimal arm of jammer based on the historical statistical average reward. B * (K) denotes jammer's accumulative utility it could get if jammer had always chosen l * in the past K epochs.
In the MAB problem, UCB1 [48][49][50] is an effective policy to realize the optimization of power strategy. Adopting the UCB1 method, jammer updates its strategy in the (K + 1)th epoch based on the following condition: According to the analysis above, based on the Q-learning algorithm and UCB1 method, both the user's and jammer's optimal strategy can be obtained through hierarchical joint optimization algorithm, which is shown in Algorithm 1.

Algorithm 1: Hierarchical joint optimization algorithm (HJOA).
Initialization: The number T of time slots in one epoch, the number k max of all epochs. User selects power P randomly, t = 1, k = 0. Outer Iteration: 1. Jammer selects the optimal arm J(k + 1) according to Equation (21).

Inner Iteration:
(1) In the tth time slot, the user selects its joint relay and power strategy according to q (t).
(2) Obtain the user's utility U(t) through the feedback of SINR at the end of current time slot according to Equation (5).
2. Obtain jammer's utility V(k) through eavesdropping the feedback of SINR in the kth epoch according to Equation (6).

Simulation Results and Discussions
In this section, the simulation parameters and system location setting are given, and then we present some necessary simulation results and give brief discussions. In Section 4.1, we give the user's power strategy probability and relay selection probability respectively. Moreover, the jammer's regret is also given. In Section 4.2, we analyze the influence of feedback error and eavesdropping error, and we also compare the user's and jammer's utility under the proposed hierarchical joint optimization algorithm (HJOA) with random selection algorithm to verify the algorithm's superiority.
In the simulation, we assume the user's and jammer's discrete power sets are P = J = [0.5 W, 1 W, 1.5 W, 2 W, 2.5 W], and their transmission costs are C u = C j = 0.1. The relay set is R = [R 1 , R 2 , R 3 , R 4 ] and different relay has the same power Q = 2 W. The number of time slots in one epoch T = 100. Path-loss factor γ = 2.
As shown in Figure 3, there exist a BS, a user, a jammer and a relay group in the investigated scenario. The coordinates of BS and the user are (0 km, 10 km) and (10 km, 0 km). The jammer is located in (10 km, 10 km). The relay group consisted of four relays, which were located in (2.5 km, 2.5 km), (5 km, 2.5 km), (5 km, 5 km), (7.5 km, 5 km) respectively. The red arrow is the jamming signal and blue arrow is the communication signal.  Figure 4 shows the user's power selection probability in one epoch, and it converges to a stationary mixed strategy after about 70 time slots. All power strategies have possibilities to be selected which can increase the randomness of strategy and fool jammer effectively. It is easy to find that the user tends to choose power strategy P 3 and tends not to choose P 1 and P 2 , that is because a higher transmission power can guarantee the SINR at BS. However, an exorbitant power causes the increase of power cost, so the possibility of P 3 is higher than P 4 and P 5 , which realizes the tradeoff of anti-jamming effect and power cost at the same time.   Figure 5 shows the user's relay selection probability in one epoch, and it converges to a stationary mixed strategy after about 50 time slots. We can find the selection probabilities of R 1 , R 2 , R 3 , R 4 are about 0.247, 0.355, 0.204, 0.194 respectively when they have achieved convergence. That is because though R 3 is closer to BS and the user, but the distance between R 3 and the jammer is also shorter. So in order to minimize the jamming influence of the jammer, the user tends to choose R 2 and R 1 even they are further from BS and the user compared to R 3 and R 4 . Considering the influence of distance on channel fading, the user tends to select R 2 compared to R 1 .  Figure 6 shows the jammer's regret in 1000 epochs, which denotes the loss of payoff due to the fact that the optimal strategy is not always chosen during the decision-making process. We can find the jammer's regret grows nearly logarithmically, i.e., it grows quickly at the start and grows slowly subsequently, which means the loss caused by the wrong choice of the optimal strategy is smaller and smaller. However, regret still grows because jammer selects other non-optimal strategies occasionally to avoid missing a future potential optimal strategy, which realizes the exploration and exploitation simultaneously.

Utility Comparison under Different Algorithms with Feedback Error and Eavesdropping Error
In one epoch, we average the user's utility every 20 time slots and obtain the comparison of the user's utility under the proposed HJOA algorithm and random selection algorithm, as shown in Figure 7. Moreover, we analyzed the influence of feedback error on utility. Under the HJOA algorithm, we can find the user's utility grows firstly and remains stable then, that is because the user has gradually obtained the optimal strategy with the iterations of algorithm. When the user can receive correct feedback, i.e., ε 1 = 0, we can find the user's utility under the HJOA algorithm reaches the maximum value and is improved compared with random selection algorithm. The greater feedback error ε 1 is, the larger deviation between the user's received feedback and actual value is, which causes the decrease of utility. In Figure 8, we average the jammer's utility every 20 epochs and obtain the comparison of the user's utility under the proposed algorithm and random selection algorithm. We analyze the influence of eavesdropping error on utility. Under the proposed the HJOA algorithm, we can find that the jammer's utility grows quickly firstly and grows slower then, that is because the jammer has gradually obtained the optimal strategy. When the jammer can eavesdrop the correct feedback, i.e., ε 2 = 0, its utility reaches the maximum value and is higher than utility under random selection algorithm. The greater feedback error ε 2 is, the lower the jammer's utility is because the existence of ε 2 causes the jammer cannot eavesdrop the user's feedback information accurately and influences its decision-making.

Conclusions
In this paper, we studied the joint relay selection and power control optimization problem in an anti-jamming relay communication system via Stackelberg game, in which a user acted as the leader and a jammer acted as follower. Based on the Q-learning algorithm and multi-armed bandit method, a hierarchical joint optimization algorithm was proposed. Simulation results showed the user's strategy selection probability and jammer's regret. Moreover, we analyzed the influence of feedback error and eavesdropping error, and compared the user's and jammer's utilities under the proposed algorithm with random selection algorithm to verify the algorithm's superiority. In the future, we will consider the dynamic change of the user's and jammer's position in the multi-user scenario to improve the anti-jamming performance.