Next Article in Journal
Wheat Yellow Rust Disease Infection Type Classification Using Texture Features
Next Article in Special Issue
Effectiveness of Artificial Neural Networks for Solving Inverse Problems in Magnetic Field-Based Localization
Previous Article in Journal
An Experimental Urban Case Study with Various Data Sources and a Model for Traffic Estimation
Previous Article in Special Issue
Sensing of Microvascular Vasomotion Using Consumer Camera
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Jamming Strategy Optimization through Dual Q-Learning Model against Adaptive Radar

Key Lab of Universal Wireless Communications, Ministry of Education of China, Beijing University of Posts and Telecommunications, Beijing 100876, China
*
Author to whom correspondence should be addressed.
Sensors 2022, 22(1), 145; https://doi.org/10.3390/s22010145
Submission received: 9 November 2021 / Revised: 16 December 2021 / Accepted: 21 December 2021 / Published: 26 December 2021
(This article belongs to the Special Issue Signal Processing and Machine Learning for Smart Sensing Applications)

Abstract

:
Modern adaptive radars can switch work modes to perform various missions and simultaneously use pulse parameter agility in each mode to improve survivability, which leads to a multiplicative increase in the decision-making complexity and declining performance of the existing jamming methods. In this paper, a two-level jamming decision-making framework is developed, based on which a dual Q-learning (DQL) model is proposed to optimize the jamming strategy and a dynamic method for jamming effectiveness evaluation is designed to update the model. Specifically, the jamming procedure is modeled as a finite Markov decision process. On this basis, the high-dimensional jamming action space is disassembled into two low-dimensional subspaces containing jamming mode and pulse parameters respectively, then two specialized Q-learning models with interaction are built to obtain the optimal solution. Moreover, the jamming effectiveness is evaluated through indicator vector distance measuring to acquire the feedback for the DQL model, where indicators are dynamically weighted to adapt to the environment. The experiments demonstrate the advantage of the proposed method in learning radar joint strategy of mode switching and parameter agility, shown as improving the average jamming-to-signal radio (JSR) by 4.05% while reducing the convergence time by 34.94% compared with the normal Q-learning method.

1. Introduction

In radar jamming, accurate decision-making is an important prerequisite for effective jamming. For single-mode radar with only a few fixed parameter combinations, the jamming decision-making based on template matching can be efficient [1]. Nowadays, with the improvement of electronic technology, modern radars tend to be adaptive. Adaptive radars can perform various tasks through automatic work mode switching, where work modes vary with different radars. Take a ground-based radar with search, tracking, and recognition modes as an example: The radar initially scans the entire airspace at the search mode, and switches to the tracking mode when a mission-related target is detected, then transitions to the recognition mode after the target is confirmed. If the echo signal quality drops or the target is lost due to the jamming, the radar will take anti-jamming measures autonomously or return to the search mode. For phased array radars with modes named “range while scan” (RWS), “track and search” (TAS), “Single Target Track” (STT), etc., the adaptive switching strategy is more complicated. Additionally, in each work mode, the pulse parameters can be changed in real time to improve radar’s performance or survivability based on environment detecting [2,3]. In such context, the effectiveness of conventional jamming decision-making methods is decreasing because of a large shortage of prior knowledge about the radars [4]. Therefore, it is urgently necessary to develop radar jamming technology.
Inspired by the cognitive radio, the application of intelligent algorithms in radar confrontation became possible [1,5,6,7]. To solve the problem of the low rate of template matching under the incomplete jamming rule library condition, a jamming decision-making method based on clustering and resampling–support vector machine is proposed in [1]. In [5], a discrete dynamic Bayesian network is established to guide decision-making in self-defense electronic jamming. In [6], an improved chaos genetic algorithm is applied to the allocation of interference strategy. In [7], a particle swarm optimization algorithm is applied to solving optimal jamming power allocation strategy aiming at cognitive MIMO radar. However, these traditional machine learning methods often require large amounts of tagged radar data acquired in advance, which are difficult to be obtained in actual scenarios.
Up to now, there have been many works on the application of reinforcement learning (RL) in communication jamming and anti-jamming [8,9,10,11,12], providing new ideas for radar confrontation. RL is a type of machine learning technology, where an agent learns from interacting with the environment and takes maximizing the feedback from the environment as its learning goal [13]. In [14], the framework of intelligent jamming based on RL is described, where the cognitive jammer and the radar are respectively regarded as the agent and the environment. In [4], the conversion between different radar working states is modeled, where an RL model is trained to choose the jamming mode against each radar state. A frequency-agile radar is considered in [15], and a jamming frequency selection algorithm based on Q-learning is proposed to solve the optimal frequency for each jamming pulse. In [16,17,18,19], other methods based on RL are used to optimize the strategy of combating jamming for radar. Compared with the traditional machine learning methods mentioned above, the agent in RL can learn with no tagged data needed, which makes it more adaptable to the unknown environment. The jamming system equipped with RL can obtain training samples during the jamming process, and update the jamming strategy dynamically based on the change of the radar signal.
However, in the existing work of jamming decision-making based on RL, two common measures of the adaptive radar including mode switching and parameter agility have not been jointly considered. For example, [14] focuses on macro-level modeling of radar jamming, abstracting radar modes into the environment state in RL. What is not noticed is that the jamming effectiveness will be weakened due to the agile actions such as frequency agility [16] and dynamic pulse repetition interval [20] taken by radar in each mode. In [15], the authors aim at the frequency hopping of the radar without considering multiple work modes. According to the authors of [3], if only static behavior of radar is considered, such as a single work mode, it is easy to result in a subjective or local optimal jamming strategy. Moreover, when facing the adaptive radar with multi-modes and the ability of parameter agility, a large jamming action space is usually needed to ensure that the correct actions are included, which greatly increases the complexity of jamming parameters generating. As high complexity will lead to a long convergence time, the jammer can hardly find the optimal jamming strategy in a limited time, which can be fatal to the protected target.
To overcome the problems above, a two-level jamming decision-making framework is developed in this paper. On this basis, a dual Q-learning (DQL) model is proposed to obtain the optimal jamming strategy. Specifically, the jamming decision-making process is disassembled into two levels, where the jamming mode is decided in the first level with the outer Q-learning, and the pulse parameters for the decided jamming mode are selected in the second level with the inner Q-learning. This structure greatly reduces the dimension of the action space of the jammer. With smaller action space and fewer parameters to be learned, it can effectively avoid falling into the local optimum while shortening the convergence time.
Another issue needed to be considered is how to express the feedback from the environment of RL in the jamming decision-making scene. Different from the previous methods where the feedback is statically assigned through the experience matrix, in this paper, we evaluate the effectiveness of jamming as the feedback of the DQL model. On jamming effectiveness evaluation, currently most of the research is based on data collected from the radar side which is impractical due to the non-cooperative nature of the battlefield. However, there are only a few studies on the evaluation from the jamming side. In [21], the accumulated amplitude extracted from signals is chosen as the characteristic statistic to evaluate the jamming effectiveness. In [22], the authors advise using the change of the radar threat level as the basis for the evaluation. In [23], an evaluation method based on feature space weighting in non-cooperative scenes is proposed, where the weight values of evaluation indicators are solved with offline simulated radar data. Considering the variability of the indicator’s contribution to the evaluation result, in this paper, the indicators’ weight values are calculated with their entropy and are updated constantly based on the real-time radar data. With the dynamic weights, the jamming effectiveness is evaluated through measuring the distance between the indicator vectors before and after jamming. The evaluation result is served as the feedback of the environment, based on which the DQL model updates dynamically.
The main contributions of this paper are summarized as follows:
  • An RL model named DQL is constructed to guide the jamming decision-making against adaptive radars, where the jamming mode and jamming parameters are hierarchically selected and jointly optimized. Because of the reduced dimensionality of action space, the globally optimal solution can easily be found with a shorter convergence time.
  • A new jamming effectiveness evaluation method based on indicator vector space is proposed to serve the feedback to the DQL model, which effectively overcomes the dependence on subjective experience when the model updates. Additionally, in view of the variable electromagnetic environment, the indicators’ weights are calculated dynamically with the real-time radar data, to make the evaluation result more credible.
The rest of this paper is organized as follows. The system model is introduced and the problem of jamming strategy optimization is formulated in Section 2. The proposed DQL model and jamming effectiveness evaluation method are explained in Section 3. The details of the simulations and the analysis of results are shown in Section 4, followed by the conclusion presented in Section 5.

2. System Model and Problem Formulation

2.1. System Model

Consider a self-defense electronic jamming scenario, where each target is equipped with a cognitive jammer, shown in Figure 1. Taking a radar and a jammer into account, the jammer devotes to optimizing jamming effectiveness by learning the strategy of the radar, in order to protect the target from detection. The adaptive radar has I work modes such as search, tracking, and guidance, denoted as { M r 1 , M r 2 , , M r I } . The work modes are switched adaptively between two adjacent beam dwell periods according to the signal-to-jamming radio (SJR). For different modes, the radar changes signal parameters pulse by pulse according to different rules. Similarly, the jammer has J jamming modes such as frequency-spot jamming, blocking jamming, and swept jamming, denoted as { M j 1 , M j 2 , , M j J } . For different jamming modes, the jammer can select parameters for each jamming pulse.
To simplify the analysis, we regard the target as a point target with radar cross-section (RCS) σ . Assume the radar would be jammed in each beam dwell period, which is called a jamming round in this paper. The number of radar pulses in a jamming round depends on the beam dwell time and the pulse repetition interval (PRI). For the nth radar pulse in a jamming round, the carrier frequency (CF) is f r ( n ) , the bandwidth (BW) is B r ( n ) , the PRI is p r i r ( n ) which represents the time between the rising edge of ( n 1 ) th and nth radar pulse, the pulse width (PW) is p w r ( n ) and the transmission power is P r ( n ) . At each pulse, the jammer attempts to align the jamming signal with the radar signal in both time and frequency domains. For the nth jammer pulse, the CF is f j ( n ) , the BW is B j ( n ) , the pulse delay time is d t j ( n ) which represents the time from receiving the n-1th radar pulse to transmitting next jamming pulse, the PW is p w j ( n ) and the transmitting power is P j ( n ) . In addition, the distance from the radar to the target is D, and the wavelength of the radar is λ . The antenna gain of the radar and the jammer is G r and G j . The transmission loss of the radar and jamming signal is L r and L j . The loss coefficient of polarization matching between the jamming signal and the radar signal is μ . Based on the above definitions, the power of nth echo at the radar receiver can be expressed as
P r s ( n ) = P r ( n ) G r 2 σ λ 2 ( 4 π ) 3 D 4 L r
The power of nth jamming pulse at the radar receiver is
P r j ( n ) = P j ( n ) G j G r λ 2 μ ( 4 π ) 2 D 2 L j
Introducing effective jamming coefficient to amend the calculation formula of SJR, the average SJR for nth radar pulse S J R ( n ) is calculated as:
S J R ( n ) = P r s ( n ) P r j ( n ) · 1 X f ( n ) · 1 X t ( n ) = P r ( n ) G r σ L j P j ( n ) G j μ 4 π D 2 L r X f ( n ) X t ( n ) ,
where X f ( n ) and X t ( n ) are the effective jamming coefficients in frequency domain and time domain respectively, expressed as:
X f ( n ) = Δ f ( n ) B j ( n ) · s g n ( Δ f ( n ) ) ,
X t ( n ) = Δ t ( n ) p w j ( n ) · s g n ( Δ t ( n ) ) ,
where Δ f ( n ) and Δ t ( n ) are the overlapping rates in frequency domain and time domain respectively, defined as:
Δ f ( n ) = min ( f j ( n ) + B j ( n ) / 2 , f r ( n ) + B r ( n ) / 2 ) max ( f j ( n ) B j ( n ) / 2 , f r ( n ) B r ( n ) / 2 ) ,
Δ t ( n ) = min ( d t j ( n ) + p w j ( n ) , p r i r ( n ) + p w r ( n ) ) max ( d t j ( n ) , p r i r ( n ) ) ,
where s g n ( x ) = 1 if x > 0 , otherwise s g n ( x ) = 1 . To reflect the performance of the algorithm more intuitively, we calculate jamming-to-signal radio (JSR) in the simulations to measure the jamming effect, where J S R ( n ) = 1 / S J R ( n ) .

2.2. Problem Formulation

The jamming problem can be modeled as a finite Markov decision process (MDP) [24], expressed as a quaternion { S , A , P , R } . S is a finite set of radar states, where state s S is determined by the radar mode and radar pulse parameters. A is a finite set of jammer actions, where action a A is defined by the jamming mode and jamming pulse parameters. P ( s ( n + 1 ) | s ( n ) , a ( n ) ) is the transition probability describing how the current state s ( n ) transfers to next state s ( n + 1 ) when the jammer takes action a ( n ) . R is the immediate reward after each action is taken.
Reinforcement learning has been proved to be an effective way to solve MDP problems, the key of which is to find the optimal policy π : S A to determine which action should be taken at each state. To estimate the effect of a policy, the state-value function for policy π is introduced as:
v π ( s ) = E π i = 0 γ i R ( n + i + 1 ) | s ( n ) = s ,
where E π [ · ] stands for expected value with the policy π given. γ ( 0 , 1 ] is the discount rate of the reward R, which means that long-term reward is considered and its influence decreases with time. Then, the optimal policy π * we aim to find is:
π * = arg max π v π ( s ) , s S

3. Proposed Jamming Scheme Based on DQL Model

To interfere the adaptive radar with mode switching and parameter agility, we propose an RL model named dual Q-learning to optimize jamming strategy, as shown in Figure 2. The jammer’s action space is disassembled into two subspaces containing jamming mode and pulse parameters respectively to reduce the dimensionality, based on which the jamming procedure can be divided into two levels. The jamming mode is determined in the first decision-making level, and specific parameters in frequency and time domain are selected in the second level according to the jamming mode. Two interactive Q-learning models are constructed to find the global optimal solution, and a dynamic method for jamming effectiveness evaluation is designed to obtain the feedback of the DQL model. The interaction between the two levels can be described as: the jamming mode determined in the first level has a guiding effect on the selection in the second level, and the pulse parameters selected in the second level directly determine the SJR at the radar receiver and affect the mode switching of the radar, thereby affecting the next input state of the first level.

3.1. Jamming Decision-Making through DQL Model

In the jamming procedure based on the DQL model mentioned above, the outer Q-learning and the inner Q-learning model are trained simultaneously to solve the optimal jamming strategy.

3.1.1. Outer Q-Learning

The outer Q-learning is modeled to acquire the jamming mode in the first decision-making level, where the radar work mode and the jamming mode are regarded as the environment state and the action of the agent, respectively. When obtaining radar work mode M r ( k ) at time k, the jammer chooses jamming mode M j ( k ) as:
M j ( k ) = arg max M j Q o M r ( k ) , M j
Q o M r ( k ) , M j ( k ) in the outer Q table is updated when the new radar mode M r ( k + 1 ) is obtained at time ( k + 1 ) according to the following rule:
Q o M r ( k ) , M j ( k ) Q o M r ( k ) , M j ( k ) + α R o ( k + 1 ) + γ max M j Q o M r ( k + 1 ) , M j Q o M r ( k ) , M j ( k ) ,
where α ( 0 , 1 ] is the learning rate of the outer Q-learning, which signifies the updating stride of Q value. R o ( k + 1 ) is the reward calculated at time ( k + 1 ) for the outer Q-learning, which depends on the evaluation result of jamming effectiveness according to radar mode switching. The evaluation method will be introduced in the next section.

3.1.2. Inner Q-Learning

The inner Q-learning is modeled to solve the optimal jamming parameters in the second decision-making level. The jamming mode selected in the first decision-making level maps to inner Q tables in frequency and time domains. When obtaining the nth radar pulse parameter vector s ( n ) = [ f r ( n ) , B r ( n ) , p r i r ( n ) , p w r ( n ) , P r ( n ) ] , the jammer takes [ f r ( n ) , B r ( n ) ] and [ p r i r ( n ) , p w r ( n ) ] as the input states of frequency and time domain Q tables respectively, and the jamming parameters will be chosen according to:
p j ( n + 1 ) = arg max p j Q i p r ( n ) , p j ,
where p r ( n ) = [ f r ( n ) , B r ( n ) ] , p j ( n ) = [ f j ( n ) , B j ( n ) ] in the frequency domain Q table, p r ( n ) = [ p r i r ( n ) , p w r ( n ) ] , p j ( n ) = [ d t j ( n ) , p w j ( n ) ] in the time domain Q table. The power of jamming signal P j ( n + 1 ) is calculated as P j ( n + 1 ) = η · P r ( n ) , where η varies with different jamming modes. Then the jamming parameter vector a ( n + 1 ) = [ f j ( n + 1 ) , B j ( n + 1 ) , d t j ( n + 1 ) , p w j ( n + 1 ) , P j ( n + 1 ) ] can be constituted. After the receiving of nth radar pulse, the transmission of next jamming pulse will start after a delay of d t j ( n + 1 ) , aiming at the ( n + 1 ) th radar pulse. In summary, the jammer predicts the radar parameters one step in advance, in order to interfere the next radar pulse in time. When the ( n + 1 ) th radar pulse parameter vector s ( n + 1 ) is obtained, Q i p r ( n ) , p j ( n + 1 ) in the inner Q table is updated according to the following rule:
Q i p r ( n ) , p j ( n + 1 ) Q i p r ( n ) , p j ( n + 1 ) + α R i ( n + 1 ) + γ max p j Q i p r ( n + 1 ) , p j Q i p r ( n ) , p j ( n + 1 ) ,
where R i ( n + 1 ) is the reward of the inner Q-learning calculated with s ( n + 1 ) and a ( n + 1 ) . R i ( n + 1 ) = X f ( n + 1 ) for the frequency domain Q table, R i ( n + 1 ) = X t ( n + 1 ) for the time domain Q table.
Both in the outer and inner Q-learning, ε -greedy policy is used to choose the jamming mode or parameters, which is an effective way to balance the exploration and exploitation. In this paper, we define ε = e δ · k which decreases as the count of iterations increases, where k represents the number of the iteration and the coefficient δ determines the decay rate. δ is set differently for two decision-making levels: δ = δ o for the outer Q-learning and δ = δ i for the inner Q-learning. Taking the outer Q-learning as an example, the jammer randomly chooses a jamming mode with probability ε , and chooses the optimal jamming mode M j ( k ) according to Equation (10) with probability 1 ε . According to the explanation above, a jamming algorithm based on the DQL model is proposed, and more details are shown in Algorithm 1.
The time series of learning and jamming decision-making with the DQL model is shown in Figure 3. During a jamming round, the outer Q-learning is performed only once at the beginning to obtain the jamming mode. Under the constraints of the jamming mode, the inner Q-learning is performed in each subsequent PRI to determine the jamming parameters.    
Algorithm 1: Jamming algorithm based on DQL model
 (K and N k denote the amount of jamming rounds in simulation and the total
 number of pulses in the kth jamming round respectively. m represents the
 number of pulses required for radar mode discerning.)
for k = 1 , 2 , , K do
Sensors 22 00145 i001
end

3.2. Jamming Effectiveness Evaluation through Dynamic Measuring of Vector Distance

Jamming effectiveness evaluation is an important part of the jamming decision-making method based on the DQL model proposed in the previous section. The evaluation result provides the reward for the outer Q-learning. In the radar confrontation, the most intuitive impact of effective jamming on the radar is the reduction of detection probability, which is hard to be known from the jamming side. Therefore, the jamming effectiveness can only be estimated according to the parameters of the radar signal received by the jammer.
To evaluate the jamming effectiveness, we firstly construct an evaluation indicator set I = { I 1 , I 2 , , I l } , where each indicator is a measurable parameter of the radar signal. When the radar is jammed, there are two possible ways for it to change its parameters. One is to switch its work mode. For example, the low SNR causes the radar to lose track of the target, so it shifts from the tracking mode to the search mode. The other possible way is that the radar takes anti-jamming measures in order to improve its performance. For instance, after receiving suppressive jamming, the radar increases the bandwidth to improve its range resolution. Usually, the indicators that can characterize the work mode switching of the radar include beam dwell time, PRI, etc, and the indicators that can characterize the anti-jamming measures taken by the radar include BW, PW, transmitting power, range and speed of frequency agility, etc. The indicators of these two aspects are considered to construct the evaluation indicator set I .
Based on I , a method of vector distance measuring is proposed to calculate jamming effectiveness, where the weight values of evaluation indicators are updated dynamically during the confrontation process. More details are explained as follows.

3.2.1. The Jamming Effectiveness Evaluation Method Based on Vector Distance Measuring

We construct a l-dimensional vector space V l based on I , where each dimension represents an evaluation indicator. To explain clearly, Figure 4 shows a 3-dimensional vector space of three indicators. As shown, for the same radar state, the evaluation indicator vectors are often clustered together. When the radar switches its working mode or takes anti-jamming measures, the evaluation indicator vector will shift in space. The greater the offset along the increasing direction of the coordinate axes is, the more effective the jamming is. Therefore, the jamming efficiency can be evaluated by measuring the shift of the evaluation indicator vector before and after the jamming.
According to the above analysis, a jamming effect evaluation method based on vector distance measuring is proposed. As known, Euclidean distance can be used to calculate the absolute distance between vectors. We assign a weight for each evaluation indicator on the basis of Euclidean distance, where each weight reflects the contribution of the corresponding indicator to the evaluation result. The calculation formula of Euclidean distance with the indicator weight is expressed as:
d ( u , v , ω ) = s g n i = 1 , , l ω i ( v i u i ) i = 1 , , l ω i ( v i u i ) 2 1 2 ,
where u , v are two indicator vectors in V l . ω = ( ω 1 , ω 2 , , ω l ) is the weight vector for l indicators, ω i ( 0 , 1 ) . u i is the value of ith indicator in vector u . s g n i = 1 , , l ω i ( v i u i ) identifies the symbol of the distance d ( u , v , ω ) , which indicates the direction in which the evaluation indicator vector shifts.
The feedback R o ( k ) of the DQL model at time k based on jamming effectiveness evaluation is calculated as:
R o ( k ) = d ( x ( k 1 ) , x ( k ) , ω ( k ) ) · 10 , i f d ( x ( k 1 ) , x ( k ) , ω ( k ) ) 0 10 , i f d ( x ( k 1 ) , x ( k ) , ω ( k ) ) < 0 ,
where x ( k ) is the normalized evaluation indicator vector obtained at time k. ω ( k ) is the weight vector calculated at time k, and the calculation method for it is described below.

3.2.2. The Method of Dynamically Weighting for Evaluation Indicators

As the contribution of different indicators to the evaluation result will vary with the change of the radar status, ω is objectively modified through the method of dynamic entropy weight calculation. We horizontally compare the value of each indicator measured in different jamming rounds, and calculate their entropy values to obtain the weights. For each indicator, as a result of the difference between each measurement, its weight is not static, but changes with the received radar signal parameters. Thus, an online evaluation model is established.
For calculating the dynamic entropy weights of evaluation indicators, we define a matrix A   R l × m , where m evaluation indicator vectors can be stored. Once a new radar state is detected, the l evaluation indicators of it are calculated and assigned to a column in A . If all the columns in A have been assigned values, the earliest assigned column will be overwritten by the new evaluation indicator vector. a i j in A denotes the value of the ith indicator in the jth vector. In order to nondimensionalize the calculation and eliminate the impact of different order of magnitudes, the original matrix needs to be normalized to a matrix B = ( b i j ) l × m . The normalization formula is expressed as:
b i j = a i j min ( a i j , , a i n ) max ( a i j , , a i n ) min ( a i j , , a i n ) , i f I i d max ( a i j , , a i n ) a i j max ( a i j , , a i n ) min ( a i j , , a i n ) , otherwise . , i = 1 , 2 , , l , j = 1 , 2 , , m ,
where I i d indicates that the vector distance d is positively correlated to the ith indicator in I , which means the lager the value of ( v i u i ) is, the better the jamming effectiveness is. Then the proportion of the jth vector for indicator I i is calculated as p i j = b i j / j = 1 m b i j , and the entropy of indicator I i can be calculated as:
e i = 1 ln ( l ) j = 1 m p i j ln ( p i j ) , i = 1 , 2 , , l
Finally, the weight for each indicator is expressed as:
ω i = 1 e i i = 1 m ( 1 e i ) , i = 1 , 2 , , l
As shown in Figure 2, when a new radar mode is detected at time k, the jamming effectiveness of the last jamming round is evaluated, with which R o k is calculated and served as the feedback of the DQL model. The jamming effectiveness evaluation algorithm based on vector distance measuring with dynamic weight is shown in Algorithm 2.
Algorithm 2: Jamming effectiveness evaluation algorithm
Input: Evaluation indicator vector x ( k )
Output: Jamming effect evaluation result R o ( k )
Sensors 22 00145 i002
 Calculate the Euclidean distance d ( x ( k 1 ) , x ( k ) , ω ( k ) ) between normalized x ( k ) and
  the last indicator vector x ( k 1 ) with the weight vector ω ( k ) according to Equation (14)
 Calculate the feedback R o ( k ) of the DQL model through Equation (15).

4. Numerical Results

To verify the proposed algorithms, a radar parameter template is firstly created, shown in Table 1. Referring to [25,26], four kinds of radar work modes are considered in this paper, including search, acquisition, tracking, and guidance. For the target, the work mode of guidance has the highest threat level, followed by tracking, acquisition, and search. The beam dwell times when the radar is at these four work modes are 80 ms, 100 ms, 120 ms, and 140 ms respectively. For each work mode, two sub-modes are specified with different parameter agility patterns.
The mode switching rule of the radar can be described as: when SJR > −4 db, the radar will raises its threat level; when −7 db < SJR < −4 db, the radar will take anti-jamming measures while maintaining the current work mode, including M r 1 M r 2 , M r 3 M r 4 , M r 5 M r 6 and M r 7 M r 8 ; when SJR < −7 db, the radar will reduce its threat level.
For the jammer, the jamming mode can be switched between M j 1 and M j 2 , which denote frequency-spot jamming and blocking jamming respectively. The optional jamming parameters for each jamming mode are illustrated in Table 2.
Other parameters in our simulation are given as: G r = 30 dB, G j = 5 dB, L r = 10 dB, L j = 5 dB, R = 10 km, σ = 1 m 2 . Considering that radar antennas are generally linearly polarized, while jammer antennas are circularly polarized or obliquely polarized, μ is given as 0.5. The parameters in the proposed algorithms are set as: α = 0.01 , γ = 0.8 , δ o = 0.08 , and δ i = 0.3 . The evaluation indicator set I is constructed of indicators I 1 I 7 , which are shown in Table 3. For each radar mode, the average PRI of all pulses in a period is calculated as indicator PRI, the difference between the maximum and minimum frequency is taken as the range of frequency agility, and the reciprocal of the number of continuous pulses with the same frequency is regarded as the speed of frequency agility.
Based on the above description, 200 jamming rounds are simulated. Figure 5 intuitively shows the time-frequency information under four different radar work modes at the initial and convergent stage of learning. Figure 5a1, b1, c1,d1 show the radar and jamming signals at the initial stage, and four other figures show the convergent stage under the same radar work mode. Compared with the random selection of jamming parameters at the initial stage, the jamming pulses can accurately cover radar pulses in time and frequency domains to achieve effective jamming at the convergent stage. Besides, it can be found that if the radar is at modes where the CF changes regularly, the jammer chooses frequency-spot jamming mode M j 1 , otherwise, it chooses to block jamming mode M j 2 , which accords with the common perception.
We compared the performance of our algorithm with the improved chaos genetic algorithm [6], the standard Q-learning [15] and the random parameter selection method. The average JSR of each jamming round is calculated to reflect the jamming effect intuitively. The learning rate and discount rate are set with the same value both in the standard Q-learning and our algorithm. The following results are obtained through 500 independent simulations and their average values are taken to make the figures.
Figure 6 and Figure 7 respectively show the average JSR and the radar threat level obtained through three methods during 200 jamming rounds. As shown, the improved chaos genetic algorithm [6], the standard Q-learning [15] and the proposed jamming algorithm based on the DQL model can all minimize the threat level of radar, and both the latter two methods can converge and stabilize the JSR within 200 jamming rounds. However, compared with the standard Q-learning, the proposed algorithm can reach the optimal average JSR 7.98 dB, which is 4.05% increased. Further, the convergence time of the proposed jamming algorithm declines by 34.94%, and the number of jamming rounds when the radar is at guidance modes including M r 7 and M r 8 reduces by 64.94%. As the high radar threat level is dangerous for the target, the proposed jamming algorithm based on the DQL model can improve the survivability of the target.
In order to further explore the convergence performance of the proposed jamming algorithm, we use the jamming round in which JSR is stable to indicate the convergence time. For jamming round i, we calculate the variance of JSR from jamming round ( i 19 ) to jamming round i ( i 20 ) . If the variance is less than 0.01, it is considered that JSR reaches a stable state in jamming round i. Figure 8 compares the convergence time of the jamming algorithm based on the standard Q-learning [15] and the proposed DQL model. It is shown that the convergence time of the proposed jamming algorithm is generally lower than that of the standard Q-learning, and as the size of jamming action space increases, the gap between the two grows. Thus, our jamming algorithm based on the DQL model has better scalability, and is more adaptable when larger jamming action space is needed in face of adaptive even unknown radars.

5. Conclusions

In this paper, a two-level framework is developed for jamming decision-making against the adaptive radar, and a dual Q-learning model is proposed to optimize the jamming strategy. The jamming mode and pulse parameters are determined hierarchically, greatly reducing the dimensionality of the search space and improving the learning efficiency of the model. In addition, we proposed a new method to calculate the jamming effectiveness by measuring the distance of indicator vectors, where the indicators are dynamically weighted to adapt to the changing environment. The jamming effectiveness evaluation result is served as the feedback value to update the DQL model.
Simulation results show that with the proposed jamming method, the radar joint strategy of mode switching and pulse parameters can be learned within limited interactions, and the optimal jamming effectiveness is reached while the radar’s threat level is minimized. Furthermore, compared with the standard Q-learning, our method improves the average JSR by 4.05% and reduces the convergence time by 34.94%.
It should be emphasized that due to the complex electromagnetic environment, the estimation of radar work mode and pulse parameters is often inaccurate. When there are errors in the input state, the performance of the proposed DQL model will deteriorate. Therefore, in the near future, how to enhance the robustness of the model to deal with the uncertainty of the input is the focus of our investigation.

Author Contributions

Investigation, H.L.; methodology, H.L.; project administration, Y.H.; resources, H.Z.; supervision, Y.H.; validation, H.L.; writing—original draft preparation, H.L.; writing—review and editing, H.Z.; data curation, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 61971064 and Grant 61901049 and the Beijing Natural Science Foundation under Grant 4202048 and Grant L212028.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Xing, Q.; Zhu, W.; Chi, Z.; Zheng, G. Jamming decision under condition of incomplete jamming rule library. J. Eng. 2019, 2019, 7449–7454. [Google Scholar] [CrossRef]
  2. Haykin, S. Cognitive Radar: A Way of the Future. IEEE Signal Procesing Mag. 2006, 23, 30–40. [Google Scholar] [CrossRef]
  3. Gao, L.; Liu, L.; Cao, Y.; Wang, S.; You, S. Performance analysis of one-step prediction-based cognitive jamming in jammer-radar countermeasure model. J. Eng. 2019, 2019, 7958–7961. [Google Scholar] [CrossRef]
  4. Zhang, B.; Zhu, W. Research on Decision-making System of Cognitive Jamming against Multifunctional Radar. In Proceedings of the 2019 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Dalian, China, 20–22 September 2019; pp. 1–6. [Google Scholar]
  5. Zheng, T.; Gao, X. Research on the self-defence electronic jamming decision-making based on the discrete dynamic Bayesian network. J. Syst. Eng. Electron. 2008, 19, 702–708. [Google Scholar] [CrossRef]
  6. Pan, W.; Jin, X.; Xie, H.; Xia, Y. Radar Jamming Strategy Allocation Algorithm based on Improved Chaos Genetic Algorithm. In Proceedings of the 2020 Chinese Control And Decision Conference (CCDC), Hefei, China, 22–24 August 2020; pp. 4478–4483. [Google Scholar]
  7. Wang, L.; Zeng, Y.; Li, Y.; Wang, M. An Optimal Jamming Strategy Aiming at Cognitive MIMO Radar. In Proceedings of the 2016 CIE International Conference on Radar (RADAR), Guangzhou, China, 10–13 October 2016; pp. 1–5. [Google Scholar]
  8. Slimeni, F.; Scheers, B.; Chtourou, Z.; Nir, V.L. Jamming mitigation in cognitive radio networks using a modified Q-learning algorithm. In Proceedings of the 2015 International Conference on Military Communications and Information Systems (ICMCIS), Cracow, Poland, 18–19 May 2015; pp. 1–7. [Google Scholar]
  9. Machuzak, S.; Jayaweera, S.K. Reinforcement Learning Based Anti-jamming with Wideband Autonomous Cognitive Radios. In Proceedings of the 2016 IEEE/CIC International Conference on Communications in China (ICCC), Chengdu, China, 27–29 July 2016; pp. 1–5. [Google Scholar]
  10. Peng, J.; Zhang, Z.; Wu, Q.; Zhang, B. Anti-Jamming Communications in UAV Swarms A Reinforcement Learning Approach. IEEE Access. 2019, 7, 180532–180543. [Google Scholar] [CrossRef]
  11. Lu, X.; Xiao, L.; Dai, C.; Dai, H. UAV-aided cellular communications with deep reinforcement learning against jamming. IEEE Wirel. Commun. 2020, 27, 48–53. [Google Scholar] [CrossRef]
  12. Yao, F.; Jia, L. A collaborative multi-agent reinforcement learning anti-jamming algorithm in wireless networks. Wirel. Commun. Lett. 2019, 8, 1024–1027. [Google Scholar] [CrossRef] [Green Version]
  13. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 1st ed.; MIT Press: Cambridge, MA, USA, 1998; pp. 216–224. [Google Scholar]
  14. Xing, Q.; Zhu, W.; Jia, X. Research on method of intelligent radar confrontation based on reinforcement learning. In Proceedings of the 2017 2nd IEEE International Conference on Computational Intelligence and Applications (ICCIA), Beijing, China, 8–11 September 2017; pp. 471–475. [Google Scholar]
  15. Wang, L.; Peng, J.; Xie, Z.; Zhang, Y. Optimal jamming frequency selection for cognitive jammer based on reinforcement learning. In Proceedings of the 2019 IEEE 2nd International Conference on Information Communication and Signal Processing (ICICSP), Weihai, China, 28–30 September 2019; pp. 39–43. [Google Scholar]
  16. Li, K.; Jiu, B.; Liu, H.; Liang, S. Reinforcement learning based anti-jamming frequency hopping strategies design for cognitive radar. In Proceedings of the 2018 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Qingdao, China, 14–16 September 2018; pp. 1–5. [Google Scholar]
  17. Lei, M.; Zhang, J. Study on anti-jamming frequency selection in radar netting. In Proceedings of the 2016 2nd IEEE International Conference on Computer and Communications (ICCC), Chengdu, China, 14–17 October 2016; pp. 1781–1784. [Google Scholar]
  18. Ak, S.; Brüggenwirth, S. Avoiding Jammers: A Reinforcement Learning Approach. In Proceedings of the 2020 IEEE International Radar Conference (RADAR), Florence, Italy, 21–25 September 2020; pp. 321–326. [Google Scholar]
  19. Li, K.; Jiu, B.; Liu, H.; Pu, W. Robust antijamming strategy design for frequency-agile radar against main lobe jamming. Remote Sens. 2021, 13, 3043. [Google Scholar] [CrossRef]
  20. Quan, Y.; Wu, Y.; Li, Y.; Sun, G.; Xing, M. Range-Doppler reconstruction for frequency agile and PRF-jittering radar. IET Radar Sonar Navig. 2018, 12, 348–352. [Google Scholar] [CrossRef]
  21. Ou, J.; Zhao, F.; Ai, X.; Liu, J.; Xiao, S. Quantitative evaluation for self-screening jamming effectiveness based on the changing characteristics of intercepted radar signals. In Proceedings of the 2016 CIE International Conference on Radar (RADAR), Guangzhou, China, 10–13 October 2016; pp. 1–5. [Google Scholar]
  22. Li, C.; Zhou, J. Jamming effectiveness evaluation from the jamming side. Electron. Inf. Warf. Technol. 2008, 23, 46–49. [Google Scholar]
  23. Peng, X.; Yu, J.; Ren, W.; Weng, X. Radar jamming effectiveness evaluation method based on feature space weighting. In Proceedings of the IET International Radar Conference (IET IRC 2020), Chongqing, China, 4–6 November 2020; pp. 629–633. [Google Scholar]
  24. Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
  25. Osner, N.R.; du Plessis, W.P. Threat evaluation and jamming allocation. IET Radar Sonar Navig. 2017, 11, 459–465. [Google Scholar] [CrossRef] [Green Version]
  26. Han, L.; Ning, Q.; Chen, B.; Lei, Y.; Zhou, X. Ground threat evaluation and jamming allocation model with Markov chain for aircraft. IET Radar Sonar Navig. 2020, 14, 1039–1045. [Google Scholar] [CrossRef]
Figure 1. The self-defense electronic jamming scenario: through effective jamming, the jammer can shorten the detection range of the radar to protect the target.
Figure 1. The self-defense electronic jamming scenario: through effective jamming, the jammer can shorten the detection range of the radar to protect the target.
Sensors 22 00145 g001
Figure 2. Jamming decision-making based on DQL model: During a jamming round, the radar work mode is discerned with the earliest received m radar pulses. Once a new radar mode M r ( k ) is recognized, it is sent to the outer Q-learning module. Then the jamming mode M j ( k ) is chosen and map to time domain Q table and frequency domain Q table in the inner Q-learning module. According to M r ( k ) and M r ( k 1 ) , the jamming effectiveness of last jamming round is evaluated, with which the outer Q table can be updated according to Equation (11). When receiving the nth ( n > m ) radar pulse, the radar pulse parameter vector s ( n ) is obtained through parameter estimation. Then the jamming parameters are selected to constitute the parameter vector a ( n + 1 ) , and the jamming signal will be generated. According to s ( n ) , the two effective jamming coefficients are evaluated, with which the inner Q table can be updated according to Equation (13).
Figure 2. Jamming decision-making based on DQL model: During a jamming round, the radar work mode is discerned with the earliest received m radar pulses. Once a new radar mode M r ( k ) is recognized, it is sent to the outer Q-learning module. Then the jamming mode M j ( k ) is chosen and map to time domain Q table and frequency domain Q table in the inner Q-learning module. According to M r ( k ) and M r ( k 1 ) , the jamming effectiveness of last jamming round is evaluated, with which the outer Q table can be updated according to Equation (11). When receiving the nth ( n > m ) radar pulse, the radar pulse parameter vector s ( n ) is obtained through parameter estimation. Then the jamming parameters are selected to constitute the parameter vector a ( n + 1 ) , and the jamming signal will be generated. According to s ( n ) , the two effective jamming coefficients are evaluated, with which the inner Q table can be updated according to Equation (13).
Sensors 22 00145 g002
Figure 3. Process of learning and jamming decision making with DQL model.
Figure 3. Process of learning and jamming decision making with DQL model.
Sensors 22 00145 g003
Figure 4. Vector space V 3 composed of three evaluation indicators. u , v are two evaluation indicator vectors, and d ( u , v , ω ) is the Euclidean distance between u and v with weight vector ω .
Figure 4. Vector space V 3 composed of three evaluation indicators. u , v are two evaluation indicator vectors, and d ( u , v , ω ) is the Euclidean distance between u and v with weight vector ω .
Sensors 22 00145 g004
Figure 5. Time-frequency information at the initial and convergent stage: for signals in each image, the brighter strips are radar pulses, the darker strips are jamming pulses. (a1,b1,c1,d1) show time-frequency information of radar and jamming pulses at initial stage when the radar is at mode M r 1 , M r 4 , M r 5 , and M r 7 , respectively. (a2,b2,c2,d2) show the corresponding time-frequency information at convergent stage for the same four radar modes.
Figure 5. Time-frequency information at the initial and convergent stage: for signals in each image, the brighter strips are radar pulses, the darker strips are jamming pulses. (a1,b1,c1,d1) show time-frequency information of radar and jamming pulses at initial stage when the radar is at mode M r 1 , M r 4 , M r 5 , and M r 7 , respectively. (a2,b2,c2,d2) show the corresponding time-frequency information at convergent stage for the same four radar modes.
Sensors 22 00145 g005
Figure 6. JSR comparison among different methods.
Figure 6. JSR comparison among different methods.
Sensors 22 00145 g006
Figure 7. Radar mode switching comparison among different methods: radar modes 1 to 8 represent M r 1 to M r 8 respectively.
Figure 7. Radar mode switching comparison among different methods: radar modes 1 to 8 represent M r 1 to M r 8 respectively.
Sensors 22 00145 g007
Figure 8. Convergence time comparison among different methods. The size of jamming action space is defined as the product of the total number of optional parameters in time and frequency domains.
Figure 8. Convergence time comparison among different methods. The size of jamming action space is defined as the product of the total number of optional parameters in time and frequency domains.
Sensors 22 00145 g008
Table 1. Radar parameter template. When the radar is at a certain mode, it selects the pulse parameters according to the rules in the template. Reside and switch, slippery, staggered, and jittered are four of the common radar parameter agility patterns. As shown, reside and switch A:k B:m C:n means that the parameter value stays at A for k pulses, stays at B for m pulses, and stays at C for n pulses; slippery A:B:C means that the parameter value changes from A to C in steps of B; staggered such as [A B C] means that the parameter value is cycled in the order of the list; jittered such as (A, B) means the parameter value is randomly selected from the range of A to B.
Table 1. Radar parameter template. When the radar is at a certain mode, it selects the pulse parameters according to the rules in the template. Reside and switch, slippery, staggered, and jittered are four of the common radar parameter agility patterns. As shown, reside and switch A:k B:m C:n means that the parameter value stays at A for k pulses, stays at B for m pulses, and stays at C for n pulses; slippery A:B:C means that the parameter value changes from A to C in steps of B; staggered such as [A B C] means that the parameter value is cycled in the order of the list; jittered such as (A, B) means the parameter value is randomly selected from the range of A to B.
Work ModeSub-Mode f r /MHz B r /MHz pri r /us pw r /us P r /kW
search M r 1 reside and switch:
8500:5 9500:5 9000:5
100staggered:
[1100 1320 1470]
80120
M r 2 reside and switch:
8600:3 9600:3 9100:3
100staggered:
[1100 1320 1470]
120120
acquisition M r 3 slippery:
8800:600:10000
150staggered:
[1070 1430 857]
120170
M r 4 slippery:
9800:600:12200
150staggered:
[1070 1430 857]
120170
tracking M r 5 jittered:
(8500,11500)
800reside and switch:
830:2 890:4 960:3
120170
M r 6 jittered:
(7500,12500)
1000reside and switch:
830:2 890:4 960:3
120170
guidance M r 7 jittered:
(9500,12500)
800slippery:
740:40:900
120200
M r 8 jittered:
(8500,13500)
1000slippery:
740:40:900
120200
Table 2. Jamming parameter template. When a jamming mode is determined, the corresponding pulse parameters are selected according to the rules in the template. {A:B:C} denotes a set of optional values, consisting of an arithmetic sequence from A to C with B as the difference.
Table 2. Jamming parameter template. When a jamming mode is determined, the corresponding pulse parameters are selected according to the rules in the template. {A:B:C} denotes a set of optional values, consisting of an arithmetic sequence from A to C with B as the difference.
Mode f j /MHz B j ( × B r ) dt j /us pw j ( × pw r ) P j ( × P r )
M j 1 {8000:500:12000}3{800:100:1500}20.003
M j 2 {8500:1000:11500}6{800:100:1500}20.006
Table 3. Jamming effectiveness evaluation indicator set. The correlation to evaluation result is positive means that the greater the increase of this indicator, the better the jamming effectiveness. While the negative attribute means that the greater the decrease of this indicator, the better the jamming effectiveness.
Table 3. Jamming effectiveness evaluation indicator set. The correlation to evaluation result is positive means that the greater the increase of this indicator, the better the jamming effectiveness. While the negative attribute means that the greater the decrease of this indicator, the better the jamming effectiveness.
Evaluation
Indicator
ExplanationCorrelation to
Evaluation Result
I 1 PRIpositive
I 2 powerpositive
I 3 beam dwell timenegative
I 4 BWpositive
I 5 PWpositive
I 6 range of frequency agilitypositive
I 7 speed of frequency agilitypositive
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Liu, H.; Zhang, H.; He, Y.; Sun, Y. Jamming Strategy Optimization through Dual Q-Learning Model against Adaptive Radar. Sensors 2022, 22, 145. https://doi.org/10.3390/s22010145

AMA Style

Liu H, Zhang H, He Y, Sun Y. Jamming Strategy Optimization through Dual Q-Learning Model against Adaptive Radar. Sensors. 2022; 22(1):145. https://doi.org/10.3390/s22010145

Chicago/Turabian Style

Liu, Hongdi, Hongtao Zhang, Yuan He, and Yong Sun. 2022. "Jamming Strategy Optimization through Dual Q-Learning Model against Adaptive Radar" Sensors 22, no. 1: 145. https://doi.org/10.3390/s22010145

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop