Robust Antijamming Strategy Design for Frequency-Agile Radar against Main Lobe Jamming

: To combat main lobe jamming, preventive measures can be applied to radar in advance based on the concept of active antagonism, and efﬁcient antijamming strategies can be designed through reinforcement learning. However, uncertainties in the radar and the jammer, which will result in a mismatch between the test and training environments, are not considered. Therefore, a robust antijamming strategy design method is proposed in this paper, in which frequency-agile radar and a main lobe jammer are considered. This problem is ﬁrst formulated under the framework of Wasserstein robust reinforcement learning. Then, the method of imitation learning-based jamming strategy parameterization is presented to express the given jamming strategy mathematically. To reduce the number of parameters that require optimization, a perturbation method inspired by NoisyNet is also proposed. Finally, robust antijamming strategies are designed by incorporating jamming strategy parameterization and jamming strategy perturbation into Wasserstein robust reinforcement learning. The simulation results show that the robust antijamming strategy leads to improved radar performance compared with the nonrobust antijamming strategy when uncertainties exist in the radar and the jammer.


Introduction
Main lobe jamming is one of the most challenging jamming types because the jammer and the target are close enough that both are in the main beam of the radar. Common strategies to combat main lobe jamming involve identifying and eliminating jamming signals after the radar is jammed [1][2][3], which can be regarded as passive suppression methods. However, these methods usually require the jammer and the direction-of-look to be separable in angular space. Generally, the smaller the angular separation between the jammer and the direction-of-look, the worse the antijamming performance of the radar is. As a result, these passive suppression methods do not work well (or do not work at all) if the angular separation is small. In this paper, we focus on a situation in which the directions-of-arrival of the received signals incident from the target and the jamming signals are the same, which is common for a target equipped with a self-protection jamming system [4].
The principle of electronic counter-countermeasure (ECCM) techniques is to identify a domain in which the received target signals and the jamming signals can be separated from each other. In fact, such a domain may exist since it is difficult for the jammer to effectively and simultaneously jam both the time and frequency domains. Therefore, in contrast to passive suppression methods, active antagonism requires the radar to actively sense possible unjammed domains and agilely take actions in such domains to avoid being jammed. Specifically, these agile actions include frequency agility in transmission [5], pulse repetition interval agility [6], pulse diversity [7], and so on. Among the above-mentioned agile actions, frequency agility in transmission is considered one effective way to combat main lobe jamming because frequency-agile (FA) radar can actively change its carrier frequency in a random manner. This makes it difficult for the jammer to intercept and jam the radar [5,8,9].
To design FA radar antijamming strategies (hereafter, strategy and policy are used interchangeably), the works in [10][11][12][13] considered a specific jamming strategy situation, and the antijamming strategy design problems were formulated within the framework of the Markov decision process (MDP), which is solved through reinforcement learning (RL) algorithms. The use of RL algorithms to design antijamming strategies has received much attention in the domain of communication [14,15], but its potentiality for radar antijamming requires further exploration. In [10], an RL-based approach was proposed in which FA radar learns the dynamics of the jammer and avoids being jammed. In contrast to the signal-to-noise ratio (SNR) reward signal used in [10], the authors in [11] proposed utilizing the probability of detection as the reward signal, and a similar deep RL-based antijamming scheme for FA radar was proposed. In contrast to the pulse-level FA radar in [10,11], subpulse-level FA radar and a jammer that works in a transmit/receive time-sharing mode were considered in [16], which is more similar to real electronic warfare than the scenarios in [10,11]. In addition, a policy gradient-based RL algorithm known as proximal policy optimization (PPO) [12] was used in [16] to further facilitate the stability of the learning process and improve convergence performance. In [13], the antijamming strategy design was investigated under a partially observable condition, and the authors highlighted that antijamming performance depends on the random nature of the jammer.
As discussed in [10,11,16], FA radar can learn antijamming strategies offline in the training environment and then utilize the learned strategies to combat the jammer in the test environment. At every time step, the jammer will intercept the action of the radar, and the radar will also sense the whole electronic spectrum to infer the action of the jammer. The sensing in these procedures was assumed to be accurate and perfect in the training environment in [10,11,16]. This assumption is not always true in practice because uncertainties exist in both the radar and jammer. For example, if the interception occurs in the frequency domain, then the jammer cannot intercept each radar pulse if it is equipped with a scanning superheterodyne receiver. This is because such a receiver is time multiplexed, and the number of bandwidths that can be scanned [17] is based on a preprogrammed scanning strategy. Even if the jammer is equipped with receivers that have a large instantaneous bandwidth, such as channelized receivers, measurement errors cannot be excluded [17]. Similarly, due to noise and hardware system errors, the radar cannot acquire perfect information about the jammer, even if it can sense the entire electronic spectrum through spectrum sensing [18].
The existence of uncertainties in both the radar and jammer will lead to a mismatch between the presumed and true environment. If uncertainties in the environment are not considered, then radar antijamming performance will be heavily degraded. Therefore, it is of vital importance to design robust antijamming strategies to maintain good performance when uncertainties exist. It should be noted that uncertainties in the jammer were considered in [13], but the best approach to designing robust antijamming strategies remains unknown.
To overcome the uncertainties in both the radar and jammer, a robust antijamming strategy design method for FA radar is proposed in this paper. FA radar and main lobe jamming with a transmit/receive time-sharing jammer are considered and modeled within the framework of RL. The proposed robust method was based on imitation learning [19] and Wasserstein robust reinforcement learning (WR 2 L) [20], where imitation learning was used to learn the jammer's strategy, and WR 2 L was utilized to design a radar strategy that was robust against uncertainties in the jammer's strategy and itself. The main contributions of this paper are summarized as follows: • To express the jamming strategy mathematically, a jamming strategy parameterization method based on imitation learning is proposed, where the jammer is assumed to be an expert making decisions in an MDP. Through the proposed method, we can transform the jamming strategy from a "text description" to a neural network consisting of a series of parameters that can be optimized and perturbed; • To reduce the computational burden of designing robust antijamming strategies, a jamming strategy perturbation method is presented, where only some of the weights of the neural network need to be optimized and perturbed; • By incorporating jamming strategy parameterization and jamming strategy perturbation into WR 2 L, a robust antijamming strategy design method is proposed to obtain robust antijamming strategies.
The remainder of this paper is organized as follows. The backgrounds of RL, robust RL, and imitation learning are briefly introduced in Section 2. In Section 3, the signal models of the FA radar and the main lobe jammer are presented, and then, the RL framework for the FA radar antijamming strategy design is described. The proposed robust antijamming strategy design method, which incorporates jamming strategy parameterization and jamming strategy perturbation into WR 2 L, is explored in Section 4. Simulation results are shown in Section 5, and Section 6 concludes the paper.

Reinforcement Learning
An RL problem can be formulated within the framework of the MDP, which consists of a five-tuple S, A, P, R, γ [21], where S is the set of states, A is the set of actions, P(s t+1 |s t , a t ) describes the probability of transition from the current state s t to the next state s t+1 with the chosen action a t , R(s, a) provides a scalar reward given a state s and action a, and γ ∈ [0, 1] is a discount factor.
RL emphasizes the interaction between the agent and its environment, and the procedure can be described as follows. At each discrete time step t, the agent is in the state s t ∈ S and chooses the action a t ∈ A according to the specified policy π(a|s), which is a function mapping states to a probability distribution over all possible actions. With the obtained state s t and action a t , the environment and the agent transition to the next state s t+1 according to P(s t+1 |s t , a t ). After that, the agent receives a scalar reward r t+1 . Finally, the agent collects a trajectory τ = s 0 , a 0 , r 1 , s 1 , a 1 , r 2 , ..., and the objective of the agent is to find an optimal policy π * to maximize the cumulative reward, which can be expressed as follows: where R(τ) = ∑ T−1 t=0 γ t r t+1 is the cumulative reward of τ and p π (τ) is a probability density function of trajectory τ. p π (τ) can be expressed by the transition probability and the policy, which is defined below: where p 0 (s 0 ) denotes the initial state distribution.

Robust Reinforcement Learning
To train an efficient policy for a real-world task, one practical approach is to let the agent interact with the environment in a simulator and then transfer the learned policy to the real world [22]. However, there is a discrepancy between the training environment in a simulator and the real world. Therefore, robust policies are needed to alleviate this discrepancy.
Robust RL is usually based on the idea of the "maxmin" criterion [20,22,23], which aims to maximize the performance of the agent in the worst case. In [23], a softer version of the maxmin objective, the conditional value at risk, was used, and the agent maximized the long-term return for the worst eth percentile of MDPs. Similar to [23], an adversarial agent was introduced to model the uncertainties, and the original agent maximized the long-term reward, while the adversarial agent minimized it [22].
In addition to the methods mentioned above, directly optimizing a maxmin objective can also be used to design robust policies. In [20], a model-free robust policy design method called WR 2 L was proposed. In WR 2 L, the agent formulates robust reinforcement learning as a minmax game, where the agent aims to improve the performance by optimizing its policy, while the environment tries to worsen the performance by changing the dynamic parameters.
Note that the method in [23] requires knowledge of the distribution of environmental parameters that determine the environmental dynamics. Although the method in [22] overcame this problem, a carefully designed adversarial agent was needed, which was difficult to obtain in our problem. In contrast to the methods in [22,23], WR 2 L is model-free and does not require knowledge of the dynamics of the environment. Furthermore, it was based on mathematical optimization and thus more reliable. As a result, WR 2 L was considered in this paper.

Imitation Learning
Imitation learning aims to derive a policy from demonstration data that are generated by an underlying policy π e [19]. The demonstration data consist of a series of states and their corresponding actions, which can be expressed as d = s 0 , a 0 , ..., s T−1 , a T−1 . Note that the states and actions in d are generated by the expert who executes the underlying policy π e and are different from those in τ described for RL.
Behavior cloning regards imitation learning as a supervised learning problem, where a supervised model is trained with training data and labels, which are the states and actions in d, respectively. After the training process ends, the model is capable of predicting an appropriate action for a given state. Behavior cloning is simple and easy to implement. However, each action in the demonstration data d depends on the previous part, which violates the "iid" assumption in supervised learning and results in poor generalization [26].
In IRL, the expert is assumed to make decisions in an MDP/R, which is an MDP without the reward function. In contrast to behavior cloning, IRL can be regarded as a type of indirect imitation learning method, and it aims to recover the reward function on which the expert decisions are based [24]. IRL does not fit single-time-step decisions, so the problem encountered in behavior cloning can be avoided [25].
GAIL extracts a policy from the demonstration data directly and does not need to recover the reward function. Combining imitation learning with generative adversarial networks (GANs), GAIL also trains a generator and a discriminator [25]. The generator is used to produce trajectories whose distribution is close to the distribution of the demonstration data, while the discriminator is used to distinguish them. GAIL has been shown to outperform most existing methods [25]. Based on the above analysis, GAIL was considered in this paper.

Signal Models of FA Radar and Jammer
Pulse-level FA radar has the capability of changing carrier frequency randomly from pulse to pulse, which imparts the radar with a good ECCM capability [27]. However, if the jammer can react to the current intercepted radar pulse, then the ECCM performance of the pulse-level FA radar will degrade [16]. To improve the ECCM performance against the jammer mentioned above, a subpulse-level frequency-agile waveform [9] was adopted in this paper. For a subpulse-level frequency-agile waveform, one pulse consists of several subpulses, and the radar can change the carrier frequency of each subpulse randomly. It was assumed that a deception subpulse can be chosen for transmission in each pulse. Compared with regular subpulses, less transmitted power can be allocated to the deception subpulse in order to mislead the jammer and protect the regular subpulses from being jammed.
The expression of the subpulse-level frequency-agile waveform in a single pulse at time instant k is provided in (3).
where a(k) is the complex envelope, M denotes the number of subpulses, T c denotes the duration of each subpulse, u m can have values between 0 and 1, representing how much transmitted power is distributed to this subpulse, and f m denotes the subcarrier of the m th subpulse. f m can be expressed as f 0 + d m ∆ f , where ∆ f denotes the step size between two subcarriers, f 0 represents the initial carrier frequency, and d m denotes an integer varying from 0 to N − 1, with N denoting the number of frequencies available for the radar. The 0 th subpulse of s TX (k) is the deception subpulse. Here, rect(k) represents the rectangle function: The received signal s RX (k) that includes the target return, the noise signal, and the main lobe suppression jamming signal at time instant k can be expressed as follows: where µ(m) is the complex amplitude with respect to subcarrier f m , T d is the time delay of the target, f m d is the Doppler frequency with respect to subcarrier f m , n(k) is the noise signal, and J(k) is the main lobe suppression jamming signal. Here, n(k) is white Gaussian noise, whose mean is zero and variance is σ 2 n . The suppression jamming signal can be regarded as having the same statistical characteristics as the noise signal and can also be modeled as a complex Gaussian distribution [4].
In this paper, it was assumed that the jammer works in a transmit/receive time-sharing mode, which means that the jammer cannot transmit jamming signals and intercept radar signals simultaneously. To jam the radar efficiently, the jammer cannot transmit jamming signals continuously because of the agility of the carrier frequency of the FA radar, and it will interrupt jamming to allow time for the jammer to catch up with the current radar parameters, which is referred to as "look-through" [28].
The jammer was assumed to adopt spot jamming and barrage jamming, which are two typical active suppression jamming types [4]. It should be emphasized that spot jamming is a narrowband signal and barrage jamming is a wideband signal. Although the bandwidth of barrage jamming is wide enough to cover all carrier frequencies of the radar, its power density is much lower than that of spot jamming given the same jammer transmitter power, which greatly weakens the jamming performance. Therefore, it was assumed that the jammer prefers spot jamming to barrage jamming and only adopts the latter under certain conditions due to its limited transmitter power.
For one radar pulse at time step t (the time step is equivalent to the pulse index in the radar scenario), we considered three possible jammer choices, which are stated as follows and depicted in Figure 1: • Choice 1: The jammer performs the look-through operation throughout the whole pulse, which means that the jammer does not transmit a jamming signal and just intercepts the radar waveform; • Choice 2: The jammer performs the look-through operation for a short period, and then, the jammer transmits a spot jamming signal with a central carrier frequency of f j t or a barrage jamming signal; • Choice 3: The jammer does not perform the look-through operation and just transmits a spot jamming signal with a central carrier frequency of f j t or a barrage jamming signal.

RL Formulation of the Anti-Jamming Strategy Problem
As in [10,11], the MDP was used to describe the interaction between the FA radar and the jammer, which were regarded as the agent and the environment, respectively. Here, M is used to denote this MDP.
At time step t, the FA radar is in state s t and then takes action a t . The jammer performs look-through and/or transmits jamming signals according to predefined rules, and as a result, the state transitions to s t+1 and the radar receives a scalar reward r t+1 . The basic elements, including actions, states, and rewards, in our RL problem were previously defined in [16], and we apply these definitions herein. These definitions are briefly reviewed below.
Actions: There are M subpulses in one pulse, including the deception subpulse and regular subpulses. For each regular subpulse, the radar can select one frequency from N available frequencies. For the deception subpulse, the radar can not only decide whether it is transmitted or not, but also determine its subcarrier if it is transmitted.
Here, the radar action at time step t is encoded into a vector a t with size 1 × M. All elements except for the first one in a t are within 0 and N − 1, which corresponds to the subcarriers of regular subpulses varying from f 0 to f 0 + (N − 1)∆ f . For the deception subpulse, the first element in a t is within 0 and N, in which N means that the deception subpulse is not transmitted. Taking M = 3 as an example, a t = [3, 1, 2] means that the radar does not transmit the deception subpulse, and the subcarriers of the other two subpulses The action of the jammer can also be encoded into a vector a j t with the size 1 × 3, which is described as follows. If Choice 1 is selected, then a j t can be expressed as a States: The k th-order history [21] is used to approximate the state to alleviate the problem of partial observability, and state s t can be expressed as follows: where o t = a j t is the observation of the radar and is actually the action of the jammer at time step t.

Reward:
The proposed method still applies the probability of detection p d as the reward signal [11,16], and the goal of the FA radar is to find an optimal strategy to maximize p d in one coherent processing interval (CPI). If the frequency step between two frequencies is greater than ∆F = c 2l , with c denoting the speed of light and l denoting the target along the radar boresight, then their corresponding target returns will be decorrelated [29]. If the frequency step is less than ∆F, then their corresponding target returns are partially correlated.
To simplify the analysis, we assumed that the frequency step was large enough to decorrelate the target returns. In one CPI, the target returns with the same subcarriers can be first integrated coherently, and then, all coherent integration results of all subcarriers can be processed by the SNR weighting-based detection (SWD) algorithm [30]. This procedure is illustrated in Figure 2, and the detailed calculation procedure of p d is given in Appendix A.
In practice, the radar will make use of all pulses in one CPI to detect the target, meaning that it only receives the reward p d at the end of one CPI. This will result in a sparse reward problem, which hinders the learning of the radar. To address this problem, an additional negative reward v, which is proportional to the signal-to-interference-plusnoise ratio (SINR) of that pulse, is given. The overall reward signal can be expressed as follows.

Robust Formulation
Given a predefined jamming strategy, an optimal antijamming strategy can be obtained by using a large number of RL algorithms in the training environment based on the perfect sensing and interception assumption. As mentioned previously, if this antijamming strategy is used in a test environment in which uncertainties exist in the radar and the jammer, then antijamming performance may degrade because of the mismatch between the training and test environments.
From an RL perspective, uncertainties in the radar and the jammer will result in a discrepancy in the transition probability between the training and test environments. A detailed explanation is given as follows. In Figure 3, the left and right images illustrate the transition probability in the training and test environments, respectively. When the radar is in the training environment, the current state of the radar is s t , and it will choose an action a t according to the policy π, as shown in the left image in Figure 3. Based on the perfect sensing and interception assumption, the observation of the radar is o t+1 , and the next state will transition to s t+1 ∼ P(s t+1 |s t , a t ). When the radar is in the test environment, it is assumed that it is also in s t and chooses an action a t , which is the same as the radar in the training environment. The difference is that the observation of the radar may not be the same as o t+1 due to errors caused by the jammer. In the right-hand image in Figure 3, we use a circle filled with dotted lines to distinguish it from the observation in the training environment. As a result, the next state will not be s t+1 . Given the same current state s t and action a t , the resultant next state is different. Therefore, there exists a discrepancy between the transition probability in the training and test environments. Thus, we considered a robust antijamming strategy design problem, where the transition probability of the test environment deviates from that of the training environment.

Training environment
Test environment | | Figure 3. Illustration of the transition probability in the training and test environments.
Based on the above analysis, WR 2 L [20] is used to solve the radar robust antijamming strategy design problem. As reported in [20], the transition probability of the environment is determined by dynamic parameters. Taking the CartPole [31] task as an example, the length of the pole is the dynamic parameter, and the transition probability varies with the length of the pole. Given the reference dynamic parameters φ 0 of a task, WR 2 L perturbs the dynamic parameters φ to determine the worst-case scenarios within an -Wasserstein ball and identifies a policy with parameters θ to maximize the worst-case performance. With the help of zeroth-order optimization [32], WR 2 L is able to handle high-dimensional tasks.
The objective function of WR 2 L can be expressed as follows: where W 2 2 P φ (·|s, a), P φ 0 (·|s, a) is the Wasserstein distance of order 2 [20] between P φ (·|s, a) and P φ 0 (·|s, a), ≥ 0 is the radius of the -Wasserstein ball, π u (a|s) is a policy with a uniform distribution over actions a given the state s, and ρ φ 0 π u (s) follows a uniform distribution over states s. For notational convenience, the term (s, a) ∼ π u (·)ρ φ 0 π u (·) in (8) is ignored in the rest of the paper.
As discussed previously, the uncertainties in the radar and those in the jammer have the same effect on the change in the transition probability. As a result, only the uncertainties in the jammer are considered in this paper. In the radar and the jammer scenario, the reference dynamic parameters φ 0 can be regarded as the jamming strategy in the training environment with the perfect interception assumption. The dynamic parameters φ can be regarded as jamming strategies with the existence of uncertainties. However, WR 2 L cannot be applied directly. The reasons and their corresponding solutions are given below: (1) The dynamic parameters remain unknown for a given jamming strategy, and we can only describe it using predefined rules. For example, a jamming strategy can be expressed by the following rule: the jammer transmits a spot jamming signal whose central frequency is based on the last intercepted radar pulse. Therefore, we proposed a method of imitation learning-based jamming strategy parameterization, as presented in Section 4.2, which aims to express the jamming strategy mathematically; (2) After jamming strategy parameterization, the jamming strategy can be expressed in a neural network consisting of a series of parameters. As is shown later, the number of parameters of this neural network is large, which will lead to a heavy computational burden. Thus, a jamming parameter perturbation method is provided in Section 4.3 to alleviate this problem.
The final robust radar antijamming strategy design method is described in Section 4 and incorporates jamming strategy parameterization and jamming parameter perturbation into WR 2 L.

Jamming Strategy Parameterization
Dynamic parameters can be easily acquired from a gym environment and perturbed to determine the worst-case scenarios to design a robust RL strategy [20]. Intuitively, the jamming strategy was perturbed in this way in [20]. However, how to characterize or describe the jamming strategy remains unsolved. To this end, a method of imitation learning-based jamming strategy parameterization was proposed to express jamming strategies in a neural network that consists of a series of parameters.
To realize the target mentioned above, a basic assumption about the jammer was made and can be stated as follows.
Assumption: During the interaction between the radar and the jammer, the jammer is also an agent described by MDP M ≡ S , A , P , R , γ with an optimal policy π j , meaning that its action at every time step is optimal and maximizes its long-term expected reward.
It should be emphasized that M is different from the M mentioned in Section 3, and a superscript is used to distinguish between them. Note that M may not exist in practice; however, this assumption is indeed reasonable because there is always an internal motivation for the jammer's decisions, which it views as optimal. As a consequence, the jammer can be regarded as an expert whose actions are optimal, and we can learn its implicit policy π j from the expert trajectories using a series of parameters, which is referred to as jamming strategy parameterization in this paper.
Jamming strategy parameterization can be segmented into two phases: gathering expert trajectories and deriving a policy from these expert data [33]. The first phase is easy to implement. Given a predefined jamming strategy, the trajectories d E = {d 1 , d 2 , ..., d N E } can be collected through the interaction between the radar and the jammer, as shown in Figure 4, with N E denoting the number of trajectories to be collected. Note that this predefined jamming strategy cannot be expressed mathematically and thus can be regarded as a given rule that instructs the jammer how to choose actions to jam the radar. The trajectory d i can be expressed as d i = {s 0 , a 0 , s 1 , a 1 , ..., s T−1 , a T−1 }, where s t ∈ S and a t ∈ A are the states and actions of the jammer, respectively.
As shown in Figure 4, gathering the expert trajectories can be achieved based on M of the radar antijamming strategy design. At time step t, the jammer action a t is actually the observation o t in M, and the state s t can be obtained from f (s t ), with the input being the state in M. Here, f (·) is a function that maps from s t to s t and is designed to extract useful features for the jammer. Once the trajectories are obtained, imitation learning methods can be used to derive its policy.  The derivation of the policy π j can be regarded as an IRL problem based on the aforementioned assumption, which can be described as a unified objective function: where R (s , a ) is the implicit reward function of M , ψ(·) is a convex cost function regularizer to avoid overfitting, Π is the set of all stationary stochastic policies, and H(π ) is the entropy of the policy. Note that (9) looks slightly different from the function given in [25]; the difference is that, here, the agent is assumed to maximize the long-term reward rather than minimize it.
As shown in (9), IRL aims to find a reward function that assigns high reward to the expert policy π j and low reward to other policies. If IRL methods, such as [24,34], are used to derive the jamming strategy, two steps are needed: recovering the reward function and finding the optimal policy under that obtained reward function. Let the recovered implicit reward function beR and its corresponding optimal policy be π R , which can be considered the derived jamming strategy.
In this paper, we applied an alternative method called GAIL, which stems from the basic concept of IRL in (9), but the step of recovering the reward function is not necessary. More specifically, it can be proven that the final policy π R mentioned above can be obtained directly by solving π in (10) [25].
In (10), D(s , a ) is a discriminative classifier that maps the input {s , a } to a real number ranging from 0 to 1, and ρ ≥ 0 is a real number controlling the entropy regularizer.
The policy π and the classifier D(s, a) can be parameterized with ϕ and ω, where the solution to (10) π ϕ is the derived jamming strategy, and ϕ is its corresponding jamming parameters. In fact, (10) satisfies the definition of the GAN [35], where the policy π can be regarded as the generator and the classifier D is the discriminator.
Algorithm 1 is the overall algorithm of jamming strategy parameterization. Before the algorithm starts, a predefined jamming strategy and a mapping function f (·) are needed. Note that f (·) was specifically designed for this given jamming strategy. The predefined radar policy π pre is used in the expert trajectory collection phase, which can be a random policy. The first phase is gathering the expert trajectories. In the second phase, a fixed number of trajectories of π ϕ i are first collected, and then, the gradient of the discriminator can be estimated based on Monte Carlo estimation, which is given below.
The update procedure of the generator can be achieved through any RL algorithms with a reward function log (D ω (s , a )). Here, TRPO [36] was adopted. The termination condition can be the convergence of the cumulative reward of the generator. Once the termination condition is satisfied, the jamming strategy parameterization is complete.

Jamming Parameter Perturbation
Through jamming strategy parameterization, the reference jamming parameters φ 0 and the jamming parameters φ that need to be optimized and perturbed can both be expressed by neural networks. Let W h φ ∈ R p h ×q h and W h φ 0 ∈ R p h ×q h be the weights of the h th layer of φ and φ 0 , respectively, with p h and q h denoting the input and output sizes of this layer. If there are H layers in total, they can be denoted as , respectively (here, we ignore the biases in φ and φ 0 ). As shown later, a frequent matrix inversion operation is needed during the procedure of robust antijamming strategies, and the dimension of that matrix is related to the dimension of φ. The dimension of φ may be high, and an example is described in the following. For a three-layer neural network with p h , q h = 20 and h = 1, 2, 3, the number of parameters is 1200. For certain complicated jamming strategies, networks with more parameters are always needed. As a result, there will be a heavy computational burden if the minimization problem in (8) is solved directly.

Algorithm 1: Jamming strategy parameterization.
Input: Predefined jamming strategy, mapping function f (·), the number of pulses in one CPI T, the number of trajectories to be collected N E , the initial parameters of π ϕ and D ω as ϕ 0 , ω 0 , predefined radar policy π pre , an empty list d E Output: The parameters of π when GAIL is convergent /* Gathering the expert trajectories */ 1 for n = 1,2,...,N E do 2 Sample s 0 according to the given distribution p 0 (s 0 ) 3 for t = 0, 1, ..., T − 1 do 4 Obtain s t based on f (·) 5 Radar takes actions a t according to π pre (a t |s t ) 6 Jammer takes action a t according to the predefined jamming strategy Update the discriminator parameters from ω i to ω i+1 with the gradient in (11) 13 Update the generator parameters from ϕ i to ϕ i+1 using the RL algorithm TRPO [36] with reward function log(D ω i+1 (s , a )) To alleviate the problem mentioned above, we propose an alternating procedure, inspired by NoisyNet, to perturb the jamming parameters [37]. More specifically, φ can be expressed by the combination of two terms, the reference jamming parameters φ 0 and an extra term ∆φ that can be expressed as [W 1 , W 2 , ..., W h , ..., W H ], W h ∈ R p h ×q h . To reduce the number of parameters that need to be perturbed, the elements in each column of W h were set to be the same, which means that only q h variables need to be perturbed. The relationship between the hth element in φ, the hth element in φ 0 and the hth element in ∆φ is displayed in Figure 5, and the mathematical expression is as follows: If the proposed perturbation method is adopted, only ∆φ needs to be perturbed and optimized in (8), and a new objective function can be obtained, as shown in (13).
where φ = φ 0 + ∆φ. Clearly, the computational burden greatly decreases. With respect to the example mentioned above, there are only 60 parameters in total.    Figure 5. Jamming parameter perturbation. In W h , the elements in each column are the same, and their background color is blue.

WR 2 L-Based Robust Anti-Jamming Strategy Design
In the above subsections, the imitation learning-based jamming strategy parameterization is first proposed, and its corresponding perturbation method is then presented. Incorporating them into WR 2 L, the robust radar antijamming strategy design is presented in this subsection.
As described above in (13), a "maxmin" objective function with respect to θ and ∆φ needs to be solved to design a robust strategy, which is slightly different from the original objective function of WR 2 L. However, we can still use the method proposed in [20] to solve it. More specifically, that problem can be solved through an alternating procedure that interchangeably updates one variable while the other remains fixed. This procedure is described briefly in the following.
Let the jamming parameters be φ [j] = φ 0 + ∆φ [j] at the j th iteration. The policy parameter θ is updated to find the optimal policy as follows.
In fact, this is just an RL problem and can be solved by any type of RL algorithm. In this paper, TRPO [36] was used to obtain the current policy, which can be denoted as θ [j+1] . After that, the jamming parameter φ is updated to determine the worst case with respect to θ [j+1] , which is expressed as follows: where φ = φ 0 + ∆φ.
To solve this minimization problem with a constraint, first-order and second-order Taylor expansions of the objective function and the constraint, respectively, are performed to simplify the analysis. Consider a pair of parameters θ [j+1] and φ 0 . The result is given below, and the detailed derivation can be found in [20].
The closed-form solution to (16) given below can be easily obtained through the Lagrange multiplier method [20]: where g [j+1] is the gradient of the expected cumulative reward with respect to φ at φ 0 , i.e., . Thus, the jamming parameters φ [j+1] can be expressed as The estimation of g [j+1] and H 0 can be achieved via a zero-order optimization method [32,38]. According to the two propositions in [20], g [j+1] and H 0 can be expressed as follows. , a), P φ 0 +ξ (·|s, a) I (20) As shown in (19) and (20), a random variable ξ with the same size as φ 0 is sampled from the given Gaussian distribution N (0, σ 2 I) to perturb φ 0 .
The procedure mentioned above can be repeated until the maximum number of iterations is reached.

Simulation Results
In this section, the performance of jamming strategy parameterization and the robust antijamming strategy design is verified. The basic parameters of the FA radar and the main lobe jammer are given in Table 1.

Parameter
Value radar transmitter power P T 30 kW radar transmit antenna gain G T 30 dB radar initial frequency f 0 3 GHz bandwidth of each subpulse B 2 MHz the number of subpulses in a single pulse 3 the number of frequencies available for the radar 3 the number of pulses in one CPI 32 distance between the radar and the jammer R d 100 km false alarm rate p f 1 × 10 −4 the length of the target along the radar boresight l 10 m jammer transmitter power P J 1 W jammer transmit antenna gain G j 0 dB Given the length of the target along the radar boresight l, the frequency step ∆F required to decorrelate the target can be calculated by ∆F = c 2l = 3×10 8 2×10 = 15 MHz. Therefore, the frequency step size ∆ f needs to be larger than ∆F. In addition, if ∆ f is just comparable to ∆F, then the power density of barrage jamming may still be high because its power is distributed over a narrow bandwidth. Thus, the frequency step size ∆ f was set to 100 MHz, which was large enough to decorrelate the target and reduce the power density of barrage jamming. It was assumed that the radar cross-section (RCS) of the target does not fluctuate at the same frequency, but the RCS may differ among different frequencies. Without loss of generality, the RCS with respect to these three frequencies was set to σ RCS = [3 m 2 , 3 m 2 , 3 m 2 ].
If the jammer adopts spot jamming, it was assumed that its jamming power would be distributed over a frequency band whose bandwidth is B spot = 2B, which is wider than B. If the jammer adopts barrage jamming, its jamming power was assumed to be distributed over a frequency band whose bandwidth is B bar = 500 MHz to cover all possible frequencies of the FA radar. The last k = 3 observations and actions were used to approximate the history H t . In addition, u 0 , u 1 , and u 2 in (3) were predefined and set to [0, 1,1] or [0.2, 0.9, 0.9], depending on whether the deception subpulse was transmitted.
According to the radar equation [29], the SINR used to calculate the probability of detection in (A2) and (A3) can be easily calculated based on these basic simulation parameters. More specifically, the received power P r scattered by the target with respect to different RCSs, the received jamming power P j r , and the noise power can be calculated as follows: where λ = c f is the wavelength. The power of the thermal noise in the radar receiver can be calculated by P N = kT s B n , where k = 1.38 × 10 −23 J/K is Boltzmann's constant, T s = 290 K is the system noise temperature, and B n ≈ B is the noise bandwidth. With all parameters of the radar and the jammer given, the received power P r , the noise power P N , and the received jamming power P j r can be obtained; therefore, the SNR (when the radar is not jammed) and the SINRs (when the radar is jammed) can be easily calculated. Figure 6 shows three different jamming strategies that were used to verify the effectiveness of the proposed method. Jamming Strategy 1 selects Choice 2, while Jamming Strategies 2 and 3 select Choices 1 and 3 simultaneously. To simplify the analysis, it was assumed that the duration of look-through and jamming signal transmission was an integer multiple of the duration of each subpulse, as shown in Figure 6. These three different jamming strategies are described as follows.
Jamming Strategy 1: The duration of look-through for Jamming Strategy 1 is short, and the jammer transmits spot jamming once the radar signal is intercepted. The central frequency of spot jamming is the same as the subcarrier of the intercepted radar signal, meaning that the jammer will be misled if the deception subpulse is transmitted. Jamming Strategy 2: For Jamming Strategy 2, the jammer performs the look-through operation for the first pulse to intercept the whole pulse. For the next pulse, the jammer only transmits the jamming signal. The jammer will ignore the deception subpulse and jam the regular subpulses. If there are two different subcarriers in this intercepted pulse, the jammer will adopt barrage jamming. If not, the jammer will adopt spot jamming, whose central frequency is the same as that of the intercepted subpulse.
Jamming Strategy 3: Jamming Strategy 3 is similar to Jamming Strategy 2. The only difference is that the jammer will jam the next two pulses based on the last intercepted pulse.

Performance of Jamming Strategy Parameterization
In this subsection, the performance of jamming strategy parameterization is tested. Some details are first provided.
As mentioned above, a mapping function f (·) is needed when the expert trajectories are collected. For the three different jamming strategies, different mapping functions were designed to enhance learning performance.
With respect to Jamming Strategies 1 and 2, f (·) can be expressed as follows.
The state s t of Jamming Strategies 1 and 2 at time step t only extracts the most recent action of the radar since this information is sufficient for GAIL to derive the strategy of the jammer. With respect to Jamming Strategy 3, f (·) can be expressed as in (23): where mod is the operation for calculating the remainder and 1 f 1 =f 2 is an indicator function that equals one if the subcarriers f 1 and f 2 in a t−1 are the same. The state s t of Jamming Strategy 3 not only contains the most recent action of the radar, but also includes the time and frequency information about the radar. Fully connected neural networks with four layers and thirty-two hidden units in each layer were used to parameterize the generator and the discriminator in GAIL. N E = 100 expert trajectories were generated to train GAIL to parameterize the jamming strategy. The parameterization performance with respect to three jamming strategies is given in Figure 7. The Wasserstein distance was used to evaluate how close the distance was between the derived jamming strategies π φ 0 and the predefined jamming strategies, and the y-axis in Figure 7 denotes their Wasserstein distance after each training epoch. As shown in Figure 7, their Wasserstein distance converged to zero. This means that the predefined jamming strategy can be expressed by the derived jamming strategy, which consists of a series of parameters φ 0 .
For a better understanding, Figure 8 presents the learning results of the derived jamming strategy for Jamming Strategy 1 in multiple phases. Here, the radar adopted a random strategy to select subcarriers. The derived jamming strategies were used to jam the random radar when their Wasserstein distance was 0.17, 0.025, and 0. The actions of the jammer induced by the derived jamming strategies and predefined Jamming Strategy 1 are plotted in Figure 8, which are denoted as "parameterization" and "ground truth", respectively. It can be seen that the difference between the actions induced by the derived jamming strategies and predefined Jamming Strategy 1 became smaller as the Wasserstein distance decreased.
For jamming strategy parameterization, the number of expert trajectories is of critical importance. For an imitation learning task, more expert trajectories mean that the agent can collect more information about the expert, which will result in better performance. For the problem considered here, as N E increased, the performance of jamming strategy parameterization improved.
In Figure 9, Jamming Strategy 1 is used as an example to show the influence of N E on the performance of jamming strategy parameterization. Three different cases were considered, where N E was set to 10, 100, and 200, respectively. As shown in Figure 9, when N E = 10, the performance of jamming strategy parameterization was the worst. It can be seen in Figure 9 that the performance of jamming strategy parameterization was similar when N E = 100 and N E = 200. Therefore, N E was set to 100 in this paper.

Performance of Robust Antijamming Strategy Design
Before presenting the results of the robust antijamming strategy design, we first present the training performance against three different jamming strategies under the perfect interception assumption. As shown in Figure 10, the performance obtained through the RL algorithm (TRPO was used here) was compared with the performance of a random strategy to show its effectiveness (random strategy means that the radar chooses actions randomly at each pulse).  As shown in Figure 8, the magnitude of the uncertainties in the jammer was equivalent to the magnitude of the Wasserstein distance between a jamming strategy and its corresponding reference jamming strategy. As a consequence, numerous random jamming strategies were generated to test the robustness of the obtained antijamming strategies. Note that the Wasserstein distance between these random jamming strategies and their corresponding reference jamming strategies varied in a given range to model the magnitude variation in uncertainties in the jammer. The three different jamming strategies described previously were regarded as the reference jamming strategies, and their parameters, which can be parameterized by the proposed method, were the reference dynamic parameters. Taking Jamming Strategy 1 as an example, the following describes how to generate test samples.
Let the parameters of Jamming Strategy 1 be φ 0 . The proposed jamming parameter perturbation method was used to perturb φ 0 to generate test samples. More specifically, we sampled a large number of ∆φ independently, which followed a Gaussian distribution with meanm and variancev. According to jamming parameter perturbation, ∆φ was added to φ 0 to generate the parameters of random jamming strategies. Then, the Wasserstein distance between these random strategies and Jamming Strategy 1 was calculated, and each random jamming strategy was labeled with its Wasserstein distance. We collected the random jamming strategies whose Wasserstein distance varied from 0 to 0.2 and divide them uniformly into ten groups. The Wasserstein distance between the random strategies in the ith group and Jamming Strategy 1 was within [(i − 1) * 0.02, i * 0.02]. In this analysis, there were 100 random jamming strategies in each group.
The performance of the robust antijamming strategies against three different jamming strategies is given in Figure 11. With respect to each jamming strategy, the antijamming strategy with = 0, which was actually a nonrobust design, was compared with two other robust antijamming strategies. It should be emphasized that does not determine the exact radius of the -Wasserstein ball because there are some approximations of the objective function and the constraint of WR 2 L.
For all three jamming strategies, the performance of nonrobust and robust antijamming strategies decreased as the uncertainty increased, which was caused by the mismatch between the test and training jamming strategies. However, it can be seen in Figure 11 that the robust antijamming strategies outperformed the nonrobust antijamming strategies if the uncertainty reached a certain level. Here, "nonrobust" indicates that the antijamming strategies were directly designed by TRPO [36] with the perfect interception assumption. The training performance of nonrobust antijamming strategies is given in Figure 10.
Taking the performance of the robust antijamming strategies against Jamming Strategy 1 as an example, a detailed explanation of the simulation results in Figure 11a is provided (the simulation results in Figure 11 are similar, so only one result is explained). To describe the simulation results more clearly, the x-axis, which ranges from 0 to 0.2, was divided into four stages, as shown in Figure 12, and each stage was analyzed.
In Stage 1, the performance of the nonrobust antijamming strategy was the best, and the performance of the robust antijamming strategy with = 0.3 was the worst. In this stage, the mismatch between the training and test environments was so small that it could be ignored. Therefore, the performance of the nonrobust antijamming strategy was the best.
In Stage 2, the performance of the robust antijamming strategy with = 0.1 was the best, and the performance of the robust antijamming strategy with = 0.3 was the worst. In this stage, although the mismatch could not be ignored, the nonrobust antijamming strategy against Jamming Strategy 1 could still outperform the robust antijamming strategy with = 0.3.
In Stage 3, the performance of the robust antijamming strategy with = 0.1 was still the best, and the performance of the nonrobust antijamming strategy was the worst. The mismatch in this stage was so large that the nonrobust antijamming strategy achieved the worst performance.
In Stage 4, the performance of the robust antijamming strategy with = 0.3 was the best, and the performance of the nonrobust antijamming strategy was the worst. Not surprisingly, the performance of the nonrobust antijamming strategy was still the worst. The mismatch in this stage was large enough that it could not be covered by the -Wasserstein ball with = 0.1, so the performance of the robust antijamming strategy with = 0.1 was no longer the best.
In theory, the performance of nonrobust antijamming strategies was the best if the Wasserstein distance between the test jamming strategies and the reference jamming strate-gies was zero. It should be emphasized that the tick label of the x-axis in Figure 11 actually indicates that the Wasserstein distance varied in a given range. Therefore, it is possible that the performance of nonrobust antijamming strategies was worse than that of robust antijamming strategies when the tick label of the x-axis was zero, as shown in Figure 11c. (c) Figure 11. The probability of detection for three different jamming strategies with respect to different magnitudes of the Wasserstein distance. The x-axis is the Wasserstein distance between the test jamming strategies and the reference jamming strategies, and the y-axis is the final probability of detection. (a-c) presents the detection probability for Jamming Strategy 1, 2 and 3 respectively with respect to different magnitudes of the Wasserstein distance.
To test the robustness of the radar under adversarial circumstances, we assumed that the jammer was capable of learning to design an adversarial jamming strategy to combat the nonrobust antijamming strategy.
For a jammer with a predefined jamming strategy, a nonrobust antijamming strategy could be obtained through RL algorithms such as TRPO. Therefore, three nonrobust antijamming strategies with parameters θ 1 nl , θ 2 nl and θ 3 nl against Jamming Strategies 1, 2, and 3 could be obtained, as shown in Figure 13. Given different radii of the -Wasserstein ball, robust antijamming strategies against Jamming Strategies 1, 2 and 3 could also be obtained through the proposed robust antijamming strategy design method in this paper.
As shown in Figure 13, the adversarial jamming strategy against each nonrobust antijamming strategy could be obtained by solving (15), and the parameters of the resultant adversarial jamming strategies are denoted by φ Different values in (15), which is referred to as the adversarial strategy radius, were chosen to design adversarial jamming strategies. As shown in Figure 13, we let the nonrobust and robust antijamming strategies combat their corresponding adversarial jamming strategies with different adversarial strategy radii, and the results are given in Figure 14. Figure 14a-c shows the results related to Jamming Strategies 1, 2 and 3, respectively.
As shown in Figure 14, a larger adversarial strategy radius would usually lead to worse performance, and robust antijamming strategies outperformed nonrobust antijamming strategies in most instances. This can be easily explained by the fact that jamming is more efficient in a larger -Wasserstein ball.  (c) Figure 14. The probability of detection for three different jamming strategies with respect to different adversarial strategy radii. The x-axis is the radius of the -Wasserstein ball when the adversarial strategies were generated. (a-c) presents the detection probability for Jamming Strategy 1, 2 and 3 respectively with respect to different adversarial strategy radii.

Conclusions
In this paper, we proposed a robust antijamming strategy design method that was designed to combat main lobe jamming for FA radar when uncertainties exist in the environment. The proposed method incorporated jamming strategy parameterization and jamming parameter perturbation into WR 2 L. We showed that, by regarding the jammer as an expert and applying imitation learning, a given jamming strategy can be represented by a series of parameters. Simulation results showed that the obtained jamming parameters can replace the given jamming strategy with minor errors. It can be seen that jamming parameter perturbation is capable of reducing the dimensions of the parameters and generating random jamming strategies to test the proposed method. Most importantly, the results showed that the robust antijamming strategies outperformed the nonrobust antijamming strategies when the uncertainties in jamming strategies reached a certain level. In addition, it should be pointed out that the proposed method can also be used in the antijamming strategy design in the domain of communication [15,39].
It should be emphasized that the proposed method only addresses how to design robust antijamming strategies against known jamming strategies. If the jamming strategy is unknown, then the radar needs to perform jamming strategy parameterization in an online fashion, and the collected expert trajectories are not accurate since uncertainties exist in the jammer. As a result, the performance of the proposed method will worsen. This problem will be investigated in the future.  doms v = [2, 2, ..., 2] 1×N , which is denoted as Θ v p 1 [30]. Similarly, the probability of detection can be expressed as follows: Given p f , the decision threshold T thres can be obtained through (A2), and then, p d can be obtained through (A3). The CDF of the weighted Chi-squared distribution in (A2) and (A3) can be calculated through the method in [40].