Reinforcement Learning for Compressed-Sensing Based Frequency Agile Radar in the Presence of Active Interference

: Compressed sensing (CS)-based frequency agile radar (FAR) is attractive due to its superior data rate and target measurement performance. However, traditional frequency strategies for CS-based FAR are not cognitive enough to adapt well to the increasingly severe active interference environment. In this paper, we propose a cognitive frequency design method for CS-based FAR using reinforcement learning (RL). Speciﬁcally, we formulate the frequency design of CS-based FAR as a model-free partially observable Markov decision process (POMDP) to cope with the non-cooperation of the active interference environment. Then, a recognizer-based belief state computing method is proposed to relieve the storage and computation burdens in solving the model-free POMDP. This method is independent of the environmental knowledge and robust to the sensing scenario. Finally, the double deep Q network-based method using the exploration strategy integrating the CS-based recovery metric into the (cid:101) -greedy strategy (DDQN-CSR- (cid:101) -greedy) is proposed to solve the model-free POMDP. This can achieve better target measurement performance while avoiding active interference compared to the existing techniques. A number of examples are presented to demonstrate the effectiveness and advantage of the proposed design.


Introduction
In electronic warfare scenarios, hostile jammers emit active interference by intercepting and imitating radar signals [1,2], having a significant negative effect on radar functioning. Hence, it is necessary to equip radar systems with anti-jamming techniques. In addition, since an ever-growing number of electromagnetic systems require access to the limited frequency resource, especially after the wide deployment of the fifth generation (5G), minimizing the active co-frequency interference between different radiators becomes an attractive consideration. Frequency agile radar (FAR), which transmits pulses with different carrier frequencies in a coherent processing interval (CPI), possesses anti-jamming capabilities and has the potential to realize spectrum compatibility [3,4]. Random frequency is a common strategy for FAR with the thumbtack-type ambiguity function, but it cannot avoid active interference flexibly due to the lack of utilization of environmental information. Sense-and-avoid (SAA) techniques can be employed to select unoccupied frequency bands automatically [5,6], but the selection is based on the active interference knowledge sensed in the previous time. This cannot handle the anti-interference design in a dynamically changing environment. Therefore, it is of great significance to learn the interference dynamics and design a more cognitive frequency strategy for FAR.
Reinforcement learning (RL) is a branch of machine learning that aims at making the agent learn a control strategy through interaction with the environment [7][8][9]. It has been widely studied in the cognitive communication field to learn spectrum sense and access strategies [10,11]. Inspired by these investigations, some researchers have attempted to implement the proposed recognizer-based belief state computing method. This avoids dependence on environmental knowledge. Moreover, the proposed recognizer-based method has superior robustness to the sensing scenario. This is verified by the numerical results.
(4) We propose the DDQN-CSR--greedy method to solve the model-free POMDP. This is able to achieve better target measurement performance in active interference than the state-of-art methods. Concretely, the DDQN-CSR--greedy method takes actions based on the agent state and output posterior probability, which is independent of the environmental model. In addition, this method uses the CSR metric to guide both anti-interference action exploration and exploitation phases. Consequently, the target measurement performance can be optimized while avoiding active interference.
The rest of the paper is organized as follows. Section 2 presents the signal model of CS-based FAR in active interference. Section 3 formulates and solves the problem of transmit frequency design for CS-based FAR in active interference. Section 4 presents the results and corresponding analyses. Section 5 provides the conclusion.

Signal Model
In this section, we introduce the signal model of CS-based FAR in active interference. Figure 1 presents a simplified working scenario where clutter is negligible and radar returns are not subject to multipath. The scenario contains a hostile jammer and a communication system that shares frequency channels with the CS-based FAR. The hostile jammer transmits intentional active interference by imitating intercepted radar signals. The communication equipment generates unintentional electromagnetic interference to the radar system. Consider that the targets of interest are measured by the CS-based FAR within the CPI consisting of N pulses. The pulse width and the pulse repetition interval are T p and T r , respectively. As shown in Figure 2, the N pulses are transmitted with the agile frequency f n = f c + ∆ f n , n = 1, . . . , N where f c is the lowest carrier frequency, ∆ f n ∈ [0, B] is the frequency hopping interval of the nth pulse, and B is the maximum value of ∆ f n . The nth transmit pulse is defined as u n (t) = rect t − (n − 1)T r T p e j2π f n (t−(n−1)T r ) (2) where j = √ −1, and Considering K targets in a coarse range and sampling each received pulse once, the nth received target echo can be represented by where β k , R k , and v k are the scattering intensity, range, and velocity of the kth target, respectively. As seen in Equation (4), the phase of the received signal is discontinuous due to the agile frequency, which will degrade the performance of the conventional FFT-based MTD method. Since the target distribution is usually sparse within a coarse range, the CSR technique can be employed to realize moving target measurement [19][20][21]. To do so, uniformly divide the coarse range and interesting velocity scope into P and Q grids, respectively. Define the measurement matrix as Φ = φ 11 , . . . , φ pq , . . . , φ PQ ∈ C N×PQ . The elements of Φ are given by where f = [ f 1 , . . . , f N ] T , t = [0, . . . , (N − 1)T r ] T , ( · ) T denotes the transpose, and • denotes the Hadamard product. Then, the target echo can be written as where σ is a K-sparse vector, and the position of the nonzero in σ corresponds to the range-Doppler of the targets. Given the received target echo x , the vector σ can be reconstructed using l 1 minimization CSR algorithms such as orthogonal matching pursuit (OMP) [23], and correspondingly, the target range-Doppler measurement can be finished. In the numerical experiments later, we adopt OMP for sparse recovery.
In practice, the received signal is contaminated by noise and active interference. Uniformly dividing the available frequency band [ f c , f c + B] into M channels Θ = {α 1 , . . . , α M }, the signal received in the mth frequency channel can be classified into the following four cases.
where w mn ∼ CN(0, N 0 ) is an independent and identically distributed Gaussian noise vector, x mn ∈ C is the target echo, J mn ∈ C is the active interference signal, and H 0 and H 1 represent the hypotheses of the absence and presence of active interference, respectively. In Equation (7), the second and fourth cases appear in the frequency channel used for target measurement at the receiver. Obviously, the presence of active interference in the measured channel, i.e., the fourth case in Equation (7), will degrade the target measurement performance significantly. Therefore, the transmit frequency should satisfy f n / ∈ f j n to guarantee target measurement performance in the active interference environment, where f j n ∈ Θ is the index set of the frequency channels occupied by the active interference. Considering K targets in a coarse range and sampling each received pulse once, the nth received target echo can be represented by Considering K targets in a coarse range and sampling each received pulse once, the nth received target echo can be represented by

Problem Formulation and Solution Method
In this section, we present a model-free POMDP model for designing frequency strategies in active interference and provide a recognizer-based belief state computing method to relieve the storage and computation burdens in solving the POMDP. Then, the DDQN-CSR--greedy method is proposed to solve the model-free POMDP to obtain the transmit frequency strategy of the CS-based FAR.

Model-Free Partially Observable Markov Decision Process
As analyzed in Section 2, the transmit frequency should satisfy f n / ∈ f j n to protect the CS-based FAR against active interference. To this end, SAA-based transmit design techniques have been studied. In the SAA framework, radar senses environmental knowledge first. Then, the anti-interference strategy is designed based on the sensed information. We can learn from the working process above that the SAA-based method will perform poorly in the dynamically changing interference scenario where the sensed interference information is inconsistent with the current one. Hence, the frequency design should be carried out based on learning the dynamics of the interference. For the CS-based FAR, the received contaminated signal is the only information that can be used to complete the learning process. POMDP is a well-studied mathematical framework for learning dynamic environments and making decisions by an agent under imperfect observations. Therefore, following the scenario in Figure 1, we formulate the frequency design for anti-interference as a POMDP by regarding the CS-based FAR as an autonomous agent working in the dynamic interference environment consisting of a hostile jammer and a communication system. Furthermore, due to the non-cooperation of the active interference environment, the environmental model is hard to obtain. Hence, we formulate the design as a model-free POMDP specified by the tuple {A, S, R, O, γ}:

1.
A is the action space of the agent. Here, the action a n ∈ A is assigned within [1, . . . , M] to denote the transmit frequency channel selected by the CS-based FAR.

2.
S is the state space. In our case, the state is represented by s n = s a n , s j n (8) where s a n is an M 1 -dimensional vector denoting the agent state and composed of the recent actions of the CS-based FAR, s j n is an M-dimensional binary vector showing the frequency channels occupied by the active interference. As an example, with M 1 = 2 and M= 3, s n = [1, 3, 0, 1, 0] denotes that the last two actions taken by the CS-based FAR are to select the first and third frequency channels for emission and that the second frequency channel is occupied in the nth step. In practice, the value of M 1 is determined by balancing the computational complexity and the ability of s n to represent the agent state. Since the agent state s a n is known, the number of underlying states is 2 M under a given observation.

3.
R(s n , a n , s n+1 ) is the reward obtained by the agent. Our work aims to avoid active interference from other electromagnetic equipment, so the reward function we adopt has the form R(s n , a n , s n+1 ) = 1 s j n+1 (a n ) = 0 −1 s j n+1 (a n ) = 1 (9)

4.
O denotes the observation space where the observation 5. γ ∈ [0, 1] is the discount parameter used to put weights on future rewards. As shown in Figure 3, the CS-based FAR takes action a n at first. Then, the state s n is transformed into the next state s n+1 , and the CS-based FAR obtains the observation o n+1 Remote Sens. 2022, 14, 968 6 of 20 and the reward R(s n , a n , s n+1 ). After that, the new action a n+1 will be taken. The loop above is performed until the target measurement task is finished. [ ] 0,1 γ ∈ is the discount parameter used to put weights on future rewards.
As shown in Figure 3, the CS-based FAR takes action n a at first.

Recognizer-Based Belief State Computing Method
In the POMDP, we do not have direct access to the state, so the next action must be determined according to historical observations and actions [22]. This places heavy burdens on storage and computation. To overcome this, the belief state b is introduced in the POMDP to represent historical information. The belief state is the posterior probabilities of underlying states under a given observation. In light of the Bayesian rule, the updating expression for the belief state is defined as

Recognizer-Based Belief State Computing Method
In the POMDP, we do not have direct access to the state, so the next action must be determined according to historical observations and actions [22]. This places heavy burdens on storage and computation. To overcome this, the belief state b is introduced in the POMDP to represent historical information. The belief state is the posterior probabilities of underlying states under a given observation. In light of the Bayesian rule, the updating expression for the belief state is defined as T(s n+1 |s n ,a n )b n (s n ) (11) where P obs (o n+1 |s n+1 ) is the probability of receiving observation o n+1 under state s n+1 , T(s n+1 |s n , a n ) is the probability of the transition from state s n ∈ S to state s n+1 ∈ S under action a n ∈ A, η = 1/P r (o n+1 |b n , a n ) is a normalizing constant, and P r (o n+1 |b n , a n ) = ∑ s n+1 ∈S P obs (o n+1 |s n+1 ) ∑ s n ∈S T(s n+1 |s n ,a n )b n (s n ) (12) In Equation (11), the values of P obs (o n+1 |s n+1 ) and T(s n+1 |s n , a n ) are hard to obtain in the non-cooperative active interference environment. This makes the implementation of the updating expression to be intractable. Here, we propose a model-free recognizer-based method for calculating the belief state. The framework of the recognizer-based belief state computing method is presented in Figure 4. Specifically, we use the classification algorithm combining a neural network and softmax regression (NN-softmax) to construct the active interference recognizer. The input of the neural network is the contaminated observation y mn , and the output features z of the neural network are addressed by softmax regression to obtain the probability p mn of the presence of active interference in the observation, i.e., where ε 1 and ρ 1 denote the weight and bias for the output p mn , respectively, and ε 2 and ρ 2 denote the weight and bias for the output 1 − p mn , respectively. Assume that the signals received by different channels are independent. According to probability theory, the belief state b n can be computed by where Ω i is the index set of the occupied channels for the ith underlying state s i n .
where i Ω is the index set of the occupied channels for the ith underlying state i n s .
Thus, the proposed recognizer-based belief state computing method avoids the requirement of environmental knowledge, which makes it more suitable for application in a non-cooperative environment. Given the belief state, the reward function of the POMDP can be derived as Thus, the proposed recognizer-based belief state computing method avoids the requirement of environmental knowledge, which makes it more suitable for application in a non-cooperative environment.
Given the belief state, the reward function of the POMDP can be derived as According to Equation (9), the value of the reward R depends only on whether the frequency channel used for emission is occupied. Considering that the m * th frequency channel is used for emission, the probabilities of the presence and absence of active interference in the emission channel are p m * n+1 and (1−p m * n+1 ), respectively. Hence, by substituting Equations (9) and (14) into Equation (15), the derived reward function is where m * = a n .

Transmit Frequency Strategy Design Using the DDQN-CSR--Greedy Method
In this subsection, the model-free POMDP is solved by the proposed DDQN-CSR-greedy method to obtain the transmit frequency strategy of the CS-based FAR.
First, the formulated POMDP is transformed into a belief-state-based MDP so that the solution method for MDP can be employed in the solving stage. As shown in Equation (14), the belief state can be determined by the output posterior probability vector p n = [p 1n , p 2n , . . . , p Mn ]. Therefore, we specify the belief-state-based MDP by the tuple S * , A, R * , γ , where the state space S * contains states defined as The belief-state-based MDP aims to obtain the optimal solution that yields the highest expected discounted reward. The DDQN-based solution method, which can mitigate estimation bias in the learning process, is developed to find an approximate solution by learning the value function Q(s * , a) through the loss function as follows: where θ is the weight of the main network used for selecting actions, and θ − is the weight of the target network used for evaluating actions. The advantages of using the quadratic loss function mainly include reasonable penalties for errors and easy computation of gradients. Associating the step and episode of RL with the transmit pulse and CPI of the CS-based FAR, the flow chart of the DDQN-based transmit frequency design for CS-based FAR is given in Figure 5, and the corresponding learning details are given in Algorithm 1. Note that '%' in Algorithm 1 denotes the remainder operator.
Algorithm 1. DDQN-based frequency design for CS-based FAR in active interference.
Input: the maximum number of episodes N e , the maximum number of steps N st for each episode, the number of transitions N tr used for training the main network, the updating interval N m of the main network, the updating interval N t of the target network, the dimension of the agent state M 1 and the number of the divided frequency channels M. Observation phase: Select the transmit frequency channel a randomly, transform the state s * into the state s * compute the reward R * , and select the transmit frequency channel randomly again. Perform the loop above and store the generated transitions (s * , a, R * , s * ) in the replay memory Ξ.

Interaction phase:
Initialize the main network Q(θ), target network Q θ − , and n e = 1. Repeat (for each episode): Repeat (for each step of the episode): (a) Select the frequency channel a n ∈ A according to the exploration strategy to act on the environment, and obtain M observations in different frequency channels. (b) Calculate M posterior probabilities by putting M observations into the NN-softmax-based active interference recognizer, and obtain the next state s * n+1 . (c) Compute the reward R * n s * n , a n , s * n+1 via (16), and store the transition s * n , a n , Update the parameter θ using the loss function (18) with N tr transitions randomly selected from Ξ.
n e = n e + 1. • untiln e > N e or the target measurement task is finished.
As shown in Algorithm 1, the exploration strategy plays a significant connecting role in the learning loop. Conventional DDQN adopts the -greedy strategy to explore and exploit anti-interference actions, which is insufficient to achieve good target measurement performance. In this paper, we develop the CSR--greedy exploration strategy, the main idea of which is to use the target measurement metric to guide the exploration and exploitation of the anti-interference action.
The recovery performance of the l 1 -minimization-based CSR algorithms is guaranteed by the restricted isometry property (RIP) and mutual coherence of the measurement matrix [19][20][21]. The coherence of Φ is defined as follows: where ( · ) H denotes the conjugate transpose. The smaller the coherence is, the better the sparse recovery performance will be [19][20][21]. Therefore, to achieve better CSR capability, the action a n can be obtained by solving the following optimization problem using the exhaustive search or other numerical methods [24]: where f n−1 is a vector consisting of previous transmit frequencies.
Remote Sens. 2022, 14, x 9 of 21   In the exploration phase of the traditional -greedy strategy, the action is selected randomly to realize action exploration. This strategy can explore the anti-interference action effectively but cannot guarantee the sparse recovery capability of the transmit pulses. Since the transmit frequency sequence possesses randomness to achieve a high sparse recovery probability [21], it is feasible to use the CSR metric to conduct anti-interference action exploration. Therefore, as described in Algorithm 2, the action is selected using Equation (20) in the exploration phase. In addition, the CSR metric can be used in the exploitation phase to achieve better target measurement performance while avoiding active interference. According to the definition of the reward function, the frequency channels corresponding to the lower output Q values are more likely to have interference and vice versa. Therefore, the set A sub of unoccupied frequency channels can be obtained by performing the following clustering method on the outputs of the main Q network for different frequency channels. Specifically, rank the output Q values as where [Q 1 ≤ Q 2 ≤ . . . ≤ Q M ]. Next, the occupied and unoccupied frequency channels can be divided by clustering the lower and higher Q values in Q r , respectively. In detail, compute the difference between the adjacent Q values in Q r and obtain the vector The clustering boundary can be obtained by The set of the high Q values is defined as Correspondingly, the set A sub can be obtained by Finally, the action of the CS-based FAR can be selected from A sub using Equation (20) in the exploitation phase. The process of the CSR--greedy exploration method is summarized in Algorithm 2.

•
Perform the following step with probability (exploration phase): Select the action a n by solving (20).
• Perform the following steps with 1-probability (exploitation phase): (a) Compute the output Q values for different actions using the main network Q(s * n , a ; θ) a = 1, . . . , M. (b) Generate the set A sub of the unoccupied frequency channels by performing the clustering method described in (21)- (25). (c) Select the action a n from A sub using (20).
Output: the selected action a n .

Numerical Results
In this section, experimental results are presented to show the effectiveness and advantage of the design. Section 4.1 and 4.2 first analyze the proposed recognizer-based belief state computing method and the developed CSR--greedy exploration strategy. Section 4.3 and 4.4 present the comparisons between different anti-interference strategies and different target measurement methods to demonstrate the superiority of the proposed DDQN-CSR--greedy method. Unless otherwise stated, the experimental conditions are set as follows: (1) The parameters used to define the radar environment and characterize the DDQNbased cognitive frequency design are given in Table 1. To eliminate the limitation brought by the pulse width on the frequency step, the linear frequency modulated (LFM) signal is transmitted in the pulse. (2) A fully connected feedforward neural network is employed to be the Q network. The parameters of the Q network are given in Table 2. (3) Based on the related works [12][13][14][15][16][17], several active interference dynamics are employed to evaluate the performance of the proposed design. The interference strategies are detailed in Table 3. (4) For the recognizer-based belief state computing method, to balance the computation complexity and recognition performance, the NN-softmax with two hidden layers is used to construct the active interference recognizer. The network parameters are given in Table 4. (5) For the CSR--greedy and -greedy exploration strategies, the exploration probability is linearly reduced from 1 to 0.

Analysis of the Recognizer-Based Belief State Computing Method
The belief state computing formula in Equation (14) is strictly derived based on probability theory, and the output posterior probability is the only variable in the computing formula. In addition, we can see in Figure 5 that the output posterior probability plays a key role in implementing the RL-based radar strategy design. Therefore, in this subsection, we analyze the output posterior probability of the designed active interference recognizer under different scenarios to verify the effectiveness of the proposed recognizer-based method.
In the following examples, the active interference data are from the jammer, and the target echo is simulated by modulating the transmit signal in the range-Doppler domain. We train the active interference recognizer with 79 active interference signals and test the recognizer with 100 noise signals, 100 target echoes, and 100 active interference signals. Figure 6 plots the output posterior probabilities for different observation cases in Equation (7). In Figure 6a, the noise-only environment is considered. In Figure 6b-d, white Gaussian noise of power 0 dBW is added, and a signal-to-noise ratio (SNR) of 0 dB is assumed in Figure 6d.
As Figure 6 shows, the output posterior probabilities are less than 0.01 and 0.04 for the noise-only and target scenarios, respectively. When the interference-to-noise ratio (INR) is greater than 0 dB, the values of the output posterior probability exceed 0.96 and 0.7 for the interference and coexisting cases, respectively. We can observe that the value of the output posterior probability is low in both noise-only and target scenarios and increases significantly in the presence of active interference. This illustrates that the output posterior probability can well reflect the state of the channel, and the reflection is robust to the sensing scenarios. In [17], the energy detector is used to connect the observation and state of the POMDP, which will make an incorrect judgment in the presence of high-powered noise and target echoes. As presented in Figure 6a,b, the proposed recognizer-based belief state computing method can maintain good performance in this case. Furthermore, as Figure 6c,d shows, the value of the output posterior probability increases with increasing interference energy, which illustrates that the output posterior probability can not only predict the presence of interference but also reflect the degree of interference. This capability contributes to the success of the DDQN-based solution method for avoiding active interference, which can be understood through the expression of the reward function in Equation (16). (7). In Figure 6a, the noise-only environment is considered. In Figure 6b-d, white Gaussian noise of power 0 dBW is added, and a signal-to-noise ratio (SNR) of 0 dB is assumed in Figure 6d.
As Figure 6 shows, the output posterior probabilities are less than 0.01 and 0.04 for the noise-only and target scenarios, respectively. When the interference-to-noise ratio (INR) is greater than 0 dB, the values of the output posterior probability exceed 0.96 and 0.7 for the interference and coexisting cases, respectively. We can observe that the value of the output posterior probability is low in both noise-only and target scenarios and increases significantly in the presence of active interference. This illustrates that the output posterior probability can well reflect the state of the channel, and the reflection is robust to the sensing scenarios. In [17], the energy detector is used to connect the observation and state of the POMDP, which will make an incorrect judgment in the presence of high-powered noise and target echoes. As presented in Figure 6a,b, the proposed recognizer-based belief state computing method can maintain good performance in this case. Furthermore, as Figure 6c,d shows, the value of the output posterior probability increases with increasing interference energy, which illustrates that the output posterior probability can not only predict the presence of interference but also reflect the degree of interference. This capability contributes to the success of the DDQN-based solution method for avoiding active interference, which can be understood through the expression of the reward function in Equation (16).

Output posterior probability
Output posterior probability Output posterior probability Output posterior probability

Analysis of the CSR--Greedy Exploration Strategy
In this subsection, we examine the effectiveness of the proposed CSR--greedy exploration strategy by presenting the convergence, anti-interference performance, and coherence achieved by the method. Figure 7 plots the convergence curves of the average rewards versus episodes for different interference scenarios. All curves rise in the early stage of the interaction and reach stabilized values quickly. This means that anti-interference strategies have been learned efficiently by the CS-based FAR agents with the two exploration strategies. We can observe in Figure 7 that the learning speed of the developed CSR--greedy exploration strategy is similar to that of the -greedy exploration strategy. In fact, due to the improved action selection process, the CSR--greedy strategy has better convergence performance of target measurement capability than the -greedy strategy, which is demonstrated in the next subsection. learned efficiently by the CS-based FAR agents with the two exploration strategies. We can observe in Figure 7 that the learning speed of the developed CSR-ϵ-greedy exploration strategy is similar to that of the ϵ-greedy exploration strategy. In fact, due to the improved action selection process, the CSR-ϵ-greedy strategy has better convergence performance of target measurement capability than the ϵ-greedy strategy, which is demonstrated in the next subsection. To see how well the CS-based FAR agent has learned, we test the anti-interference performance of the learned frequency strategy. As shown in Figure 8a-e, the CS-based FAR can totally avoid constant, sweep, and signal-dependent active interference. In Figure 8f, we test the proposed method in the stochastic interference environment. As presented, the CS-based FAR cannot predict the frequency of stochastic interference accurately, but it can learn the probability of the interference and select the frequency channel with the low probability of being occupied to avoid active interference. To see how well the CS-based FAR agent has learned, we test the anti-interference performance of the learned frequency strategy. As shown in Figure 8a-e, the CS-based FAR can totally avoid constant, sweep, and signal-dependent active interference. In Figure 8f, we test the proposed method in the stochastic interference environment. As presented, the CS-based FAR cannot predict the frequency of stochastic interference accurately, but it can learn the probability of the interference and select the frequency channel with the low probability of being occupied to avoid active interference.
To quantify and highlight the performance of the developed CSR--greedy exploration method, the anti-interference probability and coherence achieved after 100 episodes are given in Table 5. In all active interference scenarios, the proposed CSR--greedy exploration strategy can achieve lower coherence than the -greedy strategy while obtaining anti-interference probabilities comparable to the ones obtained by the -greedy strategy. Specifically, both exploration strategies can achieve 100% anti-interference probabilities in the constant, sweep, and signal-dependent interference scenarios. For the stochastic case, since the CSR--greedy strategy considers the target measurement metric in the action selection, it obtains a slightly lower anti-interference probability than the -greedy strategy. Nevertheless, due to the lower coherence, the developed method can achieve better target measurement performance than the -greedy strategy in the active interference environment. This is illustrated in the following subsection. Remote Sens. 2022, 14, x 15 of 21 To quantify and highlight the performance of the developed CSR-ϵ-greedy exploration method, the anti-interference probability and coherence achieved after 100 episodes are given in Table 5. In all active interference scenarios, the proposed CSR-ϵ-greedy exploration strategy can achieve lower coherence than the ϵ-greedy strategy while obtaining anti-interference probabilities comparable to the ones obtained by the ϵ-greedy strategy. Specifically, both exploration strategies can achieve 100% anti-interference probabilities in the constant, sweep, and signal-dependent interference scenarios. For the stochastic case, since the CSR-ϵ-greedy strategy considers the target measurement metric in the action selection, it obtains a slightly lower anti-interference probability than the ϵ-greedy  Table 5. Anti-interference probability (%)/coherence.

Target Measurement Comparison between Different Anti-Interference Frequency Strategies in Active Interference
In this subsection, we compare the proposed DDQN-CSR--greedy method with other anti-interference frequency control techniques, including the random frequencies, the SAA method, and the DDQN with the -greedy exploration strategy (DDQN--greedy). For the random strategy, the frequency channels are selected with uniform probabilities. For the SAA method, the CS-based FAR senses the active interference frequency first and then picks an unoccupied frequency channel randomly. The OMP method [23] is used at the receiver to measure target parameters. To evaluate the target measurement performance, we define the correct measurement probability as 26) where N al is the total number of target measurement experiments, and N c is the number of correct measurements. When both the range and velocity of the target are measured correctly, the value of N c is increased by 1. Consider that the INR is 10 dB, the SNR is -5 dB, and 64 pulses (one CPI) are used for measuring target parameters. Figure 9 plots the convergence curves of Cr versus episode for different exploration strategies under 100 Monte Carlo experiments. As Figure 9 presents, the proposed DDQN-CSR--greedy outperforms the DDQN--greedy method in terms of convergence speed, stability, and the average value of Cr upon convergence due to the improved learning process in which the CSR--greedy strategy can optimize CSR compatibility while exploring and exploiting anti-interference actions compared to the -greedy strategy.
Considering that the number of episodes in the DDQN-CSR--greedy and DDQN-greedy methods is 100, Figure 10 plots the value of Cr achieved by different anti-interference strategies under 100 Monte Carlo experiments. As shown, the value of Cr increases with the increasing SNR, and the proposed method can achieve better target measurement performance than other techniques. Specifically, the random strategy performs poorly in all interference scenarios due to the absence of environmental knowledge. The SAA method is suitable for countering constant interference but cannot handle dynamic interference scenarios. The DDQN--greedy method can learn the dynamics of the active interference. However, due to the lack of consideration of measurement performance in the action selection, the DDQN--greedy method obtains a target measurement performance that is even worse than the random strategy in some interference scenarios. In contrast, the proposed DDQN-CSR--greedy method can achieve better target measurement performance in all interference scenarios since it optimizes the CSR performance while learning the interference behaviors. In detail, the value of Cr achieved by the proposed method is approximately 100% when the SNR is greater than −2 dB for all interference scenarios. strategy. Nevertheless, due to the lower coherence, the developed method can achieve better target measurement performance than the ϵ-greedy strategy in the active interference environment. This is illustrated in the following subsection.

Target Measurement Comparison between Different Anti-Interference Frequency Strategies in Active Interference
In this subsection, we compare the proposed DDQN-CSR-ϵ-greedy method with other anti-interference frequency control techniques, including the random frequencies, the SAA method, and the DDQN with the ϵ-greedy exploration strategy (DDQN-ϵgreedy). For the random strategy, the frequency channels are selected with uniform probabilities. For the SAA method, the CS-based FAR senses the active interference frequency first and then picks an unoccupied frequency channel randomly. The OMP method [23] is used at the receiver to measure target parameters. To evaluate the target measurement performance, we define the correct measurement probability as where al N is the total number of target measurement experiments, and c N is the number of correct measurements. When both the range and velocity of the target are measured correctly, the value of c N is increased by 1. Consider that the INR is 10 dB, the SNR is -5 dB, and 64 pulses (one CPI) are used for measuring target parameters. Figure 9 plots the convergence curves of Cr versus episode for different exploration strategies under 100 Monte Carlo experiments. As Figure 9 presents, the proposed DDQN-CSR-ϵ-greedy outperforms the DDQN-ϵ-greedy method in terms of convergence speed, stability, and the average value of Cr upon convergence due to the improved learning process in which the CSR-ϵ-greedy strategy can optimize CSR compatibility while exploring and exploiting anti-interference actions compared to the ϵ-greedy strategy.
(a) (b) Considering that the number of episodes in the DDQN-CSR-ϵ-greedy and DDQN-ϵgreedy methods is 100, Figure 10 plots the value of Cr achieved by different anti-interference strategies under 100 Monte Carlo experiments. As shown, the value of Cr increases with the increasing SNR, and the proposed method can achieve better target measurement performance than other techniques. Specifically, the random strategy performs poorly in all interference scenarios due to the absence of environmental knowledge. The SAA method is suitable for countering constant interference but cannot handle dynamic interference scenarios. The DDQN-ϵ-greedy method can learn the dynamics of the active interference. However, due to the lack of consideration of measurement performance in the action selection, the DDQN-ϵ-greedy method obtains a target measurement performance that is even worse than the random strategy in some interference scenarios. In contrast, the proposed DDQN-CSR-ϵ-greedy method can achieve better target measurement performance in all interference scenarios since it optimizes the CSR performance while learning the interference behaviors. In detail, the value of C r achieved by the proposed method is approximately 100% when the SNR is greater than −2 dB for all interference scenarios.

Target Measurement Comparison between Different Target Measurement Techniques in Active Interference
To further illustrate the superiority of the proposed method, we compare the DDQN-CSR--greedy-based CSR method with other moving target measurement techniques, including the FFT-based MTD and the min-coherence-based CSR. The FFT-based MTD technique chooses the LFM signal with a bandwidth of 100 MHz for emission and uses pulse compression and FFT-based MTD for signal processing at the receiver. For the min-coherence-based CSR, the carrier frequency of the transmit pulse is determined by Equation (20), and other parameters are the same as the DDQN-CSR--greedy-based CSR. As Figure 11 illustrates, the DDQN-CSR--greedy-based CSR method can achieve a significant shift of the performance curve to the left compared to the FFT-based MTD and min-coherence-based CSR techniques. This demonstrates a considerable improvement in target measurement in the presence of active interference.

Target Measurement Comparison between Different Target Measurement Techniques in Active Interference
To further illustrate the superiority of the proposed method, we compare the DDQN-CSR-ϵ-greedy-based CSR method with other moving target measurement techniques, including the FFT-based MTD and the min-coherence-based CSR. The FFT-based MTD technique chooses the LFM signal with a bandwidth of 100 MHz for emission and uses pulse compression and FFT-based MTD for signal processing at the receiver. For the min- coherence-based CSR, the carrier frequency of the transmit pulse is determined by Equation (20), and other parameters are the same as the DDQN-CSR-ϵ-greedy-based CSR. As Figure 11 illustrates, the DDQN-CSR-ϵ-greedy-based CSR method can achieve a significant shift of the performance curve to the left compared to the FFT-based MTD and mincoherence-based CSR techniques. This demonstrates a considerable improvement in target measurement in the presence of active interference.

Conclusions
This work presented and validated an effective cognitive frequency design method for CS-based FAR in the presence of non-cooperative active interference.
As the problem formulation shows, the developed model does not require the environmental knowledge compared to the previous decision model of radar frequency strategy design. Hence, it is more applicable in the non-cooperative interference environment. In addition, both agent and environment states are denoted in the model to cope with the signal-dependent and signal-independent interference. In addition to the superiority of not relying on environmental knowledge, the results illustrate that the proposed recognizer-based belief state computing method for the model-free POMDP can well reflect the state of the environment and is robust to sensing scenarios. In the solution stage, the proposed DDQN-CSR--greedy-based solution method can find a frequency strategy that achieves good target measurement performance while avoiding active interference. As the simulation results show, the proposed DDQN-CSR--greedy method achieves considerable target measurement performance improvement in the presence of active interference over existing anti-interference and target measurement techniques.
In addition to the advantages above, the proposed designs can be flexibly extended to other fields. For example, the proposed recognizer-based method can be employed in other control problems to relate the observation with the state of the control model to solve the difficulties caused by non-cooperation. Additionally, the developed DDQN-CSR-greedy-based design method can be employed in other tasks involving active interference by substituting the CSR metric with other metrics.
This work assumes that observations in different frequency channels can be obtained at the same time, which increases the complexity of CS-based FAR receivers. Future work can focus on how to design a cognitive frequency strategy by observing only the frequency channel used for target measurement at each step. Perhaps some RL-based methods for spectrum sensing in the communication field can be used to solve this problem. In addition, the designed method is based on a single agent, i.e., CS-based FAR. Extending the design to multi-agent scenarios is also a research direction. Due to the incomplete information in the considered model, combining the proposed method with Bayesian game theory [25,26] may be an approach.
Funding: This research was funded by the National Natural Science Foundation of China, grant number 62001346.

Conflicts of Interest:
The authors declare no conflict of interest.