2.1. Radar Countermeasure System Model
The environment of modern radar system usually contains a complex interference environment, which prevents the detection radar from obtaining accurate target information. As shown in
Figure 1, the system model consists of two agents (frequency agile radar for detecting targets and airborne jammer): the radar obtains the echo signal through a receiver, and evaluates the transmit detection signal through the effectiveness evaluation. Then, it makes a decision and launchs a new anti-jamming detection signal. The jammer receives the radar detection signal and performs real-time interference processing.
Traditional frequency agile radar usually adopts “cover pulse + tracking signal” composite detection waveform [
34,
35] to achieve anti-jamming measures. That is, the radar transmitter transmits two kinds of signals. One is a cover pulse with obvious spectral characteristics, which is used to guide the jammer to lock the frequency and waveform of the jamming signal to the cover pulse. The other is a tracking signal with low interception characteristics for real detection and tracking functions. The tracking signal gets staggered with the cover pulse and reduces the effective power of the jamming signal into the radar machine, so as to achieve the anti-jamming effect. The traditional radar anti-jamming system is shown in
Figure 2a. First, the radar receiver receives echo signals from the electromagnetic environment to obtain specific measurement results of target and interference signal parameters. Then, based on the detected and existing prior knowledge base, e.g., pulse width, modulation method and repetition period of the target signal and the interference signal, which are used to determine the type of the interference signal. Finally, combine the target and interference signal measurement results to schedule anti-interference resources, select effective anti-interference cover pulse and tracking signal, and transmit them through the transmitter.
According to the anti-interference characteristics of frequency agile radar, jammers usually use synchronous targeting jamming [
36,
37] in suppressive jamming for signal jamming. As shown in
Figure 3, the jammer performs frequency targeting on each radar pulse received. Then, it interferes with the current radar signal and stops after
time. After the next pulse arrives, the jammer will perform frequency aiming again, and stop after jamming for a period of time
, and so on.
In
Figure 3, the pulse repetition period of the
and
frequency agile radar. The distance between the radar and the target is determined by the propagation speed c and the waveform propagation time
. Here,
is used to describe the distance between the target and the radar.
,
and
are the delay time of each pulse aiming frequency.
,
and
are the real-time jammers based on the aiming results Interference time;
and
are the stop interference time.
Synchronous targeting jamming technology can effectively interfere with frequency agile radar. Moreover, it is irrelevant to the frequency agile modulation method. When the frequency agile radar system detects that the interference signal parameters do not exist in the prior expert knowledge, it cannot dynamically adjust the anti-jamming detection radar signal pattern based on the anti-jamming effect feedback obtained from the environment. In this situation, the anti-jamming effect of radar deteriorates drastically, and the processing efficiency reduces greatly.
Compared with traditional radar system relying on a static prior knowledge base to make decisions, intelligent radar anti-jamming system has the ability to dynamically perceive the external environment and optimize the knowledge base. The system can dynamically adjust the anti-jamming measures based on the feedback of the anti-jamming measures from the external environment (whether the target is tracked, whether the jamming signal is locked to the shield pulse). The intelligent radar anti-jamming system is shown in
Figure 2b. First, the echo signal is received from the receiver, and the specific measurement results of the target and interference signal parameters are obtained. Then, the interference pattern is analyzed according to the prior knowledge base. In this case, if the interference pattern does not exist in the prior knowledge base, it is necessary to identify and analyze the situation of the target and the interference signal, and evaluate the current anti-jamming effectiveness. The evaluation content includes the accuracy evaluation of the target information and whether the jamming signal is locked to the shielding pulse, etc. Finally, the anti-jamming method is adjusted according to the results of the effectiveness evaluation and launched. In addition, add the newly detected interference pattern feature parameter information and effectiveness evaluation results into the prior knowledge base. The intelligent radar anti-jamming system continuously interacts with the external environment to learn online anti-jamming strategies that are not in the prior knowledge base. The system can finally form a set of optimal strategies for a specific environment based on the accumulation of experience.
Contrapose different radar system environments in the intelligent radar anti-jamming system, we have established a RL model suitable for a single radar system and a RL model suitable for a radar-jammer system environment. We analyze the reasons why the existing basic theoretical models and frameworks of RL are not suitable for the environment of radar-jammer system, and propose an efficient RL model suitable for the system.
2.2. Reinforcement Learning Model for Single Radar System
MDP is the theoretical derivation basis of single-agent RL [
38], and solving RL problems depends on the framework. As shown in
Figure 4a, the Markov decision process is a tuple <
> composed of four elements, where
s represents the limited set of all states contained in the environment by the agent, a represents the limited set of all actions that the current agent will take in the environment,
T represents the transformation equation, and
r is the reward equation. The decision process sequence of the agent under the MDP framework can be expressed as {
}, where
and
only depend on the previous state
and action
, and
t represents time.
The frequency agile radar system based on the RL theory first launches the frequency agile detection signal source strategy . Then the echo signal is received by the receiver. Finally, according to the current target state and the detection result evaluation , select and transmit the next anti-interference agile detection signal source strategy . The radar system obtains target status information by continuously interacting with the environment. According to the evaluation result of the target status information, the transmitted frequency signal is changed to improve the radar’s anti-jamming detection target ability in real time.
Early radar systems were simple and usually contained only a radar and a target. As shown in
Figure 4b, the simple single radar-target environment system satisfies the MDP condition, which can be analogous to the RL theoretical model. The radar in the radar system which is regarded as an agent, and together with the target and its environment, it is regarded as a complete MDP environment. Its parameters are defined as shown in
Table 1. During the interaction between the radar and the environment, the goal of the radar is to adjust the emitted signal source to maximize the detection of the target’s position, speed and acceleration in the environment.
This paper gives the parameter definition of RL model based on single radar system: Assuming a discrete time sequence at each time t, the radar can detect the state of a target from the environment. Define the signal source taken by the radar at time t as action . At the next moment, is defined as the received value return as the result of the radar taking action to evaluate whether the detected target is accurate. At the next moment, defines the result of the radar taking action in the current state , and assessing whether the detected target is accurate and other information is defined as taking action to obtain value return . At each moment, the radar completes the mapping from the state to the selection probability of each possible signal action . This mapping relationship becomes the radar strategy, denoted as , which is the probability of selecting when the state is .
Through the above analysis, a single radar system can be analogous to a RL theoretical model. However, when there are multiple agents in the radar system environment (radar, jammer, and target), the MDP-based RL model framework cannot be directly applied. It is necessary to introduce new theories and conditional constraints to ensure the convergence of the algorithm model.
2.3. Reinforcement Learning Modeling for Radar-Jammer System
With the rapid development of information technology and its widespread application in the military field, the application environment of modern radars has become increasingly complex. Usually, the combat environment contains radar and jammer, and there is an interference effect between the two to prevent the detection radar from obtaining accurate information about the target. In the above scenario, the existing RL algorithm model cannot be directly applied to the radar-jammer countermeasure system. RL is developed based on the MDP theory. It is a strategy for a single agent to learn the possible delayed return signal maximization strategy in a random static environment, e.g., QL [
18], DQN [
21], DDQN [
24], DQN+LSTM [
25] and other classic algorithm models. This type of algorithm relies on certain prerequisites, that is, the agent can perform sufficient experiments in the environment, and the experimental environment is MDP, then RL can ensure the convergence of the optimal strategy.
The radar-jammer system does not satisfy the MDP theory in two aspects: On one hand, the rader-jammer system is beyond the scope of MDP theoretical application. The main reason is that in this kind of environmental system, the optimal strategy of the agent not only depends on the environment, but also on the strategies adopted by other agents, which violates the assumptions required to ensure convergence. When agents have opposing targets, there may not be a clear optimal strategy, resulting in frequent retrieval of the balance between multiple agents, making the environment unstable. On the other hand, the decision-making action taken by the radar is not a typical sequence decision-making process, and there is the possibility of step changes. That is, a step from the current parameter action with weak anti-interference to the parameter action with good anti-interference, and the relationship between each action is independent, which also does not satisfy the MDP theory. Therefore, in the complex electromagnetic environment of the radar system with multiple agents coexisting, it is difficult for the existing RL algorithm to obtain the correct evaluation of the current strategy from the environment, resulting in poor algorithm convergence and unable to satisfy the real-time requirements of military radars.
The combination of interdisciplinary game theory and artificial intelligence RL is the theoretical basis of solutions in the field of multi-agent reinforcement learning (MARL) [
31]. The characteristic of a multi-agent system is the interaction of strategies between multiple agents. Each agent has independent goals and decision-making capabilities, and at the same time, the agent is affected by the behavioral decisions of other agents. When the radar is in a multi-agent environment, the value return of the radar is affected by the behavioral decisions of the jammer. According to MG theory, the radar and the jammer independently choose actions to form a joint action. If the learning model of multi-agents in the same environment is adopted, the process is shown in
Figure 5, e.g., the actions
and
taken by the radar and the jammer are regarded as shared actions, and joint actions
are adopted to obtain joint state
and value return
.
However, in reality, the radar and the jammer cannot know the status and action strategy of the other side, and should regard the other side as part of the environment. As shown in
Figure 6, this paper proposes a radar-jammer system environment model, which regards the radar and the jammer as two groups of agent environments: radar is environment1 (Envi1), jammer and target are composed of environment2 (Envi2).
The parameter definitions of the radar side and the jammer are shown in
Table 2. For the radar side, the environment of the jammer and the target is regarded as Envi2. The radar sends a signal
to Envi2, from which a target state
with jamming information and a return value
to evaluate the accuracy of the target are obtained. Then the transmitting signal behavior
of the radar is adjusted to detect the accurate target information.
Where, the target error rate
refers to the false target probability that the target in the echo signal detected by the radar front-end detector is the jammer. The jammer uses the adaptive radar to interfere with the synchronized aiming jamming technology in
Section 2.1. Nevertheless, it cannot optimize interference strategies based on environmental feedback.
This section proposes a radar-jammer system environment model suitable for the actual radar system environment. The theoretical basis of this model will be analyzed in
Section 2.4.
2.4. Markov Game Inference Applicable to Radar-Jammer System Environment
MDP includes one agent and multiple states. Radar-jammer system is a typical special case of multi-agent systems, and MDP theory cannot be applied to the multi-agent systems. For a game with multiple agents and multiple states, Markov game (MG) is defined. MG uses a tuple to represent { } where n represents the number of multi-agent in the current environment, T represents the transfer function, () represents the behavior set of the ith agent, and represents the reward function of the agent i. The transfer function T in MG refers to the probability distribution of the next state when the current state and joint behavior of the agent are given. The reward function () indicates that the agent i takes the joint in the state s. After the behavior (), the reward is obtained in the next state .
Combining with MG, give the definition of a radar-jammer countermeasure system. Assuming that the system contains a radar and a jammer, a tuple can be expressed as {}:
2 refers to the number of agents in the radar-jammer system, including the radar and the jammer;
s = {} indicates the joint state of the radar-jammer system model, indicates the current state of the radar side, and indicates the state of the jammer;
represents the action set of the radar, and represents the action set of the jammer;
T represents the transfer function, which represents the probability of the radar side taking action and the jammer taking action under the current state and then transferring to the next state;
represents the reward function of radar; represents the reward function of jammer. Among them, the radar side and the jammer side have independent reward function R. Different agents can switch from the same state to different rewards.
For judging whether the MG is applicable to the model, the focus is to determine whether there is a Nash equilibrium in the model. The core of game theory is to establish a strategy interaction model for the game between multi-agent. Each game is equivalent to a mathematical model, which is used to describe the interaction results of each agent’s reward strategy. Combined with a radar-jammer system, the process of target detection and recognition by the radar each time is a game. Each target detection is performed by selecting a group of signals from the set of transmitted radar waveform signals (as actions ) by the radar.
Radar-jammer systems are different from games, e.g., Go [
21]: the mutual gains of the radar and the jammer will become part of the other party’s losses. However, in the actual situation, the two parties are not sure that they will bring losses to the other party, which is a general game and there is a Nash equilibrium. As shown in
Table 3, taking two actions each of the two agents as an example, analyze the Nash equilibrium between the detection target radar and the jammer: The vertical radar side takes actions
(
), and the horizontal jammer side takes actions
(
). The two have their own optimal decision-making actions
and
, respectively.
Combining the Nash equilibrium theorem, as shown in Equation (
1), set
to be the joint strategy of radar-jammer system,
to indicate the radar strategy, and
to indicate the jammer strategy.
where
represents the strategy implemented by the jammer in the system except for the radar side,
represents the radar return function,
represents the action set of the radar, and represents the probability distribution set on the action set
of the radar. When Equation (
1) holds, it means that strategy
is the optimal strategy of the radar side. Here
indicates that when the jamming strategy
is fixed, the radar adopts an arbitrary strategy
.
means that when the jamming strategy
is fixed, the strategy
adopted by the radar is the optimal strategy. In the game process, when the radar side makes the optimal decision, the interference side keeps its strategy unchanged, the current radar side cannot further improve its return, and the Nash equilibrium is reached at this time. According to this, we give the definition of the Nash equilibrium of the radar-jammer system: in the game model
:
)}, if the strategy combination of radar and jamming (
), the strategy
(
) of either party is the best strategy for the other party’s strategy
(
). It is said that (
) is a Nash equilibrium of the radar-jammer system. Because the relationship between radar and jammer is confrontation, the two parties cannot adopt the optimal strategy at the same time. Therefore, the (
,
) in
Table 3 does not exist in the radar-jamming system. Regarding this point, we set (
,
) to be the Nash equilibrium state of the radar’s successful anti-interference, (
,
) is the Nash equilibrium state of the jammer’s successful interference.In the process of the game between the radar and the jammer, there exists a Nash equilibrium of one side’s optimal strategy. At this time, the MG theory can be used in the radar-jamming system environment model proposed in this paper.