Anti-Jamming Decision-Making for Phased-Array Radar Based on Improved Deep Reinforcement Learning

Zhao, Hang; Song, Hu; Liu, Rong; Hou, Jiao; Yu, Xianxiang

doi:10.3390/electronics14112305

Open AccessArticle

Anti-Jamming Decision-Making for Phased-Array Radar Based on Improved Deep Reinforcement Learning

by

Hang Zhao

¹,

Hu Song

¹,

Rong Liu

^1,*,

Jiao Hou

¹ and

Xianxiang Yu

²

¹

The 724th Research Institute of China State Shipbuilding Corporation Limited, Nanjing 211153, China

²

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(11), 2305; https://doi.org/10.3390/electronics14112305

Submission received: 18 April 2025 / Revised: 30 May 2025 / Accepted: 2 June 2025 / Published: 5 June 2025

(This article belongs to the Special Issue Advanced Radar Waveform Design and Intelligent Countermeasures in Integrated Radar and Communication Systems)

Download

Browse Figures

Versions Notes

Abstract

In existing phased-array radar systems, anti-jamming strategies are mainly generated through manual judgment. However, manually designing or selecting anti-jamming decisions is often difficult and unreliable in complex jamming environments. Therefore, reinforcement learning is applied to anti-jamming decision-making to solve the above problems. However, the existing anti-jamming decision-making models based on reinforcement learning often suffer from problems such as low convergence speeds and low decision-making accuracy. In this paper, a multi-aspect improved deep Q-network (MAI-DQN) is proposed to improve the exploration policy, the network structure, and the training methods of the deep Q-network. In order to solve the problem of the

ϵ

-greedy strategy being highly dependent on hyperparameter settings, and the Q-value being overly influenced by the action in other deep Q-networks, this paper proposes a structure that combines a noisy network, a dueling network, and a double deep Q-network, which incorporates an adaptive exploration policy into the neural network and increases the influence of the state itself on the Q-value. These enhancements enable a highly adaptive exploration strategy and a high-performance network architecture, thereby improving the decision-making accuracy of the model. In order to calculate the target value more accurately during the training process and improve the stability of the parameter update, this paper proposes a training method that combines n-step learning, target soft update, variable learning rate, and gradient clipping. Moreover, a novel variable double-depth priority experience replay (VDDPER) method that more accurately simulates the storage and update mechanism of human memory is used in the MAI-DQN. The VDDPER improves the decision-making accuracy by dynamically adjusting the sample size based on different values of experience during training, enhancing exploration during the early stages of training, and placing greater emphasis on high-value experiences in the later stages. Enhancements to the training method improve the model’s convergence speed. Moreover, a reward function combining signal-level and data-level benefits is proposed to adapt to complex jamming environments, which ensures a high reward convergence speed with fewer computational resources. The findings of a simulation experiment show that the proposed phased-array radar anti-jamming decision-making method based on MAI-DQN can achieve a high convergence speed and high decision-making accuracy in environments where deceptive jamming and suppressive jamming coexist.

Keywords:

deep reinforcement learning; radar anti-jamming; decision-making; phased-array radar

1. Introduction

With the development of modern electronic countermeasure (ECM) technologies and computing power, radar jamming techniques have become increasingly complex and diverse [1,2]. Consequently, radar anti-jamming decision-making plays an increasingly important role in modern radar applications [3]. In 2006, Haykin introduced the concept of cognitive radar [4], and foresightedly highlighted the application prospects of Bellman’s Dynamic Programming Equation in adaptive algorithms for radar signal processing research. In order to address the problems of intelligent decision-making in radar anti-jamming, existing research analyzes the problem from the perspectives of game theory and reinforcement learning (RL).

Generally, the interaction between radar systems and jammers can be modeled as a two-player zero-sum (TPZS) game [5,6], where both sides attempt to select optimal strategies to maximize their own gains and maximize losses for the opponent [7]. The strategy that brings both sides to a steady state is referred to as the Nash Equilibrium (NE) [8] strategy. However, in complex jamming environments, the model faces a large state and action space, and the information obtained by the model is usually limited. Strategy optimization based solely on game theory struggles to achieve absolute rationality, thus requiring strategies to be gradually improved through continuous learning steps.

As a branch of machine learning (ML), RL primarily addresses exploring and learning optimal strategies. RL, which shows great application potential, has achieved significant progress in energy management [9], robotic control [10], communication network optimization [11], and in many other fields [12,13,14]. RL aims to achieve a stated goal by continuously learning and adapting its strategy in a dynamic environment, and learning how to take actions to maximize rewards. Driven by the wave of intelligent development, the introduction of the RL algorithm provides radar with an adaptive and dynamic anti-jamming strategy optimization method, which can effectively deal with the decision-making difficulties brought about by the diversity of jamming signals. As a typical RL algorithm, Q-learning uses a value function approximation represented by a Q-value table. Q-learning provides an effective solution for low-dimensional decision-making problems, but as the state and action spaces grow, the size of the Q-value table increases exponentially, which not only slows down the computation speed but also severely increases the cost of the decision-making system [15].

Compared to table-based value approximation, the deep Q-network (DQN) employs neural networks to approximate value functions and uses experience replay to reduce temporal correlation among data samples. Experiments have shown that DQN substantially outperforms Q-learning in high-dimensional state and action spaces. With the application of DQN, the radar anti-jamming strategy model can accommodate more anti-jamming actions to cope with the complex jamming environment.

However, there are still drawbacks to DQN. Since action evaluation and selection are performed by the same network, DQN tends to overestimate action values to varying degrees. And this overestimation is magnified, step by step, during the training process, which is called ‘bootstrapping’, causing local optima and convergence difficulties. To address these issues, the double deep Q-network (DDQN) employs separate evaluation and target networks to mitigate action value overestimation. At the same time, DDQN improves the calculation of the target Q-value to further mitigate overestimation compared to DQN. Up to this point, DDQN has made improvements to address the overestimation of the value function, but in more complex environments, the performance of DDQN still faces challenges. First, DDQN relies on the

ϵ

-greedy exploration strategy, which tends to select random rather than optimal actions to some extent. The usual approach to solving the above problem is to gradually decrease

ϵ

with the number of training steps. But this solution relies heavily on the initial value of

ϵ

and the rate of descent; inappropriate values will lead to local optima and a slow convergence rate. Second, traditional experience replay collects experiences with equal probability, which ignores differences between experiences. In a dynamic environment, the speed of model convergence will be affected. Lastly, hyperparameters such as learning rate, discount factor, and target network update frequency significantly impact DDQN convergence speed and generalization ability [16], and fixed hyperparameter settings are inadequate for complex tasks. Regarding the above issues, some models optimize the DDQN network structure [17,18,19,20] by employing a dueling double deep Q-network [21], which reconstructs the Q-value as a state-value function and an action advantage value, allowing the agent to learn the state-value more frequently and accurately. Other models choose to improve experience replay [22,23], adopting prioritized experience replay (PER) [24] to preferentially replay high-value experiences, thereby improving data utilization and learning speed.

In existing research, radar anti-jamming decision-making methods mainly focus on three aspects, namely, overall anti-jamming strategy optimization, anti-jamming waveform strategy optimization, and frequency selection strategy optimization. Ref. [19] proposed a cognitive radar anti-jamming strategy generation algorithm based on a dueling double deep Q-network, which shows that the dueling double deep Q-network generates a more effective anti-jamming strategy than DQN. Ref. [20] generates the anti-jamming waveform strategy for airborne radars based on the dueling double deep Q-network as well as PER, and uses an iterative transformation method (ITM) to generate the time domain signals of the optimal waveform strategy. The simulation results demonstrate the improved effect of the dueling double deep Q-network and PER. Yet, these improvements still exhibit slow convergence and potential local optima in complex jamming environments. Ref. [25] established both radar and jammer as agents, and found the optimal power allocation strategy for the radar by using the multi-agent deep deterministic policy gradient (MADDPG) algorithm based on multi-agent reinforcement learning. Ref. [26] presented an NE strategy for radar-jammer competitions involving dynamic jamming power allocation by utilizing an end-to-end deep reinforcement learning (DRL) algorithm, considering that the competition between the radar and the jammer involves imperfect information. Additionally, Ref. [27] proposed a slope-varying linear frequency modulation (SV-LFM) signal strategy generation method based on DQN, enabling radar systems to learn optimal anti-jamming strategies without prior information.

Moreover, compared to other types of radar systems, phased-array radars offer greater flexibility in selecting the positions and number of auxiliary antennas. This capability provides higher adaptability in dynamically changing jamming environments. Therefore, to address the above problems, this paper makes comprehensive improvements to the aforementioned methods and proposes a phased-array radar anti-jamming decision-making approach with high accuracy in environments where multiple types of jamming coexist. A noisy network constructed using factorized Gaussian noise [28] is incorporated to dynamically adjust action selection randomness during training, thereby enhancing exploration efficiency. Moreover, n-step temporal difference learning [29] is employed to update the Q-value using multi-step returns, accelerating reward propagation. The above improvements are combined with a dueling network and DDQN to achieve comprehensive performance enhancement. Additionally, as an improvement of double-depth priority experience replay (DDPER) [30], this paper proposes variable double-depth priority experience replay (VDDPER), which refines experience learning mechanisms by dynamically adjusting the proportion of deep and shallow experiences utilized during training, further accelerating model convergence. The main contributions of this work are as follows:

To improve the reward convergence speed and decision accuracy, a multi-aspect improved deep Q-network (MAI-DQN) algorithm for phased-array radar anti-jamming decision-making is proposed. In the MAI-DQN, the noisy network based on factorized Gaussian noise is used to improve the exploration policy. And n-step learning and DDQN are used to update the temporal difference target value, thereby improving the reward propagation of long sequence tasks. In addition, the dueling network is used to decompose the Q-value into two parts, thereby introducing the concept of advantage and further improving the network structure. Moreover, target soft update, variable learning rate, and gradient clipping are used to further improve the training effect.
To enhance the utilization of experience replay across all experiences in the early stages of training and emphasize high-value experiences in the later stages of training, the VDDPER mechanism is proposed as an improvement over DDPER. VDDPER dynamically adjusts the sample sizes for different values of experience to direct the model’s attention toward experiences of different values during training.
To accurately characterize jamming states in complex environments and enable optimal anti-jamming measures, a novel and effective design of the state space, action space, and reward function—combining data-level and signal-level benefits—is proposed and shown to achieve rapid convergence during training.

The remainder of this paper is organized as follows: In Section 2, the phased-array radar echo signal model, array configuration, and interactive system model for the radar and the jammer competition are introduced. In Section 3, the RL decision-making process and the basic concepts with DQN are analyzed and described. Section 4 presents the implementation of MAI-DQN in phased-array radar anti-jamming decision-making and outlines its improvements in detail. Section 5 describes the experimental setup for several RL algorithms and MAI-DQN, followed by a comparison and analysis of the experimental results. Finally, the conclusion and some future works are shown in Section 6.

2. System Model

In this section, we introduce the basic concept of phased-array radar and analyze the echo signal model. Then, we introduce and analyze the environmental interaction model between the phased-array radar and the jammer.

As a kind of array radar, phased-array radar controls the beam direction by adjusting the phase of individual antenna elements within the array, allowing for rapid scanning, multi-target tracking, and enhanced anti-jamming capabilities. Essentially, it represents a signal processing approach implemented in the analog domain using radio frequency (RF) components and feed networks. This paper focuses on a monostatic uniform linear array (ULA) phased-array radar. Therefore, an accurate echo model is established. When employing a linear frequency modulation (LFM) signal, the single-target echo signal model for an M-element phased-array radar can be expressed as follows:

s (t) = \sqrt{σ} α (Θ) g (t) exp \{j 2 π (f_{0} (t - τ) + k / 2 {(t - τ)}^{2} + f_{D} t) + φ_{0}\},

(1)

where

g (t)

represents the signal envelope with a pulse width of T;

f_{0}

denotes the carrier frequency;

τ

represents the target propagation delay; k denotes the signal chirp rate;

f_{D}

denotes the target Doppler shift; and

φ_{0}

represents the initial phase of the echo signal. Additionally,

α (Θ)

denotes the signal steering vector:

α (Θ) = {[\begin{matrix} 1 & e^{- j 2 π d / λ sin Θ} & \dots & e^{- j 2 π (M - 1) d / λ sin Θ} \end{matrix}]}^{T},

(2)

where d denotes the inter-element spacing of the antenna array, and

λ

denotes the radar wavelength. To avoid grating lobes, most ULA-configured phased-array radars set the element spacing to

2 / λ

.

\sqrt{σ}

accounts for the influence of the target’s radar cross-section (RCS). In this paper, a log-normal distribution RCS fluctuation model is employed to simulate large naval targets, with the probability density function defined as follows:

p (σ) = \frac{exp [- {ln}^{2} (σ / σ_{0}) / 4 ln ρ]}{σ \sqrt{4 π ln ρ}},

(3)

where

σ_{0}

represents the median RCS value, and

ρ

denotes the ratio between the average and median RCS values.

Moreover, this paper employs frequency agility (FA) and cover pulse techniques as active anti-jamming strategies. When the radar adopts an FA waveform, the single-target echo signal model can be expressed as follows:

s (t) = \sqrt{σ} α (Θ) g (t) exp \{j 2 π (f_{m} (t - τ) + k / 2 {(t - τ)}^{2} + f_{D} t) + φ_{0}\},

(4)

where

f_{m}

represents the carrier frequency of the m-th pulse,

f_{m} = f_{0} + b (m) Δ f

,

b (m) \in N

controls the frequency hopping of the m-th pulse, and

Δ f

denotes the minimum frequency hopping interval. When the radar employs a cover pulse waveform, it simultaneously transmits multiple carrier frequencies to deceive jammers into intercepting incorrect frequency points. Assuming that the real signal carrier frequency is

f_{0}

this time, the cover pulse waveform echo signal model will not be repeated here.

As shown in Figure 1, when operating in search mode, the radar transmits signals to scan the target region. Part of the signal is reflected by the target and is received as an echo by the radar, while part of the signal is intercepted by the jammer. Using digital radio frequency memory (DRFM) technology, the jammer samples and stores the intercepted signal at high speed. After analyzing the signal characteristics, the jammer decides whether to transmit suppression or deception jamming signals to interfere with the radar in the time, frequency, or energy domains. At the same time, noise signals represented by Gaussian white noise are also received by the radar due to the presence of thermal noise from the electronics in the radar, as well as atmospheric noise and other effects.

Regarding jamming, this paper investigates various types of jamming signals [31,32,33,34], including dense false target jamming, sweep jamming, coherent slice forwarding jamming, and spot jamming, as shown in Figure 2, alongside the target echo and noise signals. Due to the presence of multiple jammers, the above types of jamming may co-exist, and based on the jamming direction, the above types of jamming can be categorized as mainlobe, sidelobe, or combined mainlobe–sidelobe jamming. In addition, random phase perturbations are introduced across all types of jamming to simulate potential uncertainties such as the phase noise from the jammer, phase disturbance strategies, and multipath effects.

Regarding anti-jamming measures and anti-jamming strategies, this paper incorporates frequency agility, cover pulses, the sidelobe canceller (SLC) algorithm, and sidelobe blanking (SLB) as anti-jamming measures. During the game process with the jammer, the model iteratively explores and generates the most effective combination of anti-jamming measures, leading to the derivation of the optimal anti-jamming strategy.

Since electromagnetic waves in free space satisfy the principle of linear superposition, and the target echo signal, jamming signal, and noise are independent, these three signals are generally additive. Under such conditions, the radar’s received signal model can be expressed as follows:

R (t) = A_{s} s (t) + A_{j} j (t) + n (t),

(5)

where

s (t)

denotes the target echo signal,

j (t) = \sum_{h = 0}^{H - 1} j_{h} (t)

represents the jamming signal under one or more jamming, and

n (t)

accounts for noise. The amplitude factor of the target echo is given by the following:

A_{s} = 10^{S N R / 10} \cdot σ_{n}^{2},

(6)

where

S N R

represents the signal-to-noise ratio,

σ_{n}

denotes the standard deviation of Gaussian white noise, and the amplitude factor of the jamming signal is as follows:

A_{j} = 10^{J N R / 10} \cdot σ_{n}^{2},

(7)

where

J N R

represents the jamming-to-noise ratio.

After constructing the signal model, additional metrics beyond

S N R

and

J N R

are required to comprehensively evaluate the signal quality before and after anti-jamming measures, thereby accurately reflecting the effectiveness of anti-jamming strategies. The first metric is the interference suppression ratio, reflecting the effectiveness of anti-jamming techniques in suppressing jamming signal energy. The interference suppression ratio

I S R

can be expressed as follows:

I S R = 10 lg (P_{j, b e f o r e} / P_{j, a f t e r}),

(8)

where

P_{j, b e f o r e}

denotes the energy of the jamming signal before suppression, and

P_{j, a f t e r}

represents the remaining energy of the jamming signal after suppression.

Another evaluation metric is the radar target detection result after pulse compression and cell-averaging constant false alarm rate (CA-CFAR) processing, which assesses the effectiveness of anti-jamming measures in filtering out false targets. As a classical constant false alarm rate method, CA-CFAR estimates the detection threshold based on a set of neighboring background noise cells and compares it with the signal level in the target cell to determine the presence of a target. Generally, the CA-CFAR detection threshold is given by the following:

T_{C A} = β \cdot {\hat{P}}_{n} = - ln (1 - P_{f a}^{1 / N_{r e f}}) \cdot \sum_{i = 1}^{N_{r e f}} A_{i},

(9)

where

β

represents the threshold scaling factor,

{\hat{P}}_{n}

denotes the estimated reference noise level,

P_{f a}

denotes the predetermined false alarm probability,

N_{r e f}

denotes the number of reference cells, and

A_{i}

denotes the amplitude of the echo signal in the i-th reference cell.

Moreover, a set of data-level evaluation metrics is designed within the reward function to assess the effectiveness of anti-jamming measures in the frequency domain.

3. Reinforcement Learning Background

This section begins by outlining several fundamental concepts of reinforcement learning, followed by a description of the algorithmic principles and exploration strategies of DQN.

3.1. Reinforcement Learning Basics

The problems addressed by the RL are generally modeled using a Markov decision process (MDP), which is defined as a four-tuple

\{S, A, R, P\}

[35]:

(1): State: Represents all available information about the environment at any given time. The set of all states forms the state space S. At time step t, the state of the agent is denoted as $s_{t}$ ;
(2): Action: Describes the set of possible actions an agent can take in each state, forming the action space A. At time step t, the action taken by the agent is denoted as $a_{t}$ ;
(3): Reward: Represents the immediate feedback received after taking action $a_{t}$ at time step t, denoted as $r_{t + 1}$ , which guides the agent’s behavior optimization;
(4): Transition probability: Describes the probability of transitioning to state $s_{t + 1}$ after executing action $a_{t}$ in state $s_{t}$ , denoted as $P (s_{t + 1} | s_{t}, a_{t})$ . The relationships among S, A, R, and P are shown in Figure 3.

The goal of RL is to find a policy

π (a | s)

that maximizes the expected reward accumulated by the agent in each state, a process that can be described by the following equation:

π^{*} = {\arg \max}_{π} E [G_{t} | π],

(10)

where the cumulative reward is defined as follows:

G_{t} = r_{t + 1} + γ r_{t + 2} + γ^{2} r_{t + 3} + \dots = \sum_{k = 0}^{\infty} γ^{k} r_{t + k},

(11)

where,

γ \in [0, 1]

denotes the discount factor, which measures the importance of future rewards. A value of

γ

closer to 1 implies that the agent emphasizes long-term returns, whereas a value closer to 0 indicates a preference for short-term returns. During optimization, the agent evaluates the quality of policies by learning a value function. Specifically, the state-value function is defined as follows:

V^{π} (s) = E_{π} [G_{t} | s_{t} = s],

(12)

which represents the expected cumulative reward obtained from state s when following policy

π

. In RL, the state-value function describes the long-term rewards achievable from a given state and, thus, provides an assessment of how favorable a particular state is for the agent. Similarly, under a given policy

π

, the action-value function is defined as follows:

Q^{π} (s, a) = E_{π} [G_{t} | s_{t} = s, a_{t} = a],

(13)

which represents the expected cumulative reward obtained by taking action a in state s, subsequently following policy

π

. Under an optimal policy

π^{*}

, the state-value function is maximized and referred to as the optimal state-value function:

V^{*} (s) = \max_{π} E [r_{t} + γ G_{t + 1} ∣ s_{t} = s] .

(14)

Under this condition, the optimal action-value function can be expressed as follows:

Q^{*} (s, a) = E [r_{t} + γ \max_{a^{'}} Q^{*} (s_{t + 1}, a^{'}) ∣ s_{t} = s, a_{t} = a] .

(15)

3.2. Deep Q-Network

The Nature DQN is an improved version of the original DQN algorithm, featuring enhanced network update methods and the use of experience replay. Compared to the original DQN, Nature DQN is more widely adopted and, for convenience, is simply referred to as DQN in the following discussion.

In order to solve the problem of the Q-table being too large due to Q-learning when the dimensions of the states and actions become larger, DQN employs a deep neural network to approximate the Q-function instead of using a traditional Q-table. Specifically, states serve as inputs to the neural network, which outputs Q-values corresponding to each possible action.

DQN uses an evaluation network, denoted as

Q (s, a; θ)

, to compute Q-values for actions taken in the current state, and a target network

Q^{'} (s, a; θ^{'})

to select the optimal action based on these values. Both of them share the same network architecture. Every T steps, the evaluation network synchronizes its parameters with those of the target network. Additionally, DQN uses experience replay and introduces previous data in order to reduce sample correlation and improve convergence speed.

DQN applies the one-step temporal difference (TD) algorithm to calculate the action-value function. During training, the one-step TD target value

y_{t}

is used to estimate the cumulative discounted return at each time step:

y_{t} = r_{t} + {γ \max}_{a'} Q (s_{t + 1}, a; θ^{'}),

(16)

and the TD error can be expressed as follows:

δ_{t} = y_{t} - Q (s_{t}, a_{t}; θ) .

(17)

The goal of the neural network is to minimize the loss function defined by the mean squared error (MSE) of the TD error:

L (θ) = \frac{1}{2} E {[δ_{t}]}^{2} .

(18)

Additionally, DQN employs gradient descent to optimize the loss function. It computes the partial derivatives of the loss function with respect to each parameter, determining the gradient direction at the current position. Subsequently, parameters are updated in the opposite direction of this gradient, iteratively achieving local minimization of the loss function.

3.3. Exploration Policy in Deep Q-Network

To balance exploration and exploitation and avoid slow network convergence or premature convergence to local optima, DQN employs the

ϵ

-greedy policy for the action selection. The

ϵ

-greedy policy can be formulated as follows:

a = \{\begin{matrix} {\arg \max}_{a^{'}} Q (s_{t}, a^{'}; θ), & k > ϵ \\ r a n d o m a c t i o n i n A, & k \leq ϵ \end{matrix},

(19)

which indicates that, at training step t, there is a probability of 1-

ϵ

to select the action with the highest reward, and a probability of

ϵ

to randomly select an action from the action set A, where k is a random number uniformly sampled from

[0, 1]

, and

ϵ

denotes the greedy rate. In most models,

ϵ

is set to 1 at the beginning to encourage exploration during early training steps. As training progresses,

ϵ

gradually decreases at a fixed rate, which is usually set to 0.99, progressively emphasizing exploitation.

4. MAI-DQN for Anti-Jamming Decision-Making

The traditional DQN algorithm cannot achieve fast reward convergence and an accurate anti-jamming strategy in complex jamming environments. For this reason, the MAI-DQN is proposed.

In this section, the improvements made to the algorithms of this paper for DQN are detailed in terms of network architecture, exploration strategy, and training methods. Then, the state and action spaces of the phased-array radar anti-jamming decision-making model are introduced in detail, and a detailed solution for the phased-array radar anti-jamming strategy based on MAI-DQN is given. As shown in Figure 4, in terms of network architecture, the current signal state

s_{t}

of the phased-array radar is input into the network architecture enhanced with DDQN and a dueling network. In terms of exploration policy, the action is selected through the noisy network, and then the TD error values for training are obtained. In terms of training methods, the network parameters are updated using the TD error generated by the n-step TD method. The variable learning rate, soft update, and gradient clipping are used by the model to improve training results. Finally, the Q-value corresponding to the current anti-jamming action is output. In this process, VDDPER stores the current experience in three separate pools based on the reward magnitude for efficient experience replay and adjusts the sample size according to the value of each experience.

4.1. Network Architecture

4.1.1. Double Deep Q-Network

In the network architecture, the MAI-DQN algorithm incorporates a double deep Q-network (DDQN) to further mitigate the overestimation issue observed in DQN. When calculating the one-step TD target value in DDQN, the action corresponding to the maximum Q-value is initially selected using the evaluation network:

a^{'} = {\arg \max}_{a} Q (s_{t + 1}, a; θ),

(20)

then, the improved one-step TD target value method can be expressed as follows:

y_{t}^{D D Q N} = r_{t} + γ Q (s_{t + 1}, {\arg \max}_{a} Q (s_{t + 1}, a; θ); θ^{'}) .

(21)

4.1.2. Dueling Network

MAI-DQN further improves the learning efficiency and generalization capability of DQN by introducing a dueling network structure. In the DQN neural network architecture, the fully connected layers directly connect to the output layer, producing the Q-values corresponding to each action for the current state after training. In contrast, the dueling network introduces two separate branches following the linear layers, forming two evaluation modules—one evaluates the state value

V (s)

, referred to as the state-value network, and another predicts the advantage of each action

A (s, a)

, known as the advantage network. This structure ensures that the assessment of the state does not entirely depend on the value of actions, enabling the model to better converge when actions have a limited impact on state value. In the dueling network, the Q-value output can be represented as follows:

Q (s, a; θ, α, β) = V (s; θ, β) + A (s, a; θ, α),

(22)

where

α

and

β

are parameters corresponding to the advantage network and the state-value network, respectively, used for function optimization.

However, because the above equation overly emphasizes the summed result of

V (s)

and

A (s, a)

, it cannot yield accurate and fixed values of

V (s)

and

A (s, a)

when given a certain Q-value, potentially leading to unstable or non-convergent training. Therefore, the following improvement is proposed:

Q (s, a; θ, α, β) = V (s; θ, β) + (A (s, a; θ, α) - A_{a v}),

(23)

where

A_{a v} = \frac{1}{|A|} \sum_{a^{'}} A (s, a^{'}; θ, α)

.

According to the improvement of the above equation, both

V (s)

and

A (s, a)

yield similar solutions for a given Q-value.

4.2. Exploration Policy

As previously mentioned, the traditional DQN employs an

ϵ

-greedy exploration policy. However, both the parameter

ϵ

and its decay rate require manual configuration. Optimal exploration strategies may vary across different scenarios or different parameter settings within the same scenario, and manually tuning the parameters of the exploration policy may cause slower convergence or suboptimal performance. Additionally, in certain states, some actions may clearly be suboptimal, yet the

ϵ

-greedy policy still randomly selects these actions with probability

ϵ

and reduces exploration efficiency.

To address this issue, this paper introduces a noisy network to enhance the exploration policy. The noisy network incorporates learnable stochastic noise into the weights of the neural network, enabling the agent to automatically adjust the degree of exploration according to different states. For a neural network, a linear network layer with input p and output q can be represented by the following equation:

y = ω x + b,

(24)

where

x \in R^{p}

,

y \in R^{q}

,

ω \in R^{q \times p}

represents the weight matrix, and

b \in R^{q \times 1}

denotes the bias vector. Both

ω

and b are learnable parameters in the network. When noise is introduced, the equation is as follows:

y = (μ^{ω} + σ^{ω} ⊙ ε^{ω}) x + μ^{b} + σ^{b} ⊙ ε^{b},

(25)

where

μ^{ω} \in R^{q \times p}

,

σ^{ω} \in R^{q \times p}

,

μ^{b} \in R^{q}

, and

σ^{b} \in R^{q}

. The symbol ⊙ denotes the Hadamard product, and

ε^{ω} \in R^{q \times p}

,

ε^{b} \in R^{q}

are random noise variables.

In the noisy network, parameters

μ^{ω}

,

σ^{ω}

,

μ^{b}

, and

σ^{b}

are learned and updated through gradient descent, whereas the noise variables

ε^{ω}

and

ε^{b}

are generated using factorized Gaussian noise. Specifically, p values are sampled independently from a Gaussian distribution to obtain

ε_{i} \in R^{p \times 1}

firstly, and subsequently, q values are sampled independently from a Gaussian distribution to obtain

ε_{j} \in R^{q \times 1}

. The noise variable

ε^{ω}

is then calculated as follows:

ε^{ω} = f (ε_{i}) \cdot f (ε_{j}^{T}),

(26)

where

f (ε) = s g n (ε) \sqrt{|ε|}

, similarly,

ε^{b}

can be expressed as follows:

ε^{b} = f (ε_{j}) .

(27)

In the above equation,

ε_{i}, ε_{j} \sim N (0, 1)

, while the initial values of

μ^{ω}

and

μ^{b}

are randomly sampled from independent uniform distributions,

μ^{ω}

,

μ^{b} \sim U (- 1 / \sqrt{p}, 1 / \sqrt{p})

. In this paper, the initial values of

σ^{ω}

and

σ^{b}

are set to

2.5 / \sqrt{p}

.

4.3. Training Methods

4.3.1. Variable Double-Depth Priority Experience Replay

Traditional DQN and DDQN improve training efficiency through experience replay by sampling experiences uniformly with equal probability. But during training, experiences with larger TD errors typically provide greater learning value, making uniform sampling potentially inefficient and resulting in slower convergence. To address this, the PER method replaces uniform sampling with priority-based sampling, improving the agent’s learning efficiency. Additionally, PER employs a Sum Tree data structure to store and retrieve sampling weights, which helps optimize computational complexity. In PER, the sampling probability for each experience is defined as follows:

P_{i} = p_{i} / \sum_{j = 1}^{N} p_{j},

(28)

where

p_{i} = |δ_{i}| + η

represents the sampling priority of the i-th experience, and

η

is a small positive constant used to prevent division by zero when the TD error is zero. Building on the above method, Ref. [31] proposed the DDPER algorithm to further enhance the algorithm’s efficiency in utilizing scarce and high-value experiences. In short, DDPER compares the reward

r_{t}

of the current experience with a reward threshold

Φ

, which is dynamically adjusted during training. In addition to storing experiences in a shallow experience pool, those with rewards exceeding the threshold are classified as positive experiences and stored in a positive deep experience pool. Additionally, DDPER identifies experiences with negative rewards as negative experiences, storing them in a negative deep experience pool. Thus, high-value experiences are distinguished from ordinary experiences.

During training, experiences from the shallow experience pool are sampled according to priorities

p_{i}

, whereas those from the deep experience pool are sampled uniformly. The sizes of these pools are defined as follows: the shallow-experience pool size is

B_{s} = B \cdot (1 - P_{d s})

, the positive deep experience pool size is

B_{d p} = P_{p n} (B - B_{s})

, and the negative deep experience pool size is

B_{d n} = B - B_{s} - B_{d p}

, where B denotes the batch size.

In DDPER, both

P_{p n}

and

P_{d s}

are fixed values. In contrast, in the proposed VDDPER, the model dynamically adjusts the parameters

P_{p n}

and

P_{d s}

based on the training steps. This dynamic adjustment enables the experience replay mechanism to better align with the human memory update process. The adjustment strategy for

P_{p n}

and

P_{d s}

is described as follows:

P_{d s} = min (P_{p s}^{m a x}, P_{p s}^{i n i} + η_{1} (n_{s} / N^{m a x})),

(29)

P_{p n} = min (P_{p n}^{m a x}, P_{p n}^{i n i} + η_{2} (n_{s} / N^{m a x})),

(30)

where

P_{p s}^{i n i}

and

P_{p n}^{i n i}

represent the preset initial values,

P_{p s}^{m a x}

and

P_{p n}^{m a x}

denote their respective maximum limits,

η_{1}

and

η_{2}

signify the total increments, and

n_{s}

and

N^{m a x}

are the current training and the maximum training steps, respectively. Through this update mechanism, both

P_{p n}

and

P_{d s}

gradually increase from their initial values,

P_{p s}^{i n i}

and

P_{p n}^{i n i}

, during the training process, until they reach their maximum limits,

P_{p s}^{m a x}

and

P_{p n}^{m a x}

. At this point, the model achieves maximum efficiency in utilizing high-value experiences.

This approach aims to direct the model’s attention primarily toward ordinary experiences in the initial training stages, enabling the model to adequately learn from a diverse range of experiences. Conversely, in later stages, the model progressively shifts its focus toward high-value experiences, thus enhancing convergence speed and stability.

4.3.2. N-Step Learning

In the one-step TD method, the TD target value only uses a single time step to estimate the cumulative discounted reward,

G_{t}

. However, when stochastic elements or noise are present in the environment, the TD target value may become biased. Moreover, in tasks with long sequences, the one-step TD method is less effective at incorporating recent experiences into the TD target value. To address these issues, the n-step TD method is proposed, incorporating cumulative rewards from the subsequent N steps. The n-step TD method enables rewards to propagate more rapidly toward earlier decisions, thereby enhancing learning efficiency. The n-step TD target value update method can be expressed as follows:

y_{t}^{N D D Q N} = \sum_{k = 0}^{N - 1} γ^{k} r_{r + k} + γ^{n} Q (s_{t + n}, {\arg \max}_{a} Q (s_{t + n}, a; θ); θ^{'}) .

(31)

After applying the noisy network, n-step learning, dueling network, and DDQN, the loss function used for training can be expressed as follows:

\begin{matrix} L {(ζ)}_{MAI} = E [E {(\sum_{k = 0}^{n - 1} γ^{k} r_{t + k} + γ^{n} Q (y, b (y), ε^{'}; ζ^{'}) - Q (x, a, ε; ζ))}^{2}] \\ s . t . \{\begin{matrix} (x, a, r, y) \sim D \\ b (y) = {\arg \max}_{b \in A} Q (y, b, ε^{″}; ζ), \end{matrix} \end{matrix}

(32)

where D denotes the experience replay buffer,

ε

,

ε^{'}

, and

ε^{″}

represent noise variable samples at different stages,

ζ

represents the parameters of the evaluation network, and

ζ^{'}

denotes the parameters of the target network.

4.3.3. Variable Learning Rate, Soft Update, and Gradient Clipping

In addition to the improvements mentioned above, several additional methods are incorporated in this paper to further improve network training performance. Firstly, given the varying demands for learning rates at different stages of training, this paper utilizes a variable learning rate strategy to enhance convergence speed and stability:

α_{l r} = \max (α_{l r}^{m i n}, α_{l r}^{i n i} - α_{l r}^{t} (n_{s} / N^{m a x})),

(33)

where

α_{l r}^{i n i}

denotes the initial learning rate, and

α_{l r}^{t}

represents the total learning rate reduction. As training progresses, the learning rate decreases to

α_{l r}^{m i n}

gradually.

Secondly, this paper introduces an improved network-update strategy [36], transitioning from a traditional hard update method, in which the evaluation network’s parameters are directly copied to the target network every K steps, to a soft update method, wherein a part of the evaluation network’s parameters is continuously transferred to the target network during each training step. This modification enhances training stability. The parameter update for the target network in the soft update method is given by the following:

θ^{'} = (1 - κ) θ^{'} + κ θ,

(34)

where

κ

is the update ratio of the network.

Thirdly, to prevent gradient explosion, gradient clipping is employed during the neural network’s loss optimization via gradient descent. Specifically, when the

L_{2}

-norm of the gradient

{∥\nabla_{ζ_{t}} L (ζ)∥}_{2}

exceeds a predefined threshold, the gradient clipping method proportionally scales the gradient, constraining its norm to the specified threshold.

4.4. State Space

In RL, a state represents all the environmental information accessible to the agent at a specific moment, while the state space denotes the complete set of all possible states. In complex jamming environments, designing an appropriate state space is crucial. In this paper, a state space based on parameters that are necessary for signal-quality assessment is proposed, capturing signal characteristics from multiple perspectives. The state space defined in this study can be expressed as follows:

s_{t} = [M_{J}, S N R, J N R, I S R, n_{j}, ψ],

(35)

where the parameter

M_{J}

denotes the type of jamming in the current state,

n_{j}

represents the number of jammed frequency points in one coherent processing interval (CPI), and

ψ

indicates the target identification results obtained after the CA-CFAR processing.

As shown in Figure 5, this study selects ten game iterations as a complete game process. During the game between the phased-array radar and the jammer, the radar first transmits a detection signal into the environment. After receiving the radar signal, the jammer transmits a jamming signal back toward the radar. The radar then senses the received signal to obtain the state

s_{t}

and subsequently adopts an anti-jamming action,

a_{t}

, based on the current policy,

π (a | s)

. This forces the jammer to change its jamming type, resulting in a transition to the next state

s_{t + 1}

.

4.5. Action Space

In this paper, the action space is defined as a discrete set. Since we focus on optimizing anti-jamming strategies over a fixed time interval, the anti-jamming actions must be structured in a “packaged” form. Considering diverse jamming types, this study incorporates frequency agility and cover pulses as active anti-jamming methods, along with the SLC and SLB as passive anti-jamming methods. Accordingly, the action space is formulated as follows:

a_{t} = [f_{n u m}, c_{S L C}, c_{S L B}],

(36)

where

f_{n u m}

represents either the number of frequency points used in cover pulses or the number of frequency-agility points within one CPI, and when

f_{n u m}

equals 1, it represents conventional LFM signals. The parameters

c_{S L C}

and

c_{S L B}

indicate the activation status of SLC and SLB, respectively.

Specifically, SLB exhibits superior target recovery performance against sidelobe spot jamming, whereas SLC is more effective against coherent slice forwarding jamming. Regarding mainlobe deceptive jamming, only active anti-jamming methods can partially mitigate interference effectively. Additionally, pulse compression is utilized as a fixed action for target detection throughout the anti-jamming decision-making optimization process.

4.6. Reward Function

In RL, the reward function guides the agent in learning the values of different actions when given specific states [37]. In radar anti-jamming strategy optimization, a suitably designed reward function directly influences training efficiency and the effectiveness of the final policy. An optimal reward function should incentivize the radar system to learn the best anti-jamming strategies, ensuring effective target detection even in complex jamming environments, without misleading the learning process. After reviewing reward function designs in relevant papers [38,39,40], a reward function at both the signal level and data level is designed in this paper. At the signal level, the reward is based on target detection results within a CPI, defined as follows:

R_{s} = \sum_{i = 1}^{I} R_{ψ, n},

(37)

where I denotes the number of pulses within one CPI, and

R_{ψ, i}

is calculated as follows:

R_{ψ, i} = \{\begin{matrix} 1, & t a r g e t d e t e c t e d \\ - 1, & t a r g e t n o t d e t e c t e d, \end{matrix}

(38)

which indicates that within a CPI, the radar receives a reward of +1 when the n-th pulse successfully detects the target via CA-CFAR processing, and

- 1

otherwise. Target detection is determined by the radar’s estimation of the current environmental situation.

At the data level, the reward function is designed based on

S N R

,

J N R

,

I S R

, and the number of jammed frequency points:

R_{d} = α_{1} G_{1} + α_{2} G_{2},

(39)

where the gain

G_{1} = 10 lg (10^{J N R / 10} (1 - P_{f} \cdot P_{r}))

quantifies the effectiveness of active anti-jamming measures. The probability of frequency-domain jamming is

P_{f} = n_{j} / f_{n u m}

. The probability of space-domain jamming is defined as

P_{r} = (R_{r} - R_{j}) / R_{r}

, with

R_{r}

representing the radar’s maximum detection range and

R_{j}

denoting the estimated distance to the jammer. The passive anti-jamming measure gain is defined as

G_{2} = β_{1} (I S R) + β_{2} (S N R_{a} - S N R_{b})

, where

S N R_{a} - S N R_{b}

denotes the

S N R

loss resulting from anti-jamming measures.

Considering these signal-level and data-level rewards, the final reward function can be expressed as follows:

R = ω_{1} R_{s} + ω_{2} R_{d} .

(40)

In the above equations,

ω_{1}

,

ω_{2}

,

α_{1}

,

α_{2}

,

β_{1}

, and

β_{2}

represent the weighting factors used to constrain rewards to similar scales. Through RL, the agent progressively discovers the optimal anti-jamming strategy that maximizes the average reward. For comparative purposes, the same reward function is applied to different RL algorithms in simulation experiments. In summary, the pseudocode for the phased-array radar anti-jamming decision-making algorithm based on MAI-DQN is outlined in Algorithm 1.

Algorithm 1 Training procedure of phased-array radar anti-jamming decision-making
Initialize: the state dimension, the size of the action space, hyperparameters for the MAI-DQN, a dueling noisy double evaluation network, the target network with parameters $ζ$ and $ζ^{'}$ , priority in the experience pool, the sizes of $B_{s}$ , $B_{d p}$ , $B_{d n}$ , other parameters for VDDPER, $S N R$ , and $J N R$ , the settings for the phased-array radar and the jammer, and $t = 0$ .
1:	for episode = 1 → M do
2:		if $t = 0$ then
3:			Generate target echo, random jamming, and noise in the environment, and initialize them as state $s_{1}$ .
4:		end if
5:		Sample a noisy network $ζ \sim ε$ .
6:		Choose action $a_{t} = {\arg \max}_{a} Q (s_{t}, a, ε; ζ)$ .
7:		Observe the environment again and get $r_{t}$ as well as $s_{t + 1}$ .
8:		Store experience $(s_{t}, r_{t}, a_{t}, s_{t + 1})$ in the shallow experience pool.
9:		Store experience $(s_{t}, r_{t}, a_{t}, s_{t + 1})$ in the positive deep experience pool or negative experience pool, depending on reward threshold $Φ$ and update $Φ$ if $r_{t} \geq Φ$ .
10:		if episode $\geq B_{s}$ then
11:			Sample experience as $B_{s}$ in the shallow experience pool by probability matrix $P$ .
12:			Sample the experience as $B_{d p}$ in the positive deep experience pool uniformly.
13:			Sample the experience as $B_{d n}$ in the negative deep experience pool uniformly.
14:			Integrate the experience in $B_{s}$ , $B_{d p}$ , $B_{d n}$ into a mini-batch $D_{m i n i}$ .
15:			Update $P$ , $P_{p n}$ and $P_{d s}$ .
16:			Update $y_{t}$ through the n-step TD method.
17:			Train the evaluation network through backpropagation.
18:			Update the priority of the sampling experience by the TD error.
19:			Update the sizes of $B_{s}$ , $B_{d p}$ , and $B_{d n}$ by $P_{p n}$ , as well as $P_{d s}$ .
20:			Train the evaluation network using $D_{m i n i}$ .
21:			Update the target network using the soft update method.
22:		end if
23:		if $t = N$ then
24:			$t = 0$ .
25:		else
26:			$t = t + 1$ .
27:			Update the type of jamming.
28:		end if
29:	end for

5. Experiments

This section presents the simulation results of the phased-array radar anti-jamming decision-making model based on the MAI-DQN. The simulation outcomes are analyzed and discussed to illustrate and validate the effectiveness and performance improvements achieved by the proposed algorithm in the decision-making process. Firstly, the comparison result of the loss function verifies the improvement of the variable learning rate on model training. Secondly, the impact of varying the values of the n-step learning parameter, soft update percentage, discount factor, and noisy network parameter on the average reward is analyzed to validate the chosen hyperparameters. Thirdly, compared with the other algorithms, the performance of MAI-DQN is analyzed from three aspects, that is, reward mean, standard deviation, and average decision accuracy. Fourthly, the pulse compression results using MAI-DQN and DDQN are compared. Lastly, an ablation analysis is conducted to compare the complete MAI-DQN with its variants that exclude improvements. The results demonstrate the effectiveness of the proposed MAI-DQN in phased-array radar anti-jamming decision-making.

5.1. Experimental Settings

In the simulation, the radar is capable of employing the following signal transmission modes: conventional LFM signals, frequency agile signals, and cover pulse signals. The jammer can select from a total of nine jamming types, including mainlobe jamming, sidelobe jamming, and combined mainlobe–sidelobe jamming, such as dense false target jamming, sweep jamming, coherent slice forwarding jamming, and spot jamming. Simulation parameters for radar and jammer settings are presented in Table 1.

Regarding the RL network structure, this paper adopts two linear network layers as hidden layers, containing 64 units each. The activation function employed in the output layer is the rectified linear unit (ReLU) function [41], defined as follows:

f (x) = \max (0, x) .

(41)

As one of the classical neural network activation functions, ReLU introduces nonlinearity into the network, enabling neural networks to effectively learn and approximate complex nonlinear relationships. Compared with Sigmoid and Tanh functions, ReLU maintains a constant gradient of 1 within the positive range, providing better gradient propagation and effectively mitigating the vanishing gradient problem. Moreover, ReLU induces sparsity through neuron deactivation, enhancing the model’s generalization capability and reducing overfitting.

Regarding network training, as previously mentioned, this study utilizes VDDPER, an adaptive learning rate, and soft updating of the target network. The detailed parameter settings for the model are presented in Table 2.

5.2. Experimental Results and Discussion

To verify the advantages of using an adaptive learning rate compared to a fixed learning rate, this paper compares their respective loss functions while keeping other parameters of the MAI-DQN model unchanged. As illustrated in Figure 6, the training convergence achieved by gradually decaying the learning rate

α_{l r}

from 0.01 to 0.001 shows significant improvements over fixed learning rates of

α_{l r}

= 0.01 and

α_{l r}

= 0.001. Specifically, when a fixed learning rate of

α_{l r}

= 0.001 is employed, the model’s convergence speed noticeably slows down, requiring approximately 6000 steps to achieve convergence. Conversely, with a fixed learning rate of

α_{l r}

= 0.01, although the model initially converges more rapidly, it experiences noticeable oscillations starting at around 1000 training steps.

To verify the appropriateness of the hyperparameter settings in this paper, as shown in Figure 7, we compare the impact of varying the values of the n-step learning parameter, the soft update percentage, the discount factor, and the noisy network parameter on the average reward. The results indicate that when the values listed in Table 2 are used, the model achieves an optimal average reward.

In terms of the algorithm comparison, the proposed MAI-DQN is compared with four other algorithms, which are Nature DQN, DDQN, the DDPER-VGA-DDQN algorithm proposed by Professor Xiao in [31], and MAI-DQN using the DDPER instead of VDDPER(MAI-DQN-DDPER). To ensure a fair comparison, all five algorithms share the same neural network parameters, except for the improvements introduced in this paper. Figure 8a shows the comparison of average rewards during the training process for Nature DQN, DDQN, DDPER-VGA-DDQN, MAI-DQN-DDPER, and the proposed MAI-DQN algorithm.

Firstly, regarding convergence speed, MAI-DQN achieves substantial improvements, reaching convergence at around step 600. In contrast, DDQN and Nature DQN exhibit significantly slower convergence. Meanwhile, MAI-DQN-DDPER and DDPER-VGA-DDQN both converge around step 2200, this notable difference arises because the VDDPER method dynamically adjusts the sampling ratio of experiences with different values during training, thereby enhancing exploratory depth at the initial training stage and significantly accelerating the learning speed.

Secondly, in terms of reward quality and stability, this paper utilizes the mean and standard deviation of rewards to quantify performance, as summarized in Table 3. Compared with the other four algorithms, MAI-DQN achieves not only rapid convergence but also a higher average reward due to its increased focus on high-value experience in later training stages. Additionally, the multifaceted improvements to the network architecture, particularly the integration of the noisy network structure, result in significantly higher reward stability for MAI-DQN compared to DDQN and DQN.

As shown in Figure 8b, this paper computes the average decision-making accuracy from the beginning to the end of training to evaluate the decision accuracy of the five algorithms. The results indicate that the MAI-DQN algorithm significantly outperforms MAI-DQN-DDPER, DDPER-VGA-DDQN, DDQN, and Nature DQN in terms of average decision accuracy. Furthermore, a notable performance gap appears between Nature DQN and the other four algorithms, attributable to Nature DQN’s tendency to overestimate Q-values in complex jamming environments, causing the model to converge to local optima. It should be noted that none of the algorithms reach 100% accuracy post-convergence, as the results include data collected during the initial exploration stage.

After completing anti-jamming decision-making optimization, this paper compares the pulse compression results using the anti-jamming strategy generated by MAI-DQN with the pulse compression results using the anti-jamming strategy generated by DDQN, thereby validating the effectiveness of the proposed algorithm. To enhance the

S N R

, the following results are obtained by performing incoherent accumulation of pulse compression within one CPI.

As shown in Figure 9, the target signal is masked by the jamming signal in the pulse compression output of the jammed phased-array radar system, resulting in a failure to detect the target.

As shown in Figure 10, in the pulse compression result processed using the anti-jamming strategy generated by MAI-DQN, the target signal is successfully recovered. It is worth noting that although some jamming residue remains in the result, the target can still be detected by CFAR in most cases. However, as shown in Figure 11, in the pulse compression results processed by the anti-jamming strategy generated by DDQN, the results are affected to varying degrees. Specifically, in Figure 11a,c–f, the target signals fail to be recovered to a level detectable by CFAR.

As shown in Figure 12, under 100 Monte Carlo simulations, this paper compares the performance differences among various jamming countermeasures. In the presence of sidelobe jamming or when sidelobe suppression jamming coexists with mainlobe dense false target jamming, the target detection rate achieved after applying the anti-jamming strategies generated by MAI-DQN shows good performance at

J N R

levels of 30 dB and below. However, when the

J N R

exceeds 30 dB, the detection rates experience a significant decline under all countermeasure settings.

Lastly, as shown in Table 4, an ablation study was conducted to compare the complete MAI-DQN with its variants that exclude improvements in network architecture, exploration strategy, and training methods, respectively. The comparison focuses on the mean value, standard deviation of the reward, and the average target detection rate after convergence, in order to validate the contribution of each component to the overall model performance. The average target detection rate is obtained from the results of 100 Monte Carlo simulations under all jamming scenarios.

In this paper, the computational hardware includes an Intel(R) Core(TM) i9-12900H CPU @ 2.50 GHz, 32 GB of RAM (Intel, Santa Clara, CA, USA), and an NVIDIA GeForce RTX 3070 Ti Laptop GPU (Nvidia, Santa Clara, CA, USA). The programming environment is Python 3.7, and the framework used is Torch 1.13.1. Table 5 summarizes the average computational time per decision for the MAI-DQN, DDPER-VGA-DDQN, DDQN, and Nature DQN algorithms during the decision-making process.

Compared to the DDPER-VGA-DDQN algorithm, the MAI-DQN employs a more sophisticated network architecture, resulting in a slightly longer average decision-making time. In contrast, Nature DQN, due to its simpler network structure, achieves the shortest decision-making time among the algorithms. Additionally, because phased-array radar systems typically process larger datasets compared to traditional radars, and given that each decision-making instance in this paper spans one CPI, the single decision time observed here is relatively longer. However, in realistic combat scenarios, anti-jamming strategies are usually generated in advance, thus, the increased decision-making time resulting from the adoption of MAI-DQN is generally acceptable.

6. Conclusions

Traditional phased-array radar anti-jamming decision-making methods encounter significant challenges in terms of decision complexity and processing speeds within complex jamming environments. To address this, this paper establishes an RL-based phased-array radar anti-jamming decision-making model, utilizing the proposed MAI-DQN algorithm to identify optimal anti-jamming strategies. This approach effectively enhances both decision-making efficiency and accuracy. Simulation results demonstrate that, compared to the DDPER-VGA-DDQN algorithm, the MAI-DQN in phased-array radar anti-jamming decision-making achieves faster model convergence, improved convergence quality and stability, and higher decision accuracy.

However, since mainlobe suppression jamming typically poses a critical threat to single-station radar systems, future works can focus on extending the anti-jamming strategy optimization from single-station radar systems to multi-station cooperative radar systems, thereby further enhancing anti-mainlobe jamming capabilities. In addition, all experimental results in this paper are derived from simulations, which impose certain limitations; therefore, in future work, real-world measured data can be used to validate the performance of the proposed model.

Author Contributions

Conceptualization, H.Z. and H.S.; methodology, H.Z.; validation, H.S., R.L., and J.H.; formal analysis, X.Y.; writing—original draft preparation, H.Z.; writing—review and editing, R.L. and X.Y.; visualization, J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a pre-research project funded by the Equipment Development Department (no. 909130305).

Data Availability Statement

The original contributions presented in this study are included in the article.

Conflicts of Interest

All authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

References

Zhao, L.; Yan, L.; Duan, X.; Wang, Z. Radar Anti-Jamming Performance Evaluation Based on Logistic Fusion of Multi-Stage SIR Information. Remote Sens. 2024, 16, 3214. [Google Scholar] [CrossRef]
Liu, L.; Pu, W.; Li, Y.; Jiu, B.; Luo, Z.Q. Radar Anti-Jamming Strategy Learning via Domain-Knowledge Enhanced Online Convex Optimization. In Proceedings of the 2024 EEE 13rd Sensor Array and Multichannel Signal Processing Workshop (SAM), Corvallis, OR, USA, 8–11 July 2024; pp. 1–5. [Google Scholar]
Bai, L.; Jiu, B.; Li, K.; Liu, H.; Zhao, Y.; Yang, B. Importance differentiation based coordinated anti-jamming strategy optimization for frequency agile radar. In Proceedings of the 2023 IEEE International Radar Conference (RADAR), Sydney, Australia, 6–10 November 2023; pp. 1–5. [Google Scholar]
Haykin, S. Cognitive radar: A way of the future. IEEE Signal Process. Mag. 2006, 23, 30–40. [Google Scholar] [CrossRef]
Zhang, G.; Xie, J.; Zhang, H. Game-theoretic strategy design of multistatic MIMO radar network and jammer. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 7728–7737. [Google Scholar] [CrossRef]
Xie, F.; Liu, H.; Li, J. IRJ-MARL: Intelligent Radar and Jammer Zero-sum Waveform Game based on Multi-agent Reinforcement Learning. In Proceedings of the 2024 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), Zhuhai, China, 22–24 November 2024; pp. 1–4. [Google Scholar]
Wang, C.; Jiu, B.; Pu, W.; Li, K.; Wu, Y.; Liu, H. Bounded Rationality-Based Anti-Jamming Strategy Generation for Frequency Agile Radar. In Proceedings of the 2024 7th International Conference on Information Communication and Signal Processing (ICICSP), Zhoushan, China, 21–23 September 2024; pp. 602–607. [Google Scholar]
Nash, J.F., Jr. Equilibrium points in n-person games. Proc. Natl. Acad. Sci. USA 1950, 36, 48–49. [Google Scholar] [CrossRef]
Han, X.; Mu, C.; Yan, J.; Niu, Z. An autonomous control technology based on deep reinforcement learning for optimal active power dispatch. Int. J. Electr. Power Energy Syst. 2023, 145, 108686. [Google Scholar] [CrossRef]
He, X.; Hu, Z.; Yang, H.; Lv, C. Personalized robotic control via constrained multi-objective reinforcement learning. Neurocomputing 2024, 565, 126986. [Google Scholar] [CrossRef]
Dong, R.; Wang, B.; Cao, K.; Tian, J.; Cheng, T. Secure transmission design of RIS enabled UAV communication networks exploiting deep reinforcement learning. IEEE Trans. Veh. Technol. 2024, 73, 8404–8419. [Google Scholar] [CrossRef]
Hammersborg, P.; Strümke, I. Reinforcement learning in an adaptable chess environment for detecting human-understandable concepts. IFAC-PapersOnLine 2023, 56, 9050–9055. [Google Scholar] [CrossRef]
Gillberg, J.; Bergdahl, J.; Sestini, A.; Eakins, A.; Gisslén, L. Technical challenges of deploying reinforcement learning agents for game testing in aaa games. In Proceedings of the 2023 IEEE Conference on Games (CoG), Boston, MA, USA, 21–24 August 2023; pp. 1–8. [Google Scholar]
Zou, J.; Lou, J.; Wang, B.; Liu, S. A novel deep reinforcement learning based automated stock trading system using cascaded lstm networks. Expert Syst. Appl. 2024, 242, 122801. [Google Scholar] [CrossRef]
Lyu, L.; Shen, Y.; Zhang, S. The advance of reinforcement learning and deep reinforcement learning. In Proceedings of the 2022 IEEE International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA), Changchun, China, 25–27 February 2022; pp. 644–648. [Google Scholar]
Yu, S.; Wang, X.; Shen, Y.; Wu, G.; Yu, S.; Shen, S. Novel Intrusion Detection Strategies With Optimal Hyper Parameters for Industrial Internet of Things Based On Stochastic Games and Double Deep Q-Networks. IEEE Internet Things J. 2024, 11, 29132–29145. [Google Scholar] [CrossRef]
Sahal, M.; Hidayat, Z.; Saputra, F.D.; Rizqifadiilah, M.A.; Putra, R.A. Obstacle Avoidance System on Autonomous Car Using D3QN. In Proceedings of the 2023 14th International Conference on Information & Communication Technology and System (ICTS), Surabaya, Indonesia, 4–5 October 2023; pp. 199–204. [Google Scholar]
Gök, M. Dynamic path planning via Dueling Double Deep Q-Network (D3QN) with prioritized experience replay. Appl. Soft Comput. 2024, 158, 111503. [Google Scholar] [CrossRef]
Lei, A.; Fan, W.; Zhou, F. A cognitive radar anti-jamming strategy generation algorithm based on dueling double DQN. In Proceedings of the 2023 IEEE International Radar Conference (RADAR), Sydney, Australia, 6–10 November 2023; pp. 1–5. [Google Scholar]
Zheng, Z.; Li, W.; Zou, K. Airborne radar anti-jamming waveform design based on deep reinforcement learning. Sensors 2022, 22, 8689. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1995–2003. [Google Scholar]
Sharma, A.; Thangaraj, V. Intelligent service placement algorithm based on DDQN and prioritized experience replay in IoT-Fog computing environment. Internet Things 2024, 25, 101112. [Google Scholar] [CrossRef]
Zhang, W.; Zhao, T.; Zhao, Z.; Ma, D.; Liu, F. Performance analysis of deep reinforcement learning-based intelligent cooperative jamming method confronting multi-functional networked radar. Signal Process. 2023, 207, 108965. [Google Scholar] [CrossRef]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
Zhou, C.; Wang, C.; Bao, L.; Gao, X.; Gong, J.; Tan, M. Frequency Diversity Array Radar and Jammer Intelligent Frequency Domain Power Countermeasures Based on Multi-Agent Reinforcement Learning. Remote Sens. 2024, 16, 2127. [Google Scholar] [CrossRef]
Geng, J.; Jiu, B.; Li, K.; Zhao, Y.; Liu, H.; Li, H. Radar and jammer intelligent game under jamming power dynamic allocation. Remote Sens. 2023, 15, 581. [Google Scholar] [CrossRef]
Chen, S.; Huo, W.; Zhang, C.; Pei, J.; Zhang, Y.; Huang, Y.; Shen, J. Anti-Jamming Strategy Design Based on Deep Q-Network for Slope-Varying LFM Signal. In Proceedings of the 2024 IEEE Radar Conference (RadarConf24), Denver, CO, USA, 6–10 May 2024; pp. 1–6. [Google Scholar]
Fortunato, M.; Azar, M.G.; Piot, B.; Menick, J.; Osband, I.; Graves, A.; Mnih, V.; Munos, R.; Hassabis, D.; Pietquin, O.; et al. Noisy networks for exploration. arXiv 2017, arXiv:1706.10295. [Google Scholar]
Hernandez-Garcia, J.F.; Sutton, R.S. Understanding multi-step deep reinforcement learning: A systematic study of the DQN target. arXiv 2019, arXiv:1901.07510. [Google Scholar]
Xiao, Y.; Cao, Z.; Yu, X.; Jiang, Y. Deep reinforcement learning based decision making for radar jamming suppression. Digit. Signal Process. 2024, 151, 104569. [Google Scholar] [CrossRef]
Puduru, V.K.; Yakkati, R.R.; Pardhasaradhi, B.; Babu, K.S.; Cenkeramaddi, L.R. Real-Time Detection of Spot Jamming Attacks in mmWave Radar Systems Using a Lightweight CNN. IEEE Sens. Lett. 2024, 8, 6014804. [Google Scholar] [CrossRef]
Xiao, N.; Liu, Z.; Gao, Z.; Quan, Y. Dense False Target Jamming Suppression based on Time-domain Collaborative Waveform Design. In Proceedings of the 2024 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), Zhuhai, China, 22–24 November 2024; pp. 1–6. [Google Scholar]
Zhou, H.; Wang, L.; Ma, M.; Guo, Z. Compound radar jamming recognition based on signal source separation. Signal Process. 2024, 214, 109246. [Google Scholar] [CrossRef]
Cheng, Y.; Yuan, Y.; Xu, H.; Yi, W. Deep Reinforcement Learning-Based Jamming Against Multiple Frequency Agile Radars. In Proceedings of the 2024 IEEE Radar Conference (RadarConf24), Denver, CO, USA, 6–10 May 2024; pp. 1–6. [Google Scholar]
Li, K.; Jiu, B.; Wang, P.; Liu, H.; Shi, Y. Radar active antagonism through deep reinforcement learning: A way to address the challenge of mainlobe jamming. Signal Process. 2021, 186, 108130. [Google Scholar] [CrossRef]
Kobayashi, T.; Ilboudo, W.E.L. T-soft update of target network for deep reinforcement learning. Neural Netw. 2021, 136, 63–71. [Google Scholar] [CrossRef] [PubMed]
Eschmann, J. Reward function design in reinforcement learning. In Reinforcement Learning Algorithms: Analysis and Applications; Springer: Berlin/Heidelberg, Germany, 2021; pp. 25–33. [Google Scholar]
Wei, J.; Wei, Y.; Yu, L.; Xu, R. Radar anti-jamming decision-making method based on DDPG-MADDPG algorithm. Remote Sens. 2023, 15, 4046. [Google Scholar] [CrossRef]
Sun, T.; Ma, X.; Zhao, Y.; Xue, F.; Gao, M. Design of Anti-Jamming Waveforms for Cognitive Radar Based on Deep Reinforcement Learning. In Proceedings of the 2024 6th International Conference on Electronic Engineering and Informatics (EEI), Chongqing, China, 28–30 June 2024; pp. 1596–1605. [Google Scholar]
Xing, H.; Xing, Q.; Wang, K. Radar anti-jamming countermeasures intelligent decision-making: A partially observable Markov decision process approach. Aerospace 2023, 10, 236. [Google Scholar] [CrossRef]
Agarap, A.F. Deep learning using rectified linear units (relu). arXiv 2018, arXiv:1803.08375. [Google Scholar]

Figure 1. Scenarios involving radar–environment interactions.

Figure 2. Simulation of signals. The signals shown in the figure are as follows: target echo, noise, target echo plus noise, dense false target jamming plus coherent slice forwarding jamming, dense false target jamming, sweep jamming, and spot jamming. The pulse repetition period is 200

μ

s

; the sampling frequency is 20 MHz.

Figure 2. Simulation of signals. The signals shown in the figure are as follows: target echo, noise, target echo plus noise, dense false target jamming plus coherent slice forwarding jamming, dense false target jamming, sweep jamming, and spot jamming. The pulse repetition period is 200

μ

s

; the sampling frequency is 20 MHz.

Figure 3. The structure of a Markov decision process (MDP).

Figure 4. The diagram of the proposed a multi-aspect improved deep Q-network (MAI-DQN).

Figure 5. Gaming process between the radar and jammer.

Figure 6. Loss function for different learning rate settings. (a) Variable learning rate,

α_{l r}^{i n i}

= 0.01,

α_{l r}^{m i n}

= 0.001; (b) fixed learning rate,

α_{l r}

= 0.001; (c) fixed learning rate,

α_{l r}

= 0.01.

Figure 6. Loss function for different learning rate settings. (a) Variable learning rate,

α_{l r}^{i n i}

= 0.01,

α_{l r}^{m i n}

= 0.001; (b) fixed learning rate,

α_{l r}

= 0.001; (c) fixed learning rate,

α_{l r}

= 0.01.

Figure 7. Effects of hyperparameter settings on the average reward; (a) the n-step learning parameter and the soft update percentage; (b) the discount factor and the noisy network parameter.

Figure 8. Comparison of algorithm performances. (a) Comparison of average reward. (b) Comparison of average accuracy.

Figure 9. Pulse compression results without applying any anti-jamming measures. (a) Sidelobe dense false target jamming. (b) Sidelobe spot jamming. (c) Sidelobe coherent slice forwarding jamming. (d) Sidelobe sweep jamming. (e) Mainlobe dense false target jamming combined with sidelobe coherent slice forwarding jamming. (f) Mainlobe and sidelobe dense false target jamming. (g) Mainlobe dense false target jamming combined with sidelobe sweep jamming. (h) Mainlobe dense false target jamming combined with sidelobe spot jamming.

Figure 10. Pulse compression results utilizing the anti-jamming strategy generated by MAI-DQN. (a) Sidelobe dense false target jamming. (b) Sidelobe spot jamming. (c) Sidelobe coherent slice forwarding jamming. (d) Sidelobe sweep jamming. (e) Mainlobe dense false target jamming combined with sidelobe coherent slice forwarding jamming. (f) Mainlobe and sidelobe dense false target jamming. (g) Mainlobe dense false target jamming combined with sidelobe sweep jamming. (h) Mainlobe dense false target jamming combined with sidelobe spot jamming.

Figure 11. Pulse compression results utilizing the anti-jamming strategy generated by double deep Q-network (DDQN). (a) Sidelobe dense false target jamming. (b) Sidelobe spot jamming. (c) Sidelobe coherent slice forwarding jamming. (d) Sidelobe sweep jamming. (e) Mainlobe dense false target jamming combined with sidelobe coherent slice forwarding jamming. (f) Mainlobe and sidelobe dense false target jamming. (g) Mainlobe dense false target jamming combined with sidelobe sweep jamming. (h) Mainlobe dense false target jamming combined with sidelobe spot jamming.

Figure 12. Comparison of jamming performance. (a) Sidelobe jamming. (b) Mainlobe dense false target jamming plus sidelobe jamming.

Table 1. Simulation parameters.

Parameter	Value
Carrier frequency	3 GHz
Bandwidth	10 MHz
Frequency hopping interval	10 MHz
Pulse width	10 μs
Pulse repetition period	200 μs
Sampling frequency	30 MHz
Standard deviation of noise	0.2
Initial $S N R$	10 dB
Initial $J N R$	30 dB
Initial distance of target	15 km
Number of array elements	24

Table 2. Model parameters.

Parameter	Value
Maximum training steps	10,000
Maximum shallow memory capacity	1000
Maximum positive deep memory capacity	500
Maximum negative deep memory capacity	500
Initial proportion of deep and shallow experiences	0.2
Initial proportion of positive and negative experiences	0.6
Length of observation space	6
Length of action space	14
Initial learning rate	0.01
Minimum learning rate	0.001
Discount factor	0.9
Percentage of soft update	0.005
N-step learning parameter	5
Noisy network parameter	2.5
Mini-batch size	64

Table 3. Comparison of the mean value and standard deviation of the reward.

Algorithm	Mean Value	Standard Deviation
MAI-DQN	42.79	3.65
MAI-DQN-DDPER	41.91	3.65
DDPER-VGA-DDQN	41.13	4.15
DDQN	39.91	5.35
Nature DQN	39.79	5.61

Table 4. Ablation analysis on MAI-DQN.

Algorithm	Mean Value	Standard Deviation	Average Target Detection Rate
MAI-DQN	42.79	3.65	86.44%
(w/o) Dueling Network & DDQN	40.85	3.64	80.89%
(w/o) Noisy Network	41.33	4.90	82.67%
(w/o) Training Methods	38.75	4.12	81.00%

Table 5. Comparison of the average time required for each decision-making process.

Algorithm	Value
MAI-DQN	47.8 × 10⁻² s
MAI-DQN-DDPER	47.1 × 10⁻² s
DDPER-VGA-DDQN	45.9 × 10⁻² s
DDQN	42.4 × 10⁻² s
Nature DQN	41.9 × 10⁻² s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, H.; Song, H.; Liu, R.; Hou, J.; Yu, X. Anti-Jamming Decision-Making for Phased-Array Radar Based on Improved Deep Reinforcement Learning. Electronics 2025, 14, 2305. https://doi.org/10.3390/electronics14112305

AMA Style

Zhao H, Song H, Liu R, Hou J, Yu X. Anti-Jamming Decision-Making for Phased-Array Radar Based on Improved Deep Reinforcement Learning. Electronics. 2025; 14(11):2305. https://doi.org/10.3390/electronics14112305

Chicago/Turabian Style

Zhao, Hang, Hu Song, Rong Liu, Jiao Hou, and Xianxiang Yu. 2025. "Anti-Jamming Decision-Making for Phased-Array Radar Based on Improved Deep Reinforcement Learning" Electronics 14, no. 11: 2305. https://doi.org/10.3390/electronics14112305

APA Style

Zhao, H., Song, H., Liu, R., Hou, J., & Yu, X. (2025). Anti-Jamming Decision-Making for Phased-Array Radar Based on Improved Deep Reinforcement Learning. Electronics, 14(11), 2305. https://doi.org/10.3390/electronics14112305

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Anti-Jamming Decision-Making for Phased-Array Radar Based on Improved Deep Reinforcement Learning

Abstract

1. Introduction

2. System Model

3. Reinforcement Learning Background

3.1. Reinforcement Learning Basics

3.2. Deep Q-Network

3.3. Exploration Policy in Deep Q-Network

4. MAI-DQN for Anti-Jamming Decision-Making

4.1. Network Architecture

4.1.1. Double Deep Q-Network

4.1.2. Dueling Network

4.2. Exploration Policy

4.3. Training Methods

4.3.1. Variable Double-Depth Priority Experience Replay

4.3.2. N-Step Learning

4.3.3. Variable Learning Rate, Soft Update, and Gradient Clipping

4.4. State Space

4.5. Action Space

4.6. Reward Function

5. Experiments

5.1. Experimental Settings

5.2. Experimental Results and Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI