Dynamic Jamming Policy Generation for Netted Radars Using Hybrid Policy Network

Hao, Wanbing; Ke, Wentao; Feng, Xiaoyi; Xia, Zhaoqiang

doi:10.3390/app15168898

Open AccessArticle

Dynamic Jamming Policy Generation for Netted Radars Using Hybrid Policy Network

¹

School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710072, China

²

Xi’an Electronic Engineering Research Institute, Xi’an 710100, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(16), 8898; https://doi.org/10.3390/app15168898

Submission received: 31 May 2025 / Revised: 3 August 2025 / Accepted: 10 August 2025 / Published: 12 August 2025

(This article belongs to the Section Applied Physics General)

Download

Browse Figures

Versions Notes

Abstract

Radar jamming resource allocation is crucial for maximizing jamming effectiveness and ensuring operational superiority in complex electromagnetic environments. However, the existing approaches still sufferfrom inefficiency, instability, and suboptimal global solutions. To address these issues, this work proposes addressing effective jamming resource allocation in dynamic radar countermeasures with multiple jamming types. A deep reinforcement learning framework is designed to jointly optimize transceiver strategies for adaptive jamming under state-switching scenarios. In this framework, a hybrid policy network is proposed to coordinate beam selection and power allocation, while a dynamic fusion metric is integrated to evaluate jamming effectiveness. Then the non-convex optimization is resolved via a proximal policy optimization version 2 (PPO2)-driven iterative algorithm. Experiments demonstrate that the proposed method achieves superior convergence speed and reduced power consumption compared to baseline methods, ensuring robust jamming performance against eavesdroppers under stringent resource constraints.

Keywords:

dynamic radar countermeasures; adaptive jamming; resource allocation; hybrid policy network

1. Introduction

Radar jamming is an electronic countermeasure where a transmitter emits signals to disrupt a radar’s ability of detecting and tracking targets. By suppressing or deceiving the radar receiver, the jammer degrades situational awareness and protects assets in modern warfare. Radar jamming resource allocation is crucial for maximizing jamming effectiveness and ensuring operational superiority in complex electromagnetic environments, which has widely adopted heuristic algorithms due to their scalability and adaptability to large-scale optimization [1]. Previous studies optimized jamming allocation using detection probability models and cooperative strategies [2,3,4] or combinatorial algorithms [5,6], while radar-side resource allocation employed modified PSO [7]. Distinctly, this work [8] introduces a game-theoretic framework, formulating the radar–jammer interaction via a novel utility function and the CPAG algorithm to find the strategic Nash Equilibrium—moving beyond single-sided optimization. These optimization-based approaches primarily employ frame-by-frame optimization, which suffers from inefficiency, instability, and suboptimal global solutions. As deep reinforcement learning (DRL) offers dynamic resource allocation through agent–environment interactions, some approaches [9,10,11] have begun to employ DRL algorithms for jamming policy optimization in radar countermeasure scenarios. Among them, to quickly find a jamming policy for multifunctional radar, an advantage actor–critic (A2C)-driven jamming policy generation scheme [12] was introduced, utilizing heuristic reward functions and Markov decision process-based interactive learning. In [13], to address the joint optimization issue of jamming task and power allocation within the netted radars anti-jamming fusion game, a hierarchical reinforcement learning-based jamming resource allocation scheme was established, neglecting the dynamic action switching of the jammer based on radar operating state. These DRL-based approaches could optimally adjust the jamming policy to the limited states of radars.

However, these existing approaches fail to comprehensively model the intricate game-theoretic interplay between jammers and netted radars, particularly in mixed discrete-continuous action spaces of jammers. It does not address adaptive jamming policy optimization under dynamic radar state-switching constraints using DRL-based joint transceiver design. The efficient collaborative exploration is absent in heterogeneous action spaces. Meanwhile, the difficulty in adapting to non-stationary environmental disturbances caused by real-time radar state switching is increased. Consequently, the policy network in DRL framework needs to be adaptive to the dynamic variation of jammers, which has not been explored sufficiently.

In this work, to address the challenge of previous works, we design a multi-stage game optimization framework with a hybrid policy network (HPN)-based DRL. The game process between the jammer agent and the netted radars is shown in Figure 1. In the framework, jammers are modeled as intelligent agents with rigorously defined state transition rules under composite jamming environments. For various jamming types, the framework aims to maximize penetration distance, ensuring the successful accomplishment of the mission, while minimizing jamming power consumption. The resulting non-convex optimization problem is then addressed via a PPO2-driven [14] iterative algorithm, enabling adaptive policy generation under dynamic radar state-switching constraints. The main contributions are summarized as follows.

A hybrid policy network is designed to simultaneously generate coordinated beam selection and power allocation actions across the decomposed action space, enabling the jammer to efficiently handle mixed actions in complex confrontation scenarios.
A dynamic weighted fusion metric is introduced to comprehensively assess the jamming effectiveness, with weights dynamically adjusted based on the radar’s various operational stages.
The PPO2 algorithm is employed to train the reinforcement network, preventing policy collapse due to improper power allocation through importance sampling ratio clipping, and balancing task success and energy efficiency.

In Section 2, we introduce the the jamming model for the netted radars, while the proposed method based on DRL is described in Section 3. We present the results of our experiments in Section 4 and summarize the features of our method in Section 5. In Section 6, we conclude the work.

2. Jamming Model of Netted Radars

In this context, we aim to address a one-to-many electronic warfare scenario where an adaptive jammer, escorting a penetration target, allocates jamming resources against an N-node netted radar system. The jammer employs multi-beam coverage to engage all radars simultaneously, optimizing type-power coordination under full radar-state observability.

2.1. Active Jamming Type

Research on active jamming techniques for netted radar systems must be systematically designed based on their netted characteristics. Netted radars enhance detection capabilities through multi-node collaboration, yet significant differences in parameters such as frequency band configuration, signal structure, and deployment locations among nodes render a single jamming strategy ineffective in synchronously suppressing all network nodes. Radar jamming can be categorized by energy source into two primary types as follows: passive jamming and active jamming [15,16]. Passive jamming, which employs media such as chaff to reflect or absorb radar waves, exhibits strong concealment but suffers from poor controllability. Active jamming, which actively transmits electromagnetic signals to disrupt radar operations, has become the preferred solution in contemporary netted radars confrontation scenarios due to its high adaptability and precise control advantages.

The engineering implementation of active jamming encompasses two distinct paradigms—suppression jamming and deception jamming—whose operational principles and application scenarios exhibit complementary relationships. Suppression jamming transmits high-power noise signals to saturate the target frequency band, significantly degrading the signal-to-noise ratio (SNR) of the radar receiver and thus reducing its target detection probability and operational range. Such jamming directly disrupts radar signal processing systems, particularly demonstrating efficacy against modern radars employing complex signal modulation techniques. Deception jamming replicates radar signal waveforms while introducing artificial error parameters, generating false echoes with characteristics similar to genuine targets to mislead radar perception.

2.2. Radar Echo Signal Model Under Jamming Condition

The composite signal received by radar node n at time k consists of target reflections and jamming components as follows:

s_{n}^{k} (t) = r_{n}^{k} (t) + j_{n}^{k} (t)

(1)

r_{n}^{k} (t) = \sqrt{α_{n}^{k} P_{n}^{k} h_{n}^{k}} \cdot {\tilde{s}}_{n}^{k} (t) (t - τ_{n}^{k}) \cdot exp (- j 2 π f_{n}^{k} t) + w_{n}^{k} (t)

(2)

where t is a continuous-time variable,

α_{n}^{k}

represents the transmission path loss,

P_{n}^{k}

denotes the transmitted power of the radar,

h_{n}^{k}

stands for the reflection coefficient of the target,

{\tilde{s}}_{n}^{k} (t)

represents the normalized complex envelope of radar transmitted signal,

τ_{n}^{k}

indicates the transmission delay,

f_{n}^{k}

is the Doppler shift, and

w_{n}^{k} (t)

is the zero-mean white Gaussian noise.

τ_{n}^{k}

and

f_{n}^{k}

can be calculated by the following formulas:

τ_{n}^{k} = \frac{2 R_{n}^{k}}{c}

(3)

f_{n}^{k} = \frac{2 [v_{x}^{k} (x^{k} - x_{n}) + v_{y}^{k} (y^{k} - y_{n}) + v_{z}^{k} (z^{k} - z_{n})]}{λ R_{n}^{k}}

(4)

where c denotes the speed of light,

λ

represents the wavelength of the radar-transmitted signal,

v_{x}^{k}

,

v_{y}^{k}

and

v_{z}^{k}

correspond to the velocity components of the penetrating target,

x^{k}

,

y^{k}

and

z^{k}

denote the target’s 3D Cartesian coordinates at time k,

x_{n}^{k}

,

y_{n}^{k}

and

z_{n}^{k}

denote the radar’s 3D Cartesian coordinates at time k, and

R_{n}^{k}

quantifies the instantaneous range between radar n and the target at the time k as follows:

R_{n}^{k} = \sqrt{{(x^{k} - x_{n})}^{2} + {(y^{k} - y_{n})}^{2} + {(z^{k} - z_{n})}^{2}}

(5)

The radar detects penetrating target by transmitting pulsed signals and receiving target echoes. Assuming identical operational parameters across all radars, the received echo-signal power at radar n from the target is given by the following [3,17]:

P_{r, n}^{k} = \frac{P_{t, n} G_{r}^{2} λ^{2} σ^{k}}{{(4 π)}^{3} {(R_{n}^{k})}^{4}}

(6)

where

P_{t, n}

denotes the transmit power of the radar n,

G_{r}

represents the antenna gain,

σ^{k}

is the radar cross-section (RCS) of the target relative to radar n at time k, and

R_{n}^{k}

quantifies the instantaneous distance between radar n and the target at time k.

The received signal at radar n contains both target echoes and jamming components, and the received jamming power

P_{j, n}^{k}

is as follows [3,17]:

P_{j, n}^{k} = \frac{u_{n}^{k} P_{j, n} G_{j} G_{i} (θ_{n}^{k}) λ_{j}^{2} γ_{j}}{{(4 π)}^{2} {(R_{m, n}^{k})}^{2}}

(7)

where

P_{j, n}

denotes the power of the jamming signal transmitted by the jammer to radar n,

u_{n}^{k} \in {0, 1}

is a binary variable indicating whether radar n is being jammed at discrete time k,

G_{j}

indicates the jammer transmit antenna gain,

γ_{j}

corresponds to polarization mismatch loss,

R_{m, n}^{k}

defines the distance between the jammer and radar n at time k. As shown in Figure 2,

θ_{n}^{k}

represents the angular separation within the target–jammer–radar geometric configuration, and

G_{i} (θ_{n}^{k})

describes the radar antenna gain pattern as a function of

θ_{n}^{k}

, typically modeled by the following [13]:

G_{i} (θ_{n}^{k}) = \{\begin{matrix} G_{r}, & | θ_{n}^{k} | \leq θ_{0.5} / 2 \\ ζ G_{r} \cdot {(\frac{θ_{0.5}}{θ_{n}^{k}})}^{2}, & θ_{0.5} / 2 < | θ_{n}^{k} | \leq 90^{\circ} \\ ζ G_{r} \cdot {(\frac{θ_{0.5}}{90^{\circ}})}^{2}, & 90^{\circ} < | θ_{n}^{k} | \leq 180^{\circ} \end{matrix}

(8)

where

θ_{0.5}

represents the 3-dB beamwidth of the radar antenna, and

ζ

denotes the antenna gain coefficient.

2.3. Jamming Effect Evaluation Index

2.3.1. Detection Probability

When subjected to active suppression jamming, the radar receiver acquires not only target echo signals and inherent system noise but also suppression jamming signals. Considering a radar n detecting the target under such conditions, the SNR is defined as follows:

{SNR}_{n}^{k} = \frac{P_{r, n}^{k}}{P_{j, n}^{k} + P_{noise}}

(9)

where

P_{noise}

represents the receiver’s thermal noise power.

The fundamental principles of radar detection indicate that when the target fluctuation type is known, the detection probability for a specific target can be calculated by combining the SNR [18]. We assume that the target fluctuation follows the Swerling I detection model, and the single-pulse detection probability of radar n for target at time k is expressed as:

P r o b_{d, n}^{k} = exp (- \frac{y_{0}}{1 + {SNR}_{n}^{k}}), n_{p} = 1

(10)

where

y_{0}

represents the detection threshold determined by the false alarm probability requirement,

n_{p} = 1

denotes the number of non-coherent pulse integrations.

In netted radar systems, each radar generates local detection criteria, which will be aggregated at the fusion center via the

K - N

fusion rule [19] to achieve robust decision-making. This framework ensures efficient information fusion while maintaining operational resilience against jamming threats. We assume that each radar n makes a local binary decision

d_{n} \in {0, 1}

, where

d_{n} = 1

indicates target detection and

d_{n} = 0

indicates no detection. The fusion center collects these decisions into a global decision vector

D = [d_{1}, d_{2}, \dots, d_{N}]

. Since each

d_{n}

can take one of two values, there are

2^{N}

possible distinct combinations for D. The fusion rule

R (D)

for the netted radar system is then defined as follows: A target is declared detected (

R (D) = 1

) if at least K radars confirm its presence; otherwise, it is declared undetected (

R (D) = 0

).

R (D) = \{\begin{matrix} 1, & if \sum_{n = 1}^{N} d_{n} \geq K \\ 0, & otherwise \end{matrix}

(11)

According to the

K - N

fusion rule, the joint detection probability

P_{d}

of the netted radar system is defined as follows:

P_{d} = \sum_{D} R (D) \prod_{d_{n} \in S_{1}} P_{d}^{n} \prod_{d_{n} \in S_{0}} (1 - P_{d}^{n})

(12)

where

S_{1}

is the set of radars that have detected a target,

S_{0}

is the set of radars that have not detected a target, and

P_{d}^{n}

is the detection probability of the individual radar.

2.3.2. Positioning Accuracy

Deception jamming operates through a fundamentally distinct mechanism. Unlike the suppression technique, it exploits radar signal triggering to autonomously generate modulated pulses synchronized with the target radar’s frequency. These pulses embed falsified target parameters into the victim radar’s processing chain, inducing systematic biases in its detection outputs.

Under deception jamming and noise environments, the range measurement error

Σ_{r}

, azimuth angle error

Σ_{φ}

, and elevation angle error

Σ_{θ}

of radar systems follow Gaussian distributions with variances given by the following [20]:

\{\begin{matrix} Σ_{r} & = \frac{c τ}{k_{1} \sqrt{\frac{f_{r} \cdot SNR}{β_{n}}}} \\ Σ_{φ} & = \frac{1.4 θ_{0.5}}{k_{2} \sqrt{\frac{B τ f_{r} \cdot SNR}{β_{n}}}} \\ Σ_{θ} & = \frac{1.4 θ_{0.5}}{k_{2} \sqrt{\frac{B τ f_{r} \cdot SNR}{β_{n}}}} \end{matrix}

(13)

where

α

and

β

denote the elevation angle and azimuth angle of the target relative to the radar, respectively. c is the speed of light,

τ

denotes the pulse width,

f_{r}

represents the pulse repetition frequency,

β_{n}

is the antenna servo bandwidth, B indicates the radar receiver bandwidth, and

k_{1}

,

k_{2}

are system-dependent calibration constants.

The resulting degradation in positioning accuracy can be quantified using the geometric dilution of precision (GDOP), which is defined as follows:

\begin{matrix} [\begin{matrix} Σ_{X}^{2} \\ Σ_{Y}^{2} \\ Σ_{Z}^{2} \end{matrix}] & = [\begin{matrix} {(cos α cos β)}^{2} & {(r cos α sin β)}^{2} & {(r sin α cos β)}^{2} \\ {(cos α sin β)}^{2} & {(r cos α cos β)}^{2} & {(r sin α sin β)}^{2} \\ {(sin α)}^{2} & 0 & {(r cos α)}^{2} \end{matrix}] [\begin{matrix} Σ_{r}^{2} \\ Σ_{θ}^{2} \\ Σ_{φ}^{2} \end{matrix}] \end{matrix}

(14)

\begin{matrix} Q_{g d o p} & = \sqrt{Σ_{X}^{2} + Σ_{Y}^{2} + Σ_{Z}^{2}} \end{matrix}

(15)

where

Q_{g d o p}

is the GDOP of a single radar. By employing the dynamic weighted fusion rule [21] and Equation (15), it can be obtained that the GDOP of the netted radar system

G_{d}

for the target position measurement under jamming is formulated as follows:

G_{d} = \sqrt{\frac{1}{\sum_{n = 1}^{N} \frac{1}{Σ_{x}^{2}}} + \frac{1}{\sum_{n = 1}^{N} \frac{1}{Σ_{y}^{2}}} + \frac{1}{\sum_{n = 1}^{N} \frac{1}{Σ_{z}^{2}}}}

(16)

2.3.3. Weighted Model for Evaluation Indicators

The netted radar system exhibits varying threat priorities across operational phases [22,23]. During long-range target search, the primary objective is detection confirmation, necessitating jamming strategies that suppress radar detection probability. As target proximity increases and detection confidence surpasses a defined threshold, the system transmits to the positioning phase. In this stage, resource allocation shifts toward optimizing geometric positioning accuracy, guiding adaptive countermeasure deployment aligned with real-time mission demands. In view of the differences in the characteristics of netted radars in different operational states, we use the dynamic weighted fusion metric F to comprehensively and accurately evaluate the jamming effectiveness and improve the overall countermeasure ability and task execution effect as follows:

F = [|\frac{P_{d} - {\tilde{P}}_{d}}{P_{d}^{max} - P_{d}^{min}}|, |\frac{G_{d} - {\tilde{G}}_{d}}{G_{d}^{max} - G_{d}^{min}}|] [\begin{matrix} ω_{1} \\ ω_{2} \end{matrix}]

(17)

where

ω_{1}

and

ω_{2}

represent the dynamic weights of the metrics, and the two jamming metrics are dimensionless normalized. In the search state,

ω_{1} = 0.8

and

ω_{2} = 0.2

, while in the tracking state,

ω_{1} = 0.2

and

ω_{2} = 0.8

. The indicators of max, min, and tilde denote the maximum value, minimum value, and the value without any jamming for the detection probability and GDOP of netted radars. The metric F quantifies the enhancement of jamming effectiveness relative to the non-jamming baseline, and the netted radar system can be identified as capable of detecting the jammer under current operational states when F exceeds the predefined threshold range.

3. HPN Based DRL for Jamming

To establish the theoretical foundation for our hybrid policy network, we first formalize the core reinforcement learning paradigms enabling adaptive jamming control, including policy gradient methods and the actor–critic framework. These fundamentals underpin the proximal policy optimization and its extension to hybrid action spaces for coordinated jamming.

3.1. Policy Gradient Methods

DRL aims to maximize cumulative rewards through two primary approaches: value-based methods and policy gradient methods. Value-based methods (e.g., Q-learning, DQN) derive optimal policies indirectly by constructing value functions for states or state–action pairs, demonstrating notable advantages in discrete action spaces. In contrast, policy gradient methods directly optimize policy parameters

θ

through iterative policy evaluation and improvement phases.

Let the parameterized policy

π_{θ} (s)

, a stochastic mapping from states to action probability distributions, interact with the environment to generate trajectories

τ = (s_{0}, a_{0}, r_{0}, \dots, s_{T}, a_{T}, r_{T})

through sequential sampling. The optimization objective of policy gradient methods is to maximize the expected return

E_{τ \sim π_{θ}} [R (τ)]

, where

R (τ)

denotes the discounted cumulative reward over a trajectory of length T. By leveraging the policy gradient theorem, this optimization problem is addressed through parameter updates along the direction

E_{τ \sim π_{θ}} [\sum_{t = 0}^{T} R (τ) \nabla_{θ} log π_{θ} (τ)]

.

\nabla_{θ}

denotes the gradient with respect to policy parameters

θ

, thereby specifying the gradient direction for optimizing

θ

.

Due to the intractability of direct expectation computation, Monte Carlo approximation is typically employed. A set of Q samples of trajectories

{τ^{q}}_{q = 1}^{Q}

are sampled, and the gradient is estimated as follows:

E_{τ \sim π_{θ}} [\sum_{t = 0}^{T} R (τ) \nabla_{θ} log π_{θ} (τ)] \approx \frac{1}{Q} \sum_{q = 1}^{Q} \sum_{t = 0}^{T} R (τ^{q}) \nabla_{θ} log π_{θ} (a_{t}^{q} | s_{t}^{q})

(18)

Policy optimization increases the selection probability of actions associated with positively rewarded trajectories

τ^{q}

and decreases it otherwise.

Traditional policy gradient methods require repeated trajectory sampling for updates, which severely limits their data efficiency. Importance sampling techniques address this by reusing historical data from policy

π_{θ^{'}}

, constructing importance weight ratios for gradient correction as follows:

\nabla_{θ} \bar{R} (θ) = E_{τ \sim π_{θ^{'}} (τ)} [\frac{π_{θ} (τ)}{π_{θ^{'}} (τ)} \sum_{t = 0}^{T} R (τ) \nabla log π_{θ} (τ)]

(19)

Policy gradient methods demonstrate significant advantages, particularly in their capacity to effectively address high-dimensional continuous action spaces, circumvent the need for explicit environment dynamics modeling, and achieve robustness to environmental noise via direct policy optimization.

3.2. Advantage Actor-Critic

Traditional policy gradient methods assign uniform reward weighting across entire trajectories, resulting in high variance. The actor–critic framework mitigates this by introducing a state-value function

V_{π} (s)

as a baseline, with the advantage function

A d v_{π} (s_{t}, a_{t}) = Q_{π} (s_{t}, a_{t}) - V_{π} (s_{t})

quantifying the relative merit of action

a_{t}

compared to the policy average. The improved gradient estimate becomes as follows:

\nabla_{θ} J (θ) = E_{τ \sim π_{θ}} [\sum_{t = 0}^{T} A d v_{π} (s_{t}, a_{t}) \nabla_{θ} log π_{θ} (a_{t} | s_{t})]

(20)

Practical implementation must address distribution shift between old

θ_{o l d}

and new

θ_{n e w}

policies. Gradient updates incorporate the following importance ratio:

\nabla_{θ} J (θ) = E_{τ \sim π_{θ_{old}}} [\sum_{t = 0}^{T} \frac{π_{θ_{new}} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})} A d v_{θ_{old}} (s_{t}, a_{t}) \nabla_{θ} log π_{θ_{new}} (a_{t} | s_{t})]

(21)

3.3. Proximal Policy Optimization

To stabilize policy updates, Proximal Policy Optimization (PPO) constrains the Kullback–Leibler (KL) divergence between successive policies [14]. The following two principal variants exist:

(1) PPO-Penalty: Augments the objective with KL divergence penalty

J^{C L I P} (θ) = E_{t} [min (ρ_{t} (θ) A d v_{t}, clip (ρ_{t} (θ), 1 - ϵ, 1 + ϵ) A d v_{t})] - β D_{K L} (π_{θ_{old}} ∥ π_{θ_{new}})

(22)

where

ρ_{t} (θ) = π_{θ_{new}} (a_{t} | s_{t}) / π_{θ_{old}} (a_{t} | s_{t})

denotes importance ratio and

β

is an adaptive coefficient.

(2) PPO-Clip (PPO2): Implicitly constrains policy updates via a clipping mechanism

J^{C L I P} (θ) = E_{t} [min (ρ_{t} (θ) A d v_{t}, clip (ρ_{t} (θ), 1 - ϵ, 1 + ϵ) A d v_{t})]

(23)

where

ϵ

controls maximum update deviation. By eliminating KL divergence computation, this variant achieves superior computational efficiency, establishing itself as the dominant approach in deep reinforcement learning.

3.4. Jammer Agent Model

In the complex confrontation between the jammer and netted radars, a hybrid DRL based resource allocation network is designed to generate jamming policies, selecting jamming types and allocating power. The jammer applies jamming to the chosen radar, collects feedback, and updates its state information. Key data such as states, actions, and rewards are stored in the experience pool, allowing the hybrid policy network (HPN) to learn from these samples and optimize its parameters, improving decision-making and jamming effectiveness.

3.4.1. States of Jammer Agent

The distance between the jammer and radar nodes is a key factor in resource allocation and serves as the jammer agent’s raw observation data. During the penetration process, the radar switches dynamically between search, track, and guidance states [24]. The operational state transition mechanism of the multi-functional netted radar system is modeled through a state machine framework, as illustrated in Figure 3. The system state variable

w_{r} \in {0, 1, 2}

quantifies the operational state for netted radar system, corresponding to search state (

w_{r} = 0

), track state (

w_{r} = 1

), and guidance state (

w_{r} = 2

). The search state serves as the initial state, while the guidance state functions as the terminal state, with the operational mission terminating if either the target successfully executes penetration (i.e., exceeds the maximum penetration distance) or the netted radar system transitions into guidance state prior to the target exceeding the maximum penetration distance, and the state transition protocol is defined as follows:

(1) Search-to-Track Transition Criterion: The netted radars transition from search state (

w_{r} = 0

) to track state (

w_{r} = 1

) when three successful target acquisition confirmations are achieved within four consecutive detection cycles. Failure to meet this empirical threshold maintains the system in its current search operational state.

(2) Track State Management Criterion: The tracking state executes adaptive state evaluation through:

State Promotion Condition: Activation of transition to the guidance state ( $w_{r} = 2$ ) occurs when at least two valid target locks are confirmed across three sequential detection intervals.
State Regression Condition: Return to search state ( $w_{r} = 0$ ) is triggered if zero valid target detections are recorded during three consecutive detection windows.
State Retention Policy: Detection outcomes sustain the track state if just one successful confirmation occurs in three consecutive detections.

This state representation framework establishes quantitative mapping between observational data and state transition probabilities, enabling dynamic resource allocation optimization in contested electromagnetic environments. The threshold parameters incorporate both statistical characteristics of target detection processes and weapon system reaction latency considerations, while maintaining strict adherence to the original operational specifications without introducing extraneous interpretations.

The raw observations are normalized and combined with the operational state of the netted radar system to construct the input state of the jammer agent as follows:

S_{t} = Con (r_{1}, r_{2}, \dots, r_{i}, \dots, w_{r})

(24)

where

r_{i}

represents the distance between the jammer and the i-th radar, and the greater the value of the state variable

w_{r}

corresponding to the operational state of the netted radar system, the higher the associated threat level.

3.4.2. Action of the Jammer Agent

Jamming actions are divided into continuous jamming power action

A_{p}

and discrete jamming type action

A_{m}

, which are defined as follows:

\begin{matrix} A_{p} & = (p_{1}, p_{2}, \dots, p_{i}, \dots) \end{matrix}

(25)

\begin{matrix} A_{m} & = [m_{1}, m_{2}, \dots, m_{i}, \dots], m_{i} \in {0, 1} \end{matrix}

(26)

where

p_{i}

and

m_{i}

are the jamming power and jamming type for i-th radar, respectively.

The jammer agent faces two decision tasks: selecting discrete jamming beams and allocating continuous power. We present a new HPN, where a discrete Actor network in DRL framework determines beam selection, and a continuous Actor network based on Gaussian distribution handles power allocation. This design enables efficient and coordinated handling of both discrete and continuous actions, enhancing the jammer’s performance in complex confrontation scenarios.

3.4.3. Rewards for Jammer Agent

Reasonable reward value setting can accelerate the agent’s learning and convergence in the interaction with the environment. Therefore, we set rewards from radar state transition reward

R_{1}

and jamming power reward

R_{2}

as follows:

\begin{matrix} R_{1} = \{\begin{matrix} + 0.1 & w_{r} decreases \\ - 0.1 & w_{r} increases \\ + 0.01 & w_{r} unchanged \\ - 1 & mission abort \\ + 1 & penetration completed \end{matrix} \end{matrix}

(27)

\begin{matrix} R_{2} = \frac{1}{1 + P_{s u m}} \cdot e^{- α \frac{t}{T}} \end{matrix}

(28)

where

α

is the attenuation coefficient that controls the rate of exponential decay, T represents the total frame of the task, and

P_{s u m} = \sum_{i = 1}^{N} p_{i}

denotes the cumulative jamming power allocated to all jammed radars. Thus, the per-step total reward is obtained as

r_{t} = R_{1} + R_{2}

.

In typical electronic countermeasure scenarios, combat processes exhibit distinct phased characteristics as follows: the initial search phase prioritizes suppression of target detection probability, while the subsequent tracking phase emphasizes positioning accuracy assurance. To address this, we established a state transition model based on operating states of the netted radar system, enabling multi-scale quantitative evaluation of jamming effects. The proposed dual-dimensional reward system integrates radar state evolution information with power consumption constraints, constructing an environmentally aware dynamic value assessment framework. The core innovation lies in introducing phase-aware weight allocation mechanisms and exponentially decaying power penalty terms, with design principles stemming from the following two engineering considerations:

Mitigation of the Sparse Reward Problem: Traditional reward mechanisms often suffer from sparsity in complex confrontation scenarios, leading to unstable policy updates. We designed the following three-tier reward scheme: a phased reward upon radar state escalation, a penalty for state degradation due to electronic countermeasures, and terminal rewards for task success or failure. This event-driven reward injection increases effective reward density during training while reducing policy gradient estimation variance.
The Balance Between Energy Efficiency and Task Performance Efficacy: $P_{sum}$ enforces hard energy constraints via reciprocal power accumulation, while the exponential term $e^{- α t / T}$ introduces temporal decay. This forces agents to prioritize jamming intensity during early mission phases while shifting to refined power allocation near mission deadlines, as the marginal benefit of power consumption decays exponentially over time.

3.4.4. HPN Optimization

The DRL framework with HPN shown in Figure 4 is optimized according to the PPO2 algorithm [14] described in Algorithm 1. The core objective of the DRL framework with HPN jamming resource allocation is to enable intelligent agents to simultaneously execute two heterogeneous action types in complex electromagnetic environments as follows: discrete jamming pattern selection and continuous beam power allocation. The traditional PPO2 algorithm inherently supports only single-action spaces (either discrete or continuous), making them unsuitable for such hybrid action scenarios. To address this, the framework extends PPO2 through a dual-branch HPN. The discrete branch employs a Softmax activation function to generate categorical distributions for jamming beam type selection, while the continuous branch utilizes Tanh activation to output mean and variance parameters for Gaussian-distributed power allocation sampling. Two parallel branches share a unified state encoder, followed by an intermediate fully connected network with a 128-unit hidden layer employing the ReLU activation function, ensuring feature extraction consistency when decoupling action spaces.

Algorithm 1 Optimized jamming strategy allocation-based HPN-enhanced PPO2

Require: Maximum episodes M, discount factor

γ

, clip threshold

ϵ

, learning rates

α_{actor}, α_{critic}

Ensure: Optimized policy network

π_{θ}

and value network

V_{ϕ}

1:: Initialize:
2:: Policy network $π_{θ}$ (with old policy $π_{θ_{old}}$ and new policy $π_{θ_{new}}$ )
3:: Value network $V_{ϕ}$ , replay buffer $B$
4:: for episode $= 1$ to M do
5:: Reset environment, obtain initial state $s_{0}$
6:: for timestep $t = 0$ to $T - 1$ do
7:: Generate hybrid actions $A_{p}$ and $A_{m}$ from $π_{θ_{old}}$
8:: Execute $a_{t}$ , observe reward $r_{t}$ and next state $s_{t + 1}$
9:: Store transition $(s_{t}, A_{t}, r_{t}, s_{t + 1})$ in $B$
10:: end for
11:: Compute discounted returns $R_{t}$ and advantages $A d v_{t}$
12:: for iteration $k = 1$ to $K_{epoch}$ do
13:: Sample batch ${(s_{i}, A_{i}, R_{i}, A d v_{i})}$ from $B$
14:: Critic Update:
15:: Compute value loss $L_{critic}$
16:: Update $ϕ \leftarrow ϕ - α_{critic} \nabla_{ϕ} L_{critic}$
17:: Actor Update:
18:: Compute importance ratios $ρ {(θ)}_{discrete}$ and $ρ {(θ)}_{continuous}$
19:: Compute clipped loss $L_{actor}$
20:: Update $θ \leftarrow θ - α_{actor} \nabla_{θ} L_{actor}$
21:: end for
22:: Synchronize policies: $θ_{old} \leftarrow θ_{new}$
23:: end for

The proposed algorithm workflow consists of data interaction and policy update. In the interaction phase, the agent generates hybrid actions

A_{p}

and

A_{m}

based on the current state

S_{t}

. The jammer executes these actions, and the environment returns the next state

S_{t + 1}

and reward

r_{t}

. These trajectory data are stored in the replay buffer until task termination or reaching the maximum step limit. In the update phase, the algorithm samples batches of data from the replay buffer, and optimizes both the policy network

A c t o r

and the value network

C r i t i c

through a combination of Generalized Advantage Estimation (GAE) [25] and a clipped surrogate objective. A detailed elucidation of the core architectural components and algorithmic update procedures is provided below.

GAE advantage calculation: The GAE method balances bias and variance in advantage estimation by aggregating multi-step temporal difference (TD) errors. The advantage function $A d v_{t}$ is computed as follows:

$A d v_{t} = \sum_{k = 0}^{T - t} {(γ λ)}^{k} δ_{t + k},$

(29)

$δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t})$

(30)

where $δ_{t}$ represents the TD error at time step t, $γ \in [0, 1]$ is the discount factor, and $λ$ controls the trade-off between bias and variance. The term $V (s)$ is the state-value predicted by the critic network.
Actor Update: The policy network is updated using a clipped surrogate objective to ensure stable policy improvement. For hybrid action spaces, the importance ratios for both action types are computed separately as follows:

$ρ {(θ)}_{discrete} = \frac{π_{new} (A_{p} | s)}{π_{old} (A_{p} | s)}$

(31)

$ρ {(θ)}_{continuous} = \frac{π_{new} (A_{m} | s)}{π_{old} (A_{m} | s)}$

(32)

According to Equation (23), the overall policy loss integrates both ratios with the advantage function and applies a clipping mechanism:

$L (θ) = - E [min (ρ (θ) \cdot A d v, clip (ρ (θ), 1 - ϵ, 1 + ϵ) \cdot A d v)]$

(33)

where $ϵ$ (typically $0.1 \leq ϵ \leq 0.3$ ) is the clipping threshold. The gradients are computed via backpropagation, and gradient clipping is applied to prevent exploding gradients. The final loss is a weighted sum of discrete and continuous action losses.
Critic Update: The value network minimizes the mean squared error (MSE) between predicted state-values and target values derived from discounted cumulative rewards. The Critic loss function is calculated as follows:

$L_{critic} = E [{(V_{target} - V (s))}^{2}]$

(34)

$V_{target} = \sum_{k = t}^{T} γ^{k - t} r_{k}$

(35)

where $V_{target}$ is the target value. The value network undergoes multiple optimization steps per batch to enhance value estimation accuracy.

4. Experiment and Simulation

4.1. Scene Description and Parameter Settings

In the experiments, we investigate a complex electromagnetic countermeasure scenario involving coordinated jamming against a widely distributed networked radar system during target penetration. A simulation environment is constructed, consisting of multiple spatially dispersed monostatic radars and a multi-beam jammer. As illustrated in Figure 5, the system comprises

N = 4

radar nodes, with the fusion center adopting the

K - N

fusion role (

K = 3

). Based on the data provided in [3,26], all radar nodes share identical parameters, with detailed configurations for both the jammer and radars provided in Table 1 and Table 2, respectively. The initial motion states of the jammer and the target are shown in Table 3. The netted radars are deployed in a fan-shaped spatial configuration with overlapping detection ranges, while the jammer—equipped with a phased array and multi-beam system—maneuvers synchronously with the penetrating target. These configurations are denoted as Config A. The most of the following experiments are conducted using the aforementioned parameter configuration (denoted as Config A) unless specifically indicated. To further evaluate the effectiveness of our proposed method, we also perform two experiments on extended configurations, namely Config B and Config C, which are discussed in Section 4.4. By dynamically allocating jamming beam patterns and transmission power, the jammer implements coordinated suppression.

In the design and implementation of algorithms, the rational configuration of hyper-parameters is a critical factor in ensuring model performance and training stability. For the dynamic characteristics of jamming resource allocation in complex electromagnetic countermeasure scenarios, parameter settings must balance task requirements, the heterogeneity of network architectures, and practical computational resource constraints. The multi-dimensional action space inherent in such tasks demands a careful equilibrium between exploration in strategy optimization and accurate value estimation, while the dual-branch network architecture introduces additional complexity by coupling discrete and continuous action representations. To address these challenges, we conducted extensive comparative experiments and comprehensive evaluations of algorithm convergence trends and generalization capabilities. Ultimately, we identified a set of hyper-parameters tailored to the task’s characteristics (as shown in Table 4), whose design rationale reflects a judicious trade-off among dynamic environment adaptability, policy update stability, and computational efficiency.

4.2. Comparison Strategies

To validate the effectiveness of the proposed method, the HPN with PPO2 algorithm is compared with the following two baseline strategies: (1) Distribution policy based on real digital action key encoding (AKE) of jamming type and jamming power [27]: This method encodes discrete jamming type selection as the integer part and maps continuous power allocation to the decimal part. By leveraging prior knowledge, it simplifies the action space and performs hybrid decision-making through a fixed encoding coupling mechanism. (2) Discrete action space deep Q-network (DQN) algorithm with interval division of power [11]: Although this method simplifies the action space, the coarse-grained power quantization degrades jamming effectiveness and introduces suboptimal solutions.

4.3. Training Process

Based on the parameter settings in Section 4.1 and the training framework in Section 3.4.4, we conduct hybrid action strategy training for the jammer agent. After every

K_{epoch}

policy-update steps, the target’s motion state is re-initialized, and a new training episode is initiated. The total number of training episodes is set to 1000 to balance exploration and convergence requirements. As shown in Figure 6, the reward function converges across different training episodes, with the average reward gradually increasing to a steady-state value as the number of episodes grows. This indicates that the policy network effectively learns the joint optimization pattern of beamforming and power allocation.

4.4. Experimental Results

Maximum penetration distance is defined as the farthest distance the penetrating target achieves before the netted radar system transitions to the guidance state across all training episodes. Figure 7 shows the change in the maximum penetration distance with increasing simulation times. Results show that after about 440 training runs (

M = 440

), the maximum penetration distance under the HPN stabilizes. In comparison, the AKE and DQN algorithms need around 600 and 800 runs to finish the penetration task. Figure 8 shows total power consumption of different strategies at each penetration time step. The figure reveals that the HPN with PPO2 algorithm generally has a lower proportion of jamming power consumption than the AKE and DQN approaches, and it utilizes jamming power more efficiently while fulfilling the penetration task.

In addition, to observe the effectiveness of our proposed method in more scenarios, we have also added several experiments on various configurations (i.e., Config B and Config C). The results are illustrated in Figure 9 and Figure 10, which depict the jamming resource consumption curves under different operational settings. In Figure 9, we have evaluated a 3-node radar network with

K - N

fusion role (

K = 2

); the method maintains effective jamming while strategically concentrating power on critical nodes to prevent detection confirmation (Config B). Figure 10 examines the modified penetration geometry, where the z-coordinates of both radars and jammer are adjusted to 35 km, with their velocity vectors set to (0, −0.4, −0.2) km/s. In this scenario, power distribution dynamically adjusts throughout the mission according to real-time positional changes (Config C).

To validate the stability of the proposed algorithm, Figure 11 and Figure 12 illustrate the dynamic evolution of jamming beam power distribution and beam type allocation. The analysis reveals that during the initial phase (Frames 1–24), Radars 1 and 4 exhibited significantly higher received jamming power due to their geometric configuration, as follows: the angle between the jammer–radar line-of-sight and the target-radar line-of-sight remained within the mainlobe beamwidth of the radar, allowing energy penetration through the main lobe. Jamming power demonstrated a strong inverse correlation with jammer-to-radar distance, resulting in higher received power at Radars 1 and 4. As the jammer accompanied the target into the intermediate phase (Frames 25–28), Radars 2 and 3 experienced shorter distances to the jammer, while Radars 1 and 4 became relatively distant. At this stage, the angle between the jammer–radar and target-radar lines-of-sight exceeded the mainlobe beamwidth, causing significant received power reduction due to sidelobe attenuation effects and leading to power equalization across radars. In the final penetration phase, Radars 2 and 3 further approached the jammer, where near-field jamming gain surpassed the path loss advantage of mainlobe-directed energy, becoming the dominant factor in jamming power. Consequently, jamming power was redistributed primarily towards Radars 2 and 3.

Additionally, concurrent analysis of beam type allocation indicates that the netted radar system operated in search state during the initial phase (1–23 frames), employing suppression jamming to overwhelm radar search beams. After radar transitioned to track state, the jamming strategy adaptively shifted to deception jamming to counter the dynamically enhanced radar signal processing capabilities.

Furthermore, based on the computational efficiency analysis, we have conducted rigorous timing experiments to evaluate the three methods. The results in Table 5 demonstrate significant differences in algorithm efficiency. HPN and AKE exhibit significantly shorter training times compared to DQN, primarily because DQN must handle a high-dimensional discrete action space (with 20 optional actions per radar, resulting in a total of 80-dimensional outputs), which increases the computational overhead of both forward and backward propagation in the network. In terms of inference efficiency, HPN and AKE require approximately 15 milliseconds per step for decision-making, whereas DQN needs 32.5 milliseconds. This discrepancy arises because DQN must evaluate all possible actions at each step to select the optimal one. Notably, although HPN has a slightly longer training time than AKE (3560 s vs. 3200 s), its superior jamming performance justifies this minor computational overhead. Overall, HPN achieves the best balance between effectiveness and efficiency.

These results demonstrate the robustness and timeliness of the proposed algorithm in multi-target cooperative jamming under complex electromagnetic environments. Our proposed method preserves the physical independence of actions by separately modeling them through a discrete actor and a continuous actor, and its HPN achieves cross-action-space parameter sharing through a shared feature extraction layer. In contrast, the AKE method forcibly couples discrete beam selection with continuous power allocation through real-number coding, which forces two action dimensions with distinct physical meanings to share the same coding space and introduces spurious correlations; DQN suffers from quantization errors when discretizing power into discrete levels. Meanwhile, compared with DQN’s greedy exploration policy, HPN’s exploration policy employs importance sampling ratio clipping to prevent policy collapse caused by improper power allocation during early exploration phases.

5. Discussion

The proposed HPN-PPO2 framework demonstrates superior convergence and energy efficiency compared to conventional AKE and DQN approaches. This stems from the following three key innovations: (1) The hybrid action space decomposition eliminates spurious correlations between beam selection and power allocation, mitigating the dimensional coupling inherent in AKE’s real-number encoding. (2) The dynamic fusion metric aligns jamming objectives with radar operational states, enabling phase-aware resource allocation—critical for countering adaptive radar state transitions. (3) PPO2’s clipped surrogate objective ensures stable policy updates in hybrid action spaces, overcoming DQN’s quantization errors and exploration inefficiency.

However, three limitations require further investigation. First, the assumption of perfect radar state observability may not hold in practical scenarios with signal occlusion. In the future, we may develop recurrent neural networks with attention mechanism to reconstruct hidden states from partial observations, enabling robust decision-making under uncertainty. Second, mobile radar networks may create coverage gaps that exceed beam steering rates. This might be addressed by integrating Kalman filtering with graph neural networks to predict radar trajectories, combined with threat-prioritized beam-sweeping protocols that dynamically adjust sector coverage. Third, scenarios with more radars than available beams (

N > L

) or multiple penetration targets require architectural extensions. For radar-beam imbalance, we propose adaptive beam-sharing via multi-beam waveforms and time-division hopping. Multiple targets will be handled through distributed multi-agent coordination using shared critic networks.

These findings highlight the potential of hybrid policy architectures in electromagnetic countermeasure optimization. Our framework’s modular design permits straightforward integration of additional jamming modalities, suggesting promising directions for cognitive electronic warfare systems.

6. Conclusions

In this work, we proposed a DRL-based collaborative jamming resource allocation method. A hybrid policy network (HPN) coordinated beam selection and power allocation simultaneously, and a dynamic metric quantified the jamming effectiveness. Using the HPN with PPO2 algorithm, it enabled adaptive policy optimization for the DRL framework. Experiments show it outperformed the existing approaches (AKE and DQN) in terms of the following: 30% faster convergence in penetration, 18.7% less power consumption, 95.2% utilization, and enhancing jamming adaptability and resource efficiency. The framework’s modular design supports seamless integration of emerging jamming techniques while maintaining adaptability in complex electromagnetic environments. Future work will extend this architecture to multi-agent collaborative jamming scenarios and incorporate partial observability modeling for enhanced battlefield realism.

Author Contributions

Conceptualization, W.H. and Z.X.; Data curation, W.H.; Funding acquisition, X.F.; Methodology, W.H. and W.K.; Project administration, Z.X.; Supervision, Z.X.; Validation, W.K.; Writing—original draft, W.H. and W.K.; Writing—review and editing, X.F. and Z.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partly supported by the Key Research & Development Program of Shaanxi (Nos. 2023-ZDLGY-12 and 2023-ZDLGY-16.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, B.; Tian, L.; Chen, D.; Liang, S. An adaptive dwell time scheduling model for phased array radar based on three-way decision. J. Syst. Eng. Electron. 2020, 31, 500–509. [Google Scholar] [CrossRef]
Jiang, H.; Zhang, Y.; Xu, H. Optimal allocation of cooperative jamming resource based on hybrid quantum-behaved particle swarm optimisation and genetic algorithm. IET Radar Sonar Navig. 2017, 11, 185–192. [Google Scholar] [CrossRef]
Zhang, D.; Sun, J.; Yang, C.; Yi, W. Joint jamming beam and power scheduling for suppressing netted radar system. In Proceedings of the 2021 IEEE Radar Conference (RadarConf21), Atlanta, GA, USA, 7–14 May 2021; pp. 1–6. [Google Scholar]
Yao, Z.; Tang, C.; Wang, C.; Shi, Q.; Yuan, N. Cooperative jamming resource allocation model and algorithm for netted radar. Electron. Lett. 2022, 58, 834–836. [Google Scholar] [CrossRef]
You, S.; Diao, M.; Gao, L. Implementation of a combinatorial-optimisation-based threat evaluation and jamming allocation system. IET Radar Sonar Navig. 2019, 13, 1636–1645. [Google Scholar] [CrossRef]
Zou, W.; Niu, C.; Liu, W.; Wang, Y.; Zhan, J. Combination search strategy-based improved particle swarm optimisation for resource allocation of multiple jammers for jamming netted radar system. IET Signal Process. 2023, 17, e12198. [Google Scholar] [CrossRef]
Tian, L.; Liu, F.; Miao, Y.; Li, K.; Liu, Q. Resource allocation of radar network based on particle swarm optimisation. J. Eng. 2019, 2019, 6568–6572. [Google Scholar] [CrossRef]
He, B.; Yang, N. Power allocation between radar and jammer using conflict game theory. Electron. Lett. 2024, 60, e13311. [Google Scholar] [CrossRef]
Zhang, S.; Tian, H. Design and implementation of reinforcement learning-based intelligent jamming system. IET Commun. 2020, 14, 3231–3238. [Google Scholar] [CrossRef]
Li, S.; Liu, G.; Zhang, K.; Qian, Z.; Ding, S. DRL-Based joint path planning and jamming power allocation optimization for suppressing netted radar system. IEEE Signal Process. Lett. 2023, 30, 548–552. [Google Scholar] [CrossRef]
Feng, L.; Liu, S.; Xu, H. Multifunctional radar cognitive jamming decision based on dueling double deep Q-network. IEEE Access 2021, 10, 112150–112157. [Google Scholar] [CrossRef]
Zhang, C.; Yang, B.; Ji, W.; Hu, J.; Xu, S.; Xiao, Y. Cognitive jamming policy generation based on A2C algorithm. In Proceedings of the 2024 International Radar Symposium (IRS), Wroclaw, Poland, 2–4 July 2024; pp. 33–38. [Google Scholar]
Wang, Y.; Liang, Y.; Wang, Z. Hierarchical reinforcement learning-based joint allocation of jamming task and power for countering netted radar. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 2149–2167. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Wang, B.; Cui, G.; Zhang, B.; Sheng, B.; Kong, L.; Ran, D. Deceptive jamming suppression based on coherent cancelling in multistatic radar system. In Proceedings of the 2016 IEEE Radar Conference (RadarConf), Philadelphia, PA, USA, 2–6 May 2016; pp. 1–5. [Google Scholar]
Wang, Y.; Dong, Q.; Jin, Q.; Mao, X. A deception jamming detection and suppression method for multichannel SAR. In Proceedings of the 2022 7th International Conference on Signal and Image Processing (ICSIP), Suzhou, China, 20–22 July 2022; pp. 29–34. [Google Scholar]
Li, J.; Shen, X.; Xiao, S. Robust jamming resource allocation for cooperatively suppressing multi-station radar systems in multi-jammer systems. In Proceedings of the 2022 25th International Conference on Information Fusion (FUSION), Linköping, Sweden, 4–7 July 2022; pp. 1–8. [Google Scholar]
Liu, W.; Wang, Y.; Liu, J.; Huang, L.; Jao, C. Performance analysis of adaptive detectors for point targets in subspace interference and Gaussian noise. IEEE Trans. Aerosp. Electron. Syst. 2018, 54, 429–441. [Google Scholar] [CrossRef]
Pham, V.; Nguyen, T.; Nguyen, D.; Morishita, H. A new method based on copula theory for evaluating detection performance of distributed-processing multistatic radar system. IEICE Trans. Commun. 2022, 105, 67–75. [Google Scholar] [CrossRef]
Zhao, Z.; Zhou, X.; Hong, S.; Gong, Y. Receiver placement in passive radar through GDOP coverage ratio with TDOA-AOA hybrid localization. In Proceedings of the IET International Radar Conference, Chongqing, China, 4–6 November 2020; pp. 476–480. [Google Scholar]
Xia, J.; Ma, J.; Li, Y.; Song, M. Cooperative jamming resource allocation based on integer-encoded directed mutation artificial bee colony algorithm. In Proceedings of the 2021 IEEE 4th International Conference on Electronic Information and Communication Technology (ICEICT), Xi’an, China, 18–20 August 2021; pp. 695–700. [Google Scholar]
Bachamann, D.; Evans, R.; Moran, B. Game theoretic analysis of adaptive radar jamming. IEEE Trans. Aerosp. Electron. Syst. 2011, 47, 1081–1100. [Google Scholar] [CrossRef]
Nichlors, R.; Warren, P. Threat evaluation and jamming allocation. IET Radar Sonar Navig. 2017, 11, 459–465. [Google Scholar]
Tang, Z.; Gong, Y.; Tao, M.; Su, J.; Fan, Y.; Li, T. Recognition of working mode for multifunctional phased array radar Uunder small sample condition. In Proceedings of the 2023 IEEE 6th International Conference on Electronic Information and Communication Technology (ICEICT), Qingdao, China, 21–24 July 2023; pp. 1157–1160. [Google Scholar]
Schulman, S.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-dimensional continuous control using generalized advantage Estimation. arXiv 2015, arXiv:1506.02438. [Google Scholar]
Wu, Z.; Hu, S.; Luo, Y.; Li, X. Optimal distributed cooperative jamming resource allocation for multi-missile threat scenario. IET Radar Sonar Navig. 2022, 16, 113–128. [Google Scholar] [CrossRef]
Zhao, Z.; Liu, Y.; Xiao, S. Dynamic weighted fusion algorithm and its accuracy analysis for multi-radar localization. Electron. Opt. Control 2010, 5, 35–37. [Google Scholar]

Figure 1. The game process of the jammer agent and the netted radar.

Figure 2. Schematic diagram of relative positions among the jammer, radar, and target.

Figure 3. Schematic diagram of the netted radar system working state conversion.

Figure 4. Schematic diagram of hybrid policy network (HPN)-based DRL.

Figure 5. Penetration scenario simulation diagram.

Figure 6. Reward variation curve of the jammer.

Figure 7. Maximum penetration distance change.

Figure 8. Total power consumption of jamming.

Figure 9. Total power consumption of jamming in Config B.

Figure 10. Total power consumption of jamming in Config C.

Figure 11. Jamming power allocation result.

Figure 12. Jamming type allocation result.

Table 1. Operating parameters of jammer.

Parameter	Value
Number of beams generated by the jammer L	4
Total power of the jammer	100 W
Operating wavelength $λ_{j}$	0.1 m
Antenna gain of the jammer $G_{j}$	10 dB
Polarization mismatch loss $γ_{j}$	0.5
RCS of target	1 $m^{2}$

Table 2. Operating parameters of radar.

Parameter	Value
Transmit power $P_{t}$	$2 \times 10^{8}$ W
Transmit-antenna gain $G_{r}$	40 dB
Operating wavelength $λ$	0.1 m
Main-lobe beamwidth of the antenna $θ_{0.5}$	$3^{\circ}$
Detection threshold $y_{0}$	1
Thermal noise power of receiver $P_{noise}$	$1 \times 10^{- 12}$ W
Pulse width $τ$	$1 \times 10^{- 5}$ s
Pulse repetition frequency $f_{r}$	$1 \times 10^{4}$ Hz
Antenna servo bandwidth $β_{n}$	$1 \times 10^{7}$ Hz
Receiver bandwidth B	$1 \times 10^{7}$ Hz

Table 3. Initial motion states of jammer and target.

Type	Initial Location (km)	Initial Speed (m/s)
Jammer	(25,55,30)	(0,−0.4,−0.15)
Target	(24,55,30)	(0,−0.4,−0.15)
Radar 1	(10,20,0)	/
Radar 2	(20,10,0)	/
Radar 3	(30,10,0)	/
Radar 4	(40,20,0)	/

Table 4. Parameters of PPO2 algorithm based on HPN.

Parameter	Value
Maximum training episodes M	1000
Maximum simulation steps $K_{e p o c h}$	50
Replay buffer size	1024
Clipping threshold $ϵ$	0.2
Discount factor $γ$	0.9
Actor learning rate $α_{a c t o r}$	0.0001
Critic learning rate $α_{c r i t i c}$	0.0003
Actor update steps	10
Critic update steps	10
Gradient clipping norm	0.5
Optimizer	Adam
Rate of exponential decay $α$	0.9999

Table 5. Differences in algorithm efficiency and complexity.

Method	Training Time (s)	Avg. Inference Time (ms)	Complexity Class
HPN	3560	15.2	$O (N \times L)$
AKE	3200	14.8	$O (N \times L)$
DQN	8900	32.5	$O (N \times T \times P)$

(N: Radar nodes, L: Jammer beams, T: Jamming types, P: Power quantization levels).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hao, W.; Ke, W.; Feng, X.; Xia, Z. Dynamic Jamming Policy Generation for Netted Radars Using Hybrid Policy Network. Appl. Sci. 2025, 15, 8898. https://doi.org/10.3390/app15168898

AMA Style

Hao W, Ke W, Feng X, Xia Z. Dynamic Jamming Policy Generation for Netted Radars Using Hybrid Policy Network. Applied Sciences. 2025; 15(16):8898. https://doi.org/10.3390/app15168898

Chicago/Turabian Style

Hao, Wanbing, Wentao Ke, Xiaoyi Feng, and Zhaoqiang Xia. 2025. "Dynamic Jamming Policy Generation for Netted Radars Using Hybrid Policy Network" Applied Sciences 15, no. 16: 8898. https://doi.org/10.3390/app15168898

APA Style

Hao, W., Ke, W., Feng, X., & Xia, Z. (2025). Dynamic Jamming Policy Generation for Netted Radars Using Hybrid Policy Network. Applied Sciences, 15(16), 8898. https://doi.org/10.3390/app15168898

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dynamic Jamming Policy Generation for Netted Radars Using Hybrid Policy Network

Abstract

1. Introduction

2. Jamming Model of Netted Radars

2.1. Active Jamming Type

2.2. Radar Echo Signal Model Under Jamming Condition

2.3. Jamming Effect Evaluation Index

2.3.1. Detection Probability

2.3.2. Positioning Accuracy

2.3.3. Weighted Model for Evaluation Indicators

3. HPN Based DRL for Jamming

3.1. Policy Gradient Methods

3.2. Advantage Actor-Critic

3.3. Proximal Policy Optimization

3.4. Jammer Agent Model

3.4.1. States of Jammer Agent

3.4.2. Action of the Jammer Agent

3.4.3. Rewards for Jammer Agent

3.4.4. HPN Optimization

4. Experiment and Simulation

4.1. Scene Description and Parameter Settings

4.2. Comparison Strategies

4.3. Training Process

4.4. Experimental Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI