A Q-Learning-Based Method for UAV Communication Resilience Against Random Pulse Jamming

Wen, Yuqi; Zhang, Yusi; Niu, Yingtao

doi:10.3390/electronics14244945

Open AccessArticle

A Q-Learning-Based Method for UAV Communication Resilience Against Random Pulse Jamming

by

Yuqi Wen

^1,2,

Yusi Zhang

^2,* and

Yingtao Niu

²

¹

School of Electronic Information Engineering, Nanjing University of Information Science & Technology, Nanjing 210007, China

²

The Sixty-Third Research Institute, National University of Defense Technology, Nanjing 210007, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(24), 4945; https://doi.org/10.3390/electronics14244945

Submission received: 20 November 2025 / Revised: 12 December 2025 / Accepted: 12 December 2025 / Published: 17 December 2025

(This article belongs to the Section Microwave and Wireless Communications)

Download

Browse Figures

Versions Notes

Abstract

In open wireless communication channels, the combined effects of random pulse jamming and multipath-induced time-varying fading significantly degrade the reliability and efficiency of information transmission. Particularly in highly dynamic scenarios such as unmanned aerial vehicle (UAV) communications, existing Q-learning-based anti-jamming methods often rely on idealized channel assumptions, leading to mismatched “transmit/silence” decisions under fading conditions. To address this issue, this paper proposes a Q-learning and time-varying fading channel-aware anti-jamming method against random pulse jamming. In the proposed framework, a fading channel model is incorporated into Q-learning, where the state space jointly represents timeslot position, jamming history, and channel sensing results. Furthermore, a reward function is designed by jointly considering jamming power and channel quality, enabling dynamic strategy adaptation under rapidly varying channels. A moving average process is applied to smooth simulation fluctuations. The results demonstrate that the proposed method effectively suppresses jamming collisions, enhances the successful transmission rate, and improves communication robustness in fast-fading environments, showing strong potential for deployment in practical open-channel applications.

Keywords:

Q-learning; communication anti-jamming; time-varying fading channel; random pulse jamming

1. Introduction

With the rapid rise in the low-altitude economy and the continuous development of unmanned aerial vehicle (UAV) technology, UAV communication has become a significant research topic in the field of wireless communication [1,2]. However, due to the openness of wireless channels, UAV communication is highly susceptible to malicious jamming, especially random pulse jamming in the time domain. Moreover, the rapid time-varying characteristics of wireless channels further amplify the destructive impact of pulse jamming, posing severe challenges to the stability and reliability of UAV communication links [3].

Traditional anti-jamming techniques, such as direct-sequence spread spectrum (DSSS) [4] and frequency-hopping spread spectrum (FHSS) [5], mitigate jamming by spreading the signal spectrum to disperse the jammer’s power. However, these methods are ineffective in suppressing or avoiding highly dynamic jamming patterns such as arandom pulse jamming, and they often require large time–frequency resources to achieve sufficient processing gain, thereby significantly reducing spectral efficiency [6,7].

To overcome these limitations, intelligent anti-jamming technologies have emerged. By leveraging machine learning to realize dynamic policy optimization, these methods provide new insights for adaptive anti-jamming communication. For instance, ref. [8] extended a proximal policy optimization (PPO)-based intelligent anti-jamming algorithm was proposed to formulate the anti-jamming problem as a Markov decision process and achieve fast adaptation in dynamic jamming environments [8]; ref. [9] applied Q-learning to fact with the environment and learn optimal anti-jamming strategies in static scenarios; ref. [8] proposed the Slot-Cross Q-Learning (SCQL) algorithm, enabling parallel sensing and learning of multiple jamming patterns within a single timeslot, which effectively reduced learning latency and improved anti-jamming performance in rapidly changing environments. Moreover, ref. [10] integrated Q-learning with online learning to design a dynamic power control policy that significantly reduced bit error rates and accelerated convergence; ref. [11] introduced a Time-Domain Anti-Pulse Jamming Algorithm (TDAA), which discriminated temporal jamming patterns and adapted strategies accordingly, improving system performance under random pulse jamming. Additionally, ref. [12] proposed the Double Q-Learning algorithm, which employed a dual Q-table structure to alleviate overestimation in conventional Q-learning, thereby stabilizing the learning process and enhancing convergence performance.

However, most of the aforementioned studies are based on the ideal channel assumption and do not fully consider the influence of the time-varying characteristics of the channel on the decision-making performance of Q-learning in the high-speed movement scenario of unmanned aerial vehicles. In high-mobility scenarios, the channel fading formed by multipath effects has a particularly significant impact on signals, which significantly reduces the effectiveness of existing anti-jamming methods when dealing with rapidly changing channel conditions [13]. Especially when unmanned aerial vehicles collect data along prescribed routes, due to the existence of channel masking effects and multipath effects related to trajectories, the channel state changes periodically to a certain extent. Therefore, studying anti-jamming methods applicable to multipath fading in mobile communication environments is of great significance for enhancing the reliability and effectiveness of wireless communication systems.

The channel characteristics in mobile communication scenarios usually need to be characterized by using channel models. Ref. [14] revealed that, based on the relevant two-path Rician model, multipath delay propagation would disrupt signal orthogonality, thereby leading to a significant increase in bit error rate. Reference [15] further points out that in a time-varying channel environment, the statistical characteristics of the received signal will change rapidly over time, which leads to a decline in the reliability of jamming detection methods based on energy efficiency or spectral features, thereby increasing the modeling difficulty of intelligent anti-jamming algorithms in dynamic scenarios. For the Q-learning algorithm, the time-varying nature of the channel may bring two challenges: Firstly, the key parameters such as the amplitude and phase of the received signal fluctuate rapidly and randomly due to the multipath effect, resulting in the distortion of the state space representation based on historical data and making the convergence of the Q-table difficult; Secondly, the traditional Q-learning anti-jamming methods usually only classify the states based on “whether there is jamming”, without considering the dynamic changes in the channel. As a result, in the case where time-varying fading and random pulse jamming coexist, the communication strategy does not match the actual environment, leading to problems such as an increase in transmission failure rate and slow convergence. It can be seen from this that how to design an intelligent anti-jamming method with higher robustness on the basis of considering the influence of channel fading has become an urgent problem to be solved at present.

To address this issue, this paper proposes an Anti-Jamming algorithm based on Time-varying Fading Channel-Awareness (TFCAAJ). This algorithm combines the characteristics of time-domain signals with Q-learning algorithms to output a “silent/transmission” strategy, which can effectively enhance transmission reliability and improve transmission energy efficiency while ensuring the successful transmission rate of the system. The contribution of this paper includes the following two aspects:

Put forward a method based on channel perception of random pulse jamming. The Q-learning state space was optimized. By introducing the channel gain variable, the problem of transmission strategy adaptation when channel fading and random pulse jamming coexist in unmanned aerial vehicle communication was effectively solved.
Based on the optimized Q-learning decision loop, a reward function combined with channel quality and jamming characteristics was designed, which effectively suppressed ineffective communication in deep decay channels, improved the accuracy of action value estimation, accelerated the convergence speed, and enhanced the adaptability and robustness of the system in time-varying jamming environments.

2. System Model

For the air-ground communication scenario as shown in Figure 1, the unmanned aerial vehicle collects signals based on its flight trajectory, the ground transmitter radiates signals omnidirectionally, and the ground jammer emits random pulse jamming to the UAV. Due to the existence of channel masking effects related to trajectories, periodic deep fading occurs in the transmission channel. The legitimate transceiver knows the period of the flight action and can obtain the state information of the transmission channel through channel estimation. Since the jamming is sent by a third party, the state of the jamming channel and the transmission pattern of the jamming signal are unknown. Therefore, communication quality is affected by three aspects: first, the estimable multipath time-varying channel; second, the masked channel with a known period; and third, the unknown periodic random pulse jamming.

As depicted in Figure 1, the legitimate transmitter and the jammer are both located on the ground, whereas the UAV acts as a mobile receiver. The UAV experiences a line-of-sight (LoS) component together with several non-line-of-sight (NLoS) scattered components, which results in Rician fading. Moreover, the UAV motion induces Doppler frequency shifts, so that the channel coefficients vary over time within and across transmission slots.

In this communication scenario, to study the communication anti-jamming strategy in a complex channel environment, the system model established in this paper is shown in Figure 2. The system mainly consists of three parts: the communication transmitter, which performs data transmission or remains silent in the next time slot according to the actions issued by the receiving end, and returns the necessary observation information for the receiving end to update the communication status; The communication receiver receives and detects whether the signal is interfered with, and transmits the action of the next time slot to the transmitting end through the control channel. Malicious jammers randomly trigger jamming within a fixed period based on a pre-set probability distribution (such as normal, uniform, or Poisson). The main assumptions of the system are as follows:

The communication time is divided into equal-length time slots, each of which is the minimum communication unit, with a length of: $T_{slot} = T_{tx} + T_{sense} + T_{ack}$ . Among them $T_{tx}$ is the transmission sub-time slot, $T_{sense}$ is the receiving and perception sub-time slot, and $T_{ack}$ is the feedback and learning sub-time slot. In each time slot, the transmitter can choose the “transmission” or “silent” action and receive jamming information feedback in the sensing sub-time slot.
When the transmitter is operating normally, it can transmit one subframe in each time slot. Each subframe contains the same amount of information, and l subframes form a data packet containing a cyclic redundancy check field. The transmitter is far from the jammer and is less affected by jamming. It can reliably receive the control information transmitted by the receiver from the low-capacity control channel reinforced by the protocol, achieving cooperative anti-jamming.
The jammer monitors the channel in use and estimates the channel state to obtain real-time state information of the channel and use it for subsequent signal processing optimization. During the jamming period of length N, the jammer only emits one jamming signal, and its target time slot $T_{k} \in {1, 2, \dots, N}$ follows the preset jamming selection distribution: $p_{k (t)} = f_{k (t)}, t \in [1, N]$ . Among them, $f_{k (t)}$ can be a normal distribution, a uniform distribution, or a Poisson distribution. The selection behaviors within different jamming periods are independent of each other.
Jamming detection and period estimation assumption: The communicating party does not know the prior information of the jamming period and phase. They can only determine whether there is significant pulse jamming through energy detection in each time slot and obtain binary observations $j_{t} \in {0, 1}$ . After detecting several pulses, the system estimates the period length N online based on the intervals of the most recent K jamming moments, and resets the jamming marker when it detects crossing the period boundary.
In the wireless communication environment, signals are usually affected by multipath propagation and time delay propagation during transmission. These factors jointly lead to rapid changes in the amplitude and phase of the signal at the receiving end. Since this paper mainly considers the communication scenario of unmanned aerial vehicles, at this time, there are signal components of line-of-sight propagation in the communication link, and the wireless communication signal usually undergoes Rician fading. Therefore, in this paper, the classic Jake model is adopted to simulate the time-varying Rician fading channel in mobile scenarios, to characterize the rapid variation characteristics of channel amplitude and phase in actual wireless communication. The Jake channel model is a classic Rician fading model, widely used to describe the fading effect caused by multipath propagation delay. In this model, the channel is represented as the superposition of multiple sinusoidal components, each corresponding to one propagation path with a specific Doppler frequency shift. The maximum Doppler frequency $f_{d}$ determined by the carrier frequency and the UAV velocity, controls the rate at which the channel amplitude and phase vary over time. The channel gain $h (t)$ in the Jake model can be expressed as:

$h (t) = \sqrt{\frac{K}{K + 1}} h_{LOS} + \sqrt{\frac{1}{K + 1}} \sum_{k = 1}^{N} a_{k} e^{j (2 π f_{d, k} t + ϕ_{k})}$

(1)

Among them, K is the Rician factor, $h_{LOS}$ is the direct path component, and $a_{k}$ , $ϕ_{k}$ , $f_{d, k}$ is the amplitude, phase, and Doppler frequency shift in the KTH scattering path. When $K = 0$ , Rician decline was equivalent to Rayleigh’s decline.

Figure 2. System model of the anti-jamming communication framework.

3. Problem Modeling

In traditional Q-learning anti-jamming methods, the communication strategy is usually determined by simply judging whether there is jamming in each time slot. This method may be effective in a static environment or when the channel conditions are relatively ideal, but it is not reasonable enough in a complex dynamic environment, especially when there are time-varying channels. Traditional Q-learning methods usually ignore the influence of channel fading. This method ignores the fact that the time-frequency characteristics of the jamming signal change with the variation in the channel. In fact, wireless signals are affected by multipath effects through their propagation paths, causing rapid fluctuations in channel gain and signal quality over time. Therefore, the influence of jamming signals not only depends on their presence or absence but is also closely related to the fading characteristics of the channel, which makes the existing methods ineffective when dealing with dynamic channels.

For instance, when a channel encounters deep fading, even if there are jamming signals, their impact on the wireless communication link may be greatly weakened, and they may not even be correctly perceived by the wireless communication system. Traditional Q-learning methods usually overlook this point, which may lead to false alarms or missed detections in the system’s perception of jamming, thereby making incorrect anti-jamming decisions. Take pulse jamming as an example. It is assumed that there is a significant multipath delay spread in the transmission channel. When the pulse jamming signal passes through the channel, its time-domain waveform is shown in Figure 3. It can be seen that the multipath effect in the channel causes the originally steep pulse waveform to have obvious time-domain expansion and tailing effects, and the amplitude also undergoes deep random fading, which is manifested as the expansion of the jamming signal in time. This indicates that in scenarios with channel fading, if the anti-jamming decision-making method based on static assumptions is still adopted, there will be problems such as an increase in the probability of anti-jamming decision-making errors and a decrease in spectral utilization due to false alarms or missed detections of jamming.

The transmit or silent decision of the transmitter in random jamming and time-varying fading channels is modeled as a Markov decision process with a discount factor, denoted as

M = (S, A, P, R)

. Here, S represents the state space, A represents the action space, P represents the probability of state transition, and R represents the reward function.

To represent both jamming and channel fading simultaneously, the state space is defined as:

s_{t} = (k_{t}, j_{t}, g_{t}) \in S

(2)

In this model, each time slot is identified by an integer

k_{t} \in {1, 2, \dots, N}

as its position in the current jamming period. The jamming marker

j_{t} \in {0, 1}

indicates whether any jamming has been detected during the cycle—it is initialized to 0 at the first time slot of the cycle (i.e.,

k_{t} = 1

). Once a pulse is detected via the perception function

g_{t}^{int} = 1

in any time slot,

j_{t}

for that slot and all subsequent slots remain set to 1 until reset at the cycle end. Simultaneously, to characterize the time-varying Rician fading channel, the envelope gain

g_{t} = | h (t_{t}) | \geq 0

is sampled at the center of each time slot, and its amplitude follows the Rician distribution, whereas the temporal correlation between the sample

g_{t}

is determined by the Jakes fading process and its Doppler spectrum, rather than by the marginal Rician distribution itself.” Therefore, the state triplet

(k_{t}, j_{t}, g_{t})

can fully reflect the information of time slot position, the history of jamming occurrence, and the current channel quality.

The action space includes two types: “silent” and “transmission”, which are defined as

A = {a : a \in {0, 1}}

(3)

Among them,

a = 0

indicates that the transmitter takes a “silent” action. The transmitter does not transmit any signals at all during this time slot; It only performs measurement of the channel and the jamming level, and updates the corresponding state parameters.

a = 1

indicates that the transmitter takes the “transmission” action. The transmitter transmits data at a fixed power

P_{t x}

within this time slot and is simultaneously affected by possible jamming and channel fading.

In this system model, the baseband signal observed by the receiving end at each time slot is defined as follows:

\begin{array}{l} y_{t} = \sum_{l = 0}^{L - 1} h_{l} (t) x_{t - d_{l}} + j_{t} + n_{t} \end{array}

(4)

Among them,

x_{t}

is the legitimate transmitted signal,

n_{t}

is additive white Gaussian noise, and hℓ(t) denotes the time-varying complex gain of the

l

-th path with relative delay

d_{l}

, and

j_{t}

is the random jamming. In our simulations we consider a three-path channel (

L = 3

) whose delays and average gains are given in Table 1. We use a binary process

g_{t}^{int} \in {0, 1}

to mark whether there is jamming at the time slot center (1 indicates yes). Channel gain

h (t)

is generated using the classic Jake model to reflect the rapid amplitude and phase changes caused by multipath propagation. At the center time

t_{t}

of the t time slot, take an envelope sample of

h (t)

:

g_{t} = | h (t_{t}) |

(5)

Here,

|•|

represents the amplitude operator. Because the air-ground channel of unmanned aerial vehicles generally has direct paths and a large number of independent multipath components superimposed, the statistical characteristics of

g_{t}

follow the Rician distribution in the absence of specular component and the Rayleigh distribution in the presence of specular component. Performing arithmetic averaging over multiple consecutive time slots for

g_{t}

can effectively filter out short-term sharp fluctuations caused by multipath jamming, thereby reflecting the overall availability and reliability of the channel in link quality estimation and adaptive control.

The state transition from the current time slot state

s_{t} = (k_{t}, j_{t}, g_{t})

to the next time slot state

s_{t + 1} = (k_{t + 1}, j_{t + 1}, g_{t + 1})

is determined as follows. First, the time slot number

k_{t + 1}

in the next time slot state is solely dependent on the current time slot number

k_{t}

, and is independent of jamming indicators or channel conditions. The specific transition is as follows:

k_{t + 1} = \{\begin{array}{l} k_{t} + 1, & k_{t} < K \\ 1, & k_{t} = K \end{array}

(6)

The transfer of the jamming indicator

j_{t + 1}

is influenced by the current jamming indicator

j_{t}

and jamming detection result

g_{t}^{int} \in {0, 1}

, where

g_{t}^{int} = 1

indicates jamming detected in the current time slot and

g_{t}^{int} = 0

indicates no jamming detected, the following holds:

j_{t + 1} = \{\begin{array}{l} 0, & j_{t} = 0, g_{t}^{int} = 0, k_{t} < K \\ 1, & j_{t} = 0, g_{t}^{int} = 1, k_{t} < K \\ 1, & j_{t} = 1, k_{t} < K \\ 0, & k_{t} = K \end{array}

(7)

During the first time slot of each jamming cycle (

k_{t} = 1

), the jamming flag is initialized to 0. If jamming is first detected in any subsequent time slot (

g_{t}^{int} = 1

), the jamming flag

j_{t}

is set to 1 starting from that slot, and remains at 1 for all subsequent time slots within that jamming cycle. This state persists until the end of the jamming cycle (

k_{t} = k

), at which point the jamming flag is reset to 0 at the start of the next cycle. Finally, the channel state

g_{t + 1}

for the next time slot depends on the Rayleigh fading statistics of the channel and is independent of the current slot number and jamming identifier. It follows a Rician distribution with parameter

σ^{2}

:

g_{t + 1} ~ Rician (δ)

(8)

In summary, the conditional probability density function for the system to move from state

s_{t} = (k_{t}, j_{t}, g_{t})

to state

s_{t + 1} = (k_{t + 1}, j_{t + 1}, g_{t + 1})

after executing action

a_{t}

is:

f_{g_{t + 1}} (g_{t + 1}) = \frac{g_{t + 1}}{σ^{2}} \exp (- \frac{g_{t + 1}^{2}}{2 σ^{2}}), g_{t + 1} \geq 0

(9)

Thus, the transition probability is the integral of the Rician PDF over the corresponding interval of:

\Pr {g_{1} \leq g_{t + 1} \leq g_{2} | s_{t}, a} = \int_{g_{1}}^{g_{2}} f_{g_{t + 1}} (g) d g

(10)

The immediate reward for the transmitter executing action a in state s primarily encompasses the following scenarios: ① During an jamming cycle, if jamming has been detected, the data will definitely be successfully transmitted when the transmitter performs the “transmit” action in the subsequent cycle, yielding a system benefit; ② If jamming has not yet been detected, the transmission will also definitely succeed when the “transmit” action is performed, yielding a benefit; ③ If jamming is detected in the current time slot and the channel gain is

g_{t} < g_{t h}

, transmission in that slot fails, resulting in system loss

L_{t}

; ④ When the transmitter is silent, the reward is set to 0. Based on this, the reward mechanism of the TFCAAJ algorithm is defined as follows:

r (s, a) = \{\begin{array}{l} E_{t}, & j_{t} = 1, a_{t} = 1 \\ E_{t}, & j_{t} = 0, g_{t}^{int} = 0, a_{t} = 1 \\ - L_{t}, & j_{t} = 0, g_{t}^{int} = 1, g_{t} < g_{t h} \\ 0, & else \end{array}

(11)

Among these,

E_{t}

and

L_{t}

represent reward and penalty values dynamically calculated based on channel gain

g_{t}

. Specifically, the calculation formulas for

E_{t}

and

L_{t}

are as follows:

E_{t} = E_{0} • \frac{g_{t}}{g_{t h}}

(12)

L_{t} = L_{0} • (1 - \frac{g_{t}}{g_{t h}})

(13)

Here,

E_{0}

and

L_{0}

are the given reward and penalty values, and

g_{t h}

is the channel gain threshold values, which are set to a fixed value.

The objective of solving Markov decision process (MDP) is to find the optimal strategy π that can maximize the long-term cumulative reward. The state-action value function (also known as the Q value) can be expressed as:

Q^{π} (s, a) = E [\sum_{t = 0}^{\infty} γ^{t} r_{t + 1} | s_{t} = s, a_{t} = a]

(14)

Here,

Q^{π} (s, a)

represents the future cumulative reward obtained by starting from state s at moment t, first executing action a, and then continuing to execute action according to strategy π. This cumulative reward is composed of the weighted sum of all rewards

r_{t + 1}

starting from time t + 1, with the weights controlled by the discount factor γ, which is typically used to represent the relative importance of future rewards. Specifically,

Q^{π} (s, a)

is the weighted expected value of all future rewards that can be obtained through strategy π under the given current state s and action a conditions. When the optimal Q value corresponding to each pair of state-action combinations can be found, the optimal strategy π can be derived. This method is based on dynamic programming or other reinforcement learning algorithms and aims to maximize long-term returns by gradually optimizing the Q value. Therefore, by optimizing the Q value of each state-action pair, we can ultimately obtain a strategy that selects the optimal action in each state, thereby ensuring the maximum cumulative reward.

4. TFCAAJ Algorithm Based on Q-Learning

This paper proposes a time-domain fading channel-aware anti-jamming (TFCAAJ) algorithm based on Q-learning. In response to the above issues, this article proposes the following improvements:

State representation optimization: In traditional Q-learning methods, states are usually simply represented as a fixed state space without fully considering the dynamic changes in the channel and the complexity of jamming patterns. To solve this problem, we optimized the design of the state space, integrating the characteristics of time-varying Rician fading channels and the time-frequency characteristics of jamming. In each time slot, the status not only includes the current position of the time slot, but also the jamming-aware identification and channel gain, which can describe the changes in the environment more comprehensively. Specifically, the state triples

s_{t} = (k_{t}, j_{t}, g_{t})

are updated in real time within each time slot, fully reflecting the coupling relationship among the time slot position, historical jamming information, and the current channel quality. In this way, the algorithm can more accurately perceive the jamming and channel conditions in the current environment and optimize the decision-making process.

Reward function remodeling: In traditional Q-learning, the design of the reward function is relatively simple, usually only providing a fixed reward value based on the presence or absence of jamming. However, in a time-varying channel environment, relying solely on the existence of jamming is insufficient to reflect the actual system performance. To better cope with this environment, we have designed a dynamic reward function based on jamming and channel state. In this model, the reward not only takes into account whether the data is successfully transmitted, but also incorporates the jamming intensity and channel gain into the reward calculation. Specifically, when the jamming signal is strong and the channel quality is poor, transmission failure will bring a relatively high penalty. Conversely, if the jamming signal is weak or the channel quality is good, the transmission will be successful and the corresponding reward will be obtained. This improvement enables the reward function to be dynamically adjusted to adapt to the time-frequency variations in channel fading and jamming, better meeting the requirements in actual communication.

At each time slot, the transmitter selects to perform the “transmission” or “silent” action based on the current status and updates the status according to the received jamming feedback. Through the Q-learning mechanism, the transmitter constantly adjusts its strategy to maximize the transmission success rate under the influence of interfering signals and avoid unnecessary collisions and failures. Initially, the algorithm randomly initializes the Q-values for each time slot. Subsequently, in each time slot, the transmitter performs transmission based on the current state

s_{t}

and the selected action

a_{t}

, while the receiver updates the state

s_{t + 1}

and instant reward

r (s_{t}, a_{t})

based on the jamming perception feedback. The receiver then uses the Q-value update formula to optimize the current action and selects the action

a_{t + 1}

for the next time slot:

Q (s, a) \leftarrow Q (s, a) + α (\begin{array}{l} R (s, a) + γ \\ \max_{a} Q (s^{'}, a^{'}) - Q (s, a) \end{array})

(15)

In the Q-learning update,

α \in (0, 1]

denotes the learning rate that controls how strongly new observations override old estimates.

The algorithm process is as follows (Algorithm 1):

Algorithm 1 Time-varying Fading Channel-Awareness (TFCAAJ)

1: Initialization:

α, γ

; For any

s \in S a \in A

,

Q (s, a) = 0

.

2: for

t = 1, 2, \cdot \cdot \cdot, T

do.

3: The transmitter performs the decision action or initial action

a_{t}

of the previous time slot and calculates the channel condition

g_{t}

.

4: Detect whether jamming exists in the current environment during the perception sub-slot

T_{sense}

.

5: The receiver calculates the immediate reward

r

and the next state

s'

based on the jamming perception result and the channel state

g_{t}

.

6: The receiver updates the Q value according to Formula (14).

7: The receiver infers the next action based on the update criterion, which is as follows:

a_{t + 1} = \{\begin{matrix} \arg \max_{a \in A} Q_{t} (s_{t}, a) & 1 - ε \\ \forall a \in A & ε \end{matrix}

8: The receiver generates a new strategy and passes the new strategy to the transmitter.

9: end for

5. Simulation and Analysis

5.1. Simulation Settings

To verify the performance of the TFCAAJ algorithm, we set four metrics: successful transmission rate, jamming and collision probability, bit error rate, and effective time slot utilization rate.

Successful transmission rate: The successful transmission rate measures the proportion of data successfully transmitted within a unit of time. In the system, the agent selects whether to send data in each time slice. If the sent data is not interfered with, it is regarded as a successful transmission. The calculation formula for the successful transmission rate

η

is as follows:

η = \frac{1}{K} \sum_{k = 1}^{K} I (n_{k} > 0)

(16)

I (n_{k} > 0) = \{\begin{array}{l} 1, & n_{k} > 0 \\ 0, & n_{k} = 0 \end{array}

(17)

Among these,

η

denotes the proportion of successful transmissions within k attempts, while

n_{k}

represents the payoff for the kth transmission. The indicator function

I (R_{k} > 0)

takes the value 1 (indicating a successful transmission); if

(R_{k} < 0)

occurs, then

I (R_{k} > 0)

takes the value 0 (indicating a failed transmission).

Jamming collision probability: The jamming collision probability refers to the proportion of all transmission attempts that are interfered with (i.e., collide). In this system, when the agent selects to transmit data and the jamming power at that moment exceeds the threshold value, it is regarded as a collision. Its expression is as follows:

P_{coll} = \frac{\sum_{k = 1}^{K} C_{k}}{\sum_{k = 1}^{K} A_{k}}

(18)

Let

A_{k}

denote the transmission-activity indicator:

A_{k} = \{\begin{array}{l} 1, & if the agent selects the transmit action at slot k \\ 0, & otherwise \end{array}

(19)

Similarly, let

C_{k}

denote the collision indicator:

C_{k} = \{\begin{array}{l} 1, & if a jamming collision occurs in the k - th transmission \\ 0, & otherwise \end{array}

(20)

Bit rate: The bit error rate is defined as the ratio between the number of erroneously detected bits at the receiver and the total number of transmitted bits. The BER expression is derived using the standard analytical BER formula of BPSK under additive noise and jamming. The system itself does not employ BPSK modulation. Considering the presence of jamming, if a collision occurs during a certain transmission, the actual received signal will be affected by the superposition of noise and jamming. Therefore, the bit error rate expression can be adjusted according to the actual jamming ratio to:

{BER}_{k} = \{\begin{array}{l} Q (\sqrt{2 \cdot SNR}), & C_{k} = 0 \\ Q (\frac{2 \cdot SNR}{1 + JSR}), & C_{k} = 1 \end{array}

(21)

SNR stands for signal-to-noise ratio, and JSR stands for jamming source ratio,

{SNR}_{k}

denotes the signal-to-noise ratio of the

k - th

transmission and is defined as:

{SNR}_{k} = \frac{P_{s}}{N_{0} B}

(22)

where

P_{s}

is the average transmit signal power,

N_{0}

is the one-sided noise power spectral density, and

B

is the system bandwidth.

{JSR}_{k}

denotes the jamming-to-signal ratio of the

k - th

transmission and is given by

{JSR}_{k} = \frac{P_{j}}{P_{s}}

(23)

where

P_{j}

is the average jamming power. The average bit error rate throughout the entire simulation process is:

{\bar{P}}_{e} = \frac{1}{K} \sum_{k = 1}^{K} {BER}_{k}

(24)

Effective time slot utilization rate: The effective time slot utilization rate represents the effective transmission ratio generated per unit time slot, which is equal to “cumulative successful transmission times/total number of time slots”. Let

S_{k} \in {0, 1}

denote the indicator of whether the

k - th

time slot is successful, be (1 for success, 0 for failure), then the effective time slot utilization rate of the first k time slots is

η_{slot}

:

η_{slot} = \frac{1}{K} \sum_{k = 1}^{K} S_{k}

(25)

This section compares the performance of the proposed algorithm with that of the following four transmission schemes:

Continuous Transmission Strategy (CT): Under this strategy, the sender does not take any form of jamming avoidance measures when facing jamming, but continuously sends data to the receiver, completely ignoring the existence of external jamming during the communication process.

Cross-slot Cross-Q-Learning Algorithm (SCQL) [8]: This algorithm introduces a cross-slot cross-update mechanism based on basic Q-learning, which can utilize the correlation of jamming in the time domain to accelerate the convergence speed of the strategy and improve the efficiency of slot allocation within the jamming period.

Time-domain Anti-Random Pulse Jamming Algorithm (TDAA) [11]: This algorithm is based on the Q-learning framework and combines the discrimination of time-domain jamming patterns with adaptive strategy adjustment. It can reduce the collision probability, improve the successful transmission rate of the system, and the utilization rate of time slots in a random pulse jamming environment.

Double Q-learning algorithm (Double-Q) [12]: This algorithm adopts a double Q-table structure, separating action selection from action evaluation, effectively alleviating the overestimation problem of traditional Q-learning. By reducing the valuation bias, Double-Q can achieve a more stable learning process and better convergence performance in complex jamming environments.

In Table 1, the greedy factor

ε

denotes the exploration probability in the

ε

-greedy Q-learning policy. The vector ‘[0, 0.3, 0.6] ms’ represents the relative delays of the three taps in the multipath channel, and ‘[0.7, 0.2, 0.1]’ gives the square roots of the average powers of these taps.

5.2. Result Analysis

As shown in Figure 4, the changing trends of collision probabilities of five algorithms, namely TFCAAJ, CT, SCQL, Double-Q, and TDAA, are compared. Due to the absence of any anti-jamming mechanism, the collision probability of CT remains stable at 0.1. In contrast, the other four reinforcement learning-based algorithms can all gradually reduce the collision probability during the learning process and eventually converge to a lower level. Among them, TFCAAJ rapidly converges after approximately 200 time slots and maintains the lowest collision probability in a steady state, demonstrating superior anti-jamming capability. Double-Q and SCQL have certain fluctuations in the initial stage. However, as the learning progresses, the collision probability gradually decreases, and the final result is close to TFCAAJ. TDAA can reduce the collision probability to a certain extent, but its convergence speed is relatively slow, and its steady-state level is also lower than that of TFCAAJ. Overall, TFCAAJ can converge more quickly and reduce effective collisions in dynamic channels and random jamming environments.

As shown in Figure 5, the four reinforcement learning-based algorithms all exhibit a relatively low success rate in the initial stage. Subsequently, as interactive learning progresses, the success rate gradually improves and stabilizes after approximately 400 to 500 time slots. Among them, TFCAAJ has the highest successful transmission rate under steady state, approaching 0.98, which is significantly better than the other algorithms. Double-Q comes second, with a faster convergence speed and a stable value close to TFCAAJ. There are certain ups and downs in the learning process of SCQL, and the final convergence value is slightly lower. When the jamming prediction ability of TDAA is limited, the increase in its successful transmission rate is insufficient, and the steady-state level lags significantly behind TFCAAJ and Double-Q. The results show that TFCAAJ not only reduces the collision rate but also maximizes the effective transmission opportunity, achieving the highest successful network transmission rate.

As can be seen from Figure 6, the bit error rates of the five algorithms in the initial stage are all relatively high, approximately 0.2 to 0.25. As the number of time slots increases, BER gradually decreases and tends to stabilize. Relying on its perception of channel quality (SINR), TFCAAJ can still select reliable time slots for transmission in the presence of jamming. Therefore, its BER curve drops more rapidly and eventually remains at the lowest level (about 0.02). Double-Q and SCQL can also achieve a lower bit error rate after convergence, but both are slightly higher than TFCAAJ. The convergence speed of TDAA is relatively slow, the bit error rate fluctuates greatly in the initial stage, and its steady-state performance is also lower than that of TFCAAJ. Due to the absence of any adversarial mechanism, the bit error rate of CT remains at a relatively high level. It can be seen from this that TFCAAJ exhibits the best bit error rate performance in complex channels and random jamming environments.

As shown in Figure 7, the effective time slot utilization index reflects the actual successful transmission ratio per unit time slot. Unlike the aforementioned successful transmission rate, this indicator directly reflects the efficiency of the algorithm in resource utilization. It can be seen that TFCAAJ experienced a slight decline in the initial stage, but soon rose and surpassed other algorithms. Eventually, it maintained the highest level in a steady state, demonstrating its “less but more precise” transmission advantage. The performance of Double-Q in the middle and late stages is close to that of TFCAAJ, but it is slightly lower overall. The curve of SCQL climbs slowly in the early stage, and the steady-state utilization rate is significantly low. TDAA is limited by its jamming prediction and state estimation mechanisms and remains at the lowest level throughout the entire process. The overall results show that TFCAAJ can maximize the utilization rate of time slot resources while ensuring a low bit error rate, further verifying its performance advantages in resisting random pulse jamming in the time domain.

To sum up, it can be clearly seen that the TFCAAJ algorithm demonstrates significant advantages in multiple performance indicators. This reflects that the TFCAAJ algorithm has stronger robustness in random pulse jamming and time-varying channel environments, and can effectively enhance the effectiveness of communication systems.

6. Conclusions

This paper proposes a time-domain fading channel-aware anti-jamming method based on Q-learning to address the dual impacts of random pulse jamming and time-varying fading channels on the reliability of wireless communication systems. This method introduces the channel fading model into the traditional Q-learning framework, jointly considering the time slot position, jamming history, and channel quality, and realizes dynamic decision optimization in a fast time-varying environment. The simulation results show that the proposed method is superior to the existing algorithms in terms of successful transmission rate, bit error rate, and anti-jamming performance. Especially in complex scenarios where channel fading and impulse jamming coexist, the TFCAAJ algorithm can effectively reduce the probability of jamming collisions and significantly improve the time slot utilization and throughput of the system. This verifies that this method has good practicability and scalability in high-speed dynamic communication environments. In future work, we plan to extend the proposed TFCAAJ framework in several directions. First, we plan to extend the TFCAAJ framework to address cooperative anti-jamming scenarios in multi-UAV networks, enabling multiple agents to share sensing information and collaboratively optimize their transmission decisions. Second, the current study assumes accurate channel state information; future research will consider more realistic scenarios with imperfect channel estimation and more sophisticated, adaptive jamming strategies.

Author Contributions

Conceptualization, Y.W., Y.Z. and Y.N.; methodology, Y.W.; validation, Y.W., Y.Z. and Y.N.; formal analysis, Y.N.; investigation, Y.W.; data curation, Y.Z.; writing—original draft preparation, Y.W.; writing—review and editing, Y.W.; supervision, Y.W.; funding acquisition, Y.Z., Y.N. All authors have read and agreed to the published version of the manuscript.

Funding

The work was supported in part by the Research Program of the National University of Defense Technology under Grant No. ZK24-58 and in part by the Research Program of the National Key Laboratory of Wireless Communication under Grant No. 2024-kgr-JJ-08. This work was supported by the National Science Foundation of China grant numbers 62371461.

Data Availability Statement

Due to institutional data privacy requirements, our data is unavailable.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Han, C.; An, K.; Lin, Z.; Chatzinotas, S.; Wang, J. Endogenous Anti-Jamming Communications for SAGIN: A Network Perspective. IEEE Netw. 2025. early access. [Google Scholar] [CrossRef]
Liu, Y.; Han, Y.; Li, H.; Gu, S.; Qiu, J.; Li, T. Computing over Space: Status, Challenges, and Opportunities. Engineering 2025, 54, 20–25. [Google Scholar] [CrossRef]
Zidane, Y.; Silva, J.S.; Tavares, G. Jamming and Spoofing Techniques for Drone Neutralization: An Experimental Study. Drones 2024, 8, 743. [Google Scholar] [CrossRef]
Torrieri, D.J. Frequency-Hopping Systems. IEEE Trans. Commun. 1980, 28, 892–899. [Google Scholar]
Pöpper, C.; Strasser, M.; Capkun, S. Anti-Jamming Broadcast Communication Using Uncoordinated Spread Spectrum Techniques. IEEE J. Sel. Areas Commun. 2010, 28, 703–715. [Google Scholar] [CrossRef]
Poisel, R.A. Modern Communications Jamming: Principles and Techniques, 2nd ed.; Artech House: Norwood, MA, USA, 2011; pp. 154–196. [Google Scholar]
Yin, D.; Zhao, Z.; Dai, Y.; Long, H. A Novel Multi-Agent Deep Reinforcement Learning Approach. In Proceedings of the International Conference on Computer Big Data and Artificial Intelligence (ICCBDAI 2020), Changsha, China, 24–25 October 2020; p. 012097. [Google Scholar]
Niu, Y.; Zhou, Z.; Pu, Z.; Wan, B. Anti-Jamming Communication Using Slotted Cross Q-Learning. Electronics 2023, 12, 2879. [Google Scholar] [CrossRef]
Sharma, H.; Kumar, N.; Tekchandani, R. Mitigating Jamming Attack in 5G Heterogeneous Networks: A Federated Deep Reinforcement Learning Approach. IEEE Trans. Veh. Technol. 2023, 72, 2439–2452. [Google Scholar] [CrossRef]
Xiao, L.; Li, Y.; Liu, J.; Zhao, Y. Power Control with Reinforcement Learning in Cooperative Cognitive Radio Networks Against Jamming. J. Supercomput. 2015, 71, 3237–3257. [Google Scholar] [CrossRef]
Zhou, Q.; Li, Y.; Niu, Y. A Countermeasure Against Random Pulse Jamming in Time Domain Based on Reinforcement Learning. IEEE Access 2020, 8, 97164–97174. [Google Scholar] [CrossRef]
Liu, H.; Zhang, H.; He, Y.; Sun, Y. Jamming Strategy Optimization Through Dual Q-Learning Model Against Adaptive Radar. Sensors 2021, 22, 145. [Google Scholar] [CrossRef] [PubMed]
Darsena, D.; Gelli, G.; Iudice, I.; Verde, F. Detection and Blind Channel Estimation for UAV-Aided Wireless Sensor Networks in Smart Cities under Mobile Jamming Attack. IEEE Internet Things J. 2022, 9, 11932–11942. [Google Scholar] [CrossRef]
Wu, T.-M. A Suboptimal Maximum-Likelihood Receiver for FFH/BFSK Systems with Multitone Jamming over Frequency-Selective Rayleigh-Fading Channels. IEEE Trans. Veh. Technol. 2008, 57, 1316–1322. [Google Scholar]
Liu, X.; Xu, Y.; Jia, L.; Wu, Q.; Anpalagan, A. Anti-Jamming Communications Using Spectrum Waterfall: A Deep Reinforcement Learning Approach. IEEE Commun. Lett. 2018, 22, 998–1001. [Google Scholar] [CrossRef]

Figure 1. UAV Communication Scenario Model.

Figure 3. Comparison of time-domain waveforms before and after the pulse interference signal propagates through the Rician fading channel. (a) Before passing through the channel. (b) After passing through the channel.

Figure 4. Jamming Collision Probability.

Figure 5. Success Rate.

Figure 6. Average Bit Error Rate.

Figure 7. Effective Slot Utilization Rate.

Table 1. Parameter Simulation Settings.

Parameter	Numerical Value
Communication time slot length $T_{S}$	6 ms
Transmission sub-slot length $T_{tx}$	5 ms
Sense the sub-slot length $T_{sense}$	0.4 ms
Learn the sub-slot length $T_{learning}$	0.6 ms
Total number of transmission slots $T_{α}$	10,000
Learning rate factor $α_{t}$	0.1
Discount factor $γ$	0.6
Greedy factor $ε$	$1 / \sqrt{t}$
Multipath delay $τ$	[0, 0.3, 0.6] ms
Channel gain $h$	[0.7, 0.2, 0.1]
Maximum Doppler frequency shift $f_{d}$	160 Hz
Signal-to-noise ratio $S N R$	10 dB
Rician factor $K$	10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wen, Y.; Zhang, Y.; Niu, Y. A Q-Learning-Based Method for UAV Communication Resilience Against Random Pulse Jamming. Electronics 2025, 14, 4945. https://doi.org/10.3390/electronics14244945

AMA Style

Wen Y, Zhang Y, Niu Y. A Q-Learning-Based Method for UAV Communication Resilience Against Random Pulse Jamming. Electronics. 2025; 14(24):4945. https://doi.org/10.3390/electronics14244945

Chicago/Turabian Style

Wen, Yuqi, Yusi Zhang, and Yingtao Niu. 2025. "A Q-Learning-Based Method for UAV Communication Resilience Against Random Pulse Jamming" Electronics 14, no. 24: 4945. https://doi.org/10.3390/electronics14244945

APA Style

Wen, Y., Zhang, Y., & Niu, Y. (2025). A Q-Learning-Based Method for UAV Communication Resilience Against Random Pulse Jamming. Electronics, 14(24), 4945. https://doi.org/10.3390/electronics14244945

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Q-Learning-Based Method for UAV Communication Resilience Against Random Pulse Jamming

Abstract

1. Introduction

2. System Model

3. Problem Modeling

4. TFCAAJ Algorithm Based on Q-Learning

5. Simulation and Analysis

5.1. Simulation Settings

5.2. Result Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI