Intent-Aware Collision Avoidance for UAVs in High-Density Non-Cooperative Environments Using Deep Reinforcement Learning

Xuchuan Liu; Yuan Zheng; Chenglong Li; Bo Jiang; Wenyong Gu

doi:10.3390/aerospace13020111

Abstract

Collision avoidance between unmanned aerial vehicles (UAVs) and non-cooperative targets (e.g., off-nominal operations or birds) presents significant challenges in urban air mobility (UAM). This difficulty arises due to the highly dynamic and unpredictable flight intentions of these targets. Traditional collision-avoidance methods primarily focus on cooperative targets or non-cooperative ones with fixed behavior, rendering them ineffective when dealing with highly unpredictable flight patterns. To address this, we introduce a deep reinforcement learning-based collision-avoidance approach leveraging global and local intent prediction. Specifically, we propose a Global and Local Perception Prediction Module (GLPPM) that combines a state-space-based global intent association mechanism with a local feature extraction module, enabling accurate prediction of short- and long-term flight intents. Additionally, we propose a Fusion Sector Flight Control Module (FSFCM) that is trained with a Dueling Double Deep Q-Network (D3QN). The module integrates both predicted future and current intents into the state space and employs a specifically designed reward function, thereby ensuring safe UAV operations. Experimental results demonstrate that the proposed method significantly improves mission success rates in high-density environments, with up to 80 non-cooperative targets per square kilometer. In 1000 flight tests, the mission success rate is 15.2 percentage points higher than that of the baseline D3QN. Furthermore, the approach retains an 88.1% success rate even under extreme target densities of 120 targets per square kilometer. Finally, interpretability analysis via Deep SHAP further verifies the decision-making rationality of the algorithm.

Keywords:

deep learning; deep reinforcement learning; unmanned aerial vehicle; collision avoidance

1. Introduction

In recent years, with the rapid development of UAV technology, a wide variety of UAV applications have emerged, and UAM has become a highly promising component of future transportation. UAM is an emerging low-altitude transportation model designed to provide efficient transportation services through UAVs and other aviation technologies. The prerequisite for the widespread adoption of UAM is that UAVs must complete flight missions safely and collision-free. Ensuring safe aircraft operations fundamentally relies on conflict resolution technology, which is divided into two stages [1,2]: (1) strategic conflict resolution, conducted pre-flight through planned flight paths and scheduling rules to minimize the need for real-time avoidance maneuvers, and (2) tactical conflict resolution, executed mid-flight via real-time adjustments to UAV flight parameters (e.g., heading, speed) to resolve conflicts with cooperative or non-cooperative targets rapidly and locally. Collision avoidance technology is the core of tactical conflict resolution; it serves as the last line of defense for flight safety, ensuring operational integrity when other preventive measures fail.

Existing collision avoidance approaches primarily include: 1. Geometric methods [3,4,5,6,7,8]. 2. Optimization algorithms [9,10,11,12]. 3. Intelligent algorithms [13,14,15,16,17,18,19,20]. Unlike cooperative targets, whose intentions and pre-planned routes can be communicated via technologies like automatic dependent surveillance broadcast system (ADS-B) and integrated into conflict resolution protocols, non-cooperative targets are inherently harder to address. In practical scenarios, unpredictable non-cooperative targets (e.g., off-nominal operations or birds) pose significant challenges due to their complex, dynamic behaviors and lack of fixed patterns; moreover, directly applying these methods to non-cooperative targets often results in low collision avoidance success rates and solutions prone to local optima.

The complexity of intent prediction arises because global intent and local intent, though interrelated and markedly distinct, require distinct analytical approaches. As shown in Figure 1, flight intent comprises a global intent (green line, denoting the bird’s overall objective of moving from the treetop to the farmland) and a local intent (yellow line, representing the localized adjustments driven by obstacles or wind). When predicting intent, two main categories of methods are typically distinguished: first, traditional algorithm-based approaches [21,22,23,24,25,26], which rely on kinematic models, and second, deep learning-based approaches [27,28,29], which employ machine learning algorithms. However, existing approaches typically focus on only global or local levels; while this suffices for targets whose intent changes little, it struggles to deliver accurate predictions for non-cooperative targets whose intent varies frequently.

Figure 1. Schematic diagram of flight intent and flight intent prediction process. Intent prediction is complex because global and local intents—linked yet distinct—need separate analyses. Global intent is the bird’s overall flight from treetop to farmland; local intent comprises short-term obstacle-avoiding maneuvers.

As such, to better predict such dynamic flight intents, we first model the global flight intent by feeding long-term global flight behaviors of non-cooperative targets into a state-space model network. Simultaneously, we incorporate the target’s current state by inputting short-term local flight behaviors into a localized intent prediction module focused on fine-grained adjustments. The predicted global and local intents are then used to generate the flight intent via feature fusion. Furthermore, we propose the FSFCM based on deep reinforcement learning. Specifically, by superimposing the current state with the predicted flight intent, the module constructs a fused state space, enabling the reinforcement learning framework to integrate real-time and predictive information. This ensures high collision avoidance success rates across diverse non-cooperative target scenarios with varying flight intents. In summary, the specific contributions of this article are as follows:

We construct the GLPPM, which enhances the capability to recognize and predict the intents of non-cooperative targets with diverse flight patterns. Specifically, we introduce a Global Association Block (GAB) to capture global intent and a Local Extraction Block (LEB) that focuses on recent local intent.
We construct an FSFCM that integrates information from both the current state space and the future-intent state space, thereby enabling the model to identify current intentions as well as future intentions, providing a better action basis for collision-avoidance strategies.
We provide extensive experimental validation of the superiority and generalization ability of our approach across a spectrum of detection and avoidance tasks. The numerical results show that, in terms of flight-mission success rate, the proposed method has a rate 15.2% higher than that of the original D3QN algorithm without flight-intent prediction. For flight-intent prediction, our method provides better support for flight missions and outperforms the LSTM-based algorithm by 1.5%. In a high-density operating environment with 120 non-cooperative targets per square kilometer, the proposed method still maintains a high success rate of 88.1%. Finally, we employ a Deep-SHAP-based interpretable machine-learning approach to analyze the collision-avoidance decision-making process and identify the underlying rationale.

The remainder of this paper is structured as follows: Section 2 reviews related research work; Section 3 presents the problem formulation and modeling; Section 4 introduces the core algorithms and background knowledge of the proposed methodology; Section 5 details the Global and Local Perception Prediction Module, as well as the state space, action space, and reward function design of the deep reinforcement learning algorithm in the Fusion Sector Flight Control Module, followed by an explanation of the algorithm workflow; Section 6 describes the design of the simulation environment, along with experimental tasks, procedures, and results; Section 7 summarizes the simulation results and outlines future research directions.

2. Related Works

2.1. Collision Avoidance Methods

To prevent collisions between UAVs and static obstacles, cooperative targets, or non-cooperative targets, existing collision avoidance methods are primarily categorized into geometry-based, optimization-based, and reinforcement learning-based approaches.

Geometry-based collision avoidance methods: These methods leverage obstacle geometric features and UAV kinematic constraints to calculate time-to-collision, ensuring safe distances between entities. Tan et al. [7] proposed a geometric collision avoidance strategy for UAV swarms by combining line-of-sight vectors with relative velocity vectors and integrating dynamic environmental constraints. By computing collision envelopes, UAVs identify safe maneuvering directions to avoid collisions while maintaining formation. Wolf et al. [8] introduced a hybrid method combining geometric collision avoidance with tracking control. When obstacles are detected, a bounding sphere is generated around them, defining a risk zone. UAVs then compute collision detection angles based on tangent lines to the sphere and their own motion direction to mitigate collision risks.

Optimization-based collision avoidance methods: These methods transform collision avoidance problems into mathematical optimization tasks by defining objective functions and constraints. These methods globally analyze conflict scenarios through numerical computation and optimization techniques, enabling optimal path planning in complex environments. LEE et al. [9] proposed an obstacle-based genetic algorithm integrating a direction factor, which reduces the search space, shortens computation time, and accelerates convergence to solve collision-free shortest-path planning for mobile agents in 2D spaces. Yao et al. [11] introduced the Dynamic Window Approach (DWA), which computes a velocity space (dynamic window) based on the current state. Within this window, velocity pairs (linear and angular velocities) satisfying dynamic constraints and collision-free criteria are sampled. By evaluating motion models and constraints, DWA generates safe and feasible trajectories through multi-set sampling.

Reinforcement learning (RL) is a machine learning technique that trains agents to interact with environments and learn optimal strategies through rewards or penalties. Yang et al. [15] employed a Markov Decision Process (MDP)-based approach using RL to solve MDP problems, providing aircraft separation assurance in urban air mobility environments. Cetin et al. [16] established an autonomous collision avoidance scenario for UAVs in the AirSim and Unreal Engine simulation platforms. They applied the DQN (Deep Q-Network) algorithm to train UAVs in real time, successfully integrating image and scalar information into the state input. Experiments demonstrated that this fused state space enables autonomous collision avoidance decisions and maneuvers in simulations. Additionally, improving deep RL algorithms themselves—for example, by modifying state space structures—can enhance UAV autonomous collision avoidance capabilities [17,18]. Zhang et al. [19] used an improved Actor-Critic network structure to achieve path tracking and navigation. The experimental results validated the feasibility of the proposed method and demonstrated its significant contribution to intelligent navigation in maritime transportation. Yan et al. [20] proposed a multi-agent reinforcement learning algorithm with a spatiotemporal attention mechanism, enhancing each UAV’s ability to learn state information representations in complex dynamic environments.

In summary, existing collision avoidance algorithms are limited by their specialized application scenarios and fail to account for environments with numerous non-cooperative targets exhibiting diverse flight intents. These methods lack the capability to integrate non-cooperative target intentions into collision avoidance strategies. To address this, this work constructs a Fusion Sector Flight Control Module based on a deep reinforcement learning model. By incorporating an improved state space design, the module enables the integration of current and predicted future states, thereby enhancing the system’s ability to avoid collisions with non-cooperative targets of varying flight intents.

2.2. Trajectory Prediction Methods

Based on the intelligence level of trajectory prediction algorithms, they can be broadly categorized into two types: traditional algorithm-based methods and deep learning-based methods.

In the category of traditional algorithms, Pepy et al. [21] reduced vehicle navigation accuracy errors by incorporating real vehicle dynamic models during the path planning phase. Compared to this, kinematic-based algorithms have simpler structures and are more frequently adopted in trajectory prediction tasks. The simplest examples include the Constant-Velocity (CV) and Constant-Acceleration (CA) models [22], which are often used in safety risk testing and assessment. Zhang et al. [23] proposed an improved Kalman filter that dynamically adjusts the model transition probability matrix in real-time, overcoming the limitation of the fixed transition matrix in traditional IMM algorithms, thereby improving the model switching speed and enhancing tracking accuracy.

In summary, traditional trajectory prediction algorithms rely on dynamics, kinematics, Kalman filtering, and Monte Carlo methods. These approaches offer low computational costs but face challenges in handling trajectory uncertainty and limited predictive capabilities for nonlinear systems.

In recent years, deep learning-based approaches have achieved significant progress in trajectory prediction tasks for dynamic targets such as vehicles, ships, and pedestrians. Yoon et al. [27] proposed an improved GRU algorithm, which was specially optimized for low-altitude maneuvering targets. Experiments demonstrated that this architecture has a significant advantage in accuracy. Nacar et al. [28] proposed a trajectory feature analysis-enhanced GRU neural network, which combines historical position and velocity estimation information as input to the model, improving prediction accuracy. Dai et al. [29] proposed a joint prediction neural network combining a CNN and LSTM, using simulation-generated trajectory data for spatiotemporal feature extraction. Experiments showed that this network effectively integrates the advantages of both neural networks, improving prediction accuracy.

In summary, existing trajectory prediction methods fail to address the challenges posed by non-cooperative targets with dynamic flight intents and complex trajectories. They also lack the capability to integrate global temporal information with local temporal patterns, which limits their ability to accurately predict target intentions. To overcome these limitations, this work adopts a deep learning-based trajectory prediction approach and introduces the GAB and LEB. These components enhance the model’s global information association capabilities and local memory retention abilities, enabling more precise prediction of complex flight behaviors.

3. Model Construction

This study focuses on the collision resolution problem between UAVs and non-cooperative targets with diverse flight intents in UAM scenarios. The core challenge lies in achieving two primary objectives. Safe avoidance: The UAV must evade non-cooperative targets with dynamic and unpredictable maneuvering behaviors. Autonomous navigation: The UAV must autonomously decide its trajectory while ensuring arrival at a designated destination. Formally, the conflict resolution problem is defined as follows:

\min J (u) = \sum_{t = t_{0}}^{n} \sum_{i = 0}^{m} R (x_{t}, X_{i})

(1)

x_{t + 1} = f (x_{t})

(2)

where

J

represents the total collision risk cost over the mission;

R

denotes the collision risk between the UAV and the i-th non-cooperative target

X_{i}

at time step

t

;

x_{t}

indicating the UAV state at time

t

;

n

being the total time steps required to reach the goal point;

m

representing the total number of non-cooperative targets;

f (x_{t})

describing the state transition dynamics that govern the UAV’s motion.

In our consideration, the method we propose will aim to limit the collision avoidance process within the two-dimensional space as much as possible, minimizing the risk of collision avoidance maneuvers affecting drones flying at other altitude layers. Vertical space will only be considered for collision avoidance when the two-dimensional space cannot accomplish it.

The UAV’s state at time step

t

is governed by the state transition function

f (x_{t})

:

\{\begin{array}{l} v_{t + 1} = v_{t} + a_{t} T \\ x_{t + 1} = x_{t} + v_{t} T c o s θ_{t} \\ y_{t + 1} = y_{t} + v_{t} T s i n θ_{t} \\ θ_{t + 1} = θ_{t} + ω_{t} T \end{array}

(3)

where

v_{t} \in [0.1, 10] m / s

denotes the UAV’s flight speed;

a_{t} \in [- 3, + 3] m / s^{2}

represents its acceleration;

θ_{t}

is the yaw angle relative to the x-axis;

ω_{t} \in [- π / 30, + π / 30] r a d / s

indicates the yaw angular velocity;

x_{t}

,

y_{t}

are the UAV’s Cartesian coordinates at time

t

.

Based on the above definitions, the conflict resolution problem between UAVs and non-cooperative targets with diverse flight intents in UAM scenarios can be formulated as a discrete-time optimal control problem. The objective is to minimize the total collision risk cost

J

by manipulating the UAV’s acceleration and yaw angular velocity while ensuring arrival at the designated target location.

4. Review of Traditional Methods

4.1. Introduction to the LSTM Algorithm

The LSTM network is a specialized type of RNN designed to address the gradient explosion or vanishing gradient problems commonly encountered in traditional RNNs when processing sequential data. LSTM introduces memory cells, which are special hidden states with the same shape as the standard hidden state, enabling the network to retain additional temporal information. They can be mathematically described by the following equation:

C_{t} = f_{t} \cdot C_{t - 1} + i_{t} \cdot \tilde{C_{t}}

(4)

where

C_{t}

is state of the memory cell;

i_{t}

and

f_{t}

denote the input and forget gates.

The LSTM network utilizes a memory cell state

C_{t}

, which is regulated by three key components: the forget gate function

f_{t}

, the input gate function

i_{t}

, and the candidate memory cell state

\tilde{C_{t}}

. To control the memory cell, multiple gating mechanisms are employed: First, the input gate determines when to input new data into the memory cell. Second, the output gate regulates the output of information from the memory cell. Finally, the forget gate resets or clears the memory cell’s stored information.

4.2. Introduction to the MAMBA Algorithm

To better capture global contextual information, Gu et al. [30] proposed the MAMBA algorithm, a deep learning method based on selective state space models (SSMs). By discretizing data to construct global state relationships, MAMBA reduces computational overhead while achieving superior global feature extraction capabilities compared to Transformers, enabling stronger preservation of contextual coherence. SSMs are traditional mathematical frameworks that describe system dynamics over time. They model system behavior through hidden variables called states, which effectively capture temporal dependencies in sequential data. A classical SSM defines two key equations, a state equation and observation equation:

h (t) = A h (t) + B x (t)

(5)

y (t) = C h (t)

(6)

where

h (t)

is the N-dimensional hidden state at time

t

, modeling the relationship between input

x (t)

and output

y (t)

; A, B, C are learnable parameters.

For discrete inputs, the zero-order hold technique discretizes data by maintaining signal values over intervals. A learnable parameter

Δ

represents the time interval between discrete steps. This process is defined by:

\bar{A} = \exp (Δ A)

(7)

\bar{B} = {(Δ A)}^{- 1} (\exp (Δ A) - I) \cdot Δ B

(8)

leading to the discrete-state equations:

h_{k} = \bar{A} h_{k - 1} + \bar{B} x_{k}

(9)

y_{k} = C h_{k}

(10)

4.3. Introduction to the D3QN Algorithm

The D3QN algorithm is a reinforcement learning framework that combines the advantages of Double DQN and Dueling DQN to address the overestimation bias inherent in traditional DQN algorithms. By decoupling action selection and Q-value estimation through two independent networks—the Training Network (used for selecting actions) and the Target Network (used for computing Q-values)—Double DQN mitigates overestimation bias, while Dueling DQN improves value function accuracy by separately estimating state-value and action-advantage components. The optimal Q-value function

Q_{t}

in D3QN is defined as:

Q_{t} = r_{t + 1} + γ Q (s_{t + 1}, a r g m a x_{a} Q (s_{t + 1}, a; ω_{e}); ω_{t})

(11)

where

r_{t + 1}

denotes the reward at time

t + 1

;

γ

is the discount factor;

s_{t + 1}

represents the state at time

t + 1

;

ω_{e}

is the parameter of the Training Network;

ω_{t}

denotes the parameters of the Target Network.

5. Method

5.1. Global and Local Awareness for Collision Avoidance

To enhance UAV collision avoidance capabilities against non-cooperative targets with diverse intents, this study proposes a Global and Local Awareness collision avoidance method tailored for dynamic conflict scenarios. The framework comprises two submodules: (1) the GLPPM, which predicts the intent of non-cooperative targets by analyzing historical and real-time data, and (2) the FSFCM, which leverages the D3QN-based deep reinforcement learning algorithm to process observed state-space sector information. The predicted intent is then converted into forecasted sectors within the state space, which are fed back into the FSFCM to generate maneuver commands for conflict resolution. Figure 2 below illustrates a schematic diagram of the proposed framework.

Figure 2. Schematic diagram of a deep reinforcement learning obstacle avoidance approach for unmanned aerial vehicles based on global and local flight intent prediction. The proposed method consists of two components. The first is the GLPPM, which senses and predicts the flight intent of non-cooperative targets by integrating their long-term global objectives with short-term local intents. The second component is the FSFCM, responsible for autonomous collision-avoidance decision-making and action execution. Based on a reinforcement learning algorithm, this module employs a specially designed state space—incorporating the flight intent predictions from the first module to endow the agent with future collision risk assessment capabilities—and a tailored reward function. The trained model learns effective avoidance strategies and automatically executes collision-avoidance maneuvers.

5.2. Global and Local Perception Prediction Module

The proposed GLPPM integrates the strengths of two sequence modeling architectures: LEB and GAB. Specifically, the LEB component employs differentiable gating mechanisms to construct a precise local temporal feature extractor, ensuring robust gradient propagation of critical historical information. Meanwhile, the GAB leverages its state space model-based linear-complexity architecture to efficiently capture long-term dependencies, with its parameter-sharing mechanism significantly reducing computational overhead in large-scale temporal data processing. By establishing cross-hierarchical feature interaction channels, this module dynamically couples global pattern recognition with localized detail modeling, achieving adaptive intent prediction for non-cooperative targets. A schematic illustration of the module’s architecture is provided in Figure 3.

Figure 3. The proposed GLPPM consists of LEB and GAB, which are responsible for local feature extraction and global association of flight intents, respectively. The LEB, built on long short-term memory networks, first extracts features that capture the non-cooperative target’s local intent. These feature matrices are then fed into the GAB, which employs a state-space model to identify the target’s global intent and explicitly correlate it with the previously extracted local intent.

The proposed GLPPM first processes temporal sequences through LEB, where LSTM units extract localized temporal features via gating mechanisms (including forget, input, and output gates) to capture dynamic dependencies between adjacent time steps. Subsequently, a fully connected layer performs linear transformation on the feature matrix, reshaping its dimensions to match the input format required by the GAB. The reshaped matrix is then fed into the GAB, where the MAMBA network leverages its SSM with linear-complexity architecture to efficiently model long-range dependencies. This integration preserves LSTM’s sensitivity to short-term dynamics while enhancing the model’s ability to perceive long-distance temporal associations through MAMBA’s global modeling capabilities. The above process can be mathematically described as follows:

Y = F_{G} (F_{C} (F_{L} (X)))

(12)

where

F_{G}

,

F_{C}

, and

F_{L}

denote the GAB, fully connected function, and LEB, respectively, with

X

representing the input temporal trajectory.

First, the LEB—comprising stacked LSTM layers—extracts local temporal features through gating mechanisms (forget gate

f_{t}

, input gate

i_{t}

, output gate

o_{t}

) and memory cell updates. At time step

t

, the hidden state

h_{t}

and memory cell state

C_{t}

are computed as:

h_{t} = F_{L} (C_{t}) = o_{t} \times t a n h (C_{t})

(13)

C_{t} = f_{t} C_{t - 1} + i_{t} \tilde{C_{t}}

(14)

where

\tilde{C_{t}} = \tanh (W_{c} [h_{t - 1}, x_{t}] + b_{c})

is the candidate memory state.

The gates are defined by:

f_{t} = σ (W_{i} [h_{t - 1}, x_{t}] + b_{f})

(15)

i_{t} = σ (W_{i} [h_{t - 1}, x_{t}] + b_{i})

(16)

o_{t} = σ (W_{o} [h_{t - 1}, x_{t}] + b_{o})

(17)

where

x_{t}

is the input sequence;

σ

denotes the sigmoid activation;

h_{t - 1}

is the previous hidden state;

W

and

b

are learnable weights and biases.

Next, the feature matrix is linearly transformed via

F_{C}

to match the input dimensions of the GAB, where the network computes:

y_{k} = F_{G} (h_{k}) = C h_{k}

(18)

h_{k} = \bar{A} h_{k - 1} + \bar{B} F_{C} (F_{L})

(19)

This integration ensures the model retains LSTM’s sensitivity to short-term dynamics while leveraging MAMBA’s selective state space model for efficient global dependency modeling, enabling accurate intent prediction for non-cooperative targets.

To train the GAB and LEB, we developed a non-cooperative target kinematic model. This model is based on the typical flight capabilities of birds and uncontrolled UAVs, where acceleration and yaw angular velocity vary randomly within the corresponding flight capability ranges, following a uniform distribution. This approach simulates the unpredictable behavior of non-cooperative targets with changing intentions. Using this model, we can simulate the flight trajectory data of non-cooperative targets. We collected these flight trajectory data as the dataset for prediction training. The collected data is then used to train the deep learning network. Once a well-performing model is obtained, it is applied in the intent prediction part of the reinforcement learning model.

5.3. Fusion Sector Flight Control Module

5.3.1. Structure of Fusion Sector Flight Control Module

The proposed FSFCM employs the D3QN-based deep reinforcement learning model for training and testing, with its specific architecture illustrated in the schematic diagram below. The dual-network mechanism in D3QN helps reduce overestimation bias, while the experience replay mechanism breaks data correlation and improves sample utilization efficiency. Figure 4 shows a schematic diagram of the Fusion Sector Flight Control Module.

Figure 4. Schematic diagram of Fusion Sector Flight Control Module based on the D3QN algorithm. Built upon the fusion state space, the model acquires the ability to integrate both the current risk-profile sectors and the future high-risk sectors, endowing it with a stronger perception and assessment of future collision risk. Complemented by a specially designed reward function, it guides the UAV to autonomously navigate toward the target waypoint while automatically avoiding non-cooperative targets encountered along the trajectory.

5.3.2. State Space

Traditional state space construction methods struggle to handle dynamic numbers of non-cooperative targets, as collision avoidance success rates heavily rely on fixed non-cooperative target counts—a limitation inconsistent with real-world UAV flight scenarios. To address this, we propose a sector-based state space framework: the UAV’s perception area (a 100 m radius circle centered on itself) is divided into

N

equal sectors, and only the nearest non-cooperative target within each sector is retained for threat modeling. This design assumes that multiple targets within the same sector pose equivalent directional threats relative to the UAV; thus, the closest target represents the sector’s overall risk.

In contrast to conventional state spaces—where each non-cooperative target consumes at least two dimensions (relative angle and distance)—our sector-based approach reduces dimensionality from

2 M

(for

M

targets) to

N

. For example, as illustrated in the figure below, dividing the perception area into 9 sectors requires only 9 dimensions, whereas a traditional method would demand at least 18. This simplification not only accommodates variable numbers of non-cooperative targets but also accelerates neural network convergence. The fusion state space is shown in Figure 5.

Figure 5. Composition diagram of fusion state space.

The proposed state space at time t consists of four components:

S_{t} = \{S_{t}^{1}, S_{t}^{2}, S_{t}^{3}, S_{t + 1}^{4}\}

(20)

where

S_{t}^{1}

represents the UAV’s own state information;

S_{t}^{2}

represents the relative position information between the UAV and the goal point;

S_{t}^{3}

represents the threat information of non-cooperative targets in each sector;

S_{t + 1}^{4}

represents the predicted threat information of non-cooperative targets in each sector.

Furthermore,

S_{t}^{1}

is composed of the direction angle and velocity of the UAV at time

t

, denoted as

θ_{t}

and

v_{t}

respectively.

S_{t}^{2}

is composed of the relative angle and normalized relative distance between the UAV and the goal point at time

t

, denoted as

Δ_{t}

and

d_{t}

respectively. Therefore,

S_{t}^{1}

and

S_{t}^{2}

can be represented as:

S_{t}^{1} = [θ_{t}, v_{t}]

(21)

S_{t}^{2} = [d_{t}, Δ_{t}]

(22)

The information in

S_{t}^{3}

consists of the normalized relative distances between the UAV and the nearest non-cooperative targets within each sector at time

t

. For the nth sector, this distance is denoted by

D_{n}

; if there are no non-cooperative targets in a sector, its value is 1. The information in

S_{t + 1}^{4}

consists of the predicted normalized relative distances between the UAV and the nearest non-cooperative targets within each sector at time

t + 1

. For the nth sector, this predicted distance is denoted by

D_{n}^{p}

. Thus,

S_{t}^{3}

and

S_{t + 1}^{4}

can be represented as:

S_{t}^{3} = [D_{1}, D_{2}, \dots, D_{n}]

(23)

S_{t + 1}^{4} = [D_{1}^{p}, D_{2}^{p}, \dots, D_{n}^{p}]

(24)

5.3.3. Action Space

At the initial moment, the algorithm will randomly select an initial action for the UAV. Afterward, the algorithm will choose the next action from the action space. Each action consists of a specific yaw angular velocity and acceleration to control the UAV’s direction and speed. According to real-world UAV maneuverability data, this paper limits the UAV’s speed within the range of

0.1 - 10 m / s

, sets the yaw angular velocity at

[- π / 30, 0, + π / 30] r a d / s

, and sets the acceleration at

[- 3, 0, + 3] m / s^{2}

. The detailed composition of each action in the action space is shown in Table 1 below:

Table 1. Action Space.

5.3.4. Reward Function

The reward function is the sole channel through which an agent receives environmental feedback in reinforcement learning, indicating whether an action taken in the current state is favorable or unfavorable. To enable the UAV to avoid all non-cooperative targets and successfully reach its designated destination, this paper primarily defines three reward components. The first is the mission reward guiding the UAV toward its goal point. The second is the collision avoidance reward activated when the UAV successfully circumvents non-cooperative targets at the current time step.

The composite reward function for the UAV is expressed as:

R = R_{1} + R_{2}

(25)

(1): Mission reward $R_{1}$

To drive the UAV toward the goal point, a corresponding reward is defined, expressed as:

R_{1} = R_{1}^{1} + R_{1}^{2} + R_{1}^{3}

(26)

Firstly,

R_{1}^{1}

steers the UAV’s heading—its directional orientation—squarely toward the goal point, formulated as:

R_{1}^{1} = \{\begin{array}{l} r_{1} i f Δ_{t} ϵ [0, \frac{π}{18}] \cup [\frac{35}{18} π, 2 π] a n d p = 0 \\ r_{2} i f Δ_{t} ϵ (\frac{π}{18}, \frac{2}{18} π] \cup [\frac{34}{18} π, \frac{35}{18} π) a n d p = 0 \\ 0 i f (Δ_{t} ϵ (\frac{9}{18} π, \frac{27}{18} π) a n d p = 0) o r p > 0 \\ - r_{3} i f Δ_{t} ϵ (\frac{2}{18} π, \frac{9}{18} π] \cup [\frac{27}{18} π, \frac{34}{18} π) a n d p = 0 \end{array}

(27)

where

p

denotes the number of non-cooperative targets within the detection range.

r_{1}

and

r_{2}

represent reward values assigned when the UAV is oriented toward the target location, while

r_{3}

signifies the penalty for failing to face the target direction.

Secondly,

R_{1}^{2}

guides the UAV to approach the target position, formulated as:

R_{1}^{2} = \{\begin{array}{l} d_{t - 1} - d_{t} i f - 20 < d_{t - 1} - d_{t} < 20 \\ 0 e l s e \end{array}

(28)

Finally,

R_{1}^{3}

provides rewards or penalties based on mission outcomes, defined as:

R_{1}^{3} = \{\begin{array}{l} r_{4} i f r e a c h g o a l \\ r_{5} i f o u t o f b o u n d s \\ r_{6} i f c o l l i s i o n \end{array}

(29)

where

r_{3}

,

r_{4}

, and

r_{5}

are constants.

(2): Collision avoidance reward $R_{2}$

The comprehensive collision avoidance rewards obtained by the UAV are as follows:

R_{2} = R_{2}^{1} + R_{2}^{1 p} + R_{2}^{2} + R_{2}^{2 p} + R_{2}^{3} + R_{2}^{3 p}

(30)

The terms

R_{2}^{1}

and

R_{2}^{1 p}

are designed to avoid non-cooperative targets within the UAV’s perception range. Specifically,

R_{2}^{1}

calculates the reward for the current time step’s perception range, while

R_{2}^{1 p}

evaluates the reward for the predicted

p

step future perception range. Their definitions are as follows:

R_{2}^{1} = \{\begin{array}{l} r_{7} i f d_{t}^{i} - d_{t - 1}^{i} \geq 0 \\ - r_{7} i f d_{t}^{i} - d_{t - 1}^{i} < 0 \end{array}

(31)

R_{2}^{1 p} = \{\begin{array}{l} r_{8} i f d_{p}^{i} - d_{p - 1}^{i} \geq 0 \\ - r_{8} i f d_{p}^{i} - d_{p - 1}^{i} < 0 \end{array}

(32)

where

r_{7}

represents the relative motion trend between the UAV and non-cooperative targets within each sector, and

r_{8}

represents the relative motion trend between the UAV and non-cooperative targets within each sector in predicted

p

step future.

d_{t}^{i}

and

d_{t - 1}^{i}

denote the distances between the UAV and the nearest non-cooperative target in the ith sector at time steps

t

and

t - 1

, respectively. The calculation logic for

R_{2}^{1 p}

follows the same principle.

The components

R_{2}^{2}

and

R_{2}^{2 p}

are designed to differentiate the threat levels of non-cooperative targets intruding into the UAV’s perception range. Specifically,

R_{2}^{2}

calculates the threat level for the current time step, while

R_{2}^{2 p}

evaluates the threat level for the predicted

p

step future time interval. Their definitions are as follows:

R_{2}^{2} = - (1 - \frac{d_{t}^{i}}{100})

(33)

R_{2}^{2 p} = - (1 - \frac{d_{p}^{i}}{100})

(34)

The terms

R_{2}^{3}

and

R_{2}^{3 p}

are designed to prevent non-cooperative targets from approaching the UAV too closely. Penalties are imposed when non-cooperative targets enter an unsafe proximity, defined as:

R_{2}^{3} = \{\begin{array}{l} 0 i f d_{t}^{i} > d_{s} \\ - r_{9} i f d_{t}^{i} \leq d_{s} \end{array}

(35)

R_{2}^{3 p} = \{\begin{array}{l} 0 i f d_{p}^{i} > d_{s} \\ - r_{10} i f d_{p}^{i} \leq d_{s} \end{array}

(36)

where

r_{9}

and

r_{10}

denote fixed penalty values, while

d_{s}

represents the predefined minimum safe distance.

Typically, the penalties for boundary violations

r_{5}

and crashes

r_{6}

are set to be very large because boundary violations and crashes are unacceptable. The reward for reaching

r_{4}

should be relatively small to prevent the positive reward from overly compensating for the other penalties. For other dense rewards related to collision avoidance and target guidance, the specific values should be kept very small, as these rewards accumulate at each simulation step. If the values are too large, they may overshadow the penalties for collisions and boundary violations. Specifically, the ratio of boundary violation penalty to crash penalty to destination arrival reward to other constant rewards can be set as 10,000:10,000:100:1. Additionally, the values of the constant rewards should be adjusted based on the environment. When there are a large number of non-cooperative targets in the flight scenario and the UAV faces significant collision avoidance pressure, we need to increase the penalties related to approaching non-cooperative targets and decrease the target guidance rewards to reduce the collision risk. Conversely, the penalties related to approaching non-cooperative targets should be reduced, and the target guidance reward should be increased.

6. Experiments

6.1. Experimental Setup

6.1.1. Simulation Scenario Configuration

In this section, experiments are conducted in a 2D planar environment to demonstrate the superiority of the proposed method. The UAV’s flight altitude is set to a realistic 120 m horizontal plane, with all movements constrained to this plane. The simulated airspace is a 1 km × 1 km area containing the UAV, non-cooperative targets, and goal points. At initialization, these entities are randomly generated at arbitrary positions within the airspace, with the UAV and non-cooperative targets assigned random initial directions and velocities. Non-cooperative targets are designed to perform variable acceleration and variable angular velocity motion with stochastic maneuvering patterns, simulating diverse flight intentions of non-cooperative entities. The flight mission is considered complete when the UAV enters a 10 m radius around the goal point. The simulation environment is illustrated in Figure 6:

Figure 6. Visualization of simulated training environment.

6.1.2. Training Parameter Settings

For deep learning networks and deep reinforcement learning models, hyperparameter configurations directly influence final training outcomes. To eliminate this impact, this study employs unified public hyperparameter settings, as shown in Table 2 and Table 3:

Table 2. Deep Learning Hyperparameter Configuration.

Table 3. Reinforcement Learning Hyperparameter Configuration.

All training and testing were conducted on a unified platform based on the Windows 11 operating system, equipped with an Intel i9-14900KF CPU, an ASUS Prime Z790-P WIFI motherboard, and an NVIDIA GeForce RTX 4090 D GPU.

During training, the reward value obtained by the agent served as a critical reference metric. The model underwent a total of 8000 epochs of training. In the initial 0–1000 epochs, the agent’s cumulative reward gradually increased, indicating that the agent was progressively learning strategies to maximize rewards. After the 1000th epoch, the reward stabilized and reached its maximum value. Detailed trends are illustrated in Figure 7 below.

Figure 7. Reward values.

6.2. Case 1: Comparison Study

This experiment evaluates how different network architectures in the GLPPM proposed in this paper affect the UAV’s mission success rate. It also compares the effectiveness of these modules with traditional LSTM and MAMBA algorithms in predicting intentions of non-cooperative targets for UAV collision avoidance assistance. All configurations mentioned above were integrated with the same fused sector flight control module for training and testing, aiming to assess which network architecture achieves higher flight mission success rates for the UAV. The network architecture is shown in Figure 8.

Figure 8. Schematic diagrams of the five network architectures. (a) Series structure 1; (b) Series structure 2; (c) Parallel structure; (d) MAMBA algorithm; (e) LSTM algorithm.

In this experiment, the number of non-cooperative targets is set to 80 per square kilometer. Considering the UAV’s maneuverability, both the UAV and non-cooperative targets are assigned a maximum speed of 10 m/s. For each group, the best-performing trained strategy is selected for flight mission testing. Detailed results are summarized in Table 4:

Table 4. GLPPM Support Effect Test for Flight Missions.

The success rate refers to the percentage of completed flight missions (i.e., reaching the goal point without collisions or boundary violations) relative to the total number of attempted missions. The test results clearly indicate that the proposed GLPPM outperforms alternative architectures in a high-density non-cooperative target environment. Furthermore, the serial integration of the LEB and GAB demonstrates superior performance compared to their parallel configuration. Notably, combining the LEB with the GAB achieves better results than using either MAMBA or LSTM alone.

6.3. Case 2: Comparison Experiment of Prediction Methods

This study aims to verify that the proposed method outperforms the RT-GRU and AIMM-IAKF methods, which are used as comparison methods, in providing non-cooperative target predicted locations and assisting reinforcement learning in collision avoidance. Specifically, the RT-GRU method [27] is based on the GRU algorithm, while the AIMM-IAKF method [23] relies on the Kalman method.

We trained the three methods mentioned above in a flight task environment with 80 non-cooperative targets per square kilometer and conducted 1000 flight mission tests. The success rates are shown in Table 5:

Table 5. Comparison results of prediction methods.

The experimental results demonstrate that the proposed method achieves a higher task success rate. Compared to the RT-GRU algorithm, which also uses deep learning, the proposed method better extracts global and local flight intentions, making more accurate non-cooperative target position predictions and providing stronger data support for reinforcement learning collision avoidance decisions. The AIMM-IAKF algorithm, based on traditional methods, performs poorly in handling the high-density, high uncertainty in flight intentions environment proposed in this paper, highlighting the challenges traditional methods face in predicting tasks for nonlinear systems.

6.4. Case 3: Ablation Study

This experiment aims to validate the effectiveness and contribution of the proposed GLPPM combined with the fused state space architecture to the overall system performance. The experimental setup maintains a constant density of 80 non-cooperative targets per square kilometer and a maximum flight speed of 10 m/s. By incrementally integrating system components and evaluating the modified architectures, the validity and contribution of each component are demonstrated. For each model modification, the best-performing trained strategy is selected for testing. The detailed results are summarized in Table 6:

Table 6. Ablation Study Results.

Firstly, it is evident that incorporating the proposed GLPPM into the baseline D3QN reinforcement learning model (referred to as “Baseline”) and replacing the non-cooperative targets’ current positions in the original D3QN state space with their predicted future positions significantly improves flight strategy learning, leading to a substantial increase in the success rate. This confirms the module’s effectiveness in aiding UAV collision avoidance. Subsequently, replacing the original state space with the fused state space— which integrates both current and predicted future information of non-cooperative targets—increases the mission success rate to 94.7%. This demonstrates that the fused state space provides a more nuanced representation of risk scenarios, offering enhanced auxiliary information for collision avoidance decision-making. Furthermore, the proposed method maintains a high task success rate even under the challenging condition of 80 non-cooperative targets per square kilometer, proving the system’s robustness and stability.

6.5. Case 4: Statistically Significant Study

This study aims to verify that the proposed method significantly outperforms the baseline D3QN method in the flight arrival rate indicator. We conducted flight mission tests for both the proposed method and the baseline in an environment with 80 non-cooperative targets per square kilometer. Every 100 tests were considered as one data sample, and we collected 100 samples for each method. The sample results are shown in Figure 9. We then formulated hypotheses, with the null hypothesis being that there is no significant difference between the two sets of sample data and the success rates are the same. The alternative hypothesis is that there is a significant difference between the two sets of sample data. We set the significance level to 0.05. We then performed a Mann–Whitney U test on the two sets of sample data, and the results are shown in Table 7:

Figure 9. D3QN and GLPPM + FSFCM flight arrival rate test sample plot.

Table 7. Statistically Significant Study Results.

The U statistic is 1, and this value is close to 0, which means that the observed data of GLPPM + FSFCM is almost entirely greater than that of D3QN. The test results show that the p-value is clearly less than 0.05; thus we reject the null hypothesis and accept the alternative hypothesis. There is a significant difference in the flight arrival rate performance between the proposed method and the baseline method. The statistical results in Figure 9 demonstrate that the proposed method consistently achieves a higher arrival rate.

6.6. Case 5: Density Stress Test

This experiment is designed to validate the system’s capability to perform conflict resolution under varying densities and flight speeds of non-cooperative targets, including ultra-high-density scenarios. The system is trained and tested in environments with different configurations, using the mission success rate as the primary evaluation metric to assess conflict resolution performance. The test results are shown in Table 8.

Table 8. Density Stress Test Results.

The test results demonstrate that the system achieves a high mission success rate when operating in environments with a maximum flight speed of 10 m/s and 60 non-cooperative targets per square kilometer. Additionally, the results show that the maximum flight speed significantly impacts mission success: higher speeds correlate with lower success rates. This indicates that increased UAV velocity reduces the time window for conflict resolution, aligning with real-world logic. Similarly, higher densities of non-cooperative targets lead to lower success rates, reflecting the increased difficulty of conflict resolution in congested airspace—a result consistent with practical observations. Notably, the system maintains a 75% mission success rate even under high-speed and ultra-high-density conditions, demonstrating the effectiveness of the proposed conflict resolution strategies in challenging scenarios.

6.7. Local Explanation

This experiment aims to analyze the impact of individual features on model decision-making in specific scenarios, thereby enhancing model interpretability. A widely adopted machine learning explanation framework is based on Shapley values, which quantify the contribution of each feature to the model’s predictions. The SHAP framework [31] provides a unified measure to explain how input features influence the model’s output. By analyzing feature contributions, the decision-making process becomes more transparent, and the model’s preferences can be effectively revealed. Lundberg et al. [32] proposed the Deep SHAP method, which combines SHAP theory with deep learning by estimating average feature contributions through multiple reference points to compute Shapley values. Deep SHAP is model-agnostic and applicable to any deep learning architecture. Experimental results demonstrate its ability to reasonably interpret decision processes in deep learning models.

The specific SHAP values are presented in Table 9. For the first sample (as shown in Figure 10), the UAV has a non-cooperative target in Sector 1 and a relative angle of 320° counterclockwise toward the goal point. Since Sector 1 (located at the UAV’s left front) contains a non-cooperative target, this feature contributes to the decision to decelerate, while other sectors without non-cooperative targets favor acceleration. In addition, based on the predicted flight intent, the non-cooperative target within Sector 1 will move away from the UAV in the future, so the incentive for slowing the UAV down will correspondingly diminish. Ultimately, the UAV selects deceleration to avoid the non-cooperative target. Additionally, due to the presence of a non-cooperative target on the left side and the goal point’s position on the right, the target location contributes substantially to a right-turn decision. The UAV ultimately executes a right turn to resolve the conflict.

Table 9. SHAP values of different features in 4 samples.

Figure 10. We selected four representative UAV decision-making samples, employed sector schematic diagrams to illustrate the operational scenarios faced by the UAV, and visualized the SHAP feature values corresponding to each sample.

For the second sample, the UAV has non-cooperative targets in Sectors 4 and 5 and a relative angle of 200° counterclockwise toward the goal point. Non-cooperative targets exist in Sectors 4 and 5, located at the UAV’s rear. Notably, Sector 5 contains a non-cooperative target at an extremely close proximity, which significantly contributes to the decision to accelerate. However, the goal point is positioned behind the UAV, favoring deceleration to reach it, thereby counteracting the acceleration tendency. Furthermore, according to the flight intent prediction, the non-cooperative target in sector 4 will move away from the UAV, whereas the one in Sector 5 will approach it. Consequently, the future contribution of Sector 4 to accelerating the UAV diminishes, while that of Sector 5 increases. Ultimately, to prioritize safety and increase distance from non-cooperative targets, the UAV executes an acceleration maneuver. Additionally, the presence of a non-cooperative target in the left-rear Sector 4 drives the UAV to perform a right-turn decision for collision avoidance.

For the third sample, the UAV has non-cooperative targets in Sectors 4–6, with a relative angle of 150° counterclockwise toward the goal point. Non-cooperative targets exist in Sectors 4–6 (located at the UAV’s rear), and Sectors 4 and 5 contain targets at extremely close proximity. This significantly contributes to the decision to accelerate. However, the goal point lies behind the UAV, favoring deceleration to approach it, thereby counteracting the acceleration tendency. Furthermore, based on the flight intent prediction, the non-cooperative target in Sector 4 will move away from the UAV and enter Sector 5, while the one in Sector 5 will approach the UAV and enter Sector 4; the non-cooperative target in Sector 6 will move away from the UAV. Consequently, the future contribution of sector 4 to accelerating the UAV increases, whereas those of Sectors 5 and 6 decrease. Ultimately, to prioritize safety and increase separation from non-cooperative targets, the UAV chooses to maintain a straight course while accelerating. Additionally, the presence of non-cooperative targets in the left-rear Sector 4 and right-rear Sector 6 neutralizes turning tendencies, leading to a decision to continue straight.

For the fourth sample, the UAV has a non-cooperative target in Sector 7, with a relative angle of 320° counterclockwise toward the goal point. Since Sector 7 (located at the UAV’s right front) contains a non-cooperative target, this strongly contributes to the decision to decelerate, while other sectors without non-cooperative targets favor acceleration. Furthermore, based on the flight intent prediction, the non-cooperative target in sector 7 will maintain an almost constant distance from the UAV in the future; therefore, the incentive for the UAV to decelerate will remain almost unchanged. Ultimately, to avoid the non-cooperative target, the UAV selects a deceleration maneuver. Similarly, the presence of a non-cooperative target on the right side encourages a left-turn decision, while the goal point’s position on the UAV’s right side counteracts this maneuver. To prioritize collision avoidance, the UAV ultimately executes a left turn.

6.8. Global Explanation

This experiment aims to evaluate how the model makes decisions under diverse scenarios and analyze its behavioral preferences in different situations.

(1): UAV yaw decisions are influenced by the positions of the goal point and non-cooperative targets.

Regarding Figure 11a, it illustrates the influence of the spatial distribution of non-cooperative targets on the drone’s right-turn maneuver. Each dot in the figure represents the SHAP value contributed by a non-cooperative target located at that point toward the drone’s decision to turn right. It is evident that the location of non-cooperative targets significantly impacts the agent’s decision-making. When the agent selects a right-turn action, non-cooperative targets in Sectors 1–5 exert a notable influence, with the impact intensity increasing as the distance decreases. This indicates that targets in this direction encourage the agent to perform a right-turn maneuver. As the distance between the agent and the non-cooperative target increases, the influence diminishes. When the goal point approaches the agent’s perception boundary, the encouraging effect nearly disappears. Regarding Figure 11b, it shows the impact of the spatial distribution of non-cooperative targets on the drone’s left-turn maneuver; non-cooperative targets in Sectors 4–9 also promote left-turn actions. These results demonstrate that the UAV adopts appropriate turning maneuvers to avoid non-cooperative targets.

Figure 11. Analysis of UAV yaw decision-making, where (a,b) are the results of UAV yaw decision affected by non-cooperative target position, (c,d) are the results of UAV yaw decision affected by goal position.

Regarding Figure 11c, it depicts how the spatial distribution of goal points influences the drone’s straight-ahead action; the goal point located in Sector 9 (directly ahead on the right) has the strongest influence on this decision, with the impact intensity increasing as the distance decreases. This indicates that targets in this direction encourage straight movement. As the clockwise angle between the goal point and the agent’s heading increases, the influence weakens. When the goal point is positioned behind the agent, the encouraging effect vanishes. Further, as the angle increases, targets on the agent’s left side begin to suppress straight movement. The strongest suppression occurs when the goal point is in Sector 1 (directly ahead on the left), with the suppression effect intensifying as the distance decreases. Regarding Figure 11d, it shows how the spatial distribution of goal points influences the drone’s left-turn maneuver; the goal point located in Sector 1 (directly ahead on the left) has the greatest impact on this decision, with influence intensity increasing as the distance decreases. This indicates that targets in this direction encourage a left-turn maneuver. As the counterclockwise angle between the goal point and the agent’s heading increases, the influence weakens. When the goal point is positioned behind the agent, the encouraging effect disappears. Further, as the angle increases, targets on the agent’s right side begin to suppress left-turn actions. The strongest suppression occurs when the goal point is in Sector 9 (directly ahead on the right), with the suppression effect intensifying as the distance decreases.

(2): UAV acceleration decisions are influenced by the positions of both the goal point and non-cooperative targets.

Regarding Figure 12a, it illustrates the influence of the spatial distribution of non-cooperative targets on the drone’s deceleration maneuver. Each point in the figure represents the SHAP value contributed by a non-cooperative target located at that position toward the drone’s decision to slow down. When non-cooperative targets are located in the front sectors (Sectors 1, 2, 8, and 9) of the UAV, they significantly impact the decision to decelerate, with the influence intensifying as the distance decreases. This indicates that non-cooperative targets in the forward direction strongly encourage the UAV to decelerate. When non-cooperative targets are near the UAV’s perception boundary, the encouraging effect diminishes to nearly zero. Regarding Figure 12b, it illustrates how the acceleration maneuver of the UAV is influenced by the positional distribution of non-cooperative targets. When non-cooperative targets are present in the adjacent sectors (Sectors 2–8) of the UAV, they significantly promote the decision to accelerate, with the influence increasing as the distance decreases. This suggests that non-cooperative targets in non-frontal directions (excluding the forward sectors) encourage acceleration. However, this encouraging effect also vanishes when the targets approach the perception boundary.

Figure 12. UAV acceleration decision analysis, where (a,b) are the results of UAV acceleration decision affected by non-cooperative target position, (c,d) are the results of UAV acceleration decision affected by goal position.

Regarding Figure 12c,d, they, respectively, depict the UAV performing a speed-holding maneuver and an acceleration maneuver, both influenced by the positional distribution of the goal points. Beyond a certain distance, the goal point’s position generally suppresses acceleration. The suppression effect strengthens as the distance decreases. Within a certain proximity threshold, however, the goal point begins to encourage acceleration, with the promoting effect intensifying as the distance shrinks. Notably, the influence varies by action type: the range within which the goal point encourages the speed-holding maneuver is comparatively small, whereas the range that encourages the acceleration maneuver is significantly larger. This is because higher UAV speeds reduce the time window for collision avoidance maneuvers, and collision risks persist near the goal point. Consequently, the UAV only prioritizes acceleration when the goal point is sufficiently close. Conversely, to mitigate collision risks, the UAV cautiously regulates its speed most of the time. For the speed-holding maneuver, when the goal point is far from the UAV, the UAV should change its speed instead of maintaining it; consequently, viewed overall, the inhibitory region far exceeds the encouraging region, and the goal point more strongly tends to inhibit the UAV’s speed-holding action.

7. Conclusions

To cope with the highly dynamic intents of non-cooperative targets, we integrate the GLPPM for intent forecasting and the FSFCM that feeds these forecasts into a D3QN agent with a safety-oriented reward. The GLPPM, by combining deep learning neural networks with different strengths, is capable of extracting both the global and local flight intentions of non-cooperative targets. Meanwhile, the FSFCM is specifically designed to fully utilize the prediction capabilities of the GLPPM and monitor non-cooperative target threats in the current and future environments. It does so by processing a variable number of non-cooperative targets in a fused sector state space, significantly enhancing the model’s ability to perceive safety risks. Extensive experiments corroborate the value of this design. In a baseline comparison with 80 non-cooperative targets per km², the GLPPM-augmented D3QN achieved a 15.2 percentage point higher success rate than the original D3QN in 1000 flight missions. Comparative experiments demonstrate that the proposed method can better extract the global and local intentions of non-cooperative targets, providing a reference for collision avoidance strategies. An ablation study further reveals that the GLPPM alone contributes 9.6%, while the fused state space adds another 5.6%. Density stress tests show that the method proposed in this paper performs well under various non-cooperative target densities and different flight speed conditions. Deep SHAP analyses verify the decision-making rationality of the algorithm. Future work will extend the framework to real-world urban environments containing both dynamic non-cooperative agents and dense static obstacles. We will pay particular attention to the reaction speed requirements of reinforcement learning algorithms in real-world scenarios and the challenges posed by cooperative collision avoidance with other UAVs. These challenges require the algorithm to have a higher level of optimization and more refined communication and cooperation rules between UAVs. In addition, we will improve the UAV collision avoidance environment in three-dimensional space in future work.

Author Contributions

Conceptualization, W.G.; methodology, Y.Z.; software, X.L.; validation, X.L.; writing—original draft, X.L.; writing—review and editing, B.J., W.G., Y.Z. and C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 52502410).

Data Availability Statement

The data presented in this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned Aerial Vehicle
UAM	Urban Air Mobility
GLPPM	Global and Local Perception Prediction Module
FSFCM	Fusion Sector Flight Control Module
D3QN	Dueling Double Deep Q-Network
GAB	Global Association Block
LSTM	Long Short-Term Memory
LEB	Local Extraction Block
DWA	Dynamic Window Approach
RL	Reinforcement Learning
MDP	Markov Decision Process
DQN	Deep Q-Network
CV	Constant-Velocity
CA	Constant-Acceleration
RNN	Recurrent Neural Network
LLM	Large Language Model
GRU	Gated Recurrent Unit
SAE	Stacked Autoencoders
$J$	Total collision risk cost over the mission, -
$R$	Collision risk between the UAV and the i-th non-cooperative target, -
$X_{i}$	Non-cooperative target i, -
$x_{t}$	UAV state at time $t$ , -
$n$	Total time steps required to reach the goal point, -
$m$	Total number of non-cooperative targets, -
$f (x_{t})$	State transition dynamics that govern the UAV’s motion, -
$v_{t}$	UAV’s flight speed, $m / s$
$a_{t}$	UAV’s acceleration, $m / s^{2}$
$θ_{t}$	Yaw angle relative to the x-axis, $r a d$
$ω_{t}$	Yaw angular velocity, $r a d / s$
$C_{t}$	State of the memory cell, -
$i_{t}$	Input gates, -
$f_{t}$	Forget gates, -
$\tilde{C_{t}}$	Candidate memory cell state, -
$h (t)$	N-dimensional hidden state at time $t$ , -
A, B, C	Learnable parameters, -
$Δ$	Time interval between discrete steps, -
$Q_{t}$	Optimal Q-value function, -
$r_{t + 1}$	Reward at time $t + 1$ , -
$γ$	Discount factor, -
$s_{t + 1}$	State at time $t + 1$ , -
$ω_{e}$	Parameter of the Training Network, -
$ω_{t}$	Parameters of the Target Network, -
$R_{1}$	Mission reward, -
$R_{2}$	Collision avoidance reward, -
$Δ_{t}$	Relative angle between UAV and goal point at time $t$ $, r a d$
$d_{t}$	Normalized relative distance between UAV and goal point at time $t$ , -
$p$	Number of non-cooperative targets within the detection range, -

References

Netjasov, F. Framework for airspace planning and design based on conflict risk assessment: Part 1: Conflict risk assessment model for airspace strategic planning. Transp. Res. Part C Emerg. Technol. 2012, 24, 190–212. [Google Scholar] [CrossRef]
Netjasov, F. Framework for airspace planning and design based on conflict risk assessment: Part 2: Conflict risk assessment model for airspace tactical planning. Transp. Res. Part C Emerg. Technol. 2012, 24, 213–226. [Google Scholar] [CrossRef]
Shakhatreh, H.; Sawalmeh, A.H.; Al-Fuqaha, A.; Dou, Z.; Almaita, E.; Khalil, I.; Othman, N.S.; Khreishah, A.; Guizani, M. Unmanned aerial vehicles (UAVs): A survey on civil applications and key research challenges. IEEE Access 2019, 7, 48572–48634. [Google Scholar] [CrossRef]
Chakravarthy, A.; Ghose, D. Obstacle avoidance in a dynamic environment: A collision cone approach. IEEE Trans. Syst. Man. Cybern. Part A Syst. Humans 2002, 28, 562–574. [Google Scholar] [CrossRef]
Alexopoulos, A.; Kandil, A.; Orzechowski, P.; Badreddin, E. A comparative study of collision avoidance techniques for unmanned aerial vehicles. In Proceedings of the 2013 IEEE International Conference on Systems, Man and Cybernetics, Manchester, UK, 13–16 October 2013; pp. 1969–1974. [Google Scholar] [CrossRef]
Payal, A.; Akashdeep; Singh, C.R. A summarization of collision avoidance techniques for autonomous navigation of UAV. In Proceedings of the International Conference on Unmanned Aerial System in Geomatics, Roorkee, India, 6–7 April 2019; Springer International Publishing: Cham, Switzerland, 2019; pp. 393–401. [Google Scholar] [CrossRef]
Tan, C.Y.; Huang, S.; Tan, K.K.; Teo, R.S.H. Three dimensional collision avoidance for multi unmanned aerial vehicles using velocity obstacle. J. Intell. Robot. Syst. 2020, 97, 227–248. [Google Scholar] [CrossRef]
Wolf, M.T.; Burdick, J.W. Artificial potential functions for highway driving with collision avoidance. In Proceedings of the 2008 IEEE International Conference on Robotics and Automation, Pasadena, CA, USA, 19–23 May 2008; pp. 3731–3736. [Google Scholar] [CrossRef]
Lee, H.Y.; Shin, H.; Chae, J. Path planning for mobile agents using a genetic algorithm with a direction guided factor. Electronics 2018, 7, 212. [Google Scholar] [CrossRef]
Kuo, P.H.; Li, T.H.S.; Chen, G.Y.; Ho, Y.F.; Lin, C.J. A migrant-inspired path planning algorithm for obstacle run using particle swarm optimization, potential field navigation, and fuzzy logic controller. Knowl. Eng. Rev. 2017, 32, e5. [Google Scholar] [CrossRef]
Yao, M.; Deng, H.; Feng, X.; Li, P.; Li, Y.; Liu, H. Improved dynamic windows approach based on energy consumption management and fuzzy logic control for local path planning of mobile robots. Comput. Ind. Eng. 2024, 187, 109767. [Google Scholar] [CrossRef]
Pérez-Carabaza, S.; Scherer, J.; Rinner, B.; López-Orozco, J.A.; Besada-Portas, E. UAV trajectory optimization for Minimum Time Search with communication constraints and collision avoidance. Eng. Appl. Artif. Intell. 2019, 85, 357–371. [Google Scholar] [CrossRef]
Polvara, R.; Sharma, S.; Wan, J.; Manning, A.; Sutton, R. Obstacle avoidance approaches for autonomous navigation of unmanned surface vehicles. J. Navig. 2018, 71, 241–256. [Google Scholar] [CrossRef]
Boivin, E.; Desbiens, A.; Gagnon, E. UAV collision avoidance using cooperative predictive control. In Proceedings of the 2008 16th Mediterranean Conference on Control and Automation, Ajaccio, France, 25–27 June 2008; pp. 682–688. [Google Scholar] [CrossRef]
Yang, X.; Wei, P. Scalable multi-agent computational guidance with separation assurance for autonomous urban air mobility. J. Guid. Control. Dyn. 2020, 43, 1473–1486. [Google Scholar] [CrossRef]
Çetin, E.; Barrado, C.; Muñoz, G.; Macias, M.; Pastor, E. Drone navigation and avoidance of obstacles through deep reinforcement learning. In Proceedings of the 2019 IEEE/AIAA 38th Digital Avionics Systems Conference (DASC), San Diego, CA, USA, 8–12 September 2019; pp. 1–7. [Google Scholar] [CrossRef]
Wan, K.; Gao, X.; Hu, Z.; Wu, G. Robust motion control for UAV in dynamic uncertain environments using deep reinforcement learning. Remote Sens. 2020, 12, 640. [Google Scholar] [CrossRef]
Pham, D.T.; Tran, N.P.; Alam, S.; Duong, V.; Delahaye, D. A machine learning approach for conflict resolution in dense traffic scenarios with uncertainties. In Proceedings of the 2019 Thirteenth USA/Europe Air Traffic Management Research and Development Seminar, Vienna, Austria, 17–21 June 2019. [Google Scholar]
Zhang, G.; Li, Z.; Li, J.; Shu, Y.; Zhang, X. Reinforcement learning-driven autonomous navigation strategy for rotor-assisted vehicles via integral event-triggered mechanism. Transp. Res. Part D Transp. Environ. 2025, 146, 104841. [Google Scholar] [CrossRef]
Yan, C.; Wang, C.; Zhou, H.; Xiang, X.; Wang, X.; Shen, L. Multi-agent reinforcement learning with spatial–temporal attention for flocking with collision avoidance of a scalable fixed-wing UAV fleet. IEEE Trans. Intell. Transp. Syst. 2025, 26, 1769–1782. [Google Scholar] [CrossRef]
Pepy, R.; Lambert, A.; Mounier, H. Reducing navigation errors by planning with realistic vehicle model. In Proceedings of the 2006 IEEE Intelligent Vehicles Symposium, Meguro-Ku, Japan, 13–15 June 2006; pp. 300–307. [Google Scholar] [CrossRef]
Ammoun, S.; Nashashibi, F. Real time trajectory prediction for collision risk estimation between vehicles. In Proceedings of the 2009 IEEE 5th International Conference on Intelligent Computer Communication and Processing, Cluj-Napoca, Romania, 27–29 August 2009; pp. 417–422. [Google Scholar] [CrossRef]
Zhang, P.; Liu, C.; Ji, Y.; Wang, Z.; Li, Y. Enhanced UAV trajectory tracking using AIMM-IAKF with adaptive model transition probability. Appl. Sci. 2025, 15, 11111. [Google Scholar] [CrossRef]
Jin, B.; Jiu, B.; Su, T.; Liu, H.; Liu, G. Switched Kalman filter-interacting multiple model algorithm based on optimal autoregressive model for manoeuvring target tracking. IET Radar Sonar Navig. 2015, 9, 199–209. [Google Scholar] [CrossRef]
Krozel, J.; Andrisani, D. Intent inference with path prediction. J. Guid. Control. Dyn. 2006, 29, 225–236. [Google Scholar] [CrossRef]
Zhang, H.; Yan, Y.; Li, S.; Hu, Y.; Liu, H. UAV behavior-intention estimation method based on 4-D flight-trajectory prediction. Sustainability 2021, 13, 12528. [Google Scholar] [CrossRef]
Yoon, S.; Jang, D.; Yoon, H.; Park, T.; Lee, K. GRU-Based Deep Learning Framework for Real-Time, Accurate, and Scalable UAV Trajectory Prediction. Drones 2025, 9, 142. [Google Scholar] [CrossRef]
Nacar, O.; Abdelkader, M.; Ghouti, L.; Gabr, K.; Al-Batati, A.; Koubaa, A. VECTOR: Velocity Enhanced GRU Neural Network for Real Time 3D UAV Trajectory Prediction. arXiv 2024, arXiv:2410.23305. [Google Scholar] [CrossRef]
Dai, R.; Ruan, J.; Liu, S.; Chen, H. Trajectory deviation prediction of UAV formation by joint neural networks. Appl. Sci. 2025, 15, 382. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Lipovetsky, S.; Conklin, M. Analysis of regression in game theory approach. Appl. Stoch. Model. Bus. Ind. 2001, 17, 319–330. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]

Figure 1. Schematic diagram of flight intent and flight intent prediction process. Intent prediction is complex because global and local intents—linked yet distinct—need separate analyses. Global intent is the bird’s overall flight from treetop to farmland; local intent comprises short-term obstacle-avoiding maneuvers.

Figure 2. Schematic diagram of a deep reinforcement learning obstacle avoidance approach for unmanned aerial vehicles based on global and local flight intent prediction. The proposed method consists of two components. The first is the GLPPM, which senses and predicts the flight intent of non-cooperative targets by integrating their long-term global objectives with short-term local intents. The second component is the FSFCM, responsible for autonomous collision-avoidance decision-making and action execution. Based on a reinforcement learning algorithm, this module employs a specially designed state space—incorporating the flight intent predictions from the first module to endow the agent with future collision risk assessment capabilities—and a tailored reward function. The trained model learns effective avoidance strategies and automatically executes collision-avoidance maneuvers.

Figure 3. The proposed GLPPM consists of LEB and GAB, which are responsible for local feature extraction and global association of flight intents, respectively. The LEB, built on long short-term memory networks, first extracts features that capture the non-cooperative target’s local intent. These feature matrices are then fed into the GAB, which employs a state-space model to identify the target’s global intent and explicitly correlate it with the previously extracted local intent.

Figure 4. Schematic diagram of Fusion Sector Flight Control Module based on the D3QN algorithm. Built upon the fusion state space, the model acquires the ability to integrate both the current risk-profile sectors and the future high-risk sectors, endowing it with a stronger perception and assessment of future collision risk. Complemented by a specially designed reward function, it guides the UAV to autonomously navigate toward the target waypoint while automatically avoiding non-cooperative targets encountered along the trajectory.

Figure 5. Composition diagram of fusion state space.

Figure 6. Visualization of simulated training environment.

Figure 7. Reward values.

Figure 8. Schematic diagrams of the five network architectures. (a) Series structure 1; (b) Series structure 2; (c) Parallel structure; (d) MAMBA algorithm; (e) LSTM algorithm.

Figure 9. D3QN and GLPPM + FSFCM flight arrival rate test sample plot.

Figure 10. We selected four representative UAV decision-making samples, employed sector schematic diagrams to illustrate the operational scenarios faced by the UAV, and visualized the SHAP feature values corresponding to each sample.

Figure 11. Analysis of UAV yaw decision-making, where (a,b) are the results of UAV yaw decision affected by non-cooperative target position, (c,d) are the results of UAV yaw decision affected by goal position.

Figure 12. UAV acceleration decision analysis, where (a,b) are the results of UAV acceleration decision affected by non-cooperative target position, (c,d) are the results of UAV acceleration decision affected by goal position.

Table 1. Action Space.

	$- 3$	$0$	$3$
$ω_{t} (r a d / s)$	$- 3$	$0$	$3$
$- π / 30$	$(- π / 30, - 3)$	$(- π / 30, 0)$	$(- π / 30, 3)$
$0$	$(0, - 3)$	$(0, 0)$	$(0, 3)$
$π / 30$	$(π / 30, - 3)$	$(0, - 3)$	$(π / 30, 3)$

Table 2. Deep Learning Hyperparameter Configuration.

Parameter	Value
D_state	16
D_conv	4
Expand	2
Learning rate	0.001
Sequence length	5
Batch size	4096
Hidden size	64
Loss function	MSE
Total number of training rounds	100

Table 3. Reinforcement Learning Hyperparameter Configuration.

Parameter	Value
Learning rate	0.000001
Discount factor	0.99
Buffer size	1,000,000
Batch size	512
Multi-step update	5
Update delay of current network	10 steps
Update delay of target network	Upon completion of each round
Loss function	MSE
Total number of training rounds	8000
Total number of tests	1000

Table 4. GLPPM Support Effect Test for Flight Missions.

Module Structure	Success Rate
Structure (a)	94.7%
Structure (b)	94.4%
Structure (c)	94.3%
Structure (d)	94.1%
Structure (e)	93.2%

Table 5. Comparison results of prediction methods.

System Composition	Success Rate
GLPPM + FSFCM	94.7%
AIMM-IAKF	85.8%
RT-GRU	93.3%

Table 6. Ablation Study Results.

System Composition	Success Rate
D3QN	79.5%
GLPPM + D3QN	89.1%
GLPPM + FSFCM	94.7%

Table 7. Statistically Significant Study Results.

Test Metric	Value
U stat	1.0
p value	2.4 × 10⁻³⁴

Table 8. Density Stress Test Results.

Maximum Speed (m/s)	Non-Cooperative Target Density (Aircraft Per Square km)	Success Rate
10	60	98.0%
	90	93.4%
	120	88.1%
15	60	93.7%
	90	82.7%
	120	75.0%

Table 9. SHAP values of different features in 4 samples.

Feature	The First Sample	The Second Sample	The Third Sample	The Fourth Sample
Direction	0.02	−0.05	−0.11	0.04
Speed	0.1	−0.1	0.01	0
Progress	−0.56	0	−0.47	0.28
Angle	3.14	−0.1	−0.15	−2.65
Sector 1	0.73	−0.08	−0.06	−0.02
Sector 2	−0.11	−0.11	−0.15	0
Sector 3	−0.62	−0.66	−0.66	−0.02
Sector 4	−0.33	0.15	3.62	−0.38
Sector 5	−0.33	2.53	4.88	−0.29
Sector 6	−0.24	−0.36	0.53	−0.35
Sector 7	−0.01	−0.02	−0.84	4.56
Sector 8	0	0	−0.27	−0.65
Sector 9	−0.06	−0.03	−0.45	−0.26
Predict sector 1	−0.04	−0.01	−0.53	−0.01
Predict sector 2	0.07	0.08	−0.07	−0.01
Predict sector 3	−0.46	−0.4	−1.46	−0.12
Predict sector 4	−0.24	0.04	5.47	−0.84
Predict sector 5	−0.62	2.62	1.97	−0.46
Predict sector 6	−1.16	−0.97	0.5	−0.57
Predict sector 7	−0.11	−0.1	−0.81	4.58
Predict sector 8	−0.01	−0.01	−0.18	−0.32
Predict sector 9	−0.05	−0.06	−0.84	−0.33

The bolded values are the SHAP values that have the greatest impact on the UAV’s decision-making.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.