Resource-Aware Surprise Reinforcement Learning for Collision Avoidance in Maritime UAV Encounters

Liu, Zuocheng; Feng, Qi; Wang, Zidong; Gao, Xiaoguang

doi:10.3390/drones10060450

Open AccessArticle

Resource-Aware Surprise Reinforcement Learning for Collision Avoidance in Maritime UAV Encounters

¹

School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710129, China

²

Department of Computer Science, City University of Hong Kong, Hong Kong

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(6), 450; https://doi.org/10.3390/drones10060450

Submission received: 3 April 2026 / Revised: 4 June 2026 / Accepted: 7 June 2026 / Published: 9 June 2026

(This article belongs to the Special Issue Artificial Intelligence-Driven Drones Systems for Marine Engineering Applications)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A unified framework, termed Resource-Aware Intrinsic Surprise Exploration (RAISE), is developed by integrating resource-aware modulation, surprise-based intrinsic reward, and adaptive EMA scaling into SAC-D for maritime UAV collision avoidance.
Across constrained, moderate, and resource-rich settings, RAISE consistently improves collision-avoidance performance, while reducing reversal and strengthening advisories and producing more stable command behaviors than baseline methods.

What are the implications of the main findings?

The results show that collision avoidance for maritime UAVs should be optimized not only for safety, but also for advisory stability and command economy, so that the learned policy remains operationally acceptable.
The proposed framework offers a transferable design for safety-critical reinforcement learning under operational constraints, with strong generalization across unseen encounter scenarios.

Abstract

Collision avoidance in maritime unmanned aerial vehicle (UAV) operations must satisfy two competing objectives: ensuring reliable safety separation and minimizing unnecessary maneuver commands that increase operator burden and communication overhead. While deep reinforcement learning (DRL) has shown promise in handling high-dimensional encounter states, standard DRL approaches often prioritize safety at the cost of operational suitability, leading to frequent, oscillatory, or unnecessary avoidance commands that erode remote operator trust and consume limited communication bandwidth. To address this challenge, this paper proposes Resource-Aware Intrinsic Surprise Exploration (RAISE), a unified framework that balances collision avoidance performance with command economy. We conceptualize the issuance of avoidance maneuvers as a consumable “virtual resource”, compelling the agent to optimize its intervention budget. RAISE integrates this mechanism into the Soft Actor–Critic (SAC) architecture, augmented by a surprise-based intrinsic reward derived from the ensemble forward dynamics prediction error. This allows the agent to efficiently explore complex encounter scenarios driven by curiosity, while a resource-aware coefficient adaptively suppresses redundant actions when the communication or operational budget is constrained. Furthermore, an adaptive exponential moving average (EMA) scaling mechanism is introduced to stabilize the interplay between intrinsic and extrinsic rewards. Extensive simulations under diverse resource constraints and encounter geometries demonstrate that RAISE outperforms state-of-the-art baselines. It significantly reduces maneuver reversal rates and strengthens command stability without compromising safety margins. Specifically, under resource-constrained settings, RAISE suppresses excessive and unstable advisory behavior by reducing strengthening and reversal commands while maintaining effective collision avoidance; under resource-rich settings, it flexibly enhances safety buffers, demonstrating superior adaptability and operational realism for autonomous maritime UAV systems. Robustness evaluation confirms that RAISE maintains stable performance under sensor noise and wind disturbances.

Keywords:

maritime UAV collision avoidance; deep reinforcement learning; resource-aware exploration; intrinsic motivation; autonomous UAV navigation

1. Introduction

Ensuring safe separation in increasingly congested maritime airspaces is a paramount challenge for modern offshore operations. With the rapid expansion of unmanned aerial vehicles (UAVs) deployed for maritime monitoring, infrastructure inspection, and search-and-rescue [1], the frequency of complex UAV encounters has risen sharply, exacerbating the risk of aerial collisions [2,3]. Operational experience reveals that while autonomous systems can successfully prevent most collisions, they often trigger excessive or continuous avoidance commands. A critical yet often overlooked issue is that frequent, nuisance, or oscillatory maneuver commands consume valuable and often limited maritime communication bandwidth. Furthermore, they severely disrupt the remote operator’s workflow and increase cognitive load, eventually leading to the desensitization of human supervisors—a phenomenon known as “trust erosion” in human-autonomy teaming [4,5,6]. In maritime environments where connectivity is unstable, executing continuous, high-frequency commands is highly impractical [7,8]. Therefore, the next generation of maritime UAV collision avoidance systems must strictly adhere to a dual mandate: providing high safety assurance while maintaining “command precision”—issuing avoidance maneuvers only when absolutely necessary to preserve communication resources and operator trust.

Current solutions, however, struggle to satisfy this dual mandate simultaneously. Traditional UAV collision avoidance methods [9,10,11] often rely on fixed geometric rules or reactive potential fields. While effective in simple scenarios, they are computationally rigid and unable to adapt online to varying operational needs, such as communication constraints or a remote operator’s tolerance for frequent interventions [5,12]. Conversely, deep reinforcement learning (DRL) has emerged as a powerful alternative, capable of mapping continuous, high-dimensional states to maneuver commands without massive storage overhead [13,14]. Recent RL studies have further expanded UAV decision-making from value-based control to actor–critic, multi-agent, and safe/constrained learning frameworks, improving adaptability in navigation, control, resource scheduling, and cooperative UAV operations [15,16]. Safe/constrained RL also provides a principled way to incorporate safety or resource-related constraints into policy learning [17]. However, these advances still primarily optimize safety, path efficiency, or mission performance, while the operational cost of frequent maneuver advisories remains insufficiently modeled. Even when recent DRL agents have demonstrated excellent safety performance [18], they typically lack “operational awareness”. Existing UAV DRL research is predominantly safety-driven; without explicit constraints on the “cost” of maneuvering or transmitting commands, these agents often exhibit “jittery” or oscillatory behaviors to marginally increase safety buffers. This creates a critical gap: algorithms that are mathematically safe but operationally unacceptable for communication-constrained maritime environments.

To bridge the gap between algorithmic safety and operational suitability, we draw inspiration from resource-constrained reinforcement learning [19]. We propose a novel paradigm that conceptualizes the issuance of avoidance commands as a consumable “virtual resource”, representing the finite capacity of communication bandwidth and remote operator attention within an encounter episode. This abstraction forces the agent to treat every maneuver command as a “costly investment”, thereby naturally suppressing redundant or low-value actions.

However, simply penalizing maneuver commands can hinder the agent’s exploration during training, leading to overly conservative policies that fail to discover safe avoidance trajectories. To address this, we introduce Resource-Aware Intrinsic Surprise Exploration (RAISE). This framework unifies safety assurance and command economy by integrating synergistic mechanisms within the Soft Actor–Critic Discrete (SAC-D) architecture [20]. Specifically, to prevent the resource constraint from hindering learning, we employ a surprise-based intrinsic motivation module that utilizes an ensemble dynamics model to generate prediction errors [21], driving the agent to explore novel scenarios. This exploration is dynamically regulated by a resource-modulated control coefficient, which couples exploration intensity with the remaining command budget. Furthermore, to address the scale discrepancy between varying intrinsic signals and stable extrinsic rewards, we introduce an adaptive exponential moving average (EMA) scaling mechanism, ensuring consistent reward normalization and stable convergence.

By unifying safety assurance, communication efficiency, and operator burden reduction, RAISE provides a practical path toward operationally suitable AI collision avoidance for maritime UAVs. The main contributions of this work are summarized as follows:

Resource-aware decision formulation: Introduction of a virtual resource mechanism that regulates maneuver command frequency and intensity during both training and deployment.
Resource-aware surprise exploration with adaptive scaling: An ensemble-based prediction-error signal is used for novelty estimation, while its influence is adaptively scaled and modulated by the remaining advisory resource.
Unified SAC-D integration: Seamless incorporation of resource modulation and intrinsic exploration into a standard actor–critic framework.
Comprehensive evaluation: Experiments across resource-constrained, moderate, and resource-rich conditions demonstrate that RAISE improves advisory stability by reducing strengthening and reversal behaviors while maintaining reliable collision-avoidance performance and stable training convergence.

2. Related Work

2.1. Conventional UAV Collision Avoidance Methods

Early UAV collision avoidance relied heavily on reactive and rule-based logic. Traditional geometric and force-field approaches, such as artificial potential fields (APFs) [10] and velocity obstacles (VOs) [9], determine avoidance maneuvers based on relative distances and velocities. While effective in standard, low-density encounters, several limitations remain. For instance, classical APF methods may suffer from local-minimum traps in certain obstacle configurations, although many improved APF variants have been proposed to mitigate this limitation [11,22]. In addition, purely reactive potential-field planners may exhibit local path oscillations near obstacles or in narrow passages, reducing trajectory smoothness under certain conditions [23,24]. More importantly, because these methods are typically governed by predefined geometric or force-field rules, they lack the online adaptability required to flexibly recalibrate the trade-off between safety margins and operational costs, such as communication bandwidth consumption or remote operator intervention frequency, based on real-time maritime contexts [5,12].

2.2. Deep Reinforcement Learning for Autonomous Navigation

DRL has emerged as a promising alternative to overcome the scalability and adaptability issues of traditional reactive systems. Algorithms like Deep Q-Networks (DQN) [13] and Soft Actor–Critic (SAC) [14] have been successfully applied to safety-critical domains, including UAV obstacle avoidance [25,26], autonomous maritime navigation [27,28], and aircraft separation assurance [29,30]. By mapping high-dimensional, continuous states directly to maneuver commands, DRL agents can navigate complex environments without the need for massive computational overhead [18]. More recent studies have further broadened this line of research toward policy-gradient and actor–critic methods, multi-agent reinforcement learning, curriculum-based training, and safe/constrained reinforcement learning [15,16,17]. These developments improve the scalability of RL-based UAV control in uncertain, dynamic, and cooperative environments, and they also provide tools for incorporating safety or resource-related constraints during policy optimization.

Despite these advances, most UAV DRL collision-avoidance studies still evaluate performance mainly through collision rate, path efficiency, or separation distance, rather than through the operational cost and temporal stability of maneuver advisories. Without explicit mechanisms to regulate the frequency or severity of maneuver commands, safety-driven agents may generate “jittery” or excessive control signals. While mathematically safe, such high-frequency behavior can saturate limited maritime communication networks, increase remote operator cognitive load, and erode trust in autonomous UAV operations.

2.3. Constrained Exploration and Intrinsic Motivation

To balance safety with operational constraints, this work draws inspiration from constrained reinforcement learning (CRL) [31]. In domains such as robotic control with finite energy budgets [32,33] or games with consumable items [34,35], agents are trained to maximize rewards under strict resource limitations. We adapt this concept by formulating the “command budget” as a virtual constraint, a novel perspective in the context of maritime UAV collision avoidance.

Furthermore, to ensure efficient policy learning under such constraints, intrinsic motivation (IM) mechanisms are often employed. IM tackles the sparse reward problem by encouraging exploration. Existing approaches include count-based exploration [36,37], information-gain approaches [38,39], prediction-error-based exploration [21,40], and empowerment-based approaches [41]. Among these, surprise-based exploration [21], which rewards the agent for prediction errors in forward dynamics, has shown superior sample efficiency in continuous control tasks.

However, standard intrinsic exploration is typically cost-agnostic. In safety-critical UAV operations, unconstrained curiosity can lead to dangerous or communication-heavy behaviors. To the best of our knowledge, no prior work has integrated surprise-based exploration with resource-aware constraints specifically to solve the “safety vs. command economy” trade-off in maritime UAV encounters. This paper bridges this gap by introducing the RAISE framework.

3. Problem Formulation

3.1. UAV Encounter Dynamics

In this study, the UAV encounter scenario is formulated as a strategic vertical resolution advisory problem for medium-to-large fixed-wing UAVs conducting beyond visual line of sight (BVLOS) maritime missions. The purpose of the model is not to generate full three-dimensional trajectories, but to determine whether sufficient vertical separation can be achieved within the remaining horizontal encounter time. Accordingly, the horizontal encounter geometry is represented by the time-to-conflict variable τ_t, while the learning problem focuses on vertical separation and vertical advisory generation.

As illustrated in Figure 1, the encounter geometry is characterized by the relative altitude between the ego-UAV and the intruding UAV, alongside the time to the closest point of approach (CPA) in the horizontal direction. The horizontal geometry is thereby parameterized by the time-to-conflict variable (τ_t), which represents the remaining time until the UAVs breach the minimum horizontal safety threshold:

τ_{t} = \frac{D_{t}}{v_{r e l}}

(1)

where D_t denotes the current horizontal distance and v_rel denotes the locally estimated relative horizontal speed over the short encounter horizon. The resulting variable τ_t represents the remaining time before the horizontal separation reaches the predefined safety threshold. Thus, horizontal motion is retained as a temporal constraint on the vertical resolution process, rather than being explicitly controlled as an additional maneuver dimension.

Under this abstraction, the relative motion of the two UAVs is described by a compact set of vertical variables, including the relative altitude

h_{t} = h_{i n t, t} - h_{e g o, t}

and the respective vertical velocities of the ego-UAV and the intruder. This formulation provides sufficient information for evaluating collision risk and generating precise avoidance maneuver alerts without explicitly representing lateral motion.

3.2. UAV Dynamic Model

It should be noted that the dynamic model used in this study is an advisory-level vertical kinematic response model, rather than a full six-degree-of-freedom UAV flight dynamics model. The objective of this model is to describe how high-level climb, descent, and no-command advisories affect the vertical separation between the ego-UAV and the intruding UAV during a short encounter window. Therefore, attitude dynamics, aerodynamic forces, actuator dynamics, and low-level flight-control loops are not explicitly modeled.

Given the two-dimensional encounter abstraction introduced above, the horizontal interaction between UAVs is fully captured by the time-to-conflict variable τₜ. In this formulation, horizontal motion is retained as a temporal constraint on the vertical resolution process, rather than being explicitly controlled as an additional maneuver dimension. Therefore, the collision-avoidance dynamics can be modeled exclusively in the vertical dimension, consistent with altitude-based deconfliction strategies commonly used in UAV operations, where vertical maneuvers can rapidly establish separation within a limited encounter time [42]. This formulation is suitable for the considered fixed-wing maritime encounter scenario because the key decision variable is whether the ego-UAV can establish adequate vertical separation before the horizontal encounter time expires. In this sense, τₜ acts as a countdown variable that links horizontal closure with the urgency of vertical maneuvering. Under this formulation, both UAVs follow one-dimensional longitudinal dynamics updated at a discrete frequency of 1 Hz. This frequency reflects the high-level strategic decision rate suitable for maritime environments, directly accommodating the severe bandwidth limitations of long-range UAV telemetry. At each time step t, the environment state is represented as:

s_{t} = (h_{t}, {\dot{h}}_{e g o, t}, {\dot{h}}_{i n t, t}, τ_{t}, a_{p r e v})

(2)

Here, h_t denotes the vertical relative altitude between the intruding UAV and the ego-UAV, defined as h_t = h_int_,t − h_ego_,t. The variables

{\dot{h}}_{e g o, t}, {\dot{h}}_{i n t, t}

represent the vertical velocities of the ego-UAV and intruding UAV, respectively. The variable τ_t indicates the estimated time to the closest point of approach (CPA), assuming a constant relative horizontal speed—i.e., the remaining time before the horizontal separation between the UAVs decreases below a predefined maritime safety threshold (e.g., 150 m). For medium-to-large fixed-wing UAVs executing BVLOS maritime missions, this relatively conservative separation minimum is necessary. It strictly accounts for the high cruising speeds, severe offshore wind disturbances, GPS inaccuracies, and inherent latency in satellite or long-range communications.

To capture the temporal mismatch between decision issuance and execution in real maritime operations, the model introduces an action execution delay mechanism. Specifically, the action applied at time step t corresponds to the maneuver command generated at the previous step t − 1. To support this mechanism, the state representation includes the previous command index a_prev, which determines the acceleration applied at time t. This design effectively models realistic latency effects—such as communication delays or mechanical response lags—thereby improving the fidelity and temporal continuity of the simulated trajectories. The transition to the next time step t + 1 follows discrete motion equations that update the altitude and vertical velocity of both UAVs according to their respective accelerations, as described in Equation (3).

s_{t + 1} = [\begin{matrix} h_{t} + {\dot{h}}_{i n t, t} + \frac{1}{2} {\ddot{h}}_{i n t, t} - {\dot{h}}_{e g o, t} - \frac{1}{2} {\ddot{h}}_{e g o, t} \\ {\dot{h}}_{e g o, t} + {\ddot{h}}_{e g o, t} \\ {\dot{h}}_{i n t, t} + {\ddot{h}}_{i n t, t} \\ τ_{t} - 1 \\ a_{p r e v} \end{matrix}]

(3)

Here,

{\ddot{h}}_{e g o, t}, {\ddot{h}}_{i n t, t}

denote the vertical accelerations of the ego-UAV and the intruding UAV at time step t, respectively. For the ego-UAV, the applied acceleration is directly determined by the maneuver command issued by the collision avoidance system. The acceleration values in Equation (3) should be interpreted as commanded vertical acceleration responses associated with discrete advisories, rather than direct actuator-level control inputs. This representation is appropriate for evaluating whether the advisory policy can generate timely and physically bounded vertical separation during the encounter. The intruding UAV, in contrast, is modeled using a goal-directed vertical response that drives it toward the ego-UAV’s altitude within the remaining encounter time, as described in Equation (4). This design is not intended to represent all possible maritime UAV traffic behaviors. Rather, it provides a controlled conflict-generation mechanism for constructing repeatable encounter cases, so that the learned policy can be evaluated under clear vertical conflict pressure. At each step, the intruding UAV estimates a target vertical velocity that would allow it to reach the ego-UAV’s altitude by the predicted time-to-CPA under the current kinematic conditions.

{\dot{h}}_{i n t, t + 1} = {\dot{h}}_{e g o, t} - \frac{h_{i n t, t} - h_{e g o, t}}{τ_{t}}

(4)

{\ddot{h}}_{i n t, t} = c l i p ({\dot{h}}_{i n t, t + 1} - {\dot{h}}_{i n t, t}, - a_{\max}, a_{\max})

(5)

On this basis, the intruding UAV’s vertical acceleration is determined by the difference between its current and target vertical velocities, subject to a predefined maximum acceleration limit. To ensure physically feasible motion and stable trajectory evolution, the acceleration is constrained within the allowable range using a clipping function, as shown in Equation (5). Similarly, the vertical velocity is bounded by a maximum value

{\dot{h}}_{i n t, \max}

, preventing unrealistic climb or descent rates and maintaining smooth motion continuity throughout the encounter. In this setting, the intruder behavior serves as a stress-case model for policy training and evaluation. More diverse intruder behaviors, such as route-following, cooperative, non-cooperative, and stochastic traffic models, can be investigated in future extensions of the proposed resource-aware advisory framework.

3.3. Resource-Aware Collision Avoidance Decision Framework

3.3.1. Resource-Aware Decision Framework

In conventional collision avoidance systems, advisories are issued solely based on safety requirements, without accounting for operator workload or alert fatigue. However, in real-world UAV operations, frequent or redundant advisories can overload the remote operators and reduce trust in the automation system.

To address this issue, we introduce a resource-aware decision framework, in which a virtual resource variable

r_{t}^{r e s}

represents the remaining command-resource budget available for issuing collision-avoidance advisories during an encounter. This variable is not intended to directly measure physical communication bandwidth, link latency, or human cognitive workload. Instead, it is a decision-level proxy for advisory burden, reflecting the operational cost associated with frequent, continued, strengthened, or reversed maneuver advisories. The cost values are designed to distinguish different levels of command burden at the advisory-transition level, rather than to convert advisories into exact units of bandwidth or operator workload. Specifically, issuing a new or continued alert consumes 3 units, a strengthening advisory consumes 2 units, and a reversal advisory consumes 5 units. The reversal cost is assigned the largest value because switching between climb and descent advisories may lead to oscillatory guidance and represents a high-burden advisory transition from the perspective of supervisory control. The resource state is updated as:

r_{t + 1}^{res} = \max (0, r_{t}^{res} - c (a_{t}, a_{t - 1}))

(6)

The cost function is defined as:

c (a_{t}, a_{t - 1}) = \{\begin{cases} 0, NOC o r n o r e s o u r c e - c o n s u m i n g t r a n s i t i o n, \\ 3, new o r c o n t i n u e d a l e r t, \\ 2, strengthening a d v i s o r y, \\ 5, reversal a d v i s o r y . \end{cases}

(7)

If the remaining resource is insufficient to support the requested advisory type, the action is suppressed and replaced by NOC. This mechanism enforces a finite alert budget and prevents the policy from relying on frequent high-cost interventions.

A resource coefficient ρ_t ∈ [0, 1] is further introduced to regulate the influence of this limited resource on the decision-making process. When the available resource decreases, the coefficient proportionally reduces the intensity or frequency of advisories, encouraging more conservative and resource-efficient behavior. In essence, this mechanism allows the system to adapt its decision policy according to the current alert capacity, ensuring both operational safety and operator acceptance.

3.3.2. State Space

The state space for the collision avoidance task consists of six variables, as summarized in Table 1. These variables capture the essential kinematic and decision-related information required by the reinforcement learning agent to assess encounter risk and determine appropriate advisories. The first three variables describe the vertical geometry of the encounter: the relative altitude between the intruding UAV and ego-UAV (h_t), the vertical rate of the ego-UAV (

{\dot{h}}_{e g o, t}

), and the vertical rate of the intruding UAV (

{\dot{h}}_{i n t, t}

). The fourth variable, time to loss of horizontal separation (τₜ), characterizes the remaining time until the two UAVs reach the minimum allowed horizontal distance (typically 150 m), effectively representing the horizontal encounter geometry. The fifth variable, previous advisory (a_prev), encodes the advisory issued at the previous time step, which helps the agent penalize unnecessary command reversals or escalations in advisory intensity, thereby maintaining temporal consistency and preserving the Markov property.

Finally, a new variable—the resource level (

r_{t}^{r e s}

)—is introduced to represent the remaining alert resource available to the system at time t. This variable reflects the agent’s alert budget, influencing how aggressively it can issue future advisories. By including

r_{t}^{r e s}

in the observation space, the system becomes aware of its operational constraints, enabling adaptive behavior that balances collision safety with alert economy.

3.3.3. Action Space

The action space consists of seven discrete advisories: NOC, DES-N, CLB-N, DES-T, CLB-T, DES-E, and CLB-E, as listed in Table 2. Here, N, T, and E denote normal, transitional, and escalated advisories, respectively. Terms such as reversal or strengthening describe transition types between consecutive advisories and are not separate action labels. Each advisory, except NOC (No conflict), instructs the ego-UAV to achieve or maintain a specific vertical rate, corresponding to a designated vertical acceleration. The NOC action indicates that no immediate collision threat exists, allowing the UAV to maintain its nominal trajectory, and can be issued at any time.

Table 2 also defines the availability rules between advisories, ensuring physically consistent and operationally feasible trajectory transitions. For instance, NOC may precede any other advisory, while normal descent (DES-N) and climb (CLB-N) can only be initiated from NOC. Transitional or strengthened advisories (DES-T, CLB-T, DES-E, CLB-E) can only follow compatible preceding advisories, ensuring smooth kinematic transitions and avoiding abrupt or contradictory guidance. This discrete action design ensures that the agent’s policy remains interpretable and aligned with standard deterministic state-machine constraints, while still allowing flexibility for optimization under resource-aware and learning-based settings.

3.3.4. Reward Shaping

To ensure both safety and operational efficiency in the collision avoidance process, the reward function is designed to balance three competing objectives: (1) preventing near mid-air collisions (safety), (2) minimizing unnecessary alerts (remote operator workload), and (3) maintaining acceptable altitude deviations (flight stability). The overall reward consists of two main components: terminal altitude penalty and alert and altitude management. The main weighting parameters used in the extrinsic reward are summarized in Table 3 before the individual reward terms are introduced.

The parameters follow a safety-priority hierarchy. The conflict penalty ω_col = 100 has the largest magnitude, ensuring that collision avoidance remains the dominant objective. Advisory-management penalties are deliberately smaller, including ω_alert = 0.3, ω_str = 0.3, ω_rev = 0.5, and ω_cross = 0.5. Therefore, advisory economy shapes the policy only after the primary safety objective has been prioritized.

1.: Terminal Altitude Penalty

This component evaluates the final vertical separation between the ego-UAV and the intruder when the time to loss of horizontal separation reaches zero—that is, at the critical moment of closest horizontal proximity. It penalizes unsafe altitude configurations such as loss of separation or excessive climb/descent, reinforcing safe and stable avoidance maneuvers. The overall final altitude penalty is defined as:

R_{t e r m i n a l} = R_{c o l l i s i o n} + R_{d e v i a t i o n} + R_{e x c e s s i v e}

(8)

where:

Collision Penalty:

R_{c o l l i s i o n} = - ω_{c o l} 1 \{|Δ h| \leq h_{\min}\}

(9)

penalizes any case where the relative altitude Δh between the two UAVs falls below the collision threshold h_min.

Terminal Separation Penalty:

R_{d e v i a t i o n} = ω_{c o l} (\frac{|Δ h| - h_{\min}}{h_{s a f e} - h_{\min}}) 1 \{h_{\min} \leq |Δ h| < h_{s a f e}\}

(10)

penalizes moderate deviations from the safe altitude range h_safe.

Excessive Terminal-Separation Penalty:

R_{e x c e s s i v e} = \max (- 2 ω_{e x} (|Δ h| - h_{s a f e}), - γ_{a l t}) 1 \{|Δ h| \geq h_{s a f e}\}

(11)

applies a saturation penalty when the UAVs exceed the maximum permitted altitude deviation, with γ_alt denoting the upper penalty bound. In the study, γ_alt = 25, which serves as the saturation bound for excessive-altitude penalties. This value is larger than the advisory-related penalties but smaller than the conflict penalty, preserving the safety-priority hierarchy. The bounded form prevents extremely large altitude deviations from causing unbounded reward magnitudes and improves critic stability during training.

2.: Alert and Altitude Management

This component reflects the dynamic control efficiency of the avoidance policy—encouraging timely yet minimal advisory usage while maintaining altitude constraints. The corresponding reward term is defined as:

R_{m a n a g e} = R_{a l t_l i m i t} + R_{a l e r t_p e n a l t y} + R_{c l e a r}

(12)

where:

Altitude Limit Penalty:

R_{a l t_l i m i t} = - ω_{l i m} 1 \{|h_{e g o}| > h_{\max}\}

(13)

penalizes ego-UAV altitude exceeding the predefined upper limit h_max.

Alert Penalty:

R_{a l e r t_p e n a l t y} = - (ω_{a l e r t} I_{a l e r t} + ω_{r e v} I_{r e v} + ω_{s t r} I_{s t r} + ω_{c r o s s} I_{c r o s s}) \exp (τ_{t} - T)

(14)

where I_alert, I_str, I_rev, and I_cross are binary indicators for active advisory issuance, advisory strengthening, advisory reversal, and vertical path crossing, respectively. The exponential factor exp(τ_t − T) increases the penalty magnitude as the encounter approaches CPA, discouraging late-stage unnecessary advisories and inconsistent command changes near the conflict point.

Conflict Clearance Reward:

R_{c l e a r} = ω_{N O C} I_{N O C}

(15)

provides a small positive reward when the conflict is successfully cleared and the system transitions to a no conflict (NOC) state. I_NOC indicates that the conflict has been cleared and the system returns to the no-conflict advisory state.

4. UAV Collision Avoidance Based on Resource-Aware Intrinsic Surprise Exploration

This section introduces the proposed collision-avoidance decision-making framework based on deep reinforcement learning. The method extends the standard Soft Actor–Critic (SAC) algorithm by incorporating a surprise-based intrinsic exploration term and a resource-aware modulation mechanism. The overall objective is to improve collision-avoidance efficiency while maintaining an alert economy and reducing unnecessary advisories.

4.1. SAC Algorithm for Collision Avoidance

To address the sequential decision-making problem of UAV collision avoidance under complex dynamic conditions, this study adopts the Soft Actor–Critic (SAC) algorithm as the fundamental learning framework [14]. SAC is an off-policy, entropy-regularized actor–critic method that optimizes both task performance and policy stochasticity, thereby achieving a balance between exploitation and exploration during training. In the context of collision avoidance, the agent must issue timely and stable vertical advisories to maintain safe separation while minimizing unnecessary command reversals or oscillations that could confuse the remote operator.

The optimization objective of SAC is formulated as a maximum entropy reinforcement learning problem, where the policy aims not only to maximize the expected cumulative reward but also to preserve sufficient action entropy for exploration:

J (π) = E_{τ ~ π} [\sum_{t = 0}^{T} (r_{t} + α H (π (\cdot | s_{t})))]

(16)

where r_t denotes the extrinsic reward at time t, and

H (π (\cdot | s_{t})) = - E_{a_{t} ~ π} [\log π (a_{t} | s_{t})]

represents the policy entropy. The temperature parameter α > 0 regulates the trade-off between maximizing return and maintaining policy diversity.

The SAC framework is composed of three primary components: two Q-value networks, Q_θ₁(s,a) and Q_θ₂(s,a), a stochastic policy network π_ϕ(a∣s), and a target critic used for stabilization. The Q-networks are trained to minimize the soft Bellman residual:

J_{Q} (θ) = E_{(s_{t}, a_{t}, r_{t}, s_{t + 1})} [\frac{1}{2} {(Q_{θ} (s_{t}, a_{t}) - y_{t})}^{2}]

(17)

where the soft target value y_t is computed as:

y_{t} = r_{t} + γ E_{a_{t + 1} ~ π} [\min_{i} Q_{{\bar{θ}}_{i}} (s_{t + 1}, a_{t + 1}) - α \log π (a_{t + 1} | s_{t + 1})]

(18)

This clipped double-Q formulation reduces overestimation bias and improves the stability of learning. The policy network is optimized by minimizing the Kullback–Leibler divergence between the current policy and the soft Q-function-induced Boltzmann distribution:

J_{π} (ϕ) = E_{s_{t} ~ D} [α \log π_{ϕ} (a_{t} | s_{t}) - Q_{θ} (s_{t}, a_{t})]

(19)

where a_t is sampled from π_ϕ(a_t∣s_t) and D denotes the replay buffer.

Since the action space in the collision avoidance task is discrete, we employ SAC-Discrete, an adaptation of the Soft Actor–Critic (SAC) framework for discrete control problems [20]. This formulation preserves the entropy-regularized objective and the dual Q-network structure of the original SAC while incorporating dedicated modifications to the policy representation, Q-function estimation, and policy update scheme. These adaptations enable SAC-D to perform stable and efficient policy optimization under the maximum-entropy reinforcement learning paradigm in discrete action domains.

4.2. Quantifying Novelty via Surprise-Based Intrinsic Reward

In reinforcement learning for UAV collision avoidance, the external reward primarily captures high-level safety objectives, such as maintaining vertical separation or preventing near mid-air collisions. However, this extrinsic feedback provides limited guidance in early-stage exploration, as it is sparse and only weakly correlated with intermediate advisory decisions. Consequently, the agent may struggle to efficiently explore novel yet safety-critical encounter situations. To address this limitation, a surprise-based intrinsic motivation mechanism is introduced, serving as a quantitative measure of state-transition novelty and guiding exploration toward poorly understood regions of the dynamics [21]. The intuition is that transitions that are difficult for the model to predict are likely underrepresented in the agent’s knowledge and therefore should receive additional reward incentives.

To quantify transition novelty, we trained an ensemble of probabilistic forward dynamics models. Given the current state–action pair (s_t, a_t), each ensemble member predicts a diagonal Gaussian distribution over the next state:

q_{θ} (s_{t + 1} | s_{t}, a_{t}) = N ({\hat{μ}}_{i} (s_{t}, a_{t}), diag ({\hat{σ}}_{i}^{2} (s_{t}, a_{t}))), i = 1, \dots N

(20)

where N denotes the ensemble size. The dynamics ensemble is trained by minimizing the negative log-likelihood (NLL) of the observed transition samples:

L_{m o d e l} = - \frac{1}{N} \sum_{i = 1}^{N} E_{(s_{t}, a_{t}, s_{t + 1}) ~ D} [\log q_{θ} (s_{t + 1} | s_{t}, a_{t})]

(21)

Here, s_t, a_t, and s_t₊₁ denote the current state, action, and observed next state, respectively. D denotes the replay buffer containing sampled transitions. N is the number of ensemble dynamics models.

q_{θ} (s_{t + 1} | s_{t}, a_{t})

is the predictive distribution of the i-th dynamics model parameterized by θ_i.

{\hat{μ}}_{i} (s_{t}, a_{t})

and

{\hat{σ}}_{i}^{2} (s_{t}, a_{t})

denote the predicted mean and variance of the next state.

Equivalently, up to constants independent of the model parameters, this objective can be expressed as:

L_{m o d e l} = \frac{1}{2 N} \sum_{i = 1}^{N} 𝔼_{D} [\sum_{j = 1}^{d} \frac{{(s_{t + 1}^{j} - {\hat{μ}}_{i}^{j} (s_{t}, a_{t}))}^{2}}{{\hat{σ}}_{i}^{2, j} (s_{t}, a_{t})} + \log {\hat{σ}}_{i}^{2, j} (s_{t}, a_{t})]

(22)

In the expanded form, d is the dimension of the state vector, and the superscript j denotes the j-th state dimension. A large prediction error or high model variance indicates that the transition deviates from the learned dynamics and is therefore “surprising”. The intrinsic reward is then defined as the ensemble-averaged negative log-likelihood of the actually observed next state:

r_{t}^{i n t} = - \frac{1}{N} \sum_{i = 1}^{N} \log q_{θ i} (s_{t + 1} | s_{t}, a_{t})

(23)

For the diagonal Gaussian predictive distribution, Equation (23) can be expanded as:

r_{t}^{i n t} = \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{d} [\frac{{(s_{t + 1}^{j} - {\hat{μ}}_{i}^{j} (s_{t}, a_{t}))}^{2}}{2 {\hat{σ}}_{i}^{2, j} (s_{t}, a_{t})} + \frac{1}{2} \log (2 π {\hat{σ}}_{i}^{2, j} (s_{t}, a_{t}))]

(24)

Thus,

r_{t}^{i n t}

measures the sample-wise surprisal of the observed transition under the learned dynamics model. A transition that is poorly predicted by the ensemble receives a larger intrinsic reward, encouraging the agent to explore dynamically uncertain or insufficiently modeled regions of the state-action space.

By integrating this surprise-driven reward into the actor–critic framework, the learning process becomes more adaptive to the dynamic collision scenarios. The agent can identify previously unseen altitude separation patterns, encounter configurations, and conflict-resolution dynamics, thereby refining its advisory policy in a data-efficient and safety-oriented manner.

It should be noted that Equation (24) is implemented as a sample-wise NLL-based surprisal measure, rather than as a directly computed KL divergence. Let

p (s_{t + 1} | s_{t}, a_{t})

denote the true transition distribution of the environment and

q_{θ} (s_{t + 1} | s_{t}, a_{t})

denote the learned predictive distribution. Although p is unknown and is not explicitly estimated in the implementation, taking the expectation of the NLL under the true transition distribution gives the cross entropy between p and q_θ:

𝔼_{s_{t + 1} ~ p} [r_{t}^{i n t}] = 𝔼_{s_{t + 1} ~ p} [- \log q_{θ} (s_{t + 1} | s_{t}, a_{t})] = H (p, q_{θ})

(25)

The cross entropy can be decomposed as:

H (p, q_{θ}) = H (p) + D_{K L} (p (s_{t + 1} | s_{t}, a_{t}) ‖q_{θ} (s_{t + 1} | s_{t}, a_{t}))

(26)

Since H(p) is independent of the model parameters, reducing the expected NLL is equivalent to reducing the KL divergence between the true transition dynamics and the learned predictive dynamics up to a constant. Therefore, the proposed intrinsic reward should be interpreted as a sample-wise NLL-based proxy for transition novelty, with a theoretical connection to KL divergence in expectation.

4.3. Resource-Aware Intrinsic Surprise Exploration

In practical UAV collision avoidance tasks, excessive exploratory actions or redundant alerts can increase remote operator workload and reduce system trustworthiness. To further examine this problem, we trained collision avoidance agents using state-of-the-art reinforcement learning methods, including the baseline SAC-D and the surprise-based intrinsic exploration model, under unconstrained resource conditions. The training statistics summarized in Figure 2 and Table 4 reveal that both approaches tend to generate an excessive number of advisories—on average more than ten alerts per episode—with frequent strengthening and reversal actions. These results indicate that while intrinsic-exploration-based learning improves policy responsiveness, it also amplifies advisory activity, leading to unstable and operationally inefficient behavior. The challenge, therefore, lies in achieving a better balance between effective exploration and practical alert management. To address this issue, a resource-aware intrinsic reward mechanism is introduced, guiding the learning process to adapt exploration intensity according to the available alert resources. This design enables the agent to maintain efficient policy learning while avoiding unnecessary advisory activations under resource-limited conditions.

The key component of this mechanism is the resource coefficient

ρ_{t} \in [0, 1]

, which quantifies the remaining decision resource or allowable alert capacity at time step t. The coefficient is computed as a normalized function of the resource state embedded in the observation vector:

ρ_{t} = f (r_{t}^{r e s}) = {(\frac{r_{t}^{r e s}}{r_{m a x}})}^{β}

(27)

where

r_{t}^{r e s}

denotes the current resource level,

r_{m a x}

is the predefined maximum, and β > 1 controls the sensitivity of reduction. When the available resource decreases,

ρ_{t}

proportionally suppresses the influence of intrinsic exploration, encouraging the policy to focus on task-relevant, low-cost advisories. This mechanism reflects the intuition that the collision-avoidance system should be more cautious and less exploratory when its “alert budget” becomes constrained.

Because the numerical magnitudes of the external and intrinsic rewards may differ significantly and may vary during training, an adaptive scaling coefficient is introduced:

η_{t} = η_{0} λ_{tar} \frac{EMA (S_{ext})}{EMA (S_{int}) + ϵ}

(28)

The batch-level reward magnitude estimates are defined as:

S_{e x t} = \frac{1}{B} \sum_{b = 1}^{B} |r_{b}^{e x t}|, S_{i n t} = \frac{1}{B} \sum_{b = 1}^{B} |r_{b}^{int}|

(29)

Here, S_ext and S_int denote the batch-level magnitude estimates of the external and intrinsic rewards, respectively, and B is the batch size. EMA(⋅) denotes the exponential moving average (EMA) used to estimate the running reward scale, and ϵ is a small positive constant for numerical stability.

The EMA operation provides a running estimate of the external and intrinsic reward scales and prevents the adaptive coefficient from reacting excessively to high-variance batch-level fluctuations. Therefore, η_t maintains a stable relative contribution of the intrinsic reward throughout training.

The target ratio λ_tar controls the desired contribution of the intrinsic reward relative to the external reward scale. It should be emphasized that λ_tar is not the final intrinsic reward ratio by itself. Instead, η₀λ_tar determines the pre-gating target scale, while the actual intrinsic contribution in the final reward is also modulated by the resource-aware coefficient ρ_t. Combining both mechanisms, the overall reward shaping function for each transition is expressed as:

r_{t}^{'} = r_{t} + η_{t} ρ_{t} r_{t}^{int}

(30)

where

r_{t}^{i n t}

denotes the intrinsic reward derived from the prediction-error-based surprise model described in Section 4.2. Thus, the actual intrinsic reward term added to the external reward is

η_{t} ρ_{t} r_{t}^{int}

. When the remaining resource is sufficient, ρ_t ≈ 1, and the intrinsic reward is allowed to contribute according to the target scale controlled by η_t. When the remaining resource becomes scarce, ρ_t → 0, and the intrinsic reward is progressively suppressed regardless of its surprise value.

In our study, η₀ = 0.3 and λ_tar = 0.6, giving a pre-gating target scale of η₀λ_tar = 0.18. Therefore, when the resource gate is fully open, the surprise-based intrinsic reward is encouraged to remain approximately within 10–20% of the external reward scale. Under resource-limited conditions, the final contribution is further reduced by ρ_t, which prevents exploration bonuses from overpowering the task-specific objectives related to safety, collision avoidance, and resource conservation.

The resulting algorithm, termed RAISE, effectively integrates adaptive exploration with resource-constrained decision-making, as illustrated in Figure 3 and summarized in Algorithm 1. This design ensures that the intrinsic reward serves as an auxiliary exploration signal whose magnitude is jointly controlled by EMA normalization and the resource coefficient. It enables the agent to balance two key objectives: ensuring safety through effective conflict-resolution advisories and reducing unnecessary or high-cost alerts under limited resource conditions. This mechanism provides a robust and data-efficient learning framework that promotes stable policy optimization in dynamic encounter scenarios.

Algorithm 1: RAISE—Resource-Aware Intrinsic Surprise Exploration

Input: Environment

E

, policy π_θ, ensemble model f_ϕ, replay buffer

B

,
base coefficient η₀, target ratio λ_tar, resource limit r_maxInitialize:

θ, ϕ, B

, EMA statistics

{\hat{R}}_{e x t} ​ \leftarrow ​ 1, {\hat{R}}_{i n t} ​ \leftarrow ​ 1

1: for training step t = 1, 2, … do
2: Observe state s_t and resource level

r_{t}^{r e s}

3: Select action

a_{t} ~ π_{θ} (a_{t} | s_{t})

and execute in environment

E

4: Obtain next state s_t₊₁ and external reward r_t
5: Store

(s_{t}, a_{t}, r_{t}, s_{t + 1})

into buffer

B

7: if update step is due then
8: Sample a mini-batch {(s, a, r, s’)} from

B

9: Compute intrinsic reward:

r^{i n t} = - \log p_{ϕ} (s^{'} | s, a)

10: Update EMA statistics:
11:

{\hat{R}}_{e x t} ​ \leftarrow ​ α {\hat{R}}_{e x t} + (1 - α) 𝔼 [| r |]; {\hat{R}}_{i n t} ​ \leftarrow ​ α {\hat{R}}_{i n t} + (1 - α) S t d [r^{i n t}]

12: Compute adaptive coefficient:

η_{t} = η_{0} λ_{t a r} \frac{{\hat{R}}_{e x t}}{{\hat{R}}_{i n t} + ε}

13: Compute resource weight:

ρ_{t} = {(r_{t}^{r e s} / r_{\max})}^{β}

14: Form shaped reward:

{r^{'}}_{t} = r_{t} + η_{t} ρ_{t} r^{i n t}

15: Update critic Q_ψ and actor π_θ using SAC with

r_{t}^{'}

16: Update model f_ϕ by minimizing negative log-likelihood loss
17: end if
18: end for
Output: Trained policy parameters θ

5. Experimental Simulation and Result Analysis

5.1. Experimental Setup

To evaluate the effectiveness of the proposed RAISE algorithm, comparative experiments were conducted against three baselines: SAC-D, Surprise, and SAC-D-resource. All experiments were performed in a simulated two-UAV vertical encounter environment, where the ego-UAV executes vertical resolution advisories while the intruder follows a goal-directed approach strategy. The simulation runs in discrete time steps of 1 s, consistent with high-level tactical decision-making frequencies and typical surveillance update rates (e.g., ADS-B) for autonomous UAV frameworks.

The proposed RAISE algorithm extends the SAC-Discrete framework by incorporating an ensemble dynamics model for surprise-based intrinsic motivation and a resource-aware exploration coefficient that adapts the intensity of intrinsic rewards according to the remaining alert resource. This design allows the agent to maintain efficient exploration while reducing redundant or high-cost advisories under limited resources.

All algorithms were trained under the same conditions for 1.6 × 10⁶ environment steps (approximately 1500 iterations). The main experimental configuration used a resource limit of r_max = 20, representing a moderate operational alert capacity. Additional analyses in Section 5.4 examine the effects of constrained (r_max = 15) and abundant (r_max = 25) resource settings. Replay buffer sizes, batch sizes, and learning rates were tuned for stable performance across models. For all methods, the entropy temperature α was automatically adjusted to balance exploration and exploitation, and target networks were updated using a soft-update rate of τ = 0.005. The detailed hyperparameter configuration is summarized in Table 5.

5.2. Algorithm Performance Evaluation Metrics

To comprehensively assess the performance of RL models in longitudinal UAV collision avoidance tasks, a set of interpretable and safety-oriented evaluation metrics is defined. These metrics are derived from the reward structure of the environment and are designed to capture the agent’s effectiveness across three critical dimensions: flight safety, avoidance efficiency, and advisory quality. All statistics are computed over the last 400 evaluation episodes to ensure convergence and stability.

Reward: Mean cumulative reward per episode, representing the overall trade-off between safety and alert efficiency.
Collision rate: Frequency of near-miss, directly measuring flight safety.
Resolution success: Proportion of encounters maintaining safe vertical separation.
Alert: Mean number of advisories per episode, reflecting alert frequency.
Strengthening: Average number of increased-intensity advisories (e.g., transitioning from a standard to an aggressive vertical maneuver), indicating risk sensitivity.
Reversal: Number of reversals between climb and descend advisories, reflecting policy stability.
Crossing: Frequency of altitude crossings between UAVs, highlighting residual conflict risks.

These indicators provide a concise yet comprehensive evaluation framework, linking reinforcement learning performance with operationally relevant safety outcomes.

5.3. Test Scenario Design

To evaluate the generalization ability of the proposed algorithm under diverse encounter conditions, a hybrid scenario generation strategy combining semi-random, fixed-point, and fully random initialization was adopted. This approach provides a balance between interpretability—by using structured and repeatable conflict geometries—and robustness, by exposing the agent to diverse and stochastic encounters.

Four representative scenarios (A–D) were constructed as shown in Table 6. Each scenario specifies the initial vertical states of the ego-UAV and intruder while introducing partial randomness in altitude and velocity to simulate realistic variability in flight encounters. Notably, the vertical rate boundaries (up to ±20 m/s) were specifically designed to encompass the kinematic capabilities of high-performance fixed-wing maritime UAVs and ship-borne VTOL platforms, accommodating the rapid altitude transitions required in dynamic marine environments.

Scenarios A–C represent semi-random and fixed-point configurations designed to analyze policy behavior in structured and interpretable conflict geometries. Scenario D introduces a fully random configuration to test algorithmic robustness and generalization under stochastic conditions.

For reproducibility, random seeds were fixed during evaluation, and each model was trained with multiple random initializations. This ensured that the comparative results across SAC-D, Surprise, and RAISE were statistically consistent and not affected by sampling variance.

5.4. Training Performance and Results of the Model

5.4.1. Performance Comparison Under Moderate Resource (20)

To evaluate the effectiveness of the proposed RAISE algorithm in a balanced operational environment, training experiments were conducted under a moderate resource limit of 20. This setting represents a practical operating condition in which the collision avoidance system retains sufficient advisory capability while remaining constrained from issuing excessive or unnecessarily complex maneuvers. The comparative performance of SAC-D, SAC-D-resource, Surprise, and RAISE is reported in Table 7, while their training dynamics and detailed behavioral metrics are illustrated in Figure 4 and Figure 5. All numerical results are reported as the mean ± standard deviation.

Overall, RAISE achieved the best average performance among the four methods under the resource-constrained setting. As shown in Table 6, RAISE obtained the highest average reward of −14.183 ± 2.757, improving over SAC-D (−15.912 ± 4.084), SAC-D-resource (−15.204 ± 3.586), and Surprise (−15.546 ± 3.707). In addition, RAISE achieved the highest resolution success rate of 0.843 ± 0.056 compared with 0.821 ± 0.080 for SAC-D, 0.831 ± 0.070 for SAC-D-resource, and 0.821 ± 0.073 for Surprise. Its collision rate was also among the lowest, reaching 0.003 ± 0.006, which is comparable to Surprise and lower than those of SAC-D and SAC-D-resource. These results indicate that RAISE improves the overall quality of conflict-resolution decisions while maintaining a low residual collision risk.

The advantage of RAISE is particularly evident in its advisory behavior. Although RAISE generated the highest average number of alerts (2.157 ± 0.529), the increase was moderate relative to SAC-D-resource (2.036 ± 0.426) and Surprise (2.082 ± 0.300). More importantly, RAISE substantially reduces complex or unstable advisory adjustments. It achieved the lowest strengthening value of 2.397 ± 0.497, compared with 3.226 ± 0.582 for SAC-D, 3.113 ± 0.538 for SAC-D-resource, and 2.834 ± 0.558 for Surprise. RAISE also produced the lowest reversal value of 0.014 ± 0.016, further improving upon Surprise (0.024 ± 0.026) and markedly outperforming SAC-D and SAC-D-resource. Its crossing value of 0.056 ± 0.039 was close to the best-performing SAC-D-resource baseline (0.052 ± 0.037) and lower than those of SAC-D and Surprise. Collectively, these results suggest that RAISE does not achieve improved performance simply by issuing more advisories; rather, it tends to generate more stable and less contradictory resolution actions.

Figure 4 presents the evolution of the evaluation return during training. All four methods improved rapidly in the initial stage and gradually stabilized thereafter. RAISE established an early return advantage and maintained a higher mean return than the baselines through most of the training. The enlarged late-stage view further highlights this advantage: while the baseline curves remained clustered at lower levels, RAISE preserved the highest mean return throughout the zoomed interval. This indicates that RAISE learns a better-performing policy under the same advisory resource constraint and retains this benefit after convergence. Despite partial overlap of the shaded regions due to seed-dependent variability, the sustained separation of the mean curves supports a clear and persistent improvement.

The detailed training metrics in Figure 5 further clarify the source of this improvement. All methods rapidly reduced the collision rate and increased resolution success during the early stage of learning. In the later stage, RAISE maintained slightly higher resolution success while exhibiting noticeably lower strengthening and reversal values than the competing approaches. Although its alert count was relatively high, the associated advisories were less frequently strengthened or reversed, indicating a more coherent resolution policy under limited resources. This behavior is consistent with the results in Table 7: RAISE improves the average reward and resolution success mainly by reducing unstable or unnecessarily escalated advisory actions, rather than by aggressively increasing maneuver frequency.

Overall, the results demonstrate that RAISE provides a favorable balance between conflict-resolution effectiveness and advisory stability under a resource limit of 20. Compared with the baseline methods, it achieved the highest average reward and resolution success rate, maintained a low collision rate, and substantially reduced strengthening and reversal behaviors. These findings support the effectiveness of the proposed resource-aware exploration mechanism in learning more reliable and stable collision-resolution policies in constrained operational environments.

5.4.2. Resource-Level Sensitivity and Adaptive Behavior Under Resource-Constrained and Resource-Rich Environments (15 and 25)

To further examine the adaptability and robustness of the proposed RAISE framework, additional experiments were conducted under resource-constrained (15) and resource-rich (25) conditions. These settings respectively simulate operational scenarios with limited alert capacity and ample advisory freedom, allowing an evaluation of how the resource-aware mechanism adjusts exploration and alert issuance behaviors.

Under resource level 15, where the system operates under tight alert constraints, RAISE maintained superior overall training performance compared to SAC-D and Surprise (Table 8). As shown in Figure 6 (left), RAISE not only converged faster but also achieved the highest final return (−23.39), outperforming SAC-D (−24.83) and Surprise (−26.74). Its collision rate remained the lowest throughout most of the training process (≈0.02), confirming that safety can still be preserved even when alert resources are scarce. In terms of advisory behavior (Figure 7), RAISE exhibited a consistent pattern with the moderate-resource case (Section 5.4.1): it issued slightly more alerts (1.65 per episode) than the baselines but with notably fewer strengthening (2.07 vs. 3.00) and almost no reversal actions (0.003). This indicates that RAISE allocates its limited alert budget more effectively, favoring timely, stable advisories over reactive corrections and unnecessary intensifications.

Under resource level 25, representing an unconstrained operating condition, all algorithms achieved comparably high returns (around −8 to −10; Table 9, Figure 6 right), with RAISE maintaining a slight performance edge (−8.22 vs. −8.59 for SAC-D and −9.79 for Surprise). Although the alert frequency naturally increased due to relaxed resource limitations, RAISE continued to demonstrate the lowest reversal rate (0.19) and maintained a moderate alert strength (2.35) throughout training, reflecting its ability to adaptively adjust advisory intensity once sufficient safety margins are established (Figure 8). Notably, while Surprise initially exhibited a higher alert frequency, RAISE gradually surpassed it in later stages, showing that the algorithm dynamically balances safety assurance with efficient resource utilization as the environment stabilizes.

Overall, these supplementary experiments validate that RAISE achieves consistent, balanced performance across resource regimes. When resources are scarce, it effectively suppresses excessive advisory changes while maintaining collision-free safety and stable convergence. When resources are abundant, it flexibly expands advisory activity and exploration intensity without destabilizing the learned policy. This adaptability highlights the generalization capability of the resource-aware intrinsic exploration mechanism, supporting the consistency and scalability of the main findings under diverse operational constraints.

These results can also be interpreted as a sensitivity evaluation with respect to the available advisory-resource budget. As the resource level changes from 15 to 20 and 25, RAISE does not exhibit abrupt degradation or unstable advisory behavior. Instead, the policy adapts its command strategy according to the available budget: under scarce resources, it suppresses high-burden advisory transitions such as strengthening and reversal; under abundant resources, it permits more advisory activity to improve safety margins while maintaining stable command behavior. This indicates that the proposed resource-aware mechanism is not tuned to a single resource setting, but remains effective across different operational budgets.

5.5. Performance Evaluation in Testing Scenarios

To further evaluate the generalization capability of the proposed RAISE algorithm beyond the training distribution, testing experiments were conducted under four representative encounter scenarios, denoted as Scenarios A–D, using a moderate resource limit of 20. These scenarios cover different vertical-rate configurations and conflict geometries, enabling the policies to be examined under both structured and stochastic encounter conditions. In addition to the learning-based baselines, a rule-based controller and an artificial potential field controller were included as deterministic non-learning baselines. All methods were evaluated under the same scenarios and metrics, and the results are summarized in Figure 9 and Table 10.

Overall, RAISE achieved the most favorable safety-performance trade-off among the compared learning-based methods. Across Scenarios A, B, and D, RAISE obtained the highest average reward, with values of −21.739, −21.133, and −21.486, respectively. These results outperformed SAC-D, SAC-D-resource, and Surprise, indicating that the resource-aware intrinsic exploration mechanism improves policy generalization under unseen encounter configurations. In Scenario C, APF obtained a slightly higher reward than RAISE; however, this advantage was accompanied by a substantially higher collision rate and reversal count. Therefore, the reward result in Scenario C should be interpreted together with the safety and advisory-stability metrics rather than as an isolated measure.

In terms of collision avoidance, RAISE demonstrated the most reliable safety behavior. It achieved the lowest collision rate in all four scenarios, with collision rates of 0.026, 0.024, 0.004, and 0.012 in Scenarios A–D, respectively. Compared with SAC-D and Surprise, RAISE consistently reduced the collision risk while maintaining competitive or superior resolution success. For example, in Scenario A, the resolution success rate increased from 0.566 for SAC-D and 0.654 for Surprise to 0.742 for RAISE. A similar improvement was observed in Scenario B, where RAISE reached a resolution success rate of 0.752. In Scenario D, which represents a more random testing condition, RAISE still achieved a resolution success rate of 0.696, outperforming all other methods. Although APF achieved the highest resolution success in Scenario C, its collision rate remained considerably higher than that of RAISE, suggesting that its successful resolutions are less consistently associated with safe separation maintenance.

The comparison with deterministic controllers further highlights the advantage of learning a resource-aware policy. Rule-based and APF methods issue fewer alerts and produce very low strengthening counts, but this apparent command economy is achieved at the cost of weaker safety and poorer adaptability. In Scenarios A, B, and D, both deterministic methods showed substantially higher collision rates than RAISE. For instance, APF reached collision rates of 0.144, 0.120, and 0.144 in Scenarios A, B, and D, respectively, whereas RAISE kept the corresponding rates at 0.026, 0.024, and 0.012. Rule-based control exhibited a similar limitation, with low alert frequency but reduced resolution success and increased collision risk. These results indicate that simply minimizing the number of advisories does not necessarily lead to operationally acceptable behavior; rather, effective collision avoidance requires timely and context-sensitive interventions.

RAISE also showed a clear advantage in advisory stability. Although it generated more alerts than the other learning-based baselines, these alerts were more consistent and less oscillatory. The reversal count of RAISE remained the lowest or among the lowest across all scenarios, with values of 0.044, 0.050, 0.090, and 0.048 in Scenarios A–D. These values were substantially lower than those of SAC-D, SAC-D-resource, rule-based, and APF, and were also lower than Surprise in all scenarios. The reduction in reversal behavior is important because frequent changes in advisory direction may lead to confusing or contradictory guidance for remote operators. In addition, RAISE generally reduces strengthening actions compared with SAC-D and SAC-D-resource, indicating that the policy tends to generate earlier and more stable advisories instead of repeatedly escalating avoidance commands.

It is also worth noting that the deterministic controllers showed low strengthening counts mainly because their advisory activity is limited and less adaptive, not because they achieve a better balance between safety and command economy. Their high reversal rates and collision rates reveal that low command frequency alone is insufficient for robust collision avoidance. In contrast, RAISE uses a larger number of well-targeted alerts to maintain separation while simultaneously suppressing reversals and excessive strengthening. This behavior is consistent with the design objective of the resource-aware mechanism: the goal is not to minimize all commands indiscriminately, but to reduce redundant, contradictory, or operationally inefficient advisories while preserving safety.

In summary, the testing results confirm that RAISE generalizes well to unseen encounter scenarios and achieves a strong balance between collision safety and advisory stability. Compared with learning-based baselines, it consistently improves reward, reduces collision rate, and suppresses reversal behavior. Compared with deterministic controllers, it provides more adaptive and reliable avoidance decisions under diverse encounter geometries. These findings support the effectiveness of integrating resource-aware modulation and surprise-based intrinsic exploration for operationally suitable maritime UAV collision avoidance.

Figure 10 provides a trajectory-level visualization of the representative testing scenarios, where the Ego-UAV is shown on the left side of each subplot and the intruding aircraft is shown on the right side. The trajectories further illustrate the behavioral differences among the compared methods. The learning-based baselines exhibited less consistent vertical maneuver patterns in several scenarios, while the deterministic controllers generally produced simpler but relatively rigid avoidance responses. These observations are consistent with the quantitative results reported above, where the baselines showed higher collision rates, higher reversal counts, or weaker overall conflict-resolution performance. In contrast, RAISE generated more stable and progressive vertical maneuvers across the testing scenarios. Its trajectories avoid abrupt direction changes and excessive oscillations, supporting the quantitative results that RAISE achieves lower collision rates and fewer reversals while maintaining effective conflict resolution. These visual results therefore provide additional evidence that the proposed resource-aware mechanism contributes to both safety performance and avoidance-behavior stability.

5.6. Robustness Evaluation

While full sim-to-real deployment was beyond the scope of this work, robustness under realistic perturbations provides an important proxy for evaluating the operational reliability of collision-avoidance policies. In real maritime UAV operations, the state information available to the avoidance system may be affected by GPS positioning errors, barometric altitude inaccuracies, state-estimation uncertainty, and stochastic wind gusts. Therefore, additive Gaussian observation noise and stochastic vertical wind perturbations were adopted as controlled simulation proxies for sensor-measurement uncertainty and external wind-induced dynamic disturbances, respectively. A robustness evaluation was conducted to examine whether the learned policies could maintain reliable performance under degraded observation and dynamic conditions.

The robustness evaluation considered three perturbed environments in addition to the clean setting. First, sensor noise was introduced by adding Gaussian noise to the directly measured state variables, including relative altitude, ego-UAV vertical velocity, and intruder vertical velocity. Two noise levels were tested, corresponding to 5% and 10% of the operational ranges of the corresponding state variables. Second, wind disturbance was introduced by applying stochastic vertical wind perturbations to the ego-UAV dynamics. Third, a combined perturbation condition was considered, where sensor noise and wind disturbance were applied simultaneously. All methods were evaluated under the same perturbation settings and metrics, and the results are reported in Table 11.

Overall, RAISE demonstrated the strongest robustness among the compared methods. Under the clean condition, RAISE achieved the highest reward, the lowest collision rate, and the highest resolution success rate, with a collision rate of 0.022 and a resolution success rate of 0.637. When 5% sensor noise was introduced, the collision rate of RAISE increased to 0.044, but remained lower than those of SAC-D, Surprise, and APF. Under 10% sensor noise, all methods experienced performance degradation, but RAISE still maintained the lowest collision rate among the compared methods, with a value of 0.127. This indicates that although sensor uncertainty makes the avoidance task more difficult, RAISE remains less sensitive to observation perturbations than the baseline methods.

The advantage of RAISE was also evident under wind disturbance. In the wind-disturbed environment, RAISE achieved a collision rate of 0.020 and a resolution success rate of 0.609, both of which remained close to its clean-environment performance. In comparison, SAC-D and Surprise exhibited lower resolution success rates, while APF showed a higher collision rate and weaker overall robustness. Under the combined Noise + Wind condition, all methods degraded further, as expected. Nevertheless, RAISE still obtained the best reward and the lowest collision rate, with a collision rate of 0.128, outperforming SAC-D, Surprise, and APF. These results suggest that the proposed method maintains a more reliable safety margin under simultaneous observation and dynamic perturbations.

In terms of advisory behavior, RAISE consistently achieved the lowest or among the lowest reversal counts across all environment configurations. For example, its reversal count remained 0.076 under the clean condition, 0.089 under 5% noise, 0.122 under 10% noise, 0.077 under wind disturbance, and 0.121 under the combined Noise + Wind condition. These values were substantially lower than those of APF and were also lower than those of SAC-D and Surprise in most cases. This result indicates that RAISE maintains stable advisory directions even when the observed state or vehicle dynamics are perturbed. Although RAISE produced more alerts than the other methods, these alerts were associated with lower collision risk and fewer reversals, suggesting that the additional interventions are more consistent and safety-oriented rather than oscillatory.

The deterministic APF controller exhibited a different robustness pattern. APF generated the fewest alerts and strengthening actions in the clean environment, but its performance deteriorated substantially under sensor noise. Its collision rate increased from 0.109 in the clean environment to 0.225 under 5% noise and 0.296 under 10% noise, while its reversal count also increased markedly. Under the combined Noise + Wind condition, APF reached the highest collision rate among all methods, with a value of 0.303. These results show that low advisory frequency alone does not guarantee robust avoidance performance. A policy must also adapt its maneuvers to uncertain and perturbed encounter states.

The robustness of RAISE may be partly related to its resource-aware learning design. During training, the resource-aware modulation encourages the policy to balance collision avoidance performance with advisory stability, rather than minimizing or maximizing maneuver commands indiscriminately. In addition, the surprise-based intrinsic reward exposes the policy to less predictable state transitions, which may improve its tolerance to moderate distributional variations during testing. The bounded and threshold-based reward terms also help avoid overly sensitive responses to small state deviations near critical safety boundaries. These design factors provide a possible explanation for why RAISE maintains lower collision rates and fewer reversals under sensor noise, wind disturbance, and combined perturbations.

In summary, the robustness results confirm that RAISE provides more reliable performance under degraded operating conditions. Compared with learning-based baselines, it maintains lower collision rates, higher or competitive resolution success, and more stable advisory behavior. Compared with deterministic controllers, it shows stronger adaptability to sensor uncertainty and wind disturbance. These findings further support the effectiveness of the proposed resource-aware reinforcement learning framework for maritime UAV collision avoidance under realistic perturbations.

6. Conclusions

In this paper, we proposed a resource-aware intrinsic surprise exploration framework (RAISE) to address the critical challenge of balancing flight safety with operational suitability in deep reinforcement learning-based maritime UAV collision avoidance operations. By introducing a virtual resource mechanism, RAISE regulates alert frequency and intensity, treating system advisory capacity as a finite budget. This resource-aware modulation is synergized with a surprise-based intrinsic motivation module, which ensures effective exploration even under strict constraint settings. Comprehensive simulations demonstrate that RAISE consistently outperforms baseline methods across varying resource levels. It achieves superior operational suitability by significantly reducing alert reversals and strengthening counts—key indicators of control stability and reduced operational overhead—while maintaining robust safety margins (low near-miss rates). When resources are scarce, the agent suppresses unstable and high-burden advisory transitions, especially reversal and strengthening commands, rather than simply minimizing the raw alert count; when abundant, it flexibly optimizes trajectory smoothness without compromising stability. Results in unseen encounter geometries confirm the framework’s strong generalization capabilities beyond the training distribution. Furthermore, the surprise-based intrinsic motivation mechanism inherently reflects distribution shifts: under sensor noise or wind perturbations, prediction error increases naturally, helping the policy respond more cautiously to degraded operating conditions without retraining. Future research will focus on extending this approach to full 3D environments to incorporate horizontal maneuvers and handling complex multi-UAV encounters. Additionally, we intend to integrate probabilistic remote operator response models and formal safe reinforcement learning techniques to enable a more comprehensive assessment of human–machine coordination with stronger safety guarantees in next-generation supervisory collision avoidance systems.

Author Contributions

Methodology, Z.L. and X.G.; software, Z.L. and Q.F.; validation, Q.F. and Z.W.; writing—original draft preparation, Z.L.; writing—review and editing, Z.L., Q.F. and Z.W.; supervision, X.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was jointly funded by the National Key Laboratory of Air-based Information Perception and Fusion and the Aeronautical Science Foundation of China, Grant No. 202471.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:

RAISE	Resource-Aware Intrinsic Surprise Exploration
UAV	Unmanned Aerial Vehicle
EMA	Exponential Moving Average
DRL	Deep Reinforcement Learning
SAC-D	Soft Actor–Critic with Discrete Actions
APF	Artificial Potential Fields
VO	Velocity Obstacles
DQN	Deep Q-Networks
CRL	Constrained Reinforcement Learning
IM	Intrinsic Motivation
CPA	Closest Point of Approach
BVLOS	Beyond Visual Line of Sight

References

Nomikos, N.; Gkonis, P.K.; Bithas, P.S.; Trakadas, P. A Survey on UAV-Aided Maritime Communications: Deployment Considerations, Applications, and Future Challenges. IEEE Open J. Commun. Soc. 2023, 4, 56–78. [Google Scholar] [CrossRef]
Cao, Y.; Zhao, G.; Wu, Y.; Wang, H.; Sun, J.; Zhang, L. Dynamic Separation Standards for Multi-Category UAV Operations. Aerospace 2025, 12, 1064. [Google Scholar] [CrossRef]
Riedel, M. A Review of Detect and Avoid Standards for Unmanned Aircraft Systems. Aerospace 2025, 12, 344. [Google Scholar] [CrossRef]
Lee, J.D.; See, K.A. Trust in Automation: Designing for Appropriate Reliance. Hum. Factors J. Hum. Factors Ergon. Soc. 2004, 46, 50–80. [Google Scholar] [CrossRef]
Cummings, M.L.; Mitchell, P.J. Predicting Controller Capacity in Supervisory Control of Multiple UAVs. IEEE Trans. Syst. Man Cybern.—Part A Syst. Hum. 2008, 38, 451–460. [Google Scholar] [CrossRef]
Tang, J.; Lao, S.; Wan, Y. Systematic Review of Collision-Avoidance Approaches for Unmanned Aerial Vehicles. IEEE Syst. J. 2022, 16, 4356–4367. [Google Scholar] [CrossRef]
Liu, Y.; Yan, J.; Zhao, X. Deep Reinforcement Learning Based Latency Minimization for Mobile Edge Computing with Virtualization in Maritime UAV Communication Network. IEEE Trans. Veh. Technol. 2022, 71, 4225–4236. [Google Scholar] [CrossRef]
Zhang, C.; Lin, B.; Hu, X.; Qi, S.; Qian, L.; Wu, Y. Resource Management and Trajectory Optimization for UAV-IRS Assisted Maritime Edge Computing Networks. Tsinghua Sci. Technol. 2025, 30, 1600–1616. [Google Scholar] [CrossRef]
Jenie, Y.I.; Kampen, E.-J.v.; de Visser, C.C.; Ellerbroek, J.; Hoekstra, J.M. Selective Velocity Obstacle Method for Deconflicting Maneuvers Applied to Unmanned Aerial Vehicles. J. Guid. Control. Dyn. 2015, 38, 1140–1146. [Google Scholar] [CrossRef]
Sun, J.; Tang, J.; Lao, S. Collision Avoidance for Cooperative UAVs With Optimized Artificial Potential Field Algorithm. IEEE Access 2017, 5, 18382–18390. [Google Scholar] [CrossRef]
Merei, A.; Mcheick, H.; Ghaddar, A.; Rebaine, D. A Survey on Obstacle Detection and Avoidance Methods for UAVs. Drones 2025, 9, 203. [Google Scholar] [CrossRef]
Kamel, M.; Alonso-Mora, J.; Siegwart, R.; Nieto, J. Robust collision avoidance for multiple micro aerial vehicles using nonlinear model predictive control. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: New York, NY, USA, 2017; pp. 236–243. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
Chen, H.; Lin, Y.; Fu, M.; Yao, L.; Sheng, M. A Survey on Reinforcement Learning Methods for UAV Systems. ACM Comput. Surv. 2025, 58, 103. [Google Scholar] [CrossRef]
Ekechi, C.C.; Elfouly, T.; Alouani, A.; Khattab, T. A Survey on UAV Control with Multi-Agent Reinforcement Learning. Drones 2025, 9, 484. [Google Scholar] [CrossRef]
Wachi, A.; Shen, X.; Sui, Y. A Survey of Constraint Formulations in Safe Reinforcement Learning. arXiv 2024, arXiv:2402.02025. [Google Scholar] [CrossRef]
Li, S.; Egorov, M.; Kochenderfer, M. Optimizing collision avoidance in dense airspace using deep reinforcement learning. arXiv 2019, arXiv:1912.10146. [Google Scholar] [CrossRef]
Wang, Z.; Pan, T.; Zhou, Q.; Wang, J. Efficient exploration in resource-restricted reinforcement learning. Proc. AAAI Conf. Artif. Intell. 2023, 37, 10279–10287. [Google Scholar] [CrossRef]
Christodoulou, P. Soft actor-critic for discrete action settings. arXiv 2019, arXiv:1910.07207. [Google Scholar] [CrossRef]
Achiam, J.; Sastry, S. Surprise-based intrinsic motivation for deep reinforcement learning. arXiv 2017, arXiv:1703.01732. [Google Scholar] [CrossRef]
Bounini, F.; Gingras, D.; Pollart, H.; Gruyer, D. Modified artificial potential field method for online path planning applications. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV); IEEE: New York, NY, USA, 2017; pp. 180–185. [Google Scholar]
Koren, Y.; Borenstein, J. Potential field methods and their inherent limitations for mobile robot navigation. In Proceedings of the 1991 IEEE International Conference on Robotics and Automation; IEEE: New York, NY, USA, 1991; Volume 1392, pp. 1398–1404. [Google Scholar]
Biswas, K.; Kar, I. On reduction of oscillations in target tracking by artificial potential field method. In Proceedings of the 2014 9th International Conference on Industrial and Information Systems (ICIIS), Gwalior, India, 15–17 December 2014; pp. 1–6. [Google Scholar]
Jiang, W.; Cai, T.; Xu, G.; Wang, Y. Autonomous obstacle avoidance and target tracking of UAV: Transformer for observation sequence in reinforcement learning. Knowl.-Based Syst. 2024, 290, 111604. [Google Scholar] [CrossRef]
Liang, C.; Liu, L.; Liu, C. Multi-UAV autonomous collision avoidance based on PPO-GIC algorithm with CNN–LSTM fusion network. Neural Netw. 2023, 162, 21–33. [Google Scholar] [CrossRef]
Zheng, K.; Zhang, X.; Wang, C.; Zhang, M.; Cui, H. A partially observable multi-ship collision avoidance decision-making model based on deep reinforcement learning. Ocean Coast. Manag. 2023, 242, 106689. [Google Scholar] [CrossRef]
Wang, Y.; Xu, H.; Feng, H.; He, J.; Yang, H.; Li, F.; Yang, Z.J.O.E. Deep reinforcement learning based collision avoidance system for autonomous ships. Ocean Eng. 2024, 292, 116527. [Google Scholar] [CrossRef]
Hu, Z.; Gao, X.; Wan, K.; Wang, Q.; Zhai, Y. Asynchronous Curriculum Experience Replay: A Deep Reinforcement Learning Approach for UAV Autonomous Motion Control in Unknown Dynamic Environments. IEEE Trans. Veh. Technol. 2023, 72, 13985–14001. [Google Scholar] [CrossRef]
Song, C.; Zhang, Y.; Bai, S.; Li, B.; Gan, Z.; Neretin, E. An end-to-end Flight Control Method for UAVs Based on MD-SAC. IEEE Trans. Consum. Electron. 2025, 71, 3641–3653. [Google Scholar] [CrossRef]
Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained policy optimization. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 22–31. [Google Scholar]
Kormushev, P.; Calinon, S.; Caldwell, D.G. Reinforcement learning in robotics: Applications and real-world challenges. Robotics 2013, 2, 122–148. [Google Scholar] [CrossRef]
Kormushev, P.; Ugurlu, B.; Calinon, S.; Tsagarakis, N.G.; Caldwell, D.G. Bipedal walking energy minimization by reinforcement learning with evolving policy parameterization. In Proceedings of the 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems; IEEE: New York, NY, USA, 2011; pp. 318–324. [Google Scholar]
Kempka, M.; Wydmuch, M.; Runc, G.; Toczek, J.; Jaśkowski, W. Vizdoom: A doom-based ai research platform for visual reinforcement learning. In Proceedings of the 2016 IEEE Conference on Computational Intelligence and Games (CIG); IEEE: New York, NY, USA, 2016; pp. 1–8. [Google Scholar]
Resnick, C.; Eldridge, W.; Ha, D.; Britz, D.; Foerster, J.; Togelius, J.; Cho, K.; Bruna, J. Pommerman: A multi-agent playground. arXiv 2018, arXiv:1809.07124. [Google Scholar]
Strehl, A.L.; Littman, M.L. An empirical evaluation of interval estimation for markov decision processes. In Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence; IEEE: New York, NY, USA, 2004; pp. 128–135. [Google Scholar]
Bellemare, M.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; Munos, R. Unifying count-based exploration and intrinsic motivation. In Proceedings of the Advances in Neural Information Processing Systems, Sydney, Australia, 6–12 December 2016; Volume 29. [Google Scholar]
Houthooft, R.; Chen, X.; Duan, Y.; Schulman, J.; De Turck, F.; Abbeel, P. Vime: Variational information maximizing exploration. In Proceedings of the Advances in Neural Information Processing Systems, Sydney, Australia, 6–12 December 2016. [Google Scholar]
Shyam, P.; Jaśkowski, W.; Gomez, F. Model-based active exploration. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 5779–5788. [Google Scholar]
Pathak, D.; Agrawal, P.; Efros, A.A.; Darrell, T. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 21–26 July 2017; pp. 2778–2787. [Google Scholar]
Mohamed, S.; Jimenez Rezende, D. Variational information maximisation for intrinsically motivated reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Thipphavong, D.; Cone, A.; Lee, S. Ensuring Interoperability Between Unmanned Aircraft Detect-and-Avoid and Manned Aircraft Collision Avoidance. In Proceedings of the USA/Europe Air Traffic Management Research and Development Seminar, Seattle, WA, USA, 26–30 June 2017. [Google Scholar]

Figure 1. Representation of the UAV’s velocity and the relative position of the intruding UAV (I) with respect to the ego-UAV (E) at the CPA.

Figure 2. Excessive advisory activity during training under unconstrained conditions.

Figure 3. Conceptual architecture of the RAISE framework.

Figure 4. Average return curves of the algorithms during training under a moderate resource level (20).

Figure 5. Core performance metrics of the algorithms under a moderate resource level (20).

Figure 6. Average return of the algorithms under resource-constrained (15, (left)) and resource-rich (25, (right)) conditions.

Figure 7. Core performance metrics of the algorithms under the resource-constrained condition (15).

Figure 8. Core performance metrics of the algorithms under the resource-rich condition (25).

Figure 9. Performance comparison of four representative metrics of all trained models under a moderate resource level (20).

Figure 10. Collision-avoidance trajectories of the compared models across four representative test scenarios (A–D).

Table 1. State space variables.

Variable	Description	Values	Units
h_t	Relative intruder altitude	[−750, 750]	m
${\dot{h}}_{e g o, t}$	Ego-UAV vertical rate	[−20, 20]	m/s
${\dot{h}}_{i n t, t}$	Intruding UAV vertical rate	[−15, 15]	m/s
τₜ	Time to loss of hor. separation	[0, 40]	s
a_prev	Previous advisory	See Table 2	-
$r_{t}^{r e s}$	Remaining resource	[0, r_max]	-

Table 2. Advisory set and advisory availability.

Action	Description	Acceleration	Available From
NOC	No conflict	0	All
DES-N	Normal descend ≤ −7.5 m/s	−g/3	NOC
CLB-N	Normal climb ≥ 7.5 m/s	g/3	NOC
DES-T	Transitional descend ≤ −7.5 m/s	−g/2.5	CLB-N, CLB-T, CLB-E, DES-E
CLB-T	Transitional climb ≥ 7.5 m/s	g/2.5	DES-N, DES-T, DES-E, CLB-E
DES-E	Escalated descend ≤ −12.5 m/s	−g/2.5	DES-N, DES-T
CLB-E	Escalated climb ≥ 12.5 m/s	g/2.5	CLB-N, CLB-T

Table 3. Parameters used in the extrinsic reward function.

Parameter	Value	Meaning
ω_col	100	Conflict penalty weight
ω_alert	0.3	Alert penalty
ω_str	0.3	Strengthening advisory penalty
ω_rev	0.5	Reversal advisory penalty
ω_cross	0.5	Vertical path-crossing penalty
ω_lim	0.1	Altitude-limit penalty coefficient
ω_ex	0.1	Excessive terminal-separation coefficient
γ_alt	25	Saturation bound of excessive terminal-separation penalty
ω_NOC	0.2	Conflict-clearance reward

Table 4. Comparison of advisory behaviors for SAC-D and surprise models.

Model	Alerts	Strengthening	Reversal
SAC-D	11.400	4.530	0.433
Surprise	10.733	6.790	0.900

Table 5. Simulation settings.

Parameter	Value/Setting
Learning rate	3 × 10⁻⁴
Optimizer	Adam
Discount factor (γ)	0.975
Replay buffer size	4 × 10⁴
Batch size	512
Intrinsic model	Ensemble dynamics (for Surprise, RAISE)
Intrinsic reward	Negative log-likelihood (Surprise)
Resource limit (r_max)	20 (default); 15/25 (for comparison)
Target update (τ)	0.005
Entropy coefficient (α)	Auto-tuned
Training iterations	1500 (all algorithms)

Table 6. Parameter settings for the four test scenarios.

Scenario	${\dot{h}}_{e g o}$	$h_{int}$	${\dot{h}}_{int}$
A	Uniform (−20, 20)	Uniform (−300, 300)	−3
B	10	Uniform (−300, 300)	Uniform (−15, 15)
C	Uniform (−20, 20)	0	Uniform (−15, 15)
D	Uniform (−20, 20)	Uniform (−300, 300)	Uniform (−15, 15)

Table 7. Overall training results of all algorithms under a moderate resource level (20).

Model	Reward	Collision Rate	Resolution Success	Alert	Strengthening	Reversal	Crossing
SAC-D	−15.912 ± 4.084	0.006 ± 0.008	0.821 ± 0.080	1.506 ± 0.239	3.226 ± 0.582	0.112 ± 0.095	0.068 ± 0.042
SAC-D-resource	−15.204 ± 3.586	0.004 ± 0.006	0.831 ± 0.070	2.036 ± 0.426	3.113 ± 0.538	0.107 ± 0.056	0.052 ± 0.037
Surprise	−15.546 ± 3.707	0.003 ± 0.005	0.821 ± 0.073	2.082 ± 0.300	2.834 ± 0.558	0.024 ± 0.026	0.070 ± 0.046
RAISE	−14.183 ± 2.757	0.003 ± 0.006	0.843 ± 0.056	2.157 ± 0.529	2.397 ± 0.497	0.014 ± 0.016	0.056 ± 0.039

Table 8. Overall training results of the algorithms under the resource-constrained condition (15).

Model	Reward	Collision Rate	Resolution Success	Alert	Strengthening	Reversal	Crossing
SAC-D	−24.825	0.013	0.680	1.320	3.003	0.047	0.087
Surprise	−26.739	0.050	0.673	1.367	2.747	0.037	0.147
RAISE	−23.392	0.020	0.693	1.650	2.070	0.003	0.107

Table 9. Overall training results of the algorithms under the resource-rich condition (25).

Model	Reward	Resolution Success	Alert	Strengthening	Reversal	Crossing
SAC-D	−8.588	0.940	2.540	3.210	0.340	0.020
Surprise	−9.793	0.930	2.910	1.900	0.160	0.040
RAISE	−8.223	0.950	3.020	2.350	0.190	0.020

Table 10. Testing performance of all trained models across representative encounter scenarios under a moderate resource level (20).

Scenario	Agent	Reward	Collision Rate	Resolution Success	Alert	Reversal	Strengthening	Crossing
A	SAC-D	−32.127 ± 29.665	0.058 ± 0.234	0.566 ± 0.496	0.842 ± 0.932	0.138 ± 0.345	4.372 ± 3.266	0.248 ± 0.450
	SAC-D-resource	−28.257 ± 25.933	0.026 ± 0.159	0.622 ± 0.485	1.400 ± 1.446	0.158 ± 0.370	4.706 ± 3.331	0.214 ± 0.415
	Surprise	−27.427 ± 29.637	0.068 ± 0.252	0.654 ± 0.476	1.670 ± 1.348	0.140 ± 0.347	2.952 ± 2.942	0.194 ± 0.420
	RAISE	−21.739 ± 25.514	0.026 ± 0.159	0.742 ± 0.438	2.712 ± 1.579	0.044 ± 0.215	2.232 ± 2.468	0.182 ± 0.411
	Rule-based	−39.056 ± 30.374	0.126 ± 0.332	0.468 ± 0.499	0.676 ± 0.669	0.390 ± 0.520	0.136 ± 0.343	0.030 ± 0.182
	APF	−39.607 ± 32.237	0.144 ± 0.351	0.520 ± 0.500	0.502 ± 0.612	0.522 ± 0.664	0.222 ± 0.448	0.060 ± 0.284
B	SAC-D	−31.585 ± 27.116	0.044 ± 0.205	0.528 ± 0.499	0.638 ± 0.753	0.214 ± 0.410	4.270 ± 2.924	0.260 ± 0.439
	SAC-D-resource	−32.154 ± 28.539	0.032 ± 0.176	0.602 ± 0.489	1.040 ± 1.384	0.306 ± 0.486	4.986 ± 3.091	0.284 ± 0.451
	Surprise	−28.155 ± 30.499	0.064 ± 0.245	0.634 ± 0.482	1.546 ± 1.263	0.130 ± 0.365	2.654 ± 2.612	0.190 ± 0.407
	RAISE	−21.133 ± 23.556	0.024 ± 0.153	0.752 ± 0.432	3.086 ± 1.863	0.050 ± 0.227	1.568 ± 1.542	0.146 ± 0.353
	Rule-based	−42.088 ± 31.682	0.134 ± 0.341	0.394 ± 0.489	0.704 ± 0.658	0.382 ± 0.522	0.156 ± 0.368	0.028 ± 0.226
	APF	−41.883 ± 31.452	0.120 ± 0.325	0.460 ± 0.498	0.516 ± 0.627	0.496 ± 0.608	0.258 ± 0.451	0.034 ± 0.255
C	SAC-D	−35.408 ± 21.293	0.046 ± 0.209	0.442 ± 0.497	0.548 ± 0.795	0.240 ± 0.427	4.948 ± 3.220	0.270 ± 0.453
	SAC-D-resource	−36.139 ± 19.622	0.018 ± 0.133	0.414 ± 0.493	0.548 ± 0.510	0.178 ± 0.383	6.106 ± 2.841	0.432 ± 0.511
	Surprise	−35.151 ± 18.724	0.020 ± 0.140	0.440 ± 0.496	0.712 ± 0.897	0.212 ± 0.409	4.938 ± 2.883	0.362 ± 0.505
	RAISE	−33.436 ± 17.469	0.004 ± 0.063	0.452 ± 0.498	2.400 ± 1.498	0.090 ± 0.286	3.164 ± 2.508	0.320 ± 0.492
	Rule-based	−35.528 ± 24.032	0.070 ± 0.255	0.494 ± 0.500	0.656 ± 0.605	0.692 ± 0.770	0.342 ± 0.523	0.130 ± 0.617
	APF	−32.294 ± 23.037	0.054 ± 0.226	0.566 ± 0.496	0.564 ± 0.634	0.476 ± 0.634	0.168 ± 0.414	0.072 ± 0.428
D	SAC-D	−36.274 ± 31.526	0.102 ± 0.303	0.472 ± 0.499	0.802 ± 0.887	0.184 ± 0.387	3.884 ± 3.145	0.258 ± 0.451
	SAC-D-resource	−28.079 ± 24.354	0.028 ± 0.165	0.582 ± 0.493	1.286 ± 1.399	0.186 ± 0.433	4.310 ± 3.222	0.222 ± 0.452
	Surprise	−29.642 ± 30.082	0.068 ± 0.252	0.596 ± 0.491	1.524 ± 1.327	0.096 ± 0.314	2.934 ± 2.987	0.234 ± 0.455
	RAISE	−21.486 ± 22.068	0.012 ± 0.109	0.696 ± 0.460	2.786 ± 1.609	0.048 ± 0.240	2.264 ± 2.401	0.178 ± 0.422
	Rule-based	−36.610 ± 29.325	0.102 ± 0.303	0.458 ± 0.498	0.702 ± 0.673	0.416 ± 0.539	0.126 ± 0.332	0.046 ± 0.219
	APF	−39.022 ± 32.126	0.144 ± 0.351	0.490 ± 0.500	0.528 ± 0.630	0.544 ± 0.782	0.252 ± 0.494	0.072 ± 0.333

Table 11. Robustness evaluation under sensor noise and wind disturbances (mean over 4 scenarios).

Environment Configuration	Agent	Reward	Collision Rate	Resolution Success	Alert	Reversal	Strengthening
Clean	SAC-D	−34.147 ± 27.647	0.065 ± 0.243	0.508 ± 0.499	0.682 ± 0.839	0.207 ± 0.404	4.514 ± 3.213
	Surprise	−29.300 ± 26.248	0.049 ± 0.213	0.597 ± 0.482	1.370 ± 1.238	0.143 ± 0.347	3.348 ± 2.950
	APF	−38.286 ± 28.781	0.109 ± 0.298	0.496 ± 0.497	0.499 ± 0.598	0.531 ± 0.705	0.246 ± 0.478
	RAISE	−25.532 ± 22.699	0.022 ± 0.144	0.637 ± 0.467	2.722 ± 1.632	0.076 ± 0.257	2.364 ± 2.316
Noise-5%	SAC-D	−37.391 ± 29.217	0.091 ± 0.283	0.477 ± 0.498	0.725 ± 0.881	0.219 ± 0.411	4.359 ± 3.108
	Surprise	−34.921 ± 28.015	0.081 ± 0.270	0.536 ± 0.493	1.726 ± 1.364	0.176 ± 0.389	3.139 ± 2.751
	APF	−46.455 ± 34.101	0.225 ± 0.408	0.435 ± 0.493	0.672 ± 0.859	0.919 ± 1.005	0.513 ± 0.662
	RAISE	−30.658 ± 25.052	0.044 ± 0.205	0.590 ± 0.485	3.086 ± 1.657	0.089 ± 0.280	2.216 ± 2.159
Noise-10%	SAC-D	−44.322 ± 31.355	0.153 ± 0.355	0.403 ± 0.490	0.822 ± 0.953	0.272 ± 0.446	4.105 ± 2.815
	Surprise	−42.701 ± 31.522	0.144 ± 0.350	0.453 ± 0.494	2.010 ± 1.445	0.204 ± 0.409	2.875 ± 2.437
	APF	−53.139 ± 35.377	0.296 ± 0.445	0.344 ± 0.473	0.954 ± 1.155	1.167 ± 1.161	0.664 ± 0.706
	RAISE	−41.192 ± 30.724	0.127 ± 0.331	0.471 ± 0.498	3.282 ± 1.607	0.122 ± 0.326	1.997 ± 1.774
Wind	SAC-D	−37.437 ± 27.931	0.077 ± 0.262	0.467 ± 0.495	0.702 ± 0.847	0.203 ± 0.399	4.492 ± 3.222
	Surprise	−32.292 ± 27.005	0.053 ± 0.221	0.559 ± 0.486	1.378 ± 1.252	0.157 ± 0.364	3.285 ± 2.924
	APF	−39.627 ± 29.538	0.120 ± 0.311	0.490 ± 0.499	0.520 ± 0.607	0.541 ± 0.688	0.244 ± 0.473
	RAISE	−28.387 ± 23.407	0.020 ± 0.132	0.609 ± 0.470	2.787 ± 1.675	0.077 ± 0.263	2.325 ± 2.331
Noise + Wind	SAC-D	−47.731 ± 32.431	0.173 ± 0.376	0.376 ± 0.483	0.872 ± 0.939	0.276 ± 0.443	3.992 ± 2.815
	Surprise	−44.602 ± 30.960	0.133 ± 0.337	0.414 ± 0.488	2.035 ± 1.474	0.180 ± 0.407	2.796 ± 2.450
	APF	−53.182 ± 35.617	0.303 ± 0.450	0.357 ± 0.473	0.970 ± 1.178	1.121 ± 1.111	0.647 ± 0.689
	RAISE	−42.580 ± 30.798	0.128 ± 0.333	0.440 ± 0.494	3.296 ± 1.640	0.121 ± 0.336	2.004 ± 1.805

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Z.; Feng, Q.; Wang, Z.; Gao, X. Resource-Aware Surprise Reinforcement Learning for Collision Avoidance in Maritime UAV Encounters. Drones 2026, 10, 450. https://doi.org/10.3390/drones10060450

AMA Style

Liu Z, Feng Q, Wang Z, Gao X. Resource-Aware Surprise Reinforcement Learning for Collision Avoidance in Maritime UAV Encounters. Drones. 2026; 10(6):450. https://doi.org/10.3390/drones10060450

Chicago/Turabian Style

Liu, Zuocheng, Qi Feng, Zidong Wang, and Xiaoguang Gao. 2026. "Resource-Aware Surprise Reinforcement Learning for Collision Avoidance in Maritime UAV Encounters" Drones 10, no. 6: 450. https://doi.org/10.3390/drones10060450

APA Style

Liu, Z., Feng, Q., Wang, Z., & Gao, X. (2026). Resource-Aware Surprise Reinforcement Learning for Collision Avoidance in Maritime UAV Encounters. Drones, 10(6), 450. https://doi.org/10.3390/drones10060450

Article Menu

Resource-Aware Surprise Reinforcement Learning for Collision Avoidance in Maritime UAV Encounters

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Conventional UAV Collision Avoidance Methods

2.2. Deep Reinforcement Learning for Autonomous Navigation

2.3. Constrained Exploration and Intrinsic Motivation

3. Problem Formulation

3.1. UAV Encounter Dynamics

3.2. UAV Dynamic Model

3.3. Resource-Aware Collision Avoidance Decision Framework

3.3.1. Resource-Aware Decision Framework

3.3.2. State Space

3.3.3. Action Space

3.3.4. Reward Shaping

4. UAV Collision Avoidance Based on Resource-Aware Intrinsic Surprise Exploration

4.1. SAC Algorithm for Collision Avoidance

4.2. Quantifying Novelty via Surprise-Based Intrinsic Reward

4.3. Resource-Aware Intrinsic Surprise Exploration

5. Experimental Simulation and Result Analysis

5.1. Experimental Setup

5.2. Algorithm Performance Evaluation Metrics

5.3. Test Scenario Design

5.4. Training Performance and Results of the Model

5.4.1. Performance Comparison Under Moderate Resource (20)

5.4.2. Resource-Level Sensitivity and Adaptive Behavior Under Resource-Constrained and Resource-Rich Environments (15 and 25)

5.5. Performance Evaluation in Testing Scenarios

5.6. Robustness Evaluation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI