Next Article in Journal
UAV-Based Remote Sensing and Artificial Intelligence for Climate-Smart Agriculture: A Systematic Review of Technologies, Analytics, and Applications in Smallholder Systems
Previous Article in Journal
DCAFuse: A Differential Cross-Attention Transformer Network for Infrared and Visible Image Fusion in UAV-Based Wilderness Search and Rescue
Previous Article in Special Issue
Energy–Information–Decision Coupling Optimization for Cooperative Operations of Heterogeneous Maritime Unmanned Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Resource-Aware Surprise Reinforcement Learning for Collision Avoidance in Maritime UAV Encounters

1
School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710129, China
2
Department of Computer Science, City University of Hong Kong, Hong Kong
*
Author to whom correspondence should be addressed.
Drones 2026, 10(6), 450; https://doi.org/10.3390/drones10060450
Submission received: 3 April 2026 / Revised: 4 June 2026 / Accepted: 7 June 2026 / Published: 9 June 2026

Highlights

What are the main findings?
  • A unified framework, termed Resource-Aware Intrinsic Surprise Exploration (RAISE), is developed by integrating resource-aware modulation, surprise-based intrinsic reward, and adaptive EMA scaling into SAC-D for maritime UAV collision avoidance.
  • Across constrained, moderate, and resource-rich settings, RAISE consistently improves collision-avoidance performance, while reducing reversal and strengthening advisories and producing more stable command behaviors than baseline methods.
What are the implications of the main findings?
  • The results show that collision avoidance for maritime UAVs should be optimized not only for safety, but also for advisory stability and command economy, so that the learned policy remains operationally acceptable.
  • The proposed framework offers a transferable design for safety-critical reinforcement learning under operational constraints, with strong generalization across unseen encounter scenarios.

Abstract

Collision avoidance in maritime unmanned aerial vehicle (UAV) operations must satisfy two competing objectives: ensuring reliable safety separation and minimizing unnecessary maneuver commands that increase operator burden and communication overhead. While deep reinforcement learning (DRL) has shown promise in handling high-dimensional encounter states, standard DRL approaches often prioritize safety at the cost of operational suitability, leading to frequent, oscillatory, or unnecessary avoidance commands that erode remote operator trust and consume limited communication bandwidth. To address this challenge, this paper proposes Resource-Aware Intrinsic Surprise Exploration (RAISE), a unified framework that balances collision avoidance performance with command economy. We conceptualize the issuance of avoidance maneuvers as a consumable “virtual resource”, compelling the agent to optimize its intervention budget. RAISE integrates this mechanism into the Soft Actor–Critic (SAC) architecture, augmented by a surprise-based intrinsic reward derived from the ensemble forward dynamics prediction error. This allows the agent to efficiently explore complex encounter scenarios driven by curiosity, while a resource-aware coefficient adaptively suppresses redundant actions when the communication or operational budget is constrained. Furthermore, an adaptive exponential moving average (EMA) scaling mechanism is introduced to stabilize the interplay between intrinsic and extrinsic rewards. Extensive simulations under diverse resource constraints and encounter geometries demonstrate that RAISE outperforms state-of-the-art baselines. It significantly reduces maneuver reversal rates and strengthens command stability without compromising safety margins. Specifically, under resource-constrained settings, RAISE suppresses excessive and unstable advisory behavior by reducing strengthening and reversal commands while maintaining effective collision avoidance; under resource-rich settings, it flexibly enhances safety buffers, demonstrating superior adaptability and operational realism for autonomous maritime UAV systems. Robustness evaluation confirms that RAISE maintains stable performance under sensor noise and wind disturbances.

1. Introduction

Ensuring safe separation in increasingly congested maritime airspaces is a paramount challenge for modern offshore operations. With the rapid expansion of unmanned aerial vehicles (UAVs) deployed for maritime monitoring, infrastructure inspection, and search-and-rescue [1], the frequency of complex UAV encounters has risen sharply, exacerbating the risk of aerial collisions [2,3]. Operational experience reveals that while autonomous systems can successfully prevent most collisions, they often trigger excessive or continuous avoidance commands. A critical yet often overlooked issue is that frequent, nuisance, or oscillatory maneuver commands consume valuable and often limited maritime communication bandwidth. Furthermore, they severely disrupt the remote operator’s workflow and increase cognitive load, eventually leading to the desensitization of human supervisors—a phenomenon known as “trust erosion” in human-autonomy teaming [4,5,6]. In maritime environments where connectivity is unstable, executing continuous, high-frequency commands is highly impractical [7,8]. Therefore, the next generation of maritime UAV collision avoidance systems must strictly adhere to a dual mandate: providing high safety assurance while maintaining “command precision”—issuing avoidance maneuvers only when absolutely necessary to preserve communication resources and operator trust.
Current solutions, however, struggle to satisfy this dual mandate simultaneously. Traditional UAV collision avoidance methods [9,10,11] often rely on fixed geometric rules or reactive potential fields. While effective in simple scenarios, they are computationally rigid and unable to adapt online to varying operational needs, such as communication constraints or a remote operator’s tolerance for frequent interventions [5,12]. Conversely, deep reinforcement learning (DRL) has emerged as a powerful alternative, capable of mapping continuous, high-dimensional states to maneuver commands without massive storage overhead [13,14]. Recent RL studies have further expanded UAV decision-making from value-based control to actor–critic, multi-agent, and safe/constrained learning frameworks, improving adaptability in navigation, control, resource scheduling, and cooperative UAV operations [15,16]. Safe/constrained RL also provides a principled way to incorporate safety or resource-related constraints into policy learning [17]. However, these advances still primarily optimize safety, path efficiency, or mission performance, while the operational cost of frequent maneuver advisories remains insufficiently modeled. Even when recent DRL agents have demonstrated excellent safety performance [18], they typically lack “operational awareness”. Existing UAV DRL research is predominantly safety-driven; without explicit constraints on the “cost” of maneuvering or transmitting commands, these agents often exhibit “jittery” or oscillatory behaviors to marginally increase safety buffers. This creates a critical gap: algorithms that are mathematically safe but operationally unacceptable for communication-constrained maritime environments.
To bridge the gap between algorithmic safety and operational suitability, we draw inspiration from resource-constrained reinforcement learning [19]. We propose a novel paradigm that conceptualizes the issuance of avoidance commands as a consumable “virtual resource”, representing the finite capacity of communication bandwidth and remote operator attention within an encounter episode. This abstraction forces the agent to treat every maneuver command as a “costly investment”, thereby naturally suppressing redundant or low-value actions.
However, simply penalizing maneuver commands can hinder the agent’s exploration during training, leading to overly conservative policies that fail to discover safe avoidance trajectories. To address this, we introduce Resource-Aware Intrinsic Surprise Exploration (RAISE). This framework unifies safety assurance and command economy by integrating synergistic mechanisms within the Soft Actor–Critic Discrete (SAC-D) architecture [20]. Specifically, to prevent the resource constraint from hindering learning, we employ a surprise-based intrinsic motivation module that utilizes an ensemble dynamics model to generate prediction errors [21], driving the agent to explore novel scenarios. This exploration is dynamically regulated by a resource-modulated control coefficient, which couples exploration intensity with the remaining command budget. Furthermore, to address the scale discrepancy between varying intrinsic signals and stable extrinsic rewards, we introduce an adaptive exponential moving average (EMA) scaling mechanism, ensuring consistent reward normalization and stable convergence.
By unifying safety assurance, communication efficiency, and operator burden reduction, RAISE provides a practical path toward operationally suitable AI collision avoidance for maritime UAVs. The main contributions of this work are summarized as follows:
  • Resource-aware decision formulation: Introduction of a virtual resource mechanism that regulates maneuver command frequency and intensity during both training and deployment.
  • Resource-aware surprise exploration with adaptive scaling: An ensemble-based prediction-error signal is used for novelty estimation, while its influence is adaptively scaled and modulated by the remaining advisory resource.
  • Unified SAC-D integration: Seamless incorporation of resource modulation and intrinsic exploration into a standard actor–critic framework.
  • Comprehensive evaluation: Experiments across resource-constrained, moderate, and resource-rich conditions demonstrate that RAISE improves advisory stability by reducing strengthening and reversal behaviors while maintaining reliable collision-avoidance performance and stable training convergence.

2. Related Work

2.1. Conventional UAV Collision Avoidance Methods

Early UAV collision avoidance relied heavily on reactive and rule-based logic. Traditional geometric and force-field approaches, such as artificial potential fields (APFs) [10] and velocity obstacles (VOs) [9], determine avoidance maneuvers based on relative distances and velocities. While effective in standard, low-density encounters, several limitations remain. For instance, classical APF methods may suffer from local-minimum traps in certain obstacle configurations, although many improved APF variants have been proposed to mitigate this limitation [11,22]. In addition, purely reactive potential-field planners may exhibit local path oscillations near obstacles or in narrow passages, reducing trajectory smoothness under certain conditions [23,24]. More importantly, because these methods are typically governed by predefined geometric or force-field rules, they lack the online adaptability required to flexibly recalibrate the trade-off between safety margins and operational costs, such as communication bandwidth consumption or remote operator intervention frequency, based on real-time maritime contexts [5,12].

2.2. Deep Reinforcement Learning for Autonomous Navigation

DRL has emerged as a promising alternative to overcome the scalability and adaptability issues of traditional reactive systems. Algorithms like Deep Q-Networks (DQN) [13] and Soft Actor–Critic (SAC) [14] have been successfully applied to safety-critical domains, including UAV obstacle avoidance [25,26], autonomous maritime navigation [27,28], and aircraft separation assurance [29,30]. By mapping high-dimensional, continuous states directly to maneuver commands, DRL agents can navigate complex environments without the need for massive computational overhead [18]. More recent studies have further broadened this line of research toward policy-gradient and actor–critic methods, multi-agent reinforcement learning, curriculum-based training, and safe/constrained reinforcement learning [15,16,17]. These developments improve the scalability of RL-based UAV control in uncertain, dynamic, and cooperative environments, and they also provide tools for incorporating safety or resource-related constraints during policy optimization.
Despite these advances, most UAV DRL collision-avoidance studies still evaluate performance mainly through collision rate, path efficiency, or separation distance, rather than through the operational cost and temporal stability of maneuver advisories. Without explicit mechanisms to regulate the frequency or severity of maneuver commands, safety-driven agents may generate “jittery” or excessive control signals. While mathematically safe, such high-frequency behavior can saturate limited maritime communication networks, increase remote operator cognitive load, and erode trust in autonomous UAV operations.

2.3. Constrained Exploration and Intrinsic Motivation

To balance safety with operational constraints, this work draws inspiration from constrained reinforcement learning (CRL) [31]. In domains such as robotic control with finite energy budgets [32,33] or games with consumable items [34,35], agents are trained to maximize rewards under strict resource limitations. We adapt this concept by formulating the “command budget” as a virtual constraint, a novel perspective in the context of maritime UAV collision avoidance.
Furthermore, to ensure efficient policy learning under such constraints, intrinsic motivation (IM) mechanisms are often employed. IM tackles the sparse reward problem by encouraging exploration. Existing approaches include count-based exploration [36,37], information-gain approaches [38,39], prediction-error-based exploration [21,40], and empowerment-based approaches [41]. Among these, surprise-based exploration [21], which rewards the agent for prediction errors in forward dynamics, has shown superior sample efficiency in continuous control tasks.
However, standard intrinsic exploration is typically cost-agnostic. In safety-critical UAV operations, unconstrained curiosity can lead to dangerous or communication-heavy behaviors. To the best of our knowledge, no prior work has integrated surprise-based exploration with resource-aware constraints specifically to solve the “safety vs. command economy” trade-off in maritime UAV encounters. This paper bridges this gap by introducing the RAISE framework.

3. Problem Formulation

3.1. UAV Encounter Dynamics

In this study, the UAV encounter scenario is formulated as a strategic vertical resolution advisory problem for medium-to-large fixed-wing UAVs conducting beyond visual line of sight (BVLOS) maritime missions. The purpose of the model is not to generate full three-dimensional trajectories, but to determine whether sufficient vertical separation can be achieved within the remaining horizontal encounter time. Accordingly, the horizontal encounter geometry is represented by the time-to-conflict variable τt, while the learning problem focuses on vertical separation and vertical advisory generation.
As illustrated in Figure 1, the encounter geometry is characterized by the relative altitude between the ego-UAV and the intruding UAV, alongside the time to the closest point of approach (CPA) in the horizontal direction. The horizontal geometry is thereby parameterized by the time-to-conflict variable (τt), which represents the remaining time until the UAVs breach the minimum horizontal safety threshold:
τ t = D t v r e l
where Dt denotes the current horizontal distance and vrel denotes the locally estimated relative horizontal speed over the short encounter horizon. The resulting variable τt represents the remaining time before the horizontal separation reaches the predefined safety threshold. Thus, horizontal motion is retained as a temporal constraint on the vertical resolution process, rather than being explicitly controlled as an additional maneuver dimension.
Under this abstraction, the relative motion of the two UAVs is described by a compact set of vertical variables, including the relative altitude h t = h i n t , t h e g o , t and the respective vertical velocities of the ego-UAV and the intruder. This formulation provides sufficient information for evaluating collision risk and generating precise avoidance maneuver alerts without explicitly representing lateral motion.

3.2. UAV Dynamic Model

It should be noted that the dynamic model used in this study is an advisory-level vertical kinematic response model, rather than a full six-degree-of-freedom UAV flight dynamics model. The objective of this model is to describe how high-level climb, descent, and no-command advisories affect the vertical separation between the ego-UAV and the intruding UAV during a short encounter window. Therefore, attitude dynamics, aerodynamic forces, actuator dynamics, and low-level flight-control loops are not explicitly modeled.
Given the two-dimensional encounter abstraction introduced above, the horizontal interaction between UAVs is fully captured by the time-to-conflict variable τₜ. In this formulation, horizontal motion is retained as a temporal constraint on the vertical resolution process, rather than being explicitly controlled as an additional maneuver dimension. Therefore, the collision-avoidance dynamics can be modeled exclusively in the vertical dimension, consistent with altitude-based deconfliction strategies commonly used in UAV operations, where vertical maneuvers can rapidly establish separation within a limited encounter time [42]. This formulation is suitable for the considered fixed-wing maritime encounter scenario because the key decision variable is whether the ego-UAV can establish adequate vertical separation before the horizontal encounter time expires. In this sense, τₜ acts as a countdown variable that links horizontal closure with the urgency of vertical maneuvering. Under this formulation, both UAVs follow one-dimensional longitudinal dynamics updated at a discrete frequency of 1 Hz. This frequency reflects the high-level strategic decision rate suitable for maritime environments, directly accommodating the severe bandwidth limitations of long-range UAV telemetry. At each time step t, the environment state is represented as:
s t = h t , h ˙ e g o , t , h ˙ i n t , t , τ t , a p r e v
Here, ht denotes the vertical relative altitude between the intruding UAV and the ego-UAV, defined as ht = hint,thego,t. The variables h ˙ e g o , t , h ˙ i n t , t represent the vertical velocities of the ego-UAV and intruding UAV, respectively. The variable τt indicates the estimated time to the closest point of approach (CPA), assuming a constant relative horizontal speed—i.e., the remaining time before the horizontal separation between the UAVs decreases below a predefined maritime safety threshold (e.g., 150 m). For medium-to-large fixed-wing UAVs executing BVLOS maritime missions, this relatively conservative separation minimum is necessary. It strictly accounts for the high cruising speeds, severe offshore wind disturbances, GPS inaccuracies, and inherent latency in satellite or long-range communications.
To capture the temporal mismatch between decision issuance and execution in real maritime operations, the model introduces an action execution delay mechanism. Specifically, the action applied at time step t corresponds to the maneuver command generated at the previous step t − 1. To support this mechanism, the state representation includes the previous command index aprev, which determines the acceleration applied at time t. This design effectively models realistic latency effects—such as communication delays or mechanical response lags—thereby improving the fidelity and temporal continuity of the simulated trajectories. The transition to the next time step t + 1 follows discrete motion equations that update the altitude and vertical velocity of both UAVs according to their respective accelerations, as described in Equation (3).
s t + 1 = h t + h ˙ i n t , t + 1 2 h ¨ i n t , t h ˙ e g o , t 1 2 h ¨ e g o , t h ˙ e g o , t + h ¨ e g o , t h ˙ i n t , t + h ¨ i n t , t τ t 1 a p r e v
Here, h ¨ e g o , t , h ¨ i n t , t denote the vertical accelerations of the ego-UAV and the intruding UAV at time step t, respectively. For the ego-UAV, the applied acceleration is directly determined by the maneuver command issued by the collision avoidance system. The acceleration values in Equation (3) should be interpreted as commanded vertical acceleration responses associated with discrete advisories, rather than direct actuator-level control inputs. This representation is appropriate for evaluating whether the advisory policy can generate timely and physically bounded vertical separation during the encounter. The intruding UAV, in contrast, is modeled using a goal-directed vertical response that drives it toward the ego-UAV’s altitude within the remaining encounter time, as described in Equation (4). This design is not intended to represent all possible maritime UAV traffic behaviors. Rather, it provides a controlled conflict-generation mechanism for constructing repeatable encounter cases, so that the learned policy can be evaluated under clear vertical conflict pressure. At each step, the intruding UAV estimates a target vertical velocity that would allow it to reach the ego-UAV’s altitude by the predicted time-to-CPA under the current kinematic conditions.
h ˙ i n t , t + 1 = h ˙ e g o , t h i n t , t h e g o , t τ t
h ¨ i n t , t = c l i p h ˙ i n t , t + 1 h ˙ i n t , t , a max , a max
On this basis, the intruding UAV’s vertical acceleration is determined by the difference between its current and target vertical velocities, subject to a predefined maximum acceleration limit. To ensure physically feasible motion and stable trajectory evolution, the acceleration is constrained within the allowable range using a clipping function, as shown in Equation (5). Similarly, the vertical velocity is bounded by a maximum value h ˙ i n t , max , preventing unrealistic climb or descent rates and maintaining smooth motion continuity throughout the encounter. In this setting, the intruder behavior serves as a stress-case model for policy training and evaluation. More diverse intruder behaviors, such as route-following, cooperative, non-cooperative, and stochastic traffic models, can be investigated in future extensions of the proposed resource-aware advisory framework.

3.3. Resource-Aware Collision Avoidance Decision Framework

3.3.1. Resource-Aware Decision Framework

In conventional collision avoidance systems, advisories are issued solely based on safety requirements, without accounting for operator workload or alert fatigue. However, in real-world UAV operations, frequent or redundant advisories can overload the remote operators and reduce trust in the automation system.
To address this issue, we introduce a resource-aware decision framework, in which a virtual resource variable r t r e s represents the remaining command-resource budget available for issuing collision-avoidance advisories during an encounter. This variable is not intended to directly measure physical communication bandwidth, link latency, or human cognitive workload. Instead, it is a decision-level proxy for advisory burden, reflecting the operational cost associated with frequent, continued, strengthened, or reversed maneuver advisories. The cost values are designed to distinguish different levels of command burden at the advisory-transition level, rather than to convert advisories into exact units of bandwidth or operator workload. Specifically, issuing a new or continued alert consumes 3 units, a strengthening advisory consumes 2 units, and a reversal advisory consumes 5 units. The reversal cost is assigned the largest value because switching between climb and descent advisories may lead to oscillatory guidance and represents a high-burden advisory transition from the perspective of supervisory control. The resource state is updated as:
r t + 1 res = max ( 0 , r t res c ( a t , a t 1 ) )
The cost function is defined as:
c ( a t , a t 1 ) = 0 ,   NOC   o r   n o   r e s o u r c e - c o n s u m i n g   t r a n s i t i o n , 3 ,   new   o r   c o n t i n u e d   a l e r t , 2 ,   strengthening   a d v i s o r y , 5 ,   reversal   a d v i s o r y .
If the remaining resource is insufficient to support the requested advisory type, the action is suppressed and replaced by NOC. This mechanism enforces a finite alert budget and prevents the policy from relying on frequent high-cost interventions.
A resource coefficient ρt ∈ [0, 1] is further introduced to regulate the influence of this limited resource on the decision-making process. When the available resource decreases, the coefficient proportionally reduces the intensity or frequency of advisories, encouraging more conservative and resource-efficient behavior. In essence, this mechanism allows the system to adapt its decision policy according to the current alert capacity, ensuring both operational safety and operator acceptance.

3.3.2. State Space

The state space for the collision avoidance task consists of six variables, as summarized in Table 1. These variables capture the essential kinematic and decision-related information required by the reinforcement learning agent to assess encounter risk and determine appropriate advisories. The first three variables describe the vertical geometry of the encounter: the relative altitude between the intruding UAV and ego-UAV (ht), the vertical rate of the ego-UAV ( h ˙ e g o , t ), and the vertical rate of the intruding UAV ( h ˙ i n t , t ). The fourth variable, time to loss of horizontal separation (τₜ), characterizes the remaining time until the two UAVs reach the minimum allowed horizontal distance (typically 150 m), effectively representing the horizontal encounter geometry. The fifth variable, previous advisory (aprev), encodes the advisory issued at the previous time step, which helps the agent penalize unnecessary command reversals or escalations in advisory intensity, thereby maintaining temporal consistency and preserving the Markov property.
Finally, a new variable—the resource level ( r t r e s )—is introduced to represent the remaining alert resource available to the system at time t. This variable reflects the agent’s alert budget, influencing how aggressively it can issue future advisories. By including r t r e s in the observation space, the system becomes aware of its operational constraints, enabling adaptive behavior that balances collision safety with alert economy.

3.3.3. Action Space

The action space consists of seven discrete advisories: NOC, DES-N, CLB-N, DES-T, CLB-T, DES-E, and CLB-E, as listed in Table 2. Here, N, T, and E denote normal, transitional, and escalated advisories, respectively. Terms such as reversal or strengthening describe transition types between consecutive advisories and are not separate action labels. Each advisory, except NOC (No conflict), instructs the ego-UAV to achieve or maintain a specific vertical rate, corresponding to a designated vertical acceleration. The NOC action indicates that no immediate collision threat exists, allowing the UAV to maintain its nominal trajectory, and can be issued at any time.
Table 2 also defines the availability rules between advisories, ensuring physically consistent and operationally feasible trajectory transitions. For instance, NOC may precede any other advisory, while normal descent (DES-N) and climb (CLB-N) can only be initiated from NOC. Transitional or strengthened advisories (DES-T, CLB-T, DES-E, CLB-E) can only follow compatible preceding advisories, ensuring smooth kinematic transitions and avoiding abrupt or contradictory guidance. This discrete action design ensures that the agent’s policy remains interpretable and aligned with standard deterministic state-machine constraints, while still allowing flexibility for optimization under resource-aware and learning-based settings.

3.3.4. Reward Shaping

To ensure both safety and operational efficiency in the collision avoidance process, the reward function is designed to balance three competing objectives: (1) preventing near mid-air collisions (safety), (2) minimizing unnecessary alerts (remote operator workload), and (3) maintaining acceptable altitude deviations (flight stability). The overall reward consists of two main components: terminal altitude penalty and alert and altitude management. The main weighting parameters used in the extrinsic reward are summarized in Table 3 before the individual reward terms are introduced.
The parameters follow a safety-priority hierarchy. The conflict penalty ωcol = 100 has the largest magnitude, ensuring that collision avoidance remains the dominant objective. Advisory-management penalties are deliberately smaller, including ωalert = 0.3, ωstr = 0.3, ωrev = 0.5, and ωcross = 0.5. Therefore, advisory economy shapes the policy only after the primary safety objective has been prioritized.
1.
Terminal Altitude Penalty
This component evaluates the final vertical separation between the ego-UAV and the intruder when the time to loss of horizontal separation reaches zero—that is, at the critical moment of closest horizontal proximity. It penalizes unsafe altitude configurations such as loss of separation or excessive climb/descent, reinforcing safe and stable avoidance maneuvers. The overall final altitude penalty is defined as:
R t e r m i n a l = R c o l l i s i o n + R d e v i a t i o n + R e x c e s s i v e
where:
  • Collision Penalty:
R c o l l i s i o n = ω c o l 1 Δ h h min
penalizes any case where the relative altitude Δh between the two UAVs falls below the collision threshold hmin.
  • Terminal Separation Penalty:
R d e v i a t i o n = ω c o l Δ h h min h s a f e h min 1 h min Δ h < h s a f e
penalizes moderate deviations from the safe altitude range hsafe.
  • Excessive Terminal-Separation Penalty:
R e x c e s s i v e = max 2 ω e x Δ h h s a f e , γ a l t 1 Δ h h s a f e
applies a saturation penalty when the UAVs exceed the maximum permitted altitude deviation, with γalt denoting the upper penalty bound. In the study, γalt = 25, which serves as the saturation bound for excessive-altitude penalties. This value is larger than the advisory-related penalties but smaller than the conflict penalty, preserving the safety-priority hierarchy. The bounded form prevents extremely large altitude deviations from causing unbounded reward magnitudes and improves critic stability during training.
2.
Alert and Altitude Management
This component reflects the dynamic control efficiency of the avoidance policy—encouraging timely yet minimal advisory usage while maintaining altitude constraints. The corresponding reward term is defined as:
R m a n a g e = R a l t _ l i m i t + R a l e r t _ p e n a l t y + R c l e a r
where:
  • Altitude Limit Penalty:
R a l t _ l i m i t = ω l i m 1 h e g o > h max
penalizes ego-UAV altitude exceeding the predefined upper limit hmax.
  • Alert Penalty:
R a l e r t _ p e n a l t y = ω a l e r t I a l e r t + ω r e v I r e v + ω s t r I s t r + ω c r o s s I c r o s s exp τ t T
where Ialert, Istr, Irev, and Icross are binary indicators for active advisory issuance, advisory strengthening, advisory reversal, and vertical path crossing, respectively. The exponential factor exp(τtT) increases the penalty magnitude as the encounter approaches CPA, discouraging late-stage unnecessary advisories and inconsistent command changes near the conflict point.
  • Conflict Clearance Reward:
R c l e a r = ω N O C I N O C
provides a small positive reward when the conflict is successfully cleared and the system transitions to a no conflict (NOC) state. INOC indicates that the conflict has been cleared and the system returns to the no-conflict advisory state.

4. UAV Collision Avoidance Based on Resource-Aware Intrinsic Surprise Exploration

This section introduces the proposed collision-avoidance decision-making framework based on deep reinforcement learning. The method extends the standard Soft Actor–Critic (SAC) algorithm by incorporating a surprise-based intrinsic exploration term and a resource-aware modulation mechanism. The overall objective is to improve collision-avoidance efficiency while maintaining an alert economy and reducing unnecessary advisories.

4.1. SAC Algorithm for Collision Avoidance

To address the sequential decision-making problem of UAV collision avoidance under complex dynamic conditions, this study adopts the Soft Actor–Critic (SAC) algorithm as the fundamental learning framework [14]. SAC is an off-policy, entropy-regularized actor–critic method that optimizes both task performance and policy stochasticity, thereby achieving a balance between exploitation and exploration during training. In the context of collision avoidance, the agent must issue timely and stable vertical advisories to maintain safe separation while minimizing unnecessary command reversals or oscillations that could confuse the remote operator.
The optimization objective of SAC is formulated as a maximum entropy reinforcement learning problem, where the policy aims not only to maximize the expected cumulative reward but also to preserve sufficient action entropy for exploration:
J ( π ) = E τ ~ π t = 0 T r t + α H ( π ( | s t ) )
where rt denotes the extrinsic reward at time t, and H ( π ( | s t ) ) = E a t ~ π log π ( a t | s t ) represents the policy entropy. The temperature parameter α > 0 regulates the trade-off between maximizing return and maintaining policy diversity.
The SAC framework is composed of three primary components: two Q-value networks, Qθ1(s,a) and Qθ2(s,a), a stochastic policy network πϕ(as), and a target critic used for stabilization. The Q-networks are trained to minimize the soft Bellman residual:
J Q ( θ ) = E ( s t , a t , r t , s t + 1 ) 1 2 Q θ ( s t , a t ) y t 2
where the soft target value yt is computed as:
y t = r t + γ E a t + 1 ~ π min i Q θ ¯ i ( s t + 1 , a t + 1 ) α log π ( a t + 1 | s t + 1 )
This clipped double-Q formulation reduces overestimation bias and improves the stability of learning. The policy network is optimized by minimizing the Kullback–Leibler divergence between the current policy and the soft Q-function-induced Boltzmann distribution:
J π ( ϕ ) = E s t ~ D α log π ϕ ( a t | s t ) Q θ ( s t , a t )
where at is sampled from πϕ(atst) and D denotes the replay buffer.
Since the action space in the collision avoidance task is discrete, we employ SAC-Discrete, an adaptation of the Soft Actor–Critic (SAC) framework for discrete control problems [20]. This formulation preserves the entropy-regularized objective and the dual Q-network structure of the original SAC while incorporating dedicated modifications to the policy representation, Q-function estimation, and policy update scheme. These adaptations enable SAC-D to perform stable and efficient policy optimization under the maximum-entropy reinforcement learning paradigm in discrete action domains.

4.2. Quantifying Novelty via Surprise-Based Intrinsic Reward

In reinforcement learning for UAV collision avoidance, the external reward primarily captures high-level safety objectives, such as maintaining vertical separation or preventing near mid-air collisions. However, this extrinsic feedback provides limited guidance in early-stage exploration, as it is sparse and only weakly correlated with intermediate advisory decisions. Consequently, the agent may struggle to efficiently explore novel yet safety-critical encounter situations. To address this limitation, a surprise-based intrinsic motivation mechanism is introduced, serving as a quantitative measure of state-transition novelty and guiding exploration toward poorly understood regions of the dynamics [21]. The intuition is that transitions that are difficult for the model to predict are likely underrepresented in the agent’s knowledge and therefore should receive additional reward incentives.
To quantify transition novelty, we trained an ensemble of probabilistic forward dynamics models. Given the current state–action pair (st, at), each ensemble member predicts a diagonal Gaussian distribution over the next state:
q θ ( s t + 1 | s t , a t ) = N ( μ ^ i s t , a t , diag ( σ ^ i 2 s t , a t ) ) ,        i = 1 , N
where N denotes the ensemble size. The dynamics ensemble is trained by minimizing the negative log-likelihood (NLL) of the observed transition samples:
L m o d e l = 1 N i = 1 N E ( s t , a t , s t + 1 ) ~ D log q θ ( s t + 1 | s t , a t )
Here, st, at, and st+1 denote the current state, action, and observed next state, respectively. D denotes the replay buffer containing sampled transitions. N is the number of ensemble dynamics models. q θ ( s t + 1 | s t , a t ) is the predictive distribution of the i-th dynamics model parameterized by θi. μ ^ i s t , a t and σ ^ i 2 s t , a t denote the predicted mean and variance of the next state.
Equivalently, up to constants independent of the model parameters, this objective can be expressed as:
L m o d e l = 1 2 N i = 1 N 𝔼 D j = 1 d s t + 1 j μ ^ i j s t , a t 2 σ ^ i 2 , j ( s t , a t ) + log σ ^ i 2 , j ( s t , a t )
In the expanded form, d is the dimension of the state vector, and the superscript j denotes the j-th state dimension. A large prediction error or high model variance indicates that the transition deviates from the learned dynamics and is therefore “surprising”. The intrinsic reward is then defined as the ensemble-averaged negative log-likelihood of the actually observed next state:
r t i n t = 1 N i = 1 N log q θ i ( s t + 1 | s t , a t )
For the diagonal Gaussian predictive distribution, Equation (23) can be expanded as:
r t i n t = 1 N i = 1 N j = 1 d s t + 1 j μ ^ i j s t , a t 2 2 σ ^ i 2 , j ( s t , a t ) + 1 2 log 2 π σ ^ i 2 , j ( s t , a t )
Thus, r t i n t measures the sample-wise surprisal of the observed transition under the learned dynamics model. A transition that is poorly predicted by the ensemble receives a larger intrinsic reward, encouraging the agent to explore dynamically uncertain or insufficiently modeled regions of the state-action space.
By integrating this surprise-driven reward into the actor–critic framework, the learning process becomes more adaptive to the dynamic collision scenarios. The agent can identify previously unseen altitude separation patterns, encounter configurations, and conflict-resolution dynamics, thereby refining its advisory policy in a data-efficient and safety-oriented manner.
It should be noted that Equation (24) is implemented as a sample-wise NLL-based surprisal measure, rather than as a directly computed KL divergence. Let p ( s t + 1 | s t , a t ) denote the true transition distribution of the environment and q θ ( s t + 1 | s t , a t ) denote the learned predictive distribution. Although p is unknown and is not explicitly estimated in the implementation, taking the expectation of the NLL under the true transition distribution gives the cross entropy between p and qθ:
𝔼 s t + 1 ~ p r t i n t = 𝔼 s t + 1 ~ p log q θ ( s t + 1 | s t , a t ) = H p , q θ
The cross entropy can be decomposed as:
H p , q θ = H p + D K L p ( s t + 1 | s t , a t ) q θ ( s t + 1 | s t , a t )
Since H(p) is independent of the model parameters, reducing the expected NLL is equivalent to reducing the KL divergence between the true transition dynamics and the learned predictive dynamics up to a constant. Therefore, the proposed intrinsic reward should be interpreted as a sample-wise NLL-based proxy for transition novelty, with a theoretical connection to KL divergence in expectation.

4.3. Resource-Aware Intrinsic Surprise Exploration

In practical UAV collision avoidance tasks, excessive exploratory actions or redundant alerts can increase remote operator workload and reduce system trustworthiness. To further examine this problem, we trained collision avoidance agents using state-of-the-art reinforcement learning methods, including the baseline SAC-D and the surprise-based intrinsic exploration model, under unconstrained resource conditions. The training statistics summarized in Figure 2 and Table 4 reveal that both approaches tend to generate an excessive number of advisories—on average more than ten alerts per episode—with frequent strengthening and reversal actions. These results indicate that while intrinsic-exploration-based learning improves policy responsiveness, it also amplifies advisory activity, leading to unstable and operationally inefficient behavior. The challenge, therefore, lies in achieving a better balance between effective exploration and practical alert management. To address this issue, a resource-aware intrinsic reward mechanism is introduced, guiding the learning process to adapt exploration intensity according to the available alert resources. This design enables the agent to maintain efficient policy learning while avoiding unnecessary advisory activations under resource-limited conditions.
The key component of this mechanism is the resource coefficient ρ t [ 0 , 1 ] , which quantifies the remaining decision resource or allowable alert capacity at time step t. The coefficient is computed as a normalized function of the resource state embedded in the observation vector:
ρ t = f ( r t r e s ) = r t r e s r m a x β
where r t r e s denotes the current resource level, r m a x is the predefined maximum, and β > 1 controls the sensitivity of reduction. When the available resource decreases, ρ t proportionally suppresses the influence of intrinsic exploration, encouraging the policy to focus on task-relevant, low-cost advisories. This mechanism reflects the intuition that the collision-avoidance system should be more cautious and less exploratory when its “alert budget” becomes constrained.
Because the numerical magnitudes of the external and intrinsic rewards may differ significantly and may vary during training, an adaptive scaling coefficient is introduced:
η t = η 0 λ tar EMA ( S ext ) EMA ( S int ) + ϵ
The batch-level reward magnitude estimates are defined as:
S e x t = 1 B b = 1 B r b e x t ,     S i n t = 1 B b = 1 B r b int
Here, Sext and Sint denote the batch-level magnitude estimates of the external and intrinsic rewards, respectively, and B is the batch size. EMA(⋅) denotes the exponential moving average (EMA) used to estimate the running reward scale, and ϵ is a small positive constant for numerical stability.
The EMA operation provides a running estimate of the external and intrinsic reward scales and prevents the adaptive coefficient from reacting excessively to high-variance batch-level fluctuations. Therefore, ηt maintains a stable relative contribution of the intrinsic reward throughout training.
The target ratio λtar controls the desired contribution of the intrinsic reward relative to the external reward scale. It should be emphasized that λtar is not the final intrinsic reward ratio by itself. Instead, η0λtar determines the pre-gating target scale, while the actual intrinsic contribution in the final reward is also modulated by the resource-aware coefficient ρt. Combining both mechanisms, the overall reward shaping function for each transition is expressed as:
r t = r t + η t ρ t r t int
where r t i n t denotes the intrinsic reward derived from the prediction-error-based surprise model described in Section 4.2. Thus, the actual intrinsic reward term added to the external reward is η t ρ t r t int . When the remaining resource is sufficient, ρt ≈ 1, and the intrinsic reward is allowed to contribute according to the target scale controlled by ηt. When the remaining resource becomes scarce, ρt → 0, and the intrinsic reward is progressively suppressed regardless of its surprise value.
In our study, η0 = 0.3 and λtar = 0.6, giving a pre-gating target scale of η0λtar = 0.18. Therefore, when the resource gate is fully open, the surprise-based intrinsic reward is encouraged to remain approximately within 10–20% of the external reward scale. Under resource-limited conditions, the final contribution is further reduced by ρt, which prevents exploration bonuses from overpowering the task-specific objectives related to safety, collision avoidance, and resource conservation.
The resulting algorithm, termed RAISE, effectively integrates adaptive exploration with resource-constrained decision-making, as illustrated in Figure 3 and summarized in Algorithm 1. This design ensures that the intrinsic reward serves as an auxiliary exploration signal whose magnitude is jointly controlled by EMA normalization and the resource coefficient. It enables the agent to balance two key objectives: ensuring safety through effective conflict-resolution advisories and reducing unnecessary or high-cost alerts under limited resource conditions. This mechanism provides a robust and data-efficient learning framework that promotes stable policy optimization in dynamic encounter scenarios.
Algorithm 1: RAISE—Resource-Aware Intrinsic Surprise Exploration
Input: Environment E , policy πθ, ensemble model fϕ, replay buffer B ,
     base coefficient η0, target ratio λtar, resource limit rmaxInitialize: θ , ϕ , B , EMA statistics R ^ e x t 1 , R ^ i n t 1
1: for training step t = 1, 2, … do
2:   Observe state st and resource level r t r e s
3:   Select action a t ~ π θ ( a t | s t ) and execute in environment E
4:   Obtain next state st+1 and external reward rt
5:   Store ( s t , a t , r t , s t + 1 ) into buffer B
7:   if update step is due then
8:     Sample a mini-batch {(s, a, r, s’)} from B
9:     Compute intrinsic reward: r i n t = log p ϕ ( s | s , a )
10:    Update EMA statistics:
11:       R ^ e x t α R ^ e x t + ( 1 α ) 𝔼 [ | r | ] ; R ^ i n t α R ^ i n t + ( 1 α ) S t d [ r i n t ]
12:    Compute adaptive coefficient: η t = η 0 λ t a r R ^ e x t R ^ i n t + ε
13:    Compute resource weight: ρ t = ( r t r e s / r max ) β
14:    Form shaped reward: r t = r t + η t ρ t r i n t
15:    Update critic Qψ and actor πθ using SAC with r t
16:    Update model fϕ by minimizing negative log-likelihood loss
17:   end if
18: end for
Output: Trained policy parameters θ

5. Experimental Simulation and Result Analysis

5.1. Experimental Setup

To evaluate the effectiveness of the proposed RAISE algorithm, comparative experiments were conducted against three baselines: SAC-D, Surprise, and SAC-D-resource. All experiments were performed in a simulated two-UAV vertical encounter environment, where the ego-UAV executes vertical resolution advisories while the intruder follows a goal-directed approach strategy. The simulation runs in discrete time steps of 1 s, consistent with high-level tactical decision-making frequencies and typical surveillance update rates (e.g., ADS-B) for autonomous UAV frameworks.
The proposed RAISE algorithm extends the SAC-Discrete framework by incorporating an ensemble dynamics model for surprise-based intrinsic motivation and a resource-aware exploration coefficient that adapts the intensity of intrinsic rewards according to the remaining alert resource. This design allows the agent to maintain efficient exploration while reducing redundant or high-cost advisories under limited resources.
All algorithms were trained under the same conditions for 1.6 × 106 environment steps (approximately 1500 iterations). The main experimental configuration used a resource limit of rmax = 20, representing a moderate operational alert capacity. Additional analyses in Section 5.4 examine the effects of constrained (rmax = 15) and abundant (rmax = 25) resource settings. Replay buffer sizes, batch sizes, and learning rates were tuned for stable performance across models. For all methods, the entropy temperature α was automatically adjusted to balance exploration and exploitation, and target networks were updated using a soft-update rate of τ = 0.005. The detailed hyperparameter configuration is summarized in Table 5.

5.2. Algorithm Performance Evaluation Metrics

To comprehensively assess the performance of RL models in longitudinal UAV collision avoidance tasks, a set of interpretable and safety-oriented evaluation metrics is defined. These metrics are derived from the reward structure of the environment and are designed to capture the agent’s effectiveness across three critical dimensions: flight safety, avoidance efficiency, and advisory quality. All statistics are computed over the last 400 evaluation episodes to ensure convergence and stability.
  • Reward: Mean cumulative reward per episode, representing the overall trade-off between safety and alert efficiency.
  • Collision rate: Frequency of near-miss, directly measuring flight safety.
  • Resolution success: Proportion of encounters maintaining safe vertical separation.
  • Alert: Mean number of advisories per episode, reflecting alert frequency.
  • Strengthening: Average number of increased-intensity advisories (e.g., transitioning from a standard to an aggressive vertical maneuver), indicating risk sensitivity.
  • Reversal: Number of reversals between climb and descend advisories, reflecting policy stability.
  • Crossing: Frequency of altitude crossings between UAVs, highlighting residual conflict risks.
These indicators provide a concise yet comprehensive evaluation framework, linking reinforcement learning performance with operationally relevant safety outcomes.

5.3. Test Scenario Design

To evaluate the generalization ability of the proposed algorithm under diverse encounter conditions, a hybrid scenario generation strategy combining semi-random, fixed-point, and fully random initialization was adopted. This approach provides a balance between interpretability—by using structured and repeatable conflict geometries—and robustness, by exposing the agent to diverse and stochastic encounters.
Four representative scenarios (A–D) were constructed as shown in Table 6. Each scenario specifies the initial vertical states of the ego-UAV and intruder while introducing partial randomness in altitude and velocity to simulate realistic variability in flight encounters. Notably, the vertical rate boundaries (up to ±20 m/s) were specifically designed to encompass the kinematic capabilities of high-performance fixed-wing maritime UAVs and ship-borne VTOL platforms, accommodating the rapid altitude transitions required in dynamic marine environments.
Scenarios A–C represent semi-random and fixed-point configurations designed to analyze policy behavior in structured and interpretable conflict geometries. Scenario D introduces a fully random configuration to test algorithmic robustness and generalization under stochastic conditions.
For reproducibility, random seeds were fixed during evaluation, and each model was trained with multiple random initializations. This ensured that the comparative results across SAC-D, Surprise, and RAISE were statistically consistent and not affected by sampling variance.

5.4. Training Performance and Results of the Model

5.4.1. Performance Comparison Under Moderate Resource (20)

To evaluate the effectiveness of the proposed RAISE algorithm in a balanced operational environment, training experiments were conducted under a moderate resource limit of 20. This setting represents a practical operating condition in which the collision avoidance system retains sufficient advisory capability while remaining constrained from issuing excessive or unnecessarily complex maneuvers. The comparative performance of SAC-D, SAC-D-resource, Surprise, and RAISE is reported in Table 7, while their training dynamics and detailed behavioral metrics are illustrated in Figure 4 and Figure 5. All numerical results are reported as the mean ± standard deviation.
Overall, RAISE achieved the best average performance among the four methods under the resource-constrained setting. As shown in Table 6, RAISE obtained the highest average reward of −14.183 ± 2.757, improving over SAC-D (−15.912 ± 4.084), SAC-D-resource (−15.204 ± 3.586), and Surprise (−15.546 ± 3.707). In addition, RAISE achieved the highest resolution success rate of 0.843 ± 0.056 compared with 0.821 ± 0.080 for SAC-D, 0.831 ± 0.070 for SAC-D-resource, and 0.821 ± 0.073 for Surprise. Its collision rate was also among the lowest, reaching 0.003 ± 0.006, which is comparable to Surprise and lower than those of SAC-D and SAC-D-resource. These results indicate that RAISE improves the overall quality of conflict-resolution decisions while maintaining a low residual collision risk.
The advantage of RAISE is particularly evident in its advisory behavior. Although RAISE generated the highest average number of alerts (2.157 ± 0.529), the increase was moderate relative to SAC-D-resource (2.036 ± 0.426) and Surprise (2.082 ± 0.300). More importantly, RAISE substantially reduces complex or unstable advisory adjustments. It achieved the lowest strengthening value of 2.397 ± 0.497, compared with 3.226 ± 0.582 for SAC-D, 3.113 ± 0.538 for SAC-D-resource, and 2.834 ± 0.558 for Surprise. RAISE also produced the lowest reversal value of 0.014 ± 0.016, further improving upon Surprise (0.024 ± 0.026) and markedly outperforming SAC-D and SAC-D-resource. Its crossing value of 0.056 ± 0.039 was close to the best-performing SAC-D-resource baseline (0.052 ± 0.037) and lower than those of SAC-D and Surprise. Collectively, these results suggest that RAISE does not achieve improved performance simply by issuing more advisories; rather, it tends to generate more stable and less contradictory resolution actions.
Figure 4 presents the evolution of the evaluation return during training. All four methods improved rapidly in the initial stage and gradually stabilized thereafter. RAISE established an early return advantage and maintained a higher mean return than the baselines through most of the training. The enlarged late-stage view further highlights this advantage: while the baseline curves remained clustered at lower levels, RAISE preserved the highest mean return throughout the zoomed interval. This indicates that RAISE learns a better-performing policy under the same advisory resource constraint and retains this benefit after convergence. Despite partial overlap of the shaded regions due to seed-dependent variability, the sustained separation of the mean curves supports a clear and persistent improvement.
The detailed training metrics in Figure 5 further clarify the source of this improvement. All methods rapidly reduced the collision rate and increased resolution success during the early stage of learning. In the later stage, RAISE maintained slightly higher resolution success while exhibiting noticeably lower strengthening and reversal values than the competing approaches. Although its alert count was relatively high, the associated advisories were less frequently strengthened or reversed, indicating a more coherent resolution policy under limited resources. This behavior is consistent with the results in Table 7: RAISE improves the average reward and resolution success mainly by reducing unstable or unnecessarily escalated advisory actions, rather than by aggressively increasing maneuver frequency.
Overall, the results demonstrate that RAISE provides a favorable balance between conflict-resolution effectiveness and advisory stability under a resource limit of 20. Compared with the baseline methods, it achieved the highest average reward and resolution success rate, maintained a low collision rate, and substantially reduced strengthening and reversal behaviors. These findings support the effectiveness of the proposed resource-aware exploration mechanism in learning more reliable and stable collision-resolution policies in constrained operational environments.

5.4.2. Resource-Level Sensitivity and Adaptive Behavior Under Resource-Constrained and Resource-Rich Environments (15 and 25)

To further examine the adaptability and robustness of the proposed RAISE framework, additional experiments were conducted under resource-constrained (15) and resource-rich (25) conditions. These settings respectively simulate operational scenarios with limited alert capacity and ample advisory freedom, allowing an evaluation of how the resource-aware mechanism adjusts exploration and alert issuance behaviors.
Under resource level 15, where the system operates under tight alert constraints, RAISE maintained superior overall training performance compared to SAC-D and Surprise (Table 8). As shown in Figure 6 (left), RAISE not only converged faster but also achieved the highest final return (−23.39), outperforming SAC-D (−24.83) and Surprise (−26.74). Its collision rate remained the lowest throughout most of the training process (≈0.02), confirming that safety can still be preserved even when alert resources are scarce. In terms of advisory behavior (Figure 7), RAISE exhibited a consistent pattern with the moderate-resource case (Section 5.4.1): it issued slightly more alerts (1.65 per episode) than the baselines but with notably fewer strengthening (2.07 vs. 3.00) and almost no reversal actions (0.003). This indicates that RAISE allocates its limited alert budget more effectively, favoring timely, stable advisories over reactive corrections and unnecessary intensifications.
Under resource level 25, representing an unconstrained operating condition, all algorithms achieved comparably high returns (around −8 to −10; Table 9, Figure 6 right), with RAISE maintaining a slight performance edge (−8.22 vs. −8.59 for SAC-D and −9.79 for Surprise). Although the alert frequency naturally increased due to relaxed resource limitations, RAISE continued to demonstrate the lowest reversal rate (0.19) and maintained a moderate alert strength (2.35) throughout training, reflecting its ability to adaptively adjust advisory intensity once sufficient safety margins are established (Figure 8). Notably, while Surprise initially exhibited a higher alert frequency, RAISE gradually surpassed it in later stages, showing that the algorithm dynamically balances safety assurance with efficient resource utilization as the environment stabilizes.
Overall, these supplementary experiments validate that RAISE achieves consistent, balanced performance across resource regimes. When resources are scarce, it effectively suppresses excessive advisory changes while maintaining collision-free safety and stable convergence. When resources are abundant, it flexibly expands advisory activity and exploration intensity without destabilizing the learned policy. This adaptability highlights the generalization capability of the resource-aware intrinsic exploration mechanism, supporting the consistency and scalability of the main findings under diverse operational constraints.
These results can also be interpreted as a sensitivity evaluation with respect to the available advisory-resource budget. As the resource level changes from 15 to 20 and 25, RAISE does not exhibit abrupt degradation or unstable advisory behavior. Instead, the policy adapts its command strategy according to the available budget: under scarce resources, it suppresses high-burden advisory transitions such as strengthening and reversal; under abundant resources, it permits more advisory activity to improve safety margins while maintaining stable command behavior. This indicates that the proposed resource-aware mechanism is not tuned to a single resource setting, but remains effective across different operational budgets.

5.5. Performance Evaluation in Testing Scenarios

To further evaluate the generalization capability of the proposed RAISE algorithm beyond the training distribution, testing experiments were conducted under four representative encounter scenarios, denoted as Scenarios A–D, using a moderate resource limit of 20. These scenarios cover different vertical-rate configurations and conflict geometries, enabling the policies to be examined under both structured and stochastic encounter conditions. In addition to the learning-based baselines, a rule-based controller and an artificial potential field controller were included as deterministic non-learning baselines. All methods were evaluated under the same scenarios and metrics, and the results are summarized in Figure 9 and Table 10.
Overall, RAISE achieved the most favorable safety-performance trade-off among the compared learning-based methods. Across Scenarios A, B, and D, RAISE obtained the highest average reward, with values of −21.739, −21.133, and −21.486, respectively. These results outperformed SAC-D, SAC-D-resource, and Surprise, indicating that the resource-aware intrinsic exploration mechanism improves policy generalization under unseen encounter configurations. In Scenario C, APF obtained a slightly higher reward than RAISE; however, this advantage was accompanied by a substantially higher collision rate and reversal count. Therefore, the reward result in Scenario C should be interpreted together with the safety and advisory-stability metrics rather than as an isolated measure.
In terms of collision avoidance, RAISE demonstrated the most reliable safety behavior. It achieved the lowest collision rate in all four scenarios, with collision rates of 0.026, 0.024, 0.004, and 0.012 in Scenarios A–D, respectively. Compared with SAC-D and Surprise, RAISE consistently reduced the collision risk while maintaining competitive or superior resolution success. For example, in Scenario A, the resolution success rate increased from 0.566 for SAC-D and 0.654 for Surprise to 0.742 for RAISE. A similar improvement was observed in Scenario B, where RAISE reached a resolution success rate of 0.752. In Scenario D, which represents a more random testing condition, RAISE still achieved a resolution success rate of 0.696, outperforming all other methods. Although APF achieved the highest resolution success in Scenario C, its collision rate remained considerably higher than that of RAISE, suggesting that its successful resolutions are less consistently associated with safe separation maintenance.
The comparison with deterministic controllers further highlights the advantage of learning a resource-aware policy. Rule-based and APF methods issue fewer alerts and produce very low strengthening counts, but this apparent command economy is achieved at the cost of weaker safety and poorer adaptability. In Scenarios A, B, and D, both deterministic methods showed substantially higher collision rates than RAISE. For instance, APF reached collision rates of 0.144, 0.120, and 0.144 in Scenarios A, B, and D, respectively, whereas RAISE kept the corresponding rates at 0.026, 0.024, and 0.012. Rule-based control exhibited a similar limitation, with low alert frequency but reduced resolution success and increased collision risk. These results indicate that simply minimizing the number of advisories does not necessarily lead to operationally acceptable behavior; rather, effective collision avoidance requires timely and context-sensitive interventions.
RAISE also showed a clear advantage in advisory stability. Although it generated more alerts than the other learning-based baselines, these alerts were more consistent and less oscillatory. The reversal count of RAISE remained the lowest or among the lowest across all scenarios, with values of 0.044, 0.050, 0.090, and 0.048 in Scenarios A–D. These values were substantially lower than those of SAC-D, SAC-D-resource, rule-based, and APF, and were also lower than Surprise in all scenarios. The reduction in reversal behavior is important because frequent changes in advisory direction may lead to confusing or contradictory guidance for remote operators. In addition, RAISE generally reduces strengthening actions compared with SAC-D and SAC-D-resource, indicating that the policy tends to generate earlier and more stable advisories instead of repeatedly escalating avoidance commands.
It is also worth noting that the deterministic controllers showed low strengthening counts mainly because their advisory activity is limited and less adaptive, not because they achieve a better balance between safety and command economy. Their high reversal rates and collision rates reveal that low command frequency alone is insufficient for robust collision avoidance. In contrast, RAISE uses a larger number of well-targeted alerts to maintain separation while simultaneously suppressing reversals and excessive strengthening. This behavior is consistent with the design objective of the resource-aware mechanism: the goal is not to minimize all commands indiscriminately, but to reduce redundant, contradictory, or operationally inefficient advisories while preserving safety.
In summary, the testing results confirm that RAISE generalizes well to unseen encounter scenarios and achieves a strong balance between collision safety and advisory stability. Compared with learning-based baselines, it consistently improves reward, reduces collision rate, and suppresses reversal behavior. Compared with deterministic controllers, it provides more adaptive and reliable avoidance decisions under diverse encounter geometries. These findings support the effectiveness of integrating resource-aware modulation and surprise-based intrinsic exploration for operationally suitable maritime UAV collision avoidance.
Figure 10 provides a trajectory-level visualization of the representative testing scenarios, where the Ego-UAV is shown on the left side of each subplot and the intruding aircraft is shown on the right side. The trajectories further illustrate the behavioral differences among the compared methods. The learning-based baselines exhibited less consistent vertical maneuver patterns in several scenarios, while the deterministic controllers generally produced simpler but relatively rigid avoidance responses. These observations are consistent with the quantitative results reported above, where the baselines showed higher collision rates, higher reversal counts, or weaker overall conflict-resolution performance. In contrast, RAISE generated more stable and progressive vertical maneuvers across the testing scenarios. Its trajectories avoid abrupt direction changes and excessive oscillations, supporting the quantitative results that RAISE achieves lower collision rates and fewer reversals while maintaining effective conflict resolution. These visual results therefore provide additional evidence that the proposed resource-aware mechanism contributes to both safety performance and avoidance-behavior stability.

5.6. Robustness Evaluation

While full sim-to-real deployment was beyond the scope of this work, robustness under realistic perturbations provides an important proxy for evaluating the operational reliability of collision-avoidance policies. In real maritime UAV operations, the state information available to the avoidance system may be affected by GPS positioning errors, barometric altitude inaccuracies, state-estimation uncertainty, and stochastic wind gusts. Therefore, additive Gaussian observation noise and stochastic vertical wind perturbations were adopted as controlled simulation proxies for sensor-measurement uncertainty and external wind-induced dynamic disturbances, respectively. A robustness evaluation was conducted to examine whether the learned policies could maintain reliable performance under degraded observation and dynamic conditions.
The robustness evaluation considered three perturbed environments in addition to the clean setting. First, sensor noise was introduced by adding Gaussian noise to the directly measured state variables, including relative altitude, ego-UAV vertical velocity, and intruder vertical velocity. Two noise levels were tested, corresponding to 5% and 10% of the operational ranges of the corresponding state variables. Second, wind disturbance was introduced by applying stochastic vertical wind perturbations to the ego-UAV dynamics. Third, a combined perturbation condition was considered, where sensor noise and wind disturbance were applied simultaneously. All methods were evaluated under the same perturbation settings and metrics, and the results are reported in Table 11.
Overall, RAISE demonstrated the strongest robustness among the compared methods. Under the clean condition, RAISE achieved the highest reward, the lowest collision rate, and the highest resolution success rate, with a collision rate of 0.022 and a resolution success rate of 0.637. When 5% sensor noise was introduced, the collision rate of RAISE increased to 0.044, but remained lower than those of SAC-D, Surprise, and APF. Under 10% sensor noise, all methods experienced performance degradation, but RAISE still maintained the lowest collision rate among the compared methods, with a value of 0.127. This indicates that although sensor uncertainty makes the avoidance task more difficult, RAISE remains less sensitive to observation perturbations than the baseline methods.
The advantage of RAISE was also evident under wind disturbance. In the wind-disturbed environment, RAISE achieved a collision rate of 0.020 and a resolution success rate of 0.609, both of which remained close to its clean-environment performance. In comparison, SAC-D and Surprise exhibited lower resolution success rates, while APF showed a higher collision rate and weaker overall robustness. Under the combined Noise + Wind condition, all methods degraded further, as expected. Nevertheless, RAISE still obtained the best reward and the lowest collision rate, with a collision rate of 0.128, outperforming SAC-D, Surprise, and APF. These results suggest that the proposed method maintains a more reliable safety margin under simultaneous observation and dynamic perturbations.
In terms of advisory behavior, RAISE consistently achieved the lowest or among the lowest reversal counts across all environment configurations. For example, its reversal count remained 0.076 under the clean condition, 0.089 under 5% noise, 0.122 under 10% noise, 0.077 under wind disturbance, and 0.121 under the combined Noise + Wind condition. These values were substantially lower than those of APF and were also lower than those of SAC-D and Surprise in most cases. This result indicates that RAISE maintains stable advisory directions even when the observed state or vehicle dynamics are perturbed. Although RAISE produced more alerts than the other methods, these alerts were associated with lower collision risk and fewer reversals, suggesting that the additional interventions are more consistent and safety-oriented rather than oscillatory.
The deterministic APF controller exhibited a different robustness pattern. APF generated the fewest alerts and strengthening actions in the clean environment, but its performance deteriorated substantially under sensor noise. Its collision rate increased from 0.109 in the clean environment to 0.225 under 5% noise and 0.296 under 10% noise, while its reversal count also increased markedly. Under the combined Noise + Wind condition, APF reached the highest collision rate among all methods, with a value of 0.303. These results show that low advisory frequency alone does not guarantee robust avoidance performance. A policy must also adapt its maneuvers to uncertain and perturbed encounter states.
The robustness of RAISE may be partly related to its resource-aware learning design. During training, the resource-aware modulation encourages the policy to balance collision avoidance performance with advisory stability, rather than minimizing or maximizing maneuver commands indiscriminately. In addition, the surprise-based intrinsic reward exposes the policy to less predictable state transitions, which may improve its tolerance to moderate distributional variations during testing. The bounded and threshold-based reward terms also help avoid overly sensitive responses to small state deviations near critical safety boundaries. These design factors provide a possible explanation for why RAISE maintains lower collision rates and fewer reversals under sensor noise, wind disturbance, and combined perturbations.
In summary, the robustness results confirm that RAISE provides more reliable performance under degraded operating conditions. Compared with learning-based baselines, it maintains lower collision rates, higher or competitive resolution success, and more stable advisory behavior. Compared with deterministic controllers, it shows stronger adaptability to sensor uncertainty and wind disturbance. These findings further support the effectiveness of the proposed resource-aware reinforcement learning framework for maritime UAV collision avoidance under realistic perturbations.

6. Conclusions

In this paper, we proposed a resource-aware intrinsic surprise exploration framework (RAISE) to address the critical challenge of balancing flight safety with operational suitability in deep reinforcement learning-based maritime UAV collision avoidance operations. By introducing a virtual resource mechanism, RAISE regulates alert frequency and intensity, treating system advisory capacity as a finite budget. This resource-aware modulation is synergized with a surprise-based intrinsic motivation module, which ensures effective exploration even under strict constraint settings. Comprehensive simulations demonstrate that RAISE consistently outperforms baseline methods across varying resource levels. It achieves superior operational suitability by significantly reducing alert reversals and strengthening counts—key indicators of control stability and reduced operational overhead—while maintaining robust safety margins (low near-miss rates). When resources are scarce, the agent suppresses unstable and high-burden advisory transitions, especially reversal and strengthening commands, rather than simply minimizing the raw alert count; when abundant, it flexibly optimizes trajectory smoothness without compromising stability. Results in unseen encounter geometries confirm the framework’s strong generalization capabilities beyond the training distribution. Furthermore, the surprise-based intrinsic motivation mechanism inherently reflects distribution shifts: under sensor noise or wind perturbations, prediction error increases naturally, helping the policy respond more cautiously to degraded operating conditions without retraining. Future research will focus on extending this approach to full 3D environments to incorporate horizontal maneuvers and handling complex multi-UAV encounters. Additionally, we intend to integrate probabilistic remote operator response models and formal safe reinforcement learning techniques to enable a more comprehensive assessment of human–machine coordination with stronger safety guarantees in next-generation supervisory collision avoidance systems.

Author Contributions

Methodology, Z.L. and X.G.; software, Z.L. and Q.F.; validation, Q.F. and Z.W.; writing—original draft preparation, Z.L.; writing—review and editing, Z.L., Q.F. and Z.W.; supervision, X.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was jointly funded by the National Key Laboratory of Air-based Information Perception and Fusion and the Aeronautical Science Foundation of China, Grant No. 202471.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:
RAISEResource-Aware Intrinsic Surprise Exploration
UAVUnmanned Aerial Vehicle
EMAExponential Moving Average
DRLDeep Reinforcement Learning
SAC-DSoft Actor–Critic with Discrete Actions
APFArtificial Potential Fields
VOVelocity Obstacles
DQNDeep Q-Networks
CRLConstrained Reinforcement Learning
IMIntrinsic Motivation
CPAClosest Point of Approach
BVLOSBeyond Visual Line of Sight

References

  1. Nomikos, N.; Gkonis, P.K.; Bithas, P.S.; Trakadas, P. A Survey on UAV-Aided Maritime Communications: Deployment Considerations, Applications, and Future Challenges. IEEE Open J. Commun. Soc. 2023, 4, 56–78. [Google Scholar] [CrossRef]
  2. Cao, Y.; Zhao, G.; Wu, Y.; Wang, H.; Sun, J.; Zhang, L. Dynamic Separation Standards for Multi-Category UAV Operations. Aerospace 2025, 12, 1064. [Google Scholar] [CrossRef]
  3. Riedel, M. A Review of Detect and Avoid Standards for Unmanned Aircraft Systems. Aerospace 2025, 12, 344. [Google Scholar] [CrossRef]
  4. Lee, J.D.; See, K.A. Trust in Automation: Designing for Appropriate Reliance. Hum. Factors J. Hum. Factors Ergon. Soc. 2004, 46, 50–80. [Google Scholar] [CrossRef]
  5. Cummings, M.L.; Mitchell, P.J. Predicting Controller Capacity in Supervisory Control of Multiple UAVs. IEEE Trans. Syst. Man Cybern.—Part A Syst. Hum. 2008, 38, 451–460. [Google Scholar] [CrossRef]
  6. Tang, J.; Lao, S.; Wan, Y. Systematic Review of Collision-Avoidance Approaches for Unmanned Aerial Vehicles. IEEE Syst. J. 2022, 16, 4356–4367. [Google Scholar] [CrossRef]
  7. Liu, Y.; Yan, J.; Zhao, X. Deep Reinforcement Learning Based Latency Minimization for Mobile Edge Computing with Virtualization in Maritime UAV Communication Network. IEEE Trans. Veh. Technol. 2022, 71, 4225–4236. [Google Scholar] [CrossRef]
  8. Zhang, C.; Lin, B.; Hu, X.; Qi, S.; Qian, L.; Wu, Y. Resource Management and Trajectory Optimization for UAV-IRS Assisted Maritime Edge Computing Networks. Tsinghua Sci. Technol. 2025, 30, 1600–1616. [Google Scholar] [CrossRef]
  9. Jenie, Y.I.; Kampen, E.-J.v.; de Visser, C.C.; Ellerbroek, J.; Hoekstra, J.M. Selective Velocity Obstacle Method for Deconflicting Maneuvers Applied to Unmanned Aerial Vehicles. J. Guid. Control. Dyn. 2015, 38, 1140–1146. [Google Scholar] [CrossRef]
  10. Sun, J.; Tang, J.; Lao, S. Collision Avoidance for Cooperative UAVs With Optimized Artificial Potential Field Algorithm. IEEE Access 2017, 5, 18382–18390. [Google Scholar] [CrossRef]
  11. Merei, A.; Mcheick, H.; Ghaddar, A.; Rebaine, D. A Survey on Obstacle Detection and Avoidance Methods for UAVs. Drones 2025, 9, 203. [Google Scholar] [CrossRef]
  12. Kamel, M.; Alonso-Mora, J.; Siegwart, R.; Nieto, J. Robust collision avoidance for multiple micro aerial vehicles using nonlinear model predictive control. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: New York, NY, USA, 2017; pp. 236–243. [Google Scholar]
  13. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  14. Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
  15. Chen, H.; Lin, Y.; Fu, M.; Yao, L.; Sheng, M. A Survey on Reinforcement Learning Methods for UAV Systems. ACM Comput. Surv. 2025, 58, 103. [Google Scholar] [CrossRef]
  16. Ekechi, C.C.; Elfouly, T.; Alouani, A.; Khattab, T. A Survey on UAV Control with Multi-Agent Reinforcement Learning. Drones 2025, 9, 484. [Google Scholar] [CrossRef]
  17. Wachi, A.; Shen, X.; Sui, Y. A Survey of Constraint Formulations in Safe Reinforcement Learning. arXiv 2024, arXiv:2402.02025. [Google Scholar] [CrossRef]
  18. Li, S.; Egorov, M.; Kochenderfer, M. Optimizing collision avoidance in dense airspace using deep reinforcement learning. arXiv 2019, arXiv:1912.10146. [Google Scholar] [CrossRef]
  19. Wang, Z.; Pan, T.; Zhou, Q.; Wang, J. Efficient exploration in resource-restricted reinforcement learning. Proc. AAAI Conf. Artif. Intell. 2023, 37, 10279–10287. [Google Scholar] [CrossRef]
  20. Christodoulou, P. Soft actor-critic for discrete action settings. arXiv 2019, arXiv:1910.07207. [Google Scholar] [CrossRef]
  21. Achiam, J.; Sastry, S. Surprise-based intrinsic motivation for deep reinforcement learning. arXiv 2017, arXiv:1703.01732. [Google Scholar] [CrossRef]
  22. Bounini, F.; Gingras, D.; Pollart, H.; Gruyer, D. Modified artificial potential field method for online path planning applications. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV); IEEE: New York, NY, USA, 2017; pp. 180–185. [Google Scholar]
  23. Koren, Y.; Borenstein, J. Potential field methods and their inherent limitations for mobile robot navigation. In Proceedings of the 1991 IEEE International Conference on Robotics and Automation; IEEE: New York, NY, USA, 1991; Volume 1392, pp. 1398–1404. [Google Scholar]
  24. Biswas, K.; Kar, I. On reduction of oscillations in target tracking by artificial potential field method. In Proceedings of the 2014 9th International Conference on Industrial and Information Systems (ICIIS), Gwalior, India, 15–17 December 2014; pp. 1–6. [Google Scholar]
  25. Jiang, W.; Cai, T.; Xu, G.; Wang, Y. Autonomous obstacle avoidance and target tracking of UAV: Transformer for observation sequence in reinforcement learning. Knowl.-Based Syst. 2024, 290, 111604. [Google Scholar] [CrossRef]
  26. Liang, C.; Liu, L.; Liu, C. Multi-UAV autonomous collision avoidance based on PPO-GIC algorithm with CNN–LSTM fusion network. Neural Netw. 2023, 162, 21–33. [Google Scholar] [CrossRef]
  27. Zheng, K.; Zhang, X.; Wang, C.; Zhang, M.; Cui, H. A partially observable multi-ship collision avoidance decision-making model based on deep reinforcement learning. Ocean Coast. Manag. 2023, 242, 106689. [Google Scholar] [CrossRef]
  28. Wang, Y.; Xu, H.; Feng, H.; He, J.; Yang, H.; Li, F.; Yang, Z.J.O.E. Deep reinforcement learning based collision avoidance system for autonomous ships. Ocean Eng. 2024, 292, 116527. [Google Scholar] [CrossRef]
  29. Hu, Z.; Gao, X.; Wan, K.; Wang, Q.; Zhai, Y. Asynchronous Curriculum Experience Replay: A Deep Reinforcement Learning Approach for UAV Autonomous Motion Control in Unknown Dynamic Environments. IEEE Trans. Veh. Technol. 2023, 72, 13985–14001. [Google Scholar] [CrossRef]
  30. Song, C.; Zhang, Y.; Bai, S.; Li, B.; Gan, Z.; Neretin, E. An end-to-end Flight Control Method for UAVs Based on MD-SAC. IEEE Trans. Consum. Electron. 2025, 71, 3641–3653. [Google Scholar] [CrossRef]
  31. Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained policy optimization. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 22–31. [Google Scholar]
  32. Kormushev, P.; Calinon, S.; Caldwell, D.G. Reinforcement learning in robotics: Applications and real-world challenges. Robotics 2013, 2, 122–148. [Google Scholar] [CrossRef]
  33. Kormushev, P.; Ugurlu, B.; Calinon, S.; Tsagarakis, N.G.; Caldwell, D.G. Bipedal walking energy minimization by reinforcement learning with evolving policy parameterization. In Proceedings of the 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems; IEEE: New York, NY, USA, 2011; pp. 318–324. [Google Scholar]
  34. Kempka, M.; Wydmuch, M.; Runc, G.; Toczek, J.; Jaśkowski, W. Vizdoom: A doom-based ai research platform for visual reinforcement learning. In Proceedings of the 2016 IEEE Conference on Computational Intelligence and Games (CIG); IEEE: New York, NY, USA, 2016; pp. 1–8. [Google Scholar]
  35. Resnick, C.; Eldridge, W.; Ha, D.; Britz, D.; Foerster, J.; Togelius, J.; Cho, K.; Bruna, J. Pommerman: A multi-agent playground. arXiv 2018, arXiv:1809.07124. [Google Scholar]
  36. Strehl, A.L.; Littman, M.L. An empirical evaluation of interval estimation for markov decision processes. In Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence; IEEE: New York, NY, USA, 2004; pp. 128–135. [Google Scholar]
  37. Bellemare, M.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; Munos, R. Unifying count-based exploration and intrinsic motivation. In Proceedings of the Advances in Neural Information Processing Systems, Sydney, Australia, 6–12 December 2016; Volume 29. [Google Scholar]
  38. Houthooft, R.; Chen, X.; Duan, Y.; Schulman, J.; De Turck, F.; Abbeel, P. Vime: Variational information maximizing exploration. In Proceedings of the Advances in Neural Information Processing Systems, Sydney, Australia, 6–12 December 2016. [Google Scholar]
  39. Shyam, P.; Jaśkowski, W.; Gomez, F. Model-based active exploration. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 5779–5788. [Google Scholar]
  40. Pathak, D.; Agrawal, P.; Efros, A.A.; Darrell, T. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 21–26 July 2017; pp. 2778–2787. [Google Scholar]
  41. Mohamed, S.; Jimenez Rezende, D. Variational information maximisation for intrinsically motivated reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
  42. Thipphavong, D.; Cone, A.; Lee, S. Ensuring Interoperability Between Unmanned Aircraft Detect-and-Avoid and Manned Aircraft Collision Avoidance. In Proceedings of the USA/Europe Air Traffic Management Research and Development Seminar, Seattle, WA, USA, 26–30 June 2017. [Google Scholar]
Figure 1. Representation of the UAV’s velocity and the relative position of the intruding UAV (I) with respect to the ego-UAV (E) at the CPA.
Figure 1. Representation of the UAV’s velocity and the relative position of the intruding UAV (I) with respect to the ego-UAV (E) at the CPA.
Drones 10 00450 g001
Figure 2. Excessive advisory activity during training under unconstrained conditions.
Figure 2. Excessive advisory activity during training under unconstrained conditions.
Drones 10 00450 g002
Figure 3. Conceptual architecture of the RAISE framework.
Figure 3. Conceptual architecture of the RAISE framework.
Drones 10 00450 g003
Figure 4. Average return curves of the algorithms during training under a moderate resource level (20).
Figure 4. Average return curves of the algorithms during training under a moderate resource level (20).
Drones 10 00450 g004
Figure 5. Core performance metrics of the algorithms under a moderate resource level (20).
Figure 5. Core performance metrics of the algorithms under a moderate resource level (20).
Drones 10 00450 g005
Figure 6. Average return of the algorithms under resource-constrained (15, (left)) and resource-rich (25, (right)) conditions.
Figure 6. Average return of the algorithms under resource-constrained (15, (left)) and resource-rich (25, (right)) conditions.
Drones 10 00450 g006
Figure 7. Core performance metrics of the algorithms under the resource-constrained condition (15).
Figure 7. Core performance metrics of the algorithms under the resource-constrained condition (15).
Drones 10 00450 g007
Figure 8. Core performance metrics of the algorithms under the resource-rich condition (25).
Figure 8. Core performance metrics of the algorithms under the resource-rich condition (25).
Drones 10 00450 g008
Figure 9. Performance comparison of four representative metrics of all trained models under a moderate resource level (20).
Figure 9. Performance comparison of four representative metrics of all trained models under a moderate resource level (20).
Drones 10 00450 g009
Figure 10. Collision-avoidance trajectories of the compared models across four representative test scenarios (A–D).
Figure 10. Collision-avoidance trajectories of the compared models across four representative test scenarios (A–D).
Drones 10 00450 g010
Table 1. State space variables.
Table 1. State space variables.
VariableDescriptionValuesUnits
htRelative intruder altitude[−750, 750]m
h ˙ e g o , t Ego-UAV vertical rate[−20, 20]m/s
h ˙ i n t , t Intruding UAV vertical rate[−15, 15]m/s
τTime to loss of hor. separation[0, 40]s
aprevPrevious advisorySee Table 2-
r t r e s Remaining resource[0, rmax]-
Table 2. Advisory set and advisory availability.
Table 2. Advisory set and advisory availability.
ActionDescriptionAccelerationAvailable From
NOCNo conflict0All
DES-NNormal descend ≤ −7.5 m/s−g/3NOC
CLB-NNormal climb ≥ 7.5 m/sg/3NOC
DES-TTransitional descend ≤ −7.5 m/s−g/2.5CLB-N, CLB-T, CLB-E, DES-E
CLB-TTransitional climb ≥ 7.5 m/sg/2.5DES-N, DES-T, DES-E, CLB-E
DES-EEscalated descend ≤ −12.5 m/s−g/2.5DES-N, DES-T
CLB-EEscalated climb ≥ 12.5 m/sg/2.5CLB-N, CLB-T
Table 3. Parameters used in the extrinsic reward function.
Table 3. Parameters used in the extrinsic reward function.
ParameterValueMeaning
ωcol100Conflict penalty weight
ωalert0.3Alert penalty
ωstr0.3Strengthening advisory penalty
ωrev0.5Reversal advisory penalty
ωcross0.5Vertical path-crossing penalty
ωlim0.1Altitude-limit penalty coefficient
ωex0.1Excessive terminal-separation coefficient
γalt25Saturation bound of excessive terminal-separation penalty
ωNOC0.2Conflict-clearance reward
Table 4. Comparison of advisory behaviors for SAC-D and surprise models.
Table 4. Comparison of advisory behaviors for SAC-D and surprise models.
ModelAlertsStrengtheningReversal
SAC-D11.4004.5300.433
Surprise10.7336.7900.900
Table 5. Simulation settings.
Table 5. Simulation settings.
ParameterValue/Setting
Learning rate3 × 10−4
OptimizerAdam
Discount factor (γ)0.975
Replay buffer size4 × 104
Batch size512
Intrinsic modelEnsemble dynamics (for Surprise, RAISE)
Intrinsic rewardNegative log-likelihood (Surprise)
Resource limit (rmax)20 (default); 15/25 (for comparison)
Target update (τ)0.005
Entropy coefficient (α)Auto-tuned
Training iterations1500 (all algorithms)
Table 6. Parameter settings for the four test scenarios.
Table 6. Parameter settings for the four test scenarios.
Scenario h e g o h ˙ e g o h int h ˙ int
A0Uniform (−20, 20)Uniform (−300, 300)−3
B010Uniform (−300, 300)Uniform (−15, 15)
C0Uniform (−20, 20)0Uniform (−15, 15)
D0Uniform (−20, 20)Uniform (−300, 300)Uniform (−15, 15)
Table 7. Overall training results of all algorithms under a moderate resource level (20).
Table 7. Overall training results of all algorithms under a moderate resource level (20).
ModelRewardCollision
Rate
Resolution Success AlertStrengtheningReversalCrossing
SAC-D−15.912 ± 4.0840.006 ± 0.0080.821 ± 0.0801.506 ± 0.2393.226 ± 0.5820.112 ± 0.0950.068 ± 0.042
SAC-D-resource−15.204 ± 3.5860.004 ± 0.0060.831 ± 0.0702.036 ± 0.4263.113 ± 0.5380.107 ± 0.0560.052 ± 0.037
Surprise−15.546 ± 3.7070.003 ± 0.0050.821 ± 0.0732.082 ± 0.3002.834 ± 0.5580.024 ± 0.0260.070 ± 0.046
RAISE−14.183 ± 2.7570.003 ± 0.0060.843 ± 0.0562.157 ± 0.5292.397 ± 0.4970.014 ± 0.0160.056 ± 0.039
Table 8. Overall training results of the algorithms under the resource-constrained condition (15).
Table 8. Overall training results of the algorithms under the resource-constrained condition (15).
ModelRewardCollision
Rate
Resolution Success AlertStrengtheningReversalCrossing
SAC-D−24.8250.0130.6801.3203.0030.0470.087
Surprise−26.7390.0500.6731.3672.7470.0370.147
RAISE−23.3920.0200.6931.6502.0700.0030.107
Table 9. Overall training results of the algorithms under the resource-rich condition (25).
Table 9. Overall training results of the algorithms under the resource-rich condition (25).
ModelRewardCollision
Rate
Resolution Success AlertStrengtheningReversalCrossing
SAC-D−8.5880.0000.9402.5403.2100.3400.020
Surprise−9.7930.0000.9302.9101.9000.1600.040
RAISE−8.2230.0000.9503.0202.3500.1900.020
Table 10. Testing performance of all trained models across representative encounter scenarios under a moderate resource level (20).
Table 10. Testing performance of all trained models across representative encounter scenarios under a moderate resource level (20).
ScenarioAgentRewardCollision
Rate
Resolution SuccessAlertReversalStrengtheningCrossing
ASAC-D−32.127 ± 29.6650.058 ± 0.2340.566 ± 0.4960.842 ± 0.9320.138 ± 0.3454.372 ± 3.2660.248 ± 0.450
SAC-D-resource−28.257 ± 25.9330.026 ± 0.1590.622 ± 0.4851.400 ± 1.4460.158 ± 0.3704.706 ± 3.3310.214 ± 0.415
Surprise−27.427 ± 29.6370.068 ± 0.2520.654 ± 0.4761.670 ± 1.3480.140 ± 0.3472.952 ± 2.9420.194 ± 0.420
RAISE−21.739 ± 25.5140.026 ± 0.1590.742 ± 0.4382.712 ± 1.5790.044 ± 0.2152.232 ± 2.4680.182 ± 0.411
Rule-based−39.056 ± 30.3740.126 ± 0.3320.468 ± 0.4990.676 ± 0.6690.390 ± 0.5200.136 ± 0.3430.030 ± 0.182
APF−39.607 ± 32.2370.144 ± 0.3510.520 ± 0.5000.502 ± 0.6120.522 ± 0.6640.222 ± 0.4480.060 ± 0.284
BSAC-D−31.585 ± 27.1160.044 ± 0.2050.528 ± 0.4990.638 ± 0.7530.214 ± 0.4104.270 ± 2.9240.260 ± 0.439
SAC-D-resource−32.154 ± 28.5390.032 ± 0.1760.602 ± 0.4891.040 ± 1.3840.306 ± 0.4864.986 ± 3.0910.284 ± 0.451
Surprise−28.155 ± 30.4990.064 ± 0.2450.634 ± 0.4821.546 ± 1.2630.130 ± 0.3652.654 ± 2.6120.190 ± 0.407
RAISE−21.133 ± 23.5560.024 ± 0.1530.752 ± 0.4323.086 ± 1.8630.050 ± 0.2271.568 ± 1.5420.146 ± 0.353
Rule-based−42.088 ± 31.6820.134 ± 0.3410.394 ± 0.4890.704 ± 0.6580.382 ± 0.5220.156 ± 0.3680.028 ± 0.226
APF−41.883 ± 31.4520.120 ± 0.3250.460 ± 0.4980.516 ± 0.6270.496 ± 0.6080.258 ± 0.4510.034 ± 0.255
CSAC-D−35.408 ± 21.2930.046 ± 0.2090.442 ± 0.4970.548 ± 0.7950.240 ± 0.4274.948 ± 3.2200.270 ± 0.453
SAC-D-resource−36.139 ± 19.6220.018 ± 0.1330.414 ± 0.4930.548 ± 0.5100.178 ± 0.3836.106 ± 2.8410.432 ± 0.511
Surprise−35.151 ± 18.7240.020 ± 0.1400.440 ± 0.4960.712 ± 0.8970.212 ± 0.4094.938 ± 2.8830.362 ± 0.505
RAISE−33.436 ± 17.4690.004 ± 0.0630.452 ± 0.4982.400 ± 1.4980.090 ± 0.2863.164 ± 2.5080.320 ± 0.492
Rule-based−35.528 ± 24.0320.070 ± 0.2550.494 ± 0.5000.656 ± 0.6050.692 ± 0.7700.342 ± 0.5230.130 ± 0.617
APF−32.294 ± 23.0370.054 ± 0.2260.566 ± 0.4960.564 ± 0.6340.476 ± 0.6340.168 ± 0.4140.072 ± 0.428
DSAC-D−36.274 ± 31.5260.102 ± 0.3030.472 ± 0.4990.802 ± 0.8870.184 ± 0.3873.884 ± 3.1450.258 ± 0.451
SAC-D-resource−28.079 ± 24.3540.028 ± 0.1650.582 ± 0.4931.286 ± 1.3990.186 ± 0.4334.310 ± 3.2220.222 ± 0.452
Surprise−29.642 ± 30.0820.068 ± 0.2520.596 ± 0.4911.524 ± 1.3270.096 ± 0.3142.934 ± 2.9870.234 ± 0.455
RAISE−21.486 ± 22.0680.012 ± 0.1090.696 ± 0.4602.786 ± 1.6090.048 ± 0.2402.264 ± 2.4010.178 ± 0.422
Rule-based−36.610 ± 29.3250.102 ± 0.3030.458 ± 0.4980.702 ± 0.6730.416 ± 0.5390.126 ± 0.3320.046 ± 0.219
APF−39.022 ± 32.1260.144 ± 0.3510.490 ± 0.5000.528 ± 0.6300.544 ± 0.7820.252 ± 0.4940.072 ± 0.333
Table 11. Robustness evaluation under sensor noise and wind disturbances (mean over 4 scenarios).
Table 11. Robustness evaluation under sensor noise and wind disturbances (mean over 4 scenarios).
Environment ConfigurationAgentRewardCollision
Rate
Resolution SuccessAlertReversalStrengthening
CleanSAC-D−34.147 ± 27.6470.065 ± 0.2430.508 ± 0.4990.682 ± 0.8390.207 ± 0.4044.514 ± 3.213
Surprise−29.300 ± 26.2480.049 ± 0.2130.597 ± 0.4821.370 ± 1.2380.143 ± 0.3473.348 ± 2.950
APF−38.286 ± 28.7810.109 ± 0.2980.496 ± 0.4970.499 ± 0.5980.531 ± 0.7050.246 ± 0.478
RAISE−25.532 ± 22.6990.022 ± 0.1440.637 ± 0.4672.722 ± 1.6320.076 ± 0.2572.364 ± 2.316
Noise-5%SAC-D−37.391 ± 29.2170.091 ± 0.2830.477 ± 0.4980.725 ± 0.8810.219 ± 0.4114.359 ± 3.108
Surprise−34.921 ± 28.0150.081 ± 0.2700.536 ± 0.4931.726 ± 1.3640.176 ± 0.3893.139 ± 2.751
APF−46.455 ± 34.1010.225 ± 0.4080.435 ± 0.4930.672 ± 0.8590.919 ± 1.0050.513 ± 0.662
RAISE−30.658 ± 25.0520.044 ± 0.2050.590 ± 0.4853.086 ± 1.6570.089 ± 0.2802.216 ± 2.159
Noise-10%SAC-D−44.322 ± 31.3550.153 ± 0.3550.403 ± 0.4900.822 ± 0.9530.272 ± 0.4464.105 ± 2.815
Surprise−42.701 ± 31.5220.144 ± 0.3500.453 ± 0.4942.010 ± 1.4450.204 ± 0.4092.875 ± 2.437
APF−53.139 ± 35.3770.296 ± 0.4450.344 ± 0.4730.954 ± 1.1551.167 ± 1.1610.664 ± 0.706
RAISE−41.192 ± 30.7240.127 ± 0.3310.471 ± 0.4983.282 ± 1.6070.122 ± 0.3261.997 ± 1.774
WindSAC-D−37.437 ± 27.9310.077 ± 0.2620.467 ± 0.4950.702 ± 0.8470.203 ± 0.3994.492 ± 3.222
Surprise−32.292 ± 27.0050.053 ± 0.2210.559 ± 0.4861.378 ± 1.2520.157 ± 0.3643.285 ± 2.924
APF−39.627 ± 29.5380.120 ± 0.3110.490 ± 0.4990.520 ± 0.6070.541 ± 0.6880.244 ± 0.473
RAISE−28.387 ± 23.4070.020 ± 0.1320.609 ± 0.4702.787 ± 1.6750.077 ± 0.2632.325 ± 2.331
Noise + WindSAC-D−47.731 ± 32.4310.173 ± 0.3760.376 ± 0.4830.872 ± 0.9390.276 ± 0.4433.992 ± 2.815
Surprise−44.602 ± 30.9600.133 ± 0.3370.414 ± 0.4882.035 ± 1.4740.180 ± 0.4072.796 ± 2.450
APF−53.182 ± 35.6170.303 ± 0.4500.357 ± 0.4730.970 ± 1.1781.121 ± 1.1110.647 ± 0.689
RAISE−42.580 ± 30.7980.128 ± 0.3330.440 ± 0.4943.296 ± 1.6400.121 ± 0.3362.004 ± 1.805
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Z.; Feng, Q.; Wang, Z.; Gao, X. Resource-Aware Surprise Reinforcement Learning for Collision Avoidance in Maritime UAV Encounters. Drones 2026, 10, 450. https://doi.org/10.3390/drones10060450

AMA Style

Liu Z, Feng Q, Wang Z, Gao X. Resource-Aware Surprise Reinforcement Learning for Collision Avoidance in Maritime UAV Encounters. Drones. 2026; 10(6):450. https://doi.org/10.3390/drones10060450

Chicago/Turabian Style

Liu, Zuocheng, Qi Feng, Zidong Wang, and Xiaoguang Gao. 2026. "Resource-Aware Surprise Reinforcement Learning for Collision Avoidance in Maritime UAV Encounters" Drones 10, no. 6: 450. https://doi.org/10.3390/drones10060450

APA Style

Liu, Z., Feng, Q., Wang, Z., & Gao, X. (2026). Resource-Aware Surprise Reinforcement Learning for Collision Avoidance in Maritime UAV Encounters. Drones, 10(6), 450. https://doi.org/10.3390/drones10060450

Article Metrics

Back to TopTop