1. Introduction
Ensuring safe separation in increasingly congested maritime airspaces is a paramount challenge for modern offshore operations. With the rapid expansion of unmanned aerial vehicles (UAVs) deployed for maritime monitoring, infrastructure inspection, and search-and-rescue [
1], the frequency of complex UAV encounters has risen sharply, exacerbating the risk of aerial collisions [
2,
3]. Operational experience reveals that while autonomous systems can successfully prevent most collisions, they often trigger excessive or continuous avoidance commands. A critical yet often overlooked issue is that frequent, nuisance, or oscillatory maneuver commands consume valuable and often limited maritime communication bandwidth. Furthermore, they severely disrupt the remote operator’s workflow and increase cognitive load, eventually leading to the desensitization of human supervisors—a phenomenon known as “trust erosion” in human-autonomy teaming [
4,
5,
6]. In maritime environments where connectivity is unstable, executing continuous, high-frequency commands is highly impractical [
7,
8]. Therefore, the next generation of maritime UAV collision avoidance systems must strictly adhere to a dual mandate: providing high safety assurance while maintaining “command precision”—issuing avoidance maneuvers only when absolutely necessary to preserve communication resources and operator trust.
Current solutions, however, struggle to satisfy this dual mandate simultaneously. Traditional UAV collision avoidance methods [
9,
10,
11] often rely on fixed geometric rules or reactive potential fields. While effective in simple scenarios, they are computationally rigid and unable to adapt online to varying operational needs, such as communication constraints or a remote operator’s tolerance for frequent interventions [
5,
12]. Conversely, deep reinforcement learning (DRL) has emerged as a powerful alternative, capable of mapping continuous, high-dimensional states to maneuver commands without massive storage overhead [
13,
14]. Recent RL studies have further expanded UAV decision-making from value-based control to actor–critic, multi-agent, and safe/constrained learning frameworks, improving adaptability in navigation, control, resource scheduling, and cooperative UAV operations [
15,
16]. Safe/constrained RL also provides a principled way to incorporate safety or resource-related constraints into policy learning [
17]. However, these advances still primarily optimize safety, path efficiency, or mission performance, while the operational cost of frequent maneuver advisories remains insufficiently modeled. Even when recent DRL agents have demonstrated excellent safety performance [
18], they typically lack “operational awareness”. Existing UAV DRL research is predominantly safety-driven; without explicit constraints on the “cost” of maneuvering or transmitting commands, these agents often exhibit “jittery” or oscillatory behaviors to marginally increase safety buffers. This creates a critical gap: algorithms that are mathematically safe but operationally unacceptable for communication-constrained maritime environments.
To bridge the gap between algorithmic safety and operational suitability, we draw inspiration from resource-constrained reinforcement learning [
19]. We propose a novel paradigm that conceptualizes the issuance of avoidance commands as a consumable “virtual resource”, representing the finite capacity of communication bandwidth and remote operator attention within an encounter episode. This abstraction forces the agent to treat every maneuver command as a “costly investment”, thereby naturally suppressing redundant or low-value actions.
However, simply penalizing maneuver commands can hinder the agent’s exploration during training, leading to overly conservative policies that fail to discover safe avoidance trajectories. To address this, we introduce Resource-Aware Intrinsic Surprise Exploration (RAISE). This framework unifies safety assurance and command economy by integrating synergistic mechanisms within the Soft Actor–Critic Discrete (SAC-D) architecture [
20]. Specifically, to prevent the resource constraint from hindering learning, we employ a surprise-based intrinsic motivation module that utilizes an ensemble dynamics model to generate prediction errors [
21], driving the agent to explore novel scenarios. This exploration is dynamically regulated by a resource-modulated control coefficient, which couples exploration intensity with the remaining command budget. Furthermore, to address the scale discrepancy between varying intrinsic signals and stable extrinsic rewards, we introduce an adaptive exponential moving average (EMA) scaling mechanism, ensuring consistent reward normalization and stable convergence.
By unifying safety assurance, communication efficiency, and operator burden reduction, RAISE provides a practical path toward operationally suitable AI collision avoidance for maritime UAVs. The main contributions of this work are summarized as follows:
Resource-aware decision formulation: Introduction of a virtual resource mechanism that regulates maneuver command frequency and intensity during both training and deployment.
Resource-aware surprise exploration with adaptive scaling: An ensemble-based prediction-error signal is used for novelty estimation, while its influence is adaptively scaled and modulated by the remaining advisory resource.
Unified SAC-D integration: Seamless incorporation of resource modulation and intrinsic exploration into a standard actor–critic framework.
Comprehensive evaluation: Experiments across resource-constrained, moderate, and resource-rich conditions demonstrate that RAISE improves advisory stability by reducing strengthening and reversal behaviors while maintaining reliable collision-avoidance performance and stable training convergence.
3. Problem Formulation
3.1. UAV Encounter Dynamics
In this study, the UAV encounter scenario is formulated as a strategic vertical resolution advisory problem for medium-to-large fixed-wing UAVs conducting beyond visual line of sight (BVLOS) maritime missions. The purpose of the model is not to generate full three-dimensional trajectories, but to determine whether sufficient vertical separation can be achieved within the remaining horizontal encounter time. Accordingly, the horizontal encounter geometry is represented by the time-to-conflict variable τt, while the learning problem focuses on vertical separation and vertical advisory generation.
As illustrated in
Figure 1, the encounter geometry is characterized by the relative altitude between the ego-UAV and the intruding UAV, alongside the time to the closest point of approach (CPA) in the horizontal direction. The horizontal geometry is thereby parameterized by the time-to-conflict variable (
τt), which represents the remaining time until the UAVs breach the minimum horizontal safety threshold:
where
Dt denotes the current horizontal distance and
vrel denotes the locally estimated relative horizontal speed over the short encounter horizon. The resulting variable
τt represents the remaining time before the horizontal separation reaches the predefined safety threshold. Thus, horizontal motion is retained as a temporal constraint on the vertical resolution process, rather than being explicitly controlled as an additional maneuver dimension.
Under this abstraction, the relative motion of the two UAVs is described by a compact set of vertical variables, including the relative altitude and the respective vertical velocities of the ego-UAV and the intruder. This formulation provides sufficient information for evaluating collision risk and generating precise avoidance maneuver alerts without explicitly representing lateral motion.
3.2. UAV Dynamic Model
It should be noted that the dynamic model used in this study is an advisory-level vertical kinematic response model, rather than a full six-degree-of-freedom UAV flight dynamics model. The objective of this model is to describe how high-level climb, descent, and no-command advisories affect the vertical separation between the ego-UAV and the intruding UAV during a short encounter window. Therefore, attitude dynamics, aerodynamic forces, actuator dynamics, and low-level flight-control loops are not explicitly modeled.
Given the two-dimensional encounter abstraction introduced above, the horizontal interaction between UAVs is fully captured by the time-to-conflict variable
τₜ. In this formulation, horizontal motion is retained as a temporal constraint on the vertical resolution process, rather than being explicitly controlled as an additional maneuver dimension. Therefore, the collision-avoidance dynamics can be modeled exclusively in the vertical dimension, consistent with altitude-based deconfliction strategies commonly used in UAV operations, where vertical maneuvers can rapidly establish separation within a limited encounter time [
42]. This formulation is suitable for the considered fixed-wing maritime encounter scenario because the key decision variable is whether the ego-UAV can establish adequate vertical separation before the horizontal encounter time expires. In this sense,
τₜ acts as a countdown variable that links horizontal closure with the urgency of vertical maneuvering. Under this formulation, both UAVs follow one-dimensional longitudinal dynamics updated at a discrete frequency of 1 Hz. This frequency reflects the high-level strategic decision rate suitable for maritime environments, directly accommodating the severe bandwidth limitations of long-range UAV telemetry. At each time step
t, the environment state is represented as:
Here, ht denotes the vertical relative altitude between the intruding UAV and the ego-UAV, defined as ht = hint,t − hego,t. The variables represent the vertical velocities of the ego-UAV and intruding UAV, respectively. The variable τt indicates the estimated time to the closest point of approach (CPA), assuming a constant relative horizontal speed—i.e., the remaining time before the horizontal separation between the UAVs decreases below a predefined maritime safety threshold (e.g., 150 m). For medium-to-large fixed-wing UAVs executing BVLOS maritime missions, this relatively conservative separation minimum is necessary. It strictly accounts for the high cruising speeds, severe offshore wind disturbances, GPS inaccuracies, and inherent latency in satellite or long-range communications.
To capture the temporal mismatch between decision issuance and execution in real maritime operations, the model introduces an action execution delay mechanism. Specifically, the action applied at time step
t corresponds to the maneuver command generated at the previous step
t − 1. To support this mechanism, the state representation includes the previous command index
aprev, which determines the acceleration applied at time
t. This design effectively models realistic latency effects—such as communication delays or mechanical response lags—thereby improving the fidelity and temporal continuity of the simulated trajectories. The transition to the next time step
t + 1 follows discrete motion equations that update the altitude and vertical velocity of both UAVs according to their respective accelerations, as described in Equation (3).
Here,
denote the vertical accelerations of the ego-UAV and the intruding UAV at time step
t, respectively. For the ego-UAV, the applied acceleration is directly determined by the maneuver command issued by the collision avoidance system. The acceleration values in Equation (3) should be interpreted as commanded vertical acceleration responses associated with discrete advisories, rather than direct actuator-level control inputs. This representation is appropriate for evaluating whether the advisory policy can generate timely and physically bounded vertical separation during the encounter. The intruding UAV, in contrast, is modeled using a goal-directed vertical response that drives it toward the ego-UAV’s altitude within the remaining encounter time, as described in Equation (4). This design is not intended to represent all possible maritime UAV traffic behaviors. Rather, it provides a controlled conflict-generation mechanism for constructing repeatable encounter cases, so that the learned policy can be evaluated under clear vertical conflict pressure. At each step, the intruding UAV estimates a target vertical velocity that would allow it to reach the ego-UAV’s altitude by the predicted time-to-CPA under the current kinematic conditions.
On this basis, the intruding UAV’s vertical acceleration is determined by the difference between its current and target vertical velocities, subject to a predefined maximum acceleration limit. To ensure physically feasible motion and stable trajectory evolution, the acceleration is constrained within the allowable range using a clipping function, as shown in Equation (5). Similarly, the vertical velocity is bounded by a maximum value , preventing unrealistic climb or descent rates and maintaining smooth motion continuity throughout the encounter. In this setting, the intruder behavior serves as a stress-case model for policy training and evaluation. More diverse intruder behaviors, such as route-following, cooperative, non-cooperative, and stochastic traffic models, can be investigated in future extensions of the proposed resource-aware advisory framework.
3.3. Resource-Aware Collision Avoidance Decision Framework
3.3.1. Resource-Aware Decision Framework
In conventional collision avoidance systems, advisories are issued solely based on safety requirements, without accounting for operator workload or alert fatigue. However, in real-world UAV operations, frequent or redundant advisories can overload the remote operators and reduce trust in the automation system.
To address this issue, we introduce a resource-aware decision framework, in which a virtual resource variable
represents the remaining command-resource budget available for issuing collision-avoidance advisories during an encounter. This variable is not intended to directly measure physical communication bandwidth, link latency, or human cognitive workload. Instead, it is a decision-level proxy for advisory burden, reflecting the operational cost associated with frequent, continued, strengthened, or reversed maneuver advisories. The cost values are designed to distinguish different levels of command burden at the advisory-transition level, rather than to convert advisories into exact units of bandwidth or operator workload. Specifically, issuing a new or continued alert consumes 3 units, a strengthening advisory consumes 2 units, and a reversal advisory consumes 5 units. The reversal cost is assigned the largest value because switching between climb and descent advisories may lead to oscillatory guidance and represents a high-burden advisory transition from the perspective of supervisory control. The resource state is updated as:
The cost function is defined as:
If the remaining resource is insufficient to support the requested advisory type, the action is suppressed and replaced by NOC. This mechanism enforces a finite alert budget and prevents the policy from relying on frequent high-cost interventions.
A resource coefficient ρt ∈ [0, 1] is further introduced to regulate the influence of this limited resource on the decision-making process. When the available resource decreases, the coefficient proportionally reduces the intensity or frequency of advisories, encouraging more conservative and resource-efficient behavior. In essence, this mechanism allows the system to adapt its decision policy according to the current alert capacity, ensuring both operational safety and operator acceptance.
3.3.2. State Space
The state space for the collision avoidance task consists of six variables, as summarized in
Table 1. These variables capture the essential kinematic and decision-related information required by the reinforcement learning agent to assess encounter risk and determine appropriate advisories. The first three variables describe the vertical geometry of the encounter: the relative altitude between the intruding UAV and ego-UAV (
ht), the vertical rate of the ego-UAV (
), and the vertical rate of the intruding UAV (
). The fourth variable, time to loss of horizontal separation (
τₜ), characterizes the remaining time until the two UAVs reach the minimum allowed horizontal distance (typically 150 m), effectively representing the horizontal encounter geometry. The fifth variable, previous advisory (
aprev), encodes the advisory issued at the previous time step, which helps the agent penalize unnecessary command reversals or escalations in advisory intensity, thereby maintaining temporal consistency and preserving the Markov property.
Finally, a new variable—the resource level ()—is introduced to represent the remaining alert resource available to the system at time t. This variable reflects the agent’s alert budget, influencing how aggressively it can issue future advisories. By including in the observation space, the system becomes aware of its operational constraints, enabling adaptive behavior that balances collision safety with alert economy.
3.3.3. Action Space
The action space consists of seven discrete advisories: NOC, DES-N, CLB-N, DES-T, CLB-T, DES-E, and CLB-E, as listed in
Table 2. Here, N, T, and E denote normal, transitional, and escalated advisories, respectively. Terms such as reversal or strengthening describe transition types between consecutive advisories and are not separate action labels. Each advisory, except NOC (No conflict), instructs the ego-UAV to achieve or maintain a specific vertical rate, corresponding to a designated vertical acceleration. The NOC action indicates that no immediate collision threat exists, allowing the UAV to maintain its nominal trajectory, and can be issued at any time.
Table 2 also defines the availability rules between advisories, ensuring physically consistent and operationally feasible trajectory transitions. For instance, NOC may precede any other advisory, while normal descent (DES-N) and climb (CLB-N) can only be initiated from NOC. Transitional or strengthened advisories (DES-T, CLB-T, DES-E, CLB-E) can only follow compatible preceding advisories, ensuring smooth kinematic transitions and avoiding abrupt or contradictory guidance. This discrete action design ensures that the agent’s policy remains interpretable and aligned with standard deterministic state-machine constraints, while still allowing flexibility for optimization under resource-aware and learning-based settings.
3.3.4. Reward Shaping
To ensure both safety and operational efficiency in the collision avoidance process, the reward function is designed to balance three competing objectives: (1) preventing near mid-air collisions (safety), (2) minimizing unnecessary alerts (remote operator workload), and (3) maintaining acceptable altitude deviations (flight stability). The overall reward consists of two main components: terminal altitude penalty and alert and altitude management. The main weighting parameters used in the extrinsic reward are summarized in
Table 3 before the individual reward terms are introduced.
The parameters follow a safety-priority hierarchy. The conflict penalty ωcol = 100 has the largest magnitude, ensuring that collision avoidance remains the dominant objective. Advisory-management penalties are deliberately smaller, including ωalert = 0.3, ωstr = 0.3, ωrev = 0.5, and ωcross = 0.5. Therefore, advisory economy shapes the policy only after the primary safety objective has been prioritized.
- 1.
Terminal Altitude Penalty
This component evaluates the final vertical separation between the ego-UAV and the intruder when the time to loss of horizontal separation reaches zero—that is, at the critical moment of closest horizontal proximity. It penalizes unsafe altitude configurations such as loss of separation or excessive climb/descent, reinforcing safe and stable avoidance maneuvers. The overall final altitude penalty is defined as:
where:
penalizes any case where the relative altitude Δ
h between the two UAVs falls below the collision threshold
hmin.
penalizes moderate deviations from the safe altitude range
hsafe.
applies a saturation penalty when the UAVs exceed the maximum permitted altitude deviation, with
γalt denoting the upper penalty bound. In the study,
γalt = 25, which serves as the saturation bound for excessive-altitude penalties. This value is larger than the advisory-related penalties but smaller than the conflict penalty, preserving the safety-priority hierarchy. The bounded form prevents extremely large altitude deviations from causing unbounded reward magnitudes and improves critic stability during training.
- 2.
Alert and Altitude Management
This component reflects the dynamic control efficiency of the avoidance policy—encouraging timely yet minimal advisory usage while maintaining altitude constraints. The corresponding reward term is defined as:
where:
penalizes ego-UAV altitude exceeding the predefined upper limit
hmax.
where
Ialert,
Istr,
Irev, and
Icross are binary indicators for active advisory issuance, advisory strengthening, advisory reversal, and vertical path crossing, respectively. The exponential factor exp(
τt −
T) increases the penalty magnitude as the encounter approaches CPA, discouraging late-stage unnecessary advisories and inconsistent command changes near the conflict point.
provides a small positive reward when the conflict is successfully cleared and the system transitions to a no conflict (NOC) state.
INOC indicates that the conflict has been cleared and the system returns to the no-conflict advisory state.
4. UAV Collision Avoidance Based on Resource-Aware Intrinsic Surprise Exploration
This section introduces the proposed collision-avoidance decision-making framework based on deep reinforcement learning. The method extends the standard Soft Actor–Critic (SAC) algorithm by incorporating a surprise-based intrinsic exploration term and a resource-aware modulation mechanism. The overall objective is to improve collision-avoidance efficiency while maintaining an alert economy and reducing unnecessary advisories.
4.1. SAC Algorithm for Collision Avoidance
To address the sequential decision-making problem of UAV collision avoidance under complex dynamic conditions, this study adopts the Soft Actor–Critic (SAC) algorithm as the fundamental learning framework [
14]. SAC is an off-policy, entropy-regularized actor–critic method that optimizes both task performance and policy stochasticity, thereby achieving a balance between exploitation and exploration during training. In the context of collision avoidance, the agent must issue timely and stable vertical advisories to maintain safe separation while minimizing unnecessary command reversals or oscillations that could confuse the remote operator.
The optimization objective of SAC is formulated as a maximum entropy reinforcement learning problem, where the policy aims not only to maximize the expected cumulative reward but also to preserve sufficient action entropy for exploration:
where
rt denotes the extrinsic reward at time
t, and
represents the policy entropy. The temperature parameter
α > 0 regulates the trade-off between maximizing return and maintaining policy diversity.
The SAC framework is composed of three primary components: two Q-value networks,
Qθ1(
s,
a) and
Qθ2(
s,
a), a stochastic policy network
πϕ(
a∣
s), and a target critic used for stabilization. The Q-networks are trained to minimize the soft Bellman residual:
where the soft target value
yt is computed as:
This clipped double-Q formulation reduces overestimation bias and improves the stability of learning. The policy network is optimized by minimizing the Kullback–Leibler divergence between the current policy and the soft Q-function-induced Boltzmann distribution:
where
at is sampled from
πϕ(
at∣
st) and
D denotes the replay buffer.
Since the action space in the collision avoidance task is discrete, we employ SAC-Discrete, an adaptation of the Soft Actor–Critic (SAC) framework for discrete control problems [
20]. This formulation preserves the entropy-regularized objective and the dual Q-network structure of the original SAC while incorporating dedicated modifications to the policy representation, Q-function estimation, and policy update scheme. These adaptations enable SAC-D to perform stable and efficient policy optimization under the maximum-entropy reinforcement learning paradigm in discrete action domains.
4.2. Quantifying Novelty via Surprise-Based Intrinsic Reward
In reinforcement learning for UAV collision avoidance, the external reward primarily captures high-level safety objectives, such as maintaining vertical separation or preventing near mid-air collisions. However, this extrinsic feedback provides limited guidance in early-stage exploration, as it is sparse and only weakly correlated with intermediate advisory decisions. Consequently, the agent may struggle to efficiently explore novel yet safety-critical encounter situations. To address this limitation, a surprise-based intrinsic motivation mechanism is introduced, serving as a quantitative measure of state-transition novelty and guiding exploration toward poorly understood regions of the dynamics [
21]. The intuition is that transitions that are difficult for the model to predict are likely underrepresented in the agent’s knowledge and therefore should receive additional reward incentives.
To quantify transition novelty, we trained an ensemble of probabilistic forward dynamics models. Given the current state–action pair (
st,
at), each ensemble member predicts a diagonal Gaussian distribution over the next state:
where
N denotes the ensemble size. The dynamics ensemble is trained by minimizing the negative log-likelihood (NLL) of the observed transition samples:
Here, st, at, and st+1 denote the current state, action, and observed next state, respectively. D denotes the replay buffer containing sampled transitions. N is the number of ensemble dynamics models. is the predictive distribution of the i-th dynamics model parameterized by θi. and denote the predicted mean and variance of the next state.
Equivalently, up to constants independent of the model parameters, this objective can be expressed as:
In the expanded form,
d is the dimension of the state vector, and the superscript
j denotes the
j-th state dimension. A large prediction error or high model variance indicates that the transition deviates from the learned dynamics and is therefore “surprising”. The intrinsic reward is then defined as the ensemble-averaged negative log-likelihood of the actually observed next state:
For the diagonal Gaussian predictive distribution, Equation (23) can be expanded as:
Thus, measures the sample-wise surprisal of the observed transition under the learned dynamics model. A transition that is poorly predicted by the ensemble receives a larger intrinsic reward, encouraging the agent to explore dynamically uncertain or insufficiently modeled regions of the state-action space.
By integrating this surprise-driven reward into the actor–critic framework, the learning process becomes more adaptive to the dynamic collision scenarios. The agent can identify previously unseen altitude separation patterns, encounter configurations, and conflict-resolution dynamics, thereby refining its advisory policy in a data-efficient and safety-oriented manner.
It should be noted that Equation (24) is implemented as a sample-wise NLL-based surprisal measure, rather than as a directly computed KL divergence. Let
denote the true transition distribution of the environment and
denote the learned predictive distribution. Although
p is unknown and is not explicitly estimated in the implementation, taking the expectation of the NLL under the true transition distribution gives the cross entropy between
p and
qθ:
The cross entropy can be decomposed as:
Since H(p) is independent of the model parameters, reducing the expected NLL is equivalent to reducing the KL divergence between the true transition dynamics and the learned predictive dynamics up to a constant. Therefore, the proposed intrinsic reward should be interpreted as a sample-wise NLL-based proxy for transition novelty, with a theoretical connection to KL divergence in expectation.
4.3. Resource-Aware Intrinsic Surprise Exploration
In practical UAV collision avoidance tasks, excessive exploratory actions or redundant alerts can increase remote operator workload and reduce system trustworthiness. To further examine this problem, we trained collision avoidance agents using state-of-the-art reinforcement learning methods, including the baseline SAC-D and the surprise-based intrinsic exploration model, under unconstrained resource conditions. The training statistics summarized in
Figure 2 and
Table 4 reveal that both approaches tend to generate an excessive number of advisories—on average more than ten alerts per episode—with frequent strengthening and reversal actions. These results indicate that while intrinsic-exploration-based learning improves policy responsiveness, it also amplifies advisory activity, leading to unstable and operationally inefficient behavior. The challenge, therefore, lies in achieving a better balance between effective exploration and practical alert management. To address this issue, a resource-aware intrinsic reward mechanism is introduced, guiding the learning process to adapt exploration intensity according to the available alert resources. This design enables the agent to maintain efficient policy learning while avoiding unnecessary advisory activations under resource-limited conditions.
The key component of this mechanism is the resource coefficient
, which quantifies the remaining decision resource or allowable alert capacity at time step
t. The coefficient is computed as a normalized function of the resource state embedded in the observation vector:
where
denotes the current resource level,
is the predefined maximum, and
β > 1 controls the sensitivity of reduction. When the available resource decreases,
proportionally suppresses the influence of intrinsic exploration, encouraging the policy to focus on task-relevant, low-cost advisories. This mechanism reflects the intuition that the collision-avoidance system should be more cautious and less exploratory when its “alert budget” becomes constrained.
Because the numerical magnitudes of the external and intrinsic rewards may differ significantly and may vary during training, an adaptive scaling coefficient is introduced:
The batch-level reward magnitude estimates are defined as:
Here, Sext and Sint denote the batch-level magnitude estimates of the external and intrinsic rewards, respectively, and B is the batch size. EMA(⋅) denotes the exponential moving average (EMA) used to estimate the running reward scale, and ϵ is a small positive constant for numerical stability.
The EMA operation provides a running estimate of the external and intrinsic reward scales and prevents the adaptive coefficient from reacting excessively to high-variance batch-level fluctuations. Therefore, ηt maintains a stable relative contribution of the intrinsic reward throughout training.
The target ratio
λtar controls the desired contribution of the intrinsic reward relative to the external reward scale. It should be emphasized that
λtar is not the final intrinsic reward ratio by itself. Instead,
η0λtar determines the pre-gating target scale, while the actual intrinsic contribution in the final reward is also modulated by the resource-aware coefficient
ρt. Combining both mechanisms, the overall reward shaping function for each transition is expressed as:
where
denotes the intrinsic reward derived from the prediction-error-based surprise model described in
Section 4.2. Thus, the actual intrinsic reward term added to the external reward is
. When the remaining resource is sufficient,
ρt ≈ 1, and the intrinsic reward is allowed to contribute according to the target scale controlled by
ηt. When the remaining resource becomes scarce,
ρt → 0, and the intrinsic reward is progressively suppressed regardless of its surprise value.
In our study, η0 = 0.3 and λtar = 0.6, giving a pre-gating target scale of η0λtar = 0.18. Therefore, when the resource gate is fully open, the surprise-based intrinsic reward is encouraged to remain approximately within 10–20% of the external reward scale. Under resource-limited conditions, the final contribution is further reduced by ρt, which prevents exploration bonuses from overpowering the task-specific objectives related to safety, collision avoidance, and resource conservation.
The resulting algorithm, termed RAISE, effectively integrates adaptive exploration with resource-constrained decision-making, as illustrated in
Figure 3 and summarized in Algorithm 1. This design ensures that the intrinsic reward serves as an auxiliary exploration signal whose magnitude is jointly controlled by EMA normalization and the resource coefficient. It enables the agent to balance two key objectives: ensuring safety through effective conflict-resolution advisories and reducing unnecessary or high-cost alerts under limited resource conditions. This mechanism provides a robust and data-efficient learning framework that promotes stable policy optimization in dynamic encounter scenarios.
| Algorithm 1: RAISE—Resource-Aware Intrinsic Surprise Exploration |
Input: Environment , policy πθ, ensemble model fϕ, replay buffer , base coefficient η0, target ratio λtar, resource limit rmaxInitialize: , EMA statistics 1: for training step t = 1, 2, … do 2: Observe state st and resource level 3: Select action and execute in environment 4: Obtain next state st+1 and external reward rt 5: Store into buffer 7: if update step is due then 8: Sample a mini-batch {(s, a, r, s’)} from 9: Compute intrinsic reward: 10: Update EMA statistics: 11: 12: Compute adaptive coefficient: 13: Compute resource weight: 14: Form shaped reward: 15: Update critic Qψ and actor πθ using SAC with 16: Update model fϕ by minimizing negative log-likelihood loss 17: end if 18: end for Output: Trained policy parameters θ |
5. Experimental Simulation and Result Analysis
5.1. Experimental Setup
To evaluate the effectiveness of the proposed RAISE algorithm, comparative experiments were conducted against three baselines: SAC-D, Surprise, and SAC-D-resource. All experiments were performed in a simulated two-UAV vertical encounter environment, where the ego-UAV executes vertical resolution advisories while the intruder follows a goal-directed approach strategy. The simulation runs in discrete time steps of 1 s, consistent with high-level tactical decision-making frequencies and typical surveillance update rates (e.g., ADS-B) for autonomous UAV frameworks.
The proposed RAISE algorithm extends the SAC-Discrete framework by incorporating an ensemble dynamics model for surprise-based intrinsic motivation and a resource-aware exploration coefficient that adapts the intensity of intrinsic rewards according to the remaining alert resource. This design allows the agent to maintain efficient exploration while reducing redundant or high-cost advisories under limited resources.
All algorithms were trained under the same conditions for 1.6 × 10
6 environment steps (approximately 1500 iterations). The main experimental configuration used a resource limit of r
max = 20, representing a moderate operational alert capacity. Additional analyses in
Section 5.4 examine the effects of constrained (r
max = 15) and abundant (r
max = 25) resource settings. Replay buffer sizes, batch sizes, and learning rates were tuned for stable performance across models. For all methods, the entropy temperature α was automatically adjusted to balance exploration and exploitation, and target networks were updated using a soft-update rate of τ = 0.005. The detailed hyperparameter configuration is summarized in
Table 5.
5.2. Algorithm Performance Evaluation Metrics
To comprehensively assess the performance of RL models in longitudinal UAV collision avoidance tasks, a set of interpretable and safety-oriented evaluation metrics is defined. These metrics are derived from the reward structure of the environment and are designed to capture the agent’s effectiveness across three critical dimensions: flight safety, avoidance efficiency, and advisory quality. All statistics are computed over the last 400 evaluation episodes to ensure convergence and stability.
Reward: Mean cumulative reward per episode, representing the overall trade-off between safety and alert efficiency.
Collision rate: Frequency of near-miss, directly measuring flight safety.
Resolution success: Proportion of encounters maintaining safe vertical separation.
Alert: Mean number of advisories per episode, reflecting alert frequency.
Strengthening: Average number of increased-intensity advisories (e.g., transitioning from a standard to an aggressive vertical maneuver), indicating risk sensitivity.
Reversal: Number of reversals between climb and descend advisories, reflecting policy stability.
Crossing: Frequency of altitude crossings between UAVs, highlighting residual conflict risks.
These indicators provide a concise yet comprehensive evaluation framework, linking reinforcement learning performance with operationally relevant safety outcomes.
5.3. Test Scenario Design
To evaluate the generalization ability of the proposed algorithm under diverse encounter conditions, a hybrid scenario generation strategy combining semi-random, fixed-point, and fully random initialization was adopted. This approach provides a balance between interpretability—by using structured and repeatable conflict geometries—and robustness, by exposing the agent to diverse and stochastic encounters.
Four representative scenarios (A–D) were constructed as shown in
Table 6. Each scenario specifies the initial vertical states of the ego-UAV and intruder while introducing partial randomness in altitude and velocity to simulate realistic variability in flight encounters. Notably, the vertical rate boundaries (up to ±20 m/s) were specifically designed to encompass the kinematic capabilities of high-performance fixed-wing maritime UAVs and ship-borne VTOL platforms, accommodating the rapid altitude transitions required in dynamic marine environments.
Scenarios A–C represent semi-random and fixed-point configurations designed to analyze policy behavior in structured and interpretable conflict geometries. Scenario D introduces a fully random configuration to test algorithmic robustness and generalization under stochastic conditions.
For reproducibility, random seeds were fixed during evaluation, and each model was trained with multiple random initializations. This ensured that the comparative results across SAC-D, Surprise, and RAISE were statistically consistent and not affected by sampling variance.
5.4. Training Performance and Results of the Model
5.4.1. Performance Comparison Under Moderate Resource (20)
To evaluate the effectiveness of the proposed RAISE algorithm in a balanced operational environment, training experiments were conducted under a moderate resource limit of 20. This setting represents a practical operating condition in which the collision avoidance system retains sufficient advisory capability while remaining constrained from issuing excessive or unnecessarily complex maneuvers. The comparative performance of SAC-D, SAC-D-resource, Surprise, and RAISE is reported in
Table 7, while their training dynamics and detailed behavioral metrics are illustrated in
Figure 4 and
Figure 5. All numerical results are reported as the mean ± standard deviation.
Overall, RAISE achieved the best average performance among the four methods under the resource-constrained setting. As shown in
Table 6, RAISE obtained the highest average reward of −14.183 ± 2.757, improving over SAC-D (−15.912 ± 4.084), SAC-D-resource (−15.204 ± 3.586), and Surprise (−15.546 ± 3.707). In addition, RAISE achieved the highest resolution success rate of 0.843 ± 0.056 compared with 0.821 ± 0.080 for SAC-D, 0.831 ± 0.070 for SAC-D-resource, and 0.821 ± 0.073 for Surprise. Its collision rate was also among the lowest, reaching 0.003 ± 0.006, which is comparable to Surprise and lower than those of SAC-D and SAC-D-resource. These results indicate that RAISE improves the overall quality of conflict-resolution decisions while maintaining a low residual collision risk.
The advantage of RAISE is particularly evident in its advisory behavior. Although RAISE generated the highest average number of alerts (2.157 ± 0.529), the increase was moderate relative to SAC-D-resource (2.036 ± 0.426) and Surprise (2.082 ± 0.300). More importantly, RAISE substantially reduces complex or unstable advisory adjustments. It achieved the lowest strengthening value of 2.397 ± 0.497, compared with 3.226 ± 0.582 for SAC-D, 3.113 ± 0.538 for SAC-D-resource, and 2.834 ± 0.558 for Surprise. RAISE also produced the lowest reversal value of 0.014 ± 0.016, further improving upon Surprise (0.024 ± 0.026) and markedly outperforming SAC-D and SAC-D-resource. Its crossing value of 0.056 ± 0.039 was close to the best-performing SAC-D-resource baseline (0.052 ± 0.037) and lower than those of SAC-D and Surprise. Collectively, these results suggest that RAISE does not achieve improved performance simply by issuing more advisories; rather, it tends to generate more stable and less contradictory resolution actions.
Figure 4 presents the evolution of the evaluation return during training. All four methods improved rapidly in the initial stage and gradually stabilized thereafter. RAISE established an early return advantage and maintained a higher mean return than the baselines through most of the training. The enlarged late-stage view further highlights this advantage: while the baseline curves remained clustered at lower levels, RAISE preserved the highest mean return throughout the zoomed interval. This indicates that RAISE learns a better-performing policy under the same advisory resource constraint and retains this benefit after convergence. Despite partial overlap of the shaded regions due to seed-dependent variability, the sustained separation of the mean curves supports a clear and persistent improvement.
The detailed training metrics in
Figure 5 further clarify the source of this improvement. All methods rapidly reduced the collision rate and increased resolution success during the early stage of learning. In the later stage, RAISE maintained slightly higher resolution success while exhibiting noticeably lower strengthening and reversal values than the competing approaches. Although its alert count was relatively high, the associated advisories were less frequently strengthened or reversed, indicating a more coherent resolution policy under limited resources. This behavior is consistent with the results in
Table 7: RAISE improves the average reward and resolution success mainly by reducing unstable or unnecessarily escalated advisory actions, rather than by aggressively increasing maneuver frequency.
Overall, the results demonstrate that RAISE provides a favorable balance between conflict-resolution effectiveness and advisory stability under a resource limit of 20. Compared with the baseline methods, it achieved the highest average reward and resolution success rate, maintained a low collision rate, and substantially reduced strengthening and reversal behaviors. These findings support the effectiveness of the proposed resource-aware exploration mechanism in learning more reliable and stable collision-resolution policies in constrained operational environments.
5.4.2. Resource-Level Sensitivity and Adaptive Behavior Under Resource-Constrained and Resource-Rich Environments (15 and 25)
To further examine the adaptability and robustness of the proposed RAISE framework, additional experiments were conducted under resource-constrained (15) and resource-rich (25) conditions. These settings respectively simulate operational scenarios with limited alert capacity and ample advisory freedom, allowing an evaluation of how the resource-aware mechanism adjusts exploration and alert issuance behaviors.
Under resource level 15, where the system operates under tight alert constraints, RAISE maintained superior overall training performance compared to SAC-D and Surprise (
Table 8). As shown in
Figure 6 (left), RAISE not only converged faster but also achieved the highest final return (−23.39), outperforming SAC-D (−24.83) and Surprise (−26.74). Its collision rate remained the lowest throughout most of the training process (≈0.02), confirming that safety can still be preserved even when alert resources are scarce. In terms of advisory behavior (
Figure 7), RAISE exhibited a consistent pattern with the moderate-resource case (
Section 5.4.1): it issued slightly more alerts (1.65 per episode) than the baselines but with notably fewer strengthening (2.07 vs. 3.00) and almost no reversal actions (0.003). This indicates that RAISE allocates its limited alert budget more effectively, favoring timely, stable advisories over reactive corrections and unnecessary intensifications.
Under resource level 25, representing an unconstrained operating condition, all algorithms achieved comparably high returns (around −8 to −10;
Table 9,
Figure 6 right), with RAISE maintaining a slight performance edge (−8.22 vs. −8.59 for SAC-D and −9.79 for Surprise). Although the alert frequency naturally increased due to relaxed resource limitations, RAISE continued to demonstrate the lowest reversal rate (0.19) and maintained a moderate alert strength (2.35) throughout training, reflecting its ability to adaptively adjust advisory intensity once sufficient safety margins are established (
Figure 8). Notably, while Surprise initially exhibited a higher alert frequency, RAISE gradually surpassed it in later stages, showing that the algorithm dynamically balances safety assurance with efficient resource utilization as the environment stabilizes.
Overall, these supplementary experiments validate that RAISE achieves consistent, balanced performance across resource regimes. When resources are scarce, it effectively suppresses excessive advisory changes while maintaining collision-free safety and stable convergence. When resources are abundant, it flexibly expands advisory activity and exploration intensity without destabilizing the learned policy. This adaptability highlights the generalization capability of the resource-aware intrinsic exploration mechanism, supporting the consistency and scalability of the main findings under diverse operational constraints.
These results can also be interpreted as a sensitivity evaluation with respect to the available advisory-resource budget. As the resource level changes from 15 to 20 and 25, RAISE does not exhibit abrupt degradation or unstable advisory behavior. Instead, the policy adapts its command strategy according to the available budget: under scarce resources, it suppresses high-burden advisory transitions such as strengthening and reversal; under abundant resources, it permits more advisory activity to improve safety margins while maintaining stable command behavior. This indicates that the proposed resource-aware mechanism is not tuned to a single resource setting, but remains effective across different operational budgets.
5.5. Performance Evaluation in Testing Scenarios
To further evaluate the generalization capability of the proposed RAISE algorithm beyond the training distribution, testing experiments were conducted under four representative encounter scenarios, denoted as Scenarios A–D, using a moderate resource limit of 20. These scenarios cover different vertical-rate configurations and conflict geometries, enabling the policies to be examined under both structured and stochastic encounter conditions. In addition to the learning-based baselines, a rule-based controller and an artificial potential field controller were included as deterministic non-learning baselines. All methods were evaluated under the same scenarios and metrics, and the results are summarized in
Figure 9 and
Table 10.
Overall, RAISE achieved the most favorable safety-performance trade-off among the compared learning-based methods. Across Scenarios A, B, and D, RAISE obtained the highest average reward, with values of −21.739, −21.133, and −21.486, respectively. These results outperformed SAC-D, SAC-D-resource, and Surprise, indicating that the resource-aware intrinsic exploration mechanism improves policy generalization under unseen encounter configurations. In Scenario C, APF obtained a slightly higher reward than RAISE; however, this advantage was accompanied by a substantially higher collision rate and reversal count. Therefore, the reward result in Scenario C should be interpreted together with the safety and advisory-stability metrics rather than as an isolated measure.
In terms of collision avoidance, RAISE demonstrated the most reliable safety behavior. It achieved the lowest collision rate in all four scenarios, with collision rates of 0.026, 0.024, 0.004, and 0.012 in Scenarios A–D, respectively. Compared with SAC-D and Surprise, RAISE consistently reduced the collision risk while maintaining competitive or superior resolution success. For example, in Scenario A, the resolution success rate increased from 0.566 for SAC-D and 0.654 for Surprise to 0.742 for RAISE. A similar improvement was observed in Scenario B, where RAISE reached a resolution success rate of 0.752. In Scenario D, which represents a more random testing condition, RAISE still achieved a resolution success rate of 0.696, outperforming all other methods. Although APF achieved the highest resolution success in Scenario C, its collision rate remained considerably higher than that of RAISE, suggesting that its successful resolutions are less consistently associated with safe separation maintenance.
The comparison with deterministic controllers further highlights the advantage of learning a resource-aware policy. Rule-based and APF methods issue fewer alerts and produce very low strengthening counts, but this apparent command economy is achieved at the cost of weaker safety and poorer adaptability. In Scenarios A, B, and D, both deterministic methods showed substantially higher collision rates than RAISE. For instance, APF reached collision rates of 0.144, 0.120, and 0.144 in Scenarios A, B, and D, respectively, whereas RAISE kept the corresponding rates at 0.026, 0.024, and 0.012. Rule-based control exhibited a similar limitation, with low alert frequency but reduced resolution success and increased collision risk. These results indicate that simply minimizing the number of advisories does not necessarily lead to operationally acceptable behavior; rather, effective collision avoidance requires timely and context-sensitive interventions.
RAISE also showed a clear advantage in advisory stability. Although it generated more alerts than the other learning-based baselines, these alerts were more consistent and less oscillatory. The reversal count of RAISE remained the lowest or among the lowest across all scenarios, with values of 0.044, 0.050, 0.090, and 0.048 in Scenarios A–D. These values were substantially lower than those of SAC-D, SAC-D-resource, rule-based, and APF, and were also lower than Surprise in all scenarios. The reduction in reversal behavior is important because frequent changes in advisory direction may lead to confusing or contradictory guidance for remote operators. In addition, RAISE generally reduces strengthening actions compared with SAC-D and SAC-D-resource, indicating that the policy tends to generate earlier and more stable advisories instead of repeatedly escalating avoidance commands.
It is also worth noting that the deterministic controllers showed low strengthening counts mainly because their advisory activity is limited and less adaptive, not because they achieve a better balance between safety and command economy. Their high reversal rates and collision rates reveal that low command frequency alone is insufficient for robust collision avoidance. In contrast, RAISE uses a larger number of well-targeted alerts to maintain separation while simultaneously suppressing reversals and excessive strengthening. This behavior is consistent with the design objective of the resource-aware mechanism: the goal is not to minimize all commands indiscriminately, but to reduce redundant, contradictory, or operationally inefficient advisories while preserving safety.
In summary, the testing results confirm that RAISE generalizes well to unseen encounter scenarios and achieves a strong balance between collision safety and advisory stability. Compared with learning-based baselines, it consistently improves reward, reduces collision rate, and suppresses reversal behavior. Compared with deterministic controllers, it provides more adaptive and reliable avoidance decisions under diverse encounter geometries. These findings support the effectiveness of integrating resource-aware modulation and surprise-based intrinsic exploration for operationally suitable maritime UAV collision avoidance.
Figure 10 provides a trajectory-level visualization of the representative testing scenarios, where the Ego-UAV is shown on the left side of each subplot and the intruding aircraft is shown on the right side. The trajectories further illustrate the behavioral differences among the compared methods. The learning-based baselines exhibited less consistent vertical maneuver patterns in several scenarios, while the deterministic controllers generally produced simpler but relatively rigid avoidance responses. These observations are consistent with the quantitative results reported above, where the baselines showed higher collision rates, higher reversal counts, or weaker overall conflict-resolution performance. In contrast, RAISE generated more stable and progressive vertical maneuvers across the testing scenarios. Its trajectories avoid abrupt direction changes and excessive oscillations, supporting the quantitative results that RAISE achieves lower collision rates and fewer reversals while maintaining effective conflict resolution. These visual results therefore provide additional evidence that the proposed resource-aware mechanism contributes to both safety performance and avoidance-behavior stability.
5.6. Robustness Evaluation
While full sim-to-real deployment was beyond the scope of this work, robustness under realistic perturbations provides an important proxy for evaluating the operational reliability of collision-avoidance policies. In real maritime UAV operations, the state information available to the avoidance system may be affected by GPS positioning errors, barometric altitude inaccuracies, state-estimation uncertainty, and stochastic wind gusts. Therefore, additive Gaussian observation noise and stochastic vertical wind perturbations were adopted as controlled simulation proxies for sensor-measurement uncertainty and external wind-induced dynamic disturbances, respectively. A robustness evaluation was conducted to examine whether the learned policies could maintain reliable performance under degraded observation and dynamic conditions.
The robustness evaluation considered three perturbed environments in addition to the clean setting. First, sensor noise was introduced by adding Gaussian noise to the directly measured state variables, including relative altitude, ego-UAV vertical velocity, and intruder vertical velocity. Two noise levels were tested, corresponding to 5% and 10% of the operational ranges of the corresponding state variables. Second, wind disturbance was introduced by applying stochastic vertical wind perturbations to the ego-UAV dynamics. Third, a combined perturbation condition was considered, where sensor noise and wind disturbance were applied simultaneously. All methods were evaluated under the same perturbation settings and metrics, and the results are reported in
Table 11.
Overall, RAISE demonstrated the strongest robustness among the compared methods. Under the clean condition, RAISE achieved the highest reward, the lowest collision rate, and the highest resolution success rate, with a collision rate of 0.022 and a resolution success rate of 0.637. When 5% sensor noise was introduced, the collision rate of RAISE increased to 0.044, but remained lower than those of SAC-D, Surprise, and APF. Under 10% sensor noise, all methods experienced performance degradation, but RAISE still maintained the lowest collision rate among the compared methods, with a value of 0.127. This indicates that although sensor uncertainty makes the avoidance task more difficult, RAISE remains less sensitive to observation perturbations than the baseline methods.
The advantage of RAISE was also evident under wind disturbance. In the wind-disturbed environment, RAISE achieved a collision rate of 0.020 and a resolution success rate of 0.609, both of which remained close to its clean-environment performance. In comparison, SAC-D and Surprise exhibited lower resolution success rates, while APF showed a higher collision rate and weaker overall robustness. Under the combined Noise + Wind condition, all methods degraded further, as expected. Nevertheless, RAISE still obtained the best reward and the lowest collision rate, with a collision rate of 0.128, outperforming SAC-D, Surprise, and APF. These results suggest that the proposed method maintains a more reliable safety margin under simultaneous observation and dynamic perturbations.
In terms of advisory behavior, RAISE consistently achieved the lowest or among the lowest reversal counts across all environment configurations. For example, its reversal count remained 0.076 under the clean condition, 0.089 under 5% noise, 0.122 under 10% noise, 0.077 under wind disturbance, and 0.121 under the combined Noise + Wind condition. These values were substantially lower than those of APF and were also lower than those of SAC-D and Surprise in most cases. This result indicates that RAISE maintains stable advisory directions even when the observed state or vehicle dynamics are perturbed. Although RAISE produced more alerts than the other methods, these alerts were associated with lower collision risk and fewer reversals, suggesting that the additional interventions are more consistent and safety-oriented rather than oscillatory.
The deterministic APF controller exhibited a different robustness pattern. APF generated the fewest alerts and strengthening actions in the clean environment, but its performance deteriorated substantially under sensor noise. Its collision rate increased from 0.109 in the clean environment to 0.225 under 5% noise and 0.296 under 10% noise, while its reversal count also increased markedly. Under the combined Noise + Wind condition, APF reached the highest collision rate among all methods, with a value of 0.303. These results show that low advisory frequency alone does not guarantee robust avoidance performance. A policy must also adapt its maneuvers to uncertain and perturbed encounter states.
The robustness of RAISE may be partly related to its resource-aware learning design. During training, the resource-aware modulation encourages the policy to balance collision avoidance performance with advisory stability, rather than minimizing or maximizing maneuver commands indiscriminately. In addition, the surprise-based intrinsic reward exposes the policy to less predictable state transitions, which may improve its tolerance to moderate distributional variations during testing. The bounded and threshold-based reward terms also help avoid overly sensitive responses to small state deviations near critical safety boundaries. These design factors provide a possible explanation for why RAISE maintains lower collision rates and fewer reversals under sensor noise, wind disturbance, and combined perturbations.
In summary, the robustness results confirm that RAISE provides more reliable performance under degraded operating conditions. Compared with learning-based baselines, it maintains lower collision rates, higher or competitive resolution success, and more stable advisory behavior. Compared with deterministic controllers, it shows stronger adaptability to sensor uncertainty and wind disturbance. These findings further support the effectiveness of the proposed resource-aware reinforcement learning framework for maritime UAV collision avoidance under realistic perturbations.