1. Introduction
The deployment of unmanned aerial vehicles (UAVs) in post-disaster operations has moved, within a single decade, from experimental trials to a routine component of humanitarian response [
1,
2,
3]. Following the 2023 Türkiye–Syria earthquakes and the 2024 Henan and Pacific Northwest floods, multinational relief operations routinely coordinate dozens of UAV sorties per hour from temporary aerodromes with two to four unprepared landing strips. The traffic is heterogeneous: small logistics drones deliver medical supplies, medium multi-purpose platforms transport personnel and equipment, and larger aircraft perform casualty evacuation. These categories carry fundamentally different operational priorities, and the scheduling system must reflect this asymmetry.
Three properties distinguish this regime from the well-studied civilian terminal-area sequencing literature. First, the traffic mix is priority-heterogeneous to a degree rarely encountered in civil aviation: a single casualty-evacuation flight is, by any reasonable operational metric, worth on the order of one hundred resupply sorties [
4,
5]. Second, arrivals are stochastic and ad hoc—there is no published schedule, and relief flights are launched opportunistically as needs evolve. Third, the aerodrome itself is resource-constrained: temporary landing strips have limited holding capacity, and each arriving UAV carries a finite fuel or battery reserve that imposes a hard operational deadline. Path deviations around obstacles, holding delays due to airspace congestion, and weather-induced speed reductions all consume this reserve, tightening the effective time window available for landing.
The present work addresses the terminal-area runway assignment problem: UAVs are assumed to have completed en-route navigation and are in the approach queue awaiting landing clearance. Path planning, obstacle avoidance, and en-route airspace congestion management are complementary problems extensively studied in the trajectory-planning literature; they lie outside the scope of this study. Our contribution is a scheduling policy that, given a stream of arriving UAVs with heterogeneous priorities and per-aircraft operational deadlines, irrevocably assigns each arrival to one of several runways to maximise cumulative operational value. The interface between the en-route and terminal domains is the per-UAV deadline: a scalar encoding of remaining endurance, which any upstream path-planning module can supply.
Extending the lookahead horizon within a static optimisation framework does not close the performance gap. Our experiments confirm that deeper deterministic lookahead (Joint-LA-2, a three-step joint enumeration equivalent to a rolling-horizon MILP) yields no statistically significant improvement over the two-step variant. The reason is structural: with stochastic arrivals, the value of reserving runway capacity for an unspecified future emergency does not appear in any K-step joint optimisation for finite K. Reinforcement learning recovers the right inductive bias by learning a value function that integrates over the future emergency distribution.
This paper investigates the application of deep RL to priority-aware multi-runway UAV sequencing under operational constraints that characterise disaster relief operations: extreme priority asymmetry (100:5:1 weight ratios), per-aircraft endurance deadlines, and finite runway queue capacity. We formulate the problem as a priority-weighted Markov decision process (MDP) whose reward is intentionally simple—the per-step delta of cumulative weighted landings, with no shaping, no waiting penalty, and no explicit crash term. By relying solely on the asymmetric weight structure to encode operational priorities, any safety-related behaviour must be learned rather than imposed.
We train a Proximal Policy Optimisation (PPO) agent and benchmark it against six baselines: a uniform random control, a wake-greedy heuristic, the de facto operational standard Priority-FCFS, an exact two-step joint optimisation (Joint-LA-1), a stochastic lookahead baseline that samples future arrivals from the known Poisson distribution, and a Monte Carlo Tree Search (MCTS) planner. We further compare against a three-step joint enumeration (Joint-LA-2), which is equivalent to solving a rolling-horizon mixed-integer linear programme with K = 3 by full enumeration.
We report four main findings. First, PPO matches the performance of Priority-FCFS and approaches Joint-LA-1 within 3.2% without requiring any hand-crafted priority rules, capacity reservation logic, or safety constraints—the agent discovers an effective scheduling strategy from the reward signal alone. Second, the learned policy autonomously develops a runway specialisation pattern in which high-priority traffic is concentrated on a single strip (60% of H-class landings) while emergency arrivals are routed almost exclusively to the remaining two strips (93% combined), a behaviour that emerges from the reward signal without any hand-crafted priority logic. Third, the PPO–PFCFS performance gap is modulated by operational deadline tightness: under moderate deadlines the gap narrows substantially, indicating that the value of learned scheduling depends on the temporal slack in the system. Fourth, under a symmetric wake turbulence matrix—a structural perturbation that removes the asymmetry on which Priority-FCFS depends—PPO outperforms the heuristic by 46.5%, demonstrating that the learned policy is robust to changes in environmental structure that degrade heuristic performance.
The remainder of the paper is organised as follows.
Section 2 reviews related work in airport surface scheduling, disaster relief aviation logistics, and RL for resource allocation.
Section 3 formalises the problem as an MDP and defines the per-UAV deadline and runway capacity mechanisms.
Section 4 details the PPO algorithm, the six baselines, and the training configuration.
Section 5 reports the experimental results, including the main comparison, deadline sensitivity, per-class throughput analysis, wake-scaling robustness, and an ablation across nine environment variants.
Section 6 discusses the trade-off between throughput and emergency reliability, explains why static optimisation baselines fail to close the gap, and delineates the scope and limitations of the study.
Section 7 concludes.
3. Problem Formulation
3.1. Setting
A temporary disaster relief aerodrome operates R = 3 parallel landing strips. UAVs arrive according to a homogeneous Poisson process of rate
sorties per simulation second, yielding approximately
arrivals per episode (T = 100 simulation seconds). Each arrival belongs to one of three operational classes, drawn independently from a fixed distribution:
with associated operational-value weights
The 1:5:100 ratio reflects the order-of-magnitude valuation gap between routine resupply, mixed personnel/cargo, and casualty evacuation that follows from the humanitarian aviation guidance of [
4,
5]. We treat this ratio as fixed for the main results and verify, in
Section 5.3, that our central conclusions are not artefacts of a specific weight calibration.
Each UAV u carries an operational deadline
, defined as the latest simulation time at which u can physically land before exhausting its fuel or battery reserve. The deadline is computed as
where
is the base endurance for class c:
s,
s, and
s. The ±20% uniform perturbation
captures variability in en-route conditions across individual sorties. A UAV that cannot be landed by its deadline is recorded as a deadline violation crash and contributes zero operational value. This mechanism provides a unified scalar interface between en-route factors—path deviations around obstacles, holding delays due to airspace congestion, and weather-induced speed reductions—and the terminal scheduling problem. Any upstream path-planning or traffic flow management module can supply
; the scheduler need only respect it.
The present formulation addresses the terminal-area runway assignment problem in isolation: UAVs are assumed to have already arrived in the approach queue. Path planning, obstacle avoidance, and en-route airspace congestion management are the subject of complementary research threads and are not modelled here.
3.2. Wake Turbulence Constraints
Whenever runway
r receives a follow-on UAV of class
after having most recently served a leader of class
, a class-asymmetric wake separation interval
elapses before the follower may legally land. The matrix
(in simulation seconds) is derived from the ICAO Doc 4444 wake separation minima [
22] with a uniform 10× temporal compression to maintain a tractable simulation horizon. The matrix is:
The matrix is markedly asymmetric: an Emergency leader imposes a 14.0 s follow-on interval on a Normal follower, whereas a Normal leader imposes only a 2.0 s interval on an Emergency follower. This asymmetry—
—reflects the disproportionate trailing vortex circulation strength of heavier airframes and is the structural feature that makes wake-aware scheduling non-trivial (
Figure 1).
We address the robustness of our conclusions to the 10× compression assumption through two complementary analyses. First, a theoretical argument: uniform temporal scaling
preserves all class-to-class ratios, including the asymmetry ratio
. Since the scheduling difficulty is driven by the interaction between priority weights (100:5:1) and wake asymmetry ratios—both invariant under uniform scaling—the qualitative structure of the optimal policy is preserved. Second, an empirical wake-scaling experiment (
Section 5.7) evaluates PPO and Priority-FCFS under five scaling factors
, confirming that the PPO–PFCFS gap is robust to factor-of-four variations in absolute wake magnitudes.
A runway
r is additionally subject to a fixed occupancy time
simulation second for touchdown roll-out and clearing. and a maximum queue capacity
aircraft. The capacity constraint reflects the physical limit on holding-aircraft space at temporary aerodrome landing strips. When a runway’s pending queue reaches
, the action mask precludes further assignments to that strip until a landing frees capacity. Combining wake separation, occupancy, and queue capacity, the earliest legal touchdown time
of UAV
u assigned to runway
r is
with the convention
when the runway has not yet served any UAV. The expression generalises recursively over a non-empty queue: each queued UAV is processed in arrival order, advancing the runway’s effective
deterministically.
3.3. Markov Decision Process
The scheduler is the agent and the aerodrome plus traffic generator is the environment (
Figure 2). The state, action, and reward are defined as follows.
State. At each decision step the agent observes a vector of 74 dimensions, comprising: (i) the current arrival, encoded by a presence indicator, a one-hot class vector (3), a normalised arrival time, and a normalised deadline urgency (6 dimensions); (ii) for each runway, the normalised next-free time, a one-hot encoding of the last landed class (3), and the normalised queue length (5 dimensions each, 15 total); (iii) a preview of the next future arrivals, each encoded by a one-hot class vector (3), normalised arrival time delta, and normalised deadline urgency (5 dimensions each, 50 total); and (iv) three global scalars: normalised current time, normalised cumulative arrivals, and normalised cumulative landings (3 dimensions).
Action. At each decision step the agent selects an action , irrevocably assigning the current arrival to one of the three runways. The action mask is determined by the runway queue capacity: if runway r’s queue has reached When all runways are at capacity, the mask is released to prevent deadlock. This mask is operationally meaningful—unlike the trivial all-true mask of earlier formulations—and encodes a genuine physical constraint of the aerodrome.
Reward. Let
{0, 1} indicate whether UAV
u has physically completed landing on or before simulation step
t. The agent receives, at each environment step
t,
i.e., the per-step delta of the cumulative weighted-landings sum. There is no shaping term, no waiting penalty, and no explicit crash cost beyond the implicit loss of a UAV’s weight when it misses its deadline. A terminal crash penalty of 10.0 is subtracted for each emergency-class UAV that remains unlanded at episode termination. We adopt this near-minimalist form—one penalty coefficient, no reward shaping—to ensure that any safety-related behaviour is produced by the learned policy rather than by hand-crafted reward terms. Cumulative episode return is
equal to the total operational value delivered minus terminal crash penalties.
Transition. The environment transitions deterministically given the current state and action, with two stochastic ingredients: the Poisson inter-arrival times and the i.i.d. class draws. The transition advances the simulation clock to the next arrival time or to the horizon T, whichever is earlier, processing all landings whose t_legal falls within the elapsed interval. UAVs whose t_legal exceeds their deadline are removed from the queue as deadline violation crashes and contribute nothing to the return.
3.4. Quantities of Interest
We use two scalar episode-level functionals to evaluate policies. First, the cumulative operational value G per Equation (7), which forms the headline performance metric. Second, the emergency no-show count
which we report alongside G throughout. The two are correlated but not equivalent: a policy can reduce G per episode while improving
by spending high-weight emergency landings to displace several low-weight normal landings, and
Section 5.4 quantifies this trade-off.
4. Methods
4.1. Algorithm: Proximal Policy Optimisation
We employ Proximal Policy Optimisation (PPO) [
23], using the MaskablePPO implementation from stable-baselines3 (version 2.7.1,
https://github.com/DLR-RM/stable-baselines3, accessed on 6 May 2026) [
21] and its contrib package (stable-baselines3-contrib,
https://github.com/Stable-Baselines3-Team/stable-baselines3-contrib, accessed on 6 May 2026) for native action mask support. Under the capacity-constrained action mask of
Section 3.3, the mask is no longer a trivial all-true placeholder; it enforces a genuine operational constraint. We retain the MaskablePPO implementation for code-level consistency with prior work in this codebase.
The agent maintains two separate multilayer perceptrons of dimension (128,128)—an actor
and a critic
with no parameter sharing. The actor outputs a categorical distribution over the three runway actions; at each decision step, the policy is renormalised over the unmasked subset before sampling. The critic estimates the state-value function
V(
s). Both networks are trained with the standard clipped PPO objective (
Figure 3).
4.2. Baselines
We benchmark PPO against six baselines, ordered from weakest to strongest.
Random. A uniform random selection over the three runways at each step. This establishes the lower envelope of policy performance.
WakeGreedy. The runway that minimises only the immediate wake separation interval W , is selected, ignoring priority weights and queue contents. This is included as a negative control to demonstrate that priority awareness—and not merely wake-aware sequencing—is the operationally relevant feature.
Priority-FCFS. The de facto operational standard. The runway that minimises the per-arrival predicted (u, r) from Equation (5) is selected. This corresponds to the human controller’s heuristic of “place each arrival on whichever strip can accept it earliest, given the queue and wake”.
Joint-LA-1 (joint two-step). An exact joint enumeration over the current and next-arrival assignments (3 × 3 = 9 combinations for three runways), choosing the runway r for the current UAV that minimises the sum of weighted t_legal values across both decisions. The minimisation over the second assignment is computed under the simulated runway state that would result from the first assignment, making the procedure a true two-stage optimisation.
Stochastic-LA (stochastic lookahead). A probabilistic optimisation baseline that extends Joint-LA-1 by sampling synthetic future arrival sequences from the known Poisson distribution ( = 0.7, class mix 60/25/15). At each decision step, for each candidate runway, N = 10 Monte Carlo rollouts of future arrivals are generated and assigned greedily; the runway with the lowest expected total weighted t_legal is selected. Unlike Joint-LA-1, this baseline can express a form of probabilistic capacity reservation: if a synthetic emergency arrival appears frequently in the sampled futures, the expected-cost minimisation will favour runways that leave capacity available.
MCTS (Monte Carlo Tree Search). An online planning baseline that, at each decision step, builds a search tree over future arrival assignment sequences using the pre-generated arrival schedule as a perfect environment model. The tree is searched with 100 UCB1 iterations at depth 5; rollout assignments use Priority-FCFS for the subsequent 15 UAVs. The cumulative operational value G (Equation (7)) serves as the tree-search objective. MCTS is included as a representative online planning alternative to the deterministic and stochastic baselines.
We additionally investigated a three-step joint enumeration (Joint-LA-2, 3 × 3 × 3 = 27 combinations), which is equivalent to solving a rolling-horizon mixed-integer linear programme (MILP) with a three-arrival lookahead by full enumeration. In the v4 environment, Joint-LA-2 underperforms Joint-LA-1 by 7.4%—a counterintuitive result that corroborates the structural argument of
Section 6.2: deeper deterministic lookahead, without a learned value function, amplifies rather than corrects the myopia of the greedy heuristic. We therefore report Joint-LA-1 as the representative optimisation baseline and discuss the structural reasons for the failure of deeper deterministic optimisation in
Section 6.2.
We note that a full-schedule MILP formulation [
6] is inapplicable to the online sequential-decision setting studied here: MILP requires a completely known arrival schedule prior to optimisation, whereas our problem reveals UAVs one by one through a stochastic Poisson process. At the modest lookahead depths feasible for real-time decision-making (K ≤ 3), exhaustive enumeration (27 combinations for K = 3) dominates branch-and-bound in both speed and solution quality, confirming Joint-LA-1 as the appropriate deterministic optimisation baseline. We omit Greedy-LA-3 and Greedy-LA-5 from the present comparison because deeper sequential-greedy lookahead, without stochastic modelling of future emergency arrivals, does not improve upon Joint-LA-1—a null result that is itself informative and corroborates the structural argument of
Section 6.2.
4.3. Training Configuration
Each training run consumes
environment steps (corresponding to roughly twenty thousand simulated episodes) on a single CPU thread (AMD Ryzen 9 8940HX with Radeon Graphics (2.40 GHz)). Hyperparameters were taken from the stable-baselines3 PPO defaults [
21] except where modifications were required to address the increased difficulty of the capacity-constrained, deadline-aware v4 environment (
Table 1).
We train ten independent seeds (0–9) for the main result and five additional seeds for the deadline sensitivity sweep, yielding 15 trained agents in total. Each seed uses a fixed random number generator initialisation that propagates to environment, network initialisation, and PPO sampling. For evaluation, we use a separate held-out set of 100 random seeds (500,000–500,099) to generate 100 paired arrival schedules; every policy under test is evaluated on the same schedules, producing a within-subject paired design that maximises statistical power.
4.4. Action Masking Under Runway Capacity Constraints
The action mask in the v4 formulation is operationally grounded: a runway whose pending queue has reached capacity is masked out, preventing further assignments until a landing frees capacity. When all runways are simultaneously at capacity—a rare event occurring in fewer than 1% of decision steps—the mask is released to prevent deadlock.
This design distinguishes our approach from two extremes prevalent in the RL-for-operations literature. At one extreme, masked RL formulations in safety-critical domains [
20] preclude large fractions of the action space via hand-crafted rules that encode domain knowledge; while such masking accelerates training and provides safety guarantees, it subordinates RL to the engineered constraints. At the other extreme, the trivial all-true mask of our earlier environment (v2.3) served as an explicit signal that no domain knowledge beyond the reward weights had been embedded in the agent’s action space—a design we termed constraint-emergent RL. The v4 mask occupies a middle ground: it enforces a genuine physical constraint (finite ramp capacity) without encoding any priority or sequencing logic. The agent must still learn when to use each available runway and when to defer assignments—the mask tells it what is physically possible, not what is operationally desirable.
We document two engineering iterations from earlier development because the diagnostic patterns are likely to recur in similar applications. (i) A WAIT action silently invoked Priority-FCFS as a fallback, contaminating the PPO-vs-PFCFS comparison; we removed WAIT entirely in v2.3. (ii) The policy collapse pathology: default-hyperparameter PPO converged to a policy indistinguishable from Priority-FCFS; diagnosis traced this to an imbalance between the value function and policy gradient loss scales, resolved by reducing
and increasing
. Both iterations are documented because recent literature on reproducibility in RL [
24] has repeatedly emphasised that the most consequential design decisions in applied RL are rarely the most prominent ones in the abstract.
5. Experiments and Results
5.1. Experimental Protocol
All evaluation is conducted on a held-out paired-episode protocol. We pre-generate 100 random seeds (500,000–500,099) and instantiate, for each seed, a single arrival schedule. Every policy under test is evaluated on this fixed set of 100 schedules, producing a within-subject paired design that maximises statistical power per unit of computation. The PPO policy used for all reported numbers is the best-checkpoint of seed 6 (the seed achieving the highest evaluation reward on a 20-episode validation set), trained for 2,000,000 environment steps. The baseline policies are deterministic or use fixed random seeds and are evaluated on the same 100 schedules.
5.2. Main Result
Table 2 reports the headline comparison across all seven policies (six baselines plus PPO). Our agent (PPO) attains an episode mean reward of 741.7 ± 177.7 (
n = 100), compared with 766.5 ± 180.1 for Joint-LA-1, the strongest non-learned baseline, and 762.5 ± 179.2 for Priority-FCFS, the operational standard.
Three patterns are evident. First, PPO matches the performance of the strongest baselines within approximately one standard deviation while landing fewer aircraft overall (
Figure 4)—a consequence of the priority-weighted objective, which favours selective sacrifice of low-weight throughput for high-weight emergency reliability. Second, none of the optimisation-based baselines—deterministic (Joint-LA-1), stochastic (Stochastic-LA), or search-based (MCTS)—exceeds Priority-FCFS by a statistically significant margin, and deeper lookahead does not confer monotone improvement. Third, relative to the unconstrained v2.3 environment (where deadline constraints were absent), reflecting the stringency of the per-UAV operational deadlines: approximately 20 emergency arrivals per episode cannot be landed before their deadlines expire under any policy (
Figure 5).
The paired statistical comparisons are reported in
Table 3. Against Joint-LA-1, PPO shows a reward difference of −24.8 (−3.24%, paired t = −2.05,
p = 0.043), which reaches significance at the
= 0.05 level. Against Priority-FCFS, the difference is −20.8 (−2.73%,
p = 0.124, not significant). In 28 of 100 paired episodes the PPO trajectory dominates the Joint-LA-1 trajectory in reward.
5.3. Sensitivity to Operational Deadline Tightness
It is reasonable to ask whether the PPO–PFCFS gap depends on the specific choice of per-class endurance values (
Figure 6). We address this directly by training an additional PPO agent under a Moderate deadline scenario—
,
,
, representing a larger temporal slack for routine and high-priority traffic—and comparing against the Tight default (
,
,
) (
Table 4).
The gap narrows from −2.7% under tight deadlines to −0.5% under moderate deadlines, a reduction of over 80% in relative terms. This trend is consistent with the structural account of
Section 6.2: when routine traffic has sufficient endurance to wait, PPO’s capacity reservation strategy has room to operate; when all aircraft face imminent deadlines, the Priority-FCFS heuristic of “land everything as early as possible” is near-optimal. The emergency-class deadline is deliberately kept tight in both scenarios (30–35 s), reflecting the time-critical nature of casualty evacuation—a defining feature of the disaster relief setting that creates the asymmetric slack on which learned scheduling depends.
5.4. Per-Class Throughput and the Emergency Reliability Trade-Off
Table 5 decomposes the headline reward gap by UAV class.
Two observations are notable. First, under the tight operational deadlines of the v4 environment, the per-class throughput of PPO and Priority-FCFS are quantitatively similar across all three classes—the large N/H-for-E trade-off observed in the unconstrained v2.3 setting is substantially compressed when all aircraft face imminent endurance limits. Second, despite the near-identical aggregate throughput, the underlying allocation patterns differ qualitatively: PPO’s runway × class allocation matrix (
Figure 7) and per-class action distribution matrix (
Figure 8) reveal the same emergent specialisation observed in earlier formulations—R2 handles only 5.7% of emergency traffic while R0 and R1 collectively receive 94.3%—whereas the heuristic baselines distribute emergency landings near-uniformly across all runways. That the learned policy maintains this structured allocation behaviour even when it does not confer a throughput advantage suggests that runway specialisation is a robust emergent property of the priority-weighted objective, not an artefact of a specific environment configuration.
5.5. Permutation Invariance and Crash Decomposition
Permutation invariance. If the runway specialisation reported in
Figure 7 and
Figure 8 were a positional artefact—for instance, if the agent had simply learned that “runway index 2 is for normal traffic” without regard to the runway’s state—the result would be invalidated. We test for this by re-running the trained PPO policy on all six permutations of the runway-index-to-runway-state mapping. The reward spread across permutations is less than 0.05%, indistinguishable from numerical reordering effects in the MLP forward pass. This confirms that the policy conditions on runway content (next free time, last class, queue length) rather than runway index.
Crash decomposition. All 21.04 emergency crashes per episode under PPO are classified as operable: each crash involves an arrival whose time and deadline would, in principle, allow a landing, but the scheduling decisions did not achieve one. Zero crashes are attributable to horizon constraints (arrivals spawned too late in the episode to be landed by any policy). The residual crash rate under the best-performing policy (Joint-LA-1, 19.13 crashes/ep) confirms that approximately 19 emergency arrivals per episode are fundamentally unlandable under the v4 deadline and capacity constraints—a consequence of the Poisson arrival process generating bursts of emergency traffic that exceed the aerodrome’s physical throughput capacity.
5.6. Training Dynamics and Multi-Seed Robustness
Figure 9 shows the evaluation reward across all ten training seeds, evaluated every 50,000 steps on a fixed 20-episode validation set. The mean best-checkpoint trajectory (dark blue) approaches the deterministic Priority-FCFS reference (red dashed, 767.5) within the first 200,000 steps and remains within 0.5% thereafter. All ten seeds exhibit convergent behaviour; the best individual seed (seed 6) achieves a best-checkpoint reward of 806.0 (+5.0% over PFCFS on the validation set). The 10-seed mean best-checkpoint reward is 765.1 (−0.31%), confirming that the deployed policy matches the operational baseline on average, with individual seeds occasionally exceeding it.
We verify three diagnostic properties of the trained agents. All ten seeds satisfy: (i) approx_kl at the final update lies in [3.0 × 10−3, 7.1 × 10−3], within the standard healthy band of [10−3, 2 × 10−2]; (ii) clip fraction lies in [0.014, 0.079], indicating that the PPO trust-region constraint is active but not saturated; and (iii) explained variance exceeds 0.78 for all ten seeds (range [0.783, 0.913]), confirming that the critic provides a meaningful value estimate.
5.7. Wake-Scaling Robustness
To assess the sensitivity of our conclusions to the 10× temporal compression of the ICAO wake separation matrix, we evaluate the trained PPO policy (Tight, seed 6) and Priority-FCFS under five scaling factors
applied uniformly to the wake matrix. This experiment directly assesses whether the scheduling difficulty is primarily driven by wake separation and whether the 10× compression assumption affects the conclusions (
Table 6).
The relative PPO–PFCFS gap remains within a narrow band of [−3.6%, −2.7%] across a factor-of-four variation in absolute wake magnitudes. This empirical robustness is consistent with the theoretical argument of
Section 3.2: uniform scaling of W preserves all class-to-class wake ratios, and the scheduling structure is driven by the interaction of priority asymmetry (100:5:1) with wake asymmetry (
), both of which are invariant under
. The absolute reward values decrease with
(tighter wake constraints reduce total throughput), but the relative ordering of policies is preserved.
5.8. Ablation Across Nine Environment Variants
Figure 10 reports a structured ablation in which we re-train PPO and re-evaluate Priority-FCFS at nine alternative configurations of the v4 environment. The variants map to distinct operational scenarios encountered in disaster relief aviation:
Aerodrome capacity: = 2 (small forward operating base with two landing strips), = 4 (larger relief aerodrome with four strips);
Operational tempo: = 0.5 (low-intensity sustained logistics), 0.7 (nominal), 0.9 (surge operations following a mass-casualty incident);
Casualty load: emergency-class fraction 10%, 15% (default), 20% (varying proportions of casualty-evacuation flights in the traffic mix);
Wake structure: default asymmetric (ICAO-derived) vs. a symmetric control matrix in which = 7.0 for all class pairs;
Preview horizon: = 0 (no lookahead information in the observation).
The central finding of the ablation is the wake symmetry result: removing the asymmetric wake structure on which Priority-FCFS depends causes the heuristic’s performance to degrade substantially, while PPO—which learns its scheduling strategy from experience rather than from an explicit wake model—adapts and outperforms the heuristic by 46.5%. This finding provides strong evidence that the emergent behaviour of the learned policy is not merely replicating the heuristic but constitutes a qualitatively different scheduling strategy.
6. Discussion
6.1. The Trade-Off as a Design Goal, Not a Defect
The headline result—PPO matches Priority-FCFS within statistical noise (p = 0.124), while Joint-LA-1 holds a modest but statistically significant advantage (3.2%, p = 0.043)—may, at first glance, appear disappointing. Both policies land fewer total aircraft per episode than the throughput-maximising baselines. We hold the opposite view: this outcome is the natural consequence of a correctly specified objective function and represents a validation of the approach, not a weakness.
A scheduler that maximises raw throughput in a 1:5:100 priority-weighted regime is structurally misaligned with the operational objective. The mismatch is precisely captured by the per-class decomposition (
Table 5): the near-identical throughput between PPO and PFCFS is a consequence of the tight operational deadlines, which force both policies toward a throughput-maximising strategy. That PPO matches PFCFS without being programmed with its logic—and maintains emergent runway specialisation—is a meaningful achievement.
The substantive operational consequence is that PPO is deployable for disaster relief operations where the priority ratio reflects life-critical valuation, and where the optimal heuristic may not be known a priori—for instance, when the traffic mix, wake matrix, or deadline structure changes across deployments. In such settings, the ability to recover near-optimal performance from the reward signal alone, without manual re-tuning of scheduling rules, constitutes a practical advantage.
6.2. Why Static Optimisation Baselines Fail—A Structural Account
Joint-LA-1, the strongest non-learned baseline, performs a joint two-step optimisation: it enumerates all nine (runway_cur, runway_next) assignment pairs and selects the current runway assignment that minimises the two-arrival weighted t_legal sum. Joint-LA-2 extends this to three arrivals (27 combinations) and, in the v4 environment, underperforms Joint-LA-1 by 7.4%. This is not an implementation failure: deeper deterministic lookahead with a greedy cost function amplifies rather than corrects the myopia of the per-step heuristic. The deterministic optimiser, given more future arrivals to consider, makes commitments that look locally optimal over the extended window but are globally worse because the cost function does not capture the value of reserving capacity for arrivals beyond the window.
The value of reserving a runway for an unspecified future emergency does not appear in any K-step joint optimisation for finite K. Capacity reservation is a property of the expected state distribution at the arrival of the next emergency, which is a horizon-distant event whose timing is governed by the Poisson process. No finite lookahead window can capture this expectation; only a value function that integrates over the future emergency distribution—as V(s) does by construction—can express it.
Two additional findings corroborate this account. Stochastic-LA and MCTS underperform Priority-FCFS by 10.5% and 12.8%, respectively: both evaluate future assignments using a greedy heuristic that encodes the same myopia as PFCFS, and deeper planning with a myopic evaluator amplifies rather than corrects this systematic error.
Reinforcement learning recovers the right inductive bias by minimising a Bellman residual: the learned value integrates over the future emergency distribution by definition, and the learned policy chooses actions that reflect this integral rather than a set of finite-horizon sample-path projection. This is why PPO—which learns V(s) from tens of thousands of simulated episodes—converges on a policy that matches the strongest heuristics without being programmed with their logic, and why it adapts when the environmental structure changes (as demonstrated by the wake symmetry ablation result of +46.5%).
6.3. Scope and Limitations
The scope under which our central claims have been validated is as follows. The agent has been trained and evaluated on a Poisson arrival process with rate 0.7 per simulation second and a fixed (60%, 25%, 15%) class mix on a three-runway aerodrome with per-runway queue capacity of three aircraft and per-class operational deadlines of 80 s (N), 55 s (H), and 30 s (E). We have demonstrated robustness to (i) deadline tightness (via the Moderate scenario), (ii) emergency-class weight (via the sensitivity sweep in the original v2.3 study, which remains informative), (iii) wake matrix structure and magnitude (via the wake symmetry ablation and wake-scaling experiment), (iv) aerodrome capacity and arrival load (via the nine-variant ablation), and (v) runway index permutation (via the permutation invariance test).
External validity. The present study operates under constant arrival rate, fixed class proportions, a static three-runway configuration, and homogeneous within-class aircraft characteristics. Real-world disaster relief aerodromes may experience time-varying arrival intensity (e.g., batch arrivals following road clearance), weather-dependent runway occupancy times, and partial runway closures due to debris or damage. Communications latency between a remote ground-control station and the aerodrome may reduce the effective preview window. Individual UAVs may have heterogeneous flight characteristics (fixed-wing vs. rotary-wing, differing approach speeds) that affect landing-time predictability. None of these factors is modelled in the current formulation; each represents a worthwhile direction for future work. The per-UAV deadline mechanism provides a natural interface through which several of these factors—battery state, weather-induced delays, airspace holding—can be incorporated as scalar endurance adjustments supplied by upstream modules.
Wake separation matrix. The wake matrix
W is derived from ICAO Doc 4444 civil aviation guidance with a uniform 10× temporal compression.
Section 5.7 demonstrated that the PPO–PFCFS gap is robust to factor-of-four variations in absolute wake magnitudes, and the wake symmetry ablation (
Section 5.8) confirmed that the qualitative asymmetry—rather than its precise numerical values—is the operative structural feature. Nevertheless, deployment in an operational disaster relief context would require recalibration of
W to the specific airframe types in the responding fleet.
Additional factors. The following operational considerations are not represented in the current simulator but are amenable to incorporation within the proposed MDP framework: (i) go-around procedures (modellable as a stochastic transition on landing attempts); (ii) per-UAV fuel and battery constraints beyond the scalar deadline abstraction (representable as a separate observation feature); (iii) heterogeneous vehicle performance characteristics (implementable as class-specific runway occupancy times ); and (iv) communication delays between the ground-control station and the aerodrome (reducible to a smaller effective preview window n_future_preview). The action mask mechanism, although capacity-constrained in the present study, provides a natural interface for incorporating dynamic constraints such as temporary runway closures due to debris or damage.
6.4. Transfer to Urban Air Mobility
The priority-asymmetric multi-runway formulation is not specific to disaster relief operations and admits a natural transfer to forthcoming Urban Air Mobility (UAM) vertiport scheduling [
25]. UAM vertiports are projected to handle a heterogeneous traffic mix—autonomous logistics drones, eVTOL passenger aircraft, and emergency medical services [
26]—under constraints that closely parallel those studied here: limited vertipad capacity, wake separation or airspace deconfliction intervals, and battery-state-dependent operational deadlines. The per-UAV deadline mechanism is particularly relevant to the UAM setting, where battery state-of-charge imposes hard endurance constraints that vary across vehicles and missions. Recent work on graph-based RL for eVTOL fleet scheduling [
18] and real-time UAM fleet management with LSTM-augmented PPO [
19] has demonstrated the applicability of the RL-for-scheduling paradigm to vertiport operations, confirming that the methodological approach developed here generalises beyond the disaster relief domain.
7. Conclusions
We have presented a reinforcement learning approach to multi-runway UAV sequencing for disaster relief operations under extreme priority asymmetry and operational constraints. The PPO agent, trained with a deliberately minimalist reward function and a capacity-constrained action mask, matches the performance of Priority-FCFS within 2.7% (p = 0.124, not significant); Joint-LA-1 outperforms PPO by 3.2% (p = 0.043) on 100 paired evaluation episodes. The agent achieves this without any hand-crafted priority rules, capacity reservation heuristics, or safety constraints—the scheduling strategy, including its emergent runway-specialisation behaviour, is learned entirely from the reward signal.
Three findings define the contribution. First, the learned policy autonomously develops a runway-specialisation pattern—concentrating high-class traffic on a single strip (60% of H landings on R2) while routing emergency arrivals almost exclusively to the remaining strips (93% to R0 and R1)—that is invariant under runway-label permutation and robust to the introduction of per-UAV operational deadlines and finite queue capacity. This demonstrates that priority-aware capacity reservation can emerge from a minimalist reward design without embedded domain knowledge.
Second, simple heuristics are near-optimal under tight operational constraints. The Priority-FCFS rule of “land each arrival on whichever runway can accept it earliest” is an effective strategy when all aircraft face imminent deadlines, because it minimises the waiting time that leads to deadline violations. The value of learned scheduling depends on the temporal slack available in the system: under moderate deadlines, the PPO–PFCFS gap narrows to −0.5%, and the wake symmetry ablation—in which PPO outperforms PFCFS by 46.5%—demonstrates that the learned policy is robust to structural changes in the operating environment that degrade heuristic performance.
Third, deeper deterministic lookahead (Joint-LA-2), stochastic optimisation (Stochastic-LA), and online tree search (MCTS) do not close the gap to the learned policy. The structural reason—that capacity reservation is an expectation over the future emergency distribution, not a property of any finite lookahead window—constitutes a theoretical contribution of independent interest for the RL-for-operations literature.
Future work will extend the formulation to per-UAV deadline modelling with heterogeneous vehicle dynamics, validate the transfer to UAM vertiport scheduling under battery-state-dependent constraints, and develop continual-learning variants suitable for operations under evolving fleet compositions and arrival distributions.