1. Introduction
Fifth-generation (5G) mobile networks deliver high data rates and stringent quality of service (QoS) that guarantees low latency, high throughput, and reliable coverage through a dense and flexible radio access network (RAN). In adverse situations, such as natural disasters, power outages, or sudden traffic surges, fixed terrestrial base stations (BSs) may become unavailable or severely degraded. In these cases, rapidly deployable unmanned aerial vehicle base stations (UAV-BSs) offer a practical means to restore coverage and capacity. Yet, realizing dependable uplink connectivity with a UAV-BS is challenging: the air-to-ground (A2G) channel is dynamic and height-dependent, flight and power budgets are constrained, and user scheduling must satisfy minimum-rate requirements while coping with inter-user interference.
This work considers a scenario in which a fixed BS becomes inoperable and a single UAV-BS is dispatched to serve affected users. We target the uplink and adopt non-orthogonal multiple access (NOMA) with adaptive successive interference cancellation (SIC) to increase spectral efficiency under minimum per-user throughput constraints. The decision-making problem is inherently continuous and coupled: the UAV must select its 3D motion while the network jointly schedules users and allocates per-subchannel transmit powers. Classical trajectory planners (e.g., particle swarm or direct search) can struggle with scalability and non-stationarity, and value-based deep reinforcement learning (DRL) methods such as deep Q networks (DQN) operate in discrete action spaces and may suffer from overestimation bias and target-network instability [
1,
2,
3,
4,
5,
6]. In contrast, policy-gradient methods directly optimize in continuous action spaces and can offer improved stability in terms of high variance and large updates in policy [
7]. Policy is stochastic and the agent’s action is probabilistic. The reward an agent gets varies due to randomness in the environment and policy, which leads to high variance in gradient estimate. On the other hand, large updates in the policy parameter
can also cause the policy to perform poorly.
Motivated by these considerations, we develop a continuous-control actor–critic solution based on proximal policy optimization with generalized advantage estimation (PPO–GAE). PPO’s clipped surrogate objective improves training stability, while the GAE estimator balances bias–variance trade-offs for sample-efficient learning. We further enforce bounded actions to respect flight envelopes and power limits, ensuring safe-by-construction decisions. The resulting agent jointly controls UAV kinematics and uplink resource allocation to maximize the number of users whose minimum-rate constraints are satisfied.
Research Gap. Existing surveys synthesize broad UAV networking applications but do not provide a unified, learning-based formulation that
jointly optimizes UAV motion, uplink NOMA scheduling with adaptive SIC, and per-subchannel power allocation under minimum-rate constraints in a realistic 3GPP A2G setting [
8]. Metaheuristic trajectory designs (e.g., particle swarm and direct search) can improve channel quality [
9], yet they usually decouple motion from radio resource management and do not exploit continuous-control RL. Offline neural surrogates for throughput prediction and deployment planning [
10] bypass closed-loop control and are less responsive to fast channel and traffic fluctuations than on-policy methods such as PPO–GAE. Works on spectrum/energy efficiency in cognitive UAV networks [
11] and DRL for
downlink multi-UAV systems under fronthaul limits [
12] address important but different regimes; they neither tackle the
uplink NOMA case with adaptive SIC nor the bounded continuous-action control that jointly handles UAV kinematics and per-subchannel power in disaster-response scenarios. Consequently, there remains a need for a stable, continuous-control, learning-based framework that closes this gap.
Our contributions are summarized as follows:
Joint control formulation. We pose a coupled optimization that integrates UAV kinematics, uplink NOMA scheduling with adaptive SIC ordering, and per-subchannel power allocation under minimum user-rate constraints, with the objective of maximizing the number of served users.
Bounded-action PPO–GAE agent. We design a continuous-action actor–critic algorithm (PPO–GAE) with explicit action bounding for flight and power feasibility, yielding stable learning and safe-by-construction decisions.
Realistic A2G modeling and robustness. We employ a 3GPP-compliant A2G channel and evaluate robustness to imperfect SIC and channel-state information (CSI), capturing practical impairments often overlooked in prior art.
Ablation studies. We isolate the gains due to (i) NOMA vs. OMA, (ii) adaptive SIC ordering, and (iii) bounded-action parameterization and quantify their individual and combined benefits.
Reproducibility. We release complete code and configurations to facilitate verification and extension by the community.
Why PPO instead of DQN? Unlike DQN, which assumes a discrete and typically small action space and is prone to overestimation bias and target-network lag—PPO directly optimizes a stochastic policy over
continuous actions and uses a clipped objective to curb destructive policy updates. This is well aligned with the continuous, multi-dimensional action vector arising from simultaneous UAV motion and power-control decisions, and it yields improved training stability and sample efficiency compared with value-based baselines [
1,
2,
3,
4,
5,
6].
The remainder of this paper is organized as follows.
Section 2 reviews related work on UAV-enabled cellular systems, NOMA scheduling, and DRL for wireless control.
Section 3 details the system model, the problem formulation, and the proposed bounded-action PPO–GAE algorithm.
Section 4 presents quantitative results, including ablations and robustness analyses.
Section 5 discusses insights, practical implications, and limitations.
Section 6 concludes the paper and outlines future directions.
2. Related Work
UAV-assisted 5G networking has attracted sustained interest across wireless communications, while reinforcement learning (RL) has emerged as a powerful tool for control and resource optimization in nonstationary environments. Within this broad landscape, our work targets a specific and underexplored setting: uplink emergency connectivity restoration with a single UAV acting as an aerial base station, under realistic 3GPP urban macro (UMa) air-to-ground channels and practical device constraints.
In [
13], the authors study energy sustainability for UAVs via wireless power transfer from flying energy sources, coordinating multiple agents with multi-agent DRL (MADRL). Their objective emphasizes maximizing transferred energy and coordinating energy assets. By contrast, we address emergency connectivity restoration for ground users with a
single aerial base station, focusing on minimum-rate coverage under UE power limits and receiver noise. Methodologically, we employ a bounded-action PPO–GAE agent to jointly control UAV kinematics and uplink resource allocation, whereas [
13] centers on energy-transfer optimization and multi-agent coordination.
Trajectory learning without side information has been demonstrated in [
14], where deterministic policy gradients operate in a continuous deterministic action space to learn UAV paths. Our formulation differs in both scope and modeling: we couple UAV motion with
uplink NOMA scheduling (with adaptive SIC) and per-subchannel power allocation, and we train an actor–critic PPO–GAE agent under realistic 3GPP UMa LoS/NLoS channels with rigorous ablations isolating the effects of NOMA, SIC ordering, and bounded-action parameterization.
A broad survey in [
15] reviews supervised, unsupervised, semi-supervised, RL, and deep learning techniques for UAV-enabled wireless systems, highlighting the promise of learning-based control. Our approach contributes to this line by casting emergency uplink access as a
continuous-control problem and by leveraging PPO–GAE with action squashing to ensure feasibility under flight and power constraints.
Work in [
16] considers multiple UAVs serving as aerial base stations during congestion, aiming to maximize throughput. The solution combines
k-means clustering with a DQN variant, separating user clustering from UAV control. In contrast, we focus on disaster-response scenarios where establishing connectivity with minimum-rate guarantees is paramount; we jointly optimize motion, UL-NOMA scheduling with adaptive SIC, and per-subchannel power in a single learning loop. Unlike [
16], our setting enforces minimum-rate fairness, adopts 3GPP-compliant channel modeling, and respects UE transmit-power limits, while avoiding the discretization and overestimation issues that can affect DQN in continuous domains.
The authors of [
17] investigate UAV-aided MEC trajectory optimization for IoT latency/QoE, primarily benchmarking computing-centric baselines. Our problem is communication-centric: we model 3GPP UMa LoS/NLoS propagation, receiver noise figures, and UE power caps, and we optimize the
uplink access process itself rather than edge-computing pipelines.
Energy-efficiency maximization with quantum RL is explored in [
18], where a layerwise quantum actor–critic with quantum embeddings is proposed. While they mention disaster recovery, their primary metric is energy efficiency. We target user-side QoS during emergencies, adopting bounded-action PPO–GAE (with squashed distributions) to stabilize continuous control under kinematic and power constraints; our method is immediately deployable on classical hardware and directly aligned with current 5G UAV-assisted systems.
Path planning for post-disaster environments is addressed in [
19] via an Adaptive Grey Wolf Optimization (AGWO) algorithm focused on trajectory efficiency. Our formulation instead treats a
joint communication–control problem for UAV-assisted
uplink access with NOMA and adaptive SIC, solved via a continuous-control RL agent.
Finally, ref. [
20] studies joint resource allocation and UAV trajectory optimization in
downlink UAV-NOMA networks with QoS guarantees using a heuristic matching-and-swapping scheduler and convex optimization. We consider the complementary
uplink case in disaster response, replacing heuristic matching with an RL-driven policy (bounded-action PPO–GAE) that adapts online across varied user spatial distributions.
2.1. UAV Path and Trajectory Optimization: Prior Art and Research Gap
Trajectory and placement optimization for UAVs spans surveillance, mapping, IoT data collection, and cellular augmentation. Surveys synthesize challenges in 3D placement and motion planning under realistic constraints, emphasizing the coupling between mobility and communication objectives [
21,
22,
23,
24,
25,
26,
27,
28]. Algorithmically, metaheuristics (e.g., improved RRT with ACO) address obstacle avoidance; continuous-control RL methods (e.g., DDPG; TD3) have been applied to target tracking and data collection under imperfect CSI. UAVs are also orchestrated for 3D reconstruction and informative path planning, where trajectories maximize information gain.
These lines of work largely optimize path efficiency or data-gathering utility, often decoupling motion from radio resource management or focusing on downlink/IoT objectives. Such decoupling limits system performance because UAV trajectory or placement decisions are made without considering instantaneous channel conditions or interference patterns, while power control and scheduling are optimized for static positions. This separation can yield locally optimal but globally inefficient behavior, where the UAV hovers in coverage-poor regions or allocates power sub-optimally. In contrast, our coupled optimization jointly updates UAV motion and per-subchannel resource allocation within a single policy, enabling the agent to reposition adaptively to improve link quality, spectral efficiency, and fairness across users. We close a specific gap: uplink emergency access with minimum-rate constraints, where the UAV must jointly (i) respect kinematic limits, (ii) schedule UL-NOMA users with adaptive SIC, and (iii) allocate per-subchannel powers—all under a realistic 3GPP UMa channel. Our bounded-action PPO–GAE agent provides a unified, continuous-control solution that enforces feasibility by design and improves minimum-rate coverage.
2.2. State of the Art in UAV Wireless Optimization and the Disaster-Response Uplink Gap
UAV-enabled wireless systems have been optimized for security, energy, spectrum efficiency, and waveform robustness. Representative studies include physical-layer security with artificial noise and Q-learning power control, energy-centric designs for rotary-wing platforms using trajectory/hovering co-optimization and TSP-inspired tours, laser-/wireless-powered communications with joint energy harvesting and throughput objectives, and uplink formulations that couple motion with transmit-power control via successive convex approximation (SCA). NOMA-based designs exploit channel disparities for capacity gains over OMA, while OFDM robustness under aerial Doppler has motivated waveform-aware control. Disaster scenarios have been examined through fading/topology models and aerial overlay architectures; game-theoretic approaches address adversarial jamming in vehicular IoT [
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40].
Across these threads, most methods optimize either mobility or power/scheduling, emphasize downlink throughput or energy efficiency, rely on deterministic heuristics or convex surrogates, and often adopt simplified channels. Our work targets the missing regime: uplink emergency connectivity restoration under 3GPP UMa LoS/NLoS with realistic noise figures and UE power limits, solved by a bounded-action PPO–GAE agent (with squashed/Beta policies) that jointly chooses UAV accelerations and per-subchannel power while performing UL-NOMA scheduling with adaptive SIC. Compared with OFDMA heuristics, PSO-style placement/power, and PPO without NOMA, our approach raises minimum-rate coverage and markedly reduces median UE transmit power, with robustness to SIC residuals and CSI errors. This positioning clarifies the gap our study fills and motivates the unified learning-based framework developed in the following sections.
3. Materials and Methods
This section provides a concise yet comprehensive description of the uplink air-to-ground (A2G) scenario, user distribution, parameter initialization, experiment model, optimization problem, constraints, and the reinforcement learning solution framework.
3.1. Initialization
Scenario assumptions and resources. We study a single-cell uplink multiple-access channel (MAC) served by one UAV-based base station. Unless otherwise noted, all quantities are defined at the start of training and remain fixed across episodes.
Users and channel setting.
A set of users, , transmit to the UAV over frequency-selective channels impaired by additive Gaussian noise. Per-user channel gains and power variables are denoted and , respectively.
Spectrum partitioning and MAC policy.
The system bandwidth is
, partitioned into
orthogonal subchannels,
, each with
. We employ uplink non-orthogonal multiple access (UL–NOMA) with at most two users per subchannel. User
n allocates power
subject to the per-UE budget
which is enforced in the optimization (see constraints). The UL–NOMA pairing and successive interference cancellation (SIC) rule are detailed later and used consistently during training (see Algorithm 1).
| Algorithm 1 Bounded-action PPO–GAE for joint UAV motion, UL–NOMA scheduling, and power allocation [41]. The procedure alternates between (i) trajectory collection under the frozen policy , (ii) advantage/target computation (GAE), and (iii) minibatch PPO updates for actor and critic with clipping. Symbols and losses are defined in Section 3 (System/Rate Models) and (RL Formulation).
|
- Require:
Initial actor parameters , critic parameters w; horizon T; number of parallel actors E; PPO clip ; discount ; GAE parameter ; number of epochs K; minibatch size - Ensure:
Updated parameters that maximize the coverage-driven reward - 1:
▹ Sync old and current policy parameters - 2:
for training iteration do Phase A: Trajectory collection (frozen policy) - 3:
for each actor in parallel do - 4:
Roll out for T steps and store transitions - 5:
end for - 6:
Concatenate all actors’ trajectories into a dataset Phase B: Advantage and target computation (GAE) - 7:
for each time index t in do - 8:
Compute advantages - 9:
Compute critic targets - 10:
end for Phase C: PPO updates (minibatch, K epochs) - 11:
for epoch do - 12:
for each minibatch of size M do - 13:
Actor step: maximize the clipped surrogate on (with ) - 14:
Critic step: minimize the value loss on using targets - 15:
end for - 16:
end for - 17:
Policy sync: - 18:
end for
|
User field (spatial layout).
Users are uniformly instantiated within a area (i.e., ). Alternative layouts (e.g., clustered, ring, and edge-heavy) can be sampled for robustness; the initialization here defines the default field for the baseline experiments.
UAV platform and kinematic bounds.
A single UAV acts as the RL agent and is controlled via 3D acceleration commands under hard feasibility limits:
- −
Altitude (≤400 ft);
- −
Speed (≈100 mph);
- −
Acceleration .
These bounds are encoded in the action parameterization to ensure feasibility (see PPO action head and spherical parameterization).
Time discretization.
The environment advances in fixed steps of , which is used consistently in the kinematic updates, scheduling decisions, and reward aggregation.
Regulatory note. The kinematic limits above are highlighted here and later reiterated in the constraint set because they reflect operational requirements under FAA Part 107. Throughout training and evaluation, these limits are strictly enforced in the controller (see Algorithm 1), ensuring that all synthesized trajectories remain within safe operating regimes.
Notation and Symbols
Table 1 summarizes the main symbols used throughout the section; these symbols are referenced within the channel model, the throughput/SIC expressions, and the training pseudocode (as shown in Algorithm 1).
3.2. System Model
This study adopts the Al-Hourani air-to-ground (A2G) path loss model for UAV communications, which is widely used and validated in the literature [
42,
43,
44]. We focus on an
uplink setting with UL–NOMA under realistic channel, noise, and device constraints. The geometric relationships and line-of-sight (LoS) behavior as well as elevation-dependent trends in loss and LoS probability are illustrated in
Section 3.2.1 and later used by the rate model. UAV mobility intrinsically alters radio geometry through distance-dependent path loss, altitude-dependent LoS likelihood, and interference coupling, thereby shifting per-user SINR and QoS guarantees. The proposed formulation therefore couples the kinematic actions with resource allocation within a single objective.
3.2.1. A2G Channel in 3GPP UMa
Environment and propagation modes.
Following [
45,
46], the UAV base station (UAV-BS) is modeled as a low-altitude platform (LAP) operating in a 3GPP urban macro (UMa) environment. Radio propagation alternates probabilistically between
LoS and
NLoS conditions depending primarily on the elevation angle between the UAV and the given user equipment (UE).
Considering 3GPP UMa environment it is extremely necessary and common practice to consider all required parameters to calculate pathloss in LoS and NLoS scenarios as shown in
Figure 1 which would eventually be needed to calculate channel gains of users in different subchannels and throughput.
Geometry and LoS probability (Al-Hourani).
Let the UAV be at horizontal coordinates, denoted by
and altitude
, and user
n be at
. The ground distance and slant range are
With elevation angle (degrees)
, the LoS probability is
with
for urban environments [
45]. Larger elevation angles typically increase
, but higher altitudes also increase distance
, creating a distance–visibility trade-off.
Path loss (dB) and effective channel gains (linear).
Free-space loss at carrier
is
with
and
c the speed of light. Excess losses for UMa are typically
dB and
dB, yielding
Convert to linear scale before mixing:
and form the effective per-UE, per-subchannel gain as
Computation recipe (linked to Figure 1). Compute
and
via (
2).
Evaluate
using (
3); set
.
Compute
and add excess losses in (
5).
Convert to linear gains via (
6).
Mix LoS/NLoS per (
7) (or sample the state in Monte Carlo).
Interpretation and design intuition.
Raising altitude improves visibility (higher ) but increases distance (higher ). Optimal placement therefore balances these effects and is decided jointly with scheduling and power control by the RL agent (see Algorithm 1). The net elevation trends are shown next.
3.2.2. Throughput Model (UL–NOMA with SIC)
3.2.3. Aerodynamic/Kinematic Update Model
3.3. Problem Formulation
We jointly optimize per-UE power allocation and UAV acceleration to
maximize rate coverage, the number of users meeting a target throughput within each episode. Unless otherwise stated, we consider
users over a
area and
subchannels. Time is slotted as
with step
s and horizon
(i.e., 200 time steps per episode; training uses 1000 episodes). The chosen user density aligns with typical population scales [
48] and demonstrates scalability.
Scope, horizon, and decision variables.
At each time
, the controller selects (i) per-UE per-subchannel transmit powers
and (ii) the UAV acceleration vector
. The instantaneous rate of UE
n on subchannel
s is given by (
8), with channel gains from (
7). The aggregate rate of UE
n is
where
follows Shannon’s law with the UL–NOMA/SIC interference structure (cf. System Model).
Objective: rate coverage maximization.
Let
be the per-UE target rate. Define the coverage indicator
The optimization objective over an episode is
Constraints. We enforce communication, kinematic, geofencing, and regulatory constraints at every time step t:
- −
- −
- −
UL–NOMA scheduling and SIC order (at most two UEs per subchannel; valid SIC decoding order):
- −
Kinematics (acceleration and velocity; FAA bound [49]):
- −
Geofencing (UAV horizontal):
- −
User field (fixed deployment region):
- −
Altitude bound (FAA Part 107 [49]):
Solution strategy.
We solve the coupled communication–control problem with a reinforcement learning approach based on proximal policy optimization with generalized advantage estimation (PPO–GAE), using bounded continuous actions to ensure feasibility and training stability; see Algorithm 1 for the training loop placed near its first reference in the text.
Reward shaping.
To align the RL objective with rate coverage while enforcing feasibility, the per-step reward is
with a composite penalty
where each term is implemented as a hinge (or indicator) cost that activates only upon violation (e.g.,
). The weighting coefficients reflect the relative importance of each constraint and were empirically tuned as
,
, and
. These values ensure that constraint violations impose noticeable but non-destabilizing penalties, allowing the agent to learn feasible and stable UAV behavior while maximizing rate coverage.
State and actions.
State at time t. We include (i) channel gain status , (ii) UAV kinematic state (previous position/velocity) , and (iii) current allocation summaries (e.g., or normalized logits, subchannel occupancy, and recent SIC order statistics).
Actions at time t. Two heads are produced by the actor: (i) UAV acceleration , and (ii) per-UE per-subchannel power fractions that are normalized into feasible powers.
Action representation (spherical parameterization).
To guarantee
by design, the actor outputs spherical parameters
and maps them to Cartesian acceleration:
with feasible domains
,
, and
. This parameterization simplifies constraint handling and improves numerical stability [
50].
Action sampling (Beta-distribution heads).
The components , , and are sampled from Beta distributions , , and , respectively, naturally producing values on . After linear remapping (for angles), this yields bounded, well-behaved continuous actions and stable exploration near the acceleration limit .
Training hyperparameters.
To ensure stable learning and reproducibility across layouts, we adopt a conservative PPO–GAE configuration drawn from widely used defaults and tuned with small grid sweeps around clipping, entropy, and rollout length. Unless otherwise stated, all values in
Table 2 are fixed across experiments, with linear learning-rate decay and early stopping to avoid overfitting to any one topology.
PPO–GAE framework (losses and advantages).
The actor is trained with the clipped surrogate,
while the critic minimizes the value regression loss,
Generalized advantage estimation uses TD error
and
balancing bias and variance via
.
Training loop and placement.
The PPO–GAE training procedure for joint UAV motion, UL–NOMA scheduling, and power allocation is summarized in Algorithm 1.
3.4. Evaluation Protocol, Baselines, and Metrics
This subsection specifies the data generation, train/validation splits, baselines, metrics, statistical treatment, and ablations used to evaluate the proposed solution.
User layouts (testbeds).
We evaluate four canonical spatial layouts within a
field (
Section 3, Initialization):
Uniform: users are sampled i.i.d. uniformly over .
Clustered: users are drawn from a mixture of isotropic Gaussian clusters (centers sampled uniformly in the field; cluster spreads chosen to keep users within bounds).
Ring: users are placed at an approximately fixed radius around the field center with small radial/azimuthal jitter, producing pronounced near–far differences.
Edge-heavy: sampling density is biased toward the four borders (users within bands near the edges), emulating disadvantaged cell-edge populations.
Unless stated otherwise, the UAV geofence is the
square centered on the user field (
Section 3), with altitude constrained by FAA Part 107.
Train/validation protocol.
Training uses 1000 episodes with horizon
time steps (step
). We employ five training random seeds and five
disjoint evaluation seeds (
Section 3), fixing all hyperparameters across runs. After each PPO update (rollout length 2048 steps), we evaluate the current policy on held-out seeds and report the mean and 95% confidence intervals (CIs).
Baselines.
We compare the proposed PPO+UL–NOMA agent against the following:
- −
PPO (OFDMA): same architecture/hyperparameters, but limited to one UE per subchannel (OMA) and no SIC.
- −
OFDMA + heuristic placement: grid/elevation search for a feasible hovering point and altitude; OFDMA scheduling with per-UE power budget.
- −
PSO (placement/power): particle swarm optimization over plus a global per-UE power scaling factor; OFDMA scheduling.
All baselines share the same bandwidth, noise figure, and UE power constraints as the proposed method. An ‘exhaustive’ enumeration of the joint decision space—UAV position/altitude together with user association and power allocation—exhibits combinatorial growth and quickly becomes intractable even for moderate problem sizes. Consequently, we benchmark against competitive heuristic and learning baselines that are standard in the literature. Our choice of PPO reflects its robust training dynamics, clipped-objective regularization, and straightforward hyperparameterization for coupled motion-and-allocation control. A broader benchmark against SARSA/A3C and additional actor–critic variants is an important extension we plan to pursue in follow-up work.
Ablations.
To quantify the contribution of each component we perform the following:
No-NOMA: PPO agent with OMA only.
- −
Fixed SIC order: PPO+NOMA with a fixed decoding order (ascending received power), disabling adaptive reordering.
- −
No mobility: PPO+NOMA with UAV motion frozen at its initial position (power/scheduling still learned).
- −
Robustness sweeps: imperfect SIC residual factor and additive CSI perturbations to channel gains.
Primary and secondary metrics.
The primary metric is
rate coverage, i.e., the fraction of users meeting the minimum rate
:
Secondary metrics include (i) per-user rate CDFs to characterize fairness, (ii) median UE transmit power to reflect energy burden at the user side, and (iii) training curves (coverage vs. PPO updates) to assess convergence behavior.
Statistical treatment and reporting.
For each configuration (layout × method), we average metrics over evaluation seeds and episodes and report the mean ± 95% CI. CIs are computed from the empirical standard error under a t-distribution with degrees of freedom equal to the number of independent trials minus one. Where appropriate (paired comparisons across seeds), we also report percentage-point (pp) gains.
Reproducibility.
All random seeds, environment initializations, and hyperparameters are logged. The symbols used throughout are collected in
Table 1 (
Section 3).
5. Discussion and Limitations
This section interprets the empirical findings, discusses practical implications for UAV-assisted uplink connectivity, and identifies limitations and avenues for future research.
5.1. Key Findings and Practical Implications
Coverage gains across diverse layouts.
Across uniform, clustered, ring, and edge-heavy deployments, the proposed PPO+UL–NOMA agent consistently improves
rate coverage relative to strong baselines (
Table 4 and
Table 5). Typical gains over PPO with OFDMA lie in the 8–10 pp range, with the largest improvements in edge-heavy scenarios (
Figure 8) where near–far disparities are most pronounced and adaptive SIC can be exploited effectively.
Fairness and user experience.
Per-user rate CDFs (
Figure 13a–d) show that the learned policy not only raises average performance but also shifts the distribution upward so that
most users exceed
. This is particularly relevant for emergency and temporary coverage, where serving many users with a minimum quality of service (QoS) is paramount.
Lower user-side power.
Relative to baselines, the learned controller reduces median UE transmit power (by up to tens of percent in our runs), reflecting more favorable placement and pairing decisions. Lower UE power is desirable for battery-limited devices and improves thermal/noise robustness at the receiver.
Feasibility by design.
The
bounded-action parameterization guarantees kinematic feasibility, contributing to stable training and trajectories that respect FAA altitude and speed limits. The learned paths (
Figure 12) exhibit quick exploration followed by convergence to stable hovering locations that balance distance and visibility (
Section 3).
5.2. Limitations and Threats to Validity
Single-UAV, single-cell abstraction.
Results are obtained for one UAV serving a single cell. Interference coupling and coordination in multi-UAV, multi-cell networks are not modeled and may affect achievable coverage.
Channel and hardware simplifications.
We adopt a widely used A2G model (Al-Hourani in 3GPP UMa) with probabilistic LoS/NLoS and a fixed noise figure. Small-scale fading dynamics, antenna patterns, and hardware impairments (e.g., timing offsets) are abstracted, and Shannon rates are used as a proxy for link adaptation.
Energy, endurance, and environment.
UAV battery dynamics, wind/gusts, no-fly zones, and backhaul constraints are outside our scope. These factors can influence feasible trajectories and airtime.
Objective design.
We optimize rate coverage at a fixed . Other system objectives, e.g., joint optimization of coverage, average throughput, and energy, introduce multi-objective trade-offs that we do not explore here.
Simulation-to-reality gap.
Although our simulator is based on standardized 3GPP UMa channel models with realistic LoS/NLoS probability, noise figure, and power constraints, differences from real-world measurements (e.g., due to small-scale fading, hardware impairments, or environmental obstructions) may cause a simulation-to-reality gap. We partially address this by training across multiple user distributions to avoid overfitting to a single topology and by using parameter values grounded in physical measurements. Future work will explore domain randomization and transfer learning to adapt the trained policy to empirical data.
5.3. Future Work
We identify several natural extensions: (i) multi-UAV coordination via MARL with interference-aware pairing and collision-avoidance constraints; (ii) energy-aware control that co-optimizes flight energy, airtime, and user coverage under battery/endurance models; (iii) environment realism, including wind fields, no-fly zones, and 3D urban geometry; (iv) robust learning, with explicit modeling of SIC residuals and CSI uncertainty, and safety layers for constraint satisfaction; (v) multi-objective optimization, e.g., Pareto-efficient policies trading coverage, throughput, and energy; (vi) sample-efficient training through model-based RL, curriculum learning, or offline pretraining before online fine-tuning; and (vii) adaptive retraining under environmental shifts. While the current model generalizes across four representative user layouts, drastic environmental changes (e.g., different propagation regimes or user densities) would require updated simulation or fine-tuning. Future extensions will investigate transfer reinforcement learning and online domain adaptation, enabling the agent to update its policy using a limited number of real-environment samples without full retraining.
6. Conclusions
We studied joint UAV motion control, uplink power allocation, and UL–NOMA scheduling under a realistic A2G channel and regulatory kinematic constraints. Our bounded-action PPO–GAE agent coordinates UAV acceleration with per-subchannel power and adaptive SIC to
maximize rate coverage. Across four canonical spatial layouts, it consistently outperforms PPO with OFDMA and placement/power baselines, raising the fraction of users above the minimum-rate threshold and reducing median UE transmit power (
Figure 5,
Figure 6,
Figure 7 and
Figure 8 and
Figure 13;
Table 4 and
Table 5). Ablations indicate that UL–NOMA with adaptive SIC, feasibility-aware actions, and joint trajectory–power decisions are all critical to the gains.
Limitations include the single-UAV/single-cell abstraction, simplified environment physics, and the use of a single primary objective. Future work will address multi-UAV settings, energy/flight-time constraints, weather and airspace restrictions, and multi-objective formulations. We plan to release seeds, configuration files, and environment scripts to facilitate reproducibility and benchmarking in this domain. Our current formulation omits an explicit propulsion/hover energy model for the UAV. Although endurance constraints can reshape feasible trajectories and scheduling, the present results focus on the coupling of motion and uplink resource allocation. Incorporating an energy-aware term in the reward and budgeted constraints is a promising extension we defer to future work.