Next Article in Journal
DCAFuse: A Differential Cross-Attention Transformer Network for Infrared and Visible Image Fusion in UAV-Based Wilderness Search and Rescue
Previous Article in Journal / Special Issue
CPD-UAV: A Benchmark Dataset for Detecting Personnel Visually Blended with the Environment Under UAV Perspective
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

A Review of Reinforcement Learning for Multirotor UAVs from a Hierarchical Control Perspective: Biomimetic Architecture and Sim-to-Real

1
School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China
2
Chongqing Innovation Center, Beijing Institute of Technology, Chongqing 401120, China
3
B&H Unmanned Intelligent System Research Institute, Hefei 230041, China
4
Institute of Advanced Technology, Beijing Institute of Technology, Jinan 250300, China
*
Author to whom correspondence should be addressed.
Drones 2026, 10(6), 448; https://doi.org/10.3390/drones10060448 (registering DOI)
Submission received: 12 March 2026 / Revised: 21 May 2026 / Accepted: 22 May 2026 / Published: 8 June 2026

Highlights

What are the main findings?
  • This review classifies reinforcement learning methods for multirotor UAVs using a biomimetic spinal cord–cerebellum–cerebrum architecture, which maps various learning algorithms to their specific functional roles within the UAV control loop.
  • It systematically identifies the distinct deployment assumptions, safety treatments, and specific sim-to-real bottlenecks inherent to each functional layer within the control loop.
What are the implications of the main findings?
  • This classification clarifies the correspondence between algorithmic characteristics and hierarchical control requirements.
  • It offers a clear pathway for coupling data-driven policies with analytical models to achieve robust sim-to-real transfer.

Abstract

As unmanned aerial vehicle (UAV) systems evolve from automated execution toward autonomous decision-making, multirotor UAVs increasingly face complex dynamics, uncertain sensing conditions, and task-level autonomy demands. Reinforcement learning (RL) has emerged as a promising learning-based paradigm for addressing these challenges. Existing surveys on RL-based UAV control predominantly classify methods from an algorithmic or learning-paradigm perspective, while relatively little attention has been paid to the functional roles of RL policies within the control loop. This often leads to an unclear correspondence between algorithmic characteristics and the requirements of different control layers. To address this gap, this review proposes a biomimetic “spinal cord–cerebellum–cerebrum” framework, organizing existing RL studies into low-level dynamic stabilization, mid-level perception–action coordination, and high-level task planning and decision-making. The proposed hierarchy emphasizes the functional role and intervention depth of RL policies within the control architecture, further supporting a layer-wise analysis of sim-to-real challenges. This review aims to provide a structured understanding of the roles of reinforcement learning in hierarchical UAV control and to highlight future research directions toward robust real-world deployment.

1. Introduction

Unmanned aerial vehicles (UAVs) are advancing from automated execution to autonomous perception, control, and decision-making. Multirotors, particularly quadrotors, are widely used in infrastructure inspection [1], disaster rescue [2], and terrain surveying [3] because of their simple structure and hovering capability. However, real flight still faces aerodynamic disturbances, actuator limits, payload variation, and uncertain environments. Reinforcement learning (RL) provides a data-driven way to learn nonlinear control and decision policies, but its deployment risk depends strongly on where the learned policy is inserted into the UAV autonomy loop.
Current learning-based flight-control research largely focuses on standard quadrotors, where the actuation layout, simulation model, and experimental protocol are relatively mature. These studies provide the main basis for analyzing how RL policies move from simulation to real flight. However, practical deployment often involves conditions that go beyond standard quadrotor assumptions, such as changes in platform morphology, sensing quality, or broader task constraints. These changes introduce sim-to-real gaps that are not fully captured in nominal training environments.
Existing reviews typically classify RL methods by algorithm type or application. Chen et al. [4] and Rezwan and Choi [5] mapped algorithm types and navigation applications. Sönmez et al. [6] and Xiao et al. [7] reviewed specific learning paradigms and perception modalities. These studies provide useful references, but most of them classify existing work mainly by algorithm, task, or sensor type. They pay less attention to where the learned policy is inserted into the UAV autonomy loop. As a result, it remains difficult to compare different studies from the perspective of control responsibility, deployment risk, and sim-to-real transfer.
As shown in Figure 1, this review adopts a function-oriented biomimetic taxonomy. The terms spinal cord, cerebellum, and cerebrum are used as functional references rather than strict biological mappings. In biological motor control, these structures are broadly associated with reflexive stabilization, sensorimotor prediction, and goal-directed action selection, respectively [8,9,10]. By analogy, the proposed taxonomy organizes UAV-RL methods according to the input–output role of the learned policy.
At the spinal cord layer, RL operates close to physical execution and is mainly used for low-level stabilization or control adaptation. At the cerebellum layer, RL connects perception and action, allowing the UAV to respond to incomplete, delayed, or high-dimensional sensory information. At the cerebrum layer, RL supports task-level planning, swarm cooperation, and semantic decision-making. This hierarchy is not intended to rename low-, middle-, and high-level control. Its purpose is to distinguish how deeply RL intervenes in the autonomy loop and what type of decision the learned policy produces.
This classification also provides a basis for sim-to-real analysis. A policy that outputs controller parameters, visual decisions, or task plans will encounter different deployment risks. Therefore, the present survey evaluates the reviewed methods according to their policy interface, safety treatment, deployment assumptions, and sensitivity to real-world variation. The recurring question is not only what a method achieves in simulation or experiments, but also under what conditions its learned policy remains valid.
The contributions of the framework are threefold. First, it reorganizes RL-based multirotor UAV studies through a spinal cord–cerebellum–cerebrum hierarchy according to the input–output role of the learned policy. Second, it compares the main paradigms in each layer by their safety treatment, sim-to-real bottlenecks, and deployment assumptions. Third, it uses deployment variations as a recurring perspective to examine where methods developed in standard settings may require redesign, physical priors, or explicit safety mechanisms before real-world use.
The remainder of this review is organized as follows. Section 2 introduces the RL formulation from the perspective of policy interface, platform response, and training/deployment constraints. Section 3 analyzes low-level dynamic stabilization and robust control at the spinal cord layer. Section 4 discusses mid-level perception–action coordination at the cerebellum layer. Section 5 reviews high-level planning, swarm cooperation, and semantic decision-making at the cerebrum layer. Section 6 summarizes deployment bottlenecks and future research directions, and Section 7 concludes the review.

2. Reinforcement Learning

Classical value-based algorithms, such as Q-learning [11], evaluate discrete actions. The emergence of Deep Q-Networks (DQN) [12] further demonstrated that neural networks could process high-dimensional observations by utilizing experience replay and target networks to stabilize the learning of discrete action values. While they effectively handle discrete tasks like grid-map navigation [13] and communication link selection [14,15], flight control intrinsically requires continuous state and action spaces. Actor–Critic (AC) algorithms [16] address this by constructing direct state-to-action mappings, thereby avoiding the loss of precision caused by forced action discretization. Applying this continuous mapping to physical flight, Rodriguez-Ramos et al. [17] utilized Deep Deterministic Policy Gradient (DDPG) algorithm during dynamic landings.
Nevertheless, early continuous control methods often suffer from training instability and low sample efficiency. To address these bottlenecks, modern algorithms have evolved in two primary directions. Proximal Policy Optimization (PPO) [18] improved stability by clipping the probability ratio in its surrogate objective function. This restricts policy update steps to prevent destructively large changes. Leveraging PPO’s stability to handle increasing degrees of freedom, Deshpande et al. [19] controlled novel tilt-rotor UAVs with a curriculum learning mechanism, showing scalability to thrust-vectoring configurations. In parallel, Soft Actor–Critic (SAC) [20] introduced an off-policy maximum entropy framework. By training a stochastic actor to simultaneously maximize expected return and policy entropy, SAC enhances exploration capability and sample efficiency, providing a robust alternative to dynamic environmental uncertainties in UAV control.
Furthermore, as missions extend from single UAVs to swarms, the algorithmic foundation expands to multi-agent settings. The centralized training with decentralized execution (CTDE) paradigm, established by Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [21], was introduced to mitigate the non-stationarity of multi-agent environments. It allows the centralized critic to utilize global states and joint actions during training, while the actors rely solely on local observations during execution. Building upon this, the challenge of credit assignment in cooperative teams has been effectively tackled by value factorization methods like QMIX [22]. QMIX employs a monotonic mixing network to decompose the joint team action-value into individual utilities, ensuring that the global greedy actions perfectly align with the local decentralized ones. These multi-agent baselines provide the indispensable mathematical basis for the cooperative tasks discussed in subsequent chapters.
RL is an experiential learning framework based on trial-and-error. As shown in Figure 2, the UAV acts as an agent in a dynamic environment, exploring the world by executing actions and adjusting its policy based on reward or penalty signals. The goal is to find a policy that maximizes long-term returns.
Prior to exploring advanced paradigms, this review establishes a standard RL formulation for UAV policy design. This baseline is formulated as a standard Markov Decision Process (MDP) [16], defined by a five-tuple:
M = ( S , A , P , R , γ )
Here, S denotes the state space, A denotes the action space, P ( s t + 1 s t , a t ) is the transition probability from state s t to state s t + 1 after action a t , R ( s t , a t ) is the reward function, and γ [ 0 ,   1 ] is the discount factor for future rewards.
The state and action spaces jointly define the interface between an RL policy and the UAV autonomy loop. A learned policy can be written as
a t = π θ ( s t )
Here, s t S is the policy input at time t , a t A is the policy output, and π θ denotes a policy parameterized by θ .
The state space defines what information is available to the policy during learning and deployment. In UAV-RL studies, low-dimensional flight states such as position, velocity, attitude, angular velocity, and tracking error are commonly used for trajectory tracking, wind rejection, and attitude stabilization [23,24]. Historical states are also used to capture hidden or slowly varying dynamics, such as mass variation during aerial 3D printing [25] and aerodynamic changes during tilt-rotor mode transition [26]. In perception-based tasks, visual or geometric features can be included as policy inputs, such as image moments or optical-flow-based observations [27]. These examples show that the state space determines which disturbances, configuration changes, or environmental variations can be observed by the policy. Compact states are easier to train and deploy, but may omit actuator degradation, payload variation, or mode-dependent effects. Richer states expose more deployment-relevant information, but may increase sensing requirements, estimation error, and onboard inference delay.
The action space defines how the policy intervenes in the UAV system. In high-level or intermediate tasks, actions can be waypoints, velocity commands, attitude references, or local navigation decisions. In low-level control, the action space often shifts to direct thrust [28], generalized torque vectors [29], or hybrid commands for nonstandard configurations [30]. Abstract actions leave more responsibility to conventional controllers and are usually easier to integrate with existing flight stacks. Low-level actions give the policy more control authority, but make the learned behavior more dependent on actuator response, control allocation, and platform configuration. When the vehicle structure, actuator condition, or flight mode changes, the same action may no longer produce the same physical effect.
The reward function specifies the optimization preference of the policy. UAV reward functions usually combine tracking error, energy consumption, collision avoidance, attitude stability, swing suppression, formation quality, or task completion. Specific reward components are used to penalize unsafe motion [31], improve trajectory tracking [32], or enforce formation constraints [33]. Such shaping is useful because it converts task requirements into a scalar learning signal and guides exploration toward desirable behaviors. However, penalty terms do not necessarily prevent unsafe actions during real-time execution. The weights among tracking, energy, safety, and smoothness terms are often task-specific and may require retuning when the platform, payload, or environment changes. If the reward is optimized mainly under simplified simulation assumptions, the learned policy may exploit those assumptions rather than remain reliable in real flight.
The neural network determines how the policy approximates the mapping from input to action. Multilayer perceptrons are often used for low-dimensional flight states, while convolutional and recurrent networks are used for images or temporal histories. Attention-based networks originate from the Transformer architecture proposed by Vaswani et al. [34], in which self-attention assigns different weights to tokens or features according to their pairwise relevance. In UAV-RL studies, this mechanism is mainly useful for visual feature aggregation, temporal dependency modeling, and multi-agent relation encoding. Physics-informed networks introduce consistency with known flight dynamics [35,36]. Spiking and lightweight networks reduce the computation and energy costs of high-frequency onboard execution [37,38]. These architectures improve the representational capacity of RL policies, but also introduce deployment tradeoffs. Larger models may improve perception, memory, or planning, yet increase latency and onboard computation. Physics-informed structures reduce dependence on pure reward feedback, but still rely on the correctness of the assumed model. Lightweight networks improve real-time feasibility, but may reduce robustness when observations or dynamics differ from training conditions.
The training process determines how the policy acquires robustness before deployment. Curriculum learning, domain randomization, imitation learning, privileged information, and policy distillation are commonly used to reduce exploration difficulty and improve sim-to-real performance. Curriculum strategies gradually expose policies to harder flight conditions [19]. Domain randomization and robustness-oriented training broaden the distribution of dynamics, visual inputs, attack conditions, or platform parameters [39,40]. These methods are useful because real-world exploration is risky and simulation rarely matches all deployment conditions. Their limitation is that training coverage remains finite. Curriculum learning may bias the policy toward the selected task sequence. Domain randomization improves robustness to sampled variations, but cannot guarantee coverage of unmodeled actuator behavior, sensing errors, or configuration-dependent dynamics. Distillation can produce compact deployable policies, but the deployed policy must still rely only on signals available in real flight.
The following sections organize UAV-RL studies by the input–output role of the learned policy. Methods close to physical execution are categorized into indirect, semi-direct, and end-to-end control based on whether the policy outputs controller parameters, corrective terms, or low-level commands. Methods linking perception and action are categorized into indirect, semi-direct, and end-to-end perception based on whether the policy relies on predictive states, structured perception-guided representations, or direct perception-to-action mappings. Methods for task-level autonomy are categorized into temporal hierarchical planning, spatial swarm cooperation, and semantic decision-making based on whether the policy interface is organized over time, across agents, or through task semantics. These categories are then evaluated according to their task characteristics and deployment challenges.

3. Spinal Cord Layer: Low-Level Dynamic Stabilization and Robust Control

The spinal cord layer corresponds to low-level physical execution in UAV flight, where the controller directly affects attitude stabilization, trajectory tracking, and actuator-level response. Outside the RL literature, this layer has also been widely studied through classical robust control and estimation-based designs. Sliding-mode and backstepping controllers are commonly used to improve tracking stability under external disturbances and model uncertainty [41]. Observer-based methods, such as extended-state observers and high-order sliding-mode observers, estimate lumped disturbances, unmodeled dynamics, or actuator faults before compensation is applied [42,43]. Recent fractional-order control and observation schemes further introduce memory-related dynamics into sliding surfaces or fault observers, and have been investigated for robust tracking under noise, chattering, random disturbances, and actuator uncertainty [44,45].
The following discussion focuses on how RL is inserted into the low-level control loop. A policy may tune controller parameters or reference commands, augment an existing controller through residual or feedforward correction, or directly generate low-level control commands. These different intervention depths determine how much analytical control structure is retained, how safety is handled during execution, and how sensitive the learned action is to actuator delay, motor chattering, and model mismatch. Accordingly, spinal-cord-layer RL methods are divided into indirect, semi-direct, and end-to-end control approaches, as summarized in Table 1.

3.1. Indirect Control

Indirect control keeps the conventional controller as the only module that generates the final control command. The RL policy does not output motor thrusts, torques, or residual commands. Instead, it adjusts controller parameters, gain schedules, or control constraints around a predefined control law:
θ t = π ω ( s t ) , u t = C ( x t ; θ t )
In this formulation, s t is the state or error signal observed by the RL policy, θ t denotes the adapted controller parameter, and C ( ) is the analytical controller that produces the actual command. This input–output structure explains both the sim-to-real advantage and the limitation of indirect control. Since the final command is still generated by a known controller, deployment is less exposed to unsafe exploratory actions. However, the learned policy can only compensate for errors that can be represented as parameter variation within the original control structure.
Indirect control leaves high-frequency motor execution to conventional controllers such as proportional-integral-derivative (PID) or model predictive control (MPC), while RL only adjusts controller parameters online. Among these baselines, PID remains one of the most widely used low-level controllers because it regulates tracking error through proportional, integral, and derivative feedback:
u ( t ) = K p e ( t ) + K i 0 t e ( τ ) d τ + K d d e ( t ) d t
Here, e ( t ) is the tracking error and u ( t ) is the control command. In this paradigm, RL does not replace the stabilizing loop itself, but only modifies its gains. Learning is restricted to parameter adaptation around an interpretable controller rather than direct actuator-level command generation.
Sönmez et al. [46] used DDPG to tune PID gains online for quadrotor trajectory tracking. Their results show that RL can improve disturbance adaptation without discarding the conventional feedback structure. However, the later PD-based variant in [47] also reveals the limitation of this paradigm. When delayed feedback and exploration noise become significant, the integral term may amplify instability instead of improving steady-state performance. Indirect control therefore remains effective only when the nominal controller already captures the dominant flight dynamics and RL corrects parameters rather than compensating for missing control structure.
Hoover et al. [48] extended this idea to wind rejection through RL-based gain scheduling. Their results show that once too many controller gains are scheduled simultaneously, exploration becomes inefficient and convergence deteriorates. In particular, scheduling 36 gains led to non-convergent exploration, which makes the scalability limit of indirect adaptation much more concrete. This is not merely an optimization difficulty. It indicates that indirect control scales poorly when adaptation can no longer be confined to a compact parameter space. The method works best when the mismatch is moderate and can still be represented as parameter variation around a valid baseline controller.
Indirect control relies on a static control structure in which RL adapts only the parameters of a pre-designed controller. This retains the stabilizing prior of the baseline controller and usually makes deployment more conservative than direct actuator-level learning. However, this advantage does not imply guaranteed safety, because inappropriate gain adaptation can still degrade closed-loop stability unless the gain range is constrained or the controller is explicitly verified.
The main limitation of this paradigm is that its adaptability is bounded by the validity of the baseline controller. Indirect control is more suitable for moderate parameter variations, such as changes in mass, inertia, or disturbance intensity, than for errors that cannot be represented by gain adjustment. When the nominal controller no longer captures the dominant flight dynamics, RL-based gain tuning can still improve local adaptation, but it cannot compensate for missing control structure or unmodeled actuator behavior. In such cases, deeper intervention, such as residual correction or direct command learning, may be required. This also shifts the main challenge from parameter adaptation to action safety, execution stability, and real-flight validation.

3.2. Semi-Direct Control

Semi-direct control keeps a conventional controller as the stabilizing backbone, while RL provides a corrective term for effects that are difficult to capture analytically:
u = u b ( x ) + Δ u θ ( o , m )
where u b ( x ) denotes the baseline control command generated from the vehicle state x , and Δ u θ ( o , h ) denotes the learned correction based on current observations o and, when necessary, historical information m . Compared with indirect control, RL is no longer limited to tuning controller parameters. It can modify the control input through a residual or feedforward term. Compared with end-to-end control, however, the analytical controller is still retained, so the learned component does not carry the full burden of stabilization. Unlike observer-based robust control, where compensation is computed from an explicit disturbance or fault estimate, the residual term in semi-direct RL is learned from data and is therefore more dependent on the training distribution and action constraints.
This structure is useful when the nominal controller remains valid but cannot fully compensate for disturbances or unmodeled local dynamics. Ishihara et al. [49] trained a residual policy built upon a cascaded PID controller, as shown in Figure 3. They reported an approximately 50% reduction in position deviation relative to the baseline and demonstrated outdoor flight under wind speeds exceeding 13 m/s. The same controller also remained effective when the vehicle mass and propeller lift coefficient varied from 50% to 150% of their nominal training values. These results indicate that the learned correction can target wind-induced errors that are not explicitly handled by the baseline controller. Song et al. [50] further showed that semi-direct compensation does not need to rely only on state-error feedback. By introducing visual feedforward cues for gust response, the learned correction could act before the disturbance was fully reflected in the tracking error. This is an important step, because it shows that semi-direct control can move from reactive correction toward anticipatory compensation. In [51], residual learning was further used to capture downwash-induced interaction effects in close-proximity flight, reducing the position root mean square error by 29%. These studies suggest that the main strength of semi-direct control lies in repairing local model deficiency around a valid controller, rather than replacing the controller itself.
Once RL begins to modify the control input itself, however, safety can no longer be attributed to the baseline alone. In practice, semi-direct methods differ mainly in how the residual is restricted. One common form is bounded compensation:
Δ u θ α u b , 0 < α < 1
This constraint keeps the learned residual smaller than the nominal command, which is consistent with [52], where residual actions were clipped to within 20% of the baseline input. However, bounded residuals should not be treated as equivalent to certified safety. A stronger safety mechanism appears when the learned action is filtered by an explicit constraint. For a control-affine system, a standard control barrier function (CBF) condition can be written as
L f h ( x ) + L g h ( x ) u + κ h ( x ) 0
Here, h ( x ) defines the safe set, L f h ( x ) and L g h ( x ) are Lie derivatives along the drift and input vector fields, u is the applied control input, and κ > 0 sets the linear class- K term κ h ( x ) . In [53], CBF filtering was used to restrict learned actions within a theoretically admissible region. This distinction is central. A residual policy is not automatically safe because it is residual. Its safety depends on whether the correction is only encouraged to remain small, explicitly clipped, or filtered before execution.
Some semi-direct works further show that the missing dynamics may not be observable from instantaneous tracking error alone. In [25], historical attitude errors were used to reflect mass decay during aerial 3D printing, allowing adaptation without a separate mass-estimation module. In [26], a Long Short-Term Memory (LSTM) extracted time-dependent aerodynamic features during tilt-rotor mode transition.
The limitation of semi-direct control is therefore tied to the validity of the baseline controller. It is more suitable when the main discrepancy can be expressed as a bounded correction, delayed disturbance, hidden parameter change, or temporally varying local effect. When the baseline no longer provides an adequate stabilizing structure, the residual term may become difficult to interpret and may even weaken the nominal loop if it is not constrained. Thus, semi-direct control improves deployability only when the learned correction, the baseline controller, and the safety mechanism remain consistent with the real execution conditions.

3.3. End-to-End Control

End-to-end control represents the deepest RL intervention in the spinal cord layer. In this review, it refers to learned policies that generate low-level flight-control commands more directly. Depending on the implementation, the output may be a velocity-level command, an attitude or body-rate reference, a thrust command, or an actuator-level command. These interfaces should not be treated as identical. A velocity or attitude command still relies on an inner-loop controller, whereas actuator-level outputs place the learned policy closer to physical execution.
Early exploratory studies provided foundational validations for this direct actuator-level mapping. Hwangbo et al. [54] demonstrated that an RL agent could directly output rotor thrusts to stabilize a quadrotor, bypassing conventional inner-loop structures. To systematically evaluate such capabilities, Koch et al. [55] compared RL-based high-frequency attitude controllers against classical PID, noting that algorithms like PPO could achieve comparable or faster rise times. Low-level autonomous tracking was further investigated by Pi et al. [56], who mapped states directly to four actuator outputs for hovering and trajectory tracking. To address the sample inefficiency typical of learning such fast-scale dynamics, Lambert et al. [57] designed a deep model-based RL approach. As illustrated in Figure 4, their model predictive control loop enabled a Crazyflie quadrotor to achieve stable hovering using only 10,000 trained data points—equivalent to just three minutes of actual flight.
End-to-end control should also be distinguished by the information structure used by the policy. Some policies map the current state directly to a command, whereas others use temporal histories or latent representations to infer hidden disturbances before generating the command. ARMOR [39] represents this latter direction. It uses long-term temporal information to construct a robust latent representation for UAV control under physical attacks, so that gust disturbances and sensor attacks can be separated more effectively before the policy output is produced. This example shows that end-to-end control is not necessarily a memoryless state-to-action mapping. However, its safety still comes mainly from training-time representation learning and attack exposure, rather than from an execution-time safety filter.
One limitation concerns how safety is handled during learning and execution. In [58], near-wall penalties were used to discourage unsafe actions in confined-space flight. Such reward shaping can guide the learned policy toward safer behavior during training, but it does not impose an execution-time constraint on the final command. Its effect remains tied to the training distribution and to whether the penalty terms represent the real safety boundary with sufficient accuracy. If sensing noise, actuator delay, or environmental conditions differ from training, a penalty-shaped policy may still generate unsafe or oscillatory commands.
This limitation motivates the distinction between reward-based safety and execution-time safety. Reward penalties influence policy optimization, whereas safety filters modify or restrict the action before it reaches the vehicle. A general filtered structure can be written as
u e x e c = F s a f e t y ( u r l , x )
Here, u r l is the command proposed by the learned policy, x is the current vehicle state, F s a f e t y is the safety-filtering operator, and u e x e c is the command actually applied to the UAV. CBF-, MPC-, or reachability-based filters follow this logic by projecting or modifying nominal actions to satisfy admissible constraints. Cheng et al. [59] represent this hybrid direction by combining an RL-based model-free controller with a CBF-based model-based controller. Compared with reward penalties such as those in [58], this structure provides a clearer execution-time safety mechanism because unsafe nominal actions can be corrected before they reach the vehicle. However, because the final command is produced after an analytical filtering step, this architecture should be treated as safety-filtered hybrid RL rather than pure end-to-end control.
Real-time execution also affects safety. A low-level policy must produce commands fast enough for the flight controller, otherwise inference delay or computational overload can degrade closed-loop behavior. van Breukelen Castillo et al. [37] addressed this issue by using a spiking neural policy for high-frequency control, thereby reducing the computational cost associated with conventional neural networks. This supports onboard deployment, but it does not directly address reward-based safety limits.
Overall, end-to-end control offers a direct command interface, but this also places stronger requirements on safety handling. Reward penalties can shape safer behavior during training, but they remain heuristic unless unsafe actions are constrained during execution. Safety-filtered designs provide stronger protection, but they also reintroduce analytical control structure and should not be treated as pure end-to-end control.
Table 2 compares the three spinal-cord-layer paradigms according to the depth of RL intervention, retained analytical structure, safety treatment, and deployment evidence. Indirect methods rely on analytical baseline protection, but their adaptation is limited when the mismatch becomes structural. Semi-direct methods extend the baseline controller through residual or feedforward correction, while safety still mainly comes from the baseline controller or bounded correction. The CBF-based filtering in [53] is a stronger case because it can modify unsafe actions before execution. Pure end-to-end entries in Table 2 mainly use reward penalties or training guidance, not online action correction. The safety-filtered direct RL discussed in Section 3.3 should therefore be treated as a hybrid contrast rather than evidence of pure end-to-end online safety filtering. The result in [40] supports cross-scale transfer within a related morphology and similar actuation layout, not general heterogeneous-platform adaptation. Overall, most spinal-cord-layer RL methods still improve robustness mainly within fixed or closely related platforms, whereas thrust-vectoring, mode switching, and heterogeneous actuation can fundamentally change the control-allocation map.

4. Cerebellum Layer: Mid-Level Agile Perception and Motor Coordination

The cerebellum layer concerns the transformation from sensory information to action-relevant decisions. At this layer, RL does not mainly address high-frequency motor stabilization or long-horizon mission planning. Instead, it determines how incomplete, delayed, or high-dimensional observations are converted into short-horizon flight responses. The deployment difficulty therefore lies not only in whether the UAV can perceive the environment, but also in whether the perceived information leads to physically consistent actions after deployment.
This perception–action interface is the main criterion of this chapter. A method that first constructs a predictive state or map may improve interpretability and risk assessment, but its reliability depends on representation fidelity and inference latency. A method that uses intermediate semantic or geometric features can reduce raw-domain sensitivity, but it inherits the bias of the perception frontend. A method that maps raw observations directly to actions shortens the interface, but its deployment depends more strongly on training coverage, domain randomization, real-flight validation, and onboard inference speed. Based on this interface, this chapter divides cerebellum-layer RL methods into indirect perception, semi-direct perception, and end-to-end perception, as summarized in Table 3.

4.1. Indirect Perception

Indirect perception preserves a cascaded architecture in which environmental understanding is completed before policy execution. The RL policy does not act on raw sensory input directly. Instead, it first infers a compact latent state and then predicts its temporal evolution, so that decision-making is based on a belief about the future rather than on the current observation alone. This makes the paradigm particularly suitable for scenarios involving occlusion, sensing delay, and feature sparsity, where reactive perception is often unreliable.
This framework can be expressed through a latent-state update under partial observability:
z t p θ ( z t z t 1 , a t 1 , o t )
Here, z t is the internal belief state used to summarize the hidden environment dynamics, o t is the current sensory observation, and a t 1 is the previously executed action. The role of this update is to transform incomplete raw input into a predictive state that can support downstream planning.
The benefit of this design is most evident when the main challenge is partial observability rather than control execution. In visually cluttered environments, Zhang et al. [36] used a Diffusion Transformer world model to predict future visual states and preserve navigation capability under occlusion. The significance of this result is not simply the use of a larger generative model. More importantly, the agent gains access to a predicted future scene, which turns obstacle avoidance from a purely reactive behavior into anticipatory decision-making. However, this advantage depends directly on prediction fidelity. If the latent transition is inaccurate, planning quality degrades before the controller is even invoked.
This dependence becomes clearer when the sensing modality changes. Wang et al. [60] found that conventional visual prediction transferred poorly to feature-sparse open-water scenes, where stable texture cues are limited. Instead of predicting raw images, they modeled the temporal evolution of homography-based semantic features and used future water boundaries as an augmented state. This comparison shows that indirect perception is not tied to vision itself. What matters is whether the predictive representation captures the dominant invariant of the environment. Transfer therefore depends less on the nominal network structure than on whether the latent state is matched to the sensing physics of the task.
A similar logic appears in non-visual settings. Jiang et al. [61] predicted target motion from intermittent radar point clouds, while Ebrahimi et al. [62] used radio-frequency propagation priors for localization in vision-denied environments. Liang et al. [63] further introduced event-triggered temporal updating, whereby recurrent inference is activated only when the observation stream changes significantly. These studies suggest that the main strength of indirect perception lies in modality-agnostic predictive abstraction. Its main limitation is that the predictive model must remain reliable under sparse, delayed, or modality-specific sensing conditions.
Overall, indirect perception is best understood as a prediction-first strategy for handling perceptual uncertainty. It offers more opportunities for pre-execution risk evaluation than direct perception-to-action policies, because future risk can be estimated before the action is applied. This benefit, however, comes at the cost of model complexity and inference latency. Large predictive models can improve foresight under occlusion, but they also increase onboard computation. More fundamentally, indirect perception improves robustness mainly at the sensing level. It helps preserve decision quality under occlusion, texture loss, or modality shift, but it does not by itself address downstream differences in control allocation or platform-dependent action consequences.

4.2. Semi-Direct Perception

Semi-direct perception maintains a two-stage architecture in which a perception frontend compresses raw observations into a semantically structured representation before the RL policy acts. Rather than mapping pixels directly to commands, the policy operates on filtered features:
z t = ϕ ( o t ) , a t = π θ ( z t , m t )
Here, o t denotes the raw sensory input, ϕ the frontend extractor, z t the semantic state, and m t the optional temporal memory. This architecture matters because performance loss at the cerebellum layer often comes less from the policy itself than from nuisance variation in the observation space, such as illumination changes, background texture, clutter, or social interference. By removing these factors before policy inference, semi-direct perception shifts the transfer problem from raw-domain alignment to feature-level invariance.
Semi-direct perception can appear in two forms. In the first form, image processing extracts compact features, such as obstacle position, optical flow, intention cues, target bounding boxes, or temporal motion descriptors, and the RL policy acts on these features. Early implementations demonstrated the power of this decoupling. For example, Ross et al. [64] extracted structured features, such as optical flow and Radon transforms, from monocular images, mapping them to reactive control commands via imitation learning. Similarly, in target-tracking missions, Mitakidis et al. [27] utilized a Convolutional Neural Network (CNN)-based frontend to detect an unmanned ground vehicle, providing bounding-box information to a DDPG policy that determined the necessary roll, pitch, and yaw actions.
In the second form, learning is used in the perception frontend to estimate depth, pose, semantic boundaries, or a local map, while the final action is still generated by a planner or heuristic controller. For instance, as illustrated in Figure 5, Xie et al. [65] constructed a convolutional network to first predict depth from RGB images, and then used this intermediate depth representation to train a Dueling Double Deep Q-Network (D3QN) for obstacle avoidance.
By decoupling raw visual variations from the control policy, this semi-direct architecture proves particularly effective against disturbance sources with a recognizable structure. In human-centric environments, the core difficulty is not simply obstacle avoidance, but intention-aware interaction. Xu et al. [66] therefore used pedestrian social intentions rather than raw images as policy input, while Kang et al. [67] introduced human prior knowledge into competitive drone racing, ensuring that exploration remained closer to expert flight behavior. In both cases, the policy does not learn from raw visual input alone; it is guided by a representation that already contains task-relevant structure.
In purely physical dynamic environments, a similar logic applies. The goal is no longer social compliance, but separating motion-relevant cues from static background interference. Zhang et al. [68] addressed this by introducing a two-stream Actor–Critic structure that extracts features from both the observation and its temporal change, and reported better generalization than conventional DDPG and TD3 baselines in multi-obstacle navigation. Liang et al. [69] extended this direction by combining CNN–LSTM feature fusion with a generalized integral compensator, so that temporal interaction cues are preserved while steady-state correction is injected into the policy loop. In both cases, RL is not learning perception from scratch. It operates on a representation already biased toward dynamic saliency.
Kaufmann et al. [70] presented the Swift drone-racing system (Figure 6). Onboard IMU signals and camera images are first processed through Visual–Inertial Odometry (VIO), gate detection, followed by Kalman filtering to construct a compact observation state. The learned policy then maps this estimated state, together with the previous action, to control commands at 100 Hz. Thus, Swift highlights a central tradeoff of semi-direct perception. The retained perception and estimation frontend reduces the burden of learning directly from raw images, but real-flight errors in detection, VIO, and filtering are transferred to the policy as state-estimation bias. For sim-to-real evaluation, the key question is therefore not only whether the policy runs fast enough, but whether the estimated observation state remains dynamically consistent with the real vehicle and environment during aggressive flight.
The key strength of semi-direct perception is therefore selective decoupling rather than full abstraction. It is often more robust than direct pixel control when the deployment gap is dominated by observation noise, lighting variation, background texture, or interaction semantics, because the frontend can absorb part of the visual shift before the policy acts. However, this also defines its limit. The policy is only as invariant as the frontend representation. If the semantic extractor misses a latent interaction cue, misclassifies intent, or overfits to a specific sensing pipeline, the downstream RL policy inherits that bias rather than correcting it.
Overall, semi-direct perception is useful when the intermediate representation preserves the variables that dominate the task. It can reduce raw sensory variation, improve interpretability, and lower the learning burden on the policy. Its deployment reliability, however, depends on frontend robustness, feature sufficiency, and real-time computation. For sim-to-real deployment, the central question is not only whether the extracted feature is visually stable, but also whether it remains action-relevant after the UAV is embedded in a real environment.

4.3. End-to-End Perception

End-to-end perception removes the explicit perception–planning interface and maps high-dimensional observations directly to flight actions or target attitudes. Its main advantage is low-latency reactivity, because visual interpretation, short-horizon decision-making, and motor response are learned within a single policy. However, this same coupling also makes deployment evaluation more demanding. For end-to-end perception, the key question is not only whether the policy succeeds in a task, but also how the observation is encoded, how the policy is trained, and how the simulator-trained behavior is transferred to real flight.
Early explorations in this domain focused on direct visual–motor coupling through supervised learning. As illustrated in Figure 7, Giusti et al. [71] formulated trail following as an image classification task, utilizing deep neural networks to determine heading directions directly from raw RGB frames. To alleviate the burden of collecting expert flight data, Loquercio et al. [72] explored cross-domain imitation by training a steering and collision-avoidance policy on datasets recorded from cars and bicycles. Consequently, subsequent research shifted toward deep reinforcement learning to handle goal-oriented visual tasks. Zhu et al. [73] introduced a target-driven visual navigation framework, combining the current observation with a target image in an end-to-end Siamese Actor–Critic network.
Furthermore, to safely train deep visual policies without risking catastrophic crashes during exploration, the concept of privileged learning emerged. Kahn et al. [74] utilized an adaptive MPC teacher operating on full state information to safely supervise a neural network that only received raw sensory inputs. Expanding on this privileged learning paradigm, Loquercio et al. [75] developed an end-to-end policy capable of high-speed flight in the wild. By training the sensorimotor agent via a sampling-based expert in simulation with realistic sensor noise, their method achieved zero-shot transfer from simulated noisy depth images to challenging physical environments.
For a policy deployed in different environments, the expected return can be written as
J e n v ( π ) = E τ p e n v ( τ π ) [ t = 0 T γ t r t ] , Δ s i m t o r e a l = J s i m ( π ) J r e a l ( π )
Here, τ denotes the trajectory generated by the closed-loop interaction between the policy, the UAV, and the environment. This formulation is important for end-to-end perception because visual inputs are not independent samples. They are produced by the motion of the vehicle itself. Once the real trajectory distribution differs from the simulated one, visual errors and action errors may be amplified together.
The first dimension is the observation encoder. Geles et al. [76] used a stateless visual policy for agile traversal, which reduces memory and state-estimation overhead. This design is suitable for low-latency reflexive behavior, and the study reported 20 successful laps out of 20 attempts across six real-world runs. However, the stateless encoder also narrows the information available to the policy, so its generalization still depends on whether the visual scene remains close to the training distribution. Doukhi and Lee [77] processed sparse LiDAR point clouds for mapless navigation. This shifts the encoder from image appearance to geometric point-cloud structure, reducing dependence on texture and illumination. Its limitation is that transfer then depends on whether simulated and real point-cloud statistics match.
The second dimension is the training process. Xing et al. [78] combined imitation learning with RL in an asymmetric training framework for visual racing. This indicates that end-to-end perception does not rely on RL exploration alone; expert demonstrations or privileged information can reduce the cost of learning agile visual behavior. The reported best lap time of 5.54 s supports the effectiveness of this training design in a racing task, but the result should still be interpreted under the visual and geometric regularities of the track. Zhang et al. [79] introduced differentiable physics into visual training and reported 20 m/s forest flight with a 90% success rate. In this case, the training process is not purely data-driven. Physical gradients help shape the policy and improve sample efficiency. However, the point-mass assumption also means that rigid-body rotation and aerodynamic coupling are simplified. Serpiva et al. [80] used video diffusion models to synthesize state–action pairs and enlarge the replay distribution. This expands training coverage, but generated video artifacts and offline generation latency may reduce its suitability for real-time closed-loop deployment.
The third dimension is the transfer mechanism. In [76,78], real-flight validation provides stronger deployment evidence than simulation-only evaluation, but both results still depend on the tested visual distribution and task structure. In [79], differentiable physics improves the connection between visual learning and physical motion, but the simplified dynamics leave a residual gap between simulated and real action consequences. In [80], diffusion-based data generation enables zero-shot gate traversal with a reported real success rate of 61.7%, but this transfer mechanism mainly expands visual and state–action coverage. It does not by itself guarantee that generated samples preserve all physical consequences required by high-speed flight.
Overall, end-to-end perception provides a low-latency route for reflexive flight, but its apparent simplicity hides a tight coupling between sensing, learning, and execution. Its deployment should be evaluated through three linked questions: whether the encoder preserves task-relevant information, whether the training process covers the required visual and dynamic variations, and whether the transfer mechanism has been validated under real onboard constraints. For configuration-varying UAVs, the practical route is not end-to-end scaling alone, but end-to-end perception supported by physics-aware training, explicit deployment evidence, and a clearer separation between visual robustness and platform-specific control realization.
Table 4 shows how sim-to-real gaps enter the cerebellum layer through the perception–action interface. Indirect methods depend on whether predicted states or maps remain accurate under real sensing conditions. Semi-direct methods reduce raw observation shifts through intermediate features, but errors from the frontend extractor can still pass to the RL policy. End-to-end methods are more directly exposed to visual domain shifts, because changes in texture, lighting, sensor noise, or scene layout may immediately alter the action output. Therefore, real-flight results in [76,78,79,80] provide stronger evidence than simulation-only studies, but they should still be interpreted within their tested scenes, sensors, training settings, and platforms. Overall, cerebellum-layer RL mainly addresses whether simulated perception can still support correct action decisions after deployment. It can mitigate observation uncertainty and perception–action coupling errors, but it does not by itself guarantee reliable execution when platform configuration, flight mode, or control-allocation changes. In such cases, perception-layer learning still needs support from lower-layer control interfaces and explicit safety mechanisms.

5. Cerebrum Layer: High-Level Task Planning and Cognitive Decision-Making

For high-level UAV tasks, the role of RL shifts from direct flight execution to task-level organization. At this layer, the policy does not mainly stabilize attitude, reject disturbances, or convert sensory input into short-horizon reactions. Instead, it transforms mission states, multi-agent relations, or semantic instructions into subgoals, schedules, coordination actions, or task plans. Therefore, the central question is not only whether a high-level decision improves task performance, but whether this decision remains executable when lower-layer control, perception, communication, and platform constraints are considered.
This chapter evaluates cerebrum-layer RL through the input–output interface of the learned policy. Temporal hierarchical planning is organized around long-horizon time decomposition. Spatial swarm cooperation is organized around inter-agent relations. Semantic instruction decision-making is organized around language, prompts, or symbolic task descriptions. These paradigms are summarized in Table 5.

5.1. Time-Domain Hierarchical Planning

Temporal hierarchical planning is used when the difficulty of a UAV mission lies not in instantaneous motion generation, but in decisions whose consequences unfold over a longer horizon. In such problems, optimizing all actions at a single timescale may enlarge the effective decision horizon and make credit assignment inefficient. A common solution is to let a high-level policy update sparse mission decisions while a lower-level executor handles short-horizon realization:
g k = π H ( s k ) , a t = π L ( o t , g k ) , t [ k Δ , ( k + 1 ) Δ )
Here, g k denotes the macro decision updated every Δ steps, and a t is the lower-level action conditioned on it. The purpose of this decomposition is not to impose hierarchy for its own sake, but rather to align with the decision structure of missions in which planning variables evolve more slowly than the actuator loop.
In [81], hierarchical RL is used to organize global movement intention before local execution is handled at a finer scale. Here, the value of temporal decomposition lies in shifting navigation progress from instantaneous motion commands to subgoal-level decisions. This supports high-level organization in unknown environments, but the evidence mainly verifies navigation and obstacle-avoidance performance in the tested setting. It does not remove the need for reliable local sensing, short-horizon tracking, and collision avoidance.
The same idea becomes more explicit in endurance-oriented missions. In [82], the planning problem is no longer only where to fly, but also when a route remains favorable under energy and weather conditions. The high-level policy therefore carries a scheduling role, because the value of a decision depends on future energy balance rather than immediate geometric feasibility alone. This makes temporal hierarchy suitable for energy-aware route selection. However, its deployment reliability depends on the accuracy of the energy, weather, and aerodynamic models used during planning.
A further extension appears when flight decisions are coupled with onboard resources. In [83], trajectory planning is combined with distributed inference scheduling, so the high-level policy must coordinate mobility and computation together. In [84], traffic monitoring introduces another slow variable: monitoring persistence over time. In these cases, temporal planning moves from route generation toward task orchestration. The relevant evidence is therefore not only path quality, but also inference delay, computation load, sensing duration, and energy consumption. If communication delay, onboard computation, or sensing reliability changes after deployment, the selected path, computation ratio, or monitoring schedule may no longer remain optimal. Mowla et al. [85] further studied IoT-enabled real-time UAV path planning for dynamic disaster response. Their method combines PPO with a kinematic optimization layer, so that the RL planner adapts to changing hazard zones while the generated path is refined under curvature constraints. In a 25 × 25 dynamic disaster grid, PPO + KinOpt reported a 34.463 m path length, 0.903 path efficiency, 0.099 m−1 maximum curvature, and 0.093 s planning time. This work fits the cerebrum layer because RL mainly supports route-level planning rather than low-level flight stabilization. Its limitation is that the evidence still comes from a simulated grid environment, so it should be treated as feasibility-aware path-planning evidence rather than direct flight-control validation.
Taken together, these studies suggest that temporal hierarchy is useful when the dominant decision is delayed, cumulative, and difficult to express through stepwise local rewards. Its effectiveness depends on whether the macro layer is built around variables that actually constrain mission performance. If endurance, communication load, sensing persistence, or computation cost are simplified too aggressively, the hierarchy may become easier to optimize while losing planning validity.
A further limitation is that temporal hierarchical methods usually presuppose a reliable lower layer. Their gains reflect not only better high-level organization, but also the assumption that local tracking, safety preservation, and perception updates are already available elsewhere in the autonomy stack. This limitation becomes more evident in heterogeneous UAV teams. Once platforms differ in endurance, sensing range, onboard computation, or maneuverability, the same macro decision can induce different costs and feasible outcomes. Under such conditions, temporal hierarchy remains useful only if platform differences are represented in the planning state or constraints.

5.2. Spatial-Domain Swarm Cooperation

Spatial swarm cooperation addresses cerebrum-layer tasks in which performance depends on how multiple UAVs coordinate through spatial or network relations. The difficulty is relational rather than purely temporal. Each UAV may observe only a local neighborhood, while the team objective depends on collision avoidance, formation geometry, coverage overlap, communication quality, task offloading, or tactical interaction. In this setting, the main evidence to examine is whether the learned coordination remains valid when team size, topology, communication condition, or agent capability changes.
A common representation is a graph-structured policy. The local relational update can be expressed as
v i ( l + 1 ) = ψ ( v i ( l ) , A g g j N i φ ( v i ( l ) , v j ( l ) , e i j ) ) , a i = π θ ( v i ( L ) )
Here, v i ( l ) is the latent state of agent i at layer l , N i is its neighbor set, e i j denotes relational features such as distance or link quality, and a i is the cooperative action. This formulation is useful because it makes local relations part of the decision process.
Spatial geometry is the most direct form of this relational decision. Yang et al. [86] used Graph Neural Network (GNN) and PPO for cooperative path planning, where the next waypoint is conditioned on neighboring agents. Their reported results show that the success rate improved by over 78% and the collision rate was reduced by 73% compared with representative Multi-Agent Reinforcement Learning (MARL) baselines. This provides relatively direct evidence that graph-based policies can improve dense multi-UAV path planning. However, this result still depends on whether neighbor information and communication links remain available during deployment.
Relative geometry also appears in formation and tracking tasks. Liu et al. [87] considered leader–follower formation control, where the decision output is a formation-keeping command. Zhou et al. [88] studied multi-target cooperative tracking, where the policy must distribute UAVs around moving targets rather than control each vehicle independently. These works support the use of relative geometry as a high-level decision variable. Their deployment limitation is that relative pose, neighbor information, and target observation must remain reliable. When communication delay, sensing noise, or target occlusion occurs, the same learned relation may no longer support the same cooperative behavior.
When communication quality enters the objective, spatial cooperation is no longer only a problem of maintaining geometric formation. It becomes a topology-aware resource-coordination problem, because the policy must consider how UAV positions affect channel quality, connectivity, interference, or secure transmission. Figure 8 illustrates this mechanism in Ref. [89]. In this framework, channel conditions, security-beamforming information, and node features are encoded by graph convolutional layers, and the resulting topology-aware representation is used by a SAC policy. The role of the learned policy is therefore not limited to choosing spatial movements; it also links motion decisions with communication-state changes.
The policy operates on relational information among UAVs, users, and communication links, rather than only on the physical states of individual vehicles. Similar communication-aware formulations appear in [90], where UAV deployment is optimized for communication coverage, in [91], where collision avoidance and trajectory planning are considered together, and in [92], where the policy jointly adjusts power and position under jamming. These studies show that spatial cooperation at the cerebrum layer often extends beyond formation control toward graph-structured resource allocation. However, their conclusions are still conditioned on modeled channel states, connectivity graphs, or jamming assumptions. Therefore, they provide evidence for communication-aware coordination under the tested models, but they do not by themselves establish robustness under real RF uncertainty, packet loss, asynchronous communication, or hardware-constrained networking.
More complex settings introduce dynamic relations or unequal agent roles. In [93], the decision is a cooperative action based on a dynamic graph. In [94], partition paths are planned for IoT data collection. In [95], a heterogeneous policy network is used for swarm confrontation. These studies are closer to practical multi-UAV missions because the value of a decision depends on changing relations or unequal roles. However, this also raises the evidence requirement. A graph may describe who interacts with whom, but it does not automatically describe what each platform can do. In heterogeneous teams, a relay UAV, a sensing UAV, and a high-endurance UAV may occupy similar graph positions while having different feasible actions, communication roles, and energy costs.
Taken together, spatial swarm cooperation provides a structured way to represent inter-agent dependence. Its strength is that it turns neighborhood structure, topology, and tactical geometry into learnable decision variables. Its limitation is that relational abstraction does not remove communication uncertainty or capability mismatch.

5.3. Semantic-Based Instruction Decision-Making

Semantic instruction decision-making is used when the main difficulty lies in translating human intent, task rules, or abstract mission descriptions into executable decision variables. At this layer, RL is no longer required to discover the entire task structure from reward feedback alone. Part of the structure is introduced through language, prompts, symbolic code, or other semantic priors. The main effect is to reduce unguided exploration in large decision spaces by biasing the policy toward semantically plausible actions.
Anderson et al. [96] introduced the Vision-and-Language Navigation (VLN) task. By training a sequence-to-sequence neural network agent to interpret natural language instructions in visual environments, their work provided a method for grounding human semantic commands into sequential navigational actions. In addition to human-provided semantics, agents can also learn emergent communication protocols. Foerster et al. [97] proposed Differentiable Inter-Agent Learning (DIAL), showing that agents can generate and interpret discrete communication messages to coordinate in cooperative tasks.
The process of incorporating these semantic priors to guide exploration can be expressed as a prior-regularized policy optimization problem:
π * = a r g   m a x π   E π [ t = 0 T γ t r t ] λ D K L ( π ( s t , c ) π F ( s t , c ) )
Here, c denotes the language instruction or semantic prompt, and π F is a foundation-model prior or heuristic policy induced by it. The second term does not replace reinforcement learning. It biases exploration toward semantically plausible regions of the action space. The key evaluation question is therefore whether the semantic prior is grounded in the real task, not merely whether language is used.
Instruction-following navigation provides a direct example of semantic input entering the decision loop. In [98], the policy aligns visual observations with language instructions and outputs navigation actions. This supports the use of semantic input for specifying task progress. However, the evidence mainly concerns vision-language navigation. It does not by itself prove that language can support long-horizon task decomposition or guarantee physical execution. A policy may correctly interpret a command but still fail if the lower-level controller cannot execute the implied maneuver.
Semantic information can also guide the RL search process before the final policy is obtained. In [99], Large Language Model (LLM)-generated heuristic trajectories guide a secure heterogeneous UAV network before being distilled into the RL process. This gives semantic information a stronger role than post hoc explanation, because it influences which candidate policies are explored. The reported evidence shows improvement in secrecy rate and energy efficiency, with robustness across varying swarm sizes and random seeds. This supports the value of semantic priors in complex communication and task-allocation settings. Its limitation is that the heuristic must match the physical and network constraints. If the LLM prior is plausible in language space but inconsistent with communication, energy, or maneuver constraints, it can guide the policy toward a suboptimal or infeasible solution.
Language can also be used to generate executable task logic. In [100], language and music semantics are translated into swarm choreography with motion-planning support. In [101], language is compiled into executable policy code. These studies show that semantic interfaces can improve compositionality and make high-level tasks easier to specify. However, executable code or a generated trajectory is not equivalent to physically safe flight. The generated plan remains valid only if the required primitives, timing assumptions, collision margins, and vehicle dynamics are valid for the actual UAVs.
Benchmark and representation studies should be interpreted more cautiously. Works such as [96,102] help define how instruction–vision alignment and aerial vision-language navigation can be evaluated, but they do not by themselves demonstrate a complete semantic RL loop for real UAV deployment. In particular, ref. [102] provides an aerial vision-language navigation benchmark with a 3D simulator rendered from 25 city-level scenarios. This is valuable for training and comparison, but benchmark evidence should not be treated as real-world deployment proof.
Overall, semantic instruction decision-making mainly contributes task-space compression. Language and foundation-model priors can define intent, constrain exploration, and improve task specification. However, semantic correctness is not the same as physical correctness. The same instruction, such as inspecting a target, maintaining formation, or avoiding interference, may correspond to different feasible trajectories, communication actions, or timing constraints on different UAV platforms. Therefore, semantic decision-making should be evaluated by grounding accuracy, plan feasibility, inference latency, safety checking, and dependence on lower-layer motion primitives. Language can specify what to do, but it does not fully determine how the task can be physically executed.
Table 6 shows that cerebrum-layer RL mainly validates high-level task organization rather than direct physical execution. Temporal hierarchical methods use macro decisions such as routes, schedules, or monitoring actions to reduce long-horizon planning difficulty. However, these results still rely on lower-layer tracking, sensing, communication, and energy models. Spatial swarm cooperation provides more task-level evidence in path planning, formation, tracking, communication coverage, and confrontation. Yet most studies still depend on simulated communication graphs, channel states, neighbor observations, or simplified multi-agent settings, so their scalability should not be directly interpreted as real swarm deployability.
Semantic instruction decision-making provides a different type of evidence. These methods show that language, prompts, or foundation-model priors can help define task objectives, guide exploration, or generate task logic. However, semantic correctness does not guarantee physical executability. Vision-language benchmarks and language-programmed policies mainly test grounding, planning, or code generation, rather than full closed-loop UAV deployment. Overall, Table 6 indicates that the main sim-to-real bottleneck at the cerebrum layer is whether high-level decisions remain compatible with real lower-layer control, perception, communication, and platform constraints.

6. Discussion

6.1. Sim-to-Real Challenges

RL-based UAV policies are usually trained in simulation, but real deployment exposes gaps in dynamics, sensing, safety, computation, and training coverage. These issues can be regarded as different forms of sim-to-real challenge. The proposed spinal cord–cerebellum–cerebrum hierarchy helps locate where each challenge enters the UAV autonomy loop.
At the spinal cord layer, the sim-to-real gap mainly arises from dynamics, actuation, control allocation, and safety-critical execution. For standard quadrotors, domain randomization can partially cover variations in mass, inertia, wind, or thrust coefficients. However, this strategy becomes less sufficient when the real platform involves mode transition, variable actuation, or configuration-dependent allocation. In these cases, the learned action may no longer have the same physical effect after deployment, because the mapping from command to body force and moment changes with the vehicle configuration.
Nahrendra et al. [103] provide a representative example with a tilting-rotor drone, as shown in Figure 9. Their Retro-RL framework does not allow the RL policy to replace the flight stack directly. Instead, it retains a nominal cascade controller, an allocation matrix, and a state estimator, while the learned policy supplies an additional action under perturbed state inputs. The final command is generated through an uncertainty-aware mixer. This design shows that, for thrust-vectoring platforms, sim-to-real transfer is not only a policy-learning problem, but also an actuator-allocation and uncertainty-handling problem. A similar difficulty is reported in variable-pitch micro air vehicles, where Wang and Zhao [104] observed physical-flight failure caused by nonlinear dynamic mismatch. These results indicate that complex actuation can turn a parameter-mismatch problem into a control-interface mismatch.
This issue becomes more evident in reconfigurable multirotors. Robust and observer-based controllers, as discussed in Section 3, can provide useful baselines for disturbance rejection, actuator fault estimation, and stability preservation, but they are usually designed around a fixed actuation structure. Morphing quadrotors violate this assumption because geometric reconfiguration changes both the mechanical structure and the control problem. Acar et al. [105] reviewed morphing mechanisms and control strategies for quadrotors from this perspective. Yang et al. [106] studied an arm-length-varying morphing quadrotor and used a convex-combined DRL controller for morphology-dependent flight control. Li et al. [107] further treated flight control and morphing control as a coupled problem, using reinforcement learning to coordinate trajectory tracking and morphology adjustment. These studies suggest that reconfigurable multirotors should not be treated as simple extensions of standard quadrotors. Their deployment gap may involve coupled changes in dynamics, control allocation, actuator coordination, and feasible maneuver sets.
At the cerebellum layer, the gap mainly appears in observation and perception–action coupling. Visual texture, lighting, occlusion, sensor noise, and latency can change the policy input after deployment. World models and generated visual data can enlarge perception coverage, as in [36,80], but they also introduce prediction fidelity, artifact, and latency issues. Therefore, perception robustness does not automatically guarantee that the resulting action remains physically valid on the real platform.
At the cerebrum layer, the gap mainly appears in task feasibility and lower-layer assumptions. High-level planning, swarm coordination, and semantic policies often assume reliable tracking, sensing, communication, and safety checking. In real deployment, these assumptions may fail because of communication delay, heterogeneous vehicle capability, or infeasible generated plans. For example, LLM-generated policies may violate dynamic limits without explicit safety filters [100]. Although language-based policy generation [101] can improve task specification, the generated plan still needs to be checked against vehicle dynamics, communication conditions, and safety constraints.
Other deployment challenges can also be interpreted as sim-to-real problems. Hardware constraints determine whether a policy can satisfy onboard latency, energy, and SWaP (Size, Weight, and Power) limits. The diffusion models in [36,80] have high inference or generation latency, while multi-agent onboard policies may increase energy consumption [84]. Training cost reflects another form of sim-to-real limitation. Real-world data collection is expensive and risky, and although differentiable physics [79] and diffusion-based data generation [80] can improve sampling efficiency, they still require validation that simulated or generated data preserve real action consequences.
Table 7 shows the value of the proposed hierarchy. It does not only group studies by task level. It indicates which sim-to-real evidence should be checked at each layer: execution safety for the spinal cord layer, perception–action consistency for the cerebellum layer, and task feasibility for the cerebrum layer.

6.2. Future Trends

UAV reinforcement learning is gradually moving from purely algorithm-level updates toward closer integration with physical priors, model-based constraints, and data-driven representations. Earlier studies often focused on improving standard RL algorithms (like moving from PPO to MAPPO). These algorithmic improvements remain useful, but they may be insufficient when the main deployment difficulty comes from dynamics mismatch, safety constraints, onboard computation, or task-level uncertainty. Therefore, future progress may depend not only on new RL algorithms, but also on how learning policies are combined with physical models, safety mechanisms, and deployable network architectures.
Frontier research is mainly reflected in the following four directions:
  • Adding Physical Priors: Purely data-driven policies may require extensive exploration to learn basic aerodynamic and dynamic regularities. To reduce this burden, RL controllers may benefit from incorporating physical priors, equivariant structures, differentiable dynamics, or barrier function constraints. In [108], Yu and Lee built an equivariant network using the rotational symmetry of quadrotors to reduce the parameter search space. Peng et al. [109] added CBFs to the Actor–Critic model to keep actions within safe physical limits. A possible next step is to make these priors more adaptable to aerodynamic disturbances, actuator delay, and configuration changes, so that learned policies can better account for real UAV dynamics.
  • Using Foundation Models: Foundation models may provide useful priors for task specification, heuristic generation, and vision-language grounding. Instead of relying only on random exploration, future RL agents may use the reasoning or generative ability of LLMs, Vision-Language Models (VLMs), and diffusion models to reduce the search space of high-level planning or communication decision-making [110]. Lin et al. [111] built a VLM-based navigation framework to plan paths using visual inputs and text instructions. Swarm choreography studies [100] also indicate that language models can help specify complex multi-agent behaviors. For UAV deployment, this direction could be further connected with vehicle dynamics, safety constraints, communication conditions, and mission-specific rules to improve the feasibility of semantically generated decisions.
  • Fast Network Architectures: As neural policies become more complex, micro-UAV deployment may be limited by onboard computation, energy consumption, and thermal constraints. Lightweight and neuromorphic network architectures may provide a useful route for reducing inference cost, especially when policies operate close to the low-level control loop. In this direction, an important optimization issue is the balance among inference speed, control frequency, disturbance robustness, and compatibility with embedded flight hardware.
  • Distributed Swarm Control: As UAV swarms scale up, centralized MARL may face increasing difficulty in state representation, communication, and training complexity. Distributed or partially distributed architectures may reduce the reliance on a central controller and improve scalability under communication constraints. Self-attention mechanisms [112] can process variable numbers of neighboring agents, while distributed machine learning [113] may help share sensing and computing resources across the fleet. Further development may focus on maintaining coordination under packet loss, asynchronous updates, heterogeneous vehicle capabilities, and limited onboard computation, which remain important constraints in practical swarm deployment.

7. Conclusions

This review analyzes reinforcement learning for multirotor UAVs through a biomimetic spinal cord–cerebellum–cerebrum hierarchy. The framework classifies existing studies according to the input–output role of the learned policy rather than only by algorithm type or task scenario. This view separates low-level control intervention, mid-level perception–action coordination, and high-level task organization, and shows that each layer faces different sim-to-real risks.
The reviewed studies are mainly based on standard quadrotor or multirotor platforms, which provide the main evidence for current UAV-RL methods. However, real deployment may involve more complex configurations, such as thrust-vectoring, tilt-rotor, variable-pitch, or morphing multirotors. These platforms are not treated here as separate survey objects, but as cases where standard sim-to-real problems become stronger. In such systems, the gap may involve not only dynamics or sensing errors, but also changes in action definition, control allocation, flight mode, and feasible maneuvers. Therefore, future UAV-RL research should move beyond algorithm replacement and focus on interface-aware, safety-constrained, allocation-aware, and hardware-feasible policy design.

Author Contributions

Conceptualization, W.W.; methodology, W.W.; validation, M.D.; formal analysis, X.Z.; investigation, Y.S., Q.M., M.D. and Y.W.; resources, W.W. and Q.Y.; writing—original draft preparation, X.Z.; writing—review and editing, Y.S. and Q.M.; supervision, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China (Grant No. 2020YFC1512500).

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ham, Y.; Han, K.K.; Lin, J.J.; Golparvar-Fard, M. Visual Monitoring of Civil Infrastructure Systems via Camera-Equipped Unmanned Aerial Vehicles (UAVs): A Review of Related Works. Vis. Eng. 2016, 4, 1. [Google Scholar] [CrossRef]
  2. Erdelj, M.; Natalizio, E.; Chowdhury, K.R.; Akyildiz, I.F. Help from the Sky: Leveraging UAVs for Disaster Management. IEEE Pervasive Comput. 2017, 16, 24–32. [Google Scholar] [CrossRef]
  3. Nex, F.; Remondino, F. UAV for 3D Mapping Applications: A Review. Appl. Geomat. 2014, 6, 1–15. [Google Scholar] [CrossRef]
  4. Chen, H.; Lin, Y.; Fu, M.; Yao, L.; Sheng, M. A Survey on Reinforcement Learning Methods for UAV Systems. ACM Comput. Surv. 2025, 58, 103:1–103:37. [Google Scholar] [CrossRef]
  5. Rezwan, S.; Choi, W. Artificial Intelligence Approaches for UAV Navigation: Recent Advances and Future Challenges. IEEE Access 2022, 10, 26320–26339. [Google Scholar] [CrossRef]
  6. Sönmez, S.; Rutherford, M.J.; Valavanis, K.P. A survey of offline- and online-learning-based algorithms for multirotor UAVs. Drones 2024, 8, 116. [Google Scholar] [CrossRef]
  7. Xiao, J.; Zhang, R.; Zhang, Y.; Feroskhan, M. Vision-Based Learning for Drones: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 15601–15621. [Google Scholar] [CrossRef]
  8. Cisek, P.; Kalaska, J.F. Neural Mechanisms for Interacting with a World Full of Action Choices. Annu. Rev. Neurosci. 2010, 33, 269–298. [Google Scholar] [CrossRef]
  9. Ito, M. Control of mental activities by internal models in the cerebellum. Nat. Rev. Neurosci. 2008, 9, 304–313. [Google Scholar] [CrossRef]
  10. Grillner, S.; El Manira, A. Current Principles of Motor Control, with Special Reference to Vertebrate Locomotion. Physiol. Rev. 2020, 100, 271–320. [Google Scholar] [CrossRef]
  11. Hosseinzadeh, M.; Ali, S.; Ionescu-Feleaga, L.; Ionescu, B.-S.; Yousefpoor, M.S.; Yousefpoor, E.; Ahmed, O.H.; Rahmani, A.M.; Mehmood, A. A Novel Q-Learning-Based Routing Scheme Using an Intelligent Filtering Algorithm for Flying Ad Hoc Networks (FANETs). J. King Saud. Univ.-Comput. Inf. Sci. 2023, 35, 101817. [Google Scholar] [CrossRef]
  12. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-Level Control through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  13. Ni, J.; Ge, Y.; Zhao, Y.; Gu, Y. An Improved Multi-UAV Area Coverage Path Planning Approach Based on Deep Q-Networks. Appl. Sci. 2025, 15, 11211. [Google Scholar] [CrossRef]
  14. Huang, H.; Yang, Y.; Wang, H.; Ding, Z.; Sari, H.; Adachi, F. Deep Reinforcement Learning for UAV Navigation Through Massive MIMO Technique. IEEE Trans. Veh. Technol. 2020, 69, 1117–1121. [Google Scholar] [CrossRef]
  15. Cherif, N.; Jaafar, W.; Yanikomeroglu, H.; Yongacoglu, A. RL-Based Cargo-UAV Trajectory Planning and Cell Association for Minimum Handoffs, Disconnectivity, and Energy Consumption. IEEE Trans. Veh. Technol. 2024, 73, 7304–7309. [Google Scholar] [CrossRef]
  16. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  17. Rodriguez-Ramos, A.; Sampedro, C.; Bavle, H.; Moreno, I.G.; Campoy, P. A deep reinforcement learning technique for vision-based autonomous multirotor landing on a moving platform. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1010–1017. [Google Scholar] [CrossRef]
  18. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  19. Deshpande, A.M.; Kumar, R.; Minai, A.A.; Kumar, M. Developmental reinforcement learning of control policy of a quadcopter UAV with thrust vectoring rotors. In Proceedings of the ASME 2020 Dynamic Systems and Control Conference (DSCC), Virtual, 5–7 October 2020; p. V002T36A011. [Google Scholar] [CrossRef]
  20. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
  21. Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 6382–6393. [Google Scholar]
  22. Rashid, T.; Samvelyan, M.; de Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; pp. 4295–4304. [Google Scholar]
  23. Ma, B.; Liu, Z.; Dang, Q.; Zhao, W.; Wang, J.; Cheng, Y.; Yuan, Z. Deep Reinforcement Learning of UAV Tracking Control Under Wind Disturbances Environments. IEEE Trans. Instrum. Meas. 2023, 72, 2510913. [Google Scholar] [CrossRef]
  24. Wu, H.; Ye, H.; Xue, W.; Yang, X. Improved Reinforcement Learning Using Stability Augmentation with Application to Quadrotor Attitude Control. IEEE Access 2022, 10, 67590–67604. [Google Scholar] [CrossRef]
  25. Shetty, G.; Ramezani, M.; Habibi, H.; Voos, H.; Sanchez-Lopez, J.L. Motion control in multi-rotor aerial robots using deep reinforcement learning. In Proceedings of the 2025 International Conference on Unmanned Aircraft Systems (ICUAS), Charlotte, NC, USA, 14–17 May 2025; pp. 29–36. [Google Scholar] [CrossRef]
  26. Jin, S.; Zhao, W. Temporal-sequence offline reinforcement learning for transition control of a novel tilt-wing unmanned aerial vehicle. Aerospace 2025, 12, 435. [Google Scholar] [CrossRef]
  27. Mitakidis, A.; Aspragkathos, S.N.; Panetsos, F.; Karras, G.C.; Kyriakopoulos, K.J. A deep reinforcement learning visual servoing control strategy for target tracking using a multirotor UAV. In Proceedings of the 2023 9th International Conference on Automation, Robotics and Applications (ICARA), Abu Dhabi, United Arab Emirates, 10–12 February 2023; pp. 219–224. [Google Scholar] [CrossRef]
  28. Pi, C.-H.; Dai, Y.-W.; Hu, K.-C.; Cheng, S. General purpose low-level reinforcement learning control for multi-axis rotor aerial vehicles. Sensors 2021, 21, 4560. [Google Scholar] [CrossRef] [PubMed]
  29. Sharifi, D.; Torabi, K.; Rahaghi, M.I.; Shahbazi, H. Designing a Reinforcement Learning-Based Robust Controller Resisting External Forces for a Octarotor with New Structure. J. Braz. Soc. Mech. Sci. Eng. 2023, 45, 141. [Google Scholar] [CrossRef]
  30. Gamal, Z.; Mahran, Y.; El-Badawy, A. Control of a twin rotor using twin delayed deep deterministic policy gradient (TD3). In Proceedings of the 2024 28th International Conference on System Theory, Control and Computing (ICSTCC), Sinaia, Romania, 10–12 October 2024; pp. 329–335. [Google Scholar] [CrossRef]
  31. Panetsos, F.; Karras, G.C.; Kyriakopoulos, K.J. A deep reinforcement learning motion control strategy of a multi-rotor UAV for payload transportation with minimum swing. In Proceedings of the 2022 30th Mediterranean Conference on Control and Automation (MED), Vouliagmeni, Greece, 28 June–1 July 2022; pp. 368–374. [Google Scholar] [CrossRef]
  32. Hua, H.; Fang, Y.; Zhang, X.; Qian, C. A New Nonlinear Control Strategy Embedded with Reinforcement Learning for a Multirotor Transporting a Suspended Payload. IEEE/ASME Trans. Mechatron. 2022, 27, 1174–1184. [Google Scholar] [CrossRef]
  33. Xie, Y.; Yu, C.; Zang, H.; Gao, F.; Tang, W.; Huang, J.; Chen, J.; Xu, B.; Wu, Y.; Wang, Y. Multi-UAV formation control with static and dynamic obstacle avoidance via reinforcement learning. In Proceedings of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hangzhou, China, 19–25 October 2025; pp. 20410–20417. [Google Scholar] [CrossRef]
  34. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  35. Abdulkadirov, R.; Lyakhov, P.; Butusov, D.; Nagornov, N.; Kalita, D. Physics-Aware Machine Learning Approach for High-Precision Quadcopter Dynamics Modeling. Drones 2025, 9, 187. [Google Scholar] [CrossRef]
  36. Zhang, W.; Tang, P.; Zeng, X.; Man, F.; Yu, S.; Dai, Z.; Zhao, B.; Chen, H.; Shang, Y.; Wu, W.; et al. Aerial world model for long-horizon visual generation and navigation in 3D space. arXiv 2025, arXiv:2512.21887. [Google Scholar]
  37. van Breukelen Castillo, M.F.; de Wagter, C.; Ferede, R.; Vos, R.W.; de Croon, G.C.H.E. Spiking neural networks for high-speed continuous quadcopter control using proximal policy optimization. In Proceedings of the 16th International Micro Air Vehicle Conference and Competition (IMAV 2025), San Andrés Cholula, Mexico, 3–7 November 2025; pp. 133–141. [Google Scholar]
  38. Peng, C.; Qiao, G.; Ge, B. Dynamic Cascade Spiking Neural Network Supervisory Controller for a Nonplanar Twelve-Rotor UAV. Sensors 2025, 25, 1177. [Google Scholar] [CrossRef]
  39. Dash, P.; Chan, E.; Lawrence, N.P.; Pattabiraman, K. ARMOR: Robust reinforcement learning-based control for UAVs under physical attacks. arXiv 2025, arXiv:2506.22423. [Google Scholar] [CrossRef]
  40. Vaidya, V.; Keshavan, J. Dynamics-invariant quadrotor control using scale-aware deep reinforcement learning. In Proceedings of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hangzhou, China, 19–25 October 2025; pp. 2368–2375. [Google Scholar] [CrossRef]
  41. Borja-Jaimes, V.; Valdez-Martínez, J.S.; Beltrán-Escobar, M.; Ramírez-Zúñiga, G.; Reyes-Mayer, A.; Calixto-Rodríguez, M. Robust Backstepping-Sliding Control of a Quadrotor UAV with Disturbance Compensation. Computation 2026, 14, 51. [Google Scholar] [CrossRef]
  42. Zhao, Z.; Cao, D.; Yang, J.; Wang, H. High-Order Sliding Mode Observer-Based Trajectory Tracking Control for a Quadrotor UAV with Uncertain Dynamics. Nonlinear Dyn. 2020, 102, 2583–2596. [Google Scholar] [CrossRef]
  43. Chen, L.; Liu, Z.; Dang, Q.; Zhao, W.; Wang, G. Robust trajectory tracking control for a quadrotor using recursive sliding mode control and nonlinear extended state observer. Aerosp. Sci. Technol. 2022, 128, 107749. [Google Scholar] [CrossRef]
  44. Borja-Jaimes, V.; Coronel-Escamilla, A.; Escobar-Jiménez, R.F.; Adam-Medina, M.; Guerrero-Ramírez, G.V.; Sánchez-Coronado, E.M.; García-Morales, J. Fractional-order sliding mode observer for actuator fault estimation in a quadrotor UAV. Mathematics 2024, 12, 1247. [Google Scholar] [CrossRef]
  45. Labbadi, M.; Cherkaoui, M. Adaptive fractional-order nonsingular fast terminal sliding mode based robust tracking control of quadrotor UAV with Gaussian random disturbances and uncertainties. IEEE Trans. Aerosp. Electron. Syst. 2021, 57, 2265–2277. [Google Scholar] [CrossRef]
  46. Sönmez, S.; Martini, S.; Rutherford, M.J.; Valavanis, K.P. Reinforcement learning based PID parameter tuning and estimation for multirotor UAVs. In Proceedings of the 2024 International Conference on Unmanned Aircraft Systems (ICUAS), Chania, Greece, 4–7 June 2024; pp. 1224–1231. [Google Scholar] [CrossRef]
  47. Sönmez, S.; Montecchio, L.; Martini, S.; Rutherford, M.J.; Rizzo, A.; Stefanovic, M.; Valavanis, K.P. Reinforcement Learning-Based PD Controller Gains Prediction for Quadrotor UAVs. Drones 2025, 9, 581. [Google Scholar] [CrossRef]
  48. Hoover, R.J.; Wu, W.; Shimada, K. Applying reinforcement learning to PID flight control of a quadrotor drone to mitigate wind disturbances. In Proceedings of the 2024 10th International Conference on Automation, Robotics and Applications (ICARA), Athens, Greece, 22–24 February 2024; pp. 285–293. [Google Scholar] [CrossRef]
  49. Ishihara, Y.; Hazama, Y.; Suzuki, K.; Yokono, J.J.; Sabe, K.; Kawamoto, K. Improving wind resistance performance of cascaded PID controlled quadcopters using residual reinforcement learning. arXiv 2023, arXiv:2308.01648. [Google Scholar] [CrossRef]
  50. Song, F.; Li, Z.; Yu, X. A feedforward quadrotor disturbance rejection method for visually identified gust sources based on transfer reinforcement learning. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 6612–6623. [Google Scholar] [CrossRef]
  51. Zhang, R.; Zhang, D.; Mueller, M.W. ProxFly: Robust control for close proximity quadcopter flight via residual reinforcement learning. In Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA), Atlanta, GA, USA, 19–23 May 2025; pp. 13683–13689. [Google Scholar] [CrossRef]
  52. Kim, D.; Lee, J.D.; Bang, H.; Bae, J. Reinforcement learning-based fault-tolerant control for quadrotor with online transformer adaptation. arXiv 2025, arXiv:2505.08223. [Google Scholar]
  53. Huang, Y.-H.; Liu, E.-J.; Wu, B.-C.; Ning, Y.-J. Safe UAV Control Against Wind Disturbances via Demonstration-Guided Reinforcement Learning. Drones 2026, 10, 2. [Google Scholar] [CrossRef]
  54. Hwangbo, J.; Sa, I.; Siegwart, R.; Hutter, M. Control of a Quadrotor with Reinforcement Learning. IEEE Robot. Autom. Lett. 2017, 2, 2096–2103. [Google Scholar] [CrossRef]
  55. Koch, W.; Mancuso, R.; West, R.; Bestavros, A. Reinforcement Learning for UAV Attitude Control. ACM Trans. Cyber Phys. Syst. 2019, 3, 1–21. [Google Scholar] [CrossRef]
  56. Pi, C.-H.; Hu, K.-C.; Cheng, S.; Wu, I.-C. Low-Level Autonomous Control and Tracking of Quadrotor Using Reinforcement Learning. Control Eng. Pract. 2020, 95, 104222. [Google Scholar] [CrossRef]
  57. Lambert, N.O.; Drew, D.S.; Yaconelli, J.; Levine, S.; Calandra, R.; Pister, K.S.J. Low-Level Control of a Quadrotor with Deep Model-Based Reinforcement Learning. IEEE Robot. Autom. Lett. 2019, 4, 4224–4230. [Google Scholar] [CrossRef]
  58. Tayar, M.S.; de Oliveira, L.K.; Negri, J.D.; Segreto, T.H.; Godoy, R.V.; Becker, M. Autonomous UAV flight navigation in confined spaces: A reinforcement learning approach. arXiv 2025, arXiv:2508.16807. [Google Scholar] [CrossRef]
  59. Cheng, R.; Orosz, G.; Murray, R.M.; Burdick, J.W. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. Proc. AAAI Conf. Artif. Intell. 2019, 33, 3387–3395. [Google Scholar] [CrossRef]
  60. Wang, Z.; Mahmoudian, N. Vision-driven river following of UAV via safe reinforcement learning using semantic dynamics model. Robot. Auton. Syst. 2026, 198, 105357. [Google Scholar] [CrossRef]
  61. Jiang, W.; Cai, T.; Xu, G.; Wang, Y. Autonomous obstacle avoidance and target tracking of UAV: Transformer for observation sequence in reinforcement learning. Knowl. Based Syst. 2024, 290, 111604. [Google Scholar] [CrossRef]
  62. Ebrahimi, D.; Sharafeddine, S.; Ho, P.-H.; Assi, C. Autonomous UAV Trajectory for Localizing Ground Objects: A Reinforcement Learning Approach. IEEE Trans. Mob. Comput. 2021, 20, 1312–1324. [Google Scholar] [CrossRef]
  63. Liang, C.; Liu, L.; Cao, J.; Li, X. Autonomous Collision-Avoiding for Multi-UAVs in Complex Dynamic Environments: An Event-Triggered PPO Approach with LSTM-Attention Integration. Neural Netw. 2026, 194, 108196. [Google Scholar] [CrossRef]
  64. Ross, S.; Melik-Barkhudarov, N.; Shankar, K.S.; Wendel, A.; Dey, D.; Bagnell, J.A.; Hebert, M. Learning Monocular Reactive UAV Control in Cluttered Natural Environments. In Proceedings of the 2013 IEEE International Conference on Robotics and Automation; IEEE: Karlsruhe, Germany, 2013; pp. 1765–1772. [Google Scholar]
  65. Xie, L.; Wang, S.; Markham, A.; Trigoni, N. Towards Monocular Vision Based Obstacle Avoidance through Deep Reinforcement Learning. arXiv 2017, arXiv:1706.09829. [Google Scholar] [CrossRef]
  66. Xu, Z.; Han, X.; Shen, H.; Jin, H.; Shimada, K. NavRL: Learning safe flight in dynamic environments. IEEE Robot. Autom. Lett. 2025, 10, 3668–3675. [Google Scholar] [CrossRef]
  67. Kang, Y.; Di, J.; Li, M.; Zhao, Y.; Wang, Y. Autonomous Multi-Drone Racing Method Based on Deep Reinforcement Learning. Sci. China Inf. Sci. 2024, 67, 180203. [Google Scholar] [CrossRef]
  68. Zhang, S.; Li, Y.; Dong, Q. Autonomous Navigation of UAV in Multi-Obstacle Environments Based on a Deep Reinforcement Learning Approach. Appl. Soft Comput. 2022, 115, 108194. [Google Scholar] [CrossRef]
  69. Liang, C.; Liu, L.; Liu, C. Multi-UAV Autonomous Collision Avoidance Based on PPO-GIC Algorithm with CNN–LSTM Fusion Network. Neural Netw. 2023, 162, 21–33. [Google Scholar] [CrossRef]
  70. Kaufmann, E.; Bauersfeld, L.; Loquercio, A.; Müller, M.; Koltun, V.; Scaramuzza, D. Champion-Level Drone Racing Using Deep Reinforcement Learning. Nature 2023, 620, 982–987. [Google Scholar] [CrossRef]
  71. Giusti, A.; Guzzi, J.; Ciresan, D.C.; He, F.-L.; Rodriguez, J.P.; Fontana, F.; Faessler, M.; Forster, C.; Schmidhuber, J.; Caro, G.D.; et al. A Machine Learning Approach to Visual Perception of Forest Trails for Mobile Robots. IEEE Robot. Autom. Lett. 2016, 1, 661–667. [Google Scholar] [CrossRef]
  72. Loquercio, A.; Maqueda, A.I.; del-Blanco, C.R.; Scaramuzza, D. DroNet: Learning to Fly by Driving. IEEE Robot. Autom. Lett. 2018, 3, 1088–1095. [Google Scholar] [CrossRef]
  73. Zhu, Y.; Mottaghi, R.; Kolve, E.; Lim, J.J.; Gupta, A.; Fei-Fei, L.; Farhadi, A. Target-Driven Visual Navigation in Indoor Scenes Using Deep Reinforcement Learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA); IEEE: Singapore, 2017; pp. 3357–3364. [Google Scholar]
  74. Kahn, G.; Zhang, T.; Levine, S.; Abbeel, P. PLATO: Policy Learning Using Adaptive Trajectory Optimization. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA); IEEE: Singapore, 2017; pp. 3342–3349. [Google Scholar]
  75. Loquercio, A.; Kaufmann, E.; Ranftl, R.; Müller, M.; Koltun, V.; Scaramuzza, D. Learning High-Speed Flight in the Wild. Sci. Robot. 2021, 6, eabg5810. [Google Scholar] [CrossRef]
  76. Geles, I.; Bauersfeld, L.; Romero, A.; Xing, J.; Scaramuzza, D. Demonstrating agile flight from pixels without state estimation. In Proceedings of Robotics: Science and Systems (RSS), Delft, The Netherlands, 15–19 July 2024. [Google Scholar] [CrossRef]
  77. Doukhi, O.; Lee, D.-J. Deep Reinforcement Learning for End-to-End Local Motion Planning of Autonomous Aerial Robots in Unknown Outdoor Environments: Real-Time Flight Experiments. Sensors 2021, 21, 2534. [Google Scholar] [CrossRef]
  78. Xing, J.; Romero, A.; Bauersfeld, L.; Scaramuzza, D. Bootstrapping reinforcement learning with imitation for vision-based agile flight. arXiv 2024, arXiv:2403.12203. [Google Scholar] [CrossRef]
  79. Zhang, Y.; Hu, Y.; Song, Y.; Zou, D.; Lin, W. Learning Vision-Based Agile Flight via Differentiable Physics. Nat. Mach. Intell. 2025, 7, 954–966. [Google Scholar] [CrossRef]
  80. Serpiva, V.; Lykov, A.; Batool, F.; Kozlovskiy, V.; Cabrera, M.A.; Tsetserukou, D. FlightDiffusion: Revolutionising autonomous drone training with diffusion models generating FPV video. arXiv 2025, arXiv:2509.14082. [Google Scholar] [CrossRef]
  81. Zhang, X.; Zong, H.; Wu, W. Cooperative Obstacle Avoidance of Unmanned System Swarm via Reinforcement Learning Under Unknown Environments. IEEE Trans. Instrum. Meas. 2025, 74, 7500615. [Google Scholar] [CrossRef]
  82. Xu, T.; Wu, D.; Meng, W.; Ni, W.; Zhang, Z. Energy-Optimal Trajectory Planning for Near-Space Solar-Powered UAV Based on Hierarchical Reinforcement Learning. IEEE Access 2024, 12, 21420–21436. [Google Scholar] [CrossRef]
  83. Dhuheir, M.A.; Baccour, E.; Erbad, A.; Al-Obaidi, S.S.; Hamdi, M. Deep Reinforcement Learning for Trajectory Path Planning and Distributed Inference in Resource-Constrained UAV Swarms. IEEE Internet Things J. 2023, 10, 8185–8201. [Google Scholar] [CrossRef]
  84. Kong, X.; Ni, C.; Duan, G.; Shen, G.; Yang, Y.; Das, S.K. Energy Consumption Optimization of UAV-Assisted Traffic Monitoring Scheme with Tiny Reinforcement Learning. IEEE Internet Things J. 2024, 11, 21135–21145. [Google Scholar] [CrossRef]
  85. Mowla, M.N.; Asadi, D.; Rabie, K. IoT-Enabled Real-Time UAV Path Planning for Dynamic Disaster Response. IEEE Internet Things Mag. 2025, 1–8, Early access. [Google Scholar] [CrossRef]
  86. Yang, Q.; Lu, J.; Zhang, Y.; Shao, S. Multi-UAV Cooperative Path Planning via Graph Neural Network and Proximal Policy Optimization. IEEE Access 2025, 13, 193575–193588. [Google Scholar] [CrossRef]
  87. Liu, Z.; Li, J.; Shen, J.; Wang, X.; Chen, P. Leader–Follower UAVs Formation Control Based on a Deep Q-Network Collaborative Framework. Sci. Rep. 2024, 14, 4674. [Google Scholar] [CrossRef]
  88. Zhou, W.; Li, J.; Liu, Z.; Shen, L. Improving Multi-Target Cooperative Tracking Guidance for UAV Swarms Using Multi-Agent Reinforcement Learning. Chin. J. Aeronaut. 2022, 35, 100–112. [Google Scholar] [CrossRef]
  89. Tang, X.; Zhao, K.; Shen, C.; Du, Q.; Wang, Y.; Niyato, D.; Han, Z. Deep Graph Reinforcement Learning for UAV-Enabled Multi-User Secure Communications. IEEE Trans. Mob. Comput. 2025, 24, 8780–8793. [Google Scholar] [CrossRef]
  90. Jiang, Z.; Chen, Y.; Wang, K.; Yang, B.; Song, G. A graph-based PPO approach in multi-UAV navigation for communication coverage. Int. J. Comput. Commun. Control 2023, 18, 5505. [Google Scholar] [CrossRef]
  91. Hsu, Y.-H.; Gau, R.-H. Reinforcement Learning-Based Collision Avoidance and Optimal Trajectory Planning in UAV Communication Networks. IEEE Trans. Mob. Comput. 2022, 21, 306–320. [Google Scholar] [CrossRef]
  92. Feng, Z.; Huang, M.; Wu, D.; Wu, E.Q.; Yuen, C. Multi-Agent Reinforcement Learning with Policy Clipping and Average Evaluation for UAV-Assisted Communication Markov Game. IEEE Trans. Intell. Transp. Syst. 2023, 24, 14281–14293. [Google Scholar] [CrossRef]
  93. Elrod, M.; Mehrabi, N.; Amin, R.; Kaur, M.; Cheng, L.; Martin, J.; Razi, A. Graph based deep reinforcement learning aided by transformers for multi-agent cooperation. In Proceedings of the 2025 IEEE International Conference on Communications Workshops (ICC Workshops), Montreal, QC, Canada, 8–12 June 2025; pp. 415–420. [Google Scholar] [CrossRef]
  94. Ge, G.; Sun, M.; Xue, Y.; Pavlova, S. Transformer-based soft actor–critic for UAV path planning in precision agriculture IoT networks. Sensors 2025, 25, 7463. [Google Scholar] [CrossRef]
  95. Su, A.; Hou, F.; Hong, Y. Heterogeneous policy network reinforcement learning for UAV swarm confrontation. In Proceedings of the 2024 China Automation Congress (CAC), Qingdao, China, 1–3 November 2024; pp. 722–727. [Google Scholar] [CrossRef]
  96. Anderson, P.; Wu, Q.; Teney, D.; Bruce, J.; Johnson, M.; Sunderhauf, N.; Reid, I.; Gould, S.; Van Den Hengel, A. Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Salt Lake City, UT, USA, 2018; pp. 3674–3683. [Google Scholar]
  97. Foerster, J.N.; Assael, Y.M.; de Freitas, N.; Whiteson, S. Learning to Communicate with Deep Multi-Agent Reinforcement Learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 5–10 December 2016; pp. 2145–2153. [Google Scholar]
  98. Saxena, P.; Raghuvanshi, N.; Goveas, N. UAV-VLN: End-to-end vision-language guided navigation for UAVs. In Proceedings of the 2025 European Conference on Mobile Robots (ECMR), Padua, Italy, 2–5 September 2025; pp. 1–6. [Google Scholar]
  99. Zheng, L.; He, J.; Chang, S.Y.; Shen, Y.; Niyato, D. LLM meets the sky: Heuristic multi-agent reinforcement learning for secure heterogeneous UAV networks. arXiv 2025, arXiv:2507.17188. [Google Scholar] [CrossRef]
  100. Schuck, M.; Dahanaggamaarachchi, D.O.; Sprenger, B.; Vyas, V.; Zhou, S.; Schoellig, A.P. SwarmGPT: Combining large language models with safe motion planning for drone swarm choreography. IEEE Robot. Autom. Lett. 2025, 10, 12237–12244. [Google Scholar] [CrossRef]
  101. Liang, J.; Huang, W.; Xia, F.; Xu, P.; Hausman, K.; Ichter, B.; Florence, P.; Zeng, A. Code as policies: Language model programs for embodied control. arXiv 2023, arXiv:2209.07753. [Google Scholar] [CrossRef]
  102. Liu, S.; Zhang, H.; Qi, Y.; Wang, P.; Zhang, Y.; Wu, Q. AerialVLN: Vision-and-language navigation for UAVs. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 15384–15394. [Google Scholar]
  103. Nahrendra, I.M.A.; Tirtawardhana, C.; Yu, B.; Lee, E.M.; Myung, H. Retro-RL: Reinforcing Nominal Controller with Deep Reinforcement Learning for Tilting-Rotor Drones. IEEE Robot. Autom. Lett. 2022, 7, 9004–9011. [Google Scholar] [CrossRef]
  104. Wang, Z.; Zhao, S. Sim-to-real transfer in reinforcement learning for maneuver control of a variable-pitch MAV. IEEE Trans. Ind. Electron. 2025, 72, 10445–10454. [Google Scholar] [CrossRef]
  105. Acar, O.; Honkavaara, E.; Botez, R.M.; Bayburt, D.Ç. Mechanisms and control strategies for morphing structures in quadrotors: A review and future prospects. Drones 2025, 9, 663. [Google Scholar] [CrossRef]
  106. Yang, T.; Wu, H.-N.; Wang, J.-W. cc-DRL: A convex combined deep reinforcement learning flight control design of a morphing quadrotor. IEEE Trans. Cybern. 2025, 55, 4554–4566. [Google Scholar] [CrossRef]
  107. Li, C.-X.; Wu, H.-N.; Yang, T. Coordinated control of flight and morphing for morphing quadrotor via reinforcement learning. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 12755–12766. [Google Scholar] [CrossRef]
  108. Yu, B.; Lee, T. Equivariant reinforcement learning frameworks for quadrotor low-level control. IEEE Trans. Control Syst. Technol. 2026, 34, 86–99. [Google Scholar] [CrossRef]
  109. Peng, C.; Liu, X.; Ma, J. Design of safe optimal guidance with obstacle avoidance using control barrier function-based actor–critic reinforcement learning. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 6861–6873. [Google Scholar] [CrossRef]
  110. Zhang, C.; Sun, G.; Li, J.; Wu, Q.; Wang, J.; Niyato, D.; Liu, Y. Multi-objective aerial collaborative secure communication optimization via generative diffusion model-enabled deep reinforcement learning. IEEE Trans. Mob. Comput. 2025, 24, 3041–3058. [Google Scholar] [CrossRef]
  111. Lin, P.; Sun, G.; Liu, C.; Li, F.; Ren, W.; Cong, Y. OpenVLN: Open-world aerial vision-language navigation. arXiv 2025, arXiv:2511.06182. [Google Scholar]
  112. Chen, D.; Qi, Q.; Fu, Q.; Wang, J.; Liao, J.; Han, Z. Transformer-based reinforcement learning for scalable multi-UAV area coverage. IEEE Trans. Intell. Transp. Syst. 2024, 25, 10062–10077. [Google Scholar] [CrossRef]
  113. Ding, Y.; Yang, Z.; Pham, Q.-V.; Hu, Y.; Zhang, Z.; Shikh-Bahaei, M. Distributed machine learning for UAV swarms: Computing, sensing, and semantics. IEEE Internet Things J. 2024, 11, 7447–7473. [Google Scholar] [CrossRef]
Figure 1. Taxonomy of RL-based UAV systems in this survey.
Figure 1. Taxonomy of RL-based UAV systems in this survey.
Drones 10 00448 g001
Figure 2. Agent–environment interaction [16].
Figure 2. Agent–environment interaction [16].
Drones 10 00448 g002
Figure 3. Architecture of the residual reinforcement learning controller [49].
Figure 3. Architecture of the residual reinforcement learning controller [49].
Drones 10 00448 g003
Figure 4. The model predictive control loop used to stabilize the Crazyflie [57]. a * represents the optimal action selected by the objective function from simulated candidate actions.
Figure 4. The model predictive control loop used to stabilize the Crazyflie [57]. a * represents the optimal action selected by the objective function from simulated candidate actions.
Drones 10 00448 g004
Figure 5. Network architecture of monocular image based obstacle avoidance through deep reinforcement learning [65].
Figure 5. Network architecture of monocular image based obstacle avoidance through deep reinforcement learning [65].
Drones 10 00448 g005
Figure 6. Technical architecture of the Swift drone-racing system, including visual–inertial perception, state estimation, learned control policy, simulation training loop, and real-world deployment [70].
Figure 6. Technical architecture of the Swift drone-racing system, including visual–inertial perception, state estimation, learned control policy, simulation training loop, and real-world deployment [70].
Drones 10 00448 g006
Figure 7. An end-to-end perception framework for autonomous UAV trail following [71].
Figure 7. An end-to-end perception framework for autonomous UAV trail following [71].
Drones 10 00448 g007
Figure 8. GNN-assisted SAC framework for topology-aware UAV communication coordination. The graph encoder converts node features, channel conditions, and security-beamforming information into a topology-aware state representation, which is then used by the SAC update [89].
Figure 8. GNN-assisted SAC framework for topology-aware UAV communication coordination. The graph encoder converts node features, channel conditions, and security-beamforming information into a topology-aware state representation, which is then used by the SAC update [89].
Drones 10 00448 g008
Figure 9. Retro-RL framework for a tilting-rotor drone [103].
Figure 9. Retro-RL framework for a tilting-rotor drone [103].
Drones 10 00448 g009
Table 1. Comparison of three reinforcement learning paradigms in the spinal cord layer.
Table 1. Comparison of three reinforcement learning paradigms in the spinal cord layer.
Comparison DimensionIndirect ControlSemi-Direct ControlEnd-to-End Control
Control mechanismParameter adaptation around a nominal controllerResidual or feedforward correction around a baseline controllerDirect generation of low-level flight-control commands
Safety treatmentRelatively conservative when the baseline controller and gain bounds remain validDepends on baseline protection, residual bounds, or explicit filteringOften relies on reward shaping or training guidance unless a safety filter is added
Suitable useModerate parameter adaptation around a valid controllerLocal disturbance compensation and actuator-level correctionAgile command generation and high-frequency response
Main deployment concernLimited when the baseline controller is inaccurateResidual may weaken the baseline if unconstrainedHigh training burden, action safety, onboard latency, and allocation dependence
Table 2. Comprehensive comparison of relevant RL techniques in the spinal cord layer.
Table 2. Comprehensive comparison of relevant RL techniques in the spinal cord layer.
ParadigmPaperTaskPrior RetainedSafety TreatmentValidation/Reported Evidence
Indirect[46]Trajectory trackingPID controllerBaseline controller protectionSimulation; tracking improvement reported
Indirect[47]Trajectory trackingPD controllerBaseline controller protectionOutdoor test
Indirect[48]Wind rejectionPID controllerBaseline controller protectionSimulation; 36-gain scheduling led to non-convergent exploration
Semi-direct[49]Anti-wind hoverCascaded PIDBaseline protection + residual correctionOutdoor test; about 50% lower position deviation; wind >13 m/s; 50–150% mass/lift variation
Semi-direct[50]Gust rejectionVisual gust priorBaseline protection; feedforward correctionTransfer-learning validation reported
Semi-direct[51]Downwash rejectionProprioceptive mappingBaseline protection; residual correctionReduced position RMSE by 29%
Semi-direct[52]Actuator fault toleranceInformation distillationAction clippingFault-tolerant adaptation reported
Semi-direct[53]Safe hoveringExpert demonstrationCBF-based action filteringSimulation; execution-time constraint handling
Semi-direct[25]Aerial 3D printingTemporal attitude-error historyBaseline protection; temporal adaptationMass-variation adaptation reported
Semi-direct[26]Tilt-wing transitionLSTM transition priorBaseline protection; mode-transition feature learningTransition-control validation reported
E2E[39]Attack-resilient UAV controlLatent-state representationTeacher–student robust representation learningSimulation; unseen-attack generalization
E2E[58]Confined-space flightSpatial or environment priorReward penalty; heuristic safety shapingSimulation or experimental validation reported
E2E[59]Safe continuous controlCBF safety layerExecution-time safety filteringSimulation
E2E[37]High-frequency controlNeuromorphic priorNo explicit safety filter reportedHigh-frequency control with reduced computation
E2E[40]Cross-scale quadrotor controlScale-aware dynamics priorTraining-time robustness designCross-scale transfer within related quadrotor morphology
Table 3. Comparison of three perception paradigms in the cerebellum layer.
Table 3. Comparison of three perception paradigms in the cerebellum layer.
Comparison DimensionIndirect PerceptionSemi-Direct PerceptionEnd-to-End Perception
Perception–action interfacePredictive state, map, or latent model before policy executionStructured visual, semantic, or dynamic features before policy executionRaw or minimally processed observation directly mapped to action
Main benefitSupports anticipatory decision-making under occlusion or partial observabilityReduces raw observation shift through feature-level filteringReduces interface latency and supports fast reactive behavior
Main evidence to checkPrediction fidelity, map reliability, sensing modality, and planning compatibilityFeature sufficiency, frontend robustness, and policy dependence on extracted cuesTraining data, domain randomization, real-flight validation, and inference speed
Deployment concernModel complexity, sensor burden, and prediction latencyFrontend bias and accumulated perception errorVisual domain shift, data demand, and action-consequence mismatch
Table 4. Comprehensive comparison of relevant RL techniques in the cerebellum layer.
Table 4. Comprehensive comparison of relevant RL techniques in the cerebellum layer.
ParadigmPaperTaskInputValidationKey ResultMain Limitation
Indirect[36]3D navigationImage sequenceZero-shotZero-shot transferPrediction fidelity and generative latency
Indirect[60]River followingSemantic masksSimulationReduced collisionsModality-specific feature dependence
Indirect[61]Target trackingObservation sequenceSimulationRobust trackingDepends on temporal coverage and sequence quality
Indirect[63]Obstacle avoidanceState sequenceSimulationReduced communication overheadTrigger design may miss fast state changes
Indirect[62]RF localizationRF signalSimulationRF-only localizationSensitive to propagation-model mismatch
Semi-direct[66]Social navigationRGB imagesSimulationSocially compliant navigationIntention misclassification affects policy
Semi-direct[67]Drone racingRGB imagesSimulationImproved collision avoidanceTrack and opponent assumptions may limit transfer
Semi-direct[68]Multi-obstacle avoidanceDynamic featuresSimulationImproved robustness over DDPG and TD3Feature extractor may miss hidden interaction cues
Semi-direct[69]Collision avoidanceSpatiotemporal statesSimulationImproved accuracyDepends on temporal feature quality and compensator assumptions
Semi-direct[70]Drone racingEstimated state; previous actionReal flight15/25 wins; fastest lap 17.47 sState-estimation bias under visual shift
End-to-End[76]Agile traversalRGB imagesSim; real flight20/20 successful lapsGeneralization depends on visual and task distribution
End-to-End[77]Mapless navigationPoint cloudsSimulationLightweight controlSensitive to point-cloud statistics and sensor mismatch
End-to-End[78]Visual racingRGB imagesReal flightBest lap time 5.54 sTrack-specific visual distribution may limit reuse
End-to-End[79]Forest traversalRGB imagesReal flight20 m/s flight; success rate 90%Simplified dynamics may omit rotation and aerodynamic coupling
End-to-End[80]Zero-shot traversalFPV videoSim; real flightReal success rate 61.7%Generated artifacts and offline latency
Table 5. Comparison of three decision paradigms in the cerebrum layer.
Table 5. Comparison of three decision paradigms in the cerebrum layer.
Comparison DimensionTemporal Hierarchical PlanningSpatial Swarm CooperationSemantic Instruction Decision-Making
Main questionHow should long-horizon missions be decomposed over time?How should multiple UAVs coordinate through spatial or network relations?How should human intent or task semantics be converted into executable decisions?
Policy inputMission state, energy state, task queue, monitoring demandNeighbor state, relative pose, topology graph, channel stateLanguage instruction, prompt, visual-language state, symbolic rule
Policy outputSubgoal, waypoint, route schedule, computation ratioCooperative action, role, trajectory, offloading ratio, communication parameterTask plan, heuristic trajectory, policy code, semantic action
Main evaluation focusWhether the macro decision matches the real mission bottleneckWhether coordination remains valid under changing graph, scale, and communication conditionsWhether semantic decisions are grounded, feasible, and safe to execute
Table 6. Comprehensive comparison of relevant RL techniques in the cerebrum layer.
Table 6. Comprehensive comparison of relevant RL techniques in the cerebrum layer.
ParadigmPaperTaskDecision OutputEvidence Checked in the Study
Temporal[81]Hierarchical navigationMacro movement vectorSimulation-based obstacle avoidance in unknown environments
Temporal[82]Solar UAV enduranceRoute and attitude decisionEnergy-oriented trajectory planning in near-space UAV simulation
Temporal[83]Inference schedulingPath and computation ratioJoint trajectory planning and distributed inference in resource-constrained UAV swarms
Temporal[84]Traffic monitoringNavigation and sensing persistenceEnergy-aware UAV-assisted traffic monitoring
Temporal[85]IoT-enabled disaster path planningCurvature-constrained path25 × 25 dynamic grid; 34.463 m path length; 0.903 efficiency; 0.099 m−1 max curvature; 0.093 s planning time
Spatial[86]Cooperative planningNext waypointSuccess rate improved by over 78%; collision rate reduced by 73% in high-density simulation
Spatial[87]Formation controlFormation-keeping commandFormation-maintenance performance in leader–follower simulation
Spatial[88]Cooperative trackingContainment policyMulti-target cooperative tracking compared with baseline methods
Spatial[89]Secure communicationCommunication parameterSecurity-oriented communication optimization under modeled channels
Spatial[90]Communication coverageNavigation actionGraph-PPO evaluation for multi-UAV communication coverage
Spatial[91]Collision avoidanceTrajectoryConstraint-aware collision-free trajectory planning
Spatial[92]Anti-jammingPower and positionRobustness under modeled jamming conditions
Spatial[93]Target collectionCooperative actionDynamic graph coordination for multi-agent cooperation
Spatial[94]IoT collectionPartition pathCoverage-oriented UAV path planning for precision-agriculture IoT
Spatial[95]Swarm confrontationTactical maneuverTactical decision performance in a simplified confrontation setting
Semantic[98]Visual navigationNavigation actionVision-language navigation with free-form instructions
Semantic[99]Secure routingHeuristic trajectorySecrecy rate and energy efficiency improved; robustness tested across swarm sizes and random seeds
Semantic[100]Drone choreographyFormation trajectoryLanguage-driven choreography with motion-planning support
Semantic[101]Policy programmingPython 3 policyExecutable policy-code generation
Semantic[102]3D aerial navigationFlight action/benchmark task25 city-level scenarios, 8446 paths, 25,338 instructions, and over 870 object types
Table 7. Layer-wise sim-to-real challenges and guidance from the proposed hierarchy.
Table 7. Layer-wise sim-to-real challenges and guidance from the proposed hierarchy.
LayerMain Sim-to-Real GapTypical Deployment RiskGuidance from Hierarchy
Spinal cordDynamics, actuation, allocation, safetyLow-level command becomes unsafe or inaccurateCheck control interface, actuator limits, allocation dependence, and safety filters
CerebellumObservation and perception–action gapVisual robustness does not ensure valid action consequenceEvaluate sensing robustness together with physical action validity
CerebrumTask feasibility and lower-layer assumptionsHigh-level plan cannot be executed safelyGround decisions in control, perception, communication, and safety constraints
Cross-layerHardware and training coveragePolicy cannot run or generalize onboardReport latency, update rate, energy cost, training coverage, and real-flight evidence
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wei, W.; Zhao, X.; Shu, Y.; Meng, Q.; Ding, M.; Wang, Y.; Yan, Q. A Review of Reinforcement Learning for Multirotor UAVs from a Hierarchical Control Perspective: Biomimetic Architecture and Sim-to-Real. Drones 2026, 10, 448. https://doi.org/10.3390/drones10060448

AMA Style

Wei W, Zhao X, Shu Y, Meng Q, Ding M, Wang Y, Yan Q. A Review of Reinforcement Learning for Multirotor UAVs from a Hierarchical Control Perspective: Biomimetic Architecture and Sim-to-Real. Drones. 2026; 10(6):448. https://doi.org/10.3390/drones10060448

Chicago/Turabian Style

Wei, Wei, Xubo Zhao, Yongjie Shu, Qingkai Meng, Mingkai Ding, Yunyi Wang, and Qingdong Yan. 2026. "A Review of Reinforcement Learning for Multirotor UAVs from a Hierarchical Control Perspective: Biomimetic Architecture and Sim-to-Real" Drones 10, no. 6: 448. https://doi.org/10.3390/drones10060448

APA Style

Wei, W., Zhao, X., Shu, Y., Meng, Q., Ding, M., Wang, Y., & Yan, Q. (2026). A Review of Reinforcement Learning for Multirotor UAVs from a Hierarchical Control Perspective: Biomimetic Architecture and Sim-to-Real. Drones, 10(6), 448. https://doi.org/10.3390/drones10060448

Article Metrics

Back to TopTop