Next Article in Journal
An Attitude Measurement Method for Spin-Stabilized Projectiles Using Dual High-Speed Cameras
Previous Article in Journal
Numerical Simulation for Rigid Multi-Body Separation of Coupling Collision and Friction Dynamics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hierarchical Adaptive PID Tuning for Agile Flight: A Safety-Constrained Reinforcement Learning Approach

by
Zhong Tian
,
Sen Hu
,
Hao Fu
,
Weiyu Zhu
and
Bangchu Zhang
*
School of Aeronautics and Astronautics, Sun Yat-sen University, Shenzhen 518107, China
*
Author to whom correspondence should be addressed.
Aerospace 2026, 13(5), 446; https://doi.org/10.3390/aerospace13050446
Submission received: 6 April 2026 / Revised: 4 May 2026 / Accepted: 5 May 2026 / Published: 9 May 2026
(This article belongs to the Section Aeronautics)

Abstract

Multirotor unmanned aerial vehicles (UAVs) suffer from significant control performance degradation during aggressive maneuvers, primarily due to aerodynamic nonlinearities and coupling effects. Conventional fixed-gain PID controllers struggle to simultaneously satisfy performance and robustness requirements across the wide flight envelope. To address this challenge, this paper presents a novel hierarchical safety-constrained reinforcement learning (RL) framework for adaptive PID tuning: the inner loop employs fixed gains, the outer loop leverages proximal policy optimization (PPO) for online adaptive gain scheduling, and linear matrix inequality (LMI) constraints delineate robust parameter boundaries for the adaptive exploration. Importantly, the LMI feasibility strictly guarantees theoretical stability for the fixed inner-loop parameters at the linearization vertices within a linear parameter-varying (LPV) framework. Concurrently, the online outer-loop RL stage is protected by safety boundaries and a Lagrangian penalty mechanism acting as an effective engineering safeguard rather than a rigorous global stability proof. Comprehensive high-fidelity simulation benchmarks demonstrate that, compared with a baseline fixed-gain PID controller, the proposed framework reduces overshoot by 18.5% in high-speed step responses and improves the overall mean RMSE by 15.0% across 100 randomized mixed-trajectory trials (with improvements of up to 40.9% in highly dynamic scenarios), yielding consistent gains in trajectory tracking accuracy and disturbance rejection despite uncertain model variations. By seamlessly blending control-theoretic hard constraints with RL-based soft-parameter tuning, the proposed architecture offers a safe and highly adaptive solution for large-envelope flight control, demonstrating strong engineering relevance.

1. Introduction

Precise control is fundamental to the stable flight and mission execution of UAVs, representing a key enabling technology across aerospace and robotics domains. Quadrotors, as representative underactuated systems, require stable and robust flight control as a prerequisite for autonomous operation. The proportional–integral–derivative (PID) controller remains the most widely adopted control scheme for quadrotor UAVs owing to its structural simplicity, ease of implementation, and well-established stability properties [1,2,3].
The control effectiveness of PID controllers on quadrotor platforms is critically dependent on the accuracy of parameter tuning. Quadrotors are strongly nonlinear, tightly coupled, six-degree-of-freedom underactuated systems: nonlinearities remain mild under low-dynamic conditions such as hovering, but become substantially more pronounced during high-speed aggressive flight as aerodynamic drag (proportional to the square of velocity) and inertial-coupling effects intensify. The resulting dynamics exhibit characteristic linear parameter-varying (LPV) behavior [4,5], rendering fixed-gain PID controllers incapable of simultaneously meeting performance and robustness requirements across the full flight envelope.
Conventional tuning approaches suffer from three fundamental limitations [6,7]. First, trial-and-error methods are time-consuming and rely heavily on expert experience; although metaheuristic search algorithms such as differential evolution [8] can partially automate the process, coupled multi-loop tuning remains intractable. Second, rule-based methods (e.g., adaptive control [9] and active disturbance-rejection control [10]) offer limited coverage and struggle to accommodate the diverse nonlinear operating conditions spanning the full envelope. Third, model-based methods (e.g., backstepping/sliding mode [11] and model predictive control [12]) require accurate global models, yet the strong nonlinearities of real systems make high-fidelity model acquisition prohibitively expensive. Moreover, the tuning process itself entails flight safety risks, as inappropriate parameters may lead to vehicle loss.
Beyond multirotor UAVs, the demand for robust responses to dynamic environments and unmodeled perturbations represents a universal challenge across modern automation domains. For instance, in autonomous driving and active suspension systems, dynamic motion-planning frameworks must adapt rapidly to parameter variations to ensure safety [13]. Similarly, maintaining system stability against external disruptions is equally critical in networked systems facing disturbances and actuator attacks [14]. Inspired by these cross-domain demands for safety and adaptive capability, reinforcement learning (RL)-based adaptive tuning has emerged as an active research area in recent years [15,16].
Existing RL-PID studies can be categorized into three classes. First, direct RL control (RL directly outputs control commands)—Koch et al. [17] and Wang et al. [18] applied PPO to quadrotor attitude/velocity control, but generally lacked explicit safety constraints. Second, RL gain prediction (RL outputs and PID gains)—Sönmez et al. [19] employed action clipping to restrict the exploration range and Wang et al. [18] and Alrubyli et al. [20] applied PPO and Q-learning, respectively, for online PID tuning. Third, RL-PID hybrid (PID structure preserved and RL optimizes gain scheduling)—Dogru et al. [21] applied RL to autonomous PID tuning in chemical processes; Ping et al. [22] extended this approach to fixed-wing aircraft; Xue et al. [23] improved the PPO clipping objective to enhance sample efficiency; and Zhai et al. [24] and Wang et al. [25] explored deep RL for automatic PID parameter tuning. While these studies have advanced performance optimization, they share three common shortcomings: the absence of theoretically grounded safe-parameter-domain constraints [26,27], relying instead on reward penalties or simple clipping; a lack of targeted designs addressing the LPV-parameter-variation characteristics inherent to large-envelope flight; and few studies have systematically exploited the dynamic symmetry of the platform for cross-channel policy transfer.
Building upon the ongoing analysis, this paper develops an RL-based PID-tuning methodology focused on outer-loop nonlinearity compensation. The core design rationale stems from the frequency-domain separation characteristics of quadrotor dynamics: the inner loops (attitude/angular rate) are dominated by rigid-body dynamics with relatively mild nonlinearities and high bandwidth, making them suitable for robust fixed-parameter control, whereas the outer loops (position/velocity) experience significant nonlinear effects from aerodynamic drag (proportional to velocity squared) and inertial coupling during aggressive maneuvers, constituting the region where adaptive parameter scheduling yields the greatest benefit.
A systematic comparison with representative related approaches is presented in Table 1. The key innovation lies in the joint design of three complementary mechanisms: (1) LPV polytopic modeling combined with common Lyapunov LMI constraints [28], which construct a parameterized safety domain across the entire operating envelope, compressing the RL exploration space; (2) dual-timescale Lagrangian-constrained PPO, where the policy network (fast timescale, α θ = 10 4 ) and Lagrange multiplier (slow timescale, α λ = 10 5 ) are updated in a decoupled manner; and (3) symmetry-based policy transfer, leveraging I x x I y y to directly map the longitudinal policy to the lateral channel. The joint integration of these three mechanisms is absent in existing work.
The main contributions are summarized as follows:
  • Hierarchical adaptive control architecture: An “outer-loop adaptive, inner-loop fixed” strategy is adopted, reducing the full-system adaptive tuning problem to low-dimensional parameter optimization of the pitch-channel position loop (P) and velocity loop (PID), thereby lowering the learning complexity while precisely compensating for the dominant dynamic nonlinearities.
  • Symmetry-based policy transfer: Exploiting the strong symmetry between the pitch and roll channels inherent to the X-configuration quadrotor, the longitudinal policy is directly mapped to the lateral channel, reducing training costs and enhancing engineering practicality.
  • Safety constraints with online fine-tuning: An LMI-derived safe-parameter domain is combined with an online fine-tuning mechanism in a high-fidelity simulation environment, providing bounded parameter guarantees for RL exploration within the LPV linearization framework while enhancing policy adaptability to real-world environments.
Paper organization. The remainder of this paper is structured as follows. Section 2 details the problem formulation, LPV modeling, and the derivation of LMI safety bounds, alongside the RL framework formulation. Section 3 presents the progressive curriculum training configurations and high-fidelity experimental benchmark results. Section 4 offers an in-depth analysis of the adaptive mechanisms, extended baseline comparisons, hyperparameter sensitivity, and corresponding system limitations. Finally, Section 5 concludes the paper with broader engineering perspectives.

2. Hierarchical Adaptive PID Tuning Framework

To address the dual challenges of aerodynamic nonlinearity and model uncertainty across the wide flight envelope, a hierarchical “inner-loop fixed, outer-loop adaptive” control architecture is adopted: an LPV model captures the parameter-varying dynamics, LMI constraints delineate the safe exploration domain, and a safety-constrained PPO algorithm tunes the outer-loop gains online.

2.1. LMI-Based Safe-Parameter-Domain Construction

Taking the longitudinal channel of an X-configuration quadrotor as the modeling target, a dynamics model incorporating aerodynamic nonlinearity ( C d · u · | u | ) and geometric nonlinearity ( sin θ terms) is constructed (Equation (1)). Under high-speed flight and aggressive maneuvering, these nonlinearities cause substantial variation in open-loop gain, rendering fixed-gain PID controllers unable to simultaneously satisfy performance and robustness requirements.
x ˙ = u cos θ m u ˙ = m g sin θ C d u | u | + F T cos θ θ ˙ = q I y y q ˙ = τ pitch
where x denotes the forward position, u the forward velocity, θ the pitch angle, q the pitch rate, m the mass, g the gravitational acceleration, C d the fuselage-drag coefficient, F T the total thrust, I y y the pitch-axis moment of inertia, and τ pitch the pitch-control torque.
Using the forward velocity u as the scheduling parameter, the nonlinear system is transformed into an LPV state-space representation (Equation (2)) [4,5]. Jacobian linearization is performed at four characteristic velocity operating points { 0 , 5 , 10 , 15 } m/s, yielding vertex subsystems ( A k , B k ) , which are assembled into a standard LMI feasibility framework through polytopic convex combination (Equations (3) and (4)) [28]. The selection of these four vertices stems from a practical trade-off between interpolation accuracy and LMI feasibility region size: a 5 m/s interval provides sufficient grid density to accurately capture the mild nonlinearities in the drag coefficient and pitch dynamics up to the aggressive envelope limit of 15 m/s, ensuring that the intermediate interpolated dynamics remain closely bounded by the convex hull. Conversely, introducing an excessive number of vertices would overly shrink the intersection of the LMI constraints, resulting in a highly conservative safe-parameter domain that stifles outer-loop RL exploration capability.
x ˙ = A ( u ( t ) ) x + B ( u ( t ) ) v , v = τ pitch
x ˙ = A k x + B k v , k = 1 , , 4
A ( u ( t ) ) = k = 1 4 μ k A k , B ( u ( t ) ) = k = 1 4 μ k B k , μ k 0 , μ k = 1
A four-level cascaded control structure is adopted, with the control law defined in Equations (5)–(8):
u ref = K p , x ( x ref x )
θ ref = K p , u e u + K i , u e u d t + K d , u e ˙ u
q ref = K p , θ ( θ ref θ )
τ pitch = K p , q e q + K i , q e q d t + K d , q e ˙ q
Introducing the integral error as augmented states X a = [ e x , u , θ , q , ξ u , ξ q ] T , the cascaded PID is equivalently reconstructed as structured state feedback, yielding the augmented closed-loop system:
X ˙ a = A a , k B a , k K total X a + E a , k r , k = 1 , , 4
where A a , k , B a , k , and E a , k are the augmented system matrix, input matrix, and reference matrix at vertex k, respectively; K total encodes all PID gains; r R is the external reference input (velocity reference u ref , in m/s), and E a , k is the corresponding reference signal distribution vector.
The parameter determination proceeds in two steps:
Step 1—Inner-loop parameter fixation. The LMI constraint set is jointly solved across all four vertices, encompassing quadratic stability, D-stability region ( α -stability and conic sector) constraints, and H performance bounds (complete derivation in Appendix B, Equations (A1)–(A5)), yielding the fixed inner-loop parameter set Θ in * = { K p , θ * , K p , q * , K i , q * , K d , q * } , which remains constant throughout subsequent RL training.
Step 2—Outer-loop safety boundary computation. Substituting Θ in * into the augmented system, the LMI feasible region for the outer-loop parameters { K p , x , K p , u , K i , u , K d , u } is solved as the intersection across all vertices, yielding the safe convex hull boundary [ K min , K max ] . This boundary serves as the physical constraint on the RL action space, with the midpoint of each parameter interval used to initialize the RL agent.
The safe domain construction simultaneously compresses the RL search space and provides bounded parameter-domain constraints within the LPV linearization framework, thereby improving sampling efficiency. It should be noted that the LMI feasibility guarantee applies strictly to the closed-loop stability of the fixed inner-loop parameters at the four linearization vertices. During outer-loop RL online tuning, closed-loop stability is jointly maintained by safety boundary constraints (hard-boundary backstop) and the Lagrangian penalty mechanism (soft-boundary guidance), constituting an engineering safeguard rather than a rigorous global mathematical proof.

2.2. Safety-Constrained Dual-Timescale Reinforcement Learning

The safety-constrained dual-timescale RL framework is established upon LMI-fixed inner-loop parameters providing a stable foundation, PPO performing online adaptive adjustment of outer-loop parameters, with the framework incorporating both longitudinal training and lateral transfer mechanisms (see Figure 1).
As depicted in Figure 1, the entire closed-loop control and training process is formulated into five primary processing steps across the Agent and Environment. Step 1: Policy Execution (Agent). Based on the current state s t , the Actor network computes the raw action a t to adjust the outer-loop parameters. Step 2: Action Processing (Env). The action a t first undergoes direct affine mapping based on the prescribed safety constraints to produce an intermediate gain K target . To prevent aggressive gain fluctuation, an EMA smoothing filter is applied to yield the final deployed gain K t . Step 3: Actuation and Dynamics (Env). The smoothed gain K t is transmitted to the cascaded PID controller, functioning alongside the reference trajectory to drive the quadcopter mixer and physical components. Step 4: Reward Evaluation (Env). The system states s t resulting from the performed action are assessed to calculate the immediate composite return R t and safety cost C t . Step 5: Policy and Multiplier Update (Agent). The calculated rewards and costs flow back to the Agent. At the step-level, the Critic network estimates the value function V π ( x , λ ) to guide the Actor and at the episode-level, the expected cumulative safety cost J C ( π θ ) directs the dynamic adjustment of the Lagrange multiplier λ , ensuring robust constraint adherence.

2.2.1. MDP Formulation

The optimization objective is to maximize the discounted cumulative return subject to safety constraints (Equation (10)):
max θ E τ t γ t R t , s . t . E C ( s t , a t ) ξ
State space S : s t R 12 , comprising longitudinal kinematic states (four dimensions: x , u , θ , q ), tracking errors (two dimensions: position error and velocity error), current normalized gains (four dimensions, reflecting the historical tuning trajectory), and attitude dynamic information (two dimensions: θ ˙ , q ˙ ). The inclusion of current normalized gains in the state vector enables the agent to perceive “how far the current parameters lie from the boundary”, thereby facilitating proactive convergence when approaching the LMI constraint boundary, rather than relying on external hard clipping to passively trigger corrections.
Action space A : a t = [ K p , x , K p , u , K i , u , K d , u ] T [ 1 , 1 ] 4 , mapped to physical gains through an affine transformation (Equation (11)) that naturally aligns with the LMI safety boundary [ K min , K max ] :
K target = a t + 1 2 K max K min + K min
Reward function R t (Equation (12)); the constraint cost C t is defined in Equation (13):
R t = r t λ C t , r t = clip i w i e i 2 , 0 , 50
C t = i C gain , i + C vel + C att + C rate , C gain , i = max 0 , | a t , i | m gain
The reward-truncation upper bound of 50 prevents a small number of extreme trajectories from dominating gradient updates. The constraint cost C t simultaneously monitors four categories of safety violations: gain boundary exceedance ( C gain , i ), velocity limit violation ( C vel ), excessive attitude ( C att ), and angular rate exceedance ( C rate ), covering the primary modes of UAV loss of control.
Value function and advantage estimation: Standard MDP value function V π ( s t ) , action-value function Q π ( s t , a t ) , and generalized advantage estimation (GAE, λ GAE = 0.95 ) are employed; complete definitions are provided in Appendix C (Equations (A6) and (A7)). The discount factor γ = 0.99 endows the agent with an effective planning horizon of approximately 2 s at a 20 ms control period, sufficient to encompass a complete velocity step-response transient.

2.2.2. Constraint-Aware PPO Algorithm

Algorithm-selection rationale. PPO is selected over SAC or DDPG [17] based on two engineering considerations. First, the PPO clipping surrogate objective inherently limits the magnitude of single-step policy updates—let ρ t ( θ ) = π θ ( a t | s t ) / π θ old ( a t | s t ) denote the probability ratio; the clipping surrogate objective is:
L CLIP ( θ ) = E t min ρ t ( θ ) A ^ t , clip ( ρ t ( θ ) , 1 ε , 1 + ε ) A ^ t
where ε = 0.2 and A ^ t is the generalized advantage estimate. This truncation mechanism is highly compatible with the requirement to prevent catastrophic forgetting during Stage 6 online fine-tuning. Second, in the offline stages (Sub-stages 1–5) with 32 parallel environments, PPO’s on-policy sampling eliminates the need for an experience replay buffer, significantly reducing memory footprint and hyperparameter sensitivity. The total loss function integrates the clipping surrogate objective, value function error, and policy entropy:
L ( θ ) = L CLIP ( θ ) c 1 L VF ( θ ) + c 2 H π θ
where c 1 = 0.5 (value function coefficient), c 2 = 0.01 (entropy coefficient), and L VF ( θ ) = 1 2 E t [ ( V θ ( s t ) V target , t ) 2 ] .
Motivation for Lagrangian constraints. Embedding fixed-weight penalty terms directly into the reward function requires extensive manual tuning, and the optimal weights vary significantly across training stages. Lagrangian relaxation [26] is adopted to transform the constrained optimization objective (Equation (10)) into an unconstrained composite objective:
L ( θ , λ ) = J R ( π θ ) λ J C ( π θ ) ξ
where J R ( π θ ) is the expected cumulative performance reward, J C ( π θ ) is the expected cumulative safety cost, and ξ = 0.03 is the violation tolerance threshold. The multiplier λ automatically tracks the degree of constraint violation at runtime: it increases to intensify penalties when gains exceed boundaries or flight states are violated, and decreases to restore exploration freedom once constraints are satisfied. This mechanism converts the safety–performance trade-off into an online optimization problem in dual space, eliminating the need for manually setting penalty weights for each training stage.
Design rationale for dual-timescale separation. Taking the gradient of the Lagrangian objective (Equation (16)) with respect to θ yields the policy update direction; subgradient ascent on λ yields the multiplier update. The two are decoupled via different learning rates to achieve timescale separation:
θ t + 1 = θ t + α θ θ L ( θ t , λ t ) , λ t + 1 = P + λ t + α λ J C ( π θ t ) ξ
where P + is a projection operator ensuring λ 0 ; the policy network updates rapidly at α θ = 10 4 , while the Lagrange multiplier adjusts slowly at α λ = 10 5 ( α λ α θ ). The order-of-magnitude difference between the two learning rates ensures that the multiplier undergoes significant adjustment only after the policy has sufficiently converged, analogous to the “fast–slow system separation” in singular perturbation theory: if the multiplier updates too rapidly, the policy is forced to change direction before it has responded to the constraint signal, leading to oscillation or even divergence. The LMI constraint-ablation study (see Section 3.2.1) corroborates the necessity of this mechanism from a complementary perspective: disabling the gain-constraint penalty causes the policy to consistently fail to converge to low constraint-violation levels throughout training.
Structural basis for policy transfer. The X-configuration quadrotor satisfies I x x I y y ( Δ I / I < 4 % ) and the fuselage-drag coefficient C d is approximately isotropic, endowing the longitudinal and lateral channels with approximately isomorphic dynamical structures in state space. Consequently, the converged longitudinal policy π long * can be directly mapped to the lateral channel, reducing lateral-channel training costs to zero while maintaining performance degradation within acceptable bounds (validated in Section 3.2.2).

2.3. Overall Methodology and Agent-Training Framework

The proposed hierarchical safe-RL method is structured into four interconnected steps that sequentially transition from theoretical modeling to practical deployment (see Figure 2).
As illustrated in Figure 2, the consecutive steps closely interact to systematically elevate control performance. Step I (LPV Modeling) decouples the quadcopter dynamics and synthesizes an augmented multi-cell linear parameter-varying (LPV) model covering the flight envelope. This mathematical structure is directly fed into Step II (LMI Safety Domain), where Lyapunov functions are constructed to solve the safe outer-loop parameter boundaries and optimal initial PID gains. These theoretical bounds provide a physically meaningful and constraint-guaranteed search space for the reinforcement learning agent, fundamentally accelerating subsequent training efforts.
Offline multi-stage curriculum training (Step III, Sub-stages 1–5): A progressive curriculum design gradually escalates task difficulty—Stages 1–4 employ fixed simulation parameters, progressing from constant-velocity tracking to chirp-based aggressive maneuvers with wind disturbances and Stage 5 introduces full-domain randomization to enhance generalization. All five offline stages employ 32-environment parallel PPO, warm-started from the best checkpoint of the preceding stage. The interconnection from Step II to Step III restricts the neural network’s exploration within the derived rigorous physical bounds. Consequently, the initial offline stages (Stages 1–4) successfully yield a basic maneuvering policy that achieves rapid convergence on simplified dynamics, thereby establishing a fundamental tracking capability. The subsequent domain randomization phase (Stage 5) produces an intermediate robust policy ( π 5 * ) resilient to parameter variations, bridging the gap to real-world deployment by significantly reducing sim-to-real disparities.
Online fine-tuning and policy transfer (Step IV, Sub-stage 6): The Stage 5 converged policy π 5 * is transferred to a Gazebo M480 model integrated with the PX4 autopilot stack for online fine-tuning, employing a conservative learning rate ( α = 10 5 ) and a tightened PPO clipping range ( ε = 0.1 ) to prevent catastrophic forgetting. The interconnection from Step III to Step IV passes this highly generalized but moderately conservative policy ( π 5 * ) into high-fidelity environments. This final fine-tuning step yields the optimal deployable policy, actively compensating for unmodeled aerodynamics and actuator delays. Cumulatively, the results from each step—from theoretical bounding to progressive simulation and final sim-to-real transfer [30]—contribute to the overarching performance improvement by safely decomposing an otherwise intractable end-to-end safe RL exploration into sequential, physically informed optimization tasks. The dual-timescale parameter update still follows the rule described in Equation (17). After longitudinal policy convergence, the policy is directly reused for the lateral channel via dynamic symmetry ( I x x I y y ), without retraining.

3. Results

Our results are presented in the following order: LMI safe domain construction, offline five-stage curriculum training and online fine-tuning results, and performance evaluation across four comparative experiments.

3.1. Results of LMI-Based Safe-Parameter-Domain Construction

3.1.1. UAV Physical Parameters

An X-configuration quadrotor is employed; the complete physical parameters are listed in Appendix D, Table A2 (mass 1.4 kg, arm length 0.241 m, I x x = 0.0211 kg·m2, I y y = 0.0219 kg·m2, C d = 0.073 N/(m/s)2, motor time constant T m = 0.02 s).

3.1.2. Longitudinal LPV Model and Polytopic Construction

Within the operating envelope u [ 0 , 15 ] m/s, linearization is performed at four characteristic points, yielding vertex systems ( A k , B k ) , k = 1 , , 4 . The augmented state X a comprises position error, velocity, attitude angle, angular rate, and integral states of the velocity and angular rate loops.

3.1.3. LMI Constraint Design and Solution Results

The LMI constraint parameters for each control loop are listed in Appendix D, Table A3.
Step 1—Inner-loop parameter fixation. A high-bandwidth, well-damped configuration is selected: attitude loop K p , θ = 8.0 and angular rate loop K p , q = 1.5 , K i , q = 12.0 , and K d , q = 0.04 . The closed-loop poles at all four velocity vertices fall within the D α = 2.0 , φ = 50 ° sector, with a phase margin 60 ° and a gain margin 10 dB (see Figure 3).
Step 2—Outer-loop safety boundary. Substituting Θ in * into the augmented system, the outer-loop LMI feasible region is solved; the results are presented in Table 2 (see Figure 4).
Leveraging the X-configuration structural symmetry ( I x x I y y ), the lateral channel adopts the same parameter boundaries as the longitudinal channel. The yaw and altitude channels employ fixed parameters obtained from single-point hovering LMI solutions: altitude outer loop K p , z = 1.0 ; altitude inner loop K p , z ˙ = 4.85 , K i , z ˙ = 2.85 , and K d , z ˙ = 1.0 ; yaw outer loop K p , ψ = 1.5 ; and yaw inner loop K p , r = 1.35 , K i , r = 0.0 , and K d , r = 0.5 .

3.2. PPO-Based Adaptive Controller Parameter Training

3.2.1. Offline Five-Stage Curriculum Training (Sub-Stages 1–5)

The RL agent exclusively adjusts the longitudinal outer-loop parameters: state s t R 12 and action a t [ 1 , 1 ] 4 , affine-mapped to the safety boundary [ K min , K max ] of Table 2. The five-stage offline training configuration is detailed in Table 3. The penalty weight W gain = 5.0 and the gain soft-constraint margin m gain = 0.70 are maintained consistently across all stages (hard-constraint margin m hard = 0.95 ).
The Lagrange multiplier λ j is updated according to the slow-timescale rule of Equation (17) ( α λ = 10 5 , violation tolerance threshold ξ = 0.03 ).
Core PPO hyperparameters are listed in Appendix D, Table A4 ( ε clip = 0.2 , γ = 0.99 , λ GAE = 0.95 , N env = 32 , N step = 1024 ).
Training platform: AMD Ryzen-9 5950X/NVIDIA RTX 4060/32 GB DDR4, Ubuntu 22.04 LTS. The total offline training comprises approximately 1.5 × 10 7 timesteps (Stages 1–5 combined).
Figure 5 illustrates the reward-curve evolution across the five offline training stages. The magnitude of negative transfer at each stage transition progressively diminishes with advancing stages, indicating that the progressive curriculum design effectively reduces the cost of cross-task knowledge transfer.
Quantitative performance summaries at the conclusion of the five offline stages are presented in Table 4.
LMI constraint-ablation study. To verify the necessity of the LMI safety domain constraints, the RL action constraints are degraded to simple symmetric clipping ( K [ K mid ± Δ K ] , where Δ K is comparable to the LMI boundary width), the gain-constraint penalty weight W gain is set to zero, and all other hyperparameters are held constant. Comparative training is conducted on the Stage 5 (domain randomization) task. As shown in Table 5 and Figure 6, the LMI-constrained variant achieves a 71.4% reduction in constraint violations (gain-boundary exceedances per episode): 13.8 ± 5.6 vs. 48.4 ± 2.5 . The policy without LMI constraints, lacking penalty guidance, consistently fails to converge to low constraint-violation levels throughout training, validating that the LMI constraint mechanism is a necessary condition for achieving control within safe boundaries.

3.2.2. Gazebo Online Fine-Tuning (Stage 6) and Lateral Transfer

(1) Stage 6—High-Fidelity Online Fine-Tuning
Policy π 5 * is transferred to a Gazebo M480 model integrated with the PX4 autopilot stack for online fine-tuning. Realistic sensor noise is introduced ( σ v = 0.10 m/s, σ p = 0.05 m, and σ θ = 0.02 rad), with the system operating at 50 Hz via ROS2 DDS and a communication latency of 3–5 ms. Fine-tuning proceeds for a total of 6.0 × 10 6 timesteps; the smoothed episode reward rises steadily from approximately 5576 to 5171 , reflecting gradual adaptation to fixed-bias plant parameters and sensor noise (see Figure 7). The limited absolute improvement ( Δ 400 ) indicates that Stage 5 full-domain randomization had already brought the policy close to the Stage 6 performance ceiling, so Stage 6 fine-tuning primarily consolidates robustness rather than relearning the task. Stage 6 employs single-environment serial sampling in Gazebo ( N env = 1 ), resulting in an effective data throughput far lower than that of the offline 32-environment parallel stages (e.g., Stage 5 effective throughput: approximately 5.0 M × 32 = 160 M ). The policy, nonetheless, converges rapidly for three reasons: (1) Stage 5 full-domain randomization has sufficiently enhanced generalization, and online fine-tuning only needs to eliminate motor dynamic residuals rather than relearn the task; (2) the conservative learning rate ( α = 10 5 ) prevents catastrophic forgetting; and (3) the initial policy π 5 * already possesses strong tracking capability, significantly narrowing the fine-tuning search space.
(2) Lateral Channel Policy Transfer
Based on the dynamic symmetry of the X-configuration quadrotor ( I x x I y y , approximately isotropic C d ), the converged longitudinal policy π long * is directly mapped to the lateral channel without retraining. The test conditions are u y = 8 sin ( 0.2 π t ) m/s superimposed with 0–5 m/s random wind, comprising n = 100 independent trials (50 trials each at 3 m/s and 5 m/s wind speeds); the results are shown in Figure 8.
The lateral-channel RMSE exceeds the longitudinal value by only 16.3%, which is below the commonly adopted 20% transfer feasibility threshold, validating the viability of the “train longitudinal–reuse lateral” approach.

3.3. Experimental Results and Analysis

The following four experiments compare RL-PID against an engineering-tuned fixed-gain PID ( K p , x = 0.7 , K p , u = 3.0 , K i , u = 3.0 , K d , u = 0.3 ).
The three experiments in Section 3.3.1, Section 3.3.2 and Section 3.3.3 are each conducted with n = 10 independent repetitions in the high-fidelity simulation environment, each run under 1–3 m/s random wind disturbance (uniformly random wind direction and seed fixed at 42 for reproducibility), with statistics reported as mean ± standard deviation ( x ¯ ± s , n = 10 ). Improvement is computed as the ratio of means: Δ = ( x ¯ PID x ¯ RL ) / x ¯ PID × 100 % .

3.3.1. High-Speed Step Response

Experimental setup. The UAV starts from hover and receives a 13 m/s step velocity command sustained for 10 s, with a pitch angle constraint | θ | 60 ° (see Figure 9). The detailed quantitative results are summarized in Table 6.

3.3.2. Emergency-Braking Response

Experimental setup. The UAV cruises at u = 13 m/s; at t = 3 s, a step command to 0 m/s is issued, which is sustained for 10 s (see Figure 10). The quantitative results are summarized in Table 7.

3.3.3. Frequency-Domain Sweep Test

Experimental setup. Two test configurations using chirp signals: Test A (constant-amplitude sweep): amplitude ±8 m/s and frequency linearly increasing from 0.1 Hz to 1.0 Hz over 20 s. Test B (constant-frequency amplitude sweep): frequency fixed at 0.5 Hz and amplitude linearly increasing from 3 m/s to 13 m/s over 20 s. The quantitative results are summarized in Table 8, and the corresponding response curves are shown in Figure 11.

3.3.4. Comprehensive Statistical Performance Test

Experimental setup. A total of 100 randomized mixed-trajectory trials (see Figure 12) covering four trajectory types—constant velocity, sinusoidal, step, and chirp—were conducted under 0–5 m/s random wind disturbance. Sample sizes per category were weighted by flight mission frequency: step 30 trials, constant velocity 24 trials, sinusoidal 24 trials, and chirp 22 trials, totaling 100 trials. Initial velocity, wind speed, and wind direction were uniformly randomly sampled within each category. All reported improvements were computed relative to the fixed PID baseline as Δ = ( RMSE PID RMSE RL ) / RMSE PID × 100 % .

4. Discussion

4.1. Adaptive-Mechanism Analysis

By synthesizing the gain-variation curves across the four comparative experiments (Figure 9d, Figure 10d and Figure 11c,f), three characteristic adaptive scheduling behaviors can be identified:
  • Error-driven regulation. When the tracking error is large, K p and K i increase to accelerate convergence; as the error diminishes, K i decreases to suppress steady-state oscillation. In the step-response experiment (Section 3.3.1), K i , u actively rises during the transient phase ( t < 2 s) and returns to a lower value at steady state, ultimately reducing steady-state error by 62.8% while simultaneously decreasing overshoot by 18.5%.
  • State-synchronous regulation. Under periodic reference signals, gain variations exhibit correlation with both the amplitude and frequency of the reference signal. In the frequency sweep test (Section 3.3.3), K p , u and K i , u increase correspondingly during high-amplitude segments and decrease during low-amplitude segments, with the gain-adjustment rhythm synchronized to the reference signal period. This indicates that the policy exploits the dynamic characteristics of the reference signal as an implicit scheduling variable, enabling RL-PID to significantly outperform the fixed-gain scheme on chirp trajectories (improvement of 40.9%) and sinusoidal trajectories (improvement of 18.8%).
  • Constraint-aware regulation. When system states approach the 60 ° pitch-angle-constraint boundary, the policy proactively reduces gains to mitigate the risk of constraint violation. In the emergency-braking experiment (Section 3.3.2), the RL-PID mean maximum pitch angle ( 59.4 ± 0 . 7 ° ) is slightly higher than that of the fixed PID ( 58.0 ± 2.1 ° ); neither violates the 60 ° hard constraint, while RL-PID achieves a 9.5% reduction in braking time through more aggressive gain scheduling without any constraint violations.
These three behaviors demonstrate that the control law learned by the RL policy is not a fixed-gain mapping, but rather a comprehensive response to the current error state, reference signal dynamics, and system safety margins.

4.2. Comprehensive Performance Discussion

Two consistent conclusions emerge from Table 9, Table 10 and Table 11:
Performance improvement is positively correlated with task-dynamic complexity. For steady-state tracking (constant velocity trajectories and 4.1% improvement) and large-amplitude step responses (RMSE improvement of 3.3%), the fixed PID already performs reasonably well, leaving limited marginal benefit from RL-PID. However, for time-varying dynamic signals (chirp, sinusoidal, and sweep), the advantage of RL-PID expands significantly with increasing task-dynamic complexity.
Safety constraints are effectively maintained throughout performance enhancement. In the mixed-trajectory test, the RL-PID maximum pitch angle ( 53 . 9 ° ± 11 . 1 ° ) is slightly higher than that of the fixed PID ( 52 . 3 ° ± 12 . 3 ° ), yet none of the 100 trials exceeded the 60 ° hard constraint (see Table 9). The slightly higher mean maximum pitch angle of RL-PID is attributable to its more aggressive gain scheduling—actively increasing K p , u during high-dynamic segments to accelerate tracking, thereby driving pitch angles closer to the constraint boundary—while the Lagrange multiplier mechanism suppresses boundary exceedance risk in a timely manner when the system approaches constraints. This observation is consistent with the “constraint-aware regulation” analysis in Section 4.1: RL-PID trades a limited pitch-angle margin for faster dynamic response while maintaining overall compliance with hard constraints. The joint action of the LMI safety domain and Lagrangian-constrained PPO forms a dual-layer protection mechanism of “soft-boundary guidance + hard-boundary backstop.”
Variance analysis of the mixed-trajectory test. The coefficient of variation (CV ≈ 89%) of the RL-PID velocity RMSE in Table 9 originates from trajectory-type heterogeneity rather than sporadic controller instability. The per-category statistics in Table 10 corroborate this assessment: the inherent task-difficulty gap between chirp-trajectory RL-PID RMSE (1.11 m/s) and step-trajectory RMSE (2.80 m/s) is approximately 2.5-fold, naturally inflating the pooled variance when mixed. The fixed-PID pooled standard deviation (±1.69) is virtually identical to that of RL-PID (±1.70), further supporting this conclusion.

4.3. Extended Baselines, Runtime Trade-Offs, and Training Sensitivity

Positioning against extended baselines. Since fixed-gain PID and the LMI-ablated variant do not cover the full control landscape, it is useful to position the proposed method against Gain-Scheduled PID, LPV/LMI fixed-gain scheduling, MPC, and ADRC. Conceptually, RL-PID separates three roles that are often coupled: robust feasibility, online performance adaptation, and computational deployment. The LMI region provides an explicit safe-search domain, avoiding unconstrained gain exploration; the RL policy then exploits the remaining physical margin to reduce the conservatism of worst-case LPV/LMI tuning under time-varying references. Compared with MPC, this design shifts the optimization burden from online receding-horizon solving to offline policy training, leaving only a lightweight neural-network evaluation during flight. Compared with ADRC, the search space is bounded by explicit physical and stability constraints rather than relying solely on feedback-error-driven disturbance rejection. Thus, the observed gain does not come from RL alone or from the LMI box alone, but from their division of labor: LMI defines admissible behavior, while RL selects task-dependent gains within that admissible set.
Computation–performance trade-off analysis. Figure 13 provides a quantitative view of this positioning by comparing Fixed PID, Gain-Scheduled PID, Linear MPC, and RL-PID across three benchmark scenarios. The x-axis reports per-step computation time in μ s, and the y-axis reports velocity RMSE; error bars denote the standard deviation over n = 10 trials. The results show that Linear MPC achieves the lowest RMSE (≈0.81–0.93 m/s), but at a much higher computational cost (276 μ s/step), whereas Fixed PID is cheapest (≈3.4 μ s/step) but loses accuracy in dynamic trajectories such as Chirp Sweep. RL-PID lies between these extremes: its runtime cost is close to Gain-Scheduled PID (approximately 9–10 μ s/step), while its accuracy approaches MPC in the Chirp Sweep and Mixed Maneuver scenarios. In the easier step-response case, all methods converge to a similar RMSE (≈0.76 m/s), so computation becomes the dominant differentiator. Overall, RL-PID occupies a favorable region of the Pareto front, retaining PID-class computational efficiency while recovering much of the adaptive performance usually associated with more expensive online optimization.
Training sensitivity and robustness of the comparison. The above trade-off is meaningful only if the RL-PID policy is not a fragile outcome of hyperparameter tuning. In practice, stable learning requires the training dynamics to follow the same hierarchy as the controller design: the actor should first learn useful tracking behavior, while the Lagrange multiplier gradually enforces constraint discipline. We therefore use separated timescales ( α θ = 10 4 for the policy and α λ = 10 5 for the multiplier). When α λ was increased toward α θ , small transient violations were penalized too early, causing oscillatory policies and loss of tracking ability; when it was too small, constraint enforcement became delayed during long transients. Similarly, overly tight violation margins or oversized penalty weights pushed the policy toward nearly static gains inside the LMI domain, recreating the conservatism that RL adaptation is intended to reduce. Initializing near the LMI midpoint and using soft margins therefore implements a practical “soft guidance before hard backstop” mechanism, supporting the stability of the Pareto comparison rather than merely improving one tuned run.

4.4. Limitations and Future Work

  • Real-World Applicability and Sim-to-Real Gaps. Fundamentally, safety guarantees in this study theoretically hold under bounded-LPV mathematical derivations, but remain highly sensitive to significant model mismatch or unmodeled aerodynamic perturbations. Unmodeled actuator latency, aggressive sensor noise, or structural asymmetry ( Δ I / I > 4 % ) could breach the current strict offline bounds. While the proposed method is validated in a high-fidelity Gazebo environment incorporating standard actuator dynamics, the present contribution is inherently an analytical and simulation-based engineering study rather than a full hardware-level deployment. Future iterations must account for unmodeled noise scaling and explicitly tackle the hardware deployment sim-to-real chasm using domain adaptation or fine-tuning.
  • Position–velocity trade-off. The proposed method prioritizes velocity tracking performance; position error may become non-negligible during highly dynamic tests. Future work may introduce multi-objective reward functions that explicitly balance velocity-tracking accuracy and position drift.
  • Extended baseline comparisons. The analytical defense surrounding MPC, LPV, and ADRC benchmarks lays the foundation for future empirical validation. Implementing these exhaustive parallel baselines on identical test-bed environments constitutes an inevitable metric for definitively isolating the improvements yielded, specifically by RL, against other sophisticated controllers.

5. Conclusions

This paper develops a novel hierarchical adaptive-PID-control method integrating proximal policy optimization with LMI-based constraints to alleviate multirotor-control degradation during large-envelope flight. Utilizing an LPV formulation, LMI feasibility strictly dictates inner-loop stability and enforces constrained ranges for outer-loop online adjustments. Consequently, within high-fidelity simulation environments under demanding benchmarks, the proposed framework substantially outpaces fixed-gain formulations, demonstrating a 15% RMSE reduction overall and up to 40.9% performance elevation under severe chirp dynamics. Crucially, the mathematical safety guarantee solely protects the fixed inner loop at defined vertices, while the online RL parameter adaptation acts as a highly resilient engineering safeguard via Lagrangian boundaries rather than providing absolute global systemic stability out of simulation. Moving forward, validating robustness against acute structural asymmetry and unmodeled actuator delays, and conducting comparative live-flight studies against advanced nonlinear pipelines (such as Model Predictive Control) remain pivotal research priorities for transitioning this resilient framework into a fully deployed mechanism.

Author Contributions

Conceptualization, Z.T. and B.Z.; methodology, Z.T.; software, Z.T.; validation, Z.T., S.H. and H.F.; formal analysis, Z.T.; investigation, Z.T.; resources, B.Z.; data curation, S.H.; writing—original draft preparation, Z.T.; writing—review and editing, W.Z.; visualization, H.F.; supervision, W.Z.; project administration, B.Z.; and funding acquisition, B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 52202513 and 52302511) and the Guangdong Basic and Applied Basic Research Foundation (No. 2021A1515110797 and No. 2023A1515010023)

Data Availability Statement

The original contributions presented in this study are included in the article and its appendices.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
UAVUnmanned aerial vehicle
RLReinforcement learning
PIDProportional–integral–derivative
PPOProximal policy optimization
LMILinear matrix inequality
LPVLinear parameter-varying
MDPMarkov decision process
RMSERoot mean square error
MAEMean absolute error
GAEGeneralized advantage estimation
DDSData distribution service

Appendix A. Longitudinal-Dynamics-Model Symbol Definitions

Table A1. Symbol definitions for Equations (1)–(9).
Table A1. Symbol definitions for Equations (1)–(9).
SymbolDescriptionUnit
xForward positionm
uForward velocitym/s
θ Pitch anglerad
qPitch raterad/s
mTotal vehicle masskg
gGravitational accelerationm/s2
C d Fuselage-drag coefficientN/(m/s)2
F T Total thrustN
I y y Pitch-axis moment of inertiakg·m2
τ pitch Pitch control torqueN·m
X a Augmented state vector [ e x , u , θ , q , ξ u , ξ q ] T
ξ u , ξ q Velocity loop and angular rate loop integral states
K total Structured feedback matrix encoding all cascaded PID gains
rExternal reference input (velocity reference u ref )m/s
E a , k Reference signal distribution vector at vertex k
μ k Polytopic convex combination weights, μ k 0 , μ k = 1

Appendix B. Complete LMI-Constraint Derivation

The LMI constraint set jointly guarantees closed-loop stability and robust performance across all four LPV vertices. Let the augmented closed-loop system matrix be A c = A a , k B a , k K , k = 1 , , 4 .
(B1) Quadratic (Lyapunov) stability: A common symmetric positive-definite matrix P 0 is required to satisfy the following for all vertices:
A c T P + P A c 0 , k = 1 , , 4
(B2) α -stability: All closed-loop eigenvalues are constrained to have real parts satisfying Re ( λ ) < α :
A c T P + P A c + 2 α P 0
(B3) Conic-sector constraint: Eigenvalue arguments are restricted to | arg ( λ ) | < φ , ensuring a minimum damping ratio:
sin φ ( A c T P + P A c ) cos φ ( P A c A c T P ) cos φ ( A c T P P A c ) sin φ ( A c T P + P A c ) 0
(B4) H performance constraint: Given the disturbance-augmented system X ˙ a = A c X a + B w w , z = C z X a , the L 2 gain from w to z is bounded by γ :
A c T P + P A c P B w C z T B w T P γ I 0 C z 0 γ I 0
(B5) Positive-definiteness constraint:
P 0 , γ > 0
The selected α and φ values for each control loop are listed in Appendix D, Table A3.

Appendix C. Supplementary PPO-Algorithm Formulas

(C1) State-value function:
V π ( s t ) = E k = 0 γ k r t + k | s t
(C2) Action-value function:
Q π ( s t , a t ) = E k = 0 γ k r t + k | s t , a t
The advantage function A ^ π ( s t , a t ) = Q π ( s t , a t ) V π ( s t ) is computed via generalized advantage estimation (GAE, λ GAE = 0.95 ). The composite advantage function is A ^ t = A ^ R λ A ^ C , where A ^ R and A ^ C are the advantage estimates for performance reward and safety cost, respectively.
(C3) Policy gradient:
θ J ( θ ) = E t θ log π θ ( a t s t ) A ^ t
The PPO clipping surrogate objective L CLIP ( θ ) , total loss L ( θ ) , Lagrangian composite objective L ( θ , λ ) , and dual-timescale update rules are provided in Equations (14)–(17) of the main text and are not repeated here.

Appendix D. Experimental-Configuration Parameters

Table A2. Simulation-UAV physical parameters.
Table A2. Simulation-UAV physical parameters.
CategoryParameterSymbolValueUnit
BasicTotal massm1.4kg
Gravitational accel.g9.8m/s2
InertiaRoll axis I x x 0.0211kg·m2
Pitch axis I y y 0.0219kg·m2
Yaw axis I z z 0.0366kg·m2
GeometryArm lengthl0.241m
AerodynamicsThrust coeff. C t 1.105 × 10 5 N/(rad/s)2
Torque coeff. C m 1.779 × 10 7 N·m/(rad/s)2
Drag coeff. C d 0.073N/(m/s)2
Damping torque coeff. C d m 0.0055N·m/(rad/s)2
MotorResponse time const. T m 0.02s
Table A3. LMI-constraint parameters for each control loop.
Table A3. LMI-constraint parameters for each control loop.
Control Loop α -StabilityConic Angle φ Phase MarginGain MarginBandwidth
Angular rate loop≥2.0 50 ° 60 ° ≥10 dB[10, 30] rad/s
Attitude loop≥2.0 50 ° 60 ° ≥10 dB[5, 15] rad/s
Velocity loop≥1.0 60 ° 50 ° ≥8 dB[1, 5] rad/s
Position loop≥0.8 65 ° 45 ° ≥6 dB[0.5, 2] rad/s
Table A4. PPO-hyperparameter configuration.
Table A4. PPO-hyperparameter configuration.
CategoryParameterValue
Algorithm ε clip 0.2
γ 0.99
λ GAE 0.95
Network training α θ 10 4 (linear decay)
α λ 10 5
δ grad (gradient clipping)0.5
N epoch 5
N batch 512
Loss weights c 1 (vf_coef)0.5
c 2 (ent_coef)0.01
Sampling N step 1024
N env (parallel envs)32 (Stages 1–5)/1 (Stage 6)
Training platform: AMD Ryzen-9 5950X/NVIDIA RTX 4060/32 GB DDR4, Ubuntu 22.04 LTS. Total training: approximately 1.68 × 10 7 timesteps (offline Stages 1–5: 6.5 M; online Stage 6: approximately 6.0 M).

References

  1. Shauqee, M.N.; Rajendran, P.; Suhadis, N.M. Quadrotor Controller Design Techniques and Applications Review. INCAS Bull. 2021, 13, 179–194. [Google Scholar] [CrossRef]
  2. Lopez-Sanchez, I.; Moreno-Valenzuela, J. PID control of quadrotor UAVs: A survey. Annu. Rev. Control 2023, 56, 100900. [Google Scholar] [CrossRef]
  3. Moreno-Valenzuela, J.; Perez-Alcocer, R.; Guerrero-Medina, M.; Dzul, A. Nonlinear PID-Type Controller for Quadrotor Trajectory Tracking. IEEE/ASME Trans. Mechatron. 2018, 23, 2436–2447. [Google Scholar] [CrossRef]
  4. Kazemi, M.H.; Tarighi, R. PID-based attitude control of quadrotor using robust pole assignment and LPV modeling. Int. J. Dyn. Control 2024, 12, 2385–2397. [Google Scholar] [CrossRef]
  5. Zhu, X.; Li, Y.; Wang, H.; Shuai, Z.; Huang, H.; Yin, G. Integrated Physics-Data Based LPV Attitude Control of Quadrotor UAV System. IEEE Trans. Ind. Electron. 2025, 72, 9635–9644. [Google Scholar] [CrossRef]
  6. Borase, R.P.; Maghade, D.K.; Sondkar, S.Y.; Pawar, S.N. A review of PID control, tuning methods and applications. Int. J. Dyn. Control 2021, 9, 818–827. [Google Scholar] [CrossRef]
  7. Rinaldi, M.; Primatesta, S.; Guglieri, G. A Comparative Study for Control of Quadrotor UAVs. Appl. Sci. 2023, 13, 3464. [Google Scholar] [CrossRef]
  8. Gün, A. Attitude control of a quadrotor using PID controller based on differential evolution algorithm. Expert Syst. Appl. 2023, 229, 120518. [Google Scholar] [CrossRef]
  9. Muthusamy, P.K.; Garratt, M.; Pota, H.; Muthusamy, R. Real-Time Adaptive Intelligent Control System for Quadcopter Unmanned Aerial Vehicles With Payload Uncertainties. IEEE Trans. Ind. Electron. 2022, 69, 1641–1653. [Google Scholar] [CrossRef]
  10. Yang, S.; Xi, L.; Hao, J.; Wang, W. Aerodynamic-Parameter Identification and Attitude Control of Quadrotor Model with CIFER and Adaptive LADRC. Chin. J. Mech. Eng. 2021, 34, 18. [Google Scholar] [CrossRef]
  11. Ullah, S.; Alghamdi, H.; Algethami, A.A.; Alghamdi, B.; Hafeez, G. Robust Control Design of Under-Actuated Nonlinear Systems: Quadcopter Unmanned Aerial Vehicles with Integral Backstepping Integral Terminal Fractional-Order Sliding Mode. Fractal Fract. 2024, 8, 412. [Google Scholar] [CrossRef]
  12. Nwafor, S.C.; Eneh, J.N.; Ndefo, M.I.; Ugbe, O.C.; Ugwu, H.I.; Ani, O. An optimal hybrid quadcopter control technique with MPC-based backstepping. Arch. Control Sci. 2024, 34, 39–62. [Google Scholar] [CrossRef]
  13. Huang, T.; Pan, H.; Sun, W.; Gao, H. Sine Resistance Network-Based Motion-Planning Approach for Autonomous Electric Vehicles in Dynamic Environments. IEEE Trans. Transp. Electrif. 2022, 8, 2862–2873. [Google Scholar] [CrossRef]
  14. Zhao, N.; Lun, D.; Zhang, H.; Zhao, X.; Rudas, I.J. Composite Anti-Disturbance Control for Networked Systems With Disturbances and Actuator Attacks via Event-Triggered Output Feedback. IEEE Trans. Cybern. 2026, 56, 393–403. [Google Scholar] [CrossRef]
  15. Liu, H.; Kiumarsi, B.; Kartal, Y.; Taha Koru, A.; Modares, H.; Lewis, F.L. Reinforcement Learning Applications in Unmanned Vehicle Control: A Comprehensive Overview. Unmanned Syst. 2023, 11, 17–26. [Google Scholar] [CrossRef]
  16. Azar, A.T.; Koubaa, A.; Mohamed, N.A.; Ibrahim, H.A.; Ibrahim, Z.F.; Kazim, M.; Ammar, A.; Benjdira, B.; Khamis, A.M.; Hameed, I.A.; et al. Drone Deep Reinforcement Learning: A Review. Electronics 2021, 10, 999. [Google Scholar] [CrossRef]
  17. Koch, W.; Mancuso, R.; West, R.; Bestavros, A. Reinforcement Learning for UAV Attitude Control. ACM Trans. Cyber-Phys. Syst. 2019, 3, 1–21. [Google Scholar] [CrossRef]
  18. Wang, Y.; Zhang, W.; Mou, J.; Zheng, K. Attitude Control Based on Reinforcement Learning for Quadrotor. In Proceedings of the 2021 International Conference on Autonomous Unmanned Systems (ICAUS), Changsha, China, 24–26 September 2021; pp. 331–338. [Google Scholar] [CrossRef]
  19. Sönmez, S.; Montecchio, L.; Martini, S.; Rutherford, M.J.; Rizzo, A.; Stefanovic, M.; Valavanis, K.P. Reinforcement Learning-Based PD Controller Gains Prediction for Quadrotor UAVs. Drones 2025, 9, 581. [Google Scholar] [CrossRef]
  20. Alrubyli, Y.; Bonarini, A. Using Q-Learning to Automatically Tune Quadcopter PID Controller Online for Fast Altitude Stabilization. In Proceedings of the 2022 IEEE International Conference on Mechatronics and Automation (ICMA), Guilin, China, 7–10 August 2022; pp. 514–519. [Google Scholar] [CrossRef]
  21. Dogru, O.; Velswamy, K.; Ibrahim, F.; Wu, Y.; Sundaramoorthy, A.S.; Huang, B.; Xu, S.; Nixon, M.; Bell, N. Reinforcement learning approach to autonomous PID tuning. Comput. Chem. Eng. 2022, 161, 107760. [Google Scholar] [CrossRef]
  22. Ping, H.; Han, B. An Automatic PID Tuning Method for DEP Fixed-Wing Aircraft Based on Reinforcement Learning. In Proceedings of the 2024 International Conference on Autonomous Unmanned Systems (ICAUS); Springer: Singapore, 2024; pp. 15–24. [Google Scholar] [CrossRef]
  23. Xue, W.; Wu, H.; Ye, H.; Shao, S. An Improved Proximal Policy Optimization Method for Low-Level Control of a Quadrotor. Actuators 2022, 11, 105. [Google Scholar] [CrossRef]
  24. Zhai, Y.; Zhao, Q.; Han, Y.; Wang, J.; Zeng, W. Intelligent PID Controller Based on Deep Reinforcement Learning. In Proceedings of the 2024 8th International Conference on Robotics, Control and Automation (ICRCA), Shanghai, China, 12–14 January 2024; pp. 343–348. [Google Scholar] [CrossRef]
  25. Wang, H.; Ricardez-Sandoval, L.A. A Deep Reinforcement Learning-Based PID Tuning Strategy for Nonlinear MIMO Systems with Time-varying Uncertainty. IFAC-PapersOnLine 2024, 58, 887–892. [Google Scholar] [CrossRef]
  26. Gu, S.; Yang, L.; Du, Y.; Chen, G.; Walter, F.; Wang, J.; Knoll, A. A Review of Safe Reinforcement Learning: Methods, Theories, and Applications. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11216–11235. [Google Scholar] [CrossRef]
  27. Mannucci, T.; van Kampen, E.J.; de Visser, C.; Chu, Q. Safe Exploration Algorithms for Reinforcement Learning Controllers. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 1069–1081. [Google Scholar] [CrossRef] [PubMed]
  28. Boyd, S.; Hast, M.; Åström, K.J. MIMO PID tuning via iterated LMI restriction. Int. J. Robust Nonlinear Control 2016, 26, 1718–1731. [Google Scholar] [CrossRef]
  29. Saeed, A.; Bhatti, A.I.; Malik, F.M. LMIs-Based LPV Control of Quadrotor with Time-Varying Payload. Appl. Sci. 2023, 13, 6553. [Google Scholar] [CrossRef]
  30. Salvato, E.; Fenu, G.; Medvet, E.; Pellegrino, F.A. Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning. IEEE Access 2021, 9, 153171–153187. [Google Scholar] [CrossRef]
Figure 1. Safety-constrained dual-timescale reinforcement learning framework. The framework consists of the Agent (step-level and episode-level learning) and the Environment (Env). Symbols are defined as follows: s t represents the current state; a t is the action generated by the Actor; V π ( x , λ ) is the value function estimated by the Critic; λ is the Lagrange multiplier dynamically adjusting the safety penalty; J C ( π θ ) is the expected cumulative safety cost; C t and R t denote the episode-level total cost and composite return, respectively, fed back to the Critic; r t is the step-level reward; c t is the step-level cost (aggregated into C t ); and K target and K t are the intermediate and final step PID gains processed through safety mapping and exponential moving average (EMA) smoothing.
Figure 1. Safety-constrained dual-timescale reinforcement learning framework. The framework consists of the Agent (step-level and episode-level learning) and the Environment (Env). Symbols are defined as follows: s t represents the current state; a t is the action generated by the Actor; V π ( x , λ ) is the value function estimated by the Critic; λ is the Lagrange multiplier dynamically adjusting the safety penalty; J C ( π θ ) is the expected cumulative safety cost; C t and R t denote the episode-level total cost and composite return, respectively, fed back to the Critic; r t is the step-level reward; c t is the step-level cost (aggregated into C t ); and K target and K t are the intermediate and final step PID gains processed through safety mapping and exponential moving average (EMA) smoothing.
Aerospace 13 00446 g001
Figure 2. Overall hierarchical framework encompassing LPV modeling, LMI safety domain construction, and two-phase RL training. Labels on the interconnections delineate the progression of mathematical models, parameter bounds, and control policies across the integrated steps.
Figure 2. Overall hierarchical framework encompassing LPV modeling, LMI safety domain construction, and two-phase RL training. Labels on the interconnections delineate the progression of mathematical models, parameter bounds, and control policies across the integrated steps.
Aerospace 13 00446 g002
Figure 3. Pole- placement verification (fixed inner-loop parameters).
Figure 3. Pole- placement verification (fixed inner-loop parameters).
Aerospace 13 00446 g003
Figure 4. Three-dimensional cross-section of the longitudinal outer-loop parameter feasible region.
Figure 4. Three-dimensional cross-section of the longitudinal outer-loop parameter feasible region.
Aerospace 13 00446 g004
Figure 5. Reward- curve evolution across five-stage offline curriculum training. Different colors distinguish the five training stages; in each subplot, the solid line indicates the mean episode reward and the shaded region represents the standard deviation.
Figure 5. Reward- curve evolution across five-stage offline curriculum training. Different colors distinguish the five training stages; in each subplot, the solid line indicates the mean episode reward and the shaded region represents the standard deviation.
Aerospace 13 00446 g005
Figure 6. LMI constraint-ablation comparison (Stage 5). (a) Episode-reward-training curves and (b) constraint violations per episode. Dashed lines indicate the mean over the last 1000 episodes.
Figure 6. LMI constraint-ablation comparison (Stage 5). (a) Episode-reward-training curves and (b) constraint violations per episode. Dashed lines indicate the mean over the last 1000 episodes.
Aerospace 13 00446 g006
Figure 7. Stage 6, Gazebo online fine-tuning training curve. The shaded colored area represents the standard deviation of the episode reward. The smoothed episode reward (window = 2000 episodes) increases from approximately 5576 at the start to 5171 at the end of 6.0 × 10 6 timesteps, demonstrating steady policy improvement under fixed-bias parameters and sensor noise.
Figure 7. Stage 6, Gazebo online fine-tuning training curve. The shaded colored area represents the standard deviation of the episode reward. The smoothed episode reward (window = 2000 episodes) increases from approximately 5576 at the start to 5171 at the end of 6.0 × 10 6 timesteps, demonstrating steady policy improvement under fixed-bias parameters and sensor noise.
Aerospace 13 00446 g007
Figure 8. Performance comparison of longitudinal and lateral channel policy transfer. (a) Representative velocity tracking curves (wind speed 1.5 m/s); (b) RMSE mean ± standard deviation comparison across four control schemes ( n = 100 ); and (c) time-averaged gain value comparison. The longitudinal RL-PID RMSE is 2.503 m/s; after lateral transfer, it increases to only 2.912 m/s (+16.3%), well below the 20% feasibility threshold.
Figure 8. Performance comparison of longitudinal and lateral channel policy transfer. (a) Representative velocity tracking curves (wind speed 1.5 m/s); (b) RMSE mean ± standard deviation comparison across four control schemes ( n = 100 ); and (c) time-averaged gain value comparison. The longitudinal RL-PID RMSE is 2.503 m/s; after lateral transfer, it increases to only 2.912 m/s (+16.3%), well below the 20% feasibility threshold.
Aerospace 13 00446 g008
Figure 9. High-speed step- response comparison. (a) Velocity response; (b) pitch angle; (c) pitch rate; and (d) adaptive-gain curves. The dashed line in (a) represents the reference command as indicated in the legend; the dashed lines in (b,c) represent the initial states; and the dashed line in (d) represents the fixed PID parameter values.
Figure 9. High-speed step- response comparison. (a) Velocity response; (b) pitch angle; (c) pitch rate; and (d) adaptive-gain curves. The dashed line in (a) represents the reference command as indicated in the legend; the dashed lines in (b,c) represent the initial states; and the dashed line in (d) represents the fixed PID parameter values.
Aerospace 13 00446 g009
Figure 10. Emergency- braking dynamic-response comparison. (a) Velocity response; (b) pitch angle; (c) pitch rate; and (d) adaptive-gain curves. The dashed line in (a) represents the reference command as indicated in the legend; the dashed lines in (b,c) represent the initial states; and the dashed line in (d) represents the fixed PID parameter values.
Figure 10. Emergency- braking dynamic-response comparison. (a) Velocity response; (b) pitch angle; (c) pitch rate; and (d) adaptive-gain curves. The dashed line in (a) represents the reference command as indicated in the legend; the dashed lines in (b,c) represent the initial states; and the dashed line in (d) represents the fixed PID parameter values.
Aerospace 13 00446 g010
Figure 11. Frequency- sweep-test response comparison. (a) Test A velocity response; (b) Test A pitch angle; (c) Test A adaptive-gain curves; (d) Test B velocity response; (e) Test B pitch angle; and (f) Test B adaptive-gain curves. Dashed lines denote reference commands in (a,d), initial states in (b,e), and Fixed-PID values in (c,f).
Figure 11. Frequency- sweep-test response comparison. (a) Test A velocity response; (b) Test A pitch angle; (c) Test A adaptive-gain curves; (d) Test B velocity response; (e) Test B pitch angle; and (f) Test B adaptive-gain curves. Dashed lines denote reference commands in (a,d), initial states in (b,e), and Fixed-PID values in (c,f).
Aerospace 13 00446 g011
Figure 12. Statistical distribution of mixed-trajectory test results. (a) Velocity RMSE; (b) velocity MAE; (c) maximum | θ | ; and (d) RMSE by trajectory type ( n = 100 ). Circles denote outliers, and black diamonds denote mean values.
Figure 12. Statistical distribution of mixed-trajectory test results. (a) Velocity RMSE; (b) velocity MAE; (c) maximum | θ | ; and (d) RMSE by trajectory type ( n = 100 ). Circles denote outliers, and black diamonds denote mean values.
Aerospace 13 00446 g012
Figure 13. Computation–performance Pareto front across three benchmark scenarios. Error bars: mean ± std ( n = 10 ). Dashed grey lines connect Pareto-optimal points (circled). Lower-left is better.
Figure 13. Computation–performance Pareto front across three benchmark scenarios. Error bars: mean ± std ( n = 10 ). Dashed grey lines connect Pareto-optimal points (circled). Lower-left is better.
Aerospace 13 00446 g013
Table 1. Systematic comparison with representative related methods.
Table 1. Systematic comparison with representative related methods.
MethodSafety MechanismControl ArchitecturePolicy TransferDistinction of This Work
Wang et al. [18]No explicit constraintsRL outputs attitude commandsNoneLMI safety domain for outer loop
Sönmez et al. [19]Action clippingRL predicts PD gainsNoneLMI polytopic constraints; integral terms & transfer
Xue et al. [23]Penalty rewardPPO outputs control commandsNoneFixed inner-loop PID; LMI safety domain
Saeed et al. [29]LMI-LPV robust controlFixed LPV gain schedulingNoneRL online adaptation within LMI domain
This workLMI + Lagrangian PPOInner fixed + outer RLLong.→Lat. transfer
Table 2. Longitudinal outer-loop parameter LMI safety domain.
Table 2. Longitudinal outer-loop parameter LMI safety domain.
ParameterLower Bound K min Upper Bound K max Initial Value (Midpoint)
K p , x 0.24.52.4
K p , u 1.010.05.5
K i , u 3.025.014.0
K d , u 0.050.550.3
Table 3. Five-stage offline curriculum training configuration.
Table 3. Five-stage offline curriculum training configuration.
StageReference SignalEnvironment/RandomizationTraining ObjectiveSteps
Stage 1Constant velocity u x = 5.0  m/sNo wind, widest constraintsBaseline tracking1.0 M
Stage 2Sinusoidal, 8 m/s, 0.3 HzNo windPeriodic dynamic tracking2.0 M
Stage 3Chirp, 0.1–0.5 Hz, 8–0 m/sNo windWideband frequency response3.0 M
Stage 4Chirp (same as Stage 3)Mild wind 0–2 m/sTracking under disturbance4.0 M
Stage 5Mixed chirp/sinusoidal/stepFull-domain rand., wind 0–2.5 m/sGeneralization5.0 M
Table 4. Performance summary at the conclusion of each offline training stage.
Table 4. Performance summary at the conclusion of each offline training stage.
StageEnvironmentFinal RewardMean Velocity Error
Stage 1Simplified simulation≈−5700.41 m/s
Stage 2Simplified simulation≈−47501.03 m/s
Stage 3Simplified simulation≈−28701.24 m/s
Stage 4Simplified simulation≈−29401.24 m/s
Stage 5Simplified + domain rand.≈−54601.47 m/s
Table 5. LMI constraint-ablation study statistics (Stage 5, last 1000 episodes).
Table 5. LMI constraint-ablation study statistics (Stage 5, last 1000 episodes).
MetricWith LMIWithout LMIDifference
Constraint violations/ep 13.8 ± 5.6 48.4 ± 2.5 71.4 %
Episode Reward 6185 ± 2610 4783 ± 1052
The reward function compositions differ between the two variants (with LMI including constraint penalty terms); hence, episode reward is not directly comparable.
Table 6. Step-response performance comparison ( n = 10 , random wind 0–3 m/s, and mean ± std). Bold values indicate the best performance.
Table 6. Step-response performance comparison ( n = 10 , random wind 0–3 m/s, and mean ± std). Bold values indicate the best performance.
MetricRL-PIDFixed-PIDImprovement
Rise time t r (s) 1.451 ± 0.604 1.837 ± 0.869 + 21.0 %
Settling time t s (s) 3.982 ± 2.918 8.042 ± 2.054 + 50.5 %
Overshoot M p (%) 1.692 ± 1.178 2.077 ± 0.977 + 18.5 %
Steady-state error e s s (m/s) 0.175 ± 0.240 0.471 ± 0.462 + 62.8 %
RMSE (m/s) 3.309 ± 0.206 3.423 ± 0.311 + 3.3 %
MAE (m/s) 1.363 ± 0.310 1.646 ± 0.441 + 17.1 %
Table 7. Emergency-braking-performance comparison ( n = 10 , random wind 0–3 m/s, and mean ± std). Bold values indicate the best performance.
Table 7. Emergency-braking-performance comparison ( n = 10 , random wind 0–3 m/s, and mean ± std). Bold values indicate the best performance.
MetricRL-PIDFixed-PIDImprovement
Braking time t brake (s) 1.194 ± 0.141 1.319 ± 0.091 + 9.5 %
Braking distance d brake (m) 7.773 ± 1.319 8.428 ± 1.287 + 7.8 %
Max. pitch angle θ max (°) 59.418 ± 0.688 58.001 ± 2.132 2.4 %
Table 8. Frequency-sweep-test performance comparison ( n = 10 , random wind 0–3 m/s, and mean ± std).
Table 8. Frequency-sweep-test performance comparison ( n = 10 , random wind 0–3 m/s, and mean ± std).
Test TypeMetricRL-PIDFixed-PIDImprovement
Const.-amplitudeRMSE (m/s) 3.781 ± 0.137 4.086 ± 0.059 + 7.5 %
MAE (m/s) 2.947 ± 0.130 3.218 ± 0.064 + 8.4 %
Const.-frequencyRMSE (m/s) 3.873 ± 0.118 4.429 ± 0.060 + 12.6 %
MAE (m/s) 3.066 ± 0.062 3.453 ± 0.029 + 11.2 %
Table 9. Mixed-trajectory test overall statistics ( n = 100 ).
Table 9. Mixed-trajectory test overall statistics ( n = 100 ).
MetricRL-PIDFixed PIDImprovement
Velocity RMSE (m/s) 1.90 ± 1.70 2.23 ± 1.69 15.0%
Velocity MAE (m/s) 1.21 ± 1.21 1.52 ± 1.26 20.5%
Max. pitch angle (°) 53.9 ± 11.1 52.3 ± 12.3 3.1 %
Table 10. Per-trajectory RMSE comparison (mean ± std, unit: m/s).
Table 10. Per-trajectory RMSE comparison (mean ± std, unit: m/s).
Trajectory TypenRL-PIDFixed PIDImprovement
Constant velocity24 1.65 ± 0.90 1.72 ± 0.91 + 4.1 %
Sinusoidal24 1.75 ± 1.98 2.16 ± 2.00 + 18.8 %
Step30 2.80 ± 2.13 2.97 ± 2.22 + 5.8 %
Chirp22 1.11 ± 0.52 1.87 ± 0.46 + 40.9 %
Table 11. Summary of core performance improvements across four experiments (improvement = ( PID RL ) / PID × 100 % ).
Table 11. Summary of core performance improvements across four experiments (improvement = ( PID RL ) / PID × 100 % ).
ExperimentCore MetricRL-PIDFixed PIDImprovement
Step responseOvershoot (%) 1.69 ± 1.18 2.08 ± 0.98 18.5%
Steady-state error (m/s) 0.18 ± 0.24 0.47 ± 0.46 62.8%
MAE (m/s) 1.36 ± 0.31 1.65 ± 0.44 17.1%
Emergency brakingBraking time (s) 1.19 ± 0.14 1.32 ± 0.09 9.5%
Braking distance (m) 7.77 ± 1.32 8.43 ± 1.29 7.8%
Sweep (const.-amp.)RMSE (m/s) 3.78 ± 0.14 4.09 ± 0.06 7.5%
Sweep (const.-freq.)RMSE (m/s) 3.87 ± 0.12 4.43 ± 0.06 12.6%
Mixed ( n = 100 )Velocity RMSE (m/s) 1.90 ± 1.70 2.23 ± 1.69 15.0%
Chirp RMSE (m/s) 1.11 ± 0.52 1.87 ± 0.46 40.9%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tian, Z.; Hu, S.; Fu, H.; Zhu, W.; Zhang, B. Hierarchical Adaptive PID Tuning for Agile Flight: A Safety-Constrained Reinforcement Learning Approach. Aerospace 2026, 13, 446. https://doi.org/10.3390/aerospace13050446

AMA Style

Tian Z, Hu S, Fu H, Zhu W, Zhang B. Hierarchical Adaptive PID Tuning for Agile Flight: A Safety-Constrained Reinforcement Learning Approach. Aerospace. 2026; 13(5):446. https://doi.org/10.3390/aerospace13050446

Chicago/Turabian Style

Tian, Zhong, Sen Hu, Hao Fu, Weiyu Zhu, and Bangchu Zhang. 2026. "Hierarchical Adaptive PID Tuning for Agile Flight: A Safety-Constrained Reinforcement Learning Approach" Aerospace 13, no. 5: 446. https://doi.org/10.3390/aerospace13050446

APA Style

Tian, Z., Hu, S., Fu, H., Zhu, W., & Zhang, B. (2026). Hierarchical Adaptive PID Tuning for Agile Flight: A Safety-Constrained Reinforcement Learning Approach. Aerospace, 13(5), 446. https://doi.org/10.3390/aerospace13050446

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop