Hierarchical Adaptive PID Tuning for Agile Flight: A Safety-Constrained Reinforcement Learning Approach

Tian, Zhong; Hu, Sen; Fu, Hao; Zhu, Weiyu; Zhang, Bangchu

doi:10.3390/aerospace13050446

Open AccessArticle

Hierarchical Adaptive PID Tuning for Agile Flight: A Safety-Constrained Reinforcement Learning Approach

by

Zhong Tian

,

Sen Hu

,

Hao Fu

,

Weiyu Zhu

and

Bangchu Zhang

^*

School of Aeronautics and Astronautics, Sun Yat-sen University, Shenzhen 518107, China

^*

Author to whom correspondence should be addressed.

Aerospace 2026, 13(5), 446; https://doi.org/10.3390/aerospace13050446

Submission received: 6 April 2026 / Revised: 4 May 2026 / Accepted: 5 May 2026 / Published: 9 May 2026

(This article belongs to the Section Aeronautics)

Download

Browse Figures

Versions Notes

Abstract

Multirotor unmanned aerial vehicles (UAVs) suffer from significant control performance degradation during aggressive maneuvers, primarily due to aerodynamic nonlinearities and coupling effects. Conventional fixed-gain PID controllers struggle to simultaneously satisfy performance and robustness requirements across the wide flight envelope. To address this challenge, this paper presents a novel hierarchical safety-constrained reinforcement learning (RL) framework for adaptive PID tuning: the inner loop employs fixed gains, the outer loop leverages proximal policy optimization (PPO) for online adaptive gain scheduling, and linear matrix inequality (LMI) constraints delineate robust parameter boundaries for the adaptive exploration. Importantly, the LMI feasibility strictly guarantees theoretical stability for the fixed inner-loop parameters at the linearization vertices within a linear parameter-varying (LPV) framework. Concurrently, the online outer-loop RL stage is protected by safety boundaries and a Lagrangian penalty mechanism acting as an effective engineering safeguard rather than a rigorous global stability proof. Comprehensive high-fidelity simulation benchmarks demonstrate that, compared with a baseline fixed-gain PID controller, the proposed framework reduces overshoot by 18.5% in high-speed step responses and improves the overall mean RMSE by 15.0% across 100 randomized mixed-trajectory trials (with improvements of up to 40.9% in highly dynamic scenarios), yielding consistent gains in trajectory tracking accuracy and disturbance rejection despite uncertain model variations. By seamlessly blending control-theoretic hard constraints with RL-based soft-parameter tuning, the proposed architecture offers a safe and highly adaptive solution for large-envelope flight control, demonstrating strong engineering relevance.

Keywords:

multirotor UAV; LMI safety constraints; proximal policy optimization; hierarchical reinforcement learning; adaptive PID control

1. Introduction

Precise control is fundamental to the stable flight and mission execution of UAVs, representing a key enabling technology across aerospace and robotics domains. Quadrotors, as representative underactuated systems, require stable and robust flight control as a prerequisite for autonomous operation. The proportional–integral–derivative (PID) controller remains the most widely adopted control scheme for quadrotor UAVs owing to its structural simplicity, ease of implementation, and well-established stability properties [1,2,3].

The control effectiveness of PID controllers on quadrotor platforms is critically dependent on the accuracy of parameter tuning. Quadrotors are strongly nonlinear, tightly coupled, six-degree-of-freedom underactuated systems: nonlinearities remain mild under low-dynamic conditions such as hovering, but become substantially more pronounced during high-speed aggressive flight as aerodynamic drag (proportional to the square of velocity) and inertial-coupling effects intensify. The resulting dynamics exhibit characteristic linear parameter-varying (LPV) behavior [4,5], rendering fixed-gain PID controllers incapable of simultaneously meeting performance and robustness requirements across the full flight envelope.

Conventional tuning approaches suffer from three fundamental limitations [6,7]. First, trial-and-error methods are time-consuming and rely heavily on expert experience; although metaheuristic search algorithms such as differential evolution [8] can partially automate the process, coupled multi-loop tuning remains intractable. Second, rule-based methods (e.g., adaptive control [9] and active disturbance-rejection control [10]) offer limited coverage and struggle to accommodate the diverse nonlinear operating conditions spanning the full envelope. Third, model-based methods (e.g., backstepping/sliding mode [11] and model predictive control [12]) require accurate global models, yet the strong nonlinearities of real systems make high-fidelity model acquisition prohibitively expensive. Moreover, the tuning process itself entails flight safety risks, as inappropriate parameters may lead to vehicle loss.

Beyond multirotor UAVs, the demand for robust responses to dynamic environments and unmodeled perturbations represents a universal challenge across modern automation domains. For instance, in autonomous driving and active suspension systems, dynamic motion-planning frameworks must adapt rapidly to parameter variations to ensure safety [13]. Similarly, maintaining system stability against external disruptions is equally critical in networked systems facing disturbances and actuator attacks [14]. Inspired by these cross-domain demands for safety and adaptive capability, reinforcement learning (RL)-based adaptive tuning has emerged as an active research area in recent years [15,16].

Existing RL-PID studies can be categorized into three classes. First, direct RL control (RL directly outputs control commands)—Koch et al. [17] and Wang et al. [18] applied PPO to quadrotor attitude/velocity control, but generally lacked explicit safety constraints. Second, RL gain prediction (RL outputs and PID gains)—Sönmez et al. [19] employed action clipping to restrict the exploration range and Wang et al. [18] and Alrubyli et al. [20] applied PPO and Q-learning, respectively, for online PID tuning. Third, RL-PID hybrid (PID structure preserved and RL optimizes gain scheduling)—Dogru et al. [21] applied RL to autonomous PID tuning in chemical processes; Ping et al. [22] extended this approach to fixed-wing aircraft; Xue et al. [23] improved the PPO clipping objective to enhance sample efficiency; and Zhai et al. [24] and Wang et al. [25] explored deep RL for automatic PID parameter tuning. While these studies have advanced performance optimization, they share three common shortcomings: the absence of theoretically grounded safe-parameter-domain constraints [26,27], relying instead on reward penalties or simple clipping; a lack of targeted designs addressing the LPV-parameter-variation characteristics inherent to large-envelope flight; and few studies have systematically exploited the dynamic symmetry of the platform for cross-channel policy transfer.

Building upon the ongoing analysis, this paper develops an RL-based PID-tuning methodology focused on outer-loop nonlinearity compensation. The core design rationale stems from the frequency-domain separation characteristics of quadrotor dynamics: the inner loops (attitude/angular rate) are dominated by rigid-body dynamics with relatively mild nonlinearities and high bandwidth, making them suitable for robust fixed-parameter control, whereas the outer loops (position/velocity) experience significant nonlinear effects from aerodynamic drag (proportional to velocity squared) and inertial coupling during aggressive maneuvers, constituting the region where adaptive parameter scheduling yields the greatest benefit.

A systematic comparison with representative related approaches is presented in Table 1. The key innovation lies in the joint design of three complementary mechanisms: (1) LPV polytopic modeling combined with common Lyapunov LMI constraints [28], which construct a parameterized safety domain across the entire operating envelope, compressing the RL exploration space; (2) dual-timescale Lagrangian-constrained PPO, where the policy network (fast timescale,

α_{θ} = 10^{- 4}

) and Lagrange multiplier (slow timescale,

α_{λ} = 10^{- 5}

) are updated in a decoupled manner; and (3) symmetry-based policy transfer, leveraging

I_{x x} \approx I_{y y}

to directly map the longitudinal policy to the lateral channel. The joint integration of these three mechanisms is absent in existing work.

The main contributions are summarized as follows:

Hierarchical adaptive control architecture: An “outer-loop adaptive, inner-loop fixed” strategy is adopted, reducing the full-system adaptive tuning problem to low-dimensional parameter optimization of the pitch-channel position loop (P) and velocity loop (PID), thereby lowering the learning complexity while precisely compensating for the dominant dynamic nonlinearities.
Symmetry-based policy transfer: Exploiting the strong symmetry between the pitch and roll channels inherent to the X-configuration quadrotor, the longitudinal policy is directly mapped to the lateral channel, reducing training costs and enhancing engineering practicality.
Safety constraints with online fine-tuning: An LMI-derived safe-parameter domain is combined with an online fine-tuning mechanism in a high-fidelity simulation environment, providing bounded parameter guarantees for RL exploration within the LPV linearization framework while enhancing policy adaptability to real-world environments.

Paper organization. The remainder of this paper is structured as follows. Section 2 details the problem formulation, LPV modeling, and the derivation of LMI safety bounds, alongside the RL framework formulation. Section 3 presents the progressive curriculum training configurations and high-fidelity experimental benchmark results. Section 4 offers an in-depth analysis of the adaptive mechanisms, extended baseline comparisons, hyperparameter sensitivity, and corresponding system limitations. Finally, Section 5 concludes the paper with broader engineering perspectives.

2. Hierarchical Adaptive PID Tuning Framework

To address the dual challenges of aerodynamic nonlinearity and model uncertainty across the wide flight envelope, a hierarchical “inner-loop fixed, outer-loop adaptive” control architecture is adopted: an LPV model captures the parameter-varying dynamics, LMI constraints delineate the safe exploration domain, and a safety-constrained PPO algorithm tunes the outer-loop gains online.

2.1. LMI-Based Safe-Parameter-Domain Construction

Taking the longitudinal channel of an X-configuration quadrotor as the modeling target, a dynamics model incorporating aerodynamic nonlinearity (

C_{d} \cdot u \cdot | u |

) and geometric nonlinearity (

sin θ

terms) is constructed (Equation (1)). Under high-speed flight and aggressive maneuvering, these nonlinearities cause substantial variation in open-loop gain, rendering fixed-gain PID controllers unable to simultaneously satisfy performance and robustness requirements.

\{\begin{matrix} \dot{x} = u cos θ \\ m \dot{u} = - m g sin θ - C_{d} u | u | + F_{T} cos θ \\ \dot{θ} = q \\ I_{y y} \dot{q} = τ_{pitch} \end{matrix}

(1)

where x denotes the forward position, u the forward velocity,

θ

the pitch angle, q the pitch rate, m the mass, g the gravitational acceleration,

C_{d}

the fuselage-drag coefficient,

F_{T}

the total thrust,

I_{y y}

the pitch-axis moment of inertia, and

τ_{pitch}

the pitch-control torque.

Using the forward velocity u as the scheduling parameter, the nonlinear system is transformed into an LPV state-space representation (Equation (2)) [4,5]. Jacobian linearization is performed at four characteristic velocity operating points

{0, 5, 10, 15}

m/s, yielding vertex subsystems

(A_{k}, B_{k})

, which are assembled into a standard LMI feasibility framework through polytopic convex combination (Equations (3) and (4)) [28]. The selection of these four vertices stems from a practical trade-off between interpolation accuracy and LMI feasibility region size: a 5 m/s interval provides sufficient grid density to accurately capture the mild nonlinearities in the drag coefficient and pitch dynamics up to the aggressive envelope limit of 15 m/s, ensuring that the intermediate interpolated dynamics remain closely bounded by the convex hull. Conversely, introducing an excessive number of vertices would overly shrink the intersection of the LMI constraints, resulting in a highly conservative safe-parameter domain that stifles outer-loop RL exploration capability.

\dot{x} = A (u (t)) x + B (u (t)) v, v = τ_{pitch}

(2)

\dot{x} = A_{k} x + B_{k} v, k = 1, \dots, 4

(3)

A (u (t)) = \sum_{k = 1}^{4} μ_{k} A_{k}, B (u (t)) = \sum_{k = 1}^{4} μ_{k} B_{k}, μ_{k} \geq 0, \sum μ_{k} = 1

(4)

A four-level cascaded control structure is adopted, with the control law defined in Equations (5)–(8):

u_{ref} = K_{p, x} (x_{ref} - x)

(5)

θ_{ref} = K_{p, u} e_{u} + K_{i, u} \int e_{u} d t + K_{d, u} {\dot{e}}_{u}

(6)

q_{ref} = K_{p, θ} (θ_{ref} - θ)

(7)

τ_{pitch} = K_{p, q} e_{q} + K_{i, q} \int e_{q} d t + K_{d, q} {\dot{e}}_{q}

(8)

Introducing the integral error as augmented states

X_{a} = {[e_{x}, u, θ, q, ξ_{u}, ξ_{q}]}^{T}

, the cascaded PID is equivalently reconstructed as structured state feedback, yielding the augmented closed-loop system:

{\dot{X}}_{a} = (A_{a, k} - B_{a, k} K_{total}) X_{a} + E_{a, k} r, k = 1, \dots, 4

(9)

where

A_{a, k}

,

B_{a, k}

, and

E_{a, k}

are the augmented system matrix, input matrix, and reference matrix at vertex k, respectively;

K_{total}

encodes all PID gains;

r \in R

is the external reference input (velocity reference

u_{ref}

, in m/s), and

E_{a, k}

is the corresponding reference signal distribution vector.

The parameter determination proceeds in two steps:

Step 1—Inner-loop parameter fixation. The LMI constraint set is jointly solved across all four vertices, encompassing quadratic stability, D-stability region (

α

-stability and conic sector) constraints, and

H_{\infty}

performance bounds (complete derivation in Appendix B, Equations (A1)–(A5)), yielding the fixed inner-loop parameter set

Θ_{in}^{*} = {K_{p, θ}^{*}, K_{p, q}^{*}, K_{i, q}^{*}, K_{d, q}^{*}}

, which remains constant throughout subsequent RL training.

Step 2—Outer-loop safety boundary computation. Substituting

Θ_{in}^{*}

into the augmented system, the LMI feasible region for the outer-loop parameters

{K_{p, x}, K_{p, u}, K_{i, u}, K_{d, u}}

is solved as the intersection across all vertices, yielding the safe convex hull boundary

[K_{min}, K_{max}]

. This boundary serves as the physical constraint on the RL action space, with the midpoint of each parameter interval used to initialize the RL agent.

The safe domain construction simultaneously compresses the RL search space and provides bounded parameter-domain constraints within the LPV linearization framework, thereby improving sampling efficiency. It should be noted that the LMI feasibility guarantee applies strictly to the closed-loop stability of the fixed inner-loop parameters at the four linearization vertices. During outer-loop RL online tuning, closed-loop stability is jointly maintained by safety boundary constraints (hard-boundary backstop) and the Lagrangian penalty mechanism (soft-boundary guidance), constituting an engineering safeguard rather than a rigorous global mathematical proof.

2.2. Safety-Constrained Dual-Timescale Reinforcement Learning

The safety-constrained dual-timescale RL framework is established upon LMI-fixed inner-loop parameters providing a stable foundation, PPO performing online adaptive adjustment of outer-loop parameters, with the framework incorporating both longitudinal training and lateral transfer mechanisms (see Figure 1).

As depicted in Figure 1, the entire closed-loop control and training process is formulated into five primary processing steps across the Agent and Environment. Step 1: Policy Execution (Agent). Based on the current state

s_{t}

, the Actor network computes the raw action

a_{t}

to adjust the outer-loop parameters. Step 2: Action Processing (Env). The action

a_{t}

first undergoes direct affine mapping based on the prescribed safety constraints to produce an intermediate gain

K_{target}

. To prevent aggressive gain fluctuation, an EMA smoothing filter is applied to yield the final deployed gain

K_{t}

. Step 3: Actuation and Dynamics (Env). The smoothed gain

K_{t}

is transmitted to the cascaded PID controller, functioning alongside the reference trajectory to drive the quadcopter mixer and physical components. Step 4: Reward Evaluation (Env). The system states

s_{t}

resulting from the performed action are assessed to calculate the immediate composite return

R_{t}

and safety cost

C_{t}

. Step 5: Policy and Multiplier Update (Agent). The calculated rewards and costs flow back to the Agent. At the step-level, the Critic network estimates the value function

V_{π} (x, λ)

to guide the Actor and at the episode-level, the expected cumulative safety cost

J_{C} (π_{θ})

directs the dynamic adjustment of the Lagrange multiplier

λ

, ensuring robust constraint adherence.

2.2.1. MDP Formulation

The optimization objective is to maximize the discounted cumulative return subject to safety constraints (Equation (10)):

max_{θ} E_{τ} [\sum_{t} γ^{t} R_{t}], s . t . E [C (s_{t}, a_{t})] \leq ξ

(10)

State space

S

:

s_{t} \in R^{12}

, comprising longitudinal kinematic states (four dimensions:

x, u, θ, q

), tracking errors (two dimensions: position error and velocity error), current normalized gains (four dimensions, reflecting the historical tuning trajectory), and attitude dynamic information (two dimensions:

\dot{θ}, \dot{q}

). The inclusion of current normalized gains in the state vector enables the agent to perceive “how far the current parameters lie from the boundary”, thereby facilitating proactive convergence when approaching the LMI constraint boundary, rather than relying on external hard clipping to passively trigger corrections.

Action space

A

:

a_{t} = {[K_{p, x}, K_{p, u}, K_{i, u}, K_{d, u}]}^{T} \in {[- 1, 1]}^{4}

, mapped to physical gains through an affine transformation (Equation (11)) that naturally aligns with the LMI safety boundary

[K_{min}, K_{max}]

:

K_{target} = \frac{a_{t} + 1}{2} (K_{\max} - K_{\min}) + K_{\min}

(11)

Reward function

R_{t}

(Equation (12)); the constraint cost

C_{t}

is defined in Equation (13):

R_{t} = r_{t} - λ C_{t}, r_{t} = - clip (\sum_{i} w_{i} e_{i}^{2}, 0, 50)

(12)

C_{t} = \sum_{i} C_{gain, i} + C_{vel} + C_{att} + C_{rate}, C_{gain, i} = max (0, | a_{t, i} | - m_{gain})

(13)

The reward-truncation upper bound of 50 prevents a small number of extreme trajectories from dominating gradient updates. The constraint cost

C_{t}

simultaneously monitors four categories of safety violations: gain boundary exceedance (

C_{gain, i}

), velocity limit violation (

C_{vel}

), excessive attitude (

C_{att}

), and angular rate exceedance (

C_{rate}

), covering the primary modes of UAV loss of control.

Value function and advantage estimation: Standard MDP value function

V^{π} (s_{t})

, action-value function

Q^{π} (s_{t}, a_{t})

, and generalized advantage estimation (GAE,

λ_{GAE} = 0.95

) are employed; complete definitions are provided in Appendix C (Equations (A6) and (A7)). The discount factor

γ = 0.99

endows the agent with an effective planning horizon of approximately 2 s at a 20 ms control period, sufficient to encompass a complete velocity step-response transient.

2.2.2. Constraint-Aware PPO Algorithm

Algorithm-selection rationale. PPO is selected over SAC or DDPG [17] based on two engineering considerations. First, the PPO clipping surrogate objective inherently limits the magnitude of single-step policy updates—let

ρ_{t} (θ) = π_{θ} (a_{t} | s_{t}) / π_{θ_{old}} (a_{t} | s_{t})

denote the probability ratio; the clipping surrogate objective is:

L^{CLIP} (θ) = E_{t} [min (ρ_{t} (θ) {\hat{A}}_{t}, clip (ρ_{t} (θ), 1 - ε, 1 + ε) {\hat{A}}_{t})]

(14)

where

ε = 0.2

and

{\hat{A}}_{t}

is the generalized advantage estimate. This truncation mechanism is highly compatible with the requirement to prevent catastrophic forgetting during Stage 6 online fine-tuning. Second, in the offline stages (Sub-stages 1–5) with 32 parallel environments, PPO’s on-policy sampling eliminates the need for an experience replay buffer, significantly reducing memory footprint and hyperparameter sensitivity. The total loss function integrates the clipping surrogate objective, value function error, and policy entropy:

L (θ) = L^{CLIP} (θ) - c_{1} L_{VF} (θ) + c_{2} H_{π_{θ}}

(15)

where

c_{1} = 0.5

(value function coefficient),

c_{2} = 0.01

(entropy coefficient), and

L_{VF} (θ) = \frac{1}{2} E_{t} [{(V_{θ} (s_{t}) - V_{target, t})}^{2}]

.

Motivation for Lagrangian constraints. Embedding fixed-weight penalty terms directly into the reward function requires extensive manual tuning, and the optimal weights vary significantly across training stages. Lagrangian relaxation [26] is adopted to transform the constrained optimization objective (Equation (10)) into an unconstrained composite objective:

L (θ, λ) = J_{R} (π_{θ}) - λ (J_{C} (π_{θ}) - ξ)

(16)

where

J_{R} (π_{θ})

is the expected cumulative performance reward,

J_{C} (π_{θ})

is the expected cumulative safety cost, and

ξ = 0.03

is the violation tolerance threshold. The multiplier

λ

automatically tracks the degree of constraint violation at runtime: it increases to intensify penalties when gains exceed boundaries or flight states are violated, and decreases to restore exploration freedom once constraints are satisfied. This mechanism converts the safety–performance trade-off into an online optimization problem in dual space, eliminating the need for manually setting penalty weights for each training stage.

Design rationale for dual-timescale separation. Taking the gradient of the Lagrangian objective (Equation (16)) with respect to

θ

yields the policy update direction; subgradient ascent on

λ

yields the multiplier update. The two are decoupled via different learning rates to achieve timescale separation:

θ_{t + 1} = θ_{t} + α_{θ} \nabla_{θ} L (θ_{t}, λ_{t}), λ_{t + 1} = P^{+} [λ_{t} + α_{λ} (J_{C} (π_{θ_{t}}) - ξ)]

(17)

where

P^{+}

is a projection operator ensuring

λ \geq 0

; the policy network updates rapidly at

α_{θ} = 10^{- 4}

, while the Lagrange multiplier adjusts slowly at

α_{λ} = 10^{- 5}

(

α_{λ} ≪ α_{θ}

). The order-of-magnitude difference between the two learning rates ensures that the multiplier undergoes significant adjustment only after the policy has sufficiently converged, analogous to the “fast–slow system separation” in singular perturbation theory: if the multiplier updates too rapidly, the policy is forced to change direction before it has responded to the constraint signal, leading to oscillation or even divergence. The LMI constraint-ablation study (see Section 3.2.1) corroborates the necessity of this mechanism from a complementary perspective: disabling the gain-constraint penalty causes the policy to consistently fail to converge to low constraint-violation levels throughout training.

Structural basis for policy transfer. The X-configuration quadrotor satisfies

I_{x x} \approx I_{y y}

(

Δ I / I < 4 %

) and the fuselage-drag coefficient

C_{d}

is approximately isotropic, endowing the longitudinal and lateral channels with approximately isomorphic dynamical structures in state space. Consequently, the converged longitudinal policy

π_{long}^{*}

can be directly mapped to the lateral channel, reducing lateral-channel training costs to zero while maintaining performance degradation within acceptable bounds (validated in Section 3.2.2).

2.3. Overall Methodology and Agent-Training Framework

The proposed hierarchical safe-RL method is structured into four interconnected steps that sequentially transition from theoretical modeling to practical deployment (see Figure 2).

As illustrated in Figure 2, the consecutive steps closely interact to systematically elevate control performance. Step I (LPV Modeling) decouples the quadcopter dynamics and synthesizes an augmented multi-cell linear parameter-varying (LPV) model covering the flight envelope. This mathematical structure is directly fed into Step II (LMI Safety Domain), where Lyapunov functions are constructed to solve the safe outer-loop parameter boundaries and optimal initial PID gains. These theoretical bounds provide a physically meaningful and constraint-guaranteed search space for the reinforcement learning agent, fundamentally accelerating subsequent training efforts.

Offline multi-stage curriculum training (Step III, Sub-stages 1–5): A progressive curriculum design gradually escalates task difficulty—Stages 1–4 employ fixed simulation parameters, progressing from constant-velocity tracking to chirp-based aggressive maneuvers with wind disturbances and Stage 5 introduces full-domain randomization to enhance generalization. All five offline stages employ 32-environment parallel PPO, warm-started from the best checkpoint of the preceding stage. The interconnection from Step II to Step III restricts the neural network’s exploration within the derived rigorous physical bounds. Consequently, the initial offline stages (Stages 1–4) successfully yield a basic maneuvering policy that achieves rapid convergence on simplified dynamics, thereby establishing a fundamental tracking capability. The subsequent domain randomization phase (Stage 5) produces an intermediate robust policy (

π_{5}^{*}

) resilient to parameter variations, bridging the gap to real-world deployment by significantly reducing sim-to-real disparities.

Online fine-tuning and policy transfer (Step IV, Sub-stage 6): The Stage 5 converged policy

π_{5}^{*}

is transferred to a Gazebo M480 model integrated with the PX4 autopilot stack for online fine-tuning, employing a conservative learning rate (

α = 10^{- 5}

) and a tightened PPO clipping range (

ε = 0.1

) to prevent catastrophic forgetting. The interconnection from Step III to Step IV passes this highly generalized but moderately conservative policy (

π_{5}^{*}

) into high-fidelity environments. This final fine-tuning step yields the optimal deployable policy, actively compensating for unmodeled aerodynamics and actuator delays. Cumulatively, the results from each step—from theoretical bounding to progressive simulation and final sim-to-real transfer [30]—contribute to the overarching performance improvement by safely decomposing an otherwise intractable end-to-end safe RL exploration into sequential, physically informed optimization tasks. The dual-timescale parameter update still follows the rule described in Equation (17). After longitudinal policy convergence, the policy is directly reused for the lateral channel via dynamic symmetry (

I_{x x} \approx I_{y y}

), without retraining.

3. Results

Our results are presented in the following order: LMI safe domain construction, offline five-stage curriculum training and online fine-tuning results, and performance evaluation across four comparative experiments.

3.1. Results of LMI-Based Safe-Parameter-Domain Construction

3.1.1. UAV Physical Parameters

An X-configuration quadrotor is employed; the complete physical parameters are listed in Appendix D, Table A2 (mass 1.4 kg, arm length 0.241 m,

I_{x x} = 0.0211

kg·m²,

I_{y y} = 0.0219

kg·m²,

C_{d} = 0.073

N/(m/s)², motor time constant

T_{m} = 0.02

s).

3.1.2. Longitudinal LPV Model and Polytopic Construction

Within the operating envelope

u \in [0, 15]

m/s, linearization is performed at four characteristic points, yielding vertex systems

(A_{k}, B_{k}), k = 1, \dots, 4

. The augmented state

X_{a}

comprises position error, velocity, attitude angle, angular rate, and integral states of the velocity and angular rate loops.

3.1.3. LMI Constraint Design and Solution Results

The LMI constraint parameters for each control loop are listed in Appendix D, Table A3.

Step 1—Inner-loop parameter fixation. A high-bandwidth, well-damped configuration is selected: attitude loop

K_{p, θ} = 8.0

and angular rate loop

K_{p, q} = 1.5

,

K_{i, q} = 12.0

, and

K_{d, q} = 0.04

. The closed-loop poles at all four velocity vertices fall within the

D_{α = 2.0, φ = 50 °}

sector, with a phase margin

\geq 60 °

and a gain margin

\geq 10

dB (see Figure 3).

Step 2—Outer-loop safety boundary. Substituting

Θ_{in}^{*}

into the augmented system, the outer-loop LMI feasible region is solved; the results are presented in Table 2 (see Figure 4).

Leveraging the X-configuration structural symmetry (

I_{x x} \approx I_{y y}

), the lateral channel adopts the same parameter boundaries as the longitudinal channel. The yaw and altitude channels employ fixed parameters obtained from single-point hovering LMI solutions: altitude outer loop

K_{p, z} = 1.0

; altitude inner loop

K_{p, \dot{z}} = 4.85

,

K_{i, \dot{z}} = 2.85

, and

K_{d, \dot{z}} = 1.0

; yaw outer loop

K_{p, ψ} = 1.5

; and yaw inner loop

K_{p, r} = 1.35

,

K_{i, r} = 0.0

, and

K_{d, r} = 0.5

.

3.2. PPO-Based Adaptive Controller Parameter Training

3.2.1. Offline Five-Stage Curriculum Training (Sub-Stages 1–5)

The RL agent exclusively adjusts the longitudinal outer-loop parameters: state

s_{t} \in R^{12}

and action

a_{t} \in {[- 1, 1]}^{4}

, affine-mapped to the safety boundary

[K_{min}, K_{max}]

of Table 2. The five-stage offline training configuration is detailed in Table 3. The penalty weight

W_{gain} = 5.0

and the gain soft-constraint margin

m_{gain} = 0.70

are maintained consistently across all stages (hard-constraint margin

m_{hard} = 0.95

).

The Lagrange multiplier

λ_{j}

is updated according to the slow-timescale rule of Equation (17) (

α_{λ} = 10^{- 5}

, violation tolerance threshold

ξ = 0.03

).

Core PPO hyperparameters are listed in Appendix D, Table A4 (

ε_{clip} = 0.2

,

γ = 0.99

,

λ_{GAE} = 0.95

,

N_{env} = 32

,

N_{step} = 1024

).

Training platform: AMD Ryzen-9 5950X/NVIDIA RTX 4060/32 GB DDR4, Ubuntu 22.04 LTS. The total offline training comprises approximately

1.5 \times 10^{7}

timesteps (Stages 1–5 combined).

Figure 5 illustrates the reward-curve evolution across the five offline training stages. The magnitude of negative transfer at each stage transition progressively diminishes with advancing stages, indicating that the progressive curriculum design effectively reduces the cost of cross-task knowledge transfer.

Quantitative performance summaries at the conclusion of the five offline stages are presented in Table 4.

LMI constraint-ablation study. To verify the necessity of the LMI safety domain constraints, the RL action constraints are degraded to simple symmetric clipping (

K \in [K_{mid} \pm Δ K]

, where

Δ K

is comparable to the LMI boundary width), the gain-constraint penalty weight

W_{gain}

is set to zero, and all other hyperparameters are held constant. Comparative training is conducted on the Stage 5 (domain randomization) task. As shown in Table 5 and Figure 6, the LMI-constrained variant achieves a 71.4% reduction in constraint violations (gain-boundary exceedances per episode):

13.8 \pm 5.6

vs.

48.4 \pm 2.5

. The policy without LMI constraints, lacking penalty guidance, consistently fails to converge to low constraint-violation levels throughout training, validating that the LMI constraint mechanism is a necessary condition for achieving control within safe boundaries.

3.2.2. Gazebo Online Fine-Tuning (Stage 6) and Lateral Transfer

(1) Stage 6—High-Fidelity Online Fine-Tuning

Policy

π_{5}^{*}

is transferred to a Gazebo M480 model integrated with the PX4 autopilot stack for online fine-tuning. Realistic sensor noise is introduced (

σ_{v} = 0.10

m/s,

σ_{p} = 0.05

m, and

σ_{θ} = 0.02

rad), with the system operating at 50 Hz via ROS2 DDS and a communication latency of 3–5 ms. Fine-tuning proceeds for a total of

6.0 \times 10^{6}

timesteps; the smoothed episode reward rises steadily from approximately

- 5576

to

- 5171

, reflecting gradual adaptation to fixed-bias plant parameters and sensor noise (see Figure 7). The limited absolute improvement (

Δ \approx 400

) indicates that Stage 5 full-domain randomization had already brought the policy close to the Stage 6 performance ceiling, so Stage 6 fine-tuning primarily consolidates robustness rather than relearning the task. Stage 6 employs single-environment serial sampling in Gazebo (

N_{env} = 1

), resulting in an effective data throughput far lower than that of the offline 32-environment parallel stages (e.g., Stage 5 effective throughput: approximately

5.0 M \times 32 = 160 M

). The policy, nonetheless, converges rapidly for three reasons: (1) Stage 5 full-domain randomization has sufficiently enhanced generalization, and online fine-tuning only needs to eliminate motor dynamic residuals rather than relearn the task; (2) the conservative learning rate (

α = 10^{- 5}

) prevents catastrophic forgetting; and (3) the initial policy

π_{5}^{*}

already possesses strong tracking capability, significantly narrowing the fine-tuning search space.

(2) Lateral Channel Policy Transfer

Based on the dynamic symmetry of the X-configuration quadrotor (

I_{x x} \approx I_{y y}

, approximately isotropic

C_{d}

), the converged longitudinal policy

π_{long}^{*}

is directly mapped to the lateral channel without retraining. The test conditions are

u_{y} = 8 sin (0.2 π t)

m/s superimposed with 0–5 m/s random wind, comprising

n = 100

independent trials (50 trials each at 3 m/s and 5 m/s wind speeds); the results are shown in Figure 8.

The lateral-channel RMSE exceeds the longitudinal value by only 16.3%, which is below the commonly adopted 20% transfer feasibility threshold, validating the viability of the “train longitudinal–reuse lateral” approach.

3.3. Experimental Results and Analysis

The following four experiments compare RL-PID against an engineering-tuned fixed-gain PID (

K_{p, x} = 0.7

,

K_{p, u} = 3.0

,

K_{i, u} = 3.0

,

K_{d, u} = 0.3

).

The three experiments in Section 3.3.1, Section 3.3.2 and Section 3.3.3 are each conducted with

n = 10

independent repetitions in the high-fidelity simulation environment, each run under 1–3 m/s random wind disturbance (uniformly random wind direction and seed fixed at 42 for reproducibility), with statistics reported as mean ± standard deviation (

\bar{x} \pm s

,

n = 10

). Improvement is computed as the ratio of means:

Δ = ({\bar{x}}_{PID} - {\bar{x}}_{RL}) / {\bar{x}}_{PID} \times 100 %

.

3.3.1. High-Speed Step Response

Experimental setup. The UAV starts from hover and receives a 13 m/s step velocity command sustained for 10 s, with a pitch angle constraint

| θ | \leq 60 °

(see Figure 9). The detailed quantitative results are summarized in Table 6.

3.3.2. Emergency-Braking Response

Experimental setup. The UAV cruises at

u = 13

m/s; at

t = 3

s, a step command to 0 m/s is issued, which is sustained for 10 s (see Figure 10). The quantitative results are summarized in Table 7.

3.3.3. Frequency-Domain Sweep Test

Experimental setup. Two test configurations using chirp signals: Test A (constant-amplitude sweep): amplitude ±8 m/s and frequency linearly increasing from 0.1 Hz to 1.0 Hz over 20 s. Test B (constant-frequency amplitude sweep): frequency fixed at 0.5 Hz and amplitude linearly increasing from 3 m/s to 13 m/s over 20 s. The quantitative results are summarized in Table 8, and the corresponding response curves are shown in Figure 11.

3.3.4. Comprehensive Statistical Performance Test

Experimental setup. A total of 100 randomized mixed-trajectory trials (see Figure 12) covering four trajectory types—constant velocity, sinusoidal, step, and chirp—were conducted under 0–5 m/s random wind disturbance. Sample sizes per category were weighted by flight mission frequency: step 30 trials, constant velocity 24 trials, sinusoidal 24 trials, and chirp 22 trials, totaling 100 trials. Initial velocity, wind speed, and wind direction were uniformly randomly sampled within each category. All reported improvements were computed relative to the fixed PID baseline as

Δ = ({RMSE}_{PID} - {RMSE}_{RL}) / {RMSE}_{PID} \times 100 %

.

4. Discussion

4.1. Adaptive-Mechanism Analysis

By synthesizing the gain-variation curves across the four comparative experiments (Figure 9d, Figure 10d and Figure 11c,f), three characteristic adaptive scheduling behaviors can be identified:

Error-driven regulation. When the tracking error is large, $K_{p}$ and $K_{i}$ increase to accelerate convergence; as the error diminishes, $K_{i}$ decreases to suppress steady-state oscillation. In the step-response experiment (Section 3.3.1), $K_{i, u}$ actively rises during the transient phase ( $t < 2$ s) and returns to a lower value at steady state, ultimately reducing steady-state error by 62.8% while simultaneously decreasing overshoot by 18.5%.
State-synchronous regulation. Under periodic reference signals, gain variations exhibit correlation with both the amplitude and frequency of the reference signal. In the frequency sweep test (Section 3.3.3), $K_{p, u}$ and $K_{i, u}$ increase correspondingly during high-amplitude segments and decrease during low-amplitude segments, with the gain-adjustment rhythm synchronized to the reference signal period. This indicates that the policy exploits the dynamic characteristics of the reference signal as an implicit scheduling variable, enabling RL-PID to significantly outperform the fixed-gain scheme on chirp trajectories (improvement of 40.9%) and sinusoidal trajectories (improvement of 18.8%).
Constraint-aware regulation. When system states approach the $60 °$ pitch-angle-constraint boundary, the policy proactively reduces gains to mitigate the risk of constraint violation. In the emergency-braking experiment (Section 3.3.2), the RL-PID mean maximum pitch angle ( $59.4 \pm 0.7 °$ ) is slightly higher than that of the fixed PID ( $58.0 \pm 2.1 °$ ); neither violates the $60 °$ hard constraint, while RL-PID achieves a 9.5% reduction in braking time through more aggressive gain scheduling without any constraint violations.

These three behaviors demonstrate that the control law learned by the RL policy is not a fixed-gain mapping, but rather a comprehensive response to the current error state, reference signal dynamics, and system safety margins.

4.2. Comprehensive Performance Discussion

Two consistent conclusions emerge from Table 9, Table 10 and Table 11:

Performance improvement is positively correlated with task-dynamic complexity. For steady-state tracking (constant velocity trajectories and 4.1% improvement) and large-amplitude step responses (RMSE improvement of 3.3%), the fixed PID already performs reasonably well, leaving limited marginal benefit from RL-PID. However, for time-varying dynamic signals (chirp, sinusoidal, and sweep), the advantage of RL-PID expands significantly with increasing task-dynamic complexity.

Safety constraints are effectively maintained throughout performance enhancement. In the mixed-trajectory test, the RL-PID maximum pitch angle (

53.9 ° \pm 11.1 °

) is slightly higher than that of the fixed PID (

52.3 ° \pm 12.3 °

), yet none of the 100 trials exceeded the

60 °

hard constraint (see Table 9). The slightly higher mean maximum pitch angle of RL-PID is attributable to its more aggressive gain scheduling—actively increasing

K_{p, u}

during high-dynamic segments to accelerate tracking, thereby driving pitch angles closer to the constraint boundary—while the Lagrange multiplier mechanism suppresses boundary exceedance risk in a timely manner when the system approaches constraints. This observation is consistent with the “constraint-aware regulation” analysis in Section 4.1: RL-PID trades a limited pitch-angle margin for faster dynamic response while maintaining overall compliance with hard constraints. The joint action of the LMI safety domain and Lagrangian-constrained PPO forms a dual-layer protection mechanism of “soft-boundary guidance + hard-boundary backstop.”

Variance analysis of the mixed-trajectory test. The coefficient of variation (CV ≈ 89%) of the RL-PID velocity RMSE in Table 9 originates from trajectory-type heterogeneity rather than sporadic controller instability. The per-category statistics in Table 10 corroborate this assessment: the inherent task-difficulty gap between chirp-trajectory RL-PID RMSE (1.11 m/s) and step-trajectory RMSE (2.80 m/s) is approximately 2.5-fold, naturally inflating the pooled variance when mixed. The fixed-PID pooled standard deviation (±1.69) is virtually identical to that of RL-PID (±1.70), further supporting this conclusion.

4.3. Extended Baselines, Runtime Trade-Offs, and Training Sensitivity

Positioning against extended baselines. Since fixed-gain PID and the LMI-ablated variant do not cover the full control landscape, it is useful to position the proposed method against Gain-Scheduled PID, LPV/LMI fixed-gain scheduling, MPC, and ADRC. Conceptually, RL-PID separates three roles that are often coupled: robust feasibility, online performance adaptation, and computational deployment. The LMI region provides an explicit safe-search domain, avoiding unconstrained gain exploration; the RL policy then exploits the remaining physical margin to reduce the conservatism of worst-case LPV/LMI tuning under time-varying references. Compared with MPC, this design shifts the optimization burden from online receding-horizon solving to offline policy training, leaving only a lightweight neural-network evaluation during flight. Compared with ADRC, the search space is bounded by explicit physical and stability constraints rather than relying solely on feedback-error-driven disturbance rejection. Thus, the observed gain does not come from RL alone or from the LMI box alone, but from their division of labor: LMI defines admissible behavior, while RL selects task-dependent gains within that admissible set.

Computation–performance trade-off analysis. Figure 13 provides a quantitative view of this positioning by comparing Fixed PID, Gain-Scheduled PID, Linear MPC, and RL-PID across three benchmark scenarios. The x-axis reports per-step computation time in

μ

s, and the y-axis reports velocity RMSE; error bars denote the standard deviation over

n = 10

trials. The results show that Linear MPC achieves the lowest RMSE (≈0.81–0.93 m/s), but at a much higher computational cost (276

μ

s/step), whereas Fixed PID is cheapest (≈3.4

μ

s/step) but loses accuracy in dynamic trajectories such as Chirp Sweep. RL-PID lies between these extremes: its runtime cost is close to Gain-Scheduled PID (approximately 9–10

μ

s/step), while its accuracy approaches MPC in the Chirp Sweep and Mixed Maneuver scenarios. In the easier step-response case, all methods converge to a similar RMSE (≈0.76 m/s), so computation becomes the dominant differentiator. Overall, RL-PID occupies a favorable region of the Pareto front, retaining PID-class computational efficiency while recovering much of the adaptive performance usually associated with more expensive online optimization.

Training sensitivity and robustness of the comparison. The above trade-off is meaningful only if the RL-PID policy is not a fragile outcome of hyperparameter tuning. In practice, stable learning requires the training dynamics to follow the same hierarchy as the controller design: the actor should first learn useful tracking behavior, while the Lagrange multiplier gradually enforces constraint discipline. We therefore use separated timescales (

α_{θ} = 10^{- 4}

for the policy and

α_{λ} = 10^{- 5}

for the multiplier). When

α_{λ}

was increased toward

α_{θ}

, small transient violations were penalized too early, causing oscillatory policies and loss of tracking ability; when it was too small, constraint enforcement became delayed during long transients. Similarly, overly tight violation margins or oversized penalty weights pushed the policy toward nearly static gains inside the LMI domain, recreating the conservatism that RL adaptation is intended to reduce. Initializing near the LMI midpoint and using soft margins therefore implements a practical “soft guidance before hard backstop” mechanism, supporting the stability of the Pareto comparison rather than merely improving one tuned run.

4.4. Limitations and Future Work

Real-World Applicability and Sim-to-Real Gaps. Fundamentally, safety guarantees in this study theoretically hold under bounded-LPV mathematical derivations, but remain highly sensitive to significant model mismatch or unmodeled aerodynamic perturbations. Unmodeled actuator latency, aggressive sensor noise, or structural asymmetry ( $Δ I / I > 4 %$ ) could breach the current strict offline bounds. While the proposed method is validated in a high-fidelity Gazebo environment incorporating standard actuator dynamics, the present contribution is inherently an analytical and simulation-based engineering study rather than a full hardware-level deployment. Future iterations must account for unmodeled noise scaling and explicitly tackle the hardware deployment sim-to-real chasm using domain adaptation or fine-tuning.
Position–velocity trade-off. The proposed method prioritizes velocity tracking performance; position error may become non-negligible during highly dynamic tests. Future work may introduce multi-objective reward functions that explicitly balance velocity-tracking accuracy and position drift.
Extended baseline comparisons. The analytical defense surrounding MPC, LPV, and ADRC benchmarks lays the foundation for future empirical validation. Implementing these exhaustive parallel baselines on identical test-bed environments constitutes an inevitable metric for definitively isolating the improvements yielded, specifically by RL, against other sophisticated controllers.

5. Conclusions

This paper develops a novel hierarchical adaptive-PID-control method integrating proximal policy optimization with LMI-based constraints to alleviate multirotor-control degradation during large-envelope flight. Utilizing an LPV formulation, LMI feasibility strictly dictates inner-loop stability and enforces constrained ranges for outer-loop online adjustments. Consequently, within high-fidelity simulation environments under demanding benchmarks, the proposed framework substantially outpaces fixed-gain formulations, demonstrating a 15% RMSE reduction overall and up to 40.9% performance elevation under severe chirp dynamics. Crucially, the mathematical safety guarantee solely protects the fixed inner loop at defined vertices, while the online RL parameter adaptation acts as a highly resilient engineering safeguard via Lagrangian boundaries rather than providing absolute global systemic stability out of simulation. Moving forward, validating robustness against acute structural asymmetry and unmodeled actuator delays, and conducting comparative live-flight studies against advanced nonlinear pipelines (such as Model Predictive Control) remain pivotal research priorities for transitioning this resilient framework into a fully deployed mechanism.

Author Contributions

Conceptualization, Z.T. and B.Z.; methodology, Z.T.; software, Z.T.; validation, Z.T., S.H. and H.F.; formal analysis, Z.T.; investigation, Z.T.; resources, B.Z.; data curation, S.H.; writing—original draft preparation, Z.T.; writing—review and editing, W.Z.; visualization, H.F.; supervision, W.Z.; project administration, B.Z.; and funding acquisition, B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 52202513 and 52302511) and the Guangdong Basic and Applied Basic Research Foundation (No. 2021A1515110797 and No. 2023A1515010023)

Data Availability Statement

The original contributions presented in this study are included in the article and its appendices.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned aerial vehicle
RL	Reinforcement learning
PID	Proportional–integral–derivative
PPO	Proximal policy optimization
LMI	Linear matrix inequality
LPV	Linear parameter-varying
MDP	Markov decision process
RMSE	Root mean square error
MAE	Mean absolute error
GAE	Generalized advantage estimation
DDS	Data distribution service

Appendix A. Longitudinal-Dynamics-Model Symbol Definitions

Table A1. Symbol definitions for Equations (1)–(9).

Symbol	Description	Unit
x	Forward position	m
u	Forward velocity	m/s
$θ$	Pitch angle	rad
q	Pitch rate	rad/s
m	Total vehicle mass	kg
g	Gravitational acceleration	m/s²
$C_{d}$	Fuselage-drag coefficient	N/(m/s)²
$F_{T}$	Total thrust	N
$I_{y y}$	Pitch-axis moment of inertia	kg·m²
$τ_{pitch}$	Pitch control torque	N·m
$X_{a}$	Augmented state vector ${[e_{x}, u, θ, q, ξ_{u}, ξ_{q}]}^{T}$	—
$ξ_{u}, ξ_{q}$	Velocity loop and angular rate loop integral states	—
$K_{total}$	Structured feedback matrix encoding all cascaded PID gains	—
r	External reference input (velocity reference $u_{ref}$ )	m/s
$E_{a, k}$	Reference signal distribution vector at vertex k	—
$μ_{k}$	Polytopic convex combination weights, $μ_{k} \geq 0$ , $\sum μ_{k} = 1$	—

Appendix B. Complete LMI-Constraint Derivation

The LMI constraint set jointly guarantees closed-loop stability and robust performance across all four LPV vertices. Let the augmented closed-loop system matrix be

A_{c} = A_{a, k} - B_{a, k} K

,

k = 1, \dots, 4

.

(B1) Quadratic (Lyapunov) stability: A common symmetric positive-definite matrix

P ≻ 0

is required to satisfy the following for all vertices:

A_{c}^{T} P + P A_{c} ≺ 0, k = 1, \dots, 4

(A1)

(B2)

α

-stability: All closed-loop eigenvalues are constrained to have real parts satisfying

Re (λ) < - α

:

A_{c}^{T} P + P A_{c} + 2 α P ≺ 0

(A2)

(B3) Conic-sector constraint: Eigenvalue arguments are restricted to

| arg (λ) | < φ

, ensuring a minimum damping ratio:

[\begin{matrix} sin φ (A_{c}^{T} P + P A_{c}) & cos φ (P A_{c} - A_{c}^{T} P) \\ cos φ (A_{c}^{T} P - P A_{c}) & sin φ (A_{c}^{T} P + P A_{c}) \end{matrix}] ≺ 0

(A3)

(B4)

H_{\infty}

performance constraint: Given the disturbance-augmented system

{\dot{X}}_{a} = A_{c} X_{a} + B_{w} w

,

z = C_{z} X_{a}

, the

L_{2}

gain from w to z is bounded by

γ_{\infty}

:

[\begin{matrix} A_{c}^{T} P + P A_{c} & P B_{w} & C_{z}^{T} \\ B_{w}^{T} P & - γ_{\infty} I & 0 \\ C_{z} & 0 & - γ_{\infty} I \end{matrix}] ≺ 0

(A4)

(B5) Positive-definiteness constraint:

P ≻ 0, γ_{\infty} > 0

(A5)

The selected

α

and

φ

values for each control loop are listed in Appendix D, Table A3.

Appendix C. Supplementary PPO-Algorithm Formulas

(C1) State-value function:

V^{π} (s_{t}) = E [\sum_{k = 0}^{\infty} γ^{k} r_{t + k} | s_{t}]

(A6)

(C2) Action-value function:

Q^{π} (s_{t}, a_{t}) = E [\sum_{k = 0}^{\infty} γ^{k} r_{t + k} | s_{t}, a_{t}]

(A7)

The advantage function

{\hat{A}}^{π} (s_{t}, a_{t}) = Q^{π} (s_{t}, a_{t}) - V^{π} (s_{t})

is computed via generalized advantage estimation (GAE,

λ_{GAE} = 0.95

). The composite advantage function is

{\hat{A}}_{t} = {\hat{A}}_{R} - λ {\hat{A}}_{C}

, where

{\hat{A}}_{R}

and

{\hat{A}}_{C}

are the advantage estimates for performance reward and safety cost, respectively.

(C3) Policy gradient:

\nabla_{θ} J (θ) = E_{t} [\nabla_{θ} log π_{θ} (a_{t} ∣ s_{t}) {\hat{A}}_{t}]

(A8)

The PPO clipping surrogate objective

L^{CLIP} (θ)

, total loss

L (θ)

, Lagrangian composite objective

L (θ, λ)

, and dual-timescale update rules are provided in Equations (14)–(17) of the main text and are not repeated here.

Appendix D. Experimental-Configuration Parameters

Table A2. Simulation-UAV physical parameters.

Category	Parameter	Symbol	Value	Unit
Basic	Total mass	m	1.4	kg
	Gravitational accel.	g	9.8	m/s²
Inertia	Roll axis	$I_{x x}$	0.0211	kg·m²
	Pitch axis	$I_{y y}$	0.0219	kg·m²
	Yaw axis	$I_{z z}$	0.0366	kg·m²
Geometry	Arm length	l	0.241	m
Aerodynamics	Thrust coeff.	$C_{t}$	$1.105 \times 10^{- 5}$	N/(rad/s)²
	Torque coeff.	$C_{m}$	$1.779 \times 10^{- 7}$	N·m/(rad/s)²
	Drag coeff.	$C_{d}$	0.073	N/(m/s)²
	Damping torque coeff.	$C_{d m}$	0.0055	N·m/(rad/s)²
Motor	Response time const.	$T_{m}$	0.02	s

Table A3. LMI-constraint parameters for each control loop.

Control Loop	$α$ -Stability	Conic Angle $φ$	Phase Margin	Gain Margin	Bandwidth
Angular rate loop	≥2.0	≤ $50 °$	≥ $60 °$	≥10 dB	[10, 30] rad/s
Attitude loop	≥2.0	≤ $50 °$	≥ $60 °$	≥10 dB	[5, 15] rad/s
Velocity loop	≥1.0	≤ $60 °$	≥ $50 °$	≥8 dB	[1, 5] rad/s
Position loop	≥0.8	≤ $65 °$	≥ $45 °$	≥6 dB	[0.5, 2] rad/s

Table A4. PPO-hyperparameter configuration.

Category	Parameter	Value
Algorithm	$ε_{clip}$	0.2
	$γ$	0.99
	$λ_{GAE}$	0.95
Network training	$α_{θ}$	$10^{- 4}$ (linear decay)
	$α_{λ}$	$10^{- 5}$
	$δ_{grad}$ (gradient clipping)	0.5
	$N_{epoch}$	5
	$N_{batch}$	512
Loss weights	$c_{1}$ (vf_coef)	0.5
	$c_{2}$ (ent_coef)	0.01
Sampling	$N_{step}$	1024
	$N_{env}$ (parallel envs)	32 (Stages 1–5)/1 (Stage 6)

Training platform: AMD Ryzen-9 5950X/NVIDIA RTX 4060/32 GB DDR4, Ubuntu 22.04 LTS. Total training: approximately

1.68 \times 10^{7}

timesteps (offline Stages 1–5: 6.5 M; online Stage 6: approximately 6.0 M).

References

Shauqee, M.N.; Rajendran, P.; Suhadis, N.M. Quadrotor Controller Design Techniques and Applications Review. INCAS Bull. 2021, 13, 179–194. [Google Scholar] [CrossRef]
Lopez-Sanchez, I.; Moreno-Valenzuela, J. PID control of quadrotor UAVs: A survey. Annu. Rev. Control 2023, 56, 100900. [Google Scholar] [CrossRef]
Moreno-Valenzuela, J.; Perez-Alcocer, R.; Guerrero-Medina, M.; Dzul, A. Nonlinear PID-Type Controller for Quadrotor Trajectory Tracking. IEEE/ASME Trans. Mechatron. 2018, 23, 2436–2447. [Google Scholar] [CrossRef]
Kazemi, M.H.; Tarighi, R. PID-based attitude control of quadrotor using robust pole assignment and LPV modeling. Int. J. Dyn. Control 2024, 12, 2385–2397. [Google Scholar] [CrossRef]
Zhu, X.; Li, Y.; Wang, H.; Shuai, Z.; Huang, H.; Yin, G. Integrated Physics-Data Based LPV Attitude Control of Quadrotor UAV System. IEEE Trans. Ind. Electron. 2025, 72, 9635–9644. [Google Scholar] [CrossRef]
Borase, R.P.; Maghade, D.K.; Sondkar, S.Y.; Pawar, S.N. A review of PID control, tuning methods and applications. Int. J. Dyn. Control 2021, 9, 818–827. [Google Scholar] [CrossRef]
Rinaldi, M.; Primatesta, S.; Guglieri, G. A Comparative Study for Control of Quadrotor UAVs. Appl. Sci. 2023, 13, 3464. [Google Scholar] [CrossRef]
Gün, A. Attitude control of a quadrotor using PID controller based on differential evolution algorithm. Expert Syst. Appl. 2023, 229, 120518. [Google Scholar] [CrossRef]
Muthusamy, P.K.; Garratt, M.; Pota, H.; Muthusamy, R. Real-Time Adaptive Intelligent Control System for Quadcopter Unmanned Aerial Vehicles With Payload Uncertainties. IEEE Trans. Ind. Electron. 2022, 69, 1641–1653. [Google Scholar] [CrossRef]
Yang, S.; Xi, L.; Hao, J.; Wang, W. Aerodynamic-Parameter Identification and Attitude Control of Quadrotor Model with CIFER and Adaptive LADRC. Chin. J. Mech. Eng. 2021, 34, 18. [Google Scholar] [CrossRef]
Ullah, S.; Alghamdi, H.; Algethami, A.A.; Alghamdi, B.; Hafeez, G. Robust Control Design of Under-Actuated Nonlinear Systems: Quadcopter Unmanned Aerial Vehicles with Integral Backstepping Integral Terminal Fractional-Order Sliding Mode. Fractal Fract. 2024, 8, 412. [Google Scholar] [CrossRef]
Nwafor, S.C.; Eneh, J.N.; Ndefo, M.I.; Ugbe, O.C.; Ugwu, H.I.; Ani, O. An optimal hybrid quadcopter control technique with MPC-based backstepping. Arch. Control Sci. 2024, 34, 39–62. [Google Scholar] [CrossRef]
Huang, T.; Pan, H.; Sun, W.; Gao, H. Sine Resistance Network-Based Motion-Planning Approach for Autonomous Electric Vehicles in Dynamic Environments. IEEE Trans. Transp. Electrif. 2022, 8, 2862–2873. [Google Scholar] [CrossRef]
Zhao, N.; Lun, D.; Zhang, H.; Zhao, X.; Rudas, I.J. Composite Anti-Disturbance Control for Networked Systems With Disturbances and Actuator Attacks via Event-Triggered Output Feedback. IEEE Trans. Cybern. 2026, 56, 393–403. [Google Scholar] [CrossRef]
Liu, H.; Kiumarsi, B.; Kartal, Y.; Taha Koru, A.; Modares, H.; Lewis, F.L. Reinforcement Learning Applications in Unmanned Vehicle Control: A Comprehensive Overview. Unmanned Syst. 2023, 11, 17–26. [Google Scholar] [CrossRef]
Azar, A.T.; Koubaa, A.; Mohamed, N.A.; Ibrahim, H.A.; Ibrahim, Z.F.; Kazim, M.; Ammar, A.; Benjdira, B.; Khamis, A.M.; Hameed, I.A.; et al. Drone Deep Reinforcement Learning: A Review. Electronics 2021, 10, 999. [Google Scholar] [CrossRef]
Koch, W.; Mancuso, R.; West, R.; Bestavros, A. Reinforcement Learning for UAV Attitude Control. ACM Trans. Cyber-Phys. Syst. 2019, 3, 1–21. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, W.; Mou, J.; Zheng, K. Attitude Control Based on Reinforcement Learning for Quadrotor. In Proceedings of the 2021 International Conference on Autonomous Unmanned Systems (ICAUS), Changsha, China, 24–26 September 2021; pp. 331–338. [Google Scholar] [CrossRef]
Sönmez, S.; Montecchio, L.; Martini, S.; Rutherford, M.J.; Rizzo, A.; Stefanovic, M.; Valavanis, K.P. Reinforcement Learning-Based PD Controller Gains Prediction for Quadrotor UAVs. Drones 2025, 9, 581. [Google Scholar] [CrossRef]
Alrubyli, Y.; Bonarini, A. Using Q-Learning to Automatically Tune Quadcopter PID Controller Online for Fast Altitude Stabilization. In Proceedings of the 2022 IEEE International Conference on Mechatronics and Automation (ICMA), Guilin, China, 7–10 August 2022; pp. 514–519. [Google Scholar] [CrossRef]
Dogru, O.; Velswamy, K.; Ibrahim, F.; Wu, Y.; Sundaramoorthy, A.S.; Huang, B.; Xu, S.; Nixon, M.; Bell, N. Reinforcement learning approach to autonomous PID tuning. Comput. Chem. Eng. 2022, 161, 107760. [Google Scholar] [CrossRef]
Ping, H.; Han, B. An Automatic PID Tuning Method for DEP Fixed-Wing Aircraft Based on Reinforcement Learning. In Proceedings of the 2024 International Conference on Autonomous Unmanned Systems (ICAUS); Springer: Singapore, 2024; pp. 15–24. [Google Scholar] [CrossRef]
Xue, W.; Wu, H.; Ye, H.; Shao, S. An Improved Proximal Policy Optimization Method for Low-Level Control of a Quadrotor. Actuators 2022, 11, 105. [Google Scholar] [CrossRef]
Zhai, Y.; Zhao, Q.; Han, Y.; Wang, J.; Zeng, W. Intelligent PID Controller Based on Deep Reinforcement Learning. In Proceedings of the 2024 8th International Conference on Robotics, Control and Automation (ICRCA), Shanghai, China, 12–14 January 2024; pp. 343–348. [Google Scholar] [CrossRef]
Wang, H.; Ricardez-Sandoval, L.A. A Deep Reinforcement Learning-Based PID Tuning Strategy for Nonlinear MIMO Systems with Time-varying Uncertainty. IFAC-PapersOnLine 2024, 58, 887–892. [Google Scholar] [CrossRef]
Gu, S.; Yang, L.; Du, Y.; Chen, G.; Walter, F.; Wang, J.; Knoll, A. A Review of Safe Reinforcement Learning: Methods, Theories, and Applications. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11216–11235. [Google Scholar] [CrossRef]
Mannucci, T.; van Kampen, E.J.; de Visser, C.; Chu, Q. Safe Exploration Algorithms for Reinforcement Learning Controllers. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 1069–1081. [Google Scholar] [CrossRef] [PubMed]
Boyd, S.; Hast, M.; Åström, K.J. MIMO PID tuning via iterated LMI restriction. Int. J. Robust Nonlinear Control 2016, 26, 1718–1731. [Google Scholar] [CrossRef]
Saeed, A.; Bhatti, A.I.; Malik, F.M. LMIs-Based LPV Control of Quadrotor with Time-Varying Payload. Appl. Sci. 2023, 13, 6553. [Google Scholar] [CrossRef]
Salvato, E.; Fenu, G.; Medvet, E.; Pellegrino, F.A. Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning. IEEE Access 2021, 9, 153171–153187. [Google Scholar] [CrossRef]

Figure 1. Safety-constrained dual-timescale reinforcement learning framework. The framework consists of the Agent (step-level and episode-level learning) and the Environment (Env). Symbols are defined as follows:

s_{t}

represents the current state;

a_{t}

is the action generated by the Actor;

V_{π} (x, λ)

is the value function estimated by the Critic;

λ

is the Lagrange multiplier dynamically adjusting the safety penalty;

J_{C} (π_{θ})

is the expected cumulative safety cost;

C_{t}

and

R_{t}

denote the episode-level total cost and composite return, respectively, fed back to the Critic;

r_{t}

is the step-level reward;

c_{t}

is the step-level cost (aggregated into

C_{t}

); and

K_{target}

and

K_{t}

are the intermediate and final step PID gains processed through safety mapping and exponential moving average (EMA) smoothing.

Figure 1. Safety-constrained dual-timescale reinforcement learning framework. The framework consists of the Agent (step-level and episode-level learning) and the Environment (Env). Symbols are defined as follows:

s_{t}

represents the current state;

a_{t}

is the action generated by the Actor;

V_{π} (x, λ)

is the value function estimated by the Critic;

λ

is the Lagrange multiplier dynamically adjusting the safety penalty;

J_{C} (π_{θ})

is the expected cumulative safety cost;

C_{t}

and

R_{t}

denote the episode-level total cost and composite return, respectively, fed back to the Critic;

r_{t}

is the step-level reward;

c_{t}

is the step-level cost (aggregated into

C_{t}

); and

K_{target}

and

K_{t}

are the intermediate and final step PID gains processed through safety mapping and exponential moving average (EMA) smoothing.

Figure 2. Overall hierarchical framework encompassing LPV modeling, LMI safety domain construction, and two-phase RL training. Labels on the interconnections delineate the progression of mathematical models, parameter bounds, and control policies across the integrated steps.

Figure 3. Pole- placement verification (fixed inner-loop parameters).

Figure 4. Three-dimensional cross-section of the longitudinal outer-loop parameter feasible region.

Figure 5. Reward- curve evolution across five-stage offline curriculum training. Different colors distinguish the five training stages; in each subplot, the solid line indicates the mean episode reward and the shaded region represents the standard deviation.

Figure 6. LMI constraint-ablation comparison (Stage 5). (a) Episode-reward-training curves and (b) constraint violations per episode. Dashed lines indicate the mean over the last 1000 episodes.

Figure 7. Stage 6, Gazebo online fine-tuning training curve. The shaded colored area represents the standard deviation of the episode reward. The smoothed episode reward (window = 2000 episodes) increases from approximately

- 5576

at the start to

- 5171

at the end of

6.0 \times 10^{6}

timesteps, demonstrating steady policy improvement under fixed-bias parameters and sensor noise.

Figure 7. Stage 6, Gazebo online fine-tuning training curve. The shaded colored area represents the standard deviation of the episode reward. The smoothed episode reward (window = 2000 episodes) increases from approximately

- 5576

at the start to

- 5171

at the end of

6.0 \times 10^{6}

timesteps, demonstrating steady policy improvement under fixed-bias parameters and sensor noise.

Figure 8. Performance comparison of longitudinal and lateral channel policy transfer. (a) Representative velocity tracking curves (wind speed 1.5 m/s); (b) RMSE mean ± standard deviation comparison across four control schemes (

n = 100

); and (c) time-averaged gain value comparison. The longitudinal RL-PID RMSE is 2.503 m/s; after lateral transfer, it increases to only 2.912 m/s (+16.3%), well below the 20% feasibility threshold.

Figure 8. Performance comparison of longitudinal and lateral channel policy transfer. (a) Representative velocity tracking curves (wind speed 1.5 m/s); (b) RMSE mean ± standard deviation comparison across four control schemes (

n = 100

); and (c) time-averaged gain value comparison. The longitudinal RL-PID RMSE is 2.503 m/s; after lateral transfer, it increases to only 2.912 m/s (+16.3%), well below the 20% feasibility threshold.

Figure 9. High-speed step- response comparison. (a) Velocity response; (b) pitch angle; (c) pitch rate; and (d) adaptive-gain curves. The dashed line in (a) represents the reference command as indicated in the legend; the dashed lines in (b,c) represent the initial states; and the dashed line in (d) represents the fixed PID parameter values.

Figure 10. Emergency- braking dynamic-response comparison. (a) Velocity response; (b) pitch angle; (c) pitch rate; and (d) adaptive-gain curves. The dashed line in (a) represents the reference command as indicated in the legend; the dashed lines in (b,c) represent the initial states; and the dashed line in (d) represents the fixed PID parameter values.

Figure 11. Frequency- sweep-test response comparison. (a) Test A velocity response; (b) Test A pitch angle; (c) Test A adaptive-gain curves; (d) Test B velocity response; (e) Test B pitch angle; and (f) Test B adaptive-gain curves. Dashed lines denote reference commands in (a,d), initial states in (b,e), and Fixed-PID values in (c,f).

Figure 12. Statistical distribution of mixed-trajectory test results. (a) Velocity RMSE; (b) velocity MAE; (c) maximum

| θ |

; and (d) RMSE by trajectory type (

n = 100

). Circles denote outliers, and black diamonds denote mean values.

Figure 12. Statistical distribution of mixed-trajectory test results. (a) Velocity RMSE; (b) velocity MAE; (c) maximum

| θ |

; and (d) RMSE by trajectory type (

n = 100

). Circles denote outliers, and black diamonds denote mean values.

Figure 13. Computation–performance Pareto front across three benchmark scenarios. Error bars: mean ± std (

n = 10

). Dashed grey lines connect Pareto-optimal points (circled). Lower-left is better.

Figure 13. Computation–performance Pareto front across three benchmark scenarios. Error bars: mean ± std (

n = 10

). Dashed grey lines connect Pareto-optimal points (circled). Lower-left is better.

Table 1. Systematic comparison with representative related methods.

Method	Safety Mechanism	Control Architecture	Policy Transfer	Distinction of This Work
Wang et al. [18]	No explicit constraints	RL outputs attitude commands	None	LMI safety domain for outer loop
Sönmez et al. [19]	Action clipping	RL predicts PD gains	None	LMI polytopic constraints; integral terms & transfer
Xue et al. [23]	Penalty reward	PPO outputs control commands	None	Fixed inner-loop PID; LMI safety domain
Saeed et al. [29]	LMI-LPV robust control	Fixed LPV gain scheduling	None	RL online adaptation within LMI domain
This work	LMI + Lagrangian PPO	Inner fixed + outer RL	Long.→Lat. transfer	—

Table 2. Longitudinal outer-loop parameter LMI safety domain.

Parameter	Lower Bound $K_{min}$	Upper Bound $K_{max}$	Initial Value (Midpoint)
$K_{p, x}$	0.2	4.5	2.4
$K_{p, u}$	1.0	10.0	5.5
$K_{i, u}$	3.0	25.0	14.0
$K_{d, u}$	0.05	0.55	0.3

Table 3. Five-stage offline curriculum training configuration.

Stage	Reference Signal	Environment/Randomization	Training Objective	Steps
Stage 1	Constant velocity $u_{x} = 5.0$ m/s	No wind, widest constraints	Baseline tracking	1.0 M
Stage 2	Sinusoidal, 8 m/s, 0.3 Hz	No wind	Periodic dynamic tracking	2.0 M
Stage 3	Chirp, 0.1–0.5 Hz, 8–0 m/s	No wind	Wideband frequency response	3.0 M
Stage 4	Chirp (same as Stage 3)	Mild wind 0–2 m/s	Tracking under disturbance	4.0 M
Stage 5	Mixed chirp/sinusoidal/step	Full-domain rand., wind 0–2.5 m/s	Generalization	5.0 M

Table 4. Performance summary at the conclusion of each offline training stage.

Stage	Environment	Final Reward	Mean Velocity Error
Stage 1	Simplified simulation	≈−570	0.41 m/s
Stage 2	Simplified simulation	≈−4750	1.03 m/s
Stage 3	Simplified simulation	≈−2870	1.24 m/s
Stage 4	Simplified simulation	≈−2940	1.24 m/s
Stage 5	Simplified + domain rand.	≈−5460	1.47 m/s

Table 5. LMI constraint-ablation study statistics (Stage 5, last 1000 episodes).

Metric	With LMI	Without LMI	Difference
Constraint violations/ep	$13.8 \pm 5.6$	$48.4 \pm 2.5$	$- 71.4 %$
Episode Reward	$- 6185 \pm 2610$	$- 4783 \pm 1052$	— ^†

^† The reward function compositions differ between the two variants (with LMI including constraint penalty terms); hence, episode reward is not directly comparable.

Table 6. Step-response performance comparison (

n = 10

, random wind 0–3 m/s, and mean ± std). Bold values indicate the best performance.

Table 6. Step-response performance comparison (

n = 10

, random wind 0–3 m/s, and mean ± std). Bold values indicate the best performance.

Metric	RL-PID	Fixed-PID	Improvement
Rise time $t_{r}$ (s)	$1.451 \pm 0.604$	$1.837 \pm 0.869$	$+ 21.0 %$
Settling time $t_{s}$ (s)	$3.982 \pm 2.918$	$8.042 \pm 2.054$	$+ 50.5 %$
Overshoot $M_{p}$ (%)	$1.692 \pm 1.178$	$2.077 \pm 0.977$	$+ 18.5 %$
Steady-state error $e_{s s}$ (m/s)	$0.175 \pm 0.240$	$0.471 \pm 0.462$	$+ 62.8 %$
RMSE (m/s)	$3.309 \pm 0.206$	$3.423 \pm 0.311$	$+ 3.3 %$
MAE (m/s)	$1.363 \pm 0.310$	$1.646 \pm 0.441$	$+ 17.1 %$

Table 7. Emergency-braking-performance comparison (

n = 10

, random wind 0–3 m/s, and mean ± std). Bold values indicate the best performance.

Table 7. Emergency-braking-performance comparison (

n = 10

, random wind 0–3 m/s, and mean ± std). Bold values indicate the best performance.

Metric	RL-PID	Fixed-PID	Improvement
Braking time $t_{brake}$ (s)	$1.194 \pm 0.141$	$1.319 \pm 0.091$	$+ 9.5 %$
Braking distance $d_{brake}$ (m)	$7.773 \pm 1.319$	$8.428 \pm 1.287$	$+ 7.8 %$
Max. pitch angle $θ_{max}$ (°)	$59.418 \pm 0.688$	$58.001 \pm 2.132$	$- 2.4 %$

Table 8. Frequency-sweep-test performance comparison (

n = 10

, random wind 0–3 m/s, and mean ± std).

Table 8. Frequency-sweep-test performance comparison (

n = 10

, random wind 0–3 m/s, and mean ± std).

Test Type	Metric	RL-PID	Fixed-PID	Improvement
Const.-amplitude	RMSE (m/s)	$3.781 \pm 0.137$	$4.086 \pm 0.059$	$+ 7.5 %$
	MAE (m/s)	$2.947 \pm 0.130$	$3.218 \pm 0.064$	$+ 8.4 %$
Const.-frequency	RMSE (m/s)	$3.873 \pm 0.118$	$4.429 \pm 0.060$	$+ 12.6 %$
	MAE (m/s)	$3.066 \pm 0.062$	$3.453 \pm 0.029$	$+ 11.2 %$

Table 9. Mixed-trajectory test overall statistics (

n = 100

).

Table 9. Mixed-trajectory test overall statistics (

n = 100

).

Metric	RL-PID	Fixed PID	Improvement
Velocity RMSE (m/s)	$1.90 \pm 1.70$	$2.23 \pm 1.69$	15.0%
Velocity MAE (m/s)	$1.21 \pm 1.21$	$1.52 \pm 1.26$	20.5%
Max. pitch angle (°)	$53.9 \pm 11.1$	$52.3 \pm 12.3$	$- 3.1 %$

Table 10. Per-trajectory RMSE comparison (mean ± std, unit: m/s).

Trajectory Type	n	RL-PID	Fixed PID	Improvement
Constant velocity	24	$1.65 \pm 0.90$	$1.72 \pm 0.91$	$+ 4.1 %$
Sinusoidal	24	$1.75 \pm 1.98$	$2.16 \pm 2.00$	$+ 18.8 %$
Step	30	$2.80 \pm 2.13$	$2.97 \pm 2.22$	$+ 5.8 %$
Chirp	22	$1.11 \pm 0.52$	$1.87 \pm 0.46$	$+ 40.9 %$

Table 11. Summary of core performance improvements across four experiments (improvement

= (PID - RL) / PID \times 100 %

).

Table 11. Summary of core performance improvements across four experiments (improvement

= (PID - RL) / PID \times 100 %

).

Experiment	Core Metric	RL-PID	Fixed PID	Improvement
Step response	Overshoot (%)	$1.69 \pm 1.18$	$2.08 \pm 0.98$	18.5%
	Steady-state error (m/s)	$0.18 \pm 0.24$	$0.47 \pm 0.46$	62.8%
	MAE (m/s)	$1.36 \pm 0.31$	$1.65 \pm 0.44$	17.1%
Emergency braking	Braking time (s)	$1.19 \pm 0.14$	$1.32 \pm 0.09$	9.5%
	Braking distance (m)	$7.77 \pm 1.32$	$8.43 \pm 1.29$	7.8%
Sweep (const.-amp.)	RMSE (m/s)	$3.78 \pm 0.14$	$4.09 \pm 0.06$	7.5%
Sweep (const.-freq.)	RMSE (m/s)	$3.87 \pm 0.12$	$4.43 \pm 0.06$	12.6%
Mixed ( $n = 100$ )	Velocity RMSE (m/s)	$1.90 \pm 1.70$	$2.23 \pm 1.69$	15.0%
	Chirp RMSE (m/s)	$1.11 \pm 0.52$	$1.87 \pm 0.46$	40.9%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tian, Z.; Hu, S.; Fu, H.; Zhu, W.; Zhang, B. Hierarchical Adaptive PID Tuning for Agile Flight: A Safety-Constrained Reinforcement Learning Approach. Aerospace 2026, 13, 446. https://doi.org/10.3390/aerospace13050446

AMA Style

Tian Z, Hu S, Fu H, Zhu W, Zhang B. Hierarchical Adaptive PID Tuning for Agile Flight: A Safety-Constrained Reinforcement Learning Approach. Aerospace. 2026; 13(5):446. https://doi.org/10.3390/aerospace13050446

Chicago/Turabian Style

Tian, Zhong, Sen Hu, Hao Fu, Weiyu Zhu, and Bangchu Zhang. 2026. "Hierarchical Adaptive PID Tuning for Agile Flight: A Safety-Constrained Reinforcement Learning Approach" Aerospace 13, no. 5: 446. https://doi.org/10.3390/aerospace13050446

APA Style

Tian, Z., Hu, S., Fu, H., Zhu, W., & Zhang, B. (2026). Hierarchical Adaptive PID Tuning for Agile Flight: A Safety-Constrained Reinforcement Learning Approach. Aerospace, 13(5), 446. https://doi.org/10.3390/aerospace13050446

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hierarchical Adaptive PID Tuning for Agile Flight: A Safety-Constrained Reinforcement Learning Approach

Abstract

1. Introduction

2. Hierarchical Adaptive PID Tuning Framework

2.1. LMI-Based Safe-Parameter-Domain Construction

2.2. Safety-Constrained Dual-Timescale Reinforcement Learning

2.2.1. MDP Formulation

2.2.2. Constraint-Aware PPO Algorithm

2.3. Overall Methodology and Agent-Training Framework

3. Results

3.1. Results of LMI-Based Safe-Parameter-Domain Construction

3.1.1. UAV Physical Parameters

3.1.2. Longitudinal LPV Model and Polytopic Construction

3.1.3. LMI Constraint Design and Solution Results

3.2. PPO-Based Adaptive Controller Parameter Training

3.2.1. Offline Five-Stage Curriculum Training (Sub-Stages 1–5)

3.2.2. Gazebo Online Fine-Tuning (Stage 6) and Lateral Transfer

3.3. Experimental Results and Analysis

3.3.1. High-Speed Step Response

3.3.2. Emergency-Braking Response

3.3.3. Frequency-Domain Sweep Test

3.3.4. Comprehensive Statistical Performance Test

4. Discussion

4.1. Adaptive-Mechanism Analysis

4.2. Comprehensive Performance Discussion

4.3. Extended Baselines, Runtime Trade-Offs, and Training Sensitivity

4.4. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Longitudinal-Dynamics-Model Symbol Definitions

Appendix B. Complete LMI-Constraint Derivation

Appendix C. Supplementary PPO-Algorithm Formulas

Appendix D. Experimental-Configuration Parameters

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI