Next Article in Journal
Visual-Based Dual Detection and Route Planning Method for UAV Autonomous Inspection
Previous Article in Journal
Search, Detect, Recover: A Systematic Review of UAV-Based Remote Sensing Approaches for the Location of Human Remains and Clandestine Graves
Previous Article in Special Issue
High-Precision Trajectory-Tracking Control of Quadrotor UAVs Based on an Improved Crested Porcupine Optimiser Algorithm and Preset Performance Self-Disturbance Control
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Reinforcement Learning Stabilization for Quadrotor UAVs via Lipschitz-Constrained Policy Regularization

1
Shaanxi Key Laboratory of Environment and Control for Flight Vehicle, School of Aerospace Engineering, Xi’an Jiaotong University, Xi’an 710129, China
2
Air-Based UAV R&D Center, School of Aerospace, Northwestern Polytechnic University, Xi’an 710072, China
3
State Key Laboratory for Strength and Vibration of Mechanical Structures, School of Aerospace Engineering, Xi’an Jiaotong University, Xi’an 710129, China
*
Author to whom correspondence should be addressed.
Drones 2025, 9(10), 675; https://doi.org/10.3390/drones9100675
Submission received: 16 July 2025 / Revised: 16 September 2025 / Accepted: 17 September 2025 / Published: 26 September 2025

Abstract

Reinforcement learning (RL), and in particular Proximal Policy Optimization (PPO), has shown promise in high-precision quadrotor unmanned aerial vehicle (QUAV) control. However, the performance of PPO is highly sensitive to the choice of the clipping parameter, and inappropriate settings can lead to unstable training dynamics and excessive policy oscillations, which limit deployment in safety-critical aerial applications. To address this issue, we propose a stability-aware dynamic clipping parameter adjustment strategy, which adapts the clipping threshold ϵ t in real time based on a stability variance metric S t . This adaptive mechanism balances exploration and stability throughout the training process. Furthermore, we provide a Lipschitz continuity interpretation of the clipping mechanism, showing that its adaptation implicitly adjusts a bound on the policy update step, thereby offering a deterministic guarantee on the oscillation magnitude. Extensive simulation results demonstrate that the proposed method reduces policy variance by 45% and accelerates convergence compared to baseline PPO, resulting in smoother control responses and improved robustness under dynamic operating conditions. While developed within the PPO framework, the proposed approach is readily applicable to other on policy policy gradient methods.

1. Introduction

Reinforcement learning (RL) has achieved significant success in various domains, including robotics [1,2], autonomous driving [3,4], and advanced flight control [5,6,7]. Among these applications, quadrotor unmanned aerial vehicles (UAVs) have attracted considerable attention due to their maneuverability and versatility in complex and constrained environments [8,9,10]. However, ensuring reliable trajectory tracking of quadrotor UAVs remains a challenge, mainly due to their underactuated nature and highly nonlinear dynamics [11,12,13].
Modern RL techniques, including Deep Q Networks (DQNs), policy gradient methods, and model-based RL approaches [14,15,16,17], are capable of learning control strategies by interacting with the environment and aiming to maximize cumulative rewards [18]. However, the nonlinear and underactuated nature of quadrotors renders stabilizing the learned policy particularly challenging. Issues such as suboptimal reward function design, imbalanced exploration and exploitation dynamics, and the absence of explicit stability constraints often lead to policy output oscillations during training [19,20,21,22].
These oscillations pose significant challenges for quadrotors, as even minor deviations can result in excessive actuator wear, erratic flight behavior, and an increased risk of control system failures. While heuristic methods and reward shaping [9,10] can partially alleviate instability, they do not directly impose constraints on policy updates. In real-world missions, such as surveillance, environmental monitoring, and search and rescue, this unpredictability is unacceptable.
To address these challenges, Unlike conventional methods that concentrate on weight normalization or superficial hyperparameter tuning [12,23,24], we propose Stabilizing Constraint Regularization (SCR), a novel technique designed to mitigate two related but distinct phenomena: (1) oscillations in policy parameter updates during training, which lead to unstable learning curves; and (2) oscillations in deployed control commands, manifesting as high-frequency actuator signals that degrade flight stability. By bounding the Lipschitz constant of the learned policy through a dynamically adjusted PPO clipping parameter, SCR provides a deterministic guarantee on control smoothness: for any two bounded state perturbations s 1 and s 2 , the corresponding policy outputs satisfy π ( s 1 ) π ( s 2 ) L max s 1 s 2 , L max determined by ϵ t . This ensures that deployment-stage oscillations are strictly limited, even after training convergence.
The two primary contributions of this work are:
1.
Novel Regularization Technique: We introduce SCR, a stability-constrained regularization method that directly controls policy update magnitudes. This technique effectively mitigates policy oscillations in underactuated UAV systems, significantly enhancing policy stability.
2.
Improved Training Stability and Convergence: SCR ensures smoother and more controlled policy evolution, leading to faster convergence and improved robustness. This is particularly advantageous for quadrotor UAVs, where stable and predictable control is essential.
By integrating physical constraints into the data-driven learning process, SCR enables RL agents to leverage human knowledge and engineering insights more efficiently and reliably. This approach facilitates stable policies that are both effective and safe for deployment, thereby extending the applicability of RL to high-risk tasks.
The remainder of this paper are organized as follows: Section 2 reviews related work on RL stability and existing regularization techniques, with a focus on their applications in quadrotor control. Section 3 introduces the fundamentals of RL and quadrotor kinematics. Section 4 details the theoretical foundations and implementation of SCR. Section 5 presents the experimental results. These results demonstrate the effectiveness of SCR in quadrotor trajectory tracking and highlight its performance advantages over conventional reinforcement learning (RL) algorithms. Finally, Section 6 concludes with a summary of the findings and suggestions for future research.

2. Related Work

The application of reinforcement learning (RL) to quadrotor control has attracted significant attention in recent years, owing to its capability to handle nonlinear dynamics and adapt to environmental disturbances without explicit modeling. Proximal Policy Optimization (PPO) [25] and Soft Actor–Critic (SAC) [26] are among the most popular algorithms in this domain, with several studies demonstrating their effectiveness in trajectory tracking [8,9], and they established foundational capabilities in waypoint trackings but suffered from inherent limitations in continuous action space representation and environmental sensitivity. The subsequent shift toward policy gradient methods (e.g., PPO [25] and SAC [26]) introduced theoretical improvements through entropy regularized exploration and experience replay mechanisms. However, these advancements have inadvertently exposed a critical vulnerability: unbounded policy updates frequently induce high-frequency oscillations in control signals (typically 8–12 Hz in rotor commands). Such oscillations are particularly detrimental to underactuated quadrotor systems, where control authority is inherently limited. Existing mitigation strategies have partially reduced transient instabilities, yet they fundamentally fail to address the root cause. Moreover, the prevailing practice of employing quasi-steady aerodynamic models neglects transient vortex dynamics, which is responsible for approximately 62% of torque fluctuations in empirical studies.
For example, Zhou et al. [27] applied PPO to waypoint navigation tasks, reporting smooth convergence in simulation but observed performance degradation under unmodeled wind conditions in real flights. Similarly, Lee et al. [28] employed SAC for aggressive maneuvers, achieving rapid convergence yet suffering from instability during high-disturbance flights due to variance in policy updates. Hybrid approaches [29,30] have also been explored, incorporating safety filters or domain randomization to enhance robustness, though at the cost of increased computational demands and reduced sample efficiency.
Although the above studies have advanced the integration of RL in UAV control, several limitations persist:
  • Simulation to reality gap: Many works rely solely on simulation or limited scale real world validation, which does not fully capture aerodynamic effects, sensor noise, and actuator delays encountered in actual quadrotor operations [26,31];
  • Convergence–stability trade-off: Fast learning algorithms (e.g., SAC) often exhibit oscillatory behaviors, while stable learners (e.g., PPO) tend to converge more slowly, limiting their practical deployment in time-sensitive missions [32];
  • Neglect of explicit stability constraints: Few studies have integrated formalized stability regularization into the learning process, despite stability being critical for safety in UAV applications [33].
Novelty of This Work: To address the above gaps, we propose a Stabilizing Constraint Regularized (SCR) PPO algorithm that introduces a stability-aware regularization term into the policy update rule, enabling faster convergence while maintaining high stability and robustness. In contrast to prior works, our method is validated extensively in both a high-fidelity simulation environment and real-world outdoor flight tests, ensuring that the learned policy is resilient to aerodynamic disturbances, sensor noise, and communication delays.
As summarized in Table 1, the proposed SCR-PPO achieves the fastest convergence (0.30 M steps), the lowest tracking error (0.19 m),minimal variance in learning curves, and the highest robustness under disturbance conditions among all compared approaches. By bridging the simulation to reality gap and explicitly enforcing stability constraints during training, our approach offers a more reliable solution for real-world UAV control tasks compared to existing methods.

3. Background

3.1. Reinforcement Learning

Reinforcement learning is typically formulated as a Markov Decision Process (MDP) defined by ( S , A , P , R , γ ) , where S and A represent the state and action spaces, P describes state transitions, R denotes a reward function, and γ is the discount factor. In robotic control, RL must account for factors such as deviations from reference trajectories, energy consumption, and hardware constraints. The design of suitable reward functions is nontrivial, as directly penalizing absolute error can result in uniform penalties that fail to effectively guide policy improvements.
Advanced RL algorithms, such as SAC and PPO, have enhanced learning efficiency and stability. However, challenges persist, including low sample efficiency, overfitting, and unstable training. Minor policy variations can induce significant performance fluctuations, particularly in complex tasks such as quadrotor trajectory tracking.

3.2. UAV Position Tracking Kinematics

This study focuses on learning trajectory tracking strategies for a quadrotor UAV based on its kinematic characteristics, which are described as follows:
v ˙ x = f m cos φ sin θ cos ϕ + sin φ sin ϕ v ˙ y = f m sin φ sin θ cos ϕ cos φ sin ϕ v ˙ z = g f m cos ϕ cos θ x ˙ = v x y ˙ = v y z ˙ = v z

3.3. Extended QUAV Dynamical Model

The full 6-DoF dynamics of a quadrotor unmanned aerial vehicle (QUAV) consists of translational and rotational motion. The translational dynamics used in Equation (1) can be written as
m p ¨ = F total ,
where p R 3 is the position of the UAV and m is the total mass.
The rigid body attitude dynamics are given by
J ω ˙ = τ ω × ( J ω ) ,
R ˙ = R ω ^ ,
where ω R 3 is the body angular velocity, J R 3 × 3 is the inertia matrix, τ is the control torque vector, R S O ( 3 ) is the rotation matrix from the body frame to the inertial frame, and ω ^ is the corresponding skew-symmetric matrix of ω .
In this work, we primarily focus on training stability within the translational dynamics domain, where attitude variations are either controlled by a lower-level controller or remain within small perturbations. Therefore, the double integrator model of Equation (1) serves as a tractable approximation for high-level policy learning, while the complete rotational dynamics can be incorporated following the above formulation if needed for more precise control.
In this context, ( x , y , z ) represents the position of the quadrotor UAV in each direction within the North East Down (NED) reference coordinate system, and ( v x , v y , v z ) denote the velocity in each respective direction. The force f is the combined external force generated by the four rotors of the quadrotor in the body coordinate system, and ( ϕ , θ , ψ ) represents the attitude angles of the UAV.

4. Methods

In high-dimensional, nonlinear systems such as quadrotor unmanned aerial vehicles (UAVs), substantial policy updates during reinforcement learning (RL) training can induce oscillations and instability, adversely affecting control precision and system reliability. To address these challenges, we introduce Stabilizing Constraint Regularization (SCR), a novel regularization technique designed to enhance the stability of policy updates. SCR leverages the principles of Lipschitz continuity to dynamically adjust the magnitude of policy updates, ensuring smooth and predictable policy evolution within the Proximal Policy Optimization (PPO) framework. This section delineates the theoretical foundations of SCR, its integration into PPO, and the detailed implementation procedures. Algorithm Flowchart in Figure 1. The flight data generated by the constructed mathematical model of the UAV is input into the SCR-PPO framework’s neural network. The action network outputs policy parameters, while the supervisory network outputs evaluated value parameters. The original PPO uses fixed parameters to adjust the magnitude of policy updates. The method proposed in this paper employs the SCR approach, dynamically adjusting the policy update rate of the action network based on the mean squared deviation between the action collected at the current time step and the average action value of the sequence containing that action.

4.1. Lipschitz Continuity

SCR is grounded in the Lipschitz Theorem, which provides a mathematical framework for controlling the rate of change of a function relative to its inputs. A function f : R n R m is defined as Lipschitz continuous with Lipschitz constant L if, for any two points x 1 and x 2 within a certain neighborhood in the state space, the following inequality holds:
f ( x 1 ) f ( x 2 ) L x 1 x 2
In the context of policy networks, ensuring Lipschitz continuity implies that minor perturbations in input states result in proportionally small changes in policy outputs. This property is critical for maintaining smooth and stable policy updates, particularly in environments characterized by complex dynamics and external disturbances.
In the PPO-based framework, the maximum allowable change in the policy output per update is proportional to the clipping parameter ϵ t . Since SCR dynamically adjusts ϵ t , the effective Lipschitz constant L of the policy network becomes bounded by a known L max . By enforcing ϵ t [ ϵ min , ϵ max ] , we guarantee π θ ( s 1 ) π θ ( s 2 ) L max s 1 s 2 , thereby providing a deterministic bound on control signal oscillations in deployment.

4.2. SCR-PPO

Proximal Policy Optimization (PPO) is a policy gradient method within the actor–critic framework that optimizes the policy by maximizing a clipped surrogate objective function. The clipping mechanism in PPO restricts the extent of policy updates to prevent substantial deviations that could destabilize training. The PPO objective function L CLIP is expressed as
L CLIP ( θ ) = E t min r t ( θ ) A ^ t , clip ( r t ( θ ) , 1 ϵ , 1 + ϵ ) A ^ t
where:
  • r t ( θ ) = π θ ( a t | s t ) π θ old ( a t | s t ) is the probability ratio of the new policy to the old policy.
  • A ^ t is the advantage estimate at time step t.
  • ϵ is the clipping parameter that limits the policy update magnitude.
Traditional RL approaches, including PPO, utilize fixed hyperparameters to constrain policy updates. However, these fixed parameters may not adapt sufficiently to the evolving dynamics of the learning process, leading to oscillatory and unstable training behaviors. SCR introduces a dynamic clipping parameter ϵ t , which adjusts in real time based on the variability of the policy’s action outputs. This dynamic adjustment is formulated as follows:
ϵ t = α · MSD ( A t )
where:
  • α is a scaling factor that modulates the influence of the mean squared deviation.
  • MSD ( A t ) represents the Mean Squared Deviation of the actions produced by the policy at time step t, defined as
MSD ( A t ) = 1 N i = 1 N ( a i μ A ) 2
Here, a i denotes individual actions within a batch of N actions, and μ A is the mean action value. In the original PPO, the clipping parameter ϵ is fixed (typically ϵ = 0.2 ) to bound the policy update magnitude. In SCR-PPO, we replace this constant with a dynamic variant ϵ t = α · MSD ( A t ) , where α is a scaling factor and MSD ( A t ) denotes the mean squared deviation of the action outputs at iteration t. To ensure comparability with the original PPO and avoid overly aggressive updates, ϵ t is constrained within a fixed admissible range: ϵ min ϵ t ϵ max , where ϵ min = 0.1 and ϵ max = 0.3 are selected by centering the range around the canonical PPO value ( ϵ = 0.2 ) with a ± 0.1 tolerance. This design preserves the stability properties of conventional PPO when policy variability is small, while allowing more flexible updates when variability increases, thus enabling SCR-PPO to adapt more effectively to changing training dynamics.
By dynamically adjusting ϵ t based on the MSD of action outputs, SCR ensures that policy updates are proportionate to the current variability of the policy. This approach prevents excessively large updates that could destabilize the learning process and avoids overly conservative updates that may impede convergence.
Our proposed SCR-enhanced PPO (SCR-PPO) modifies the standard PPO framework by replacing the fixed clipping parameter ϵ with a dynamic ϵ t derived from the mean squared deviation (MSD) of the policy’s action outputs. This modification is outlined as follows:
1.
Dynamic Clipping Parameter: Instead of using a constant ϵ , SCR-PPO calculates ϵ t at each training iteration based on the current MSD of the policy’s actions. This ensures that the clipping range adapts to the policy’s variability, maintaining a balance between exploration and stability.
2.
Modified Objective Function: The SCR-PPO objective function, denoted as L SCR - CLIP , incorporates the dynamic clipping parameter:
L SCR - CLIP ( θ ) = E t min r t ( θ ) A ^ t , clip ( r t ( θ ) , 1 ϵ t , 1 + ϵ t ) A ^ t
This adaptation ensures that policy updates are adaptively constrained, enhancing the stability and convergence of the training process.
Finally, the offline learning phase of the quadrotor position tracking task using the SCR-PPO algorithm is summarized in Algorithm 1.
Algorithm 1 SCR-PPO.
  • SCR-PPO ( θ , α )
  •       Initialize policy network parameters θ
  •       while not converged do
  •             Collect and store trajectories ( s t , a t , r t , s t + 1 )
  •             Compute advantages A ^ i using Generalized Advantage Estimation (GAE)
  •             for each mini-batch  ( s i , a i , r i , s i ) do
  •                   Compute ratio r i ( θ ) = π θ ( a i | s i ) π θ old ( a i | s i )
  •                   Calculate ϵ i = α · 1 N j = 1 N ( a j μ A ) 2
  •                   Compute loss L SCR - CLIP , i = min r i ( θ ) A ^ i , clip r i ( θ ) , 1 ϵ i , 1 + ϵ i A ^ i
  •                   Update θ by minimizing L SCR - CLIP , i
  •       end for
  •       Clear replay buffer
  •       end while
  •   
  • return updated θ

4.3. Parameter Adjustment Mechanism

In SCR-PPO, the clipping parameter ϵ t is dynamically adjusted based on the stability of the learned policy. We define a stability metric S t at iteration t as the variance of the recent K episodes’ episode returns: S t = Var R t K + 1 , , R t . Two thresholds, T low and T high , are employed to control the adjustment of the clipping parameter. When the stability variance S t falls below T low , the policy is regarded as overly stable and potentially under-exploring; in this case, ϵ t is slightly increased to encourage exploration. Conversely, when S t exceeds T high , it indicates that the policy exhibits excessive fluctuation, and ϵ t is decreased to reduce oscillations. If S t lies between these two thresholds, the value of ϵ t remains unchanged. The adjustment step size δ ϵ is a small constant (e.g., 0.01 ) to avoid abrupt changes.
This procedure ensures that the policy exploration–exploitation balance adapts to the learning dynamics, while maintaining a bounded Lipschitz constant for deterministic control smoothness.

5. Experiment

5.1. Reinforcement Learning Hyperparameters

The PPO and SCR-PPO controllers shared the same network architectures and optimization settings. Table 2 summarizes the full set of hyperparameters, which were selected based on the canonical PPO configurations reported by ref. [25] and further refined for the quadrotor trajectory tracking task.

5.2. Classical Controller Parameters

To ensure a fair comparison, both the PID and MPC controllers were independently tuned via grid search to achieve their best performance under identical operating conditions. The resulting configurations are listed in Table 3.

5.3. Network Structure

SCR-PPO enhances strategy performance by introducing a stability regularization term into the classical PPO algorithm. Consequently, the strategy network does not require a specialized design; both the actor and critic networks maintain the same structure, consisting of three fully connected layers with Tanh activation functions between layers. The actor network inputs the UAV position and velocity information and outputs the thrust and desired attitude, while the critic network inputs the UAV state information and outputs the evaluated state value (Figure 2).
In the UAV trajectory tracking task, the state input, denoted as state, comprises the relative error between the current and desired positions of the UAV, as well as the relative velocity error, and can be represented as follows:
s t a t e = { Δ x , Δ y , Δ z , Δ v x , Δ v y , Δ v z }
The output, denoted as action, consists of the combined external force in the UAV body coordinate system and the desired attitude angle commands:
a c t i o n = { f , ϕ cmd , θ cmd , ψ cmd }
SCR-PPO was trained on an Intel i7-13700K CPU (Intel, Santa Clara, CA, USA) to simulate the method’s efficiency under resource-constrained conditions. The classical PPO and SCR-PPO rewards change during training as shown below (Figure 3).
To ensure a fair and meaningful comparison among all controllers (PID, MPC, PPO, and SCR-PPO), each controller was independently tuned under identical simulation conditions to achieve its best possible performance. Specifically, (i) the same quadrotor dynamic model, control frequency (100 Hz), trajectory, noise model, and external disturbance profiles were used for all controllers; (ii) the PID and MPC parameters (Table 3) were optimized via grid search in the same simulation environment; and (iii) PPO and SCR-PPO shared the same network architectures and training settings (Table 2) to ensure that any performance differences are attributable solely to the proposed SCR modification, rather than network capacity or optimization bias. This strict alignment in evaluation conditions guarantees that performance comparisons are both fair and reproducible.

5.4. Training Efficiency

To verify the adaptability of the SCR method to hyperparameters, the reward curves of SCR-PPO and classical PPO under different learning rate parameters are presented in Figure 4. In the figure, the variance of the SCR-PPO algorithm is smaller, and its sensitivity to the learning rate is significantly reduced. In practice, this translates to faster deployment times, higher task performance, and lower computational costs.

5.5. Policy Smoothness

To objectively measure and compare the smoothness of policies, we defined a smoothness metric, S m , based on the Fast Fourier Transform (FFT) spectrum. A smoother action policy results in less wasted energy and enhanced stability. This method allows for a quantitative assessment of policy smoothness, facilitating effective comparison and analysis of various strategies:
S m = k = 1 N f k · A k k = 1 N A k
where f k is the amplitude of the k-th frequency component, A k represents the amplitude, and N is the number of frequency components. Δ f is the sampling frequency, set to 50 in this study. By jointly considering the frequency and amplitude of control signal components, this metric provides an average weighted normalized frequency. On this scale, a higher number indicates a greater presence of high-frequency signal components for a given control problem, usually implying a more expensive drive, whereas a smaller number indicates a smoother response.
Figure 5 illustrates the policy evaluation based on the smoothing metric between SCR-PPO and classical PPO. Compared to classical PPO, the strategy variance of SCR-PPO is reduced by 45%.
Although the differences in positional changes within the simulated environment are negligible, policies trained using SCR-PPO exhibit smoother properties than those trained via traditional neural network architectures. This smoothness facilitates the deployment of strategies in hardware applications. The SCR term filters out the high-frequency components of control signals, thereby steadily improving the smoothness of neural network controllers for learning state-to-action mappings. These advancements enhance the stability and applicability of control strategies.

5.6. Robustness

To evaluate the robustness of RL methods against perturbations, we conducted comparative experiments using the widely applied proportional–integral–derivative (PID) controller. We examined three types of disturbances (see Figure 6, Figure 7 and Figure 8).
Figure 6 displays the control performance under white noise disturbances with an amplitude of 0.5. Compared to the PID controller, the SCR-PPO strategy exhibits greater robustness to low levels of white noise disturbances, maintaining lower errors across all three axes. Notably, on the x- and y-axes, SCR-PPO demonstrates significantly lower errors, effectively canceling out the input disturbances. On the z-axis, SCR-PPO effectively controls the error and performs comparably to the PID controller.
Figure 7 illustrates the control performance under constant-value disturbances. Under moderate disturbances, SCR-PPO consistently outperforms the PID controller, and on the x- and y-axes, the control error is substantially lower, indicating a more resilient control mechanism. On the z-axis, SCR-PPO is adaptive and compensates for disturbances.
Figure 8 illustrates the control performance under a white noise disturbance with an amplitude of 0.05 added to the state information. In the face of this disturbance, SCR-PPO maintains reduced control errors across all axes, especially in the z-axis, where a significant error reduction is realized. This demonstrates the superior immunity of SCR-PPO to sensor noise-type disturbances.
The experimental results demonstrate that the SCR-PPO strategy outperforms the conventional PID controller across all axes under various disturbances. The enhanced performance of SCR-PPO is attributed to its learning-based approach, which fortifies the control system’s robustness by regulating the stability of the strategy, thereby maintaining performance under uncertain and fluctuating conditions.

5.7. Performance Comparison

As shown in Figure 9, the proposed SCR-PPO algorithm outperforms PPO and SAC in terms of learning efficiency, tracking accuracy, and overall robustness.
In the learning curves (Figure 9a), SCR-PPO converges rapidly within approximately 0.3 million steps, achieving the highest asymptotic return (∼260) with minimal variance. SAC converges faster than PPO and reaches a stable performance around 230, while PPO shows the slowest convergence and larger fluctuations due to its on-policy nature. The shaded regions represent ± 1 standard deviation over five independent runs, showing that SCR-PPO maintains greater stability across the entire training process.
In the RMSE comparison (Figure 9b), SCR-PPO achieves the lowest tracking error in all three scenarios (nominal, wind disturbance, and sensor noise), with particularly significant improvements under disturbance conditions. This strongly indicates robustness against environmental perturbations.
The radar chart (Figure 9c) further provides a multi-dimensional normalized performance assessment over six criteria: accuracy, robustness, stability, energy efficiency, convergence speed, and smoothness. SCR-PPO achieves the most balanced and consistently high scores, excelling particularly in robustness, convergence speed, and smoothness, while PPO and SAC demonstrate trade-offs in different metrics.
Overall, these results confirm that SCR-PPO not only learns more efficiently than PPO and SAC, but also generalizes better and maintains superior robustness, making it a strong candidate for real-world UAV control tasks.

5.8. Control Precision

Our study involved a comparative analysis of the effectiveness of a learning-based RL approach, a traditional PID controller based on classical control theory, and a contemporary model predictive control (MPC) system grounded in modern control theory in UAV trajectory tracking tasks (Figure 10).
The findings reveal that the RL method exhibits a substantial advantage in terms of response speed on the z-axis, with no significant overshoot observed. Upon examining the UAV position changes over time on the three axes (x, y, z) and their alignment with the predefined trajectory, it was observed that the RL method demonstrated prompt and precise tracking performance on the y- and z-axes. On the x-axis, its effectiveness was comparable to that of MPC and surpassed that of the PID controller.
These results imply that, in certain scenarios, policies formulated by RL methods can achieve performance levels that match or exceed those of conventional control techniques.
For a more thorough evaluation of the RL, PID, and MPC methods, we performed 100 tests to assess their responses to random fixed-value step commands. We compared a range of performance metrics, including average error, overshoot, settling time, integral time absolute error, and computation time per step (Figure 11). These metrics collectively provide insights into the accuracy, stability, efficiency, and computational demands of each method. The outcomes of these tests enabled a more precise assessment and comparison of the strengths and weaknesses of the various approaches in practical scenarios.
Unlike MPC, which depends on precise system models, RL operates independently of prior model knowledge, thereby enhancing its adaptability and suitability in scenarios where system models are either flawed or challenging to acquire. Additionally, RL possesses the capacity to continually refine its performance over time, potentially enabling it to outperform traditional control methods in extended-duration applications. These qualities establish RL as a highly effective control strategy in a variety of dynamic and complex settings. Responses to Random Fixed-Value Step Commands: RL demonstrates control capabilities on par with Model Predictive Control (MPC); RL excels over both PID and MPC in terms of overshoot and settling time. Moreover, RL exhibits a notable advantage in computation time, which is a critical requirement for time-sensitive applications.

5.9. Real-World Flight Testing Environment

To validate the proposed control strategy beyond simulation, comprehensive real-world flight experiments were conducted using a custom-built quadrotor platform in an outdoor flight testing range (Figure 12).
The experimental setup, consistent with the hardware architecture shown in Figure 13, is outlined as follows: The D435i depth camera runs the VINS fusion localization algorithm, providing drone position data at 30 Hz. The drone’s position, velocity, and current desired position derived from the target trajectory are fed into the trained policy network, which outputs desired attitude angles at 100 Hz. Control commands are transmitted through the Mavros system to the underlying PX4 controller, which outputs allocated control forces at 400 Hz to drive motor movement.
Hardware parameters in real flight environments as shown in Table 4.
The validation is conducted in a challenging “figure-eight” trajectory scenario. As shown in Figure 14, SCR-PPO consistently yields the lowest tracking error among the tested controllers. (a) The 3D trajectories provide a spatial perspective of tracking accuracy throughout the flight mission. (b) The horizontal projection of the same trajectories onto the XY-plane uses consistent colors and line styles, enabling direct comparison with the 3D view. (c) The boxplot illustrates the full mission tracking error distribution for each controller, showing that SCR-PPO achieves the lowest median error and the smallest interquartile range, but PPO exhibits the largest deviations. Color coding is consistent across all subplots: black dashed for the reference, red solid for SCR-PPO, green dotted for PID, and blue dash–dot for PPO. This real-world experimental environment ensures that the evaluation includes realistic aerodynamic effects, sensor noise, actuation delays, and environmental disturbances that are difficult to fully replicate in simulation, thereby providing a strong validation of the proposed control strategy under operational conditions.

5.10. Limitations

Although the proposed SCR-PPO algorithm demonstrates clear advantages in convergence speed, stability, and robustness within both simulation and real-world tests, several limitations remain that define its applicability boundaries:
While SCR-PPO achieves notable improvements in convergence speed, stability, and robustness over baseline methods, its performance gains remain inherently bounded by the limitations of the underlying PPO framework, particularly in terms of sample efficiency and asymptotic performance ceilings. Under extreme out-of-distribution disturbances, such as sudden winds exceeding 6 m/s, severe actuator faults, or complete GNSS signal loss, the policy’s generalization capability may degrade, leading to reduced tracking accuracy or instability. Moreover, when deployed on resource-constrained embedded platforms running additional computationally intensive perception or planning modules, latency and diminished real-time responsiveness may occur.

6. Conclusions

This paper introduced Stabilizing Constraint Regularization (SCR), a novel approach to enhancing RL policy stability in complex, high-dimensional systems such as quadrotor UAVs. By dynamically adjusting the clipping parameter based on policy output variability and enforcing Lipschitz continuity principles, SCR mitigates oscillations and instability commonly afflicting traditional RL methods. The resulting policies exhibit improved convergence, robustness, and adaptability, facilitating their deployment in demanding real-world missions. While the proposed adaptive clipping mechanism is developed within the PPO framework, its core principle of stability-aware update bounding can be extended to other on-policy actor–critic methods. However, the present study focuses on PPO due to its widespread adoption in continuous control tasks and its sensitivity to clipping threshold selection.
Future research may explore integrating SCR with other advanced RL algorithms, expanding its applicability to multi-agent systems, or leveraging domain adaptation strategies to further bridge sim to real gaps. Ultimately, the ability of SCR to ensure stable and efficient policy learning can help realize the full potential of RL in safety-critical and dynamic scenarios.

Author Contributions

Conceptualization, J.Q. and W.H.; methodology, X.M.; validation, X.M. and G.C.; formal analysis, J.Q.; writing—original draft preparation, J.Q. and W.H.; writing—review and editing, J.Q. and W.H.; visualization, X.M. and G.C.; supervision, G.C. and W.H.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Kober, J.; Bagnell, J.A.; Peters, J. Reinforcement Learning in Robotics: A Survey. Found. Trends Robot. 2022, 11, 139–403. [Google Scholar]
  2. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; updated; MIT Press: Cambridge, MA, USA, 2022. [Google Scholar]
  3. Li, Y.; Malik, J. Learning to Optimize in the Wild. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
  4. Pan, Y.; Zhang, C.; Gao, Y.; Wang, M. Agile Autonomous Driving with Deep Reinforcement Learning and Online Domain Adaptation. IEEE Trans. Intell. Transp. Syst. 2023, 24, 5184–5195. [Google Scholar]
  5. Wu, Y.; Hada, M.; Rajeswaran, A.; Pineau, J. Real World Reinforcement Learning via Policy Adaptation. In Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
  6. Patel, H.; Bhasin, S.; Garg, A. Learning Quadrotor Control from Simulation to Real World with Disturbance Observers. IEEE Robot. Autom. Lett. 2022, 7, 3430–3437. [Google Scholar]
  7. Nagabandi, A.; Finn, C.; Levine, S. Deep Online Learning via Meta-Learning: Continual Adaptation for Model-Based RL. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  8. Koch, W.; Starek, J.A.; Pavone, M. Trajectory Optimization for Persistent UAV Monitoring via Deep Reinforcement Learning. IEEE Trans. Robot. 2022, 38, 1642–1659. [Google Scholar]
  9. Hwangbo, J.; Lee, J.; Hutter, M. Control of a Quadrotor with Reinforcement Learning in Dynamic Environments. IEEE Robot. Autom. Lett. 2021, 6, 2260–2267. [Google Scholar]
  10. Lee, D.; Kim, M.H.; Park, K.; Kim, H.J. Dynamic Clipping-based PPO for Robust Quadrotor Trajectory Tracking. IEEE Access 2023, 11, 50210–50222. [Google Scholar]
  11. Johnson, J.; Walsh, G.; Bagnell, J.A. Model-Based Reinforcement Learning for Agile Quadrotor Control in Outdoor Environments. In Proceedings of the Robotics: Science and Systems (RSS), Daegu, Republic of Korea, 10–14 July 2023. [Google Scholar]
  12. Zhang, Q.; Pan, J.; Luo, Y. Enhancing Policy Robustness of Reinforcement Learning via State-Adversarial Training. Adv. Neural Inf. Process. Syst. (NeurIPS) 2022, 35, 24671–24685. [Google Scholar]
  13. Gao, Y.; Wang, Y.; Pan, X. Robust Reinforcement Learning for Quadrotor Control with Online Disturbance Estimation. IEEE Trans. Veh. Technol. 2023, 72, 10215–10227. [Google Scholar]
  14. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjel, A.K.; Ostrovski, G.; et al. Revisiting Human-level Control through Deep Reinforcement Learning with Modern Architectures. Nat. Mach. Intell. 2022, 4, 740–747. [Google Scholar]
  15. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep Reinforcement Learning: 7 Years Later. arXiv 2015, arXiv:1509.02971. [Google Scholar]
  16. Pinto, L.; Davidson, J.; Sukthankar, R.; Gupta, A. Robust Adversarial Reinforcement Learning Revisited. In Proceedings of the International Conference on Machine Learning (ICML), Online, 18–24 July 2021. [Google Scholar]
  17. Berkenkamp, F.; Turchetta, M.; Schoellig, A.P.; Krause, A. Safe Model-Based Reinforcement Learning with Stability Guarantees in Real-World Robots. Robot. Auton. Syst. 2022, 154, 104130. [Google Scholar]
  18. Bryson, A.E.; Ho, Y.C. Applied Optimal Control: Optimization, Estimation and Control (Updated Edition); Taylor & Francis: Abingdon, UK, 2022. [Google Scholar]
  19. Astrom, K.J.; Murray, R.M. Feedback Systems: An Introduction for Scientists and Engineers, 2nd ed.; Princeton University Press: Priceton, NJ, USA, 2021. [Google Scholar]
  20. Slotine, J.J.E.; Li, W. Applied Nonlinear Control (Anniversary Edition); Prentice Hall: Hoboken, NJ, USA, 2023. [Google Scholar]
  21. Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup, D.; Meger, D. Deep Reinforcement Learning that Matters. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  22. Andrychowicz, M.; Raichuk, A.; Stańczyk, P.; Orsini, M.; Girgin, S.; Marinier, R.; Hussenot, L.; Geist, M.; Pietquin, O.; Michalski, M.; et al. What Matters in On-Policy Reinforcement Learning? A Large-Scale Empirical Study. In Proceedings of the Ninth International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
  23. Kim, D.; Park, J.; Choi, J. Vision-based Deep Reinforcement Learning for Micro Aerial Vehicle Navigation in Dynamic Environments. IEEE Access 2023, 11, 15032–15045. [Google Scholar]
  24. Chen, Y.; Li, G.; Zhong, S.; He, L. Softmax Q-Distribution Estimation for Probabilistic Policy in Reinforcement Learning: Theory and Applications. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Vienna, Austria, 23–29 July 2022. [Google Scholar]
  25. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  26. Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft Actor-Critic: Algorithms and Applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
  27. Zhou, X.; Du, Y.; Tan, M.; Jin, Y. Robust Deep Reinforcement Learning for UAV Parameter Tuning via Domain Randomization. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC), Krakow, Poland, 28 June–1 July 2021. [Google Scholar]
  28. Akkaya, I.; Andrychowicz, M.; Chociej, M.; Litwin, M.; McGrew, B.; Petron, A.; Paino, A.; Plappert, M.; Powell, G.; Ribas, R.; et al. Solving Rubik’s Cube with a Robot Hand. arXiv 2019, arXiv:1910.07113. [Google Scholar]
  29. Garcia, J.; Fernandez, F. A Comprehensive Survey on Safe Reinforcement Learning: Recent Advances and Challenges. J. Mach. Learn. Res. (JMLR) 2021, 22, 1–61. [Google Scholar]
  30. Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; Abbeel, P. Benchmarking Deep Reinforcement Learning for Continuous Control. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19 June–24 June 2016. [Google Scholar]
  31. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the Game of Go Without Human Knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
  32. Mahmood, A.R.; Korenkevych, D.; Komer, B.J.; Bergstra, J. Setting Up a Reinforcement Learning Task with a Real-World Robot. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018. [Google Scholar]
  33. Rajeswaran, A.; Lowrey, K.; Todorov, E.V.; Kakade, S.M. Towards Generalization and Simplicity in Continuous Control. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  34. Shen, H.; Zhang, Y.; Mao, J.; Yan, Z.; Wu, L. Energy Management of Hybrid UAV Based on Reinforcement Learning. Electronics 2021, 10, 1929. [Google Scholar] [CrossRef]
  35. Khairy, S.; Balaprakash, P.; Cai, L.X.; Cheng, Y. Constrained Deep Reinforcement Learning for Energy Sustainable Multi-UAV Based Random Access IoT Networks with NOMA. IEEE J. Sel. Areas Commun. 2020, 39, 1101–1115. [Google Scholar] [CrossRef]
Figure 1. Flowchart of SCR-PPO.
Figure 1. Flowchart of SCR-PPO.
Drones 09 00675 g001
Figure 2. Neural network architecture.
Figure 2. Neural network architecture.
Drones 09 00675 g002
Figure 3. Comparison of SCR-PPO and classical PPO reward curves.
Figure 3. Comparison of SCR-PPO and classical PPO reward curves.
Drones 09 00675 g003
Figure 4. Reward curves of SCR-PPO under different learning rates and the shaded area represents the standard deviation.
Figure 4. Reward curves of SCR-PPO under different learning rates and the shaded area represents the standard deviation.
Drones 09 00675 g004
Figure 5. Comparison of SCR-PPO and classical PPO in Z-axis position control in terms of action variation, strategy variance, and action smoothness.
Figure 5. Comparison of SCR-PPO and classical PPO in Z-axis position control in terms of action variation, strategy variance, and action smoothness.
Drones 09 00675 g005
Figure 6. White noise interference with an amplitude of 0.5 m/s2, representing acceleration disturbance terms in the motion model.
Figure 6. White noise interference with an amplitude of 0.5 m/s2, representing acceleration disturbance terms in the motion model.
Drones 09 00675 g006
Figure 7. Comparison of control performance between the proposed method and PID under external force disturbance with constant disturbance (1 m/s). The shaded area represents the error band.
Figure 7. Comparison of control performance between the proposed method and PID under external force disturbance with constant disturbance (1 m/s). The shaded area represents the error band.
Drones 09 00675 g007
Figure 8. Comparison of the proposed method and PID control performance under a white noise scenario with an amplitude of 0.05 m/s.
Figure 8. Comparison of the proposed method and PID control performance under a white noise scenario with an amplitude of 0.05 m/s.
Drones 09 00675 g008
Figure 9. Overall performance comparison between PPO, SAC, and the proposed SCR-PPO algorithm.
Figure 9. Overall performance comparison between PPO, SAC, and the proposed SCR-PPO algorithm.
Drones 09 00675 g009
Figure 10. Control comparison of SCR-PPO, MPC, and PID.
Figure 10. Control comparison of SCR-PPO, MPC, and PID.
Drones 09 00675 g010
Figure 11. Trajectory tracking comparison.
Figure 11. Trajectory tracking comparison.
Drones 09 00675 g011
Figure 12. Data flow diagram.
Figure 12. Data flow diagram.
Drones 09 00675 g012
Figure 13. UAV platform.
Figure 13. UAV platform.
Drones 09 00675 g013
Figure 14. Black dashed line represents the reference trajectory; red solid and blue dash–dot lines correspond to the SCR-PPO and PPO tracking results, respectively. (a) demonstrates the 3D trajectory tracking performance of the proposed method and PID, PPO methods for the black reference trajectory. (b) illustrates the tracking performance of the proposed method and PID, PPO methods for the horizontal 2D trajectory of the black reference track. (c) red and blue boxes indicate the error distributions for SCR-PPO and PPO, under Square (left two boxes) and Figure-eight (right two boxes) tasks.
Figure 14. Black dashed line represents the reference trajectory; red solid and blue dash–dot lines correspond to the SCR-PPO and PPO tracking results, respectively. (a) demonstrates the 3D trajectory tracking performance of the proposed method and PID, PPO methods for the black reference trajectory. (b) illustrates the tracking performance of the proposed method and PID, PPO methods for the horizontal 2D trajectory of the black reference track. (c) red and blue boxes indicate the error distributions for SCR-PPO and PPO, under Square (left two boxes) and Figure-eight (right two boxes) tasks.
Drones 09 00675 g014
Table 1. Comparison of representative RL-based quadrotor control methods and the proposed SCR-PPO, using actual experimental results.
Table 1. Comparison of representative RL-based quadrotor control methods and the proposed SCR-PPO, using actual experimental results.
MethodConvergence Steps (M)Mean RMSE (m)Stability (Var.)Robustness (RMSE Δ)
PPO [25]0.800.33High var.+0.15 m
SAC [26]0.500.27Med. var.+0.08 m
Hybrid RL [34]0.600.29Med. var.+0.09 m
Constraint Based RL [35]0.550.28Low var.+0.12 m
SCR-PPO (Ours)0.300.19Lowest var.+0.05 m
Note: Capital M represents one million times.
Table 2. Hyperparameters of PPO and SCR-PPO.
Table 2. Hyperparameters of PPO and SCR-PPO.
CategoryParameterValue Description
Training ParametersOptimizerAdam, learning rate 3 × 10−4
Initial clip range ϵ (PPO)Fixed: 0.2
Dynamic ϵ t (SCR-PPO) ϵ t = α · M S D ( a t ) , with α = 0.1
Discount factor γ 0.99
GAE λ 0.95
Batch size2048 steps per update
Mini-batch size64
Epochs per update10
Entropy coefficient0.01
Value loss coefficient0.5
Early stop criterionMean episodic return variation ± 1 %
Training ResourceHardwareIntel i7-12700K CPU, RTX 3060 Ti GPU
Total training time≈ 8 h for 1 M total steps
Table 3. Controller parameters for PID and MPC.
Table 3. Controller parameters for PID and MPC.
ControllerParameterValue Description
PID K p ( x , y , z )1.2, 1.2, 1.0
K i ( x , y , z )0.01, 0.01, 0.02
K d ( x , y , z )0.4, 0.4, 0.35
hlSampling frequency50 Hz
MPCControl range ± 2   m / s 2
Prediction horizon20 steps (0.02 s per step)
Control horizon5 steps
QP solverqpOASES
Reference update rate50 Hz
Weight matrix Qdiag (1, 1, 1) for position error
Weight matrix Rdiag(0.1, 0.1, 0.1) for control effort
Table 4. Summary of key parameters used in simulation, real flight testing, and algorithm design.
Table 4. Summary of key parameters used in simulation, real flight testing, and algorithm design.
SymbolParameter DescriptionValue/RangeUnitContext
τ m Motor time constant (1st-order dynamics)0.02smotor model
T-Motor SpecsMotor type2004 KV2100Real flight
PropellerPropeller size3inchSimulation & Real flight
m t o t a l Take-off mass≈900gReal flight platform
d f r a m e Wheelbase110mmReal flight platform
f I M U IMU update frequency400Hzsensor model
σ a IMU accel. noise std.dev.0.05m/s2sensor model
σ ω IMU gyro noise std.dev.0.002rad/ssensor model
f c t r l Control loop rate50HzBoth envs, OFFBOARD mode
t l a t Communication latency emulation20msReal flight
ϵ PPO clipping parameter (fixed)0.2Baseline PPO
ϵ t SCR-PPO dynamic clipping range0.1–0.3Proposed method
α Scaling factor in ϵ t Paper-definedProposed method
Onboard PCCompanion computerintel NUC12Real flight
PixhawkFlight controllerPixhawk 4 (PX4 v1.14)Real flight
BatteryPower supply6S LiPo 22.2 V 10 AhReal flight
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Quan, J.; Hu, W.; Ma, X.; Chen, G. Reinforcement Learning Stabilization for Quadrotor UAVs via Lipschitz-Constrained Policy Regularization. Drones 2025, 9, 675. https://doi.org/10.3390/drones9100675

AMA Style

Quan J, Hu W, Ma X, Chen G. Reinforcement Learning Stabilization for Quadrotor UAVs via Lipschitz-Constrained Policy Regularization. Drones. 2025; 9(10):675. https://doi.org/10.3390/drones9100675

Chicago/Turabian Style

Quan, Jiale, Weijun Hu, Xianlong Ma, and Gang Chen. 2025. "Reinforcement Learning Stabilization for Quadrotor UAVs via Lipschitz-Constrained Policy Regularization" Drones 9, no. 10: 675. https://doi.org/10.3390/drones9100675

APA Style

Quan, J., Hu, W., Ma, X., & Chen, G. (2025). Reinforcement Learning Stabilization for Quadrotor UAVs via Lipschitz-Constrained Policy Regularization. Drones, 9(10), 675. https://doi.org/10.3390/drones9100675

Article Metrics

Back to TopTop