1. Introduction
Quadrotor attitude and altitude control has recently become an active area of research due to its critical role in various fields recently, such as aerial photography, geographic mapping, shipping and delivery, disaster management, search and rescue, law enforcement, wildlife monitoring, precision agriculture, and weather forecasting. In most of these applications, quadrotors are required to hover, track paths, and remain stable in outdoor environments [
1,
2,
3,
4,
5]. However, outdoor environments are affected by wind disturbances, turbulence, and sudden gusts, which can seriously affect flight stability and mission success. Several control strategies have been reported in the literature for quadrotor position and attitude tracking under disturbances. In [
6], a sliding mode control (SMC)-based tracking controller for quadrotor UAVs considering nominal dynamics with bounded uncertainties has been proposed; however, the method suffers from oscillations and chattering, and lacks an explicit adaptive disturbance estimation mechanism. In [
7], a fault-tolerant predictive control scheme is developed using a discrete-time sliding mode observer to handle generic bounded disturbances, but the approach remains sensitive to observer noise, exhibits chattering, and does not incorporate adaptive learning capabilities. Similarly, disturbance observer-based attitude control methods for aggressive maneuvers are highly model-dependent and predominantly compensate only matched disturbances, which limits their robustness against unmodelled dynamics and external wind disturbances [
8]. These limitations motivate the need for an adaptive, learning-assisted disturbance rejection framework capable of handling nonlinear uncertainties with reduced chattering and improved robustness.
In recent years, active disturbance rejection control (ADRC) and intelligent optimization techniques have been widely explored for UAV attitude control under disturbances. A dual closed-loop ADRC scheme for quadrotor attitude stabilization under wind gusts [
9] employs a Proportional-Derivative (PD) outer loop, which limits disturbance rejection performance and lacks validation with realistic wind field models. In [
10], an improved extended state observer (ESO) was developed with nonlinear feedback to address actuator faults and wind gusts; however, the method lacks online parameter adaptation. Hybrid intelligent algorithms, such as fish swarm optimization and Particle Swarm Optimization (PSO) with elite Gaussian learning, have been proposed for ADRC parameter optimization in fixed-wing UAVs; nevertheless, the tuning remains environment-independent, and disturbances are treated as unknown without adaptive updating [
11]. A pigeon-inspired optimization approach has also been proposed for ADRC tuning in Vertical Take-off and Landing (VTOL) UAVs, achieving faster convergence but without providing rigorous stability analysis [
12]. In [
13], an improved Beetle Antennae Search–Sine Cosine Algorithm optimization (BAS–SCA)-based ADRC tuning method proposed with variable step-size search, combined second-order ADRC with Ant Colony Optimization (ACO) and BAS optimization; both methods rely on offline parameter tuning and focus only on attitude control, without addressing highly dynamic wind conditions. An adaptive composite disturbance rejection control (ACDRC) using iterative learning for agricultural micro-UAVs proposed [
14].
In recent years [
15,
16,
17,
18], ADRC improvement focused through enhanced observer structures and composite disturbance rejection. An improved ESO is developed to address measurement noise, along with an adaptive composite disturbance rejection scheme for quadrotor attitude control. Later a finite-time composite control strategy for wind disturbance rejection in UAVs was introduced, which proposed a switching semi-decoupled ADRC framework for ground vehicles. Although these approaches enhance robustness and convergence, they still rely on fixed or offline-tuned parameters and do not provide online learning capability, which limits their adaptability under rapidly varying and highly nonlinear disturbance conditions in realistic quadrotor operations. The adaptive fast-finite-time observer in [
19] provides rapid convergence under uncertainties; however, it requires complex adaptive gain design and may increase implementation complexity for high-dimensional nonlinear UAV systems. The sliding mode observer presented in [
20] exhibits strong robustness against uncertainties and disturbances, but the discontinuous switching structure may introduce chattering effects and increased sensitivity to measurement noise. The nonlinear ESO-based sliding mode controller in [
21] demonstrates improved disturbance estimation capability; nevertheless, the control structure involves additional optimization and sliding-mode components that increase computational burden. The Extended Kalman Filter (EKF)-based observer in [
22] achieves effective state estimation for robotic manipulators, but its performance strongly depends on accurate system modeling and covariance tuning, which may limit robustness under significant model uncertainties and external disturbances.
In contrast, the proposed ESO offers a comparatively simpler model-independent disturbance estimation framework with lower computational complexity and easier parameterization for real-time quadrotor implementation. Moreover, the integration of the ESO with the DDPG-based adaptive NLSEF tuning mechanism enhances disturbance rejection and tracking performance under nonlinear and uncertain operating conditions without requiring precise system modeling or complex switching logic.
More recent studies have integrated fuzzy logic, neural networks, and deep reinforcement learning with ADRC to enhance disturbance rejection capability [
23,
24,
25,
26]. A fuzzy ADRC scheme with parameters optimized using an improved whale optimization algorithm, achieving reduced steady-state error and improved anti-interference performance; however, the parameter adaptation is performed offline, limiting real-time adaptability. An intelligent attitude controller combining ADRC with fuzzy logic and an adaptive radial basis function neural network to tune ESO and NLSEF parameters online proposed, but the approach lacks extensive real-world validation. A reinforcement learning-based parameter optimization strategy for active disturbance rejection control for an autonomous underwater vehicle proposed, in which discrete action spaces are formulated, making the approach unsuitable for continuous-control quadrotor systems.
A PID–deep reinforcement learning (DRL)-based wind disturbance compensation strategy proposed and the controller improves adaptability compared to classical PID, it exhibits limited responsiveness to fast, highly time-varying real-world wind fields.
Overall, existing ADRC-based and optimization-assisted controllers either rely on offline tuning, lack rigorous stability guarantees, or are validated under mild or unrealistic disturbance conditions, and many learning-based methods adopt discrete action spaces that are unsuitable for UAV dynamics, thereby highlighting the need for an online adaptive and learning-based disturbance rejection framework operating in a continuous-control space, capable of handling strong, time-varying disturbances in full-scale quadrotor systems.
The main contributions of this article are summarized as follows:
A Deep Deterministic Policy Gradient (DDPG)-based adaptive tuning strategy is proposed for the nonlinear active disturbance rejection control (NLADRC) framework, with specific emphasis on the online optimization of the NLSEF gain parameter β1. In contrast to existing reinforcement learning (RL)-based ADRC approaches that tune multiple parameters or observer gains, the proposed method emphasizes selective tuning of the most sensitive nonlinear feedback gain, thereby reducing learning complexity while improving control smoothness and preventing actuator saturation under composite time-varying disturbances.
A composite control architecture is developed by integrating the DDPG-optimized NLADRC with a complete six-degree-of-freedom (6-DOF) quadrotor dynamic model. The extended state observer (ESO) estimates the combined effects of system uncertainties and disturbances in real time, while the DDPG agent adaptively tunes the roll-channel NLSEF gain . Unlike many existing studies based on simplified or decoupled models, the proposed framework considers a composite disturbance, enabling a more realistic robustness evaluation.
Numerical simulations demonstrate that the proposed DDPG-NLADRC for roll-channel scheme achieves significant improvements in trajectory tracking accuracy, disturbance rejection capability, and actuator stress mitigation.
The structure of this article is organized as follows:
Section 2 presents the mathematical modeling of the quadrotor system;
Section 3 describes the design of ADRC strategy;
Section 4 provides the design of proposed DDPG-based ADRC scheme;
Section 5 provides numerical simulation, results and comparative analysis; finally,
Section 6 concludes the paper and discusses future research directions.
2. Nonlinear Quadrotor System Modeling
A quadrotor has three linear position variables and three angular position variables. Accurate derivation of the quadrotor dynamics is challenging due to its under-actuated nature, strong coupling among states, and high sensitivity to external disturbances. In this work, the system dynamics are formulated using the Newton–Euler framework. To describe the motion, two coordinate systems are defined: an earth-fixed inertial reference frame
and a body-fixed reference frame
. The quadrotor motion is governed by the combined lift forces generated by the four rotors
together with gravitational effects. Both translational and rotational motions are regulated through appropriate adjustment of the individual rotor speeds [
27].
Quadrotor flight states are defined by three translational coordinates (
x,
y,
z) and three rotational angles (
) as depicted in
Figure 1. Roll angle,
determines the rotation around
x-axis. Pitch angle, θ, determines the rotation around the
y-axis and yaw angle,
determines the rotation around
z-axis. The quadrotor dynamics are described using two coordinate frames: the inertial frame (Earth frame), which is fixed to the ground and used to represent the global position and orientation of the vehicle, and the body-fixed frame, which is attached to the quadrotor’s center of mass and used to express the forces, torques, and rotational dynamics.
Using the coordinate transformation principle, the rotation matrix that maps vectors from the body-fixed frame to the earth-fixed inertial frame is as given in the following equation:
where
= cos(
x) and
= sin(
x). Thrust T in the direction of
z-axis is created by the combined rotor forces. By using transformation matrix for angular velocities from the inertial frame to the body frame,
where
T represents the angular velocities in body frame. Angular speed of the
ith rotor is ω
i and generates the force
fi in the rotor-axis direction. The rotor’s angular velocity and acceleration produce a torque
around the rotor axis:
where
b is the drag constant and
k is the thrust factor. The combined forces generated by the rotors produce a total thrust,
, along the body
-axis. The torque vector
consists of the roll, pitch, and yaw components
,
, and
acting about the corresponding body-frame axes:
where
is the distance between the rotor and the center of mass of the quadrotor. The quadcopter is assumed as a rigid body and Newton–Euler equations are used to describe system dynamics. The combined effect of inertial and centrifugal force is equal to the gravitational force and rotor’s total thrust:
In the inertial frame, quadrotor motion is governed by only gravitational force and the rotor generated thrust:
In the body frame,
where
is the gyroscopic force and
is the external torque.
where
−
. Based on the complex dynamic characteristics of the quadrotor model, the overall control system is typically decomposed into four independent control channels, the altitude control channel (
z), the pitch angle channel (
θ), the roll angle channel (
ϕ), and the yaw angle channel (
ψ). This four-channel decomposition leads to the decentralized control design and the easier implementation of control algorithms. However, due to the strong coupling between translational and rotational motions, a robust and adaptive control strategy is essential to ensure stability and tracking performance under disturbances and modeling uncertainties.
3. Active Disturbance Rejection Controller Design
Owing to the dynamic structure, quadrotor aircraft system can be represented by four decoupled control channels: the altitude (
z) channel, pitch angle (
θ) channel, roll angle
channel and the yaw angle (
ψ) channel. ADRC scheme is a nonlinear control strategy in response to the limitations of the PID control algorithm. ADRC has a nonlinear tracking differentiator (TD), extended state observer (ESO) and nonlinear state error feedback law (NLSEF). The TD eliminates sudden changes in the setpoint and provides a smoothed input signal. ESO estimates each state variable value and the disturbance estimation value. The NLSEF law gives the control strategy of the controlled object. A nonlinear TD was constructed as the as numerical integration is more reliable and stable than numerical differentiation in a noisy environment. The design of the roll-channel controller is presented in this section as an example. A second-order TD can be designed as follows [
27]:
where
is reference signal,
is the tracking signal and
is its derivative,
h is the sampling time,
r is the speed factor that determines the convergence rate of the tracking differentiator and
h0 is the filtering factor associated with the sampling step size. The function
denotes the signum function. The nonlinear function
fhan(
·) is referred to as the optimal synthetic rapid control function and is defined as follows:
The extended state observer (ESO) designed for the roll dynamics is expressed by the following equations:
where
,
and
are the observed values of
and roll-channel total disturbance, respectively.
denote the roll-channel observer gains, whereas y denotes the output of the roll channel. The parameter
denotes the sampling step size, parameter
represents the compensation factor, and
is the control input.
are nonlinear factors; δ is a filter factor.
fal(∙) is a nonlinear function, employed to improve estimation accuracy and enhance disturbance rejection capability, and it is defined as
Finally, with the properly designed extended state observer, the nonlinear state error feedback law (NLSEF) for the roll channel is given by
where
are the nonlinear combination coefficients,
are nonlinear factors, and u
is the final control quantity.
4. DDPG-Enhanced ADRC Design and Convergence Analysis
Reinforcement learning (RL) has emerged as an effective framework for enhancing controller autonomy in complex dynamical systems by enabling performance improvement through continuous interaction with the environment, rather than relying on explicit system models. In this work, a Deep Deterministic Policy Gradient (DDPG) algorithm is employed due to its suitability for continuous state and action spaces, which is essential for online tuning of ADRC parameters in nonlinear quadrotor dynamics. The DDPG architecture consists of actor and critic neural networks, where the actor generates continuous-valued tuning actions and the critic evaluates their quality by jointly processing system states and control inputs through parallel information pathways to estimate the action–value function.
The observation vector of the RL agent is designed to capture both the transient and steady-state characteristics of the system under disturbed operating conditions. The reward function is formulated to penalize tracking errors, excessive control effort, and abrupt variations in the control input. Furthermore, episode termination is implemented through the “Is Done” condition, whereby the training episode is terminated whenever actuator constraints are violated, safety limits are exceeded, or the maximum episode duration is reached, thereby ensuring stable and safe learning. Through the integration of observation space, reward shaping, and termination logic, the DDPG-based agent learns optimal ADRC tuning parameters in real time, enabling adaptive disturbance rejection and improved control performance under varying flight conditions.
In this study, a Deep Deterministic Policy Gradient (DDPG) framework is employed to adaptively tune the nonlinear state error feedback (NLSEF) gain
of the ADRC controller for roll motion control of the quadrotor as shown in
Figure 2. This choice is motivated by its dominant role in shaping the transient response and control effort in ADRC. In comparison,
mainly affects damping characteristics, while the ESO gains primarily influence observer convergence and disturbance estimation dynamics. Therefore, by restricting the reinforcement learning agent to a single, highly sensitive parameter, the complexity of the learning problem is significantly reduced, leading to improved convergence and stability of the DDPG algorithm.
The state (observation) vector is defined as
, where
and
denote the roll angle tracking error and its derivative, respectively, while
and
represent the roll angle and roll rate. The action of the agent is chosen as the incremental adjustment of the NLSEF gain, expressed as
, and the adaptive NLSEF gain is updated according to
. This formulation allows the agent to dynamically tune the control effort in response to system dynamics and disturbances. The reward function is designed to penalize large tracking errors and angular velocities, and it is given by
, where the bonus term
is awarded when
, and
otherwise. Stable training and safe exploration are ensured through proper reward formulation, while the training episode is terminated if the roll angle exceeds 0.8 rad or the roll rate exceeds 5 rad/s. The main hyperparameters used for training are as follows: sampling time is set to 0.005 s, the discount factor is chosen as 0.99, and the target network smoothing factor is fixed at
. A mini-batch size of 64 is employed with a replay buffer capacity of
. The initial exploration noise standard deviation is set to 0.3 and decays at a rate of
to ensure sufficient exploration during early training and stable convergence thereafter. The actor network is trained by performing gradient ascent on the expected Q-value, following the deterministic policy gradient formulation. The policy gradient is computed as
The critic is optimized by minimizing the mean squared error (MSE) loss between its predicted Q-value and the target Q-value, expressed as
where the target value is defined as
where
denotes the actor network parameterized by
, which maps the state
to a continuous action
, and
represents the critic network parameterized by
, which estimates the action–value function. The symbols
and
indicate the corresponding target networks used to stabilize learning. The term
denotes the immediate reward at sample
,
is the discount factor, and
is the mini-batch size drawn from the replay buffer. The gradients
and
represent the critic and actor gradients, respectively, which together drive the policy update through backpropagation. This learning mechanism enables the DDPG agent to iteratively improve the control policy by accurately estimating the long-term reward associated with each NLSEF gain adjustment. For Lyapunov-based stability analysis for the proposed DDPG-based ADRC applied to the roll channel of the quadrotor system, the roll dynamics can be expressed in the ADRC canonical form as
where
represents the lumped total disturbance including model uncertainties and external aerodynamic effects and
denotes the nominal control gain. The tracking errors are defined as
A standard quadratic Lyapunov candidate function is selected as
which is positive definite and radially unbounded. The time derivative of
along the system trajectories is given by
The DDPG-based ADRC control law is defined as
where
is the ESO estimate of the total disturbance and
is the virtual control signal generated by the DDPG-tuned nonlinear state error feedback. Substituting the control law yields
where
denotes the ESO estimation error. Since the ESO is designed to be asymptotically convergent, the estimation error satisfies
. Furthermore, the DDPG policy generates a stabilizing feedback law of the form
, with
and
, leading to
. For
and
, the first two terms are negative semi-definite, while the last term vanishes asymptotically due to ESO convergence.
Hence, the Lyapunov derivative satisfies for some after a finite transient. To account for observer dynamics, the ESO estimation errors are defined as , , and a composite Lyapunov function is constructed as . Using standard ADRC observer theory, the ESO error dynamics are exponentially stable, which implies , with positive constants . Therefore, all tracking errors and ESO estimation errors are uniformly ultimately bounded, and the closed-loop roll subsystem under the proposed DDPG-based ADRC is globally asymptotically stable in the sense of Lyapunov.
The stability analysis of the proposed DDPG-assisted ADRC framework is performed under the assumption that the extended state observer (ESO) provides sufficiently accurate disturbance estimation and that the DDPG algorithm converges to a bounded control policy. In practical implementations, reinforcement learning-induced uncertainties such as exploration noise, neural network approximation errors, and incomplete convergence may influence the transient system response and robustness characteristics. Nevertheless, since the DDPG algorithm is utilized mainly for adaptive parameter tuning, the primary closed-loop stability behavior is governed by the ADRC framework and ESO dynamics. Therefore, the proposed method preserves stable tracking performance under bounded disturbances and parameter variations within the considered operating conditions.
To assess the convergence characteristics of the proposed DDPG-assisted ADRC framework, the reinforcement learning agent was trained for a total of 300 episodes, each consisting of a maximum of 500 time steps. The evolution of the episode reward and the corresponding 20-episode moving-average reward during the training process is illustrated in
Figure 3. As observed, the episode reward exhibits minor fluctuations during the initial training phase due to the exploration mechanism of the DDPG algorithm. However, the moving-average reward converges smoothly and stabilizes at approximately 8957 after nearly 100 episodes, indicating successful policy convergence. Although several isolated reward drops are observed throughout the training process, these are attributed to exploratory actions and do not affect the overall learning stability. Furthermore, all training episodes reached the maximum episode length of 500 steps without premature termination, demonstrating stable closed-loop operation during learning. The critic-estimated Q-value increased progressively throughout the training process and reached 1133.29 at the final episode, confirming successful value-function learning and convergence of the DDPG algorithm.
5. Numeric Simulation
The parameters used in the quadrotor simulation are listed in
Table 1. The performance of the proposed control framework is investigated through time-domain simulations under both nominal and disturbed operating conditions. Initially, the quadrotor is commanded to track a constant reference altitude of
m, while the desired attitude angles are set to
. These reference values are selected to represent a realistic non-hover maneuver involving simultaneous roll and pitch deflections, rather than a trivial equilibrium condition. The initial state of the system is chosen as
, where altitude is expressed in meters and angular states are expressed in radians. The chosen initial conditions introduce a moderate initial roll angle deviation while maintaining zero initial displacement in the remaining channels, thereby facilitating a meaningful understanding of the transient behavior, coupling effects, and convergence rate between the translational and rotational dynamics.
Two simulation scenarios are considered to evaluate the disturbance rejection capability and robustness of the control strategies under roll-channel disturbances. In the first case, the roll channel is subjected to external disturbances, and the system performance is analyzed using the conventional ADRC controller (
Figure 4,
Figure 5,
Figure 6 and
Figure 7). In the second case, the same reference trajectory, initial conditions, and disturbance profile are maintained. However, the controller is replaced with the proposed DDPG-based ADRC framework (
Figure 8,
Figure 9,
Figure 10 and
Figure 11).
This allows the comparison of the roll tracking performance, control effort, and robustness improvement achieved through reinforcement learning-based adaptive tuning over the conventional ADRC approach. To evaluate the disturbance rejection capability of the nominal and proposed DDPG-based ADRC scheme, an external disturbance is injected into the roll channel in the form of a composite torque disturbance. The disturbance is mathematically modeled as , where represents a step input applied at , denotes a sinusoidal disturbance component, is a zero-mean, and it is a 0.1 variance Gaussian white noise process, where . This composite disturbance emulates realistic environmental effects such as sudden wind gusts, periodic aerodynamic oscillations, and measurement or process uncertainties. Among all attitude channels, the roll dynamics are particularly sensitive to such disturbances and are strongly coupled with the overall system behavior. Hence, injecting disturbances only in the roll channel is sufficient to evaluate the global disturbance rejection capability of the controller. Furthermore, the NLSEF gain parameter of the roll channel is adaptively tuned using the DDPG agent. The parameter directly governs the nonlinear feedback intensity and significantly influences transient performance indices such as rise time, overshoot, and control effort. Adaptation of a single, critical control parameter effectively minimizes learning complexity without compromising performance, resulting in improved stability, faster convergence, and practicality of the proposed control framework.
The quantitative performance comparison between the conventional ADRC and the proposed DDPG-based ADRC is summarized in
Table 2 for all four channels under disturbed operating conditions. Both controllers are evaluated using standard time-domain performance indices, including root mean square error (RMSE), settling time (
), steady-state error (
), and control effort metrics. Tracking accuracy in each channel is evaluated using RMSE, defined as RMSE =
, where
is the system output and
is the reference trajectory. A lower RMSE indicates superior trajectory tracking and enhanced disturbance rejection capability. RMS control effort gives the quantitative measure of the average control energy utilized by the system throughout the simulation interval, Urms =
. Peak control effort is the maximum absolute control magnitude applied, Upeak =
.
The tracking response of the baseline controller, illustrated in
Figure 4, demonstrates that the ADRC is capable of maintaining stable reference tracking across all four channels, namely altitude, roll, pitch, and yaw. In the altitude channel, the controller achieves an RMSE of 3.30 m with a settling time of 1.81 s. The response is smooth and free from overshoot, indicating satisfactory baseline tracking capability. RMS control effort of 174.02 N and peak value of 218.88 N indicate sustained actuator activity during disturbance rejection. ESO outputs and disturbance estimates of each channel are depicted in
Figure 5 and
Figure 6, respectively.
In attitude channels, the roll channel exhibits the most pronounced transient oscillations, reflected by an RMSE of 0.23 rad and a settling time of 5.0 s. The oscillatory behavior of the roll channel results from the combined effects of strong nonlinear coupling among rotational dynamics, high sensitivity of angular acceleration to torque disturbances and rapid disturbance variation.
This results in a high RMS control effort of 191.78 N·m and peak value of 267.19 N·m, confirming the aggressive compensation depicted in
Figure 6. The pitch channel performs better, with an RMSE of 0.08 rad, settling time of 0.88 s, RMS control effort of 174.24 N·m and peak control effort of 220.01 N.m. The yaw channel achieves an RMSE of 0.002 rad with a settling time of 0.63 s, while requiring an RMS control effort of 152.83 N·m and a peak control effort of 251.85 N·m.
RMS control effort of 152.83 N and peak control effort of 251.85 N·m are utilized in the yaw channel. The yaw channel is characterized by relatively low rotational inertia, and it exhibits particularly high peak torque as small angular deviations require rapid corrective action. Overall, although the conventional ADRC guarantees system stability and satisfactory tracking performance across all channels, the results clearly reveal that its primary drawback in terms of excessive control effort and increased actuator stress, particularly in the roll channel. This motivates the need for an adaptive learning-based enhancement to improve transient performance while simultaneously reducing control aggressiveness.
The DDPG-optimized ADRC tracking responses are depicted in
Figure 8. Altitude RMSE remains unchanged at 3.30 m, demonstrating that steady-state accuracy is preserved. However, RMS control effort decreases from 174.02 N to 166.55 N, representing improved actuation smoothness and energy efficiency. Although the peak control effort increases slightly to 228.21 N, and transient duration is reduced, suggesting sharper but shorter corrective action. This reflects a more efficient redistribution of control energy. The state estimates and disturbance estimates provided by the ESO in each channel with the proposed controller depicted in
Figure 9 and
Figure 10 respectively.
Table 2 presents a quantitative comparison between conventional ADRC and DDPG-based ADRC under composite disturbances. In the disturbance-sensitive roll channel, the DDPG-based ADRC demonstrates a marked improvement in transient and steady-state behavior. The settling time is reduced to 0.50 s with a reduced RMSE of 0.10 rad and zero steady-state error. The reduction in settling time highlights the effectiveness of the learned adaptation in enhancing system responsiveness. Furthermore, the reduced control effort indices, characterized by an RMS value of 182.02 N·m and a peak value of 246.81 N·m, indicate that the improved transient performance is achieved without significant increase in actuation demand. From the results, it is clear that the DDPG-assisted tuning mechanism successfully balances disturbance rejection and control smoothness. In particular, the moderate RMS value reflects reduced average control energy expenditure, while the bounded peak torque confirms the absence of aggressive or impulsive control action during disturbance rejection.
In the pitch channel, the RMSE is slightly reduced from 0.08 rad to 0.07 rad, while the RMS control torque decreases from 174.24 N·m to 167.15 N·m, indicating that improved tracking accuracy is achieved with lower average actuation effort. A slight increase in peak control effort to 231.69 N·m along with reduced RMS control effort indicates that the controller applies a brief but stronger corrective torque during the initial transient phase to accelerate disturbance suppression, while maintaining lower average energy consumption over the full response. In yaw dynamics, RMSE is maintained at 0.002 rad, indicating excellent tracking accuracy and effective disturbance attenuation. Settling time decreases from 0.63 s to 0.56 s, reflecting faster convergence of yaw angle error. The RMS control torque remaining nearly unchanged indicates that the improved transient response is achieved without increasing the average control energy. This reflects efficient gain tuning, where the controller enhances responsiveness while preserving energy neutrality over the full simulation horizon. The peak control effort decreases significantly from 251.85 N·m to 214.39 N·m. This reduction in maximum instantaneous torque suggests that the DDPG-based tuning mitigates abrupt control spikes typically associated with disturbance rejection in yaw motion. The proposed controller reduces actuator stress and potential saturation risk, thereby enhancing hardware implementation feasibility.
It is observed that the DDPG-based ADRC controller produces higher-frequency variations in the control signal during the transient interval (1–2 s) compared to the conventional ADRC. This behavior arises primarily from the interaction between the learned policy and the observer dynamics. In particular, the Deep Deterministic Policy Gradient policy adaptively increases the nonlinear state error feedback gains to rapidly reduce tracking error during transients. As a result, the controller becomes highly sensitive to small variations in the error states and to residual estimation errors from the extended state observer (ESO), which is still converging in this interval. These effects lead to rapid gain fluctuations and overcompensation, lead to oscillatory control action. While this improves tracking accuracy, it may not be directly implementable in practical quadrotor systems due to actuator bandwidth and rate limitations. To address this, smoothing techniques such as low-pass filtering or control rate limiting can be incorporated.
Overall, a substantial performance enhancement is observed with the DDPG-based ADRC compared to the conventional ADRC. Roll channel exhibits a reduction in settling time and steady-state error, demonstrating superior disturbance rejection and faster transient recovery. Collectively, these results confirm that adaptive tuning of the NLSEF gains through DDPG method enhances the inherent robustness of the ADRC framework while significantly improving overall control performance. The proposed DDPG-based adaptation not only accelerates dynamic response and improves disturbance rejection capability, but also optimizes control energy utilization and reduces excessive actuation demands. Consequently, the controller achieves a balanced trade-off between responsiveness, robustness, and actuator efficiency, thereby strengthening the practical feasibility and real-time deployment potential of the control system. The proposed DDPG-assisted ADRC framework was evaluated under a representative disturbance profile consisting of step, sinusoidal, and noise components. Although improved tracking and disturbance rejection performance were achieved under the considered conditions, the generalization capability of the learned policy under unseen disturbance scenarios has not been fully investigated in this study. Further validation under varying disturbance conditions and parameter uncertainties will be considered in future work.
While the proposed DDPG-based NLADRC framework demonstrates improved tracking performance and control smoothness, certain limitations should be noted. First, the adaptive mechanism is restricted to tuning only a single NLSEF parameter (β1), which may limit the achievable performance compared to full multi-parameter optimization of the ADRC structure. Parameters related to the extended state observer (ESO) and other NLSEF gains remain fixed, which may reduce adaptability under significantly varying operating conditions. Future work will focus on extending the proposed approach to multi-parameter and multi-axis tuning, as well as experimental validation on real quadrotor platforms.
The present study mainly focuses on the closed-loop performance of the proposed DDPG-assisted ADRC framework under the considered simulation conditions. Detailed convergence analysis and evaluation across multiple random seeds have not been included in the current work. Further repeatability and convergence investigations will be considered in future work. Although the proposed method is validated through numerical simulations, its practical implementation is feasible due to the structure of the ADRC framework and the use of continuous-control actions generated by the DDPG agent. However, real-world deployment may introduce additional challenges, including sensor noise, actuator constraints, communication delays, and onboard computational limitations. Hardware-in-the-loop (HIL) simulation and real-time experimental validation are necessary to further assess the performance of the proposed method under practical operating conditions. These aspects will be considered as part of future work.