2. Dynamics Modeling
Although the proposed controller does not rely on a model, the dynamics of the drone considered is described in this section for the purposes of simulation and interpretation of the results presented later.
The model dynamics follows the rigid body equations of motion [
12]. In this case, an inertial reference frame
attached to the ground and a fixed body frame
attached to the vehicle are defined as represented in
Figure 1.
The rigid body state comprises a position in the inertial frame
p, a velocity in the body frame
v, an angular velocity in the body frame
, and a rotation matrix
R that represents the transformation between the fixed body frame
and the inertial frame
orientations. The full kinematics equations are:
where the quadrotor attitude
is expressed in Euler angles, a vector with the elements
as the angle of rotation about the
x axis,
as the angle of rotation about the
y axis, and
as the angle of rotation about the
z axis. The matrix
Q is defined as
The dynamics equations are
where
J represents the moment of inertia matrix, which is assumed to be diagonal. The application
corresponds to a skew-symmetric matrix. The inputs
T and
n represent the thrust force and torque, respectively.
3. Controlling the Quadrotor
3.1. Controller Structure
The relationship between both sets of Equations (
1) and (
3) allows for the development of a cascade control structure that has an inner loop for controlling the attitude of the quadrotor and an outer loop for controlling the position and that generates as a manipulated variable the desired orientation to the inner (attitude) controller. This type of controller is commonly used in UAVs, and its block diagram is shown in
Figure 2.
The trajectory generator block creates the references , , and for the position controller and the reference for the attitude controller. These signals represent the desired position and its derivatives and the desired angle, respectively. Besides generating an input thrust T, the position controller also generates a desired upwards orientation corresponding to the third column of R, which is combined with in the attitude controller to track the desired orientation angles.
3.2. Position Controller
Neglecting the rotational dynamics, it is assumed that the desired orientation,
, can be forced upon the system. Considering that
, the virtual input
is defined and transforms the first line of (
3) into
which is now a linear system that can be controlled with a linear controller.
For the sake of reference trajectory tracking, define the tracking error
e by
where
represents the reference position yielded by the guidance subsystem. Taking the second derivative of (
5) and resorting to (
4), the error dynamics
is obtained. Selecting the virtual control variable
to be
introduces linear state feedback, where each state has a proportional gain. There is also a correction term in order to eliminate the second derivative of the error present in the dynamics. Applying the control law (
7) transforms (
6) into
where the gains
and
are selected so as to drive the error
e to zero, enabling reference tracking of a desired trajectory
with time derivatives
and
.
The virtual control input
, which is a three-dimensional vector, is translated into the thrust
T input and the desired orientation
(forced upon the system) values. Via Equation (
7), the thrust
T is calculated with
and the desired orientation
is obtained through
The thrust input is fed directly into the system, as opposed to , which is passed to the attitude controller so that the orientation errors can also be regulated.
3.3. Attitude Controller
Neglecting the translational dynamics, the desired orientation is fed to the attitude controller, which produces, as manipulated variable, an input torque for the system.
To track the desired orientation value, the desired angle, , is combined with to obtain the remaining desired angles and .
Through decomposition of the desired
R matrix, the equation
allows the computation of the desired attitude
. To track this value, a possible strategy is to linearize the second equations of systems (
1) and (
3) around the hover condition, a valid approach for smooth trajectories where aggressive maneuvers are not required, resulting in
given that the matrix
J is diagonal. The linearized system obtained is also a double integrator. The input
n is chosen to account for the angle and angular velocity error, resulting in
The inner loop that comprises the attitude controller must react faster than the outer loop containing the position controller. This time scale separation is forced to ensure that the system quickly corrects angle displacements that are detrimental to effectively tracking the desired position. To guarantee correctness, the gains of each controller must be chosen such that the inner loop poles contain a real part that is 10 times (or more) larger than the real part of the outer loop poles.
3.4. Underlying LQ Control
The adaptive RL controller proposed in this article consists of a state feedback in which the gains converge to those of an LQ controller. Hereafter, the model and the quadratic cost that define this underlying controller are defined. Assuming that the inputs are piece-wise constant over the sampling period
h, the discrete-time equivalent [
13] of the double integrator dynamics takes the form
, with
and where, for the case of the rotational dynamics,
is the diagonal element from the inertial moment matrix
J associated with the angle to be controlled. For the case of the translational dynamics, the mass scaling is incorporated in the virtual input
, and thus
.
Having both
and
, the discrete controller gains for the control laws in (
7) and (
13) can be calculated resorting to the LQR algorithm, which aims at minimizing the quadratic cost
A careful choice of the and weight matrices is necessary for good performance.
The sampling period selected throughout all experiences is , equivalent to a frequency of 100 Hz. The quadrotor state variables are assumed to be fully observable. Sensor noise is modeled by adding white noise after the A/D sampling process. A standard deviation of (with zero mean) is used in the simulation.
5. Simulation Study
This section presents simulations that illustrate the results obtained when applying the above algorithm to control the motion of the quadrotor model described in
Section 2. The simulation experiments show that, with the proposed controller, the closed-loop system is able to track a reference in which there are curved sections followed by segments that are approximately straight. More important, the simulations illustrate adaptation. There is an initial period in which a constant vector of a priori chosen controller gains is used. During this period, the algorithm is learning the optimal gains, but these estimates are not used for feedback. After this initial period, the gains learned by the RL algorithm are used.
There are two sets of experiments. In the first set, the initial gains are far from the optimum. In the second set of experiments, the initial set of gains is not optimal, but closer to optimum than the gains in the first set. Two relevant results are shown: one is the ability of the algorithm to improve the performance when the gains obtained by RL are used; furthermore, the dither noise can be reduced when the initial gains are closer to the optimum (that is to say, when more a priori information is available). The evaluation and comparison of the different situations is done using an objective index (the “score”) defined below.
Usually, control action decisions based on RL are obtained by training neural networks, requiring very large amounts of plant input/output data and, therefore, taking a long time before convergence. However, in the approach followed in this article, the approximation of the Q-function does not rely on neural network training but, instead, on recursive least-squares that have a fast convergence rate. Again, this feature is rendered possible by the class of control problems (linear-quadratic) considered.
The wind force
w is characterized by its x, y, and z components, which determine its direction and magnitude. White noise is applied to a low-pass filter and added to
w to produce more realistic deviations from the average magnitude. The average value of the disturbance is assumed to be measurable and is compensated for, but the model is still affected by the white noise effect, which has a standard deviation of
. The wind disturbance value for the
i component is described as
where
represents the generated wind,
represents the average value, and
represents the filtered white noise. The dynamics of the motor are neglected, being considered much faster than the remaining quadrotor dynamics. The aerodynamic drag effect is also neglected, since most of the quadrotor operations are maintained in a near-hover condition.
The model considered has the following parameters:
- •
Kg;
- •
m;
- •
Kg·m2;
- •
Kg·m2.
For simulation purposes, the constant wind disturbance is compensated, assuming that it is possible to measure its average value. The noisy oscillations around the average value still affect the system.
To gain a better insight into how well the controller performs before and after the learning process, a performance metric is defined as
where
represents the desired position and
p the actual position. The metric calculates the average value of the distance between the moving point in the reference trajectory and the actual position of the quadrotor. The lower this value, the better the controller can track the given reference trajectory.
The control algorithm parameters are selected by trial and error in simulations. The main parameters to adjust are the weights in the quadratic cost considered and the dither noise variance added to the control variable. The dither noise variance must be chosen according to a trade-off between not disturbing optimality (meaning that the variance must be small) and providing enough excitation to identify the parameters of the quadratic function that approximates the Q-function (which requires increasing the dither variance). The weight adjusts the controller bandwidth and must be selected to ensure that the inner loop is much faster than the outer loop.
In order to avoid singularities in the model, the references are such that the attitude angle deviates by only a maximum value with respect to the vertical.
5.1. Experiment 1
The first test attempts to improve a controller that is tuned with the following weights in relation to the quadratic cost defined above are:
Position controller: = diag(200,1), = 100;
Attitude controller (except ): = diag(100,1), = 10;
Attitude controller (): = diag(10,5), = 10.
This calibration affects the attitude control of the angle , where a poor selection of weights is chosen so it can be improved.
The algorithm has the parameters shown in
Table 1.
The learning process (during which a priori chosen controller gains are used) lasts 400 s, and the trajectory in the form of a lemniscate has cycles of 20 s, meaning that each learning cycle of 10 s comprises half a curve. The starting point is (2,0,0) at rest. The lemniscate has been selected as a test reference since it combines approximated straight and curved stretches.
Simulation results for the trajectory tracking improvement are presented in
Figure 3. To test the performance of both the initial and learned gains, a single cycle of the lemniscate curve is used.
For a starting point at (2,0,3) that starts right at the beginning of the lemniscate curve, the controller, after learning a better set of gains, has a score (as defined in Equation (
31)) of 0.0177, whereas the untuned controller produced a score of 0.1203. These results show a clear improvement in the trajectory tracking performance. The corresponding gain evolution in time is presented in
Figure 4. The symbol
t denotes discrete time. Hence, the scale is the number of samples, and continuous time elapsed since the beginning of the simulation is obtained by multiplying by the sampling period
h.
Despite some oscillations, the convergence is quick. This happens due to the selection of high variance for the dither noise. However, this procedure can be a problem, given that, in real life, quadrotors have limitations on the actuators that might render such high values impossible. The magnitudes of the input during the learning stage can induce big oscillations in the quadrotor, since the angular velocity and the angle get excited by the effect of the dither noise.
In order to be able to reduce the dither noise power and still obtain satisfactory results, an alternative is to increase the number of steps required for an update of the learned controller gains. However, this approach slows down the learning process. Another possibility is to increase the values of the covariance matrix of the LS estimator at each update so that the updates are less influenced by previous updates. This comes at the price of bigger oscillations. The new parameters are shown in
Table 2.
With these updates, a new simulation rendered the results in
Figure 5, which shows the evolution of the gains with time. Even though the learning process slows down as expected, improvements are still achieved.
Figure 6 demonstrates that good trajectory tracking was obtained.
For a starting point at (2,0,3) that coincides with the beginning of the lemniscate curve, the controller performs reference tracking with a score (as defined in Equation (
31)) of 0.0175 after the learning process, whereas the untuned controller produced a score of 0.1204, with the input torque
acquiring smaller values as a direct consequence of the reduction in the dither noise.
5.2. Experiment 2
The previous two tests aim to improve the controller gains starting from non-optimum values. The following three tests try to improve on already optimized results.
The starting point for the controller gains corresponds to the following configuration of weights:
Position controller: = diag(200,1), = 100;
Attitude controller: = diag(100,1), = 10.
The difference between experiments is the value of the moment of inertia , which is assumed in each experiment to be:
Third experiment Kg·m2;
Fourth experiment Kg·m2;
Fifth experiment Kg·m2.
The actual value of is 0.01 Kg·m2. This means that each experience has its attitude controller gains with values that are different from the optimal ones.
The algorithm has the same parameters throughout all three experiences, as presented in
Table 3.
The dither is now significantly reduced, since less excitation is required.
The third experiment rendered the results shown in
Figure 7 and
Figure 8. A closer look at the trajectories reveals that improvements do occur, with the improved gains producing a better tracking performance. The score for the non-optimized version is 0.0202, whereas the version with learned gains scored 0.0175, reflecting a small improvement over the original configuration.
The simulation of the fourth experiment rendered the results in
Figure 9 and
Figure 10.
The score for the performance metric before the gains computed with RL are applied is 0.01781 and, after applying the algorithm, the score becomes 0.01765. In this case, the improvement is almost negligible. This can be seen in
Figure 10, where both trajectories practically overlap each other, ending up having the same distance to the reference lemniscate curve.
The score metric for the original set of gains is 0.01811 and, after applying the algorithm, the score becomes 0.01765. In this last experience, a small improvement was achieved. A good tuning of the algorithm allowed for a small performance improvement over a very optimized controller.
6. Conclusions
The development of a reinforcement learning-based adaptive controller for a quadrotor, which includes an adapted version of the Q-learning policy iteration algorithm for linear-quadratic problems, was performed. The particular class of RL-based controllers considered is such that it allows adaptation in real time.
Disturbances affecting the system input and output have a big effect on the correct functioning of the algorithm. A careful choice of algorithm parameters and balance between the estimation algorithm parameters is the solution to this problem. However, when big disturbances are present, it is only possible to make the gains converge close to optimality. Increasing the influence of prior estimations allows for a greater degree of robustness, with the drawback of deviating the convergence process to nearby values of the original optimal gains. Nonetheless, that is necessary to prevent harsh oscillations in the learned gains, which are still present in the quadrotor tests, most likely due to the non-linear nature of the rotational dynamics and other unmodeled dynamics besides the perturbations.
Still, the algorithm produced good results provided that the drone is kept working within the near-linear zone of operation, that is, where safe maneuvers with the quadrotor close to the hovering position are prevalent.
The selection of the dither noise power to inject must solve a dual problem. Indeed, the solution to the control problem requires a dither noise power as small as possible (ideally, zero), while the solution to the estimation problem requires a high value for the dither noise variance. The exact solution to this problem of finding the dither noise power value that fits the best compromise can be found by using multi-objective optimization, but it is computationally very heavy. Good approximations, such as the one proposed in [
16], are available for predictive adaptive controllers. A possibility is then to try to adapt this approach to RL adaptive control, but a much more complicated algorithm is expected to arise. Although promising as future work, such a research track is outside the scope of the present work. Instead, in this article, the approach followed was to adjust the dither noise power by trial and error in order to obtain the best results.