Deep Reinforcement Learning with Corrective Feedback for Autonomous UAV Landing on a Mobile Platform

: Autonomous Unmanned Aerial Vehicle (UAV) landing remains a challenge in uncertain environments, e.g., landing on a mobile ground platform such as an Unmanned Ground Vehicle (UGV) without knowing its motion dynamics. A traditional PID (Proportional, Integral, Derivative) controller is a choice for the UAV landing task, but it suffers the problem of manual parameter tuning, which becomes intractable if the initial landing condition changes or the mobile platform keeps moving. In this paper, we design a novel learning-based controller that integrates a standard PID module with a deep reinforcement learning module, which can automatically optimize the PID parameters for velocity control. In addition, corrective feedback based on heuristics of parameter tuning can speed up the learning process compared with traditional DRL algorithms that are typically time-consuming. In addition, the learned policy makes the UAV landing smooth and fast by allowing the UAV to adjust its speed adaptively according to the dynamics of the environment. We demonstrate the effectiveness of the proposed algorithm in a variety of quadrotor UAV landing tasks with both static and dynamic environmental settings.


Introduction
Unmanned Aerial Vehicles (UAVs) have been widely used in a variety of real-world applications, such as civil engineering [1], precision agriculture [2], and monitoring in mining areas [3].One advantage of using UAVs is that they can fly to and land on complex terrains that are more difficult to reach through the ground traverse.However, UAVs have drawbacks of relatively short flight time and low load limit compared with ground platforms such as Unmanned Ground Vehicles (UGVs).Alternatively, collaborating UAVs and UGVs is a more efficient and effective way to solve complex field tasks [4].On the one hand, UAVs can fly up to a certain height and provide a global map that aids UGVs in planning and choosing the nearest path to the destination.On the other hand, UGVs can provide UAVs with charging facilities that guarantee the flight time as needed.
However, the autonomous landing of a UAV on a UGV is still challenging, as discussed in [5].Specifically, the motion dynamics of the UGV are unknown for the UAV that has to perform the landing task with high uncertainty.To solve the landing problem, a variety of methods have been proposed, such as fuzzy control [6], Model Predictive Control (MPC) [7], PD (Proportional, Derivative) [8] control, PID (Proportional, Integral, Derivative) control [9], vision-based control [10] together with reinforcement learning [11][12][13][14].
Some of the approaches only considered UAV landing on static platforms [6,8] or in simulation [7].A basic PID controller was used to design a collaborative UGV-UAV system for data collection in the application of the construction industry [9].One drawback of the PID controller is that a fixed gain cannot provide an immediate response to overcome the nonlinear thrust effect with decreasing altitude.In addition, the parameters of traditional PID controllers are all constant numbers that need manual tuning.Therefore, such controllers can hardly handle dynamic situations such as landing with various initial conditions or landing on a moving platform.
Alternatively, learning-based methods have been integrated with the traditional PID controller for solving tasks in dynamic environments.Specifically, Reinforcement Learning (RL) has become popular and has been combined with PID to improve the accuracy of path planning for mobile robots [15,16].The results of combining Q-learning [17] and PID have proved better than Q-learning or PID alone.However, tabular Q-learning requires discrete states and actions that can hardly handle high-dimensional or continuous control problems in many real-world tasks.Recently, more advanced Deep Reinforcement Learning (DRL) algorithms, such as Deep Deterministic Policy Gradient (DDPG) [18], Proximal Policy Optimization (PPO) [19], and Soft Actor-Critic (SAC) [20], can output continuous actions based on high-dimensional sensory input.Specifically, DDPG was found effective in handling disturbances for vision-based UAV landing [12].The controller was trained in simulation and transferred to a real-world environment, but the output would be the same even if the heights were different because the altitude (z-direction) was not considered in the state representation.Another work [14] solved this problem by considering threedimensional directions in the state representation and also chose DDPG for vision-based UAV landing on a moving platform.However, these methods suffer the same problem as most deep reinforcement learning algorithms that rely on heavy offline training with high-quality samples.
Corrective feedback from a human teacher can possibly speed up the learning process if the teacher has a good understanding of the task as well as the dynamics of the environment.The DAGGER method required the human expert to label each queried state visited by the learner [21].HG-DAGGER reduced the alertness burden on the expert by executing a human-gated mixed control trajectory and using the human-labeled portions of the data as the online batch update [22].In a more natural and efficient manner, the EIL approach made use of non-intervention in addition to the intervention of human feedback [23].In another work, the TAMER framework allowed a human to interactively shape an agent's policy via evaluative feedback [24].The credit assignment mechanism associated the feedback with the relevant data of state-action pairs.Based on the structure of TAMER, the COACH framework advocated using the feedback in the action domains, and past feedback was considered for adjusting the amount of human feedback that a given action received adaptively [25].Furthermore, corrective feedback was used to construct action exploration strategies in continuous spaces [26,27].However, the human teacher would not always be able to give appropriate feedback for problems with fast and complex transitions in high-dimensional action spaces, e.g., learning to control a UAV landing on a mobile platform.In that case, the learning curve would be similar to pure reinforcement learning since few feedback signals would be given by the teacher [26].
In this paper, we solve the quadrotor UAV landing task in dynamic environments using the PID controller combined with deep reinforcement learning as well as corrective feedback based on heuristics.Similar to a recent study [28] that uses an adaptive learning navigation rule for UAV landing on a moving vehicle, the heuristics in this paper are in terms of rules based on the experience of a human expert.We note that there are many choices of reinforcement learning algorithms that can handle high-dimensional states and continuous actions, and we choose the DDPG algorithm without the loss of generality.As a result, our method can automatically learn the optimal parameters of the PID controller so that the human operator can be relieved from the heavy workload of manual parameter tuning of the PID controller.Compared with the previous work [15,16], our method has better generalization capability for landing with uncertain initial conditions, as well as landing with reliable performance on mobile ground platforms.In addition, our method has the advantage of high efficiency over the vision-based deep reinforcement learning methods [12,14] due to the use of heuristics for parameter tuning, which speeds up the learning process with immediate feedback rather than waiting for sparse rewards, as in many RL algorithms.Different to the interactive learning literature in which a human typically intervenes occasionally [21,[23][24][25], our corrective feedback is available at every time step if needed for the PID controller.
From the perspective of designing an intelligent control system with respect to humancomputer interaction, the main innovation of our work is that we have decoupled the UAV landing control problem using a hierarchical framework.Specifically, a low-level PID controller is responsible for providing fast reactive signals to control the speed of the upward rotors, while a high-level agent or human corrective feedback does not need to pay attention to the rotor control.However, the PID controller is known for its difficult parameter tuning issue, and human designers are usually needed to fine-tune the PID gains, which is a time-consuming and challenging task for the risky UAV landing problem.In this work, the gains would be adapted by the high-level learning agent if the operation conditions were changed.To achieve the fine-tuning of the PID gains, the agent does not need to learn from scratch, as the human corrective feedback can regulate the agent's action selection.On the one hand, the human's knowledge about the landing task can be incorporated before the task starts to improve the safety of the UAV.On the other hand, the real-time feedback from a human can accelerate the convergence of the task learning process.
The remainder of the paper is organized as follows.Section 2 briefly introduces reinforcement learning.Section 3 proposes our approach, followed by experiments and results in Section 4. Finally, we conclude the paper in Section 5.

Reinforcement Learning
A Reinforcement Learning (RL) agent manages to find optimal actions in given states by maximizing the expected accumulated rewards through trial-and-error interaction with the environment.Typically, an RL problem can be described by five elements S, A, P, r and γ, where S denotes the state space and a specific state s ∈ S, A denotes the action space and an action a ∈ A, r denotes the reward function, and R t stands for the accumulated reward R t = ∑ T i=t γ i−t r(s i , a i ) received from the time step t to T. P represents the state-transition model, and γ is a discount factor.
The state-value function V π of a state s t following a policy π is defined as the expected accumulated reward as follows ( Similarly, the action-value function Q π of (s t , a t ) following a policy π is defined as follows The expected reward J(π) is an evaluation function of a policy π defined as follows The optimal policy π * (a t |s t ) means that an optimal action a * would be selected in the state s t , which maximizes the Q-value function as follows Q-learning [17] is a popular algorithm for finding the optimal action selection policy for discrete states and actions.Based on Q-learning, a variety of algorithms such as DQN [29], double DQN [30] and dueling DQN [30] have been proved effective in solving high-dimensional problems.

Deep Reinforcement Learning
Many real-world control problems have to be solved in continuous state and action spaces.Function approximators have been used to represent the state-value and actionvalue functions, trying to alleviate the issue of the curse of dimensionality.Neural networks have become a popular choice of function approximators, especially due to the power of deep neural networks such as CNN.Accordingly, we can optimize the parameters θ Q of a neural network by a loss function as follows: where If π is an arbitrary deterministic policy, we describe it as a mapping from states to actions µ : S → A and omit the expectation: Then, we define an actor function µ(s|θ µ ) as a mapping from every state to a particular action.The actor function represented by a neural network is updated based on the expected return J(π) as follows The DDPG [18] algorithm concurrently learns a Q-function and a policy using two neural networks, one for the actor and one for the critic.The actor network takes the current state as input and an action as output in the continuous action space.The critic evaluates the current state and action of the actor by calculating the corresponding Q-value.However, simultaneously updating the two neural networks is unstable and can cause divergence.Another two target networks, for both the actor and the critic, are employed to generate the targets for computing the Time Difference (TD) errors for the learning.As a result, the stability of the algorithm is increased.
The target networks have the same structures as the two actor and critic networks.In practice, a random disturbance is added to every action for exploration.After each action execution, the transition (s step ,a step ,r step ,s step+1 ) is stored in a replay buffer.The critic network is updated based on Equation (8) when the replay buffer is full, where B is the size of a sampled batch, For every step, the actor network is updated as follows Then, we can update the target networks, After training with sufficient episodes, the converged target networks can be used to solve the problems.

Reinforcement Learning with Corrective Feedback
As an RL agent typically requires trial-and-error interactions with the environment to collect sufficient experiences so as to optimize its control policy, learning from scratch requires exploring the entire state and action spaces, which can take quite some time.
Similar to the interactive learning framework COACH [25], we use corrective feedback in terms of a binary signal, i.e., to increase or decrease the action selected by the RL agent, to speed up the RL process (see Figure 1 The corrective feedback serves as a guidance for action selection during reinforcement learning.In other words, the agent selects an action a t , and the feedback of bias a h would be added to decide a final action a t .It is expected that the human has a better understanding of how well the task is performed and, therefore, can provide an immediate positive or negative reward to generate an appropriate action advice a h towards the optimal action, as shown in the literature [25][26][27].However, the human advice can not be guaranteed to be always correct or accurately associated with the situations to be improved.In contrast, we design the corrective feedback as a module of heuristic rules that define when and how the actions should be biased (see Section 3.4).After the action a t is performed, the agent observes a reward r t and a new state s t+1 , and the data (s t ,a t ,r t ,s t+1 ) is saved in a memory buffer M.Then, M can be used by a deep reinforcement learning algorithm using the experience replay mechanism.Typically, the function approximation technique can be used to represent the actor and critic.If M is full, the critic and actor networks are updated one time per episode.The parameters are updated in an online learning fashion.

Approach
In this section, we first introduce the UAV dynamics and present how we use the standard PID controller for UAV landing.Then, we explain how to combine PID with RL.Finally, we modify the learning-aided PID control with corrective feedback.

Uav Dynamics
In order for the UAV to land on the ground vehicle, the UAV estimates its relative position to the landing platform using a camera installed underneath the UAV.We use the North East Down (NED) frame and a body frame to describe the UAV landing process.Since the UAV is a rigid body, the NED frame {o e , x e , y e , z e } is the inertia frame based on the earth, and o e denotes the center of the earth.The body frame {o b , x b , y b , z b } is attached to the UAV fuselage, and o b indicates the mass center of the UAV.
To describe the rotational motion, we define the rotation matrix R ∈ R 3×3 and the Euler angles [φ, θ, ψ] T that represent the pitch, raw and yaw, respectively.The rotation matrix can be obtained based on the Euler anglers, where the operators s and c denote sin(•) and cos(•) for the sake of simplicity.The kinematics of the UAV can be described as follows Here we use p = p x , p y , p z T , v = v x , v y , v z T , and ω = ω x , ω y , ω z T to denote the position, the linear velocity, and the angular velocity of the UAV according to the body frame, respectively.With regard to the acceleration, g indicates the local gravitational acceleration, m is the total mass, T is the applied thrust along the vector e 3 = [0, 0, 1], and T d represents disturbance.The angular velocity can be calculated based on the Euler angers γ = [ψ, θ, φ] T , and the attitude transition matrix W is defined as In Equation ( 12), J means the inertial matrix according to the UAV body frame, and τ and τ d represent the applied torque and the disturbance torque, respectively.

Baseline: Standard PID for UAV Landing
A PID controller provides a low-level control loop that calculates control actions based on the error signal e(t), which is the deviation between the desired set-point and the current measurement.The structure of a standard PID controller is shown in The PID controller continuously corrects the output based on the three control parameters, i.e., proportional, integral and derivative gains, denoted by k P , k I and k D , respectively.The three parameters are updated according to the error signal, and the control signal u(t) is obtained as follows In this work, we employ velocity control for safe landing, and thus, the commands will be sent to the UAV to adjust its velocities until it reaches the landing platform.Here the error signal e(t) reflects the distance between the UAV and the centroid of the landing area, which is detected and localized using a vision-based method.The control variables are calculated based on the detected errors and the PID gains.

PID with RL for UAV Landing
The framework of PID integrated with RL is shown in Figure 3.The framework consists of two modules.The RL module is shown within the blue dashed lines, and the PID module is shown within the green dashed lines.Denote by p uav = (p x , p y , p z ) the position of UAV in the 3D world coordinate system, and x = (p x , p y , p z ).The reference signal Re f indicates the goal position p g = (p g x , p g y , p g z ) of the UAV, i.e., the horizontal surface center of the ground vehicle.The state vector s = (d x , d y , d z ) is three-dimensional, indicating the distances from p uav to p g in the x, y and z directions, respectively.The output of the PID controller is u = (v x , v y ), where v x and v y are the velocities in the x and y directions.
The action of the agent a consists of the three PID parameters k P , k I , k D that can be adjusted by the RL module at any time step if needed.In this paper, we use PID to control the velocities in the x and y directions, assuming the velocity in the vertical z direction as constant for safety reasons.In other words, the action a = (k The reward function r t is defined as follows where d t indicates the distance between the UAV and the goal position at the time step t.If the UAV reaches the target position and lands successfully, the reward is 1, and the episode ends.If the UAV fails, the reward is −1, and the episode also ends.Otherwise, the reward is the difference between the distance between the last time step and the current time step.We note that this reward function encourages fast landing towards the goal position and punishes fast landing away from the goal position.Due to the contribution of the RL module, the PID controller is expected to be more adaptive to changing situations.

Rl with Corrective Feedback for UAV Landing
Although the RL algorithm enables automatic parameter turning of the PID controller, the learning process is time-consuming.We assume that the human is likely to have a good understanding of how the landing task should be carried out and can therefore provide heuristics to influence the action selection of the UAV towards faster learning of the optimal landing policy.
According to the experience of the human expert, the P-gains of the PID controller have significant influence on the UAV landing task, i.e., k v x P and k v y P .Higher values of k v x P and k v y P may result in a greater change in speed in the x and y directions.When the UAV is far from the goal position, i.e., the error signal e(t) is high, then higher values of k v x P and k v y P are preferred for decreasing e(t) faster.However, if the P-gain is too high (e.g., higher than 1.0), it might result in high velocity so that the UAV would easily lose sight of the ground vehicle.On the other hand, if the P-gain is too small (e.g., smaller than 0.2), it would have little impact on the velocity change; therefore, the P-gain needs to be increased.
We illustrate the proposed approach of PID with DDPG [18] and corrective feedback in Algorithm 1.We note that many other reinforcement learning algorithms should also work with the method illustrated in Figure 3. Algorithm 1 PID with DDPG and corrective feedback 1: Randomly initialize critic network Q(s, a|θ Q ) and actor network µ(s|θ µ ) with weights θ Q and θ µ 2: Initialize target network Q(s, a|θ Q ) and µ(s|θ µ ) with weights θ Q ,θ µ , where θ Q → θ Q ,θ µ → θ µ 3: Initialize the replay buffer M 4: for episode = 1 to N 1 do: Receive initial observation state s t 6: for t = 1 to N 2 do: Select a primary action a t = µ(s|θ µ ) according to the current policy 8: Receive corrective feedback a h

9:
Select the action a t = a t + a h 10: Update the parameters of the PID controller with a t 11: Observe the reward r t and the new state s t+1 12: Save the transition (s t ,a t ,r t ,s t+1 ) in M 13: Sample a random mini-batch of (s i ,a i ,r i ,s i+1 ) from M 14: Update the critic network using Equation (8) 16: Update the actor policy using Equation ( 9): Update the target network using Equation (10) 18:

end for 19: end for
We assumed that the range of the RL agent's action a t was (0, 0.6), and the corrective feedback was a h = −0.2 or a h = 0.2.As mentioned above, if the human considered that the velocity of the UAV could be increased, then a h = 0.2.As a result, the UAV would accelerate towards the target position.Otherwise, if the human considered that the velocity of the UAV should be decreased, then a h = −0.2.The following heuristics were used to construct the following rules of corrective feedback:

Environmental Settings
We first carried out the quadrotor UAV landing task in a simulated environment using the Gazebo simulator [31] (see Figure 4).The UAV was controlled by the ROS package [32].The velocity of the UAV in the z direction was set to 0.2 m/s by default.
The ground vehicle could move forward and backward and turn at a certain angle.The size of the ground vehicle (0.6 m × 0.8 m × 0.2 m) was larger than that of the UAV (0.4 m × 0.4 m) to leave enough space for landing.We stuck a designed marker (0.6 m × 0.8 m) on top of the horizontal surface of the mobile ground vehicle.It was recognized by the UAV's downside camera for the purpose of detection and estimation.The marker had smaller circular patterns at its center, used for localization when the UAV was close to the ground vehicle.In this work, since the UAV needs to land on the ground vehicle, we develop a visionbased method to detect the landing platform.As shown in Figure 4, a designed landmark is placed on the surface of the platform for the UAV to recognize.In order to achieve a lightweight visual-based detection, the relative position of the platform is estimated based on the circle of the landmark.We first convert the RGB images captured by the UAV camera to the HSV color model so as to eliminate other colors, with the exception of the blue color.Then, the HSV mask can convert the RGB image to a grey image, and a binary image of the landmark can be obtained by threshold segmentation of the grey image.Finally, we can identify the circular feature and estimate the center (x c , y c ) and the diameter of the detected circle.In addition, the altitude of the UAV can be calculated based on the focal length of the UAV's camera and the size of the detected circle in the UAV's camera view.
For the three learning-based approaches, i.e., RL (DDPG), PID with RL (RL-PID) and PID with RL and corrective feedback (RLC-PID), we trained the agent for 400 episodes, where p 0 = (0, 0, 4.0) and p g = (0, 0, 0.2).We expect that the UAV could always keep track of the marker for localization.If the marker was out of sight, it would be considered a failure, and the UAV started a new episode.The ground vehicle was assumed static during training for the purpose of faster policy learning.Then, it was allowed to move during the testing to compare the performance of the controllers in dynamic situations.We note that the UAV hovered for a while before landing towards p g .Due to the hovering error, the actual initial position of training was within a radius of about 0.1 m around p 0 in the three-dimensional space.The introduced uncertainty made the problem more challenging than landing from exactly the same initial position.
The parameters of DDPG were set empirically as follows.The learning rates for both the actor and critic networks were 0.0001.The target network was updated every 100 time steps.The discount factor of the reward was 0.9.The memory buffer M = 2000, and the mini-batch size was 64.

Training in the Simulation Environment
The success times and training time (in minutes) were compared among RL(DDPG), RL-PID and RLC-PID in Figure 5.The RL-PID method succeeded more than the RL method, and the RLC-PID method was even better, with a near 100% success rate.We note that the RL approach resulted in many failures in which the UAV lost track of the marker; therefore, it was terminated earlier and took less time than RL-PID.The required training time of RLC-PID was also the shortest.The reason is that the RL module encouraged the UAV to optimize the PID parameters for fast learning of a stable landing policy.In addition, the corrective feedback can further speed up the parameter optimization process.In order to demonstrate the stability and convergence of the proposed method, we compared the accumulated reward of RL, RL-PID and RLC-PID in Figure 6, and we also compared the loss of RL, RL-PID and RLC-PID in Figure 7.The RL approach had the lowest reward, and RL-PID was close to RLC-PID during the training.Finally, the loss of the three approaches was close to zero, indicating that the learned policies became stable, although without guarantee of the high quality of the policies.

Testing in the Simulation Environment 4.3.1. Testing with a Static Vehicle
We tested the learned controllers together with the PID controller in a static scenario, with two initial landing conditions p 0 = (0.2, 0.2, 4.0) and p 0 = (1.0,1.0, 4.0).The results of success times are compared in Table 1.It illustrated that the PID parameters of RLC-PID resulted in the best performance among the three approaches.The condition p 0 = (0.2, 0.2, 4.0) was relatively easy because it was close to the condition p 0 = (0, 0, 4.0) used for training.In other words, the marker was close to the center of field of view (FOV) of the UAV and easily tracked by the UAV.Accordingly, PID, RL-PID, RLC-PID solved it with high success rates, except that RL alone failed many times.In contrast, the condition p 0 = (1.0,1.0, 4.0) was more difficult as the marker was close to the boundary of the UAV's FOV.In other words, the UAV would lose track of the marker if it flew in the wrong direction.As a result, the performance of PID and RL-PID dropped almost by half while RLC-PID still maintained high performance.
The trajectories of PID, RL, RL-PID and RLC-PID were also compared in Figure 8.We note that the RLC-PID approach encouraged a circular landing pattern compared with other approaches that showed longer trajectories of a vertical landing pattern.In other words, RLC-PID suggested the UAV speed up in the x and y directions in the beginning when the UAV was far from the goal location, and it suggested the UAV slow down in the end when it was close to the destination.The PID parameters of (k were compared for PID, RL-PID and RLC-PID (see Figure 9).The P-gains of k v x P and k We tested 100 episodes for UAV landing on the moving ground vehicle from p 0 = (0, 0, 4.0) using PID, RL-PID and RLC-PID methods, in which the vertical speed of UAV was set to v z = 0.1 m/s and v z = 0.2 m/s, respectively.The moving velocity of the ground vehicle was set to 0.1 m/s, but this information was unknown to the UAV, and the ground vehicle could occasionally move backward during the experiment.The PID parameters were set empirically as in the previous section.The task settings were the same with the static landing task, except that the ground vehicle was allowed to move back and forth in a straight line.Thus, it was more difficult for the UAV to land on the moving vehicle because the environment was changing with uncertainty.The results of success times were compared in Table 2, in which we can find the success times of each approach that was tested 100 times.It illustrates that the PID parameters of RLC-PID resulted in the best performance among the three approaches.Either the PID or the RL approach alone could hardly solve the landing task.For the RL-PID approach, the vertical speed of the UAV had a great influence on the success rate.Figure 10 illustrates the trajectories of the UAV and the UGV in the experiments, where the blue cross represents the initial position of the UAV, the red line indicates the landing trajectory of the UAV, the green line represents the moving trajectory of the ground vehicle, and the purple dot represents the final position of the ground vehicle.An intuitive finding is that the UAV's landing trajectory of our approach is more smooth, and it can follow the motion of the ground vehicle.Figure 11 shows how the PID parameters of (k were adapted for RL-PID and RLC-PID during the experiment.Similar to the testing results in the static vehicle setting, k v x P and k v y P of RL-PID kept increasing monotonically throughout the experiment, while k v x P and k v y P of RLC-PID changed more rapidly at the beginning, and changed slower afterward.As a result, the PID parameters had a significant influence on the landing trajectories and the success rates of the landing task.

Real-World Experiments
The models and parameters of the RLC-PID approach trained in the simulation were directly transferred to real-world experiments without any modification.The settings of real-world experiments were similar to the simulation.Due to safety concerns, we first tested the learned RLC-PID controller with a static landmark five times (see Figure 12).Here, the landmark was exactly the same as in the simulation environment.In all five tests, the UAV successfully landed on the landmark, and the final landing locations were all close to the center of the landmark.Then, we used a movable landmark pulled by two human operators through two strings to imitate the case of landing on a mobile platform (see Figure 13).We note that the UAV had no information about the moving directions of the landmark.During the five tests, the landmark was pulled back and forth in random directions, and the results show that the UAV also successfully landed on the landmark, but the final landing positions had a larger deviation from the center position of the landmark compared with the static cases.Finally, we tested the landing performance on a real mobile vehicle (see Figure 14).In the experiment, the moving velocity of the UGV was set to 0.1 m/s, but its moving direction was uncertain.We can also see from the figure that since the UAV needed to track and minimize the distance to the center of the landmark, its trajectory reflected the movement of the ground vehicle to some extent.In all five tests, we found that the UAV still managed to land on the ground vehicle successfully, but the final positions were close to the boundary of the landmark.In comparison with the indoor experiments, the outdoor experiments were affected by wind and airflow.Therefore, the landing accuracy was slightly worse than the indoor experiments.

Conclusions and Future Work
In this paper, we have proposed an autonomous UAV landing approach by combining the advantages of the traditional PID control method and reinforcement learning.Specifically, we have designed an RL-PID framework that allows the RL module to adaptively adjust the parameters of the PID module in an online fashion.In addition, we have used corrective human feedback to provide immediate rewards to speed up the learning process.In both simulation and real-world experiments, we have demonstrated the effectiveness of the proposed RLC-PID algorithm in terms of success rate.The models and parameters of the RLC-PID controller trained in simulation could be directly transferred to real-world experiments without much fine-tuning.In future work, we will incorporate online human intervention into our framework, and develop a more sophisticated credit assignment mechanism.

1 Figure 2 .
Figure 2. Standard structure of a PID controller.

1 Figure 3 .
Figure 3.The framework of PID with RL.

Figure 4 .
Figure 4.A quadrotor UAV landing task in the simulation environment.(a) Environmental setting.(b) Recognized marker on the mobile vehicle.

PFigure 9 .
Figure 9. PID parameters when landing on a static ground vehicle during testing.Both k v x P and k v y P of PID remained around 1.0 according to human experience, while k v x P and k v y P of RL-PID kept increasing monotonically driven by the RL module.Due to

Figure 11 .
Figure 11.PID parameters when landing on a moving ground vehicle during testing.

Figure 12 .
Figure 12.Real-world UAV landing on a static landmark.

Figure 13 .
Figure 13.Real-world UAV landing on a movable landmark.
(a) Early stage of landing.(b) Late stage of landing.

Figure 14 .
Figure 14.Real-world UAV landing on a mobile ground vehicle.
). RL with corrective feedback based on the human experience of the task.

Table 1 .
Success times of landing on a static ground vehicle during testing, N test = 100.

Table 2 .
Success times of landing on a moving ground vehicle during testing, N test = 100.