An Adaptive Multi-Level Quantization-Based Reinforcement Learning Model for Enhancing UAV Landing on Moving Targets

: The autonomous landing of an unmanned aerial vehicle (UAV) on a moving platform is an essential functionality in various UAV-based applications. It can be added to a teleoperation UAV system or part of an autonomous UAV control system. Various robust and predictive control systems based on the traditional control theory are used for operating a UAV. Recently, some attempts were made to land a UAV on a moving target using reinforcement learning (RL). Vision is used as a typical way of sensing and detecting the moving target. Mainly, the related works have deployed a deep-neural network (DNN) for RL, which takes the image as input and provides the optimal navigation action as output. However, the delay of the multi-layer topology of the deep neural network affects the real-time aspect of such control. This paper proposes an adaptive multi-level quantization-based reinforcement learning (AMLQ) model. The AMLQ model quantizes the continuous actions and states to directly incorporate simple Q-learning to resolve the delay issue. This solution makes the training faster and enables simple knowledge representation without needing the DNN. For evaluation, the AMLQ model was compared with state-of-art approaches and was found to be superior in terms of root mean square error (RMSE), which was 8.7052 compared with the proportional–integral– derivative (PID) controller, which achieved an RMSE of 10.0592.


Introduction
Unmanned aerial vehicle (UAV) applications are increasing daily and are part of many recent technological applications. Some examples of UAV or drone applications are shipping [1], surveillance [2,3], battlefield [4], rescuing applications [5,6], inspection [7,8], tracking [3], etc. One of the appealing applications of UAVs is traffic sensing [9] and congestion estimation [10], and/or detection, which is beneficial for intelligent transportation systems (ITS) [11]. The UAV has the advantage of having no humans on board, making it a fixable and desirable platform for exploring and applying new ideas. Subsequently, UAV technology has opened the door to many sustainability-related studies, including agriculture, air quality, fire control, pollen count, etc.
The UAV's control system is categorized into three parts: teleoperated [12,13], semiautonomous [14,15], and full autonomous [7,16]. Each category defines the involvement level of humans in UAV flight control and daily activities. The sustainability of UAV applications requires essential autonomous features that provide a high degree of autonomy in the UAV systems. These autonomous features improve the UAV's performance in complex environments and sustain its safety. One essential autonomous feature is auto-landing on a moving target. In various applications such as ground-aerial vehicle collaboration, aerial vehicles need to identify a certain landing area. This functionality has to be autonomous because of the challenging aspect of the teleoperation when landing on a moving target. In addition, there is a risk of failure, which might cause damage to the aerial vehicle and other properties. It is necessary to include the functionality of autonomous landing in all categories of the operation of UAVs, even in the teleoperation category. Hence, the landing of UAVs on moving targets is an essential function of robotics competitions [17,18]. The UAVs have a limited flight time. When the targeted task is far from the ground station, one solution is to use a mobile carrier such as a truck, helicopter, or ship. For example, a flight landing on a moving ship deck requires identifying the landing spot and assuring a safe and precise landing while the ship is moving [19].
The mathematical function of the plant is important for ensuring a reliable controller in nonlinear and dynamic control systems. The controller's stability is assessed using complex mathematical approaches and techniques. The accuracy of the mathematical model of the plant is questioned in many real-world applications. Engineers have also used mathematical approximations to make model development easier. These estimates are based on assumptions limiting the controller's generalizability, resulting in difficulties in application and reliability. The concept of free model control has been utilized to avoid such approximations and invalid assumptions. Instead of utilizing it to tune a simplified controller through repeated trial and error, it can be used to construct an accurate controller that incorporates enough plant knowledge [20].
Reinforcement learning (RL) is a sort of artificial intelligence (AI) based on model-free control. It has proven to be a useful and effective control method in nonlinear and highly dynamic systems, especially when proper modeling is difficult. Furthermore, combining RL with a deep-neural network for video analysis and decision making based on lengthy training has made its way into the automotive industry and driverless automobiles as a valuable AI product [21]. It has also found its way into UAV control [22]. The reason relates to the ability to train the RL model based on an extensive number of driving scenarios and then use the learned knowledge for operation. Hence, RL is considered one type of modelfree control as it does not need to build a model for the control. One appealing application for RL is autonomous landing on a moving target because of its non-linearity and lack of an accurate plant model. The plant is non-accurate because of various environment and platform-dependent factors. Furthermore, the autonomous landing on a moving target includes a dynamic aspect that makes the problem more challenging. Q-learning is used due to its simplicity in terms of preserving the knowledge as a table of states and their corresponding actions for optimizing the reward. Unlike other advanced Q-learning methods, the knowledge is needed in Q-learning for a neural network to preserve it, which makes it applicable in limited resources hardware, such as the exit in drones.
The problem of the autonomous landing of a quadrotor on a moving target is a nonlinear control problem with a dynamic nature. The non-linearity comes from various aspects, e.g., the quadrotor's nonlinear kinematic model, the motors' nonlinear response, the curvature, the geometric trajectory, etc. The dynamical aspect comes from various aspects, e.g., the dynamic of the target mobility, the air disturbance, the battery changes, etc., as well as the difficulty in achieving a full mathematical description of the various blocks in the system, including the quadrotor, the environment, and the target. Hence, solving this problem using a model-based approach is ineffective compared to the modelfree approach. Hence, we propose reinforcement learning (RL) as a control algorithm for accomplishing the smooth control of a quadrotor landing on a moving target while maintaining various dynamic requirements of control performance. Incorporating RL as an approach to control the landing process requires a special type of modeling of the interface between the agent of RL and the environment. The modeling includes the quadrotor's low-level control commands, the quadrotor itself, and its geometrical and dynamic relation with the moving target.
The early works of UAV landing were focused on landing in a safe area representing a stationary target. An example is the study in [23]. The fusion between inertial sensing and vision was performed to build a map for the environment. Next, the landing was performed with the assistance of the map. Other approaches have also used sensor fusion but between the Global Positioning System (GPS) and the inertial measurement unit (IMU) in an outdoor environment, such as the work in [24]. Similarly, Ref. [25] proposed a fusion between differential GPS and onboard sensing of a hexacopter for outdoor landing. Infrared lamps have been used for guiding the UAV based on a vision system to perform landing on a stationary area [26]. This work was applied to a fixed-wing aerial vehicle similar to the work in [27], where the stereo vision was used with a global navigation system satellite (GNSS) with fusion under the Kalman filtering model. In addition, many classical works have been built based on the visual servo for control and optical flow for perception. The authors of [28] proposed an autonomous navigation of UAVs using an adjustable autonomous agent in an indoor environment. The sensing mainly depended on proximity sensors and optical flow mechanisms.
The literature includes numerous works related to the development of an autonomous landing of an aerial vehicle on a moving target. Some of them were based on classical or modern control, while others were based on RL. One important component in the landing works is the Kalman filter incorporation for tracking the target [29]. Some algorithms have used sliding mode control, such as the work in [30], where sliding mode control was combined with a 2D map representation of the target. Recently, more interest has been shown towards the use of RL-based landing on a moving target. Deep Q-learning was used the most for a single drone [31] and multiple UAVs [32]. Similarly, in the work in [33], an approach for tracking the mobility of moving targets using a camera was developed. The approach was based on an extended Kalman filter and visual-inertial data. For the target detection, the AprilTag was used. The approach concentrated more on tracking the moving target without attention to the landing control. Other approaches have adopted deep reinforcement learning to handle the continuous nature of control.
In the work in [34], Marker alignment and vertical descent were decomposed as two discrete jobs during the landing. Furthermore, the divide and conquer paradigm was employed to divide the tasks into two sub-tasks, each of which was allocated to a deep Q network (DQN). In [35], a RL framework was combined with the Deep Deterministic Policy Gradients (DDPG). While Z was isolated, the technique considered tracking in X and Y as part of the reinforcement control. Furthermore, the work developed a rewarding function that did not take enough dynamics into account, limiting the applicability of the strategy to simple landing procedures. In [36], a sequential deep Q network was trained in a simulator to deploy it to the real world while handling noisy conditions.
In [37], the least-square policy iteration was used to produce an autonomous landing based on RL. The target was assumed to be stationary, and the rewarding functions employed two terms with adaptive weighting: one for position error and the other for velocity error. The weights were assumed to change exponentially with the error so that when the error was high, the position error obtained more weight, and when the error is small, the velocity error obtained more weight. The authors did not cover the quantification of velocity and location in their paper. In [38], Kalman filtering and reinforcement learning were proposed for image-based visual serving. This research demonstrated the importance of including velocity inaccuracy in the reward function as well as the efficacy of asymmetric rewards.
Overall, none of the previous approaches proposed autonomous landing on a moving target based on standalone RL to address quantization issues. Basically, quantization leads to the high computation (slow convergence) and low accuracy (less gained knowledge) tradeoff. In order to avoid this tradeoff, we propose a novel type of the quantization of actions we call adaptive quantization. It uses a feedback loop from the target to change the magnitude of the action. This feedback enables an adaptive change of action according to the state and provides a compact representation of the Q-matrix. Another essential matter in solving the problem is the dynamic incorporation of the rewarding. More specifically, using a fixed weighted average reward formula makes it only controllable for a small interval. However, incorporating the adaptive weighted average formula of rewarding enables a wide interval of controllability.
Consequently, this article proposes a novel definition of the elements of RL based on the aimed goal. The definition of the state, actions, and reward can be proposed according to the nature of the problem to be solved. The article uses Q-learning, which is a special type of RL that uses a recursive equation to update the Q-values of various state-actions associations. Q-learning is based on the dynamic programming updating equation named the Bellman equation. This equation was selected because of its simplicity and sufficiency in performing the iterative process of updating the Q-value. It yields, after training, the criterion for selecting the best action from a set of candidate actions to move the UAV toward the target for landing in an optimized way. The approach of defining the reward in each action selection based on the next state leads to the definition of the maximum accumulated reward or the Q-value to accomplish the optimization of the control performance metrics.
Considering that the RL model contains continuous states and actions, a quantization is needed to preserve the discrete nature of the problem. However, the system will fall into the problem of a slow convergence-low accuracy tradeoff. To avoid this problem, adaptive quantization is proposed by using environmental feedback to change the quantization level. Subsequently, an Adaptive Multi-Level Quantization-based Reinforcement Learning (AMLQ) model for autonomous landing on moving targets is simulated in this study. The study has resulted in the following contributions:

1.
A novel RL-based formulation of the problem of autonomous landing based on Q-learning through defining states, actions, and rewards; 2.
An adaptive quantization of actions relies on a compact type of Q-matrix. It is useful for the fast convergence of training and high gained knowledge while preserving the aimed accuracy; 3.
A thorough evaluation process has been made by comparing various types of RLbased autonomous landing using several types of mobility scenarios of targets and compares them with classical proportional-integral-derivative (PID)-based control.
This paper is organized as follows. The methods are provided in Section 2. Next, the results and discussion are illustrated in Section 3. Lastly, the conclusion and future works are summarized in Section 4.

Methods
This section presents the developed methodology of the autonomous control of the quadrotor on a moving target. It gives the problem formulation and defines the Q-learning elements: state, reward, and action. Next, the model of the Q-table update, and the Q-learning of the control algorithm are presented. The terms and symbols that we use throughout this paper are shown in Table 1. One of the parameters is the granularities g p , g v , which represent the resolution of decomposition based on the quantization. Increasing the quantization means a smaller number of decompositions and consequently less granularity.
The velocity of the UAV (x ta , y ta , z ta ) The position of the target The final error of the UAV with respect to the position The final error of the UAV with respect to the velocity Coefficients of the adaptive linear action model Coefficients of the adaptive exponential action model

Autonomous Landing
The classical way of solving the problem of autonomous landing is the use of a PID controller to control each of the three coordinates of the aerial vehicle and to control the heading as well. A conceptual diagram of the autonomous landing problem is presented in Figure 1. The diagram shows the actual axes and the calculated axes that are denoted by * symbol while the C denoted the planned direction. This landing approach is successful for the stationary target after careful tuning. However, in the case of moving targets, a dynamical and nonlinear component is added to the plant, which takes it beyond the traditional PID control [39].
We used the Q-learning algorithm, which is based on defined states, termination state, actions, rewarding function, transition function, learning rate, and discounting factor. The output of the algorithm is the Q-matrix. As we show in Algorithm 1, the algorithm starts by randomly initiating the values of the Q-matrix. Next, it goes for a certain number of iterations until convergence. We call each iteration an episode. Each episode starts from a random state and keeps running until reaching the final state (or the termination state). It selects an action based on the policy derived from the Q-matrix. It enables the action which makes the system move from its current state toward the next state. It measures the reward We used the Q-learning algorithm, which is based on defined states, term state, actions, rewarding function, transition function, learning rate, and discoun tor. The output of the algorithm is the Q-matrix. As we show in Algorithm 1, the al starts by randomly initiating the values of the Q-matrix. Next, it goes for a certain of iterations until convergence. We call each iteration an episode. Each episode sta a random state and keeps running until reaching the final state (or the terminatio It selects an action based on the policy derived from the Q-matrix. It enables th which makes the system move from its current state toward the next state. It meas reward and uses this measurement to update the Q-matrix. Whenever the system the final state, it returns to the outer loop representing the episodes. Step-1: Initiate : → with arbitrary Step-2: Check for convergence, if NOT converged continue; otherwise go to 3.

Algorithm 1 Pseudocode of the Q-learning algorithm
Step-2.1: Select Random state from .
Step-2.2: Check if the state reached terminated state , if NOT continue; otherwise, Go to 2.
Step-2.2.1: Select action based on policy ( ) and

Algorithm 1 Pseudocode of the Q-learning algorithm
Step-1: Initiate Q : X × A → Q 0 with arbitrary Step-2: Check for convergence, if NOT converged continue; otherwise go to 3.
Step-2.1: Select Random state s from X.
Step-2.2: Check if the state reached terminated state s end , if NOT continue; otherwise, Go to 2.
Step-2.2.1: Select action a based on policy π(s) and exploration strategy Step-2.2.2: Find the next state based on the transition T.
Step-2.2.3: Receive the reward-based r.
Step-2.2.4: Update the value of Q using α and γ.

Go to Step 2 End
The autonomous landing problem on a moving target assumes a quadrotor moving in an indoor environment, and it has p = (x d , y d , z d ) ∈ R 3 as the position, v = v xd , v yd , v zd ∈ R 3 as the velocity. It aims to land on a moving target with the position of (x ta , y ta ) ∈ R 2 and the velocity of v xta , v yta ∈ R 2 . The landing is a control problem that brings e = (x d − x ta , y d − y ta , z d ) to (0, 0, 0). The error measures the accuracy of the landing control after v xd , v yd , v zd = 0. Figure 2 shows the quadrotor body frame {D} and reference frame {R}. The D and R of the quadrotor Euler angles (roll ϕ, pitch θ, and yaw ψ) describe the rotations of the quadrotor [40]. The positions and velocities of the quadrotor are used to calculate the quadrotor , , ∈ as the velocity. It aims to land on a moving target with the position of ( , ) ∈ and the velocity of , ∈ . The landing is a control problem that brings = ( − , − , ) to (0,0,0). The error measures the accuracy of the landing control after , , = 0. Figure 2 shows the quadrotor body frame {D} and reference frame {R}. The D and R of the quadrotor Euler angles (roll φ, pitch θ, and yaw ψ) describe the rotations of the quadrotor [40]. The positions and velocities of the quadrotor are used to calculate the quadrotor position and estimate the target position [39]. The autonomous landing includes tracking the target in the XY plane and landing on a moving target.

Q-Learning
The elements of Q-learning are states, action, and reward. We present each of them in the following.

State
The state describes the relative difference between the UAV and the target. The same target with fixed dimensions was used in all experiments. The state is described as a vector of relative difference with respect to position and velocity in two dimensions, x, and .
= ( − , − ) (1 ) Moreover, we included the third coordinate as a static distance range between the UAV and the target. The autonomous landing is activated when the UAV is flying within this range. Subsequently, the target is assumed to be moving within one plane and has the same value of . Hence, phi/theta angles are fixed during the autonomous landing process, and the flight control focuses on the dynamic changes of the and only to reduce the process complexity. The vision, with the assistance of a marker located on the moving target, is needed to determine the navigation states of the UAV. In order to calculate the state, we used the AprilTags code that provides 3D relative information about the position of a target with respect to the UAV. In addition, the state is quantized using a quantization vector = ( … ) for each of the x and y components. The

Q-Learning
The elements of Q-learning are states, action, and reward. We present each of them in the following.

State
The state describes the relative difference between the UAV and the target. The same target with fixed dimensions was used in all experiments. The state is described as a vector of relative difference with respect to position and velocity in two dimensions, x, and y.
Moreover, we included the third coordinate z d as a static distance range between the UAV and the target. The autonomous landing is activated when the UAV is flying within this range. Subsequently, the target is assumed to be moving within one plane and has the same value of z td . Hence, z td phi/theta angles are fixed during the autonomous landing process, and the flight control focuses on the dynamic changes of the x d and y d only to reduce the process complexity. The vision, with the assistance of a marker located on the moving target, is needed to determine the navigation states of the UAV. In order to calculate the state, we used the AprilTags code that provides 3D relative information about the position of a target with respect to the UAV. In addition, the state is quantized using a quantization vector q s = (q 1 q 2 . . . q 7 ) for each of the x and y components. The quantization factor is introduced by dividing x d − x ta or y d − y ta over the quantization parameter. The result is used after rounding to indicate the value of the component of the state. The quantization leads to a number of states equaling 7 × 7 = 49. The parameters q i , i = 1 . . . 7 are not uniform to counter the non-linearity aspect of the model. They can be set by using tuning to obtain the adjacency matrix S = s ij = q ij , i, j = 1, 2, . . . , 7 .

Action
The command that moves the UAV on the x-axis based on tilting with respect to the pitch angle is denoted by c x . The UAV receives a normalized value for this action as a command of tilting with a positive or negative value which causes the UAV to move in a positive or negative direction along the x-axis. The set of possible actions in the Q-learning is given as two sets: the first set includes two vectors, namely, c x and c y . c x = [−c n x − (c n x − 1) . . . − c 2 − c 1 0 . . . c 1 c 2 . . . c n x ], where c y denotes the command that moves the UAV on the x-axis based on tilting with respect to the roll angle. The UAV will receive a normalized value for this action as a command of tilting with a positive or

. . c n y ].
For the second set, we provided two actions: a control signal that moves the UAV along the z-axis or c . z and a control signal that rotates the UAV around the z-axis or c ω z . The c . z will be responsible for the taking off and landing process, while the c ω z will be responsible for searching for the target for the first time or in the case of target loss, in order to trigger the autonomous landing process. Figure 3 visualizes the required movements to be performed by the UAV based on the dynamics of the instant position of the moving target [40].

Action
The command that moves the UAV on the -axis based on tilting with respect to the pitch angle is denoted by . The UAV receives a normalized value for this action as a command of tilting with a positive or negative value which causes the UAV to move in a positive or negative direction along ℎ -axis. The set of possible actions in the Q-learning is given as two sets: the first set includes two vectors, namely, and .
where denotes the command that moves the UAV on the x-axis based on tilting with respect to the roll angle. The UAV will receive a normalized value for this action as a command of tilting with a positive or negative value which causes the UAV to move in a positive or negative direction along the y-axis.
For the second set, we provided two actions: a control signal that moves the UAV along the z-axis or and a control signal that rotates the UAV around the z-axis or . The will be responsible for the taking off and landing process, while the will be responsible for searching for the target for the first time or in the case of target loss, in order to trigger the autonomous landing process. Figure 3 visualizes the required movements to be performed by the UAV based on the dynamics of the instant position of the moving target [40].

Q-Table Representation
The presentation of the Q-table is determined based on the number of rows equal to the number of states . The number of actions equals , which represents the number of columns. We present the Q-table in the form:

Reward
The reward presents the ranking of various action state association to select the action that provides the maximum accumulated reward until the goal is reached. The design of the reward is critical to the performance of the system. In order to calculate the reward, we defined the reward function as it is given in the equation below.

Q-Table Representation
The presentation of the Q-table is determined based on the number of rows equal to the number of states n x × n y . The number of actions equals n c x × n c y , which represents the number of columns. We present the Q-table in the form:

Reward
The reward presents the ranking of various action state association to select the action that provides the maximum accumulated reward until the goal is reached. The design of the reward is critical to the performance of the system. In order to calculate the reward, we defined the reward function as it is given in the equation below.
We can observe from the equation that the weighting of the reward is higher for the position term when the UAV is far from the target α 1 . However, when the UAV moves closer to the target, the weighting is higher for the velocity term α 2 .

Terminating State
The termination state is defined as the state of landing. The landing is completed when the UAV coordinates are equal to the target's coordinate and the UAV has zero velocity. In Qlearning, each episode ends when the system reaches the termination state. In the termination Table Update The Q-table preserves the built knowledge of the system. It combines entries for states and actions while embedding the current Q-value for each association of states and actions. Observing the states and actions definitions shows the infinite value due to their continuous nature. However, we quantized the states and actions to define a finite size Q table. The quantization is based on the pre-defined value of resolution or granularity. For the state, (x d − x ta , y d − y ta ) ∈ R 2 , if we define a granularity factor g s for the states,

Q-
In addition, for the state, we defined a granularity factor g a , which leads to an action component c x g a , c y g a ∈ Z 2 . A computation performance tradeoff results from the value of both g s and g a . Furthermore, uniform quantization in the range of state or action is not effective due to the non-linearity. In order to handle these issues, we used the state non-uniform state quantization based on the vector of g s = [g 1 g 2 , . . . g s ] where g i ∈ N, and we replaced quantization with an adaptive model of action using the equations below: c y = ∆ 9 + (∆ 10 e (v y,d− v y,ta ) ) (9) The model is combined with two ranges; the first one gives an almost fixed value of c x = α when the relative velocity is around zero or when the UAV is about to catch the target. The second one gives a linear change of the action when the target does not have a close speed to the target.

Adaptive Multi-Level Quantization (AMLQ) Model
The control AMLQ model performs within two phases: the first one is the learning phase, and the second one is the operation phase. We used the term adaptive because we selected the actions c x and c y to be adaptive with respect to the relative velocity v x,d− v x,ta and v y,d− v y,ta . We used the term multi-level because the model performs multi-level quantization of both actions and states.

Learning Phase
The role of the learning phase is to build the Q-matrix based on consecutive iterations of training called episodes. In each episode, the algorithm places the target at a certain location. It searches for the best sequence of actions to enable the traveling of the UAV from its current location toward the target to perform the landing. We considered that autonomous landing combines two stages: tracking and landing. The Q-learning part is responsible for the former, while the latter is performed separately. We placed the target at four different locations in the environment, and we built a Q-matrix in an accumulated way based on a set of episodes at each location. We selected the locations at the corners of the environment.
It is important to point out that the Q-matrix does not know the beginning of the learning. Hence, the action is selected using a uniform random distribution from the list of actions. This enables an equal exploration of all actions and their rewards given their corresponding states. We have named this strategy a heuristic strategy. It lasts for a certain number of iterations before the control policy is derived from the Q-matrix. The learning phase is summarized using the Algorithm 2.

Operation Phase
The operation phase is enabled after the learning phase is performed. The input of the operation phase is the Q-matrix that was created in the learning phase. The role of the Q-matrix when it is consulted is to provide the policy of control or the action that will be selected based on the current state and action a t+1 = π(s t , a t ). Once the action is enabled, the system will move from its current state s t+1 . Next, the system will also enable the policy to select the new action a t+2 , and this process will be repeated until the final state is reached, which is the termination of the control, as shown in Algorithm 3.

Algorithm 3 Pseudocode of the operation phase
While not reaching the final state Measure the target position using the vision Update the state Select the action based on the state using the Q-matrix Enable the action End Perform the landing using a gradual decrease in altitude End

Evaluation Metrics
The generated performance metrics reflect the difference between the position of the target and the UAV based on the tested scenario.

Mean Square Error (MSE)
This metric estimates the difference between the output signal of the control and the target. In our system, which is a multi-input multi-output system MIMO, we proved the MSE for each of the three coordinates x, y, and z. Assuming any of them is the signal y, then MSE is given in the following Equation (10).
where N denotes the number of samples that are used for evaluation, i denotes the index of the sample,ŷ i denotes the output signal of the system, and y i denotes the target signal of the system.

Root Mean Square Error (RMSE)
Root Mean Square Error (RMSE) is the square root of the average of the set, which is MSE. The equation of RMSE is shown below.

Experimental Design
The evaluation considered a target moving in a circular path in the plane. The experimental design as based on the parameters of the scenarios provided in Table 2. The radius of the circle was 1.5 m, and the speed of the target in each circle was 15 cm/s. The parameters used for the AMLQ-based autonomous landing are shown in Table 3. Table 2. PID gains used in the simulation.

Results and Discussion
The evaluation results of the circular trajectory are presented in Figure 4. We used two types of the AMLQ model, the first one was with four actions, and we used the suffix 4A, the second one was with four actions, and we used the suffix 5A. The error with respect to X is presented in Figure 4. The figure shows that the AMLQ-4A model had a better performance and fewer errors than the AMLQ-5A model. This result can be explained by the less capable convergence of the five actions due to the bigger size of the latter. Similarly, we present the relative error with respect to Y-axis in Figure 5 for both the AMLQ-4A and AMLQ-5A models. It was also observed that the performance of the AMLQ-4A was better than the AMLQ-5A in terms of the magnitude of the error. ∆ , ∆ 0.032 ∆ ,∆ , ∆ , ∆ 0.25

Results and Discussion
The evaluation results of the circular trajectory are presented in Figure  two types of the AMLQ model, the first one was with four actions, and we us 4A, the second one was with four actions, and we used the suffix 5A. The e spect to X is presented in Figure 4. The figure shows that the AMLQ-4A mode performance and fewer errors than the AMLQ-5A model. This result can be the less capable convergence of the five actions due to the bigger size of the larly, we present the relative error with respect to Y-axis in Figure 5 for both 4A and AMLQ-5A models. It was also observed that the performance of th was better than the AMLQ-5A in terms of the magnitude of the error. The performance of the developed AMLQ-4A and AMLQ-5A was compared wit PID control. The graphs of the relative error concerning X and Y are presented in Figu It was observed that the magnitude of the error of the PID was higher than its equiv The performance of the developed AMLQ-4A and AMLQ-5A was compared with the PID control. The graphs of the relative error concerning X and Y are presented in Figure 6. It was observed that the magnitude of the error of the PID was higher than its equivalent of the AMLQ-4A and AMLQ-5A. However, the latter generated more oscillation compared with the PID. This result can be explained by the quantization of the actions in the QL. An approximating model of the Q-function via a neural network is recommended to solve this issue. The performance of the developed AMLQ-4A and AMLQ-5A was compared with the PID control. The graphs of the relative error concerning X and Y are presented in Figure 6. It was observed that the magnitude of the error of the PID was higher than its equivalent of the AMLQ-4A and AMLQ-5A. However, the latter generated more oscillation compared with the PID. This result can be explained by the quantization of the actions in the QL. An approximating model of the Q-function via a neural network is recommended to solve this issue. In addition to the error concerning X and Y, we present the conducted trajectory by the UAV and the target in Figure 7a-c for each of the AMLQ-5A, AMLQ4A, and PIDs. We observed that in both of the AMLQ models, the oscillation was higher for the ALMQ compared with the PIDs. However, the errors resulting from the AMLQ were fewer, as it was observed in the response graphs of the error on both X and Y. The main issue with the ALMQ was the low resolution of quantization which led to sudden actions of the UAV and more oscillation. On the other hand, increasing the resolution of the quantization led to a slow and even reduced convergence of the Q-matrix. In summary, we provided an RMSE for each of the three controllers: AMLQ-4A, AMLQ-5A, and PID, in Table 4. As is observed in Table 4, the lowest RMSE was scored by the AMLQ-5A. observed in the response graphs of the error on both X and Y. The main issue with the ALMQ was the low resolution of quantization which led to sudden actions of the UAV and more oscillation. On the other hand, increasing the resolution of the quantization led to a slow and even reduced convergence of the Q-matrix. In summary, we provided an RMSE for each of the three controllers: AMLQ-4A, AMLQ-5A, and PID, in Table 4. As is observed in Table 4, the lowest RMSE was scored by the AMLQ-5A.  The results reveal that the developed AMLQ can reduce the error of landing on the target. This issue is regarded as a valuable functionality that can be added to the UAV management software. Furthermore, it enhances human-UAV interaction and provides a safer condition for their collaboration.

Conclusions and Future Works
The problem of the autonomous landing of a UAV on a moving target is a nonlinear control problem with high dynamics. It includes tracking the target in the XY plane and landing on a moving target. To conduct the tracking, adaptive multi-level Q-learning was proposed. The definition of states considered quantizing the relative position between the UAV and the target in the X and Y coordinates. The definition of the action considered the quantization of the control signal in the X and Y plane using pre-defined quantization levels. One of the actions was proposed to be adaptive concerning the velocity to mitigate the problem of the low granularity of quantization which led to oscillation. The evaluation was conducted on various scenarios of trajectories of the moving target, namely linear and circular. For the evaluation, we provided the time response of the relative position of the UAV with respect to the target in the X and Y coordinates. The evaluation showed fewer resulted errors in the X and Y coordinates for the AMLQ model as compared with the PIDs-based tracking model. However, the lower levels of quantization resolution in the Q-matrix caused an oscillation performance when compared with the PIDs. Future work could involve the incorporation of a neural network to approximate the Q-function and to enable the handling of the continuous representation of states and actions, consequently obtaining smoother control and less oscillation.  Informed Consent Statement: Ethical review and approval are not applicable because this study does not involve humans or animals. Data Availability Statement: Our data were auto-generated by the system during the training and simulation.