UAV Maneuvering Target Tracking in Uncertain Environments Based on Deep Reinforcement Learning and Meta-Learning

: This paper combines deep reinforcement learning (DRL) with meta-learning and proposes a novel approach, named meta twin delayed deep deterministic policy gradient (Meta-TD3), to realize the control of unmanned aerial vehicle (UAV), allowing a UAV to quickly track a target in an environment where the motion of a target is uncertain. This approach can be applied to a variety of scenarios, such as wildlife protection, emergency aid, and remote sensing. We consider a multi-task experience replay bu ﬀ er to provide data for the multi-task learning of the DRL algorithm, and we combine meta-learning to develop a multi-task reinforcement learning update method to ensure the generalization capability of reinforcement learning. Compared with the state-of-the-art algorithms, namely the deep deterministic policy gradient (DDPG) and twin delayed deep deterministic policy gradient (TD3), experimental results show that the Meta-TD3 algorithm has achieved a great improvement in terms of both convergence value and convergence rate. In a UAV target tracking problem, Meta-TD3 only requires a few steps to train to enable a UAV to adapt quickly to a new target movement mode more and maintain a better tracking e ﬀ ectiveness.


Introduction
With the development of technology, various types of unmanned aerial vehicles (UAVs) have become available, such as fixed-wing and quadcopter. In particularly, quadcopters play an important role in a variety of scenarios. The advantages of using quadcopters include low-cost, low fault rates, light weight, fast response, high maneuverability, hovering capabilities, and vertical take-off and landing. In recent years, quadcopters have been extensively applied for different types of tasks, e.g., object tracking [1], wildlife protection [2], disaster rescue [3], 3D reconstruction [4], and maneuvering target tracking [5]. As an important application of UAV, maneuvering target tracking means that UAV observes a moving target through sensors, perceives the environment of the target, follows the target, and maintains the tracking range with the target. However, when dealing with the uncertain environment and the movement mode of the uncertain target, UAVs do not have the ability to autonomously complete decision-making and planning, and cannot achieve high-precision tracking of maneuvering targets. Accurate maneuvering target tracking under uncertain environments is the requirement of future development UAV.
Traditional methods of target tracking include genetic algorithms, Bayesian inference, and statistical theory, etc. These methods transform the target tracking problem into an optimization problem. By solving the optimization model, the tracking path can be obtained. However, these traditional algorithms have a poor real-time performance for uncertain environment problems and cannot realize real-time tracking decision.
In order to tackle these problems, researchers have employed artificial intelligence-based approaches, and one of such approaches is reinforcement learning (RL). RL is an effective approach to train an agent to make decisions in a complex and uncertain environment. Mnih [6] developed a novel artificial agent, named a deep Q-network (DQN), that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. Since then, deep reinforcement learning algorithms have been more widely used in UAV navigation and target tracking. Huang [7] combined DQN with UAV navigation to enable the agent to make decisions based on the received signal strength. Wu [8] proposed a search algorithm based on DQN, which can realize automatic search and path finding of UAV. Wang [9] developed a DRL framework for UAV navigation in large and complex environments for the partially observable Markov decision process (POMDP), and realized UAV navigation based on DDPG [10]. Wan proposed a novel DRL method to achieve robust control of UAV in dynamic uncertain environment [11]. Based on DQN, Bhagat proposed a deep reinforcement learning method for tracking targets in the presence of obstacles and target motion [12].
These researches have achieved good results in UAV navigation, but they all have some drawbacks. First of all, these researches have usually focused on the navigation of UAVs with fixed targets and the movement of targets is simple and single [13][14][15]. UAV maneuvering target tracking is different to navigation, and it requires a UAV to make autonomous decision in real time. If the navigation of a stationary target is applied to the tracking of a moving target directly, the tracking effect will not be guaranteed. Secondly, most of the current researches only consider to train DRL model in a single environment, and as such a UAV can only perceive a specific scenario, leading to a poor generalization capability of a learned policy. UAV cannot adapt or achieve better effects in a new environment. Meanwhile, recent research has found the existing DRL methods brittle [16,17]. If a DRL model is only trained for a single particular environment or task, when faced with uncertain environments and stochastic tasks, the model may be inefficient and unable to achieve a desired performance due to the model's limited applicability [18].
In this paper, we have conducted research on UAV maneuvering target tracking with different target maneuvering methods. We have combined the DRL algorithm TD3 [19] with meta-learning [20,21] to develop an improved TD3 algorithm, named meta twin delayed deep deterministic policy gradient (Meta-TD3), to be used to realize the tracking of UAV to maneuvering targets in uncertain environments.
The contributions of this paper are summarized as follows: (1) A UAV motion model is constructed, and the UAV maneuvering target tracking is defined as an MDP. Based on TD3 algorithm, a decision-making framework is established to control the course and velocity of UAV. Through the decision-making framework, autonomous real-time maneuvering target tracking can be realized. (2) In order to make the UAV quickly adapt to the tracking of multiple uncertain target motion modes, a deep reinforcement learning algorithm, meta-TD3, is developed by combining DRL and meta-learning. A multi-task experience replay buffer is proposed, which contains several replay buffers of different tasks. A multi-task meta-learning update method is developed which breaks the original DRL network update. In this way, DRL model can learn multiple tasks at the same time and perform meta-learning on weights obtained by training from multiple tasks. Through these, meta-TD3 can overcome the shortcomings of DRL and enable UAV to quickly adapt to an uncertain environment and realize target tracking quickly and efficiently.
(3) Through a series of experiments, compared with the state-of-the-art algorithms DDPG and TD3, the performance, training efficiency and the generalization capability of meta-TD3 have been verified. In the UAV target tracking problem, Meta-TD3 only requires a few steps to train to enable the UAV to adapt to the new target movement mode more quickly and maintain a better tracking effect.
The remainder of this paper is organized as follows. Section 2 introduces the UAV motion model and defines the UAV maneuvering target tracking as an MDP. In Section 3, we constructed a DRL framework for UAV maneuvering target tracking based on TD3 algorithm, and propose a reinforcement learning algorithm combined with meta-learning. The experiments in Section 4 demonstrate the effectiveness and adaptability of the proposed algorithm. Section 5 summarizes this paper and looks forward to future work.

Problem Formulation
In order to describe the UAV maneuvering target tracking, we construct the UAV motion model and target tracking model based on Markov decision process (MDP).

UAV Motion Model
The UAV motion model is the basis for completing navigation and target tracking missions. The UAV can be thought as a rigid body with forces and torques applied from the four rotors and gravity [22,23]. In navigation and target tracking scenario, we assume that UAV is flying at a fixed altitude. The experiments of this paper are set in a x-y plane of Cartesian inertial coordinates [24]. The UAV motion model in Cartesian inertial coordinates is expressed as where x t , y t denote the position coordinates of the center of mass of UAV in Cartesian inertial coordinates at time t. x t+1 , y t+1 denote the position coordinates of the center of mass of UAV in Cartesian inertial coordinates at time t + 1. v t denotes the linear velocity of the UAV, n t is the acceleration of the linear velocity of the UAV. ϕ t denotes the heading angle, ω t is the angular velocity of the heading angle and t (t = 1, . . . , n) t(t=1, . . . ,n)denotes each timestep, and dt is defined as the time interval between t and t + 1. The control of UAV is realized through the acceleration and the angular velocity of the heading angle, where acceleration represents the control of throttle and the angular velocity represents the control of steering. The UAV realizes the flight through the control of the throttle and the steering. The UAV motion model is shown as in Figure 1. ψ t denotes the angle between the linear velocity of the UAV and the target line at time t, d t is the distance between UAV and target at time t, x

Target Tracking Model Based on Reinforcement Learning
In order to realize the robust control of UAV in complex and uncertain environments, we focus on reinforcement learning. Reinforcement learning is a machine learning method wherein an autonomous agent learns to find a near-optimal behavior through trial-and-error interactions with its environment [2]. The specific implementation process is defined as the Markov decision process [25,26].
Markov decision process can be described by a five-tuple (S, A, P, R, γ). State space S is the set of all states of the UAV. Action space A is the set of all the actions that the UAV can perform. is denoted as P(s s, a) , which indicates the transition probability of the agent taking action a ∈ A from state s ∈ S to the next state s ∈ S. R is denoted as R(s, a), indicating the reward that can be obtained by taking action a ∈ A in state s ∈ S. γ represents the discount factor. The implementation process of reinforcement learning is as follows.
The agent's policy π(s) provides the guideline on what is the optimal action to take in a certain state with the goal to maximize the total rewards. denotes the expected reward when performing actions in state and following policy π(s). The state value function of the judgement strategy is defined as: The reinforcement learning model framework of the UAV navigation and target tracking is shown in Figure 2. The state of the UAV and the state of the target are integrated into the Agent. During the interaction process, the agent outputs the action to the environment to change the state of the UAV, and the target changes the state according to its own movement mode. For the action taken by the UAV, the environment feedbacks the corresponding reward to the agent. The policy is dynamically updated in the process, so that the action tends to be optimal, thus realizing the target tracking.

State Space
The state space describes the collection of information that the UAV can obtain from the environment. The position of the center of mass of the UAV relative to the inertial coordinate system and the motion variables of UAV are defined as state. In this paper, we assume that the UAV can get the position, attitude information and target distance through GPS and sensors. Therefore, the state space consists of the following variables, x, y, ϕ, v, ψ, d. x, y denote the coordinate of UAV, ϕ denotes the heading angle, v is the linear velocity of the UAV, ψ denotes the angle between the linear velocity of the UAV and the target line and d is the Euclidean distance between the UAV and the target. In order to unify the range of each state variable, eliminate errors caused by the variable scale, and improve the efficiency of neural network training, normalization operation is carried out for each state variable.
In addition, every state s is defined as:

Action Space
In the Markov decision process, the agent selects actions from the action space. Considering the actual situation, the speed of the UAV cannot be changed abruptly. We set the acceleration and the angular velocity of the heading angle as actions. In other words, with the acceleration and the angle change, the state of UAV will change simultaneously.
The action of the model is defined as: where n t denotes the acceleration of the linear velocity of the UAV, ω t is the angular velocity of the heading angle. The action space is defined as a continuous interval A= [a min , a max ] from which the actions of the reinforcement learning are selected.

Reward Shaping
The reward function is the assessment of the agent's actions by the environment. The reward function can determine the performance of the agent in the reinforcement learning. In this paper, the reward function is composed of three parts: distance reward, angle reward and extra distance penalty and reward.
In order to improve the convergence speed and accuracy of the reinforcement learning model, this paper normalizes the distance reward function. The normalized distance reward function is defined as: where δ is a positive constant, d t denotes the distance between UAV's current position and the target at timestep t. d max is the max distance that UAV can detect. To ensure the UAV flying in the direction of the target at all times, UAV would obtain an angle reward. The normalized angle reward function r θ (s t , a t ) is defined as: where ε is a positive constant and ψ t denotes the angle between the linear velocity of the UAV and the target line at timestep t.
In order to achieve faster tracking of a target, we set extra distance penalty and reward. The extra distance penalty and reward is defined as: where D denotes the optimal tracking distance, p is a negative constant and r is a positive constant.
To summarize, the final reward function can be formulated as:

Meta-TD3 for Target Tracking Model
In this section, we introduce the UAV target tracking based on the actor-critic reinforcement learning framework and propose a reinforcement learning algorithm combined with meta-learning.

UAV Motion Model
In Section 2.2, we define the target tracking model based on reinforcement learning, and action space is defined as the continuous space A. Because of the continuous space, some reinforcement learning algorithms such as Q-learning and DQN cannot be applied [27][28][29]. These algorithms are more suitable for reinforcement learning in the discrete action space. For the target tracking problem to be solved in this paper, these algorithms require optimize all the values in the action space [10]. It's difficult to effectively explore such a large action space and ensure the ideal effect of training. In this section, we introduce the model architecture based on TD3 (twin delayed deep deterministic policy gradient). The reinforcement learning model architecture we created consists of an actor module, a critic module and an experience replay buffer.
In order to increase the stability of training, actor module is composed of two neural networks and critic module is composed of four neural networks with the same structure. Actor module consists of an online policy neural network and a target policy network, critic module consists of two online Q-value neural networks and two target Q-value neural networks.
The structure and implementation process of target tracking model is shown in Figure 3. At each step, the agent selects an action by a t = π φ (s t ) + N t , where N t is exploration noise. Based on the reward function in Section 2.2.3, the environment returns the reward r(s t , a t ) to the UAV. The samples (s t , a t , r(s t , a t ), s t+1 ) from the interaction between the agent and the environment are stored in the experience replay buffer. When training, a mini batch of samples N * (s j , a j , r j , s j+1 ) are randomly selected to update the network parameters by gradient descent. This random sampling method greatly reduces the correlation between samples, thereby improving the stability of the training process. The main purpose of actor module is to generate reasonable actions according to the current state of the UAV to approach the target and finally realize the target tracking. The input of the actor module is the state of the UAV, and the output is the action of the UAV. The training process of actor module is to find the optimal policy π φ .
The loss function of actor module is defined as: where θ 1 denotes the neural networks parameters of the online Q-value Networks, φ is the neural networks parameters of online policy networks. Q θ 1 (s, a) s=s j ,a=π φ (s j ) is the reward that the agent can obtain in state s j by choosing action according to policy π φ (s). During the training process, Adam gradient descent is performed on the network parameters φ. The online policy networks parameters φ are updated to maximize the loss function J as much as possible.
Through the actor module, the agent can get action according to the current state. In order to evaluate the agent's decision, we introduce the critic module to guide the policy to be optimal. The input of the critic module is the state and the action of the UAV, the output is the Q-value. During the training process, the networks parameters of the critic module are updated along the direction of increasing the Q value. In order to obtain the maximum Q-value and reduce the over-estimation of the Q-value, according to the Bellman equations and [30], the target optimal Q-value function y t is defined as: where φ denotes the neural networks parameters of target policy network,θ i (i = 1, 2) are the neural networks parameters of target Q-value Networks. r(s j , a j ) denotes the immediate reward the agent can receive in state s j by taking action a j . γ denotes the discounting factor of future reward. Qθ i (s j+1 , πφ(s j+1 )) is the future reward that the agent can receive in state s j+1 according to policy πφ. The loss function of critic module is defined as (12). L is similar to the form of MSE (mean square error) in supervised learning, where θ i , i = 1, 2, . . . , denotes the neural networks parameters of online Q-value networks: During training, the parameters of online policy neural networks are updated by maximizing the loss function J, the parameters of online Q-value neural networks are updated by minimizing the loss function L. The parameters of target neural networks are updated by soft update. The updating method is shown in (13), where τ is a constant coefficient, used to the regulate the update rate:

Multi-Tasks Experience Replay Buffer and Meta-Learning Update
In Section 3.1, we introduce the UAV target tracking model based TD3, which can realize the target tracking in a single motion form. However, the target movement mode may be more complex and changeable. Aiming at the problem of insufficient generalization ability of reinforcement learning, we consider a combination of reinforcement learning and meta-learning to improve the generalization ability of reinforcement learning.
In order to find a more general tracking strategy, it is necessary to train for multi-tasks. In Section 3.2.1, we introduce the setting of multi-tasks and the experience replay buffer for multi-tasks. In Section 3.2.2, we propose a new reinforcement learning updating method, which breaks the original TD3 update method through meta-learning, thereby obtaining a general tracking strategy.

Multi-Tasks Setting
Generalization ability refers to the predictive ability of the model learned by the method to unknown data, which is an essential and important property of the learning method. As for reinforcement learning, we expect a good reinforcement model capable of well adapting or generalizing to new tasks and new environments that have never been encountered during training time. In traditional reinforcement learning, the agent is usually trained in a specific environment. The exploration is usually employed to ensure the generalization of the policy. However, it remains to explore more implementation possibilities for a specific environment and task. In this way, the generalization capability of reinforcement learning has not improved much. Taking the target tracking problem as an example, the tracking policy trained for a specific target motion (speed and motion direction change in a specific way) is difficult to directly adapt to the task of tracking other target motion modes.
In this section, we propose the multi-task setting. Agent needs to learn a variety of different tasks at the same time, and finally learn a policy that can adapt to many tasks well. We construct task experience replay buffer B T j for task T j ( j = 1, 2, . . . , n) and combine all task experience replay buffers to form multi-task experience replay bufferB. The structure of multi-tasks experience replay buffer is shown in Figure 4. For target tracking, different motion modes of target, different initial relative positions of the UAV and target are defined as different target tracking tasks. In task T j , ) generated by the agent interacting with the environment are stored in the corresponding task experience replay buffer B T j . During training, an equal amount of experience is randomly selected from each task experience replay buffer for training, thus realizing the learning of a variety of different tasks.

Meta-Learning Update Method
In this section, based on meta-learning, we propose Meta-TD3 to learn a variety of different tasks and obtain a more generalized policy meta-policy. The meta-policy obtained through meta-learning can enable agent to use only a small number of learning steps to acquire excellent strategies for new tasks.
During the training of reinforcement learning, agent only considers the current task and environment, trying to find a policy that can adapt to the current task and environment [31,32]. Although the reinforcement learning algorithm adopts exploration to ensure that the policy has a certain generalization ability, the previous policy cannot be used directly when dealing with different tasks and environments [33][34][35]. Aiming to resolve the problem of insufficient generalization ability for different tasks in the implementation process of reinforcement learning, we introduce meta-learning into reinforcement learning (Meta-TD3) and optimize the initial parameters of reinforcement learning to improve generalization ability.
The purpose of Meta-TD3 is to maximize the expected generalization ability of reinforcement learning algorithms in all the training tasks. Meta-TD3 is used to learn many different types of parameters to obtain the initial parameters of reinforcement learning for different type of tasks [36]. Meta-TD3 obtains prior knowledge from many related tasks as a guide to learning new tasks. When responding to a new task, based on the previous initial parameters learned through Meta-TD3, only a few steps of fine tuning are needed to meet the requirements of task.
Meta-TD3 includes update processes: internal TD3 reinforcement learning for a single task and external meta learning update for multiple different tasks. These processes can be viewed from the perspective of feature learning as building an internal representation that is widely applicable to many tasks. In Meta-TD3, internal TD3 training and external meta-learning update are performed alternately, which meets a certain update frequency. TD3 model learns multiple tasks separately to obtain different parameters, Meta-TD3 obtains the initial parameters of reinforcement learning by optimizing different parameters.
For the internal TD3 reinforcement learning update process, it can be described as a learning process for a specific task T j . The agent will optimize the policy to achieve target tracking for the current task. The specific update process is the same as that introduced in Section 3.1. After completing the internal TD3 reinforcement learning update for the task T j , the parameters φ T j ,φ T j of the policy network about the task T j and the parameters θ T j 2 of the Q-value network about the task T j can be obtained.
For external meta-learning update, it can be described as quadratic gradient optimization for a series of task network parameters. The external meta-learning update process is as follows: where φ meta denotes the parameters of the policy network updated through Meta-TD3, θ meta 1 and θ meta 2 denote the parameters of the Q-value network updated through Meta-TD3,φ T j denotes the parameters of the policy target network for internal TD3 update for task T j ,θ T j 2 andθ T j 1 are the parameters of the Q-value target network for internal TD3 update for task T j , and β is the meta step size of outer meta-learning. n denotes the number of tasks.
The Meta-TD3 update process is shown in Figure 5. During training, we randomly select tasks from the multi-tasks experience replay buffer. Take the update of policy neural network parameters φ meta as an example. When learning task T j , the neural network parametersφ T j are obtained by the internal TD3 reinforcement learning update. When all tasks sampled have completed the internal TD3 reinforcement learning update, the neural network parameters are further optimized through external meta-learning update. Internal and external updates are carried out constantly to obtain the neural network parameters as meta-policy. Ultimately, we present Meta-TD3 by introducing the multi-tasks experience replay buffer and meta-learning update described in Sections 3.2.1 and 3.2.2 to TD3. Meta-TD3 is summarized in Table 1. , and meta-actor network π φ meta with random parameters θ meta 1 , θ meta 2 , φ meta 4: Initialize critic networks Q θ 1 , Q θ 2 , and actor network π φ θ 1 ← θ meta 1 , θ 2 ← θ meta 2 , φ ← φ meta 5: Initialize target networksθ 1 ← θ 1 ,θ 2 ← θ 2 ,φ ← φ 6: Sample batch of tasks T j ∼ p(T) j = 1, 2, . . . , n 7: Initialize replay buffer B T j 8: For meta_iteration = 1, 2, . . . do 9: for t = 1, 2, . . . do 10: Select action a T j ∼ π φ (s T j ) + ε with exploration noise ε ∼ N(0, σ) 11: Observe reward r T J and new state s T j 12: Store transition (s T J , a T J , r T J , s T j ) in B T j 13: Sample mini-batch of N transitions (s T J , a T J , r T J , s T j ) from B T j 14: if t mod update_policy_freq then 17: Update φ by the deterministic policy gradient: 18: Update the target networks: 20: The neural network parameters of task T j :φ T j =φ,θ if t mod meta_update_freq then 25: end for 26: After completing the training of Meta-TD3, we will obtain the parameters of neural networks for tasks θ meta 1 , θ meta 2 , φ meta and take them as prior knowledge. When dealing with new tasks, the model is trained on these parameters. New tasks can be achieved only through internal fine tune of Meta-TD3 algorithm.

Results
In this section, we set up comparative experiments to verify the implementation effect, training efficiency, and generalization ability of Meta-TD3 algorithm.

Experimental Platform and Environment Setting
The experimental environment of this paper is defined as 5000 m × 5000 m 2D plane in which the UAV can complete the target tracking task. Assuming that the UAV can the coordinates of the target through GPS in real time. The UAV is supposed to fly at a fixed altitude of 50 m and needs to approach the target as soon as possible and maintain the target within the distance range. Assuming that the target is moving on the ground, the motion mode of target will change in real time. At time t, if the horizontal distance d t between the UAV and the target exceeds L = 500 m, the UAV will get a penalty p = −1. At time t, if the horizontal distance d t between the UAV and the target is less than D = 100 m, it is considered that the UAV achieves the tracking at time t, the UAV will get a reward r = 3.
The parameters of the UAV motion model are set up as follows. Since the research object of this paper is the Quadrotor UAV, it is considered that UAV can complete the hovering. Therefore, the maximum speed of UAV is v max = 20 m/s and the minimum speed is v min = 0 m/s. The maximum speed of target is 15 m/s and the minimum speed is 0. The acceleration of the UAV is n t ∈ [−2 m/s 2 , 2 m/s 2 ] and the angular acceleration is ω t ∈ [−3/s 2 , 3/s 2 ]. In order to unify the range of each variable and improve the efficiency of learning, each state variable is normalized into [−1.0, 1.0]. The coefficients in reward shaping are instantiated as d max = 6000 m, δ = 0.7, ε = 0.4.
According to the definition of the state space and the definition of the action space, it is clear that the Meta-TD3 model has 6-dimensional input and 2-dimensional output. In Meta-TD3, all the policy neural networks have the same structure, and all the Q-value neural networks have the same structure. The policy neural networks and the Q-value neural networks are shown in Figure 6. The policy neural networks are constructed as 6 × 256 × 256 × 2 and the Q-value neural networks are constructed as 8 × 256 × 256 × 1. Except that the activation function of the last layer of the policy neural networks is tanh function and the last layer of the Q-value neural networks has no activation function, the remain layers are ReLU layers. The Adam optimizer is employed to optimize the actor module and the critic module, the learning rate is both 0.0003. Other hyper-parameters are set with soft update rate τ = 0.005, the discounting factor γ = 0.99, the variance of gaussian noise σ = 0.2. When training, the parameters of actor module are updated as a frequency of 2. The maximum size of the task experience replay buffer R is set to 100,000, and the size of Minibatch is set to 256. In the simulation, the decision period dt is set to 1 s, and an episode contains 400 decision steps.

Model Training and Testing
To verify the performance of Meta-TD3 in target tracking and the generalization in different environments, the experiment is divided into two parts: one is training in a series of tasks through Meta-TD3 algorithm to obtain a general tracking policy, and the other is to test the performance and the generalization of general policy in specific new tasks.  N(0, 3). Based on this, we generate five different target motion modes for training. The multi-task experience replay buffer contains five separate task experience replay buffer. The UAV is trained to track the target under the conditions of five different target motion modes, and stores the corresponding interactive data into the corresponding experience replay buffer. During training, for each task, samples of equal amount are randomly selected from the corresponding experience replay buffer for training. We train the Meta-TD3 algorithm for 1250 episodes and update the parameters of meta-learning every 10 steps.
In order to test the generalization performance of the parameters trained through Meta-TD3 algorithm, we set up three groups of different tasks for experiments. To be specific, we resort to other two state-of-the-art DRLs, DDPG and TD3, as baselines and re-implement them with almost the same hyper-parameter settings to the fine tune of Meta-TD3. We set up three tracking tasks according to increasing difficulty, and all three tasks are trained (fine-tuned) for 1000 episodes.
In Task 1, UAV is close to the target at the beginning, the initial speed direction of the UAV is basically consistent with the target's initial speed direction. The specific settings of Task 1 are as follows. The target initial position is (1000 m, 800 m), the initial speed is 3 m/s, the acceleration obeys  N(0.2, 3). The initial position of the UAV is (0 m, −300 m), and the initial speed of UAV is 4 m/s. After the start of the task, UAV will approach the target as soon as possible and achieve the distance to the target within the set range. Figure 7 shows the tracking trajectory with the Meta-TD3 weights for fine tune, TD3 and DDPG algorithms with randomly initialized network weights after 1000 episodes of training, and the distance between the UAV and the target at each decision step after 1000 episodes of training. In the of Figure 7a, the red star indicates the initial position of the UAV and the blue star indicates the initial position of the target. At the beginning of the Task 1, Meta-TD3, TD3 and DDPG can make the UAV quickly approach the target. It can be seen from the Figure 7a that the tracking trajectory of the fine tune using Meta-TD3 weights can be closer to the target trajectory, and the tracking effect is significantly better than that of the randomly initialized TD3 algorithm and the DDPG algorithm. As can be seen from the Figure 7b, both Meta-TD3 algorithm and TD3 algorithm can basically reach the tracking range at 100 decision steps and achieve stable tracking of the target while the DDPG algorithm cannot achieve stable tracking. Fine tune tracking with Meta-TD3 weights can ensure that the distance between the target is stable near 0 m, and the tracking stability and tracking effect are higher than the TD3 algorithm and the DDPG algorithm. Figure 8 shows the average reward for Task 1. The average reward obtained by using Meta-TD3 weights for fine tune has been much higher than the other two algorithms within 1000 episodes, and the average reward can basically reach the end around 400 episodes and maintain convergence. But the average reward of TD3 algorithm and DDPG algorithm increases slowly during training, and the maximum average reward can basically be reached in 1000 episodes.  Figure 9 shows the tracking trajectory with the Meta-TD3 weights for fine tune, TD3 and DDPG algorithms with randomly initialized network weights after 1000 episodes of training, and the distance between the UAV and the target at each decision step after 1000 episodes of training. As shown in Figure 9, all the three algorithms have achieved a good performance on target tracking. In the whole tracking process, fine tune with the parameters of Meta-TD3 algorithm can ensure that the distance between UAV and the target is smaller than the TD3 algorithm and DDPG algorithm. The average reward during training of the three algorithms is shown in Figure 10. For Task 2, the average reward of fine tune rises rapidly, reaching the maximum average reward in 300 episodes and maintaining convergence. The average reward obtained by TD3 and DDPG with randomly initialized training cannot reach the maximum value in 1000 episodes, and the average reward cannot converge.  Figure 11. shows the tracking trajectory with the Meta-TD3 weights for fine tune, TD3 and DDPG algorithms with randomly initialized network weights after 1000 episodes of training, and the distance between the UAV and the target at each decision step after 1000 episodes of training. In Figure 11, fine tune with the parameters of Meta-TD3 algorithm provides better effect to Task 3. It can be seen from the tracking distance that fine tune with the parameters of Meta-TD3 algorithm keeps a small tracking distance from the target. The average reward of Task 3 during training of the three algorithms is shown in Figure 12. Among the three algorithms, the average reward convergence speed of Meta-TD3 algorithm is much faster than other algorithms, and the maximum average reward of Meta-TD3 is higher than other algorithms.

Discussion
To evaluate the proposed Meta-TD3 algorithm, we conducted a large number of experiments in three different tasks by comparing with the TD3 and DDPG algorithms to verify the implementation effect, training efficiency and generalization ability of the Meta-TD3 algorithm.
For the target tracking effect, we compared the average reward, the average minimum tracking distance, and the average fluctuation range of the tracking distance (when UAV is close to the target) obtained by the three algorithms during target tracking in different environments. We conducted five comparative experiments under three different environments, and compared the average rewards obtained by different algorithms. The experimental results are shown in Table 2. It can be clearly seen that Meta-TD3 algorithm obtained the highest average reward in different environments, followed by TD3 algorithm, and DDPG algorithm obtained the lowest average reward. The results in Table 3 are the average minimum tracking distance, and the results in Table 4 are the fluctuation range of the average tracking distance. According to Tables 3 and 4, Meta-TD3 algorithm has the smallest average fluctuation range of the minimum tracking distance and the smallest average tracking distance under different tasks. Combined with Tables 2-4, the advantages of Meta-TD3 algorithm in tracking effect can be concluded. The great improvements come from the multi-tasks experience replay buffer and meta-learning update in Sections 3.2.1 and 3.2.2. The former ensures that multiple tasks data can be sampled during training, while the latter ensures that the algorithm learns multiple tasks during training. This enables the meta-TD3 algorithm to have better tracking effect in testing.  For the training efficiency and generalization ability, we set the average reward and the convergence speed of training as indicators, carried out three groups of experiments, and obtained the average reward change curve. Figures 8-12 shows the average reward changes of Meta-TD3, TD3, and DDPG algorithms during experiments. According to the results shown in the Figures 8-12, it can be clearly seen that the reward curve of Meta-TD3 can converge more quickly and the maximum reward is higher than TD3 and DDPG algorithms. The average rewards and convergence steps of different environments are shown in Tables 2 and 5. According to Table 2, it can be clearly seen that the average reward of Meta-TD3 is the highest in different environments. In Table 5, it can be found that Meta-TD3 algorithm has the fastest convergence speed in different environments, which is nearly 50% faster than TD3, while DDPG algorithm can hardly converge in the three environments. Therefore, it can be determined that the training efficiency of Meta-TD3 is faster. Combining Tables 3 and 4, we can determine that the tracking distance and tracking distance fluctuation range of the Meta-TD3 algorithm are the smallest when converging. The policy obtained by Meta-TD3 has stronger generalization ability, which can maintain the training efficiency while improving Tracking effect in different environments. Stronger generalization ability means that Meta-TD3 can be promoted in different environments. This is promoted by the two improvements proposed in this paper. Learning a variety of tasks during training can ensure that the obtained policy converges quickly in the new environment and maintains a better tracking effect. In the implementation process, this paper assumes that the height of UAV is fixed, and the computer performs Meta-TD3 update on the simulation scene in advance to obtain the meta-policy. Based on the meta-policy, UAV can quickly adapt to specific tracking scenarios to obtain more adaptive Tracking policy. From the experimental results, it can be seen that Meta-TD3 has great advantages in tracking effect, training efficiency and generalization ability, and can provide support for target tracking of UAV in uncertain environment. Although we have done some work on tracking effects and generalization capabilities, there is still challenges to deploy the algorithm to a real UAV. In a real situation, the external air pressure change as time passes. It is difficult for a UAV to maintain maneuvering at a fixed altitude. No matter how well UAV is trained in a virtual environment, there is a big gap with reality. But we can continuously increase the reality of the scene and model in the virtual environment, and get closer to the reality. For future research, we will consider realizing more detailed UAV control, including controlling a UAV with motor speed, taking the image information obtained by a UAV's vision sensor and the ranging information from Lidar as state information, so as to realize target tracking in a scenario closer to a real-life three-dimensional situation.

Conclusions
This paper discusses UAV maneuvering target tracking in uncertain environments. We construct UAV motion controlled by acceleration and angular acceleration on a two-dimensional plane and define the UAV maneuvering target tracking as MDP. To ensure that the UAV autonomously achieves target tracking in uncertain environments, this paper proposes Meta-TD3 based on DRL and meta-learning, which can realize fast target tracking under uncertain target motion.
Based on actor-critic framework and TD3 algorithm, we construct UAV tracking model that can realize maneuvering target tracking in a single deterministic environment. Considering the tracing policy for a deterministic environment cannot be directly applied in the new environment, we design the method to learn more general tracking policy by meta-learning from a variety of different environments. We propose multi-tasks experience replay buffer to provide data for multi-tasks learning of DRL algorithm. We combine meta-learning to propose a multi-task reinforcement learning update method to ensure the generalization of the policy. Through a series of experiments, compared with the state-of-the-art algorithms DDPG and TD3, the performance and the generalization capability of Meta-TD3 have been verified. Experiments show that the Meta-TD3 is superior to the DDPG and TD3 algorithms in terms of convergence value and convergence speed. In the UAV target tracking problem, Meta-TD3 only requires a few steps to train to enable the UAV to adapt to the new target movement mode more quickly and maintain a better tracking effect.
Some future work includes the realization of more detailed UAV control to 3D space and target tracking in more complex environments.