Hierarchical Active Tracking Control for UAVs via Deep Reinforcement Learning

Active tracking control is essential for UAVs to perform autonomous operations in GPS-denied environments. In the active tracking task, UAVs take high-dimensional raw images as input and execute motor actions to actively follow the dynamic target. Most research focuses on three-stage methods, which entail perception first, followed by high-level decision-making based on extracted spatial information of the dynamic target, and then UAV movement control, using a low-level dynamic controller. Perception methods based on deep neural networks are powerful but require considerable effort for manual ground truth labeling. Instead, we unify the perception and decision-making stages using a high-level controller and then leverage deep reinforcement learning to learn the mapping from raw images to the high-level action commands in the V-REP-based environment, where simulation data are infinite and inexpensive. This end-to-end method also has the advantages of a small parameter size and reduced effort requirements for parameter turning in the decision-making stage. The high-level controller, which has a novel architecture, explicitly encodes the spatial and temporal features of the dynamic target. Auxiliary segmentation and motion-in-depth losses are introduced to generate denser training signals for the high-level controller’s fast and stable training. The high-level controller and a conventional low-level PID controller constitute our hierarchical active tracking control framework for the UAVs’ active tracking task. Simulation experiments show that our controller trained with several augmentation techniques sufficiently generalizes dynamic targets with random appearances and velocities, and achieves significantly better performance, compared with three-stage methods.


Introduction
Unmanned aerial vehicles (UAVs) are becoming an ideal platform to execute dirty and dangerous tasks, due to their high agility and low cost. Perception and control are the two key modules of autonomous UAVs. Without these, UAVs cannot derive rich information from the complex environment, make proper decisions and behave correctly. Autonomous perception and smart control are always topics of interest in the UAV community.
In this paper, we focus on the active tracking task of UAVs. Active tracking for a dynamic target is a fundamental function for UAVs to perform monitoring and anti-terrorism operations in GPS-denied environments. This specific task requires both autonomous perception to determine the location of the dynamic target and control to actively track the target, which can be transferred and generalized to more difficult autonomous tasks.
Most research considers the active tracking task to be three separate subproblems, namely, perceive first, make movement decisions based on the target's estimated position, and then control the UAV dynamics [1][2][3][4]. In the perception stage, early research used traditional computational vision techniques, such as filtering in HSV space and the Hough transform, to detect objects with certain colors or shapes. Later research used hand-crafted 1.
We develop a novel and interpretable neural architecture for the high-level controller to derive the spatial and temporal latent features of the dynamic target and output continuous tracking speed commands directly. This compact architecture reduces the number of parameters of the high-level controller.

2.
End-to-end mapping from raw images to the high-level decision is trained via deep RL in the virtual environment based on V-REP [21]. We also leverage PyRep [22] to accelerate the simulation speed and run parallel environments for faster data collection. The simulation data are inexpensive, and no effort for ground-truth labeling is needed. 3.
To further accelerate the training process, we adopt auxiliary segmentation and motion-in-depth losses, which can generate denser training signals. 4.
Augmentation techniques are applied in the virtual environment to increase the robustness of the trained controller.
In our experiment, the proposed hierarchical control framework and auxiliary losses can effectively decrease the training time. The quadrotor with a trained high-level control module and low-level PID control module can adapt to the dynamic target with random colors, paths, and speeds in the UAVs' active tracking task.

Markov Decision Process
In the RL problem, the interaction process between the agent and the environment is usually modeled as a Markov decision process (MDP) [23]. The MDP can be denoted as M = (S, A, P, ρ 0 , r, γ), where S is the state space, A is the action space, P(S t+1 = s | S t = s, A t = a) : S × A × S → [0, 1] is the probability of transitioning into state s upon taking action a in state s, r(s, a) : S × A → R is the immediate reward associated with taking action a in state s, γ ∈ [0, 1) is the discount factor and defines the horizon of the RL problem, and ρ 0 : S → [0, 1] is the initial state distribution.
The goal of the deep RL problem is to identify an optimal policy π * θ that maximizes the expected discounted return:

Proximal Policy Optimization Algorithm
Policy gradient methods in RL are sensitive to hyperparameters, such as the policy update step. If the policy is updated in an unfavorable direction, the policy will be worse when using the sampled experiences in the next update iteration.
Trust region policy optimization (TRPO) [24] avoids significant and destructive policy parameter changes with a KL divergence constraint δ on the size of the policy update at each iteration: whereÂ t is an estimator of the advantage function at timestep t and θ old represents the parameters of the policy before the update. TRPO approximates Equation (3) by linear objective and quadratic constraints and then solves it with the conjugate gradient algorithm.
Schulman et al. [25] proposed the proximal policy optimization (PPO) algorithm to further simplify TRPO.
PPO denotes the probability ratio as r t (θ): and proposes a new objective as follows: where is a hyperparameter that clips the moving r t in the range [1 − , 1 + ].
The minimum of the clipped objective clip obj is taken as the final objective, which restricts large policy updates. PPO has been demonstrated to work well on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing [25], and is one of the most powerful model-free deep RL algorithms.

Generalized Advantage Estimation
To reduce variance in policy gradient methods, Schulman et al. [26] proposed the generalized advantage estimator (GAE) as follows: GAE is a general form of the following two advantage estimators: Equation (7) is biased but has low variance, while Equation (8) is unbiased but has high variance, due to the sum of many terms. GAE unifies these two advantage estimators and balances between the bias and the variance for advantage estimation by parameter λ ∈ (0, 1). Equations (7) and (8) are the specific cases of Equation (6) when λ = 0 and λ = 1, respectively.

Hierarchical Control Framework
In the active tracking task, quadrotors use only the raw images captured by the onboard camera to execute proper subsequent actions. Since the raw images are very high-dimensional observations, designing an effective controller for quadrotors manually is challenging. Instead, we build a neural network controller and train it via deep RL. In theory, this learning method can map raw images to quadrotor motor commands directly. However, the deep RL method suffers from the problems of exploration and local optimality. Training such a complicated policy is not easy, even with sufficient interaction data between the quadrotor and the environment.
We propose a hierarchical controller framework, as shown in Figure 1, to solve this dilemma. This framework consists of two-level policy layers. In the high-level policy layer, the neural RL controller outputs the desired speed commands for the quadrotor, given three sequential raw images [O t−2 , O t−1 , O t ] as inputs. This high-level RL controller perceives the features of the dynamic target and makes high-level decisions to follow it. The high-level decisions consist of the desired linear speed along the head direction V f b and the desired yaw speed V yaw . In the low-level policy layer, the PID controller is used to track high-level decisions under the current altitude and speed and then outputs the desired motor commands [u 1 , u 2 , u 3 , u 4 ] for the quadrotor. This hierarchical framework enables us to focus on the learning of a high-level RL controller without concern for the low-level dynamic control of the quadrotor. Hierarchical control framework Figure 1. Hierarchical control framework for the quadrotor active tracking task.

Simulator Set-Up and Augmentation
Due to the data inefficiency issue of model-free RL, extensive interaction data with the environment are necessary to improve the high-level RL controller through trial and error. We cannot afford to train our RL controller for quadrotors in the real world. Instead, we set up a simulation environment where the quadrotors can gain infinite and high-fidelity training data.
Based on the virtual robot experimentation platform (V-REP) [21], we build an RL training environment for the quadrotor's active tracking task, as shown in Figure 2. In this environment, three main entities exist: a quadrotor equipped with a camera, an IMU and an altitude ranging sensor, and a dynamic person (the target) walking along a path from the beginning to the end. The camera is tilted down 30 degrees relative to the horizontal plane. The quadrotor uses sequential images from the camera as the observation and decides the desired speed commands using the RL controller to follow the dynamic target. Then, the quadrotor uses the IMU and altitude ranging sensor to obtain the current velocity and height information and applies the desired motor commands to track the desired speed commands while maintaining its altitude via the PID controller. Next, the environment returns the reward function and the next observation. At the initialization of each training episode, the target people will be in front of the quadrotor. To enhance the generalizability of our trained RL controller to various domains, we adopt several environmental augmentation methods as follows.

1.
Visual randomization: We divide the appearance of the target person into five partshair, skin, shirt, trousers, and shoes. Then, we change the color of each part at every timestep, which is helpful to learn the dynamic target's more essential features rather than simply memorizing the colors.

2.
Speed randomization: To learn an RL controller that is less sensitive to the target's velocity, we change the walking speed of the target person at every timestep in the interval [0, V target ] uniformly.

3.
Path randomization: To increase the robustness of the RL controller to the gestures and relative position of the target, we randomly sample n control points between the beginning and the end of the path and then generate a smooth trajectory using a B-spline for every training episode. The target person will follow the random trajectories under control.
The simulation speed is the main concern when training with data-driven model-free RL methods. However, the original Python remote API of V-REP is not sufficiently fast. To accelerate the training process, we leverage PyRep [22] to break this data generation bottleneck. PyRep is built on top of the V-REP and can provide a flexible API and significant acceleration. Moreover, the simulation environments are easy to parallelize, implying that several workers can be used to interact with different environments at the same time and accelerate the training even further.

RL Controller Architecture
The neural RL controller is an actor-critic style architecture, as shown in Figure 3.
as observations, the actor network and the critic network share the perception layer, which consists of the spatial and temporal feature encoders. This shared perception layer is designed to extract the spatial and temporal features of the dynamic target. The rest of the actor network and the critic network calculate the continuous normalized desired speed commands [V f b ,V yaw ] and the estimated value of the observation, respectively. This end-to-end perception and control method maps the high-dimensional raw images to the high-level actions directly. The mapping is learned via deep RL in the virtual environment, which saves the effort typically required for ground-truth labeling for perception learning and parameter tuning for simultaneous perception and control. The layers with a specific function improve the interpretability of the RL controller.

Attention-Based Spatial Feature Encoder
In the active tracking task, the quadrotor should first know the location of the dynamic target. To reduce the computational cost of the RL controller, a compact neural network architecture with powerful representational capacity is necessary. We leverage the convolutional block attention module (CBAM) [27], which consists of a 1D channel attention module and a 2D spatial attention module. The channel attention module focuses on 'what' is meaningful in the input image, while the spatial attention module tends to know 'where' informative parts are located. Given a raw image input, CBAM can help extract the spatial information of our dynamic target of interest.

Feature Difference-Based Temporal Feature Encoder
For better adaptation to the target's different velocities, not only the spatial features, but also the temporal features of the dynamic target must be determined. A small optical flow network is used to incorporate explicit temporal information of the dynamic target in [28]. However, computing optical flow is still expensive, and we can encode the temporal information more efficiently. Unlike the latent flow method in [29], which fuses raw images and their differences directly, we focus on our target of interest and first extract the spatial features of three sequential raw images by the shared convolutional block attention module and then concatenate the spatial features and their differences. A 2D convolution layer processes these fused features further to extract the temporal information of the dynamic target.

Decision-Making and Value-Fitting Layer
The spatial and temporal feature encoders constitute the final perception layer, which encodes the spatial and temporal features of the dynamic target, followed by the standard RL actor-critic architecture. We adopt two fully connected (FC) layers for decision-making and value fitting. The value-fitting layer estimates the value function of the observation. The decision-making layer computes the mean of the continuous high-level actions µ a = [a 1 , a 2 ] T . In the training stage, the high-level actions are sampled from the following: where Σ is the diagonal covariance matrix. The Gaussian noise is helpful for exploration. Then, the high-level actions are clipped into the range to be normalized: The details of the RL controller's network architecture are presented in Appendix A. Figure 4 shows the aerial view and abstraction of the training environment. The y-axis is set to the head direction of the quadrotor, and the x-axis is set from the quadrotor's right to left. The coordinate of the target in this quadrotor's body frame is denoted as (dx, dy); then, the yaw angle ϕ and distance ρ between the target and quadrotor can be calculated as follows:

Reward Engineering
We want the quadrotor to follow the target at a fixed distance ρ d while being oriented toward the target; thus, the following naive reward function is proposed, which encourages ρ = ρ d and ϕ = 0: where α 1 > 0, α 2 > 0, β 1 > 0, β 2 > 0 are tunable hyperparameters. β 1 , β 2 control the gradient of the first and second parts of the exponential reward function, respectively, and α 1 , α 2 are their coefficients of combination.

Auxiliary Segmentation and Motion-in-Depth Loss
While the RL controller's architecture is designed to have a powerful spatial and temporal representational capacity, the RL controller is still difficult to properly train using the model-free RL method since the input state is so highly dimensional; this training process is very data inefficient. Moreover, the perception layer in the RL controller may not learn the correct spatial and temporal features of the dynamic target guided by such a highly abstract reward signal.
To accelerate the training process and build a good representation of the perception layer in the RL controller, supervised learning is combined with the RL method. Additionally, two auxiliary losses are added in the training process, namely, the segmentation loss and the motion-in-depth loss. Then, supervised learning assists the RL training process by training additional auxiliary losses. The auxiliary losses introduced can generate denser signals that facilitate the learning of spatial and temporal representations.

Auxiliary Segmentation Loss
In the spatial features encoder, the output latent features are supposed to represent the spatial information of the dynamic target in our design. To provide denser training signals, we add the auxiliary segmentation loss after the convolutional block attention module, as shown in Figure 3.
The output map of the convolutional block attention module is denoted as Z n , and the ground-truth segmentation map is denoted as G n given input O n , where Z n is followed by the sigmoid activation function σ to predict the probability map P n for each pixel in Z n : Then, the auxiliary segmentation loss is the binary cross-entropy between the predicted probability p n i,j and ground truth g n i,j for each pixel in Z n : where g n i,j ∈ {0, 1}, and 1 corresponds to the target class, while 0 corresponds to the background class.

Auxiliary Motion-in-Depth Loss
The temporal feature encoder in the RL controller is designed to encode the temporal information of the dynamic target. The temporal information should contain the change in the dynamic target's position that is parallel to the quadrotor's camera view plane and the relative depth change of the dynamic target that is vertical to the quadrotor's camera view plane. Among these, the depth change feature is more difficult to extract since the input resolution is set to be small for faster training, and the occupied resolution of the dynamic target is even smaller. To learn the dynamic target's position change vertical to the quadrotor's camera view plane, we need to introduce some auxiliary training signals at the end of the velocity perception layer.
For a dynamic target, suppose that the length projected on the image plane and the depth from the camera center at timestep t are l t and d t , respectively. We can denote the optical expansion s between two timesteps as t i , t j , which indicates the relative scale change of the dynamic target as follows: We can also denote the motion-in-depth τ between two timesteps t i , t j , which indicates the relative depth change of the dynamic target as follows: Interestingly, the motion-in-depth is the reciprocal of the optical expansion [30], indicating that they are unified and can both represent the vertical position change information of the dynamic target. Because the motion-in-depth is easier to calculate, we choose to adopt auxiliary motion-in-depth loss as the additional training signal for the velocity perception layer, as shown in Figure 3.
In [30], the motion-in-depth τ is calculated as the ratio between the depth of corresponding points over two frames as follows: where x i represents the occupied pixels of the dynamic target at the first frame and x j represents the correspondence of x i at the second frame. We further simplify the calculation of motion-in-depth by ignoring the pixel matching over two frames for fast training:τ where y i and y j are the occupied pixels of the dynamic target over two frames at timesteps t i and t j , respectively. Due to the condition for training episode termination in our setting (listed in Section 4.1.1), the dynamic target is always in view of the quadrotor at training time, and d y j , d(y i ) can be obtained directly in V-REP; thus,τ j i is always available during training.
The output map of the velocity perception layer is denoted as W, and the motion-indepth map over timestep t − 2 and t is denoted asτ t t−2 . We resizeτ t t−2 as M to be the same size as W and then calculate the auxiliary motion-in-depth loss as follows: where m i,j and w i,j are the pixels of M and W, respectively. Note that we skip frame t − 1 because the motion-in-depth over two contiguous frames can be too small to guide the learning process effectively.
With both auxiliary segmentation and motion-in-depth losses, the total loss for the updating policy is the following: (20) where C Seg > 0 and C Dep > 0 are tunable hyperparameters.

Environment Settings
The V-REP-based simulator is set up for the active tracking task. The observed state of the quadrotor is three sequential raw images s t = [O t−2 , O t−1 , O t ] with a resolution of 64 × 64. Given the observed state s t , the RL controller outputs the normalized desired speed commands, which consist of the linear velocity along with the head directionV f b ∈ [−1, 1] m/s and the yaw speedV yaw ∈ [−1, 1] rad/s. Then, the normalized desired speed commands are scaled to the final desired speed commands: where the scales C f b and C yaw are set to be 0.8 and 0.1, respectively. The final desired speed commands are tracked by the PID controller while maintaining the altitude at 2 m. The desired distance between the quadrotor and the target ρ d is 3 m.
At the start of each training episode, the quadrotor is 3 m behind the dynamic target (virtual person), and the dynamic target is in the center of the quadrotor's view. The target's end position is set to be 10 m away from the starting position. During the training, the visual, speed, and path randomization augmentation techniques in Section 3.2 are applied. The maximum speed of dynamic target V target is 0.5 m/s, and the number of sampled control points between the target's start and end positions n is 8. When one of the following conditions is satisfied, the current training episode is completed, and the next training episode can begin:

1.
The dynamic target is out of the view of the quadrotor.

2.
The distance between the target and the quadrotor is out of the interval of [2,4] m. 3.
The target reaches its end position.

Implementation Details
We use the model-free algorithm PPO with auxiliary losses to train our active tracking RL controller. The modified PPO algorithm is summarized in Algorithm 1. Adam [31] is used for optimization of the RL controller. The hyperparameters for the reward function, modified PPO algorithm, and optimizer are listed in Table 1. Our algorithm is implemented by Pytorch [32]. Update policy θ old ← θ ;

Ablation Studies in the Learning Process
We train our RL controller for the active tracking task through trial and error in the customized V-REP-based environment. We also perform ablation studies to evaluate the importance of the proposed auxiliary losses and the hierarchical control framework. Two groups of experiments are compared: 1.
The proposed hierarchical active tracking controller framework, which uses the RL controller for perception and high-level decision-making, and the PID controller is used for low-level dynamic control.

2.
The end-to-end active tracking controller framework, which maps raw image observations to a UAV's motor commands directly.
In each group, four additional experiments are also compared:

1.
Training with auxiliary segmentation and motion-in-depth loss.

2.
Training with auxiliary segmentation but without motion-in-depth loss. 3.
Training without auxiliary segmentation but with motion-in-depth loss.

4.
Training without auxiliary segmentation and motion-in-depth loss.
The learning curves of all experiments are presented in Figure 5. All experiments in the end-to-end active tracking control learning group, as shown in Figure 5a, cannot achieve notable progress during the learning process, indicating that learning the direct mapping from high-dimensional observations to low-level motor commands is extremely complicated. With the hierarchical control framework, the learning of an active tracking controller is much more efficient, as presented in Figure 5b. We can observe that training with auxiliary segmentation and motion-in-depth loss can converge quickly and achieve good asymptotic performance. The performance of training with only one of the two auxiliary losses suffers from the problem of weak stability. Training without any auxiliary losses is relatively data inefficient and cannot achieve good performance within 10 5 training episodes. Therefore, the proposed hierarchical control framework and auxiliary losses can improve the data efficiency of model-free RL and must be introduced in the quadrotor's active tracking task to achieve good and stable performance.

Comparison with Baselines
To highlight the effectiveness of our hierarchical active tracking controller, we compare our controller and two baseline controllers in the unseen scenario, as illustrated in Figure 6. The framework of the baseline controllers is presented in Figure 7, which corresponds to a three-stage method. The baseline controller obtains the bounding box of the dynamic target, using the passive tracker, and calculates the desired speed commands to pull the bounding box to the center of the image with a high-level PID controller. Then, the other low-level PID controller is used to track the speed commands. We do not use DNN-based passive trackers, such as SiamRPN [15], for the perception at baseline, considering that they need additional training with manual ground-truth labeling and that their model parameters are much larger than ours (the parameters of the backbond network ResNet-50 that are usually used in DNN-based trackers are 25.5 M, while the total parameters of our end-to-end perception and high-level decision-making controller are only 0.4 M). Instead, two state-of-the-art passive trackers KCF [12] and MIL [13] are used as the perception components of the baseline controllers.   In the test scenario, we apply different randomness to the color and speed of the dynamic target. The results of the validation experiments are presented in Figure 8. The bars show the results averaged over 10 episodes in the unseen test scenario. The delimiters show the maximum and minimum performance. These results suggest that our method achieves significantly better performance than the compared baseline methods under different randomness conditions. The method that we propose is robust to the dynamic target's appearance and velocity. The failure of traditional passive trackers in a few steps is probably due to the resolution of observations and the dynamics of the quadrotor. The observation with dimension (64 × 64) is large for the high-level RL controller, but it is still not sufficiently informative for the passive tracker to update the bounding box precisely. Moreover, the dynamics of the quadrotor and the observations are coupled. The movement of the quadrotor is realized by changing its altitude, and then the target's position in the observation also changes, which complicates stable active tracking; however, our method can deal with these problems.

Analysis of Simulation Results
The spatial and temporal feature encoders in the RL controller are designed to encode the dynamic target's spatial and temporal features, respectively. To evaluate these encoders, we visualize the output actions of the RL controller for a test episode in the unseen scenario, where the dynamic target changes its color and speed at every timestep.
The mapping from the ground truth distance error δρ = ρ − ρ d to the output normalized linear velocityV f b is shown in Figure 9a. In general, the correlation between ρ andV f b is positive, according to the fitting curve. However, the correlation is slightly weak, due to the random velocities of the dynamic target. To maintain the dynamic target at a fixed distance ρ d , the quadrotor must change speed frequently. Figure 9b shows the mapping from the yaw angle error δϕ = ϕ to the normalized yaw speedV yaw . We can observe that when the dynamic target is to the right of the quadrotor (δϕ < 0), the RL controller tries to force the quadrotor to yaw right (V yaw < 0). Similarly, the RL controller outputs the command to yaw left when the target is to the quadrotor's left. The mapping is almost linear in the interval δϕ ∈ [−0.1, 0.1] rad and becomes saturated when beyond this interval. We note some slow-down commands in the interval δϕ ∈  To illustrate the effectiveness of our controller's temporal representation ability, we visualize the sequences of desired speed commands in the test episode. The desired normalized linear velocityV f b and the ground truth distance error δρ over time are presented in Figure 10a. When the distance error δρ increases, the outputV f b also increases. Similarly, V f b decreases when δρ begins to drop. Moreover, the change inV f b can adapt to different variation scales of δρ. These results indicate that the RL controller can obtain not only the spatial features, but also the temporal feature of the dynamic target. We can draw a similar conclusion from Figure 10b, which shows the sequences of ground-truth yaw angle error δϕ and the desired normalized angular velocityV yaw . By comparing the change in δϕ andV yaw in the range [0, 350] and the [350, 600] timestep, we observe that the correlation between the change rates of δϕ and those ofV yaw is positive.
To better understand the proposed RL controller, we sample three sequences with a length of six timesteps during the test episode and then visualize the layers in the RL controller and the corresponding action commands in Figures 11-13.
The dynamic target changes its appearance, position, and speed at every timestep, as shown in Figures 11a, 12a and 13a. Given the sequence of raw observations, the RL controller's position perception layer can precisely segment the dynamic target from the background, as presented in Figures 11b, 12b and 13b. The spatial information of the dynamic target is extracted for further temporal sensing. The corresponding outputs of velocity perception layers W t in Figures 11c, 12c and 13c are highly abstract and fuse the spatial and temporal features of the dynamic targets. We observe that the RL controller concentrates on the dynamic target when it vanishes in the quadrotor's view; otherwise, it concentrates on the overall information in the observation.
From timesteps t = 640 to t = 645, the dynamic target is to the left of the quadrotor, and the RL controller yields the maximum yaw-left command to pull the target back to the center of the view. From timesteps t = 720 to t = 725, the dynamic target is in front of the quadrotor and accelerates, and the RL controller adjusts the yaw commandV yaw slightly to 0 and increases the linear speed commandV f b to keep up with the dynamic target. From timesteps t = 880 to t = 885, the target accelerates and moves to the quadrotor's right; then, the RL controller outputs the maximumV yaw and commandsV f b to follow the dynamic target.   The sequence of action commands is shown in Figures 11d, 12d and 13d, and the corresponding observations indicate that the RL controller for the active tracking task is robust to a dynamic target with different appearances, positions, and speeds.

Conclusions
In this paper, we propose a hierarchical control framework for UAVs' active tracking task. This framework combines the PID-based low-level controller with the high-level RL controller. The RL controller consists of a novel perception layer and a standard actor-critic layer, enabling end-to-end perception and high-level control with high-dimensional raw images as input. The perception layer encodes the spatial and temporal features of the dynamic target by the convolutional block attention module and spatial feature difference, respectively. The auxiliary segmentation loss and motion-in-depth loss introduced are essential for our RL controller's fast and stable learning. No ground-truth labeling is required in the learning process. Simulation experiments show that our method is effective in the active tracking task and robust to a dynamic target's appearance and velocity, compared with conventional three-stage methods.
Further research should be devoted to the smooth RL controller, which outputs energy-saving actions. More augmentation techniques should be introduced to achieve stable tracking for various dynamic targets in complex environments and mitigate the sim-to-real problem.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this paper are available on request from the first author.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Architecture of the Network
The network architecture of the RL controller is detailed in Table A1, where Conv2a + ReLU and Conv2b are shared in the channel attention module of the position perception layer, and Plus2a, Plus2b and Plus2c are the outputs of the shared position perception layer given the observations O t , O t−1 , O t−2 , respectively. The parameters kernel_size, stride, and padding are used for the height and width dimensions.