Object Tracking in Unmanned Aerial Vehicle Videos via Multifeature Discrimination and Instance-Aware Attention Network

.


Introduction
In recent years, low-altitude aerial photography technology based on unmanned aerial vehicle (UAV) has been widely used as an effective supplement to aviation remote sensing and satellite remote sensing for traffic data collection. This technology can clearly capture ground targets through airborne high-resolution cameras, and the imaging resolution can reach the centimeter level. Correspondingly, the amount of traffic image data generated by low-altitude aerial photography of unmanned aerial vehicles has also increased dramatically. It is not only laborious but also inefficient to process a large amount of data manually. Therefore, the intelligent processing of UAV imagery has gradually become a research hotspot. Single object tracking of UAV images, as one of the technologies, provides the basis for subsequent tasks such as vehicle traffic assessment and road conflict prediction. Single object tracking involves manually setting the initial position of the target in the first frame, and continuously predicting the target's bounding box in the subsequent frame. Although great progress has been made in recent years, robust tracking is still a very challenging problem for scenes in UAV scenarios such as appearance variation, background clutter, and severe occlusion.

Related Work
UAV video object tracking is an important application of visual tracking technology. In recent years, there have been two mainstream methods for the development of visual tracking, one based on correlation filtering, and the other based on deep learning. Therefore, we only introduce the works most related to ours. For a more comprehensive and detailed review, please refer to [15][16][17].

Discriminative Correlation Filter (DCF)-Based Tracker
The pioneering work of applying correlation filtering in visual object tracking is the minimum output sum of squared error (MOSSE) [1] filter method. Until 2013, the circulant structure of tracking-by-detection with kernels (CSK) [2] and the kernelized correlation filters (KCF) [18] based on MOSSE [1] used the property of circulant matrices to deduce closed-form solution and utilized fast Fourier transform (FFT) to carry out complicated calculations in the frequency domain rather than in the spatial domain, which minimized computational cost and achieved high tracking speed, usually hundreds of frames per second. At the same time, the kernel function was applied to map the samples to a high-dimensional space, which improved the tracking performance. Discriminative scale space tracking (DSST) [19] is a novel scale adaptive tracker which is implemented by learning separate discriminative correlation filters for translation and scale estimation. The explicit scale filter is learned online using target appearance sampled at a set of different scales. Bertinetto et al. [20] designed a tracking method for the deformation and illumination challenges that often occur in the tracking process. By using color features to characterize the targets, their experimental results showed that better performance was obtained for dealing with the deformation and motion blur problems in the tracking process. However, color feature is very sensitive to illumination changes. At this time, the histogram of gradient (HOG) feature is preferable which can obtain better tracking results in illumination variation cases. Spatially regularized discriminative correlation filters (SRDCF) [21] can effectively suppress the boundary effects by adding spatial regularization but destroy the closed-form solution. Therefore, the iterative Gauss-Seidel method is used to obtain the optimal correlation filter parameters, which makes the speed very slow but significantly improves the robustness.
For the task of visual object tracking in UAV videos, several algorithms have been proposed based on correlation filtering. In [22], a fast-tracking stability measurement metric was designed, based on the peak-to-sidelobe ratio values, which made the DCF algorithms more robust to complicated appearance variations. In [23], a novel approach to repress the aberrances happening during the detection process was proposed, i.e., aberrance repressed correlation filter (ARCF). By enforcing the restriction on the rate of alteration in response maps generated in the detection phase, the ARCF tracker suppresses aberrances, and thus is more robust and accurate for tracking objects. By integrating three kinds of attention, namely contextual attention, dimensional attention, and spatiotemporal attention, into the correlation filter tracking framework, a UAV tracker TACF [24] with multilevel visual attention was proposed, improving the robustness to challenging visual factors such as partial occlusion and clutter background.

Deep Learning (DL)-Based Tracker
In recent years, Siamese networks have drawn great attention in the community of visual object tracking with balanced accuracy and speed. A Siamese network consists of two branches, which implicitly encodes the original patches into another space, and then operates them with a specific tensor to produce a response map. SiameseFC [6] uses AlexNet [25] as the backbone network for feature extraction and, first, introduces the correlation layer as fusion tensor to obtain good performance. CFNet [3] adds a correlation filter to the template branch based on SiameseFC [6], which makes the Siamese network shallower but more effective. However, both SiameseFC [6] and CFNet [3] need multiscale input for the multiscale test, which makes it unable to better cope with the constant change of target scale in UAV videos. Currently, Siamese network-based trackers still have an accuracy gap compared with state-of-the-art algorithms, and they cannot benefit from deep networks, such as ResNet-50 [26] or deeper.
Recently, a new residual module, called a cropping-inside residual (CIR) unit [27], has been proposed to eliminate the negative effect of padding, and to further improve the SiamFC [6] using the modules with controlled receptive field size and network stride. The limitation of a shallow network can be relieved by using the CIR unit [27], and the performance of the trackers based on the Siamese network is greatly improved. In the last two years, because of the rapid development of the target detection algorithms, researchers in the field of visual object tracking have tried to combine a target detection algorithm with a target tracking algorithm. SiamRPN [11] formulates the tracking problem as a local one-shot detection, through combining a Siamese network and a region proposal network (RPN), which obtains the target position by classifying the foreground and background of each region proposals, as well as predicting bounding-box regression. SiamRPN++ [12] proposes a simple yet effective sampling strategy to break the spatial invariance restriction of Siamese tracker and successfully train a ResNet-driven Siamese tracker with significant performance gain. Accurate tracking by overlap maximization (ATOM) [8] decomposes the tracking task into two tasks of classification and estimation. For the estimation task, it modifies IoU-Net [14] to perform target-specific prediction to estimate the target bounding-box, which is suitable for tracking tasks. For the classification task, it consists of a two-layer fully convolutional network head, which is trained online using a conjugate gradient-based strategy to guarantee processing speed.

Combination of DL-and DCF-Based Tracker
In recent years, we have witnessed continuous development of deep learning in the field of visual object tracking. Several works have attempted to combine deep features with traditional features to obtain a superior tracking performance. The authors of [28] used the semantic information of high-level features and the texture information of low-level features to obtain a tracker with better performance, where the feature maps from conv5-4, conv4-4, and conv3-4 of VGG-19 [29] were exploited. These features from three layers were resized to a uniform resolution, and then used as samples for training three correlation filters, respectively. Finally, three response graphs were merged by the linear weighting with fixed weights of 1, 0.5, and 0.02 as the final response graph, and the maximum response point was the position of the target. For the DeepSRDCF tracker, [30] carried out experiments on the features of different layers of VGG-19 and proved that the features of low-level convolutional neural networks (CNN) had a better effect on solving the tracking problem, and excessive semantic information was not necessary for tracking. It was concluded that the convolutional feature of the first layer of VGG-19 was the best. Therefore, the DeepSRDCF tracker [30] replaced the traditional features used in SRDCF [21] with the deep features extracted from the ImageNet-VGG-2048 network [31] which used principal component analysis (PCA) to reduce the dimension of the features. C-COT [32] employs an implicit interpolation model to solve the problem of multiresolution features fusion and its formulation enables efficient integration of multiresolution deep feature maps in the continuous spatial domain. ECO [33] is further optimized based on C-COT [32]. The author considered that the number of deep feature channels was quite large as well as massively redundant. Therefore, a factorized convolution operator was proposed to greatly reduce the useless feature channels for object tracking. Meanwhile, ECO [33] also proposed a conservative model update strategy and a compact generation model to improve the robustness of the algorithm in complex scenes. In [34], an attentional correlation filter network (ACFN) was proposed, which adopted an attentional mechanism to adaptively select the best fitting subset from all available correlation filters.

Two-Stage Tracking Structure
We decomposed the visual object tracking into two dedicated subtasks, i.e., target detection and bounding-box estimation, which was inspired by the framework adopted in seminal works [8,11,12]. The structure of the proposed method is shown in Figure 1, which contains two components, i.e., the target detection stage (TDS) and the bounding-box estimation stage (BES). TDS is used to locate the object based on discriminative correlation filter (DCF), and BES is used to estimate the accurate region of the target with an instance-aware IoU-Net. These two components comprise a two-stage object tracking framework for the typical challenges of background clutter, scale variation, and viewpoint changes in UAV videos. In the TDS, the fundamental function of the DCF tracker is to roughly locate the target position. In order to obtain a better performance and a more robust DCF tracker in the UAV scenarios, we used a variety of handcrafted features and multilevel deep features. Furthermore, considering that the size of the target in UAV videos was relatively small, we used the shallow network ResNet-18 [26] trained on ImageNet as the backbone for feature extraction, in order to avoid the loss of informative feature caused by pooling and large convolution steps in the deeper network. This stage is detailed in Section 3.2.
We decomposed the visual object tracking into two dedicated subtasks, i.e., target detection and bounding-box estimation, which was inspired by the framework adopted in seminal works [8,11,12]. The structure of the proposed method is shown in Figure 1, which contains two components, i.e., the target detection stage (TDS) and the bounding-box estimation stage (BES). TDS is used to locate the object based on discriminative correlation filter (DCF), and BES is used to estimate the accurate region of the target with an instance-aware IoU-Net. These two components comprise a two-stage object tracking framework for the typical challenges of background clutter, scale variation, and viewpoint changes in UAV videos. In the TDS, the fundamental function of the DCF tracker is to roughly locate the target position. In order to obtain a better performance and a more robust DCF tracker in the UAV scenarios, we used a variety of handcrafted features and multilevel deep features. Furthermore, considering that the size of the target in UAV videos was relatively small, we used the shallow network ResNet-18 [26] trained on ImageNet as the backbone for feature extraction, in order to avoid the loss of informative feature caused by pooling and large convolution steps in the deeper network. This stage is detailed in Section 3.2.
The BES accurately estimates the target bounding-box based on the proposals given by the TDS. Visual object tracking in a scene usually focuses on specific instances of a given object rather than a class of objects. Therefore, we designed two sub-networks, one for modeling the instance-based prior knowledge of the given object, and the other for estimating the IoU scores against different proposals in the current frame, namely instance-aware attention network and instance-aware IoU-Net, respectively. The BES accepts the following four inputs: (1) deep features of the current frame, (2) the region proposals generated from the current frame, (3) deep features of the reference frame, and (4) the target bounding box in the reference frame. It outputs the predicted IoU scores of the region proposals in each frame. In the tracking process, the final bounding box was obtained by maximizing the score value with gradient ascent. We discuss this stage in detail, in Section 3.3.

Target Detection Stage
Considering that the correlation filter-based trackers reached a better balance between computational efficiency and robustness of target discrimination, we explored a target localization model for UAV video tracking scenarios. The purpose of the discriminative correlation filter inference The BES accurately estimates the target bounding-box based on the proposals given by the TDS. Visual object tracking in a scene usually focuses on specific instances of a given object rather than a class of objects. Therefore, we designed two sub-networks, one for modeling the instance-based prior knowledge of the given object, and the other for estimating the IoU scores against different proposals in the current frame, namely instance-aware attention network and instance-aware IoU-Net, respectively. The BES accepts the following four inputs: (1) deep features of the current frame, (2) the region proposals generated from the current frame, (3) deep features of the reference frame, and (4) the target bounding box in the reference frame. It outputs the predicted IoU scores of the region proposals in each frame. In the tracking process, the final bounding box was obtained by maximizing the score value with gradient ascent. We discuss this stage in detail, in Section 3.3.

Target Detection Stage
Considering that the correlation filter-based trackers reached a better balance between computational efficiency and robustness of target discrimination, we explored a target localization model for UAV video tracking scenarios. The purpose of the discriminative correlation filter inference process was to learn multiple filters from a set of labeled samples (x, y) = (x k , y k ) T k=1 . Each training sample x k contained a c-dimensional feature map extracted from the region of interest (ROI) of the image. Assume that all samples have the same spatial size W × H, then, at each spatial k is used to represent the c th feature layer of x k . The desired filter F contained a set of W × H convolution filters f c learned by each feature layer. The response map R(x k ) obtained by convolving the filter F with the sample x k of size W × H, which is given by: where ⊗ represents the circular convolution operation, C denotes the total number of features extracted from the sample x k , and f c represents the filter parameter of size W × H corresponding to the c th feature layer. The filter parameters are obtained by minimizing the L2 error between the response map R(x k ) on the training sample x k and the label y k , which can be formalized as: where the coefficient a k ≥ 0 determines the influence of each training sample, and λ ≥ 0 is the weight of the regular term. The desired output y k is a two-dimensional Gaussian distribution in the frequency domain, which includes the labels for each position in the sample x k . Obviously, Equation (2) is the linear least square error (LSE) problem, therefore, we can use Parseval's theorem to convert it to the frequency domain solution to improve the speed of the algorithm. Considering the complexity of the UAV scene and the challenge incurred by the small target, we used a variety of feature fusion methods to construct rich target representation. Specifically, in the implementation process, on the one hand, we adopted two handcrafted features, a color histogram score [20] that was insensitive to spatial structure and a HOG descriptor that was robust to the illumination variation. On the other hand, we also used the features of the first and third convolution layers in ResNet-18 [26] to make full use of representation capabilities of the deep features. The handcrafted features complemented deep features to tackle the difficult problem of extracting tiny small target features from a deep neural network. It is worth noting that when the target and the scene changed only slightly between adjacent frames, the abovementioned DCF-based tracker could accurately estimate the new position of the target. However, in real UAV videos, there are also huge changes in viewpoint, illumination, background, and target scale. Therefore, we only used the target localization results at this stage as guidance and further combined the bounding-box prediction scheme in the next stage to accurately estimate the position of the target.

Bounding-Box Estimation Stage
IoU-Net [14] is used to predict intersection over union (IoU) values between the region proposals and the target bounding box, which shows superiority in object detection tasks. We migrated it to perform visual object tracking in UAV videos by pruning category prediction branches and using a Siamese network architecture. Considering that the low-level features of CNN contain more texture information, and the high-level features contain more semantic information, we combined the features of first convolutional layer and high-level features of the ResNet-18 [26] network to avoid the loss of target information. As shown in Figure 2, in order to make the network focus on the target rather than the background, we used the reference frame to generate an instance-aware attention value, which gave different weights to different feature channels. During the tracking process, instance-aware IoU-Net was guided by these weights to predict the IoU scores of the bounding-box proposals given by TDS. Then, the bounding-box proposals were refined by using gradient ascent maximizing IoU and the final bounding box was obtained by fusing the proposals whose IoU scores were the top three.
Remote Sens. 2020, 12, x 7 of 18 maximizing IoU and the final bounding box was obtained by fusing the proposals whose IoU scores were the top three.

Instance-Aware Attention Network
As shown in Figure 2, the generation process of a instance-aware vector representing the prior information of an instance can be formalized as: represents the convolutional features of a reference frame z ; 0 B is the normalized initial bounding box in the reference frame; chn W is the weights obtained by channel attention;

 
PrPool  is precise ROI pooling operation [14]; and     represents the convolution, concatenate, and fully-connected layer operations after feature channel attention is applied. Specifically,

 
PrPool  can pool the features based on ROIs in each channel, it can be calculated by the average pooling of features  in any given region where  

Instance-Aware Attention Network
As shown in Figure 2, the generation process of a instance-aware vector representing the prior information of an instance can be formalized as: where β(z) represents the convolutional features of a reference frame z; B 0 is the normalized initial bounding box in the reference frame; W chn is the weights obtained by channel attention; PrPool(·) is precise ROI pooling operation [14]; and ν(·) represents the convolution, concatenate, and fully-connected layer operations after feature channel attention is applied. Specifically, PrPool(·) can pool the features based on ROIs in each channel, it can be calculated by the average pooling of features β in any given region B, i.e., where (x 1 , y 1 ) and (x 2 , y 2 ) are the continuous coordinates of the top-left and bottom-right of the bounding box B, and g(·) is a continuous feature value after performing a bilinear interpolation operation on β.
The channel attention module assigns weights to the feature channels by calculating W chn to suppress feature channels that are not related to the target. This channel attention module contains three operations. Firstly, each channel performs a global pooling (implemented by max pooling, in this paper) operation which makes the results contain a global receptive field [35] to characterize the global response Remote Sens. 2020, 12, 2646 8 of 18 distribution characteristics of the feature channels. Then, considering the flexibility and simplicity of feature modeling, we used two cascaded fully connected layers, the first compressed the number of channels and the second restored the number of channels to the original value, to further learn a set of parameters to model the correlation between feature channels. Next, a convolution operation was performed on the weighted features, and the size of convolution kernel was consistent with the feature dimensions of each channel to obtain a feature quantity with a dimension of 1 × 1 × C z . We performed the above operations on the low-level and high-level convolutional features, respectively, and used the concatenate operation, as well as the fully connected layer to obtain an instance-aware vector.

Instance-Aware IoU-Net
As shown in Figure 2, this sub-network is used to evaluate the IoU score of i th bounding-box proposal B i t in the current frame x t , which can be formulated as follows: where β(·) represents the convolutional features of sample x t ; ω contains the prior information of the instance, i.e., the instance-aware vector obtained by Equation (3) in the reference image; p(·) represents the operations of a subsequent convolutional layer and a fully connected layer. In particular, this sub-network is similar to the instance-aware attention network. The convolutional features of the current frame are first subjected to a convolution operation to adapt to the specific tracking task. Then, features are input to the PrPool layer along with the bounding-box proposals, so the PrPool layer operating in continuous space can subsequently maximize the IoU score to obtain more accurate tracking results. We used the concatenate operation on different convolutional features of the sample, and applied a fully connected layer to fuse the features and predict the IoU score. It is worth noting that, since the proposals given by the TDS may not completely include the target during the tracking process, in order to retain as much target information as possible, the hyperparameters of the PrPool layer in the instance-aware IoU-Net adopted larger values as compared with those of the instance-aware attention network.

Offline Training
In order to train the network, we minimized the error between the prediction value S IoU in Equation (5) and the ground-truth by given the image patch pairs. For the training dataset, we used the recently released large tracking datasets, namely TrackingNet [36] and VisDrone2018 [37]. In our proposed two-stage architecture, the backbone network used for deep feature extraction was the pretrained ResNet-18 [26] network on ImagNet. Because the TDS is updated online, only the parameters of the instance-aware attention network and instance-aware IoU-Net are optimized from offline training. For clarity and completeness of the BES descriptor, as shown in Tables 1 and 2, we briefly introduce the configuration of the deep neural network below. Our instance-aware attention network consists of two sets of symmetrical structures. The configuration of the instance-aware attention network and parameters are shown in Table 1. The convolution kernel size of all convolutional layers is set as 3 × 3, the two hyperparameters of PrPool layer are 5 × 5 and 3 × 3, respectively, and the rest of the fully connected layers are the parameters, as shown in Table 1.
For our instance-aware IoU-Net as shown in Table 2, it is mainly composed of four convolutional layers with a convolution kernel size of3 × 3, two PrPool layer with hyperparameters of 7 × 7 and 5 × 5, respectively, and a fully connected layer with parameter of 521 × 1. All convolutional layers and fully connected layer are followed by BatchNorm and ReLU. During the training process, a pair of image patches are sampled from different frames of a video, and the interval between the two frames is set below 40. Specifically, for obtaining instance-aware vector branch, we take the target as the center and sample a square image patch on the original image with 5 2 times the size of the target. Then, we adjusted the crop regions to a fixed size of 288 × 288. For the IoU score prediction branch, we used the same strategy to sample a patch, but added some perturbations to the position and scale to simulate the movement of the target. For each pair of image patches, Gaussian noise was added to the ground truth to generate 16 candidate images, and the minimum IoU between them and the ground truth is 0.1. Throughout the training process, we used image flipping and color dithering for data augmentation.

Online Tracking
The TDS first crops an 150 × 150 image patch as the target search area according to the target position at the previous frame, and then extracts the HOG features with 4 cells in this area, and the feature map size is 37 × 37 × 28. Then, the TDS extracts the color histogram with 32 bins, and the size of the feature map is 32 × 32 × 32. Secondly, the TDS samples a square patch which is based on the target position at the previous frame, and its area is about 5 times that of the target area, then, input to ResNet-18 [26] to obtain the deep features of the target. Next, the correlation filtering operation is performed on each feature map to obtain the response maps. Then, all response maps are linearly interpolated to the same resolution of 75 × 75 and fused to obtain the final position of the target. Here, the fusion method has not been studied more specifically, only linear weighting is used as the fusion manner. For the BES, the instance-aware vector is calculated by using Equation (3) in the first frame, whose dimension is 1 × 1 × 256. For subsequent frames x t , the target bounding-box proposals are evaluated by the instance-aware IoU-Net. Note that all the network weights of the BES are fixed while performing online tracking. We employ five gradient ascent processes to maximize S IoU , and finally obtain the tracking results by fusing the results of the top three proposals.

Datasets
We provide a comprehensive evaluation of our proposed tracker, which are performed on extensive challenging tracking datasets, including UAV123 [15] and UAVDT [16]. UAV123 [15] is an extremely large dataset which contains 123 aerial video sequences comprising more than 110K frames. These sequences contain 12 common attributes for visual tracking. UAVDT [16] is composed of about 80 K representative frames which are selected from 10 hours of raw videos, as well as fully annotated with bounding boxes for the following three fundamental computer vision tasks: object detection, single object tracking, and multiple object tracking. For single object tracking task, there are 50 video sequences and 8 attributes are annotated for each sequence.

Metrics
In this paper, we use a one-pass evaluation (OPE) process mentioned in [17] as the metric, which means to run trackers throughout a test sequence with initialization from the ground-truth position in the first frame and report the success rate plot or the precision plot. The success rate plot shows the ratios of number of successful frames to the total number of frames in the sequence when the thresholds varied from 0 to 1. The precision plot shows the percentage of frames whose center location error is within the given threshold. In the success rate plot, we use the area under curve (AUC) to rank all the tracking algorithms. In the precision plot, we use the precision value when the threshold of location error is set as 20 pixels to sort all the trackers.

Experimental Details
Our proposed tracker uses the ResNet-18 [26] pretrained on ImageNet as the backbone network with the parameters all fixed while training. The parameters in our instance-aware IoU network architecture, are initialized using [14], which are obtained by minimizing the error of the prediction value in Equaiton (5) with ADAM [38]. We train for 40 epochs with 64 image pairs per mini-batch. While training the hyperparameters of ADAM optimizer, the initial learning rate is set as 0.02 and a factor 0.2 decay every 10 epochs. Our proposed method is implemented on PyTorch 0.4.1 platform, using Ubuntu 16.04 with an Intel Xeon(R) E5-2602 v4, 16G RAM, and Nvidia RTX 2080Ti.

Attribute-Based Evaluation:
We also perform an attribute-based analysis for the proposed method on UAV123 [15]. For 123 videos in this dataset, each video sequence has 12 attributes including aspect ratio change (ARC), background clutter (BC), camera motion (CM), fast motion (FM), full occlusion (FOC), illumination variation (IV), low resolution (LR), out-of-view (OV), partial occlusion (POC), similar object (SOB), scale variation (SV), and viewpoint change (VC). As shown in Table 3., in scale variation cases, DCF-based trackers use multiscale inputs to solve the scale change problem, but these methods are not optimal because the scale change of objects is continuous. Therefore, we benefited from our proposed instance-aware IoU-Net, and obtained the AUC score, at least 15.4% higher than that of the DCF-based trackers, as well as 10.4% higher than that of deep learning based trackers. Meanwhile, in the video sequences with other attributes, the proposed tracker has also achieved a higher performance. It is worth noting that the performance of the proposed method for similar object (SOB) attribute is not significantly improved as compared with other attributes, which shows that our model is not more effective in distinguishing objects with similar appearance than targets and backgrounds. Table 3. Tracker performance analysis based on challenge attributes for the proposed method and the state-of-art methods on the UAV123 dataset [15]. Red, green, and blue fonts indicate the first, second, and third place.

Evaluation on UAVDT Benchmark
Overall Evaluation: Figure 4 demonstrates the overall performance of our proposed tracker compared with several state-of-the-art trackers on UAVDT [16]. In addition to the seven trackers mentioned in Section 4.2.1), another two trackers which combine DCF and deep learning together, i.e., C-COT [32] and CF2 [28], are also involved in the comparison experiments. Specifically, on the

Attribute-Based Evaluation:
We also perform an attribute-based analysis for the proposed method on UAV123 [15]. For 123 videos in this dataset, each video sequence has 12 attributes including aspect ratio change (ARC), background clutter (BC), camera motion (CM), fast motion (FM), full occlusion (FOC), illumination variation (IV), low resolution (LR), out-of-view (OV), partial occlusion (POC), similar object (SOB), scale variation (SV), and viewpoint change (VC). As shown in Table 3, in scale variation cases, DCF-based trackers use multiscale inputs to solve the scale change problem, but these methods are not optimal because the scale change of objects is continuous. Therefore, we benefited from our proposed instance-aware IoU-Net, and obtained the AUC score, at least 15.4% higher than that of the DCF-based trackers, as well as 10.4% higher than that of deep learning based trackers. Meanwhile, in the video sequences with other attributes, the proposed tracker has also achieved a higher performance. It is worth noting that the performance of the proposed method for similar object (SOB) attribute is not significantly improved as compared with other attributes, which shows that our model is not more effective in distinguishing objects with similar appearance than targets and backgrounds. Table 3. Tracker performance analysis based on challenge attributes for the proposed method and the state-of-art methods on the UAV123 dataset [15]. Red, green, and blue fonts indicate the first, second, and third place.

Evaluation on UAVDT Benchmark
Overall Evaluation: Figure 4 demonstrates the overall performance of our proposed tracker compared with several state-of-the-art trackers on UAVDT [16]. In addition to the seven trackers mentioned in Section 4.2.1), another two trackers which combine DCF and deep learning together, i.e., C-COT [32] and CF2 [28], are also involved in the comparison experiments. Specifically, on the UAVDT [16] dataset, the precisions of our method achieve 78.0% when the location error threshold is equal to 20 pixels, 2.8% and 5.5% higher than those of the second and third best tracker, i.e., SiamDW (75.2%) and MDNet (72.5%), respectively. The AUC scores achieve 56.6%, and 10.2% higher than those of the second (SiamDW, 46.4%) and third best trackers (MDNet, 46.4%), respectively. than those of the second (SiamDW, 46.4%) and third best trackers (MDNet, 46.4%), respectively.
Attribute-Based Evaluation: Again, we perform an attribute-based analysis for the proposed method on UAVDT [16]. There are 8 attributes for each video sequence, including background clutter (BC), camera rotation (CR), object rotation (OR), small object (SO), illumination variation (IV), object blur (OB), scale variation (SV), and large occlusion (LO). As shown in Table 4, in the case of small objects, the AUC score of the proposed method achieves 51.8%, 5% higher than that of ARCF-HC which is designed for UAV target tracking, benefit from the fusion of shallow and deep features. In the case of scale variation, we improve the AUC score by at least by 16.7% as compared with the DCFbased trackers, 13.5% higher than deep learning based trackers, and 17.4% higher than the trackers which combine DCF and deep learning. The proposed tracker also achieves higher ranking performance in the video sequences with other attributes.
(a) (b) Figure 4. Comparison results of several state-of-the-art trackers on the UAVDT dataset [16]. The values close to the name of the methods represent the AUC or the precision when the threshold of location error is set as 20 pixels. (a) The success rate plot; (b) The precision plot. Table 4. Tracker performance analysis based on challenge attributes for the proposed method and the state-of-art methods on the UAVDT dataset [16]. Red, green, and blue fonts indicate the first, second, and third place.

Efficiency Analysis
To validate the effectiveness of the proposed method, we evaluate the performance of the UAV videos in the presence of challenges such as small objects, viewpoint changes, scale variation, and camera motion. As shown in Figure 5 and Figure 6, our method achieved better robustness and Attribute-Based Evaluation: Again, we perform an attribute-based analysis for the proposed method on UAVDT [16]. There are 8 attributes for each video sequence, including background clutter (BC), camera rotation (CR), object rotation (OR), small object (SO), illumination variation (IV), object blur (OB), scale variation (SV), and large occlusion (LO). As shown in Table 4, in the case of small objects, the AUC score of the proposed method achieves 51.8%, 5% higher than that of ARCF-HC which is designed for UAV target tracking, benefit from the fusion of shallow and deep features. In the case of scale variation, we improve the AUC score by at least by 16.7% as compared with the DCF-based trackers, 13.5% higher than deep learning based trackers, and 17.4% higher than the trackers which combine DCF and deep learning. The proposed tracker also achieves higher ranking performance in the video sequences with other attributes. Table 4. Tracker performance analysis based on challenge attributes for the proposed method and the state-of-art methods on the UAVDT dataset [16]. Red, green, and blue fonts indicate the first, second, and third place.

Efficiency Analysis
To validate the effectiveness of the proposed method, we evaluate the performance of the UAV videos in the presence of challenges such as small objects, viewpoint changes, scale variation, and camera motion. As shown in Figures 5 and 6, our method achieved better robustness and practicability in handling those cases. For example, in the S010 sequence, KCF [18] loses tracking in the early stage and SiamDW [27] fails to cover the body of the object. CF2 [28] gives a tracking result covering the target, but its size is much larger than the target size, while our tracker performs well in the entire sequence. Because the target size of this sequence is very large at the beginning and it is reduced to one seventh of the original size in only 100 frames, the methods that only use multiscale input to solve the target scale change easily lose the target. The same situation happens in the Person21 sequence. In the S0103, S0307 and S1301 sequences, there are different degrees of viewpoint change and object rotation. Although most of the trackers do not lose the target, the prediction accuracy of target size is uneven, while our method can achieve more robust results. In the sequence Car15 where two trucks interlaced, the trackers are likely to be distracted by the similar object. SiamDW [27] loses tracking in the early stage because it is based on an offline end-to-end architecture, whereas our method has both online positioning module and offline training bounding-box prediction module, therefore, it can always cling to the target.
Remote Sens. 2020, 12, x 13 of 18 practicability in handling those cases. For example, in the S010 sequence, KCF [18] loses tracking in the early stage and SiamDW [27] fails to cover the body of the object. CF2 [28] gives a tracking result covering the target, but its size is much larger than the target size, while our tracker performs well in the entire sequence. Because the target size of this sequence is very large at the beginning and it is reduced to one seventh of the original size in only 100 frames, the methods that only use multiscale input to solve the target scale change easily lose the target. The same situation happens in the Person21 sequence. In the S0103, S0307 and S1301 sequences, there are different degrees of viewpoint change and object rotation. Although most of the trackers do not lose the target, the prediction accuracy of target size is uneven, while our method can achieve more robust results. In the sequence Car15 where two trucks interlaced, the trackers are likely to be distracted by the similar object. SiamDW [27] loses tracking in the early stage because it is based on an offline end-to-end architecture, whereas our method has both online positioning module and offline training bounding-box prediction module, therefore, it can always cling to the target.  In addition to the satisfactory accuracy of tracking results, our method can carry out real-time processing on a general-purpose single-board GPU computing platform. The comparison results in terms of average frames per second (FPS) and milliseconds per frame (MSPF) are shown in Table 5. It can be seen that the proposed method achieve 25.6 FPS and 39.1 MSPF, respectively, which meet the requirements of real-time processing. This is because our proposed method adopts a correlation filtering operation, which achieves high tracking speed. Notably, the real-time performance, here, refers to the UAV tracker that relies on a general GPU parallel computing platform, which is compatible with real-time tracking of offline video and wireless backhaul video. The existing airborne computing platforms of miniaturized low-altitude drones have difficulty meeting the board-level real-time processing requirement.

Ablation Studies
To validate the effectiveness of our method, we compare two variants of our method with the original method that integrates all modules on the UAV123 [15] dataset, as shown in Table 6. Specifically, the TDS locates the target by only using the correlation filtering algorithm without multiscale input. TDS + BES -NoIA uses the correlation filtering algorithm to locate the target and  [16]. From left to right and top to down are the screenshots of the tracking results on the videos of S0101, S0103, S0307, and S1301.
In addition to the satisfactory accuracy of tracking results, our method can carry out real-time processing on a general-purpose single-board GPU computing platform. The comparison results in terms of average frames per second (FPS) and milliseconds per frame (MSPF) are shown in Table 5. It can be seen that the proposed method achieve 25.6 FPS and 39.1 MSPF, respectively, which meet the requirements of real-time processing. This is because our proposed method adopts a correlation filtering operation, which achieves high tracking speed. Notably, the real-time performance, here, refers to the UAV tracker that relies on a general GPU parallel computing platform, which is compatible with real-time tracking of offline video and wireless backhaul video. The existing airborne computing platforms of miniaturized low-altitude drones have difficulty meeting the board-level real-time processing requirement.

Ablation Studies
To validate the effectiveness of our method, we compare two variants of our method with the original method that integrates all modules on the UAV123 [15] dataset, as shown in Table 6. Specifically, the TDS locates the target by only using the correlation filtering algorithm without multiscale input.
TDS + BES -NoIA uses the correlation filtering algorithm to locate the target and generates 16 candidate regions, and then uses IoU-Net to predict the IoU score for each candidate region. In this process, the coordinates of the candidate regions are modified continuously through five gradient optimizations, and finally the bounding box is obtained through fusing the results of the top three proposals. It should be noted that the IoU-Net in TDS + BES -NoIA does not add instance-aware vector at this time. TDS + BES is our proposed two-stage object tracking method, which uses correlation filter to locate the target position, and then use instance-aware IoU-Net to predict the scale of the target to obtain the final bounding box. Table 6. Inner module ablation study by comparing the proposed method and two variants of our method on the UAVDT dataset [15]. The AUC is the overall area under curve score. The precision is the value when the threshold of location error is set as 20 pixels. Others are attribute-based AUC scores. As shown in Table 6, TDS + BES obtains 7.4% and 12.4% improvement over TDS + BES -NoIA and TDS, according to the success plot AUC on the UAVDT dataset. Moreover, TDS + BES -NoIA has an5.0% improvement over the method only using TDS. Hence, our two-stage object tracking method is more effective, and the instance-aware network can further improve the accuracy as compared with the method only using IoU-Net.

AUC
More specifically, in the case of scale variation (SV), TDS + BES -NoIA gets about 5.5% higher performance gain than only using TDS. This proves that using IoU-Net can better track the targets with scale variation than using multiscale input, and can further improve the performance by introducing instance-aware IoU-Net. The final version of our method achieves 7.8% improvement over only using TDS. For the sequence with other attributes, the AUC score obtained by using the final version of our method is higher than by the other two variants of our method, which proves the validity that instance-aware IoU-Net does help TDS + BES -NoIA with lower sensitivity to various interferences by introducing the weights on the feature maps.
At the same time, the handcraft features and deep features are tested separately in the case of using instance-aware IoU-Net. The results are shown in Table 7, where DCF_DH_IOU stands for the original method, DCF_D_IOU represents a method using only deep features, and DCF_ H_ IOU represents a method that uses only handcraft features. It can be seen that the method using only handcraft features has the lowest AUC score in all-attribute video sequences except the case of small target. On the six-attribute video sequences, the AUC scores of the original methods are all higher than the methods only using depth features Table 7. Feature ablation study by comparing the proposed method and two variants of our method on the UAVDT dataset [15]. The AUC is the overall area under curve score. The precision is the value when the threshold of location error is set as 20 pixels. Others are attribute-based AUC scores.

Conclusions
In this paper, we exploit a two-stage scheme that combines a detection-based network (IoU-Net) with a DCF-based tracker for object tracking in UAV videos. We integrate multilayer deep features and handcrafted features to build a discriminative target representation, and use the instance-aware attention network to further embed the prior knowledge of the target. In the tracking process, the DCF-based tracking stage is responsible for locating the target, and the instance-aware IoU-Net is responsible for further predicting the refined target bounding box. We extensively evaluate the proposed tracker on two UAV tracking datasets (i.e., UAV123 [15] and UAVDT [16]). The experimental results show that our approach can improve the overall performance with a large margin and run at over 25 FPS on a general-purpose single-board GPU computing platform.
We summarize the potential directions for improving the proposed method and clarified our future work. First, in the bounding-box estimation stage, the instance-aware IoU-Net only gives the IoU score between the ground-truth bounding box and the candidate objects, and the spatial position relationship between the two is not accurately described. This problem eventually causes the online update of the correlation filtering to be disturbed with the background area around the target. In the future, we should improve the loss function to suppress the expansion of bounding-box anomalies in the optimization process. Second, our method has not fully utilized the inter-frame timing information in the video, which leads to the lack of robust tracking when interfering targets with similar appearances exist. In the future, we plan to explore a priori information embedding method combining temporal context in the instance-aware attention network.

Conflicts of Interest:
The authors declare no conflict of interest.