Benchmarking Deep Trackers on Aerial Videos

In recent years, deep learning-based visual object trackers have achieved state-of-the-art performance on several visual object tracking benchmarks. However, most tracking benchmarks are focused on ground level videos, whereas aerial tracking presents a new set of challenges. In this paper, we compare ten trackers based on deep learning techniques on four aerial datasets. We choose top performing trackers utilizing different approaches, specifically tracking by detection, discriminative correlation filters, Siamese networks and reinforcement learning. In our experiments, we use a subset of OTB2015 dataset with aerial style videos; the UAV123 dataset without synthetic sequences; the UAV20L dataset, which contains 20 long sequences; and DTB70 dataset as our benchmark datasets. We compare the advantages and disadvantages of different trackers in different tracking situations encountered in aerial data. Our findings indicate that the trackers perform significantly worse in aerial datasets compared to standard ground level videos. We attribute this effect to smaller target size, camera motion, significant camera rotation with respect to the target, out of view movement, and clutter in the form of occlusions or similar looking distractors near tracked object.

Reinforcement learning (RL). In CF based trackers, correlation filters are learned to match the target distribution aiming for a response that is Gaussian distributed. The implementation of the CF trackers is usually done in the Fourier domain for computational efficiency and filters are updated during online tracking. However, the filter update reduces the speed of the tracking procedure [23,28,29]. SN-based trackers approach tracking as a similarity learning problem, where matching is done in feature space. SN trackers are trained offline and are not updated online, which is efficient but may reduce tracking performance [30][31][32]. TD-based trackers treat tracking as a binary classification problem that aims to separate the target from the background. Multiple patches are taken in the target frame (tth frame) near the target location in the previous frame ((t − 1)th frame) and the patch with the highest score is selected as the target patch [33][34][35]. RL-based trackers learn an optimal path to the tracked object, either by moving the predicted location to the target location or by learning hyperparameters for tracking [36,37]. Evaluating tracking algorithms requires large and diverse datasets with annotated ground truth of the target location at every frame. Additionally, attribute annotation is important to fully assess tracker performance in different challenging situations, such as occlusion, illumination variation, etc. There are several tracking benchmarks in the literature [10,26,27,[38][39][40][41][42][43][44][45][46][47][48][49][50][51][52][53][54]. Two of the most popular tracking benchmarks are the OTB [27,38] and the VOT challenge [26,[39][40][41][42][43] datasets. In our study we benchmark various trackers on aerial datasets, another important category of datasets that consist of sequences taken from aerial platforms. We consider a video to be aerial if the camera is on an airplane or unmanned aerial vehicle (UAV). To include a greater variety of datasets, we broaden this definition to include cases where the camera is located anywhere above the ground ranging from a roof to a satellite.
There are some key factors that make aerial tracking different from tracking in standard ground level videos. In the aerial videos, the area covered by the field of view of the camera is usually much larger than that of ground level videos. More importantly, in aerial videos, the tracked object is much smaller in size in terms of pixels. Due to the smaller object size, it is more difficult to generate discriminative features, which adversely affects tracking performance. The object's size and viewpoint can change significantly and quickly in aerial videos. Additionally, the tracked object may be occluded for a long time and even disappear for several frames. Camera motion often causes abrupt changes in the object appearance and may result in out-of-frame movement for the tracked object. These characteristics present a unique set of challenges for aerial tracking compared to standard tracking on ground level videos. Initial work by Minnehan et al. [55] indicated that the performance of deep trackers varies significantly when tracking in aerial videos. Popular trackers, such as tracking by detection MDNet [33], or CF-based CCOT [29], were not as effective when tracking in aerial videos due to occlusion, smaller target size, and camera motion [55]. Our study differs from [55] in various aspects. We consider a larger number of more recent trackers from four general groups. We perform our benchmarking on multiple datasets and conduct detailed evaluation for various attributes of aerial imagery.
Several review papers exist in the literature that overview visual object tracking research [9,10,56,57]. Li et al. [56] studied deep learning trackers based on network architecture, network training, and network function and ran experiments on OTB100, TC-128 [46], and VOT2015 [41] to compare different deep learning based trackers. Fiaz et al. [57] performed an extensive review that compared various trackers based on different feature extraction methods. Trackers based on both deep learning and hand crafted features were evaluated on benchmarks, such as OTB 2015, OTB 2013 [38], TC-128, OTTC [57], and VOT 2017 [39]. VOT challenges compare many different trackers and provide a good overview of the performance of recent trackers.
However, these review papers and tracker benchmark studies deal with datasets that are ground based. In this work, we focus on aerial tracking because the conditions encountered vary significantly from ground level situations. The main contributions of this paper can be described as follows.

•
We focus on aerial tracking using videos taken from aerial platforms. To our knowledge, this is the first comprehensive benchmarking study of visual object tracking on aerial videos. • We benchmark ten recent deep learning based trackers from four tracking groups. • We consider four different aerial benchmarks and compare the trackers' performance in various challenging situations which provides a better understanding of the state-of-the-art of visual object tracking on aerial videos.
The rest of the paper is organized as follows. In Section 2, we discuss the relevant tracking algorithms we incorporated in our benchmarking. In Section 3, we discuss the datasets used in our experiments and the evaluation metrics. In Section 4, we show the evaluation results on different benchmarks and discuss the comparison among different trackers on different benchmarks as well as specific attributes present within the datasets. In Section 5, we present final remarks based on our evaluation.

Tracking Algorithms
In this section, we overview the tracking algorithms based on their approach to tracking and network architectures. We only selected trackers that use deep learning features and considered their performance and source code availability.
A brief description follows for the four types of trackers outlined in Figure 1.

TD-Based Trackers
Tracking by detection frameworks view tracking as a foreground (target) vs. background classification problem. TD-based methods have been used for some time and continue to be popular with deep trackers in recent years [33][34][35][58][59][60][61][62]. In these frameworks, the networks generally learn the region of interest by sampling multiple patches from the input image. Positive and negative instances within the training samples are selected based on the intersection-over-union (IoU) score with the ground truth. Online training is usually done in these networks to improve the classification accuracy during tracking.
The TD-based trackers that we included in our benchmarking are MDNet [33], DAT [35], Meta-Tracker [59], and RT-MDNet [34]. MDNet was the winner of the VOT2015 challenge and was used as a baseline tracker for the other TD-based trackers. DAT improved over MDNet using reciprocation learning. The Meta-Tracker improved over MDNet using a meta-learning approach and RT-MDNet improved the real time performance of MDNet. Short descriptions of these trackers are given below.

Multidomain Network Tracker (MDNet)
In visual object tracking, there are some desirable properties for target representation learning such as invariance with respect to illumination, scale, perspective, and motion blur. The goal of using multidomain learning is to learn a discriminative model that learns a shared representation of the target in various domains. To achieve this, MDNet is trained offline with large set of video sequences, where each sequence is considered as a domain.
The MDNet architecture is shown in Figure 2. In the network shown, the domain specific layers are shown as FC6_1 to FC6_n where all the preceding layers are considered as shared layers. During tracking, all branches of the sixth fully connected layer are removed and replaced with a single fully connected layer, where online adaptation is performed by fine-tuning the fully connected layers. For precise localization of the object during tracking, a bounding box regression technique is used on bounding boxes with high scores. DAT learns to attend a specific foreground class from the background by using reciprocation learning. Here, reciprocation learning refers to utilizing the backpropagated gradient to aid the learning procedure. It is achieved by passing the input image through the network and gathering the backpropagated gradients in the input layer which are later used as a regularization term with the classification loss. The MDNet architecture is used for the implementation. However, VOT benchmark training is no longer required for this tracker. Instead, for the first three convolutional blocks, VGG-M weights pretrained on ImageNet [63] are used and never updated. Similarly to MDNet, the fully connected layers are updated during online fine tuning. During this process, multiple patches are sampled at every input frame. The sampled patches are passed through the network and an attention map is obtained by taking the partial derivative of the classification score with respect to the input image at each patch. This attention map regularizes the original loss function through an additional term, with the goal of better classification where the map localizes the target. The objective is to maximize the mean and minimize the standard deviation of the attention map corresponding to the true class and do the inverse for the background. Eventually a hyperparameter is added to combine the two losses. Evaluation on OTB benchmarks showed the effectiveness of this tracking algorithm.

Meta-Tracker (Meta-SDNet)
Meta-SDNet improves over the baseline MDNet tracker using a meta-learning approach for online adaptation taking into consideration the uncertainty during tracking. Meta learning refers to a few shot learning procedure, where an algorithm adapts in a new environment by learning either model parameters, or a metric, or proper optimization techniques. In the meta-learning process, the network parameters are learned such that uncertainties in future frames can be minimized by better modeling the target appearance without overfitting the recent target appearances. Specific video sequences suitable for training were selected from a large scale video detection dataset and used to learn the appropriate parameters for meta-learning. The first three layers of Meta-SDNet are based on the pretrained VGG16 framework and used as feature extractors. The last three fully connected layers were randomly initialized and trained with the Adam optimizer. RT-MDNet improves over MDNet tracking framework by incorporating ROI alignment technique from Fast-RCNN [64]. The technique allowed to construct a high resolution feature map. The channel activation was based on large receptive field. An adaptive ROI alignment layer was added after the convolutional layers and before the fully connected layers in the original MDNet architecture. This feature extraction method improves the computational complexity of the overall tracking process. Dilated convolution was used to extract high resolution feature maps, which improved the quality of the representation of the target in feature space. Modified bilinear interpolation was considered for the adaptive ROIAlign layer instead of linear interpolation which changed the tracking performance significantly. Another contribution was combining instance embedding loss with the classification loss to discriminate between similar foreground targets in multiple domains. The network was tested on OTB 2015 and UAV123 datasets [47] and performed comparable to the state-of-the-art trackers in real-time.

CF-Based Trackers
A popular method for visual object tracking is learning Discriminative Correlation Filters (DCF) to predict the location of the tracked object in a patch [65][66][67][68][69][70][71][72][73][74][75][76][77][78][79]. A basic correlation filter based tracking framework is shown in Figure 3. Generally, a large patch around the tracked object is cropped at tth frame during tracking. Any feature extraction technique may be used to extract features from the cropped patch. Then the features are utilized to learn a bank of correlation filters that generates a Gaussian response map at the desired target location. Based on the response map, the bounding box of the tracked object is predicted. However, the filter learning procedure is performed at various time instances. Generally, a few frames and the corresponding target locations are saved based on the tracking confidence and utilized during the filter learning procedure.
CF-based trackers aim to find the filters f where the template x is given from a patch with ideal response map y, typically modeled by a Gaussian distribution. For example, the Kernelized Correlation Filter (KCF) [20] utilized a Gaussian kernel to model the target response. The best filter parameters are computed as follows.
However, CFs have limited detection range and may perform poorly when the object undergoes deformation. The patch size and the filter size must be equal, which often makes the tracker learn the background within the patch for irregularly shaped objects. If the patch is small, then in cases of occlusion the object may not be redetected after reappearing. Therefore, it is important to incorporate regularization in the CF-based tracking framework.
Danelljan et al. proposed Spatial Regularization for learning the DCFs (SRDCF [23]). The objective was to weaken the responses due the background information by spatially modifying the filter coefficients. The background is suppressed by assigning higher values of the filter coefficients which are outside of the target bounding box and vice versa. The filter parameters f * are learned using Equation (2) where α t describes the impact of each training sample and w is the spatial regularization term.
An improved version of SRDCF is the Continuous Convolution Operation Tracker (CCOT [29]), where filters are learned for multiple resolutions of target patches in the continuous domain. These filters are then used to produce multiple resolution feature maps. However, CCOT suffers from the large number of filters that are required to be learned to capture the target representation. Another limitation is that the tracker updates at every frame, which causes overfitting to the most recent target appearance. There are many variations and improvements to CF trackers since SRDCF and CCOT. We selected the following CF based trackers in our benchmarking. The Efficient Convolution Operators (ECO) [28], the Spatial Temporal Regularized Correlation Filter (STRCF) [80], and the Multi-Fusion Tracker (MFT) [81]. ECO was the winner tracker of VOT2016 challenge. MFT was the winner tracker of VOT2018 challenge. STRCF is also one of the top performing CF-based trackers. A brief overview of these approaches is given below.

Efficient Convolution Operators (ECO)
The ECO formulation is based on discriminative correlation filters with a factorized convolution operator introduced to reduce overfitting, a generative model to estimate the training sample distribution, and an efficient model update strategy. The base framework of CCOT had many filters that contained negligible energy, and these were eliminated in ECO to make the training more efficient. Then, filters are reproduced as the linear combination of the learned filters. The Gauss-Newton method was used to optimize the quadratic loss using conjugate gradient method. The factorized convolution operation reduces the computational and memory complexity of the tracker. To improve overfitting compared with CCOT, a new sample space was introduced based on Gaussian mixtures to obtain a representative sample set. A model update strategy was also introduced to reduce overfitting based on updating the model every N s frames, where this parameter was identified by heuristics. Small values of N s may result in overfitting, whereas large values reduce the convergence speed of the optimization. The base network model was VGG, where features from the first and last convolutional layers were used along with HOG and Color features. Finally, the comparison was done on the tracking benchmarks and the model achieved state-of-the-art performance on the VOT2016 challenge.

Multi-Fusion Tracker (MFT)
The Multi-Fusion Tracker (MFT) improves upon the baseline SRDCF tracker by using a motion model and hierarchical feature selection with adaptive fusion. Typically, motion models are ignored in CF and TD trackers. However, MFT utilized a motion model to improve over partial occlusion and bounding box estimation. For the motion model, Kalman filtering was used, which effectively reduces the noise of the bounding box center locations from the DCF predictions. Another important distinction is that the online training was formulated to learn independent correlation filters for multilevel CNN features. It was determined that early layer features are better for adapting the scale changes for small deformation, but deeper features are better for adapting with larger deformation. Additionally, middle-level features are the most representative of the target scale, as the deeper features are prone to drifting towards similar objects. This shortcoming is solved by fusing multilevel features from the network. For these multilevel features, adaptive independent correlation filters were learned using conjugate gradient method. The model was updated at a fixed frame interval instead of every frame following ECO. Then, the outputs of the multiple independent hierarchical filters are fused using an adaptive weighting scheme, where the center location of the target bounding box is extracted from the feature map. Scale change is achieved using a multiscale search strategy of the image patches, after cropping based on the motion estimation model which predicts the center location of the patch. The MFT algorithm outperforms the baseline ECO architecture on the OTB dataset.

Spatial Temporal Regularized Correlation Filter (STRCF)
Spatial Temporal Regularized Correlation Filter (STRCF) tracking incorporates both spatial and temporal regularization on the DCF framework. In a comparative study on different sequences where the target appearance varies significantly, STRCF outperforms SRDCF due to its effective appearance modeling. To solve for the filter parameters, the Alternating Direction Method of Multipliers (ADMM) was used to achieve closed form solutions. The algorithm can operate in real-time when using handcrafted features. With the temporal regularization term, the objective function to update the filters f * is given by Here, f t−1 represents the filters used in the (t − 1)th frame and µ denotes the hyperparameter for regularization. This expansion to online passive-aggressive algorithm [82] improves over SRDCF in two ways: (i) better model updating with multiple samples and (ii) better occlusion handling by passively updating the correlation filters. Evaluation results on Temple-color, VOT-2016, and OTB 2015 benchmarks showed that this algorithm achieved state-of-the-art accuracy.

SN-Based Trackers
Siamese networks contain twin branches with shared weights and are widely used in visual object tracking [30,[83][84][85][86][87][88][89][90][91][92][93]. The objective of Siamese networks is to learn a shared representation of the input images in a similarity learning fashion. Generally, two input images are fed to the network in each of the branches, and both branches of the network are updated at the same time with shared weights. Bertinetto and Valmadre et al. [30] proposed a fully convolutional Siamese architecture for tracking (SiamFC) that has a template branch and a search window branch. The network was trained with the ImageNet video detection dataset [63], where two different frames in a sequence are cropped and resized such that the area A of the resized patch is where w and h are the width and height of the corresponding target bounding box, p is the context amount which is set to p = (w + h)/4, and s is the scale factor. Finally, cross-correlation is applied in feature space to get the final response map. Multiples scales and aspects are considered to deal with the scale and aspect ratio changes. Training the network is done using positive and negative pairs with logistic loss where v is the score for one pair of patches and y ∈ {+1, −1} is the target value. The overall network is trained using SGD with average loss. One limitation of SiamFC is that it does not always find the tightest bounding box around the target and the resulting localization accuracy is not as good compared to the CF-based trackers. Li et al. [32] incorporated the Faster R-CNN [94] with the SiamFC architecture. The bounding box proposal generation and bounding box regression from Faster R-CNN improved the overall performance of the tracking framework. He et al. [95] proposed a twofold Siamese network architecture where semantic appearance information is encoded to get a better response map. Zhu et al. [31] further improved SiamRPN using distractor-aware training. They also used a local to global search strategy to improve tracking during occlusion.
Among SN-based trackers, we select SiamFC [30] and DaSiamRPN [31]. SiamFC was the winning tracker of the VOT2017 realtime challenge and DaSiamRPN was the winning tracker of the VOT2018 realtime challenge. Furthermore, these trackers are the backbone of many other state-of-the-art SN-based trackers [83][84][85][86]. Brief descriptions about these trackers are provided below.

Fully Convolutional Siamese Tracker (SiamFC)
SiamFC casts the tracking problem as a similarity learning problem and uses fully convolutional branches for feature embedding. The SiamFC network architecture is shown in Figure 4. Two input images of different size are fed into the two branches of the network and the same transformation is applied to both images. Keeping the target at the center, the first frame of the sequence is cropped and resized to pass through one of the branches of the Siamese network. The first frame of the sequence is kept fixed in one of the branches in the network. All the other frames of the sequence are cropped, resized, and passed through the second branch of the network. After getting the feature embedding from both branches, a correlation layer is used to find the correlation between two images in the latent space. In the final correlation map, the bounding box is determined in the detection frame based on the similarity score. In the SiamFC architecture, binary cross-entropy loss with SGD is used during training. The network is trained offline on the 2015 version of ImageNet Large Scale Visual Tracking Challenge [63] dataset and no online update is made during tracking. One drawback of this approach is that it uses an expensive multiscale test to adapt for changes in the object's scale, which is not very efficient and does not capture the scale change very well.

Distractor-Aware Siamese Region Proposal Networks (DaSiamRPN)
The initial SiamRPN [32] architecture utilizes a key concept in Siamese networks, where the goal is to learn an embedding space maximizes the distance between the inter-class objects and minimizes the distance between intra-class objects. SiamRPN introduces several new concepts including a Region Proposal Network (RPN) and one-shot learning. The region proposal network is utilized on top of the base Siamese network adopted from the original SiamFC architecture for feature extraction. The DaSiamRPN network improves the training procedure by increasing the number of positive examples to reduce the imbalance between positive and negative examples. The original SiamRPN is trained on Youtube-BB dataset [96] consisting of 200,000 video sequences which are annotated every 30 frames. In DaSiamRPN, the ImageNet Detection [63] and COCO Detection [97] datasets were augmented into image pairs, and used to train the network. Semantic negative pairs were included, instead of considering hard negative pairs as in object detection. Additionally, motion blur images were used for training.
Distractors were collected using non maximum suppression to avoid redundant candidates. Among all the predicted bounding boxes, the box with highest score was set as the target and boxes with score greater than a threshold were chosen as distractors. This approach works well for short-term tracking, but it may fail for long-term tracking, as the search region is not large enough to cover the entire range within the image where the object may reappear. DaSiamRPN tried to solve this issue by using local-to-global search strategy when tracking failure occurs and keeping track of the number of frames in which the target is not found. For the training architecture, a modified AlexNet pretrained on ImageNet was used. Evaluation on multiple benchmarks showed that the tracker achieves state-of-the-art performance.

RL-Based Trackers
In RL-based approaches an agent learns to find an optimal path in an environment from its own experience using feedback. The agent generally receives observations in discrete time steps with rewards and chooses an action from a set of available options. This process continues until convergence. Reinforcement learning algorithms are extensively used in game theory and have been considered for visual object tracking as well [36,37,[98][99][100][101]. For example, in the HP [99] tracker, Dong et al. introduced a novel hyper parameter selection technique which can learn sequence specific hyperparameters using continuous deep Q-learning. Supancic et al. [100] formulated the tracking problem as an online partially-observable Markov decision making process (POMDP) where, instead of heuristic target initialization, an optimal decision making policy is learned to update the appearance model. Among the RL-based trackers, the Actor-Critic [37] tracker outperformed the other trackers based on the evaluation results provided in the corresponding papers. We have included the Actor-Critic tracker in our benchmarking and provide a short description below.

Actor-Critic Tracking (ACT)
ACT is a reinforcement learning tracker that can operate in real-time. The two main components of the framework are the Actor and the Critic models. The actor network moves the bounding box to the target location and the critic network guides the learning process during offline training. The process is guided by calculating the Q-value using reinforcement learning to train both the Actor and the Critic networks. A modified deep deterministic policy gradient algorithm is used to effectively train the model. During online tracking, the Actor model employs a dynamic search framework to learn the position of the target and the Critic model verifies the position to make the tracker more robust. A pretrained VGG architecture was used to initialize the Actor and Critic networks. In the Critic network, Q learning was done using the Bellman equation in Q-learning, while the Actor network learns using chain rule. During training, the samples were generated from translation and scaling of the bounding box, where the scale was sampled from a Gaussian distribution centered by the object location. The ImageNet Videos were used to train the Actor network so that for each iteration 20 to 40 frames were randomly chosen for training. The tracker achieves 30 fps speed with performance comparable with state-of-the-art trackers on popular tracking benchmarks.
The DTB70 dataset contains 70 sequences of UAV collected data where the bounding boxes are drawn manually. Some of the sequences are collected from YouTube. Different types of camera motion, including translation and rotation, are incorporated to make the dataset more challenging. Three types of targets appear in the videos: human, animal, and rigid objects.
The UAV123 dataset contains 123 video sequences taken from UAV platforms. Note that we excluded the seven synthetic sequences from the UAV123 dataset for our evaluation. A subset of the UAV123 dataset, UAV20L, is also evaluated for long-term tracking analysis.
The attributes of the aerial datasets are listed in Table 1. These attributes make aerial tracking more challenging and their annotations are available within the corresponding datasets. The comparison of different trackers with these attributes provide better understanding of their performance under different tracking scenarios. Comparisons across UAV123 and DTB70 will indicate the generalizability of different trackers. For long-term tracking, an important consideration is consistent performance in a long temporal span, which tests the tracker's ability to create a robust model and perform efficient model updates. Some trackers may drift a little from one frame to the next, which may not be noticeable in short term sequences, but eventually could result in target loss during long-term tracking.
Regarding the datasets, there are inherent differences between the OTB aerial subset and other aerial datasets in terms of resolution, attributes, and size of the objects to be tracked. In the standard OTB sequences, the objects are much larger and occupy a larger portion of the image frame, while in aerial datasets such as DTB70, UAV123, and UAV20L, the object to track takes a smaller portion of the image because the camera is higher and covers a larger area. Another important distinction of the DTB or UAV sequences is that the camera rotates around the object, whereas none of the videos in OTB sequences have this attribute. Additionally, the OTB benchmark sequences have minimal camera jitter and less clutter in the background. Tracker codes were obtained from their Github repository at the URLs provided in Table 2. None of the algorithms in our implementation were trained from scratch. All of the trackers were implemented in a server workstation with NVIDIA TITAN-V GPU. The CF-based algorithms were implemented using MATLAB and MATConvNet, whereas the other trackers are implemented using Python and PyTorch. Readers are referred to the Github pages in Table 2 for further implementation details.
All the trackers were evaluated using one pass evaluation (OPE) introduced by the OTB benchmark evaluation procedure. The trackers were initialized in the very first frame and never re-initialized after a tracking failure. All the trackers were set to provide the output in the OTB format ([tclx, tcly, w, h]), where tclx and tcly are the coordinates of the top left corner of the bounding box, respectively, and w and h are the width and height of the box, respectively.
To evaluate tracker performance on the aerial benchmarking datasets, we examined visual examples as well as results based on evaluation metrics. To assess overlap performance, successful tracking is considered if the predicted bounding box and the groundtruth bounding box have an intersection over union (IOU) overlap greater than or equal to some threshold (e.g., 0.5). The tracker is evaluated for different thresholds and the success vs. threshold plot is obtained. The area under the curve (AUC) is computed based on the success vs. threshold plot and the trackers are ranked based on this value.

Results and Discussion
In this section, we have present our benchmarking results. In Figure 5, the overlap success plot is shown for all benchmark datasets. The AUC, computed for each of the trackers, is indicated in brackets in the legends. The results show that DaSiamRPN outperforms the other trackers in three out of four datasets. In Figure 6, the overlap success plot is shown for various attributes in the UAV123 dataset. The AUC is computed and shown in the legends. This plot shows how well various trackers perform in specific challenges such as occlusion, out of view, fast motion, etc. It can be seen from these results that DaSiamRPN does the best in most of the challenges. In Table 3, tracker performance (in terms of AUC) is compared for the ground based OTB dataset and the aerial datasets. The results show that even though the AUC for all trackers is high for OTB datasets, their performance was significantly reduced in the aerial datasets.  In terms of precision, the success rate of each tracker is evaluated based on the center to center distance, between the predicted bounding box and the groundtruth bounding box, compared to some predefined threshold in pixels. The center distance threshold is swept to find the precision vs. threshold plots. The precision values for a threshold of 20 pixels are shown in the brackets of the legends in the precision plots. Figure 7 shows the precision plots for all the benchmark datasets. It is seen that DaSiamRPN outperforms the other trackers in terms of precision as well. In Figure 8     In Table 4, the AUC of overlap success is shown for various attributes present in the UAV123 dataset. In Table 5, the precision at 20 pixel threshold is shown for the UAV123 dataset attributes. Both DaSiamRPN and DAT trackers perform well in these challenges. In Table 6, the AUC of overlap success is provided for various attributes in the DTB70 dataset. In Table 7, the precision at the 20 pixel threshold is provided for different challenges present in the DTB70 dataset. DaSiamRPN achieved better results in terms of both overlap success and precision in most of the challenges of the DTB70 dataset.
Visual results on the aerial datasets are shown in Figure 9. These illustrate the challenges for trackers due to the small target size, image rotation change in zoom level, and presence of similar looking distractors. The speed comparison is provided in Figure 10 on the UAV123 dataset. The fastest tracker is DaSiamRPN which runs above 200 fps and the slowest tracker is DAT which runs below 1 fps. The ECO and DAT trackers achieve good tracking performance at lower speeds.  Table 1. Top performing tracker is shown in red and the second best performer is shown in blue. Best viewed in color.   Table 6. Overlap results for various DTB70 dataset attributes listed in Table 1. Top performing tracker is shown in red and the second best performer is shown in blue. Best viewed in color.

Overall Comparison
The overall quantitative comparison of the trackers is shown in Table 3 for all benchmark datasets. The results for the OTB aerial subset are slightly worse than the overall OTB results. As the OTB subset does not exhibit all the attributes of the other aerial datasets, tracker performance remains similar with respect to the overall ground level OTB dataset. The evaluation plots in Figures 5 and 7 showed that the ECO tracker performs the best and STRCF is the second best for the OTB subset.
Tracker performance degrades consistently for the aerial datasets compared to ground level tracking. For the DTB70 dataset, the DaSiamRPN and MFT trackers perform the best for overlap and precision respectively. The DaSiamRPN tracker performs the best for the UAV123 dataset and UAV20L datasets. The distractor aware training, region proposal network for better IoU, and local to global search for redetection make the DaSiamRPN the best overall performer for aerial tracking.

Attribute Comparison
Attributes are annotated in the datasets, as shown in Table 1, and are used to evaluate the tracker performance under various challenging conditions. In different sequences, one or several attributes may be present. The evaluation of trackers based on specific video attributes is shown in Figures 6 and 8 and Tables 4 and 5 for UAV123 and in Tables 6 and 7 for DTB70. These results provide insights in the performance of the trackers under various conditions.
The aspect ratio change (ARC-UAV123) and the aspect ratio variation (ARV-DTB70) attributes specify the target's aspect ratio change over the temporal span of the sequence. The evaluation shows that the DaSiamRPN outperforms all the other trackers on this challenge. The region proposal network helps DaSiamRPN achieve better performance in terms of aspect ratio change. Note that in this specific challenge, in the UAV123 dataset, CF-based trackers and MDNet-based trackers perform similarly, whereas CF-based trackers perform relatively well compared to the MDNet-based trackers in the DTB70 dataset. We conclude that the model update strategy of the CF-based trackers help to achieve relatively better performance compared to he MDNet-based trackers.
The Background Clutter (BC) attribute is present in both datasets. This attribute specifies instances where the target is hardly distinguishable from the background. Based on the evaluation, DAT outperforms the other trackers for UAV123 and SiamFC does best for DTB70, for both overlap and precision. Note that SiamFC performs the worst for UAV123 sequences where BC is present.
Camera motion (CM-UAV123), fast motion (FM-UAV123), and fast camera motion (FCM-DTB70) attributes are evaluated. These attributes specify faster relative speed between the camera and the target. The DaSiamRPN tracker outperforms the other trackers for both datasets. However, MDNet based trackers performed comparatively well in the UAV123 dataset when camera motion is involved. However, when the relative motion is fast, the CF-based trackers outperform the MDNet based trackers in both datasets. We attribute this to the better model update strategy of the correlation filter based trackers.
Full occlusion (FOC-UAV123), partial occlusion (POC-UAV123), and Occlusion (OCC-DTB70) attributes are also evaluated. In POC, the target becomes partially occluded and then reappears without full occlusion, whereas in FOC, the target may be fully occluded for several frames and then reappears. The OCC attribute in DTB70 addresses both full and partial occlusion. It is important for the tracker to have redetection capabilities and stop updating the appearance model during full or partial occlusion to prevent drift from the target. Based on the tracker evaluation on these datasets, we find that for full occlusion DAT performed well, whereas for partial occlusion, DaSiamRPN performed well. For OCC in the DTB70 dataset, ECO performed best.
The Out of View (OV) attribute is present in both datasets. This attribute specifies that the target is no longer within the field of view of the camera, which is particularly challenging for most trackers. To do well in this challenge, the tracker must have redetection capabilities and be discriminative enough to distinguish the target from other similar looking objects. DaSiamRPN outperforms all the other trackers in this challenge, because it incorporates local-to-global search strategy.
The attributes Similar Object (SOB-UAV123) and Similar Object Around (SOA-DTB70) indicate that objects with shape and appearance similar to the target appear near the target. These objects are also called distractors. Tracking is more challenging when the target is partially occluded and distractors are present close to the target. The object may also be fully occluded by the distractor. For this challenge, the results show that CF-based trackers perform well in the DTB70 dataset, whereas MDNet-based trackers perform well in the UAV123 dataset. The performance of the Siamese trackers is comparatively lower because the correlation map generally provides high scores on the distractors, which sometimes causes tracker failure.
The Scale Variation (SV) attribute is present in both datasets. It specifies those sequences where the scale of the target changes over time. The results show that DaSiamRPN is significantly better than other trackers. Again, we attribute this to the Region Proposal Network architecture of the DaSiamRPN tracker. Illumination variation (IV), low resolution (LR), and viewpoint change (VC) are some other attributes that are present in the UAV123 dataset. IV specifies the sequences where illumination change is involved related to the target and LR specifies those sequences where the target has low resolution. VC specifies the changes in the camera viewpoint over the temporal span of a sequence. In terms of overlap, DaSiamRPN outperforms the other trackers in these challenges. However, for the precision, MDNet outperforms the other trackers for LR and IV attributes.
Deformation (DEF), In-plane Rotation (IPR), Out of Plane Rotation (OPR), and Motion Blur (MB) are some other attributes present in the DTB70 dataset. Deformation specifies the shape change of the object, in-plane and out-of-plane rotation specifies whether the target object is rotating inside or outside from the image plane, and motion blur specifies blurred target during tracking. For deformation, out-of-plane rotation, and in-plane-rotation, DaSiamRPN performs best. For MB, ECO performs best among the compared trackers for overlap success. However, for the precision success, MFT performs best. The results show that trackers perform much better in the IPR challenges compared to the OPR challenges. In presence of the MB attribute, CF-based trackers generally performs better compared to the other trackers.

Visual Comparison
We present visual examples of the results for all trackers in Figure 9. The Basketball sequence is taken from the OTB dataset. Although it is not an aerial image, it contains distractors that are large enough to observe. Partial occlusion and distractor attributes are present in this sequence. SiamFC and some MDNet based trackers lost the target by the end of the sequence.
The Car2 sequence from DTB70 dataset specifies the camera motion where the camera rotates around the target. The results show that eight out of 10 trackers lost track as soon as there is significant rotation of the camera in this sequence.
The Boat8 and Car7 sequences are taken from the UAV123 dataset. Boat8 has significant scale and aspect ratio change where the appearance of the target also changed. These results show that ECO and MFT trackers perform well but eventually drift. Surprisingly, DaSiamRPN got stuck on small part of the object, most likely due to a proposal generated for that region. In Car7, the target was occluded by a tree while another distractor appeared at the same time. Only the ECO and the DaSiamRPN trackers were able to successfully handle the situation, whereas all other trackers started to track the distractor. In cases of full occlusion, all the trackers lost the target.
Finally, the bird1 sequence from UAV123 dataset is evaluated. Here the target is moving fast and went out-of-view multiple times for several frames. It is seen that due to the small target size, fast motion and partial occlusion, only STRCF and RT-MDNet successfully track the object until it goes out-of-view, but no tracker can redetect the target when it reappears.

Speed Comparison
The speed comparison among all the trackers is shown in Figure 10 where the AUC vs. frame rate is plotted for the UAV123 dataset. SN-based trackers have higher frame rate compared to the other trackers. This is because the network parameters are not updated during online tracking. The DaSiamRPN achieves significantly higher frame rate because of its approach to online tracking as one shot learning. Among the CF-based trackers, ECO has the highest frame rate and outperforms the other CF-based trackers. The factorized convolution operation makes the tracker more efficient, thereby allowing it to achieve a higher frame rate and better performance. Among the MDNet based trackers, RT-MDNet achieves real-time performance with a small accuracy drop from the the original MDNet tracker. It is also seen that DAT performs well, but it has the lowest frame rate due to the update strategy of the tracker where the tracker updates on all the frames with score lower than a certain threshold.

Conclusions
In this study, we benchmarked ten state-of-the-art CNN-based visual object trackers from four different classes: Siamese Network-based, Tracking by Detection-based, Correlation Filter-based, and Reinforcement Learning. We considered four datasets: a subset of OTB, DTB70, UAV123, and UAV20L datasets for testing and comparing the tracking algorithms. Visual examples of different trackers are shown and the results of an One Pass Evaluation (OPE) are reported. We compared the results among different datasets as well as specific attribute challenges within the datasets. Trackers performed worse in the aerial datasets than in the typical ground level videos.
In our study, we found that Siamese network based trackers face difficulty when there are distractors present within the sequence. This is because the cross-correlation operation will create strong peaks on the distractors and drift may occur particularly when the main target is occluded. Siamese trackers do not have any online model update, which makes them fast but occasionally cannot handle significant appearance change of the object. However, DaSiamRPN performs well due to distractor aware training and accurate localization based on the RPN. The challenge for the CF-based trackers is to find a proper update strategy such that the tracker does not update the appearance model when the target is absent. Among MDNet based trackers, py-MDNet and DAT are computationally expensive where RT-MDNet and meta-tracker run at higher frame rate with relatively lower accuracy. Finally, the RL-based tracker is yet to achive the desired accuracy.
It is notable that none of the trackers is designed for reidentification, which is needed for target reacquisition when the target goes out of view and reappears.
The overall performance of the implemented trackers indicates that further research is needed to reach the full potential of deep learning tracking on aerial sequences.