Robust Correlation Tracking for UAV Videos via Feature Fusion and Saliency Proposals

: Following the growing availability of low-cost, commercially available unmanned aerial vehicles (UAVs), more and more research efforts have been focusing on object tracking using videos recorded from UAVs. However, tracking from UAV videos poses many challenges due to platform motion, including background clutter, occlusion, and illumination variation. This paper tackles these challenges by proposing a correlation ﬁlter-based tracker with feature fusion and saliency proposals. First, we integrate multiple feature types such as dimensionality-reduced color name (CN) and histograms of oriented gradient (HOG) features to improve the performance of correlation ﬁlters for UAV videos. Yet, a fused feature acting as a multivector descriptor cannot be directly used in prior correlation ﬁlters. Therefore, a fused feature correlation ﬁlter is proposed that can directly convolve with a multivector descriptor, in order to obtain a single-channel response that indicates the location of an object. Furthermore, we introduce saliency proposals as re-detector to reduce background interference caused by occlusion or any distracter. Finally, an adaptive template-update strategy according to saliency information is utilized to alleviate possible model drifts. Systematic comparative evaluations performed on two popular UAV datasets show the effectiveness of the proposed approach.

It is critical to employ an efficient feature representation in order to improve the performance in object tracking.Gradient and color features are the most popular single types of feature.In particular, color features, such as color names (CN), help capture rich color characteristics, and histogram of oriented gradient (HOG) [12] features are adept in capturing abundant gradient information.Based on these feature descriptions, a variety of techniques on target tracking have been proposed.For instance, FragTrack [13] is devised to build object appearance models by exploiting multiple parts of the target.Babenko et al. [14] presented a multiple instance learning (MIL) algorithm to develop a discriminative model by bagging all ambiguous negative and positive samples.Grabner et al. [15] utilized a novel on-line Adaboost feature selection method (OAB), benefitting considerably by on-line training.In a past paper [2], a structural local sparse representation is applied to tracking task, where both partial and spatial information are exploited.Zhang et al. [16] discovered the relationship between an object and its spatiotemporal context based on the use of a Bayesian framework.Extended Lucas Kanade (ELK) method [17] considers two log-likelihood terms that are related to information regarding object pixels or background affiliation, in addition to the standard LK template matching term.Most of the aforementioned techniques are dependent of the intensity or texture information while characterizing a given image.However, it is difficult for them to meet the requirement of processing a large number of frames per second without resorting to parallel computation on a standard PC in dealing with real-time tasks [17].From this viewpoint, correlation filters [18][19][20][21][22] show their strengths both in speed and in accuracy, where tracking problem is converted from time domain to frequency domain with fast Fourier transform (FFT).In so doing, convolution can be substituted with multiplication in an effort to achieve fast learning and target detection.
Although high tracking speed may be obtained, long-time tracking can often result in model drift.To ensure the stability of model updating in object tracking, Kalal et al. [1] decomposed the ultimate task of tracking into subtasks of tracking, learning and detection (TLD), where tracking and detection reinforce each other.However, if the location of an object is predicted only with respect to the previous frame, the appearance model may suffer from noisy samples.In particular, when the object is becoming blocked by something else, the tracker will fail immediately.Having taken notice of this, Hare et al. [2] adjusted the appearance model in a more reliable way, learning a joint structured output (Struck) to predict the object location.Apart from using a correlation filter, Zhu et al. [21] introduced an additional filter for detection, which greatly alleviated the problems of location error and model drifting caused by serious occlusion.Benefiting from temporal context and online redetector, a method described previously [22] performs robustly to appearance variation.
Note that following the increasing availability of low-cost, commercially available unmanned aerial vehicles (UAVs), more and more research efforts have been focusing on object detection and tracking by UAV videos.For example, Logoglu et al. [23] designed a feature-based moving object detection method for aerial videos.Fu et al. [24] proposed a technique named ORVT, for onboard robust visual tracking of targets in aerial images using a reliable global-local object model.However, all methods mentioned above cannot cope well with challenges appearing in such videos, which typically involve illumination variation, background clutter, and occlusion.To address these issues we propose a robust tracking approach for UAV videos, which offers three main contributions: (1) Composed of the HOG and dimension-reduced CN features, fused features are introduced to correlation filter in order to improve the robustness of appearance model in describing the target.
(2) To deal with background clutter and meanwhile, and to reduce the risk of model drifts caused by occlusion, saliency proposals are introduced as posterior information to relocate the object.(3) A new adaptive template update method is proposed to further alleviate the problem of model drift that is caused by occlusion or distraction.The effectiveness of this approach is demonstrated through systematic comparisons against other techniques.
The rest of this paper is organized as follows.Section 2 discusses relevant previous work on correlation filter and saliency detection.Under the general framework of correlation filter, Section 3 describes our approach.Section 4 presents an evaluation of the proposed approach and a comparative study with state-of-the-art techniques.Section 5 discusses the tracking speed of different methods and assesses the effects of each contribution made by the proposed work.Finally, Section 6 concludes this study and points out interesting further research.

Correlation Filters
Because of their impressive high-speed, correlation filters have attracted a great deal of interests in object tracking.For instance, David S. Bolme et al. [25] proposed the minimum output sum of squared errors (MOSSE) filter, which works by finding the maximum cross correlation response between the model and candidate patch.Henriques et al. [26] exploited the circulate structure and Fourier transformation in a kernel space (CSK), offering excellent performance on a range of computer vision problems.A vector correlation filter (VCF) was proposed by Boddeti et al. [27] to minimize localization errors while improving the tracking speed.Danelljan et al. [28] exploited the color attributes of an object and introduced CN features into CSK to perform object tracking.Combining techniques of kernel trick and cycle shift [26], Kernelized Correlation Filter (KCF) [29] entails more adaptive performance for diverse scenarios using multichannel HOG features.The DSST tracker [19] learns adaptive multiscale correlation filters by the use of HOG features to handle the scale change of target objects.To learn a model that is inherently robust to both color changes and deformations, Staple [30] combines two image patch representations that are sensitive to competing factors.Danelljan et al. [31] utilized a spatial regularization component in the learning process to penalize correlation filter coefficients as a function of their spatial location.Recently the authors of a past paper [20] proposed a background-aware correlation filter (BACF) that can model how background as well as foreground of an object may vary over time.To drastically reduce the number of parameter in the model, Danelljan et al. [32] proposed a factorized convolution operator.The utilization of a compact generative model of the training sample distribution significantly reduces the memory and time complexity, while providing better diversity of samples.
Whilst many methods exist as outlined above, they do not address the critical issue of online model update.As a result, such correlation trackers are susceptible to model drifting and hence, are less effective for handling important problems such as long-term occlusion and object out-of-view.

Saliency Detection
Saliency is considered to represent an object or a pixel that is more conspicuous than its neighbors.Saliency detection aims to capture the regions that stand out in an image.In terms of algorithm strategy, saliency detection approaches can be categorized into two subgroups, one is the group of bottom-up data-driven methods [9,33,34] and the other is that of top-down task-driven methods [10].
Top-down methods are task-driven which learn a supervised classifier for salient object detection.In DRFI [9], hand crafted features were extracted to classify each region.Xi at al. [10] proposed a SVM based methods with a color information as the input.On the other hand, for most bottom-up methods, low-level features are employed to calculate the saliency value.By analyzing the log-spectrum of an input image, Hou X. et al. [8] introduced a mechanism to extract the spectral residual of an image in spectral domain.They proposed a fast method for constructing the corresponding saliency map in the spatial domain which is independent of features, categories, or other forms of prior knowledge of the domain objects.To keep the structure of the objects, region-based methods were also proposed.These methods segmented images into coherent regions to obtain proper spatial structure.Goferman et al. [33] used a patch-based approach to get global properties.Cheng et al. [34] combined a soft abstraction to decompose an image into large perceptually homogeneous elements in order to achieve efficient saliency detection.Additionally, boundary cue is used to improve the saliency detection performance, with boundary prior knowledge treating image boundary regions as labeled background.

Proposed Approach
We aim to develop an online tracking algorithm that is adaptive to significant appearance change without being prone to drifting, in which the extracted fused features are encoded in terms of multivectors.Further, saliency information is attained to provide reliable proposals for correlation filters to redetect objects in case of tracking failure.In particular, the adaptive template updating rules are put forward in order to achieve robust performance.The flowchart of the proposed tracking approach is illustrated in Figure 1, where the speed of such a tracker is ensured using a correlation filter.

Correlation Tracking through Fused Features
Features play an important role in computer vision.For example, much of the impressive progress in object detection can be attributed to the improvement in the representation power of features [35].Gradient and color features are the most widely exploited in object detection and tracking.Indeed, previous work [36] has verified that there exists a strong complementarity between gradient and color features.

Correlation Tracking through Fused Features
Features play an important role in computer vision.For example, much of the impressive progress in object detection can be attributed to the improvement in the representation power of features [35].Gradient and color features are the most widely exploited in object detection and tracking.Indeed, previous work [36] has verified that there exists a strong complementarity between gradient and color features.However, how to jointly utilize different features for aerial tracking is still an open question.Compared with generic visual object tracking, certain tracking challenges are amplified in aerial scenarios, including abrupt camera motion, low resolution, significant changes in scale and aspect ratio, fast moving objects, as well as partial or full occlusion.It is difficult to obtain comprehensive information of interesting objects using a single feature type like HOG or CN [37] under such circumstances.Hence, we employ fused features to achieve robust performance in aerial tracking.Inspired by CN from a linguistic viewpoint [37], which involves eleven preliminary color terms: black, blue, brown, grey, green, orange, pink, purple, red, white ,and yellow, we concatenate CN features extracted from the original image and substantially reduce the number of color dimensions in an effort to enable a significant speed boost, with the support a work reported previously [28].In addition, any given input color image is transformed into one with grey values and then, HOG features are extracted from the resulting grey image.All these features are concatenated directly to form a multivector as a fused feature descriptor.
In this paper, we utilize the multivector representation of fused features which better fits with the correlation tracking framework.However, how to jointly utilize different features for aerial tracking is still an open question.Compared with generic visual object tracking, certain tracking challenges are amplified in aerial scenarios, including abrupt camera motion, low resolution, significant changes in scale and aspect ratio, fast moving objects, as well as partial or full occlusion.It is difficult to obtain comprehensive information of interesting objects using a single feature type like HOG or CN [37] under such circumstances.Hence, we employ fused features to achieve robust performance in aerial tracking.Inspired by CN from a linguistic viewpoint [37], which involves eleven preliminary color terms: black, blue, brown, grey, green, orange, pink, purple, red, white, and yellow, we concatenate CN features extracted from the original image and substantially reduce the number of color dimensions in an effort to enable a significant speed boost, with the support a work reported previously [28].In addition, any given input color image is transformed into one with grey values and then, HOG features are extracted from the resulting grey image.All these features are concatenated directly to form a multivector as a fused feature descriptor.
In this paper, we utilize the multivector representation of fused features which better fits with the correlation tracking framework.More specifically, we denote x d as the fused feature multivector of cardinality d ∈ R D , respectively.We consider y d as the desired correlation output corresponding to a given sample x d .A correlation filter w of the same dimensionality as x d is then learned by solving the following minimization problem: where λ is a regularization parameter.Note that the minimization problem in Equation ( 1) is akin to training the multivector correlation filters in a past paper [27], and can be resolved within each individual feature channel using FFT.Let the capital letters be the corresponding Fourier transformed signals.The learned filter in the frequency domain on the d − th (d ∈ {1, . . . ,D}) channel can be written as where Y, X, W denote the discrete Fourier transforms (DFT) of y, x w, respectively; Y represents the complex conjugation of Y, and Y X d is a point-wise product.Given an image patch in the next frame (of the video sequence concerned), the fused feature multivector is denoted by z ∈ R D .The correlation response map is computed by where the operator F −1 denotes the inverse FFT.The target location can then be estimated by searching for the position of the maximum value of the correlation response map r, such that

Object Redetection Based on Saliency Proposals
For traditional correlation filter-based trackers [26][27][28][29], the use of FFT helps greatly reduce the computational cost, demonstrating the ability of real time tracking on UAV videos.Nevertheless, two main challenges remain: (a) distraction and (b) model drift, caused by occlusion or background clutter.
In DSST [19], an independent scale prediction filter is presented, but it fails perform well when serious occlusion exists, as shown in Figure 2. A common approach to handling model drift is to integrate a short-term tracker and online long-term detector, as with what is taken in the TLD algorithm [1].However, learning an online long-term detector relies heavily on lots of well-labeled training samples which can be difficult to collect.Additionally, an exhaustive search through the entire image with sliding windows is time-consuming, especially for the case of employing complex but discriminative features.
where Y , X , W denote the discrete Fourier transforms (DFT) of y , x w , respectively; Y represents the complex conjugation of Y , and  d Y X is a point-wise product.Given an image patch in the next frame (of the video sequence concerned), the fused feature multivector is denoted by ∈ D z R .The correlation response map is computed by where the operator F denotes the inverse FFT.The target location can then be estimated by searching for the position of the maximum value of the correlation response map r , such that x y r a b (4)

Object Redetection Based on Saliency Proposals
For traditional correlation filter-based trackers [26][27][28][29], the use of FFT helps greatly reduce the computational cost, demonstrating the ability of real time tracking on UAV videos.Nevertheless, two main challenges remain: (a) distraction and (b) model drift, caused by occlusion or background clutter.In DSST [19], an independent scale prediction filter is presented, but it fails perform well when serious occlusion exists, as shown in Figure 2. A common approach to handling model drift is to integrate a short-term tracker and online long-term detector, as with what is taken in the TLD algorithm [1].However, learning an online long-term detector relies heavily on lots of well-labeled training samples which can be difficult to collect.Additionally, an exhaustive search through the entire image with sliding windows is time-consuming, especially for the case of employing complex but discriminative features.To provide relatively less proposals and suppress the background interference, in this paper we not only utilize an adaptive update strategy to learn the appearance model, but also exploit a few To provide relatively less proposals and suppress the background interference, in this paper we not only utilize an adaptive update strategy to learn the appearance model, but also exploit a few pieces of reliable information from the biologically inspired saliency map.We postulate that the redetector could alleviate the model drift problem caused by occlusion or distraction.

Saliency Proposal Detection
Due to its simplicity and efficiency, we propose to utilize the spectral residual based saliency detection algorithm [8] to obtain saliency proposals.Then we iteratively redetect the object based on the resulting saliency proposals.Given an original image I, Fourier transform is used to extract the phase features P( f ) and amplitude features A( f ) of the image (in the frequency domain), as shown in Equations ( 5) and ( 6): From this averaged spectrum is approximated by convoluting the input image h n ( f ) * L( f ), where L( f ) = log(A( f )) and h n ( f ) denotes a local average filter to approximate the shape of A( f ).Thus, the spectral residual R( f ) can be obtained by Equation (7): In the subsequent experimental studies, the size of h n ( f ), n is empirically set to 3. The spectral residual R( f ) helps capture the key information contained within an image.In particular, it serves as a compressed representation of the underlying scene reflected by the image.Using inverse Fourier transform (IFT), we can construct the saliency map in the spatial domain.The saliency map contains primarily the nontrivial parts of the scene.The content of the residual spectrum can also be interpreted as the unexpected portion of the image.Thus, the value at each point in a saliency map is squared to indicate the estimation error.For better visual effects, we smooth the saliency map with a Gaussian filter g(x).In sum, given an image I(x), we have where k = 4, σ = 2.5, (i, j) is the coordinate of pixel x and F −1 denotes IFT.
Having built a saliency map S(x), saliency proposals can be obtained using threshold segmentation and region connection.Specifically, the saliency map is first segmented according to the adaptive thresholding [38], and therefore generates a number of interconnected domains.Without losing generality, suppose that the connected domain corresponding to the real object does not appear at the border of the image, we can exclude the connected domains whose centers are within a certain number of pixels of the boundary in the segmented image to derive the final saliency proposals (in implementation herein, this number is set to 15).

Redetection Based on Saliency Proposals
The traditional correlation tracker cannot perform well when serious occlusion exists.To address this issue, we propose our tracker with a redetection approach based on saliency proposals.If the correlation response r is less than the threshold T 1 for more than L consecutive frame, it is high likely that the target is occluded seriously.So we redetect the object using saliency information at this time, otherwise the object will be located only by correlation filter.
Specifically, we consider the location of a certain object in the previous frame as a center point, around which the image patch is cropped from the original image.The image patch is of size B × B, , where _ num lost is the number of frames where serious occlusion happened continuously, w and h denote the initialized horizontal width and vertical height of the interested object in the first frame, respectively, and means rounding down.Such an image patch is designed to guarantee that the longer the object is lost, the bigger the image patch is cropped from the original image.
From this we can obtain the saliency proposals in the image patch and sample paddings with a size of × 3 3 w h around the center of every saliency proposal.Then, correlation filtering is applied , where num_lost is the number of frames where serious occlusion happened continuously, w and h denote the initialized horizontal width and vertical height of the interested object in the first frame, respectively, and ch an image patch is designed to guarantee that the longer the object is lost, the bigger the image tch is cropped from the original image.
From this we can obtain the saliency proposals in the image patch and sample paddings with a e of × 3 3 w h around the center of every saliency proposal.Then, correlation filtering is applied tween the center of each saliency proposal and the template in the first frame, with the point of the means rounding down.Such an image patch is designed to guarantee that the longer the object is lost, the bigger the image patch is cropped from the original image.
From this we can obtain the saliency proposals in the image patch and sample paddings with a size of 3w × 3h around the center of every saliency proposal.Then, correlation filtering is applied between the center of each saliency proposal and the template in the first frame, with the point of the largest response r m taken as the center of the new object if r m exceeds a certain value T 2 .Otherwise, in order to ensure that the object remains within an image patch, the patch is expanded when repeating the redetection step in the next frame.
Figure 3 shows that an object is redetected based on saliency proposals.As can be seen from this figure, when the object is lost, we can gradually relocate its approximate position by saliency detection in the area where the object may appear.Following this, the object can then be relocated accurately using correlation filtering.
Remote Sens. 2018, 10, x FOR PEER REVIEW 7 of 21 occlusion happened continuously, w and h denote the initialized horizontal width and vertical height of the interested object in the first frame, respectively, and means rounding down.
Such an image patch is designed to guarantee that the longer the object is lost, the bigger the image patch is cropped from the original image.From this we can obtain the saliency proposals in the image patch and sample paddings with a size of × 3 3 w h around the center of every saliency proposal.Then, correlation filtering is applied between the center of each saliency proposal and the template in the first frame, with the point of the largest response m r taken as the center of the new object if m r exceeds a certain value 2 T .
Otherwise, in order to ensure that the object remains within an image patch, the patch is expanded when repeating the redetection step in the next frame.
Figure 3 shows that an object is redetected based on saliency proposals.As can be seen from this figure, when the object is lost, we can gradually relocate its approximate position by saliency detection in the area where the object may appear.Following this, the object can then be relocated accurately using correlation filtering.

Adaptive Model Updating
To obtain a robust and efficient approximation, we update the numerator (1 ) where t is the frame index and the learning rate η is set to 0.025 empirically.
If the object position is relocated according to saliency information, we update the template according to Equation (12):

Adaptive Model Updating
To obtain a robust and efficient approximation, we update the numerator A d and the denominator B d of the correlation filter W d in Equation (2) separately, using a moving average: where t is the frame index and the learning rate η is set to 0.025 empirically.
Remote Sens. 2018, 10, 1644 If the object position is relocated according to saliency information, we update the template according to Equation ( 12): Then, the previous templates and the first frame template are combined to update the target template, thereby minimizing potential model drift.

Experimental Results
We provide representative experimental results in this section.The proposed tracker is implemented in Matlab2014 on a PC with a 3.4 GHz processor and 16 GB RAM without involving any sophisticated program optimization.In order to present an objective evaluation regarding the performance of the proposed approach, we conduct experiments on two datasets, namely, the VIVID dataset [39] and the UAV123 dataset [40], for both qualitative and quantitative evaluations.In these experiments, the parameters are fixed for all of the sequences, in which T 1 and T 2 are set to 0.2 and 0.25, respectively.In addition, L is set as 7 and the candidate region size for the correlation filter is set to three times as big as that of the object under tracking.
We compare the proposed tracker with a range of excellent state-of-art trackers, including TLD [1], DSST [19], BACF [20], ORVT [24], Staple [30], SRDCF [31], ECO_HC [32], KCFDP [41], BIT [42], and fDSST [43].Among these trackers, TLD introduces the detection method into the tracking problem, which performs well when occlusion exists, while DSST, KCFDP, SRDCF, Staple, BACF, ECO_HC, and fDSST involve the use of correlation filters to improve the speed of tracking.In particular, ORVT is an onboard robust aerial tracking algorithm working by the use of a reliable global-local object model.Additionally, BIT is a biologically inspired tracker that extracts low-level biologically inspired features while imitating an advanced learning mechanism to combine generative and discriminative models for target location.Note that we employ publicly available codes of compared trackers for fair comparison.
We follow the standard evaluation metrics for object tracking algorithms in two aspects: the precision rate and success rate.The precision rate shows the percentage of successfully tracked frames on which the center location error (CLE) of a tracker is within a given threshold (e.g., 20 pixels), with CLE defined as the average Euclidean distance between the center locations of the targets and the manually labeled ground truths.A tracking result in a frame is considered successful if , where r d and r t denote the areas of the bounding boxes of the tracked region and the ground truth, respectively, ∩ and ∪ represent the intersection and union of two regions, respectively, and |•| denotes the number of pixels in the region.Thus, the success rate is defined as the percentage of frames where the overlap rates are greater than a threshold θ.Normally, the threshold θ is set to 0.5.
We present the results under one-pass evaluation (OPE) using the average precision and success rate over all sequences.OPE is the most common evaluation method which runs trackers on each sequence for once.It initializes the trackers with the state of the ground truth object in the first frame and reports the average precision or success rates across all the results obtained.

Experiments on VIVID Dataset
There are eleven video sequences in the VIVID dataset.Apart from motion blur and fast motion, these video sequences also suffer from further difficulties such as occlusion, scale variation, background clutter, low resolution, etc.In the VIVID dataset, the ground truth is given every ten frames.To evaluate the trackers more accurately, we mark the entire eleven sets of videos' ground truths, referring to the official data, for quantitative evaluation.The experimental results on these nine videos are summarized in Tables 1 and 2, which show the overall rates of the success plots and those of the precision plots, respectively.As can be seen from these tables, our tracker performs reliably and can achieve optimal outcomes overall.In particular, regarding the first three video sequences where occlusion occurs seriously, our method exhibits an excellent performance benefitting from saliency based redetection and adaptive template updating, while the other trackers lost the targets under these circumstances.However, the remaining video sequences are frequently affected by scale change, rotation and similar objects which led to a decline in the performance of our algorithm also.Figure 4 shows the qualitative evaluation on the VIVID dataset.Figure 4a illustrates the performance of our approach and compared algorithms on the sequence pktest01.Only our method keeps the virtue of robust tracking after more than 100 frames of occlusion.It is evident that through redetecting object by saliency information, the proposed tracker is more robust than the other trackers.In the sequence pktest03, in addition to motion blur and fast motion, the other main challenges for tracking are illustration variation, serious occlusion, and background clutter.From the last picture of Figure 4b, it is obvious that the full occlusion with the car is handled well by our tracker, while the other methods have a shift for the target.This implies that saliency detection makes an important contribution to achieve such an outstanding performance.In addition, almost every frame is subject to a varying degree of background clutter.Note that the scale of the target is too small to recognize, it is almost integrated with the background with certain texture and other details lost.It can be seen from the results that only our algorithm can successfully deal with the problem of background clutter as other methods fail to track the target completely.There is no doubt that fused features help improve the robustness of the proposed appearance model.In addition, the adaptive model update strategy also helps reduce model drift.Both of the above measures lead to the excellent performance of our method.
As shown in Figure 4c-e, where there is no significant occlusion, our methods can always follow the target as with other trackers.It works even when similar cars appear in the sequence egtest02.
However, when scale variation and rotation occur, the calculated scales of bounding boxes are not sufficiently accurate causing a decrease in the accuracy of our tracker.For the sequences egtest01 and redteam, the background is similar with the edge of the target.If the response of the correlation filter is less than the threshold for a long time, our tracker will automatically try to relocate the target by exploiting the vision saliency.Of course, this strategy may gradually introduce certain noise from the background around the target to the template, leading to slight model drift.

Experiments on UAV123 Dataset
In order to evaluate the performance of our proposed approach, we conduct experiments on twenty challenging video sequences selected from the UAV123 dataset for both quantitative and qualitative analysis.The UAV123 dataset provides a facility for the evaluation of different trackers on a number of fully annotated HD videos captured from a professional grade UAV.It complements those benchmarks establishing the aerial component of tracking while providing a more comprehensive sampling of tracking nuisances that are ubiquitous in low-altitude UAV videos.Apart from aspect ratio change (ARC) and fast motion (FM), these video sequences are also affected

Experiments on UAV123 Dataset
In order to evaluate the performance of our proposed approach, we conduct experiments on twenty challenging video sequences selected from the UAV123 dataset for both quantitative and qualitative analysis.The UAV123 dataset provides a facility for the evaluation of different trackers on a number of fully annotated HD videos captured from a professional grade UAV.It complements those benchmarks establishing the aerial component of tracking while providing a more comprehensive sampling of tracking nuisances that are ubiquitous in low-altitude UAV videos.Apart from aspect ratio change (ARC) and fast motion (FM), these video sequences are also affected by several adverse conditions such as background clutter (BC), camera motion (CM), full occlusion (FOC), illumination variation (IV), low resolution (LR), out of view (OV), partial occlusion (POC), similar object (SOB), scale variation (SV), and viewpoint change (VC).Thus, the experiments carried out herein include all typical challenges involved in real-world aerial tracking problems.
Ranging from 535 to 1783 frames, the twenty selected sequences used here involve all the challenging factors in the UAV123 dataset with different resolutions.Various scenes exist in these sequences, such as roads, buildings, field, beaches, and so on.The targets include aerial vehicles, person, trucks, boats, cars, etc. Detailed information of these sequences is listed in Table 3.  4 and 5 exhibit the overall rates of the success plots and those of the precision plots on the twenty sequences, respectively.It can be seen that our tracker achieves the best performance on average, demonstrating its robustness in dealing with object tracking tasks involving different challenging factors and various background types.We also perform an attribute-based comparison with other methods on this subset of the UAV123 dataset.Figures 5 and 6 show the success plots and precision plots of twelve respective attributes on the precision and success rates, respectively.As can be seen from these results, our tracker always performs reliably and can achieve the optimal, or at least a close to optimal solution in most cases.Specifically, for the amplified challenging factors in aerial tracking, including CM, BC, SV, ARC, FM, IV, FOC, and VC, our tracker is able to achieve promising results, benefitting from the robustness of fused features as well as from the employment of the appearance template and model updating strategy.For videos with fast moving objects, camera motion, and background clutter, the fused features have stronger abilities to capture the information from the objects and, therefore, lead to better results as compared to the classic single-feature trackers.In addition, when the aspect ratio of an object changes significantly, our adaptive appearance template updating strategy can adjust the template to the appearance of the object.Moreover, thanks to the high confidence model updating method background noise is suppressed as much as possible when serious occlusion exists in aerial videos.Nevertheless, our tracker may not perform equally well when dealing with images of low resolution and targets that are out of view.It is likely due to the fact that such challenging factors usually create very serious problems for saliency detection, resulting in model drift.
Figure 7 illustrates qualitative evaluations on the application of different trackers to example sequences selected from the UAV123 dataset.In the sequence person16, the background has the similar color with person, making it difficult for the trackers to successfully function to a different extent.
Owing to the use of saliency information, our tracker is able to relocate the object after it has been occluded by the tree and outperforms the state-of-the-art tracking methods.As shown in Figure 7b, the sequence uav1_3 contains almost all the possible challenges in aerial tracking, especially low resolution and serious background clutter.Benefiting from the target redetection strategy, our tracker can track the target successfully all the time, while the others locate the target correctly only once in a while.Of course, the robustness of fused features also helps ensure the good performance of our tracker.However, for certain sequences with serious scale variation and similar objects, for example the sequence car10, our tracker slightly underperforms in comparison to several state-of-art algorithms (e.g., ECO_HC, BACF, and Staple).Under such circumstances, our tracker may incur small model drift but it does not lose the target.(a) person16 (a) person16

Discussion
In this section, we discuss the tracking speed of different methods and assess the effect of each technical contribution incorporated within the proposed approach.All the experimental results are again taken on the twenty selected sequences of the UAV123 dataset, as indicated previously.

Speed Analyse
For practical applications of aerial tracking, the computational efficiency of a given tracker also needs to be considered.Table 6 lists the running speed of each compared tracker and the average speeds over all of the sequences are shown in the last row.As we can see, the fDSST tracker achieves the fastest running speed which is almost 133 fps and the biologically inspired BIT tracker performs well in terms of running efficiency, too.Mainly due to the low cost of computing the color histogram, the Staple tracker also has a good performance on tracking speed.However, SRDCF and BACF trackers show low running efficiencies on all of the twenty test sequences, which are approximately 10.79 fps and 9.65 fps, respectively, which still may not meet the standard of real-time running.It is worthwhile to note that our tracker can meet the real-time requirements, while gaining highly satisfactory results on both success rate and precision rate.This owes much to the robustness of fused features and the efficacy of saliency detection.To further strengthen the performance of our proposed tracker, we are trying to find an optimization method to speed it up.

Discussion
In this section, we discuss the tracking speed of different methods and assess the effect of each technical contribution incorporated within the proposed approach.All the experimental results are again taken on the twenty selected sequences of the UAV123 dataset, as indicated previously.

Speed Analyse
For practical applications of aerial tracking, the computational efficiency of a given tracker also needs to be considered.Table 6 lists the running speed of each compared tracker and the average speeds over all of the sequences are shown in the last row.As we can see, the fDSST tracker achieves the fastest running speed which is almost 133 fps and the biologically inspired BIT tracker performs well in terms of running efficiency, too.Mainly due to the low cost of computing the color histogram, the Staple tracker also has a good performance on tracking speed.However, SRDCF and BACF trackers show low running efficiencies on all of the twenty test sequences, which are approximately 10.79 fps and 9.65 fps, respectively, which still may not meet the standard of real-time running.It is worthwhile to note that our tracker can meet the real-time requirements, while gaining highly satisfactory results on both success rate and precision rate.This owes much to the robustness of fused features and the efficacy of saliency detection.To further strengthen the performance of our proposed tracker, we are trying to find an optimization method to speed it up.

Effect of Fused Features
Computationally, feature construction is an essential part of our tracker as it provides sufficient information for the correlation filter.We perform an experimental study to show the advantage of feature fusion.In particular, we test our tracker with fused features against a version of the tracker using only HOG or CN features.The results are reported in Figure 8.It is obvious that fused features lead to better performance in terms of both the precision rate and the success rate.

Effect of Fused Features
Computationally, feature construction is an essential part of our tracker as it provides sufficient information for the correlation filter.We perform an experimental study to show the advantage of feature fusion.In particular, we test our tracker with fused features against a version of the tracker using only HOG or CN features.The results are reported in Figure 8.It is obvious that fused features lead to better performance in terms of both the precision rate and the success rate.

Effect of Saliency Proposals
To demonstrate the effectiveness of our saliency proposals in the detection stage, we evaluate the quantitative performance of our tracker with and without saliency proposals respectively.Note that almost all the sequences used for the experiments suffer from partial or full occlusion.The results are shown in Figure 9. Compared with the version without saliency proposals, the one utilizing saliency obtained exceedingly better performance.In addition, these results demonstrate that the tracking-by-detection mechanism is very helpful once integrated with correlation-based tracking for occlusion-dominated scenes.

Effect of Saliency Proposals
To demonstrate the effectiveness of our saliency proposals in the detection stage, we evaluate the quantitative performance of our tracker with and without saliency proposals respectively.Note that almost all the sequences used for the experiments suffer from partial or full occlusion.The results are shown in Figure 9. Compared with the version without saliency proposals, the one utilizing saliency obtained exceedingly better performance.In addition, these results demonstrate that the tracking-by-detection mechanism is very helpful once integrated with correlation-based tracking for occlusion-dominated scenes.

Comparison of Saliency-Based Detection and Sliding Widow Based Detection
To further testify the contribution of the saliency-based mechanisms, we use the traditional sliding window-based detection in substitution of the saliency-based within the generic framework of our tracking algorithm.Specifically, the detector is applied to the entire frame with sliding windows when 1 max(r) < T .In our implementation, the detector is trained by a random fern classifier [1], where each fern performs a number of pixel comparisons on an image patch with two feature vectors that point to the leaf-node with a certain posterior probability.The posteriors from all ferns are averaged as the target response and the detection is based on the use of the scanning window strategy.We use a k-nearest neighbor (KNN) classifier to select the most confident tracked results as positive training samples, e.g., a new patch is predicted as the target if k nearest feature vectors in the training set all have positive labels (k = 5 in this work).
Figure 10 presents the success plot and precision plots of these two trackers on the testing sequences.Obviously, our tracker performs significantly better on both evaluations.Due to fast motions of the UAV, great changes can occur to the scale and appearance of the target in the videos, which may reduce the similarity between the target and the corresponding tracking templates.Hence, it is hard for methods using a sliding window to obtain satisfactory results, which work by discriminating the target according to similarity measures between windows.What should not be ignored is that object tracking is closely related to attentional tasks in the biological world.Inspired by this observation, we exploit the abundant saliency information in the videos.Then, the adaptive template updating strategy ensures that new templates obtained by saliency detection can be introduced in time.This helps minimize the occurrence of possible model drift when the appearance of the target changes drastically.

Comparison of Saliency-Based Detection and Sliding Widow Based Detection
To further testify the contribution of the saliency-based mechanisms, we use the traditional sliding window-based detection in substitution of the saliency-based within the generic framework of our tracking algorithm.Specifically, the detector is applied to the entire frame with sliding windows when max( r) < T 1 .In our implementation, the detector is trained by a random fern classifier [1], where each fern performs a number of pixel comparisons on an image patch with two feature vectors that point to the leaf-node with a certain posterior probability.The posteriors from all ferns are averaged as the target response and the detection is based on the use of the scanning window strategy.We use a k-nearest neighbor (KNN) classifier to select the most confident tracked results as positive training samples, e.g., a new patch is predicted as the target if k nearest feature vectors in the training set all have positive labels (k = 5 in this work).
Figure 10 presents the success plot and precision plots of these two trackers on the testing sequences.Obviously, our tracker performs significantly better on both evaluations.Due to fast motions of the UAV, great changes can occur to the scale and appearance of the target in the videos, which may reduce the similarity between the target and the corresponding tracking templates.Hence, it is hard for methods using a sliding window to obtain satisfactory results, which work by discriminating the target according to similarity measures between windows.What should not be ignored is that object tracking is closely related to attentional tasks in the biological world.Inspired by this observation, we exploit the abundant saliency information in the videos.Then, the adaptive template updating strategy ensures that new templates obtained by saliency detection can be introduced in time.This helps minimize the occurrence of possible model drift when the appearance of the target changes drastically.
discriminating the target according to similarity measures between windows.What should not be ignored is that object tracking is closely related to attentional tasks in the biological world.Inspired by this observation, we exploit the abundant saliency information in the videos.Then, the adaptive template updating strategy ensures that new templates obtained by saliency detection can be introduced in time.This helps minimize the occurrence of possible model drift when the appearance of the target changes drastically.Furthermore, we compare the speeds of these two methods.The results are illustrated in Figure 11.It can be seen that with the introduction of saliency information, the proposed approach achieves a higher running speed on the majority of testing sequences as compared to the version with sliding window-based detection.This can be expected because the proposed approach is intended to imitate biological vision systems that are able to pop-out the salient locations in the visual field [44] even under the most adverse conditions (e.g., highly cluttered scenes, low-light, etc.).These salient locations become the focus of attention for the post-attentive stages of visual processing, which can effectively provide proposals for target relocation.However, for the detector without the use of saliency detection, every tracking outcome is computed via running a sliding window, inevitably at the expense of costing more computing resource.Furthermore, we compare the speeds of these two methods.The results are illustrated in Figure 11.It can be seen that with the introduction of saliency information, the proposed approach achieves a higher running speed on the majority of testing sequences as compared to the version with sliding window-based detection.This can be expected because the proposed approach is intended to imitate biological vision systems that are able to pop-out the salient locations in the visual field [44] even under the most adverse conditions (e.g., highly cluttered scenes, low-light, etc.).These salient locations become the focus of attention for the post-attentive stages of visual processing, which can effectively provide proposals for target relocation.However, for the detector without the use of saliency detection, every tracking outcome is computed via running a sliding window, inevitably at the expense of costing more computing resource.

Conclusions
In this paper, we have proposed a robust tracking method for UAV videos via fused feature based correlation filter and saliency detection.The correlation filter that combines the HOG and dimension-reduced CN features leads to significant contribution in tracking performance while dealing with challenging factors such as occlusion, noise and illumination.To handle serious occlusion, this work has introduced saliency information into the tracker as redetection, thereby reducing background interference.Moreover, an adaptive model update strategy is adopted to alleviate possible model drifts, which is both robust and computationally efficient.Experimental investigations have demonstrated, both quantitatively and qualitatively, that our approach achieves favorable results on the average performance for two popular aerial tracking datasets in comparison with the state-of-the-art methods.Given its reliability and robustness, the proposed tracker can be successfully employed in a wide variety of UAV video applications (beyond those related to surveillance), such as wild-life monitoring, activity control, navigation/localization, and obstacle/object avoiding, especially when real-time processing is mandatory, as in the case of rescue

Conclusions
In this paper, we have proposed a robust tracking method for UAV videos via fused feature based correlation filter and saliency detection.The correlation filter that combines the HOG and dimension-reduced CN features leads to significant contribution in tracking performance while dealing with challenging factors such as occlusion, noise and illumination.To handle serious occlusion, this work has introduced saliency information into the tracker as redetection, thereby reducing background interference.Moreover, an adaptive model update strategy is adopted to alleviate possible model drifts, which is both robust and computationally efficient.Experimental investigations have demonstrated, both quantitatively and qualitatively, that our approach achieves favorable results on the average performance for two popular aerial tracking datasets in comparison with the state-of-the-art methods.Given its reliability and robustness, the proposed tracker can be successfully employed in a wide variety of UAV video applications (beyond those related to surveillance), such as wild-life monitoring, activity control, navigation/localization, and obstacle/object avoiding, especially when real-time processing is mandatory, as in the case of rescue or defense purposes.
As a generic approach for aerial videos, we plan to further develop more robust fused features and to reinforce the fast nature of the redetect methods in future, while operating in real-time.Also, in this work, it has been assumed that each-channel feature is independent of the rest and hence, no interaction between such features has been considered.As such, a channel-wise filter was successfully adopted.However, it would be interesting to explore the interconnections among the information contents conveyed by different channels and to introduce a general linear filter to deal with such Author Contributions: All the authors made significant contributions to this work.X.X. and Y.L. devised the approach and analyzed the data; Q.S. provided advice for the preparation and revision of the work; X.X.performed the experiments; and H.D. helped with the experiments.

Figure 2 .
Figure 2. Tracking results of DSST: (a) tracking well without occlusion; (b) tracking failed within occlusion; and (c) model drift after occlusion.

Figure 2 .
Figure 2. Tracking results of DSST: (a) tracking well without occlusion; (b) tracking failed within occlusion; and (c) model drift after occlusion.
mote Sens. 2018, 10, x FOR PEER REVIEW 7 of 21 clusion happened continuously, w and h denote the initialized horizontal width and vertical ight of the interested object in the first frame, respectively, and means rounding down.

B
of the correlation filter d W in Equation (2) separately, using a moving average:

Figure 4 .
Figure 4. Qualitative evaluation of tracking results on VIVID dataset.

Figure 4 .
Figure 4. Qualitative evaluation of tracking results on VIVID dataset.

Figure 5 .
Figure 5. Precision plots of proposed tracker compared with state-of-the-art approaches on different attributes of UAV123 dataset.

Figure 5 . 21 (Figure 6 .
Figure 5. Precision plots of proposed tracker compared with state-of-the-art approaches on different attributes of UAV123 dataset.

Figure 6 . 21 (Figure 6 .
Figure 6.Success plots of tracker compared with state-of-the-art approaches on different attributes of proposed UAV123 dataset.

Figure 7 .
Figure 7. Qualitative evaluation of tracking results on UAV123 dataset.

Figure 8 .
Figure 8. Tracking results using fused, color names (CN) or histograms of oriented gradient (HOG) features on 20 sequences from UAV123 dataset.

Figure 8 .
Figure 8. Tracking results using fused, color names (CN) or histograms of oriented gradient (HOG) features on 20 sequences from UAV123 dataset.

Figure 9 .
Figure 9. Tracking results with or without saliency proposals on 20 sequences from UAV123 dataset.

Figure 9 .
Figure 9. Tracking results with or without saliency proposals on 20 sequences from UAV123 dataset.

Figure 10 .
Figure 10.Tracking results with saliency based detection and sliding window based detection on 20 sequences from UAV123 dataset.

21 Figure 10 .
Figure 10.Tracking results with saliency based detection and sliding window based detection on 20 sequences from UAV123 dataset.

Figure 11 .
Figure 11.Running speeds of tracking methods with saliency-based detection and sliding window based detection on 20 sequences from UAV123.

Figure 11 .
Figure 11.Running speeds of tracking methods with saliency-based detection and sliding window based detection on 20 sequences from UAV123.

Funding:
This work was supported by the National Natural Science Foundation of China (61871460, 61876152), the National Key Research and Development Program of China (2016YFB0502502), and the Foundation Project for Advanced Research Field of China (614023804016HK03002).

Table 1 .
Overall rates of precision plots on different sequences of VIVID dataset.

Table 2 .
Overall rates of success plots on different sequences of VIVID dataset.

Table 3 .
Description of sequences selected from UVA123 for experimental investigations.

Table 4 .
Overall rates of precision plots on different sequences of UAV123 dataset.

Table 5 .
Overall rates of success plots on different sequences of UAV123 dataset.

Table 6 .
Running speed (frame per second) of each tracker on sequences from the UAV123 dataset.

Table 6 .
Running speed (frame per second) of each tracker on sequences from the UAV123 dataset.