Learning Future-Aware Correlation Filters for Efﬁcient UAV Tracking

: In recent years, discriminative correlation ﬁlter (DCF)-based trackers have made consid-erable progress and drawn widespread attention in the unmanned aerial vehicle (UAV) tracking community. Most existing trackers collect historical information, e.g., training samples, previous ﬁlters, and response maps, to promote their discrimination and robustness. Under UAV-speciﬁc tracking challenges, e.g., fast motion and view change, variations of both the target and its environment in the new frame are unpredictable. Interfered by future unknown environments, trackers that trained with historical information may be confused by the new context, resulting in tracking failure. In this paper, we propose a novel future-aware correlation ﬁlter tracker, i.e., FACF. The proposed method aims at effectively utilizing context information in the new frame for better discriminative and robust abilities, which consists of two stages: future state awareness and future context awareness. In the former stage, an effective time series forecast method is employed to reason a coarse position of the target, which is the reference for obtaining a context patch in the new frame. In the latter stage, we ﬁrstly obtain the single context patch with an efﬁcient target-aware method. Then, we train a ﬁlter with the future context information in order to perform robust tracking. Extensive experimental results obtained from three UAV benchmarks, i.e., UAV123_10fps, DTB70, and UAVTrack112, demonstrate the effectiveness and robustness of the proposed tracker. Our tracker has comparable performance with other state-of-the-art trackers while running at ∼ 49 FPS on a single CPU.


Introduction
Visual object tracking is a popular but challenging task in the domain of multimedia and computer vision. Given a video sequence, the task is to precisely estimate the position of the target of interest. With the popularity of unmanned aerial vehicles (UAVs), visual tracking applied for UAV platforms has attracted extensive attention, e.g., public security [1], disaster investigation [2], and remote sensor mounting [3]. Although this technique has acquired impressive progress, its performance is unsatisfactory when the target undergoes UAV-specific tracking challenges, such as viewpoint change, fast motion, and low resolution.
There are two main streams in visual tracking community: DCF-based trackers [4][5][6][7] and Siamese-based trackers [8]. Although Siamese-based trackers achieved impressive tracking performance using a GPU or GPUs, the complex calculation of the deep network inevitably brings large energy loss to the mobile platform such as UAVs. DCF-based trackers [9,10] is one of the most suitable choices for source-limited UAV platforms because of their balanced accuracy and speed as well as low cost. However, using synthesized training samples [11] for training inevitably impedes the discriminative power of the filter, that is, boundary effects. In the literature, in order to solve this problem, many attempts have been conducted to enhance filter discrimination for the target and its environment, such as fixed or adaptive spatial regularization [12][13][14][15] and context learning [16][17][18]. For context learning, these methods [16][17][18] suppress the response of context patch in multiple directions to zero, thus achieving effective performance improvement. Nevertheless, multiple context patches may introduce irrelevant background noise, resulting in a suboptimal filter. Moreover, the feature extraction of these patches, especially when using deep features [18], hinders the real-time ability of trackers with context learning.
Moreover, traditional DCF-based trackers model the filter by virtue of historical information, e.g., accumulated training samples [4,[11][12][13][14][15], previously generated filters [5,19,20], or response maps [21][22][23]. Although DCF-based trackers benefit from prevenient clues, this paradigm may fail to deal well with complex and changeable UAV tracking challenges, such as fast motion and viewpoint change. In these cases, both the volatile environment and the target appearance changes bring about severe uncertainties. It has been proved that the information in the future frame [20] has played a vital role in improving adaptability of the tracker. In [20], the object observation in the next frame is predicted by exploring spatial-temporal similarities of the target change in consecutive frames. Then, it is integrated with historical samples to form a more robust object model. However, this similarity assumption is not always valid for UAV object tracking because of the complex changeable nature of UAV tracking scenarios.
With respect to the above concerns, we propose a two-stage correlation filter tracker that can efficiently exploit the contextual information of the upcoming frame. The achievement of this purpose depends on two irreversible future-aware stages, i.e., future state awareness and future context awareness. The former stage is for predicting the spatial location change of the target in the upcoming frame, and the latter is for suppressing distractions caused by future complex background while enhancing filter discriminative power. In the first stage, when a new frame is coming, the simple yet effective single exponential smoothing forecast method [24] is used to predict a coarse target position. In the latter stage, we employ an efficient mask generation method to segment a single context patch based on the coarse position. Then, the segmented contextual information is incorporated into the training phase for discrimination improvement. Lastly, the more powerful filter rectifies the prediction error of the first stage. We perform comprehensive experiments on three challenging UAV benchmarks, i.e., DTB70 [25], UAV123_10fps [26], and UAVTrack112 [27]. The results confirm that the proposed tracker has superiority in terms of accuracy and speed compared with 29 other state-of-the-art trackers. Figure 1 shows the overall performance of all trackers on DTB70 [25] benchmark. Clearly, our tracker has comparable performance against other trackers while maintaining real-time speed on a single CPU, which demonstrates that the FACF tracker is suitable for real-time UAV applications.
The main contributions are summarized as follows: • A coarse-to-fine DCF-based tracking framework is proposed to exploit the context information hidden in the frame that is to be detected; • Single exponential smoothing forecast is used to provide a coarse position, which is the reference for acquiring a context patch; • We obtain a single future-aware context patch through an efficient target-aware mask generation method without additional feature extraction; • Experimental results on three UAV benchmarks verify the advancement of the proposed tracker. Our tracker can maintain real-time speed in real-world tracking scenarios.
The remainder of this paper is organized as follows: Section 2 generalizes the most relevant works; Sections 3 introduces the baseline tracker; Section 4 details the proposed method; Section 5 exhibits extensive and comprehensive experiments; and Section 6 provides a brief summary of this work. Overall performance based on area under curve (AUC) and distance precision (DP) between the proposed FACF tracker and 29 other state-of-the-art trackers on DTB70 [25] benchmark. AUC and DP are two metrics for evaluating tracking accuracy for which its detailed explanation is in Section 5. The legend provides detailed speed values of each tracker. The superscript * represents the GPU-based tracker.

Related Works
In this section, we briefly discuss the most related trackers, including DCF-based trackers, trackers with context learning, trackers with future information, and trackers for UAVs.

DCF-Based Trackers
DCF-based trackers formulate the tracking task as a ridge regression problem, with the view of training a filter to distinguish the target from the background. The use of a cyclic matrix and calculation in the Fourier domain simplifies the filter optimization process. Recently, many methods were proposed for improving tracking accuracy from different aspects. These methods include kernel tricks [11], scale estimation [6,28,29], mitigation of boundary effects [12][13][14][15], solutions for temporal degradation [5,19], trainingset management [7,30], more powerful feature representation [7,11,[31][32][33], and consequent feature de-redundancy [19,34,35]. In general, these above methods collect historical known information to predict future unknown target states. Future information is not considered to be utilized for raising the robustness and adaptability of the tracker.

Trackers with Context Learning
In the literature, context learning is one of the most efficacious strategies for discriminating the enhancement of the filter. Mueller et al. [16] proposed a novel context-aware DCF-based tracker (CACF). They used multiple regularization terms to repress the response of context patches in four directions around the target. Later, different trackers [17,18,36] equipped with context learning all received significant performance improvement. In detail, Fu et al. [17] selected more reasonable surrounding samples according to the position and scale of the tracked object. Based on CACF [16], Yan et al. [36] cropped four context patches on the basis of the location of distractors response generated in the last frame. To repress context noise more adequately, Li et al. [18] considered the context samples located at the four corners. Moreover, to avoid frame-by-frame learning, they proposed a periodic key frame selection method for context learning and used temporal regularization to retain the discrimination for background interference. These methods mentioned above are all limited to heavy feature extraction of context samples, especially when using deep features.
In addition, it is easy for these methods to introduce background noise outside of the search region. Different from the methods mentioned above, our work produces a context-aware mask to segment a single pixel-level context sample, which can drastically increase speed and evade unrelated interference while boosting performance remarkably.

Trackers with Future Informarion
Most trackers [12][13][14][15][16][17][18][21][22][23]34,36,37] based on the DCF framework update the target appearance model with a fixed online learning rate through historical observations. Then, the filter trained with the appearance model is used for predicting the target state in the upcoming frame. By mining temporal-spatial similarities in consecutive video sequences, Zhang et al. [20] predicted a target observation of the next frame and incorporated it into the target model with a large learning rate. It can improve the adaptation of the model to target appearance variations, thus promoting tracking performance. This method implies that the change rate of the target appearance is constant. However, it is not always valid, as there often exists fast motion and viewpoint change in most UAV tracking cases. Different from the similarity assumption [20] for obtaining future information, we predicted a coarse position based on single exponential smoothing forecast. Based on this position, a true context patch sample in the next frame is efficiently segmented for context learning.

Trackers for UAVs
DCF-based trackers [10] are gradually becoming the most pervasive tracking paradigm in the UAV tracking community due to their high efficiency. In the real-world UAV object tracking process, low resolution, fast object motion, and viewpoint change pose extreme challenges. With respect to these issues, a large number of works are proposed for better tracking performance. They can mainly be divided into following strategies: spatial attention mechanism, adaptive regression label, and temporal consistency. Concretely, in contrast to the fixed spatial attention in [12], later works [23,38] proposed a dynamic spatial attention mechanism using target salient information or response variations. Different from traditional predefined label, References [39,40] generate an adaptive regression label to repress the distractors. On the other hand, References [21,41] keep temporal consistency at the responselevel, which largely improves positioning accuracy. With the exception of the above works, there also exists work [42] that focuses on adaptive templates and model updates by using cellular automata and high confidence assessment. Recently, some lightweight Siamese-based trackers [27,43,44] are designed for UAV object tracking, such as SiamAPN++ [27] and MultiRPN-DIDNet [44]. All the above-mentioned works ignore the threat posed by the rapid context changes in real-world UAV tracking. By incorporating the future contextual information into filter learning, the proposed tracker with handcrafted features is more discriminative to the scene changes of UAV object tracking.

Revisit BACF
In this section, we review the BACF [13] tracker, which is the baseline in this work. To mitigate the problem of boundary effects, background-aware correlation filter (BACF) [13] enlarges the search region and introduces a binary matrix B ∈ R M×N (M N) to crop more complete negative samples, which largely improves tracking performance. In this work, we select the BACF tracker as the baseline. Given the vectorized desired response y ∈ R N×1 and the vectorized training samples of each channel x d ∈ R N×1 , the filter w in the current frame f can be obtained by minimizing the following objective: where λ is the filter regularization parameter, and D is the total number of channels. represents the correlation operator.
Unfortunately, a large number of real negative samples inevitably introduce background noise, which results in insufficient discrimination of the tracker, especially when distractors appear in the future frame.

Proposed Approach
In this section, we first carefully analyze the existing problems. Then, we introduce the proposed method, including two stages: future state awareness and future context awareness. Lastly, the complete tracking procedure is detailed.

Problem Formulation
In the tracking process of DCF-based trackers, the filter learned in the current frame is used to localize the object in the upcoming frame. Variations of the tracked object and its surroundings in the new frame are unforeseeable. The interference of surroundings, along with the boundary effects, may result in tracking drift. Many trackers [16][17][18]36] exploit current context patches surrounding the tracked object to enhance the discrimination power. However, this strategy cannot ensure robustness when facing unpredictable background variations in the new frame.
Usually, DCF-based trackers use a padding object patch, which contains certain context information, and its corresponding Gaussian label for filter training. It is expected that the response within the target region is a Gaussian shape while the response in the background region tends to zero. For trackers [16,18] with context learning, they obtain context information by cropping several context patches around the target and then exploits context regularization terms to restrain the response of context patches to zero. As shown in the top figure in Figure 2, context patches (blue dotted box) are the same size as the object patch (yellow dotted box). Therefore, these context patches may include the target of interest, which is contradictory to the regression term. Moreover, this strategy brings heavy calculation burden and redundancy. On the one hand, context patches need to be cropped and to extract features separately. On the other hand, context patches contain a large percentage of the overlap region, which is not efficient.  With respect to these concerns, we tried to utilize context information in the upcoming frame for filter training, which aims to cope with unpredictable background variations. In addition, designing an efficient method to obtain the context patch is another goal of this work. Inspired from two-stage detectors [45,46], we propose a coarse-to-fine search strategy to improve localization precision. The pipeline of the proposed tracker is shown in Figure 3. Specifically, preliminary prediction of the object location is performed to precisely segment contextual pixels in the new frame. Then, the future-aware filter trained with future context information corrects the prediction bias in order to obtain the final prediction.

# f+1
Future State Awareness  Figure 3. Tracking process of the proposed FACF tracker, which consists of two stages: future state awareness and future context awareness. Future state awareness: when a new frame is upcoming, we used the single exponential smoothing forecast to obtain a coarse target position. Future context awareness: we extract feature maps of the predicted region with the coarse position. Next, the feature maps are multiplied by the context mask to obtain the feature of the single context patch, which is then fed into filter training phase. Finally, the outputted filter performs target localization on the feature maps of the predicted region.

Stage One: Future State Awareness
From another perspective, the task of visual tracking is to predict target displacement in subsequent sequences. Normally, the motion state of the target within a certain time interval is approximately unchanged. Based on this assumption, we use a simple and effective time series forecasting method, i.e., single exponential smoothing forecast (SESF) [24], to roughly estimate the displacement of the target in the new frame. Let us assume that the true displacement vector (estimated by the filter) ∆ t f = [∆x t f , ∆y t f ] of the f -th frame is given, the predicted displacement (estimated by SESF) ∆ p f +1 in the ( f + 1)-th frame can be obtained by following formula: where ∆ p f is the predicted displacement vector in the f -th frame. [∆x, ∆y] represents the displacement deviation in the x and y direction. α is the smoothing index. So far, we can obtain the initial prediction P p f +1 of the target position in the next frame: where P t f denotes the estimated position by the filter in the f -th frame. As shown in Figure 4, we provide some examples to verify the effectiveness of single exponential smoothing forecast for the initial prediction. We compare the center location error (CLE) of the initial and final prediction by the SESF module and our filter, respectively. In some cases, the initial prediction is more accurate than the final prediction.  Next, the methods for using future context information and obtaining the final position P t f +1 on the basis of the result of single exponential smoothing forecasts will be discussed.

Fast Context Acquisition
Usually, previous context learning methods [16][17][18] are limited to tedious feature extraction of context patches in multiple directions, which also increase the computation complexity of filter training. Moreover, context information outside of the search region may introduce unnecessary information into the model. Furthermore, previous methods use the current context information for discrimination enhancement, which cannot deal with the unpredictable changes in the new frame, such as the appearance of similar targets or sudden viewpoint change.
While most trackers [14,47] strive to improve focus on the target through mask generation methods, we take an alternative approach. As shown in the bottom figure of Figures 2 and 3, when a new frame is coming, we first obtain the coarse object patch with initial prediction P p f . Then, we use an efficient and effective mask generation method [47] to acquire the mask m of the target. Finally, the context-aware mask (1 − m) is used to segment a single context patch based on the coarse object patch. In practice, we directly segment the features of the context patch after obtaining the features of the predicted patch for efficiency. The coarse object patch is regarded as the new search region to correlate with the filter. Then, we can acquire the final prediction.

Filter Training
Based on BACF [13], we incorporate future context information in the pixel-level into the training phase. The objective function of the proposed tracker is expressed as follows: represents the surrounding context of the sought object in the upcoming frame ( f + 1), and is the dot product operator. γ is the context regularization parameter.
Denoting auxiliary variable h d = B T w d ∈ R N×1 , Equation (4) can be rewritten as follows.
Then, the ADMM [48] algorithm is adopted to optimize Equation (6) by alternately solving the following three subproblems. Each subproblem has its own closed-form solution.
Subproblem w: The solution of Equation (7) can be solved in the following spatial domain: where ζ f and h f can be obtained, respectively, by the inverse Fourier Transform, i.e., Since Equation (9) has element-wise dot product operation, we try to process the pixels on same location in order to decrease its nigh computational complexity. Equation (9) can be reformulated as follows.
Take the derivative of Equation (10) with respect toĥ f (n) and set the result equal to zero, we can obtain the following.
In Equation (11), there exist matrix inversion, which is computationally heavy. With the assumption thatx [49] formula can be applied to accelerate computation. For convenience, we denote that b = µN + ∑ 1 a=0 S axa (n) Tx a (n),ŝ p (n) = ∑ 1 a=0 S axa (n) Tx f (n). Then, Equation (11) can be converted into the following.

Object Detection
The final object position can be estimated through the peak of the generated response map R. Given the predicted patchx p f +1 and the trained filterĥ f , the response map in frame f + 1 can be obtained by the following: where F −1 represents the inverse Fourier Transform (IFT). The biggest difference between our tracker and all previous trackers is that it contains future context information, resulting in more robustness relative to uncertain environment change.

Model Update
The object appearance model is updated by using linear weighted combination frameby-frame: wherex M represents the object models, andx o f is the training sample of the current frame. β is the online learning rate.

Tracking Procedure
In this work, we train the scale filter [6] to estimate the scale variation. The complete tracking procedure of our tracker is shown in Algorithm 1.

Algorithm 1: FACF Tracker
Input: A video sequence with F frames. The position P 1 and scale S 1 of the target in the first frame I 1 .

Output:
The position P f and scale S f of the target in subsequent frames I f , f > 1.  Learn the filterĥ f using the context information of the upcoming frame with Equation (12). 9 Detection phase: 10 Generate the response map R f with x p f andĥ f using Equation (14). 11 Obtain the final object position P t f (P f ) via the response map R f and estimate the scale S f .

12
Return P f and S f .

Experiments
In this section, we perform extensive and comprehensive experiments on three challenging UAV benchmarks. First, we introduce the detailed experimental settings, including parameters, benchmarks, metrics, and platform. Next, we compare our tracker with 29 other state-of-the-art trackers with handcrafted or deep features. Then, we verify the rationality of the parameters and the effectiveness of each component. Afterward, different context learning strategies are analyzed. The last subsection provides some failure cases of the proposed tracker.

Parameters
Our tracker uses Hog, CN, and Grayscale features for object representation. We use two ADMM iterations to train the filter. The learning rate β of the model update is set to 0.019, and the context regularization parameter is chosen as γ = 0.009. The smoothing index α is set to 0.88. During the entire experiment, the parameters of the proposed tracker remain unchanged. The other trackers used for comparison retain their initial parameter setting. The code of our tracker is available at https://github.com/FreeZhang96/FACF, accessed on 10 September 2021.

Benchmarks
Experiments are conducted on three well-known UAV benchmarks involving DTB70 [25], UAV123_10fps [26], and the recent built UAVTrack112 [27]. These benchmarks have 305 video sequences in total, which are captured on UAV platforms.

Metrics
We use the one pass evaluation (OPE) norm to test all trackers. Evaluations of tracking accuracy are based on IoU and CLE. IoU (the Intersection Over Union) refers to the intersection of the bounding boxes between the prediction and groundtruth. CLE (Center Location Error) denotes the location error (pixels) between the predicted center location and the true location. When IoU or CLE exceeds a given threshold, the tracking results are deemed successful. If we set different thresholds (IoU ∈ [0, 1] and CLE ∈ [0, 50]), we can obtain the success plots and precision plots. The area under the curve (success plots) is denoted as AUC. DP (Distance Precision) represents the score in precision plots when CLE = 20 pixels. FPS (Frame Per Second) is used for speed measurement of each tracker.

Platform
All experiments are carried out using Matlab2019b. The experimental platform is a PC with an Intel(R) Core(TM) i7-9750H CPU (2.60 GHz), 32 GB RAM, and a single RTX 2060 GPU.
Overall Evaluation. Precision and success plots of our tracker and other trackers on all three benchmarks are presented in Figure 5.
DTB70 [25] benchmark contains 50 video sequences with 12 attributes. Our tracker has the best AUC and DP scores, namely 0.496 and 0.727, respectively. The AUC and DP scores surpass the second excellent tracker AutoTrack [23] 1.8% and 1.1%, respectively.
UAV123_10fps [26] is a large benchmark composed of 123 challenging UAV video sequences. We report the precision and success plots in Figure 5a. The proposed tracker FACF outperforms other trackers in terms of AUC and DP scores.    [26], and (c) UAVTrack112 [27]. First row: precision plots. Second row: success plots.
UAVTrack112 [27] is a recent newly built benchmark that is collected by DJI Mavic Air2. This benchmark contains 112 sequences with 13 attributes. From the plots in Figure 5c, our tracker performs the best with the AUC and DP scores of 0.478 and 0.709, respectively. Table 1 presents the average AUC, DP, and speed comparison between our tracker and other handcrafted-based trackers on three benchmarks. In terms of AUC and DP, our tracker FACF performs best (0.486 and 0.707), followed by AutoTrack [23] (0.473 and 0.694) and ARCF [21] (0.467 and 0.677). The average speed of our FACF tracker can reach 48.816 FPS, which is sufficient for real-time applications.
Attribute-oriented Evaluation. To verify the performance of the proposed tracker in UAV-specific scenarios, this part provides extensive attribute-based analysis following attribute categorization in [10]. The new attributes for all benchmarks include VC (camera motion and viewpoint change), FM (fast motion), LR (low resolution), OCC (partial occlusion and full occlusion), and IV (illumination variation). Following the new attributes, we take the average AUC/DP score of all attributes (in all benchmarks) belonging to a new attribute as the final score. For example, as for the DP score of VC, five scores of all related attributes (camera motion and viewpoint change in both of UAV123_10fps [26] and UAVTrack112 [27] and fast camera motion in DTB70 [25]) are averaged in order to obtain the desired result. Table 2 exhibits the average performance of 17 different trackers under these specific attributes. Our tracker achieves the best AUC and DP scores under the VC, FM, and LR attributes. Table 1. Average performance of all trackers on benchmarks DTB70 [25], UAV123@10fps [26], and UAVTrack112 [27]. Red, green, and blue represent the top three trackers in terms of DP, AUC, and FPS, respectively.  Figure 6 provides some detailed success plots of representative attribute-based analysis on different benchmarks. In terms of camera motion, fast motion, and low resolution, our tracker is in a leading position, surpassing the second place by a large margin. As shown in Figure 7, we compared the tracking results of the proposed tracker with five other state-of-the-art on 10 challenging video sequences. These compared trackers are STRCF [5], BACF [13], ECO_HC [7], AutoTrack [23], and ARCF [21]. In these UAV-specific scenarios (including VC, LR, and FM), the proposed tracker can achieve robust tracking while other trackers fail.

The Impact of Key Parameter
To investigate the impact of key parameters for performance, we perform extensive experiments on DTB70 [25], UAV123_10fps [26], and UAVTrack112 [27] benchmarks. As shown in Figures 8 and 9, we only provide the results of the most important parameters, i.e., the smoothing index α and context regularization parameter γ.   [25]; car13, uav3, and wakeboard from UAV123_10fps [26]; and courier1, electric box,and uav1 from UAVTrack112 [27].  Smoothing index α. The smoothing index α ∈ [0, 1] controls the importance of the true displacement in the current frame for prediction. As for the tracking task, the displacement of the tracked object in a short time is approximate. Therefore, the value of the smoothing index is absolutely close to one. In Figure 8, we provide the average AUC and DP scores when α varies from 0 to 1. When the smoothing index reaches α = 0.88 (red dotted line in Figure 8), our tracker achieves the best performance in terms of AUC and DP. The result confirms our analysis, and we select α = 0.88.
Context regularization parameter γ. Figure 9 shows the average AUC and DP scores on all three benchmarks when the value of context regularization parameter varies from 0 to 0.02 with a step of 0.001. The red dotted line denotes the performance when γ = 0. As γ increases, AUC and DP reach the maximum value when γ = 0.009. Therefore, this work selects γ = 0.009.

The Vality of Component
To verify the effectiveness of each component, we develop four trackers equipped with different components. The average performance of different trackers on three benchmarks is shown in Figure 10. FACF-FCA denotes the FACF tracker disabled with future context awareness (FCA). FACF-FSA stands for the FACF tracker without future state awareness (FSA). FACF-(FSA+FCA) represents the BACF tracker with Hog, CN, and Grayscale features (baseline tracker). Clearly, both FSA and FCA modules can improve the tracking performance. The FSA module is beneficial for obtaining the future context patch as well as for improving tracking accuracy. Only when based on FSA can FCA contribute to more accurate tracking.

The Strategy for Context Learning
In this part, different context learning methods based on the current or upcoming frame are compared on DTB70 [25] benchmark. We denote the context learning methods of FACF and CACF as FCA and CA, respectively. FACF+CA means the FACF-FCA tracker with the context learning method in CACF [16]. BACF+CA and BACF+FCA are the baseline trackers (with Hog, CN, and Grayscales) that are using the context learning strategy in CACF [16] and FCA in our tracker, respectively. For these trackers, we finetuned the context regularization parameter to obtain the best performance on DTB70 [25] benchmarks (0.01 for FACF+CA, 0.008 for BACF+CA, and 0.001 for BACF+FCA). The results displayed in Table 4 indicate that the context learning strategy proposed in this paper is superior to that in CACF [16]. The fast context segmentation can not only avoid the context patch from containing the target but also effectively reduce computational complexity.  Figure 11 visualizes three representative and challenging sequences from three benchmarks that the proposed method fails to track. In the sequence Animal3, there are similar targets around the alpaca to be tracked. Although our tracker uses context learning to repress interference, their similar appearance still confuses the proposed tracker. In the sequence uav1_3, the UAV moves so fast and irregularly that the SESF module cannot work well, resulting in tracking failure. In the sequence sand truck, the sand truck is under a low illumination condition. When the target enters a dark environment, the proposed tracker cannot localize it.  Figure 11. Some representative tracking failure cases of the proposed FACF tracker. From top to bottom: Animal3, uav1_3, and sand truck from DTB70 [25], UAV123_10fps [26], and UAVTrack112 [27], respectively.

Conclusions
In this work, in order to enhance filter discriminative power in future unknown environments, we proposed a novel future-aware correlation filter tracker, namely FACF. By virtue of an effective time series forecast method, we obtained the predicted patch with the coarse target position, which is beneficial for localizing the target more precisely and for obtaining the predicted patch. Then, a context-aware mask is produced through an efficient target-aware method. Afterward, we obtain a single context patch at the pixellevel by the element-wise dot product between the context-aware mask and the feature maps of the predicted patch. Finally, feature maps of the context patch are utilized for improving discriminative ability. Extensive experiments on three UAV benchmarks verify the superiority of the FACF tracker against other state-of-the-art handcrafted-based and deep-based trackers.
The proposed future-aware strategy aims at dealing with unpredicted surrounding changes by learning the future context rather than the current context. The fast context acquisition avoids additional feature extraction as well as unrelated background noise. In general, our method guarantees the robustness, accuracy, and efficiency, which is promising for UAV real-time application. We think the proposed context learning method can be extended to other trackers for more robust UAV tracking. In future work, we will explore more accurate and efficient strategies to exploit future information in order to boost tracking performance without sacrificing speed drastically.

Data Availability Statement:
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest:
The authors declare no conflict of interest.