You are currently viewing a new version of our website. To view the old version click .
Remote Sensing
  • Article
  • Open Access

14 October 2021

Learning Future-Aware Correlation Filters for Efficient UAV Tracking

,
,
,
,
and
1
Aeronautics Engineering College, Air Force Engineering University, Xi’an 710038, China
2
Air Traffic Control and Navigation College, Air Force Engineering University, Xi’an 710051, China
3
Air and Missile Defense College, Air Force Engineering University, Xi’an 710051, China
4
Harbin Institute of Technology, Harbing 150080, China
This article belongs to the Special Issue Advanced Application of Artificial Intelligence and Machine Vision in Remote Sensing

Abstract

In recent years, discriminative correlation filter (DCF)-based trackers have made considerable progress and drawn widespread attention in the unmanned aerial vehicle (UAV) tracking community. Most existing trackers collect historical information, e.g., training samples, previous filters, and response maps, to promote their discrimination and robustness. Under UAV-specific tracking challenges, e.g., fast motion and view change, variations of both the target and its environment in the new frame are unpredictable. Interfered by future unknown environments, trackers that trained with historical information may be confused by the new context, resulting in tracking failure. In this paper, we propose a novel future-aware correlation filter tracker, i.e., FACF. The proposed method aims at effectively utilizing context information in the new frame for better discriminative and robust abilities, which consists of two stages: future state awareness and future context awareness. In the former stage, an effective time series forecast method is employed to reason a coarse position of the target, which is the reference for obtaining a context patch in the new frame. In the latter stage, we firstly obtain the single context patch with an efficient target-aware method. Then, we train a filter with the future context information in order to perform robust tracking. Extensive experimental results obtained from three UAV benchmarks, i.e., UAV123_10fps, DTB70, and UAVTrack112, demonstrate the effectiveness and robustness of the proposed tracker. Our tracker has comparable performance with other state-of-the-art trackers while running at ∼49 FPS on a single CPU.

1. Introduction

Visual object tracking is a popular but challenging task in the domain of multimedia and computer vision. Given a video sequence, the task is to precisely estimate the position of the target of interest. With the popularity of unmanned aerial vehicles (UAVs), visual tracking applied for UAV platforms has attracted extensive attention, e.g., public security [1], disaster investigation [2], and remote sensor mounting [3]. Although this technique has acquired impressive progress, its performance is unsatisfactory when the target undergoes UAV-specific tracking challenges, such as viewpoint change, fast motion, and low resolution.
There are two main streams in visual tracking community: DCF-based trackers [4,5,6,7] and Siamese-based trackers [8]. Although Siamese-based trackers achieved impressive tracking performance using a GPU or GPUs, the complex calculation of the deep network inevitably brings large energy loss to the mobile platform such as UAVs. DCF-based trackers [9,10] is one of the most suitable choices for source-limited UAV platforms because of their balanced accuracy and speed as well as low cost. However, using synthesized training samples [11] for training inevitably impedes the discriminative power of the filter, that is, boundary effects. In the literature, in order to solve this problem, many attempts have been conducted to enhance filter discrimination for the target and its environment, such as fixed or adaptive spatial regularization [12,13,14,15] and context learning [16,17,18]. For context learning, these methods [16,17,18] suppress the response of context patch in multiple directions to zero, thus achieving effective performance improvement. Nevertheless, multiple context patches may introduce irrelevant background noise, resulting in a suboptimal filter. Moreover, the feature extraction of these patches, especially when using deep features [18], hinders the real-time ability of trackers with context learning.
Moreover, traditional DCF-based trackers model the filter by virtue of historical information, e.g., accumulated training samples [4,11,12,13,14,15], previously generated filters [5,19,20], or response maps [21,22,23]. Although DCF-based trackers benefit from prevenient clues, this paradigm may fail to deal well with complex and changeable UAV tracking challenges, such as fast motion and viewpoint change. In these cases, both the volatile environment and the target appearance changes bring about severe uncertainties. It has been proved that the information in the future frame [20] has played a vital role in improving adaptability of the tracker. In [20], the object observation in the next frame is predicted by exploring spatial-temporal similarities of the target change in consecutive frames. Then, it is integrated with historical samples to form a more robust object model. However, this similarity assumption is not always valid for UAV object tracking because of the complex changeable nature of UAV tracking scenarios.
With respect to the above concerns, we propose a two-stage correlation filter tracker that can efficiently exploit the contextual information of the upcoming frame. The achievement of this purpose depends on two irreversible future-aware stages, i.e., future state awareness and future context awareness. The former stage is for predicting the spatial location change of the target in the upcoming frame, and the latter is for suppressing distractions caused by future complex background while enhancing filter discriminative power. In the first stage, when a new frame is coming, the simple yet effective single exponential smoothing forecast method [24] is used to predict a coarse target position. In the latter stage, we employ an efficient mask generation method to segment a single context patch based on the coarse position. Then, the segmented contextual information is incorporated into the training phase for discrimination improvement. Lastly, the more powerful filter rectifies the prediction error of the first stage. We perform comprehensive experiments on three challenging UAV benchmarks, i.e., DTB70 [25], UAV123_10fps [26], and UAVTrack112 [27]. The results confirm that the proposed tracker has superiority in terms of accuracy and speed compared with 29 other state-of-the-art trackers. Figure 1 shows the overall performance of all trackers on DTB70 [25] benchmark. Clearly, our tracker has comparable performance against other trackers while maintaining real-time speed on a single CPU, which demonstrates that the FACF tracker is suitable for real-time UAV applications.
Figure 1. Overall performance based on area under curve (AUC) and distance precision (DP) between the proposed FACF tracker and 29 other state-of-the-art trackers on DTB70 [25] benchmark. AUC and DP are two metrics for evaluating tracking accuracy for which its detailed explanation is in Section 5. The legend provides detailed speed values of each tracker. The superscript * represents the GPU-based tracker.
The main contributions are summarized as follows:
  • A coarse-to-fine DCF-based tracking framework is proposed to exploit the context information hidden in the frame that is to be detected;
  • Single exponential smoothing forecast is used to provide a coarse position, which is the reference for acquiring a context patch;
  • We obtain a single future-aware context patch through an efficient target-aware mask generation method without additional feature extraction;
  • Experimental results on three UAV benchmarks verify the advancement of the proposed tracker. Our tracker can maintain real-time speed in real-world tracking scenarios.
The remainder of this paper is organized as follows: Section 2 generalizes the most relevant works; Section 3 introduces the baseline tracker; Section 4 details the proposed method; Section 5 exhibits extensive and comprehensive experiments; and Section 6 provides a brief summary of this work.

3. Revisit BACF

In this section, we review the BACF [13] tracker, which is the baseline in this work. To mitigate the problem of boundary effects, background-aware correlation filter (BACF) [13] enlarges the search region and introduces a binary matrix B R M × N ( M N ) to crop more complete negative samples, which largely improves tracking performance. In this work, we select the BACF tracker as the baseline. Given the vectorized desired response y R N × 1 and the vectorized training samples of each channel x d R N × 1 , the filter w in the current frame f can be obtained by minimizing the following objective:
ϵ ( w f ) = 1 2 d = 1 D B x f d w f d y 2 2 + λ 2 d = 1 D w f d 2 2 ,
where λ is the filter regularization parameter, and D is the total number of channels. ⊛ represents the correlation operator.
Unfortunately, a large number of real negative samples inevitably introduce background noise, which results in insufficient discrimination of the tracker, especially when distractors appear in the future frame.

4. Proposed Approach

In this section, we first carefully analyze the existing problems. Then, we introduce the proposed method, including two stages: future state awareness and future context awareness. Lastly, the complete tracking procedure is detailed.

4.1. Problem Formulation

In the tracking process of DCF-based trackers, the filter learned in the current frame is used to localize the object in the upcoming frame. Variations of the tracked object and its surroundings in the new frame are unforeseeable. The interference of surroundings, along with the boundary effects, may result in tracking drift. Many trackers [16,17,18,36] exploit current context patches surrounding the tracked object to enhance the discrimination power. However, this strategy cannot ensure robustness when facing unpredictable background variations in the new frame.
Usually, DCF-based trackers use a padding object patch, which contains certain context information, and its corresponding Gaussian label for filter training. It is expected that the response within the target region is a Gaussian shape while the response in the background region tends to zero. For trackers [16,18] with context learning, they obtain context information by cropping several context patches around the target and then exploits context regularization terms to restrain the response of context patches to zero. As shown in the top figure in Figure 2, context patches (blue dotted box) are the same size as the object patch (yellow dotted box). Therefore, these context patches may include the target of interest, which is contradictory to the regression term. Moreover, this strategy brings heavy calculation burden and redundancy. On the one hand, context patches need to be cropped and to extract features separately. On the other hand, context patches contain a large percentage of the overlap region, which is not efficient.
Figure 2. Different strategies of obtaining context patch. Top: traditional method. Bottom: our efficient method.
With respect to these concerns, we tried to utilize context information in the upcoming frame for filter training, which aims to cope with unpredictable background variations. In addition, designing an efficient method to obtain the context patch is another goal of this work. Inspired from two-stage detectors [45,46], we propose a coarse-to-fine search strategy to improve localization precision. The pipeline of the proposed tracker is shown in Figure 3. Specifically, preliminary prediction of the object location is performed to precisely segment contextual pixels in the new frame. Then, the future-aware filter trained with future context information corrects the prediction bias in order to obtain the final prediction.
Figure 3. Tracking process of the proposed FACF tracker, which consists of two stages: future state awareness and future context awareness. Future state awareness: when a new frame is upcoming, we used the single exponential smoothing forecast to obtain a coarse target position. Future context awareness: we extract feature maps of the predicted region with the coarse position. Next, the feature maps are multiplied by the context mask to obtain the feature of the single context patch, which is then fed into filter training phase. Finally, the outputted filter performs target localization on the feature maps of the predicted region.

4.2. Stage One: Future State Awareness

From another perspective, the task of visual tracking is to predict target displacement in subsequent sequences. Normally, the motion state of the target within a certain time interval is approximately unchanged. Based on this assumption, we use a simple and effective time series forecasting method, i.e., single exponential smoothing forecast (SESF) [24], to roughly estimate the displacement of the target in the new frame. Let us assume that the true displacement vector (estimated by the filter) Δ f t = [ Δ x f t , Δ y f t ] of the f-th frame is given, the predicted displacement (estimated by SESF) Δ f + 1 p in the ( f + 1 ) -th frame can be obtained by following formula:
Δ f + 1 p = α Δ f t + ( 1 α ) Δ f p ,
where Δ f p is the predicted displacement vector in the f-th frame. [ Δ x , Δ y ] represents the displacement deviation in the x and y direction. α is the smoothing index. So far, we can obtain the initial prediction P f + 1 p of the target position in the next frame:
P f + 1 p = P f t + Δ f + 1 p ,
where P f t denotes the estimated position by the filter in the f-th frame. As shown in Figure 4, we provide some examples to verify the effectiveness of single exponential smoothing forecast for the initial prediction. We compare the center location error (CLE) of the initial and final prediction by the SESF module and our filter, respectively. In some cases, the initial prediction is more accurate than the final prediction.
Figure 4. Center location error (CLE) comparison of the single exponential smoothing forecast method (SESF) and the proposed tracker (FASF) on three challenging seqeuences. From top to bottom are (a) Car5 from DTB70 [25], (b) wakeboard7 from UAV123_10fps [26], and (c) air conditioning box1 from UAVTrack112 [27].
Next, the methods for using future context information and obtaining the final position P f + 1 t on the basis of the result of single exponential smoothing forecasts will be discussed.

4.3. Stage Two: Future Context Awareness

4.3.1. Fast Context Acquisition

Usually, previous context learning methods [16,17,18] are limited to tedious feature extraction of context patches in multiple directions, which also increase the computation complexity of filter training. Moreover, context information outside of the search region may introduce unnecessary information into the model. Furthermore, previous methods use the current context information for discrimination enhancement, which cannot deal with the unpredictable changes in the new frame, such as the appearance of similar targets or sudden viewpoint change.
While most trackers [14,47] strive to improve focus on the target through mask generation methods, we take an alternative approach. As shown in the bottom figure of Figure 2 and Figure 3, when a new frame is coming, we first obtain the coarse object patch with initial prediction P f p . Then, we use an efficient and effective mask generation method [47] to acquire the mask m of the target. Finally, the context-aware mask ( 1 m ) is used to segment a single context patch based on the coarse object patch. In practice, we directly segment the features of the context patch after obtaining the features of the predicted patch for efficiency. The coarse object patch is regarded as the new search region to correlate with the filter. Then, we can acquire the final prediction.

4.3.2. Filter Training

Based on BACF [13], we incorporate future context information in the pixel-level into the training phase. The objective function of the proposed tracker is expressed as follows:
ϵ ( w f ) = 1 2 d = 1 D B x f d w f d y 2 2 + λ 2 d = 1 D w f d 2 2 + γ 2 d = 1 D B x c , f + 1 d w f d 2 2 ,
where x c , f + 1 = x f + 1 p ( 1 m ) represents the surrounding context of the sought object in the upcoming frame ( f + 1 ) , and ⊙ is the dot product operator. γ is the context regularization parameter.
Denoting auxiliary variable h d = B T w d R N × 1 , Equation (4) can be rewritten as follows.
ϵ ( w f , h f ) = 1 2 d = 1 D x f d h f d y 2 2 + λ 2 d = 1 D w f d 2 2 + γ 2 d = 1 D x c , f + 1 d h f d 2 2 .
After converting Equation (5) to the Fourier domain, the augmented Largrangian form of Equation (5) is expressed as follows:
ϵ ( w f , h ^ f , ζ ^ f ) = 1 2 d = 1 D x ^ f d h ^ f d y ^ 2 2 + λ 2 d = 1 D w f d 2 2 + γ 2 c = 1 D x ^ c , f + 1 d h ^ f d 2 2 + d = 1 D ( h ^ f d N F B T w f d ) T ζ ^ f d + μ 2 d = 1 D h ^ f d N F B T w f d 2 2 ,
where ^ is the Discrete Fourier Transformation (DFT). ζ ^ f = [ ζ ^ f 1 T , ζ ^ f 2 T , , ζ ^ f D T ] R ND × 1 and μ are the Largrangian vector and a penalty factor, respectively.
Then, the ADMM [48] algorithm is adopted to optimize Equation (6) by alternately solving the following three subproblems. Each subproblem has its own closed-form solution.
Subproblem w :
w f * = λ 2 d = 1 D w f d 2 2 + d = 1 D ( h ^ f d N F B T w f d ) T ζ ^ f d + μ 2 d = 1 D h ^ f d N F B T w f d 2 2 .
The solution of Equation (7) can be solved in the following spatial domain:
w f * = ζ f + μ h f λ / N + μ ,
where ζ f and h f can be obtained, respectively, by the inverse Fourier Transform, i.e., ζ f = 1 N B F T ζ ^ f and h f = 1 N B F T h ^ f .
Subproblem h ^ f :
h ^ f * = 1 2 d = 1 D x ^ f d h ^ f d y ^ 2 2 + γ 2 c = 1 D x ^ c , f + 1 d h ^ f d 2 2 + d = 1 D ( h ^ f d N F B T w f d ) T ζ ^ f d + μ 2 d = 1 D h ^ f d N F B T w f d 2 2 .
Since Equation (9) has element-wise dot product operation, we try to process the pixels on same location in order to decrease its nigh computational complexity. Equation (9) can be reformulated as follows.
h ^ f ( n ) * = 1 2 x ^ f ( n ) T h ^ f ( n ) y ^ ( n ) 2 2 + γ 2 x ^ c , f + 1 ( n ) T h ^ f ( n ) 2 2 + ( h ^ f ( n ) w f ( n ) ) T ζ ^ f ( n ) + μ 2 h ^ f ( n ) w f ( n ) ) 2 2 .
Take the derivative of Equation (10) with respect to h ^ f ( n ) and set the result equal to zero, we can obtain the following.
h ^ f ( n ) * = x ^ f ( n ) x ^ f ( n ) T + γ x ^ c , f + 1 ( n ) x ^ c , f + 1 ( n ) T + μ N I D 1 x ^ f ( n ) y ^ ( n ) N ζ ^ f + μ N w ^ f ( n ) .
In Equation (11), there exist matrix inversion, which is computationally heavy. With the assumption that x ^ f ( n ) x ^ f ( n ) T + γ x ^ c , f + 1 ( n ) x ^ c , f + 1 ( n ) T = a = 0 1 S a x ^ a ( n ) T x ^ a ( n ) (where S 0 = 1 and S 1 = γ , x ^ 0 ( n ) = x ^ f ( n ) and x ^ 1 ( n ) = x ^ c , f + 1 ( n ) ), the Sherman–Morrison [49] formula can be applied to accelerate computation. For convenience, we denote that b = μ N + a = 0 1 S a x ^ a ( n ) T x ^ a ( n ) , s ^ p ( n ) = a = 0 1 S a x ^ a ( n ) T x ^ f ( n ) . Then, Equation (11) can be converted into the following.
h ^ f ( n ) * = 1 μ N x ^ f ( n ) y ^ ( n ) N ζ ^ f + μ N w ^ f ( n ) a = 0 1 S a x ^ a ( n ) μ b 1 N s ^ p ( n ) y ^ ( n ) a = 0 1 x ^ a ( n ) T ζ ^ ( n ) + μ a = 0 1 x ^ a ( n ) T w ^ f ( n ) .
The Lagrangian parameter ζ ^ is described as follows:
ζ ^ ( i + 1 ) = ζ ^ ( i ) + μ ( h ^ ( i ) w ^ ( i ) ) ,
where ( i ) and ( i + 1 ) represent the ( i ) -th and ( i + 1 ) -th iteration, respectively. The penalty fator μ is updated as μ ( i + 1 ) = min ( μ max , δ μ ( i ) ) .

4.3.3. Object Detection

The final object position can be estimated through the peak of the generated response map R . Given the predicted patch x ^ f + 1 p and the trained filter h ^ f , the response map in frame f + 1 can be obtained by the following:
R f + 1 = F 1 d = 1 D x ^ f + 1 p , d h ^ f d ,
where F 1 represents the inverse Fourier Transform (IFT). The biggest difference between our tracker and all previous trackers is that it contains future context information, resulting in more robustness relative to uncertain environment change.

4.3.4. Model Update

The object appearance model is updated by using linear weighted combination frame-by-frame:
x ^ f M = ( 1 β ) x ^ f 1 M + β x ^ f o ,
where x ^ M represents the object models, and x ^ f o is the training sample of the current frame. β is the online learning rate.

4.4. Tracking Procedure

In this work, we train the scale filter [6] to estimate the scale variation. The complete tracking procedure of our tracker is shown in Algorithm 1.
Algorithm 1: FACF Tracker
Remotesensing 13 04111 i001

5. Experiments

In this section, we perform extensive and comprehensive experiments on three challenging UAV benchmarks. First, we introduce the detailed experimental settings, including parameters, benchmarks, metrics, and platform. Next, we compare our tracker with 29 other state-of-the-art trackers with handcrafted or deep features. Then, we verify the rationality of the parameters and the effectiveness of each component. Afterward, different context learning strategies are analyzed. The last subsection provides some failure cases of the proposed tracker.

5.1. Implementation Details

5.1.1. Parameters

Our tracker uses Hog, CN, and Grayscale features for object representation. We use two ADMM iterations to train the filter. The learning rate β of the model update is set to 0.019, and the context regularization parameter is chosen as γ = 0.009 . The smoothing index α is set to 0.88. During the entire experiment, the parameters of the proposed tracker remain unchanged. The other trackers used for comparison retain their initial parameter setting. The code of our tracker is available at https://github.com/FreeZhang96/FACF, accessed on 10 September 2021.

5.1.2. Benchmarks

Experiments are conducted on three well-known UAV benchmarks involving DTB70 [25], UAV123_10fps [26], and the recent built UAVTrack112 [27]. These benchmarks have 305 video sequences in total, which are captured on UAV platforms.

5.1.3. Metrics

We use the one pass evaluation (OPE) norm to test all trackers. Evaluations of tracking accuracy are based on IoU and CLE. IoU (the Intersection Over Union) refers to the intersection of the bounding boxes between the prediction and groundtruth. CLE (Center Location Error) denotes the location error (pixels) between the predicted center location and the true location. When IoU or CLE exceeds a given threshold, the tracking results are deemed successful. If we set different thresholds (IoU [ 0 , 1 ] and CLE [ 0 , 50 ] ), we can obtain the success plots and precision plots. The area under the curve (success plots) is denoted as AUC. DP (Distance Precision) represents the score in precision plots when CLE = 20 pixels. FPS (Frame Per Second) is used for speed measurement of each tracker.

5.1.4. Platform

All experiments are carried out using Matlab2019b. The experimental platform is a PC with an Intel(R) Core(TM) i7-9750H CPU (2.60 GHz), 32 GB RAM, and a single RTX 2060 GPU.

5.2. Performance Comparison

5.2.1. Comparison with Handcrafted-Based Trackers

In this part, we comprehensively compare our tracker with other 16 state-of-the-art handcrafted trackers, i.e., STRCF [5], SAMF [28], KCF [11], DSST [6], ECO_HC [7], Staple [50], KCC [29], SAMF_CA [16], ARCF [21], AutoTrack [23], SRDCF [12], MCCT_H [37], CSRDCF [14], BACF [13], Staple_CA [16] and SRDCFdecon [30].
Overall Evaluation. Precision and success plots of our tracker and other trackers on all three benchmarks are presented in Figure 5.
Figure 5. Overall performance comparison of the proposed FACF tracker and other 16 state-of-the-art handcrafted feature-based trackers on (a) DTB70 [25], (b) UAV123_10fps [26] and (c) UAVTrack112 [27]. First row: precision plots. Second row: success plots.
DTB70 [25] benchmark contains 50 video sequences with 12 attributes. Our tracker has the best AUC and DP scores, namely 0.496 and 0.727, respectively. The AUC and DP scores surpass the second excellent tracker AutoTrack [23] 1.8% and 1.1%, respectively.
UAV123_10fps [26] is a large benchmark composed of 123 challenging UAV video sequences. We report the precision and success plots in Figure 5a. The proposed tracker FACF outperforms other trackers in terms of AUC and DP scores.
UAVTrack112 [27] is a recent newly built benchmark that is collected by DJI Mavic Air2. This benchmark contains 112 sequences with 13 attributes. From the plots in Figure 5c, our tracker performs the best with the AUC and DP scores of 0.478 and 0.709, respectively.
Table 1 presents the average AUC, DP, and speed comparison between our tracker and other handcrafted-based trackers on three benchmarks. In terms of AUC and DP, our tracker FACF performs best (0.486 and 0.707), followed by AutoTrack [23] (0.473 and 0.694) and ARCF [21] (0.467 and 0.677). The average speed of our FACF tracker can reach 48.816 FPS, which is sufficient for real-time applications.
Table 1. Average performance of all trackers on benchmarks DTB70 [25], UAV123@10fps [26], and UAVTrack112 [27]. Red, green, and blue represent the top three trackers in terms of DP, AUC, and FPS, respectively.
Attribute-oriented Evaluation. To verify the performance of the proposed tracker in UAV-specific scenarios, this part provides extensive attribute-based analysis following attribute categorization in [10]. The new attributes for all benchmarks include VC (camera motion and viewpoint change), FM (fast motion), LR (low resolution), OCC (partial occlusion and full occlusion), and IV (illumination variation). Following the new attributes, we take the average AUC/DP score of all attributes (in all benchmarks) belonging to a new attribute as the final score. For example, as for the DP score of VC, five scores of all related attributes (camera motion and viewpoint change in both of UAV123_10fps [26] and UAVTrack112 [27] and fast camera motion in DTB70 [25]) are averaged in order to obtain the desired result. Table 2 exhibits the average performance of 17 different trackers under these specific attributes. Our tracker achieves the best AUC and DP scores under the VC, FM, and LR attributes.
Table 2. AUC and DP scores of all trackers under UAV special attributes, including VC, FM, LR, OCC, and IV. Red, green, and blue denote the top three results.
Figure 6 provides some detailed success plots of representative attribute-based analysis on different benchmarks. In terms of camera motion, fast motion, and low resolution, our tracker is in a leading position, surpassing the second place by a large margin. As shown in Figure 7, we compared the tracking results of the proposed tracker with five other state-of-the-art on 10 challenging video sequences. These compared trackers are STRCF [5], BACF [13], ECO_HC [7], AutoTrack [23], and ARCF [21]. In these UAV-specific scenarios (including VC, LR, and FM), the proposed tracker can achieve robust tracking while other trackers fail.
Figure 6. Attribute-based analysis of the proposed tracker and the other 16 state-of-the-art handcrafted feature-based trackers on DTB70 [25], UAV123_10fps [26] and UAVTrack112 [27].
Figure 7. Visualization of the tracking results between the proposed tracker and 5 other state-of-the-art trackers on 10 challenging sequences. From left to right and from top to bottom are Car2, ChasingDrones, Gull1 and Snowboarding from DTB70 [25]; car13, uav3 and wakeboard from UAV123_10fps [26]; and courier1, electric box and uav1 from UAVTrack112 [27].

5.2.2. Comparison with Deep-based Trackers

Thirteen state-of-the-art deep-based trackers, i.e., LUDT [51], LUDT+ [51], fECO [52], fDeepSTRCF [52], TADT [53], CoKCF [54], UDT [55], CF2 [56], UDT+ [55], ECO [7], DeepSTRCF [5], and KAOT [18], are used for comparison. The overall performance of all trackers on DTB70 benchmarks is shown in Table 3. Our tracker has comparable performance with respect to other deep-based trackers. In particular, the AUC and DP scores (0.496 and 0.727) of the proposed tracker rank third and second, respectively. Meanwhile, our tracker FACF can achieve real-time speed, depending on a single CPU, while other deep-based trackers use GPU for acceleration.
Table 3. Performance comparision of the proposed tracker and 13 other deep-based trackers on DTB70 [25] benchmark. Red, green, and blue represent the top three trackers in terms of DP, AUC and FPS, respectively. The superscript * means GPU spped.

5.3. Parameter Analysis and Ablation Study

5.3.1. The Impact of Key Parameter

To investigate the impact of key parameters for performance, we perform extensive experiments on DTB70 [25], UAV123_10fps [26], and UAVTrack112 [27] benchmarks. As shown in Figure 8 and Figure 9, we only provide the results of the most important parameters, i.e., the smoothing index α and context regularization parameter γ .
Figure 8. Average performance on three benchmarks when the smooth index α varies from 0 to 1. Top: the curve of AUC score. Bottom: the curve of DP score.
Figure 9. Average performance on three benchmarks when the context regularization parameter γ varies from 0 to 0.02. Top: the curve of AUC score. Bottom: the curve of DP score.
Smoothing index α . The smoothing index α [ 0 , 1 ] controls the importance of the true displacement in the current frame for prediction. As for the tracking task, the displacement of the tracked object in a short time is approximate. Therefore, the value of the smoothing index is absolutely close to one. In Figure 8, we provide the average AUC and DP scores when α varies from 0 to 1. When the smoothing index reaches α = 0.88 (red dotted line in Figure 8), our tracker achieves the best performance in terms of AUC and DP. The result confirms our analysis, and we select α = 0.88 .
Context regularization parameter γ . Figure 9 shows the average AUC and DP scores on all three benchmarks when the value of context regularization parameter varies from 0 to 0.02 with a step of 0.001. The red dotted line denotes the performance when γ = 0 . As γ increases, AUC and DP reach the maximum value when γ = 0.009 . Therefore, this work selects γ = 0.009 .

5.3.2. The Vality of Component

To verify the effectiveness of each component, we develop four trackers equipped with different components. The average performance of different trackers on three benchmarks is shown in Figure 10. FACF-FCA denotes the FACF tracker disabled with future context awareness (FCA). FACF-FSA stands for the FACF tracker without future state awareness (FSA). FACF-(FSA+FCA) represents the BACF tracker with Hog, CN, and Grayscale features (baseline tracker). Clearly, both FSA and FCA modules can improve the tracking performance. The FSA module is beneficial for obtaining the future context patch as well as for improving tracking accuracy. Only when based on FSA can FCA contribute to more accurate tracking.
Figure 10. Ablation analysis of the proposed FACF tracker on three benchmarks.

5.4. The Strategy for Context Learning

In this part, different context learning methods based on the current or upcoming frame are compared on DTB70 [25] benchmark. We denote the context learning methods of FACF and CACF as FCA and CA, respectively. FACF+CA means the FACF-FCA tracker with the context learning method in CACF [16]. BACF+CA and BACF+FCA are the baseline trackers (with Hog, CN, and Grayscales) that are using the context learning strategy in CACF [16] and FCA in our tracker, respectively. For these trackers, we finetuned the context regularization parameter to obtain the best performance on DTB70 [25] benchmarks (0.01 for FACF+CA, 0.008 for BACF+CA, and 0.001 for BACF+FCA). The results displayed in Table 4 indicate that the context learning strategy proposed in this paper is superior to that in CACF [16]. The fast context segmentation can not only avoid the context patch from containing the target but also effectively reduce computational complexity.
Table 4. Performance with different context learning strategies on DTB70 [25] benchmark.

5.5. Failure Cases

Figure 11 visualizes three representative and challenging sequences from three benchmarks that the proposed method fails to track. In the sequence Animal3, there are similar targets around the alpaca to be tracked. Although our tracker uses context learning to repress interference, their similar appearance still confuses the proposed tracker. In the sequence uav1_3, the UAV moves so fast and irregularly that the SESF module cannot work well, resulting in tracking failure. In the sequence sand truck, the sand truck is under a low illumination condition. When the target enters a dark environment, the proposed tracker cannot localize it. Table 2 also confirms that the performance of our tracker is not state of the art under the attribute of illumination variation.
Figure 11. Some representative tracking failure cases of the proposed FACF tracker. From top to bottom: Animal3, uav1_3, and sand truck from DTB70 [25], UAV123_10fps [26], and UAVTrack112 [27], respectively.

6. Conclusions

In this work, in order to enhance filter discriminative power in future unknown environments, we proposed a novel future-aware correlation filter tracker, namely FACF. By virtue of an effective time series forecast method, we obtained the predicted patch with the coarse target position, which is beneficial for localizing the target more precisely and for obtaining the predicted patch. Then, a context-aware mask is produced through an efficient target-aware method. Afterward, we obtain a single context patch at the pixel-level by the element-wise dot product between the context-aware mask and the feature maps of the predicted patch. Finally, feature maps of the context patch are utilized for improving discriminative ability. Extensive experiments on three UAV benchmarks verify the superiority of the FACF tracker against other state-of-the-art handcrafted-based and deep-based trackers.
The proposed future-aware strategy aims at dealing with unpredicted surrounding changes by learning the future context rather than the current context. The fast context acquisition avoids additional feature extraction as well as unrelated background noise. In general, our method guarantees the robustness, accuracy, and efficiency, which is promising for UAV real-time application. We think the proposed context learning method can be extended to other trackers for more robust UAV tracking. In future work, we will explore more accurate and efficient strategies to exploit future information in order to boost tracking performance without sacrificing speed drastically.

Author Contributions

Methodology, S.M.; software, F.Z.; validation, Z.Q., L.Y. and Y.Z.; formal analysis, L.Y.; investigation, S.M.; resources, Z.Q.; data curation, Y.Z.; writing—original draft preparation, F.Z.; writing—review and editing, F.Z.; visualization, Z.Q.; supervision, S.M.; project administration, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Miao, Y.; Li, J.; Bao, Y.; Liu, F.; Hu, C. Efficient Multipath Clutter Cancellation for UAV Monitoring Using DAB Satellite-Based PBR. Remote Sens. 2021, 13, 3429. [Google Scholar] [CrossRef]
  2. Zhang, F.; Yang, T.; Liu, L.; Liang, B.; Bai, Y.; Li, J. Image-Only Real-Time Incremental UAV Image Mosaic for Multi-Strip Flight. IEEE Trans. Multimed. 2021, 23, 1410–1425. [Google Scholar] [CrossRef]
  3. Mcarthur, D.R.; Chowdhury, A.B.; Cappelleri, D.J. Autonomous Control of the Interacting-BoomCopter UAV for Remote Sensor Mounting. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 5219–5224. [Google Scholar]
  4. Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA, 13–18 June 2010; pp. 2544–2550. [Google Scholar]
  5. Li, F.; Tian, C.; Zuo, W.; Zhang, L.; Yang, M.H. Learning Spatial-Temporal Regularized Correlation Filters for Visual Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4904–4913. [Google Scholar]
  6. Danelljan, M.; Häger, G.; Khan, F.; Felsberg, M. Accurate scale estimation for robust visual tracking. In Proceedings of the British Machine Vision Conference (BMVC), Nottingham, UK, 1–5 September 2014. [Google Scholar]
  7. Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Cision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6638–6646. [Google Scholar]
  8. Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Cision (ECCVW), Amsterdam, The Netherlands, 8–16 October 2016; pp. 850–865. [Google Scholar]
  9. Fu, C.; Lin, F.; Li, Y.; Chen, G. Correlation Filter-Based Visual Tracking for UAV with Online Multi-Feature Learning. Remote Sens. 2019, 11, 549. [Google Scholar] [CrossRef] [Green Version]
  10. Fu, C.; Li, B.; Ding, F.; Lin, F.; Lu, G. Correlation Filter for UAV-Based Aerial Tracking: A Review and Experimental Evaluation. arXiv 2020, arXiv:2010.06255. [Google Scholar]
  11. Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [Green Version]
  12. Danelljan, M.; Hager, G.; Shahbaz Khan, F.; Felsberg, M. Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4310–4318. [Google Scholar]
  13. Kiani Galoogahi, H.; Fagg, A.; Lucey, S. Learning background-aware correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1135–1143. [Google Scholar]
  14. Lukezic, A.; Vojir, T.; Čehovin Zajc, L.; Matas, J.; Kristan, M. Discriminative correlation filter with channel and spatial reliability. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6309–6318. [Google Scholar]
  15. Dai, K.; Wang, D.; Lu, H.; Sun, C.; Li, J. Visual Tracking via Adaptive Spatially-Regularized Correlation Filters. In Proceedings of the IEEE Conference on Computer Cision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 4665–4674. [Google Scholar]
  16. Mueller, M.; Smith, N.; Ghanem, B. Context-aware correlation filter tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1396–1404. [Google Scholar]
  17. Fu, C.; Jiang, W.; Lin, F.; Yue, Y. Surrounding-Aware Correlation Filter for UAV Tracking with Selective Spatial Regularization. Signal Process. 2020, 167, 1–17. [Google Scholar] [CrossRef]
  18. Li, Y.; Fu, C.; Huang, Z.; Zhang, Y.; Pan, J. Intermittent Contextual Learning for Keyfilter-Aware UAV Object Tracking Using Deep Convolutional Feature. IEEE Trans. Multimed. 2021, 23, 810–822. [Google Scholar] [CrossRef]
  19. Xu, T.; Feng, Z.H.; Wu, X.J.; Kittler, J. Joint group feature selection and discriminative filter learning for robust visual object tracking. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 7950–7960. [Google Scholar]
  20. Zhang, Y.; Gao, X.; Chen, Z.; Zhong, H.; Xie, H.; Yan, C. Mining Spatial-Temporal Similarity for Visual Tracking. IEEE Trans. Image Process. 2020, 29, 8107–8119. [Google Scholar] [CrossRef] [PubMed]
  21. Huang, Z.; Fu, C.; Li, Y.; Lin, F.; Lu, P. Learning Aberrance Repressed Correlation Filters for Real-Time UAV Tracking. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 2891–2900. [Google Scholar]
  22. Fu, C.; Ye, J.; Xu, J.; He, Y.; Lin, F. Disruptor-Aware Interval-Based Response Inconsistency for Correlation Filters in Real-Time Aerial Tracking. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6301–6313. [Google Scholar] [CrossRef]
  23. Li, Y.; Fu, C.; Ding, F.; Huang, Z.; Lu, G. AutoTrack: Towards High-Performance Visual Tracking for UAV with Automatic Spatio-Temporal Regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11923–11932. [Google Scholar]
  24. Nazim, A.; Afthanorhan, A. A comparison between single exponential smoothing (SES), double exponential smoothing (DES), holt’s (brown) and adaptive response rate exponential smoothing (ARRES) techniques in forecasting Malaysia population. Glob. J. Math. Anal. 2014, 2, 276–280. [Google Scholar]
  25. Li, S.; Yeung, D.Y. Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4140–4146. [Google Scholar]
  26. Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 445–461. [Google Scholar]
  27. Fu, C.; Cao, Z.; Li, Y.; Ye, J.; Feng, C. Onboard Real-Time Aerial Tracking with Efficient Siamese Anchor Proposal Network. IEEE Trans. Geosci. Remote Sens. 2021, 1–13. [Google Scholar] [CrossRef]
  28. Li, Y.; Zhu, J. A scale adaptive kernel correlation filter tracker with feature integration. In Proceedings of the European Conference on Computer Vision Workshops (ECCV), Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 254–265. [Google Scholar]
  29. Wang, C.; Zhang, L.; Xie, L.; Yuan, J. Kernel cross-correlator. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2017; pp. 4179–4186. [Google Scholar]
  30. Danelljan, M.; Hager, G.; Shahbaz Khan, F.; Felsberg, M. Adaptive decontamination of the training set: A unified formulation for discriminative visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1430–1438. [Google Scholar]
  31. Danelljan, M.; Khan, F.S.; Felsberg, M.; Van De Weijer, J. Adaptive Color Attributes for Real-Time Visual Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 1090–1097. [Google Scholar]
  32. Qi, Y.; Zhang, S.; Qin, L.; Yao, H.; Huang, Q.; Lim, J.; Yang, M.H. Hedged Deep Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4303–4311. [Google Scholar]
  33. Danelljan, M.; Robinson, A.; Khan, F.S.; Felsberg, M. Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 472–488. [Google Scholar]
  34. Xu, T.; Feng, Z.; Wu, X.; Kittler, J. Learning Adaptive Discriminative Correlation Filters via Temporal Consistency Preserving Spatial Feature Selection for Robust Visual Object Tracking. IEEE Trans. Image Process. 2019, 28, 5596–5609. [Google Scholar] [CrossRef] [Green Version]
  35. Zhu, X.F.; Wu, X.J.; Xu, T.; Feng, Z.; Kittler, J. Robust Visual Object Tracking via Adaptive Attribute-Aware Discriminative Correlation Filters. IEEE Trans. Multimed. 2021, 1. [Google Scholar] [CrossRef]
  36. Yan, Y.; Guo, X.; Tang, J.; Li, C.; Wang, X. Learning spatio-temporal correlation filter for visual tracking. Neurocomputing 2021, 436, 273–282. [Google Scholar] [CrossRef]
  37. Wang, N.; Zhou, W.; Tian, Q.; Hong, R.; Wang, M.; Li, H. Multi-cue correlation filters for robust visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4844–4853. [Google Scholar]
  38. Fu, C.; Xu, J.; Lin, F.; Guo, F.; Liu, T.; Zhang, Z. Object Saliency-Aware Dual Regularized Correlation Filter for Real-Time Aerial Tracking. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8940–8951. [Google Scholar] [CrossRef]
  39. Fu, C.; Ding, F.; Li, Y.; Jin, J.; Feng, C. Learning dynamic regression with automatic distractor repression for real-time UAV tracking. Eng. Appl. Artif. Intell. 2021, 98, 104116. [Google Scholar] [CrossRef]
  40. Zheng, G.; Fu, C.; Ye, J.; Lin, F.; Ding, F. Mutation Sensitive Correlation Filter for Real-Time UAV Tracking with Adaptive Hybrid Label. In Proceedings of the International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 1–8. [Google Scholar]
  41. Lin, F.; Fu, C.; He, Y.; Guo, F.; Tang, Q. Learning Temporary Block-Based Bidirectional Incongruity-Aware Correlation Filters for Efficient UAV Object Tracking. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 2160–2174. [Google Scholar] [CrossRef]
  42. Xue, X.; Li, Y.; Shen, Q. Unmanned aerial vehicle object tracking by correlation filter with adaptive appearance model. Sensors 2018, 18, 2751. [Google Scholar] [CrossRef] [Green Version]
  43. Zha, Y.; Wu, M.; Qiu, Z.; Sun, J.; Zhang, P.; Huang, W. Online Semantic Subspace Learning with Siamese Network for UAV Tracking. Remote Sens. 2020, 12, 325. [Google Scholar] [CrossRef] [Green Version]
  44. Zhuo, L.; Liu, B.; Zhang, H.; Zhang, S.; Li, J. MultiRPN-DIDNet: Multiple RPNs and Distance-IoU Discriminative Network for Real-Time UAV Target Tracking. Remote Sens. 2021, 13, 2772. [Google Scholar] [CrossRef]
  45. Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the 29th International Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
  46. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R.B. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  47. Li, B.; Fu, C.; Ding, F.; Ye, J.; Lin, F. All-Day Object Tracking for Unmanned Aerial Vehicle. arXiv 2021, arXiv:2101.08446. [Google Scholar]
  48. Boyd, S.; Parikh, N.; Chu, E. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 2011, 3, 1–122. [Google Scholar] [CrossRef]
  49. Sherman, J.; Morrison, W.J. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Ann. Math. Stat. 1950, 21, 124–127. [Google Scholar] [CrossRef]
  50. Bertinetto, L.; Valmadre, J.; Golodetz, S.; Miksik, O.; Torr, P.H. Staple: Complementary learners for real-time tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1401–1409. [Google Scholar]
  51. Wang, N.; Zhou, W.; Song, Y.; Ma, C.; Liu, W.; Li, H. Unsupervised Deep Representation Learning for Real-Time Tracking. Int. J. Comput. Vis. 2021, 129, 400–418. [Google Scholar] [CrossRef]
  52. Wang, N.; Zhou, W.; Song, Y.; Ma, C.; Li, H. Real-Time Correlation Tracking Via Joint Model Compression and Transfer. IEEE Trans. Image Process. 2020, 29, 6123–6135. [Google Scholar] [CrossRef] [PubMed]
  53. Li, X.; Ma, C.; Wu, B.; He, Z.; Yang, M.H. Target-aware deep tracking. In Proceedings of the IEEE Conference on Computer Cision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 1369–1378. [Google Scholar]
  54. Zhang, L.; Suganthan, P. Robust Visual Tracking via Co-trained Kernelized Correlation Filters. Pattern Recognit. 2017, 69, 82–93. [Google Scholar] [CrossRef]
  55. Wang, N.; Song, Y.; Ma, C.; Zhou, W.; Liu, W.; Li, H. Unsupervised deep tracking. In Proceedings of the IEEE Conference on Computer Cision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 1308–1317. [Google Scholar]
  56. Valmadre, J.; Bertinetto, L.; Henriques, J.F.; Vedaldi, A.; Torr, P.H.S. End-to-End Representation Learning for Correlation Filter Based Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5000–5008. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.