Next Article in Journal
A Survey on Reputation Systems for UAV Networks
Previous Article in Journal
Multi-UAV Formation Path Planning Based on Compensation Look-Ahead Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Visual Object Tracking Based on the Motion Prediction and Block Search in UAV Videos

1
School of Information Engineering, Henan University of Science and Technology, Luoyang 471023, China
2
Longmen Laboratory, Luoyang 471000, China
3
Henan Academy of Sciences, Zhengzhou 450046, China
4
Xiaomi Technology Co., Ltd., Beijing 100102, China
*
Author to whom correspondence should be addressed.
Drones 2024, 8(6), 252; https://doi.org/10.3390/drones8060252
Submission received: 21 April 2024 / Revised: 28 May 2024 / Accepted: 5 June 2024 / Published: 7 June 2024

Abstract

:
With the development of computer vision and Unmanned Aerial Vehicles (UAVs) technology, visual object tracking has become an indispensable core technology for UAVs, and it has been widely used in both civil and military fields. Visual object tracking from the UAV perspective experiences interference from various complex conditions such as background clutter, occlusion, and being out of view, which can easily lead to tracking drift. Once tracking drift occurs, it will lead to almost complete failure of the subsequent tracking. Currently, few trackers have been designed to solve the tracking drift problem. Thus, this paper proposes a tracking algorithm based on motion prediction and block search to address the tracking drift problem caused by various complex conditions. Specifically, when the tracker experiences tracking drift, we first use a Kalman filter to predict the motion state of the target, and then use a block search module to relocate the target. In addition, to improve the tracker’s ability to adapt to changes in the target’s appearance and the environment, we propose a dynamic template updating network (DTUN) that allows the tracker to make appropriate template decisions based on various tracking conditions. We also introduce three tracking evaluation metrics: namely, average peak correlation energy, size change ratio, and tracking score. They serve as prior information for tracking status identification in the DTUN and the block prediction module. Extensive experiments and comparisons with many competitive algorithms on five aerial benchmarks, UAV20L, UAV123, UAVDT, DTB70, and VisDrone2018-SOT, demonstrate that our method achieves significant performance improvements. Especially in UAV20L long-term tracking, our method outperforms the baseline in terms of success rate and accuracy by 19.1% and 20.8%, respectively. This demonstrates the superior performance of our method in the task of long-term tracking from the UAV perspective, and we achieve a real-time speed of 43 FPS.

1. Introduction

In recent years, UAV technology and industry have developed rapidly. Object tracking technology from the UAV perspective has been widely used and has become the core technology of UAVs [1,2,3]. Visual target tracking technology is not only widely used in civil fields such as security, logistics, and rescue but also plays an important role in military fields such as intelligence collection, guidance, and remote sensing [4]. However, object tracking technology from the UAV perspective faces several challenges and problems [5]. (1) Tracking robustness under complex scenarios: there are many challenging statuses in object tracking tasks from an aerial perspective, such as fast motion, object deformation, background clutter, occlusion, and being out of view. (2) Resource constraints: UAVs have limited computational resources and energy, and must achieve efficient object tracking and meet the real-time requirements with limited resources. Therefore, researchers still need to explore and solve the problems faced by object tracking technology from the UAV perspective.
Traditional object-tracking algorithms usually use filtering methods (e.g., Kalman filter [6]) to estimate the motion state of the object, and predict the position of the object by establishing state equations and observation equations. However, these algorithms are prone to the problem of tracking drift when the object undergoes sudden changes in motion or when the sensor wobbles. In recent years, with the development of deep learning (DL) technology, researchers have begun to apply DL techniques to visual object tracking and have made significant progress. Due to their excellent performance, Siamese network trackers [7,8] have been favored by many researchers. In visual object tracking, Siamese networks are deep neural networks for learning target features and computing object similarities. It consists of two similar convolutional neural networks (CNNs) that process the template image and the search image respectively and output their feature representations. Some derived methods such as STMTrack [9], SiamGAT [10], and SiamRN [11] are based on Siamese networks. These methods usually improve the training method of the algorithm and the similarity calculation process but do not address the tracking drift problem caused by disturbances such as object occlusion and being out of view. As a result, these methods are not advantageous for long-term tracking tasks in complex scenes.
With the improvement of computing resources and the success of the Transformer architecture in the field of natural language processing, target tracking algorithms based on Transformer architecture have become a hot research topic, such as DropTrack [12], SwinTrack [13], and AiATrack [14]. In visual object tracking, tracking algorithms based on the Transformer architecture mainly consist of an encoder and a decoder. The encoder and decoder consist of multiple identical encoder layers, each containing a self-attention mechanism and a feedforward neural network. The self-attention mechanism can model associations between different locations in the input stream. Feedforward neural networks introduce nonlinear transformations into the feature representation. SwinTrack [13] uses the Transformer architecture for both representation learning and feature fusion. The Transformer architecture allows for better feature interaction for tracking than the pure CNN framework. AiATrack [14] proposes an attention-in-attention (AiA) module, which enhances appropriate correlations by seeking consensus among all correlation vectors. The AiA module can be applied to both self-attention blocks and cross-attention blocks to facilitate feature aggregation. It is due to the attentional mechanism of the Transformer architecture and the feed-forward neural network that the tracker performs the tracking task with better robustness and generalization. However, for the same reason, trackers based on the Transformer architecture are more computationally intensive. Therefore, certain limitations exist in performing tracking tasks on UAV platforms.
In order to solve the tracking drift problem caused by complex conditions such as occlusion and being out of view, this paper proposes a visual object tracking algorithm based on motion prediction and block search and improves it for the tracking task from the UAV perspective. We introduce three metrics for evaluating tracking results: average peak correlation energy (APCE), size change ratio (SCR), and tracking score (TS). These metrics aim to jointly identify the tracking status and provide prior information for the proposed dynamic template updating network (DTUN). The proposed DTUN employs the optimal template strategy based on different tracking statuses to improve the tracker’s robustness. We utilize the Kalman filtering for motion prediction of the object when the object is temporarily occluded or subject to other disturbances that cause tracking failure. In cases of near-linear motion, the Kalman filter can be applied simply and efficiently. For cases of the nonlinear motion of the object, the Kalman filter predicts the approximate motion direction of the object, which is crucial for the block-search module. The proposed block-prediction module mainly solves the long-term tracking drift problem. Considering cases of tiny objects from the UAV perspective, we process the enlarged search region with block search, significantly improving search efficiency for tiny objects. Overall, our method performs well in solving the tracking drift problem in long-term tracking caused by complex statuses such as object occlusion and being out of view. Figure 1 shows a visualization of our method compared with the baseline algorithm. When tracking drift occurs due to object occlusion, it will continue if no action is taken (as shown by the blue bounding box). In contrast, our method employs a target prediction and block search framework, which can effectively relocate the tracker to the object (as shown by the red bounding box).
Extensive experiments on five aerial datasets, UAV20L [15], UAV123 [15], UAVDT [16], DTB70 [17], and VisDrone2018-SOT [18], demonstrate the superior performance of our method. In particular, on the large-scale long-term tracking dataset UAV20L, our method achieves a significant improvement of 19.1% and 20.8% in terms of success and accuracy, respectively, compared to the baseline method. In addition, our method achieves a tracking speed of 43 FPS, which far exceeds the real-time requirement. This demonstrates the high-efficiency and high-accuracy performance of our method in performing the object tracking task from the UAV perspective. Figure 2 shows the results of our method compared with other methods in terms of success rate and tracking speed.
In summary, the main contributions of this paper can be summarized in the following three aspects:
(1)
We introduce three evaluation metrics: the APCE, SCR, and TS, which are used to evaluate the tracking results for each frame. These evaluation metrics are used to jointly identify the tracking status and provide feedback information to the DTUN in the tracking of subsequent frames. The proposed DTUN adjusts the template strategy according to different tracking statuses, enabling the tracker to adapt to changes in the object and the tracking scenario.
(2)
We propose a motion prediction and block search module. When tracking drift occurs due to complex statuses (e.g., occlusion and being out of view), we first predict the motion state of the object using a Kalman filter, and then utilize the block search to re-locate the object. Its performance is excellent for solving the tracking drift problem from the UAV perspective.
(3)
Our proposed algorithm achieves significant performance improvement on five aerial datasets, UAV20L [15], UAV123 [15], UAVDT [16], DTB70 [17], and VisDrone2018-SOT [18]. In particular, on the long-term tracking dataset UAV20L, our method achieves 19.1% and 20.8% increase in success and precision, respectively, compared to the baseline method, and achieves 43 FPS real-time speed.
This paper is organized as follows: Section 2 describes other work related to our approach. Section 3 describes the framework and specific details of our proposed approach. Section 4 shows the results of experiments on five aerial datasets and compares them with other methods. Finally, our work is summarized in Section 5.

2. Related Work

Visual object tracking is an important task in computer vision that has developed rapidly in recent years. It aims to accurately track objects in video sequences in real time. In addition, the application of visual object tracking in the field of UAVs is receiving increasing attention and plays an important role. In this chapter, the research on visual object tracking algorithms and their development on UAVs is discussed.

2.1. Object Tracking Algorithm

With the development of computer vision and DL, researchers began to apply DL to visual object tracking. The introduction of the Siamese network has made a significant contribution to the progress of visual object tracking. It learns the representation of the template and search images and calculates the similarity between them. The template and search images are fed into two networks with shared weights. Siamese network-based trackers have achieved high accuracy and real-time performance in object-tracking tasks. Recently, the Transformer model achieved great success in natural language processing and computer vision with the improvement of computational resources. Transformer-based tracking algorithms achieve more accurate tracking by introducing an attention mechanism to model the relationship between the object and its surrounding context. These methods take a sequence of object features as input and use a multilayer Transformer encoder to learn the object’s representation.
In recent years, Siamese network-based tracking algorithms have been extensively researched. Li et al. proposed SiamRPN [19] by introducing the regional proposal network (RPN) [20] into the Siamese network. RPN consists of two branches, object classification and bounding box regression. The introduction of RPN eliminates the traditional multiscale testing and online fine-tuning process, and greatly improves the accuracy and speed of object tracking. Guo et al. argued that anchor-based regression networks require tricky hyperparameters to be set manually, so they proposed an anchor-free tracking algorithm (SiamCAR) [21]. Compared to the anchor-based approach, the anchor-free regression network has fewer hyperparameters. It directly calculates the distance from the center of the object to the four edges during the bounding box regression process. Fu et al. argued that the fixed initial template information has been fully mined and the existing online learning template update process is time-consuming. Therefore, they proposed a tracking framework based on space-time memory networks (STMTrack) [9] that utilizes the historical templates from the tracking process to better adapt to changes in the object’ appearance. RTS [22] proposed a segmentation-centered tracking framework that can better distinguish between object and background information. It can generate accurate segmentation masks in the tracking results, but there is a reduction in the tracking speed. Although the above trackers improve tracking accuracy by improving the model training method and bounding box regression, they do not effectively solve the tracking drift problem due to complex situations. In addition, Zheng et al. [23] argued that the imbalance of the training data makes the learned features lack significant discriminative properties. Therefore, in the offline training phase, they made the model more focused on semantic interference by controlling the sampling strategy. In the inference phase, an interference-aware module and a global search strategy are used to improve the tracker’s resistance to interference. However, this global search strategy is not good for tracking tiny objects from the UAV perspective, especially when there are similar objects around, background clutter, or low resolution.
After Siamese networks, Transformer large-model-based trackers also achieved excellent results. Yan et al. [24] presented a tracking architecture with an encoder–decoder transformer (STARK). The encoder models the global spatio-temporal feature information of the target and the search region, and the decoder predicts the spatial location of the target. The encoder–decoder transformer captures remote dependencies in both spatial and temporal dimensions and does not require subsequent hyperparameter processing. Cao et al. proposed an efficient Hierarchical Feature Transformer (HiFT) [25], which inputs hierarchical similarity maps into the feature transformer for the interactive fusion of spatial and semantic cues. It not only improves the global contextual information but also efficiently learns the dependencies between multilevel features. Ye et al. [26] argued that the features extracted by existing two-stage tracking frameworks lack target perceptibility and have limited target–background discriminability. Therefore, they proposed a one-stage tracking framework (OSTrack), which bridges templates and search images with bidirectional information flows to unify feature learning and relation modeling. To further improve the inference efficiency, they proposed an in-network candidate early elimination module to gradually discard candidates belonging to the background. The above Transformer model-based tracker achieves significant performance improvement in visual object tracking, but it requires high computational resources, which are not advantageous for applications on UAV platforms.

2.2. Tracking Algorithms in the UAV

Visual object-tracking technology is widely used as the core technology of UAV applications. The fast and automatic tracking of moving targets can be achieved by equipping UAVs with visual sensors and object-tracking algorithms. However, tiny objects detected by aerial sensors are easily occluded and susceptible to environmental influences. In addition, there are limitations in the computational power of UAV devices. Therefore, researchers have focused on developing, researching, and improving object-tracking algorithms specifically for aerial tracking to address these challenges.
Cao et al. [25] argued that using only the last layer of image features will reduce the accuracy of aerial tracking, and simply using multiple layers of features will increase the online inference time. Therefore, they proposed the hierarchical feature Transformer framework. This framework enables the interactive fusion of spatial and semantic information to achieve efficient aerial tracking. TCTrack [27] proposed a temporal context information tracking framework that can fully utilize the temporal context of aerial tracking. Specifically, they proposed online temporal adaptive convolution to enhance temporal information of spatial features and an adaptive temporal converter that uses temporal knowledge for encoding-decoding. The proposed online temporal adaptive convolution can be used to enhance the temporal information of spatial features. Since visual trackers may perform the tracking task at night, low-light conditions are unfavorable for tracking. Therefore, Li et al. [28] proposed a novel discriminative correlation filter-based tracker (ADTrack) with illumination adaptive and anti-dark capability. This tracker first extracts the illumination information from the image, then performs enhancement preprocessing, and finally generates an object-aware mask to realize object tracking. For airborne tracking, if the object is within a complex environment, tracking performance is severely compromised. With the application of depth cameras on UAVs, adding depth information can more effectively deal with complex scenes such as background information interference. Therefore, Yang et al. [29] proposed a multimodal fusion and feature-matching algorithm and constructed a large-scale RGBD (RGB-Depth map) tracking dataset. In addition, RGB-T (RGB-thermal) [30] tracking is also an effective means in the field of visual object tracking to solve the difficulty of UAV tracking in complex scenes. However, adding additional sensor devices also significantly increases the cost and flight burden of UAVs.

3. Proposed Method

This section provides a comprehensive overview of the proposed method. Section 3.1 outlines the overall structure, Section 3.2 delves into the dynamic template updating network, Section 3.3 discusses the specifics of the search–evaluation network, and Section 3.4 presents details related to the block-prediction module.

3.1. Overall Framework

Figure 3 illustrates the structure of the proposed MPBTrack algorithm, comprising three parts: the dynamic template updating network, the search–evaluation network, and the block-prediction module. The dynamic template updating network adjusts the number of templates according to the evaluation results regarding the tracking condition. The tracking results for each frame are filtered, and high-quality templates are stored in the template memory. The template extraction module extracts a corresponding number of diverse high-quality template features from the template memory and then concatenates these template features with the initial templates. The number of templates used for the current frame tracking is determined by the received APCE feedback results. The search–evaluation network calculates the similarity between the template image and the search image and evaluates the tracking results. Following the similarity calculation, the network obtains results for the object’s classification, center-ness, and regression from three branches. The tracking status evaluation network yields three metrics: APCE, TS, and SCR. The block-prediction module first recognizes the tracking status based on the three joint evaluation metrics. When the tracker detects tracking drift caused by occlusion or other interference, it will predict the target’s motion trajectory using a Kalman filter. If the object’s position is not accurately predicted within 20 frames, a block search in an expanded area is utilized to relocate the object.

3.2. Dynamic Template Updating Network

Since fixed templates alone cannot adapt to changes in the object’ appearance and tracking scenarios (e.g., illumination variation and background clutter), they can easily lead to tracking failures. Therefore, we propose a dynamic template updating network to adapt to the changes in the object’ appearance and tracking scenarios during the tracking process. For normal tracking conditions, the tracker can achieve accurate tracking by using only a few templates. For complex tracking conditions, we will employ more historical templates to adapt them to improve the robustness of object tracking. The method dynamically switches the number of templates according to the tracking status, which improves the tracking robustness as well as the computational efficiency.
Figure 4 shows the structure of the DTUN. The network includes a dynamic template in addition to the traditional initial template. The dynamic templates are obtained from high-quality regions filtered by the tracking results. The dynamic features extracted by the feature extraction network are stored in the feature memory. In poor tracking status (e.g., background clutter, motion blur, and object occlusion), the target region of the tracking result is affected when the template quality is low. If low-quality templates are stored in the template memory, the subsequent tracking accuracy will be greatly affected. The tracking score is the most intuitive response to the quality of the tracking region. In order to improve the quality of templates in the template memory, we filter the tracking results to ensure that high-quality templates are stored in the feature memory while low-quality templates are discarded.
The TS metric is employed to assess the quality of the tracked region for each frame. If the TS of the current region exceeds a predefined threshold (0.6), it will be incorporated into the feature memory as a new template, following the processes of cropping and feature extraction. Specifically, the new template will perform the following operations to yield the optimal template features: (1) obtain the cropping size of the current image according to Equation (23), and then crop with the target as the center point; (2) resize the cropped region to 289 × 289 to obtain a template containing the foreground region and the background region; (3) generate a foreground–background mask map with the same size as the foreground and the background in the template; and (4) the cropped template and the mask map are jointly fed into the feature extraction network to obtain the template features (the purpose is to improve the tracker’s discrimination between the target and the background), and finally the template features are stored in the feature memory.
The APCE score measures the quality of the response map of the tracking results (described in more detail in Section 3.3.2). The tracker receives feedback on the APCE results at each frame and transforms the APCE values into the corresponding number of templates. The equation for this transformation process is as follows:
T = N m a x N m a x 1 1 + exp a A P C E b
where T denotes the number of templates, and N m a x denotes the maximum value of the template range. We set the template range to 1–10, and a and b denote the slope and horizontal offset of the function, respectively. If the tracking quality of the previous frame is high, a small number of templates will be used in the next frame to achieve better tracking results. When the tracker receives a lower tracking quality in the previous frame, more historical templates will be utilized to adapt to the current complex tracking situation.
The template extraction mechanism is an important part of the dynamic template-updating process. A large number of historical templates are stored in the template memory, and effective utilization of these templates is crucial to the robustness and accuracy of tracking. By using Equation (1), we can calculate the number of templates that need to be used in the next frame, and then use the template extraction mechanism to select high-quality and diverse templates from the historical ones, suitable for tracking in the next frame. The template extraction can be denoted as:
τ i = t N × i , i = 0 , 1 , 2 , , N
T j = m a x r a g τ j , τ j + 1 , j = 0 , 1 , 2 , , N 1
T c o n = c o n c a t ( T 0 , , T N 1 )
where t is the total number of templates in the library, and N is the number of tracked templates. t a u j denotes the j-th segmentation point, and m a x a r g φ 1 , φ 2 denotes the maximum value in the interval φ 1 , φ 2 . c o n c a t ( , ) denotes the concatenation operation. Assuming that N templates are needed for the next frame, the specific extraction steps are as follows: (1) The initial template is necessary, as it contains primary information about the tracking target. Note that we exclude the last frame template, as it may introduce additional interfering information. (2) The templates in the template memory are sequentially divided into N 1 segments, and then the template with the highest tracking score is selected from each segment to serve as the optimal tracking template. This extraction mechanism can enhance the diversity of templates and extract high-quality templates, thus significantly improving the robustness of the tracker.

3.3. Search–Evaluation Network

3.3.1. Search Subnetwork

The search network utilizes STMTrack [9] as the baseline and GoogleNet [31] as the feature extraction network. As shown in Figure 5, the template image z and search image x undergo feature extraction to obtain their respective features φ ( z ) and φ ( x ) . In the tracking process, the extracted historical template features are concatenated, and then the concatenated template features and the search image features are subjected to a cross-correlation operation to obtain the response map R * . The process can be represented as:
R * = c o n c a t φ 1 ( z ) , , φ i ( z ) φ ( x )
where φ denotes the feature extraction operation, ★ denotes the cross correlation operation, and c o n c a t ( , ) denotes the concatenation operation. Following the response map, the classification convolutional neural network and regression convolutional neural network are used to obtain the classification feature maps R c l s and regression feature maps R r e g , respectively. The purpose of the classification branch is to classify the target to be tracked from the background. The classification branch includes a center-ness branch that boosts confidence for positions closer to the center of the image. Multiplying the classification response map s c l s with the center-ness response map s c t r suppresses the classification confidence for locations farther from the center of the target, resulting in the final tracking response map. The purpose of the regression branch is to determine the distance from the center of the target to the left, top, right, and bottom edges of the target bounding box in the search image.
During the training phase of the network model, the classification branch is trained using the focal loss function, which can be expressed as:
L c l s = α t 1 p t γ log p t
where 1 p t γ is a modulating factor, γ is a tunable focusing parameter, and p t is the predicted probability of a positive or negative sample. The regression branch utilizes intersection over union (IoU) as a loss function, which can be expressed as:
L r e g = 1 I n t e r s e c t i o n B , B * U n i o n B , B *
where B is the predicted bounding box and B * is its corresponding ground-truth bounding box. The center-ness branch uses a binary cross-entropy loss (BCE), which can be written as:
L c e n = c ( i , j ) log p ( i , j ) + ( 1 c ( i , j ) ) log ( 1 p ( i , j ) )
where p ( i , j ) represents the center-ness score at point ( i , j ) . c ( i , j ) denotes the label value at position ( i , j ) . The final loss function is:
L o s s = 1 N x , y L c l s + λ N x , y L r e g + λ N x , y L c e n
where N denotes the number of point, λ is the weight value.

3.3.2. Evaluation Network

In the evaluation network, we introduce three evaluation metrics: APCE, SCR, and TS. These three metrics will be jointly used for status identification in the block-prediction module (detailed in Section 3.4). In addition, APCE will also be fed back to the template extraction mechanism as prior information about the tracking quality. The tracking score will also be used to filter high-quality templates and place them in the template memory.
Average Peak Correlation Energy: Inspired by the LMCF [32] method, we introduce the APCE metric to evaluate the quality of the target region in the tracking response map. When the target is not disturbed by poor conditions, the APCE value is high, and the 3D heatmap of the tracking response map shows a stiff single peak shape. If the target is affected by a cluttered background or occlusion, the APCE value will be lower, and the 3D heatmap of the tracking response map will show a multipeak shape. The more the target is affected, the lower the APCE value. The APCE is calculated as follows:
A P C E = F max F min 2 m e a n w , h F w , h F min 2
where F max and F min represent the maximum value and minimum value of the response map of the tracking result, respectively. F w , h is the response values in the response map at position ( w , h ) .
Figure 6 shows the APCE values and response maps of the tracking results for different statuses. The APCE values are high under normal conditions, and the response map of the tracking result exhibits a stiff single peak form. However, when the target is affected by a cluttered background, the APCE value decreases, and the response map of the tracking result begins to show a trend of multiple peaks. When the target is occluded, the value of APCE decreases significantly, and the response map shows a low multipeak form. Our proposed DTUN is able to adapt to tracking situations where the impact on the target is small. However, when the target suffers from more serious impacts (e.g., occlusion), it will lead to tracking drift and further to complete failure in subsequent tracking if it cannot be effectively addressed. With the APCE metric, we can better measure the tracking results and determine the extent to which the target is affected by the external environment. This provides a more effective a priori guide for us to identify the tracking status in the block-prediction module.
Size Change Ratio: During the target-tracking process, the size of the target typically changes continuously, without any significant changes between consecutive frames. If the size of the target changes significantly in consecutive frames, external interference has likely caused the tracker to drift. Smaller influences are usually insufficient to cause tracking drift, and only severe influences (e.g., target occlusion) can cause tracking drift. To identify tracking drift in the block-prediction module, we introduce the size change ratio as an indicator. The size change ratio is expressed as:
S C R = 2 m i = m 2 m F w × h i F w × h c
where m denotes the number of templates in the template memory. F w × h i denotes the size of the i-th target template in the template memory, and F w × h c denotes the template size of the current frame.
Specifically, in order to obtain a more accurate SCR value, target sizes of poor quality are discarded. The average of the latter m / 2 target sizes in the template memory is used to calculate the historical target size. Then, the ratio of the historical target size relative to the current target size is calculated. If this ratio falls within the threshold, the target size is considered to be within the normal range of variation. Otherwise, a sudden change in size is considered to have occurred. Such sudden size changes are often caused by tracking drift due to severe impacts (e.g., occlusion) on the target. Therefore, we use the SCR as a tracking status identification metric to assess whether the tracker is experiencing tracking drift. Considering that the rapid movement of the sensor may also cause the target size to change rapidly in a short period, which may lead to the misjudgment of the tracker, so we add the tracking score and the APCE score to support the judgment. Tracking drift is only considered to have occurred when all three conditions meet the threshold, which is then transmitted to the condition recognizer. This auxiliary judgment prevents the misjudgment of the tracker when the tracking condition is good and effectively improves the recognition accuracy of tracking drift.
Figure 7 displays the SCR variation curve during target tracking. The blue curve and bounding box represent the baseline algorithm, while the red curve and bounding box represent our method. Tracking drift occurs, and the size of the target bounding box gradually increases after the target is occluded starting from frame 1837. The blue curve and target bounding box indicate that the baseline algorithm has entered a tracking drift state. At frame 2148, when the SCR exceeds the threshold, it prompts our method to utilize the block prediction to relocate the target. As a result, the target is successfully tracked at frame 2343. This demonstrates the significance of the SCR metric in determining tracking drift and the effectiveness of our method in resolving such situations.
Tracking score: The tracking score is a direct measure of the quality of tracking results. In Siamese networks, the tracking score is calculated by multiplying the classification confidence and center-ness score. The classification confidence reflects the similarity between the template and the search region, while the center-ness reflects the distance between the target and the center point in the search image. The classification confidence s c l s can be multiplied by the center-ness s c t r to suppress the score for positively classified targets that are far from the target center. The tracking score s t r c can be expressed as: s t r c = s c l s × s c t r . To suppress large variations in the target scale, the scale penalty function is used to penalize large-scale variations in the target, and the process can be written as:
s * = s t r c × p n = s t r c × e k × m a x ( r r , r r ) × m a x ( s s , s s )
where k is a hyperparameter. r represents the proposal’s ratio of height and width, and r represents that of the last frame. s and s represent the overall scale of the proposal and last frame, respectively. s is calculated by:
( w + p ) × ( h + p ) = s 2
where w and h denote the width and height of the target, respectively, and p is the padding value. Additionally, to suppress scores away from the center, the response map is post-processed using a cosine window function. The final tracking score T S can be expressed as:
T S = m a x s * × ( 1 d ) + H × d
where d is a hyperparameter, and H denotes the cosine matrix.
Finally, a response map with a tracking score of 5 × 5 is obtained. The maximum confidence score in the response map indicates a higher probability that the target is present. If the target is affected by external factors, the maximum confidence score will be lower. The tracker records tracking results with confidence scores below the threshold and sends this information to the condition recognizer. The condition recognizer receives three evaluation scores from the evaluation network and performs further processing in the block-prediction module.
Algorithm 1 shows the details of the DTUN.
Algorithm 1: Dynamic template-updating network.
Drones 08 00252 i001

3.4. Block-Prediction Module

The block-prediction module comprises three parts: the condition recognizer, the Kalman filter, and block search. Its purpose is to address the tracking drift problem that arises from target occlusion and from targets moving out of the field of view. These problems are difficult to solve with traditional trackers. If the target is completely occluded and then reappears, tracking drift may occur, preventing the target from appearing in the tracker’s search area and resulting in tracking failure. Similarly, if the target moves out of view and then reappears, the tracker may not be able to locate the target in the search area. These challenging situations can cause tracking drift and subsequent tracking failures. Therefore, it is of great significance to solve the tracking drift problem in long-term tracking.
In our analysis of the search–evaluation network, we have determined the significance of the APCE, SCR, and TS in identifying tracking drift. The status recognizer identifies the tracking status for each frame during the tracking process. The Kalman filter is used to predict the target trajectory in the subsequent frame if the target is occluded. The block search is used to re-locate the target if the predictor fails to predict the target within 20 frames. The block search network has two modes: block search and expanded block search. If the block search fails to locate the target, the expanded block search will be used. Additionally, we check if the target moves out of the field of view during each frame’s tracking process. If the target moves out of the field of view, the expanded block search will be used to relocate it.

3.4.1. Motion Prediction

In the field of target tracking, most trackers do not include a target prediction process. This can result in tracking drift when the target is disturbed by external factors, such as occlusion. This is particularly problematic for long-term tracking, as tracking drift can lead to complete failure of subsequent tracking. Therefore, it is crucial to introduce a target prediction process to address tracking drift caused by target occlusion. In airborne tracking, due to the characteristics of distant sensors and small targets, the motion process before the target is occluded and can be regarded as approximately linear.
The Kalman filter [6] is used in the target prediction stage. The state and observation equations of the Kalman filter can be expressed as:
x k = A x k 1 + w k 1
z k = H x k + v k
where k denotes the moment of the kth frame, x k is the state vector, and z k is the observation vector. H is the observation matrix, and H = I , where I is the unit matrix. A is the state transition matrix. w k 1 and v k are the process error and observation error, respectively, and are assumed to be subject to Gaussian distributions with covariance matrices Q and R. We set Q = I × 0.1 and R = I . During the process of object tracking, the updating process takes up most of the time, while the prediction process takes up very little time. Therefore, it can be assumed that the size of the target will not change significantly in a short period. The state space of the target is set as follows:
X = [ x , y , w , h , v x , v y ] T
where x, y denote the center coordinates of the target, and w and h denote the width and height of the target, respectively. v x and v y denote the change ratio of the center coordinates of the target, respectively.
The Kalman filtering computational steps consist of the prediction process and the update process. The prediction process focuses on predicting the state and error covariance variables. It can be expressed as:
x ^ k = A x ^ k 1
p ^ k = A p ^ k 1 A T + Q
where P k 1 is the error covariance of the prediction for the k 1 st frame. The state update phase includes the optimal estimation of the system state and the update of the error covariance matrix. It can be expressed as:
x ^ k = x ^ k + K k ( z k H x ^ k )
p k = ( I K k H ) p k
K k = p k H T ( H p k H T + R ) 1
where K k is the Kalman filter gain of the kth frame. z k is the actual observation of the kth frame.

3.4.2. Block Search

The Kalman filter can effectively solve most target occlusion problems with near-linear motion during target prediction. However, the motion trajectory of the target after occlusion may sometimes exhibit significant nonlinearity. To deal with this situation, we first predict the approximate motion direction of the target using the Kalman filter, and then perform a block search to further locate the target. In target tracking from the UAV perspective, the target is relatively small and moves slowly. Therefore, if the Kalman filter fails to locate the target within 20 frames, we will use block search to further search for it.
Figure 8 illustrates the process of block search. Figure ➀ shows the process of target prediction using Kalman filtering after the target is occluded. When the target is occluded, its motion trajectory exhibits a large nonlinearity, which differs from the predicted trajectory, resulting in prediction failure. Therefore, the block search is necessary for further target searching. In Figure ➁, the red dot indicates the predicted position of the Kalman filter, and the green rectangular box indicates the original search area. The target is not within the search area due to the small size of the search area. Therefore, in the block search module, the search area will be expanded first. The process can be expressed as follows:
s w = t w i 1 b w × 289 × n s h = t h i 1 b h × 289 × n
where s w and s h represent the width and height of the search area, respectively. n is the magnification factor. In block search, the value of n is 3 when tracking drift occurs because the object moves out of view, and 2 when tracking drift occurs in other situations. b w and b h are the width and height of the initially sampled image, obtained as follows:
b w = t w 1 t s , b h = t h 1 t s
t s = t w 1 × t h 1 × p 2 289
where t w 1 and t h 1 are the width and height of the initial target, respectively. p is the search region factor, typically set to 4. Due to the enlarged search region being larger than the target, accurately localizing the target can be difficult if searching for it directly within the region. Figure ➂ shows the response map results of searching in this way. As the response map shows, this direct search makes target localization difficult. Therefore, as shown in Figure ➃, we segment the enlarged search area into 3 × 3 blocks and search for the target in each block. Block search effectively overcomes the challenge of accurately localizing small targets in a larger search area. Figure ➄ shows the response map result of the block search, which accurately localizes the target in the block image containing it. The target region has the highest response value score, while the other blocks have significantly lower scores.
Figure ➅ illustrates the calculation of the target center position. The center coordinates ( x , y ) of the target in the original search image are calculated as follows:
x = ( c w C 2 ) × S + x x 1 + x 0 y = ( c h C 2 ) × S + y y 1 + y 0
where C represents the number of block subimages. ( c w , c h ) represents the coordinate position of the block subimage where the target is located, with the horizontal and vertical axes being represented by w and h respectively. S denotes the width and height of the square block subimage. ( x 0 , y 0 ) represents the coordinates of the center position of the search area x 0 = s w 2 , y 0 = s h 2 . ( x 1 , y 1 ) represents the coordinates of the center position of the block subimage where the maximum score of the target is located, x 1 = y 1 = S 2 . ( x , y ) represents the coordinates of the position with the maximum score in the block search response map. The width w and height h of the target bounding box are calculated using the following equations:
w = ( 1 r ) w i 1 + r × w p h = ( 1 r ) h i 1 + r × h p
where w p and h p denote the width and height of the predicted target, respectively. w i 1 and h i 1 denote the width and height of the target in the previous frame, respectively. r is derived by:
r = p n ( a r g m a x { s t r c } ) × m a x { s t r c } × q
where q is a hyperparameter, s t r c is the tracking score, and p n represents the scale penalty function.
Tracking drift can also be caused by targets moving out of view, which makes it difficult for the tracker to localize the target due to the uncertainty in the center of the search area. To address this issue, the enlarged block search method is utilized. To bring the search area closer to the target’s reappearance, we calculate the average of the target’s historical trajectory coordinates and use it as the center of the new search area. The search area is expanded to three times the original size, and a 5 × 5 block search is used to search the target.
When the target moves out of the field of view, the tracker is unable to track the target correctly. During this time, the tracker will perform expanded block search every 20 frames. (It should be noted that the Kalman filter is not effective in acquiring the target when it moves out of view or reappears. In this case, only block search will be employed, and the Kalman filter will not be used). It is not until the target reappears within the field of view (e.g., the target changes its direction of motion or the UAV turns its camera towards the target) that the expanded block search can correctly localize the target. Given the uncertainty surrounding the moment of target reappearance, the expanded block search is conducted every 20 frames, which avoids repeated search calculations and improves the tracking efficiency.
Figure 9 shows the overlap rate graphs of the proposed method compared to the baseline algorithm in the UAV20L long-term tracking dataset. The overlap ratio is defined as the intersection-over-union ratio between the predicted target bounding box and the ground truth. A low overlap ratio in the graph indicates tracking drift, which may be caused by occlusion or the disappearance of the target. Tracking drift leads to an almost complete failure of subsequent tracking, as shown by the blue curve in the figure. Our method (red curve) effectively re-tracks the target after tracking drift, demonstrating its effectiveness in solving the tracking drift problem.
Algorithm 2 shows the complete details of the MPBTrack.
Algorithm 2: The proposed MPBTrack algorithm.
Drones 08 00252 i002

4. Experiments

This section presents the experimental validation of the MPBTrack algorithm. Section 4.1 describes the details of model training and experimental evaluation. In Section 4.2, we perform ablation experiments for analysis. Section 4.3, Section 4.4, Section 4.5, Section 4.6 and Section 4.7 present quantitative experimental results, and Section 4.8 presents a qualitative experimental analysis.

4.1. Experimental Details

The experiments were conducted on an Intel(R) Xeon(R) Silver 4110 2.10 GHz CPU and NVIDIA GeForce RTX2080 GPU platform. We use GoogLeNet [31] as the backbone. The algorithmic model is trained on the TrackingNet [33], LaSOT [34], and GOT-10k [35] training sets, as well as the ILSVRC VID [36], ILSVRC DET [36], and COCO [37] datasets. The model was trained with 20 epochs using the SGD optimizer. The learning rate increased from 1 × 10 2 to 8 × 10 2 with a warmup technology in the first epoch and then decreased from 8 × 10 2 to 1 × 10 6 with a cosine annealing learning rate schedule. The momentum and weight decay rate were set to 0.9 and 1 × 10 4 , respectively. During the inference phase, the block subimages were uniformly cropped to 289 × 289 to be fed into the feature extraction network. The block search was performed every 20 frames after tracking drift, while the expanded block search was performed every 15 frames.
To demonstrate the effectiveness of our approach in tracking an object on UAV platforms, quantitative and qualitative experiments were conducted on five aerial object tracking datasets: UAV20L [15], UAV123 [15], UAVDT [16], DTB70 [17], and VisDrone2018-SOT [18]. The evaluation of the experimental results included overall performance assessment and evaluation of the challenge attributes. The metrics used to evaluate the experiment were success and precision. Success is expressed as the area under curve (AUC) plotted as the overlap at different thresholds. The overlap is the ratio of the intersection over union (IoU) between the predicted bounding box region A and the ground truth region B. It can be expressed as O v e r l a p = | A B | | A B | , where | . | denotes the area of the region. Due to the potential for inconsistent evaluation results with different overlap thresholds, AUC was used to rank the success scores of the trackers. Precision measures the distance between the center position of the predicted bounding box and the center position of the ground truth. Traditionally, this distance threshold is set to 20 pixels.

4.2. Ablation Experiments

To validate the effectiveness of the proposed modules, we performed ablation experiments on the UAV123 and UAV20L datasets. Our tracking framework comprises two parts: the dynamic template-updating network (DTUN) and the block-prediction module (BPM). We added DTUN, BPM, and DTUN+BPM to the baseline algorithm to validate the effectiveness of each module separately.
Table 1 presents the experimental results for the UAV20L and UAV123 datasets. Our method demonstrates a 16.5% and 18.9% improvement in success and precision, respectively, compared to the baseline on the UAV20L dataset. Specifically, DTUN shows a 4.4% and 5.3% improvement in success and precision, respectively, while BPM shows a 14.3% and 16.0% improvement in success and precision, respectively. The ablation experiments demonstrate that both the proposed DTUN and BPM achieve excellent performance improvement. Our method provides a significant improvement in solving the tracking drift problem in long-term airborne object tracking.
The target in the UAV view is small and moves slowly over a short period. To avoid reduced tracking speed, the block search should not be used for every frame when tracking drift occurs. Instead, appropriate time intervals should be used to balance the success rate and speed. Table 2 shows the changes in tracking performance and speed using different time intervals. Tracking speed may decrease without improving performance if a smaller time interval is used. Conversely, using a larger time interval can decrease tracking performance. The decrease in tracking speed may be due to unresolved tracking drift. The experimental analysis determined that the frame interval suitable for tracking in UAVs is 20.
Section 3.3.2 discusses the importance of the SCR metric in determining tracking drift conditions in UAV video object tracking. Choosing a suitable SCR threshold can improve the accuracy of recognizing tracking conditions. Table 3 demonstrates the impact of different SCR values on tracking performance. The table shows that choosing a threshold that is too large or too small results in misjudging the tracking drift condition and further leads to the degradation of tracking performance.

4.3. Experiments on UAV20L Benchmark

UAV20L [15] is a long-term object tracking dataset, which contains 20 long-term sequences with an average of 2933 frames per sequence. The dataset comprises approximately 58k frames, with the longest sequence consisting of 5527 frames. It presents a significant challenge for the tracker’s long-term tracking capabilities. The challenging attributes of UAV20L include aspect ratio change (ARC), background clutter (BC), camera motion (CM), fast motion (FM), full occlusion (FO), illumination variation (IV), low resolution (LS), out of view (OV), partial occlusion (PO), scale variation (SV), similar object (SO), and viewpoint change (VC). Precision and success serve as evaluation metrics for this dataset.
Figure 10 shows a comparison of our method with other competitive methods, such as TaMOs [38], ARTrack [39], RTS [22], SLT-TransT [40], OSTrack [26], and TransT [41] based on Transformer architecture, and STMTrack [9], SiamBAN [42], SiamCAR [21], and SiamGAT [10] based on Siamese network derivation. Our method outperforms the baseline STMTrack by improving the success by 19.1% and improving the precision by 20.8%. In terms of success, our method surpasses the second-ranked OSTrack by 2.3% and achieves faster tracking speeds. The results demonstrate the superior tracking robustness and accuracy of our method for long-term tracking on UAVs, and the tracking speed far exceeds the real-time requirements.
Figure 11 and Figure 12 show the success and precision plots for attribute evaluation, respectively. The proposed method achieves excellent performance in both attribute evaluations. In terms of success, the scores are ARC (0.683), BC (0.613), CM (0.698), FO (0.599), IV (0.663), OV (0.703), PO (0.689), SV (0.706), and VC (0.742); in terms of precision, the scores are ARC (0.881), BC (0.870), CM (0.900), FO (0.857), IV (0.856), OV (0.898), PO (0.894), SV (0.900), and VC (0.906). The attribute evaluations demonstrate the remarkable performance of the proposed DTUN and BPM in dealing with various complex situations, proving the effectiveness of our approach.

4.4. Experiments on UAV123 Benchmark

UAV123 [15] consists of 123 low-altitude video sequences captured by UAVs, with a total of more than 110K frames. This dataset poses significant challenges for object tracking due to its inclusion of numerous video sequences with complex conditions. Table 4 outlines the results of the comparison experiments between our method and other competitors, such as TaMOs [38], HiFT [25], STMTrack [9], SiamPW-RBO [43], and LightTrack [44]. Compared with the baseline method, our method delivers competitive performance, improving success and precision by 1.4% and 2.1%, respectively.

4.5. Experiments on UAVDT Benchmark

UAVDT [16] is a dataset for object tracking and detection captured by UAVs, which contains 50 video sequences with moving vehicles as its targets of interest. This dataset encompasses various challenges such as long-term tracking (LT), large occlusion (LO), object blur (OB), small object (SO), background clutter (BC), camera rotation (CR), object motion (OM), camera motion (CM), illumination variation (IV), and scale variation (SV).
Figure 13 illustrates the experimental results of our method in comparison with other methods (TaMOs [38], ARTrack [39], DropTrack [12], ROMTrack [48], TransT [41], LightTrack [44], SiamCAR [21], and STMTrack [9]) on the UAVDT dataset. Our approach demonstrates superior performance compared to other Transformer-based large model structures and Siamese network trackers. Compared to the baseline STMTrack, our method shows an improvement of 2.0% and 1.2% in terms of success and precision, respectively. Figure 14 shows the success plots for attribute evaluation. Our method secures excellent success scores across various challenging attributes, including BC (0.608), CM (0.651), IV (0.695), LO (0.615), LT (0.767), OB (0.677), OM (0.687), SV (0.678), and SO (0.675). The evaluation results highlight the excellent performance of our method in adapting to various complex conditions on UAVs.

4.6. Experiments on DTB70 Benchmark

DTB70 [17] dataset is a highly diverse dataset that consists of 70 videos captured by UAVs. Figure 15 shows the success and precision results of the proposed method on the DTB70 dataset. Our method achieves an excellent score of 0.670 in terms of success and outperforms the baseline method by 2.3% and 2.3% in terms of success and precision. Compared to Transformer architecture-based methods such as TaMOs [38], ROMTrack [48], ARTrack [39], TransT [41], and OSTrack [26], our method achieves better tracking results without requiring large model parameters and computational resources. Similarly, compared to Siamese network-based methods such as STMTrack [9], SiamAPN [49], SiamGAT [10], and SiamCAR [21], our method achieves significant performance improvement in terms of success and precision. In summary, our Siamese network-based tracker achieves a higher success score than the Transformer-based tracker, and it demonstrates significant competitiveness in tracking from UAVs.

4.7. Experiments on VisDrone2018-SOT Benchmark

VisDrone2018-SOT [18] contains 35 video sequences totaling approximately 29k frames. Figure 16 shows the success rate plot and accuracy plot of some competitive trackers (e.g., TaMOs-R50 [38], ARTrack [39], ToMP-101 [50], CNNInMo [46], and STMTrack [9]) on the VisDrone2018-SOT dataset. Our method outperforms the baseline method by 5.6% and 4.9% in terms of success and precision, achieving significant performance improvements. The success vs. speed and precision vs. speed results of the trackers are shown in Figure 17. Compared to the Transformer architecture-based and Siamese network-based approaches, our method not only achieves leading results in terms of success, but also achieves a real-time speed of 43 FPS, which is a great advantage for object tracking from UAVs.

4.8. Qualitative Analysis

In order to visually compare the tracking performance of the proposed method, a qualitative comparison with the ground truth as well as other methods (e.g., ARTrack [39], TCTrack [27], TransT [41], SiamCAR [21], STMTrack [9]) is performed. The study analyzes seven sequences from UAV20L and VisDrone2018-SOT datasets. Figure 18 shows the sequences in the following order from top to bottom: car1, group2, person14, uav180, uav1, group1, and uav93.
(1)
car 1: This sequence presents two challenges: occlusion of the object and the object moving out of view. Some trackers experience tracking drift after the object is occluded, and more trackers experience tracking drift after the target moves out of the field of view. Our method, however, successfully tracks the object again after both challenges.
(2)
group2, person14, uav180: These three sequences present the challenge of object occlusion. The visualization results demonstrate that when the object is occluded, only our tracker successfully tracks it, while the other trackers experience tracking drift or track the wrong object for a prolonged period in the subsequent frames. This highlights the significant advantages of our tracker in long-term tracking and handling challenging situations.
(3)
uav1: The uav1 sequence involves the challenges of camera motion, background clutter, and fast object motion. The simultaneous interference of these three challenges in the tracking of this uav1 sequence leads to tracking drift in multiple trackers. However, our tracker relies on dynamic template updating and block search to remain relatively resistant to interference from complex conditions.
(4)
group1 and uav93: These two sequences present challenges with similar targets and object occlusion. When mutual occlusion between objects occurs, other trackers appear to track the wrong object. Our tracker can still accurately track the correct object in this challenging scenario.

5. Conclusions

This paper proposes a visual target-tracking algorithm based on object prediction and block search, aiming to solve the problem of tracking drift in UAV view. Specifically, when the tracker experiences tracking drift due to object occlusion or moving out of view, our approach predicts the motion state of the object using a Kalman filter. Then, the proposed block search module efficiently searches for the tracking of the drifting target. In addition, to enhance the adaptability of the tracker in changing scenarios, we propose a dynamic template update network. This network employs the optimal template strategy based on various tracking conditions to improve the tracker’s robustness. Finally, we introduce three evaluation metrics: APCE, SCR, and TS. These metrics are used to identify tracking drift status and provide prior information for object tracking in subsequent frames. Extensive experiments and comparisons with many competitive algorithms on five aerial benchmarks, namely, UAV123, UAV20L, UAVDT, DTB70, and VisDrone2018-SOT, have demonstrated the effectiveness of our approach in resisting tracking drift in a complex UAV viewpoint environment, and it achieved a real-time speed of 43 FPS.

Author Contributions

L.S. and X.L. conceived of the idea and developed the proposed approaches. Z.Y. advised the research. D.G. helped edit the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 62271193), the Aeronautical Science Foundation of China (No. 20185142003), Natural Science Foundation of Henan Province, China (No. 222300420433), Science and Technology Innovative Talents in Universities of Henan Province, China (No. 21HASTIT030), Young Backbone Teachers in Universities of Henan Province, China (No. 2020GGJS073), and Major Science and Technology Projects of Longmen Laboratory (No. 231100220200).

Data Availability Statement

Code and data are available upon request from the authors.

Conflicts of Interest

Author Z.Y. was employed by the company Xiaomi Technology Co., Ltd., The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Yeom, S. Thermal Image Tracking for Search and Rescue Missions with a Drone. Drones 2024, 8, 53. [Google Scholar] [CrossRef]
  2. Han, Y.; Yu, X.; Luan, H.; Suo, J. Event-Assisted Object Tracking on High-Speed Drones in Harsh Illumination Environment. Drones 2024, 8, 22. [Google Scholar] [CrossRef]
  3. Chen, Q.; Liu, J.; Liu, F.; Xu, F.; Liu, C. Lightweight Spatial-Temporal Contextual Aggregation Siamese Network for Unmanned Aerial Vehicle Tracking. Drones 2024, 8, 24. [Google Scholar] [CrossRef]
  4. Memon, S.A.; Son, H.; Kim, W.G.; Khan, A.M.; Shahzad, M.; Khan, U. Tracking Multiple Unmanned Aerial Vehicles through Occlusion in Low-Altitude Airspace. Drones 2023, 7, 241. [Google Scholar] [CrossRef]
  5. Gao, Y.; Gan, Z.; Chen, M.; Ma, H.; Mao, X. Hybrid Dual-Scale Neural Network Model for Tracking Complex Maneuvering UAVs. Drones 2023, 8, 3. [Google Scholar] [CrossRef]
  6. Kalman, R.E. A new approach to linear filtering and prediction problems. J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]
  7. Xie, X.; Xi, J.; Yang, X.; Lu, R.; Xia, W. STFTrack: Spatio-Temporal-Focused Siamese Network for Infrared UAV Tracking. Drones 2023, 7, 296. [Google Scholar] [CrossRef]
  8. Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. SiamAPN++: Siamese attentional aggregation network for real-time UAV tracking. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 3086–3092. [Google Scholar]
  9. Fu, Z.; Liu, Q.; Fu, Z.; Wang, Y. Stmtrack: Template-free visual tracking with space-time memory networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13774–13783. [Google Scholar]
  10. Guo, D.; Shao, Y.; Cui, Y.; Wang, Z.; Zhang, L.; Shen, C. Graph attention tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9543–9552. [Google Scholar]
  11. Cheng, S.; Zhong, B.; Li, G.; Liu, X.; Tang, Z.; Li, X.; Wang, J. Learning to filter: Siamese relation network for robust tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4421–4431. [Google Scholar]
  12. Wu, Q.; Yang, T.; Liu, Z.; Wu, B.; Shan, Y.; Chan, A.B. Dropmae: Masked autoencoders with spatial-attention dropout for tracking tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14561–14571. [Google Scholar]
  13. Lin, L.; Fan, H.; Zhang, Z.; Xu, Y.; Ling, H. Swintrack: A simple and strong baseline for transformer tracking. Adv. Neural Inf. Process. Syst. 2022, 35, 16743–16754. [Google Scholar]
  14. Gao, S.; Zhou, C.; Ma, C.; Wang, X.; Yuan, J. Aiatrack: Attention in attention for transformer visual tracking. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 146–164. [Google Scholar]
  15. Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 445–461. [Google Scholar]
  16. Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
  17. Li, S.; Yeung, D.Y. Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
  18. Wen, L.; Zhu, P.; Du, D.; Bian, X.; Ling, H.; Hu, Q.; Liu, C.; Cheng, H.; Liu, X.; Ma, W.; et al. Visdrone-sot2018: The vision meets drone single-object tracking challenge results. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
  19. Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
  20. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1–9. [Google Scholar] [CrossRef]
  21. Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2020; pp. 6269–6277. [Google Scholar]
  22. Paul, M.; Danelljan, M.; Mayer, C.; Van Gool, L. Robust visual tracking by segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 571–588. [Google Scholar]
  23. Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar]
  24. Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10448–10457. [Google Scholar]
  25. Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. Hift: Hierarchical feature transformer for aerial tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15457–15466. [Google Scholar]
  26. Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 341–357. [Google Scholar]
  27. Cao, Z.; Huang, Z.; Pan, L.; Zhang, S.; Liu, Z.; Fu, C. Tctrack: Temporal contexts for aerial tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14798–14808. [Google Scholar]
  28. Li, B.; Fu, C.; Ding, F.; Ye, J.; Lin, F. All-day object tracking for unmanned aerial vehicle. IEEE Trans. Mob. Comput. 2022, 22, 4515–4529. [Google Scholar] [CrossRef]
  29. Yang, J.; Gao, S.; Li, Z.; Zheng, F.; Leonardis, A. Resource-efficient RGBD aerial tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13374–13383. [Google Scholar]
  30. Luo, Y.; Guo, X.; Dong, M.; Yu, J. RGB-T Tracking Based on Mixed Attention. arXiv 2023, arXiv:2304.04264. [Google Scholar]
  31. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  32. Wang, M.; Liu, Y.; Huang, Z. Large margin object tracking with circulant feature maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4021–4029. [Google Scholar]
  33. Muller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; Ghanem, B. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 300–317. [Google Scholar]
  34. Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5374–5383. [Google Scholar]
  35. Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef]
  36. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
  37. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
  38. Mayer, C.; Danelljan, M.; Yang, M.H.; Ferrari, V.; Van Gool, L.; Kuznetsova, A. Beyond SOT: Tracking Multiple Generic Objects at Once. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–6 January 2024; pp. 6826–6836. [Google Scholar]
  39. Wei, X.; Bai, Y.; Zheng, Y.; Shi, D.; Gong, Y. Autoregressive visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9697–9706. [Google Scholar]
  40. Kim, M.; Lee, S.; Ok, J.; Han, B.; Cho, M. Towards sequence-level training for visual tracking. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 534–551. [Google Scholar]
  41. Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar]
  42. Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6668–6677. [Google Scholar]
  43. Tang, F.; Ling, Q. Ranking-based Siamese visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8741–8750. [Google Scholar]
  44. Yan, B.; Peng, H.; Wu, K.; Wang, D.; Fu, J.; Lu, H. Lighttrack: Finding lightweight neural networks for object tracking via one-shot architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15180–15189. [Google Scholar]
  45. Zhang, D.; Zheng, Z.; Jia, R.; Li, M. Visual tracking via hierarchical deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 3315–3323. [Google Scholar]
  46. Guo, M.; Zhang, Z.; Fan, H.; Jing, L.; Lyu, Y.; Li, B.; Hu, W. Learning target-aware representation for visual tracking via informative interactions. arXiv 2022, arXiv:2201.02526. [Google Scholar]
  47. Zhang, Z.; Liu, Y.; Wang, X.; Li, B.; Hu, W. Learn to match: Automatic matching network design for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13339–13348. [Google Scholar]
  48. Cai, Y.; Liu, J.; Tang, J.; Wu, G. Robust object modeling for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 9589–9600. [Google Scholar]
  49. Fu, C.; Cao, Z.; Li, Y.; Ye, J.; Feng, C. Siamese anchor proposal network for high-speed aerial tracking. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 510–516. [Google Scholar]
  50. Mayer, C.; Danelljan, M.; Bhat, G.; Paul, M.; Paudel, D.P.; Yu, F.; Van Gool, L. Transforming model prediction for tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8731–8740. [Google Scholar]
Figure 1. Visualization results of the proposed MPBTrack method compared with the baseline on the UAV123 dataset. The first row shows the bounding box of the object, where green, red, and blue colors indicate the ground truth, our method and the baseline method, respectively. The second and third rows display the response map results for our method and the baseline method, respectively. (a) represents the tracking drift condition caused by the object being out of view. (b,c) represent the tracking drift conditions caused by the occlusion. (d) represents the tracking error caused by the occlusion and the interference from a similar object.
Figure 1. Visualization results of the proposed MPBTrack method compared with the baseline on the UAV123 dataset. The first row shows the bounding box of the object, where green, red, and blue colors indicate the ground truth, our method and the baseline method, respectively. The second and third rows display the response map results for our method and the baseline method, respectively. (a) represents the tracking drift condition caused by the object being out of view. (b,c) represent the tracking drift conditions caused by the occlusion. (d) represents the tracking error caused by the occlusion and the interference from a similar object.
Drones 08 00252 g001
Figure 2. Performance comparison of our method with others on the UAV20L long-term tracking dataset in terms of success and speed.
Figure 2. Performance comparison of our method with others on the UAV20L long-term tracking dataset in terms of success and speed.
Drones 08 00252 g002
Figure 3. Overall network framework of MPBTrack. “★” indicates cross correlation operation.
Figure 3. Overall network framework of MPBTrack. “★” indicates cross correlation operation.
Drones 08 00252 g003
Figure 4. Dynamic template updating network structure. “+” indicates concatenation operation. The red box in the response map represents the maximum tracking score.
Figure 4. Dynamic template updating network structure. “+” indicates concatenation operation. The red box in the response map represents the maximum tracking score.
Drones 08 00252 g004
Figure 5. Search network structure. “★” denotes the cross correlation operation. R * represents the response map.
Figure 5. Search network structure. “★” denotes the cross correlation operation. R * represents the response map.
Drones 08 00252 g005
Figure 6. The heatmaps and APCE values of response maps for different tracking statuses. (a) Normal situation, (b) background clutter, and (c) object occlusion.
Figure 6. The heatmaps and APCE values of response maps for different tracking statuses. (a) Normal situation, (b) background clutter, and (c) object occlusion.
Drones 08 00252 g006
Figure 7. SCR curve diagram.
Figure 7. SCR curve diagram.
Drones 08 00252 g007
Figure 8. Block search module.
Figure 8. Block search module.
Drones 08 00252 g008
Figure 9. Comparison results of overlap rates on the UAV20L long-term tracking dataset. The red curve represents the method using motion prediction and block search, while the blue curve represents the baseline method. The intermittent blank areas in the figure indicate cases where the target disappears, resulting in no overlap rate values.
Figure 9. Comparison results of overlap rates on the UAV20L long-term tracking dataset. The red curve represents the method using motion prediction and block search, while the blue curve represents the baseline method. The intermittent blank areas in the figure indicate cases where the target disappears, resulting in no overlap rate values.
Drones 08 00252 g009aDrones 08 00252 g009b
Figure 10. Success and precision plot in UAV20L dataset.
Figure 10. Success and precision plot in UAV20L dataset.
Drones 08 00252 g010
Figure 11. Success plot for attribute evaluation on the UAV20L dataset.
Figure 11. Success plot for attribute evaluation on the UAV20L dataset.
Drones 08 00252 g011aDrones 08 00252 g011b
Figure 12. Precision plot for attribute evaluation on the UAV20L dataset.
Figure 12. Precision plot for attribute evaluation on the UAV20L dataset.
Drones 08 00252 g012
Figure 13. Success and precision plots on the UAVDT dataset.
Figure 13. Success and precision plots on the UAVDT dataset.
Drones 08 00252 g013
Figure 14. Success plots for attribute evaluation on the UAVDT dataset.
Figure 14. Success plots for attribute evaluation on the UAVDT dataset.
Drones 08 00252 g014
Figure 15. Success and precision plots on the DTB70 dataset.
Figure 15. Success and precision plots on the DTB70 dataset.
Drones 08 00252 g015
Figure 16. Success and precision plots on the VisDrone2018-SOT dataset.
Figure 16. Success and precision plots on the VisDrone2018-SOT dataset.
Drones 08 00252 g016
Figure 17. The success vs. speed and precision vs. speed plots on the VisDrone2018-SOT dataset.
Figure 17. The success vs. speed and precision vs. speed plots on the VisDrone2018-SOT dataset.
Drones 08 00252 g017
Figure 18. Qualitative evaluation results. From top to bottom, the sequences are car1, group2, person14, uav180, uav1, group1, and uav93.
Figure 18. Qualitative evaluation results. From top to bottom, the sequences are car1, group2, person14, uav180, uav1, group1, and uav93.
Drones 08 00252 g018
Table 1. Analysis of ablation experiments using DTUN and BPM on the UAV123 and UAV20L datasets.
Table 1. Analysis of ablation experiments using DTUN and BPM on the UAV123 and UAV20L datasets.
ModuleUAV20LUAV123
Success Precision Success Precision
Baseline0.5890.7420.6470.825
Baseline + DTUN0.6170.7840.6510.834
Baseline + BPM0.6830.8750.6510.841
Baseline + DTUN + BPM0.7060.9050.6560.842
Table 2. Ablation experiments with different search intervals.
Table 2. Ablation experiments with different search intervals.
Interval FramesUAV20LVisDrone2018-SOT
Success PrecisionFPS Success PrecisionFPS
100.6970.89332.40.6370.82227.9
150.7000.89832.60.6370.82328.3
200.7060.90543.50.6650.86443.4
250.6890.81132.80.6330.81826.7
Table 3. Experimental results for various SCR thresholds on the UAV20L dataset.
Table 3. Experimental results for various SCR thresholds on the UAV20L dataset.
SCR2.02.42.83.23.6
Success0.6960.7040.7060.6950.688
Precision0.8920.9010.9050.8900.882
Table 4. Experimental results comparing our method with other methods on the UAV123 dataset. Trackers are ranked based on their success scores.
Table 4. Experimental results comparing our method with other methods on the UAV123 dataset. Trackers are ranked based on their success scores.
TrackerTa-MOs [38]HiFT [25]TC-Track [27]PAC-Net [45]Siam-CAR [21]Light-Track [44]CNN-InMO [46]Siam-BAN [42]Siam-RN [11]Auto-Match [47]SiamPW-RBO [43]Siam-GAT [10]STM-Track [9]MPB-Track
Succ.0.5710.5890.6040.6200.6230.6260.6290.6310.6430.6440.6450.6460.6470.656
Prec.0.7910.7870.8000.8270.8130.8090.8180.833---0.8430.8250.842
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, L.; Li, X.; Yang, Z.; Gao, D. Visual Object Tracking Based on the Motion Prediction and Block Search in UAV Videos. Drones 2024, 8, 252. https://doi.org/10.3390/drones8060252

AMA Style

Sun L, Li X, Yang Z, Gao D. Visual Object Tracking Based on the Motion Prediction and Block Search in UAV Videos. Drones. 2024; 8(6):252. https://doi.org/10.3390/drones8060252

Chicago/Turabian Style

Sun, Lifan, Xinxiang Li, Zhe Yang, and Dan Gao. 2024. "Visual Object Tracking Based on the Motion Prediction and Block Search in UAV Videos" Drones 8, no. 6: 252. https://doi.org/10.3390/drones8060252

APA Style

Sun, L., Li, X., Yang, Z., & Gao, D. (2024). Visual Object Tracking Based on the Motion Prediction and Block Search in UAV Videos. Drones, 8(6), 252. https://doi.org/10.3390/drones8060252

Article Metrics

Back to TopTop