Object-Tracking Algorithm Combining Motion Direction and Time Series

Su, Jianjun; Wu, Chenmou; Yang, Shuqun

doi:10.3390/app13084835

Open AccessArticle

Object-Tracking Algorithm Combining Motion Direction and Time Series

by

Jianjun Su

¹,

Chenmou Wu

² and

Shuqun Yang

^1,*

¹

School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

²

Department of Computer Science & Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(8), 4835; https://doi.org/10.3390/app13084835

Submission received: 1 March 2023 / Revised: 7 April 2023 / Accepted: 10 April 2023 / Published: 12 April 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Object tracking using deep learning is a crucial research direction within intelligent vision processing. One of the key challenges in object tracking is accurately predicting the object’s motion direction in consecutive frames while accounting for the reliability of the tracking results during template updates. In this work, we propose an innovative object-tracking algorithm that leverages both motion direction and time series information. We propose a loss function that guides the tracking model to learn the direction of object motion between consecutive frames, resulting in improved object localization accuracy. Furthermore, to enhance the algorithm’s ability to discriminate the reliability of tracking results and improve the quality of template updates, the proposed approach includes an attention mechanism-based tracking result reliability scoring module, which takes into account the time series of tracking results. Compressive experiment evaluation on four datasets shows our algorithm effectively improves the performances of object tracking. The ablation experiments and qualitative analysis prove the effectiveness of the proposed module and loss function.

Keywords:

object tracking; deep learning; motion direction; attention mechanism; time series

1. Introduction

Object tracking has become one of the most important research areas in computer vision due to its wide range of applications in surveillance, video analysis, autonomous driving, drone photography, and traffic patrol [1,2,3]. The primary goal of object-tracking is to predict the position and size of a target in subsequent frames of a video sequence after the initial frame has been provided. With the advances in deep neural networks, the data-driven method has become mainstream for object tracking. Compared with the traditional filter-based method, the data-driven method achieves the best performance in terms of tracking accuracy [4,5,6]. Recently, researchers have paid more attention to designing stronger neural networks for target discrimination and incorporating detection modules to enhance the algorithm’s long-term tracking ability [7,8,9]. Despite these approaches achieving significant performance, they are still challenged by deformation, occlusion, complex backgrounds, illumination, and scale changes.

Object-tracking models can be simply classified into two categories: online training models and offline training models. Online training models, such as MDNet [10], DLT [11], TCNN [12], and STCT [13], design a classifier to distinguish the target from the environmental context by using large-scale image data to learn target commonality offline and positive and negative samples to learn target characteristics online during tracking. Offline training models extract the target template from the image first, and then locate the target by integrating the features of target templates and search images, which avoids updating the network model during tracking. One common method is using a Siamese network to project the target template and search image to another space, and then integrate the features to locate the target, as performed by SiamFC [14], SiamFC++ [15], SiamRPN [16], SiamBAN [17], SiamCAR [18], and SiamRCNN [19]. The other way relies on the attention mechanism [20] that finds the correlation between the target template and the search image for highlighting important features and suppressing irrelevant ones, as performed by TransT [21], TMT [22], STMTrack [23], DTT [24], and STARK [25]. However, these algorithms neglect the continuous motion of the target from frame to frame, which leads to the discontinuity of the predicted target position and affects tracking accuracy. Furthermore, previous methods lack attention to the reliability of tracking results, which plays an important role in the algorithm for discriminating the target. To address these challenges, we proposed a novel object-tracking algorithm that combines motion direction and time series.

Our proposed approach addresses the lack of consideration of target motion in previous algorithms by adding direction information constraint. During the training process, the algorithm learns general patterns of target motion, which enable it to obtain the motion information of a specific target during tracking, leading to improved tracking accuracy. Additionally, we propose a tracking result scoring module that processes time series to analyze the correlation between current tracking results and historical targets. This module captures high-quality templates via the dependability of tracking results in facilitating template updates. Our approach is independent of any particular object-tracking model and can be applied to a range of algorithms. In this work, we combine the proposed algorithm with the STARK [25] to design an object-tracking algorithm that integrates motion direction and time series analysis. Our main contributions are summarized as follows:

1.: We add an additional constraint from the angle quantity to enhance the understanding of motion direction for the network. The additional directional guidance loss guides the network learned motion direction information, which effectively improves the tracking performance.
2.: A tracking result scoring module (TRSM) is proposed to determine the reliability of tracking results. The scoring results are used to assist in the selection of high-quality templates and promote the updating of templates.
3.: The proposed algorithm is evaluated on four benchmark datasets, including TrackingNet, GOT-10K, LaSOT, and VOT2020. Both quantitative experiments and qualitative analysis validate the effectiveness of the proposed method in this work.

The rest of the paper is organized as follows. In Section 2, we review related work in the field of object-tracking algorithms. Section 3 describes the proposed algorithm in detail, including the motion direction-guided loss function and the tracking result scoring module based on the attention mechanism. Section 4 presents the experimental setup and results, as well as an ablation study to analyze the effectiveness of the proposed algorithm. Section 5 provides an additional analysis of our approach. Finally, we conclude the paper in Section 6.

2. Related Works

Online model update. The object-tracking algorithm of online model update generally necessitates the online fine-tuning of the network to suit the specific target. Wang et al. [11] introduced the DLT algorithm by applying deep neural networks to target tracking. The algorithm performs unsupervised training on a stacked denoising autoencoder when offline and fine-tunes the model when online. During the tracking process, DLT extracts candidate regions and determines the target location based on the region with the highest confidence level. MDNet [10] is an algorithm that uses a multi-domain convolutional network, which divides the network into shared layers and domain-specific layers. The shared layers acquire the target commonality, while the domain-specific layers acquire the target characteristics. Nam et al. [12] proposed the use of Convolutional Neural Networks (CNN) with a tree structure to locate targets by combining multiple convolutional networks based on linear weighting to reduce the impact of interference samples on model updates. During tracking, CNN nodes are added at certain intervals, while old nodes are removed to maintain network stability. In response to the limitation of using a large number of training samples in tracking, Wang et al. [13] viewed the CNN training process as an integrated learning process by using each convolutional kernel that outputs the corresponding feature map as a baseline learner. Han et al. [26] proposed the BranchOut method, which uses a network similar to MDNet but is designed with multiple fully connected layer branches, each with a variable number of fully connected layers. This method ensures the representation capability of the target features.

Offline training models. The online fine-tuning method can be computationally expensive, prompting researchers to investigate models based entirely on offline training. One popular algorithm that has received significant attention is the Siamese network-based object-tracking algorithm. The SINT algorithm [27] was the first to introduce the Siamese network to object tracking by sampling the search image, selecting candidate areas, and using the target template to calculate cross-correlation responses for all candidate areas. Similarly, the classic SiamFC algorithm [14] directly extracts the target template and search image features using the same convolutional network and calculates the position in the search image that is most similar to the target using cross correlation. The SiamRPN algorithm [16] converts the object-tracking problem into a classification and regression problem by using the Region Proposal Network (RPN) to classify and regress the template features and search image features. Meanwhile, the SiamCAR algorithm [18] uses an anchor-free and proposal-free strategy to perform classification and regression prediction on the target, which reduces hyperparameters and improves algorithm performance. Rotation of the target significantly decreases the tracking accuracy of object-tracking algorithms. To address this issue, the RE-SiamNets algorithm [28] incorporates rotation equivariance into the Siamese network model, which effectively mitigates the adverse impact of target rotation. The attention mechanism has been shown to be effective at integrating the features of the target template and the search image. The TMT algorithm [22] incorporates the Transformer’s encoder and decoder into the Siamese network and uses the encoder to extract target template features and the decoder to fuse target template features with search image features. The TransT algorithm [21] enhances feature extraction and fusion by designing the Ego-Context Augment module and the Cross-Feature Augment Modules based on the attention mechanism. The VTT algorithm [29] fuses the features of the template and search image using the Siamese network, enhances global information using the Transformer, and then determines the tracking result using the Top-1 result of classification and regression. The STARK algorithm [25] uses an attention mechanism to encode temporal and spatial features of the target template and search image, and predicts the coordinates of the upper left and lower right corners of the target. The STARK algorithm is trained using L_1 loss and GIoU [30] loss. For template updating, it employs a fully connected network to determine the reliability of the tracking results, controlling the template update process. Recently, the STARK algorithm has demonstrated state-of-the-art performance on various benchmarks.

3. Methods

Our work implements optimization in two distinct phases: tracking and training, as illustrated in Figure 1. In the tracking phase, the tracking result scoring module (TRSM) is utilized to assess the reliability of tracking outcomes. Results with reliability values surpassing a predefined threshold τ are subsequently selected as high-quality target templates. During training, a loss function incorporating directional guidance is employed to instruct the tracking model in learning target motion direction information.

3.1. Loss Function with Directional Guidance

Given the assumption of minor changes in the target’s position and appearance across consecutive frames, the direction vector from the predicted target position to the actual target position in the current frame, expressed as

\overset{⇀}{R}

, approximates the direction toward the actual target position in the subsequent frame. The direction vector

\overset{⇀}{R}

can be utilized by the tracker to forecast the position of the target in the subsequent frame. Inspired by the SIoU [31], an additional constraint is built during the training process, designed to facilitate the model’s exploitation of commonalities associated with target motion.

The SIoU loss function integrates data pertaining to angle, distance, aspect ratio, and IoU to offer directional guidance throughout the training process. In accordance with Figure 2, the direction vector is visualized as the line connecting the centroids of the prediction box

\hat{B}

and the actual box B, with the centroid distance serving as the magnitude of the direction vector. Angle information is provided by the angle between the centroid line and the coordinate axis.

The angular measure, denoted as

η

[31], is specified by the following equation:

η = 1 - 2 \times \sin^{2} (\arcsin (ξ) - \frac{π}{4}),

(1)

when

α < \frac{π}{4}

, we have

ξ = \frac{C_{h}}{d} = \sin (α)

, and when

α \geq \frac{π}{4}

, then

ξ = \frac{C_{w}}{d} = \sin (\frac{π}{2} - α)

.

C_{h}

and

C_{w}

denote the distances in the height and width directions, respectively, between the center points of

\hat{B}

and

B

.

Consider prediction box

\hat{B} = ({\hat{c}}_{x}, {\hat{c}}_{y}, \hat{w}, \hat{h})

and actual box

B = (c_{x}, c_{y}, w, h)

, with the four dimensions indicating the x and y coordinates of the center point and the width and height of the rectangular box. The distance between

\hat{B}

and

B

is divided into two directions along the x and y axes and is incorporated with angle information to define the distance loss

L_{Distance}

[31] as follows:

L_{Distance} = \sum_{i = x, y} (1 - e^{- λ ρ_{i}}),

(2)

where

ρ_{x} = {(\frac{c_{x} - {\hat{c}}_{x}}{W_{closure}})}^{2}

,

ρ_{y} = {(\frac{c_{y} - {\hat{c}}_{y}}{H_{closure}})}^{2}

, and

λ = 2 - η

. Furthermore,

W_{closure}

and

H_{closure}

denote the width and height of the closed region of

\hat{B}

and

B

, respectively.

Define the aspect ratio loss

L_{Shape}

[31] as follows:

L_{Shape} = \sum_{i = w, h} {(1 - e^{- ω_{i}})}^{ϵ},

(3)

where

ω_{w} = \frac{|w - \hat{w}|}{m a x (w, \hat{w})}

and

ω_{h} = \frac{|h - \hat{h}|}{m a x (h, \hat{h})}

. Here,

ϵ

is a hyperparameter that controls the weight of the aspect ratio loss in the overall loss.

The IoU loss [31] refers to the measurement of the intersection over union of boxes

\hat{B}

and B:

I o U = \frac{|B \cap \hat{B}|}{|B \cup \hat{B}|},

(4)

The SIoU loss function can be expressed as

L_{SIoU}

[31]:

L_{SIoU} = 1 - I o U + \frac{1}{2} (L_{Distance} + L_{Shape}),

(5)

The inclusion of angle information provides directional guidance in the x and y coordinate axes during distance calculation. This training process enables the model to approach the actual target position along the coordinate axes, thus learning the commonality of target motion information. It allows the algorithm to better capture the direction of target motion during inference and results in improved tracking accuracy.

The loss function utilized in this study is a combination of the

L_{1}

loss function and the SIoU loss function with respective weightings. Specifically, the loss function

L_{loss}

is given by the following:

L_{loss} = λ_{L_{1}} L_{1} (B, \hat{B}) + λ_{SIoU} L_{SIoU} (B, \hat{B}),

(6)

where

λ_{L_{1}}

and

λ_{SIoU}

are the weights assigned to the

L_{1}

loss function and SIoU loss function, respectively.

3.2. Tracking Result Scoring Module

The TRSM module is comprised of three components. In the first stage, the image features of the tracking results are obtained and transformed into vector sequences that can be processed by the attention mechanism. In stage 2, the Transformer structure is utilized to encode the vector sequence using the self-attentive mechanism in the encoding phase and integrate the time series information using the cross-attentive mechanism in the decoding phase. In stage 3, a fully connected network is employed to output the tracking result’s reliability score. The overall architecture of the module is illustrated in Figure 3.

The TRSM module operates on a set of historical tracking results and the current frame tracking results, using them as templates for the input.

Stage 1. We apply a fully convolutional network

φ

to extract the template features. The structure of the convolutional network

φ

is not restricted to any particular design. The module receives an input template

T = \{T_{1}, T_{2}, \dots, T_{t}, T_{result}\}

, where

T \in R^{3 \times H \times W \times (t + 1)}

, and

T_{result}

represents the current tracking result template, while

T_{history} = \{T_{i}, i = 1, 2, \dots, t\}

denotes the historical tracking result templates. The template

T

is mapped to

ℝ^{C \times \frac{H}{s} \times \frac{W}{s} \times (t + 1)}

via network

φ

, then flattened and connected to form a vector sequence

T ’

. The length of

T

’ is

\frac{H}{s} \times \frac{W}{s} \times (t + 1)

, and its dimension is

C

.

Stage 2.

T ’

is fed into the encoding stage, which contains

N

layers of encoders. Each encoder layer is processed using a multi-headed self-attentive mechanism and a feed-forward network. The encoder captures the dependencies between individual sequence elements and enhances features based on contextual information to exploit temporal features between individual templates. In the decoding stage, unlike the decoder in the Transformer architecture, only the cross-attention mechanism and the feed-forward network are employed to construct the decoder, with no self-attention mechanism. A query vector is used to query the temporal features of the current and historical tracking results in the N-layer decoder, and a feature vector with strong discriminative power is generated.

Stage 3. We composed a three-layer fully connected network that takes the feature vector outputted by the decoder as input and produces a score value through a sigmoid activation function. A higher score value outputted by the TRSM module indicates that the tracking results and the historical results have stronger similarity, which results in higher tracking result reliability and enhances the performance of the algorithm by improving the quality of template update. The loss function utilized to train the TRSM module is the binary cross-entropy loss function, denoted by

L_{TRSM}

, which is expressed as follows:

L_{TRSM} = y_{i} \log (P_{i}) + (1 - y_{i}) \log (1 - P_{i}),

(7)

where y represents the ground truth value and

P_{i}

is the estimated value outputted by the TRSM module.

4. Experimental Results and Analysis

4.1. Datasets

The proposed algorithm is evaluated on four public datasets.

LaSOT [32]. LaSOT is a large-scale long-term single-object-tracking dataset that comprises 1400 video sequences spanning 70 categories and a minimum of 1000 frames per sequence, with an average frame count of 2500. Evaluation of LaSOT is performed using success rate and accuracy rate.

TrackingNet [33]. TrackingNet is a large-scale short-term object-tracking dataset with over 30,000 video sequences and a test set comprising 511 video sequences that feature a diverse range of target classes. To mitigate the influence of image resolution and target box size, TrackingNet adopts normalized accuracy and success rate for evaluation, with test results requiring official server evaluation.

GOT-10K [34]. GOT-10K is a comprehensive object-tracking dataset with a collection of 10,000 video sequences featuring 563 categories. Evaluation of GOT-10K is conducted on an official server, with algorithm performance measured using average overlap and success rates.

VOT2020 [35]. VOT2020 is a dataset specifically designed for the VOT (Visual Object Tracking) Challenge, containing 60 video sequences featuring various challenging scenarios, such as fast motion and occlusion. Expected average overlap (EAO) is employed to measure algorithm performance on the VOT2020 dataset.

4.2. Implementation Details

The base tracker used in this work is STARK-ST50 [25]. Note that instead of tracking the result scoring mechanism, we incorporate the TRSM module to determine the reliability of tracking results. The tracker is trained using Equation (6), while the TRSM module is trained using Equation (7).

To construct the historical templates in the TRSM module, we select the initial target template and the dynamic template updated over time in the STARK algorithm. The TRSM module employs a pre-trained ResNet-50 [36] as its feature extraction network, which has also been utilized in STARK-ST50. The scaling factor of this network is 16 (s = 16). The encoder and decoder sections in the encoding–decoding architecture have two layers each, and we set the number of heads to 8 and dimensions to

C = 256

. Additionally, the fully connected network’s hidden layer has 256 neurons.

During the tracking process, we set the reliability threshold

τ

to 0.6 based on experience. The TRSM module determines the reliability of the tracking results, and the template is updated only if it exceeds this threshold. As referenced in [31], we use the hyperparameter

ϵ = 4

in the loss function to control the attention of aspect ratio loss in the total loss. Following the research on the linear combination of L1 loss and GIoU loss in DETR, we set the weights

λ_{L_{1}}

and

λ_{SIoU}

to 5.0 and 2.0, respectively. The aforementioned parameters are applicable to all datasets.

The algorithm is trained in two steps: the tracker is trained first, and then the TRSM module is trained separately with fixed tracker model parameters. The experiments were conducted on an Ubuntu Server system equipped with four 24 G RTX3090s. We trained our algorithm on various datasets, including LaSOT, GOT-10K, COCO [37], and TrackingNet.

4.3. Evaluation Metrics

Average overlap (AO) [34]: Average overlap refers to the mean value of the overlap score, which is computed as the Intersection over Union (IoU) between the prediction box and the ground truth, as specified by Equation (4).

Success [32]: Success denotes the ratio of frames with an overlap score exceeding a particular threshold (typically 0.5) to the total number of frames. The success plot can be generated by varying the threshold over the range of 0 to 1.

Precision [32]: Precision is characterized by the center location error (CLE), which is defined as the Euclidean distance between the predicted box center and the ground truth center. The ratio of frames with CLE less than a certain threshold (usually 20 pixels) to the total number of frames represents the precision. The precision plot can be plotted by varying the threshold values.

Normalized precision (NP) [33]: Normalized precision is a measure that normalizes the precision by the real target box size to mitigate the adverse effects caused by the target scale. It is computed as follows:

P_{NP} = ‖ W {(C^{t r} - C^{g t}) ‖}_{2},

(8)

W = d i a g (B B_{x}^{g t}, B B_{y}^{g t}),

(9)

where

C^{t r}

and

C^{g t}

represent the center position of the prediction box and the ground truth, respectively. The function

d i a g (\cdot)

denotes the diagonal matrix.

B B_{x}^{g t}

and

B B_{y}^{g t}

denote the

x

and

y

coordinates of the ground truth, respectively. The normalized precision plot is obtained by varying the threshold values from 0 to 0.5.

Expected average overlap (EAO) [35]: The VOT2020 dataset employs an anchor-based evaluation protocol that consists of accuracy, robustness, and expected average overlap (EAO). EAO is a composite measure that combines accuracy and robustness and is used to rank algorithms in the VOT2020 dataset. The related calculation is presented in [35].

4.4. Comparison of Results

Evaluation on LaSOT. We evaluate the performance of our proposed algorithm on the LaSOT test set. AUC (the area under the success plot curve) and P (the accuracy) are used to compare our algorithm against other trackers. Results presented in Table 1 demonstrate that our algorithm outperforms all other compared trackers, with an AUC value of 67.1%, which is higher than that of the STARK-ST50 algorithm (66.4%). Additionally, in terms of accuracy (P), our algorithm achieved a score of 72.2%, which is 1.0 percentage points higher than the STARK-ST50 algorithm’s score of 71.2%. Furthermore, the accuracy and success plots are illustrated in Figure 4. It demonstrates the effectiveness of our approach in improving precision and accuracy.

Evaluation on TrackingNet. The evaluation of our algorithm on the TrackingNet test set is conducted by submitting it to the official server. The obtained results are presented in Table 2, where

P_{norm}

represents the area under the curve of the normalized accuracy plot. Our algorithm outperforms the STARK-ST50 and SiamRCNN algorithms, achieving a

P_{norm}

value of 87.2%, which is an improvement of 1.1 percentage points and 1.8 percentage points, respectively. These results indicate that our algorithm is ranked as the best performer among all the compared algorithms. It proves the effectiveness of our approach.

Evaluation on GOT-10K. In this experiment, we followed the standard protocol of training solely on the training set and subsequently evaluating the model on the test set, as prescribed by the GOT-10K benchmark. The test results were then submitted to the official server for evaluation. As presented in Table 3, we report the success rate at two distinct thresholds, namely SR0.5 and SR0.75. Our approach surpasses the STARK-ST50 algorithm in both SR0.5 and SR0.75 metrics. Notably, our algorithm achieves an AO value of 69.7%, outperforming the STARK-ST50 algorithm by 1.7 percentage points and the highest performance of all compared algorithms. This indicates that our approach also performs well in enhancing the AO metric.

Evaluation on VOT2020. Table 4 presents the evaluation results of our algorithm in terms of robustness and accuracy, as well as the expected average overlap (EAO) metric. Our algorithm shows similar robustness compared to the STARK algorithm, but outperforms it in terms of accuracy. Moreover, our algorithm achieves the best EAO score compared to all other evaluated algorithms.

4.5. Speed, FLOPs, and Params

Table 5 compares our algorithm with the baseline algorithm, STARK-ST50, in terms of Speed, Params, and FLOPs. STARK-ST50 achieves a Speed of 27 fps, requires 28.2 M Params, and has 12.8 G FLOPs, while our algorithm achieves 25 fps, requires 29.8 M Params, and has 12.9 G FLOPs. Compared to STARK-ST50, our algorithm shows a slight increase in both FLOPs and Params. Importantly, our algorithm maintains real-time tracking performance.

4.6. Ablation Study

In the ablation experiments, we optimized the STARK algorithm using DIoU, CIoU [50], and the SIoU loss function, which included directional guidance, as well as the TRSM module. The resulting algorithm was then validated on the TrackingNet and LaSOT test set to evaluate the impact of each loss function and the TRSM module on algorithm performance. The experimental results are shown in Table 6.

The experimental results demonstrate that the impact of DIoU and CIoU on algorithm performance is minimal, resulting in an improvement of only 0.3 percentage points in the

P_{norm}

metrics and slight improvements in the other metrics. However, using the SIoU loss function with angle information improved the algorithm’s performance on the

P_{norm}

metrics in the TrackingNet dataset by 0.8, the P metrics in LaSOT by 0.6, and the AUC metrics in both datasets. These results suggest that angle information is beneficial for improving the algorithm’s precision and success rate.

Furthermore, when the TRSM module is used in isolation, the

P_{norm}

metrics increase by 0.5 percentage points, and the other metrics also improve. This underscores the effectiveness of the TRSM module in optimizing the template update of the STARK algorithm. These findings indicate that mining the time series information of tracking results can enhance the algorithm’s ability to assess the reliability of tracking results. When used in conjunction with the SIoU loss function, the TRSM module and the SIoU loss function complement each other, resulting in further improvements in tracking precision and success rate.

4.7. Qualitative Results Analysis

Figure 5 depicts a visual comparison between the STARK-ST50 algorithm and our proposed algorithm on the LaSOT dataset. In the “bird-2” video sequence, the target in frame 255 is obstructed by similar objects, causing all algorithms to track incorrectly. However, our algorithm successfully repositioned the target in frame 275 and accurately captured it again in frame 345 after being partially occluded. In contrast, STARK-ST50 was unable to distinguish the interfering object. In the “fox-20” video sequence, both algorithms were able to track the target well in frames 2100 and 2102. However, STARK-ST50 displayed an abnormal position change in the prediction box from 2102 to 2104, resulting in an incorrect target position, which was gradually rectified in subsequent frames. In contrast, our algorithm tracked the target steadily throughout.

Overall, our algorithm demonstrated a superior ability to discriminate interferers and suppress abnormal abrupt changes in predicted box positions, leading to improved tracking accuracy and success rate. The observed improvement in performance can be attributed to the optimization of the algorithm’s template update, resulting from its enhanced ability to discriminate the reliability of tracking results, as well as its proficiency in capturing the motion of the target.

5. Discussion

While comprehensive experiments validated the effectiveness of the loss function with directional guidance and the TRSM module, further analysis of their characteristics is still necessary. Our method optimizes the object-tracking algorithm separately in the training and tracking phases and is not dependent on any specific algorithm, making it applicable to various deep learning-based object-tracking algorithms to improve their tracking accuracy. When applied to the STARK algorithm, our method can effectively improve its performance with only a small increase in parameters. Although the algorithm in this paper is not compared with the most advanced algorithms, this demonstrates the importance of our method in improving tracking accuracy. Specifically, the TRSM module is applicable to long-term object-tracking algorithms, which determine target loss based on the reliability of tracking results before conducting target re-detection.

However, additional research is still required for the proposed method. Firstly, the TRSM module must rely on the object-tracking algorithm for training, which may prove inconvenient. Therefore, our future research work will focus on optimizing the TRSM module and conducting independent training to make it a completely independent module for easier application to various object-tracking algorithms. Secondly, the hyperparameters set in the loss function were based on research in the object detection task, which could have an adverse impact on the object-tracking task. In the future, we will investigate the sensitivity of these hyperparameters in the object-tracking task.

6. Conclusions

In this paper, we proposed an object-tracking algorithm that leverages motion direction and time series information. To account for the lack of consideration of target motion direction, we introduced a loss function that incorporates directional guidance. Furthermore, we designed a tracking result scoring module (TRSM) based on the attention mechanism by integrating tracking result time series information. The TRSM module improves the algorithm’s ability to evaluate the reliability of tracking results and facilitates template updates. Our algorithm’s performance was assessed through various experiments conducted on LaSOT, TrackingNet, GOT-10K, and VOT2020 datasets, and it demonstrated notable improvements in tracking accuracy. The ablation experiments and qualitative analysis also validated the effectiveness of the proposed method. Our findings have important implications for the research aimed at enhancing the accuracy and success rate of deep learning-based object-tracking algorithms and improving the reliability of tracking results. However, the TRSM module still requires the support of the object-tracking algorithm during training. To address this limitation, in our future research, we aim to improve the TRSM module by exploring alternative training methods that do not involve object-tracking algorithms. Additionally, we will investigate the effect of various hyperparameters on object-tracking performance in the loss function. Moreover, we will apply the proposed method to long-term object-tracking scenarios that more closely resemble real-life situations.

Author Contributions

Conceptualization, J.S., C.W. and S.Y.; methodology, J.S.; software, J.S.; validation, J.S.; formal analysis, J.S., C.W. and S.Y.; investigation, J.S. and C.W.; resources, C.W. and S.Y.; data curation, J.S.; writing—original draft preparation, J.S.; writing—review and editing, J.S. and C.W.; visualization, J.S.; supervision., C.W. and S.Y.; project administration, S.Y.; funding acquisition, S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

Thanks to friends for their suggestions on this paper and the program.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kiani Galoogahi, H.; Fagg, A.; Huang, C.; Ramanan, D.; Lucey, S. Need for speed: A benchmark for higher frame rate object tracking. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 24–27 October 2017; pp. 1125–1134. [Google Scholar]
Bonatti, R.; Ho, C.; Wang, W.; Choudhury, S.; Scherer, S. Towards a robust aerial cinematography platform: Localizing and tracking moving targets in unstructured environments. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 229–236. [Google Scholar]
Karaduman, M.; Cınar, A.; Eren, H. UAV traffic patrolling via road detection and tracking in anonymous aerial video frames. J. Intell. Robot. Syst. 2019, 95, 675–690. [Google Scholar] [CrossRef]
Marvasti-Zadeh, S.M.; Cheng, L.; Ghanei-Yakhdan, H.; Kasaei, S. Deep learning for visual tracking: A comprehensive survey. IEEE Trans. Intell. Transp. Syst. 2021, 23, 3943–3968. [Google Scholar] [CrossRef]
Soleimanitaleb, Z.; Keyvanrad, M.A.; Jafari, A. Object tracking methods: A review. In Proceedings of the 2019 9th International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, 24–25 October 2019; pp. 282–288. [Google Scholar]
Zhou, J.; Yao, Y.; Yang, R. Deep Learning for Single-object Tracking: A Survey. In Proceedings of the 2022 IEEE 2nd International Conference on Software Engineering and Artificial Intelligence (SEAI), Xiamen, China, 10–12 June 2022; pp. 12–19. [Google Scholar]
Cui, Y.; Jiang, C.; Wang, L.; Wu, G. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13608–13618. [Google Scholar]
Wei, Y.; Hua, Y.; Xiang, W. Research on Specific Long-term Single Object Tracking Algorithms in the Context of Traffic. Procedia Comput. Sci. 2022, 214, 304–311. [Google Scholar] [CrossRef]
Wang, J.; Yang, H.; Xu, N.; Wu, C.; Wu, D.O. Long-term target tracking combined with re-detection. EURASIP J. Adv. Signal Process. 2021, 2021, 1–16. [Google Scholar] [CrossRef]
Nam, H.; Han, B. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4293–4302. [Google Scholar]
Wang, N.; Yeung, D.-Y. Learning a deep compact image representation for visual tracking. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–8 December 2013; pp. 809–817. [Google Scholar]
Nam, H.; Baek, M.; Han, B. Modeling and propagating cnns in a tree structure for visual tracking. arXiv 2016, arXiv:1608.07242. [Google Scholar]
Wang, L.; Ouyang, W.; Wang, X.; Lu, H. Stct: Sequentially training convolutional networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1373–1381. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. Eur. Conf. Comput. Vis. 2016, 9914, 850–865. [Google Scholar]
Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. AAAI Conf. Artif. Intell. 2020, 34, 12549–12556. [Google Scholar] [CrossRef]
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6668–6677. [Google Scholar]
Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6269–6277. [Google Scholar]
Voigtlaender, P.; Luiten, J.; Torr, P.H.; Leibe, B. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6578–6588. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar]
Wang, N.; Zhou, W.; Wang, J.; Li, H. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1571–1580. [Google Scholar]
Fu, Z.; Liu, Q.; Fu, Z.; Wang, Y. Stmtrack: Template-free visual tracking with space-time memory networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13774–13783. [Google Scholar]
Yu, B.; Tang, M.; Zheng, L.; Zhu, G.; Wang, J.; Feng, H.; Feng, X.; Lu, H. High-performance discriminative tracking with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 20–25 June 2021; pp. 9856–9865. [Google Scholar]
Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 20–25 June 2021; pp. 10448–10457. [Google Scholar]
Han, B.; Sim, J.; Adam, H. Branchout: Regularization for online ensemble tracking with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3356–3365. [Google Scholar]
Tao, R.; Gavves, E.; Smeulders, A.W. Siamese instance search for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1420–1429. [Google Scholar]
Gupta, D.K.; Arya, D.; Gavves, E. Rotation equivariant siamese networks for tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12362–12371. [Google Scholar]
Bian, T.; Hua, Y.; Song, T.; Xue, Z.; Ma, R.; Robertson, N.; Guan, H. Vtt: Long-term visual tracking with transformers. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 9585–9592. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Gevorgyan, Z. SIoU Loss: More Powerful Learning for Bounding Box Regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5374–5383. [Google Scholar]
Muller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; Ghanem, B. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 300–317. [Google Scholar]
Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kristan, M.; Leonardis, A.; Matas, J.; Felsberg, M.; Pflugfelder, R.; Kämäräinen, J.-K.; Danelljan, M.; Zajc, L.Č.; Lukežič, A.; Drbohlav, O. The eighth visual object tracking VOT2020 challenge results. Eur. Conf. Comput. Vis. 2020, 12539, 547–601. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. Eur. Conf. Comput. Vis. 2014, 8693, 740–755. [Google Scholar]
Dai, K.; Zhang, Y.; Wang, D.; Li, J.; Lu, H.; Yang, X. High-performance long-term tracking with meta-updater. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6298–6307. [Google Scholar]
Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 6182–6191. [Google Scholar]
Danelljan, M.; Bhat, G. PyTracking: Visual Tracking Library Based on PyTorch. Available online: https://github.com/visionml/pytracking (accessed on 31 March 2021).
Yan, B.; Zhang, X.; Wang, D.; Lu, H.; Yang, X. Alpha-refine: Boosting tracking performance by precise bounding box estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5289–5298. [Google Scholar]
Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4282–4291. [Google Scholar]
Zhang, Q.; Wang, Z.; Liang, H. SiamRDT: An object tracking algorithm based on a reliable dynamic template. Symmetry 2022, 14, 762. [Google Scholar] [CrossRef]
Danelljan, M.; Gool, L.V.; Timofte, R. Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7183–7192. [Google Scholar]
Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4660–4669. [Google Scholar]
Deng, A.; Liu, J.; Chen, Q.; Wang, X.; Zuo, Y. Visual Tracking with FPN Based on Transformer and Response Map Enhancement. Appl. Sci. 2022, 12, 6551. [Google Scholar] [CrossRef]
Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware anchor-free tracking. Eur. Conf. Comput. Vis. 2020, 12366, 771–787. [Google Scholar]
Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Bhat, G.; Johnander, J.; Danelljan, M.; Khan, F.S.; Felsberg, M. Unveiling the power of deep tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 483–498. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]

Figure 1. The framework of our object-tracking algorithm. The TRSM module assumes the responsibility of identifying high-quality tracking results for template updates. The loss function with direction contributes to guiding the model in capturing commonalities associated with the target’s motion.

Figure 2. Illustration of angular quantity calculation.

Figure 3. Tracking result scoring module (TRSM). TRSM consists of three stages: stage 1 focuses on extracting features from both history and tracking results; stage 2 processes time series based on attention mechanism; and stage 3 calculates confidence value by fully connected network (FCN).

Figure 4. Accuracy and success rate on the LaSOT test set.

Figure 5. Comparison of visual tracking results on the LaSOT dataset.

Table 1. Evaluation on the LaSOT test set. Bold indicates best result.

Tracker	Published	AUC (%)	P (%)
LTMU [38]	2020	56.8	56.4
DiMP50 [39]	2019	57.9	57.7
TrDiMP [22]	2021	63.9	66.2
SuperDiMP [40]	2019	63.9	66.2
SiamRCNN [19]	2020	64.8	68.4
TransT [21]	2021	64.9	69.0
AlphaRefine [41]	2021	65.9	68.8
STARK-ST50 [25]	2021	66.4	71.2
Ours	-	67.1	72.2

Table 2. Evaluation on the TrackingNet test set. Bold indicates best result.

Tracker	Published	AUC (%)	$P_{norm} (%)$
SiamRPN++ [42]	2019	73.3	80.0
DiMP50 [39]	2019	74.0	80.1
SiamRDT [43]	2022	74.6	-
SiamFC++ [15]	2020	75.4	80.0
PrDiMP50 [44]	2020	75.8	81.6
AlphaRefine [41]	2021	80.5	85.6
SiamRCNN [19]	2020	81.2	85.4
TransT [21]	2021	81.4	86.7
STARK-ST50 [25]	2021	81.3	86.1
Ours	-	81.9	87.2

Table 3. Evaluation on the GOT-10K test set. Bold indicates best result.

Tracker	Published	AO (%)	SR0.5 (%)	SR0.75 (%)
ATOM [45]	2019	55.6	63.4	40.2
TR-Siam [46]	2022	58.2	68.3	45.7
SiamFC++ [15]	2020	59.5	69.5	47.9
DiMP50 [39]	2019	61.1	71.7	49.2
Ocean [47]	2020	61.1	72.1	47.3
SiamRDT [43]	2022	61.3	72.5	49.2
PrDiMP50 [44]	2020	63.4	73.8	54.3
SiamRCNN [19]	2020	64.9	72.8	59.7
TransT [21]	2021	67.1	76.8	60.9
STARK-ST50 [25]	2021	68.0	77.7	62.3
Ours	-	69.7	79.4	63.9

Table 4. Evaluation on the VOT2020 dataset. Bold indicates best result.

Tracker	Published	EAO	Accuracy	Robustness
KCF [48]	2014	0.154	0.407	0.432
SiamFC [14]	2016	0.179	0.418	0.502
ATOM [45]	2019	0.271	0.462	0.734
DiMP50 [39]	2019	0.274	0.457	0.74
UPDT [49]	2018	0.278	0.465	0.755
SuperDiMP [40]	2019	0.305	0.477	0.786
STARK-ST50 [25]	2021	0.308	0.478	0.799
Ours	-	0.313	0.486	0.798

Table 5. Comparison of the speed, FLOPs, and Params.

Trackers	Speed (fps)	Params (M)	FLOPs (G)
Our	25	29.8	12.9
STARK-ST50	27	28.2	12.8

Table 6. Ablation study on TrackingNet and LaSOT test set; “√” indicates that the method was used.

	STARK-ST50	DIOU	CIOU	SIOU	TRSM	TrackingNet		LaSOT
	STARK-ST50	DIOU	CIOU	SIOU	TRSM	$P_{norm} (%)$	AUC (%)	P (%)	AUC (%)
1	√					86.1	81.3	71.2	66.4
2	√	√				86.4	81.4	71.3	66.4
3	√		√			86.4	81.4	71.4	66.5
4	√			√		86.9	81.6	71.8	66.8
5	√				√	86.6	81.5	71.6	66.7
6	√			√	√	87.2	81.9	72.2	67.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, J.; Wu, C.; Yang, S. Object-Tracking Algorithm Combining Motion Direction and Time Series. Appl. Sci. 2023, 13, 4835. https://doi.org/10.3390/app13084835

AMA Style

Su J, Wu C, Yang S. Object-Tracking Algorithm Combining Motion Direction and Time Series. Applied Sciences. 2023; 13(8):4835. https://doi.org/10.3390/app13084835

Chicago/Turabian Style

Su, Jianjun, Chenmou Wu, and Shuqun Yang. 2023. "Object-Tracking Algorithm Combining Motion Direction and Time Series" Applied Sciences 13, no. 8: 4835. https://doi.org/10.3390/app13084835

APA Style

Su, J., Wu, C., & Yang, S. (2023). Object-Tracking Algorithm Combining Motion Direction and Time Series. Applied Sciences, 13(8), 4835. https://doi.org/10.3390/app13084835

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Object-Tracking Algorithm Combining Motion Direction and Time Series

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. Loss Function with Directional Guidance

3.2. Tracking Result Scoring Module

4. Experimental Results and Analysis

4.1. Datasets

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Comparison of Results

4.5. Speed, FLOPs, and Params

4.6. Ablation Study

4.7. Qualitative Results Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI