Spatial-Temporal Contextual Aggregation Siamese Network for UAV Tracking

Chen, Qiqi; Wang, Xuan; Liu, Faxue; Zuo, Yujia; Liu, Chenglong

doi:10.3390/drones8090433

Open AccessArticle

Spatial-Temporal Contextual Aggregation Siamese Network for UAV Tracking

by

Qiqi Chen

^1,2

,

Xuan Wang

^1,*,

Faxue Liu

^1,2

,

Yujia Zuo

¹ and

Chenglong Liu

¹

Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(9), 433; https://doi.org/10.3390/drones8090433

Submission received: 1 August 2024 / Revised: 20 August 2024 / Accepted: 22 August 2024 / Published: 26 August 2024

Download

Browse Figures

Versions Notes

Abstract

In recent years, many studies have used Siamese networks (SNs) for UAV tracking. However, there are two problems with SNs for UAV tracking. Firstly, the information sources of the SNs are the invariable template patch and the current search frame. The static template information lacks the perception of dynamic feature information flow, and the shallow feature extraction and linear sequential mapping severely limit the mining of feature expressiveness. This makes it difficult for many existing SNs to cope with the challenges of UAV tracking, such as scale variation and viewpoint change caused by the change in height and angle of the UAV, and the challenges of background clutter and occlusion caused by complex aviation backgrounds. Secondly, the SNs trackers for UAV tracking still struggle with extracting lightweight and effective features. A tracker with a heavy-weighted backbone is not welcome due to the limited computing power of the UAV platform. Therefore, we propose a lightweight spatial-temporal contextual Siamese tracking system for UAV tracking (SiamST). The proposed SiamST improves the UAV tracking performance by augmenting the horizontal spatial information and introducing vertical temporal information to the Siamese network. Specifically, a high-order multiscale spatial module is designed to extract multiscale remote high-order spatial information, and a temporal template transformer introduces temporal contextual information for dynamic template updating. The evaluation and contrast results of the proposed SiamST with many state-of-the-art trackers on three UAV benchmarks show that the proposed SiamST is efficient and lightweight.

Keywords:

UAV tracking; Siamese tracker; temporal context; multi-scale; high-order

1. Introduction

Object tracking is one of the fundamental tasks of computer vision. The progress is that, given the initial information about any object, the tracker finds the given object in a series of tracking frames and returns the accurate position of each frame. Unmanned Aerial Vehicle (UAV) tracking refers specifically to the processing of videos captured from UAVs. UAV tracking is widely used in many fields as one of the important tasks of remote sensing observation [1,2], such as aerial inspection, aerial photography, visual localization [3], etc.

Currently, the mainstream trackers for object tracking include the Discriminative Correlation Filters (DCFs) and the Siamese Networks (SNs). In DCFs, a filter

w

is trained by learning the object template online [4]. Then, DCFs can find the object by similarity matching the search region through the filter

w

, while updating the filter

w

in tracking progress. SNs, including the template branch and the search branch, these two branches realize object tracking by minimizing the distance between the target template and the target in the search patch [5]. The SNs first obtain the feature maps of both the template patch

F_{z}

and the search region

F_{s}

through the same feature extraction network, and then find the object by similarity matching between the

F_{z}

and

F_{s}

. Both DCFs and SNs have the common goal of matching the similarity information between the template patch and the search region to effectively distinguish the object from the background. In recent years, the SNs have outperformed DCFs in terms of accuracy and efficiency, so many trackers utilize Siamese networks for UAV tracking [6]. The core of classical SNs is to establish an accurate appearance model of the given object as the template so that the correct object can be accurately found when using the template for matching. During this progress, the importance of the templates is evident.

However, there are two inherent limitations when SNs are used for UAV tracking. Firstly, the creation of the template only uses the ground truth box of the tracking sequences’ first frame, and the template is not modified during the whole tracking process [5]. However, during the tracking process, especially for long-term tracking, the object will inevitably undergo appearance variations, and the fixed template information is not conducive to the accurate matching of similar information. For the tracking algorithm, the processed consecutive frames contain a lot of temporal context information, which can be used for template updates. However, most SNs ignore the temporal context information, which makes the tracker ineffective when it has an obvious appearance change. Secondly, the UAV platforms are accompanied by changes in flight altitude and horizontal movement, which cause scale variation and observation angle changes in the object. Additionally, aerial scenarios are more complex, often appearing with similar interference, also the presence of background interference, such challenges put more demands on the feature extraction capability of the trackers. However, UAVs are difficult to run weight-accurate feature extraction networks due to limited computing platforms, but shallow CNN feature extraction networks bring fixed sensory fields with single-scale and low-order features. The limited feature extraction capability does not allow the extraction of multiscale features, and it is difficult to achieve remote and high-order interactions between pixels, which loses much of the spatial information.

To make up for the first limitation, SNs try to use template replacement [7] or linear template update [8]. However, the direct and violent replacement of the template cannot guarantee the correctness of target information, and the linear template update is very limited due to the restriction of the linear model. To increase the spatial information and break the spatial limitation of CNN’s fixed receptive field, there are also many trackers using the transformer structure for feature extraction [9,10]. They do improve the tracking accuracy by increasing the spatial discrimination of the features, but along with the transformer structure is the increase in the computational quadratically with the increase in input size. Since the pixel points of the input image are at least

100 \times 100

, if the transformer structure is used for feature extraction, it will require a computational amount that is difficult to run on UAV platforms. However, it does not mean that transformers cannot be run on UAV platforms, but we need to consider the input size of the transformer.

Inspired by the above research, we propose a lightweight spatial-temporal context Siamese tracker for UAV tracking. Specifically, a high-order multiscale spatial module is designed to extract multiscale remote high-order spatial information, and the temporal template transformer is used to update the template by introducing temporal context information. The proposed high-order multiscale spatial module first adopts adaptive channel splitting and convolution methods to extract multiscale features. Then, the high-order self-interaction of feature maps is realized by recursive multiplying, which not only realizes the parallel feature extraction of multiscale sensitivity fields but also the high-order interaction of features. The proposed temporal template transformer introduces the target context information of continuous frames to adjust the template. Instead of manually specifying the template update mechanism, we adaptively introduce temporal context information through the transformer encoder-decoder structure. Although we use the transformer structure here, due to the small template size, the increased computation amount of the temporal template transformer is acceptable compared to the high computation amount of feature extraction directly on pixels. Our main contributions are as follows:

(1): A spatial-temporal contextual aggregation Siamese network (SiamST) is proposed for UAV tracking to improve tracking performance by enhancing horizontal spatial information and introducing vertical temporal information.
(2): We propose a high-order multiscale spatial module to extract multiscale spatial information and increase the spatial information. Through the adaptive channel splitting convolution and recursive multiplying, the feature map achieves high-order self-interaction. Therefore, the new interaction architecture design effectively breaks through the deficiencies of the inherent linear sequential mapping of the traditional CNN-based model in terms of its ability to express shallow features without changing the depth of the features and introducing a huge amount of computation.
(3): The proposed temporal template transformer adaptively introduces temporal context information to realize adaptive template updates through the encoder and decoder structure, thereby enabling effective perception of dynamic feature information flows with a lightweight architectural design.
(4): Experimental results on three UAV tracking benchmarks, UAV20L, UAV123@10fps, and DTB70, demonstrate that the proposed SiamST has impressive performance compared with other cutting-edge methods and can reach 75.5 fps on NVIDIA 3060Ti.

The rest of this article is arranged as follows: In Section 2, we briefly introduced the work related to SNs, UAV trackers, and spatial-temporal context information in UAV tracking. In Section 3, we introduced the proposed SiamST in detail. In Section 4, we described the details of the experiments, including test benchmarks, experimental setting, ablation experiments, and comparison results. In Section 5, we discussed the results of the experiments and the limitations of the proposed method, and the direction of future development. Section 6 provides the conclusion.

2. Related Work

2.1. Siamese Trackers

The SNs contain two weighted-shared feature extraction networks [11], a similarity matching network, and a prediction head. SNs first use the weighted-shared backbone for feature extraction and obtain the feature maps

F_{z}

and

F_{s}

of the template patch

Z

and the search region

S

. Then, similarity matching is performed to obtain the response map

R

that contains the similarity information of

F_{z}

and

F_{s}

. The tracking head decodes the response map to get the tracking result.

Without the need for online learning and priori information on templates. SNs leverage the powerful learning capabilities of deep learning through end-to-end training to maximize the network’s similarity learning capabilities on template features and search features. The excellent tracking effect of SNs quickly gained attention in the tracking field, so many studies focus on SNs. SiamFC is a pioneering effort to implement an end-to-end object tracking process [12]. Since then, many trackers have focused on the improvement of SNs based on SiamFC. For example, SiamRPN++ [13], SiamCAR [14], and Oceans [15] use ResNet as the backbone to improve the feature extraction capability of SNs. Later, MixFormer [16] and Swintrack [17] adopted the transformer structure with a greater computational load as the backbone. The deeper and heavier backbone is conducive to extracting better features. However, the large amount of computation is not suitable for the UAV platform with limited computing resources. Many trackers choose the combination of CNN and transformer to extract features and balance tracking speed.

In addition, there are also many trackers to improve similarity matching methods. AutoMatch [18] proposes six matching operators and tries to improve the matching methods of SNs. SparseTT [10] adopts the transformer structure to replace cross-correlation to realize similarity matching between the template branch and search branch. To obtain a more accurate tracking estimation, SiamRPN [19] adopts the region proposal network to achieve a more accurate tracking estimation by presetting the proportions and sizes of multiple anchors, but it also introduces many hyper-parameters that need to be artificially regulated. Therefore, anchor-free bounding box regression is widely used at present, such as SiamBAN [20]. Anchor-free finds objects using either a keypoint-based approach or a center-based approach, without the need for predefined anchors. Whether using an anchor-based or anchor-free mechanism, accurate object estimation is a complex task that requires a large amount of prior knowledge.

2.2. UAV Tracking

As one of the branches of object tracking, UAV tracking has the same main tracking process. DCFs are widely used for UAV tracking due to their high computational efficiency [5]. However, DCFs cannot fully utilize a large amount of a priori knowledge from offline training benchmarks due to the process of learning the object templates online. The online learning of a given object makes it difficult to realize the end-to-end training method. Therefore, Siamese networks with good performance are widely used for UAV tracking.

Many of the SNs used to improve general tracking performance are still effective for UAV tracking. You only need to be aware that mobile platforms used for UAV tracking need to balance the amount of computation. As well, many of the challenges faced by UAV tracking are compounded by the particular environment of UAV tracking, such as scale variation, camera movement, complex backgrounds, etc., which require more discriminative features. Much research has been conducted to improve on the above challenges.

SiamAPN [21] designed a two-stage Siamese tracker for UAV tracking based on the Siamese network, where the first stage generates the anchor frames, and the second stage refines the anchor frames through a multiclassification regression network to obtain the tracking results. SiamAPN++ [22] designed a lightweight attention aggregation network for UAV tracking to generate robust anchor frames. HiFT [23], as a one-stage tracker, used an anchor-free mechanism and designed a hierarchical transformer structure for aggregating multilayer features to achieve accurate UAV tracking. SiamTPN [24] designed a lightweight feature pyramid structure based on transformer to generate robust feature representations. SGDViT [25] proposed a saliency-guided dynamic visual transformer for the replacement of correlation operation to achieve similarity matching for UAV tracking.

2.3. Spatial and Temporal Context Information for Tracking

For computer vision tasks, extracting features with temporal and spatial information is particularly important to improve the CNN-based model’s inefficient mechanical transfer of feature information and the lack of multidimensional contextual information perception ability due to the architectural design of the linear sequential mapping mechanism. In earlier years, DCFs used the multiple resolution scale search method to address multi-scale variations of targets, e.g., [26,27], SiamRPN [19] uses the region proposal network to address scale variations of targets and avoids multiple computations of scales. SiamCAR [14] proposes an anchor-free mechanism that centers on increasing the multiscale spatial information of the model to enhance the features. As one of the keys of feature extraction ability, the receptive field represents the area coverage of the input image mapped by each pixel in the feature map output by the feature extraction network. It was found that CNN networks have local receptive fields due to the fixed convolution kernel. The local receptive field means a fixed pixel range for feature extraction and will limit the feature extraction ability of the network. Then, transformer is introduced into computer vision and can establish remote interaction between patches [28], which makes the receptive field of the model increase to the whole picture. The increase in the receptive field injects global spatial information into the model, transformer not only increases the interaction between the foreground and background but also greatly improves the feature extraction capability of the model. Therefore, many trackers use transformer structures for feature extraction and have achieved large improvements. However, the transformer comes with an increase in computational effort. The recurrent network proposed by HorNet [29] avoids the huge computational effort of self-attention and greatly improves the ability of the network to extract spatial information.

The temporal information between consecutive frames is also important for object tracking. Many trackers use the template update mechanism to update the template during the tracking process. This improves the tracker’s ability to cope with the appearance variations of objects to a certain extent, but [7] choosing a direct and violent replacement of the template cannot guarantee the correctness of target information. The first frame of the tracking sequence contains clear information about the target. The ground truth of the first frame may not be increased by the template information if it is directly replaced by the target patch in the searching process. Therefore, many trackers choose to introduce temporal context information in a variety of ways instead of directly replacing the template. For example, Dsiam [8] learns that object appearance changes quickly and linearly through FFT transform. UpdateNet [30] uses a convolutional neural network to accumulate templates to estimate the optimal template for the next frame. Compared with Dsiam, which uses an online transformation matrix to update the template, UpdateNet designs a CNN network with greater learning ability to learn the changes in the appearance of the target, which provides a new idea for updating the target template. However, due to the limitations of the local receptive field, the matching ability and fusion ability of the original template and the reference template are limited. STMTrack [31] designs a tracking framework based on a spatial-temporal memory network that utilizes historical information related to the target to guide the tracker to focus on the region of interest. STMTrack uses a storage mechanism to store historical target information and realizes target tracking by calculating pixel-level similarity between search frames and historical frames. However, STMTrack is difficult to adapt to mobile UAV tracking due to its heavy computation. TCTrack [32] designs an adaptive response map to refine the response map. TCTrack introduces the temporal context information by modifying the corresponding map without modifying the target template information.

Inspired by the above studies, we try to improve the UAV tracking performance by augmenting the horizontal spatial information and introducing vertical temporal information to the Siamese network. Specifically, a high-order multiscale spatial module is designed to extract multiscale remote high-order spatial information, and the temporal template transformer is used to update the template by introducing temporal context information.

3. Method

This section describes the proposed SiamST. As shown in Figure 1, the proposed SiamST contains four main parts: a feature extraction network, a high-order multiscale spatial module and temporal template transformer, a cross-correlation fusion network, and a prediction head. We first briefly introduce the overall process of the proposed SiamST, as well as the input and output of each part. Then, we focus on the proposed high-order multiscale spatial module and temporal template transformer.

3.1. Overall Objective

Given the UAV tracking sequences, we select the ground truth of the first frame as the template patch

Z \in R^{127 \times 127 \times 3}

, and each subsequent frame of the sequence as the search patch

S \in R^{255 \times 255 \times 3}

, where 127 and 255 are the sizes of the template patch and search patch, respectively, and each patch has three channels. The template and search patch are first sent into the backbone for feature extraction at the same time to obtain their respective feature maps

F_{S} \in R^{30 \times 30 \times 192}

F_{T} \in R^{6 \times 6 \times 192}

. In this work, we choose the CNN backbone, AlexNet, as the backbone. To further extract spatial information,

F_{S}

and

F_{T}

are secondly sent into the proposed high-order multiscale spatial module. We note the outputs as

F_{S}^{S} \in R^{30 \times 30 \times 192}

and

F_{T}^{S} \in R^{6 \times 6 \times 192}

. Thirdly, the temporal template transformer is used to update the template. The input of the temporal template transformer is the original template of the first frame

F_{T}^{S} \in R^{6 \times 6 \times 192}

and the tracking result of consecutive frames as reference templates

\{F_{T}^{r e f_1}, \dots, F_{T}^{r e f_t}\} \in R^{6 \times 6 \times 192}

. The updated template is outputted as

F_{T}^{S T} \in R^{6 \times 6 \times 192}

. Then, we use the cross-correlation fusion network to calculate the similarity information of

F_{S}^{S}

and

F_{T}^{S T}

and generate the response map

R \in R^{25 \times 25 \times 192} .

The formula is as follows:

R = F_{T}^{S T} {⨂ F}_{T}^{S T}

(1)

where the

⨂

is the cross-correlation fusion operation. Finally, by decoding the response map, the prediction head gets the accurate object estimate.

During the training process, the target patch

Z \in R^{127 \times 127 \times 3}

and search patch

S \in R^{255 \times 255 \times 3}

input to the network in pairs.

In addition, there are many template patches as reference templates

\{Z_{r e f_1}, \dots, Z_{r e f_t}\} \in R^{127 \times 127 \times 3}

input to the network. The tracking network is trained using multiple templates and a single search patch. We take the template

Z

corresponding to the search area

S

as the original template and the rest of the reference templates

\{Z_{r e f_1}, \dots, Z_{r e f_t}\}

as the temporal context template patches as input. This input is used to train the proposed temporal template transformer.

During the testing process, we take the template information of the first frame as the original template

Z

. In the second frame of tracking sequences, the template information is only the original template, so we take the original template as the only input of the proposed temporal template transformer. When the tracking result of the second frame is obtained, we first determine the maximum position of the response map, that is, the location of the object center

C \in R^{1 \times 1 \times 192}

in the second frame. Then, select the corresponding position of the search feature map as the target template feature map

Z_{r e f_1} \in R^{6 \times 6 \times 192}

of this frame. The advantage of using the response map to locate the object is to avoid the computation of the tracking head to judge the target position and the quadratic feature extraction.

3.2. High-Order Multiscale Spatial Module

The specific structure of the proposed multiscale, high-order spatial module is shown in Figure 2. We first implement multiscale feature extraction by channel splitting and multiple residual convolutions using a residual block hybrid structure. We know that in the process of feature extraction, the number of channels of feature maps represents the diversity of extracted features. In the designed high-order multiscale spatial module, we first carry out adaptive channel splitting, and the formula is as follows:

\{x_{1}, \dots x_{n}\} = S p l i t (C o n v (F)),

(2)

where

F

is the feature map obtained through the feature extraction network,

\{x_{1}, \dots x_{n}\}

is the feature set obtained by splitting, and then the residual convolution of the obtained feature set is carried out. The formula is as follows:

y_{i} = \{\begin{matrix} d_{i} (x_{i}), i = 1 \\ d_{i} (x_{i} + y_{i - 1}), 2 \leq i \leq k \end{matrix},

(3)

where

d_{i}

is the

3 \times 3

convolution operation and

y_{i}

is the output. Through adaptive splitting of channels, complex features can be extracted through multiple convolution residual structures to achieve deep features, while shallow features will adaptively undergo fewer convolution calculations. The feature sets will adaptively perform convolution operations once, twice, three, or four times. We compare the effects of different convolution times on the feature scale. As shown in Figure 2, the input area covered by a

3 \times 3

convolution kernel is

3 \times 3

, which means each pixel in the calculated output feature map corresponds to

3 \times 3

pixels of the original feature map. Multiple convolution operations enlarge the corresponding input feature size. The receptive fields corresponding to the once, twice, three, or four times convolution calculations are

3 \times 3

,

5 \times 5

,

7 \times 7

,

9 \times 9,

respectively. Through the above structure, we can obtain different receptive fields, and then obtain multiscale features. In conclusion, this adaptive nonaverage convolution method enhances the multiscale property and feature extraction abilities of the model.

After that, we were inspired by HorNet to try to implement long-range and high-order feature interaction. We first encode the location of the features.

E n c o d e r (F) = f_{e n c} (F + P_{e n c}) .

(4)

After the position encoding of the features, convolution is used to divide the features into multiple groups. It is worth noting that to realize the remote interaction of the feature map, large convolution

7 \times 7

is used to realize the multihead mapping

{P r o j}_{i n}

. The calculation formula is as follows:

[p_{0}^{H W \times C_{0}}, q_{0}^{H W \times C_{0}}, \dots, q_{n - 1}^{H W \times C_{n - 1}}] = {P r o j}_{i n} (F)

(5)

Next, we do group multiple multiplication convolution.

p_{k + 1} = f_{k} (q_{k}) ⨀ g_{k} (p_{k}), k = 0, 1, \dots, n - 1 .

(6)

where

g_{k} = L i n e a r (C_{k - 1}, C_{k}), 0 \leq k \leq n - 1 .

(7)

where

f_{k}

is convolution operation, and

g_{k}

realizes the alignment of channels. In the calculation process, we realize the high-order self-interaction of spatial information through the corresponding point calculation of feature maps. In this paper, we choose third-order computation to balance computational efficiency while realizing higher-order interaction of spatial information. After the process of splitting perception and multihead mapping channel alignment, the aggregation of spatial-dimensional long-range contextual information is achieved at shallower feature channels with lower computational effort. The multiscale and multidimensional feature rearrangement guided by the spatial contextual information successfully breaks the limitation of mechanical feature transfer caused by the linear mapping mechanism of the traditional CNN backbone.

3.3. Temporal Template Transformer

The self-correlation operation of sequential feature blocks in multiple subspaces achieved by the self-attention mechanism in traditional transformer structures makes it difficult to achieve the mining of long-distance spatial-temporal contextual information, which is crucial for aerial tracking. Different from the traditional transformer architecture design, we introduce multistream reference temporal context information before feature interaction in the subspaces and introduce dynamic long-range relevance information in the current feature processing, which achieves lightweight and efficient exploration of the temporal context information that is used for feature information updating and saliency optimization. We introduce the proposed temporal template transformer in detail. As shown in Figure 3, the proposed temporal template transformer contains an encoder and a decoder. The input of the encoder is the original template and the reference templates of consecutive frames, and the input of the decoder is the original template. The function of the encoder is to obtain the temporal context information for template enhancement by comparing and fusing the reference templates with the original template information, while the function of the decoder is to adaptively update the original template with the temporal context information obtained by the encoder. The complex template update mechanism and sensitive hyperparameter tuning are avoided in this process.

We chose the template as the carrier of temporal context information fusion for two reasons. First, template updates can be realized by introducing object information in continuous frames to alleviate the impact of object appearance changes on tracking. Secondly, although the transformer structure is used to realize the adaptive fusion of temporal context, the calculation amount is larger than that of the CNN network, but the size of the template patch after feature extraction is small, which is only a

6 \times 6

feature map in this work. Therefore, we believe that the increased calculation amount is acceptable.

Specifically, we fuse the reference templates, so the structure of the model does not need to change due to the number of reference frames. The calculation process is as follows:

F_{T}^{r e f} = C o n v (C o n c a t (F_{T}^{r e f_1}, \dots, F_{T}^{r e f_t})),

(8)

where

{F_{T}^{r e f_1}, \dots, F_{T}^{r e f_t}} {\in R}^{6 \times 6 \times 192}

is the set of reference templates, and

F_{T}^{r e f} {\in R}^{6 \times 6 \times 192}

is the fusion result of the reference frames. In this work, to introduce the target information of adjacent frames, we use the tracking results of the first three frames near the search frame as the reference templates. Next, we employ the multihead attention to compute the original template

F_{T} {\in R}^{6 \times 6 \times 192}

and the representative of the reference frames

F_{r e f}

. Multihead attention can better learn the multiple relationships and similarity information between each pixel, and effectively extract the global context information of the original template and reference templates. The specific formula is as follows:

MultiHead (Q, K, V) = (C o n C at (H_{1}, \dots, H_{N})),

(9)

where

Attention (Q, K, V) = Softmax (\frac{Q \cdot K^{T}}{\sqrt{d_{k}}}) \cdot V,

(10)

H_{N} = Attention (Q W_{q}^{n}, K W_{k}^{n}, V W_{v}^{n}) .

(11)

where

Q, K, V

means the input query, key, and value of the attention module.

\sqrt{d_{k}}

is the scale scaling factor, and

W_{q}^{n}, W_{k}^{n}, {a n d W}_{v}^{n}

are earnable parameters.

N

represents the number of heads of multihead attention. In this work, we set N as 8. During the encoding process, we consider that the temporal templates of consecutive frames need to be queried with the original template information to obtain the difference and similarity information. Thus, we set the

F_{T}^{r e f}

as

Q

and the

F_{T}

as

K

and

V

.

y_{E} = LayerNorm (F_{T} + MultiHead (F_{T}^{r e f}, F_{T}, F_{T}))

(12)

where the

y_{E} {\in R}^{6 \times 6 \times 192}

is the output of the first multihead attention.

After passing the results through residual connection and layernorm, we realize further information fusion through multihead attention again.

{O u t p u t}_{E} = LayerNorm (y_{E} + MultiHead (y_{E}, y_{E}, y_{E}))

(13)

where the

{O u t p u t}_{E} {\in R}^{6 \times 6 \times 192}

is the output of the encoder. We define the

{O u t p u t}_{E}

as the adaptive update knowledge of the original template, which is input to the decoder to realize the enhancement of the template. During the decoding process, we first enhance the original template

F

through multihead attention.

y_{D 1} = LayerNorm (F + MultiHead (F_{T}, F_{T}, F_{T}))

(14)

The

y_{D 1}

and

{O u t p u t}_{e n c o d e r}

are sent to multihead attention again to perform the decoding process. In the encoder, we take the temporal context information as a query, aiming to encode it with the template information to a greater extent. In the decoder, our purpose is to carry out an adaptive update of the template information. Therefore, we decode the template information as a query in the decoder. The formula is as follows:

y_{D 2} = {LayerNorm (y}_{D 1} + MultiHead (y_{D 1}, {O u t p u t}_{E} {, O u t p u t}_{E}))

(15)

Finally, through a feed-forward network and the last layernorm, we get the output of the proposed temporal template transformer.

O u t p u t = L a y e r N o r m (y_{D 2} + F F N (y_{D 2}))

(16)

where

F F N (y_{D 2}) = W \cdot R e L u (y_{D 2} + b_{1}) + b_{2}

(17)

In the proposed temporal template transformer, we keep the residual structure. Since not all reference templates are conducive to template updates, we choose to let the network update adaptively instead of specifying what template is useful. Thus, the residual structure helps to ensure that the model will not get worse.

3.4. Prediction Head

The prediction head is used to decode the response map and estimate the object. During training progress, the prediction head requires a lot of prior knowledge, which is crucial to achieving accurate target estimation. In this work, we use three branches of the network for object estimation, the classification branch, the regression branch, and the centrality branch. The input of these three branches is the response map, and the outputs are

F_{c l s} {\in R}^{25 \times 25 \times 2}

,

F_{r e g} {\in R}^{25 \times 25 \times 4}

, and

F_{c e n} {\in R}^{25 \times 25 \times 1}

, respectively.

The output of the classification branch is a classification map. We note it as

F_{c l s} {\in R}^{25 \times 25 \times 2}

, where 25 is the size of the

F_{c l s}

and each pixel of

F_{c l s}

has two scores

δ_{p o s}

and

δ_{n e g}

for foreground and background. The scores are used to classify whether this pixel belongs to the foreground or the background. The cross-entropy loss function is used to calculate the classification loss. The formula is as follows:

L_{c l s} = \frac{- 1}{N} \sum_{(i, j)} y_{(i, j)} \times \log (δ_{p o s}) + (1 - y_{(i, j)}) l o g {(δ}_{n e g})

(18)

where

y_{i}

is the label of each pixel and

i, j \in (0, 24)

, N is the number of pixels within the ground truth. The output of the regression branch is

F_{r e g} {\in R}^{25 \times 25 \times 4}

, each pixel

(i, j)

of

F_{r e g}

has a

4 D

vector

t_{(i, j)}

, which means the distances of the left, top, right, and bottom four edges

(l, t, r, b)

of the target estimation box from the center. We note the ground truth as

T : ((x_{0}, y_{0}), (x_{1}, y_{1}))

, where

(x_{0}, y_{0})

and

(x_{1}, y_{1})

are the left-top and right-bottom locations of ground truth. We only consider regression boxes within the ground truth to balance positive and negative samples when calculating regression losses. For each pixel

(i, j)

within the ground truth needs to calculate its label. The formula is as follows:

T_{(x, y)} = (x - x_{0}, {y - y}_{0}, x_{1} - x, y_{1} - y),

(19)

where

(x, y)

is the location of the figure patch corresponding to the pixel

(i, j)

of

F_{r e g}

. The regression can be calculated by IoU loss. The formula is as follows:

L_{r e g} = \frac{1}{N} \sum_{N} L_{I o U} [T_{(x, y)}, t_{(x, y)}],

(20)

Considering that the prediction box will get lower quality and not be highly referable far from the target center, we add the centrality branch to improve the quality of the prediction box. The output of the centrality branch is

F_{c e n} {\in R}^{25 \times 25 \times 1}

, each pixel of

F_{c e n}

has one score

δ_{c e n}

of centerness. The label

C_{(i, j)}

of

F_{c e n}

is calculated by the distance between

(x, y)

and ground truth

T

. The formula is as follows:

C_{(i, j)} = \sqrt{\frac{\min (x - x_{0}, x_{1} - x)}{\max (x - x_{0}, x_{1} - x)} \times \frac{\min (y - y_{0}, y_{1} - y)}{\max (y - y_{0}, y_{1} - y)}}

(21)

With the label, we can calculate the centerness loss.

L_{c e n} = \frac{- 1}{N} \sum_{(i, j)} (C_{(i, j)} \times \log (δ_{c e n})) + (1 - C_{(i, j)}) l o g (1 - δ_{c e n}) .

(22)

The overall loss can be calculated as follows:

{L = L}_{c l s} + L_{r e g} + L_{c e n}

(23)

4. Experiments

4.1. Implementation Detail

We train the proposed SiamST using an end-to-end training approach. Our training platform is RTX 3060Ti, CPU is AMD 3700X. Pytorch1.10 is used to build the training network. The Python version is 3.8, and the stochastic gradient descent (SGD) method is used to train a total of 20 epochs. Our learning rate is set to rise from 0.001 to 0.005 for the first five epochs and decrease from 0.005 to 0.0005 for the next 15 epochs. Our training benchmarks contain GOT-10K [33], COCO [34], VID [35], and LaSOT [36]. We evaluate the trained network on chosen testing benchmarks commonly used for UAV tracking evaluation, such as UAV123@10fps [37], UAV20L, and DTB70 [38]. We use the one-pass evaluation (OPE) method for evaluation and select precision rate and success rate as comparison metrics.

4.2. Comparision with the State-of-the-Arts

To show the tracking performance of the proposed SiamST, UAV evaluation benchmarks such as DTB70, UAV123@10fps, and UAV20L are selected as evaluation benchmarks. As one of the most used UAV evaluation benchmarks, DTB70 contains 70 tracking sequences from UAV perspectives and includes a variety of challenges that often occur under UAV tracking, such as occlusion, scale change, and low resolution, so it can well evaluate the robustness of the tracker. UAV123@10fps and UAV20L contain 123 and 20 common UAV scene sequences, respectively. They are used as short-term and long-term evaluation benchmarks for UAV tracking, which can comprehensively evaluate the tracking effect of the tracker for short and long time. To prove the superiority of the proposed tracker, we chose multiple state-of-the-art (SOTA) Siamese trackers for UAV tracking for comparison, such as TCTrack [32], SGDViT [25], SiamAPN [21], SiamAPN++ [22], SiamSA [39], as well as HiFT [23]. In addition, general trackers such as SiamRPN [19], SiamCAR [14], UpdateNet [30], and so on [40,41,42,43] are also selected as comparison trackers.

4.2.1. UAV123@10fps Benchmark

As shown in Figure 4, we evaluate the tracking results of UAV123@10fps by success rate and precision rate. The precision rate and success rate of the proposed SiamST are 0.784 and 0.599, which rank first among the 12 compared trackers. Among the compared trackers, the precision rate and success rate of TCTrack are 0.774 and 0.588. Similar to SiamST, TCTrack also introduces temporal context information for UAV tracking. Different from the SiamST proposed in this paper to enhance template information, TCTrack designs a transformer structure to improve response maps by introducing temporal context information. The tracking results of SiamST and TCTrack prove that temporal context plays a very important role in UAV tracking. In addition, TCTrack also uses temporal adaptive convolution to enhance the spatial information of template features, and SiamST also proposes a high-order multiscale spatial module. Realize high-order interaction of information and long-distance interaction between target and background; thus, the organic complementarity of contextual information in the spatial and temporal dimensions is realized. Both SiamST and TCTrack improve the Siamese network at the spatial and temporal levels, so the tracking performance is higher than other trackers. The excellent results of the proposed SiamST at UAV123@10fps prove that the proposed tracking algorithm is accurate for UAV tracking.

4.2.2. UAV20L Benchmark

UAV20L is used as a long-term UAV tracking benchmark, with each tracking sequence exceeding 1000 frames. We use UAV20L to evaluate the long-term tracking performance of SiamST. As shown in Figure 4, the precision rate and success rate of the proposed SiamST are 0.720 and 0.545, respectively, which are 1.7% and 1.2% higher than the Siamese-based UAV tracker SiamAPN++. SiamAPN++ introduces an attention mechanism into the Siamese framework for UAV tracking. Compared with SiamAPN++, the proposed SiamST in this paper has two advantages. Firstly, SiamST uses an atypical transformer structure that introduces long-distance temporal contextual information flow to achieve further lightweight and efficient exploration of the temporal context information, which is crucial for aerial tracking. However, SiamAPN++ ignores the temporal context information. Secondly, SiamAPN++ uses an anchor-based mechanism to perform scale estimation, while SiamST uses an anchor-free prediction head to avoid hyperparameter adjustment. Therefore, the results of the proposed SiamST are higher than SiamAPN++. The tracking performance of TCTrack accuracy ranks second, which also proves the important role of temporal context information in long-term tracking.

4.2.3. DTB70 Benchmark

As shown in Figure 4, the proposed SiamST still performs well on the DTB70 benchmark. The precision rate and success rate are 0.813 and 0.631, respectively. Compared with SGDViT, the precision rate is higher than 0.7% and the accuracy is higher than 2.8%. SGDVit finds the problem that the background information will increase when there are changes in the object and focuses on the ratio change and scale variation, SGDViT proposes a saliency-guided dynamic vision transformer to distinguish foreground and background information and refine the cross-correlation work, which can meet many challenges under UAV tracking. With the help of the saliency-guided information, the performance of SGDViT is outstanding. SiamST also increases the guide information, which is temporal information from context frames. The multidimensional architecture design based on the organic complementarity of contextual information in the spatial and temporal dimensions makes its performance better than the SOTA aerial tracker SGDViT.

4.3. Results in Different Attributes

To further analyze the capability of the proposed SiamST to cope with UAV tracking of complex backgrounds, we evaluated its performance on UAV123@10fps under nine common UAV tracking challenges. The results are shown in Figure 5. Firstly, using the UAV as a mobile platform, we tested the performance under camera motion, scale variation, and viewpoint change. The results show that, compared with the comparison trackers, the success rate of the proposed algorithm is 0.594, 0.577, and 0.609, respectively, which is higher than the comparison trackers. This proves that the spatial information enhancer proposed in this paper can improve the multiscale adaptability and spatial information extraction ability of the tracker. Additionally, the UAV tracking scene is complex, often facing the influence of background clutter, partial occlusion, full occlusion, and similar objects. The occlusion scene makes it easy to change the appearance of the object and difficult to extract the overall features of the target.

Additionally, background clutter and the appearance of similar targets make it easy to cause the tracking drift phenomenon. At this point, the introduction of temporal context information compensates for these shortcomings. Therefore, the tracker performs well in these scenarios. As can be seen from Figure 5, in the face of the above challenges, the success rates of the proposed SiamST are 0.414, 0.520, 0.388, and 0.559, respectively, SiamST always ranks first. Additionally, with illumination variation and out-of-view as common challenges, the proposed SiamST performs well, which proves its generalization and strong adaptability to complex environments.

4.4. Ablation Experiments

We conducted ablation experiments to verify the effectiveness of the proposed high-order multiscale spatial module and temporal template transformer. As shown in the following table, we evaluated the short-term and long-term UAV tracking benchmarks UV123@10fps and UAV20L, respectively. To represent the fairness of the evaluation, the baseline is all the same with SiamST except for the multiscale, high-order multiscale spatial module and the temporal context template. The proposed high-order multiscale spatial module is noted as HMSM, and the temporal template transformer is noted as TTTrans in Table 1.

As shown in Table 1, the precision rate and success rate of the baseline with the proposed high-order multiscale spatial module on UAV123@10fps are 0.757 and 0.576, respectively, which achieved a precision rate increase of 0.9%, and an increase of 0.7% in the success rate compared to the baseline. Meanwhile, the tracking results on UAV20L are 0.685 and 0.525, which are 10.1% and 8.2% higher than the baseline.

As for the baseline, with the proposed temporal template transformer, there is better tracking performance than the baseline. The gains are 1.4% and 0.9%, respectively, on UAV123@10fps. On UAV20L, the precision rate and success rate are 0.702 and 0.527, and the gains are 11.8% and 8.4%, respectively.

Meanwhile, the gains of the tracker with both the proposed high-order multiscale spatial module and temporal template transformer are 3.3% and 3% on UAV123@10fps, achieving the best tracking results. At the same time, the comparison is more obvious on the UAV20L benchmark. The precision rate and success rate of the baseline are 0.72 and 0.545, respectively. Compared with the baseline, there are 13.6% and 10.2% gains in precision rate and success rate, respectively.

Ablation experiments demonstrate the effectiveness of the proposed high-order multiscale spatial module and temporal template transformer for improving UAV tracking, and the interaction between them can further improve the tracking performance. It further illustrates that the design of spatial-temporal context aggregation architecture in an aerial tracker using a lightweight backbone network is an efficient and necessary adaptive change, which is a novel and promising research idea in the field of UAV tracking.

Finally, we also compared the amount of calculation flops and the number of parameters. We have implemented the proposed SiamST on the mobile platform RK3588, achieving real-time speed. As shown in Table 1, the proposed high-order multiscale spatial module and temporal template transformer both achieved a large accuracy improvement while increasing a small amount of calculation. To intuitively show the specific values of the amount of computation flops and the number of parameters, we chose the Siamese tracker SGDViT for UAV tracking for comparison, and it can be seen that our tracker achieved results with better performance and a lower computational amount than the SOTA aerial tracker SGDViT. This proves the superiority of the proposed SiamST.

4.5. Visual Comparisons and Case Discussion

To intuitively show the tracking effect of the proposed SiamST. We show the tracking results of SiamST and the TCTrack, SGDViT, SiamAPN, and SiamAPN++. All the comparison trackers are SNs and have the same backbone.

As shown in Figure 6, we show three UAV tracking sequences. In the first tracking sequence, the object experienced partial occlusion and full occlusion of buildings in the tracking process. We can see that SGDViT, SiamAPN, and SiamAPN++ all had tracking drift to varying degrees. This is because SGDViT, SiamAPN, and SiamAPN++ only consider the original template and current tracking frame but do not consider the temporal context information, so the target features cannot be extracted when the target is occluded, making the tracker unable to track continuously. The proposed SiamST and TCTrack can cope with the challenge of occlusion because the introduction of temporal context information can guide the tracker to find the right location. After that, SGDViT tracked the target again, while SiamAPN and SiamAPN++ failed due to occlusion. In the second tracking sequence, part of the target is out of view with background clutter during tracking progress, which makes it hard to extract accurate features. As can be seen from Figure 6, TCTrack, SGDViT, and SiamAPN all have tracking drift due to background interference in this process, while the proposed SiamST can continuously track the target. These tracking results demonstrate the robustness of the proposed SiamST.

In the third tracking sequence, both SGDViT and SiamAPN occurred tracking drift successively under the premise of no occlusion in dealing with complex background interference. SGDViT had tracking failure after the tracking drift, which indicates that the feature extraction ability of both of them needs to be improved. In the tracking process of the above three UAV tracking sequences, only the proposed SiamST tracks the target in the whole process, which proves that the proposed SiamST can cope with many challenges in UAV tracking, and intuitively shows the excellent tracking performance and robustness of the proposed SiamST with the introduction of multidimensional spatial-temporal context aggregation network architecture design.

5. Discussion

In the experimental part, we performed comparison experiments with the SOTA trackers on three common UAV benchmarks, and the results show that the proposed SiamST achieves the best tracking performance. We believe that there are two reasons for the best performance. First, the proposed high-order multiscale spatial module improves the feature extraction capability of the model, and second, the target template information from successive frames helps to alleviate the challenge brought by the change in the target appearance. In the attributes analysis in Section 4.3, the proposed SiamST is superior to the comparison algorithm under the challenges of occlusion, scale variation, background interference, and out-of-view. It is worth mentioning that the proposed SiamST achieved 13.6% and 10.2% improvements over the baseline on the UAV20L benchmark. This proves the validity of the proposed high-order multiscale spatial module and temporal template transformer.

Although it is better than the comparison trackers, SiamST is not good enough to deal with all occlusions. We believe that adding trajectory prediction on the basis of the Siamese network will help to further improve tracking performance. Therefore, how to combine trajectory prediction with the Siamese network to achieve target tracking is the direction of our future research.

6. Conclusions

Aiming at the situation that the existing Siamese tracker makes it difficult to achieve efficient template updates and the spatial information of the UAV tracker is insufficient. In this paper, a Siamese tracker with a novel spatial-temporal contextual aggregation architectural design is proposed for UAV tracking (SiamST). Specifically, a high-order multiscale spatial module is proposed to extract multiscale features while realizing high-order interaction of features to enhance spatial information. Through the adaptive channel splitting convolution and recursive multiplying, SiamST achieves high-order self-interaction of the feature map. Then, the temporal template transformer is proposed to adaptively introduce temporal context information through the encoder and decoder structure, while avoiding complex update mechanisms to realize adaptive template updates. The evaluation performance and ablation experiments on DTB70, UAV123_10fps, and UAV20L demonstrate the effectiveness of the proposed algorithm. The calculation flops and the number of parameters demonstrate that the proposed network in this paper can efficiently and accurately realize aerial tracking.

Author Contributions

Conceptualization, Q.C.; methodology, Q.C.; validation, Q.C. and X.W.; investigation, Q.C., F.L., and X.W.; resources, Y.Z. and C.L.; writing—original draft preparation, Q.C., writing—review and editing, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 62405310).

Data Availability Statement

The original contributions presented in the study are included in the article.

Acknowledgments

The authors are grateful for the anonymous reviewers’ critical comments and constructive suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, J.; Jiang, S.; Song, L.; Peng, P.; Mu, F.; Li, H.; Jiang, P.; Xu, T. Automated optical inspection of FAST’s reflector surface using drones and computer vision. Light Adv. Manuf. 2023, 4, 3–13. [Google Scholar] [CrossRef]
Cao, Y.; Dong, S.; Zhang, J.; Xu, H.; Zhang, Y.; Zheng, Y. Adaptive Spatial Regularization Correlation Filters for UAV Tracking. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 7867–7877. [Google Scholar] [CrossRef]
Ye, J.; Fu, C.; Lin, F.; Ding, F.; An, S.; Lu, G. Multi-regularized correlation filter for UAV tracking and self-localization. IEEE Trans. Ind. Electron. 2021, 69, 6004–6014. [Google Scholar] [CrossRef]
Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2544–2550. [Google Scholar]
Javed, S.; Danelljan, M.; Khan, F.S.; Khan, M.H.; Felsberg, M.; Matas, J. Visual object tracking with discriminative filters and siamese networks: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 6552–6574. [Google Scholar] [CrossRef] [PubMed]
Fu, C.; Lu, K.; Zheng, G.; Ye, J.; Cao, Z.; Li, B.; Lu, G. Siamese object tracking for unmanned aerial vehicle: A review and comprehensive analysis. Artif. Intell. Rev. 2023, 56 (Suppl. S1), 1417–1477. [Google Scholar] [CrossRef]
Held, D.; Thrun, S.; Savarese, S. Learning to track at 100 fps with deep regression networks. In Computer Vision–ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14; Springer: Cham, Switzerland, 2016; pp. 749–765. [Google Scholar]
Guo, Q.; Feng, W.; Zhou, C.; Huang, R.; Wan, L.; Wang, S. Learning dynamic siamese network for visual object tracking. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1763–1771. [Google Scholar]
Tang, C.; Wang, X.; Bai, Y.; Wu, Z.; Zhang, J.; Huang, Y. Learning Spatial-Frequency Transformer for Visual Object Tracking. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 5102–5116. [Google Scholar] [CrossRef]
Fu, Z.; Fu, Z.; Liu, Q.; Cai, W.; Wang, Y. SparseTT: Visual tracking with sparse transformers. arXiv 2022, arXiv:2205.03776. [Google Scholar]
Tao, R.; Gavves, E.; Smeulders, A.W. Siamese instance search for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1420–1429. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the Computer Vision–ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; Proceedings, Part II 14. Springer: Cham, Switzerland, 2016; pp. 850–865. [Google Scholar]
Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4282–4291. [Google Scholar]
Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6269–6277. [Google Scholar]
Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware anchor-free tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland; pp. 771–787. [Google Scholar]
Cui, Y.; Jiang, C.; Wang, L.; Wu, G. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13608–13618. [Google Scholar]
Lin, L.; Fan, H.; Zhang, Z.; Xu, Y.; Ling, H. Swintrack: A simple and strong baseline for transformer tracking. Adv. Neural Inf. Process. Syst. 2022, 35, 16743–16754. [Google Scholar]
Zhang, Z.; Liu, Y.; Wang, X.; Li, B.; Hu, W. Learn to match: Automatic matching network design for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 13339–13348. [Google Scholar]
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R.; Tang, Z.; Li, X. SiamBAN: Target-aware tracking with siamese box adaptive network. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5158–5173. [Google Scholar] [CrossRef] [PubMed]
Fu, C.; Cao, Z.; Li, Y.; Ye, J.; Feng, C. Siamese anchor proposal network for high-speed aerial tracking. In Proceedings of the IEEE International Conference on Robotics and Automation, Xi’an, China, 30 May–5 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 510–516. [Google Scholar]
Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. SiamAPN++: Siamese attentional aggregation network for real-time UAV tracking. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), Prague, Czech Republic, 28–30 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3086–3092. [Google Scholar]
Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. Hift: Hierarchical feature transformer for aerial tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 15457–15466. [Google Scholar]
Xing, D.; Evangeliou, N.; Tsoukalas, A.; Tzes, A. Siamese transformer pyramid networks for real-time UAV tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 2139–2148. [Google Scholar]
Yao, L.; Fu, C.; Li, S.; Zheng, G.; Ye, J. SGDViT: Saliency-Guided Dynamic Vision Transformer for UAV Tracking. arXiv 2023, arXiv:2303.04378. [Google Scholar]
Li, Y.; Zhu, J. A scale adaptive kernel correlation filter tracker with feature integration. In Proceedings of the Computer Vision-ECCV 2014 Workshops, Zurich, Switzerland, 6–7 and 12 September 2014; Proceedings, Part II 13. Springer: Cham, Switzerland, 2015; pp. 254–265. [Google Scholar]
Zheng, L.; Tang, M.; Chen, Y.; Wang, J.; Lu, H. Fast-deepkcf without boundary effect. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4020–4029. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Rao, Y.; Zhao, W.; Tang, Y.; Zhou, J.; Lim, S.N.; Lu, J. Hornet: Efficient high-order spatial interactions with recursive gated convolutions. Adv. Neural Inf. Process. Syst. 2022, 35, 10353–10366. [Google Scholar]
Zhang, L.; Gonzalez-Garcia, A.; Weijer, J.V.D.; Danelljan, M.; Khan, F.S. Learning the model update for siamese trackers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4010–4019. [Google Scholar]
Fu, Z.; Liu, Q.; Fu, Z.; Wang, Y. Stmtrack: Template-free visual tracking with space-time memory networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13774–13783. [Google Scholar]
Cao, Z.; Huang, Z.; Pan, L.; Zhang, S.; Liu, Z.; Fu, C. TCTrack: Temporal contexts for aerial tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 14798–14808. [Google Scholar]
Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5374–5383. [Google Scholar]
Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Computer Vision–ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 445–461. [Google Scholar]
Li, S.; Yeung, D.-Y. Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31, pp. 4140–4146. [Google Scholar]
Zheng, G.; Fu, C.; Ye, J.; Li, B.; Lu, G.; Pan, J. Siamese Object Tracking for Vision-Based UAM Approaching with Pairwise Scale-Channel Attention. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 10486–10492. [Google Scholar]
Zhang, Z.; Peng, H. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4591–4600. [Google Scholar]
Li, Y.; Fu, C.; Ding, F.; Huang, Z.; Lu, G. AutoTrack: Towards high-performance visual tracking for UAV with automatic spatio-temporal regularization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11923–11932. [Google Scholar]
Yan, B.; Peng, H.; Wu, K.; Wang, D.; Fu, J.; Lu, H. Lighttrack: Finding lightweight neural networks for object tracking via one-shot architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15180–15189. [Google Scholar]
Sosnovik, I.; Moskalev, A.; Smeulders, A.W. Scale equivariance improves siamese tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 2765–2774. [Google Scholar]

Figure 1. The flow chart of the proposed SiamST. Where “HMSM” means the proposed high-order multiscale spatial module and “TTTrans” is the proposed temporal template transformer. Additionally,

⨂

replaces the cross-correlation.

Figure 1. The flow chart of the proposed SiamST. Where “HMSM” means the proposed high-order multiscale spatial module and “TTTrans” is the proposed temporal template transformer. Additionally,

⨂

replaces the cross-correlation.

Figure 2. The flow chart of the proposed high-order multiscale spatial module.

Figure 3. The flow chart of the proposed temporal template transformer.

Figure 4. The tracking results of the proposed SiamST and SOTA trackers on three UAV tracking benchmarks.

Figure 5. The success plots of proposed SiamST and comparison trackers on nine challenges on UAV123@10fps.

Figure 6. Visual comparisons of our network and four SOTA trackers on five tracking videos.

Table 1. Ablation study of the proposed HMSM and TTTrans. Where “HMSM” means the proposed high-order multiscale spatial module, and “TTTrans” is the proposed temporal template transformer.

Trackers	UAV123@10fps		UAV20L		GFlops	Params (M)	Fps
Trackers	Pre (%)	AUC (%)	Pre (%)	AUC (%)	GFlops	Params (M)	Fps
Baseline	74.8	56.9	58.4	44.3	8.36	6.64	49.5
Baseline + HMSM	75.7	57.6	68.5	52.5	8.88	7.48	47.2
Baseline + TTTrans	76.2	57.8	70.2	52.7	10.84	7.73	39.8
Baseline + HMSM + TTTrans	78.1	59.9	72.0	54.5	11.2	8.75	37
SGDViT	76.6	58.8	67.3	50.5	11.3	23.29	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Q.; Wang, X.; Liu, F.; Zuo, Y.; Liu, C. Spatial-Temporal Contextual Aggregation Siamese Network for UAV Tracking. Drones 2024, 8, 433. https://doi.org/10.3390/drones8090433

AMA Style

Chen Q, Wang X, Liu F, Zuo Y, Liu C. Spatial-Temporal Contextual Aggregation Siamese Network for UAV Tracking. Drones. 2024; 8(9):433. https://doi.org/10.3390/drones8090433

Chicago/Turabian Style

Chen, Qiqi, Xuan Wang, Faxue Liu, Yujia Zuo, and Chenglong Liu. 2024. "Spatial-Temporal Contextual Aggregation Siamese Network for UAV Tracking" Drones 8, no. 9: 433. https://doi.org/10.3390/drones8090433

APA Style

Chen, Q., Wang, X., Liu, F., Zuo, Y., & Liu, C. (2024). Spatial-Temporal Contextual Aggregation Siamese Network for UAV Tracking. Drones, 8(9), 433. https://doi.org/10.3390/drones8090433

Article Menu

Spatial-Temporal Contextual Aggregation Siamese Network for UAV Tracking

Abstract

1. Introduction

2. Related Work

2.1. Siamese Trackers

2.2. UAV Tracking

2.3. Spatial and Temporal Context Information for Tracking

3. Method

3.1. Overall Objective

3.2. High-Order Multiscale Spatial Module

3.3. Temporal Template Transformer

3.4. Prediction Head

4. Experiments

4.1. Implementation Detail

4.2. Comparision with the State-of-the-Arts

4.2.1. UAV123@10fps Benchmark

4.2.2. UAV20L Benchmark

4.2.3. DTB70 Benchmark

4.3. Results in Different Attributes

4.4. Ablation Experiments

4.5. Visual Comparisons and Case Discussion

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI