Contextual Enhancement–Interaction and Multi-Scale Weighted Fusion Network for Aerial Tracking

Wang, Bo; Wang, Xuan; Ma, Linglong; Zuo, Yujia; Liu, Chenglong

doi:10.3390/drones8080343

Open AccessArticle

Contextual Enhancement–Interaction and Multi-Scale Weighted Fusion Network for Aerial Tracking

by

Bo Wang

^1,2

,

Xuan Wang

^1,*,

Linglong Ma

^1,2,

Yujia Zuo

¹ and

Chenglong Liu

¹

Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(8), 343; https://doi.org/10.3390/drones8080343

Submission received: 27 May 2024 / Revised: 18 July 2024 / Accepted: 23 July 2024 / Published: 24 July 2024

Download

Browse Figures

Versions Notes

Abstract

Siamese-based trackers have been widely utilized in UAV visual tracking due to their outstanding performance. However, UAV visual tracking encounters numerous challenges, such as similar targets, scale variations, and background clutter. Existing Siamese trackers face two significant issues: firstly, they rely on single-branch features, limiting their ability to achieve long-term and accurate aerial tracking. Secondly, current tracking algorithms treat multi-level similarity responses equally, making it difficult to ensure tracking accuracy in complex airborne environments. To tackle these challenges, we propose a novel UAV tracking Siamese network named the contextual enhancement–interaction and multi-scale weighted fusion network, which is designed to improve aerial tracking performance. Firstly, we designed a contextual enhancement–interaction module to improve feature representation. This module effectively facilitates the interaction between the template and search branches and strengthens the features of each branch in parallel. Specifically, a cross-attention mechanism within the module integrates the branch information effectively. The parallel Transformer-based enhancement structure improves the feature saliency significantly. Additionally, we designed an efficient multi-scale weighted fusion module that adaptively weights the correlation response maps across different feature scales. This module fully utilizes the global similarity response between the template and the search area, enhancing feature distinctiveness and improving tracking results. We conducted experiments using several state-of-the-art trackers on aerial tracking benchmarks, including DTB70, UAV123, UAV20L, and UAV123@10fps, to validate the efficacy of the proposed network. The experimental results demonstrate that our tracker performs effectively in complex aerial tracking scenarios and competes well with state-of-the-art trackers.

Keywords:

aerial tracking; UAV visual object tracking; Siamese tracker; contextual enhancement–interaction; multi-scale fusion

1. Introduction

Visual target tracking is a critical research area within computer vision [1], aiming to achieve continuous and stable tracking of a target across successive video frames based on the initial position and size of the target in the video sequence. Aerial remote sensing tracking, as a subset of visual tracking tasks, is fundamental to remote sensing image processing. It plays a pivotal role in numerous applications, including video surveillance [2,3], airborne detection [4], intelligent transportation [5], and various other domains.

Advances in UAV technology have significantly facilitated the acquisition of high-quality aerial remote sensing image data. Therefore, the study of UAV technology and its subtasks has garnered significant interest from researchers [6,7]. Aerial tracking tasks have emerged as a focal point of interest. However, unlike typical tracking tasks, aerial tracking faces unique challenges due to its distinctive imaging perspective [8]. For example: 1. in actual application scenarios, drastic attitude changes in the UAV can lead to large-scale changes and rapid movements, requiring the algorithm to cope with target appearance changes; 2. a single aerial viewpoint can result in similar target interference, requiring the tracking algorithm to resist such interference; 3. complex and variable natural environments can produce problems like target occlusion, requiring the algorithm to handle occlusions as they occur; 4. aerial tracking tasks are often long and continuous, requiring the algorithm to have long-term tracking capabilities. Given these challenges, designing an accurate and robust real-time tracking algorithm for aerial tracking tasks remains challenging.

The emergence of deep learning has brought significant advancements in many fields [9,10,11], and deep learning-based target trackers have demonstrated improved robustness [12]. In recent years, various state-of-the-art deep learning-based tracking algorithms have been proposed [13,14]. Among these, the most representative is the Siamese tracker [15], which has achieved satisfactory results in terms of both accuracy and efficiency. The Siamese-based tracker comprises three main stages [16]. First, the deep features of the target template and the search area are extracted using the weight-sharing backbone network. Then, the similarity response of the features of the two branches is calculated. Finally, the best-matching search area is obtained through a classification regression network, and the corresponding tracking results are output. However, despite their impressive tracking performance, there are still some shortcomings. The features of the template and the search image are extracted independently, leading to a lack of interaction between their features before cross-correlation. The Siamese network cannot fully utilize the features of the two branches to suppress background clutter and enrich the template information in the search area. It performs poorly in challenging scenarios such as large appearance changes, rapid camera motion, or interference from similar objects. This inevitably limits the application scenarios of the tracker. Additionally, it is generally believed that shallow features contain more texture information, whereas deep features possess more semantic characteristics. The former is beneficial for locating the target position, while the latter is more suitable for distinguishing the target from the background. Therefore, the localization network differs in the attention it pays to shallow and deep features. However, many networks overlook this aspect. They fuse the cross-correlation response maps of different layers into a single similarity response and use it as the input for the localization subnetwork.

To address these issues, we propose a contextual enhancement–interaction and multi-scale weighted fusion network for aerial tracking. The overall structure is illustrated in Figure 1. Specifically, it consists of three main parts: a feature extraction network, a feature interaction network guided by the contextual enhancement–interaction module (CEIM), and a target localization network driven by the multi-scale weighted fusion (MSWF) module. In the CEIM, the features from the two branches are first independently encoded by the Transformer-based enhancement module (TEM) to strengthen their representations. Then, a spatial cross-attention module (CAM) is used in parallel to aggregate the rich contextual interdependencies between the corresponding layers of the template and search features. Finally, the two features are fused using element-wise addition, obtaining three different levels of the similarity response. Combining the Transformer-enhanced template and cross-attention enhances the network’s ability to distinguish between similar distractors and complex backgrounds. This way, the tracker achieves more accurate and stable tracking. Additionally, we designed a selective fusion module that emphasizes the deep and shallow similarity responses to generate a better feature representation as input for target localization, enabling the network to achieve better tracking performance.

In summary, our work can be summarized as follows:

We introduce a novel tracker designed specifically for aerial tracking, capable of achieving accurate and stable tracking across diverse and complex aerial scenarios.
To mitigate insufficient information exchange between the template and search branches, we designed the contextual enhancement–interaction module (CEIM). This module facilitates interaction between the two branches at three different levels, thereby bolstering the algorithm’s ability to handle complex scenarios.
We introduce a new fusion module named multi-scale weighted fusion (MSWF), which dynamically learns and assigns weights to cross-correlation response maps across three different levels, enhancing the algorithm’s ability to generate robust responses.
We conducted experiments on four tracking benchmarks: DTB70, UAV123, UAV20L, and UAV123@10fps. Our results demonstrate that the proposed network exhibits high stability and effectiveness.

The rest of this paper is organized as follows: In Section 2, we introduce Siamese-based trackers. In Section 3, the details of our tracker are described in detail. Section 4 presents the details of the experiments performed, including ablation experiments and a series of comparative experiments. Finally, the discussion and conclusion are given in Section 5.

2. Related Work

The Siamese model was initially used for signature verification, and its simple yet efficient dual-branch structure later gained significant attention and gradually expanded into the field of object tracking. With the rapid evolution of deep learning, the precision of object tracking has been significantly enhanced. Consequently, the Siamese network is considered one of the most promising tracking technologies at present. It initially extracts the features of the search and template images separately through two weight-sharing backbone networks. Then, it utilizes a similarity function to calculate the resemblance between the template from the first frame and subsequent search images, enabling the realization of object tracking.

SINT [17] was the first to apply the Siamese network to tracking tasks, pioneering its use and directly learning the matching function between two branches to address the tracking task through similarity. SiamFC [18] determines the object’s position in each frame by comparing similarities across different locations in the search image and the template features, achieving an end-to-end deep learning-based tracking method. However, this approach requires strict translational invariance for correlation operations. SiamRPN [19] introduces the RPN network, transforming the tracking task from a similarity matching problem between two branches into a classification problem of distinguishing the target foreground from the irrelevant background. By presetting different anchor boxes at the same location, it improves the accuracy and robustness of the task. DaSiamRPN [20] improves tracking ability by solving the problem of imbalanced training data distribution. SiamRPN++ [21] employs a deeper feature extraction network to achieve better tracking performance, but it overlooks the advantages of deep network feature fusion. However, anchor-based tracking networks are sensitive to predefined anchor box parameters and struggle to handle object deformations. Trackers like SiamCAR [22], SiamBAN [23], and SiamFC++ [24] have been proposed to address this issue. They use an anchor-free mechanism to reduce the computational load and achieve faster and more stable tracking.

The attention mechanism emphasizes the importance of features that have been validated by many tracking algorithms for their effectiveness. RASNet [25] introduced attention models into target tracking, proposing three models, namely, the general attention mechanism, the residual attention mechanism, and the channel attention mechanism, achieving good tracking results. SiameAttn [26] integrates self-attention and cross-attention to form Deformable Siamese Attention Networks, enabling feature communication between the two branches. However, it overlooks spatial-level communication between the branches. SiamAPN++ [27] utilizes the attention mechanism to form the Attention Aggregation Network, achieving real-time aerial tracking.

The Transformer was first proposed in the literature [28] for natural language processing applications. The basic Transformer module is the attention module, which aggregates information from the entire input sequence. With the success of Transformers in other vision tasks, many approaches have emerged in recent years that use Transformers to improve tracking performance. HIFT [29] directly employs a Transformer to fuse multi-scale feature information, but it is difficult to achieve real-time tracking. TMT [14] considers it important to transfer information across multiple frames to reinforce each other. It obtains good tracking by introducing a Transformer structure to bridge multiple template frames for timing information. TCTrack [30] uses the Transformer architecture to enhance feature extraction and improve the inter-correlation response maps using temporal information.

All of the above trackers achieve impressive results; however, they perform poorly in challenging scenarios like long-term tracking and the aerial tracking of similar targets. Most of them do not focus on information exchange between the two branches. Furthermore, the tracking network prediction head only pays attention to the fused feature context, ignoring the complementary role of deep and shallow features, potentially reducing the accuracy of aerial tracking. To address these issues, we propose a new aerial tracking tracker that fully considers branch feature correlations, implicitly updates template features, and suppresses background noise in search areas. Additionally, we introduce a learnable multi-scale adaptive fusion module for deep and shallow features with weighted integration, emphasizing the cross-correlation response map of the target localization network.

3. Methods

In this section, we first introduce the overall structure of our proposed network and then provide the specific details of how we designed the network. The specifics include the CEIM and MSWF. The overall structure of the proposed network is displayed in Figure 1. We denote the template and search images by

Z

and

X

, respectively, their feature mappings by

F_{X}

and

F_{Z}

, and feature shapes by

C \times h \times w

and

C \times H \times W

, where

C

represents the number of channels,

h

and

w

denote the height and width of the template features, and

H

and

W

represent the height and width of the search features.

3.1. Overall Structure

As shown in Figure 1, firstly, the template and search region images are separately input into a weight-sharing backbone network to extract their respective three layers of deep features. To guide the network in establishing mutual dependencies between template and search region features and to enhance the feature expression capabilities of these two branches, we meticulously designed a contextual enhancement–interaction module. The cross-attention module and Transformer-based enhancement module are used to guide the features to focus on the similar information between branches and strengthen the feature representation before the cross-correlation operation. Then, the strengthened features are fed into the deep cross-correlation module to generate the cross-correlation response. The application of deep correlation enables us to obtain multiple response maps rich in similar information, which play a crucial role in subsequent prediction subnets, aiding in more accurately determining the target’s position. To further exploit and utilize the complementary advantages of cross-correlation response maps at different scales, we developed a multi-scale weighted fusion module. This module can adaptively learn and allocate weights for each hierarchical feature, performing effective response fusion. Finally, we pass the fused response maps to the target localization subnet to achieve precise target localization. In the following sections, we will elaborate on the working principles and advantages of both the contextual enhancement–interaction module and the multi-scale weighted fusion module.

3.2. Contextual Enhancement–Interaction Module

In many previous Siamese tracking networks, the template and search branches were mostly independent. However, we believe that allowing them to complement each other is highly meaningful. On the one hand, learning the target information from the template branch can make the features of the search branch more discriminative when multiple similar objects or occlusions occur during tracking, enhancing the saliency of target features and aiding the network in accurately identifying the target. On the other hand, for the template branch, encoding more contextual information from the search images is beneficial for challenging tracking scenarios such as long-term tracking and target deformation. At the same time, this can also be regarded as using the information in the search image to update the template, providing an implicit way for the network to adaptively update the target template features. In many past Siamese networks, the template was fixed and could not be updated online, which could lead to the network being unable to adapt to scenarios where the target undergoes significant deformation, which is common in long-term tracking. Moreover, using explicit template updates in Siamese networks could result in excessive computation and failure to meet real-time requirements. Therefore, adopting methods for implicit template updates can effectively address these issues.

Inspired by [26], we designed a contextual enhancement–interaction module that allows for feature interaction between the template and search branches before cross-correlation, thus aggregating rich contextual interdependencies between the template and search images. However, unlike [26], which did not consider spatial-level interactions, our approach directly utilizes self-attention weights from one branch to weight the channels of the other branch. In this scenario, weight computation is independently performed at the spatial level, lacking sufficient spatial-level interaction between the two branches. We propose a CEIM that concurrently addresses feature enhancement within each branch and feature interaction between branches. It contains two parts: a Transformer-based feature enhancement module for individual branches and a cross-attention module for feature interaction between branches. Next, we will elaborate on their operations.

3.2.1. Cross-Attention Module

The cross-attention mechanism captures correlations between features. Therefore, we designed the CAM to capture correlations between the template and search features. The structure of the CAM is shown in Figure 2.

The template and search features extracted by the backbone are input into the CAM, aiming to obtain their interaction features based on their content. To capture the similarity between feature maps, we employ dot-product computation to calculate the similarity.

For illustration purposes, we take one layer of features of the template branch as an example. Firstly, two 1 × 1 convolutional layers are used to adjust the number of channels to reduce the amount of calculation.

K_{Z} \in R^{C / r \times h \times w}

and

V_{Z} \in R^{C \times h \times w}

are obtained. Then, they are reshaped into 3D feature maps,

K_{Z} \in R^{h w \times C / r}

and

V_{Z} \in R^{h w \times C}

. Also, the same approach is used for the search features, obtaining

K_{X} \in R^{H W \times C / r}

and

V_{X} \in R^{H W \times C}

. The correlation matrix

R

is obtained by computing the dot product of each key in

K_{Z}

with all keys in

K_{X}

. For computational convenience, we directly multiply

K_{Z}

with the transpose of

K_{X}

to obtain the correlation matrix

R

, which has the shape of

h w \times H W

. The formula is given by

R = K_{Z} {(K_{X})}^{T}

(1)

Since each row in

R

represents the similarity between a specific spatial position in

K_{Z}

and all positions in

K_{X}

, we apply the SoftMax function to the rows of

R

to obtain the weight matrix

W_{Z}

. Then, we compute the weighted sum of all spatial positions in Vx with each row of

W_{Z}

to obtain

U_{X}

. We employ the Gamma function to constrain the size of

U_{X}

and then perform element-wise addition with

F_{X}

to obtain

{F_{X}}^{'}

. The above process can be expressed as

W_{X} = SoftMax (R)

(2)

U_{X} = W_{X} \cdot V_{X}

(3)

{F_{X}}^{'} = Gamma (Reshape (U_{X})) + F_{X}

(4)

Similarly, since each column in

R

represents the similarity between a particular spatial location in

K_{X}

and all locations in

K_{Z}

, we apply the SoftMax function to the transpose in

R

to obtain the weight matrix

W_{X}

. We then perform a weighted sum of all spatial locations in

V_{Z}

with each row of

W_{X}

to obtain

U_{Z}

and similarly use the Gamma function to constrain the magnitude of

U_{Z}

, and we perform an element-by-element summation to obtain

{F_{Z}}^{'}

. The above process can be expressed as

W_{Z} = Softmax (R^{T})

(5)

U_{Z} = W_{Z} \cdot V_{Z}

(6)

{F_{Z}}^{'} = Gamma (Reshape (U_{Z})) + F_{Z}

(7)

3.2.2. Transformer-Based Enhancement Module

We not only realize the information interaction between the two branches through the CAM but also skillfully use the Transformer Encoder structure to independently encode the features. As shown in Figure 3, the core components of this TEM structure include a multi-head attention module and a feedforward network. Taking the template branch as an example, we first define the features of each layer through three 1 × 1 convolutional layers for encoding and then reshape the encoded features into vector forms of queries

Q

, keys

K

, and values

V

. Subsequently, we utilize dot-product operations to calculate the similarity between the queries

Q

and the transposed keys

K

. To control the magnitude of the similarity scores, we scale them before applying the SoftMax operation, ensuring a more reasonable probability distribution as the output. Finally, the SoftMax function serves as the activation function to weight the values

V

, resulting in a weighted feature representation. The computational process of the single-head attention module can be precisely expressed through specific mathematical formulas. In addition, we use a head count of 4 for multi-head attention. The calculation process is as follows:

Attention (Q, K, V) = SoftMax (\frac{Q \cdot K^{T}}{\sqrt{d_{K}}}) \cdot V

(8)

where

d_{K}

is the number of channels of the feature, and the value is 256.

The output enhanced by self-attention is added to the values

V

. Subsequently, layer normalization is applied. The result is then fed into a feedforward network consisting of two fully connected layers. The first layer utilizes the ReLU activation function, while the other layer does not apply any activation function. Additionally, residual connections and layer normalization are employed to ensure training stability and mitigate gradient disappearance. This process can be represented as

y = LN (V + Attention (Q, K, V))

(9)

{F_{Z}}^{″} = LN (y + FFN (y))

(10)

FFN = w_{2} \cdot ReL U (w_{1} \cdot y + b_{1}) + b_{2}

(11)

where

LN

denotes layer normalization,

FFN

denotes the feedforward network, and

w_{i}

and

b_{i} (i = 1, 2)

denote the weights and biases of the feedforward network.

Element-wise addition is utilized to fuse the output of the CAM with that of the TEM. Finally, a convolutional layer with a 3 × 3 kernel is applied to smooth the features, resulting in enhanced features for the template branch. These enhanced features, along with the corresponding enhanced features from the search branch, serve as the input to a deep cross-correlation operation to compute the cross-correlation response.

F u s i o n_{Z} = {Conv}_{3 \times 3} ({F_{Z}}^{'} + {F_{Z}}^{″})

(12)

3.3. Multi-Scale Weighted Fusion Module

Based on Siamese trackers, after extracting deep features from the template and search branches, most commonly, cross-correlation operations are utilized to generate single-channel response maps. This approach is prone to overlooking crucial information since different feature channels contain distinct semantic details. Therefore, we aim to preserve as much useful information as possible in the generated response maps. Inspired by [21], we introduce a depth correlation layer in our work to compute cross-correlations between feature maps channel-wise. Initially, the template and search features are divided into 256 two-dimensional matrices based on the number of feature channels, according to their depth. Subsequently, each pair of matrices from the template and search features is correlated to obtain the corresponding response results. Finally, the response results are concatenated depth-wise to produce multi-channel response maps. This allows the network to obtain multiple semantic similarity maps that retain rich information, which facilitates a more accurate determination of tracked target locations in subsequent prediction networks. The depth cross-correlation operation between the template features and search features, processed by the CEIM, can be represented as follows.

C o r r = F_{X} ⋆ F_{Z}

(13)

where

⋆

denotes the depth correlation operation.

Unlike predefined category tasks, such as classification or detection, visual object tracking deals with unknown target categories. The target category remains consistent throughout the tracking process. Depth and shallow convolutional features capture varying levels of image information. Shallow features typically contain information about textures and edges, aiding in target localization. Deep features, however, contain higher-level semantic information, helping to distinguish similar targets and improve tracking accuracy. A comprehensive consideration of both deep and shallow features is essential for effective object tracking. Treating all relevant response features equally may limit the tracking capability of the network. Combining deep and shallow features in a balanced manner can effectively utilize diverse image information for target tracking. Utilizing features from different levels holistically can enhance the robustness and performance of tracking algorithms, enabling effective tracking across diverse scenarios and target conditions.

This paper proposes MSWF to guide the network to mine discriminative features, thus achieving more accurate tracking results. The structure of this module is shown in Figure 4. First, response features from different convolutional depths are concatenated along the channel. Then, the channel count is downsampled to 256 using a 1 × 1 convolutional layer. Average pooling computes the average value of all elements, retaining background information, while max pooling focuses on the maximum element within the region, primarily preserving texture features. Therefore, we combine their advantages and employ both pooling methods, which enables the network to learn which information to pay more attention to. Next, after concatenating them along the channel dimension, a convolutional layer with a kernel size of 7 × 7 is applied to expand the number of channels to 3. Then, a SoftMax operation is applied across the channel dimension to obtain the weights for cross-correlation response features at three different scales. Finally, cross-correlation responses at different scales are weighted by these obtained weights and added element-wise to obtain the final fused features. The above operation can be expressed as

f = {Conv}_{1 \times 1} (Cat (C o r r_{1}, C o r r_{2}, C o r r_{3}))

(14)

f_{Avg} = Avg (f)

(15)

f_{Max} = Max (f)

(16)

W = SoftMax ({Conv}_{7 \times 7} (Cat (f_{Max}, f_{Avg}))) \in R^{3 \times H \times W}

(17)

F_{f u s e} = \sum_{i = 1}^{3} W_{i} \cdot C o r r_{i}

(18)

where

Cat (\cdot)

represents concatenation in the channel direction,

C o r r_{i} (i = 1, 2, 3)

represents the cross-correlation response at three different levels, and

Avg

and

Max

denote the average pooling and max pooling operations, respectively.

4. Results

In this section, first, the training and testing process is described in detail. Then, the test benchmarks and metrics are presented. The effectiveness of our proposed tracker in aerial remote sensing tracking is verified through ablation experiments. Finally, the performance of the proposed network is illustrated by comparing it with some SOTA trackers on the DTB70 [31], UAV123 [32], UAV20L, and UAV123@10fps benchmarks.

4.1. Implementation Details

Our platform utilizes the Windows 10 operating system, CUDA version 11.8, and the Python 3.7 programming framework with PyTorch 1.13 for training and validating algorithmic performance. The hardware platform consists of an AMD Ryzen 5 5600 CPU and an Nvidia GeForce RTX 3080 GPU. We leveraged datasets such as COCO [33], GOT-10K [34], VID, and LaSOT [35] to train our proposed network. Subsequently, we fine-tuned the tracker using stochastic gradient descent with a batch size of 12. Employing a warm-up [36] training strategy, we froze the ResNet50 backbone for the first 10 epochs and conducted a total of 20 epochs throughout the process. For the first five epochs, we used a warm-up learning rate ranging from 0.001 to 0.005. The learning rate was then reduced from 0.005 to 0.0005 for the subsequent 15 epochs.

4.2. Evaluation Index

We conducted a series of experiments with our tracker and compared it against state-of-the-art trackers on three UAV tracking dataset benchmarks, DTB70, UAV123, and UAV20L, as well as UAV123@10fps. We evaluated it using the One-Time Evaluation (OPE) procedure commonly used for target tracking. Specifically, the methodology uses the ground truth to initialize the first frame in the sequence. The tracker can then acquire subsequent target locations in the sequence based on the first frame’s information until the end of tracking, when no further information on the ground truth is acquired. Finally, the average precision and success rate are used as evaluation metrics to evaluate the performance of the tracker.

4.3. Ablation Study

To verify the effectiveness of the proposed structure, Table 1 lists the success rate and accuracy metrics of different components and baselines. With the addition of the multi-scale weighted fusion module, the network’s perception of the cross-correlation response map with different depths is enhanced. The tracker achieves a 0.9% improvement in the success rate and a 1.2% improvement in accuracy. It integrates shallow texture features and deep semantic features well so that the network can deal with a variety of complex scenes. In addition, the CEIM was added to the model, which strengthens the feature representation of the branch itself and realizes the information exchange between branches. The tracking performance and robustness are further improved. Specifically, the tracker’s success rate increases by 2.3%, and the accuracy improves by 2.4%. When the two modules are used together, the tracker achieves a 65.5% success rate and an 83.9% accuracy. Notably, while the addition of both modules slightly reduces the running speed, our tracker still operates in real time.

4.4. Experiments on the DTB70 Benchmark

There are a total of 70 video sequences in the DTB70 benchmark. The videos were mostly captured at low altitudes, covering a variety of videos captured by UAV cameras. With the rapid movement of drones and quick changes in camera perspectives, targets undergo significant variations in their shape and aspect ratio, resulting in complex tracking scenarios. Additionally, the dataset defines 11 challenging attributes: scale variation (SV), aspect ratio variation (ARV), occlusion (OCC), deformation (DEF), fast camera motion (FCM), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), background clutter (BC), surrounding similar objects (SOA), and motion blur (MB). These attributes cover various difficulties that may be faced during aerial tracking, making the evaluation of algorithm performance more comprehensive. We extensively compared algorithms with 15 state-of-the-art trackers, including SiamAttn [26], TCTrack++, SiamCAR [22], TCTrack [30], SGDViT, HiFT [29], SiamAPN++ [27], LightTrack [37], SiamSA [38], SiamAPN [39], SiamGAT [40], SiamMask [41], DASiamRPN [20], and Ocean [42], on the DTB70 benchmark to validate the tracking performance of the proposed tracker.

Overall Evaluation: The success rate and accuracy plots are shown in Figure 5. Our tracker has a success rate of 0.655 and an accuracy of 0.838, which are 0.9% and 1.2% higher than those of SiamAttn with a similar structure, respectively. This is because SiamAttn only focuses on the interactions between branches and ignores similar response fusion; thus, our tracker exhibits better performance. The proposed network achieves the best results among all the trackers, which shows that our algorithm has superior accuracy and robustness.

Attribute-Based Evaluation: To analyze the capability of the tracker proposed in this paper in handling various complex scenes, we analyzed the six most common attributes in the DTB70 benchmark test: occlusion (OCC), fast camera motion (FCM), in-plane rotation (IPR), background clutter (BC), surrounding similar objects (SOA), and motion blur (MB). The success rate and accuracy plots are shown in Figure 6.

As shown in Figure 6, in the case of severe image motion (such as FCM, IPR), other types of information interference (such as BC, SOA, OCC), or motion blur (MB) of the target, our algorithm shows excellent performance by strengthening the information exchange between search and template branches and using the strategy of adaptive weighted fusion of multi-layer responses. Compared with other SOTA algorithms, our tracker achieves significant results. This indicates that the interaction between the two branches’ features and adaptive fusion features in our algorithm can effectively enhance the robustness of tracking in complex scenarios.

4.5. Experiments on the UAV123 Benchmark

The UAV123 benchmark consists of 123 video sequences, encompassing common remote sensing scenarios with objects such as bicycles, boats, cars, groups, pedestrians, trucks, UAVs, and wakeboards. It is one of the most commonly used aerial tracking datasets. In this test, our algorithm was compared to other SOTA algorithms, including Ocean, CGACD [43], SiamCAR, SiamRPN++, SiamBAN, HiFT, SiamRPN, SiamDW [44], and SiamFC.

The success rate and precision plots are shown in Figure 7. Our algorithm ranks first with a success rate of 0.636 and an accuracy of 0.823. Compared with CGACD, the success rate and accuracy are 1.6% and 0.8% higher, respectively. This indicates the superior accuracy and robustness of our algorithm on the UAV123 dataset.

4.6. Experiments on the UAV20L Benchmark

The UAV20L benchmark is an aerial video dataset designed for long-term tracking, containing 20 long-time sequences. The average number of frames per test sequence in the UAV20L benchmark is about 3000 frames, and the maximum number of sequence frames reaches 5527 frames. As the number of video frames increases, the position prediction of the tracked target in subsequent frames becomes more difficult, and the shape becomes more irregular. The case of long-term tracking will lead to a variety of challenges in a video sequence, which requires the tracker to have good robustness. Therefore, in the actual UAV tracking application scenario for ground targets, only the trackers that overcome the challenge of long-term tracking can meet practical requirements.

We compared our algorithm with the following 10 SOTA algorithms with publicly available dataset results: SiamFC++, SiamBAN, SiamAPN++, SiamCAR, SiamAPN, SiamSA, SiamMask, TCTrack, and SGDViT [45].

Overall Evaluation: As shown in Figure 8, our tracker performs excellently in the long-term tracking scenario, with a success rate as high as 0.598 and an accuracy rate of 0.805, both among the top results. Compared with the baseline algorithm SiamCAR, our algorithm achieves a significant improvement of 7.5% and 11.8% in the success rate and accuracy, respectively. This significant advantage fully validates the effectiveness of our strategy in coping with long-term UAV ground-tracking scenarios. Unlike the baseline, our tracker enhances the interaction between the template and search branches to achieve better long-term tracking results. Our CEIM implements the implicit updating of templates and enhances the template representation itself. This prevents error accumulation due to over-updating.

Attribute-Based Evaluation: To further demonstrate the tracking performance of the proposed tracker in complex scenarios, a radar chart is used to compare the success rates of the nine challenging attributes in the UAV20L benchmark. As shown in Figure 9, our algorithm significantly outperforms several other state-of-the-art algorithms in handling complex scenarios, such as fast motion, occlusion, viewpoint change, and low resolution. This proves that our algorithm can effectively tackle various complex scenarios encountered during the tracking process.

4.7. Experiments on the UAV123@10fps Benchmark

The UAV123@10fps benchmark was obtained by downsampling the UAV123 benchmark and contains 123 sequences with a frame rate of 10fps. This benchmark also contains 12 aerial tracking challenges, and thus, it is appropriate to use UAV123@10fps to evaluate our tracker.

Figure 10 shows the comparisons of our tracker with the other 12 state-of-the-art trackers on the UAV123@10fps dataset. The success and accuracy plots show that our tracker achieved the best performance, with 61.5% and 79.7%, respectively. Compared to the baseline SiamCAR, our method improved the success rate and accuracy by 1% and 2%, respectively. The reason for this is that SiamCAR neglects further enhancement of the branch’s features and does not effectively fuse the deep and shallow similarity responses. It also demonstrates the effectiveness of our strategy.

4.8. Qualitative Analysis

To intuitively showcase the tracking prowess of our algorithm and further emphasize its performance in handling challenging scenarios, we have visualized the tracking results using heatmaps, which vividly illustrate the tracker’s regions of interest. As depicted in Figure 11, in the Bike1 sequence, where similar targets surrounding the target create interference, the baseline tracker diverts partial attention to the distracting objects, while our algorithm exhibits only a minor drift, maintaining a stable focus on the target. In the Boat8 sequence, the rapid movement of the target results in a significant amount of background noise, causing the baseline method to become distracted by the cluttered background. However, our algorithm remains accurate in capturing the target. In the Car1 sequence, despite the coexistence of partial occlusion and interference from similar targets, our algorithm’s attention remains unwavering, whereas the baseline method shifts its focus to the distracting objects. Lastly, in the Car13 sequence, the low resolution leads to a scarcity of usable features for the target itself, compounded by the added difficulty posed by similar surrounding objects. However, the attention of our method is still focused on the target and is not disturbed.

To further demonstrate the performance of our tracker in aerial tracking, we performed a tracking visualization comparison on four challenging video sequences on the DTB70 benchmark. The tracking results are visually compared between our tracker and three SOTA trackers, namely, siamAPN++, SiamCAR, and TCTrack++. The tracking results are shown in Figure 12.

In the RcCar6 sequence, the four trackers suffered from drift due to aspect ratio variation (ARV), similar objects around (SOA), and motion blur (MB). However, our tracker quickly recovered the correct tracking of the target, while the baseline SiamCAR re-tracked the target only until the 220th frame with inferior accuracy compared to our tracker. The other two trackers failed to re-track, resulting in tracking failure. In the Sheep1 and SpeedCar4 sequences, facing challenges such as aspect ratio variation (ARV), deformation (DEF), fast camera motion (FCM), in-plane rotation (IPR), and similar objects around (SOA), only our tracker achieved successful tracking, proving the effectiveness of our strategy. For the Surfing3 sequence, including aspect ratio variation (ARV), fast camera motion (FCM), and background clutter (BC), our tracker stably tracked the target throughout the entire process.

5. Discussion and Conclusions

In this work, we propose a new Siamese tracker named contextual enhancement–interaction and multi-scale weighted fusion network to improve tracking performance in challenging UAV tracking scenarios, including similar targets, scale variations, and background clutter. The proposed tracker improves the tracking accuracy in two ways. Firstly, we introduce the contextual enhancement–interaction module (CEIM), which enhances the tracker’s feature representation capability. Using a cross-attention mechanism (CAM), this module effectively integrates contextual information across branches, suppressing background clutter in search features and enriching the features of the template branch. Additionally, a parallel Transformer-based enhancement module (TEM) reinforces feature representation. Secondly, we introduce the multi-scale weighted fusion (MSWF) module, which fuses similarity responses from different scales by learning weights for each level, thereby enhancing discriminative capabilities.

Ablation experiments on the DTB70 benchmark demonstrate the effectiveness of our CEIM and MSWF module in improving tracking accuracy. Comparative experiments on DTB70, UAV20L, UAV23, and UAV123@10fps benchmarks demonstrate that our tracker achieves efficient and accurate performance in aerial tracking tasks, comparable to state-of-the-art algorithms. Additionally, heatmap comparison experiments and qualitative analysis experiments further demonstrate the effectiveness of the algorithmic strategy in dealing with complex scenes. The proposed algorithm suppresses interference from similar targets by exchanging information between branches and enhancing feature representation, thereby improving target recognition. Employing MSWF to fuse deep and shallow texture and semantic features makes the algorithm more discriminative. Consequently, our tracker effectively manages diverse challenging attributes and accurately locates and sizes tracked targets.

We expect that the work presented in this paper will inspire more innovative research on UAV tracking algorithms. However, our algorithm still has some limitations, such as tracking speed. Although the algorithm can achieve real-time tracking, the tracking speed decreased with the addition of the CEIM and MSWF. And, as tracking speed is a critical issue in aerial tracking, looking to the future, we hope to balance the tracking speed and tracking accuracy to achieve faster and more accurate aerial tracking. We will work on further optimizing the network structure to make it more lightweight and meet the needs of more practical application scenarios, such as by using a more lightweight network backbone and adopting a lighter regression box acquisition method.

Author Contributions

All authors participated in devising the tracking approach and made significant contributions to this work. B.W. devised the approach and performed the experiments; X.W. and Y.Z. provided advice for the preparation and revision of the work; L.M. and C.L. helped with the experiments. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 61905240.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, F.; Wang, X.; Zhao, Y.; Lv, S.; Niu, X. Visual Object Tracking: A Survey. Comput. Vis. Image Underst. 2022, 222, 103508. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, X.; Huang, Z.; Cheng, X.; Feng, J.; Jiao, L. Bidirectional Multiple Object Tracking Based on Trajectory Criteria in Satellite Videos. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Yan, H.; Xu, X.; Jin, G.; Hou, Q.; Geng, Z.; Wang, L.; Zhang, J.; Zhu, D. Moving Targets Detection for Video SAR Surveillance Using Multilevel Attention Network Based on Shallow Feature Module. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–18. [Google Scholar] [CrossRef]
Li, J.; Jiang, S.; Song, L.; Peng, P.; Mu, F.; Li, H.; Jiang, P.; Xu, T. Automated Optical Inspection of FAST’s Reflector Surface Using Drones and Computer Vision. Light Adv. Manuf. 2023, 4, 3. [Google Scholar] [CrossRef]
Li, Y.; Ma, L.; Zhong, Z.; Cao, D.; Li, J. TGNet: Geometric Graph CNN on 3-D Point Cloud Segmentation. IEEE Trans. Geosci. Remote Sens. 2020, 58, 3588–3600. [Google Scholar] [CrossRef]
Mei, Z.; Shao, X.; Xia, Y.; Liu, J. Enhanced Fixed-Time Collision-Free Elliptical Circumnavigation Coordination for UAVs. IEEE Trans. Aerosp. Electron. Syst. 2024, 1–14. [Google Scholar] [CrossRef]
Li, S.; Shao, X.; Zhang, W.; Zhang, Q. Distributed Multicircular Circumnavigation Control for UAVs with Desired Angular Spacing. Def. Technol. 2024, 31, 429–446. [Google Scholar] [CrossRef]
Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep Learning for Unmanned Aerial Vehicle-Based Object Detection and Tracking: A Survey. IEEE Geosci. Remote Sens. Mag. 2022, 10, 91–124. [Google Scholar] [CrossRef]
Zuo, C.; Qian, J.; Feng, S.; Yin, W.; Li, Y.; Fan, P.; Han, J.; Qian, K.; Chen, Q. Deep Learning in Optical Metrology: A Review. Light Sci. Appl. 2022, 11, 39. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Liu, T.; Singh, M.; Çetintaş, E.; Luo, Y.; Rivenson, Y.; Larin, K.V.; Ozcan, A. Neural Network-Based Image Reconstruction in Swept-Source Optical Coherence Tomography Using Undersampled Spectral Data. Light Sci. Appl. 2021, 10, 155. [Google Scholar] [CrossRef] [PubMed]
Huang, L.; Luo, R.; Liu, X.; Hao, X. Spectral Imaging with Deep Learning. Light Sci. Appl. 2022, 11, 61. [Google Scholar] [CrossRef] [PubMed]
Fu, C.; Lu, K.; Zheng, G.; Ye, J.; Cao, Z.; Li, B.; Lu, G. Siamese Object Tracking for Unmanned Aerial Vehicle: A Review and Comprehensive Analysis. Artif. Intell. Rev. 2023, 56, 1417–1477. [Google Scholar] [CrossRef]
Song, J.; Xu, H.; Ren, G.; Miao, Q. Infrared Object-Tracking Algorithm With Temporal Features Fusion Siamese Network. IEEE Trans. Instrum. Meas. 2024, 73, 1–13. [Google Scholar] [CrossRef]
Wang, N.; Zhou, W.; Wang, J.; Li, H. Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1571–1580. [Google Scholar]
Javed, S.; Danelljan, M.; Khan, F.S.; Khan, M.H.; Felsberg, M.; Matas, J. Visual Object Tracking with Discriminative Filters and Siamese Networks: A Survey and Outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 6552–6574. [Google Scholar] [CrossRef] [PubMed]
Marvasti-Zadeh, S.M.; Cheng, L.; Ghanei-Yakhdan, H.; Kasaei, S. Deep Learning for Visual Tracking: A Comprehensive Survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 3943–3968. [Google Scholar] [CrossRef]
Tao, R.; Gavves, E.; Smeulders, A.W.M. Siamese Instance Search for Tracking. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1420–1429. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P. Fully-convolutional siamese networks for object tracking. In Proceedings of the Computer Vision–ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–16 October 2016; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; Part II 14; pp. 850–865. [Google Scholar]
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High Performance Visual Tracking with Siamese Region Proposal Network. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware Siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of Siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6268–6276. [Google Scholar]
Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R.; Tang, Z.; Li, X. SiamBAN: Target-Aware Tracking with Siamese Box Adaptive Network. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 5158–5173. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; Volume 34, pp. 12549–12556. [Google Scholar] [CrossRef]
Ni, Z.-L.; Bian, G.-B.; Xie, X.-L.; Hou, Z.-G.; Zhou, X.-H.; Zhou, Y.-J. RASNet: Segmentation for Tracking Surgical Instruments in Surgical Videos Using Refined Attention Segmentation Network. In Proceedings of the 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Berlin, Germany, 23–27 July 2019; pp. 5735–5738. [Google Scholar]
Yu, Y.; Xiong, Y.; Huang, W.; Scott, M.R. Deformable Siamese Attention Networks for Visual Object Tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6727–6736. [Google Scholar]
Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. SiamAPN++: Siamese Attentional Aggregation Network for Real-Time UAV Tracking. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 16 December 2021; pp. 3086–3092. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. Hift: Hierarchical feature transformer for aerial tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Cao, Z.; Huang, Z.; Pan, L.; Zhang, S.; Liu, Z.; Fu, C. TCTrack: Temporal contexts for aerial tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Li, S.; Yeung, D.-Y. Visual Object Tracking for Unmanned Aerial Vehicles: A Benchmark and New Motion Models. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar] [CrossRef]
Mueller, M.; Smith, N.; Ghanem, B. A Benchmark and Simulator for UAV Tracking. In Proceedings of the Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 445–461. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Huang, L.; Zhao, X.; Huang, K. GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
Fan, H.; Bai, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Harshit Huang, M.; Liu, J.; Xu, Y. LaSOT: A High-Quality Large-Scale Single Object Tracking Benchmark. Int. J. Comput. Vis. 2021, 129, 439–461. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Yan, B.; Peng, H.; Wu, K.; Wang, D.; Fu, J.; Lu, H. Lighttrack: Finding lightweight neural networks for object tracking via one-shot architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15180–15189. [Google Scholar]
Zheng, G.; Fu, C.; Ye, J.; Li, B.; Lu, G.; Pan, J. Siamese Object Tracking for Vision-Based UAM Approaching with Pairwise Scale-Channel Attention. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022. [Google Scholar]
Fu, C.; Cao, Z.; Li, Y.; Ye, J.; Feng, C. Onboard Real-Time Aerial Tracking with Efficient Siamese Anchor Proposal Network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Guo, D.; Shao, Y.; Cui, Y.; Wang, Z.; Zhang, L.; Shen, C. Graph Attention Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 9543–9552. [Google Scholar]
Hu, W.; Wang, Q.; Zhang, L.; Bertinetto, L.; Torr, P.H.S. SiamMask: A Framework for Fast Online Object Tracking and Segmentation. arXiv 2022, arXiv:2207.02088. [Google Scholar]
Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware anchor-free tracking. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Du, F.; Liu, P.; Zhao, W.; Tang, X. Correlation-Guided Attention for Corner Detection Based Visual Tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6835–6844. [Google Scholar]
Zhang, Z.; Peng, H. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4591–4600. [Google Scholar]
Yao, L.; Fu, C.; Li, S.; Zheng, G.-Z.; Ye, J. SGDViT: Saliency-Guided Dynamic Vision Transformer for UAV Tracking. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 3353–3359. [Google Scholar]

Figure 1. The overall framework of our Siamese tracking network. It mainly consists of three parts: a feature extraction network, a feature interaction network guided by the contextual enhancement–interaction module (CEIM) for feature enhancement, and a target localization network driven by the multi-scale weighted fusion (MSWF) module for determining object positions. The red box shows the tracking results.

Figure 2. The structure of the cross-attention module. The features of the corresponding levels in the two branches communicate at the spatial level. At the same time, residual connections are introduced to prevent network degradation.

Figure 3. Structure of Transformer-based enhancement module (TEM).

Figure 4. Structure of multi-scale weighted fusion (MSWF).

Figure 5. Success plot (left) and precision plot (right) on DTB70 benchmark.

Figure 6. Success plot (left) and precision plot (right) of different attributes in DTB70 benchmark.

Figure 7. Success plot (left) and precision plot (right) on UAV123 benchmark.

Figure 8. Success plot (left) and precision plot (right) on UAV20L benchmark.

Figure 9. The success rate on the UAV20L benchmark for nine challenging attributes. The numbers in parentheses indicate the success rate of our method and the second highest success rate.

Figure 10. Success plot (left) and precision plot (right) on UAV123@10fps benchmark.

Figure 11. Comparison of heatmaps between the baseline (second column) and our tracker (third column) on the UAV123 benchmark. To visually show the differences that the tracker focuses on, we selected similar target interference, background clutter, partial occlusion, and low-resolution scenes. It can be seen in the figure that the image center represents the focus of the tracker. The baseline suffers from distractions in these scenarios, which easily leads to tracking failure, and our tracker performs well. The red box indicates the tracked target.

Figure 12. A qualitative analysis of our network and three SOTA trackers on the DTB70 benchmark.

Table 1. An ablation study of the proposed tracker on the DTB70 benchmark. The symbols ✓ and ✗ indicate whether the module was or was not in use, respectively. Speed evaluation was performed on RTX3060Ti.

NO	CEIM	MSWF	Succ (%)	Pre (%)	FPS
1	✗	✗	62.5	80.7	52
2	✗	✓	63.4	81.9	51
3	✓	✗	64.8	83.1	37
4	✓	✓	65.5	83.9	36

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, B.; Wang, X.; Ma, L.; Zuo, Y.; Liu, C. Contextual Enhancement–Interaction and Multi-Scale Weighted Fusion Network for Aerial Tracking. Drones 2024, 8, 343. https://doi.org/10.3390/drones8080343

AMA Style

Wang B, Wang X, Ma L, Zuo Y, Liu C. Contextual Enhancement–Interaction and Multi-Scale Weighted Fusion Network for Aerial Tracking. Drones. 2024; 8(8):343. https://doi.org/10.3390/drones8080343

Chicago/Turabian Style

Wang, Bo, Xuan Wang, Linglong Ma, Yujia Zuo, and Chenglong Liu. 2024. "Contextual Enhancement–Interaction and Multi-Scale Weighted Fusion Network for Aerial Tracking" Drones 8, no. 8: 343. https://doi.org/10.3390/drones8080343

APA Style

Wang, B., Wang, X., Ma, L., Zuo, Y., & Liu, C. (2024). Contextual Enhancement–Interaction and Multi-Scale Weighted Fusion Network for Aerial Tracking. Drones, 8(8), 343. https://doi.org/10.3390/drones8080343

Article Menu

Contextual Enhancement–Interaction and Multi-Scale Weighted Fusion Network for Aerial Tracking

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Overall Structure

3.2. Contextual Enhancement–Interaction Module

3.2.1. Cross-Attention Module

3.2.2. Transformer-Based Enhancement Module

3.3. Multi-Scale Weighted Fusion Module

4. Results

4.1. Implementation Details

4.2. Evaluation Index

4.3. Ablation Study

4.4. Experiments on the DTB70 Benchmark

4.5. Experiments on the UAV123 Benchmark

4.6. Experiments on the UAV20L Benchmark

4.7. Experiments on the UAV123@10fps Benchmark

4.8. Qualitative Analysis

5. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI