SiamMAN: Siamese Multi-Phase Aware Network for Real-Time Unmanned Aerial Vehicle Tracking

: In this paper, we address aerial tracking tasks by designing multi-phase aware networks to obtain rich long-range dependencies. For aerial tracking tasks, the existing methods are prone to tracking drift in scenarios with high demand for multi-layer long-range feature dependencies such as viewpoint change caused by the characteristics of the UAV shooting perspective, low resolution, etc. In contrast to the previous works that only used multi-scale feature fusion to obtain contextual information, we designed a new architecture to adapt the characteristics of different levels of features in challenging scenarios to adaptively integrate regional features and the corresponding global dependencies information. Speciﬁcally, for the proposed tracker (SiamMAN), we ﬁrst propose a two-stage aware neck (TAN), where ﬁrst a cascaded splitting encoder (CSE) is used to obtain the distributed long-range relevance among the sub-branches by the splitting of feature channels, and then a multi-level contextual decoder (MCD) is used to achieve further global dependency fusion. Finally, we design the response map context encoder (RCE) utilizing long-range contextual information in backpropagation to accomplish pixel-level updating for the deeper features and better balance the semantic and spatial information. Several experiments on well-known tracking benchmarks illustrate that the proposed method outperforms SOTA trackers, which results from the effective utilization of the proposed multi-phase aware network for different levels of features.


Introduction
The task of aerial tracking is a challenging task, aiming at determining the target's position in subsequent frames and generating predicted boxes with the information of the initial position of the target in the first frame.Originally, it was a task that simulated human cognitive mechanisms; recently benefiting from the rapid development of cross-disciplines, it is widely used in video surveillance [1,2], UAV applications [3][4][5], and intelligent transportation [6,7], etc.To achieve efficient and accurate tracking, we need to distinguish between the two properties of the target foreground and background.Distinguished from general tracking, aerial tracking faces many challenges introduced by the UAV's shooting perspective which are not present in general tracking tasks, such as occlusions that require high long-range dependence on shallow features containing more spatial information, scenarios with scale and viewpoint changes that require a high global generalization of mid-level features, and small-target or low-resolution tracking scenarios which are more sensitive to pixel-level context optimization updating of deeper semantic features.Based on the aforementioned analysis of the properties of aerial tracking, one question is raised naturally: Can we design a new multi-phase aware framework to adapt the characteristics of different levels of the features to adaptively integrate regional features and the corre- sponding long-range relevance information to improve the feature representation capability for pixel-level tracking?
In recent years, the Siamese tracker-based methods [8][9][10][11] have become highly efficient approaches to addressing aerial tracking tasks, with huge performance improvements and a balance between accuracy and real-time performance, becoming a hot research area in deep learning-based methods [12][13][14][15].The core idea of the Siamese tracker-based method is to use two branches of the same feature extraction network for the target template and the search region, respectively and transform the tracking problem into a similarity matching problem between the features of the two branches through the process of the correlation operation.Finally, the best matching search area is obtained by the subsequent classification regression network.The development trend of the Siamese tracker reflects that how to effectively utilize different levels of features is the key to improving performance.One way is through linear multiscale context fusion.For example, some works [16,17] achieve feature fusion by direct summation or channel cascading of feature blocks extracted by the backbone network.Other works [18,19] enable the network to obtain richer dependency information by designing efficient local modeling encoders or expanding the receptive field by decomposing the feature information.While the existing approach of feature utilization enables the tracker to utilize the dependency information, linear fusion or local modeling does not take full advantage of the global view of feature information, and the pixel-level intercorrelation between features at different levels is often neglected, which is necessary for accurate tracking.
To address the above problems, designing the adaptive aware network for different levels of features is an effective and feasible approach.We propose a new Siamese multiphase aware network called SiamMAN for aerial tracking tasks, as shown in Figure 1.It contains a multi-phase aware network adapted to the features at different depth levels to better capture the dependencies between features at different levels and improve the utilization of information from shallow spatial location features and deep semantic features.In the two-stage aware sub-network, the three feature blocks 3,4,5 extracted by the backbone network are first sent to the proposed cascade splitting encoder (CSE) to break the receptive field limitations and obtain the distributed long-range relevance among the sub-branches by the splitting of feature channels.Then, the multi-level contextual decoder (MCD) using a pooling strategy is used to achieve further global dependency fusion.Finally, in the similarity matching sub-network, we designed the response map context encoder (RCE) network utilizing long-range contextual information in backpropagation to accomplish pixel-level updating for the deeper features and better balance the semantic and spatial information.Our main contributions can be summarized as follows: (1) We propose a novel multi-phase Siamese tracking method, SiamMAN, to enhance the network's ability to distinguish feature representations for the task of aerial tracking to improve accuracy in scenarios with high requirements at different feature levels.Specifically, the response map context encoder (RCE) module achieves optimization of deep semantic features by means of non-local perceptual modeling, and the multilevel contextual decoder (MCD) module achieves global relevance aggregation of features using an improved transformer structure.The cascaded splitting encoder (CSE) module can obtain long-range relevance information through channel splitting.(2) A muti-phase aware framework adapted to different depth features is proposed to learn the dependency information between the channels in a global view, and we propose solutions to achieve better feature representation and utilization for different depth-level features, relying on the rich dependency information obtained from different levels to significantly improve the tracking results.(3) We achieve the best performance compared with SOTA trackers on several well-known tracking benchmarks containing challenging scenes, including UVA123, UVA20L, DTB70, and LaSOT.Experiments show that the proposed SiamMAN can effectively improve the tracking performance in challenging scenes, such as those with low resolution and scale variation.

Related Work
In this part, we briefly review the research related to our work in recent years, including a summary of the Siamese tracker and fusion networks.

Siamese Trackers
In recent years, Siamese network-based trackers have stood out from the crowd of trackers with excellent tracking performance; before that, correlation filtering-based approaches [20][21][22] received widespread attention for their efficient processing and easy deployment with low computation, driving the development of the aerial tracking field.However, the lower performance caused by artificially designed features makes it difficult to cope with challenging scenarios.In contrast, Siamese trackers emerged with many model variants for enhancing contextual information aggregation trying to achieve more efficient feature utilization and better performance.The early Siamese tracker was not specifically designed to solve the aerial tracking tasks but rather was designed to solve the challenges of the general target tracking tasks in pursuit of model generalization.The first algorithm to apply Siamese networks to address the tracking task was SINT [23], and the subsequent SiamFC [24] first introduced a correlation layer to unite feature maps, pioneering an end-to-end deep learning-based tracking approach, but the operation of the correlation operation required the network to satisfy strict translational invariance.Inspired by Faster R-CNN, Li et al. proposed SiamRPN [25] to avoid the process of multi-scale extraction of feature maps by introducing RPN networks [26] commonly used in target detection tasks, and the subsequent DaSiamRPN [27] achieved further performance improvements, but they extracted features at shallow depths.SiamRPN++ [28] applies a simple and efficient spatial perception strategy to achieve a deeper feature extraction network application, but it is sensitive to parameters such as pre-defined anchors.To address these problems, trackers such as SiamCAR [16], SiamBAN [17], and SiamFC++ [29] that redesign regression networks using an anchor-free strategy have been proposed, but the interference of unbalanced samples on features at different levels still exists.Later, the Siamese tracker designed for aerial tracking tasks began to appear in the field of target tracking such as the SOTA tracker SiamAPN [30] and SiamAPN++ [31] in the field of aerial tracking; they have enhanced the ability to cope with unbalanced samples through the study of adaptive

Related Work
In this part, we briefly review the research related to our work in recent years, including a summary of the Siamese tracker and fusion networks.

Siamese Trackers
In recent years, Siamese network-based trackers have stood out from the crowd of trackers with excellent tracking performance; before that, correlation filtering-based approaches [20][21][22] received widespread attention for their efficient processing and easy deployment with low computation, driving the development of the aerial tracking field.However, the lower performance caused by artificially designed features makes it difficult to cope with challenging scenarios.In contrast, Siamese trackers emerged with many model variants for enhancing contextual information aggregation trying to achieve more efficient feature utilization and better performance.The early Siamese tracker was not specifically designed to solve the aerial tracking tasks but rather was designed to solve the challenges of the general target tracking tasks in pursuit of model generalization.The first algorithm to apply Siamese networks to address the tracking task was SINT [23], and the subsequent SiamFC [24] first introduced a correlation layer to unite feature maps, pioneering an end-to-end deep learning-based tracking approach, but the operation of the correlation operation required the network to satisfy strict translational invariance.Inspired by Faster R-CNN, Li et al. proposed SiamRPN [25] to avoid the process of multiscale extraction of feature maps by introducing RPN networks [26] commonly used in target detection tasks, and the subsequent DaSiamRPN [27] achieved further performance improvements, but they extracted features at shallow depths.SiamRPN++ [28] applies a simple and efficient spatial perception strategy to achieve a deeper feature extraction network application, but it is sensitive to parameters such as pre-defined anchors.To address these problems, trackers such as SiamCAR [16], SiamBAN [17], and SiamFC++ [29] that redesign regression networks using an anchor-free strategy have been proposed, but the interference of unbalanced samples on features at different levels still exists.Later, the Siamese tracker designed for aerial tracking tasks began to appear in the field of target tracking such as the SOTA tracker SiamAPN [30] and SiamAPN++ [31] in the field of aerial tracking; they have enhanced the ability to cope with unbalanced samples through the study of adaptive anchors, but the strategy of adaptive anchors still cannot cope well with the need for multi-level feature utilization in challenging aerial scenarios.

Transformer and Fusion Networks
Transformer was first proposed in the literature [32], and the transformer structure has been widely used in the field of NLP in recent years, driving breakthroughs in the research of many tasks in the field of artificial intelligence [33].The core goal of Transformer is to select the information that is more critical to the current task goal from a large amount of information, and the essential idea is to selectively filter a small amount of important information from a large amount of information and focus on the important information, ignoring most of the unimportant information.For example, NonLocal [34] proposed a non-local information statistical mechanism based on capturing dependencies between long-range features, which directly integrates global information and provides richer semantic information while obtaining global information through multiple convolutional layers.DaNet [35] proposed a pixel-level optimization module based on a self-attentive mechanism to capture global contextual dependency for image segmentation tasks, which achieved good results.Later, ViT [36] and MobileViT [37] were the first to introduce a more effective transformer architecture into computer vision tasks, breaking the limitation that CNNs can only acquire local information and ignore global information, thus enabling the modeling of dependencies between distant pixels.SiamHAS [38] proposed a tracking method with a hierarchical attention strategy that makes better use of the global relevance of features through the introduction of a multi-layer attention mechanism to achieve more accurate tracking.SE-SiamFC [39] used a scale model to break the limits of translational invariance and enhance the accuracy of the output prediction frame results of the classification regression network.SiamTPN [18] and HiFT [19] use the transformer structure directly in feature fusion networks but do not take into account the effect of adapting features at different depth levels, and the transformer structure is still limited by the receptive field of local modeling and could not achieve global contextual modeling for feature optimization at multilevel scales.SGDViT [40] applies a large-scale transformer attention structure designed specifically for aerial tracking tasks and is the current SOTA tracker in the field of aerial tracking.Unlike the above Trackers that employ various attention networks, we design the CSE module in the shallower feature level to acquire the distributed long-range dependencies of each branch through the process of channel splitting, to better cope with the requirements for long-range dependencies in scenarios such as occlusions.In addition, we designed the MCD module to further learn the global dependencies of the middle-level features to cope with the demand for global generalizability of features in common scale-view change scenarios in aerial tracking and to further solve the problem that the CSE module is unable to fully explore the global information due to the splitting of feature channels.Finally, we design the RCE module to complete the pixel-level updating of deep features by utilizing the contextual information and the characteristics of receptive field mapping in backpropagation, so that the network achieves a better balance between deep semantic information and spatial information, and better copes with scenarios such as small-target tracking and low-resolution scenes, which are particularly sensitive to semantic information.To summarize, SiamMAN proposes a multiphase awareness network strategy, where each special network designed to solve the aerial tracking challenges at different depth levels is well integrated into the framework, which has a greater advantage over the tracker using an attention network in challenging scenarios of aerial tracking.Comprehensive empirical experimental results validate the effectiveness of our proposed method.

Proposed Approach
In this section, we specify the general framework of the proposed network and then describe the designed two-stage aware network and response map context encoder for obtaining rich pixel-level global contextual information, respectively.Finally, we present the efforts made to adapt different levels of features for further optimization and the loss function during training.

Overall Architecture
The overall framework of the tracking algorithm proposed in this paper is shown in Figure 1.
The Siamese multi-phase aware network (SiamMAN) tracker consists of the following four main sub-networks: feature extraction backbone network, two-stage aware neck, similarity matching network, and prediction heads.The feature extraction backbone ResNet50 network takes a pair of images consisting of two branches of the target template and the search region as the inputs and uses a model that has been trained on ImageNet as its initial pre-trained model.The backbone network extracts the feature maps of the target template branch image patch Z and the search region branch image patch X, respectively, and uses the extracted feature blocks in the 3rd, 4th, and 5th as the input of the subsequent CSE block in the TAN module.In the backpropagation of the training process, the parameters are shared between the two branches of the search region and template in the Siamese network.In the model, the two-stage aware neck part achieves global contextual information aggregation of features using transformer architecture designed to adapt to different scale features.The adjustment layers use multilayer convolutional layers to dimensionally adjust the output data of the CSE block of each branch, and the number of channels of the three-layer feature blocks is uniformly adjusted from the original [512, 1024, 2048] channels to [256,256,256] channels to reduce the subsequent parameters and computation.In the similarity matching sub-network, the depth-separable correlation operation is used to achieve the fusion of the deep and shallow features in the output response maps by convolving the target template and the corresponding layers of 3, 4, and 5 of the search region layers.The process of deep intercorrelation operation can be described as where denotes the depth-separable correlation operation.The feature fusion module achieves the optimization of response map features based on the modeling of dependencies between long-range pixel features, which provides richer semantic information and better balances the utilization of deep feature information.Finally, the classification regression network with an anchor-free strategy is used to obtain the binary attribute classification results and the prediction box size information for each pixel point.

Two-Stage Aware Neck
Some challenging scenarios such as viewpoint change and target occlusion that may exist in different frames in aerial tracking tasks require high demand for multi-layer feature utilization and algorithmic robustness.Existing trackers such as SiamCAR and SiamBAN that utilize linear summation or cascade fusion strategies can neither fully utilize contextual information nor cope well with scale changes of small targets.Therefore, we propose a two-stage aware neck feature fusion network that contains two functional components before and after the adjustment layers: a cascaded splitting encoder and a multi-level contextual decoder.For the cascade splitting encoder, the computation process of the third feature block extracted by backbone in the target template branch Z, for example, is a branch computation process with 512 channels of input, decomposed into 4 subbranches with 128 channels of input and two 512 channels of input sub-branches after four convolution operations, one pooling layer, and gamma function, to obtain distributed long-range information under each sub-branch by channel decomposition and cascading additive information exchange between sub-branches.The detailed calculation process is shown in Figure 2.For the first four branches, it is equivalent to dividing the input features () into 4 subsets.Each subset of channels has the same size and is denoted as  () ∈ ℝ × × , where i takes values in the set {1,2,3,4}, and H, W, and C denote the shape of the input operational tensor data, the number of channels, and the height and width of each feature map, respectively.The first subset is sent into the 3 × 3 deep convolution, and the output is added to the next subset and used as the input of the next branch.The output of each sub-branch is represented separately as  , After that, we concatenate and sum it with the output of the fifth sub-branch as the final output of the module.In the fifth branch operation, the input can eliminate part of the noise interference after the average pooling layer and finally find the optimal fusion of the network through the continuous adjustment of the gamma parameter function in the training process to achieve better feature utilization.The specific formulas are as follows.
AvgPool ( ) Gama ( ) Finally, after cross-channel information optimization, we obtain the output feature map ' x : Compared with the traditional convolutional operation, the cascade splitting encoder can obtain the distributed long-range relevance among the sub-branches by the splitting of feature channels, break the limitation of the receptive field in the traditional CNN For the first four branches, it is equivalent to dividing the input features x(Z) into 4 subsets.Each subset of channels has the same size and is denoted as x i (Z) ∈ R C×H×W , where i takes values in the set {1, 2, 3, 4}, and H, W, and C denote the shape of the input operational tensor data, the number of channels, and the height and width of each feature map, respectively.The first subset is sent into the 3 × 3 deep convolution, and the output is added to the next subset and used as the input of the next branch.The output of each sub-branch is represented separately as F i , After that, we concatenate and sum it with the output of the fifth sub-branch as the final output of the module.In the fifth branch operation, the input can eliminate part of the noise interference after the average pooling layer and finally find the optimal fusion of the network through the continuous adjustment of the gamma parameter function in the training process to achieve better feature utilization.The specific formulas are as follows.
Finally, after cross-channel information optimization, we obtain the output feature map x : x = F i + F 5 (5) Compared with the traditional convolutional operation, the cascade splitting encoder can obtain the distributed long-range relevance among the sub-branches by the splitting of feature channels, break the limitation of the receptive field in the traditional CNN structure, and make full use of the multi-scale features between different levels of features to enhance the recognition ability of the network at relatively shallow feature layers that contain more spatial information.
For the Multi-level Contextual Decoder (MCD), after the CSE modules and subsequent dimensional adjustment layers, the feature blocks of each layer are flattened into sequence information using convolutional operations and used as the input of subsequent MCD modules.Inspired by the global dependency modeling capability of Transformer, we design the global feature modeling (GFM) network to obtain global dependency relationships between channels over long distances using a muti-head awareness component to achieve further global dependencies fusion and address the problem that the cascade splitting encoder method does not fully explore global information due to the splitting of feature channels.The MCD blocks in the target template and search region each contain four of the proposed GFM modules.Specifically, the adjusted feature block L4 corresponding to the fourth layer feature block of the feature extraction backbone network is used as the query variable Q input of the three-way GFM modules, respectively, to realize the mutual aware mechanism and information exchange between different branches for a better global dependency modeling of long-range location and semantic information.The key-value pair inputs of each path correspond to the dimensionally adjusted output features L3, L4, and L5 at each level, respectively.The output tensor T whose key-value input is L4 is sent to another GFM block to achieve a better balance of deep and shallow features through a two-layer calculation to obtain the final output tensor L 4 , which can be expressed as For the proposed GFM module, specifically, in contrast to the traditional Transformer encoder structure, we use an averaging pooling strategy on the ternary input side of the aware computation as a preprocessing mechanism to optimize the input data for the K and V parameters.To further optimize for a more lightweight structure for aerial tracking tasks, we replace the position-encoding step in the traditional Transformer with the image itself, encoding sequence information using a zero-padding strategy to ensure the integrity of the sequence information.The GFM module consists of a multi-head aware module, a feed-forward network, and a normalization layer, whose core process is the processing of the input ternary data.The multi-head aware strategy enables the model to pay joint attention to the information from different representation subspaces at different locations.The calculation process of QKV ternary inputs illustrates the broadly theoretical implementation process of muti-head awareness, where Q, K, and V represent the query variables, the values of keys, and the values in the initial key-value pairs.This module calculates the similarity of Q and K, and multiplies V by the normalized distribution weights to achieve the feature enhancement of V.The final output is obtained with the same dimensionality as the original input.Softmax is used to obtain probability values about specific value parameters to norm the layers.To prevent the network from degradation, we add the input of the residual term to the output of the computation and perform hierarchical normalization after the residual connection.The overall calculation process of the GFM module can be summarized as

Responsemap Context Encoder
After performing deep correlation to obtain the response maps, we design the response map contextual information encoder utilizing long-range context information in backpropagation to accomplish pixel-level updating for the deeper features and better balance the semantic and spatial information.This could make the model break through the local modeling limitation, and its structure is shown in the following Figure 3.
balance the semantic and spatial information.This could make the model break through the local modeling limitation, and its structure is shown in the following Figure 3. Specifically, the response map feature information sent into the module is first converted into four dimensions by a linear mapping process of unsqueezing and reshaping to fit the subsequent high-dimensional convolutional optimization.The three-branch input feature maps are then adjusted to half the original number of channels by three convolutional processes.The feature block of the θ-branch is multiplied with the feature block of the φ-branch after flattening and transpose operations, and the result is normalized by a softmax layer to obtain the distribution score, which is applied to the feature block of the g-branch after flattening and transpose, and multiplied with it to obtain the optimized feature R. The above process can be summarized in theoretical modeling as where x denotes the input feature maps, i represents the spatial and temporal index of the corresponding features, the f function calculates the similarity of i and j, the g function computes the representation of the feature map at position j, and the response factor C(x) is used to normalize the output to obtain the final output.The temporal information obtained in the training phase through the temporal index could break the limitations of the local receptive field to obtain long-range relevance information, which is important for scenes with occlusion and a low resolution.Specifically, the response map feature information sent into the module is first converted into four dimensions by a linear mapping process of unsqueezing and reshaping to fit the subsequent high-dimensional convolutional optimization.The three-branch input feature maps are then adjusted to half the original number of channels by three convolutional processes.The feature block of the θ-branch is multiplied with the feature block of the ϕ-branch after flattening and transpose operations, and the result is normalized by a softmax layer to obtain the distribution score, which is applied to the feature block of the g-branch after flattening and transpose, and multiplied with it to obtain the optimized feature R. The above process can be summarized in theoretical modeling as where x denotes the input feature maps, i represents the spatial and temporal index of the corresponding features, the f function calculates the similarity of i and j, the g function computes the representation of the feature map at position j, and the response factor C(x) is used to normalize the output to obtain the final output.The temporal information obtained in the training phase through the temporal index could break the limitations of the local receptive field to obtain long-range relevance information, which is important for scenes with occlusion and a low resolution.
Drones 2023, 7, 707 After that, R is transposed and flattened by a convolution layer, and then dimensionally expanded X features are added, and the result is reshaped into the dimension of the features sent into the branches as feature R .The above process can be summarized as Finally, R is cascaded with the initial input feature X.Then, the channel dimension is adjusted by a 1 × 1 convolution layer to be consistent with the input X as the final output Y.The above process can be summarized as Compared with the constantly stacked convolution and RNN operator, the above operation can quickly capture the long-range dependence by directly computing the relationship between two spatial-temporal locations, and the high-dimensional global modeling of long-range dependence can effectively improve the feature expression of deep response maps, achieve the effect of pixel-level deep and shallow information balance and semantic information optimization, and have higher computational efficiency.

Training Loss
For the prediction heads, after similarity matching and feature optimization of the response map context encoder, the output tensor of dimension 25 × 25 × 256 is used as the input data for each head.For the regression head, it outputs the regression maps F reg ∈ R H×W×4 , where W denotes the width of the output feature map and H denotes the height, both of which are 25; each pixel position of the 4-channel feature maps records the distance from each corresponding position point to the 4 edges of the bounding box, noted as the four-dimensional vector t(i, j) = (l, t, r, b), which can be calculated as follows: where (x, y) denotes the location coordinates (i, j) of the search area corresponding to that point, and (x 0 , y 0 ) and (x 1 , y 1 ) denote the coordinates of the ground truth.
For the classification head, we use the cross-entropy loss BCE to calculate the classification loss: L cls = 0.5 × BCE(δ pos , I) + 0.5 × BCE(δ neg , I) where I is the ground truth, when calculating the BCE loss, and the fit with I denotes the foreground and background scores corresponding to the specific location of the search area branch, respectively.We use the regression target boundary box T (i, j) and the prediction boundary box t (i, j) to calculate the regression loss, which can be calculated by the following equation: where L IOU T (i,j) , t (i,j) means that the IOU loss of T (i,j) and t (i,j) , I(i, j) is an indicator function defined by: Drones 2023, 7, 707 10 of 22 For the centrality head, it outputs a single channel of size 25 × 25 centrality feature map F cen ∈ R H×W×1 recording the centrality score C(i, j) at the corresponding position.The centrality score is calculated as where C(i, j) denotes the predicted centrality score for a specific location and R(i, j) represents the actual centrality score for this location.
The overall loss of the algorithm is as follows: where L cls , L cen , and L reg represent the classification loss, centrality loss, and regression loss, respectively.α 1 and α 2 are used as weight hyperparameters to adjust the network and are set to 1 and 3, respectively, during the training process.

Implementation Details
The experimental environment for the algorithms in this paper is set up as follows: the operating system of the platform used is Windows 10, CUDA version 11.8, and the Python 3.7 + pytorch 1.13 programming framework is used to train and verify the algorithm performance.The hardware platform used is AMD Ryzen5 5600 for CPU and Nvidia GeForce RTX3080 for GPU.The parameters in the training process were set as follows: We trained the proposed network using COCO [41], GOT-10K [42], VID, and LaSOT [43] datasets.To evaluate the generality and robustness of the proposed algorithm from multiple perspectives, the model is trained by applying a stochastic gradient descent SGD optimizer with a momentum of 0.9, the batch size is set to 12, and a warm-up [44] training strategy is used to freeze the ResNet50 backbone network in the first ten rounds of training and unfreeze the backbone network in the second ten rounds for training, for a total of 20 iterations of the process.In our testing experiments, the traditional one-pass evaluation (OPE) setup was used.That is, we run the tracker from the first frame to the end frame.The tracker is initialized with the position of the first frame of the target in ground truth, and then the tracker is run to obtain the average precision and success rate, during which it is not initialized again.

UAV123 Benchmark
UAV123 [45] contains 123 image sequences collected by low-altitude UAVs, including image sequences with various challenging features, including scale variation, low resolution, occlusion, etc., and is one of the authoritative datasets in the field of aerial tracking.The UAV123 dataset involves the following attributes in general and aerial tracking scenes: aspect rotation changes (ARC), background clutter (BC), fast motion (FM), full occlusion (FOC), illumination variation (IV), out of view (OV), partial occlusion (POC), similar object (SOB), and scale variation (SV), especially for aerial difficulties there are camera motion (CM), low resolution (LR), and viewpoint change (VC).Success rate and Precision are used as evaluation metrics.The center position error between the prediction box and the ground truth within 20 pixels or a region overlap ratio within 50% are used as the criteria to discriminate successful tracking in terms of Precision and success rate, respectively.The ratio of the number of frames judged to be successful to the total number of frames is defined as the Precision and success rate, respectively.The Precision is calculated as follows: In this formula, (x pr , y pr ) and (x gt , y gt ) refer to the coordinates of the centers of the prediction box and ground truth, and CLE refers to the Euclidean distance between their centers.Accordingly, the success rate is calculated as follows: 4.1.3.UAV20L Benchmark UAV20L is the definitive benchmark for evaluating and analyzing long-duration aerial tracking with 20 different long-duration video sequences of urban neighborhood scenes.These 20 long-duration sequences include complex scenes in various types of urban neighborhoods and challenging frame intervals, such as target occlusion, scale changes, and disappearance of targets.4.1.4.DTB70 Benchmark DTB70 [46] contains 70 UAV video sequences and is one of the most commonly used authoritative benchmarks for testing the comprehensive generalization performance of algorithms in the field of aerial tracking.The video sequences contain numerous comprehensive and challenging scenarios such as occlusion, scale variation, and low resolution.DTB70 also uses Precision and success rate as evaluation parameter metrics.

LaSOT Benchmark
LaSOT is a large-scale, high-quality, comprehensive benchmark for evaluating longterm tracking performance, and is a commonly used authoritative dataset in the field of target tracking, containing 280 long-term test video sequences of 70 object classes in a variety of scenarios.The LaSOT dataset still uses Precision and success rates for tracking effectiveness evaluation.

Ablation Studies
To verify the effectiveness of the proposed multi-phase aware strategy, CSE, MCD, and RCE modules in this paper, we conduct a comprehensive analysis and discussion of the effectiveness of the proposed method in aerial tracking scenes under the UAV123 benchmark and UVA20L benchmark, respectively, and conduct comprehensive and detailed ablation experiments.First, we add a two-stage aware network (TAN) including CSE and MCD to the framework for two benchmark evaluation experiments and compare the tracking results before and after adding the TAN network; then, we verify the effectiveness of adding the response map contextual encoder (RCE) proposed in this paper to the framework and compare the tracking results on the two benchmark; finally, we add the proposed TAN and RCE networks to the framework together and compare the tracking results to determine the better performance improvement that would be achieved by adding both together.As shown in Table 1, on the UAV123 benchmark, the RCE network improves the success rate and Precision of the tracker by 0.6% and 1.3% compared to no addition, reaching 62.1% and 80.5%, respectively.The MCD network improves the success rate and Precision of the tracker by 1.4% and 2.5% compared to no addition, reaching 62.9% and 81.7%, respectively.Adding all the networks ultimately improves the success rate and Precision of the tracker by 2.4% and 3.8% compared to no addition, reaching 63.9% and 83.0%, respectively.On the UAV20L benchmark, the RCE network improved the success rate and Precision of the tracker by 0.3% and 1.1% compared to non-addition, reaching 55.5% and 71.5%, respectively.The CSE network improves the success rate and Precision of the tracker by 1.0% and 1.5%, reaching 56.2% and 71.9%, respectively, while adding all the networks eventually improves the success rate and Precision of the tracker by 2.6% and 4.7%, reaching 57.8% and 75.1%, respectively, compared to non-addition.In summary, through the ablation experiments, we can conclude that the RCE, CSE, and MCD modules contribute to the Precision and success rate improvement of the framework to different degrees.The best performance can be obtained by using RCE, CSE, and MCD simultaneously.

UAV123 Benchmark (a) Overall performance:
Table 2 shows the success rate (Succ.)and Precision (Pre.) of the comparison trackers.Compared with the tracker with a similar architecture design such as SiamTPN, the proposed SiamMAN has a great improvement, the success rate of which increased from 59.3% to 63.9%, and the Precision increased from 79.0% to 83.0%.It is observed that our proposed SiamMAN ranks first in success rate and precision, outperforming all the selected state-of-the-art trackers.This is mainly because the adaptive awareness networks used in SiamMAN have advantages in measuring the edges of objects and in scenarios such as scale changes, resulting in an advantage in terms of Precision and success rate.Furthermore, we design a new architecture to adapt the characteristics of different levels of features in challenging scenarios to adaptively integrate regional features and the corresponding global dependency information; the multiphase awareness network adopted by our SiamMAN can realize long-distance contextual information aggregation, which can complete pixellevel measurements more accurately and determine the centre position of the target more precisely, and thus has an advantage in terms of success rate.The above description illustrates the effectiveness of the multiphase awareness network used in SiamMAN.Also, SiamMAN ensures a high success rate and real-time requirements at faster speeds of 43 FPS with the hardware RTX3080.Also, we perform a fair speed comparison experiment on our RTX3080 platform based on the accessible codes of SiamBAN, SiamCAR, SiamHAS, Ocean, SiamTPN, and HiFT with the same environmental parameter settings.The experiment shows that SiamMAN achieves the same level of tracking speed and real-time performance as the mainstream Siamese trackers.Additionally, SiamMAN obtains a score of 64.6% in AUC and outperforms Ocean by 7.2 percentage points, which is a huge improvement relative to the mainstream, non-large model SOTA tracker.(b) Performance under different challenges: Tables 3 and 4 show the comparison of the success rate and Precision of the trackers in ten groups of video sequences, including LR, POC, OV, VC, CM, and SOB challenges.Compared with the contrast trackers, it can be observed that, in the majority of cases, our proposed SiamMAN tracker achieves the best or second evaluation results compared to the state-of-the-art tracker, such as viewpoint change, fast movement, scale change, and low resolution, demonstrating the effectiveness of the proposed method in improving the performance in challenging scenarios.Specifically, for scenes including LR, POC, and VC attributes, such as Bike3, Car15, Person21, and Car1_s, SiamMAN obtains the best scores in success rate.For scenes including SOB, POC, and CM attributes, such as Bike3, Person21, and Car1_s, SiamMAN obtains the best scores in precision.Why could SiamMAN effectively cope with the challenging attributes?Thanks to the designed multi-phase awareness network adapted to different levels of feature characteristics, our SiamMAN has a powerful multi-level global long-range dependency modelling capability, which meets the demands of long-range contextual relationships well in scenes such as target occlusion and scale viewpoint change.In addition, the designed RCE module utilizes contextual information to accomplish pixel-level updating of deeper semantic features, which is critical for small targets and low-resolution scenes that are extremely sensitive to deep semantic information.Especially in the uniquely challenging attributes of aerial tracking, SiamMAN's performance shows great advantages.Compared to the traditional method such as TMCS, SiamMAN demonstrates absolute superiority under the vast majority of attribute challenges.SiamMAN offers a novel and efficient approach in the field of aerial tracking.In addition, we observe that the proposed method is inferior to Ocean in terms of success rate and precision in the attributes of POC and CM, which may be because the template update strategy adopted by Ocean can better adapt to the real-time changes of the target aspect ratio features, and the incorporation of the online update feature extraction strategy will be an important further research direction for further study and improvement of our research.(c) Qualitative evaluation: To visualize the actual tracking performance of the proposed SiamMAN in various challenging scenarios compared with other advanced trackers and further discuss its performance, we visualized and compared the tracking results of seven video sequences containing various challenging scenarios in the UAV123 benchmark test, as shown in Figure 4.In the video sequences containing target occlusion scenes such as bike2_1, car7_1, person21, etc., only SiamMAN completes tracking in the face of the occlusion scenes, while all other trackers show tracking drift or failure, verifying that the TAN network can enhance tracking performance in scenes lacking target information using long-range global dependency modelling.In the video sequences containing low-resolution, fast-moving scenes such as Uav1_1 and Uav3_1, SiamMAN overcomes the effects of low-resolution and background clutter to complete tracking, while the rest of the trackers all fail to track.This further verifies the robustness of the SiamMAN in this paper in the face of complex scenes containing multiple challenging scenes and also demonstrates the effectiveness of the RCE network to extract global contextual features to accomplish pixel-level updating of deeper semantic features for tracking performance improvement in small-target tracking and low-resolution scenes that are extremely sensitive to deep semantic information.
The above seven tracking sequences demonstrate that the SiamMAN proposed in this paper has excellent robustness and tracking performance in scenarios such as scale change, occlusion, background clutter, and fast motion.It can be seen that the tracker's tracking accuracy is greatly improved when measuring small targets and in low-resolution scenarios, which is mainly due to the RCE module that can utilize the contextual information for pixellevel updating of deep semantic features.As for the occlusion and scale viewpoint change scenarios, the anchor-free strategy and the powerful long-range dependency modeling capability of the TAN module could measure the objects more accurately, which not only overcomes occlusion and other interferences but also further improves the tracking accuracy of the prediction boxes.SiamMAN provides a new and efficient tracking method for the aerial tracking field.The above seven tracking sequences demonstrate that the SiamMAN proposed in this paper has excellent robustness and tracking performance in scenarios such as scale change, occlusion, background clutter, and fast motion.It can be seen that the tracker's tracking accuracy is greatly improved when measuring small targets and in low-resolution scenarios, which is mainly due to the RCE module that can utilize the contextual information for pixel-level updating of deep semantic features.As for the occlusion and scale viewpoint change scenarios, the anchor-free strategy and the powerful long-range dependency modeling capability of the TAN module could measure the objects more accurately, which not only overcomes occlusion and other interferences but also further improves the tracking accuracy of the prediction boxes.SiamMAN provides a new and efficient tracking method for the aerial tracking field.

UAV20L Benchmark
To evaluate the performance of the proposed SiamMAN in a long-time aerial target tracking scenario, we compared it with eight other state-of-the-art trackers in the UAV20L benchmark, and the obtained results are shown in Table 5.The experiments show that our

UAV20L Benchmark
To evaluate the performance of the proposed SiamMAN in a long-time aerial target tracking scenario, we compared it with eight other state-of-the-art trackers in the UAV20L benchmark, and the obtained results are shown in Table 5.The experiments show that our tracker achieves the best results (precision score: 75.1%, success rate: 57.8%), with a Precision score of 1.5% higher than SiamAPN++ and a success rate of 1.8% higher than SiamAPN++.Compared to the traditional methods that have already been used for long-duration tracking, such as CFIT, our SiamMAN has made great progress in terms of Precision and success rate, with a 22.1% increase in the success rate and a significant increase in Precision of 25.9%.Thanks to the multi-phase awareness network adapted to the characteristics of different levels, SiamMAN can aggregate long-range dependency information for different level characteristics, avoiding the circumstance that the shorttime loss of target features may affect the long-time target tracking and the accuracy of tracking.It can be seen that SiamMAN can obtain long-range global dependency modeling information through the multi-phase aware mechanism and can cope well with the scenario of long-time tracking.We compare our proposed method with seven SOTA trackers on the DTB70 benchmark, and the success and Precision plots are shown in Table 6, both of which show that our SiamMAN achieves excellent performance compared with other advanced trackers, with a success rate of 64.9% and a Precision of 83.6%.Compared with SGDViT, the success rate is improved by 1.9%, and compared with SiamAttn, the Precision is improved by 1.0%.It can be seen that the multiphase awareness network adopted by SiamMAN is more effective compared to the attention networks adopted by SiamAttn, which can obtain the context-dependent update feature expression adapted to different levels from the global perspective, achieve better expression of positional feature information at the shallow level, and achieve better expression of feature expression of semantic information at the deeper level, and better balance between deep and shallow levels of information, at the same time as achieving better performance for the aerial tracking tasks.Also, we can see that SGDViT obtains a success of 63.0%, which is mainly because it employs a large-scale transformer structure that can better measure the foreground and background properties of the target's edge pixels.However, the high computational effort makes it difficult to apply to real-time tracking.In contrast, our SiamMAN achieves a better balance between performance and computational effort.

LaSOT Benchmark
To evaluate the performance of SiamMAN for long-time tracking generalization in more types of scenarios, we compare the proposed method with nine advanced trackers, as shown in Figure 5.
SiamMAN achieves the best results in both evaluation metrics of success plot and Precision plot (precision: 53.1%, success rate 52.8%), with the Precision score improving by 3.4% over ATOM and success rate improving by 6.1% over SiamMask.Compared to the Siamese family of trackers that utilize a similar architecture, such as SiamCAR, SimMAN offers a 0.7% improvement in Precision and a 1.1% improvement in success rate.It can be seen that SiamMAN shows good performance for long-time target tracking on the LaSOT benchmark, which contains more generalized scenes.The ability to achieve such enhancements on extremely challenging and large comprehensive datasets further elucidates the contribution of the proposed multiphase awareness network to the performance enhancement of the Siamese tracker.The TAN module accomplishes the information exchange in global view at a relatively shallow feature level, which makes the spatial information feature representation more accurate and enhances the model's ability to deal with the tracking tasks in the scenes of occlusion, change in view angle, and fast motion, etc., which have a high demand for spatial information.The RCE module utilizes contextual information at a deeper level to complete the pixel-level updating of semantic information, which helps the model to more accurately measure the position of objects at the pixel level in small targets, low resolution, and other scenarios that have a high demand for semantic information.It can be said that the proposed SiamMAN completes the optimization of feature representation in multi-layer features adaptatively and has excellent generalization for the accurate tracking of objects in multiple scenes, providing a new method with excellent generalization and high efficiency in the field of target tracking.SiamMAN achieves the best results in both evaluation metrics of success plot and Precision plot (precision: 53.1%, success rate 52.8%), with the Precision score improving by 3.4% over ATOM and success rate improving by 6.1% over SiamMask.Compared to the Siamese family of trackers that utilize a similar architecture, such as SiamCAR, Sim-MAN offers a 0.7% improvement in Precision and a 1.1% improvement in success rate.It can be seen that SiamMAN shows good performance for long-time target tracking on the LaSOT benchmark, which contains more generalized scenes.The ability to achieve such enhancements on extremely challenging and large comprehensive datasets further elucidates the contribution of the proposed multiphase awareness network to the performance enhancement of the Siamese tracker.The TAN module accomplishes the information exchange in global view at a relatively shallow feature level, which makes the spatial information feature representation more accurate and enhances the model's ability to deal with the tracking tasks in the scenes of occlusion, change in view angle, and fast motion, etc., which have a high demand for spatial information.The RCE module utilizes contextual information at a deeper level to complete the pixel-level updating of semantic information, which helps the model to more accurately measure the position of objects at the pixel level in small targets, low resolution, and other scenarios that have a high demand for semantic information.It can be said that the proposed SiamMAN completes the optimization of feature representation in multi-layer features adaptatively and has excellent generalization for the accurate tracking of objects in multiple scenes, providing a new method with excellent generalization and high efficiency in the field of target tracking.

Heatmap Comparison Experiments
To more intuitively demonstrate the performance improvement of the proposed modules with a Siamese tracker for regions of interest in specific challenging video sequences, and further validate the performance improvement of the proposed two-stage aware network and response map context encoder in optimizing challenging scenarios, we selected an image sequence bike1 from the UAV123 benchmark for the heatmap experiments, and we added the proposed three functional modules RCE, CSE, and MCD modules to the model in turn, and the heatmaps of the three scenarios are shown in Figure 6.It can be seen that before adding the modules, the tracker's heatmap area is very frag-

Heatmap Comparison Experiments
To more intuitively demonstrate the performance improvement of the proposed modules with a Siamese tracker for regions of interest in specific challenging video sequences, and further validate the performance improvement of the proposed two-stage aware network and response map context encoder in optimizing challenging scenarios, we selected an image sequence bike1 from the UAV123 benchmark for the heatmap experiments, and we added the proposed three functional modules RCE, CSE, and MCD modules to the model in turn, and the heatmaps of the three scenarios are shown in Figure 6.It can be seen that before adding the modules, the tracker's heatmap area is very fragmented, which means that the tracker's attention is easily affected by distractions rather than focusing on the target's own features.After adding the RCE module, the distraction of the tracker is significantly improved, which is mainly due to the pixel-level optimization of the RCE module for deep semantic features.After adding the CSE module, the tracker's attention area is more focused on the target itself, which is mainly due to the ability of the CSE module to obtain long-distance relevance that allows the model to focus on more levels of features of the target.After adding the MCD module, the tracker's heatmap area is more concentrated, which is mainly due to the powerful global relevance extraction ability of the MCD module, which can obtain a more accurate representation of the target's features.Meanwhile, we can see that the SiamMAN tracker containing three functional modules achieves the optimal heatmap area covering the target region, and the multiple functional modules ultimately make our SiamMAN have better accuracy and robustness.
extraction ability of the MCD module, which can obtain a more accurate representation of the target's features.Meanwhile, we can see that the SiamMAN tracker containing three functional modules achieves the optimal heatmap area covering the target region, and the multiple functional modules ultimately make our SiamMAN have better accuracy and robustness.

Real-World Tests
In this section, we deploy our tracker on the UAV onboard embedded platform Jetson kits to test its practicability in real-world scenes.During the real-world tests, the utilization of the GPU and CPU is 71% and 36.8% on average.The challenging scenes in the realworld tests include scale variation, occlusion, motion blur, and low resolution.Our realworld tracking results using the UAV platform are shown in Figure 7.It can be seen that our tracker can accurately track the pedestrian by extracting global relevance when facing a complex background and small-target tracking scenarios (real-world subset1).When facing a similar object interference (real-world subset2) scenario, SiamMAN can effectively distinguish the target object.In the scenario of changing viewpoints (real-world subset3), SiamMAN can effectively perform the tracking task under different viewpoints

Real-World Tests
In this section, we deploy our tracker on the UAV onboard embedded platform Jetson kits to test its practicability in real-world scenes.During the real-world tests, the utilization of the GPU and CPU is 71% and 36.8% on average.The challenging scenes in the real-world tests include scale variation, occlusion, motion blur, and low resolution.Our real-world tracking results using the UAV platform are shown in Figure 7.It can be seen that our tracker can accurately track the pedestrian by extracting global relevance when facing a complex background and small-target tracking scenarios (real-world subset1).When facing a similar object interference (real-world subset2) scenario, SiamMAN can effectively distinguish the target object.In the scenario of changing viewpoints (real-world subset3), SiamMAN can effectively perform the tracking task under different viewpoints due to its strong spatial and temporal dependency modeling capability.Finally, our tracker remains at a speed of over 20 FPS during the tests.
due to its strong spatial and temporal dependency modeling capability.Finally, our tracker remains at a speed of over 20 FPS during the tests.

Conclusions
In this work, we propose a new multi-phase aware framework integrated into the Siamese tracker to achieve performance improvement of the algorithm in various challenging scenarios.Specifically, firstly, we propose a response map context encoder (RCE) to enable deep features to aggregate more contextual information and better balance the deep semantic information to enhance the tracker's ability to distinguish target features among deep semantic information.Secondly, we propose a two-stage aware neck which includes the multi-level contextual decoder (MCD) and cascade splitting encoder (CSE) modules to aggregate more long-range spatial-temporal information across channels to achieve global modeling and enhance the tracker's ability to cope with complex scenarios such as target occlusion and scale change.Finally, the new multi-phase aware featureoptimized functional structure is efficiently integrated into the tracker framework.Comprehensive and extensive experiments validate the effectiveness of our proposed neural network framework.Overall, we believe that our work can boost the development within the field of remote sensing, aerial tracking, and learning systems.
tracking performance in challenging scenes, such as those with low resolution and scale variation.

Figure 1 .
Figure 1.The overall framework of the proposed tracker.

Figure 1 .
Figure 1.The overall framework of the proposed tracker.
information under each sub-branch by channel decomposition and cascading additive information exchange between sub-branches.The detailed calculation process is shown in Figure2.

Figure 2 .
Figure 2. The structure of the proposed cascade splitting encoder.

Figure 2 .
Figure 2. The structure of the proposed cascade splitting encoder.

Figure 3 .
Figure 3.The structure of the proposed response map context encoder.

Figure 3 .
Figure 3.The structure of the proposed response map context encoder.

Figure 7 .
Figure 7. Results of real-world tests on the embedded platform.The tracking targets are marked with red boxes.

Table 1 .
Ablation study on RCE, CSE, and MCD modules.The symbol means that we add the corresponding module to the baseline model.

Table 2 .
UAV123 benchmark comparison results.The bold font is the best score.

Table 3 .
The success rate achieved by the SimMAN tracker and other eight trackers on ten videos in the UAV123 benchmark.The best and the second-best results are highlighted in red and green, respectively.

Table 4 .
The precision achieved by the SimMAN tracker and other eight trackers on ten videos in the UAV123 benchmark.The best and the second-best results are highlighted in red and green, respectively.

Table 5 .
UAV20L benchmark comparison results.The bold font is the best score.

Table 6 .
DTB70 benchmark comparison results.The best and the second-best results are highlighted in red and green, respectively.