Discriminative Siamese Tracker Based on Multi-Channel-Aware and Adaptive Hierarchical Deep Features

: Most existing Siamese trackers mainly use a pre-trained convolutional neural network to extract target features. However, due to the weak discrimination of the target and background information of pre-trained depth features, the performance of the Siamese tracker can be signiﬁcantly degraded when facing similar targets or changes in target appearance. This paper proposes a multi-channel-aware and adaptive hierarchical deep features module to enhance the discriminative ability of the tracker. Firstly, through the multi-channel-aware deep features module, the importance values of feature channels are obtained from both the target details and overall information, to identify more important feature channels. Secondly, by introducing the adaptive hierarchical deep features module, the importance of each feature layer can be determined according to the response value of each frame, so that the hierarchical features can be integrated to represent the target, which can better adapt to changes in the appearance of the target. Finally, the proposed two modules are integrated into the Siamese framework for target tracking. The Siamese network used in this paper is a two-input branch symmetric neural network with two input branches, and they share the same weights, which are widely used in the ﬁeld of target tracking. Experiments on some Benchmarks show that the proposed Siamese tracker has several points of improvement compared to the baseline tracker.


Introduction
Object tracking is a basic research hotspot in the field of computer vision, and has many applications in daily life, such as autonomous driving [1], video surveillance [2], and human-computer interaction [3]. Usually, the information of the tracked object is given in the first frame, and the new position of the target in the subsequent frames is predicted by the designed tracker. Since only the first frame of the target information is given, prior knowledge is seriously insufficient. Therefore, when facing some complex scenes, such as background clutter, lighting changes, fast motion, and partial occlusion, the tracking effect will sharply drop. A number of models were proposed to extract target features in target tracking, such as manual-features [4], correlation-filters [5][6][7], regressors [8,9], and classifiers [10][11][12]. While most Siamese-based trackers use pre-trained deep models to extract features for the tracking task, they pay less attention to how learning more discriminative deep features.
Recent work in this area includes the design of loss functions to select appropriate feature channels [13], using memory networks to preserve the latest appearance models [14], attention mechanisms to enhance feature representation [15], and multi-layer feature fusion to represent targets [16]. For example, MemDTC [14], an algorithm that uses Memory networks to memorize the appearance of targets, achieves a good performance; however, due to the presence of memory banks, it occupies a large amount of device memory during tracking. This uses up the limited computational resources and leads
The Siamese network is a symmetric network. It was originally applied to the field of template matching. It contains two input branches that share the same network structure and internal parameters. SiamFC was introduced to the field of target tracking and achieved good results. We integrated the two methods with a Siamese network for an objecttracking task, and evaluated the proposed tracker on several benchmarks, including OTB-50 [24], OTB-100 [25], UAV123 [26], Temple Color-128 [27], and VOT2016 [28]. Extensive experiments have shown that the proposed tracker is more effective in terms of success rate and precision rate compared to trackers based on pre-trained deep features. The main contributions of the proposed method can be summarized as: 1. The proposed method designed a multi-channel-aware, deep-feature module to establish the interdependence between feature channels, which include two branches to learn the overall information and saliency information of the target, and adopted feature recalibration to enhance the channel weights that play a positive role in target representation.
2. To effectively fuse the features of different layers, the proposed method uses adaptive hierarchical deep features to guide the generation of the most significant features of the target, which can obtain the contribution of different feature layers, then fuses the two feature layers according to their contribution, and this contribution value is adaptively updated.
3. We integrate the two methods with a Siamese network for object tracking and evaluate the proposed method on some benchmarks. The experimental results have shown that the proposed tracker is more effective than some other trackers.

Related Work
Visual object tracking has been developed for decades, and many tracking methods have been proposed. This section provides short outlines for some representative trackers related to our work, such as trackers using deep features, trackers based on the Siamese network, and trackers based on the deep feature and attention mechanism.

Deep Features Based Tracker
Thanks to the powerful appearance modeling abilities of deep features, the performance of the tracker can be significantly improved; therefore, the traditional manual features are gradually replaced. The DCF-based trackers also use deep features to improve performance, such as DeepSRDCF [29], C-COT [30], ECO [31]. To take advantage of deep features, CF2 [22] and FCNT [32] use shallow and deep features to fuse the representation targets for efficiency.
Although these trackers have outstanding feature representation power, there is a significant problem, as only limited training samples and the ground-truth visual appearance of the target in the first frame are available. In addition, we found that the previously mentioned tracker only utilizes the last layer of the CNN features; unlike their approach, our tracker uses multiple convolutional layers to model the target and the weights between multiple convolutional layers are adaptively updated.

Siamese Network Based Tracker
The Siamese network-based tracker [21,[33][34][35] views tracking as a matching problem and learns a similarity metric network. The input to the Siamese tracker consists of two parts: the initial frame template of the tracked object and the search region of the current frame. They both use the same full convolutional network to extract target features, and finally use cross-correlation operations for template-matching to generate a response map. The position of the maximum value in the response map is the corresponding position of the target in the search area. SiamFC [21] is a tracking method based on an offline end-to-end training of the Siamese network. This aims to learn a similarity function for target matching. Since SiamFC mainly focuses on appearance features and ignores the high-level semantic information of the target, SA-Siam [35] improves this. It uses a dual Siamese network tracking scheme, in which one Siamese branch is responsible for apparent feature matching, and the other branch is responsible for semantic information matching, which combines the apparent features and semantic information to make the performance of the tracker more stable. Differing from the detection network used in some methods, GOTURN [33] uses a regression method based on the Siamese network to learn the relationship between the appearance and movement of the target. After entering the search area of the template, the Siamese network extract target feature, and then the regression network can compare the two image returns to the position of the target. SiamMCF [36] and DSiam [37] solve the similarity problem through multi-layer interconnection.
Although these Siamese networks have been pre-trained on some large datasets, they are more suitable for classification tasks and do not take full advantage of the semantic and object information associated with a particular target object. Therefore, there are certain problems in the modeling of target feature expression.

Deep Feature and Attention Based Tracker
Attention mechanism has been widely used in the field of computer vision, such as object detection [38], person search [39] and image segmentation [40]. Introducing the attention mechanism into target tracking can help the tracker pay more attention to the information of the target itself and reduce the influence of the unimportant parts when positioning. This strategy is applicable in most scenarios. With the development of attention mechanism in the field of tracking, some related trackers have been proposed. To acquire spatial and semantic features of thermal infrared targets, HSSNet [41] design a Siamese CNN with multiple hierarchical features. MLSSNet [42] proposes a multilevel similarity network to learn the global semantic features and local structural features of objects. RASNet [43] integrates three attention modules-channel attention, general attention and residual attention into one layer of the Siamese network, which alleviates the overfitting problem in deep network training and improves its discriminative ability and adaptability. MemTrack [23] and MemDTC [14] introduce attention mechanisms for spatial location and use a long short term memory-based (LSTM) controller to manage the read and write operations of feature maps in memory. IMG-Siam [44] introduces channel attention to better learn the matching models.
This paper proposes a multi-channel-aware deep-feature method, which includes two branch attention mechanisms. This multi-channel aware deep feature method works on two feature layers, and finally obtains a fusion of multi-layer and multi-channel attention features.

Proposed Method
By carefully designing feature extraction strategies, the matching accuracy can be improved. However, the tracking target is arbitrary, and it is impractical to design features that are suitable for any target. To deal with these problems, this paper proposes a novel scheme to learn target deep features via the multi-channel-aware and adaptive hierarchical deep features module to guide the generation of the most significant target features. The proposed method uses the features extracted by existing methods to improve the performance of the Siamese-based tracker.
In this section, we will introduce the details of the proposed tracking framework. As shown in Figure 2, the proposed tracking framework consists of a Siamese network for feature extraction and a feature-learning mechanism to enhance the target feature representation. Specifically, the feature-learning mechanism consists of two modules: one is responsible for learning the multi-channel-aware deep features and the other is responsible for fusing adaptive hierarchical deep features. The learning multi channel-aware deepfeature module has two branches with the same structure, which are responsible for the recalibration of the corresponding characteristic channels. The adaptive hierarchical deepfeature fusion module determines the weight of feature layers by combining the peak side lobe ratio (PSLR) with the peak point constraint. The whole system is trained end-to-end by inputting the image block containing the target into the framework. Our tracker is based on the SiamFC [21] framework. We will describe the proposed method in detail.

Basic Siamese Network for Visual Tracking
As mentioned above, Siamese network are widely used to solve the problem of template matching, and their input contains two parts: the template image z and the search image x of the current frame. The template image for visual tracking is usually given in the first frame, while the search image of the current frame is cropped in a certain region around the target location estimated in the previous frame. Both parts of the input use the same CNNφ θ network to extract the depth features and then obtain the target response using cross-correlation. This can be represented as: where in the formula represents the cross-correlation operation of two features. The parameter d represents the bias; φ θ () denotes the method of extracting deep features. The position of the maximum response value of f θ (z, x) represents the target position.

Multi-Channel Aware Deep Features
Each channel in the feature layer contains different target information. In templatematching process, the contribution from each channel is different. The proposed method designs two branches to focus on multi-channel-aware deep features. At first, the two branches are global average pooling and global maximum pooling. This is because global average pooling preserves the overall information about the target, and global maximum pooling preserves the salience information about the target, so the proposed method uses two pooling operations to obtain a cropped feature map that preserves the overall and detailed knowledge of the target. After the pooling operation, two feature vectors of 1 × 1 × 512 are obtained, and 1 × 1 convolution operation is used to reduce them to 1 × 1 × 256, then restore them to 1 × 1 × 512. The non-linear expression ability of feature vectors can be increased by adding the ReLU function between the two lifting dimension operations. The traditional attention network uses the full-connection layer operation, but the full-connection layer is mainly put forward for the classification task. Before tracking, the target information is known from the first frame, so there is no need to carry out object classification. Moreover, the full-connection layer will destroy the spatial structure of the image. As the convolution operation will not destroy the spatial structure of the image, it helps to retain the local features of the image, which is more conducive to target positioning. The Sigmoid function is used to normalize the previously obtained 512 dimension feature vectors to between 0 and 1.
Hence, the proposed method obtains two pooling feature vectors f 1 * 1 * c max and f 1 * 1 * c avg for max and average branch, respectively, from CONV2 and CONV5. Finally, the two pooling feature vectors from two feature layers are respectively fused together to obtain vector φ θ2 (·) 1 * 1 * c and φ θ5 (·) 1 * 1 * c that can represent the weight of the channel. This two-weight vector is multiplied by the original feature to obtain a feature map weighted to the channel C H * W * C

M2
and C H * W * C

M5
. This process is called feature recalibration. The calculation process can be expressed as: (2) Finally, the multi-channel-aware deep features that are obtained can be expressed as: where ε represents the sigmoid function

Adaptive Hierarchical Deep Features
Due to the characteristics of depth features, low-level features contain more target details due to their higher resolution, while high-level features encode more high-level semantic information despite their lower resolution. In the tracking stage, the fusion of high-and low-level features becomes an effective method to solve the problem of positioning accuracy. Therefore, the fusion of high and low-level features has become a research problem. Figure 4 contains two video sequences, showing the response values of different feature layers on the same video frame. Obviously, different feature layers contribute differently to the target response.

Search region
Conv2 response Conv5 response The CNNs used in this paper total have five feature layers; after each convolution, the resolution will be lower. To ensure that the template features have a wealth of detailed information and high-level semantic information, the proposed method adopts the adaptive weighted fusion method to enhance the performance. The proposed method gives the CONV2 and CONV5 layer different reliability weights and the response of the reliability weight from the features of the layer itself, and the reliability weights are updated in real-time. In our method, the reliability weight estimation of the feature layer consists of two parts: (1) The layer max response learning reliability weight w max d , namely, the response peak in the feature layer and the template area. The larger the response value, the higher the reliability. (2) The layer interference detection reliability weight w ratio d , that is, the ratio of main lobe peak intensity to the peak intensity of the strongest side lobe. The lower the ratio, the higher the reliability. In the tracking stage, the two parts work together to determine the reliability of the feature layer, which can be expressed as: and normalized s.t.∑ d W d = 1. The reliability measures are described in the following paragraphs.

Layer Response Learning Reliability
The ideal response peak should be the unique peak obtained by cross-correlation between the template and the search area, and its size should be close to 1. However, in the actual tracking process, due to the existence of a high level of background interference, the response map is high-noise in some frames with a low discrimination ability. Therefore, the feature layer's response weight can be obtained as follows: where is the cross-correlation, w max d2 is the max response of CONV2 and w max d5 is the max response of CONV5, F2 and F5 are the features of CONV2 and CONV5 in the search area. Figure 5 shows the influence of the hierarchical deep features on the response map. It can be seen that the response map with hierarchical deep features has a higher peak value and a more concentrated response point.

Layer Interference Detection Reliability
The second part of the feature layer reliability reflects the ratio of the contribution of different feature layers to target localization. Unlike the similar method proposed by Bolme et al. [45] to detect target loss, our method detects the primary and secondary peaks in the response map and determines the interference strength of different feature layers by the ratio of these two peak points, The smaller the ratio, the lower the interference. In this way, the influence of nearby, strong interfering objects on the target modeling can be reduced, and the final ratio can be lower than 0.5. PSLR weight can be expressed as: Therefore, the adaptive hierarchical feature weight can be expressed as: Figure 6 compares the responses of the same video sequence with different frames before and after using adaptive hierarchical deep features. We can clearly see that, in the video, when the appearance and position of the biker significantly changed, the tracking failed without the use of adaptive hierarchical deep features; at this time, the tracking box appeared in the same position and the response value was approximately the same. When using adaptive hierarchical deep features, although the response value drops sharply as the biker's appearance and position changes, the tracking can still be completed, and it gradually returns to normal and remains in the subsequent frames.

Training Detail
The proposed method used a GOT10K dataset [46] and ImageNet Large Scale Visual Recognition Challenge 2015 VID dataset [47] to train the model. In the training process, the SiamFC cropping strategy was used to crop the template image z and the search image x, respectively, and the target position was taken as the center position. Image pairs (z, x) were randomly selected from the training set and a logistic loss function was used in the following form: where N is the possible locations of the target on the response map, f (z, x)[n] is the response map score, and g[n] ∈ {1, −1} is the ground truth coordinate. To ensure that more training samples were obtained, we randomly selected 10 image pairs from each video sequence and set the maximum interval between the template and the search images to 100 images; the batch size was set to 32. The Stochastic Gradient Descent(SGD) method was used to optimize the objective function. In the test stage, the same strategy as SiamFC was used for target positioning.
Based on experience, the momentum was set to 0.9, learning rate decay from 1 × 10 −2 to 1 × 10 −5 , weight decay rate 5 × 10 −4 , and a total of 35 generations were used for training.
We implemented the proposed tracker with Python and PyTorch framework, on a PC with 16G memory, an Intel(R) Core i7-9700 CPU @3.0 GHz, and a NVIDIA GeForce RTX 2060 GPU.

Evaluation on OTB Benchmark
The OTB dataset is a public dataset to test the effectiveness of target-tracking algorithms, which is divided into OTB50 [24] and OTB100 [25], containing 50 and 100 video sequences, respectively.
As shown in Figure 7, on the OTB100 dataset, our tracker achieved excellent results in terms of both success rate and precision rate, with a success rate of 63.1% and an precision rate of 84.2%, which are 4.8% and 7.0% better than the baseline algorithm SiamFC, respectively. Compared with the attention Siamese tracker MemTrack, our tracker was 0.4% and 3.1% ahead in success rate and precision rate, respectively. However, compared to the attention memory tracker MemDTC, our tracker lagged behind in success rate and precision rate by 0.7% and 0.5%, which we speculate is due to the dynamic memory network introduced by MenDTC, which enables the target template to adapt to changes in target appearance during tracking. We also compared some CNN-and correlation-filter-based trackers such as SRDCF, CREST, CSR-DCF. The proposed tracker achieved a 3.1%, 1.1%, and 5.2% improvement in success rate and 5.0%, 0.8%, and 4.3% improvement in precision rate compared to these methods.  Figure 7. Success and precision rates on the OTB100 dataset. Figure 8 shows the overall performance of the proposed tracker on OTB50. It can be seen that the proposed tracker had the best OTB50. Compared with the baseline algorithm SiamFC, the proposed tracker leads in two metrics, success rate and precision rate, by 8.7% and 13.8%, respectively. The proposed tracker also achieved a 4%, 3.2%, 6.4% and 8.4% improvement n success rate and an 8%, 3.6%, 9.7% and 10.8% improvement in precision compared tot the MemTrack, CREST, SRDCF, and CSR-DCF trackers, respectively. Unlike the OTB100 performance, our tracker has a better performance in terms of success rate and accuracy compared to MemDTC. Experiments on both datasets show that our tracker has excellent performance, proving the effectiveness of the proposed approach.

Qualitative Analysis on OTB Benchmark
In order to analyze the proposed tracker in more depth, we performed another qualitative analysis. Figure 9 shows the effect comparison of different trackers on six typical video sequences. These trackers include a CF-based tracker DSST, attention-based tracker Mem-Track, Siamese-based tracker SiamRPN and SiamFC, CNN's and CF-based tracker CF2.
The following types of video sequence contain several common challenges that would be faced in visual target tracking, such as: scale variation (in Biker, Girl2), being obscured (in Biker, DragonBaby, Girl2, Lemming), being out of view (in Soccer), and background clutter (in DragonBaby). Figure 9 shows the tracking effectiveness of our tracker when facing these challenges. Due to the introduction of the multi-channel-aware module and the adaptive hierarchical depth feature module, our proposed tracker can adapt well to these challenges compared to other algorithms.
In addition, to validate the performance of our proposed tracker in more depth, we conducted experiments in the 11 challenges of the OTB100 dataset. Tables 1 and 2 present the results of the proposed tracker compared with other trackers in the 11 challenges. It can be seen that the proposed tracker is able to consistently maintain an excellent performance in challenging situations due to the introduction of learning multi-channelaware and adaptive hierarchical depth feature modules. In Tables 1 and 2, SV represents scale variation, LR represents low resolution, OC represents occlusion, DF represents deformation, MB represents motion blur, FM represents fast motion, IR represents in-plane rotation, OR represents out-of-plane rotation, OV represents out-of-view, BC represents background clutter, and IV represents illumination variation.
As shown in Tables 1 and 2, some more details about the proposed algorithm can be seen in this paper. In general, the proposed algorithm performs well on all 11 challenges. In all 11 challenges, the algorithm in this paper performs better than the baseline algorithm SiamFC, which directly uses pre-trained deep features to model the target, while we learn multi-channel-aware deep feature and adaptive hierarchical deep features to obtain a more discriminative feature. CF2 also uses hierarchical deep features to model the target; however, the weight of each layer expressed on the target is directly given. In contrast, the hierarchical deep features of the proposed algorithm are derived from the performance of each frame, and this weight is adaptively updated. MemTrack and MemDTC preserve the most recent appearance information of the target by introducing a memory network, and these are similar to the proposed algorithms in LR, OC, and OV scenarios; however, there are still some gaps. It can be seen that the proposed algorithm performs slightly worse in both IR and IV scenes, which indicates that the algorithm has room for improvement in planar rotation and strong illumination change scenes.

Evaluation on TC-128 Benchmark
The TC-128 [27] is a dataset for color information, which contains 128 video sequences to test the performance of the tracker. We compared this dataset with some other excellent trackers, including: ECO [31], CREST [51], HCFTstar [53], CF2 [22], CACF [54], KCF [6], DSST [49], LOT [55], CSK [56]. The results show that our tracker is in second place in both precision rate and success rate metrics. Figure 10 shows the performance of all algorithms.
As shown in Figure 10, the success rate and precision rate of the proposed tracker reach 54.5% and 73.8%, respectively, which are inferior to the 55.2% and 74% reached by ECO. The reason for this may be that ECO uses a combination of depth features and color features, while the TC-128 dataset is designed to obtain the color information of objects, so the extraction of color features is beneficial for target modeling. However, the complex feature extraction method of ECO leads to its tracking speed of only 8 FPS, which cannot meet the requirements of real-time tracking, while our tracker can reach a speed of 29FPS. Moreover, our tracker has a 5% and 4.6% higher success rate and precision rate than CF2, which also uses multi-layer depth features. Meanwhile, trackers based on manual features, such as KCF,CSK and DSST, are much less effective than other trackers that use deep features.    Tables 3 and 4.  Tables 3 and 4, it is clear that the algorithm proposed in this paper performs well on these challenges. It also outperforms CF2, which also uses hierarchical depth features, in terms of overall performance. The CREST algorithm, which uses only one layer of deep features, performs worse than our algorithm, illustrating the benefits of using adaptive hierarchical depth features. However, it can be seen that the proposed algorithm generally performs well in the two challenges of Deformation and Motion Blur. The reason for this may be that the rapid deformation causes blurring in the object's appearance, meaning that the most significant features of the target may be affected. Therefore, the model does not learn more discriminative features and the ability to distinguish the background is reduced. In the follow-up, we will continue to study this problem and try to achieve an improvement.

Evaluation on UAV123 Benchmark
The UAV-123 [26] is a dataset consisting of low-altitude UAV capture videos, which is fundamentally different from the videos in mainstream tracking datasets, such as OTB50, VOT2014. It contains a total of 123 video sequences and over 110k frames. Unmanned aerial vehicles (UAVs) are increasingly used in daily life, so it is of practical significance to test the proposed algorithm on this dataset. We tested our algorithm on UAV123, using the same evaluation method as the OTB dataset, against nine other algorithms, including: SRDCF [50], CREST [51], CF2 [22], SiamRPN [34], DSST [4], Struck [57], ECO [31], TADT [13], KCF [6], and CSK [56], the comparison results are shown in Figure 11.
As shown in Figure 11, thanks to the proposed method, our tracker achieved a 53.9% success rate and 76.1% precision rate on UAV-123, higher than CF2 and SRDCF, which also used depth features, and similarly improved the performance ECO, TADT and CREST by 1.4% and 2.0%, 2.6% and 3.7%, 5.8% and 8.3%. As the UAV123 dataset contains many UAV aerial images, the targets being tracked in the images are generally small, so it is especially important to learn a more discriminative target feature. Compared with ECO, which uses a complex computational strategy for feature selection, the proposed algorithm in this paper can more accurately identify these small targets. Similar to ECO, TADT also works on feature reduction by designing a regression loss and ranking loss to learn more effective target features, respectively; however, the learned features are not as accurate as the features of the proposed algorithm when facing smaller targets, so the tracking effect is average.  Figure 11. Success and precision rates on the UAV-123 dataset.
Using end-to-end training on a large-scale image dataset while introducing a region proposal network, SiamRPN achieves a higher precision rate than our tracker. However, as it uses ordinary depth features, its performance is weaker than the proposed tracker with a similar target interference.This can also be verified in the lemming and girl2 sequences in Figure 9. Similarly, trackers using manual features, e.g., KCF, Struck, and DSST, all perform worse than trackers using depth features.

Evaluation on VOT2016 Benchmark
The VOT2016 [28] is a very popular dataset in the field of target tracking, which automatically labels samples to annotate sample coordinates. It uses two metrics, accuracy and robustness, to evaluate the performance of the tracker, as these two have the weakest relationship of the several evaluation metrics used for target tracking to avoid interference. The Expect Average Overlap Rate (EAO) was introduced to rank the algorithms, which better reflects some issues when compared to the OTB dataset. We used VOT-2016 to evaluate our tracker, and compared this with some other trackers.
We selected 11 trackers, including TADT [13], Staple [48], SA-Siam [35], DeepSRDCF [29], MDNet [10], SRDCF [50], CF2 [22], DAT [58], SAMF [59], DSST [49], KCF [6]. To ensure a fair comparison, the results of the other algorithms were downloaded from the VOT-2016 official website. Figure 12 shows the EAO ranking results, and it can be seen that our tracker outperforms TADT, which is innovative in feature modeling, and is in the first position. Table 3 shows more detailed comparison information, including EAO score, OP score and Failures score, and our tracker is in the leading position in all three metrics.
From Table 5, we can see that the proposed algorithm achieves the highest performance for EAO, which indicates the robustness of the proposed algorithm. The proposed algorithm performs better than our baseline algorithm SiamFC on EAO, Overlap and Failure, which can reflect the effectiveness of the proposed multi-channel-aware, deep-feature and adaptive hierarchical deep features in this paper. The last column of Table 4 shows that the proposed algorithm has the lowest tracking failure rate, which means the prediction results have the smallest deviation from the groundtruth. The experimental results further demonstrate the effectiveness of the proposed method.

Ablation Studies
The baseline algorithm of the proposed method is SiamFC, to which we introduce a multi-channel-aware, deep-feature module and an adaptive hierarchical deep-feature module. To test the effectiveness of the proposed modules, we conducted ablation experiments to compare the performance of individual modules and the overall algorithm with the baseline tracker SiamFC.
We separately tested two modules on OTB100, and the results are shown in the figure below. It is easy to see that the effect of a single module is not as good as that of two modules acting at the same time. Figure 13 compares success rate and precision rate for these variations in the OTB100 benchmark, where WLR and WCR modules achieved a 63.0% and 83.3%, and 63.0% and 83.4% success rate and precision rate, respectively. Therefore, combining the two reliability modules can achieve the best performance. This is also a significant improvement compared to the baseline tracker SiamFC.   Figure 13. Comparison of the two modules when they act separately. Ours shows the effect when the two modules work together. Ours-WCR representative without multi-channel aware deep feature. Ours-WLR representative without using adaptive hierarchical deep features. SiamFC is our baseline algorithm.

Conclusions
This paper proposes a novel scheme to learn target deep aware features, including the learning of multi-channel-aware deep-feature and adaptive hierarchical deep features. The modified mechanism can focus on the modeling of the target appearance, effectively deal with changes in the target appearance, and suppress the interference of background information. The proposed learning multi-channel-aware deep-feature module can focus on important information in the channel, and the proposed adaptive hierarchical deep features module can obtain adaptive feature layer fusion weights. Finally, the two modules work together to enhance the discriminative abality of the tracker. We combine the proposed model with the Siamese framework and prove its effectiveness. In conclusion, this paper proposes a new approach to better utilize the feature modeling abilities of pre-trained neural networks, and a large number of experimental results on several datasets show that the proposed method has a good performance.
From a comparative analysis of different datasets, it can be seen that, compared with the method that uses only single-layer features to model the target, using layered depth features can yield a more discriminative target feature. Compared with the method that uses complex computational strategies for feature dimensionality reduction, our method will be much simpler computationally, and can achieve real-time performance. Compared with the memory network-based method, the proposed method does not have a complicated model update strategy and does not occupy too much memory, which is also beneficial for the efficient use of hardware resources. However, the proposed method performs poorly in some specific scenarios. In future research, we will analyze the reasons for this and try to solve the problems.
In future research, we plan to investigate the use of meta-learning [60] methods to generate an optimal set of initialization parameters, so that the network can be trained online using reliable target information in the first frame, allowing the network to converge faster and obtain a better set of weights for the feature layers and feature channels. A more interesting plan is to enhance the feature representation using the multi-headed attention mechanism proposed by transformer [61] to further improve the performance.