Object-Aware Adaptive Convolution Kernel Attention Mechanism in Siamese Network for Visual Tracking

: As a classic framework for visual object tracking, the Siamese convolutional neural network has received widespread attention from the research community. This method uses a convolutional neural network to obtain the object features and to match them with the search area features to achieve object tracking. In this work, we observe that the contribution of each convolution kernel in the convolutional neural network for object tracking tasks is different. We propose an object-aware convolution kernel attention mechanism. Based on the characteristics of each object, the convolution kernel features are dynamically weighted to improve the expression ability of object features. The experiments performed using OTB and VOT benchmark datasets show that the performance of the tracking method fused with the convolution kernel attention mechanism is signiﬁcantly better compared with the original method. Moreover, the attention mechanism can also be integrated with other tracking frameworks as an independent module to improve the performance.


Introduction
Visual object tracking is a fundamental task in the field of computer vision. It has a wide range of application prospects in autopilot, robot navigation, drone cruise, intelligent security, human-computer interaction, etc. In the literature, researchers have proposed various visual object tracking methods. The visual tracking techniques are based on correlation filter-based methods, deep learning-based methods, and transformer-based methods. Recently, the methods based on deep learning have attracted the attention of the research community due to their advantage of end-to-end learning fashion. In particular, the object tracking methods based on Siamese convolutional neural networks have been developed rapidly due to their simple structures and efficiency. The Siamese convolutional neural network-based object tracking method uses deep convolutional features to describe objects. The deep convolutional features are extracted using multiple convolutional and activation layers. The feature maps of different layers have different characteristics. The use of features obtained using different convolution kernels to realize the effective distinction between the object and the background is a key issue for improving the effectiveness of features.
Based on the aforementioned analysis, in this work, we propose an adaptive convolution kernel attention mechanism for visual object tracking tasks. The main contributions of this paper include:

•
The paper analyzes the effects of different convolution kernels on object tracking tasks and proves the necessity of convolution kernel attention.

•
The paper analyzes the problems with the offline training-based attention mechanism in the tracking task and finds out the reason why this attention method is not suitable for object tracking.

•
The paper presents an object-aware convolution kernel attention module. The proposed module improves the specificity of object features for different individuals based on the difference between each convolution kernel feature.
The proposed attention mechanism can be used as a general convolution kernel attention module and has good versatility. It can be used as a "plug and play" module by embedding it in various object tracking methods based on convolution feature matching, thus significantly improving the tracking performance of the original method. Finally, we combine the proposed convolution kernel attention module with classic object tracking methods. We perform experiments using OTB and VOT benchmark datasets to verify that the proposed technique significantly improves the object tracking performance.

Related Works
With the development of deep learning, the visual object tracking methods based on deep learning have received widespread attention from the research community. Among them, the Siamese convolutional neural network has become a research hotspot due to its simple network structure and end-to-end training ability.
The Siamese convolutional neural network was first proposed by Chopra et al. in 2005 [1] for face similarity discrimination. Bertinetto et al. [2] used this architecture for object tracking and proposed the SiamFC tracking method. Based on this method, Jack et al. [3] embedded the correlation filter as a differentiable layer in the SiamFC architecture and proposed the CFNet tracking method. Ran et al. [4] replaced the AlexNet with a betterperforming VGG network and used region pooling to further improve the tracking performance. Wang et al. [5] added three attention modules to the template branch of the Siamese convolutional neural network to improve its ability to select object features. He et al. [6] proposed the SA-Siam method based on the original Siamese convolutional neural network and added a new group of Siamese networks. One group of networks tracks the object based on the appearance features of the image and the other group uses the semantic features of the template and the search area for tracking. The tracking results of the two groups of networks are merged to obtain the final tracking result, which further improves the robustness in complex scenarios. Gong et al. [7] proposed an online fast update method to reduce the computational complexity of the Siamese convolutional neural network. This method realizes the online fine-tuning of network weights. Among these derivative algorithms, the SiamRPN [8] method proposed by Li et al. combined the RPN module in Faster RCNN [9] with the Siamese convolutional neural network. After extracting the appearance features of the template image and searching the image, it is used as the input of the classification network in the RPN at the same time. By using the regression network, the tracking problem is transformed into a one-shot detection problem. This method is able to achieve excellent tracking performance. Its enhanced version, DaSiamRPN [10], aims to address the shortcoming of the original method, i.e., distinguishing objects of the same category, by adding interference training. This improves the network's ability to identify a single tracking object and avoid tracking failures caused by interference from similar objects of the same type.
With the development of tracking methods, the researchers have established that the deeper networks, such as ResNet, cannot achieve better results in the Siamese convolutional neural network tracking frameworks. In order to address this problem, Li et al. further studied the strict conversion invariance requirements implicit in the Siamese network tracking framework and proposed a method for applying deep networks in the Siamese neural network tracking frameworks [11]. Zhang et al. [12] proposed guidelines for the design of Siamese neural networks for object tracking. Wang et al. [13] proposed an unsupervised method for training Siamese neural networks. Zhou et al. [14] added a classification regression positioning branch in the original Siamese neural network, which improves the accuracy and robustness of the object tracking method. Zhang et al. [15] built a new type of Siamese neural network based on the inverse residual network to improve the performance of object tracking. Zeng et al. [16] used point regression as the main method to realize a coarse-fine classification network for visual tracking. Mu et al. [17] used a multilayer Siamese network to consume high-level features for correcting low-level features and enhance the object response while maintaining object details to achieve real-time UAV tracking. It is evident that the object tracking method based on the Siamese network has become an important framework in the object tracking field as it has huge development potential in object tracking technology.
In the object tracking method based on the Siamese convolutional neural network, the main focus is to optimize and improve the network architecture. There are a few studies that discuss the role of different convolution kernel features in the object tracking process.
In this work, we analyze the characteristics of different convolution kernels in the Siamese convolutional neural network. Moreover, we propose a new convolution kernel attention module to improve the performance of the object tracking method.

Convolution Kernel Feature Analysis
Usually, the convolutional neural networks have multiple convolutional layers, and each convolutional layer has multiple convolution kernels. Therefore, analyzing the response of each convolutional layer and convolution kernel can provide useful information for improving the object tracking method. First, in this work, we regard each frame in the video as an independent static image. After the image is passed through the convolutional neural network, a series of convolutional feature maps are formed. Based on the different levels of convolution feature extraction, those features are divided into low-level convolution features, middle-level convolution features, and high-level convolution features. In other words, the feature maps of different levels denote different levels of information extracted from the input image. For simplicity, in this work, we choose AlexNet in the SiameseFC tracking method to extract the convolutional features of each convolutional layer from the image. Then, we obtain the response of the object based on cross-matching, as shown in Figure 1. Since there are many channels in the high-level convolution features, only some typical features are extracted in order to highlight the key points.
tracking framework and proposed a method for applying deep networks in the Siamese neural network tracking frameworks [11]. Zhang et al. [12] proposed guidelines for th design of Siamese neural networks for object tracking. Wang et al. [13] proposed an unsu pervised method for training Siamese neural networks. Zhou et al. [14] added a classifi cation regression positioning branch in the original Siamese neural network, which im proves the accuracy and robustness of the object tracking method. Zhang et al. [15] buil a new type of Siamese neural network based on the inverse residual network to improve the performance of object tracking. Zeng et al. [16] used point regression as the main method to realize a coarse-fine classification network for visual tracking. Mu et al. [17 used a multi-layer Siamese network to consume high-level features for correcting low level features and enhance the object response while maintaining object details to achiev real-time UAV tracking. It is evident that the object tracking method based on the Siames network has become an important framework in the object tracking field as it has hug development potential in object tracking technology.
In the object tracking method based on the Siamese convolutional neural network the main focus is to optimize and improve the network architecture. There are a few stud ies that discuss the role of different convolution kernel features in the object tracking pro cess. In this work, we analyze the characteristics of different convolution kernels in th Siamese convolutional neural network. Moreover, we propose a new convolution kerne attention module to improve the performance of the object tracking method.

Convolution Kernel Feature Analysis
Usually, the convolutional neural networks have multiple convolutional layers, and each convolutional layer has multiple convolution kernels. Therefore, analyzing the re sponse of each convolutional layer and convolution kernel can provide useful information for improving the object tracking method. First, in this work, we regard each frame in th video as an independent static image. After the image is passed through the convolutiona neural network, a series of convolutional feature maps are formed. Based on the differen levels of convolution feature extraction, those features are divided into low-level convo lution features, middle-level convolution features, and high-level convolution features. In other words, the feature maps of different levels denote different levels of information extracted from the input image. For simplicity, in this work, we choose AlexNet in the SiameseFC tracking method to extract the convolutional features of each convolutiona layer from the image. Then, we obtain the response of the object based on cross-matching as shown in Figure 1. Since there are many channels in the high-level convolution features only some typical features are extracted in order to highlight the key points.  Ma et al. [18] show that the high-level features (Conv5) mainly reflect the semantic features of the input image. By matching the semantic features, the object to be tracked can be distinguished from the background area. However, since the receptive field of a single pixel in high-level convolution features is relatively large, relying only on high-level convolution features in object tracking may provide an inaccurate position, which may reduce the tracking accuracy while ensuring robustness. At the same time, the low-level features (Conv2) retain more detailed information describing the instance level information, such as the boundary and texture of the object. Ma et al. [18], Zhu et al. [17], etc. have proposed to make full use of multi-layer features in order to improve the object features and eventually object tracking performance. Based on the feature analysis of the aforementioned convolutional layers, in this work, we further explore the role of different convolution kernels in each layer of convolutional features for specific tracking tasks.
The basic assumptions of this work are as follows.

•
In the object tracking method, the objects, and scenes to be tracked are diverse.

•
In different tracking scenarios, the convolution kernel activated by the object in a convolution layer may be different.

•
During the tracking process, if all the convolution kernel features are used for feature matching, then the pertinence of the object is weakened, resulting in a lack of distinct characteristics of the object.
In order to verify these hypotheses, we use the "blurcar1", "coke", "david", "soccer", and "toy" video sequences in the benchmark set to perform experiments. The response of each object for different convolution kernels in the Conv5 convolution layer is shown in Figure 2. Ma et al. [18] show that the high-level features (Conv5) mainly reflect the s features of the input image. By matching the semantic features, the object to be can be distinguished from the background area. However, since the receptive fi single pixel in high-level convolution features is relatively large, relying only o level convolution features in object tracking may provide an inaccurate position may reduce the tracking accuracy while ensuring robustness. At the same time, level features (Conv2) retain more detailed information describing the instance formation, such as the boundary and texture of the object. Ma et al. [18], Zhu et etc. have proposed to make full use of multi-layer features in order to improve th features and eventually object tracking performance. Based on the feature analys aforementioned convolutional layers, in this work, we further explore the role of d convolution kernels in each layer of convolutional features for specific tracking ta The basic assumptions of this work are as follows.

•
In the object tracking method, the objects, and scenes to be tracked are diver • In different tracking scenarios, the convolution kernel activated by the ob convolution layer may be different.

•
During the tracking process, if all the convolution kernel features are used for matching, then the pertinence of the object is weakened, resulting in a lack of characteristics of the object. In order to verify these hypotheses, we use the "blurcar1", "coke", "david", " and "toy" video sequences in the benchmark set to perform experiments. The resp each object for different convolution kernels in the Conv5 convolution layer is sh  As shown in Figure 2, the objects and environments in different video sequen form differently for each convolution kernel in the Conv5 convolutional layer. sponse position of the "blurcar1" sequence in the third convolution kernel mat position of the object contained in the image. The "coke", "david", "soccer", an video sequences have better activation effects in the 113th, 124th, 171st, an As shown in Figure 2, the objects and environments in different video sequences perform differently for each convolution kernel in the Conv5 convolutional layer. The response position of the "blurcar1" sequence in the third convolution kernel matches the position of the object contained in the image. The "coke", "david", "soccer", and "toy" video sequences have better activation effects in the 113th, 124th, 171st, and 235th convolution kernels of Conv5 layer, respectively. It is evident from Figure 2 that the response maps of other convolution kernels are sparse and contain a lot of noise. Therefore, they are unable to reflect the location of the object in the image. In order to further analyze the contribution of different convolution kernels in object tracking tasks, we select two typical video sequences "soccer" and "jump" from the OTB benchmark dataset and analyze the matching response of the object in each convolution kernel. The response maps of each convolution kernel in the "soccer" video sequence are shown in Figure 3.
Appl. Sci. 2022, 12, x FOR PEER REVIEW convolution kernels of Conv5 layer, respectively. It is evident from Figure 2 that sponse maps of other convolution kernels are sparse and contain a lot of noise. The they are unable to reflect the location of the object in the image. In order to further a the contribution of different convolution kernels in object tracking tasks, we sele typical video sequences "soccer" and "jump" from the OTB benchmark dataset an lyze the matching response of the object in each convolution kernel. The respons of each convolution kernel in the "soccer" video sequence are shown in Figure 3.  The leftmost image in Figure 3 presents the object tracking result, and the image on the left denotes the matching response map of the object template and the area. It is evident that the quality of the object response map is relatively good, an is a significant peak at the real position of the object. However, it should be noted t final response map is the result of the common matching of all convolution kernel f in the object convolution feature. If each convolution kernel feature of the object im extracted separately and the matching response is performed, a response map of the in each convolution kernel is obtained as shown in the figure. The figure also show some convolution kernels, such as 1-17, 20, 22, 26, and 30, have significant peaks object area (in this case, the central area). The overall distribution is similar to a t mensional Gaussian distribution, i.e., the convolution kernel in this part makes t mary contribution to the final object response map. At the same time, the response of other convolution kernels, such as 18, 19, 21, 23-25, 27-29, 31, and 32, either h significant peaks or have peaks at wrong positions. These convolution kernel featu not contribute to the final response map and may degrade its quality.
Similar to the "soccer" video sequence, the response maps of the "jump" vi quence for each convolution kernel are shown in Figure 4. The leftmost image in Figure 3 presents the object tracking result, and the second image on the left denotes the matching response map of the object template and the search area. It is evident that the quality of the object response map is relatively good, and there is a significant peak at the real position of the object. However, it should be noted that the final response map is the result of the common matching of all convolution kernel features in the object convolution feature. If each convolution kernel feature of the object image is extracted separately and the matching response is performed, a response map of the object in each convolution kernel is obtained as shown in the figure. The figure also shows that some convolution kernels, such as 1-17, 20, 22, 26, and 30, have significant peaks in the object area (in this case, the central area). The overall distribution is similar to a two-dimensional Gaussian distribution, i.e., the convolution kernel in this part makes the primary contribution to the final object response map. At the same time, the response graphs of other convolution kernels, such as 18, 19, 21, 23-25, 27-29, 31, and 32, either have no significant peaks or have peaks at wrong positions. These convolution kernel features do not contribute to the final response map and may degrade its quality.
Similar to the "soccer" video sequence, the response maps of the "jump" video sequence for each convolution kernel are shown in  The response maps of the convolution kernel for the "jump" video sequence are similar to that of the "soccer" video sequence. The leftmost image in the figure denotes the result of object tracking, the second image on the left denotes the object matching response map, and the image on the right side of the figure represents the independent response map of each convolution kernel. As presented in the figure, for the "jump" video sequence, the response quality of each convolution kernel is also quite different. Among them, 2-17, 26, and 30 have better response quality and those significant peaks are in the center of the object. On the other hand, the response quality of other convolution kernels is poor. It should be noted that the convolution kernel with a good response for the "jump" video sequence is different from the "soccer" video sequence, which means that for different objects,s the convolution kernels that contribute to the object response map are different.
Based on the aforementioned analysis, it can be concluded that for different tracking objects and in different tracking environments, the response levels of different convolution kernels in a convolutional layer for the same object are inconsistent. Consequently, their contribution to the tracking task is also inconsistent. Therefore, it is impossible to adapt to all objects by modifying the convolution structure of the Siamese convolutional neural network at one time. In order to achieve stable object tracking, it is necessary that the object tracking method adaptively selects different convolution kernels for different objects, makes full use of the useful features in convolution features, suppresses inefficient features, and achieves stable matching of object features. Based on the aforementioned analysis regarding the characteristics of the convolution kernel, we propose an objectaware adaptive convolution kernel attention module.

Analysis of the Attention Mechanism in the Pre-Training Stage
A typical attention module usually consists of a sub-neural network connected to the original neural network. The attention module and the original model are trained simultaneously to obtain the optimal parameters. In this work, we present the convolution kernel-oriented "Squeeze-and-Excitation" attention mechanism [19] (referred to as the SE attention module) based on the Siamese network object tracking framework. While the Siamese network uses a series of image pairs to train the network, the SE attention module has the ability to simultaneously learn the weight of each convolution kernel in the object tracking task and improve the neural network's feature extraction of the object sensitive area. The architecture of a typical SE attention module is shown in Figure 5. The response maps of the convolution kernel for the "jump" video sequence are similar to that of the "soccer" video sequence. The leftmost image in the figure denotes the result of object tracking, the second image on the left denotes the object matching response map, and the image on the right side of the figure represents the independent response map of each convolution kernel. As presented in the figure, for the "jump" video sequence, the response quality of each convolution kernel is also quite different. Among them, 2-17, 26, and 30 have better response quality and those significant peaks are in the center of the object. On the other hand, the response quality of other convolution kernels is poor. It should be noted that the convolution kernel with a good response for the "jump" video sequence is different from the "soccer" video sequence, which means that for different objects, the convolution kernels that contribute to the object response map are different.
Based on the aforementioned analysis, it can be concluded that for different tracking objects and in different tracking environments, the response levels of different convolution kernels in a convolutional layer for the same object are inconsistent. Consequently, their contribution to the tracking task is also inconsistent. Therefore, it is impossible to adapt to all objects by modifying the convolution structure of the Siamese convolutional neural network at one time. In order to achieve stable object tracking, it is necessary that the object tracking method adaptively selects different convolution kernels for different objects, makes full use of the useful features in convolution features, suppresses inefficient features, and achieves stable matching of object features. Based on the aforementioned analysis regarding the characteristics of the convolution kernel, we propose an object-aware adaptive convolution kernel attention module.

Analysis of the Attention Mechanism in the Pre-Training Stage
A typical attention module usually consists of a sub-neural network connected to the original neural network. The attention module and the original model are trained simultaneously to obtain the optimal parameters. In this work, we present the convolution kernel-oriented "Squeeze-and-Excitation" attention mechanism [19] (referred to as the SE attention module) based on the Siamese network object tracking framework. While the Siamese network uses a series of image pairs to train the network, the SE attention module has the ability to simultaneously learn the weight of each convolution kernel in the object tracking task and improve the neural network's feature extraction of the object sensitive area. The architecture of a typical SE attention module is shown in Figure 5. Appl Figure 5. The architecture of the SE attention module.
where represents the output convolution feature by a certain layer and (•) denotes a global information compression method for convolution feature . Here, this method is implemented using global average pooling. The calculation process is expressed as follows: is a weighting process for , where represents the parameters used in the weighting process. These parameters are obtained automatically during end-to-end training of the deep learning network. When embedding the SE attention module in the Siamese network, we use ReLU and a fully connected layer to realize ex (⋅, ). Each convolution kernel (⋅,⋅) uses the scalar ′ to multiply the original convolution feature ∈ ℝ × to achieve the weighting of the feature. The (⋅,⋅) output weighted convolutional layer features are denoted by ̃. The core of the SE attention module lies in the acquisition of the weighted parameter . Under normal circumstances, the SE attention module should be combined and synchronized with the backbone network for training. The network architecture after embedding the SE attention module in the Siamese network object tracking framework is shown in Figure 6.   Figure 6. The structure of Siamese neural network with SE attention module.
During the training process, the SE attention module in the Siamese network learns the importance of each convolution kernel in the corresponding convolution layer based on back propagation. During the tracking process, the attention mechanism uses the importance weight of each convolution kernel to adjust the object convolution feature in order to improve the discrimination of the convolution feature and assist the object template correlation calculation for obtaining a more significant object response. It should be noted that the two branches of the Siamese network share the same parameters and that only one branch needs to be trained during the network training. In this work, we implement the aforementioned Siamese network architecture based on the Matconvnet framework [20] and embed the SE attention module in the Conv3, Conv4, and Conv5 convolutional layers of the SiamFC. We train the network using the same training process used by the classic Siamese network tracking method. After the training is completed, we use a Siamese network with SE attention mechanism to re-test the "soccer" video sequence and where U represents the output convolution feature by a certain layer and F sq (·) denotes a global information compression method for convolution feature U. Here, this method is implemented using global average pooling. The calculation process is expressed as follows: F ex (·, W) is a weighting process for z c , where W represents the parameters used in the weighting process. These parameters are obtained automatically during end-to-end training of the deep learning network. When embedding the SE attention module in the Siamese network, we use ReLU and a fully connected layer to realize F ex (·, W). Each convolution kernel F scale (·, ·) uses the scalar z c to multiply the original convolution feature u c ∈ R H×W to achieve the weighting of the feature. The F scale (·, ·) output weighted convolutional layer features are denoted by U. The core of the SE attention module lies in the acquisition of the weighted parameter W. Under normal circumstances, the SE attention module should be combined and synchronized with the backbone network for training. The network architecture after embedding the SE attention module in the Siamese network object tracking framework is shown in Figure 6. where represents the output convolution feature by a certain layer and (•) denotes a global information compression method for convolution feature . Here, this method is implemented using global average pooling. The calculation process is expressed as follows: is a weighting process for , where represents the parameters used in the weighting process. These parameters are obtained automatically during end-to-end training of the deep learning network. When embedding the SE attention module in the Siamese network, we use ReLU and a fully connected layer to realize ex (⋅, ). Each convolution kernel (⋅,⋅) uses the scalar ′ to multiply the original convolution feature ∈ ℝ × to achieve the weighting of the feature. The (⋅,⋅) output weighted convolutional layer features are denoted by ̃. The core of the SE attention module lies in the acquisition of the weighted parameter . Under normal circumstances, the SE attention module should be combined and synchronized with the backbone network for training. The network architecture after embedding the SE attention module in the Siamese network object tracking framework is shown in Figure 6.   Figure 6. The structure of Siamese neural network with SE attention module.
During the training process, the SE attention module in the Siamese network learns the importance of each convolution kernel in the corresponding convolution layer based on back propagation. During the tracking process, the attention mechanism uses the importance weight of each convolution kernel to adjust the object convolution feature in order to improve the discrimination of the convolution feature and assist the object template correlation calculation for obtaining a more significant object response. It should be noted that the two branches of the Siamese network share the same parameters and that only one branch needs to be trained during the network training. In this work, we implement the aforementioned Siamese network architecture based on the Matconvnet framework [20] and embed the SE attention module in the Conv3, Conv4, and Conv5 convolutional layers of the SiamFC. We train the network using the same training process used by the classic Siamese network tracking method. After the training is completed, we use a Siamese network with SE attention mechanism to re-test the "soccer" video sequence and During the training process, the SE attention module in the Siamese network learns the importance of each convolution kernel in the corresponding convolution layer based on back propagation. During the tracking process, the attention mechanism uses the importance weight of each convolution kernel to adjust the object convolution feature in order to improve the discrimination of the convolution feature and assist the object template correlation calculation for obtaining a more significant object response. It should be noted that the two branches of the Siamese network share the same parameters and that only one branch needs to be trained during the network training. In this work, we implement the aforementioned Siamese network architecture based on the Matconvnet framework [20] and embed the SE attention module in the Conv3, Conv4, and Conv5 convolutional layers of the SiamFC. We train the network using the same training process used by the classic Siamese network tracking method. After the training is completed, we use a Siamese network with SE attention mechanism to re-test the "soccer" video sequence and analyze its convolution kernel weights. The results and assigned weights of each convolution corresponding to the object are shown in Figure 7. analyze its convolution kernel weights. The results and assigned weights of each convolution corresponding to the object are shown in Figure 7.  In Figure 7, the number below each convolution kernel response map represents the weight given by the SE attention module. It is evident that the weights of the SE attention module are not clearly distinguishable. The weights of most convolution kernels are between 0.4 and 0.6 and are not reasonably allocated according to the contribution of the convolution kernel. After analyzing the training process, we observe that the main reason for the failure of the SE attention module is that its weight acquisition method does not match the object tracking task and the Siamese network training method. Please note that the Siamese network is trained using a series of image pairs. Each pair contains the object template and the object search area. In order to obtain good generalization performance, the image pairs are acquired from various scenarios in the real world, such as pedestrians, vehicles, animals, etc. The analysis presented in Part 3 shows that the convolution kernels activated by different types of objects are different. This means that during the training process of the Siamese network, the meaning of each convolutional layer is constantly changing to correspond to the training data. Therefore, in this case, the convolution kernel weights learned by the SE attention module are repeatedly adjusted due to different training samples and are unable to converge. It is noteworthy that the attention mechanism commonly used for tasks, such as object detection, does not meet the requirements of object tracking. Therefore, a new convolution kernel weighting mechanism that meets the essential characteristics of object tracking is a dire need.

Object-Aware Convolution Kernel Attention Mechanism
Based on the analysis presented in this work, it is evident that the convolution kernels activated by different objects are different. The weighted selection of convolution kernels cannot be performed by the attention module used in ordinary end-to-end training. In order to effectively address this problem, we combine the characteristics of object tracking tasks and propose a new convolution kernel attention module for each individual object. This attention mechanism is different from the attention module traditionally applied during the training phase. The attention weight acquisition is not based on the training phase of the Siamese network. Instead, one-time training is performed at the beginning of the object tracking task. (1) At the beginning of the object tracking task, the object template is generated using the object position given in the initial frame, and the object template convolution feature is obtained by using the template branch of the Siamese network. (2) At the original object location, the search area is divided in the initial frame according to the In Figure 7, the number below each convolution kernel response map represents the weight given by the SE attention module. It is evident that the weights of the SE attention module are not clearly distinguishable. The weights of most convolution kernels are between 0.4 and 0.6 and are not reasonably allocated according to the contribution of the convolution kernel. After analyzing the training process, we observe that the main reason for the failure of the SE attention module is that its weight acquisition method does not match the object tracking task and the Siamese network training method. Please note that the Siamese network is trained using a series of image pairs. Each pair contains the object template and the object search area. In order to obtain good generalization performance, the image pairs are acquired from various scenarios in the real world, such as pedestrians, vehicles, animals, etc. The analysis presented in Part 3 shows that the convolution kernels activated by different types of objects are different. This means that during the training process of the Siamese network, the meaning of each convolutional layer is constantly changing to correspond to the training data. Therefore, in this case, the convolution kernel weights learned by the SE attention module are repeatedly adjusted due to different training samples and are unable to converge. It is noteworthy that the attention mechanism commonly used for tasks, such as object detection, does not meet the requirements of object tracking. Therefore, a new convolution kernel weighting mechanism that meets the essential characteristics of object tracking is a dire need.

Object-Aware Convolution Kernel Attention Mechanism
Based on the analysis presented in this work, it is evident that the convolution kernels activated by different objects are different. The weighted selection of convolution kernels cannot be performed by the attention module used in ordinary end-to-end training. In order to effectively address this problem, we combine the characteristics of object tracking tasks and propose a new convolution kernel attention module for each individual object. This attention mechanism is different from the attention module traditionally applied during the training phase. The attention weight acquisition is not based on the training phase of the Siamese network. Instead, one-time training is performed at the beginning of the object tracking task. (1) At the beginning of the object tracking task, the object template is generated using the object position given in the initial frame, and the object template convolution feature is obtained by using the template branch of the Siamese network. (2) At the original object location, the search area is divided in the initial frame according to the size of the object search area set by the network. The candidate area branch of the Siamese network is used to obtain the convolution features of the candidate area. (3) The object template convolution feature and the candidate area convolution feature are both matched with each convolution kernel according to the template. At this time, the convolution kernel response maps with different response levels can be obtained. (4) Finally, based on the peak in each response map at the object position, we evaluate the support of the activated convolution kernel for the tracking task and set the convolution kernel weight used in this tracking. At the beginning of each tracking task, the above process is repeated, and different convolution kernel weight distributions are obtained based on different objects. The algorithm is shown in Table 1. Table 1. The proposed object-aware convolution kernel attention mechanism framework.

Input
The First Frame I of the Video, the Position B of the Object in the Image.

Output
Convolution kernel weight w.
Step 1 Based on the object position B, select the object image Z from the image I. Take the location B as the center, select the image X in the search area according to the requirements of the candidate area network.
Step 2 Get the convolution feature ϕ(Z) of the object image Z and the convolution feature ϕ(X) of the image X in the search area.
The number of channels of ϕ(Z) is equal to the number of channels of ϕ(X), denoted as C.
Step 3 for i = 1 to C Select the convolution feature ϕ i (Z) of the i-th channel of ϕ(Z) Select the convolution feature ϕ i (X) of the i-th channel of ϕ(X) where x 0 , y 0 represent the coordinates of the peak in the response map, and p x 0 ,y 0 represents the specific value of the peak. With the peak coordinate at the center, the area with radius r is called the peak area. The region outside the peak area is called the sidelobe area. µ s and σ s represent the average and standard deviation of all values in the sidelobe area, respectively. In this work, the value of psr is used to express the contribution of the convolution kernel response map for object tracking. It is used as the basis for the weight of the convolution kernel response map. It should be noted that the final response map of the Siamese network is estimated based on the forward propagation of the network. It is not convenient to directly assign the weights to each convolution kernel response map. The analysis of the computations of the Siamese network shows that directly assigning the obtained weight to the object template feature is equivalent to assigning the weight to each convolution kernel response map.
The calculation process of the response map for object matching is expressed as: where Z and X represent the object image and the search area image, respectively, ϕ(Z) denotes the convolution feature of the object image, ϕ(X) denotes the convolution feature of the search area image, and * represents the correlation operation. Essentially, the correlation operation of the multi-kernel convolution feature is the superposition of the correlation operation of each convolution kernel feature. Assuming that there are N convolution kernels, then: where ϕ i (Z) and ϕ i (X) represent the i-th convolution kernel feature of the object image and the search area image, respectively. ϕ i (Z) * ϕ i (X) denotes the correlation operation of the i-th convolution kernel feature, which is the response map of the i-th convolution kernel. Therefore, the final response map after weighting the different convolution kernel response maps is expressed as: where w i denotes the weight of each convolution kernel response map obtained using the aforementioned attention module and × represents the element-wise multiplication of the weight and the corresponding response map. Since in the Siamese network, the correlation operation of the convolution feature is essentially a convolution operation, each element of the response map f i (Z, X) in the aforementioned formula is expressed as f i (Z, X) u,v . Now, where (u, v) represents the coordinate of an element in the response map, (j, k) denotes the coordinate of an element in the convolution feature, M denotes the size of the object template convolution feature, ϕ i (Z) j,k denotes the element in the object convolution feature, ϕ i (X) j,k denotes the element in the convolution feature of the search area, and × represents multiplication by the element value. Therefore, the aforementioned expression is transformed as follows: or It is evident that (7) is equivalent to the following expression: or The aforementioned derivation shows that weighting the response maps of different convolution kernels is equivalent to weighting the features of the object template or the features of the search area. For convenience, we use the method of weighting the object template features to achieve the weighting of the convolution kernel response map.
Based on the SiamFC object tracking method, the attention method proposed above is combined with typical tracking sequences, such as "soccer" and "jump" to observe the weight assignment of each convolution kernel, as shown in Figures 8 and 9. The experimental results show that the attention module sets the weights according to the response of each convolution kernel. A high weight is assigned to a convolution kernel with a clear peak response at the object position. Similarly, a low weight is assigned to a convolution kernel without a clear object position. At the same time, since the attention module obtains weights at the beginning of the object tracking task, different convolution kernel weights can be assigned to different objects, which is more in line with the characteristics of the object tracking task. of each convolution kernel. A high weight is assigned to a convolution kernel with a clear peak response at the object position. Similarly, a low weight is assigned to a convolution kernel without a clear object position. At the same time, since the attention module obtains weights at the beginning of the object tracking task, different convolution kernel weights can be assigned to different objects, which is more in line with the characteristics of the object tracking task.  Figure 9. The proposed attention module weighting results for each convolution kernel ("jump" video sequence).

Experimental Design
In order to verify the effectiveness of the method proposed in this work, we use Ubuntu and Matconvnet [20] to implement the aforementioned convolution kernel attention model and combine it with the classic SiamFC tracking framework to form a new visual peak response at the object position. Similarly, a low weight is assigned to a convolution kernel without a clear object position. At the same time, since the attention module obtains weights at the beginning of the object tracking task, different convolution kernel weights can be assigned to different objects, which is more in line with the characteristics of the object tracking task.  Figure 9. The proposed attention module weighting results for each convolution kernel ("jump" video sequence).

Experimental Design
In order to verify the effectiveness of the method proposed in this work, we use Ubuntu and Matconvnet [20] to implement the aforementioned convolution kernel attention model and combine it with the classic SiamFC tracking framework to form a new visual

Experimental Design
In order to verify the effectiveness of the method proposed in this work, we use Ubuntu and Matconvnet [20] to implement the aforementioned convolution kernel attention model and combine it with the classic SiamFC tracking framework to form a new visual object tracking method, namely SiamFCCA. The proposed method is presented in Figure 10. In this work, the performance of the proposed SiamFCCA is tested by using the OTB [21,22] and VOT [23] benchmark datasets, and the results are compared with those of SiamFC and other existing object tracking methods. OTB and VOT are the two main benchmark datasets in the field of object tracking. By testing on the two benchmark datasets, the performance of the object tracking method can be reflected. The major implementation details of the model are as follows: object tracking method, namely SiamFCCA. The proposed method is presented in Figure 10. In this work, the performance of the proposed SiamFCCA is tested by using the OTB [21,22] and VOT [23] benchmark datasets, and the results are compared with those of SiamFC and other existing object tracking methods. OTB and VOT are the two main benchmark datasets in the field of object tracking. By testing on the two benchmark datasets, the performance of the object tracking method can be reflected. The major implementation details of the model are as follows: (1) Backbone convolutional neural network: We use AlexNet as the basis for building the Siamese network and remove the boundary expansion operations on the pooling and convolutional layers. The resolution of the object template is 127 × 127, and the size of the candidate area is twice that of the object template.
(2) Convolution kernel attention module: In this work, we calculate the convolution kernel weights of the three convolution layers, i.e., Conv3, Conv4 and Conv5, using the initial frame. These are used as the weights of all object image features and search area features in the later stages.
(3) The Siamese network still uses a similar training method used by SiamFC. During the training process, the two branches use the same parameters. Please note that the convolution kernel attention module does not participate in the training process.

Qualitative Analysis
We analyze the performance of SiamFCCA and other tracking methods on the OTB and VOT benchmark datasets. The tracking results of some classic video sequences are shown in Figure 11. (1) Backbone convolutional neural network: We use AlexNet as the basis for building the Siamese network and remove the boundary expansion operations on the pooling and convolutional layers. The resolution of the object template is 127 × 127, and the size of the candidate area is twice that of the object template.
(2) Convolution kernel attention module: In this work, we calculate the convolution kernel weights of the three convolution layers, i.e., Conv3, Conv4 and Conv5, using the initial frame. These are used as the weights of all object image features and search area features in the later stages.
(3) The Siamese network still uses a similar training method used by SiamFC. During the training process, the two branches use the same parameters. Please note that the convolution kernel attention module does not participate in the training process.

Qualitative Analysis
We analyze the performance of SiamFCCA and other tracking methods on the OTB and VOT benchmark datasets. The tracking results of some classic video sequences are shown in Figure 11. The aforementioned experimental results show that the proposed SiamFCCA performs efficiently in the presence of background clutter, low resolution, and fast-moving objects compared with the original SiamFC and other tracking methods. This mainly relies on SiamFCCA's effective extraction and fusion of object appearance features, which enhance the convolution kernel features that are conducive to expressing the object characteristics and reduce the response of interfering objects in the cluttered background. At the same time, based on the suppression of other interference peaks in the response map, the object can still correctly respond to the object position, even when it is moving rapidly. The aforementioned experimental results show that the proposed SiamFCCA performs efficiently in the presence of background clutter, low resolution, and fast-moving objects compared with the original SiamFC and other tracking methods. This mainly relies on SiamFCCA's effective extraction and fusion of object appearance features, which enhance the convolution kernel features that are conducive to expressing the object characteristics and reduce the response of interfering objects in the cluttered background. At the same time, based on the suppression of other interference peaks in the response map, the object can still correctly respond to the object position, even when it is moving rapidly.

Quantitative Analysis
In order to clearly evaluate the performance of each object tracking method, we use the quantitative indicators in the OTB and VOT benchmark datasets to test the performance of each object tracking method. The indicators include the precision plots and success plots in OTB and the expected average coverage (EAO) in VOT. We test the performance of each tracking method for all the video sequences available in the OTB benchmark set. The accuracy and success maps are shown in Figure 12. The aforementioned experimental results show that the proposed SiamFCCA performs efficiently in the presence of background clutter, low resolution, and fast-moving objects compared with the original SiamFC and other tracking methods. This mainly relies on SiamFCCA's effective extraction and fusion of object appearance features, which enhance the convolution kernel features that are conducive to expressing the object characteristics and reduce the response of interfering objects in the cluttered background. At the same time, based on the suppression of other interference peaks in the response map, the object can still correctly respond to the object position, even when it is moving rapidly.

Quantitative Analysis
In order to clearly evaluate the performance of each object tracking method, we use the quantitative indicators in the OTB and VOT benchmark datasets to test the performance of each object tracking method. The indicators include the precision plots and success plots in OTB and the expected average coverage (EAO) in VOT. We test the performance of each tracking method for all the video sequences available in the OTB benchmark set. The accuracy and success maps are shown in Figure 12.  Figure 12 shows that the overall performance of the proposed SiamFCCA on the OTB benchmark set is better compared with the improved SiamFC. Its accuracy and success rate are 6.8% and 7.1% better compared with SiamFC, respectively. In order to further  Figure 12 shows that the overall performance of the proposed SiamFCCA on the OTB benchmark set is better compared with the improved SiamFC. Its accuracy and success rate are 6.8% and 7.1% better compared with SiamFC, respectively. In order to further compare the performance of each tracking method in each sub-category of the OTB benchmark set, we estimate the precision data and success plots of each sub-category, as shown in Table 2 and Figure 13.   By analyzing the success plots and precision data, it is evident that the SiamFCCA performs better compared with SiamFC in terms of most indicators. By analyzing the experiments of each tracking method performed using the OTB benchmark set, we observe that when the SiamFCCA is integrated with the convolution kernel attention module, its overall performance improves significantly compared with the SiamFC and other tracking methods. Especially in the "In-plane Rotation", "Low Resolution" and "Out of View" sub-categories, the precision of SiamFCCA increases by 11.1%, 21.4%, and 10.1% compared with SiamFC, respectively.
We also test the proposed SiamFCCA object tracking method using the VOT benchmark set. The evaluation index is mainly reflected by the expected average coverage (EAO). The test results of each tracking benchmark set are shown in Table 3. It is evident from the test results that the tracking performance of the proposed SiamFCCA is better compared with SiamFC as the EAO score increases by 0.04 and its robustness score reduces by 14.98 compared with the original method.  [36] 0.1688 0.5225 44.0194 SSKCF [36] 0.1661 0.5231 40.3088 MOSSEca [37] 0.1405 0.3953 50.0759 KCF [24] 0.1351 0.4445 50.0994 Struck [25] 0.0960 0.4142 80.3253 DSST [38] 0.0793 0.3913 95.5587 IVT [39] 0.0761 0.3883 104.737 The aforementioned experiments show that the SiamFCCA object tracking method proposed in this work is effective. The convolution kernel attention module adjusts the weight of convolution kernel features according to the object characteristics. Therefore, when the object resolution is relatively small or there are a large number of interfering objects, the convolution features output by the model still have a high response to the object. The tracking method does not cause tracking failure due to the response of other interference features.

Conclusions
In this work, based on the different contributions of different convolution kernels to the object tracking task, the attention mechanism of the convolution kernel in the Siamese network is explored. We analyze the reasons due to which the traditional convolution kernel modules are not suitable for object tracking. In this work, we propose an objectaware convolution kernel attention module. The experiments show that the weight of the convolution kernel estimated by this module meets the requirements of the object tracking task. Finally, we combine the proposed convolution kernel attention module with the classic SiamFC tracking method and present a new tracking method named SiamFCCA.
The tests performed using the OTB and VOT benchmark datasets show that the proposed object tracking method has a significantly better performance compared with the original object tracking method.

Conflicts of Interest:
The authors declare no conflict of interest.