Fast and Accurate Visual Tracking with Group Convolution and Pixel-Level Correlation

: Visual object trackers based on Siamese networks perform well in visual object tracking (VOT); however, degradation of the tracking accuracy occurs when the target has fast motion, large-scale changes, and occlusion. In this study, in order to solve this problem and enhance the inference speed of the tracker, fast and accurate visual tracking with a group convolution and pixel-level correlation based on a Siamese network is proposed. The algorithm incorporates multi-layer feature information on the basis of Siamese networks. We designed a multi-scale feature aggregated channel attention block (MCA) and a global-to-local-information-fused spatial attention block (GSA), which enhance the feature extraction capability of the network. The use of a pixel-level mutual correlation operation in the network to match the search region with the template region reﬁnes the bounding box and reduces background interference. Comparing our work with the latest algorithms, the precision and success rates on the UAV123, OTB100, LaSOT, and GOT10K datasets were improved, and our tracker was able to run at 40FPS, with a better performance in complex scenes such as those with occlusion, illumination changes, and fast-motion situations.


Introduction
As one of the research contents in computer vision, visual object tracking has wide application prospects and value in security surveillance, intelligent transportation, autonomous driving, human-computer interaction, autonomous robotics, marine exploration, military target identification, and tracking.Visual object tracking was first carried out using correlation filtering for tracking, and with the development of deep learning, convolutional neural networks have gradually been widely used due to their powerful feature extraction capabilities.Visual object tracking is usually divided into three parts: using a backbone network to extract the target's features, then correlating the template features with the search, and finally utilizing a classification and regression sub-network to predict the center and bounding box of the target.Siamese networks are widely used in object tracking with this structure.
SiamFC [1] first introduced Siamese networks to object tracking.In SiamFC, the template features are correlated with the search features to find the region with the largest response and complete tracking and evaluation.Since then, many works have been carried out on Siamese networks in object tracking.SiamRPN [2] introduced the RPN (region proposal network) structure of object detection to tracking, constructing two branches-one for the regression of the target bounding box, and the other for the classification of the target-where the multi-scale anchor box improves the performance under object scale changes.SiamRPN++ [3] solved the problem of poor results in deep networks due to the destruction of translation invariance when the network is deepened, successfully using ResNet [4] and MobileNet [5] as the backbone networks.SiamFC++ [6] removes the anchor frame and changes the output prediction to an anchor-free style without presetting the anchor frame.
In recent years, transformer structures have boomed in various fields of computer vision.TransT [7] uses the structure of a transformer as the correlation operation, which improves accuracy.Zhao et al. [8] used a transformer structure as the backbone network and utilized a decoder to reconstruct the target appearance within the search region so that the template is close to the search frame, rather than the search frame being directly related to the template image.In this way, the robustness of the tracker is enhanced, even if the appearance of the target has changed.Gao et al. [9] proposed a one-and-a-half-stream structure that uses an adaptive token division method so that the search and template regions have self-attention and cross-attention, as in a two-stream structure, as well as advanced template interactions with the search region, as in a one-stream structure.This structure outperforms some two-stream and one-stream pipelines.
In object tracking, training datasets usually contain many videos and multiple forms of motion.Some annotations may be less accurate due to occlusion and present similarities; thus, some trackers use data processing methods to improve the performance.Yang et al. [10] analyzed the dataset distribution in a low-level feature space and proposed a sample squeezing method to eliminate redundant samples, making the dataset more abundant and informative and increasing the diversity of the dataset.Qi et al. [11] adaptively obtained a tight enclosing box; when the target is in deformation or rotation, the bounding box cannot tightly enclose the target.They also designed a classifier to determine whether the target is occluded or not, which helps to avoid the collection of occluded samples for tracker updates, and to improve accuracy.
However, there are still some challenges in practical applications.Target appearance changes, illumination variation, and occlusion can affect the effectiveness of tracking.
Generally, different features of the object are extracted in different stages of the network.As shown with HDT [12], combining these features from different layers improves the performance of the tracker.HDT uses an improved hedge algorithm to hedge weak trackers from each layer into a strong tracker.In this work, we consider feature fusion by using a 1 × 1 convolution to concatenate and fuse features from different stages in the Siamese backbone network, which can improve the algorithm accuracy.Meanwhile, in order to improve the detection speed, we use a group convolution for the dimensionality reduction.A group convolution [13] can exponentially reduce the number of parameters compared with a normal convolution, which can speed up the operation.In the correlation stage, we use a new matching method, namely a pixel-level correlation operation, in the network, which is able to obtain a correlation feature map with a smaller kernel size and a more diverse target representation, reducing the interference of background clutter and preserving the target boundary and scale information, which is beneficial to the subsequent prediction.
The main contributions of this work are as follows: (1) Feature fusion: we use not only the last layer output feature map for prediction but also the feature map of layers 3, 4, and 5 for feature fusion to output the prediction; (2) Pixel-level correlation: the template features are decomposed into spatial features and channel features, which are matched with the search features, instead of correlating channel-by-channel; (3) Speed improvement: we use a group convolution for the dimensionality reduction, which reduces the number of parameters and the use of activation functions and normalization in the backbone to speed up the detection; (4) New attention module: we designed two new attention modules, namely, a multi-scale feature aggregated channel attention block (MCA) and a global-to-local-informationfused spatial attention block (GSA), enabling the network to focus on certain parts of the features and reduce the attention on useless parts, thus improving the performance and accuracy of the model.
The rest of this paper is organized as follows.In Section 2, we present research on object tracking based on Siamese networks published in recent years.Section 3 outlines the core of our tracker, including four parts to improve accuracy, from the lightness to the robustness of the algorithm.Section 4 is the experimental section, which presents an ablation study and a comparison of the results of different trackers on different datasets to analyze the validity of our work.Finally, we conclude the paper in Section 5.

Related Work
This section introduces the development of object tracking and some object trackers that have been reported in recent years.Object tracking algorithms can be divided into two categories: one is based on correlation filtering, and the other is based on deep learning.The methods based on correlation filtering include MOSSE [14], KCF [15], and DSST [16].Correlation filtering introduces the convolution theorem from the signal domain to object tracking and transforms the template matching problem into a correlation operation in the frequency domain.This method is fast in operation but has average accuracy in complex scenarios.In recent years, with the development of deep learning technology and the establishment of large-scale datasets, object tracking algorithms based on convolutional neural networks have gradually emerged, among which Siamese network-based visual object trackers are particularly remarkable.A Siamese network consists of two sub-networks with the same structure and shared parameters, which are initially used for picture similarity analysis and metric learning.SINT [17] and SiamFC [1] first introduced Siamese networks to the visual object tracking field.SiamFC inputs the template picture and search sample, obtains the template feature map and search feature map, and then slides the template feature map over the search feature map as part of the correlation operation.The point with the largest response on the search feature map is considered the prediction target.SiamFC, as a fully convolutional network, has a simple structure and high tracking speed, and many subsequent works have been based on it.SiamRPN [2] introduced the RPN structure from object detection to the tracking field.One branch judges whether the object is in the foreground or background, and the other branch predicts the bounding box of the target.However, these algorithms only use shallow networks, and the tracking effect worsens for deep networks.Through the use of SiamRPN++ [3], it was found that the accuracy of deep networks is reduced because the strict translational invariance is broken, but allowing the target to be shifted in a certain range near the center point during training can alleviate the impact, enabling the successful application of deep networks in tracking algorithms.SiamFC++ [6] uses an anchor-free prediction head that does not set any anchor parameters, eliminating the effect of preset hyperparameters on the generalization ability of the algorithm.There are also some transformer structures used in visual object tracking that have achieved good results.
Although these works achieved good results, the tracking accuracy decreases and the inference speed becomes slower in the face of occlusion, object scale changes, background clutter, and other situations.In this paper, we adopt feature fusion and some simplified methods for complex scenes to reduce the computational cost and improve accuracy at the same time, using pixel-level correlation to reduce the influence of background clutter and to refine the object bounding box.

Proposed Method
In this section, we describe the network framework in detail.As shown in Figure 1, our model mainly consists of a Siamese network backbone and two sub-network detection heads for the bounding box classification and regression.The Siamese backbone network is fine-tuned from ResNet50, inspired by the transformer structure, reducing the use of activation functions and normalization, and instead using channel attention [18] and spatial attention [19] modules in the classification and regression sub-networks to make the network more accurate in extracting features.Moreover, to improve the inference speed, a group convolution and 1 × 1 convolution are used for the dimensionality reduction in the feature fusion stage; both of them accelerate the computation speed and reduce the inference time.The cross-correlation operation no longer uses depth-wise correlation [3]; template features and search features are correlated in a pixel-level matching model, which can effectively reduce background clutter and allow the model to refine the object boundary ranges and focus more on the target.
Appl.Sci.2023, 13, x FOR PEER REVIEW 4 of 16 a group convolution and 1 × 1 convolution are used for the dimensionality reduction in the feature fusion stage; both of them accelerate the computation speed and reduce the inference time.The cross-correlation operation no longer uses depth-wise correlation [3]; template features and search features are correlated in a pixel-level matching model, which can effectively reduce background clutter and allow the model to refine the object boundary ranges and focus more on the target.

Siamese Backbone Framework
Thus far, deep convolutional neural networks have been successfully applied in the field of object tracking.The deepening of these networks has led to improvements in the performance of trackers, such as ResNet [4], ResNeXt [13], and MobileNet [5], which have achieved a good performance.ResNet50, as a classical network, has good robustness and effectiveness and is usually used in trackers as a feature extraction backbone network while modifying the backbone network in order to cater to the accuracy requirements of the tracking task.
Ren et al. [20] proposed Flow Alignment FPN (FAFPN) to align feature maps of different resolutions to solve the semantic misalignment problem when fusing features of different layers.We set the steps of the conv4 and conv5 feature layers to 1 and remove the down-sampling operation so that the output resolution of the last three blocks is the same; meanwhile, to increase the receptive field, the use of a dilated convolution [21] to extract more features has been proven to be effective.Transformers [22], as excellent model architectures, are widely used in various vision tasks.Compared to convolutional neural networks, transformers usually use less activation functions and normalization operations with good results.Inspired by this, a similar method is applied in the backbone.
The original ResNet50 network uses a convolution of 7 × 7 with a 2-step size in the first layer, following a maximum pooling to complete a 4-fold down-sampling of the input image.The transformer divides the image into patches of the same size and feeds each patch into the network.We change the first layer of the network to a convolution of 4 × 4 with a 4-step length, with no overlap between convolutions.Compared with the previous one, the convolutional kernel with K = 4 and S = 4 has a smaller kernel size and a larger step size.The computation and parameter numbers are shown in Equations ( 1) and (2): 7 3 64 2352 2 (1) represents the pixel-level correlation method, which is presented in Section 3.2.The feature fusion model is presented in Section 3.3.The classification and regression sub-network using a dual-attention mechanism, CNN2, is presented in Section 3.4.

Siamese Backbone Framework
Thus far, deep convolutional neural networks have been successfully applied in the field of object tracking.The deepening of these networks has led to improvements in the performance of trackers, such as ResNet [4], ResNeXt [13], and MobileNet [5], which have achieved a good performance.ResNet50, as a classical network, has good robustness and effectiveness and is usually used in trackers as a feature extraction backbone network while modifying the backbone network in order to cater to the accuracy requirements of the tracking task.
Ren et al. [20] proposed Flow Alignment FPN (FAFPN) to align feature maps of different resolutions to solve the semantic misalignment problem when fusing features of different layers.We set the steps of the conv4 and conv5 feature layers to 1 and remove the down-sampling operation so that the output resolution of the last three blocks is the same; meanwhile, to increase the receptive field, the use of a dilated convolution [21] to extract more features has been proven to be effective.Transformers [22], as excellent model architectures, are widely used in various vision tasks.Compared to convolutional neural networks, transformers usually use less activation functions and normalization operations with good results.Inspired by this, a similar method is applied in the backbone.
The original ResNet50 network uses a convolution of 7 × 7 with a 2-step size in the first layer, following a maximum pooling to complete a 4-fold down-sampling of the input image.The transformer divides the image into patches of the same size and feeds each patch into the network.We change the first layer of the network to a convolution of 4 × 4 with a 4-step length, with no overlap between convolutions.Compared with the previous one, the convolutional kernel with K = 4 and S = 4 has a smaller kernel size and a larger step size.The computation and parameter numbers are shown in Equations ( 1) and (2): where N denotes the input size, and 3 and 64 are the input and output channels in the first layer of the network, leading to a significant reduction in computation.
Another difference between transformers and CNNs is the use of activation functions and normalization.RELUs are widely used in various CNN networks as simple and effi-cient activation functions.GELUs, as a variant of RELUs, are used in the latest transformer structures, such as the Swin Transformer and BERT, and can effectively alleviate neuron death and avoid gradient disappearance.Therefore, we use GELUs [23] instead of RELUs.
Traditional convolutional neural networks use an activation function after each layer of convolution.In order to speed up the operation, we remove the activation function after the 3 × 3 convolution, only using it after the 1 × 1 convolution.
As for normalization, BN is the most common normalization method, which is widely used in various vision tasks.Meanwhile, the setting of the batch size affects the final result.Models with an insufficient batch size are not suitable for convergence, while there may be a reduction in the generalization ability of models with too large a batch size.Group normalization [24] can be used for the normalization of samples, and it has been used in many application scenarios.We use GN instead of BN and also reduce its use to improve the inference speed.The modified Resnet50 consists of a new bottleneck (see Figure 2), and the inference speed is about 5 FPS faster.
where N denotes the input size, and 3 and 64 are the input and output channels in the first layer of the network, leading to a significant reduction in computation.
Another difference between transformers and CNNs is the use of activation functions and normalization.RELUs are widely used in various CNN networks as simple and efficient activation functions.GELUs, as a variant of RELUs, are used in the latest transformer structures, such as the Swin Transformer and BERT, and can effectively alleviate neuron death and avoid gradient disappearance.Therefore, we use GELUs [23] instead of RELUs.
Traditional convolutional neural networks use an activation function after each layer of convolution.In order to speed up the operation, we remove the activation function after the 3 × 3 convolution, only using it after the 1 × 1 convolution.
As for normalization, BN is the most common normalization method, which is widely used in various vision tasks.Meanwhile, the setting of the batch size affects the final result.Models with an insufficient batch size are not suitable for convergence, while there may be a reduction in the generalization ability of models with too large a batch size.Group normalization [24] can be used for the normalization of samples, and it has been used in many application scenarios.We use GN instead of BN and also reduce its use to improve the inference speed.

Pixel to Global Correlation
Correlation is the most important part of object tracking, which combines template features with search features and then connects them to the output of the classification and regression sub-networks.Unlike depth-wise correlation [3], which correlates template features with search features channel by channel, in this work, we use pixel to global correlation [25], which decomposes template features and correlates every pixel with the search features to obtain a correlated feature map S.This correlation can effectively suppress background interference, improve the target response on the feature map, and further improve the accuracy of the target bounding box.
The process is shown in Figure 3, where the template features

Pixel to Global Correlation
Correlation is the most important part of object tracking, which combines template features with search features and then connects them to the output of the classification and regression sub-networks.Unlike depth-wise correlation [3], which correlates template features with search features channel by channel, in this work, we use pixel to global correlation [25], which decomposes template features and correlates every pixel with the search features to obtain a correlated feature map S.This correlation can effectively suppress background interference, improve the target response on the feature map, and further improve the accuracy of the target bounding box.
The process is shown in Figure 3, where the template features Another difference between transformers and CNNs is the use of activation functions and normalization.RELUs are widely used in various CNN networks as simple and efficient activation functions.GELUs, as a variant of RELUs, are used in the latest transformer structures, such as the Swin Transformer and BERT, and can effectively alleviate neuron death and avoid gradient disappearance.Therefore, we use GELUs [23] instead of RELUs.
Traditional convolutional neural networks use an activation function after each layer of convolution.In order to speed up the operation, we remove the activation function after the 3 × 3 convolution, only using it after the 1 × 1 convolution.
As for normalization, BN is the most common normalization method, which is widely used in various vision tasks.Meanwhile, the setting of the batch size affects the final result.Models with an insufficient batch size are not suitable for convergence, while there may be a reduction in the generalization ability of models with too large a batch size.Group normalization [24] can be used for the normalization of samples, and it has been used in many application scenarios.We use GN instead of BN and also reduce its use to improve the inference speed.

Pixel to Global Correlation
Correlation is the most important part of object tracking, which combines template features with search features and then connects them to the output of the classification and regression sub-networks.Unlike depth-wise correlation [3], which correlates template features with search features channel by channel, in this work, we use pixel to global correlation [25], which decomposes template features and correlates every pixel with the search features to obtain a correlated feature map S.This correlation can effectively suppress background interference, improve the target response on the feature map, and further improve the accuracy of the target bounding box.
The process is shown in Figure 3, where the template features  Similarly, the template features are also converted into channel feature vectors, , according to the channel dimension.The search features are first correlated with the spatial feature vectors Z s to obtain feature map S 1 based on Equation (4): Then, feature map S 1 is correlated with the channel feature vectors X f to obtain feature map S 2 based on Equation (5): where * represents the convolution process.Feature map S 2 is obtained after both the channel features and spatial features of the template are correlated.Then, the classification and regression sub-networks complete the target prediction.
Naive correlation [1] and depth-wise correlation [3] use whole template features as kernels to correlate the search features so that the adjacent sliding windows on the feature map produce similar responses, blurring the spatial information.As a refinement method, pixel to global correlation decomposes the template into 1 * 1 feature sub-kernels according to the space and channel to correlate the search region, which effectively reduces background interference and further improves the accuracy of the target bounding box, avoiding the blurring of features.

Feature Fusion
In order to make full use of the features extracted from the backbone network and the advantages of deep networks, features from different layers are used in our feature fusion, and at the same time, in order to speed up the inference, a group convolution [13] is used to first reduce the feature dimensions to simplify the number of parameters and then aggregate the features via a pointwise convolution.
Group convolutions [13] have been widely applied as efficient convolution methods.Their specific process is shown in Figure 4. C 1 × H × W is used as the input, and the output is C 2 × H × W, which represents the channel, height, and width of the convolution.The input is divided into g groups, and each group uses a convolution with a kernel size of k × k and C 1 /g channels.Compared with the number of parameters of an ordinary convolution, i.e., k × k × C 1 × C 2 , the number of parameters of the group convolution is k × k × C 1 × C 2 /g, which is 1/g of an ordinary convolution, greatly reducing the parameter redundancy.A group convolution is equivalent to decomposing the input and processing the data in parallel, which can speed up the operation.The number of parameters and FLOPs is calculated using Equations ( 6) and (7):  Generally speaking, during the tracking process, there may be problems such as illumination changes and scale variation, which require the tracking task to use as much feature information as possible.It is usually considered that in the shallow layer of a network, the network extracts the fine-grained information [26] of the object, such as its color and Generally speaking, during the tracking process, there may be problems such as illumination changes and scale variation, which require the tracking task to use as much feature information as possible.It is usually considered that in the shallow layer of a network, the network extracts the fine-grained information [26] of the object, such as its color and shape, to help locate the object's position, and as the network deepens, the network extracts the semantic information of the object.Fusing these features from different deep and shallow layers helps to track the target.After correlation, the features of the three stages are concatenated together, and the fusion of the features is implemented using a pointwise convolution [27], which achieves the fusion of cross-channel information quickly and efficiently.

Classification and Regression Sub-Network
The aim of an attention mechanism is to allow the model to learn how to allocate its own attention and weight the input signal.An attention mechanism scores each dimension of the input and then weights the features according to the score, increasing the weight of interesting parts and decreasing the weight of uninteresting parts, so that the network adaptively highlights the features that are important to the downstream model or task.In this work, two attention modules, namely, channel attention and spatial attention modules, are implemented in the classification and regression sub-network (CNN2), as shown in Figure 5.The features are first reduced in dimensionality via a group convolution [13]; then, a PW convolution [27] is used for feature fusion, and finally the dual channel and spatial attention module is followed.
Generally speaking, during the tracking process, there may be problems such as illumination changes and scale variation, which require the tracking task to use as much feature information as possible.It is usually considered that in the shallow layer of a network, the network extracts the fine-grained information [26] of the object, such as its color and shape, to help locate the object's position, and as the network deepens, the network extracts the semantic information of the object.Fusing these features from different deep and shallow layers helps to track the target.After correlation, the features of the three stages are concatenated together, and the fusion of the features is implemented using a pointwise convolution [27], which achieves the fusion of cross-channel information quickly and efficiently.

Classification and Regression Sub-Network
The aim of an attention mechanism is to allow the model to learn how to allocate its own attention and weight the input signal.An attention mechanism scores each dimension of the input and then weights the features according to the score, increasing the weight of interesting parts and decreasing the weight of uninteresting parts, so that the network adaptively highlights the features that are important to the downstream model or task.In this work, two attention modules, namely, channel attention and spatial attention modules, are implemented in the classification and regression sub-network (CNN2), as shown in Figure 5.The features are first reduced in dimensionality via a group convolution [13]; then, a PW convolution [27] is used for feature fusion, and finally the dual channel and spatial attention module is followed.The multi-scale feature aggregated channel attention block (MCA) is a mechanism for tuning the network at the channel level, as shown in Figure 6.The input features are first divided into four parts, each of which is reduced to half of the original channel via a convolution layer.Two operations are performed independently: one directly uses global average pooling to make the features 1 × 1 ×  in size, with a global perceptual field, aggregating the global features and squeezing information from the channels after the sigmoid activation to obtain the channel weights, which are then multiplied back to the divided features; the other uses an additional convolution layer and then performs the same operation as the former.The four parts adopt the same operation and concatenate together, completing the attention enhancement of the channel dimension, making the network automatically focus on the channels that are important.The multi-scale feature aggregated channel attention block (MCA) is a mechanism for tuning the network at the channel level, as shown in Figure 6.The input features are first divided into four parts, each of which is reduced to half of the original channel via a convolution layer.Two operations are performed independently: one directly uses global average pooling to make the features 1 × 1 × C in size, with a global perceptual field, aggregating the global features and squeezing information from the channels after the sigmoid activation to obtain the channel weights, which are then multiplied back to the divided features; the other uses an additional convolution layer and then performs the same operation as the former.The four parts adopt the same operation and concatenate together, completing the attention enhancement of the channel dimension, making the network automatically focus on the channels that are important.
The MCA block is based on Equations ( 8)- (10), where F is the input, S is the spilt operation, Cat is the concatenate operation, δ is the activation function, C 1 and C 2 represent the convolution layers, and GAP stands for global average pooling.
ER REVIEW The MCA block is based on Equations ( 8)- (10), where F is the input, S operation, Cat is the concatenate operation, δ is the activation function,  an sent the convolution layers, and GAP stands for global average pooling.

( ) (
) The global-to-local-information-fused spatial attention block (GSA) is si channel attention block in that it weights the network from the spatial di shown in Figure 7.The same input features are divided into four parts, using lution layers, average pooling, and maximum pooling [28] for each feature network along the channel direction to obtain four 1 * h * w feature maps.The p and convolution map are concatenated before another convolution layer to obt in the spatial dimension, which are then multiplied back to the input.Two pa The global-to-local-information-fused spatial attention block (GSA) is similar to the channel attention block in that it weights the network from the spatial dimension as shown in Figure 7.The same input features are divided into four parts, using two convolution layers, average pooling, and maximum pooling [28] for each feature point of the network along the channel direction to obtain four 1 * h * w feature maps.The pooling map and convolution map are concatenated before another convolution layer to obtain weights in the spatial dimension, which are then multiplied back to the input.Two parts are then added to complete the attention enhancement of the spatial dimension, making the network focus on the more important regions.We employ the GSA block in Equations ( 11)-( 13).
where GAP and GMP represent average pooling and maximum pooling, F is the input feature, C 1 , C 2 , C 3 , C 4 represent the convolution layers, and Cat is the concatenate operation.
After the template features are correlated with the search features (pixel-level correlation), they are fed into the classification and regression sub-networks (CNN2), which predict whether it is an object or background, along with the bounding box of the target.As shown in Figure 8, the two sub-networks use the same correlation module as the input and do not use separate correlation modules, which also reduces the amount of computation and speeds up the operation of the network.The algorithm finally runs at 40 FPS, which is nearly 9 FPS faster than SiamCAR.( ( ( ), ( )))

PEER REVIEW 9 of 16
where GAP and GMP represent average pooling and maximum pooling, F is the input feature,  1 ,  2 ,  3 ,  4 represent the convolution layers, and Cat is the concatenate operation.
After the template features are correlated with the search features (pixel-level correlation), they are fed into the classification and regression sub-networks (CNN2), which predict whether it is an object or background, along with the bounding box of the target.As shown in Figure 8, the two sub-networks use the same correlation module as the input and do not use separate correlation modules, which also reduces the amount of computation and speeds up the operation of the network.The algorithm finally runs at 40 FPS, which is nearly 9 FPS faster than SiamCAR.fferent correlation model (b) Shared correlation model  ( ( ( ), ( ))) where GAP and GMP represent average pooling and maximum pooling, F is the input feature,  ,  ,  ,  represent the convolution layers, and Cat is the concatenate operation.
After the template features are correlated with the search features (pixel-level correlation), they are fed into the classification and regression sub-networks (CNN2), which predict whether it is an object or background, along with the bounding box of the target.As shown in Figure 8, the two sub-networks use the same correlation module as the input and do not use separate correlation modules, which also reduces the amount of computation and speeds up the operation of the network.The algorithm finally runs at 40 FPS, which is nearly 9 FPS faster than SiamCAR.

Implementation Details
The initial model of the backbone was derived from ResNet50 [4] trained on the COCO [29] dataset, a migration learning approach that is commonly used for network training today.We used the Lasot [30], Got10k [31], ImageNet VID [32], and YouTube Bounding Boxes [33] datasets as training sets.The search region was cropped to 255 × 255,

Experiments 4.1. Implementation Details
The initial model of the backbone was derived from ResNet50 [4] trained on the COCO [29] dataset, a migration learning approach that is commonly used for network training today.We used the Lasot [30], Got10k [31], ImageNet VID [32], and YouTube Bounding Boxes [33] datasets as training sets.The search region was cropped to 255 × 255, and the template region was cropped to 127 × 27 for training.The initial learning rate was 0.001, and 20 training epochs were performed using stochastic gradient descent (SGD).In the first 5 epochs, the learning rate increased from 0.001 to 0.005, and in the last 15 epochs, it gradually decreased from 0.005 to 0.0005.Meanwhile, the parameters of the backbone network were frozen in the first 10 epochs, where only the neck and output parts were trained, and in the last 10 epochs, the parameters of the backbone network were unfrozen, and the network was trained as a whole.Finally, the model was tested and evaluated on the UAV123 [34] and OTB100 [35] datasets.

Ablation Study
In order to explore the effect of the multi-layer feature fusion, ablation comparison experiments were conducted.Table 1 shows that the use of multi-layer feature fusion is better than just using a single feature, and the effect is better when using the three-layer feature fusion of CONV3, CONV4, and CONV5 than when using the two-layer feature fusion of CONV4 and CONV5, indicating that the features extracted from the different stages of the network are not the same, and fusing multi-layer features is beneficial to improving the tracking accuracy.The correlation method based on pixel matching of the template features also shows an improvement compared to the channel-by-channel correlation method, with an improvement of 0.8% on the UAV123 dataset.The addition of the attention module to the network further improves the effect of the network, and the use of both spatial and channel attention models enables the network to achieve the best effect, with a final accuracy of 65.5% on the UAV123 dataset.
Table 1.Ablation study of the proposed tracker on UAV123.L3, L4, and L5 represent conv3, conv4, and conv5, respectively.DW/Pix stands for depth-wise correlation and pixel to global correlation.
In order to analyze the effect of fusing multi-layer features, we tested the model on three datasets.As shown in Table 2, the use of three feature maps from different convolution layers leads to the best results on all three datasets, which shows that the use of multi-layer feature fusion is beneficial to improving the accuracy.Another ablation experiment was conducted to explore the attention mechanism and pixel-level correlation.As shown in Table 3, the baseline uses three convolution layers with pixel-level correlation, while MCA and GSA are the multi-scale feature aggregated channel attention block and the global-to-local-information-fused spatial attention block.Every addition improves the accuracy.In the end, all modules are used, achieving the best performance with an AUC of 65.5% and a precision rate of 85.2%.cars; a variety of scenes including fields, roads, and water, with many activity styles; and occlusions, scale changes, lighting changes, and camera movements in order to increase the tracking challenge.The evaluation metrics include success, precision, and norm precision.Precision is the center position error, using the average center position error of all frames in a sequence to evaluate the performance of the trackers.Success is the proportion of area overlapped between the detection and the real area; generally, the area under the curve is used as its value.

Results on UAV123
UAV123 [34] is a collection of 123 high-definition videos captured using UAVs during aerial photography, containing a variety of targets such as pedestrians, ships, planes, and cars; a variety of scenes including fields, roads, and water, with many activity styles; and occlusions, scale changes, lighting changes, and camera movements in order to increase the tracking challenge.The evaluation metrics include success, precision, and norm precision.Precision is the center position error, using the average center position error of all frames in a sequence to evaluate the performance of the trackers.Success is the proportion of area overlapped between the detection and the real area; generally, the area under the curve is used as its value.
We compared our work with other state-of-the-art trackers, including SiamRPN++ [3], Ocean [36], SiamBAN [37], and SiamGAT [38].As shown in Figure 9, compared with SiamCAR, our tracker shows a 4.0% improvement in success and a 4.8% improvement in precision.We also compared the trackers in terms of visual attributes, including illumination changes, occlusion, scale changes, and background clutter, as shown in Figure 10.Our tracker ranks first, which shows that our tracker has the ability to cope with illumination changes, occlusion, and scale changes.
changes, occlusion, scale changes, and background clutter, as tracker ranks first, which shows that our tracker has the ability changes, occlusion, and scale changes.

Results on OTB100
OTB100 is a widely used object-tracking dataset.It contains attributes such as fast motion, motion blur, and low resolution.with other state-of-the-art trackers including SiamCAR [39], S [37], and CFNet [40].
Figure 11 illustrates the success and precision plots of the track-er achieves better results than SiamCAR [39] and SiamBAN in terms of scale variation, out-of-plane rotation, low resolution a success rate of 0.701 and a precision rate of 0.914.The integr pix-el-level correlation methods enables the tracker to work w resolution, scale variation, etc.

Results on OTB100
OTB100 is a widely used object-tracking dataset.It contains 100 video sequences with attributes such as fast motion, motion blur, and low resolution.We compared our tracker with other state-of-the-art trackers including SiamCAR [39], SiamRPN++ [3], SiamBAN [37], and CFNet [40].
Figure 11 illustrates the success and precision plots of the compared trackers.Our track-er achieves better results than SiamCAR [39] and SiamBAN [37], with a faster speed in terms of scale variation, out-of-plane rotation, low resolution, etc.Our tracker obtains a success rate of 0.701 and a precision rate of 0.914.The integration of the attention and pix-el-level correlation methods enables the tracker to work well in scenarios with low resolution, scale variation, etc.We also compared the trackers in terms of visual attributes, including illumination changes, occlusion, scale changes, and background clutter, as shown in Figure 10.Our tracker ranks first, which shows that our tracker has the ability to cope with illumination changes, occlusion, and scale changes.

Results on OTB100
OTB100 is a widely used object-tracking dataset.It contains 100 video sequences with attributes such as fast motion, motion blur, and low resolution.We compared our tracker with other state-of-the-art trackers including SiamCAR [39], SiamRPN++ [3], SiamBAN [37], and CFNet [40].
Figure 11 illustrates the success and precision plots of the compared trackers.Our track-er achieves better results than SiamCAR [39] and SiamBAN [37], with a faster speed in terms of scale variation, out-of-plane rotation, low resolution, etc.Our tracker obtains a success rate of 0.701 and a precision rate of 0.914.The integration of the attention and pix-el-level correlation methods enables the tracker to work well in scenarios with low resolution, scale variation, etc.

Results on GOT10K and LaSOT
As a large tracking dataset, GOT10K contains more than 10,000 videos, and it is populated with more than 560 categories of moving objects and 87 motion patterns-more than other datasets.We tested our model on the test set.As shown in Table 4, compared with SiamCAR [39], SiamFC++ [6], and Ocean [36], our tracker achieves an AO of 60.7%, which is 1.2% better than that of SiameseFC++ and generally better than that of the other trackers.LaSOT contains 70 object categories and provides an equal number of sequences for each category to mitigate potential category bias, resulting in a collection of 1400 sequences with an average video length of 2512 frames, constituting a high-quality tracking dataset.We tested our tracker on this test set.As shown in Table 5, our tracker outperforms Ocean by 1.2% and has a better performance than the other trackers, which shows its effectiveness and generalizability.Figure 12 shows that our model can track successfully in the face of size variation, occlusion, and low resolution, improving the success and precision rates.The inaccuracy

Results on GOT10K and LaSOT
As a large tracking dataset, GOT10K contains more than 10,000 videos, and it is populated with more than 560 categories of moving objects and 87 motion patterns-more than other datasets.We tested our model on the test set.As shown in Table 4, compared with SiamCAR [39], SiamFC++ [6], and Ocean [36], our tracker achieves an AO of 60.7%, which is 1.2% better than that of SiameseFC++ and generally better than that of the other trackers.LaSOT contains 70 object categories and provides an equal number of sequences for each category to mitigate potential category bias, resulting in a collection of 1400 sequences with an average video length of 2512 frames, constituting a high-quality tracking dataset.We tested our tracker on this test set.As shown in Table 5, our tracker outperforms Ocean by 1.2% and has a better performance than the other trackers, which shows its effectiveness and generalizability.Figure 12 shows that our model can track successfully in the face of size variation, occlusion, and low resolution, improving the success and precision rates.The inaccuracy of the boat tracking is due to the fixed viewpoint, and as the boat is traveling from far to near, its size changes rapidly, so the tracker does not work well.Our model aggregates multi-layer features with different receptive fields, which reduces the problem of accuracy degradation due to the change in the size of the object.The person tracking inaccuracy is due to the close distance and high similarity of the two people, resulting in the bounding box containing both.Pixel-level correlation is a more refined correlation method that can refine the bounding box and diminish tracking exceptions caused by background interference.Due to the small size and fast movement of UAVs, tracking errors often occur.The attention module can enhance the feature extraction ability of the network, allowing the network to focus on important features and track successfully.Therefore, our tracker provides a better accuracy than the other algorithms in different situations.Meanwhile, compared to SiamCAR's inference speed of 31FPS, model runs at 40FPS, representing an improvement of 9FPS, which is an improvement in both speed and success.

Conclusions
In this work, we propose a Siamese framework with a group convolution and pixellevel correlation for visual object tracking, with training from end to end, using multilayer feature fusion and attention mechanisms to improve the feature extraction capability of the network, which works well under fast motion, occlusion, etc.We designed two attention modules: a multi-scale channel attention block (MCA) and a global-to-local spatial attention block (GSA), which enable the network to extract more meaningful features in the classification and regression sub-network.During tracking, pixel-level correlation reduces background interference and provides more refined target boundaries, and it decomposes the template features from the channel and spatial dimensions and uses every pixel feature to correlate the template and search regions.Furthermore, in order to improve the inference speed, our tracker uses a group convolution, which reduces the

Conclusions
In this work, we propose a Siamese framework with a group convolution and pixellevel correlation for visual object tracking, with training from end to end, using multi-layer feature fusion and attention mechanisms to improve the feature extraction capability of the network, which works well under fast motion, occlusion, etc.We designed two attention modules: a multi-scale channel attention block (MCA) and a global-to-local spatial attention block (GSA), which enable the network to extract more meaningful features in the classification and regression sub-network.During tracking, pixel-level correlation reduces background interference and provides more refined target boundaries, and it decomposes the template features from the channel and spatial dimensions and uses every pixel feature to correlate the template and search regions.Furthermore, in order to improve the inference speed, our tracker uses a group convolution, which reduces the number of

Figure 1 .
Figure 1.Illustration of our proposed framework.Section 3.1 presents Siamese backbone network, CNN1, CONV3, CONV4, CONV5 represent layer 3, 4, 5 of it.★ represents the pixel-level correlation method, which is presented in Section 3.2.The feature fusion model is presented in Section 3.3.The classification and regression sub-network using a dual-attention mechanism, CNN2, is presented in Section 3.4.

Figure 1 .
Figure 1.Illustration of our proposed framework.Section 3.1 presents Siamese backbone network, CNN1, CONV3, CONV4, CONV5 represent layer 3, 4, 5 of it.representsthe pixel-level correlation method, which is presented in Section 3.2.The feature fusion model is presented in Section 3.3.The classification and regression sub-network using a dual-attention mechanism, CNN2, is presented in Section 3.4.
The modified Resnet50 consists of a new bottleneck (see Figure 2), and the inference speed is about 5 FPS faster.(a) Structure of the original bottleneck (b) Structure of the new bottleneck

Figure 2 .
Figure 2. (a) original bottleneck using triple activation function and triple normalization.(b) new bottleneck using two activation function and one normalization.

Figure 3 .Figure 2 .
Figure 3. Illustration of pixel to global correlation, where f Z is the template feature, and f X is the search feature.(a) The template feature is decomposed into feature vectors s Z and c Z .s Z The modified Resnet50 consists of a new bottleneck (see Figure 2), and the inference speed is about 5 FPS faster.(a) Structure of the original bottleneck (b) Structure of the new bottleneck

Figure 2 .
Figure 2. (a) original bottleneck using triple activation function and triple normalization.(b) new bottleneck using two activation function and one normalization.
Template feature decomposition (b) Search feature correlation

Figure 3 .Figure 3 .
Figure 3. Illustration of pixel to global correlation, where f Z is the template feature, and f X is the search feature.(a) The template feature is decomposed into feature vectors s Z and c Z .s Z Figure 3. Illustration of pixel to global correlation, where Z f is the template feature, and X f is the search feature.(a) The template feature is decomposed into feature vectors Z s and Z c .Z s converts the template feature into feature vectors according to each pixel position.Z c converts the template feature maps of each channel into feature vectors.(b) Feature vectors Z s and Z c are successively correlated with the search feature X f to obtain features S 1 and S 2 .S 2 is the correlation feature map combining the template and search features.

Figure 4 .
Figure 4. Feature fusion model with a group convolution and pointwise convolution: (a) denotes input features, (b) denotes group convolution, (c) denotes pointwise convolution, and (d) denotes output.

Figure 4 .
Figure 4. Feature fusion model with a group convolution and pointwise convolution: (a) denotes input features, (b) denotes group convolution, (c) denotes pointwise convolution, and (d) denotes output.

Figure 7 .
Figure 7.The global to local information fused spatial attention block (GSA).

Figure 7 .
Figure 7.The global to local information fused spatial attention block (GSA).

Figure 7 .
Figure 7.The global to local information fused spatial attention block (GSA).

Figure 8 .
Figure 8. Different connections between the correlation module and prediction sub-network: (a) separate correlation module connected to the classification and regression sub-networks; (b) use of a shared correlation module.

Figure 8 .
Figure 8. Different connections between the correlation module and prediction sub-network: (a) separate correlation module connected to the classification and regression sub-networks; (b) use of a shared correlation module.

Figure 9 .
Figure 9. (a) Overall success and precision plots of our tracker on UAV123 compared with other trackers.(b) Success plot for visual attributes.(c) Precision plot for visual attributes.

Figure 10 .
Figure 10.Comparison of success in terms of visual attributes.

Figure 10 .
Figure 10.Comparison of success in terms of visual attributes.

16 Figure 9 .
Figure 9. (a) Overall success and precision plots of our tracker on UAV123 compared with other trackers.(b) Success plot for visual attributes.(c) Precision plot for visual attributes.

Figure 10 .
Figure 10.Comparison of success in terms of visual attributes.

Figure 11 .
Figure 11.(a) Overall success and precision plots of our tracker on OTB100 compared with other trackers.(b) Success plot for visual attributes.(c) Precision plot for visual attributes.
Appl.Sci.2023,13,  x FOR PEER REVIEW 14 of 16 near, its size changes rapidly, so the tracker does not work well.Our model aggregates multi-layer features with different receptive fields, which reduces the problem of accuracy degradation due to the change in the size of the object.The person tracking inaccuracy is due to the close distance and high similarity of the two people, resulting in the bounding box containing both.Pixel-level correlation is a more refined correlation method that can refine the bounding box and diminish tracking exceptions caused by background interference.Due to the small size and fast movement of UAVs, tracking errors often occur.The attention module can enhance the feature extraction ability of the network, allowing the network to focus on important features and track successfully.Therefore, our tracker provides a better accuracy than the other algorithms in different situations.Meanwhile, compared to SiamCAR's inference speed of 31FPS, our model runs at 40FPS, representing an improvement of 9FPS, which is an improvement in both speed and success.

Figure 12 .
Figure 12.Comparisons of tracking results from different trackers.Targets including person, car, boat, UAV, and images present challenging attributes such as low resolution, occlusion, fast motion, and size variation.Green boxes denote ground truth, yellow boxes are results from SiamCAR, and red boxes are our model results.

Figure 12 .
Figure 12.Comparisons of tracking results from different trackers.Targets including person, car, boat, UAV, and images present challenging attributes such as low resolution, occlusion, fast motion, and size variation.Green boxes denote ground truth, yellow boxes are results from SiamCAR, and red boxes are our model results.

Table 2 .
Ablation study of the use of feature maps from different layers.

Table 3 .
[34]tion study of the attention model and correlation method.UAV123[34]is a collection of 123 high-definition videos captured using UAVs during aerial photography, containing a variety of targets such as pedestrians, ships, planes, and

Table 4 .
Comparison with other trackers on the GOT10k test set.

Table 5 .
Comparison with other trackers on the UAV123, OTB100, and LaSOT datasets in terms of the AUC.

Table 4 .
Comparison with other trackers on the GOT10k test set.

Table 5 .
Comparison with other trackers on the UAV123, OTB100, and LaSOT datasets in terms of the AUC.