ACSiamRPN: Adaptive Context Sampling for Visual Object Tracking

: In visual object tracking ﬁelds, the Siamese network tracker, based on the region proposal network (SiamRPN), has achieved promising tracking e ﬀ ects, both in speed and accuracy. However, it did not consider the relationship and di ﬀ erences between the long-range context information of various objects. In this paper, we add a global context block (GC block), which is lightweight and can e ﬀ ectively model long-range dependency, to the Siamese network part of SiamRPN so that the object tracker can better understand the tracking scene. At the same time, we propose a novel convolution module, called a cropping-inside selective kernel block (CiSK block), based on selective kernel convolution (SK convolution, a module proposed in selective kernel networks) and use it in the region proposal network (RPN) part of SiamRPN, which can adaptively adjust the size of the receptive ﬁeld for di ﬀ erent types of objects. We make two improvements to SK convolution in the CiSK block. The ﬁrst improvement is that in the fusion step of SK convolution, we use both global average pooling (GAP) and global maximum pooling (GMP) to enhance global information embedding. The second improvement is that after the selection step of SK convolution, we crop out the outermost pixels of features to reduce the impact of padding operations. The experiment results show that on the OTB100 benchmark, we achieved an accuracy of 0.857 and a success rate of 0.643. On the VOT2016 and VOT2019 benchmarks, we achieved expected average overlap (EAO) scores of 0.394 and 0.240, respectively.


Introduction
Visual object tracking is one of the most basic problems in the application of human-computer interaction, visual analysis and auxiliary drive systems. Its purpose is to accurately estimate the position and scale of the object in the subsequent frame, according to the bounding box given in the first frame [1]. The appearance difference caused by illumination, deformation, occlusion, rotation and motion is a great challenge. In addition, the tracking speed is also very important in practical application. Generally, the real-time tracking is at least 25 Frames Per Second (FPS).
Video tracking technology has developed rapidly in the past few years. In particular, the Siamese network, based on a region proposal network (SiamRPN) [2] proposed by Li et al., adds the idea branch of SiamRPN, which can collect the long-range context information of an object; thus, the global spatial attention mechanism is introduced into the object tracker. In order to improve the adaptability of the object tracker to the object scales, we design a cropping-inside selective kernel block (CiSK block) based on SKNet and replace the 3 × 3 convolutions in the RPN part of SiamRPN with CiSK blocks. Due to its multi-branch structure, a CiSK block can provide SiamRPN with a dynamic receptive field ability. In addition, the GAP and GMP used in the fuse step of the CiSK block enrich the channel attention information of SiamRPN. The source code of our method is available at https://github.com/linjiangxiaoxian/ACSiamRPN.

Methods
In this section, we will describe in detail the ACSiamRPN framework that we propose for single object tracking. As shown in Figure 1, ACSiamRPN includes a Siamese subnet for feature extraction and an RPN subnet for bounding box prediction. There are two branches in the RPN subnet: one is responsible for foreground and background classification, and the other is for proposal refinement. The whole framework can be trained end to end. The ACSiamRPN framework is modified from the original SiamRPN by using a GC block and a CiSK block. The GC block can extract global context information and facilitate subsequent processing. The CiSK block has a dynamic receptive field, and the cropping operation added to the CiSK can alleviate the negative impact of padding to object localization. The four CiSK blocks in the RPN subnet have the same structure, but do not share weight. The channel numbers of features outputted by the CiSK blocks are kept at 256. Then, the final output is obtained through cross-correlation and 1 × 1 convolution. Details of the GC block and CiSK block will be described in the following part of this section.
Electronics 2020, 9, x FOR PEER REVIEW 3 of 13 the global spatial attention mechanism is introduced into the object tracker. In order to improve the adaptability of the object tracker to the object scales, we design a cropping-inside selective kernel block (CiSK block) based on SKNet and replace the 3 × 3 convolutions in the RPN part of SiamRPN with CiSK blocks. Due to its multi-branch structure, a CiSK block can provide SiamRPN with a dynamic receptive field ability. In addition, the GAP and GMP used in the fuse step of the CiSK block enrich the channel attention information of SiamRPN. The source code of our method is available at https://github.com/linjiangxiaoxian/ACSiamRPN.

Methods
In this section, we will describe in detail the ACSiamRPN framework that we propose for single object tracking. As shown in Figure 1, ACSiamRPN includes a Siamese subnet for feature extraction and an RPN subnet for bounding box prediction. There are two branches in the RPN subnet: one is responsible for foreground and background classification, and the other is for proposal refinement. The whole framework can be trained end to end. The ACSiamRPN framework is modified from the original SiamRPN by using a GC block and a CiSK block. The GC block can extract global context information and facilitate subsequent processing. The CiSK block has a dynamic receptive field, and the cropping operation added to the CiSK can alleviate the negative impact of padding to object localization. The four CiSK blocks in the RPN subnet have the same structure, but do not share weight. The channel numbers of features outputted by the CiSK blocks are kept at 256. Then, the final output is obtained through cross-correlation and 1 × 1 convolution. Details of the GC block and CiSK block will be described in the following part of this section. The extracted features are used as input by a subsequent RPN subnet to predict the object bounding box. The output of classification branch of the RPN subnet has 2 k channels, representing the foreground and background scores of k anchors, respectively. The output of the regression branch of the RPN subnet has 4 k channels, representing the four correction offsets for the predicted bounding box of k anchors, which are the correction offsets for the horizontal and vertical coordinates of the bounding box center and the correction offsets for the width and height of the bounding box.

Global Context Block
In order to model the long-range context of the template frame and deepen the network's global understanding of the current tracking scene, we added a GC block [17] to the template branch of the original DaSiamRPN. Please note that long-range context here is not a temporal concept, but a spatial one. Long-range context means the relationship between pixels that are far away from each other in the same frame. As shown in Figure 2, the GC block is composed of two parts, namely the context modeling part and the channel transforming part. Although the GC block is a lightweight block, we only added it to the template branch. The reason for this is the template branch needs to be run only once at the first frame. However, the detection branch needs to be run multiple times at all the subsequent frames, so any addition to the detection branch will affect the tracking speed. The Figure 1. The architecture of ACSiamRPN. Video frames are input to a Siamese subnet to extract features. The extracted features are used as input by a subsequent RPN subnet to predict the object bounding box. The output of classification branch of the RPN subnet has 2 k channels, representing the foreground and background scores of k anchors, respectively. The output of the regression branch of the RPN subnet has 4 k channels, representing the four correction offsets for the predicted bounding box of k anchors, which are the correction offsets for the horizontal and vertical coordinates of the bounding box center and the correction offsets for the width and height of the bounding box.

Global Context Block
In order to model the long-range context of the template frame and deepen the network's global understanding of the current tracking scene, we added a GC block [17] to the template branch of the original DaSiamRPN. Please note that long-range context here is not a temporal concept, but a spatial one. Long-range context means the relationship between pixels that are far away from each other in the same frame. As shown in Figure 2, the GC block is composed of two parts, namely the context modeling part and the channel transforming part. Although the GC block is a lightweight block, we only added it to the template branch. The reason for this is the template branch needs to Electronics 2020, 9, 1528 4 of 13 be run only once at the first frame. However, the detection branch needs to be run multiple times at all the subsequent frames, so any addition to the detection branch will affect the tracking speed. The template branch with the GC block can provide a more stable and reliable template for subsequent frames to match. the feature map is reshaped to HW × 1 × 1 and fed into a softmax function to obtain the attention weight (HW × 1 × 1). Finally, matrix multiplication is performed between the reshaped original features (C × HW) and the attention weight (HW × 1 × 1) to get the output of the context modeling part (C × 1 × 1).
For the channel transforming part, the main function is to complete information transformation and assign the context established by the context modeling part to the corresponding channel. Similar to SENet, this part uses 1 × 1 convolution to model the relationship between channels. First, the channel number is compressed to 1/r (in our experiment, we set r to 4), and then the channel number is restored to C. In this way, a bottleneck is formed, and the calculation and parameter amounts are reduced. Layer normalization is added to facilitate the training and optimizing process, and Rectified Linear Unit (ReLU) activation is used to increase the model's non-linearity. Finally, the output C × 1 × 1 vector is broadcasted and added elementwise with the original feature map to get the final output. To summarize, the GC block can be formulated as where x and z represent the input and output of the GC block and x , x , x and z are the elements of x and z . is the number of elements in x .
1 represents the weight used in the first convolution module of the channel transforming part, 2 represents the weight used in the last convolution module of the channel For the context modeling part, the main function is to establish the relationship between contexts. Conventional convolution can only catch the local context information, and the GAP and GMP operations are only simple statistical calculations, which cannot model global context well. The context modeling module groups the features of all positions together via weighted averaging to obtain the global context features [13] (it can be regarded as global attention pooling). In the context modeling part, the Channel number, Height and Width input features (C × H × W) are first convoluted by a 1 × 1 kernel, and the channel number is compressed to 1 to obtain a 1 × H × W feature map. Then, the feature map is reshaped to HW × 1 × 1 and fed into a softmax function to obtain the attention weight (HW × 1 × 1). Finally, matrix multiplication is performed between the reshaped original features (C × HW) and the attention weight (HW × 1 × 1) to get the output of the context modeling part (C × 1 × 1).
For the channel transforming part, the main function is to complete information transformation and assign the context established by the context modeling part to the corresponding channel. Similar to SENet, this part uses 1 × 1 convolution to model the relationship between channels. First, the channel number is compressed to 1/r (in our experiment, we set r to 4), and then the channel number is restored to C. In this way, a bottleneck is formed, and the calculation and parameter amounts are reduced. Layer normalization is added to facilitate the training and optimizing process, and Rectified Linear Unit (ReLU) activation is used to increase the model's non-linearity.
Finally, the output C × 1 × 1 vector is broadcasted and added elementwise with the original feature map to get the final output. To summarize, the GC block can be formulated as Electronics 2020, 9, 1528 5 of 13 where x and z represent the input and output of the GC block and x j , x i , x m and z i are the elements of x and z. N p is the number of elements in x.
Np m=1 e W k xm represents the softmax function. W v1 represents the weight used in the first convolution module of the channel transforming part, W v2 represents the weight used in the last convolution module of the channel transforming part. ReLU represents the activation function, and LN represents the layer normalization operation.

Cropping-Inside Selective Kernel Block
During object tracking, the object scale is random and may vary over time so that receptive fields have crucial influence on tracking performance. Networks such as Inception [9] have several receptive fields due to their multiple parallel branches with different kernel sizes; however, the weight of each branch is fixed in the fusion step, making it not adaptive to objects of different scales. SKNet [13] is famous for its simplicity and efficiency. SKNet can adaptively assign the weights of different branches according to the scales of different objects, making it suitable for tasks handling objects of random sizes, such as object tracking.
The proposed CiSK block as shown in Figure 3 was inspired by SKNet, which inherits its dynamic receptive field ability. In order to better apply it to object tracking tasks, we made two main improvements to SKNet. Firstly, we think that besides average pooling, maximum pooling is another important method for gathering discriminative features. Therefore, in the fuse step of SKNet, we added an extra branch starting with GMP, forming a two-branch channel attention-generating module. The GMP branch shares weights with the GAP branch. Secondly, a padding operation is necessary for SKNet due to the same convolution it used to maintain feature dimensions. However, padded values around the original feature induced potential position bias in model training [7], and thus the prediction accuracy is expected to be degraded, especially when an object moves near the search range boundary. To address this issue, the most padding-affected elements around the feature after the select step were cropped out. The detailed working process of the CiSK block is as follows.
Electronics 2020, 9, x FOR PEER REVIEW 6 of 13 and were obtained by adding 1 and 2 and 1 and 2 together, and they were normalized by a softmax function to get and : Then, and were multiplied by and with the broadcasting mechanism. The products of multiplication are summed up to get feature map : where . . + = 1. Finally, the outermost pixels of each channel of were cropped to get the final output :

Implementation Details
We took the AlexNet [18] that was pre-trained from ImageNet [19] as the backbone network for feature extracting and trained 20 epochs in total. First, we froze the parameters of the pre-trained AlexNet, then trained other parts for 10 epochs. After that, we unfroze the last two layers of AlexNet and trained it together with other parts of the network for 10 epochs. The total loss is the sum of the classification loss and the standard smooth 1 loss for regression. Stochastic gradient descent (SGD) was used for optimizing, and the momentum parameter was set to 0.9. During training, the learning rate was arranged as follows. For the first 5 epochs, the learning rate increased exponentially from 0.005 to 0.01. For the remaining 15 epochs, the learning rate decreased exponentially from 0.01 to 0.0005.
During inferencing, we regarded our tracker as a local one-shot detection framework, in which The input feature X is convoluted by a 3 × 3 and 1 × 1 kernel respectively to obtain the top-branch feature U t and bottom-branch feature U b . U t and U b have the same spatial dimension and channel number. The size of U t and U b depends on which branch they source from. If it sources from the template branch, the size is 6 × 6 × 256. If it sources from the search branch, it is 22 × 22 × 256. Then, we add U t and U b element by element to get U: Electronics 2020, 9, 1528 6 of 13 The vectors s 1 and s 2 are obtained by GAP and GMP of U, where c is the cth channel of s 1 and s 2 : Then, the channel number of s 1 and s 2 were reduced to one-eighth of their original values by using a fully connected layer. z 1 and z 2 can be formulated as where δ represents the ReLU activation function and B represents batch normalization. After that, two fully connected layers were used with z 1 to recover its original dimension so a 1 and b 1 were obtained, respectively. The same operation was also done to z 2 to obtain a 2 and b 2 . These processes can be formulated as where A and B are the transformation matrices when the dimension is increased. a and b were obtained by adding a 1 and a 2 and b 1 and b 2 together, and they were normalized by a softmax function to get a s and b s : Then, a s and b s were multiplied by U t and U b with the broadcasting mechanism. The products of multiplication are summed up to get feature map V: where s.t.a c + b c = 1. Finally, the outermost pixels of each channel of V were cropped to get the final output V out : where the size of V is 6 × 6 × 256 if sourced from the template branch or 22 × 22 × 256 if sourced from the search branch. After cropping, the size of V out becomes 4 × 4 × 256 or 20 × 20 × 256 correspondingly.

Implementation Details
We took the AlexNet [18] that was pre-trained from ImageNet [19] as the backbone network for feature extracting and trained 20 epochs in total. First, we froze the parameters of the pre-trained AlexNet, then trained other parts for 10 epochs. After that, we unfroze the last two layers of AlexNet and trained it together with other parts of the network for 10 epochs. The total loss is the sum of the classification loss and the standard smooth L 1 loss for regression. Stochastic gradient descent (SGD) was used for optimizing, and the momentum parameter was set to 0.9. During training, the learning rate was arranged as follows. For the first 5 epochs, the learning rate increased exponentially from 0.005 to 0.01. For the remaining 15 epochs, the learning rate decreased exponentially from 0.01 to 0.0005.
During inferencing, we regarded our tracker as a local one-shot detection framework, in which the bounding box in the first frame was the only exemplar. This exemplar was sampled through the template branch only once, and the template branch was pruned after that to accelerate the tracking speed [2]. Subsequent frames were sampled by searching branches and fed into the RPN subnet to get the refined proposal. In addition, our tracker was designed for short-term object tracking, so no online template update mechanism was used.
Our tracker was implemented using PyTorch framework with Python on an Intel(R) Xeon(R) CPU E5-1620 v3 @3.50GHz and two NVIDIA GTX 1080Ti GPUs with 22 GB of memory in total.

Result on OTB100
We used the standard OTB100 benchmark to evaluate the performance of our tracker, which contained 100 fully annotated real-world sequences. These sequences had 11 challenges, namely illumination variation (IV), deformation (DEF), motion blur (MB), out-of-plane rotation (OPR), low resolution (LR), occlusion (OCC), fast motion (FM), in-plane rotation (IPR), out-of-view (OV), background cluttered (BC) and scale variation (SV). There were two evaluation criteria. One was the overlap rate of the bounding box and the other was the center positioning error of the bounding box.
We could get two graphs to demonstrate the performance of multiple models, those being precision plots of one-pass evaluation (OPE) based on the center positioning error of the bounding box and success plots of OPE based on the overlap rate of the bounding box. Horizontal values in these two graphs are the thresholds of those two criteria. The precision value shows the percentage of frames that meet the distance of the center point, and the success rate value shows the percentage of frames that meet the overlap rate. We compared it with nine state-of-the-art methods, including SiamRPN, MEEM [25], MUSTer [26], SiamFC [4], DSST [27], KCF [28], Struck [29], TLD [30] and CSK [31]. The results are as follows.
The number in the precision plot of OPE in Figure 4 is the precision value when the location error threshold is 20, which is the official evaluation metric used by Object Tracking Benchmark (OTB) dataset. The number in the success plots of OPE in Figure 4 is the area under the curve (AUC). As shown in Figure 4, we achieved the best results on the OTB100 benchmark, with precision 0.5 percentage points higher than the baseline (SiamRPN) and a success rate 1 percentage point higher than the baseline. Electronics 2020, 9, x FOR PEER REVIEW 7 of 13 get the refined proposal. In addition, our tracker was designed for short-term object tracking, so no online template update mechanism was used. We used four datasets as training sets, namely ImageNet VID [19], ImageNet DET [19], COCO [20] and YouTube-BB [21], and used OTB100 [22], VOT2016 [23] and VOT2019 [24] to evaluate the proposed method. Before being fed into the tracker, template frame images were resized to 127 × 127, and the search frame images were resized to 255 × 255.
Our tracker was implemented using PyTorch framework with Python on an Intel(R) Xeon(R) CPU E5-1620 v3 @3.50GHz and two NVIDIA GTX 1080Ti GPUs with 22 GB of memory in total.

Result on OTB100
We used the standard OTB100 benchmark to evaluate the performance of our tracker, which contained 100 fully annotated real-world sequences. These sequences had 11 challenges, namely illumination variation (IV), deformation (DEF), motion blur (MB), out-of-plane rotation (OPR), low resolution (LR), occlusion (OCC), fast motion (FM), in-plane rotation (IPR), out-of-view (OV), background cluttered (BC) and scale variation (SV). There were two evaluation criteria. One was the overlap rate of the bounding box and the other was the center positioning error of the bounding box.
We could get two graphs to demonstrate the performance of multiple models, those being precision plots of one-pass evaluation (OPE) based on the center positioning error of the bounding box and success plots of OPE based on the overlap rate of the bounding box. Horizontal values in these two graphs are the thresholds of those two criteria. The precision value shows the percentage of frames that meet the distance of the center point, and the success rate value shows the percentage of frames that meet the overlap rate. We compared it with nine state-of-the-art methods, including SiamRPN, MEEM [25], MUSTer [26], SiamFC [4], DSST [27], KCF [28], Struck [29], TLD [30] and CSK [31]. The results are as follows.
The number in the precision plot of OPE in Figure 4 is the precision value when the location error threshold is 20, which is the official evaluation metric used by Object Tracking Benchmark (OTB) dataset. The number in the success plots of OPE in Figure 4 is the area under the curve (AUC). As shown in Figure 4, we achieved the best results on the OTB100 benchmark, with precision 0.5 percentage points higher than the baseline (SiamRPN) and a success rate 1 percentage point higher than the baseline. As shown in Figure 5, we compared ACSiamRPN with two classic trackers (SiamFC and SiamRPN) and showed the results of six videos in OTB100, which are some frames in Skiing, MotorRolling, CarScale, Liquor, Tiger1 and Lemming. It can be found that when an object is too small (Skiing), the object rotates in plane (MotorRolling), the object scale changes greatly (CarScale) or the object is occluded (Liquor, Tiger1 and Lemming). The two classic trackers often got inaccurate tracking results or even tracking failure. Our tracker can handle these challenges better. We think the main reasons are that, first, the GC block can collect long-range context information and improve a network's understanding of tracking scenes. Second, the CiSK block can adjust the receptive field As shown in Figure 5, we compared ACSiamRPN with two classic trackers (SiamFC and SiamRPN) and showed the results of six videos in OTB100, which are some frames in Skiing, MotorRolling, CarScale, Liquor, Tiger1 and Lemming. It can be found that when an object is too small (Skiing), the object rotates Electronics 2020, 9, 1528 8 of 13 in plane (MotorRolling), the object scale changes greatly (CarScale) or the object is occluded (Liquor, Tiger1 and Lemming). The two classic trackers often got inaccurate tracking results or even tracking failure. Our tracker can handle these challenges better. We think the main reasons are that, first, the GC block can collect long-range context information and improve a network's understanding of tracking scenes. Second, the CiSK block can adjust the receptive field adaptively according to the variation of object features during the tracking process, so it can better estimate the current scale of an object.

Result on VOT2016
Visual Object Tracking (VOT) benchmarks evaluate a tracker by applying a reset-based methodology. Whenever a tracker has no overlap with the ground truth, the tracker will be reinitialized after five frames. The major evaluation metrics of VOT benchmarks are accuracy (A), robustness (R) and expected average overlap (EAO). An excellent tracker should have high A and EAO scores but a low R score.
We used the VOT2016 benchmark to test our tracker and compared it with nine advanced trackers. The VOT2016 public dataset was used for single object short-term tracking tasks, including 60 video sequences. We compared EAO, A and R, three criteria of the different trackers, and the details are shown in Table 1 and Figure 6. Table 1. Detailed information about several published state-of-the-art trackers' performances in VOT2016. Red, blue and green represent the 1st, 2nd and 3rd best trackers, respectively.

Result on VOT2016
Visual Object Tracking (VOT) benchmarks evaluate a tracker by applying a reset-based methodology. Whenever a tracker has no overlap with the ground truth, the tracker will be re-initialized after five frames. The major evaluation metrics of VOT benchmarks are accuracy (A), robustness (R) and expected average overlap (EAO). An excellent tracker should have high A and EAO scores but a low R score.
We used the VOT2016 benchmark to test our tracker and compared it with nine advanced trackers. The VOT2016 public dataset was used for single object short-term tracking tasks, including 60 video sequences. We compared EAO, A and R, three criteria of the different trackers, and the details are shown in Table 1 and Figure 6. Table 1. Detailed information about several published state-of-the-art trackers' performances in VOT2016. Red, blue and green represent the 1st, 2nd and 3rd best trackers, respectively.
As shown in Table 1 and Figure 6, our tracker reached 0.397 EAO, 0.601 accuracy, and 0.252 robustness. Our EAO and accuracy criteria were about 15.4% and 7.3% higher than the baseline (SiamRPN), respectively, and the robustness (failure rate) was reduced by about 3%.

Result on VOT2019
We used the VOT2019 benchmark to test our tracker and compared it with nine advanced trackers. Like VOT2016, the VOT2019 public dataset was also used for single object short-term tracking tasks, and it includes 60 video sequences. Compared to VOT2018, VOT2019 replaced 20% of sequences with more difficult ones. We compared EAO, A and R, the three criteria of different trackers, and the details are shown in Table 2 and Figure 7. Table 2. Detailed information about several published state-of-the-art trackers' performances in VOT2019. Red, blue and green represent the 1st, 2nd and 3rd best trackers, respectively.

Result on VOT2019
We used the VOT2019 benchmark to test our tracker and compared it with nine advanced trackers. Like VOT2016, the VOT2019 public dataset was also used for single object short-term tracking tasks, and it includes 60 video sequences. Compared to VOT2018, VOT2019 replaced 20% of sequences with more difficult ones. We compared EAO, A and R, the three criteria of different trackers, and the details are shown in Table 2 and Figure 7.  [24] 0.223 0.561 0.788 TADT [37] 0.207 0.516 0.677 CSRDCF [38] 0.201 0.496 0.632 CSRpp [24] 0.187 0.468 0.662 FSC2F [24] 0.185 0.480 0.752 ALTO [24] 0.182 0.358 0.818 Figure 7. EAO scores in the VOT2019 challenge. A larger value is better.
As shown in Table 2 and Figure 7, our tracker was ranked first in both the EAO and accuracy criteria, while the robustness ranking was slightly behind. Among them, EAO and accuracy are about 2.6% and 13.5% higher than the second-ranked tracker, respectively.
In the VOT2019 ranking list, the performance of some trackers based on Siamese networks were better than ours, such as SiamDW_ST, SiamMask and SiamRPN++. The main reason is that they use much deeper backbone networks, such as ResNet, InceptionNet and so on, so they can extract richer target features. The backbone network used in ACSiamRPN is a five-layer AlexNet, and thus our network is relatively lightweight and can achieve a higher tracking speed. We compared the As shown in Table 2 and Figure 7, our tracker was ranked first in both the EAO and accuracy criteria, while the robustness ranking was slightly behind. Among them, EAO and accuracy are about 2.6% and 13.5% higher than the second-ranked tracker, respectively.
In the VOT2019 ranking list, the performance of some trackers based on Siamese networks were better than ours, such as SiamDW_ST, SiamMask and SiamRPN++. The main reason is that they use much deeper backbone networks, such as ResNet, InceptionNet and so on, so they can extract richer target features. The backbone network used in ACSiamRPN is a five-layer AlexNet, and thus our network is relatively lightweight and can achieve a higher tracking speed. We compared the performance and tracking speed of ACSiamRPN with SiamDW_ST, SiamMask and SiamRPN++. The result is shown in Figure 8. As shown in Figure 8, the speed of our tracker is much higher than that of other siamese network-based networks on the VOT2019 benchmark. For example, our tracker is 57 FPS faster than SiamMask (128 vs. 71), while the EAO is only 0.042 (0.24 vs. 0.282) lower than it.

Ablation Study
The GC block and CiSK block are the two main contributions of our model. In order to study their effectiveness, we carried out ablation experiments on VOT2016. As shown in Table 3, both the GC and CiSK blocks played a positive role. Although only adding the GC block to the template branch made the network no longer symmetrical, during training, the network will adaptively adjust the weight of two branches to output features that are conductive to subsequent template matching operations. In addition, the GC block can provide a better template for the tracker. EAO is a new performance criterion introduced in VOT2015, which combines the raw values of accuracy and robustness and forms a kind of hybrid criterion. EAO measures the expected no-reset overlap of a tracker run on a short-term sequence [23]. EAO has a clear practical interpretation and provides a more reasonable measure for short-term object tracking tasks, and thus it is officially recognized as the ranking criterion by the VOT competition. As shown in Table 3, the GC block and CiSK block both have obvious contributions for the improvement of the EAO, which demonstrates their effectiveness. The Perfomance vs. Speed on VOT2019 As shown in Figure 8, the speed of our tracker is much higher than that of other siamese network-based networks on the VOT2019 benchmark. For example, our tracker is 57 FPS faster than SiamMask (128 vs. 71), while the EAO is only 0.042 (0.24 vs. 0.282) lower than it.

Ablation Study
The GC block and CiSK block are the two main contributions of our model. In order to study their effectiveness, we carried out ablation experiments on VOT2016. As shown in Table 3, both the GC and CiSK blocks played a positive role. Although only adding the GC block to the template branch made the network no longer symmetrical, during training, the network will adaptively adjust the weight of two branches to output features that are conductive to subsequent template matching operations. In addition, the GC block can provide a better template for the tracker.  EAO is a new performance criterion introduced in VOT2015, which combines the raw values of accuracy and robustness and forms a kind of hybrid criterion. EAO measures the expected no-reset overlap of a tracker run on a short-term sequence [23]. EAO has a clear practical interpretation and provides a more reasonable measure for short-term object tracking tasks, and thus it is officially recognized as the ranking criterion by the VOT competition. As shown in Table 3, the GC block and CiSK block both have obvious contributions for the improvement of the EAO, which demonstrates their effectiveness.
Cropping operation, GAP branch and GMP branch are the three main modifications in CiSK. In order to study their effectiveness, we carried out ablation experiments on VOT2016. As shown in Table 4, when GAP and GMP branches are used at the same time, the performance of the model is better than when only a GAP or GMP branch is used (0.370 vs. 0.369/0.367 when a cropping operation is not adopted, 0.397 vs. 0.382/0.395 when a cropping operation is adopted). It also can be seen that model performance gains a considerable improvement in different GAP and GMP combinations due to the cropping operation (0.382 vs. 0.369, 0.395 vs. 0.367, 0.397 vs. 0.370).

Conclusions
In this paper, we proposed two lightweight and efficient modules, namely the GC block and CiSK block, and integrated them into SiamRPN. The GC block can model the long-range context of template frames better, and the CiSK block gives models the ability of dynamic receptive fields. We used four large-scale datasets to train our model and used three mainstream benchmarks to evaluate the model's performance. Careful ablation study was carried out to demonstrate the positive effect of each module. Experiment results show that the proposed ACSiamRPN model has competitive performance.

Funding:
This research was funded by the Artificial Intelligence Program of Shanghai, grant number 2019-RGZN-01077.