Real-Time Object Tracking Algorithm Based on Siamese Network

: Object tracking is aimed at tracking a given target that is only speciﬁed in the ﬁrst frame. Due to the rapid movement and the interference of cluttered backgrounds, object tracking is a signiﬁcant challenging issue in computer vision. This research put forward an innovative feature pyramid and optical ﬂow estimation based on the Siamese network for object tracking, which is called SiamFP. The SiamFP jointly trains the optical ﬂow and the tracking task under the Siamese network framework. We employ the optical ﬂow network based on the pyramid correlation mapping to evaluate the movement information of the target in two contiguous frames, to increase the accuracy of the feature representation. Simultaneously, we adopt spatial attention as well as channel attention to effectively restrain the ambient noise, stress the target area, and better extract the features of the given object, so that the tracking algorithm has a higher success rate. The proposed SiamFP obtains state-of-the-art performance on OTB50, OTB2015, and VOT2016 benchmarks while exhibiting better real-time and robustness.


Introduction
Visual object tracking is regarded as one of the hottest investigation contents in the field of computer vision, which is highly concern by a large number of researchers. Visual target tracking technology has been widely used in military visual analysis [1], intelligent transportation [2], smart city [3], and many other domain. In recent years, although deep learning technology has greatly improved the robustness of tracking algorithms, the process of target tracking still faces challenges such as illumination changes, target occlusion, and motion blur. Therefore, it remains a huge challenge to invent a fast and accurate tracker.
Around 2000, beginning with the LK-tracker, although the classical algorithms and machine learning is applied to target tracking, the robustness, and accuracy of these algorithms are relatively low. From 2010 to 2016, with the proposal of MOEES [4], the correlation filtering method had become a research hotspot. Since 2016, the deep learningbased method had greatly improved the robustness and accuracy of the algorithm, such as represented by dual-network tracers represented by SINT [5] and SiamFC [6]. Though some trackers have utilized optical flow for the purpose of enhancing performance [7,8], these optical flow features have not been trained end-to-end, thus, these algorithms are unable to fully utilize the optical flow information.
In this paper, the SiamFP tracking algorithm for feature pyramid optical flow estimation is proposed, which utilizes optical flow information and pyramid features to enhance feature representation and tracking accuracy. The evaluation shows that our tracker possesses better performance on the VOT2016 benchmark and the OTB2015 benchmark. The major contributions of this work can be summarized as follows.

Related Work
This part will make a brief introduction to the associated approaches and techniques for tracking. Three aspects of in-depth feature-based tracking, Siamese-based tracking network as well as optical flow in vision tasks are highlighted.

Deep Feature-Based Tracking
In recent years, deep features have greatly improved the performance of trackers, so they have been applied by many scholars. The work based on deep convolution networks is mainly divided into two parts. Firstly, a pre-trained object recognition network is used and built on a discriminant or regression model. The tracker combines the depth features with the correlation filter. For example, for the purpose of further enhancing the SRDCF algorithm, the convolutional features were introduced by Danelljan et al. [9]. Ma et al. [10] substituted HOG (Histogram of Oriented Gradient) features by depth-wise convolutional features and fused the confidence figures which are gained by the means of percolating three layers of features separately. Dai et al. [11] put forward a tracking approach which combines depth-wise convolutional features with adaptive spatially regularized correlation filters. Another method uses classification or regression networks to introduce deep features. For example, Hong et al. [12] put forward a tracker which consists of CNN and SVM. David et al. [13] directly regress the target position by the means of training a neural network, which is considered as the first algorithm on the basis of deep learning to achieve 100 fps. The above approaches are computationally slow, difficult to track in real time, and difficult to achieve the highest performance for non-end-to-end training.

Siamese Network-Based Tracking
An end-to-end network is utilized with the intention of improving tracking performance. The approach gains excellent tracking performance through offline pre-training on mass data. Li et al. [14] proposed to predict object locations through a Region Proposal Network (RPN) [15]. The whole structure is made up of a Siamese network and RPN, while the model is trained end-to-end. Wang et al. [16] pointed out that correlation filters can be regarded as a special layer in the Siamese framework, which is able to respond to environmental changes by the means of continuously regenerating the filter. Xu et al. [17] put forward a novel algorithm on the foundation of the Siamese network, where one branch forecasts the confidence of the target location by the means of forecasting each pixel, while the other branch regresses the four edges between the specimens and the earth truth distance [18]. Gao et al. [19] proposed a Siamese attention key point network and obtained bounding boxes by the means of forecasting the coordinates of the upper left, center and lower right corners of the target.

Optical Flow for Visual Tasks
Optical flow information is broadly applied during the course of conducting computer vision tasks. Li et al. [20] and Nicola A. Piga et al. [21] applied optical flow networks for pose estimation. FlowTrack [22] trains an end-to-end optical flow network and utilities rich flow information in consecutive frames with the intention of boosting character representation and tracking performance. Zhou et al. [23] used an optical flow network for end-to-end training, which was able to forecast the movement trend of the target more precisely. Chen et al. [24] fitted optical flow characteristics to a temporally noisy turbulent environment and constructed an online tracking algorithm.

Methodology
The SiamFP algorithm proposed in this article draws on the network structure of the SiamFC algorithm. The complete course of SiamFP algorithm is displayed in Figure 1. The main process is divided into the following four steps: (a) Feature pyramid extractor [25]. Given two input images I 1 and I 2 , we generate L-level pyramids of feature representations, with the bottom (zeroth) level being the input images, i.e., C 0 t = I t . To generate feature representation at the lth layer, c l−1 t , we use layers of convolutional filters to downsample the features at the (l−1)th pyramid level, c l−1 t , by a factor of 2. (b) Optical flow network. The first and second frame images are used as the existing tracking frame and the prior frame to enter the optical flow estimation network to assess the approximate position of the target in the next frame. Then, the center of the search area is cropped in accordance with the assessed position with the intention of obtaining a more precise search area. end-to-end training, which was able to forecast the movement trend of the target more precisely. Chen et al. [24] fitted optical flow characteristics to a temporally noisy turbulent environment and constructed an online tracking algorithm.

Methodology
The SiamFP algorithm proposed in this article draws on the network structure of the SiamFC algorithm. The complete course of SiamFP algorithm is displayed in Figure 1. The main process is divided into the following four steps: (a) Feature pyramid extractor [25]. Given two input images I1 and I2, we generate L-level pyramids of feature representations, with the bottom (zeroth) level being the input images, i.e., = . To generate feature representation at the lth layer, , we use layers of convolutional filters to downsample the features at the (l−1)th pyramid level, , by a factor of 2. (b) Optical flow network. The first and second frame images are used as the existing tracking frame and the prior frame to enter the optical flow estimation network to assess the approximate position of the target in the next frame. Then, the center of the search area is cropped in accordance with the assessed position with the intention of obtaining a more precise search area. (c) Siamese network. Input template image and search image into convolutional neural network to generate template feature map and search feature map. (d) Attention network. Template features and search features obtain new template features and search features through spatial and channel attention mechanism networks, respectively, and the new features perform cross-correlation operations to obtain response graphs.

Spatial attention
Channel attention Spatial attention

Fully Convolutional Siamese Network
Recently, the fully convolutional Siamese network SiamFC algorithm has been extensively applied in various fields of target tracking. The algorithm consists of two inputs, one is the template branch, that is, the object to be tracked, and the first frame of the video sequence is usually selected as the template input to obtain the template feature map; the other is the search branch, that is, the image search area of each subsequent frame is used as the search image input to obtain the search feature map. SiamFC regards tracking as a process of template matching. That is, feature extraction, followed by cross-correlation operation on the generated feature maps, that is, the convolution computation, so as to generate a heat-map. The cross-correlation operation is as Equation (1) follows.

Fully Convolutional Siamese Network
Recently, the fully convolutional Siamese network SiamFC algorithm has been extensively applied in various fields of target tracking. The algorithm consists of two inputs, one is the template branch, that is, the object to be tracked, and the first frame of the video sequence is usually selected as the template input to obtain the template feature map; the other is the search branch, that is, the image search area of each subsequent frame is used as the search image input to obtain the search feature map. SiamFC regards tracking as a process of template matching. That is, feature extraction, followed by cross-correlation operation on the generated feature maps, that is, the convolution computation, so as to generate a heat-map. The cross-correlation operation is as Equation (1) follows.
where, Z and X indicate the template image and the search image, respectively. ϕ is the network for feature representation of templates and search areas, referring to AlexNet [26]. ϕ is a transformation operation, and the highest corresponds of the response value represents the predicted position of Z. I is the identity matrix, and b is the bias term.

End-to-End Feature Pyramid Network
The optical flow network can forecast the movement trend of the target more precisely. This paper performs end-to-end offline pre-training on the optical flow network. The network structure is displayed in Figure 2. First, based on a coarse-to-fine scheme, this paper uses pyramid features instead of ordinary images as input. Then, FlowNet2 [27] is fine-tuned, using FlowNetC as the first block followed by two FlowNetS blocks to crop the optical flow map in accordance with the target location, and finally, for the purpose of reduce the number of parameters, the image features are distorted to output 4 different scales, 64 × 64 × 2, 32 × 32 × 2, 16 × 16 × 2, 8 × 8 × 2. Based on our findings, we found that end-to-end trained networks perform outstanding than untrained networks.
where, and indicate the template image and the search image, respectively. is the network for feature representation of templates and search areas, referring to AlexNet [26].
is a transformation operation, and the highest corresponds of the response value represents the predicted position of Z. I is the identity matrix, and b is the bias term.

End-to-End Feature Pyramid Network
The optical flow network can forecast the movement trend of the target more precisely. This paper performs end-to-end offline pre-training on the optical flow network. The network structure is displayed in Figure 2. First, based on a coarse-to-fine scheme, this paper uses pyramid features instead of ordinary images as input. Then, FlowNet2 [27] is fine-tuned, using FlowNetC as the first block followed by two FlowNetS blocks to crop the optical flow map in accordance with the target location, and finally, for the purpose of reduce the number of parameters, the image features are distorted to output 4 different scales, 64 × 64 × 2, 32 × 32 × 2, 16 × 16 × 2, 8 × 8 × 2. Based on our findings, we found that end-to-end trained networks perform outstanding than untrained networks.

Attention Network
In recent years, Attention Model has been widely used in various types of deep learning tasks, such as natural language processing [28,29], image recognition [30,31] and speech recognition [32,33]. It is one of the core technologies of deep learning that deserves the most attention and in-depth understanding. The attention mechanism gathers the information of a certain part and focuses on a certain area in the image in terms of image processing. In other words, the attention mechanism gathers the most effective information of the target to better apply it in all aspects.
The attention mechanism mainly involves two parts. First, it is necessary to confirm which part should gain more attention; second, it extracts features from the critical parts with the intention of gaining significant information. Since targets in complex scenarios will be affected by background interference, appearance distortion and other problems, the attention mechanism was introduced into the task of terrorist tracking in order to extract the target in the video frame from the background, so as to focus on the target of the tracking model and reduce the background interference to a certain extent. In this paper, the attention mechanism combining channel attention and spatial attention was used, as is displayed in Figure 3.
To be specific, the channel attention module uses a feature map as a unit. Each channel of the feature map is able to be considered as a special feature detector, which uses the feature vector = , , … , obtained by the feature M ∈ * * passing the global average pooling layer as the input of the completely connected layer to maintain the adaptability of the target appearance in the deep network. ReLU is the activation function, after passing the next fully connected layer, the Sigmoid function was selected as the nonlinear activation function. So as to strengthen the non-linear relationship between the channels to promote better mutual learning, the output vector obtained = , , … ,

Attention Network
In recent years, Attention Model has been widely used in various types of deep learning tasks, such as natural language processing [28,29], image recognition [30,31] and speech recognition [32,33]. It is one of the core technologies of deep learning that deserves the most attention and in-depth understanding. The attention mechanism gathers the information of a certain part and focuses on a certain area in the image in terms of image processing. In other words, the attention mechanism gathers the most effective information of the target to better apply it in all aspects.
The attention mechanism mainly involves two parts. First, it is necessary to confirm which part should gain more attention; second, it extracts features from the critical parts with the intention of gaining significant information. Since targets in complex scenarios will be affected by background interference, appearance distortion and other problems, the attention mechanism was introduced into the task of terrorist tracking in order to extract the target in the video frame from the background, so as to focus on the target of the tracking model and reduce the background interference to a certain extent. In this paper, the attention mechanism combining channel attention and spatial attention was used, as is displayed in Figure 3. The spatial attention module uses each pixel in the feature map as a unit. For the sake of uplift the feature representational capacity of the model, the structural dependency of spatial information was first established, and then weight was assigned to each pixel in  To be specific, the channel attention module uses a feature map as a unit. Each channel of the feature map is able to be considered as a special feature detector, which uses the feature vector α = {α 1 , α 2 , . . . , α d } obtained by the feature M ∈ R w * h * d passing the global average pooling layer as the input of the completely connected layer to maintain the adaptability of the target appearance in the deep network. ReLU is the activation function, after passing the next fully connected layer, the Sigmoid function was selected as the non-linear activation function. So as to strengthen the non-linear relationship between the channels to promote better mutual learning, the output vector obtained β = {β 1 , β 2 , . . . , β d } was multiplied with the input feature for the purpose of gaining the channel attention feature map N ∈ R w * h * d .
The spatial attention module uses each pixel in the feature map as a unit. For the sake of uplift the feature representational capacity of the model, the structural dependency of spatial information was first established, and then weight was assigned to each pixel in the feature map. First, the input feature map x was convolved using a 1 × 1 convolution kernel, and then transformation functions f (x), g(x) and h(x) were used in order to transform three convolutions, respectively. As shown in Equation (2).
where, W 1 , W 2 , W 3 represent the weight of f (x), g(x) and h(x), respectively. After that, the result output by the function f (x) was transposed for matrix multiplication with the output result, and the result obtained was calculated by utilizing the Softmax function with the intention of obtaining the spatial attention map. The spatial attention is calculated by the following Equation (3).
where, a and b represent the a-th and the b-th position on the input image, respectively. After that, matrix multiplication was performed on the obtained spatial attention map and the function h(x). The feature map which is regulated by the spatial attention module is calculated as Equation (4) follows.
where, α is the weight parameter. The final output of the multi-attention mechanism is the element addition of the channel attention feature and the spatial attention feature. In this way, better feature appearance information can be acquired. As shown in Equation (5).

Implementation Details
The algorithm in this paper is implemented based on the Python programming language in the Pytorch framework on the computer hardware platform of Intel Broadwell 2.4 GHz GPU (Tesla V100) and Intel(R)Core(TM) i7-10700 CPU@2.90 GHz. Our method is divided into two training parts. The optical flow estimation network is trained offline on the ILSVRC-2015 [34] video dataset, and the pre-trained FlowNet2.0 is used as the backbone network of the optical flow estimation network. Set the number of epochs to 20, and the learning rate declines linearly from 0.01 to 0.001. The tracking network is trained on the ILSVRC-2015 and YouTube-BB [35] datasets. ILSVRC-2015 has approximately 1.3 million frames and approximately 2 million tracked objects with terrestrial real bounding boxes. YouTube-BB consist of more than 100,000 videos annotated once in every 30 frames. The pre-trained AlexNet [26] network is used as the backbone network of the tracking network, randomly selecting a pair of images from the images, cropping out the central region z, and the other one Figure crop out x. The network parameters were optimized by Stochastic Gra-dient Descent (SGD), the number of epochs was set to 50, and the learning rate decreased from 0.01 to 0.001.

Ablation Experiments
In this section, ablation experiments are conducted in the OTB benchmark to verify the effectiveness of the module and prove that the design of SiamFC is reasonable. The result analysis included the success rate and accuracy of otb50 and otb2015. As shown in Table 1. In this paper, SiamFC is used as a baseline tracker. The success rates of baseline in otb50 and otb2015 were 0.519 and 0.586, respectively, and the accuracy was 0.693 and 0.772, respectively. In baseline +PW, the optical flow estimation network is added, and the success rate and accuracy are greatly improved, which can prove that the optical flow network can more accurately predict the movement trend of the target. In baseline +PW +FP, the input image is replaced by the feature pyramid, and the success rate and accuracy are improved, indicating that the feature pyramid can suppress the target loss caused by illumination changes. In baseline +PW+FP+Att, the dual attention mechanism is added, and the success rate and accuracy have been significantly improved, indicating that the ability to use the attention mechanism to resist background interference is enhanced. Overall, each module in SiamFP can significantly improve the performance of the algorithm.

State-of-the-Art Comparison
As we all know, OTB (Object Tracking Benchmark) and VOT (Visual Object Tracking) are very comprehensive to the test of tracker, and they are universal platform for evaluating computer vision algorithm. Therefore, the proposed SiamFP is compared with the most advanced tracker on OTB and VOT benchmarks. Traditionally, the tracking speed exceeding 25 FPS is considered as real-time, and the tracking speed of SiamFP can reach 50 FPS.

Results on OTB
OTB benchmarks include OTB50 [36] and OTB2015 [37]. OTB2015 dataset contains 100 manually labeled video sequences, so the OTB2015 dataset is also called OTB100. OTB50 selects 50 difficult video sequences from OTB2015. Common evaluation metrics in the OTB benchmarks are accuracy and success rate. Accuracy refers to calculating the Center Location Error (CLE) between the model's forecast location (bounding box) and the groundtruth of the target given a threshold T. The success rate refers to the IoU between the predicted position frame of the model and the target real position frame given a threshold T. On the basis of the accuracy and success rate indicators, OTB proposed the One Pass Evaluation (OPE) robustness evaluation indicator, which refers to initializing the first frame with the position of the target in the ground-truth, and then running the tracking algorithm for the purpose of gaining the average precision and success rate, generate OPE metrics.
The comparative experiments between the SiamFP tracker and the more advanced trackers SRDCF [38], Staple [39], CFNet, SiamFC, fDSST [40], and SiamRPN have achieved leading levels in both OTB50 and OTB2015 benchmarks, and the OPE accuracy graph and the success rate graph is displayed in Figure 4. The experimental outcomes show that in the OTB50 and OTB2015 benchmarks, the tracking accuracy reaches 0.829 and 0.88, and the success rate reaches 0.604 and 0.658, respectively. Compared with the SiamFC algorithm, the precision is increased by 13.6% and 10.8% correspondingly, and the success rate is increased by 8.5% and 7.2% correspondingly. In the OTB2015 benchmark, the performance is significantly improved, especially in three complex scenes, occlusion (OCC), illumination variation (IV), and fast motion (FM). As shown in Figure 5. More results are summarized in Table 2. For the purpose of determining the effective improvement of the algorithm in this study in actual video sequences, video sequences with relatively complex scenes were selected from the OTB dataset as the visualization results, as shown in Figure 6. To sum up, the SiamFP algorithm adding optical flow network and attention mechanism has a good ability to suppress background interference.

Results on VOT
The VOT dataset is mainly derived from the Video Object Tracking Challenge (VOT, Challenge). It not only provides open-source toolkits and many labeled videos, but also provides evaluation criteria and test results for target tracking algorithms. In the VOT dataset, VOT2016 [41] is the most commonly used dataset in order to assess the performance of the algorithm. It has increased from 16 videos to 60 videos, and the difficulty has also increased. VOT2016 mainly uses three indicators to verify the performance of the target tracking algorithm. (1) Accuracy. Computational accuracy is equivalent to counting the number of overlaps between the forecast bounding box and the ground-truth bounding box. (2) Robustness. The steadiness of the tracking target is assessed by utilizing Robustness. (3) Expected Average Overlap, EAO.
The EAO curve results of the VOT2016 assessment are shown in Figure 7 and Table  3. As is revealed by the results in the figure, SiamFP ranks first among 70 trackers according to the criteria for ranking trackers by EAO score, exceeding the performance of these state-of-the-art trackers. Notably, SiamFP outperforms the second-ranked tracker SiamFC by about 15%.

Results on VOT
The VOT dataset is mainly derived from the Video Object Tracking Challenge (VOT, Challenge). It not only provides open-source toolkits and many labeled videos, but also provides evaluation criteria and test results for target tracking algorithms. In the VOT dataset, VOT2016 [41] is the most commonly used dataset in order to assess the performance of the algorithm. It has increased from 16 videos to 60 videos, and the difficulty has also increased. VOT2016 mainly uses three indicators to verify the performance of the target tracking algorithm. (1) Accuracy. Computational accuracy is equivalent to counting the number of overlaps between the forecast bounding box and the ground-truth bounding box.
(2) Robustness. The steadiness of the tracking target is assessed by utilizing Robustness.
The EAO curve results of the VOT2016 assessment are shown in Figure 7 and Table 3. As is revealed by the results in the figure, SiamFP ranks first among 70 trackers according to the criteria for ranking trackers by EAO score, exceeding the performance of these state-of-the-art trackers. Notably, SiamFP outperforms the second-ranked tracker SiamFC by about 15%. bustness. (3) Expected Average Overlap, EAO.
The EAO curve results of the VOT2016 assessment are shown in Figure 7 and Table  3. As is revealed by the results in the figure, SiamFP ranks first among 70 trackers according to the criteria for ranking trackers by EAO score, exceeding the performance of these state-of-the-art trackers. Notably, SiamFP outperforms the second-ranked tracker SiamFC by about 15%.

Conclusions
With the intention of addressing the problem which the target tracking algorithm fails to track due to fast movement and background interference, this paper proposes a feature pyramid optical flow estimation Siamese network target tracking algorithm. The algorithm jointly trains optical flow and tracking tasks under the framework of the Siamese network. On the basis of the pyramid correlation mapping, the optical flow network is utilized to assess the movement information of the target in two contiguous frames, thereby improving the accuracy of feature representation. At the same time, SiamFP utilizes spatial attention and channel attention to effectively suppress background noise and better extract target features, effectively improving the success rate of the tracking algorithm. This paper conducts comparative tests with various algorithms on the OTB50, OTB2015, and VOT2016 datasets, and conducts qualitative analysis of the video sequences. The experimental results show that the SiamFP algorithm based on SiamFC reduces the occupation of computing resources, improves the computing speed, and improves the tracking effect. The designed SiamFP algorithm has good tracking effect in various challenging scenarios and shows a good balance between accuracy and real-time performance.