Online Siamese Network for Visual Object Tracking

Offline-trained Siamese networks are not robust to the environmental complication in visual object tracking. Without online learning, the Siamese network cannot learn from instance domain knowledge and adapt to appearance changes of targets. In this paper, a new lightweight Siamese network is proposed for feature extraction. To cope with the dynamics of targets and backgrounds, the weight in the proposed Siamese network is updated in an online manner during the tracking process. In order to enhance the discrimination capability, the cross-entropy loss is integrated into the contrastive loss. Inspired by the face verification algorithm DeepID2, the Bayesian verification model is applied for candidate selection. In general, visual object tracking can benefit from face verification algorithms. Numerical results suggest that the newly developed algorithm achieves comparable performance in public benchmarks.


Introduction
As a fundamental and challenging task, visual object tracking has a variety of applications, such as smart video surveillance, autopilot, human-computer interaction and video communication [1][2][3]. In general, the goal of visual object tracking is to estimate the position and scale variation of targets in the video sequence, where its initial state is given in the first frame. Object occlusion, scale variation, illumination variation, deformation, background clutter and motion blur are the major challenges for robust visual object tracking.
To exploit the representation capabilities of CNNs, Tao et al. [9] proposed a matching function with the Siamese network to extract feature vectors, which was named the Siamese Instance Search for Tracking (SINT). This new method was trained using the contrastive loss. Similarly, in [7], a fully convolutional Siamese-like tracking algorithm (SiamFC) was proposed, which was supervised by the logistic loss.
The above two algorithms train the Siamese Networks in an offline manner. Ideally, feature vectors of the same target in different frames need to be close to each other. In order to train such a CNN model, a large set of annotated video sequences are needed. However, in practice, the amount of annotated video sequences for tracking is insufficient. For example, only 3964 groups of annotated video sequences are used in the offline training in SiamFC, which are extracted from ImageNet [16]. Without online updating, the offline learned Siamese models are susceptible to appearance changes. Different from SiamFC and SINT, the algorithms MDNet and SANet trained an offline model and updated part of it in the inference phase, and these two algorithms were supervised by logistic loss. In addition, these two algorithms have made superior performance in online tracking benchmark (OTB). However, the dataset for model training (i.e., in the offline process) in MDNet [10] and SANet [11] are not completely independent with regard to the test dataset, and the training data used in MDNet [10] and SANet [11] is not supported by the visual object tracking (VOT) [17]. At the same time, discriminative correlation filter (DCF) algorithms like ECO [12] and MCPF [13] have been proposed, and they update their tracking models with historical prediction results. Basically, the DCF methods train a regression model to indicate object's position offset, which is different from MDNet, SANet, and our proposed algorithm.
In this paper, a novel online lightweight Siamese network is proposed for visual tracking (OSNV). In the tracking inference phase, tracking results and sample candidates are collected to maintain an online training set. Unlike SiamFC and SINT, the proposed method updates the feature model with historical prediction results. Because of that, the feature model could learn from domain knowledge and adapt to appearance changes of target. By the way, the input data of the proposed Siamese network is the feature maps of VGG-M [18], not the raw image data. The framework of online Siamese Network is depicted in Figure 1. Our proposed Siamese network lies in the middle, and two groups of branches share the same configuration and weights. Right: the improved contrastive loss, which is served as the Loss Function to propagate gradients for updating the Siamese network. Dpout stands for drop-out layer, best viewed in colour.
In addition, the offline Siamese network trained by the contrastive loss [19] or logistic loss [7] has limited discrimination capability. Furthermore, it is not easy to prepare a dataset to train a Siamese network. If the cross-entropy loss is adopted [20], the feature extraction model may be overfitting due to limited training data. On the other hand, the contrastive loss and logistic loss mainly concentrate on reducing the intra-class variations while the cross-entropy loss leads to increasing the inter-class variations. In conclusion, they are complementary to each other. With these properties, the cross-entropy loss [21] is integrated with the contrastive loss to update the proposed Siamese network, which is helpful to enhance the model's discrimination.
As for candidate selection, the Bayesian verification model [22] is used. It is widely used as a matching function in the face verification task [22,23]. Instead of identifying a hyperplane to distinguish the object from the background, the Bayesian verification model, trained using expectation maximization (EM), determines a log likelihood ratio to give a similarity score for an image-pair.
The main contributions of this paper are summarized as follows: • An online Siamese network is proposed. It can learn from the domain knowledge of target and adapt to appearance changes of target; • An improved contrastive loss integrated with cross-entropy loss is introduced to update the Siamese network; • The Bayesian verification model is transferred for candidate selection. In addition, we find that the visual object tracking can benefit from face verification algorithms; • Four ablation experiments are applied to verify the effectiveness of the proposed loss function. The obtained numerical results demonstrate that the newly developed Siamese network outperforms SiamFC [7] and SINT [9], and has made a comparable performance with state-of-the-art trackers.
The rest of this paper is organized as follows. Section 2 reviews the related work. In Section 3, details of proposed Siamese network and the improved loss function are presented. Implementation details about Bayesian verification model are provided in Section 4. The numerical results are provided in Section 5. Finally, Section 6 concludes the whole paper.

Siamese Network for Visual Object Tracking
For the Siamese network, Tao et al. [9] proposed a Siamese model with a region of interest (e.g., RoiPool [24]) layer. It keeps the feature vectors with different regions the same length. Their proposed feature model learns a matching mechanism [9]. The SINT algorithm [9] samples candidates in the coming frame and the most similar one is determined by the learned matching function. However, without online updating, SINT cannot learn from the domain knowledge with specific tracking targets.
Different from SINT, the SiamFC method [7] is supervised by the logistic loss. It has an exemplar CNN branch and a search CNN branch. These two branches share the same weight and configuration. The output feature maps of the exemplar correlates with the search branch to generate a response map. The location of the maximum value within the response map indicates the position offset of target. However, without the online updating, the SiamFC cannot adapt to appearance changes of tracking targets. Similar to SiamFC, Guo [25] proposed a dynamic Siamese network, which is robust to target variation and background clutters. In the work of DCFNet [26], Wang et al. proposed a correlation filter layer, which is inserted in the Siamese network to learn the convolutional features and enlarge the training dataset simultaneously. In addition, the prediction results returned by the SiamFC are penalized by a cosine window in order to achieve a comparable performance. In our proposed algorithm, the confidence of prediction result is similarity score returned by Bayesian verification model without any window tricks.

Online Algorithms for Visual Object Tracking
Online models for visual object tracking mainly have two classes: DCF based and None-DCF based. DCF-based algorithms i.e., CF2 [8], ECO [12], MCPF [13], SRDCF [27], train correlation filters based on feature vectors extracted from pre-trained CNN models. For the None-DCF based algorithms, the representative trackers are MDNet and SANet. They train an offline model with a tracking dataset, where part of the feature model is updated in tracking. However, the dataset for model training (i.e., in the offline process) in MDNet and SANet is not completely independent with regard to the test dataset, which is not supported by the VOT-Challenge [17].

Loss Function for CNNs in Visual Tracking
In [20], Wang et al. proposed an online CNN model, which is supervised by the cross-entropy loss. The cross-entropy loss function aims at finding a hyperplane to distinguish the target from the background and increases the inter-class variations. It is always used in the classification task [4,21]. However, the cross-entropy loss is prone to overfitting due to the limited training data, which makes it not robust to appearance changes of targets. For the offline-trained models, the contrastive loss is used in [9] and the logistic loss is used in [7]. The contrastive loss and logistic loss are mainly used to reduce the intra-class variations, which make them robust to the distractor from similar objects. In this paper, we propose an improved loss function where the cross-entropy loss is integrated with the contrastive loss.

Bayesian Verification Model
Bayesian formulation has been very successful in the face verification task. In [22], Chen et al. proposed a new Joint Bayesian formulation based on the classical Bayesian face recognition method. Instead of modeling the difference of appearance between two images, they set up a joint Gaussian distribution for an image pair. The method yielded an excellent result on the challenging data set of Labeled Face in Wild (LFW) [28] with "hand-crafted features". In [23], Sun et al. constructed a DeepID2 CNN model to map pixel values into the high-dimensional feature space, which is a Siamese-like structure as well. They also trained a Bayesian model with input data extracted by the DeepID2 CNN network for face verification. Similar to DeepID2, the Bayesian verification model is implemented for candidates selection.
With Bayesian verification [22], a target is modeled by summing two independent Gaussian variables as µ and ε are latent variables, and their covariance matrices are S µ , S ε , respectively. The log likelihood ratio where P(x i , x j |H I ) is the probability that the two samples (x i , x j ) are from the same target. P(x i , x j |H I ) is Gaussian with covariance matrix: P(x i , x j |H E ) is the probability that the two samples (x i , x j ) are from different targets. The associated covariance matrix is: From Labels (4) and (5), the log likelihood ratio in Label (3) can be expressed as [22], after ignoring constant terms, where The parameters Θ = {S µ , S ε } can be learned by the EM-like algorithm [22].

Proposed Algorithm
In this section, details of the proposed Siamese network are described. Then, a brief introduction about contrastive loss and cross-entropy loss is given. After that, the improved loss function is presented. Finally, we talk about the implementation of Bayesian Verification model for candidates' selection.

Siamese Network
The proposed Siamese network for online updating consists of two branches, and they share the same configurations and weight. As shown in Figure 1, each branch of the Siamese network consists of four CNN layers. There are three types of CNNs layers including two fully convolutional layers (FC1 and FC2), a rectified linear unit layer (ReLU) and a Dropout layer (Dpout). The configurations of filters size, stride and pad within the layers FC1 and FC2 are (3 × 3 × 512 × 512, 1, 0) and (1 × 1 × 512 × 256, 1, 0), respectively. The purpose is to gradually reduce the feature dimension and remove redundancy.
To illustrate the effect of the proposed Siamese network, we collect the feature maps from conv3 in VGG-M and our proposed Siamese network. These feature maps are obtained by the first frame of 100 video sequences in OTB-2015 [29]. The histograms of the collected feature maps are depicted in Figure 2. The width of the bins is set to be 0.1. As shown in Figure 2, we can see that the amplitude frequency in the range of [0, 0.1) is 0.8114. Moreover, in the range of [0, 0.1), 94.84% amplitude values are equal to 0. Therefore, the feature maps of conv3 in VGG-M are sparse and there exists redundant information, which are not relative to visual tracking task. With our proposed Siamese network, the histogram of the obtained feature maps is depicted in Figure 2. The frequency of the amplitude value range [0, 0.1) is now 0.1119. Moreover, points with zero amplitude are removed. In other words, our proposed Siamese network effectively eliminates redundant information. Due to the training data scarcity, a lightweight Siamese network with four layers is designed. Because of online updating, the feature model may be overfitting to historical target's appearance. To address this issue, we add a drop-out layer [30] after the layer ReLU (see Figure 1). We set the drop-out rate as 0.5. For the layer FC3, it has two outputs, corresponding to the target and background. The new layer L2Dis is used to calculate the square of Euclidean distance between two feature vectors by: where x i,1 , x i,2 stand for input pair-data to update the Siamese network. The parameters in w 0 belong to layers FC1 and FC2, and f (·, w 0 ) is the embedding function referred to the Siamese network. The w 0 is updated using an online dataset during the tracking process. In the rest of this paper, we use For online model updating, we collect tracking results and sample extra candidates to maintain an online training set. During the tracking process, we only keep Q groups' tracking results before the current frame (i.e., when the current frame number P < Q), only P groups' tracking results are saved.). In addition, the data from the first frame is always kept. With the online updating, compared to SINT and SiamFC, the proposed Siamese network can learn from the domain knowledge and adapt to appearance changes of specific tracking targets.

Cross-Entropy Loss
The cross-entropy loss [21] is commonly used in the classification task. In the visual tracking task, there are two classes: target and background. The corresponding cross-entropy loss is where In addition, p( f (x i,j ), w 1 ) denotes the probability that the candidate x i,j belongs to the target. The variable y i,j stands for the true class label as depicted in Figure 1. The parameters in w 1 , w 2 are from layer FC3. The batch-size of input pair-data is M. Under the cross-entropy loss, the classification model attempts to approximate the true distribution of candidate. The classification model based on the cross-entropy loss mainly focuses on increasing the margin among candidates from different classes. As discussed in Section 1, we also want to decrease the distance among feature vectors, which are extracted from the same target. However, the cross-entropy loss does not have restriction about the intra-class variations, which is essential to the visual tracking task.

Contrastive Loss
In the metric-based learning, the contrastive loss [19] is mainly used to reduce the intra-class variations by pulling feature vectors from the same instance together, and it is given by The variable of m is a hyperparameter. As shown in Figure 1, the variable s stands for the pair label. If two candidate samples (x i,1 , x i,2 ) are from the same object (target or background), the value of variable s i is 1. Otherwise, it is 0. It is not easy to train the Siamese network supervised by the contrastive loss. For example, if we have N 1 target candidates and N 2 background candidates, the possible input sample pairs are N 1 ( It would be computationally prohibitive to use all possible sample pairs to update the Siamese network. In addition, although the contrastive loss tends to increase the inter-class variations, the obtained margin may not be satisfactory for the discrimination purpose.

Improved Contrastive Loss
In practice, the contrastive loss employed a function with a hard decision margin to separate object f (x i,1 ) from background f (x i,2 ): The partial derivative of E on f is zero in the range of (m, ∞): However, the Softmax loss behaves as a smooth function to separate object f (x i,1 ) from background f (x i,2 ): The weights of w 1 and w 2 correspond to object and background, respectively. In addition, the gradients to separate the object from background will persist: The fixed margin used in the contrastive loss to separate feature vectors can't handle challenge video sequences very well. Thus, it is not appropriate to separate the object and background only supervised by the contrastive loss. The cross-entropy loss is more suitable to increases the inter-class variations than the contrastive loss. However, the contrastive loss is good at reducing the intra-class variations. Given that, we think these two loss functions are complementary. Our proposed loss function not only increases the inter-class variations but also reduces the intra-class variations via The last item on the right-side of Equation (19) is the regularization term.

Implementation of the Bayesian Verification Model
For the implementation of the Bayesian verification model, features extracted by the Siamese network are used to update it. In the first frame of a video sequence, the proposed algorithm samples 500 positive candidates and 1000 negative candidates, respectively. Then, the Bayesian verification model will be trained by the EM-like algorithm, which is the same as [22]. In addition, the Bayesian verification model is updated frame by frame. Furthermore, it is adjusted with ten iterations in every frame. In the phase of inference, we utilize Equation (3) to compare two Siamese feature vectors extracted from an image pair and get their similarity scores to select the most relevant candidate bounding boxes. In practice, the top five results will be saved.

Implementation Details
To improve the performance under scale variation, the strategy of bounding box regression [6] is employed. The bounding box regression model is only initialized with the target state in the first frame. Then, it is used to refine the returned results from random sample candidates. For the online updating, the learning rate is η = 0.005 while the number of iterations is T = 50 in the first frame. The batch size of the input data is M = 256. In the subsequent frames, the Siamese network is adjusted with a learning rate of η = 0.003 and T = 30 iterations. The hyperparameters of m and β in the loss function (19) are both set to 1. Our proposed Siamese network is adjusted every six frames in order to prevent overfitting with the recent training samples. As for the joint Bayesian verification model, it is iterated 50 times in the first frame and 10 times in subsequent frames. The Siamese network is updated by stochastic gradient descent (SGD). The joint Bayesian verification model is updated frame by frame.

Experimental Validations
In this section, the proposed algorithm (OSNV) is evaluated on three large benchmarks, OTB-2013 [14], OTB-50 and OTB-2015 [29]. They contain 51, 50 and 100 test sequences, respectively. In addition, we test our proposed algorithm in VOT-2016 [17] and TempleColor [31]. Our proposed algorithm is implemented in MATLAB 2017a on a Dell R7300 desktop with a Nvidia TitanX GPU. The algorithm runs at 1.2405 fps on average.
The tracking algorithms are evaluated based on the distance precision and bounding box overlap with one-pass evaluation (OPE), temporal robustness evaluation (TRE) and spatial robustness evaluation (SRE). The predicted rectangle boxes are considered to be a success if their coordinate centers are less than 20 pixels compared to the ground truth bounding boxes. The numbers in the legends of the distance precision plots are the precision value when the location error threshold is 20 pixels. For the success rate plots, they stand for the area under curve. When the overlap value is more than the overlap threshold, it is a success prediction.

Ablation Study
To evaluate the performance of the proposed improved contrastive loss and Siamese network, four additional algorithms are designed: (1) The algorithm OSNV_Log is the proposed Siamese network with the logistic loss. For the ablation algorithm OSNV_Log, the layer contrastive loss is disabled with the replacement by the logistic loss layer, as shown in Figure 1. In addition, the layer of FC3 and cross entropy (Softmax) layer are also disabled. (2) The algorithm OSNV_Sof is the proposed Siamese network with the cross-entropy loss. This is implemented by disabling the layer L2D and contrastive loss in Figure 1. (3) The algorithm OSNV_Con is the proposed Siamese network with the contrastive loss, which is implemented by disabling the layer FC3 and cross-entropy loss in Figure 1; (4) For the algorithm OSNV_PCA, the proposed Siamese network is disabled. The PCA is applied to reduce the dimension of VGG-M outputs from 3 × 3 × 512 × 1 to a 256 × 1 vector, and its length is the same as the output of the proposed Siamese network.
The ablation study results are depicted in Figure 3. Compared with the contrastive loss or cross-entropy loss, the proposed improved contrastive loss gains improvement about 1.8% on the success rate. It can also noticed that the method with logistic loss is better than the cross-entropy loss or contrastive loss. Our proposed improved contrastive loss (OSNV) outperforms the logistic loss (OSNV_Log) by 1.5%. Furthermore, when the input data is only adjusted by the PCA model, the performance has a big degradation about 4.4% compared to the OSNV. The performance degradation is attributed to discarding domain knowledge and not adapting to appearance changes of target. Basically, based on the ablation results, it can conclude that the contrastive loss and cross-entropy loss compensate for each other. More importantly, the ablation results have opened an interesting topic: Are the loss functions used in current Siamese-like algorithms suitable for model training regardless of the structures of neural networks?
The overall performance of all algorithms are displayed in Figure 4. The plot on the left side is the location error threshold with OPE. The plot on the right side is the success rate with OPE. There are 51 video sequences to be tested with the evaluation tool provided by OTB-2013 [14].
As shown in Figure 4, our proposed algorithm OSNV achieves a comparable result. Compared with the Siamese-like tracking algorithms: SiamFC_3s and SINT_noflow, the OSNV makes a performance gain about 1.2% and 3.9% on the success rate of OTB-2013, respectively. In addition, the OSNV performs not as good as the state-of-the-art CNN algorithms MDNet [10] and SANet [13]. There are two reasons: (1) the training data used by MDNet and SANet comes from the VOT dataset, which is more appropriate for tracking task. However, the data used in MDNet and SANet for backbone network training is not completely independent to test video sequences, which is not supported by the VOT-Challenge office [17]. (2) these two algorithms apply hard example mining strategy for model updating, which is not used in OSNV. As the correlation filter algorithm ECO [12], it extracts feature vectors by CNNs, Colormap, and histogram of oriented gradient (HOG). And its feature diversity is superior to OSNV. For the MCPF [13], the author combines the particle filter and correlation filter algorithms to improve the tracker's performance. However, the OSNV takes a random sample strategy, which is not effective as the particle filter.

Robustness Evaluation
To evaluate the algorithms' robustness, Wu et al. [14] introduced two new metrics: TRE (i.e., tracking starts at different frames) and SRE (i.e., some of tiny disturbance is added into the initial state of bounding box in the first frame: x + ε 1 , y + ε 2 , w + ε 3 , h + ε 4 ). For a test video sequence, TRE would generate 20 groups of test sequences about the original one with different start points. SRE is about evaluating a tracking algorithm with 12 different shifted bounding boxes in the first frame based on one annotated video sequence. With the limited GPU cards, we didn't make TRE and SRE experiments of SANet. In our machine, the tracking speed of SANet is no more than 0.00073 fps.
As Figure 5 depicted, the OSNV gets the fourth place in both temporal and spatial variations. For the evaluation metric SRE, the OSNV gains improvement about 4.2% compared with SiamFC_3s. In addition, the OSNV outperforms the algorithm of SINT_noflow about 1% with the mean success rate on the bounding box overlap in TRE. Due to the online model updating, our algorithm has made a good performance in TRE and SRE compared to SiamFC_3s and SINT_noflow.

Attribute-Based Evaluation
In [14], Wu et al. proposed to categorize the sequences by annotating them with 11 attributes. In this paper, we show comparison results on eight attributes: fast motion, background clutter, motion blur, deformation, illumination variation, in-plane rotation, occlusion, and scale variation. The success plots of TRE with nine tracking algorithms on different attributes are depicted in Figure  6. The evaluation metric is the success rate with overlap threshold on TRE. The number in every sub figure's title indicates the amount of video sequences belongs to that attribute. The performance of algorithms corresponding to those eight different attributes represents the ability of the tracking method to deal with different challenges.   The attributes of fast motion, background clutter, motion blur, deformation, in-plane rotation, and illumination are mainly relative to the feature model's representation ability. As Figure 6 depicted, our proposed algorithm outperforms SiamFC_3s and SINT_noflow on all attributes. For the illumination and scale variation, compared to MCPF, the OSNV makes gains about 0.9% and 0.8%, respectively.

Evaluation on OTB-2015
Here, we compare the proposed algorithm OSNV with nine state-of-the-art tracking algorithms on the challenge tracking benchmark OTB-2015 [29], which has 100 test video sequences, and the size of it is twice as big as OTB-2013. As shown in Figure 7, the proposed algorithm OSNV achieves a comparable result against MCPF on success plots. For the algorithm SiamFC_3s and SINT_noflow, the OSNV has demonstrated superior performance in two metrics.

Evaluation on OTB-50
As for the dataset OTB-50, it has 50 video sequences which are selected from the OTB-2015. They are more challenging compared to other 50 video sequences within the OTB-2015. Similarly, we evaluate the 10 algorithms on OTB-50 as well. And the test results are depicted as Figure 8. Compared with MCPF, SiamFC_3s, and SINT_noflow, our proposed algorithm makes a consistent improvement.   Figure 8. The OPE results of 10 tracking algorithms on OTB-50. The left side is precision plots and the right side is success plots, which are both evaluated on OTB-50.

Evaluation on VOT-2016
For the performance comparison in VOT-2016, the latest Visual Object Tracking toolkit is used. We download the performance results of trackers CCOT [35], MDNet [10], DeepSRDCF [36], SRDCF [27], TGPR [34], and HCF [37] from the VOT-2016 challenge results link. The HCF is an improved version of CF2 [8] with scale estimation. The result of SiamFC_3s is evaluated by our own through adding a vot wrapper for the source code. For the algorithm of SINT, we didn't use its result for comparison for the reason that its training data is officially not supported by the VOT-Challenge. The overall comparison results are shown in Table 1. Our proposed algorithm gets the second best performance in all three metrics among 8 trackers. Specifically, compared to MDNet, the OSNV has made a gain about 5.46% in EAO. As suggested by VOT office [17], the training data used by MDNet is coming from Imagenet [16], which is different from the original source code.

Evaluation on TempleColor
The dataset of TempleColor has 129 video sequences. With the source codes of ECO, MDNet, SANet, MCPF, SiamFC_3s, SINT_noflow, KCF, MEEM, and TGPR downloading from authors' project pages, we use the OTB [29] evaluation tool to generate the OPE comparison results in TempleColor. For the tracker TGPR, it has a bug to test the video sequence Pool_ce3 in TempleColor. As for the MDNet and SANet, they also have bugs in video sequence Pool_ce1 and Table_tennis_ce, respectively. Thus, we exclude all of them for comparison. In addition, the training dataset of SINT is not independent from TempleColor. Thus, we also exclude it. The performance results are depicted in Figure 9. Our proposed algorithm makes a gain about 2.4% compared with the SiamFC_3s.   Figure 9. The OPE results of six tracking algorithms on TempleColor. The left side is precision plots and the right side is success plots, which are both evaluated on TempleColor.

Qualitative Evaluation
We select two representative tracking algorithms to compare with the OSNV to perform qualitative results on eight challenging sequences. For each video sequence, we select five images to be shown. The performance results are shown in Figure 10. In the video sequence of BlurCar and BlurFace, the appearance of both targets doesn't change much. In addition, the algorithms of OSNV, SINT_nolfow, and SiamFC_3s could accurately estimate the trajectory of targets. However, when targets have undergone dramatic appearance changes as shown in the video sequences of BlurBody, KiteSurf, and Soccer, the performance of SINT_noflow and SiamFC_3s has big degradation. As for our proposed algorithm OSNV, it can adapt to the appearance changes of targets with online updating. In addition, from the video sequences of Bolt2, Human3, and Liquor, we can draw a conclusion that the offline trained Siamese models of SINT and SiamFC are easily disturbed by the similar objects, while the OSNV can predict the location of targets perfectly. We attribute that the OSNV can learn from the domain knowledge from the specific tracking target.

Failure Case
In practice, we find that the OSNV doesn't perform well in the severe deformation scenes. In Figure 11, we select three frames in each of the three video sequences: Diving, Ironman, and Jump. When the targets undergo severe deformation as Ironman, the OSNV lost the targets immediately. In addition, because of the online updating, the Siamese network has learned lots of incorrect knowledge. Therefore, the algorithm cannot recover from the mistake. Similarly, for the dramatic scale changes within targets, the OSNV cannot deal with that as well. In video sequences of Diving and Jump, our proposed algorithm is unable to accurately predict position and scale changes.

OSNV
Ground Truth Figure 11. Failure cases of OSNV algorithm about three video sequences: Diving, Ironman, Jump, from top to bottom.

Conclusions
In this paper, an online Siamese network and improved loss function have been introduced for visual object tracking. Compared with other offline-trained Siamese-like tracking algorithms, the new technique can learn from domain knowledge and adapt to appearance changes of the target. The improved loss function makes a significant improvement over the logistic loss, cross-entropy loss, and contrastive loss. Numerical results indicated that our proposed algorithm outperforms the offline-trained Siamese-like algorithms. However, compared to state-of-the-art trackers MDNet, SANet, and ECO, our proposed algorithm does not perform well in the case of severe deformation, which could be solved by designing a more complex Siamese network in future. In addition, the Bayesian verification model is used for candidate selection, which is different from other tracking methods. Finally, we think that there are two prospects with Siamese network for the future: (1) the offline Siamese-like algorithm with the online model updating may be a promising direction for the promotion of tracking performance; (2) the loss functions used in current Siamese-like algorithms are not suitable for model training regardless of the structure of neural networks.