Real-Time Visual Tracking with Variational Structure Attention Network

Online training framework based on discriminative correlation filters for visual tracking has recently shown significant improvement in both accuracy and speed. However, correlation filter-base discriminative approaches have a common problem of tracking performance degradation when the local structure of a target is distorted by the boundary effect problem. The shape distortion of the target is mainly caused by the circulant structure in the Fourier domain processing, and it makes the correlation filter learn distorted training samples. In this paper, we present a structure–attention network to preserve the target structure from the structure distortion caused by the boundary effect. More specifically, we adopt a variational auto-encoder as a structure–attention network to make various and representative target structures. We also proposed two denoising criteria using a novel reconstruction loss for variational auto-encoding framework to capture more robust structures even under the boundary condition. Through the proposed structure–attention framework, discriminative correlation filters can learn robust structure information of targets during online training with an enhanced discriminating performance and adaptability. Experimental results on major visual tracking benchmark datasets show that the proposed method produces a better or comparable performance compared with the state-of-the-art tracking methods with a real-time processing speed of more than 80 frames per second.


Introduction
Visual tracking is one of the most widely used computer vision algorithms. The goal of visual tracking is to estimate the position and scale of a specified target from the sequence of video frames. Among various conventional tracking algorithms, the discriminative correlation filter (DCF) approaches achieved an acceptable tracking performance using low-level hand-crafted features [1][2][3][4][5][6]. However, the lack of representation of hand-crafted features makes the tracking task inaccurate or even fail on challenging sequences.
Recently, with the advent of the large-scale datasets [7], the convolutional neural networks (CNNs) had great success in the visual tracking field. Since visual tracking requires rich representations, deep features extracted from pretrained CNNs models [8][9][10][11] are widely used to replace the hand-crafted features in the DCF framework. In particular, the tracking-by-detection based trackers [12][13][14] have exhibited unparalleled performance by combining detection and tracking in a unified framework. However, in contrast to DCF-based trackers, they require high computational load to target localization, and, as a result, it makes a real-time tracking impossible. To increase the processing speed for real-time tracking, Siamese networks have been recently proposed for visual tracking applications [15][16][17]. They were trained to compare the similarity between the initial and deformed target appearances. In particular, Valmadre et al. and Wang et al. proposed adaptive tracking approaches that pre-train the DCF in conjunction with the Siamese network while maintaining the major properties of the correlation filter [16,17]. However, an approximated solution of the DCF unavoidably results in a boundary effect because of the circulant structure of the Fourier transform, and the unbalanced weighting mechanism of the cosine window will aggravate the boundary effect, which results in the tracking performance degradation. To solve the negative effects, recent visual tracking approaches adopted the Siamese network-based DCF proposed by Valmadre and Wang, but they cannot guarantee a promising performance especially in real-time tracking applications.
To tackle this issue, we propose a fast and accurate tracking method using a structure-attention network to extract rich structures that are robust to the boundary effect problem. We first take SiamDCF [17] as a baseline Siamese tracking network for fast and adaptive tracking. To overcome the structure distortion problem caused by the boundary effect of the DCF, we train the DCF online through the proposed structure-attention network so that the DCF can learn the target structure that is robust to distortion. We use a variational auto-encoder as a structure-attention network to generate various and representative structures of the target. In addition, we train the structure-attention network by minimizing a novel reconstructed loss function combining two denoising criteria. The proposed two denoising criteria are designed based on two properties of DCF, cosine window weighting process, and shifted training samples by using a circulant matrix. Therefore, the denoising training process, by reflecting these properties in the denoising criteria enables the structure-attention network to capture robust features even in the boundary effect. In addition, the minimization of the proposed reconstructed loss, which represents the error of reconstructing both RGB input image and the corresponding feature map, allows the structure-attention network to generate a feature map without losing details of target structure, and can generate representative target structures. Figure 1 shows that our method can extract a robust structure of the target. The major contribution of the proposed work includes: • We propose a structure-attention network to minimize the structure distortion due to boundary effect and to help learn the representative structure of target during the online training of correlation filter.

•
We propose a novel reconstructed loss and two denoising criteria for training the structureattention network. This allows for capturing robust structural features of the target even in the boundary effect without losing detailed information of the target.

•
Experimental evaluations on various standard benchmark datasets demonstrate that our method achieves a better or comparable performances compared with the state-of-the-art tracking in accuracy and real-time tracking speed.  [17]; (c) response maps of our SiamDCF with a variational auto-encoder [18]; and (d) response maps of our method. Our method successfully removes the surrounding background clutters and focuses on the structure of the target, where the peak response value coincides with the true maximum correlation point.

Related Works
Correlation filter-based approaches have played an important role in the visual tracking field because of their computational efficiency, accuracy, and robustness. Bolme et al. proposed a minimum output sum of squared error (MOSSE)-based correlation filter using single-channel features for real-time video tracking [1]. Henriques et al. proposed a kernelized correlation filter (KCF) using multi-channel features and circulant matrices [2,19]. Danelljan et al. used an adaptive color features in visual tracking for rich representation for the target [20]. To increase the accuracy in tracking a scale-variant object, Danelljan et al. proposed a scale estimation filter [3]. Choi et al. proposed a feature integration framework for visual tracking [4,5]. In addition to the multiple feature integration approaches, various algorithms were proposed to solve the intrinsic problems of correlation filters. Correlation filters often suffer from the boundary effect caused by cyclic-shift when training correlation filters. To overcome these issues, Galoogahi et al. proposed an alternating direction method of multipliers (ADMM) optimization for tracking [6], and Danelljan et al. proposed a spatial regularization method for correlation filters [21]. Chen et al. also proposed a new sparse model with a modulated template dictionary [22].
However, because of the common limitations in representing the target appearance using hand crafted features, convolutional features, which are extracted by CNNs pretrained on a large-scale dataset, such as ImageNet [7], have been widely used to improve the performance of correlation filter-based trackers [10,11,23]. Ma et al. adaptively trained correlation filters using hierarchical characteristics of pretrained CNN features [10]. Qi et al. adaptively integrate multi-correlation filter responses using an adaptive hedge algorithm [11]. Danelljan et al. integrated CNN features into [21] for performance improvement [23]. To overcome the drawback of single-resolution features, Danelljan et al. proposed an implicit interpolation method to integrate multi-resolution CNN features. Recently, a tracking-by-detection framework becomes one of the standard approaches for visual tracking. Different from correlation filter-based tracking methods, tracking is performed using a classifier which distinguishes target from background. Hong et al. proposed a framework to combine pretrained CNNs and used online SVMs to obtain target-specified saliency maps for tracking [24]. Instead of using a single classifier, Zhang et al. proposed a multi-expert restoration framework to address the drift-problem during tracking [25]. Nam et al. proposed a multi-domain learning framework for tracking [12]. This approach significantly improved the tracking performance. In spite of many attracting properties, most of tracking-by-detection frameworks commonly require high computational costs, and have limitation of features extracted by a pretrained CNN.
Recently, Siamese CNN architecture was used to compare the similarity of target through the end-to-end framework without any online fine-tuning [15][16][17]. These approaches are very successful and show remarkable performance improvement in real-time tracking. The biggest factor in the success of this approach is the use of pre-trained CNN models that are well-suited to tracking, rather than pre-trained CNN models on large-scale datasets. Bertinetto et al. proposed a fully convolutional Siamese tracking framework and introduced the correlation layers to estimate target positions [15]. Valmadre et al. improved the fully convolutional Siamese tracking framework by adding a correlation filter into the Siamese network, and achieved more shallow but efficient tracking [16]. Wang et al. proposed a similar Siamese network that can be trainable online by replacing the correlation layer by the discriminative correlation filter, and performed pre-training the Siamese network [17]. However, due to the boundary effect problem of correlation filters, it did not achieve a significant performance improvement compared with other Siamese network-based methods.

Proposed Method
This section presents the proposed structure-attention network and online tracking process. Figure 2 shows the overall process of the proposed tracking algorithm.

Variational Auto-Encoder
Let x denote the data, z latent variable, and p(x|z) the distribution of generating data x given latent variable z. Since the inference of posterior p(z|x) is intractable to compute, the variational auto-encoder (VAE) utilizes q(z|x) to approximate the true posterior by optimizing the variational lower bound. The VAE maps the input data into latent variables q(z|x) via an encoder network, and then reconstructs p(x|z) from the latent variables via a decoder network. The variational lower bound, denoted as L V , can be formulated as where the first term is the Kullback-Leibler divergence (KLD) of the approximated distribution from the true posterior, and the second term is expected reconstructed loss. Since the second term is not straightforward for the expected reconstructed term, we can reparameterize the z by using a differentiable transformation as [18]. We also assume that both q(z|x) and p(z) are Gaussian so that the KLD term can be analytically integrated. Hence, the standard VAE objective function can be formulated as where J is the dimension of latent variable z, and {µ, σ} outputs of the deterministic encoder network. The reconstruction loss can be minimized using the cross entropy loss. More details can be found in [18].

Structure Attention Network
We propose to add a variational auto-encoder (VAE) sub-network in the upper Siamese path of SiamDCF as shown in Figure 2. The VAE is called a structure-attention network, which generates various and representative target structures. The encoder in the VAE subnet takes convolutional features of the previous (or t − 1st) frame as input, and generates the latent vector z. More specifically, the encoder consists of three convolutional layers with batch-normalization and ReLU activation function followed by three fully-connected layers, and is considered as a nonlinear function of convolutional features x ∈ R w×h×c as The decoder consists of three deconvolutional layers and one convolutional layer with the batch-normalization and ReLU activation function. The decoder takes the latent variable z as input, and generates both reconstructed feature map y ∈ R w×h×c and RGB image of size w × h × 3 using another convolution layer. Table 1 shows the details of our structure-attention network.

Pre-Training
In the pre-training step, the structure-attention network is pre-trained for the following two purposes: (i) capturing robust features even in the boundary effect problem, and (ii) generating various representative target features. To this end, we use dual-structure noises as denoising criteria as shown in Figure 3. The proposed dual-structure noises are based on the properties of the boundary effect problem, which is the intrinsic problem of the correlation filter. The channel-wise noise consists of randomly selected channels multiplied by the inverse-cosine window. In the online tracking process, correlation filters can minimize background information by using cosine-window, and can accurately learn the target appearance. However, a center fitted weighting mechanism of cosine-window aggravates the boundary effect of the correlation filter, and makes the correlation filter learn unnecessary features. In this context, the structure-attention network is trained using an inverse-cosine window as a denoising criteria, to capture the robust features regardless of the center fitted weighting mechanism of cosine-window.
The random shift noise consists of randomly shifted rows and columns of feature vectors. Since the correlation filter is trained by using shifted training data using the circulant matrix, the structural information of the target is distorted, and therefore it is necessary to capture the robust features even in the shift. Thus, shifting feature vectors act like shifted training data during the training process of the correlation filter, and help the structure-attention network to capture robust features even with shifting.
To preserve the details of target structures, we propose a novel reconstructed loss function. The upper path of the Siamese network takes RGB images {m i } K i=1 as input, and produces the convolutional feature map {x i } K i=1 as output. The VAE takes the feature maps with a batch size K as input. Let {x i } K i=1 denote the feature map which is corrupted by two noise structures. Given the latent variable z, different from the standard VAE, not only latent variable z but also input feature maps are corrupted, variational denoising reconstructed loss L R can be formulated as [26]: where p(x i |x i ) represents the distribution of generating data given corrupted convolutional feature maps and latent variable. However, to design the VAE to reconstruct not only robust feature maps from the corrupted feature maps but also to preserve detail structural information of target, the original target information should be reflect to VAE. Hence, we can reformulate our reconstructed loss by adding image reconstruction term as: where p(x i |x i ) and p(m i |z) respectively represent the distribution of generating data given convolutional feature maps and latent variable. Different from the conventional denoising VAE criterion, we added an image reconstruction term in Equation (4). The second term makes the latent variable reflect the image structure to reconstruct feature maps, and allow reconstructed feature maps to preserve details of the target structure. To approximate the true posterior more stably, we can add the regularization term too. As a result, the objective function of our structure-attention network can be formulated as: where J is the dimension of latent variable z, and µ and σ are outputs of variational parameter φ of the encoder network that takes corrupted feature mapsx i as input. Figure 4 shows the reconstructed feature maps through the pre-trained structure-attention network. The reconstructed feature maps can attract attention to a representative and robust target structural features. Figure 5 also shows the tracking results using the proposed structure-attention network. We use the intersection-over-union (IoU) with the peak-versus-noise-ratio (PNR) which is introduced [27], to reveal the distribution of the correlation response map and to analyze the impacts of our attention map.

Online Tracking
Since the target appearance changes by frame, online training is required for adaptive tracking. The standard discriminative correlation filter based tracking method can be formulated as a ridgeregression problem: where x i represents a set of feature maps of the training samples, and y is a desired output. The solution to obatin a desired correlation filter w i can be gained as: where ∧ denotes the Fourier domain, * represents a complex conjugate, and denotes the Hadamard product. Since the feature vectors are circulant matrix, computational load can be reduced. In order to prevent a distortion of structural information of the target due to the boundary effect during online training of the correlation filters, train the feature maps obtained through the structure-attention network together. From Equation (8), we can reformulate the correlation filter in online process as: where x represents a feature map from the Siamese network, and s a structure feature map from the structure-attention network. The correlation filtering process in the t-th frame can be simplified as: where t represents a frame index, η a online learning rate, and A and B respectively cross-and auto-correlations that are added to the structure-attention feature map.

Experimental Results
In this section, we introduce the details of our method, and evaluate our tracking algorithm on various benchmark datasets OTB2013 [28], OTB2015 [29], and Temple-Color-128 [30]. In particular, we evaluate the effectiveness of our structure-attention network through the multiple ablation study and detailed evaluation on various sequences. In addition, all experiment results can be found at [31].

Implementation Details
The Siamese network receives a 107 × 107 × 3 image as input. The structure-attention network receives a 107 × 107 × 32 feature map as input, and generates output of the same size. In the pre-training phase, we used Caltech-256 dataset [32] with batch size 64 with 50 epochs. We used Adam optimizer with learning rate 0.001. We implemented our algorithm in Python using the Pytorch library. In the online tracking phase, we set regularization parameter λ and online learning rate η to 0.0001 and 0.01, respectively. The proposed algorithm runs on a PC with an Intel Core i7 3.4 GHz CPU (Santa Clara, CA, USA), 32 GB RAM, and a Geforce GTX 1080 TI GPU (Santa Clara, CA, USA). In our settings, the average speed is 89 FPS.

Evaluation Methodology
We compare the performance of our tracking method with twelve state-of-the-art trackers including: SiamDCF [17], DSST [3], ACFN [5], SRDCF [21], SRDCFdecon [33], MEEM [34], Sturck [35], SiamFC [15], CFNet [16], ADNet-fast [36], CNN-SVM [24], and TRACA [37]. We follow the evaluation approaches introduced in the standard benchmark [28]. The performance of trackers is evaluated by using one-pass evaluation (OPE) with precision and success plots. The precision plots measure the percentage of frames where the distance between the estimated locations and the ground-truth is under a threshold. The success plots measure the overlap ratio between estimated bounding boxes and ground-truth. We set the distance threshold to 20 pixels in precision plots and use Area Under Curve (AUC) in success plots.

Evaluation on OTB2013
We evaluate our tracking method on 50 video sequences using one-pass-evaluation with distance precision and overlap success ratio. Figure 6 shows both precision and success rate on the 50 video sequences. The proposed tracker achieved the state-of-the-art performance. Our method performs the best in both precision and success rate, and has a large margin in success rate compared with TRACA [37]. In particular, through our structure-attention network, our tracker outperforms a baseline Siamese tracker SiamDCF in a large margin.

Evaluation on OTB2015
We evaluate our algorithm on OTB2015 [29] dataset which contains more videos and hard datasets than OTB2013 [28]. This dataset includes 100 fully annotated video sequences. Figure 7 shows the overall results on OTB2015 dataset. Our method achieves the best result in both precision and success rate. In particular, compared to TRACA [37], which is ranked second by a small margin on OTB2013 dataset, our method outperforms with a large margin for both precision and success rate. This illustrates our method is more robust and accurate on challenging video sequences. In addition, the large margin between the proposed tracker and our baseline Siamese tracker SiamDCF [17] demonstrates that our structure-attention network can capture the robust structural features of target, and can train discriminative correlation filter adaptively even under the boundary effect problem. Table 2 and Figure 8 illustrate the precision scores of 11 video attributes on the OTB2015 dataset. The proposed method shows the best performances in seven attributes. In addition, Table 3 and Figure 9 demonstrate the success rate scores of 11 video attributes. Our tracker achieves the best performances in eight attributes and the second best score in Low Resolution. This clearly shows the effectiveness of our structure-attention network. Figure 10 also shows that the proposed method outperforms other trackers in success rate versus speed.

Evaluation on TempleColor-128
We compare our tracker on the TempleColor-128 [30] dataset containing 128 video sequences using one-pass-evaluation. Figure 11 illustrates both distance precision and overlap success rate on overall video sequences. Our tracker ranks second by a small margin on distance precision. However, while SRDCFdecon [33] has 1 fps of average speed, our tracker has over 89 fps in real time. Moreover, our tracker achieves the best result in overlap precision with SRDCFdecon. In particular, compared to our baseline tracker SiamDCF [17], our method outperforms SiamDCF with a large margin.

Ablation Study
To analyze the impacts of the proposed method, we perform several ablation studies on OTB2013, OTB2015, and TColor128 datasets. We implement six variants of our tracker including: (i) Baseline is SiamDCF [17] which is our baseline Siamese tracking network, (ii) Ours-shift trains our structure-attention network using only shift noise, (iii) Ours-channel trains our structure-attention network using only channel-wise noise, (iv) Ours-VAE uses the standard variational auto-encoder (VAE) as structure-attention network, (v) Ours-oneloss trains the structure-attention network using only denoising reconstruction loss, and (vi) Ours is our complete model using both structure noises and the proposed reconstruction loss. Figure 12 shows the results on the overall datasets. Compared to SiamDCF, our complete model shows the best performances in both precision and success rate.

Qualitative Evaluation
We perform a qualitative evaluation of our method with five existing trackers including: SiamDCF, SiamFC, CFNet, TRACA, and SRDCF. Figure 13 illustrates several frames from five challenging sequences on OTB2015 dataset (Bird1, Ironman, Matrix, Shaking, and Skiing). In the Bird1 and Ironman sequences, which are some of the most challenging sequences on the OTB2015 dataset, our tracker robustly tracks the target from the start to end frame even in the heavy occlusion and deformation. In the Matrix and Skiing, when the compared trackers struggling due to heavy scale variation and small size of target, our method can accurately estimate the scale of target. In particular, compared to our baseline tracker SiamDCF, our method shows the significant improvement in qualitative results. This proves the effectiveness of our structure-attention network which can robustly train the correlation filter even in the boundary effect problem. Figure 13. Qualitative comparison of our trackers with five trackers on OTB2015 dataset (from top to down are Bird1, Ironman, Matrix, Shaking, and Skiing). Our trackers achieve the best visual results with existing trackers in several challenging sequences.

Conclusions
In this paper, we presented a novel real-time tracking method based on the discriminative correlation filter with the proposed structure-attention network. To capture robust structural features even in the boundary effect problem of the correlation filter, our structure-attention network is trained with a novel reconstructed loss and dual structure noises. Using the structure-attention network, the correlation filter can learn representative and robust structural features. Extensive experiments on benchmark datasets have shown the effectiveness of our method.