Improving Object Tracking by Added Noise and Channel Attention

CNN-based trackers, especially those based on Siamese networks, have recently attracted considerable attention because of their relatively good performance and low computational cost. For many Siamese trackers, learning a generic object model from a large-scale dataset is still a challenging task. In the current study, we introduce input noise as regularization in the training data to improve generalization of the learned model. We propose an Input-Regularized Channel Attentional Siamese (IRCA-Siam) tracker which exhibits improved generalization compared to the current state-of-the-art trackers. In particular, we exploit offline learning by introducing additive noise for input data augmentation to mitigate the overfitting problem. We propose feature fusion from noisy and clean input channels which improves the target localization. Channel attention integrated with our framework helps finding more useful target features resulting in further performance improvement. Our proposed IRCA-Siam enhances the discrimination of the tracker/background and improves fault tolerance and generalization. An extensive experimental evaluation on six benchmark datasets including OTB2013, OTB2015, TC128, UAV123, VOT2016 and VOT2017 demonstrate superior performance of the proposed IRCA-Siam tracker compared to the 30 existing state-of-the-art trackers.


Introduction
Visual Object Tracking (VOT) is a promising and fundamental research area in computer vision applications including robotics [1], video understanding [2], video surveillance [3] and autonomous driving [4]. Given the initial state of a target object (generally specified by a bounding box) in a video, the aim of an object tracker is to estimate the spatial trajectory of the target object in the upcoming frames. Despite a significant progress made in the field of VOT, it remains a challenging problem owing to diverse real-world challenges such as scale variations, occlusion, background clutter, fast motion, and illumination variations.
Deep trackers take the benefits from pretrained deep neural networks and have shown outstanding performance [5][6][7][8][9][10]. These deep trackers extract features from off-the-shelf pretrained models as a backbone feature extractor known as deep features for better discrimination. The pretrained models are trained over ImageNet [11] for image classification tasks such as VGGNet and AlexNet. Many computer vision sub-fields employ pretrained models to benefit from transfer learning [12,13]. However, it can be observed that during tracking, these models may not fully adapt the specific target features and online learning may steer to overfitting [14]. Recently, deep Siamese-based trackers [10,[15][16][17] have become popular since they achieve good performance with relatively low

•
We propose an additive noise as input regularization to improve deep network generalization. • Early feature fusion mechanism is proposed to learn better target feature representation. • An adaptive channel attention mechanism is integrated to give more weight to the important channels compared to the less important ones using a skip connection. • Robustness of the proposed tracker is evaluated on the six benchmark datasets. Our experiments demonstrate better performance of the proposed tracker compared to the 30 state-of-the-art methods.

Related Work
In this section, we review different deep learning methods using additive noise. We also discuss closely related tracking approaches including deep features based trackers, Siamese-based and attention-based trackers. A detailed study may be found in recent surveys [6,9,32,33].

Deep Learning with Noise
Deep Neural Network (DNN) models have shown significant importance due to improved performance in various computer vision problems such as image classification, semantic segmentation, and action recognition. However, due to limited training data, networks may lead to overfitting. Dropout is an often used method to handle overfitting issue by randomly dropping out values in the hidden units in the network model [18]. However, it is still unclear how to select the best dropout rate to perform well and how can we maximize the benefit from optimization as well as preventing model from overfitting [19]. Instead of using dropout, many researchers used additive noise to handle overfitting problem [19,34,35]. Increased dropout rate may cause information loss especially when target size is small and decreased dropout may not be able to avoid overfitting. Noh et al. [19] used additive noise as regularizer from marginalized noise instead of dropout approach. Bishop et al. [34] showed that additive noise effect is similar to Tikhonove regularization. Liu et al. [36] used noise layer to prevent their network from adversarial attacks. Fiaz et al. [6] studied the performance of trackers on noisy inputs during tracking. In contrast, we propose an additive noise as input regularization to improve the generalization error in the visual object tracking domain. Proposed regularization improves the tracking performance during the inference. We also verified the performance of our framework by inducing a noise layer before each convolutional layer. Experimental results showed that inducing a noise layer for each convolutional layer reduces the tracking performance compared to adding noise in the input data.

Deep Feature-Based Trackers
Recently, deep learning approaches have boosted the tracking performance due to their inherent characteristics. However, employing deep learning in visual tracking have several limitations. For example, deep learning requires more computational resources and have higher time complexity. The ground truth for the reference target object is provided only on the first frame of the video. To benefit from deep learning and limited available training data, deep features are combined into correlation filter tracking to boost the tracking performance. For instance, DeepSRDCF [5], CF2 [8], and FCNT [37] take the leverage from deep learning by extracting deep features at multiple layers from pretrained models such as VGG [38] or AlexNet [39]. Deep features from different layers were exploited to enable the capabilities of accuracy and robustness for the visual tracking [7,23,40,41]. Bhat et al. [41] revealed that pretrained models do not always fetch performance boost due to incompatible resolutions, unseen target objects and increasing dimensions. On the other hand, deep learning can also be used as classification or regression networks for visual tracking [22,42,43]. CNN-SVM [44] employs CNN model and performs classification task using SVM with saliency map. The TSN tracker [45] used CNN to encode temporal and spatial information for classification. The MDNet [21] is a multi-domain online deep tracker performing tracking as classification task, and capturing the domain dependent information during online tracking within a particle filter framework.
The online model update is performed to adapt different appearance variations of the target, but it may lose target under scenarios such as occlusion, deformation, or background clutter. Online learning requires extra computational cost to update the model parameters. Although CNN-based models have fewer parameters than RNN-based models, frequent model update incur extra computational cost therefore, such trackers may have limited real-world applications.

Siamese Network-Based Trackers
A Siamese network comprises of two parallel Convolutional Neural Networks (CNN) streams that are used to learn the similarity between input images in embedded space and to fuse them to produce an output [46]. Owing to their inherent characteristics such as accuracy and speed, Siamese networks are popular in the visual tracking community [10,[15][16][17]47]. A SiameseFC [15] extracts input image features using an embedded CNN model and fuses them by using a correlation layer, to generate a response map. CFNet [10] is an improved version of the SiameseFC and it integrates a correlation filter layer as a differentiable layer within template branch. On the other hand, GOTURN [16] involves the use of a Siamese network as a feature extractor and the use of fully connected layers for fusing embedded features. The GOTURN tracker performs regression between two consecutive frames.
The SINT [17] formulates the tracking problem as a verification task to learn the similarity between inputs. These approaches have secured much importance due to their performance, but overfitting might occur if trained on small datasets. The proposed tracking algorithm enhances the discriminative ability of Siamese tracking framework by exploring data augmentation using additive noise.

Attention Mechanism-Based Trackers
Recently, attention mechanisms have become popular owing to their improved learning capabilities. CSRDCF [48] constructs a unique spatial reliability map to impose constraints on correlation filters within a correlation tracking framework. AFS-Siam [49] selects the discriminative kernels from different convolutional layers. Choi et al. [24] proposed ACFN and used spatial attention to select a subset of correlation filters for visual object tracking. RTT [50] used multi-directional recurrent filters to learn the target object appearance. The objective of using a channel attention mechanism has enabled the tracker to learn the most critical information to adapt the target appearance. However, the attention mechanism within convolutional layers has not been fully exploited. On the basis of these considerations, we introduced a channel attention mechanism to highlight the importance of discriminative features. Our technique showed high performance by offline learning efficient discriminative features.

The Proposed Input-Regularized Channel Attentional Siamese (IRCA-Siam) Network
Overall framework of the proposed IRCA-Siam network is shown in Figure 1. Compared to the previous deep trackers, IRCA-Siam exploits additive noise in the input data within Siamese framework to handle overfitting problem. We propose an early feature fusion mechanism for better target localization. We also integrate a channel attention mechanism within IRCA-Siam to highlight the more useful and discriminative features for improved tracking.

Fully Convolutional Siamese Network
The building block of the proposed framework is SiameseFC tracker proposed by Bertinetto et al. [15]. SiameseFC formulates the tracking problem, to learn a similarity map from embedded CNN models, as a cross-correlation problem within a Siamese network architecture. The embedded CNN model consists of two parallel branches, one representing the target and the other representing the search region. In visual tracking, the target template is provided in the first frame of the as an exemplar z. The objective of SiameseFC is to find the most similar region from the search region (larger in size than the template) x for subsequent frames as: where * represents the cross-correlation, θ(·) denotes the embedded space, and b represents the offset of the similarity value. From Equation (1), we note that SiameseFC uses feature representation and discriminative learning to produce a similarity map by using a single function θ(·). The performance of both tasks may lead to overfitting the model to the training data. We therefore propose noisy regularized feature fusion to overcome the challenges faced by SiameseFC and to improve the generalization capability of the tracker. We also highlight the importance of discriminative channel feature information.

Input Regularization and Feature Fusion
In the current study, a data augmentation mechanism is introduced for Siamese networks to overcome their limitations. Existing Siamese trackers suffer due to low fidelity of the target representation. We propose an input regularization during the training of Siamese trackers. Introducing noise into the input can be regarded as input regularization, and it encourages the model to learn various aspects of the object and increases its robustness against noise during testing. The features from both branches are fused (as shown in Figure 1) such that the model can learn the target features under noise or disturbance to enhance its accuracy in real-world noisy environment. It may be noted that during tracking, a target may observe noise leading to performance degradation. The proposed feature fusion mechanism helps to overcome this limitation.  We induce random Gaussian noise into the input patches to obtain noisy images with mean µ and standard deviation σ. A Gaussian noise map G → Rand G (µ, σ 2 ) is constructed and added with the input, where Rand G (·) is a random number generator function based on Gaussian density function. In contrast to existing Siamese networks, the proposed model accepts four inputs, namely a target patch (z), a noisy target patch (G + z), a search patch (x), and a noisy search patch (G + x). Low-level features from noisy and clean images are fused to encode the spatial target information for better localization.
In practice, we fuse features from target patch and noisy target patch as: where B represents a convolutional block including a convolutional layer, a normalization layer, a rectifier layer, and a pooling layer. Similarly, features from search and noisy search patches are fused as: The proposed framework is summarized as: where ∆(.) denotes the channel attention and ⊕ represents the element-wise addition operation.
The channel attention network is explained in Section 3.3. During testing, we do not require noisy template and noisy search region. Instead, we provide the same template and search region that are provided to the other two inputs.

Channel Attention Network
A convolutional feature channel can be considered to be equivalent to a specific type of visual pattern. SiameseFC treats the feature channels for both the exemplar and search branches equally, which leads to performance degradation. However, the proposed channel attention mechanism exploits the relationship among channels and assigns more weights to channels that contribute more to target discrimination and localization. The objective is to enhance the adaptation capacity of the model to capture target variations. We incorporate a channel attention mechanism in the template branch as shown in Figure 1. There exists many channel attentional networks to calibrate the channel information such as SENet [51] and SA-Siam [52] which employ only global max-pooling and multi-perceptron layer. Choi et al. [24] proposed ACFN and used spatial attention to select a subset of correlation filters for visual object tracking. On the other hand, our channel network fuses the channel coefficients from global max-pooling and global average pooling and then forwards to convolutional layer. The global max-pooling exploits the finer and distinctive target information while global average pooling reflects the overall knowledge of the target for proposed channel attention.
The proposed channel attention mechanism is a lightweight network, as depicted in Figure 2. The input for this network is the output features θ(z) from the last convolutional layers. This network passes the inputs to Global Average Pooling (GAP) and Global Maximum Pooling (GMP) layers. The outputs of these layers are fused using an element-wise operation to form a Global Descriptor (GD). The GD is feed forwarded to a dimensionality reduction layer, a rectifier activation layer, and a dimensionality increasing layer and then relayed to a Sigmoid activation layer to provide the final weights of the input features.
The input to the channel attentional mechanism is represented as C = θ(Z) from Equation (4). The Global Descriptor (GD) is calculated using element-wise operation (⊕) between the outputs from GAP and GMP layers as: The weights for input features are computed as: where fc 1 and fc 2 denote fully connected layers, Relu represents rectifiers layer, and σ is the Sigmoid function as f (x) = 1 1+e −x . It is assumed that C has k feature channels such that C = [c 1 , c 2 , ...c k ].
where α k represents the k th weight for channel c k . Then the final output of channel attention will be The output of proposed channel attention element-wise is added to the θ(Z) using skip connection as shown in proposed framework Figure 1. Proposed channel attention is only applied in the template branch of our framework to exploit the target feature channels.

Implementation Details
We train proposed model over GOT-10K dataset [53] which contains more than 10,000 video sequences. The proposed network accepts four input image patches. During offline training, the input size for the template and noisy template is 127 × 127 × 3, while that for search region and noisy search region is 255 × 255 × 3. For noisy images, µ is fixed at zero and σ is set to 0.09 which is obtained empirically and discussed in Section 4.3. During data curation, we crop the input patches such that the target object resides at the center as it reflects the most influential region for tracking performance. During training, we regularize our input using Gaussian additive noise such that it refrains to distract against noise at inference time. The model was trained offline end-to-end using a stochastic gradient method for 50 epochs. We set the momentum to 0.9 and the weight decay to 5 × 10 −4 , while the learning rate started at 10 −2 and later decreased to 10 −5 . During training, we adopt the following loss function to update the model parameters: where g represents the response map, y ∈ {+1, −1} denotes ground-truth label, k shows the position in the response, and δ indicates the set of positions in the search window on the score map. During testing, we set template and noisy template is 135 × 135 × 3, while that for search region and noisy search region is 263 × 263 × 3. During the inference, the maximum location on the response map represents the new estimated target location. To overcome the problem of scale variations, we constructed a pyramid over three scales (0.963, 1, 1.0375) based on previously estimated location for the current frame and selected the best score for target scale estimation. The code was implemented in python 3.7 and PyTorch 1.0.1 and all the experiments were performed using 1 GPU NVIDIA TITAN X p over i7 3.6GHz CPU (PRIME Z370-A II) with 32G memory.

Evaluation over OTB Datasets
We present precision and success plots for the OTB2015. We compared IRCA-Siam with other state-of-the-art methods including TRACA, SRDCF, staple, SiamTri, CFNet, SiamFC, UDT, and CNNSI. Figure 3 demonstrates that the proposed algorithm IRCA-Siam showed better tracking performance compared to other trackers. IRCA-Siam achieved 62.5% and 83.5% success and precision respectively, which is 3.9% and 6.3% gain in performance compared to baseline SiamFC tracker. We compared our method with Siamese-based trackers including SiamTri, SiameseFC, CFNet, UDT, and CNNSI as shown in Figure 3. These tracking approaches take two inputs, but our approach takes four inputs. During training, we train our model such that it withholds discriminative ability for better localization. Our method has achieved 2.1% and 2.3%, 4.5% and 2.7%, and 5.3% and 4.7% superior performance in terms of precision and success, respectively, compared to correlation filter-based trackers such as TRACA, SRDCF, and Staple, respectively.  We also present the success scores for OTB2013 and OTB2015 in Table 1. The table also displays the average speed in units of Frames Per Second (FPS). The table shows that MSN [64] and HASiam [61] achieved success score more than 63.0 for OTB2013. Compared to these trackers, IRCA-Siam secured superior success score of 65.3. We also observed that our algorithm surpassed the other methods over OTB2015. Futhermore, our algorithm performs tracking at 77 FPS and is a real-time tracker. Although TRACA [59], SiamTri [55], Staple [58], SiamFC-lu [60], and SiameseFC [15] show higher tracking speed than our algorithm, they are less successful for OTB2013 and OTB2015.

Challenge-Based Comparison
We present the evaluation of IRCA-Siam for various tracking challenges and compared with other state-of-the-art methods including TRACA, SRDCF, staple, SiamTri, CFNet, SiamFC, UDT, and CNNSI over OTB2015 in terms of success and precision in Figures 4 and 5 respectively. IRCA-Siam showed the best performance over fast motion, motion blur, deformation, in-planar rotation, out-of-planar rotation, occlusion, illumination variations, and scale variations challenges in terms of success. IRCA-Siam did not perform well over low-resolution videos and background clutter but ranked second with a minor difference as shown in Figure 4. SiamTri and TRACA surpassed our method with less than 1.0% for low-resolution and background clutter. However, overall, our tracker performed best for most of the challenges in terms of success.
We present precision plots for different challenges in Figure 5. Our algorithm showed better performance for eight challenges including fast motion, scale variations, illumination variations, occlusion, deformation, motion blur, in-plane rotation and low resolution. IRCA-Siam showed second best performance for out-of-view, low resolution, and background clutter. However, the difference between the top ranked compared with our method is less than 1.0%. As our approach ranked best for the rest of the challenges, such a minor difference can be ignored. We notice that other Siamese-based trackers are trained from raw images and do not perform well against different challenges. However, we train our model with regularized input such that it preserves the discriminative ability for better localization against noise during test time. This approach helped our method to perform better for most of the challenges in terms of both success and precision as shown in Figures 4 and 5 respectively.

Qualitative Analysis
We performed the qualitative analysis of the proposed method over CarScale, FaceOcc1, Skiing, and Jogging-1 sequences as shown in Figure 6. In CarScale sequence, IRCA-Siam performed better compared to others as its bounding box enclose most region of the vehicle while others less. Almost all the trackers tackled the FaceOcc1 sequence successfully. However, IRCA-Siam and TRACA succeeded to track the skier in Skiing sequence. The proposed method also performed efficiently for occlusion in Jogging-1 sequence.

Ablation Study
In this section, we investigate the effect of input additive noise and noise layers before convolution layers during the training. During testing, we neither provide input noise nor noise layers. We performed different experiments for SiameseFC and proposed (IR-Siam) method as shown in Figure 7. We also evaluated the performance of the proposed channel attention with additive noise named IRCA-Siam as shown in Figure 1. We performed our ablation study over OTB2015 dataset and showed the performance in precision and success.
In our framework, noise is added to inputs as regularization instead of dropout approach. Liu et al. [36] used noise layer to prevent their network from adversarial attacks. Therefore, we also used noise layer before convolutional layers to verify the improvement of generalization error using noise layer within convolutional model θ. We present the additive noise as input regularization as well as noise layer within Siamese tracking framework as shown in Figure 7. In Figure 7a shows the baseline SiameseFC tracking framework. We used a noise layer and placed before each convolutional layer to learn the noisy gradients during back propagation. Figure 7b presents the SiameseFC with noise layer before each convolutional layer. Similarly, Figure 7c,d represents the proposed framework without channel attention and, with and without noise layer, respectively. In our ablation study, we preformed different experiments to show the impact of addition of noise layer within Siamese framework.  First, we evaluate the performance of additive input noise. In this study, we used Salt and Pepper (S&P) and Gaussian noise as input noise. For S&P noise, we use three different probabilities (0.09, 0.05, and 0.03), similarly we use three different σ (0.09, 0.05, and 0.03) with mean (µ) zero for Gaussian input noise computation as shown in Figure 8. We observe that SiameseFC showed better performance without addition of noise. On the other hand, our IR-Siam without channel attention improved the tracking performance in the addition of Gaussian noise with σ = 0.09 and achieved precision = 81.9 and success = 61.9. We investigate the addition of input layers within the network architecture. We only added Gaussian noise layers before convolution layers as shown in Figure 7. We observe that the added noise layer degrades the performance for SiameseFC as well as our IR-Siam tracker. From Table 6, we note that IR-Siam shows the tracking improvement when noise is added as input. Moreover, we also find that the added channel attentional module shows tracking performance improvement. Proposed IRCA-Siam with channel attention achieved best precision = 83.4 and success = 62.5. The improved performance of IRCA-Siam reflects the importance of proposed channel attention network as it efficiently highlights the important feature channels and reduces the significance of the irrelevant ones.

Conclusions
In this work, input-noise-based regularization is proposed to improve tracking generalization. In addition, early feature fusion of noisy and clean channels is also proposed for better target localization. In the same framework, channel attention has been proposed to select more informative target features to improve tracking performance. For input-noise regularization, Gaussian noise has been added to both the template and the search patches during the training. Feature fusion is performed at low-level layers to make the tracking process more robust to noise and to improve target localization. Channel attention has been used to highlight more descriptive features and to suppress the noisy features. The proposed tracker has shown superior performance compared to 18 Siamese trackers and 12 other existing trackers. The proposed tracker has shown promising performance for fast motion, motion blur, deformation, in-plane rotation, out-of-plane rotation, occlusion, illumination variations, and scale variation challenges.