Low-Rank Multi-Channel Features for Robust Visual Object Tracking

: Kernel correlation ﬁlters (KCF) demonstrate signiﬁcant potential in visual object tracking by employing robust descriptors. Proper selection of color and texture features can provide robustness against appearance variations. However, the use of multiple descriptors would lead to a considerable feature dimension. In this paper, we propose a novel low-rank descriptor, that provides better precision and success rate in comparison to state-of-the-art trackers. We accomplished this by concatenating the magnitude component of the Overlapped Multi-oriented Tri-scale Local Binary Pattern (OMTLBP), Robustness-Driven Hybrid Descriptor (RDHD), Histogram of Oriented Gradients (HoG), and Color Naming (CN) features. We reduced the rank of our proposed multi-channel feature to diminish the computational complexity. We formulated the Support Vector Machine (SVM) model by utilizing the circulant matrix of our proposed feature vector in the kernel correlation ﬁlter. The use of discrete Fourier transform in the iterative learning of SVM reduced the computational complexity of our proposed visual tracking algorithm. Extensive experimental results on Visual Tracker Benchmark dataset show better accuracy in comparison to other state-of-the-art trackers.


Introduction
Visual tracking is the process of Spatio-temporal localization of a moving object in the camera scene. Object localization has potential applications, including human activity recognition [1], vehicle navigation [2], surveillance and security [3], and human-machine interaction [4]. The researchers are developing robust trackers to reduce the computational cost of the visual object tracking algorithm. Several reviews on robust tracking techniques have been published [5][6][7]. The results discussed in this paper show that the tracking is affected by the geometric and photometric variations in the object appearance. Visual tracking is desired to be robust against intrinsic variations (e.g., pose, shape deformation, and scale) and extrinsic variations (e.g., background clutter, occlusion, and illumination) [8,9]. Significant efforts have been made to extract invariant features through handcrafted and deep learning methods. The deep learning approaches have achieved a higher accuracy; however, a massive training data is required, which is not often available in many surveillance applications. The handcrafted method, on the other hand, only requires a careful selection of discriminative features in the object appearance.
Tracking techniques reported in the literature can generally be classified into three groups: generative, discriminative, and filtered based tracker. The generative models [10] identify the target among many sampled candidate region through similarity function. The test instance is decided as the target location when it has the highest similarity to the appearance model among the sampled candidate regions. The generative tracker, therefore, undergoes through a high computational overhead. The discriminative trackers [11] use the target samples to learn a classifier that can differentiate the target from its background. A discriminative tracker largely depends on the positive and negative samples to update the classifier during tracking. The large sample set can make the classifier more robust; however, it is not available due to time-sensitivity. During the last decade, considerable research on correlation filter-based (CFB) trackers [12] is performed. CFB frameworks brought various improvements in the visual tracking process. The correlation filters include the circulant matrix and fast Fourier transform (FFT), which can employ an extensive sample set to train the classifier. The weighted mask of the kernelized correlation filter (KCF) [13] makes the tracker more robust to the variations in the visual scene. Despite the robustness, KCF continuously requires to update the learned kernel along with the changes in the target appearance. However, such a model update mechanism is sensitive to occlusions. The correlation filter-based tracker performance depends on the quality of the features. Moreover, large dimensionality of the feature vector is also a barrier for the tracker to participate in the real-time scenarios.
In this work, we identified a feature set, that improves the tracking performance even in the presence of intrinsic and extrinsic variations. We have fused multiple handcrafted feature channels to get a response map of the target object's position in each new frame of the video. The dimensionality of the fused feature vector is reduced by selecting ten high entropy variables from the Principal Component Analysis (PCA) output. Instead of the euclidean distance of the feature variables, Pearson's coefficients of skewness is used as PCA input, as shown in Figure 1. Moreover, we have used the circulant matrix along with fast Fourier transform in the kernel correlation filter to reduce the computation complexity of our tracker. Extensive experiments performed over benchmark dataset reveal that the proposed descriptor provides considerable improvement in precision and success rate.  The benchmark dataset [8] is a useful tool to evaluate the performance of visual trackers. The dataset provides the ground truth position of the target in each frame. The dataset set contains 100 sequences labeled with 11 different attributes. Each frame sequence is manually specified with multiple challenges. The overlap ratio of estimated and the ground truth bounding box describes the accuracy of the trackers.
The remaining discussion of the paper is organized as follows. Section 2 discusses the related work from the literature. Section 3 contains the detail about the proposed tracker methodology. Section 4 presents the results in the shape of graphs related to precision and success rate, bar plots, and comparison table. The conclusion of the paper is drawn in Section 5.

Related Work
Visual object tracking has been extensively studied and discussed in [14,15]. The visual trackers can be grouped into single vs. multi-object trackers, context-aware vs. unaware, and generative vs. discriminative trackers. Single object trackers [16] can only track a single target at a time, while multi-object trackers [17] can monitor more than one object at the same instant. The discriminative models in [18,19], employ the handcrafted features to train the ensemble classifier. In [20,21], tracking by detection through deep learning is studied. To deal with appearance variability, the discriminative methods, update their classifier at each location of the candidate, which results in a massive computational cost. In [22,23] sparse representation and metric learning [10] are used to build the generative model-based trackers. The generative model is updated at each location of the candidate to avoid the tracking drift. Similar to the discriminative model, the generative model also suffers from a huge computational load.
Sparsity-based Collaborative Model (SCM) [24] and Adaptive Structural Local-sparse Appearance model (ASLA) [25] are proposed to deal with extreme variations in the target appearance. SCM and ASLA both suffer from a significant scale drift, whenever the target has rotating motions and fast scale changes. The MUlti-Store Tracker (MUSTer) [26], and Multiple Experts Using Entropy Minimization (MEEM) [27] employ ensemble-based methods to solve the drift problem in online tracking. However, MEEM fails whenever identical objects appear in the visual scene.
The correlation filter-based tracker [28] attempts to minimize the sum of squared error between the actual and desired correlation response. In [29,30] the use of Fast Fourier Transform has reduced the correlation cost. The work in [28] has introduced, multi-channel feature map by employing color and texture descriptor combination. Kernelized correlation filters were developed in [31][32][33]. The synthetic training samples generated by the circular shift operation creates a boundary effect. The training of the kernelized correlation filters is severely affected by such synthetic data. In [34,35], spatial regularization is applied over the correlation filter to remove the boundary effect. The algorithm developed in [36] overcomes the boundary effect by simultaneous learning of the correlation filter and its desired response map. Discriminative Correlation Filter (DCF) [37], trackers employ Fourier transforms to efficiently learn a classifier on all patches of the target neighborhood. However, the discriminative correlation filter based tracker also suffers from the spatial boundary effect. The Spatially Regularized Discriminative Correlation Filters (SRDCF) [35] utilize the spatial regularization to vanish the boundary effect of the DCF. The SRDCF fails when the target object is hollow at the center; the filter then considers the background pixels as a target, that leads to the drift problem. The Channel and Spatial Reliability Discriminative Correlation Filters (CSRDCF) [38] includes a color histogram-based segmentation mask, which avoids the background pixels. The Discriminative Scale Space Tracking (DSST) [39] enhances the tracking speed, but the performance is inferior in comparison to CSRDCF. In SRDCFdecon [40] adaptive decontamination is used which adaptively learns the reliability of each training sample and eliminates the influence of contaminated ones. The minimum barrier distance (MBD) [41] is developed to mitigate the impact of background on the tracker accuracy. The MBD consider the dissimilarity value to weight the extracted feature at each target position. The MBD based tracker can precisely locate the target an all attributes of OTB database, but it fails at low resolution, and clutter background.
Correlation filters utilize multiple different combinations of color and texture features, extracted from the patches in the search window. The multi-channel HoG feature [42] integrated with color naming features [43] provide the basis for kernelized correlation filters based tracking methods cite19. The handcrafted features have produced excellent results on the existing benchmark datasets; However, they provide poor performance when there is a rapid variation in the object appearance.
The Color Histograms (CHs) based handcrafted features are robust to the fast motion of the object, but they result in a poor performance in the presence of illumination variations, and background clutters [44]. A robust descriptor can significantly improve the performance of visual tracking. In [45], the concentration of the feature map has shown favorable performance even in the presence of target object state and color variations. Recently, SVM-based support correlation filters (SCFs) [43] have increased efficiency by utilizing the circulant matrix. Moreover, multi-channel SCF (MSCF), kernelized SCF (KSCF), and scale-adaptive KSCF (SKSCF) further improves the accuracy and efficiency of the trackers.

Proposed Method
The proposed tracker incorporates the discriminant color and texture features to achieve a better precision and success rate. A novel fusion and reduction approach is employed to reduce the computational cost. The details of the proposed multi-channel feature and dimensionality reduction are discussed as follows.

Multi-Channel Feature
The image target patch in Figure 1b is described using multi-channel features of the proposed visual tracking approach. A combination of a total of 45 channels consisting of HoG, Color Naming, RDHD [46,47], and the magnitude component of Overlapped Multi-oriented Tri-scale Local Binary Pattern (OMTLBP) [48] is used to describe the object with better accuracy. The Felzenszwalb's HoG (FHoG) feature [42] vector is extracted from the patch of each input frame shown in Figure 1a. Each patch is divided into a grid, with each grid cell of size 4 × 4. The 32-dimensional feature vector represents each cell in the grid. Total nine orientations described through 27 variables are used in combination with four texture and one truncation variable to describe the grid cell. Let f 1 = {w/w ∈ R 32 } is the FHoG feature vector for each grid cell. The color names (CN) [43] feature with vocabulary size 11 is extracted from each grid cell of the input patch. Suppose f 2 = {x/x ∈ R 11 } is the feature vector representing the colour naming shown in the Figure 1c.
The RDHD [47] enhances the discriminative capability of the proposed multi-channel feature. The extrema responses of both first and second-order symmetrical Gaussian derivative filters are quantized to obtain the robust features. The symmetrical Gaussian derivative filters applied on the target patch S, which is presented through the following mathematical equations.
where S x = G x * S, S y = G y * S, S xx = G xx * S, S yy = G yy * S and S xy = G xy * S. while G xx , G yy and G xy represent the second order partial derivatives of the symmetrical Gaussian function with respect to x, y, and (x, y). The symmetrical Gaussian function is shown in Equation (4). The channel D of RDHD obtained through Equation (3) is used as a multi-channel feature in the proposed descriptor.
The rotation invariant magnitude component of Overlapped Multi-oriented Tri-scale Local Binary Pattern (OMTLBP_M riu2 P,R ) [48] is also included in the multi-channel feature to increase the robustness of the visual object descriptor. The total number of sample points P = 8 and radius R = {1, 2, 3} is used to extract the OMTLBP_M riu2 P,R as shown in Equation (5). The χ(u, v) tends to be 1, when u is greater than v, otherwise 0. In Equations (5) and (6) the µ m is the mean value of the k th magnitude component m k r,c extracted from the segment (r, c) of the input image. We assume f 4 = {z/z ∈ R 1 } represent the OMTLBP_M riu2 P,R feature vectors. The Multi-channel feature 'R' described in Equation (7) is obtained by the concatenation of FHoG, CN, RDHD [47], and OMTLBP_M riu2 P,R .
where the symbol || represent the concatenation operation. R denote the final multi-channel feature.
The dimensionality reduction operation is performed over the multi-channel feature to reduce the complexity of the proposed tracker.

Fusion and Reduction
Recently, fusion, and reduction [49][50][51] approach shown in the Figure 1d has been developed, to reduce the dimensionality of the feature vector. The robust variables are selected, among the fused features set to increase the recognition and decrease the computation time. The feature vector R in Equation (7) is the fused vector. After fusion, robust variables are selected based on the entropy value of their coefficient of skewness. Let r i be the variables in the fused feature vector R.
where M represents the total number of grid cells in the frame. The ψ denote the Euclidean distance between the variables of the fused feature. The h is a thresholding function, that is defined as: where h is the threshold function, which is derived for selection of minimum distance features and R m denotes the minimum distance feature vector. R m is used to compute skewness of values. We have selected the threshold parameter as 0.4 in Equation (10). We tested the performance of our tracker with various values of threshold parameter used in Equation (10) of the manuscript. Through experimentation, we found that the threshold parameter 0.4 in Equation (10), provides the highest joint maxima for both AUC and precision when validated on OTB-50 and OTB-2013 dataset. The symbol σ 2 andx denote the variance and median values of R m . Then the technique of principal component analysis is applied on γ to calculate the score of each feature. Finally, the entropy values associated with each variable of the feature are sorted into descending order. Ten highest score variable is selected as a feature vector. The output low-rank representation is denoted by t, as shown in Figure 1e.
The t ∈ R M×N×D denote extracted multi-channel feature map of the input template training patch. The size of each channel is M × N variables, with total number of D = 45 channels in the feature map. The y ∈ R M×N is the symmetrical Gaussian label of dimension equal to the feature map, and σ = 0.02 √ M × N. The proposed method include a symmetrical Gaussian filter label to locate the position of the target in the subsequent frame of the video. The Gaussian filter label used is symmetrical shape similar to maxican hat [52,53] shown in the Figure 1f-g. Lett = F (t) andŷ = F (y), where F (.) represent the Discrete Fourier Transformation of the multi-channel features.
The training process identifies the function f (t) = w T t, that can reduce the lease square error of the Ridge Regression, shown in Equation (8). The aim is to learn the support correlation filter w and the regularization parameter λ that reduces the least square error in the Ridge Regression. Equation (13) gives the close from of the minimization.
where T is the circulant matrix of the multi-channel feature vector t. The T H represent the hermitian transpose, i.e., T H = (T * ) , and T * is the complex-conjugate of T. For any ith input frame of the video, the position of the target object is the location of the maximum response value in R i , described in Equation (15).

Dataset and Evaluated Trackers
The performance of the proposed multi-channel feature is evaluated on Tracker Benchmark (TB) dataset [8]. The TB contains a total of 100 images of annotated sequences with eleven different attributes. The dataset includes variations in illumination, scale, and resolution. The objects in the dataset suffer from deformation, occlusion, and background clutter. The video sequence is captured in the in-plane and out-of-plane rotation. The blurriness due to fast motion and out of view object is captured in the sequence. The TB is classified into three different test suits, namely OTB2013, OTB50, and OTB2015. The information about the target object true bounding box is included with the first frame of each sequence. The ground truth positions of the subject in each frame are provided with the dataset to calculate the accuracy.

Evaluation Procedure
The one-pass evaluation (OPE) plots [8] are used to evaluate and compare the performance of the proposed descriptor with other methods. The OPE consist of precision and success plots used for the evaluation of our proposed tracker.
The precision vs. location error threshold plot evaluates the tracking precision. The precision illustrates the number of estimated target positions which lie within the defined threshold distance from the ground truth. The location error is the average Euclidean distance between the object's estimated position and the annotated ground truth position. The location error describes the gap in term of pixels only and does not care for the size and scale of the target object. The threshold error ranges between 0 and 50 pixels is used to calculate the average precision of each sequence. The distance precision (DP) is at location error of 20 pixels.
The success rate vs. overlap threshold is also used in parallel to evaluate our tracker. The success rate is the percentage of frames with an overlap ratio of γ greater than a given threshold of τ. Where γ is the overlap ratio between estimated B est and ground truth B gth bounding box, i.e., The range of threshold τ is between 0 and 1. The trapezoidal integration is used to calculate the Area Under the Curve (AUC) of the success plot. The precision and success plot shown in Figure 2 show a comparison of the proposed tracker with other state-of-the-art trackers. The DP and AUC in Figure 3 describe the results more conveniently.

Parameter Setting
The experimental results are obtained on the desktop computer with an Intel 2 core 2.2 GHz CPU and 8 GB RAM and Nvidia GTX 750 Ti GPU. The parameter scales ψ and scales factor γ of SKSCF are 21 and 1.04, respectively. The standard deviation σ of the kernel is 0.5 for SKSCF. The optimal setting for the upper and lower threshold (θ l θ u ) are (0.3, 0.6). The cell size and orientation of the HoG feature are set to 4 and 9, respectively. The OMTLBP is defined for 8 sample point with scale values ranges from 1 to 3. Table 1, summarizes the results of SKSCF method for the different type of kernels on 50 challenging image sequences of TB-100 dataset [8]. Table 1 shows that SKSCF with Gaussian kernel outperforms the polynomial and linear in mean DP and AUC. However, the linear kernel can process more frames per second (FPS) in comparison to Gaussian and polynomial. In this work, we have selected the Gaussian kernel with different variants of correlation filters. Eleven visual attributes classify the annotated sequences of the benchmark dataset [8]. The 11 characteristics of the sequences includes background clutter (BC), illumination variation (IV), scale variation (SV), deformation (DEF), fast motion (FM), in-plane rotation (IPR), low resolution (LR), motion blur (MB), occlusion (OCC), out-of-view (OV), and out-of-plane rotation (OPR).

Overall Performance
The proposed tracker evaluated on three benchmark test suits of OTB dataset. The OTB-2015 test suit is used to evaluate the proposed tracker for dimensionality reduction using the Gaussian kernel. In the absence of dimensionality reduction, the SKSCF-proposed gives 89.1%, and 66.1% mean DP and AUC respectively at speed seven frames per second speed. The same SKSCF-proposed when tested in the presence of the dimensionality reduction, it provided 88.87% average DP and 65.6% average AUC at rate eight frames per second speed. By employing the dimensionality reduction, the mean DP by 0.04% while the AUC reduced by 0.05% at the cost of one frame per second increase in the speed.
The precision and success rate mentioned in Tables 2 and 3 demonstrate that the proposed tracker leads other methods both in precision and success rate. Table 2 compare the performance of our proposed tracker with other recently developed methods on OTB-2013 test suits. Table 2 shows that our proposed tracker has a better precision on all attributes except in the case of BC, SV, and IPR. The precision of the proposed tracker is 1.9%, 3.3%, 0.6%, 1.2%, 0.2%, 1.8%, 0.8%, and 1.5% higher than the other trackers for FM, MB, DEF, IV, LR, OCC, OPR, and OV, respectively. The success rate is lower only for the case of MB, IPR, and SV. The success rate is 0.5%, 0.1%, 0.7%, 1.6%, 3.9%, 0.6%, 0.5%, and 0.5% higher that of other trackers for FM, BC, DEF, IV, LR, OCC, OPR, and OV, respectively. Table 3 demonstrates a performance comparison on OTB-2015 test suit. Table 3 shows that our proposed tracker have 4.5%, 2%, 1.1%, 2.8%, 2.6%, 1.5%, and 1.2% higher precision that other tracker on FM, MB, DEF, IV, LR, OCC, and OPR attributes. The precision of the proposed tracker is 0.6%, 3.3%, 1.3%, and 1.5% lower than the best trackers for BC, IPR, SV, and OV respectively. The success rate of the proposed tracker is 0.6%, 0.4%, 0.1%, 0.3%, 0.1, 2.2%, and 5.3% higher for BC, MB, DEF, IV, LR, OCC, and OV respectively. The success rate is lower for the case of FM, IPR, OPR, and SV.   The bar plots in Figure 3 show a comparison of DP and AUC on all three test suits of TB-100 dataset. The Figure 3a-c show that the DP and AUC of our proposed tracker are higher than the other recently developed trackers when tested on OTB-2013, OTB-2015, and OTB-50, respectively.
The OPE plots in Figure 2 show a comparison of precision and success rate with the recently reported trackers. Figure 2a [54] 0

Conclusions
We propose robust low-rank descriptor for kernel support correlation filter. The proposed multi-channel feature-based tracker provides favorable results on several attributes of OTB-2013, OTB-50, and OTB-2015 test suit of the database. The low rank of the feature obtained by employing the novel fusion and reduction approach. The accuracy and speed increased by employing the circulant data matrix in the training procedure to the support vector machine. The results show the proposed tracker outperforms the recently developed trackers in the case of deformation, illumination variation, low resolution, occlusion, and out-of-view. The distance precision (DP) of the proposed tracker at location error of 20 pixels is 2.1%, 2.9%, and 6.6% higher than the best performing trackers on OTB-2013, OTB-50, and OTB-2015 respectively. The AUC is 4.2%, 3%, and 3.4% on OTB-2013, OTB-50, and OTB-2015 respectively.
Author Contributions: F. has developed the algorithm of the project, evaluated the framework, and wrote the paper. M.R. helped in formating and the correction of the grammer mistakes. M.J.K., Y.A., and H.T. supervised the research and improved the paper.