Real-Time Visual Tracking through Fusion Features

Due to their high-speed, correlation filters for object tracking have begun to receive increasing attention. Traditional object trackers based on correlation filters typically use a single type of feature. In this paper, we attempt to integrate multiple feature types to improve the performance, and we propose a new DD-HOG fusion feature that consists of discriminative descriptors (DDs) and histograms of oriented gradients (HOG). However, fusion features as multi-vector descriptors cannot be directly used in prior correlation filters. To overcome this difficulty, we propose a multi-vector correlation filter (MVCF) that can directly convolve with a multi-vector descriptor to obtain a single-channel response that indicates the location of an object. Experiments on the CVPR2013 tracking benchmark with the evaluation of state-of-the-art trackers show the effectiveness and speed of the proposed method. Moreover, we show that our MVCF tracker, which uses the DD-HOG descriptor, outperforms the structure-preserving object tracker (SPOT) in multi-object tracking because of its high-speed and ability to address heavy occlusion.


Introduction
Object tracking is an important computer vision task that has many practical applications, such as security and surveillance, motion analysis, augmented reality, traffic control and human-computer interaction. A real-time visual tracking system combines software and hardware design. A digital camera captures video. To achieve a smooth output video impression for human eyes, a frame-rate of at least 15 frames per second (FPS) is required. A two-axis turntable will be used to pivot the camera horizontally (yaw) and vertically (pitch). Object tracking will be accomplished through software on the main control computer. An interface allows the user to select a target and to see what the camera is tracking. The system attempts to always keep the object in the center of its field of view. It is noteworthy that target tracking algorithms play a decisive role in this system.
Single object tracking is the most common task within the field of computer vision. Many methods for object tracking have been proposed. Adam et al. [1] presented a part-based algorithm called FragTrack which models the object appearance based on multiple parts of the target. Grabner et al. [2] proposed an on-line boosting algorithm (OAB) to select features for tracking. In [3], Babenko et al. adopted Multiple Instance Learning (MIL), which puts all ambiguous positive and negative samples into bags to learn a discriminative model. Kalal et al. [4] proposed a novel tracking framework (TLD) that decomposes the tasks into three components: tracking, learning and detection. Struck [5] presents a framework for adaptive visual object tracking based on structured output prediction. Xu et al. [6] proposed the structural local sparse appearance (ASLA) model which exploits both partial information and spatial information. In [7], a robust tracking framework based on the locality sensitive histograms is proposed. Wang et al. [8] present a novel probability continuous outlier model (PCOM) to depict the continuous outliers that occur in the linear representation model. The approach [9] formulates the spatio-temporal relationships between the object of interest and its local context based on a Bayesian framework. In [10], Oron presented Extended Lucas Kanade or ELK that it casts the original LK algorithm as a maximum likelihood optimization. These methods rely on intensity or texture information for the image description and include complex appearance models and optimization methods. It is difficult for most of them, when executed on a standard PC, to keep up with the 25 frame-per-second demand without parallel computing when real-time processing is required [11].
Recently, correlation filters for object tracking began to receive more attention because they have an impressively high-speed. Several state-of-the-art methods using correlation filters have been proposed for a variety of applications, such as object detection and recognition and object tracking. Bolme et al. [12] propose a tracker that is based on the Minimum Output Sum of Squared Error (MOSSE) filter, which is robust to variations in lighting, scale, pose, and non-rigid deformations while operating at 669 frames per second. Henriques et al. [13] provide a link to Fourier analysis using the well-established theory of circulant matrices and devise Kernel classifiers with the same characteristics as the correlation. Their tracker is called a CSK tracker. Boddeti et al. [14] propose a vector correlation filter (VCF) with HOG features and demonstrate the efficacy and speed of the proposed approach on the challenging task of multi-view car alignment. Galoogahi et al. [15] propose an extension to canonical correlation filter theory that can efficiently handle multi-channel signals. In contrast to object tracking, color descriptors have been shown to obtain excellent results for object recognition and detection [16][17][18][19][20]. Most early color detectors use simple color representations for image description. The linguistic study of Berlin and Kay [21] on basic color terms is one of the most influential works in color naming. In [17], the authors show that the color names (CN) learned from real-world images outperform chip-based color names on real-world applications. Danelljan et al. [22] extend the CSK tracker [13] with color names (CN), which provides superior performance for visual tracking. Recently, Henriques et al. [23] derived a new Kernelized Correlation Filter (KCF), which is the journal version of CSK and can use HOG features very well.
Traditional object trackers based on correlation filters typically use a single type of feature. In this paper, we attempt to integrate multiple feature types to improve the performance. The fusion of multiple features leads to a significant increase in the performance for object detection [16,18]. In reference [16], the authors extend the description of the local features with color information. A boosted CN-HOG detector is proposed by [18], where CN descriptors are combined with HOGs to incorporate texture information. These investigators show that their approach can significantly improve the detection performance on the challenging PASCAL VOC datasets. Shi et al. [24] specifically show that the correct features to use are exactly those that make the tracker work best. To obtain an effective and efficient tracking algorithm, we propose a new DD-HOG fusion feature that consists of a discriminative descriptor (DD) and histograms of oriented gradients (HOG). Khan et al. [20] show that their discriminative descriptor (DD) outperforms other pure color descriptors and the color name (CN) descriptor. The DD feature has been used for object tracking [25]. In addition, Dalal and Triggs [26] proposed histograms of oriented gradients (HOG), which are widely used for object detection. However, the DD-HOG is a multi-vector descriptor and therefore cannot be directly used in prior correlation filters [12][13][14][15][16]. Those correlation filters have been traditionally designed to be used with scalar or single vector feature representations only. In our paper, we propose a multi-vector correlation filter (MVCF) to resolve this problem. A multi-vector correlation filter interpreted literally is made up of multiple vector correlation filters. The vector correlation filter is composed of one correlation filter. The DD-HOG feature is correlated with our multi-vector correlation filter for obtaining a single-channel response. The peak of the responses indicates the target center. A similar process can be done for other multi-vector features. The tracker that is based on a multi-vector correlation filter (MVCF) that comprises four main components: (1) a scale adjustment that makes all of the elements of a multi-vector feature have the same size; (2) a multi-vector structure that a multi-vector descriptor can be convolved with directly; (3) an update scheme that multiple object appearance models must update separately (and all of the previous frames are considered); and (4) a dimensionality reduction technique that reduces the dimension for each element of the multi-vector feature independently. A quantitative evaluation is conducted on the CVPR2013 tracking benchmark [11]. It is a comprehensive dataset that is specially designed to facilitate the evaluation of performance. Extensive experiments demonstrate that the proposed tracker based on the multi-vector correlation filter (MVCF) can outperform state-of-the-art trackers. Tracking results in the CVPR2013 benchmark are shown in Figure 1. each element of the multi-vector feature independently. A quantitative evaluation is conducted on the CVPR2013 tracking benchmark [11]. It is a comprehensive dataset that is specially designed to facilitate the evaluation of performance. Extensive experiments demonstrate that the proposed tracker based on the multi-vector correlation filter (MVCF) can outperform state-of-the-art trackers.
Tracking results in the CVPR2013 benchmark are shown in Figure 1.
Tracking results in the CVPR2013 benchmark. We employ all 36 color sequences for evaluation. Note that there are two targets for the jogging sequence. Only the top tracker is presented in the corresponding color rectangle. The proposed tracker achieves the best performance in 26 of the 36 sequences. MVCF with the DD-HOG feature shows a significant improvement over state-of-the-art approaches using correlation filters, such as CSK with raw pixels, CN with color names and KCF with HOG. The quantitative comparison of our tracker with 10 state-of-the-art methods is reported in terms of its precision at a threshold of 20 pixels. The experimental results show that our approach outperforms state-of-the-art tracking methods.
We also show that our MVCF tracker can obtain substantial performance in multi-object tracking. The goal of multi-object tracking is to estimate the states of multiple objects. In complex scenes, multi-object tracking remains a challenging problem for many reasons, including frequent occlusion by other objects, similar appearances of different objects, and real-time processing. In this paper, we argue that our MVCF tracker has an extraordinary ability to address partial occlusion and can run at an impressively high-speed. Therefore, the MVCF tracker appears to be a good choice for multi-object tracking. Our experimental evaluations show that the MVCF tracker performs very well on videos that are used [27] for multiple-object tracking. We use a simple approach to tracking multiple objects in which we only run multiple instances of our MVCF tracker without spatial constraints between the objects. The speed of our algorithm to track approximately four objects simultaneously is more than 25 fps. Therefore, the multi-vector correlation filter (MVCF) can be used as a basic framework in multi-object tracking such as Random Forests (RFs) and Support Vector Machines (SVMs).
The contributions of this paper are as follows.

MVCF:
We propose a new type of correlation filter, a multi-vector correlation filter (MVCF), which can directly convolve with a multi-vector descriptor. Extensive experiments demonstrate that the proposed tracker, which is based upon MVCF, can outperform state-of-the-art trackers.
Feature Selection: We select optimal features to a multi-vector correlation filter based on how the tracker that uses MVCF works. The new proposed DD(11)-HOG fusion feature is the optimal Figure 1. Tracking results in the CVPR2013 benchmark. We employ all 36 color sequences for evaluation. Note that there are two targets for the jogging sequence. Only the top tracker is presented in the corresponding color rectangle. The proposed tracker achieves the best performance in 26 of the 36 sequences. MVCF with the DD-HOG feature shows a significant improvement over state-of-the-art approaches using correlation filters, such as CSK with raw pixels, CN with color names and KCF with HOG. The quantitative comparison of our tracker with 10 state-of-the-art methods is reported in terms of its precision at a threshold of 20 pixels. The experimental results show that our approach outperforms state-of-the-art tracking methods.
We also show that our MVCF tracker can obtain substantial performance in multi-object tracking. The goal of multi-object tracking is to estimate the states of multiple objects. In complex scenes, multi-object tracking remains a challenging problem for many reasons, including frequent occlusion by other objects, similar appearances of different objects, and real-time processing. In this paper, we argue that our MVCF tracker has an extraordinary ability to address partial occlusion and can run at an impressively high-speed. Therefore, the MVCF tracker appears to be a good choice for multi-object tracking. Our experimental evaluations show that the MVCF tracker performs very well on videos that are used [27] for multiple-object tracking. We use a simple approach to tracking multiple objects in which we only run multiple instances of our MVCF tracker without spatial constraints between the objects. The speed of our algorithm to track approximately four objects simultaneously is more than 25 fps. Therefore, the multi-vector correlation filter (MVCF) can be used as a basic framework in multi-object tracking such as Random Forests (RFs) and Support Vector Machines (SVMs).
The contributions of this paper are as follows.

MVCF:
We propose a new type of correlation filter, a multi-vector correlation filter (MVCF), which can directly convolve with a multi-vector descriptor. Extensive experiments demonstrate that the proposed tracker, which is based upon MVCF, can outperform state-of-the-art trackers.
Feature Selection: We select optimal features to a multi-vector correlation filter based on how the tracker that uses MVCF works. The new proposed DD(11)-HOG fusion feature is the optimal feature for MVCF tracking. We also show that MVCF with DD(11)-HOG obtains superior performance compared with current state-of-the-art correlation filters with other features.
Multi-object Tracking: We apply our approach across multi-object tracking tasks. We demonstrate that MVCF is well suited to use as a basic framework in multi-object tracking, such as with RFs and SVMs. The speed of our algorithm for tracking approximately four objects simultaneously is more than 25 fps.
The remainder of this paper is organized as follows. In Section 2, we review the CSK tracker. In Section 3, we introduce a new robust tracker that is based on a multi-vector correlation filter. In Section 4, we show our experimental results. Finally, we present our conclusions in Section 5.

The CSK Tracker
Correlation filters have shown superior performance on a number of computer vision problems. The CSK tracker [13] is based on a kernelized single-channel correlation filter and runs at hundreds of frames per second. The key for its outstanding speed is that the CSK tracker exploits the circulant structure. The CSK tracker use scalar features, such as raw pixel values, and a grayscale image patch is preprocessed. The intensity channel is computed using Matlab's "rgb2gray" function when the input is a 3-channel RGB color image. In this section, we describe briefly the CSK tracker.

Training Samples and Labels
Training Samples and Labels are used as inputs of the classifier. A classifier is trained using a single grayscale image patch x of size MˆN that is centered around the target. The x is expressed as a MNˆ1 vector. The CSK tracker considers all of the cyclic shifts x i " P i x, i P tp0, 0q¨¨¨, pM´1, N´1qu, which are referred to as training samples. The P is the permutation matrix that cyclically shifts vectors by one element to the right (the last element wraps around). The single grayscale image patch x i of size MˆN is a training sample, and its corresponding confidence score is y i . The label in a large majority of trackers is a binary value in general. In the CSK tracker, the labels that are computed by a Gaussian function are continuous values. The confidence score will be 1 nearby the target location i 1 " pm 1 , n 1 q and will decay to 0 as the distance increases. The total of all of the locations is MˆN, which corresponds to the training sample. The label of the i " pm, nqth location is y i " expp´0 .5 s 2ˆp pm´m 1 q 2`p n´n 1 q 2 qq, where pm, nq P t0,¨¨¨, M´1uˆt0,¨¨¨, N´1u and the spatial bandwidth s " ? MN{16.

Training
A classifier is trained by finding the parameter w that minimizes the cost function. The cost function minimization problem is written as where λ is a parameter for regularization, and the classifier has MˆN pairs of training samples and labels. The kernel is defined as κpx, x 1 q " ă φpxq, φpx 1 q ą, where φ is the mapping to the Hilbert space. The inputs x i are mapped to a rich high-dimensional feature space using φpx i q. The Representer Theorem [28] then states that the cost function in Equation (1) is minimized by the solution w " ř i α i φpx i q, which can be expanded as a linear combination of the training samples. The parameter w and φpx i q have the same dimensionality. The online classifier coefficients α are updated by updating the solution w over time, where the coefficients α are where F denote the Fourier transform, and the vector k x has elements k i " κpx, P i xq, i " 0,¨¨¨, n´1. P is the permutation matrix. In summary, the solution w is implicitly represented by the vector α, which is solved by Equation (2).

Fast Detection
In the new frame, a set of grayscale patches z of size MˆN are obtained in a search region around an object location. The Kernel classifier can perform detection quickly with the Fast Fourier Transform (FFT). A classifier response is computed for each single input. All of the responses are evaluated simultaneously. The confidence map of a target center is obtained by where k z is a vector that has the elements p k i " κpz i , p xq, and F and F´1 denote the Fourier transform and Fourier inverse transform, respectively, and d is the element-wise product. The learned object appearance p x is updated overtime. The current model is computed by considering all of the previous frames. The best object location can be estimated by maximizing the confidence map.

Proposed Algorithm
The novelty of this paper is to present a real-time tracker that is based on the CSK algorithm. Our multi-vector correlation filter (MVCF) can directly use multi-vector descriptors (i.e., DD-HOG, CN-HOG). We present details of the proposed tracking algorithm in this section.

Input of Multi-Vector Correlation Filter
Generally, a unique vector is computed to represent an image patch when only one feature is used in a correlation filter framework. For a multi-vector descriptor, an image patch is mapped to multiple image representations. All of the descriptors carry with them corresponding mappings.
Given an image x, each vector feature is defined as X i " ϕ i pxq, i " 1, 2,¨¨¨, ν, where ν is the size of the set of multi-vector features. Then, X i is the i-th vector feature, and its corresponding mapping is ϕ i .
The multi-channel X " rX 1 , X 2 ,¨¨¨, X ν s is the input of the multi-vector correlation filter. The element X i " rϕ i,1 pxq, ϕ i,2 pxq,¨¨¨, ϕ i,D i pxqs is a MˆNˆD i tensor, where the ϕ i,j pxq is an MˆN matrix in the 1 ď j ď D i th channel of the ith element X i . A fixed number for the channels is allotted for each element. The dimension D of the input X is ř ν i"1 D i , where the number of elements is ν. The multi-vector feature X as a whole is correlated with our multi-vector correlation filter.
It is necessary to consider the size of the different elements X i . We try to use a pre-processing to ensure that all of the elements X i could have the same size MˆN. In general, different features correspond with different elements X i by ϕ i " rϕ i,1 , ϕ i,2 ,¨¨¨, ϕ i,D i s. So, the dimensions of these elements are not the same. But this will not influence the pre-processing. Here, we give a briefing for the CN-HOG feature. Generally, an image is represented by HOG features which are computed densely. The cell is a 8ˆ8 non-intersecting pixel region to represent an image. Of course, there are other nˆn pixel cells used in practice. For the cell c i , the representation is obtainedby concatenation, specifically, c i " rCN i , HOG i s. For the HOG descriptor, we first compute the intensity channel by the "rgb2gray" function, and then, the representation is computed in each cell (nˆn pixel). The element X HOG has a decreasing size, pM{nqˆpN{nq, and the dimension is 31. A similar procedure is built to compute the element X CN for each cell, resulting in the same size as the elements X HOG . The RGB values are mapped to an 11-dimensional color representation. The bi-vector feature X CN´HOG as a whole has the size pM{nqˆpN{nq, and the dimension of the fusion vector is 42.

Multi-Vector Structure
To directly use multi-vector features, such as DD-HOG, we design a novel multi-vector structure in Figure 2. A multi-vector correlation filter, when interpreted literally, is composed of multiple vector correlation filters. The vector correlation filter is composed of a single correlation filter. For each feature channel, its corresponding confidence score is computed, and, an input patch x j are mapped using ϕ i,d px j q where i " t1, 2,¨¨¨, νu and 1 ď d ď D i . Each input patch is collected around the target. The result ϕ i,d px j q is an MˆN matrix in the d`ř i´1 n"1 D n , -th channel of the whole filter. Equation (1) can then be expressed as where the number of vectors is ν, and the number of channels of the ith vector is D i . For each feature channel, there is a corresponding classifier. The confidence score indicates a similarity with the target. A sharp peak can be obtained near the target location. The final confidence score is attained by the aggregate of the outputs of each feature channel.
where the number of vectors is ν, and the number of channels of the ith vector is D . For each feature channel, there is a corresponding classifier. The confidence score indicates a similarity with the target. A sharp peak can be obtained near the target location. The final confidence score is attained by the aggregate of the outputs of each feature channel. The aggregate is embodied in the kernel computation κ(x, x′). The manipulation of the linear kernel and Gaussian kernel are identical. The inputs x and x′ as a whole vector feature are a M × N × D tensor, where M × N is the total number of locations, and D is the number of channels. The single channel k is obtained by the aggregate of the outputs of each feature channel. The notation k is a matrix of size M × N, and its dimension is one. Therefore, both the solution α in training and the confidence score vector y denote a M × Nmatrix.

Updating Scheme
The scheme needs to update both the learned object appearance x and the filter coefficients α overtime.
To update the model, all of the previous frames should be considered from the first frame until the current framep. All of the previous appearances of the target are {x |m = 1, ⋯ , p}. A positive weight constant β is allocated for each frame. Here η is a learning rate parameter that can set the weight β . Equation (4) can then be expressed anew as The aggregate is embodied in the kernel computation κpx, x 1 q. The manipulation of the linear kernel and Gaussian kernel are identical. The inputs x and x 1 as a whole vector feature are a MˆNˆD tensor, where MˆN is the total number of locations, and D is the number of channels. The single channel k is obtained by the aggregate of the outputs of each feature channel. The notation k is a matrix of size MˆN, and its dimension is one. Therefore, both the solution α in training and the confidence score vector p y denote a MˆN matrix.

Updating Scheme
The scheme needs to update both the learned object appearance p x and the filter coefficients α overtime.
To update the model, all of the previous frames should be considered from the first frame until the current frame p. All of the previous appearances of the target are tx m |m " 1,¨¨¨, pu. A positive weight constant β m is allocated for each frame. Here η is a learning rate parameter that can set the weight β m . Equation (4) can then be expressed anew as min w 1,1 ,w 1,2 ,¨¨¨,w ν,Dν where the number of vectors is ν, and the number of channels of the ith vector is D i . λ is a parameter for regularization. In frame m, the corresponding confidence score of input image patch x j,m is y j,m . Then, the solution α in Equation (2) can then be expressed as F pαq " ř p m"1 β m F py m qF pk m q ř p m"1 β m F pk m qpF pk m q`λq This cost function is minimized by F pαq. The derivation of Equation (6) are given in Appendix.
The object appearance p x P " rp x p 1 , p x p 2 ,¨¨¨, p x p ν s is an MˆNˆD tensor, where the number of vectors is ν, and p x p i is an MˆNˆD i tensor. In each new frame, the filter is updated by The object appearance is updated by

Dimension Reduction
For the current frame p, the object appearance p x P " rp x p 1 , p x p 2 ,¨¨¨, p x p ν s consists of ν object sub-appearances. For each sub-appearance p x p i , i " 1, 2,¨¨¨, ν, we use an eigenvalue decomposition technique (EVD) independently, which reduces the dimension to obtain a boosted speed.
For the first frame, we extract the image patch using the initial ground truth. The multi-vector X p " rX 1 p , X 2 p ,¨¨¨, X ν p s is the input of the multi-vector correlation filter mentioned in Section 3.1, where p is the current frame. For each multi-channel vector feature X i 1 , its corresponding covariance matrix A 1 i is computed, where i " 1, 2,¨¨¨, ν. Covariance matrix A 1 i is a square matrix of size D iˆDi . Then, we perform an eigenvalue decomposition of the matrix A 1 i . The covariance matrix is decomposed to the following form: i diagonal matrix of the eigenvalues from Σ 1 i . The projection matrix B 1 i is selected as the first D 1 i in Q 1 i . The low-dimensional sub-appearance p x 1 i is obtained by p x 1 i " X i 1 B 1 i , where i " 1, 2,¨¨¨, ν, and the dimension of p x 1 i is D 1 i . The learned appearance p x 1 " rp x 1 1 , p x 1 2 ,¨¨¨, p x 1 ν s is used to compute the detectionscores p y for the next frame. In each new frame, the covariance matrix that is ready for EVD The procedure is similar for the subsequent frames.

Main Differences from CSK, CN and KCF
All types of correlation filters are designed depending on the usage of a feature. The searching for and usage of good features are a significant part of the methodology.
Traditionally, many different correlation filters were designed to be used with scalar feature (most commonly pixel value) representations only. The CSK tracker uses this traditional type of correlation filter, which is a single channel correlation filter. The CN tracker proposes a tracking method that can handle multi-channel color feature vectors (color name descriptors). The vector correlation filter (VCF) used in the CN tracker was designed by Boddeti [14] and it is a multi-channel correlation filter. The journal version of CSK (called KCF) also uses the VCF and has already been able to address the HOG features very well. The KCF selects the HOG features to obtain better performance.
Reference [18] shows that the simple fusion of CN and HOG obtains an outstanding performance increase for object detection. We propose a new type of fusion feature (DD-HOG) that can gain a significant improvement in performance for tracking. The multi-vector descriptor cannot be directly used in prior correlation filters. We design a multi-vector correlation filter (MVCF) to solve this difficulty. The MVCF can also use the other multi-vector descriptors readily. Table 1 shows detailed information about the differences between our method and the above three methods. Here, we briefly analyze the main factor for why the proposed approach is better than previous ones. In [29], they find that the feature extractor plays the most important role in a tracker. On the other hand, the observation model often brings no significant improvement. Thus, selecting the right features provides the potential for improving performance. Using the proposed sophisticated fusion features can dramatically improve the tracking performance. This feature could make the correlation filters tracking system work better.

Experiments
In this section, we present qualitative and quantitative tracking results. We performed two sets of experiments to evaluate the performance of our tracker. In the first set of experiments, we selected an optimal fusion feature for our multi-vector correlation filter, used a dimensionality reduction technique for the optimal feature and compared our tracker with other existing state-of-the-art trackers. Moreover, our tracker is compared to other correlation-based methods, such as CSK [13], CN [22] and KCF [23]. In the second set of experiments, we evaluated the performance of the MVCF tracker for multi-object tracking.

Evaluation Setup
The proposed tracker is implemented in Matlab on a workstation with a 3.7 GHz processor and 8 GB RAM without sophisticated program optimization. In our approach, the parameters are fixed for all of the sequences. A Gaussian kernel is used in our tracker. We refer to the parameters of the proposed model in [13,22,23]. We set the bandwidth of the Gaussian kernel to σ " 0.5, the spatial bandwidth to s " ? MN{16 for a target of size MˆN, regularization to λ " 10´4, adaptation parameter to µ " 0.15, and learning rate to η " 0.2. For all of the fusion features that use HOG, we use 4ˆ4 pixel cells.
We use two criteria, the tracking precision and success rate, as quantitative evaluations [11]. Precision: The center location error (CLE) is a tracking evaluation method that is widely used; it is defined as the distance between the central locations of the tracked target and the manually labeled ground truths. The precision score shows the percentage of frames whose estimated location is within the given threshold distance of the ground truth. The default threshold is equal to 20 pixels.

Success Rate:
Another evaluation method is the Pascal VOC overlap ratio (VOR). The overlap score is defined as S " |r t Xr a | |r t Yr a | , where r t represents the tracked bounding box, and r a represents the ground truth bounding box. The X and Y represent the intersection and union of two regions, respectively, and |¨| denotes the number of pixels in the region. If this overlap rate is above a given threshold, then the tracking result in one frame is considered to be a success. The default threshold is equal to 50%. The success rate is computed with all of the frames.

Color Descriptors
We describe the color descriptors that we will use to augment the HOG feature descriptors for object tracking. In the following description, we refer to some of the published articles [16,[18][19][20]22,26,30]. Table 2 shows a comparison of the feature dimensionality of the different color descriptors.

RGB:
The standard 3-channel RGB color space, which is by far the most commonly used color space.
HSV: H is for the hue, S is for the saturation, and V is for the value. It is often more natural to think about a color in terms of its hue and saturation than in terms of the additive or subtractive color components.
YCbCr: YCbCr is approximately perceptually uniform and is used as a part of the color image pipeline in video and digital photography systems.
LAB: In the Lab color space, the dimension L is for the lightness, and A and B are for the color-opponent dimensions.
Opponent: This representation is invariant with respect to specularities, and the image is   C: The C color representation adds photometric invariants with respect to shadow-shading to the opponent descriptor by normalizing by the intensity. This step is performed according to CN: Color names, or color attributes, are linguistic color labels that humans assign to colors in the world. Berlin and Kay [21], in a linguistic study, concluded that the English language contains eleven basic color terms: black, blue, brown, gray, green, orange, pink, purple, red, white and yellow. In this paper, we use the mapping in [17], which is automatically learned from Google images.

CN-HOG:
Among various features, the histograms of oriented gradients (HOG) proposed by Dalal and Triggs [26] are the most commonly used features for object detection. A single-channel grayscale image/signal is mapped to a 31 dimensional image/signal representation. Generally, an image is represented by HOG features that are computed densely. The cell is an 8ˆ8 non-intersecting pixel region to represent an image. Of course, there are other nˆn pixel cells used in practice.
A 31-dimensionalvector is computed to represent each cell. Although both memory usage and time complexity rise, the discriminative ability of the HOG-based classifier increases compared with the classifier using raw pixel values. The authors of [18] incorporate 11-dimensional color names into a 31-dimensional HOG feature, which results in increased performance. Due to the cell computing of HOG, there is a similar procedure to compute the color attributes for each cell.

DD(11)-HOG, DD(25)-HOG:
We extend the 31-dimensional HOG vector with the discriminative descriptor (DD) vector that is proposed by Khan et al. [20] to obtain an outstanding discriminative power in a classification problem. The discriminative descriptor (DD) is not limited to eleven color names and can freely choose the desired dimensionality. The authors of [20] make the universal color descriptors available for settings with 11, 25, and 50 clusters. Our goal is to obtain a compact and discriminative descriptor, and thus, we consider only DD (11) and DD (25). DD (25) outperforms all of the other descriptors (CN, DD (11) and DD(50)) that were used in their experiment. We hope to know whether similar results can be obtained for object tracking.

Evaluation on Comprehensive Benchmark
The first set of experiments is conducted on the CVPR2013 tracking benchmark [11], which is specially designed for the evaluation of the tracking performance. The tracking dataset consists of 50 fully annotated sequences, which provide a large number of scene changes and target motions. There are many challenging factors in these sequences, including illumination change, scale change, occlusion, and fast motion.

Experiment 1: Feature Selection
We first perform an experiment to select the optimal features for the multi-vector correlation filter. The performance of the color descriptors is evaluated on the task of object tracking. Table 3 shows the results on all 36color videos of the CVPR2013 tracking benchmark. The results are reported in terms of both the precision, with a CLE of 20 pixels, and success rate, with a VOR of 50%. We also provide the median frames per second (FPS). For HOG features with 4ˆ4 pixel cells, the intensity channel is computed using Matlab's "rgb2gray" function.  Table 3 also shows a comparison of the feature dimensions of different color descriptors. The features on the top are more compact than the features on the bottom. The fusion features CN-HOG and DD(11)-HOG have a dimensionality 42. This dimensionality is significantly more compact than a fusion approach in which the HOG would be computed on multiple color channels. The results clearly show that the DD(11)-HOG descriptor performs best with a 0.68 success rate and 0.75 precision. In [20], the DD descriptor with 25 dimensions outperforms all of the other descriptors, including the 11-dimensional descriptor and CN descriptor for object detection. However, the DD(11)-HOG obtains the best results in our experiment. Moreover, the fusion approach in which the HOG would be computed on multiple color channels cannot obtain good results. CN-HOG provides the highest speed among the ten color descriptors. The median speed of DD(11)-HOG over all 36 sequences is 72 fps. In general, MVCF provides a high speed regardless of what features are used. In summary, DD(11)-HOG is the optimal feature for the multi-vector correlation filter for tracking.

Experiment 2: Low-Dimensional DD(11)-HOG Feature
In simple fusion, we incorporate 11-dimensional color names (DD) into a 31-dimensional HOG feature, such as CN-HOG in [18]. The dimension of the fusion feature DD(11)-HOG is 42. The computational time scales linearly with the dimension of the fusion feature. In this paper, we use an adaptive dimensionality reduction technique that reduces the dimension of the multi-vector separately. The technique is applied to compress the 42-dimensional DD(11)-HOG to only nine dimensions of DD(11) 5 -HOG 4 , including five-dimensional DD(11) and 4-dimensional HOG. Table 4 shows the results that were obtained using the DD(11)-HOG and its comparison with other features. The CSK tracker uses a single-channel correlation filter with raw pixels, the CN tracker uses a vector correlation filter with the CN descriptor, and the KCF tracker uses the vector correlation filter with the HOG descriptor. The quantitative evaluation shows that the MVCF with DD(11)-HOG obtains the best results, successfully tracks objects in almost all of the sequences in this dataset, and outperforms the other three methods, including CSK, CN and KCF. The results also show that our MVCF with the DD(11) 5 -HOG 4 feature further improves the speed without a significant loss in the accuracy. If we use a greater compression ratio in the dimensionality reduction technique, both the success rate and precision score gradually decrease. Among the five trackers, CSK is shown to provide the highest speed. The KCF obtains the second fastest speed. The speed of MVCF using DD(11)-HOG is faster than the CN tracker, which also uses a color descriptor. The proposed tracker based on MVCF can outperform prior state-of-the-art trackers using correlation filters.
Our tracker produces an overall performance that is comparable to state-of-the-art trackers, as presented in Table 5. Our method is ranked in first place for its overall performance on this benchmark dataset, with a success rate of 0.69 and a precision score of 0.75. Our tracker has a high speed, which is 90 fps. A tracker that needs to address live video streams must have a high processing speed. In general, a frame-rate of at least 25 fps is required. However, the speed of many algorithms in our experiment is fewer than 20 fps. Figure 3 shows the success plot over all of the 36 sequences. The results are reported at a success rate with a VOR of 50%. Our algorithm achieves the best results. Table 5. Quantitative comparison of our tracker with 10 state-of-the-art methods over 36 challenging sequences. The Results are reported in terms of both the median success rate and precision. We also provide the median frames per second (FPS).

Method
Success For better evaluation and analysis of the strengths and weaknesses of the tracking approaches, the authors of [11] annotate the sequences with 11 different attributes, namely, illumination variation (IV), scale variation (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), background clutter (BC) and low resolution (LR). By annotating the attributes of each sequence, they construct subsets that have different dominant attributes, which facilitates analyzing the performances of the trackers for each challenging factor. Table 6 presents the success rate and our tracker rank for the different sequence attributes in the benchmark dataset. Some of the trackers perform well on a few subsets. However, our method outperforms the others on most of the subsets. Our tracker achieves the best performance in 10 of the 11 subsets. On the SV subset, the ASLA method performs better than ours. The ASLA approach with scale adaptation is the best, achieving a success rate of 0.57 while the success rate of our tracker is 0.56. Even with a fixed scale, our tracker is robust to appearance variations that are introduced by For better evaluation and analysis of the strengths and weaknesses of the tracking approaches, the authors of [11] annotate the sequences with 11 different attributes, namely, illumination variation (IV), scale variation (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), background clutter (BC) and low resolution (LR). By annotating the attributes of each sequence, they construct subsets that have different dominant attributes, which facilitates analyzing the performances of the trackers for each challenging factor. Table 6 presents the success rate and our tracker rank for the different sequence attributes in the benchmark dataset. Some of the trackers perform well on a few subsets. However, our method outperforms the others on most of the subsets. Our tracker achieves the best performance in 10 of the 11 subsets. On the SV subset, the ASLA method performs better than ours. The ASLA approach with scale adaptation is the best, achieving a success rate of 0.57 while the success rate of our tracker is 0.56. Even with a fixed scale, our tracker is robust to appearance variations that are introduced by scale variations. Owing to space restrictions, we only illustrate the success plots for attributes illumination variation (IV), scale variation (SV), occlusion (OCC) and deformation (DEF) as shown in  Table 6 and Figure 5 demonstrate that the proposed tracker has an extraordinary ability to address heavy occlusion. scale variations. Owing to space restrictions, we only illustrate the success plots for attributes illumination variation (IV), scale variation (SV), occlusion (OCC) and deformation (DEF) as shown in Figure 4. Both Table 6 and Figure 5 demonstrate that the proposed tracker has an extraordinary ability to address heavy occlusion.

Multiple-Object Tracking
We first evaluate the performance of our tracker on the videos used in [27] for multi-object tracking. These nine videos include multiple objects, and the average length of the videos is 842frames. We compare the performance of our MVCF tracker with the mst-SPOT tracker. The basis of the structure-preserving object tracker (SPOT) is formed by the popular Dalal-Triggs detector [26], which was obtained by training a linear SVM on HOG features. In their experiments, the mst-SPOT tracker outperforms the other baseline trackers (OAB, TLD, and no-SPOT) in almost all of the videos. We use a simple approach for tracking multiple objects that only runs multiple instances of our MVCF tracker without spatial constraints between the objects.
We evaluate the performance of the trackers by measuring the precision and success rate of each tracker and averaging over five runs. Table 7 and Figure 6 depict the tracking results of the SPOT tracker and ours, evaluated on nine videos. In this experiment, the proposed MVCF achieves overall the best performance using both the precision and success rate. The computational complexity grows linearly in the number of objects being tracked. The speed of our algorithm to track approximately four objects simultaneously is more than 25 fps.

Multiple-Object Tracking
We first evaluate the performance of our tracker on the videos used in [27] for multi-object tracking. These nine videos include multiple objects, and the average length of the videos is 842 frames. We compare the performance of our MVCF tracker with the mst-SPOT tracker. The basis of the structure-preserving object tracker (SPOT) is formed by the popular Dalal-Triggs detector [26], which was obtained by training a linear SVM on HOG features. In their experiments, the mst-SPOT tracker outperforms the other baseline trackers (OAB, TLD, and no-SPOT) in almost all of the videos. We use a simple approach for tracking multiple objects that only runs multiple instances of our MVCF tracker without spatial constraints between the objects.
We evaluate the performance of the trackers by measuring the precision and success rate of each tracker and averaging over five runs. Table 7 and Figure 6 depict the tracking results of the SPOT tracker and ours, evaluated on nine videos. In this experiment, the proposed MVCF achieves overall the best performance using both the precision and success rate. The computational complexity grows linearly in the number of objects being tracked. The speed of our algorithm to track approximately four objects simultaneously is more than 25 fps.  6. Performance of our MVCF tracker and the SPOT tracker on Multiple-Object Videos. There are 29 single object tasks in all nine videos. Trackers that can successfully tracks objects in all of the frames of the sequence are denoted by a "√". If a tracker misses the object, then we provide the frame number, which denotes the last frame that can be tracked successfully. MVCF outperforms SPOT, and MVCF can successfully track objects in almost all of the sequences without spatial constraints between the objects. In the Carchase and Parade sequences, MVCF can track the targets until they leave the view. The speed of our algorithm to track approximately four objects simultaneously is more than 25 fps.
Moreover, we qualitatively describe the results. There are 29 single object tasks in all nine videos. Our tracker successfully tracks objects in almost all of the sequences in this dataset. Only two objects, gazelles in the Hunting video and Dancer 2 in the skating video, cannot be tracked well by our tracker. The gazelle is very fast in Hunting and undergoes significant pose changes. It is unlikely that online appearance models are able to adapt fast and correctly. In the Carchase and Parade sequences, MVCF can track the targets until they leave the view. The SPOT tracker fails to track some of the objects, including three players in the basketball video, three persons in the parade video, Singer 1 in the shaking video and two dancers in the skating video. This finding clearly shows that our approach delivers competitive results, even though it does not consider the spatial constraints between objects.

Conclusions
In this paper, we propose a robust tracker that is based on a multi-vector correlation filter (MVCF) that can efficiently handle multi-vector fusion features. We propose a new DD(11)-HOG fusion feature using our MVCF tracker, which leads to a significant increase in the performance for object tracking. Numerous experimental results and evaluations demonstrate that the proposed tracker can outperform existing state-of-the-art trackers in the literature. Moreover, we argue that our MVCF tracker has a powerful ability to address partial occlusion and can run at an impressively ' ". If a tracker misses the object, then we provide the frame number, which denotes the last frame that can be tracked successfully. MVCF outperforms SPOT, and MVCF can successfully track objects in almost all of the sequences without spatial constraints between the objects. In the Carchase and Parade sequences, MVCF can track the targets until they leave the view. The speed of our algorithm to track approximately four objects simultaneously is more than 25 fps. Moreover, we qualitatively describe the results. There are 29 single object tasks in all nine videos. Our tracker successfully tracks objects in almost all of the sequences in this dataset. Only two objects, gazelles in the Hunting video and Dancer 2 in the skating video, cannot be tracked well by our tracker. The gazelle is very fast in Hunting and undergoes significant pose changes. It is unlikely that online appearance models are able to adapt fast and correctly. In the Carchase and Parade sequences, MVCF can track the targets until they leave the view. The SPOT tracker fails to track some of the objects, including three players in the basketball video, three persons in the parade video, Singer 1 in the shaking video and two dancers in the skating video. This finding clearly shows that our approach delivers competitive results, even though it does not consider the spatial constraints between objects.

Conclusions
In this paper, we propose a robust tracker that is based on a multi-vector correlation filter (MVCF) that can efficiently handle multi-vector fusion features. We propose a new DD(11)-HOG fusion feature using our MVCF tracker, which leads to a significant increase in the performance for object tracking. Numerous experimental results and evaluations demonstrate that the proposed tracker can outperform existing state-of-the-art trackers in the literature. Moreover, we argue that our MVCF tracker has a powerful ability to address partial occlusion and can run at an impressively high-speed. Therefore, MVCF is an ideal framework for multi-object tracking. We hope this work can motivate other researchers to perform more in-depth study in other computer vision tasks.
Author Contributions: Yang Ruan and Zhenzhong Wei conceived and designed the experiments; Yang Ruan performed the experiments; Zhenzhong Wei analyzed the data; Zhenzhong Wei contributed analysis tools; Yang Ruan wrote the paper.

Conflicts of Interest:
The authors declare no conflict of interest.