^{1}

^{*}

^{1}

^{2}

^{1}

^{2}

^{3}

^{4}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (

To tackle robust object tracking for video sensor-based applications, an online discriminative algorithm based on incremental discriminative structured dictionary learning (IDSDL-VT) is presented. In our framework, a discriminative dictionary combining both positive, negative and trivial patches is designed to sparsely represent the overlapped target patches. Then, a local update (LU) strategy is proposed for sparse coefficient learning. To formulate the training and classification process, a multiple linear classifier group based on a K-combined voting (KCV) function is proposed. As the dictionary evolves, the models are also trained to timely adapt the target appearance variation. Qualitative and quantitative evaluations on challenging image sequences compared with state-of-the-art algorithms demonstrate that the proposed tracking algorithm achieves a more favorable performance. We also illustrate its relay application in visual sensor networks.

Object tracking via video sensors is an important subject and has long been investigated in the computer vision community. In common sense, an object, or a target, refers to a region in the video frame detected or labeled for specific purposes. Stable and accurate tracking of objects is fundamental to many real-world applications, such as motion-based recognition, automated surveillance, visual sensor network, video indexing, human-computer interaction, traffic monitoring, vehicle navigation,

Historically, visual trackers proposed in the early years typically kept the appearance model fixed throughout an image sequence. Recently, methods proposed to track targets while evolving the appearance model in an online manner, called online visual tracking, have been popular [

Appearance representation of the target is a basic, but important, task for visual tracking. Discrimination capability, computational efficiency and occlusion resistance are generally considered as the three main aspects in appearance modeling. For online visual tracking, the schemes can be classified into patch-based schemes (e.g., holistic gray-level image vector [

Based on the differences in object observation modeling, online visual tracking can be generally classified into generative methods (e.g., [

As an elegant working model, sparse representation has recently been extensively studied and applied in pattern recognition and computer vision [_{1} minimization. Mei and Ling [

Inspired by the discussions above, this paper considers the dictionary learning problem for online visual tracking, as well as a visual tracking algorithm, incremental discriminative structured dictionary learning (IDSDL)-VT, including incremental discriminative structured dictionary learning and multiple linear classifiers. The workflow is shown in

Compared with the dictionary learning papers referred to above, we do not solely rely on the optimization, but focus on the dictionary design with separate learning to improve the discriminative ability of the sparse coefficients for classification. Moreover, the dictionary is built on a patch level, and thus, a spatial multi-dictionary learning structure is established. Though numbers of papers have appeared based on sparse representation, few consider the dictionary learning aspect. There is a dictionary learning process in the algorithm proposed by Liu

The rest of the paper is organized as follows. In Section 2, details of the proposed structured dictionary learning algorithm are described. Details of the proposed visual tracking algorithm within the Bayesian inference framework are proposed in Section 3. Experimental results and a discussion are given in Section 4. In Section 5, concluding remarks and a possible direction for future research are provided.

We begin the description of the proposed dictionary learning algorithm, incremental discriminative structured dictionary learning (IDSDL), with the sparse appearance modeling as follows. Typically, the global appearance of an object under different illumination and viewpoint conditions is known to lie approximately in a low-dimensional subspace [

Suppose at time ^{2} × 1. Moreover, there exists a set of templates
_{1} and _{2} are regularization constants. Most of the previous tracking works apply the _{1} constraints to the coefficients, which is an approximation to the _{0} regularizer for the purpose of convexity, and the sparsity-based optimization methods by previous tracking works include basis pursuit (BP) [

In sparse representation, a dictionary refers to a matrix _{1}, _{2},…,_{n}] ∈ ℝ^{(d2×N)×M} made up of a group of basis vectors, where the target signal is spanned. Given a training set of image patches,

Based on the assumption and definition described above, we present an incremental discriminative structured dictionary learning method. A structured dictionary is defined as
_{t}^{d2×d2} is an identity matrix used as a non-negativity constraints similar with the settings in [

Suppose the target location has been estimated; the positive and negative training samples could be represented in the overlapped form by
_{i}^{+} + ^{−}, which separately refer to the number of positive samples,
_{t}^{(}^{j}^{)} refers to the _{j}, onto the constraint set, namely the _{2} – _{t}_{−1} is a warm start for _{t}

_{t0 −1}

^{i},

_{t0 −1},

_{1},

_{2}. 1:

_{0}→

_{t}

_{−1}. 6: Sparse representation of negative sample patches

_{t}

_{t}

_{t}

_{t}

An online visual tracking problem can be interpreted as a Bayesian recursive and sequential inference task in a Markov model with hidden state variables and is further divided into cascaded estimation of a dynamical model and observation model [_{t}_{1}, y_{2},…, y_{t}_{t}_{t}_{t}_{−1}) refers to the dynamical model between the two consecutive states and _{t}_{t}_{t}_{t}

In the context of particle filtering, typically, a set of candidates
_{t}_{1:}_{t}_{−1}, _{1:}_{t}_{t}_{1:}_{t}_{−1},_{1:}_{t}_{t}_{1:}_{t}_{−1}), where the weights become the observation likelihood, _{t}_{t}_{t}

Ideally, a dynamical model _{t}_{t}_{−1}) should be able to fully describe the variation of the target in detail, yet in most practical cases, this could be approximately parameterized. Typically, at time _{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}^{T} is the 2D translation vector. In a homogeneous coordinate system,

Based on the principle above, Ross _{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}_{0} is a vector whose elements are the corresponding variances of the affine parameters.

The support vector machine (SVM) is one of the most widely used classifiers in machine learning and pattern recognition application. It makes attempts to find a separating hyperplane that maximizes the margin between two classes. The margin is defined as the distance of the closest point to the hyperplane. Given a set of instance-label pairs {_{k}_{k}_{k}^{n}_{k}_{k}_{k}_{k}

We consider the discriminative observation modeling on a patch level. Given the patch-based coefficients, _{t}_{N}

To both timely adapt the variation of target appearance and maintain its original invariance, a progressive classification is applied. At time _{0}, the voting function is processed twice, when sequentially, _{t}_{−1} = _{t}_{0−1} and _{t}_{−1} = w_{0} are separately set. The former one is introduced to locate the target as an intermediate result based on its latest appearance model, while the latter one is used to locally refine the location with respect to its originality. Correspondingly, the dynamical modeling is also conducted twice to formulate a step-wise classification, similar to [

Once the current target location is estimated, the model is updated accordingly. In this paper, the update process is two fold. The first one is to adapt the dictionary using the proposed IDSDL algorithm proposed in the last section. Then, the positive and negative samples are sampled around the current estimated location of the target. Based on the learned results, sparse coefficients are obtained to train the SVM classifiers so that updated models are generated. Details of the IDSDL algorithm could refer to the last section, while the classifier training is described as follows.

To establish an efficient discriminative model at time _{t}_{−1}, for current patch. For each

The proposed algorithm is summarized in Algorithm 2 based on the descriptions above.

_{0}, target region

_{0}, particle numbers

_{l}

_{s}

_{1},

_{2},

_{0},

_{t}

_{1:}

_{M}

_{M}

_{0}

_{t}

_{t}

_{o}. 6:

_{0}

_{t}

_{0}. 9:

_{t}

_{t}

_{t}

In Algorithm 2, the proposed dictionary learning algorithm is the most computational, while the online training and classification process does not take much running time, since the efficient linear SVM is applied. The dynamical modeling process takes the least running time according to the proposed straightforward process. To accelerate the process, we apply a C implementation of elastic net regulation proposed by Mairal

In this section, we present experiments on test videos to demonstrate the efficiency and effectiveness of the proposed algorithm.

The proposed tracking algorithm, IDSDL-VT, is implemented in MATLAB and C/C++ and runs at about 1.3 fps on a 3.4 GHz dual core PC with 8 GB of RAM. For parameter configuration, in each frame, each target region is normalized to 24 × 24, and the patch size is 12 × 12, _{1} and _{2}, in

To evaluate the efficiency of the proposed algorithms, nine benchmark video sequences, most of which are publicly available, are used under the challenges of lighting and scale changes, out-of-plane rotation and partial occlusion. Comparatively, the proposed tracker is evaluated against state-of-the-art algorithms, including Frag [

It should be noted that the setting on a particle number and the regulation constant above is based on the setup of classical online visual tracking algorithms, for a better performance comparison [

Qualitative analysis and discussions are provided as follows. The visual challenges include heavy occlusion, illumination change, scale change, fast motion, cluttered background, pose variation, motion blur and low contrast.

The two test sequences,

The sequences,

The sequences,

The two video sequences,

Besides qualitative evaluation, quantitative evaluation of the tracking results is also an important issue for tracking performance evaluation. Similar to other classical works, two performance measurements are applied to compare the proposed tracker with the other reference trackers. Quantitative comparisons using average center errors (CE) based on Euclidean distance and the PASCAL [

Moreover, the average center error (ACE) and average overlap rate (AOR) are defined as:

To demonstrate the proposed improvement in the voting scheme, comparison between single voting and K-combined voting is also drawn on the benchmark sequences. The settings are the same with the ones above. It can be found that the proposed algorithm with K-combined voting is better in both center error evaluation and overlap rate evaluation. Even with single voting, the proposed tracker can perform better than other classical trackers on the overlap rate in most cases. It should be noted that, on the one hand, the performance with a low

Overall, it can be concluded that the proposed tracker achieves better performance than the other state-of-the-art algorithms.

In this paper, a dictionary learning algorithm called incremental discriminative structured dictionary learning (IDSDL) is proposed to learn from positive and negative samples, joined to construct a structured dictionary with a newly established randomly permuted unit matrix for sparse representation. Each test sequence corresponds to a dictionary during the tracking process. Corresponding to the patch settings above, the selected learned dictionaries of sequence

Moreover, the average computation time per frame of the proposed IDSDL algorithm based on different target normalized sizes are provided in

To demonstrate the potential application of the proposed algorithm, we evaluate its relay tracking performance in visual sensor networks. The test dataset is from the CATproject (

In order to make use of the visual information acquired as much as possible, we establish the tracking process with a shared dictionary across all the cameras, shown in

Quantitatively, we evaluate the lifecycle of the target once it is detected in one camera. In this paper, the lifecycle of a target is defined as:
_{c}_{e}_{e}_{c}_{e}_{c}

Vertically, it can be found from the table that the proposed algorithm with dictionary sharing achieves higher values. This is mainly because of the satisfactory online tracking performance proposed above. Moreover, based on dictionary sharing, more visual information about the target could be obtained before the target enters the specific scene. Thus, the performance with the dictionary sharing is better than the one without sharing. Horizontally, all the values in Camera 2 and Camera 4 are lower than their counterparts in other columns. This is due to the background modeling in Camera 2 and Camera 4. In Camera 2, there are other moving objects as the target enters the scene, and in Camera 4, there is light variation. The foreground target could not be timely and correctly detected, which leads to a relatively poor lifecycle performance. Moreover, the value with the dictionary sharing in Camera 2 is not much higher than that without sharing, yet the opposite case occurs in Camera 3. This is because, due to the camera view point, the person's initial pose in Camera 2 is much different from those in other cameras. Thus, the corresponding dictionaries learned in other cameras could not provide much effective information about the target. It should be noted that the Kalman tracking method heavily and continuously relies on the background modeling performance. It could not track the target until its foreground is re-detected again, and due to the variation of foreground area, the target is not correctly labeled in some frames. Comparatively, our proposed tracker only relies on the foreground information once for location initialization, and with the dictionary sharing across the network, it achieves a better performance.

It can be found that our proposed tracker could perform more favorably than the other state-of-the-art trackers comprehensively in both qualitative and quantitative evaluation. We present some justifications here. For discriminative tracking algorithms, discrimination of the target from the background is critical. One of our contributions is the proposed IDSDL algorithm for dictionary learning. The proposed dictionary contains both a positive template set and negative samples, and during training, the LU strategy is proposed to only update its partial columns. Furthermore, the positive samples are used to update the positive part, and their negative counterparts are used to generate the learned negative part.

Confidence is also important for observation estimation. Our proposed KCV voting method combines the classifiers randomly and outputs their estimated result by a maximal scheme. Since the candidates are also generated randomly, the random combination could also be viewed as a supplementary re-sampling step from the particle filtering aspect. A limited combination of random sample points are still random, because the joint distribution of single Gaussian variables is still Gaussian. Thus, statistically speaking, it improves the estimation generalization during the tracking process. The maximal scheme is a nonlinear superposition process, and it creates more confidence points given limited candidates.

It has been shown from the experiments that our proposed tracker is currently not sufficient for real-time processing. Currently, it is mainly a MATLAB implementation with some C/C++ Mex functions. Most of the processing time is spent on the dictionary learning and classifier training part. It is certain that the processing could be several times faster in both single camera and visual sensor network cases when all the codes are re-written in C. The running speed could be higher if the processing could be paralleled or assisted with a graphic processing unit (GPU) coprocessor, since each patch could be independently processed before KCV without interleaving. Though the proposed tracker is slower, it achieves better performance in the accuracy evaluation.

This paper proposes an online discriminative visual tracking algorithm based on incremental discriminative structured dictionary learning and multiple linear classification based on randomly-combined voting. Not only qualitative, but also quantitative, evaluations are conducted, which demonstrate that, on challenging image sequences, the proposed tracking algorithm enjoys better performance than the state-of-the-art algorithms. It is also shown that our proposed algorithm could be applied to relay tracking with satisfactory performance in visual sensor networks. Our future work might focus on the application of the proposed dictionary learning method to other classification problems. The proposed algorithm could also be extended to multiple object tracking or the tracking of specific class (e.g., humans or their parts) given certain application environments.

This work is supported NSFC(no. 61171172 and no. 61102099), the National Key Technology R&D Program (no. 2011BAK14B02) and STCSM (Science and Technology Commission of Shanghai Municipality, China) under no. 10231204002, no. 11231203102 and no. 12DZ2272600. We also give our sincere thanks to the anonymous reviewers for their comments and suggestions.

The authors declare no conflicts of interest.

Workflow of the proposed algorithm. The proposed dictionary learning method, incremental discriminative structured dictionary learning (IDSDL), is detailed in Section 2, while the proposed affine warping, support vector machine (SVM) training and classification and K-combined voting (KCV) are detailed in Section 3.

Generation of proposed dictionary, which is composed of positive and negative template patches learned by IDSDL and trivial templates. The corresponding patches are cropped separately from the positive and negative samples around the target based on different sampling radii.

Relationship between
_{t}

Single voting and random combined voting. The former scheme is a special case of the latter one with

Qualitative evaluation results of eight algorithms on challenging tested sequences

Qualitative evaluation results of eight algorithms on challenging tested sequences

Qualitative evaluation results of eight algorithms on challenging tested sequences

Qualitative evaluation results of eight algorithms on challenging tested sequences

Center error (CE) evaluation for nine video clips. The proposed algorithm is compared with seven state-of-the-art methods: Frag[

Overlap rate (OR) evaluation for nine video clips. The proposed algorithm is compared with seven state-of-the-art methods: Frag [

Dictionary of

Relay tracking evaluation establishment. We assume that the cameras are connected by the local area network (LAN) with the computing server in the back-end. The videos acquired would be transmitted to the server without any time delay.

Tracking process with a shared dictionary across all the cameras.

Coefficients of positive and negative samples in sequence

Center error (pixels) and overlap rate of the tracking methods. The best three results are in bold, italicized and underlined fonts.

Frag | IVT | VTD | L1T | TLD | MIL | PLS | Proposed | Single Voting | ||
---|---|---|---|---|---|---|---|---|---|---|

Occlusion 1 | 5.621 | 9.175 | 11.135 | 6.500 | 17.648 | 32.260 | ||||

0.899 | 0.845 | 0.775 | 0.876 | 0.649 | 0.594 | |||||

Occlusion 2 | 15.491 | 10.408 | 11.119 | 18.588 | 14.058 | 46.186 | ||||

0.604 | 0.588 | 0.592 | 0.493 | 0.612 | 0.471 | |||||

Caviar 1 | 5.699 | 45.245 | 119.932 | 5.593 | 48.499 | 47.393 | ||||

0.682 | 0.277 | 0.278 | 0.704 | 0.255 | 0.268 | |||||

Caviar 2 | 5.569 | 8.641 | 4.724 | 8.514 | 70.269 | 32.431 | ||||

0.557 | 0.452 | 0.671 | 0.658 | 0.255 | 0.365 | |||||

Deer | 92.089 | 127.467 | 171.468 | 25.652 | 66.457 | 20.198 | ||||

0.076 | 0.217 | 0.039 | 0.412 | 0.213 | 0.510 | |||||

Car 11 | 63.922 | 2.106 | 27.055 | 33.252 | 25.113 | 43.465 | ||||

0.086 | 0.432 | 0.435 | 0.376 | 0.175 | 0.769 | |||||

David Indoor | 76.691 | 13.552 | 7.630 | 9.671 | 16.146 | 64.335 | ||||

0.195 | 0.525 | 0.625 | 0.602 | 0.448 | 0.278 | |||||

Singer | 22.034 | 8.483 | 4.571 | 32.690 | 15.171 | 14.199 | ||||

0.341 | 0.662 | 0.703 | 0.413 | 0.337 | 0.212 | |||||

Jumping | 58.448 | 36.802 | 62.988 | 92.393 | 9.894 | 60.206 | ||||

0.138 | 0.283 | 0.080 | 0.093 | 0.527 | 0.096 | |||||

| ||||||||||

38.396 | 27.969 | 16.639 | 50.012 | 35.135 | 32.359 | |||||

0.398 | 0.538 | 0.504 | 0.555 | 0.380 | 0.430 |

Running time, average center error and ORE of different normalized patch sizes for

32 × 32 | 3.704s | 1.826 | 0.821 |

24 × 24 | 1.192s | 1.907 | 0.817 |

16 × 16 | 0.930s | 4.022 | 0.783 |

8 × 8 | 0.433s | 25.441 | 0.406 |

The lifecycle of different methods. The best results are in bold font.

Kalman | 0.624 | 0.607 | 0.705 | 0.501 | 0.609 |

Proposed (no dictionary share) | 0.806 | 0.709 | 0.753 | 0.641 | 0.727 |

Proposed (dictionary share) |