Partial Block Scheme and Adaptive Update Model for Kernelized Correlation Filters-Based Object Tracking

: In visual object tracking, the dynamic environment is a challenging issue. Partial occlusion and scale variation are typical challenging problems. We present a correlation-based object tracking based on the discriminative model. To attenuate the inﬂuence by partial occlusion, partial sub-blocks are constructed from the original block, and each of them operates independently. The scale space is employed to deal with scale variation using a feature pyramid. We also present an adaptive update model with a weighting function to calculate the frame-adaptive learning rate. Theoretical analysis and experimental results demonstrate that the proposed method can robustly track drastic deformed objects. The sparse update reduces the computational cost for real-time tracking. Although the partial block scheme generation increases the computational cost, we present a novel sparse update approach to reduce the computational cost drastically for real-time tracking. The experiments were performed on a variety of sequences, and the proposed method exhibited better performance compared with the state-of-the-art trackers.


Introduction
Tracking the position of objects of interest from a sequence of video frames is a fundamental problem in computer vision research.Object tracking is an integral part of computer vision and is applied in various fields including robotics, surveillance system, motion analysis, autonomous cars, unmanned aerial vehicles (UAVs) and human computer interaction (HCI).However, the research on object tracking is still recognized as a difficult problem since the object tracking environment contains various challenging factors, such as illumination variation, scale variation, occlusion, deformation, motion blur, fast motion and rotation.These factors significantly degrade the performance of object tracking.For that reason, minimizing the influence of environmental changes in the development of robust trackers is an important issue.There are many tracking algorithms  to deal with the variety of environmental changes.The state-of-the-art tracking algorithms have tried to solve the problem by analyzing the cause of environmental changes using various classification approaches.
In this paper, we present a novel correlation filter-based object tracking algorithm that focuses on solving scale variation and partial occlusion problems.The proposed algorithm is based on a discriminative model tracker with a correlation filter.The kernelized correlation filter (KCF) tracker [21] has demonstrated outstanding performance for object tracking by drastically reducing computational cost using an efficient search based on the diagonalization property of a circular matrix and a dual correlation filter (DCF).However, the KCF tracker is sensitive to environment changes because it still does not consider partial occlusion and scale variation, which make the performance of the tracker poor.
Most tracking-by-detection algorithms consider only object translation [22], but the proposed algorithm deals with scaling and partial occlusion, as well as object translation.Partial occlusion is a significant problem that degrades the performance of object detection.We propose a robust KCF-based tracker to overcome the partial occlusion problem using a partial block scheme.The partial block scheme facilitates stable object tracking, even if the object is partially occluded.A robust tracker also needs a strategy for scale estimation to deal with changes of the object size.The scale space [22] creates an image pyramid and determines the most appropriate size of an object block.We also propose an adaptive update model using a weighting function by improving the general update model used in the original version of the kernelized correlation filter [21].The proposed adaptive update model calculates the learning rate with a modified sigmoid function as the weighting function, and then, the optimal learning rate is calculated for each frame.
In summary, the proposed method is developed to deal with partial occlusion, scaling, illumination variation and deformation.Experimental results demonstrated that the proposed method exhibits a better performance than existing state-of-the-art algorithms for various test videos including illumination variation, scale variation, occlusion, deformation, motion blur, fast motion and rotation.Figure 1 compares the performances of the proposed method and state-of-the-art trackers.More specifically, Tiger2 including partial occlusion substantiates that the proposed partial block scheme successfully solves the partial occlusion problem.Freeman3 and Shaking substantiate that the performance of the proposed method is good enough for the scale variation problem.On the other hand, existing trackers do not properly respond to occlusion and scale variation problems.Comparison of the proposed method with the state-of-the-art trackers kernelized correlation filter (KCF) [21], discriminative scale space tracker (DSST) [22], fragment-based tracker (FRAG) [15], locality sensitive histograms tracker (LSHT) [37], multiple instance learning (MIL) [6], structured output tracking with kernels (STRUCK) [5] and tracking-learning-detection (TLD) [9].The sequences include Tiger2, Freeman3 and Shaking from OTB-100.
The composition of this paper is as follows: Section 2 describes the technical background of tracking with related works.Section 3 presents the proposed method with the object translation estimation using the partial block scheme, the object scaling estimation using scale space and the adaptive update model.After summarizing experimental results in Section 4, Section 5 concludes the paper.
The generation model-based trackers use a method of modeling the appearance of an object of interest and have various models for representing the object.Among generation model-based trackers, incremental visual tracking (IVT) [11] uses a PCA and applies an adaptive appearance update model to withstand lighting changes and variations.However, IVT is very sensitive to partial occlusion where the object is partially obscured by other objects.The occlusion problem was improved by applying the probability continuous outlier model (PCOM) [12] based on IVT.The visual tracking by decomposition (VTD) tracker [13] extended particle filter tracking (PFT), and the L1-minimization tracker [14] employed sparse representation.Furthermore, the fragment-based tracker (FRAG) [15] employed the local patch to ensure robustness for solving the partial occlusion problem, and the circulant sparse tracker (CST) [16] employed a combination of circularity and scarcity expressions.In addition, multi-task sparse learning tracker (MTT) [17] and the low-rank sparse trackers [18] belong to the generation model-based trackers.
The discriminative model employs the method that classifies objects and background, and it learn the model directly.The ensemble trackers [1] proposed to combine multiple weak classifiers to form an ensemble structure.The online Ada boosting (OAB) [2] employed identifiable feature selection and online boosting, and the online random forest (ORF) [3] learned and classified random forests online.STRUCK [5], using the kernel [6], online support vector machine (SVM) [4] and the multiple instance learning (MIL) tracker with HAAR features also belong to the discriminative model.The weighted multiple instance learning (WMIL) [7] improved the weighting of positive samples in the MIL and reflected the weighting of samples when learning the classifier.The correlation filter-based tracker also belongs to the discriminative model.The minimizing the output sum of squared error (MOSSE) tracker [19] proposed an adaptive correlation filter, and circulant structure of circulant structure kernel (CSK) [20] used the dense sampling with the theory of circulant matrices and fast Fourier transform (FFT).Furthermore, the kernelized correlation filter (KCF) tracker [21] applied linear and kernel ridge regression with histogram of oriented gradients (HOG) features for high-speed tracking.However, the KCF tracker can only be used for object translation estimation.The scale estimation problem was solved by the discriminative scale space tracker (DSST) [22], which estimates the translation and scale, independently.

Discriminative Correlation Filter
In recent research, discriminative classifiers were the core component of modern trackers, and the discriminative model distinguished the object from the surrounding environment [21] to effectively track the object of interest.To distinguish between the object and surrounding environment, discriminative model-based trackers [1][2][3][4][5][6][7][8][9][10][19][20][21][22] learned about the positive samples and the negative samples.The discriminative model-based trackers were considered to be more significant with respect to negative samples, and negative samples did not cover the object, completely.This means that the positive sample was located closer to the location of the object, and the positive sample contained enough information to represent the object.Generally, a large number of negative samples increases the computational cost.Most trackers [3,4,6,8,9,32] employed the random sampling methods to avoid high computational cost.However, correlation filter-based trackers [20][21][22] efficiently tracked objects using the circulant structure [20] and FFT to incorporate all samples without iterating them.In addition, the dual correlation filter (DCF) was proposed in the literature [21].The DCF performs linear multi-channel filtering for a similar performance of a nonlinear kernel with very low complexity.Thus, CSK [20], KCF [21] and DSST [22] used dense sampling with all samples; nevertheless, they can perform object tracking in real time because the computational cost is not high.Real-time processing is one of the significant components in object tracking for various vision applications.

Proposed Method
The partial occlusion and scale variation in object tracking comprise a crucial problem.Many researchers have tried to solve this problem, but it is still known to be a difficult problem.In addition, object tracking in real-time is also significant because its target is videos.The real-time object tracking that these problems have solved can be applied to many vision applications.Therefore, our goal is to develop real-time object tracking with the consideration of partial occlusion and scale variation.
The proposed work is based on KCF, which consists of (i) the detection part for describing objects and (ii) the training part for learning.The updated model is used in the detection part of the next frame, and the entire process repeats to track objects continuously.
In this paper, we propose object tracking using the partial block scheme and the adaptive update model.The partial block scheme is proposed to solve the partial occlusion problem, and the adaptive update model employs the weighted learning rate.The weighting is calculated from the reliability of the response of each block with a sigmoid function.If the reliability of the response is high, we use a higher learning rate.On the other hand, if the reliability of the response is low, we use a lower learning rate.Furthermore, a sparse update is performed to reduce the increased computation due to multiple partial blocks.Figure 2 shows the block diagram of the proposed method.The proposed methods can be divided into four steps as follows: 1. Partial block separation: separating the partial blocks from the whole block of an object.
Partial blocks can be adjusted in size and position according to the parameter.2. Translation estimation: calculating the responses using a kernelized correlation filter of all blocks and then selecting the translation response map. 3. Scale estimation: estimating the object scale with the scale space and calculating the scale factor.4. Adaptive model update: model updating with the adaptive learning rate considering the reliability of responses.

Partial Block Scheme
We propose the partial block scheme to address environmental changes such as partial occlusion, partial illumination variation and partial blurring.Partial blocks are computed for each frame from the whole block and divided into four parts.We can generate the partial blocks using the whole block and Equation (1).
where W m and W n are the height and width of the whole block, P m and P n are the height and width of partial blocks and the sizes of partial blocks are identical.d is a factor that adjusts the partial blocks' size.The centers of partial blocks are obtain with: In Equation (2), W(x c , y c ) is the center position of the whole block, and B k (x c , y c ) is the center position of all blocks, which include the whole block and partial blocks.k means the index of blocks, and ω means the factor that adjust the location for partial blocks.The indices of partial blocks are 0-4, and B 0 means the whole block.B 1 , B 2 , B 3 and B 4 mean partial blocks, respectively.As shown Figure 3, the positions of partial blocks depend on the position of the whole block, and this can adjust the parameter ω.Furthermore, the sizes of the partial blocks are set to an identical size for the convenience of calculation by parameter d.We proposed a partial block scheme to deal with the partial occlusion problem.Partial occlusion can occur in all blocks including the whole block.However, the proposed method is designed to track any block without partial occlusion.
The whole block is small or unsuitable parameter d can produce too small partial blocks.The small size of partial blocks among the generated partial blocks can disturb object tracking.Thus, we employ excluding very small blocks using Equations ( 3) and (4).
where τ is the threshold for the decision whether to exclude partial blocks.B w k means the weighting for partial blocks.If P m is smaller than τ, B w 1 and B w 2 are excluded blocks.Furthermore, if P n is smaller than τ, B w 3 and B w 4 are excluded blocks.All partial blocks are not large enough; we can only use whole blocks, and B w 0 is always 1.

Translation Estimation
The kernelized correlation filter (KCF) tracker [21] is a representative tracking algorithm based on correlation filters; it is superior in terms of performance and speed.The correlation filter tracker aims to calculate a filter h that minimizes the square error of sample data and regression data.The KCF tracker calculates filter h between sample data x i and regression data y i with: where λ is the regularization parameter.In the KCF tracker, we employ the kernel trick [38] for the non-linear regression function.Thus, non-linear filters could be as fast as linear correlation filters.The kernelized version of ridge regression is defined [21] as: where K is the kernel matrix, I is the identity matrix and α is the represented vector [20,21] of filter h.
The n × n kernel matrix can be expressed in the circulant matrix [21] as follows.
The circular structure can express the same signal x n according to n shift due to periodic characteristics.In Equation (7), the first row {x 1 , • • • , x n } is the base samples, and cyclically-shifted rows are virtual samples.
The kernel matrix can be diagonalized by discrete Fourier transform (DFT), and the kernel ridge regression solution is defined [21] by: Equation ( 8) is a closed-form solution, which is very efficient; it uses only fast Fourier transform (FFT) and element-wise operation [20].The following equation [21] calculates κ, where indicates the element-wise product and κ xx is the kernel correlation of x; it can be computed quickly with FFT [20].F means the fast Fourier transform, and F −1 means the inverse transform.
In this paper, we perform the kernel ridge regression solution for all blocks using the equation below.
where α means the update model of α, R means the response map and R 0 is for the whole block.R 1 , R 2 , R 3 and R 4 are the response maps for partial blocks, respectively.Furthermore, we perform the weighting for each response map and pick the suitable response map by: The index for all blocks is k; k * means the index of the picked response that includes the highest value of all response maps.As we mentioned before, B w k is the weighting for blocks.The position of the highest value in the picked response map means the translated position of the object.Then, we can calculate the center position for the picked block with ∆x, ∆y by: Bk (x c , y c ) = B k * (x c + ∆x, y c + ∆y). ( We recalculate the central position for the new whole block using Bk (x c , y c ).
where Ŵ(x c , y c ) is the updated center position of the whole block and η * means the scale factor.Then, in the next frame, we can obtain new partial blocks using Equation (2).

Scale Estimation
The translation estimation tracks the horizontal and vertical movements of objects.Thus, tracking only the translation of the object has limited performance object tracking.The DSST [22] proposed scale space for accurate scale estimation.Scale space expresses the data in 3 dimensions; the size is P m × P n × D. Here, P m , P n and D are the height, width and dimension, respectively.We compose the image pyramid for scale estimation with Ŵ(x c , y c ).The equation for the composition of the image pyramid consisting of various sizes is as follows.
The scale factors η l contains large values to small values to compose the image pyramid.s is the factor for the scale step.The dimension of features is defined by l ∈ {1, . . ., D}.After the image pyramid has been composed, we can calculate the scale response using the equation below [22].
where Z l means the object, A l is the desired output and S is the kernel correlation result.The scale response β has D-th values, and the index of highest value means scale factor η * .
The picked scale factor η * is multiplied by the size and center position of all the blocks.

Adaptive Update Model
In the tracking process, the attributes of objects change constantly.Furthermore, most objects maintain continuity with the previous frame.We propose that the adaptive learning rate is applied using the reliability of the response map.The peak-to-sidelobe ratio (PSR) [19] value of the response regards the reliability of the response map.It reflects the relationship between the main lobe and the surrounding side lobe by: where R where v 1 , v 2 and γ are the parameters of the weighting function and ρ m controls the maximum learning rate.
Figure 4 shows the response maps for two different values of the adaptive learning rate.The reliability of the response map determines the adaptive learning rate value.Specifically, the higher the rate value, the more it influences the update model, and vice versa.The adaptive update model is defined by: The number of ρ depends on the number of blocks.αt means the update model for the current frame, and αt−1 means the update model for the previous frame.α t is the calculated kernel regression solution in the current frame.The adaptive update model is performed to learn the translation estimation and scale estimation, identically.
For real-time tracking, we assume that the peak position of the response is the center, and we do not need to calculate the update model for the next frame.Then, we employ the sparse update for real-time object tracking.The sparse update is given by: where τ is the parameter for the sparse update; when δ is true, we can skip the model update for efficiency.

Experiments
We evaluated the proposed tracker with state-of-the-art trackers such as KCF [21], DSST [22], FRAG [15], LSHT [37], MIL [6], STRUCK [5] and TLD [9] for quantitative performance evaluation.The experiment was conducted with challenging sequences in the OTB-100 [39], and it included various attributes for natural sequences.Furthermore, we conducted experiments using one-pass evaluation (OPE).OPE is a general performance evaluation method used by the object tracking benchmark (OTB) [39].

Parameters and Experimental Setup
The factor d that adjusts the partial blocks size was set to two.The factor ω that adjusts the location for partial blocks was set to 0.5.The excluded partial blocks decided by τ were 15 pixels.The regularization parameter λ was 0.001.We used the scale step s = 1.02 and the dimension of scale space D = 33.The regularization parameter for adaptive learning rate ψ was 1/14.v 1 , v 2 and γ were set to 10, 0.5 and 1.5.ρ m was set to 0.03 for the maximum learning rate.τ was set to zero for the sparse update.In addition, we used histogram of oriented gradients (HOG) [40] as a feature to represent images.We conducted the experiments using MATLAB R2017b with an i7-2600 core 3.40-GHz CPU with 16 GB RAM.

Quantitative Evaluation
We calculated the center location error (CLE) for quantitative performance evaluation; this means the euclidean distance between the center location of the ground truth and the estimated center location by the object tracker.The euclidean distance can be calculated as follows: where x n b and x n y are the estimated center location by the tracker and x g b and x g y are the center location from the ground truth.N means the total number of pixels in the bounding box.If the center positions are close, the tracker can obtain a lower value, which means the good performance of the tracker.The precision was defined as success within 20 pixels (threshold), otherwise precision was defined as a failure.Thus, precision depended on the CLE results and its threshold.
As another measurement, the success rate was defined as: where r t and r a are bounding boxes for the ground truth and estimated results by the tracking algorithm.and indicate intersection and union.The function | • | means the number of pixels in the bounding box.The higher the success rate, the more overlap between the estimated bounding box and the ground truth.The success rate varied according to the threshold, and the threshold used in the experiment was 0.5.We can draw a success rate graph considering all thresholds and define the under area of the graph as the area under the curve (AUC).Then, we conducted experiments for all sequences in OTB-100.Table 1 shows the average performance evaluation results of proposed method with the state-of-the-art trackers.The scores of the proposed method were highest for all measurements.The precision score was higher than KCF [21] and STRUCK [5], and the success rate and AUC were higher than DSST [22].Experimental results demonstrated the effectiveness of the proposed method compared to existing trackers.The results showed that the proposed tracker was suitable for sequences including a variety of environments.The proposed tracker can be applied to a variety of vision applications such as robotics, surveillance systems, motion analysis, autonomous cars, unmanned aerial vehicles (UAVs) and human computer interaction (HCI), as mentioned above.The precision plots and success plots for all experimental sequences with the proposed method and the state-of-the-art trackers are shown in Figure 5.The proposed method performed well across all thresholds, and the translation and scale estimation of the proposed method worked suitably.In particular, the success rate was generally higher than the precision, which means the estimated bounding box by our tracker was more overlapped with the bounding box from the ground truth.
In addition, an experiment was conducted to combine components in various manners using KCF as a baseline algorithm.Each component was the partial blocks (PB), scale pyramid (SP), adaptive update (AU) and sparse update (SU).Figure 6 shows the precision plots and success plots for each combination.Some combinations without considering scale variation such as PB and AU were detrimental to improving the tracking performance, and the proposed method combining all components performed better than any other combination.
The OTB-100 contains sequences with attributes such as illumination variation (IV), scale variation (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), background clutter (BC) and low-resolution (LR).Figure 7 shows the average success rate of trackers for each attribute and demonstrates that the proposed method outperformed the existing trackers for all attributes.To consider the real-time tracker, we measured the frames-per-second (FPS) on the correlation filter-based trackers.KCF [21] and DSST [22] averaged 203 FPS and 43 FPS, respectively.The proposed method using the non-sparse updating version processed 35 frames per second.However, the sparse updating version processed 46 FPS on average for 100 sequences.As a result, the proposed method was slower than the KCF, but could be considered as a real-time tracker.Moreover, the proposed method outperformed the baseline KCF and DSST in terms of the AUC by 14.45% and 7.25%, respectively.Figure 8 shows the tracking results of the experimental sequences in OTB-100 database.From the top-left to bottom-right, the sequences in Figure 8 are Panda, Liquor, Freeman3, Walking2, Car1, Car24, Human8, Lemming, Box, Dog1, Coke, Vase, Skating1, CarScale, Singer1 and KiteSurf, respectively.The images contained various attributes, and the results of the proposed method were identified by a red bounding box.As intended, the proposed method was responsive to partial occlusion and scale variation.

Conclusions
In this paper, we proposed a kernelized correlation filter-based visual object tracking algorithm using the partial block scheme and adaptive update model in the scale space.The proposed method accurately estimated translation and scale using the discriminative model.The proposed adaptive update model used the weighting function, which can be expressed as the combined sigmoid and gamma functions, to reduce the computational cost of calculating partial blocks for real-time tracking.
Various experiments were conducted to measure the performance of the trackers with the OTB-100 database.Experimental results validated that the proposed method outperformed existing state-of-the-art trackers in the sense of CLE, precision, success rate and AUC.

Figure 2 .
Figure 2. The block diagram of the proposed method.We separate the partial blocks from the object in the frame and perform the translation estimation and scale estimation.Finally, we perform the model update with the weighting function or skip the update.PSR, peak-to-sidelobe ratio.

k
are the mean, standard deviation and peak value of the response.ψ is a regularization parameter.The weighting function calculates the adaptive learning rate ρk .ρk

Figure 4 .
Figure 4.The adaptive learning rate according to response.(a) response map with learning rate value of 0.0055; (b) response map with learning rate value of 0.0289.

Figure 5 .Figure 6 .
Figure 5.The precision plots and success plots over all 100 sequences.The legend of the plots indicates the state-of-the-art trackers.(a) precision plots; (b) success plots.

Figure 8 .
Figure 8.The experimental results comparison of the proposed tracker and existing trackers.

Table 1 .
The average performance evaluation results of trackers with the proposed method for 100 sequences in the OTB-100 database.The bold values mean the best performance.CLE, center location error.