A Hybrid Visual Tracking Algorithm Based on SOM Network and Correlation Filter

To meet the challenge of video target tracking, based on a self-organization mapping network (SOM) and correlation filter, a long-term visual tracking algorithm is proposed. Objects in different videos or images often have completely different appearance, therefore, the self-organization mapping neural network with the characteristics of signal processing mechanism of human brain neurons is used to perform adaptive and unsupervised features learning. A reliable method of robust target tracking is proposed, based on multiple adaptive correlation filters with a memory function of target appearance at the same time. Filters in our method have different updating strategies and can carry out long-term tracking cooperatively. The first is the displacement filter, a kernelized correlation filter that combines contextual characteristics to precisely locate and track targets. Secondly, the scale filters are used to predict the changing scale of a target. Finally, the memory filter is used to maintain the appearance of the target in long-term memory and judge whether the target has failed to track. If the tracking fails, the incremental learning detector is used to recover the target tracking in the way of sliding window. Several experiments show that our method can effectively solve the tracking problems such as severe occlusion, target loss and scale change, and is superior to the state-of-the-art methods in the aspects of efficiency, accuracy and robustness.


Introduction
Object tracking has made remarkable progress in the past two decades [1][2][3], but due to the deformation of the target, sudden movement, light change, severe occlusion, out of field of vision and other factors leading to a large change in appearance, object tracking is still very challenging. In order to cope with these changes, neural networks with memory function and correlation filters are widely used in object tracking. However, the existing tracking algorithms based on the neural network and adaptive model cannot maintain the long-time memory of the target appearance, and the updating of the model in the case of noise may lead to the drifting of the tracking target.
Self-organization mapping neural networks and correlation filters attracted extensive attention in the field of image research and visual tracking [4][5][6]. The popularity of the self-organization mapping neural network (SOM) and associated filters is due to three important properties. First of all, SOM adopts the learning method unsupervisedly, which is more similar to the learning of the biological neural network in the human brain. Its most important characteristic is to self-organize and adaptively change network parameters and structure by automatically looking for internal rules and essential properties in samples [7,8]. Secondly, the correlation filter implements the efficient computation of spatial correlation information through Fourier transformation, thus achieving a higher tracking speed. The correlation filter considers the context information of the target object, and provides more discriminability than the appearance model [9,10] based on the target object only. Even if the target object is severely obscured, the correlation filter can still use context clues to infer the target location. Third, the learning problem of the correlation filter to integrate multi-channel features into correlation filtering method which is more robust for extracting the edge information of the object and for lighting and color changes.
The existing trackers based on correlation filters have achieved certain effects in the field of target tracking, but these algorithms have some defects. These methods use a moving average scheme to update the filter at a high frequency to deal with the timevarying target appearance. This scheme can only maintain the short-term memory of the target appearance. This method may give rise to tracking drift in the presence of noise. Moreover, lack of long-term memory of the appearance of the target is difficult to recover from tracking failure after drifting. As shown in Figure 1, the classic correlation filter tracker (KCF [6], STC [24]) produces target drift due to the noise update of the 4th frame in the video sequence. After 5 frames of severe occlusion, the target tracking failure is caused and unable to recover. These algorithms are limited to predicting the location of the target without predicting the scale of the target, and fail to solve the problem of updating the model when the tracking fails, which limits the performance and application scenarios of the tracking algorithm. The ACFLST [23] algorithm proposed a correlation filter update algorithm, and all the data are relatively excellent in the latest object tracking algorithm test based on correlation filters. However, the performance of the ACFLST algorithm is not ideal. Separated displacement filter and the scale filter are inappropriate in real application scenarios, especially in complicated scenes, because the scale changes of tracking objects are often related to the position. Zhou et al. [25] explored the tracking algorithm of scale-adaptive KCF and deep feature fusion, which improved the feature occlusion problem to a certain extent. Zhang et al. [26] used KCF-based scale estimation to track aerial infrared targets to improve the problem of KCF tracking accuracy decline in the case of large changes in the scale and rotation of aerial infrared targets.  [23], MUSTer [27], KCF [5], STC [24], Struck [4] and TLD [28] (X: no tracking output).
Object tracking adopts detection tracking mode, which treats object tracking as a multiple detection problem in local search window, and usually separates the target from its surrounding background by incremental training classifier, so as to achieve accurate target tracking. Existing methods collect positive and negative training samples from sample areas around the estimated target location and update the classifier with these samples. There may be two problems with this kind of approach. The first problem is sampling uncertainty, i.e., small sample errors may accumulate, and cause the object tracking drift. Many methods have been proposed to reduce the fuzziness of samplings. The main idea of these methods is to intelligently identify and update the classifier when training the characteristics of samples with noise. Examples include Ensemble learning [12,29], Semisupervised learning [30], Multiple instance learning (MIL) [11] and Transfer learning [31]. The second problem is the stability and adaptability of updating the appearance model. In order to balance the stability and self-adaptability of the algorithm, Kalal et al. [28] decomposed the tracking task into three modules (TLD): tracking, training and detection. The tracking and detection modules can promote each other, provide additional training samples through the results of the tracker and update the detector with effective strategies. The online learning detector can be utilized to reinitialize the tracker in the event of a trace failure, and a similar mechanism is used in [32][33][34] to recover the target object from a trace failure. Zhang [35] et al. used multiple classifiers with different learning rates and designed an entropy metric to fuse multiple tracking outputs. The object tracking algorithm proposed by us uses the online training detector to reinitialize the tracker, this thought is similar to [23,28,35]; however, in our method, only when the memory filter response is under a certain threshold, operation detector is used to detect the drifted objects. This method helps to improve the efficiency of the system while running. Considering the motion continuity of the target, we do not need to apply the detector to target detection in every image frame. In addition, in order to improve the accuracy of the position prediction of the target object, three position filters are adopted.
This paper proposes a long-time tracking algorithm based on multiple correlation filters. Each filter adopts different updating strategies to carry out long-time tracking cooperatively. In this paper, a re-detection mechanism based on support vector machine is designed. Once the target re-enters the field of vision, the algorithm in this paper can recapture the target to track the target. Instead of relying on only one correlation filter [23] for target location estimation, our algorithm is based on SOM network and multiple correlation filters. SOM is used to extract target features, three complementary location correlation filters are used to estimate target location, scale filter is used to predict scale change and memory filter is used to determine the recovery operation in case of tracking failure. The most relevant work with our proposed method is the MUSTer algorithm, which is put forward by Hong [27]. Both methods use the correlation filter based on memory to track. The main differences between MUSTer and our algorithm are the used feature extraction method and the model of target appearance for memory. MUSTer utilizes local feature pool target to represent the appearance of target, our memory filter models the appearance of the target as a whole. It has been shown experimentally that it is often challenging to detect a sufficient number of locally reliable feature points for matching, especially when the target object is of low resolution or unclear structure. Figure 1 shows an example. In the 4th frame, since the detection and matching feature points are very few, the MUSTer tracker cannot recover the object tracking after the tracking fails. At the same time, our proposed algorithm is also relatively close to the algorithm proposed by C. Ma [23], but our proposed algorithm uses three displacement filters, and adopts a joint tracking method in object tracking. This improvement is useful for target positioning and the effect is better.
In this paper, three displacement filters, one scale filter and one memory filter are used to solve the stability and adaptive problems in object tracking. First, we create three displacement filters to estimate the movement of the target. These three filters respectively model the different shapes of the target object and encode the deformation of the target object. In order to accurately locate the target object, we use SOM features to express the basic characteristics of the target. Experimental results show that this feature representation enhances the ability to distinguish between the target and the surrounding background. Secondly, we used the pyramid features to learn scale filter [23], combined with displacement filters, to accurately get the scale of the tracking target. Third, we created a memory filter to track the target. For each tracking result, we calculate the confidence level with the memory filter to judge whether the tracking fails. Once the confidence value is lower than a given threshold, the algorithm starts a SVM detector which is trained online to recover the target.
The essential contribution of our research is to propose a competent object tracking model and algorithm, which effectively uses SOM features, feature pyramid networks and correlation filters to achieve stable and efficient object tracking. Specifically, this method has the following three contributions: 1. Extend our original preliminary work [36] by adding pyramid features and correlation filters, and use an effective target update strategy to update the object detection module [37] to achieve long-time effective tracking of targets. 2. Systematically analyze the influence of different feature types of tracking objects and the size of surrounding environment area on the design of SOM network and correlation filters in complex scenes. 3. The performance of this algorithm and other related works [27] are discussed and compared in detail. We have evaluated the algorithm and conducted extensive testing and comparison on the OTB-50 [38] and OTB-100 [39] datasets and other challenging video sequences (VOT2020 [40],UAV123 [41], LaSOT [42] and NFS [43]).

Method Overview
The goal of the object tracking algorithm proposed in this paper is to use SOM and multiple correlation filters to deal with the following challenges in the visual tracking process: (1) the obvious changes in appearance over time; (2) changes in scale; (3) recover the goal from the tracking failure. First, the existing algorithms based on a single correlation filter [23] cannot achieve these goals, because it is tough to strike a balance between stability and adaptability using one filter only. Secondly, although a lot of works have been done to solve the challenge of scale prediction [17,24,44], it is still an unresolved problem because the slight error of scale estimation will cause rapid degradation of the appearance model. Third, it is still a challenge to determine when the tracking failure occurs and to re-detect and track the target from the failure. In the algorithm proposed in this paper, we use three different levels of displacement filters, a scale filter and a memory filter to solve these problems. Figure 2 shows the construction of a correlation filter for visual tracking. The displacement filters A T1 , A T2 and A T3 are used to model and estimate different forms of targets, respectively, the scale filter A S is used to evaluate the scale estimation of the tracked object, and the long-time memory filter A L is used to keep the long-time memory of the appearance of target to estimate the confidence level of every tracking result. Figure 2. SOM feature extraction and correlation filters. The translation filter A T1 , A T2 and A T3 with short-time memory adapts to changing appearance of the target and its surrounding context. The long-time memory filter A L is conservatively learned to maintain the long-time memory of the target appearance. Figure 3 shows a schematic diagram of the algorithm for object tracking using three correlation filters. It is initialized in the 1st frame of input, and SOM is trained according to the specified object position to extract the regional features, and the three correlation filters proposed by this algorithm are learned. For subsequent input frames, we first use three displacement filters A T1 , A T2 and A T3 to obtain three target locations at the center of the search window of the previous frame. The average value of these three target locations is our estimated target location. Once the position of estimated target is obtained, we use the scale filter A S to predict the change of the target scale, thereby determining the bounding box of the tracking target. For each tracking result, we judge if the tracking fails (whether the target confidence is lower than a certain set threshold T r ) by the long-time memory filter A L . In the event that the tracker loses a target, the online detector will be activated to recover the lost or drifting target. When the confidence of the re-detected object is greater than the set update threshold T a , the long-time memory filter A L needs to be updated first, and then A T1 , A T2 and A T3 are updated with a reasonable learning rate. After comparing our experiments with other classifiers, the support vector machine (SVM) can get much better results than other algorithms on the small sample training set. SVM is currently one of the best classifiers with excellent generalization ability and can reduce the requirements for data scale and data distribution. Although the longtime memory filter A L proposed by this algorithm itself can also be used as a detector, because the filter uses high-dimensional features, the calculation load is large. In order to improve the calculation efficiency, we use the online training SVM classifier to construct an additional. We update the detection module and the long-time memory filter A L with a reasonable learning rate, which can snatch the target appearance over a long period of time.

Kernelized Correlation Filters-Based Tracker
Trackers based on correlation filters [17,45] have achieved very good capability in recent evaluations [38,46]. The main idea of these works is regressing the input feature of the cyclic shift to a soft regression index, such as generated by a Gaussian function. The input features of the cyclic shift are similar to the densely sampled samples of the target appearance [6]. Since the training of the correlation filter does not require binary samples (hard threshold), the tracking algorithm using the correlation filter effectively reduces the sampling dilemma that is adversely affected by most tracking algorithms that detect frame by frame. In addition, by using the redundancy in the shifted sample set, Fast Fourier Transform (FFT) can effectively use a large number of training samples to train correlation filters. This increase in training data helps distinguish the target from the surrounding background. This section will explain in detail the derivation process of coring correlation filtering.
Henriques [6] uses cyclic sampling of the target area, that is, dense sampling to reduce the amount of calculation, which not only improves the calculation efficiency, but also improves the tracking accuracy. Different from the sparse sampling methods of other algorithms, the correlation filtering used in proposed method does not strictly distinguish between positive and negative samples, and a transformation matrix is used to cyclically shift the target image block x. For a one-dimensional image x = [x 1 , x 2 , · · · , x n ], the transformation matrix can be as following: The cyclic shift transformation matrix (1) is used to chain-shift the image, and the image transformed by the permutation matrix constitutes the cyclic matrix: X is the circulant matrix, and the circulant matrix we can use Discrete Fourier Transform (DFT) to obtain the following characteristics: where F represents the constant matrix of DFT that transforms the spatial domain data into frequency domain; , F H is the Hermitian transpose, also called the conjugate transpose matrix, that is, conjugate first and then perform transpose. f is the linear correlation filter which is trained on the image block X of size M × N can be regarded as a ridge regression model, which uses all cyclic shifts (horizontal and vertical) of x as training data. We assign a regression target score to each shift feature: , where (m, n) represents the position shifted along the horizontal and vertical directions. In the center of the target object, we have a highest score y i = 1. If the position (m, n) is far from the target center, the score drops fast from 1 to 0. The kernel width σ 0 is a parameter which is defined previously to control the sensitivity of the scoring function.
First, in the Fourier domain, the ridge regression solution for the circulant matrix X is as follows: where I is the identity matrix with size (M × N) × (M × N), according to Equation (3), we obtain: The operations on the diagonal matrix are all element-level, so we get the follows: Among them, the symbol represents the Hadamard product, which is a matrix element-level multiplication, that is, elements with the same position are multiplied sep-arately. Then use the unitarity of the Fourier transform matrix, namely: FF H = I, Equation (4) can be rewritten as: Substituting Equation (3) into Equation (7), we get: According to the characteristics of the circulant matrix, the construction rule of the circulant matrix and the nature of the Fourier change, we have: C(x) is the cyclic shift matrix of x. Synthesizing the right part of the Equations (9) and (10), we have: According to Equation (8): According to the nature of the circulant matrix convolution: From the Equation (12), we can get: Sincex * andx are in a conjugate relationship, each element inx * x is a real number. Taking the conjugate of such a matrix, the element value does not change in any way. Therefore, Equation (13) can continue to be deduced, as follows: The following is the objective function of the linear ridge regression training correlation filter: where λ > 0 is a regularization term. Equation (14) is a linear estimator: f (X) = W T X. From Equation (13), the Fourier frequency domain solution is: wherex represents the Fourier signal of x,x * is the complex conjugate transform of x and operation is the product of Hadamard. In order to strengthen the discriminative ability of learning filters, Henriques et al. [5] and others introduced the kernel K, K(x, x ) = ϕ T (x)ϕ(x ) which trains the correlation filter in the kernel space, which is used to study the correlation filter in the kernel space when keeping the computational complexity as linear complexity. The calculation formula of the coring correlation filter is: where α = {a i } is the dual variable of W. In terms of shift-invariant kernels, such as RBF kernels, the dual coefficient α [20,47] can be found by using the cyclic matrix in the Fourier domain:α =ŷ k xx + λ (19) where K represents the kernel correlation matrix, and the Fourier transform of K is as follows: Since the algorithm only requires element dot product, FFT and FFT inverse operations, the computational time complexity is O(nlogn), where n is the number of input data.
Given a new frame as input, we use the similar solution in Equation (19) to efficiently calculate the correlation response mapping. The method is to crop an image block z at the center of the object in the previous frame, and then use the trained target templatex to calculate the response map f in the Fourier transform domain: Finally, we search for the position of the maximum value of the response map f to locate the target.

Displacement Filter
When estimating the target position, we broaden the input bounding box of the target object to include more context around the tracking target and provide more available displacement features. Compared with the tracking algorithm based on the online learning sparse sample classifier [48][49][50] (random sampling surround the estimated target position), our method is based on the correlation filter. The learning sample is intensive, which is all loops of the input characteristics shifted version. The increase in training data helps distinguish the target from the background.

Scale Filter
Danelljan et al. [17] proposed a discriminative correlation filter for scale estimation. We similarly constructed a pyramid feature of the target appearance centered on the estimated position and used it to train the scale-dependent filter. Unlike [17], our method does not use the predicted scale change to update the displacement filter A T . Let W × H be the size of the tracking target and S be the target scale set. For scale s ∈ S, the size of the image area captured with the estimated target position as the center is sW × sH, and the captured image block is rescaled to W × H. Then SOM features are extracted from each sampled image block to form a multi-scale representation of the feature pyramid containing the target. Assuming that X s is the feature vector of scale s, and s * is the optimal scale of the target object, then: In the process of object tracking, our method estimates the change of target displacement firstly, then predicts the change of scale. Our method is different from other existing tracking algorithms, which generally infer changes in position and scale at the same time. For example, the tracking algorithm based on particle filtering [51] uses random samples to approximate the target's position and scale change state distribution. The gradient descent method (such as Lucas-Kanade [52]) infers the local optimal position and changing scale in an iterative manner. The algorithm we proposed is to break the tracking task into two independent subtasks, which not only reduces the burden of intensive evaluation of the target state, but also avoids the noise update of the displacement filter when the scale estimation is not accurate.
The particle filter-based tracking algorithm [51] uses random samples to approximate the target state distribution including position and scale changes, as shown in Figure 4a. Gradient descent methods (such as Lucas-Kanade [52]) iteratively infer local optimal positions and scale changes (see Figure 4b). The object tracking algorithm based on correlation filter [23] decomposes the tracking task into two independent subtasks (position and scale estimation) demonstrated in Figure 4d, which not only reduces the burden of intensive estimation of the target state, but also avoids the noise update of the displacement filter under the circumstance inaccurate scale estimation. Experimental results (see Ablation Study Section). show that the performance of our tracker is significantly better than another implementation (CT-JOP), which uses the estimated scale change to update the displacement filter.

Long-Time Memory Filter
In order to adapt to the changes in the appearance of the target during the tracking process, as time goes by, the tracking algorithm must update the pre-trained displacement filters. However, if the filter is updated by directly minimizing the output error of all tracking results, the computational overhead in the tracking process will be very large [53,54]. The proposed algorithm uses a moving average scheme to update the displacement filter. The updated equation is as follows: where t is the index of the image frame, and η ∈ (0, 1) is the learning rate. This method updates the position filter every frame, emphasizing the importance of model adaptation and short-time memory of target appearance, but only one of the three position filters is updated each time. The selection of these three filters is a circular selection method. Since this scheme is very effective in dealing with appearance changes, the tracking algorithm [6,17] has achieved good performance in recent benchmark studies [38,46]. However, when the training samples are noisy, these trackers are prone to drift and cannot recover from tracking failures due to the lack of long-time memory of the appearance of the target. The update scheme in Equations (21) and (22) assumes that the tracking result of each frame is sufficiently reliable, so it is natural to use the training sample to update the correlation filter. This is not correct in a complex scene, the result of such an operation is easy to send tracking drift. To solve this problem, we proposed to create a long-time memory filter to preserve the appearance of the target. In order to maintain the stability of object tracking, we set a threshold T a to conditionally update the long-time memory filter. Only when the target's confidence max( f (x)) is greater than this threshold T a do we update the long-time memory filter. The proposed algorithm uses the maximum value of the correlation response map as the confidence score, because it reflects the similarity between the tracked object and the learning template in the long-time memory correlation filter. Compared with the long-time memory method [55,56] that only uses the first frame as the target appearance, we conditionally update the long-time memory filter to improve its adaptive ability. This allows the long-time memory filter to adapt to a certain degree of time-varying target appearance.

Online Object Detector
The displacement filter A T captures the appearance of the target and is a short-time memory filter. We use contextual information around the target object to learn the filter. In order to reduce the boundary discontinuity caused by the cyclic shift, we weight each channel of the input feature by a two-dimensional cosine window. We use the SOM feature to learn the scale filter A S . Unlike the displacement filter A T , we directly extract features from the target area without considering the surrounding context, because considering the surrounding context does not provide information about the target scale change. We use a conservative learning rate to learn the long-time memory filter A L to maintain the long-time memory of the appearance of the target to determine whether tracking failure occurs.
Tracking failure is generally caused by some serious occlusion or the target moving out of the camera view. In our tracking algorithm, for each tracked target z, we use the memory filter A L to calculate its confidence max( f A L (z)). Only when the confidence is lower than the predefined re-detection threshold Tr will we activate the detection device. This can reduce the computational load in the object tracking process and avoid using a sliding window for detection in each frame.
In order to ensure the operating efficiency of the system, we use an SVM as a detector instead of using a long-time memory filter A L . We intercept training samples at the estimated target position to train the SVM detector incrementally, and assign binary labels to these samples according to their overlap ratio [35]. In this algorithm, we only extract samples with changed targets for training to further reduce the computational workload. During training, the quantized color histogram is used as a feature representation, the image color is converted to the CIE Lab space and each channel is quantized to 5 bits (referring to four equal intervals in each channel). In order to improve the robustness against drastic changes in illumination, we apply the non-parametric local rank transform [57] to the L channel.

Method Implementation
As shown in Figure 3, the tracking algorithm proposed in this paper uses SOM features to train three correlation filters (A T1 , A T2 , A T3 , A S , A L ) for position estimation, scale estimation and long-time memory of target appearance. We also built a re-detection module that uses the SVM detector to recover targets from tracking failures. We give a summary of the proposed tracking algorithm in Algorithm 1.

Algorithm 1: Object tracking algorithm based on SOM and correlation filter.
Data: Track the starting position of the target, b 0 = (x 0 , y 0 , s 0 ) A T1 , A T2 , A T3 , A S , A L Result: Estimated target location and scale b t = (x t , y t , s t ) 1 According to the starting position b 0 , the image area X is cut, the SOM feature is extracted, and the SVM is trained, A T , A S , A L ; 2 while video sequence is not over do 3 Calculate f A T (x), estimate the target position (x t , y t ) in the next frame //position estimation; 4 Calculate f A S (x), estimate the target scale s t of the next frame //scale estimation;

5
At position (x t , y t ), sample image area z according to scale s t ; 6 if max( f A L (z)) ≤ T r then 7 Start the SVM detection module; The displacement filter A T1 , A T2 and A T3 combines the context information to separate the tracking target object from the background. Some methods [20,58] enlarged the target bounding box based on a fixed ratio of 2.5 to include the surrounding context. We conclude through analysis based on experiments that an appropriate increase in the context area will also improve the tracking results. At the beginning, we set it to 2.8 times larger, and then consider the aspect ratio of the target bounding box. We also observed that when the target (such as pedestrian) has a small height and width ratio, the smaller the zoom ratio, the less unnecessary context area in the vertical direction. For this reason, when the aspect ratio of the target is less than 0.5, we reduce the zoom in the vertical direction by half. To train the SVM detector, we densely sample a large window at the center of the estimated target. When the overlap ratio between these samples and the target position is greater than 0.5, we assign them a positive label +1; when their overlap ratio is less than 0.1, we assign them a negative label −1.
In this algorithm, the re-detection threshold T r is set to a lower value of 0.20. When the confidence level max( f A L (z)) is lower than this value, the algorithm will activate the SVM detection module. When the SVM detection module re-detects the target, the target acceptance threshold Ta is set to 0.4, and only if it is higher than this threshold does it indicate that the target is detected. Each of these detection results needs to be retained during detection, because it is needed when relocating the target and reinitializing the tracking process. We also set the stability threshold to 0.4, and update the memory filter A L when the confidence is greater than this threshold, so as to achieve the purpose of keeping the long-time memory of the target appearance. All thresholds are compared with the confidence score calculated by the long-time memory filter A L , and the regularization parameter of Equation (2) is set to λ = 10 −4 . The Gaussian kernel width setting in Equation (9) is proportional to the target size W × H, σ 0 = 0.15 × (W × H). The learning rate η = 0.01 in Equations (21) and (22). For scale estimation, we use the feature pyramid series N = 21, and the scale factor α = 1.03.

Experiments Details
We use the latest system of visual tracking evaluation standards to evaluate our methods, including overlap success rate (OS), distance precise rate (DP), OPE, TRE and SRE. Among them, OPE initializes the first frame with the location of the object in the ground-truth, and then runs our tracking method to get the accuracy and success rate.
In the experiment, we followed the experiment rule from the benchmark research [38] and corrected the parameter values of all sequences. The tracking algorithm proposed in this paper is implemented with MATLAB. The computer operating environment is configured as: 32 GB RAM; Intel I7-4770 3.40 GHz CPU.

Overall Performance
In experiments, we initialize overlap success rate at IoU = 0.5, set the distance accuracy rate as 20 pixels. Table 1 shows the OS(Overlap Success), the DP(Distance Precision) and the average tracking speed (the value marked in red is the highest, and the blue is the second highest). Results show that compared with the OTB-50 dataset, the OTB-100 dataset is more challenging because the performance of evaluation from all trackers on OTB-100 is not as good as on OTB-50. In Figure 5, we use one-frame initialization evaluation (OPE-One-Pass Evaluation), temporal robustness evaluation (TRE) and spatial robustness evaluation (SRE) standards to evaluate OTB-100 test sets and the quantitative results were given.
From Table 1, it can be noticed that our algorithm is superior to most current methods in the aspect of overlap success rate and distance accuracy. In terms of overall evaluation results, out algorithm in this paper is second only to SiamBAN [15]. This is mainly because the tracking algorithm in this paper may be sensitive to the initial position given in the first frame, resulting in a slight impact on the accuracy of the initial position. D3S [59] proposed a tracking algorithm using two complementary modules, GIM and GEM, to solve the problem of target dynamic changes. GIM locates the target under high deformation. GEM filters the results and restricts the position of the target when the GIM segmentation target is not unique. Although D3S can restart from tracking failures, it is less effective in dealing with scale changes. Our proposed method has a higher overlap accuracy rate (78.3% vs. 67.6%) than D3S in scale prediction. Both the SiamR-CNN [60] tracker and our proposed method can resolve the scale change of the tracking target, thereby obtaining better overlap accuracy than the D3S tracker. Unlike the SiamR-CNN [60] tracker, we use multiple displacement filters, and update these filters in a cyclic update mode, which can memorize more object appearances and make the tracker more effective in tracking deformed objects. At the same time, we adopted the SOM feature, and updated the displacement filters A T1 , A T2 and A T3 without considering the scale change. We have observed through experiments that small errors in the scale estimation will cause rapid degradation of the displacement filters A T1 , A T2 and A T3 . In addition, our proposed method has a slightly better overlap success rate than the SiamR-CNN tracker: 78.3% vs. 68.4% on OTB-50, and 69.7% vs. 66.3% on OTB-100.
In terms of tracking speed, our method is at an intermediate level like SiamBAN [15], D3S [59] and PrDiMP [61] trackers. The tracking speed of DiMP [62] and ASRCF [63] is higher than 40FPS. However, these trackers are inferior to our method in terms of accuracy because they cannot recover from failures and cannot handle scale changes. Although it is time-consuming to search and detect using a sliding window when tracking fails, we only activate the detector when the confidence value is lower than the re-detection threshold T r , so the speed of the algorithm in this paper is close to the real-time speed of video shooting (20 FPS).
Regarding the TRE and SRE evaluation schemes, the method proposed in this paper cannot get good performance in the OPE evaluation. This is because the TRE and SRE evaluation programs do not fully show the strengths of the methods we propose. The setting of TRE decomposes a video sequence into several segments, so the re-detection importance in long-time tracking is ignored. SRE initializes the tracker with wrong target position and scale. Since our tracker depends on correlation filter training to distinguish the object from the background, inaccurate spatial information of initialization will have a negative impact on the performance of the filter's target positioning.

Complicated Scenario Test
The test sequence [38] has 11 challenging and complicated scenes, these complex scenes all put forward higher requirements for the object tracking algorithm, such as occlusion or out of view. These complex scenarios are very useful for analyzing the results of the tracker in all aspects. Tables 2 and 3 show the overlap success rate and distance accuracy test results of the OTB-100 dataset in complex scenarios (the value marked in red is the highest, and the blue is the second highest). In terms of overlap success rate, the algorithm in this paper is superior to other methods in most attributes (in 11 complex scenarios, the overlap success rate of the algorithm in this paper achieved 6 highest and 3 s highest). Compared with the SiamBAN [15] tracker, our tracker achieves better performance in 5 attributes: illumination variation (0.3%), outof-plane rotation (4.1%), occlusion (1.4%), deformation (3.1%) and in-plane rotation (4%). In addition, this algorithm is also in the second place in complex scenes with out of view, background clutter and fast motion. We attribute these performance improvements mainly to two advantages. Firstly, we divide the update model of the displacement filter from the model update of the scale filter. Although this method does not seem to be optimal in terms of estimating the target state compared with the SiamR-CNN [60] tracker, it effectively avoids the problem of scale estimation. Degradation of displacement filter caused by the inaccuracy. Secondly, we use the long-time correlation filter as the overall memory template to maintain the appearance of the object. SiamR-CNN [60] uses the information of the first frame and the historical frame for long-term tracking, and iteratively updates the historical frame information. In the presence of obvious deformation and rotation, the key points to identify the target object are much less. In this case, updating the information of the video frame may cause the historical target feature to be blurred, and the tracking performance will decrease. That is why our algorithm has better performance than the tracker of SiamR-CNN in dealing with these challenges.
In terms of distance accuracy, it can be noticed from Table 3 that our method has achieved good results in three aspects: illumination variation (78%), deformation (86.3%) and in-plane rotation (79.5%). These results prove the effectiveness of our algorithm in dealing with large-scale appearance changing in complex scenes and tracking failure recovery. Due to the use of a similar re-detection module, the tracker of SiamBAN and SiamR-CNN also perform very well when processing the situation of fast motion, motion blur and low resolution.
We also compared the tracking results of the algorithm proposed in this paper with the four latest object trackers (ARCF [65], PrDiMP [61], D3S [59], SiamBAN [15]). We have selected 6 challenging image sequences for testing, and the test results are shown in Figure 6.   The tracking algorithm we proposed can well calculate the movement and scale change of the object in the challenging image sequence, which can be attributed to three reasons. First of all, our three displacement filter learning is based on the SOM feature that can adaptively unsupervise the appearance of the target, which plays a very important role in obtaining the appearance of the target object. Therefore, the tracker proposed in this paper can achieve a good tracking effect on illumination changing and background clutters, rotation and partial occlusion. Secondly, the update of the scale filter A S and the displacement filter A T is carried out separately, which effectively reduces the degradation of the displacement filter due to the inaccurate scale estimation. Third, in the situation of tracking failure, the online-trained detector can effectively re-detect the target object. For example, in the case of severe occlusion or out of view, the tracker we proposed can restore the target's tracking.

Ablation Study
In order to well understand the contribution of each part of the tracker proposed in this article, we conducted a component modification study, replacing the SOM network and correlation filter-based objects in this algorithm by using the four-component modified tracker:  [17]] and MUSTer [27] trackers, joint scale change data when updating the displacement filter. Figure 6 shows the overlap accuracy and distance accuracy data on the OTB-50 dataset, where IoU = 0.5 and the distance threshold is 20 pixels. Figure 7 lists our comparison of the impact of these algorithms on the tracker in the center position error of each frame of the 4 image sequences. And Table 4 shows component effectiveness analysis on OTB-50 under OPE. Generally speaking, our proposed method can track the objects accurately and stably. Especially on the Soccer sequence, our tracker drifted due to severe occlusion of the target at frame 60, but after a short period of 10 frames, the tracking algorithm we proposed quickly repositioned the target. The result of the effective work of the detection module. Our tracker also drifted out of the field of view in the 400th frame of the trellis image sequence, but it was able to successfully re-detect the target and resume normal tracking in a short time. The performance of CT-NRe method is significantly better than CT-HOG, which illustrates the importance of using SOM features. Comparing CT-FSC and CT-NRe, it can be seen that the use of the re-detection module is very important for the recovery of tracking failure. In addition, the performance of our proposed algorithm with all components is significantly better than the other three implementations (CT-HOG, CT-NRe and CT-FSC). Since we independently updated the displacement filters A T1 , A T2 and A T3 and the scale filter AS, the distance accuracy of the CT-FSC method is only slightly reduced. The performance of our tracker is significantly better than the CT-JOP method, which shows that the joint update of the displacement filter and the scale filter will lead to a lower tracking effect. It also shows that the scale evaluation is still a challenging problem.

Experiments on VOT2020
We evaluated our algorithm in the short-term tracking dataset and the real-time tracking dataset in the Visual Object Tracking challenge (VOT2020) [40]. Different from the previous VOT, VOT2020 cancels the restart mechanism and replaces it with initialization points. We present the comparison results of the latest trackers submitted to VOT2020 in Table 5. The four best performers are the methods in this paper, RPT [70], OceanPlus [71] and AlphaRef [72] (the value marked in red is the highest, and the blue is the second highest). RPT is a tracker composed of two parts: a target state estimation network and an online classification network [70], whose EAO is the highest in the VOT-ST2020. Its accuracy and robustness are not as good as ours. Our method exceeds RPT by 6.2% and 1.1% on accuracy and robustness in VOT-ST2020. AlphaRef's [72] performance in VOT-RT2020 is good, and both EAO and accuracy are the best performers, reaching 0.486 and 0.754, respectively. The algorithm in this paper also performed well in VOT-RT2020, both EAO and robustness of ours ranked second, only 0.01 and 0.003 behind the top performer. Our no-reset average overlap also ranked second. Table 5. Results for VOT-ST2020 and VOT-RT2020 challenges. Expected average overlap (EAO), accuracy and robustness are shown. For reference, a no-reset average overlap AO [38] is shown under Unsupervised. (The value marked in red is the highest, and the blue is the second highest).

Experiments on NFS
The NFS [43] dataset consists of 100 videos (380K frames), which come from real world scenarios using a higher frame rate (above 240FPS) camera. The Area Under Curve (AUC) of each tracker is presented in Table 6, the value marked in red is the highest, and the blue is the second highest and the green one ranks third. Our AUC reached 0.591, which is greater than 0.5, indicating that our tracker has predictive value. Furthermore, our tracker ranks third, and is only 0.003 behind SiamBAN.

Conclusions
This algorithm proposes an efficient algorithm for object tracking based on SOM and correlation filters. First of all, three kinds of correlation filters, (1) displacement filter, (2) scale filter and (3) long-time memory filter, are utilized in our algorithm. These three filters work together to obtain the object appearance, object scale and object appearance storage and solve the problem of tracking stability-adaptive problem. The acquisition of target appearance emphasizes the importance of model learning speed and adaptive ability. The long-time memory of the target appearance emphasizes the conservative learning rate and stability of model. The tracking algorithm proposed in this paper takes into account the stability and adaptability of the model in robust tracking. Secondly, in order to improve the positioning accuracy and performance of object tracking, we use SOM features to learn the correlation filters. To improve the tracking results, we studied the influence of the surrounding environment and the learning rate on the tracking efficiency, so as to obtain the optimal scale of the target image area. Third, incremental learning online detector of SVM is introduced to recover the target, and explicitly deal with tracking failures. Experimental results show that the algorithm is better than other state-of-the-art methods in terms of robustness, accuracy and efficiency.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.