Object Tracking in Satellite Videos Based on Improved Kernel Correlation Filter Assisted by Road Information

: Video satellites can stare at target areas on the Earth’s surface to obtain high-temporal-resolution remote sensing videos, which make it possible to track objects in satellite videos. However, it should be noted that the object size in satellite videos is usually small and has less textural property, and the moving objects in satellite videos are easily occluded, which puts forward higher requirements for the tracker. In order to solve the above problems, consider that the remote sensing image contains rich road information, which can be used to constrain the trajectory of the object in a satellite video, this paper proposes an improved Kernel Correlation Filter (KCF) assisted by road information to track small objects, especially when the object is occluded. Speciﬁcally, the contributions of this paper are as follows: First, the tracking conﬁdence module is reconstructed, which integrates the peak response and the average peak correlation energy of the response map to more accurately judge whether the object is occluded. Then, an adaptive Kalman ﬁlter is designed to adaptively adjust the parameters of the Kalman ﬁlter according to the motion state of the object, which improves the robustness of tracking and reduces the tracking drift after the object is occluded. Last but not least, an object tracking strategy assisted by road information is recommended, which searches for objects with road information as constraints, to locate objects more accurately. After the above improvements, compared with the KCF tracker, our method improves the tracking precision by 35.9% and the tracking success rate by 18.1% with the tracking rate at a speed of 300 frames per second, which meets the real-time requirements.


Introduction
The research on object tracking in satellite videos is an important application of video satellites.In this section, we will introduce the background of this paper, the research progress of object tracking and the research advances of object tracking in satellite videos.

Background
Object tracking is one of the core tasks in the field of computer vision.It is widely used in auto driving, intelligent security, national defense and military fields.The goal of object tracking is to predict the position of the object in subsequent frames with the information (position and size) of the object in the initial frame.As a new way of earth observation, video satellites can stare at the target area on the Earth's surface and record the observed information as video sequences, which contain rich dynamic information.With the continuous development of commercial remote sensing satellites, the cost of acquiring high-resolution remote sensing images has been reduced.For example, the Jilin-1 satellite constellation launched by China Chang Guang Satellite Technology Co. can obtain images with about a 1 m resolution, and the number of frames is about 30 frames per second (FPS).These high-quality remote sensing videos have been applied in fields such as agricultural production, environmental monitoring, geographic mapping and so on.
Current object tracking methods can be divided into generative and discriminative methods.The generative method first models the appearance representation of the object and then searches the image area based on the minimum reconstruction error criterion to obtain the tracking results of the object.The discriminant method takes the object and the surrounding background as positive and negative samples, respectively, and effectively distinguishes them through online training classifiers.In recent years, with the deepening of the research on object tracking, the methods based on a correlation filter and a Siamese network have gradually become two main directions in the field of object tracking.The method based on a correlation filter collects a group of positive and negative samples around the object to train the correlation filter template.The filter is then used to correlate with multiple candidate samples in the search area of the next frame of video.The sample with the largest response value is determined as the state of the object current frame.Finally, the filter template will be updated with the new samples, and the steps above are repeated until the last frame of the video.The method based on a Siamese network transforms object tracking from performing a random gradient descent online to adapting to the weight of the network for tracking to matching in the form of front-frame and back-frame object pairs.Using the same network as the feature extraction network, the feature map of the template image and the image to be matched are output, and the mutual convolution (correlation) operation is performed to obtain the object response result, which is mapped back to the original image, and the object position of the current frame is calculated.
The above object tracking methods perform well in traditional object tracking tasks.However, compared with traditional object tracking tasks, tracking objects in satellite videos faces many challenges, including: (1) Object size is smaller: Due to the low spatial resolution of satellite video, moving objects such as vehicles, airplanes, ships and so on account for a very small proportion of the entire video size, even less than 10 × 10 pixels.The small size also makes it difficult to obtain the appearance characteristics of the objects and distinguish the objects from each other in shape, color, texture, etc., which puts forward higher requirements for object tracking methods, as shown in Figure 1a.
(2) Background is more complex: Due to the large field of view of satellite videos, the videos often have multiple or even hundreds of times the width and height of traditional surveillance videos, which results in more interference and redundancy information in satellite videos, including the effects of background (such as Figure 1b) and objects (such as Figure 1c) that are similar to the object and the effects of lighting changes (such as Figure 1d-f).
(3) Objects are easily occluded: Because video satellites often use bird's-eye view to obtain ground observation data, many of the objects in satellite videos are partially or completely occluded, such as Figure 1h-i, which makes it easier to lose objects during tracking.

Object Tracking
In 2010, Bolme et al. introduced the idea of a correlation filter into the field of object tracking for the first time and proposed the Minimum Output Sum of Squared Error (MOSSE) tracker [1].By minimizing the mean square error of expected output and actual output, correlation filters with stable output are obtained for object tracking, and multiple samples of the object are used as training samples to generate better filters to improve the robustness of the filter template.Then, the Circulant Structure Kernels (CSK) tracker [2] improves the sample redundancy in the MOSSE tracker by extending ridge regression and the kernel method and incorporating a regular term into the loss function to prevent overfitting.Based on the CSK tracker, Henriques et al. extended the algorithm to the multichannel feature and proposed the Kernel Correlation Filter (KCF) tracker [3].The KCF tracker uses the Histogram of Oriented Gradient (HOG) feature to improve the object description ability and uses the Gaussian kernel to improve the speed of tracking.The KCF tracker becomes the basis of most subsequent related correlation filter trackers.The Color Names (CN) tracker [4] first applies color name features to the correlation filter, describing the object by combining CN features with gray scale features, which greatly improves the tracking performance.The SAMF Scale Adaptive Multiple Features (SAMF) tracker [5] further combines HOG and CN features to describe the structure and color information of the object and uses an exhaustive search to determine the scale of the object, which further improves the tracking effect.To improve the speed of object scale estimation, Danelljan et al. proposed the Discriminatiive Scale Space Tracker (DSST) [6] and the fDSST Fast Discriminatiive Scale Space Tracker (fDSST) [7].On the basis of the KCF tracker, the DSST tracker introduced a scale correlation filter to estimate object scale, which improves the efficiency and accuracy of object scale estimation, and the fDSST tracker further improved the speed of scale estimation.In order to solve the boundary effect caused by cyclic shift, [8] adds spatial regularization constraints to the objective function to be optimized and presents the Spatially Regularized Discriminative Correlation Filters (SRDCF) tracker, which assigns different weights to the filter coefficients according to the spatial location of the pixels to reduce the influence of the boundary pixels.The Spatial-Temporal Regularized Correlation Filters (STRCF) tracker [9] incorporates a time regularization term based on the SRDCF tracker to prevent model degradation and uses an Alternating Direction Method of Multipliers (ADMM) iterative solution to improve the tracking speed.The Continuous Convolution Operator Tracker (C-COT) [10] uses VGG-Net to extract features based on the KCF tracker and presents a continuous spatial domain difference operation.The filter parameters are solved by the predefined conjugate gradient method.The Efficient Convolution Operators (ECO) tracker [11] improves the C-COT tracker in three aspects: model parameters, training sample set size and update strategy, which improve the tracking accuracy and efficiency.The Group Feature Selection and Discriminative Correlation Filter (GFS-DCF) tracker [12] presents a feature channel selection mechanism that enables the algorithm to automatically select the desired features, reducing the redundancy of multidimensional features and the cost of operation.In some of the methods mentioned above, the application of deep learning only uses convolution neural network for feature extraction, so it does not fully play to the advantage of deep learning.In 2016, the emergence of Siamese networks with their simple network structure and high tracking efficiency attracted wide attention from researchers in the field of object tracking and has become the mainstream direction of deep learning trackers.The Generic Object Tracking Using Regression Networks (GOTURN) tracker [13] training network directly regresses to obtain the position and scale change of the object relative to the previous frame, which is fast but not accurate.The Siamese Instance Search for Tracking (SINT) tracker [14] uses the Siamese network search mechanism to match the search image to the template image and selects the region with the largest matching value as the result of the tracking.The Fully-Convolutional Siamese (SiamFC) tracker [15] introduces a full convolution layer based on the SINT algorithm, which can handle the inconsistency between the size of the template image and the search image, and obtains the cross-correlation results of the two input images using the dense sliding window estimation method to obtain the object location.The SiamFC tracker does not update the template, resulting in high tracking accuracy and high tracking speed.Most of the following deep learning trackers are based on the SiamFC tracker.The Siamese Region Proposal Network (SiamRPN) tracker [16] combines a Siamese network and a region proposal network to directly obtain the position of the object, avoids searching for the object on multiple scales and improves tracking speed.Based on the SiamRPN tracker, the SiamRPN++ tracker [17] improves the limitation of convolution network structure on algorithm performance, and the residual network is used instead of AlexNet.The Siamese Box Adaptive Network (SiamBAN) [18] treats visual tracking as a parallel classification and regression problem, thus directly classifying objects in a uniform full convolution network and returning them to their bounding boxes.The no-priority box design avoids hyperparameters associated with candidate boxes, making the tracker more flexible and versatile.

Object Tracking in Satellite Videos
With the continuous development of video satellite technology, object tracking based on satellite videos has been applied in many fields, which has resulted in a lot of research results.Ref. [19] combines a KCF tracker with a three-frame difference algorithm to improve the tracking accuracy when the object and background discrimination is low, but the occlusion of the object is not considered.Ref. [20] uses velocity characteristics and inertia mechanisms to construct a specific kernel correlation filter for object tracking in satellite videos, which can better distinguish an object from the background but has limited effect in complex scenes.In 2019, the author further proposed a fused kernel correlation filter [21] that adaptively uses optical flow and HOG characteristics, but this method is less robust to light variation.Ref. [22] presents a method based on a high-speed correlation filter, which constrains the object tracking process by using the global motion characteristics of the moving object in satellite video and corrects the trajectory of the moving object by using the Kalman filter.Ref. [23] presents an improved correlation filter based on motion estimation, which combines the Kalman filter with motion estimation to solve the problem of object occlusion during motion and reduce the boundary effect.In addition, considering the angle change of the object in a satellite video, Ref. [24] presents a rotation adaptive correlation filter tracker which estimates the rotation angle of the object and rotates the feature map of each frame to the same angle according to the rotation angle of the object, thus maintaining the stability of the feature map.However, when background dithering occurs, it is difficult for the tracker to accurately estimate the initial angle of the object, which affects the accuracy of the algorithm.In [25], a kernel correlation filter tracker based on multi-feature fusion and motion trajectory compensation is presented, which calculates the location of the object using a subpixel positioning method, and an adaptive Kalman filter is introduced to correct the correlation filter.However, due to the deep learning feature, the tracking speed of the tracker is slow, and it is difficult to achieve real-time tracking.Ref. [26] fused the HOG feature and an Optical Flow (OF) feature to improve the representation information of the object and introduced a disruptor-aware mechanism to weaken the influence of background noise.Ref. [27] proposed a correlation filter-based dual-flow tracker to address problems of limited feature representation and tracking drift.Ref. [28] decomposed the rotation issue into a translation solution and proposed the RAMC tracker.It decoupled the rotation and translation motion patterns, achieving adaptive angle estimation.
It can be seen that the Kalman filter is widely used to estimate the motion state of objects in satellite videos.The Kalman filter is used to improve the correlation filter to solve the problem of tracking failure after the object is occluded.In solving the problem of object occlusion in satellite video object tracking, the satisfactory results obtained by Kalman filter prediction are mainly based on, but not limited to, the following two assumptions: (1) the object is basically moving in a uniform straight line in the video; (2) the object is not occluded before the Kalman filter converges.However, by analyzing the objects in the satellite video, it can be found that there are objects in the video that do not meet these criteria.Some objects are occluded within 10 frames of the video sequence, and the Kalman filter does not converge at this time.There are also some objects whose direction of motion changes when they are completely occluded, at which point the Kalman filter cannot accurately predict the object location.Therefore, the tracker needs to be optimized in order to improve the robustness and accuracy of tracking.
To solve these problems, we propose an improved kernel correlation filter (KCF) assisted by road information to track small objects in satellite video.This method is mainly composed of three parts: tracking confidence module, motion estimation module and object detection module after occlusion.Specifically, the main contributions of this article are as follows: (1) We redesigned the tracking confidence module to evaluate the tracking status of the object more reasonably and effectively and to adjust the tracking strategy in time.
(2) The error of the Kalman filter is adjusted adaptively according to the result of the tracking confidence module, which improves the accuracy of the Kalman filter in the motion estimation module for object location prediction.
(3) An object search strategy assisted by road information is proposed, which reduces the search range after the object disappears.The "Local Maximum Response" is used to evaluate the reliability of object detection, so the proposed method can locate the object quickly and accurately after the object is occluded.
(4) Experiments show that the proposed method is superior to the state-of-the-art algorithms in tracking accuracy and speed.

Kernel Correlation Filter
The proposed method is based on the KCF tracker, so it is briefly introduced here.The KCF tracker builds a training set by cyclic shifting.Suppose the base vector Then one of the cyclic shifts of x can be expressed as Qx = (x n , x 1 , • • • , x n−1 ) T , which represents moving x one position to the right.By constantly left multiplying the permutation matrix Q, {Q u x | u = 0, • • • , n − 1} can realize the cyclic shift of base vector x for u times.Connect all cyclic shifts of x to the matrix X, so X is the cyclic matrix: For any vector x, its cyclic matrix can be diagonalized by the expression (3): where x is the discrete Fourier transformation of x,F represents the discrete Fourier transformation matrix, and F H is the conjugate transpose of F.
The KCF tracker uses ridge regression to train the classifier.The main idea is to find a function f (z) = f T z that minimizes the mean square error between the output of all training samples and their expected output and the loss function such as (4): where λ is a regularization parameter and λ > 0. λ is used to prevent the model from overfitting; x i is the training sample of i; y i is the expected output of x i .By deriving the training samples, a closed-form solution of the formula (4) can be obtained as follows: where X is a circular matrix of all training samples, y is the expected output vector, and I is the identity matrix.Solving the filter factor f directly requires a lot of matrix operations and time complexity.Using the properties of the circular matrix, the form (3) can be substituted into the form (5). Then the following equation can be derived: According to the properties of the Fourier transform matrix, F H F = I, the solution of the filter in the frequency domain can be obtained by substituting ( 6) into (5) as follows: where the denotes dot product, f , x, ŷ are discrete Fourier transforms of f , x, y, respectively, and x * are complex conjugates of x.
In order to improve the ability of the KCF tracker to solve nonlinear problems, the KCF tracker uses a kernel function to transform ridge regression problems in low-dimensional space into high-dimensional space ϕ(x), classify the samples in the high-dimensional space and solve the linear inseparability problem.Suppose the kernel function is k(x, x ) = ϕ T (x)ϕ(x ), so the formula f (z) = f t z can be written as: For most kernel functions such as the Gaussian kernel, the polynomial kernel and the linear kernel, the kernel matrix still has the property of a cyclic matrix.Therefore, α can be solved by the following formula: where kxx is the Fourier transform of the basis vector of the kernel matrix k = C(k xx ).For the Gaussian kernel κ(x, x ) = exp(− 1 σ 2 x − x 2 ), the k xx can be expressed as: Then, the response map can be calculated by the following formula: where kxz is the kernel correlation Fourier transform of sample x and sample z.The position corresponding to the maximum value in the response map is the position of the object in the current frame.
In order to improve the robustness of tracking, the filter template needs to be updated: where xt , αt is the feature obtained in frame t, and η is the learning rate.

Proposed Method
In order to solve the problem of tracking failure in satellite video object tracking mentioned above, an improved kernel correlation filtering tracker assisted by road information is proposed in this paper.Firstly, the tracking confidence index evaluates the tracking state of the correlation filter and updates the filter template and the Kalman filter when the tracking result is credible.When it is judged that the object is occluded, the Kalman filter is used to predict the position of the object.If the response result of the predicted position is credible, the predicted position is taken as the position of the object in the current frame.If the response result of the predicted position is not credible, the object is searched by tracking the road information to obtain the position of the object after the occlusion.The overall flow of proposed method is shown in Figure 2.

Tracking Confidence Module
The tracker based on a correlation filter generally adopts the strategy of updating the filter template frame by frame.When the object is partially or completely occluded, the learned filter template will accumulate background information and eventually lead to the loss of the object.Therefore, it is important to accurately judge the tracking status of the object to improve the tracking accuracy.
In the article [23], the author uses the peak response as a criterion to judge whether the tracking process is credible, which can be obtained from Formula (11), and F max is used here to represent the peak response of the response map.However, the peak response can not fully reflect the oscillation degree of the response map, that is, when the object is occluded, the peak response may not be at a low level.Therefore, only using the peak response as an indicator cannot accurately determine whether the object is occluded.Wang et al. proposed the average peak correlation energy (APCE), which is used to measure the oscillation degree of the response map [29].APCE is defined as: where y max , y min and y r,c are the maximum value, the minimum value and the response value at (r, c) in the response map, respectively.APCE can reflect the oscillation degree of the response map, and the peak response can still reflect whether the tracking is reliable to a certain extent.Therefore, a combination of APCE and F max is recommended here to measure the tracking confidence.The tracking confidence index (TCI) is defined as follows: where APCE t and F t max are APCE and peak response of frame t, respectively, and β is set to 0.8.If TCI t satisfaction (15) considers that the tracking of frame t is reliable, otherwise it is judged that the object is occluded, and γ is the tracking confidence threshold.

Motion Estimation
When the object is occluded, the key information such as the appearance characteristics of the object will be lost.At this time, the position of the object can be predicted with the help of the speed, acceleration, direction and other information of the object movement.The Kalman filter is an efficient algorithm that can optimally estimate the system state according to the observed measurement results.It has simple form and small amount of calculation, so it is suitable for the tasks that require high real-time performance.In this paper, we use the Kalman filter to estimate the speed, position and direction of moving objects in video.The state equation and observation equation of the system can be written as: where X t and Z t are the state and observation vector of the system at time t, respectively, A t,t−1 is the state transition matrix, H t is the observation matrix, ω t−1 is the process noise matrix, and v t is the observation noise matrix.The prediction equation of the Kalman filter is: The update equation of the Kalman filter is: where X t,t−1 is the state prediction at t time, X t−1 is the best estimate of t − 1 time, P t,t−1 is the state covariance matrix at t time, Q t−1 is the noise covariance matrix of ω t−1 , R t is the measurement noise covariance matrix of v t , K t is the Kalman gain matrix, and Z t is the input variable matrix at t time.
Because the change of position between adjacent frames of video is short, the motion of the object can be regarded as linear motion in a short time.So we set the state transition matrix A and observation matrix H to: The convergence of the Kalman filter directly affects its prediction results, and the two key parameters of measurement noise covariance matrix R and process noise covariance matrix Q determine the convergence speed and convergence effect.If R is a fixed value, the greater Q is, the more the measured value is trusted; otherwise, the smaller Q is, the more the predicted value of the model is trusted.Assuming that the noise is stable in the original Kalman filter, Q and R are set as fixed values.
In order to improve the prediction performance of the Kalman filter, we fix the measurement noise covariance matrix R and adaptively adjust the noise covariance matrix according to the tracking state of the object.Specifically, if the object is not occluded, the measurement noise covariance matrix R and the process noise covariance matrix Q in the Kalman filter are respectively set as: If the object is occluded, the measurement noise covariance matrix R and the process noise covariance matrix Q in the Kalman filter are set as: Adaptively adjusting the noise covariance matrix according to the state of the object tracking process is helpful to improve the convergence speed and prediction performance of the Kalman filter.This will be explained in the experiment section.

Object Detection Based on Road Information
Remote sensing images contain abundant road information, which has important applications in emergency response, traffic management, driverless path planning and other fields.For most objects in satellite video, especially vehicle objects, their trajectory usually coincides with the road.Inspired by this, it can be believed that the road information can be used to assist in tracking objects in satellite video.Therefore, an object detection and tracking method under the constraint of road information is proposed in this paper.In order to obtain the road information in the satellite image, we use the method proposed in [30] to extract the road information in the remote sensing image, as shown in Figure 3.We use the road network picture after binarization as a mask.If a point belongs to a road, the pixel value of the point in the road network picture is 1.If the point does not belong to a road, the pixel value of the point in the road network picture is 0. The overall idea of using road information to assist object tracking is: when it is determined that the object is occluded, firstly stop updating the correlation filter template and use the Kalman filter to predict the position of the object.If the response map of the predicted position satisfies Formula (15), the predicted position of the Kalman filter is used as the position of the object in the current frame.If the response map of the predicted location does not meet Formula (15), select the range of the specified size around the predicted location of the Kalman filter.Assuming that there are N points within the range that belong to the road, take the N points as the center to intercept N image blocks with the same size as the filter template.Use the filter template to perform correlation operations with the N image blocks one by one to obtain N response maps and the corresponding peak response of each response map.We define the maximum of these N peak responses as the local maximum response, which is recorded as V max .Further, in order to determine whether Vmax is the corresponding position of the object, the detection threshold (DT) is used as Formula ( 22): where F t max is the max value in the response map of frame t, and δ is set to 0.5.If V max satisfies (22), it is considered that the object has been successfully detected, and the position corresponding to V max is used as the position of the object in the current frame.If this formula is not satisfied, it is considered that the occlusion state of the object has not yet ended.At this time, the predicted position of the Kalman filter is still used as the position of the object in the current frame.
It can be seen from the above that setting the range of the search area is very important.If the search area is set too small, the actual position of the object may not be included in the search area.If the search area is too large, the traversal process will reduce the tracking efficiency.As mentioned above, the prediction result of the Kalman filter is closely related to its convergence, so in order to evaluate the convergence of the Kalman filter, we use the distance between the predicted position of the Kalman filter and the tracking position of the KCF to judge whether the Kalman filter has been fitted.As shown in Formula (23), when the distance between the predicted position of the Kalman filter and the position tracked by the KCF is less than 2 pixels, the Kalman filter is considered to have been fitted.
Assuming that the object is occluded at time t, the object position predicted by the Kalman filter is (X KF , Y KF ), and the object speed is (V x , V y ).When the Kalman filter is fitting, we use the speed predicted by the Kalman filter to determine the search range after the object disappears, as shown in Figure 4a. Figure 4b can visually reflect the object search range assisted by road information.The size of the search range is calculated by the following formula: (X, Y) is the possible position of the object, and 1 , 2 is the scale factor.If V x > V y , then 1 , 2 are set to 10 and 8, respectively; otherwise, if V x ≤ V y , then 1 , 2 are set to 8 and 10, respectively.
When the Kalman filter is not fitted, the speed obtained by the Kalman filter is not reliable, so the search range is set to a fixed value according to experience as (25) , and the settings for 1 and 2 are the same as Formula (24).

Experiments 4.1. Datasets and Compared Tracker
The data set used in this paper is from the Jilin-1 satellite.The objects in the video are marked by [31].We select eight video sequences to compare the tracking performance of our method with that of the comparison method.Each video contains one designated object, all of which are motor vehicles.For more information about the objects, see Table 1.It should be noted that the exact position of the object after it is completely occluded is not provided, so we set the ground truth of the object in the occluded frame to (1,1,1,1), and define the object occluded frame as an invalid frame.The effective frames in Table 1 are obtained by subtracting the invalid frames from the total frames, but the total frames are still used for calculation in the evaluation process.The object size refers to the size of the object in the first frame of the video sequence.For the definition of challenges, reference is made to [32], and the definition is shown in Table 2.We choose CSK tracker [2], CN tracker [4], KCF tracker [3], fDSST tracker [7], STRCF tracker [9], ECO tracker [11] and GFSDCF tracker [12] to compare the tracking performance with the method proposed in this paper.The CSK tracker introduces a cyclic shift strategy based on the MOSSE tracker.The CN tracker uses the color name feature to represent the object.The KCF tracker is a classical correlation filter tracker, which is also the basis of the proposed method.The fDSST tracker can estimate the scale change of the object, and the STRCF tracker adapts well to the changes of the appearance of the object.The ECO tracker and the GFSDCF tracker both use the deep feature.We will verify the performance superiority of the proposed method by comparing it with other trackers.

Setting of Parameters
All algorithms are carried out on MATLAB 2021.The experimental environment is a Windows 11 system, the Central Processing Unit (CPU) is 2.90 GHz AMD R7-4800h, and the Graphics Processing Unit (GPU) is Nvdia RTX 2060.In the proposed method, the regularization factor λ is set to 10 −4 , the learning rate η is set to 0.012, the search area size is 2.5 times the object size, and the δ in ( 22) is set to 0.5.Because the object in the satellite video is small, we set the cell size of the HOG feature used in our method and the KCF tracker to 1 × 1.Because both the ECO tracker and the GFSDCF tracker use the deep feature, these two trackers run using GPU.

Evaluation Metrics
In this paper, two commonly recognized metrics [33] are used to quantitatively analyze the object tracking results, namely, the tracking precision score and the tracking success score.In addition, FPS represents the number of frames that the tracker can complete processing per second, which can be used to evaluate the tracking speed of the tracker.
The tracking precision score is defined as the ratio of the number of frames whose center location error (CLE) is less than a certain threshold to the total number of frames.The calculation formula of the center position error is: where (x p , y p ) is the object center position predicted by the tracker, and (x gt , y gt ) is the real position of the object center.Since the object in the satellite is small, the threshold is set to 5 pixels, that is, if the distance between the predicted position and the real position is less than 5 pixels, the tracking is considered successful.The tracking success score is defined as the ratio of the number of frames whose overlap rate exceeds a certain threshold to the total number of frames.The calculation formula of overlap rate is: where r p and r gt refer to the predicted object region and the object region, respectively, ∩ and ∪ refer to intersection and union, respectively, and refers to pixels in the region.
The overlap rate threshold is set to 0.5, that is, if the overlap rate of the current frame is greater than 0.5, the tracking is considered successful.In the actual evaluation process, the area under curve (AUC) score is generally used as the evaluation index; AUC refers to the area under the overlap rate curve.

Threshold of Occlusion
As mentioned above, we judge whether the object is occluded by observing the TCI fluctuation.Therefore, the selection of threshold has an important impact on whether the tracker can accurately judge the tracking state.We plot the TCI of all video sequences, as shown in Figure 5.
As shown in Figure 5, the TCI value distribution of all sequences can be seen to have an obvious unimodal distribution.Therefore, we can judge whether the object is in an occluded or non occluded state by selecting an appropriate threshold.
In order to select the appropriate threshold γ, we perform grid search from [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] for the threshold γ.The corresponding tracking precision score and AUC score are shown in Figure 6.It can be seen that when the threshold is 0.5, the tracking precision score and AUC score are the highest, so we set the occlusion threshold γ to 0.5.
The visualization of the tracking process of sequences Seq1 and Seq2 in Figure 7 show the relationship between tracking confidence index and object tracking status well.When the threshold is 0.5, it can accurately judge whether the object is occluded.When the object is occluded, TCI is less than the threshold; when the object is not occluded, TCI is greater than the threshold.In a very small number of cases, the object is occluded but the TCI is still greater than the threshold, which is usually caused by illumination change or rapid movement.As can be seen from Figure 7, the duration of this situation is very short, so its impact on the final tracking results can be ignored.

Noise Covariance Matrix
As mentioned above, the measurement noise covariance matrix R and the process noise covariance matrix Q in the original Kalman filter are fixed values.However, in the practical application, the confidence level of the measured and predicted values should be dynamic.In this article, we adjust the values of Q based on whether the object is occluded.In order to show the advantage of adaptive adjustment of Q, we conduct a comparative experiment by fixing R and adjusting the value of Q.We perform grid search from [0.01, 0.001, 0.0001, 0.00001] for the Q, and the parameter setting of the proposed method is shown in (20) and (21).The experimental results are shown in Figure 8 and Table 3.  From the experimental results, it can be seen that compared with the fixed value of Q, the adaptive Q adjustment strategy proposed in this paper improves the performance of the proposed tracker.

Quantitative Evaluation
Figure 9 shows the tracking precision score and AUC score of the proposed method and the comparison trackers on all sequences.It can be seen that in terms of overall tracking precision score, the proposed method is 19.1% better than the second ranked ECO tracker and 35.9% better than the basic KCF tracker.In terms of overall tracking AUC score, the proposed method is 6.1% better than the second ranked ECO tracker and 18.1% better than the basic KCF tracker.
In order to specifically analyze the performance of each algorithm in different environments, we divide the eight video sequences into five categories according to the motion of the object in the satellite video.In Seq1 and Seq2, the object basically moves along a straight line, and the acceleration is small, which can be regarded as uniform linear motion, but the object is completely occluded during the movement.The objects in Seq3 and Seq4 can also be regarded as moving in a straight line at a constant speed, but they are not affected by occlusion during the movement, and the objects in Seq4 have dramatic changes in appearance due to changes in lighting during the movement.In Seq5, the object is partially occluded in the first frame, and partial occlusion and full occlusion occur many times during the movement.In Seq6, the moving direction of the object changes after it is completely occluded.In Seq7 and Seq8, the object moves nonlinearly but is not blocked during the movement.These objects with different characteristics put forward higher requirements for the comprehensive performance of the algorithm.Figures 10 and 11 show the tracking precision and tracking success AUC of the proposed method and the comparison method on eight satellite video sequences, respectively.method in this paper is only 1.5% behind the first ranked tracker in Seq4, but the tracking precision score of our method is still higher than that of the KCF tracker by 2.8%.In terms of the AUC score, our method has achieved the best results in the Seq1, Seq2, Seq3 and Seq6 sequences.Compared with the second ranked tracker, the AUC scores of our method in these four sequences have increased by 38.3%, 1.8%, 0.9% and 33.2%, respectively.Compared with the KCF tracker, the AUC scores of our method in these four sequences have increased by 40.6%, 22.4%, 0.9% and 34.2%, respectively.The AUC scores of our method on Seq4, Seq5, Seq7 and Seq8 sequences are not the best.However, compared with the KCF tracker, the AUC scores of the proposed method on these four sequences are still increased by 2.5%, 44.2%, 0.2% and 0.1%, respectively.The complete experimental results are shown in Table 4.In this subsection, we have selected three representative trackers for qualitative evaluation with the trackers proposed in this paper, as shown in Figure 12, where the tracked objects are all moving vehicles.In Seq2, the object is completely occluded in the tracking process.Most of the comparison trackers will lose the object after the object is occluded, while the ECO tracker can continue to track after the object reappears.This came about because the search range of the ECO tracker is larger than the KCF tracker, and the occlusion in the Seq2 sequence is caused by the bridge, its scale is similar to the object size, and the occlusion time is shorter.This enables the ECO tracker to capture the object after it reappears; In Seq1, the object is occluded for a longer time, and the scale of the occluder is larger.Therefore, in Seq1, only the method in this paper can relocate the object and continue to complete the tracking after the object is occluded.In Seq5, the ground truth given by the object in the initial frame is the partially occluded position and size.In the first 10 frames, the object is always partially or completely occluded.At this time, the Kalman filter cannot converge, so the prediction of the object position by the Kalman filter is not credible.Therefore, the object detection module we designed plays a key role in term of object search.The GFSDCF tracker uses the deep feature, so it has more advantages in feature extraction and strong robustness to partial occlusion.The ECO tracker is disturbed by similar objects in the last stage of tracking, which leads to tracking failure.In the most challenging Seq6, the moving direction of the object changes after occlusion.If the Kalman filter is used to predict the position, there is a large deviation in the predicted position.The proposed tracker uses the adaptive Kalman filter and road information to detect the object in time after the object reappears and has achieved good tracking results.However, it should be noted that the proposed method does not detect the rotation and scale change of the object, which makes the tracker match the initial object frame after detecting the object (as shown in frame 98 of Seq6 sequence in Figure 12) and cannot adapt to the direction change of the object and the resulting scale change.Therefore, the AUC score of our method is not high for the object in Seq6.In the sequence without occlusion, the performance of this tracker is close to that of other trackers in Seq3, Seq7 and Seq8.In Seq4, the object is deformed due to the change of illumination in motion.It can be seen clearly in frames 114 and 143 of Seq4 in Figure 12.Since our method is based on the KCF tracker and does not include the scale change of the object in the tracking process, the bounding box cannot match the real size of the object after the object is deformed better.Both the ECO tracker and the GFSDCF tracker can respond to deformation, which can also be reflected in Seq7 and Seq8 sequences.In addition, the 290th frame of Seq4 also reflects the change of brightness under the influence of illumination.It can be seen that the object changes from light to dark.At this time, except for the proposed method, other trackers have tracking drift.
In general, the visualization results show that our method can solve the problem that the object is completely occluded when moving along a straight line.At the same time, it can also solve the problem of unreliable tracking before the Kalman filter is converging and the problem that the moving direction of the object changes after occlusion.

Conclusions
In order to solve the problem of object occlusion in satellite video object tracking, we use the road information contained in remote sensing images and propose an improved kernel correlation filter assisted by road information to track small objects in satellite videos.Compared with other methods, the proposed method enjoys the following advantages: Firstly, the reliability of the tracking results of the kernel correlation filter is evaluated by the tracking confidence module.Then, the motion state of the object is estimated by the adaptive Kalman filter.Finally, the object detection module assisted by road information is used to locate and detect the object more accurately.By comparing our tracker with the existing object trackers on the eight video sequences obtained from the Jilin-1 satellite constellation, our tracker achieves the highest tracking precision score on seven sequences and the highest AUC score on four sequences.The result shows that our method is able to effectively solve the problem of tracking failure caused by object occlusion.However, the rotation and scale change of the object are not considered in the proposed method, which leads to it not obtaining a good AUC score, and the problem will be handled in our future work.

Figure 1 .
Figure 1.The problems of tracking objects in satellite videos: (a) small objects; (b-f) complex background; (g-i) object is occluded.

Figure 2 .
Figure 2. The flowchart of the proposed method.TCI refers to tracking confidence index, OT and DT refer to occlusion threshold and detection threshold, respectively, and Vmax refers to the maximum response in all response maps.

Figure 3 .
Figure 3.The remote sensing image and the result of road extraction: (a) is a frame in the satellite remote sensing video; (b) is the corresponding extracted road network map.

Figure 4 .
Figure 4. Schematic diagram of object speed and search range: (a) is the remote sensing image; (b) is the road network map.The green solid line box is the object boundary box predicted by the Kalman filter, the green dotted line box is the search area, the red arrow represents the object speed along the Y axis, and the blue arrow represents the object speed along the X axis.

Figure 5 .
Figure 5. TCI value distribution of all frames.

Figure 6 .Figure 7 .
Figure 6.Tracking accuracy score and AUC score obtained by different thresholds: (a) tracking accuracy score plots; (b) tracking AUC score plots.

Figure 8 .
Figure 8. Tracking accuracy score and AUC score obtained by different Q: (a) tracking accuracy score plots; (b) tracking AUC score plots.

Figure 9 .
Figure 9. Experimental results on the eight sequences: (a) tracking accuracy score plots; (b) tracking AUC score plots.

Figure 12 .
Figure 12.Screenshots of some tracking results.At each frame, the bounding boxes with different colors are the tracking results of the different trackers, and the green number in the top-left corner is the frame number of the current frame in the satellite videos.

Table 1 .
Overview of the satellite videos and the objects in the experiments.

Table 2 .
The definition of the challenges.
LQ Low Quality: the image is low quality and the object is difficult to be distinguished ROT Rotation: the object rotates in the video DEF Deformation: non-rigid object deformation POC Partial Occlusion: the object is partially occluded in the video TO Tiny Object: at least one ground truth bounding box has less than 10 × 10 pixels

Table 3 .
Results of comparative experiments.The bold digits denote the optimal results.
Precision plots of eight video sequences involving object sequences without and with occlusion.The legend in the precision plot is the corresponding precision score per object tracker.

Table 4 .
The object tracking results of all video sequences employing precision score, AUC score, success score and FPS to determine the best tracker.The bold digits denote the optimal results.