Hyperspectral Video Target Tracking Based on Deep Edge Convolution Feature and Improved Context Filter

: To address the problem that the performance of hyperspectral target tracking will be degraded when facing background clutter, this paper proposes a novel hyperspectral target tracking algorithm based on the deep edge convolution feature (DECF) and an improved context filter (ICF). DECF is a fusion feature via deep features convolving 3D edge features, which makes targets easier to distinguish under complex backgrounds. In order to reduce background clutter interference, an ICF is proposed. The ICF selects eight neighborhoods around the target as the context areas. Then the first four areas that have a greater interference in the context areas are regarded as negative samples to train the ICF. To reduce the tracking drift caused by target deformation, an adaptive scale estimation module, named the region proposal module, is proposed for the adaptive estimation of the target box. Experimental results show that the proposed algorithm has satisfactory tracking performance against background clutter challenges.


Introduction
As an important branch of computer vision, target tracking [1][2][3][4] is widely used in pedestrian monitoring [5,6], robot navigation [7,8], regional control [9,10], and other fields. The target tracking algorithm estimates the state of the target in each frame after giving the position and size of the target in the video sequence's first frame. Most of target tracking methods based on visible light videos use the shape, appearance, and color information to track the target. However, when the color of the target and the background color are similar, how to accurately and robustly track the moving target is a challenge.
Compared to visible images, hyperspectral images (HSIs) [11,12] contain not only spatial but also spectral information about the target, so it has a wide range of applications in the fields of ground target recognition [13] and resource exploration [14]. Due to the large data scale of HSIs, it is difficult for traditional equipment to obtain hyperspectral videos (HSVs). The development of snapshot hyperspectral sensors provides a basis for us to use HSVs to track targets. Recently, Uzkent et al. [15] proposed a deep kernel correlation filter (DeepHKCF) to convert the hyperspectral image to a pseudo-color image, thereby obtaining the depth features of the image, ignoring the role of the spectrum. Qian et al. [16] proposed a hyperspectral target tracking method based on convolutional networks (CNHT), which only selects small cubes in the target area to train convolutional filters, ignoring the band correlation. In the HSVs, the usage of spectral information can improve the discrimination of targets, Xiong et al. [17] proposed a material-based hyperspectral target 1.
We propose a 3D edge feature-extraction method. The three directional edge features are fused with directional adaptive weights to extract a 3D matrix, which enhances the edge information and contains spatial-spectral features.

2.
We first used a novel convolution fusion feature named DECF, which is obtained by convolving the grouped depth features with the 3D edge features. DECF greatly preserves semantic and spatial-spectral information and makes the target more discriminative.

3.
An ICF is first proposed. First, eight influence factors are calculated in the context areas. Secondly, four areas corresponding to the first four influence factors are regarded as negative samples to train context filter. At last, adaptive weights calculated by four influence factors are used to suppress background clutter. 4.
Inspired by the region proposal network (RPN), this paper proposes a new adaptive scale estimation method named RPM. The estimation of the target box is achieved by adjusting the length and width of the target box by using RPM.
The rest of this research paper is organized as follows: Section 2 provides an overview of the related work. Section 3 describes the proposed approach. Section 4 presents the experiments for validating and analyzing the proposed framework. Section 5 discusses the conclusions drawn from this research.

Related Work
Our proposed DC-HVT can be divided into three parts, including feature extraction, correlation filtering-based trackers (CF trackers), and scale estimation. We reviewed the related methods that are relevant to these three parts as follows.

Feature Extraction
In order to fully extract the spatial-spectral features of HSIs, researchers have proposed various spatial-spectral feature extraction methods. Traditional manual features are typically represented by texture features and shape features, such as Gabor features [22], local binary pattern (LBP) features [23], and morphological profile features [24]. Zhu et al. [25] used 3D Gabor features to extract HSIs features from three angles for fusion, but this feature is only applicable to small samples. Li et al. [26] oriented to the rotation-invariant texture structure of HSIs local spatial information, and applied LBP to HSI feature extraction for the first time. This attempt yielded satisfactory results. In addition, feature-extraction methods based on LBP and sparse representation have also made progress. For example, Tu et al. [27] proposed a hyperspectral image classification method that combines LBP with a joint sparse representation classifier in order to fully utilize the texture features of images. This method improves the classification accuracy of hyperspectral images. With the development of deep learning [28,29], many computer vision works have made breakthroughs, and deep learning techniques have been widely used in HSIs. In the early days, Chen et al. [30] proposed a deep belief network, but it needed to represent the spatial information as vectors before training, and thus could not extract spatial information effectively. He et al. [31] built a 3D convolutional neural network (CNN) in order to extract the spectral and spatial information of HSIs at the same time, but the CNN model needs to convolve a fixed-sized region and cannot fully adapt to geometric changes.

CF Trackers
CF was first applied in the field of target detection. In 2010, Bolme et al. [32] proposed the minimum output sum of squared error (MOSSE) algorithm, which first used CF for video target tracking. Due to the small number of training samples in the MOSSE algorithm, it is easy to produce problems such as overfitting. For this reason, Henriques et al. [33] proposed a tracking-by-detection model, which uses a single grayscale feature [34] and does not adapt well to complex environments. Based on this, Henriques et al. [35] improved the single channel feature to a multichannel gradient histogram feature. This algorithm uses image gradients to improve the tracking accuracy. However, it brings boundary effects. To overcome this problem, Galoogahi et al. [36] expanded the training sample area to reduce the boundary effects. Danelljan et al. [37] introduced spatial regularization in spatially regularized discriminative correlation filters (SRDCF) to penalize boundary regions. Different from SRDCF, Mueller et al. [38] proposed a CACF, which uses the target context information as negative samples for filter training. But the performance of this method is restricted by the context area including four image patches.

Scale Estimation
Because CF usually uses a fixed-size window, it is easy to generate tracking drift when the target size changes. In order to solve this problem, Li et al. [39] used bounding boxes with multiple scales to match the target region in the previous frame and selected the bounding box with the highest similarity. Martin et al. [40] added a one-dimensional scale filter to the position filter to perform target localization and scale estimation respectively, but this method increased the computation complexity. Danelljan et al. [41] reduced the computational effort by using dimensionality reduction operation and QR decomposition. With the advancement of deep learning, Bertinetto et al. [42] pioneered the application of Siamese networks to track target and used, proposing the use of fully convolutional Siamese networks (SiamFC) to calculate the similarity between the template and the search region to achieve better performance in target tracking. However, SiamFC does not use regression to adjust the scale of the target box and requires multiscale testing to estimate the box's size. To solve this problem, Li et al. [43,44] proposed the Siamese RPN model, which can better adapt to the scale changes of the tracked target.

Proposed Approach
To address the problems of degraded tracking accuracy and tracking drift under background clutter, DC-HVT is proposed in this paper. The proposed tracking framework is summarized in Figure 1. In frame 1, we selected the ground truth and search region manually. The frame 1 is used to train the template. PCA and ResNet are used to extract deep features of ground truth. Linear space scale theory [45][46][47] is used to extract 3D edge features of search region. In order to get the feature that contain more information, we convolved the above two features to get DECF. ICF is used to suppress the tracking drift caused by the background clutter. After ICF, we can get a response map to locate the target. RPM is used to adapt to the change of size during the movement of the target. After predicting the location and scale of the target in the next frame, the parameters of the ICF are updated. Therefore, DC-HVT maintains a good tracking performance against background clutter challenge.

PCA Dimensionality Reduction
The input of the pretrained ResNet50 is usually a one-band grayscale image or a three-band RGB image. Because the HSIs we used have 16 bands, the 16-band HSIs cannot be directly input into the network. So PCA [19,20,48] is used to reduce experimental HSIs with 16 bands to single-band images to meet the input requirements of the network.
Let X be the data sample such that X = (x 1 , x 2 , x 3 , . . . , x p ), x i ∈ R p×l , p represents the pixels of HSIs, and l represents the bands of HSIs, the value of l is 16. We have where ψ is the matrix after decentralization. X is the average pixel value for each band. Furthermore, a covariance matrix K, having significant location information, is constructed as where the superscript T denotes the transpose operation. Furthermore, through the eigenvalue decomposition, the eigenvalues of K and the corresponding eigenvectors are obtained. We have where ν represents the eigenvector, τ represents the eigenvalue. Then eigenvalues are sorted to get the largest eigenvalue τ max and the corresponding eigenvector ν max , X is dimensionality reduced as where X p is the matrix after dimensionality reduction. As shown in Figure 2, after the dimension reduction of PCA, the amount of HSI data is reduced and becomes an image of one band.
PCA PCA Figure 2. The results after PCA dimensionality reduction.

Deep Features
ResNet [21] adopts identity connection, which cleverly skips the influence of the deep network weights, to achieve constant mapping [49]. The ResNet structure not only speeds up the training, but also ensures that the training accuracy is not affected by the increase in network depth. It also alleviates the problem of network degradation.
In this paper, ResNet50 is used to extract the deep features from the HSVs. The architecture constitutes of 50 layers, including 49 convolutional layers and one fully connected layer. The first stage facilitates input preprocessing, and the next four stages consist of bottlenecks, with convolutional units 2, 3, 4, and 5, respectively consisting of 3, 4, 6, and 3 bottlenecks.
In the field of target tracking, the spatial information is used to achieve accurate target localization, while the semantic information contained in deep features can enhance the robustness of the tracking algorithm. The spatial information is already provided by 3D edge features mentioned in Section 3.3. The deep features, extracted by res3d_branch2c in the pretrained ResNet50 network, are used to make up for the lack of semantic information.
X p is the input of ResNet50, and E is the feature extracted by ResNet50. Figure 3 shows the first 128 channels of deep features obtained by the experiment. It may be noted that the size of deep features is m × n × r. m represents the row of the deep features, n represents the column of the deep features, r is the number of channels. Our proposed algorithm uses the deep features extracted by res3d_branch2c, so the output feature's size of this layer is 28 × 28 × 512.

3D Edge Features
As described earlier, HSIs are a three-dimensional cube consisting of two spatial dimensions and a spectral dimension. The feature-extraction methods based on RGB [50] images need to operate the bands of HSIs independently, which ignores the relationship between bands. Moreover, HSIs have many bands and a large amount of data, so using CNN [51] for feature extraction requires a large amount of computation.
HSIs are not only a three-dimensional data cube but also a three-dimensional discrete function, so the problem of obtaining the gradient of HSIs in three directions can be solved by obtaining partial derivatives for the three-dimensional discrete function. We use the derivative of the image to represent the gradient. In the spatial direction, the edge of the target is more obvious when the absolute value of the gradient increases [52,53], which facilitates target location. In the spectral direction, the target spectral curve is different from the background spectral curve, which can be observed by derivative differences. Therefore, when the target and the background are similar in space, the derivative differences of the spectral direction contribute to distinguish the target from the background. Hence, threedimensional feature-extraction techniques are required. According to the linear scale space theory [45][46][47], any derivative of scale space can be computed by using convolution of the Gaussian kernel's derivative. Hence, the derivative of HSIs in each direction can be obtained by solving the derivative of the Gaussian function in the corresponding direction.
In a 3D HSIs image, the Gaussian function can be expressed as where x and y represent the spatial dimensions, z denotes the spectral dimension, σ denotes the standard deviation of the normal distribution, and W denotes w × w × w window. The first order partial derivatives of the Gaussian function with respect to x, y and z, respectively denoted as , are given as In this paper, the search area for HSIs are denoted by H, H ∈ R u×v×l . u and v represent the width and height of the search region, respectively, and l denotes the number of bands. The first-order partial derivatives of H on x, y, z, respectively denoted as I x , I y , I z , are given as where * denotes convolution. The first-order derivative of the Gaussian function is employed to obtain the edge detection results in three directions. As the edge features in different directions have different effects on the image, fusion using simple weighted averaging or usage of static (nonadaptive) weights results in blurred edges.
This paper proposes a method to determine the fusion weights based on the change in the derivative value. The derivative of the image means the gradient of the image. The larger the absolute value of the gradient is, the more obvious the edge of the target is [52,53], which is convenient for locating the target. The gradient in the spectral direction represents the differences between adjacent bands, and it will be helpful to identify the target when the target and the background are similar in space. Therefore, by calculating the sum of the derivatives of each pixel in the direction of the edge, the proportion of the derivatives along different edge directions can be obtained. These proportions are used as weights for the corresponding edges for realising adaptive fusion of the image features. The adaptive weights can use the adjustable value when the background clutter changes. Figure 4 shows the distribution of a central pixel point and surrounding pixels within I x , I y , I z in the HSIs.
This weight is related to a cube with a size of 3 × 3 × 3 centered on the pixel. Except for the center pixel where the weight needs be calculated, there are still 26 pixels around the center pixel in 3D space. Therefore, the upper limit of the summation sign is set to 26. It may be noted that toward the edges of the matrix, the neighborhood values are subjected to a complementary 0 operation. Then, the adaptive weights of the edge features in different directions are given as where φ, ϕ and ϑ denote the fusion weights in the x-direction, y-direction, and z-direction, respectively. The weighted fusion of the multi-directional edge detection results can be denoted as where Q denotes the 3D edge features of the HSIs, and Q c represents the c-th element in Q. Figure 5 shows the edge features in 16 bands. As illustrated in Figure 5, the edges of the target are obvious and contain a lot of detailed information. In the last eight bands, the edge features of the target gradually become clear.

Deep Edge Convolution Feature
To ensure that the fused image contains multiple features, the feature fusion is implemented by using convolution. Resnet50 extracts the deep features of the image and uses them as convolution kernels to retain the relevant local features.
As shown in Figure 6, we demonstrated the fusion process of DECF. In stage I, the deep features are equally divided into 32 groups, each group having a size of 28 × 28 × 16, and are denoted as E i , i ∈ [1,32]. In stage II, the edge features Q ∈ R u×v×l are used as inputs, and E i is used as the bootstrap convolution kernel to convolve Q. The output of the convolution layer is given as where Z denotes the DECF, Z i is a channel in Z, * denotes the convolution symbol, and E i is the i-th group of deep features. Moreover, the size of Q is equivalent to the size of the search area. In particular, because the size of the search area varies with the size of the target bounding box, the spatial scale of Q is also not fixed. The resulting feature map is an edge feature having a depth feature guide, ensuring the detailed information and target visibility. Figure 7 shows the DECF after fusion with both edge information and semantic information.

Improved Context Filter
In a traditional context filter [38], the initial sample corresponding to the circular matrix A 0 is taken as the positive sample. The four regions above, below, left, and right are regarded as the context regions, and are represented as negative samples A i . It may be noted that A i generates a response of 0 in the region A i during training, and thus enables the tracker to effectively discriminate the target and the background. Hence, the objective function is expressed as where ω is the trained correlation filter, A 0 denotes the image of the target region after circular displacement, A i is the image the background region after the cyclic shift, y is the label matrix, and λ 1 , λ 2 are the regularisation factors. As shown in Equation (16), the context filter adopts a constrained strategy to train the positive samples with high response values and the negative samples with low response values. The approach computes the solution of ω in the frequency domain, based on the diagonalization property of the circular matrix, aŝ where a 0 represents the original image block of the target area, a i represents the original image block of the background area,ˆdenotes the Fourier transform, (·) * denotes the conjugate, and denotes the dot product. During training, four regions around the target are selected as the background regions. It may be noted that the selected regions are not targeted and cannot effectively eliminate the background information. Moreover, the same suppression weights are used with regard to the target's contextual information, and the method does not take into account the degree of background interference on the target.
To address these problems, the ICF is proposed to introduce an interference factor that represents the contextual information. The interference factor is based on the curve of the filtered response map, and evaluates the influence of the context on the tracking target. Furthermore, the area around the target is divided into eight regions that are comprehensively sampled and ranked based on the interference factor. The top four background regions with the highest influence on the target are selected for suppression. Then, the suppression weights are adaptively computed so that the background information with stronger interference is suppressed more as compared to the one with lesser interference.
As shown in Figure 8, there are eight neighborhoods, A 1 ∼ A 8 , in the top, bottom, left, right, and four diagonal areas, respectively. As compared to the traditional context filter, we also added four diagonal areas as negative samples. It may be noted that A 0 is the target region filled with positive samples. As is evident from the tracking response map, the ideal response map should be the curve with a peak in the centre of the target and a smooth background area. However, in general, due to the influence of external factors such as background clutter and light changes, some background response values can be higher, leading to tracking drift. In this regard, the interference factor β is used to assess the extent to which the background affects the target, and is computed as where F max represents the peak of the response map after correlation filtering in the A 0 region, and F i represents the peak of the response map after correlation filtering in the A i region. Based on Equation (18), the interference factors, β 1 ∼ β 8 , for each of the eight sampling regions, shown in Figure 8, are obtained. Furthermore, β 1 ∼ β 8 are ranked in ascending order and the top four are selected. The top ranked β i means that its corresponding A i interferes more with the target. Therefore, a higher weight is used to suppress that interference. Hence, the weight ζ i is computed as where β is the monotonically incrementing factor. As shown in Equation (19), a value of β i > 1 represents the background region having less influence on the target region. Hence, no weight is assigned. When 0 < β i ≤ 1, and as β i tends to zero, the background region will have a greater influence on the target region and a higher weight needs to be assigned. Hence, the objective function in Equation (16) can be reformulated as Solving for it givesω The response map obtained after ICF is shown in Figure 9. As is evident, the response of the target region is obvious and the background clutter is effectively suppressed. It may be noted that the trained filter is denoted asω. The regularization factor λ 1 is used to prevent overfitting. In this experiment, λ 1 was set to 0.0001. After the target position is predicted, the filter is updated as follows, where t denotes the t-th frame of the input image sequence and θ represents the learning rate. The larger the value of θ is, the faster the filter is updated. In this experiment, we empirically set θ to 0.02.

Adaptive Scale Estimation
Scale shifts are frequently caused by the target's movement during the tracking process. Even if the scale of the target frame is fixed, deformation or occlusion of the target will obscure the target information or substitute background information, affecting the tracking accuracy. To resolve these issues, this paper proposes an adaptive scaling method using an RPM. The RPM-based sliding window scaling adopts the scale of the target box from the previous frame to generate a different size target box. The approach can be denoted as where S T is the size of the target box, and M and N represent the length and width of the target box of the previous frame, respectively. It may be noted that ε is an even number, ε ∈ [−2, 2]. Moreover, the ranges of integers i and j are empirically set to 1, 2, and 3. Different sizes of target boxes are obtained by using Equation (23). The scale of the target box, having the highest response value, is chosen as the scale of the target for the current frame.

Experimental Results and Analysis
This section discusses the experimental setup and data collection for validating the proposed algorithm. In addition, the qualitative and quantitative analyses of the proposed and state-of-the-art algorithms are also presented.

Experimental Setting
The algorithms, developed and analyzed in this study, are implemented by using MAT-LAB R2021b technology on a workstation with an Intel(R) Core(TM) i7-12700HCPU@2.30 GHz, 16 GB RAM, and RTX3060 GPU. The algorithm attained a processing speed of 3.5 frames per second. The matconvnet toolkit is used to extract the deep features from the Resnet50 network.
In this research, we use six different video sequences for the performance analysis of our tracking algorithm. The sequences are all from the publicly available hyperspectral dataset in [17]. To test the ability of this algorithm against background clutter (BC), we selected four sequences with a BC challenge. In order to analyze the generality of the algorithm, the selected sequences also contain the illumination variation (IV), motion blur (MB), occlusion (OCC), scale variation (SV) and out-of-plane rotation (OPR) challenges. The RGB images of these video sequences are shown in Figure 10. Table 1 represents the detailed information of the six selected video sequences.  The publicly available dataset has a total of 35 video sets, each consisting of hyperspectral video and visible video, which are pixel to pixel. All video sequences are taken from a 16-band hyperspectral camera with a wavelength of 470-620 nm. The hyperspectral camera adopts the snapshot VIS produced by IMEC, and bandwidth of each band is around 10 nm. The full name of nm is nanometer, and nm is the meaning of wavelength. The camera shoots video up to 180 frames per second, whereas all videos in the public dataset are shot at 25 frames per second. The dataset can be downloaded on the website (www.hsitracking.com accessed on 1 January 2020). Table 1, six sets of HSI sequences are used as the test sequences. The target and target size are determined manually at the initial frame. The initial size of the target is presented in the fourth row of Table 1. The second row of Table 1 indicates the number of frames in each image sequence, and the third row indicates the size of each image sequence. The fifth row of Table 1 represents the challenges faced by the sequence.

As shown in
The sequence Coin consists of 149 frames having a large number of coins, causing BC and MB while moving. The sequence Fruit consists of 552 frames, where the color of background is similar to the target, causing BC and OCC while moving. The sequence Pedestrian consists of 306 frames, where the walking person walks from the tree shade to clearing, causing SV and IV when moving. The sequence Kangaroo consists of 117 frames, in which the kangaroo jumps and produces SV and OPR. The sequence Drive consists of 341 frames, in which the background gets cluttered as the man moves, causing SV and BC. The sequence Forest consists of 530 frames, in which the target is affected by the OCC and BC of the trees while moving.

Qualitative Comparison
In this experiment, we compare the performance of our algorithm and other hyperspectral target trackers, including MFI-HVT [18], MHT [17], DeepHKCF [15], CNHT [16], context, edge and RES. In MFI-HVT, multiple features are used instead of a single feature. The MHT approach extracts the material information of the target, by using SSHMG, to distinguish the targets and backgrounds of similar color. In the DeepHKCF technique, features are extracted by using a trained deep convolutional network, and ROI mapping is employed to improve the robustness and computational efficiency. In CNHT, features are extracted by using a double-layer convolutional network to facilitate discriminative information. For the RES approach, the features extracted from the dimensionally-reduced image via Resnet50 are used to track the target. To verify the effectiveness of the improved context filter and 3D edge features, two compared algorithms named context and edge are used. Different from our algorithm, the context algorithm uses CACF, and the edge algorithm only uses 3D edge features in the feature-extraction module.
The results of the proposed algorithm and seven other algorithms discussed in this paper, over the six test sequences, are summarized in Figures 11-16.
In Figure 11, the background is filled with similar coins, making it difficult to track the target accurately, and the coins are pinched and moved by the fingers throughout the sequence causing the target to be partially obscured. DeepHKCF does not adapt well to the background clutter during tracking, making it drift throughout the tracking process. The edge and the context tracked robustly throughout the coin sequence, showing good performance to the challenge of background clutter.
In Figure 12, the fruit moves above the leaves, causing a change in size making tracking difficult. The MHT, MFI-HVT, and the proposed algorithm take advantage of the spectral characteristics of the target and perform well in this sequence. However, CNHT substitutes too much background clutter in the tracking process leading to tracking failure at frame 153.

Ours
DeepHKCF MHT RES MFI-HVT CNHT Groundtruth context edge In Figure 13, the pedestrian walks from the shadows into the sunlight, causing the pedestrian to become smaller and smaller. Methods such as MHT and DeepHKCF do not have a target frame estimation module, resulting in the drifting of the target frame after frame 224.
In Figure 14, there is some background interference due to the rapid jumping of the kangaroo and the similarity of the tracked kangaroo with other kangaroos. Most of the trackers perform well on this sequence as most of them use target features.
In Figure 15, the drive moves over a cluttered background, causing the drive to deform due to directional shifts. During the tracking process, the target changes frequently, causing the fact that the target boxes of all trackers do not adapt well to the changes of the target. However, at frame 599, our tracker overlaps perfectly with the ground truth.

Ours
DeepHKCF MHT RES MFI-HVT CNHT Groundtruth context edge Figure 15. Qualitative outcomes for the drive sequence.
In Figure 16, the target walks in front of the trees and a portion of the forest causes occlusion of the target. Hence, MFI-HVT and DeepHKCF, which use depth features, lose the target from frame 338 onward. However, our tracker is able to accurately locate the target and adapt to the changes in the target.

Ours
DeepHKCF MHT RES MFI-HVT CNHT Groundtruth context edge Figure 16. Qualitative outcomes for the forest sequence.

Quantitative Comparison
In this section, we compare the precision and success rate of the proposed algorithm with seven other algorithms. The precision is defined as the deviation between the centre positions of the tracking and ground truth target boxes as being not higher than a certain threshold. Similarly, the success rate is defined as the overlap between the tracking and the ground truth target boxes as being not lower than a certain threshold. Tables 2 and 3 show the precision and success rate values for the eight algorithms. It can be observed that our algorithm has significantly improved the tracking performance. Figure 17 shows the precision and success rate curves of the algorithm on all sequences, where the higher area covered by the curve represents a higher value. Figures 18-21 present the experimental findings related to BC, OCC, SV, and OPR, respectively.  As shown in Tables 2 and 3, the proposed algorithm almost achieves the top two performances in terms of the various indicators. On a whole, it achieves a precision of 0.941 and a success rate of 0.696, an improvement of 0.4% and 2.4%, respectively, compared with MHT. It may be noted that MHT is the current state of the art. The edge algorithm does not show strong robustness in the whole test due to the use of a single feature. Due to the usage of DECF and ICF, our algorithm yields a better result than the other algorithms against the BC challenge. The performances of the proposed algorithm are summarized in Figure 18 and the tables. Specifically, our algorithm achieves 91.8% precision and 71.4% success rate against the BC challenge, which are significant improvements in comparison with MHT and context algorithm. MFI-HVT gets the second highest score of 91.7% precision because of the use of multifeatures. The context algorithm achieved the second highest score with a 70.9% success rate. As for the context algorithm, the success rate is 0.5% lower than ours, and the precision is 0.2% lower than ours. This is because the filter of the context algorithm is not improved. Compared with RES algorithm, the success rate of our algorithm is improved by 3.4%, and the precision is improved by 0.7%. Moreover, compared with the edge algorithm, the success rate of our algorithm is improved by 1.3%, and the precision is improved by 0.6%. These two sets of experiments show that our algorithm with DECF is more efficient than the algorithm using one feature alone. MFI-HVT shows a poor performance of 77.5% precision and 47.4% success rate in the OPR challenge. The overall performance indicates that the MFI-HVT algorithm does not have strong robustness. Additionally, as is evident from Figure 19, when the target is obscured, our algorithm has a success rate 1.3% lower than MHT but ranks first in terms of precision. Although the consideration of material features in MHT facilitate adaptive target recognition, it fails for accurate scale estimation. As shown in Figure 20, our algorithm outperforms other algorithms owing to the adaptive scale estimation even when the target is affected by deformation. As shown in Figure 21, when the target is affected by OPR, the accuracy is only 1.2% lower and the success rate is only 4.3% lower as compared to MHT.

Conclusions
This paper proposes an algorithm based on DECF and ICF for HSV-based target tracking. The proposed DECF is composed of both the 3D edge features and deep features of the HSIs. DECF can extract the representation of the targets which have similar color as the background. The use of ICF ensures that the tracker remains robust even under BC challenges. Extensive experiments have been conducted by using different HSVs sequences to demonstrate the superior performance of the proposed algorithm. In future work, the HSIs' dimensionality reduction process will be further investigated to utilise the spectral information to extract the target features.