Siam Deep Feature KCF Method and Experimental Study for Pedestrian Tracking

The tracking of a particular pedestrian is an important issue in computer vision to guarantee societal safety. Due to the limited computing performances of unmanned aerial vehicle (UAV) systems, the Correlation Filter (CF) algorithm has been widely used to perform the task of tracking. However, it has a fixed template size and cannot effectively solve the occlusion problem. Thus, a tracking-by-detection framework was designed in the current research. A lightweight YOLOv3-based (You Only Look Once version 3) mode which had Efficient Channel Attention (ECA) was integrated into the CF algorithm to provide deep features. In addition, a lightweight Siamese CNN with Cross Stage Partial (CSP) provided the representations of features learned from massive face images, allowing the target similarity in data association to be guaranteed. As a result, a Deep Feature Kernelized Correlation Filters method coupled with Siamese-CSP(Siam-DFKCF) was established to increase the tracking robustness. From the experimental results, it can be concluded that the anti-occlusion and re-tracking performance of the proposed method was increased. The tracking accuracy Distance Precision (DP) and Overlap Precision (OP) had been increased to 0.934 and 0.909 respectively in our test data.


Introduction
Since the 21st century, public security problems encountered by various countries have been increasing. With the rapid development of technology, public security crises have become more hidden [1]. Fortunately, the invention of computer vision has changed many traditional fields of human technological activities [2][3][4], especially in some security fields, such as public security [5,6]. Installing surveillance cameras in dense venues is an effective way to reduce crimes and ensure social security [7]. To reduce the number of fixed cameras and continuously supervise moving pedestrians' misbehaviors, an airborne unmanned aerial vehicles (UAV) system with pedestrian tracking came to be used in realtime tracking. Unlike the tracking method in traditional surveillance cameras, the UAV pedestrian tracking system requires a more lightweight network with fast-running abilities.
Most of the state-of-art pedestrian tracking algorithms usually use the tracking-bydetection framework. The framework can be considered to be an estimation problem, composed of pedestrian detection and data association [8]. In the former, when a video sequence is obtained, we need to detect the frame of the object by frame for the association. In the latter, we link together the detection object in different frames by using data association strategies, which depend on features such as appearance. Yu et al. [9] showed that in the framework of tracking-by-detection, detection quality could seriously affect the performance of pedestrian tracking. Therefore, to improve the performance of pedestrian tracking, an accurate detector is necessary. Most of the traditional pedestrian detection algorithms [10,11] mostly utilize the method of sliding windows to traverse the entire image to locate the positions of objects and then extract robust features of the image, aerial vehicles system due to its lightweight features. The structure of this paper is described as follows: Section 2.2 introduces the object detection algorithm, Section 2.3 describes the data association method, and in Sections 3 and 4, we present the experimental results and give the conclusion.

DFKCF Method Coupled with the SiamCSP
To port the pedestrian tracking method into the airborne UAV systems, a Siam-DFKCF method is proposed. In this method, the frame topic is published by the Usb_Cam node, which mainly publishes image frames (/img_frame) and time stamps (/time_stamp) to other nodes. A lightweight YOLOv3 node YOLO-ECA subscribes to the image topic (/img_frame) and time stamps topic (/time_stamp). Therefore, the pedestrian can be detected followed by the publishing of the Roi topic (/roi_initial). Thereafter, the tracking node KCF subscribes to the Roi topic, tracks the pedestrian, predicts the pedestrian location, and publishes the relevant image topic (/roi_predict). After that, a lightweight Siamese CNN node SiamCSP subscribes to both the Roi and image topic, then calculates the image similarity (/similarity) between the previous two topics. During the tracking, a Motor drive node Gimbal_Motor is used to perform the tracking command. The ROS Master (Robot Operating System [34]) registers the published topics of each node followed by a subscription from the Master, as shown in Figure 1. carefully designed and performed to study the influences of moving speed, rotating speed, and occlusion on tracking accuracy and efficiency. Thereafter, pedestrian tracking experiments were performed to verify its performance in real-time tracking. The proposed method has been irrefutably validated and it is considered to be suitable for unmanned aerial vehicles system due to its lightweight features. The structure of this paper is described as follows: Section 2.2 introduces the object detection algorithm, Section 2.3 describes the data association method, and in Sections 3 and 4, we present the experimental results and give the conclusion.

DFKCF Method Coupled with the SiamCSP
To port the pedestrian tracking method into the airborne UAV systems, a Siam-DFKCF method is proposed. In this method, the frame topic is published by the Usb_Cam node, which mainly publishes image frames (/img_frame) and time stamps (/time_stamp) to other nodes. A lightweight YOLOv3 node YOLO-ECA subscribes to the image topic (/img_frame) and time stamps topic (/time_stamp). Therefore, the pedestrian can be detected followed by the publishing of the Roi topic (/roi_initial). Thereafter, the tracking node KCF subscribes to the Roi topic, tracks the pedestrian, predicts the pedestrian location, and publishes the relevant image topic (/roi_predict). After that, a lightweight Siamese CNN node SiamCSP subscribes to both the Roi and image topic, then calculates the image similarity (/similarity) between the previous two topics. During the tracking, a Motor drive node Gimbal_Motor is used to perform the tracking command. The ROS Master (Robot Operating System [34]) registers the published topics of each node followed by a subscription from the Master, as shown in Figure 1. The process of this Siam-DFKCF algorithm includes four main stages: Feature extraction, Feature tracking, Feature re-extraction, and Execution stage.
(1) Feature extraction: The Usb_Cam node publishes the frame to the YOLO-ECA node, and the YOLO-ECA node detects and extracts the received frame to obtain the feature of the target (/roi_initial), which is received by the KCF node and the SiamCSP node. (2) Feature tracking: The tracking box (/roi_predict), including both size and the relative location, is predicted by the KCF node through correlation filter processing. The topic about the location is published to the Gimbal_Motor node so that the camera can track the pedestrian in real time. Thereafter, the received feature (/roi_initial) is compared with the predicted result (/roi_predict) by the SiamCSP node to obtain the similarity (/similarity). (3) Feature re-extraction: When the image similarity (/similarity) is below the minimum threshold, go to step (1) to recalculate the correlation feature model in the KCF node. Otherwise, go to the next step. The process of this Siam-DFKCF algorithm includes four main stages: Feature extraction, Feature tracking, Feature re-extraction, and Execution stage.
(1) Feature extraction: The Usb_Cam node publishes the frame to the YOLO-ECA node, and the YOLO-ECA node detects and extracts the received frame to obtain the feature of the target (/roi_initial), which is received by the KCF node and the SiamCSP node. (2) Feature tracking: The tracking box (/roi_predict), including both size and the relative location, is predicted by the KCF node through correlation filter processing. The topic about the location is published to the Gimbal_Motor node so that the camera can track the pedestrian in real time. Thereafter, the received feature (/roi_initial) is compared with the predicted result (/roi_predict) by the SiamCSP node to obtain the similarity (/similarity). (3) Feature re-extraction: When the image similarity (/similarity) is below the minimum threshold, go to step (1) to recalculate the correlation feature model in the KCF node. Otherwise, go to the next step. (4) Execution stage: The relative position of (/roi_predict) in the previous frame and the current frame is calculated and then transferred to the Gimbal_Motor node to perform tracking.
The schematic diagram of the Siam-DFKCF algorithm is shown in Figure 2. (4) Execution stage: The relative position of (/roi_predict) in the previous frame and the current frame is calculated and then transferred to the Gimbal_Motor node to perform tracking. The schematic diagram of the Siam-DFKCF algorithm is shown in Figure 2.

An Improved YOLOv3 Based on Efficient Channel Attention
YOLOv3 [23] is a one-stage network based on a regression method which extracts the features and directly predicts and classifies input images. It does not need to generate a large number of candidate windows compared with two-stage networks, and has excellent recognition speed and detection accuracy. Specifically, YOLO-v3 first resizes the input image to a 416 × 416 pixel image and then feeds it into deep neural networks for training in the Traditional YOLOv3. Then the 416 × 416 pixel image is divided into S × S grids, and each grid is responsible for predicting the pedestrian within the image. In addition, 26 × 26 feature maps are usually fused with 52 × 52 feature maps via up-sampling. Similarly, the 13 × 13 feature maps are fused with the 26 × 26 maps via up-sampling in the traditional multi-scale feature fusion. This multi-scale feature fusion and Darknet-53 [35] constitute the feature extraction network of YOLO-v3. Detection accuracy can be greatly improved by the combination of the deep feature and the shallow feature. However, it still has several challenges when applied towards pedestrian detection. Therefore, the Feature Extraction Network, Feature Pyramid Network (FPN), and Detection Network are combined to establish an Efficient Channel Attention YOLOv3method (ECA-YOLO), which is shown in Figure 3.

An Improved YOLOv3 Based on Efficient Channel Attention
YOLOv3 [23] is a one-stage network based on a regression method which extracts the features and directly predicts and classifies input images. It does not need to generate a large number of candidate windows compared with two-stage networks, and has excellent recognition speed and detection accuracy. Specifically, YOLO-v3 first resizes the input image to a 416 × 416 pixel image and then feeds it into deep neural networks for training in the Traditional YOLOv3. Then the 416 × 416 pixel image is divided into S × S grids, and each grid is responsible for predicting the pedestrian within the image. In addition, 26 × 26 feature maps are usually fused with 52 × 52 feature maps via up-sampling. Similarly, the 13 × 13 feature maps are fused with the 26 × 26 maps via up-sampling in the traditional multi-scale feature fusion. This multi-scale feature fusion and Darknet-53 [35] constitute the feature extraction network of YOLO-v3. Detection accuracy can be greatly improved by the combination of the deep feature and the shallow feature. However, it still has several challenges when applied towards pedestrian detection. Therefore, the Feature Extraction Network, Feature Pyramid Network (FPN), and Detection Network are combined to establish an Efficient Channel Attention YOLOv3method (ECA-YOLO), which is shown in Figure 3.  (1) Feature Extraction Network: In Figure 3, the backbone feature extraction network reshapes the input image into a 608 × 608 × 3 RGB image which is then fed to the CBM. The CBM represents a complete convolutional layer, including three operations: Convolution Operation (Conv), Batch Normalization (BN), and Mish activation function. The identified 128 × 152 × 152 feature is then fed to the SCSPBody feature extraction network, as shown in Figure 4. To further improve the accuracy and efficiency of the feature extraction network, SqueezeNet [36] and Cross Stage Partial Network (CSPNet) are introduced into the backbone of the Squeeze Cross Stage Partial Body (SCSPBody). Where the CSPNet is the backbone network of YOLO-v4 [37] with an enhancement of the learning capacities of CNN, the SuqeezeNet is the smaller CNN architecture that requires less communication across servers during distributed training with equivalent accuracy. Furthermore, it requires less bandwidth to export a new model from the cloud to an autonomous car or UAV and is more feasible to deploy on FPGAs and other hardware with limited memory.  (1) Feature Extraction Network: In Figure 3, the backbone feature extraction network reshapes the input image into a 608 × 608 × 3 RGB image which is then fed to the CBM. The CBM represents a complete convolutional layer, including three operations: Convolution Operation (Conv), Batch Normalization (BN), and Mish activation function. The identified 128 × 152 × 152 feature is then fed to the SCSPBody feature extraction network, as shown in Figure 4. To further improve the accuracy and efficiency of the feature extraction network, SqueezeNet [36] and Cross Stage Partial Network (CSPNet) are introduced into the backbone of the Squeeze Cross Stage Partial Body (SCSPBody). Where the CSPNet is the backbone network of YOLO-v4 [37] with an enhancement of the learning capacities of CNN, the SuqeezeNet is the smaller CNN architecture that requires less communication across servers during distributed training with equivalent accuracy. Furthermore, it requires less bandwidth to export a new model from the cloud to an autonomous car or UAV and is more feasible to deploy on FPGAs and other hardware with limited memory. (1) Feature Extraction Network: In Figure 3, the backbone feature extraction network reshapes the input image into a 608 × 608 × 3 RGB image which is then fed to the CBM. The CBM represents a complete convolutional layer, including three operations: Convolution Operation (Conv), Batch Normalization (BN), and Mish activation function. The identified 128 × 152 × 152 feature is then fed to the SCSPBody feature extraction network, as shown in Figure 4. To further improve the accuracy and efficiency of the feature extraction network, SqueezeNet [36] and Cross Stage Partial Network (CSPNet) are introduced into the backbone of the Squeeze Cross Stage Partial Body (SCSPBody). Where the CSPNet is the backbone network of YOLO-v4 [37] with an enhancement of the learning capacities of CNN, the SuqeezeNet is the smaller CNN architecture that requires less communication across servers during distributed training with equivalent accuracy. Furthermore, it requires less bandwidth to export a new model from the cloud to an autonomous car or UAV and is more feasible to deploy on FPGAs and other hardware with limited memory.  SCSPBody divides the feature maps of the input feature layer into two parts and then concatenates them through the cross-stage hierarchical structure. Specifically, SCSPBody1 is composed of two Conv 3 × 3, one Conv 1 × 1, four residual units, and four fire units, as shown in Figure 4a. Firstly, the input feature layer with a size of 128 × 152 × 152 is convoluted to Conv 256 × 3 × 3/2 to obtain the feature maps with a size of 256 × 76 × 76. Secondly, the network is divided into two parts: one part uses Conv 128 × 3 × 3 to generate feature maps with a size of 128 × 76 × 76, and the other part uses four residual units for feature extraction. The number of channels is adjusted to 128 by Conv 128 × 1 × 1, and then four fire units are used to continue extracting features to obtain 128 × 76 × 76 feature maps. Conv-BN-LeakReLU 16 × 1 × 1, Conv-BN-LeakReLU 64 × 1 × 1, Conv-BN-LeakReLU 64 × 3 × 3 are applied to each fire unit. Finally, the 128 × 76 × 76 feature maps of the two parts are concatenated as the output layer (256 × 76 × 76) using Equation (1) where x f ire is the output feature of Conv 128 × 1 × 1 and x Conv is the output feature of the fire unit. Similarly, SCSPBody2 is composed of two Conv 3 × 3, one Conv 1 × 1, and eight fire units as shown in Figure 4b.
The fire units reduce the amount of computation during model training and reduce the size of the model file, which is more convenient for model saving and transmission. The operation of the Squeeze channel S 1×1 and the Expand channel E 1×1 , E 3×3 are defined in Equation (3) [39].
The Feature Extraction Network of the YOLO-ECA model is listed in Table 1.
In deep convolutional neural networks, the low-level (high-resolution) feature layer contains more detailed information, and the high-level (low-resolution) feature layer contains more semantic information. As the network layer gradually deepens, the detailed information continues to decrease, while the semantic information continues to increase. To achieve multi-scale object detection, the feature pyramid network fuses high-level semantic information with low-level detailed information of different layers, which can improve feature extraction capabilities and the detection accuracy of small objects. To obtain multi-scale semantic information about a pedestrian, motivated by the works of [40][41][42], a structure of the feature pyramid network is adopted in this paper, as shown in Figure 5. The multi-scale prediction process is as follows. Firstly, the large feature layer (LFL0), medium feature layer (MFL0), and small feature layer (SFL0) are effective feature layers extracted from the backbone network of the YOLO-ECA model. Secondly, the feature layer SFL-ECA1 is obtained after the Efficient Channel Attention (ECA) operation of feature layer SFL0, and then feature layer SFL-ECA1 is fused with feature layer MFL0 via up-sampling to generate feature layer MFL1. The feature MFL-ECA1 is obtained after the ECA operation, then MFL-ECA1 is fused with feature layer LFL0 via up-sampling to obtain feature layer LFL1. Finally, feature layer LFL-ECA1 is obtained after the Efficient Channel Attention operation of feature layer LFL1, and then the feature layer LFL-ECA1 is fused with feature layer MFL1 via down-sampling to generate feature layer MFL2. The feature layer MFL-ECA2 is obtained after the ECA operation of feature layer MFL2, and then the feature layer LFL-ECA2 is fused with feature layer SFL-ECA1 via down-sampling to obtain feature layer SFL2. The feature layers LFL1, MFL2, and SFL2 are connected to three CBL units for multi-scale prediction, respectively. Feature reuse is further realized by the top-down and bottom-up feature fusion strategies, which can effectively improve the prediction accuracy of a pedestrian. Motivated by the works of ECA Net [43], adaptive attention is integrated into the FPN network using the Efficient Channel Attention structure which is shown in Figure 6. Firstly, the feature layer (FL0) is extracted from the backbone network of the YOLO-ECA or fused by FPN. Secondly, the feature layer FL-GAP0 is obtained after the global average pooling (GAP) operation of feature layer FL0, and then the Efficient Channel Attention weight FL-Weight is obtained by Conv1D and the Sigmoid activating. Finally, feature layer FL-ECA0 is calculated by multiplying FL0 by FL-Weight. (

3) Detection Network
The prediction box (/roi_predict) is calculated using the operations of the CBL unit and the post-process of anchor boxes. The parameters of three CBL units are listed in Table  2. CBL represents a complete convolutional layer, including three operations: Convolution Operation (Conv), Batch Normalization (BN), and Leaky Rectified linear unit (Leak Relu) activation function.  Motivated by the works of ECA Net [43], adaptive attention is integrated into the FPN network using the Efficient Channel Attention structure which is shown in Figure 6. Firstly, the feature layer (FL0) is extracted from the backbone network of the YOLO-ECA or fused by FPN. Secondly, the feature layer FL-GAP0 is obtained after the global average pooling (GAP) operation of feature layer FL0, and then the Efficient Channel Attention weight FL-Weight is obtained by Conv1D and the Sigmoid activating. Finally, feature layer FL-ECA0 is calculated by multiplying FL0 by FL-Weight. Motivated by the works of ECA Net [43], adaptive attention is integrated into the FPN network using the Efficient Channel Attention structure which is shown in Figure 6. Firstly, the feature layer (FL0) is extracted from the backbone network of the YOLO-ECA or fused by FPN. Secondly, the feature layer FL-GAP0 is obtained after the global average pooling (GAP) operation of feature layer FL0, and then the Efficient Channel Attention weight FL-Weight is obtained by Conv1D and the Sigmoid activating. Finally, feature layer FL-ECA0 is calculated by multiplying FL0 by FL-Weight. (

3) Detection Network
The prediction box (/roi_predict) is calculated using the operations of the CBL unit and the post-process of anchor boxes. The parameters of three CBL units are listed in Table  2. CBL represents a complete convolutional layer, including three operations: Convolution Operation (Conv), Batch Normalization (BN), and Leaky Rectified linear unit (Leak Relu) activation function.  (

3) Detection Network
The prediction box (/roi_predict) is calculated using the operations of the CBL unit and the post-process of anchor boxes. The parameters of three CBL units are listed in Table 2. CBL represents a complete convolutional layer, including three operations: Convolution Operation (Conv), Batch Normalization (BN), and Leaky Rectified linear unit (Leak Relu) activation function.

The Siamese CNN with the Cross Stage Partial
The Siamese CNN with the Cross Stage Partial (SiamCSP) neural network is established based on Siamese Net and CSP Net to identify whether the pedestrian is lost or occluded as shown in Figure 7. The SiamCSP is composed of the Feature Enhancement Network, Feature Extraction Layer, and Decision Layer. The parameters of the SiamCSP are listed in Table 3.

The Siamese CNN with the Cross Stage Partial
The Siamese CNN with the Cross Stage Partial (SiamCSP) neural network is established based on Siamese Net and CSP Net to identify whether the pedestrian is lost or occluded as shown in Figure 7. The SiamCSP is composed of the Feature Enhancement Network, Feature Extraction Layer, and Decision Layer. The parameters of the SiamCSP are listed in Table 3.    In the feature extraction layer, POSHE [44] feature enhancement is performed on the frame selection area to achieve a better feature extraction on the prediction frame. It is defined in Equation (4) where n is the number of pixels in the entire region, p i r j represents the probability of level j in the region x, and x represents the region a, b, c, · · · , i.
(2) Feature Extraction Layer To further improve the comparing precision between the initial features (/roi_initial) and the predictive features (/roi_predict), CSPNet is also introduced to the feature extraction network which is shown in Figure 8.
(1) Feature Enhancement Network In the feature extraction layer, POSHE [44] feature enhancement is performed on the frame selection area to achieve a better feature extraction on the prediction frame. It is defined in Equation (4) where is the number of pixels in the entire region, ( ) represents the probability of level in the region , and represents the region , , , ⋯ , .
(2) Feature Extraction Layer To further improve the comparing precision between the initial features (/roi_initial) and the predictive features (/roi_predict), CSPNet is also introduced to the feature extraction network which is shown in Figure 8. (

3) Decision Layer
The proposed SiamCSP uses two fully connected layers as the decision layer, and the Euclidean distance is calculated and defined as the feature keyframe similarity.

(4) Loss function
The cross-entropy classification loss is employed in the training stage. We denote the input of a pair of images as ( , ) and set up a new parameter . Let = 1, if and are of the same person, otherwise = 0. Then the contrastive loss is adopted for training, as defined in Equation (5) where 2 = ‖ ( ) − ( )‖. denotes the feature representation by the Siamese CNN, and the margin denotes the desired minimal distance between two images.

Improved Loss Function of the Siam-DFKCF Model
To better ensure the training accuracy, we introduce two additional loss functions into the original YOLO loss function , the similarity loss calculated by Si-amCSP and the loss of the tracking box in the KCF. The is defined in Equation (8). The global loss function is shown in Equation (6). In the YOLO loss function, the complete intersection over union (CIoU) [45] is adopted to replace the mean square error in the current research, which is defined in Equation (7). (

3) Decision Layer
The proposed SiamCSP uses two fully connected layers as the decision layer, and the Euclidean distance is calculated and defined as the feature keyframe similarity.

(4) Loss function
The cross-entropy classification loss is employed in the training stage. We denote the input of a pair of images as image i , image j and set up a new parameter y ij . Let y ij = 1, if image i and image j are of the same person, otherwise y ij = 0. Then the contrastive loss is adopted for training, as defined in Equation (5) L Siam = ∑ i,j y ij ρ 2 where ρ 2 Siam = δ(image i ) − δ image j . δ denotes the feature representation by the Siamese CNN, and the margin denotes the desired minimal distance between two images.

Improved Loss Function of the Siam-DFKCF Model
To better ensure the training accuracy, we introduce two additional loss functions into the original YOLO loss function L yolo , the similarity loss L Siam calculated by SiamCSP and the loss of the tracking box in the KCF. The L KCF is defined in Equation (8). The global loss function is shown in Equation (6). In the YOLO loss function, the complete intersection over union (CIoU) [45] is adopted to replace the mean square error in the current research, which is defined in Equation (7).
where GT and PB denote the ground truth and predict box in Equation (6), ν = 4 π 2 arctan w h GT − arctan w h PB in Equation (7), (w, h) is the width and height of the box, and ρ 2 CIoU is the Euclidean distance between the center point of the GT and PB. χ is the diagonal distance of the smallest closure area that contains both the GT and PB. In Equation (7), s = Size(Current) and s = Size(initial) denote the size of the KCF tracking box and initial box, and ρ KCF = (x cur − x int ) 2 + (y cur − y int ) 2 is the Euclidean distance between the center point of the KCF predicted box and the initial box.

Dataset
In our experiment, a dataset was built using a dummy head as an alternative way. A series of pictures were taken at a series of shooting angles, scenes, and light environments. As a result, a total number of 3266 images of the target were obtained. Then, the dummy's position on the images was labeled using the Label-Image software. 1754 images were randomly selected as the training set, and the other 1502 images were designated as the testing set.

Experiment Arrangement
During the test, a dummy head was arranged to occlude a moving dummy head to compare the Siam-DFKCF and the original KCF algorithms. The experimental environment of this paper was NVIDIA 2070Super, I5-9600 K, and 16 G memory. An MIT motor was used to rotate the dummy head while a stepper motor was used to translate the platform along a guide rail. All of the motors were controlled by an STM32F429 board. The dummy head was used to model a pedestrian as shown in Figure 9. To verify the developed KCF and Siam-DFKCF models for detection and tracking of the moving dummy, experiments were designed and conducted. Moving speed, angular velocity, and tracking distance were studied in the current research. For comparison, the same dataset was used to train models in the Siam-DFKCF and the original KCF algorithms. Thereafter, the tracking tests were performed under the same moving parameters as listed in Table 4. where and denote the ground truth and predict box in Equation (6), = 4 2 (arctan ( ℎ ) − arctan ( ℎ ) ) in Equation (7), ( , ℎ) is the width and height of the box, and 2 is the Euclidean distance between the center point of the and . is the diagonal distance of the smallest closure area that contains both the and . In Equation (7), = ( ) and ̅ = ( ) denote the size of the KCF tracking box and initial box, and = √( − ) 2 + ( − ) 2 is the Euclidean distance between the center point of the KCF predicted box and the initial box.

Dataset
In our experiment, a dataset was built using a dummy head as an alternative way. A series of pictures were taken at a series of shooting angles, scenes, and light environments. As a result, a total number of 3266 images of the target were obtained. Then, the dummy's position on the images was labeled using the Label-Image software. 1754 images were randomly selected as the training set, and the other 1502 images were designated as the testing set.

Experiment Arrangement
During the test, a dummy head was arranged to occlude a moving dummy head to compare the Siam-DFKCF and the original KCF algorithms. The experimental environment of this paper was NVIDIA 2070Super, I5-9600 K, and 16 G memory. An MIT motor was used to rotate the dummy head while a stepper motor was used to translate the platform along a guide rail. All of the motors were controlled by an STM32F429 board. The dummy head was used to model a pedestrian as shown in Figure 9. To verify the developed KCF and Siam-DFKCF models for detection and tracking of the moving dummy, experiments were designed and conducted. Moving speed, angular velocity, and tracking distance were studied in the current research. For comparison, the same dataset was used to train models in the Siam-DFKCF and the original KCF algorithms. Thereafter, the tracking tests were performed under the same moving parameters as listed in Table 4.

Experimental Analysis
• Anti-occlusion performance of the algorithm The tracking accuracy of each frame received from the Usb_Camera node was recorded to study the anti-occlusion performance of KCF and Siam-DFKCF algorithms. KCF could not preserve features for a long time, and HOG features were shallow; thus, tracking loss usually appeared for the KCF algorithm when suffering a long-term occlusion. This tracking loss was avoided by the Siam-DFKCF algorithm since it could extract and retain the deep features of the missing dummy images. It was irrefutably proven by observations during the test, as compared in Figure 10a,b. The linear speed was arranged to be 1 m/s, the angular speed was π/8 rad/s, and the tracking distance was 5 m. Dummy head A was not occluded until 3.51 s, at which point dummy head B appeared in front of dummy head A until 3.57 s, resulting in the occlusion problem. It was shown that dummy head A was tracked and marked with a blue bounding box in the two algorithms before 3.51 s. However, tracking loss was observed for both the KCF and the Siam-DFKCF algorithms when dummy head A was completely occluded. Since then, the target was lost for the KCF algorithm. On the other hand, dummy head A was retracted for the Siam-DFKCF algorithm. It is estimated that the deep feature of the tracking target in the Siam-DFKCF algorithm enabled the retraction, which is useful for real-time tracking.  Anti-occlusion performance of the algorithm The tracking accuracy of each frame received from the Usb_Camera node was recorded to study the anti-occlusion performance of KCF and Siam-DFKCF algorithms. KCF could not preserve features for a long time, and HOG features were shallow; thus, tracking loss usually appeared for the KCF algorithm when suffering a long-term occlusion. This tracking loss was avoided by the Siam-DFKCF algorithm since it could extract and retain the deep features of the missing dummy images. It was irrefutably proven by observations during the test, as compared in Figure 10a,b. The linear speed was arranged to be 1 m/s, the angular speed was π/8 rad/s, and the tracking distance was 5 m. Dummy head A was not occluded until 3.51 s, at which point dummy head B appeared in front of dummy head A until 3.57 s, resulting in the occlusion problem. It was shown that dummy head A was tracked and marked with a blue bounding box in the two algorithms before 3.51 s. However, tracking loss was observed for both the KCF and the Siam-DFKCF algorithms when dummy head A was completely occluded. Since then, the target was lost for the KCF algorithm. On the other hand, dummy head A was retracted for the Siam-DFKCF algorithm. It is estimated that the deep feature of the tracking target in the Siam-DFKCF algorithm enabled the retraction, which is useful for real-time tracking. During tracking, the camera was controlled to focus on dummy head A. The yaw angular velocity of the camera changed as dummy head A moved from the left side to the right side, which is compared in Figure 11. The and represent the relative During tracking, the camera was controlled to focus on dummy head A. The yaw angular velocity of the camera changed as dummy head A moved from the left side to the right side, which is compared in Figure 11. The X error and Y error represent the relative location of the moving target position from the image center. They were recorded as horizontal deviation and vertical deviation Y in the upper right corner of Figure 10. The missing tracking could be recognized using the recorded yaw angular velocity. Six tests were arranged to study the tracking accuracy of the two algorithms with 0 rad/s angular speed of dummy head A. When an occlusion occurred, the blue box was lost on dummy head B for the two algorithms, resulting in a decrease of the yaw angular velocity between 3.51 s and 4.11 s as shown in Figure 11a. The yaw angular velocity was found to increase to a normal value for the Siam-DFKCF algorithm, which meant a re-tracking was successfully performed. On the contrary, the yaw angular velocity was found to decrease to 0 for the KCF algorithm. Another six tests were arranged to study the angular speed effect of dummy head A as shown in Figure 11b. The maximum angular speeds of 0.26 rad/s and 0.35 rad/s were observed for the non-rotating and π/8 rad/s rotating cases, resulting in a 34.62% increment. It is estimated that occlusion led to a decrease in the angular speed, followed by an abrupt enhancement of the angular speed between 3.51 s and 4.11 s. location of the moving target position from the image center. They were recorded as horizontal deviation and vertical deviation in the upper right corner of Figure 10. The missing tracking could be recognized using the recorded yaw angular velocity. Six tests were arranged to study the tracking accuracy of the two algorithms with 0 rad/s angular speed of dummy head A. When an occlusion occurred, the blue box was lost on dummy head B for the two algorithms, resulting in a decrease of the yaw angular velocity between 3.51 s and 4.11 s as shown in Figure 11a. The yaw angular velocity was found to increase to a normal value for the Siam-DFKCF algorithm, which meant a re-tracking was successfully performed. On the contrary, the yaw angular velocity was found to decrease to 0 for the KCF algorithm. Another six tests were arranged to study the angular speed effect of dummy head A as shown in Figure 11b. The maximum angular speeds of 0.26 rad/s and 0.35 rad/s were observed for the non-rotating and π/8 rad/s rotating cases, resulting in a 34.62% increment. It is estimated that occlusion led to a decrease in the angular speed, followed by an abrupt enhancement of the angular speed between 3.51 s and 4.11 s.

Scale adaptation performance of the algorithm
The size of the bounding box was further compared to study the scale adaptation performances of the KCF and Siam-DFKCF algorithms. In the original KCF algorithm, the scale of the extracted image was always the pixel size of the initial target image tracking area; therefore, the relative scale of the target in the image changed according to the relative distance between the camera and the tracking target. If the tracking target moved close or away from the camera, the size of the bounding box should be adaptively changed to capture the exact feature. However, if the size of the bounding box did not change accordingly, the extracted features would be incomplete, or variable background information would be introduced, which might lead to tracking failure.
The width sizes of the tracking box (/roi_predict) of the six cases were compared as illustrated in Figure 12a,b. It was shown that the width of the tracking box (/roi_predict) increased monotonously against velocity in the Siam-DFKCF algorithm. The size of the tracking box gradually increased to a maximum value in the KCF algorithm as shown in Figure 12b. On the other hand, when the similarity of the feature frame was lower than the threshold, the original tracking box would be replaced by the detection box of YOLO-ECA to ensure a modification of the frame size in real-time. Thus, the tracking box was updated and corrected more frequently in the Siam-DFKCF algorithm. It is estimated that the developed Siam-DFKCF algorithm is more suitable for the problem of tracking losses. Figure 11. Angle speed of the camera between the Siam-DFKCF and KCF ((a) angular speed of 0 rad/s, (b) angular speed of π/8 rad/s).

•
Scale adaptation performance of the algorithm The size of the bounding box was further compared to study the scale adaptation performances of the KCF and Siam-DFKCF algorithms. In the original KCF algorithm, the scale of the extracted image was always the pixel size of the initial target image tracking area; therefore, the relative scale of the target in the image changed according to the relative distance between the camera and the tracking target. If the tracking target moved close or away from the camera, the size of the bounding box should be adaptively changed to capture the exact feature. However, if the size of the bounding box did not change accordingly, the extracted features would be incomplete, or variable background information would be introduced, which might lead to tracking failure.
The width sizes of the tracking box (/roi_predict) of the six cases were compared as illustrated in Figure 12a,b. It was shown that the width of the tracking box (/roi_predict) increased monotonously against velocity in the Siam-DFKCF algorithm. The size of the tracking box gradually increased to a maximum value in the KCF algorithm as shown in Figure 12b. On the other hand, when the similarity of the feature frame was lower than the threshold, the original tracking box would be replaced by the detection box of YOLO-ECA to ensure a modification of the frame size in real-time. Thus, the tracking box was updated and corrected more frequently in the Siam-DFKCF algorithm. It is estimated that the developed Siam-DFKCF algorithm is more suitable for the problem of tracking losses.
To quantitatively analyze the scale adaptation performances of the two algorithms, the subtraction between the initial tracking box size and the final tracking box size is further discussed as shown in Table 5. To quantitatively analyze the scale adaptation performances of the two algorithms, the subtraction between the initial tracking box size and the final tracking box size is further discussed as shown in Table 5. The sizes of the initial tracking box and the final tracking box were found to be almost the same for both the KCF and Siam-DFKCF algorithms under 0 rad/s angular velocity. Detailly, the relative deviations were found to be 1.72%, 4.55%, and 5.56% at the speed of 1 m/s, 2 m/s, and 3 m/s respectively in the Siam-DFKCF algorithm. Similarly, the relative deviations were found to be 3.33%, 4.76%, and 6.41% for the KCF algorithm. On the other hand, when the angular velocity increased to π/8 rad/s, the size deviations were abruptly increased to 20.99%, 28.57%, and 38.75% in the KCF algorithm. On the contrary, the size deviations of the Siam-DFKCF algorithm were reduced to 1.52%, 2.6%, and 4.44%. It is implied that better scale adaptation performances were achieved for the Siam-DFKCF algorithm.

•
Loss and re-tracking of Targets The similarity was calculated in the decision layer as discussed in Section 2.3. It was found that the similarity of the keyframes began to decrease from 0 s. It is estimated that an occlusion appeared since dummy A was occluded by dummy B, as shown in Figure 13. The similarity was related to the non-occluded part of dummy A, thus the similarity value gradually decreased when the non-occluded part was shrunk step by step. On the other hand, the similarity value gradually increased when dummy A appeared as time goes by. Thereafter, the similarity kept decreasing to a minimum value of 0.11 until 4.07 s in the KCF algorithm since the KCF algorithm did not have a function to retrieve the target. On the other hand, the similarity began to increase to a maximum value of 0.99 until 4.50 s in the Siam-DFKCF algorithm since the Siam-DFKCF algorithm could re-track the target and restore the similarity of the keyframe. It should be noted that a similarity threshold of 0.7 was used empirically in the current research to update the tracking box in the algorithm  The sizes of the initial tracking box and the final tracking box were found to be almost the same for both the KCF and Siam-DFKCF algorithms under 0 rad/s angular velocity. Detailly, the relative deviations were found to be 1.72%, 4.55%, and 5.56% at the speed of 1 m/s, 2 m/s, and 3 m/s respectively in the Siam-DFKCF algorithm. Similarly, the relative deviations were found to be 3.33%, 4.76%, and 6.41% for the KCF algorithm. On the other hand, when the angular velocity increased to π/8 rad/s, the size deviations were abruptly increased to 20.99%, 28.57%, and 38.75% in the KCF algorithm. On the contrary, the size deviations of the Siam-DFKCF algorithm were reduced to 1.52%, 2.6%, and 4.44%. It is implied that better scale adaptation performances were achieved for the Siam-DFKCF algorithm.

•
Loss and re-tracking of Targets The similarity was calculated in the decision layer as discussed in Section 2.3. It was found that the similarity of the keyframes began to decrease from 0 s. It is estimated that an occlusion appeared since dummy A was occluded by dummy B, as shown in Figure 13. The similarity was related to the non-occluded part of dummy A, thus the similarity value gradually decreased when the non-occluded part was shrunk step by step. On the other hand, the similarity value gradually increased when dummy A appeared as time goes by. Thereafter, the similarity kept decreasing to a minimum value of 0.11 until 4.07 s in the KCF algorithm since the KCF algorithm did not have a function to retrieve the target. On the other hand, the similarity began to increase to a maximum value of 0.99 until 4.50 s in the Siam-DFKCF algorithm since the Siam-DFKCF algorithm could re-track the target and restore the similarity of the keyframe. It should be noted that a similarity threshold of 0.7 was used empirically in the current research to update the tracking box in the algorithm since 3.42 s. Therefore, the 0.42 s delay was observed. When dummy A rotated with an angular velocity of π/8 rad/s, the average similarity of the re-tracking keyframes could only reach a maximum of 0.89 which was less than the 0.99 of the non-rotated case. It is estimated that the threshold should be reduced adaptively to avoid mis-tracking when the moving target has a large angular velocity. since 3.42 s. Therefore, the 0.42 s delay was observed. When dummy A rotated with an angular velocity of π/8 rad/s, the average similarity of the re-tracking keyframes could only reach a maximum of 0.89 which was less than the 0.99 of the non-rotated case. It is estimated that the threshold should be reduced adaptively to avoid mis-tracking when the moving target has a large angular velocity.

Dataset
In our experiment, a real-time video taken from a school was utilized. There were 121 frames in total and the frame size was 1920 × 1080. The experimental environment of this paper was run on an Inter i5-9600K CPU at 4.30 GHz, Nvidia 2070 Super GPU, and 16 G memory.

Evaluation Criterion
We utilized the evaluation tool provided by VOT [46].
(1) Distance Precision (DP): The Center Location Error (CLE) between the Tracking Box (TB) and the Ground Truth (GT) was taken into account for calculating the error of location. The CLE was calculated by The Distance Precision (DP) was calculated by

Dataset
In our experiment, a real-time video taken from a school was utilized. There were 121 frames in total and the frame size was 1920 × 1080. The experimental environment of this paper was run on an Inter i5-9600K CPU at 4.30 GHz, Nvidia 2070 Super GPU, and 16 G memory.

Evaluation Criterion
We utilized the evaluation tool provided by VOT [46].
(1) Distance Precision (DP): The Center Location Error (CLE) between the Tracking Box (TB) and the Ground Truth (GT) was taken into account for calculating the error of location. The CLE was calculated by The Distance Precision (DP) was calculated by where the Num(CLE < λ DP ) meant the number of frames when the CLE value was less than the threshold λ DP , and Num(total) meant the number of frames in the whole data set.
(2) Overlap Precision (OP) The overlap between the Tracking box (TB) and the ground truth (GT) was taken into account for calculating success scores, and the Intersection over Union (IoU) ratio was calculated by IoU = (area|TB ∩ GT|)/(area|TB ∪ GT|) > λ_OP (11) When the IoU value was less than 50%, the position status was defined as false (false positive-FP). In contrast, it was defined as true (true positive-TP) while the method could not produce a position (false negative-FN). This process was repeated by the number of frames in the whole data set, and the ratio of successful frames to the whole frame was found and achieved as Overlap Precision (OP), which was calculated by

Experiment Results
A speed of 93 FPS (Frames Per Second) was achieved during tracking, and it was easy to track the target face because only one moving face existed in a clean scene as shown in Figure 14a. To verify the tracking accuracy under high moving speed, the person moved fast to show that the target face was locked all the time, as shown in Figure 14b,c. The face was recognized during the whole tracking test even though it was turned around, as shown in Figure 14d,e. Furthermore, it was also recognized when partially covered, as shown in Figure 14f. As a result, we evaluated our algorithm against various special scenarios such as face occlusion, face fast-moving, and face turned. It is shown that the proposed algorithm has strong robustness. Therefore, the real-time face-tracking accuracy was irrefutably validated using the previous tracking experiments. We expect that the proposed algorithm can also be applied in other tacking areas, such as object behavior analysis, anomaly behaviors detection, etc.
where the ( < ) meant the number of frames when the value was less than the threshold , and ( ) meant the number of frames in the whole data set.
(2) Overlap Precision (OP) The overlap between the Tracking box (TB) and the ground truth (GT) was taken into account for calculating success scores, and the Intersection over Union (IoU) ratio was calculated by When the IoU value was less than 50%, the position status was defined as false (false positive-FP). In contrast, it was defined as true (true positive-TP) while the method could not produce a position (false negative-FN). This process was repeated by the number of frames in the whole data set, and the ratio of successful frames to the whole frame was found and achieved as Overlap Precision (OP), which was calculated by

Experiment Results
A speed of 93 FPS (Frames Per Second) was achieved during tracking, and it was easy to track the target face because only one moving face existed in a clean scene as shown in Figure 14a. To verify the tracking accuracy under high moving speed, the person moved fast to show that the target face was locked all the time, as shown in Figure 14b,c. The face was recognized during the whole tracking test even though it was turned around, as shown in Figure 14d,e. Furthermore, it was also recognized when partially covered, as shown in Figure 14f. As a result, we evaluated our algorithm against various special scenarios such as face occlusion, face fast-moving, and face turned. It is shown that the proposed algorithm has strong robustness. Therefore, the real-time face-tracking accuracy was irrefutably validated using the previous tracking experiments. We expect that the proposed algorithm can also be applied in other tacking areas, such as object behavior analysis, anomaly behaviors detection, etc.  The performance of the proposed method was compared with the KCF, Tracking Learning Detection (TLD) [47], Siamese Region Proposal Network (SiamRPN) [48], and Distractor-aware Siamese Region Proposal Network algorithms (DaSiamRPN) [49]. As shown in Figure 15, all algorithms performed well under the slow-moving environment The performance of the proposed method was compared with the KCF, Tracking Learning Detection (TLD) [47], Siamese Region Proposal Network (SiamRPN) [48], and Distractor-aware Siamese Region Proposal Network algorithms (DaSiamRPN) [49]. As shown in Figure 15, all algorithms performed well under the slow-moving environment from the 1st frame to the 11th frame. However, the TLD and the KCF algorithm had poor performance in pedestrian tracking which was fast-moving from the 11th frame to the 66th frame. When the face was turned from the 66th frame to the 99th frame, the KCF tracking failed due to the fixed scale, and the TLD was low performing due to the large Tracking Box from the 66th frame. The performance of the SiamRPN and DaSiamRPN was unsatisfactory while the face was occluded from the 99th frame to the final frame. On the contrary, the proposed method tracked the face well from the first frame to the final frame. The Overlap Precision and Distance Precision of the methods are shown in Table 6.
from the 1st frame to the 11th frame. However, the TLD and the KCF algorithm had poor performance in pedestrian tracking which was fast-moving from the 11th frame to the 66th frame. When the face was turned from the 66th frame to the 99th frame, the KCF tracking failed due to the fixed scale, and the TLD was low performing due to the large Tracking Box from the 66th frame. The performance of the SiamRPN and DaSiamRPN was unsatisfactory while the face was occluded from the 99th frame to the final frame. On the contrary, the proposed method tracked the face well from the first frame to the final frame. The Overlap Precision and Distance Precision of the methods are shown in Table  6.  It is shown that the highest DP of 0.934 and the highest OP of 0.909 was achieved using the proposed method in our dataset study; meanwhile, CPU usage and GPU usage were balanced. The TLD, and KCF were traditional filtering algorithms; they had high CPU usage, while the GPU usage was almost zero. On the contrary, the SiamRPN and DASiamRPN algorithm had excellent performance during tracking, and they had a high GPU occupancy for they were deep learning algorithms. As a modification, the method we proposed combined the accuracy of the deep learning algorithm and the low GPU occupancy of the filtering algorithm, and a good balance was achieved between CPU and GPU computations. Overall, the face tracking algorithm designed in this paper had good adaptability in dealing with face tracking.  It is shown that the highest DP of 0.934 and the highest OP of 0.909 was achieved using the proposed method in our dataset study; meanwhile, CPU usage and GPU usage were balanced. The TLD, and KCF were traditional filtering algorithms; they had high CPU usage, while the GPU usage was almost zero. On the contrary, the SiamRPN and DASiamRPN algorithm had excellent performance during tracking, and they had a high GPU occupancy for they were deep learning algorithms. As a modification, the method we proposed combined the accuracy of the deep learning algorithm and the low GPU occupancy of the filtering algorithm, and a good balance was achieved between CPU and GPU computations. Overall, the face tracking algorithm designed in this paper had good adaptability in dealing with face tracking.

Conclusions
A lightweight algorithm with high efficiency is required for an unmanned operating system. Thus, we proposed a deep feature KCF method based on YOLO-ECA and SiamCSP. In the method, the Efficient Channel Attention block (ECA) was introduced into the feature pyramid network (FPN) of YOLO to allocate resources more adaptively. The model focused on the information that was more critical to the current task, and more powerful appearance features were compared to judge similarities for the kernel correlation filter. In this way, the defect in feature association was solved and the tracking-by-detection framework was better used. Additionally, a fire unit was used to make the backbone feature extraction network of the YOLO more lightweight. The tracking efficiency and detecting accuracy were improved. Finally, two series of experiments on the proposed method demonstrate the effectiveness of our algorithm compared to traditional visual tracking methods. In the dummy tracking experiments, influences of moving speed, rotating speed, and occlusion on tracking accuracy and efficiency were studied. It is found that the proposed method had better performance in dealing with anti-occlusion, scale adaptation, and re-tracking problems. Thereafter, the method was used in real-time pedestrian tracking. It is found that the best DP (0.934) and OP (0.909) had been achieved using the proposed method.
In general, the proposed method has good performance in the aspect of anti-occlusion and re-tracking. This lightweight method can significantly improve the high accuracy while maintaining high efficiency. In the future, multiple face tracking will be investigated and discussed in our future studies. Then, we will focus on a similarity representation method of small objects with less computational burden. Finally, we expect that the proposed method can be applied in the area of airborne UAV systems, which will be discussed in future work. Funding: The authors would like to thank the reviewers for their good and valuable suggestions which improved this paper greatly. The work at ZJUT was supported by the National Natural Science Foundation of China (Grants No. 52175279, 51705459) and the Natural Science Foundation of Zhejiang Province (Grant No. LY20E050022).

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data that support the findings of this study are available from the corresponding author upon reasonable request.