Siam Deep Feature KCF Method and Experimental Study for Pedestrian Tracking

Tang, Di; Jin, Weijie; Liu, Dawei; Che, Jingqi; Yang, Yin

doi:10.3390/s23010482

Open AccessArticle

Siam Deep Feature KCF Method and Experimental Study for Pedestrian Tracking

by

Di Tang

¹

,

Weijie Jin

¹,

Dawei Liu

^2,*,

Jingqi Che

¹ and

Yin Yang

²

¹

College of Mechanical Engineering, Zhejiang University of Technology, Hangzhou 310014, China

²

China Aerodynamics Research and Development Center, High Speed Aerodynamic Institute, Mianyang 621000, China

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(1), 482; https://doi.org/10.3390/s23010482

Submission received: 2 November 2022 / Revised: 1 December 2022 / Accepted: 6 December 2022 / Published: 2 January 2023

(This article belongs to the Section Intelligent Sensors)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The tracking of a particular pedestrian is an important issue in computer vision to guarantee societal safety. Due to the limited computing performances of unmanned aerial vehicle (UAV) systems, the Correlation Filter (CF) algorithm has been widely used to perform the task of tracking. However, it has a fixed template size and cannot effectively solve the occlusion problem. Thus, a tracking-by-detection framework was designed in the current research. A lightweight YOLOv3-based (You Only Look Once version 3) mode which had Efficient Channel Attention (ECA) was integrated into the CF algorithm to provide deep features. In addition, a lightweight Siamese CNN with Cross Stage Partial (CSP) provided the representations of features learned from massive face images, allowing the target similarity in data association to be guaranteed. As a result, a Deep Feature Kernelized Correlation Filters method coupled with Siamese-CSP(Siam-DFKCF) was established to increase the tracking robustness. From the experimental results, it can be concluded that the anti-occlusion and re-tracking performance of the proposed method was increased. The tracking accuracy Distance Precision (DP) and Overlap Precision (OP) had been increased to 0.934 and 0.909 respectively in our test data.

Keywords:

pedestrian tracking; machine learning; YOLO; Siamese CNN; ROS

1. Introduction

Since the 21st century, public security problems encountered by various countries have been increasing. With the rapid development of technology, public security crises have become more hidden [1]. Fortunately, the invention of computer vision has changed many traditional fields of human technological activities [2,3,4], especially in some security fields, such as public security [5,6]. Installing surveillance cameras in dense venues is an effective way to reduce crimes and ensure social security [7]. To reduce the number of fixed cameras and continuously supervise moving pedestrians’ misbehaviors, an airborne unmanned aerial vehicles (UAV) system with pedestrian tracking came to be used in real-time tracking. Unlike the tracking method in traditional surveillance cameras, the UAV pedestrian tracking system requires a more lightweight network with fast-running abilities.

Most of the state-of-art pedestrian tracking algorithms usually use the tracking-by-detection framework. The framework can be considered to be an estimation problem, composed of pedestrian detection and data association [8]. In the former, when a video sequence is obtained, we need to detect the frame of the object by frame for the association. In the latter, we link together the detection object in different frames by using data association strategies, which depend on features such as appearance. Yu et al. [9] showed that in the framework of tracking-by-detection, detection quality could seriously affect the performance of pedestrian tracking. Therefore, to improve the performance of pedestrian tracking, an accurate detector is necessary. Most of the traditional pedestrian detection algorithms [10,11] mostly utilize the method of sliding windows to traverse the entire image to locate the positions of objects and then extract robust features of the image, such as Scale Invariant Feature Transform (SIFT), Histogram of Gradient (HOG), and so on. After that, the features are sent into classifiers with the traditional machine learning method Adaptive Boost (Adaboost) [12] or Support Vector Machines (SVM) [13]. However, these algorithms rely on manually designed classification features, and these manually designed features are not suitable for multi-scale detection in complex backgrounds. In recent years, with the rapid development of Convolutional Neural Networks (CNNs) which can reduce the image to a form that can be easily processed without losing vital features for prediction, image-processing methods based on CNN have been widely used in the field of object detection and classification [14,15,16]. Since then, the deep detection mode has been widely developed. It can be divided into two categories: one-stage networks and two-stage networks [17,18]. Regions with CNN feature (RCNN) [19] is a typical two-stage method of object detection which proposes a selective search method to find candidate boxes. The Fast-RCNN method [20] and Faster-RCNN [21] were proposed by Ross Girshick to further improve performance, and it was also suggested that using EdgeBoxes instead of the selective search method would reduce the time for candidate box proposals. However, the detection speed of two-stage networks was far inferior to that of one-stage networks. Nowadays, YOLO [22] is the most popular one-stage object detection method which treats object detection as a regression task. YOLOv3 [23] predicts the coordinates of the possible bounding box, object class, and confidence of object class. It has been widely used in vehicle [24], pedestrian [25,26], fire [27], and even medical cell detections [28,29], etc. However, the huge computation cost of YOLOv3 makes it difficult to achieve real-time detection in small airborne UAV systems. Therefore, the allocation of computing resources in YOLOv3 is an alternative and efficient way to improve the computational efficiency.

The real-time deep tracking method has been widely used in many areas; however, its application in UAV system detections is challenging because of its limited computations. Fortunately, Correlation Filter (CF)-based tracking methods have attracted increasing interest due to their high computational speed and high operating efficiency. Dense sampling in these methods is approximated by generating a circulant matrix. Each row in the circulant matrix denotes a vectorized sample. With such kind of representation, a regression model can be efficiently solved in the Fourier domain. It is a method to obtain more realistic and reliable regression coefficients at the expense of losing some information and reducing little accuracy. Bolme et al. [30] presented a new type of correlation filter, a Minimum Output Sum of Squared Error (MOSSE) filter, which produced stable correlation filters when initialized using a single frame. Subsequently, Henriques et al. proposed the Circulant Structure of Tracking-by-detection with Kernels (CSK) [31] which provided a link to Fourier analysis that opens up the possibility of extremely fast learning and detection with the Fast Fourier Transform. Immediately, the authors derived a new kernelized correlation filter (KCF) [32] that unlike other kernel algorithms had the same complexity as its linear counterpart. However, the KCF is a method to obtain faster computation speed by reducing accuracy. It has a fixed template size resulting in the occlusion problem. Furthermore, the HOG feature used in KCF leads the tracking method to easily lose its target. Considering these cases, a Siamese Convolutional Neural Network [33] (Siamese CNN) is used in the proposed method. It consists of two identical artificial convolutional neural networks each capable of learning the hidden representation of an input vector, work parallelly in tandem and comparing their outputs at the end, usually through a cosine distance. That is the similarity between the two images.

In all literature, few studies have been conducted to design a lightweight network with fast-running abilities to track a moving body even though it may be occluded. Thus, the main topics of this paper are to propose a Siam Deep Feature KCF method, which has high computational efficiency for pedestrian tracking. Dummy tracking experiments were carefully designed and performed to study the influences of moving speed, rotating speed, and occlusion on tracking accuracy and efficiency. Thereafter, pedestrian tracking experiments were performed to verify its performance in real-time tracking. The proposed method has been irrefutably validated and it is considered to be suitable for unmanned aerial vehicles system due to its lightweight features. The structure of this paper is described as follows: Section 2.2 introduces the object detection algorithm, Section 2.3 describes the data association method, and in Section 3 and Section 4, we present the experimental results and give the conclusion.

2. Materials and Methods

2.1. DFKCF Method Coupled with the SiamCSP

To port the pedestrian tracking method into the airborne UAV systems, a Siam-DFKCF method is proposed. In this method, the frame topic is published by the Usb_Cam node, which mainly publishes image frames (/img_frame) and time stamps (/time_stamp) to other nodes. A lightweight YOLOv3 node YOLO-ECA subscribes to the image topic (/img_frame) and time stamps topic (/time_stamp). Therefore, the pedestrian can be detected followed by the publishing of the Roi topic (/roi_initial). Thereafter, the tracking node KCF subscribes to the Roi topic, tracks the pedestrian, predicts the pedestrian location, and publishes the relevant image topic (/roi_predict). After that, a lightweight Siamese CNN node SiamCSP subscribes to both the Roi and image topic, then calculates the image similarity (/similarity) between the previous two topics. During the tracking, a Motor drive node Gimbal_Motor is used to perform the tracking command. The ROS Master (Robot Operating System [34]) registers the published topics of each node followed by a subscription from the Master, as shown in Figure 1.

The process of this Siam-DFKCF algorithm includes four main stages: Feature extraction, Feature tracking, Feature re-extraction, and Execution stage.

(1): Feature extraction: The Usb_Cam node publishes the frame to the YOLO-ECA node, and the YOLO-ECA node detects and extracts the received frame to obtain the feature of the target (/roi_initial), which is received by the KCF node and the SiamCSP node.
(2): Feature tracking: The tracking box (/roi_predict), including both size and the relative location, is predicted by the KCF node through correlation filter processing. The topic about the location is published to the Gimbal_Motor node so that the camera can track the pedestrian in real time. Thereafter, the received feature (/roi_initial) is compared with the predicted result (/roi_predict) by the SiamCSP node to obtain the similarity (/similarity).
(3): Feature re-extraction: When the image similarity (/similarity) is below the minimum threshold, go to step (1) to recalculate the correlation feature model in the KCF node. Otherwise, go to the next step.
(4): Execution stage: The relative position of (/roi_predict) in the previous frame and the current frame is calculated and then transferred to the Gimbal_Motor node to perform tracking.

The schematic diagram of the Siam-DFKCF algorithm is shown in Figure 2.

2.2. An Improved YOLOv3 Based on Efficient Channel Attention

YOLOv3 [23] is a one-stage network based on a regression method which extracts the features and directly predicts and classifies input images. It does not need to generate a large number of candidate windows compared with two-stage networks, and has excellent recognition speed and detection accuracy. Specifically, YOLO-v3 first resizes the input image to a 416 × 416 pixel image and then feeds it into deep neural networks for training in the Traditional YOLOv3. Then the 416 × 416 pixel image is divided into S × S grids, and each grid is responsible for predicting the pedestrian within the image. In addition, 26 × 26 feature maps are usually fused with 52 × 52 feature maps via up-sampling. Similarly, the 13 × 13 feature maps are fused with the 26 × 26 maps via up-sampling in the traditional multi-scale feature fusion. This multi-scale feature fusion and Darknet-53 [35] constitute the feature extraction network of YOLO-v3. Detection accuracy can be greatly improved by the combination of the deep feature and the shallow feature. However, it still has several challenges when applied towards pedestrian detection. Therefore, the Feature Extraction Network, Feature Pyramid Network (FPN), and Detection Network are combined to establish an Efficient Channel Attention YOLOv3method (ECA-YOLO), which is shown in Figure 3.

(1): Feature Extraction Network:

In Figure 3, the backbone feature extraction network reshapes the input image into a 608 × 608 × 3 RGB image which is then fed to the CBM. The CBM represents a complete convolutional layer, including three operations: Convolution Operation (Conv), Batch Normalization (BN), and Mish activation function. The identified 128 × 152 × 152 feature is then fed to the SCSPBody feature extraction network, as shown in Figure 4. To further improve the accuracy and efficiency of the feature extraction network, SqueezeNet [36] and Cross Stage Partial Network (CSPNet) are introduced into the backbone of the Squeeze Cross Stage Partial Body (SCSPBody). Where the CSPNet is the backbone network of YOLO-v4 [37] with an enhancement of the learning capacities of CNN, the SuqeezeNet is the smaller CNN architecture that requires less communication across servers during distributed training with equivalent accuracy. Furthermore, it requires less bandwidth to export a new model from the cloud to an autonomous car or UAV and is more feasible to deploy on FPGAs and other hardware with limited memory.

SCSPBody divides the feature maps of the input feature layer into two parts and then concatenates them through the cross-stage hierarchical structure. Specifically, SCSPBody1 is composed of two Conv 3 × 3, one Conv 1 × 1, four residual units, and four fire units, as shown in Figure 4a. Firstly, the input feature layer with a size of 128 × 152 × 152 is convoluted to Conv 256 × 3 × 3/2 to obtain the feature maps with a size of 256 × 76 × 76. Secondly, the network is divided into two parts: one part uses Conv 128 × 3 × 3 to generate feature maps with a size of 128 × 76 × 76, and the other part uses four residual units for feature extraction. The number of channels is adjusted to 128 by Conv 128 × 1 × 1, and then four fire units are used to continue extracting features to obtain 128 × 76 × 76 feature maps. Conv-BN-LeakReLU 16 × 1 × 1, Conv-BN-LeakReLU 64 × 1 × 1, Conv-BN-LeakReLU 64 × 3 × 3 are applied to each fire unit. Finally, the 128 × 76 × 76 feature maps of the two parts are concatenated as the output layer (256 × 76 × 76) using Equation (1)

x_{𝓁} = H ([x_{f i r e}, x_{C o n v}]),

(1)

where

x_{f i r e}

is the output feature of Conv 128 × 1 × 1 and

x_{C o n v}

is the output feature of the fire unit. Similarly, SCSPBody2 is composed of two Conv 3 × 3, one Conv 1 × 1, and eight fire units as shown in Figure 4b. Firstly, the input feature layer is convoluted to Conv 256 × 76 × 76 to obtain the 512 × 38 × 38 feature maps. Secondly, the network is divided into two parts: one part uses Conv 256 × 3 × 3 to generate 256 × 38 × 38 feature maps, and the other part uses Conv 128 × 1 × 1 to adjust the number of channels. Then four fire units are used to extract features to obtain 128 × 38 × 38 feature maps, and then further used to extract features to obtain 256 × 38 × 38 feature maps. Conv-BN-LeakReLU 16 × 1 × 1, Conv-BN-LeakReLU 64 × 1 × 1 and Conv-BN-LeakReLU 64 × 3 × 3 are applied to the first four-unit; Conv-BN-LeakReLU 32 × 1 × 1, Conv-BN-LeakReLU 128 × 1 × 1, and Conv-BN-LeakReLU 128 × 3 × 3 are applied to the second four-unit. Finally, the 256 × 38 × 38 feature maps of the two parts are concatenated as the 512 × 38 × 38 output layer. In residual units, the Conv 64 × 1 × 1 layer compresses the number of channels for the feature layer

ι

, and then the Conv 128 × 3 × 3 is used to enhance feature extraction and expand the number of channels. The feature layers

x_{ι - 1}

and

H (x_{ι - 1})

are connected by a shortcut. Finally,

x_{ι}

is defined in Equation (2) [38].

x_{ι} = H (x_{ι - 1}) + x_{ι - 1},

(2)

The fire units reduce the amount of computation during model training and reduce the size of the model file, which is more convenient for model saving and transmission. The operation of the Squeeze channel

S_{1 \times 1}

and the Expand channel

E_{1 \times 1}

,

E_{3 \times 3}

are defined in Equation (3) [39].

S_{1 \times 1} = \frac{E_{1 \times 1}}{4} = \frac{E_{3 \times 3}}{4},

(3)

The Feature Extraction Network of the YOLO-ECA model is listed in Table 1.

(2): Feature Pyramid Network (FPN)

In deep convolutional neural networks, the low-level (high-resolution) feature layer contains more detailed information, and the high-level (low-resolution) feature layer contains more semantic information. As the network layer gradually deepens, the detailed information continues to decrease, while the semantic information continues to increase. To achieve multi-scale object detection, the feature pyramid network fuses high-level semantic information with low-level detailed information of different layers, which can improve feature extraction capabilities and the detection accuracy of small objects.

To obtain multi-scale semantic information about a pedestrian, motivated by the works of [40,41,42], a structure of the feature pyramid network is adopted in this paper, as shown in Figure 5. The multi-scale prediction process is as follows. Firstly, the large feature layer (LFL0), medium feature layer (MFL0), and small feature layer (SFL0) are effective feature layers extracted from the backbone network of the YOLO-ECA model. Secondly, the feature layer SFL-ECA1 is obtained after the Efficient Channel Attention (ECA) operation of feature layer SFL0, and then feature layer SFL-ECA1 is fused with feature layer MFL0 via up-sampling to generate feature layer MFL1. The feature MFL-ECA1 is obtained after the ECA operation, then MFL-ECA1 is fused with feature layer LFL0 via up-sampling to obtain feature layer LFL1. Finally, feature layer LFL-ECA1 is obtained after the Efficient Channel Attention operation of feature layer LFL1, and then the feature layer LFL-ECA1 is fused with feature layer MFL1 via down-sampling to generate feature layer MFL2. The feature layer MFL-ECA2 is obtained after the ECA operation of feature layer MFL2, and then the feature layer LFL-ECA2 is fused with feature layer SFL-ECA1 via down-sampling to obtain feature layer SFL2. The feature layers LFL1, MFL2, and SFL2 are connected to three CBL units for multi-scale prediction, respectively. Feature reuse is further realized by the top-down and bottom-up feature fusion strategies, which can effectively improve the prediction accuracy of a pedestrian.

Motivated by the works of ECA Net [43], adaptive attention is integrated into the FPN network using the Efficient Channel Attention structure which is shown in Figure 6. Firstly, the feature layer (FL0) is extracted from the backbone network of the YOLO-ECA or fused by FPN. Secondly, the feature layer FL-GAP0 is obtained after the global average pooling (GAP) operation of feature layer FL0, and then the Efficient Channel Attention weight FL-Weight is obtained by Conv1D and the Sigmoid activating. Finally, feature layer FL-ECA0 is calculated by multiplying FL0 by FL-Weight.

(3): Detection Network

The prediction box (/roi_predict) is calculated using the operations of the CBL unit and the post-process of anchor boxes. The parameters of three CBL units are listed in Table 2. CBL represents a complete convolutional layer, including three operations: Convolution Operation (Conv), Batch Normalization (BN), and Leaky Rectified linear unit (Leak Relu) activation function.

2.3. The Siamese CNN with the Cross Stage Partial

The Siamese CNN with the Cross Stage Partial (SiamCSP) neural network is established based on Siamese Net and CSP Net to identify whether the pedestrian is lost or occluded as shown in Figure 7. The SiamCSP is composed of the Feature Enhancement Network, Feature Extraction Layer, and Decision Layer. The parameters of the SiamCSP are listed in Table 3.

(1): Feature Enhancement Network

In the feature extraction layer, POSHE [44] feature enhancement is performed on the frame selection area to achieve a better feature extraction on the prediction frame. It is defined in Equation (4)

S_{k}^{e} = \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 0}^{k} p_{i} (r_{j}),

(4)

where

n

is the number of pixels in the entire region,

p_{i} (r_{j})

represents the probability of level

j

in the region

x

, and

x

represents the region

a, b, c, \dots, i

.

(2): Feature Extraction Layer

To further improve the comparing precision between the initial features (/roi_initial) and the predictive features (/roi_predict), CSPNet is also introduced to the feature extraction network which is shown in Figure 8.

(3): Decision Layer

The proposed SiamCSP uses two fully connected layers as the decision layer, and the Euclidean distance is calculated and defined as the feature keyframe similarity.

(4): Loss function

The cross-entropy classification loss is employed in the training stage. We denote the input of a pair of images as

(i m a g e_{i}, i m a g e_{j})

and set up a new parameter

y_{i j}

. Let

y_{i j} = 1

, if

i m a g e_{i}

and

i m a g e_{j}

are of the same person, otherwise

y_{i j} = 0

. Then the contrastive loss is adopted for training, as defined in Equation (5)

L_{S i a m} = \sum_{i, j} y_{i j} ρ_{S i a m}^{2} + (1 - y_{i j}) \max {(m a r g i n - ρ_{S i a m}, 0)}^{2}

(5)

where

ρ_{S i a m}^{2} = ǁ δ (i m a g e_{i}) - δ (i m a g e_{j}) ǁ

.

δ

denotes the feature representation by the Siamese CNN, and the margin denotes the desired minimal distance between two images.

2.4. Improved Loss Function of the Siam-DFKCF Model

To better ensure the training accuracy, we introduce two additional loss functions into the original YOLO loss function

L_{y o l o}

, the similarity loss

L_{S i a m}

calculated by SiamCSP and the loss of the tracking box in the KCF. The

L_{K C F}

is defined in Equation (8). The global loss function is shown in Equation (6). In the YOLO loss function, the complete intersection over union (CIoU) [45] is adopted to replace the mean square error in the current research, which is defined in Equation (7).

L o s s = L_{y o l o} + L_{S i a m} + L_{K C F},

(6)

L_{C I o U} = 1 - I o U + \frac{ρ_{C I o U}^{2} (b_{P B}, b_{G T})}{χ^{2}} α ν,

(7)

L_{K C F} = \frac{1}{m} \sum_{i = 1}^{m} [\max (s - \bar{s}, m a r g i n) + \max (ρ_{K C F}^{2}, m a r g i n) - 2 m a r g i n]

(8)

where

G T

and

P B

denote the ground truth and predict box in Equation (6),

ν = \frac{4}{π^{2}} (\arctan {(\frac{w}{h})}_{G T} - \arctan {(\frac{w}{h})}_{P B})

in Equation (7),

(w, h)

is the width and height of the box, and

ρ_{C I o U}^{2}

is the Euclidean distance between the center point of the

G T

and

P B

.

χ

is the diagonal distance of the smallest closure area that contains both the

G T

and

P B

. In Equation (7),

s = S i z e (C u r r e n t)

and

\bar{s} = S i z e (i n i t i a l)

denote the size of the KCF tracking box and initial box, and

ρ_{K C F} = \sqrt{{(x_{c u r} - x_{i n t})}^{2} + {(y_{c u r} - y_{i n t})}^{2}}

is the Euclidean distance between the center point of the KCF predicted box and the initial box.

3. Experiments and Results

3.1. Dummy Tracking Experiments Result

3.1.1. Dataset

In our experiment, a dataset was built using a dummy head as an alternative way. A series of pictures were taken at a series of shooting angles, scenes, and light environments. As a result, a total number of 3266 images of the target were obtained. Then, the dummy’s position on the images was labeled using the Label-Image software. 1754 images were randomly selected as the training set, and the other 1502 images were designated as the testing set.

3.1.2. Experiment Arrangement

During the test, a dummy head was arranged to occlude a moving dummy head to compare the Siam-DFKCF and the original KCF algorithms. The experimental environment of this paper was NVIDIA 2070Super, I5-9600 K, and 16 G memory. An MIT motor was used to rotate the dummy head while a stepper motor was used to translate the platform along a guide rail. All of the motors were controlled by an STM32F429 board. The dummy head was used to model a pedestrian as shown in Figure 9. To verify the developed KCF and Siam-DFKCF models for detection and tracking of the moving dummy, experiments were designed and conducted. Moving speed, angular velocity, and tracking distance were studied in the current research. For comparison, the same dataset was used to train models in the Siam-DFKCF and the original KCF algorithms. Thereafter, the tracking tests were performed under the same moving parameters as listed in Table 4.

3.1.3. Experimental Analysis

Anti-occlusion performance of the algorithm

The tracking accuracy of each frame received from the Usb_Camera node was recorded to study the anti-occlusion performance of KCF and Siam-DFKCF algorithms. KCF could not preserve features for a long time, and HOG features were shallow; thus, tracking loss usually appeared for the KCF algorithm when suffering a long-term occlusion. This tracking loss was avoided by the Siam-DFKCF algorithm since it could extract and retain the deep features of the missing dummy images. It was irrefutably proven by observations during the test, as compared in Figure 10a,b. The linear speed was arranged to be 1 m/s, the angular speed was π/8 rad/s, and the tracking distance was 5 m. Dummy head A was not occluded until 3.51 s, at which point dummy head B appeared in front of dummy head A until 3.57 s, resulting in the occlusion problem. It was shown that dummy head A was tracked and marked with a blue bounding box in the two algorithms before 3.51 s. However, tracking loss was observed for both the KCF and the Siam-DFKCF algorithms when dummy head A was completely occluded. Since then, the target was lost for the KCF algorithm. On the other hand, dummy head A was retracted for the Siam-DFKCF algorithm. It is estimated that the deep feature of the tracking target in the Siam-DFKCF algorithm enabled the retraction, which is useful for real-time tracking.

During tracking, the camera was controlled to focus on dummy head A. The yaw angular velocity of the camera changed as dummy head A moved from the left side to the right side, which is compared in Figure 11. The

X_{e r r o r}

and

Y_{e r r o r}

represent the relative location of the moving target position from the image center. They were recorded as horizontal deviation and vertical deviation

Y

in the upper right corner of Figure 10. The missing tracking could be recognized using the recorded yaw angular velocity. Six tests were arranged to study the tracking accuracy of the two algorithms with 0 rad/s angular speed of dummy head A. When an occlusion occurred, the blue box was lost on dummy head B for the two algorithms, resulting in a decrease of the yaw angular velocity between 3.51 s and 4.11 s as shown in Figure 11a. The yaw angular velocity was found to increase to a normal value for the Siam-DFKCF algorithm, which meant a re-tracking was successfully performed. On the contrary, the yaw angular velocity was found to decrease to 0 for the KCF algorithm. Another six tests were arranged to study the angular speed effect of dummy head A as shown in Figure 11b. The maximum angular speeds of 0.26 rad/s and 0.35 rad/s were observed for the non-rotating and π/8 rad/s rotating cases, resulting in a 34.62% increment. It is estimated that occlusion led to a decrease in the angular speed, followed by an abrupt enhancement of the angular speed between 3.51 s and 4.11 s.

Scale adaptation performance of the algorithm

The size of the bounding box was further compared to study the scale adaptation performances of the KCF and Siam-DFKCF algorithms. In the original KCF algorithm, the scale of the extracted image was always the pixel size of the initial target image tracking area; therefore, the relative scale of the target in the image changed according to the relative distance between the camera and the tracking target. If the tracking target moved close or away from the camera, the size of the bounding box should be adaptively changed to capture the exact feature. However, if the size of the bounding box did not change accordingly, the extracted features would be incomplete, or variable background information would be introduced, which might lead to tracking failure.

The width sizes of the tracking box (/roi_predict) of the six cases were compared as illustrated in Figure 12a,b. It was shown that the width of the tracking box (/roi_predict) increased monotonously against velocity in the Siam-DFKCF algorithm. The size of the tracking box gradually increased to a maximum value in the KCF algorithm as shown in Figure 12b. On the other hand, when the similarity of the feature frame was lower than the threshold, the original tracking box would be replaced by the detection box of YOLO-ECA to ensure a modification of the frame size in real-time. Thus, the tracking box was updated and corrected more frequently in the Siam-DFKCF algorithm. It is estimated that the developed Siam-DFKCF algorithm is more suitable for the problem of tracking losses.

To quantitatively analyze the scale adaptation performances of the two algorithms, the subtraction between the initial tracking box size and the final tracking box size is further discussed as shown in Table 5.

The sizes of the initial tracking box and the final tracking box were found to be almost the same for both the KCF and Siam-DFKCF algorithms under 0 rad/s angular velocity. Detailly, the relative deviations were found to be 1.72%, 4.55%, and 5.56% at the speed of 1 m/s, 2 m/s, and 3 m/s respectively in the Siam-DFKCF algorithm. Similarly, the relative deviations were found to be 3.33%, 4.76%, and 6.41% for the KCF algorithm. On the other hand, when the angular velocity increased to π/8 rad/s, the size deviations were abruptly increased to 20.99%, 28.57%, and 38.75% in the KCF algorithm. On the contrary, the size deviations of the Siam-DFKCF algorithm were reduced to 1.52%, 2.6%, and 4.44%. It is implied that better scale adaptation performances were achieved for the Siam-DFKCF algorithm.

Loss and re-tracking of Targets

The similarity was calculated in the decision layer as discussed in Section 2.3. It was found that the similarity of the keyframes began to decrease from 0 s. It is estimated that an occlusion appeared since dummy A was occluded by dummy B, as shown in Figure 13. The similarity was related to the non-occluded part of dummy A, thus the similarity value gradually decreased when the non-occluded part was shrunk step by step. On the other hand, the similarity value gradually increased when dummy A appeared as time goes by. Thereafter, the similarity kept decreasing to a minimum value of 0.11 until 4.07 s in the KCF algorithm since the KCF algorithm did not have a function to retrieve the target. On the other hand, the similarity began to increase to a maximum value of 0.99 until 4.50 s in the Siam-DFKCF algorithm since the Siam-DFKCF algorithm could re-track the target and restore the similarity of the keyframe. It should be noted that a similarity threshold of 0.7 was used empirically in the current research to update the tracking box in the algorithm since 3.42 s. Therefore, the 0.42 s delay was observed. When dummy A rotated with an angular velocity of π/8 rad/s, the average similarity of the re-tracking keyframes could only reach a maximum of 0.89 which was less than the 0.99 of the non-rotated case. It is estimated that the threshold should be reduced adaptively to avoid mis-tracking when the moving target has a large angular velocity.

3.2. Pedestrian Tracking Experiments Result

3.2.1. Dataset

In our experiment, a real-time video taken from a school was utilized. There were 121 frames in total and the frame size was 1920 × 1080. The experimental environment of this paper was run on an Inter i5-9600K CPU at 4.30 GHz, Nvidia 2070 Super GPU, and 16 G memory.

3.2.2. Evaluation Criterion

We utilized the evaluation tool provided by VOT [46].

(1): Distance Precision (DP):

The Center Location Error (CLE) between the Tracking Box (TB) and the Ground Truth (GT) was taken into account for calculating the error of location. The CLE was calculated by

C L E = \sqrt{{(x_{T B} - x_{G T})}^{2} + {(y_{T B} - y_{G T})}^{2}},

(9)

The Distance Precision (DP) was calculated by

D P = N u m (C L E < λ_{D P}) / N u m (t o t a l),

(10)

where the

N u m (C L E < λ_{D P})

meant the number of frames when the

C L E

value was less than the threshold

λ_{D P}

, and

N u m (t o t a l)

meant the number of frames in the whole data set.

(2): Overlap Precision (OP)

The overlap between the Tracking box (TB) and the ground truth (GT) was taken into account for calculating success scores, and the Intersection over Union (IoU) ratio was calculated by

I o U = (a r e a |T B \cap G T|) / (a r e a |T B \cup G T|) > λ_O P

(11)

When the IoU value was less than 50%, the position status was defined as false (false positive–FP). In contrast, it was defined as true (true positive–TP) while the method could not produce a position (false negative–FN). This process was repeated by the number of frames in the whole data set, and the ratio of successful frames to the whole frame was found and achieved as Overlap Precision (OP), which was calculated by

O P = T F / (F P + T F + F N)

(12)

3.2.3. Experiment Results

A speed of 93 FPS (Frames Per Second) was achieved during tracking, and it was easy to track the target face because only one moving face existed in a clean scene as shown in Figure 14a. To verify the tracking accuracy under high moving speed, the person moved fast to show that the target face was locked all the time, as shown in Figure 14b,c. The face was recognized during the whole tracking test even though it was turned around, as shown in Figure 14d,e. Furthermore, it was also recognized when partially covered, as shown in Figure 14f. As a result, we evaluated our algorithm against various special scenarios such as face occlusion, face fast-moving, and face turned. It is shown that the proposed algorithm has strong robustness. Therefore, the real-time face-tracking accuracy was irrefutably validated using the previous tracking experiments. We expect that the proposed algorithm can also be applied in other tacking areas, such as object behavior analysis, anomaly behaviors detection, etc.

The performance of the proposed method was compared with the KCF, Tracking Learning Detection (TLD) [47], Siamese Region Proposal Network (SiamRPN) [48], and Distractor-aware Siamese Region Proposal Network algorithms (DaSiamRPN) [49]. As shown in Figure 15, all algorithms performed well under the slow-moving environment from the 1st frame to the 11th frame. However, the TLD and the KCF algorithm had poor performance in pedestrian tracking which was fast-moving from the 11th frame to the 66th frame. When the face was turned from the 66th frame to the 99th frame, the KCF tracking failed due to the fixed scale, and the TLD was low performing due to the large Tracking Box from the 66th frame. The performance of the SiamRPN and DaSiamRPN was unsatisfactory while the face was occluded from the 99th frame to the final frame. On the contrary, the proposed method tracked the face well from the first frame to the final frame. The Overlap Precision and Distance Precision of the methods are shown in Table 6.

It is shown that the highest DP of 0.934 and the highest OP of 0.909 was achieved using the proposed method in our dataset study; meanwhile, CPU usage and GPU usage were balanced. The TLD, and KCF were traditional filtering algorithms; they had high CPU usage, while the GPU usage was almost zero. On the contrary, the SiamRPN and DASiamRPN algorithm had excellent performance during tracking, and they had a high GPU occupancy for they were deep learning algorithms. As a modification, the method we proposed combined the accuracy of the deep learning algorithm and the low GPU occupancy of the filtering algorithm, and a good balance was achieved between CPU and GPU computations. Overall, the face tracking algorithm designed in this paper had good adaptability in dealing with face tracking.

4. Conclusions

A lightweight algorithm with high efficiency is required for an unmanned operating system. Thus, we proposed a deep feature KCF method based on YOLO-ECA and SiamCSP. In the method, the Efficient Channel Attention block (ECA) was introduced into the feature pyramid network (FPN) of YOLO to allocate resources more adaptively. The model focused on the information that was more critical to the current task, and more powerful appearance features were compared to judge similarities for the kernel correlation filter. In this way, the defect in feature association was solved and the tracking-by-detection framework was better used. Additionally, a fire unit was used to make the backbone feature extraction network of the YOLO more lightweight. The tracking efficiency and detecting accuracy were improved. Finally, two series of experiments on the proposed method demonstrate the effectiveness of our algorithm compared to traditional visual tracking methods. In the dummy tracking experiments, influences of moving speed, rotating speed, and occlusion on tracking accuracy and efficiency were studied. It is found that the proposed method had better performance in dealing with anti-occlusion, scale adaptation, and re-tracking problems. Thereafter, the method was used in real-time pedestrian tracking. It is found that the best DP (0.934) and OP (0.909) had been achieved using the proposed method.

In general, the proposed method has good performance in the aspect of anti-occlusion and re-tracking. This lightweight method can significantly improve the high accuracy while maintaining high efficiency. In the future, multiple face tracking will be investigated and discussed in our future studies. Then, we will focus on a similarity representation method of small objects with less computational burden. Finally, we expect that the proposed method can be applied in the area of airborne UAV systems, which will be discussed in future work.

Author Contributions

Writing, finished the experiment and this paper, W.J. and D.L.; methodology, D.T.; experiments and analysis, W.J. and D.T.; data collection and analysis, Y.Y. and D.L.; labeling the ground truth for each image in our dataset, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to thank the reviewers for their good and valuable suggestions which improved this paper greatly. The work at ZJUT was supported by the National Natural Science Foundation of China (Grants No. 52175279, 51705459) and the Natural Science Foundation of Zhejiang Province (Grant No. LY20E050022).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors wish to thank the editor and reviewers for their suggestions, thank Di Tang for his guidance, and thank Dawei Liu, Jingqi Che, and Yin Yang for their help.

Conflicts of Interest

The authors declare no conflict of interest.

References

Schuurman, B. Research on terrorism, 2007–2016: A review of data, methods, and authorship. Terror. Political Violence 2020, 32, 1011–1026. [Google Scholar] [CrossRef] [Green Version]
Zhang, X.; Yao, L.; Wang, X.; Monaghan, J.; Mcalpine, D.; Zhang, Y. A survey on deep learning-based non-invasive brain signals: Recent advances and new frontiers. J. Neural Eng. 2021, 18, 031002. [Google Scholar] [CrossRef]
Esteva, A.; Chou, K.; Yeung, S.; Naik, N.; Madani, A.; Mottaghi, A.; Liu, Y.; Topol, E.; Dean, J.; Socher, R. Deep learning-enabled medical computer vision. NPJ Digit. Med. 2021, 4, 5. [Google Scholar] [CrossRef] [PubMed]
Qayyum, A.; Qadir, J.; Bilal, M.; Al-Fuqaha, A. Secure and robust machine learning for healthcare: A survey. IEEE Rev. Biomed. Eng. 2020, 14, 156–180. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Shu, L.; Chen, J.; Ferrag, M.A.; Wu, J.; Nurellari, E.; Huang, K. A survey on smart agriculture: Development modes, technologies, and security and privacy challenges. IEEE/CAA J. Autom. Sin. 2021, 8, 273–302. [Google Scholar] [CrossRef]
Meneghello, F.; Calore, M.; Zucchetto, D.; Polese, M.; Zanella, A. IoT: Internet of threats? A survey of practical security vulnerabilities in real IoT devices. IEEE Internet Things J. 2019, 6, 8182–8201. [Google Scholar] [CrossRef]
Li, H.; Xiezhang, T.; Yang, C.; Deng, L.; Yi, P. Secure video surveillance framework in smart city. Sensors 2021, 21, 4419. [Google Scholar] [CrossRef]
Hu, W.; Li, X.; Luo, W.; Zhang, X.; Maybank, S.; Zhang, Z. Single and multiple object tracking using log-Euclidean Riemannian subspace and block-division appearance model. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2420–2440. [Google Scholar] [CrossRef]
Yu, F.; Li, W.; Li, Q.; Liu, Y.; Shi, X.; Yan, J. Poi: Multiple object tracking with high performance detection and appearance feature. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 36–42. [Google Scholar] [CrossRef] [Green Version]
Jain, V.; Learned-Miller, E. Fddb: A benchmark for face detection in unconstrained settings. In UMass Amherst Technical Report UM-CS-2010-009; University of Massachusetts: Amherst, MA, USA, 2010; Volume 2. [Google Scholar]
Kuo, C.H.; Huang, C.; Nevatia, R. Multi-target tracking by on-line learned discriminative appearance models. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 685–692. [Google Scholar] [CrossRef]
Kim, Y.; Bang, H. Introduction to Kalman filter and its applications. In Introduction and Implementations of the Kalman Filter; IntechOpen: London, UK, 2018; pp. 1–16. [Google Scholar] [CrossRef] [Green Version]
Milan, A.; Rezatofighi, S.H.; Dick, A.; Reid, I.; Schindler, K. Online multi-target tracking using recurrent neural networks. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar] [CrossRef]
Wen, Q.; Luo, Z.; Chen, R.; Yang, Y.; Li, G. Deep learning approaches on defect detection in high resolution aerial images of insulators. Sensors 2021, 21, 1033. [Google Scholar] [CrossRef]
Glowacz, A. Fault diagnosis of electric impact drills using thermal imaging. Measurement 2021, 171, 108815. [Google Scholar] [CrossRef]
Fan, P.; Shen, H.M.; Zhao, C.; Wei, Z.; Yao, J.G.; Zhou, Z.Q.; Fu, R.; Hu, Q. Defect identification detection research for insulator of transmission lines based on deep learning. J. Phys. Conf. Ser. 2021, 1828, 012019. [Google Scholar] [CrossRef]
Masita, K.L.; Hasan, A.N.; Shongwe, T. Deep learning in object detection: A review. In Proceedings of the 2020 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD), Durban, South Africa, 6–7 August 2020; pp. 1–11. [Google Scholar] [CrossRef]
Miao, X.; Liu, X.; Chen, J.; Zhuang, S.; Fan, J.; Jiang, H. Insulator detection in aerial images for transmission line inspection using single shot multibox detector. IEEE Access 2019, 7, 9945–9956. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef] [Green Version]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, USA, 7–12 December 2015; Volume 28. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef] [Green Version]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Li, Y.; Wang, J.; Huang, J.; Li, Y. Research on Deep Learning Automatic Vehicle Recognition Algorithm Based on RES-YOLO Model. Sensors 2022, 22, 3783. [Google Scholar] [CrossRef]
Zhang, J.; Chen, X.; Li, Y.; Chen, T.; Mou, L. Pedestrian detection algorithm based on improved Yolo v3. In Proceedings of the 2021 IEEE International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, 29–31 July 2021; pp. 180–183. [Google Scholar] [CrossRef]
Yi, Z.; Yongliang, S.; Jun, Z. An improved tiny-yolov3 pedestrian detection algorithm. Optik 2019, 183, 17–23. [Google Scholar] [CrossRef]
Wilson, S.; Varghese, S.P.; Nikhil, G.A.; Manolekshmi, I.; Raji, P.G. A Comprehensive Study on Fire Detection. In Proceedings of the 2018 Conference on Emerging Devices and Smart Systems (ICEDSS), Tiruchengode, India, 2–3 March 2018; pp. 242–246. [Google Scholar] [CrossRef]
Wang, Y.; Liu, Y.; Sun, M.; Zhao, X. Deep-learning-based polar-body detection for automatic cell manipulation. Micromachines 2019, 10, 120. [Google Scholar] [CrossRef] [Green Version]
He, W.; Han, Y.; Ming, W.; Du, J.; Liu, Y.; Yang, Y.; Wang, L.; Wang, Y.; Jiang, Z.; Cao, C.; et al. Progress of Machine Vision in the Detection of Cancer Cells in Histopathology. IEEE Access 2022, 10, 46753–46771. [Google Scholar] [CrossRef]
Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2544–2550. [Google Scholar] [CrossRef]
Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. Exploiting the circulant structure of tracking-by-detection with kernels. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 702–715. [Google Scholar] [CrossRef]
Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [Green Version]
Davide, C. Siamese neural networks: An overview. In Artificial Neural Networks; Springer: Cham, Switzerland, 2021; pp. 73–94. [Google Scholar] [CrossRef]
Wendt, A.; Schüppstuhl, T. Proxying ROS communications—enabling containerized ROS deployments in distributed multi-host environments. In Proceedings of the 2022 IEEE/SICE International Symposium on System Integration (SII), Virtual, 9–12 January 2022; pp. 265–270. [Google Scholar] [CrossRef]
Yi, X.; Song, Y.; Zhang, Y. Enhanced darknet53 combine MLFPN based real-time defect detection in steel surface. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Shenzhen, China, 4–7 November 2020; Springer: Cham, Switzerland, 2020; pp. 303–314. [Google Scholar] [CrossRef]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv, 2016; arXiv:1602.07360. [Google Scholar] [CrossRef]
Qin, Q.; Hu, W.; Liu, B. Feature projection for improved text classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8161–8171. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
Lee, H.J.; Ullah, I.; Wan, W.; Gao, Y.; Fang, Z. Real-time vehicle make and model recognition with the residual SqueezeNet architecture. Sensors 2019, 19, 982. [Google Scholar] [CrossRef] [Green Version]
Ma, Z.; Yang, X.; Zhang, Y. Driver Hand Detection Using Squeeze-and-Excitation YOLOv4 Network. In Proceedings of the 2020 2nd International Conference on Big-Data Service and Intelligent Computation, Xiamen, China, 3–5 December 2020; pp. 25–30. [Google Scholar] [CrossRef]
Kolchev, A.; Pasynkov, D.; Egoshin, I.; Kliouchkin, I.; Pasynkova, O.; Tumakov, D. YOLOv4-based CNN model versus nested contours algorithm in the suspicious lesion detection on the mammography image: A direct comparison in the real clinical settings. J. Imaging 2022, 8, 88. [Google Scholar] [CrossRef] [PubMed]
Xue, H.; Sun, M.; Liang, Y. ECANet: Explicit cyclic attention-based network for video saliency prediction. Neurocomputing 2022, 468, 233–244. [Google Scholar] [CrossRef]
Cui, Z.; Wang, N.; Su, Y.; Zhang, W.; Lan, Y.; Li, A. ECANet: Enhanced context aggregation network for single image dehazing. Signal Image Video Process. 2022, 1–9. [Google Scholar] [CrossRef]
Kim, J.Y.; Kim, L.S.; Hwang, S.H. An advanced contrast enhancement using partially overlapped sub-block histogram equalization. IEEE Trans. Circuits Syst. Video Technol. 2001, 11, 475–484. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar] [CrossRef]
Kristan, M.; Matas, J.; Leonardis, A.; Felsberg, M.; Cehovin, L.; Fernandez, G.; Vojir, T.; Hager, G.; Nebehay, G.; Pflugfelder, R. The visual object tracking vot2015 challenge results. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 11–18 December 2015; pp. 1–23. [Google Scholar] [CrossRef] [Green Version]
Kalal, Z.; Mikolajczyk, K.; Matas, J. Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 1409–1422. [Google Scholar] [CrossRef] [Green Version]
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8971–8980. [Google Scholar] [CrossRef]
Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of ROS nodes of the Siam-DFKCF method.

Figure 2. The schematic diagram of the Siam-DFKCF method.

Figure 3. The structure of improved YOLOv3 based on the Efficient Channel Attention (YOLO-ECA).

Figure 4. The structure of Squeeze Cross Stage Partial Body (SCSPBody) ((a) SCSPBody1, (b) SCSPBody2).

Figure 5. The structure of the Feature Pyramid Network.

Figure 6. The structure of the Efficient Channel Attention block (ECA).

Figure 7. The structure of the Siamese CNN with Cross Stage Partial (SiamCSP).

Figure 8. The structure of the Cross Stage Partial (CSP)block.

Figure 9. Schematic diagram of experimental equipment.

Figure 10. The tracking process of the dummy ((a) the KCF algorithm, (b) the Siam-DFKCF algorithm).

Figure 11. Angle speed of the camera between the Siam-DFKCF and KCF ((a) angular speed of 0 rad/s, (b) angular speed of π/8 rad/s).

Figure 12. The real-time size of the tracking box ((a) angular speed of 0 rad/s, (b) angular speed of π/8 rad/s).

Figure 13. The similarity of the keyframes ((a) angular speed of 0 rad/s, (b) angular speed of π/8 rad/s).

Figure 14. Experimental results of tracking faces ((a). the 11th frame of the test video, (b). the 32th frame of the test video, (c). the 51th frame of the test video, (d). the 79th frame of the test video, (e). the 92th frame of the test video, (f). the 118th frame of the test video).

Figure 15. Results of different tracking algorithms.

Table 1. The Feature Extraction Network of YOLO-ECA.

		Type	Filters	Size	Output
		Conv	32	3 × 3/2	32 × 304 × 304
		Conv	64	3 × 3/1	64 × 304 × 304
		Conv	64	3 × 3/2	64 × 152 × 152
		Conv	128	3 × 3/1	128 × 152 × 152
Conv 128 × 3 × 3		Conv	256	3 × 3/2	256 × 76 × 76
		Conv	128	1 × 1/1
	4×	Conv	64	1 × 1/1
		Conv	128	3 × 3/1
		Residual
	4×	Conv	16	1 × 1/1
		Conv	64	1 × 1/1
		Conv	64	3 × 3/1
		Fire
Conv 256 × 3 × 3		Conv	512	3 × 3/2	512 × 38 × 38
		Conv	128	1 × 1/1
	4×	Conv	16	1 × 1/1
		Conv	64	1 × 1/1
		Conv	64	3 × 3/1
		Fire
	4×	Conv	32	1 × 1/1
		Conv	128	1 × 1/1
		Conv	128	3 × 3/1
		Fire
	1×	Conv	512	3 × 3/2	512 × 19 × 19
	Direct	Conv Max5 × 5	Conv Max9 × 9	Conv Max13 × 13	1024 × 19 × 19
	SPP				1024 × 19 × 19

Table 2. The Detection Network of YOLO-ECA.

		Type	Filters	Size	Output
Large	3×	Convolutional	256	3 × 3/1	256 × 76 × 76
		BN
		LeakReLU
Medium	3×	Convolutional	512	3 × 3/1	512 × 38 × 38
		BN
		LeakReLU
Small	3×	Convolutional	1024	3 × 3/1	1024 × 19 × 19
		BN
		LeakReLU

Table 3. The parameters of the SiamCSP.

		Type	Filters	Size	Output
		Resize			3 × 304 × 304
		Poshe			3 × 304 × 304
Conv 64 × 3 × 3		Conv	128	3 × 3/2	128 × 152 × 152
		Conv	64	3 × 3/1
	1×	Conv	32	1 × 1/1
		Conv	64	3 × 3/1
		Residual
		Conv	64	3 × 3/1
		MaxPooling		2 × 2	128 × 76 × 76
Conv 128 × 3 × 3		Conv	128	3 × 3	256 × 38 × 38
		Conv	64	3 × 3/1
	1×	Conv	32	1 × 1/1
		Conv	64	3 × 3/1
		Residual
		Conv	128	3 × 3/1
		MaxPooling		2 × 2/1	256 × 19 × 19
		Conv	512	3 × 3	512 × 19 × 19
	Direct	Conv Max5 × 5	Conv Max9 × 9	Conv Max13 × 13	512 × 7×7
		SPP			512 × 7×7
		FC
		FC&Sigmoid			Similarity

Table 4. The motion parameters for comparison experiments.

Number	Tracking Distance (m)	Linear Velocity (m/s)	Angular Velocity (rad/s)
1	5	1	0
2	5	1	$π / 8$
3	5	2	0
4	5	2	$π / 8$
5	5	3	0
6	5	3	$π / 8$

Table 5. The results of the tracking box of different algorithms.

Detection Method	The Difference in Sizes Between the Initial and Final Tracking Box
$linear velocity (m / s)$	1		2		3
$Angular velocity (r a d / s)$	0	$π / 8$	0	$π / 8$	0	$π / 8$
Siam-DFKCF	1.72%	1.52%	4.55%	2.6%	5.56%	4.44%
Traditional KCF	3.33%	20.99%	4.76%	28.57%	6.41%	38.75%

Table 6. The results of the tracking box of different algorithms.

Method	DP	OP	FPS	GPU Cost	CPU Cost
TLD	0.653	0.628	32	0	36
KCF	0.694	0.645	241	0	33
SiamRPN	0.868	0.843	134	32	0
DaSiamRPN	0.917	0.892	117	34	0
Proposed method	0.934	0.909	93	23	18

FPS: Frames Per Second.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, D.; Jin, W.; Liu, D.; Che, J.; Yang, Y. Siam Deep Feature KCF Method and Experimental Study for Pedestrian Tracking. Sensors 2023, 23, 482. https://doi.org/10.3390/s23010482

AMA Style

Tang D, Jin W, Liu D, Che J, Yang Y. Siam Deep Feature KCF Method and Experimental Study for Pedestrian Tracking. Sensors. 2023; 23(1):482. https://doi.org/10.3390/s23010482

Chicago/Turabian Style

Tang, Di, Weijie Jin, Dawei Liu, Jingqi Che, and Yin Yang. 2023. "Siam Deep Feature KCF Method and Experimental Study for Pedestrian Tracking" Sensors 23, no. 1: 482. https://doi.org/10.3390/s23010482

APA Style

Tang, D., Jin, W., Liu, D., Che, J., & Yang, Y. (2023). Siam Deep Feature KCF Method and Experimental Study for Pedestrian Tracking. Sensors, 23(1), 482. https://doi.org/10.3390/s23010482

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Siam Deep Feature KCF Method and Experimental Study for Pedestrian Tracking

Abstract

1. Introduction

2. Materials and Methods

2.1. DFKCF Method Coupled with the SiamCSP

2.2. An Improved YOLOv3 Based on Efficient Channel Attention

2.3. The Siamese CNN with the Cross Stage Partial

2.4. Improved Loss Function of the Siam-DFKCF Model

3. Experiments and Results

3.1. Dummy Tracking Experiments Result

3.1.1. Dataset

3.1.2. Experiment Arrangement

3.1.3. Experimental Analysis

3.2. Pedestrian Tracking Experiments Result

3.2.1. Dataset

3.2.2. Evaluation Criterion

3.2.3. Experiment Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI