Scale-Adaptive KCF Mixed with Deep Feature for Pedestrian Tracking

: Pedestrian tracking is an important research content in the ﬁeld of computer vision. Tracking is achieved by predicting the position of a speciﬁc pedestrian in each frame of a video. Pedestrian tracking methods include neural network-based methods and traditional template matching-based methods, such as the SiamRPN (Siamese region proposal network), the DASiamRPN (distractor-aware SiamRPN), and the KCF (kernel correlation ﬁlter). The KCF algorithm has no scale-adaptive capability and cannot effectively solve the occlusion problem, and because of many defects of the HOG (histogram of oriented gradient) feature that the KCF uses, the tracking target is easy to lose. For those defects of the KCF algorithm, an improved KCF model, the SKCFMDF (scale-adaptive KCF mixed with deep feature) algorithm was designed. By introducing deep features extracted by a newly designed neural network and by introducing the YOLOv3 (you only look once version 3) object detection algorithm, which was also improved for more accurate detection, the model was able to achieve scale adaptation and to effectively solve the problem of occlusion and defects of the HOG feature. Compared with the original KCF, the success rate of pedestrian tracking under complex conditions was increased by 36%. Compared with the mainstream SiamRPN and DASiamRPN models, it was still able to achieve a small improvement.


Introduction
Pedestrian tracking is an important research topic in the field of computer vision, and it has great application values, such as for the use of intelligent monitoring, pedestrian flow observation, and other scenarios. In reality, tracking is achieved by determining the location of a specific pedestrian in each frame of a video. Pedestrian tracking methods include neural network-based methods and traditional template matching-based methods. Regarding the neural network-based method, the mainstream method is the use of a Siamese neural network [1] based on the RPN (region proposal network) [2] for tracking. Bo Li and Junjie Yan proposed the Siamese region proposal network (SiamRPN) [3], which is different from the standard RPN because it extracts candidate area from related feature maps, and then the target appearance information on the template branch is encoded into the RPN feature to distinguish the foreground from the background. However, it is still difficult for the SiamRPN to distinguish between similar objects in an image. Due to a lack of model updates and the suppression of interference objects, Zheng Zhu and Qiang Wang proposed the DASiamRPN (distractor-aware SiamRPN) [4]. There are also pedestrian tracking methods based on template matching. SIFT (scale-invariant feature transform) features [5] can be used to describe the characteristics of pedestrians. Using this feature as a template, the location of pedestrians can be predicted by sliding window matching on the next frame of video. Joao F. Henriques proposed the KCF (kernel correlation filter) algorithm [6] for pedestrian tracking. The KCF algorithm uses pedestrian image information and surrounding background image information to train a target detector to predict the position of a pedestrian in subsequent frames. However, the KCF algorithm has three flaws. One is its scale problem, as the size of the target detector is unchanged all the time. However, in the video, the size of the pedestrian target changes due to its distance from the camera. Thus, the algorithm inevitable tracks the target inaccurately. The second problem is the defect of the HOG (histogram of oriented gradient) feature [7] used by the KCF. The HOG feature uses gradient feature representation, so it is insensitive to pedestrian posture changes and color information, which leads to tracking errors or tracking loss during the tracking process. The third is the occlusion problem. When the pedestrian target is occluded, the detector cannot give the accurate position of the pedestrian target in the next frame of the video.
For the shortcomings of the KCF algorithm, an improved KCF model that incorporates deep features was designed. The neural network framework YOLOv3 (you only look once version 3) [8] used for target recognition is used for pedestrian detection, and then the newly detected image of the pedestrian by YOLOv3 is used as a new template of the KCF to train its target detector so as to solve the scale change problem. When the HOG feature is not capable of distinguishing different pedestrians, the deep feature is integrated to determine the location of the pedestrian target, and the convolutional neural network is used to extract the deep feature of pedestrians for comparison. When the pedestrian is occluded and the KCF target is lost, the convolutional neural network used for extracting the deep features of pedestrians is used to compare the last deep feature before the occlusion with the deep features of all pedestrians recognized by YOLOv3 after the occlusion disappears and to re-determine the location of the pedestrian according to similarity.
YOLOv3 runs very fast, but its non-maximum suppression (NMS) algorithm has caused many correctly predicted bounding boxes to be removed by mistake. We added the retrieval algorithm to recover the person detection box that was mistakenly removed by NMS, and we replaced NMS with Soft-NMS to further improve the accuracy. Experiments on the PASCAL VOC (Pattern Analysis, Statistical Modeling and Computational Learning, Visual Object Classes) dataset showed that YOLOv3, which uses Soft-NMS and the improved retrieval algorithm, improved the accuracy by approximately 3.1% compared to the original algorithm, while the operating speed did not change much.

KCF with Deep Feature and Adaptive Scale
With the aim of fixing these three flaws of the KCF algorithm, the SKCFMDF (scaleadaptive KCF mixed with deep feature) algorithm is proposed here. This new algorithm solves the three problems of the KCF algorithm. A schematic diagram is shown in Figure 1.

KCF Tracking Algorithm
The KCF algorithm is proposed by Joao F. Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista in 2014. The KCF algorithm uses the image of a tracking target to extract the HOG feature, and then it takes the surrounding images as the training sample to train the target detector. After training, the Gaussian kernel function is used to calculate the correlation response between the HOG feature of the tracking target image and the HOG features of the surrounding images (the image with the highest response value is the latest position image of the tracking target in the frame of the image), and then the algorithm uses the image with the highest response in the frame to retrain the target detector. By using the discrete Fourier transform to convert the above process from the time domain to the frequency domain, the calculation can be greatly reduced.

KCF Tracking Algorithm
The KCF algorithm is proposed by Joao F. Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista in 2014. The KCF algorithm uses the image of a tracking target to extract the HOG feature, and then it takes the surrounding images as the training sample to train the target detector. After training, the Gaussian kernel function is used to calculate the correlation response between the HOG feature of the tracking target image and the HOG features of the surrounding images (the image with the highest response value is the latest position image of the tracking target in the frame of the image), and then the algorithm uses the image with the highest response in the frame to retrain the target detector. By using the discrete Fourier transform to convert the above process from the time domain to the frequency domain, the calculation can be greatly reduced.

KCF Scale Adaptation
The KCF has a scale problem. In the KCF algorithm, the scale of the extracted image is always the pixel size of the initial target image tracking area, so if the movement of the target causes the distance from the camera to change, the relative scale of the target in the image also changes. If the size of the target bounding box does not change, the extracted features will be incomplete or variable background information will be introduced, thus leading to a failure of tracking; as such, the KCF has problems with scale changes.
Because the magnitude of the target scale change is not too large, fixing the KCF's scale change problem is not difficult. In a new image frame, the target bounding box predicted by the KCF could be combined with the pedestrian bounding box detected by

KCF Scale Adaptation
The KCF has a scale problem. In the KCF algorithm, the scale of the extracted image is always the pixel size of the initial target image tracking area, so if the movement of the target causes the distance from the camera to change, the relative scale of the target in the image also changes. If the size of the target bounding box does not change, the extracted features will be incomplete or variable background information will be introduced, thus leading to a failure of tracking; as such, the KCF has problems with scale changes.
Because the magnitude of the target scale change is not too large, fixing the KCF's scale change problem is not difficult. In a new image frame, the target bounding box predicted by the KCF could be combined with the pedestrian bounding box detected by YOLOv3 to obtain a new scale that matches the size of the tracking target. Then, the new target box could be used as the training template of the KCF target detector so that the scale adaptation of the KCF can be realized. The process is shown in Figure 2.
Electronics 2021, 10,536 target box could be used as the training template of the KCF target dete scale adaptation of the KCF can be realized. The process is shown in Figu The pedestrian detection framework we use is YOLOv3. YOLOv3 is a with darknet-53 as a backbone network to extract the characteristics of an the receptive field of each layer in the network is different, YOLOv3 ex different scales in different layers; a total of three scales were identifie COCO (Microsoft Common Objects in Context) dataset [9] was used to tra For the smallest detection scale, YOLOv3 divides the image into 52 × 5 the 52 × 52 size is suitable for commonly seen small-scale objects and each an anchor point. The input image for YOLOv3 is compressed to a fixed size is independent from the input image. Three bounding boxes of diffe identified at the anchor point. For the middle scale, the model divides the 26 grids as anchor points, because the grid size of 26 × 26 is suitable for middle-scale objects. For the largest scale, the model divides the image in as anchor points because the grid size of 13 × 13 is suitable for commonly The pedestrian detection framework we use is YOLOv3. YOLOv3 is a neural network with darknet-53 as a backbone network to extract the characteristics of an image. Because the receptive field of each layer in the network is different, YOLOv3 extract targets of different scales in different layers; a total of three scales were identified, and the MS-COCO (Microsoft Common Objects in Context) dataset [9] was used to train the network.
For the smallest detection scale, YOLOv3 divides the image into 52 × 52 grids because the 52 × 52 size is suitable for commonly seen small-scale objects and each grid is used as an anchor point. The input image for YOLOv3 is compressed to a fixed size, so the grid size is independent from the input image. Three bounding boxes of different shapes are identified at the anchor point. For the middle scale, the model divides the image into 26 × 26 grids as anchor points, because the grid size of 26 × 26 is suitable for commonly seen middle-scale objects. For the largest scale, the model divides the image into 13 × 13 grids as anchor points because the grid size of 13 × 13 is suitable for commonly seen large-scale objects. In the smallest-scale recognition, the feature maps used in the middle-scale are combined with the feature maps of the largest-scale, thereby increasing the accuracy of the smallest-scale recognition. In this way, the scale change of target recognition can be realized to recognize from three scales in similar way to a feature pyramid network [10].
The determination of the new training template of the KCF is realized by the intersection ratio of the KCF prediction box and the detection box of YOLOv3.
As shown in the image in Figure 3a, in a pedestrian video, the pedestrian bounding box detected by the KCF is marked with a red box; in the next frame, which is the image in Figure 3b, the KCF predicting box is still marked in red because the woman has walked closer to the camera. The image scale of the pedestrian relative to the image has increased, but the scale of the KCF red box has not changed, resulting in the box not completely covering the target. As the woman gets closer, the scale of the target continues to relatively increase, and continuing to use the KCF would cause a loss of tracking. At this time, the YOLOv3 model is introduced to detect the coordinates of people in the frame, which are marked by the blue boxes. Since multiple pedestrians appear in the image, by calculating the value of the intersection over union (IOU) between the blue box and the red box, the blue box with the largest IOU can be selected as the new target to train the KCF target detector. As shown in the image in Figure 3a, in a pedestrian video, the pedestrian bounding box detected by the KCF is marked with a red box; in the next frame, which is the image in Figure 3b, the KCF predicting box is still marked in red because the woman has walked closer to the camera. The image scale of the pedestrian relative to the image has increased, but the scale of the KCF red box has not changed, resulting in the box not completely covering the target. As the woman gets closer, the scale of the target continues to relatively increase, and continuing to use the KCF would cause a loss of tracking. At this time, the YOLOv3 model is introduced to detect the coordinates of people in the frame, which are marked by the blue boxes. Since multiple pedestrians appear in the image, by calculating the value of the intersection over union (IOU) between the blue box and the red box, the blue box with the largest IOU can be selected as the new target to train the KCF target detector.

Pedestrian Feature Extraction Based on Convolutional Neural Network
There are flaws for the HOG feature used by the KCF. The HOG feature uses gradient feature representation, so it is not sensitive to pedestrian posture changes and color information, and the gradient feature of HOG is also sensitive to noise, which leads to tracking errors or tracking losses during the tracking process.
In order to make up for these HOG feature defects, a neural network used to learn and extract the deep features of pedestrian targets was designed and trained.
The network is a nine-layer convolutional neural network. The network structure is shown in Table 1. Since the entire network is small, the deep features of the image can be quickly extracted. The network uses a convolution kernel with a size of 3 × 3 and a stride size of 1, resizes the pedestrian image recognized by YOLOv3 to a pixel size of 128 × 64, and uses an RGB three-channel image with this fixed size as the input image of the entire network, so the sizes of the feature maps are independent from the original picture size. The network uses the Adam descent algorithm [11], and each layer uses L2 regularization and batch normalization. Additionally, each layer uses an ELU (exponential linear unit) as the activation function, so the convergence speed is fast and the training speed can be accelerated.

Pedestrian Feature Extraction Based on Convolutional Neural Network
There are flaws for the HOG feature used by the KCF. The HOG feature uses gradient feature representation, so it is not sensitive to pedestrian posture changes and color information, and the gradient feature of HOG is also sensitive to noise, which leads to tracking errors or tracking losses during the tracking process.
In order to make up for these HOG feature defects, a neural network used to learn and extract the deep features of pedestrian targets was designed and trained.
The network is a nine-layer convolutional neural network. The network structure is shown in Table 1. Since the entire network is small, the deep features of the image can be quickly extracted. The network uses a convolution kernel with a size of 3 × 3 and a stride size of 1, resizes the pedestrian image recognized by YOLOv3 to a pixel size of 128 × 64, and uses an RGB three-channel image with this fixed size as the input image of the entire network, so the sizes of the feature maps are independent from the original picture size. The network uses the Adam descent algorithm [11], and each layer uses L 2 regularization and batch normalization. Additionally, each layer uses an ELU (exponential linear unit) as the activation function, so the convergence speed is fast and the training speed can be accelerated. The number of layers designed by the neural network is derived from the calculation formula of the receptive field. The neural network calculations are used to extract pedestrian features, mainly distinctive texture features. Since the size of the input pedestrian image is 128 × 64 pixels, by observing the difference of the people, texture features of with an approximate size of 20 × 20 pixels can obviously distinguish a pedestrian from another. To distinguish people in different dresses, we calculated a network with eight layers according to the required receptive field size and receptive field calculation formula, and a fully connected layer was also added to calculate a 128-dimensional feature vector.
The receptive field formula is as follows.
where L k−1 is the receptive field size of the k-1th layer, F k is the current convolution kernel size, and S i is the stride of the i-th layer.
According to the formula of the receptive field, the size of the receptive field in the seventh layer is 24 × 24 pixels. After the seventh convolutional layer is calculated, a feature map size of 32 × 16 is obtained, and then a 4 × 4 pooling layer is used to further remove redundant information and reduce the amount of calculation. The feature map is changed to the size of 8 × 4. A fully connected layer is used to extract a 128-dimensional feature vector as the feature representation of the pedestrian target.

Pedestrian Tracking Based on Fusion Metrics
When there is a long-term occlusion problem, the KCF algorithm loses tracked pedestrian targets. At this time, if a comparison of the deep features is combined for tracking, the deep features of the missing pedestrian target image can be retained. In the next frame, the deep features of the newly emerging pedestrian target are compared. If the similarity is greater than the set threshold, it can be considered that the occluded target has reappeared and can be tracked again. For example, the man in Figure 4a with a red bounding box is the tracking target, and his deep feature of 128-dimensional vector is extracted by the network and restored; however, his slow running causes him to be gradually left by the camera view, as seen in Figure 4b. In Figure 4c, he is completely out of the camera view and the KCF totally loses track, but SKCFMDF compares the 128-dimensional vector of the target with other people and finds that no one matches. Thus, in Figure 4c, the target is lost and no red bounding box is labelled. In Figure 4d, the target reappears, and SKCFMDF compares the restored target vector with the newly appeared man to find the target again; as such, the corresponding bounding box is labelled with red. Because the deep feature of 128-dimensional vector of the target is restored in the model forever, there is no time limit for the occluded target to be retracked. Therefore, SKCFMDF, which combines the scale-adaptive KCF algorithm and the deep feature contrast tracking method, can solve the problem of the KCF losing pedestrian targets under occlusion. where Ckcf is the recognition confidence of the KCF, and Dnetwork is the confidence of recognition by the neural network. When the camera shakes, λ can be set to 0.

Use Soft-NMS Algorithm to Filter Incorrectly Predicted Detection Boxes
Soft-NMS was proposed by Boldla et al., and it is like the traditional one except that it does not remove the bounding boxes with high overlap values all at once. It decays the confidence score of the bounding box whose overlap value is higher than or equal to the threshold.
The removal step in traditional NMS could be described as follows, where ( , ) is the overlap value between the bounding box with max confidence score and the rest of the boxes. The equation sets the score of bounding box i by comparing the value and the threshold value, . It is hard to make the judgement just by the threshold to determine whether the bounding boxes should be removed. Soft-NMS decays the score of the bounding box when the box's overlap value is higher than or equal to the threshold because according to the principle of the YOLOv3′s convolutional neural network, the higher the overlap value of the box, the more likely the bounding box is a duplicate predicted one that is false positive. When some boxes' overlap values are a little bit higher than the threshold, they need to be removed. However, they have high confidence scores, which means they are more likely to be correctly predicted boxes and should be kept. Thus, Soft-NMS keeps the bounding boxes with overlap values higher than the threshold After the neural network for calculating the deep features is obtained, the similarity calculated by the deep feature of the pedestrian image and the KCF prediction confidence are merged, and the position of a pedestrian can be tracked more accurately. The formula of the fusion measurement method is as follows.
where C kcf is the recognition confidence of the KCF, and D network is the confidence of recognition by the neural network. When the camera shakes, λ can be set to 0.

Use Soft-NMS Algorithm to Filter Incorrectly Predicted Detection Boxes
Soft-NMS was proposed by Boldla et al., and it is like the traditional one except that it does not remove the bounding boxes with high overlap values all at once. It decays the confidence score of the bounding box whose overlap value is higher than or equal to the threshold.
The removal step in traditional NMS could be described as follows, where iou(M, b i ) is the overlap value between the bounding box with max confidence score and the rest of the boxes. The equation sets S i the score of bounding box i by comparing the iou value and the threshold value, N t . It is hard to make the judgement just by the threshold to determine whether the bounding boxes should be removed. Soft-NMS decays the score of the bounding box when the box's overlap value is higher than or equal to the threshold because according to the principle of the YOLOv3 s convolutional neural network, the higher the overlap value of the box, the more likely the bounding box is a duplicate predicted one that is false positive. When some boxes' overlap values are a little bit higher than the threshold, they need to be removed. However, they have high confidence scores, which means they are more likely to be correctly predicted boxes and should be kept. Thus, Soft-NMS keeps the bounding boxes with overlap values higher than the threshold but not high enough to almost fully overlap with the bounding box M, and these bounding boxes have high enough confidence scores that they are kept after decaying the confidence scores. The bounding boxes that almost fully overlap with the bounding box M are removed because they are likely to be duplicated predicted bounding boxes. The removal standard of Soft-NMS is defined as The two equations above work as linear functions to keep or decay scores of detected boxes. Thus, the bounding boxes far away from the bounding box M will be less affected or not affected at all according to the equations. Additionally, if bounding boxes are very close to bounding box M or mostly covered by the bounding box M, their confidence scores will be greatly decreased. Finally, after all the confidence scores of bounding boxes are decayed, another threshold to remove incorrectly predicted bounding boxes is used. Decaying the confidence scores of these bounding boxes does not remove the duplicate ones, so the threshold needs to be set to filter the bounding boxes with low confidence scores after decaying.
Compared to traditional NMS, Soft-NMS does not add any more calculation for YOLOv3 human detection framework. The computational complexity of Soft-NMS is O N 2 , where N is the number of predicted bounding boxes produced by the convolutional neural network, which is the same as traditional NMS. Each bounding box needs to have its overlap value calculated, with the bounding box having max confidence score, so the computational complexity of Soft-NMS is O(N 2 ).
Soft-NMS is a quite small component for the YOLOv3 human detection framework. It does not need to retrain the convolutional neural network of YOLOv3, so it does not cost too much work to be integrated into the YOLOv3 framework.

Use the Retrieval Algorithm to Recover the Detected Bounding Box Missed by Soft-NMS
Because Soft-NMS also judges whether to remove the detected bounding box based on the overlap value, bounding boxes that are incorrectly removed and missed by the algorithm must be still detected. In this regard, these error-removed detected bounding box can be retrieved through the retrieval algorithm.
In the retrieval algorithm, the FHOG (Felzenszwalb histogram of oriented gradient) features of face images are extracted from the face dataset, and these feature data are used to train an SVM (Support Vector Machine). Along with using NMS and a sliding window to crop an image, faces can be detected in images. Additionally, in the improved YOLOv3, after having detected all human instances, all the faces in the image need to be detected, as seen in Figure 5. Ideally, each detected face bounding box is fully inside the human bounding box. Thus, if there a detected face bounding box is outside the human bounding box or overlapping with a human bounding box, it must belong to an incorrectly removed human bounding box. Thus, all human bounding boxes removed by Soft-NMS are searched again to find one that fully covers the face bounding box and has the highest confidence. In Figure 5, it can be seen that the green bounding boxes detected faces that were not fully inside the human bounding boxes, so human bounding boxes missed by Soft-NMS had to exist. Additionally, the missed human bounding boxes are recovered by the retrieval algorithm and bounded in blue. Here, we discuss how to judge whether a face-detected bounding box is completely in a person-detected bounding box. In the following equations, X1 and Y1 are the coordinates of the upper right corner of the face-detected bounding box, X2 and Y2 are the coordinates of the lower left corner of the face-detected bounding box, M1 and N1 are the coordinates of the upper right corner of the person-detected bounding box, and M2 and N2 are the coordinates of the lower left corner of the person-detected bounding box. If the coordinates of these points meet the following conditions, the face-detected bounding box is completely inside the person-detected bounding box.
If there is a face-detected bounding box outside or overlapping with the person-detected-bounding box, the model searches for all the original detected bounding boxes that have not been deleted by the Soft-NMS algorithm. The model calculates which one of them completely covers the face-detected bounding box and finally finds the person-detected bounding box with the highest confidence. Then, the person-detected bounding box is retrieved and restored, so the accuracy of person detection is improved. Figure 6 shows the flow chart of the retrieval algorithm. Here, we discuss how to judge whether a face-detected bounding box is completely in a person-detected bounding box. In the following equations, X 1 and Y 1 are the coordinates of the upper right corner of the face-detected bounding box, X 2 and Y 2 are the coordinates of the lower left corner of the face-detected bounding box, M 1 and N 1 are the coordinates of the upper right corner of the person-detected bounding box, and M 2 and N 2 are the coordinates of the lower left corner of the person-detected bounding box. If the coordinates of these points meet the following conditions, the face-detected bounding box is completely inside the person-detected bounding box.
If there is a face-detected bounding box outside or overlapping with the persondetected-bounding box, the model searches for all the original detected bounding boxes that have not been deleted by the Soft-NMS algorithm. The model calculates which one of them completely covers the face-detected bounding box and finally finds the persondetected bounding box with the highest confidence. Then, the person-detected bounding box is retrieved and restored, so the accuracy of person detection is improved. Figure 6 shows the flow chart of the retrieval algorithm.

Train the Deep Feature Extraction Network
The market-1501 dataset [12] was used to train a neural network for traction of pedestrian images. The dataset contained 32,668 images of 15 Each pedestrian was captured by at least two cameras, and each camera pictures of pedestrians. Since the neural network calculates a feature vect uses triplet loss [13] as the training loss function. d(a, p)−d(a, n)+margin, 0) where a, p, and n are three pedestrian images as training data: a is a train person, p is a sample image of the same person as a, and n is a sample im different from a. Additionally, d(a, p) is the Euclidean distance of the ped of a and p after the neural network calculates the deep features, and d(a, n) i

Train the Deep Feature Extraction Network
The market-1501 dataset [12] was used to train a neural network for the feature extraction of pedestrian images. The dataset contained 32,668 images of 1501 pedestrians. Each pedestrian was captured by at least two cameras, and each camera took multiple pictures of pedestrians. Since the neural network calculates a feature vector, the training uses triplet loss [13] as the training loss function. L = max(d(a, p) − d(a, n) + margin, 0) where a, p, and n are three pedestrian images as training data: a is a training image of a person, p is a sample image of the same person as a, and n is a sample image of a person different from a. Additionally, d(a, p) is the Euclidean distance of the pedestrian images of a and p after the neural network calculates the deep features, and d(a, n) is the Euclidean distance of the pedestrian images of a and n after the neural network calculates the deep features. It can be seen from Figure 7 that the training of the neural network reached 98% accuracy after 30 k iterations.

Soft-NMS and Retrieval Algorithm to Improve YOLOv3
Regarding the experiment, the used dataset was PASCAL VOC 2007 [14]. Th weights used by YOLOv3 are the weights trained by the author of the official website. Th dataset used for weight training was MS-COCO. The test part of the PASCAL VOC d taset was used to test the average accuracy of the improved YOLOv3. The PASCAL VO test section contained about 5000 pictures.
In the experiment, we set the NMS overlap threshold to the default value of 0 which was the value found by the author to obtain the highest accuracy. For Soft-NMS, addition to the overlap threshold, Nt, set to 0.3, there was also a, σ, set by the Soft-NM author. By comparing the object confidence with this threshold, the detected boundin box of the error prediction was finally removed. The σ value was set to 0.4. Setting th threshold too high would have removed all the detected bounding box, and setting th threshold too low would have reduced the detection accuracy because when the detecte bounding box has a very high overlap rate, it is more likely to be a repeated detecte bounding box. Setting a low threshold meant that the detected bounding box would rare be removed. After trying many values for this threshold in the PASCAL VOC 2007 d taset, we got the highest accuracy when it was set to 0.4 (as seen in Figure 8). The inp resolution of the network was set to 416. After detecting the PASCAL VOC dataset, w calculated the accuracy of the detection. The accuracy of YOLOv3 using traditional NM and YOLOv3 was improved after using Soft-NMS and the retrieval algorithm, as show in Table 2.

Soft-NMS and Retrieval Algorithm to Improve YOLOv3
Regarding the experiment, the used dataset was PASCAL VOC 2007 [14]. The weights used by YOLOv3 are the weights trained by the author of the official website. The dataset used for weight training was MS-COCO. The test part of the PASCAL VOC dataset was used to test the average accuracy of the improved YOLOv3. The PASCAL VOC test section contained about 5000 pictures.
In the experiment, we set the NMS overlap threshold to the default value of 0.3, which was the value found by the author to obtain the highest accuracy. For Soft-NMS, in addition to the overlap threshold, Nt, set to 0.3, there was also a, σ, set by the Soft-NMS author. By comparing the object confidence with this threshold, the detected bounding box of the error prediction was finally removed. The σ value was set to 0.4. Setting the threshold too high would have removed all the detected bounding box, and setting the threshold too low would have reduced the detection accuracy because when the detected bounding box has a very high overlap rate, it is more likely to be a repeated detected bounding box. Setting a low threshold meant that the detected bounding box would rarely be removed. After trying many values for this threshold in the PASCAL VOC 2007 dataset, we got the highest accuracy when it was set to 0.4 (as seen in Figure 8). The input resolution of the network was set to 416. After detecting the PASCAL VOC dataset, we calculated the accuracy of the detection. The accuracy of YOLOv3 using traditional NMS and YOLOv3 was improved after using Soft-NMS and the retrieval algorithm, as shown in Table 2. bounding box. Setting a low threshold meant that the detected bounding box would rarely be removed. After trying many values for this threshold in the PASCAL VOC 2007 dataset, we got the highest accuracy when it was set to 0.4 (as seen in Figure 8). The input resolution of the network was set to 416. After detecting the PASCAL VOC dataset, we calculated the accuracy of the detection. The accuracy of YOLOv3 using traditional NMS and YOLOv3 was improved after using Soft-NMS and the retrieval algorithm, as shown in Table 2. In Figure 9, the performance of Soft-NMS and the retrieval algorithm can be seen. In Figure 9b, it can be seen that the girl with the white shirt was missed by NMS, but she was In Figure 9, the performance of Soft-NMS and the retrieval algorithm can be seen. In Figure 9b, it can be seen that the girl with the white shirt was missed by NMS, but she was detected using Soft-NMS, as seen in Figure 9a. In Figure 9c, it can be seen that when using the retrieval algorithm, the missed people with blue boxes were retrieved.
Electronics 2021, 10, 536 12 of 14 detected using Soft-NMS, as seen in Figure 9a. In Figure 9c, it can be seen that when using the retrieval algorithm, the missed people with blue boxes were retrieved.

Tracking Effect Analysis
The model uses the OTB-100 (Object Tracking Benchmark 100) dataset [15] to test the accuracy. Because it is used for pedestrian tracking, and the model selects several videos containing pedestrians, including the "human 6 video," the "woman video," and the "girl 2 video," for calculations. The success rate of OTB benchmark was used to indicate the

Tracking Effect Analysis
The model uses the OTB-100 (Object Tracking Benchmark 100) dataset [15] to test the accuracy. Because it is used for pedestrian tracking, and the model selects several videos containing pedestrians, including the "human 6 video," the "woman video," and the "girl 2 video," for calculations. The success rate of OTB benchmark was used to indicate the performance of all algorithms, and the overlap score for the success rate was set to 0.5. Additionally, the comparison algorithms were the KCF, ASLA (adaptive structural local sparse appearance) [16], TLD (tracking learning detection) [17], DASiamRPN, and SiamRPN. For the fusion metric, λ was set to 0.2, and the pedestrian detection confidence of YOLOv3 was set to 0.4.
It can be seen from Table 3 that the success rate of our proposed model was greatly improved compared to the traditional KCF, ASLA, and TLD algorithms. For the DASiamRPN, the neural network-based method of the SiamRPN also improved. In Table 3, the FPS (frames per second) of all the algorithms are also shown. The KCF, ASLA, and TLD were run on an Inter i3-6100 CPU at 3.70 GHz. The SiamRPN, DASiamRPN, and SKCFMDF are based on neural networks, and they were run on a Nvidia Titan X GPU.
Like the flaws within the KCF, the features that ASLA and TLD extracted were not robust for pedestrian posture changes even though their calculations were fast. However, SKCFMDF extracted deep features using a neural network, so it achieved a higher performance compared to the traditional ones.
Compared to the SiamRPN, there is a lot of improvement for SKCFMDF with the "human 6 video," because the color of the pedestrian in the "human 6 video" is more like the background, which is not the case in the other videos.
Compared to the DASiamRPN, there were not large differences for SCKFMDF in all three videos, because based on the SiamRPN, a distractor-aware module was added to the DASiamRPN, and in the tracking process, the DASiamRPN updated the framework with new samples in real time.

Conclusions
Aiming to fix the scale adaptation and occlusion problems of the KCF, an improved KCF pedestrian tracking algorithm, SKCFMDF (which integrates deep features) was proposed. By introducing deep features and target detection algorithms, the accuracy of the tracking algorithm was finally improved. Using Soft-NMS and the retrieval algorithm, YOLOv3 s detection accuracy was increased by 3.1%. By combining the YOLOv3 detection box and the KCF detection box, the scale-adaptive problem of the KCF was solved. By using the deep feature of the pedestrian extracted by a newly designed neural network, the occlusion problem of the KCF and the flaws of the HOG feature were solved. Compared to other mainstream pedestrian tracking algorithms, SKCFMDF is found to be much better.
However, the SKCFMDF algorithm still has room for improvement. The pedestrian target detection framework YOLOv3 can be replaced with a more efficient detection framework in the future, or it can be used in an environment with obvious pedestrian characteristics based on the MobileNet network [18] or YOLO-lite [19]. The target detection networks of ShuffleNet [20] and FBNet [21] can be used to reduce the calculation time. The neural network that extracts deep features could also be replaced with other efficient pedestrian recognition algorithms.