Helmet-Wearing Tracking Detection Based on StrongSORT

Object detection based on deep learning is one of the most important and fundamental tasks of computer vision. High-performance detection algorithms have been widely used in many practical fields. For the management of workers wearing helmets in construction scenarios, this paper proposes a framework model based on the YOLOv5 detection algorithm, combined with multi-object tracking algorithms, to monitor and track whether workers wear safety helmets in real-time video. The improved StrongSORT tracking algorithm of DeepSORT is selected to reduce the loss of the tracked object caused by the occlusion, trajectory blur, and motion scale of the object. The safety helmet dataset is trained with YOLOv5s, and the best result of training is used as the weight model in the StrongSORT tracking algorithm. The experimental results show that the mAP@0.5 of all classes in the YOLOv5s model can reach 95.1% in the validation dataset, mAP@0.5:0.95 is 62.1%, and the precision of wearing helmet is 95.7%. After the box regression loss function was changed from CIOU to Focal-EIOU, the mAP@0.5 increased to 95.4%, mAP@0.5:0.95 increased to 62.9%, and the precision of wearing helmet increased to 96.5%, which were increased by 0.3%, 0.8% and 0.8%, respectively. StrongSORT can update object trajectories in video frames at a speed of 0.05 s per frame. Based on the improved YOLOv5s combined with the StrongSORT tracking algorithm, the helmet-wearing tracking detection can achieve better performance.


Introduction and Related Work
The safety of the construction site has always been a hot issue. Wearing a helmet can greatly reduce a heavy blow to the head during the construction process, and is also a basic guarantee for the personal safety of workers. The accidents caused by the lack of safety helmets during construction are still on the rise. Therefore, real-time monitoring of construction sites and management to ensure workers wear safety helmets are effective ways to reduce the occurrence of safety accidents.
The task of real-time monitoring whether workers are wearing a helmet consists of object detection and target tracking.
Object detection has always been a hot research direction of computer vision. In short, its main task is to classify and locate algorithms in specific scenes, and many excellent detection algorithms have also been combined in applications in various fields. Traditional object detection [1,2] is a feature extractor using manual identification. However, these traditional model algorithms are relatively slow in detection speed and poor in detection precision compared with the current algorithms based on deep learning. The generalization effect of the model on the test dataset is not very good, and the performance applied to the actual project cannot meet the specified standards. As research on CNNs (convolutional neural networks) [3] continues to advance, object detection can be divided into two methods: two-stage and one-stage.
Two-stage is an algorithm based on region proposals; first, extracting region proposals, and then performing a classification regression task on the region proposals. Girshick, R. et al. [4] proposed R-CNN (Region CNN) for the first time, and used selective search [5] to extract region proposals. The backbone used the AlexNet [6] network model as the detector.
the StrongSORT [28] tracking algorithm with good performance is used to achieve target tracking, so as to achieve the monitoring requirement of wearing a safety helmet under the construction background.
This paper uses detection-based tracking, combined with better detector and tracker algorithms, to achieve the task of tracking and monitoring whether a helmet is being worn in real-time scenarios.

YOLOv5
In the detection-based tracking task, the most important step is to select an appropriate detector, and the result trained by the detector model directly affects the effect of target trajectory tracking. The detection speed and detection precision of the object detector also directly affect the real-time tracking of the target trajectory. In YOLOv5, there are four network models, named YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x, respectively. In this paper, the YOLOv5s model is selected as the detector model in the tracking task.

YOLOv5s
The YOLOv5s network structure mainly includes the Input and Backbone network and Neck network and Prediction header, as shown in Figure 1. In the input part, the input size is set to 640 × 640 × 3, and the input image will use the Mosaic method for data enhancement. The main idea is to randomly crop any four images in the dataset, and then splice them as the training image, so as to achieve the effect of enriching the dataset. An adaptive anchor box calculation is introduced in the input section, the optimal initial anchor box size is selected for different datasets, and the image is processed with adaptive image scaling, reducing calculation time and improving the performance of detection. In the backbone section, the BottleneckCSP network structure is used to separate features by separating channels, and stitching when predicting output, which reduces the repeated calculation of feature information in the calculation process and enhances the ability of CNN to learn more feature information. In the Neck section, the structure of FPN and PANet is used. In the head section, convolution of three feature layers of different scales of the Neck layer makes the final prediction output. monitoring results. Combined with the above problems, this paper takes the construction personnel wearing helmets as the monitoring object, and processes more than 8000 images as the training dataset. YOLOv5 is selected as the detector for training, and the StrongSORT [28] tracking algorithm with good performance is used to achieve target tracking, so as to achieve the monitoring requirement of wearing a safety helmet under the construction background. This paper uses detection-based tracking, combined with better detector and tracker algorithms, to achieve the task of tracking and monitoring whether a helmet is being worn in real-time scenarios.

YOLOv5
In the detection-based tracking task, the most important step is to select an appropriate detector, and the result trained by the detector model directly affects the effect of target trajectory tracking. The detection speed and detection precision of the object detector also directly affect the real-time tracking of the target trajectory. In YOLOv5, there are four network models, named YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x, respectively. In this paper, the YOLOv5s model is selected as the detector model in the tracking task.

YOLOv5s
The YOLOv5s network structure mainly includes the Input and Backbone network and Neck network and Prediction header, as shown in Figure 1. In the input part, the input size is set to 640 × 640 × 3, and the input image will use the Mosaic method for data enhancement. The main idea is to randomly crop any four images in the dataset, and then splice them as the training image, so as to achieve the effect of enriching the dataset. An adaptive anchor box calculation is introduced in the input section, the optimal initial anchor box size is selected for different datasets, and the image is processed with adaptive image scaling, reducing calculation time and improving the performance of detection. In the backbone section, the BottleneckCSP network structure is used to separate features by separating channels, and stitching when predicting output, which reduces the repeated calculation of feature information in the calculation process and enhances the ability of CNN to learn more feature information. In the Neck section, the structure of FPN and PANet is used. In the head section, convolution of three feature layers of different scales of the Neck layer makes the final prediction output.

Bounding Box Regression Loss
The traditional IOU (intersection over union) [33] calculates the overlap rate of the predicted box and ground truth box, that is, the ratio of their intersection and union. In this kind of bounding box regression algorithm, when the prediction box and the ground truth box do not have an intersection, the IOU loss remains at 0, degenerates to a constant, the regression loss of the predicted bounding box cannot be measured, and the convergence rate of the IOU loss is very slow. Therefore, Rezatofighi, H. et al. [34] proposed generalized intersection over union, GIOU, to solve the problem of traditional IOU loss, but when the prediction box and the ground truth box coincide, GIOU degenerated into traditional IOU loss, and the above problems will also occur. Therefore, on the basis of GIOU, combining the proportion of the non-cross area and the proportion of the distance between the center point of the two boxes, Zheng, Z. et al. [23] proposed DIOU (distance-IOU loss), which solves the situation when the two boxes of the prediction box and the ground truth box are contained horizontally and vertically with each other. When the prediction box coincides with the center point of the ground truth box, DIOU degenerates into the traditional IOU form. Therefore, complete-IOU (CIOU) loss [23] is proposed, combined with the aspect ratio penalty to deal with the problem of center point coincidence. As a result, CIOU takes into account both the loss of the cross area and the loss of the center point offset, combined with the loss of the proportion of width and height. Therefore, in the YOLOv5s detection algorithm, CIOU is used in the region proposals regression, and the calculation loss function is shown in Equation (1).
Focal-EIOU CIOU considers various problems in the regression process of the prediction box and accelerates the convergence of the prediction box; however, there will be some problems that the width and height of the prediction box cannot be converged in a certain proportion at the same time. The EIOU [35] loss considered these problems and its calculation form is shown in Equation (2).
In the prediction of bounding box regression, the training samples play an important role in the convergence process. Therefore, under the premise of EIOU loss, this paper uses Focal-EIOU [35] to replace the prediction box regression loss CIOU in the original YOLOv5, and adds the inhibitory factor γ to promote the training samples with more information to play more roles in the regression process; its calculation form is shown in Equation (3).
After replacing the CIOU loss of the bounding box regression with the Focal-EIOU loss, training the YOLOv5s model, the mAP@0.5 of all classes on the validation dataset is improved, which increased from 95.1% to 95.4%, mAP@0.5:0.95 increased from 62.1% to 62.9%, and the detection precision of wearing helmet increased to 96.5%, which were increased by 0.3%, 0.8% and 0.8%, respectively.

StrongSORT
The SORT [26] algorithm for multi-object tracking combines past video frames and current video frames, and solves the problem of prediction of object motion trajectory and video frame data correlation through Kalman Filter [36] and Hungarian algorithm, respectively, so as to achieve correlation detection of video frames. However, when the object is occluded, the next video frame predicted by Kalman Filter and the result detected by the detector will fail to match, and the trajectory tracking of that object will end, leading to ID switching of a large number of targets. DeepSORT [27] adds a pre-trained CNN network to the SORT to save the appearance features of each trajectory of the last 100 frames, solving the problem of ID switching caused by the object due to occlusion. At the same time, DeepSORT introduces cascade matching and new trajectory confirmation to improve the optimal matching of the predicted trajectory with the object in the current frame. Du, Y. et al. [28] proposed StrongSORT; two plug-and-play lightweight algorithms are introduced: AFLink and GSI. Among them, the AFLink model associates the short trajectory as a complete trajectory, which is a fully connected model without appearance information, GSI improves the absence in detection by simulating nonlinear motion, achieves more accurate positioning based on Gaussian regression, and does not ignore the motion information of the detected object during the regression process.

Kalman Filter
In [27], the motion information (u, v, γ, h, h) of the object is described by an eight-dimensional space, which are the center coordinates (u, v) of the bounding box, respectively. Taking the aspect ratio γ, height h, and the relative speed of each variable in the object image coordinates, Kalman Filter is used to predict and update the object trajectory in the next frame, and the state X i at time t − 1 is used to predict and transfer, and according to Equation (4), we obtain the status information X at time t, where F is the state transition matrix. The error of these two states is represented by the covariance matrix P k , then the state error at the next moment t is described as P t , as shown in Equation (5), to complete the prediction of the track state information of the next frame.
After completing the trajectory prediction of the next frame, the Hungarian algorithm is used to match the predicted trajectory with the object detected in the current frame, and a Kalman Filter is used to update the trajectory of the successful matching.

Cascade Matching
In [26], a Kalman Filter updates the predicted next frame trajectory result and matches the detected object. In [27], the trajectory predicted by Kalman is divided into a confirmed state and an unconfirmed state, and the newly predicted trajectory is initialized to an unconfirmed state; the Hungarian algorithm matches the detected object to a certain number of times before it is converted into a confirmation state, and the trajectory of the confirmation state will be matched with the detected box detected by the detector. In cascade matching, first, the set of deterministic trajectories predicted by the Kalman Filter is denoted as T, and the set of detected boxes is denoted as D, and the cost matrix C of the two is calculated by Equation (6); the Mahalanobis Distance is used to describe the trajectory predicted by Kalman Filter and the motion information of the current detection box, which is d(i, j). As shown in Equation (7), d j represents the j-th detection box.
(y i , S i ) represents the projection of the i-th trajectory to the detection space. Second, the matches that do not conform to the Mahalanobis Distance are removed according to the set threshold (as in Equation (8)).
Finally, according to the update status of the prediction box, the newer prediction box will be matched with the Hungarian algorithm first.

AFLink
Over-reliance on the feature information of the object is easily affected by noise. The pursuit of high performance and detection speed by correlating global trajectory information will result in complex computations and a large number of hyperparameters. AFLink directly predicts the association of two trajectories through time information. In the AFLink model, trajectories T i and T j are used, as shown in Figure 2 for the AFLink consists of the position information of the last 30 frames and the frame f k . T i and T j will be input into the time module and the fusion module [28]. The time module is used to extract frame feature information, and then the fusion module is used to perform feature fusion on the extracted frames of different dimensions, and then the classifier predicts the correlation between the two frames. In this process, the two trajectories T i and T j do not interfere with each other in the processing of the time extraction module and the fusion module.

AFLink
Over-reliance on the feature information of the object is easily affected by noise. The pursuit of high performance and detection speed by correlating global trajectory information will result in complex computations and a large number of hyperparameters. AFLink directly predicts the association of two trajectories through time information. In the AFLink model, trajectories and are used, as shown in Figure 2 for the AFLink model structure, where * , , consists of the position information of the last 30 frames and the frame . and will be input into the time module and the fusion module [28]. The time module is used to extract frame feature information, and then the fusion module is used to perform feature fusion on the extracted frames of different dimensions, and then the classifier predicts the correlation between the two frames. In this process, the two trajectories and do not interfere with each other in the processing of the time extraction module and the fusion module.

Appearance Information
For appearance feature information, in DeepSORT, a CNN network pretrained on the pedestrian re-identification dataset is used, the CNN is used to extract pedestrian features, and the pedestrian features are saved, and the CNN feature extractor is used for pedestrian tracking tasks. When using [27] to track objects, it will save the features of the 100 most recent frames of each track in a feature library. When a video frame has an undetected object, it calculates the feature library of the -ℎ track and the -ℎ detection. The minimum cosine distance of the feature of the object is shown in Equation (9).
, min 1 ∈ In StrongSORT, the CNN that extracts feature information is replaced by a feature extractor BoT network, which can extract more feature information about the detection object in the video frame. At the same time, the feature library extracted and saved by CNN is changed to a feature update strategy, that is, the appearance state of the -ℎ track in the -ℎ frame is updated in the form of an exponential moving average, as Equation (10) shows.

1
(10) is the appearance embedding information of the detection matched by the current trajectory.

Appearance Information
For appearance feature information, in DeepSORT, a CNN network pretrained on the pedestrian re-identification dataset is used, the CNN is used to extract pedestrian features, and the pedestrian features are saved, and the CNN feature extractor is used for pedestrian tracking tasks. When using [27] to track objects, it will save the features of the 100 most recent frames of each track in a feature library. When a video frame has an undetected object, it calculates the feature library R i of the i-th track and the j-th detection. The minimum cosine distance of the feature f j of the object is shown in Equation (9).
In StrongSORT, the CNN that extracts feature information is replaced by a feature extractor BoT network, which can extract more feature information about the detection object in the video frame. At the same time, the feature library extracted and saved by CNN is changed to a feature update strategy, that is, the appearance state e t i of the i-th track in the t-th frame is updated in the form of an exponential moving average, as Equation (10) shows.
f t i is the appearance embedding information of the detection matched by the current trajectory.

Construction of Dataset
The dataset used for experimental detection and tracking was selected from the Safety Helmet Wearing Dataset for the experiment; the dataset includes more than 8000 images, includes workers wearing and not wearing a helmet in various construction scenarios, and images of complex task-overlapping scenes, dark scenes, character occlusion and other scenarios. At the same time, some negative samples without helmet were added to the dataset to increase the difficulty of detection. The dataset sample is consistent with the real-time monitoring of the actual construction scene, so, in this paper, the Safety Helmet Dataset is selected for this detection and tracking task, and the dataset is processed into three categories, namely the target helmet to be detected and samples that not wearing a helmet and the head as a negative sample interference. Figure 3a-d contain positive and negative samples, night samples, character occlusion, and negative samples, respectively.

Construction of Dataset
The dataset used for experimental detection and tracking was selected from the Safety Helmet Wearing Dataset for the experiment; the dataset includes more than 8000 images, includes workers wearing and not wearing a helmet in various construction scenarios, and images of complex task-overlapping scenes, dark scenes, character occlusion and other scenarios. At the same time, some negative samples without helmet were added to the dataset to increase the difficulty of detection. The dataset sample is consistent with the real-time monitoring of the actual construction scene, so, in this paper, the Safety Helmet Dataset is selected for this detection and tracking task, and the dataset is processed into three categories, namely the target helmet to be detected and samples that not wearing a helmet and the head as a negative sample interference. Figure 3a-d contain positive and negative samples, night samples, character occlusion, and negative samples, respectively. We recorded a short video at our school construction site as a real-time detect video to test our model.

Experimental Environment
The specific training environment of our experiment is as shown in Table 1.  We recorded a short video at our school construction site as a real-time detect video to test our model.

Experimental Environment
The specific training environment of our experiment is as shown in Table 1. In this experiment, the mean of Average Precision, mAP and Frame Per Second, FPS are used as the evaluation criteria for the YOLOV5s model. The speed at which video frames are processed per second and the frequency at which the target ID is switched are used as metrics for the StrongSORT tracking model.

The Results of YOLOv5s
In the experiment, the epoch is set to 300 and the batch size is 8. The results of training on the helmet dataset are shown in Figure 4a,b. As shown in Figure 4a, the training precision of wearing a helmet can reach 95.7%, the precision of not wearing a helmet can reach 94.4%, and the mAP@0.5 of all classes on the validation dataset can reach 95.1%. Figure 4b shows that there are no false detections and missed detections. frames are processed per second and the frequency at which the target ID is switched are used as metrics for the StrongSORT tracking model.

The Results of YOLOv5s
In the experiment, the epoch is set to 300 and the batch size is 8. The results of training on the helmet dataset are shown in Figure 4a,b. As shown in Figure 4a, the training precision of wearing a helmet can reach 95.7%, the precision of not wearing a helmet can reach 94.4%, and the mAP@0.5 of all classes on the validation dataset can reach 95.1%. Figure 4b shows that there are no false detections and missed detections.  The results of the loss in the training process can be seen, as shown in Figure 5; after the class loss on the validation dataset is trained to 200 epochs, the loss begins to converge and remains at 0.0015 until the 300 epochs of iterations are completed. The results of the loss in the training process can be seen, as shown in Figure 5; after the class loss on the validation dataset is trained to 200 epochs, the loss begins to converge and remains at 0.0015 until the 300 epochs of iterations are completed. As can be seen in Figure 6, the classification loss during the training process can be reduced to an effect of 0.00018. As can be seen in Figure 6, the classification loss during the training process can be reduced to an effect of 0.00018. As can be seen in Figure 6, the classification loss during the training process can be reduced to an effect of 0.00018. At the same time, we use YOLOv5s's best weight result to detect real-time recorded construction site videos. Figure 7 shows the results of missed detection and full detection in different video frames. In Figure 7a, a small number of occluded objects cannot be detected. In Figure 7b, it can be seen that all objects are detected, and the detection of video frames can reach an average inference speed of 16.2 ms. The average detection processing speed of each frame is 0.015 s.  At the same time, we use YOLOv5s's best weight result to detect real-time recorded construction site videos. Figure 7 shows the results of missed detection and full detection in different video frames. In Figure 7a, a small number of occluded objects cannot be detected. In Figure 7b, it can be seen that all objects are detected, and the detection of video frames can reach an average inference speed of 16.2 ms. The average detection processing speed of each frame is 0.015 s. As can be seen in Figure 6, the classification loss during the training process can be reduced to an effect of 0.00018. At the same time, we use YOLOv5s's best weight result to detect real-time recorded construction site videos. Figure 7 shows the results of missed detection and full detection in different video frames. In Figure 7a, a small number of occluded objects cannot be detected. In Figure 7b, it can be seen that all objects are detected, and the detection of video frames can reach an average inference speed of 16.2 ms. The average detection processing speed of each frame is 0.015 s.

Tracking Results of StrongSORT
After using the YOLOv5s model to train the dataset, the target person wearing a helmet can be basically detected. The trained detector is combined with the StrongSORT to achieve the effect of real-time target tracking. As shown in Figure 8, the upper left corner is the unique identification (ID) number of the target person, Figure 8a-c show the tracking situation of the same target person in different video frames by the StrongSORT, and there is no target ID switching, and there is no false detection in real-time monitoring and tracking. Taking the target person whose ID is 17 as an example, in different video frames, although the target is occluded for a long time, the ID of the target person has never been switched, which achieves a good tracking result. In Figure 8c, the target without a helmet is not marked as a positive sample, and no false detection occurs.
Using StrongSORT to achieve tracking, it achieves a processing speed of 26.5 ms per frame. At the same time, the average detection speed of YOLOv5s is 0.017 s, and the average speed of StrongSORT to update the video frame track is 0.05 s. is the unique identification (ID) number of the target person, Figure 8a-c show the tracking situation of the same target person in different video frames by the StrongSORT, and there is no target ID switching, and there is no false detection in real-time monitoring and tracking. Taking the target person whose ID is 17 as an example, in different video frames, although the target is occluded for a long time, the ID of the target person has never been switched, which achieves a good tracking result. In Figure 8c, the target without a helmet is not marked as a positive sample, and no false detection occurs. Using StrongSORT to achieve tracking, it achieves a processing speed of 26.5 ms per frame. At the same time, the average detection speed of YOLOv5s is 0.017 s, and the average speed of StrongSORT to update the video frame track is 0.05 s.

Detector Comparison Experiment
In this paper, the two-stage detection models Faster R-CNN + FPN, Cascade Masked R-CNN + FPN, and the one-stage detection model YOLOv3 + SPP are compared with the improved YOLOv5 detector model, and each model uses pretrained weights during training. Under the same conditions, the above-mentioned Safety Helmet Dataset is used for training, and the weight size, mAP@0.5, FPS, and the saved weights model size is used as the metrics for evaluating the detector. The experimental results are shown in Table 2, and it can be seen that the detection model proposed in this paper has reached 95.4% of the mAP@0.5 of all categories on the validation dataset, and the precision of wearing helmets can reach 96.5%, as illustrated in Figure 9. The inference speed FPS has reached 100 images per second, and the model weight is only 14.4 MB, which is better than other detection models. The original YOLOv5s network weight size was 14.5 MB, and the weight model size was not much different after changing the prediction box regression loss.

Detector Comparison Experiment
In this paper, the two-stage detection models Faster R-CNN + FPN, Cascade Masked R-CNN + FPN, and the one-stage detection model YOLOv3 + SPP are compared with the improved YOLOv5 detector model, and each model uses pretrained weights during training. Under the same conditions, the above-mentioned Safety Helmet Dataset is used for training, and the weight size, mAP@0.5, FPS, and the saved weights model size is used as the metrics for evaluating the detector. The experimental results are shown in Table 2, and it can be seen that the detection model proposed in this paper has reached 95.4% of the mAP@0.5 of all categories on the validation dataset, and the precision of wearing helmets can reach 96.5%, as illustrated in Figure 9. The inference speed FPS has reached 100 images per second, and the model weight is only 14.4 MB, which is better than other detection models. The original YOLOv5s network weight size was 14.5 MB, and the weight model size was not much different after changing the prediction box regression loss.

Tracker Comparison Experiment
In addition, the use of DeepSORT [27] and StrongSORT [28] algorithms are selected in this paper. The above algorithms are combined with the YOLOv5s + Focal-EIOU detector for comparative experiments. We use the character occlusion ID switching and FPS as evaluation indicators. As shown in Table 3, the StrongSORT processes tracking and detection is faster than DeepSORT, reaching 37 frames per second.

Tracker Comparison Experiment
In addition, the use of DeepSORT [27] and StrongSORT [28] algorithms are selected in this paper. The above algorithms are combined with the YOLOv5s + Focal-EIOU detector for comparative experiments. We use the character occlusion ID switching and FPS as evaluation indicators. As shown in Table 3, the StrongSORT processes tracking and detection is faster than DeepSORT, reaching 37 frames per second. In the experimental results of YOLOv5s combined with DeepSORT, the phenomenon of frequent switching of the target ID after the target occlusion occurs, as shown in Figure 10, where Figure 10a,c are the same target ID at different frame moments, the ID is switched from 858 to 939 due to the target person being obscured, and in Figure 10b the ID is lost because the target is obscured. In YOLOv5s combined with StrongSORT, for the same target person, there is no ID switching or ID loss when the targets have been obscured, which has always been 223, as shown in Figure 10d,e.

Conclusions
In this paper, the object detection model YOLOv5s is combined with the tracking algorithm StrongSORT [28] to realize the tracking of helmet wearing. According to the comparative experiments, the YOLOv5s model is the most suitable choice in terms of detection speed and detection precision. In addition, in the tracking comparison experiment,

Conclusions
In this paper, the object detection model YOLOv5s is combined with the tracking algorithm StrongSORT [28] to realize the tracking of helmet wearing. According to the comparative experiments, the YOLOv5s model is the most suitable choice in terms of detection speed and detection precision. In addition, in the tracking comparison experiment, StrongSORT [28] has a faster processing speed than DeepSORT [27], and the target ID will not be lost or switched due to problems such as long-term occlusion and large changes in motion scale, amongst the other reasons. At the same time, the speed of processing detection and tracking has also achieved good results. In the next work, we will explore how to apply this work to embedded terminal applications.