Multi-Scale Safety Helmet Detection Based on SAS-YOLOv3-Tiny

: In the practical application scenarios of safety helmet detection, the lightweight algorithm You Only Look Once (YOLO) v3-tiny is easy to be deployed in embedded devices because its number of parameters is small. However, its detection accuracy is relatively low, which is why it is not suitable for detecting multi-scale safety helmets. The safety helmet detection algorithm (named SAS-YOLOv3-tiny) is proposed in this paper to balance detection accuracy and model complexity. A light Sandglass-Residual (SR) module based on depthwise separable convolution and channel attention mechanism is constructed to replace the original convolution layer, and the convolution layer of stride two is used to replace the max-pooling layer for obtaining more informative features and promoting detection performance while reducing the number of parameters and computation. Instead of two-scale feature prediction, three-scale feature prediction is used here to improve the detection effect about small objects further. In addition, an improved spatial pyramid pooling (SPP) module is added to the feature extraction network to extract local and global features with rich semantic information. Complete-Intersection over Union (CIoU) loss is also introduced in this paper to improve the loss function for promoting positioning accuracy. The results on the self-built helmet dataset show that the improved algorithm is superior to the original algorithm. Compared with the original YOLOv3-tiny, the SAS-YOLOv3-tiny has signiﬁcantly improved all metrics (including Precision (P), Recall (R), Mean Average Precision (mAP), F1) at the expense of only a minor speed while keeping fewer parameters and amounts of calculation. Meanwhile, the SAS-YOLOv3-tiny algorithm shows advantages in accuracy compared with lightweight object detection algorithms, and its speed is faster than the heavyweight model.


Introduction
Driving a motorcycle or an electric two-wheeler without a safety helmet will cause a high mortality rate. However, many cyclists still have fluke psychology, so wearing safety helmets must rely on traffic policies' way of compulsory supervision to attract people's attention. At present, there are two main ways for traffic management departments to supervise whether riders wear helmets. In general, traffic policies check traffic surveillance videos manually. Another way is that traffic policies manage drivers and passengers on the road. These methods need a lot of human and material resources and cause the phenomenon of missing detection. Whether or not people who ride motorcycles and twowheelers wear safety helmets to improve safety is crucial for intelligent traffic management, which has a significant research value. With the development of artificial intelligence, intelligent systems based on automatic image detection have been intensely studied and applied in different fields.
The task of object detection is to locate and classify objects in a given image. The uncertainty of object type and number, the diversity of object scale and the external environment's interference will bring different degrees of influence to the task. Object detection algorithms based on convolutional neural networks are mainly divided into two categories: anchor-based and anchor-free. There are two types of anchor-based algorithm: two-stage Experimental studies have found that the general object detection algorithms can be applied to the detection task of the safety helmet. However, under complex scenarios, the small-scale object is sheltered and it is dense. The remote small-scale safety helmets and hats with low resolution and blurry pixels have less characteristic information, which leads to the phenomenon of missed detection. In addition, it is challenging to balance the accuracy and complexity in general object detection algorithms, and the imbalance between the two makes it difficult to deploy on mobile devices. Even though YOLOv3 is a widely used object detection algorithm with good recognition speed and detection accuracy by combining several methods such as residual network, feature pyramid and multi-feature fusion network, it has lots of parameters and amount of computation and generates a large model. Hence, it is challenging that the model is transplanted to embedded applications when computing power and storage space are limited. YOLOv3-tiny based on YOLOv3 is a lightweight object detection network applying an embedded platform, but its detection accuracy is low. In this paper, SAS-YOLOv3-tiny is proposed to balance the detection accuracy and speed for a set of self-built helmet datasets. Aiming to promote detection effect while reducing the number of parameters and calculation amount, the Sandglass-Residual module based on depthwise separable convolution and channel attention mechanism is constructed to replace the traditional convolution layer while the convolution layer of stride two is utilized into the backbone to replace the max-pooling layer, which can extract informative and high-dimensional features. The three-scale feature prediction method is introduced into the network structure of SAS-YOLOv3-tiny to improve the two-scale feature prediction for obtaining accurate location information of small objects further. The improved spatial pyramid pooling module is applied to enhance the feature extraction further. CIoU is used to promote the loss function to improve location accuracy. Our algorithm achieved the mAP value of 81.6% on the validation set and the mAP value of 80.3% on the test set with the average detection time of 3.2 ms on each image under an actual traffic environment.
The rest of the paper is organized as follows. Section 2 will explain the principles of the original algorithm YOLOv3-tiny. Section 3 will describe the innovation points of the improved algorithm (SAS-YOLOv3-tiny) in detail. Section 4 will show some experimental results and analyze them. Finally, in Section 5, this paper will be summarized and some future works will be proposed.

The Principles of YOLOv3-Tiny
In this section, we will mainly introduce the principles of YOLOv3-tiny. In Section 2.1, the network architecture of YOLOv3-tiny will be defined in detail. In Section 2.2, the principle of bounding box prediction will be explained. The above principles lay a solid foundation for the improved algorithm in Section 3.

Network Architecture of YOLOv3-Tiny
YOLOv3-tiny is an improved version of YOLOv3, which has changed the YOLOv3's backbone network (named Darknet53) to seven convolution layers with kernel size of 3 × 3 and six max-pooling layers with stride 2. The idea of FPN is adopted to integrate feature map with low resolution and feature map with high resolution. YOLOv3-tiny utilized last two downsampled feature maps with size of 28 × 28 × 256 and 14 × 14 × 1024 to predict the objects. The reason is that the feature map with size of 14 × 14 × 1024 contains abstract and high-level semantic information while the feature map with size of 28 × 28 × 256 carrying more detailed and lower-level location information, which can obtain feature map containing both semantic and positional information. Specifically, the input image with size of 448 × 448 × 3 is processed through the backbone network and a convolution operation, producing the resultant feature map with size of 14 × 14 × 1024. One part of the processed results is processed through the convolutions and used to output predictions in terms of the current feature map, and the other part is processed through a convolution layer and an up-sampling operations, and then is fused with the corresponding upper Appl. Sci. 2021, 11, 3652 4 of 17 feature map with size of 28 × 28 × 256. The above operations can obtain the feature map with size of 28 × 28 × 384, which is processed by the convolutions, and then used for prediction. At scale y1, the feature map downsampled by 32× is utilized to detect larger objects. At scale y2, the feature map downsampled by 16× is responsible for detecting smaller objects. YOLOv3-tiny network's structure is demonstrated in Figure 1. more detailed and lower-level location information, which can obtain feature map containing both semantic and positional information. Specifically, the input image with size of 448 448 3  is processed through the backbone network and a convolution operation, producing the resultant feature map with size of 14 14 1024  . One part of the processed results is processed through the convolutions and used to output predictions in terms of the current feature map, and the other part is processed through a convolution layer and an up-sampling operations, and then is fused with the corresponding upper feature map with size of 28 28 256  . The above operations can obtain the feature map with size of 28 28 384  , which is processed by the convolutions, and then used for prediction. At scale y1, the feature map downsampled by 32  is utilized to detect larger objects. At scale y2, the feature map downsampled by 16  is responsible for detecting smaller objects. YOLOv3-tiny network's structure is demonstrated in Figure 1.

Bounding Box Prediction
YOLOv3 continued to employ K-means clustering of YOLOv2 to determine the prior boxes, which drew on the anchor box mechanism of RPN in Faster R-CNN. K-means clustering algorithm in YOLOv3-tiny obtained K prior boxes on the Common Objects in Context (COCO) dataset according to the annotated ground truth boxes, which could improve the detection accuracy and speed. Joseph Redmon et al. modified the clustering distance in the k-means algorithm [12]. As shown in Formula (1), it is defined by IOU. The larger the IOU is, the closer distance of the two bounding boxes is.
IoU box centroid  (1) In Formula (1), ( , ) d box centroid represents the clustering distance, centroid represents the box that is selected as the center of mass by the algorithm, box represents the other

Bounding Box Prediction
YOLOv3 continued to employ K-means clustering of YOLOv2 to determine the prior boxes, which drew on the anchor box mechanism of RPN in Faster R-CNN. K-means clustering algorithm in YOLOv3-tiny obtained K prior boxes on the Common Objects in Context (COCO) dataset according to the annotated ground truth boxes, which could improve the detection accuracy and speed. Joseph Redmon et al. modified the clustering distance in the k-means algorithm [12]. As shown in Formula (1), it is defined by IOU. The larger the IOU is, the closer distance of the two bounding boxes is.
In Formula (1), d(box, centroid) represents the clustering distance, centroid represents the box that is selected as the center of mass by the algorithm, box represents the other bounding boxes and IOU represents the ratio of the intersecting area of the two boxes to the combined area. Even though too many prior boxes can guarantee the detection effects, it greatly affects the efficiency of the algorithm. YOLOv3-tiny used six prior boxes. The corresponding relationship between feature maps and prior boxes is as follows. Feature maps of size 14, 28 correspond [(81,82); (135,169); (344,319)], [(10,14); (23,27); (37,58)], respectively. Generally, large feature maps usually have small receptive fields, which are very sensitive to small-scale objects, and thus, they will select small prior boxes. On the contrary, small feature maps always have large receptive fields, which are suitable for detecting large objects, and thus, they select large prior boxes.
The final predicted bounding box coordinates of the YOLOv3-tiny network can be obtained by Formulas (2) and (3), and the final bounding box prediction schematic is shown in Figure 2. The confidence is divided into two parts: one is the probability of the existence of the object, showed by P r (object) (if the object exists, P r (object) = 1, otherwise it is 0), while the other is the accuracy of the predicted bounding box, which is shown in Formula (4).
C con f = P r (class i |object ) × P r (object) × IOU truth pred = P r (class) × IOU truth pred (4) the combined area. Even though too many prior boxes can guarantee the detection effects, it greatly affects the efficiency of the algorithm. YOLOv3-tiny used six prior boxes. The corresponding relationship between feature maps and prior boxes is as follows. Feature maps of size 14, 28 correspond [(81,82); (135,169); (344,319)], [(10,14); (23,27); (37,58)], respectively. Generally, large feature maps usually have small receptive fields, which are very sensitive to small-scale objects, and thus, they will select small prior boxes. On the contrary, small feature maps always have large receptive fields, which are suitable for detecting large objects, and thus, they select large prior boxes. The final predicted bounding box coordinates of the YOLOv3-tiny network can be obtained by Formulas (2) and (3), and the final bounding box prediction schematic is shown in Figure 2. The confidence is divided into two parts: one is the probability of the existence of the object, showed by ( ) r P object (if the object exists, ( )=1 r P object , otherwise it is 0), while the other is the accuracy of the predicted bounding box, which is shown in Formula (4).   In Formulas (2) and (3), b x and b y are the coordinates of the center point of the modified bounding box; b w and b h represent the width and height of the modified bounding box, respectively; t x and t y represent the offset between the object center point and the upper-left corner of the grid; t w and t h represent the offset of the width and height of the predicted bounding box, respectively; c x and c y represent the offset of the grid relative to the upperleft corner. p w and p h are the width and height of the prior box, respectively. The sigmoid function is used to control the range of value within (0, 1) and control the offset of the object center within the corresponding grid cell to ensure that it is not out of bounds. In Formula (4), C con f represents the confidence score of a specific category for each box; P r (class i |object) represents the probability of predicting C conditional class in each grid cell (I = 1, 2, . . . , C); P r (object) × IOU truth pred represents the confidence score; IOU truth pred represents the intersection ratio of the ground truth box and the prediction box.

SAS-YOLOv3-Tiny Algorithm
The original YOLOv3 algorithm has a considerable computation cost and parameters, which are not suitable for deployment on mobile devices. Therefore, YOLOv3 does not satisfy the specific application domain, such as helmet detection. Even though YOLOv3-tiny can meet the practical needs in terms of computation amount and number of parameters, which is not as accurate as YOLOv3 due to model compression. To further reduce the number of parameters and the amount of calculation, the Sandglass-Residual module will be proposed in Section 3.1. Meanwhile, the channel attention mechanism will be fused into the Sandglass-Residual module to extract more valuable features. In Section 3.2, the improved SPP module will be introduced into the SAS-YOLOv3-tiny network architecture to obtain local and global features. In Section 3.3, we will show the overall network architecture of SAS-YOLOv3-tiny, which utilizes three-scale feature prediction to promote the small-scale objects' detection performance. CIoU loss will be applied to the original loss function to improve position accuracy in Section 3.4.

Sandglass-Residual Module Based on Channel Attention Mechanism
The inverted residual module of MobileNetv2 [23] places the shortcut on the lowdimensional representations. Feature compression will cause some problems that optimization is complicated, and the gradient is easy to shake, affecting the convergence of the model. MobileNext [24] proposes a new sandglass bottleneck module to solve the inverted residual module problem, which puts the shortcut on the high-dimensional representations. The above operations can retain the advantages of high-speed convergence and training on the high-dimensional network and take advantage of the computational advantages of depthwise separable convolution. In general, the parameters and calculation amount of traditional convolution increase significantly with the increase of convolution layers. So the conventional convolution is replaced with depthwise separable convolution to reduce model complexity, which is transformed into two parts: depthwise convolution and point convolution. We assume that the size of input feature map is In the depthwise convolution operation, the size of convolution kernel is D k × D k × 1 and its number is M. In the point convolution operation, the size of convolution kernel is 1 × 1 × M and its number is N. so the computation amount of depthwise separable convolution is By comparing the computational amount of the two, the computational amount of depthwise separable convolution can be reduced to 1/N + 1/D 2 K of the standard convolution. In our work, the Sandglass-Residual module based on the lightweight idea is constructed in the feature extraction process, ensuring that more information is passed from the bottom to the top and gradient propagation is facilitated. The Specific operations are as follows. In the high-dimensional space, two depthwise convolutions with kernel size of 3 × 3 are performed, which can encode more spatial information. The point convolution with kernel size of 1 × 1 is utilized to reduce and increase channels' dimensions and encode information between channels. The first depthwise convolution and the last point convolution use nonlinear activation functions. In contrast, the first point convolution and the final depthwise convolution directly perform linear output to avoid information loss. The parameters of the Sandglass-Residual module are shown in Table 1.

Input
Operation Output The YOLOv3-tiny algorithm is applied to a real-world scenario dataset, objects in the image are treated equally. If the weight is assigned to the features of the object area, the weighted feature maps will be conducive to detecting far-distance and small-scale safety helmets, which can improve detection accuracy without introducing too many parameters. The Squeeze-Excitation (SE) channel attention module in SENet [25] gives different weights to different channels in the feature map of the convolutional neural network, making the network pay more attention to the channels with higher weights.
Thus, it can enhance the learning ability of the network, and its specific operations are as follows. The feature map with size of H × W × C is compressed into a vector that its size is 1 × 1 × C by compression operation (i.e., global average pooling operation). Then the weights of different channels are obtained by excitation operation (i.e., two fully connection operations), and finally, the feature weighting operation is carried out on the obtained feature maps. After the above operations, the attention feature maps are produced. All channels of the feature maps generated by the above Sandglass-Residual module are treated equally, which makes some essential features be overlooked so that these obtained features are not conducive to detecting difficult-to-distinguish objects. Therefore, in this paper, the channel attention is introduced into the Sandglass-Residual module to extract informative features, adjusting the characteristic relationship between network models by squeeze and excitation operations. Its structure is shown in Figure 3. Compared with the original SR block, the Sandglass-Residual module based on the Squeeze-Excitation channel attention enhances the network's nonlinear characteristics, which can improve the model generalization ability without changing the output dimension. The subsequent ablation experiments prove that the Sandglass-Residual module based on the Squeeze-Excitation channel attention is good for improving the detection performance.
The YOLOv3-tiny algorithm is applied to a real-world scenario dataset, objects in the image are treated equally. If the weight is assigned to the features of the object area, the weighted feature maps will be conducive to detecting far-distance and small-scale safety helmets, which can improve detection accuracy without introducing too many parameters. The Squeeze-Excitation (SE) channel attention module in SENet [25] gives different weights to different channels in the feature map of the convolutional neural network, making the network pay more attention to the channels with higher weights. Thus, it can enhance the learning ability of the network, and its specific operations are as follows. The feature map with size of H W C   is compressed into a vector that its size is 11C  by compression operation (i.e., global average pooling operation). Then the weights of different channels are obtained by excitation operation (i.e., two fully connection operations), and finally, the feature weighting operation is carried out on the obtained feature maps. After the above operations, the attention feature maps are produced. All channels of the feature maps generated by the above Sandglass-Residual module are treated equally, which makes some essential features be overlooked so that these obtained features are not conducive to detecting difficult-to-distinguish objects. Therefore, in this paper, the channel attention is introduced into the Sandglass-Residual module to extract informative features, adjusting the characteristic relationship between network models by squeeze and excitation operations. Its structure is shown in Figure 3. Compared with the original SR block, the Sandglass-Residual module based on the Squeeze-Excitation channel attention enhances the network's nonlinear characteristics, which can improve the model generalization ability without changing the output dimension. The subsequent ablation experiments prove that the Sandglass-Residual module based on the Squeeze-Excitation channel attention is good for improving the detection performance. reweighting

Improved Spatial Pyramid Pooling Module
To obtain the context semantic information of different receptive fields and further improve the detection accuracy of the model, an improved spatial pyramid pooling (SPP) module is added into the improved backbone network. The traditional spatial pyramid pooling [2] is to solve the problem that the input of the fully connected layer must be a fixed eigenvector, which allows us to build a network that supports images of any size to input without cropping and scaling operations. The spatial pyramid pooling module in this paper integrates multi-scale local feature information with global feature information to obtain richer feature representations, which is shown in Figure 4.

Improved Spatial Pyramid Pooling Module
To obtain the context semantic information of different receptive fields and further improve the detection accuracy of the model, an improved spatial pyramid pooling (SPP) module is added into the improved backbone network. The traditional spatial pyramid pooling [2] is to solve the problem that the input of the fully connected layer must be a fixed eigenvector, which allows us to build a network that supports images of any size to input without cropping and scaling operations. The spatial pyramid pooling module in this paper integrates multi-scale local feature information with global feature information to obtain richer feature representations, which is shown in Figure 4.
After going through the improved SPP module, the feature map's size stays the same, realized by the pooling operation of stride one and the padding method. Specifically, the final feature map with size of 14 × 14 × 1024 extracted from the backbone network already contains rich semantic information. After that, three max-pooling operations are adopted to obtain three kinds of feature maps, which are concatenated with the input feature map with size of 14 × 14 × 1024 along the channel dimension to produce the feature map of size 14 × 14 × 4096 as the output. 5 × 5, 9 × 9, 13 × 13 are the size of the pooling kernel, while the stride is 1. The experiments show that the improved SPP module is added after the backbone network to extract rich features, improving the detection effect.

Network Architecture of SAS-YOLOv3-Tiny
To solve low detection accuracy and high missing rate of YOLOv3-tiny on small objects such as helmets, we have improved the original network. The network structure of SAS-YOLOv3-tiny is shown in Figure 5. The backbone network of SAS-YOLOv3-tiny is constructed by combining the previously made Sandglass-Residual module based on the Squeeze-Excitation channel attention and the improved SPP module based on spatial pyramid pooling. To be specific, in Figure 5, the dashed line part is the feature extraction part of the backbone network, in which five brown DBLs in the middle of the backbone network are the 1 × 1 convolution layer of stride 2 to replace the max-pooling layer to perform down-sampling operations and change the number of channels. The five Sandglass-Residual in the middle of the backbone are the Sandglass-Residual modules based on the Squeeze-Excitation channel attention to replace the standard convolution layer behind the max-pooling layer of the original backbone network. Furthermore, to make the network more robust, we add the improved SPP module at the end of the backbone network to fully extract local and global features.  Figure 4. Improved spatial pyramid pooling module.
After going through the improved SPP module, the feature map's size stays the same, realized by the pooling operation of stride one and the padding method. Specifically, the final feature map with size of 14 14 1024  extracted from the backbone network already contains rich semantic information. After that, three max-pooling operations are adopted to obtain three kinds of feature maps, which are concatenated with the input feature map with size of 14 14 1024  along the channel dimension to produce the feature map of size 14 14 4096  as the output. 55  , 99  , 13 13  are the size of the pooling kernel, while the stride is 1. The experiments show that the improved SPP module is added after the backbone network to extract rich features, improving the detection effect.

Network Architecture of SAS-YOLOv3-tiny
To solve low detection accuracy and high missing rate of YOLOv3-tiny on small objects such as helmets, we have improved the original network. The network structure of SAS-YOLOv3-tiny is shown in Figure 5. The backbone network of SAS-YOLOv3-tiny is constructed by combining the previously made Sandglass-Residual module based on the Squeeze-Excitation channel attention and the improved SPP module based on spatial pyramid pooling. To be specific, in Figure 5, the dashed line part is the feature extraction part of the backbone network, in which five brown DBLs in the middle of the backbone network are the 11  convolution layer of stride 2 to replace the max-pooling layer to perform down-sampling operations and change the number of channels. The five Sandglass-Residual in the middle of the backbone are the Sandglass-Residual modules based on the Squeeze-Excitation channel attention to replace the standard convolution layer behind the max-pooling layer of the original backbone network. Furthermore, to make the network more robust, we add the improved SPP module at the end of the backbone network to fully extract local and global features.   Simultaneously, we improved the method of multi-scale feature fusion. Based on the original network's two-scale feature prediction, a downsampled feature map with size of 56 56 128  is used to form the three-scale feature prediction to improve object detection accuracy further. In addition to being used as prediction, the feature map with size of 28 28 384  after a convolution operation continue to go through a convolution layer and Simultaneously, we improved the method of multi-scale feature fusion. Based on the original network's two-scale feature prediction, a downsampled feature map with size of 56 × 56 × 128 is used to form the three-scale feature prediction to improve object detection accuracy further. In addition to being used as prediction, the feature map with size of 28 × 28 × 384 after a convolution operation continue to go through a convolution layer and an upsampled layer, and then is concatenated with the feature map with size of 56 × 56 × 128 to perform prediction. At scale y3, the feature map downsampled by 8× is utilized to detect small objects, because it can get more detailed features and location information of small objects. In the improved algorithms, nine prior boxes instead of six prior boxes are utilized, and the corresponding relationship between feature maps and prior boxes is as follows.

Improved Loss Function
Recently, in terms of bounding box regression, IOU loss optimizations have replaced previous regression loss optimizations (MSE loss, L1-Smooth loss, etc.). One of the most commonly used evaluation criteria for the performance of object detection algorithms is intersection over union (IOU), which is the ratio of the overlap area of the ground truth box and the prediction box to the total area of the two boxes, as shown in Formula (5). In Formula (5), A = (x, y, w, h) represents the prediction box and B = (x gt , y gt , w gt , h gt ) represents the ground truth box.
Even though IOU can reflect the detection effect of the prediction box and the ground truth box, it only works when the bounding boxes overlap and does not provide any adjustment gradient for the non-overlapped part. The concept of IOU is based on the ratio, so it is insensitive to the object scale. In this paper, in Formula (6), the traditional regression loss MSE is replaced with CIoU [26], whose detection effect is more conducive to the actual scene. It inherits the advantages of the Generalized Intersection Over Union (GIoU) [27] and Distance-IoU(DIoU) [28], which not only considers the distance and overlap ratio but also considers the scale and the aspect ratio between the prediction box and the ground truth box so that it can carry out the bounding box regression better. The complete definition of CIoU loss function is shown in Formula (7). Therefore, the loss function of SAS-YOLOv3-tiny is shown in Formula (6), which is divided into three parts: loss CIoU represents the regression loss, loss obj represents the confidence loss and loss class represents the category loss, they are shown as shown in Formulas (7)-(9). LOSS = loss CIoU + loss obj + loss class (6) In Formula (7), b and b gt represent the center points of the prediction box and the ground truth box, respectively. Meanwhile, ρ 2 b, b gt represents the Euclidean distance between two center points, c represents the diagonal distance of the smallest closure region that can contain both the prediction box and the ground truth box, α is the weight parameter and ν is used to measure the similarity of aspect ratios. In Formulas (8) and (9), K × K represents the size of the final feature map to be detected; I obj ij is used to determine whether the j-th prior box in the i-th grid is responsible for the object. If it is responsible for the object, it has a value of 1. Otherwise, its value is 0. The weight coefficients λ coord and λ noobj are set at 5 and 0.5, respectively, which are used to offset the imbalance between positive and negative samples.

Experiments and Results Analysis
In Section 4, some experiments and results analysis will be explained in detail. The basic information of safety helmet detection dataset and evaluation criteria of detection effect will be introduced in Section 4.1. Then, we will explain the experimental progress and do a result analysis in Section 4.2. There are four subsections in Section 4.2. In Section 4.2.1, we will describe the training setting. In Section 4.2.2, we will do ablation experiments to prove the effectiveness of each scheme. In Section 4.2.3, we will conduct the comparison of results with other state-of-the-art detection models. In Section 4.2.4, we will show the detection results of some samples under different detection models.

Dataset
Dataset is crucial for deep learning-based object detection algorithms. In our work, a set of safety helmet datasets was made, which contained 7656 images and was obtained by searching on the Internet, taking photos with cameras and web crawlers, and the format was produced in VOC format. Labeling software (labelImg) was used to label the collected images. There were four categories of objects: helmet (wear a safety helmet for two-wheelers), cap (wear a non-protective hat), Nowear (wear nothing) and safety-cap (wear an industrial helmet). Additionally, the annotated image coordinate information was saved as an XML file. Next, the training set, the validation set and the test set were randomly divided, and the 8:1:1 ratio was adopted in our study, so there were 6063 training samples, 827 validation samples and 766 test samples. Specifically, the training set was used to train parameters of neural network. The validation set was used to test the effect of the current model after each epoch. The test set was used to test the model's final generalization performance because it did not participate in the training process at all.

Evaluation Criteria
The quality of the detection effect usually needs a certain standard to evaluate, so the following evaluation criteria are introduced.
(1) The formulas of the Precision and Recall are shown in Formula (10), and the formula of F1 is shown in Formula (11). F1 is the harmonic mean of Precision and Recall. In Formula (10) (2) The formulas of Average Precision (AP) and Mean Average Precision (mAP) are shown in Formula (12).
In Formula (12), N represents the number of object categories. In general, the increase of the Recall is often accompanied by a decrease in Precision. To better balance the two, the P-R curve is introduced, and the area below it is the AP value of a specific category.

Training Setting
This paper's experimental platforms were Intel(R) Core (TM) I7-9700 CPU @3.00 GHz processor and NVIDIA GeForce RTX 2080Ti GPU. The programming language used for the algorithm in this paper was Python 3.8. The deep learning framework Pytorch 1.6.0 was used. The operating system used was Ubuntu18.04, and other dependent libraries were configured. Generally speaking, there were two training methods to train the model. One method was that random initial weights were used to train the model. The other is that pre-training weights were used to train the model. This paper used the first method to train the model to compare the different modification methods. In the experiment, the SAS-YOLOv3-tiny network was trained from scratch by using a self-built dataset. To ensure the fairness of the test, we retrained the YOLOv3-tiny, YOLO v3 and v4 [29] in the same experimental environment to obtain the corresponding detection model for experimental comparison results of the improved algorithm model on the validation set and the test set. Some experimental parameters were set as follows. In the experiment, the batch size was set to 4; 140 epochs were trained; the cosine learning rate strategy was used, which changed the learning rate from 0.01 to 0.0005; momentum was set to 0.937; weight decay was set to 0.000484. In addition, the multi-scale training strategy was adopted to improve the detection effect of the network for images of different input resolutions, and the cut size was selected at {320, 352, 384, 416, 448, 480, 512, 544, 576, 608, 640} for training in each iteration.

Ablation Experiments
In this section, to better understand the influence of each improved method on the detection effect, ablation learning is carried out on the self-built helmet validation set. First of all, we first presented each of our schemes in Table 2. Then, we compared different modification schemes based on YOLOv3-tiny in terms of indicators including P, R, F1, mAP, Weight, Total Parameters and average time of detecting a single image (detection time) in Table 3, and comprehensively analyze how each improvement point promote performance. Finally, we demonstrated the effectiveness of the improvement point by presenting a training curve for each scheme.  The different schemes are shown in Table 2. We used the yolov3-tiny algorithm as the baseline. Specifically, in the scheme SR, the Sandglass-Residual (SR) module was used to replace the original convolution layer, and the max-pooling layer was replaced with the convolution layer of stride two. In the scheme SR-3 scale (3 s), on the basis of the scheme SR, a three-scale prediction method was adopted. In the Scheme SR-3s-SPP, to further improve the detection effect, the improved SPP was utilized on the basis of the Scheme SR-3s. In addition, in the Scheme SR-3s-SPP-SE, the Squeeze-Excitation (SE) channel attention mechanism was integrated into the Sandglass-Residual module to extract more representative features on the basis of the Scheme SR-3s-SPP. In the Scheme SR-3s-SPP-SE-CIoU, we used CIoU loss on the basis of the Scheme SR-3s-SPP-SE. A combination of five improvements formed our final algorithm, in other words, the last Scheme SR-3s-SPP-SE-CIoU was our improved algorithm (named SAS-YOLO-v3-tiny).
The ablation results of different models on the validation set are shown in Table 3. From Table 3, we can see that the values of indicators including P, R, mAP, F1 are low in the original YOLOv3-tiny algorithm. Additionally, the indicators, including the weight size of the model and total parameters, still have room for improvement. Compared with the original algorithm, the improved YOLOv3-tiny based on the Sandglass-Residual module made the network more lightweight; this was because the Scheme SR based on depthwise separable convolution reduced the number of parameters and computation amount, reducing the size of weight files and the number of parameters by nearly half. In addition, owing to putting the shortcut on the high-dimensional representations, the SR module could extract rich feature, which could increase R by 4.6%, increase mAP by 4.5% and increase F1 by 1.4% while keeping the detection speed almost unchanged. The Scheme SR-3s changed two-scale feature prediction into three-scale feature prediction, which could incorporate shallow features with sufficient location information, making R, mAP, F1 increase by 1.7%, 2%, 0.9%, respectively. The introduction of the improved SPP module in the Scheme SR-3s-SPP could extract feature with different receptive fields, which can further improve P, R, F1 by 0.7%, 0.8%, 0.8%, respectively. Based on the Scheme SR-3s-SPP-SE, the channel attention mechanism was introduced into the Sandglass-Residual module, which could pay attention to useful feature, improving P, mAP and F1 by 2.0%, 1.1%, 1.0%, respectively. Further, CIoU loss was utilized in the final Scheme SR-3s-SPP-SE-CIoU to promote positioning accuracy, which could improve P by nearly 1%. Due to the combination of the above improved methods, compared with the original YOLOv3-tiny, SAS-YOLOv3-tiny had advantages on model performance and complexity. Specifically, it improved P by 2.5%, improved R by 6.9%, improves mAP by 7.9%, improved F1 by 4.5% over the original algorithm on the validation set and had a smaller number of parameters than the original algorithm at a sacrifice of only 0.8 ms.
To further demonstrate the effectiveness of different schemes, we presented the curves in the training process for six groups of experiments. Two critical performance indicators are mAP and F1, the curves of F1 and mAP in the different models are shown in Figure 6a,b. The horizontal axis in the Figure 6a,b represents the training time, while the vertical axis represents the value of F1 and mAP, respectively. The YOLOv3-tiny represents the training curves of the original algorithm, in which the mAP value and F1 value are the lowest. The SR represents the training results of the Scheme SR, the main reason for the promotion of performance is utilization of the Sandglass-Residual module. The SR-3s represents the training process of the Scheme SR-3s, in which the Sandglass-Residual module and the three-scale feature prediction are applied simultaneously, promoting further enhancement in terms of the mAP and the F1. The SR-3s-SPP represents the training curves of the Scheme SR-3s-SPP, in which not only the Sandglass-Residual module and the three-scale feature prediction are adopted, but also the SPP module is employed. The application of the SPP module showed that the training process was easier to converge and the results were more robust. The SR-3s-SPP-SE represents the training process of the Scheme SR-3s-SPP-SE, in which the channel attention mechanism was introduced on the basis of the above three improvement methods, indicating that the training model had reached a better level. The SR-3s-SPP-SE-CIoU represents the results of the Scheme SR-3s-SPP-SE-CIoU, in which CIoU was utilized on the basis of the previous methods, proving that the CIoU could promote the positing accuracy. As shown in Figure 6a,b, we can intuitively see that the final model is better than the original algorithm.

Result Comparison with Other Detection Models
To prove each algorithm's generalization performance, we compare and analyze different evaluation indexes for different algorithms on the test set, which are shown as shown in Table 4. From Table 4, we can draw the following conclusions. On the test set, compared with the YOLOv3-tiny, SAS-YOLOv3-tiny improves R from 75.4% to 80.9% improves mAP from 74.6% to 80.3%, improves F1 from 72.9% to 75.2% and reduces the size of weight file from 69.5 MB to 46.9 MB, which mainly benefits from the use of SR module based on channel attention and SPP module, the application of three-scale prediction method, and the introduction of CIoU loss. Compared with the latest YOLOv4-tiny, SAS-YOLOv3-tiny improves R from 80.0% to 80.9% and improves mAP from 78.9% to 80.3%, but P and F1 decreases to some extent. The main reason is that the idea of the Cross Stage Partial network (CSPNet) [30] is applied into the YOLOv4-tiny, which strengthen the network feature representation. Compared with the YOLOv3 and the YOLOv4, SAS-YOLOv3-tiny has tremendous advantages in terms of the number of parameters and speed, although its accuracy is inferior to the YOLOv3 and the YOLOv4, and the main reason for the decrease of accuracy lies in less parameters and small computational burden.

Result Comparison with Other Detection Models
To prove each algorithm's generalization performance, we compare and analyze different evaluation indexes for different algorithms on the test set, which are shown as shown in Table 4. From Table 4, we can draw the following conclusions. On the test set, compared with the YOLOv3-tiny, SAS-YOLOv3-tiny improves R from 75.4% to 80.9% improves mAP from 74.6% to 80.3%, improves F1 from 72.9% to 75.2% and reduces the size of weight file from 69.5 MB to 46.9 MB, which mainly benefits from the use of SR module based on channel attention and SPP module, the application of three-scale prediction method, and the introduction of CIoU loss. Compared with the latest YOLOv4-tiny, SAS-YOLOv3-tiny improves R from 80.0% to 80.9% and improves mAP from 78.9% to 80.3%, but P and F1 decreases to some extent. The main reason is that the idea of the Cross Stage Partial network (CSPNet) [30] is applied into the YOLOv4-tiny, which strengthen the network feature representation. Compared with the YOLOv3 and the YOLOv4, SAS-YOLOv3-tiny has tremendous advantages in terms of the number of parameters and speed, although its accuracy is inferior to the YOLOv3 and the YOLOv4, and the main reason for the decrease of accuracy lies in less parameters and small computational burden. but P and F1 decreases to some extent. The main reason is that the idea of the Cross Stage Partial network (CSPNet) [30] is applied into the YOLOv4-tiny, which strengthen the network feature representation. Compared with the YOLOv3 and the YOLOv4, SAS-YOLOv3-tiny has tremendous advantages in terms of the number of parameters and speed, although its accuracy is inferior to the YOLOv3 and the YOLOv4, and the main reason for the decrease of accuracy lies in less parameters and small computational burden.

Detection Results under Application Scenarios
To prove that the improved algorithm is more suitable for natural complex scenes in terms of accuracy, we show the detection effect of some test images, which are shown in Figure 7. For small-scale objects, occluded objects and dense objects, SAS-YOLOv3-tiny is superior to the YOLOv3-tiny algorithm. As can be seen from the first and second set of images, SAS-YOLOv3-tiny and the latest YOLOv4-tiny can detect all objects, but YOLOv3tiny leaves out an ordinary object. As can be seen from the third set of images, SAS-YOLOv3tiny can detect all objects while YOLOv4-tiny neglects a helmet object, and YOLOv3-tiny detects some of the objects incorrectly. For detecting small objects at long distances, SAS-YOLOv3-tiny has better performance than YOLOv3-tiny and YOLOv4-tiny. In the last set of images, SAS-YOLO-v3-tiny and YOLOv4-tiny can detect some standard objects, but they will miss objects to be detected when a man deliberately lowers his head. As can be seen from the above test images, the improved algorithm is superior to the original algorithm and sometimes even has a better detection effect than the latest YOLOv4-tiny. To prove that the improved algorithm is more suitable for natural complex scenes in terms of accuracy, we show the detection effect of some test images, which are shown in Figure 7. For small-scale objects, occluded objects and dense objects, SAS-YOLOv3-tiny is superior to the YOLOv3-tiny algorithm. As can be seen from the first and second set of images, SAS-YOLOv3-tiny and the latest YOLOv4-tiny can detect all objects, but YOLOv3-tiny leaves out an ordinary object. As can be seen from the third set of images, SAS-YOLOv3-tiny can detect all objects while YOLOv4-tiny neglects a helmet object, and YOLOv3-tiny detects some of the objects incorrectly. For detecting small objects at long distances, SAS-YOLOv3-tiny has better performance than YOLOv3-tiny and YOLOv4tiny. In the last set of images, SAS-YOLO-v3-tiny and YOLOv4-tiny can detect some standard objects, but they will miss objects to be detected when a man deliberately lowers his head. As can be seen from the above test images, the improved algorithm is superior to the original algorithm and sometimes even has a better detection effect than the latest YOLOv4-tiny.

Conclusions
In this paper, the SAS-YOLOv3-tiny algorithm is proposed to solve the problem that the original lightweight algorithm YOLOv3-tiny was low at accuracy. Even though YOLOv3-tiny has a faster speed and fewer parameters, its detection accuracy needs to be improved. First of all, the lightweight Sandglass-Residual module based on depthwise separable convolution and channel attention mechanism was constructed to replace the original convolution layer while the max-pooling layer was replaced with the convolution layer of stride two, which could reduce the number of parameters and improve detection performance. Furthermore, the detection performance is further improved when threescale feature prediction is utilized. Next, the improved spatial pyramid pooling module was merged behind the backbone network to extract expressive features. Finally, we utilized CIoU to improve the loss function, which also improved the location effect. In conclusion, for the validation set, SAS-YOLOv3-tiny made P from 70.7% to 73.2%, made R from 73.3% to 80.2%, made mAP from 73.7% to 81.6% and made F1 from 71.9% to 76.4%. For the test set, SAS-YOLOv3-tiny had good generalization, and it performed better than the original YOLOv3-tiny at the expense of 0.7 ms speed, which was comparable to YOLOv4-tiny in terms of detection accuracy; compared with the heavyweight algorithms YOLOv3 and YOLOv4, SAS-YOLOv3-tiny had a great advantage in speed although its detection accuracy was not as good as theirs. The experimental results and contrast curves reveal that the improved methods can strengthen the effect of detection. The next work is to expand the safety helmet dataset based on the dataset in this paper and further improve the detection accuracy while maintaining a lower number of parameters and speed.

Conclusions
In this paper, the SAS-YOLOv3-tiny algorithm is proposed to solve the problem that the original lightweight algorithm YOLOv3-tiny was low at accuracy. Even though YOLOv3-tiny has a faster speed and fewer parameters, its detection accuracy needs to be improved. First of all, the lightweight Sandglass-Residual module based on depthwise separable convolution and channel attention mechanism was constructed to replace the original convolution layer while the max-pooling layer was replaced with the convolution layer of stride two, which could reduce the number of parameters and improve detection performance. Furthermore, the detection performance is further improved when three-scale feature prediction is utilized. Next, the improved spatial pyramid pooling module was merged behind the backbone network to extract expressive features. Finally, we utilized CIoU to improve the loss function, which also improved the location effect. In conclusion, for the validation set, SAS-YOLOv3-tiny made P from 70.7% to 73.2%, made R from 73.3% to 80.2%, made mAP from 73.7% to 81.6% and made F1 from 71.9% to 76.4%. For the test set, SAS-YOLOv3-tiny had good generalization, and it performed better than the original YOLOv3-tiny at the expense of 0.7 ms speed, which was comparable to YOLOv4-tiny in terms of detection accuracy; compared with the heavyweight algorithms YOLOv3 and YOLOv4, SAS-YOLOv3-tiny had a great advantage in speed although its detection accuracy was not as good as theirs. The experimental results and contrast curves reveal that the improved methods can strengthen the effect of detection. The next work is to expand the safety helmet dataset based on the dataset in this paper and further improve the detection accuracy while maintaining a lower number of parameters and speed.

Informed Consent Statement: Not applicable.
Data Availability Statement: Some or all data, models or code generated or used during the study are available from the corresponding author by request.

Conflicts of Interest:
The authors declare no conflict of interest.