1. Introduction
Construction site safety is more important than ever as more and more infrastructure needs to be built as the industry revitalizes. Accidents can be prevented by using personal protective equipment [
1]. Helmets are the most important personal protective equipment to protect workers from falling objects [
2], and it is legally mandatory at construction sites around the world to wear them [
3]. However, wearing a hard hat tends to be neglected due to discomfort and a weak sense of safety. Therefore, it is very important to check whether the worker is wearing a helmet for the safety of the worker, and it is possible to increase the level of safety management. In existing helmet-wearing inspections at construction sites, surveillance image inspection and manned patrols are performed [
4]. However, this method requires a lot of time and effort, and a manual monitor may cause misjudgment due to fatigue because the examiner must stare at the screen for a long time. Accordingly, image analysis techniques are rapidly developing with the help of new technologies and sensors to detect helmets at construction sites.
Although the number of deaths from industrial accidents has decreased compared to the past, the death rate in the construction industry is still high, and more than half of all deaths occur in the construction industry. Not wearing a helmet at a construction site can lead to a fatal accident. Therefore, to prevent such fatal accidents, a system that recognizes and detects whether a helmet is worn at a construction site is required. Advances in computer technology have made it possible to train large-scale deep neural networks by applying GPUs for massively parallel computing [
5,
6].
Object detection is an essential capacity of computer vision solutions. It has gained attention over the last few years by using the core components of parallel-Self-Organizing Map (SOM) that are used for the classification of meteorological radar images [
7]. Miller et al. used the congealing process to minimize the summed component-wise (pixel-wise) entropies over a continuous set of transforms on the images’ data to demonstrate a procedure for effectively bringing test data into correspondence with the data-defined model produced [
8]. In the field of object detection, a series of deep learning-based methods have been developed, and CNNs (Convolution Neural Networks) are the most used because of their excellent characteristics in high-level feature extraction. As a result, they are gradually replacing conventional detection methods in image analysis [
9]. There are two main methods for CNN-based object detection. The first is a two-stage detector that first extracts a set of candidate regions where the object may be and then applies the CNN detector to object classification and location. Representatives include R-CNN (Region-Based Convolution Neural Network) [
10] and improved networks such as Fast R-CNN [
11] and Faster R-CNN [
12]. On the other hand, the single-stage detector treats object detection as a regression problem, directly predicting class probabilities and bounding box coordinates according to CNN features. Representative networks are the single-shot multibox detector (SSD) [
13], look-at-once (YOLO) [
14], and the improved network [
15]. The development of CNN-based detectors has motivated deep learning-based hard hat-wearing detection methods [
16,
17], and many researchers consider deep learning-based methods as essential measures to solve construction safety management problems [
18].
We developed a helmetless autodetector based on Faster R-CNN, achieving an accuracy of 90.1% to 98.4% in various scenarios [
19]. However, Faster R-CNN cannot meet the real-time requirement as it takes about 0.2 s to detect the image [
20]. To improve Faster R-CNN, we used multi-scale training, augmented anchor strategy, and online hard-example mining. The safety helmet detection accuracy was finally improved by 7% compared to the existing algorithm, but the operating speed did not improve. Recently, many researchers have been working on single-stage detectors for hard hat detection tasks. Shi et al. [
21] extracted multi-scale feature maps using the image pyramid structure and combined them with YOLOv3. In the research of Wu et al. [
22], instead of the original backbone of YOLOv3, a densely connected convolutional network [
23] was adopted, resulting in better detection results with the same detection time. Shen et al. [
24] obtained a face-to-helmet regression model after detecting hard hats based on the first stage face detector [
25]. Li et al. [
2] chose the SSD algorithm to meet the real-time requirement and added MobileNet [
26] to reduce the computational load. Wang et al. [
4] proposed a new objective function to improve YOLOv3 and applied it to helmet detection. Single-stage detectors generally have lower two-stage detector accuracy but provide higher throughput [
27]. YOLOv4 is suitable for real-time object detection by complying with both speed and accuracy among object detection models [
28]. All the above studies show that developing a deep learning-based hard hat-wearing detection method can help reduce human and material resources, prevent omissions and false positives caused by human factors, and lay the foundation for the next step.
This paper proposes the image super revolution improved network based on YOLOv4 to solve the small helmet detection problem. The paper makes the following specific contributions:
To solve the small detection problem, we proposed, based on YOLOv4, the object detection accuracy performance by increasing the resolution of low-resolution photos through the ISR module and the high performance in our model.
To improve feature extraction, we proposed the CSP1-N network as the backbone feature extraction network to improve feature extraction.
To make detailed feature fusion processes during training we propose the CSP2-N network in the neck.
To train non-linear features, we propose the Hard Swish activation to improve the model.
The structure of this paper is organized as follows.
Section 2 introduces the related works of safety helmet detection.
Section 3 introduces the algorithm of the proposed model, and
Section 4 presents the experimental environment, training details, and analysis of the results. In
Section 5, we provide our conclusions.
3. Realtime ISR-YOLOv4 Based Small Object Detection
Yu and Zhang and Wang et al. used the small object detection algorithm proposed in this paper divided into input, backbone, network, neck network, and head. After image super resolution processing is performed on the input part to extract small objects, the backbone part is used to extract features of small objects from the image, the neck part is used for multi-scale features, and the head part is used for multi-scale features It uses a map to detect, then targets it and determine its location [
35,
37]. The major structure of small object detection is divided into four categories, so we used this structure in safety helmet detection by improving the backbone, neck, and prediction parts. The structure of the algorithm is shown in
Figure 2.
3.1. Image Super Resolution (ISR)
The proposed small object detection algorithm proceeds with input, backbone, neck, and head output. The input part performs ISR processing, extracts small object features from the backbone image, the neck fuses multi-scale features, and the prediction specifies detection using a multi-scale feature map. The algorithm is shown in
Figure 2. The ISR module was added to the input to capture local details of small targets. The main extracted content used texture extraction for image enhancement and texture extraction for image identification. In the backbone network, blocks in Darknet53 are connected in the same way as each layer in DenseNet [
23]. We use this specific association for training deeper network structures and for neuronal modes of functional maps learned at different levels. This connection can avoid overfitting with fewer parameters than other networks; the neck requires fewer parameters than other networks and avoids overfitting. The neck part maintains the PAnet structure and the original spatial pyramid pooling structure. PANet is a functional fusion module of this part, combining different scale functions. Spatial pyramid modules are structures added to the neck to amplify the usable fields of the network. YOLOv4 [
28] was selected as the head, and a loss function was added to the foreground and background balance loss, reliability loss, and classification loss due to bounding box regression to improve the accuracy of small object detection.
The ISR module input is split into two parts, content and local textures, as shown in
Figure 3. It is first extracted with a content extractor, and then sub-pixel convolution is used to double the resolution of the content feature. The texture extractor connects the two parts to the output terminals while selecting a trusted local texture from the base and reference and function and denoising the reference function [
37]. P0 represents the output of the image super revolution module and is defined as:
is the local texture input, is the content input, is the texture extraction factor, is the content extraction factor, indicates the secondary up-scaling via sub-pixel convolution, and indicates feature stitching. Both the content extractor and the texture extractor consist of residual blocks. The default method uses sub-pixel convolution to perform advanced spatial resolution processing of the content features of the underlying input.
To increase the pixel values of width and height, subpixel convolution transfers the pixels in the channel dimension [
37]. The features generated by the convolutional layer are expressed as:
The pixel shuffling operation of subpixel convolution rearranges the features to rH × rW × C [
37]. This operation is mathematically defined as:
The pixel-representing part of the output feature is , and the coordinates after the pixel shuffling operation , are the upscaling factors. ISR is where represents the output feature pixel. The coordinates of the pixel shuffling operation are upscaling factors. The size of the space is doubled by using in the ISR module. D and content input A are sent from the texture area to the texture extraction, which makes the extraction of small objects highly reliable. Adding textures and content on an element-by-element basis allows the output to incorporate semantic and local information from inputs and references. Thus, has a similar meaning to the trusted texture selected from shallow feature D and deeper level
3.2. Backbone Network
We add the remaining modules to YOLOv4 to reduce the parameters and improve the network learning ability. Yu and Zjang used CSPNDarkNet53 as a face mask-wearing detection module. The rest of the unit can be expressed as follows. First, there was a 1 × 1 convolution; we then proceeded with a 3 × 3 convolution, after which they added weights to both outputs of the module. The weights retain dimensional information, and the goal is to augment the information in the feature layers [
35]. We used this method in helmet detection differently; by maintaining the first and last CSP connections of each extra residual network, inter-edges are added between every two adjacent extra blocks to provide cross-layer flow separation of gradients and accelerate forward propagation while simultaneously repeating deeply repeating extra blocks. It whitens wasted and vanishing resources that occur in between. After the image feature layer set CSPDarkNet53 is the input, it continues to perform convolutional downsampling to get better information. Therefore, the three layers at the end of the backbone have the best semantic information and configuration, and the last three layers are chosen as input to the SPPNet. The network structure of CSPDarkNet53 is shown in
Figure 4.
In this paper, CSPDarknet53 of YOLOv4 was changed to a CSP1-N module for increased performance. YOLOv4 uses redundancy networks to lower the computing performance requirements for the algorithm, but the memory requirements are partially improved with the CSP1-N module.
Compared to CSPDarkNet53 in
Figure 4, it is an upstream network using an H-Swish function [
38], as shown in the following equation
Since the Swish function [
39] contains a sigmoid function compared to the ReLU function, the Swish function has higher computational performance requirements but better accuracy. In addition, the model runtime reduction slope error attenuation can be reduced through the H-Swish function. It has been reduced in past work [
40]. It also improves the model object detection accuracy performance by segmenting the input layer of the image block in CSP1-N. As shown in
Figure 5, it is used as the residual edge of the convolution operation.
3.3. Neck Network
Convolutional neural networks require the input images to be the same size. In conventional convolutional neural networks, fixed input values are obtained through truncation and warping operations. There is also a study in which Yolov4 uses multi-scale local function to improve the demand for fixed input size through SPPNet [
41]. Yu and Zjang, by adding the CSP2-N module to the PANnet structure to combine full functional information and multiple scales, improved the model performance accuracy and functionality. CSP2-N is shown in
Figure 6. The neck network of YOLOv4 adopts common convolution operation, and CSPNet has advantages such as excellent learning ability, computing bottleneck, and memory cost reduction. By improving the CSPNet network module based on YOLOv4, the network function convergence function can be further strengthened Ref. [
35]. However, in our experiment, we add the CBL-Resunit structure for the information delivery networks to optimize the connectivity of the neck, and the main role is to use the cross-layer connectivity properties of ResNet to allow the information processing to be distributed through multiple paths in the FPN and PAN. Information and localization are effectively fused through different pathways to improve image processing. The SPPNet network composes functional fusion between other backbone layers at the neck by combining the bottom-to-top deep positioning function, and in PANet, the top-to-bottom calculation method for detailed functions is implemented through this convergence operation. It provides more useful features for predictive networks. The CSP2-N network is shown in
Figure 6.
3.4. Tiling Images
Tiling effectively magnifies the detector on small objects but can maintain the small input resolution needed to run fast inference. If you use tiling during training, it is important to remember that you need to tile the image at inference time for more accurate results. This is because we want to keep the zoomed perspective so that the object during inference is sized similar to the object during training. Here is a model trained to detect helmets via construction site photos. In
Figure 7, the model was trained with tiling to better recognize helmets given the small size and large size of the source image, but if tiling was not used in the inference, instead of helmets, it will detect fittings and other large shapes. The object we tried to detect during training.
Therefore, we tiled the image before running the inference.
Figure 8 allows you to magnify parts of the image and make the helmet easier to detect against the model.
3.5. ISR-YOLOv4 Network Structure
The improved network model uses three CSP1-N networks in the singularity extraction network from the backbone, as shown in
Figure 9, and each CSP1-N network has N remaining units. In this paper, to reduce the computational requirements, the residual modules are connected in series with N residual unit combinations. This method can modify two 3 by 3 convolution operations with 1 by 1, 3 by 3, 1 by 1 convolution modules. The first 1 by 1 convolutional layer can reduce parameters while reducing the number of channels to approximately 50%. A 3 × 3 convolutional layer can improve feature extraction and reuse the residual number of channels. Finally, the 1 × 1 convolution operation recovers the output of the 3 × 3 convolution layer, so the alternative convolution operation is efficient and has high accuracy for feature extraction and can reduce the computer performance requirements.
5. Conclusions
In this experiment, we propose an improved network based on YOLOv4 to resolve the helmet detection problem. Meanwhile, the efficiency and robustness model were verified through comparative studies on object detection algorithms. First, we improved the object detection accuracy performance by increasing the resolution of low-resolution photos through the ISR module. The backbone feature extraction network was improved through CSP-1N module feature extraction, and CSP2-N was used in the neck part so that the model can handle parallel process learning. Further, to improve the model learning non-linear functions, the H-Swish activation function was added.
As a result of the experiment, our method in this paper showed the best performance detection accuracy on safety helmet detection compared to other algorithms. In addition, the algorithm also reduces the model’s requirements for training cost and model complexity, allowing the model to be deployed to medium-sized devices and used in other industries where helmet-wearing decisions are required.
However, in this study, there is still a problem of insufficient feature extraction for difficult-to-detect samples or missing and false-positive cases, and it is difficult to collect a lot of data according to the Personal Information Protection Act. In addition, there is still a point where the type of helmet cannot be identified. Therefore, the next step should be extended to more object detection tasks by extending the dataset to helmet types and gaining further improvements to the model in the current work.