Automatic Detection of Construction Workers’ Helmet Wear Based on Lightweight Deep Learning

Liang, Han; Seo, Suyoung

doi:10.3390/app122010369

Open AccessArticle

Automatic Detection of Construction Workers’ Helmet Wear Based on Lightweight Deep Learning

by

Han Liang

and

Suyoung Seo

^*

Department of Civil Engineering, Kyungpook National University, Daegu 41566, Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(20), 10369; https://doi.org/10.3390/app122010369

Submission received: 6 October 2022 / Revised: 11 October 2022 / Accepted: 11 October 2022 / Published: 14 October 2022

(This article belongs to the Section Civil Engineering)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Featured Application

The proposed approach has the advantage of low running cost, high accuracy, and real-time applicability to real-world environments. This work provides new ideas for algorithm development and lightweight modeling research for construction automation applications.

Abstract

To reduce the risk of head trauma to workers working in high-risk workplaces such as construction sites, we designed a new automated lightweight end-to-end convolutional neural network to identify whether all people on a construction site are wearing helmets. Firstly, we used GhostNet as the backbone feature extraction network to take advantage of its low running cost and make the model lighter overall while ensuring efficient automatic feature extraction. Secondly, we designed a multi-scale segmentation and feature fusion network (MSFFN) in the feature-processing stage to improve the algorithm’s robustness in detecting objects at different scales. In contrast, the design of the feature fusion network can enrich the diversity of helmet features and improve the accuracy of helmet detection when distance changes, viewpoint changes, and occlusion phenomena occur. Thirdly, we proposed an improved version of the attention mechanism, the lightweight residual convolutional attention network version 2 (LRCA-Netv2). The main idea of the improvement is implemented around the spatial dimension by fusing the combined features along with the horizontal and vertical directions and then weighting them separately. Such an operation allows the establishment of dependencies between the more distant features with improved accuracy compared to the original LRCA-Net. Finally, when tested on the dataset, the proposed lightweight helmet-wearing detection network has a mAP and FPS of 93.5% and 42.

Keywords:

object detection; attention mechanism; lightweight network; helmet-wearing detection; computer vision; automated detection

1. Introduction

People are essential resources in the construction industry, so their safety should be a top priority. Accidents occasionally occur during the construction of large projects, making construction sites one of the most dangerous places to work. Accidents can result in injuries from cars, trucks, or objects falling from a height. Such injuries can result in the death of an individual (worker), which can cause damage to the worker’s family and the builder. Therefore, appropriate protective measures should be taken to address this risk. According to the U.S. Bureau of Labor Statistics (BLS) 2013 Workplace Injury Dataset, approximately 100 fatal and 7300 non-fatal head injuries were caused by falling objects on the job each year in the United States between 2003 and 2012. Many of these injuries could have been prevented if workers had regularly worn safety helmets. Wearing a helmet is an effective protective measure to minimize the risk of traumatic brain injury when the vertical fall of an object collides with a person’s head. Helmets effectively reduced the impact of heavy objects (concrete blocks) falling from a height of 1.83 m by 75% to 90%, compared to people who did not wear helmets [1]. Therefore, introducing automated monitoring technology and regular site monitoring is essential to reduce the risk of accidents.

In recent years, many researchers have devoted their efforts to detecting workers in the field. Different researchers have tried many traditional techniques to detect the location and activity of workers. Most studies have used traditional vision-based computer vision and machine learning techniques, and the histogram of oriented gradients (HoG), background subtraction, and color segmentation, to detect the helmets of workers [2,3,4,5,6]. Most of these methods are computationally expensive, and the accuracy is low.

With the development of deep learning, using convolutional neural networks (CNN) to locate and monitor workers has become very popular. Compared to traditional detection, CNN has improved dramatically in terms of accuracy. Many advanced algorithms such as SSD [7], YOLO [8], Faster-RCNN [9], and Efficient-Net [10] can detect features very accurately, and these methods can easily detect objects in real-time video streams. CNN and deep learning methods have several advantages over traditional methods. First, they can detect static and moving objects beyond traditional detection methods that rely on object movement. Secondly, they can detect workers in different poses, while traditional methods sometimes cannot detect objects when the workers’ poses change. However, most of the existing helmet detection algorithms are not end-to-end; these methods first detect workers and then determine the presence of helmets within the prediction frame. Such detection has some logical problems. For example, the helmet will be certified as a positive sample even if it is not worn as long as it appears in the prediction frame, which is the opposite of our monitoring needs.

On the other hand, existing object detection networks are generally complex, and the number of parameters is generally high. Such problems can lead to higher computational costs and make it challenging to meet the needs of real-time detection. This paper focuses on how to accurately and quickly detect workers’ helmets.

In this paper, the research objective is to identify whether all the people on a construction site are wearing helmets. At the same time, the mainstream object detection methods are computationally expensive and have low accuracy. To solve this problem, we designed an end-to-end one-stage convolutional neural network that achieves high accuracy while significantly reducing computational cost and can meet the real-time performance required for real-world detection.

The main contributions of this paper are as follows:

The proposed object detection network for helmet-wearing utilizes GhostNet, a lightweight network, as the backbone feature extraction network. It uses its cheap operation to make the model lighter overall while ensuring efficient automatic feature extraction.
In the feature processing stage, we designed multiscale segmentation and feature fusion network (MSFFN) to improve the algorithm’s robustness in detecting objects at different scales. In contrast, the design of the feature fusion network can enrich the diversity of helmet features, which is beneficial to the accuracy of helmet detection when the distance changes, viewpoint changes, and occlusion phenomena occur.
Our proposed lightweight residual convolutional attention network version 2 (LRCA-Netv2) is an improvement of the spatial attention module LRCA-Net proposed in our previous work. The main idea of the improvement is that by fusing the combined features along with the horizontal and vertical directions and then weighting the attention separately, such an operation can establish dependencies between the more distant features using precise location information. It has a good performance improvement over the previous module.
The mAP and FPS of the proposed lightweight helmet-wearing detection network evaluated on the combined dataset reach 93.5% and 42, respectively, improving our model in execution speed and accuracy compared to other methods.

2. Related Work

Traditionally, helmet detection is a tedious task. Before helmet testing, the person is usually tested. Du et al. (2011) [11] detected helmets using a Haar-like feature algorithm. The Haar-like feature first detects possible face regions. Then, above the face region, it detects the helmet using color detection. The AdaBoost learning (classifier) method was adopted during the face and non-face classification, which helps to remove outliers.

Similarly, Silva et al. (2013) [12] proposed a “Hybrid descriptor for features Extraction”, a traditional machine learning approach employed to detect helmets of motorcycle riders. His algorithm was based on a local binary pattern (LBP), a histogram of oriented gradients, and a circular Hough transform descriptor aided with a hybrid descriptor. SVM and random-forest classifiers were used for vehicle classification.

Later, workers were detected in live video streaming by an algorithm presented by Zhu et al. (2016) [13]. Their vision-based method facilitates monitoring construction site workers to detect the wearing of a helmet, per safety regulations requirements. Histogram of gradient (HoG) features, combined with pair-wise matching, was used to detect the workers with helmets. This method relies on background subtraction to speed up the detection of human bodies. The proposed method allows a helmet of any color to be easily detected. Similar to the previous work, Shrestha et al. (2015) [14] developed a tool that automatically detects the workers who are not wearing helmets in real-time video. The procedure uses a Haar-like feature detection algorithm that detects the worker’s face. Once the human face is detected, this information is sent to the helmet detection algorithm, which uses the edge detection algorithm (semi-circular appearance of helmet) combined with color detection, and detects the helmet on a worker’s head.

Another vision-based method by Park et al. (2015) [15] came to light from research. The authors used the traditional machine learning approach with an available dataset of true and false matches of helmet images. The method contains background subtraction followed by SVM, dilation, erosion, and rectangle fitting. The authors extended the HoG technique to detect helmets directly in their work. It can easily detect the helmet with different colors. The helmet is detected in the upper part of the rectangle, which encloses the human. For training data, the helmets were annotated manually. Almost the same technique was adopted by Rubaiyat et al. (2016) [2] with a different dataset of images, though they only used HoG and circular Hough transform with SVM to train the data in their approach.

Mneymneh et al. (2017) [16] used SURF features combined with cascaded filters (HoG, Haar-like, and LBP) to detect workers with hard hats. The SURF detector also performed very well. Li et al. (2017) [3] used the ViBe-Background modeling algorithm combined with HoG and SVM to detect helmeted workers in substations. The ViBe-Background algorithm extracts workers from complex backgrounds, and HoG + SVM detects whether workers are wearing helmets. The method achieves an accuracy of 89% when using test data.

In all, traditional methods have performed very well in the past, but all methods have some inevitable problems. Most of the methods shared the same issues while detecting helmets. In the methods above, most methods used HoG, color segmentation, and SVM. These methods mostly fail to detect illumination changes, occlusions, color variations, non-circular objects, and busy backgrounds. The advantage of traditional methods is that they do not require a lot of data to train. However, the execution speed and accuracy are not high.

To address the limitations of traditional methods, then there is a need to develop more robust helmet detection. Developing suitable CNN algorithms based on data from construction sites is a feasible solution to this problem.

In such attempts, Jiang et al. (2016) [17] tried an artificial neural network (ANN) combined with statistical features (GLCM) to detect the helmets from low-resolution image data. The algorithm comprised a local binary pattern (LBP) and gray-level co-occurrence matrix (GLCM) aided with a back-propagation artificial neural network. GLCM is an image descriptor that characterizes the image. It was observed that ANNs could be trained by experience when applied to new unknown input data. Following the footsteps of ANN models, CNN modeling came into existence. CNN was a groundbreaking aspect of deep learning, gaining popularity in recent years. The CNN was improved, and many new sub-CNNs were developed to enhance the efficiency.

Fang et al. (2018) [18] used the Faster R-CNN method to detect non-helmet-wearing workers. Faster R-CNN has a moderate processing time, but it has a higher precision when compared with old methods. Faster R-CNN is a two-stage algorithm, so when the Faster R-CNN is compared to competitive algorithms, the speed is lower than those methods. Therefore, a compromise should be made for speed and accuracy as Faster R-CNN is slow but highly accurate. On the subject of speed, YOLO, a new algorithm, was developed to overcome the speed issues of Faster R-CNN. Bo et al. (2019) [19] administered YOLOv3 with Dark-Net-53 (backbone) feature detection algorithm for detecting unsafe actions conducted by workers on-site. The YOLO algorithm can easily be adapted to real-time surveillance systems to detect the helmet quickly. According to the data from this study, the accuracy was 96.6%. The good idea behind YOLO is that it is a one-stage algorithm. It can solve the regression directly but fail to generate a bounding box around RoI.

Similarly, Wu et al. (2019) [20] used improved YOLO v3 with a Dense-Net backbone to identify whether or not workers wear a safety harness while working. Upon implementation, the YOLO-Dense-Net backbone can also detect occluded objects well. YOLO has many versions and can be modified into new algorithms. The version used in the above research was a modified YOLO v3, which was 2.44% better than the original YOLO v3.

The Single Shot MultiBox Detector (SSD) algorithm is worth mentioning when discussing one-stage algorithms. SSD’s success is its ability to withstand different sizes and multi-scale feature maps. Additionally, the architecture is simple and efficient. Due to SSD’s simplicity and frequent use, Long et al. (2019) [21] adopted this algorithm to build a safety head protection detection model. The attractive part of SSD is that it uses less GPU and performs better than previously mentioned methods. In addition, the proposed method can achieve a real-time speed of up to 21 fps.

A comparative study was conducted by Nath et al. (2020) [22]. The author built and compared three models using YOLO v3 architecture to check the speed and efficiency of all proposed models. The first model initially detects the worker and helmet, and then the information is passed to a machine learning classifier. The second model simultaneously detects workers with helmets in a single CNN framework. The third model uses cropped images, and then the CNN classifier classifies the images (as a helmet or non-helmet user). It was found that the second approach achieved the best results with 72% accuracy and could process 11 fps data. The third approach was second-best performing and gained an accuracy of 68%. However, the first model was the fastest and could process 13 fps with 63% accuracy.

With the update of the YOLO series, Hayat and Morgado-Dias (2022) [23] proposed an automatic safety helmet detection system based on YOLOv5x with good detection capability for smaller objects and in terms of objects in low-light images.

Recently, Wang et al. (2020) [24] fabricated a model based on lightweight CNN to detect helmets in real-time. A depth-wise separable MobileNet model was used for detection. In depth-wise convolution, a single filter to each input channel is applied, and then the pointwise 1 × 1 convolution is used to combine the outputs of the depth-wise convolution. The proposed MobileNet (adopted as the backbone) was used for fast multi-scale feature map generation. This means the network can detect small objects and can deal with occlusions. The average precision of the proposed method was up to 89.4%, with the running speed being 62 FPS.

In contrast, representative works such as [25] proposed a sampling fusion network, SCRDet, which fuses multi-layer features with adequate anchor sampling to improve small object detection due to the difficulty of detecting objects of small size, any orientation, and dense distribution. Based on this, [26] proposed SCRDet++ to highlight the features of small objects and reduce the interference of the background by designing an instance-level feature map denoising module in the feature map and improving the handling of rotation detection by incorporating the IoU constant factor into SmoothL1 loss. This work provides an excellent idea for dense small-size object detection.

In this paper, we propose a new one-stage lightweight automatic helmet wear detection algorithm to address the limitations of previous studies. The method’s superiority is demonstrated in the publicly available dataset, contributing to industrial application development and lightweight research on neural networks.

3. Methodologies

Figure 1 shows the network framework of the proposed helmet-wearing detection algorithm, which consists of three main components: a backbone feature extraction network, a multi-scale segmentation and feature fusion network, and an improved lightweight residual convolutional attention network, namely LRCA-Netv2.

GhostNet is the backbone feature extraction network to meet the network’s overall lightweight. The multi-scale segmentation and feature fusion network (MSFFN) improves the algorithm’s robustness in detecting objects at different scales. At the same time, feature fusion is also designed to facilitate the detection of helmets. In addition, the spatial attention module in LRCA-Net is improved. The original 7 × 7 convolution method, which cannot synthesize global information, is abandoned in the spatial dimension. The integrated features along the horizontal and vertical directions are fused and then weighted separately for attention, which is more effective in focusing on global features. The precise location information is used to make dependencies between more distant features. Experimental results show that such a design scheme can significantly improve helmet detection accuracy while greatly reducing computational costs.

3.1. Design of Backbone Network

Due to the application device’s limited memory and computational resources, the efficiency and lightweight nature of detection algorithms designed for the problem of wearing helmets are crucial. The research focuses on making the network less computationally intensive while ensuring accuracy. The overall computation of neural networks mainly depends on the number of parameters in the backbone network, so choosing a lightweight backbone network is essential. GhostNet [27] (Han et al., 2020) obtains one of the similar feature maps by inexpensively manipulating the transform of another, such that one of the similar feature maps can be considered as a phantom of the other one. The phantom feature maps can be generated by cheap operations based on the Ghost module so that the same number of feature maps can be generated with fewer parameters than a standard convolutional layer, so fewer arithmetic resources are required. We chose GhostNet as the backbone network because it can improve the execution speed of models in existing designed neural network structures. The design of the backbone network was based on GhostNet with two modules in a series of steps, sizes 1 and 2, as shown in Figure 2.

When Stride equals 1, no height and width compression is applied to the input feature layer, and the residual structure is used to optimize the network. When Stride is equal to 2, layer-by-layer convolution is added in the middle of the residual structure to compress the height and width of the input feature layer, respectively. The backbone network designed in this way reduces the number of parameters in the network and continuously obtains the deepened feature layers. Based on the feature maps obtained by one standard convolution, the feature maps can be compressed and deepened by one Stride 2 and multiple Stride 1s. We took the last three convolutional layers, namely Conv4, Conv5, and Conv6, with the shapes (52, 52, 40), (26, 26, 112), and (13, 13, 160), respectively. These three feature layers have multi-scale sensing information and contain three sizes that can be applied to different object features near and far. These three feature layers were fed into our proposed multi-scale feature fusion network (MSFFN) for the helmet detection task.

3.2. Multi-Scale and Feature Fusion Network

By analyzing the samples in the helmet dataset in 13 × 13, 26 × 26, and 52 × 52 multi-scale segmentation, the object as a whole can be captured as features, whether based on 3 × 3 or 5× 5 convolutional kernels.

Since the majority of convolutional kernels in the whole network have a size of 3 × 3, it is evident that the helmet wearer’s object size is a better match for the segmentation size of 26 × 26, as shown in Figure 3, so when designing the feature fusion network, more attention should be paid to the size of the feature layers of 26 × 26.

In the feature fusion process, the Conv5 module of shape (26, 26, 112) is the first input to the spatial pyramid pooling (SPP), as in Figure 4. SPP is performed by first halving the input channels by the Conv×5 module and then performing maximum pooling with kernel sizes of 26, 13, and 5, respectively, where the padding is adaptive to different kernel sizes. The results of the three maximum pooling operations are concatenated with the unpooled data, and the number of combined channels is restored to the same as the input. The Conv4_1 of shape (52, 52, 128), Conv5_1 of shape (26, 26, 256) and Conv6_1 of shape (13, 13, 512) are obtained, respectively. Next, Conv4_1 and Conv6_1 are downsampled and upsampled, respectively, and then combined with Conv5_1 to generate Conv5_2 and Conv5_3, respectively. Conv5_2 and Conv5_3 are also fused to obtain the new feature layer Conv5_4.

We designed the SPP module, as shown in Figure 4, which aims to achieve the fusion of local and global features of the feature map. Because the size of the largest pooling kernel is the same as the size of the input feature map at this time, such a structure can enrich the expressiveness of the feature map and facilitate the case of large differences in object sizes in the images to be detected. Subsequent ablation experiments prove that the detection accuracy can be effectively improved after adding the SPP module. Combined with the design of MSFFN, multi-scale feature layers can be used to deepen the feature network and further enrich the diversity of shapes. Finally, the attention module is added to the three obtained feature layers of different scales, Conv4_1, Conv5_4, and Conv6_1.

3.3. Improved Lightweight Residual Convolutional Attention Network

In our previous work, we replaced the fully connected layer in the channel attention module of CBAM with 1D convolution. We added the residual structure to obtain the improved network LRCA-Net [28] (Liang et al., 2022) so that the modified module can effectively capture the information of cross-channel interaction and reduce the overall number of parameters of the module to achieve overall efficiency.

However, we find that there is still room for improvement by analyzing the spatial attention module part of LRCA-Net (whose structure is shown in Figure 5). The operation process is shown in Equation (1), i.e., maximum pooling and average pooling are used, respectively. Then, the two obtained features are connected and aggregated into a feature with the shape of H × W × 2. Then, it is directly convolved by a standard convolution layer of size 7 × 7, and then the spatial-refined feature F″ by sigmoid:

F^{″} = A_{s} (F^{'}) = σ (k^{7 \times 7} ([M a x p o o l (F^{'}); A v g p o o l (F^{'})]))

(1)

where A_s denotes the spatial attention module, F′ denotes the channel-refined feature, F″ denotes the spatial-refined feature F″, σ denotes the sigmoid function, and k^7×7 represents the convolution kernel size of 7 × 7.

It can be seen that the drawback of this spatial attention module is very obvious. The direct 7 × 7 standard convolution for the feature map cannot obtain the feature map with integrated global information. As shown in Figure 6, there is a lack of perceptual relationship between the two 7 × 7 convolutions for feature maps of size n × n.

To solve this problem, we improved the spatial attention module, as shown in Figure 7. First, we averaged the pooling based on the X and Y axes, respectively, for the input feature map F′ with shape H × W × C as in Equations (2) and (3):

X_{j} (F^{'}) = A v g p o o l (H) = \frac{\sum_{i = 1}^{n} f_{i j}}{W}

(2)

Y_{i} (F^{'}) = A v g p o o l (W) = \frac{\sum_{j = 1}^{n} f_{i j}}{H},

(3)

where W is the width of the feature map, H is the height of the feature map, and

f_{i j}

is the feature map value within the feature map at (i, j).

The aggregated features obtained along the two directions are X(F′) with the shape H × 1 × C and Y(F′) with the shape 1 × W × C, respectively, and the two features are concatenated and subjected to a 1 × 1 convolution operation to obtain the integrated feature f as in Equation (4). This perception capability can provide the correlation between features in the same space and accurately retain their location information, which helps to locate the object more accurately:

f = R e L U (C o n v 2 d ([X (F^{'}), Y (F^{'})]))

(4)

where [,] represents the concatenate operation, ReLU is the normalization and activation function, and f is the integrated feature.

After passing f through the 1 × 1 convolution split by the sigmoid function, respectively, two sets of attention, such as A_s(H) with the shape H × 1 × C and A_s(W) with the shape 1 × W × C are obtained as Equations (5) and (6):

A_{s} (H) = σ (C o n v 2 d (f_{h}))

(5)

A_{s} (W) = σ (C o n v 2 d (f_{w})),

(6)

The obtained attention weights are combined with the input features to obtain the spatial refined feature F″ as in Equation (7):

F^{″} = A_{s} (F^{'}) = F^{'} \cdot A_{s} (H) \cdot A_{s} (W)

(7)

Therefore, using the new spatial attention module to replace the previous method, the improved attention mechanism LRCA-Netv2 overall structure can be obtained, as shown in Figure 8.

Using a combination of features along with horizontal and vertical directions, the idea proposed in this paper focuses on the spatial dimension and weights the attention separately to focus on the global features more effectively. Previously, spatial attention was based solely on convolution and used precise location information to create dependencies among more distant features.

4. Discussion

4.1. Dataset and Experimental Environment

The helmet-wearing detection network proposed in this paper combines SafetyHelmetWearing-Dataset (SHWD) [29] and Safety Helmet detection with Extended Labels 5K images (SHEL5K) [30] into a new dataset for evaluation. The combined dataset contains 12,582 annotated images, with the training and validation sets randomly divided into 11,324 and 1258 in a ratio of 9:1. For our experiments, the detected objects were divided into two types, “Helmet” for objects wearing helmets and “Person” for objects without helmets. Figure 9 shows the percentage of ground truth for each object in the dataset.

The experimental setup of this paper is shown in Table 1. The TensorFlow2 framework was used to build the experimental model for training, validation, and testing, and CUDA kernels were used to compute the results. The hardware mainly consisted of a high-performance workstation host. The workstation had an Intel(R) Core (TM) i5-11400F processor and an RTX 3050 graphics card. It is worth noting that such a mid- to high-end training environment was chosen only for fast convergence of the training model, thus avoiding the overfitting and gradient explosion caused by a lengthy training process with too small a batch size. In addition, the number of parameters mainly determines the lightweight class of the proposed network, so the lightweight class architecture discussed in the ablation experiments only compares the models’ size. To be fair, all models were experimented on in the same configuration environment.

4.2. Evaluation Metrics and Experimental Details

To test the performance of the model, we evaluated the model by introducing average precision (AP) and mean average precision (mAP) as in Equations (10) and (11). The AP is calculated using the difference-average accuracy measure, the area under the precision-recall curve. The equation of the precision and recall are shown in Equations (8) and (9):

P r e c i s i o n = \frac{T P}{T P + F P},

(8)

R e c a l l = \frac{T P}{T P + F N},

(9)

where T/F denotes true/false, which indicates whether the prediction is correct, and P/N denotes positive/negative, which indicates a positive or negative prediction result.

AP = {(\frac{1}{n} \sum_{(r \in \frac{1}{n}, \frac{2}{n} \dots \frac{n - 1}{n}, 1)} P_{interop} (r))}^{2},

(10)

mAP = \frac{1}{n} \sum AP,

(11)

where

n

denotes the number of detection points and

P_{interop} (r)

represents the value of the accuracy at a recall of

r

.

Figure 10 shows a plot of the loss values during the training of the model and a plot of the validation loss function obtained by validating the model using the validation dataset during the training process. The training was divided into two phases: the freeze phase and the unfreezing phase. For initializing the weight parameterization of the shared convolutional layer, a pre-trained GhostNet model was used. The loss function was CIoU loss [31], the batch size was set at eight and four, and the learning rates were 0.01 and 0.0001, respectively. The total number of training iterations was 100 epochs, which were divided into two phases: frozen training and unfrozen training, where the first 50 epochs froze the backbone network weights and prioritized the training of network weights other than the backbone network, and the second 50 epochs unfroze the backbone network for full network training. This has the advantage of fine-tuning the original weights of the backbone network, thus accelerating the convergence of the network and saving training time.

4.3. Ablation Experiments

The effects of different module combinations on the results were further explored in the ablation experiments to verify the rationality and effectiveness of our proposed network. As a fair comparison, except for the parameters of the added modules, such as data set, input image size, relevant hyperparameters, training strategy, and experimental environment, all were the same in the ablation experiments. As shown in Table 2, when the original base algorithm (baseline) extracts features using only the backbone network and outputs results without any module addition, the mAP is only 76.17%. The addition of the multiscale segmentation module significantly increases the mAP to 85.63%, and after adding the feature fusion network and SPP modules, the results increased to 88.49% and 89.77%, respectively. It is worth noting that the output result of the anchor-free method is much better than that of the baseline. However, the mAP decreases to 92.58% while the number of parameters increases significantly compared to the anchor-based method. The results of the ablation experiments show that the proposed network module can effectively improve the detection results when performing the detection task and that it is more efficient and accurate to use the anchor-based assignment strategy than the anchor-free adaptive scale assignment strategy.

In addition, to validate the performance of our improved attention mechanism LRCA-Netv2, ablation experiments of the attention mechanism were designed, and their corresponding results are listed in Table 3. All the results are based on the tests of the single model proposed in this paper. There were no differences in the network modules used. The datasets, input image sizes, and relevant hyperparameters are the same, where the baseline is set without adding any attention module. As shown in Figure 11, the models equipped with the attention method improved the detection results. It is noteworthy that the best results were achieved when using the improved method, better than the models using the other attention methods.

4.4. Performance Comparisons

The comparison of quantitative experimental results using the same dataset and model training methods is reported in Table 4. Here, the helmet wear detection algorithm proposed in this paper is compared with SSD, Faster-RCNN, Yolov4 [32], YoloX series [33], and EfficientDet [34] series. Our proposed method can achieve 93.5% of mAP, which is an improvement compared to other detection methods.

As shown in Figure 12, the mAP of our method is improved by 3.6% compared to EfficientDet-D3, which has a similar number of parameters. Although there is no lack of YoloX-X and EfficientDet-D5 with higher mAP among these algorithms, our method is not only more accurate compared to them, but the model size is only 11.3% and 33.3% of theirs. Our method can stabilize around 42 fps for the time factor, while YoloX-X and EfficientDet-D5 are only 13 and 20 fps, respectively. In terms of processing speed, our method has an excellent improvement. This is because many network topologies have too many unnecessary parameters, leading to slow convergence during model training, and may overfit and affect the accuracy of detection results. Our proposed network structure can achieve higher accuracy with smaller model size.

In Figure 13, we show some detection examples using the model proposed in this paper at several real construction sites in Korea. The results are shown using non-maximum suppression and setting the threshold score above 0.5. These examples cover many factors, including visual range, illumination, individual pose, and occlusion. These visualization results demonstrate the excellent detection capability of our proposed model, which can be well generalized to various construction sites.

In some cases, the proposed method also has limitations. Figure 14 illustrates the visualization results of missed or false detections, with examples such as Figure 14a having a relatively high probability of missing detection coupled with a low confidence level due to the tiny target. In Figure 14b, there is insufficient illumination and an individual pose, and the excavator is misidentified as a helmet. In Figure 14c, the examples demonstrate the limitations of the proposed method when occlusion is present.

5. Conclusions

To reduce the risk of head trauma to workers during construction work at high-risk workplaces such as construction sites, it is critical to develop an algorithm that can automatically and robustly detect helmet wear.

In this paper, we designed a novel one-stage lightweight end-to-end convolutional neural network aimed at identifying whether people on a construction site are wearing helmets or not. This algorithm achieves high accuracy while significantly reducing computational costs and can meet the real-time performance required for real-world detection. The designed neural network first utilizes GhostNet, a lightweight network, as the backbone feature extraction network. It uses its inexpensive operation to make the model lighter overall while ensuring efficient automatic feature extraction. Secondly, we designed a multi-scale segmentation and feature fusion network (MSFFN) in the feature-processing stage to improve the algorithm’s robustness in detecting objects at different scales.

In contrast, the design of the feature fusion network can enrich the diversity of helmet features, which is beneficial to improving the accuracy of helmet detection when distance changes, viewpoint changes, and occlusion phenomena occur. For the attention module, we propose LRCA-Netv2, an improved version of LRCA-Net, and it has a good performance improvement over the previous one. Finally, the mAP and FPS of the proposed lightweight helmet-wearing detection network evaluated on the dataset reached 93.5% and 42, respectively. Our model has excellent performance compared to other methods. This work provides new ideas for improving existing helmet wear detection algorithms and model lightweight efforts. Future developments will involve expanding the variety of construction site detection objects, especially for small objects, and focusing on such working directions as improving the loss function, combining modules with adaptability, and combining helmet detection with tracking techniques for monitoring other objects.

Author Contributions

Conceptualization, H.L. and S.S.; methodology, H.L.; software, H.L.; writing—original draft preparation, H.L.; writing—review and editing, S.S.; visualization, H.L.; supervision, S.S.; project administration, S.S.; funding acquisition, S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2016R1D1A1B02011625).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Suderman, B.L.; Hoover, R.W.; Ching, R.P.; Scher, I.S. The effect of hardhats on head and neck response to vertical impacts from large construction objects. Accid. Anal. Prev. 2014, 73, 116–124. [Google Scholar]
Rubaiyat, A.H.; Toma, T.T.; Kalantari-Khandani, M.; Rahman, S.A.; Chen, L.; Ye, Y.; Pan, C.S. Automatic detection of helmet uses for construction safety. In Proceedings of the 2016 IEEE/WIC/ACM International Conference on Web Intelligence Workshops (WIW), Omaha, NE, USA, 13–16 October 2016; pp. 135–142. [Google Scholar]
Li, J.; Liu, H.; Wang, T.; Jiang, M.; Wang, S.; Li, K.; Zhao, X. Safety helmet wearing detection based on image processing and machine learning. In Proceedings of the 2017 Ninth International Conference on Advanced Computational Intelligence (ICACI), Doha, Qatar, 4–6 February 2017; pp. 201–205. [Google Scholar]
Wu, H.; Zhao, J. An intelligent vision-based approach for helmet identification for work safety. Comput. Ind. 2018, 100, 267–277. [Google Scholar] [CrossRef]
Li, K.; Zhao, X.; Bian, J.; Tan, M. Automatic safety helmet wearing detection. arXiv 2018, arXiv:1802.00264. [Google Scholar]
Jin, M.; Zhang, J.; Chen, X.; Wang, Q.; Lu, B.; Zhou, W.; Nie, G.; Wang, X. Safety Helmet Detection Algorithm based on Color and HOG Features. In Proceedings of the 2020 IEEE 19th International Conference on Cognitive Informatics and Cognitive Computing (ICCI* CC), Beijing, China, 26–28 September 2020; pp. 215–219. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Lecture Notes in Computer Science: European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28, pp. 91–99. Available online: http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf (accessed on 5 October 2022).
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 10-15 June 2019; pp. 6105–6114. [Google Scholar]
Du, S.; Shehata, M.; Badawy, W. Hard hat detection in video sequences based on face features, motion, and color information. In Proceedings of the 2011 3rd International Conference on Computer Research and Development, Shanghai, China, 11–13 March 2011; Volume 4, pp. 25–29. [Google Scholar]
Silva, R.; Aires, K.; Santos, T.; Abdala, K.; Veras, R.; Soares, A. Automatic detection of motorcyclists without helmet. In Proceedings of the 2013 XXXIX Latin American Computing Conference (CLEI), Caracas, Venezuela, 7–11 October 2013; pp. 1–7. [Google Scholar]
Zhu, Z.; Ren, X.; Chen, Z. Visual tracking of construction job site workforce and equipment with particle filtering. J. Comput. Civ. Eng. 2016, 30, 04016023. [Google Scholar] [CrossRef]
Shrestha, K.; Shrestha, P.P.; Bajracharya, D.; Yfantis, E.A. Hard-hat detection for construction safety visualization. J. Constr. Eng. 2015, 2015, 721380. [Google Scholar] [CrossRef]
Park, M.W.; Elsafty, N.; Zhu, Z. Hardhat-wearing detection for enhancing on-site safety of construction workers. J. Constr. Eng. Manag. 2015, 141, 04015024. [Google Scholar] [CrossRef]
Mneymneh, B.E.; Abbas, M.; Khoury, H. Automated hardhat detection for construction safety applications. Procedia Eng. 2017, 196, 895–902. [Google Scholar] [CrossRef]
Jiang, X.; Xue, H.; Zhang, L.; Zhou, Y. A Study of Low-resolution Safety Helmet Image Recognition Combining Statistical Features with Artificial Neural Network. Int. J. Simul. Syst. Sci. Technol. 2016, 17, 11.1–11.6. [Google Scholar]
Fang, Q.; Li, H.; Luo, X.; Ding, L.; Luo, H.; Rose, T.M.; An, W. Detecting non-hardhat use by a deep learning method from far-field surveillance videos. Autom. Constr. 2018, 85, 1–9. [Google Scholar] [CrossRef]
Bo, Y.; Huan, Q.; Huan, X.; Rong, Z.; Hongbin, L.; Kebin, M.; Weizhong, Z.; Lei, Z. Helmet detection under the power construction scene based on image analysis. In Proceedings of the 2019 IEEE 7th International Conference on Computer Science and Network Technology (ICCSNT), Dalian, China, 19–20 October 2019; pp. 67–71. [Google Scholar]
Wu, F.; Jin, G.; Gao, M.; Zhiwei, H.E.; Yang, Y. Helmet detection based on improved YOLO V3 deep model. In Proceedings of the 2019 IEEE 16th International Conference on Networking, Sensing and Control (ICNSC), Banff, AB, Canada, 9–11 May 2019; pp. 363–368. [Google Scholar]
Long, X.; Cui, W.; Zheng, Z. Safety helmet wearing detection based on deep learning. In Proceedings of the 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chengdu, China, 15–17 March 2019; pp. 2495–2499. [Google Scholar]
Nath, N.D.; Behzadan, A.H.; Paal, S.G. Deep learning for site safety: Real-time detection of personal protective equipment. Autom. Constr. 2020, 112, 103085. [Google Scholar] [CrossRef]
Hayat, A.; Morgado-Dias, F. Deep Learning-Based Automatic Safety Helmet Detection System for Construction Safety. Appl. Sci. 2022, 12, 8268. [Google Scholar] [CrossRef]
Wang, L.; Xie, L.; Yang, P.; Deng, Q.; Du, S.; Xu, L. Hardhat-wearing detection based on a lightweight convolutional neural network with multi-scale features and a top-down module. Sensors 2020, 20, 1868. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. SCRDet: Towards more robust detection for small, cluttered and rotated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 8232–8241. [Google Scholar]
Yang, X.; Yan, J.; Liao, W.; Yang, X.; Tang, J.; He, T. SCRDet++: Detecting small, cluttered and rotated objects via instance-level feature denoising and rotation loss smoothing. IEEE Trans. Pattern Anal. Mach. Intell. 2022; Early Access. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
Liang, H.; Seo, S. Lightweight Deep Learning for Road Environment Recognition. Appl. Sci. 2022, 12, 3168. [Google Scholar] [CrossRef]
Gochoo, M. Safety Helmet detection with Extended Labels 5K images (SHEL5K). Mendeley Data 2022. [Google Scholar] [CrossRef]
Peng, D.; Sun, Z.; Chen, Z.; Cai, Z.; Xie, L.; Jin, L. Detecting heads using feature refine net and cascaded multi-scale architecture. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 2528–2533. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 3 April 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding Yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]

Figure 1. The network framework of a helmet-wearing detection algorithm is proposed in this paper.

Figure 2. The structure of two ghost bottlenecks. Left: Stride 1; Right: Stride 2.

Figure 3. The multi-scale segmentation effect of the sample is shown. The segmentation scales are (a) 13 × 13, (b) 26 × 26, and (c) 52 × 52, respectively.

Figure 4. The structure of the Conv×5 module and spatial pyramid pooling (SPP).

Figure 5. The structure of LRCA-Net’s spatial attention module.

Figure 6. For feature maps of size n × n, there is an obvious lack of perceptual relationship between two 7 × 7 convolutions at longer distances.

Figure 7. The structure of the modified spatial attention module.

Figure 8. The overall structure of LRCA-Netv2 contains the original channel attention mechanism A_c and the modified channel attention mechanism A_s.

Figure 9. Percentage of ground truth for each object.

Figure 10. Model loss function graph.

Figure 11. The visualized heat maps were generated by adding different attention mechanisms, as in Table 3. The visualization results are compared for each of the three different size feature maps obtained by MSFFN (a) 13 × 13, (b) 26 × 26, and (c) 52 × 52.

Figure 12. Model size vs. mAP. Details are in Table 4. Note that our approach obtains higher accuracy while having less model complexity.

Figure 13. Using the detection examples of our model on construction sites, we show the results for detections with scores above 0.5, with each color corresponding to an object class (red for helmeted, blue for unhelmeted, and the values on the bounding boxes representing confidence levels). These examples cover many factors, including (a) detection results on construction sites, (b) longer visual range situations, (c) occluded detection, (d) insufficient light in the tunnel, and (e) different individual postures.

Figure 14. Limitations of the proposed method (blue circles represent false or missed objects), covering objects including (a) tiny objects, (b) severely under-illuminated or individual poses, and (c) obscured.

Table 1. Workstation mainframe hardware and software.

Items		Description
H/W	CPU	Intel(R) Core (TM) i5-11400F
	RAM	16 GB
	SSD	Samsung SSD 500 GB
	Graphics Card	NVIDIA GeForce RTX 3050
S/W	Operating System	Windows 11 Pro, 64bit
	Programming Language	Python 3.7
	Learning Framework	TensorFlow 2.2.0

Table 2. Results of ablation experiments using the same data set, where bold numbers indicate the highest mAP, “√” in each column indicates that the leftmost component is used in the model, and “×” indicates that the leftmost component is removed. The last number in each column represents the mAP obtained using the corresponding component.

Backbone Feature Extraction	Baseline	√	√	√	√	√
Multi-scale Segmentation		√	√	√	√	√
Feature Fusion Network			√	√	√	√
SPP Module				√	√	√
LRCA-Netv2					√	√
Anchor-free	×	×	×	×	√	×
Anchor-based	√	√	√	√	×	√
Parameters (Millions)	3.18	3.40	9.83	10.03	28.08	11.17
mAP	76.17%	85.63%	88.49%	89.77%	92.58%	93.47%

Table 3. Experiments on the ablation of attentional mechanisms using the same data set, where bold numbers indicate the highest mAP and “+” indicates that a certain attention module has been added to the baseline.

Settings	Input Size	Parameters (Millions)	mAP (%)
Baseline	416 × 416	10.03	89.77
+ CBAM	416 × 416	11.13	91.07
+ LRCA-Net	416 × 416	11.07	92.80
+ LRCA-Netv2	416 × 416	11.17	93.47

Table 4. Comparison of quantitative experimental results on the same test set, where bold numbers indicate the best value in each column. “Helmet” and “Person” columns indicate whether the individual wears a safety helmet or not, respectively.

Method	Input Size	Backbone	Parameters (Millions)	FPS	AP (%)		mAP (%)
Method	Input Size	Backbone	Parameters (Millions)	FPS	Helmet	Person	mAP (%)
SSD	300 × 300	VGG16	26.2	18	85.9	78.3	82.1
Faster-RCNN(VGG)	600 × 600	VGG16	136.7	4	80.1	74.7	77.4
Faster-R-CNN(ResNet)	600 × 600	ResNet50	28.3	17	85.5	81.5	83.5
YOLOX-Tiny	640 × 640	Modified CSP	5.1	42	86.4	81.8	84.1
YOLOX-S	640 × 640	Modified CSP	8.7	32	88.7	83.7	86.2
YOLOX-M	640 × 640	Modified CSP	25.2	19	90.2	85.6	87.9
YOLOX-L	640 × 640	Modified CSP	54.1	15	93.1	88.1	90.6
YOLOX-X	640 × 640	Modified CSP	99.1	13	94.9	89.7	92.3
YOLOv4	416 × 416	CSPDark-53	64.3	18	93.2	86.2	89.7
EfficientDet-D0	512 × 512	Efficient-B0	3.8	35	78.8	72.6	75.7
EfficientDet-D1	640 × 640	Efficient-B1	6.5	31	84.3	81.5	82.9
EfficientDet-D2	768 × 768	Efficient-B2	8.1	30	91.5	85.5	88.5
EfficientDet-D3	896 × 896	Efficient-B3	11.9	27	92.5	87.3	89.9
EfficientDet-D4	1024 × 1024	Efficient-B4	20.6	24	92.9	88.1	90.5
EfficientDet-D5	1280 × 1280	Efficient-B5	33.6	20	93.7	89.5	91.6
Our Approach	416 × 416	GhostNet	11.2	42	95.9	91.1	93.5

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liang, H.; Seo, S. Automatic Detection of Construction Workers’ Helmet Wear Based on Lightweight Deep Learning. Appl. Sci. 2022, 12, 10369. https://doi.org/10.3390/app122010369

AMA Style

Liang H, Seo S. Automatic Detection of Construction Workers’ Helmet Wear Based on Lightweight Deep Learning. Applied Sciences. 2022; 12(20):10369. https://doi.org/10.3390/app122010369

Chicago/Turabian Style

Liang, Han, and Suyoung Seo. 2022. "Automatic Detection of Construction Workers’ Helmet Wear Based on Lightweight Deep Learning" Applied Sciences 12, no. 20: 10369. https://doi.org/10.3390/app122010369

APA Style

Liang, H., & Seo, S. (2022). Automatic Detection of Construction Workers’ Helmet Wear Based on Lightweight Deep Learning. Applied Sciences, 12(20), 10369. https://doi.org/10.3390/app122010369

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automatic Detection of Construction Workers’ Helmet Wear Based on Lightweight Deep Learning

Abstract

Featured Application

Abstract

1. Introduction

2. Related Work

3. Methodologies

3.1. Design of Backbone Network

3.2. Multi-Scale and Feature Fusion Network

3.3. Improved Lightweight Residual Convolutional Attention Network

4. Discussion

4.1. Dataset and Experimental Environment

4.2. Evaluation Metrics and Experimental Details

4.3. Ablation Experiments

4.4. Performance Comparisons

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI