Safety Helmet Detection Based on YOLOv5 Driven by Super-Resolution Reconstruction

High-resolution image transmission is required in safety helmet detection problems in the construction industry, which makes it difficult for existing image detection methods to achieve high-speed detection. To overcome this problem, a novel super-resolution (SR) reconstruction module is designed to improve the resolution of images before the detection module. In the super-resolution reconstruction module, the multichannel attention mechanism module is used to improve the breadth of feature capture. Furthermore, a novel CSP (Cross Stage Partial) module of YOLO (You Only Look Once) v5 is presented to reduce information loss and gradient confusion. Experiments are performed to validate the proposed algorithm. The PSNR (peak signal-to-noise ratio) of the proposed module is 29.420, and the SSIM (structural similarity) reaches 0.855. These results show that the proposed model works well for safety helmet detection in construction industries.


Introduction
The construction industry is one of the most prone to safety accidents. Therefore, it is of great practical significance to study safety guarantees in this field. Over the past 20 years, it has experienced a decline in accident rates [1].
Head injuries can easily lead to a disability [2]. Reducing head injuries is the primary problem to ensure personnel security in the industry, and safety helmets are widely used to do so. The impact resistance of safety helmets can disperse the impact of rocks. Thus, many industrial regulations require workers to wear safety helmets during working. However, workers do not wear safety helmets as required due to the lack of safety awareness. Because of these reasons, safety accidents are frequent. According to relevant research statistics, 47.3% of people with head injuries in construction site accidents did not wear safety helmets [3]. Thus, it is very important to strengthen the supervision and management of workers. At present, the management of safety helmet wearing in most construction sites still requires manual monitoring. However, the efficiency of manual monitoring is very low due to the large working area and large flow of personnel. With the development of science and technology, video surveillance has become more and more popular. It is a vital part of safety helmet detection. Hence, the optimization of monitoring systems is widely studied [4,5]. Traditional video surveillance is mainly used for continuous monitoring. However, the final judgement still relies on humans' decisions, and the degree of automation is not enough. Intelligence algorithms are a method to enhance automation. They are widely used in image processing, prediction, robotics and so on [6][7][8][9][10][11]. Currently, evolutionary algorithms and deep learning are two important intelligent systems [12][13][14]. Among them, deep learning is widely used in image processing because of its strong learning ability [15][16][17], which can be combined with video surveillance to solve the problems of traditional methods [18].
(1) A novel super-resolution (SR) reconstruction module is designed to improve the resolution of the image before the detection module. Compared with existing methods [23,24], this method reduces the influence of high-resolution image transmission on detection speed in the construction industry. (2) A novel CSP (Cross Stage Partial) module of YOLO (You Only Look Once) v5 is presented to reduce the information loss and the gradient confusion. (3) Based on the proposed SR reconstruction network and YOLOv5 network, a novel end-to-end safety helmet detection model is proposed to make the proposed model reach an average precision (AP) of 79.1%. (4) More than 13,000 images are collected for safety helmet detection in construction sites.
The organization of the rest of this paper is as follows. Section 2 introduces the work related to target detection and SR reconstruction. Section 3 shows the structure and the details of the Dual-Channel Residual SR reconstruction model and the improved YOLOv5 model. The experimental details and results are given in Section 4. The conclusions are given in Section 5.

Related Work
Traditional safety helmet detection methods choose features manually for target detection. They have strong subjectivity, poor generalization ability and limitations in engineering applications. With the continuous development of deep-learning algorithms, researchers are applying deep-learning algorithms to target detection and image SR reconstruction.

Target Detection
At present, research on target detection algorithms include two-stage and one-stage algorithms. Two-stage detection algorithms generate a series of candidate boxes as samples and then classify samples through a convolutional neural network. This kind of detection method has higher task accuracy but slower speed. Girshick et al. [25] proposed the region convolutional neural network, fast regions with CNN [26] and faster regions with CNN [27] Sensors 2023, 23, 1822 3 of 14 algorithms. A one-stage detection algorithm directly regresses the category probability and position coordinate values of objects through a backbone network without using a region proposal network (RPN). This kind of detection method sacrifices detection precision but improves detection speed. In 2016, Liu et al. [28] introduced the multiscaledetection method and proposed the SSD (single shot multibox detection) detection algorithm, which improved the detection accuracy. Redmon et al. [29][30][31] proposed YOLOv1, YOLOv2 and YOLOv3. The YOLOv1 network model abstracted the target detection task into a regression problem for the first time, which greatly sped up the target recognition speed. The YOLOv2 network model introduced a new basic model named darknet-19 based on YOLOv1 to realize end-to-end training. Compared with YOLOv1, the YOLOv2 network model realizes more accurate, faster and more target categories. YOLOv3 introduced the feature pyramid network (FPN) algorithm, promoted the new basic model darknet-53 and integrated three feature layers of different sizes for detection tasks. It improved detection speed and accuracy, especially the detection performance of small targets. Bochkovskiy et al. [32] proposed YOLOv4. This detection network takes CSP darknet-53 as the backbone network and uses the PANET path aggregation algorithm. As a result, it improved the detection accuracy of the model. In 2020, Jocher et al. [33] proposed YOLOv5. This network model adds a focus structure to the backbone network of YOLOv4 to obtain a balance between detection speed and accuracy. Carion et al. [23] proposed DETR for end-to-end object detection and brought transformers into the object detection fields. Recently, Wang et al. [34] proposed YOLOv7, which has achieved better accuracy and speed than YOLOv5.

SR Reconstruction
The image SR reconstruction algorithm is used to recover high-resolution images from one or more low-resolution images. Dong et al. [19] proposed SR Convolution Neural Networks (SRCNNs). SRCNNs effectively improve the results of image SR reconstruction compared with traditional image SR algorithms. However, the network is relatively simple, and the convergence speed is slow during the execution of the algorithm. In subsequent research, researchers added a residual structure to the convolution network to effectively solve the above problems. Kim et al. [23] proposed the VDSR network and increased the number of layers of the CNN to 20. The residual structure and CNN are embedded into image SR reconstruction, and the image reconstruction result is improved. Li et al. [20] proposed a multiscale residual network (MSRN). This network includes image multiscale features in the residual structure to further improve the image reconstruction result. Zhang et al. [35] proposed the residual channel attention network SRCAN. This network applies a channel attention mechanism to the image SR problem and achieves a better reconstruction effect than previous algorithms. Lu et al. [36] presented a novel recursive unit for SR reconstruction fields to force models to learn more details by learnable up-sampling methods. Liu et al. [37] proposed an attention-based approach to discriminate between texture areas and smooth areas.

Materials and Methods
In this paper, the proposed safety helmet detection model is designed based on a Dual-Channel Residual SR reconstruction module and an improved YOLOv5 module. The overall architecture is given in Figure 1. I LR means the input image features and I SR means the reconstructed image features. The two submodules in this figure are addressed as follows.

Dual-Channel Residual SR Reconstruction Module
The SR reconstruction module consists of three modules: a shallow feature extrac module, a depth nonlinear feature mapping module and an up-sampling reconstruc module. The specific structure is shown in Figure 2.

Dual-Channel Residual SR Reconstruction Module
The SR reconstruction module consists of three modules: a shallow feature extraction module, a depth nonlinear feature mapping module and an up-sampling reconstruction module. The specific structure is shown in Figure 2.

ILR ISR
Feature image Dual-Channel Residual SR reconstruction module The improved YOLO v5 module

Dual-Channel Residual SR Reconstruction Module
The SR reconstruction module consists of three modules: a shallow feature extrac module, a depth nonlinear feature mapping module and an up-sampling reconstruc module. The specific structure is shown in Figure 2. The shallow feature-extraction module is an ordinary convolution layer. As sho in Figure 2, the shallow extracted feature FL is obtained from the input original low-r lution image ILR through this module.
The depth nonlinear feature-mapping module is composed of several dual-chan pixel-channel attention blocks (DCPCABs). Each DCPCAB is shown in Figure 3. In figure, the main channel of the DCPCAB module is composed of several pixel atten blocks (PABs), a channel attention (CA) block and a convolution layer. In this paper, number of PABs in DCPCAB is two. The auxiliary channel is composed of two conv tion layers and an adaptive structured convolution block.  The architecture of PAB is shown in Figure 4. In this figure, Fin is the input feat and it is put into three branches x, y and z. One convolution layer with a 1 × 1 kernel is adopted in branch x to reduce the input feature Fin to the output feature Fx. In branc the input features are first fed into a convolution layer with a 1 × 1 kernel size for dim sion reduction and then put into a convolution layer with a 3 × 3 kernel size for feat extraction to obtain the output feature Fy. In branch z, the input feature is first fed in convolution layer for dimension reduction. The reduced dimension feature is input the pixel attention (PA) mechanism network for pixel-level feature weighting. A convo tion layer is adopted for feature extraction to obtain the output feature given as where × represents the convolution operation with a 3 × 3 kernel size. FPA can given by The shallow feature-extraction module is an ordinary convolution layer. As shown in Figure 2, the shallow extracted feature F L is obtained from the input original low-resolution image I LR through this module.
The depth nonlinear feature-mapping module is composed of several dual-channel pixel-channel attention blocks (DCPCABs). Each DCPCAB is shown in Figure 3. In this figure, the main channel of the DCPCAB module is composed of several pixel attention blocks (PABs), a channel attention (CA) block and a convolution layer. In this paper, the number of PABs in DCPCAB is two. The auxiliary channel is composed of two convolution layers and an adaptive structured convolution block.

ILR ISR
Feature image Dual-Channel Residual SR reconstruction module The improved YOLO v5 module

Dual-Channel Residual SR Reconstruction Module
The SR reconstruction module consists of three modules: a shallow feature extr module, a depth nonlinear feature mapping module and an up-sampling reconstr module. The specific structure is shown in Figure 2. The shallow feature-extraction module is an ordinary convolution layer. As s in Figure 2, the shallow extracted feature FL is obtained from the input original low lution image ILR through this module.
The depth nonlinear feature-mapping module is composed of several dual-ch pixel-channel attention blocks (DCPCABs). Each DCPCAB is shown in Figure 3. figure, the main channel of the DCPCAB module is composed of several pixel att blocks (PABs), a channel attention (CA) block and a convolution layer. In this pap number of PABs in DCPCAB is two. The auxiliary channel is composed of two co tion layers and an adaptive structured convolution block. The architecture of PAB is shown in Figure 4. In this figure, Fin is the input fe and it is put into three branches x, y and z. One convolution layer with a 1 × 1 kern is adopted in branch x to reduce the input feature Fin to the output feature Fx. In bra the input features are first fed into a convolution layer with a 1 × 1 kernel size for d sion reduction and then put into a convolution layer with a 3 × 3 kernel size for f extraction to obtain the output feature Fy. In branch z, the input feature is first fed convolution layer for dimension reduction. The reduced dimension feature is inpu the pixel attention (PA) mechanism network for pixel-level feature weighting. A co tion layer is adopted for feature extraction to obtain the output feature given as where × represents the convolution operation with a 3 × 3 kernel size. FPA given by The architecture of PAB is shown in Figure 4. In this figure, F in is the input feature, and it is put into three branches x, y and z. One convolution layer with a 1 × 1 kernel size is adopted in branch x to reduce the input feature F in to the output feature Fx. In branch y, the input features are first fed into a convolution layer with a 1 × 1 kernel size for dimension reduction and then put into a convolution layer with a 3 × 3 kernel size for feature extraction to obtain the output feature Fy. In branch z, the input feature is first fed into a convolution layer for dimension reduction. The reduced dimension feature is input into the pixel attention (PA) mechanism network for pixel-level feature weighting. A convolution layer is adopted for feature extraction to obtain the output feature given as where conv 3×3 represents the convolution operation with a 3 × 3 kernel size. F PA can be given by where δ is the sigmoid activation function. The output features of the three branches are combined through a concat oper given by where concat means the operation of channel merging, which is used to merge the features. Figure 4. The PAB architecture.

Conv
There are multiple PAB blocks in a DCPCAB module in Figure 3. The output F each PAB can be iteratively calculated as Figure 3 is the CA output and is obtained by Equation (5). The output o can be given by where FGAP stands for the global average pooling operation and ⊗ means the point multiplication operation.
The main channel output Fv in Figure 3 can be obtained by where FIN is the input of the DCPCAB. The SR reconstruction operation always makes the edge information of the ori images blurred or even deformed. The auxiliary channel module is introduced to bro the width of the whole network to solve these problems. Adaptive structured convolu blocks are added in the auxiliary channel. The modules are adaptive to different ex sion rates according to different image sizes. They can make the whole depth nonli feature-mapping submodule focus on the extraction of high-frequency features of th age. The operation of the auxiliary channel is given by where FDC refers to the expansion convolution operation and rate refers to the expansion The final output FD in Figure 3 is given by The output of module Fout in Figure 2 can be calculated by where FDN means the output of the final DCPCAB. The up-sampling reconstruction module in Figure 2 can be given by The output features of the three branches are combined through a concat operation given by where concat means the operation of channel merging, which is used to merge the three features.
There are multiple PAB blocks in a DCPCAB module in Figure 3. The output F on of each PAB can be iteratively calculated as Figure 3 is the CA output and is obtained by Equation (5). The output of CA can be given by where FGAP stands for the global average pooling operation and ⊗ means the pointwise multiplication operation.
The main channel output Fv in Figure 3 can be obtained by where F IN is the input of the DCPCAB. The SR reconstruction operation always makes the edge information of the original images blurred or even deformed. The auxiliary channel module is introduced to broaden the width of the whole network to solve these problems. Adaptive structured convolution blocks are added in the auxiliary channel. The modules are adaptive to different expansion rates according to different image sizes. They can make the whole depth nonlinear featuremapping submodule focus on the extraction of high-frequency features of the image. The operation of the auxiliary channel is given by where F DC refers to the expansion convolution operation and rate refers to the expansion rate. The final output FD in Figure 3 is given by The output of module Fout in Figure 2 can be calculated by where F DN means the output of the final DCPCAB. The up-sampling reconstruction module in Figure 2 can be given by where F up represents the up-sampling operation and I SR represents the results of the upsampling reconstruction module.
To optimize the proposed SR reconstruction network, a loss function is adopted as where || || 1 means an L1 norm, k represents the number of training pictures and I HR represents the corresponding high-resolution image of the I SR . In this paper, the final loss of the best results is 0.0034, and the meaning of it is the MAE of the resolution of the reconstructed image and high resolution image. The receptive field and the speed of the SR reconstruction models can be improved by the dual-channel residual structure. The number of parameters in the SR reconstruction model can be reduced by the PCAB structure. In other words, the PCAB structure can make the model more lightweight.
Remark 1: Here, the proposed Dual-Channel Residual SR reconstruction model is compared with SRCNN and SRGAN. Neither of these models consider the receptive field or the speed of the SR reconstruction model. SRCNN effectively improves the results of image SR reconstruction compared with traditional image SR algorithms. However, the network is relatively simple, and the convergence speed is slow during the execution of the algorithm. SRGAN [38] considers restoring fine-grained texture details. To improve the receptive field and the speed of the SR reconstruction model, the dual-channel is used in the proposed model. The number of parameters in the SR reconstruction model can be reduced by the PCAB structure. When other architectures were used instead of the dual-channel structure, the number of parameters we need to train must be the product of multiple dimensions. But the dual-channel can halve the number of parameters by introducing dual channels. In other words, the PCAB structure can make the model more lightweight.

The Improved YOLOv5 Module
YOLOv5 is an improved version algorithm based on YOLOv4 proposed by the Ultralytics LLC company. It is a network with excellent detection accuracy and speed in a single-stage detection network. YOLOv5 has a good detection effect on Pascal visual object classes (Pascal VOC) and common objects in context (COCO) target detection tasks, so YOLOv5 is selected as the detection network.
The YOLOv5 network structure is divided into four parts: the input port, backbone network, neck part and prediction part. The structure is shown in Figure 5 [33]. The input port is used to mosaic random images to enrich the datasets, calculate the adaptive anchor frame and zoom images adaptively. The backbone mainly adopts a focus structure and cross-stage partial (CSP) structure to obtain features. The focus in the back-  The input port is used to mosaic random images to enrich the datasets, calculate the adaptive anchor frame and zoom images adaptively. The backbone mainly adopts a focus structure and cross-stage partial (CSP) structure to obtain features. The focus in the backbone is used to slice the input image data. The structure combining three multiscale pooling layers is used to improve the receptive field of the network while minimizing the loss of speed. It is helpful for the network to extract the important image features, reduce the image loss caused by early image processing and further improve the detection accuracy of the model. The structures of CSP and CBL are shown in Figure 6. The input port is used to mosaic random images to enrich the datasets, calculate t adaptive anchor frame and zoom images adaptively. The backbone mainly adopts a foc structure and cross-stage partial (CSP) structure to obtain features. The focus in the bac bone is used to slice the input image data. The structure combining three multiscale poo ing layers is used to improve the receptive field of the network while minimizing the lo of speed. It is helpful for the network to extract the important image features, reduce t image loss caused by early image processing and further improve the detection accura of the model. The structures of CSP and CBL are shown in Figure 6. The CSP1_X module is improved in two parts in Figure 7. The original CSP structu of YOLOv5 can lead to problems such as information loss and gradient confusion. Ther fore, we use the LSandGlass module to replace the Res unit residual module in YOLO and the 3 × 3 depth space convolution layer. The LSandGlass is different from the bottl neck structure with deep spatial convolution in China construction, 3 × 3 deep space co volution Dwise layers are moved to both ends of the residual path with high dimension representation and the CBL blocks are stated in the mid. Two-deep convolution can e code more spatial information and make more gradients propagate across multiple laye thus reducing information loss.
Dwconv is moved to both ends of the residual path with high-dimensional represe tation to realize gradient propagation across multiple layers and reduce the loss of info mation. Considering such processing can increase the overall computation, the Gho module is used to replace the CBL module of the bottleneck module in CSP. This schem The CSP1_X module is improved in two parts in Figure 7. The original CSP structure of YOLOv5 can lead to problems such as information loss and gradient confusion. Therefore, we use the LSandGlass module to replace the Res unit residual module in YOLOv5 and the 3 × 3 depth space convolution layer. The LSandGlass is different from the bottleneck structure with deep spatial convolution in China construction, 3 × 3 deep space convolution Dwise layers are moved to both ends of the residual path with high dimensional representation and the CBL blocks are stated in the mid. Two-deep convolution can encode more spatial information and make more gradients propagate across multiple layers, thus reducing information loss. The neck module of YOLOv5 uses the structure of feature pyramid networks (FP and pyramid attention networks (PAN). The prediction module contains the boundi box loss function and the non-maximum suppression (NMS) function. YOLOv5 uses t binary cross entropy loss function to calculate the loss of category probability and targ confidence score. In the experiment, CIOU loss is selected as the bounding box loss fun tion. The related formulas are given as [39] where d1 represents the Euclidean distance between the prediction box and the cen point of the target box and d2 represents the diagonal distance of the minimum circu scribed matrix. represents the aspect ratio of the target frame, and represents t aspect ratio of the predicted frame. Dwconv is moved to both ends of the residual path with high-dimensional representation to realize gradient propagation across multiple layers and reduce the loss of information. Considering such processing can increase the overall computation, the Ghost module is used to replace the CBL module of the bottleneck module in CSP. This scheme is adopted to reduce the computation and to compress the model size compared with the original 3 × 3 standard convolution.
The neck module of YOLOv5 uses the structure of feature pyramid networks (FPN) and pyramid attention networks (PAN). The prediction module contains the bounding box loss function and the non-maximum suppression (NMS) function. YOLOv5 uses the binary cross entropy loss function to calculate the loss of category probability and target confidence score. In the experiment, CIOU loss is selected as the bounding box loss function. The related formulas are given as [39] Sensors 2023, 23, 1822 8 of 14 where d 1 represents the Euclidean distance between the prediction box and the centre point of the target box and d 2 represents the diagonal distance of the minimum circumscribed matrix. W g h g represents the aspect ratio of the target frame, and W h represents the aspect ratio of the predicted frame.
Remark 2: Here, the improved YOLOv5 model is compared with the original YOLOv5 model. The original YOLOv5 model did not consider problems such as information loss and gradient confusion. The proposed YOLOv5 model uses the LSandGlass module to replace the Res unit residual module in YOLOv5 to solve these problems. Considering that such processing can increase the overall computation, the Ghost module is used to replace the CBL module to solve it.

Experimental Setup
First, we collected the image datasets by ourselves, which all come from construction sites and depict safety helmets. We first obtained the videos from construction sites and then used the VOTT to get images from the videos. The time of each video is about 14s, and we cropped at 7 frames per second to obtain the experimental images. The number of the images in the datasets is about 13,000, and the resolution of each image is 610 × 480. Then, we used pure interpolation to resize the input images to get 2× low-resolution.
To realize fast and reliable results, the entire method was implemented on a workstation equipped with two NVIDIA TITANRTX GPUs and an Intel i9 CPU.
All coding work was based on Python 3.7 and PyTorch 1.7. The initial learning rates of the SR and detection were 0.0001 and 0.01, respectively. The training epoch times of SR and detection were all 100. The prediction times of SR and detection were all 100. The kernels of the SRCNN were 1 × 1, 5 × 5 and 9 × 9. The number of residual block layers for the generator in SRGAN was 16, and the weights of the loss function for SRGAN was given as 1, 1 and 1.
The Adam optimizer was applied with a momentum of 0.9, and the batch size was 32. The training and test datasets were collected by CSCEC-2020Z-10.

Metrics
The structural similarity values (SSIM) and the peak signal-to-noise ratio (PSNR) are used to measure the quality of the reconstructed images. Among them, the former is adopted to measure the difference between the original image and the SR reconstructed image. The latter is used to measure the difference between the original image and the SR reconstructed image.
PSNR is given by where MSE is the indicator of the square error for the image. SSIM is defined as where I 0 and I 1 are the original and the reconstructed high-resolution images. m is the indicator of the mean, and σ is the variance. c 1 and c 2 are both constants. In this paper, c 1 is set as 0.01 × 255 2 , and c 2 is set as 0.03 × 255 2 . Precision (P), Recall (R) and Average Precision (AP) are used to measure the detection tasks. Among them, Precision is used to describe the ratio of predicted positive examples to all positive examples, and it is calculated by where P, TP and FP indicate the precision, true positives and false positives, respectively.
Recall is used to describe how many of the positive samples were detected in the prediction, and it is given by where FN indicates false negatives. Average Precision synthesizes P and R, which is calculated from the area under the precision-recall curve and can be given by where TN indicates true negatives.

SR Reconstruction Experiments
The SR reconstruction experiments are trained on approximately 80% of the 13,000 original images. They are validated on about 10% of the images and tested on the remaining 10%. Examples of the input images are shown in Figure 8. The right three subfigures show the low-resolution input images, and the left three subfigures are the output images. The different super-resolution reconstruction structures are shown in Table 1, and the experiments' results are shown in Table 2. Examples of the results are shown in Figure  9. It can be observed that the PSNR of the proposed method improved by 16.22% compared with SRCNN and by 5.36% compared with SRGAN. This means that the reconstructed images using the proposed method are closer to the original images. The SSIM of  The different super-resolution reconstruction structures are shown in Table 1, and the experiments' results are shown in Table 2. Examples of the results are shown in Figure 9. It can be observed that the PSNR of the proposed method improved by 16.22% compared with SRCNN and by 5.36% compared with SRGAN. This means that the reconstructed images using the proposed method are closer to the original images. The SSIM of the proposed method improved by 9.76% compared with SRCNN and by 4.27% compared with SRGAN. These results mean that the proposed method can extract more image-structural information for human eyes.

Safety Helmet Detection Experiments
To verify the advantage of the proposed model in safety helmet detection, we compare the Precision, Recall and AP with those of other models. Each model employs the improved YOLOv5 and the original YOLOv5. The datasets are trained by different SR reconstruction methods first and different YOLO methods second.
From Table 3, it can be observed that the proposed model obtains a larger value of Precision compared with other models. This means that the proposed model has a better ability to identify safety helmets. It can also be observed that the proposed model obtains the largest values of Recall compared with the other models. This means the proposed model's ability to find all the safety helmets is the best. The AP value of the proposed

Safety Helmet Detection Experiments
To verify the advantage of the proposed model in safety helmet detection, we compare the Precision, Recall and AP with those of other models. Each model employs the improved YOLOv5 and the original YOLOv5. The datasets are trained by different SR reconstruction methods first and different YOLO methods second.
From Table 3, it can be observed that the proposed model obtains a larger value of Precision compared with other models. This means that the proposed model has a better ability to identify safety helmets. It can also be observed that the proposed model obtains the largest values of Recall compared with the other models. This means the proposed model's ability to find all the safety helmets is the best. The AP value of the proposed model compared with SRCNN+YOLOv5 improved by 25.96% and by 11.10% compared with SRGAN+YOLOv5. Furthermore, when both use the Dual-Channel Residual SR reconstruction module, the AP value of the improved YOLOv5 is approximately 0.64% higher than that of YOLOv5. It is obvious that the LSandGlass module can realize better detection results than the res module. Moreover, the original YOLOv5 is more affected by image resolution. The proposed SR reconstruction module with improved YOLOv5 improved by 23.07% compared with SRCNN with improved YOLOv5 and by 9.25% compared with SRGAN with improved YOLOv5. The proposed SR reconstruction module with the original YOLOv5 improved by 25.16% compared with SRCNN with the original YOLOv5 and by 10.39% compared with SRGAN with the original YOLOv5. These results show that the improved YOLOv5 has better robustness in detection tasks. As shown in Figure 10, when the features of safety helmets are obvious in the image, the proposed model has very good recognition of the safety helmets. Comparing (b) with (d) and (f), these results are obtained by the same detection method and different SR reconstruction methods. The number of valid detection boxes in (b) is much greater than that in (d) and (f). The Precision in (b) is also larger than those in (d) and (f). These results mean that the proposed SR reconstruction method obtains better performance than SRCNN and SRGAN. Comparing (a) with (b), (c) and (d), (e) and (f), the valid detection boxes in (b), (d) and (f) are more than those in (a), (c) and (e). The above results show that the improved YOLOv5 achieves a higher recognition ratio of safety helmets; furthermore, the proposed method can obtain more accurate positioning and higher recognition precision for safety helmet detection. This means that the improved YOLOv5 is superior to the original YOLOv5. The above results indicate that the proposed model can achieve better detection results than other models. mean that the proposed SR reconstruction method obtains better performance than SRCNN and SRGAN. Comparing (a) with (b), (c) and (d), (e) and (f), the valid detection boxes in (b), (d) and (f) are more than those in (a), (c) and (e). The above results show that the improved YOLOv5 achieves a higher recognition ratio of safety helmets; furthermore, the proposed method can obtain more accurate positioning and higher recognition precision for safety helmet detection. This means that the improved YOLOv5 is superior to the original YOLOv5. The above results indicate that the proposed model can achieve better detection results than other models.

Discussion and Conclusions
A novel safety helmet detection model is presented to implement super-resolution reconstruction-driven safety helmet detection. At construction sites, the images collected need to be transmitted to the terminal for detection. The resolution of images is lowered to make it faster. This can lead to a reduction in the detection accuracy. A novel detection model is proposed to overcome this problem. It consists of two modules. First, the SR reconstruction module is used to improve the image quality. Then, to finish the helmet detection, a novel YOLOv5 module is used as the detection module. They are trained separately but tested by the proposed datasets together. The experimental results show that the proposed SR module can increase the PSNR value while maintaining a consistent SSIM value compared with some existing SR reconstruction methods. It demonstrates the superiority of the proposed model. Based on the current results, the proposed model is a feasible tool for safety helmet detection. It can be easily used in construction monitoring or traffic safety monitoring. This paper mainly uses the individual models on specific tasks and combines the models to achieve the whole task. In the future, we will continue to realize the integrated design of SR reconstruction and YOLOv5 to reduce design redundancy. At the same time, we will implement a lightweight model and improve the computational effectiveness. Besides that, we will consider the noise in images coming from industrial sites in the future research.