Personal Protective Equipment Detection: A Deep-Learning-Based Sustainable Approach

: Personal protective equipment (PPE) can increase the safety of the worker for sure by reducing the probability and severity of injury or fatal incidents at construction, chemical, and hazardous sites. PPE is widely required to offer a satisﬁable safety level not only for protection against the accidents at the aforementioned sites but also for chemical hazards. However, for several reasons or negligence, workers may not commit to and comply with the regulations of wearing the equipment, occasionally. Since manual monitoring is laborious and erroneous, the situation demands the development of intelligent monitoring systems to offer the automated real-time and accurate detection of PPE compliance. As a solution, in this study, Deep Learning and Computer Vision are investigated to offer near real-time and accurate PPE detection. The four colored hardhats, vest, safety glass (CHVG) dataset was utilized to train and evaluate the performance of the proposed model. It is noteworthy that the solution can detect eight variate classes of the PPE, namely red, blue, white, yellow helmets, head, person, vest, and glass. A two-stage detector based on the Fast-Region-based Convolutional Neural Network (RCNN) was trained on 1699 annotated images. The proposed model accomplished an acceptable mean average precision (mAP) of 96% in contrast to the state-of-the-art studies in literature. The proposed study is a potential contribution towards the avoidance and prevention of fatal/non-fatal industrial incidents by means of PPE detection in real-time.


Introduction
Construction sites are, most of the time, dangerous workplaces, where workers may get injured or fall.According to Kang et al. [1], 70% of the fall incidents in the year 2017 were due to workers not wearing proper personal protective equipment (PPE) at the sites.The study comprehensively reviewed the incidents that occur only in the USA for one and a half decades, per the listing in the Occupational Safety and Health Administration (OSHA) database.In total 20,997 incidents were reported only in the construction industry.The incidents were categorized into four categories as a fall, struck by, caught in or between, and electrocution.Among them, fall incidents were significant as a figure of 80% from a height of building less than 30 feet and out of which only 11% of the fall victims were wearing proper PPE.In the following year, it was found that 85% of the inspected fatal cases were caused by the lack of wearing PPE [2].These statistics are quite alarming, and a proper detection system is needed for the timely detection and possible prevention and avoidance of such fatal incidents.Especially, in Kingdome of Saudi Arabia, construction, petroleum, and similar industries are very common, and hence, the need for PPE is significant.
Nowadays, appreciable efforts are being made to enhance the safety of workers, which also has a huge positive impact on the construction agencies' reputation, since wearing PPE could reduce the chances of falling by 30% [3].One of these efforts is to develop monitoring systems for PPE wear during working hours [4].However, some of the workers may not wear the given PPE temporarily due to some reasons or reject using the safety set because of a lack of awareness, which could result in fatal and non-fatal injuries.According to [4], there are several factors behind PPE noncompliance.The authors conducted a comprehensive study to investigate those factors using fuzzy k-means clustering.As an outcome, they found sixteen factors in general.The top-ranked factors are inadequate safety supervision, poor risk perception, lack of climate adaptation, lack of safety training, and lack of management support.
In general, the capabilities of Artificial Intelligence (AI) were exploited to introduce automated, advanced, and cost-effective monitoring systems that can recognize PPE and workers to decide whether the worker is complying with the safety regulations or not; this ensures the employee's safety.To be specific, the machines can detect and classify by adopting the technologies of Computer Vision and Deep Learning (DL) [5].The field of DL is known as a subfield of machine learning (ML), which mimics the human brain, where one fascinating capability of DL is its ability to improve and learn by itself.The power of computer vision originated from the Convolution Neural Networks (CNNs), which automatically perform the feature extraction for the targeted objects.To support the model with decision-making abilities, the extracted features will be fed into the DL model.Moreover, transfer learning can contribute to raising the reliability of the solution by applying knowledge of previously trained models on related problems [5].Mainly, there are two categories of detectors, one-stage detectors, which perform both localization and recognition of the desired objects in the same phase, so near real-time detection can be achieved.The YOLO (You Only Lock Once) family is a famous one-stage detector.Meanwhile, the Regional-based Convolutional Neural Network (RCNN) family is known as a popular two-stage detector that fulfills accurate and reliable results by executing the object detection procedure in two phases.Objects' localization takes place in the first phase to propose regions with large chances to contain objects.This is followed by the second phase to identify objects from the extracted characteristics from the localized regions.It is worth mentioning that both types of detectors aim to detect the targeted objects, but there is a noticeable trade-off between the real-time detection and accuracy [6].
In the current context, PPE detection is a major and challenging problem in the industry, and its accurate and real-time detection can play an important role in worker safety [5].Especially in Kingdome of Saudi Arabia, which is rich in several types of industries, PPE detection is crucial.The current study can be a potential solution to the issues mentioned in [5] in the sense that the management can be notified about the non-compliance of PPE, and reasonable actions can be taken to avoid any fatal or non-fatal incidents.
The rest of the paper is organized as follows: Section 2 presents a comprehensive review of the literature.Section 3 presents the dataset description, and the methodology is presented in Section 4. Section 5 contains results and discussion, while Section 6 concludes the paper.

Literature Review
With an emphasis on the identification of Personal Protection Equipment (PPE) using computer vision techniques, this literature review section offers a thorough examination of recent investigations carried out during the previous five years.The reviewed papers will be arranged in this part according to how well they contribute to our understanding of PPE detection in the manufacturing sector.The chosen papers present important developments in PPE detection, such as the application of deep learning algorithms, transfer learning models, and optimization methods.By analyzing these studies, we intend to spot the gaps in the body of knowledge and lay the groundwork for the investigation into the creation of a PPE-detection system that is both efficient and specific to the manufacturing industry.This is potentially beneficial in terms of the prevention and avoidance of incidents.
In a study by Chen and Demachi [7], a vision-based approach for monitoring PPE in a Nuclear power station was proposed.The experiment was conducted on an annotated dataset of 3808 images, which were collected from the real world using a webcam and the web using a web crawler tool.A different dataset was considered in the testing phase, which was also gathered manually.However, two objects were targeted to be localized and classified by the detection model: Hard Hat and Full-face mask.In addition, the distance and the posture of the workers contributed to achieving the aim of the study.The one-stage detection model YOLOv3 was trained on the combined dataset in two stages: the first stage froze the last convolutional layer in Darknet-53, and the second stage was performed by unfreezing all the convolutional layers to carry out the fine-tuning process.Moreover, during the training, different learning rates were addressed for the stages; Adam optimizer and a batch size of 8 were adopted to accomplish reliable accuracy.The developed model could detect whether the workers wear proper PPE or not, with a precision of 97.64% and a recall of 93.11%, while ensuring a real-time performance of 7.96 frames per second (FPS).In the future, the authors intend to augment the dataset to increase the accuracy and deploy the model based on the on-site surveillance system.A study by Delhi et al. [8] exploited deep learning abilities, specifically computer vision, to ensure a maximum level of safety at the construction sites by detecting the PPE in a real-time manner.However, the study was conducted on a dataset that was collected by the researchers manually, besides performing web scraping.About 2500 images were considered to train the model with the four target classes: NOT SAFE, SAFE, NoHardHat, and NoJacket.Moreover, YOLOv3 was trained on the data samples after the augmentation step based on the data, which provides robustness and generalization to the model by carrying on flipping and left and right rotation with 30 degrees.The granted dataset was divided using a train-validation-test method with random splitting of 90%, 8%, and 2% for training, validation, and testing, respectively.The model achieved an mAP, recall, and F1-score of 97% based on test data.
In Wang et al.'s study [9], the authors addressed the issue of workers' safety by using deep learning neural networks adapted for real-time object detection in order to ensure workers' compliance with safety measures.Accordingly, they proposed three deep-learning detectors based on YOLO architectures, YOLOv3, YOLOv4, and YOLOv5, for six classes.The data that were utilized comprised a high-quality dataset, CHV, which contained 1330 images [10] divided into six classes: helmets with four colors, person, and vest.As the experiment results demonstrated, YOLOv5x achieved the best mAP of 86.55%, while YOLOv5s achieved the fastest GPU performance of 52 FPS.A study by Hayat and Morgado-Dias [11] presents a deep learning technique to detect construction site workers' heads and helmets in real-time.Several versions of the popular deep learning architecture YOLO were explored in this paper, namely, YOLOv3, YOLOv4, and YOLOv5x.The authors used the public dataset by MakeML [12] for the implementation of the model, where 3000 instances were used for training and 1000 instances for testing.Moreover, only "Head" and "Helmet" classes were used in this study.For image pre-processing, Power-law transformation [13] was applied to the images to improve the lighting and contrast issues in the data.The best performing model was YOLOv5x, with 92%, 92.4%, 89.2%, and 90.8% for the accuracy, precision, recall, and F1-score, respectively.
Ma et al. [14] proposed a combination detection algorithm for PPE using the portable YOLOv4 model.The dataset used was made up of about 25,000 samples that were taken from security footage of a building construction site.The data were divided unevenly into six classes and separated into a training set and a test set.Two algorithms, YOLOv4 and YOLOv4-Tiny, were used.Pruning the model with the original dataset was the finetuning and optimization strategy to increase accuracy.The best outcome was obtained with CLSlim YOLOv4, which had an mAP loss of only 2.1%, had an enhanced inference speed by 1.8 times, and compressed the model parameters by 98.2%.The major conclusions emphasized the efficiency of the channel and layer pruning method (CLSlim) in lowering computational power usage and enhancing the detection speed.Further work is advised to investigate merging CLSlim with other lightweight strategies to speed up model inference even more and find better lightweight model techniques for mobile terminals with constrained resources.In [15], Gallo et al. proposed a system to detect PPE in unsafe industrial areas and implemented it by utilizing a system that analyzes a video stream using deep neural networks.To detect if the workers are wearing equipment or not, five models were trained: YOLOv4, YOLOv4-Tiny, SSD MobileNet, CenterNet, and EfficientDet.Three datasets were used, the first is a publicly available dataset [16], which contains 7035 images; the other two were collected using controlled settings and contained 215 and 236 images, respectively.YOLOv4-tiny, which achieved an mAP of 86%, was deployed in the system due to the speed of detection.
Nath, Behzadan, and Paal have proposed intelligence-based solutions to resolve construction fatalities caused by brain injuries and collisions [17].This study proposed a DL model built on the YOLO architecture to verify PPE compliance.Following that, the next stage is to develop a machine learning model to determine whether the worker is properly wearing PPE.This approach consists of detecting workers, their hats, and vests, and then using a machine learning model (e.g., neural networks and decision trees) to verify whether each worker is appropriately wearing his or her hat or vest.Second, a convolutional neural network (CNN) framework is used to detect individual workers and verify PPE compliance at the same time.A third approach consists of first detecting only workers in the input image.The workers are then cropped and classified by CNNbased classifiers, for example, VGG-16, ResNet-50, and Xception, based on the presence of personal protective equipment.The 23 models are trained on an in-house image dataset that is collected through crowdsourcing and Web24 mining.A dataset named Pictor-v3 contains 1500 annotated images and 4700 instances of 25 workers wearing varying combinations of personal protective equipment [18].According to the analysis, the second approach, 26, produces the best results in real-world environments, of 72.3% for the mAP.On a laptop computer, it is capable of processing 11 FPS, which enables it to perform real-time detection, as well as run on lightweight mobile devices.Another study by Torrse et al. has proposed a new cognitive safety analysis component for a monitoring system [19].In this study, the proposed system is employed to detect the proper use of personal protective equipment in real-time based on data captured by CCTV cameras.A deep learning algorithm was used to detect objects in the system.Based on the results of our study, we were able to obtain a system using YOLOv4 with an mAP of 80.19% within a real-time frame rate of 80 FPS.
In a study by Benedetto [20] et al., the detection of safety equipment was implemented using computer-generated images of individuals with protective equipment, such as ear protection, a mask, a helmet, and a vest, and individuals without the protective equipment.The paper employed transfer learning techniques with YOLO and Faster-RCNN, where both algorithms were pre-trained on Microsoft's COCO dataset.Furthermore, experimenting with three variations of fine-tuning, the pre-trained models were fine-tuned with either synthetic images only, real images only, or both synthetic and real images.The synthetic data used in this paper were created by the authors using Rage Plugin Hook [21], which works as an API for the Grand Theft Auto V (GTA-V) computer game to generate specific realistic images from the game.The total number of virtual images collected was 126,900 for training and 350 for testing.The results of the experiments verified that the models indeed achieved higher scores when tested on virtual data but scored less based on real data.The highest performing model being tested on real data was Faster R-CNN fine-tuned on both synthetic and real images, where it scored an overall mAP of 77.1%, with the higher classes being the "High-Visibility Vest" and "Person" classes with mAPs of 81.5% and 86.6%, respectively.Isailovic et al. [22] addressed the detection of head-mounted protection gear by extracting the Region of Interest (ROI) around the head and experimenting with several pre-trained computer vision architectures, such as YOLOv5, MobileNetv2 SSD, and Faster R-CNN.The dataset utilized contains a total of 12,682 images with 12 categories of personal protective equipment.It was collected by the authors from the University of Belgrade's Faculty of Medicine in Serbia and public sources, such as Roboflow [15] and Pictor PPE [17].Among all the models tested, the YOLOv5 model achieved the highest scores with 92% precision and 61.1% recall; also, the "Hardhats" category was the best in the YOLOv5 model with 100% precision and 96% recall.
Cengil et al. [23] focused on the detection of hard helmets using a one-stage-objectdetector algorithm improved by the authors.YOLOv5 was used as the base architecture for the model, with the main enhancement being in the feature extraction step.The performance of ShuffleNetv2 and MobileNetv3 as the backbone of the architecture was compared to find the model that increased the efficiency and speed of the network.The study classifies three objects in the image, "Helmet", "Head", and "Person" that are included in Roboflow's "Hard Hat" dataset [16].This includes a total of 7041 images that were split into train, test, and validation data.After analyzing the results, the YOLOv5 model with ShuffleNetv2 as the backbone had a higher precision than MobileNetv3 with 94.2%, but both backbone models achieved the same recall with 91%.The authors believe that the creation and study of comprehensive datasets with more classes will be beneficial for future work.The authors Kisaezehra et al. [24] tackled the issue of construction sites' safety by using a deep neural network to detect if the worker is wearing a hat.YOLOv5 was used with multiple network sizes (nano, small, medium, large, and extra-large) to find the most fitting model to the objective of the study, increasing the detection accuracy and speed of helmets and non-helmets.The dataset was gathered by Northeastern University in China and found in Harvard Dataverse [25], where it includes 7063 images of construction workers with and without helmets in multiple locations and poses.The results of the experiments showed a clear improvement in the scores when using the YOLOv5 extra-large network (YOLOv5x), as it scored an mAP50 of 95.8%, precision of 93.9%, recall of 91.2%, and F1-score of 92.5%.However, when comparing the speed of the models, the YOLOv5 nano network (YOLOv5n) was the fastest for both image and video detection.
The objective of the research conducted by Lo, Lin, and Hung [26] was to create deep learning algorithms for the real-time detection of PPE compliance.A large collection of 11,000 photos made up the dataset utilized for training and evaluation.The study used the YOLOv3, YOLOv4, and YOLOv7 algorithms.Data pre-processing techniques, including data augmentation, such as flipping, cropping, noise injection, and color space transformations, were used to solve the challenges of limited data and overfitting.The YOLOv7 model produced the best results, outperforming the others in terms of precision and detection speed, with mAP values of 97.29% and 25.02 FPS.The primary conclusions of the study focused on the creation of a deep learning system with an excellent accuracy of 97.5% for real-time PPE compliance detection.By guaranteeing proper PPE usage, this algorithm has the potential to improve workplace safety.Further, it is advised to broaden the PPE dataset to include a variety of situations, improve the detection of overlapping items, increase the number of negative training examples, improve data collection techniques, and find more PPE that is necessary.Despite the use of data augmentation techniques, it should be noted that most of the photos in the collection were taken under high brightness and average weather conditions, which may limit the dataset's effectiveness.
A study by Lee, Yeo-Reum, et al. [27] proposed a platform based on computer vision to monitor the proper use of PPE on construction sites.A collection of images with a resolution of 1280 × 720 was attained from multiple sources, such as Google Images, surveillance cameras, and smartphones at the construction sites.Then, the dataset with the three target categories Person, Hardhat, and Safety vest was split into 1031 samples for training and 257 samples for testing.Moreover, the dataset was used to train the stateof-the-art pixel-based PPE detection model, YOLACT.Besides, DeepSORT is considered part of the platform to enable object tracking.As a result of the study, the YOLACT model obtained an mAP50 of 66.4, while the DeepSORT algorithm achieved an accuracy of 91.3% in determining the worker's wearing status.In future work, the authors will extend the ability of the model to identify possible hazard situations by finding relationships between the construction site and the workers.Per Ferdous and Ahsan [28], the issues of workers' safety at the construction site were spotted and discussed.Therefore, an approach that adopted some of the capabilities of computer vision was introduced.Moreover, to attain the goal of providing the highest level of protection, the YOLOX-m model was trained to achieve the automatic detection of various PPE types.A novel dataset with 1699 images called CHVG was utilized to teach the model how to recognize and differentiate between eight classes, namely, four colored hardhats, vests, safety glasses, person body, and person head.The dataset's instances were augmented by applying geometric changes, such as rotation by −10 to 10 degrees, scaling, translation, and photometric changes, such as adjusting the brightness, HSV, and contrast.To enable the detection algorithm to perform under situations such as haze, rain, and low light, new images were generated by some manually written algorithms that mimic the environment's situations.The YOLOX-m model overcame the rest of the algorithms by achieving an mAP of 89.84%.
A study by Ke et al. [29] considered the aim of improving safety by introducing a deep learning real-time monitoring system for the PPE.The YOLOv5 object detection algorithm was adopted to attain the target of the study at a satisfactory performance speed.The proposed detection model was trained on samples from the FUZ-PPE dataset, which consists of 18,767 annotated images covering four different types of PPE, specifically helmets, masks, and safety wear.To efficiently perform the experiments, the dataset was divided into a training set with 13,334 images and a testing set with 5433 images.Moreover, the trained algorithm has been optimized to decrease the computational resources by 32% and the training parameters by 25%.As a result, the algorithm attained a detection speed of 105 FPS and an mAP of 84.2%.Some future improvements include enhancing the FZU-PPE dataset by adding images captured from real construction sites, besides increasing the detection accuracy of the developed model.Saudi et al. [30] authors aimed to ensure the safety conditions of construction workers by detecting three types of PPE: boots, hardhats, and vests; if all three types were detected, the worker would be labeled "safe".To accomplish this goal, the CNN architecture of region-based Convolutional Neural Networks (R-CNNs) was used, which guarantees more accurate results with two-stage detection.The authors trained this model with data taken from the MIT database [31] that includes 1129 construction site images and evaluated the model with 333 images collected personally by the authors.The final model achieved an accuracy of 70%, and the authors expect to improve upon the result in the future by using image processing techniques, such as image resizing and sharpness, along with applying momentum optimization during training.
Mneymneh et al. [32] applied a method of detecting whether construction workers are wearing safety hardhats or not to prevent accidents at industrial sites using object detection techniques.Three methods were utilized: a feature extraction and matching method, a template matching method, and a cascade classification method based on the Histogram of Oriented Gradients (HOG) features, Local Binary Pattern (LBP) features, and Haar-like features.A total of 239 images were used for training, acquired from construction areas, with the hardhat present in 75 images and not present in the remaining 164 images.The cascade classifier proved to be the best suited for different scenarios and for real time detection.Ji et al. [33] introduced a model called RFA-YOLO, which combined residual feature augmentation (RFA) and YOLOv4 for detecting the PPE of offshore drilling platform workers.The detection process of the proposed solution was executed in three stages, where it starts by localizing and classifying the person, helmet, and workwear in the first stage using the RFA-YOLO model.The resulting bounding boxes are passed to the second stage to perform the position feature construction.The third stage is meant to specify if the identified helmet or workwear is being worn by the person or not.To train and test the model, the Offshore Drilling Platform Dataset (ODPD) was utilized.This dataset consists of three parts, annotated Object Detection Dataset (ODD) with 10,000 images for three classes, the Feature Classification Dataset (FCD), which contains 6600 samples, and the labeled PPE Dataset (PPED) that has 2000 images for four classes.The approach fulfilled 93.1% and 13 FPS for accuracy and performance, respectively.
Karlsson et al. [34] aimed to develop a facial recognition and PPE-detection system that could be applied at entry points to restricted locations to make sure that only individuals who are properly equipped are permitted access.The intervention entailed testing a camera-based system that could identify PPE usage.The effectiveness of PPE's detection at distances of 3 and 5 m was evaluated.The dataset used for training and evaluation was collected from the Kaggle website [35].For PPE detection, the Faster R-CNN algorithm was used.Techniques for data augmentation through adjustments to image brightness, contrast, blur, and sharpness were used in fine-tuning and optimization.Under controlled settings, such as an airlock chamber, the best result was an mAP of 99% at 3 m and 89% at 5 m.The key conclusions focused on the successful creation of a system that can precisely identify PPE consumption by employing a camera, especially in collaborative settings.To get over issues with small items or occlusions brought on by body components, approaches for data augmentation and body cropping were adopted.For future work, it was suggested to put the system in place at entry control points where workers must show up to get into restricted areas.
Xiong and Tang [36] implemented a framework to detect PPE compliance using position-guided anchoring.The Personal Protective Equipment Dataset (CPPE) was used to carry out the study, which was created by the authors; it contains 932 images with 9428 worker instances.First, the position detector identifies workers in a workplace environment, and then, the anchors of worker positions focus the algorithm on the intended body regions to detect PPE.Then, two CNN-based classification algorithms were trained to specify if the regions contained hardhats or vests.The proposed method's hardhat detection achieved an F-score of 97%, and the vest detection achieved a 95% F-score.Computational complexity is a limitation in this study since the networks can be simplified to perform a narrowed detection of a specific PPE object.Wang et al. [37] provided a method to prevent construction safety issues by tracking worker locations, checking for equipment, and predicting potential hazards using computer vision.The dataset consists of images extracted from surveillance videos of various construction workplaces.The model used to track the workers and the gear is R-CNN; then, a second CNN-based model determines their routes.Then, the relationship between the gear and workers is analyzed to decide if there is a threat and if an alarm should be generated.The model achieved a 92% mAP score and a 95% AP score in detecting both workers and gear, and the analysis of the workers' safety position achieved an 87% precision score.
In a study [38], Wang et al. proposed a detection method to determine if workers are wearing hardhats or not and to alarm them by using a CNN.For the proposed technique, the backbone used was MobileNet for real-time detection, and for improving the extraction of features, a top-down unit was used.In addition, a residual block was included to improve the classification.The dataset [24] utilized in the study was developed by the authors; it includes workers on construction sites with hardhats and no hardhats.The proposed method was successful compared to other object-detection models, and it achieved an AP of 87% for the hardhat negative instances and an AP of 89% for the positive instances, in addition to detecting within 62 FPS.Li et al. [39] developed a real-time method to detect hardhats in construction areas, to prevent working in dangerous environments.The dataset was obtained from the construction areas' surveillance system and via web crawler, and it consists of 3261 images.The algorithm used for the detection of hardhats was SSD-MobileNet, which resulted in 95% precision and 77% recall.The limitation of the study was the low accuracy that was caused by the quality of the dataset since the images had complex backgrounds and the hardhats were obscure in some of them, for perambulatory employees in power substations.
Li et al. [40] developed a cutting-edge safety-helmet-detection framework that makes use of computer vision, machine learning, and image-processing techniques.The objective was to create a workable system for spotting safety helmet usage.The dataset contained pedestrian information from power substations, as well as the INRIA person dataset.Moving object segmentation was handled by the ViBe algorithm, while precise human detection was handled using the C4 pedestrian detection method and cascade classifier.The analysis of ten films revealed a mean pedestrian classification accuracy of about 84.2%.The proposed method outperformed existing approaches, with an area under the curve (AUC) of 94.13% compared to the method employing HOG feature extraction and an SVM classifier, which had an AUC of 89.20%.Based on head location, color space transformation, and the differentiation of color features, the suggested framework demonstrated promise in accurately identifying safety helmet use.Future work will target strengthening the system's robustness and practicality by expanding it to handle complex scenarios with numerous pedestrians and difficult surroundings.Per Vukicevic et al. [41], a solution was proposed to detect PPE in industrial facilities.Several models were proposed, including MobileNetV2, Dense-Net, and ResNet, which were trained previously using the ImageNet dataset.In terms of performance, all models achieved similar results, but MobileNetV2 was the most optimal choice due to its lower computational requirements.The dataset used in this solution was a combination of web mining images and public PPE datasets.
An approach proposed by Le and Si can ensure worker safety and mitigate the risks of dangerous accidents [42].To mentor workers about personal protective equipment, they developed a fully automated vision-based system.In the first stage, PPE is detected and, in the second, faces are detected and recognized.The primary purpose of PPE detection is to detect the presence of required PPE, whereas the purpose of face detection and recognition is to determine the identity of the workers.For six major PPE types, the accuracy of detection is up to 98%, while for face detection and recognition, it is up to 96%.In real time, the obtained results have demonstrated that the system can detect personal protective equipment and recognize faces with high precision and recall.The researchers plan to expand the dataset in the future by adding equipment with different conditions to the dataset.Also, the code will be optimized to increase the speed of PPE and face detection.
A system was developed by Maior et al. for automatically detecting PPE in industrial sites by utilizing computer vision and machine learning [43].There, 731 helmet photos from ImageNet made up the dataset that was used.The YOLOv2 algorithm was used.The major conclusions showed the possibility of computer vision and machine learning for identifying PPE usage through real-time video streaming.To maintain PPE compliance and avoid accidents, the study stressed the significance of this technology in fostering a safer atmosphere.The model should be expanded to simultaneously detect different PPE kinds, the alarm system should be improved, and real surveillance footage should be used for more precise and accurate PPE monitoring in industrial settings.
Vibhuti et al. [44] created a transfer learning model-based automated method to detect people who are not wearing masks in public settings during the COVID-19 pandemic.Many deep learning models were used in the intervention, including InceptionV3, Xception, MobileNet, MobileNetV2, VGG16, and ResNet50.The Simulated Masked Face Dataset (SMFD) [45] was the dataset used for testing and training.The pre-trained Inception V3 model was optimized using the fine-tuning technique.Using the SMFD dataset [46], the best outcomes were an accuracy and specificity of 100% for testing and 99.92% during training.The major conclusions emphasized the efficiency of the suggested transfer learning model in automating non-mask wearer identification with high accuracy.According to future work suggestions, larger datasets should be used, the system should be expanded to categorize different types of masks, and a facial recognition system should be put in place to support human identification while wearing masks.Table 1 contains a comprehensive summary of literature review including the techniques being used along with the type and source of dataset and major achievements of the studies.

Gap Analysis
The body of research on PPE identification using computer vision methods offers insightful knowledge and advances in the field.Innovative methods to automate the detection and monitoring of PPE compliance in diverse situations have been proposed in several studies.In [24], for instance, a safety helmet identification framework based on computer vision and machine learning was proposed, and its effectiveness in determining whether perambulatory workers are wearing safety helmets or not was shown.It achieved promising outcomes by accurately identifying workers using the YOLOv5 architecture.Several studies revealed that there is a need for improvement in the performance of detection systems based on various parameters.Moreover, several industries still lack PPE-detection procedures to monitor the compliance and avoid incidents, and the proposed study can potentially fill this gap.Deep learning algorithms have also been investigated in another area of research for real-time PPE compliance verification.In Lo et al. [26], a system based on the YOLOv3, YOLOv4, and YOLOv7 models was built, and it successfully detected PPE usage in large datasets with a high degree of accuracy.Moreover, initiatives have been undertaken to raise the effectiveness of PPE-detection systems.A lightweight network model based on YOLOv4 was given in Galo et al. [15], which significantly reduced the number of model parameters while maintaining good detection accuracy.The YOLOv4 network's detection speed was improved, and processing power consumption was decreased via channel and layer pruning approaches in Ma et al. [14], which led to notable performance gains.Even though this research has significantly advanced the field, there are still certain gaps that require filling.The scant investigation of PPE detection in factory settings, particularly those with chemical risks and industries, is one obvious gap.Most of the study that is currently conducted concentrates on generic circumstances, like construction sites, public locations, and surveillance cameras.However, the needs and difficulties associated with PPE detection in manufacturing settings, which may involve specialized tools, hazardous working environments, and strict safety regulations, have not been thoroughly studied.Consequently, the goal of our work is to close this crucial gap by creating a thorough PPE-detection system that is specifically designed for the industrial sector, with a focus on chemical dangers and industries.For accurately detecting and tracking the use of PPE, such as helmets, safety glasses, gloves, and other pertinent clothing, we will make use of computer vision techniques, machine learning algorithms, and optimization tactics.The undergoing research will consider the unique issues and demands of this industry by concentrating on production situations, such as the presence of machinery, fluctuating lighting conditions, and the necessity for real-time monitoring.By offering functional and efficient solutions for PPE compliance detection in industrial facilities dealing with chemical risks and allied industries, we hope to contribute to the body of current literature.Especially in Saudi Arabia, an industrial country, undergoing study can potentially contribute towards the kingdom's goal and vision of 2030 where computing technologies can support the industrial revolution and worker wellbeing.

Dataset Description
A state-of-the-art public dataset comprising PPE images was obtained from [46].The raw data were created under the name CHVG, which consists of eight classes, including four different colors of hardhats (white, blue, red, and yellow), a person's head, vest, body, and safety glass.The total number of images we obtained after data cleaning and preprocessing was 1189.Table 2 contains the number of instances for each class.Among the 1189 objects in the dataset, there are 40% persons, 18.25% vests, 4.28% glass, 6.05% heads, 10% red, 12.53% yellow, 4.54% blue, and 4.28% white instances.The dataset is divided mainly into three categories: train, test, and validation.The training set is used as part of the proposed model's training process, the test dataset serves as the base for the evaluation of the model, and the validation set serves as a check to ensure that the training is proceeding as planned.The potential features of the visual object-detection model are enlisted in Table 3.There are eight features explained in terms of their usability in the model.The description shows that the features represent the coordinates, class, nomenclature, and image's dimensions in terms of width and height.

Methodology
To reach the goals of this study and develop an accurate object detection model to detect PPE with the use of CNNs, the team followed a systematic methodology to achieve this objective.To elaborate more about CNNs, it is a special neural network enable to process the data with grid-like structure, such as images.Figure 1 illustrates an architecture of CNN.The main building block in this network is the convolutional layers, which perform algebraic operations to effectively extract the desired features of the targeted objects [47].Subsequent steps involved in the methodology are explained one by one.

Data Preprocessing
After collecting the data from several open sources and public repositories, it was necessary to preprocess and unify them, so that coherent data should be provided to the model.In the current study, since data are mainly comprised of images containing various PPE features, image filtering, denoising, scaling, and finally augmentation were applied

Data Preprocessing
After collecting the data from several open sources and public repositories, it was necessary to preprocess and unify them, so that coherent data should be provided to the model.In the current study, since data are mainly comprised of images containing various PPE features, image filtering, denoising, scaling, and finally augmentation were applied to produce a homogeneous set of images.
As seen in Figure 2, the process first started with collecting the dataset and dividing it into training, validation, and testing sets, with 430 images for training, 172 for validation, and 115 for testing.Next, the data were augmented using Albumentations library [48] to increase the performance of the model, with a few augmentation techniques, such as, image flipping, image scaling, mosaic, and hue-saturation value (HSV) alteration.

Data Preprocessing
After collecting the data from several open sources and public repositories, it w necessary to preprocess and unify them, so that coherent data should be provided to model.In the current study, since data are mainly comprised of images containing vari PPE features, image filtering, denoising, scaling, and finally augmentation were appl to produce a homogeneous set of images.
As seen in Figure 2, the process first started with collecting the dataset and divid it into training, validation, and testing sets, with 430 images for training, 172 for vali tion, and 115 for testing.Next, the data were augmented using Albumentations libr [48] to increase the performance of the model, with a few augmentation techniques, su as, image flipping, image scaling, mosaic, and hue-saturation value (HSV) alteration.

Model Training, Validation, and Evaluation
For the training of the proposed model, both single-shot detectors and two-shot tectors were investigated to find the fitting architecture for PPE detection.Nam YOLOv5 and Faster RCNN with the ResNet50 backbone were examined in this rega Moreover, the YOLOv5 single-shot detector consisting of a CSPDarknet53 backbone, f ture pyramid network (FPN), and YOLOv3 detection head were investigated.The seco shot detector Faster RCNN consists of two main components: a region proposal netw (RPN) and a region-based detector.The RPN generates candidate object regions by exa ining image regions at different scales and aspect ratios, and then uses the region-ba detector to refine the regions and classify them.More specifically, in this model ResNe network is used to extract features from the data before passing them to the RPN a

Model Training, Validation, and Evaluation
For the training of the proposed model, both single-shot detectors and two-shot detectors were investigated to find the fitting architecture for PPE detection.Namely, YOLOv5 and Faster RCNN with the ResNet50 backbone were examined in this regard.Moreover, the YOLOv5 single-shot detector consisting of a CSPDarknet53 backbone, feature pyramid network (FPN), and YOLOv3 detection head were investigated.The second-shot detector Faster RCNN consists of two main components: a region proposal network (RPN) and a region-based detector.The RPN generates candidate object regions by examining image regions at different scales and aspect ratios, and then uses the region-based detector to refine the regions and classify them.More specifically, in this model ResNet50 network is used to extract features from the data before passing them to the RPN and region-based detector that uses the fast RCNN network.This complex architecture of Faster RCNN ensures precise and accurate image detection results with better speed.
Since the number of instances in the dataset is considered small, pretrained models based on the COCO dataset were imported using the Pytorch framework.The images in the dataset are then resized into 640 × 640 pixels to be fed into the network and fine-tune the weights.The models trained for 20 epochs (the number of times the network iterates over the data) and with 16 batches.
Using DL neural networks, the need for GPU utilities was a must; therefore, the Google Colab platform was utilized with its Nvidia T4 Tensor Core GPU.The process of training and evaluating the model are repeated until the best performance is reached.For the evaluation process, validation data are evaluated with Recall, Precision, and mAP50 metrics.In the study, mAP50 is used for the benchmark evaluation, where it calculates the average Intersection over Union (IoU) overlap of the detected bounding boxes when it is over 50% with the true bounding boxes [49].Furthermore, the model's speed was also evaluated by calculating the inference time in seconds.
Finally, after deciding on the best performing architecture, the final model is then tested using various images in different environments to estimate the model's performance in real-life use cases.Based on a comprehensive session of training, testing, and validation (as explained), the proposed model is promising in terms of various performance metrics and possessed its suitability to the real-time environments in terms of relatively short detection time.Further details on the performance are provided in the subsequent sections.

Results and Discussion
The final developed PPE-detection model has been discussed in this section.To evaluate the model, Precision, Recall, and mAP metrics are applied and most widely used in the literature for similar studies [50][51][52].The selection of these metrics is based on the literature review since these were the most used metrics.Precision refers to the number of detected objects that are true, while recall refers to the number of correct objects detected [49].Accordingly, the mAP score is determined based on the Area Under Curve (AUC) that is defined as the Precision and Recall curve.Further, mAP50 will be used to calculate intersection over union (IoU) overlaps when they exceed 50% of the true bounding box [49].Table 4 compares the proposed Faster RCNN model with the YOLOv5 model that was trained on a common dataset.Similarly, both models were trained using the same hyperparameters and dataset classes.However, YOLOv5's model scored 63.9% for the mAP50, 22.4% lower than Faster RCNN's.As the Precision score was higher in the Faster RCNN model, it was chosen for the study, while YOLOv5 was the runner up in this contest.The detailed results of the chosen Faster RCNN model can be found in Table 5.Using the test data, we used the trained Faster RCNN model to classify eight different classes: head, person, glass, yellow, vest, white, and blue, respectively.The Faster RCNN model has been successful in achieving an overall mAP50 score of 96%, and a high percentage for both precision and recall of approximately 50%.As a result, the images were able to identify and localize objects with a relatively high level of accuracy.The mAP is a good measure of the sensitivity of the neural network.Therefore a good mAP indicates a model that is stable and consistent across difference confidence thresholds.Figure 3 presents the convergence rate of Fast RCNN that tapers off after 20 epochs.Table 6 shows the comparison between the highest performing model observed in the literature review and the proposed model.It is apparent that the proposed model shows a great improvement in the inference time with only 0.17 s.The reason behind the outperformance of the proposed Faster RCNN over the YOLOX-m is that it is totally configured based on the components chosen carefully by the authors, and each layer is fully optimized in terms of those components.For instance, the sequence of layers is involved, like the convolutional layer with the ReLU pooling, flatten, fully connected, and SoftMax etc.   Table 6 shows the comparison between the highest performing model observed in the literature review and the proposed model.It is apparent that the proposed model shows a great improvement in the inference time with only 0.17 s.The reason behind the outperformance of the proposed Faster RCNN over the YOLOX-m is that it is totally configured based on the components chosen carefully by the authors, and each layer is fully optimized in terms of those components.For instance, the sequence of layers is involved, like the convolutional layer with the ReLU pooling, flatten, fully connected, and SoftMax etc.As far as the limitations of the study are concerned, the model's effectiveness is mainly subject to the provided PPE dataset, though it is a widely used dataset and covers the most common features of industrial PPE.Nonetheless, any significant change may lead to misclassification, like a change in color, helmet shape, etc.
Further, it is strongly recommended that the findings of the study must be supervised to address the factors contributing to PPE non-compliance, especially, factors, such as inadequate safety supervision, a lack of safety training, and a lack of management support.As far as the limitations of the study are concerned, the model's effectiveness is mainly subject to the provided PPE dataset, though it is a widely used dataset and covers the most common features of industrial PPE.Nonetheless, any significant change may lead to misclassification, like a change in color, helmet shape, etc.
Further, it is strongly recommended that the findings of the study must be supervised to address the factors contributing to PPE non-compliance, especially, factors, such as inadequate safety supervision, a lack of safety training, and a lack of management support.As a suggestion, it may be added to the organizational risk assessment and management plan (RAMP), and the risk management officer (RMO) should take the responsibility of taking appropriate actions as the proposed system detects any PPE non-compliance among the employees in a real time environment.

Conclusions
This study was conducted in hopes of utilizing Artificial Intelligence technologies, specifically the field of Computer Vision, to aid in the safety of human beings and further increase the positive impact of AI in the industry.Many factors have been discussed throughout this paper that reinforce the importance of PPE in work environments and prove the necessity of the final developed model.The proposed study is a potential contribution towards mitigating the factors highlighted in the introduction by means of the real-time detection of PPE noncompliance.Consequently, the management can take action to prevent the incidents.This may help in nurturing the risk management culture in the organization where everyone, including workers, supervisors, and the administration, will play their role.The study possesses significant industrial implications and focuses on social sustainability and the well-being of workers and company savings.In the proposed study, eight types of PPE segments were detected using the final model with Faster RCNN architecture with the ResNet50 backbone.An mAP50 of 86% was achieved with the validation data and inference time of only 17 s, thus, faster than benchmarks.Moreover, enhancement of the model performance can be extended by increasing the number of training samples, which will improve the detection accuracy as well.In the future, the authors seek an opportunity to deploy the model based on image-capturing hardware to test the model in a real-life environment.Moreover, the authors intend to investigate various other deep learning and transfer learning approaches in the future to further fine-tune the results.

Table 1 .
Summary of literature review.

Table 2 .
Class counts of the PPE.

Table 3 .
Object detection model's data features.

Table 4 .
Comparison between Faster RCNN and YOLOv5 for the PPE Detection Model.

Table 5 .
Results of the PPE-detection model.