1. Introduction
Construction sites are considered hazardous environments with several risks that may arise, including physical injuries, falls, and exposure to hazardous materials. However, compliance with Personal Protective Equipment (PPE) regulations is often inconsistent because some workers may neglect to wear the PPE, either unintentionally or intentionally, or site supervisors lack enforcement. Traditional methods of monitoring compliance, such as time-consuming manual inspections, can lead to mistakes and are difficult to apply effectively to larger or more complex construction sites [
1].
Two major concerns in the construction industry are ensuring the safety of construction site workers and maintaining compliance with all safety protocols at the workplace. Due to the inherently dangerous nature of construction sites, workers are at risk of injuries from falls, being struck by heavy objects, and exposure to hazardous materials. Workers must wear PPE such as helmets, safety vests, and protective glasses to reduce these risks. PPE is a fundamental line of defense that protects workers from severe injuries [
2].
However, even when PPE is provided, ensuring compliance remains a significant challenge because workers may not consistently wear or use the equipment correctly. Some workers may forget to wear safety gear, while others might wear it incorrectly. This lack of compliance increases the probability of accidents and reduces the effectiveness of safety measures. Traditionally, site supervisors and safety officers manually check PPE compliance, but this approach is time-consuming, prone to human error, and difficult to scale for large or complex construction sites [
3].
With recent developments in deep learning and computer vision, there is an opportunity to automate real-time detection of PPE compliance. Using computer vision, the proposed PPE detector for construction sites leverages the deep learning model YOLO11 to accurately detect safety gear, including helmets, safety vests, and protective glasses. The system monitors workers in real-time to ensure compliance with safety regulations and protocols. By providing continuous monitoring and instant feedback, the system enables construction managers and safety officers to monitor compliance more effectively, thereby reducing workplace accidents and enhancing overall site safety. There are currently two main types of PPE detection algorithms widely used: transformer-based and Convolutional Neural Network (CNN)-based.
The prime focus of the research is to target the heavy construction industry in the Kingdom of Saudi Arabia, which spans from residential buildings to huge skyscrapers and the ambitious NEOM project, an initiative for urban development, including LINE, a future smart city [
4]. Since most of the construction sites are in remote areas, PPE compliance is mostly ignored, and a limited number of studies have been reported in this regard. Such research is necessary for the automated inspection of construction sites, ranging from medium to very large scales. Currently, limited studies have been conducted in this regard, targeting the kingdom’s requirements and environmental factors. In this regard, the current study investigates an augmented dataset using YOLO11 to detect the PPE compliance in the construction industry in the Kingdom of Saudi Arabia.
The major contributions of this work are as follows: first, developing an automated system that uses the latest computer vision and deep learning approaches to monitor PPE compliance in real time, considering a state-of-the-art dataset. Second, the system will detect whether critical PPE is being used correctly, ensuring that all workers follow safety protocols. Third, by implementing this solution, the research goal is to reduce the risks of injuries and fatalities on construction sites and promote a safer working environment.
2. Review of Literature
This review explores various studies on AI-powered worker safety monitoring for Personal Protective Equipment (PPE). Each review will discuss the methods employed, the datasets investigated, the evaluation metrics, and the research findings. The review is categorized into four subsections. Namely, the major PPE dataset used for detection, the transformer- and CNN-based approaches, and the PPE compliance detection system employed per the literature.
2.1. PPE Detection Datasets
A study was conducted by Chang et al. [
1] about a monitoring system for classifying PPE using computer vision. The research focused on using multiple cameras to cover wider areas. This study employed two technologies to monitor worker safety: worker ReID, which re-identifies individuals, and PPE classification, which tracks individual workers across camera footage. PPE classification utilizes object recognition to detect and categorize whether workers are wearing the necessary safety gear. A dataset of four different viewpoints of house construction in Hong Kong is used. Forty workers were identified from 6245 images; twenty-four were used for training re-identification, and sixteen for testing. PPE classification classifies workers into four categories: workers with a helmet and vest (WHV), workers with a vest only (WV), workers with a helmet only (WH), and workers without a helmet and vest (W). The research findings concluded that the combined approach boosted accuracy, with worker identification improving by 4% and PPE detection improving by 13%.
Han and Zeng [
2] described the detection of safety helmets on construction sites using deep learning techniques. The research employed YOLOv5 as the primary algorithm, incorporating a four-scale detection method. The dataset used contains construction site images collected from the Internet. The main evaluation metrics used are the precision-recall curve, mean average precision (mAP), and mean detection time. The study also focused on detecting whether the workers were wearing helmets. The findings of this study concluded with a 92.2% mAP, which is better than using YOLOv5 alone, which achieved 85.9%. Moreover, it takes 3 ms to detect a frame in a video with 640 × 640 resolution.
Research by Hayat and Morgado-Dias [
3] proposes detecting safety helmets on construction sites using deep learning. Moreover, the research used YOLOv3, YOLOv4, and YOLOv5 algorithms for real-time detection. The dataset used is the Hat Worker Image Dataset published by MakeML, which contains 5000 images. The research results achieved a 92.44% mAP using YOLOv5, which is excellent for detecting objects in both high- and low-light conditions.
2.2. Transformer-Based PPE Detection Algorithms
A vision-based framework for real-time detecting PPE on construction sites was proposed by Lee et al. [
5]. The framework uses the YOLACT algorithm to detect individual pixels, which uses instance segmentation. It is a lightweight model suitable for real-time applications. Additionally, the framework utilizes DeepSORT for object tracking to monitor workers on construction sites. The dataset is a combination of multiple public datasets, including AIM, MOCS, and ACID, some of which contain CCTV and smartphone images, while others are sourced from open resources. The research results were good, as the framework achieved 91.3% accuracy and 66.4% mAP.
A real-time PPE detection algorithm using deep learning by Lo et al. focused on detecting workers’ vests and helmets from videos and images [
6]. The study used YOLOv3, YOLOv4, and YOLOv7. The dataset used is a custom dataset created by the study’s developers, containing images sourced from the web and from cameras. The augmented dataset comprises 11,000 images and 88,725 labels of PPE from various construction areas. The study categorized workers into four classifications: no hat, hat, no high-visibility vest, and high-visibility vest. The study achieved mAP of 97%.
Ferdous et al. [
7] proposed a YOLO-based PPE detector model for construction sites. The study uses the CHVG dataset, which comprises eight classes: four coloured hard hats, vests, safety glasses, a person’s body, and a person’s head. In this paper, the YOLOX architecture, an anchor-free variant of the YOLO family, is employed. The study demonstrated that the YOLOX-m model achieved the highest mAP among YOLOX versions, at 89.84%. Additionally, the paper addresses some potential challenging conditions that the system might encounter, such as rain, haze, and low-light images, by artificially adding these effects to some images to test the model’s robustness.
2.3. CNN-Based PPE Detection
Isailovic et al. [
8] used deep learning-based object detection to ensure compliance with industrial PPE. The proposed pipeline integrates the estimate of the head region of interest with a PPE detector system, using a dataset containing 12 different PPE types, integrated with public datasets. Three deep learning architectures, including Faster R-CNN, MobileNetV2-SSD, and YOLOv5, have been used. The results show that YOLOv5 achieves superior performance, with a slight advantage over the alternative, achieving a precision of 0.920 ± 0.147 and a recall of 0.611 ± 0.287. Nath et al. [
9] proposed a deep learning-based real-time site safety system for PPE detection. The study demonstrates three deep learning models from the YOLO family. The first method involves detecting separate components of workers’ PPE, such as hats and vests, and combining them in the model to recognize and verify the correct PPE used in construction sites. The second method uses a single Convolutional Neural Network (CNN) to detect and verify each worker’s PPE compliance. The third method involves detecting workers and then classifying and verifying them based on their PPE attire using CNN-based classifiers. The second method achieves the highest performance, with an mAP of 72.3%.
A detailed study by Gallo et al. [
10] about innovative systems for PPE detection in industrial environments. The study presents an innovative system for detecting PPE based on deep learning at the edge; the system’s purpose is to enhance workers’ safety in industrial environments by monitoring their compliance with PPE by analyzing video from surveillance cameras and alerting or triggering an alarm if the system detects a worker who is not compliant with PPE. A system prototype was developed using a Raspberry Pi and tested with five pre-trained Convolutional Neural Networks (CNNs) for PPE detection. The evaluation compares the classification and inference latency of CNNs with YOLO, showing promising results.
In a study on workers’ safety regulation compliance using a spatiotemporal graph convolutional network by Lee et al. [
11], the authors focus on detecting compliance with safety regulations by analyzing workers’ sequential movements from video footage. The training dataset used in this paper is not mentioned in detail. This study used OpenPose to capture images of humans, and a spatial–temporal graph convolutional network (ST-GCN) was employed to predict whether a worker was wearing PPE. Results showed that an average F1-score of 0.827 was achieved.
A study on generic industrial PPE compliance using deep learning was proposed by Vukicevic et al. in [
12]. The dataset used in this paper is collected from public PPE datasets (400 from the Pictor PPE dataset and 5200 from the Roboflow dataset) and web-mined images, totaling 15,728 cropped images. Deep learning models used in this paper are Inception_v3, DenseNet, SqueezeNet, VGG19, ResNet, and MobileNetV2. The overall accuracies of these models are 0.92, 0.95, 0.87, 0.93, 0.95, and 0.95, respectively. In conclusion, DenseNet, MobileNetV2, and ResNet were the best performers for predicting PPE compliance; however, the study favored MobileNetV2 due to its lower computational requirements.
Another study by Delhi et al. [
13] examined computer vision-based deep learning techniques for detecting PPE compliance on construction sites. CNN and YOLOv3 models were trained on a dataset of 2509 web-based construction site videos. Results show that the model achieved 96% mAP, 0.96 recall, and 0.96 F1-score.
The study by Azizi et al. [
14] compared two machine learning-based approaches for predicting PPE on construction sites, focusing on robustness and timely detection. The test set used four videos from realistic construction sites. The algorithms tested in this paper are Faster R-CNN employing ResNet-50 and Few-Shot Object Detection (FsDet). Results showed that Faster R-CNN achieved a mean accuracy of 82%, while FsDet achieved 58%. In conclusion, the faster R-CNN results are significantly better than FsDet’s prediction of PPE compliance in various environments.
The study by Li et al. [
15] focuses on inspecting worker PPE using deep learning. An OpenPose algorithm was used to identify 1200 online video clips containing safe and unsafe behaviour. From this process, 1604 images were collected, and an additional 1604 images were obtained from the horizontal mirror of the videos, totalling 3208 images in the training set. Additionally, the YOLOv5 model was trained to detect objects in images. A one-dimensional CNN was trained on 600 videos and tested on 600. The accuracy achieved with this model is 0.9467 in experimental scenarios.
Li et al. [
16] reviewed the behaviors of construction employees under monitoring. The dataset encompasses various lighting conditions, for which CNNs effectively address the challenges. Faster R-CNN and YOLO methodologies have yielded satisfactory outcomes in scenarios with diverse lighting conditions. The distribution of publication years indicates rapid development within the field, with continual updates to technical methodologies. Ahmed et al. [
17] employed deep learning for PPE detection. The study compared two deep learning-based algorithms, primarily used for computer vision tasks, namely YOLOv5 and Faster R-CNN with a ResNet50 backbone. The study examined CHVG, a publicly available dataset. Data pre-processing techniques, including image flipping, filtering, and augmentation, were applied. Recall, precision, and mAP were used to evaluate the models. The study concluded that YOLOv5 achieved an mAP50 of 0.68, while the faster R-CNN achieved an mAP50 of 0.96, making it the best-performing model. Li et al. [
18] investigated the application of deep learning for detecting safety helmets. The automated monitoring approach offers a means to oversee construction workers and ensure compliance with safety helmet regulations on construction sites. A public dataset comprising 3261 images of safety helmets was investigated. SSD-MobileNet algorithm, based on CNNs, is used for model training.
Huang et al. [
19] proposed an enhanced YOLOv3 model for helmet detection. The experimental dataset comprises surveillance videos captured at various times and angles within real construction sites. By selecting every fourth image from a sequence of screenshots, a diverse set of 13,000 images depicting significant variations is curated for helmet detection data. Data samples are evenly distributed to enable the model to effectively learn features across different images, accounting for factors such as weather conditions and viewing distance. Experimental results reveal that Faster R-CNN achieves the highest mAP at 94.3%, followed closely by the enhanced detection algorithm at 93.1%, which offers better speed.
2.4. PPE Detection Systems
Rahman et al. [
20] presented a YOLO11-based approach for generic PPE compliance detection in an industrial setting. In this regard, they have investigated the approach to an augmented dataset duly obtained from various open sources as well as collected manually. The study was reported as promising, with a mAP50 of 95.5% for real-time applications involving a wide range of PPEs. The study presented by Marquez et al. [
21] proposes a novel enhancement of workplace safety for field workers by integrating edge computing with smart protective gear. The study presents a wearable gadget system. That system consists of belts, bracelets, and helmets with sensors powered by artificial intelligence. The proposed system ensures workers’ safety and integrity by detecting and notifying them of anomalies in their environment early.
Airton and Gonzalez examined how well Vision artificial intelligence systems monitor and assure PPE compliance in various fields [
22]. The study’s primary goal is to investigate the use of artificial intelligence (AI) algorithms that can adapt to various environmental factors, thereby enhancing the precision and reliability of PPE detection.
A study by Balakreshnan et al. [
23] proposes an AI-powered automated system for PPE compliance and monitoring. This system can be used in a wide range of work scenarios. Using a combination of cloud-based and on-premises analytics, the system regularly provides real-time monitoring and compliance checks. The system is well known for its ability to generate automated reports and warnings, helping organizations reduce risk and comply with strict safety regulations. This system can initiate various control actions in response to safety violations. The study by Pooja and Preeti [
24] suggests that AI may help ensure mask-wearing during the COVID-19 pandemic, thereby protecting public health and well-being. The authors suggest automating mask identification using a computer vision system, a crucial step for public safety. Identification may alert people to those who have not worn masks and warnings. A study by Muanme et al. [
25] describes an AI system that uses YOLO techniques to improve PPE compliance in industrial settings. This method ensures that employees have the appropriate equipment before gaining access to hazardous areas, thereby ensuring the safety of both workers and visitors.
Table 1 provides a summary of the reviewed literature.
Following a detailed literature review, several observations have been made and a research gap identified.
Firstly, no noteworthy studies have been conducted that focus on PPE detection in the Saudi Arabian environment. On the other hand, Saudi Arabia is among the countries with a robust construction industry that encompasses a wide range of construction types, from residential buildings to skyscrapers and smart cities, such as LINE, a project by NEOM [
4]. Moreover, most construction sites are in remote areas, and PPE compliance is often not enforced due to the lack of smart detection systems, leading to unsafe conditions. The situation demands smart systems for automated detection of PPE compliance.
Secondly, most studies have used the YOLO architecture, but none have used the most recent versions, such as YOLOv10 or YOLO11, from a Saudi Arabian construction industry perspective. That provides two reasons to use YOLO11. Also, the overall YOLO family’s promising behavior in the area.
Thirdly, recent studies are focusing on generic PPE compliance detection, which may not be very effective in a construction site’s specific environment [
20]. This situation suggests a need to develop more algorithms that can achieve higher accuracy than the existing ones.
Finally, the CHVG dataset used in studies [
6,
17] is imbalanced and requires further improvement to achieve higher accuracy. By addressing these gaps, this research aims to discover and apply previously unexplored computer vision techniques, thereby balancing the dataset to improve accuracy.
4. Experimental Results and Analysis
This section outlines the performance achieved by the algorithms tested in this research and compares them with similar research that used the same dataset.
Table 3 compares different models employing the same dataset, CHVG. Regarding mAP50, the proposed scheme outperformed the scheme in [
7] employing YOLOx. As far as the precision, recall, and F1-score values are concerned, the proposed scheme offers a marginal improvement over the scheme in [
7], but with a significantly lower inference time of 7.3 ms, compared to 150 ms. Moreover, the performance of the proposed scheme was slightly better than that of the scheme in [
17], with a 0.9% difference in mAP50. However, in terms of precision, recall, and F1-score, the proposed scheme outperformed the study [
17] by 22.89%, 12.75%, and 17.35%, respectively. Finally, in terms of processing time, the proposed scheme was significantly faster than [
17], with a time of 7.3 ms compared to 170 ms. None of the schemes employed these metrics for accuracy.
Additionally, the proposed scheme is compared with a recent study that employed YOLO11 on the PPE dataset [
20]. The proposed scheme achieved a 0.17-point improvement in mAP50 and an identical interface time. However, in terms of F1-score, precision, and recall, the scheme in [
20] performed marginally better than the proposed scheme.
Figure 4 presents the training graph for the YOLO11 model for mAP50 and mAP50-95. It is apparent that mAP steadily improves over the training period, reflecting the model’s growing ability to distinguish between PPE-compliant and non-compliant classes.
Additionally,
Table 4 presents the results of each model, including the optimizer and other hyperparameters employed to achieve the results of this study. The hyperparameters for each model have been optimized, and the results have been obtained. As demonstrated, YOLO11x obtained the best results in terms of mAP50 and mAP50-95 at a batch size of 16 and with 21 epochs using NAdam optimizer and learning rate 0.00001, as 96.9% and 70.9%, respectively, followed by the same model with batch sizes of 20 and 87 epochs at 91% and 60.3%, respectively. It can be deduced that the batch size plays a critical role in relation to epochs, while other parameters remain the same.
YOLO11l, with 100 epochs and a batch size of 28, obtained 90.7% and 58.7%, respectively. YOLO10s was followed by the aforementioned models at 75 epochs, with a batch size of 16, an auto-optimizer, and a learning rate of 0.01, achieving 87.3% and 54.4% accuracy, respectively. Eventually, the YOLO10n, with 25 epochs and a batch size of 32, obtained relatively poor performance at 77% and 47.3%, respectively, while keeping all other parameters the same. Furthermore, the table shows that YOLO11x with increased batch size and lower momentum does not improve the model’s performance.
Moreover, data augmentation, including adjusting brightness and flipping images, helped increase the model’s performance. These results are also demonstrated in
Figure 5, where all five models are compared for mAP50 and mAP50-95.
From the graph in
Figure 6 for the best model (YOLO11x), the highest mAP50 is achieved by the class ‘person’, which contains the highest number of instances. The lowest mAP50 is 0.884, achieved by the class ‘glass’, which contains the lowest number of instances compared to the other classes prior to balancing.
Figure 7 shows the F1-score confidence curve. It reveals the same outcome as in
Figure 2. The person class obtained the highest confidence, while the glass class received the lowest confidence. The best model achieved an F1-score of 0.90 at a confidence threshold of 0.424. This score represents a good trade-off between precision and recall, and the figure illustrates the model’s performance at various confidence levels.
The normalized confusion matrix is depicted in
Figure 8.
Figure 9 demonstrates the examples of the model’s predicted classes, along with their bounding boxes and confidence scores. The closer the value to 1 is, the more confidence the model has in the predicted class.
Figure 10,
Figure 11 and
Figure 12 show the proposed PPE-EYE model deployed in a UI to show how the model can be deployed in real systems. The system used to detect all PPE classes trained on the model, but in the UI, there is only one bounding box, which indicates two things; the first one is when the bounding box is green, this means that there is no violation, and when the bounding box is red it means there is a violation and the alert will be sent to the incident section with more details about it (
Figure 11). Moreover, a metric section summarizes the violations and other relevant information graphically (
Figure 12).
In the feedback mechanism, the misclassified cases are treated as new examples/shots to the existing system and used to retrain the model for continuous improvement after twenty-four hours.
6. Conclusions
In conclusion, monitoring PPE is a crucial subject that requires improvement to ensure a safe and healthy environment for employees in this field. This article discusses the implementation of an AI-based solution, PPE-EYE, to address this problem, utilizing the latest member of the YOLO architecture, CHVG, and data augmentation and balancing techniques. The YOLO11x model achieved a 0.969 mAP50 score, the best result in this research experiment, compared to the earlier YOLOX version, which achieved only 0.8984 mAP50. Furthermore, based on the software prototype developed in the academic environment, the current research model has an inference time of 0.0073 s, compared to Fast RCNN, which achieved 0.17 s. Feature improvements include balancing the dataset using up-sampling techniques and other data augmentation techniques such as rotation, scaling, and contrast. Additionally, adding more classes to the model will enhance its ability to detect a broader range of real-world PPE, including gloves and other tools. Moreover, adding images from multiple environments, such as indoor and outdoor settings, during both day and night, will enhance the model’s diversity across different environments.