A Detailed Comparative Analysis of You Only Look Once-Based Architectures for the Detection of Personal Protective Equipment on Construction Sites

: For practitioners and researchers, construction safety is a major concern. The construction industry is among the world’s most dangerous industries, with a high number of accidents and fatalities. Workers in the construction industry are still exposed to safety risks even after conducting risk assessments. The use of personal protective equipment (PPE) is essential to help reduce the risks to laborers and engineers on construction sites. Developments in the field of computer vision and data analytics, especially using deep learning algorithms, have the potential to address this challenge in construction. This study developed several models to enhance the safety compliance of construction workers with respect to PPE. Through the utilization of convolutional neural networks (CNNs) and the application of transfer learning principles, this study builds upon the foundational YOLO-v5 and YOLO-v8 architectures. The resultant model excels in predicting six key categories: person, vest, and four helmet colors. The developed model is validated using a high-quality CHV benchmark dataset from the literature. The dataset is composed of 1330 images and manages to account for a real construction site background, different gestures, varied angles and distances, and multi-PPE. Consequently, the comparison among the ten models of YOLO-v5 (You Only Look Once) and five models of YOLO-v8 showed that YOLO-v5x6’s running speed in analysis was faster than that of YOLO-v5l; however, YOLO-v8m stands out for its higher precision and accuracy. Furthermore, YOLOv8m has the best mean average precision (mAP), with a score of 92.30%, and the best F1 score, at 0.89. Significantly, the attained mAP reflects a substantial 6.64% advancement over previous related research studies. Accordingly, the proposed research has the capability of reducing and preventing construction accidents that can result in death or serious injury.


Introduction
One of the riskiest fields of work is thought to be the construction industry.Compared to workers in other industries, construction workers have twice as high a risk of injury.The nature of construction sites is well known for its high volume of activity, large machinery, frequent incidents, and numerous risks, all of which call for the careful consideration and application of safety precautions.Personal protective equipment (PPE) is the main line of defense against any threats that workers may encounter during their presence on construction sites.Manual inspections, which may be laborious and prone to human mistakes, are a major component of traditional ways of guaranteeing PPE compliance.Further, it was found that 70% of fall accidents in 2017 occurred because of workers not wearing personal protective equipment (PPE) on sites, as per a study conducted by Kang et al. [1].Further, statistics [2][3][4] show that there is a significant risk of worker fatalities and injuries in the construction sector.In addition, statistics on worker accidents in the construction industry are continuously increasing, which is alarming and points to the urgency of developing safety tracking systems for construction sites.For example, research by the Korea Occupational Safety and Health Agency states that among all industries, the construction sector has the second-highest rate of occupational accidents/injuries (25.5%) and the highest rate of fatalities (46.7%) [5].
Notwithstanding human efforts in manual and visual inspections, computer vision techniques have been developed and have progressed.This development is crystal clear to see in the use of automated PPE detection systems, which have more options nowadays than ever before.Such detection systems are a viable way to automate PPE recognition on building sites, improving safety and lessening the workload for engineers and site managers.Hence, significant efforts are currently being made to improve worker safety, which also greatly benefits construction companies because PPE can reduce the probability and severity of falling accidents.Creating recognition and monitoring systems for PPE used during working hours is one of the targeted initiatives.That is why Ferdous and Ahsan [6] created a YOLO-based architecture model for the recognition of workers wearing PPE on construction sites.Consequently, artificial intelligence (AI) capabilities can be adapted to create reasonably priced automation solutions for the construction industry, such as monitoring systems that can identify workers and PPE and determine whether or not they are adhering to safety requirements.
The major objective of this research is to exploit AI's capabilities to create a PPE detection system by using YOLO-based architecture.This aim can be established by attaining the following subtargets: (1) evaluating the accuracy of the performance of YOLObased architectures in creating a PPE detection system; (2) comparing between models' performance matrixes, such as precision, recall, and mAP; (3) exploring trade-offs between the speed and accuracy of different YOLO architectural models; and (4) proposing future recommendations for optimizing PPE detection systems in real-world applications.
The outline of this research study is delineated in the following manner.Section 2 enumerates previous research endeavors in relation to computer vision's adaptation through detection and recognition systems in fields such as the construction industry.Section 3 describes steps that were taken during the preparation of this study; it also incorporates the research methodology framework.Section 4 illustrates the evolution of YOLO across the years, in addition to enumerating the framework used in the implementation of the YOLO model.Section 5 represents the dataset used in the training, validation, and testing of different YOLO models, as well as incorporating the code used for the training of the YOLO model.Section 6 highlights the performance evaluation matrix that can be considered as the basis of comparison between the different YOLO models.Section 7 clarifies the results of comparing the YOLO models in terms of the performance evolution matrix (recall, pression, F1 score, and mAP).Section 8 summarizes the whole study and provides readers with recommendations for future YOLO training to be more beneficial in the construction field.

Literature Review
Preserving the safety of construction workers during their presence on construction sites is the main aim of this study.This can be achieved by reducing the probability and/or severity of construction incidents.According to recent studies, as mentioned in the introduction, construction workers' safety is highly dependent on them wearing proper personal protective equipment (PPE) during their presence on construction sites.Hence, this study was conducted to continue work on recognizing PPE by using computer vision applications.Consequently, this literature review shows a number of recent studies on such computer vision techniques, including a number of investigations conducted within the latest six years.The recent studies that will be discussed and reviewed in this paper mainly concentrate on the progression of PPE recognition using computer vision techniques such as convolutional neural networks (CNNs) and the application of transfer learning principles.In addition, by examining these studies, a research gap can be identified, as well as recommendations for future research for the development of an effective PPE detection system that can be tailored specifically for the construction field.As a result, and in terms of incident reduction and avoidance, this would be advantageous.
In this regard, the Web of Science (WoS), which incorporates a significant number of high-impact papers, is the most widely used platform for databases of the scientific literature.Thus, researchers often use this database to collect precise data for bibliometric analyses [7,8].Consequently, the investigation of the literature review for this research is obtained from the WoS database.Accordingly, to find the desired papers in the database, a variety of criteria are analyzed, such as (construction worker) AND ((safety) OR (risk) OR (health)) AND ((machine learning) OR (deep learning) OR (computer vision) OR (vision-based))).
In a study conducted by Delhi et al. [9], the researchers applied a type of deep learning, which is computer vision, by recognizing the PPE on construction sites on an immediate basis.Accordingly, the researchers collected the dataset on which they conducted the research manually, in addition to applying web scraping.The dataset contained around 2500 images that were classified into four classes, as follows: NOHARDHAT, NOJACKET, SAFE, and NOT SAFE.Hence, YOLO-v3 was trained on that dataset.Furthermore, following the augmentation step based on the data, YOLO-v3 was trained on a sample of data.This gives the model resilience and generalization by performing flipping along with rotation on both sides, left and right, with an angle of 30 degrees.Further, by using a validation test strategy, the provided dataset was split into 90%, 8%, and 2% random segments for training, validation, and testing, respectively.Consequently, and based on the tested data, the model succeeded to fulfil an mAP and F1 score of 97%.
Deep learning neural networks were applied in the research of Wang et al. [10] for real-time detection and recognizing of objects to address the problem of worker safety by making sure that employees followed safety protocol.They consequently suggested applying YOLO-v3, YOLO-v4, and YOLO-v5, which are detectors based on deep learning of YOLO architectures.They used data from a high-quality dataset called CHV.Such data incorporated 1330 images extracted from Wang et al.'s [11] dataset and broken down into six categories: person, vest, and helmets with four colors.The research results showed that YOLO-v5s had the fastest GPU performance of 52 FPS, while YOLO-v5x had the best mAP of 86.55%.A newly introduced cognitive analysis of safety measures for a monitoring system was proposed by Torrse et al. in another study [12].Such a system was used in this study to instantly determine whether personal protective equipment is being used appropriately based on data gathered by the monitoring of CCTV cameras.Further, the system employed a deep learning algorithm to identify objects.Hence, the study resulted in the creation of a YOLO-v4 system that could achieve an 80.19% mAP at 80 frames per second in real time.Most of the current deep learning detectors had limitations with far-away objects and close-range objects [13,14].YOLO models perform with a higher accuracy more than other detection models.
A similar study by Hayat and Morgado Dias [15] adopted a deep learning method for real time for the sake of identifying the heads and helmets on construction site workers.This paper investigated three different iterations of the well-known deep learning architecture YOLO: YOLO-v3, YOLO-v4, and YOLO-v5x.The model was implemented by the authors using the public dataset made available by Make ML [16].Therefore, a huge number of 3000 instances were used for training, and 1000 instances were used for testing.Furthermore, in this study, only the "Head" and "Helmet" were used as classes.To address the preprocessing of the images, power-law transformation [17] was used for image preprocessing so as to increase the quality of contrast and lighting in such data.With accuracy, precision, recall, and F1 scores of 92%, 92.4%, 89.2%, and 90.8%, respectively, the YOLO-v5x model gave the best accuracy and, hence, the best performance.
Gallo et al. suggested a system in [18] to recognize personal protective equipment (PPE) in hazardous industrial areas.Deep neural networks were used for the system's analysis of a video stream.Five models-YOLO-v4, YOLO-v4-Tiny, SSD, CenterNet, EfficientDet, and MobileNet-were trained to determine whether or not the workers are implementing the safety measures by wearing safety equipment.The authors utilized three datasets: two were collected under controlled conditions and incorporated 215 and 236 images, respectively; the third one is an available public dataset [19] with 7035 images.Because of its rapid detection speed, YOLO-v4-tiny was implemented in the system, achieving an mAP of 86%.Further, and by using the INRIA person dataset [20], Li et al. [21] were able to train an autonomous safety-helmet-wearing recognition system.Furthermore, a safety helmet detection model was suggested by Wang et al. [22], trained using 10,000 photos taken on construction sites by 10 distinct surveillance cameras.Geng et al. [23] presented an enhanced helmet recognition method based on an unbalanced dataset of 7581 photos, the majority of which included a person wearing a helmet against a complicated backdrop.By testing it on 689 photos, it resulted in a label confidence of 0.982.
A transfer learning model-based automated technique was developed by Vibhuti et al. [24] to identify individuals who were not wearing masks in public in the period of the COVID-19 epidemic.InceptionV3, ResNet50, VGG16, MobileNet, MobileNetV2, and Xception were among the deep learning models that were employed in the intervention.Training, testing, and validation were conducted using the Simulated Masked Face dataset (SMFD) [25].Through the use of fine-tuning strategy, the pretrained Inception (V3) model was developed and optimized.The greatest results, obtained with the SMFD dataset [26], were 100% accuracy and specificity in testing and 99.92% in training.The main outcomes highlighted the excellent accuracy of non-mask-wearer recognition automation achieved by the proposed transfer learning model.
Notwithstanding the above-mentioned studies, the previous studies and research did not address a detailed comparative analysis between the different YOLO-based architecture models' performance.This is deemed to be the research gap in the mentioned studies that is addressed and dealt with in this research by applying a detailed comparative performance analysis between the different YOLO models.

Research Methodology
In light of developing this research, we went through different phases to reach the optimum YOLO model in detecting PPE, as shown in Figure 1.The Web of Science (WoS) database was employed in this research to compile an extensive collection for the literature review to guide and inform our investigation.Moreover, in order to guarantee relevant articles and lay the groundwork for an extensive literature review, the search criteria were carefully crafted.Consequently, after gaining knowledge from previous research and studying the limitations of other research, building up different YOLO models was our goal to reach.This was achieved using Google Colab (accessed on 11 November 2023).The preparation of the dataset comprising both images and annotations was an important process to ensure that the results would be reliable.Subsequently, the stages of training and validating YOLO models were the highlighted phases.The model testing was performed by calculating performance evaluation metrics such as precision, recall, F1 score, and mAP.In the final stages, a comparative analysis of different YOLO models was conducted by analyzing the results to draw meaningful conclusions and contribute to the evolving field of detecting PPEs.

Framework
In this study, YOLO-v5 and YOLO-v8 with their different versions were used as the primary models for classes detection.YOLO was first introduced to the computer vision field in 2015 by Joseph Redmon et al. [27] under a paper entitled "You Only Look Once: Unified, Real-Time Object Detection".From 2015 to 2023, YOLO teams continued developing YOLO models and versions from one year to another.Figure 2 shows the evolution of YOLO across the years.YOLO is a one-stage single-shot detection.YOLO uses a convolutional neural network (CNN) to process an image.YOLO makes a single pass on the input image to make a prediction for targeted classes.It processes the entire image only in a single pass.Different YOLO models contain different architectures, but all of them contain the same structure, consisting of three main parts: backbone, neck, and prediction.Figure 3 shows the framework of PPE detection based on YOLO models.

Framework
In this study, YOLO-v5 and YOLO-v8 with their different versions were used as the primary models for classes detection.YOLO was first introduced to the computer vision field in 2015 by Joseph Redmon et al. [27] under a paper entitled "You Only Look Once: Unified, Real-Time Object Detection".From 2015 to 2023, YOLO teams continued developing YOLO models and versions from one year to another.Figure 2 shows the evolution of YOLO across the years.

Framework
In this study, YOLO-v5 and YOLO-v8 with their different versions were used as the primary models for classes detection.YOLO was first introduced to the computer vision field in 2015 by Joseph Redmon et al. [27] under a paper entitled "You Only Look Once: Unified, Real-Time Object Detection".From 2015 to 2023, YOLO teams continued developing YOLO models and versions from one year to another.Figure 2 shows the evolution of YOLO across the years.YOLO is a one-stage single-shot detection.YOLO uses a convolutional neural network (CNN) to process an image.YOLO makes a single pass on the input image to make a prediction for targeted classes.It processes the entire image only in a single pass.Different YOLO models contain different architectures, but all of them contain the same structure, consisting of three main parts: backbone, neck, and prediction.Figure 3 shows the framework of PPE detection based on YOLO models.YOLO is a one-stage single-shot detection.YOLO uses a convolutional neural network (CNN) to process an image.YOLO makes a single pass on the input image to make a prediction for targeted classes.It processes the entire image only in a single pass.Different YOLO models contain different architectures, but all of them contain the same structure, consisting of three main parts: backbone, neck, and prediction.Figure 3 shows the framework of PPE detection based on YOLO models.The backbone helps to produce visual features with different shapes and types by using the convolutional neural network.The neck is a set of layers which mix and combine image features in order to pass them to the next prediction step.The prediction stage takes the input from the neck stage in order to perform classification for the targeted classes.

YOLO History
YOLO is a powerful real-time detection model which was first introduced in 2015 by Joseph Redmon et al. [27].Later, in 2018, Joshep Redmon [28] upgraded YOLO-v1 to YOLO-v3, which was faster.YOLO is a one-stage single-shot detection.YOLO makes a single pass on an input image to make a prediction for targeted classes.Different versions of YOLO are faster than the two-phase object detection model.Two-phase object detection uses two phases for detection.The first phase generates a pool of probabilities for object locations.The second phase ensure these probabilities to make a final decision regarding targeted classes.YOLO-v3 is faster than two-phase object detection models such as the R-CNN and fast R-CNN models [28].YOLO-v3 is 1000 times faster than R-CNN and 100 times faster than fast R-CNN.
On January 2023, Ultralytics released YOLO-v8.YOLO-v8 contains five different models.The models of YOLO-v8 produce more efficient output while using an equivalent number of parameters.Table 2 shows YOLO-v8's different models and its characteristics [32].
Figure 4 shows illustrates a comparison between different YOLO models trained on 640 image resolution [30].The backbone helps to produce visual features with different shapes and types by using the convolutional neural network.The neck is a set of layers which mix and combine image features in order to pass them to the next prediction step.The prediction stage takes the input from the neck stage in order to perform classification for the targeted classes.

YOLO History
YOLO is a powerful real-time detection model which was first introduced in 2015 by Joseph Redmon et al. [27].Later, in 2018, Joshep Redmon [28] upgraded YOLO-v1 to YOLO-v3, which was faster.YOLO is a one-stage single-shot detection.YOLO makes a single pass on an input image to make a prediction for targeted classes.Different versions of YOLO are faster than the two-phase object detection model.Two-phase object detection uses two phases for detection.The first phase generates a pool of probabilities for object locations.The second phase ensure these probabilities to make a final decision regarding targeted classes.YOLO-v3 is faster than two-phase object detection models such as the R-CNN and fast R-CNN models [28].YOLO-v3 is 1000 times faster than R-CNN and 100 times faster than fast R-CNN.
On January 2023, Ultralytics released YOLO-v8.YOLO-v8 contains five different models.The models of YOLO-v8 produce more efficient output while using an equivalent number of parameters.Table 2 shows YOLO-v8's different models and its characteristics [32].
Figure 4 shows illustrates a comparison between different YOLO models trained on 640 image resolution [30].

Dataset Description
A public dataset containing PPE images was obtained from [10].The dataset conta images for vests, colored helmets (blue, red, white, and yellow) and persons.The data is named the CHV dataset.The CHV dataset contains photos from real construction s conditions, unlike other datasets which contain images with backgrounds that are n from construction sites.The CHV dataset contains 1330 images with 9209 instances total.The dataset contains different gestures (e.g., standing and bending), different ang (e.g., front, back, left, right, up), and distances (e.g., far away and close distance).Figur shows the image distribution into training, testing, and validation sets.
The training set is used in the training process of the model; the testing set is used the base for evaluation of the model; and the validation set is used to ensure that the mod is predicting results as planned.Figure 6 shows the percentage of each class in trainin testing, and validation sets.

Dataset Description
A public dataset containing PPE images was obtained from [10].The dataset contains images for vests, colored helmets (blue, red, white, and yellow) and persons.The dataset is named the CHV dataset.The CHV dataset contains photos from real construction site conditions, unlike other datasets which contain images with backgrounds that are not from construction sites.The CHV dataset contains 1330 images with 9209 instances in total.The dataset contains different gestures (e.g., standing and bending), different angles (e.g., front, back, left, right, up), and distances (e.g., far away and close distance).Figure 5 shows the image distribution into training, testing, and validation sets.
The training set is used in the training process of the model; the testing set is used as the base for evaluation of the model; and the validation set is used to ensure that the model is predicting results as planned.Figure 6 shows the percentage of each class in training, testing, and validation sets.

Dataset Processing
The training process was performed on the cloud platform Google Colab, with a GPU Nvidia K80/T4 ang GPU memory 16 GB with performance 4.1 TFLOPS/8.1 TFLOPS.The CHV dataset [33] contains three categories: training, validation, and testing.Figure 7 shows the code written on Google Colab in order to train, validate, and test by cloning up different YOLO models from [30,32]

Dataset Processing
The training process was performed on the cloud platform Google Colab, with a GPU Nvidia K80/T4 ang GPU memory 16 GB with performance 4.1 TFLOPS/8.1 TFLOPS.The CHV dataset [33] contains three categories: training, validation, and testing.Figure 7 shows the code written on Google Colab in order to train, validate, and test by cloning up different YOLO models from [30,32]

Dataset Processing
The training process was performed on the cloud platform Google Colab, with a GPU Nvidia K80/T4 ang GPU memory 16 GB with performance 4.1 TFLOPS/8.1 TFLOPS.The CHV dataset [11] contains three categories: training, validation, and testing.Figure 7 shows the code written on Google Colab in order to train, validate, and test by cloning up different YOLO models from [30,32].The models were trained with 50 training epochs.
Eng 2024, 5 355 Eng 2024, 5, FOR PEER REVIEW was carried out using the stochastic gradient descent (SGD) optimizer during the trainin process.The learning rate was adjusted at an initial rate of 0.01 and then periodically usin a warm-up approach which applied a decay weight of 0.0005.
A basic metric to measure the performance of object detection algorithms intersection over union (IOU).IOU is the ratio between the overlap of two boxes, grou truth box (TB) and detection box (DB).It is calculated using Equation (1) [34].Figur shows the relationship between the ground truth box (TB) and the detection box (DB).The number of weights a YOLO model learns in a single epoch is greatly increased when a large number of images are sent to the model concurrently during training.In order to handle this, datasets are usually split up into smaller batches, with "n" photos in each batch, and training the model batch by batch.The size of the input image was 640 × 640 with batch size of 16 photos.After training on all batches, the results of each batch are stored.The memory consumption increases as the number of batches increases.With a batch size of 16, a momentum value of 0.937 was implemented in the model.Optimization was carried out using the stochastic gradient descent (SGD) optimizer during the training process.The learning rate was adjusted at an initial rate of 0.01 and then periodically using a warm-up approach which applied a decay weight of 0.0005.
A basic metric to measure the performance of object detection algorithms is intersection over union (IOU).IOU is the ratio between the overlap of two boxes, ground truth box (TB) and detection box (DB).It is calculated using Equation ( 1) [33].Figure 8 shows the relationship between the ground truth box (TB) and the detection box (DB).
After calculating an IOU, the confusion matrix criteria are applied using true positive (TP), false positive (FP), and true negative (TN).These basic concepts are described to aid in understanding the following equations, as follows.True positive (TP) is the correct detection of a ground truth bounding box [33].False positive (FP) is the incorrect detection of a nonexistent object [33].False negative (FN) is an undetected ground truth bounding box [33].In object detection, true negative (TN) results do not apply as there are an infinite number of bounding boxes.
A basic metric to measure the performance of object detection algorithms is intersection over union (IOU).IOU is the ratio between the overlap of two boxes, ground truth box (TB) and detection box (DB).It is calculated using Equation ( 1) [34].Figure 8 shows the relationship between the ground truth box (TB) and the detection box (DB).One of the most important and difficult steps in machine learning is choosing appropriate metrics for performance evaluation.ROC curves, F1 score, precision, accuracy, and recall are frequently used metrics for comparison between different models [34,35].They are not suitable for all datasets [36], especially when the positive and negative datasets are imbalanced [37].Since accuracy and ROC curves do not accurately reflect the true classification performance of rare classes, they can be useless performance measures in unbalanced datasets [38,39].In the proposed analysis, the precision, recall, F1 score, and mean average precision (mAP) were used as the evaluation metrics to perform comparison between YOLO's different models.Precision is the ability of a model to identify only the relative objects [33].Precision shows the percentage of correct positive predictions among all detections [40], as shown in Equation ( 2).
Recall is the ability of the model to find all the relevant cases [33].Recall shows the percentage of true positives among all ground truths [41], as shown in Equation (3).
Moreover, the F1 score is the harmonic mean of precision and recall, as shown in Equation ( 4).
Additionally, the most common metric used to measure the accuracy of the detection is mean average precision (mAP).The mAP is a metric used to measure the accuracy of object detectors over all classes, not only a specific class.The mAP is the score achieved by comparing the detected bounding box to the ground truth bounding box.If IOU is greater than or equal to 50%, the detection is counted as TP.The formula of the mAP is given in Equation (5).
where AP k is the average precision of class k and n represents the number of classes.In this study, n = 6 (person, vest, and four colored helmets).
In addition to assessing the above metrics, multiple metrics such as model layers, floating-point operations per second (FLOPs), and frames per second (FPS) were used to evaluate the performance and efficiency of the YOLO models.The complexity of the model is measured by FLOPs, which express the number of computations of the model.The number of frames per second is represented by FPS.These metrics aid in the comprehension of variables including inference speed, computational cost, and model complexity.

YOLO Models Results Comparison
After running the model, precision × recall curves were extracted from the model.For a precision × recall curve, the accuracy of the model increases when the precision has a higher value accompanied with increase in the recall.Therefore, the curves which are closer to the right corners have higher performance.The precision × recall curves for YOLO's different models are presented in Figure 9.
Eng 2024, 5, FOR PEER REVIEW 12 closer to the right corners have higher performance.The precision × recall curves for YOLO's different models are presented in Figure 9.
In order to calculate the precision, recall, F1 score, and mean average precision (mAP), the TP, FP, and FN need to be extracted from the model after validating it.Table 3 shows the TP, FP, and FN for the ten YOLO-v5 models.Table 4 shows the TP, FP, and FN for the five YOLO-v8 models.Regarding the person class, YOLO-v5m6 scored the highest TP, which leads to an increase in precision and recall metrics.YOLO-v5n scored the highest FN, which leads to a decrease in recall metric.For the vest class, YOLO-v8s scored the highest TP, while YOLO-v5s6 scored the highest FN.For blue and red helmet classes, YOLO-v5n6 and YOLO-v8s scored the highest TP, while YOLO-v5N scored the highest FN.Other comparisons can be deduced from Tables 3 and 4.    In order to calculate the precision, recall, F1 score, and mean average precision (mAP), the TP, FP, and FN need to be extracted from the model after validating it.Table 3 shows the TP, FP, and FN for the ten YOLO-v5 models.Table 4 shows the TP, FP, and FN for the five YOLO-v8 models.Regarding the person class, YOLO-v5m6 scored the highest TP, which leads to an increase in precision and recall metrics.YOLO-v5n scored the highest FN, which leads to a decrease in recall metric.For the vest class, YOLO-v8s scored the highest TP, while YOLO-v5s6 scored the highest FN.For blue and red helmet classes, YOLO-v5n6 and YOLO-v8s scored the highest TP, while YOLO-v5N scored the highest FN.Other comparisons can be deduced from Tables 3 and 4.After calculating TP, FP, and FN, the precision, recall, F1 score, and mean average precision (mAP) can now be calculated by applying Equations ( 2)-( 5).Tables 5-8 show a comparative analysis between YOLO's different models.Figures 10-13 elaborate the results shown in Tables 5-9.After calculating TP, FP, and FN, the precision, recall, F1 score, and mean average precision (mAP) can now be calculated by applying Equations ( 2)-( 5).Tables 5-8 show a comparative analysis between YOLO's different models.Figures 10-13 elaborate the results shown in Tables 5-9.After calculating TP, FP, and FN, the precision, recall, F1 score, and mean average precision (mAP) can now be calculated by applying Equations ( 2)-( 5).Tables 5-8 show a comparative analysis between YOLO's different models.Figures 10-13 elaborate the results shown in Tables 5-9.The mAP (mean average precision) comparative analysis for various YOLO models indicates a consistent level of performance across different configurations.YOLO-v8m leads the group with the highest mAP of 0.92, closely followed by YOLO-v8x and YOLO-v8l, both achieving an mAP of 0.91.YOLO-v5x, YOLO-v5l6, YOLO-v5l, and YOLO-v5m share a common mAP of 0.90, highlighting their comparable precision in object detection.YOLO-v5s, YOLO-8n, and YOLO-v8s also delivered solid performance, with mAP values ranging from 0.87 to 0.89.YOLO-v5n6 and YOLO-v5s6 exhibited slightly lower mAPs at 0.8 and 0.85, respectively.Overall, these YOLO models showcase reliable and competitive mAP scores, with users able to choose based on specific application requirements and computational considerations.
Considering the values of model layers, YOLO-v5x6 has the highest number of layers (416), indicating a more complex architecture.YOLO-v5s and YOLO-v5n have the fewest layers (157), suggesting a simpler architecture.Regarding FPS, YOLO-v5n6 has the highest FPS (87.72), indicating faster processing of frames per second.YOLO-v5s and YOLO-v5s6 also have high FPS values, suggesting good real-time performance.YOLO-v8x has the lowest FPS (20.37), indicating slower frame processing.Concerning models' parameters, YOLO-v5x6 has the highest number of parameters (140.02 million), signifying a more complex model.YOLO-v5n has the lowest number of parameters (1.77 million), indicating a simpler model.
In addition to the abovementioned results, and after comparing computational cost for different models, it was deduced that YOLO-v8x has the highest FLOPS (214 million), suggesting higher computational efficiency.YOLO-v5x has the highest FLOPS among YOLO-v5 models, indicating higher computational load.Moreover, YOLO-v8x has the highest FLOPS among YOLO-v8 models.YOLO-v5n and YOLO-v5n6 have the lowest FLOPS (4.20 million), suggesting lower computational load.Accordingly, YOLO-v5n6 stands out for its high FPS and low FLOPS, indicating good real-time performance and computational efficiency.YOLO-v5x6, while having a high number of parameters, has a lower FPS, suggesting a trade-off between complexity and processing speed.The other models fall in between, offering a range of choices based on specific requirements for model complexity, real-time performance, and computational load.Notwithstanding the forementioned, the benefit of this detailed analysis is that YOLO-v8m benchmarks against previous related research studies with an increase of 6.64% advancement in mAP, making it more reliable and accurate.

Conclusions
To sum up, the construction workers have a higher risk of becoming injured.It was figured out that most of the construction accidents resulted from the construction workers' negligence in wearing the personal protective equipment (PPE).Hence, the researchers agreed that the PPE is the main line of defense against any threats that workers may encounter during their presence on construction sites, as per Delhi et al. [9].Therefore, the YOLO-based architectural model was thought to be one of the AI applications in terms of tracking the workers' safety on construction sites by detecting the workers' who are not wearing the PPE.
In the light of the abovementioned conclusions and after experimenting with the CHV dataset on YOLO-v5 ten models and YOLO-v8 five models, the precision, recall, F1 score, mean average precision (mAP), number of layers, number of parameters, FPS, and FLOPS were calculated to compare between different YOLO models.YOLO-v8m benchmarks against previous related research studies with an increase of 6.64% advancement in mAP, making it more suitable for applications where detection performance is the measure for decision making.These findings collectively help to understand the strengths and capabilities of each YOLO-based architecture in the context of PPE detection on construction sites, providing valuable insights for the development and deployment of computer vision solutions in occupational safety applications.
In this regard, one drawback of different YOLO models is their difficulty in identifying small objects at a long distance, especially when there are some huge objects nearby.This might be resolved by changing or developing the architecture.Additionally, even though the model was extended to include six distinct PPE classes, it may be expanded more to detect more PPE sets, such as masks, glasses, and gloves.Regarding future research, our studies will be expanded to optimize each model and enhance its performance by gathering original datasets from different actual construction sites, which could help more in the model training process.This will be reflected in more accurate predictions.

Figure 1 .
Figure 1.Framework of the proposed research methodology.

Figure 2 .
Figure 2. Evolution of YOLO across the years.

Figure 1 .
Figure 1.Framework of the proposed research methodology.

Figure 1 .
Figure 1.Framework of the proposed research methodology.

Figure 2 .
Figure 2. Evolution of YOLO across the years.

Figure 2 .
Figure 2. Evolution of YOLO across the years.

Figure 5 .
Figure 5. Image distribution into training, testing, and validation sets.

Figure 6 .
Figure 6.Percent of each class in training, testing, and validation sets.
. The models were trained with 50 training epochs.The number of weights a YOLO model learns in a single epoch is greatly increased when a large number of images are sent to the model concurrently during training.In order to handle this, datasets are usually split up into smaller batches, with "n" photos in each batch, and training the model batch by batch.The size of the input image was 640 × 640 with batch size of 16 photos.After training on all batches, the results of each batch are stored.The memory consumption increases as the number of batches increases.With a batch size of 16, a momentum value of 0.937 was implemented in the model.Optimization

Figure 5 .
Figure 5. Image distribution into training, testing, and validation sets.

Figure 6 .
Figure 6.Percent of each class in training, testing, and validation sets.
. The models were trained with 50 training epochs.The number of weights a YOLO model learns in a single epoch is greatly increased when a large number of images are sent to the model concurrently during training.In order to handle this, datasets are usually split up into smaller batches, with "n" photos in each batch, and training the model batch by batch.The size of the input image was 640 × 640 with batch size of 16 photos.After training on all batches, the results of each batch are stored.The memory consumption increases as the number of batches increases.With a batch size of 16, a momentum value of 0.937 was implemented in the model.Optimization

Figure 6 .
Figure 6.Percent of each class in training, testing, and validation sets.

Figure 8 .
Figure 8. Relationship between ground truth box (TB) and detection box (DB).Figure 8. Relationship between ground truth box (TB) and detection box (DB).

Figure 8 .
Figure 8. Relationship between ground truth box (TB) and detection box (DB).Figure 8. Relationship between ground truth box (TB) and detection box (DB).

Figure 10 .
Figure 10.Precision comparative analysis for YOLO's different models.

Figure 11 .
Figure 11.Recall comparative analysis for YOLO's different models.

Figure 10 .
Figure 10.Precision comparative analysis for YOLO's different models.

Figure 10 .
Figure 10.Precision comparative analysis for YOLO's different models.

Figure 11 .
Figure 11.Recall comparative analysis for YOLO's different models.

Figure 13 .
Figure 13.mAP comparative analysis for YOLO's different models.

Figure 13 .
Figure 13.mAP comparative analysis for YOLO's different models.

Table 5 .
Precision comparative analysis for YOLO's different models.

Table 6 .
Recall comparative analysis for YOLO's different models.

Table 7 .
F1 score comparative analysis for YOLO's different models.

Table 8 .
mAP comparative analysis for YOLO's different models.

Table 5 .
Precision comparative analysis for YOLO's different models.

Table 9 .
Complexity, FPS, and computational cost for YOLO's different models.