GBSG-YOLOv8n: A Model for Enhanced Personal Protective Equipment Detection in Industrial Environments

: The timely and accurate detection of whether or not workers in an industrial environment are correctly wearing personal protective equipment (PPE) is paramount for worker safety. However, current PPE detection faces multiple inherent challenges, including complex backgrounds, varying target size ranges, and relatively low accuracy. In response to these challenges, this study presents a novel PPE safety detection model based on YOLOv8n, called GBSG-YOLOv8n. First, the global attention mechanism (GAM) is introduced to enhance the feature extraction capability of the backbone network. Second, the path aggregation network (PANet) structure is optimized in the Neck network, strengthening the model’s feature learning ability and achieving multi-scale feature fusion, further improving detection accuracy. Additionally, a new SimC2f structure has been designed to handle image features and more effectively improve detection efﬁciency. Finally, GhostConv is adopted to optimize the convolution operations, effectively reducing the model’s computational complexity. Experimental results demonstrate that, compared to the original YOLOv8n model, the proposed GBSG-YOLOv8n model in this study achieved a 3% improvement in the mean Average Precision (mAP), with a signiﬁcant reduction in model complexity. This validates the model’s practicality in complex industrial environments, enabling a more effective detection of workers’ PPE usage and providing reliable protection for achieving worker safety. This study emphasizes the signiﬁcant potential of computer vision technology in enhancing worker safety and provides a robust reference for future research regarding industrial safety.


Introduction
As the global industrial scale continues to expand, incidents involving industrial safety have evolved into a severe global challenge, posing a serious threat to life, property, and the sustainable development of society.In an industrial environment, ensuring workers' safety is paramount.However, with the increasing number of workers and limited supervisory personnel, it becomes challenging to effectively oversee the safety conditions of many workers.Furthermore, insufficient safety awareness among workers significantly increases the risk of industrial safety incidents [1].As of 28 July 2023, a total of 12,070 various production safety incidents have occurred in China in 2023 alone, resulting in 10,527 fatalities.Therefore, more effective measures are urgently needed to ensure the safety of workers.
Wearing PPE [2] effectively reduces hazards and mitigates safety risks in industrial environments.PPE includes safety helmets, safety goggles, safety vests, face masks, etc. [3].Safety helmets contribute to head protection, preventing injuries from falling objects and electrical hazards [4].Safety goggles reduce the risk of eye injuries [5].Safety vests enhance worker visibility, reducing the risk of collisions with machinery and other objects [6].Also, face masks are crucial in blocking harmful substances from being inhaled and protecting the respiratory system [7].However, ensuring that workers wear PPE and use it correctly is a complex task.Therefore, the timely and accurate detection of PPE usage is of paramount importance [8].
In the early industrial environments, researchers typically employed traditional methods to ascertain whether workers were correctly wearing PPE in compliance with existing regulations.These methods required that factory supervisory personnel routinely engage in manual observations and patrols, manually scrutinizing surveillance videos to detect workers' unsafe behaviors, and requiring workers to conduct self-examinations daily and report any issues.While these conventional methods contribute to the monitoring of PPE use, to some extent, they include a series of issues: (1) regulatory personnel are susceptible to external interference, which may lead to oversight and incorrect judgments; (2) subjective factors, such as emotions and psychological states, can affect the objectivity of judgment; (3) manually reviewing all monitoring images is a highly labor-intensive task; and (4) workers may lack sufficient safety awareness.Therefore, to solve these problems, it has become particularly urgent to realize the automation and intelligence of PPE inspection in industrial environments.
Initially, researchers primarily focused on implementing PPE detection using sensor technology.However, these technologies require expensive equipment, increasing the cost of industrial production and posing potential health risks to workers.As technology advances, the application of artificial intelligence in industrial automation is increasingly prevalent.Before developing deep learning technology, the principal approach to the detection of correct PPE usage by workers involved image processing and machine learning.However, these methods did not perform well in complex scenarios with many interferences and could only identify limited types of PPE.As the demand for industrial production increased, the variety of PPE also gradually expanded.The emergence of deep learning technology has led to the utilization of techniques like object detection for use in the field of PPE detection.YOLO, recognized for its exceptional performance characterized by improved detection accuracy and faster processing speeds, has emerged as a prominent representative in object detection.In recent years, YOLO has undergone multiple iterations [9][10][11][12][13][14] and has been widely applied in various domains [15][16][17][18][19]. It has also begun to be used for PPE detection in industrial environments.However, the current research primarily focuses on construction sites and covers only a limited range of PPE types.With the increasing demands of industrial production, the variety of PPE required by workers is constantly expanding, presenting new detection challenges.
Hence, this study proposes the GBSG-YOLOv8n model, built upon the YOLOv8n framework, for swift and precise PPE detection in industrial settings.The key contributions of this study include the following:

•
To enhance the PPE detection model's performance, we have established a new dataset called PPES, which comprises many images captured by cameras in industrial settings, providing ample data resources for research.

•
By introducing GAM and embedding it into the model's backbone network, we enhance the focus on PPE targets, suppress interference from non-target background information, and significantly improve the feature extraction capability of the backbone network.

•
To effectively integrate feature information from different scales and prevent the loss of PPE feature details, we optimized the PANet structure within the Neck network.This optimization facilitated efficient bidirectional cross-scale connections and featureweighted fusion, further enhancing detection accuracy.

•
We have innovatively designed the SimC2f structure to significantly enhance the performance of the C2f module, resulting in the more efficient processing of image features and an improvement in overall detection efficiency.

•
To satisfy the real-time PPE detection and the light weight of the model, we use GhostConv to optimize the convolution operation in the backbone network, which significantly reduces the amount of model computation and parameters, while ensuring high detection accuracy.
The subsequent sections of this paper are organized as follows: Section 2 offers a concise introduction to the pertinent background knowledge.Section 3 presents the GBSG-YOLOv8n model.In Section 4, we substantiate the effectiveness and success of GBSG-YOLOv8n through comprehensive experiments and analyses.Finally, Sections 5 and 6 are dedicated to discussing and summarizing our work.

Related Work
Currently, PPE detection methods fall into two main categories: sensor-based and computer vision-based.Sensor-based methods use installed sensors to analyze signals and assess proper PPE usage by workers.Kelm et al. [20] utilized RFID sensor technology, deployed at the entrances and exits of workplaces, to inspect the safety of PPE.In addition, Bauk et al. [21] proposed an RFID worker safety model tailored to specific workplace requirements, embedding RFID in workers' PPE.Furthermore, Dong et al. [22] introduced an innovative method for automated remote monitoring and evaluation of PPE by integrating pressure sensors and positioning technology.They also developed a real-time locating system (RTLS) to track workers' positions, ensuring the proper use of PPE.Additionally, Hayward et al. [23] presented a PPE access control system prototype that integrates PPE with indoor and outdoor personnel location monitoring systems to ensure that employees and visitors wear PPE correctly.While sensor-based methods assist in detecting the wearing of PPE by workers, these methods incur substantial costs and require intricate deployment and maintenance.Additionally, they are restricted by particular environments and object categories, consequently diminishing detection accuracy.
As technology advances, contemporary industrial production is shifting toward higher speeds, greater precision, and automation.Computer vision-based methods are gradually emerging In PPE detection.These methods fall into two categories: the approach combining image processing with machine learning [24], mainly used for solving issues like image segmentation, feature extraction, and classification, and the method which performs tasks such as target detection and image generation with the help of deep learning techniques [25].In traditional methods, it is typical to employ image processing techniques to initially locate areas of interest, then extract image features and utilize machine learning methods to train classifiers to determine whether these areas contain PPE.Li et al. [26] utilized the background modeling algorithm to detect moving objects in the field of view of surveillance cameras.After identifying regions of interest linked to motion, they employed the histogram of oriented gradient (HOG) method to extract features.Subsequently, they used the extracted HOG features to train a support vector machine (SVM) for worker classification.Wu et al. [27] presented a color-based hybrid descriptor that combines local binary patterns (LBP), Hu moments invariants (HMI), and color histograms (CH) to capture a wide range of colors.They then implemented a hierarchical support vector machine (H-SVM) for feature classification, facilitating safety helmet detection.Although traditional methods combining image processing and machine learning have been widely applied, the need for manual feature design and extraction poses challenges, especially when dealing with large-scale data, which may lead to issues related to computational resources and memory limitations.
The continuous advancement of deep learning technology has effectively resolved these challenges, prompting numerous researchers to incorporate techniques like object detection into PPE detection.Ross et al. [28] introduced the concept of R-CNN, laying the foundation for deep-learning object detection.Subsequently, object detection algorithms based on candidate boxes were proposed and applied in the PPE detection field.Zhang et al. [29] proposed a safety management framework for on-site construction utilizing computer vision and real-time positioning systems.They employed Fast R-CNN to analyze image data from on-site cameras, enabling object detection and classification, assessing proper PPE usage by workers.Additionally, Fan et al. [30] explored the mechanisms and effectiveness of different target detection algorithms in the context of helmet detection.The findings highlighted the remarkable accuracy achieved by Faster R-CNN.They further improved the helmet detection algorithm by integrating models, resulting in enhanced detection capabilities.Although these two-stage object detection algorithms excel in accuracy, they must first locate candidate regions in the image and then employ a classifier to categorize each candidate region.These two steps individually require substantial computational resources, resulting in relatively lower efficiency for two-stage algorithms.In contrast, by using CNN for end-to-end processing, one-stage object detection algorithms directly generate multiple bounding boxes in the image and predict classification probabilities for each bounding box.This approach achieves simultaneous localization and classification, reducing the demand for computational resources and enhancing efficiency.SSD [31] and YOLO [32] algorithms are notable examples known for their capability to achieve higher detection accuracy and faster processing speeds.Han et al. [33] introduced an object detection algorithm based on a cross-layer attention mechanism and multi-scale perception.This approach effectively detects the use of safety helmets, building upon the foundation of the SSD algorithm, and it demonstrates a significant improvement in accuracy.Wang et al. [34] utilized the YOLOv3 model at a construction site to detect workers and the presence of safety helmets.Jiang et al. [35] enhanced YOLOv3 by incorporating squeezeand-excitation (SE) blocks between convolution layers in Darknet53, substituting the mean squared error (MSE) with GIoU loss, and employing focal loss to mitigate the significant foreground-background class imbalance issue, thereby more effectively achieving the realtime monitoring of mask-wearing.Ji et al. [36] introduced a residual feature enhancement module based on YOLOv4, reducing the loss of valuable information in high-level feature maps, enhancing object detection accuracy, and enabling the timely detection of workers who not wearing safety helmets or clothing in industrial environments.Wang et al. [37] tested the performance of YOLOv3, YOLOv4, and YOLOv5 on a custom dataset.The findings demonstrate that the YOLOv5 model outperformed the others.Zhang et al. [38] introduced shallow detection heads tailored for small object detection within the YOLOv5 algorithm.These heads are combined with SENet channel attention modules to effectively condense global spatial information.Additionally, they added a denoising module to the backbone network to ensure feature clarity and accuracy, significantly improving helmet detection accuracy.Tai et al. [39] introduced a new dynamic anchor box mechanism based on YOLOv5 for safety helmet detection, improving the model's accuracy in handling target changes.Sun et al. [40] integrated the MCA module into YOLOv5 to obtain more comprehensive feature map data.By employing strategies such as sparse training and channel pruning, they notably improved safety helmet detection performance.Ali et al. [41] assessed the performance of different YOLOv5 and YOLOv7 versions in detecting students' PPE compliance in a laboratory setting using a self-created safety-related dataset.Based on YOLOv7, Wang et al. [42] improved computational efficiency by introducing the CPC structure and combining it with the SA mechanism, enabling the model to concentrate on localized image information at a reduced computational cost, enhancing accuracy and improving it to better respond to the need for the real-time detection of masks in complex scenarios.These studies have demonstrated the effectiveness of deep learning in the field of PPE detection.Table 1 displays models frequently employed in PPE detection, including Faster R-CNN, SSD, YOLOv3, YOLOv4, YOLOv5, and YOLOv7.
In summary, many scholars have conducted extensive research on PPE detection.While specific achievements have been made, the diverse on-site conditions and challenges of detecting small targets limit the field, necessitating high adaptability and generalization.Trim marks are often subject to background interference, and overlapping targets make accurate detection and recognition more complex.Therefore, the primary objective of this study is to address the issues mentioned earlier and to propose a high-performance method for the rapid and accurate detection of PPE in an industrial setting.

YOLOv8n Model Analysis
YOLOv8 represents the most recent advancement in the YOLO series of object detection algorithms.It excels at swiftly and precisely detecting objects in images, determining their positions, and classifying their categories by learning object characteristics and shapes.Depending on the network's depth and width, YOLOv8 can be divided into YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x.Given the need for PPE object detection under complex and dynamic field conditions and the challenges of detecting small objects, this study has selected the YOLOv8n network from the YOLOv8 series, which has smaller parameters, but higher accuracy.The YOLOv8n model's detection network consists of four components, as illustrated in Figure 1: Input, Backbone, Neck, and Head.challenges of detecting small targets limit the field, necessitating high adaptability and generalization.Trim marks are often subject to background interference, and overlapping targets make accurate detection and recognition more complex.Therefore, the primary objective of this study is to address the issues mentioned earlier and to propose a highperformance method for the rapid and accurate detection of PPE in an industrial setting.

YOLOv8n Model Analysis
YOLOv8 represents the most recent advancement in the YOLO series of object detection algorithms.It excels at swiftly and precisely detecting objects in images, determining their positions, and classifying their categories by learning object characteristics and shapes.Depending on the network's depth and width, YOLOv8 can be divided into YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x.Given the need for PPE object detection under complex and dynamic field conditions and the challenges of detecting small objects, this study has selected the YOLOv8n network from the YOLOv8 series, which has smaller parameters, but higher accuracy.The YOLOv8n model's detection network consists of four components, as illustrated in Figure 1 The primary role of the Input network is to receive, preprocess, and transmit input images for feature extraction and object detection.It employs the Mosaic data augmentation to combine multiple photos into a single input sample, enhancing data diversity.Furthermore, input images are resized to a uniform dimension to ensure the model can effectively process images with consistent sizes.
The backbone network conducts feature extraction, converting input images into multiscale feature maps through deep convolutional neural networks, delivering crucial data for subsequent object detection.YOLOv8n incorporates an improved CSPDarknet53 as its backbone network.Unlike traditional backbone networks using cross-stage local CSP modules, YOLOv8n replace them with a C2f module.This module connects through gradient splitting, maintaining the network's lightweight characteristics while enriching information exchange during feature extraction.Finally, the SPPF module pools the input feature maps into fixed-size images to achieve adaptive output size.
The primary role of the Neck network is to amalgamate features across various scales to produce a feature pyramid.It utilizes the PANet structure [43], comprising the feature pyramid network (FPN) [44] and the path aggregation network (PAN).FPN acquires The primary role of the Input network is to receive, preprocess, and transmit input images for feature extraction and object detection.It employs the Mosaic data augmentation to combine multiple photos into a single input sample, enhancing data diversity.Furthermore, input images are resized to a uniform dimension to ensure the model can effectively process images with consistent sizes.
The backbone network conducts feature extraction, converting input images into multiscale feature maps through deep convolutional neural networks, delivering crucial data for subsequent object detection.YOLOv8n incorporates an improved CSPDarknet53 as its backbone network.Unlike traditional backbone networks using cross-stage local CSP modules, YOLOv8n replace them with a C2f module.This module connects through gradient splitting, maintaining the network's lightweight characteristics while enriching information exchange during feature extraction.Finally, the SPPF module pools the input feature maps into fixed-size images to achieve adaptive output size.
The primary role of the Neck network is to amalgamate features across various scales to produce a feature pyramid.It utilizes the PANet structure [43], comprising the feature pyramid network (FPN) [44] and the path aggregation network (PAN).FPN acquires feature maps from the convolutional neural network, constructs a feature pyramid, and employing a top-down approach, combines multi-scale features by using up-sampling and coarser granularity feature maps, thus achieving multi-scale feature fusion.To more effectively retain target position information, the PAN module supplements FPN by adopting a bottomup structure, fusing feature maps from different levels through convolutional layers, further enhancing detection performance.
The Head network serves as the ultimate prediction component, obtaining category and position information for objects of varying sizes from feature maps of diverse scales.This network uses a decoupled head structure, separating classification and detection heads.It also adopts the anchor-free concept, eliminating the need for predefined anchors.Instead, it learns various object shapes and sizes to better adapt to object detection tasks in different scenarios.

Improved Model
The overall architecture of YOLOv8n has been retained, while improving or replacing some of its modules.The GBSG-YOLOv8n model is proposed, and this section will provide a detailed introduction to the model.The network structure of GBSG-YOLOv8n is shown in Figure 2.
employing a top-down approach, combines multi-scale features by using up-sampling and coarser granularity feature maps, thus achieving multi-scale feature fusion.To more effectively retain target position information, the PAN module supplements FPN by adopting a bottom-up structure, fusing feature maps from different levels through convolutional layers, further enhancing detection performance.
The Head network serves as the ultimate prediction component, obtaining category and position information for objects of varying sizes from feature maps of diverse scales.This network uses a decoupled head structure, separating classification and detection heads.It also adopts the anchor-free concept, eliminating the need for predefined anchors.Instead, it learns various object shapes and sizes to better adapt to object detection tasks in different scenarios.

Improved Model
The overall architecture of YOLOv8n has been retained, while improving or replacing some of its modules.The GBSG-YOLOv8n model is proposed, and this section will provide a detailed introduction to the model.The network structure of GBSG-YOLOv8n is shown in Figure 2.

Global Attention Mechanism
In PPE safety detection, challenges arise from complex backgrounds and the potential oversight of small targets.Thus, it is vital to enhance the model's feature extraction abilities.Attention mechanisms help the model emphasize essential input information while filtering out less relevant material.In this study, we integrate GAM [45] into the backbone network to magnify the focus on PPE-specific features, strengthening the backbone network's ability to extract critical information.This enhancement ultimately results in improved detection performance and accuracy, as depicted by the GAM structure in Figure 3.

Global Attention Mechanism
In PPE safety detection, challenges arise from complex backgrounds and the potential oversight of small targets.Thus, it is vital to enhance the model's feature extraction abilities.Attention mechanisms help the model emphasize essential input information while filtering out less relevant material.In this study, we integrate GAM [45] into the backbone network to magnify the focus on PPE-specific features, strengthening the backbone network's ability to extract critical information.This enhancement ultimately results in improved detection performance and accuracy, as depicted by the GAM structure in Figure 3.

Global Attention Mechanism
In PPE safety detection, challenges arise from complex backgrounds and the potential oversight of small targets.Thus, it is vital to enhance the model's feature extraction abilities.Attention mechanisms help the model emphasize essential input information while filtering out less relevant material.In this study, we integrate GAM [45] into the backbone network to magnify the focus on PPE-specific features, strengthening the backbone network's ability to extract critical information.This enhancement ultimately results in improved detection performance and accuracy, as depicted by the GAM structure in Figure 3.The GAM comprises two distinct submodules-the channel attention and the spatial attention submodule [46]-designed for channel and spatial attention operations.The primary aim of the channel attention submodule is to augment interactions among global features.It initially preserves information across three dimensions via a 3D permutation operation.Subsequently, it enhances cross-dimensional channel-spatial dependencies by employing a two-layer MLP structure.Unlike the channel attention submodule, the spatial attention submodule is dedicated to improving the model's focus on spatial information.It commences with max-pooling and average-pooling operations on the input feature maps, followed by the fusion of these two pooling outcomes.The merged feature The GAM comprises two distinct submodules-the channel attention and the spatial attention submodule [46]-designed for channel and spatial attention operations.The primary aim of the channel attention submodule is to augment interactions among global features.It initially preserves information across three dimensions via a 3D permutation operation.Subsequently, it enhances cross-dimensional channel-spatial dependencies by employing a two-layer MLP structure.Unlike the channel attention submodule, the spatial attention submodule is dedicated to improving the model's focus on spatial information.It commences with max-pooling and average-pooling operations on the input feature maps, followed by the fusion of these two pooling outcomes.The merged feature map is then subjected to convolution and processed with a Sigmoid activation function.The collaborative operation of these two submodules strengthens the interdependence between global and spatial features, allowing our model to more effectively concentrate on the requisite feature information, ultimately leading to improved performance.

Bidirectional Feature Pyramid Network
Various challenges are typically encountered in PPE target detection, including variations in the distance between workers and cameras, leading to inconsistent image resolutions and a wide range of PPE target sizes, ranging from small items, like safety goggles, to larger ones, like safety vests.Consequently, effectively integrating this diverse information for multi-scale targets remains challenging.
To tackle this problem, YOLOv8n utilizes the PANet structure to build a feature pyramid, facilitating the fusion of multi-scale feature information and enhancing target detection performance.Although the PANet structure effectively integrates features of different scales through top-down and bottom-up information propagation, improving its ability to detect and handle changes in scale for various target sizes, it nonetheless exhibits some inherent problems.For instance, in the PANet structure, nodes with only one input edge and those lacking feature fusion exist.This can lead to unbalanced information transmission and an increase in additional parameters and computational burden.In response to these issues, this study introduces the BiFPN structure [47] as a new Neck component.The BiFPN structure is a deep learning network architecture specifically designed for object detection tasks, and it improves the PANet structure.The BiFPN structure removes redundant nodes from the top and bottom layers of the PANet structure.It incorporates various connection methods, such as horizontal, vertical, and cross-scale connections, to enhance feature fusion efficiency.This enables the model to better adapt to multi-scale targets and reduces the risk of losing crucial information.Furthermore, the BiFPN structure can be stacked multiple times as needed, further enhancing the fusion of multi-scale features without introducing excessive redundant parameters and computational burden.These improvements significantly boost the model's performance in regards to multi-scale object detection tasks.The structure is shown in Figure 4. Compared to the PANet structure, the BiFPN structure possesses a more robust feature fusion capability, effectively integrating more feature information without increasing the Electronics 2023, 12, 4628 9 of 24 computational burden.Furthermore, its distinctive skip connection structure proficiently addresses the issue of feature loss, illustrated in Figure 5.In cases where the Neck network's initial input and output nodes coincide within the same layer, the supplementary edges are consolidated to minimize spatial information loss.Compared to the PANet structure, the BiFPN structure possesses a more robust feature fusion capability, effectively integrating more feature information without increasing the computational burden.Furthermore, its distinctive skip connection structure proficiently addresses the issue of feature loss, illustrated in Figure 5.In cases where the Neck network's initial input and output nodes coincide within the same layer, the supplementary edges are consolidated to minimize spatial information loss.
where I i represents the input features, O is the output features, and ω i and ω j are learnable weights.The introduction of the ReLU activation function maps the learnable weights to the range [0, 1], and ε = 0.0001 is a minimal value added to ensure output stability.
In this study, we substituted the PANet structure in the model with the BiFPN structure to enhance the transmission and extraction of multi-scale PPE target features.This improvement aims to boost the PPE detection performance for targets of various scales, consequently elevating the model's mAP.

SimC2f Design
A significant improvement of YOLOv8 is the substitution of the original C3 module with the C2f module.The C2f module aims to augment the model's feature extraction capabilities without adding to the model's complexity, thus further enhancing its overall performance.The ELAN principle inspires the design of the C2f module.Typically, as the network reaches a certain depth, the improvement in accuracy by adding more convolutional blocks diminishes, and the model's convergence deteriorates.The ELAN module enhances performance by increasing the longest gradient path within the residual blocks to address this issue.The ELAN module enhances model performance by analyzing the gradient paths, both short and long, in each layer.Especially if the network is very deep, the ELAN module enables better control over the gradient paths, facilitating improved feature learning and overall model performance.Based on this foundation, enhancements were applied to the C3 module to achieve a more substantial gradient flow while preserving a lightweight design.The ELAN and C2f structures are shown in Figures 6 and 7.
enhances performance by increasing the longest gradient path within the residual blocks to address this issue.The ELAN module enhances model performance by analyzing the gradient paths, both short and long, in each layer.Especially if the network is very deep, the ELAN module enables better control over the gradient paths, facilitating improved feature learning and overall model performance.Based on this foundation, enhancements were applied to the C3 module to achieve a more substantial gradient flow while preserving a lightweight design.The ELAN and C2f structures are shown in Figures 6 and 7.The C2f module demonstrates relatively high accuracy in multi-object detection tasks.However, it exhibits some limitations.First, it employs fixed weights for feature fusion, which limits the adaptability to diverse targets and scenarios, resulting in inflexibility in weight assignment.Second, the simple feature summation operation may lead to information loss, particularly when addressing small targets or complex scenes, which enhances performance by increasing the longest gradient path within the residual blocks to address this issue.The ELAN module enhances model performance by analyzing the gradient paths, both short and long, in each layer.Especially if the network is very deep, the ELAN module enables better control over the gradient paths, facilitating improved feature learning and overall model performance.Based on this foundation, enhancements were applied to the C3 module to achieve a more substantial gradient flow while preserving a lightweight design.The ELAN and C2f structures are shown in Figures 6 and 7.The C2f module demonstrates relatively high accuracy in multi-object detection tasks.However, it exhibits some limitations.First, it employs fixed weights for feature fusion, which limits the adaptability to diverse targets and scenarios, resulting in inflexibility in weight assignment.Second, the simple feature summation operation may lead to information loss, particularly when addressing small targets or complex scenes, which The C2f module demonstrates relatively high accuracy in multi-object detection tasks.However, it exhibits some limitations.First, it employs fixed weights for feature fusion, which limits the adaptability to diverse targets and scenarios, resulting in inflexibility in weight assignment.Second, the simple feature summation operation may lead to information loss, particularly when addressing small targets or complex scenes, which may not adequately retain positional information and target details.In addition, for highresolution feature maps, the C2f module requires many computations and parameters, increasing the model's complexity.To overcome these limitations in the C2f module, we designed the new SimC2f structure, which introduces the SimAM attention module [48] in C2f to assign unique weights to each neuron, thus providing a more flexible method of weight assignment.The SimC2f structure is shown in Figure 8. may not adequately retain positional information and target details.In addition, for highresolution feature maps, the C2f module requires many computations and parameters, increasing the model's complexity.To overcome these limitations in the C2f module, we designed the new SimC2f structure, which introduces the SimAM attention module [48] in C2f to assign unique weights to each neuron, thus providing a more flexible method of weight assignment.The SimC2f structure is shown in Figure 8.The primary advantage of SimAM is its ability to allocate a unique weight to each neuron without introducing additional parameters.This capability enables the network to better adapt to various targets and scenarios, ultimately enhancing the model's performance and efficiency.
The SimAM assesses neuron importance using principles derived from neuroscience.In neuroscience, information-rich neurons typically display distinct activation patterns from those of neighboring neurons, often inhibiting the surrounding neurons, a phenomenon known as the spatial inhibitory effects.Hence, neurons with spatial inhibitory effects are considered to have higher significance.The SimAM, drawing inspiration from spatial inhibition, assesses the significance of each neuron through an analysis of linear separa- The primary advantage of SimAM is its ability to allocate a unique weight to each neuron without introducing additional parameters.This capability enables the network to better adapt to various targets and scenarios, ultimately enhancing the model's performance and efficiency.
The SimAM assesses neuron importance using principles derived from neuroscience.In neuroscience, information-rich neurons typically display distinct activation patterns from those of neighboring neurons, often inhibiting the surrounding neurons, a phenomenon known as the spatial inhibitory effects.Hence, neurons with spatial inhibitory effects are considered to have higher significance.The SimAM, drawing inspiration from spatial inhibition, assesses the significance of each neuron through an analysis of linear separability between the target neuron and its counterparts.This evaluation involves defining an energy function for each neuron, as depicted in Equation ( 2).
The above equation takes the following analytical form: Among these, Equations ( 5) and ( 6) can be noted as follows: Therefore, the minimum energy equation is derived from Equation ( 7).
According to Equation ( 7), a lower energy value signifies a more pronounced distinction between neuron t and other neurons, indicating higher importance.This entire process can be represented by Equation (8).
Industrial settings often feature intricate backgrounds and diverse objects.To achieve precise PPE compliance detection, enhancing the model's feature perception and representation capabilities is imperative.Consequently, we incorporated SimAM into the C2f module, enhancing the model's capacity to perceive features in small targets while avoiding the introduction of extra parameters-this substantially improved detection performance and accuracy.

GhostConv
In industrial settings, achieving high precision and real-time capabilities is critical for PPE detection.Timely issue identification is essential to effectively reduce safety risks.Although various enhancements have substantially improved the accuracy and performance of PPE detection, challenges remain, particularly in regards to real-time operations and lightweight requirements.To tackle these issues, this study utilizes GhostConv [49] convolution to substitute for conventional convolutions in the network backbone.
The GhostConv convolution process involves two steps.Initially, it is employed in traditional convolution generation feature maps with fewer channels, using relatively small computational resources.Subsequently, based on these feature maps, a series of simple linear operations is applied to convolve the channel feature maps, resulting in the acquisition of more feature maps.Ultimately, these two sets of feature maps are spliced to form the final feature map.
In Figure 9b, the input is a feature map, which undergoes an initial standard convolution operation to produce Y .Here, X ∈ R C×H×W , with C denoting the number of channels, H as the height, and W as the width.The * symbolizes the convolution operation, and the f represents the convolution filter for this layer.
Next, the feature maps of each channel are used to generate the Ghost feature map Y ij using the Φ i,j operation.
Finally, the feature map identity is connected to yield the final feature map.Utilizing GhostConv eliminates redundant information in feature map fusion, enhancing model performance and significantly improving the inference speed of the network model.In contrast to traditional convolution methods, GhostConv substantially reduces the cost of learning unnecessary features through cost-effective linear operations, achieving superior performance using the same computational resources.
We replace the traditional convolution of the original model with GhostConv.Although the accuracy decreases slightly, the model is more lightweight, which helps us detect problems and reduce security risks more quickly.

Experimental Datasets
In an industrial environment, the timely and precise detection of workers employing PPE is essential to ensure their safety.This allows us to take necessary measures promptly to mitigate potential risks.However, as of now, no publicly accessible dataset encompasses the many types of PPE.Therefore, this research focuses on detecting the usage of PPE by workers in an industrial environment.We assembled a dataset comprising four PPE categories-safety helmets, safety goggles, safety vests, and masks-by gathering pertinent images.This dataset primarily originates from two sources.The first source consists of close-up video images captured by temporarily deployed cameras.The second source includes wide-angle video images captured by surveillance cameras in various factory workshops.These HIKVISION brand surveillance cameras all feature a 4-megapixel resolution.Images are sampled at a frequency of one frame extracted every 5 s, and the images collected from multiple distinct industrial settings.A sampling example is shown Utilizing GhostConv eliminates redundant information in feature map fusion, enhancing model performance and significantly improving the inference speed of the network model.In contrast to traditional convolution methods, GhostConv substantially reduces the cost of learning unnecessary features through cost-effective linear operations, achieving superior performance using the same computational resources.
We replace the traditional convolution of the original model with GhostConv.Although the accuracy decreases slightly, the model is more lightweight, which helps us detect problems and reduce security risks more quickly.

Experimental Datasets
In an industrial environment, the timely and precise detection of workers employing PPE is essential to ensure their safety.This allows us to take necessary measures promptly to mitigate potential risks.However, as of now, no publicly accessible dataset encompasses the many types of PPE.Therefore, this research focuses on detecting the usage of PPE by workers in an industrial environment.We assembled a dataset comprising four PPE categories-safety helmets, safety goggles, safety vests, and masks-by gathering pertinent images.This dataset primarily originates from two sources.The first source consists of closeup video images captured by temporarily deployed cameras.The second source includes wide-angle video images captured by surveillance cameras in various factory workshops.These HIKVISION brand surveillance cameras all feature a 4-megapixel resolution.Images are sampled at a frequency of one frame extracted every 5 s, and the images collected from multiple distinct industrial settings.A sampling example is shown in Figure 10.Upon data collection, we conducted data cleaning and filtering, employing data augmentation techniques.These techniques encompassed independent object cropping, horizontal flipping, exposure adjustment, and the introduction of Gaussian noise.This entire process generated 4000 images, comprising both original and augmented variants.Subsequently, we partitioned these samples into training, validation, and testing sets at an 8:1:1 ratio, which accounted for 3200, 400, and 400 images, respectively.Before training, we utilized the LabelImg annotation software to generate text files containing image paths, annotated regions, and label types.The number of specific labeled categories and the distribution of x, y coordinates of the center point of the target box are shown in Figure 11.

Experimental Environments
All experiments in this study were performed on a Linux-based computer equipped with a 12th Gen Intel(R) Core(TM) i9-12900 K CPU and an NVIDIA GeForce RTX 4090 GPU boasting 32 GB of VRAM.The software environment was based on Python 3.8, utilizing Pytorch 2.0 as the development framework.The batch size and the number of epochs were set to 32 and 300, respectively.

Evaluation Metrics
We selected metrics such as precision, recall, F1 score, and mAP to assess the detection model's performance.The specific calculation formulas are as follows: Upon data collection, we conducted data cleaning and filtering, employing data augmentation techniques.These techniques encompassed independent object cropping, horizontal flipping, exposure adjustment, and the introduction of Gaussian noise.This entire process generated 4000 images, comprising both original and augmented variants.Subsequently, we partitioned these samples into training, validation, and testing sets at an 8:1:1 ratio, which accounted for 3200, 400, and 400 images, respectively.Before training, we utilized the LabelImg annotation software to generate text files containing image paths, annotated regions, and label types.The number of specific labeled categories and the distribution of x, y coordinates of the center point of the target box are shown in Figure 11.Upon data collection, we conducted data cleaning and filtering, employing data augmentation techniques.These techniques encompassed independent object cropping, horizontal flipping, exposure adjustment, and the introduction of Gaussian noise.This entire process generated 4000 images, comprising both original and augmented variants.Subsequently, we partitioned these samples into training, validation, and testing sets at an 8:1:1 ratio, which accounted for 3200, 400, and 400 images, respectively.Before training, we utilized the LabelImg annotation software to generate text files containing image paths, annotated regions, and label types.The number of specific labeled categories and the distribution of x, y coordinates of the center point of the target box are shown in Figure 11.

Experimental Environments
All experiments in this study were performed on a Linux-based computer equipped with a 12th Gen Intel(R) Core(TM) i9-12900 K CPU and an NVIDIA GeForce RTX 4090 GPU boasting 32 GB of VRAM.The software environment was based on Python 3.8, utilizing Pytorch 2.0 as the development framework.The batch size and the number of epochs were set to 32 and 300, respectively.

Evaluation Metrics
We selected metrics such as precision, recall, F1 score, and mAP to assess the detection model's performance.The specific calculation formulas are as follows:

Experimental Environments
All experiments in this study were performed on a Linux-based computer equipped with a 12th Gen Intel(R) Core(TM) i9-12900 K CPU and an NVIDIA GeForce RTX 4090 GPU boasting 32 GB of VRAM.The software environment was based on Python 3.8, utilizing Pytorch 2.0 as the development framework.The batch size and the number of epochs were set to 32 and 300, respectively.

Evaluation Metrics
We selected metrics such as precision, recall, F1 score, and mAP to assess the detection model's performance.The specific calculation formulas are as follows: TP, FP, FN, and TN stand for true positives, false positives, false negatives, and true negatives, respectively.AP signifies the average detection precision for individual defect categories, whereas mAP signifies the average detection precision across all defect categories.
To assess the model's lightweight characteristics, we utilized four key evaluation metrics: parameters, FLOPS, weight, and inference time.The quantity of the model's parameters directly impacts its complexity, with higher parameter counts leading to increased model complexity and greater demands on computational resources for training and inference.FLOPS quantifies the floating-point operations performed during the model's inference, and higher FLOPS values typically signify a heightened need for computational resources during inference.Weight measures the size of the model's weights, which is intimately tied to storage and transmission efficiency.Smaller weight values indicate lighter model weights, contributing to the enhanced light weight of the model.Inference time denotes the duration required for the model to process a single image, and shorter inference times are paramount for real-time applications, reflecting the model's ability to expedite object detection tasks.

Performance Analysis of the GBSG-YOLOv8n Model
We trained the YOLOv8n and GBSG-YOLOv8n models separately on the same training dataset and obtained the following experimental results, as shown in Tables 2 and 3.The table above shows that the traditional YOLOv8n model achieved a mAP of 87.8%, while the GBSG-YOLOv8n model reached a mAP of 90.8%, representing a 3% improvement.Furthermore, compared to the traditional YOLOv8n, the GBSG-YOLOv8n model showed a 3.1% improvement in precision, a 2.6% increase in recall, and a 2.8% boost in the F1 score.These results indicate that our proposed GBSG-YOLOv8n model shows a distinct advantage, particularly in scenarios with complex backgrounds and significant variations in target sizes.
For a more intuitive representation of the detection performance, we have plotted the precision-recall curves for YOLOv8n and GBSG-YOLOv8n, which are presented in Figure 12.From these graphs, it is evident that the GBSG-YOLOv8n model significantly outperforms the traditional YOLOv8n model.boost in the F1 score.These results indicate that our proposed GBSG-YOLOv8n model shows a distinct advantage, particularly in scenarios with complex backgrounds and significant variations in target sizes.For a more intuitive representation of the detection performance, we have plotted the precision-recall curves for YOLOv8n and GBSG-YOLOv8n, which are presented in Figure 12.From these graphs, it is evident that the GBSG-YOLOv8n model significantly outperforms the traditional YOLOv8n model.Table 3 reveals significant improvements in GBSG-YOLOv8 compared to the original YOLOv8n model, as evident in four critical metrics: parameters, FLOPS, weight, and inference time.These results affirm GBSG-YOLOv8's lightweight nature in regards to PPE detection when contrasted with YOLOv8n.Specifically, GBSG-YOLOv8n boasts fewer parameters, reduced memory consumption, and faster computational operations.Table 3 reveals significant improvements in GBSG-YOLOv8 compared to the original YOLOv8n model, as evident in four critical metrics: parameters, FLOPS, weight, and inference time.These results affirm GBSG-YOLOv8's lightweight nature in regards to PPE detection when contrasted with YOLOv8n.Specifically, GBSG-YOLOv8n boasts fewer parameters, reduced memory consumption, and faster computational operations.Furthermore, its more compact model size minimizes storage space requirements.These characteristics establish GBSG-YOLOv8n as a practical and efficient choice, particularly well-suited for PPE detection in industrial settings.

Ablation Experiment
We conducted ablation experiments to thoroughly validate the enhanced algorithm's effectiveness in optimizing the original method.Each experiment set was trained and validated using the identical PPES dataset, and the results are presented in Table 4.The data in Table 4 demonstrates that introducing GAM into the backbone network has enhanced the capability to extract essential information, resulting in a 1.4% improvement in mAP compared to the original YOLOv8n.Replacing the PANet structure in the Neck network with the BiFPN structure has improved the model's feature learning ability, effectively integrating feature information across multiple scales, leading to a 2.6% improvement in mAP compared to the original YOLOv8n.The utilization of the SimC2f structure has effectively achieved adaptive feature fusion, enhancing the extraction of valuable information and resulting in a 3.3% improvement in mAP.While replacing traditional convolutions in the backbone network with GhostConv has led to a slight decrease in accuracy, the overall accuracy has significantly improved compared to that of the original model.Additionally, four key metrics, namely parameters, FLOPs, weight, and inference time, have all exhibited substantial reductions, signifying a more streamlined and lightweight model that boosts computational efficiency and reduces resource requirements.Experimental results confirm that each improvement introduced in this study has enhanced performance compared to that of the original YOLOv8n model.For a more intuitive presentation of these findings, we have created bar charts illustrating the results, which are shown in Figure 13.
Through these bar charts, it can be noted that the detection accuracy has significantly improved, and the complexity has noticeably decreased.This further validates the outstanding performance of the proposed GBSG-YOLOv8n model for real-time PPE detection.

Experiments Comparing GBSG-YOLOv8n to Other Models
To conduct a more in-depth evaluation of the effectiveness of the GBSG-YOLOv8n model, we performed comparative experiments using mainstream algorithms, and Table 5 presents the results regarding PPE detection performance.Through these bar charts, it can be noted that the detection accuracy has significantly improved, and the complexity has noticeably decreased.This further validates the  The analysis of these results shows that as the model versions are upgraded, the detection performance gradually improves for the parameters of accuracy, recall, F1 score, and mAP.And there are different variations in regards to the complexity aspect of the model.SSD adopts end-to-end training and multi-scale detection, showing significant improvements in performance in various aspects compared to Faster R-CNN.The YOLO series has introduced strategies such as single-stage detection, multi-scale detection, and the prediction of multiple bounding boxes, all while reducing model complexity.YOLO also offers considerable performance improvements over Faster R-CNN and SSD.YOLOv5 incorporated CSPDarkNet53 as its backbone, along with the focus module and PANet structure, enhancing detection accuracy and performance.YOLOv7 introduced E-ELAN and a deeper network structure, achieving greater speed and accuracy, but increasing model computation and parameters due to the more complex network.In contrast, our proposed model, GBSG-YOLOv8n, builds upon YOLOv8n and significantly enhances performance by introducing GAM into the backbone network, optimizing the PANet structure, utilizing the SimC2f structure, and incorporating GhostConv.Compared to other mainstream models, GBSG-YOLOv8n excels in metrics like precision, recall, F1, and mAP, surpassing those of other models.Additionally, it significantly outperforms other models in terms of parameters, FLOPS, and weight, demonstrating its lightweight nature.This reaffirms GBSG-YOLOv8n's ability to provide reliable support for PPE detection in industrial environments.To present these results more clearly, we have created a series of model comparison charts (refer to Figure 14).

Practical Applications of GBSG-YOLOv8n in Industrial Environments
To implement the practical application of the GBSG-YOLOv8n model in an industrial environment, we deployed the model on the cloud server of the PPE safety monitoring system.We designed and constructed the PPE safety monitoring system using a microservices architecture, with its primary task being the monitoring of the wearing of PPE by workers in industrial settings.In the factory, we deployed multiple network cameras positioned at various locations to continuously monitor the PPE usage of workers in real-time.For instance, at various entrance points within the factory (as shown in Figure 15a, where red boxes mark the different entrance points), these network cameras transmit captured real-time images of workers over the network to the cloud server.Subsequently, the GBSG-YOLOv8n model deployed on the cloud server is utilized for detection.If a worker's PPE complies with the requirements (as shown in Figure 15c), the system display on the large screen will record the worker's information, obtained through the access control system, including the worker's entry time, indicating the worker's smooth access to the factory.However, if a worker's PPE does not meet the requirements (as shown in Figure 15d), the system display on the large screen will record the worker's information, obtained through the access control system, along with the entry time.Meanwhile, it will also trigger an alarm, and the recorded violation of the worker will be logged into the system.The entire system deployment is illustrated in Figure 15b.Practical applications in industrial settings demonstrate that the use of the GBSG-YOLOv8n model has a positive impact on enhancing industrial safety.
other mainstream models, GBSG-YOLOv8n excels in metrics like precision, recall, F1, and mAP, surpassing those of other models.Additionally, it significantly outperforms other models in terms of parameters, FLOPS, and weight, demonstrating its lightweight nature.This reaffirms GBSG-YOLOv8n's ability to provide reliable support for PPE detection in industrial environments.To present these results more clearly, we have created a series of model comparison charts (refer to Figure 14).

Practical Applications of GBSG-YOLOv8n in Industrial Environments
To implement the practical application of the GBSG-YOLOv8n model in an industrial environment, we deployed the model on the cloud server of the PPE safety monitoring system.We designed and constructed the PPE safety monitoring system using a microservices architecture, with its primary task being the monitoring of the wearing of PPE by workers in industrial settings.In the factory, we deployed multiple network cameras positioned at various locations to continuously monitor the PPE usage of workers in realtime.For instance, at various entrance points within the factory (as shown in Figure 15a, where red boxes mark the different entrance points), these network cameras transmit captured real-time images of workers over the network to the cloud server.Subsequently, the GBSG-YOLOv8n model deployed on the cloud server is utilized for detection.If a worker's PPE complies with the requirements (as shown in Figure 15c), the system display on the large screen will record the worker's information, obtained through the access control system, including the worker's entry time, indicating the worker's smooth access to the factory.However, if a worker's PPE does not meet the requirements (as shown in Figure 15d), the system display on the large screen will record the worker's information, obtained through the access control system, along with the entry time.Meanwhile, it will also trigger an alarm, and the recorded violation of the worker will be logged into the system.The entire system deployment is illustrated in Figure 15b.Practical applications in industrial settings demonstrate that the use of the GBSG-YOLOv8n model has a positive impact on enhancing industrial safety.While our model has been widely applied in industrial environments and has performed excellently in most cases, we inevitably faced certain limitations that have emerged through multiple rounds of experimental testing and practical applications.We have observed that, in specific situations, angle problems caused by the position of the worker and the position of the image acquisition equipment may cause our model to erroneously identify eyeglasses as safety goggles.
Figure 16 represents a series of real-time actions of workers in the detection field.The results for Figure 16a,b,d are correct; however, as shown in Figure 16c, when a worker wearing an orange safety vest turns his/her whole body to the other side, when observed from a side angle, our model will erroneously classify the ordinary glasses worn by the middle worker as safety goggles.Although eyeglasses and safety goggles may seem very similar in terms of appearance, they exhibit significant differences in terms of purpose and nature.To address this limitation, in the future, we will focus on expanding the scenarios in the PPES dataset to enhance the model's generalization capability.
trol system, including the worker's entry time, indicating the worker's smooth access to the factory.However, if a worker's PPE does not meet the requirements (as shown in Figure 15d), the system display on the large screen will record the worker's information, obtained through the access control system, along with the entry time.Meanwhile, it will also trigger an alarm, and the recorded violation of the worker will be logged into the system.The entire system deployment is illustrated in Figure 15b.Practical applications in industrial settings demonstrate that the use of the GBSG-YOLOv8n model has a positive impact on enhancing industrial safety.While our model has been widely applied in industrial environments and has performed excellently in most cases, we inevitably faced certain limitations that have emerged through multiple rounds of experimental testing and practical applications.We have observed that, in specific situations, angle problems caused by the position of the worker and the position of the image acquisition equipment may cause our model to erroneously identify eyeglasses as safety goggles.
Figure 16 represents a series of real-time actions of workers in the detection field.The results for Figure 16a,b,d are correct; however, as shown in Figure 16c, when a worker wearing an orange safety vest turns his/her whole body to the other side, when observed from a side angle, our model will erroneously classify the ordinary glasses worn by the middle worker as safety goggles.Although eyeglasses and safety goggles may seem very similar in terms of appearance, they exhibit significant differences in terms of purpose and nature.To address this limitation, in the future, we will focus on expanding the scenarios in the PPES dataset to enhance the model's generalization capability.While our model has been widely applied in industrial environments and has performed excellently in most cases, we inevitably faced certain limitations that have emerged through multiple rounds of experimental testing and practical applications.We have observed that, in specific situations, angle problems caused by the position of the worker and the position of the image acquisition equipment may cause our model to erroneously identify eyeglasses as safety goggles.
Figure 16 represents a series of real-time actions of workers in the detection field.The results for Figure 16a,b,d are correct; however, as shown in Figure 16c, when a worker wearing an orange safety vest turns his/her whole body to the other side, when observed from a side angle, our model will erroneously classify the ordinary glasses worn by the middle worker as safety goggles.Although eyeglasses and safety goggles may seem very similar in terms of appearance, they exhibit significant differences in terms of purpose and nature.To address this limitation, in the future, we will focus on expanding the scenarios in the PPES dataset to enhance the model's generalization capability.

Discussion
Compared to other visual detection tasks, detecting whether industrial workers are wearing PPE presents a unique set of challenges.First, the diverse types of PPE that require detection, with substantial differences in size, render the target detection task quite challenging.Moreover, within the complex and ever-changing industrial environment, various potential sources of interference further increase the complexity of detection.If workers fail to wear PPE correctly and this is not detected promptly, it may lead to severe safety issues, even tragic consequences.Hence, the accurate and timely detection of PPE compliance is vital to ensure worker safety and mitigate potential industrial risks.As a replacement for traditional manual inspections, computer vision technology has proven to be an efficient solution.Computer vision technology offers automated and efficient detection methods.Through computer vision technology, we can accurately identify potential issues in the early stages, allowing for early warnings to mitigate potential dangers.In addition, this approach significantly reduces false alarms and omissions, enhancing detection accuracy and reliability.
In this study, we selected the YOLOv8n model from YOLOv8 due to its reduced parameter size and heightened detection accuracy.However, recognizing the task's complexity and specific YOLOv8n limitations, we proposed the GBSG-YOLOv8n model to detect whether industrial workers correctly wear PPE in an industrial environment.By refining the backbone and Neck networks, we achieved a 3% enhancement in detection performance, and the model is more lightweight.
Simultaneously, we also conducted the following research experiments on the same PPES dataset to compare the performance of the GBSG-YOLOv8n model with other transformer based models and models with exceptional performance.The specific results are shown in the Table 6 These results once again highlight the outstanding performance of our proposed GBSG-YOLOv8n model in regards to target detection.The model's exceptional performance equips it with various potential applications in industrial production environments.The GBSG-YOLOv8n model enables us to quickly and accurately detect whether workers are correctly wearing PPE, effectively reducing workplace safety risks.
Furthermore, our research holds significant value, not only in ensuring worker safety in industrial environments, but also in demonstrating extensive applicability across various potential domains.In the medical field, our PPE detection technology can ensure that healthcare workers and patients correctly wear appropriate PPE, such as masks and protective gowns, especially when dealing with infectious patients.It is worth emphasizing that healthcare-associated infections are a severe concern in medical facilities, endowing our research with significant potential for reducing infection risks.Police and traffic management personnel in the field of transportation can also benefit from our PPE detection technology.This technology helps ensure their use of safety vests, helmets, and other necessary PPE, reducing the risk of road traffic accidents.This is crucial for improving traffic safety and reducing accident rates.In the military and emergency services sectors, PPE detection technology also plays a vital role in ensuring that soldiers and rescue personnel wear PPE correctly, effectively guaranteeing their safety.Furthermore, our research can be applied to environmental monitoring to ensure that researchers and workers wear appropriate PPE when handling hazardous substances or working in contaminated environments, thus reducing environmental pollution and occupational risks.
Therefore, our PPE detection technology shows broad prospects for application in various fields.It not only enhances workplace and specific environment safety, but also helps to reduce the risk of accidents.This is crucial for individual protection and improves societal safety and health.

Conclusions
In this study, we propose a new and improved PPE detection model, called GBSG-YOLOv8n, and construct a dedicated PPES dataset to better meet the challenges of PPE detection in industrial environments.First, we overcome the limitations in extracting PPE target features by introducing the GAM, which maximizes the retention of channel and spatial information, enhances cross-dimensional interactions, and significantly improves the feature extraction capabilities in the backbone network, notably enhancing detection performance.Second, we optimize the fusion of multi-scale target information by replacing the original PANet structure with the BiFPN structure, effectively integrating feature information from different scales, preventing the loss of PPE feature information, and improving detection accuracy.Finally, the SimAM attention mechanism is introduced into the C2f module, and the SimC2f structure is proposed.This enhancement enables the model to more efficiently process image features, resulting in a notable improvement in detection efficiency.Finally, GhostConv is used to replace the traditional convolution in the backbone network so that the model reduces the model complexity and makes the model more lightweight, while ensuring detection accuracy.
The experimental results unequivocally illustrate the exceptional performance of the PPE detection model proposed in this study.In comparison to mainstream models, it offers substantial advantages.This model not only satisfies the demands for real-time safety monitoring in industrial settings, but also imparts significant value in safeguarding workers and mitigating potential industrial hazards.In future research, we plan to continue to enhance the model's performance, making it applicable to more complex and diverse scenarios, expanding its utility to broader domains, and smoothly deploying the model into multiple systems.

Figure 5 .
Figure 5.The architecture of BiFPN.Due to the varying resolutions of input feature maps, BiFPN assigns weights to each additional feature layer during feature fusion, adapting them based on their contribution to the network.The model training process emphasizes learning the features with significant weight allocations and performing multi-scale feature fusion through multiple iterations.The weighted formula for BiFPN is as follows:

Figure 5 .
Figure 5.The architecture of BiFPN.Due to the varying resolutions of input feature maps, BiFPN assigns weights to each additional feature layer during feature fusion, adapting them based on their contribution to the network.The model training process emphasizes learning the features with significant weight allocations and performing multi-scale feature fusion through multiple iterations.The weighted formula for BiFPN is as follows:

Figure 11 .
Figure 11.(a) PPE category diagram; (b) distribution plot of x and y coordinates.

Figure 11 .
Figure 11.(a) PPE category diagram; (b) distribution plot of x and y coordinates.

Figure 11 .
Figure 11.(a) PPE category diagram; (b) distribution plot of x and y coordinates.

Figure 13 .
Figure 13.Ablation experiment result: (a) bar chart for accuracy; (b) bar chart for parameters; (c) bar chart for FLOPS; (d) bar chart for weight; (e) bar chart for inference time.

Figure 13 .
Figure 13.Ablation experiment result: (a) bar chart for accuracy; (b) bar chart for parameters; (c) bar chart for FLOPS; (d) bar chart for weight; (e) bar chart for inference time.

Figure 14 .
Figure 14.Comparative experimental results: (a) line graph showing accuracy; (b) bar chart for parameters; (c) bar chart for flops; (d) bar chart for weight.

Figure 14 .
Figure 14.Comparative experimental results: (a) line graph showing accuracy; (b) bar chart for parameters; (c) bar chart for flops; (d) bar chart for weight.

Figure 15 .
Figure 15.(a) Factory layout; (b) system deployment diagram; (c) system display interface for workers wearing PPE correctly; (d) system display interface for workers wearing PPE incorrectly.

Figure 16 .
Figure 16.(a-d) are real-time monitoring images of the site.

Figure 15 .Figure 15 .
Figure 15.(a) Factory layout; (b) system deployment diagram; (c) system display interface for workers wearing PPE correctly; (d) system display interface for workers wearing PPE incorrectly.

Figure 16 .
Figure 16.(a-d) are real-time monitoring images of the site.

Figure 16 .
Figure 16.(a-d) are real-time monitoring images of the site.

Table 1 .
Comparison analysis of common models in the PPE field.

Table 2 .
Comparison of model accuracy.

Table 3 .
Comparison of model complexity.

Table 6 .
Comparison of performance results.