Lightweight Helmet-Wearing Detection Algorithm Based on StarNet-YOLOv10

Wang, Hongli; Zong, Qiangwen; Liao, Yang; Luo, Xiao; Gong, Mingzhi; Liang, Zhenyao; Gu, Bin; Liao, Yong

doi:10.3390/pr13040946

Open AccessArticle

Lightweight Helmet-Wearing Detection Algorithm Based on StarNet-YOLOv10

by

Hongli Wang

¹,

Qiangwen Zong

²,

Yang Liao

²,

Xiao Luo

²,

Mingzhi Gong

²,

Zhenyao Liang

¹,

Bin Gu

^2,* and

Yong Liao

^3,*

¹

Yunnan Chihong Zinc & Germanium Co., Ltd., Qujing 655000, China

²

Three Gorges High-Tech Information Technology Co., Ltd., Yichang 443002, China

³

School of Microelectronics and Communication Engineering, Chongqing University, Chongqing 400044, China

^*

Authors to whom correspondence should be addressed.

Processes 2025, 13(4), 946; https://doi.org/10.3390/pr13040946

Submission received: 26 February 2025 / Revised: 15 March 2025 / Accepted: 19 March 2025 / Published: 22 March 2025

(This article belongs to the Special Issue Applications of Artificial Intelligence Technologies in Energy, Manufacturing and Automatic Control Processes)

Download

Browse Figures

Versions Notes

Abstract

:

The safety helmet is the equipment that construction workers must wear, and it plays an important role in protecting their lives. However, there are still many construction workers who do not pay attention to the wearing of helmets. Therefore, the real-time high-precision intelligent detection of construction workers’ helmet wearing is crucial. To this end, this paper proposes a lightweight helmet-wearing detection algorithm based on StarNet-YOLOv10. Firstly, the StarNet network structure is used to replace the backbone network part of the original YOLOv10 model while retaining the original Spatial Pyramid Pooling Fast (SPPF) and Partial Self-attention (PSA) parts. Secondly, the C2f module in the neck network is optimised by combining the PSA attention module and the GhostBottleneckv2 module, which improves the extraction of feature information and the expression ability of the model. Finally, optimisation is performed in the head network by introducing the Large Separable Kernel Attention (LSKA) attention mechanism to improve the detection accuracy and detection efficiency of the detection head. The experimental results show that compared with the existing Faster R-CNN, YOLOv5s, YOLOv6, and the original YOLOv10 models, the StarNet-YOLOv10 model proposed in this paper has a greater degree of improvement in the accuracy, recall, average precision mean, computational volume, and frame rate, in which the accuracy is as high as 83.36%, the recall rate can be up to 81.17%, and the average precision mean can reach 78.66%. Meanwhile, compared with the original YOLOv10 model, this model improves 1.7% in accuracy, 1.62% in recall, and 4.43% in mAP. Therefore, the present model can well meet the detection requirements of helmet wearing and can effectively reduce the safety hazards caused by not wearing helmets on construction sites.

Keywords:

computer vision; YOLO; safety helmet testing; machine learning; artificial intelligence

1. Introduction

With the rapid development of urbanisation, the safety of construction in building, construction, and other industries is a growing concern. As a necessary component of personal protective equipment, the safety helmet can effectively reduce the risk of head injury and is one of the key measures to protect workers’ lives [1]. However, in the actual production environment, due to the poor supervision and weak safety awareness of workers, the violation of not wearing a helmet is common, which brings serious hidden danger to construction safety. The traditional manual monitoring method is not only inefficient but also difficult to achieve all-weather and all-round real-time supervision, which cannot meet the strict requirements of safety management in smart construction sites.

In smart construction site safety monitoring, target detection technology plays an important role, of which the most used is the YOLO (You Only Look Once) algorithm, which was first proposed by Joseph Redmon et al. in 2015 [2] and is known for its fast detection speed and end-to-end real-time processing capability. It avoids the complex candidate frame extraction and multiple detection steps of traditional methods by transforming the target detection task into a single regression problem, predicting the bounding box and class probabilities of a target directly from the image pixels. The core strength of YOLO lies in its concise network structure and efficient computational efficiency, which enables it to detect multiple targets quickly and accurately in real-time scenarios. From the initial YOLOv1 to today’s YOLOv11 version, the algorithm has been continuously optimised structurally and algorithmically, as well as in terms of detection performance, further improving detection accuracy and robustness, and it is widely used in the fields of real-time surveillance, automated driving, and robot vision.

The YOLOv10 algorithm was proposed by Wang A et al. [3] in 2024, as a newer version of the YOLO algorithm; on the basis of the YOLOv8 and YOLOv9 versions, a larger optimisation was carried out on its network structure as well as the loss function and other modules, which improves the model’s ability to detect small and dense targets, and the detection is more efficient and able to perform in real-time, and is at the same time lighter and has been continuously applied to target detection in various fields. For example, study [4] applies the YOLOv10 algorithm to electric vehicle helmet-wearing detection and proposes a target detection algorithm based on the improved YOLOv10n, which introduces the BIMAFPN feature pyramid structure and at the same time constructs the Inner-Wise-MPDIoU loss function instead of the CLoU loss function of YOLOv10, so as to make it more efficient in detection accuracy and also greatly improve the lightweight features. Study [5] applies the YOLOv10 algorithm to the field of road safety detection and proposes a vehicle and pedestrian detection method based on YOLOv10s, which introduces the Compact Inverted Block (CIB) module and the Partial Self-attention (PSA) attention mechanism to optimise the algorithm and improve its detection accuracy. Optimisation is carried out to improve the detection performance and detection effect of the algorithm. Study [6] applies the YOLOv10 algorithm to the field of food detection and proposes a lightweight yellow cauliflower grading detection model based on the improved YOLOv10, which designs a new AKVanillaNet backbone network and embeds the DysnakeConv module into the C2f structure, greatly enhances the feature extraction ability of the target, and achieves accurate detection and classification of cauliflowers. Study [7] applies the YOLOv10 algorithm to the biomedical field and proposes a multi-focus cell image fusion algorithm based on the improvement of YOLOv10, which replaces the backbone network and the convolutional layer of YOLOv10’s neck with the Adaptive Feature Extractor Module (AFE), and the Hybrid Feature Aggregation Module (HFAB) is added to the network’s neck. The HFAB was added to improve the detection accuracy of the YOLOv10 algorithm for small targets. Study [8] applies the YOLOv10 algorithm to the field of helmet detection and proposes a small target detection algorithm for underground helmets based on the REIW-YOLOv10n and designs two new structures, namely, the RepNMSC and the ERepGFPN structures, which use cross-layer connectivity to process high-level semantic information and low-level spatial information with the same priority, greatly improving the accuracy of small target detection. Study [9] proposes a helmet-wearing detection algorithm based on YOLOv8-ADSC, which uses Adaptive Spatial Feature Fusion (ASFF) and Deformable Convolutional Network version 2 (DCNv2) to augment the detection head so that the network can capture multi-scale information about the target more efficiently. A lightweight Content-Aware Feature Reconstruction (CARAFE) up-sampling module is also used to increase the sensing range, reduce the information loss caused by up-sampling, and improve the accuracy and robustness of target detection. However, the above improved model based on YOLOv10 is still not lightweight enough, and although it can meet the requirements for small target detection, it is less effective for dynamic monitoring and multi-target, multi-scale detection.

Therefore, based on the above research, this paper applies the YOLOv10 algorithm to helmet-wearing detection in construction sites and proposes an intelligent detection algorithm based on StarNet-YOLOv10. The algorithm uses the StarNet network to replace the backbone network in YOLOv10, which greatly improves the lightness of the model and also introduces the PSA attention mechanism as well as the GhostBottleneckv2 module for the optimisation of the C2f module and the LSKA attention mechanism for the optimisation of the detection head, which further enhances the performance of the detection of the target.

The paper is organised as follows: in Section 1, the overall structure of the proposed StarNet-YOLOv10-based helmet-wearing detection algorithm and the optimised functional modules are presented; in Section 2, the experiments and results are analysed, where the proposed algorithm and the representative algorithms are compared and analysed experimentally with respect to the performance indexes; and in Section 3, the full paper is summarised and future work is outlined.

2. StarNet-YOLOv10 Model

2.1. Overall Model Design

The overall structure of the StarNet-YOLOv10-based lightweight helmet wear detection algorithm designed in this paper is shown in Figure 1. Firstly, the StarNet network structure is introduced to replace the backbone network part of the original YOLOv10 model, while the original SPPF and PSA parts are retained. Then, the PSA attention mechanism, as well as the GhostBottleneckv2 module, are introduced to optimise the C2f module on the neck network to improve the feature extraction ability and extract richer feature information. Finally, the LSKA attention mechanism is introduced in the detection head part for optimisation to further improve the detection performance of the model.

2.2. StarNet-Based Backbone Network

StarNet [10] is an efficient neural network architecture based on Star Operation, and its core idea is to map low-dimensional features to high-dimensional nonlinear feature space through element-wise multiplication, thus significantly improving the expressive ability of the model while maintaining computational efficiency and specificity. The specific structure is shown in Figure 2. It starts with the initial feature extraction of the input image through the first convolutional layer, which is then followed closely by the batch normalisation (BN) and ReLU6 activation functions to accelerate the training and introduce nonlinearity. Then, more complex feature information is gradually extracted through four more stages, each of which contains a convolutional layer and a number of Star Blocks, of which the specific structure of Star Blocks is shown in Figure 2, including two deeply separable convolutions and three fully connected networks. Finally, Global Average Pooling (GAP) or Fully Connected Layer (FC) is used to integrate the features in preparation for classification. Thus, two linearly transformed features are fused by performing element-by-element multiplication at four stages to generate a new high-dimensional feature space. This operation is similar to a polynomial kernel function, which is able to achieve high-dimensional feature mapping in a low-dimensional space, thus avoiding the traditional approach of enhancing the expressive power by increasing the network width (number of channels) and improving the feature extraction and expressive power of the model, as well as significantly reducing the computational overheads to achieve the purpose of lightweighting. In a single-layer neural network, StarNet combines the weight matrix and bias into a single entity, denoted as

w = [\begin{matrix} W \\ B \end{matrix}]

, where W denotes the weight component and B denotes the bias term. Correspondingly, the input vector X is expanded into a matrix

x = [\begin{matrix} X \\ 1 \end{matrix}]

, containing a constant term (usually 1). In this way, StarNet implements the star operation. And its star operation can be expressed by

\begin{matrix} w_{1}^{T} x \times w_{2}^{T} x = (\sum_{i = 1}^{d + 1} w_{1}^{i} x^{i}) \times (\sum_{j = 1}^{d + 1} w_{2}^{j} x^{j}) \\ = \sum_{i = 1}^{d + 1} \sum_{j = 1}^{d + 1} w_{1}^{i} w_{2}^{j} x^{i} x^{j} = α_{(1, 1)} x^{1} x^{1} + \dots + α_{(d + 1, d + 1)} x^{d + 1} x^{d + 1} \end{matrix}

(1)

α_{(i, j)} = \{\begin{array}{l} w_{1}^{i} w_{2}^{j}, & i = j \\ w_{1}^{i} w_{2}^{j} + w_{1}^{j} w_{2}^{i}, & i \neq j \end{array}

(2)

where i and j index the channels and α denotes the coefficient for each item. According to the test [9], the number of parameters of the improved YOLOv10 model using StarNet is reduced from 16.5 M to 9.7 M, which is about 41.2%, and the computation amount is reduced from 64.0 GFLOPs to 29.9 GFLOPs, which is about 53.1%, and, at the same time, the speed of the model inference is also improved from 3.9 ms to 3.7 ms, which is about 5.12% lower. The model inference speed is also improved from 3.9 ms to 3.7 ms, which is a 5.12% reduction, and thus can well meet the requirement of lightweighting while ensuring the computational efficiency. Meanwhile, since StarNet mainly carries out feature extraction and recognition through star operations, it does not need to rely excessively on deeply separable convolutional kernels for feature extraction, so it can be better applied to feature extraction in complex environments than other backbone networks such as MobileNet that mainly rely on convolutional kernel operations. In addition, StarNet achieves cross-layer feature fusion through an improved path aggregation network, which improves the detection accuracy of small targets compared to ShuffleNet’s channel shuffling operation.

2.3. PSAGhost_C2f Module Design

PSA is an efficient attention module which can be widely applied to tasks such as image classification, target detection, etc. [11]. The PSA module forms a pyramidal feature map by splicing the convolution results of different-sized convolution kernels and then applies the attention mechanism on this feature map to extract richer feature information, and its specific structure and flow are shown in Figure 3.

According to Figure 3, it can be seen that PSA is mainly divided into four steps. Firstly, the input 3D feature images are processed by the multiscale convolution module, namely the SPC module. The main processing is as follows: assuming the input is X, the input is first split into S parts:

Split (X) = [X_{0}, X_{1}, \dots, X_{s - 1}]

(3)

The corresponding features are then extracted separately, and the extracted multi-scale features are spliced through Concat:

F_{i} = Conv (K_{i} \times k_{i}, G_{i}) (X_{i}), i = 0, 1, \dots, S - 1

(4)

F = concat ([F_{0}, F_{1}, \dots, F_{S - 1}])

(5)

Then, the channel weighting of each group of convolved feature maps is carried out by the SE (Squeeze-and-Excitation) module, and then the feature recalibration is carried out on the multiscale channel attention vectors by using Softmax to get the new multi-scale channel interaction after the attention weights and the weighted fusion of the multi-scale feature maps, and, finally, the recalibrated weights and the corresponding feature maps are subjected to a dot-multiplication operation by element, and the output is obtained as a feature map after the attention weighting of multi-scale feature information. It can be expressed as follows:

Z_{i} = S E w e i g h t (F_{i}), i = 0, 1, \dots, S - 1

(6)

A t t e n t i o n = Softmax (Z)

(7)

Therefore, PSA compares with common attention modules, such as SENet, as well as CBAM, which reduces the computation and memory occupation through polarisation filtering and enhancement design and improves the robustness of the model at the same time. Meanwhile, PSA considers both channel attention and spatial attention and introduces a multi-scale convolution module for processing, which can capture more spatial information of different scales to expand the feature space, thus improving feature extraction and model representation.

In addition, we introduce the GhostBottleneckv2 [12] structure in GhostNetv2 instead of the Bottleneck module in the C2f module. GhostBottleneckv2 is an improved Inverted Residual Bottleneck compared with GhostBottleneckv1, which mainly introduces the Decoupled Fully Connected (DFC) attention module for optimisation and improves the expression ability of the model, and its general structure is shown in Figure 4. The input features are first extracted by a Ghost module and a DFC attention module and then processed by normalisation and activation function to perform element-by-element multiplication to distinguish important information. Then, the features are processed by a deep separable convolution and a Ghost module to further enhance the feature representation, and finally the output is connected to the input by residual connection, which can alleviate the problem of gradient vanishing and facilitate the training of deeper networks. Therefore, the introduction of GhostBottleneckv2 can not only improve the feature extraction ability and model representation ability but also reduce the computational volume of the model to achieve the purpose of being lightweight.

As a result of the above analysis, the rough structure of the final PSAGhost_C2f module is shown in Figure 5. The input feature map is first processed initially by one CBS module, then processed in parallel by multiple GhostBottleneckv2 networks through segmentation processing, followed by merging the extracted feature data and optimising them with the PSA attention module, and finally being outputted by one CBS module after processing to obtain the processed feature data.

2.4. LSKA Detection Head Design

LSKA [13] (Large Separable Kernel Attention) is a new type of attention module, which inherits the design of the LKA (Large Kernel Attention) mechanism and greatly reduces the computational complexity and memory requirements by decomposing the traditional two-dimensional weighted convolutional kernel into two cascaded one-dimensional separable convolutional kernels. Compared with LKA, LSKA exhibits a better speed/accuracy balance under different core sizes. Even when increasing the core size, the inference speed degradation of the LSKA model is significantly lower than that of the LKA model, which suggests that LSKA is able to capture a wider range of image features while maintaining efficient computation and delivering similar or better performance, as illustrated by its specific structural design in Figure 6.

According to Figure 6, it can be seen that for the LSKA module, it first decomposes a K × K convolution into a (2d − 1) × (2d − 1) depth convolution, a K/d × K/d depth-extended convolution, and a 1 × 1 convolution, and then decomposes the 2D convolution kernel of the depth convolution and the depth-extended convolution into a 1D horizontal convolution kernel and a vertical convolution kernel, and finally the decomposed convolution kernels are connected in series sequentially. Compared with the traditional convolution to expand the sensory field of view by increasing the convolution kernel, LSKA replaces the 2D convolution kernel with a pair of 1D convolution kernels cascaded with each other through the above processing and at the same time introduces large-scale separable convolution kernels. This means that the LSKA is effective in capturing important information when dealing with the key features of an image, such as edges, textures, and shapes, so that the model can still maintain high detection efficiency and detection accuracy on the basis of low computational complexity.

Therefore, comparing it to the original detection head for YOLOv10, applying the LSKA module to the detection head of the YOLOv10 model can enhance the model’s detection capability and detection effect on targets of different scales. At the same time, LSKA significantly reduces the number of parameters and computation and maintains a performance comparable to that of the traditional large convolution kernel attention mechanism, which can well achieve the lightweight effect of the detection model.

3. Experiment and Analysis of Results

3.1. Experimental Environment and Data Preparation

The hardware environment for the experiment and the hyperparameter settings for training are shown in Table 1. The input image size is uniformly adjusted to 640 × 640, and in order to allow the features of the input image to be completely extracted and recognised for classification, we select a batch size of 32, an epochs parameter of 200, and a learning rate of 0.0001. At the same time, in order to prevent the generation of overfitting and to speed up the rate of convergence, we select the maximum number of tolerated rounds (patience) of 50 and the rotation enhancement angle (degree) of 25, so as to maximise the model training effect.

The experimental dataset is from the publicly available Safety Helmet Wearing Dataset (SHWD) [14], with some differences in size, and includes a total of 7581 image data, some of which are displayed in Figure 7, which is divided into a training set, a validation set, and a test set in a ratio of 8:1:1 during model training. Then, the trained model is evaluated and its performance is analysed, and then it is tested in different site environments in real life, and the performance and various indicators of the model are analysed based on the actual test results. After the final test, the model is deployed to the project platform for real-time safety monitoring of the site environment.

3.2. Performance Indicators Selection

For the selection of performance evaluation metrics for this model, we chose to use precision, recall, mean average precision (mAP), computational volume (giga floating-point operations per second, GFLOPs), and frames per second (FPS) as the evaluation metrics for the model performance, and the expressions for each metric are

P r e c i s i o n = \frac{T P}{T P + F P}

(8)

R e c a l l = \frac{T P}{T P + F N}

(9)

P = \int_{0}^{1} P (R) d R

(10)

m A P = \frac{\sum_{i = 1}^{k} A P_{i}}{K}

(11)

F P S = \frac{1000}{p r + i n f + p o s}

(12)

where TP denotes the number of positive samples detected to be true, FP denotes the number of negative samples predicted to be true, and FN denotes the number of positive samples not detected. api is the accuracy rate of the ith category, and K is the number of categories. pr denotes the image preprocessing time, inf denotes the image inference time, and pos denotes the image post-processing time.

3.3. Analysis of Indicator Results

After the simulation and calculation of the model, we compare the detection performance of this model with the detection performance of the commonly used construction site helmet detection algorithms, including the five major models such as YOLOv5s, YOLOv7-tiny, YOLOv8, and the original YOLOv10, and their performance comparison results are shown in Table 2.

Based on the results, it can be seen that the StarNet-YOLOv10 proposed in this paper improved 9.15% in accuracy, 8.54% in recall, and 9.14% in mAP compared to the Faster R-CNN model; 12.15% in accuracy, 9.94% in recall, and 11.71% in mAP compared to the YOLOv5s model; 7% in accuracy, 7.05% in recall, and 8.33% in mAP compared to the YOLOv7-tiny model; 3.13% in accuracy, 2.83% in recall, and 7.48% in mAP compared to the YOLOv8 model; and 3.13% in accuracy, 2.83% in recall, and 7.48% in mAP compared to the YOLOv7-tiny model. Compared to the YOLOv8 model, it improved by 3.13% in accuracy, 2.83% in recall, and 7.48% in mAP; compared to the YOLOv9 model, it improved by 2.68% in accuracy, 4.76% in recall, and 7.8% in mAP; compared to the original YOLOv10 model, it improved by 1.7% in accuracy, 1.62% in recall, and 4.43% in mAP. At the same time, there exists a large improvement in both GFLOPS and FPS compared to the original YOLOv10 model; it improved 56.02% in GFLOPS and nearly 10% in FPS performance. Therefore, through this improvement, it greatly improves the model’s detection performance and lightness, and it can be very well applied in site safety detection. At the same time, since the dataset used is from different image data with differences in size, it can also be shown that the model is able to accurately detect input images of all sizes, with high model robustness and generalisation. At the same time, the detection speed can be greatly improved by reducing the amount of computation and computational parameters, which makes it more effective real-time and more suitable for scenarios with higher real-time requirements.

3.4. Ablation Experiment

In order to verify the effectiveness of the improved module in this paper, YOLOv10 is used as a reference model for the ablation experiments, and the results obtained are shown in Table 3 (√ indicates selected, — indicates unselected).

According to the results of their comparison experiments, when StarNet was introduced as the backbone network alone, it improved 1.12% in terms of accuracy, 0.88% in terms of recall, and 2.11% in terms of mAP compared to the original YOLOv10 model; after the introduction of the PSA attention module alone to optimise the C2f module in YOLOv10, it improved 0.66% in terms of accuracy, 0.61% in terms of recall, and 1.26% in terms of mAP; compared to the original YOLOv10model it improved 0.66% in accuracy, 0.61% in recall, and 1.26% in mAP; after using the LSKA attention module alone to optimise the detection head, it improved 0.49% in accuracy, 0.55% in recall, and 1.15% in mAP compared to the original YOLOv10 model; according to the comparison results, it can be seen that all three parts of the optimisation have good significance and can improve the detection performance of the model very well.

At the same time, we also carried out multi-group combination experiments. According to the results, the introduction of the StarNet backbone network and PSA_C2f simultaneously improved the accuracy by 1.31%, recall by 1.34%, and mAP by 3.18% compared with the original YOLOv10 model; the introduction of PSA_C2f and the LSKA detector header simultaneously improved the accuracy by 0.15% compared with the original YOLOv10 model; the introduction of PSA_C2f and LSKA detector header simultaneously improved the accuracy by 1.34% and mAP by 3.18%. After both the StarNet backbone and the LSKA detection header were introduced, compared to the original YOLOv10 model, it improved by 1.11% in accuracy, 1.31% in recall, and 3.02% in mAP; and, finally, when all the modules were introduced, compared to the original YOLOv10 model, this model improved 1.7% in accuracy, 1.62% in recall, and 4.43% in mAP. It can be seen that each module has its own significance, and the performance improvement obtained will be more with each additional module. Meanwhile, from the point of view of the performance improvement of the respective modules, the introduction of the StarNet backbone network is the most helpful and effective in improving the model detection performance, which also proves the necessity of introducing the StarNet backbone network.

3.5. Image Analysis Results

We trained the image dataset using the model proposed in this paper, and the detection results obtained are shown in Figure 8.

According to the detection results in Figure 8, it can be seen that the helmet detection model based on StarNet-YOLOv10 designed in this paper has a better detection effect on the actual construction site, especially in the crowded area; no matter how many people are not wearing helmets, the model can accurately detect them. At the same time, it can also accurately identify and detect when there are crowds or other special objects obscuring the helmet, such as the umbrella obscuring the helmet in Figure 8c and the case of more crowds obscuring the helmet in Figure 8b. Not only that, the model does not need to rely on the recognition of the face, as shown in Figure 8a; only the head can be detected to get the detection result of whether the helmet is worn or not. The detection efficiency is high, and it has high practical value and application value.

At the same time, we used the same dataset to compare the model of this algorithm with the YOLOv10 algorithm before optimisation was tested, and the comparison test results obtained are shown in Figure 9.

According to the comparison test results in Figure 9, it can be seen that when facing more occlusions, the detection effect of the StarNet-YOLOv10-based helmet detection model designed in this paper is significantly higher than that of the YOLOv10 model before optimisation, e.g., as shown in Figure 9a, in the case of occlusions from the crowd, the detection confidence of the model designed in this paper for the people without helmets is higher, and the recognition and classification effect is better. For Figure 9b, it identifies and detects people wearing helmets in the presence of large buildings, and according to the detection results obtained, the model designed herein has a higher confidence in the identification of people wearing helmets and a better identification and classification effect. Therefore, the experimental results show that the helmet detection model based on StarNet-YOLOv10 designed in this paper has a greater improvement in detection performance on the basis of the original YOLOv10 and can be well applied in the detection of helmet wearing on actual construction sites. Finally, we actually deployed the StarNet-YOLOv10 algorithm proposed in this paper on NVIDIA Jetson Nano and successfully realised the real-time monitoring of safety helmets in construction site environments, achieving 15 W low-power operation, keeping the frame rate between 80 and 90 FPS, and reducing the graphics memory consumption to 1.5 GB.

4. Summary and Future Work

4.1. Summary

In this paper, a lightweight helmet-wearing detection algorithm based on StarNet-YOLOv10 is proposed. Firstly, the StarNet network structure was introduced to replace the backbone network part of the original YOLOv10 model while retaining the original SPPF, as well as the PSA part. Then, the PSA attention mechanism, as well as the GhostBottleneckv2 module, were introduced to optimise the C2f module on the neck network to improve the feature extraction ability and extract richer feature information. Finally, the LSKA attention mechanism was introduced in the detection head part for optimisation to further improve the detection performance of the model. As can be seen from the results of a large amount of experimental data, comparing the results with the existing YOLOv5s, YOLOv8, and other mainstream top five helmet detection models, the StarNet-YOLOv10 model proposed in this paper has a significant improvement in the model accuracy, recall, average precision, computational volume, and frame rate and can achieve an accuracy of 83.36%, a recall of 81.17%, and an average frame rate of 78.66. Meanwhile, compared with the original YOLOv10 model, this model improves 1.7% in accuracy, 1.62% in recall, and 4.43% in mAP. It can be seen that the model can achieve very high detection accuracy while using a smaller number of computational parameters, which can meet today’s requirements for the detection of helmet wearing in complex construction environments oriented towards intelligent construction sites and has high practical and application value.

4.2. Future Work

Due to the lack of more extreme scenarios in the training dataset, such as storms and extreme weather events such as typhoons and snowstorms, more scenarios will be collected for training in subsequent work to improve the detection accuracy in extreme scenarios and achieve higher application value. At the same time, a more lightweight structure can be used in the model structure to improve the real-time and computational speed of the model in order to achieve faster and more accurate helmet-wearing detection. Also, due to the differences in the colour, texture, and shape of different styles of helmets, it may pose a greater test to the model’s detection and may even result in misclassification, which is still a problem that needs to be solved in practical deployment. Next, the loss function and model structure will be further optimised to achieve the system deployment in real, complex site environments.

Author Contributions

Conceptualisation, H.W. and Y.L. (Yang Liao); methodology, Q.Z. and Y.L. (Yang Liao); software, Y.L. (Yang Liao) and X.L.; validation, Y.L. (Yong Liao); formal analysis, M.G.; investigation, Z.L. and B.G.; data curation, Y.L. (Yang Liao); writing—original draft preparation, Y.L. (Yong Liao); writing—review and editing, Y.L. (Yong Liao); visualisation, Y.L. (Yong Liao); supervision, H.W., Q.Z. and Y.L. (Yong Liao). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Yunnan Chihong Zinc & Germanium Co., Ltd. Engineering Project Management System (CHXZ-FZ-08-202311-0160).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Hongli Wang and Zhenyao Liang was employed by the company Yunnan Chihong Zinc Germanium Co., and author Qiangwen Zong, Yang Liao, Xiao Luo, Mingzhi Gong and Bin Gu was employed by the company Three Gorges High-tech Information Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zhang, X. Research on Analysis of Influencing Factors of Building Construction Safety Management and Intelligent Control. Ph.D. Thesis, Zhejiang University, Hangzhou, China, 2021. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2016, arXiv:1506.02640. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. YOLOv10: Real-Time End-to-End Object Detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Zhou, X.; Wang, K.; Zhou, X.; Han, J. Electric Vehicle Helmet Wearing Detection Algorithm Based on Improved YOLOv10n. Electron. Meas. Technol. 2025, 1–12. Available online: http://kns.cnki.net/kcms/detail/11.2175.tn.20250206.1353.030.html (accessed on 17 February 2025).
Shen, J.; Zeng, Q.; Gao, Y.; Deng, N. Vehicle and Pedestrian Detection Method Based on YOLOv10s. Comput. Knowl. Technol. 2024, 20, 25–27,39. [Google Scholar]
Jin, X.; Liang, X.; Deng, P. A Lightweight Yellow Cauliflower Grading Detection Model Based on Improved YOLOv10. Intell. Agric. 2024, 6, 108–118. [Google Scholar]
Chen, M.; Wang, J.; Wang, L.; Zhang, Q.; Xu, W.; Chen, W. Multi-Focus Cell Image Fusion Method Based on Target Recognition. Adv. Lasers Optoelectron. 2025, 1–19. Available online: http://kns.cnki.net/kcms/detail/31.1690.TN.20250213.1003.006.html (accessed on 17 February 2025).
Gao, L.P.; Zhou, M.R.; Hu, F.; Bian, K.; Chen, Y. Small Target Detection Algorithm for Underground Helmet Based on REIW-YOLOv10n. Coal Sci. Technol. 2025, 1–13. Available online: http://kns.cnki.net/kcms/detail/11.2402.td.20240919.1902.003.html (accessed on 17 February 2025).
Wang, J.; Sang, B.; Zhang, B.; Liu, W. A Safety Helmet Detection Model Based on YOLOv8-ADSC in Complex Working Environments. Electronics 2024, 13, 4589. [Google Scholar] [CrossRef]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the Stars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2024; pp. 5694–5703. [Google Scholar]
Zhang, H.; Zu, K.; Lu, J.; Zou, Y.; Meng, D. EPSANet: An Efficient Pyramid Split Attention Block on Convolutional Neural Network. arXiv 2021, arXiv:2105.14447. [Google Scholar]
Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetV2: Enhance Cheap Operation with Long-Range Attention. Adv. Neural Inf. Process. Syst. 2022, 35, 9969–9982. [Google Scholar]
Lau, K.W.; Po, L.M.; Rehman, Y.A.U. Large Separable Kernel Attention: Rethinking the Large Kernel Attention Design in CNN. Expert Syst. Appl. 2024, 236, 121352. [Google Scholar]
Zhang, G.; Zhou, J.; Ma, G.; He, H. Lightweight Safety Helmet Wearing Detection Algorithm of Improved YOLOv8. Electron. Meas. Technol. 2024, 45, 147–154. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [PubMed]
Hu, Y.; Hu, S.; Liu, S.; Song, X. Helmet Wear Detection Based on YOLOv5s Algorithm. Inf. Technol. 2025, 1, 61–67. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 10–12 March 2024; pp. 1–6. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2024; pp. 1–21. [Google Scholar]

Figure 1. Overall structural design of the StarNet-YOLOv10 model.

Figure 2. Schematic diagram of StarNet’s structure.

Figure 3. PSA module flowchart.

Figure 4. GhostBottleneckv2 structure diagram.

Figure 5. PSAGhost_C2f module flowchart.

Figure 6. LSKA detection head schematic structure.

Figure 7. Partial image dataset presentation.

Figure 8. Test results show. Subfigures (a–d) denote the detection result for crowded area and sheltered scenario.

Figure 9. Comparison test results between the proposed model in this paper and the YOLOv10 model. Subfigure (a) denotes the comparing results in crowded area. And subfigure (b) denotes denotes the comparing results in sheltered area.

Table 1. Experimental environment and parameter settings.

Device and Parameters	Value
operation system	Windows11
CPU	Intel i5-12600KF
memory	16 GB
display card	NVIDIA RTX 4060ti
deep learning frameworks	Pytorch 2.6
epochs	200
patience	50
batch	32
iou	0.5
imgsz	640
degrees	25
lr0	0.0001
optimiser	Adam

Table 2. Comparative results of testing performance.

	Precession	Recall	mAP	GFLOPs	FPS
Faster R-CNN [15]	74.21%	72.63%	69.52%	32.8	76.5
YOLOv5s [16]	71.21%	71.68%	66.95%	14.6	72.3
YOLOv7-tiny [17]	76.36%	74.12%	70.33%	18	78
YOLOv8 [18]	80.23%	78.34%	71.18%	34.2	80.1
YOLOv9 [19]	80.68%	76.41%	70.86%	56.7	78.3
YOLOv10 [3]	81.66%	79.55%	74.23%	24.1	87.9
StarNet-YOLOv10	83.36%	81.17%	78.66%	10.6	96.4

Table 3. Comparative results of ablation experiments.

StarNet	PSA_C2f	LSKA	Precession	Recall	mAP
√	—	—	82.78%	80.43%	76.34%
—	√	—	82.32%	80.16%	75.49%
—	—	√	82.15%	80.10%	75.38%
√	√	—	82.97%	80.89%	77.41%
—	√	√	82.18%	80.26%	76.13%
√	—	√	82.77%	80.86%	77.25%
√	√	√	83.36%	81.17%	78.66%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, H.; Zong, Q.; Liao, Y.; Luo, X.; Gong, M.; Liang, Z.; Gu, B.; Liao, Y. Lightweight Helmet-Wearing Detection Algorithm Based on StarNet-YOLOv10. Processes 2025, 13, 946. https://doi.org/10.3390/pr13040946

AMA Style

Wang H, Zong Q, Liao Y, Luo X, Gong M, Liang Z, Gu B, Liao Y. Lightweight Helmet-Wearing Detection Algorithm Based on StarNet-YOLOv10. Processes. 2025; 13(4):946. https://doi.org/10.3390/pr13040946

Chicago/Turabian Style

Wang, Hongli, Qiangwen Zong, Yang Liao, Xiao Luo, Mingzhi Gong, Zhenyao Liang, Bin Gu, and Yong Liao. 2025. "Lightweight Helmet-Wearing Detection Algorithm Based on StarNet-YOLOv10" Processes 13, no. 4: 946. https://doi.org/10.3390/pr13040946

APA Style

Wang, H., Zong, Q., Liao, Y., Luo, X., Gong, M., Liang, Z., Gu, B., & Liao, Y. (2025). Lightweight Helmet-Wearing Detection Algorithm Based on StarNet-YOLOv10. Processes, 13(4), 946. https://doi.org/10.3390/pr13040946

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Helmet-Wearing Detection Algorithm Based on StarNet-YOLOv10

Abstract

1. Introduction

2. StarNet-YOLOv10 Model

2.1. Overall Model Design

2.2. StarNet-Based Backbone Network

2.3. PSAGhost_C2f Module Design

2.4. LSKA Detection Head Design

3. Experiment and Analysis of Results

3.1. Experimental Environment and Data Preparation

3.2. Performance Indicators Selection

3.3. Analysis of Indicator Results

3.4. Ablation Experiment

3.5. Image Analysis Results

4. Summary and Future Work

4.1. Summary

4.2. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI