Research and Optimization of a Lightweight Reﬁned Mask-Wearing Detection Algorithm Based on an Attention Mechanism

: To address the current problems of the incomplete classiﬁcation of mask-wearing detection data, small-target miss detection, and the insufﬁcient feature extraction capabilities of lightweight networks dealing with complex faces, a lightweight method with an attention mechanism for detecting mask wearing is presented in this paper. This study incorporated an “incorrect_mask” category into the dataset to address incomplete classiﬁcation. Additionally, the YOLOv4-tiny model was enhanced with a prediction feature layer and feature fusion execution, expanding the detection scale range and improving the performance on small targets. A CBAM attention module was then introduced into the feature enhancement network, which re-screened the feature information of the region of interest to retain important feature information and improve the feature extraction capabilities. Finally, a focal loss function and an improved mosaic data enhancement strategy were used to enhance the target classiﬁcation performance. The experimental results of classifying three objects demonstrate that the lightweight model’s detection speed was not compromised while achieving a 2.08% increase in the average classiﬁcation precision, which was only 0.69% lower than that of the YOLOv4 network. Therefore, this approach effectively improves the detection effect of the lightweight network for mask-wearing.


Introduction
Since the emergence of the new coronavirus, wearing a mask has become a norm to prevent the virus from spreading [1,2].Recently, automatic detection has become the mainstream method for monitoring mask wearing, primarily due to the rapid advancements in artificial intelligence technology.Existing detection algorithms mainly fall into two categories: two-stage algorithms such as R-CNN [3] and Fast-RCNN [4], and one-stage algorithms such as SSD [5] and the YOLO [6][7][8][9][10] series.These algorithms allow us to achieve automated monitoring of mask-wearing using artificial intelligence technology.
In the mask dataset, the RMFD [11] dataset proposed by Wuhan University is a valuable resource for recognition tasks using mask data.However, it does not include a category for masks worn incorrectly in real-world environments.An alternative dataset, the IMFD [12], covers incorrectly worn masks but was synthesized by an algorithm and may not perform well in real-world detection tasks.Vrigkas et al. [13] proposed the publicly available annotated image database "FaceMask", which performs well.However, the dataset only provides a binary classification of "Mask" or "No_Mask" categories.
In the mask-wearing detection algorithm, Niu et al. [14] incorporated an attention mechanism into the RetinaFace [15] face detection algorithm, improving mask-wearing detection performance.However, the algorithm's detection speed is relatively low, making it difficult to meet real-time requirements.Wang et al. [16] improved the YOLOv3 algorithm by introducing the spatial pyramid pooling structure [17] for mask detection, Electronics 2023, 12,1911 2 of 12 achieving a small improvement in the detection accuracy and speed.However, the algorithm struggles with complex mask detection tasks where the backgrounds of the datasets are relatively uniform, and the detection speed is still unsatisfactory.Zhu et al. [18] integrated the spatial pyramid pooling structure and the path aggregation network [19] into the YOLOv4-tiny algorithm to increase the detection accuracy for mask detection tasks.However, after adding the two structures, the algorithm experienced many missed detections for small targets in complex mask detection scenarios as well as decreased speed.Kocacinar et al. [20] proposed a lightweight facial mask recognition system deployed on mobile devices for face mask detection and identity recognition, achieving 99.96% and 82.65% accuracy, respectively.However, this algorithm only utilized individual images and may need to perform more effectively in crowded public spaces.Wei et al. [21] proposed a Mask-YOLO algorithm based on YOLOv3.This effectively addresses the problem of decreased detection accuracy caused by occlusion, density, and small-scale objects by introducing channel attention and complete intersection over union (CIoU) loss.However, the algorithm still needs to improve on the issue of missed and false detections for small targets, and the established data samples are relatively limited.Duan et al. [22] presented an RMPC algorithm for mask-wearing detection based on YOLOv7, which enhances the model's accuracy by fully integrating feature information by changing the stacking method of the max pooling and convolution layers.Nevertheless, this method increases the model's parameter quantity and reduces the network's detection speed.To effectively reduce the spread of COVID-19, Endris et al. [23] established a dataset containing improperly worn masks and trained and tested it on the YOLOX [24] algorithm, achieving a good detection performance.However, most of the data in the dataset were artificially synthesized and did not reflect the real-world situation of mask-wearing detection.Bo et al. [25] replaced the backbone feature extraction network of YOLOv3 with ShuffleNetv2 [26] and introduced the SKNet [27] attention mechanism in the feature enhancement network.This improved the network detection speed at the cost of 1.01% of the detection accuracy, and the detection speed was only improved by 34FPS.Zhang et al. [28] used MobileNetV2 [29] as the feature extraction network based on YOLOv2, which simplified the network model and improved the training speed.However, the division of the dataset was incomplete and the number of parameters remained excessive, resulting in suboptimal detection speed.
To effectively and efficiently detect people wearing masks in public places, important data and efficient algorithms are required.However, the current mask dataset lacks a category for incorrectly worn masks in real-world public environments.In detection algorithms, complex target detection models such as YOLOv3 improve detection accuracy but are slow and unsuitable for deployment on mobile devices.Lightweight networks such as YOLOv4-tiny have fast detection speeds but may experience problems with small target detection and insufficient feature extraction in complex scenarios.This study presents an enhanced approach for detecting targets in a lightweight manner using YOLOv4-tiny, aimed at addressing the issues at hand.First, a new dataset was constructed by segmenting the mask dataset into three categories: not wearing a mask, not wearing a mask correctly, and wearing a mask correctly.Second, to improve the detection capability of YOLOv4tiny for small targets, a layer of predictive features was incorporated through feature fusion, which expands the detection scale and increases the perceptual field.Additionally, to improve the feature extraction ability of the model and increase its attention toward facial regions, we introduced a convolutional block attention module (CBAM) [30] in the enhanced feature extraction network.Finally, the focal loss [31] function and an improved mosaic data enhancement strategy were used to boost the target classification accuracy.

YOLOv4-Tiny Algorithm
YOLOv4-tiny [32] is a series of simplifications based on YOLOv4 that removes the spatial pyramid pooling structure and path aggregation network, resulting in a network with one-tenth the number of parameters compared to YOLOv4.By employing CSPDarknet53-Tiny as the backbone feature extraction network and replacing the Mish [33] activation function with the LeakyRelu [34] activation function, YOLOv4-tiny simplifies its network structure while enhancing the detection speed.YOLOv4-tiny utilizes the feature pyramid network [35], which reduces the network parameters compared to YOLOv4's PAN network.Figure 1 illustrates the network structure of YOLOv4-tiny.

YOLOv4-Tiny Algorithm
YOLOv4-tiny [32] is a series of simplifications based on YOLOv4 that removes the spatial pyramid pooling structure and path aggregation network, resulting in a network with one-tenth the number of parameters compared to YOLOv4.By employing CSPDark-net53-Tiny as the backbone feature extraction network and replacing the Mish [33] activation function with the LeakyRelu [34] activation function, YOLOv4-tiny simplifies its network structure while enhancing the detection speed.YOLOv4-tiny utilizes the feature pyramid network [35], which reduces the network parameters compared to YOLOv4's PAN network.Figure 1 illustrates the network structure of YOLOv4-tiny.The enhanced feature extraction network of YOLOv4-tiny is simple and significantly reduces the number of model parameters.However, we found that the network only outputs two prediction feature layers, with sizes of 256, 26, and 26 and 512, 13 and 13.Due to the model having only two prediction feature layers, it cannot retain the semantic information of shallow feature layers, leading to a decrease in the detection resolution of objects.This affects the detection of small-size targets and may also cause inaccurate boundary localization for large targets.At the same time, the network has a small receptive field, meaning it can only perceive minor local information in an image and cannot obtain a broader global context.This leads to a decrease in the network's ability to recognize and classify objects.Figure 2 shows the FPN structure in YOLOv4-tiny.

CBAM A ention Module
The convolutional block a ention module (CBAM) combines channel and spatial attention modules to process incoming feature layers effectively.The channel a ention mechanism separately applies global average pooling and global maximum pooling operations.The resulting pooled features are then passed through a fully connected layer, The enhanced feature extraction network of YOLOv4-tiny is simple and significantly reduces the number of model parameters.However, we found that the network only outputs two prediction feature layers, with sizes of 256, 26, and 26 and 512, 13 and 13.Due to the model having only two prediction feature layers, it cannot retain the semantic information of shallow feature layers, leading to a decrease in the detection resolution of objects.This affects the detection of small-size targets and may also cause inaccurate boundary localization for large targets.At the same time, the network has a small receptive field, meaning it can only perceive minor local information in an image and cannot obtain a broader global context.This leads to a decrease in the network's ability to recognize and classify objects.Figure 2 shows the FPN structure in YOLOv4-tiny.

YOLOv4-Tiny Algorithm
YOLOv4-tiny [32] is a series of simplifications based on YOLOv4 that removes the spatial pyramid pooling structure and path aggregation network, resulting in a network with one-tenth the number of parameters compared to YOLOv4.By employing CSPDark-net53-Tiny as the backbone feature extraction network and replacing the Mish [33] activation function with the LeakyRelu [34] activation function, YOLOv4-tiny simplifies its network structure while enhancing the detection speed.YOLOv4-tiny utilizes the feature pyramid network [35], which reduces the network parameters compared to YOLOv4's PAN network.Figure 1 illustrates the network structure of YOLOv4-tiny.The enhanced feature extraction network of YOLOv4-tiny is simple and significantly reduces the number of model parameters.However, we found that the network only outputs two prediction feature layers, with sizes of 256, 26, and 26 and 512, 13 and 13.Due to the model having only two prediction feature layers, it cannot retain the semantic information of shallow feature layers, leading to a decrease in the detection resolution of objects.This affects the detection of small-size targets and may also cause inaccurate boundary localization for large targets.At the same time, the network has a small receptive field, meaning it can only perceive minor local information in an image and cannot obtain a broader global context.This leads to a decrease in the network's ability to recognize and classify objects.Figure 2 shows the FPN structure in YOLOv4-tiny.

CBAM A ention Module
The convolutional block a ention module (CBAM) combines channel and spatial attention modules to process incoming feature layers effectively.The channel a ention mechanism separately applies global average pooling and global maximum pooling operations.The resulting pooled features are then passed through a fully connected layer,

CBAM Attention Module
The convolutional block attention module (CBAM) combines channel and spatial attention modules to process incoming feature layers effectively.The channel attention mechanism separately applies global average pooling and global maximum pooling operations.The resulting pooled features are then passed through a fully connected layer, summed, and normalized to obtain weights for each channel in the original input feature layer.Finally, each channel of the original input feature layer is multiplied by its corresponding weight.It can be considered a form of weight calculation that enhances the helpful feature channel responses while suppressing the useless channel responses by computing the attention weights.Therefore, this mechanism improves the feature extraction capability of the model.The spatial attention mechanism computes the maximum and average values of each channel at each feature point of the input feature layer.These values are stacked and passed through a 1 × 1 convolutional layer for channel adjustment.The resulting weights are applied to the input feature layer by multiplication, producing a feature map with spatial weights.This can be considered a weighting of the spatial dimensions.The responses of useful local features can be amplified by multiplying the attention weights by the original feature maps while suppressing the useless background noise.Consequently, this mechanism significantly improves the detection accuracy of the model.Figure 3 displays the structure of this mechanism.
summed, and normalized to obtain weights for each channel in the original input feature layer.Finally, each channel of the original input feature layer is multiplied by its corresponding weight.It can be considered a form of weight calculation that enhances the helpful feature channel responses while suppressing the useless channel responses by computing the a ention weights.Therefore, this mechanism improves the feature extraction capability of the model.
The spatial a ention mechanism computes the maximum and average values of each channel at each feature point of the input feature layer.These values are stacked and passed through a 1 × 1 convolutional layer for channel adjustment.The resulting weights are applied to the input feature layer by multiplication, producing a feature map with spatial weights.This can be considered a weighting of the spatial dimensions.The responses of useful local features can be amplified by multiplying the a ention weights by the original feature maps while suppressing the useless background noise.Consequently, this mechanism significantly improves the detection accuracy of the model.Figure 3 displays the structure of this mechanism.

Data Enhancement
In the field of image recognition, common data augmentation techniques include flipping, scaling and color space manipulation.Mosaic data augmentation, on the other hand, involves randomly flipping, scaling and applying color space manipulation to four images before stitching them together to form a composite image.The advantage of this data enhancement is that it enhances the robustness and the generalization ability of the model.By generating large images containing diverse scenes and objects, the model must learn how to differentiate and detect objects in various regions and precisely classify and identify them during prediction, thus adapting be er to complex real-world scenarios.Additionally, this can reduce the model's training time and computational overhead, resulting in more efficient and effective detection results for practical applications.Figure 4 displays a picture after mosaic data enhancement.

YOLOv4-Tiny Mask Detection Algorithm Optimization
The YOLOv4-tiny algorithm has been extensively applied in object detection due to its lightweight characteristic, which enables it to operate efficiently on resource-constrained devices.However, as this algorithm sacrifices some accuracy, it faces challenges

Data Enhancement
In the field of image recognition, common data augmentation techniques include flipping, scaling and color space manipulation.Mosaic data augmentation, on the other hand, involves randomly flipping, scaling and applying color space manipulation to four images before stitching them together to form a composite image.The advantage of this data enhancement is that it enhances the robustness and the generalization ability of the model.By generating large images containing diverse scenes and objects, the model must learn how to differentiate and detect objects in various regions and precisely classify and identify them during prediction, thus adapting better to complex real-world scenarios.Additionally, this can reduce the model's training time and computational overhead, resulting in more efficient and effective detection results for practical applications.Figure 4 displays a picture after mosaic data enhancement.
sponding weight.It can be considered a form of weight calculation that enhances the help ful feature channel responses while suppressing the useless channel responses by compu ting the a ention weights.Therefore, this mechanism improves the feature extraction ca pability of the model.
The spatial a ention mechanism computes the maximum and average values of each channel at each feature point of the input feature layer.These values are stacked and passed through a 1 × 1 convolutional layer for channel adjustment.The resulting weight are applied to the input feature layer by multiplication, producing a feature map with spatial weights.This can be considered a weighting of the spatial dimensions.The re sponses of useful local features can be amplified by multiplying the a ention weights by the original feature maps while suppressing the useless background noise.Consequently this mechanism significantly improves the detection accuracy of the model.Figure 3 dis plays the structure of this mechanism.

Data Enhancement
In the field of image recognition, common data augmentation techniques include flip ping, scaling and color space manipulation.Mosaic data augmentation, on the other hand involves randomly flipping, scaling and applying color space manipulation to four image before stitching them together to form a composite image.The advantage of this data en hancement is that it enhances the robustness and the generalization ability of the model By generating large images containing diverse scenes and objects, the model must learn how to differentiate and detect objects in various regions and precisely classify and iden tify them during prediction, thus adapting be er to complex real-world scenarios.Addi tionally, this can reduce the model's training time and computational overhead, resulting in more efficient and effective detection results for practical applications.Figure 4 display a picture after mosaic data enhancement.

YOLOv4-Tiny Mask Detection Algorithm Optimization
The YOLOv4-tiny algorithm has been extensively applied in object detection due to its lightweight characteristic, which enables it to operate efficiently on resource-con strained devices.However, as this algorithm sacrifices some accuracy, it faces challenge

YOLOv4-Tiny Mask Detection Algorithm Optimization
The YOLOv4-tiny algorithm has been extensively applied in object detection due to its lightweight characteristic, which enables it to operate efficiently on resource-constrained devices.However, as this algorithm sacrifices some accuracy, it faces challenges such as missing small targets and an insufficient feature extraction capability.To address these issues, this paper proposes a YOLOv4-tiny algorithm that has been improved from the perspectives of algorithm optimization and training strategy while ensuring its detection speed and accuracy.Specifically, we introduced an additional feature layer based on the two existing prediction feature layers in the YOLOv4-tiny algorithm, adopted a bottom-up feature fusion method, and integrated five CBAM attention mechanism modules into the feature enhancement network.Additionally, we replaced the original confidence loss function with a focal loss function and used an improved mosaic data augmentation strategy to train the model and improve its detection performance.

Improving the Small-Target Detection Capability
The YOLOv4-tiny model only detects feature maps that have undergone downsampling of 16× and 32× for target detection.Each down-sampling operation halves a feature map's size, reducing the resolution and losing feature information for small targets.As the number of down-sampling layers increases, the feature map size decreases, resulting in less and less information.This limits the information that is available for small targets, leading to missed detections, particularly in complex mask detection scenarios with many easily missed small targets.We propose expanding the detection range for small targets by adding a feature map prediction layer to the original network after 8× down-sampling processing.To enhance the network's small target detection ability, we incorporated the FPN structure by merging predicted feature maps from three scales to enrich the feature information.The modified network is called the YOLOv4-tiny-3 model, and its structure is illustrated in Figure 5.
such as missing small targets and an insufficient feature extraction capability.To address these issues, this paper proposes a YOLOv4-tiny algorithm that has been improved from the perspectives of algorithm optimization and training strategy while ensuring its detection speed and accuracy.Specifically, we introduced an additional feature layer based on the two existing prediction feature layers in the YOLOv4-tiny algorithm, adopted a bottom-up feature fusion method, and integrated five CBAM a ention mechanism modules into the feature enhancement network.Additionally, we replaced the original confidence loss function with a focal loss function and used an improved mosaic data augmentation strategy to train the model and improve its detection performance.

Improving the Small-Target Detection Capability
The YOLOv4-tiny model only detects feature maps that have undergone down-sampling of 16× and 32× for target detection.Each down-sampling operation halves a feature map's size, reducing the resolution and losing feature information for small targets.As the number of down-sampling layers increases, the feature map size decreases, resulting in less and less information.This limits the information that is available for small targets, leading to missed detections, particularly in complex mask detection scenarios with many easily missed small targets.We propose expanding the detection range for small targets by adding a feature map prediction layer to the original network after 8× down-sampling processing.To enhance the network's small target detection ability, we incorporated the FPN structure by merging predicted feature maps from three scales to enrich the feature information.The modified network is called the YOLOv4-tiny-3 model, and its structure is illustrated in Figure 5.

Introducing the A ention Mechanism
Despite the improved detection speed and parameter reduction achieved by YOLOv4-tiny, its feature extraction capability is drastically diminished.Furthermore, adding a layer of large-scale predictive feature maps to the YOLOv4-tiny algorithm provides additional semantic information, making it easier for the model to mistakenly identify targets with similar objects and potentially leading to misidentified background regions, which affects the overall detection accuracy of the model.We propose introducing the CBAM a ention module into YOLOv4-tiny's feature enhancement network to address the reduced feature extraction capability issue.The CBAM module enhances the feature map by assigning a ention weights in both the channel and spatial dimensions, thereby changing the weight distribution of the original features to enhance effective features.It helps the model to be er understand the spatial and channel relationships of the target, leading to improved recognition while suppressing false detections in the background.Overall, this improves the detection accuracy and the robustness of the model.For this paper, we added five CBAM modules to the feature enhancement network of the YOLOv4-tiny-3 algorithm to iteratively change the weight distribution of the original

Introducing the Attention Mechanism
Despite the improved detection speed and parameter reduction achieved by YOLOv4tiny, its feature extraction capability is drastically diminished.Furthermore, adding a layer of large-scale predictive feature maps to the YOLOv4-tiny algorithm provides additional semantic information, making it easier for the model to mistakenly identify targets with similar objects and potentially leading to misidentified background regions, which affects the overall detection accuracy of the model.We propose introducing the CBAM attention module into YOLOv4-tiny's feature enhancement network to address the reduced feature extraction capability issue.The CBAM module enhances the feature map by assigning attention weights in both the channel and spatial dimensions, thereby changing the weight distribution of the original features to enhance effective features.It helps the model to better understand the spatial and channel relationships of the target, leading to improved recognition while suppressing false detections in the background.Overall, this improves the detection accuracy and the robustness of the model.For this paper, we added five CBAM modules to the feature enhancement network of the YOLOv4-tiny-3 algorithm to iteratively change the weight distribution of the original feature map, enhance face region attention, and improve network detection accuracy.The network framework of the algorithm is displayed in Figure 6.feature map, enhance face region a ention, and improve network detection accuracy.The network framework of the algorithm is displayed in Figure 6.
Figure 6.The network structure of the algorithm in this paper.

Training Strategy Optimization
The mask dataset used in this paper was highly unbalanced, with a significantly higher number of targets not wearing masks than those not wearing masks correctly.This posed a risk of the model overfi ing the "not wearing masks" category and underfi ing the "not wearing masks correctly" category during the training process.Furthermore, the dataset contained targets of different complexities within the same category, which can bias a prediction model towards easy-to-classify samples and decrease the performance on hard-to-classify ones, thereby reducing classification ability.We replaced the original confidence loss function with a focal loss function to address this issue.This function is based on a cross-entropy loss function with an additional category weight coefficient and moderating factor, improving the model's ability to identify and generalize over a few categories, focus more on hard-to-classify samples, and reduce the influence of easy-toclassify ones during model training.
To further enrich the image background, improve model robustness and enhance training efficiency, we propose an improved mosaic data enhancement method in this paper.This method synthesizes the four original images into six images that are then sent to the model for training.Given that the images generated by the mosaic method differ from the real distribution of natural images, we set a 50% probability of using mosaic data enhancement in each step and only used it in the first 70% of the epochs.This approach enriched the image background and ensured a real distribution of images while improving model robustness.Figure 7 displays a picture after improved mosaic data enhancement.

Training Strategy Optimization
The mask dataset used in this paper was highly unbalanced, with a significantly higher number of targets not wearing masks than those not wearing masks correctly.This posed a risk of the model overfitting the "not wearing masks" category and underfitting the "not wearing masks correctly" category during the training process.Furthermore, the dataset contained targets of different complexities within the same category, which can bias a prediction model towards easy-to-classify samples and decrease the performance on hardto-classify ones, thereby reducing classification ability.We replaced the original confidence loss function with a focal loss function to address this issue.This function is based on a cross-entropy loss function with an additional category weight coefficient and moderating factor, improving the model's ability to identify and generalize over a few categories, focus more on hard-to-classify samples, and reduce the influence of easy-to-classify ones during model training.
To further enrich the image background, improve model robustness and enhance training efficiency, we propose an improved mosaic data enhancement method in this paper.This method synthesizes the four original images into six images that are then sent to the model for training.Given that the images generated by the mosaic method differ from the real distribution of natural images, we set a 50% probability of using mosaic data enhancement in each step and only used it in the first 70% of the epochs.This approach enriched the image background and ensured a real distribution of images while improving model robustness.Figure 7 displays a picture after improved mosaic data enhancement.
feature map, enhance face region a ention, and improve network detection accuracy.Th network framework of the algorithm is displayed in Figure 6.

Training Strategy Optimization
The mask dataset used in this paper was highly unbalanced, with a significantly higher number of targets not wearing masks than those not wearing masks correctly.Thi posed a risk of the model overfi ing the "not wearing masks" category and underfi ing the "not wearing masks correctly" category during the training process.Furthermore, th dataset contained targets of different complexities within the same category, which can bias a prediction model towards easy-to-classify samples and decrease the performanc on hard-to-classify ones, thereby reducing classification ability.We replaced the origina confidence loss function with a focal loss function to address this issue.This function i based on a cross-entropy loss function with an additional category weight coefficient and moderating factor, improving the model's ability to identify and generalize over a few categories, focus more on hard-to-classify samples, and reduce the influence of easy-to classify ones during model training.
To further enrich the image background, improve model robustness and enhanc training efficiency, we propose an improved mosaic data enhancement method in this pa per.This method synthesizes the four original images into six images that are then sent to the model for training.Given that the images generated by the mosaic method differ from the real distribution of natural images, we set a 50% probability of using mosaic data en hancement in each step and only used it in the first 70% of the epochs.This approach enriched the image background and ensured a real distribution of images while improv ing model robustness.Figure 7 displays a picture after improved mosaic data enhance ment.

Datasets
The lack of well-developed public datasets for mask detection in public space scenarios, particularly for incorrectly worn masks, poses a challenge for current research in this area.To address this issue, we collated and constructed a three-category dataset consisting of correctly worn masks (face_mask), masks not worn (face), and improperly worn masks (incorrect_mask).First, we selected 7393 images of targets without masks and 5094 images of targets with correctly worn masks from CMFD and RMFD.Additionally, we obtained 2000 artificially synthesized images of incorrectly worn masks from the IMFD dataset.To augment the real data of incorrectly worn masks in public places, we captured numerous photos containing incorrectly worn masks.We obtained 2625 high-quality, diverse and realistic images of incorrectly worn masks after screening and sorting.All the above data were sorted and labeled to generate a complete mask dataset comprising 17,112 images.Pictures containing face targets contained many small objects, and the backgrounds of the photographs were complex.

Experimental Environment
The experiments using the Keras environment are presented in Table 1.

Experimental Protocols
To test the algorithm's effectiveness, we validated it experimentally via the following aspects: the P-R curve, the precision rate, the recall rate, the AP value, the detection speed, the actual scene effect, and an ablation experiment.A P-R curve provides an intuitive reflection of a classifier's performance, while precision and recall rates measure accuracy and completeness.An AP value is a comprehensive evaluation index for precision and recall rates.The detection speed is critical to a classifier's real-time performance, and the actual scene effect measures its practicality.An ablation experiment validated the improvements implemented in this paper.AP and mAP values are calculated as follows: We measured each category's AP value by calculating the area under the precisionrecall curve (P(R)).The mAP is the average AP value of all classes, with N representing the number of categories.Our study set N to 3.

Comparison of P-R Curves
We compared the P-R curves of our proposed algorithm, YOLOv4-tiny, YOLOv4-tiny-3 and literature 18 for face target detection results under the same experimental conditions and methodology, as shown in Figure 8.The results indicate that the YOLOv4-tiny-3 algorithm outperforms YOLOv4-tiny and literature 18 on unmasked targets (face) based on the P-R curve and the area formed by the X axis and Y axis, which represent the AP value.It can be observed that even after adding a predictive feature layer to expand the detection scale, the algorithm still retains feature information of small targets after down-sampling, which improves small target recognition and enhances network detection accuracy.Our proposed algorithm outperformed the YOLOv4-tiny-3 algorithm in terms of accuracy on face targets.By using an attention mechanism within the feature-enhanced network, our algorithm filters out non-critical target information and concentrates on the vital face region.Consequently, this improves the network's ability to detect faces with increased accuracy.Our proposed algorithm's performance on correctly worn mask targets (face_mask) and incorrectly worn mask targets (incorrect_mask) aligned with the other two algorithms.This was primarily due to the YOLOv4-tiny algorithm's superior detection accuracy on these targets, limiting the algorithm's capability to make any significant improvements.

Comparison of P-R Curves
We compared the P-R curves of our proposed algorithm, YOLOv4-tiny, YOLOv4tiny-3 and literature 18 for face target detection results under the same experimental conditions and methodology, as shown in Figure 8.The results indicate that the YOLOv4tiny-3 algorithm outperforms YOLOv4-tiny and literature 18 on unmasked targets (face) based on the P-R curve and the area formed by the X axis and Y axis, which represent the AP value.It can be observed that even after adding a predictive feature layer to expand the detection scale, the algorithm still retains feature information of small targets after down-sampling, which improves small target recognition and enhances network detection accuracy.Our proposed algorithm outperformed the YOLOv4-tiny-3 algorithm in terms of accuracy on face targets.By using an a ention mechanism within the featureenhanced network, our algorithm filters out non-critical target information and concentrates on the vital face region.Consequently, this improves the network's ability to detect faces with increased accuracy.Our proposed algorithm's performance on correctly worn mask targets (face_mask) and incorrectly worn mask targets (incorrect_mask) aligned with the other two algorithms.This was primarily due to the YOLOv4-tiny algorithm's superior detection accuracy on these targets, limiting the algorithm's capability to make any significant improvements.

Comparison of Precision, Recall and AP Values
The precision rate, recall rate and AP values for the algorithm in this paper and YOLOv4, YOLOv4-tiny, YOLOv4-tiny-3 and literature 18 for face, face_mask and incor-rect_mask targets are presented in Table 2.In contrast to the YOLOv4-tiny algorithm, our proposed algorithm demonstrated a precision rate on the face targets that increased by 6.09% from 86.21% to 92.30%, and there was a small increase in the precision rate on other targets.Additionally, the AP value increased from 77.64% to 80.93%, an increase of 3.29%, and was maintained at the same level as for other targets.In contrast to the YOLOv4-tiny-3 algorithm, our proposed algorithm demonstrated an accuracy rate on the face targets that increased by 0.88% from 92.30% to 93.18.The AP value increased from 80.93% to 83.01%, an increase of 2.08%, and there was a small increase in the AP values for other targets.The algorithm in this paper improved the AP values for each objective by 4.36%, 0.68%, and 1.02%, respectively, compared with the literature 18.Compared with YOLOv4, our proposed algorithm achieved a difference of 2.04% and 0.32% in the AP values for the face and incorrect_mask targets, respectively.For the face_mask target, our algorithm showed a relatively higher AP value.This was because the face targets were more abundant in our dataset and the image backgrounds were more complex.Therefore, a deeper network structure such as YOLOv4 can improve detection performance under complex backgrounds.However, for relatively simple data, the improvement effect of YOLOv4 is insignificant.Therefore, adding a predictive feature layer on the basis of YOLOv4-tiny enhanced the network's detection precision by solving the problem of many small targets

Comparison of Precision, Recall and AP Values
The precision rate, recall rate and AP values for the algorithm in this paper and YOLOv4, YOLOv4-tiny, YOLOv4-tiny-3 and literature 18 for face, face_mask and incor-rect_mask targets are presented in Table 2.In contrast to the YOLOv4-tiny algorithm, our proposed algorithm demonstrated a precision rate on the face targets that increased by 6.09% from 86.21% to 92.30%, and there was a small increase in the precision rate on other targets.Additionally, the AP value increased from 77.64% to 80.93%, an increase of 3.29%, and was maintained at the same level as for other targets.In contrast to the YOLOv4-tiny-3 algorithm, our proposed algorithm demonstrated an accuracy rate on the face targets that increased by 0.88% from 92.30% to 93.18.The AP value increased from 80.93% to 83.01%, an increase of 2.08%, and there was a small increase in the AP values for other targets.The algorithm in this paper improved the AP values for each objective by 4.36%, 0.68%, and 1.02%, respectively, compared with the literature 18.Compared with YOLOv4, our proposed algorithm achieved a difference of 2.04% and 0.32% in the AP values for the face and incorrect_mask targets, respectively.For the face_mask target, our algorithm showed a relatively higher AP value.This was because the face targets were more abundant in our dataset and the image backgrounds were more complex.Therefore, a deeper network structure such as YOLOv4 can improve detection performance under complex backgrounds.However, for relatively simple data, the improvement effect of YOLOv4 is insignificant.Therefore, adding a predictive feature layer on the basis of YOLOv4-tiny enhanced the network's detection precision by solving the problem of many small targets being missed in complex scenes.Additionally, we introduced the CBAM attention mechanism within the feature-enhanced network to mitigate the lightweight network's weak feature extraction ability.It further improved the network's detection accuracy and resolved the lightweight network's poor feature extraction ability issue.The Table 3 presents the mAP, the FPS and the number of model parameters for each target in the dataset, comparing YOLOv4, YOLOv4-tiny, YOLOv4-tiny-3, literature 18 and our proposed algorithm.Our results indicated that compared to the YOLOv4-tiny algorithm, the mAP of the YOLOv4-tiny-3 algorithm increased by 1% from 90.97% to 91.97%.In this paper, the mAP value further increased by 1.08% compared to the YOLOv4-tiny algorithm and by 2.08% compared to the YOLOv4 algorithm, with only an 8.21 frame-s −1 decrease in detection speed.The difference in mAP compared to the YOLOv4 algorithm was only 0.69%, while the number of parameters decreased by 90.27%.Furthermore, the detection speed increased from 11.29 frame-s −1 to 70.22 frame-s −1 , an increase of 58.93 frame-s -1 .This is because the YOLOv4 network had more layers and a more complex model, providing a more vital ability to extract features from complex face data.However, having more convolutional layers can lead to a significant decrease in detection speed.Compared with literature 18, the mAP value was improved by 2.01% and the detection speed increased from 67.20 frame-s −1 to 70.22 frame-s −1 .Our results demonstrated that our proposed algorithm significantly enhances the model's detection accuracy in mask-wearing detection scenarios using lightweight networks while maintaining a fast detection speed.9 illustrates a comparison of the detection outcomes of YOLOv4-tiny and our proposed algorithm for various scene images, demonstrating the effectiveness of our algorithm in real-world scenarios.The results indicate that the YOLOv4-tiny algorithm missed small targets in images (a) and (b).In contrast, this paper's algorithm detected them in images (d) and (e) thanks to the additional large-scale predictive feature layer and FPN feature fusion.These enhancements enabled the model to retain small target feature information and obtain a broader range of contextual semantic details, resulting in improved detection capabilities.
Additionally, image (c) shows that the YOLOv4-tiny algorithm incorrectly detected an improperly masked target as correctly masked, while image (f) demonstrates that this paper's algorithm detected all targets accurately.This improvement resulted from integrating an attention mechanism into the enhanced feature extraction network of YOLOv4-tiny, along with the utilization of focal loss and an improved mosaic data enhancement strategy.These enhancements effectively enhanced the model's feature extraction ability.

Ablation Experiments
To validate the effectiveness of the improved strategy for the YOLOv4-tiny algorithm and training, we designed several sets of ablation experiments and their results are presented in Table 4.A "√" in the table indicates the use of the structural modification method or training strategy, while "×" indicates that the operation was not performed.
The results show that in the first group of experiments, adding one prediction feature layer to the YOLOv4-tiny algorithm resulted in an mAP value of 91.25%, indicating that the algorithm retained more small-target semantic information, and further feature fusion improved its detection ability for small targets.In the second group of experiments, the a ention module was introduced in the enhanced feature extraction network of the YOLOv4-tiny algorithm, resulting in an mAP value of 91.57%, indicating an improved feature extraction ability.
Groups 3 and 4 added focal loss and an improved mosaic data enhancement strategy, respectively, based on groups 1 and 2. The mAP values improved by 0.72% and 0.85%, respectively, indicating that this paper's training strategy was effective and alleviated the negative impact of the dataset.Group 5 incorporated all the improved strategies, resulting in an mAP of 93.05%.

Conclusions
The data classification for mask-wearing detection is incomplete, and lightweight networks have limitations in detecting small targets and extracting features.This paper introduces a novel approach for lightweight mask-wearing detection based on an a ention mechanism.The mask dataset was subdivided, and a category for incorrectly worn

Ablation Experiments
To validate the effectiveness of the improved strategy for the YOLOv4-tiny algorithm and training, we designed several sets of ablation experiments and their results are presented in Table 4.A " √ " in the table indicates the use of the structural modification method or training strategy, while "×" indicates that the operation was not performed.The results show that in the first group of experiments, adding one prediction feature layer to the YOLOv4-tiny algorithm resulted in an mAP value of 91.25%, indicating that the algorithm retained more small-target semantic information, and further feature fusion improved its detection ability for small targets.In the second group of experiments, the attention module was introduced in the enhanced feature extraction network of the YOLOv4-tiny algorithm, resulting in an mAP value of 91.57%, indicating an improved feature extraction ability.
Groups 3 and 4 added focal loss and an improved mosaic data enhancement strategy, respectively, based on groups 1 and 2. The mAP values improved by 0.72% and 0.85%, respectively, indicating that this paper's training strategy was effective and alleviated the negative impact of the dataset.Group 5 incorporated all the improved strategies, resulting in an mAP of 93.05%.

Conclusions
The data classification for mask-wearing detection is incomplete, and lightweight networks have limitations in detecting small targets and extracting features.This paper introduces a novel approach for lightweight mask-wearing detection based on an attention mechanism.The mask dataset was subdivided, and a category for incorrectly worn masks was added based on data for both unworn and worn masks.Our proposed algorithm expanded the detection scale range of YOLOv4-tiny and combined profound speech information with shallow semantic information to improve the accuracy of small target detection.Additionally, to enhance feature extraction in lightweight networks, our approach incorporated five attention modules into the YOLOv4-tiny feature enhancement network for the more efficient screening of feature information in regions of interest.Training and testing were conducted using the tri-classified dataset, and the results showed that compared with YOLOv4-tiny, this paper's algorithm achieved a 6.97% improvement in accuracy for face targets, with a high mAP of 93.05%, and maintained a detection speed of 70.22FPS.Compared with the YOLOv4 algorithm, the mAP differed by only 0.69%, but the detection speed improved significantly.Unfortunately, there is still a gap in detection accuracy between our proposed algorithm and YOLOv4, and we could not achieve a significant improvement in detecting incorrectly and correctly masked targets due to the dataset's imbalance.Moreover, we only tested the algorithm's detection speed on a personal portable laptop.We did not apply it to mobile edge or realistically simulate a scenario of mask detection using mobile devices in public places.In our future work, we will continue to expand our dataset and constantly improve the network algorithm structure, aiming to enhance the detection accuracy of the network while ensuring detection speed.This will be achieved by optimizing our model and incorporating more advanced techniques to improve the performance of our system.
Funding: This research was funded by the Natural Science Foundation of Hubei Province, China (grant number 2021CFB584).

Figure 6 .
Figure 6.The network structure of the algorithm in this paper.

Figure 6 .
Figure 6.The network structure of the algorithm in this paper.

Figure 8 .
Figure 8. P-R curves for different algorithms with different targets.

Figure 8 .
Figure 8. P-R curves for different algorithms with different targets.

3 Figure 9 .
Figure 9.Comparison of YOLOv4-tiny and our algorithm for partial-scene mask detection.

Figure 9 .
Figure 9.Comparison of YOLOv4-tiny and our algorithm for partial-scene mask detection.

Table 2 .
Precision, recall and AP values for different targets of different networks.

Table 3 .
mAP, FPS and the number of parameters for different networks.

Table 4
Comparison of ablation experiment results.

Table 4 .
Comparison of ablation experiment results.