LighterFace Model for Community Face Detection and Recognition

: This research proposes a face detection algorithm named LighterFace, which is aimed at enhancing detection speed to meet the demands of real-time community applications. Two pre-trained convolutional neural networks are combined, namely Cross Stage Partial Network (CSPNet), and ShuffleNetv2. Connecting the optimized network with Global Attention Mechanism (GAMAttention) extends the model to compensate for the accuracy loss caused by optimizing the network structure. Additionally, the learning rate of the detection model is dynamically updated using the cosine annealing method, which enhances the convergence speed of the model during training. This paper analyzes the training of the LighterFace model on the WiderFace dataset and a custom community dataset, aiming to classify faces in real-life community settings. Compared to the mainstream YOLOv5 model, LighterFace demonstrates a significant reduction in computational demands by 85.4% while achieving a 66.3% increase in detection speed and attaining a 90.6% accuracy in face detection. It is worth noting that LighterFace generates high-quality cropped face images, providing valuable inputs for subsequent face recognition models such as DeepID. Additionally, the LighterFace model is specifically designed to run on edge devices with lower computational capabilities. Its real-time performance on a Raspberry Pi 3B+ validates the results.


Introduction
Currently, community security is a growing concern.Communities have a large number of people coming in and out every day, and it is very common for strangers to blend into the crowd.In order to avoid them from committing illegal acts and to facilitate management, an efficient supervision system is necessary.The face detection algorithm integrated into surveillance cameras facilitates real-time identification of targeted faces.However, the development of face detection technology prompts critical considerations regarding the safeguarding of personal privacy.It is imperative to ensure that the processing of face images and associated data does not infringe upon individual privacy rights.Edge computing emerges as a viable solution to address these concerns, as it allows for the processing of face information without relying on internet connectivity, an approach which maximizes the protection of personal privacy and security.Navigating the challenge of efficiently processing image information for face detection using limited local computing resources becomes a complex task.The utilization of these constrained resources in the most effective manner requires strategic solutions to balance the imperative of accurate face detection with the need to uphold privacy rights.
Over the past decade, a great deal of research has been devoted to designing peopleidentification systems that are both efficient and economical.These systems aim to swiftly and accurately identify individuals, promptly alerting a remote monitoring point in the Face detection serves as a crucial prerequisite for effective face recognition, with the accuracy of face detection directly influencing the overall face recognition rate.While deep learning-based face detection algorithms have shown superior performance compared to traditional methods, challenges persist, particularly in accurately recognizing small-scale and heavily occluded faces.SSH (Single Shot multibox Detector with Scale Hierarchy) improves upon SSD [10], presenting a multi-branch approach within VGG-net [11] to detect multi-scale faces.This approach addresses limitations in face recognition accuracy, particularly in challenging scenarios.FDDB introduces Light-Head Faster R-CNN as a strategic enhancement for face detection performance.By incorporating multi-scale training and testing along with deformable Convolutional Neural Networks (CNN) [12], this method aims to achieve improved accuracy and efficiency within the FDDB framework.Face R-FCN [13], building upon R-FCN, employs smaller-sized location-sensitive RoI (Region of Interest) pooling kernels and additional minor anchor points.The method also integrates regular average pooling with location-sensitive average pooling, contributing to enhanced face detection accuracy.A Focusing Attention Network (FAN) introduces an attention mechanism into face detection through anchor-level attention, which is particularly beneficial for improving recall in cases of occluded faces while maintaining a low false alarm rate.PyramidBox addresses stern face detection challenges by proposing low-level feature pyramid networks [14], PyramidAnchors, and context-sensitive prediction modules.Moreover, it introduces a data anchor sampling method to augment training samples at different scales, emphasizing the importance of contextual information in face detection.These advancements collectively represent a continuous effort to refine face detection algorithms, aiming to overcome challenges and improve accuracy, especially in scenarios involving small-scale faces and occlusions.
Simplifying the face detection model to reduce the amount of computation and time is a key issue under the premise of meeting the deployment conditions of embedded terminals.It is imperative to ensure algorithmic accuracy for practical applications.Current mainstream methods for achieving lightweight models, include pruning [15], knowledge distillation [16], and optimizing neural network structure.Pruning and knowledge distillation focus on post-training optimization of the model structure, while optimizing the neural network structure involves direct training of a lightweight network.In this paper, the design approach is rooted in the notion of creating an efficient and accurate face detection model suitable for deployment on embedded terminals.Historically, the R-CNN algorithm [17] played a pivotal role in practical target detection through CNNs.Subsequent advancements, such as Fast R-CNN [18] and Faster R-CNN [19], addressed shortcomings and improved detection speed.The evolution of techniques like FCN [20] and Mask R-CNN [21] contributed to the maturation of image segmentation methods.Noteworthy lightweight algorithms, including MobileNet-SSD [22,23] and You Only Look Once (YOLO) [24][25][26][27], have simplified object detection on mobile terminals without necessitating the use of cloud servers.
Although existing algorithms have made remarkable progress in face detection, there remains a need to balance the requirement of maintaining detection accuracy while reducing computational costs.To overcome challenges posed by limited server computational power and inadequate real-time performance, our research addresses this demand by leveraging advanced deep learning techniques.By integrating state-of-the-art lightweight network ShuffleNetv2 and the excellent classification task network CSPNet, our proposed LighterFace model aims to enable real-time deployment of face detection models on edge devices.We carried out ablation and comparison experiments, and the results demonstrate that LighterFace runs on Raspberry Pi 3b+ with an accuracy of 90.6% and a speed of 1543 ms.

Materials and Methods
In this section, the proposed face recognition algorithm is introduced through the following two components: face detection and face recognition.CSPNet addresses the issue of redundant gradient information in the neural network backbone by integrating gradient changes into the feature maps from start to finish.This approach reduces the model's parameter count and floating-point operations per second (FLOPS) while maintaining inference speed and accuracy, ultimately reducing the model size.ShuffleNetv2 achieves a reduction in the model's parameter count and FLOPS by employing techniques such as Pointwise Group Convolution and channel shuffling.The face detection model LighterFace, presented in this paper, further optimizes the network structure by combining the aforementioned two networks.To compensate for the loss in detection accuracy, GAMAttention is integrated into the face detection model.As a result, the model's structure is streamlined without sacrificing a significant amount of detection accuracy.This phase involves identify-ing and cropping the faces within an image or video stream.The quality of face detection significantly influences the overall accuracy of the subsequent face recognition process.

Face Detection Model
LighterFace is formulated and optimized using the Cross Stage Partial Network (CSP-Net).The algorithmic model architecture is illustrated in Figure 1.The representative detection model utilized in CSPNet is YOLOv5.For comparative analysis, YOLOv5 is employed as a control group in this section to highlight the optimization advancements achieved by LighterFace more distinctly.In the network architecture of LighterFace, depicted in Figure 1, the utilization of the CBRM module occurs in the initial phase, with the feature extraction module incorporating Shuffle Block and Global Attention Mechanism (GAMAttention).
GAMAttention is integrated into the face detection model.As a result, the model's structure is streamlined without sacrificing a significant amount of detection accuracy.This phase involves identifying and cropping the faces within an image or video stream.The quality of face detection significantly influences the overall accuracy of the subsequent face recognition process.

Face Detection Model
LighterFace is formulated and optimized using the Cross Stage Partial Network (CSPNet).The algorithmic model architecture is illustrated in Figure 1.The representative detection model utilized in CSPNet is YOLOv5.For comparative analysis, YOLOv5 is employed as a control group in this section to highlight the optimization advancements achieved by LighterFace more distinctly.In the network architecture of LighterFace, depicted in Figure 1, the utilization of the CBRM module occurs in the initial phase, with the feature extraction module incorporating Shuffle Block and Global Attention Mechanism (GAMAttention).

CBRM Module Structure
In the YOLO model, the original authors employ the Stem module as the initial component of the network.Its primary function is to perform a sequence of convolutional and pooling operations on the input image, extracting the initial feature representation.This facilitates subsequent layers of the network to more effectively learn and represent semantic features of the image.The structure of the Stem module is illustrated in Figure 2a.
Stem, first through the 3 × 3 convolution of the initial extraction of image features, were sent to two branches to extract further features, with one branch through two convolutions, to extract further image features, the other branch through the MaxPool refinement of graphical features can significantly reduce the complexity of the image features.The features of the two branches are then fused and, finally, a 1 × 1 convolution of the convolution regulates the number of channels, which facilitates further extraction of image features by the subsequent backbone network.

CBRM Module Structure
In the YOLO model, the original authors employ the Stem module as the initial component of the network.Its primary function is to perform a sequence of convolutional and pooling operations on the input image, extracting the initial feature representation.This facilitates subsequent layers of the network to more effectively learn and represent semantic features of the image.The structure of the Stem module is illustrated in Figure 2a.According to the four guidelines mentioned in ShuffleNetv2 to reduce the amount of computation, as shown below, there is still room for improvement in the performance of this module [28]: i.
Equal channel width minimizes Memory Access Cost (MAC); ii.
Excessive group convolution increases MAC; iii.
Network fragmentation reduces the degree of parallelism; iv.
Element-wise operations are non-negligible.
The CBRM module is proposed based on the above four criteria, as shown in Figure 2b, which eliminates the branching structure and reduces the number of convolutional layers compared to the STEM module.The color image is divided into three monochrome Stem, first through the 3 × 3 convolution of the initial extraction of image features, were sent to two branches to extract further features, with one branch through two convolutions, to extract further image features, the other branch through the MaxPool refinement of graphical features can significantly reduce the complexity of the image features.The features of the two branches are then fused and, finally, a 1 × 1 convolution of the convolution regulates the number of channels, which facilitates further extraction of image features by the subsequent backbone network.
According to the four guidelines mentioned in ShuffleNetv2 to reduce the amount of computation, as shown below, there is still room for improvement in the performance of this module [28]: i. Equal channel width minimizes Memory Access Cost (MAC); ii.Excessive group convolution increases MAC; iii.Network fragmentation reduces the degree of parallelism; iv.Element-wise operations are non-negligible.
The CBRM module is proposed based on the above four criteria, as shown in Figure 2b, which eliminates the branching structure and reduces the number of convolutional layers compared to the STEM module.The color image is divided into three monochrome layers and fed into the CBRM module, which extracts the image features directly by a 3 × 3 convolution and then connects with MaxPool in series to maximize the reduction in additional computation, which is in line with Rule ii and Rule iii.

ShuffleBlock
ShuffleNet is a machine-effective CNN that can be used in monolithic devices.It proposes Channel Shuffle, as shown in Figure 3, and group convolution (GConv) divides the features into three groups.After extracting the features by convolution operation, the features between different groups are not in any communication.The Channel Shuffle operator is introduced to mix the features of different groups evenly to ensure that the information can flow between different groups to improve the detection accuracy.According to the four guidelines mentioned in ShuffleNetv2 to reduce the amount of computation, as shown below, there is still room for improvement in the performance of this module [28]: i.
Equal channel width minimizes Memory Access Cost (MAC); ii.
Excessive group convolution increases MAC; iii.
Network fragmentation reduces the degree of parallelism; iv.
Element-wise operations are non-negligible.
The CBRM module is proposed based on the above four criteria, as shown in Figure 2b, which eliminates the branching structure and reduces the number of convolutional layers compared to the STEM module.The color image is divided into three monochrome layers and fed into the CBRM module, which extracts the image features directly by a 3 × 3 convolution and then connects with MaxPool in series to maximize the reduction in additional computation, which is in line with Rule ii and Rule iii.

ShuffleBlock
ShuffleNet is a machine-effective CNN that can be used in monolithic devices.It proposes Channel Shuffle, as shown in Figure 3, and group convolution (GConv) divides the features into three groups.After extracting the features by convolution operation, the features between different groups are not in any communication.The Channel Shuffle operator is introduced to mix the features of different groups evenly to ensure that the information can flow between different groups to improve the detection accuracy.In the YOLO model, the authors use the C3 module as an essential part of the backbone network.The structure of the C3 module is shown in Figure 4.It is faster and more accurate than the CSPBottleneck, which can further extract feature information and In the YOLO model, the authors use the C3 module as an essential part of the backbone network.The structure of the C3 module is shown in Figure 4.It is faster and more accurate than the CSPBottleneck, which can further extract feature information and increase the depth and width of the network for better learning of semantic features in images.However, the degree of lightweight of the multiple separated convolutions it uses could be improved.According to the Rule i criterion of ShuffleNetv2, the higher the number of channels, the more significant the gap between the number of input channels and the number of output channels, which results in the C3 module running at a less-than-optimal speed on a CPU or ARM.
When designing the Shuffle Block, the Shuffle Block network at stride = 1 is shown in Figure 5a.The whole does not use grouped convolutional modules, which conforms to Rule ii.The feature channel is divided into two branches before the start of the unit module, and one branch retains the original features, while the other branch uses the same input and output channels of two convolutional modules and one DepthWiseConv (DWConv) module to extract features in series.DWConv is less computationally intensive than normal Conv, which can reduce the computational complexity of the overall model, which is by Rule i and Rule iii.In order to reduce the computation time of other operations, such as Add, the features of the two branches are fused using Concat to avoid the Add, while other Element-wise operations are avoided by using Concat to fuse the features of two branches to avoid Add and other operations, which is by Rule iv.Finally, the feature information of each channel is fused by Channel Shuffle.
Information 2024, 15, x FOR PEER REVIEW 6 of 17 increase the depth and width of the network for better learning of semantic features in images.However, the degree of lightweight of the multiple separated convolutions it uses could be improved.According to the Rule i criterion of ShuffleNetv2, the higher the number of channels, the more significant the gap between the number of input channels and the number of output channels, which results in the C3 module running at a less-thanoptimal speed on a CPU or ARM.When designing the Shuffle Block, the Shuffle Block network at stride = 1 is shown in Figure 5a.The whole does not use grouped convolutional modules, which conforms to Rule ii.The feature channel is divided into two branches before the start of the unit module, and one branch retains the original features, while the other branch uses the same input and output channels of two convolutional modules and one DepthWiseConv (DWConv) module to extract features in series.DWConv is less computationally intensive than normal Conv, which can reduce the computational complexity of the overall model, which is by Rule i and Rule iii.In order to reduce the computation time of other operations, such as Add, the features of the two branches are fused using Concat to avoid the Add, while other Element-wise operations are avoided by using Concat to fuse the features of two branches to avoid Add and other operations, which is by Rule iv.Finally, the feature information of each channel is fused by Channel Shuffle.
The Shuffle Block network at stride = 2 is depicted in Figure 5b on the right.The primary difference from Figure 5a is the inclusion of deep convolutions in the branches that retain the original features.This ensures that no critical information is lost during the features extracting process.

GAMAttention
This module [29] serves to mitigate information loss and enhance global dimensional interaction features.The overall process is defined by where is input to the GAM, the intermediate state is obtained by channel attention module and spatial attention module and element by element multiplication, and the output is shown in Figure 6.The Shuffle Block network at stride = 2 is depicted in Figure 5b on the right.The primary difference from Figure 5a is the inclusion of deep convolutions in the branches that retain the original features.This ensures that no critical information is lost during the features extracting process.

GAMAttention
This module [29] serves to mitigate information loss and enhance global dimensional interaction features.The overall process is defined by where F 1 = R C×H×W is input to the GAM, the intermediate state is obtained by channel attention module and spatial attention module and element by element multiplication, and the output is shown in Figure 6.

GAMAttention
This module [29] serves to mitigate information loss and enhance global dimensional interaction features.The overall process is defined by where is input to the GAM, the intermediate state is obtained by channel attention module and spatial attention module and element by element multiplication, and the output is shown in Figure 6.The channel attention module employs a 3D arrangement to preserve information across three dimensions.Subsequently, it enhances the spatial dependence of the crossdimensional channels by employing a two-layer multilayer perceptron (MLP).The structure of the channel attention module is visually represented in Figure 7.The channel attention module employs a 3D arrangement to preserve information across three dimensions.Subsequently, it enhances the spatial dependence of the crossdimensional channels by employing a two-layer multilayer perceptron (MLP).The structure of the channel attention module is visually represented in Figure 7.In the spatial attention module, to incorporate spatial information, this paper chooses not to use max-pooling for spatial information extraction.Instead, it employs two convolutional layers to extract the output of the channel attention module, aligning with Rule iv in ShuffleNet.The influence factor is utilized in this paper to configure the weight of the channel and spatial attention modules, preventing dispersion in the training results.The structure of the spatial attention module is depicted in Figure 8.In the spatial attention module, to incorporate spatial information, this paper chooses not to use max-pooling for spatial information extraction.Instead, it employs two convolutional layers to extract the output of the channel attention module, aligning with Rule iv in ShuffleNet.The influence factor is utilized in this paper to configure the weight of the channel and spatial attention modules, preventing dispersion in the training results.The structure of the spatial attention module is depicted in Figure 8.In the spatial attention module, to incorporate spatial information, this paper chooses not to use max-pooling for spatial information extraction.Instead, it employs two convolutional layers to extract the output of the channel attention module, aligning with Rule iv in ShuffleNet.The influence factor is utilized in this paper to configure the weight of the channel and spatial attention modules, preventing dispersion in the training results.The structure of the spatial attention module is depicted in Figure 8.
where the ratio of overlapping area of the predicted and labeled boxes is (IOU), S 1 is the overlapping area of the two boxes, and S 2 is the total area.The similarity of the width-to-height ratios of the two boxes are considered from multiple perspectives (av).
As shown in Figure 9, the loss function in this not only takes into account the distance between the detection frame and the labeling frame but also considers the relative proportion of the two rectangular frames.This approach contributes to a more efficient convergence speed in the training process and overall performance, leading to an enhanced detection effect.

Face Feature Extraction DeepID
High-accuracy face detection is crucial for the overall face recognition system, providing multiple advantages that enhance the effectiveness and reliability of face recognition technology.LighterFace contributes to expediting the face recognition process by swiftly identifying and localizing faces in images or video streams.By reducing the computational load in subsequent stages of the face recognition pipeline, LighterFace enhances overall efficiency.Building on this foundation, we further DeepID for face detection, with the goal of enhancing both the accuracy and speed of the face recognition system.This integration aims to optimize the detection capabilities, contributing to an overall improvement in the performance of the face recognition system.

Network Infrastructure
DeepID is an efficient approach for extracting face features using the deep convolutional network.Figure 10 shows the feature extraction process.The convolutional network learns to classify all the faces available for training based on their identity and activates corresponding features through neurons in a hidden layer.Each convolutional network takes the face features from the previous layer as input and extracts local low-level features at the bottom.The number of features decreases progressively along the feature extraction cascade, while increasingly global and high-level features are formed at the top Where the width (W l ) and height (H l ) of the labeling box and the width (W p ) and height (H p ) of the prediction box, the distance between the centroids (ρ 2 ), the diagonal length of the minimum enclosing matrix of the two boxes (c 2 ), the derivation of the formula is shown.

Face Feature Extraction DeepID
High-accuracy face detection is crucial for the overall face recognition system, providing multiple advantages that enhance the effectiveness and reliability of face recognition technology.LighterFace contributes to expediting the face recognition process by swiftly identifying and localizing faces in images or video streams.By reducing the computational load in subsequent stages of the face recognition pipeline, LighterFace enhances overall efficiency.Building on this foundation, we further DeepID for face detection, with the goal of enhancing both the accuracy and speed of the face recognition system.This integration aims to optimize the detection capabilities, contributing to an overall improvement in the performance of the face recognition system.

Network Infrastructure
DeepID is an efficient approach for extracting face features using the deep convolutional network.Figure 10 shows the feature extraction process.The convolutional network learns to classify all the faces available for training based on their identity and activates corresponding features through neurons in a hidden layer.Each convolutional network takes the face features from the previous layer as input and extracts local low-level features at the bottom.The number of features decreases progressively along the feature extraction cascade, while increasingly global and high-level features are formed at the top layer.A highly compact 160-dimensional face feature vector is obtained at the end of the cascade, which contains rich identity information and directly predicts a more significant number of identity classes.

Face Privacy Protection
Amid growing public concerns about the privacy of face datasets, ensuring the protection of face privacy has become an essential consideration in the field of face recognition technology.In this paper, facial images undergo a transformation into 160-dimensional face feature vectors using DeepID, which are then stored in the database.Notably, these feature vectors are associated with the names of individuals in the community rather than being directly linked to the original face images and names.This strategic approach ensures that the feature vectors cannot be reversed or inverted back to reveal the original images, thereby significantly minimizing the risk of privacy leakage for residents in the community.

Experiment and Discussion
The overall network framework is implemented, and train using Pytorch 1.10.0+ cu102 version is also implemented and trained on a server configured with an Intel Xeon Silver 4210R processor and an NVIDIA Quadro RTX 5000 graphics card.The edge device used in real-world community applications is the Raspberry Pi 3B+, which is also the ARM configuration mentioned in the experimental section later in the paper.
The research results in this paper are mainly based on two aspects: (1) model accuracy (2) model running speed.The evaluation indexes of AP@0.5, the number of model parameters, the amount of model computation and the average speed of detecting test set images are chosen as the evaluation indexes of the detection module with the following formulas: where TP is the number of samples that were correctly classified as positive by the model.FP is the number of samples that were incorrectly classified as positive by the model.TN is the number of samples that were correctly classified as negative by the model.FN is the

Face Privacy Protection
Amid growing public concerns about the privacy of face datasets, ensuring the protection of face privacy has become an essential consideration in the field of face recognition technology.In this paper, facial images undergo a transformation into 160-dimensional face feature vectors using DeepID, which are then stored in the database.Notably, these feature vectors are associated with the names of individuals in the community rather than being directly linked to the original face images and names.This strategic approach ensures that the feature vectors cannot be reversed or inverted back to reveal the original images, thereby significantly minimizing the risk of privacy leakage for residents in the community.

Experiment and Discussion
The overall network framework is implemented, and train using Pytorch 1.10.0+ cu102 version is also implemented and trained on a server configured with an Intel Xeon Silver 4210R processor and an NVIDIA Quadro RTX 5000 graphics card.The edge device used in real-world community applications is the Raspberry Pi 3B+, which is also the ARM configuration mentioned in the experimental section later in the paper.
The research results in this paper are mainly based on two aspects: (1) model accuracy (2) model running speed.The evaluation indexes of AP@0.5, the number of model parameters, the amount of model computation and the average speed of detecting test set images are chosen as the evaluation indexes of the detection module with the following formulas: where is the number of samples that were correctly classified as positive by the model.FP is the number of samples that were incorrectly classified as positive by the model.TN is the number of samples that were correctly classified as negative by the model.FN is the number of samples that were incorrectly classified as negative by the model.IOU determines the positive and negative classes in re-image detection.In this paper, when IOU > 0.5, it is judged as a positive class and vice versa as a negative class.

Datasets
The two datasets used to train the model in this study are WiderFace and the homemade dataset.
In this paper, the face detection benchmark dataset WiderFace dataset (WF) is used, in which the images are selected from the publicly available WiderFace dataset, with a total of 32,203 images labeled with 393,703 faces.The training, validation and test sets are differentiated in the ratio of 4:1:5.This ratio is set according to the official instructions provided by WiderFace.
This research creates a face detection dataset for testing Community Face (CF).The dataset is shown in Figure 11.This dataset is used to prove that the model of this paper also keeps the leading performance in different datasets, in which the images come from the web and real community photos.In the WiderFace dataset, there is a lack of data on Asian faces.To address this gap, we specifically selected individuals entering and leaving communities in real-life projects, focusing solely on Asians.We annotated only real faces, excluding images containing distractions, such as faces on billboards or packaging boxes.There are 9240 images labeled 11,962 faces.The ratio of the training set, verification set, and test set is set to 6:2:2.Due to weather, light, and occlusion, faces become blurred in the images.In order to ensure that the trained model has better detection ability for blurred faces as well, there are a total of 902 images in the dataset that were taken in rainy and night-time conditions.determines the positive and negative classes in re-image detection.In this paper, when IOU > 0.5, it is judged as a positive class and vice versa as a negative class.

Datasets
The two datasets used to train the model in this study are WiderFace and the homemade dataset.
In this paper, the face detection benchmark dataset WiderFace dataset (WF) is used, in which the images are selected from the publicly available WiderFace dataset, with a total of 32,203 images labeled with 393,703 faces.The training, validation and test sets are differentiated in the ratio of 4:1:5.This ratio is set according to the official instructions provided by WiderFace.
This research creates a face detection dataset for testing Community Face (CF).The dataset is shown in Figure 11.This dataset is used to prove that the model of this paper also keeps the leading performance in different datasets, in which the images come from the web and real community photos.In the WiderFace dataset, there is a lack of data on Asian faces.To address this gap, we specifically selected individuals entering and leaving communities in real-life projects, focusing solely on Asians.We annotated only real faces, excluding images containing distractions, such as faces on billboards or packaging boxes.There are 9240 images labeled 11,962 faces.The ratio of the training set, verification set, and test set is set to 6:2:2.Due to weather, light, and occlusion, faces become blurred in the images.In order to ensure that the trained model has better detection ability for blurred faces as well, there are a total of 902 images in the dataset that were taken in rainy and night-time conditions.

Experimental Setup and Technical Details
In order to facilitate the comparison, this paper reproduces several mainstream face recognition algorithms, the original YOLOv5 and the improved algorithm based on Shuf-fleNet.
The batch size affects both the optimization degree and the speed of the model, while also impacting the memory usage of the CPU or GPU.Given the relatively standard training configuration, the batch size is set to eight to reduce memory usage.Setting the number of epochs to 100 is a common fixed value chosen by many models, as it allows the model to be thoroughly trained to reach its optimal and most stable state.In order to avoid the local optimum point to find the global optimum point, the cosine annealing algorithm is used to set the dynamic learning rate, and the formula is as follows:

Experimental Setup and Technical Details
In order to facilitate the comparison, this paper reproduces several mainstream face recognition algorithms, the original YOLOv5 and the improved algorithm based on Shuf-fleNet.
The batch size affects both the optimization degree and the speed of the model, while also impacting the memory usage of the CPU or GPU.Given the relatively standard training configuration, the batch size is set to eight to reduce memory usage.Setting the number of epochs to 100 is a common fixed value chosen by many models, as it allows the model to be thoroughly trained to reach its optimal and most stable state.In order to avoid the local optimum point to find the optimum point, the cosine annealing algorithm is used to set the dynamic learning rate, and the formula is as follows: where lr is the new learning rate, lr 0 is the initial learning rate, lr min is the minimum learning rate, epoch is the value corresponding to the current training to a particular epoch, epochs is the total number of epochs trained.In the deep learning network, mainly through the gradient descent method to find a set of parameters that can minimize the structural risk, and the learning rate in the training process of deep learning is a very important hyperparameter, guiding the model on how to adjust the hyperparameters of the network weights through the gradient of the loss function.The lower the learning rate, the slower the rate of change in the loss function.While using a low learning rate ensures that the algorithmic model will not miss any local minima, it also means that the algorithmic model will take longer to converge.The higher the learning rate, the faster the loss function changes, but it tends to miss local minima.The cosine annealing algorithm avoids falling into local minima during training by stepping out of the local minima and leading to a path to find the global optimal solution.The learning rate was configured using the cosine annealing method and manually set to 0.1 for comparison.The experimental results are depicted in Figure 12

Face Detection Accuracy and Speed
Face detection is a crucial prerequisite for the stable execution of face recognition.The face detection algorithm LighterFace proposed in this paper is compared with other recent face detection algorithms, including TinaFace, Retina-Face, LFFD, and YoloV5; the comparison results are shown in Table 1.The dataset used is the WiderFace dataset, which is more widely recognized.For a better comparison, this paper uses AP@0.5 and the detection speed of the model on CPU and ARM to form Table 2.

Face Detection Accuracy and Speed
Face detection is a crucial prerequisite for the stable execution of face recognition.The face detection algorithm LighterFace proposed in this paper is compared with other recent face detection algorithms, including TinaFace, Retina-Face, LFFD, and YoloV5; the comparison results are shown in Table 1.The dataset used is the WiderFace dataset, which is more widely recognized.For a better comparison, this paper uses AP@0.5 and the detection speed of the model on CPU and ARM to form Table 2.By optimizing the network structure of the feature extraction module, the detection speed of LighterFace is very much improved.This paper uses CSPNet as the backbone, which has less computational complexity than ResNet and LFFDNet.This paper's final model accuracy and detection speed are better than YoloV5 when using the same backbone.

Ablation Experiment
In this subsection, in order to better represent the improvements in LighterFace, we conducted ablation experiments to demonstrate the performance of Shuffle Block, CBAM, and GAMAttenion.In addition to using the standard evaluation metric of average accuracy at IoU = 0.5 (AP@0.5) in the WiderFace dataset, this paper utilizes the more stringent IoU = 0.5:0.05:0.95average accuracy (AP@0.5:0.95),where different evaluation metrics allow for multi-latitudinal reviewing of model performance.This paper evaluates the performance of several different settings on the WiderFace validation set and self-made data set, focusing on their AP and running speed.The experimental results are shown in Table 3, speed and convergence speed comparison (Figure 13 and Table 2).Name AP@0.5WFAP@0.5:0.95WFAP@0.5CFAP@0.By optimizing the network structure of the feature extraction module, the detection speed of LighterFace is very much improved.This paper uses CSPNet as the backbone, which has less computational complexity than ResNet and LFFDNet.This paper's final model accuracy and detection speed are better than YoloV5 when using the same backbone.

Ablation Experiment
In this subsection, in order to better represent the improvements in LighterFace, we conducted ablation experiments to demonstrate the performance of Shuffle Block, CBAM, and GAMAttenion.In addition to using the standard evaluation metric of average accuracy at IoU = 0.5 (AP@0.5) in the WiderFace dataset, this paper utilizes the more stringent IoU = 0.5:0.05:0.95average accuracy (AP@0.5:0.95),where different evaluation metrics allow for multi-latitudinal reviewing of model performance.This paper evaluates the performance of several different settings on the WiderFace validation set and self-made data set, focusing on their AP and running speed.The experimental results are shown in Table 3, speed and convergence speed comparison (Figure 13 and Table 2).

Name
AP@0.5WF AP@0.5:0.95WFAP@0.5CFAP@0.Shuffle Block replaces CSPNet with Shuffle Block.After that, the computational complexity and model size are significantly reduced, the amount of parameters is reduced by 87.9%, GLOPs are reduced by 88.6%, accuracy is reduced by 4.5%, and detection speed is accelerated by 46.7%.The architectures of both networks are the same.LighterFace adds the attention module interspersed in ShuffleNet, and all other settings remain unchanged.
After the addition of GAMAttention, the number of parameters is reduced by 81.1%, GLOPs are reduced by 85.4% compared to CSPNet, the accuracy is improved by 2.4%, and the speed is slowed down by 5.3% compared to ShuffleNet.Shuffle Block replaces CSPNet with Shuffle Block.After that, the computational complexity and model size are significantly reduced, the amount of parameters is reduced by 87.9%, GLOPs are reduced by 88.6%, accuracy is reduced by 4.5%, and detection speed is accelerated by 46.7%.The architectures of both networks are the same.LighterFace adds the attention module interspersed in ShuffleNet, and all other settings remain unchanged.
After the addition of GAMAttention, the of parameters is reduced by 81.1%, GLOPs are reduced by 85.4% compared to CSPNet, the accuracy is improved by 2.4%, and the speed is slowed down by 5.3% compared to ShuffleNet.

Community Detection Scenarios
The objective of the experiment is to underscore the importance of face detection and recognition in preventing unauthorized individuals from invading or damaging public equipment.Additionally, the experiment aims to identify and mark specific areas within the natural community as potentially dangerous zones.These areas include the community entrance, the community exit, the entrance of residential buildings, and the vicinity of the electric box.The overall layout of the experimental environment is visually represented in Figure 14.

Community Detection Scenarios
The objective of the experiment is to underscore the importance of face detection and recognition in preventing unauthorized individuals from invading or damaging public equipment.Additionally, the experiment aims to identify and mark specific areas within the natural community as potentially dangerous zones.These areas include the community entrance, the community exit, the entrance of residential buildings, and the vicinity of the electric box.The overall layout of the experimental environment is visually represented in Figure 14.
Upon the appearance of a face in the surveillance area, a face detection program is activated.The identified face is then utilized to extract face feature vectors through DeepID, subsequently undergoing a comparison with the facial data stored in the database.This comparison aims to ascertain whether the individual is a stranger.These detection and identification processes are crucially time-sensitive, demanding real-time execution to avoid missing the optimal warning window.Upon detecting a stranger, an immediate warning is transmitted to community managers, enabling them to proactively monitor potential intruders.This proactive strategy is strategically designed to thwart incidents such as burglary and vandalism, ensuring that community managers possess ample time to remain vigilant to potential threats and respond promptly and appropriately.Upon the appearance of a face in the surveillance area, a face detection program is activated.The identified face is then utilized to extract face feature vectors through DeepID, subsequently undergoing a comparison with the facial data stored in the database.This comparison aims to ascertain whether the individual is a stranger.These detection and identification processes are crucially time-sensitive, demanding real-time execution to avoid missing the optimal warning window.Upon detecting a stranger, an immediate warning is transmitted to community managers, enabling them to proactively monitor potential intruders.This proactive strategy is strategically designed to thwart incidents such burglary and vandalism, ensuring that community managers possess ample time to remain vigilant to potential threats and respond promptly and appropriately.
The experiment was conducted over four rounds, encompassing various activities, as outlined in Table 4.In the initial phase of the experiment, the algorithm's performance was tested, with LighterFace being compared against current mainstream face detection methods.

A1
Appeared near the entrance to the complex A2 Appeared near the neighborhood exit A3 Appeared near the entrance to a residential building A4 Appeared near the powerhouse

LighterFace in the Monitoring Area
After clearly defining the community detection scene, we conducted comparative experiments to employ YoloV5s, LFFD, YoloV5n, and LighterFace to assess their face detection accuracy and speed in the context of four identified activities.The primary objective is to verify and compare the performance of these models.Subsequently, 100 registered faces from each identified activity were selected for testing.The face detection mechanism is applied to crop face images, and these images are utilized to extract face feature vectors using DeepID.The resulting feature vectors are then compared with the facial data stored in the face database to evaluate the efficacy of the face recognition function.
Table 5 presents the average time and accuracy results for face detection across all algorithms, along with the accuracy of face recognition for the three datasets.Notably, LighterFace exhibits the capability to achieve face detection within 1700 ms in real application scenarios, utilizing a low-computing chip.The detection accuracy is consistently maintained at approximately 90%.This high level of detection accuracy establishes a robust foundation for subsequent face recognition.Consequently, the correct rate of the face recognition function is also sustained at around 90%.

Conclusions
In this paper, we have proposed a real-time monitoring system for tracking people entering and exiting a community using a face detection model.The objective was to prevent strangers from intruding into the community, thereby enhancing overall security performance.The key innovation lies in a lightweight model with reduced computational complexity, enabling deployment on computationally inefficient embedded devices while maintaining commendable detection accuracy and speed.LighterFace incorporates a customized ShuffleBlock as the primary feature extraction module within the CSPNet network.Through structural adjustments, the FLOPs were reduced from 15.8 to 2.3, and we achieved an 85.4% reduction in computational workload.When performing face detection on the CPU platform, the detection speed improved from taking 69.66 ms per frame to 39.5 ms per frame, achieving a 43.3% increase in detection speed.When conducting face detection on the ARM platform, the detection speed improved from taking 4578 ms per frame to 1543 ms per frame, resulting in a 66.3% increase in detection speed.Evaluation on the WiderFace dataset and a community face dataset ensures that LighterFace maintained high detection accuracy (AP@0.5)while prioritizing speed performance.To validate real-world applicability, various models were deployed on a Raspberry Pi 3B+ for testing.Results demonstrate that LighterFace can effectively support face detection on ARM chips with low computational power.Furthermore, its compatibility with mainstream older cameras facilitates software iterative upgrading, confirming LighterFace's performance in terms of speed under practical conditions.
With the proposed approach, the community can deploy the model in cameras that come with low-computing power chips, which is expected to help reduce potential community security risks.

Figure 3 .
Figure 3.The Channel Shuffle schematic (LighterFace).The data from different channels (red, blue, green) are not interconnected.Channel Shuffle can extract data from different channels and rearrange them to establish connections.

Figure 3 .
Figure 3.The Channel Shuffle schematic (LighterFace).The data from different channels (red, blue, green) are not interconnected.Channel Shuffle can extract data from different channels and rearrange them to establish connections.

Figure 4 .
Figure 4.The C3 modular architecture (comparison group), which consists of three standard convolutional layers and multiple Bottleneck modules.

Figure 4 .Figure 5 .
Figure 4.The C3 modular architecture (comparison group), which consists of three standard convolutional layers and multiple Bottleneck modules.Information 2024, 15, x FOR PEER REVIEW 7 of 17

Figure 6 .
Figure 6.GAMAttention module architecture.GAMAttention achieves this by performing elementwise multiplication with both the Channel Attention Module and the Spatial Attention Module.

Figure 6 .
Figure 6.GAMAttention module architecture.GAMAttention achieves this by performing elementwise multiplication with both the Channel Attention Module and the Spatial Attention Module.

17 Figure 7 .
Figure 7.The schematic diagram of channel attention module, which uses 3D permutation to retain information across three dimensions.

Figure 8 .
Figure 8.The schematic diagram of the spatial attention module.To focus on spatial information, we use two convolutional layers for spatial information fusion.2.1.4.Loss Function This paper uses the loss function, as shown, to improve the stability of training and convergence speed.

Figure 7 .
Figure 7.The schematic diagram of channel attention module, which uses 3D permutation to retain information across three dimensions.

17 Figure 7 .
Figure 7.The schematic diagram of channel attention module, which uses 3D permutation to retain information across three dimensions.

Figure 8 .
Figure 8.The schematic diagram of the spatial attention module.To focus on spatial information, we use two convolutional layers for spatial information fusion.2.1.4.Loss Function This paper uses the loss function, as shown, to improve the stability of training and convergence speed.

Figure 8 .
Figure 8.The schematic diagram of the spatial attention module.To focus on spatial information, we use two convolutional layers for spatial information fusion.2.1.4.Loss Function This paper uses the loss function, as shown, to improve the stability of training and convergence speed.

Information 2024 , 17 Figure 9 . 2 ρ
Figure 9. Schematic diagram of loss function parameters.Where the width (Wl) and height (Hl) of the labeling box and the width (Wp) and height (Hp) of the prediction box, the distance between the centroids (

Figure 9 .
Figure 9. Schematic diagram of loss function parameters.Where the width (W l ) and height (H l ) of the labeling box and the width (W p ) and height (H p ) of the prediction box, the distance between the centroids (ρ 2 ), the diagonal length of the minimum enclosing matrix of the two boxes (c 2 ), the derivation of the formula is shown.

Figure 11 .
Figure 11.Community face dataset (CF), A mosaic has been created for the community.

Figure 11 .
Figure 11.Community face dataset (CF), A mosaic has been for the community.
. The red line represents the loss curve of training for 100 epochs with a learning rate set to 0.1, while the blue line utilizes CosineAnnealingLR.Loss is one of the crucial metrics for assessing model performance; smaller loss values indicate a smaller discrepancy between the model's predictions and the true labels.From the graph, it is evident that the training results with CosineAnnealingLR are superior.When the learning rate is fixed, the model may become trapped in the local minima.Using a fixed learning rate for loss prediction yields worse results compared to using CosineAnnealingLR for training.Information 2024, 15, x FOR PEER REVIEW 12 of 17 where lr is the new learning rate, 0 lr is the initial learning rate, min lr is the minimum learning rate, epoch is the value corresponding to the current training to a particular epoch, epochs is the total number of epochs trained.In the deep learning network, mainly through the gradient descent method to find a set of parameters that can minimize the structural risk, and the learning rate in the training process of deep learning is a very important hyperparameter, guiding the model on how to adjust the hyperparameters of the network weights through the gradient of the loss function.The lower the learning rate, the slower the rate of change in the loss function.While using a low learning rate ensures that the algorithmic model will not miss any local minima, it also means that the algorithmic model will take longer to converge.The higher the learning rate, the faster the loss function changes, but it tends to miss local minima.The cosine annealing algorithm avoids falling into local minima during training by stepping out of the local minima and leading to a path to find the global optimal solution.The learning rate was configured using the cosine annealing method and manually set to 0.1 for comparison.The experimental results are depicted in Figure 12.The red line represents the loss curve of training for 100 epochs with a learning rate set to 0.1, while the blue line utilizes CosineAnnealingLR.Loss is one of the crucial metrics for assessing model performance; smaller loss values indicate a smaller discrepancy between the model's predictions and the true labels.From the graph, it is evident that the training results with CosineAnnealingLR are superior.When the learning rate is fixed, the model may become trapped in the local minima.Using a fixed learning rate for loss prediction yields worse results compared to using CosineAnnealingLR for training.

Figure 12 .
Figure 12.Loss comparison of fixed learning rate and cosine annealing algorithms.

Figure 12 .
Figure 12.Loss comparison of fixed learning rate and cosine annealing algorithms.

Figure 13 .
Figure 13.Comparison of accuracy convergence speed for different extracted features.

Figure 13 .
Figure 13.Comparison of accuracy convergence speed for different extracted features.

Table 1 .
Comparison of performance and detection speed of different detection algorithms.

Table 2 .
Comparison of detection speed of different feature extraction modules with the same architecture.

Table 2 .
of detection speed of different feature extraction modules with the same architecture.

Table 3 .
Comparison of detection performance of different feature extraction modules with the same architecture.

Table 3 .
Comparison of detection performance of different feature extraction modules with the same architecture.

Table 4 .
Definitions of the 4 identified activities during the test.

Table 5 .
LighterFace application in the monitoring area.