Face Detection Based on DF-Net

: Face data have found increasingly widespread applications in daily life. To efﬁciently and accurately extract face information from input images, this paper presents a DF-Net-based face detection approach. A lightweight facial feature extraction neural network based on the MobileNet-v2 architecture is designed and implemented. By incorporating multi-scale feature fusion and spatial pyramid modules, the system achieves face localization and extraction across multiple scales. The proposed network is trained on the open-source face detection dataset WiderFace. The hyperparameters such as bottleneck coefﬁcients and quality factors are discussed. Comparative experiments with other commonly used networks are carried out in terms of network model size, processing speed, and network extraction accuracy. Experimental results afﬁrm the efﬁcacy and robustness of this method, especially in challenging facial poses.


Introduction
With the continuous development of societal technology, an increasing number of fields are utilizing facial data with authentic scale information, which offers higher robustness and richer details.These applications span domains such as film production, facial recognition, virtual reality, and medical fields [1][2][3][4].The utilization of facial data is expanding, with the foremost and critical step being face detection.In any facial application system, the accuracy and speed of face detection directly affect the overall system's performance [5].
Facial detection can be categorized into two research directions [6].The first is the traditional approach, which involves manually extracting features for facial detection.For instance, the Viola-Jones method [7] employs Haar feature extraction algorithms (linear features, edge features, center features, and diagonal features).However, traditional detection algorithms are not only time-consuming and labor intensive due to the need for manual feature extraction, but they also have limited feature representation capabilities.In complex environments, they often lack robust detection performance.With the introduction of convolutional neural networks (CNNs) in 2012 [8], led by Hinton and others, more and more researchers have delved into studying and innovating upon this technology.As a result, facial detection has seen significant advancements with the advent of deep learning.Facial detection algorithms based on deep learning can be divided into two main categories: (1) Two-stage methods, which first generate candidate regions and then use convolutional neural networks to predict the targets.These methods are known for their high accuracy but tend to be slower in terms of detection speed.(2) Single-stage methods, which directly predict targets using neural networks.These methods strike a balance between speed and accuracy.These advancements in deep learning have contributed to significant improvements in facial detection technology.However, facial detection is affected by factors such as environmental conditions and obstructions, which still present numerous challenges for achieving both speed and accuracy in detection.
Addressing the current challenges of cumbersome deployment and sluggish network inference in face detection, this paper presents a novel facial extraction algorithm named Detection Face Net (DF-Net).It devises and implements a streamlined facial extraction neural network rooted in the MobileNet-v2 architecture.This network is endowed with multi-scale feature cascading and spatial pyramid modules, which collectively culminate in a proficient and precise face detection mechanism.
The ensuing sections are structured as follows: Section 2 elucidates the architecture of the designed DF-Net network, expounding upon the intricate details of its constituent modules.Section 3 verifies the proposed approach through experimentation, contrasting its outcomes with those of other methodologies.This comparative analysis substantiates the efficacy of the proposed algorithm.Finally, Section 4 concludes the manuscript.

Related Work 2.1. Face Detection Method
Due to the pivotal role of face detection, numerous researchers have proposed a range of related algorithms.In the early stages, most face detection algorithms relied on traditional feature extraction and classifier training processes.For instance, Viola and Jones introduced a face detection algorithm in 2001 capable of detecting front-facing faces, although its effectiveness on side profiles was limited [9].Felzenszwalb et al. [10][11][12] presented a component-based object detection algorithm, known as Deformable Part Model (DPM), in 2008.While versatile in detecting faces of varying orientations and poses, the algorithm's complexity resulted in prolonged runtime.With the evolving landscape of deep learning in computer vision and the advancements in convolutional neural networks within ImageNet classification tasks [13][14][15], neural networks have progressively become the mainstream technology for target detection [16,17].One noteworthy approach, the cascade CNN, blended traditional techniques with deep learning [18].It built upon the foundation of the Viola-Jones algorithm [19], enhancing the classifier with convolutional networks to attain robust face detection outcomes.Expanding on this, the Multi-Task Convolutional Neural Network (MTCNN) extended the cascade CNN concept, employing multiple cascaded convolutional neural networks for face detection [20].This method, while effective, presented deployment challenges due to its multi-cascade architecture.Face RCNN, an evolution of Faster RCNN proposed by Wang et al., further refined face detection [21].By introducing online difficult sample mining and multi-scale training mechanisms, the network's face detection prowess was significantly augmented.Nonetheless, the introduction of several modules in the network somewhat compromised its inference speed.Researchers have subsequently introduced various approaches to enhance detection speed, such as the YOLO series (YOLOv6 [22], YOLOv7 [23]), RetinaFace [24], and more.Li et al. [25].Proposed an improved anchor box matching method by integrating new data augmentation techniques and anchor design strategies into a dual-camera face detector, which provides better initialization for the regressor and consequently enhances face detection performance.Qi et al. [26].Improved detection performance by using the Wing loss function and replacing the Focus module in the Backbone with the StemBlock module, building upon YOLOv5.While these methods improve detection speed, it is important to note that they often come at the cost of a decrease in accuracy.
The introduction of deep learning has significantly improved the effectiveness of facial detection and has become the mainstream approach in contemporary facial detection.It has found widespread applications in various domains.

Multi-scale Feature Fusion Module
Multi-scale feature fusion is an essential research direction in the field of computer vision, aiming to effectively combine image features from different scales to enhance the performance of image analysis and understanding.With the advancement of deep learning, architectures such as convolutional neural networks (CNN) have taken a dominant role in computer vision tasks.Deep networks can automatically learn multi-scale features from data, but how to fuse features from different levels remains a research focus.In the realm of computer vision, addressing the issue that CNNs require fixed input image sizes leading to unnecessary accuracy loss, researchers such as Kaiming He et al. introduced the concept of pyramid pooling [27].By incorporating pyramid pooling layers into CNNs, it becomes possible to perform pooling on features at different scales, thereby achieving multi-scale information fusion.Multi-scale feature fusion holds significant practical value.For example, Qian Wang et al. combined deep CNNs and multi-scale feature fusion to propose a method for detecting multiple classes of 3D objects [28].This method enables the detection of various objects of interest within a single framework.Another innovation comes from Han et al., who introduced a novel convolutional neural network called MKFF-CNN [29].This network combines multi-scale kernels with feature fusion and is capable of recognizing gestures, serving the purpose of human-computer interaction.In a similar vein, Chen et al. devised a model named MSF-CNN for multi-scale fusion [30].This model is employed to train a facial detection system, achieving accurate face detection.Later, Lin et al. integrated the concepts of pyramid structures and multi-scale feature fusion, resulting in the Feature Pyramid Network (FPN) [31].FPN combines low-level and high-level features to create an object detection system that excels in accuracy, localization, and detection speed.Due to the advantages of FPN in object detection, this paper opts to utilize the FPN module for facial detection when conducting their research.

DF-Net Network Design
To enhance the face detection model's inference speed, this paper introduces a lightweight face detection algorithm.By adopting MobileNet-v2 as the foundational framework, the entire network's inference speed is optimized, ultimately enabling real-time face detection and extraction.The algorithms presented herein are executed on a CPU, utilizing test images with a resolution of 1280 × 1240 pixels.Notably, the algorithm achieves an impressive processing speed of 57 fps (frames per second), thereby attaining real-time performance.As depicted in Figure 1, the overarching architecture of the network is depicted.DF-Net predominantly comprises the MobileNet-v2 backbone network, a multi-scale feature cascade module, a spatial pyramid module, and a combined loss function.

Multi-scale Feature Fusion Module
Multi-scale feature fusion is an essential research direction in the field of computer vision, aiming to effectively combine image features from different scales to enhance the performance of image analysis and understanding.With the advancement of deep learning, architectures such as convolutional neural networks (CNN) have taken a dominant role in computer vision tasks.Deep networks can automatically learn multi-scale features from data, but how to fuse features from different levels remains a research focus.In the realm of computer vision, addressing the issue that CNNs require fixed input image sizes leading to unnecessary accuracy loss, researchers such as Kaiming He et al. introduced the concept of pyramid pooling [27].By incorporating pyramid pooling layers into CNNs, it becomes possible to perform pooling on features at different scales, thereby achieving multi-scale information fusion.Multi-scale feature fusion holds significant practical value.For example, Qian Wang et al. combined deep CNNs and multi-scale feature fusion to propose a method for detecting multiple classes of 3D objects [28].This method enables the detection of various objects of interest within a single framework.Another innovation comes from Han et al., who introduced a novel convolutional neural network called MKFF-CNN [29].This network combines multi-scale kernels with feature fusion and is capable of recognizing gestures, serving the purpose of human-computer interaction.In a similar vein, Chen et al. devised a model named MSF-CNN for multi-scale fusion [30].This model is employed to train a facial detection system, achieving accurate face detection.Later, Lin et al. integrated the concepts of pyramid structures and multi-scale feature fusion, resulting in the Feature Pyramid Network (FPN) [31].FPN combines low-level and high-level features to create an object detection system that excels in accuracy, localization, and detection speed.Due to the advantages of FPN in object detection, this paper opts to utilize the FPN module for facial detection when conducting their research.

DF-Net Network Design
To enhance the face detection model's inference speed, this paper introduces a lightweight face detection algorithm.By adopting MobileNet-v2 as the foundational framework, the entire network's inference speed is optimized, ultimately enabling real-time face detection and extraction.The algorithms presented herein are executed on a CPU, utilizing test images with a resolution of 1280 × 1240 pixels.Notably, the algorithm achieves an impressive processing speed of 57 fps (frames per second), thereby attaining real-time performance.As depicted in Figure 1, the overarching architecture of the network is depicted.DF-Net predominantly comprises the MobileNet-v2 backbone network, a multiscale feature cascade module, a spatial pyramid module, and a combined loss function.

Backbone Network
The role of the neural network's backbone network is to extract a sequence of highdimensional features from the input data.Due to the feature extraction demands of the backbone network and its inherently elevated dimensionality and depth, the processing speed of this network directly influences the overall neural network's performance.To

Backbone Network
The role of the neural network's backbone network is to extract a sequence of highdimensional features from the input data.Due to the feature extraction demands of the backbone network and its inherently elevated dimensionality and depth, the processing speed of this network directly influences the overall neural network's performance.To adhere to the real-time requirements of face detection, this paper employs MobileNet-v2 as the backbone network, leveraging the specialized attributes of MobileNet-v2's depth separable convolution to ensure real-time efficacy for the entire face detection algorithm.We chose MobileNetV2 instead of MobileNetV3 for a reason.When conducting research experiments, the main goal is to maintain a small size while achieving real-time performance and deployment on mobile devices.These two advantages are also required in many practical engineering applications.However, MobileNetV3 is generally more complex than MobileNetV2, accompanied by larger model sizes and higher computational costs.Since the performance requirements of the tasks in this paper are not very high, and considering limited computing resources, we chose to use MobileNetV2 as it provides a better trade-off between speed and model size.Table 1 presents a comparison between the algorithm DF-Net's use of the MobileNet-v2 backbone network and the conventional MobileNet.This paper omits the fully connected layer in the rear of MobileNet.In the table, "Input" represents the input feature map's dimensions, encompassing image height, width, and channels."Conv" and "Depthwise Conv" denote traditional convolution and depthwise separable convolution, respectively."C" signifies the number of processing channels for convolution or depthwise separable convolution, while "n" indicates the repetitions in the current layer."S" represents the stride of convolution or depthwise separable convolution.As the entire backbone network employs depthwise separable convolutions, it attains swift processing speed.Additionally, MobileNet-v2's bottleneck structure is dynamic, allowing its scaling factor to be adjusted as per specific requirements.

The Multi-Scale Feature Cascade Module
Given the diverse requirements of face detection encompassing varying sizes, positions, and feature attributes, establishing a capacity for multi-scale processing within the algorithm becomes essential.As such, three distinct output feature maps of varying scales are derived from the backbone network and subsequently utilized as inputs, with each scale capturing face information at different magnitudes.This approach concurrently extends the network's receptive field towards faces, thereby enhancing the accuracy of facial information extraction.
In the context of a deep convolutional neural network, as it transitions from one input feature map to the next, irrespective of whether the convolution employs a stride of 1 or 2, the convolutional kernel comprehensively scans the entire feature map.However, this traversal process gives rise to a challenge.During convolution, targets occupying a larger pixel space inherently receive better feature representation than those encompassing fewer pixels.Consequently, the subsequent input feature map tends to emphasize features of more spatially extensive targets.Furthermore, the deep convolutional neural network entails numerous convolution operations, each potentially leading to some degree of information loss, especially for smaller targets.Notably, convolution with a stride of 2 tends to retain pixels from larger targets while inadvertently discarding those from smaller ones.In this context, facilitating multi-scale feature extraction across the feature map stands as a pivotal task for the network itself.
As illustrated in Figure 2, the diagram depicts a multi-scale feature cascade module.To begin with, three distinct scale output feature maps, denoted as FeatureMap1, Fea-tureMap2, and FeatureMap3, are extracted from the output of the backbone network.Post the high-dimensional feature extraction accomplished by the backbone network, each of the three-scale feature maps holds their individual scale-related information.Specifically, Fea-tureMap1's resolution is rectified through linear interpolation to align with FeatureMap2, ensuring that their feature information on different scales does not intersect.Following this alignment, FeatureMap1 and FeatureMap2 are channel-wise merged, culminating in a consolidated feature map, subsequently subjected to a 1 × 1 convolution to manage channel transformation.This convolution is characterized by parameters acquired through network learning, with an identical cascading process for FeatureMap3 and FeatureMap2.
tureMap2, and FeatureMap3, are extracted from the output of the backbone network.Post the high-dimensional feature extraction accomplished by the backbone network, each of the three-scale feature maps holds their individual scale-related information.Specifically, FeatureMap1's resolution is rectified through linear interpolation to align with Fea-tureMap2, ensuring that their feature information on different scales does not intersect.Following this alignment, FeatureMap1 and FeatureMap2 are channel-wise merged, culminating in a consolidated feature map, subsequently subjected to a 1 × 1 convolution to manage channel transformation.This convolution is characterized by parameters acquired through network learning, with an identical cascading process for FeatureMap3 and FeatureMap2.In the ensuing steps, FeatureMap3 undergoes a convolution operation, succeeded by a stride-2 convolution.The latter operation is intended to harmonize FeatureMap3's resolution with the cascaded FeatureMap2, thereby enabling a channel-wise fusion devoid of any intermingling of feature information across scales.Following this fusion, another 1 × 1 convolution is executed to adjust channel numbers.The resultant FeatureMap3 encompasses three distinct scale-specific feature maps.This process mirrors that of Fea-tureMap1 and FeatureMap2.Consequently, following the traversal of the multi-scale feature cascade module by the three diverse-scale feature maps, the output feature maps collectively encompass diverse scale-associated feature information.Additionally, to cater to the network's imperative reasoning speed, this study has opted to substitute all convolutions within the multi-scale feature cascade module with deep separable convolutions.This strategic substitution translates to reduced computational overhead and enhanced calculation speed.

The Feature Pyramid Module
The convolution operation in a convolutional neural network involves a weighted summation process between a sliding window and the feature map.Consequently, the dimensions of the convolution kernel dictate the quantity of features the ongoing convolution operation can extract from the feature map.When a 3 × 3 convolution kernel traverses the feature map, the resultant output feature map contains high-dimensional features achieved through weighted summation of every 3 × 3 section of the input feature map.Likewise, when performing convolution operations of 5 × 5 or 7 × 7, the features in the output feature map represent high-dimensional attributes of the 5 × 5 or 7 × 7 segment of the input feature map, constituting the receptive field of the convolution kernel.The ability to extract features within a certain neighborhood size of the feature map is contingent upon the use of convolution kernels of varying sizes, which correspond to distinct receptive fields.In instances where the target within the current feature map is relatively large, the relatively small receptive field derived from the application of diminutive convolution kernels may not adequately encompass the target's characteristics.Conversely, In the ensuing steps, FeatureMap3 undergoes a convolution operation, succeeded by a stride-2 convolution.The latter operation is intended to harmonize FeatureMap3's resolution with the cascaded FeatureMap2, thereby enabling a channel-wise fusion devoid of any intermingling of feature information across scales.Following this fusion, another 1 × 1 convolution is executed to adjust channel numbers.The resultant FeatureMap3 encompasses three distinct scale-specific feature maps.This process mirrors that of FeatureMap1 and FeatureMap2.Consequently, following the traversal of the multi-scale feature cascade module by the three diverse-scale feature maps, the output feature maps collectively encompass diverse scale-associated feature information.Additionally, to cater to the network's imperative reasoning speed, this study has opted to substitute all convolutions within the multi-scale feature cascade module with deep separable convolutions.This strategic substitution translates to reduced computational overhead and enhanced calculation speed.

The Feature Pyramid Module
The convolution operation in a convolutional neural network involves a weighted summation process between a sliding window and the feature map.Consequently, the dimensions of the convolution kernel dictate the quantity of features the ongoing convolution operation can extract from the feature map.When a 3 × 3 convolution kernel traverses the feature map, the resultant output feature map contains high-dimensional features achieved through weighted summation of every 3 × 3 section of the input feature map.Likewise, when performing convolution operations of 5 × 5 or 7 × 7, the features in the output feature map represent high-dimensional attributes of the 5 × 5 or 7 × 7 segment of the input feature map, constituting the receptive field of the convolution kernel.The ability to extract features within a certain neighborhood size of the feature map is contingent upon the use of convolution kernels of varying sizes, which correspond to distinct receptive fields.In instances where the target within the current feature map is relatively large, the relatively small receptive field derived from the application of diminutive convolution kernels may not adequately encompass the target's characteristics.Conversely, employing large-sized convolution kernels might fall short in encapsulating intricate target details.Faces, for instance, incorporate both minute details such as eyes, nose, and mouth, along with overarching information that relates to the holistic facial structure.Consequently, relying solely on a single-sized convolution kernel for extracting facial features would fail to comprehensively incorporate all pertinent information.
To comprehensively capture target features spanning from intricate details to overarching context, this paper employs the feature spatial pyramid structure depicted in Figure 3. Within this structure, three convolution kernels with distinct receptive fields, namely, 3 × 3, 5 × 5, and 7 × 7, are employed.Given that convolution operations can potentially compromise some original information, the outputs of these three convolutional processes are merged with the initial input feature map in the channel domain.This approach ensures a fusion of feature extraction results from diverse receptive fields.While larger convolution sizes can expand the receptive field, they also introduce more parameters and computations.Therefore, employing multiple smaller convolutions as replacements can yield equivalent outcomes as larger convolutions but with reduced parameter count.Alternatively, dilated convolutions can be utilized to augment the receptive field without increasing the parameter count.
tures would fail to comprehensively incorporate all pertinent information.
To comprehensively capture target features spanning from intricate details to overarching context, this paper employs the feature spatial pyramid structure depicted in Figure 3. Within this structure, three convolution kernels with distinct receptive fields, namely, 3 × 3, 5 × 5, and 7 × 7, are employed.Given that convolution operations can potentially compromise some original information, the outputs of these three convolutional processes are merged with the initial input feature map in the channel domain.This approach ensures a fusion of feature extraction results from diverse receptive fields.While larger convolution sizes can expand the receptive field, they also introduce more parameters and computations.Therefore, employing multiple smaller convolutions as replacements can yield equivalent outcomes as larger convolutions but with reduced parameter count.Alternatively, dilated convolutions can be utilized to augment the receptive field without increasing the parameter count.

Definition of Loss Function
Equation (1) represents the loss function adopted by DF-Net.The essence of this loss function can be segmented into three key components.Firstly, the classification loss is employed to ascertain whether an object is a face.Secondly, the regression loss gauges the accuracy of the face detection frame.Lastly, the face feature point detection regression loss contributes to the precise localization of the face detection frame.
The classification loss, denoted as Lossclass, serves to discern whether an entity is a face or not.p' signifies the network's predicted value, while y' stands for the true value from the dataset.To accomplish this classification distinction, the two-class cross-entropy loss function is applied.This facilitates the network in discerning the disparities between a face and its surroundings, with the objective of minimizing the cross-entropy loss.Conversely, for forecasting the regression loss of the face detection bounding box, an IoU (Intersection over Union) loss function is employed.Here, A' symbolizes the face detection box predicted by the network, and B' signifies the actual face detection box.The network endeavors to minimize the disparity between the predicted outcome and the actual

Definition of Loss Function
Equation ( 1) represents the loss function adopted by DF-Net.The essence of this loss function can be segmented into three key components.Firstly, the classification loss is employed to ascertain whether an object is a face.Secondly, the regression loss gauges the accuracy of the face detection frame.Lastly, the face feature point detection regression loss contributes to the precise localization of the face detection frame.
The classification loss, denoted as Loss class , serves to discern whether an entity is a face or not.'p' signifies the network's predicted value, while 'y' stands for the true value from the dataset.To accomplish this classification distinction, the two-class crossentropy loss function is applied.This facilitates the network in discerning the disparities between a face and its surroundings, with the objective of minimizing the cross-entropy loss.Conversely, for forecasting the regression loss of the face detection bounding box, an IoU (Intersection over Union) loss function is employed.Here, 'A' symbolizes the face detection box predicted by the network, and 'B' signifies the actual face detection box.The network endeavors to minimize the disparity between the predicted outcome and the actual outcome by reducing the intersection and union ratio between the two bounding boxes.This progressive approach helps the network gradually converge towards the genuine face detection box.

Experimental Environment
The training environment setup for this paper is outlined in Table 2.The computational setup includes an Intel Core i5-11260H CPU, an NVIDIA RTX 3050 GPU, and 32 GB of memory.The algorithm is developed using the Pytorch deep learning framework and implemented using the Python programming language.During subsequent experimental and algorithmic tests, the deployment and execution of the algorithm on the CPU are undertaken.

WiderFace Dataset
The WiderFace dataset is employed in this paper.The inception of the dataset dates back to 2015, originated by the Chinese University of Hong Kong [32].This dataset holds a more comprehensive and inclusive classification of facial images.With a voluminous compilation of nearly 400,000 instances of facial detection data, it encompasses 61 intricate classifications to capture diverse facial attributes.In this data set, an instance of this diversity is exemplified in the "Scale" category, encapsulating multiple faces within a larger scene.Similarly, the "Occlusion" category solely consists of faces subjected to occlusion circumstances.Expanding beyond facial classification, the WiderFace dataset also encompasses facial feature points.These points consist of five salient facial features-two eye pupils, the nose tip, and two mouth corners.To elaborate on the dataset division, 90% of the data are allocated for training purposes, while the remaining 10% is dedicated to the test set.
Given the dataset's inclusion of facial feature point information, these points can be incorporated into the detected faces, introducing a supplementary constraint to the face detection process.This integration necessitates the inclusion of a quality factor denoted as α, taking values within the range of 0.25, 0.5, 0.75, and 1.This strategic selection of α values prevents excessive interference with the core face detection loss, effectively preserving its primacy.This auxiliary loss framework enforces the constraint and integration of facial feature points within the larger context of the face detection algorithm.

Network Training
During the training phase, the images in the training set are resized to a uniform size.To preserve the inherent texture and contextual details of the images, grayscale filling is employed.This approach ensures that the image's inherent information remains intact while achieving size uniformity.Training employs the Adam optimizer, and the pre-trained MobileNet-v2 backbone network from ImageNet is used.The learning rate is set to 0.001, with a rate decay mechanism implemented.After every 50 training iterations, the learning rate is reduced by a factor of 10.The training batch size is configured as 5.The DF-Net algorithm in this study undergoes 150 training iterations.Figure 4 illustrates the loss convergence following network training, revealing that network convergence is achieved within approximately 120 iterations.

Results and Analysis
The DF-Net network described in this paper utilizes both the Multi-Scale Feature Cascade Module and the Feature Pyramid Module.The Multi-Scale Feature Cascade Module allows multiple feature maps of different scales to pass through it, resulting in output feature maps that carry feature information of various scales.This greatly en-riches the semantic information of the feature maps, making it easier to obtain more accurate facial information.The Feature Pyramid enables multi-scale detection, as faces in different images may have different scales.The Feature Pyramid allows the detector to perform face detection at multiple scales.Regardless of the distance of the face or the scale within the image, the detector can recognize faces.This significantly enhances the robustness and accuracy of face detection.To validate the roles of these two modules, we conducted experiments by removing each module individually and then training and testing on the WiderFace dataset.We compared the detection results with ground truth data and found that removing either module resulted in a decrease in accuracy of approximately 2~3%.Therefore, experimental validation confirms that both the Multi-Scale Feature Cascade Module and the Feature Pyramid Module contribute to improving the accuracy of face detection.

Results and Analysis
The DF-Net network described in this paper utilizes both the Multi-Scale Cascade Module and the Feature Pyramid Module.The Multi-Scale Feature C Module allows multiple feature maps of different scales to pass through it, resu output feature maps that carry feature information of various scales.This greatly en the semantic information of the feature maps, making it easier to obtain more a facial information.The Feature Pyramid enables multi-scale detection, as faces in d images may have different scales.The Feature Pyramid allows the detector to p face detection at multiple scales.Regardless of the distance of the face or the scale the image, the detector can recognize faces.This significantly enhances the robustn accuracy of face detection.To validate the roles of these two modules, we conduc periments by removing each module individually and then training and testing WiderFace dataset.We compared the detection results with ground truth data and that removing either module resulted in a decrease in accuracy of approximately Therefore, experimental validation confirms that both the Multi-Scale Feature C Module and the Feature Pyramid Module contribute to improving the accuracy detection.
To further enhance the model's performance, a backbone network compari tween MobileNet-v1 and MobileNet-v2 is conducted, as depicted in Table 3.In t bileNet-v2 version, the incorporation of a bottleneck structure enables dynamic ment of the channel count transformation ratio, referred to as the bottleneck coe This coefficient is explored at values of 0.25, 0.5, 0.75, and 1. Notably, train and test from scratch using the WiderFace dataset.Neither of the two backbones is pre-trai ImageNet, and identical parameters are maintained.These parameters include number of training iterations at 50, a learning rate set at 0.001, and consistency in function.The observations from Table 3 indicate that while the bottleneck struct reduce model size, it entails a channel number transformation that might lead t mation loss.Given that the operational speed of the entire network framework in To further enhance the model's performance, a backbone network comparison between MobileNet-v1 and MobileNet-v2 is conducted, as depicted in Table 3.In the MobileNet-v2 version, the incorporation of a bottleneck structure enables dynamic adjustment of the channel count transformation ratio, referred to as the bottleneck coefficient.This coefficient is explored at values of 0.25, 0.5, 0.75, and 1. Notably, train and test directly from scratch using the WiderFace dataset.Neither of the two backbones is pre-trained via ImageNet, and identical parameters are maintained.These parameters include a fixed number of training iterations at 50, a learning rate set at 0.001, and consistency in the loss function.The observations from Table 3 indicate that while the bottleneck structure can reduce model size, it entails a channel number transformation that might lead to information loss.Given that the operational speed of the entire network framework in this paper aligns with real-time requirements, a bottleneck coefficient of 1 is adopted in the algorithm.In the formulation of the loss function, this paper introduces a quality factor to the auxiliary loss function, which integrates facial feature point information to constrain facial attributes.As the facial feature points loss function predominantly assumes an auxiliary role, the calibration of the quality factor demands testing.Similar to the bottleneck coefficient, the quality factor is variably set at 0.25, 0.5, 0.75, and 1.As demonstrated in Table 4, the assessment of DF-Net under distinct quality factors remains constant during experimentation.All other parameters remain fixed, with the bottleneck coefficient set at 0.25 to ensure expedited overall training pace.The outcomes outlined in Table 4 elucidate that a decrease in the quality factor corresponds to an augmented accuracy in network-based facial extraction.This phenomenon is primarily attributed to the dwindling proportion of auxiliary loss from facial feature points, enabling the network to better prioritize the core task of facial detection.Consequently, the diminished influence of facial feature points loss can paradoxically serve as a supplementary constraint on facial detection.Upon defining the aforementioned parameters, the evaluation of the DF-Net facial detection network primarily revolves around a singular facial classification.Consequently, the evaluation is predicated solely upon the utilization of the Average Precision (AP), a standard gauge within the domain of target detection.As depicted in Table 5, a comprehensive comparison is conducted between DF-Net and other renowned face detection networks such as MTCNN, Faster-RCNN, and RetinaFace.The comparative experiment is conducted under consistent conditions, utilizing an identical dataset for training and maintaining uniform learning rates.The face detection performance of the DF-Net network was evaluated on different levels of complexity within the dataset: easy, medium, and hard patterns.The detected faces were compared against the ground truth labels in the dataset.From the data presented in Table 5, it is evident that the DF-Net achieved an accuracy of 90.15% on easy patterns, 85.63% on medium patterns, and 74.89% on hard patterns.In comparison to the other three methods, our approach significantly improves the accuracy of face detection.Additionally, the processing speed of DF-Net reached 57 fps, with a model size of 4.34 M.This not only ensures real-time performance but also maintains a compact model size, making it well suited for deployment on mobile devices while retaining its real-time capabilities.As shown in Figure 5, the results of face detection using the method proposed in this paper include wearing masks, sunglasses, side faces, and the presence of objects on the face.It can be seen that the detection results are all accurate.
Faster-RCNN [ As shown in Figure 5, the results of face detection using the method proposed in this paper include wearing masks, sunglasses, side faces, and the presence of objects on the face.It can be seen that the detection results are all accurate.

Conclusions
To address the existing challenges of cumbersome deployment and sluggish network inference rates in contemporary face detection systems, this study introduces a face detection algorithm founded on DF-Net.To expedite the overall network inference, MobileNet-v2 is employed as the foundational framework.Additionally, the integration of a multiscale feature cascade module and a spatial pyramid module facilitates comprehensive multi-scale feature extraction from the feature maps.The algorithm is trained on the publicly available WiderFace dataset for face detection, followed by evaluation on a distinct test set post-training.This research extensively scrutinizes the network model's dimensions, processing velocity, and extraction precision.A comparison with three other classic face detection networks reveals a significant improvement in face detection accuracy with DF-Net.Furthermore, it conducts a meticulous exploration of each network hyperparameter through experimental analysis, affirming the efficacy of the proposed algorithm.Ultimately, these endeavors culminate in the achievement of rapid and accurate face

Conclusions
To address the existing challenges of cumbersome deployment and sluggish network inference rates in contemporary face detection systems, this study introduces a face detection algorithm founded on DF-Net.To expedite the overall network inference, MobileNet-v2 is employed as the foundational framework.Additionally, the integration of a multi-scale feature cascade module and a spatial pyramid module facilitates comprehensive multi-scale feature extraction from the feature maps.The algorithm is trained on the publicly available WiderFace dataset for face detection, followed by evaluation on a distinct test set post-training.This research extensively scrutinizes the network model's dimensions, processing velocity, and extraction precision.A comparison with three other classic face detection networks reveals a significant improvement in face detection accuracy with DF-Net.Furthermore, it conducts a meticulous exploration of each network hyperparameter through experimental analysis, affirming the efficacy of the proposed algorithm.Ultimately, these endeavors culminate in the achievement of rapid and accurate face detection.DF-Net offers real-time performance without compromising on a compact model size, making it suitable for deployment on mobile devices.These two advantages align well with practical engineering applications.DF-Net can be applied in scenarios such as pedestrian detection in autonomous driving and facial payment in mobile transactions.These scenarios often require face detection on platforms with limited memory and computing capabilities, demanding low-latency and real-time responsiveness.Hence, this method holds substantial application potential.After effectively extracting facial regions, we will further analyze facial features, which is our future research content.

Figure 1 .
Figure 1.The overall framework of DF-Net.

Figure 1 .
Figure 1.The overall framework of DF-Net.

Electronics 2023 ,
12, x FOR PEER REVIEW DF-Net algorithm in this study undergoes 150 training iterations.Figure 4 illustr loss convergence following network training, revealing that network converg achieved within approximately 120 iterations.

Figure 5 .
Figure 5.The detection results of faces in WiderFace Dataset.The green box represents the detected face.

Figure 5 .
Figure 5.The detection results of faces in WiderFace Dataset.The green box represents the detected face.

Table 3 .
DF-Net comparison of different backbones.

Table 4 .
DF-Net with different quality factors.

Table 5 .
Comparison of DF-Net with other networks.