Improved Face Detection Method via Learning Small Faces on Hard Images Based on a Deep Learning Approach

Most facial recognition and face analysis systems start with facial detection. Early techniques, such as Haar cascades and histograms of directed gradients, mainly rely on features that had been manually developed from particular images. However, these techniques are unable to correctly synthesize images taken in untamed situations. However, deep learning’s quick development in computer vision has also sped up the development of a number of deep learning-based face detection frameworks, many of which have significantly improved accuracy in recent years. When detecting faces in face detection software, the difficulty of detecting small, scale, position, occlusion, blurring, and partially occluded faces in uncontrolled conditions is one of the problems of face identification that has been explored for many years but has not yet been entirely resolved. In this paper, we propose Retina net baseline, a single-stage face detector, to handle the challenging face detection problem. We made network improvements that boosted detection speed and accuracy. In Experiments, we used two popular datasets, such as WIDER FACE and FDDB. Specifically, on the WIDER FACE benchmark, our proposed method achieves AP of 41.0 at speed of 11.8 FPS with a single-scale inference strategy and AP of 44.2 with multi-scale inference strategy, which are results among one-stage detectors. Then, we trained our model during the implementation using the PyTorch framework, which provided an accuracy of 95.6% for the faces, which are successfully detected. Visible experimental results show that our proposed model outperforms seamless detection and recognition results achieved using performance evaluation matrices.


Introduction
One of the most well-known, essential, and practical issues in computer vision systems is face detection, the objective of which is to extract information from facial images. Face detection is an essential first step in face verification [1], face identification and clustering [2], facial landmarks [3], facial hallmark classification [4], face alignment [5], and face tracking [6]. The goal of Viola-Jones' study was to develop an object detection framework [7] in 2001. Over the past ten years, numerous face-identification techniques have been developed. Despite notable advancements made in the field over the course of the last couple's protracted decades, accurate and effective face identification in the wild has continued to be a challenge. The face identification depends on variables in position, face occlusion, scale, lighting, image infringement, facial expressions, and other elements. Face detection differs from typical object detection; it has smaller component ratio variations, but considerably larger scale alterations that range from several pixels to thousands of pixels.

Related Work
In this section, we discuss a number of methods that have been utilized for face detection and recognition over the years, from early techniques to modern techniques. Various techniques for face detection and identification have emerged during the last few years. Four primary object detection algorithms can be used to combine all face detection techniques. [33] Computer vision systems have been extensively researched in early face-detection studies [34]. Despite many extensive studies in recent years, based on the technique proposed by Viola-Jones [7], there exists insufficient practical results on face detection. The VJ framework [7] was one of the first frameworks to achieve real-time face detection by applying rectangular haar-like features to a cascaded AdaBoost classifier. However, these methods are not end-to-end trained, and feature learning and classifier training are trained separately. We can obtain the right running speed, but not sufficient accuracy. Although a good running speed is obtained; it does not have a satisfactory accuracy. SVMs [35] can be trained for face detection, a good example of which is the Haar wavelet. When Haar wavelets are trained on positive and negative applied examples of feature extraction, it helps to distinguish the classes; however, they faced a problem-it could not pick up the faces of various poses because they were weak, which resulted in the poor performance of the classifier and the results were indeterminate. Haoxiang Li et al. [36] proposed a CNN cascade using different resolutions, where the background area was discarded in the fast, low-resolution phase, and some difficult decisions were carefully evaluated in the final, high-resolution phase. Using the CNN-based calibration after each detection stage in the cascade increased the localization and decreased the number of candidates for the later stages. The status of the detection window is normalized using the output of each calibration stage as an input to the following calibration stage. The CNN-based techniques present facial detection methods and inherit some achievements from the commonly existing techniques. These fall into two categories of approaches: two-stage (faster R-CNN [31,37]) and one-stage (SSD [18,26]; and Retina Net [14,17]).
The two-step approach uses a highly accurate "offering and improving" mechanism for localization. In contrast, the one-step method carefully samples facial positions and scales to derive true and false samples without training principles. The sampling [38] and reweighting [13] techniques are widely used to reduce this imbalance. Compared to the two-step method, the one-step method is very productive and has a very high recall, but is at the risk of higher false-positive rates and less accurate localization.
Recently, the MTCNN [21] performed face detection using a sliding window method and relied on an image pyramid. HR [39] is a multilevel variant of the MTCNN that also requires an image pyramid. The image pyramid has several drawbacks: it is not fast but has a high speed and large memory rate. It considers HF when designing the discovery branch and provides an anchor-matching technique to improve hit costs. In [40], Zhu et al. focused on small-face detection by providing powerful anchor generation and matching techniques. We conclude that anchor-related techniques are important for face recognition. Subsequently, S3FD [15] and Pyramid Box [17] enhanced the backbone based on the low-level functional pyramid layer (LFPN) to improve various detections. SSH [23] builds three cognitive architectures that work together based on a contextual architecture for scale-invariant face detection. DSFD [28] features enhanced modules, forward layer monitoring, and improved anchor matching procedures for quick initialization. S3FD, pyramid box, SSH, and DSFD use VGG16 as the backbone, which results in a large architecture size and unproductive computation. Facebox [41] works by drastically shrinking the size of the input face image to run face detectors in real-time. After four layers consisting of two layers of convolution and two layers of pooling, a large step size of 32 was reached. Faceboxes are fast but have limited accuracy due to their inability to detect small facial images. For the face detection, we use RetineNet architecture as mentioned earlier. All identified faces are recognized using region offering network (RON) and high feature generation pyramid (HFGP), low Sensors 2023, 23, 502 4 of 16 feature generation pyramid (LFGP), and we trained our model during the implementation using the PyTorch framework which provided an accuracy of 95.6% for the faces which are successfully detected.

Proposed Face Detection Method
The proposed method uses RetinaNet's deep learning framework, which is an advanced deep learning design for common object detection. It is crucial to have two parts: (1) a region offering network (RON) to compile a list of area suggestions that almost certainly include faces, and (2) a prediction branch for identifying faces in an area of the image and fine-tuning the boundaries of these areas. This model can perform face image detection at a competitive speed because of the elements contributing to the general parameters for the convolution layers employed in feature extraction. In this work, we suggest improving the recall and accuracy of facial image detection using the Retina Net architecture and train our face detection model with the aid of following the proposed systems show in Figure 1. First, we trained the model of RetinaNet [14] using the Wider Face dataset [9]. We also tested the pre-trained model using the same dataset to ensure that it produces hard negatives. As the second step of our training technique, these hard negative examples are sent into the network. By training on these hard negative samples, the resulting model is capable of producing fewer false positives. We used the FDDB dataset to further fine-tune the process in our method [42]. However, as this dataset only contains 5171 faces in 2845 images, merely fine-tuning it may not be the smart choice. In our method, we first pre-trained our model on the wider face dataset, a considerably larger face dataset with much more challenging cases, before fine-tuning it on FDDB. Additionally, we used the multi-scale training processes during the final fine-tuning stage. We adopted a similar end-to-end training methodology to RetinaNet because of its effectiveness and simplicity. As a final optional step, we transformed the obtained detection bounding boxes into rectangular regions of human faces. In the following, we discuss five key steps of our solution in detail.

Proposed Face Detection Method
The proposed method uses RetinaNet's deep learning framework, which is an vanced deep learning design for common object detection. It is crucial to have two pa (1) a region offering network (RON) to compile a list of area suggestions that almost c tainly include faces, and (2) a prediction branch for identifying faces in an area of the age and fine-tuning the boundaries of these areas. This model can perform face im detection at a competitive speed because of the elements contributing to the general rameters for the convolution layers employed in feature extraction. In this work, we s gest improving the recall and accuracy of facial image detection using the Retina Net chitecture and train our face detection model with the aid of following the proposed s tems show in Figure 1. First, we trained the model of RetinaNet [14] using the Wider F dataset [9]. We also tested the pre-trained model using the same dataset to ensure tha produces hard negatives. As the second step of our training technique, these hard ne tive examples are sent into the network. By training on these hard negative samples, resulting model is capable of producing fewer false positives. We used the FDDB data to further fine-tune the process in our method [42]. However, as this dataset only conta 5171 faces in 2845 images, merely fine-tuning it may not be the smart choice. In method, we first pre-trained our model on the wider face dataset, a considerably lar face dataset with much more challenging cases, before fine-tuning it on FDDB. Additi ally, we used the multi-scale training processes during the final fine-tuning stage. adopted a similar end-to-end training methodology to RetinaNet because of its effecti ness and simplicity. As a final optional step, we transformed the obtained detect bounding boxes into rectangular regions of human faces. In the following, we discuss f key steps of our solution in detail.

Feature Extraction-Region Offering Network
Our network consisted of three parts. First, to create the foundation feature, the h feature generation pyramid (HFGP) combines shallow and deep features. For instan conv4 3 and conv5 3 of ResNet ensure multilevel semantic pieces of information for f

Feature Extraction-Region Offering Network
Our network consisted of three parts. First, to create the foundation feature, the high feature generation pyramid (HFGP) combines shallow and deep features. For instance, conv4 3 and conv5 3 of ResNet ensure multilevel semantic pieces of information for feature maps. Second, a low-feature-generation pyramid (LFGP) and convolution layers are stacked alternately. In particular, LFGP generates low-level feature maps with a scale different from that of the HFGP. The convolution layers combine the main features and the large output feature map of the preceding pyramid-based layers. In addition, the added feature maps were supplied to the next convolution layer. These layers of convolution study the properties from the layers of the pyramid and take them as the basic properties of F base . The output multi-scale features are calculated as follows: where in Fbase F base denotes the feature, f l i denotes the feature with the i-th scale within the l-th LFGP, T l denotes the l-th HFGP processing, and P denotes HFGP processing. Third, Prediction Step (PS) aggregates the multi-stage, multi-scale features by means of a scale-sensible feature concatenation operation, and a channel sensible attention mechanism.

High Feature Generation Pyramid (HFGP)
HFGP fuse feature from one level in our network, which is essential for creating the last multi-degree feature pyramid. They used the channels of the input features 1 × 1 convolution layers for compression and coupling operations can be used to combine these feature maps. In particular, because HFGP takes feature maps with one-of-a-kind scales in the backbone as input, it accepts the one up sample operation to rescale the deep functions to the same scale as the coupling operation. Taking HFGP from very deep backbone features results in stronger detection; therefore, high decision prototypes bring about better functional extraction and good work on small objects.

Low Feature Generation Pyramid (LFGP)
LFGP is different from HFGP and RetinaNet. The pyramid network consists of a chain of 2-stride 3 × 3 convolution layers. Then, the convolution layers use the outputs of those layers as their information set for the feature maps. The lower convolution layer in the HFGP backbone selects the final layer at every level. To enhance the learning ability and maintain feature smoothness, we also added 1 × 1 convolution layers after the up sample and detailed the clever sum working within the top convolution layer network.
The outputs from every convolution layer in the HFGP and LFGP were combined to provide multi-scale characteristics of the present level. Overall, the outputs of the stacked LFGP create multi-degree, multi-scale features, with the front LFGP imparting shallow-level, middle-level, then returning LFGP to ensure deep-level features.

Prediction Step
Prediction Step (PS) aims to combine the multi-degree and multi-scale characteristics brought about by LFGP and HFGP into a convolution layer. Connecting functions of the same scale collectively over the channel dimension are the initial step in the PS. The aggregated function pyramid can be expressed as . , x L i ∈ R Wi×Hi×C refers to the features of the ith large-scale. Every scale inside the aggregated pyramid includes capabilities from multilevel depth. Moreover, easy coupling operations are not sufficiently adaptive for the prediction head devoted to every feature, and we have one 3 × 3 Conv contribution via all three networks, after which every network takes its own 3 × 3 Conv in parallel. Our prediction head model is extremely lightweight and quick compared with RetinaNet. Additionally, we trained class prediction using OHEM [43] with a 3:1 neg pos ratio, softmax cross-entropy, c positive labels, and one background label. As a result, unlike RetinaNet, we did not use focus loss, which we have demonstrated to be insignificant in our case.

Concatenation
To promote the recognition of features on channels where they are most advantageous, we proposed a channel-specific attention module. Following the PS block, we used channelwise statistics z ∈ R C in the compression step using global middle pooling. The following agitation stage learns the attention mechanism using two convolution-related layers to fully capture channel-wise dependencies: where σ is the ReLU function, δ refers to the sigmoid function, W 1 ∈ R C× C r , W 2 ∈ R C× C r , r is the reduction ratio (r = 16 in our experiments). The final output is obtained by reweighting input F with activation s : , each of the features is enhanced or weakened by the rescaling operation.

Implementation and Results
In this section, we present experiments on challenging dataset from WIDER FACE bounding box detection challenge. We followed the WIDER FACE protocol of having total faces in images with different detection difficulties, such as occlusions, hard poses, outof-focus faces, and low resolution. For a comparison based on state-of-the-art techniques, we reported open-access face datasets on the test-dev split, which is generally available, labelled, and does not demand the application of the assessment server. Then, we describe the results of the ablation learning assessed on the minimal split for comfort.

Implementation Details
We trained our model during the implementation using the PyTorch framework [44]. Table 1 of ResNet 50 was selected as the backbone of our CNN network, which was pretrained on Image Net. The WIDER FACE training and validation datasets were used as the training datasets in the first stage. We provide a hard value for each ground-truth annotation in accordance with the level shown in Table 1. Specifically, zero issues were used as the initialization basis for all faces. The location satisfied the positive direction stated in Table 1, along with the face. Then, we proceeded to add an appropriate hard value. Additionally, we did not consider annotations whose difficulty values were greater than 2. Then, all images that were based on more than 1000 annotations were also thrown out, as in our previous studies [45][46][47][48][49]. On the abovementioned dataset, the pre-trained ResNet architecture was trained using 200 iterations with a learning rate of 0.0001. The images were first resized in this training procedure while maintaining the original party ratio. The longer aspects were capped at 1000, and the shorter aspects were rescaled to 600. Horizontal flipping was used for data augmentation. For the region offering network branch, 12 anchors were employed in the training process, covering a total size of Sensors 2023, 23, 502 7 of 16 64 × 64, 128 × 128, 256 × 256, and 512 × 512, respectively, and three aspect ratios: 1:1, 1:2, and 2:1, respectively. Following the non-maximum suppression (NMS), 2000 region offers are retained. The second step is fed into the network using the aforementioned dataset.
The "hard negatives" are those output locations with confidence ratings more than 0.8 and IoU values with any ground-truth annotation less than 0.5. Additionally, using a fixed learning rate of 0.0001, the difficult negative mining technique was run for 150 iterations, after which it was ensured that those difficult negatives were selected together with the various sample images. To produce our final detection model, we fine-tuned the resulting model using the FDDB dataset. We performed a series of five-fold cross-validation experiments to examine the detection design of our model on FDDB. We randomly resized each face image before placing it into the network to achieve horizontal flipping.
We scale every face image such that one in every 480, 600, and 750 pixels, respectively, will be its shorter aspect. In addition, we ensured that the longer issue did not surpass 1250, similar to the coverage taken in the first step. We used a feature concatenation approach to add the features pooled from the conv3 3, conv4 3, and conv5 3 layers during the training process. The scale was utilized once the features added the potential to be improved upon or fixed. Additionally, for both the training and test stages, we applied a fixed scale of 4700 to the entire blob. We used our final model after 80 iterations of architecture within a fixed learning rate of 0.001.
Next, the test period was examined after resizing a query face image using the same method as in the first level [50,51]. The region offering network branch network in the region offers a generating stage that generates 100 region offers for every facial image. If the trust rating of the classification is greater than 0.8. A chosen region was considered as a face. The non-maximum suppression limit in our study was set to 0.3. In our experiments, we also output all-region recommendations with trust degrees greater than 0.001.
In addition, on a computer with two NVIDIA Titan X GPUs, CUDA 9.2, and cuDNN 7.1.4, we trained the network on an NVIDIA Tesla V100 to obtain results temporarily and quickly. The set batch size is 32. Thus, the training method is restricted to the 12 GB memory NVIDIA Titan Xp if the batch size on a single GPU is less than 5.
In our model based on the VGG-16 backbone, the complete training time rates are three and six days for input sizes of 320 × 320 and 512 × 512, respectively, and for the ResNet-101 backbone, 512 × 512 costs five days.

The Process Speediness
We also evaluated the inference speed of our model using state-of-the-art techniques. It is quick to apply VGG-16 for extracting base features because it has removed FC layers and makes less of a backbone. Then, with the batch size set to 1, we computed the inference time for each image by adding the CNN and NMS run times for 1000 face images and dividing by 1000. We suggest a fast version with an input size of 320 × 320, and a standard version with an input size of 512 × 512 and reduced VGG16, which is reduced to the proposed method. Based on the PyTorch optimization, our model can quickly produce accurate results. This work benefits from Table 2 by demonstrating the superiority of onestage detection, and the multilevel structure of this method yields a very clear and positive speed-accuracy curve when compared to other approaches. Additionally, we replicated and tested the speed using additional techniques on our device for comparison.  Table 3 lists a thorough comparison of the few published competitive strategies used for the WIDER FACE benchmark. To further demonstrate the efficiency of the proposed method for face detection based on deep learning methods, we randomly selected qualitative outcomes of face detection instances for various situations, as shown in Table 3. This shows how well our suggested model can identify and find dissimilar cases, such as faces that are hard occluded, in unusual positions, illumination, etc. A few false negatives are included in the list, which includes a few challenges, including small, blurry, and closely occluded faces. Table 3 lists the test results for the proposed model with ten distinct configuration versions. The batch size was set to 1, and only one NVIDIA Titan X PASCAL was used. The FDDB test-dev split was used for the testing. References provided additional statistical findings. Note that our proposed model, which uses a VGG backbone, has an AP of 38.9, outperforming competing object detectors that have extremely robust backbones and sizable inputs. For instance, the AP of the deformable R-FCN was 37.5, and the AP of R-CNN with FPN was 36.2. ResNet-101 s single-scale version has an AP of 38.8, which is comparable to modern two-stage detectors, such as mask R-CNN. Accumulation with ResNet-101 increased our results. Additionally, owing to PyTorch's optimization, it can operate at 15.8 frames per second (FPS). RefineDet receives an AP of 41.8 and gains the advantages of both one-stage and two-stage detectors; whereas CornerNet suggests key point regression for detection and gains the advantages by doing so, earning an AP of 42.1. In contrast, our proposed method, which takes 44.2 AP and outperforms all one-stage detectors, is based on the regression algorithm of the original SSD and supports multi-scale multi-level features. We only evaluated the speed of the single-scale inference method owing to the range of tools or methods used; most approaches do not compare the speed of multiscale inference strategies. Based on the state-of-the-art, we also contrast one-and two-stage detectors, which can indicate that the development of the proposed model is not solely due to the increased depth of the model or the added parameters. Mask R-CNN with ResNeXt-101-32 × 8d-FPN has 205 M parameters, whereas CornerNet with Hourglass has 201 M parameters. In contrast, the proposed model VGG only contains 147 M parameters. Additionally, it was not dominant when comparing depths. Experimental results indicated that our improved face detection method accurately detected face regions. In addition, our method works effectively, even when there are multiple faces in the frame sequences, as shown in Figures 2-7.

Evaluation Metrics
In our previous studies [52][53][54][55][56][57][58], we computed metrics such as the F-measure (FM), precision, and recall. The FM is the weighted average that balances the measurements between the precision and recall rates. The precision is the ratio of the number of correctly predicted positive observations to total number of predicted positive observations. The recall is the ratio of the number of correctly predicted positive observations to total number of observations in the actual class, as indicated in Equation (4). The following equations can be used to calculate the average precision and recall rates of face recognition methods:

Evaluation Metrics
In our previous studies [52][53][54][55][56][57][58], we computed metrics such as the F-measure (FM), precision, and recall. The FM is the weighted average that balances the measurements between the precision and recall rates. The precision is the ratio of the number of correctly predicted positive observations to total number of predicted positive observations. The recall is the ratio of the number of correctly predicted positive observations to total number of observations in the actual class, as indicated in Equation (4). The following equations can be used to calculate the average precision and recall rates of face recognition

Evaluation Metrics
In our previous studies [52][53][54][55][56][57][58], we computed metrics such as the F-measure (FM), precision, and recall. The FM is the weighted average that balances the measurements between the precision and recall rates. The precision is the ratio of the number of correctly predicted positive observations to total number of predicted positive observations. The recall is the ratio of the number of correctly predicted positive observations to total number of observations in the actual class, as indicated in Equation (4). The following equations can be used to calculate the average precision and recall rates of face recognition methods: where TP denotes the number of true positives, FP denotes the number of false positives, and FN denotes the number of false negatives. The FM is calculated using Equation (5), which considers both the precision and recall: The average FM, recall, and precision of the proposed method were 95.6%. False detection occurred in 4.4% of cases by poor lighting or low-quality images. Wearing of facial masks has been indispensable during the COVID-19 pandemic, and, at the same time, it made the process of facial recognition more difficult [59]. The range of the model accuracy was between 0 and 1, and the metric estimation scores reached their best values at 1. An evaluation of our method and other recently published face detection and recognition methods is presented in Table 4. Furthermore, we assessed the false positive findings of the selected approaches. As seen in Figure 8, the proposed method had the fewest mistakes (error rate). Additionally, the highly efficient multi-scale inference strategy detectors significantly reduced face detection and classification errors. Overfitting was a major concern during training, and it affects nearly all deep learning models. We tried to reduce overfitting risk using data augmentation methods to increase the training data and applying feature selection techniques by choosing the best features and removing the useless/unnecessary features [60][61][62][63][64]. made the process of facial recognition more difficult [59]. The range of the model accuracy was between 0 and 1, and the metric estimation scores reached their best values at 1. An evaluation of our method and other recently published face detection and recognition methods is presented in Table 4. Furthermore, we assessed the false positive findings of the selected approaches. As seen in Figure 8, the proposed method had the fewest mistakes (error rate). Additionally, the highly efficient multi-scale inference strategy detectors significantly reduced face detection and classification errors. Overfitting was a major concern during training, and it affects nearly all deep learning models. We tried to reduce overfitting risk using data augmentation methods to increase the training data and applying feature selection techniques by choosing the best features and removing the useless/unnecessary features [60][61][62][63][64].

Conclusions
In this work, we introduced a novel deep learning-based face detection technique. In addition, we used it essentially includes two components: first, a region-offering network (RON) for producing a list of area proposals that, in all likelihood, include faces or regions

Conclusions
In this work, we introduced a novel deep learning-based face detection technique. In addition, we used it essentially includes two components: first, a region-offering network (RON) for producing a list of area proposals that, in all likelihood, include faces or regions of interest (RoIs); and second, a prediction network for classifying an area of the image into faces and refining the bounds of these areas. These components contribute common parameters to the feature extraction convolution layers, enabling this architecture to perform face detection tasks at a competitive rate. We used the WIDER FACE dataset to train our model, while the results show that our method is a strong choice for face identification because it can achieve higher accuracy with minimal model size and effective computation. In experiments, we used two popular datasets, such as WIDER FACE and FDDB. Specifically, on the WIDER FACE benchmark, our proposed method achieves AP of 41.0 at speed of 11.8 FPS with a single-scale inference strategy and AP of 44.2 with a multi-scale inference strategy, which are results among one-stage detectors. Then, we trained our model during the implementation using the PyTorch framework, which provided an accuracy of 95% for the faces, which are successfully detected. The results show that our method is a strong choice for face identification because it can achieve higher accuracy with minimal model size and effective computation.
Future tasks include solving blurry image problems under dark conditions and increasing the accuracy of the approach. We plan to develop a small real-time model with a reliable landmark-based face emotion recognition performance employing a variety of datasets in 3D CNN, 3D U-Net, and YOLOv environments.