An Efﬁcient Multi-Scale Anchor Box Approach to Detect Partial Faces from a Video Sequence

: In recent years, face detection has achieved considerable attention in the ﬁeld of computer vision using traditional machine learning techniques and deep learning techniques. Deep learning is used to build the most recent and powerful face detection algorithms. However, partial face detection still remains to achieve remarkable performance. Partial faces are occluded due to hair, hat, glasses, hands, mobile phones, and side-angle-captured images. Fewer facial features can be identiﬁed from such images. In this paper, we present a deep convolutional neural network face detection method using the anchor boxes section strategy. We limited the number of anchor boxes and scales and chose only relevant to the face shape. The proposed model was trained and tested on a popular and challenging face detection benchmark dataset, i.e., Face Detection Dataset and Benchmark (FDDB), and can also detect partially covered faces with better accuracy and precision. Extensive experiments were performed, with evaluation metrics including accuracy, precision, recall, F1 score, inference time, and FPS. The results show that the proposed model is able to detect the face in the image, including occluded features, more precisely than other state-of-the-art approaches, achieving 94.8% accuracy and 98.7% precision on the FDDB dataset at 21 frames per second (FPS).


Introduction
In computer vision, face detection has been a major focus for many years. The main aim of face detection systems is to locate the face in the image, with its bounding box. Face detection has been included in the prior work of some important applications such as face recognition, face analysis, face mask detection, face tracking, and face alignment. Viola-Jones performed the pioneering face detection work by proposing the haar-cascading, feature extraction method [1]. Big data and high-performance computing systems have helped deep learning to achieve remarkable results in many applications, including natural language processing, manufacturing, computer vision, healthcare, and speech recognition. Deep convolutional neural network (DCNN)-based methods have proven to be more effective than conventional methods for object detection. As a result, researchers have started applying DCNN methods for face detection.
Face detection approaches based on deep learning can be divided into two groups: two-stage object detectors and single-stage object detectors. The two-stage object detectors include Faster R-CNN [2], and single-stage object detectors mainly include YOLO [3] and SSD [4]. In two-stage object identification systems, region proposals are generated in the first stage, and then, they are used to recognize the object and to find the coordinates of the bounding box. Though these models are slower than single-stage models, they have proven to show better accuracy rates. On the other hand, single-stage anchor-based techniques [5][6][7] use regular and dense anchors over a wide range of scales and aspect ratios to identify faces. In these approaches, face detection with a bounding box is performed in the single pipeline of the network.
Single-stage face detection models become more difficult to enhance performance, especially in cases of partial or occluded faces. Detecting faces in real-world images has many difficulties due to the partial features in occluded faces. Partial features of faces in images result from some parts of the face being hidden, e.g., with glasses, masks, hair, scarves, some part of the body, cap, and side angles. Real-life examples of partial faces are presented in Figure 1. Recent DCNN models for face detection uses image pyramid concepts to extract finegrained features of faces [8][9][10]. First, an image pyramid is developed, and then, it passes to the next layers of the network. The computation cost of this network is high due to the complex training. Thus, detection speed from the video is reduced. These models were successful in achieving better results whenever the face was entirely visible. On the other hand, if the face was partially covered, these models did not show better accuracy, as the features could not be accurately captured and utilized fully in the network.
To overcome this problem, in this study, we utilized a single-stage deep convolution neural network for a face detection pipeline with anchor boxes. The initial stages of the face detection pipeline are convolutional layers with max pooling, to extract features of faces. Then, these features are passed to the anchor box layer. We used eight anchor boxes and two scales instead of three for face detection, to reduce the unnecessary anchor boxes, compared with regular object detection methods. In order to lessen the computational complexity, and to detect partial faces from images or videos, we employed a special anchor box selection strategy.
The main contributions of this paper are summarized as follows: i. We proposed a deep convolutional neural network for partial face detection using the anchor box selection strategy on the FDDB dataset; ii. We utilized the class existence probability of anchor proposals to classify the partial features of faces; iii. We considered eight anchor boxes and two scales to avoid extra computations. Anchor boxes sizes and scales were chosen from facial subparts and shapes; iv. The proposed method shows the balance performance between precision and detection speed; v.
The proposed method examined the FDDB dataset with partial face examples, and the results were compared with the other state-of-the-art face detection methods.
The paper is organized as follows: Section 2 provides an overview of the related methods. Section 3 explains the methodology. Extensive experimental analysis is presented in Section 4. Finally, Section 5 concludes the paper and shows future directions.

Related Work
Partial face detection is one of the problems in object detection, as it requires the location of each part of the face in the image. The face is a biometric human trait that contains vital information about an individual's identification. As a result, precise face detection is the first stage in various applications such as face recognition, face picture retrieval, facial tracking, and video surveillance. The detected face can be seen with the help of a bounding box. The face detection must be robust. Faces must be able to be detected even though they might be using different conditions including different angles, lighting conditions, makeup, age, having glasses on, hats, etc. Face detection is accomplished using two approaches: handmade-based face detectors and deep learning-based face detectors.

Handcrafted Face Detection Methods
Handcrafted face detection techniques were utilized in a wide range of computer vision applications. Pioneering work in the face detection field is by Viola-Jones. Classical face detection approaches are effective in real-time performance; however, detections are not robust in all conditions. Histograms of oriented gradients (HOGs) [11] and local binary patterns (LBPs) [12] are used as feature extraction methods for face detection and have shown promising outcomes [13,14].
A taxonomy of face identification methods is presented in [15], in which it was divided into two primary classes-namely, feature-based and image-based approaches. Featurebased approaches are applicable when motion and color are present in images or videos and can provide visual cues to focus attention in situations when the multi-resolution window cannot be scanned. By contrast, image-based approaches are the most suitable for images with greyscale. All of these techniques are computationally costly since they depend on the scanning of multi-resolution windows to locate faces of all sizes. The deformable part model (DPM) is also a promising face detection method [16]. This approach utilizes the relationship of deformable facial parts to detect faces. However, face detectors that use classical machine learning have not shown efficient performance in complex situations.

Deep-Learning-Based Face Detectors
CNN-based face detectors have attained the highest performance in the field of face detection using deep learning techniques. It is possible to attain high accuracy and efficiency concurrently by training a sequence of CNN models using the cascade CNN approach [17,18]. Yu et al. proposed an intersection-over-union (IoU) loss to decrease the gap between the IoUs and annotations, improving location accuracy [19]. Hao et al. focused on the detection of normalized faces by applying a scale proposal stage within a network to zoom in and out of the input image [20]. Yang et al. scored the facial parts according to their spatial structure, in order to detect faces in obstructive and uncontrolled conditions [21]. The decision tree classification approach is used to detect faces in the LDCF+ system [22]. Hu et al. sought to detect small faces with various scales [23]. Shi et al. applied a cascade-style structure to detect rotated faces [24]. Spatial attention modules with specific scales were employed by Song et al. to estimate face locations in images [25]. Many state-of-the-art face detectors are implemented from generic object detection methods. Face R-CNN [26], Face R-FCN [27], and FDNet [28] apply two-stage object detection methods such as Faster R-CNN and R-FCN [29], with some specific strategies to perform face detection. Tian et al. proposed an improved feature fusion pyramid with segmentation to detect hard faces [30]. Wang et al. addressed the RetinaNet [31] using the attention mechanism in anchor boxes [32]. Zhang et al. combined the features of higher level and lower level to detect faces in various conditions [33]. However, performance suffered due to aggregation. Zhang et al. utilized the SRN detector [34], with attention technique and DensNet-121 as the backbone of the proposed model [35]. Small faces were detected by extending the FPN module with a receptive module in DEFace [36]. Feature hierarchy encoder-decoder network (FHEDN) is a single-stage detection network and applies context prior features of faces [37]. Li et al. employed a pure convolutional neural network without anchors boxes for face detection [38].

Methodology
In this section, the proposed methodology is explained. The main aim of the proposed work was to detect partially covered faces of varying sizes. The proposed network was developed with a single-stage, end-to-end network. A batch of randomly occluded and occlusion-free face images from the FDDB dataset was taken as input, and features were generated, after which were utilized to decode the features of the faces.

Proposed Work
In our proposed work, we utilized 22 convolution layers and 5 max-pooling layers. The proposed work is presented in Figure 2. The proposed work pipeline was divided into two parts-namely, feature extraction layers and object detection. In the feature extraction pipeline, the features of the input image were extracted using the first 16 convolution layers. An input image resolution was kept at 608 × 608 pixels for training. In each convolution layer, 3 × 3 and 1 × 1 kernel sizes were used. Additionally, a 2 × 2 max-pooling layer was applied at the end of each convolution layer to downsample the image and keep important features of faces. After each pooling phase, the number of 3 × 3 and 1 × 1 filters was increased by two. In order to detect faces, the remaining six convolution layers were used in the second part of the proposed work pipeline. The anchor box's sizes are critical for end-to-end detection methods. When using general, fixed-sized boxes, anchors are made to detect objects of various sizes. However, these general-sized boxes may not necessarily work for other object detection tasks such as face detection. Bounding boxes of faces are mainly in the size of square or vertical rectangles. Some partially covered faces were challenging to detect and distinguish. Thus, creating an anchor box whose sizes closely match the ground truth is preferable in terms of improving detection accuracy. As a result, a multi-scale anchor box was used in the proposed work. A 19 × 19 grid was employed on the features of the input image. We utilized 8 different sizes of anchor boxes, which are depicted in Figure 3 with an example. The sizes of the anchor boxes were (32,32), (78, 88), (94, 40), (128, 128), (172, 210), (300, 100), (284, 334), and (512, 512). These anchor boxes were selected based on the input image resolution size and facial parts, as shown in Figure 4. Our model detected the smallest face with the size of 32 × 32 because the smallest size of the anchor box was 32 × 32. Generally, faces are in shapes of a square and vertical rectangle; thus, we used two scales of anchor boxes of 1:1 and 2:1 to reflect square and vertical rectangle shapes. The 1:2 scale was not considered, as the shape of faces cannot be a horizontal rectangle. This is presented in Figure 5. The number of proposals per grid was calculated from Equation (1). In our work, eight anchor boxes, two scales, one the class existence probability (Pc), four bounding box coordinates, and one class were considered, and according to Equation (1), 96 proposals were generated per grid in the final convolution layer. In order to reduce the unnecessary computation of bounding boxes, the concept of a selected multi-scale anchor box was introduced. The sizes and scales of the anchor boxes were varied based on the features and shapes of the faces. #proposals per grid = (#anchor boxes × #scales) × (Pc + #coordinates + #class) (1)   The bounding box of each face object was predicted using regression on each proposal. An IoU of 0.40 was applied to select the right overlapping bounding box, with the highest probability from a non-max suppression technique.

Anchor Box Selection Approach
Anchor boxes were utilized for partial face detection and provided the feature maps for the final convolution layer. In order to detect partially covered faces, it is not necessary that all of the anchor boxes contain enough information; nonetheless, the prediction score for each anchor box was calculated, which resulted in increased execution time (training and testing). As a result, real-time detection became time consuming and decreased the frame rate. An effective strategy for selecting anchor boxes was employed to mitigate this problem. Usually, the important feature information is not present in each anchor box; thus, such anchor boxes can be avoided for further processing in the pipeline. The strategy to avoid unnecessary anchor boxes is that the anchor boxes are arranged in descending order per grid. Then, in the absence of relevant information in a large size anchor box, the small-sized anchor boxes were ignored within the same grid. Canny edge detection algorithm [39] was applied to validate the information presented in the given anchor boxes. At the end of the convolution block, this strategy was used, which resulted in a significant reduction in both computational and memory expenses.
Another strategy for partial face detection was applied by using the anchor box class existing probability (Pc). In our work, the anchor boxes comprised only faces, and features of faces are depicted in Figure 4. The face is divided into main three parts-upper parts, middle parts, and lower parts. The upper parts consist of the eyebrows, eyes, and forehead. The middle parts consist of the nose, left and right cheek, along with eyes, and half of the forehead. The lower parts primarily consist of the lips and chin. In any of the anchor boxes in which partial features of the face are present, Pc is higher, compared with other anchor boxes under occluded conditions. During the detection of partial features of faces, the difference between the location of the bounding box and anchor box was large, compared with regular object bounding boxes. Small anchor boxes may ignore certain ground truth boxes if the distance between them is large. Thus, the IoU threshold was reduced from 0.5 to 0.4, in order to alleviate this problem.

Experimental Analysis
In this section, the performance of the proposed work is evaluated with extensive experiments on the FDDB dataset.

Dataset
The Face Detection Dataset and Benchmark (FDDB)g [40] comprises annotated faces and is a subset of the Faces in the Wild dataset, including greyscale and color images. The FDDB images were obtained from Yahoo! News. This dataset comprises a collection of 2845 images with 5171 face annotations, with different resolutions. Images from this dataset contain various challenges, notably, faces with side angles, multiple expressions, scale, illumination, and occlusion. Samples of the FDDB dataset are depicted in Figure 6. In the FDDB dataset, faces were annotated in elliptical areas, which were converted into rectangles or square areas before training the model.

Experimental Setup
Experiments were performed on the proposed model using a machine with 32 GB RAM, 2 TB hard disk, Intel core i7 8th generation processor, NVIDIA Titan Xp 12 GB GPU, and 64-bit Windows operating system. Experiments were performed using python 3.7 programming language with OpenCV, Keras, and TensorFlow libraries.

Discussion
In our experiments, we considered an 80:20 training and testing dataset split. The model was trained for 50 epochs, with 128 batch sizes, 0.0001 initial learning rate, and 0.9 momenta. Figure 7 shows the performance of the proposed work with state-of-the-art face detection algorithms [41,42]. We compared our method with other methods in terms of average precision (AP), which targets the detection of the faces from the FDDB dataset. As a result, the proposed method outperformed other methods, with 98.7% accuracy. From this finding, we could confirm that the proposed work is suitable and more accurate to detect partially covered faces.  Table 1 shows the result and performance analysis of the proposed work with the other state-of-the-art object detection methods on various image resolutions. We compared our method with other single-stage and two-stage object detection approaches mentioned in the above table, with accuracy, precision, recall, F1 score, and inference time evaluation matrices. Our method outperformed in accuracy, precision, and F1 score in all resolutions. However, recall of the proposed method in resolutions of 416 × 416 and 480 × 480 is 91.6% and 94.8%, which are less, compared with Faster R-CNN (91.8%) and YOLOv4 (95.3%), respectively. It can also be seen that there is not much difference in recall values. Results from the table show that the proposed methods maintain the accuracy, precision, and recall value along with the F1 score. Out of three resolutions, our model achieved the highest performance in 608 × 608 resolution. We analyzed the inference time on the test images of the FDDB dataset and executed Nvidia Titan Xp GPU. Results of inference time show that YOLOv1 attained the minimum inference time in ms. However, our model also performed well and reported the second-lowest inference time. YOLOv1 is a singlestage object detection method and has only a 7 × 7 grid size and two anchor boxes per grid. In contrast, our work has a 19 × 19 grid size and eight anchor boxes. However, we introduced an anchor selection strategy to reduce the computation time, and this is evident from the results of inference time, compared with other detection algorithms except for YOLOv1. Faster R-CNN took the highest inference time because it is a two-stage object detection algorithm.  Table 2 shows the performance analysis of our method by taking different IoU thresholds. We achieved 98.7% AP at 0.4 IoU. This IoU was selected to detect partial features of faces. Comparatively, space occupied by faces in images is small in real-life captured images. When partial features of faces are detected, the distance between the bounding box and anchor box locations is substantial in comparison with conventional object bounding boxes. If the distance between two small anchor boxes is large, they may ignore certain ground truth boxes. Thus the performance of the proposed work outperformed by accurately detecting fully visible faces as well as partially visible faces. Detection results can be seen in Figure 8.  Figure 8 presents the detection results of the proposed method on the FDDB dataset in the first row (a); our samples in the second row (b); samples of the MAFA dataset [43] in the last row (c). The images presented in Figure 1 were fed into the model, and the detection results are shown in the second row of Figure 8. These sample images of the second row and samples from the MAFA dataset were tested to show the robustness of the proposed method. These samples were provided neither during the training nor testing phases of the model, but rather from other distributions. Detection results of Figure 8 also illustrate that the model is able to detect the face in various conditions including different poses, occlusion due to mask, scarf, hands, mobile phone, and blurred faces.
We assessed the real-time computational speed of the proposed network trained on the FDDB dataset on an NVIDIA Titan Xp with 12 GB GPU. We analyzed the running speed of our method with other state-of-the-art object detection methods, and the results are presented in Figure 9. We utilized a real-time video from a web camera as an input to evaluate the real-time performance and compared the proposed model with other detection approaches. It can be seen that our method obtained the second-highest frame per second (FPS) on all resolutions, due to the anchor boxes selection strategy. YOLOv1 achieved the highest FPS due to less computation, and Faster R-CNN attained the lowest FPS because of the two stages of the network.

Conclusions
In this paper, a single-stage, deep convolution neural network-based face detection method was addressed for detecting partial faces with different occlusions. We applied an anchor box selection strategy to reduce the computation time. Furthermore, we also utilized class existence probability to identify parts of faces from small-sized anchor boxes. Additionally, we investigated the effect of implementation factors such as IoU on detection performance. Finally, we compared the average precision (AP) of the proposed work with other face detection algorithms using the FDDB dataset. Our network was also assessed with existing object detection models considering the FDDB dataset with different resolutions. The experimental findings show that our model exhibited impressive results, with 98.7% precision, and had a better inference speed. The proposed model was also evaluated on video and attained 21 FPS, 26.5 FPS, and 32 FPS on 608 × 608, 480 × 480, and 416 × 416 resolutions, respectively.
Our proposed method shows a balance between accuracy and speed. In future work, this model can be utilized as a base to detect faces for various applications in real life for which face detection is the first phase such as security monitoring, face recognition, face mask detection, forensic investigations, attendance monitoring, and face tracking.
Funding: Fund is provided by Symbiosis International University, Pune, India.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: CKAN is a complete out-of-the-box software solution that makes data accessible and usable-by providing tools to streamline publishing, sharing, finding and using data (including storage of data and provision of robust data APIs). CKAN is aimed at data publishers (national and regional governments, companies and organizations) wanting to make their data open and available. CKAN is used by governments and user groups worldwide and powers a variety of official and community data portals including portals for local, national and international government, such as the UK's data.gov.uk and the European Union's publicdata.eu, the Brazilian dados.gov.br, Dutch and Netherland government portals, as well as city and municipal sites in the US, UK, Argentina, Finland and elsewhere.