Human Face Detection Techniques: A Comprehensive Review and Future Research Directions

: Face detection, which is an effortless task for humans, is complex to perform on machines. The recent veer proliferation of computational resources is paving the way for frantic advancement of face detection technology. Many astutely developed algorithms have been proposed to detect faces. However, there is little attention paid in making a comprehensive survey of the available algorithms. This paper aims at providing fourfold discussions on face detection algorithms. First, we explore a wide variety of the available face detection algorithms in ﬁve steps, including history, working procedure, advantages, limitations, and use in other ﬁelds alongside face detection. Secondly, we include a comparative evaluation among different algorithms in each single method. Thirdly, we provide detailed comparisons among the algorithms epitomized to have an all-inclusive outlook. Lastly, we conclude this study with several promising research directions to pursue. Earlier survey papers on face detection algorithms are limited to just technical details and popularly used algorithms. In our study, however, we cover detailed technical explanations of face detection algorithms and various recent sub-branches of the neural network. We present detailed comparisons among the algorithms in all-inclusive and under sub-branches. We provide the strengths and limitations of these algorithms and a novel literature survey that includes their use besides face detection.


Introduction
Face detection is a computer vision problem that involves finding faces in images. It is also the initial step for many face-related technologies, for instance, face verification, face modeling, head pose tracking, gender and age recognition, facial expression recognition, and many more.
Face detection is a trifling task for humans, which we can perform naturally with almost no effort. However, the task is complicated to perform via machines and requires many computationally complex steps to be undertaken. Recent developments in computational technologies have ameliorated the research in face detection. As such, many algorithms and methods for detecting faces have been proposed. Even so, there is little attention given in making a robust and updated survey of these face detection methods.
There have been some survey works referring to face detection methods. Ismail et al. [1] conducted a survey on face detection techniques in 2009. In their survey, four issues for face detection (size and types of database, illumination tolerance, facial expressions variations, and pose variations) were dealt with, along with reviews of several face detection Table 1. Comparison of our work with other similar survey papers.

Erik and Low [3]
Yang et al. [5] Ankit et al. [6] Ashu et al. [4] Sarvachan et al. [ In this survey, we present a structured classification of the related literature. The literature on face detection algorithms is very diverse; therefore, structuring the relevant works in a systematic way is not a trivial task. The following are some of the contributions of this paper: • Different face detection algorithms are reviewed in five parts, including history, working principle, advantages, limitations, and use in fields other than face detection. • Many face detection algorithms are reviewed, such as different statistical and neural network approaches, which were neglected in the earlier literature but gained popularity recently because of hardware development. • Systematic discrepancies are shown between algorithms for each single method. • A comprehensive comparison between all the methods is presented. • A list of research challenges in face detection with further research directions to pursue is given. This paper is directed to anyone who wants to learn about the different branches of face detection algorithms. There is no perfect algorithm to use as a face detection method. However, the comparative comparisons in this paper will help to choose the algorithm to use depending on the specific problems and challenges. The description of each algorithm will aid in gaining a clear understanding of that particular process. Additionally, knowing about the history, advantages, and limitations described for each of the algorithms will assist in deciding which algorithm is best suited for any task at hand with a clear understanding.
Face detection is one of the most popular computer vision problems that involve finding faces in digital images. In recent times, face detection techniques have advanced from conventional computer vision techniques toward more sophisticated machine learning (ML) approaches. The main steps of face detection technology involve finding the area in an image where a face or faces are. The main challenges in face detection are occlusion, illuminations, and complex background. A wide variety of algorithms have been proposed to combat these challenges. Basically, the available algorithms are divided mainly on two parts: feature-based and image-based approaches. While feature-based approaches find features (image edges, corners, and other structures well localized in two dimensions), image-based approaches depend largely on image scanning, which is based on window or sub-frames.
The rest of this paper is organized as follows: In Section 2, we briefly explain the feature-based approaches as shown in Figure 1. Section 3 provides an overview of the image-based approaches. Section 4 provides a robust comparison of the face detection algorithms. Section 5 epitomizes the research challenges in face detection and further research ideas to pursue. Finally, the conclusion is presented in Section 6.
• Different face detection algorithms are reviewed in five parts, including history, working principle, advantages, limitations, and use in fields other than face detection. • Many face detection algorithms are reviewed, such as different statistical and neural network approaches, which were neglected in the earlier literature but gained popularity recently because of hardware development. • Systematic discrepancies are shown between algorithms for each single method. • A comprehensive comparison between all the methods is presented. • A list of research challenges in face detection with further research directions to pursue is given.
This paper is directed to anyone who wants to learn about the different branches of face detection algorithms. There is no perfect algorithm to use as a face detection method. However, the comparative comparisons in this paper will help to choose the algorithm to use depending on the specific problems and challenges. The description of each algorithm will aid in gaining a clear understanding of that particular process. Additionally, knowing about the history, advantages, and limitations described for each of the algorithms will assist in deciding which algorithm is best suited for any task at hand with a clear understanding.
Face detection is one of the most popular computer vision problems that involve finding faces in digital images. In recent times, face detection techniques have advanced from conventional computer vision techniques toward more sophisticated machine learning (ML) approaches. The main steps of face detection technology involve finding the area in an image where a face or faces are. The main challenges in face detection are occlusion, illuminations, and complex background. A wide variety of algorithms have been proposed to combat these challenges. Basically, the available algorithms are divided mainly on two parts: feature-based and image-based approaches. While feature-based approaches find features (image edges, corners, and other structures well localized in two dimensions), image-based approaches depend largely on image scanning, which is based on window or sub-frames.
The rest of this paper is organized as follows: In Section 2, we briefly explain the feature-based approaches as shown in Figure 1. Section 3 provides an overview of the image-based approaches. Section 4 provides a robust comparison of the face detection algorithms. Section 5 epitomizes the research challenges in face detection and further research ideas to pursue. Finally, the conclusion is presented in Section 6.

Feature-Based Approaches
Feature-based approaches are further divided into three sub-fields, as shown in Figure 1. The active shape models deals with complex and non-rigid shapes by deforming to fit a given example by iterative processing. In low level analysis, segmentation is performed using pixel information and is typically more concerned about individual components of a face. On the other hand, feature analysis involves organizing facial features onto a global perspective, taking into account the facial geometry.
where E internal (C) and E external (C) are the internal and external energy functions, respectively. E internal (C) depends on the shape of the curve, while E external (C) depends on the image intensities (edges). The initialized snake will iteratively evolve to reduce or minimize E(C).
In the minimization of the energy function, internal energy enables the snakes to shrink or expand. In contrast, external energy makes the curve fit with nearby image edges in a state of equilibrium. Elastic energy [12] is commonly used as internal energy. By contrast, external energy relies on image features. The energy minimization process requires a high computational prerequisite. This is why, for faster convergence, methods of fast iteration by greedy algorithms were employed in [13].
Snakes are autonomous and self-adapting in their search for a minimal energy state [57]. They also can be made sensitive to image scale by incorporating Gaussian smoothing in the image energy function [58]. Snakes are also relatively insensitive to noise and other ambiguities in the images because the integral operator used to find both the internal and external energy functions is an inherent noise filter. The snakes algorithm works efficiently in real-time scenarios [59]. Furthermore, snakes are easy to manipulate because the external image forces behave in an intuitive manner [60,61].
Snakes generally are capable of determining the boundaries of features but they have several limitations [62]; the contours often become trapped onto false image features, and in terms of extracting non-convex features, they are not particularly suitable [63]. The convergence criteria used in the energy minimization technique govern their accuracy; tighter convergence criteria are required for higher accuracy. This results in longer computation times. Snakes are space consuming, while the Viterbi algorithm trades space for time. This means that they require a lot of time for processing [64].
Mostafa et al. [65] used snakes algorithm to extract buildings automatically from urban aerial images. Instead of using the snakes' traditional way of computing the weight coefficient value of the energy function, a user emphasis method was employed. A combination of genetic algorithm (GA) and snakes was used to calculate the parameters. This combination resulted in a needing fewer operators and ameliorated the speed. However, the detection accuracy was only good for detecting a single building, and it faced problems in detecting building blocks. Fang et al. presented a snake model in tracking multiple objects [66]. The model can track more than one object by splitting and connecting contours. This topology independent technique allows it to detect any number of objects unaccompanied by any exact number. Saito et al. also amalgamated GA with snakes to detect eye glass in a human face [67]. GA was used to find the parameters of snakes in this case also. The method faced problems in detecting asymmetric glasses.

Deformable Template Matching (DTM)
DTM model is classified as ASM because it actively deforms preset boundaries to fit a given face. Along with the face boundary, other facial features'-such as the eyes, mouth, eyebrows, nose, and ears-extraction is a pivotal task in the face detection process. The concept of snakes was taken one step further by Yuille et al. [14] in 1992 by integrating information of the eyes as a global feature for a better extraction process. The conventional template-based approach is convenient for rigid shaped faces, but suffers from problems in detecting faces with various shapes. DTM, therefore, comes with a solution by adjusting to the different shapes of the face; DTM is particularly competent with non-rigid face shapes.
DTM works by forming deformable shapes of the face, which is achieved by predefined shapes [68]. The shapes can be made in two ways: (a) polygonal templates (PT) and (b) hierarchical templates (HT). In PT, as depicted in Figure 2, a face is formed with a number of triangles, where every triangle is deformed to contort the overall version of the face [15]. On the other hand, HT functions by creating a tree shape described in Figure 3 [16]. Let us suppose that we want to find the structure of a curve from 'a' to 'b', as shown in Figure 3. A binary tree from 'a' to 'b' is built, and the process is started by selecting midpoint 'c'. With the midpoint 'c', the two halves of the curve are described recursively. Other sub-nodes at the tree are made in the same way by finding the midpoints. Here, every node in the tree describes a midpoint relative to other neighboring points. Sub-trees in the tree describes sub-curves, thereby giving the local relative positions, explaining the local curvature. Adding little noise in every node, we can reconstruct the local curvature to fit any given sample recursively and thus, fit the global shape. The steepest gradient descent minimization of a combination of external energy is implicated in this deformation process.
Here, in Equation (2), the energy due to image brightness, valley, peak, edges, and internal energy is epitomized by E i , E v , E p , E e , and E internal , respectively. . Formation of binary shape tree in building HT: (a) The distance from 'a' to 'b' is marked with their middle points in the deformation process; (b) The resulting hierarchical tree of the deformation process. Each node represents a local curvature that can be used to fit locally to any given model by adding noise recursively. Hence, this deforms the initialization to the global shape of the given model [16].
DTM combines local information with global information, and thus, ensures better extraction. Furthermore, it is accommodating of any type of shape in the given data [14] and can be employed in real time [69]. However, the weights of the energy terms are troublesome to interpolate. The consecutive execution of the minimization process costs an excessive processing time. To exacerbate the drawbacks, it is also sensitive to the initializing position; for instance, midpoint 'c' is needed to be initialized in HT.
Kluge and Lakshmanan utilized DTM in lane detection [70]. The use of DTM in detecting lanes allows likelihood of image shape (LOIS) to detect lanes without the need for thresholding and edge-non-edge classification. The detection process wholly depends on the intensity gradient information. However, the process is limited to using a smaller dataset. Jolly et al. applied DTM in vehicle segmentation [71]. A novel likelihood probability function was presented by Jolly et al. Inside and outside of the template is calculated, using directional edge-based terms. The process lacks the use of trucks, buses and trailers, like most common vehicles. Moni and Ali applied DTM, combining it with GA in object detection [72]. This combination solves the problem of optimal placement of objects in DTM. The term fitness was introduced to calculate how well random deformation fits the target shape without excessive deformation of the object. The fitness function for an object with various rotation, translation, scale or deformation was found to be harder to calculate.

Deformable Part Model (DPM)
DPM uses the pictorial structure for face detection, which was first proposed by Fischler and Elschlager et al. [73] in 1973. DPM is commonly employed in the detection of human faces [17,74] as well as in the detection of faces in comics [18]. DPM is a trainingbased model. In DPM, a face mask is formed by modeling discrete parts (eyes, nose, etc.) individually, and each of these parts are rigid parts. A set of geometric constraints are set between these parts (typically describing the distance between the eyes and nose, etc.) and can be imagined as springs, as shown in Figure 4. The intuition here is that the object parts change in appearance approximately linearly in some feature space, and the springs between these parts constrain their locations to be consistent with the deformations observed. To fit with the given data, the model is moved onto an image and stretched in different ways, trying to find a place for it that does not put too much pressure on the springs and explains that the image data are underneath the model. The face mask where the part and root filters, which are rigid, are connected with geometric constraints imagined as springs [73]; (b) the pictorial structure projected onto a real human face along with the a clear indication of part and root filters [17].
The pictorial structure can be classified into two parts: part filters and root filter, as shown in Figure 4. Part filters change depending on the articulation of the face. To be particular, parts are not changing but the distance between the parts are to be incorporated with the given image.
DPM performs well in terms of detecting various shapes of faces, as it detects faces efficiently in a real-time environment [75]. Additionally, it easily detects faces with various poses and can work with variations caused by different viewpoints and illuminations [76]. However, DPM faces difficulties, such as speed bottleneck or slowness [77], and has issues in extending to new object or face categories.
Yang and Ramanan employed DPM in object detection and pose estimation [78]. The method brings forth a general framework for two things. One is the modeling of co-occurrence relations between a mixture of parts, and the other is to draw classic spatial relations, linking the positions of the parts. However, the method produces a detection problem in images with more orientation because of fewer mixture components. Yan et al. proposed a multi-pedestrian detection method, using DPM [79]. The proposed method improves the detection of pedestrian in crowded environments that have heavy occlusion. To handle occlusion, Yan et al. utilized two layer representations. On the first layer, DPM was employed to represent the part and global appearance of the pedestrians. This approach yielded a problem regarding partially visible pedestrians. The second layer was instigated by appending the subclass weights of the body. This mixture model handles various occlusions well. In an updated paper, Yan et al. [80] solved the problem of numerous resolution imaging in pedestrian detection with DPM. This updated model faced false positive results around vehicles. This particular problem was solved by constructing a context model to reduce them, depending on the pedestrian-vehicle relationship.

Point Distribution Model (PDM)
PDM is a shape description technique. The shape of a face is described by points in PDM. PDM was first invented by Cootes et al. [19] in 1992. However, the initial face PDM was devised by Lantis et al. [20]. PDM relies on landmark points. A landmark point is the annotation of any image onto any given shape on the training set images. The shape of a face in PDM is formed by planting landmark points on the shape of a face in the training image set. The model is generally built with a global face shape, having the formations of eyes, ears, nose, and other elements of a face, as shown in Figure 5. In the training stage, a number of training samples are taken, where every image holds a number of points for each sample, building a shape for each. The shapes are then rigidly aligned to calculate the mean and covariance matrix. During the fitting of a PDM onto a face, the mean of the shape is positioned within reach of the face. Accordingly, a search strategy, named the gray-scale search strategy, is performed in deforming the shape to the given face. In the process, the training set controls the deformation according to the way the information is modeled within it.
PDM reduces the computation time for searching face features, as it blueprints the features while building the global face [81]. Additionally, the difficulty of occlusion of any face feature is reduced by compact global face information with features since other information of the face compensates for the occluded area [21]. PDM can be fitted with numerous face shapes and provides a compact structure of a face. However, the process of building training set by pointing out the landmarks of face and facial features is an unavoidable drudgery and, in many cases, is prone to errors. Furthermore, the control point movements are restricted to straight lines so that the line of action is linear.
Edwards et al. [82] used PDM in the human recognition, tracking and detection process. Variables, such as pose, lighting conditions and expressions, were handled up to the par. To avoid the need for a large dataset, decoupled dynamic variation models for each class were proposed, while initial approximate decoupling was allowed to be updated during a sequence. Over and above that, PDM is sometimes put to use for searching three-dimensional (3D) volume data. Comparisons among different ASM are presented in Table 2.  LLA wrenches out the descriptions of an image that are usually available in an image. LLA does not agitate with the type of object nor even the perspective of the viewer. In an image, a severe number of independent descriptors can be available, such as edges, color information, etc. For instance, if we look at an image containing a face shape, the LLA descriptors would signify where the edges of the face are, the different color variation of the face and image, etc. Provided that the descriptors are associated with an image, LLA descriptors are applicable all over the image and not just in the face structure. LLA can be classified into four subcategories: motion, color, gray information and edges.

Motion
A number of continuous image frames or video sequences is the primary condition for motion-based face detection. Moving targets and objects provide valuable information, which can be used in detecting faces. Two leading ways to detect visual motion are moving image contours and frame variance analysis. In frame variance analysis, the moving forepart is identified in any type of background. Moving parts that contain a face are discerned by thresholding the gathered frame difference [22,23]. Along with the face region, face features can also be extracted in this way [24,25]. Sometimes, the eye pair referenced position is taken into account while measuring frame difference [25]. If we take contours into account, it yields better results than frame variance [26]. McKenna et al. [26] applied a spatio-temporal Gaussian filter to detect moving face boundaries.
One more type of advanced motion analysis is optical flow analysis. To detect faces, we need short-ranged and sensitive motions to take into account. Optical flow analysis relies on estimating accurately the apparent brightness velocity. The face motion is detected at the beginning, and then the information is used to distinguish a face [83]. Lee et al. [83] modified the algorithm introduced by Schunck in [84] and proposed a line clustering algorithm, where moving regions of a face are obtained by thresholding the image velocity.
Motion analysis provides a robust and precise tracking [25,85]. Moreover, the analysis works on a reduced search space, as it largely focuses on the movement, and is very competent in a real-time environment [86]. However, the system is incapable of detecting eyes if the major axis is not perpendicular to the eye center connecting the line [22]. Additionally, faces with beards may belie the positive results.

Color Information
Human skin color information prominently builds a skin color cluster, which paves the way for faster face detection. The main reason behind this is the faster processing of color. Skin color-based face detection was popularly used by Kovac et al. in 2003 [27], Lik et al. in 2010 [28], Ban et al. in 2013 [29], Hewa et al. in 2015 [30] and many others.
There are several color models being used in face detection. Among them, the following are the significant ones: red, green, blue (RGB) model; hue, saturation, intensity (HSI) mode; and luminance, in-phase, quadrature (YIQ) model. In the RGB model, all the existing colors are aggregated, using basic red, green and blue colors. As the three basic colors are amalgamated to build a color, all the colors have specific values of red, green and blue. In order to detect a face structure, the pixel values corresponding to a face, which represents the maximum likelihood, are deduced. HSI model shows superior performance, compared to the color models, giving color clusters of face features a larger variance. This allows HSI to be used in detecting such human face features as the lips, eyes and eyebrows. Lastly, the YIQ model works by bolstering RGB colors for YIQ representation. The conversion shows a discrepancy between the face and background, which allows face detection in a natural environment.
Color processing is much faster, compared to other facial feature processing. Additionally, the color orientation is invariant under certain lighting conditions. Nonetheless, color information is sensitive to luminance change, and different cameras produce significantly different color values. For side viewed faces, the algorithm yields low accuracy [87].
Dong et al. [88] proposed color processing in color tattoo segmentation. A skin color model was implemented in LAB (L-lightness, A-the red/green coordinate, and Bthe yellow/blue coordinate) color space. The main goal was to wane isolated noise, which makes the color tattoo area smooth. The model cannot handle variation in lighting conditions. Chang and Sun proposed two novel skin color-based models for detecting hands [89]. First, the model is constructed by amalgamating with Cr information which provides more representative and low-noise information, compared to other methods. The second one is built directly in accordance with the certain regions on the invariant surface. This method classifies actual skin color placed on a white board efficiently, but suffers in classifying different skin colors from around the world. Tan et al. [90] presented a skin color-based gesture segmentation model. The model detects the boundary points of skin area and then utilizes the least squares method to elliptically fit the border points. Finally, the model computes an elliptical model of the skin color distribution. This model by Tan et al. produces a high false acceptance rate. Huang et al. [91] proposed a skin colorbased eye detection method. The method performs color conversion as the starting step, which allows it to handle variable lighting conditions well. However, the approach only works on a defined size of 213 × 320 pixels. Cosatto and Graf [92] proposed a skin colorbased approach to produce an almost photo-realistic talking head animation system. Each frame was analyzed, using two different algorithms. As a first step, color segmentation is followed by texture segmentation. In the color segmentation step, hue is split up into a span of background colors and a range of hair or skin colors. Manual sampling is done to define the ranges. The same features are extracted from the blob after a thresholding and component connecting process. A combination of these texture and color models is used to locate mouth, eyes, eyebrows, etc. Yoo and Oh [93] presented a face segmentation method, using skin color information. The model tracks human faces depending on the chromatic histogram and histogram backpropagation algorithm. However, adaptive determination of faces in the scenario of zooming in and out failed in the method.

Gray Information
In grayscale images, each of the pixels in an image represents only an amount of light. In other words, every pixels contain only intensity information, which is described as gray information. There are only two basic colors: black and white, with many shades of gray in between [31]. Generally the face shape, edges and features are darker, compared to their surrounding regions. This dichotomy can be used to delineate various facial parts and faces from image background or noise.
Gray information is two-dimensional (2D) processing, while color information is 3D processing. Therefore, this is computationally less complex (requires less processing time).
However, gray information processing is less efficient, and the signal-to-noise ratio is not up to par.
Caoa and Huan [94] implemented gray information processing in multi-target tracking. Depending on the prior information, a grey level likelihood function is built. Then, the grey level likelihood function is introduced into the generalized, labeled, multi-Bernoulli (GLMB) algorithm. This multi-target tracking system works well in a cluttered environment. Wang et al. [95] proposed a self-adaptive image enhancing algorithm, based on gray scale power transformation. This solved the problem of excessive darkness and brightness in gray scale images. The main advantages of using the method is sharpness improvement, adaptive brightness adjustment and the ability for self-adaptive selection of transformation co-efficiency. Gray information processing was employed in digital watermarking by Liu and Ying [96]. The spread spectrum principle is presented to improve the robustness of the watermarking. As a result, the model shows good robustness to arbitrary noise attacks, cutting, and Joint Photographic Experts Group (JPEG) compression. Bukhari et al. [97] presented a novel approach for curled text line information extraction without the need for post-processing, based on gray information processing. In the text line extraction process, a major challenge is binarization noise. The method works well in mitigating the effect of binarization noise. Patel and Parmar [98] implemented gray information processing in image retrieval. The model adds color to grayscale images. The resultant images achieve a match pixel colorization accuracy of 92%.

Edge
Edge representation was one of the earliest techniques in computer vision. A sharp change in image brightness is considered an edge. Sakai et al. [32] implemented edge information in face detection in 1972. Analyzing line drawings to detect face features, and eventually the full face, was employed effectively by Sakai et al. Based on the works of Sakai et al., Craw et al. [33] developed a human head outline detection method which used a hierarchical framework. More recent applications of edge-based face detection can be found in [34][35][36][37].
The edges in an image are to be of a label, which are matched to a set pre-model for accurate detection. To detect edges in an image, many different filters and operators are implemented.
• Sobel Operator: The sobel operator is the most commonly used operator [99][100][101]. It works by computing an approximation of the gradient of the image intensity function. • Marr-Hildreth edge operator: The Marr-Hildreth edge operator [102] works by convolving the image with the Laplacian of Gaussian function. Then, zero crossing is detected in the filtered results to obtain the edges. • Steerable filter: The steerable filter [103] is performed in three steps, which are edge detection, filter orientation of edge detection and tracking neighboring edges.
In an edge-based face detection system, a face can be detected using a minimal amount of scanning [104] and withal, the system is relatively robust and cost effective. In spite of this, edge-based face detection is not suitable for noisy images, as it does not examine edges in all scales.
Chen et al. [105] applied edge detection in laminated wood edge cutting. To detect edges, a canny operator was employed, and defect detection was performed, using the pattern recognition method. However, the proposed model needed position adjustment of the wood. Zhang and Zhao utilized an edge detection system in automatic video object segmentation [106]. The framework found the current frames moving edge by taking the background edge map into account and preceding the frame's moving edge. Provided that there is a moving background available, the model returns poor segmentation. Liu and Tang employed the artificial bee colony (ABC) algorithm in searching global optimized points of edge detection [107]. The process of searching neighbor edge points is ameliorated, depending on these global optimized points. Fan et al. extended the SUSAN operator to detect edges in a moving target detection process [108]. The frame difference for moving target detection was gathered after detecting edges. The technique effectively counterbalances overlapping and the empty hole problem of a single detection algorithm. However, the method highly relies on the selection of the gray threshold, binarization threshold and geometry threshold. Yousef et al. [109] employed edge detection in conscious machine building. The framework is a bio-inspired model, which utilizes a linear summation of the decisions made by previous kernels. This summing operator facilitates a much better edge detection quality at the price of imposing a high computational cost. Table 3 lists some similarities and differences between different LLA. FA theorizes the possibility that the human face has features that function as detectors, observing individual characteristics or features we can locate on the face. LLA sometimes detects noise (background objects) as faces, which can be solved, using analysis of high level features. Geometrical face analysis was employed rigorously to find the actual face structure, which was obtained ambiguously in low level analysis. Using the geometric shape information, there are two ways that we can put it into application. The first is the positioning of face features by the relative position of these features, and the other is flexible face structures.

Feature Searching
Feature searching techniques employ a rather conventional technique, which is that a notable face feature is searched at first, then other less notable face features are found by referencing the eminent features found first. Among the literature surveyed, eyes pair, face axis, facial outline, and body below the head are some features used as reference.
Face Outline: One of the best examples of feature searching is finding a face structure by referencing the outline of a face. The algorithm was presented by De Silva et al. [38]. This algorithm starts by searching the forehead of a face [33]. After the forehead is found, the searching algorithm searches for the eye pair, which is actually presented by sudden high variation in densities [32]. The forehead and the eye pair is then taken as reference points, and other less notable face features are searched according to the reference points.
The algorithm presented by De Silva et al. communicated to have an accuracy of 82% for facial images having <±30 faces on a plain background. The algorithm can detect faces of various races. The facial outline-based algorithm cannot detect images with glass on the faces, and also images with hair covering the forehead of the face in the image.
Eye Reference: Taking the eyes pair directly as a reference point in searching face in images was proposed by Jeng et al. [39]. The algorithm first searches for possible eyes pair locations in images. The images used as an input in the algorithm are pre-processed binary images. The next step in the algorithm is almost similar as the outline; it searches for other facial features, such as the mouth, nose, etc., corresponding to the position of the eye pair. The face features have distinct functions weighted by their density.
This algorithm showed an 86% rate of detection. In contrast, the image dataset needed to be assembled in controlled imaging surroundings [39]. Moreover, the background of the images in the dataset must be thrown into disorder for detection.
Eye Movement: By taking into account normal human vision system eye movement, Herpers et al. [40] proposed GAZE. The GAZE algorithm detects the most projecting feature in the image, which is eye movement. A rough representation is constructed, using a multi-orientation Gaussian filter. The saliency rough representation map is then used to locate the most important feature which has maximum saliency. Secondly, the saliency of the drawn out area is plunged, and the next possible area is augmented for the next iteration. The remaining facial features are perceived at later iterations.
With only the first three iterations, Herpers et al. [40] was able to detect moving eyes with 98% accuracy. Different orientations and modest illuminance fluctuations have no effect on the detection rate. Tilted faces and variations in the magnitude of the face size have no effect on accuracy, and the algorithm is independent of any measurements of face features as shown in the performance results in [40].
Feature searching can further be classified into two categories: the Viola-Jones algorithm and local binary pattern (LBP).

Viola-Jones Algorithm
Viola and Jones came up with an object detection framework in 2001 [41,42]. The main purpose of the framework was to solve the problem of face detection which the algorithm achieved with faster, high detection accuracy, even though the algorithm could detect a diverse class of objects. The algorithm solved problems of real-time face detection, such as slowness, computational complexity, etc.
The Viola-Jones algorithm functions in two steps: training and detection. In the detection stage, the image is converted into grayscale. The algorithms then finds the face on the grayscale image, using a box search throughout the image. After that, it finds the location in the colored image. For searching the face in a grayscale image, Haar-like features are used to search an image [43]. All human faces consist of the same features, and Haar-like features explores this similarity by making three types of Haar features for the face, namely edges, line and four-sided features. With the help of these, a value for each feature of the face is calculated. An integral image is made out of them and compared very quickly, as shown in Figure 6. An integral image is what makes this model faster because it reduces the computation costs by reducing number of array references, as shown in Figure 7. In the second stage, a boosting algorithm named the Adaboost learning algorithm is employed to select a few number of prominent features out of large set to make the detection efficient. A simplified version was delineated by Viola and Jones in 2003 [44]. Finally, a cascaded classifier is assigned to quickly reject non-face images in which prominent facial features selected by boosting are absent, as shown in Figure 8. Using conventional method, the 6 × 6 rectangle has a summation of 72 pixels, which uses all 36 array references; (b) integral image of the input image. Using this integral image, the value is calculated as Here, 1, 2, 3, 4 are the positions of the rectangles shown with circles. So, the sum of the pixels in 6 × 6 rectangle is 128 + 8 − 32 − 32 = 72, which is same as the real value, using only 4 array references instead of 36 [41].
The Viola-Jones algorithm has high detection accuracy, and at the time of release it was reported to be 95%. A very recent study by Jamal et al. [110] reported to have 97.88% face detection accuracy. The algorithm is most widely used for face detection for its shorter computation time [111]. The algorithm is extremely successful for its low false positive rate. The algorithm was 15% quicker than the existing algorithms at the time of release. However, the algorithm can only detect the frontal side of the face. The algorithm possesses an intensely larger training time. Training with a limited number of classifiers can result in far less accurate results; a number of sub-windows was provided with more attention [112]. Rahman and Zayed presented the Viola-Jones algorithm in detecting ground penetrating radar (GPR) profiles in bridge deck [113]. The method is utilized in the detection of hyperbolic regions acquired from GPR scans. The framework lacks different clustering approaches, which would be suitable in this purpose. Huang et al. [114] proposed an improved Viola-Jones algorithm to detect faces in Microsoft Hololens. In comparison with FACE API, the speed of the local detection is 4 times and 20 times via network, but can handle a rotation of 45 • . Winarno et al. [115] built face counter to count faces in an image, using the Viola-Jones algorithm. The model resulted in poor accuracy in images with low light intensity. Kirana et al. [116] proposed the Viola-Jones algorithm in the application of emotion recognition. The model was made for learning environments, such as a school. One predicament of the model is that it only works on forward-facing faces. Kirana et al. [117] extended the emotion recognition model [116] for fisher face type images. Feature extraction was performed, using PCA and linear discriminant analysis, and then combined with Viola-Jones for the detection process. Nonetheless, this combined framework is 15 times slower than the original Viola-Jones algorithm. Additionally, Hasan et al. [118] implemented the Viola-Jones algorithm in drowsiness detection, which is the main problem in the brain-computer interface paradigm. The method is based on the eye detection technique. The decision of drowsiness is made depending on the eye state (open or close). Saleque et al. [119] employed the Viola-Jones algorithm in detecting Bengali license plates in motor vehicles. The framework detects single vehicle license plates with 100% accuracy, but suffers from a reduction in accuracy in detecting multiple license plates.
Local Binary Pattern (LBP) LBP was mainly proposed for monochrome still images. LBP is based on the texture analysis on images. The texture analysis model was proposed first in 1990 [45]. However, LBP was first described by Ojala et al. in 1994 [46]. LBP works robustly as a texture descriptor and was found to have significant performance boost when working with histogram of oriented gradients (HOG) [47]. There have been reported a myriad of LBP variants, for instance, spatial temporal LBP (STLBP), LBP, center symmetric local binary patterns (CS-LBP), spatial color binary patterns (SCBP), opponent color local binary pattern (OC-LBP), double local binary pattern (DLBP), uniform local binary pattern (ULBP), local SVD binary pattern (LSVD-BP), etc. [120].
LBP looks for nine pixels at a time of an image-to be exact, a 3 × 3 matrix-and particularly puts interest in the central pixel. LBP compares the central pixel (cp) with its neighboring pixel (np) and assigns 0 for np < cp and 1 for np > cp in the corresponding neighbor. Then, it turns the eight binary np into one single byte which corresponds to a LBP-code or decimal number. This is done by multiplying the matrix component wise with an eight bit number representative matrix as shown in Figure 9. This decimal number is used in the training process. We are basically interested in edges; in the image, the transition from 1 to 0 and vice versa presents a change in brightness of the image. These changes are the edge descriptors. When we look at a whole image, we look for comparisons or change in pixels or brightness, and the edges are received. Figure 9. The process of calculating LBP code. Every neighboring 3 × 3 pixel is taken under np < cp and np > cp threshold to produce the binary comparison 3 × 3 matrix [45]. The binary matrix is then multiplied component wise with eight bit representative 3 × 3 matrix [46] and summed up to generate the LBP code or the decimal number representation [121].
LBP is tolerant of monochromatic illumination changes because LBP just compares the neighboring pixels; a change in illumination would change the comparative values, which would not result in change in values in the comparison [122]. LBP is mostly popular for its computation simplicity and fast performance. LBP can detect a moving object by subtracting the background, and has high discriminating power with a low false detection rate [123]. The algorithm yields the same detection accuracy in offline images and in real-time operation [124]. However, LBP is not invariant to rotations and high computation complexity. LBP uses only pixel difference while ignoring the magnitude information. It is not sensitive to minor adjustments in the image.
Rahim et al. developed a face recognition system by making use of feature extraction with LBP [125]. The model achieves an incredible 100% accuracy but does not work in real time. Yeon [126] implemented a face detection framework, using LBP features. The method performed faster when compared with the Haar-like features extraction method. Priya et al. [121] utilized LBP's advantage of micro-pattern description and solved the problem of identical twin detection. LBP was also used in surface defect detection by Liu and Xue [127]. The proposed model is an improvement of the original LBP, called gradient LBP. The model exploits image sub-blocks to plunge the LBP data matrix dimensionality. Finally, it strains non-continuity of the pixels in the local area to determine the defect area. However, this method faces noise influence when employed on the image as a whole. Varghese et al. presented an extended version of LBP, named modified LBP (MOD-LBP), and showed its application in the level identification of brain MR images [128]. When compared with original LBP, using histogram-based features, Varghese et al. found MOD-LBP to be two times better. Moreover, LBP was used in pose estimation by Solai and Raajan [129]. The model proposed by Solai and Raajan divides the estimation into five parts: front, left, tight, up and down. The pose is determined according to the pitch, yaw and roll angles of the face image. Nurzynska and Smolka implemented LBP in smile veracity recognition [130]. The model efficiently classified a posed smile and spontaneous expression. The feature vector was calculated with the use of uniform LBP. Support vector machine (SVM) was used as a classifier. Zhang et al. [131] proposed a fusion approach combining histogram of oriented grading and uniform LBP feature on blocks to recognize hand gestures. In the fusion, a histogram of oriented grading features depicts the hand shape, while LBP features epitomize the hand texture. However, the fusion yields poor result for complex backgrounds, and the speed is slow. Furthermore, LBP was implemented in script identification of handwritten documents by Rajput and Ummapure [132]. Handwritten scripts written in English, Hindi, Kannada, Malayalam, Telugu and Urdu were taken into account. Nearest neighbor and SVM were used as classifiers. LBP was used to extract features from a block of images. The defined block size was 512 × 512 pixels. However, the model did not recognize word level images.
AdaBoost "Adaptive Boosting", which is recognized in short for AdaBoost, is the introductory pragmatic boosting algorithm; Freund and Schapire first came up with this in 1996 [48]. AdaBoost mainly centers the attention on the classification and regression problem. The main objective of this algorithm is to adapt with hard-to-classify features. The algorithm functions by combining weak classifiers and originating a strong classifier. The algorithm changes the weights of different instances. To be exact, the algorithm puts more weight on hard-to-define classifiers and less on already-sorted-out classifiers. This is how the algorithm develops a better functioning classifier.
AdaBoost contains a high degree of precision. It has achieved a wide variety of success in many fields along with image processing. The algorithm can attain almost equivalent result on classification with small amounts of adjustment. As it combines weak classifiers, a wide variety of weak classifiers can be used to generate one strong classifier [133]. However, AdaBoost requires an enormous time for training [134]. It also can be sensitive to noisy background images and currently does not support null rejection.
Peng et al. [135] proposed two extended versions of the AdaBoost-based fault diagnosis system. One version, named gentle AdaBoost, was employed in fault diagnosis for the first time by Peng et al. For binary classification, one version of gentle AdaBoost was employed, and for multi-class classification, another version named AdaBoost multi-class hamming trees (AdaBoost.MH) was used. However, both of the extended version cannot deal with imbalanced data. Aleem et al. [136] utilized AdaBoost in software bug count prediction. Yadahalli and Nighot applied AdaBoost in intrusion detection [137]. The main goal of the system is to reduce the false alarm and ameliorate the rate of detection. Bin and Lianwen proposed a two-stage boosting-based scheme along with a conventional distance-based classification algorithm [138]. The application was to recognize in handwritten Chinese similar characters. Finally, the model was compared with AdaBoost, and AdaBoost outperformed the model. Selvathi and Selvaraj proposed an automatic method to segment brain tumor tissue and classification on magnetic resonance imaging (MRI) [139]. A combination of random forest and modified AdaBoost was presented. To extract tumor tissue texture, curvelet and wavelet transform was employed. However, this framework is limited to detecting brain tumors only. Lu et al. [140] proposed an improved ensemble algorithm, AdaBoost-GA-a combination of AdaBoost and GA-for cancer gene expression data classification. The model was designed to ameliorate the diversity of base classifiers and embellish integration processes. In AdaBoost-GA, Lu et al. introduced a decision group to improve the diversity of the classifiers, but the dimension of the decision group was not increased in order to gain highest accuracy.

Gabor Feature
The Gabor filter, named after Dennis Gabor, is extensively used for edge detection [49]. It is a linear filter used for texture analysis of an image. Gabor features are constructed by applying Gabor filters on images [50]. Images have smooth regions interrupted by abrupt changes and contrast called edges. These abrupt changes or edges usually contain the most prominent information in an image and hence, can indicate the presence of a face. Fourier transform is prominent in change analysis, but is not efficient in dealing with abrupt changes. Usually, Fourier transforms are represented by sine waves oscillating in infinite time and space. However, for image analysis, there is a need for something that can be localized in finite time and space. This is why the Gabor wavelet is utilized, which is a rapidly plunging oscillation with a mean zero [55]. For detecting the edges, the wavelet is used and searched over an image, initially positioning it at a random position in the image. If no edges are detected, the wavelet is searched at a different random position [51].
Using the Gabor feature, impressive results on face detection was reported by the dynamic link architecture (DLA) [141], elastic bunch graph matching (EBGM) [142], Gabor Fisher classifier (GFC) [143], and AdaBoosted Gabor Fisher classifier (AGFC) [144]. Gabor feature analysis works well with magnitudes [145]. A large amount of information can be gathered from local image regions [146,147]. Gabor feature analysis is found to be invariant to rotation, illumination and scale [148][149][150][151]. However, Gabor feature analysis has time complexity and image quality issues.
Gabor feature analysis was employed on gait analysis by Li et al. [152]. Human gait was classified into seven components. Two types of gait recognition processes were performed; one based on an entire gait outline, and another based on certain combinations. Two applications were proposed, depending on the analysis: human identification and gender recognition. The model cannot wring out dynamic properties of walking sequence. Zhang et al. [153] proposed an improved version of Gabor feature analysis, named local Gabor binary pattern histogram sequence (LGBPHS), for face representation and recognition. The model does not need any training because of its non-statistical approach.
LGBPHS comprises many parts of the histogram, corresponding to different face components at various orientations and scales. However, the framework does not handle pose and occlusion variation well enough. Cheng et al. [154] implemented Gabor feature analysis in facial expression recognition. The method is based on Gabor wavelet phase features. Conventional Gabor transformation processes utilize Gabor amplitude as a feature. Yet, Gabor amplitude features have small changes as the variation of the spatial location, while the phase can quickly change in accordance with the change in position. The presented model uses the intense texture characteristics gathered from phase information to detect facial expression. Priyadharshini et al. [155] compared Gabor and Log Gabor in vehicle recognition and proved the superiority of Log Gabor in vehicle recognition. Yakun Zhang et al. [156] presented a solution to the parameter adjustment of the Gabor filter with an application in finger vein detection. The model was named the adaptive learning Gabor filter. Additionally, a new solution for texture recognition by combining gradient descent and the convolution processing of the Gabor filter was proposed. However, the soft features available in finger veins was ignored in the process. Gabor feature analysis was utilized in fabric detection by Han and Zhang [157]. A GA algorithm was proposed to use, jointly, in determining the optimal parameters of the Gabor filter, depending on the defect-free fabric image. The model yields positive results on defects of various shapes, sizes, types. Rahman et al. applied Gabor feature analysis in the detection of the pectoral muscle boundary [158]. The model tunes the Gabor filter in the direction of the muscle boundary on the region of interest (ROI) containing the pectoral muscle. After that, it calculates the magnitude and phase responses. The responses calculated with the edge connect and region merge are then utilized for the detection process.

Constellation Analysis
Constellation is a cluster of similar things. In a constellation analysis, a facial feature group is formed to search a face in an image [52,53]. The algorithm is free from rigidity. This is why it can detect faces in images with noisy backgrounds. Most of the algorithms reviewed before failed to perform face detection in images with a complex background. Using the facial features, a face constellation model solved this very problem easily.
Various types of face constellations have been proposed by numerous scientists. We discuss three of them: statistical shape theory by Burl et al. [54], probabilistic shape model by Yow and Cipolla et al. [56] and graph matching. Statistical shape theory has a success rate of 84% and it can operate smoothly with features that are missing. The algorithm handles properly the problems originated from rotation, scale and translation to a certain magnitude. However, a significant amount of rotation in the subject's head causes a severe problem in detection. On the other hand, the probabilistic shape model marks a plunge in the detection of invalid features from noisy image and illustrates a 92% accuracy. The algorithm handles minor variations in viewpoint, scale and orientation. Additionally, eye glasses and missing features do not generate any problems. Lastly, graph matching can perform face detection in an automatic system and has higher detection accuracy.
Constellation analysis has been effectively used in telecommunication [159], diagnostic monitoring [160] and in autonomous satellite monitoring [161]. The similarities and differences among different FA are epitomized in Table 4.

Image-Based Approaches
Detecting faces on a more cluttered backgrounds paved the the way for most imagebased approaches. Most of the image-based face detection techniques work by using window-based scanning. The window is scanned pixel by pixel to classify a face and nonface. Typically, every method in image-based approaches varies in terms of the scanning window, step size, iteration number and sub-sampling rate to produce a more efficient approach. Image-based approaches are the most recent techniques that have emerged in face detection and are classified into three major fields: neural networks, linear subspace methods and statistical approaches as shown in Figure 10.

Neural Network
Neural network algorithms are inspired by the human brain's biological neural network. Neural networks take in data and train themselves to recognize the pattern (for face detection the face pattern). Then, the networks predict the output for a new set of similar faces. Neural networks can be subdivided into artificial neural network (ANN), decision-based neural network (DBNN) and fuzzy neural network (FNN).

Artificial Neural Network (ANN)
Like the biological human brain, ANN is based on a collection of connected nodes. The connected nodes are called artificial neurons. The fact of learning patterns in data enables ANN to produce better results with the availability of more data. There are several numbers of ANNs available. The most popularly used ANNs in face detection are added below.

Image-Based Approaches
Neural Networks

Retinal Connected Neural Network (RCNN)
A neural network based on the retinal connection of human eyes was proposed by Rowley, Baluja and Kanade et al. in 1998 [162]. The proposed ANN was named the retinal connected neural network (RCNN). RCNN takes a small-scale frame of the main image to analyze whether the frame contains a face as shown in Figure 11. RCNN applies a filter on an image. The filter is based on the neural network. A temporary arbitrator is used to merge the output to a single node. The input image is searched thoroughly, applying a different scale of frame to search for face content. The output node, with the help of an arbitrator, eliminates the overlapping features and combines the face features gathered from filtering. RCNN can handle a wide variety of images with different poses and rotation. When using RCNN, the methodology can be sorted out to be more or less conservative depending on the arbitration heuristics or thresholds used. The algorithm reports an acceptable number of false positives. However, the procedure is complex in terms of implementing and can only encounter frontal faces looking at the camera. Figure 11. Schematic diagram of RCNN for face detection [162]. The original image is sub-sampled on a 20 × 20 image window. All the extracted image windows are passed through illumination correctness and histogram equalization. The resulting image works as a network input. The network classifies the image as a face or non-face class.

Feed Forward Neural Network (FFNN)
The feed forward neural network (FFNN), also known as multi-layer perception, is considered to be the simplest form of ANN. The neural network was upheld from perceptrons developed by Frank Posenblatt et al. in 1958 [163]. Perceptrons are methodologies of the brain to store and organize information. Information or, for images, the face feature information moves to output nodes from input nodes, where the movement is done via hidden layers. Hidden layers assign weights to face features on the training process as shown in Figure 12. In the detection stage, the weights are compared to report a result on a given image. FFNN can handle large tasks, and accuracy is only higher on training samples.
Ertugrul et al. employed FFNN in estimating the short term power load of a small house [194]. The proposed model is a randomized FFNN. A small grid dataset was used in the evaluation and validation process. However, total accuracy dropped because of the bias in the output layer. Chaturvedi et al. [195] applied the FFNN and Izhikevich neuron model in the handwritten pattern recognition of digits and special characters. A comparison between both of them was made and it was proved that by adjusting synaptic weights and threshold values, the input patterns can achieve the same firing rate. Additionally, FFNN was implemented in image denoising by Saikia and Sarma [196]. The proposed method is a combination of FFNN and multi-level discrete cosine transform. The fusion manages speckle noise, which is a kind of multiplicative noise generated in images. Mikaeil et al. proposed FFNN in low latency traffic estimation [197]. The framework shows significant improvement in utilization performance, upstream delay and packet loss handling. Dhanaseely et al. [164] presented FFNN and cascade neural network (CASNN) in face recognition. Feature extraction was performed, using PCA, and Olivetti Research Lab (ORL) database was utilized. Both models were compared after the recognition process and it was found that CASNN is better in this scenario. Figure 12. Schematic diagram of three-layer FFNN for face detection [195]. Any given image data go in neural network, and random weight is initialized with every face feature, and then adjusted to recognize faces.

Back Propagation Neural Network (BPNN)
The origin of BPNN has some misleading information. Despite this fact, Steinbunch et al. proposed a learning matrix in 1963 [164]. This is one of the earliest involvement with BPNN [198]. A modern updated version was developed by Seppo et al. in 1975 [199]. The version is also called reverse mode automatic differentiation (RMAD). BPNN came into attention after the release of a paper by Rumelhart, Hinton and Williams in 1986 [165]. BPNN implements a system called "learning by example". BPNN calculates the error back in the input from the output to adjust the weights in the hidden layer for more accurate output. A number of face features are used as input in the training stage. A larger amount of weights are assigned to every features and compared with input nodes with errors. If the error rate is higher, the weight value is decreased on the next attempt and compared with the input node again. Thus, minimal error reporting weights are generated and employed in the detection stage for new images given. The face features are calculated using the predicted weights.
BPNN is fast, easy to program and simple to implement. The algorithm does notneed any special mention of the features of the function to be learned. BPNN is also flexible without the need for any prior knowledge about the network. Furthermore, the algorithm has no input parameter, except the input number. However, BPNN faces the major disadvantage of getting stuck into local minima.
Chanda et al. [200] applied BPNN in plant disease identification and classification. In order to fight overfitting and local optima, the framework utilizes BPNN to obtain the weight coefficient and particle swarm optimization (PSO) for optimization. The model implements five pre-processing steps: resizing, contrast enhancement, green pixel masking, and color model transformation. Finally, image segmentation is performed to classify te diseased portions of the plant. However, the model faces problems in choosing the initial parameter values of PSO. Yu et al. [201] implemented BPNN in tooth decay diagnosis. The model takes input the X-ray images of the patient's teeth. Normalized autocorrelation coefficients were employed to classify decayed and normal teeth. Additionally, BPNN was used in data pattern recognition by Dilruba et al. [202]. The model aims at finding the match ratio of training patterns to testing patterns. Two types were taken as a match: one is an exact match and the other is almost similar pattern. Li et al. utilized BPNN in building a ship equipment fault grade assessment model [203]. Three types of BPNN were taken into account: gradient descent back propagation algorithm, momentum gradient descent back propagation algorithm and Levenberg-Marquard backpropagation algorithm. To quantify the initial weight value of the neural network, GA is employed. Finally, a comparison among the three BPNN shows that Levenberg-Marquard backpropagation outperforms the other two. Furthermore, BPNN was employed in the analysis of an intrusion detection system by Jaiganesh et al. [204]. The framework analyses the user behavior and classifies them as normal or attack. The model yields poor attack detection accuracy.

Radial Basis Function Neural Network (RBFNN)
The radial basis function neural network (RBFNN) was presented by Broomhead and Lowe in 1988 [166,205]. RBFNN has similarities structurally with BPNN. RBFNN is comprised of input, hidden and output layers. However, RBFNN has only one hidden layer, and it is strictly bounded to only one hidden layer, named the feature vector. When mapping or neuron activating, RBFNN makes use of the Gaussian potential function.
In RBFNN, computations are relatively easy [167]. The network can be trained, using the first two stages of the training algorithm. The network possesses the property of best approximation [206]. The ANN shows easy design and strong tolerance to input noise, online learning ability and good generalization [207]. Additionally, RBFNN has a flexible control system. Despite those, an inadequate number of neurons in the hidden layer results in the failure of the system [208]. Additionally, a large number of neurons can result in overlapping in RBFNN.
Karayiannis and Xiong implemented an extended version of RBFNN, named cosine RBFNN, in identifying uncertainty in data classification [209]. The model was built by expanding the concepts behind the design and training of quantum neural networks (QNN), which is capable of detecting uncertainty in data classification by themselves. This method yields a learning algorithm that fits the cosine RBFNN. In the field of data mining, RBFNN was employed by Zhou et al. [210]. To speed up the learning process, a two-stage learning technique was used. To increase the output accuracy of the RBFNN, an error correlation algorithm was proposed. Static and dynamic hidden layer architecture was suggested to build a better structure of hidden layers. Venkateswarlu et al. [211] applied RBFNN in speech recognition. The framework is suitable in recognizing isolated words. Word recognition was performed in a speaker-dependent mode. When compared to multilayer perceptron neural networks (MLP), it improves efficiency significantly. Guangying and Yue employed RBFNN in the study of an electrocardiograph [212]. For the construction of RBFNN, a new algorithm was introduced. The proposed model generalizes on the given input well. The framework shows great efficiency in electrocardiogram (ECG) feature extraction. The model only considers two basic cardiac actions, despite the fact that cardiac activity is far more complex.

Rotation Invariant Neural Network (RINN)
Rowley, Baluja and Kanade proposed rotation invariant neural network (RINN) in 1997 [168]. Conventional algorithms are restricted to detecting frontal face only, while RINN can detect faces at any angle of rotation. The RINN system consists of manifold networks. At the beginning, a network name router network holds every input network to find its orientation. After that, the network prepares the window to detect one or more detector networks. The detector network processes the image plane to search for a face.
RINN can handle an image at any degree of rotation. RINN displays a higher classification performance [169]. Even so, RINN can learn only a limited number of features and performs well only with a small number of training sets. RINN was implemented in coin recognition [213] and in estimating the rotation angle of any object in an image [214]. Additionally, RINN was also exploited in pattern recognition [215].

Fast Neural Network (FNN)
The fast neural network (FNN), which reduces the computation time of the neural network, was first presented by Hazem El-Bakry et al. in 2002 [170]. FNN is very fast in computing and detecting human faces in an image plan. FNN works by diving an image into sub-images. Each of the sub-images are then searched for a face or faces, using the fast ANN or FNN. A high speed in detecting faces was reported when using FNN.
FNN reduces the computation steps in detecting a face [170]. In FNN, the problem of sub-image centering and normalization in the Fourier space is solved. FNN is a high-speed neural network, and parallel processing is implemented in the system. Yet, FNN is reported to be computationally expensive. FNN can be implemented in object detection besides face detection [170].

Polynomial Neural Network (PNN)
Haung et al. presented the polynomial neural network (PNN)-based face detection technique in 2003 [171]. PNN was originally proposed by Ivakhnenko in 1971 [172,216]. The algorithm is also known as group methods of data handling (GMDH). The GMDH neuron has two inputs and one output, which is a quadratic combinations of two inputs.
To detect a face, a frame that can slide over an image is introduced, and the detector labels the frames that contain a face. The test image is divided into variable scales to examine the numerous face shapes. The dividing process is actually re-scaling the input image into a standard frame. To overcome overlapping due to the re-scaling and multiple faces in the detection region, the images are arbitrated. The lighting conditions are ratified with an optimal plane, causing minimal error, and pixel intensities are fixed to compose a feature vector of 368 measurements. The classifier PNN, which has a single output, detects a window with a face or non-face. The complexity is slumped by PCA. PCA also helps to improve efficiency.
PNN can handle images with a cluttered background. Additionally, the algorithm is reported to have a high detection rate and low false positive rate in images with both simple and complex backgrounds. However, the algorithm is reported to have suffered from overlapping in the images output, due to re-scaling. The algorithm can also cause a problem in detecting faces in an image with a large number of faces in it.
PNN was implemented in signal processing with a chaotic background by Gardner [217]. The model generates a global chaotic background prediction, which is then subtracted to improve the signal. Ridge PNN (RPNN), an extended version of PNN, is a non-linear prediction model developed by Ghazali et al. [218] to forecast the future patterns of financial time series. The model was also extended to another version in the same paper as the dynamic ridge polynomial neural network (DRPNN), which is almost similar to the feed forward RPNN, with a feedback connection as the difference. Over and above that, PNN was put into action in gesture learning and recognition by Zhiqi [219]. The activation function is a Chebyshev polynomial, and the weights of the neural network are obtained by using a direct approach based on the pseudo-inverse. The procedure cuts down on training time. It also boosts precision and generalization. Furthermore, PNN was employed in the modeling of switched reluctance motors by Vejian et al. [220]. The model is used to simulate the flux linkage and torque characteristics of switched reluctance motors mathematically. The most appealing aspect of this is that it is self-adaptive and does not depend on a priori mathematical models.

Convolutional Neural Network (CNN)
There are many debates over who was the first to present the convolutional neural network (CNN). Despite much argument, most literature review and studies refer to some papers of LeCun et al. from 1988 to 1998 [173,221]. CNN was implemented in many research works on face detection. Some early contributions are the proposed works of Lawrence in 1996 [222], in 1997 [223] and Matsugu in 2003 [174].
CNN has a structure almost similar to FFNN. Hence, CNN has convolutional and some other layers consisting of hidden layers; for convolutions in the hidden layers, CNN is named as such. In convolutional layers, an input is convoluted and then passed to the next layer. We should define filters in the convolution layer. To sum up the process, the training stage trains the network, and best weighted values or filters are saved for detection. The network is trained with the usual backpropagation gradient descent procedure [222]. In the detection stage, the filters are scanned over the image to find patterns. Patterns can be edges, shapes or the colors. For a completely new task, CNN is a very good feature extractor. Additionally, CNN shows a very high computational efficiency and high accuracy. However, CNN requires big dataset for proper training. CNN reports to be slow and holds a high computational cost.
Besides face detection, CNN was also employed in improving bug localization by Xiao and Keung [224]. Bug reports and source files were reviewed on a character-by-character basis rather than a word-by-word basis. To extract features, a character level CNN was used, and the output was fed into a recurrent neural network (RNN) encoder-decoder. However, no fine tuning was performed in the model. Mahajan et al. applied CNN in the prediction of fault in gas chromatographs [225]. The fault was predicted by abnormalities in the pattern of the gas chromatogram, according to the model. Shoulder top, negative peak, and good peak faults were all successfully established. In spite of them, the model training lacked an adequate dataset. Hu and Lee proposed a new time transition layer that models variable temporal convolution kernel depths, using improved 3D CNN [226]. The model was also improved by adding an extended DenseNet architecture with 3D filters and pooling kernels. The model also added a cost-effective method of transferring pre-trained data from a 2D CNN to a random 3D CNN for appropriate weight initialization. CNN was employed in node identification in wireless network by Shen and Wang [227]. The dimension of every node was downsized, using PCA. Local features were extracted, using two layers CNN. For optimizing the model, stochastic gradient descent was utilized. To execute, the decision output softmax model was employed. The model performs poorly on larger scale networks. Shalini et al. conducted a sentiment analysis of Indian languages, using CNN [228]. The dataset used contained Telugu and Bengali languages. Data were classified as positive, negative and neutral. The model was implemented, using just one hidden layer. However, the model had a low cross validation accuracy for Telugu data.

Decision-Based Neural Network (DBNN)
Kung et al. presented an eminent face detection algorithm based on the decision-based neural network (DBNN) in 1995 [175]. DBNN uses static process for still images and a temporal strategy for video. In the training stage, the face pattern was annealed in order to make the eye plane horizontal and to produce a structure where distance between the eyes are constant. A Sobel edge map was assembled of size 16 by 16 pixels from images containing either a face or non-face. The Sobel edge map was later used as an input to the DBNN. The sub-images were processed to find face shapes in them. The face pattern was then located in the sub-image of the main frame it corresponds to. When the whole image is considered, only the found sub-image containing face is located as the desired location of the face.
DBNN is very effective in computation performance and time [229]. The hierarchical structure of DBNN provides a better understanding of structural richness. Furthermore, DBNN is reported to have high recognition accuracy, and its processing speed is very high (less than 0.2 second). However, the detection rate is higher only when the facial orientation is between −15 and 15 degrees.
Kung et al. applied DBNN in palm recognition [175]. The model classified palms as one being from the database or an intruder. The main drawback is the use of a small dataset.
Application in image and signal classification task of DBNN was proposed by Kung and Taur [229]. The model was a fusion of the learning rule of perceptron and hierarchical non-linear network structure. A sub-clustering hierarchical DBNN was explored for static models, and a fixed weight low pass filter was used for temporal prediction models. Golomb et al. implemented DBNN in gender classification [230]. The model does not necessitate function selection and optimization beforehand. The model requires complex calculations for simplistic classification.

Fuzzy Neural Network (FNN)
An intelligent system, which is the combination of the human-like reasoning style of a fuzzy system with the learning and connection-establishing structure of the neural network, is known as neuro-fuzzy hybridization. Neuro-fuzzy hybridization is eminently called the fuzzy neural network (FNN). FNN was employed in face detection by many researchers. To begin with, the pre-processed 20 × 20 frame, either containing a face or not, are allocated by fuzzy membership degrees. These fuzzy membership degrees are then used as an input to the neural network. The neural network is trained by the degrees, using error back propagation. When the training is over, an evaluation is run over the network, which defines the degree of which a given window contains a face or not. If a frame is labeled to have a face, post-processing is then carried out.
FNN is reported to have an higher accuracy, compared to other neural networks [176]. FNN requires fewer hidden neurons and can handle noisy backgrounds. Despite these, the FNN system requires linguistic rules instead of learning by examples as prior knowledge [234].
Kandel et al. [235] proposed a more adaptive FNN in pattern recognition. The model applies GA to the Kwan-Cai FNN. The number of Fuzzy neurons is lowered, and the recognition rates are enhanced, using a self-organizing learning algorithm based on GA. However, only English letters and Arabic numerals were used to evaluate the model. Imasaki et al. [236] utilized FNN to fine-tune the elevator's performance. The FNN-trained system is capable of adapting to a variety of traffic circumstances. Long-term shifts in traffic situations are handled by the model. Even so, the system must be customized to meet the needs of the users. Furthermore, FNN was employed in building the control system of automatic train operation by Sekine et al. [237]. Sekine et al. suggested a fuzzy neural network control scheme with two degrees of freedom. The model depicts the use of fuzzy rules both before and after the process begins. It also decreases the number of fuzzy rules that must be used. Despite changes in the control purpose and complex characteristics, automated operation continues to function well. Lin et al. presented an extended FNN named interactively recurrent self-evolving fuzzy neural network (IRSFNN) in identifying and predicting a dynamic system [238]. To maximize the benefits of local and global feedback, a novel recurrent structure with interaction feedback was introduced. A variable-dimensional Kalman filter algorithm was used to tune IRSFNN. Xu et al. [239] applied FNN in pulse pattern recognition. According to traditional Chinese pulse diagnosis (TCPD) theory, the model used FNN as a classifier to classify pulse patterns. The model has a high level of accuracy. Nonetheless, it is unable to detect complex pulse patterns, due to the limitations of the pulse dataset. Comparisons among different NN are listed in Table 5.

Linear Subspace
The linear subspace is a vector space, which is a subset of a larger vector space. In image processing terms, the smaller portion of a frame is called a subspace. The linear subspace is classified into four groups, i.e., eigenfaces, probabilistic eigenspaces, Fisherfaces and tensorfaces. Sirovich and Kirby first proposed the use of eigenfaces in face analysis [178,240], which was implemented in face recognition by Truk and Pentland [23,179].
The main goal of face detection is finding the faces in a given input image. The images that are used as an input are usually highly noisy. The noise is created due to the pose, rotation, lighting conditions, and other invariants. Despite the noises, there are some patterns that exist in an image. The image containing a face usually consists of patterns, due to the presence of some facial objects (eyes, noses, etc.). These facial features are called eigenfaces. These are usually obtained from an image by PCA. Using PCA, an eigenface of corresponding features are constructed from training a set.
By combining all the eigenfaces in the right proportion, the original face can be rebuilt. Each eigenface stands in for a single feature of a face. All the eigenfaces may or may not be available in an image. If a feature is present in an image, the proportion of the feature in the sum of eigenfaces will be higher. Therefore, a sum of all the weighted eigenfaces will represent a full face image. The weight defines in what proportion a feature is present in the image. That is, the reconstructed original image is equal to a sum of all the eigenfaces, with each eigenface having a certain weight. This is how a face, using the weighted eigenspaces, is extracted from an image.
The eigenface approach requires no knowledge of geometry and reflectance of faces. Furthermore, data compression is achieved by the low-dimensional subspace representation. However, the approach is very sensitive to the scaling of the image. Additionally, the learning or training stage is very time consuming and shows efficiency only in the condition that face classes are larger in dimension, compared to face spaces.
Along with face detection, the method was used in speaker identification by Islam et al. [241]. The model categorizes speakers based on the content of their voice. For feature extraction, the Fourier spectrum and PCA approaches were used. The classification is then done, using eigenface. The system recognizes the speaker based on every word spoken by the speaker. The system's biggest flaw is that it does not operate in real time. Zhan et al. [242] used the eigenface approach for real-time face authentication. The main contribution of this method is the 3D real-time face recognition system. The intensity and depth maps, which are continuously collected, using correlation image sensor (CIS), are used to classify the data. The complex valued eigenface is added to the traditional eigenface. Many potential applications, such as face emotion detection and attractiveness detection, are not discussed in this paper.

Probabilistic Eigenspaces
The eigenface approach for face detection obtained impressive results in a constrained environment. Hence, this technique performs well on only rigid faces. On the other hand, Electronics 2021, 10, 2354 28 of 46 probabilistic eigenspaces proposed by Moghaddam and Pentland [180,243] implements a probabilistic similarity measure, based on a parametric estimate of the probability density.
The method is reported to handle a much higher degree of occlusion [181]. Furthermore, the probability distribution of the reconstruction error of each class was employed, and the distribution of the class members in the eigenspace was taken into consideration [244].

Fisherfaces
Belhumeur, Hespanha and Kriegman proposed Fisherface for face detection in 1997 [182]. One key problem in face detection is finding the proper data representation. PCA by finding eigenfaces solves this problem. So, the subspace in an image representing the most of the variance or face can be described by eigenfaces. However, the similarity in the face subspace is not clearly defined by eigenfaces. In such situations, a subspace that gathers the same classes in one spot and those that are dissimilar far apart is required. The process to achieve the tasks is called discriminant analysis (DA). The most popular DA is linear discriminant analysis (LDA). LDA is implemented to search for a facial subspace, which is called Fisherface.
The algorithm is very useful when facial images have large variations in illumination and facial expression. Additionally, the error rate in detecting faces with glass is very small compared to the eigenface. Furthermore, Fisherfaces require less computation time than eigenfaces. However, Fisherfaces heavily depend on input data. Jin et al. used Fisherfaces in automatic modulation recognition of digital signals [245]. The model focused on reducing dimensionality based on Fisherfaces. It also used a combination of the cyclic spectrum and k-nearest neighbor to recognize nine different types of modulation signals. Du et al. combined Fisherfaces and the fuzzy iterative self-organizing technique to recognize gender, using human faces [246]. Fisherface was used to extract relevant attributes, which were then clustered, using a fuzzy iterative self-organizing technique. The fuzzy nearest neighbor method was then used to classify the data. Additionally, the algorithm was utilized in classifying facial expression by Hegde et al. [247]. The output of different blocksizes was analyzed, using the Gaussian adaptive threshold process. Fisherface was employed to detect different human emotions. The model also used eigenface and local binary pattern histogram (LBPH) to reduce differences between classes within an image, such as varying lighting conditions. The process maximizes the mean difference between groups, allowing it to accurately distinguish between individuals.

Tensorfaces
Tensorfaces was presented by Vasilescu and Terxopoulos in 2002 [183,184]. Tensorfaces is a multilinear approach where a tensor is a generalization of a matrix in a multidimensional basis. An image is dominated with multiple factors such as structure, illumination and viewpoint. The solution of the problems along with a band of images remains in the multilinear algebra domain. Within this mathematical method, a higher dimensional tensor is used to represent an image ensemble. To find faces by decomposing the images, an extension of singular value decomposition (SVD) named N-mode SVD is employed. The N-mode SVD yields tensorfaces from an image.
Tensorfaces can be implemented as an unified framework for solving several computer vision problems. Furthermore, the performance of tensorfaces was reported to be significantly better, compared to eigenfaces. Table 6 enlists comparisons among different LSM.

Statistical Approaches
Among various face detection methods, statistical approaches are the most intensively studied topic. The major sub-areas in the field are principal component analysis (PCA), support vector machine (SVM), discrete cosine transform (DCT), locality preserving projection (LPP) and independent component analysis (ICA).  [185], and was later advanced and named by Harold Hotelling in 1933 [186].On the basis of field application, PCA is known by several names, such as proper orthogonal decomposition (POD) in mechanical engineering, Kerhunen-Loeve transform (KLT) in signal processing, and so on. Turk and Pentland first implemented PCA in face detection.
PCA compresses a lot of data into some captures of the essence of real data. The mathematical procedure that PCA uses is an orthogonal transformation to convert a set of values of possibly correlated M variables to a set of K uncorrelated variables, called principal components. The components are gathered from the training set and only the first few components are obtained, while others are rejected. The obtained components are also called eigenfaces. The detection of a face is performed by projecting a test image onto a subspace spanned by the eigenspaces.
PCA performs very well in a constrained environment, and it is reported to be faster than other statistical approaches. An improved statistical-PCA (SPCA) is delineated to have a high recognition rate and simple computation [248,249]. However, PCA only relies on linear assumptions and scale variants.
PCA was employed also in cancer molecular pattern discovery by Han in [250]. Han introduced a non-negative PCA variant of PCA, which added non-negative constraints to PCA. As a classifier, the model employs SVM. The NPCA-SVM model was created by combining the two models. Under a regular Gaussian kernel, the model overcomes overfitting associated with SVM-based learning machines. Convergence issues can arise when using the model as a result of using a fixed phase size. Additionally, PCA was utilized in water quality monitoring [251]. Bingbing [252] implemented PCA in noise removal for speech denoising. Using dynamic embedded technology, the architecture produced an embedded matrix. The main components were then converted, using PCA. Finally, the model rebuilt the speech with no noise (high order principle components), using low order principle components. Tarvainen et al. [253] applied PCA in cardiac autonomic neuropathy. Instead of conforming to a small range, the model used multi-dimensional heart rate variability (HRV). Then, for dimensionality reduction, PCA was used, allowing the HRV to model the majority of the details in the original multi-dimensional data. Ying et al. [254] used PCA in building model for fire accidents of electric bicycles. The main drawback is limitation of fire data from actual fire incidents.

Support Vector Machine (SVM)
SVM is the modification of the generalized portrait algorithm, which was introduced by Vapnik and Lerner in 1963 [187]. Vapnik and Chervonenkis further developed the generalized portrait algorithm in 1964 [255]. The closer form of SVM currently used popularly was proposed by Boser, Guyon and Vapnik in 1992 [256]. SVM was used in face detection by many researchers in recent times [188,[257][258][259][260][261].
In the training stage, features are extracted using PCA or Histogram of Oriented Grading (HOG) or other feature extraction algorithms. Using the data, SVM is trained to classify between a face and a non-face. It draws a hyperplane between the classes as shown in Figure 13. In the detection stage, the received frame is extracted and compared with trained images put into the specific class of a face or a non-face. Figure 13. Schematic diagram of SVM [256]. The hyperplane divides the face and non-face data in different classes. Data points closer to the hyperplane are support vectors. These points helps to draw the margin, which determines the hyperplane position at an equally distant place from both data points to be classified clearly.
SVM is reported to be very effective with higher dimensional data. Furthermore, SVM models have generalization in practice; thus, the risk of over-fitting is quite small in SVM. SVM is reported to be memory efficient as well. However, SVM is not suitable for a large dataset. It works poorly with a noisy image dataset [262].
SVM was effectively utilized in protein structure prediction by Wang et al. [263] and in physical activity recognition by Mamun et al. [264]. Moreover, SVM was reported to be successful in breast cancer diagnosis by Gao and Li [265]. The model analyzed data using various kernel functions and SVM parameters. The radial basis function kernel and polynomial kernel were used to achieve the highest accuracy. Nasien et al. [266] applied SVM in handwritten recognition. The skeleton of a character was extracted, using a thinning algorithm. To accurately reflect the characters, a heuristic was used to generate Freeman chain code (FCC). The model yielded high accuracy. The proposed system, however, relied solely on the National Institute of Standards and Technology (NIST) database, which contains low-quality samples and broken bits. Gao et al. [267] implemented SVM in intrusion detection. They employed GA to optimize the SVM parameters. Menori and Munir combined blind steganalysis and SVM to detect hidden message and estimate hidden message length [268]. Among the different used kernels in the system, the polynomial kernel performs best.

Discrete Cosine Transform (DCT)
DCT was first proposed by Nasir Ahmed et al. in 1974 [189]. DCT was invented to perform the task of image compression [269,270]. DCT was used in face detection and recognition by Ziad et al. in 2001 [271], Aman et al. in 2011 [190], Surya et al. in 2012 [272], and so on.
The position of the eyes in an image needs to be entered manually. This is not considered a major drawback of the algorithm, as the algorithm can be used with a localization system [190]. After the system receives an image and eye coordinates in it, geometric and illumination normalization is performed. Then, the DCT of the normalized face is computed, and a certain subset of DCT coefficients describing the face is held onto as a feature vector. These subsets of DCT coefficients hold the highest variance of the image, which are of low to mid frequency. To detect and recognize a face, the system compares this face feature vector with the feature vectors of the database faces.
DCT was reported to show significantly vast improvement in the detection rates because of normalization and to be computationally less expensive, compared to Karhunen-Loeve transform (KLT) [190]. In addition, DCT provides a simpler way to deal with 3D facial distortions and produces rich information of face descriptors [273]. Even so, quantization (ignoring high frequency components) is required to make some decisions in DCT. DCT was also used in audio signal processing, fingerprint detection, palm print detection, data compression, medical technology, and wireless technology [274].

Locality Preserving Projection (LPP)
Ha and Niyogi proposed LPP in 2004 [191]. The algorithm can be used as a substitute for PCA. LPP was developed to store locality structure which makes LPP fast as pattern recognition algorithms explores nearest patterns.
In LPP, like PCA, a face subspace is searched, which usually has lower dimensions than the image space [275]. The original image is put under scale and orientation normalization. The normalization is performed in such a way that two eyes were aligned at the same position. Then, the image is cropped into 32 × 32 pixels. Each image is represented by a 1024 dimensional vector with 256 gray levels per pixel. A training set, using six images per individual, was used by He [191]. Training samples were used to learn a projection, whereas test images were projected into a reduced image space.
LPP was reported to be fast and suitable for practical applications. LPP preserves local structures, and the error rate is far less, compared to LDA and PCA [191]. In spite of these facts, the graph construction of LPP is sensitive to noise and outliers, which works as a major drawback of LPP [276].
Fu et al. [277] applied LPP in video summarization. In terms of norm, a novel distance formula was suggested that is equal to Euclidean distance. The time embedding two-dimensional locality preserving projection (TE-2DLPP) was proposed on the new distance that has a better time efficiency. The proposed method for video summarization generates a video summary that automatically include the majority of the video's contents. Guo et al. [278] proposed a palmprint recognition system based on extended LPP, named Kernel LPP (KLPP). KLPP retains the local structure of the palm print image space when explaining non-linear correlations between pixels. Variation in lighting conditions is mitigated by non-linear correlations. Classification accuracy is improved by preserving the local structure. LPP was implemented in visual tracking by Zhao et al. [279]. They proposed an extended version of LPP, which is direct orthogonal locality preserving projections (DOLPP). DOLPP is based on orthogonal locality preserving projections (OLPP). The aim of OLPP is to find a set of orthogonal basis vectors for the Laplace Beltrami operator eigenfunctions. DOLPP computes the orthogonal basis explicitly, and has higher discrimination power than LPP. Patel et al. [280] implemented LPP in visual tracking. The model overcomes the problem of missing gradual changes in PCA-based scene change detection methods. It can deal with both sudden and incremental changes. Camera acts, such as zooming in and out, on the other hand, trigger a small number of false alarms. Li et al. [281] applied LPP in fault diagnosis in the industrial process. Unlike traditional fault diagnostic methods, LPP attempts to map close points in the original space to close points in low-dimensional space. As a result, LPP is able to determine the manifold's underlying geometrical structure. In comparison to PCA, the model results in a better accuracy.

Independent Component Analysis (ICA)
Jeanny Herault and Bernard Ans proposed the earlier methodology for ICA in 1984 [192], which became popular by a paper written by Pierre Comon in 1994 [282]. ICA was implemented in face detection by Deniz et al. in 2001 [283], Marian Barlett et al. in 2002 [284], Zaid Aysseri in 2015 [193], and many other researchers.
While PCA tries to find correlation by maximizing variation, ICA strives to maximize independence of the features. ICA attempts to find a linear transformation of feature space into a new feature space such that each of the new features are mutually independent, and mutual information between the features of the original feature space and new feature space are as high as possible.
ICA performs better in many ways over PCA. ICA is sensitive to higher order data, while PCA looks only for higher variance. ICA yields a better probabilistic model, compared to PCA as well. Moreover, ICA algorithm is iterative [285]. Regardless of these advantages, ICA shows difficulty in handling large amounts of data. On top of that, ICA is reported to display difficulty in ordering of the source vector.
Brown et al. employed ICA in optical imaging of neurons [286]. The main drawback of the system is the number of sources (neurons and artifacts) must be equal or less than the number of simultaneous recordings. Back et al. [287] used ICA to analyze three years of regular returns from the Tokyo Stock Exchange for the 28 largest Japanese stocks. The independent components were divided into two groups: infrequent but big stocks and frequent but small stocks. ICA identified the data's underlying structure, which PCA loses out on. ICA brings a new perspective to the problem of comprehending the processes that affect stock market data. Many other financial areas, such as risk management and asset management, where ICA could be very useful, but was completely avoided. Furthermore, Hyvarinen et al. implemented ICA in mobile phone communication [288]. Delorme et al. [289] utilized ICA in Electroencephalogram (EEG) data analysis. Infomax, Second-Order Blind Identification (SOBI), and Quick ICA were the three forms of ICA used by Delorme et al. By optimizing separately, the device senses different types of objects. Those involving spectral thresholding were the most critical of the types studied. Similarities and differences between different SA are presented in Table 7.

Comparisons
Face detection technology has some major challenges, which reduce the accuracy and detection rate. The challenges are mainly face occlusion, odd expressions, scale variance, pose variance, complex background, less resolution, and too many faces in an image. Different algorithms combat the challenges in various ways to increase the accuracy and detection rates. The available algorithms to this day has performance variations and strength-weaknesses in detecting faces. Some of them face the problem of over-fitting, while others are computationally very efficient. A robust comparison among the algorithms reviewed is presented in Table 8.

Future Research Direction
In this section, we describe challenging issues in face detection that need to be addressed in the future.

Face Masks and Face Shields
The recent pandemic situation around the world caused by the coronavirus disease of 2019 (COVID-19) compels people to wear masks or face shields almost all the time outside of the home. This sudden development has resulted in problems in face detection implemented in surveillance, payment verification systems, etc. Faces covered with masks and shields reduce the accuracy of face detection systems. This is why this would be an interesting topic for further research in face detection. The main goal would be detecting faces with face masks and face shields. Additionally, COVID-19 situations require everyone to wear masks all the time in outdoor, hospitals, offices, etc. A monitoring system that classifies people between wearing and not wearing face masks using face detection in any situations to help ensure that everyone wears face masks is another interesting topic of research.

Fusion of Algorithms
Most face detection algorithms suffer from low accuracy in constrained and occluded environments. An interesting idea for further research is combining part based models with different boosting algorithms. Boosting algorithms produce strong feature descriptors and thus, minimize noises in an image. NN models can extract features in an image very efficiently. So, another method of research in face detection would be to combine these feature descriptors with different efficient classifiers, such as SVM, K-Nearest Neighbor, etc.

Energy Efficient Algorithms
Digital photo management tools, such as Google Photos, Apple's iPhoto, etc., and most of the mobile phones and digital cameras have built-in face detection systems. Computationally expensive algorithms, such as CNN, BPNN, etc., not only occupy computational resources, but also increase energy consumption in devices. This is why these day-today used photo detector systems are very highly energy consuming. This calls for an interesting line of research in designing face detection algorithms which are very low in energy consumption.

Use of Contextual Information
Human body parts provide a strong indication of where a human face can be. Partially available faces and out-of-focus faces in an image are hard to detect and result mostly false negatives. Taking human body parts information or the contextual information into account in detecting human faces is a very promising field of research to pursue.

Adaptive and Simulated Face Detection System
Face detection algorithms are highly depended on training data. There is a shortage of datasets for face detection. Available datasets are prone to data corruption and limited datasets produce problems to fit a model in a new detection environment. Variance in illuminations, human race, color, and other parameters increase false positives significantly. An interesting idea to minimize these problems and increase accuracy is generating simulated data to fight any given situation. Additionally, an adaptive system to incorporate with new types of data which does not need training from scratch is a very interesting line of research to pursue.

Faster Face Detection Systems
Recent face detection algorithms, such as SVM, Viola-Jones algorithm, etc., are fast. However, auto focus systems in digital cameras and mobile phones require faster face detectors. For faster detection, one possible approach could be parallel processing of the portion of a target image in multiple nodes. Additionally, future research needs to shed further light on how to manage efficient parallel processing closer to the end devices. Note that edge computing technology is paving the way for catering latency stringent applications [290,291]. Future research should investigate how fast face detection can be realized, using edge computing technology.

Conclusions
In this paper, we have presented an extensive survey on face detection methods, dividing them mainly on feature based and image based approaches. Among the sub-areas, NNs are very high performing algorithms and the newest in face detection technology. Feature-based approaches are highly applicable for real-time detection, while image-based approaches perform very well with gray-scale images. Most of the algorithms used in face detection originate back to the last century. While ASM models were extensively used for face detection earlier, NN models are gaining popularity recently with hardware developments. At the same time, SA models are computationally less expensive and faster.
Besides face detection, the algorithms reviewed were extensively used in fault diagnosis, EEG data analysis, different patterns recognition, etc. We have discussed other fields where these algorithms have been used successfully. The main challenges that all the algorithms face are occlusion, complex backgrounds, illumination, and orientation. Most of the algorithms explained in this survey work fine with the challenges; the algorithms that recently gained popularity, such as NNs, deal quite well with these problems. No particular algorithm is best in all of the cases and hence, is not recommended. The best way to choose any algorithm is by knowing the problem to be dealt with and the algorithm that is best suited to solve the problems. The performance of the algorithm depends on various factors and can be used as a hybrid by mixing more than one. This study, however, finds that almost every face detection algorithm yields false positives. Though face detection is used for tagging, auto-focusing, surveillance, etc., this dynamic will shift toward more critical applications, such as payment verification, security, healthcare, criminal identification, fake identity detection, etc., where false positives will cause a serious problem. Understanding these pivotal processes and implementing more accurate face detectors is critically important if we are to reach the full potential of this technology.

Conflicts of Interest:
The authors declare no conflict of interest.