Face Image Analysis Using Machine Learning: A Survey on Recent Trends and Applications

: Human face image analysis using machine learning is an important element in computer vision. The human face image conveys information such as age, gender, identity, emotion, race, and attractiveness to both human and computer systems. Over the last ten years, face analysis methods using machine learning have received immense attention due to their diverse applications in various tasks. Although several methods have been reported in the last ten years, face image analysis still represents a complicated challenge, particularly for images obtained from ’in the wild’ conditions. This survey paper presents a comprehensive review focusing on methods in both controlled and uncontrolled conditions. Our work illustrates both merits and demerits of each method previously proposed, starting from seminal works on face image analysis and ending with the latest ideas exploiting deep learning frameworks. We show a comparison of the performance of the previous methods on standard datasets and also present some promising future directions on the topic.


Introduction
Human face image analysis is a fundamental and challenging problem in computer vision (CV). Face analysis plays a main role in different real-world applications, such as animation, surveillance, identity verification, forensic, medical diagnosis, human-computer interaction, and so on [1][2][3]. Some research work addresses various face analysis tasks (face, age, gender, race, head pose, etc.) as individual research problems [4][5][6][7][8][9][10][11]. CV researchers also report that all face analysis problems have a relationship with each other and can assist each other if addressed as individual tasks. Despite tremendous research work on the topic, face analysis is still an arduous task due to various factors such as changes in visual angle, facial appearance, facial expressions, background, and so on. In particular, it has more complications when tackled in uncontrolled and wild conditions. (Note that we are not considering facial recognition in our current study, since it is a much-explored area, and many good research and survey papers have already been reported [12][13][14][15][16][17][18][19]). Face image analysis has several applications; some of these applications are listed below.
• Surveillance: Face analysis and tracking is widely used for surveillance purposes by CV researchers. A surveillance system presented in [20] uses the attention of people in a particular scene as an indication of exciting events. Similarly, the work proposed in [21] captures the visual attention of people by using various fixed surveillance cameras. • Targeted advertisement: One very interesting application of face analysis is targeted advertisement. Several interesting works are proposed by CV researchers regarding targeted advertisement using face analysis. For instance, Smith et al. [22] present a work that tracks the focus of attention of people. The proposed system also counts the number of subjects looking at outdoor advertisements. This work also has some implications in human behavior analysis and cognitive science. Some recent work on targeted advertisment using CV and machine learning (ML) can be explored in [23][24][25]. • Social Behaviour Analysis: The human face is tracked and used in intelligent rooms to monitor and observe the participants' activities. The visual focus of participants' attention is particularly judged through head tracking [26][27][28][29][30]. This system follows the direction of speaking of individuals and also provides information about the gestures in a meeting. The semantic cues obtained are later transcribed with conversations and intentions of all participants, which further provides some searchable indexes to be used in the future. Some excellent works which use human face tracking in workplaces and other meetings can be explored in the references [31][32][33][34][35]. • Driving Safety: Face analysis also plays a key role in ensuring driver safety while driving. Some researchers designed a driver-monitoring system by installing a camera within the car and then tracking eyebrow and eyelid movements from fatigue [36][37][38]. It can also provide alert signals to drivers, for instance, if there is some danger of accident due to pedestrians [39]. Blind spots are detected in the method proposed in [40] while driving. This helps the driver to change their vehicle direction. Another method proposed in [41] combines head localization and head pose estimation (HPE) information to estimate pedestrian path. This helps drivers to make some vital decisions while driving. • Estimation of face, expression, gender, age, and race: In the context of CV, human face analysis acquires high-level knowledge about a facial image. Face images convey several pieces of information, such as who the subject is, the gender and race of the person in the photograph, whether he/she is sad or happy, and what angle they are looking at. In all these tasks, a facial analysis infers knowledge from a face image. Some human face analysis problems are multidisciplinary, as well as intrinsically related to human science. However, very few research works combine various face image analysis tasks in a single unified framework. Our current research paper reports papers that combine at least two face image analysis tasks in one model. Estimating age, race, gender, expression, and head pose are the most important problems in face analysis, with further applications in forensics, the entertainment industry, cosmetology, security controls, and so on [42][43][44][45].
Despite immense research efforts, face analysis is still facing challenges due to several factors such as complex and changing facial expressions, occlusions, noise, illumination changes, and so on. To improve the recognition accuracy of a face analysis system, it is necessary to correlate some face analysis tasks with each other. For instance, it is most likely that males can have beards or mustaches, but females or children certainly cannot. Previous research work also reveals that most face analysis tasks are linked with each other; for example, research reported in the 19th century by Wollaston [46] suggests that face tracking is strongly and intrinsically linked with gaze estimation. The effect is also demonstrated by the authors with a picture, as can be seen in Figure 1. In Figure 1, two views of one person are taken at two angles. Although both the eyes and the head are the same, the gaze direction perceived by the human brain is dictated by two totally different head directions. The intrinsic relationship between various face image analysis tasks is later confirmed in [47]. The work in [47] claims that gaze estimation is a combination of HPE and eyes location estimation. The later research claims that head position provides a very strong indication of the eyes' gaze orientation, particularly when the eyes are not visible, for instance, in case of occlusion or when the resolution of the image is poor. The intrinsic relation between various face parts is also confirmed in the latest research work reported in [48][49][50][51][52]. Research reported in [49][50][51][52] suggests that a mutual relationship between different face parts can be exploited to address several tasks in a single framework. The research work proposed in [53] segments a face image into six different face parts, which are later used for HPE in [49,50]. Similarly, gender recognition is combined with other tasks using the same strategy in [52]. A single framework is proposed in later research [51], combining gender recognition, race classification, HPE, and facial expression recognition. In short, CV researchers and psychology literature strongly suggest that face analysis tasks can be addressed in a better way if combined in a single unified framework. This paper aims to present the reader with a single review article where different methods are reported combining several face analysis tasks into one model.

Contributions and Paper Organization
To the best of our knowledge, the proposed paper is the first attempt that combines various face analysis tasks in a single comprehensive review article. Survey papers for each of these tasks (race, age, gender, expression, and HPE) are individually available [54][55][56]; however, there is an intense need for a single paper combining these demographic components. We specifically focus on the works that have been published in the last 10 years. Moreover, recently, a shift has been noticed in the state of the art (SOA) from conventional ML to deep learning (DL)-based methods, which justifies the need for an up-to-date review article. The main contributions of the proposed paper are summarized as follows: • We present a detailed survey on different human face image analysis tasks. This provides researchers with a new and recent up-to-date overview of the SOA technology. We also introduce a taxonomy of all the existing methods for these tasks. We added a detailed discussion on the used technology's characteristics. Additionally, we explicitly elaborate on the open problems faced by the existing face analysis technology. • We provide a list of all the publicly available datasets for each of the five face analysis tasks. We also describe the way these datasets are collected and their characteristics. We provide information such as the number of images contained, subjects involved, sex information, environmental diversity, and other details. • We conducted a detailed study of all the methods for each of the five tasks, both in indoor and outdoor conditions. We summarize the results obtained using each method and present a detailed discussion. We present a concrete summary of the results for each task, provide a critical review, and point to some future directions. We also dedicate attention to various sensors used for data collection and discuss how ground truth data is collected and labeled.
The rest of this work is arranged as follows: in Section 3, we discuss already existing datasets (DBs) used to evaluate face analysis. Recent algorithms within face analysis are reviewed in detail in Section 4. How ground truth data is created for these face analysis tasks is discussed in Section 5. We present a comparison of various methods with their reported results in Section 6. We also discuss the open challenges in face analysis with some recommendations in Section 6. Finally, we present the conclusion of the paper in Section 7.

Datasets
The performance of a face analysis system is measured with a DB available for experiments. A detailed list of the available DBs for all the tasks is presented in this section. A comprehensive list of the DBs is presented in Table 1. We include only those DBs that address at least two face analysis tasks. Following the progress of each face analysis task, the number of DBs has evolved in the last couple of years. Specifically, complexity in images and background are now considered. • UNISA-Public [57]: The UNISA-Public is collected in real-time and unconstrained conditions. The DB consists of 406 face images taken from 58 individuals. The available DB does not provide cropped faces, but a face localization algorithm is needed for the images to be used for gender classification. The data is collected in a building with a camera fixed at the entrance point; therefore, lighting conditions do not change significantly. Different poses and facial expressions are taken, as the participants were not told about the data collection process beforehand. Some motion blur conditions can be seen due to the sudden movement of the individuals. • Adience [58]: This is the latest DB released for face analysis tasks, including age and gender classification tasks. It is also a challenging DB collected in an outdoor environment. All the images are collected through a smartphone device. The data set is much more challenging, as different pose variations are included along with changing lighting conditions. The DB is large, as the number of face images is around 25,580. The total number of candidates who participated in data collection is 2284. The exact age of each participant is not given, but instead, each subject is assigned to a specific group. This clearly shows that the DB can be used for age classification but not estimation. The DB can be freely downloaded from the Open University of Israel. • IMDB [59]: The IMDB contains images of celebrities collected from a website named IMDB. The DB is partitioned into two parts, namely IMDB and WIKI. The first part consists of 523, whereas WIKI consists of 51 face images. These images are manually labeled with both gender and age. There are some errors in the annotation of ground truth data. In fact, the authors assume that each face image belongs to the listed celebrity and then automatically annotate it with the gender declared in the profile. This automatic assumption results in errors in ground truth data annotation. • VGGFace [60]: This DB was explicitly built for face recognition but was later used for additional face analysis tasks. It is a large enough DB to train a DL framework. This DB was gathered in a very inexpensive way through google search. A huge quantity of weakly annotated face images are obtained in this way. These face images are filtered and annotated manually through a fast and inexact process. VGGFace2 is an extension of the VGGFACE DB. • FERET [69]: This is a DB that was introduced earlier for a number of face analysis tasks. The tasks include face recognition, gender classification, and HPE. It is also a simple DB that is collected in constrained laboratory conditions. It is a medium-sized DB, having 14,126 images. The number of subjects involved is 1199. The DB is somewhat challenging, as variations in facial expressions are present in images. The lighting conditions of the images are also not uniform. • CAS-PEAL [70]: CAS-PEAL has more than 100,000 images. The number of participants is also sufficiently large (more than 1000). The DB is not balanced, having 595 male and 445 female candidates. The authors considered yaw and pitch angles in the range ±45 • and ±30 • . The DB is simple to use in experiments, as less complexity is included in the data collection, and the number of poses is also less. • LFW [71]: This is a comparatively challenging DB, as most of the images are collected from very unconstrained environmental conditions. The total number of participants in the DB is 5479, whereas the total number of images is 1323.

Methods
It is not easy to organize all research work on the topic into a single taxonomy. In the proposed work, we do not follow or claim a specific taxonomy, but we will try to label each system by the fundamental method that underlines its specific implementations. We discuss various types of methods and also present references where these approaches have been applied to a specific task. We will discuss both advantages and disadvantages of each method. A summary of all these methods is presented in Figure 2.

Appearance Methods
These methods always assume that a specific correlation exists between a face analysis task and its 3D image properties. To know about this relationship, experiments are performed by training some classifiers. Each face image is treated as a one-dimensional vector, and certain visual features are extracted. This statistical information is used for human face analysis. Some methods that use appearance-based approaches are explored in [76][77][78][79].
Appearance-based methods are straightforward approaches suitable for both badand good-quality images. Along with positive data, negative training data is not needed. Expanding these methods is also very easy, allowing the framework to adapt to any changes when required.
Along with the above merits, these models also have some serious weaknesses. For these methods, the face part must already be detected by the system. If face localization is not correct, some drastically poor results are obtained. The amount of training data needed is large. When variations in local face appearance occur, no solution exists to tackle the algorithm (for example, problems of occlusion, subjects with facial hair, glasses, etc.). Lastly, pair-wise similarity is another significant issue faced by these methods. Although researchers have developed some algorithms to handle this problem, reported results are still not convincing.

Geometric Methods
These are also known as model-based approaches. These techniques need certain key points such as eyebrows, eyes, lips, nose, and so on. A single vector with local key points is extracted from these specific points. An algorithm is developed based on the relative position and mutual relationship between key points. This method is very similar to the appearance methods, as a human specifies the particular face task.
The literature reports the exploitation of different facial key points for human face analysis. Most of the methods use different key points in the eyes [1,77]. Other methods use the intra-ocular distance between more than one face part for face analysis. The tip of the nose is used as a discriminative cue for HPE in [80]. Due to the complex geometry of hair and its difficult extraction, hair was not previously used for face analysis. However, some recent methods report good results in HPE [81,82]. The processing time of the geometric methods is much less [83] since very few facial features are extracted, and the size of the feature vector is also small. The work reported in [84] uses nostril, eyes, cheek, and chin information. Another method proposed in [85] is used as a front-end face analysis framework. The last proposed method is used for face recognition and head pose prediction. The method proposed in [86] uses a combination of different features, which increases the processing time of the framework. In the last proposed method, the computational cost is increased as a result of improvement in the overall precision of the system.
The main advantage of these methods is that the key points extracted are very robust to certain changes such as translation and rotation. However, extraction and detection of these key points is itself a big challenge in CV. In previous literature, active shape modeling (ASM) [87] was adapted for key points localization. However, ASM fails to locate these key points in some cases (for example, when variation in lighting condition is greater, the facial expression is complex, or if there is an occlusion problem). If the images' resolution is poor or there is a far-field imagery condition, then extraction of these key points is almost impossible.

Regression Methods
In non-linear methods, a face analysis task is estimated by learning a mapping from the face image space to a specific task. These methods have drawbacks (e.g., it cannot be assured that a regression tool will efficiently learn proper mapping). The high dimensionality in some cases also creates a challenge for which some dimensionality reduction tools can be used; for instance, support vector regressors (SVRs) [88] and localized gradient histograms [39]. Moreover, regression methods can be easily utilized on these low-dimensional features (e.g., [79]).
One of the main non-linear regression tools used is the neural network. Non-linear regression methods consist of multilayer perceptrons of many feed-forward cells. These feed-forward cells are defined in different layers [89,90]. In these methods, backpropagation is used for training perception. Such algorithms inversely propagate the classification error in the reverse direction. Such a strategy updates both weights and biases of the network. Some regression-based methods for head pose prediction are explored in [91,92].
Local linear mapping is another type of regression-based neural network method. These methods consist of multiple linear maps, such as [93]. In these methods, a weight matrix is learned by first giving the input training data and then comparing it to a centroid sample test for each map. The nearest neighbor search is performed in the last stage, followed by a regression process. For more details, see the work reported in [94].
One main advantage of these methods is their processing speed, which is less. These methods are straightforward to implement. Similarly, these methods can be regularized quickly, avoiding over-fitting problems. Stochastic gradient descent (SGD) is used to update regression methods by adding more data. The regression-based method's performance drops significantly if labeled images are not appropriately annotated. To improve the ground truth data , errors in classification can be used through distortion and applying invariance approaches, as in [95].

Influence Modeling
For the basic concept of influence-based modeling, see [96,97]. This model presents the idea that individuals influence each other in a system and predicts how an actor affects a partner. When the first state of one actor is already known relative to a location, the outcome is estimated more accurately. Since a single task may provide useful information for different face tasks, all these tasks influence each other; therefore, several tasks can be handled in one framework.
Inspired by the influence-based modeling idea, some works are presented in [51,52]. These works address several face analysis tasks, including HPE, race, gender, and age estimation in a single model. The methods proposed in [51,52] use face segmentation information provided by a prior model developed in [53]. These methods do not extract landmarks information or high-dimensionality data, but instead perform face segmentation as a prior step.
Some excellent results are claimed by the authors through these methods, particularly for face analysis. However, one main drawback of these methods is the need for a manually labeled face DB. The creation of such a DB is a time-consuming and laborious task. Moreover, the cost of computation of these models is also more than that of competitive algorithms.

Model-Based 3D Registration
Measured data is registered with some specific reference models in these methods. Meyer et al. [98] proposed a method that predicts a head pose through registration of a morphable model with depth data. Similarly, 3D reconstruction and morphable model are used by Yu et al. [99]. Another work in [100] presents a model with a fitting process. A subject's face is modeled in 3D with depth data. Papazov et al. [101] propose a system for HPE, landmarks localization, and gender classification. The authors collected data using a depth sensor and triangular patch, which encodes the shape of the 3D surface. Similarly, Jeni et al. [102] perform registration by training a cascade regression framework with face scan images. Baltrusaitis et al. [103] combined depth and RGB data to regress an HPE framework using a random forest classifier. Compared to other methods, model-based face regression methods are less explored by computer vision researchers. Some recent methods for face image analysis using 3D registration methods can be seen in [104][105][106].

Hybrid Methods
Hybrid methods were introduced in [107] for the first time. The central idea is to combine several of the aforementioned methods into a single framework and perform some face analysis tasks. Appearance-and geometric-based methods are combined in [108]. Global and local features are exploited in the aforementioned method, and supplement each other in face analysis.
A method known as HyperFace [109] extracts features through CNNs and then performs face detection, gender recognition, landmarks localization, and head pose tracking and estimation. KEPLER [110] is another well-known hybrid method to address face analysis tasks. KEPLER learns both global and local features to explore some structural face dependencies. Some other recent hybrid methods can be seen in [97,109,111,112].
Since hybrid approaches obtain information from different cues through different methods and later fuse the estimates from different systems independently, we notice that this increases the prediction accuracy. Moreover, these architectures are modeled without initialization as some drift limitations. This helps overcome the limitation of one specific method with another one. To address face analysis tasks, data-mining methods are used in [113]. Similarly, gender recognition is explored in [114] through face segmentation. Extreme learning machine and CNNs are explored by Mingxing et al. [115] to address face analysis. The authors of the paper explored age and gender classification in their proposed work.

Deep Learning Methods
With the transition of ML from conventional ML to the recent DL, several limitations and drawbacks of conventional ML are now mitigated. Significantly improved results are reported with the introduction of DL methods in various visual recognition tasks. DL approaches, specifically convolutional neural networks (CNNs), outperform those methods based on conventional features. The method proposed in [116] does not use landmarks information but extracts features through CNNs and then trains a classification tool. Similarly, CNNs are combined with a regression loss function in the method proposed in QuatNet [117].
The performance of conventional ML methods on face analysis tasks was satisfactory to some extent on images obtained in controlled lab conditions. However, when traditional ML models are exposed to images collected in wild conditions, a significant drop in performance is noticed. On the other hand, DL-based methods perform very well on these DBs that are collected in unconstrained conditions [118].
Gozde et al. [119] introduced a system based on deep learning for face analysis using customer interest. According to the authors, it is one of most exciting and innovative trends in customer interest. The system developed measures customer interest attention. In the proposed system, in the initial stage, customers whose heads and face are oriented towards the advertisement are detected by the system. Then, facial components are extracted for various tasks such as head pose estimation, expression recognition, and age estimation. Other papers on face analysis using deep learning include [120][121][122][123].
Although many improvements have been noticed in the performance of face analysis tasks with these DL-based methods, their use is still sporadic. Since these approaches are recent, there is still a need to establish their specific and complete potential in this domain.

Annotation Type and Processes
In the CV context, ground truth data consist of a set of images with labels on these images. The ground truth labels can be added by a human or in some automatic way, which totally depends on the complexity of the specific problem. These labels usually include interest points, histograms, corners, shapes, and feature descriptors from a model. For the training phase, ground truth data may be positive or negative. Negative images in training data are used to generate false matches, which helps the model building.
We argue that creating ground truth data may not be a cutting-edge research area; however, it is still as important as any proposed algorithm for CV tasks. No algorithm can be verified and assessed accurately unless the ground truth data has been prepared such that image contents and region of interest are selected and prepared effectively. Better analysis is only possible when ground truth data is better. Ground truth data and its preparation depend upon the task to be addressed. For instance, in 3D image reconstruction, attributes of ground truth data must be recognized accurately for each task. Creating ground truth data for some tasks is easy, for instance in gender, race, and expression classification. The labeling method can be either human-annotated or machine-automated annotation. Ground truth data is created using human annotation for the race, age, gender, and expression recognition. However, creating ground truth data for HPE is tricky; therefore, we present some details in the following paragraphs.
As compared to the above-mentioned face analysis tasks, creating ground truth data for HPE is not easy. The earlier method to create ground truth is through manual annotation by a human. A human assigns a specific label in this method. Creation of ground truth data through this method is easy for smaller DBs; however, when the size of the DBs increases, it is a time-consuming task. Moreover, a higher probability of human error exists in these methods. For example, Pointing'04 is collected by asking each participant to look at the points marked on a wall in the measurement room. Such methods assume that each participant accurately points their head to the point. Practically, such implementation is infeasible. Moreover, it is also assumed that each subject's head is always in an accurate physical position, which is, again, practically impossible.
Along with the above manual annotation, HPE DBs are also established through synthetic data, as in [70,124]. Typically, one model is placed on another model called the virtual ground, and then the camera is moved on a sphere's surface. The center of the sphere is the same as the head model's center. Creating ground truth data through this synthetic method also has drawbacks, for example, if a model represents a neutral expression, or if the background and parts of the head are missing. Both of these drawbacks make the assessment of algorithms working in real-world conditions difficult. An image showing the creation of ground truth data for HPE is shown in Figure 3. In some other methods, such as [125], a laser pointer is attached to each subject's head, which helps to pinpoint every discrete physical location. A laser pointer is attached to participants' heads to annotate and produce ground truth data. For example, the BU [126,127] DB is created through this method. The magnetic sensor method produces nearly accurate ground truth data; however, the sensors used are very susceptible to small metals if present in the nearby environment.

Discussion: Open Areas of Research and Methods Comparison
This section includes a discussion on the results obtained using SOA DBs and some promising future directions for research on the topic.

Comparative Assessment of Reported Results
Research papers and methods reported in this paper are selected from the last 10 years' (2012-2022) work. Table 2 summarizes results of all five face analysis tasks on SOA DBs. Similarly, Table 3 presents a summary of the latest research work on the topic and its yearly development. From both Tables and the reported methods, we summarize some conclusions, which are as follows: • Previous methods reported their results using classification rate (CR) (age, gender, expression, and race classification). We also compared and present results for these four tasks with CR. CR is a single metric that is the ratio of correctly identified images to the total number of images. Mathematically, this can be written as follows: Along with CR, two other informative metrics are used for evaluating an HPE framework, that is, pose estimation accuracy (P ea ) and mean absolute error (M ae ). M ae is comparatively more common than P ea , as this metric provides a single value which provides an easy insight about the overall performance of an HPE system. On the other hand, P ea purely depends on the poses used, and hence provides comparatively less information regarding evaluation of the performance of a system. Sometimes, confusion matrices (C mat ) are also provided with results. C mat are in a tabular form, where rows are indexed with original and column entries with estimated class. C mat gives a deep visual insight into the classification errors of multiple classes. However, in this paper, we are not considering C mat and P ea , as very few papers reported these metrics, and a proper comparison cannot be made. Mathematically, M ae can be represented as follows: In Equation (2), N represents the test samples, Y i the ground truth count, and Y i the count estimated for the ith sample. • Face image analysis is a hot topic in CV. Table 2 shows more details on the performance reported for each method. Table 3 shows a summary of the face analysis work done from 2012 to 2022. It is clear from Table 2 that improvements are brought gradually in CR and M ae values. A quick look at the results in Table 2 reveals that the performance for HPE, gender recogntion, and age classification tasks on both traditional ML and newly introduced DL methods is not same. From [50][51][52], it is clear that influence-based modeling methods perform better than conventional ML-based methods. Morever, in some cases, influence-based methods perform better than DL-based methods. Therefore, we believe that a much better understanding of the DL algorithm methods and their implementation on face analysis tasks is needed. The performance of influence-based modeling on simple DBs acquired in indoor conditions is better. However, when influence-based methods are applied to DBs acquired in open, uncontrolled conditions, a significant drop in performance can be seen. DL methods show much improved results for challenging DBs (for instance, ALFW [66]). Results of traditional ML methods on AFLW are very poor. However, DL has shown much better performance on the same set of images and the same DBs. As far as the performance of traditional ML algorithms is concerned, a mixed response with regard to face analysis tasks can be seen in Table 2. Much better results are shown by hybrid models, as is clear from Table 2. • Existing methods for face analysis do not define any specific experimental protocol that can be used as a standard for experimental validation. Consequently, different authors use their own experimental setup for validation of their methods. Most of the researchers use 10-fold or 5-fold cross-validation experiments. The results and summary presented in Table 2 are with different experimental setups and validation methods. Therefore, results reported in Table 2 with the same DB may use different validation protocols and setups. For example, Geng et al. [128] use 5-fold crossvalidation for Pointing'04, whereas Khan et al. [52] use a subset of the same DB. Therefore, the results presented in Table 2 can be used as a summary, but with a warning to avoid to draw concrete conclusions. • From the results in Table 2, it can be seen that the performance of most of the methods on Adience DB is significantly lower than the other DBs. Adience is a large DB that is collected in unconstrained conditions. The results in Table 2 highlight the fact that the difficulty level in Adience DB is still high, and more research is needed to make the face analysis technology applicable to images collected in uncontrolled conditions. Moreover, the quality and type of available ground truth data are diverse. Gradual development in the methods of ground truth data can be seen from Section 5. The widely employed manual labeling methods introduce labeling errors, which are reduced with new methods. Synthetic DBs and the creation of manual data are comparatively simple. There is a lower chance of error creation with this method.

Benchmark DB Devolopment
The development and evaluation of an efficient face analysis framework require a sophisticated DB for experiments. CV researchers report various DBs for all face analysis tasks. The DBs range from a very simple one-color flat background (Pointing'04) to complex scenarios (LFW). Although the number of DBs is increasing gradually, as can be seen from Table 1 and Table 3, none of these address all face analysis tasks in a single DB. It is also clear from the discussion of the reported DBs that none can be used on its own for all face analysis tasks, as no DB covers all the modalities of a human face image. Secondly, it is also clear from Table 1 that the number of DBs addressing facial expressions is comparatively less. Very few DBs provide data about facial expression and race classification. Compared to these two, the remaining three areas (i.e., HPE, gender, and age classification) are more explored.
The number of images or videos is comparatively lower for DL-based experiments (except for a few, e.g., Adience). Another challenge these DBs face is not having multiple sessions per participant or information about specific time lapses between sessions. Diversity in lighting conditions, terms of the recording environment, ethnicity, gender, and viewing angle information are other aspects not addressed by researchers. Another challenge is not having multiple sessions per subject. Therefore, a preliminary step should be to contribute a DB that caters to all the modalities, including session-spanning information, ethnicity, age, gender, and facial expression.

Research on 3D Face is Required
As can be observed, most of the research conducted on all face analysis tasks is on 2D images. Few papers are reported using 3D face images. We believe exploring face analysis through 3D images will be an exciting topic. In most of the 3D face images, synthetic faces are used. In the case of 2D images, most of the time, valuable information is lost; for example, the exact position of the nose and chin in 3D images can provide more valuable information. The identity of a person is recognized using face image analysis methods in [179]. A pose-invariant learnt multiview 3D recognition approach is used to address the problem. Authors of the paper used four different datasets to address the task. These datasets include UMB-DB, GavabDB, Bosphorus, and FRGC database. Generative 3D face models are applied for the pose-invariant illumination method in [179]. According to the authors, the proposed 3D face model also fit the 2D along with 3D images under various conditions and sensors. Three-dimensional face analysis could be used in more research and application domains, such as face beautification, plastic surgery, and so on.

Knowledge Transfer and Data Augmentation: Possible Areas of Exploration
Looking for the new trends for other tasks in CV, we expect that face analysis will move more and more towards new DL methods. Since DL methods are facing some training problems due to less ground truth data, accurate knowledge transfer (KT) [180] and supervision learning [181] are possible options to be researched. Another possible area to be improved is data augmentation [182] and foveated architecture methods [183]. Data augmentation can overcome the limited data scenario of DL architectures. We also add that a comparatively less researched domain for KT in DL is heterogeneous domain adoption. KT is extremely helpful in transferring knowledge from the training to testing phase when attributes are somehow different. All of this minimizes the labeling efforts for training data. As data from some other domain is utilized, the labeling task is reduced. Considering new developments in DL methods, the next probable keywords are temporal pooling, LSTMs, optical flow frames, and 3D convolution for human face image analysis. Even if some of the aforementioned methods are already being explored by researchers, more research is needed to improve the performance of these tasks.

Summary and Concluding Remarks
Human face image analysis is an essential step in various face-related applications since rich information is provided about intent, motivation, subjects' attention, and so on. Despite the extensive research on the topic, particularly in the last 10 years, face image analysis is still very challenging, particularly for data collected in unconstrained conditions. We also investigate aspects of some existing solutions: first, we review SOA methods based on hand-crafted representation, and then we move to the recently introduced DL architectures. We present an analysis of the SOA results obtained so far on the topic. Finally, we identify several promising open problems and present possible future work directions; for example, we expect to see more evaluations of the DL techniques on the most challenging DBs (i.e., those collected in uncontrolled environmental conditions). Another more interesting direction is a combination of DL and influence modeling. One additional possibility is the extension of influence modeling with geometric modeling. Classification and feature extraction can be exploited with DL architectures. We believe that the survey paper, DBs, and all the methods and algorithms mentioned here could help the research community working on the topic to improve current SOA performance and inspire some new research directions.

Conflicts of Interest:
The authors declare no conflict of interest.