Feature Selection on 2D and 3D Geometric Features to Improve Facial Expression Recognition

An essential aspect in the interaction between people and computers is the recognition of facial expressions. A key issue in this process is to select relevant features to classify facial expressions accurately. This study examines the selection of optimal geometric features to classify six basic facial expressions: happiness, sadness, surprise, fear, anger, and disgust. Inspired by the Facial Action Coding System (FACS) and the Moving Picture Experts Group 4th standard (MPEG-4), an initial set of 89 features was proposed. These features are normalized distances and angles in 2D and 3D computed from 22 facial landmarks. To select a minimum set of features with the maximum classification accuracy, two selection methods and four classifiers were tested. The first selection method, principal component analysis (PCA), obtained 39 features. The second selection method, a genetic algorithm (GA), obtained 47 features. The experiments ran on the Bosphorus and UIVBFED data sets with 86.62% and 93.92% median accuracy, respectively. Our main finding is that the reduced feature set obtained by the GA is the smallest in comparison with other methods of comparable accuracy. This has implications in reducing the time of recognition.


Introduction
People use cognitive mechanisms to recognize emotions during the communication process. One cognitive mechanism is to understand the non-verbal human behavior. This has been investigated for decades since 1872 with the study of Darwin, which involves a cognitive analysis of facial expressions [1]. In 1978, Suwa et al. presented the first attempt to automatically analyze facial expressions [2]. In the same year, Ekman and Friesen proposed a standard called Facial Action Coding System (FACS). This standard is composed of 44 facial Action Units (AUs) describing all facial movements. Likewise, Ekman proposed the six basic universal emotions: happiness, sadness, surprise, fear, anger, and disgust [2,3].
The first step includes different tasks. Sometimes it can include preprocessing to reduce the noise, image enhancement, and normalization. However, the most important task is face detection [6]. Once this is done, features are located on the face. Common techniques in this step are masking, scaling, converting into grayscale, and the detection of landmarks and regions [4,7]. Considering the type of data, some works take an image only as input; therefore, they are static [8][9][10]. On the other hand, the dynamic type uses a temporal sequence of images [8,11].

•
A set of geometric features is proposed, evaluated, and compared. These features are derived from the 2D and 3D geometry of the human face using angles and normalized distances between landmark points of the face.

•
To obtain relevant features PCA as common technique and a GA as a new proposal are implemented and compared.
• The performances of four classifiers (k-NN, E-KNN (ensemble classifier subspace K-nearest neighbor), SVM3 (SVM using a cubic kernel), and SVM2 (SVM using a quadratic kernel)) are compared employing our features. • A comparative study is presented. Our proposal compares favorably in terms of accuracy with other works in the literature using static images and Bosphorus database and also greatly lowering the number of used features. Reducing the computational cost of the classification.
The rest of the paper is organized as follows. Section 2 describes material and methods, specifically, details regarding the data acquisition, feature extraction, feature selection, and classification. Section 3 presents our experiments and results. Finally, Sections 4 and 5 show the discussion and the research findings.
In the data acquisition stage, instances of facial expressions are collected from the Bosphorus or UIBVFED database. Then, feature extraction is carried out and, from 3D landmarks on a human face, 89 geometric features are determined. In the third stage, relevant features are selected from the original feature set through two methods: PCA and a GA.
Finally, a support vector machine with cubic kernel is applied for classification to the original feature set and the reduced feature sets to classify the six basic facial expressions. The expression are (i) anger, (ii) disgust, (iii) fear, (iv) happiness, (v) sadness, and (vi) surprise. Through this paper, these facial expressions will be expressed as AN, DI, FE, HA, SA, and SU, respectively.

Data Acquisition
This step involves the process to get the facial information. In traditional FER approaches, the algorithms are trained and tested in a database. In unconstrained environments the efficacy depends on acquired image data and is limited for some issues such as occlusions, pose variation, and illumination [6].
For this study, we only consider the databases where a single face appears in each image and that 3D landmarks are identified. We extracted the information from two different datasets: (a) Bosphorus and (b) UIBVFED. Table 1 shows characteristics of these two databases [20,21]. The Bosphorus database is commonly used in 3D facial expression recognition. This database was developed in 2008 by the University of Bogazici in Turkey [22]. The database includes a set of 4652 3D facial scans with manually defined landmarks collected from 105 subjects: 60 men and 45 women. The scans were taken under different occlusion conditions and various poses from the subjects. Furthermore, every subject was instructed to express the six basic emotions, the neutral state, and various AUs. Each scan covers just one pose or expression. These are acquired using a 3D sensor (InSpeck Mega Capturor II 3D) with a depth resolution of 0.3 mm (X axis), 0.3 mm (Y axis), and 0.4 mm (Z axis) [22,24].
For our experiments, 424 scans (instances) were taken. Specifically, Table 2 shows the number of instances obtained per facial expression (surprise, sadness, happiness, fear, disgust, and anger). The selected scans correspond to faces showing the six universal facial expressions. All these images are from frontal faces, with no pose variation, and without occlusions. This data set Virtual Facial Expression Dataset UIBVFED was developed using Autodesk Character Generator and consist of 640 images and respective set of landmarks [23]. It corresponds to 51 points in the image in the 3D space ( Figure 2). In total, it contains 20 characters from different ethnicities, 10 men and 10 women, with ages between 20 and 80 years. Each avatar includes 32 expressions grouped in the six universal expressions [23].
In our experiments we employed one expression from each of the six universal facial expressions and all the corresponding data from the avatars. In addition, to get the same landmarks from this database, we interpolated three landmarks and removed other landmarks that were not necessary, as illustrated in Figure 3.

Feature Extraction
Emotions are expressed through movements in certain regions of the face, which can be parameterized based on muscle movements. Up to now, two major facial parameterization systems have been developed. The first is the Facial Action Coding System (FACS) developed by Ekman and Friesen and the second is a component of the MPEG-4 standard, the Face Animation Parameters (FAP) [25].
FACS is an anatomical system to determine facial behaviors through changes in the muscles. These changes are denominated Action Units (AUs). The system contains 44 AUs; nevertheless, the six basic facial expressions can be represented using 18 AUs only [26,27].
The second system is part of MPEG-4, this standard includes the Facial Definition Parameters (FDP) to specify the size and the shape of the face, and the Facial Animation Parameter (FAP), which is used to represent every facial expression. FAP is expressed in terms of Facial Animation Parameter Units (FAPUs). FAPUs are used to keep the proportionality. They are defined as distances between some key points in a neutral state [28,29]. According to the authors of [28], the measurements ( Figure 4) are defined as follows.
• IRISDO: It is the distance between the upper and the lower eyelids, i.e., the approximate iris diameter. • ESO: It is the distance between the eye pupils, i.e., eye separation. • ENSO: It is the distance between the center of ESO and below the nostrils, i.e., eye-nose separation.  Furthermore, the MPEG-4 system specifies the movements related to the six basic facial expressions. Table 3 explains these basic facial expressions according to [29]. Based on the two previous systems (FACS and MPEG-4), a set of features was defined by us using 22 of the 24 landmarks included in the Bosphorus database to represent the six basic facial expressions. Figure 5 shows these landmarks. In total, 89 features were defined: 27 angles and 19 distances in 3D, and 20 angles and 23 distances in 2D. Table 4 presents these 89 features with an identifier assigned to each one.   Euclidean distance (Equation (1)) was employed to compute distances between landmarks. Then, all distances were normalized using the distance between the landmarks 8 and 9, see Figures 4 and 5.

3D Features 2D Features
The angles were calculated applying cosine between two vectors with three points as shown in Equation (2).
where p i are the X-Y-Z coordinates of the landmark for the 3D-features, and X-Y coordinates of the landmark for the 2D-features.

Feature Selection
An important component in data mining and machine leaning is feature selection. It is a difficult task because many features are used to obtain information; however, not all of them are essential and can reduce the performance of classification. The complexity of the problem deepens on the number of features, where the total number of possible solutions is 2 n for a dataset with n features. For this type of combinatorial optimization problem, evolutionary computation approaches are a good option [30].
Feature selection is a process where a subset of relevant features are selected from the original set, and it helps to choose the relevant features (removing irrelevant or redundant features) to understand the data, improve the predicted performance, and reduce computational requirements and the time of training [30,31].
A variety of techniques have been applied to feature selection and they are grouped in different ways as described below. Dimensionality reduction algorithms such as those based on projection (e.g., Fisher linear discriminant, principal component analysis, or compression (e.g., using information theory) modify the original representation of the features. In contrast, feature selection algorithms only select a subset of them. These algorithms can be classified into two groups, filter, and wrapper methods. The first class of methods uses general characteristics of the data to select a feature subset without using any learning algorithm. The second class manages a learning algorithm and uses its performance as the evaluation criterion. A new class of feature selection algorithms integrates the theory of sparse representation [31].
In this research, two methods are evaluated to select features: PCA employing a traditional implementation and a a GA which we designed. The GA operates as a wrapper feature selection method using a support vector machine.

Principal Component Analysis (PCA) for Feature Selection
Principal Component Analysis (PCA) is a statistical technique widely used for dimensionality reduction. PCA converts the original set of variables into a smaller set of uncorrelated variables called principal components. These components provide the most representative information of the original set [32]. Algorithm 1 explains the steps performed in PCA to obtain the reduced feature set. Genetic Algorithms (GAs) are the first evolutionary computation techniques employed in feature selection. They have been applied in different applications which include image processing and in specific in face recognition [30].
GA are adaptive heuristic search algorithms and optimization methods based on evolutionary ideas of the natural selection, such as selection mechanism, recombination, and mutation [33]. In this research, a GA selected a subset of features.
In the algorithm, every individual in the population represents a possible subset of features. An individual is defined using a binary vector (0 = absence and 1 = presence) of m genes (Figure 6), where m is the number of features, in this case 89 features. The average accuracy using 10-fold cross-validation is the fitness function. The GA was implemented using the following parameters.  Select the best two out of five randomly chosen individuals 5: Recombine pair of parents with one point crossover 6: Use simple mutation 7: Evaluate each candidate solution 8: Use elitism to select the best individual 9: Update the population for the next generation

Classification
In our research, four classifiers were employed to distinguish the facial expressions: SVM3, SVM2, k-NN fine, and E-KNN. The simplest case of SVM classification is a binary learner which finds the optimal hyperplane that separates the points in the two classes. There are generalizations of SVM that use nonlinear functions to separate the two classes. In this case, it is common to use kernel functions such as a polynomial kernel: K(x, y) = (1 + x.y) d , for some positive integer d, where x and y are feature vectors. Focusing on Matlab tool, a SVM2 uses d = 2, whereas a SVM3 employs d = 3. There are different approaches to build a multiclass classifier from binary classifiers, when C > 2, where C stands for the number of different classes (multiclass classification problem). One such approach is the "one-versus-one" approach, where C(C − 1)/2 different binary SVM classifiers are built for all possible pairs of classes. A new test point is classified according to the class that has the highest votes from all the binary classifiers [34]. This is the approach used in Matlab for multiclass SVM [35].
On the other hand, a k-Nearest Neighbor (k-NN) algorithm assigns a test pattern, x, to the most representative class of the k-nearest neighbors [36]. In Matlab, k-NN fine corresponds to a classification with k = 1. Regarding the ensemble classifiers, these are methods that combine the output of multiple weak classifiers to produce a stronger classifier. One way to combine weak classifiers is using a weighted combination [36]. Moreover, the combination or ensemble method to be used depends on whether it is for classification of two classes or more classes. Matlab has the subspace method to combine the results of multiple weak classifiers. The subspace method works for two or more classes. The subspace method of combination can work with either discriminant analysis or k-NN as weak classifiers. In our experiments with classifiers, we have used ensemble classifiers with the subspace method for combination and using k-NN as weak classifiers. The subspace algorithm is a random algorithm that uses the following parameters; m is the number of dimensions to sample in each weak classifier, d is the number of dimensions or features in the data, and n is the number of weak classifiers in the ensemble. In our experiments, the default values used in Matlab are n = 200, m = round(d/2). According to the authors of [35], the random subspace algorithm performs the following steps. "(1) Choose without replacement a random set of m predictors from the d possible values. (2) Train a weak learner using just the m chosen predictors. (3) Repeat steps 1 and 2 until there are n weak learners. (4) Predict by taking an average of the score prediction of the weak learners, and classify the category with the highest average score".

Experiments and Results
For the experiments we will be using two databases: one is the Bosphorus database (Section 2.1.1), and the other is UIBVFED database (Section 2.1.2). The motivation for the experiments is to test the classification accuracy of the original feature set (see Section 2.2) and compare with two feature selected sets. One reduced feature set is obtained using PCA (see Section 2.3.1). The second reduced featured set is obtained using our proposed GA (see Section 2.3.2 ). The following experiments were done.

1.
Assessment of the classification accuracy of the original feature set (see Section 2.2) using the Bosphorus database.

2.
Selection of a reduced feature set using PCA and assessment of classification accuracy using the Bosphorus database.

3.
Selection of a reduced feature set using our GA and assessment of classification accuracy using the Bosphorus database 4.
Assessment of the classification accuracy of the reduced feature set using GA on the UIBVFED database.
In all experiments with the Bosphorus dataset the classes were balanced to 104 instances using the SMOTE algorithm.

Original Feature Set and Performance Evaluation
For the first experiment, 89 features were processed: 27 angles and 19 distances in 3D, and 20 angles and 23 distances in 2D. These features were extracted according to the experimental setting explained in Section 2.2.
To evaluate the accuracy and try to get the best classifier for our data, four classifiers (E-KNN, K-NN, and SVM using a cubic and a quadratic kernels) were employed. We test this experiments using 10-fold cross-validation technique. The Table 5 shows the results. In general, all methods are above the 80% of accuracy, but it is important to note that the SVM3 reported the highest performance in the median, mean, maximum, and minimum. On the other hand, the lowest standard deviation was obtained by SVM2, but all classifiers are under 0.6. Table 6 shows the percentage per emotion of the best classifier, SVM3. "Happiness" reported highest accuracy and "fear" the lowest accuracy.

Feature Selection Using PCA
From the previous experiments, it was noticed that SVM3 reported the highest accuracy in average. Therefore, experiments were performed using the SVM3 as the only classifier.
For these experiments, PCA was used to reduce the feature set. It can be seen from Table 7 that our feature set was reduced from 89 features to 21, 27, and 39 through PCA varying the percentage of variation of data dispersion. In this preliminary test, one execution of the SVM3 was performed per reduced feature set. It can be seen that PCA with 99% of variance obtained the highest accuracy (81.25%) employing 39 features.  27 39 In order to perform the best result of the reduction with PCA, we employed 10-fold cross-validation with SVM3. Table 8 shows the results. The tabla shows Table 9 the confusion matrix. It can be noticed that "happiness" reports the best accuracy and "fear" the lowest accuracy followed by "disgust".

Feature Selection Using GA
Similarly to the experiments with PCA, a GA was used to select a subset of features. The GA was executed 30 times to evaluate its performance using a SVM3 only. Table 10 shows the best and the worst fits from these 30 executions. It can be seen that the best fit achieved 89.58% accuracy on average using 47 features, whereas the worst fit obtained 87.82% accuracy on average using 46 features. Moreover, it is important to remark that in terms of number of features, the best fit only has one more feature than the worst fit. Nevertheless, it can be noticed that the best fit uses more 3D features than the worst fit, i.e., the best fit employs 26 3D features, whereas the worst fit uses 21 3D features. Focusing on the best fit, it can be seen from Figure 7 that after the 167th generation, the fitness function achieved 89.58% accuracy in average. Regarding the features obtained with the GA in the best fit, the reduced feature set is composed of 47 features: 14 angles and 12 distances in 3D and eight angles and 13 distances in 2D (Table 11).  Such as in the previous experiment we testes the featured set using 10-fold cross-validation and SVM3. The measures of the performance are reported in Table 12.
The confusion matrix (Table 13) shows the accuracy per emotion and we noticed that "happiness" has the best accuracy and "fear" the lowest accuracy.

Evaluation on UIBVFEd Dataset
To evaluate the performance of the reduced feature set obtained with the GA we trained and tested with UIBVFED database. For this experiment 10-fold cross-validation and SVM3 were employed.
The results obtained in this database (Table 14) were better than those obtained with Bosphorus database.
Regarding the accuracy per facial expression, Table 15 presents the confusion matrix. It can be noticed that "happiness", "fear", and "anger" expressions obtained perfect accuracy (100%), and sadness and surprise the lowest.   Table 16 shows the accuracy for every feature set. We can see that the best result is obtained with the GA using the UIBVFED dataset with the best percentage of accuracy and the lowest number of features followed by the experiment under the same conditions but with the Bosphorus database. On the other hand, with the previous experiments and the confusion matrices, we noticed that with the Bosphorus database for all experiments the best performance per emotion is obtained for "happiness" and the second place is "sadness". For Bosphorus database, in all experiments, the lowest accuracy was obtained by "fear". In contrast, for the UIBVFED database "fear" achieved perfect accuracy.

Number of Features
Regarding the number of features in each feature set, it can be concluded that PCA reduced the original features by 57% (i.e., 89 features were decreased to 39). However, the mean of the accuracy values achieved through a SVM3 was decreased by 3.91 units (i.e., from 85.11% <original feature set> to 81.20% <reduced feature set with PCA>). On the other hand, the GA reduced the original features by 47.2% (i.e., 89 features were decreased to 47) and increased the mean of the accuracy values achieved through a SVM3 by 1.51 units (i.e., from 85.11% <original feature set> to 86.62% <reduced feature set with GA>).
Specifically comparing the original feature set with the reduced feature set obtained via GA, it can be seen from Table 17 that the highest reductions were in 2D (60%) and 3D angles (48.15%). Despite there was a reduction of 47.20% on the features. An advantage to use less features is decrease in execution time. This can be seen in the last column of Table 17. The classification time of a new instance using the features obtained with the GA (47 features), is approximately 30% less than using the original feature set (89 features). To measure execution time we employed an Intel i7-4700MQ processor.
Moreover, it can be noticed from Figures 8-11 that the GA discarded features which have redundancy and only kept the features that are necessary for facial expression recognition, taking advantage that most of the human faces have symmetry.

Comparison of Our Results with Previous Studies
As explained earlier, this research used instances of faces from the Bosphorus database so that our results could be compared with previous studies.

Comparison on Bosphorus Dataset with Handcrafted Features
The first three works were selected because they use 3D information and the same database to extract the features. Our highest median accuracy case obtained a better performance in average than Salahshoor and Faez [9] and Ujir [10]. Moreover, our proposal employs fewer features than the proposals of Salahshoor and Faez and Ujir. Conversely, our proposal, in its highest median accuracy case, is not as precise as Zhang et al. [11]. However, it is important to remark that the conditions are not the same: Zhang et al. used a dynamic approach. Compared to this last study, our proposal provides a better classification for happiness, fear, disgust, and sadness emotions. Moreover, our approach uses 47 features instead of 64 features employed in average by Zhang et al. (Table 18).

Comparison on Bosphorus Dataset with Deep Features
As mention before, our work is focused in the selection of features. For this reason, we also compare with DL approaches, where the algorithms learn the features. We only compare with DL works which use Bosphorus dataset. The authors of [37,38] use the same database and type of data. Table 19 shows the comparison. In terms of accuracy, our method gets better performance, and the meaning of the features is clear. The other datasets used in this study (Table 20) were chosen because they are focused in selecting relevant features. For example, Gaulart et al. [39] used a PCA for dimensionality reduction and Fast Neighbor Component Analysis (FNCA) for feature selection; in our case, GA made both tasks. In terms of accuracy on average compared with the above study, we achieved a greater accuracy of 93.92% versus 89.98%, and just 4.55% less than the best accuracy reported by Oh and Kim [40]. However, our method used less features. This is an advantage not just to train the model (lower time and computational cost). It is an important aspect in real-time applications.

Conclusions and Future Work
In this paper, we proposed a cognitive system for emotion recognition, where the main difference is observed in the selection of features. From these we get a novel and reduced set of geometric features in 2D and 3D to classify the six basic facial expressions: happiness, sadness, surprise, fear, anger, and disgust. The original feature set was reduced using PCA and GA. Furthermore four classifiers (cubic and quadratic SVM, k-NN, and E-KNN) were applied to the original feature set. The best performance was obtained through a GA using a SVM3 with 10-fold cross-validation, which reduced the number of features by 47.20%, i.e., from 89 to 47 features, and obtained 93.92% and 95.83% as the highest mean accuracy and the highest maximum accuracy, when tested on the UIBVFED dataset. It is important to remark that the GA was able to discard features having redundancy and could detect relevant features based on the symmetry presented on most of the human faces. Finally, the key contribution of this research is that emotions might be recognized with 93.92% accuracy using only 47 features. This proposal might be employed to improve the computational cognitive mechanism used to infer emotions in a static image; therefore, a computer might adapt its interaction with people based on the emotion detected. As a future work, we will investigate the value of the proposed descriptors on more datasets and possibly explore micro-expressions. Funding: Vianney Perez-Gomez received a scholarship from CONACYT for her graduate studies.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript.