A Comparative Study of Local Descriptors and Classifiers for Facial Expression Recognition

Antoine Badi Mame; Jules-Raymond Tapamo

doi:10.3390/app122312156

and

Discipline of Electrical, Electronic and Computer Engineering, University of KwaZulu-Natal, Durban 4041, South Africa

^*

Author to whom correspondence should be addressed.

Appl. Sci.2022, 12(23), 12156;https://doi.org/10.3390/app122312156

This article belongs to the Special Issue Research on Facial Expression Recognition

Version Notes

Order Reprints

Review Reports

Abstract

Facial Expression Recognition (FER) is a growing area of research due to its numerous applications in market research, video gaming, healthcare, security, e-learning, and robotics. One of the most common frameworks for recognizing facial expressions is by extracting facial features from an image and classifying them as one of several prototypic expressions. Despite the recent advances, it is still a challenging task to develop robust facial expression descriptors. This study aimed to analyze the performances of various local descriptors and classifiers in the FER problem. Several experiments were conducted under different settings, such as varied extraction parameters, different numbers of expressions, and two datasets, to discover the best combinations of local descriptors and classifiers. Of all the considered descriptors, HOG (Histogram of Oriented Gradients) and ALDP (Angled Local Directional Patterns) were some of the most promising, while SVM (Support Vector Machines) and MLP (Multi-Layer Perceptron) were the best among the considered classifiers. The results obtained signify that conventional FER approaches are still comparable to state-of-the-art methods based on deep learning.

Keywords:

facial expression recognition; feature extraction; local descriptors; classification

1. Introduction

Facial Expression Recognition (FER) has a number of different applications, including service robots [1], driver fatigue monitoring [2], mood prediction [3], sentiment analysis in customer reviews [4], and many others. For humans, recognizing and interpreting facial expressions comes naturally and is a basic form of communication. In some cases, facial expressions are the only means of communication, such as in newborns and unconscious patients in the ICU [5]. However, facial expression recognition is still challenging for computer systems due to factors such as camera view angles, occlusions, image noise, and changes in scene illumination. All these issues must be accounted for when developing a robust system for a given application. In the past decade, much work has been done to recognize a group of expressions, which are often related to prototypical facial expressions associated with basic emotions [6]. A FER system usually operates in three stages: in the first stage, a face region is located, then facial features are extracted in the second stage, and finally, an expression is classified in the third stage.

FER systems can either be categorized as frame or sequence-based depending on whether a static image or a video sequence is used to extract facial features for classification [7]. In frame-based approaches, only spatial information, such as the appearance and geometry of the facial image, can be used to describe the expression. In contrast, the sequence-based approach can extract spatial and temporal features to describe the evolution of expressions during a video sequence. Frame-based approaches are preferred due to their simplicity, and no assumption is made about how the expression evolves (an assumption for many sequence-based approaches). Therefore, much attention has been given to developing FER systems based on static images. In this category, we can identify two main trends in the development of FER systems. On the one hand, there are FER systems based on the extraction and classification of handcrafted features, and on the other, we have approaches based on deep learning. Handcrafted techniques rely on domain-specific knowledge about the human face, such as facial muscle deformations and the movement of facial components, such as raising the eyes and the opening/closing of the mouth during emotion elicitation.

Within the handcrafted techniques, the geometrical features were developed to describe the shape and location of facial components [8]. Some work was also undertaken to reconstruct Action Units and interpret their combinations as facial expressions [9]. Geometric-based FER systems were difficult to develop because they required the accurate labeling of Action Units [6] and landmarks, which is time-consuming and tedious. During this time, appearance-based FER systems were being developed, giving rise to local descriptors for facial expression representation. Local descriptors were used to extract texture information such as edges, corners, and spots that make up facial expressions and achieved the same or better results while being less tedious to develop than the geometric features.

Recently, deep neural networks have been studied for facial expression recognition [10,11]. Unlike traditional techniques, deep learning includes little knowledge about facial structure and appearance. Deep learning systems automatically detect and extract facial features based on a series of layers, such as convolution, pooling, and fully-connected layers. This field is attracting more and more attention due to the improved recognition capacity offered by deep neural networks for static and sequence-based recognition. Mainly, CNN (Convolution Neural Networks) is the basic model for extracting more discriminative spatial features from face images, while RNN (Recurrent Neural Networks) and LSTM (Long Short-Term Memory) networks are suitable for characterizing the spatio-temporal features of sequences [12].

In the literature, there have been a few attempts to compare the different methods based on local descriptors and traditional classifiers for facial expression recognition. In 2018, Turan and Lam [13] studied 27 local descriptors under different conditions, such as varying image resolutions and the number of sub-regions. Two classifiers were used to recognize a different number of expressions on four facial expression databases. Their results were comparable to the state-of-the-art deep learning approaches on well-known datasets. Slimani et al. [14] studied the independent performance of 46 LBP variants for facial expression recognition. A single classification technique (i.e., SVM) was used to classify seven expressions. The study showed that several local descriptors, initially proposed for other classification problems, outperformed the state-of-the-art four databases.

This study investigates the performances of six local descriptors and four machine-learning techniques for automatic facial expression recognition. The novelty of this approach is in using a variable sub-region size and a variable number of histogram bins to optimize the performance of histogram-based local descriptors. Face registration and feature vector normalization also contributed significantly to the results obtained in this work compared to previous works. Many experiments were conducted under different settings, such as varied extraction parameters, different numbers of expressions, and two datasets, to discover the best combinations of local descriptors and classifiers for facial expression recognition. The rest of this paper is organized as follows: Section 2 reviews local descriptors and classifiers used in facial expression recognition, Section 3 is the materials and methods, Section 5 presents the experimental results and discussion, and Section 6 concludes the paper.

2. Related Works

The following lines describe the various local descriptors and classifiers used in our experiments in more detail.

2.1. Local Descriptors

2.1.1. Local Binary Patterns

The Local Binary Patterns (LBP) descriptor describes the spatial structure of a local patch containing a center pixel

x_{c}

surrounded by p equally-spaced neighbors [15]. Given a texture image, I, the LBP patterns of each pixel can be computed by assigning a binary code to the center pixel,

x_{c}

of a

3 \times 3

patch with eight neighbors and computing an LBP histogram used to characterize the image. In [16], a complete version of LBP was developed to allow for a variable number of neighbors p located on a circular radius r. The LBP pattern is defined as

L B P_{r, p} = \sum_{j = 0}^{p - 1} S (x_{j} - x_{c}) 2^{j}

(1)

where

S (A)

is defined as

S (A) = \{\begin{matrix} 1 & if A \geq 0 \\ 0 & else \end{matrix}

(2)

Since its introduction as a textured-based descriptor in 2002, LBP has undergone several improvements to produce a more robust descriptor. For example, Shan et al. [17] improved LBP by dividing the face image into sub-regions of various sizes and positions, and the AdaBoost algorithm was used to learn the most discriminative LBP histograms. The so-called Boosted-LBP features were then classified as expressions using machine learning techniques such as template matching, Support Vector Machines, Linear Discriminant Analysis, and linear programming. LBPTOP extends LBP to encode temporal changes and spatial changes [18]. This method extracts LBP patterns from three orthogonal planes: the spatial plane, the vertical spatio-temporal plane, and the horizontal spatio-temporal plane. In [19], a new kernel-based manifold learning method called kernel discriminant isometric mapping is proposed to reduce the dimension of LBP features. The features are then classified using the nearest neighbor classifier. Recently, Guo et al. [20] proposed the Extended Local Binary Patterns on three Orthogonal Planes (ELBPTOP) for spontaneous micro-expressions recognition, and in [21], a smaller LBP feature vector that is also resistant to noise is presented. The novel descriptor considers four neighbors and diagonal neighbors separately and an adaptive window and averaging in radial directions to improve the feature extraction.

2.1.2. Compound Local Binary Patterns

Compound Local Binary Patterns (CLBP) extend the LBP descriptor by encoding both the magnitude and the sign of the differences between a center (or threshold) pixel and its P neighbors [22]. Unlike LBP, which replaces each gray value which a P-bit code, CLBP assigns a

2 P

-bit code. Considering

i_{c}

is the gray value of the center pixel,

i_{p}

is the gray value of the p-th neighbor, and

M_{a v g}

is the average magnitude of the difference between

i_{p}

and

i_{c}

in a local neighborhood, the CLBP code is defined as

f_{c} (x, y) = \sum_{p = 0}^{P - 1} s (i_{p}, i_{c}) 2^{2 p}

(3)

where

s (i_{p}, i_{c})

is defined as

s (i_{p}, i_{c}) = \{\begin{matrix} 00 & if i_{p} - i_{c} < 0 and | i_{p} - i_{c} | \leq M_{a v g} \\ 01 & if i_{p} - i_{c} < 0 and | i_{p} - i_{c} | > M_{a v g} \\ 10 & if i_{p} - i_{c} \geq 0 and | i_{p} - i_{c} | \leq M_{a v g} \\ 11 & otherwise \end{matrix}

The code is then split into two sub-CLBP patterns by concatenating the bit values corresponding to the following sequence

(1, 2, 5, 6, \dots, 2 P - 3, 2 P - 2)

and

(3, 4, 7, 8, \dots, 2 P - 1, 2 P)

, respectively, of the original CLBP code. The two sub-CLBP codes are treated as separate P-bit codes, and a histogram is computed for each. Finally, the two histograms are combined to form a feature vector.

2.1.3. Local Directional Patterns

Local Directional Patterns (LDP) is based on the edge responses in a local neighborhood [23]. When applied to a texture image, the descriptor computes eight directional edge responses at each pixel and encodes the responses as an 8-bit binary code using the relative strengths of the edge responses. Specifically, the edge responses are computed using Kirsch edge masks (see Figure 1). To form a binary code, the k most significant responses are set to 1 while the remaining

8 - k

bits are set to 0. In summary, the LDP code is defined as

L D P_{k} = \sum_{j = 0}^{7} B (m_{j} - m_{k}) 2^{j}

(4)

where

m_{j}

is the j-th directional response,

m_{k}

is the k-th most significant directional response and

B (a)

is defined as

B (a) = \{\begin{matrix} 1 & if a \geq 0 \\ 0 & otherwise \end{matrix}

(5)

Figure 1. Kirsch edge masks.

2.1.4. Angled Local Directional Patterns

Angled Local Directional Patterns (ALDP) were introduced by Shabat and Tapamo [24] to improve upon the LDP descriptor by addressing two drawbacks of LDP: (1) the static choice of the most significant bits and (2) the value of the center pixel was ignored. To address these issues, the ALDP descriptor is extracted by first generating Kirsch mask responses

(m_{0}, \dots, m_{7})

just like LDP, then computes angular vector components

(p_{0}, \dots, p_{7})

in four angles (0, 45, 90, 135). One of the drawbacks of ALDP is the large size of the feature vector. Considering all eight angular vector components, the basic histogram size is 256, unlike LDP, which has only 56 possible codes. More details about the ALDP descriptor can be found in [24].

2.1.5. Weber’s Local Descriptor

Weber’s Local Descriptor (WLD) was inspired by Weber’s law, which states that the human perception of a pattern depends not only on the change in stimulus (such as illumination variation) but also on the original intensity of the stimulus [25]. WLD consists of a differential excitation (

ξ

) and a gradient orientation (

θ

). A feature vector is obtained by rearranging the differential excitations several times into subgroups and then creating histograms of those subgroups. Finally, the histograms are reordered and concatenated into a single feature vector for classification. Given the texture image G, a gray-level

x_{c}

, and p neighbors, the differential and excitation components of WLD are defined by Equations (6) and (7). More details about the implementation can be found in [25].

ξ (x_{c}) = \tan^{- 1} (\frac{V_{s}^{00}}{V_{s}^{01}}) = \tan^{- 1} [\sum_{i = 0}^{p - 1} (\frac{x_{i} - x_{c}}{x_{c}})]

(6)

The second component of WLD is the gradient orientation

θ (x_{c})

, which is defined as

θ (x_{c}) = \tan^{- 1} [\frac{V_{s}^{11}}{V_{s}^{10}}] = \tan^{- 1} [\frac{x_{5} - x_{1}}{x_{7} - x_{3}}]

(7)

2.1.6. Histogram of Oriented Gradients

Dalal and Triggs [26] originally proposed the Histogram of Oriented Gradients (HOG) descriptor to tackle objection detection. The HOG descriptor counts the occurrences of gradient orientation in a local sub-region of an image. The HOG descriptor is applied by computing the gradient directions over the pixels of small sub-regions called cells. Subsequently, the histogram of the gradient directions is used as a feature. HOG is especially powerful because it applies block normalization schemes (e.g., L2-norm and L2-Hys) to the histogram. In block normalization, a window is moved over the input image, and the HOG of each cell within the window is normalized as a group. The concatenation of normalized histograms is treated as the feature vector. Given a texture image L, a sub-region size of N × N pixels, then the gradient orientation

θ_{x, y}

of a pixel located at

(x, y)

can be computed by

θ_{x, y} = \tan^{- 1} (\frac{L (x, y + 1) - L (x, y - 1)}{L (x + 1, y) - L (x - 1, y)})

(8)

2.2. Classifiers

2.2.1. Support Vector Machines

Support Vector Machines (SVM) have become the standard for many FER approaches due to the advantages such as fast training and no direct probability estimation [8,27,28]. SVM is a machine learning algorithm that classifies input features by defining a separating hyperplane. Given l training data vectors

x_{i} = (x_{1}, x_{2}, \dots, x_{l})

and the corresponding labels

y_{i} \in {- 1, 1}

, the following primal optimization problem is defined as

\min_{w, b, ξ} \{1 / 2 w^{⊤} w + C \sum_{i = 1}^{l} ξ_{i}\}

(9)

\begin{matrix} Subject to y_{i} (w^{⊤} ϕ (x_{i}) + b) \geq 1 + ξ_{i} \\ ξ_{i} > 0, i = 1, \dots, l \end{matrix}

where

ξ_{i}

represents the misclassification error for the i-th training vector,

ξ

is the total misclassification error; w is the normal vector to the hyperplane;

b / ∥ w ∥

represents the offset of the hyperplane from the origin along the normal vector w (

∥ \cdot ∥

being the norm operator);

ϕ (x_{i})

is the kernel function, mapping

x_{i}

into a higher dimensional space and C is the regularization parameter. Scikit-learn’s implementation of the LIBSVM library [29] is used in our experiments.

2.2.2. Naïve Bayes Classifier

The Naïve Bayes (NB) classifier assigns a label

y \in (y_{1}, y_{2}, \dots, y_{M})

to a feature vector

X (x_{1}, x_{2}, \dots, x_{N})

based on the maximum likelihood framework defined by

\hat{y} = \underset{y}{a r g m a x} P (X | y)

(10)

By assuming that the features in X are mutually independent given a class label y, Equation (10) reduces to:

\hat{y} = \underset{y}{a r g m a x} \prod_{i = 1}^{N} P (x_{i} | y)

(11)

The different Naive Bayes classifiers differ based on how they model the probability distribution of features given a class label

(P (x_{i} | y))

. The Gaussian Naïve Bayes classifier is the most common and assumes a Gaussian distribution as defined in Equation (11). The Cauchy distribution has also been investigated for FER [30].

P (x_{i} | y) = \frac{1}{\sqrt{2 π σ_{y}^{2}}} exp (- \frac{{(x_{i} - μ_{y})}^{2}}{2 σ_{y}^{2}})

(12)

The NB classifier is a standard due to its simplicity and good performance in many classification problems. In this study, scikit-learn’s [29] implementation of the Gaussian Naïve Bayes algorithm was used.

2.2.3. K-Nearest Neighbors

K-Nearest Neighbors (KNN) is a non-parametric method that classifies a new sample based on a majority vote system: a new data point is assigned to the class with the most representatives within the k nearest neighbors of that point. The neighbors are taken from the training set consisting of sample points for which the class is known.

Despite being the simplest machine learning algorithm, KNN has achieved encouraging results in the FER problem. Sohail and Bhattacharya [31] developed an approach based on eleven feature points representing the principal muscle actions and the KNN classifier to recognize six basic facial expressions, achieving an average accuracy of 90.76%. Panchal and Pushpalatha [32] proposed a new feature descriptor based on LBP and Asymmetric Region LBP and classification was done using KNN. When classifying seven expressions, an overall recognition accuracy of 95.1% was achieved on the JAFFE database.

It should be noted that the KNN algorithm does not construct a general internal model but simply stores training data instances. Hence, no explicit training is required. Hence, there are zero training costs. However, it becomes computationally expensive to find the k neighbors when classifying high-dimensionality features. In this study, scikit-learn’s implementation of the KNN classifier was used [29].

2.2.4. Multi-Layer Perceptron

Multi-Layer Perceptron (MLP) is the traditional basic neural network that learns a non-linear model to map training features X

(x_{1}, x_{2}, \dots, x_{m})

to output targets y [29]. The model comprises an input layer, one or more hidden layers, and an output layer. Each layer is made up of neurons that transform the value of the previous layer with a weighted linear summation

w_{1} x_{1} + w_{2} x_{2} + \dots + w_{m} x_{m}

followed by a non-linear activation function

g (\cdot) : R \to R

.

The MLP neural network has shown encouraging performances in recognizing facial expressions of emotion. Dino and Abdulrazzaq [33] proposed a system for classifying facial expressions using the Viola-Jones algorithm for face detection, the HOG descriptor for feature extraction, PCA for dimensionality reduction, and MLP for classification. They reported an average accuracy of 82.97% on the CK+ database when classifying eight basic expressions. Boughrara et al. [34] developed a new constructive training algorithm for the Multi-Layer Perceptron (MLP) applied to FER. Unlike most traditional algorithms, which fix the structure of the neural network before training, the proposed constructive training algorithm learns the architecture of the network and performs the learning process simultaneously. The approach uses the Perceived Facial Images (PFI) in eight directions to extract features from a facial image, and the predicted expression is obtained by a fusion of the neural networks corresponding to the eight directions. The main drawback of MLP is that it is often time-consuming and tedious to determine a proper structure than can guarantee convergence and avoid over-fitting.

3. An Approach for Optimized Facial Expression Recognition

A simple and effective method for facial expression recognition based on local descriptors and conventional classifiers was defined by Slimani et al. [14]. The method involves dividing the face region into several non-overlapping regions and extracting local descriptors from each sub-region. The histogram of each sub-region is calculated separately and concatenated to form a single feature vector for classification. The current work extends Slimani et al.’s [14] approach by optimizing the feature extraction process. Figure 2 summarizes the proposed approach for optimized facial expression recognition, consisting of two phases: a training phase and a testing phase. In the training phase, images are pre-processed using face detection and registration. Next, a local descriptor is applied to the registered faces to extract facial features. Subsequently, feature vectors are generated by dividing a face region into several sub-regions and calculating the histograms of each sub-region. The feature vectors are normalized based on the type of local descriptor used (see Section 3.2). To optimize the process of feature vector calculation, the extraction parameters (i.e., the sub-region size and the number of histogram bins) are varied, producing several feature sets. Then, the best extraction parameter settings are found by comparing the classification performances produced by each feature set. Finally, the best feature set and corresponding expressions are used to train a machine-learning model.

Figure 2. The proposed facial expression recognition method.

In the testing phase, images are processed, and the feature vectors are generated based on the best extraction parameter values obtained from the training phase. After that, the feature vectors are classified as facial expressions using the trained classifier. Additionally, the classifier’s performance is evaluated by 10-fold cross-validation. The novelty of this approach is in using a variable sub-region size and a variable number of histogram bins to optimize the performance of each local descriptor.

3.1. Face Detection and Registration

Faces were detected using a frontal face detector implemented by the OpenCV library for the Python programming language (OpenCV is available at https://pypi.org/project/opencv-contrib-python/ accessed on 24 July 2021). The faces were then registered, giving them a predefined pose, shape, and size. First, an input image was converted to gray-scale, and then Contrast Limited Histogram Equalization (CLAHE) was applied using Equation (13). Then, a face blob was obtained by performing a thresholding operation followed by filtration techniques. Then, an ellipse-fitting algorithm (the ellipse fitting algorithm was implemented by the Python OpenCV library.) estimates the shape and angle of inclination of the face blob. The ellipse’s angle is used to rotate the face to a vertical position. This process is not always perfect. Hence, the eye locations are used to rotate the image a second time. Finally, an elliptical crop followed by a rectangular crop is employed to produce an image where the eyes are in a predefined position relative to the sides of the image. Figure 3 illustrates face detection and registration.

g_{i, j} = ⌊(L - 1) \sum_{n = 0}^{I_{i, j}} p_{n}⌋

(13)

where

p_{n}

is defined as

\begin{matrix} p_{n} = \frac{\sum_{i, j} A (I_{i, j} = n)}{M \times N}; n = 0, 1, \dots, L - 1; 0 \leq i < M; 0 \leq j < N; \end{matrix}

and

A (x)

is defined as

\begin{matrix} A (x) = \{\begin{matrix} 1, & if x is true \\ 0, & otherwise \end{matrix} \end{matrix}

Figure 3. Face detection and registration.

3.2. Feature Extraction and Feature Vector Calculation

In this study, six local descriptors were investigated: Local Binary Patterns (LBP), Compound Local Binary Patterns (CLBP), Local Directional Patterns (LDP), Angled Local Directional Patterns (ALDP), Weber’s Local Descriptor (WLD), and Histogram of Oriented Gradients (HOG). Feature extraction and feature vector calculations are key stages of this FER method. Although the two processes were portrayed separately in the preamble of this section, their workings are very interlinked. The method used in this study divides the face image into several equally sized non-overlapping sub-regions, and a local descriptor is applied to each sub-region, producing several histograms. The final feature vector is generated by normalizing the histograms and concatenating them into a single 1D vector. It must be noted that the sub-region histograms are calculated separately and normalized such that the frequencies sum to one (this rule is applied to all the considered descriptors except the HOG descriptor, where the blocks are normalized in groups using L2-Hys [26]). The normalized histograms are then concatenated to form the final feature vector. Suppose

I_{j}

is a

W_{S} \times W_{S}

matrix of intensities representing the j-th sub-region in the image, and the function

D (x, y)

computes the descriptor at a position

(x, y)

in

I_{j}

, and then the histogram associated with this sub-region is defined as

H_{j} = {h_{i}}_{i = 0, 1, \dots, N_{b} - 1}

(14)

h_{i} = \sum_{x = 0}^{W_{S} - 1} \sum_{y = 0}^{W_{S} - 1} δ (D (x, y), i)

δ (x, i) = \{\begin{matrix} 1, & if L (i) \leq x < U (i) \\ 0, & otherwise \end{matrix}, L (i) = i \times \frac{n}{N_{b}}, U (i) = (i + 1) \times \frac{n}{N_{b}},

where n is the number of possible descriptor values, and

N_{b}

is the number of histogram bins.

The final feature vector is given by

H = \{\frac{H_{0}}{{(W_{S})}^{2}}, \frac{H_{1}}{{(W_{S})}^{2}}, \dots, \frac{H_{m - 1}}{{(W_{S})}^{2}}\}

(15)

where m is the number of sub-regions.

3.3. Feature Set Evaluation

In this step, a model is trained with different feature sets, and the best feature set is selected based on its classification performance. The performances were ranked using the average recall during 10-fold cross-validation. Then, the best feature set and the corresponding best parameter setting are identified. Algorithm 1 defines the process of feature set evaluation in more detail.

Algorithm 1 Feature Set Evaluation

Inputs:

(F_{1}, \dots, F_{m}) : feature sets

(p_{1}, \dots, p_{m}) : extraction parameters for each feature set

Outputs:

F_{b e s t} : best feature set

p_{b e s t} : best extraction parameters

1:: $F_{b e s t} \leftarrow F_{1}$ // Initialize the best feature set
2:: $p_{b e s t} \leftarrow p_{1}$ // Initialize the best parameters
3:: $m o d e l \leftarrow N e w M o d e l ()$ // Construct a new model e.g., SVM, KNN etc.
4:: $m a x S c o r e \leftarrow C r o s s V a l i d a t i o n (m o d e l, F_{1}$ ) // Obtain a cross-validation score
5:: for i in $(2, 3, \dots, m)$ do // Train new models with the remaining feature sets
6:: $m o d e l \leftarrow N e w M o d e l ()$
7:: $s c o r e \leftarrow C r o s s V a l i d a t i o n (m o d e l, F_{i})$
8:: if $s c o r e > m a x S c o r e$ then // Find the maximum score
9:: $F_{b e s t} \leftarrow F_{i}$
10:: $p_{b e s t} \leftarrow p_{i}$
11:: $m a x S c o r e \leftarrow s c o r e$
12:: end if
13:: end for
14:: return $F_{b e s t}, p_{b e s t}$

3.4. Facial Expression Classification

This research is focused on the recognition of prototypic facial expressions of emotion from facial images. The classification stage aims at training a machine learning model using labeled feature vectors extracted from the facial images. Once trained, the model can make predictions on new data. The current work considers four classification methods, including Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Naive Bayes (NB) (this work used scikit-learn’s [29] implementation of the Gaussian Naive Bayes classifier), and Multi-layer Perceptron (MLP). Table 1 gives the hyper-parameter settings used to train each classifier.

Table 1. Classifier hyper-parameters.

4. Datasets

This study conducted experiments using two well-known FER datasets: the Extended Cohn–Kanade (CK+) and Radboud Faces Dataset (RFD).

CK+ gathers facial expression images of subjects from various ethnicities, ages, and genders. This database consists of 593 image sequences from 123 subjects. The database contains posed and non-posed images taken under lab-controlled conditions. To form the first dataset for our study, we followed the guidelines in [35] by selecting our samples as follows: the last images in the sequences for anger, disgust, and happiness; the last and fourth images for the first 68 sequences related to surprise; the last and fourth from last images for the images of fear and sadness. A second dataset is created by adding images for the expression related to neutral. The number of images collected amounted to 347 and 407 for the first and the second datasets, respectively.

The Radboud Faces Database contains pictures of 67 subjects (adults and children) displaying eight facial expressions of emotion. The subjects are of Caucasian ethnicity, and the images are taken in different gaze directions and head orientations [36]. Three subsets were created: the first subset consisted of 402 images labeled with six expressions, including anger, disgust, fear, happiness, sadness, and surprise; the second subset was formed by augmenting the first set with 67 images labeled as contempt, and the third subset was formed by adding 67 images labeled as neutral to the second set. The total number of image samples was 536. Table 2 summarizes the number of samples and expressions in each dataset.

Table 2. Summary of the datasets.

5. Experimental Results and Discussion

5.1. Performance Analysis for Varying the Number of Histogram Bins

In this experiment, the face regions were first detected and aligned using a face registration algorithm that automatically positions the eyes and crops on the face image to discard background pixels. Then, the aligned face images were resized to a resolution of 100 × 100 pixels. By setting the size of the sub-regions to 13 × 13 pixels (i.e., 49 sub-regions in total), feature vectors were extracted from each image. The number of histogram bins was varied to reduce the feature vector length while achieving a high recognition rate. Note that for WLD, the histogram bins were set as

M = 2

,

S = 2

, and T was varied from 2 to 24. Figure 4 gives the results (results are given as the average accuracy during 10-fold cross-validation) obtained when the number of histogram bins is varied for different descriptor and classification combinations. We observe different trends depending on the type of classifier used. When SVM and MLP classifiers are used, the classification performance either improves or stays the same as the number of bins increases. A similar result is seen in the KNN classifier, although the large size of the features hurts the recognition rates from some descriptors (e.g., with the LDP descriptor, the performance decreases as the number of bins increases). On the other hand, when using the NB classifier, the recognition rates decrease as the number of bins rises, regardless of the type of features used.

Figure 4. FER results for different numbers of bins on the RFD dataset with 6 expressions (a) WLD results; (b) LDP results; (c) ALDP results; (d) LBP results; (e) CLBP results; (f) HOG results.

5.2. Performance Analysis for Varying the Sub-Region Size

A sub-region can be represented by a square of size

l \times l

, where l varies from 7 to 25, taking steps of 3. Figure 5 gathers the results for varying the sub-regions size on the RFD dataset. As observed in the tables, the larger the sub-regions size (or, the smaller the number of sub-regions), the better the recognition rates. This trend is observed for all other classifier and descriptor combinations. It is also worth noting that no single sub-region size works best for all descriptor and classifier combinations. However, our results suggest that sub-regions of 10 × 10 and 13 × 13 pixels give some of the highest recognition rates. Based on the results from the previous two experiments (see Section 5.1 and Section 5.2), the optimum parameter values were selected. Table 3 gives the parameter values used in the next series of experiments.

Figure 5. FER results for different sub-regions sizes on RFD with 6 expressions (a) SVM results; (b) NB results; (c) KNN results; (d) MLP results.

Table 3. Best parameters for each combination of classifier and descriptor (number of bins, block size).

5.3. Performance Analysis of the Classifiers

Figure 6 summarizes the best results obtained when varying the sub-region sizes on the CK+ dataset with six expressions. We observe that the best recognition rates (95–98%) were achieved using SVM and MLP classifiers, while the lowest accuracies were produced by the NB and KNN classifiers (73–91%). As reported in Figure 7, a similar trend is observed on the RFD dataset with six expressions. To further analyze the performances of the SVM and MLP classifiers, the previous experiments were repeated on other datasets with varying numbers of expressions. Table 4 gathers the results obtained when the number of expressions was varied. For the CK+ dataset, the expressions were anger, disgust, surprise, sadness, happiness, and disgust in the six-class problems, and the seven-class problem added the neutral expression to the previous list. For the RFD dataset, the expressions were anger, disgust, surprise, sadness, happiness, and disgust in the six-class problem, the seven-class problem added the expression of contempt to the previous list, and the eight-class problem added the neutral expression. Considering the six-class problem, SVM and MLP achieve similar classification rates (above 95%) on the two datasets. Although their performances are very similar, the MLP classifier achieved slightly better recognition rates than SVM when classifying WLD, ALDP, LBP, CLBP, and HOG descriptors. Both classifiers are negatively affected by the introduction of the neutral. The largest performance drop is seen in the ALDP + SVM combination, where the accuracy drops by almost 7%. The recognition rates of both classifiers do not change much when the Contempt expression is introduced in the RFD dataset with seven expressions.

Figure 6. FER results for the best of sub-regions on the CK+ dataset (6 expressions).

Figure 7. FER results for the best of sub-regions on the RFD dataset (6 expressions).

Table 4. Comparison of recognition rates on the CK+ and RFD datasets with a varying number of expressions.

5.4. Analyzing the Computational Costs

This section examines the computational costs of using different combinations of descriptors and classifiers. The extraction times of each local descriptor were measured by extracting a feature vector with a sub-region size of 13 × 13 pixels and 56-bin histograms. The same parameter setting was used in all descriptors to make the comparison fair. On the other hand, when measuring the prediction time, the optimum parameters (see Table 3) were used to extract feature vectors before making predictions on those feature vectors. Table 5 and Table 6 give the extraction and prediction times, respectively, where each value is the average time over 500 iterations. Results show that the best extraction time was achieved by WLD (0.54 s), followed by HOG (1.72 s). The longest times are seen in LBP and CLBP. NB was the fastest among the considered classifiers, with an average prediction time of 197 ms, followed by MLP, which took an average of 438 ms. SVM came in the third position, and KNN was the last.

Table 5. Extraction times (in seconds).

Table 6. Prediction times of various classifiers (times are given in milliseconds).

The time costs for training each classifier using different descriptors were also measured. We used the same extraction parameters for all the considered classifiers to ensure a fair comparison. We set the sub-region size to 19 × 19 pixels and the number of bins to 48. The features were extracted from the RFD dataset with six expressions. As seen in Table 7, KNN and NB had the fastest training times (a few milliseconds), followed by SVM, which took 29 ms. MLP was the slowest, taking up to 6 seconds during training. All our experiments were performed on a Windows 10 PC with an AMD Ryzen 7 (3700U) processor and 12 GB of RAM.

Table 7. Training times of various classifiers (times are given in milliseconds).

5.5. Comparison with the State-of-the-Art

Several local descriptors have been proposed in recent years to tackle the FER problem [33,37,38]. Table 8 gives the performances of various existing FER systems compared to our proposed best combinations of local descriptors and classifiers. The table shows that several local descriptors work as well as deep neural networks. Furthermore, Shokrani et al. [39] proposed the Pyramid Histogram of Oriented Gradients (PHOG) descriptor and KNN classifier, achieving an average accuracy of 100% on the CK+ dataset. Results such as these are difficult to compare with other approaches because they depend on the selection of the train and testing images. The more recent studies prefer 10-fold cross-validation as a more reliable measure of the accuracy of FER classifiers. This is because each sample is used for training and testing once. Hence, a high accuracy during a 67–33% split, as in [39], is not always a good indicator of performance.

Table 8. Comparison with state-of-the-art methods.

It is clear from the table that our approach using HOG and CLBP either performs better or as efficiently as other approaches. Although some existing approaches have produced higher accuracies, they generally produce larger feature vectors. The proposed method finds the best balance between a small feature vector and high accuracy by exploring several feature sets. Our findings reveal that a small feature vector can achieve the same accuracy as a large feature vector if the extraction parameters are appropriately set, as seen in Figure 4. This finding could greatly benefit existing methods with high dimensionality and large memory footprints. It should be noted that, unlike many recent approaches, this study did not use any dimensionality reduction or subspace-learning methods and still achieved competitive results.

6. Conclusions

This study evaluated different combinations of local descriptors and classifiers for facial expression recognition. The experiments comprehensively evaluated six descriptors and four classifiers on two famous datasets, such as CK+ and RFD. Notably, the appropriate choice of extraction parameters improved several descriptors’ performances. We can single out HOG as the descriptor that achieved the most consistent performance across all the datasets. The computational costs of each classifier and descriptor were also studied, and we found that the NB and MLP classifiers provided the fastest predictions. However, MLP can be time-consuming during training compared to the other classifiers. When evaluating the classifiers’ recognition efficiencies, we found that SVM and MLP achieved the best results.

Out of the six considered descriptors, HOG and ALDP are some of the most promising, while SVM and MLP rank at the top of the considered classifiers. We also found that SVM was faster to train and required little model tuning compared to MLP. On the other hand, MLP has a notably faster prediction, which could be an advantage in real-time applications.

Author Contributions

Conceptualization, A.B.M. and J.-R.T.; methodology, A.B.M. and J.-R.T.; software, A.B.M.; validation, A.B.M. and J.-R.T.; formal analysis, A.B.M. and J.-R.T.; investigation, A.B.M. and J.-R.T.; resources A.B.M. and J.-R.T.; data curation, A.B.M. and J.-R.T.; writing—original draft preparation, A.B.M.; writing—review and editing, A.B.M. and J.-R.T.; visualization, J.-R.T.; supervision, J.-R.T.; project administration, J.-R.T.; funding acquisition, J.-R.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Extended Cohn–Kanade dataset (CK+) that supports the findings of this study is available at the University of Pittsburgh at http://www.jeffcohn.net/resources (accessed on 24 July 2021) with the permission of Jeffrey Cohn. The Radboud Faces Database (RFD) dataset that supports the findings of this study is available from the Radboud University Nijmegen at https://rafd.socsci.ru.nl/RaFD2/RaFD?p=main (accessed on 9 March 2022) with the permission of Oliver Langner.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

Notations	Description
ALDP	Angled Local Directional Patterns
ELBPTOP	Extended local binary patterns on three Orthogonal planes
CK+	Extended Cohn–Kanade
CLBP	Compound Local Binary Patterns
CNN	Convolution Neural Networks
FER	Facial Expression Recognition
HOG	Histogram of Oriented Gradients
KNN	K-Nearest Neighbors
LBP	Local Binary Patterns
LDP	Local Directional Patterns
LSTM	Long Short-Term Memory
MLP	Multi-Layer Perceptron
NB	Naïve Bayes
RFD	Radboud Faces Dataset
RNN	Recurrent Neural Network
SVM	Support Vector Machines
WLD	Weber’s Local Descriptor

References

Rattanyu, K.; Ohkura, M.; Mizukawa, M. Emotion monitoring from physiological signals for service robots in the living space. In Proceedings of the ICCAS 2010, Gyeonggi-do, Republic of Korea, 27–30 October 2010; pp. 580–583. [Google Scholar]
Patel, M.; Lal, S.K.; Kavanagh, D.; Rossiter, P. Applying neural network analysis on heart rate variability data to assess driver fatigue. Expert Syst. Appl. 2011, 38, 7235–7242. [Google Scholar] [CrossRef]
Yannakakis, G.N.; Hallam, J. Real-time game adaptation for optimizing player satisfaction. IEEE Trans. Comput. Intell. Games 2009, 1, 121–133. [Google Scholar] [CrossRef]
Garbas, J.U.; Ruf, T.; Unfried, M.; Dieckmann, A. Towards robust real-time valence recognition from facial expressions for market research applications. In Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, Geneva, Switzerland, 2–5 September 2013; pp. 570–575. [Google Scholar]
Prkachin, K.M. Assessing pain by facial expression: Facial expression as nexus. Pain Res. Manag. 2009, 14, 53–58. [Google Scholar] [CrossRef] [PubMed]
Friesen, E.; Ekman, P. Facial action coding system: A technique for the measurement of facial movement. Palo Alto 1978, 3, 5. [Google Scholar]
Ko, B.C. A brief review of facial emotion recognition based on visual information. Sensors 2018, 18, 401. [Google Scholar] [CrossRef]
Saeed, A.; Al-Hamadi, A.; Niese, R.; Elzobi, M. Frame-based facial expression recognition using geometrical features. Adv. Hum. Comput. Interact. 2014, 2014, 408953. [Google Scholar] [CrossRef]
Poursaberi, A.; Noubari, H.A.; Gavrilova, M.; Yanushkevich, S.N. Gauss–Laguerre wavelet textural feature fusion with geometrical information for facial expression identification. EURASIP J. Image Video Process. 2012, 2012, 17. [Google Scholar] [CrossRef]
Ding, H.; Zhou, S.K.; Chellappa, R. Facenet2expnet: Regularizing a deep face recognition net for expression recognition. In Proceedings of the 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 30 May–3 June 2017; pp. 118–126. [Google Scholar]
Kanan, H.R.; Ahmady, M. Recognition of facial expressions using locally weighted and adjusted order Pseudo Zernike Moments. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), Tsukuba Science City, Japan, 11–15 November 2012; pp. 3419–3422. [Google Scholar]
Mellouk, W.; Handouzi, W. Facial emotion recognition using deep learning: Review and insights. Procedia Comput. Sci. 2020, 175, 689–694. [Google Scholar] [CrossRef]
Turan, C.; Lam, K.M. Histogram-based local descriptors for facial expression recognition (FER): A comprehensive study. J. Vis. Commun. Image Represent. 2018, 55, 331–341. [Google Scholar] [CrossRef]
Slimani, K.; Kas, M.; El Merabet, Y.; Ruichek, Y.; Messoussi, R. Local feature extraction based facial emotion recognition: A survey. Int. J. Electr. Comput. Eng. 2020, 10, 4080. [Google Scholar] [CrossRef]
Ojala, T.; Pietikäinen, M.; Harwood, D. A comparative study of texture measures with classification based on featured distributions. Pattern Recognit. 1996, 29, 51–59. [Google Scholar] [CrossRef]
Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
Shan, C.; Gong, S.; McOwan, P.W. Facial expression recognition based on local binary patterns: A comprehensive study. Image Vis. Comput. 2009, 27, 803–816. [Google Scholar] [CrossRef]
Zhao, G.; Pietikainen, M. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 915–928. [Google Scholar] [CrossRef]
Zhao, X.; Zhang, S. Facial expression recognition based on local binary patterns and kernel discriminant isomap. Sensors 2011, 11, 9573–9588. [Google Scholar] [CrossRef]
Guo, C.; Liang, J.; Zhan, G.; Liu, Z.; Pietikäinen, M.; Liu, L. Extended local binary patterns for efficient and robust spontaneous facial micro-expression recognition. IEEE Access 2019, 7, 174517–174530. [Google Scholar] [CrossRef]
Kola, D.G.R.; Samayamantula, S.K. A novel approach for facial expression recognition using local binary pattern with adaptive window. Multimed. Tools Appl. 2021, 80, 2243–2262. [Google Scholar] [CrossRef]
Ahmed, F.; Hossain, E.; Bari, A.H.; Shihavuddin, A. Compound local binary pattern (CLBP) for robust facial expression recognition. In Proceedings of the 2011 IEEE 12th International Symposium on Computational Intelligence and Informatics (CINTI), Budapest, Hungary, 21–22 November 2011; pp. 391–395. [Google Scholar]
Jabid, T.; Kabir, M.H.; Chae, O. Robust facial expression recognition based on local directional pattern. ETRI J. 2010, 32, 784–794. [Google Scholar] [CrossRef]
Shabat, A.M.; Tapamo, J.R. Angled local directional pattern for texture analysis with an application to facial expression recognition. IET Comput. Vis. 2018, 12, 603–608. [Google Scholar] [CrossRef]
Chen, J.; Shan, S.; He, C.; Zhao, G.; Pietikainen, M.; Chen, X.; Gao, W. WLD: A robust image local descriptor. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1705–1720. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Ghimire, D.; Jeong, S.; Lee, J.; Park, S.H. Facial expression recognition based on local region specific features and support vector machines. Multimed. Tools Appl. 2017, 76, 7803–7821. [Google Scholar] [CrossRef]
Revina, I.M.; Emmanuel, W.S. Face expression recognition using LDN and dominant gradient local ternary pattern descriptors. J. King Saud Univ.-Comput. Inf. Sci. 2021, 33, 392–398. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Sebe, N.; Lew, M.S.; Cohen, I.; Garg, A.; Huang, T.S. Emotion recognition using a cauchy naive bayes classifier. In Proceedings of the Object Recognition Supported by User Interaction for Service Robots, Quebec City, QC, Canada, 11–15 August 2002; Volume 1, pp. 17–20. [Google Scholar]
Sohail, A.S.M.; Bhattacharya, P. Classification of facial expressions using k-nearest neighbor classifier. In Proceedings of the International Conference on Computer Vision/Computer Graphics Collaboration Techniques and Applications, Rocquencourt, France, 28–30 March 2007; pp. 555–566. [Google Scholar]
Panchal, G.; Pushpalatha, K. A local binary pattern based facial expression recognition using K-nearest neighbor (KNN) search. Int. J. Eng. Res. Technol. 2017, 6, 525–530. [Google Scholar]
Dino, H.I.; Abdulrazzaq, M.B. Facial expression classification based on SVM, KNN and MLP classifiers. In Proceedings of the 2019 International Conference on Advanced Science and Engineering (ICOASE), Duhok, Iraq, 2–4 April 2019; pp. 70–75. [Google Scholar]
Boughrara, H.; Chtourou, M.; Ben Amar, C.; Chen, L. Facial expression recognition based on a mlp neural network using constructive training algorithm. Multimed. Tools Appl. 2016, 75, 709–731. [Google Scholar] [CrossRef]
Carcagnì, P.; Del Coco, M.; Leo, M.; Distante, C. Facial expression recognition and histograms of oriented gradients: A comprehensive study. SpringerPlus 2015, 4, 645. [Google Scholar] [CrossRef] [PubMed]
Langner, O.; Dotsch, R.; Bijlstra, G.; Wigboldus, D.H.; Hawk, S.T.; Van Knippenberg, A. Presentation and validation of the Radboud Faces Database. Cogn. Emot. 2010, 24, 1377–1388. [Google Scholar] [CrossRef]
Yaddaden, Y.; Adda, M.; Bouzouane, A. Facial Expression Recognition using Locally Linear Embedding with LBP and HOG Descriptors. In Proceedings of the 2020 2nd International Workshop on Human-Centric Smart Environments for Health and Well-being (IHSH), Boumerdes, Algeria, 9–10 February 2021; pp. 221–226. [Google Scholar]
Alphonse, A.S.; Dharma, D. Novel directional patterns and a Generalized Supervised Dimension Reduction System (GSDRS) for facial emotion recognition. Multimed. Tools Appl. 2018, 77, 9455–9488. [Google Scholar] [CrossRef]
Shokrani, S.; Moallem, P.; Habibi, M. Facial emotion recognition method based on Pyramid Histogram of Oriented Gradient over three direction of head. In Proceedings of the 2014 4th International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, 29–30 October 2014; pp. 215–220. [Google Scholar]
Lekdioui, K.; Messoussi, R.; Ruichek, Y.; Chaabi, Y.; Touahni, R. Facial decomposition for expression recognition using texture/shape descriptors and SVM classifier. Signal Process. Image Commun. 2017, 58, 300–312. [Google Scholar] [CrossRef]
Xie, S.; Hu, H. Facial expression recognition with FRR-CNN. Electron. Lett. 2017, 53, 235–237. [Google Scholar] [CrossRef]

Figure 1. Kirsch edge masks.

Figure 2. The proposed facial expression recognition method.

Figure 3. Face detection and registration.

Figure 4. FER results for different numbers of bins on the RFD dataset with 6 expressions (a) WLD results; (b) LDP results; (c) ALDP results; (d) LBP results; (e) CLBP results; (f) HOG results.

Figure 5. FER results for different sub-regions sizes on RFD with 6 expressions (a) SVM results; (b) NB results; (c) KNN results; (d) MLP results.

Figure 6. FER results for the best of sub-regions on the CK+ dataset (6 expressions).

Figure 7. FER results for the best of sub-regions on the RFD dataset (6 expressions).

Table 1. Classifier hyper-parameters.

Classifier	Parameters
SVM	kernel = Radial Basis Function, C = 1000, gamma = 0.05
NB	Automatically selected
KNN	k = 50
MLP	solver = adam, number of passes over the training data = 200, learning rate = 0.001

Table 2. Summary of the datasets.

Dataset	Database Name	Number of Images	Number of Classes
DS1	CK+	347	6
DS2	CK+	407	7
DS3	RFD	402	6
DS4	RFD	469	7
DS5	RFD	536	8

Table 3. Best parameters for each combination of classifier and descriptor (number of bins, block size).

Descriptor	SVM	NB	KNN	MLP
WLD	(88, 19)	(8, 19)	(96, 10)	(88, 13)
LDP	(48, 7)	(12, 10)	(8, 10)	(48, 13)
ALDP	(112, 13)	(16, 16)	(112, 13)	(128, 7)
LBP	(96, 19)	(32, 19)	(128, 13)	(128, 10)
CLBP	(248, 10)	(24, 10)	(248, 13)	(216, 10)
HOG	(50, 13)	(6, 7)	(9, 10)	(50, 13)

Table 4. Comparison of recognition rates on the CK+ and RFD datasets with a varying number of expressions.

	CK+		RFD
Descriptor + Classifier	6-Class	7-Class	6-Class	7-Class	8-Class
WLD + SVM	96.0	92.4	96.7	95.9	91.5
LDP + SVM	96.1	93.3	97.0	96.8	94.0
ALDP + SVM	97.5	91.2	97.0	96.8	94.1
LBP + SVM	96.4	93.0	96.6	97.0	93.0
CLBP + SVM	95.3	93.0	97.5	97.0	93.5
HOG + SVM	97.4	93.4	96.6	97.0	93.8
WLD + MLP	96.3	94.2	96.9	95.9	91.9
LDP + MLP	95.9	93.2	96.9	96.2	93.4
ALDP + MLP	96.7	92.7	95.4	95.3	93.0
LBP + MLP	97.2	93.1	96.7	96.4	94.0
CLBP + MLP	97.5	93.5	96.8	97.4	93.1
HOG + MLP	98.3	95.1	97.4	96.4	94.0

Table 5. Extraction times (in seconds).

WLD	LDP	ALDP	LBP	CLBP	HOG
0.54	3.79	3.9	5.69	12.49	1.72

Table 6. Prediction times of various classifiers (times are given in milliseconds).

Descriptor	WLD	LDP	ALDP	LBP	CLBP	HOG	Average
SVM	1300	530	1240	1050	2520	1970	1435
NB	150	240	180	200	200	210	197
KNN	3010	500	3260	3560	6360	1160	2975
MLP	310	140	460	460	680	580	438

Table 7. Training times of various classifiers (times are given in milliseconds).

Descriptor	WLD	LDP	ALDP	LBP	CLBP	HOG	Average
SVM	41	30	22	26	39	14	29
NB	9	6	6	8	6	4	6
KNN	1	1	1	1	1	1	1
MLP	6520	5680	7489	5991	6234	2836	5792

Table 8. Comparison with state-of-the-art methods.

Database	Ref	Year	Features	Samples	Classifier	No. Classes	Accuracy (Measure)
CK+	[40]	2017	LTP + HOG	610	SVM	7	96% (10-fold)
CK+	[38]	2018	MRDTP + MRDNP	1281	ELM-RBF	7	98.4% (10-fold)
CK+	[41]	2017	Deep	927	FRR-CNN	6	92.06% (10-fold)
CK+	[24]	2017	ALDP	-	SVM	7	97% (80–20%)
CK+	[10]	2017	Deep	927	FN2EN	6	98.6% (10-fold)
CK+	[33]	2019	HOG	634	MLP	8	82.97% (10-fold)
RaFD	[39]	2014	PHOG	630	KNN	7	100% (67–33%)
RaFD	[11]	2012	PZM	-	KNN	6	94.51% (70–30%)
RaFD	[37]	2020	HOG + LLE	469	SVM	6	93.54% (10-fold)
CK+	Proposed	2022	HOG	347	MLP	6	98.4% (10-fold)
CK+	Proposed	2022	ALDP	347	SVM	6	97.5% (10-fold)
RaFD	Proposed	2022	CLBP	402	SVM	6	97.5% (10-fold)

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

A Comparative Study of Local Descriptors and Classifiers for Facial Expression Recognition

Abstract

1. Introduction

2. Related Works

2.1. Local Descriptors

2.1.1. Local Binary Patterns

2.1.2. Compound Local Binary Patterns

2.1.3. Local Directional Patterns

2.1.4. Angled Local Directional Patterns

2.1.5. Weber’s Local Descriptor

2.1.6. Histogram of Oriented Gradients

2.2. Classifiers

2.2.1. Support Vector Machines

2.2.2. Naïve Bayes Classifier

2.2.3. K-Nearest Neighbors

2.2.4. Multi-Layer Perceptron

3. An Approach for Optimized Facial Expression Recognition

3.1. Face Detection and Registration

3.2. Feature Extraction and Feature Vector Calculation

3.3. Feature Set Evaluation

3.4. Facial Expression Classification

4. Datasets

5. Experimental Results and Discussion

5.1. Performance Analysis for Varying the Number of Histogram Bins

5.2. Performance Analysis for Varying the Sub-Region Size

5.3. Performance Analysis of the Classifiers

5.4. Analyzing the Computational Costs

5.5. Comparison with the State-of-the-Art

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics