Comparing Approaches for Explaining DNN-Based Facial Expression Classifications

ter Burg, Kaya; Kaya, Heysem

doi:10.3390/a15100367

Open AccessArticle

Comparing Approaches for Explaining DNN-Based Facial Expression Classifications

by

Kaya ter Burg

¹

and

Heysem Kaya

^2,*

¹

Informatics Institute, University of Amsterdam, 1090 GE Amsterdam, The Netherlands

²

Department of Information and Computing Sciences, Utrecht University, 3584 CC Utrecht, The Netherlands

^*

Author to whom correspondence should be addressed.

Algorithms 2022, 15(10), 367; https://doi.org/10.3390/a15100367

Submission received: 10 August 2022 / Revised: 28 September 2022 / Accepted: 29 September 2022 / Published: 3 October 2022

(This article belongs to the Special Issue Machine Learning in Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

Classifying facial expressions is a vital part of developing systems capable of aptly interacting with users. In this field, the use of deep-learning models has become the standard. However, the inner workings of these models are unintelligible, which is an important issue when deploying them to high-stakes environments. Recent efforts to generate explanations for emotion classification systems have been focused on this type of models. In this work, an alternative way of explaining the decisions of a more conventional model based on geometric features is presented. We develop a geometric-features-based deep neural network (DNN) and a convolutional neural network (CNN). Ensuring a sufficient level of predictive accuracy, we analyze explainability using both objective quantitative criteria and a user study. Results indicate that the fidelity and accuracy scores of the explanations approximate the DNN well. From the performed user study, it becomes clear that the explanations increase the understanding of the DNN and that they are preferred over the explanations for the CNN, which are more commonly used. All scripts used in the study are publicly available.

Keywords:

facial expression recognition; FER; DNN explainability; CNN explainability; emotion recognition

1. Introduction

The field of affective computing is concerned with providing computers the ability to examine and understand human affects and form their own human-like affects [1]. These notions are essential for creating empathetic computers that can interact appropriately with users, e.g., in situations such as (mental) health care, education, caring for the elderly, etc. One of the key elements of affective computing is emotion recognition. Emotions can be recognized using acoustic, visual and linguistic modalities [1]. In visual modality, emotions are mainly recognized from faces, under Facial Expression Recognition (FER), which is the area of focus for this research. In the context of using visual information (i.e., images and videos), FER can be categorized into two different approaches: conventional and deep-learning [2]. Traditionally, FER was done in three main steps: component detection, feature extraction, and finally the emotion classification.

More recently, there has been a surge in the usage of deep-learning models for image- and video-based tasks, including but not limited to FER. Especially the use of convolutional neural networks (CNN) has increased immensely [3] and they have dramatically outperformed conventional models [4]. When using these models, the networks can extract features on their own and the images can thus be fed into the network directly instead of extracting the features beforehand. That means such models are not limited to human-extracted features.

Thanks to CNNs being so powerful, they have become the dominant approach for state-of-the-art methods of affective computing and FER [2]. However, these models are also extremely opaque. CNNs belong to the class of black-box models, which means they cannot be interpreted by humans. Even if one looks at all the internal components of such a model, one could still not comprehend what abstractions the model has learned and why it makes the decisions it makes. This is a problem as Artificial Intelligence (AI) models are getting more and more involved in our daily lives. Especially in high-stakes environments (e.g., the legal system, education, mental health care) it is important that we understand the actual decision making mechanism, as the model’s decisions can have far-reaching consequences [5,6]. Moreover, model transparency is an important factor for building trust and technology adoption [7,8].

This is where the field of explainable artificial intelligence (XAI) comes into play. The broad goal of XAI is to make models more interpretable for humans [9]. In the context of affective computing and FER, researchers have started implementing XAI methods for the models they use [10,11] and even challenges for explainable affective computing have been organized [12,13].

In this study, we explore the interpretability of models based on geometric features. Furthermore, we compare the interpretability of such a model with that of a state-of-the-art CNN. We attempt this by constructing two models: a deep neural network (DNN) based on geometric features extended from [14] and a CNN using transfer learning on a pre-trained model developed in [15]. Both models are trained on images of facial expressions and perform an emotion classification task. We generate explanations for both the DNN and the CNN model. We evaluate the quality of the DNN explanations using several XAI measures and compare these explanations with the explanations for the CNN. Moreover, the explanations are assessed and compared via a user study. In short, the contributions of this work can be summarised as follows:

1.: Developing a new method of visually and textually explaining DNN predictions based on geometric features.
2.: Making a direct comparison between the interpretability of a CNN and a DNN trained for an emotion classification task.
3.: Performing a user study to evaluate and compare the quality of the explanations.

This study is organized as follows: first we discuss background and related literature in Section 2. Next, in Section 3, we explain how the explanations and user study are constructed. Consequently, the experiments for developing the structures for both models and the user study are explicated in Section 4. After that we discuss the results of these experiments in Section 5. Lastly, Section 6 concludes.

2. Background and Related Work

2.1. Background on Explainable AI

The field of explainable AI aims to make models more understandable for humans. This is important when we let AI models make important decisions. In terms of a classic example: we have a model in service of a bank that decides whether or not to give a person a loan. If someone is denied the loan, naturally that person would like to know why they did not get it and what they need to change in order for them to obtain it (counter-factual) based on `right to explanation’ [16]. Additionally, we have ethical concerns: e.g., does the model look at racial features? If the bank uses a very complex model, we cannot know this just by looking at the model on its own. We need more explanations to gain the insights we want.

The term that stands at the centre of XAI is interpretability.There is little consensus on an exact definition for this. We use the same definition as [17] in terms of machine learning systems: `the ability to explain or to present in understandable terms to a human’. The main problem with complex state-of-the-art models such as CNNs is that their interpretability level is very low. These models belong to the collection of black-box models: the internal workings are unintelligible.

As to measurements to evaluate the quality of explanations, there are no formal methods for this yet [9,18]. We can make a distinction between measurements that make use of human participants and those that do not [17].

Measures that do not depend of human evaluation can be calculated automatically. One such property is fidelity or faithfulness: how well does the explanation approximate the original model [19,20]? This is measured in terms of accuracy, but with respect to the original model’s predictions instead of the ground truth. This is an important property, since an explanation that does not approximate the model well does not tell us anything about the original model and is actually useless.

The fidelity measure is distinct from plausibility: how convincing is the explanation for humans [20]? These two should not be mixed up, as they represent different things and should be calculated in different manners. Plausibility should be measured based on human evaluation, whereas fidelity should not. Both the fidelity and plausibility measures will used in this study.

Achieving interpretability can be done via two distinct paths: explaining an existing, black-box model (post-hoc explaining) or designing models that are inherently interpretable [6,21]. In this study, we look at post-hoc explanations that are thus generated after the models have been constructed. Note that there are people who favour designing inherently interpretable models rather than post-hoc explanations, particularly in high-stakes applications [5].

Furthermore, there is a distinction between model-specific and model-agnostic explanation methods [18]. Model-specific explanations are those that only work for a specific type of model, e.g., only for CNNs, whereas model-agnostic methods work for any type of model. The latter approach treats the model essentially as a black-box and does need to look at any of the internal workings.

The last distinction is between local and global explanations [18]. Global explanations attempt to explain the whole model, whereas local explanations explain a subset of the data or even a single data point. Multiple local explanations can also be used to approximate a global explanation. In this study we solely make use of local explanations on single data instances. Next, we discuss the two methods that are used as a basis to generate explanations in this study.

2.1.1. SHAP

SHAP (SHapley Additive exPlanation) [22] is a widely used method for generating explanations. Its main goal is to calculate the contribution of each feature to the prediction, thus explaining what features are the most important for a prediction. This is done using Shapley values, which have their foundation in game theory [23]. In short, importance of a feature

f_{i}

is calculated using a weighted average of the difference in prediction

f (S \cup f_{i}) - f (S)

, where S is a subset of the original feature set and the values of the complement set are assumed missing. SHAP also comes with some desirable properties: local accuracy (fidelity), missingness, and consistency [22].

We chose to use the SHAP method–more specifically the model-agnostic version of SHAP: KernelSHAP–over e.g., LIME (Locally Interpretable Model-Agnostic Explanations) [24], since the KernelSHAP implementation extends the heuristically driven LIME, but with the desirable properties of SHAP included. KernelSHAP is a model-agnostic, post-hoc method for generating explanations. It can thus work on any pre-made model.

SHAP can also be used for explaining CNNs, where the input does not consist of distinct features, but an image. In that case, SHAP groups pixels together as so-called `super-pixels’ and calculates the values with these super-pixels as features. However, this approach is computationally much more expensive than gradient based methods, e.g., Grad-CAM, which is described below.

2.1.2. Grad-CAM

Another prevalent method for explaining CNNs is Grad-CAM (Gradient-weighted Class Activation Maps) [25]. It is a method to visualize class activation maps of the CNN. With that, one can see where the model is `looking’. Grad-CAM works on the last convolutional layer of the model and uses the gradients that go into that layer (dependent on the target concept one wants to show an explanation for). Based on this, a heatmap is generated, which can be superimposed on the original image, thus showing what parts of the image activated the network and what the model based its decision on. Grad-CAM is a model-specific, post-hoc method. It is specifically made for explaining CNNs.

2.2. XAI in Affective Computing and FER

Like in other AI research areas, there has been an increase in research into explaining models in affective computing and FER as well. This research has been focused on CNNs for the most part, as this is the type of model that is the most prevalent in contemporary research.

In [10], Weitz et al. explained a CNN model that distinguishes pain from other emotions, such as happiness or disgust. For this, they used the XAI method Layer-wise Relevance Propagation (LRP) [26]. They found that while this gives some insights into the model’s decisions, it is not distinctive enough.

A challenge on explainability in computer vision was proposed in 2017 by Escalante et al. [12]. The main target of this challenge was to make an explainable model that examines videos of job candidates and gives a first impression in terms of the big five personality traits. An example submission is [27], where class activations and action units are used for explaining the predictions of their CNN model.

Both [28,29] use Shapley values to explain their models on sentiment analysis, although this analysis is in a different context than FER. Prajod et al. used LRP saliency maps to investigate whether a network has learned concepts (in this case action units), especially in the case where a network originally trained for emotion recognition is used as a base for transfer-learning a model to recognize pain [30].

Gund and Bharadwaj et al. propose a technique for extracting influential landmarks in [11]. They do this in the context of moving faces and use a CNN for the emotion classification. Then, class activation maps are used to find influential regions and from these, landmarks are extracted that are based on action units.

3. Proposed Method

We construct two different types of black-box models: a DNN and a CNN. Both models work on the same dataset and perform a facial expression classification task. The models are first optimized for this particular task and then we generate explanations detailing why the model made certain decisions. Finally, we evaluate these explanations by calculating the fidelity (in case of the DNN), comparing them and performing a user study. The two complete pipelines can be seen in Figure 1.

3.1. Geometric Features-Based DNN Modeling

3.1.1. Geometric Feature Extraction

The feature set used for the DNN is based on the geometric features from a previous study on an emotion classification task [14]. In the original study, Kaya et al. aligned and extracted landmarks from the images using Xiong and De la Torre’s Supervised Descend Method [31]. Through this approach 49 fitted landmarks are obtained. From these landmarks, geometric features can be calculated. Geometric features quantify the geometric configurations that are constructed from elements such as points (as in this case), lines, etc. and can represent e.g., distances, areas, angles.

Originally, there were 23 hand-crafted geometric features. We extended this set to include the slope of the left and right eyebrow. Some features were originally averaged over the left and right parts of the face, which was reversed. The averaged features are less expressive and separate features for both parts of the face are more useful when explaining the model’s decisions particularly on posed faces. Eventually, we ended up with a set of 40 features. For further details of the geometric features see Appendix C.

3.1.2. SHAP-Based Explanation Generation

After the DNN has been constructed and trained on the data, we can generate explanations for its decisions. We do so using SHAP. For each image, the SHAP value of each feature is calculated. Next, we take the n features with the highest absolute value, i.e., the most important features. Each of those geometric features corresponds to several landmarks the feature was originally calculated from and with those, the geometric features can be plotted on the face. Since we have different types of geometric features, we end up with features that are displayed as a line, an angle, an ellipse, or aspect ratio. See Figure 2 for examples on how geometric features of different types are visualized. The features are coloured according to their SHAP value from yellow to red, with the more red the colour, the more important the feature was for the model’s decision.

Accompanying these visualizations, we generate textual explanations. In these texts, we mention what the model’s prediction for the image is and whether that is correct. If the prediction is incorrect, the true label is given. Furthermore, we list the names of the features that are plotted on the face in order of their importance. The sum of the SHAP values of the displayed features is calculated as a percentage of the sum of the SHAP values of all features and reported in the textual explanation.

We evaluate this method quantitatively by calculating both its fidelity and explanation accuracy. The fidelity is calculated by constructing new data points where only the top n features keep their original values and all other features are set to the training set average value for that feature (note that in case of feature value standardization, this value can be zero; note also that this is the way model-agnostic SHAP handles the missing attributes). We then let the model predict the class for this newly created data point and compare this to the model’s prediction of the original data point. The percentage of predictions that stay the same as the original prediction is the fidelity score.

The explanation accuracy score is calculated in a similar fashion, only now the new predictions are not evaluated against the model’s original prediction, but against the ground truth. Note that this is a distinct measure from the model’s accuracy, which is simply the percentage of correctly classified examples. To avoid confusion, the accuracy measure for explanations shall be called the “explanation accuracy" from this point onward. Furthermore, we calculate the relative cumulative SHAP weight for the top n features by dividing the sum of the SHAP values of those features by the sum of all features’ SHAP values. A plot of the aforementioned measures helps the analysis of fidelity convergence.

3.2. End-to-End CNN Modeling

For the CNN, we make use of a pre-trained model, since the used dataset is rather small and CNNs are very prone to overfitting. The model we use as a base was originally created by Dresvyanskiy et al. in [15]. They took a CNN model that was pre-trained on the VGGFace2 dataset [32]–which is mainly used for face recognition–and then further fine-tuned it on the Aff-Wild dataset [33]. Their model uses the same discrete seven emotions as we do, so no further alterations to the model’s architecture were needed. We then fine-tune their model on the KDEF dataset. Thereafter, we generate explanations to gain insights into the model’s decision making using Grad-CAM (implemented by [34]) and SHAP.

3.3. User Survey

In order to evaluate the plausibility of both models’ explanations, we perform a user study where participants answer questions on their understanding of and trust in the models. The user study consists of two main parts: evaluating the geometric features-based explanations and comparing the explanations for the DNN and the CNN. All questions on the explanations are posed in the form of statements together with a Likert item [35] from one to five (i.e., strongly disagree, disagree, neither agree nor disagree, agree, strongly agree). The questions can be found in Section 4.4.2. The introductory texts and consent form can be found in Appendix B.

In the first part, the participants will first see five example images from the test set with the original image and the probability distribution. The amount of examples where the model made a wrong prediction is chosen in accordance with the final model test set accuracy score. After examining these images, they answer ten questions regarding their understanding of and trust in the model.

After answering these questions, a new batch of five example images is shown. This time the participants see five different images from the test set, but with the visual explanation images (i.e., the most important geometric features plotted on the face) and the accompanying textual explanation. Afterwards, they answer the same ten questions as with the previous batch.

For the second part, the participants are shown seven example images (one for each class) where the explanations for the DNN and the CNN predictions on the same image are shown side-by-side. Also shown are the probability distributions for each model prediction. Thereafter, the participants answer ten questions where they compare both models and their respective explanations.

4. Experimental Setting

4.1. Dataset

The dataset used for all models is the Karolinska Directed Emotional Faces (KDEF) dataset [36]. It consists of 4900 images of human faces displaying seven different emotional expressions. The expressions consist of the six basic emotions–anger, disgust, fear, happiness, sadness, surprise–as defined by Ekman [37], extended with neutral.

In total, there were 70 participants (35 male, 35 female) who were each photographed twice for each of the expressions from five different angles (full left profile, half left profile, straight, half right profile, full right profile). Participants’ faces were centered on a grid such that their eyes and mouths are positioned in fixed coordinates.

We omitted all pictures with the full left profile and full right profile orientation, since those poses are much more difficult to classify and the goal of this research is not to develop the most all-round facial emotion classifier. Ultimately, we ended up with 1509 training set images (subject IDs 12–29), 504 validation set images (IDs 30–35) and 923 test set images (IDs 01–11). This split is used for all models. The class distribution was balanced.

4.2. DNN-Based System Development

For the DNN, we use the 40 geometric features extracted from the images as input. The features are standardized (z-normalized) using mean and standard deviation statistics estimated from the training set.

We construct a feed-forward neural network with the last layer being a dense layer with seven neurons using softmax activation. Between hidden layers, ReLU (Rectified Linear Unit) activation is used to obtain non-linearity. During training, we used the Adam optimizer [38].

4.2.1. Hyperparameter Tuning

The optimal architecture of the network we construct is found by means of hyperparameter tuning using the Keras Tuner [39] with the included Hyperband algorithm [40], which is shown to be more efficient than Random Search [41] and the commonly used Grid Search that is known to suffer from the curse of dimensionality. Each architecture is evaluated on the validation set accuracy, where accuracy is defined as the percentage of correctly classified data points.

On top of the standard Hyperband algorithm, we add an early stopping callback with patience 1 to the search, which in this case means a configuration will stop training once the validation set loss does not decrease for 1 epoch, the configuration will not be further trained. This is to increase efficiency and decrease overfitting. The following configurations for hyperparameters were tested:

Number of hidden layers: 1–4.
Number of neurons per layer: 32–512, in steps of 32.
$l_{2}$ regularization [42]: {0.1, 0.001, 0.0001}.
Dropout [43] after each hidden layer: 0–0.9, in steps of 0.1.
Learning rate: {0.1, 0.01, 0.001, 0.0001}.

The optimal hyperparameter settings for the final DNN models can be found in Appendix A.

4.2.2. Splitting on Pose

We explored whether it would be worthwhile to split the data on pose and train a model per subset, i.e., a separate dataset and model for half left, frontal, and half right. To this end, we trained four different models: one for each pose and one for the complete dataset. All the models’ hyperparameters were optimized separately.

We then compared the validation set accuracy of the model trained on the complete dataset with the concatenated accuracy of the three other models. Concatenated accuracy is calculated by counting the number of correct predictions across all three models on their respective validation datasets and dividing this amount by the total number of instances in the complete validation dataset.

In order for the development of a complete pipeline using such a split on pose, one would also have to develop a pose classifier to automatically determine an image’s face orientation.

4.2.3. Feature Selection

The complete geometric feature set consists of 40 features, but this set could contain redundant features. The goal of feature selection (FS) is to obtain a compact subset of features that describes the dataset, eliminating irrelevant or noisy features [44]. Redundant features are those that provide no extra information to the model (i.e. the feature is not needed for correct classification of the data points), but such features can cause noise and can thus introduce bias into the model. This affects how well the model generalizes and hence the performance on unseen data. On top of that, the smaller a feature set, the more efficient the computation time will be.

In order to eliminate redundant features from the complete feature set, we used several feature selection algorithms to select a feature subset, trained and tuned models with that subset and compared validation set accuracy with the no-feature selection performance baseline.

The first technique we tested is forward sequential feature selection (FSFS), implemented in Scikit-learn [45]. This is a wrapper method, which means it uses the model as a black box predictor and evaluates the performance on a certain feature subset. FSFS starts with an empty set of features and at each iteration it adds the feature that yields the highest performance gain. This continues until the specified amount of features is reached.

Furthermore, we tested recursive feature elimination (RFE), proposed by Guyon et al. [44], also implemented in Scikit-learn. RFE is an iterative process consisting of three steps: train a model, compute the ranking of the features and finally eliminate the feature with the lowest ranking. This process continues until the set of features is reduced to a certain amount. In our case, we used a logistic regression model to estimate the feature ranking, since that can be taken directly from the model’s coefficients.

Another technique we tried was picking the n features with the highest global SHAP value (i.e., the most important features over the entire dataset) as the feature subset. For this, we first calculate the SHAP score for each feature by summing the absolute SHAP score for each feature for each data point across all seven classes. Then we rank the features according to this total score and take the top n features.

The final technique consisted of hand-picking feature subsets based on domain knowledge. One can argue that features belonging to the left side of the face are more important to the half left model than for the half right model, and vice versa. Therefore, we constructed two feature subsets. Both contained the features that do not correspond to a particular side of the face (e.g., mouth width) and all features that correspond to either the left or the right side of the face. This last method is only tested on the half left model and the half right model, as the frontal model does not necessarily benefit from excluding features belonging to a particular side.

For each feature selection method, we test subsets with 5 to 35 features in steps of 5. Ultimately, for each model we pick the feature subset that yields the highest validation set accuracy.

4.3. CNN-Based System Development

4.3.1. Preprocessing

In the study where the base-model was developed, the original images were detected and aligned using RetinaFace [46]. In order to give the model the most similar images as it was trained on as possible, we use the same alignment method for the KDEF dataset, using the implementation in [47]. Note that this is a different alignment method than used for the geometric features-based DNN. For the DNN, we needed not only a method that aligns the images, but also one that extracts the landmarks from the images in order to calculate the geometric features, whereas this is not needed for the CNN. On top of face extraction, the resulting images are also resized to 224 by 224 pixels.

For the CNN, we do not split the dataset on pose, as this would decrease the dataset size even further, which would make the model more prone to overfitting.

4.3.2. Fine-Tuning

For the fine-tuning, we do not change anything about the original architecture of the model, as that model works with the same classes as we do. We do add a data augmentation layer to artificially increase the dataset size in order to reduce overfitting. The data augmentation consists of randomly rotating, shearing, zooming and horizontally flipping the images. We also added

l_{2}

regularization [42] to the layers.

We freeze the n bottom layers of the model and train only the unfrozen top layers. Again, we add an early stopping callback with various patience values. As learning rate/optimizer, we test both Adam and Stochastic Gradient Descent (SGD) with an exponential decay learning schedule.

4.4. User Survey Construction

4.4.1. Environment

All participants answered the questions independently on a computer. The user study was made using Google Forms. Before answering any questions, participants were informed of the nature of the study, what their task consisted of and what their answers could be used for. They had to give their consent to their answers being used in a research study before they could carry on answering questions. The complete questionnaire can be found at https://github.com/kayatb/GeomExp (accessed on 1 September 2022).

The group of participants consisted of 12 people. Every participant completed high school or a form of higher education. Most participants rated their level of knowledge on AI as neutral or better (on a scale of 1–5).

4.4.2. Questions

With the help of the user study, we want to quantify several qualities of the geometric features-based explanations. On top of that, we want to compare those explanations with the ones for the CNN on a human-level. All questions are answered via a Likert item from 1 (Strongly Disagree) to 5 (Strongly Agree).

In [8], Hoffman et al. propose several checklists to evaluate the goodness, satisfaction and trust of explanations generated for AI systems. Goodness refers to how good an explanation is, determined by factors such as clarity and precision. Satisfaction is defined as: “the degree to which users feel that they understand the AI system or process being explained to them" [8]. Several of their proposed questions are used in the first set of questions for the user study.

The System Usability Scale (SUS) [48] is a widely used tool to measure the usability of a system. This scale can be adapted to refer to a system’s explanatory power instead of referring to a system’s usability. As is done in [49], where Holzinger et al. propose the System Causability Scale (SCS), which extends SUS to measure the quality of explanations in terms of causability, we extend SUS to be used in our user study. Several questions in the first question set are based on the SUS. Ultimately, the first question set consists of the following:

1.: The output representations help me understand how the model works. (adapted from [8])
2.: The output representations of how the model works are satisfying. (adapted from [8])
3.: The output representations are sufficiently detailed. (adapted from [8])
4.: The output representations let me know how confident the model is for individual predictions.
5.: The output representations let me know how trustworthy the model is. (adapted from [8])
6.: I found the output representations unnecessarily complex. (adapted from [48])
7.: I think I would need an expert to give me additional explanations. (adapted from [48])
8.: The outputs of the model are very predictable. (adapted from [8])
9.: The model can perform the task better than a novice human. (adapted from [8])
10.: I am confident in the model. I believe it works well. (adapted from [8])

The second set of questions is partly extended from the first set. Instead of referring to the explainability of a single model, these questions make a comparison between two models. Again, the questions deal with goodness, satisfaction and trust, but this time in terms of which explanation the user finds better on several aspects. Again, we took into account questions regarding the intelligibility, complexity, level of detail and trust in the explanations. In these questions, there is a consistent reference to “model 1” and “model 2”. In all cases, model 1 refers to the CNN and model 2 refers to the DNN. The second question set consists of the following:

1.: The explanations for model 1 are more understandable than those for model 2.
2.: I trust model 1 more than model 2.
3.: I would prefer the explanations of model 1 over those for model 2.
4.: The explanations for model 1 are more detailed than those for model 2.
5.: The explanations for model 1 are clearer on the model’s accuracy than those for model 2.
6.: The explanations for model 1 reflect the model’s confidence on each prediction better than those of model 2.
7.: Model 1’s explanations are more unnecessarily complex than those of model 2.
8.: The explanations for model 1 were more precise than those for model 2.
9.: I would follow model 1’s advice over that of model 2.
10.: The outputs of model 1 were more predictable than those of model 2.

4.4.3. Hypotheses

The hypotheses for questions 1, 2, 3, 5, 6, 8, 9, and 10 from the first question set are that the examples with explanations are evaluated with higher scores than the examples without and thus indicate that showing the explanations increase understanding of and trust in the model. The hypothesis for question 4 is that the score would be less for the second batch, since the probability distributions are shown in the first batch, but not in the second batch. This is done on purpose, so that we can check whether participants have answered the questions in a serious manner. The hypothesis for question 7 is also that the score after the second batch is lower than after the first, since this question is phrased in a reverse manner from the others.

For question set 2, the hypotheses for questions 1, 2, 3, 4, 7, 8, 9, and 10 are that the score is below the expected median score of 3, which sits right in the centre of the five-point scale we use, indicating that explanations for model 1 (the CNN) score lower than the explanations for model 2 (the DNN). For questions 5 and 6 we performed a two-sided test, since we present the probability distributions for both models, which can be used for answering these questions.

We will thus use one-sided tests everywhere, apart from questions 5 and 6 from question set 2, since for the other questions we only want to test whether the explanations for the DNN increase understanding and are preferred over the explanations for the CNN.

4.4.4. Statistical Tests

The questions in the user study are stated in the form of Likert items. The data obtained from Likert items are generally seen as ordinal. This means the responses have an order, but the distance between values is not necessarily equal. That is why some argue that parametric tests such as t-test or ANOVA cannot be used, but rather a non-parametric test should be used to determine the statistical significance of the results [50]. However, there is discussion surrounding this, see for example [51]. For the analysis of the user study results, we have decided to use non-parametric tests.

For the first and second batch of questions (i.e., question set 1), we used the Wilcoxon signed-rank test [52], which works on two related samples; in this case the first batch of examples without any explanations and the second batch with the explanations given. Both batches use the same questions, so we will perform a test of significance on each pair of questions. The null hypothesis for such a test is that both samples are taken from the same distribution.

The third batch of questions (i.e., question set 2) stands on its own. We have thus used the Mann–Whitney U test [53] to test the significance. This test works on two unrelated samples. For the second sample we use the expected outcome/median for each question: 3. This can be compared to a one-sample t-test. The null hypothesis here is for any selected scores

s_{1}

from sample 1 and

s_{2}

from sample 2, it holds that

P r (s_{1} > s_{2}) = P r (s_{2} > s_{1})

.

5. Results

5.1. Experimental Results for Geometric Features-Based DNN Models

5.1.1. Comparative Results Using Pose-Based Models

In Table 1, we can see that the concatenated accuracy scores of the three models split on pose is higher than the accuracy score of the model trained on the whole dataset. The separate accuracy scores of the models trained on a single pose are also all higher than the no split model on the validation set.

5.1.2. Feature Selection Results

From Table 2, we can see that the frontal model has the highest performance with a subset of 30 features, decided by RFE. The half left model has three optimal feature sets consisting of 24 or 25 features. We decided to take the feature set picked by FSFS, as the training set accuracy was closest to the validation set accuracy for that set, indicating less overfitting than with the other two sets. For the half right model, a subset of 30 features picked by FSFS yields the best results.

For the full hyperparameter configurations of the final DNN models see [54].

5.2. Geometric Feature Explanation Results

In Figure 3, we can see that there is a steady increase in fidelity until 1.0 fidelity is reached when using all features (as should be expected). The first amount of features where 0.8 fidelity is reached is with 9, 5, 6 features for the frontal, half left and half right model, respectively. This amount of features yields explanation accuracy scores of 0.7013, 0.6639, 0.6943 on the test set, respectively. In all three plots, the slope decreases with increasing number of features, showing a convergence trend. The explanation accuracy is equal to the final accuracy score of the model on the test set when all features are used (again, as expected).

5.3. Experimental Results for CNN Models

For the CNN model, we trained the final model with the following data augmentation settings:

Rotation range: 50
Shear range: 0.5
Zoom range: 0.5
Horizontal flip

These settings give quite aggressive data augmentation, but this was necessary to combat the tendency of the model to overfit on the relatively small dataset.

Furthermore, we used the SGD optimizer with a learning rate schedule with exponential decay with an initial value of 0.01, a decay rate of 0.9 and decay step size of 10,000 (i.e., the learning rate goes down after this many steps). The regularization value was set to 0.01 for each layer and the early stopping patience was 2. we only unfroze the top three layers for fine-tuning, the rest remained as they are in the base model. These top three layers consisted of the feed-forward classification layers.

This configuration obtained a training set accuracy of 0.9708 and a validation set accuracy of 0.7778.

5.4. Comparing Explanations for the DNN and CNN

In Figure 4, example explanations of the CNN using Grad-CAM (second row) and DNN (third row) can be seen. We decided to omit the explanations generated for the CNN using SHAP. These explanations were of a lower quality than those of Grad-CAM, since there was a lot of noise in the explanations with seemingly random pixels coloured. It should also be noted that explanations for the CNN generated by SHAP take dramatically longer to compute than those generated by Grad-CAM.

To exemplify the textual explanation, the top left visual DNN explanation is accompanied by the following text:

This person’s emotion is classified as DISGUST. This classification is CORRECT. The following five features, listed from most important to less important, contributed for 36.6% to the decision:

1.: Left eye aspect ratio (ratio between eye width and eye height).
2.: Angle from bottom mouth to left upper mouth.
3.: Angle from left mouth corner to top of the mouth.
4.: Distance between the centre of the left eye and the left inner eyebrow.
5.: Left lower eye outer angle.

For the geometric features explanations, we decided to visualize five features, which seems like the minimum given the fidelity scores of these explanations. This could easily be extended to show more features.

5.5. User Study Results

For determining the significance of the results, we take

α = 0.05

for all questions. In Table 3, the results from the Wilcoxon signed-rank test on the question pairs from question set 1 can be found. Scores for questions 1, 3, and 4 have a significant difference between the first and second question batch. Question 4 was a control question and the significantly lower scores for this question shows that participants filled in the questions seriously and that posterior distributions provide significant information about prediction confidence. The significantly lower scores for question 1 shows that the participants’ understanding of the model has increased after they saw the explanations. The participants also think the explanations are more sufficiently detailed, indicated by the significantly lower scores for question 3.

Question 6 does not have a significantly higher score, which is a positive outcome. This indicates the participants did not find the explanations unnecessarily more complex than no explanation. Questions 5, 9, and 10 regard the trust in the model’s performance. None of these questions show a significant increase in score after the participants saw the explanations. In line with the literature [55], this shows that building trust for a technology/model is not easy and may require longitudinal exposure to or testing of the technology.

The results of the Mann–Whitney U test on question set 2, where the explanations for the CNN and those for the DNN are compared, can be found in Table 4. The scores for questions 1, 3, 4, 7 and 8 are significantly below the median. Questions 5 and 6 do not show a significant difference in scores in either direction, which is to be expected. After all, the probability distributions for both models were given in all examples. Like before, these can be used as control questions.

The scores for questions 1, 4 and 8 are significantly lower, which shows that participants’ found the explanations for model 2 (the DNN with geometric features) more understandable, detailed, and precise than those for model 1 (the CNN). Moreover, the participants would also prefer the explanations for model 2 over those for model 1, as indicated by the significantly lower score for question 3. There is a significantly lower score for question 7 as well. That indicates that the participants found the explanations for model 2 to be more complex than those for model 1. Even so, it seems they would still prefer the more complex explanations. For questions 2 and 9, we see the same pattern as in the previous tests: the questions regarding trust in the models do not show a significantly lower score. Also, like in the previous tests, there is no significantly lower score regarding the predictability of the model outputs (question 8 for the first set and 10 for the second one).

5.6. Discussion

Our main goal in this paper was to compare alternative DNN-based approaches in terms of their predictive performance and explainability using both objective quantitative measures and a user study. However, these explanations may well be presented together, in case these two models are combined at the decision level. Since the geometric features-based DNN and appearance-based CNN models use alternative, complementary representations, their decision fusion is likely to improve the predictive performance [14,56]. To further improve the predictive performance using the DNN and CNN in an explainable manner, we experimented with simple weighted fusion. Here, to fuse the posterior probabilities of the two models, a fusion weight

γ \in [0, 1]

is optimized on the validation set with steps of 0.1. We found the best validation set accuracy of 0.8392 with

γ = 0.5

, which actually degenerates to unweighted score fusion. Using this setting, we reach a test set accuracy of 0.8418, which advances the state-of-the-art on this dataset using three poses (see Table 5).

The final test set accuracy scores of the models and a few recent models on the KDEF set can be seen in Table 5. The state-of-the-art models were chosen by looking at recent research (published after 2018), where the KDEF dataset was used. It should be noted these do not necessarily use the same train-validation-test set split as we have used in this study.

In [59], a combination of image pre-processing and different CNN models was used. The preprocessing consisted of converting the images to greyscale, data augmentation, cropping the images using Haar feature-based cascade classifiers [60], and downsampling the images to reduce memory usage. The CNN that obtained the best results on the KDEF dataset was the model fine-tuned over the model initially trained by Yu and Zhang [61]. To obtain the result as reported in Table 5, they use only the frontal and half rotated images, as we have done.

The result in [57] was obtained by detecting the faces using the method from [60] as well, then segmenting the image intro four parts (i.e., right eye, left eye, nose, mouth). Next, they extracted features from those segments using Gabor filters and these features are fed into a K-nearest neighbours model for classification. They used only the frontal oriented images from the KDEF set.

In [58], a CNN model was trained only on the frontal images of the KDEF set. The faces were extracted from the images using the technique from [60], like the other examples. They trained CNN models with different architectures and used Grad-CAM and saliency maps to compare the different models on explainability.

A simple decision fusion of our experimented models outperformed the state-of-the-art models, even though this was not the main target of the study. The geometric features-based DNN performs somewhat better than the CNN model (based on the combined score of the three DNN models). This could partly be attributed to the size of the dataset. The CNN is more likely to outperform the DNN if we had used a more substantial amount of data. A future work in this direction would be to use a multi-stage pretraining approach for CNN as done in [15] and include other FER corpora for training the DNN.

For the geometric feature explanations, we achieve good results on the fidelity scores, with only approximately 25–30% of the complete feature set necessary to give a fidelity score of above 0.8 (i.e., 80% of the predictions stay the same when using only these feature values). The explanations approximate the model to a high extent.

For the CNN, explanations using Grad-CAM can be generated, which are based on the gradients inside the model when predicting an image. However, we cannot calculate measures such as fidelity and accuracy for these explanations in a precise and straightforward manner, caused by the nature of the calculations with which these explanations are constructed. Hereby, we do not have a direct measure to know how well the explanations actually approximate the CNN model. Even though we generate explanations of the model that seem to show where it looks, it is not certain if this is actually what the model bases its decisions on.

This issue is not present in our explanations for the geometric features-based DNN. We have calculated fidelity and accuracy scores for these explanations, which show the explanations stay true to the model to a high degree. Thus, the explanations indeed say something about the decision-making process of the model.

Furthermore, the explanations as generated for the DNN seem more precise than those for the CNN. By the nature of geometric features, they are inherently interpretable. Therefore, visualizing them and giving their names should be enough to know what the model sees. For the CNN, this is not the case. Like in the left most example images from Figure 4, we can see the model roughly looks at the mouth, but what about the mouth the model sees, is still hard to grasp. With the geometric features, we can exactly pinpoint that the model is looking at e.g., mouth width or lip curvature.

Finally, the results obtained from the user study indicate that the explanations for the (geometric features-based) DNN increase participants’ understanding of the underlying model. They also found those explanations better than the more frequently used heatmap explanations for CNNs on points such as understandibility, preciseness and level of detail. Even though the DNN explanations were found to be more complex, the participants would still prefer those explanations. However, the trust in the models was not increased by the DNN explanations nor was there a significant difference between the trust in the DNN and the CNN.

6. Conclusions

In this work, we have developed an alternative way of explaining a model that classifies facial expressions. In particular, this method displays the most important geometric features (as calculated by SHAP) plotted on the original images. We developed both geometric features-based DNN models and a CNN model. On the small KDEF dataset we used here, the DNN models outperforms the CNN model, although examples can be found in the literature where models have been developed that outperform all of our models.

We argue that the more conventional methods for FER are better explainable than the state-of-the-art CNNs, as can be seen in the explanations we developed and compared with methods for visualizing the decision process for CNNs. This might not matter for mundane tasks, but for high-stakes decision processes, it is vital that we can understand what the model is doing. We believe it would be better to focus more on developing and using interpretable models or models that are better to explain in critical areas than the prevalent black-box models.

The geometric feature explanations in this study are based on a DNN, which admittedly is a black-box model as well. However, the explanations we have developed here are not limited to such a model and can be generalized to other types; the explanations are model-agnostic given the geometric feature set. In other words, any model that can make use of geometric features could use the explanations. An intrinsically interpretable logistic regression model could also visualize its decisions in the same way. More recently developed interpretable models that can perform more on-par with deep-learning models such as [21] can also use this technique to visualize their decisions.

This study can be extended by using a more extensive dataset with less posed images than the KDEF dataset and seeing how the geometric features perform in such an environment. Furthermore, a dataset with more emotions than these seven could be tested. Furthermore, the deployed CNN can benefit from a multi-stage fine tuning as in [15].

Lastly, the user study especially can be extended in quite a few ways and was limited in this work. Further research should be put in the human evaluation of the proposed method. In conclusion, we have explored and laid the groundwork for a different way of explaining a system for FER. The first results regarding the quality and plausibility of these explanations look promising. Questions on several points still exist and are open for further examination.

Author Contributions

K.t.B.: Conceptualization, methodology, scripting, experimental validation, writing—original draft preparation, visualization; H.K.: conceptualization, methodology, analysis of results, supervision, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

At the beginning of the user study conducted to evaluate the DNN/CNN explanations, the participants digitally provided an informed consent for their evaluations to be included in the study. See Appendix B.2 for details.

Data Availability Statement

KDEF dataset is available for research studies from the data owners: https://www.kdef.se/ (accessed on 1 December 2021). Scripts to reproduce this work are available at https://github.com/kayatb/GeomExp (accessed on 1 September 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
CNN	Convolutional Neural Network
DNN	Deep Neural Network
FER	Facial Emotion Recognition
FR	Frontal
FS	Feature Selection
FSFS	Forward Sequential Feature Selection
Grad-CAM	Gradient-weighted Class Activation Mapping
HL	Half Left
HR	Half Right
KDEF	Karolinska Directed Emotional Faces
LIME	Local Interpretable Model-Agnostic Explanations
LRP	Layer-wise Relevance Propagation
ReLU	Rectified Linear Unit
RFE	Recursive Feature Elimination
SCS	System Causability Scale
SGD	Stochastic Gradient Descent
SHAP	SHapley Additive exPlanation
SUS	System Usability Scale
XAI	eXplainable Artificial Intelligence

Appendix A. Final DNN Configurations

In Table A1, the final configurations of all DNN models used for the reported results can be found.

Table A1. Final configurations of the DNN models after splitting on pose, finding the best feature subset using feature selection and hyperparameter tuning. HL: Half Left, HR: Half Right.

Hyperparameter	Frontal	HL	HR
Number of hidden layers	2	1	1
Learning rate	0.001	0.001	0.001
No. neurons hidden layer 1	352	64	352
Regularisation rate hidden layer 1	0.01	0.01	0.001
Dropout hidden layer 1	0.3	0	0.6
No. neurons hidden layer 2	256	-	-
Regularisation rate hidden layer 2	0.01	-	-
Dropout hidden layer 2	0.8	-	-

Appendix B. User Study

Appendix B.1. Research Description

This research is done in the context of my Bachelor thesis Artificial Intelligence at Utrecht University. In this research, I want to analyze and evaluate new ways of explaining uninterpretable machine learning models. The purpose of this survey is to quantify the quality of automatic explanations (e.g., in terms of clarity and plausibility) generated for two types of Deep Neural Network models trained to predict facial expressions.

Your task is to evaluate and compare different explanation methods for two machine learning models. This will be done using closed questions. No personal data is required or being collected. The survey takes 6-8 min to complete.

Appendix B.2. Consent Form

The participant states:

I voluntarily agree to participate in the research project.
I agree that I will not be paid for my participation.
I have been informed of the nature of the research project.
I understand that statistical data gathered from this survey can be used in a scientific publication.
I understand that my participation will remain anonymous.
I agree that my data can be shared with other researchers to answer possible other research questions.

Appendix B.3. General Questions

What is the highest degree or level of school you have completed?

No degree
Elementary school
High school
MBO
HBO
Bachelor’s degree
Master’s degree
Doctorate degree

I am very knowledgeable on the subject of Artificial Intelligence (AI).

1—Strongly disagree
2—Disagree
3—Neutral
4—Agree
5—Strongly agree

Appendix C. All Geometric Features

The complete set of geometric features, which are based on the landmarks visualized in Figure A1, as used in the DNN models is displayed in Table A2. These features were extended from [14].

Table A2. The full geometric feature set. * indicates the feature was added here and not used in the original paper. Distance based features are normalized by face height. Landmark numbers can also be found in Figure A1.

Feature #	Description	Landmarks	Feature Type
1	Eye aspect ratio (L)	[19, 24]	Distance
2	Eye aspect ratio (R)	[25, 30]	Distance
3	Mouth aspect ratio	[31, 34, 37, 40]	Distance
4	Upper lip angle (L)	[31, 34]	Angle
5	Upper lip angle (R)	[34, 37]	Angle
6	Nose tip—mouth corner angle (L)	[16, 31]	Angle
7	Nose tip—mouth corner angle (R)	[16, 37]	Angle
8	Lower lip angle (L)	[31, 41]	Angle
9	Lower lip angle (R)	[37, 39]	Angle
10	Eyebrow slope (L)	[0, 4]	Angle
11	Eyebrow slope (R)	[5, 9]	Angle
12	Lower eye outer angles (L)	[19, 24]	Angle
13	Lower eye inner angles (L)	[22, 23]	Angle
14	Lower eye outer angles (R)	[28, 29]	Angle
15	Lower eye inner angles (R)	[25, 30]	Angle
16	Mouthe corner—mouth bottom angle (L)	[31, 40]	Angle
17	Mouth corner—mouth bottom angle (R)	[37, 40]	Angle
18	Upper mouth angles (L)	[33, 40]	Angle
19	Upper mouth angles (R)	[35, 40]	Angle
20	Curvature of lower-outer lips (L)	[31, 41, 42]	Curvature
21	Curvature of lower-outer lips (R)	[37, 38, 39]	Curvature
22	Curvature of lower-inner lips (L)	[31, 40, 41]	Curvature
23	Curvature of lower-inner lips (R)	[37, 39, 40]	Curvature
24	Bottom lip curvature	[31, 37, 40]	Curvature
25	Mouth opening/mouth width	[43–48]	Distance
26	Mouth up/down	[34, 40, 44]	Distance
27	Eye—middle eyebrow distance (L)	[0, 4, 19, 22]	Distance
28	Eye—middle eyebrow distance (R)	[5, 9, 25, 28]	Distance
29	Eye—inner eyebrow distance (L)	[4, 19, 22]	Distance
30	Eye—inner eyebrow distance (R)	[5, 25, 28]	Distance
31	Inner eye—eyebrow centre (L)	[2, 22]	Distance
32	Inner eye—eyebrow centre (R)	[7, 25]	Distance
33	Inner eye—mouth top distance (L)	[22, 34]	Distance
34	Inner eye—mouth top distance (R)	[25, 34]	Distance
35	Mouth width	[31, 37]	Distance
36	Mouth height	[34, 40]	Distance
37	Upper mouth height	[34, 44, 47]	Distance
38	Lower mouth height	[40, 44, 47]	Distance
39	Outer mid eyebrow slope (L) *	[0, 2]	Slope
40	Outer mid eyebrow slope (R) *	[7, 9]	Slope

Figure A1. All landmarks with their corresponding numbers annotated. The numbers correspond to the numbers in Table A2. Example image is AF01AFS from the KDEF dataset.

References

Tao, J.; Tan, T. Affective computing: A review. In Proceedings of the International Conference on Affective computing and Intelligent Interaction, Beijing, China, 22–24 October 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 981–995. [Google Scholar]
Ko, B.C. A brief review of facial emotion recognition based on visual information. Sensors 2018, 18, 401. [Google Scholar] [CrossRef] [PubMed]
Wang, W.; Yang, Y.; Wang, X.; Wang, W.; Li, J. Development of convolutional neural network and its application in image classification: A survey. Opt. Eng. 2019, 58, 040901. [Google Scholar] [CrossRef] [Green Version]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef] [Green Version]
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Letham, B.; Rudin, C.; McCormick, T.H.; Madigan, D. Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. Ann. Appl. Stat. 2015, 9, 1350–1371. [Google Scholar] [CrossRef]
Weitz, K.; Schiller, D.; Schlagowski, R.; Huber, T.; André, E. ‘Do you trust me?’ Increasing user-trust by integrating virtual agents in explainable AI interaction design. In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, Paris, France, 2–5 July 2019; pp. 7–9. [Google Scholar] [CrossRef] [Green Version]
Hoffman, R.R.; Mueller, S.T.; Klein, G.; Litman, J. Metrics for explainable AI: Challenges and prospects. arXiv 2018, arXiv:1812.04608. [Google Scholar]
Adadi, A.; Berrada, M. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access 2018, 6, 52138–52160. [Google Scholar] [CrossRef]
Weitz, K.; Hassan, T.; Schmid, U.; Garbas, J. Towards explaining deep learning networks to distinguish facial expressions of pain and emotions. In Forum Bildverarbeitung; Institut für Industrielle Informationstechnik (IIIT): Karlsruhe, Germany, 2018; pp. 197–208. [Google Scholar]
Gund, M.; Bharadwaj, A.R.; Nwogu, I. Interpretable Emotion Classification Using Temporal Convolutional Models. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 6367–6374. [Google Scholar]
Escalante, H.J.; Guyon, I.; Escalera, S.; Jacques, J.; Madadi, M.; Baró, X.; Ayache, S.; Viegas, E.; Güçlütürk, Y.; Güçlü, U.; et al. Design of an explainable machine learning challenge for video interviews. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3688–3695. [Google Scholar]
Escalante, H.J.; Kaya, H.; Salah, A.A.; Escalera, S.; Güçlütürk, Y.; Güçlü, U.; Baró, X.; Guyon, I.; Jacques, J.C.; Madadi, M.; et al. Modeling, Recognizing, and Explaining Apparent Personality from Videos. IEEE Trans. Affect. Comput. 2022, 13, 894–911. [Google Scholar] [CrossRef]
Kaya, H.; Gürpinar, F.; Afshar, S.; Salah, A.A. Contrasting and combining least squares based learners for emotion recognition in the wild. In Proceedings of the 2015 ACM International Conference on Multimodal Interaction, Seattle, WA, USA, 9–13 November 2015; pp. 459–466. [Google Scholar]
Dresvyanskiy, D.; Ryumina, E.; Kaya, H.; Markitantov, M.; Karpov, A.; Minker, W. End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild. Multimodal Technol. Interact. 2022, 6, 11. [Google Scholar] [CrossRef]
Selbst, A.D.; Powles, J. Meaningful information and the right to explanation. Int. Data Priv. Law 2017, 7, 233–242. Available online: http://xxx.lanl.gov/abs/https://academic.oup.com/idpl/articlepdf/7/4/233/22923065/ipx022.pdf (accessed on 1 May 2022). [CrossRef]
Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar]
Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; García, S.; Gil-López, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef] [Green Version]
Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; Pedreschi, D. A survey of methods for explaining black box models. ACM Comput. Surv. (CSUR) 2018, 51, 1–42. [Google Scholar] [CrossRef] [Green Version]
Jacovi, A.; Goldberg, Y. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL); Association for Computational Linguistics: Online, 2020; pp. 4198–4205. [Google Scholar] [CrossRef]
Nori, H.; Jenkins, S.; Koch, P.; Caruana, R. Interpretml: A unified framework for machine learning interpretability. arXiv 2019, arXiv:1909.09223. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA; Volume 30, pp. 4768–4777. [Google Scholar]
Shapley, L.S. A value for n-person games. In Class. Game Theory; Kuhn, H.W., Ed.; Princeton University Press: Princeton, NJ, USA, 1997; pp. 69–79. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller, K.R.; Samek, W. On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLoS ONE 2015, 10, 1–46. [Google Scholar] [CrossRef] [Green Version]
Ventura, C.; Masip, D.; Lapedriza, A. Interpreting CNN models for apparent personality trait regression. In Proceedings of the CVPR Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 55–63. [Google Scholar]
Bobek, S.; Tragarz, M.M.; Szelażek, M.; Nalepa, G.J. Explaining Machine Learning Models of Emotion Using the BIRAFFE Dataset. In Artificial Intelligence and Soft Computing; Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 290–300. [Google Scholar]
Liew, W.S.; Loo, C.K.; Wermter, S. Emotion Recognition Using Explainable Genetically Optimized Fuzzy ART Ensembles. IEEE Access 2021, 9, 61513–61531. [Google Scholar] [CrossRef]
Prajod, P.; Schiller, D.; Huber, T.; André, E. Do Deep Neural Networks Forget Facial Action Units?–Exploring the Effects of Transfer Learning in Health Related Facial Expression Recognition. arXiv 2021, arXiv:2104.07389. [Google Scholar]
Xiong, X.; De la Torre, F. Supervised Descent Method and Its Application to Face Alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 532–539. [Google Scholar]
Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.M.; Zisserman, A. VGGFace2: A Dataset for Recognising Faces across Pose and Age. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 67–74. [Google Scholar] [CrossRef] [Green Version]
Kollias, D.; Tzirakis, P.; Nicolaou, M.A.; Papaioannou, A.; Zhao, G.; Schuller, B.; Kotsia, I.; Zafeiriou, S. Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. Int. J. Comput. Vis. 2019, 127, 907–929. [Google Scholar] [CrossRef] [Green Version]
Korobov, M.; Lopuhin, K. ELI5. 2016. Available online: https://eli5.readthedocs.io/en/latest/ (accessed on 1 May 2022).
Likert, R. A technique for the measurement of attitudes. Arch. Psychol. 1932, 140, 55. [Google Scholar]
Lundqvist, D.; Flykt, A.; Öhman, A. The Karolinska Directed Emotional Faces-KDEF; CD ROM from Department of Clinical Neuroscience, Psychology section; Karolinska Institutet: Solna, Sweden, 1998; ISBN 91-630-7164-9. [Google Scholar]
Ekman, P. Basic emotions. Handb. Cogn. Emot. 1999, 98, 16. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Chollet, F. Keras. 2015. Available online: https://keras.io (accessed on 1 May 2022).
Li, L.; Jamieson, K.; DeSalvo, G.; Rostamizadeh, A.; Talwalkar, A. Hyperband: A novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 2017, 18, 6765–6816. [Google Scholar]
Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
Ng, A.Y. Feature selection, L 1 vs. L 2 regularization, and rotational invariance. In Proceedings of the Twenty-First International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004; p. 78. [Google Scholar]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 2002, 46, 389–422. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Deng, J.; Guo, J.; Ververas, E.; Kotsia, I.; Zafeiriou, S. RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5203–5212. [Google Scholar]
Zheng, E. Batch Face. 2020. Available online: https://github.com/elliottzheng/batch-face (accessed on 1 May 2022).
Brooke, J. SUS: A quick and dirty usability scale. Usability Eval. Ind. 1995, 189, 6. [Google Scholar]
Holzinger, A.; Carrington, A.; Müller, H. Measuring the quality of explanations: The system causability scale (SCS). In KI-Künstliche Intelligenz; Springer: Berlin/Heidelberg, Germany, 2020; pp. 1–6. [Google Scholar]
Jamieson, S. Likert scales: How to (ab)use them? Med Educ. 2004, 38, 1217–1218. [Google Scholar] [CrossRef]
Norman, G. Likert scales, levels of measurement and the “laws” of statistics. Adv. Health Sci. Educ. 2010, 15, 625–632. [Google Scholar] [CrossRef]
Wilcoxon, F. Individual Comparisons by Ranking Methods. Biom. Bull. 1945, 1, 80–83. [Google Scholar] [CrossRef]
Mann, H.B.; Whitney, D.R. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann. Math. Stat. 1947, 18, 50–60. [Google Scholar] [CrossRef]
Burg, K.t. Explaining DNN Based Facial Expression Classifications. BSc Thesis, Utrecht University, Utrecht, The Netherlands, 2021. [Google Scholar]
Davis, B.; Glenski, M.; Sealy, W.; Arendt, D. Measure Utility, Gain Trust: Practical Advice for XAI Researchers. In Proceedings of the 2020 IEEE Workshop on TRust and EXpertise in Visual Analytics (TREX), Salt Lake City, Utah, USA, 25 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–8. [Google Scholar]
Toisoul, A.; Kossaifi, J.; Bulat, A.; Tzimiropoulos, G.; Pantic, M. Estimation of continuous valence and arousal levels from faces in naturalistic conditions. Nat. Mach. Intell. 2021, 3, 42–50. [Google Scholar] [CrossRef]
Mahmud, F.; Islam, B.; Hossain, A.; Goala, P.B. Facial region segmentation based emotion recognition using K-nearest neighbors. In Proceedings of the International Conference on Innovation in Engineering and Technology (ICIET), Dhaka, Bangladesh, 27–28 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–5. [Google Scholar]
Kandeel, A.A.; Abbas, H.M.; Hassanein, H.S. Explainable Model Selection of a Convolutional Neural Network for Driver’s Facial Emotion Identification. In ICPR Workshops and Challenges; Del Bimbo, A., Cucchiara, R., Sclaroff, S., Farinella, G.M., Mei, T., Bertini, M., Escalante, H.J., Vezzani, R., Eds.; Springer: Cham, Switzerland, 2021; pp. 699–713. [Google Scholar]
Puthanidam, R.V.; Moh, T.S. A Hybrid approach for facial expression recognition. In Proceedings of the 12th International Conference on Ubiquitous Information Management and Communication, Langkawi, Malaysia, 5–7 January 2018; pp. 1–8. [Google Scholar]
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA, 8–14 December 2001; IEEE: Piscataway, NJ, USA, 2001; Volume 1, p. I. [Google Scholar]
Yu, Z.; Zhang, C. Image based static facial expression recognition with multiple deep network learning. In Proceedings of the 2015 ACM International Conference on International Conference on Multimodal Interaction, Seattle, WA, USA, 9–13 November 2015; pp. 435–442. [Google Scholar]

Figure 1. The two constructed pipelines. The top pipeline illustrates the process of the geometric features-based DNN and the bottom one illustrates the CNN process. The example image used in the pipeline is AF01AFS from the KDEF dataset.

Figure 2. Examples of how geometric features are displayed based on the landmarks they originate from. From left to right, top to bottom, the geometric features displayed are: lower eye outer angles (L), eye aspect ratio (R), bottom lip curvature, outer mid eyebrow slope (L), eye—inner eyebrow distance (L), mouth opening/mouth width.

Figure 3. Fidelity and accuracy scores per top n SHAP features for the frontal, half left and half right model. Also given is the proportion of the SHAP weight of the top n features with regard to the total weight of all features. All scores are calculated on the test set.

Figure 4. Example explanations for the CNN using Grad-CAM (second row) and DNN (third row) displaying all seven emotions. The images are the same and have the following codes: AF07DIHR, AM11AFHL, AF04NES, AF03ANHR, AF01SAS, AM01HAS, BM03SUHL, respectively. The top and bottom row give the corresponding probability distributions for the images shown.

Table 1. Validation set accuracy comparison of the pose-based and non-pose based models. Overall represents the accuracy score obtained from concatenating the predictions from pose-based models.

Model	Accuracy
Frontal	0.780
Half Left	0.790
Half Right	0.828
Overall pose-based	0.794
No pose split	0.756

Table 2. The highest validation set accuracy per algorithm for each pose model together with the amount of selected features. Handpicked means manually excluding right and left orientated features, for left and right poses, respectively. Best results per column are shown in bold.

	Frontal		Half Left		Half Right
Algorithm	# Feats.	Accuracy	# Feats.	Accuracy	# Feats.	Accuracy
FSFS	25	0.7857	25	0.8070	30	0.8448
RFE	30	0.7976	25	0.8070	20	0.8276
SHAP	35	0.7857	35	0.7895	35	0.8362
Handpicked	-	-	24	0.8070	24	0.8017
No FS	40	0.7798	40	0.7895	40	0.8276

Table 3. Results of the Wilcoxon signed-rank test on question pairs from question set 1. Question numbers are the same as in Section 4.4.2.

H_{a}

refers to the alternative hypothesis: whether the answers from the second sample would be greater than or less than those from the first sample. Reported are the W- and p-values obtained from the tests. * indicates

p \leq α

.

Table 3. Results of the Wilcoxon signed-rank test on question pairs from question set 1. Question numbers are the same as in Section 4.4.2.

H_{a}

refers to the alternative hypothesis: whether the answers from the second sample would be greater than or less than those from the first sample. Reported are the W- and p-values obtained from the tests. * indicates

p \leq α

.

Question #	$H_{a}$	W-Value	p-Value
1	greater	0.0	0.001 *
2	greater	6.5	0.100
3	greater	4.5	0.015 *
4	less	39.5	0.021 *
5	greater	16.5	0.415
6	greater	0.0	0.118
7	less	23.0	0.061
8	greater	14.5	0.198
9	greater	2.0	0.718
10	greater	9.0	0.327

Table 4. Results of the Mann–Whitney U test on questions from question set 2. Question numbers are the same as used in Section 4.4.2.

H_{a}

refers to whether the alternative hypothesis was that model 1 was evaluated as worse than model 2 (less) or a two-sided test (unequal). * indicates

p \leq α

.

Table 4. Results of the Mann–Whitney U test on questions from question set 2. Question numbers are the same as used in Section 4.4.2.

H_{a}

refers to whether the alternative hypothesis was that model 1 was evaluated as worse than model 2 (less) or a two-sided test (unequal). * indicates

p \leq α

.

Question #	$H_{a}$	U-Value	p-Value
1	less	42.0	0.031 *
2	less	66.0	0.364
3	less	42.0	0.031 *
4	less	18.0	0.0003 *
5	unequal	48.0	0.105
6	unequal	54.0	0.154
7	less	36.0	0.011 *
8	less	24.0	0.001 *
9	less	66.0	0.364
10	less	9.0	0.327

Table 5. The test set accuracy scores for the final DNN models (and their concatenated score) and the CNN. Included are three state-of-the-art scores on the KDEF dataset.

Model	Test Set Accuracy
Mahmud et al. [57] (FR)	0.8602
Kandeel et al. [58] (FR)	0.8888
GEO-DNN Frontal (FR)	0.8117
GEO-DNN Half Left (HL)	0.7395
GEO-DNN Half Right (HR)	0.7913
GEO-DNN Combined (FR, HR, HL)	0.7832
CNN (FR, HR, HL)	0.7595
Fusing DNN & CNN (FR, HR, HL)	0.8418
Puthanidam and Moh [59] (FR, HR, HL)	0.8086

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

ter Burg, K.; Kaya, H. Comparing Approaches for Explaining DNN-Based Facial Expression Classifications. Algorithms 2022, 15, 367. https://doi.org/10.3390/a15100367

AMA Style

ter Burg K, Kaya H. Comparing Approaches for Explaining DNN-Based Facial Expression Classifications. Algorithms. 2022; 15(10):367. https://doi.org/10.3390/a15100367

Chicago/Turabian Style

ter Burg, Kaya, and Heysem Kaya. 2022. "Comparing Approaches for Explaining DNN-Based Facial Expression Classifications" Algorithms 15, no. 10: 367. https://doi.org/10.3390/a15100367

APA Style

ter Burg, K., & Kaya, H. (2022). Comparing Approaches for Explaining DNN-Based Facial Expression Classifications. Algorithms, 15(10), 367. https://doi.org/10.3390/a15100367

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparing Approaches for Explaining DNN-Based Facial Expression Classifications

Abstract

1. Introduction

2. Background and Related Work

2.1. Background on Explainable AI

2.1.1. SHAP

2.1.2. Grad-CAM

2.2. XAI in Affective Computing and FER

3. Proposed Method

3.1. Geometric Features-Based DNN Modeling

3.1.1. Geometric Feature Extraction

3.1.2. SHAP-Based Explanation Generation

3.2. End-to-End CNN Modeling

3.3. User Survey

4. Experimental Setting

4.1. Dataset

4.2. DNN-Based System Development

4.2.1. Hyperparameter Tuning

4.2.2. Splitting on Pose

4.2.3. Feature Selection

4.3. CNN-Based System Development

4.3.1. Preprocessing

4.3.2. Fine-Tuning

4.4. User Survey Construction

4.4.1. Environment

4.4.2. Questions

4.4.3. Hypotheses

4.4.4. Statistical Tests

5. Results

5.1. Experimental Results for Geometric Features-Based DNN Models

5.1.1. Comparative Results Using Pose-Based Models

5.1.2. Feature Selection Results

5.2. Geometric Feature Explanation Results

5.3. Experimental Results for CNN Models

5.4. Comparing Explanations for the DNN and CNN

5.5. User Study Results

5.6. Discussion

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Final DNN Configurations

Appendix B. User Study

Appendix B.1. Research Description

Appendix B.2. Consent Form

Appendix B.3. General Questions

Appendix C. All Geometric Features

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI