Explainable Depression Detection Based on Facial Expression Using LSTM on Attentional Intermediate Feature Fusion with Label Smoothing

Machine learning is used for a fast pre-diagnosis approach to prevent the effects of Major Depressive Disorder (MDD). The objective of this research is to detect depression using a set of important facial features extracted from interview video, e.g., radians, gaze at angles, action unit intensity, etc. The model is based on LSTM with an attention mechanism. It aims to combine those features using the intermediate fusion approach. The label smoothing was presented to further improve the model’s performance. Unlike other black-box models, the integrated gradient was presented as the model explanation to show important features of each patient. The experiment was conducted on 474 video samples collected at Chulalongkorn University. The data set was divided into 134 depressed and 340 non-depressed categories. The results showed that our model is the winner, with a 88.89% F1-score, 87.03% recall, 91.67% accuracy, and 91.40% precision. Moreover, the model can capture important features of depression, including head turning, no specific gaze, slow eye movement, no smiles, frowning, grumbling, and scowling, which express a lack of concentration, social disinterest, and negative feelings that are consistent with the assumptions in the depressive theories.


Introduction
Today, Major Depressive Disorder (MDD) is a widespread condition that affects people all over the world.Suicide and life-threatening situations can result from this illness.Furthermore, the effects of COVID-19 make this problem worse [1][2][3][4][5].During the epidemic, the number of depressed patients is rising rapidly.A fast diagnosis and receiving the appropriate care are essential to reducing the risk of life-threatening depression.However, the majority of people are unaware of their disease and unable to access appropriate medical care.On the other hand, all citizens have insufficient access to medical staff and therapy.Artificial intelligence can therefore be used to support medical care as a primary decision tool or as a decision-support tool.
Clinical interviews [6] are one technique for diagnosing depression.The diagnosis procedure includes a list of questions that might judge anhedonia, which is the inability to experience pleasure; anergia, which is a persistent feeling of being run down; focus; appetite; sleep; guilt; and suicide.During interviews, the patient's expression, posture, voice tone, and content of their answers are scrutinized.Using facial expression and body language, a psychiatrist can identify depression.Similar to a psychiatrist's observation, artificial intelligence is capable of recognizing facial expressions and body language without the usual ambiguity or language barrier.As a result, the focus of this study is solely on face expression.
The Facial Action Coding System (FACS) [7], which describes the relationship between muscle use and emotion in a coding system, defines facial expression.The example of an action unit in a coding system is shown in Figure 1.This approach defines not only facial emotion but also head posture and eye contact, which reflect entire facial movements.This system's ability to convert video input to numerical data makes it incredibly helpful for machine learning models.Openface [8] is a piece of artificial intelligence that may be used to recognize facial expressions that are encoded in the action units coding system.The open-source Openface tool can estimate head posture, identify facial action units, and estimate eye gazing.This tool is used to extract data on head posture, facial action units, and eye gaze for use as input in machine learning models.Compared to video data, the extracted data is more lightweight.As a result, the machine learning model is compact and portable.In order to quickly identify non-psychiatrist depression, we plan to use a machine learning model that mimics methods for diagnosing depression through the observation of facial expressions.
To resolve this issue, we employed the tools listed above to assist medical professionals in developing machine learning models that accurately identify depression without interference from privacy concerns or language barriers.Additionally, we aimed to clarify the major components that it discovered by using an explainable approach to the model in order to make the model's essential features more comprehensible to humans and to increase our understanding of the association between depression and facial expression.Our main contributions are listed as follows:

•
In order to choose the model that will perform the best, we evaluated three types of fusion modeling techniques that use the transformer model, window block LSTM, and Bi-LSTM.In addition, we used feature selection techniques and label smoothing techniques to improve model accuracy.

•
Data used in this study are extracted from clinical interview videos using Openface tools in time-series format.Data are classified as depressed or non-depressed.

•
We apply an integrated gradient [9] over the model and sort the significant features in order to describe the main characteristics of facial expression that can detect depression.
The rest of the paper is structured as follows: The related works are summarized in Section 2. How to prepare the data set and the proposed methods used in this study are explained in Section 3. Section 4 presents the results and discussion of the proposed methods.Finally, the summary of the study is provided in Section 5.

Related Work
The brain and nervous system link depression and facial expression.The relationship between facial expression and the brain and nervous system has been studied using electroencephalographic (EEG) studies [10].The findings demonstrate that EEG analysis can identify the pattern of muscle use when a person displays an emotion on their face, such as a grin, rage, or sadness.the same way as an experiment that makes use of fMRI (functional magnetic resonance imaging).By using fMRI analysis, the two investigations [11,12] investigate how depressed patients and non-depressed patients respond to happy and sad faces, respectively.The results of the fMRI investigation show that depressed patients' brains react differently from healthy individuals to sad faces compared to happy faces.Despite the fact that people rarely exhibit their emotions in regular situations, during clinical interviews, people frequently utilize their faces to convey their true feelings because speaking and facial expression go together in one [13].On the other hand, people can tell the difference between depressed patients and regular people just by looking at them [14].In a similar vein, artificial intelligence can spot depression while performing cognitive tasks [15].Reduced mouth or eye movements while performing the task are the observed points.There is evidence from earlier studies that people or machine learning models may identify depression by studying a particular facial expression.
The use of artificial intelligence in diagnosing psychiatric diseases is currently expanding.The input information and methods used to diagnose psychological problems can be clearly divided into three and five categories, respectively [16].MRI, EGG, and kinesics diagnosis (which includes behavioral, facial, and other physical data) are the three primary inputs.Bayesian models, logistic regression, decision trees, support vector machines, and deep learning are five strategies.In this study, we used deep learning approaches (LSTM and transformer model) to improve the performance of kinesics diagnosis, which includes facial expression, head pose, and eye gaze to diagnose depression.
As shown in Table 1, there are numerous methods for identifying depression or categorizing its severity in recent years.The majority of them make use of every human behavior modality, such as video, voice, and speech content text, to input and pass through multiple models to improve performance.Detecting Depression with AI Sub-Challenge (DDS) of the Audio/Visual Emotion Challenge and Workshop (AVEC 2019) [17] is a wellknown challenge that explores depression identification.This sub-challenge included an E-DAIC data set with extracted facial feature data and voice and speech content text.This sub-challenge's winner is a multi-level attention network utilizing text, audio, and video for depression prediction [18].They succeed in obtaining a concordance correlation coefficient (CCC) of 0.67 using a multi-model of three modalities.With a CCC of 0.733, the multi-transformers model is also applied to the E-DAIC data set [19].The input of the multi-transformers model is the voice and facial features.The proposed approach combines the PHQ-8 regression label method and the PHQ-8 classification at five levels for multi-task learning.The DAIC-WOZ data set was used as input in the proposed algorithms that only concentrate on facial features.By employing particle swarm optimization (PSO) [20] to choose the best predictors of AUs, one proposed strategy focuses on minimizing AUs in a feed-forward neural network (FFNN).The most accurate predictors are AU04, AU06, AU09, AU10, AU15, AU25, AU26, AU04, AU12, AU23, AU28, and AU45.They have an accuracy rate of 97.83 percent.Another set of proposed methods reduced and selected facial features from the DAIC WOZ data set using Fisher Discriminant Ratio (FDR) and Incremental Linear Discriminant Analysis (ILDA) [21].The best performance in the DAIC WOZ data set for their technique is an F1 score of 0.805.Based on facial features, movement intensity, speech, and text with 81 patients' personal collection data sets, studies about posttraumatic stress disorder (PTSD) and major depressive disorder (MDD) have been conducted [22].According to the results, the classification of MDD was 0.86 accurate, while the classification of PTSD was 0.9 accurate.
The limitations of all modality studies are privacy concerns and language barriers.The most frustrating aspect of all modality studies is the language barrier.Even so, the model may perform well as a result.Because of the numerous languages, it cannot be used by many people worldwide.In addition, the researchers found it difficult to acquire the large data set and were unable to share it among each other or obtain it from a public source because of privacy concerns.Due to these restrictions, machine learning models were difficult to use and enhance.Therefore, we may overcome challenges such as a language barrier and privacy concerns by focusing on facial expression modalities in Openface version 2.

Materials and Methods
The proposed methods are shown overall in Figure 2. In feature extraction, we first extract videos from time-series extracted files, and then in feature selection, we choose some of the features.To evaluate performance, selected features are run through multiple fusion models with label smoothing.The best performance model is applied to an integrated gradient for the visualization of key features.

Input and Target Preparation
We used input normalization to enhance a network's convergence properties.The longest input time series is not more than 11 min.Since there are 11 min, 60 s, and 30 frames per second, the timestep is 19,800.To fit this timestep, zero was added to input that falls under 11 min.
Figure 3 shows the diversity that was estimated by Deepface [23].Because of the privacy protection policy of the data collection protocol, the private information of par-ticipants is not collected.Therefore, we decided to utilize Deepface to estimate the data set.The raw data includes levels of depression that are normal, mild, moderate, and severe.The levels are classified by the Thai version of the 7-item Hamilton Depression Rating Scale (HAMD-7) [24].The Thai version of HAMD-7 is translated and reviewed by five psychiatrists.The normal level means no depressive symptoms.People can do their daily activities as normal.The mild level means depressive symptoms can be noticeable and interfere with daily activity.For instance, insomnia, weight loss, and irritability.The moderate level means that depressive symptoms cause problems with self-esteem, productivity, sensitivities, excessive worrying, and feelings of worthlessness.The symptoms of severe depressive levels are hallucinations, suicidal thoughts, or behaviors.We choose to distinguish between the depressed and non-depressed by not classifying them as having depressive levels because facial features have a slight depression cue.In order to apply the model as a binary categorization, we define normal and mild as non-depressed and moderate and severe as depressed.The raw data has 106 normal, 234 mild, 112 moderate, and 22 severe.It is divided into three independent data sets for training, development, and testing in the ratio 80:10:10.The mild and normal categories are categorized as non-depression.The moderate and severe categories are categorized as depression.Data balancing is achieved by duplicate non-depressed classes in the train data set, as shown in Table 2.

Features Extraction
The feature extraction method from videos to the extracted features file used as the model's input is shown in Figure 4. Participants must respond to numerous questions in a clinical interview.However, it only records one participant's response per video.Consequently, there are multiple videos for each participant.The OpenFace tool extracts participant videos.Following data extraction, we must join time-series extracted feature files from the same participant so that they can be used as model input.A time-series extracted features containing head posture, gaze, and action units was used.

Data Preprocessing 3.3.1. Features Selection
Extracted features include the head posture with three rotational features (pitch, roll, and yaw) and three location features (x, y, and z axes), the gaze with six vector features (x, y, and z axes for both eyes) and two radian features (angle x, y), and the action units with 18 presence feature groups and 17 intensity features.The total number of features in the extracted features is 49.However, similar features can be removed in order to eliminate redundant data and model size.
Similar to how the gaze feature has vector and radian features, the head posture feature has rotation and location feature groups that are repeated with different units, as shown in Figure 5.To compare their results, we experimented with them using a single model.The findings, as shown in Table 3, indicated that the head posture rotation group and the gaze radian feature group have good performance in classifying depression.The input for the model is therefore chosen from head posture rotation and gaze radian features.The action units feature has presence feature groups and intensity feature groups, all of which use distinct estimation models.Both feature groups are consequently chosen.In summary, three location features of head posture and six vector features of gaze are eliminated.Out of 49 features, the selected features contain 40 features.Label smoothing [25] is a regularization technique that can improve the performance of machine learning models, particularly in the training of neural networks for classification tasks.Label smoothing can prevent a model from becoming overconfident in its predictions because the true distribution of the data is not so clear-cut.Label smoothing addresses this by introducing a small amount of uncertainty into the labels.This method has the efficiency to improve our model since depressive syndrome is not clearly determinable, especially at a mild and moderate level.The formula for label smoothing is shown in Equation (1).y LS k is a soft label.y k is a hard label.α is a label-smoothing number that should be in the range of 0 to 1 for smooth labels in range 1 to 0.5.K is the number of classes.The hard label in classification tasks, for instance, is always binary in nature and is [0, 1].0 denotes false, whereas 1 denotes true.Label smoothing causes soft labels to have probabilities between [0.5, 0.95] after applying.To reduce computation time, we chose to employ label smoothing with α range of 0 to 0.9 with 0.1 increments solely on the best model.

Model Architecture
We chose to use fusion techniques to merge specific facial features in order to enhance the performance of the facial feature model.There are various fusion technique combinations, however.The early fusion technique, the intermediate fusion technique, and the late fusion technique are the three basic fusion techniques that will be used in this study.The Window Box LSTM model, Transformer model, and Bi-LSTM model are the three models that are assigned to each technique for assessing performance.

Early Fusion Model Architecture
Early fusion models combined features using a concatenate layer, which was able to transmit all 40 features to a single model.The following layers are shown in Figure 6A-C, respectively, for Bi-LSTM, Window Block LSTM, and Transformer.Following the concatenate layer, the Bi-LSTM model has 128 hidden units: the Bi-LSTM layer, attention layer, and feed forward layers, which have 64, 32, and 1 units.Because data are collected at a rate of 30 frames per second, the Window Block LSTM model must pass through reshape layers that reshape to (sample number, 660, 30, 40).The following layers include 64 hidden units of LSTM with time distribution, attention with time distribution, and feed-forward layers, which have 32, 16, and 1 units.Transformer models utilize 1 transformer block, 512 head sizes, 1 head number, and 2048 feed-forward dimensions.To shorten the time-series range before passing through the transformer model, an average pooling layer was employed for average sampling every 15 frames because the time-series are too long for the model.

Intermediate Fusion Model Architecture
Before concatenating layers and feeding forward layers, the intermediate fusion model uses its own layer and attention layer for each feature.Figure 7A-C display three different types of models.Bi-LSTM layers Pose, Gaze, AU_r, and AU_c have 64, 16, 128, and 128 hidden units, respectively, in the Bi-LSTM model.There are 128, 64, 1 units in the fefeed-forwardayers that follow.The LSTM layers of Pose, Gaze, AU_r, and AU_c for Window Block LSTM have 32, 32, 64, and 64 hidden units, respectively.The following feed-forward layers contain 128, 64, 32, 1616, and units.For each feature, the transformer model is the same as the early fusion model.The difference is that a concatenated layer is applied before the feed-forward layer.

Integrated Gradient Explanation
Integrated Gradient (IG) [9] is an explainability technique used to understand the relative of input features and predictions of machine learning models, particularly in neural network models.Although IG is always utilized for image classification to visualize important input features, we selected IG to interpret a time-series binary classification model to visualize important input features because IG can apply to a machine learning model without modifying the original deep neural network.
The methods of IG implementation in the time-series binary classification model are the same as image classification.First, baseline time-series input (zero-initialized timeseries) is divided in equally spaced intermediate steps by actual time-series input.For each intermediate step, compute the gradient of the model's output with respect to the input.After that, the computed gradients over all the intermediate steps are integrated.The integrated gradients for each input feature indicate the contribution of that feature to the final prediction.Therefore, the important input features in the time-series result are expressed.However, time series cannot be visualized for easy understanding as images with IG mapping.We decided to calculate the mean of IG feature values to visualize the impacts of features and calculate the absolute mean for arranging the impacts in order.

Results and Discussion
Table 4 displays the results of all fusion strategies for experiment modals.As a consequence, the Window Block LSTM model experiment using the intermediate fusion strategy gave the best results overall, with an accuracy of 0.8958.Since there is human error in depression analysis, particularly for mild and moderate depression levels that can be distinguished between depressed and non-depressed individuals, Window Block LSTM and Bi-LSTM models that employ intermediate fusion techniques were tested with label smoothing.The results of Table 5 demonstrate that both models perform better when label smoothing is configured properly.With (0.3, 0.7) label smoothing, the accuracy of the Bi-LSTM model is 0.9167 with a zero false positive value.The highest F1-score is achieved by the Window Block LSTM with (0.05, 0.95) label smoothing, which also has an accuracy of 0.9167.We intend to ensure that our Window Block LSTM performs well on other data sets.Therefore, we tested our model with additional data, which is shown in Appendix A. We experimented with Bi-LSTM [18] and transformer models [19] in our data set to evaluate our model using solely facial features.Compared to baseline, experimental versions of the Bi-LSTM and Window Block LSTM models perform better, as shown in Table 6.Our Window Block LSTM model surpasses the baseline with an accuracy of 91.67% and a macro F1-score of 88.89%.similar to our Bi-LSTM, which has an accuracy rate of 91.67% and a macro F1-score rate of 88.21%.Important key features that represent depressive symptoms are shown in Figure 9.The four main features are separated for apparent visualization.Each feature contains two graphs that represent the important key features of depression or non-depression and their order of importance.(A), (C), (E), and (G) display the impact on output of pose, gaze, action unit regression, and action unit classification, respectively.(B), (D), (F), and (H) display the importance of pose, gaze, action unit regression, and action unit classification, respectively.Finally, a summary of all features is shown in Figure 10 by containing two graphs that represent the important key features and their order of importance.Red color refers to a negative effect (tends to be non-depressive), and blue color refers to a positive effect (tends to be depressive).
Crucial pose features are pitch, roll, and yaw, respectively, as shown in Figure 9B.The movement of head pose features is shown in Figure 11. Figure 9A shows that pitch and roll influence model to predict non-depression because head-nodding and head tilting indicate high energy and favorable to social interaction [26][27][28][29].On the other hand, yaw has an impact on depression because head turning means a lack of concentration on social interests and withdrawal.
Significant gaze features are looking up or down and looking left or right, as shown in Figure 9D.The movement of eye gaze features is shown in Figure 12. Figure 9C shows that both gaze features have an impact on depression because looking around means patients are not concentrating and are absent-minded.Following previous studies in [15], the reduction in eye movement can be observed by a model to determine depressive symptoms.Moreover, nonspecific gaze and not having eye contact can be classified as gestures of depression [26][27][28].
Figure 9E,F show the impact on depression or non-depression of action unit regression features and the important order of action unit regression features, respectively.The movement of the action unit is shown in Figure 1.The obvious features that relate to depression are the AU26 jaw drop, AU20 lip stretcher, and AU07 lid tightener, which represent grumbling, frowning, and scowling faces.All three outstanding features express negative feelings and social disinterest [28].On the other hand, the features that relate to non-depression are the AU06 cheek raiser, AU25 lips part, AU14 dimpler, and AU12 lip corner puller, which represent the posture of talking and smiling.This means people have social interest and happiness as non-depression.The impact on depression or non-depression of action unit classification and the important order of action unit classification are shown in Figure 9G,H, respectively.The movement of the action unit is shown in Figure 1.In the same direction of action unit regression, AU07 lid tightener and AU26 jaw drop are related to depression.Conversely, the AU25 lips part of action unit regression has an impact on non-depression.In the action unit classification, the AU25 lips part has an impact on depression because this action unit represents when people talk.In the same direction as [28], silence and speaking can be classified as depression or non-depression depending on speech content.The features that relate to non-depression are the AU23 lip tightener, the AU12 lip corner puller, the AU45 blink, and the AU09 nose wrinkler, which represent the expression of a social interest-pursing lips, smiling, and blinking, for instance.
Finally, the summary of all features is shown in Figure 10.The important features are action unit classification, action unit regression, head pose, and gaze, respectively.Both action unit classification and regression are the indicators to classify depression because facial expression can be detected from them as human observation [14].Head pose and gaze are indicators that can detect concentration and social interest.Therefore, depression can be detected by four main features that are explainable.
The explainable model is helpful for understanding how the model works.The insight that we found is that depressive patients have a lack of concentration, social disinterest, and negative feelings.They are inclined to turn their heads, gaze in no specific direction, and move their eyes slower than normal people.They do not smile and always keep grumbling, frowning, and scowling.

Conclusions
Our paper aimed to improve model accuracy by experimenting with three types of fusion model techniques that use the transformer model, window block LSTM, and Bi-LSTM with feature selection techniques and label smoothing techniques.The best model for depression detection is the intermediate window block LSTM model with appropriate label smoothing (0.9, 0.1).The model obtains a 88.89% F1-score, 87.03% recall, 91.67% accuracy, and 91.40% precision.In addition, the model achieves 100% accuracy for the mild and severe classes.The advantages of this model in comparison to other experiment models are that window blocks can train faster than Bi-LSTM models and can continue to pass through lengthy time series data.The model can represent the significant facial features that are easily understood after being run using the integrated gradient technique on a test data set.This may prompt the creation of a tool that concentrates on important parts of videos in order to speed up the diagnosis process for psychiatrists.

Figure 1 .
Figure 1.Example of action units.

Figure 3 .
Figure 3. Gender and age diversity of the data set.

Figure 6 .
Figure 6.Early fusion model architecture.(A) Early Bi-LSTM fusion model architecture.(B) Early window block LSTM fusion model architecture.(C) Early transformer fusion model architecture.

Figure 7 .
Figure 7. Intermediate fusion model architecture.(A) Intermediate Bi-LSTM fusion model architecture.(B) Intermediate window block LSTM fusion model architecture.(C) Intermediate transformer fusion model architecture.Green blocks refer to Figure 6A Bi-LSTM and attention layers.Yellow blocks refer to Figure 6B reshape, time distribution LSTM, and time distribution attention layers.Gray blocks refer to Figure 6C average pooling 1d and transformer layers.

3. 4 . 3 .
Late Fusion Model ArchitectureThe late fusion model uses a single model for all features and uses an aggregate layer to average the prediction values across every attribute before reaching a conclusion.The diagram of the late fusion model is displayed in Figure8.Before feed-forward layers, the configuration of the model is the same as in the intermediate fusion model.The difference is the power of two in units of feed-forward layers.

Figure 9 .
Figure 9. (A) Pose impact on model output, (B) Pose impact on model output magnitude, (C) Gaze impact on model output, (D) Gaze impact on model output magnitude, (E) AUr impact on model output, (F) AUr impact on model output magnitude, (G) AUc impact on model output, (H) AUc impact on model output magnitude.Red color refers to a negative effect (tends to be non-depressive), and blue color refers to a positive effect (tends to be depressive).

Figure 10 .
Figure 10.(A) Positive/negative impact of all features.(B) Absolute impact (magnitude) of all features.Red color refers to a negative effect (tends to be non-depressive), and blue color refers to a positive effect (tends to be depressive).

Table 1 .
Related works of depression prediction.* CCC refers to concordance correlation coefficient.

Table 3 .
Preliminary experiment with each feature.Highlighted numbers refer to the winners.

Table 4 .
Fusion modality result comparison.Highlighted numbers refer to the winners.

Table 5 .
Intermediate fusion with label smoothing result comparison.Highlighted numbers refer to the winners.

Table 6 .
Baseline and our result comparison Highlighted numbers refer to the winners.

Table A1 .
Additional data set results comparison.Highlighted numbers refer to the winners.